tar
your files to a file called hw1.tar
and email it as an attachment to mao@cs.fsu.edu, along with a cc to asriniva@cs.fsu.edu. Make sure that you do not include the executable, any object files, or core dump file!
tar xvf hw1.tar
make
query file1 file2
...
[queries]
query
, which takes a few file names as its command line argument, as described below. The user will then input a few "queries", which should be handled as described below.
Details:
Your program will read the environment variable
VOCABULARYLIST
, which we will set to the name of a file
that will contain a list of words, one word per line. (In order to
test your program, you too will need to create such a file, and set
this variable to the appropriate file name.) Your program will read
this list of words and create a data structure that permits efficient
operations on this list, as needed by the rest of the program. Let us
call this data structure Vocabulary
.
Your program will read the files (documents) specified as command line
arguments, and represent this set of documents as a matrix having d
rows and v
columns, where d
= number of documents and v
= number of words in the vocabulary. Each row
of the matrix will represent a document, and each column will
represent a word from the vocabulary list. The (i,j)
th
element of this matrix will be the frequency of word # j (of the vocabulary) in document #
i. [Frequency = (Number of occurrences of the word, in the
document)/(Total number of words in the document). Note that the total
number of words includes those not in the vocabulary.]
The program will then read queries typed by the user, from stdin,
handling one query at a time. Each query will be a list of words,
including those from outside the vocabulary, terminated by a
newline. The program will determine the document that best matches a
query by the following process. It will first create a vector of
length This process is repeated until the program encounters end of file.
After untarring, you will obtain 3 directories, ex1, ex2, and ex3. Under each of these directories, you will find (i) a file called 'Vocab', giving the vocabulary list, (ii) files 'Doc[1-8]', which are eight documents, (iii) a file called 'Query', which lists queries followed by the document that best matched, and (iv) a file called 'Matrix', which gives the document matrix, followed by the query vector for each query.
i
th
component of the vector equal to the number of occurrences of the
i
th word (of the vocabulary) in the query. (This will
enable the user to emphasize certain words by typing them multiple
times.) The program will then multiply the document set matrix by this
vector, and choose the document corresponding to the largest component
of the resulting vector. The name of this document will be printed to stdout.
Note:
Hints and reasons for using this procedure to select the best document will be provided later, in class. But meanwhile, please start working on the program design and then on the code.
Grading criteria
Your assignment will be graded by the criteria given below
asserts
to check validity of the program's assumptions, providing a function that automatically tests the correctness of each program module, reasonable variable names, comments explaining non-obvious aspects of the program, using guards in the header files to protect against multiple inclusions, facilitating use by C++ programmers by enabling C linkage, etc.
Hints, clarifications, and example:
http://www.cs.fsu.edu/~asriniva/courses/aup02/hws/hw1-hints.html
Hints on design:
http://www.cs.fsu.edu/~asriniva/courses/aup02/hws/hw1-designhints.txt
Larger examples:
http://www.cs.fsu.edu/~asriniva/courses/aup02/hws/ex.tar
Final test files:
http://www.cs.fsu.edu/~asriniva/courses/aup02/hws/tests.hw1.tar
Last modified: 23 Sep 2002