Find the relevance of documents
Due: 7 Nov 2012
Educational objectives: Implement a self-restructuring binary search tree, develop test cases to test the correctness of your software, and empirically compare the performance for different data structures.
Statement of work: (i) Implement a self-restructuring binary search tree class that self-restructures as specified below and (ii) implement a document retrieval program, which is a modification of that in assignment 3, but using your self-restructuring binary search tree to store the list of words in each document. You should not use an STL
map
orset
in your implementation.Deliverables: Turn in a
makefile
and all header (*.h) and cpp (*.cpp) files that are needed to build your software. Turn in your development log too, which should be a plain ASCII text file calledLOG.txt
in your project directory. You will submit all of these as described in www.cs.fsu.edu/~asriniva/courses/DS12/HWinstructions.html.Requirements:
- Create a subdirectory called
proj4
.- You will need to have a
makefile
in this directory. In addition, all the header and cpp files needed to build your software must be present here, as well as theLOG.txt
file.- You should implement appropriate classes for the software. Your code should be well designed and object oriented.
- Your software's main task is as follows. A user will give it queries consisting of a set of words. For each query, the software should give the most relevant documents that contain all words in that query.
- The software is run by the user on the command line, as follows:
Retrieve Filename-List
, whereFilename-List
is a list of file names of zero or more ASCII text documents.- The software first analyzes each file given on the command line. Details of the analysis are explained later. The software then waits for a series of user actions, and responds to each user action as described below.
Possible user actions and required software response:
a Filename
: Analyze the ASCII text fileFilename
. If this file has already been analyzed previously, then replace the results of the previous analysis with the current one. If the file does not exist, then outputFile Filename does not exist
to standard output (not to standard error). This command reads each word in the file. Each time a new word is encountered, it is inserted into your self-restructuring binary search tree (there is a distinct tree for each document).q n Word-List
:n
is a positive integer andWord-List
is a list of words (this is defined later) separated by one or more blanks. The software returns a list of then
most relevant documents containing all the words inWord-List
. If fewer thann
documents match the query, then all the matching documents are returned. Each line of the output will first give the name of a document, followed by a blank, followed by the document'srelevance
, which is a floating point number defined later. The documents are output in decreasing order of relevance (that is, most relevant document first). If no document matches, then outputNo matching document
to standard output. This command also self-restructures the binary search tree for each document as follows. For each word in the word list that is found, that word is moved two steps up in the tree using rotations (unless it is already in the first or second level, in which case it is made the root).p n Filename
:n
is a positive integer andFilename
is the name of a file that has previously been analyzed. If this file has not been analyzed previously, then outputFile Filename has not been analyzed
to standard error. Otherwise, output the firstn
words in a pre-order traversal of the tree storing the words inFilename
.x
: Quit the program.- Output
Invalid command
to standard output for any other command.- Analysis of a document. In this process, the software will identify the set of words in the document. Each word will be given a floating point weight. The weight of a word is the ratio of the frequency of its occurrence to the total number of words in the document. You should store the weights of each word, of each document analyzed, in a suitable container.
A word is defined as a sequences of adjacent characters in the input file, separated by any of the following delimiters: whitespace (blanks, tabs, and newlines) or any of the following
! ( ) - : ; " , . ? /
. (A delimiter cannot be a part of any word.) Note thatabandon
andAbandon
are different words, as arecar
andcar's
.- Relevance of a document. A document is relevant to a query only if it contains each word in the query. If a document contains each word in the query, then its relevance is defined as the sum of the weights, in this document, of all the words in the query.
- Result.txt: This is an ASCII text file. You should first describe how you tested your binary search tree code (for example, what types of possible errors did you check for). You should then compare the performance of your code against that for the one from assignment 3. Under what circmstances is the current code faster?
Sample file and executable: No sample executable will be provided. You need to write good test cases to check the correctness of your program.
Bonus points:
You will get 3 bonus points if your code is faster than our sample executable on some large tests which we will announce after the submission deadline. (Your code should also be correct.)
You may get up to 5 additional points for significant extra work, such as implementing more features (for example, determining that different forms of the same word, such as
serve
,served
,serving
, andserves
are equivalent) or providing a GUI interface. Please obtain written permission from me prior to doing this. If you wish to get bonus points, then please submit your work as usual, but send an email to the John Nguyen. John will schedule a meeting with you, and you can demonstrate the special features of your software then.Copyright: Ashok Srinivasan, Florida State University.
Last modified: 20 Oct 2012