5570: Advanced Unix Programming

Hints and clarifications on Programming assignment 1

Clarification

Word size

All words will be less than 100 characters long. A word will a sequence of characters, delimited by white spaces (possibly multiple white spaces in documents and queries). You can treat upper and lower case letters as different characters. In practice, I will probably filter the files so that they contain only lower case letters (no punctuations, digits, etc).

Vocabulary list

You may not assume anything about the number of words. However, the vocabulary file will typically contain around 100 words. Words will not be repeated. The file may or may not be sorted. Your data structure should work efficiently with a sorted file too. Therefore a plain binary search tree will not be efficient, since it will be similar to a link list in performance, then. A balanced tree will be acceptable. You may use other data structures, such as a hash table.

Documents

There is no limit on the size of a document. Words can, of course, be repeated, and will generally not be sorted. Note that the document will contain words outside the vocabulary too. You can ignore these; however, you need to include these in your count of the total number of words, since the frequency is defined as: Frequency = (Number of occurrences of the word, in the document)/(Total number of words in the document). Note that the total number of words includes those not in the vocabulary. You can use wc -w, if you wish to, to count the number of words in a file.

Query

You may assume the length of each query is less than 1000 characters. A query is terminated by a single new line, and will contain at least one word. There will be no newline between words of a single query. You can therefore assume that all the characters, until you encounter a newline, form a single query.

Hints

  1. C linkage in C++ programs: Check the use of extern "C" in /usr/include/stdio.h on program. Then read about it, either in a C++ book, or on the web.

  2. Program design: You are free to choose your design. I might have used the following modules: (i) vocabulary, (ii) document, (iii) documentset, (iv) matrices, and (v) query. Each of these would have a header file specifying an interface, and a C file providing an implementation. You might also have a "utility" module to provide miscellaneous facilities.

Example


Last modified: 9 Sep 2002