Project 3: WordSmith
Analyzing vocabulary in texts
Revision dated 02/11/17
Educational Objectives:
On successful completion of this assignment, the student should be able to
- Define the concept of associative container as a re-usable component in programs
- State the distinction between unimodal and multimodal associative containers, and:
- Give examples of each type
- Describe use cases making each type appropriate
- State the distinction between ordered and unordered associative containers
- Give examples of each type
- Describe use cases making each type appropriate
- State the API for associative containers of these types:
- Unimodal Ordered Set
- Multimodal Ordered Set (aka Ordered Multiset)
- Unimodal Unordered Set
- Multimodal Unordered Set (aka Unordered Multiset)
Describe the behavior and state the runtime expectations for each operation.
- Describe various implementation plans for ordered associative containers,
and discuss whether and why runtime expectations are met by the implementation.
============================================
rubric to be used in assessment
--------------------------------------------
build [0..5]: x
test1 [0..10]: xx
test2 [0..10]: xx
test3 [0..10]: xx
log & report [0..5]: x
requirements and specs [0..5]: x
software engineering [-20..+5]: x
dated submission deduction [2 pts per]: ( x)
--
total: [0..50]: xx
============================================
Background Knowledge Required: Be sure that you have mastered the
material in these chapters before beginning the assignment:
Introduction to Sets,
Introduction to Maps.
Operational Objectives:
Create a client WordSmith of the Set API that serves as a text
analysis application.
Deliverables:
wordsmith.h,
wordsmith.cpp,
cleanup.cpp,
makefile.ws,
log.txt.
Procedural Requirements
The official development/testing/assessment environment is specified in the Course Organizer.
Begin by copying all of the files from the assignment distribution
directory, which will include:
LIB/proj3/main_ws.cpp # driver program for WordSmith
LIB/proj3/data* # sample word files
LIB/cleanup.rulz # rules for converting a string to a word
LIB/cleanup.eg # examples of applying the rules
LIB/cleanup.eg.com # command file to process cleanup.eg: "ws.x cleanup.eg.com"
LIB/proj3/deliverables.sh # submission configuration file
By now you should have "submit.sh" stored in your .bin directory and available
as a command in your system.
Define and implement the class WordSmith, placing the class API
in the header file wordsmith.h and implementations in the code file wordsmith.cpp
Be sure to fully cite all references used for code and ideas, including
URLs for web-based resources. These citations should be in the file
documentation and if appropriate detailed in relevant code locations.
Test your API using the distributed client program main.cpp.
Keep a text file log of your development and testing activities in log.txt.
Submit the assignment using command submit.sh.
Warning: Submit scripts do not work on the program and
linprog servers. Use shell.cs.fsu.edu to submit assignments. If you do
not receive the second confirmation with the contents of your assignment, there has
been a malfunction.
Functionality Requirements
- A makefile named makefile.ws is required to build the project
components main_ws.o, wordsmith.o, xstring.o and
assemble them into an executable ws.x.
- WordSmith can read an arbitrary text file on command and extract all of the
words in the file, maintaining the unique words, along with the frequency of
occurrence of each word, in a set. Letters are converted to lower case before
comparison and storage. A word is understood to be a string of letters and/or
digits, with certain other symbols allowed. Most non-alpha-numeric characters
are ignored. Exceptions are hyphens and apostrophes, which are considered part
of the word, so that contractions and hyphenated constructs are counted as
individual words. (Note: two adjacent apostrophes are not considered part of
a word, since they represent closing of a quotation.)
- WordSmith can write an analysis of its current stored words. This analysis
consists of a lexicographical listing of the unique words together with their
frequencies, followed by a count of the total number of words and the vocabulary
size (number of unique words). Note that this is a cumulative analysis over all
of the input files read since starting up wordsmith (or since the last clearing
operation).
- Note that a component of the analysis and summary is a listing of the files
whose contents contributed to the data.
- WordSmith must operate with the supplied driver program
LIB/proj3/main.cpp which has a user interface with the following options:
- Read a file. Read the words of the file into the structure
(and report summary to screen).
Note that the file read method has a bool argument such that when true, a
one-line progress statament is written to screen for each 65,536 words
read. (Note that 65,535 = 0xFFFF = 2^16 - 1.)
- Write an analysis of the current data (including input file names) to a
file (and report summary to screen).
Note that the driver program has an option to output a report file to
screen. This action is independent of WordSmith.
- Clear current data and clear all data from the structure.
- Show current size and send a data summary to the screen.
- display Menu.
- eXit BATCH mode.
- Quit program.
Use the source code in the driver program main.cpp to determine the
syntax requirements for the WordSmith public interface. Use the
executable in area51 to model expected behavior.
The following shows the exact syntax of the WordSmith API required by the driver program:
bool ReadText (const fsu::String& infile, bool showProgress = 0);
bool WriteReport (const fsu::String& outfile, unsigned short c1 = 15, unsigned short c2 = 15) const; // c1,c2 are column widths
void ShowSummary () const;
void ClearData ();
-
From any directory having access to the course library and containing your
submission files, entering "make -f makefile.ws" should result in an executable called
"ws.x".
Implementation Requirements.
- You should define a class WordSmith, declared in the file
wordsmith.h and implemented in the file wordsmith.cpp. An
object of type WordSmith is used by the driver program to create
the executable ws.x.
- Use the following to define internal types and private class variables for
WordSmith:
private:
// the internal class terminology:
typedef fsu::Pair < fsu::String, unsigned long > EntryType;
typedef fsu::LessThan < EntryType > PredicateType;
// choose one associative container class for SetType:
// typedef fsu::UOList < EntryType , PredicateType > SetType;
// typedef fsu::MOList < EntryType , PredicateType > SetType;
typedef fsu::UOVector < EntryType , PredicateType > SetType;
// typedef fsu::MOVector < EntryType , PredicateType > SetType;
// typedef fsu::RBLLT < EntryType , PredicateType > SetType;
// declare the two class variables:
SetType wordset_;
fsu::List < fsu::String > infiles_;
...
};
This will serve several useful purposes:
- Changing the structure used for SetType is as simple as
changing which typedef statement is uncommented in the WordSmith class
definition.
- It is ensured that you are writing to the Set API
-
The optimal choice (other than RBLLT) is UOVector - unimodal set API with
very fast search time. However it is important that the project is functional
with all of the choices currently available, even if functionality isn't what
you want for the multimodal options. This tests the generality and genericity
of your code. Then, later, when you have created RBLLT, you can switch over
and have fast insert times along with fast search times.
- The list of filenames is an fsu::List of fsu::String objects
You are free to add private helper methods to the class. You should not add any
class variables other than wordset_ and infiles_.
-
Add a private helper method as follows:
private: // string cleaner-upper
static void Cleanup (fsu::String&);
Cleanup is used to "clean up" the string passed by reference according to the
processing rules above. The implementation of Cleanup should be in the separate
file cleanup.cpp. (This function may be used again in a future
assignment.)
- Note that the fsu::Pair template class has comparison operators
defined that emphasize the first coordinate of the pair (called "first_",
but playing the role of "key"), so that two pairs are considered equal, for
example, if they have equal keys.
- The application should function correctly in every respect using
fsu::UOList < EntryType > for SetType.
- The application should function correctly in every respect using
fsu::UOVector < EntryType > for SetType.
- As usual, you should employ good software design practice. Your application
should be completely robust and all classes you define should be thoroughly
tested for correct function, robust behavior, and against memory leaks. Your ws.x
should mimic, or improve upon, the behavior illustrated in area51/ws_i.x.
Hints
It is critical to keep track of the various APIs you are dealing with. Here
is a partial list:
fsu::List
the Set API
fsu::Pair
fsu::String
WordSmith
In addition you have:
The user interface defined in main.cpp, which amounts to a driver
program for WordSmith plus some control commands
You are tasked to implement the WordSmith API. In this implementation you
will need to write to the various fsu APIs.
Test your WordSmith by uncomment/comment the various possible SetType
definitions (except for the last one using RBLLT, which we will get to in a
later assignment). How does it behave with the other unimodal implementation
UOVector? How does it behave with the multimodal (MultiSet) versions, MOList and
MOVector ?
Cleanup. There are some supporting files for figuring out the
string "cleanup" process:
cleanup.rulz # explains various aspects of cleaning up a string
cleanup.eg # several examples of strings and their cleaned up substring
cleanup.eg.com # command file to process cleanup.eg
When processed, cleanup.eg should result in count of 4 for each string in the file.