Project 1: Enhanced wc program

Due: September 22, 11:59pm

Educational Objectives: Apply C++ knowledge for problem solving; experience text processing techniques; experience using the makefile to organize and compile applications.

Statement of Work: Develop an enhanced wc program that collects the usage statistics of characters and words in a file (redirected as the standard input).

Project Requirement:

The wc program is a UNIX utility that counts the numbers of lines, words, and characters in a text file. Try "wc < filename" in a UNIX system to see how wc behaves. In this assignment, you will develop an enhanced wc program that no only counts the numbers of lines, words, and characters, but also collects (and outputs) the usage statistics of characters, words, and numbers in the file.

The program should read the input until it reaches the end, keeping track of the numbers of lines, words, and characters while counting the number of times each character/word is used. A word can either be an identifier or a number. An identifier is defined as a letter followed by a sequence of letters or digits('a'..'z', 'A'..'Z', or '0'..'9'). Identifiers are case insensitive ("AA00", "Aa00", "aA00", and "aa00" are considered the same). A number is defined as a sequence of digits ('0'..'9') that are not in an identifier. Different sequences represent different numbers. For example, number "001" is different from number "1". Identifiers are separated by non-letter and non-digit characters. Numbers are separated by words or other non-letter and non-digit characters.

Your program should record the number of times each identifier/number/character happens. It should first output the number of lines, words, and characters in the file. After that, it should output up to five most used characters, up to five most used identifiers, and up to five most used numbers as well as the number of times these characters/identifiers/numbers are used. Since identifiers are case insensitive, the program only outputs identifiers with lower case letters. The characters, numbers and identifiers should be outputted in the descending order based on the number of times they are used. When two characters happen the same number of times, the character with a smaller ASCII value should be considered as being used more frequently. When two identifiers (or numbers) happen the same number of times, the identifier (number) that occurs earlier in the input should be considered as being used more frequently. An example executable (for the linprog machines) 'proj1.x' is provided in the supplied program files. You should make the outputs of your program the same as those of 'proj1.x'. When printing characters, use '\t' for tab and '\n' for newline. All other letters should be outputted normally as they are.

Deliverables: Turn in files proj1.cpp and makefile in a single tar file online via Canvas. When the two files (proj1.cpp and makefile) are placed in the same directory on linprog, type 'make' should produce the executable 'proj1.x'.

Grading:

Extra credits:

Hints:

Miscellaneous: