Project 1: Hash Tables

Hash Table Upgrades & Analysis

Version 05/24/19

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Describe and explain in detail the concept of hash table
Implement hash tables as vector of lists
Define and implement the ADT Table using a private hash table structure
Define and implement the ADT AssociativeArray using a private hash table structure
Define and implement bidirectional iterator class for Table and AssociativeArray
Calculate the theoretical bucket size distribution for Simple Uniform Hashing and a given table size
Calculate the actual bucket size distribution for a given instance of a hash table
Add methods to fsu::HashTable performing these calculations

Operational Objectives:

Implement two methods in the class template HashTable<K,D,H>:
size_t HashTable<K,D,H>::MaxBucketSize () const; void HashTable<K,D,H>::Analysis (std::ostream& os) const;
conforming to the requirements and specifications given below.
Revise the HashTable class by adding a private class variable size_ where the number of elements in the table is maintained, and revise all class methods appropriately to accommodate this new class variable. Also add default arguments for the template parameter HashType and constructor argument Prime.

Background Knowledge Requirements: Before starting software development you should study and be familiar with the following:

The lecture notes on Hash Tables
The distributed code implementing fsu::HashTable in LIB/tcpp/hashtbl.h
The supplemental notes on Hash Table Analysis

Deliverables: Four files:

aa.h # contains revised implementation of HashTable class template hashtbl.cpp # contains implementations MaxBucketSize and Analysis makefile # builds 12 executables (5 hasheval*.x, 5 fhtbl*.x, plus rantable.x and hashcalc.x) log.txt # your experience log

Note that hashtbl.cpp is a slave file to both hashtbl.h and aa.h. The upgrade of HashTable in aa.h has the same API. The upgrades in aa.h are intended to make the class simpler to use in applications.

Be sure that your log.txt contains date/time of work sessions and a brief description what the activity was during that session. There should be a testing diary in the log. The log should end with a brief discussion of your experience and knowledge gained testing various hash functions and load factors.

Procedural Requirements

The official development/testing/assessment environment is specified in the Course Organizer. Code should compile without warnings or errors.
In order not to confuse the submit system, create and work within a separate subdirectory cop4531/proj1.
Maintain your work log in the text file log.txt as documentation of effort, testing results, and development history. This file may also be used to report on any relevant issues encountered during project development.

Begin by copying all of the files in the course project directory into yours, along with a few others that will be helpful:

LIB/tests/fhtbl.cpp # test harness for hash tables LIB/tests/hashcalc.cpp # calculates hash values interactively LIB/tests/hasheval.cpp # test focusing specifically on Analysis LIB/tests/rantable.cpp # creates random table data LIB/area51/fhtblKISS_i.x # linprog/Intel/Linux executables LIB/area51/fhtblModP_i.x # ... LIB/area51/fhtblMM_i.x LIB/area51/fhtblSimple_i.x LIB/area51/rantable_i.x LIB/area51/hashcalc_i.x LIB/area51/hashevalKISS_i.x LIB/area51/hashevalModP_i.x LIB/area51/hashevalMM_i.x LIB/area51/hashevalSimple_i.x

The executables in area51 are distributed only for your information and experimentation.

Next copy two files onto the names of your deliverables:
cp LIB/tcpp/hashtbl.h aa.h cp LIB/tcpp/hashtbl.cpp .
This gives the starting points for both aa.h and hashtbl.cpp.
Code the deliverables by modifying the two new files:
- Complete the implementations of MaxBucketSize and Analysis in file hashtbl.cpp.
- Complete the upgrade of HashTable in the file aa.h.
These can be done in either order. The names and #included slave files in aa.h remain unchanged. The file hashtbl.cpp is used by both aa.h and hashtbl.h interchangably.
Be sure that you have established the current submit script LIB/scripts/submit.sh as a command in your ~/.bin directory. The current one is version 2.0.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu or quake.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications - MaxBucketSize and Analysis

MaxBucketSize should return the size of the largest bucket in the hash table instance.

Analysis should result in a display (to the std::ostream passed in) as illustrated here:

table size: 998442 number of buckets: 999983 nonempty buckets: 631536 max bucket size: 8 load factor: 1.00 expected search time: 2.00 actual search time: 2.58 bucket size distributions ------------------------- size actual theory (uniform random distribution) ---- ------ ------ 0 368447 368440.3 1 367924 367872.9 2 183671 183653.0 3 60895 61123.3 4 15394 15257.2 5 3095 3046.7 6 470 507.0 7 76 72.3 8 11 9.0 9 1.0 10 0.1

This display shows the size of the table, number of buckets, number of non-empty buckets, max bucket size, load factor λ [= (table size)/(number of buckets)], actual average search time [= 1 + (table size)/(number of non-empty buckets)]. Then a tabular printout of the bucket size distribution follows, showing the bucket size, actual number of buckets of that size, and the expected number for simple uniform hashing. The table print terminates for bucket size n when there are no buckets of size > n and the theoretical size is < 0.05. Display the theoretical sizes to the nearest tenth as depicted above.

Algorithm for MaxBucketSize and Analysis. Use the algorithms developed in notes (see course organizer).
Thoroughly test your implementation for correct functionality using the provided test clients fhtbl.cpp and hasheval.cpp using a variety of tables you create with rantable.cpp. Be sure to test using variations:
1. Tables of various sizes, small to very large (at least 1,000,000)
2. Varieties of hash functions (four are provided: KISS, ModP, MM, and Simple)
3. Load factor λ = n/b = ratio of table size to (approximate) number of buckets (0.1, 1.0, 10.0 100.0 are suggested)
4. Prime / nonprime number of buckets
The test harnesses fhtbl.cpp and hasheval.cpp are easily changed via comment/uncomment of typedef blocks to accomodate the variations in hash functions. The prime/non-prime number of buckets is a constructor argument (default value "true" meaning prime number of buckets). Note prime is set to false for the Simple hash function.
Write a short summary giving your experience and lessons learned during the testing of variations as above. Turn this in as part of your test diary in log.txt.

Code Requirements and Specifications - HashTable upgrade (in aa.h)

There are two upgrades. The first is to add a private class variable
size_t size_;
whose purpose is to maintain the number of elements currently stored in the table. This addition then requires that it be maintained appropriately by the various non-const methods and also be used appropriately by all const methods. It will also need to be taken into account by constructors, the copy constructor, and the assignment operator. This upgrade will make it simpler to "load" files of unknown (but large) size by client programs.
The Check method should calculate the implicit size of the table by adding up the sizes of all of the buckets and compare the result with the size_ variable, returning true iff these are the same value.

The second upgrade is to supply a default argument for the HashClass template parameter that will default to the KISS hash class for the KeyType. This upgrade will simplify declarations when a hash table is used in a client program: client programs should be able to declare tables like this:
fsu::HashTable<KeyType,DataType> ht0; fsu::HashTable<KeyType,DataType> ht1(numbuckets);

Code Requirements and Specifications - Makefile

The makefile should have targets for all test executables:

fhtbl.x fhtblKISS.x fhtblMM.x fhtblModP.x fhtblSimple.x hasheval.x hashevalKISS.x hashevalMM.x hashevalModP.x hashevalSimple.x rantable.x hashcalc.x

(presumably all of the executables you are using for testing).

There should be two collective targets:
std: fhtbl.x hasheval.x all: std \ fhtblKISS.x fhtblMM.x fhtblModP.x fhtblSimple.x \ hashevalKISS.x hashevalMM.x hashevalModP.x hashevalSimple.x \ rantable.x hashcalc.x
We will use only your default target std in our assessment test builds, and you can assume fhtbl.cpp and hasheval.cpp will be in your portfolio directory. (All exutables we use for our testing are built with our assessment makefile.)

General "identical output" requirement: The output of all the test clients compiled with your code should be identical with the output from the corresponding area51 benchmark programs.

Hints

Use Hash Analysis Proposition 3 as a check for internal consistency during Analysis.
In calulating the theoretical distribution, you can restrict the size of the vector to be the same as that storing the actual distribution. There may be a few extra entries that need calculating for the display, but these can be done iteratively. This will save a huge amount of storage space, most of which would have very small numbers stored (or zero).

These bits of code show the formatting of the table columns:

// declare column widths int width0 = 10, width1 = 13, width2 = 15; int startheader = width0 - 4; // output ' ' at this width preceeding table ... // details header os << std::setw(startheader) << ' ' << "bucket size distributions" << '\n' << std::setw(startheader) << ' ' << "-------------------------" << '\n'; os << std::setw(width0) << "size" << std::setw(width1) << "actual" << std::setw(width2) << "theory" << " (uniform random distribution)\n"; os << std::setw(width0) << "----" << std::setw(width1) << "------" << std::setw(width2) << "------" << '\n'; ...

which might be difficult to come up with by just observing output from the area51 benchmarks.