Project 1: Hash Tables

Hash Table Upgrades & Analysis

Version 05/24/19

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

  1. Describe and explain in detail the concept of hash table
  2. Implement hash tables as vector of lists
  3. Define and implement the ADT Table using a private hash table structure
  4. Define and implement the ADT AssociativeArray using a private hash table structure
  5. Define and implement bidirectional iterator class for Table and AssociativeArray
  6. Calculate the theoretical bucket size distribution for Simple Uniform Hashing and a given table size
  7. Calculate the actual bucket size distribution for a given instance of a hash table
  8. Add methods to fsu::HashTable performing these calculations

Operational Objectives:

  1. Implement two methods in the class template HashTable<K,D,H>:
    size_t  HashTable<K,D,H>::MaxBucketSize () const;
    void    HashTable<K,D,H>::Analysis      (std::ostream& os) const;
    
    conforming to the requirements and specifications given below.
  2. Revise the HashTable class by adding a private class variable size_ where the number of elements in the table is maintained, and revise all class methods appropriately to accommodate this new class variable. Also add default arguments for the template parameter HashType and constructor argument Prime.

Background Knowledge Requirements: Before starting software development you should study and be familiar with the following:

  1. The lecture notes on Hash Tables
  2. The distributed code implementing fsu::HashTable in LIB/tcpp/hashtbl.h
  3. The supplemental notes on Hash Table Analysis

Deliverables: Four files:

aa.h         # contains revised implementation of HashTable class template
hashtbl.cpp  # contains implementations MaxBucketSize and Analysis
makefile     # builds 12 executables (5 hasheval*.x, 5 fhtbl*.x, plus rantable.x and hashcalc.x)
log.txt      # your experience log

Note that hashtbl.cpp is a slave file to both hashtbl.h and aa.h. The upgrade of HashTable in aa.h has the same API. The upgrades in aa.h are intended to make the class simpler to use in applications.

Be sure that your log.txt contains date/time of work sessions and a brief description what the activity was during that session. There should be a testing diary in the log. The log should end with a brief discussion of your experience and knowledge gained testing various hash functions and load factors.

Procedural Requirements

  1. The official development/testing/assessment environment is specified in the Course Organizer. Code should compile without warnings or errors.

  2. In order not to confuse the submit system, create and work within a separate subdirectory cop4531/proj1.

  3. Maintain your work log in the text file log.txt as documentation of effort, testing results, and development history. This file may also be used to report on any relevant issues encountered during project development.

  4. Begin by copying all of the files in the course project directory into yours, along with a few others that will be helpful:

    LIB/tests/fhtbl.cpp             # test harness for hash tables
    LIB/tests/hashcalc.cpp          # calculates hash values interactively
    LIB/tests/hasheval.cpp          # test focusing specifically on Analysis
    LIB/tests/rantable.cpp          # creates random  table data 
    LIB/area51/fhtblKISS_i.x        # linprog/Intel/Linux executables
    LIB/area51/fhtblModP_i.x        # ...
    LIB/area51/fhtblMM_i.x
    LIB/area51/fhtblSimple_i.x
    LIB/area51/rantable_i.x
    LIB/area51/hashcalc_i.x
    LIB/area51/hashevalKISS_i.x
    LIB/area51/hashevalModP_i.x
    LIB/area51/hashevalMM_i.x
    LIB/area51/hashevalSimple_i.x
    

    The executables in area51 are distributed only for your information and experimentation.

  5. Next copy two files onto the names of your deliverables:

    cp LIB/tcpp/hashtbl.h     aa.h
    cp LIB/tcpp/hashtbl.cpp   .
    

    This gives the starting points for both aa.h and hashtbl.cpp.

  6. Code the deliverables by modifying the two new files:

    These can be done in either order. The names and #included slave files in aa.h remain unchanged. The file hashtbl.cpp is used by both aa.h and hashtbl.h interchangably.

  7. Be sure that you have established the current submit script LIB/scripts/submit.sh as a command in your ~/.bin directory. The current one is version 2.0.

    Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu or quake.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications - MaxBucketSize and Analysis

  1. MaxBucketSize should return the size of the largest bucket in the hash table instance.

  2. Analysis should result in a display (to the std::ostream passed in) as illustrated here:

          table size:           998442
          number of buckets:    999983
          nonempty buckets:     631536
          max bucket size:      8
          load factor:          1.00
          expected search time: 2.00
          actual search time:   2.58
    
          bucket size distributions
          -------------------------
          size       actual         theory (uniform random distribution)
          ----       ------         ------
             0       368447       368440.3
             1       367924       367872.9
             2       183671       183653.0
             3        60895        61123.3
             4        15394        15257.2
             5         3095         3046.7
             6          470          507.0
             7           76           72.3
             8           11            9.0
             9                         1.0
            10                         0.1
    

    This display shows the size of the table, number of buckets, number of non-empty buckets, max bucket size, load factor λ [= (table size)/(number of buckets)], actual average search time [= 1 + (table size)/(number of non-empty buckets)]. Then a tabular printout of the bucket size distribution follows, showing the bucket size, actual number of buckets of that size, and the expected number for simple uniform hashing. The table print terminates for bucket size n when there are no buckets of size > n and the theoretical size is < 0.05. Display the theoretical sizes to the nearest tenth as depicted above.

  3. Algorithm for MaxBucketSize and Analysis. Use the algorithms developed in notes (see course organizer).

  4. Thoroughly test your implementation for correct functionality using the provided test clients fhtbl.cpp and hasheval.cpp using a variety of tables you create with rantable.cpp. Be sure to test using variations:

    1. Tables of various sizes, small to very large (at least 1,000,000)
    2. Varieties of hash functions (four are provided: KISS, ModP, MM, and Simple)
    3. Load factor λ = n/b = ratio of table size to (approximate) number of buckets (0.1, 1.0, 10.0 100.0 are suggested)
    4. Prime / nonprime number of buckets

    The test harnesses fhtbl.cpp and hasheval.cpp are easily changed via comment/uncomment of typedef blocks to accomodate the variations in hash functions. The prime/non-prime number of buckets is a constructor argument (default value "true" meaning prime number of buckets). Note prime is set to false for the Simple hash function.

  5. Write a short summary giving your experience and lessons learned during the testing of variations as above. Turn this in as part of your test diary in log.txt.

Code Requirements and Specifications - HashTable upgrade (in aa.h)

  1. There are two upgrades. The first is to add a private class variable

    size_t size_;
    

    whose purpose is to maintain the number of elements currently stored in the table. This addition then requires that it be maintained appropriately by the various non-const methods and also be used appropriately by all const methods. It will also need to be taken into account by constructors, the copy constructor, and the assignment operator. This upgrade will make it simpler to "load" files of unknown (but large) size by client programs.

  2. The Check method should calculate the implicit size of the table by adding up the sizes of all of the buckets and compare the result with the size_ variable, returning true iff these are the same value.

  3. The second upgrade is to supply a default argument for the HashClass template parameter that will default to the KISS hash class for the KeyType. This upgrade will simplify declarations when a hash table is used in a client program: client programs should be able to declare tables like this:

    fsu::HashTable<KeyType,DataType> ht0;
    fsu::HashTable<KeyType,DataType> ht1(numbuckets);
    

Code Requirements and Specifications - Makefile

  1. The makefile should have targets for all test executables:

    fhtbl.x
    fhtblKISS.x
    fhtblMM.x
    fhtblModP.x
    fhtblSimple.x
    hasheval.x
    hashevalKISS.x
    hashevalMM.x
    hashevalModP.x
    hashevalSimple.x
    rantable.x
    hashcalc.x
    

    (presumably all of the executables you are using for testing).

  2. There should be two collective targets:

    std: fhtbl.x hasheval.x
    all: std \
         fhtblKISS.x fhtblMM.x fhtblModP.x fhtblSimple.x \
         hashevalKISS.x hashevalMM.x hashevalModP.x hashevalSimple.x \
         rantable.x hashcalc.x
    

    We will use only your default target std in our assessment test builds, and you can assume fhtbl.cpp and hasheval.cpp will be in your portfolio directory. (All exutables we use for our testing are built with our assessment makefile.)

General "identical output" requirement: The output of all the test clients compiled with your code should be identical with the output from the corresponding area51 benchmark programs.

Hints