Hash Functions

A hash function for type T is a mapping h defined for all instances t of T such that:

There are two quite distinct categories of application for hash functions in computing. Both are of the highest use and importance. These are:

Technically, any hash function can be used in either type of application. But to make the application perform well, carefully selected hash functions must be used. Selecting a hash function that performs well is usually somewhat complicated, and always data dependent.

The performance of a hash function in security is directly related to the enigma attribute of the hash function. Note that a non-one-to-one mapping is theoretically not invertible. The equation

cannot be solved for t. But, it is still possible to "reverse engineer" a hash function, that is, find specific key values that map to a given hash value, perhaps by guessing or trial-and-error. A hash function that is relatively difficult to reverse engineer has relatively high enigma.

The performance of a hash function in fast-access tables is related to the pseudo-random attribute. A function whose hash values appear relatively highly random with respect to the key values has relatively high pseudo-randomness. We explore these ideas in the following examples.

Example

The slide shows code for a function defined for String objects S.

unsigned int f(const String& S)
{
  unsigned int hval(0), i;
  for (i = 0; i < S.Size(); ++i)
    hval += S[i]- 'a';
  return hval;
}
    /* Sample function values:
    f(a) = 0
    f(b) = 1
    f(abcd) = 0 + 1 + 2 + 3 = 6
    f(badc) = 1 + 0 + 3 + 2 = 6
    f(dcba) = 6
    if S1 = permutation of S2 then f(S1) == f(S2) */
    

The return value for input S is the sum of the character values of the elements of S, normalized so that 'a' has value zero. This code defines a mapping f defined on type T = String that has non-negative integer values, is not one-to-one (because, for example, f(ab) = f(ba) while ab != ba). Therefore, f is indeed a hash function for type T.

This hash function is not very good for either fast access or security, however. Pseudo-randomness is very poor, because related String objects get mapped to the same hash value. (In fact, any permutation of string S maps to the same hash value as S.) Enigma is not good either, because the hash value gives easily derived hints and constraints on the key value. For example, the hash value 6 cannot be the value of any key with more than six letters other than 'a'.

In principle, though, this is a hash function and hence can be plugged in to any application where hashing of String objects is needed. We will proceed to find hash functions with improved performance.

Improved Example

This slide depicts an upgrade of the algorithm shown in the previous slide.

unsigned int f(const String& S)
{
  unsigned int hval(0), i;
  for (i = 0; i < S.Size(); ++i)
    hval += i * (S[i] - 'a');
  return hval;
}
    /* sample values:
    f(a) = 0
    f(b) = 0
    f(ab) = 1
    f(abcd) = (0 * 0) + (1 * 1) + (2 * 2) + (3 * 3) = 14
    f(badc) = (0 * 1) + (1 * 0) + (2 * 3) + (3 * 2) = 12
    f(dcba) = (0 * 3) + (1 * 2) + (2 * 1) + (3 * 0) = 4   */
    

An attempt has been made to address the low enigma and randomness of the original. The upgraded function sums the character values as before, except weighted by the index, so that the same character value will contribute differentially to the hash value, depending on its location in the string. For example, S[2] = 'f' will contribute 2 x 5 = 10, while S[3] = 'f' will contribute 3 x 5 = 15 to the hash value.

This upgrade still does not have excellent pseudo-randomness or enigma, although the former can be made acceptable by applying a prime divisor (discussed later in the chapter). Much more work would be needed to attain acceptable enigma.

Useful Example

This slide depicts a second upgrade that replaces the accumulation operation

with a process I have named the Marsaglia Mixer after its discoverer, Professor George Marsaglia of the FSU Department of Statistics. A few values of this hash function are also shown. From this limited evidence, it does appear to have significantly better pseudo-randomness than the previous versions.

unsigned int f(const String& S)
{
  unsigned int i;
  long unsigned int bigval = S.Element(0);
  for (i = 1; i < S.Size(); ++i)
    bigval = ((bigval & 65535) * 18000)
             + (bigval >> 16)
             + S[i];
  bigval = ((bigval & 65535) * 18000) + (bigval >> 16);
  return bigval & 65535;
}
    /* some values:
    f(a)     = 42064
    f(b)     = 60064
    f(abcd)  = 41195
    f(bacd)  = 39909
    f(dcba)  = 29480
    f(x)     = 62848
    f(xx)    = 44448
    f(xxx)   = 15118     
    f(xxxx)  = 28081
    f(xxxxx) = 45865   */
    

In fact, the mixer itself was shown by Marsaglia to be a very good generator of pseudo-random numbers, in principle about as good as can be achieved. (Statisticians use words like "about" without blinking. Nothing is absolute.) The relative pseudo-randomness of the Marsaglia Mixer does depend on the discovery of "magic numbers" such as 18000. Marsaglia uncovered two that he recommends: 18000 and 30103. We use 18000 for our hash functions and 30103 for pseudo-random number generation. (See xran.cpp.)

This hash function also seems to have decent enigma, certainly better than the previous examples, but there has been no formal study of the mixer in this direction. The use of the mixer for secure signatures in this class is for convenience and to make the use of hashing understandable. Neither I nor Marsaglia would recommend the mixer for signatures where real security is needed. We will use the mixer to define secure signatures in the Password Server project and to define hash tables in the Internet Router project. Let us now look at the details of the Marsaglia Mixer algorithm.

Useful Example -- The Algorithm

The first remarkable thing about this code is the use of unsigned long internally, while unsigned int is the return value. The unsigned long type is a 32-bit word, while the unsigned int type is not guaranteed to be more than 16 bits. This hash function is intended to be used only for values in the unsigned int range

The mixer uses a 32-bit word space as its "mixing bowl", but returns only a 16-bit word. The 32-bit value bigval is tossed, turned, and folded like dough in the mixing bowl, and the return value is a small spoonful taken from the result.

The ascii value of the first (index zero) element of the string is used to initialize bigval. Then for each succeeding element, bigval is replaced with

and, after all the elements have been inserted into the mix, the mixer algorithm is cranked one more time. Finally, the lower order 16 bits (officially, the truncation of bigval back down to a 16-bit word) is returned.

unsigned int f(const String& S)
{
  unsigned int i;
  long unsigned int bigval = S.Element(0); // S[0] or '\0'
  for (i = 1; i < S.Size(); ++i)
    bigval = ((bigval & 65535) * 18000)    // low16 * magic_number
             + (bigval >> 16)              // high16
             + S[i];
  bigval = ((bigval & 65535) * 18000) + (bigval >> 16);
  // bigval = low16 * magic_number + high16
  return bigval & 65535;
  // return low16
}

This algorithm was developed by Professor George Marsaglia of FSU Statistics Department. It is one of the best known pseudo-random number generators. The name "Marsaglia Mixer" is due to RC Lacher. Yikes, don't get caught in that thing.

Improving and Using Pseudo-Randomness

The pseudo-randomness of any hash function can be increased significantly, often dramatically, by the post-processing trick of dividing by a (fairly large) prime number and taking the remainder as a hash value. (Also, there is likely an increase in enigma, as long as the prime number is kept secret. What is desirable for security, however, is a hash function with such good enigma that the hashing algorithm can be made public without compromising the security of signatures.)

In fast-access table applications, using hash functions, this post-processing trick is always recommended and, in this course, always used. An automated version of the Sieve of Eratosthenes makes the discovery of large prime numbers convenient, as you will see when we study hash tables. For hash tables, the hash functions used in this course, consisting of the Marsaglia Mixer followed by large prime division, are world class. The post-processing works similar to the following example.

The improved hash function has excellent pseudo-randomness for most types T.

Improving and Using Enigma

Improving enigma, from the level of the Marsaglia Mixer up to a level appropriate for modern computer security, is a full time job -- somebody else's job. The National Institute of Standards and Technology (NIST) is a US government agency whose job (one of many) is to set standards and distribute technology for computer security products. The Secure Hashing Algorithm (SHA) and its successors, developed by NIST, are in wide use. SSH is a derivative of SHA.

Exactly how does a secure signature work? Let's look at passwords. The login process for a computer system is designed to accomplish two things, identification and authentication. A user is identified by the username and authenticated by a password. Authentication is the process of proving to the system that the identity entered by the user is in fact the identity of that user. Authentication gives the system "trust" that you are who you say you are.

Authentication is accomplished by storing the hash value of the user's password. The key value, that is, the clear text of the user's password, is never stored. The system can only check passwords by computing the hash value of the entered password and comparing it with the stored value. If these two hash values agree, the user is authenticated by the system.

Reference


Marsaglia KISS

As architectures have moved to full support of 64-bit words and these sizes:

Size of bool           = 1 bytes

Size of char           = 1 bytes
Size of short          = 2 bytes
Size of int            = 4 bytes
Size of long           = 8 bytes

Size of unsigned char  = 1 bytes
Size of unsigned short = 2 bytes
Size of unsigned int   = 4 bytes
Size of unsigned long  = 8 bytes

Size of float          = 4 bytes
Size of double         = 8 bytes
Size of long double    = 16 bytes

the need for 64-bit hash values has grown. The following is an upgrade of the MM to 64-bit:

  unsigned long Crank(unsigned long word)
  { 
    // requires sizeof(unsigned long) >= 8
    word ^= (word << 13); word ^= (word >> 17); word ^= (word << 5); 
    word = 69069*word + 12345;
  }
  unsigned long KISS (const char* S)
  {
    size_t i, length;
    unsigned long bigval = 0;
    if (S)
    {
      length = strlen(S);
      bigval = S[0];
      for (i = 1; i < length; ++i)
        bigval = Crank(bigval) + S[i];
      bigval = Crank(bigval);
    }
    return bigval;
  }

Producing fast hash functions with good pseudo-randomness is important for applications such as hash tables as well as esoteric areas such as experimental high-energy physics. Producing hash functions with over-the-top excellent enigma is important for information security.

Both of these areas are rich and active areas of research, in government (NIST primarily interested in security, National Labs primarily interested in randomness), universities, and the private sector.