Hash Functions

Good hash functions should aim at the assumption of simple uniform hashing: each key is equally likely to hash into any of the m slots. Draw keys , k, from U with probability P (k), so SUH:

å
k :h (k) = j

P (k) = 1/m , j = 0, ..., m-1

thus all hash values are equally likely

This is hard to use as P (k)'s are not usually known (even approximately.) If the k's are U[0,1) and independent then h (k) = ë k m û (m fixed) satisfies SUH.

Hash functions are usually chosen to be as "independent" of patterns that may exist in the k's. Note: SUH is powerful, but often hash functions need to do "better than" just SUH. Sometimes "mixing" and "separation" are required, so that a close k₁ k₂ should be far apart in their hash value.

Aside: Why such "separating" hash function? Cryptographic verification of signatures.

My Will file ® hash ® digest ® sign ® certificate

What if my brother could change my will slightly (add the word "not") and not change the digest? Bad news for me! Even though keys may not be integers, lets consider them to be integers.

Hash Functions

h (k) = k ( mod m )

What are good choices for m?

m ¹ 2^p

m ¹ 2^p ± c (close to a power of 2)

m ¹ 10^p (especially with decimal keys)

m = prime is a good choice.

Note: If your U is well known, you could try to experimentally optimize m. This is called the division method since:

k ( mod m ) = k - m ( ë k/m û )

Multiplication Method

h (k) = ë m ( k A (mod 1) ) û

A is a real number 0 < A < 1

m is usually an integer ( 2^p)

k A (mod 1) = k A - ë k A û

Example: m = 2^p,

k x ë A 2^w û = 2^wr₁+ r₀
2^p( 2^wr₁+ r₀) = 2^w+pr₁+ r₀2^p, the p m.s.b.s here are h (k)

w k

x ë A 2^w û

r₁ p r₀

multiplying by 2^pshifts r₀by p bits (creates an integer our of the p m.s.b.s of r₀) What choices of A re best? A: A should be irrational. What irrationals are the most irrational?
A: A = a + 1
                   b + 1
                          c + 1
                                 c + ...

with repeating continued fractions (solutions to quadratic equations).

Universal Hashing:

With a fixed hash function there are keys that will hash poorly. We previously solved bad worst case behavior by randomizing into average cases: choose your hash functions randomly! This is called Universal Hashing, to do this we must construct a family of hash functions to choose from. H is such a family.
h Î H h: U ® { 0,..., m-1 }
if for each pair x, y Î U # h ' h (x) = h (y) is | H | / m . This means that h Î H randomly chosen will give h (x) = h (y) (collision) with probability 1/m. This means that on the average (with regard to the functions in H) we get SUH.

Theorem: Let h Î H. We hash n keys into a table of size m, n £ m. Then the number of expected collisions for a key x is less than one.
Proof: c_yz = { 1 if h (y) = h (z), 0 otherwise }
E [ c_yz] = 1/m (because h was chosen randomly).
c_x = total number of collisions with x in T of size m with n keys.

E [ ( c_x) ] =

å
y Î T E ( ( c_xy) ] = ( n-1 ) ( 1/m )

assumptions: y ¹ x and ( n- 1 ) ( 1/m ) < 1 since n £ m.
How can we design H a universal class?

.

Example: |T| = m, m prime. x = <x₀, x₁, x₂, ...,x_r> Bytes, Max value Byte < m

<a₀, a₁, a₂, ...,a_r> a_i randomly chosen from {0, 1, 2, ..., m-1}

h (x) =

r
å
i = 0 a_ix_i ( mod m )

H = U_a {h_a}, has m^r+1 members.
Theorem: The class H is a universal class.
Proof: Consider x, y, can assume x₀¹ y₀. With {a₀, ...,a_r} given:

a₀(x₀_-y₀ ) º

    r
-- å
i = 1 a_i(x_i- y_i ) ( mod m )

Has only one a₀that solves it. (write down h (x) = h (y) for a₀). Since m is prime:

a₀º

   r
-å
i = 1 a_i(x_{i -}y_i ) (x_{0 -}y₀ )^-1 (mod m)

This means a₀can be found to cause a collision each time. There are thus m^r different collisions here, one for each of the m^rchoices of <a₁, a₂, a₃, ...,a_r>.
Since there are m^r+1, <a₀, a₁, a₂, ...,a_r>'s, x and y collide with probability m^r/m^r+1 = 1/m Þ H is Universal.
An aside on modular inversion: a^-1( mod m ) is the integer that solves:

a a^-1= 1 ( mod m )

For a to have an inverse ( mod m ) it must be that : gcd ( a, m ) = 1, i.e. a and m have no common factor.
One computes gcd ( a, m ) via the Euclidean algorithm ( will analyze this later this term.) A variant called the Extended Euclidean Algorithm: given a, m produces gcd ( a, m ) = xa + ym. If gcd ( a, m ) = 1 the x = a^-1!

Method 2: If m is prime, then a^m-1 º 1( mod m ) for any a. (This fact is the basis for probabilistic primality testing.) thus a a^m-2 º 1 ( mod m ) and a^-1º a^m-2( mod m ). If modular multiplication is Q (1) , then what is the cost of modular exponentiation?

a³=   a¹¹=   (a²) a      ( s - m )

a⁴=   a¹⁰⁰= (a²)²       ss

a⁵=   a¹⁰¹=   (a²)² a    s ( s - m )

a¹⁰¹¹⁰= a²² = (((a²)²a)²a)²

So starting from the next to the m.s. bit and working towards the l.s. bit of the exponent, when you come to a '0' Þ square, and when you come to a '1' Þ square-multiply. Why does this work?
Induction:

a¹= a

a¹⁰= a²

a¹¹= a³

Assume a^pis correct
q = 2 p + 1 ® square-multiply

    = 2 p   ® square

Cost: Q ( lg p ) operations.
Note: this is the "giant-step" algorithm.

Hash Tables - 4 of 5