>> All right, today an
exciting topic of hashing
and hash tables and hash functions.
So first let's talk a little bit about
hash functions, what are these things.
They are mappings.
They're functions, right?
They're mappings from some type
to the non-negative integers
first and foremost.
So whenever T is some key type, it might
be a string for example and the value
of that would be a hash
value for that key example.
So sometimes these are called signatures
and sometimes they are
called hash values.
Another critical property defining
hash functions is that they are not one
to one, so that means that you can map
the type T to non-negative integers
but you cannot reverse that process.
There's no way to find the mapping
from non-negative integers
back to the hash function.
It theoretically cannot exist.
So what this means is that it's hard to
sort of reverse engineer a hash value.
There are two main uses
for hash functions.
One is in secure signatures, so
there are hash functions designed
to take your name for example and create
a number from that and that number comes
from your name and it can be used in
secure communications as a signature
that the receiver can look
at and convince themselves
that the message really came from you.
You know, they can't be cryptic because
hashing is not the same as encrypting.
Encrypting is a reversible process,
so when you encrypt a message,
then the receiver of the message
can de-crypt it and read it
but when you hash a signature, then it
cannot be reversed and so the only way
to check out whether or not
the signature legitimately came
from the person sending it is to
know the hash function and then hash
that person's name to a
value and compare that value
from the value on the message.
If they're the same, then that's
evidence that the sender really was
who is claiming to be the sender and
that's all a very interesting thing
and the National Institute of Standards
and Technologies spends a lot of effort
on having secure hash functions.
These are used for example
in the UNIX/Linux environment
to hash your password.
If you take a string consisting of
your name, let's say your user name
and your clear text password and hash
that to a value, that is the value
that is stored on the UNIX system
and this is why the UNIX system
administrator cannot tell you what your
password is.
They have no way of reversing that
hash function because it's impossible
and so all they can do is allow
you to write another password,
if you forget one that you made.
So anyways that's the security
side but our main interest
in them today is not security
but the pseudo-randomness.
So the other principle
used of hash functions is
to generate efficient tables.
Now the security we were just talking
about is related to an attribute
of a hash function called its enigma, in
other words how difficult is it to guess
from the hash value, how difficult
is it to guess the clear text thing
that created that hash value.
That's the enigma attribute and you
would like that to be difficult,
computationally intractable
to discover this.
Even if you know the hash
function, you should not be able
to reverse engineer it and find the
un-encrypted values, un-hashed values.
Okay, the other thing, the
table efficiency is related
to the pseudo-random
attribute of a hash function.
This is the one, like I
said, that we're going
to be principally interested in today.
So let's just look at some
examples of hash functions.
There are some that are simple.
The nature of the hash function is
that the good ones are rather cryptic
and hard to calculate and
so you make simple ones
that are not really very good hash
functions but they do make this system
of hashing a little bit less opaque.
So here's a nice simple one and this
by the way is the one we use
in test questions as well.
It's going to take a string as input.
More often than not our key
[inaudible] are strings of some sort
and so it takes a string as input.
What this does is define a hash value
starting out to be 0 and then it goes
through the string and adds the
letter offset value of each letter
in the string to the hash value.
Now the letter offset
value is just simply a goes
to 0, b to 1, c to 2 and so on.
So that's pretty easy to
calculate that in your head.
So for example, a hash value of
a to 0, b is 1; for a, b, c, d,
it's 0 for a plus 1 for b, plus 2
for c, plus 3 for d, which is 6,
and here you see the first reason why
this is not such a good hash function.
If I take any permutation of a,
b, c, d, because it's just the sum
of those character values, you're
going to get 6 no matter what
but makes it easy to calculate.
So that's our simple hash function.
Now you can improve this a
little bit by just doing things
like multiplying by the index.
So instead of what we did before, we're
going to take the letter offset value
but multiply by the slot, the place in
which that letter occurs in the string
and that really would give you
a considerably better enigma
and pseudo-randomness.
It's not one you would be able to
guess without some scratch paper
and for example a, b, c, d goes to
14 but this permutation goes to 12
and this permutation goes to
14, which is a good thing.
So here's an actual useful example.
This example is one of the ones that was
invented by an FSU statistics professor
who is no longer with us,
unfortunately, but nevertheless,
let's talk just a little bit about that.
This idea of a big value, this
65531 is the low 16 bits of number,
so what this does is initialize
big value to the first element
of the string, then it takes big value
bit wise and in width that number
which effectively gives you the lower 16
bits of bigval, it multiplies by 18,000
and then it right shifts that 16 to
get it back into the lower 16 bits
and then it adds in new value for
that particular element in the string.
And finally it does one more
[inaudible] for that bigval mixer
and finally bit wise ends
that with the first 16 bits,
so this is a good hash function
if you only have short
integers, 16 bit integers.
Here's an example of this and you can
see that for example a goes to this,
b to that, a, b, c to this, b, a,
c, d to that, and you really can't,
the eye does not detect any
structure related to those numbers.
So this looks like a random list of
numbers, whereas these, of course,
you can discern some intentional
ordering of these things when you look
at clear text but if you
look at the hash values,
it's hard to tell they're not just
random numbers and that's the property
of pseudo-randomness and that is
the property you have a desire
for making tables, which we'll
come to in the next chapter.
So this is George Marsaglia, FSU stat
department inventor of this thing,
and this is called a
Marsaglia mixer in our library
and this is a way that thing goes.
This is just the same algorithm
[inaudible] on another slide.
So we're going to use that and we're
also going to use an even fancier one
than Marsaglia that does well
on 32 bit numbers and so anyway,
the improving pseudo-randomness of
a hash function, a common way to do
that is to divide that hash value by
a prime and look at the remainder,
so if we have a prime
and look at the remainder
when the hash function value
is divided by that prime,
then call that remainder the new
hash value, then you're going
to improve the pseudo-randomness
of the hash function of this,
so we'll have good hash functions
and then we're going to use prime,
divide by prime, take the remainder
to get even better hash functions.
Now here's a slide on improving
enigma but first of all,
where are these things used.
I mentioned password authentication
and message authentication.
Only the hash value is
transmitted, excellent enigma means
that you can't reverse, you can't even
guess, much less reverse hash function.
Secure hashing algorithm is
NIST standard now and it's used
in your Linux [inaudible] systems.
So then there's a Marsaglia KISS.
This is the 32 bit version
of Marsaglia [inaudible].
Here's the code for it and this
is just some sample outputs,
so here's your input strings,
here's the Marsaglia mixer value
and this is your KISS
value of these same things,
none of those should be meaningful.
They should look like
columns of random numbers.
So I'm going to go straight
into hash tables.
Let's remind ourselves of what a table
is, sometimes called a dictionary,
sometimes called mapping,
sometimes called a map.
We store key data pairs in a table just
like when you did your
associative array for homework 4.
You access data in the
table through the key value
and its associative data structure.
So you know the key that allows
you to look up the data associated
with that key in the table.
Associative array is a table
with an alternate interface.
It's got a bracket operator
with special insert semantics
and the bracket operator.
This is just reminding you
of stuff you already know.
It's not a bracket operator
for an iterator.
You cannot iterate through
all possible keys for a table.
If you found a way to do that, it would
not be something you would want to do
because typically they'll
be many more possible keys
than there are actual keys in the table.
So the bracket operator is not intended
to be used for that sort of purpose.
But anyway, you've experienced
the fact that it is quite handy.
Tables have unimodal semantics.
That means duplicates
elements are not allowed.
So you insert operations such as insert,
input have sort dual personality.
The key is in the table and what insert
does is overwrite the stored data
with the incoming data.
If the key is not in the table,
then it inserts both the key
and the data as a new pair.
This is exactly so far like when you
did the ordered associative array,
where this is going to differ
is in the ordering part.
Now this reviews some of the
ways we might-- I'm sorry.
This collects the operations
that we have for a table.
Insert, remove, retrieve,
and empty and size
and then we have some auxiliary
operations when it comes
to [inaudible] and they're axioms.
There're axioms associated
with these operations.
For example, if you insert k1, d1,
then retrieve k1, d returns true and d,
your [inaudible] reference
comes back being equal to d1,
the item is stored in the table.
After removing k1, retrieve
returns false.
After insert, empty returns
false and so forth.
Empty returns true if size returns 0.
Axioms for this abstract
data [inaudible].
This is a slide where we kind of look at
possible implementations and really kind
of throw them out as not being quite
what we want but we could, of course,
just like when we made sets, we could've
just said, okay, we'll keep our items
in a list and then we'll look for them
and that'll be our element operator,
what we're looking for is sequential
search and it's horrendously slow.
If we have to do sequential
search, we do it