>> All right, today an
exciting topic of hashing and hash tables and hash functions. So first let's talk a little bit about
hash functions, what are these things. They are mappings. They're functions, right? They're mappings from some type to the non-negative integers
first and foremost. So whenever T is some key type, it might
be a string for example and the value of that would be a hash
value for that key example. So sometimes these are called signatures and sometimes they are
called hash values. Another critical property defining
hash functions is that they are not one to one, so that means that you can map
the type T to non-negative integers but you cannot reverse that process. There's no way to find the mapping from non-negative integers
back to the hash function. It theoretically cannot exist. So what this means is that it's hard to
sort of reverse engineer a hash value. There are two main uses
for hash functions. One is in secure signatures, so
there are hash functions designed to take your name for example and create
a number from that and that number comes from your name and it can be used in
secure communications as a signature that the receiver can look
at and convince themselves that the message really came from you. You know, they can't be cryptic because
hashing is not the same as encrypting. Encrypting is a reversible process,
so when you encrypt a message, then the receiver of the message
can de-crypt it and read it but when you hash a signature, then it
cannot be reversed and so the only way to check out whether or not
the signature legitimately came from the person sending it is to
know the hash function and then hash that person's name to a
value and compare that value from the value on the message. If they're the same, then that's
evidence that the sender really was who is claiming to be the sender and
that's all a very interesting thing and the National Institute of Standards
and Technologies spends a lot of effort on having secure hash functions. These are used for example in the UNIX/Linux environment
to hash your password. If you take a string consisting of
your name, let's say your user name and your clear text password and hash
that to a value, that is the value that is stored on the UNIX system and this is why the UNIX system
administrator cannot tell you what your password is. They have no way of reversing that
hash function because it's impossible and so all they can do is allow
you to write another password, if you forget one that you made. So anyways that's the security
side but our main interest in them today is not security
but the pseudo-randomness. So the other principle
used of hash functions is to generate efficient tables. Now the security we were just talking
about is related to an attribute of a hash function called its enigma, in
other words how difficult is it to guess from the hash value, how difficult
is it to guess the clear text thing that created that hash value. That's the enigma attribute and you
would like that to be difficult, computationally intractable
to discover this. Even if you know the hash
function, you should not be able to reverse engineer it and find the
un-encrypted values, un-hashed values. Okay, the other thing, the
table efficiency is related to the pseudo-random
attribute of a hash function. This is the one, like I
said, that we're going to be principally interested in today. So let's just look at some
examples of hash functions. There are some that are simple. The nature of the hash function is
that the good ones are rather cryptic and hard to calculate and
so you make simple ones that are not really very good hash
functions but they do make this system of hashing a little bit less opaque. So here's a nice simple one and this by the way is the one we use
in test questions as well. It's going to take a string as input. More often than not our key
[inaudible] are strings of some sort and so it takes a string as input. What this does is define a hash value
starting out to be 0 and then it goes through the string and adds the
letter offset value of each letter in the string to the hash value. Now the letter offset
value is just simply a goes to 0, b to 1, c to 2 and so on. So that's pretty easy to
calculate that in your head. So for example, a hash value of
a to 0, b is 1; for a, b, c, d, it's 0 for a plus 1 for b, plus 2
for c, plus 3 for d, which is 6, and here you see the first reason why
this is not such a good hash function. If I take any permutation of a,
b, c, d, because it's just the sum of those character values, you're
going to get 6 no matter what but makes it easy to calculate. So that's our simple hash function. Now you can improve this a
little bit by just doing things like multiplying by the index. So instead of what we did before, we're
going to take the letter offset value but multiply by the slot, the place in
which that letter occurs in the string and that really would give you
a considerably better enigma and pseudo-randomness. It's not one you would be able to
guess without some scratch paper and for example a, b, c, d goes to
14 but this permutation goes to 12 and this permutation goes to
14, which is a good thing. So here's an actual useful example. This example is one of the ones that was
invented by an FSU statistics professor who is no longer with us,
unfortunately, but nevertheless, let's talk just a little bit about that. This idea of a big value, this
65531 is the low 16 bits of number, so what this does is initialize
big value to the first element of the string, then it takes big value
bit wise and in width that number which effectively gives you the lower 16
bits of bigval, it multiplies by 18,000 and then it right shifts that 16 to
get it back into the lower 16 bits and then it adds in new value for
that particular element in the string. And finally it does one more
[inaudible] for that bigval mixer and finally bit wise ends
that with the first 16 bits, so this is a good hash function if you only have short
integers, 16 bit integers. Here's an example of this and you can
see that for example a goes to this, b to that, a, b, c to this, b, a,
c, d to that, and you really can't, the eye does not detect any
structure related to those numbers. So this looks like a random list of
numbers, whereas these, of course, you can discern some intentional
ordering of these things when you look at clear text but if you
look at the hash values, it's hard to tell they're not just
random numbers and that's the property of pseudo-randomness and that is
the property you have a desire for making tables, which we'll
come to in the next chapter. So this is George Marsaglia, FSU stat
department inventor of this thing, and this is called a
Marsaglia mixer in our library and this is a way that thing goes. This is just the same algorithm
[inaudible] on another slide. So we're going to use that and we're
also going to use an even fancier one than Marsaglia that does well
on 32 bit numbers and so anyway, the improving pseudo-randomness of
a hash function, a common way to do that is to divide that hash value by
a prime and look at the remainder, so if we have a prime
and look at the remainder when the hash function value
is divided by that prime, then call that remainder the new
hash value, then you're going to improve the pseudo-randomness
of the hash function of this, so we'll have good hash functions
and then we're going to use prime, divide by prime, take the remainder
to get even better hash functions. Now here's a slide on improving
enigma but first of all, where are these things used. I mentioned password authentication
and message authentication. Only the hash value is
transmitted, excellent enigma means that you can't reverse, you can't even
guess, much less reverse hash function. Secure hashing algorithm is
NIST standard now and it's used in your Linux [inaudible] systems. So then there's a Marsaglia KISS. This is the 32 bit version
of Marsaglia [inaudible]. Here's the code for it and this
is just some sample outputs, so here's your input strings,
here's the Marsaglia mixer value and this is your KISS
value of these same things, none of those should be meaningful. They should look like
columns of random numbers. So I'm going to go straight
into hash tables. Let's remind ourselves of what a table
is, sometimes called a dictionary, sometimes called mapping,
sometimes called a map. We store key data pairs in a table just like when you did your
associative array for homework 4. You access data in the
table through the key value and its associative data structure. So you know the key that allows
you to look up the data associated with that key in the table. Associative array is a table
with an alternate interface. It's got a bracket operator
with special insert semantics and the bracket operator. This is just reminding you
of stuff you already know. It's not a bracket operator
for an iterator. You cannot iterate through
all possible keys for a table. If you found a way to do that, it would
not be something you would want to do because typically they'll
be many more possible keys than there are actual keys in the table. So the bracket operator is not intended
to be used for that sort of purpose. But anyway, you've experienced
the fact that it is quite handy. Tables have unimodal semantics. That means duplicates
elements are not allowed. So you insert operations such as insert,
input have sort dual personality. The key is in the table and what insert
does is overwrite the stored data with the incoming data. If the key is not in the table,
then it inserts both the key and the data as a new pair. This is exactly so far like when you
did the ordered associative array, where this is going to differ
is in the ordering part. Now this reviews some of the
ways we might-- I'm sorry. This collects the operations
that we have for a table. Insert, remove, retrieve,
and empty and size and then we have some auxiliary
operations when it comes to [inaudible] and they're axioms. There're axioms associated
with these operations. For example, if you insert k1, d1,
then retrieve k1, d returns true and d, your [inaudible] reference
comes back being equal to d1, the item is stored in the table. After removing k1, retrieve
returns false. After insert, empty returns
false and so forth. Empty returns true if size returns 0. Axioms for this abstract
data [inaudible]. This is a slide where we kind of look at
possible implementations and really kind of throw them out as not being quite
what we want but we could, of course, just like when we made sets, we could've
just said, okay, we'll keep our items in a list and then we'll look for them
and that'll be our element operator, what we're looking for is sequential
search and it's horrendously slow. If we have to do sequential
search, we do it but we certainly would
like something better. Using a sorted vector worked out not
too badly if we maintain order by key, we can use binary search
for lookup and so we at least get logarithmic lookup
time in this cycle, which our insert and remove times are still going
to be linear and they're slow. One thing that's often overlooked
is just a plain vector or array. If your key happens to be an unassigned
integer, then you can sort your data in the vector element at that value
and that's attainable and that's great if you're willing to have data entry for
each key value through a certain range but I'm going to describe
quickly an example where that would be preposterous
and you would not do it. So for example suppose it
for [inaudible] we wanted to keep student records and the
key value for that record would be, I'm going to just say your
social security number, we don't really do that, but
everybody's familiar with it. You have this thing called
[inaudible] ID in the system. It's a substitute for that but
the same argument will hold. So that means we've got 100,000 student
records, let's say, that we need to sort but we're doing it by key equal to
social security number, well there's 10 to the 9th social security numbers
but we only have 10 to the 5th records to store, so that means for every record
we store, we've got 999,000 records that are meaningless and so we have
wasted all but 1% of required storage for that vector and that's clearly
something you don't want to do when you start talking
about large chunks of data, like all student records,
as far as data. So another thing we could do is just
make a set of key type, data type. This was your large associative
array that you built. This allows you to have no special
assumptions on key type, which is good. They don't have to be integers. All the operations are
logarithmic, that's good. Runspace overhead is
reasonable, that's good, and you get ordered traversals
included as part of the package, so set base tables or associative
arrays or maps are good. With all those properties
including the ordered traversal, you really can't do any better but
for pure speed, what we really would like is to-- That's just going over
why those alternatives are not so good and I've already done
that in my discussion. Here's what we want for an ordered
table, I mean for a table unordered. We want insert, remove, and retrieve to
have constant runtime, constant runtime. Think about that. We want modest space overhead
and no restrictions on key type. Okay, none of the things we've
mentioned so far or studied so far meets those conditions. They all fail at least two of them. So what we really need is some
sort of hybrid of a vector which does have constant
access time, right, and a list which has
constant insert time. The naive look at how to do that doesn't
work, of course, or we would've taught that in the first intro [inaudible]
but nevertheless that's kind of where we're headed and believe it
or not, we can actually pull this off and I'm now going to start
talking about technology and the way that's implemented and it's
through something called hash table. So first of all we use a hash function to convert key type to
an unassigned integer. So this only works if we have a
decent hash function on the key type. We're going to make a vector v
indexed on the hash value of a key, so that means that our
vector elements are going to be lists of the entire entries. So here's our search algorithm. You first compute the
hash value for the key. That's a computation, so that'll be
an O of 1 contribution to the search. Then we're going to directly access
the bucket at the vector index, which is remember the hash value, okay directing access a vector
element is constant time operation. Now we've got-- We know
we found the bucket where this key value pair would
be and that bucket , by the way, we call these lists buckets,
that bucket is a list of entries. So to find the entry we're looking
for, we have to do a sequential search in that list and that looks like
a little downer for our algorithm and so we have to figure out a
way to make that not be so bad. So that means that that
sequential search is going to have a worst case search
time the length of the list that the vector has for itself. Now insert, include,
retrieve, remove, get, put, and the bracket operator all use this
same search element in a hash table. It's very important to get a handle
on how this search algorithm works. So let's look. Let's look at analysis, sort
of a perfect world analysis. So let's assume that our hash function
uniformly distributes keys in the range of 0 through the size of the vector. So from a statistics perspective,
that means that looks just like a uniform distribution, uniform
probability distribution in that range but it's not, of course, probabilistic because we can calculate
the hash values. The second assumption is that the
vector size, that is the number of lists or buckets underlying the cell, is
about the same as the number of items in the table, which is
the size of the table. Okay, those two assumptions, then
let's look at the search time. Well, it's the O of calculating the hash
function plus the bucket size, right, remember it's those two things
plus O of the bucket size. So that's 1 plus O of the size of that
particular hash value but remember if on average-- Well, first of all
if they're uniformly distributed, then we can expect the size of a bucket
to be on average the same as the size of the table divided by
the number of buckets. Okay but the size of the table is by
assumption too the same as the size of the vector, in other words the
same as the number of buckets, so what we have is O of v.size
divided by O of v.size, which is 1, and so we have 1 plus O of 1, which
is O of 1, your constant runtime for the search part of this algorithm. Now more generally you could and
this is done in all textbooks, the load factor is defined
to be the size of the table divided by
the number of buckets. In our example [inaudible],
that load factor was 1 but if you just substituted
lambda for this right here and what you would get is
that the expected search time in a hash table is 1 plus lambda. Now this is remarkable and let
me go back over that algorithm. So you first compute the hash value for
the key and then you mod by the size of the vector to make sure that the
remainder is exactly a vector index, then you access the bucket
at that index, and then you search the bucket
sequentially for the actual key that you're looking for, not
the hash value of the key. You know that already. It's the bucket index. You want to search for the key itself
and remember that hash functions being as they are, more than one key is likely
to have the same hash value and so that list will have typically
more than one item. So that's a perfect-world analysis. So our class design, we're going to
have template parameters, key type, data type, and a hash class that
can go in as template parameters. Hash class is just a class
wrapper around the hash function, in other words we're not
overloading a parameter operator. We'll implement the table
protocol plus clear and dump. We'll have iterator support. We may or may not choose to copy tables. It turns out I've had a little
bit of a change of heart on that. It used to be that the standard
[inaudible] was don't let people copy tables. They're going to be huge. They can use references
all the same table. I had to rethink that when
looking at sparse matrices because and sparse matrices, you really
sometimes have to return a value from a function and that value
happens to be a sparse matrix, if sparse matrix is represented
by a hash table, then down underneath the [inaudible]
somewhere under the covers, under to hood, you're going to have
to make a copy of a hash table, and so we're going to facilitate copies. So I should cross that out. That's how it can be done but
we've decided not to do that. So in order to get better
pseudo-random distribution of keys, we're going to choose our vector
size to be a prime number. That means we need a
prime number calculator which is [inaudible] supplied, so
our hash table constructor needs to make sure that the vector size
is prime as close as possible to the parameter value that you're
given by the client program, so at the moment we're going to talk
about having no default constructor. We're going to have to throw one in
by just putting default values on, by allowing default values for that
one argument constructor just in order to make our sparse matrix
[inaudible] okay but anyway, so the constructor needs to
instantiate a hash object and we'll have two versions, one to use
as a default hash object and another to use as a hash object that's
passed down as an argument. Notice weirdly that the hash class and
hash object is kind of taking the place of the predicate class and
predicate object for ordered tables. These are unordered tables and the
hash function is the facilitator, whereas the predicate object was
a facilitator for ordered tables. Now there's a primes.h in your library
and the implementation file primes.cpp, those are not template functions. In fact, these are just functions but
primes has a function called prime below and another one called prime above. Prime above, I don't
really recommend using because they're theoretical reasons,
we don't know that the next prime above a number is small enough. Of course, in the kind of numbers
you're going to be making tables with, we do know that, so that's
not really a big worry, but prime below guarantees you that you'll choose the largest
prime that's less than or equal to the expected table size
and that's going to be fun. Hashfunctions.h and .cpp,
these are prototype and implementation hash functions. The ones we are going to have are the
KISS, Marsaglia Mixer, and the simple, so there's three different
hash functions in there. Hash class just wraps KISS, Mixer
and simple in the function classes for various types that might end up
being keys, in particular string types. So and now we have [inaudible]
functionality test for hash table, a typical functionality test
[inaudible] and we have ranfile.cpp that creates random files
of string data. You used that before in homework 4 when you were doing your
associative array implementation. So here we go on how to
build up a hash table. We have a notion of entry here. This is very, very much
like the notion of pair. It's just technically a little bit
advanced specifically for the use of tables and I'm introducing it
here but the principle difference is in the terminology where instead of
first and second for the two things in the pair, we're going ahead and
calling the first one key underscore and the second one data underscore. That's just a terminology difference. The functional difference is
that we make the key const, so what this means is that once
an entry object is created, this key cannot be changed
and this pretty much ensures that whatever else happens in a table,
nobody's going to be able to mess with the key and change it and
goof up the table structure. So we have a constructor
that takes a key and we have a constructor
that makes a key and data. We have no default constructor, by
the way, because key is constant. It wouldn't make sense to
try to make a pair like that. Since you don't know what the
key is, what use would it be. Anyway we've got a [inaudible]
constructor, assignment operator, equality, nonequality, and the usual
order of operators are defined just like the pair key is used to determine
less than or greater than and, by the way, also used
to determine equals, so the two entries are considered equal if they have the same
key or have equal keys. The data again plays no role in
equality or order just like the pair. So there's our equality defined,
equals keys, and this one you might-- There's our less than defined,
again just looking at keys. I think I have in the narrative
a little something on entries. Let me see. Yes, I think you'll find this amusing. The assignment operator-- Well,
if you got a const data member, you can't assign to it, right,
even if you know they're equal, if you try to assign to it, the
compiler is not going to [inaudible]. So what the assignment operator does is
first check to see if the keys are equal and if they're not, it
throws an error at you. Then once the-- If the keys are equal, then your copy of the
data will [inaudible]. It's a little bit different
take on assignment operator. So then how do you make a table? Well this is what the
interface looks like, the usual [inaudible]
terminology support. We've got key type, data
type, hash type, bucket type. This is an ultra-fancy version
because even a possibility of using something besides a list,
buckets and then we have a value type and an iterator type, which is for
some reason just called an iterator, not iterator type by tradition. There is a potential
area of confusion here. Note this value type is the same
as the value type for the bucket but what's going to go in the
bucket is entries of key data pairs. So the bucket will be a list or whatever
else C might be, capital C might be, of entry objects, and so the value type
is an entry, not a key or a data item. You can kind of keep
a grip on that factoid when you're going through this stuff. So we got insert, remove,
retrieve, includes, and you'll find in the actual library distribution
that we also have the get, put, and bracket operator for associate
arrays that's also part [inaudible]. The usual kinds of thing,
clear [inaudible]. There's rehash. You've had rehash now in
ordered tables or ordered sets. What rehash does is very similar. It restructures a table to be
more efficient just like rehash or the left-leaning red-black tree does, but of course functionally what it
actually accomplishes is something entirely different than
it would be for trees. We give it the same name. In fact, rehash, the name
comes from hash tables. We borrowed that name to use in
the context of binary search trees. It's a good name to borrow. We have size, empty,
begin, and iterator support. Note that these are only the const
versions which kind of hints to you that these iterators are going
to be const iterator types. You're not going to be able to de-reference an iterator
and assign to it. You got hashtable constructor,
copy constructor [inaudible]. Unsurprisingly, we have a dump function
so you can look at the under the hood of a table and see its structure and here's something I
have not mentioned before. This is analysis method associated with
hash tables and this is an innovation of the FSU library, one that we
believe should become standard practice in the implementation of hash tables because what it does is it gives
the user of that hash table a way to get a statistical picture of how
good their hash function is performing on that particular set of data. You can have what theory might tell
you it's a pretty good hash function but your data can be coming in so skewed
because of your particular application that that hash function gives you pretty
awful table performance for your data. So it's important to be
able to know a little bit about how your hash function is
behaving on your particular [inaudible]. That's what this analysis is for. And notice that private or not-- Yeah, I went in there and
made that public in this line. Sometimes we might need copies or
need to make copies [inaudible]. Anyway, our protected data is the number of buckets a vector called the bucket
vector, notice that it's a vector of type C elements and those are
going to be typical list entries. We've got a stored hash object and
here we have an index function. Index function is going
to be defined in terms of a hash object and
the number of buckets. So the way you get an index
is you take your hash object, evaluate it on the key, and look at the
remainder when you divide by the number of buckets, that's your index. So you guarantee your index will be a
legitimate bucket number all the time and by choosing the number
of buckets to be prime, that extra little help making the hash
functions look random, be pseudo-random, you get that extra little help by taking
or making the divisor a prime number. So that's what our table interface
is going to look like, like I said, they'll be a few more things [inaudible]
input and the iterator is going to be a bidirectional
iterator that is const. So it has the terminology support
that is the same as hash table does. It has constructor, destructor. It's got a valid, assignment,
increment, post [inaudible] increment, D reference, and const reference. Now actually that would
probably not go well. Even though it would be safe because of
the fact that our entries only allow you to copy data after checking
keys of the same, we're going to probably not have that. Anyway, and then the protect is on. We have a pointer to the table so
an iterator in this instance knows which table that it is iterating
on, which comes in real handy, so this is one of those sort of
knowledgeable iterators like we defined for deck as opposed to
lists, trees, and vectors. We have got not knowledgeable
iterators, they just have a pointer and they don't have any way of knowing
which list they are pointing in to for example but here and with a deck
you know which table you're pointing in to and, therefore, you can talk about
what bucket number you're pointing at and where in that bucket you're
pointing by having a list iterator, so a table iterator is going to consist
of a pointer to the table you're using. The bucket number that you're
pointing into list iterator points into the list it is that bucket. Now what rehash does is bear in mind
the desire is to have a low factor 1 which will give you runtime or constant, which will give you constant
good runtime for all of your structural operations
search, insert in the loop. And so what rehash does is simply
creates a new table with a new number of buckets based on the current
size of the table that you have, so let's say you made a table and you
had 1000 buckets in that table and some of you realize well you put 15,000
items into that table, well you may want to call rehash some night, over
night when you got some time. Rehash would restructure that table with
15,000 buckets and so whatever I said, the same number of buckets,
approximately the same number of buckets as the size of the table which
will lower your search time down to be about 1. Now associative arrays is
an interesting add on and but you have this associative
array bracket operator, of course, and you're familiar with that but
we're going to have it for hash tables. There's no precondition on
using this bracket operator. The post-condition is that
the key appears in the table, the key you called it on, so just
activating the bracket operator on a key ensures that
that key is in the table and what it does is return a reference
to the data associated with that key, so it will return to reference
to the data box in the table and after ensuring that
that data box exists. [ Silence ] So for example, you can go to an
associative array on 100 items, expected about 100 items and
you can do aa of abc equal 25. That's the first access of the bracket
operator on abc that will insert abc with a default data item in the table. You can-- Actually it won't be-- It will be default after this call
[inaudible] reassigned [inaudible] to a value of 25. If I just call that with a semicolon
after it, that would have the effect of inserting abc into the table and
default data box if [inaudible]. Of course, the second time you use
it, you don't have to insert anymore until [inaudible] match coming into play and this just becomes
a basic retrieve sign. So you have a get and a put operator. If you define the bracket operator,
you can define get in terms of the bracket operator and put
in terms of the bracket operator or you can supply put and get and define
the bracket operator in terms of get and define actually put in terms of get. So you really have the
choice of going ahead and implementing the bracket
operator directly and then using that to define put and get or
implement get directly and use that to define put and
the bracket operator. So here's our runtime analysis. These are average case runtimes. Let's just go to the table operations. We could've run all of this
for sets instead of tables, so stored values instead
of key data pairs, same technology works back and forth. So let's just talk about tables. Tables are I think the more
common use of this technology. So we have an insert, includes,
retrieve, put , get, lower bound, upper bound, remove, erase, and rehash. These are just about all our
operations we've talked about and for example an ordered
vector, insert was time O of n and a red-black left-leaning tree
insert this time log n base 2 log. In hash table, it's going to be 1 plus
n over b, where n is the number of items in the table and b is the
number of buckets in the table. Includes is logarithmic for the first
two and constant for the last one. Retrieve - logarithmic,
logarithmic, constant. Put - linear, logarithmic,
constant, and so on. But you get down to lower bound, lower
bound is logarithmic and ordered vector, logarithmic and red-black
left-leaning tree. Lower bound doesn't exist for a table
because lower bound is a concept that only makes sense
when you have ordered data and the table is not ordered. There is no effective way
to traverse the table, the hash table in an
ordered sorted fashion. It will, however, resemble the
remove, so we won't bother with erase. We'll just simply take the thing
out of the table and we will do that in constant [inaudible]. And finally this notion of rehash, we could define rehash
for an ordered vector. If we did, it would have
runtime theta of n, essentially making a
copy of that vector. I don't want to go into why it's
[inaudible], you might want to do that but you could do it but for red-black
left-leaning trees, of course, it plays a significant role. It restructures that tree to
get rid of all the dead nodes. So you could do this notion
of n nodes in a vector and what rehash would just [inaudible]
new vector with no dead nodes, it would be a smaller vector just like
this is a smaller tree after rehashing because there's no dead nodes in there. And so, you know, if your tree starts-- If your set starts filling up with dead
wood, so to speak, once it got half or more of items dead and you
think they're going to stay dead, then you probably want to call rehash
and cut down on the memory requirements and even improve the runtime a
little bit for all your operations. And the same two reasons
are improving the efficiency of the runtime really is
the most important reason for calling rehash on a hash table. If the number of elements in the
table begins to far exceed the number of buckets in the table, so that
your lambda, this n over b thing, starts getting big, then you
want to call rehash and maybe get that lambda back down to about 1. Run space requirement is probably
worth taking a look at as well. So for example, the container
[inaudible] of the constructor operations but
let's look at something like insert, putting something in, I
have that plus theta of 1, that means it's a constant amount
of time to run the insert operation and if you think about it, you
do have some local variables in there that's your [inaudible]
of extra memory but you don't have to make a copy of data
or anything like that, which would greatly inflate the memory
requirements for the [inaudible]. So that's constant. It's also constant in
a binary search tree. In our red-black left-leaning tree,
our insert operation is recursive. Because it's recursive, it builds up
a small collection of recursive calls and the runtime stack
and so that is memory that is actually being used
as part of the algorithm. Now it's okay in red-black left-leaning
trees because we are assured that the height of the tree will
be logarithmic and so that means that the number, the depth of the
recursive calls will be logarithmic as a function of the size of the tree and so these are really not
spectacularly damaging uses of extra space but they're there. The tables, there's not going to
be any extra space used for any of the operations except for the space
required for creating a container. It's going to use b buckets and n
entries and so anyway you cut it, it's going to have a footprint on the [inaudible] size b plus n. I
may not have gotten around to talking about rehash runtimes here. Rehash for a table, I'm sorry, for a red-black left-leaning tree
effectively creates a new tree by inserting all the elements of
the old tree into the new tree and since there're n elements in the old
tree and it takes log n or log k time, where k is the size of the
new tree, to insert it, you can see why you get
n log n out of that. Tables, because tables does effectively
the same thing, the difference is that your table insert operation is
constant in time and so it's just going to be the number of items that you
have to traverse and that's the, that right there is the length
of a traversal of a hash table, the length of time to
traverse the hash table. So that wraps up our discussion
of hashing and hash tables. We're going to take a break and then
we'll talk about the assignment based on hash tables, which
I call sparse matrices.