Article 1417: : Hashing

Here's a problem that arises frequently in computing: You're going to be handed a lot of items of data, and then shown a new item and asked whether you've already been handed one identical to it, and if so, where you put it. This is a problem that arises in real life, and we can use real-life reasoning to analyze it. Suppose, for example, that you are a doctor with two or three hundred patients. You've got a file of information on each patient; when a patient comes in for an appointment you'd like to lay your hands on that patient's file. How do you store the files?

You could just have a room where all the files are thrown helter-skelter on the floor; but this would not be very efficient. It guarantees that all the files are in one place, but you might have to examine every single file in order to find the one you want. A better answer is, of course, a filing cabinet. For purposes of discussion, let's imagine a filing cabinet with 26 drawers. If each drawer is labelled with a letter of the alphabet, in order, then we could use the first letter of a patient's last name to tell us where to put that patient's file -- and where to find it again later. Instead of a big room containing all the files we've now got 26 little rooms, each containing just a few files.

A filing cabinet is a real-life example of hashing. Hashing is the algorithmic equivalent of the old saying, "A place for everything, and everything in its place." Hashing depends on two things. First, we must have a set of little rooms. Instead of the drawers of a filing cabinet you might like to think of these rooms as pigeonholes, as in one of those old-fashioned desks you may have seen in an antique shop. In the computer world these pigeonholes are likely to be the elements of an array, because each element is labelled, and can be accessed directly, through its index number. Second, we must have a rule for placing each piece of data in its pigeonhole, and for finding it there again later. This rule must depend on something about the data itself, and is called the hash function. By applying the hash function to a piece of data, we obtain a hash key which specifies the correct pigeonhole. If the pigeonholes are elements of an array, the hash key will need to be an integer within that array's bounds.

In the computer world, just as in the real world, the big questions are how many pigeonholes we'll need and what the hash function will be. For a few hundred patients, our 26-drawer filing cabinet and our hash function using the first letter of each patient's last name are likely to prove adequate. But, for example, when James Murray took over the work of editing the Oxford English Dictionary, he needed a room with over a thousand pigeonholes. In general, more is better (up to the point where storage becomes problematic, as it did for Murray), provided the hash function distributes typical data fairly evenly over the pigeonholes.

A related problem is what to do with multiple items that end up in the same pigeonhole (technically known as "collisions"). If these are many, we might consider maintaining them in some fixed arrangement, as when we order within a drawer the files of two patients whose last names begin with the same letter, by looking at the second letter. Surprisingly, however, if our hashing function is good this is usually not necessary; in fact, it's likely to be more trouble than it's worth. After all, in real life if we had just five or ten files per drawer in our filing cabinet we might not bother to alphabetize them, since we can "just see" the file we want; and a computer is so fast and accurate that it can scan through an ordered clump of, say, several dozen items and "just see" the right one. Thus, a perfectly good way of dealing with collisions might be to store them as a linked list, in whatever order they arrive (linked lists were the subject of this column in the August/September 2002 issue of REALbasic Developer Magazine). To learn whether an item is present already in a pigeonhole might require traversing its entire list, but the penalty will be negligible if the list is reasonably small.

Figure 1 illustrates such an implementation. The item "time" is just in the process of arriving. The hash function, represented by the inverted triangle, is assigning it to pigeonhole T; from there we traverse the list, discover that "time" is not already present, and append it to the end of the list.

Introducing the “Welcome to Xojo” Bundle!

Hashing

Pigeonhole Your Data