Studio 7: Hash Function Design

Today, we're going to look at some issues around the design of hash functions. The goal of a hash function is to map objects of some type to integers. It's important to understand both what must be true of hash functions to achieve correct behavior, and what should be true to achieve efficient behavior.

Before you begin, update your repository and opening the studio7 package in your studios source folder.

Part A: Object Equality and Hashcodes

All Java objects have a method hashCode() that computes a 32-bit signed integer value based on the object. To study the default behavior of this method, let's look at the HashPoints class, which performs hashing on objects of type Point.

  • Open the HashPoints class and run it. Look at the code so you can interpret the printed output.

Question A1: is there anything worrisome about the output from HashPoints? If so, what?

By default, an object's hashCode() method returns a value computed from the object's address in memory, without actually looking at the object's contents. This means that the default hash codes do not reflect the equality relation that you would likely expect for objects such as Point.

Question A2: what is a reasonable definition of equality for Point objects?

To make a class's hashcodes match its equality relation, we need to override the default hashCode() method with one that does the right thing.

  • Go to the Point class and add your own implementation of public int hashCode() that ensures that two equivalent Point objects have the same hashcode.

Question A3: how did you map Points to their hashcodes?

Question A4: does every distinct Point object map to a distinct hashcode given your method? If not, explain why not and give an example of two Points that map to the same hashcode.

Question A5: what is the simplest possible hash function that would guarantee that two equivalent Point objects have the same hashcode? Would you actually want to use this function? Why or why not?

Part B: Visualizing Uniformity of Hash Values

We map objects to hash values so that we can use these values to index buckets in a hash table -- a dynamic dictionary structure that lets us quickly check if an object is present. As we discussed in class, a hash table is efficient only if our hash function distributes objects uniformly across buckets. Let's tackle a specific application that illustrates some of the practical challenges to achieving uniformity.

Java's AWT graphics library includes a class Color that represents the possible colors that a pixel of an image can have. Java colors are specified using the four coordinates "RGBA" for red, green, blue, and alpha. As you may know, the colors we seen on a monitor are created by mixing the three primaries red, green, and blue. The fourth coordinate measures transparency -- 0 is a fully transparent pixel, while the maximum value is fully opaque. Each of these four coordinates is an integer between 0 and 255.

Question B1: how many bits are needed to represent a Color object?

The Color class includes a custom hash function. For a pixel with coordinates $\{ r, g, b, a \}$, the corresponding hashcode (at least in version 8 of the JDK) is

\[ 2^{24} \times a + 2^{16} \times r + 2^8 \times g + b . \]

Question B2: Given that hashcodes are 32-bit integers, is every hashcode realizable by some Color object? Do any two distinct colors map to the same hashcode?

  • Open the file HashColors.java and run it.

This class generates hashcodes for a bunch of colors and writes them out to CSV files in the outputs folder. First, we'll look at the behavior of the hash function on color values where each of the four coordinates is chosen uniformly at random.

  • Open the file studio8randomcolors.csv from your outputs folder in Excel. (You will have to refresh as usual.)

The hash values may initially look like floating-point numbers in Excel, but they are integers -- just widen the column to see all the digits.

  • Plot a histogram of the hash values in the file. In Excel, select the column of hash values, go to the "Insert" tab, and select a histogram from the available chart types.

If your installation of Excel doesn't have histogram capability, you can install the Analysis Toolpack to get it. If this is not feasible, construct an X-Y scatter plot and visually estimate the density of values in each horizontal strip of the graph.

When checking for uniformity using a histogram, it's important to ensure that each bin contains the same number of possible hashcode values.

  • Right-click the horizontal axis, select 'Format Axis', and set the number of bins equal to 32.

You should get a histogram that looks something like the following (though of course, the random numbers may differ!):

histogram

Question B3: are there about the same number of hashcodes in each bin? About how big is the difference between the smallest and largest bin?

If we wanted to quantitatively test our hashcodes for uniformity, we'd have to use a statistical method such as a chi-square test. For this studio, we'll settle for eyeballing the histogram to look for gross deviations from uniformity.

If we use our hashcodes to index a table of some size $m$, we'll have to limit them to the range $[0, m-1]$. A quick way to do this is to compute an index which is the hashcode's value modulo $m$; that is, $i = h \bmod m$. Let's check the uniformity of the indices corresponding to our hash values for small $m$.

  • Designate one cell of your spreadsheet to hold the modulus value $m$. I'll use cell D2 in what follows. Set this cell to $m=32$ initially.

  • In column B of the spreadsheet, compute for each hash value $h$ the quantity $h \bmod m$.

    The Excel formula you want for row 2 is =MOD(A2, $D$2). Cut and paste the formula to replicate it for every row with a value in column A. Note that unlike Java, Excel's modulo operation always gives a non-negative result.

  • Finally, plot a histogram of the indices in column B. Adjust the histogram's horizontal axis so that it contains 32 bins, so that each bin covers the same number of indices.

  • Inspect the histograms that arise when $m$ is each power of 2 between 32 and 256.

    Why do we commonly use a power of 2 for the hash table size? Computing the index $i = h \bmod m$ is an expensive operation for arbitrary $m$. A modulo operation costs as much as a division operation, which is typically implemented (even in hardware) using iterative approximation. In contrast, when $m = 2^k$, computing $h \bmod m$ is the same as extracting the $k$ least-significant bits of $h$ -- which can be done in a single machine instruction! (Note: if $m$ is fixed to a compile-time constant, we can compute values modulo $m$ faster than if $m$ varies at runtime.)

Question B4: what is the greatest deviation from uniformity that you observe for this range of $m$ values? How would this deviation impact the performance of a hash table with $m$ bins?

Part C: Real Data Does Not Look Random

Of course, we want to use our hypothetical color hash table on real images, not just on white noise. As a small example, I extracted every unique color from the following image:

painting

Obviously, this painting does not have a uniform distribution of colors! the HashColors program emits the hashcodes for each unique color in this image in the file studio8paintingcolors.csv in your output directory.

  • Construct the histogram of all hashcodes for this new file, as you did for the random colors above.

Question C1: how does the resulting histogram differ from the one you saw in Part B?

  • Now repeat the second set of tests for small table sizes $m$, again as you did above.

Question C2: how do these histograms differ from the corresponding ones for random colors? Could you have predicted these observations from the overall histogram in C1?

Question C3: what would you expect to happen if you tried to hash these colors into a table of size $m$ for the small $m$ values tested above? What is the implication for hash table performance?

Part D: Improving the Hash Function

The colors from our real-world image are not uniformly distributed. More importantly, their low-order bits are not uniform either. Hence, simply computing the table index for a color as the hashcode modulo the table size $m$ is not going to result in uniform distribution -- at least not for the typical values of $m$ we tested.

One strategy to improve the distribution of indices is to add an extra "scrambling" step between hashcode and index. Instead of merely taking a few bits from the end of each hashcode, we want to somehow scramble all the hashcode's bits together, so that all of them influence the index value.

As we saw in class, integer multiplication can be used to create a simple yet effective scrambling function. The basic idea is to pick a constant $A$ and map each hashcode $h$ to an index $i$ by computing $i = (h \times A) \bmod m$. How good a job of scrambling multiplication does depends on the constant $A$.

Question D1: suppose we pick $A$ to be a power of 2, say $2^k$ for some $k$. Which bits of $h$ does the result depend on? Is this a good choice for $A$?

Don Knuth, a famous CS professor (and the inventor of LaTeX!), suggested the value $A = 1737350767$. (For more on Knuth's approach to picking constants in hashing, see his Art of Computer Programming, vol 3, sec 6.4.) However, you might also get good results by using other numbers which, when written in binary, do not have a regularly repeating bit pattern. ($A$ in binary is $1100111100011011101111001101111$.)

  • In the HashColors.java class, find the function myColorHash(). Modify it to multiply the hashcode by Knuth's magic constant.

The result of the multiplication will likely overflow 32 bits, so we'll get only the low-order 32 bits of the result as our new hashcode.

  • Run the HashColors program to generate new output, and construct the histogram of all hashcodes in studio8paintingcolors.csv as you did in Part C.

Question D2: does the distribution of hashcodes look more uniform after this change than before it?

  • Now generate histograms of your new hashcodes modulo $m$, for $m$ a power of 2 between 32 and 256, as you did in Part C.

Question D3: does the distribution of indices look more uniform?

The last result might not be what you expected!

Question D4: if we have two hashcodes $h$ and $h'$ for which $h \equiv h' \pmod m$, what can you say about $h \times A$ and $h' \times A$?

Question D5: is multiplication alone likely to produce values whose low-order bits are more uniform than those of the original hashcodes?

To avoid the issue raised in Questions D4 and D5, we need to use the middle bits of the product instead of the low-order bits. As shown in class, extracting the middle bits of the product is equivalent to doing a fixed-point approximation of floating-point multiplicative hashing. We do it by dividing the product by a power of 2, or equivalently by shifting away its low-order bits.

  • Modify the computation in myColorHash to keep only the middle 32 bits of the product. Here's some suggested code:
  return (int) ((long) c.hashCode() * A >> 16);

We cast the hashcode to a long (64 bits) to avoid losing any intermediate bits of the product. The resulting hash value will contain bits 16-47 of the 64-bit product, whereas the previous computation gave bits 0-31. If you don't need more than 16 bits for your indices, you need not cast to long and back.

  • Run the HashColors program again and generate histograms of the new hashcodes modulo $m$, for $m$ a power of 2 between 32 and 256, as you did in Part C. Don't forget to adjust the number of histogram bins to 32!

Question D6: is the distribution of indices any more uniform after this change than before it?