Studio 9: Attack of the Hashcode Crackers

One of the assumptions behind the performance analysis of hashing is that, for the keys actually hashed, the distribution of keys into buckets looks uniform. But of course, a hash table cannot control which keys it is given to store. That means that an evil person could pick a bunch of keys that all happen to hash to the same bucket, with the intent of making the hash table run slowly.

Today, you are that evil person.

evil person

In this studio, you will gradually build up your understanding of how such an attack (called a `hashDOS' attack) could be done for various hash table and hash code implementations. Like much work in computer security, the analysis and techniques you will see today may seem tedious and pedantic. However, if the payoff is large enough, attackers will definitely spend the necessary time and effort to exploit the weaknesses you'll see, and we must do so as well in order to avoid these weaknesses and defend against possible attacks.

Part A: ASCII, and Ye Shall Receive

For many tasks in today's studio, we'll be hashing ASCII text strings that consist only of the printable ASCII characters. These characters are encoded by numerical values between 32 and 126, inclusive. As you work through the studio, you'll probably want to look at this table of the ASCII encoding to see how characters map to numerical values.

To warm up, let's suppose that the hashCode() function for strings encodes them into an integer using 7 bits per character. 7 bits lets us store any value between 0 and 127, so has plenty of room to store one printable ASCII character. Here is the source code for our hashcode computation on a string s:

int hashCode(String s)
{
   int h = 0;
   for (int i = 0; i < s.length(); i++)
     h = h * 128 + s.charAt(i);
   return h;
}

Question A1: which character of s determines the low-order 7 bits of the hashcode? (Recall that the ''low-order'' bits are those in the least-significant place value positions. As an example in decimal, '567' are the low-order 3 digits of the number 1234567.)

Question A2: how many characters can we hash into a 32-bit integer with this function before we start to lose information about previously hashed characters? (Hint: what would cause us to lose information about previous characters? What effect does multiplying by 128 have on binary numbers?)

Question A3: write a mathematical expression that gives the hashcode $h(s)$ computed for a four-character string $s[0..3]$ by this function.

Part B: Breaking Division Hashing with Bad Primes

Now suppose that your victim's hash table uses division hashing to map hashcodes to indices. Having heard that prime table sizes are a good idea for division hashing, the victim chose the Mersenne prime $m = 127 = 2^7 - 1$ for the table size.

Note: the following helpful equality will help you answer the following questions:

\[ ax \bmod b = ((a \bmod b) \cdot (x \bmod b)) \bmod b \]

Proof:

We observe that any $a$ can be written as a sum $cb + d$, where $d = a \bmod b$.
Hence, we have
$ax \bmod b = (cb + d) \cdot x \bmod b $
$ = cbx + dx \bmod b$
$ = (cx) \cdot b \bmod b + dx \bmod b$
$ = 0 + dx \bmod b$
$ = (a \bmod b) \cdot x \bmod b.$
QED.

Question B1: Use the equality above to compute 621 mod 8 by hand. Note that 621 = 27 x 23.

Question B2: write an expression for $h(s) \bmod m$. Simplify as much as you can using the equality above.

Question B3: using your result from B2, find five printable ASCII strings of length four that hash to the same hashcode.

Question B4: how many ways are there to choose four printable ASCII characters that sum to 222?

Hint 1: the number of ways to divide a positive integer $n$ into an ordered list of exactly $k$ positive integers that sum to $n$ is given by the binomial coefficient \[ \binom{n-1}{k-1}. \] See this page for a proof.

Hint 2: A printable ASCII character has code at least 32 and at most 126. If we choose characters wxyz, each of which is printable, then we can write the sum of their codes as (31 + a) + (31 + b) + (31 + c) + (31 + d), where each of the values a, b, c, d is between 1 and 126-31 = 95. To satisfy the condition in Question B4, what would these four values need to sum to?

Question B5: if you can insert and search for strings in the victim's hash table, how could you use the above results to make the victim's computer perform a large amount of work?

Breaking a hash table's performance in this way is called a hash denial of service (hashDoS) attack. It consumes all the target machine's time with useless hash bucket traversals, denying that time to anyone who wants to do useful work.

Your victim, noticing your attack on the hash table, quickly changes the table size $m$ to the next largest prime, $m = 131 = 2^7 + 4$.

Question B6: once again, write an expression for $h(s) \bmod m$ for inputs of length 4. Simplify as much as you can to reveal the hidden structure in the index computation.

Question B7: using your result from B6, find five printable ASCII strings of length four that hash to bucket 0 with this new $m$.

Hint: first, find some strings of length two that hash to bucket 0. How does this help?

Question B8: can you see a way to generate hundreds of thousands of strings that all hash to bucket 0 when $m=131$? Your strings may be longer than 4 characters but should still be short (20 characters or less).

Part C: Breaking Java String Hashcodes

As we discussed in class, hashing general objects requires two steps. First, the object is hashed to an integer hashcode; and second, the hashcode is mapped to an index in the table. In the previous section, you attacked a questionable choice of index computation to find many objects (strings) whose hashcodes all hashed to the same index. In this section, you'll focus on attacking the hashcode generation step in order to break even hash tables with nicely uniform index computations.

Here is the actual hashcode computation used by Java (current as of JDK 10) to hash Strings to integers:

int hashCode(String s)
{
   int h = 0;
   for (int i = 0; i < s.length(); i++)
     h = h * 31 + s.charAt(i);
   return h;
}

Question C1: why might it be a good idea to use a multiplier other than a power of two, such as the value 31 in the function above, to generate hashcodes for strings?

Suppose you know that your victim is hashing Java Strings to integers using the above hashcode computation, then using the hashcodes to generate indices into a hash table. You know nothing about how their hash table maps hashcodes to indices, but you still want to perform a hashDoS attack on their table.

Question C2: if you want to generate a bunch of strings to DoS the table, what property should these strings have?

Question C3: find at least four printable ASCII strings of length two that have the same hashcode under the above hash function.

Hint: the hashcode for a two-character string "$xy$" is $y + 31x$. As you did in Part B, pick a hashcode value $z$ such that you can form $z$ in different ways by varying $x$ and $y$.

Question C4: How many different strings of length $2n$ can you find with the same hashcode, using your result in C3? How long must your strings be to generate a million strings with the same hashcode?

The standard practice in Java for hashing an ordered $k$-tuple of objects $[o_1, o_2, \ldots o_k]$, encapsulated in the Array.hashcode() operator, is to compose the objects' hashcodes using the above hash function. More specifically, we replace the call to charAt(i) with a call to $o_i$.hashCode(). (Strings are simply $k$-tuples of characters, where $k$ is the string length.)

Question C5: how in general might you go about trying to DoS a hash table for object $k$-tuples if you knew that this was the hash function used? What could make your task of finding collisions with this function harder than finding colliding strings as we did above?

Part D: Defense Against Dark Arts

We've now seen some examples where hashcode and index computations known to an attacker can fairly easily be broken (in the sense of generating lots of collisions). We assume that a potential attacker has access to the source code of the hash computation, just as you were given the functions in the previous sections. With this assumption in mind, it's worth thinking about whether we, as hash table designers, should panic.

First, while every application of hashing benefits from approximately uniform mapping of input keys to table indices, not every application is a likely target for an attack. A key question is, to what extent are the keys being hashed under control of a potential attacker?

Question D1: give some examples of applications of the dictionary ADT where it is reasonable to assume that an attacker could not easily manipulate the keys inserted into the dictionary so as to cause a hashDoS.

Question D2: give some examples where it is reasonable to assume that an attacker could manipulate the keys inserted into the dictionary.

If you are using hashing in an application where a potential attacker can influence the choice of keys, then you need to worry about how collision-resistant your hashcode and index computations are.

Question D3: suppose that, as in Java, your hash table maps 32-bit hashcodes to indices in a range $[0,m)$ for $m < 2^{32}$. If an attacker can perform the index mapping computation on 10000 hashcodes per second, about how long would it take to find the largest possible set of colliding hashcodes for the table?

Question D4: given your answer to D3, and the speed at which even complex hash functions can be computed on modern hardware, is it reasonable to rely entirely on the uniformity of your table's index mapping to prevent hashDoS attacks?

One way to try to prevent hashDoS attacks is to make the generation of hashcodes resistant to collisions. Assuming your universe of keys is much larger than the number of hashcodes (this is certainly true for even fairly short alphabetic strings), you might hope that it would require an unreasonable number of guesses to find enough colliding strings to perform an effective hashDoS.

Suppose, for example, you have a hashcode generator $h(s)$ for strings that produces 32-bit integers, and you don't know of any way to find collisions in $h(s)$ short of brute-force enumeration and hashing of strings.

Question D5: how many strings might an attacker have to hash in the worst case to find at least one million strings that all hash to the same hashcode under $h$?

Question D6: if an attacker can hash and check one million strings every second, about how many years would it take to complete the task in D5?

The above calculations assume that the fastest way to find many colliding strings for your hash function is brute force. But we've already seen that this is not true for the string hash function used by Java. There are many hash functions available for strings and other composite objects that are more complex, but a sufficiently clever attacker can find ways to break these more complex functions as well; see, for example, this work about breaking the popular MurmurHash family of hash functions.

There are hash functions such as SHA-256, designed by cryptographers, for which it is extremely hard to find collisions faster than brute force. However, these functions are typically much slower than would be ideal for use in a high-performance hash table, and they achieve their collision resistance in part by producing very long hashcodes (128, 256, or even 512 bits).

An alternative strategy for making collisions difficult to engineer is to add an element of uncertainty to hashcode or index generation.

Question D7: suppose you could extend your string hash function $h(s)$ to become $h(s, q)$, where $s$ is the string and $q$ is a secret -- a second integer input that changes the resulting hashcode. If you can conceal $q$ from potential attackers, how could you use $h(s,q)$ to minimize the chance that an attacker could send your hash table a collection of strings that cause a hashDoS?

Most languages used to implement network-facing services, such as Perl, Python, and PHP, have implemented some form of additional secret input as part of their hash tables and have taken other measures to defend against hashDoS attacks. For index generation, these strategies are known as universal hashing.

For hashcode generation, adding a secret is effective only if there is no way for an attacker to guess that secret, either by seeing the hashcodes produced from the function or by inferring their distribution, e.g. by timing the cost of accesses to a hash table built from these hashcodes. There do exist reasonably efficient secret-using hash functions that are designed to resist collision-finding attacks even if the attacker can see both the inputs and outputs of the function, provided that the secret $q$ remains hidden. An example is the SipHash family.