Abstract

In this lab, you'll implement a simple hash table with collision resolution via chaining. You will use Java's built-in LinkedList class to construct the chains for each hash bucket and will apply a multiplicative hashing strategy to map items to buckets. To exercise your hash table, we'll use it to implement rapid similarity search in a database of DNA sequences.

This lab consists of three components:

a pre-lab, which you will prepare and submit through Gradescope;
a coding component, which you will complete and submit via your Bitbucket repository;
a post-lab writeup, which you will prepare and submit through Gradescope.

The pre-lab and post-lab are two separate assignments, due on different days. They should be prepared following the guidelines in the eHomework Guide.

Pre-Lab Component (20%)

The pre-lab is intended to be completed before the coding component; it will help ensure that you understand the code and can work out an implementation strategy. For this reason, the pre-lab is due earlier than the other two components of the lab.

To succeed at the pre-lab, you need to read the provided code described below and be able to answer basic questions about its methods. These questions are given in the pre-lab document.

What to Turn In (READ CAREFULLY)

Prepare a brief PDF document that answers each of the questions asked in the pre-lab, following the eHomework guidelines. Submit your document to the "Lab 7 Pre-Lab" assignment via Gradescope by its due date.

Coding Component (50%)

We've provided you with the skeleton of a hash table. In your repository, go to the hash package in the labs source folder. Do a Pull to make sure that you are looking at the latest version of your repository.

Code Outline

There are two main source code files you need to consider, both in the hash package:

Record.java contains a class Record that contains two fields: a key string and some auxiliary data (a list of sequence positions). For hashing, you'll care only about the key string.
StringTable.java implements a hash table that maps key strings to records containing these strings.

You will need to fill in some of the methods in the StringTable class to get your hash table working. In particular, you'll have to implement

StringTable() to construct a new, empty hash table;
insert() to insert a new record into the table, unless another record with the same key string already exists;
find() to locate the record (if any) in the table with a specified key string;
remove() to find and remove the record (if any) in the table with a specified key string;
toIndex() to map the hashcode for a key string to an index in the table.

We've provided a function stringToHashCode() that converts key strings to integer hashcodes. Read the comments in Record.java and StringTable.java to better understand what you need to implement.

In addition to the files described above, there is additional code to implement the DNA search application on top of your hash table. While you don't need to understand this code to complete the lab, we'll describe what it does at the end of this document.

Testing Your Implementation

We have provided you with a JUnit test TestStringTable to help you check that your hash table is behaving as intended, and to verify that the DNA similarity search application works. As always, passing the unit test does not guarantee that your table is correct -- you are ultimately responsible for correctness -- but failing it does indicate that there is a problem you need to fix.

What to Turn In (READ CAREFULLY)

To complete this portion of the lab, you must do the following:

Make sure your code passes the JUnit tests as we originally gave them to you. Our lab autograder script will use the output of these unit tests as part of the assessment of your code's correctness when grading.
Make sure you eliminate or disable any print statements or other extraneous code, other than what is needed to implement the table, that you may have used for debugging. Extraneous code may slow down your implementation to the point that our autograder will fail it for taking too long to run.
Make sure you do a Team/Pull on your repository before attempting to push your code. Remember, pull before push!
You must both commit and push all of your code to your repository. It's best to do this from the top-most level of your repository, which bears your name and student ID.

Please check that your code was pushed by logging into bitbucket.org, clicking on your repository, and viewing the source code to verify that it's the most up-to-date version.

Code that is not pushed to Bitbucket will not be graded, and you will not receive credit for it, even if your local copy is correct. Unless you have an unpushed commit with a timestamp prior to the due date, we will not accept it later, as we have no way to verify that the code was completed prior to the deadline.

Post-Lab Writeup Component (30%)

The post-lab writeup asks you to reflect on your implementation and to think about possible extensions.

What to Turn In (READ CAREFULLY)

Prepare a PDF document that answers each of the questions asked in the post-lab, following the eHomework guidelines. Submit your writeup via Gradescope by its due date.

About the DNA Similarity Search Application

A DNA sequence is a string of characters, called bases, from the alphabet ${a, c, g, t}$. Genomic DNA encodes a large collection of features, including:

genes -- the instructions for building proteins;
regulatory sites -- sequence markers recognized by cellular machinery that can increase or decrease the rate at which a given gene is used to make protein;
repeats -- junk left behind by transposable elements, pieces of DNA that can autonomously copy themselves and move around in the genome. Transposable elements proliferate, then die, leaving behind many inactive copies of themselves in the genome as repeats.

All DNA is subject to mutations that alter its sequence over time. However, functional sequence features like genes and regulatory sites are more resistant to mutation than DNA that doesn't code for anything (because natural selection usually kills off organisms with too many mutations in these features). We can therefore find these features by comparing DNA sequences from two different organisms and seeing which parts of the sequences have remained similar to each other since their lineages split from their last common ancestor.

Abstractly, we are given two long strings $s_1$ and $s_2$ of characters, and we want to find short substrings (parts of the features) common to $s_1$ and $s_2$. In general, the common substrings might not be exactly the same because even functional sequences mutate over time. However, we often find that $s_1$ and $s_2$ still exhibit perfectly matching substrings of at least 10 to 15 characters in the neighborhood of their shared features. By detecting these exact substring matches, we can find the most likely locations of the shared features in $s_1$ and $s_2$ and can then apply more sensitive but more expensive similarity search algorithms only at those locations.

Naively, we could find $k$-mers (i.e. substrings of length $k$) common to $s_1$ and $s_2$ by comparing every $k$-mer from one sequence to every $k$-mer from the other. Such an approach would take time $\Theta(|s_1| \cdot |s_2|)$, which is unacceptable because interesting DNA sequences range from thousands to billions of characters in length (your own genome is about 3.2 billion characters long). Fortunately, there is a much better approach.

Call $s_1$ the corpus string and $s_2$ the pattern string, and assume we are searching for common substrings of length $k$. We first construct a table $T$ of every $k$-mer in the pattern string, remembering where in the pattern it occurs. Then, for each $k$-mer in the corpus string, we check whether it occurs in the table $T$; if so, we have found a match between pattern and corpus. If $T$ supports constant-time insertions and lookups (e.g., if it is a hash table), we can process the entire pattern and corpus in time \[ \Theta(|s_1| + |s_2| + M), \] where $M$ is the number of common substrings actually found. In general, this time cost is much lower than the cost of the naive algorithm. Fast substring matching based on hashing therefore forms the core of many of today's high-speed biosequence matching algorithms.

To make the results of our search more useful, we can also specify a mask string that contains "uninteresting" DNA that should not be reported. For example, if we're looking for matching genes, we might not be interested in any common substrings that are part of known repeats. Any $k$-mer appearing in the mask string is removed from the table before searching for matches in the corpus.

The Similarity Search Code and Data

If you want to read the similarity search driver code, you can look at Main.java. The driver reads corpus, pattern, and (optionally) mask strings from files and accepts the match length $k$ as an argument. It builts a StringTable from the pattern, removes the $k$-mes in the mask, and then reports every $k$-mer in the corpus that matches the pattern minus the mask.

We specify a table size proportional to, but much less than, the full length of the pattern sequence. This ensures that our load factor is a constant. The table size is set to a power of 2, though your implementation might not exploit this property.

The data used in most of the test cases comes from real pairs of DNA sequences with interesting similarity between them. If you want to read about where these sequences come from, check out the README.txt file in the hash/test-cases subdirectory of your repo.