K Mer Counting Assignment
Contents
- 1 Background
- 2 Breakdown
- 2.1 Slice and Slices
- 2.2 String HashMap Implementation
- 2.3 String ConcurrentHashMap Implementation
- 2.4 SequenceSlice ConcurrentHashMap Implementation
- 2.5 Long ConcurrentHashMap Implementation
- 2.6 Int Array Implementation
- 2.7 AtomicIntegerArray Implementation
- 2.8 AtomicReferenceArrayDictionary Implementation
- 2.9 IsolatedBucketDictionary Implementation
- 2.10 Open-ended K-mer Counter
- 3 Research
Background
The term k-mer refers to a substring of length k and k-mer counting refers to the process of finding the number of occurrences of a k-mer in a given string. This computational problem has many real-world applications, the most notable being in the field of computational genomics. In this assignment, you will design a program that will ultimately take in a human X-chromosome and count the number of k-mers in the string of DNA. You will be making use of everything you have learned thus far, including atomic variables.
Breakdown
This assignment is broken into in-class studios, demos, and a take-home assignment. For the take-home portion, you will create a k-mer counting program using a Long ConcurrentHashMap, an int array, and an AtomicIntegerArray. In class, you will create different, but similar, implementations using a String HashMap, a String ConcurrentHashMap, and a SequenceSlice ConcurrentHashMap (using a SequenceSlices class you will build yourself). For extra credit, you can attempt to create your own unique implementation to solve this problem. Links to the documentation for all of these data types can be found below.
ConcurrentHashMap and AtomicIntegerArray.
Slice and Slices
For this portion of the assignment, you will use the SequenceSlice class provided to you in order to build a SequenceSlices class. SequenceSlice, as the name suggests, takes in an array of bytes and an upper/lower bound in order to break up a sequence into a specific slice. SequenceSlices has two methods which you will need to implement: addToCollectionKernel and createUnmodifiableList.
addToCollectionKernel takes in a collection of SequenceSlice (any type of collection including lists), an array of bytes, upper/lower bounds, and a sliceThreshold which will determine if the recursion has reached its base case. This method recursively adds slices to the collection by checking if the bounds are smaller than the threshold. If it reaches the base case, it should add a new SequenceSlice into the collection. If it has not reached the base case yet, the recursion should divide and conquer the bounds until the method reaches the base case.
createUnmodifiableList takes in a list of arrays of bytes, a k-mer length, and a threshold. The method should iterate over the list of arrays and then call addToCollectionKernel to create a list of SequenceSlice, with the amount of possible k-mers as the upper bound. It should then use the Collections.unmodifiableList() method in order to return an immutable version of the newly created list.
Hint: Given a string of length n, the amount of possible k-mers is equal to n - k + 1.
String HashMap Implementation
In this completely sequential implementation, you will have to write the parse method. The method takes in a list of arrays of bytes and a k-mer length. It should return a StringMapKmerCount (which takes in a map), a class provided to you which does exactly what its name suggests. In order to implement this method, we recommend taking a look at the KMerUtils class and more specifically the toString method, which converts a given byte array into a String in order to make it compatible with the String HashMap.
parse should go through the amount of possible k-mers for every byte array in the list of sequences. As it goes through the bytes in the array, use the KMerUtils.toString() method to create a string to use for the HashMap. The map should take in a String as the key and an Integer as the value. We recommend using the map.compute() method and reviewing how to use lambdas.
String ConcurrentHashMap Implementation
This implementation will make your sequential String HashMap implementation into a parallel one. To do so, you will be making use of Java’s atomic version of a HashMap: a ConcurrentHashMap. Like before, you will be need to complete the parse method. You can choose to finish this method with a forall loop or a finish/async approach, which is completely up to you.
SequenceSlice ConcurrentHashMap Implementation
This parallel implementation will overlap greatly with your previous two implementations, but it will use SequenceSlice instead of Strings. Refer to the SequenceSlice class in order to get a better idea of how to implement this method and instantiate an instance of a SequenceSlice.
Long ConcurrentHashMap Implementation
This parallel implementation will also overlap greatly with everything you have done previously, but instead of using Strings or SequenceSlice, we will be using Longs to represent DNA data. In order to do this, refer to the KMerUtils class, more specifically the toPackedLong method which will convert a sequence of bytes into a Long for our purposes. As the data that this implementation can take is much larger than the previous implementations, consider how to instantiate the ConcurrentHashMap in order to avoid dynamically resizing the map. KMerUtils has a method which will calculate this number for you, but think about which one to use, why you would use it, and how that would differ from simply calculating n - k + 1 yourself.
Hint: calculatePossibleKmers and calculateSumOfAllKmers are distinct and take different input parameters. calculateSumOfAllKmers performs the n - k + 1 calculation on a list of arrays of sequences and sums the total. calculatePossibleKmers only takes the length of the k-mer (k) as the input parameter, meaning it will calculate possibilities independent of the size of the sequence. Thus, the n - k + 1 calculation would yield a value ranging from 0 (inclusive) to the result of calculatePossibleKmers (exclusive).
Int Array Implementation
Like the String HashMap implementation, this implementation should be completed sequentially. However, unlike the previous implementations, we will be using an array instead of a map. This means that you must keep in mind what size to make the array in order to instantiate it. KMerUtils has a method which will calculate this number for you, but think about which one to use, why you would use it, and how that would differ from simply calculating n - k + 1 yourself. Think of using the int array as similar to using the map, but with ints as the key/value pairs. Use KMerUtils.toPackedInt() in order to convert the sequence into an int for our purposes.
Hint: calculatePossibleKmers and calculateSumOfAllKmers are distinct and take different input parameters. calculateSumOfAllKmers performs the n - k + 1 calculation on a list of arrays of sequences and sums the total. calculatePossibleKmers only takes the length of the k-mer (k) as the input parameter, meaning it will calculate possibilities independent of the size of the sequence. Thus, the n - k + 1 calculation would yield a value ranging from 0 (inclusive) to the result of calculatePossibleKmers (exclusive).
AtomicIntegerArray Implementation
This implementation is simply the parallel equivalent of the int array implementation. To do so, you will be making use of Java’s atomic version of an array: an AtomicIntegerArray. Like before, you will be need to complete the parse method. You can choose to finish this method with a forall loop or a finish/async approach, which is completely up to you.
AtomicReferenceArrayDictionary Implementation
IsolatedBucketDictionary Implementation
Open-ended K-mer Counter
This implementation is extra credit and (as the name suggests) completely open-ended. Approach it however you wish!
Research
paper: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers (Jellyfish)
paper: Multiple comparative metagenomics using multiset k-mer counting