K-MerCounting Assignment
credit for this assignment: John Gibson and Dennis Cosgrove
Contents
- 1 Motivation
- 2 Background
- 3 Demo
- 4 Code To Investigate
- 4.1 Runtime
- 4.2 Java Util Concurrent
- 4.3 KMerUtils
- 4.3.1 long calculatePossibleKMers(int kMerLength)
- 4.3.2 int toArrayLength(long possibleKMers)
- 4.3.3 int calculateSumOfAllKMers(List<byte[]> sequences, int kMerLength)
- 4.3.4 String toString(byte[] sequence, int offset, int kMerLength)
- 4.3.5 String toString(byte[] kMer)
- 4.3.6 int toPackedInt(byte[] sequence, int offset, int kMerLength)
- 4.3.7 int toPackedInt(byte[] kMer)
- 4.3.8 byte[] unpackInt(int kMer, int kMerLength)
- 4.3.9 long toPackedLong(byte[] sequence, int offset, int kMerLength)
- 4.3.10 long toPackedLong(byte[] sequence)
- 4.3.11 byte[] unpackLong(long kMer, int kMerLength)
- 4.4 KMerCounts
- 5 Warmups
- 6 Lab
- 7 Testing Your Solution
- 8 Rubric
- 9 Research
Motivation
K-mer counting is a real world application which provides the opportunity to build thread safe data structures which alleviate contention.
Pulling real data from the National Institutes of Health forces us to address balancing our work.
Finally, k-mer counting provides opportunities for performance engineering.
Background
The term k-mer refers to a substring of length k and k-mer counting refers to the process of finding the number of occurrences of a k-mer in a given string. This computational problem has many real-world applications, the most notable being in the field of computational genomics. In this assignment, you will design a program that will ultimately take in a human X-chromosome and count the number of k-mers in the string of DNA. You will be making use of everything you have learned thus far, including atomic variables. It is useful to know that given a string of length n, the amount of possible k-mers is equal to n - k + 1.
Demo
class: | OutputKmer.java | DEMO: ![]() |
methods: | main | |
package: | kmer.output | |
source folder: | src/main/java |
Code To Investigate
Runtime
Runtime.getRuntime().availableProcessors()
Java Util Concurrent
KMerUtils
The following methods should be useful as you build the assignment.
Check out the Javadocs either on the web, or in the comments in KMerUtils.java for further details on how they work.
long calculatePossibleKMers(int kMerLength)
public static long calculatePossibleKMers(int kMerLength) { int bitsPerBase = 2; int bitsPerKMer = kMerLength * bitsPerBase; long possibleKMers = 1L << bitsPerKMer; return possibleKMers; }
int toArrayLength(long possibleKMers)
public static int toArrayLength(long possibleKMers) { // maximum array length is VM dependent, but must be indexed by an int. // reportedly some vms reserve some header words (see ArrayList) which puts its // limit as just below 2^31. // we will choose 2^30 (inclusive) as the maximum since it will hold the largest // potential k-mer (4^15) final long MAX_K_MER_ARRAY_LENGTH_INCLUSIVE = 1 << 30; if (possibleKMers <= MAX_K_MER_ARRAY_LENGTH_INCLUSIVE) { return (int) possibleKMers; } else { throw new IllegalArgumentException(possibleKMers + " > " + MAX_K_MER_ARRAY_LENGTH_INCLUSIVE); } }
int calculateSumOfAllKMers(List<byte[]> sequences, int kMerLength)
public static int calculateSumOfAllKMers(List<byte[]> sequences, int kMerLength) { int sum = 0; for (byte[] sequence : sequences) { int i = (sequence.length - kMerLength + 1); if( i > 0 ) { sum += i; } } return sum; }
String toString(byte[] sequence, int offset, int kMerLength)
public static String toString(byte[] sequence, int offset, int kMerLength) { return new String(sequence, offset, kMerLength, StandardCharsets.UTF_8); }
String toString(byte[] kMer)
public static String toString(byte[] kMer) { return new String(kMer, StandardCharsets.UTF_8); }
int toPackedInt(byte[] sequence, int offset, int kMerLength)
public static int toPackedInt(byte[] sequence, int offset, int kMerLength) { int result = 0; for (int i = 0; i < kMerLength; i++) { switch (sequence[offset + i]) { // case A: // result |= (A_MASK_INT << (i + i)); // break; case T: result |= (T_MASK_INT << (i + i)); break; case C: result |= (C_MASK_INT << (i + i)); break; case G: result |= (G_MASK_INT << (i + i)); break; } } return result; }
int toPackedInt(byte[] kMer)
public static int toPackedInt(byte[] kMer) { return toPackedInt(kMer, 0, kMer.length); }
byte[] unpackInt(int kMer, int kMerLength)
public static byte[] unpackInt(int kMer, int kMerLength) { byte[] result = new byte[kMerLength]; for (int i = 0; i < kMerLength; i++) { int v = (kMer >> (i + i)) & 0x3; switch (v) { case A_MASK_INT: result[i] = A; break; case T_MASK_INT: result[i] = T; break; case C_MASK_INT: result[i] = C; break; case G_MASK_INT: result[i] = G; break; } } return result; }
long toPackedLong(byte[] sequence, int offset, int kMerLength)
public static long toPackedLong(byte[] sequence, int offset, int kMerLength) { long result = 0; for (int i = 0; i < kMerLength; i++) { switch (sequence[offset + i]) { // case A: // result |= (A_MASK_LONG << (i + i)); // break; case T: result |= (T_MASK_LONG << (i + i)); break; case C: result |= (C_MASK_LONG << (i + i)); break; case G: result |= (G_MASK_LONG << (i + i)); break; } } return result; }
long toPackedLong(byte[] sequence)
public static long toPackedLong(byte[] sequence) { return toPackedLong(sequence, 0, sequence.length); }
byte[] unpackLong(long kMer, int kMerLength)
public static byte[] unpackLong(long kMer, int kMerLength) { byte[] result = new byte[kMerLength]; for (int i = 0; i < kMerLength; i++) { long v = (kMer >> (i + i)) & 0x3; switch ((int) v) { case A_MASK_INT: result[i] = A; break; case T_MASK_INT: result[i] = T; break; case C_MASK_INT: result[i] = C; break; case G_MASK_INT: result[i] = G; break; } } return result; }
KMerCounts
MapKMerCount
In order to provide a KMerCount that works with different k-mer type representations we have added an interface KMerCodec which provides functions to go to an from the byte[] representation:
public interface KMerCodec<T> { BiFunction<T, Integer, byte[]> decoder(); Function<byte[], T> encoder(); }
As an example the instance for long -> byte[] and back simply utilizes the pack and unpack methods in KMerUtils:
public enum LongKMerCodec implements KMerCodec<Long> { INSTANCE; @Override public Function<byte[], Long> encoder() { return KMerUtils::toPackedLong; } @Override public BiFunction<Long, Integer, byte[]> decoder() { return KMerUtils::unpackLong; } }
Creating an instance of a MapKMerCount to return from your KMerCounters will need to use the appropriate codec. For example, if you have a map of type Map<Long, Integer>
which you wish to wrap and return, your code would look something like this:
return new MapKMerCount<>(k, map, LongKMerCodec.INSTANCE);
IntArrayKMerCount
Example usage:
return new IntArrayKMerCount(k, array);
AtomicIntegerArrayKMerCount
Example usage:
return new AtomicIntegerArrayKMerCount(k, atomicIntegerArray);
Warmups
The warm-ups associated with this assignment can be found under the kmer.warmup
packages.
String HashMap Implementation
class: | StringHashMapKMerCounter.java | ![]() |
methods: | parse | |
package: | kmer.warmup.stringmap | |
source folder: | src/main/java |
method: public KMerCount parse(List<byte[]> sequences, int k)
(sequential implementation only)
In this completely sequential implementation, you will have to write the parse method. The method takes in a list of arrays of bytes and a k-mer length. It should return an instance of MapKMerCount(which takes in a map), a class provided to you which does exactly what its name suggests. In order to implement this method, we recommend taking a look at the KMerUtils class and more specifically the toString method, which converts a given byte array into a String in order to make it compatible with the String HashMap.
parse should go through the amount of possible k-mers for every byte array in the list of sequences. As it goes through the bytes in the array, use the KMerUtils.toString() method to create a string to use for the HashMap. The map should take in a String as the key and an Integer as the value. We recommend using the map.compute() method and reviewing how to use lambdas.
String ConcurrentHashMap Implementation
class: | StringConcurrentHashMapKMerCounter.java | ![]() |
methods: | parse | |
package: | kmer.warmup.stringmap | |
source folder: | src/main/java |
method: public KMerCount parse(List<byte[]> sequences, int k)
(parallel implementation required)
This implementation will make your sequential String HashMap implementation into a parallel one. To do so, you will be making use of Java’s atomic version of a HashMap: a ConcurrentHashMap. Like before, you will be need to complete the parse method. You can choose to finish this method with a forall loop or a finish/async approach, which is completely up to you.
Lab
![]() |
ThresholdSlices
slicing
class: | ThresholdSlices.java | ![]() |
methods: | addToCollectionKernel createSlices calculateReasonableThreshold |
|
package: | kmer.lab.util | |
source folder: | src/main/java |
method: private static void addToCollectionKernel(Collection<Slice<byte[]>> slices, byte[] sequence, int min, int max, IntPredicate slicePredicate)
(sequential implementation only)
method: public static List<Slice<byte[]>> createSlices(List<byte[]> sequences, int k, IntPredicate slicePredicate)
(sequential implementation only)
This class is designed to take in a list of bytes and convert it into a list of slices. The slice class should be familiar to you from earlier in the semester, but feel free to refer back to the documentation on Slices for a quick recap. You should perform this task recursively and there are two methods you will need to implement. The createSlicesmethod should call the addToCollectionKernel once, but the kernel should take over the recursion from there.
In the createSlices method, you should call the recursive kernel for each sequence in the list of sequences. When defining the min and max values, keep the information mentioned in the "Background" section in mind. As for the kernel, it should add a new slice to the collection of slices once the length of the slice dips below the provided threshold, but it should otherwise split itself into two smaller slices using recursion.
In our previous uses of Slices we leveraged sliceIndexId (and not originalData) since it was a n-way split/coarsening scenario. Here we are recursively splitting each item in a collection of sub-sequences so originalData will be used (and sliceIndexId is unnecessary and can be specified as invalid (-1))..
![]() |
for N=10 and K=3, you should be slicing like this:
threshold calculation
method: public static int calculateReasonableThreshold(List<byte[]> sequences, int k)
(sequential implementation only)
Question to ask yourself: given a list of sequences which are to be k-mer counted, how would you calculate a reasonable threshold?
LongConcurrentMapKMerCounter
class: | LongConcurrentMapKMerCounter.java | ![]() |
methods: | parse | |
package: | kmer.lab.longconcurrentmap | |
source folder: | src/main/java |
method: public KMerCount parse(List<byte[]> sequences, int k)
(parallel implementation required)
This parallel implementation will also overlap greatly with everything you have done previously, but instead of using Strings or SequenceSlice, we will be using Longs to represent DNA data. In order to do this, refer to the KMerUtils class, more specifically the toPackedLong method which will convert a sequence of bytes into a Long for our purposes.
Maps in Java often have a way to specify an initialCapacity to the constructor. When used well, this parameter can give a hint as to how many Entries the Map should be expecting.
KMerUtils has methods which will calculate what you need to compute this value. calculatePossibleKmers returns 4^k and calculateSumOfAllKmers returns n-k+1. Think about how each method might be used to calculate an appropriate initial capacity.
Some questions to ask yourself:
- How many entries should the ConcurrentMap expect for a chromosome a million in length if k is 30?
- How many entries should the ConcurrentMap expect for a chromosome a million in length if k is 1?
In order to test LongConcurrentMapKMerCounter with both java.util.ConcurrentHashMap<K,V> and your ConcurrentBucketHashMap<K,V> implementation, we pass a concurrentMapFactory to LongConcurrentMapKMerCounter's constructor. You should use this factory to create your instance of ConcurrentMap via the apply method. The Optional<Integer> parameter to this apply method is the initialCapacity.
Note: anyone feeling the weight of a pandemic should feel free to pass Optional.empty() to the apply method.
IntArrayKMerCounter
class: | IntArrayKMerCounter.java | ![]() |
methods: | parse | |
package: | kmer.lab.intarray | |
source folder: | src/main/java |
method: public KMerCount parse(List<byte[]> sequences, int k)
(sequential implementation only)
Like the String HashMap implementation, this implementation should be completed sequentially. However, unlike the previous implementations, we will be using an array instead of a map. This means that you must keep in mind what size to make the array in order to instantiate it. KMerUtils has a method which will calculate this number for you, but think about which one to use, why you would use it, and how that would differ from simply calculating n - k + 1 yourself. Think of using the int array as similar to using the map, but with ints as the key/value pairs. Use KMerUtils.toPackedInt() in order to convert the sequence into an int for our purposes.
Hint: calculatePossibleKmers and calculateSumOfAllKmers are distinct and take different input parameters. calculateSumOfAllKmers performs the n - k + 1 calculation on a list of arrays of sequences and sums the total. calculatePossibleKmers only takes the length of the k-mer (k) as the input parameter, meaning it will calculate possibilities independent of the size of the sequence. Thus, the n - k + 1 calculation would yield a value ranging from 0 (inclusive) to the result of calculatePossibleKmers (exclusive).
AtomicIntegerArrayKMerCounter
class: | AtomicIntegerArrayKMerCounter.java | ![]() |
methods: | parse | |
package: | kmer.lab.atomicintegerarray | |
source folder: | src/main/java |
method: public KMerCount parse(List<byte[]> sequences, int k)
(parallel implementation required)
This implementation is simply the parallel equivalent of the int array implementation. To do so, you will be making use of Java’s atomic version of an array: an AtomicIntegerArray. Like before, you will be need to complete the parse method. You can choose to finish this method with a forall loop or a finish/async approach, which is completely up to you.
ConcurrentBucketHashMap
For this section of the assignment you will create a thread safe data structure. You have previously created a non-thread safe Map. Now, with the help of multiple ReentrantReadWriteLocks you will create a thread safe Map that both deals with contention and allows simultaneous gets.
class: | ConcurrentBucketHashMap.java | ![]() |
methods: | constructor getIndex getBucket getLock getEntry get put compute |
|
package: | kmer.lab.concurrentbuckethashmap | |
source folder: | src/main/java |
constructor
method: public ConcurrentBucketHashMap(int initialCapacity)
(sequential implementation only)
initialCapacity can be treated as a hint of how many entries the Map can expect to receive.
private support methods
method: private int getIndex(Object key)
(sequential implementation only)
![]() |
![]() |
method: private List<Entry<K, V>> getBucket(Object key)
(sequential implementation only)
method: private ReadWriteLock getLock(Object key)
(sequential implementation only)
method: private static <K, V> Entry<K, V> getEntry(List<Entry<K, V>> bucket, Object key)
(sequential implementation only)
public methods
Be sure to make each of these methods thread safe by acquiring the appropriate (read or write) lock for the appropriate bucket.
Which method, lock() or tryLock(), is the appropriate one to use for creating a thread safe data structure such as this?
Whenever you acquire a lock, but sure to leverage try finally and unlock() the lock within the finally block.
method: public V get(Object key)
(thread-safe required)
method: public V put(K key, V value)
(thread-safe required)
method: public V compute(K key, BiFunction<? super K, ? super V, ? extends V> remappingFunction)
(thread-safe required)
advice
![]() |
![]() |
![]() |
Testing Your Solution
Correctness
class: | KMerTestSuite.java | ![]() |
package: | kmer.lab | |
source folder: | src/test/java |
Rubric
Total points: 100
- Correct ThresholdSlices (10)
- Correct and Parallel Long ConcurrentHashMap (25)
- Correct Int Array (15)
- Correct and Parallel AtomicIntegerArray (15)
- Correct and ThreadSafe ConcurrentBucketHashMap implementation (25)
- Clarity and efficiency (10)
Research
paper: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers (Jellyfish)
paper: Multiple comparative metagenomics using multiset k-mer counting