credit for this assignment: John Gibson and Dennis Cosgrove

Motivation

K-mer counting is a real world application which provides the opportunity to build thread safe data structures which alleviate contention.

Pulling real data from the National Institutes of Health forces us to address balancing our work.

Finally, k-mer counting provides opportunities for performance engineering.

Background

The term k-mer refers to a substring of length k and k-mer counting refers to the process of finding the number of occurrences of a k-mer in a given string. This computational problem has many real-world applications, the most notable being in the field of computational genomics. In this assignment, you will design a program that will ultimately take in a human X-chromosome and count the number of k-mers in the string of DNA. You will be making use of everything you have learned thus far, including atomic variables. It is useful to know that given a string of length n, the amount of possible k-mers is equal to n - k + 1.

Lecture

Demo

class:	OutputKmer.java	DEMO:
methods:	main
package:	kmer.output
source folder:	src//java

This demo video shows the big picture of KMer Counting. It included example usage of calculateSumOfAllKMers and calculatePossibleKMers method in KMerUtils.

Research

wikipedia article on k-mers

paper: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers (Jellyfish)

paper: Multiple comparative metagenomics using multiset k-mer counting

Collection of approaches to k-mer counting

Burrows–Wheeler transform

Significance of k-mer Counting

Code To Investigate

Runtime

Runtime.getRuntime().availableProcessors()

BigInteger

Java Util Concurrent

ConcurrentHashMap

AtomicIntegerArray.

IntegerKMers

toPackedInt(sequence, offset, k)

/**
 * "Packs" a k-mer into an int. The bits of the int will be used to store
 * individual nucleobases.
 * 
 * @param sequence a sequence of nucleobases
 * @param offset   the index of the first nucleobase in the desired k-mer
 * @param k        the length of the desired k-mer
 * @return an integer uniquely representing this k-mer
 */
public static int toPackedInt(byte[] sequence, int offset, int k) {
	int result = 0;
	for (int i = 0; i < k; i++) {
		switch (sequence[offset + i]) {
		// case A:
		// result |= (A_MASK_INT << (i + i));
		// break;
		case T:
			result |= (T_MASK_INT << (i + i));
			break;
		case C:
			result |= (C_MASK_INT << (i + i));
			break;
		case G:
			result |= (G_MASK_INT << (i + i));
			break;
		}
	}
	return result;
}

LongKMers

toPackedLong(sequence, offset, k)

/**
 * "Packs" a k-mer into a long. The bits of the long will be used to store
 * individual nucleobases.
 * 
 * @param sequence a sequence of nucleobases
 * @param offset   the index of the first nucleobase in the desired k-mer
 * @param k        the length of the desired k-mer
 * @return a long integer uniquely representing this k-mer
 */
public static long toPackedLong(byte[] sequence, int offset, int k) {
	final long T_MASK_LONG = (long) IntegerKMers.T_MASK_INT;
	final long C_MASK_LONG = (long) IntegerKMers.C_MASK_INT;
	final long G_MASK_LONG = (long) IntegerKMers.G_MASK_INT;
	long result = 0;
	for (int i = 0; i < k; i++) {
		switch (sequence[offset + i]) {
		// case KMerIntegerUtils.A:
		// result |= (KMerIntegerUtils.A_MASK_LONG << (i + i));
		// break;
		case IntegerKMers.T:
			result |= (T_MASK_LONG << (i + i));
			break;
		case IntegerKMers.C:
			result |= (C_MASK_LONG << (i + i));
			break;
		case IntegerKMers.G:
			result |= (G_MASK_LONG << (i + i));
			break;
		}
	}
	return result;
}

KMerResults

interface KMerResults

abstract class AbstractMapKMerResults<T>

class LongMapKMerCount

abstract class AbstractArrayKMerResults

class IntArrayKMerCount

class AtomicIntegerArrayKMerCount

Group Warmup

Complete the group warm up first.

Exercise

DefaultByteArrayRange

DefaultByteArrayRangeCreator

DefaultKMerBalancer

Tip: Get DefaultKMerBalancer working and then use it to balance all of your parallel KMerCounter implementations.

class:	DefaultKMerBalancer.java
methods:	sliceKernel createSlices calculateReasonableThreshold
package:	kmer.exercise.balance
source folder:	student/src/main/java

slicing

method: private static void sliceKernel(List<ByteArrayRange> slices, byte[] sequence, int min, int max, IntPredicate slicePredicate) (sequential implementation only)

method: public static List<Slice<byte[]>> createSlices(List<byte[]> sequences, int k, IntPredicate slicePredicate) (sequential implementation only)

This class is designed to take in a list of bytes and convert it into a list of slices. The slice class should be familiar to you from earlier in the semester, but feel free to refer back to the documentation on Slices for a quick recap. You should perform this task recursively and there are two methods you will need to implement. The createSlices method should just call the kernel (the "helper" method), and the kernel performs the recursion.

In the createSlices method, you should call the recursive kernel for each sequence that is longer than k in the list of sequences. When defining the min and max values, keep the information mentioned in the "Background" section in mind. As for the kernel, it should add a new slice to the collection of slices once the slice length dips below the provided threshold, but it should otherwise split itself into two smaller slices using recursion.

In our previous uses of Slices we leveraged sliceIndexId (and not originalData) since it was a n-way split/coarsening scenario. Here we are recursively splitting each item in a collection of sub-sequences so originalData will be used (and sliceIndexId is unnecessary and can be specified as invalid (-1))..

Warning: be sure to slice the offset (a.k.a. startingIndex) space

for N=10 and K=3, you should be slicing like this:

threshold calculation

method: public static int calculateReasonableThreshold(List<byte[]> sequences, int k) (sequential implementation only)

Define the threashold of slice length used in slicePredicate.

Question to ask yourself: given a list of sequences which are to be k-mer counted, how would you calculate a reasonable threshold?

Individual Warmup

Complete the Balanced_String_Map_K_Mer_Counting_Assignment balanced individual warm up next.

Return of the Exercise

LongConcurrentMapKMerCounter

class:	LongConcurrentMapKMerCounter.java
methods:	parse
package:	kmer.exercise
source folder:	student/src/main/java

method: public KMerCount parse(List<byte[]> sequences, int k) (parallel implementation required)

This parallel implementation will also overlap greatly with everything you have done previously, but instead of using Strings or ByteArrayRange, we will be using Longs to represent DNA data. In order to do this, refer to the KMerUtils class, more specifically the toPackedLong method which will convert a sequence of bytes into a Long for our purposes. Don't forget to balance the workload in this parallel implementation with your ThresholdSlices. Make sure to select the most appropriate parallel fork loop.

Warning:Be sure to utilize the Supplier<ConcurrentMap<Long, Integer>> concurrentMapFactory instance variable to create your ConcurrentMap.

AbstractArrayKMerCounter

totalPossibleKMers(k)

IntArrayKMerCounter

class:	IntArrayKMerCounter.java
methods:	parse
package:	kmer.exercise
source folder:	student/src/main/java

method: public KMerCount parse(List<byte[]> sequences, int k) (sequential implementation only)

Like the String HashMap implementation, this implementation should be completed sequentially. However, unlike the previous implementations, we will be using an array instead of a map. This means that you must keep in mind what size to make the array in order to instantiate it. KMerUtils has a method which will calculate this number for you, but think about which one to use, why you would use it, and how that would differ from simply calculating n - k + 1 yourself. Think of using the int array as similar to using the map, but with ints as the key/value pairs. Use KMerUtils.toPackedInt() in order to convert the sequence into an int for our purposes.

Hint: calculatePossibleKmers and calculateSumOfAllKmers are distinct and take different input parameters. calculateSumOfAllKmers performs the n - k + 1 calculation on a list of arrays of sequences and sums the total. It calculate the total number of Kmers in the list of sequences. calculatePossibleKmers only takes the length of the k-mer (k) as the input parameter, meaning it will calculate all possibilities independent of the sequences. Thus, the n - k + 1 calculation would yield a value ranging from 0 (inclusive) to the result of calculatePossibleKmers (exclusive).

AtomicIntegerArrayKMerCounter

class:	AtomicIntegerArrayKMerCounter.java
methods:	parse
package:	kmer.exercise
source folder:	student/src/main/java