K Mer Counting Assignment

From CSE231 Wiki
Revision as of 02:02, 11 April 2023 by Cosgroved (talk | contribs)
Jump to navigation Jump to search

credit for this assignment: John Gibson and Dennis Cosgrove

Motivation

K-mer counting is a real world application which provides the opportunity to build thread safe data structures which alleviate contention.

Pulling real data from the National Institutes of Health forces us to address balancing our work.

Finally, k-mer counting provides opportunities for performance engineering.

Background

The term k-mer refers to a substring of length k and k-mer counting refers to the process of finding the number of occurrences of a k-mer in a given string. This computational problem has many real-world applications, the most notable being in the field of computational genomics. In this assignment, you will design a program that will ultimately take in a human X-chromosome and count the number of k-mers in the string of DNA. You will be making use of everything you have learned thus far, including atomic variables. It is useful to know that given a string of length n, the amount of possible k-mers is equal to n - k + 1.

Lecture

Demo

class: OutputKmer.java DEMO: Java.png
methods: main
package: kmer.output
source folder: src//java

This demo video shows the big picture of KMer Counting. It included example usage of calculateSumOfAllKMers and calculatePossibleKMers method in KMerUtils.

Research

wikipedia article on k-mers

paper: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers (Jellyfish)

paper: Multiple comparative metagenomics using multiset k-mer counting

Collection of approaches to k-mer counting

Burrows–Wheeler transform

Significance of k-mer Counting

Code To Investigate

Runtime

Runtime.getRuntime().availableProcessors()

BigInteger

ZERO

ONE

shiftLeft(n)

flipBit(n)

multiply(val)

pow(n)

intValueExact()

Java Util Concurrent

ConcurrentHashMap

AtomicIntegerArray.

IntegerKMers

toPackedInt(sequence, offset, k)

/**
 * "Packs" a k-mer into an int. The bits of the int will be used to store
 * individual nucleobases.
 * 
 * @param sequence a sequence of nucleobases
 * @param offset   the index of the first nucleobase in the desired k-mer
 * @param k        the length of the desired k-mer
 * @return an integer uniquely representing this k-mer
 */
public static int toPackedInt(byte[] sequence, int offset, int k) {
	int result = 0;
	for (int i = 0; i < k; i++) {
		switch (sequence[offset + i]) {
		// case A:
		// result |= (A_MASK_INT << (i + i));
		// break;
		case T:
			result |= (T_MASK_INT << (i + i));
			break;
		case C:
			result |= (C_MASK_INT << (i + i));
			break;
		case G:
			result |= (G_MASK_INT << (i + i));
			break;
		}
	}
	return result;
}

LongKMers

toPackedLong(sequence, offset, k)

/**
 * "Packs" a k-mer into a long. The bits of the long will be used to store
 * individual nucleobases.
 * 
 * @param sequence a sequence of nucleobases
 * @param offset   the index of the first nucleobase in the desired k-mer
 * @param k        the length of the desired k-mer
 * @return a long integer uniquely representing this k-mer
 */
public static long toPackedLong(byte[] sequence, int offset, int k) {
	final long T_MASK_LONG = (long) IntegerKMers.T_MASK_INT;
	final long C_MASK_LONG = (long) IntegerKMers.C_MASK_INT;
	final long G_MASK_LONG = (long) IntegerKMers.G_MASK_INT;
	long result = 0;
	for (int i = 0; i < k; i++) {
		switch (sequence[offset + i]) {
		// case KMerIntegerUtils.A:
		// result |= (KMerIntegerUtils.A_MASK_LONG << (i + i));
		// break;
		case IntegerKMers.T:
			result |= (T_MASK_LONG << (i + i));
			break;
		case IntegerKMers.C:
			result |= (C_MASK_LONG << (i + i));
			break;
		case IntegerKMers.G:
			result |= (G_MASK_LONG << (i + i));
			break;
		}
	}
	return result;
}

KMerResults

interface KMerResults

abstract class AbstractMapKMerResults<T>
class LongMapKMerCount
abstract class AbstractArrayKMerResults
class IntArrayKMerCount
class AtomicIntegerArrayKMerCount

Group Warmup

Complete the group warm up first.

Individual Warmup

Complete the balanced individual warm up next.

Exercise K-mer Counters

Map Based

LongConcurrentMapKMerCounter

class: LongConcurrentMapKMerCounter.java Java.png
methods: parse
package: kmer.exercise
source folder: student/src/main/java

method: public KMerCount parse(List<byte[]> sequences, int k) Parallel.svg (parallel implementation required)

This parallel implementation will also overlap greatly with what you have done previously in the group and individual warmups. In those assignments, we used a String representation for each k-mer. For this implementation, we will leverage the fact that since there are only 4 bases (A, T, C, and G), we can store each base in only 2 bits. It follows that for k<=32 we can store the k-mer in a 64-bit long integer. This will allow us to use less memory for each key in our Map.

In order to do this, refer to the LongKMers.toPackedLong(sequence, offset, k) method which will convert a sequence of bytes into a Long for our purposes.

You will need to balance the workload of this parallel implementation with the specified KMerBalancer.

Make sure to select the most appropriate parallel fork loop.

Attention niels epting.svg Warning:Be sure to utilize the Supplier<ConcurrentMap<Long, Integer>> concurrentMapFactory instance variable to create your ConcurrentMap.


Array Based

AbstractArrayKMerCounter

class: AbstractArrayKMerCounter.java Java.png
methods: totalPossibleKMers
package: kmer.exercise
source folder: student/src/main/java

In the following two array-based implementations, we will leverage the fact that if k<=15 we can use an array instead of a Map to store the k-mer counts.

For each implementation we will encode the k-mer with IntKMers.toPackedInt(byte[] sequence, int offset, int k). This encoded k-mer will be used as the index into the array. The value at each index will be used to hold the number of times that particular k-mer was encountered.

To make space for each possible k-mer we might find, we need to make an array large enough for each potential index. To do that we will need to implement totalPossibleKMers(k).

totalPossibleKMers(k)

method: protected BigInteger totalPossibleKMers(int k) Sequential.svg (sequential implementation only)

Calculate the total possible different k-mers given a particular k.

For example, for the 1-mers there are 4 possibilities:

  • A
  • T
  • C
  • G

For the 2-mers, there are 16 possibilities:

  • AA
  • AT
  • AC
  • AG
  • TA
  • TT
  • TC
  • TG
  • CA
  • CT
  • CC
  • GG
  • GA
  • GT
  • GC
  • GG

How many 3-mers are there? Given a particular k, how many k-mers are there?

IntArrayKMerCounter

class: IntArrayKMerCounter.java Java.png
methods: parse
package: kmer.exercise
source folder: student/src/main/java

method: public KMerCount parse(List<byte[]> sequences, int k) Sequential.svg (sequential implementation only)

This implementation will use an int array. In order to create an array, we will need to know what size it should be. If we need an element for every possible k-mer we might encounter how many would that be? Did you just build a method which calculates this number? Yes, you did.

Note: to convert from a BigInteger to an int, we recommend using intValueExact() as it will throw an ArithmeticException if the BigInteger is too big.

Since this implementation is sequential, there is no need to balance the workload. The continuation task will do all of the work.

AtomicIntegerArrayKMerCounter

class: AtomicIntegerArrayKMerCounter.java Java.png
methods: parse
package: kmer.exercise
source folder: student/src/main/java

method: public KMerCount parse(List<byte[]> sequences, int k) Parallel.svg (parallel implementation required)

This implementation is simply the parallel equivalent of the int array implementation. To do so, you will be making use of Java’s atomic version of an array: an AtomicIntegerArray. Like before, you will be need to complete the parse method. Don't forget to balance the workload in this parallel implementation with the specified KMerBalancer.

Make sure to select the most appropriate parallel fork loop.


Testing Your Solution

Correctness

class: __KMerTestSuite.java Junit.png
package: kmer.exercise
source folder: testing/src/test/java