K Mer Balance Assignment

From CSE231 Wiki
Revision as of 21:24, 19 April 2023 by Cosgroved (talk | contribs) (→‎DefaultByteArrayRange)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Motivation

Balancing the work load is not always straightforward. A sample chromosome from the NIH has enough inconclusive reads that it is split into 50 sub-sequences. The lengths of these sub-sequences vary wildly, from ~1,000 to ~10,000,00.

Chromosome sub sequence lengths.png

spreadsheet

While creating a task for each of the 50 sub-sequences would be in our target range of 2X to 10X the the number of processors for an 8 processor machine, the workload balance would be terrible. Over 40 of the sequences are tiny, and only 3 are of lengths of any real significance. Thus, most of the processors would be sitting around idle while the last 3 processors finished up the 3 longest sub-sequences... then only 2 would be working, and finally the last one would finish up. Once could, one supposes, use the Ranges exercise from before to slice up each of the sub-sequences, but we only really want to slice up the long sub-sequences and leave the short ones unsliced.

Code To Investigate

interface KMerBalancer provides a default implementation of createReasonableSlices(sequences, k) which you will use in each of your parallel k-mer counters. You will note, that createReasonableSlices invokes two methods you will implement: calculateReasonableThreshold and createSlices.

createReasonableSlices  
default List<ByteArrayRange> createReasonableSlices(List<byte[]> sequences, int k) {
	int reasonableThreshold = calculateReasonableThreshold(sequences, k);
	return createSlices(sequences, k, (rangeLength) -> {
		return rangeLength > reasonableThreshold;
	});
}

Code To Implement

DefaultByteArrayRange

As in our previous exercise where we implemented the class Range, we will need to keep track of the min and maxExclusive. Since we will also need to track the byte array which the range is associated with, we will need to handle originalCompleteSequence.

class: DefaultByteArrayRange.java Java.png
methods: constructor
originalCompleteSequence
min
maxExclusive
package: kmer.balance.exercise
source folder: student/src/main/java

constructor and instance variables

constructor: public DefaultByteArrayRange(byte[] originalCompleteSequence, int min, int maxExclusive)

Store each of the parameters in private final instance variables.

Attention niels epting.svg Alert:As the name suggests, we are to hang onto the originalCompleteSequence. Do not copy it or create a sub-array. Just hang onto the reference in an instance variable.

originalCompleteSequence

method: public byte[] originalCompleteSequence() Sequential.svg (sequential implementation only)

Return the originalCompleteSequence passed to the constructor.

min

method: public int min() Sequential.svg (sequential implementation only)

Return the min passed to the constructor.

maxExclusive

method: int maxExclusive() Sequential.svg (sequential implementation only)

Return the maxExclusive passed to the constructor.

DefaultKMerBalancer

DefaultKMerBalancer exists to balance the workload of a list of dna sub-sequences. We have balanced workload before by breaking up data into as equally sized slices in the Ranges assignment. We used the sliced up Ranges in everything from the Coarsening Nucleobase Count to Matrix MapReduce.

K-mer counting presents an additional challenge of having to deal with sub-sequences of radically different lengths.

class: DefaultKMerBalancer.java Java.png
methods: sliceKernel
createSlices
calculateReasonableThreshold
package: kmer.balance.exercise
source folder: student/src/main/java
Circle-information.svg Tip: Get DefaultKMerBalancer working and then use it to balance all of your parallel KMerCounter implementations.

createSlices

method: public static List<Slice<byte[]>> createSlices(List<byte[]> sequences, int k, IntPredicate slicePredicate) Sequential.svg (sequential implementation only)

Attention niels epting.svg Warning: be sure to slice the offset (a.k.a. startingIndex) space

for N=10 and K=3, you should be slicing like this:

ThresholdSlices.svg

Video: ThresholdSlices  

sliceKernel

method: private static void sliceKernel(List<ByteArrayRange> slices, byte[] sequence, int min, int max, IntPredicate slicePredicate) Sequential.svg (sequential implementation only)

Here we will employ a divide-and-conquer like strategy to fill the slices list with ByteArrayRanges that are all not too long. We leverage the provided slicePredicate to test whether or not a range length from min to max is long enough to warrant dividing further.

threshold calculation

method: public static int calculateReasonableThreshold(List<byte[]> sequences, int k) Sequential.svg (sequential implementation only)

Define the threshold of slice length used to define an appropriate slicePredicate.

Question to ask yourself: given a list of sequences which are to be k-mer counted, how would you calculate a reasonable threshold?

In addition to an estimation of the work required to count the specified sequences, knowing the number of available processors will also be necessary.

availableProcessors  
int processorCount = Runtime.getRuntime().availableProcessors();

Testing Your Solution

class: __BalanceTestSuite.java Junit.png
package: kmer.balance.exercise
source folder: testing/src/test/java

Pledge, Acknowledgments, Citations

file: kmer-balance-pledge-acknowledgments-citations.txt

More info about the Honor Pledge