K Mer Balance Assignment
Contents
Motivation
Balancing the work load is not always straightforward. A sample chromosome from the NIH has enough inconclusive reads that it is split into 50 sub-sequences. The lengths of these sub-sequences vary wildly, from ~1,000 to ~10,000,00.
While creating a task for each of the 50 sub-sequences would be in our target range of 2X to 10X the the number of processors for an 8 processor machine, the workload balance would be terrible. Over 40 of the sequences are tiny, and only 3 are of lengths of any real significance. Thus, most of the processors would be sitting around idle while the last 3 processors finished up the 3 longest sub-sequences... then only 2 would be working, and finally the last one would finish up. Once could, one supposes, use the Ranges exercise from before to slice up each of the sub-sequences, but we only really want to slice up the long sub-sequences and leave the short ones unsliced.
Code To Investigate
interface KMerBalancer provides a default implementation of createReasonableSlices(sequences, k) which you will use in each of your parallel k-mer counters. You will note, that createReasonableSlices invokes two methods you will implement: calculateReasonableThreshold and createSlices.
createReasonableSlices |
---|
default List<ByteArrayRange> createReasonableSlices(List<byte[]> sequences, int k) {
int reasonableThreshold = calculateReasonableThreshold(sequences, k);
return createSlices(sequences, k, (rangeLength) -> {
return rangeLength > reasonableThreshold;
});
}
|
Code To Implement
DefaultByteArrayRange
As in our previous exercise where we implemented the class Range, we will need to keep track of the min and maxExclusive. Since we will also need to track the byte array which the range is associated with, we will need to handle originalCompleteSequence.
class: | DefaultByteArrayRange.java | |
methods: | constructor originalCompleteSequence min maxExclusive |
|
package: | kmer.balance.exercise | |
source folder: | student/src/main/java |
constructor and instance variables
constructor: public DefaultByteArrayRange(byte[] originalCompleteSequence, int min, int maxExclusive)
Store each of the parameters in private final instance variables.
Alert:As the name suggests, we are to hang onto the originalCompleteSequence. Do not copy it or create a sub-array. Just hang onto the reference in an instance variable. |
originalCompleteSequence
method: public byte[] originalCompleteSequence()
(sequential implementation only)
Return the originalCompleteSequence passed to the constructor.
min
method: public int min()
(sequential implementation only)
Return the min passed to the constructor.
maxExclusive
method: int maxExclusive()
(sequential implementation only)
Return the maxExclusive passed to the constructor.
DefaultKMerBalancer
DefaultKMerBalancer exists to balance the workload of a list of dna sub-sequences. We have balanced workload before by breaking up data into as equally sized slices in the Ranges assignment. We used the sliced up Ranges in everything from the Coarsening Nucleobase Count to Matrix MapReduce.
K-mer counting presents an additional challenge of having to deal with sub-sequences of radically different lengths.
class: | DefaultKMerBalancer.java | |
methods: | sliceKernel createSlices calculateReasonableThreshold |
|
package: | kmer.balance.exercise | |
source folder: | student/src/main/java |
Tip: Get DefaultKMerBalancer working and then use it to balance all of your parallel KMerCounter implementations. |
createSlices
method: public static List<Slice<byte[]>> createSlices(List<byte[]> sequences, int k, IntPredicate slicePredicate)
(sequential implementation only)
Warning: be sure to slice the offset (a.k.a. startingIndex) space |
for N=10 and K=3, you should be slicing like this:
Video: ThresholdSlices |
---|
sliceKernel
method: private static void sliceKernel(List<ByteArrayRange> slices, byte[] sequence, int min, int max,
IntPredicate slicePredicate)
(sequential implementation only)
Here we will employ a divide-and-conquer like strategy to fill the slices list with ByteArrayRanges that are all not too long. We leverage the provided slicePredicate to test whether or not a range length from min to max is long enough to warrant dividing further.
threshold calculation
method: public static int calculateReasonableThreshold(List<byte[]> sequences, int k)
(sequential implementation only)
Define the threshold of slice length used to define an appropriate slicePredicate.
Question to ask yourself: given a list of sequences which are to be k-mer counted, how would you calculate a reasonable threshold?
In addition to an estimation of the work required to count the specified sequences, knowing the number of available processors will also be necessary.
availableProcessors |
---|
int processorCount = Runtime.getRuntime().availableProcessors();
|
Testing Your Solution
class: | __BalanceTestSuite.java | |
package: | kmer.balance.exercise | |
source folder: | testing/src/test/java |
Pledge, Acknowledgments, Citations
file: | kmer-balance-pledge-acknowledgments-citations.txt |
More info about the Honor Pledge