Difference between revisions of "Nucleobase Counting"

From CSE231 Wiki
Jump to navigation Jump to search
Line 64: Line 64:
 
It is hard to find 1D references for midpoint.  [http://mathworld.wolfram.com/ Wolfram MathWorld] has a breakdown of [http://mathworld.wolfram.com/Midpoint.html Midpoint in 2D and 3D].  Pick any dimension you like.
 
It is hard to find 1D references for midpoint.  [http://mathworld.wolfram.com/ Wolfram MathWorld] has a breakdown of [http://mathworld.wolfram.com/Midpoint.html Midpoint in 2D and 3D].  Pick any dimension you like.
 
===Count Range===
 
===Count Range===
{{Sequential|countRangeSequential}}
+
{{Sequential|public static int countRangeSequential(byte[] chromosome, Nucleobase targetNucleobase, int min, int maxExclusive)}}
  
 
This utility method should count the number of times a particular nucleobase occurs in the array between [min, maxExclusive).
 
This utility method should count the number of times a particular nucleobase occurs in the array between [min, maxExclusive).

Revision as of 16:32, 9 February 2018

Motivation

  • get our feet wet with parallel programming
  • gain experience with the async and finish constructs
  • take different approaches to splitting up the work

We will solve the problem sequentially, then split the work up into 2 tasks, then coarsen the work n-ways, and finally split up the work in a divide-and-conquer recursive style.

Background

Bioinformatics

For this assignment, you will be writing sequential and parallel code to count nucleobases in a human X chromosome.

DNA is made up of four nucleobases: cytosine, guanine, adenine, and thymine. A strand of DNA can thus be represented as a string of letters representing these nucleobases, for example: “ACCGCATAAAGTCC.” However, DNA sequencing is typically not 100% accurate, so some of the nucleobases are not read with high certainty. These bases can be represented as an “N.” A sequence then might look something like “NCCGCATNAAGTCC.” Your goal is to write code that counts the number of occurrences a particular nucleobase or uncertain reads.

We will be using actual data pulled from the US National Library of Medicine, a database maintained by the National Institute of Health. We have already provided you the code that you need to access the chromosome from the database and check your work. You must implement a sequential solution and three parallel solutions to count the given bases in these sequences.

For some more optional background on DNA and nucleotide bases, please check out

Parallel Programming

Parallel Constructs

Related Videos

Mistakes to Avoid

Attention niels epting.svg Warning: Be sure to remove each NotYetImplementedException as you implement your solutions.
Attention niels epting.svg Warning: Do NOT copy the data. We are simply reading the chromosome data, so there is no reason to copy it.
Attention niels epting.svg Warning: Do NOT share data beyond what is necessary. Do NOT use static class fields when you could use local variables instead.

Getting Started

Code to Implement

class: MidpointUtils.java Java.png
methods: caclulateMidpoint
package: midpoint.assignment
source folder: student/src/main/java
class: NucleobaseCounting.java Java.png
methods: countRange
countSequential
countParallelLowerUpperSplit
countParallelNWaySplit
countParallelDivideAndConquerKernel
package: count.assignment
source folder: student/src/main/java


You will find the starting point for this assignment is in the count folder. In the count.assignment package you will find NucleobaseCounting.java.

Utility Methods

Midpoint

method: public static int calculateMidpoint(int a, int b) Sequential.svg (sequential implementation only)

In this method, you will need to calculate the midpoint between two numbers. This is as simple as finding the average between the two numbers. There is no need to worry about rounding correctly, just drop everything after the decimal point if the midpoint is not automatically an int.

There are at least two correct ways to implement the midpoint:

Option A:
Option B:

It is hard to find 1D references for midpoint. Wolfram MathWorld has a breakdown of Midpoint in 2D and 3D. Pick any dimension you like.

Count Range

method: public static int countRangeSequential(byte[] chromosome, Nucleobase targetNucleobase, int min, int maxExclusive) Sequential.svg (sequential implementation only)

This utility method should count the number of times a particular nucleobase occurs in the array between [min, maxExclusive).


Use a for loop to iterate from min to maxExclusive.

Sequential Solution

method: countSequential Sequential.svg (sequential implementation only)

This solution should be achieved with a simple call to one of the utility methods you wrote.

Invoke countRangeSequential from 0 to the chromosome array length.

Parallel Solutions

You need to implement three different parallel solutions to this problem. The first will involve splitting the array into two equal halves, then going through each half of the array in parallel. The second will involve splitting the array into n different pieces, then going through each of those pieces of the array in parallel. The third and final implementation will recursively divide the array until a threshold is reached.

Lower/Upper Split

method: countParallelUpperLowerSplit Parallel.svg (parallel implementation required)

In order to start with then two equal halves solution, please modify the countParallelUpperLowerSplit method.

Coarsening N-Way Split

method: countParallelNWaySplit Parallel.svg (parallel implementation required)

After you’re finished with that, please modify the countParallelNWaySplit method for splitting the work up into n tasks solution.

If the chromosome was 1,000,000 long, and you were asked to split it up in 10 tasks, each task should obviously process 100,000. Things get slightly tricky if the chromosome were 1,000,005 long.

Divide and Conquer

In a pattern that we will use repeatedly in the course, we have a public method that is asked to count an entire chromosome. This method just calls the kernel method to do all of the work:

	public static int countParallelDivideAndConquer(byte[] chromosome, Nucleobase targetNucleobase, int threshold)
			throws InterruptedException, ExecutionException {
		return countParallelDivideAndConquerKernel(chromosome, targetNucleobase, threshold, 0, chromosome.length);
	}

method: countParallelDivideAndConquerKernel Parallel.svg (parallel implementation required)

The kernel method will recursively divide the work up in parallel until the range drops below the specified threshold. At that point it will simply use the sequential count range utility method.

Attention niels epting.svg Warning: threshold is INCORRECTLY documented as "The number of tasks to create."

Testing Your Solution

Correctness

class: CountTestSuite.java Junit.png
package: count.assignment
source folder: testing/src/test/java

Launch CountTestSuite.java as a JUnit Test to run all of the tests. CountTestSuite.java is located in src/test/java count.assignment package. You can initiate this via right clicking on CountTestSuite.java and selecting "Run As..." -> "JUnit Test". [screenshot]

Performance

class: NucleobaseCountTiming.java Noun Project stopwatch icon 386232 cc.svg
package: count.assignment
source folder: src/main/java

After your JUnit tests are passing and you are confident that your solutions are correct, move on to checking out the timing performance of your different solutions with different task counts.

Rubric

As always, please make sure to cite your work appropriately.

Total points: 100

  • Correct countRangeSequential (10)
  • Correct countSequential (5)
  • Correct calculateMidpoint (5)
  • Correct and Parallel countParallelLowerUpperSplit (20)
  • Correct and Parallel countParallelNWaySplit (25)
  • Correct and Parallel countParallelDivideAndConquerKernel (25)
  • Clarity and efficiency (10)