Difference between revisions of "Nucleobase Counting"

From CSE231 Wiki
Jump to navigation Jump to search
 
(86 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
=Motivation=
 
=Motivation=
* a chance to get our feet wet with parallel programming
+
* get our feet wet with parallel programming
 
* gain experience with the async and finish constructs
 
* gain experience with the async and finish constructs
 +
* take different approaches to splitting up the work
  
 
We will solve the problem sequentially, then split the work up into 2 tasks, then coarsen the work n-ways, and finally split up the work in a divide-and-conquer recursive style.
 
We will solve the problem sequentially, then split the work up into 2 tasks, then coarsen the work n-ways, and finally split up the work in a divide-and-conquer recursive style.
Line 7: Line 8:
 
=Background=
 
=Background=
  
 +
==Bioinformatics==
 
For this assignment, you will be writing sequential and parallel code to count nucleobases in a human X chromosome.
 
For this assignment, you will be writing sequential and parallel code to count nucleobases in a human X chromosome.
 
   
 
   
DNA is made up of four nucleobases: cytosine, guanine, adenine, and thymine. A strand of DNA can thus be represented as a string of letters representing these nucleobases, for example: “ACCGCATAAAGTCC.” However, DNA sequencing is typically not 100% accurate, so some of the nucleobases are not read with high certainty. These bases can be represented as an “N.” A sequence then might look something like “NCCGCATNAAGTCC.” Your goal is to write code that counts the number of unknown nucleobases.
+
DNA is made up of four nucleobases: cytosine, guanine, adenine, and thymine. A strand of DNA can thus be represented as a string of letters representing these nucleobases, for example: “ACCGCATAAAGTCC.” However, DNA sequencing is typically not 100% accurate, so some of the nucleobases are not read with high certainty. These bases can be represented as an “N.” A sequence then might look something like “NCCGCATNAAGTCC.” Your goal is to write code that counts the number of occurrences a particular nucleobase or uncertain reads.
+
 
 
We will be using actual data pulled from the US National Library of Medicine, a database maintained by the National Institute of Health. We have already provided you the code that you need to access the chromosome from the database and check your work. You must implement a sequential solution and three parallel solutions to count the given bases in these sequences.
 
We will be using actual data pulled from the US National Library of Medicine, a database maintained by the National Institute of Health. We have already provided you the code that you need to access the chromosome from the database and check your work. You must implement a sequential solution and three parallel solutions to count the given bases in these sequences.
  
For some more optional background on DNA and nucleotide bases, please refer to the links under [[#Optional Reading]].
+
For some more optional background on DNA and nucleotide bases, please check out
 +
*https://en.wikipedia.org/wiki/DNA
 +
*https://en.wikipedia.org/wiki/FASTA_format
 +
 
 +
==Parallel Programming==
 +
===Parallel Constructs===
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/v5/V5.html#finish(edu.wustl.cse231s.v5.api.CheckedRunnable) finish]
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/v5/V5.html#async(edu.wustl.cse231s.v5.api.CheckedRunnable) async]
 +
* Feel free to use the [[Reference_Page#Async|course reference page]] if you need a reminder on how to use asyncs and finishes
  
=Mistakes to avoid=
+
===Related Videos===
{{Warning | Do NOT copy the data.  We are simply reading the chromosome data, so there is no reason to copy it.}}
+
<youtube>ShcnjoO89ZU</youtube>
  
=Where to Start=
+
<youtube>DjVep9485l4</youtube>
{{CodeToImplement|NucleobaseCounting|countRange<br>countSequential<br>caclulateMidpoint<br>countParallelUpperLowerSplit<br>countParallelNWaySplit|count.assignment}}
 
  
 +
<youtube>1x-jGwcW4WU</youtube>
  
You will find the starting point for this assignment is in the count folder.  In the count.assignment package you will find [[http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/count/assignment/NucleobaseCounting.html NucleobaseCounting.java]].
+
<youtube>0_JrsJrqgDg</youtube>
  
 +
<youtube>8oq9CKgIOiE</youtube>
  
==Sequential Solution==
+
<youtube>--GfpYtL0lI</youtube>
  
===countRangeSequential===
+
<!--
 +
* [https://wustl.app.box.com/s/2w4c9xn5du3sqwmmji7v8w35ntpq1j7z Finish & Async: Lower Upper Split]
 +
* [https://wustl.app.box.com/s/dx0f8ycz82e3r894rnpk0cjw8tj8jqzu Dealing With Final: Using Array Slots]
 +
* [https://wustl.app.box.com/s/wix18u57v7x1rdx1d1dgfedf6ibr0n92 Dealing With Final: IntegerRange]
 +
* [https://wustl.app.box.com/s/p3ooceekguiaygg4hggw76rbpt28kf6o CSE]
 +
* [https://wustl.app.box.com/s/smckzu78cvg8seocgluuo1afydgd0gmt Finish & Async Coarsening: N-Way Split]
 +
* [https://wustl.app.box.com/s/wqrauxpsz8xskt897qyukm6rkx9u4zmn Divide and Conquer: Array Sum Example]
 +
-->
  
===countSequential===
+
=The Core Questions=
 +
*What are the tasks?
 +
*What is the data?
 +
*Is the data mutable?
 +
*If so, how is it shared?
  
We recommend implementing a sequential solution before moving on to parallel solutions. In order to do this, please modify the <code>countSequential</code> method. When you’re ready to begin, delete the return statement and begin implementing your solution! Please refer to your notes from lecture for help, as we tackled a very similar problem in class.
+
=Mistakes to Avoid=
 +
{{Warning | Be sure to remove each NotYetImplementedException as you implement your solutions.}}
 +
{{Warning | Do NOT copy the data.  We are simply reading the chromosome data, so there is no reason to copy it.}}
 +
{{Warning | Do NOT share data beyond what is necessary.  Do NOT use static class fields when you could use local variables instead.}}
 +
{{Warning | Do NOT have the same for loop code duplicated throughout your solution.  Invoke countRangeSequential where appropriate.}}
 +
{{Warning | Do NOT not invoke countSequential to count the entire chromosome by mistake when you want to invoke countRangeSequential to count the range from [min,max).}}
  
Hint: maybe some loops might be best for counting things?
+
=Getting Started=
 +
*[[Initial_Setup]]
  
==Parallel Solutions==
+
=Code to Implement=
 +
==Midpoint==
 +
{{CodeToImplement|MidpointUtils|caclulateMidpoint|midpoint.lab}}
  
You need to implement two different parallel solutions to this problem. The first will involve splitting the array into two equal halves, then going through each half of the array in parallel. The second will involve splitting the array into n different pieces, then going through each of those pieces of the array in parallel.
+
{{Sequential|public static int calculateMidpoint(int a, int b)}}
 
===calculateMidpoint===
 
  
 
In this method, you will need to calculate the midpoint between two numbers. This is as simple as finding the average between the two numbers. There is no need to worry about rounding correctly, just drop everything after the decimal point if the midpoint is not automatically an int.
 
In this method, you will need to calculate the midpoint between two numbers. This is as simple as finding the average between the two numbers. There is no need to worry about rounding correctly, just drop everything after the decimal point if the midpoint is not automatically an int.
===Upper Lower Split===
 
In order to start with then two equal halves solution, please modify the <code>countParallelUpperLowerSplit</code> method. When you’re ready to begin, delete the return statement and begin implementing your solution! Again, please refer to your notes from lecture for help, as we tackled a very similar problem in class.
 
 
Hint: don’t forget the finish block! The format will be very similar to the examples.
 
  
===Coarsening N-Way Split===
+
There are at least two correct ways to implement the midpoint: 
After you’re finished with that, please modify the <code>countParallelNWaySplit</code> method for the n different pieces solution. As before, when you’re ready to begin, delete the return statement and begin implementing your solution! Again, please refer to your notes from lecture for help, as we tackled a very similar problem in class.
+
 
   
+
: Option A: <math>\frac{min+max}{2}</math>
Hint: make an array to store the results of each task. Split the array into n different chunks, each of which contains 1/n elements. Once each task has summed up the results from its chunk, add the results from each chunk to get your final answer.
+
 
 +
: Option B: <math>min+\frac{max-min}{2}</math>
 +
 
 +
It is hard to find 1D references for midpoint[http://mathworld.wolfram.com/ Wolfram MathWorld] has a breakdown of [http://mathworld.wolfram.com/Midpoint.html Midpoint in 2D and 3D].  Pick any dimension you like.
 +
 
 +
==Threshold==
 +
{{CodeToImplement|GreaterThanThresholdPredicate|constructor<br>test|threshold.lab}}
  
===Divide and Conquer===
+
{{Sequential|public GreaterThanThresholdPredicate(int threshold)}}
  
=Testing Your Solution=
+
{{Sequential|public boolean test(int value)}}
=== CountTestSuite ===
 
Launch CountTestSuite.java as a JUnit Test to run all of the tests.  CountTestSuite.java is located in src/test/java count.assignment package.  You can initiate this via right clicking on CountTestSuite.java and selecting "Run As..." -> "JUnit Test".
 
  
{{TestSuite|CountTestSuite|count.assignment}}
+
Class GreaterThanThresholdPredicate implements [https://docs.oracle.com/javase/8/docs/api/java/util/function/IntPredicate.html Java IntPredicate interface]. The test method should return a boolean value that indicates whether the input argument is '''greater than''' the threshold provided in the class constructor.
  
{{Performance|NucleobaseCountTiming|count.assignment}}
+
==Count Range==
 +
{{CodeToImplement|NucleobaseCounting|countRange<br>countSequential<br>countParallelLowerUpperSplit<br>countParallelNWaySplit<br>countParallelDivideAndConquerKernel|count.lab}}
  
=== VM Argument ===
+
{{Sequential|public static int countRangeSequential(byte[] chromosome, Nucleobase targetNucleobase, int min, int maxExclusive)}}
[[http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/hw1/run-junit-test-suite.png screenshot]]
 
  
Habanero requires a special VM argument.  Running your solution the first time will result in the text below being printed to the Console:
+
This utility method should count the number of times a particular nucleobase occurs in the array between [min, maxExclusive).
 +
<spoiler show="spoiler" hide="spoiler">Use a for loop to iterate from min to maxExclusive.</spoiler>
  
<span style="font-family:Courier;color:red">'''You need to add Habanero-Java to your VM arguments.  Exiting.<br>The text below has been put in your copy buffer.  Paste it into VM Arguments.<br>-javaagent:"~/.m2/repository/edu/rice/hjlib-cooperative/0.1.14-SNAPSHOT/hjlib-cooperative-0.1.14-SNAPSHOT.jar"'''</span>
+
==Sequential Solution==
  
NOTE: do NOT copy and paste the line above.  The "~" will be your home directory when you run it on your machine.
+
{{Sequential|public static int countSequential(byte[] chromosome, Nucleobase targetNucleobase)}}
  
The text you need paste into the VM Arguments for the run configuration will automatically be placed into the copy buffer already. Follow [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/javaagent/index.html these steps] to give Habanero what it wants.
+
This solution should be achieved with a simple call to one of the utility methods you wrote.
 +
<spoiler show="spoiler" hide="spoiler">Invoke countRangeSequential from 0 to the chromosome array length.</spoiler>
  
Be sure to preserve the -ea and add a space after it when pasting the VM arg.
+
==Parallel Solutions==
 +
 
 +
You need to implement three different parallel solutions to this problem. The first will involve splitting the array into two equal halves, then going through each half of the array in parallel. The second will involve splitting the array into n different pieces, then going through each of those pieces of the array in parallel.  The third and final implementation will recursively divide the array until a threshold is reached.
 
   
 
   
If you need any further help or clarification, please don’t hesitate to post to piazza and/or reach out to one of the instructors.
+
===Lower/Upper Split===
 +
{{Parallel|public static int countParallelLowerUpperSplit(byte[] chromosome, Nucleobase targetNucleobase)}}
 +
 
 +
In order to start with then two equal halves solution, please modify the <code>countParallelUpperLowerSplit</code> method.
 +
 
 +
===Coarsening N-Way Split===
 +
{{Parallel|public static int countParallelNWaySplit(byte[] chromosome, Nucleobase targetNucleobase, int numTasks)}}
 +
 
 +
After you’re finished with that, please modify the <code>countParallelNWaySplit</code> method for splitting the work up into n tasks solution.
 +
 
 +
If the chromosome was 1,000,000 long, and you were asked to split it up in 10 tasks, each task should obviously process 100,000.  Things get slightly tricky if the chromosome were 1,000,005 long.
 +
 
 +
<youtube>hghmrLekmRs</youtube>
 +
 
 +
===Divide and Conquer===
 +
In a pattern that we will use repeatedly in the course, we have a public method that is asked to count an entire chromosome.  This method just calls the kernel method to do all of the work:
 +
 
 +
<nowiki> public static int countParallelDivideAndConquer(byte[] chromosome, Nucleobase targetNucleobase, IntPredicate isParallelRangeLengthPredicate)
 +
throws InterruptedException, ExecutionException {
 +
return countParallelDivideAndConquerKernel(chromosome, targetNucleobase, isParallelRangeLengthPredicate, 0, chromosome.length);
 +
}</nowiki>
  
=== Line Of Red Text On The Mac ===
+
{{Parallel|static int countParallelDivideAndConquerKernel(byte[] chromosome, Nucleobase targetNucleobase, IntPredicate isParallelRangeLengthPredicate,
In computer science (and life in general) you should not make a habit of ignoring red text.
+
int min, int maxExclusive)}}
  
However, there is a benign bug on the Mac which reports:
+
The kernel method will recursively divide the work up in parallel until the range drops below the threshold specified by isParallelRangeLengthPredicate.  At that point it will simply use the sequential count range utility method.
  
<span style="font-family:Courier;color:red">'''"objc[6477]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined." '''</span>
+
=Testing Your Solution=
 +
How to effectively use JUnit:
  
Reportedly, this has finally been fixed and will be released soon:
+
<youtube>2pmgpWicfA0</youtube>
 +
 
 +
==Correctness==
 +
{{TestSuite|CountTestSuite|count.lab}}
 +
 
 +
Launch CountTestSuite.java as a JUnit Test to run all of the tests.  CountTestSuite.java is located in src/test/java count.lab package.  You can initiate this via right clicking on CountTestSuite.java and selecting "Run As..." -> "JUnit Test".
 +
[[http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/hw1/run-junit-test-suite.png screenshot]]
  
https://stackoverflow.com/questions/43003012/class-javalaunchhelper-is-implemented-in-two-places
+
==Performance==
 +
{{Performance|NucleobaseCountTiming|count.lab}}
 +
After your JUnit tests are passing and you are confident that your solutions are correct, move on to checking out the timing performance of your different solutions with different task counts.
  
For now, you should ignore this.
+
=Pledge, Acknowledgments, Citations=
 +
{{Pledge|lab-count}}
  
 
=Rubric=
 
=Rubric=
Line 96: Line 158:
 
Total points: 100
 
Total points: 100
 
* Correct countRangeSequential (10)
 
* Correct countRangeSequential (10)
* Correct countSequential (5)
+
* Correct countSequential (10)
 
* Correct calculateMidpoint (5)
 
* Correct calculateMidpoint (5)
* Correct countParallelUpperLowerSplit (20)
+
* Correct countParallelLowerUpperSplit (10)
* Correct countParallelNWaySplit (25)
+
* Parallel countParallelLowerUpperSplit (10)
* Correct countParallelDivideAndConquerKernel (25)
+
* Correct countParallelNWaySplit (15)
* Clarity and efficiency (10)
+
* Parallel countParallelNWaySplit (10)
 
+
* Correct GreaterThanThresholdPredicate (5)
=Optional Reading=
+
* Correct countParallelDivideAndConquerKernel (15)
*https://en.wikipedia.org/wiki/DNA
+
* Parallel countParallelDivideAndConquerKernel (10)
*https://en.wikipedia.org/wiki/FASTA_format
 
 
 
=Related Videos=
 
LoopingAndFinal: https://wustl.box.com/s/wix18u57v7x1rdx1d1dgfedf6ibr0n92
 
 
 
SliceArrayDemo: https://wustl.box.com/s/p3ooceekguiaygg4hggw76rbpt28kf6o
 
  
AddingParallelism: https://wustl.box.com/s/smckzu78cvg8seocgluuo1afydgd0gmt
+
Penalties may be assessed for clarity and efficiency.

Latest revision as of 04:36, 22 January 2022

Motivation

  • get our feet wet with parallel programming
  • gain experience with the async and finish constructs
  • take different approaches to splitting up the work

We will solve the problem sequentially, then split the work up into 2 tasks, then coarsen the work n-ways, and finally split up the work in a divide-and-conquer recursive style.

Background

Bioinformatics

For this assignment, you will be writing sequential and parallel code to count nucleobases in a human X chromosome.

DNA is made up of four nucleobases: cytosine, guanine, adenine, and thymine. A strand of DNA can thus be represented as a string of letters representing these nucleobases, for example: “ACCGCATAAAGTCC.” However, DNA sequencing is typically not 100% accurate, so some of the nucleobases are not read with high certainty. These bases can be represented as an “N.” A sequence then might look something like “NCCGCATNAAGTCC.” Your goal is to write code that counts the number of occurrences a particular nucleobase or uncertain reads.

We will be using actual data pulled from the US National Library of Medicine, a database maintained by the National Institute of Health. We have already provided you the code that you need to access the chromosome from the database and check your work. You must implement a sequential solution and three parallel solutions to count the given bases in these sequences.

For some more optional background on DNA and nucleotide bases, please check out

Parallel Programming

Parallel Constructs

Related Videos


The Core Questions

  • What are the tasks?
  • What is the data?
  • Is the data mutable?
  • If so, how is it shared?

Mistakes to Avoid

Attention niels epting.svg Warning: Be sure to remove each NotYetImplementedException as you implement your solutions.
Attention niels epting.svg Warning: Do NOT copy the data. We are simply reading the chromosome data, so there is no reason to copy it.
Attention niels epting.svg Warning: Do NOT share data beyond what is necessary. Do NOT use static class fields when you could use local variables instead.
Attention niels epting.svg Warning: Do NOT have the same for loop code duplicated throughout your solution. Invoke countRangeSequential where appropriate.
Attention niels epting.svg Warning: Do NOT not invoke countSequential to count the entire chromosome by mistake when you want to invoke countRangeSequential to count the range from [min,max).

Getting Started

Code to Implement

Midpoint

class: MidpointUtils.java Java.png
methods: caclulateMidpoint
package: midpoint.lab
source folder: student/src/main/java

method: public static int calculateMidpoint(int a, int b) Sequential.svg (sequential implementation only)

In this method, you will need to calculate the midpoint between two numbers. This is as simple as finding the average between the two numbers. There is no need to worry about rounding correctly, just drop everything after the decimal point if the midpoint is not automatically an int.

There are at least two correct ways to implement the midpoint:

Option A:
Option B:

It is hard to find 1D references for midpoint. Wolfram MathWorld has a breakdown of Midpoint in 2D and 3D. Pick any dimension you like.

Threshold

class: GreaterThanThresholdPredicate.java Java.png
methods: constructor
test
package: threshold.lab
source folder: student/src/main/java

method: public GreaterThanThresholdPredicate(int threshold) Sequential.svg (sequential implementation only)

method: public boolean test(int value) Sequential.svg (sequential implementation only)

Class GreaterThanThresholdPredicate implements Java IntPredicate interface. The test method should return a boolean value that indicates whether the input argument is greater than the threshold provided in the class constructor.

Count Range

class: NucleobaseCounting.java Java.png
methods: countRange
countSequential
countParallelLowerUpperSplit
countParallelNWaySplit
countParallelDivideAndConquerKernel
package: count.lab
source folder: student/src/main/java

method: public static int countRangeSequential(byte[] chromosome, Nucleobase targetNucleobase, int min, int maxExclusive) Sequential.svg (sequential implementation only)

This utility method should count the number of times a particular nucleobase occurs in the array between [min, maxExclusive).

Use a for loop to iterate from min to maxExclusive.

Sequential Solution

method: public static int countSequential(byte[] chromosome, Nucleobase targetNucleobase) Sequential.svg (sequential implementation only)

This solution should be achieved with a simple call to one of the utility methods you wrote.

Invoke countRangeSequential from 0 to the chromosome array length.

Parallel Solutions

You need to implement three different parallel solutions to this problem. The first will involve splitting the array into two equal halves, then going through each half of the array in parallel. The second will involve splitting the array into n different pieces, then going through each of those pieces of the array in parallel. The third and final implementation will recursively divide the array until a threshold is reached.

Lower/Upper Split

method: public static int countParallelLowerUpperSplit(byte[] chromosome, Nucleobase targetNucleobase) Parallel.svg (parallel implementation required)

In order to start with then two equal halves solution, please modify the countParallelUpperLowerSplit method.

Coarsening N-Way Split

method: public static int countParallelNWaySplit(byte[] chromosome, Nucleobase targetNucleobase, int numTasks) Parallel.svg (parallel implementation required)

After you’re finished with that, please modify the countParallelNWaySplit method for splitting the work up into n tasks solution.

If the chromosome was 1,000,000 long, and you were asked to split it up in 10 tasks, each task should obviously process 100,000. Things get slightly tricky if the chromosome were 1,000,005 long.

Divide and Conquer

In a pattern that we will use repeatedly in the course, we have a public method that is asked to count an entire chromosome. This method just calls the kernel method to do all of the work:

	public static int countParallelDivideAndConquer(byte[] chromosome, Nucleobase targetNucleobase, IntPredicate isParallelRangeLengthPredicate)
			throws InterruptedException, ExecutionException {
		return countParallelDivideAndConquerKernel(chromosome, targetNucleobase, isParallelRangeLengthPredicate, 0, chromosome.length);
	}

method: static int countParallelDivideAndConquerKernel(byte[] chromosome, Nucleobase targetNucleobase, IntPredicate isParallelRangeLengthPredicate, int min, int maxExclusive) Parallel.svg (parallel implementation required)

The kernel method will recursively divide the work up in parallel until the range drops below the threshold specified by isParallelRangeLengthPredicate. At that point it will simply use the sequential count range utility method.

Testing Your Solution

How to effectively use JUnit:

Correctness

class: CountTestSuite.java Junit.png
package: count.lab
source folder: testing/src/test/java

Launch CountTestSuite.java as a JUnit Test to run all of the tests. CountTestSuite.java is located in src/test/java count.lab package. You can initiate this via right clicking on CountTestSuite.java and selecting "Run As..." -> "JUnit Test". [screenshot]

Performance

class: NucleobaseCountTiming.java Noun Project stopwatch icon 386232 cc.svg
package: count.lab
source folder: src/main/java

After your JUnit tests are passing and you are confident that your solutions are correct, move on to checking out the timing performance of your different solutions with different task counts.

Pledge, Acknowledgments, Citations

file: lab-count-pledge-acknowledgments-citations.txt

More info about the Honor Pledge

Rubric

As always, please make sure to cite your work appropriately.

Total points: 100

  • Correct countRangeSequential (10)
  • Correct countSequential (10)
  • Correct calculateMidpoint (5)
  • Correct countParallelLowerUpperSplit (10)
  • Parallel countParallelLowerUpperSplit (10)
  • Correct countParallelNWaySplit (15)
  • Parallel countParallelNWaySplit (10)
  • Correct GreaterThanThresholdPredicate (5)
  • Correct countParallelDivideAndConquerKernel (15)
  • Parallel countParallelDivideAndConquerKernel (10)

Penalties may be assessed for clarity and efficiency.