Nucleobase Counting

From CSE231 Wiki
Jump to navigation Jump to search

Motivation

  • get our feet wet with parallel programming
  • gain experience with the async and finish constructs
  • take different approaches to splitting up the work

We will solve the problem sequentially, then split the work up into 2 tasks, then coarsen the work n-ways, and finally split up the work in a divide-and-conquer recursive style.

Background

For this assignment, you will be writing sequential and parallel code to count nucleobases in a human X chromosome.

DNA is made up of four nucleobases: cytosine, guanine, adenine, and thymine. A strand of DNA can thus be represented as a string of letters representing these nucleobases, for example: “ACCGCATAAAGTCC.” However, DNA sequencing is typically not 100% accurate, so some of the nucleobases are not read with high certainty. These bases can be represented as an “N.” A sequence then might look something like “NCCGCATNAAGTCC.” Your goal is to write code that counts the number of a particular nucleobase.

We will be using actual data pulled from the US National Library of Medicine, a database maintained by the National Institute of Health. We have already provided you the code that you need to access the chromosome from the database and check your work. You must implement a sequential solution and three parallel solutions to count the given bases in these sequences.

For some more optional background on DNA and nucleotide bases, please refer to the links under #Optional Reading.

Mistakes to Avoid

Attention niels epting.svg Warning: Do NOT copy the data. We are simply reading the chromosome data, so there is no reason to copy it.

Code to Implement

class: NucleobaseCounting.java Java.png
methods: countRange
countSequential
caclulateMidpoint
countParallelLowerUpperSplit
countParallelNWaySplit
countParallelDivideAndConquerKernel
package: count.assignment
source folder: student/src/main/java


You will find the starting point for this assignment is in the count folder. In the count.assignment package you will find [NucleobaseCounting.java].

Utility Methods

method: countRangeSequential Sequential.svg (sequential implementation only)

Use a for loop to iterate from min to maxExclusive.

method: calculateMidpoint Sequential.svg (sequential implementation only)

In this method, you will need to calculate the midpoint between two numbers. This is as simple as finding the average between the two numbers. There is no need to worry about rounding correctly, just drop everything after the decimal point if the midpoint is not automatically an int.

There are at least two correct ways to implement the midpoint:

Option A:
Option B:

It is hard to find 1D references for midpoint. Wolfram MathWorld has a breakdown of Midpoint in 2D and 3D. Pick any dimension you like.


Sequential Solution

method: countSequential Sequential.svg (sequential implementation only)


Invoke countRangeSequential from 0 to the chromosome array length.

We recommend implementing a sequential solution before moving on to parallel solutions. In order to do this, please modify the countSequential method. When you’re ready to begin, delete the return statement and begin implementing your solution! Please refer to your notes from lecture for help, as we tackled a very similar problem in class.

Hint: maybe some loops might be best for counting things?

Parallel Solutions

You need to implement three different parallel solutions to this problem. The first will involve splitting the array into two equal halves, then going through each half of the array in parallel. The second will involve splitting the array into n different pieces, then going through each of those pieces of the array in parallel. The third and final implementation will recursively divide the array until a threshold is reached.

Lower/Upper Split

method: countParallelUpperLowerSplit Parallel.svg (parallel implementation required)

In order to start with then two equal halves solution, please modify the countParallelUpperLowerSplit method. When you’re ready to begin, delete the return statement and begin implementing your solution! Again, please refer to your notes from lecture for help, as we tackled a very similar problem in class.

Hint: don’t forget the finish block! The format will be very similar to the examples.

Coarsening N-Way Split

method: countParallelNWaySplit Parallel.svg (parallel implementation required)

After you’re finished with that, please modify the countParallelNWaySplit method for the n different pieces solution. As before, when you’re ready to begin, delete the return statement and begin implementing your solution! Again, please refer to your notes from lecture for help, as we tackled a very similar problem in class.

Hint: make an array to store the results of each task. Split the array into n different chunks, each of which contains 1/n elements. Once each task has summed up the results from its chunk, add the results from each chunk to get your final answer.

Divide and Conquer

method: countParallelDivideAndConquerKernel Parallel.svg (parallel implementation required)

Testing Your Solution

CountTestSuite

Launch CountTestSuite.java as a JUnit Test to run all of the tests. CountTestSuite.java is located in src/test/java count.assignment package. You can initiate this via right clicking on CountTestSuite.java and selecting "Run As..." -> "JUnit Test".

class: CountTestSuite.java Junit.png
package: count.assignment
source folder: testing/src/test/java
class: NucleobaseCountTiming.java Noun Project stopwatch icon 386232 cc.svg
package: count.assignment
source folder: src/main/java

VM Argument

[screenshot]

Habanero requires a special VM argument. Running your solution the first time will result in the text below being printed to the Console:

You need to add Habanero-Java to your VM arguments. Exiting.
The text below has been put in your copy buffer. Paste it into VM Arguments.
-javaagent:"~/.m2/repository/edu/rice/hjlib-cooperative/0.1.14-SNAPSHOT/hjlib-cooperative-0.1.14-SNAPSHOT.jar"

NOTE: do NOT copy and paste the line above. The "~" will be your home directory when you run it on your machine.

The text you need paste into the VM Arguments for the run configuration will automatically be placed into the copy buffer already. Follow these steps to give Habanero what it wants.

Be sure to preserve the -ea and add a space after it when pasting the VM arg.

If you need any further help or clarification, please don’t hesitate to post to piazza and/or reach out to one of the instructors.

Line Of Red Text On The Mac

In computer science (and life in general) you should not make a habit of ignoring red text.

However, there is a benign bug on the Mac which reports:

"objc[6477]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined."

Reportedly, this has finally been fixed and will be released soon:

https://stackoverflow.com/questions/43003012/class-javalaunchhelper-is-implemented-in-two-places

For now, you should ignore this.

Rubric

As always, please make sure to cite your work appropriately.

Total points: 100

  • Correct countRangeSequential (10)
  • Correct countSequential (5)
  • Correct calculateMidpoint (5)
  • Correct and Parallel countParallelLowerUpperSplit (20)
  • Correct and Parallel countParallelNWaySplit (25)
  • Correct and Parallel countParallelDivideAndConquerKernel (25)
  • Clarity and efficiency (10)

Optional Reading

Related Videos

LoopingAndFinal: https://wustl.box.com/s/wix18u57v7x1rdx1d1dgfedf6ibr0n92

SliceArrayDemo: https://wustl.box.com/s/p3ooceekguiaygg4hggw76rbpt28kf6o

AddingParallelism: https://wustl.box.com/s/smckzu78cvg8seocgluuo1afydgd0gmt