Difference between revisions of "MapReduce Mapper Assignment"

From CSE231 Wiki
Jump to navigation Jump to search
Line 253: Line 253:
 
==K-mer Count Mapper==
 
==K-mer Count Mapper==
  
K-mer counting is a useful technique in bioinformatics: http://www.csbio.unc.edu/mcmillan/Comp555S17/Lecture02.pdf
+
[http://www.csbio.unc.edu/mcmillan/Comp555S17/Lecture02.pdf K-mer counting is a useful technique in bioinformatics].
  
Background information on k-mer counting can be found here: https://en.wikipedia.org/wiki/K-mer
+
Further background information on k-mer counting can be found [https://en.wikipedia.org/wiki/K-mer here].
  
 
The 3-mers in the chromosome:
 
The 3-mers in the chromosome:

Revision as of 16:07, 21 February 2023

Motivation

In previous semesters the MapReduce exercise has proven to be the most challenging. We will start by building some Mappers on our way to the final boss.

Each of the Mappers built today can be pairs with an Int Summing AccumulatorCombinerReducer:

  • a card mapper that matches the spec outlined in the prep video,
  • a simple word counting mapper, and
  • an analogous k-mer counting mapper.

Note: the k-mer counting mapper will prepare us for (and hopefully lessen the burden of) an exercise later in the semester.

Code To Use

Previous Exercise

DefaultEntry<K,V>

Provided

CardMapper Utilities

Deck implements Iterable<Card>

Card

card.rank()
card.suit()

Rank

rank.numericValue() note: returns Optional<Integer>

Suit

WordCount Mapper Utilities

TextSection

textSection.words()

K-mer Mapper Utilities

toStringKMer(sequence, offset, kMerLength)  
private static String toStringKMer(byte[] sequence, int offset, int kMerLength) {
	return new String(sequence, offset, kMerLength, StandardCharsets.UTF_8);
}

Code To Invesitigate

Note: each of the clients print entries. The entries produced by the map methods of the Mappers are instances of DefaultEntry. The entries produced by the StreamFramework are instances of a different implementation of Entry. Their toString() methods might be slightly different, but rest assured they are all Entries.

Card Mapping Clients

CardMapperClient

class: CardMapperClient.java CLIENT
package: mapreduce.apps.cards.client
source folder: student/src/main/java
CardMapperClient  
Deck deck = Deck.createFull();
CardMapper mapper = new CardMapper();
List<Map.Entry<Suit, Integer>> keyValuePairs = mapper.map(deck);
keyValuePairs.forEach(kv -> {
	System.out.println(kv);
});
CardMapperClient Output  
DefaultEntry[SPADES=>10]
DefaultEntry[SPADES=>9]
DefaultEntry[SPADES=>8]
DefaultEntry[SPADES=>7]
DefaultEntry[SPADES=>6]
DefaultEntry[SPADES=>5]
DefaultEntry[SPADES=>4]
DefaultEntry[SPADES=>3]
DefaultEntry[SPADES=>2]
DefaultEntry[HEARTS=>10]
DefaultEntry[HEARTS=>9]
DefaultEntry[HEARTS=>8]
DefaultEntry[HEARTS=>7]
DefaultEntry[HEARTS=>6]
DefaultEntry[HEARTS=>5]
DefaultEntry[HEARTS=>4]
DefaultEntry[HEARTS=>3]
DefaultEntry[HEARTS=>2]
DefaultEntry[DIAMONDS=>10]
DefaultEntry[DIAMONDS=>9]
DefaultEntry[DIAMONDS=>8]
DefaultEntry[DIAMONDS=>7]
DefaultEntry[DIAMONDS=>6]
DefaultEntry[DIAMONDS=>5]
DefaultEntry[DIAMONDS=>4]
DefaultEntry[DIAMONDS=>3]
DefaultEntry[DIAMONDS=>2]
DefaultEntry[CLUBS=>10]
DefaultEntry[CLUBS=>9]
DefaultEntry[CLUBS=>8]
DefaultEntry[CLUBS=>7]
DefaultEntry[CLUBS=>6]
DefaultEntry[CLUBS=>5]
DefaultEntry[CLUBS=>4]
DefaultEntry[CLUBS=>3]
DefaultEntry[CLUBS=>2]

CardMapReduceClient

class: CardMapReduceClient.java CLIENT
package: mapreduce.apps.cards.client
source folder: student/src/main/java
CardMapReduceClient  
Deck[] decks = {
	Deck.createFull(),
	Deck.createFull(),
	Deck.createFull(),
	Deck.createFull(),
};
CardMapper mapper = new CardMapper();
AccumulatorCombinerReducer<Integer, ?, Integer> accumulatorCombinerReducer = StreamUtils.summingIntAccumulatorCombinerReducer();
MapReduceFramework<Deck, Suit, Integer, ?, Integer> framework = new StreamMapReduceFramework<>(mapper, accumulatorCombinerReducer);
Map<Suit, Integer> map = framework.mapReduceAll(decks);
map.entrySet().forEach(entry -> {
	System.out.println(entry);
});
CardMapReduceClient Output  
HEARTS=216
SPADES=216
DIAMONDS=216
CLUBS=216

Word Count Mapping Clients

The word count mapping example clients use the beginning of If--- by Rudyard Kipling.

WordCountMapperClient

class: WordCountMapperClient.java CLIENT
package: mapreduce.apps.wordcount.client
source folder: student/src/main/java
WordCountMapperClient  
TextSection textSection = new TextSection("If you can keep your head when all about you");
WordCountMapper mapper = new WordCountMapper();
List<Map.Entry<String, Integer>> keyValuePairs = mapper.map(textSection);
keyValuePairs.forEach(kv -> {
    System.out.println(kv);
});
WordCountMapperClient Output  
DefaultEntry[DefaultEntry[if=>1]
DefaultEntry[you=>1]
DefaultEntry[can=>1]
DefaultEntry[keep=>1]
DefaultEntry[your=>1]
DefaultEntry[head=>1]
DefaultEntry[when=>1]
DefaultEntry[all=>1]
DefaultEntry[about=>1]
DefaultEntry[you=>1]

WordCountMapReduceClient

class: WordCountMapReduceClient.java CLIENT
package: mapreduce.apps.wordcount.client
source folder: student/src/main/java
WordCountMapReduceClient  
TextSection[] textSections = {
        new TextSection("If you can keep your head when all about you"),
        new TextSection("   Are losing theirs and blaming it on you,"),
};
WordCountMapper mapper = new WordCountMapper();
AccumulatorCombinerReducer<Integer, ?, Integer> accumulatorCombinerReducer = StreamUtils.summingIntAccumulatorCombinerReducer();
MapReduceFramework<TextSection, String, Integer, ?, Integer> framework = new StreamMapReduceFramework<>(mapper, accumulatorCombinerReducer);
Map<String, Integer> map = framework.mapReduceAll(textSections);
map.entrySet().forEach(entry -> {
    System.out.println(entry);
});
WordCountMapReduceClient Output  
all=1
theirs=1
about=1
it=1
your=1
when=1
losing=1
head=1
can=1
blaming=1
are=1
and=1
keep=1
if=1
you=3
on=1

K-mer Mapping Clients

The word count mapping example clients use the beginning of If--- by Rudyard Kipling.

KMerMapperClient

class: KMerMapperClient.java CLIENT
package: mapreduce.apps.wordcount.client
source folder: student/src/main/java
KMerMapperClient  
byte[] sequence = "ACTCATGAG".getBytes(StandardCharsets.UTF_8);
KMerMapper mapper = new KMerMapper(3);
List<Map.Entry<String, Integer>> keyValuePairs = mapper.map(sequence);
keyValuePairs.forEach(kv -> {
    System.out.println(kv);
});
KMerMapperClient Output  
DefaultEntry[ACT=>1]
DefaultEntry[CTC=>1]
DefaultEntry[TCA=>1]
DefaultEntry[CAT=>1]
DefaultEntry[ATG=>1]
DefaultEntry[TGA=>1]
DefaultEntry[GAG=>1]

KMerMapReduceClient

class: KMerMapReduceClient.java CLIENT
package: mapreduce.apps.wordcount.client
source folder: student/src/main/java
KMerMapReduceClient  
byte[][] sequences = {
        "ACTCATGAG".getBytes(StandardCharsets.UTF_8),
        "CATGAAAAAA".getBytes(StandardCharsets.UTF_8),
};
KMerMapper mapper = new KMerMapper(3);
AccumulatorCombinerReducer<Integer, ?, Integer> accumulatorCombinerReducer = StreamUtils.summingIntAccumulatorCombinerReducer();
MapReduceFramework<byte[], String, Integer, ?, Integer> framework = new StreamMapReduceFramework<>(mapper, accumulatorCombinerReducer);
Map<String, Integer> map = framework.mapReduceAll(sequences);
map.entrySet().forEach(entry -> {
    System.out.println(entry);
});
KMerMapReduceClient Output  
AAA=4
ACT=1
TCA=1
CTC=1
ATG=2
GAA=1
CAT=2
GAG=1
TGA=2

Code To Implement

CardMapper

The specification for this mapper is outlined in the prep video:

Video: Learning MapReduce with Playing Cards  


Non-numeric cards are considered to be bad data and ignored. Numeric cards should be emitted with their suit as the key and the numeric value as the value. Emitted key-value pairs are returned in a list of Entries.

class: CardMapper.java Java.png
methods: map
package: mapreduce.app.cards.exercise
source folder: student/src/main/java

method: List<Map.Entry<Suit, Integer>> map(Deck deck) Sequential.svg (sequential implementation only)

Word Count Mapper

Counting occurrences of words in text is a classic example of mapreduce. We will ignore any zero length words and convert the remaining words to lower-case so as to get a case insensitive count. Emitting each lower-cased word as the key with the value of 1 should do the trick here.

class: WordCountMapper.java Java.png
methods: map
package: mapreduce.apps.wordcount.exercise
source folder: student/src/main/java

method: public void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer) Sequential.svg (sequential implementation only)

The goal of this implementation is to count the number of times a word appears in a given text, using MapReduce. To accomplish this, you will need to create both the mapper and the reducer. Navigate to the WordCountMapper.java and IntSumListAccumulatingReducer.java classes. You will specifically define how the framework accomplishes the map and reduce methods.

The only method you will need to alter is the map method. In this method, you need to record every instance of a given word and feed it into the keyValuePairConsumer. To do this, access all of the words in the TextSection and if the length of the word is greater than zero (meaning it is not just blank space), convert it into lower-case and accept it into the consumer.

Hint: Look at the methods in TextSection and the toLowerCase() method for strings for assistance.

K-mer Count Mapper

K-mer counting is a useful technique in bioinformatics.

Further background information on k-mer counting can be found here.

The 3-mers in the chromosome:

ACTCATGAG

are:

ACT
 CTC
  TCA
   CAT
    ATG
     TGA
      GAG
class: KMerMapper.java Java.png
methods: map
package: mapreduce.apps.kmer.studio
source folder: student/src/main/java

method: public void map(byte[] sequence, BiConsumer<String, Integer> keyValuePairConsumer) Sequential.svg (sequential implementation only)

Be sure to use the provided toStringKMer(sequence, offset, kMerLength) method to generate your k-mers:

private static String toStringKMer(byte[] sequence, int offset, int kMerLength) {
	return new String(sequence, offset, kMerLength, StandardCharsets.UTF_8);
}

This mapper is similar to the #Word Count Mapper except that the k-mers overlap with each other while words are separate.

As the emitted values for each key will be later summed up in the reduction phase, what value makes sense to emit with each key?

Testing Your Solution

class: _MappersSuitableForPairingWithIntSummingReducerTestSuite.java Junit.png
package: mapreduce.apps
source folder: testing/src/test/java

Pledge, Acknowledgments, Citations

file: map-reduce-mapper-pledge-acknowledgments-citations.txt

More info about the Honor Pledge