Difference between revisions of "MapReduce Frameworks Lab"

From CSE231 Wiki
Jump to navigation Jump to search
Line 186: Line 186:
 
===mapAndAccumulateAll===
 
===mapAndAccumulateAll===
  
In this stage, you will map and accumulate a given array of data into a matrix of maps. This method should run in parallel while combining the map and accumulate portions of the simple framework (which we recommend you attempt first). As mentioned previously, the input should be sliced into a mapTaskCount number of slices and then mapped/accumulated into its rightful spot in the matrix. Although you can slice up the data into chunks yourself, we recommend using the <code>Slice</code> and <code>Slices</code> classes introduced earlier in the course.
+
In this stage, you will map and accumulate a given array of data into a matrix of maps. This method should run in parallel while combining the map and accumulate portions of the simple framework (which we recommend you attempt first). As mentioned previously, the input should be sliced into a mapTaskCount number of slices and then mapped/accumulated into its rightful spot in the matrix. Although you can slice up the data into chunks yourself, we recommend using the <code>Slice</code> and <code>SliceUtils</code> classes introduced earlier in the course.
 +
 
 +
{{Warning | Prefer the use [https://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/slices/SliceUtils.html#createNSlices-E:A-int- SliceUtils] over your own Slices implementation.  Many students have run into testing failures due to differences in the slices.}}
  
 
For each slice, the mapper should map the input into its rightful spot in the matrix and accumulate it into that specific map. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its position relative to the other slices and the getReduceIndex method, mentioned above.
 
For each slice, the mapper should map the input into its rightful spot in the matrix and accumulate it into that specific map. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its position relative to the other slices and the getReduceIndex method, mentioned above.

Revision as of 16:37, 27 October 2017

Contents

Background

As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single value using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will use MapReduce to simulate the Facebook mutual friends algorithm (finding the number of mutual friends between two people) and word count algorithm. As studios, you will use MapReduce to find which infected well(s) is causing an outbreak of cholera in historic London and to map a deck of cards.

For more information on the general concept of MapReduce, refer to this article.

Where to Start

You can find all of the relevant files for this assignment under the mapreduce directory. From there, all of the classes you will need to implement can be found under mapreduce.assignment or mapreduce.studio. The core directories are utility and building block classes we created for you and the viz directories are visualization apps that might help you understand your code from a visual standpoint. Take a look at these classes to get a better understanding of how to use them for your assignment.

Java Interfaces

To allow our frameworks to work well with JDK8 Streams, we employ a couple of standard interfaces over creating out own custom ones.

BiConsumer

We use the standard BiConsumer<T,U> interface with an BiConsumer's accept(t,u) method in the place of a mythical, custom MapContext<K,V> interface with an emit(key,value) method.

public interface BiConsumer<T, U> {
    // invoke accept with each key and value you wish to emit
    void accept(T t, U u);
}

Collector READ THE JAVADOC

We use the standard Collector<T,A,R> interface in place of a mythical, custom AccumulatorReducer<V,A,R> interface.

Note: we strongly encourage you to read the Collector<T,A,R> Javadoc.

To repeat: we strongly encourage you to read the Collector<T,A,R> Javadoc.

public interface Collector<T, A, R> {

	// invoke supplier().get() to create a new mutable container
	Supplier<A> supplier();

	// invoke accumulator().accept(container, item) to add item to a container
	BiConsumer<A, T> accumulator();

	// invoke combiner().apply(containerA, containerB) to combine one container into the other
	BinaryOperator<A> combiner();

	// invoke finisher().apply(container) to reduce a container to its final form
	Function<A, R> finisher();
}

supplier get

We use supplier().get() to create a new mutable container. For classic map reduce this would be a List<V>.

mythical code analogy: container = collector.supplier().get() container = accumulatorReducer.createContainer()

accumulator accept

We use accumulator().accept(container,item) to accumulate a value. For classic map reduce this would add an item to a list.

mythical code analogy: collector.accumulator().accept(container,item); accumulatorReducer.accumulate(container,item)

combiner apply

We use combiner().apply(containerA,containerB) to combine two accumulators. You may combine containerB into containerA or containerA into containerB. Just return whichever is the combined result.

mythical code analogy: collector.combiner().apply(containerA,containerB) accumulatorReducer.combine(containerA,containerB)

finisher apply

We use finisher().apply(container) to reduce an accumulator.

mythical code analogy: collector.finisher().apply(container) r = accumulatorReducer.reduce(container)

A Path To Victory

Watch MapReduce with Playing Cards

Read Finding Mutual Friends

Read java.util.stream.Collector Javadoc

Implement #Cards MapReduce Studio

Implement #WordCountConcreteStaticMapReduce

Implement #MutualFriendsConcreteStaticMapReduce

Implement #WordCountMapper

Implement #IntegerSumClassicReducer

Implement MutualFriendsMapper

Implement MutualFriendsReducer

Implement ClassicReducer

Implement #Simple MapReduce Framework

Implement #Cholera MapReduce Studio

Implement #Matrix MapReduce Framework

Warm Up

WordCountConcreteStaticMapReduce

Test Suite: WordCountMapReduceWarmUpTestSuite

Mapper

void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer)

Reducer

List<Integer> reduceCreateList()

void reduceAccumulate(List<Integer> list, int v)

void reduceCombine(List<Integer> a, List<Integer> b)

int reduceFinish(List<Integer> list)

Framework

List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)

Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)

Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)

MutualFriendsConcreteStaticMapReduce

Test Suite: MutualFriendsMapReduceWarmUpTestSuite

Mapper

void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)

Reducer

List<Set<AccountId>> reduceCreateList()

void reduceAccumulate(List<Set<AccountId>> list, Set<AccountId> v)

void reduceCombine(List<Set<AccountId>> a, List<Set<AccountId>> b)

MutualFriendIds reduceFinish(List<Set<AccountId>> list)

Framework

List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)

Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)

Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)

Assignment

Test Suite: MapReduceAssignmentTestSuite

Mutual Friends MapReduce Assignment

The goal of this implementation is to create Facebook’s mutual friends algorithm using MapReduce. To accomplish this, you will need to create the mapper and the reducer. Navigate to the MutualFriendsMapper.java class. In this class, you will specifically define how the framework accomplishes the map method.

MutualFriendsMapper

The only method you will need to alter is the map method. In this method, you will need to map every combination of the account holder to his/her friends. In order to do this, create ordered pairs of the given account’s ID and the IDs of the account holder’s friends. You must then feed each individual ordered pair into the keyValuePairConsumer along with the full set of the account holder’s friends.

Hint: check out the methods in the Account class for help.

MutualFriendsClassicReducer

Investigate MutualFriendIds for clues on what you need to do.

Word Count MapReduce Assignment

The goal of this implementation is to count the number of times a word appears in a given text, using MapReduce. To accomplish this, you will need to create both the mapper and the reducer. Navigate to the WordCountMapper.java and IntegerSumClassicReducer.java classes. You will specifically define how the framework accomplishes the map and reduce methods.

WordCountMapper

The only method you will need to alter is the map method. In this method, you need to record every instance of a given word and feed it into the keyValuePairConsumer. To do this, access all of the words in the TextSection and if the length of the word is greater than zero (meaning it is not just blank space), convert it into lower-case and accept it into the consumer.

Hint: Look at the methods in TextSection and the toLowerCase() method for strings for assistance.

IntegerSumClassicReducer

The only method you will need to alter is the apply method. All you need to do is sum up the value of a list of integers and return that sum value.

Simple MapReduce Framework

Navigate to the SimpleMapReduceFramework.java class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.

mapAll Method

With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of a list of key value pairs. In order to do this, you must define the map() method for the mapper by specifying that it should add a new key value pair to a previously empty list. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.

Hint: if you are creating an array of lists equivalent in size to the original array, your lists should probably consist of just one item.

accumulateAll Method

This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.

In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. Unlike the mapping phase, the map must account for the possibility of duplicates. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the accumulator() method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be turned into a different form of data before reaching the reduce stage. Although we will not do this with any of our implementations, we designed the framework to allow this. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the supplier() method.

Hint: Look into the compute() method for maps.

finishAll Method

This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.

To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the finisher() method. This step should run in parallel and will probably be the easiest of the three methods.

Hint: Use the entrySet() method to get all of the entries in the given map and remember to use a ConcurrentHashMap instead of a regular HashMap to ensure the method can run in parallel.

Matrix MapReduce Framework

Navigate to the MatrixMapReduceFramework.java class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.

The matrix framework is much more complex than the simple framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the provided getReduceIndex() method. This, in effect, creates a matrix of maps, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.

mapAndAccumulateAll

In this stage, you will map and accumulate a given array of data into a matrix of maps. This method should run in parallel while combining the map and accumulate portions of the simple framework (which we recommend you attempt first). As mentioned previously, the input should be sliced into a mapTaskCount number of slices and then mapped/accumulated into its rightful spot in the matrix. Although you can slice up the data into chunks yourself, we recommend using the Slice and SliceUtils classes introduced earlier in the course.

Attention niels epting.svg Warning: Prefer the use SliceUtils over your own Slices implementation. Many students have run into testing failures due to differences in the slices.

For each slice, the mapper should map the input into its rightful spot in the matrix and accumulate it into that specific map. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its position relative to the other slices and the getReduceIndex method, mentioned above.

Hint: The number of rows should match the number of slices.

combineAndFinishAll

In this stage, you will take the matrix you just completed and combine all of the separate rows down to one array. Afterward, you will convert this combined array of maps into one final map. This method should run in parallel.

As mentioned previously, you should go directly down the matrix to access the same bucket across the different slices you created in the mapAndAccumulateAll step. For all of the maps in a column, you should go through each entry and combine it down into one row. You will need to make use of the Collector’s finisher again, but you will also need to make use of the combiner. You can access the Collector’s combiner using the combiner() method. Although the combine step differs from the simple framework, the finish step should mirror what you did previously.

Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.

Cards MapReduce Studio

TestSuite: CardMapReduceStudioTestSuite

Prep Video: Learn MapReduce with Playing Cards

Navigate to the CardMapper.java class and look at the map method. In this part of the studio, you are going to create a mapper for a given deck of cards. The mapper should accept a card’s suit and numeric value through the use of a BiConsumer. Iterate through all of the cards in the given deck and ensure they are valid numeric cards. If they are, then emit the suit and numeric value.

Hint: the Rank class has a useful method that checks whether a card is valid.

Cholera MapReduce Studio

Navigate to the CholeraMapper.java class and look at the map method. In this studio, you will attempt to find which water pumps are infected with cholera based on a given location of a reported case. You will do this by checking which specific WaterPump is closest to a given location and map that value to a BiConsumer. The BiConsumer will accept two arguments: the closest WaterPump and the number of occurrences to add onto the map. To find the closest WaterPump, you should go through a database of all available water pumps and compare the distances between the pumps and the location.

Hint: there are some useful methods in the WaterPump and Location classes that might help you with the studio.

K-Mer MapReduce Studio

The term k-mer refers to a substring of length k and k-mer counting refers to the process of finding the number of occurrences of a k-mer in a given string. This computational problem has many real-world applications, the most notable being in the field of computational genomics. In this assignment, you will design a program that will ultimately take in a human X-chromosome and count the number of k-mers in the string of DNA.

Navigate to the KMerMapper.java class and look at the map method. For this studio your KMerMapper class will be constructed with the kMerLength. Your map method will be passed an array of chromosome data. Emit each kMerLength long k-mer as a String key with the value 1 so they may be counted in the reduce phase via integer sum.

We have provided private static String toStringKMer(byte[] sequence, int offset, int kMerLength) which should be useful.

It is useful to know that the number of k-mers of length k in a given a string of length n is equal to n - k + 1.

wikipedia k-mer article

Rubric

As always, please make sure to cite your work appropriately.

Total points: 100

Mutual friends subtotal: 10

  • Correct mapper (5)
  • Correct reducer (5)

Word count subtotal: 10

  • Correct mapper (5)
  • Correct reducer (5)

Classic Ruducer subtotal: 5

  • Correct reducer (5)

Simple framework subtotal: 25

  • Correct mapAll (5)
  • Correct accumulateAll (10)
  • Correct finishAll (10)

Matrix framework subtotal: 40

  • Correct mapAndAccumulateAll (20)
  • Correct combineAndFinishAll (20)

Whole project:

  • Clarity and efficiency (10)