Difference between revisions of "MapReduce Frameworks Lab"

From CSE231 Wiki
Jump to: navigation, search
(mapAndAccumulateAll)
(reduceFinish)
 
(29 intermediate revisions by the same user not shown)
Line 77: Line 77:
 
* [[#Optional_Warm_Up]]
 
* [[#Optional_Warm_Up]]
 
* [[#Bottlenecked_MapReduce_Framework]]
 
* [[#Bottlenecked_MapReduce_Framework]]
 +
<!--
 
* Wait For Thursday's Class Session (If Applicable)
 
* Wait For Thursday's Class Session (If Applicable)
 +
-->
 
* [[Cholera_MapReduce_Application]]
 
* [[Cholera_MapReduce_Application]]
 
* [[#Matrix_MapReduce_Framework]]
 
* [[#Matrix_MapReduce_Framework]]
Line 90: Line 92:
 
==WordCountConcreteStaticMapReduce==
 
==WordCountConcreteStaticMapReduce==
 
{{CodeToImplement|WordCountConcreteStaticMapReduce|mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.wordcount}}
 
{{CodeToImplement|WordCountConcreteStaticMapReduce|mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.wordcount}}
 +
===Mapper===
 +
====map (Provided)====
 +
<nowiki> static void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer) {
 +
mapper.map(textSection, keyValuePairConsumer);
 +
}</nowiki>
 +
 +
===Reducer===
 +
====reduceCreateList (Provided)====
 +
<nowiki> static List<Integer> reduceCreateList() {
 +
return collector.supplier().get();
 +
}</nowiki>
 +
 +
====reduceAccumulate (Provided)====
 +
<nowiki> static void reduceAccumulate(List<Integer> list, int v) {
 +
collector.accumulator().accept(list, v);
 +
}</nowiki>
 +
 +
====reduceCombine (Provided)====
 +
<nowiki> static List<Integer> reduceCombine(List<Integer> a, List<Integer> b) {
 +
return collector.combiner().apply(a, b);
 +
}</nowiki>
 +
 +
====reduceFinish (Provided)====
 +
<nowiki> static int reduceFinish(List<Integer> list) {
 +
return collector.finisher().apply(list);
 +
}</nowiki>
 +
 
===Framework===
 
===Framework===
 +
====mapAll====
 +
<youtube>-4prus5cNFo</youtube>
 +
 
{{Parallel|List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)}}
 
{{Parallel|List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)}}
  
 +
{{Warning| When first created arrays of Objects are filled with null.  You will need to assigned each array index to a new List before you start the process of adding key-value pairs}}
 +
 +
{{Warning| Reminder: our course libraries consistently specify max to be exclusive.  This includes the parallel forall loop.}}
 +
 +
{{Tip|You are encouraged to utilize the provided [[#map_.28Provided.29|map]] method.}}
 +
 +
====accumulateAll====
 
{{Sequential|static Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)}}
 
{{Sequential|static Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)}}
  
 +
{{Tip|You are encouraged to utilize the provided [[#reduceCreateList_.28Provided.29|reduceCreateList]] and [[#reduceAccumulate_.28Provided.29|reduceAccumulate]] methods.}}
 +
 +
====finishAll====
 
{{Parallel|static Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)}}
 
{{Parallel|static Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)}}
 +
 +
{{Tip|You are encouraged to utilize the provided [[#reduceFinish_.28Provided.29|reduceFinish]] method.}}
  
 
===Testing Your Solution===
 
===Testing Your Solution===
Line 104: Line 148:
 
{{CodeToImplement|MutualFriendsConcreteStaticMapReduce|map<br>reduceCreateList<br>reduceAccumulate<br>reduceCombine<br>reduceFinish<br>mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.friends}}
 
{{CodeToImplement|MutualFriendsConcreteStaticMapReduce|map<br>reduceCreateList<br>reduceAccumulate<br>reduceCombine<br>reduceFinish<br>mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.friends}}
 
===Mapper===
 
===Mapper===
 +
====map====
 
{{Sequential|static void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)}}
 
{{Sequential|static void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)}}
  
 
===Reducer===
 
===Reducer===
 +
====reduceCreateList====
 
{{Sequential|static List<Set<AccountId>> reduceCreateList()}}
 
{{Sequential|static List<Set<AccountId>> reduceCreateList()}}
  
 +
====reduceAccumulate====
 
{{Sequential|static void reduceAccumulate(List<Set<AccountId>> list, Set<AccountId> v)}}
 
{{Sequential|static void reduceAccumulate(List<Set<AccountId>> list, Set<AccountId> v)}}
  
 +
====reduceCombine====
 
{{Sequential|static void reduceCombine(List<Set<AccountId>> a, List<Set<AccountId>> b)}}
 
{{Sequential|static void reduceCombine(List<Set<AccountId>> a, List<Set<AccountId>> b)}}
  
 +
====reduceFinish====
 
{{Sequential|static MutualFriendIds reduceFinish(List<Set<AccountId>> list)}}
 
{{Sequential|static MutualFriendIds reduceFinish(List<Set<AccountId>> list)}}
 +
 +
note: creating an instance of MutualFriendIds set to the universe is currently a bit tricky.  We have provided you with the required code:
 +
 +
<nowiki> Set<AccountId> universe = null; // imagine this being a set of a billion account ids
 +
MutualFriendIds result = MutualFriendIds.createInitializedToUniverse(universe);</nowiki>
 +
 
===Framework===
 
===Framework===
 +
====mapAll====
 
{{Parallel|static List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)}}
 
{{Parallel|static List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)}}
  
 +
{{Warning| When first created arrays of Objects are filled with null.  You will need to assigned each array index to a new List before you start the process of adding key-value pairs}}
 +
 +
{{Warning| Reminder: our course libraries consistently specify max to be exclusive.  This includes the parallel forall loop.}}
 +
 +
{{Tip|You are encouraged to utilize the [[#map|map]] method you implemented.}}
 +
====accumulateAll====
 
{{Sequential|static Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)}}
 
{{Sequential|static Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)}}
  
 +
{{Tip|You are encouraged to utilize the [[#reduceCreateList|reduceCreateList]] and [[#reduceAccumulate|reduceAccumulate]] methods you implemented.}}
 +
 +
====finishAll====
 
{{Parallel|static Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)}}
 
{{Parallel|static Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)}}
 +
 +
{{Tip|You are encouraged to utilize the [[#reduceFinish|reduceFinish]] method you implemented.}}
  
 
===Testing Your Solution===
 
===Testing Your Solution===
Line 134: Line 201:
  
 
===mapAll===
 
===mapAll===
 +
NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.
 +
 
{{Parallel|List<KeyValuePair<K, V>>[] mapAll(E[] input)}}
 
{{Parallel|List<KeyValuePair<K, V>>[] mapAll(E[] input)}}
 +
 +
{{Warning| When first created arrays of Objects are filled with null.  You will need to assigned each array index to a new List before you start the process of adding key-value pairs}}
 +
 +
{{Warning| Reminder: our course libraries consistently specify max to be exclusive.  This includes the parallel forall loop.}}
  
 
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of a list of key value pairs. In order to do this, you must define the map() method for the mapper by specifying that it should add a new key value pair to a previously empty list. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
 
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of a list of key value pairs. In order to do this, you must define the map() method for the mapper by specifying that it should add a new key value pair to a previously empty list. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
Line 171: Line 244:
 
Navigate to the <code>MatrixMapReduceFramework.java</code> class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
 
Navigate to the <code>MatrixMapReduceFramework.java</code> class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
  
The matrix framework is much more complex than the bottlenecked framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the provided getReduceIndex() method. This, in effect, creates a matrix of maps, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.
+
The matrix framework is much more complex than the bottlenecked framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the HashUtils toIndex() method. This, in effect, creates a matrix of dictionaries, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the columns of the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.
  
 
===mapAndAccumulateAll===
 
===mapAndAccumulateAll===
 
{{Parallel|Map<K, A>[][] mapAndAccumulateAll(E[] input)}}
 
{{Parallel|Map<K, A>[][] mapAndAccumulateAll(E[] input)}}
  
In this stage, you will map and accumulate a given array of data into a matrix of maps. This method should run in parallel while combining the map and accumulate portions of the bottlenecked framework (which we recommend you attempt first). As mentioned previously, the input should be sliced into a mapTaskCount number of slices and then mapped/accumulated into its rightful spot in the matrix. Although you can slice up the data into chunks yourself, we recommend using the <code>IndexedRange</code> and <code>Slices</code> classes introduced earlier in the course.
+
In this stage, you will map and accumulate a given array of data into a matrix of dictionaries. This method should run in parallel while performing the map and accumulate portions of the bottlenecked framework (which we recommend you complete prior to embarking on this mission). As mentioned previously, the input should be sliced into a mapTaskCount number of IndexedRanges and then mapped/accumulated into its appropriate dictionary in the matrix. Although you could slice up the data into chunks yourself, we require using an identical algorithm as performed the <code>IndexedRange</code> and <code>Slices</code> classes introduced earlier in the course.  This will allow us to provide better feedback to allow you to pinpoint bugs sooner.  What is the best way to perform an identical algorithm to your Slices studio?  Use your Slices studio, of course.
  
For each slice, the mapper should map the input into its rightful spot in the matrix and accumulate it into that specific map. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its position relative to the other slices and the getReduceIndex method, mentioned above.
+
For each slice, the mapper should map the input into its appropriate cell in the matrix and accumulate it into that specific dictionary. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its index id and HashUtils has a toIndex method.  Which is applicable to the row and which is applicable to the column?
  
 
Hint: The number of rows should match the number of slices.
 
Hint: The number of rows should match the number of slices.
  
[[File:Matrix map accumulate all.png|400px]] [[File:Matrix map accumulate art.png|400px]]
+
[[File:Matrix map accumulate all.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_347 slide]
 
 
[https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_347 slide]
 
  
[https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_388 slide]
+
[[File:Matrix map accumulate art.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_388 slide]
 +
<!-- [[File:Matrix map accumulate death.png]]-->
  
 
===combineAndFinishAll===
 
===combineAndFinishAll===
Line 197: Line 269:
 
Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.
 
Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.
  
[https://docs.google.com/presentation/d/1iqJw_bldkVv3AhSCM740FxSTd31-zZBjk71icZ233lk/pub?start=false&loop=false&delayms=3000&slide=id.g343eac61f6_0_398 slide]
+
[[File:Matrix combine finish all.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_425 slide]
[https://docs.google.com/presentation/d/1iqJw_bldkVv3AhSCM740FxSTd31-zZBjk71icZ233lk/pub?start=false&loop=false&delayms=3000&slide=id.g343eac61f6_0_523 slide]
 
  
 
=Testing Your Solution=
 
=Testing Your Solution=

Latest revision as of 00:38, 5 March 2020

credit for this assignment: Finn Voichick and Dennis Cosgrove

Motivation

Dealing with big data is Hansel-level hot right now. We will build two implementations to better understand the inner workings of Hadoop and Spark-like frameworks. Your frameworks will actually be more general than just MapReduce.

The Matrix implementation gives us experience with dividing the work up into thread confined tasks whose results are then combined together.

Background

wikipedia article on MapReduce

As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single values per key using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will use MapReduce to simulate the Facebook mutual friends algorithm (finding the mutual friends between all friend pairs) as well as pinpointing the offending well in the 1854 Soho Cholera Outbreak.

For more information on the general concept of MapReduce, refer to this article.

Java Advice

Parameterized Type Array Tip

Creating arrays of parameterized types in Java is madness inducing. Some details are available in Java Generics Restrictions.

The example below creates an array of List<String>. The @SuppressWarnings annotation is optional.

@SuppressWarnings("unchecked")
List<String>[] result = new List[length];

Use map.entrySet()

Prefer the use of map.entrySet() over map.keySet() followed by looking up the value with map.get(key).

Use the appropriate version of forall

there are many overloaded versions of forall including:

choose the correct one for each situation where "correct" is often the one that produces the cleanest code.

Mistakes To Avoid

Attention niels epting.svg Warning: Arrays (and Matrices) are initially filled with null. You must fill them with instances.
Attention niels epting.svg Warning: Ensure that your Slices studio is conforming to spec.

Code To Use

To allow our frameworks to work well with JDK8 Streams, we employ a couple of standard interfaces over creating out own custom ones.

BiConsumer

We use the standard BiConsumer<T,U> interface with an BiConsumer's accept(t,u) method in the place of a mythical, custom MapContext<K,V> interface with an emit(key,value) method.

public interface BiConsumer<T, U> {
    // invoke accept with each key and value you wish to emit
    void accept(T t, U u);
}

Collector

read the javadoc for Collector

check out the Collector_MapReduce_Studio#Background

IndexedRange

class IndexedRange

This class has everything you need for n-way split problems, specifically:

Slices

class Slices

List<Slice<C[]>> createNSlices(C[] data, int numSlices)

HashUtils

toIndex(key,N)

MultiWrapMap

class MultiWrapMap<K,V>

A Path To Victory

The Core Questions

  • What are the tasks?
  • What is the data?
  • Is the data mutable?
  • If so, how is it shared?

Optional Warm Up

WordCountConcreteStaticMapReduce

class: WordCountConcreteStaticMapReduce.java Java.png
methods: mapAll
accumulateAll
finishAll
package: mapreduce.framework.warmup.wordcount
source folder: src/main/java

Mapper

map (Provided)

	static void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer) {
		mapper.map(textSection, keyValuePairConsumer);
	}

Reducer

reduceCreateList (Provided)

	static List<Integer> reduceCreateList() {
		return collector.supplier().get();
	}

reduceAccumulate (Provided)

	static void reduceAccumulate(List<Integer> list, int v) {
		collector.accumulator().accept(list, v);
	}

reduceCombine (Provided)

	static List<Integer> reduceCombine(List<Integer> a, List<Integer> b) {
		return collector.combiner().apply(a, b);
	}

reduceFinish (Provided)

	static int reduceFinish(List<Integer> list) {
		return collector.finisher().apply(list);
	}

Framework

mapAll

method: List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input) Parallel.svg (parallel implementation required)

Attention niels epting.svg Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs
Attention niels epting.svg Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop.
Circle-information.svg Tip:You are encouraged to utilize the provided map method.

accumulateAll

method: static Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults) Sequential.svg (sequential implementation only)

Circle-information.svg Tip:You are encouraged to utilize the provided reduceCreateList and reduceAccumulate methods.

finishAll

method: static Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult) Parallel.svg (parallel implementation required)

Circle-information.svg Tip:You are encouraged to utilize the provided reduceFinish method.

Testing Your Solution

Correctness

class: WarmUpWordCountMapReduceTestSuite.java Junit.png
package: mapreduce
source folder: src/test/java

MutualFriendsConcreteStaticMapReduce

class: MutualFriendsConcreteStaticMapReduce.java Java.png
methods: map
reduceCreateList
reduceAccumulate
reduceCombine
reduceFinish
mapAll
accumulateAll
finishAll
package: mapreduce.framework.warmup.friends
source folder: src/main/java

Mapper

map

method: static void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer) Sequential.svg (sequential implementation only)

Reducer

reduceCreateList

method: static List<Set<AccountId>> reduceCreateList() Sequential.svg (sequential implementation only)

reduceAccumulate

method: static void reduceAccumulate(List<Set<AccountId>> list, Set<AccountId> v) Sequential.svg (sequential implementation only)

reduceCombine

method: static void reduceCombine(List<Set<AccountId>> a, List<Set<AccountId>> b) Sequential.svg (sequential implementation only)

reduceFinish

method: static MutualFriendIds reduceFinish(List<Set<AccountId>> list) Sequential.svg (sequential implementation only)

note: creating an instance of MutualFriendIds set to the universe is currently a bit tricky. We have provided you with the required code:

		Set<AccountId> universe = null; // imagine this being a set of a billion account ids
		MutualFriendIds result = MutualFriendIds.createInitializedToUniverse(universe);

Framework

mapAll

method: static List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input) Parallel.svg (parallel implementation required)

Attention niels epting.svg Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs
Attention niels epting.svg Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop.
Circle-information.svg Tip:You are encouraged to utilize the map method you implemented.

accumulateAll

method: static Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults) Sequential.svg (sequential implementation only)

Circle-information.svg Tip:You are encouraged to utilize the reduceCreateList and reduceAccumulate methods you implemented.

finishAll

method: static Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult) Parallel.svg (parallel implementation required)

Circle-information.svg Tip:You are encouraged to utilize the reduceFinish method you implemented.

Testing Your Solution

Correctness

class: WarmUpMutualFriendsMapReduceTestSuite.java Junit.png
package: mapreduce
source folder: src/test/java

Required Lab

Bottlenecked MapReduce Framework

class: BottleneckedMapReduceFramework.java Java.png
methods: mapAll
accumulateAll
finishAll
package: mapreduce.framework.lab.bottlenecked
source folder: src/main/java

Navigate to the BottleneckedMapReduceFramework.java class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce. Whereas the warm ups for this lab serve to prepare you to build this required section of the lab, this bottlenecked framework is in many ways a warm up for the matrix implementation.

mapAll

NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.

method: List<KeyValuePair<K, V>>[] mapAll(E[] input) Parallel.svg (parallel implementation required)

Attention niels epting.svg Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs
Attention niels epting.svg Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop.

With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of a list of key value pairs. In order to do this, you must define the map() method for the mapper by specifying that it should add a new key value pair to a previously empty list. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.

Hint: you should create an array of lists equivalent in size to the original array. Each list will contain all of the emitted (key,value) pairs for its item.

Bottlenecked map all.png slide

accumulateAll

method: Map<K, A> accumulateAll(List<KeyValuePair<K, V>>[] mapAllResults) Sequential.svg (sequential implementation only)

This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.

In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the accumulator() method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be collected into a mutable container before reaching the finish/reduce stage. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the supplier() method.


Bottlenecked accumulate all.png slide

finishAll

method: Map<K, R> finishAll(Map<K, A> accumulateAllResult) Parallel.svg (parallel implementation required)

This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.

To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the finisher() method. This step should run in parallel and will probably be the easiest of the three methods.

Bottlenecked finish all.png slide

Matrix MapReduce Framework

class: MatrixMapReduceFramework.java Java.png
methods: mapAndAccumulateAll
combineAndFinishAll
package: mapreduce.framework.lab.matrix
source folder: src/main/java

Navigate to the MatrixMapReduceFramework.java class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.

The matrix framework is much more complex than the bottlenecked framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the HashUtils toIndex() method. This, in effect, creates a matrix of dictionaries, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the columns of the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.

mapAndAccumulateAll

method: Map<K, A>[][] mapAndAccumulateAll(E[] input) Parallel.svg (parallel implementation required)

In this stage, you will map and accumulate a given array of data into a matrix of dictionaries. This method should run in parallel while performing the map and accumulate portions of the bottlenecked framework (which we recommend you complete prior to embarking on this mission). As mentioned previously, the input should be sliced into a mapTaskCount number of IndexedRanges and then mapped/accumulated into its appropriate dictionary in the matrix. Although you could slice up the data into chunks yourself, we require using an identical algorithm as performed the IndexedRange and Slices classes introduced earlier in the course. This will allow us to provide better feedback to allow you to pinpoint bugs sooner. What is the best way to perform an identical algorithm to your Slices studio? Use your Slices studio, of course.

For each slice, the mapper should map the input into its appropriate cell in the matrix and accumulate it into that specific dictionary. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its index id and HashUtils has a toIndex method. Which is applicable to the row and which is applicable to the column?

Hint: The number of rows should match the number of slices.

Matrix map accumulate all.png slide

Matrix map accumulate art.png slide

combineAndFinishAll

method: Map<K, R> combineAndFinishAll(Map<K, A>[][] input) Parallel.svg (parallel implementation required)

In this stage, you will take the matrix you just completed and combine all of the separate rows down to one array. Afterward, you will convert this combined array of maps into one final map. This method should run in parallel.

As mentioned previously, you should go directly down the matrix to access the same bucket across the different slices you created in the mapAndAccumulateAll step. For all of the maps in a column, you should go through each entry and combine it down into one row. You will need to make use of the Collector’s finisher again, but you will also need to make use of the combiner. You can access the Collector’s combiner using the combiner() method. Although the combine step differs from the bottlenecked framework, the finish step should mirror what you did previously.

Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.

Matrix combine finish all.png slide

Testing Your Solution

Correctness

There is a top-level test suite comprised of sub test suites which can be invoked separately when you want to focus on one part of the assignment.

class: FrameworksLabTestSuite.java Junit.png
package: mapreduce.framework.lab
source folder: src/test/java

Bottlenecked

class: BottleneckedFrameworkTestSuite.java Junit.png
package: mapreduce.framework.lab.bottlenecked
source folder: src/test/java

MapAll

class: BottleneckedFrameworkTestSuite.java Junit.png
package: mapreduce.framework.lab.bottlenecked
source folder: src/test/java

AccumulateAll

class: BottleneckedAccumulateAllTestSuite.java Junit.png
package: mapreduce.framework.lab.bottlenecked
source folder: src/test/java

FinishAll

class: BottleneckedFinishAllTestSuite.java Junit.png
package: mapreduce.framework.lab.bottlenecked
source folder: src/test/java

Holistic

class: BottleneckedHolisticTestSuite.java Junit.png
package: mapreduce.framework.lab.bottlenecked
source folder: src/test/java

Matrix

class: MatrixFrameworkTestSuite.java Junit.png
package: mapreduce.framework.lab.matrix
source folder: src/test/java

MapAccumulateAll

class: MatrixMapAccumulateAllTestSuite.java Junit.png
package: mapreduce.framework.lab.matrix
source folder: src/test/java

CombineFinishAll

class: MatrixCombineFinishAllTestSuite.java Junit.png
package: mapreduce.framework.lab.matrix
source folder: src/test/java

Holistic

class: MatrixHolisticTestSuite.java Junit.png
package: mapreduce.framework.lab.matrix
source folder: src/test/java

Rubric

As always, please make sure to cite your work appropriately.

Total points: 100

Bottlenecked framework subtotal: 40

  • Correct mapAll (10)
  • Correct accumulateAll (20)
  • Correct finishAll (10)

Matrix framework subtotal: 60

  • Correct mapAndAccumulateAll (30)
  • Correct combineAndFinishAll (30)