Difference between revisions of "MapReduce Frameworks Lab"

From CSE231 Wiki
Jump to navigation Jump to search
 
(131 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
[[Bottleneck_MapReduce_Framework_Assignment|Bottleneck MapReduce Framework]]
 +
 +
[[Matrix_MapReduce_Framework_Assignment|Matrix MapReduce Framework]]
 +
 +
<!--
 +
credit for this assignment: Finn Voichick and Dennis Cosgrove
 
=Motivation=
 
=Motivation=
Dealing with [https://en.wikipedia.org/wiki/Big_data big data] is Hansel-level hot right now.  We will build two implementations to better understand the inner workings of Hadoop and Spark-like frameworks.  Your frameworks will actually be more support more than just MapReduce.
+
Dealing with [https://en.wikipedia.org/wiki/Big_data big data] is Hansel-level hot right now.  We will build two implementations to better understand the inner workings of Hadoop and Spark-like frameworks.  Your frameworks will actually be more general than just MapReduce.
  
 
The Matrix implementation gives us experience with dividing the work up into thread confined tasks whose results are then combined together.
 
The Matrix implementation gives us experience with dividing the work up into thread confined tasks whose results are then combined together.
Line 7: Line 13:
 
[https://en.wikipedia.org/wiki/MapReduce wikipedia article on MapReduce]
 
[https://en.wikipedia.org/wiki/MapReduce wikipedia article on MapReduce]
  
As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single value using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will use MapReduce to simulate the Facebook mutual friends algorithm (finding the number of mutual friends between two people) and word count algorithm. As studios, you will use MapReduce to find which infected well(s) is causing an outbreak of cholera in historic London and to map a deck of cards.
+
As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single values per key using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will use MapReduce to simulate the Facebook mutual friends algorithm (finding the mutual friends between all friend pairs) as well as pinpointing the offending well in the 1854 Soho Cholera Outbreak.
  
 
For more information on the general concept of MapReduce, refer to [https://stackoverflow.com/questions/28982/simple-explanation-of-mapreduce this] article.
 
For more information on the general concept of MapReduce, refer to [https://stackoverflow.com/questions/28982/simple-explanation-of-mapreduce this] article.
  
=Mistakes To Avoid=
+
=Java Advice=
{{Warning | Arrays (and Matrices) are initially filled with null. You must fill them with instances.}}
+
==Parameterized Type Array Tip==
 +
Creating arrays of parameterized types in Java is madness inducing.  Some details are available in [https://docs.oracle.com/javase/tutorial/java/generics/restrictions.html#createArrays Java Generics Restrictions].
 +
 
 +
The example below creates an array of List<String>.  The <code>@SuppressWarnings</code> annotation is optional.
 +
<nowiki>@SuppressWarnings("unchecked")
 +
List<String>[] result = new List[length];</nowiki>
 +
 
 +
==Use map.entrySet()==
 +
Prefer the use of [https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#entrySet-- map.entrySet()] over map.keySet() followed by looking up the value with map.get(key).
  
{{Warning | Prefer the use [https://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/slices/SliceUtils.html#createNSlices-E:A-int- SliceUtils] over your own Slices implementation.  Many students have run into testing failures due to differences in the slices.}}
+
==Use the appropriate version of forall==
 +
there are many overloaded versions of forall including:
  
{{Warning | Do NOT be fooled. slice.getData() returns the original complete data. Use [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/slice/core/Slice.html#getMinInclusive-- getMinInclusive()]  
+
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/v5/V5.html#forall-int-int-edu.wustl.cse231s.v5.api.CheckedIntConsumer- range of ints from min to maxExclusive]
and [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/slice/core/Slice.html#getMaxExclusive-- getMaxExclusive()] when processing your slice.}}
+
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/v5/V5.html#forall-T:A-edu.wustl.cse231s.v5.api.CheckedConsumer- array]
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/v5/V5.html#forall-java.lang.Iterable-edu.wustl.cse231s.v5.api.CheckedConsumer- Iterable]
  
 +
choose the correct one for each situation where "correct" is often the one that produces the cleanest code.
  
=Code To Use=
+
=Mistakes To Avoid=
 +
{{Warning | Arrays (and Matrices) are initially filled with null.  You must fill them with instances.}}
  
You can find all of the relevant files for this assignment under the '''mapreduce''' directory. From there, all of the classes you will need to implement can be found under <code>mapreduce.assignment</code> or <code>mapreduce.studio</code>. The '''core''' directories are utility and building block classes we created for you and the '''viz''' directories are visualization apps that might help you understand your code from a visual standpoint. Take a look at these classes to get a better understanding of how to use them for your assignment.
+
{{Warning | Ensure that your [[Slices|Slices studio]] is conforming to spec.}}
  
=Java Interfaces=
+
=Code To Use=
To allow our frameworks to work well with JDK8 Streams, we employ a couple of standard interfaces over creating out own custom ones.
+
To allow our frameworks to work well with JDK8 Streams, we employ a couple of standard interfaces over creating our own custom ones.
  
 
==BiConsumer==
 
==BiConsumer==
Line 35: Line 53:
 
}</nowiki>
 
}</nowiki>
  
==Collector [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html READ THE JAVADOC]==
+
==Collector==
We use the standard [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html Collector<T,A,R>] interface in place of a mythical, custom AccumulatorReducer<V,A,R> interface.
+
read [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html the javadoc for Collector]
  
Note: we strongly encourage you to read the [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html Collector<T,A,R> Javadoc].
+
check out the [[Collector_MapReduce_Studio#Background]]
  
To repeat: '''we strongly encourage you to read the [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html Collector<T,A,R> Javadoc].'''
+
==IndexedRange==
 +
[https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/core/IndexedRange.html class IndexedRange]
  
<nowiki>public interface Collector<T, A, R> {
+
This class has everything you need for n-way split problems, specifically:
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/core/IndexedRange.html#getSliceIndexId() getSliceIndexId()]
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/core/IndexedRange.html#getMinInclusive() getMinInclusive()]
 +
* [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/core/IndexedRange.html#getMaxExclusive() getMaxExclusive()].
  
// invoke supplier().get() to create a new mutable container
+
==Slices==
Supplier<A> supplier();
+
[https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/studio/Slices.html class Slices]
 +
: [https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/slice/studio/Slices.html#createNSlices(C%5B%5D,int) <nowiki>List<Slice<C[]>> createNSlices(C[] data, int numSlices)</nowiki>]
  
// invoke accumulator().accept(container, item) to add item to a container
+
==HashUtils==
BiConsumer<A, T> accumulator();
+
[https://www.cse.wustl.edu/~cosgroved/courses/cse231/s20/apidocs/hash/studio/HashUtils.html#toIndex(java.lang.Object,int) toIndex(key,N)]
  
// invoke combiner().apply(containerA, containerB) to combine one container into the other
+
==MultiWrapMap==
BinaryOperator<A> combiner();
+
[https://www.cse.wustl.edu/~cosgroved/courses/cse231/current/apidocs/edu/wustl/cse231s/util/MultiWrapMap.html class MultiWrapMap<K,V>]
  
// invoke finisher().apply(container) to reduce a container to its final form
+
=A Path To Victory=
Function<A, R> finisher();
+
* [[Int_Sum_MapReduce_Apps_Studio]]
}</nowiki>
+
* [[Collector_MapReduce_Studio]]
 +
* [[Mutual_Friends_MapReduce_Application]]
 +
* [[#Optional_Warm_Up]]
 +
* [[#Bottlenecked_MapReduce_Framework]]
 +
* [[Cholera_MapReduce_Application]]
 +
* [[#Matrix_MapReduce_Framework]]
  
===supplier get===
+
=The Core Questions=
We use supplier().get() to create a new mutable container.  For classic map reduce this would be a [https://docs.oracle.com/javase/8/docs/api/java/util/List.html List<V>].
+
*What are the tasks?
 +
*What is the data?
 +
*Is the data mutable?
 +
*If so, how is it shared?
  
mythical code analogy: <code>container = collector.supplier().get()</code> <math>\leftrightarrow</math> <code>container = accumulatorReducer.createContainer()</code>
+
=Optional Warm Up=
 +
==WordCountConcreteStaticMapReduce==
 +
{{CodeToImplement|WordCountConcreteStaticMapReduce|mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.wordcount}}
 +
===Mapper===
 +
====map (Provided)====
 +
<nowiki> static void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer) {
 +
mapper.map(textSection, keyValuePairConsumer);
 +
}</nowiki>
  
===accumulator accept===
+
===Reducer===
We use accumulator().accept(container,item) to accumulate a value.  For classic map reduce this would add an item to a list.
+
====createMutableContainer (Provided)====
 +
<nowiki> static List<Integer> createMutableContainer() {
 +
return collector.supplier().get();
 +
}</nowiki>
  
mythical code analogy: <code>collector.accumulator().accept(container,item);</code> <math>\leftrightarrow</math> <code>accumulatorReducer.accumulate(container,item)</code>
+
====accumulate (Provided)====
 +
<nowiki> static void accumulate(List<Integer> list, int v) {
 +
collector.accumulator().accept(list, v);
 +
}</nowiki>
  
===combiner apply===
+
====combine (Provided)====
We use combiner().apply(containerA,containerB) to combine two accumulators.  You may combine containerB into containerA or containerA into containerB.  Just return whichever is the combined result.
+
<nowiki> static List<Integer> combine(List<Integer> a, List<Integer> b) {
 +
return collector.combiner().apply(a, b);
 +
}</nowiki>
  
mythical code analogy: <code>collector.combiner().apply(containerA,containerB)</code> <math>\leftrightarrow</math> <code>accumulatorReducer.combine(containerA,containerB)</code>
+
====reduce (Provided)====
 +
<nowiki> static int reduce(List<Integer> list) {
 +
return collector.finisher().apply(list);
 +
}</nowiki>
  
===finisher apply===
+
===Framework===
We use finisher().apply(container) to reduce an accumulator.
+
====mapAll====
 +
mapAll can be performed in parallel.  A task should be created for each item in the input array.  Each task should accept the emitted (key, value) pairs and store them in its own List to avoid data races. These lists make up the array which is returned (one list for each item in the input array).
  
mythical code analogy: <code>collector.finisher().apply(container)</code> <math>\leftrightarrow</math> <code>r = accumulatorReducer.reduce(container)</code>
+
<youtube>-4prus5cNFo</youtube>
  
=A Path To Victory=
+
{{Parallel|List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)}}
Watch [https://www.youtube.com/watch?v=bcjSe0xCHbE MapReduce with Playing Cards]
 
  
Read [http://stevekrenzel.com/finding-friends-with-mapreduce Finding Mutual Friends]
+
{{Warning| When first created arrays of Objects are filled with null. You will need to assign each array index to a new List before you start the process of adding key-value pairs}}
  
Read [https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html java.util.stream.Collector Javadoc]
+
{{Warning| Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop.}}
  
Implement [[#Cards MapReduce Studio]]
+
{{Tip|You are encouraged to utilize the provided [[#map_.28Provided.29|map]] method.}}
  
Implement [[#WordCountConcreteStaticMapReduce]]
+
[[File:Map_all.svg]]
  
Implement [[#MutualFriendsConcreteStaticMapReduce]]
+
====accumulateAll====
 +
{{Sequential|static Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)}}
  
Implement [[#WordCountMapper]]
+
{{Tip|You are encouraged to utilize the provided [[#createMutableContainer_.28Provided.29|createMutableContainer]] and [[#accumulate_.28Provided.29|accumulate]] methods.}}
  
Implement [[#IntegerSumClassicReducer]]
+
[[File:Accumulate_all.svg]]
  
Implement MutualFriendsMapper
+
====finishAll====
 +
{{Parallel|static Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)}}
  
Implement MutualFriendsReducer
+
{{Tip|You are encouraged to utilize the provided [[#reduce_.28Provided.29|reduce]] method.}}
  
Implement ClassicReducer
+
[[File:Finish_all.svg]]
  
Implement [[#Simple MapReduce Framework]]
+
===Testing Your Solution===
 
+
====Correctness====
Implement [[#Cholera MapReduce Studio]]
+
{{TestSuite|WarmUpWordCountMapReduceTestSuite|mapreduce}}
 
 
Implement [[#Matrix MapReduce Framework]]
 
 
 
=Warm Up=
 
==WordCountConcreteStaticMapReduce==
 
Test Suite: <code>WordCountMapReduceWarmUpTestSuite</code>
 
  
 +
==MutualFriendsConcreteStaticMapReduce==
 +
{{CodeToImplement|MutualFriendsConcreteStaticMapReduce|map<br>reduceCreateList<br>reduceAccumulate<br>reduceCombine<br>reduceFinish<br>mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.warmup.friends}}
 
===Mapper===
 
===Mapper===
====void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer)====
+
====map====
===Reducer===
+
{{Sequential|static void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)}}
====List<Integer> reduceCreateList()====
 
====void reduceAccumulate(List<Integer> list, int v)====
 
====void reduceCombine(List<Integer> a, List<Integer> b)====
 
====int reduceFinish(List<Integer> list)====
 
===Framework===
 
====List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)====
 
====Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)====
 
====Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)====
 
  
==MutualFriendsConcreteStaticMapReduce==
 
Test Suite: <code>MutualFriendsMapReduceWarmUpTestSuite</code>
 
===Mapper===
 
====void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)====
 
 
===Reducer===
 
===Reducer===
====List<Set<AccountId>> reduceCreateList()====
+
====createMutableContainer====
====void reduceAccumulate(List<Set<AccountId>> list, Set<AccountId> v)====
+
{{Sequential|static List<Set<AccountId>> createMutableContainer()}}
====void reduceCombine(List<Set<AccountId>> a, List<Set<AccountId>> b)====
 
====MutualFriendIds reduceFinish(List<Set<AccountId>> list)====
 
===Framework===
 
====List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)====
 
====Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)====
 
====Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)====
 
  
=Assignment=
+
====accumulate====
Test Suite: <code>MapReduceAssignmentTestSuite</code>
+
{{Sequential|static void accumulate(List<Set<AccountId>> list, Set<AccountId> v)}}
  
==Mutual Friends MapReduce Assignment==
+
====combine====
 +
{{Sequential|static void combine(List<Set<AccountId>> a, List<Set<AccountId>> b)}}
  
The goal of this implementation is to create Facebook’s mutual friends algorithm using MapReduce. To accomplish this, you will need to create the mapper and the reducer. Navigate to the <code>MutualFriendsMapper.java</code> class. In this class, you will specifically define how the framework accomplishes the map method.
+
====reduce====
 +
{{Sequential|static AccountIdMutableContainer reduce(List<Set<AccountId>> list)}}
  
===MutualFriendsMapper===
+
===Framework===
 +
====mapAll====
 +
{{Parallel|static List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)}}
  
The only method you will need to alter is the map method. In this method, you will need to map every combination of the account holder to his/her friends. In order to do this, create ordered pairs of the given account’s ID and the IDs of the account holder’s friends. You must then feed each individual ordered pair into the keyValuePairConsumer along with the full set of the account holder’s friends.
+
{{Warning| When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs}}
  
Hint: check out the methods in the Account class for help.
+
{{Warning| Reminder: our course libraries consistently specify max to be exclusive.  This includes the parallel forall loop.}}
  
===MutualFriendsClassicReducer===
+
{{Tip|You are encouraged to utilize the [[#map|map]] method you implemented.}}
 +
====accumulateAll====
 +
{{Sequential|static Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)}}
  
Investigate <code>MutualFriendIds</code> for clues on what you need to do.
+
{{Tip|You are encouraged to utilize the [[#reduceCreateList|reduceCreateList]] and [[#reduceAccumulate|reduceAccumulate]] methods you implemented.}}
  
==Word Count MapReduce Assignment==
+
====finishAll====
 +
{{Parallel|static Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)}}
  
The goal of this implementation is to count the number of times a word appears in a given text, using MapReduce. To accomplish this, you will need to create both the mapper and the reducer. Navigate to the <code>WordCountMapper.java</code> and <code>IntegerSumClassicReducer.java</code> classes. You will specifically define how the framework accomplishes the map and reduce methods.
+
{{Tip|You are encouraged to utilize the [[#reduceFinish|reduceFinish]] method you implemented.}}
  
===WordCountMapper===
+
===Testing Your Solution===
 +
====Correctness====
 +
{{TestSuite|WarmUpMutualFriendsMapReduceTestSuite|mapreduce}}
  
The only method you will need to alter is the map method. In this method, you need to record every instance of a given word and feed it into the keyValuePairConsumer. To do this, access all of the words in the TextSection and if the length of the word is greater than zero (meaning it is not just blank space), convert it into lower-case and accept it into the consumer.
+
=Required Lab=
  
Hint: Look at the methods in TextSection and the toLowerCase() method for strings for assistance.
+
==Bottlenecked MapReduce Framework==
  
===IntegerSumClassicReducer===
+
{{CodeToImplement|BottleneckedMapReduceFramework|mapAll<br>accumulateAll<br>finishAll|mapreduce.framework.lab.bottlenecked}}
  
The only method you will need to alter is the apply method. All you need to do is sum up the value of a list of integers and return that sum value.
+
Navigate to the <code>BottleneckedMapReduceFramework.java</code> class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.  Whereas the warm ups for this lab serve to prepare you to build this required section of the lab, this bottlenecked framework is in many ways a warm up for the matrix implementation.
  
==Simple MapReduce Framework==
+
===mapAll===
 +
NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.
  
Navigate to the <code>SimpleMapReduceFramework.java</code> class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
+
{{Parallel|List<KeyValuePair<K, V>>[] mapAll(E[] input)}}
  
===mapAll Method===
+
{{Warning| When first created arrays of Objects are filled with null.  You will need to assigned each array index to a new List before you start the process of adding key-value pairs}}
  
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of a list of key value pairs. In order to do this, you must define the map() method for the mapper by specifying that it should add a new key value pair to a previously empty list. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
+
{{Warning| Reminder: our course libraries consistently specify max to be exclusive.  This includes the parallel forall loop.}}
 +
 
 +
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of Lists of key value pairs. We will leverage the Mapper which is a field/instance variable on this BottleneckedFramework instance.  When invoking the mapper's map method with an element of the input array and a BiConsumer which will  accept each key and value passed to it, adding a KeyValuePair to its List. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
  
 
Hint: you should create an array of lists equivalent in size to the original array. Each list will contain all of the emitted (key,value) pairs for its item.
 
Hint: you should create an array of lists equivalent in size to the original array. Each list will contain all of the emitted (key,value) pairs for its item.
  
===accumulateAll Method===
+
[[File:Bottlenecked map all.png|400px]] [https://docs.google.com/presentation/d/1Gtpj6_5_8imUMccpwxmfrOxKSED791OKl4lZR6G2Bm4/pub?start=false&loop=false&delayms=3000&slide=id.g7ea283730e_0_130 slide]
 +
 
 +
===accumulateAll===
 +
{{Sequential|Map<K, A> accumulateAll(List<KeyValuePair<K, V>>[] mapAllResults)}}
  
 
This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.
 
This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.
  
In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. Unlike the mapping phase, the map must account for the possibility of duplicates. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the <code>accumulator()</code> method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be turned into a different form of data before reaching the reduce stage. Although we will not do this with any of our implementations, we designed the framework to allow this. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the <code>supplier()</code> method.
+
In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the <code>accumulator()</code> method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be collected into a mutable container before reaching the finish/reduce stage. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the <code>supplier()</code> method.
  
Hint: Look into the <code>compute()</code> method for maps.
+
[[File:Bottlenecked accumulate all.png|400px]] [https://docs.google.com/presentation/d/1Gtpj6_5_8imUMccpwxmfrOxKSED791OKl4lZR6G2Bm4/pub?start=false&loop=false&delayms=3000&slide=id.g7ea283730e_0_165 slide]
  
===finishAll Method===
+
===finishAll===
 +
{{Parallel|Map<K, R> finishAll(Map<K, A> accumulateAllResult)}}
  
 
This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.
 
This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.
Line 191: Line 236:
 
To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the <code>finisher()</code> method. This step should run in parallel and will probably be the easiest of the three methods.
 
To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the <code>finisher()</code> method. This step should run in parallel and will probably be the easiest of the three methods.
  
Hint: Use the <code>entrySet()</code> method to get all of the entries in the given map and remember to use a ConcurrentHashMap instead of a regular HashMap to ensure the method can run in parallel.
+
[[File:Bottlenecked finish all.png|400px]] [https://docs.google.com/presentation/d/1Gtpj6_5_8imUMccpwxmfrOxKSED791OKl4lZR6G2Bm4/pub?start=false&loop=false&delayms=3000&slide=id.g7ea283730e_0_221 slide]
  
 
==Matrix MapReduce Framework==
 
==Matrix MapReduce Framework==
 +
 +
 +
{{CodeToImplement|MatrixMapReduceFramework|mapAndAccumulateAll<br>combineAndFinishAll|mapreduce.framework.lab.matrix}}
  
 
Navigate to the <code>MatrixMapReduceFramework.java</code> class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
 
Navigate to the <code>MatrixMapReduceFramework.java</code> class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
  
The matrix framework is much more complex than the simple framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the provided getReduceIndex() method. This, in effect, creates a matrix of maps, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.
+
The matrix framework is much more complex than the bottlenecked framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the HashUtils toIndex() method. This, in effect, creates a matrix of dictionaries, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the columns of the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.
  
 
===mapAndAccumulateAll===
 
===mapAndAccumulateAll===
 +
{{Parallel|Map<K, A>[][] mapAndAccumulateAll(E[] input)}}
  
In this stage, you will map and accumulate a given array of data into a matrix of maps. This method should run in parallel while combining the map and accumulate portions of the simple framework (which we recommend you attempt first). As mentioned previously, the input should be sliced into a mapTaskCount number of slices and then mapped/accumulated into its rightful spot in the matrix. Although you can slice up the data into chunks yourself, we recommend using the <code>Slice</code> and <code>SliceUtils</code> classes introduced earlier in the course.
+
In this stage, you will map and accumulate a given array of data into a matrix of dictionaries. This method should run in parallel while performing the map and accumulate portions of the bottlenecked framework (which we recommend you complete prior to embarking on this mission). As mentioned previously, the input should be sliced into a mapTaskCount number of IndexedRanges and then mapped/accumulated into its appropriate dictionary in the matrix. Although you could slice up the data into chunks yourself, we require using an identical algorithm as performed the <code>IndexedRange</code> and <code>Slices</code> classes introduced earlier in the course.  This will allow us to provide better feedback to allow you to pinpoint bugs sooner.  What is the best way to perform an identical algorithm to your Slices studio?  Use your Slices studio, of course.
  
{{Warning | Prefer the use [https://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/slices/SliceUtils.html#createNSlices-E:A-int- SliceUtils] over your own Slices implementationMany students have run into testing failures due to differences in the slices.}}
+
For each slice, the mapper should map the input into its appropriate cell in the matrix and accumulate it into that specific dictionary. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its index id and HashUtils has a toIndex methodWhich is applicable to the row and which is applicable to the column?
  
For each slice, the mapper should map the input into its rightful spot in the matrix and accumulate it into that specific map. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its position relative to the other slices and the getReduceIndex method, mentioned above.
+
Hint: The number of rows should match the number of slices.
  
Hint: The number of rows should match the number of slices.
+
[[File:Matrix map accumulate all.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_347 slide]
 +
 
 +
[[File:Matrix map accumulate art.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_388 slide]
  
 
===combineAndFinishAll===
 
===combineAndFinishAll===
 +
{{Parallel|Map<K, R> combineAndFinishAll(Map<K, A>[][] input)}}
  
 
In this stage, you will take the matrix you just completed and combine all of the separate rows down to one array. Afterward, you will convert this combined array of maps into one final map. This method should run in parallel.
 
In this stage, you will take the matrix you just completed and combine all of the separate rows down to one array. Afterward, you will convert this combined array of maps into one final map. This method should run in parallel.
  
As mentioned previously, you should go directly down the matrix to access the same bucket across the different slices you created in the mapAndAccumulateAll step. For all of the maps in a column, you should go through each entry and combine it down into one row. You will need to make use of the Collector’s finisher again, but you will also need to make use of the combiner. You can access the Collector’s combiner using the <code>combiner()</code> method. Although the combine step differs from the simple framework, the finish step should mirror what you did previously.
+
As mentioned previously, you should go directly down the matrix to access the same bucket across the different slices you created in the mapAndAccumulateAll step. For all of the maps in a column, you should go through each entry and combine it down into one row. You will need to make use of the Collector’s finisher again, but you will also need to make use of the combiner. You can access the Collector’s combiner using the <code>combiner()</code> method. Although the combine step differs from the bottlenecked framework, the finish step should mirror what you did previously.
  
 
Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.
 
Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.
  
 +
[[File:Matrix combine finish all.png|400px]] [https://docs.google.com/presentation/d/1bSLKsI5u2e_tFc0d-RSb0BIDwA-kD75U6yXfxvpMf6Y/pub?start=false&loop=false&delayms=3000&slide=id.g7ebdc248f6_0_425 slide]
  
=Associated Studios=
+
=Testing Your Solution=
==Int Sum Apps==
+
==Correctness==
[[Int_Sum_MapReduce_Apps_Studio]]
+
There is a top-level test suite comprised of sub test suites which can be invoked separately when you want to focus on one part of the assignment.
==Collector==
+
{{TestSuite|FrameworksLabTestSuite|mapreduce.framework.lab}}
[[Collector_MapReduce_Studio]]
+
===Bottlenecked===
==Mutual Friends App==
+
{{TestSuite|BottleneckedFrameworkTestSuite|mapreduce.framework.lab.bottlenecked}}
[[Mutual_Friends_MapReduce_Application]]
+
====MapAll====
==Cholera App==
+
{{TestSuite|BottleneckedFrameworkTestSuite|mapreduce.framework.lab.bottlenecked}}
[[Cholera_MapReduce_Application]]
+
====AccumulateAll====
 +
{{TestSuite|BottleneckedAccumulateAllTestSuite|mapreduce.framework.lab.bottlenecked}}
 +
====FinishAll====
 +
{{TestSuite|BottleneckedFinishAllTestSuite|mapreduce.framework.lab.bottlenecked}}
 +
====Holistic====
 +
{{TestSuite|BottleneckedHolisticTestSuite|mapreduce.framework.lab.bottlenecked}}
 +
===Matrix===
 +
{{TestSuite|MatrixFrameworkTestSuite|mapreduce.framework.lab.matrix}}
 +
====MapAccumulateAll====
 +
{{TestSuite|MatrixMapAccumulateAllTestSuite|mapreduce.framework.lab.matrix}}
 +
====CombineFinishAll====
 +
{{TestSuite|MatrixCombineFinishAllTestSuite|mapreduce.framework.lab.matrix}}
 +
====Holistic====
 +
{{TestSuite|MatrixHolisticTestSuite|mapreduce.framework.lab.matrix}}
  
 
=Rubric=
 
=Rubric=
Line 234: Line 300:
 
Total points: 100
 
Total points: 100
  
Simple framework subtotal: 40
+
Bottlenecked framework subtotal: 40
 
* Correct mapAll (10)
 
* Correct mapAll (10)
 
* Correct accumulateAll (20)
 
* Correct accumulateAll (20)
 
* Correct finishAll (10)
 
* Correct finishAll (10)
  
Matrix framework subtotal: 50
+
Matrix framework subtotal: 60
* Correct mapAndAccumulateAll (25)
+
* Correct mapAndAccumulateAll (30)
* Correct combineAndFinishAll (25)
+
* Correct combineAndFinishAll (30)
 
+
-->
Whole project:
 
* Clarity and efficiency (10)
 

Latest revision as of 08:23, 3 March 2022