Bottleneck MapReduce Framework Assignment
credit for this assignment: Finn Voichick and Dennis Cosgrove
Contents
Motivation
Dealing with big data is Hansel-level hot right now. We will build two implementations to better understand the inner workings of Hadoop and Spark-like frameworks. Your frameworks will actually be more general than just MapReduce.
The Matrix implementation gives us experience with dividing the work up into thread confined tasks whose results are then combined together.
Background
wikipedia article on MapReduce
As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single values per key using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will use MapReduce to simulate the Facebook mutual friends algorithm (finding the mutual friends between all friend pairs) as well as pinpointing the offending well in the 1854 Soho Cholera Outbreak.
For more information on the general concept of MapReduce, refer to this article.
Java Advice
Parameterized Type Array Tip
Creating arrays of parameterized types in Java is madness inducing. Some details are available in Java Generics Restrictions.
The example below creates an array of List<String>. The @SuppressWarnings
annotation is optional.
@SuppressWarnings("unchecked") List<String>[] result = new List[length];
Use map.entrySet()
Prefer the use of map.entrySet() over map.keySet() followed by looking up the value with map.get(key).
Use the appropriate version of forall
there are many overloaded versions of forall including:
choose the correct one for each situation where "correct" is often the one that produces the cleanest code.
Mistakes To Avoid
Warning: Arrays (and Matrices) are initially filled with null. You must fill them with instances. |
Warning: Ensure that your Slices studio is conforming to spec. |
Code To Use
To allow our frameworks to work well with JDK8 Streams, we employ a couple of standard interfaces over creating our own custom ones.
BiConsumer
We use the standard BiConsumer<T,U> interface with an BiConsumer's accept(t,u) method in the place of a mythical, custom MapContext<K,V> interface with an emit(key,value) method.
public interface BiConsumer<T, U> { // invoke accept with each key and value you wish to emit void accept(T t, U u); }
Collector
read the javadoc for Collector
check out the Collector_MapReduce_Studio#Background
IndexedRange
This class has everything you need for n-way split problems, specifically:
Slices
HashUtils
MultiWrapMap
A Path To Victory
- Int_Sum_MapReduce_Apps_Studio
- Collector_MapReduce_Studio
- Mutual_Friends_MapReduce_Application
- #Optional_Warm_Up
- #Bottlenecked_MapReduce_Framework
- Cholera_MapReduce_Application
- #Matrix_MapReduce_Framework
The Core Questions
- What are the tasks?
- What is the data?
- Is the data mutable?
- If so, how is it shared?
Warm Ups
[[]]
[[]]
Required Lab
Bottlenecked MapReduce Framework
class: | BottleneckedMapReduceFramework.java | |
methods: | mapAll accumulateAll finishAll |
|
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | student/src/main/java |
Navigate to the BottleneckedMapReduceFramework.java
class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce. Whereas the warm ups for this lab serve to prepare you to build this required section of the lab, this bottlenecked framework is in many ways a warm up for the matrix implementation.
mapAll
NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.
method: List<KeyValuePair<K, V>>[] mapAll(E[] input)
(parallel implementation required)
Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs |
Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop. |
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of Lists of key value pairs. We will leverage the Mapper which is a field/instance variable on this BottleneckedFramework instance. When invoking the mapper's map method with an element of the input array and a BiConsumer which will accept each key and value passed to it, adding a KeyValuePair to its List. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
Hint: you should create an array of lists equivalent in size to the original array. Each list will contain all of the emitted (key,value) pairs for its item.
accumulateAll
method: Map<K, A> accumulateAll(List<KeyValuePair<K, V>>[] mapAllResults)
(sequential implementation only)
This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.
In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the accumulator()
method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be collected into a mutable container before reaching the finish/reduce stage. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the supplier()
method.
finishAll
method: Map<K, R> finishAll(Map<K, A> accumulateAllResult)
(parallel implementation required)
This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.
To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the finisher()
method. This step should run in parallel and will probably be the easiest of the three methods.
Testing Your Solution
Correctness
class: | BottleneckedFrameworkTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
MapAll
class: | BottleneckedFrameworkTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
AccumulateAll
class: | BottleneckedAccumulateAllTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
ReduceAll
class: | BottleneckedFinishAllTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
Holistic
class: | BottleneckedHolisticTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |