Difference between revisions of "MapReduce Frameworks Lab"
Line 3: | Line 3: | ||
[[Matrix_MapReduce_Framework_Assignment|Matrix MapReduce Framework]] | [[Matrix_MapReduce_Framework_Assignment|Matrix MapReduce Framework]] | ||
− | <-- | + | <!-- |
credit for this assignment: Finn Voichick and Dennis Cosgrove | credit for this assignment: Finn Voichick and Dennis Cosgrove | ||
=Motivation= | =Motivation= |
Revision as of 08:22, 3 March 2022
Bottleneck MapReduce Framework
Contents
The Core Questions
- What are the tasks?
- What is the data?
- Is the data mutable?
- If so, how is it shared?
Optional Warm Up
WordCountConcreteStaticMapReduce
class: | WordCountConcreteStaticMapReduce.java | |
methods: | mapAll accumulateAll finishAll |
|
package: | mapreduce.framework.warmup.wordcount | |
source folder: | student/src/main/java |
Mapper
map (Provided)
static void map(TextSection textSection, BiConsumer<String, Integer> keyValuePairConsumer) { mapper.map(textSection, keyValuePairConsumer); }
Reducer
createMutableContainer (Provided)
static List<Integer> createMutableContainer() { return collector.supplier().get(); }
accumulate (Provided)
static void accumulate(List<Integer> list, int v) { collector.accumulator().accept(list, v); }
combine (Provided)
static List<Integer> combine(List<Integer> a, List<Integer> b) { return collector.combiner().apply(a, b); }
reduce (Provided)
static int reduce(List<Integer> list) { return collector.finisher().apply(list); }
Framework
mapAll
mapAll can be performed in parallel. A task should be created for each item in the input array. Each task should accept the emitted (key, value) pairs and store them in its own List to avoid data races. These lists make up the array which is returned (one list for each item in the input array).
method: List<KeyValuePair<String, Integer>>[] mapAll(TextSection[] input)
(parallel implementation required)
Warning: When first created arrays of Objects are filled with null. You will need to assign each array index to a new List before you start the process of adding key-value pairs |
Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop. |
Tip:You are encouraged to utilize the provided map method. |
accumulateAll
method: static Map<String, List<Integer>> accumulateAll(List<KeyValuePair<String, Integer>>[] mapAllResults)
(sequential implementation only)
Tip:You are encouraged to utilize the provided createMutableContainer and accumulate methods. |
finishAll
method: static Map<String, Integer> finishAll(Map<String, List<Integer>> accumulateAllResult)
(parallel implementation required)
Tip:You are encouraged to utilize the provided reduce method. |
Testing Your Solution
Correctness
class: | WarmUpWordCountMapReduceTestSuite.java | |
package: | mapreduce | |
source folder: | testing/src/test/java |
MutualFriendsConcreteStaticMapReduce
Mapper
map
method: static void map(Account account, BiConsumer<OrderedPair<AccountId>, Set<AccountId>> keyValuePairConsumer)
(sequential implementation only)
Reducer
createMutableContainer
method: static List<Set<AccountId>> createMutableContainer()
(sequential implementation only)
accumulate
method: static void accumulate(List<Set<AccountId>> list, Set<AccountId> v)
(sequential implementation only)
combine
method: static void combine(List<Set<AccountId>> a, List<Set<AccountId>> b)
(sequential implementation only)
reduce
method: static AccountIdMutableContainer reduce(List<Set<AccountId>> list)
(sequential implementation only)
Framework
mapAll
method: static List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAll(Account[] input)
(parallel implementation required)
Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs |
Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop. |
Tip:You are encouraged to utilize the map method you implemented. |
accumulateAll
method: static Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAll(List<KeyValuePair<OrderedPair<AccountId>, Set<AccountId>>>[] mapAllResults)
(sequential implementation only)
Tip:You are encouraged to utilize the reduceCreateList and reduceAccumulate methods you implemented. |
finishAll
method: static Map<OrderedPair<AccountId>, MutualFriendIds> finishAll(Map<OrderedPair<AccountId>, List<Set<AccountId>>> accumulateAllResult)
(parallel implementation required)
Tip:You are encouraged to utilize the reduceFinish method you implemented. |
Testing Your Solution
Correctness
class: | WarmUpMutualFriendsMapReduceTestSuite.java | |
package: | mapreduce | |
source folder: | testing/src/test/java |
Required Lab
Bottlenecked MapReduce Framework
class: | BottleneckedMapReduceFramework.java | |
methods: | mapAll accumulateAll finishAll |
|
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | student/src/main/java |
Navigate to the BottleneckedMapReduceFramework.java
class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce. Whereas the warm ups for this lab serve to prepare you to build this required section of the lab, this bottlenecked framework is in many ways a warm up for the matrix implementation.
mapAll
NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.
method: List<KeyValuePair<K, V>>[] mapAll(E[] input)
(parallel implementation required)
Warning: When first created arrays of Objects are filled with null. You will need to assigned each array index to a new List before you start the process of adding key-value pairs |
Warning: Reminder: our course libraries consistently specify max to be exclusive. This includes the parallel forall loop. |
With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of Lists of key value pairs. We will leverage the Mapper which is a field/instance variable on this BottleneckedFramework instance. When invoking the mapper's map method with an element of the input array and a BiConsumer which will accept each key and value passed to it, adding a KeyValuePair to its List. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.
Hint: you should create an array of lists equivalent in size to the original array. Each list will contain all of the emitted (key,value) pairs for its item.
accumulateAll
method: Map<K, A> accumulateAll(List<KeyValuePair<K, V>>[] mapAllResults)
(sequential implementation only)
This middle step is often excluded in more advanced MapReduce applications. When run in parallel, it is the only step of the framework that must be completed sequentially. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.
In this method, you will take in the array of lists you previously created and accumulate the key value pairs in the lists into a newly defined map. To help deal with this issue, you must make use of the Collector provided to you. More specifically, access the accumulator in the collector by calling the accumulator()
method and accept the key/value pair when you add it to the map. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> generics fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be collected into a mutable container before reaching the finish/reduce stage. In order to access the correct value for the map if the key has no associated value yet, use the supplier associated with the Collector with the supplier()
method.
finishAll
method: Map<K, R> finishAll(Map<K, A> accumulateAllResult)
(parallel implementation required)
This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.
To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Collector will come in extremely handy for this stage, more specifically the finisher which can be called using the finisher()
method. This step should run in parallel and will probably be the easiest of the three methods.
Matrix MapReduce Framework
class: | MatrixMapReduceFramework.java | |
methods: | mapAndAccumulateAll combineAndFinishAll |
|
package: | mapreduce.framework.lab.matrix | |
source folder: | student/src/main/java |
Navigate to the MatrixMapReduceFramework.java
class and there will be two methods for you to complete: mapAndAccumulateAll and combineAndFinishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce.
The matrix framework is much more complex than the bottlenecked framework, but it boosts performance by grouping the map and accumulate stages so that everything can run in parallel. It does so by slicing up the given data into the specified mapTaskCount number of slices and assigns a reduce task number to each entry using the HashUtils toIndex() method. This, in effect, creates a matrix of dictionaries, hence the name of the framework. In the combineAndFinishAll stage, the matrix comes in handy by allowing us to go directly down the columns of the matrix (as each key is essentially grouped into a bucket), combining and reducing elements all-in-one. This concept was explained in more depth during class.
mapAndAccumulateAll
method: Map<K, A>[][] mapAndAccumulateAll(E[] input)
(parallel implementation required)
In this stage, you will map and accumulate a given array of data into a matrix of dictionaries. This method should run in parallel while performing the map and accumulate portions of the bottlenecked framework (which we recommend you complete prior to embarking on this mission). As mentioned previously, the input should be sliced into a mapTaskCount number of IndexedRanges and then mapped/accumulated into its appropriate dictionary in the matrix. Although you could slice up the data into chunks yourself, we require using an identical algorithm as performed the IndexedRange
and Slices
classes introduced earlier in the course. This will allow us to provide better feedback to allow you to pinpoint bugs sooner. What is the best way to perform an identical algorithm to your Slices studio? Use your Slices studio, of course.
For each slice, the mapper should map the input into its appropriate cell in the matrix and accumulate it into that specific dictionary. Essentially, you will need to nestle the actions of the accumulate method into the mapper. In order to find where the input should go in the matrix, remember that each slice keeps track of its index id and HashUtils has a toIndex method. Which is applicable to the row and which is applicable to the column?
Hint: The number of rows should match the number of slices.
combineAndFinishAll
method: Map<K, R> combineAndFinishAll(Map<K, A>[][] input)
(parallel implementation required)
In this stage, you will take the matrix you just completed and combine all of the separate rows down to one array. Afterward, you will convert this combined array of maps into one final map. This method should run in parallel.
As mentioned previously, you should go directly down the matrix to access the same bucket across the different slices you created in the mapAndAccumulateAll step. For all of the maps in a column, you should go through each entry and combine it down into one row. You will need to make use of the Collector’s finisher again, but you will also need to make use of the combiner. You can access the Collector’s combiner using the combiner()
method. Although the combine step differs from the bottlenecked framework, the finish step should mirror what you did previously.
Hint: You can use the provided MultiWrapMap class to return the final row as a valid output. You should also combine before you finish.
Testing Your Solution
Correctness
There is a top-level test suite comprised of sub test suites which can be invoked separately when you want to focus on one part of the assignment.
class: | FrameworksLabTestSuite.java | |
package: | mapreduce.framework.lab | |
source folder: | testing/src/test/java |
Bottlenecked
class: | BottleneckedFrameworkTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
MapAll
class: | BottleneckedFrameworkTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
AccumulateAll
class: | BottleneckedAccumulateAllTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
FinishAll
class: | BottleneckedFinishAllTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
Holistic
class: | BottleneckedHolisticTestSuite.java | |
package: | mapreduce.framework.lab.bottlenecked | |
source folder: | testing/src/test/java |
Matrix
class: | MatrixFrameworkTestSuite.java | |
package: | mapreduce.framework.lab.matrix | |
source folder: | testing/src/test/java |
MapAccumulateAll
class: | MatrixMapAccumulateAllTestSuite.java | |
package: | mapreduce.framework.lab.matrix | |
source folder: | testing/src/test/java |
CombineFinishAll
class: | MatrixCombineFinishAllTestSuite.java | |
package: | mapreduce.framework.lab.matrix | |
source folder: | testing/src/test/java |
Holistic
class: | MatrixHolisticTestSuite.java | |
package: | mapreduce.framework.lab.matrix | |
source folder: | testing/src/test/java |
Rubric
As always, please make sure to cite your work appropriately.
Total points: 100
Bottlenecked framework subtotal: 40
- Correct mapAll (10)
- Correct accumulateAll (20)
- Correct finishAll (10)
Matrix framework subtotal: 60
- Correct mapAndAccumulateAll (30)
- Correct combineAndFinishAll (30)
-->