Bottleneck MapReduce Framework Assignment

From CSE231 Wiki
Jump to navigation Jump to search

credit for this assignment: Finn Voichick and Dennis Cosgrove

Motivation

Dealing with big data is Hansel-level hot right now. We will build two implementations to better understand the inner workings of Hadoop and Spark-like frameworks. Your frameworks will actually be more general than just MapReduce.

This framework will help prepare you for building the Matrix Framework next.

Background

wikipedia article on MapReduce

As the name suggests, MapReduce refers to the process of mapping then reducing some stream of data. At its core, all a MapReduce algorithm entails is transforming a list of one kind of item before collecting those items and reducing them down to a single values per key using some computation. As you can probably tell, the concept of MapReduce is extremely general and can apply to a wide berth of problems. In this assignment, we will build a MapReduce framework which can be used to to simulate the Facebook mutual friends algorithm (finding the mutual friends between all friend pairs) as well as pinpointing the offending well in the 1854 Soho Cholera Outbreak.

For more information on the general concept of MapReduce, refer to this article.

Java Advice

Use map.entrySet()

Prefer the use of map.entrySet() over map.keySet() followed by looking up the value with map.get(key).

Use the appropriate version of join_fork_loop

there are many overloaded versions of join_fork_loop including:

choose the correct one for each situation where "correct" is often the one that produces the cleanest code.

A Path To Victory

Warm Ups

Card Only Sequential Map Reduce Framework

Word Count Only Parallel MapReduce Framework

Code To Investigate

WordCountOnlyParallelMapReduceFramework

You will build each of the three methods mapReduceAll(input) invokes. Much can be gleaned from the parameters and return values of these methods.

mapReduceAll(TextSection[] input)  
@Override
public Map<K, R> mapReduceAll(E[] input) throws InterruptedException, ExecutionException {
	List<List<Map.Entry<K, V>>> mapAllResult = mapAll(input);
	Map<K, A> accumulateAllResult = accumulateAll(mapAllResult);
	return reduceAll(accumulateAllResult);
}

mapAll

For Account[] input:

input[0] Account.parse("A", "B, C, D")
input[1] Account.parse("B", "A, C, D, E")
input[2] Account.parse("C", "A, B, D, E")
input[3] Account.parse("D", "A, B, C, E")
input[4] Account.parse("E", "B, C, D")

MutualFriends MapAllResult.svg

accumulateAll

MutualFriends AccumulateAllResult.svg

reduceAll

MutualFriends ReduceAllResult.svg

Client

class: WordCountOnlyParallelMapReduceFrameworkClient.java CLIENT
package: mapreduce.framework.parallel.client
source folder: student/src/main/java
WordCountOnlyParallelMapReduceFrameworkClient  
// credit: https://web.archive.org/web/20201109035737/http://stevekrenzel.com/finding-friends-with-mapreduce
Account[] input = {
		Account.parse("A", "B,C,D"),
		Account.parse("B", "A,C,D,E"),
		Account.parse("C", "A,B,D,E"),
		Account.parse("D", "A,B,C,E"),
		Account.parse("E", "B,C,D")
};
MapReduceFramework<Account, // E
		OrderedPair<AccountId>, // K
		Set<AccountId>, // V
		List<Set<AccountId>>, // A
		Set<AccountId> // R
> framework = new BottleneckedMapReduceFramework<>(
		new MutualFriendsMapper(),
		new MutualFriendsClassicReducer());
Map<OrderedPair<AccountId>, Set<AccountId>> map = framework.mapReduceAll(input);
for (Map.Entry<OrderedPair<AccountId>, Set<AccountId>> entry : map.entrySet()) {
	System.out.println(entry);
}
CardOnlySequentialMapReduceFrameworkClient Output  
(A, B)=[C, D]
(A, C)=[B, D]
(D, E)=[B, C]
(A, D)=[B, C]
(B, C)=[A, D, E]
(B, D)=[A, C, E]
(B, E)=[C, D]
(C, D)=[A, B, E]
(C, E)=[B, D]

Code To Implement

Bottlenecked MapReduce Framework

class: BottleneckedMapReduceFramework.java Java.png
methods: mapAll
accumulateAll
toArrayOfReductions
toReductionMap
package: mapreduce.framework.lab.bottlenecked
source folder: student/src/main/java

Navigate to the BottleneckedMapReduceFramework.java class and there will be three methods for you to complete: mapAll, accumulateAll, and finishAll. These frameworks are meant to be extremely general and applied to more specific uses of MapReduce. Whereas the warm ups for this lab serve to prepare you to build this required section of the lab, this bottlenecked framework is in many ways a warm up for the matrix implementation.

mapAll

NOTE: If you struggle to get through this method, you are strongly encouraged to try the warm-ups.

method: List<List<KeyValuePair<K, V>>> mapAll(E[] input) Parallel.svg (parallel implementation required)

With this method, you will map all of the elements of an array of data into a new array of equivalent size consisting of Lists of key value pairs. We will leverage the Mapper which is a field/instance variable on this BottleneckedFramework instance. When invoking the mapper's map method with an element of the input array and a BiConsumer which will accept each key and value passed to it, adding a KeyValuePair to its List. This list should then be added to the array of lists you previously defined, therefore completing the mapping stage of MapReduce. This should all be done in parallel.

Hint: you should create a List of Lists equivalent in size to the original array. Each inner list will contain all of the emitted (key,value) pairs for its item.

Bottlenecked map all.png slide

Circle-information.svg Tip:Think about which verion of the join_fork_loop would be the best to use here.

Example: Mutual Friends

Input

Mutual Friends Input.svg

Result of mapAll(input)

Mutual Friends Bottlenecked Map All Result.svg

accumulateAll

method: Map<K, A> accumulateAll(List<KeyValuePair<K, V>>[] mapAllResults) Sequential.svg (sequential implementation only)

This middle step is often excluded in more advanced MapReduce applications. This step introduces a long sequential strand. In the matrix framework implementation, we will do away with this step altogether for the sake of performance.

You will make use of the Reducer provided to you.

In this method, you will take in the list of lists you previously created in mapAll. You probably noticed that the method must return a map of <K, A>, which differs from the <K, V> lists fed into the method. The framework is designed this way as the data originally fed into the mapping stage can be collected into a mutable container before reaching the reduction stage. You will need to create a mutable container of type A for each unique key encountered and associate the key with that mutable container in a Map. Whether it is the first time a key has been encountered or not, the value must be accumulated into the mutable container.

Circle-information.svg Tip: Look into the computeIfAbsent(key, remappingFunction) method on Map

Bottlenecked accumulate all.png slide

reduceAll

    private Map<K, R> reduceAll(Map<K, A> accumulateAllResult) throws InterruptedException, ExecutionException {
        Entry<K, A>[] entries = toArrayOfEntries(accumulateAllResult);
        List<R> reductions = toReductions(entries);
        return toReductionMap(entries, reductions);
    }

toArrayOfEntries

    private static <K, A> Entry<K, A>[] toArrayOfEntries(Map<K, A> accumulateAllResult) {
        return accumulateAllResult.entrySet().toArray(Entry[]::new);
    }

toReductions and toReductionMap

method: List<R> toReductions(Entry<K, A>[] entries) Parallel.svg (parallel implementation required)

method: Map<K, R> toReductionMap(Entry<K, A>[] entries, R[] reductions) Sequential.svg (sequential implementation only)

This final step reduces the accumulated data and returns the final map in its reduced form. Again, you may notice that the method returns a map of <K, R> instead of the <K, A> which was returned in the accumulateAll method. This happens for the exact same reason as the accumulateAll method, as the framework is designed to handle cases in which the reduced data differs in type from the accumulated data.

To reduce the data down, use the map returned from the accumulateAll stage and put the results of the reduction into a new map. The provided Reducer will come in extremely handy for this stage, more specifically the finisher which can be called using the reduce() method. This step should run in parallel and will probably be the easiest of the three methods.

Bottlenecked finish all.png slide

Testing Your Solution

class: __BottleneckedFrameworkTestSuite.java Junit.png
package: mapreduce.framework.bottlenecked.exercise
source folder: testing/src/test/java

Pledge, Acknowledgments, Citations

file: mapreduce-bottleneck-framework-pledge-acknowledgments-citations.txt

More info about the Honor Pledge