Difference between revisions of "MatrixMultiply"

From CSE231 Wiki
Jump to navigation Jump to search
m (Added note about auto-coarsening)
 
(72 intermediate revisions by 2 users not shown)
Line 10: Line 10:
 
* [http://mathworld.wolfram.com/MatrixMultiplication.html Wolfram MathWorld]
 
* [http://mathworld.wolfram.com/MatrixMultiplication.html Wolfram MathWorld]
 
* [https://en.wikipedia.org/wiki/Matrix_multiplication Wikipedia]
 
* [https://en.wikipedia.org/wiki/Matrix_multiplication Wikipedia]
==Math Definition==
+
 
 
If <math>\mathbf{A}</math> is an <math>n \times m</math> matrix and <math>\mathbf{B}</math> is an <math>m  \times p</math> matrix
 
If <math>\mathbf{A}</math> is an <math>n \times m</math> matrix and <math>\mathbf{B}</math> is an <math>m  \times p</math> matrix
  
Line 16: Line 16:
  
 
: <math>(\mathbf{A}\mathbf{B})_{ij} = \sum_{k=1}^m A_{ik}B_{kj}</math>
 
: <math>(\mathbf{A}\mathbf{B})_{ij} = \sum_{k=1}^m A_{ik}B_{kj}</math>
 +
 +
:<math>\mathbf{C}=\begin{pmatrix}
 +
a_{11}b_{11} +\cdots + a_{1n}b_{n1} & a_{11}b_{12} +\cdots + a_{1n}b_{n2} & \cdots & a_{11}b_{1p} +\cdots + a_{1n}b_{np} \\
 +
a_{21}b_{11} +\cdots + a_{2n}b_{n1} & a_{21}b_{12} +\cdots + a_{2n}b_{n2} & \cdots & a_{21}b_{1p} +\cdots + a_{2n}b_{np} \\
 +
\vdots & \vdots & \ddots & \vdots \\
 +
a_{m1}b_{11} +\cdots + a_{mn}b_{n1} & a_{m1}b_{12} +\cdots + a_{mn}b_{n2} & \cdots & a_{m1}b_{1p} +\cdots + a_{mn}b_{np} \\
 +
\end{pmatrix}
 +
</math>
 +
 
source: [https://en.wikipedia.org/wiki/Matrix_multiplication#General_definition_of_the_matrix_product Matrix Multiplication on Wikipedia]
 
source: [https://en.wikipedia.org/wiki/Matrix_multiplication#General_definition_of_the_matrix_product Matrix Multiplication on Wikipedia]
  
 +
=Code To Investigate=
 +
==Demo Video==
 +
<youtube>iEuYiy1Bx2A</youtube>
 
==SequentialMatrixMultiplier==
 
==SequentialMatrixMultiplier==
We provide the sequential iterative implementation so you can focus just on becoming familiar with using Habanero's forall loops.
+
{{CodeToInvestigate|SequentialMatrixMultiplier|multiply|matrixmultiply.demo|demo}}
 
 
The point of the required portion of this studio is not to struggle with matrix multiplication, but rather to get some experience with the parallel for loop constructs in Habanero.
 
 
 
Feel free to use the provided sequential implementation in SequentialMatrixMultiplier as a reference:
 
<nowiki> @Override
 
public double[][] multiply(double[][] a, double[][] b) {
 
double[][] result = MatrixUtils.createMultiplyResultBufferInitializedToZeros(a, b);
 
int n = a.length;
 
int m = a[0].length;
 
int p = b[0].length;
 
for (int i = 0; i < n; i++) {
 
for (int j = 0; j < p; j++) {
 
// NOTE: result is already initialized to 0.0
 
// result[i][j] = 0.0;
 
for (int k = 0; k < m; k++) {
 
result[i][j] += a[i][k] * b[k][j];
 
}
 
}
 
}
 
return result;
 
}</nowiki>
 
 
 
=Code To Use=
 
==Javadocs==
 
[http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html Habanero]
 
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#forall-int-int-edu.rice.hj.api.HjSuspendingProcedureInt1D- forall(start, endExclusive, body)]
 
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#forall-edu.wustl.cse231s.rice.habanero.options.ChunkedOption-int-int-edu.rice.hj.api.HjSuspendingProcedureInt1D- forall(chunked(), start, endExclusive, body)]
 
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#forall2d-int-int-int-int-edu.rice.hj.api.HjSuspendingProcedureInt2D- forall2d(aMin, aMaxExclusive, bMin, bMaxExclusive, body)]
 
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#forall2d-edu.wustl.cse231s.rice.habanero.options.ChunkedOption-int-int-int-int-edu.rice.hj.api.HjSuspendingProcedureInt2D- forall2d(chunked(), aMin, aMaxExclusive, bMin, bMaxExclusive, body)]
 
  
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#chunked-- chunked()]
+
==SequentialMatrixMultiplierClient==
: [http://www.cse.wustl.edu/~cosgroved/courses/cse231/f17/apidocs/edu/wustl/cse231s/rice/habanero/Habanero.html#chunked-int- chunked(size)]
+
{{CodeToInvestigate|SequentialMatrixMultiplierClient|main|matrixmultiply.client|demo}}
  
==Habanero Documentation==
+
==MatrixMultiplyApp==
WARNING: Habanero from Rice is '''inclusive'''.  So, its loops are specified from 0 to n-1.  CSE 231's wrapper is '''exclusive on max''' so it is specified 0 to n.
+
{{Viz|MatrixMultiplyApp|matrixmultiply.viz|demo}}
  
[http://pasiphae.cs.rice.edu/#hjlib-for Habanero for loops]
+
[[File:Martix multiply app 3x5 X 5x4.png|800px]]
  
=Common Mistakes To Avoid=
+
=The Core Questions=
==maxExclusive==
+
*What are the tasks?
CSE 231 is exclusive on max.  While we wrapped Habanero we changed forall(0, n-1, body) to forall(0, n, body) for everything.
+
*What is the data?
 +
*Is the data mutable?
 +
*If so, how is it shared?
  
 
=Code To Implement=
 
=Code To Implement=
  
There are three methods you will need to implement, all of which are different ways to use parallel for loops to solve the problem. To assist you, the sequential implementation has already been completed for you. We recommend starting from the top and working your way down. There is also an optional recursive implementation and a manual grouping implementation which has been done for you (this is just to demonstrate how chunking works behind the scenes).
+
There are three methods you will need to implement, all of which are different ways to use parallel for loops to solve the problem. To assist you, the [https://classes.engineering.wustl.edu/cse231/core/index.php?title=MatrixMultiply#SequentialMatrixMultiplier sequential implementation] has been implemented in a [[#Demo_Video|demo video]].
  
==ForallForallMatrixMultiplier==
+
==LoopLoopMatrixMultiplier==
{{CodeToImplement|ForallForallMatrixMultiplier|multiply|matrixmultiply.studio}}
+
{{CodeToImplement|LoopLoopMatrixMultiplier|multiply|matrixmultiply.exercise}}
  
 
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
 
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
  
In this implementation, you will simply convert the sequential solution into a parallel one using two forall loops.
+
In this implementation, you will simply convert the sequential solution into a parallel one using two nested [https://www.cse.wustl.edu/~dennis.cosgrove/courses/cse231/spring22/apidocs/fj/FJ.html#join_void_fork_loop(int,int,fj.api.TaskIntConsumer) parallel fork loops].
  
==Forall2dMatrixMultiplier==
+
=== Computation Graph ===
{{CodeToImplement|Forall2dMatrixMultiplier|multiply|matrixmultiply.studio}}
 
  
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
+
For 3x3 Matrix X 3x3 Matrix:
  
In this implementation, we will cut down the syntax of the two forall implementation with the use of HJ’s <code>forall2d</code> method. Functionally, this method serves the purpose of using two forall loops. The input parameters should equal the input parameters of the two separate forall loops. For example,
+
[[File:LoopLoopMatrixMultiplier_Computation_Graph.svg|800px]]
<nowiki>forall(aMin, aMax, (i) -> {
 
forall(bMin, bMax, (j) -> {
 
doSomething(i,j)
 
});
 
});</nowiki>
 
Would appear as
 
<nowiki>forall2d(aMin, aMax, bMin, bMax, (i, j) -> {
 
        doSomething(i,j);
 
});</nowiki>
 
in the forall2d syntax.
 
  
==Forall2dChunkedMatrixMultiplier==
+
==Loop2dMatrixMultiplier==
{{CodeToImplement|Forall2dChunkedMatrixMultiplier|multiply|matrixmultiply.studio}}
+
{{CodeToImplement|Loop2dMatrixMultiplier|multiply|matrixmultiply.exercise}}
  
 
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
 
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
  
In this implementation, we will add a minor performance boost to the process by using the forall-chunked construct. Although similar to the traditional forall loop, it increases performance using iteration grouping/chunking. This topic is discussed in detail in this [https://edge.edx.org/courses/RiceX/COMP322/1T2014R/courseware/a900dd0655384de3b5ef01e508ea09d7/6349730bb2a149a0b33fa23db7afddee/?activate_block_id=i4x%3A%2F%2FRiceX%2FCOMP322%2Fsequential%2F6349730bb2a149a0b33fa23db7afddee Rice video] and explained in the [http://pasiphae.cs.rice.edu/#hjlib-for-parallelchunkediterationwithimplicitfinish HJ documentation]. There is no need to specify anything, allow the runtime to determine the chunking.
+
<!--
 +
In this implementation, we will cut down the syntax of the two forall implementation with the use of V5’s <code>forall2d</code> method. Functionally, this method serves the purpose of using two forall loops. [[Reference_Page#Forall_2d|Take a look at the reference page]] if you have questions on how to utilize this loop.
 +
-->
  
NOTE: we contemplated also assigning building a 1D forall chunked version. We deemed this more work that it was worth given that you are already building the 2d version. Just know that forall(chunked(), ...) exists for 1d loops as well. 
+
[https://www.cse.wustl.edu/~dennis.cosgrove/courses/cse231/spring22/apidocs/fj/FJ.html#join_void_fork_loop_2d(int,int,int,int,fj.api.TaskBiIntConsumer) join_void_fork_loop_2d]
  
Use chunking.  It is a nice feature.
+
=== Computation Graph ===
  
=Optional Fun Divide and Conquer Solutions=
+
For 3x3 Matrix X 3x3 Matrix:
In this implementation, you will solve the matrix multiply problem sequentially and in parallel using recursion. Although this class should be able to take in a matrix of any size, try to imagine this as a 2x2 matrix in order to make it easier to solve. Once you solve the sequential method, the parallel method should look very similar with exception of an async/finish block.
 
  
In order to obtain the desired result matrix, you will need to recursively call the correct submatrices for each of the four result submatrices. Imagining this as a 2x2 matrix, remember that the dot products of the rows of the first matrix and the columns of the second matrix create the result matrix.
+
[[File:Loop2dMatrixMultiplier_Computation_Graph.svg|800px]]
  
Hint: Each result submatrix should have two recursive calls, for a total of eight recursive calls.
+
==Loop2dAutoCoarsenMatrixMultiplier==
 +
{{CodeToImplement|Loop2dAutoCoarsenMatrixMultiplier|multiply|matrixmultiply.exercise}}
  
==SequentialDivideAndConquerMatrixMultiplier==
+
{{Parallel|public double[][] multiply(double[][] a, double[][] b)}}
In <code>class SubMatrix</code>, method <code>sequentialDivideAndConquerMultiplyKernel</code> you will find your base case and the sub matrices prepared for you. 
 
  
<nowiki>
+
[https://www.cse.wustl.edu/~dennis.cosgrove/courses/cse231/spring22/apidocs/fj/FJ.html#join_void_fork_loop_2d_auto_coarsen(int,int,int,int,fj.api.TaskBiIntConsumer) join_void_fork_loop_2d_auto_coarsen]
private static void sequentialDivideAndConquerMultiplyKernel(SubMatrix a, SubMatrix b, SubMatrix result) {
 
if( result.size == 1 ) {
 
result.values[result.row][result.col] += a.values[a.row][a.col] * b.values[b.row][b.col];
 
} else {
 
SubMatrix a11 = a.newSub11();
 
SubMatrix a12 = a.newSub12();
 
SubMatrix a21 = a.newSub21();
 
SubMatrix a22 = a.newSub22();
 
 
SubMatrix b11 = b.newSub11();
 
SubMatrix b12 = b.newSub12();
 
SubMatrix b21 = b.newSub21();
 
SubMatrix b22 = b.newSub22();
 
 
SubMatrix result11 = result.newSub11();
 
SubMatrix result12 = result.newSub12();
 
SubMatrix result21 = result.newSub21();
 
SubMatrix result22 = result.newSub22();
 
  
throw new NotYetImplementedException();
+
This implementation will look very similar to the previous one, so don't overthink it! The real benefit can be seen in the performance difference between the two based on the coarsening being done behind the scenes.
}
+
<!--
}</nowiki>
+
In this implementation, we will add a minor performance boost to the process by using the forall-chunked construct. Although similar to the traditional forall loop, it increases performance using iteration grouping/chunking. This topic is discussed in detail in this [https://edge.edx.org/courses/RiceX/COMP322/1T2014R/courseware/a900dd0655384de3b5ef01e508ea09d7/6349730bb2a149a0b33fa23db7afddee/?activate_block_id=i4x%3A%2F%2FRiceX%2FCOMP322%2Fsequential%2F6349730bb2a149a0b33fa23db7afddee Rice video] and explained in the [[Reference_Page#Parallel_Loops|V5 documentation]]. There is no need to specify anything, allow the runtime to determine the chunking.
  
You simply need to make the appropriate recursive calls to compute the result on the right:
+
NOTE: we contemplated also assigning building a 1D forall chunked version.  We deemed this more work that it was worth given that you are already building the 2d version.  Just know that forall(chunked(), ...) exists for 1d loops as well. 
  
:<math>\begin{pmatrix}
+
Use chunking.  It is a nice feature.
\mathbf{A}_{11} & \mathbf{A}_{12} \\
+
-->
\mathbf{A}_{21} & \mathbf{A}_{22} \\
 
\end{pmatrix} \begin{pmatrix}
 
\mathbf{B}_{11} & \mathbf{B}_{12} \\
 
\mathbf{B}_{21} & \mathbf{B}_{22} \\
 
\end{pmatrix} = \begin{pmatrix}
 
\mathbf{A}_{11} \mathbf{B}_{11} + \mathbf{A}_{12} \mathbf{B}_{21} & \mathbf{A}_{11} \mathbf{B}_{12} + \mathbf{A}_{12} \mathbf{B}_{22}\\
 
\mathbf{A}_{21} \mathbf{B}_{11} + \mathbf{A}_{22} \mathbf{B}_{21} & \mathbf{A}_{21} \mathbf{B}_{12} + \mathbf{A}_{22} \mathbf{B}_{22}\\
 
\end{pmatrix}
 
</math>
 
source: [https://en.wikipedia.org/w/index.php?title=Matrix_multiplication#Parallel_matrix_multiplication Wikipedia Parallel Matrix Multiplication]
 
  
==ParallelDivideAndConquerMatrixMultiplier==
+
=Extra Credit Challege Divide and Conquer=
Again, given the following:
+
[[Matrix_Multiply_Divide_and_Conquer_Assignment|Divide and Conquer Matrix Multiplication]]
  
:<math>\begin{pmatrix}
+
=Testing Your Solution=
\mathbf{A}_{11} \mathbf{B}_{11} + \mathbf{A}_{12} \mathbf{B}_{21} & \mathbf{A}_{11} \mathbf{B}_{12} + \mathbf{A}_{12} \mathbf{B}_{22}\\
+
==Correctness==
\mathbf{A}_{21} \mathbf{B}_{11} + \mathbf{A}_{22} \mathbf{B}_{21} & \mathbf{A}_{21} \mathbf{B}_{12} + \mathbf{A}_{22} \mathbf{B}_{22}\\
+
{{TestSuite|__MatrixMultiplyTestSuite|matrixmultiply.studio}}
\end{pmatrix}
+
 
</math>
+
==Performance==
source: [https://en.wikipedia.org/w/index.php?title=Matrix_multiplication#Parallel_matrix_multiplication Wikipedia Parallel Matrix Multiplication]
+
{{Performance|MatrixMultiplicationTiming|matrixmultiply.performance}}
  
What computation can be done in parallel? What computation must be performed sequentially?
+
Investigate the performance difference for your different implementations.  When you run MatrixMultiplicationTiming it will put a CSV of the timings into your copy buffer.  You can then paste them into a spreadsheet and chart the performance. Feel free to tune the parameters of the test to see the impacts of, for example, different matrix sizes.
<!--
 
=Provided Example Implementations=
 
==SequentialMatrixMultiplier==
 
We provide the sequential iterative implementation so you can focus just on becoming familiar with using Habanero's forall loops.
 
  
==ForallGroupedMatrixMultiplier==
+
[[File:Matrix multiply performance.png]]
For 231 we are encouraging you to prefer the use of chunked().  However, if you want greater control over how your loops are broken up, please look to this as an example of how to use forall grouped.
 
  
==Forall2dGroupedMatrixMultiplier==
+
=Pledge, Acknowledgments, Citations=
Same as above but with forall2d.
+
{{Pledge|matrix-multiply}}
-->
 
=Testing Your Solution=
 
==Correctness==
 
{{TestSuite|MatrixMultiplyTestSuite|matrixmultiply.studio}}
 
==Optional Fun Divide And Conquer Matrix Multiply Correctness==
 
{{TestSuite|MatrixMultiplyTestSuite|matrixmultiply.fun}}
 
==Performance==
 
{{Performance|MatrixMultiplicationTiming|matrixmultiply.studio}}
 

Latest revision as of 00:21, 14 February 2023

Motivation

We gain experience using the parallel for loop constructs.

Background

Matrix multiplication is a simple mathematical operation which we will replicate in this studio. For our purposes, we will only deal with square matrices (same number of rows and columns). However, we will approach this problem with several different parallel constructs and approaches.

For those unfamiliar on how to multiply two matrices, take a look at these overviews:

If is an matrix and is an matrix

for each i=[0..n) and for each j=[0..p)

source: Matrix Multiplication on Wikipedia

Code To Investigate

Demo Video

SequentialMatrixMultiplier

class: SequentialMatrixMultiplier.java DEMO: Java.png
methods: multiply
package: matrixmultiply.demo
source folder: src/demo/java

SequentialMatrixMultiplierClient

class: SequentialMatrixMultiplierClient.java DEMO: Java.png
methods: main
package: matrixmultiply.client
source folder: src/demo/java

MatrixMultiplyApp

class: MatrixMultiplyApp.java VIZ
package: matrixmultiply.viz
source folder: student/src/demo/java

Martix multiply app 3x5 X 5x4.png

The Core Questions

  • What are the tasks?
  • What is the data?
  • Is the data mutable?
  • If so, how is it shared?

Code To Implement

There are three methods you will need to implement, all of which are different ways to use parallel for loops to solve the problem. To assist you, the sequential implementation has been implemented in a demo video.

LoopLoopMatrixMultiplier

class: LoopLoopMatrixMultiplier.java Java.png
methods: multiply
package: matrixmultiply.exercise
source folder: student/src/main/java

method: public double[][] multiply(double[][] a, double[][] b) Parallel.svg (parallel implementation required)

In this implementation, you will simply convert the sequential solution into a parallel one using two nested parallel fork loops.

Computation Graph

For 3x3 Matrix X 3x3 Matrix:

LoopLoopMatrixMultiplier Computation Graph.svg

Loop2dMatrixMultiplier

class: Loop2dMatrixMultiplier.java Java.png
methods: multiply
package: matrixmultiply.exercise
source folder: student/src/main/java

method: public double[][] multiply(double[][] a, double[][] b) Parallel.svg (parallel implementation required)


join_void_fork_loop_2d

Computation Graph

For 3x3 Matrix X 3x3 Matrix:

Loop2dMatrixMultiplier Computation Graph.svg

Loop2dAutoCoarsenMatrixMultiplier

class: Loop2dAutoCoarsenMatrixMultiplier.java Java.png
methods: multiply
package: matrixmultiply.exercise
source folder: student/src/main/java

method: public double[][] multiply(double[][] a, double[][] b) Parallel.svg (parallel implementation required)

join_void_fork_loop_2d_auto_coarsen

This implementation will look very similar to the previous one, so don't overthink it! The real benefit can be seen in the performance difference between the two based on the coarsening being done behind the scenes.

Extra Credit Challege Divide and Conquer

Divide and Conquer Matrix Multiplication

Testing Your Solution

Correctness

class: __MatrixMultiplyTestSuite.java Junit.png
package: matrixmultiply.studio
source folder: testing/src/test/java

Performance

class: MatrixMultiplicationTiming.java Noun Project stopwatch icon 386232 cc.svg
package: matrixmultiply.performance
source folder: src/main/java

Investigate the performance difference for your different implementations. When you run MatrixMultiplicationTiming it will put a CSV of the timings into your copy buffer. You can then paste them into a spreadsheet and chart the performance. Feel free to tune the parameters of the test to see the impacts of, for example, different matrix sizes.

Matrix multiply performance.png

Pledge, Acknowledgments, Citations

file: matrix-multiply-pledge-acknowledgments-citations.txt

More info about the Honor Pledge