Matrix multiplying 2 matrices M and N involves computing the dot product of rows of M and columns of N repeatedly. Note that the dot product itself is a data parallel task that falls into the class of computation mentioned in [[Tiling for reduced memory traffic]].
Each dot product is an element-wise multiplication, followed by an accumulation of this result via addition. This dot product forms 1 element O in the output matrix:
$O_{0,0} = M_{0,*} \cdot N_{*,0}$$\text{where} \space \space M_{0,*} \cdot N_{*,0} = \Sigma_{i} M_{i,*}N_{*,i} \space \space \forall i \in [0,3]$
![[Pasted image 20240613005657.png]]
The dot product can be parallelized, where each thread computes one multiplication. After the multiplication step is complete, each thread accumulates the result in a variable private to that thread.
Tiling uses this property to compute output elements in phases. Each phase computes part of the dot product that comprises of the output element and accumulates it in that element’s memory location.
In this case, each phase comprises of two tiles; one for each matrix:
![[Pasted image 20240613010425.png]]
For each phase:
1. Tiles are loaded from N and M into shared memory all at once
2. All partial dot products possible are computed using both tiles and accumulated into their output elements
3. Repeat step 1 for another tile
In this case, we accumulate each partial result into a variable P:
$P_{0,0} \mathrel{+}= (M_{0,0}\cdot N_{0,0}) + (M_{0,1}\cdot N_{1,0})$
so on, until we get to $P_{1,1}$ for this tile. Crucially, note that the element $M_{0,0}$ is reused for both $P_{0,0}$ and $P_{0,1}$.
**Step 1: Loading into shared memory:**
In step 1, we load elements into shared memory to prime them for low-latency re-use. $thread_{0,0}$ loads $M_{0,0}$ and $N_{0,0}$ into shared memory. This is done for all threads until $thread_{1,1}$ loads $M_{1,1}$ and $N_{1,1}$ into memory. In the final phase, $thread1,1$ would load elements $M_{3,3}$ and $N_{3,3}$ into memory etc. until $P_{1,1}$
Since each thread is responsible for loading one element (from each N and M), and elements are re-used, a thread will depend on information retrieved by another thread.
In particular; this means that all shared memory loads must complete before any computations begin, as and is enforced via `__syncThreads()`, **for example:** $thread_{0,0}$ loads $M_{0,0}$, $N_{0,0}$, but computes $P_{0,0} += (M_{0,0}\cdot N_{0,0}) + (M_{0,1}\cdot N_{1,0})$, where $M_{0,1}$ and $N_{1,0}$ are loaded by $thread_{0,1}$.
**Step 2: Accumulating dot products:**
> [!Step 2]
> All partial dot products possible are computed using both tiles and accumulated into their output elements
Recall that computations in other tile accumulate instead of re-assigning values. Now we have:
$P_{0,0} \mathrel{+}= (M_{0,2}\cdot N_{2,0}) + (M_{0,3}\cdot N_{3,0})$
Since this adds to the previous value, we get:
$ P_{0,0} = (M_{0,0} \cdot N_{0,0}) + (M_{0,1} \cdot N_{1,0}) + (M_{0,2} \cdot N_{2,0}) + (M_{0,3} \cdot N_{3,0}) $
which is:
$= \Sigma_{i}M_{i,*}N_{*,i} \Longleftrightarrow O_{0,0}$
![[Pasted image 20240613012042.png]]
Overall, tiled matrix multiplication **reduces global memory access** by **loading tiles into shared memory** and **re-using sub-computations of the dot product**.
Since each element from $M$ and $N$ is loaded once but used twice (e.g. $M_{0,0}$ used for $P_{0,0}$ and $P_{0,1}$), we increase the **compute to global memory access** ratio twofold.
More generally, for tiles of size $K \times K$, we reduce global memory accesses by a factor of $K$.
The code for this looks like:
```cpp
#define BLOCK_SIZE 16
__global__ void matrixMul(float *A, float *B, float *C, int N) {
// Shared memory for the sub-matrices
__shared__ float Asub[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bsub[BLOCK_SIZE][BLOCK_SIZE];
// Calculate row index of the block
int row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate column index of the block
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Variable to accumulate the result
float P = 0;
// Loop over all blocks
for (int i = 0; i < N / BLOCK_SIZE; ++i) {
// Load A and B to shared memory
Asub[threadIdx.y][threadIdx.x] = A[row * N + i * BLOCK_SIZE + threadIdx.x];
Bsub[threadIdx.y][threadIdx.x] = B[(i * BLOCK_SIZE + threadIdx.y) * N + col];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply Asub and Bsub together
for (int j = 0; j < BLOCK_SIZE; ++j) {
P += Asub[threadIdx.y][j] * Bsub[j][threadIdx.x];
}
// Synchronize to make sure the preceding computation is done before loading two new sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory; each thread writes one element
C[row * N + col] = P;
}
```