Matrix multiplying 2 matrices M and N involves computing the dot product of rows of M and columns of N repeatedly. Note that the dot product itself is a data parallel task that falls into the class of computation mentioned in [[Tiling for reduced memory traffic]]. Each dot product is an element-wise multiplication, followed by an accumulation of this result via addition. This dot product forms 1 element O in the output matrix: $O_{0,0} = M_{0,*} \cdot N_{*,0}$$\text{where} \space \space M_{0,*} \cdot N_{*,0} = \Sigma_{i} M_{i,*}N_{*,i} \space \space \forall i \in [0,3]$ ![[Pasted image 20240613005657.png]] The dot product can be parallelized, where each thread computes one multiplication. After the multiplication step is complete, each thread accumulates the result in a variable private to that thread. Tiling uses this property to compute output elements in phases. Each phase computes part of the dot product that comprises of the output element and accumulates it in that element’s memory location. In this case, each phase comprises of two tiles; one for each matrix: ![[Pasted image 20240613010425.png]] For each phase: 1. Tiles are loaded from N and M into shared memory all at once 2. All partial dot products possible are computed using both tiles and accumulated into their output elements 3. Repeat step 1 for another tile In this case, we accumulate each partial result into a variable P: $P_{0,0} \mathrel{+}= (M_{0,0}\cdot N_{0,0}) + (M_{0,1}\cdot N_{1,0})$ so on, until we get to $P_{1,1}$ for this tile. Crucially, note that the element $M_{0,0}$ is reused for both $P_{0,0}$ and $P_{0,1}$. **Step 1: Loading into shared memory:** In step 1, we load elements into shared memory to prime them for low-latency re-use. $thread_{0,0}$ loads $M_{0,0}$ and $N_{0,0}$ into shared memory. This is done for all threads until $thread_{1,1}$ loads $M_{1,1}$ and $N_{1,1}$ into memory. In the final phase, $thread1,1$ would load elements $M_{3,3}$ and $N_{3,3}$ into memory etc. until $P_{1,1}$ Since each thread is responsible for loading one element (from each N and M), and elements are re-used, a thread will depend on information retrieved by another thread. In particular; this means that all shared memory loads must complete before any computations begin, as and is enforced via `__syncThreads()`, **for example:** $thread_{0,0}$ loads $M_{0,0}$, $N_{0,0}$, but computes $P_{0,0} += (M_{0,0}\cdot N_{0,0}) + (M_{0,1}\cdot N_{1,0})$, where $M_{0,1}$ and $N_{1,0}$ are loaded by $thread_{0,1}$. **Step 2: Accumulating dot products:** > [!Step 2] > All partial dot products possible are computed using both tiles and accumulated into their output elements Recall that computations in other tile accumulate instead of re-assigning values. Now we have: $P_{0,0} \mathrel{+}= (M_{0,2}\cdot N_{2,0}) + (M_{0,3}\cdot N_{3,0})$ Since this adds to the previous value, we get: $ P_{0,0} = (M_{0,0} \cdot N_{0,0}) + (M_{0,1} \cdot N_{1,0}) + (M_{0,2} \cdot N_{2,0}) + (M_{0,3} \cdot N_{3,0}) $ which is: $= \Sigma_{i}M_{i,*}N_{*,i} \Longleftrightarrow O_{0,0}$ ![[Pasted image 20240613012042.png]] Overall, tiled matrix multiplication **reduces global memory access** by **loading tiles into shared memory** and **re-using sub-computations of the dot product**. Since each element from $M$ and $N$ is loaded once but used twice (e.g. $M_{0,0}$ used for $P_{0,0}$ and $P_{0,1}$), we increase the **compute to global memory access** ratio twofold. More generally, for tiles of size $K \times K$, we reduce global memory accesses by a factor of $K$. The code for this looks like: ```cpp #define BLOCK_SIZE 16 __global__ void matrixMul(float *A, float *B, float *C, int N) { // Shared memory for the sub-matrices __shared__ float Asub[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bsub[BLOCK_SIZE][BLOCK_SIZE]; // Calculate row index of the block int row = blockIdx.y * blockDim.y + threadIdx.y; // Calculate column index of the block int col = blockIdx.x * blockDim.x + threadIdx.x; // Variable to accumulate the result float P = 0; // Loop over all blocks for (int i = 0; i < N / BLOCK_SIZE; ++i) { // Load A and B to shared memory Asub[threadIdx.y][threadIdx.x] = A[row * N + i * BLOCK_SIZE + threadIdx.x]; Bsub[threadIdx.y][threadIdx.x] = B[(i * BLOCK_SIZE + threadIdx.y) * N + col]; // Synchronize to make sure the matrices are loaded __syncthreads(); // Multiply Asub and Bsub together for (int j = 0; j < BLOCK_SIZE; ++j) { P += Asub[threadIdx.y][j] * Bsub[j][threadIdx.x]; } // Synchronize to make sure the preceding computation is done before loading two new sub-matrices of A and B in the next iteration __syncthreads(); } // Write the block sub-matrix to global memory; each thread writes one element C[row * N + col] = P; } ```