Matrix multiplication with CUDA using multiple columns

Question

The code below computes C=A*B, how would I expand it to do multi tiled multiplication? One row of A is multiplied with one column of B on every one loop. My question is how would I modify it such that one row of A is multiplied with multiple columns of B so that we avoid reloading them again for the same row.

(code taken from: Dynamic matrix multiplication with CUDA)

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
 {
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;

int aBegin = wA * block_size * by;
int aEnd   = aBegin + wA - 1;
int aStep  = block_size;

int bBegin = block_size * bx;

int bStep  = block_size * wB;
float Csub=0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
{
    extern __shared__ float As[];
    extern __shared__ float Bs[];
    extern __shared__ float smem[];

    smem[ty*block_size+tx] = A[a + wA * ty + tx];

    smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

    __syncthreads();

    for (int k = 0; k < block_size; ++k)
        Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

    __syncthreads();
}

int c = wB * block_size * by + block_size * bx;
C[c + wB * ty + tx] = Csub;
}

Thanks

Could you edit your question to explain what it is you are trying to do and why? You haven't given much to go on except some (very bad) code you didn't write and a vague description of some changes you seem to want others to make to it..... — talonmies, Apr 20 '12 at 06:03
This question, linked by Brendan Wood, seems to do blocked matrix multiplication. http://stackoverflow.com/questions/8813750/matrix-multiplication-cuda — Vanwaril, Apr 20 '12 at 06:18
@Vanwaril: That is effectively the same code copy/pasted into this question. — talonmies, Apr 20 '12 at 07:10
talonmies: Wasn't aware this was "very bad" code. I wasn't asking you to write me the code, I was asking how would I approach this. Rephrasing the question. — , Apr 20 '12 at 17:23
@JohnSmith: It seems like you might have misunderstood how this code works - If I have understood what you are trying to ask, it already does exactly that. — talonmies, Apr 22 '12 at 19:57

score 1 · Answer 1 · edited May 23 '17 at 11:55

1

There is a book by Kirk and Hwu that goes into great depth with developing an efficient matrix multiplication kernel:

http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0123814723

There have also previously been several questions here on Stack Overflow regarding tiled matrix multiplication on CUDA. See the following:

Matrix Multiplication CUDA

Non Square Matrix Multiplication in CUDA

edited May 23 '17 at 11:55

Community

1
1

answered Apr 20 '12 at 01:57

Brendan Wood

6,220
3
30
28

I would not recommend this book. While it may have been good in 2010, most of the information is now outdated or just plain wrong. – Andrew Guenther Apr 20 '12 at 05:42
It is dated, but the design patterns are still quite relevant. – Brendan Wood Apr 20 '12 at 13:00

Matrix multiplication with CUDA using multiple columns

1 Answers1