1

The code below computes C=A*B, how would I expand it to do multi tiled multiplication? One row of A is multiplied with one column of B on every one loop. My question is how would I modify it such that one row of A is multiplied with multiple columns of B so that we avoid reloading them again for the same row.

(code taken from: Dynamic matrix multiplication with CUDA)

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
 {
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;

int aBegin = wA * block_size * by;
int aEnd   = aBegin + wA - 1;
int aStep  = block_size;

int bBegin = block_size * bx;

int bStep  = block_size * wB;
float Csub=0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
{
    extern __shared__ float As[];
    extern __shared__ float Bs[];
    extern __shared__ float smem[];

    smem[ty*block_size+tx] = A[a + wA * ty + tx];

    smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

    __syncthreads();

    for (int k = 0; k < block_size; ++k)
        Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

    __syncthreads();
}

int c = wB * block_size * by + block_size * bx;
C[c + wB * ty + tx] = Csub;
}

Thanks

Community
  • 1
  • 1
  • 1
    Could you edit your question to explain what it is you are trying to do and why? You haven't given much to go on except some (very bad) code you didn't write and a vague description of some changes you seem to want others to make to it..... – talonmies Apr 20 '12 at 06:03
  • This question, linked by Brendan Wood, seems to do blocked matrix multiplication. http://stackoverflow.com/questions/8813750/matrix-multiplication-cuda – Vanwaril Apr 20 '12 at 06:18
  • @Vanwaril: That is effectively the same code copy/pasted into this question. – talonmies Apr 20 '12 at 07:10
  • talonmies: Wasn't aware this was "very bad" code. I wasn't asking you to write me the code, I was asking how would I approach this. Rephrasing the question. –  Apr 20 '12 at 17:23
  • @JohnSmith: It seems like you might have misunderstood how this code works - If I have understood what you are trying to ask, it already does exactly that. – talonmies Apr 22 '12 at 19:57

1 Answers1

1

There is a book by Kirk and Hwu that goes into great depth with developing an efficient matrix multiplication kernel:

http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0123814723

There have also previously been several questions here on Stack Overflow regarding tiled matrix multiplication on CUDA. See the following:

Matrix Multiplication CUDA

Non Square Matrix Multiplication in CUDA

Community
  • 1
  • 1
Brendan Wood
  • 6,220
  • 3
  • 30
  • 28