2

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, however, both Width and Height MUST be multiples of blocksize.

So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].

Does anyone knows a way to do that? This is my kernel:

#include <stdio.h>

#include <limits.h>

#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize) 
#define WM (2*blocksize) 
#define WN (1*blocksize)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];


int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;


for(int m=0; m< uWM/blocksize;++m){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();

    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    __syncthreads();
    P[rowM*WP+colN]=Pvalue;
     }
    }

Thanks in advance!

Bernardo
  • 531
  • 1
  • 13
  • 31

2 Answers2

5

I think the easiest thing to do would be to just pad the blocks on the end with zeros:

for(int m=0; m< uWM/blocksize;++m){
    colM = m*blocksize+tx;
    rowN = m*blocksize+ty;
    if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
        MS[ty][tx]=0.;
        NS[ty][tx]=0.;
    } else {
        MS[ty][tx]=M[rowM*uWM+colM];
        NS[ty][tx]=N[colN + uWN*rowN];
    }

plus or minus. (That NS line should reference N, not M, right?)

But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158
  • That seems to be a good point, although since i am using matrices that represent medical images, i am not really sure if padding with zeros, will affect the image. But i will give it a try. As for using CUBLAS, i have been reading about it, but i am a little worried about it having a lot of new syntax(Is the learning curve high?). But i will give it a shot nonetheless. Thanks for the reply ;) – Bernardo Mar 28 '11 at 16:16
  • The arrays are only being padded within the matrix multiplication routine. They aren't passed back, and they can't affect the final result, since you're just adding zeros to the matrix elements. As for CUBLAS (or magma, or whatever) -- the learning curve is real, but afterwards you don't have to be writing your own linear algebra routines, and the performance is a selling point... – Jonathan Dursi Mar 28 '11 at 17:29
  • I see...One of the things i will be working with very soon is very large(to the point of not fitting the GPU card memory) 3D matrices, can the CUBLAS library remove the pain of 3D indexing and of memory allocation(i think the first problem is more doable than the second)? – Bernardo Mar 28 '11 at 17:41
2

The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.

You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.

talonmies
  • 70,661
  • 34
  • 192
  • 269