0

I want to do matrix multiplication with 2 non square matrices,(2000,100), (100,100), I try to use block submatrix as in the Nvidia example, but the result is wrong, I found a solved method here. Non Square Matrix Multiplication in CUDA it uses zero padding, so I change block size to 16, but it's a wrong work group size, I use pyopencl and can't use Blas and so on.

Community
  • 1
  • 1
  • 2
    You haven't asked a question, only said in very, very broad terms what you want to achieve. So what, exactly, is it you are having problems with or don't understand? "The result is wrong" isn't nearly enough information for anyone to be able to help you. – talonmies Jun 30 '12 at 14:02
  • Here, all the data I use is int32 and >0, I implement in two ways, one use numpy dot function, one use submatrix method and pyopencl.For square matrix, the two way give the same answer, but for non square matrix, 1st give right answer, while 2nd, only first row of matrix is correct, while some element <0 and even half part of whole matrix has element=0. – user1492775 Jul 01 '12 at 01:52
  • I saw some solution in the link I post. submatrix multiplication in Nvidia example only work well with square, then there is a zero padding method, I think it's a good idea, but if I set blocksize to 16, it can't divide by dimension of result matrix, so it gives error message: a wrong work group size. so how to do zero padding? or there is other method? Thank you! – user1492775 Jul 01 '12 at 02:00
  • Try reading [this answer](http://stackoverflow.com/a/9261675/681865). – talonmies Jul 01 '12 at 07:17
  • I tried the way you mentioned with smaller matrix, this time I got the correct result, but if I want to multiply the matrix I mentioned above, it has a segamentation fault, obviously the global size((2000,100)) exceeds the max work item size((1024,1024,1024)),so what should I do now, I'm really confused with this. – user1492775 Jul 04 '12 at 03:09

1 Answers1

0

One of the best presentations I have seen on the topic to date was at AFDS 2011.

PDF presentation.

Video (stream)

Video (download)

Their matrices were huge --Linpack-sized-- and non-square. You can scale their main GPU kernel's block size down from 1024 to something smaller (32,64,128?) to better solve your problem, as possibly even fit into LDS on your hardware. The presenters used the CPU to process the irregular dimensioned areas that were untouched by the GPU.

mfa
  • 5,017
  • 2
  • 23
  • 28