Dynamic matrix multiplication with CUDA

Question

The idea of my simple program that I've been trying to write is to take input from the user to see how large of a matrix to multiply.

I am looking to take the input x by x, I am not currently looking to multiply two different sizes at the moment.

How would you guys suggest I go about accomplishing this?

I'm sorry my question was not clear enough, I want to modify this kernel so that it can handle a matrix of any size(where the x and y are equivalents to keep it simple). Instead of multiples of 16.

I'm not sure if you would need my current code but here is the kernel code:

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int aBegin = wA * block_size * by;
    int aEnd   = aBegin + wA - 1;
    int aStep  = block_size;

    int bBegin = block_size * bx;

    int bStep  = block_size * wB;
    float Csub=0;

    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        extern __shared__ float As[];
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        smem[ty*block_size+tx] = A[a + wA * ty + tx];

        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

        __syncthreads();

        for (int k = 0; k < block_size; ++k)
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

        __syncthreads();
    }

    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

Update: I decided to go with the zero padding. However I am getting incorrect answers. Take matrix A 2x2, padded to 16x16:

5.000 0.000 9.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Matrix B, 2x2 padded to 16x16:

7.000 4.000 8.000 7.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

So the result for C I get is correct:

35.000 20.000 40.000 35.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000

However if you strip away the zeros the matrix should be: A:

5.000 0.000
9.000 0.000

B:

7.000 4.000
8.000 7.000

C Should be:

35.000 20.000
63.000 36.000

However the two matrix Cs are not the same.

What is your question? Are you asking how to get input from the user? — harrism, Feb 13 '12 at 04:09
If I understood your earlier questions correctly, your real question is how to modify this kernel code (itself a _very lightly_ modified version of the CUDA SDK matrix multiplication example) so it could be used to multiply matrices of arbitrary size, as opposed to round multiples of the kernel block size. Could you edit your question to reflect this? At the moment it is very unclear what you are really asking. — talonmies, Feb 13 '12 at 10:37
@talonmies, you are correct. That is exactly what I am looking for — Dan, Feb 13 '12 at 14:47
Hi Dan, did you find the correct code for your implementation above? — , Apr 17 '12 at 20:45
@JohnSmith I did thanks. But I think it's limited to 16 blocks. I'm just looking over it to see if I can change that. — Dan, Apr 19 '12 at 03:45
Could you please update the code above? I am curious to know what the complete code is, including the strategy suggested by talonmies. Thanks — , Apr 19 '12 at 04:21
@JohnSmith Do you mean the zero padding for it to work with a block of 16? — Dan, Apr 19 '12 at 13:51

talonmies · Accepted Answer · 2012-02-13T14:18:40.230

This isn't a very clear question, so this answer is something of a guess based on what you have previous asked in several rather similar questions earlier.

A good starting point to understanding how to do this sort of operation is to go back to the beginning and think about the matrix-matrix multiplication problem from first principles. You are interested in code to calculate the dot product of two matrices, C = AB. The restriction you have is that the kernel you are using can only compute products of matrices which are round multiples of some internal block size. So what can you do?

One way to look at the problem is to imagine that A and B matrices were block matrices. The matrix multiply can be written like this:

enter image description here

and the resulting matrix C can then by formed by combinations of the products of the eight submatrices in A and B:

enter image description here

It might not be immediately obvious how this helps solve the problem, but let's consider a concrete example:

You have an optimal matrix multiplication kernel which uses an internal block size of 32, and is only correct when matrices are round multiples of that block size.
You have a pair of 1000x1000 square matrices to multiply.

These first facts implies that your kernel can only correctly solve either a 1024x1024 product, or a 992x992 product, but not the 1000x1000 operation you need.

If you decide to use a 1024x1024 product, you can use the block decomposition idea to formulate the problem like this:

enter image description here

where O_nn denotes a suitably sized matrix of zeros. Now you have a pair of 1024x1024 matrices, and their product will result in

enter image description here

ie. the left hand, upper block is a 1000x1000 matrix containing AB. This is effectively zero padding to achieve the correct result. In this example, it means that about 7% more computation is performed than is required. Whether than is important or not is probably application specific.

The second approach would be to use the basic kernel to compute a 992x992 product, then work out a strategy to deal with the other seven products in the block decomposed version of the calculation, something like this:

enter image description here

with A₁₁ and B₁₁ being 992x992 matrices, and O_nn are zero matrices as before. At first inspection this doesn't look very helpful, but it is worth remembering that all the calculations to make the right hand side matrix contain only about 1.2% of the total computations required to compute the matrix product. They could easily be done on the host CPU while the GPU is doing the main calculation, and then added to the GPU result to form the final matrix. Because the CUDA API is asynchronous, most of that host calculation can be completely hidden and is effectively free.

This answer contains two strategies for doing what it is you are asking for without changing more than single line of your current kernel code. There is obviously a third way, which is to more radically modify the kernel itself, but that is something you should try yourself first and then ask for help if your solution doesn't work.

Thank you for the well formed explanation. I am going with the padded solution for the sake of time. I am curious about where you came up with a "7% more computation." — Dan, Feb 13 '12 at 22:37
I tried zero padding it however it yields the wrong results. I spent some time trying to figure it out but it seems like it lies in the 0s during the padding. Please see my edited OP for the update. — Dan, Feb 14 '12 at 04:30
The way you have done the zero padding in your example is wrong. And the 7% figure comes from the difference between 2*1024^3 and 2*1000^3, which are the operation counts of the dot products at the two sizes. — talonmies, Feb 14 '12 at 04:59
Hello talonmies, thanks for the great answer. You mentioned that the strategy would need one line of code change. So this is not about creating a matrix within a matrix but just have an if statement to check the boundaries, correct? Given Dan's code above how would I figure out the boundaries? Thanks — , Apr 17 '12 at 18:01

Dynamic matrix multiplication with CUDA

1 Answers1

Linked