The idea of my simple program that I've been trying to write is to take input from the user to see how large of a matrix to multiply.
dd@cuda-Linux:~/Desktop/multi$ ./program
What is the rowSize of a? 33
What is the colSize of a? 33
What is the rowSize of b? 33
What is the colSize of b? 33
Would you like to write the results to a file?(y or n)
y
Creating the random numbers now
Writing Matrix A to file now...
Writing Matrix B to file now...
Starting it on the device
Writing Matrix C to file now...
Finish
However the problems lies in my thread calculations. I can go to a 32x32 matrix and it will run fine and give me the correct results. However once I run a 33x33 I get results like the following:
[Matrix A] x [Matrix B] = [Matrix C] (linked to them instead of pasting several huge matrices into this post. But with matrix c you can see that half way through it starts to write the wrong numbers. My graphics card has a limit of 1024 threads which is a 32x32 matrix. Also when I go to run a 100x100 matrix Matrix C is all 0s.
Let mem_size_X be sizeof(float) * size_X, and size_X is height*width of the matrix. Right now the height and width has to be the same thus 32x32. Also the "block_size" is just the height. So with a 32x32 matrix the block size corresponds to 32.
Host code(launching):
float* deviceMatrixA;
float* deviceMatrixB;
cudaMalloc((void**) &deviceMatrixA, mem_size_A);//allocate mem_size_x on the device.
cudaMalloc((void**) &deviceMatrixB, mem_size_B);
cudaMemcpy(deviceMatrixA, a.elements, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(deviceMatrixB, b.elements, mem_size_B, cudaMemcpyHostToDevice);
int size_C = c.rowSize * c.colSize;
int mem_size_C = sizeof(float) * size_C;
c.elements = (float*) malloc(mem_size_C);
float* deviceMatrixC;
cudaMalloc((void**) &deviceMatrixC, mem_size_C);
dim3 threads(block_size, block_size);
dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);
matrixMul<<< grid, threads,2*block_size*block_size*sizeof(float)>>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize, block_size);//sizeof(float)*block_size*block_size
cudaThreadSynchronize();
The kernel code:
// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int aBegin = wA * block_size * by;
int aEnd = aBegin + wA - 1;
int aStep = block_size;
int bBegin = block_size * bx;
int bStep = block_size * wB;
float Csub=0;
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
extern __shared__ float As[];
extern __shared__ float Bs[];
extern __shared__ float smem[];
smem[ty*block_size+tx] = A[a + wA * ty + tx];
smem[block_size*block_size+ty*block_size+tx] = B[b + wB * ty + tx];
__syncthreads();
for (int k = 0; k < block_size; ++k)
Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
__syncthreads();
}
int c = wB * block_size * by + block_size * bx;
C[c + wB * ty + tx] = Csub;
}
Thanks