Calculating (very) large matrix products with CUDA

Question

I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.

For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.

The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?

Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.

A dense float matrix of 6 million by 8 million elements would require approx. 192 TB of storage (double would be twice that.) If one thread were assigned per element, it would require "only" approx. 48 trillion threads -- ~35 bits of addressing. CUDA grid dims on cc3.0+ devices provide (theoretically) ~63 bits of *block* address space, not to mention the availability of 1024 threads per block (so over 70 bits of addressable thread-space.) This question seems to be based on false premises: "something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU." — Robert Crovella, Feb 12 '15 at 05:13

MehrZ · Answer 1 · 2015-02-11T21:57:13.170

The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).

When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.

for more information you can look at this paper:

CUDA Based Fast Implementation of Very Large Matrix Computation

I just found it by googling "large matrix multiplication CUDA"

Calculating (very) large matrix products with CUDA

1 Answers1