Does launching more CUDA thread blocks have more overhead than launching fewer?

Question

I was doing some experiments with CUDA, and I noticed that launching the same basic kernel:

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

with more thread blocks sometimes resulted in a slower overall speed of execution of the executable than launching only one block (always with blocks of 512 threads) on a not-too-huge array. Notice that I alwayswaited to

On the CPU, this would be associated with the overhead of creating the threads being smaller than whatever advantage having more threads could produce.

However, on the GPU IIRC we don't have threads in the usual sense, but we simply have different phyisical cores that would otherwise be unused. I don't even think it can be a memory issue as the times the data is transferred to the GPU does not change, but maybe the uniform memory I'm using is doing something under the hood that I don't fully understand.

So I was wondering: does launching more threads and thread blocks have more overhead in CUDA? Or is it the same to launch 1 block or 128 block from the GPU's perspective?

Try/read this: https://stackoverflow.com/questions/21332040/simple-cuda-kernel-optimization/21340927 (just replace the cast by a sum). The order of the for loop will have an effect. — Ander Biguri, Jun 10 '21 at 09:33

score 2 · Answer 1 · answered Jun 10 '21 at 16:08

This:

int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)

isn't a proper grid-stride loop kernel design. It could be called a block-stride loop. Because of this, when you launch more than 1 block using this kernel, each block will do precisely the same thing.

I don't mean that they will "work together" to complete the task; I mean that each block will individually complete the task. If you launch 2 blocks, you will be doing the work to complete the task twice.

Therefore trying to infer anything about "overhead" from this is not sensible. As you increase the number of blocks you are also increasing the amount of work being done.

If you did a proper grid-stride loop kernel design, you would find increasing performance up to the point of GPU saturation, after that, you would find little change in performance as you increase the grid size (number of blocks).

It's expected that there is some overhead to launching additional blocks. It's not reasonable to conclude that the cost of something is zero. However this overhead is usually quite small, which is why the performance of the grid-stride loop is roughly constant once you exceed the point of saturation. A simplistic definition of saturation is enough threads to provide a full complement for each SM in the GPU you are running on.

Does launching more CUDA thread blocks have more overhead than launching fewer?

1 Answers1