I was doing some experiments with CUDA, and I noticed that launching the same basic kernel:
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
with more thread blocks sometimes resulted in a slower overall speed of execution of the executable than launching only one block (always with blocks of 512 threads) on a not-too-huge array. Notice that I alwayswaited to
On the CPU, this would be associated with the overhead of creating the threads being smaller than whatever advantage having more threads could produce.
However, on the GPU IIRC we don't have threads in the usual sense, but we simply have different phyisical cores that would otherwise be unused. I don't even think it can be a memory issue as the times the data is transferred to the GPU does not change, but maybe the uniform memory I'm using is doing something under the hood that I don't fully understand.
So I was wondering: does launching more threads and thread blocks have more overhead in CUDA? Or is it the same to launch 1 block or 128 block from the GPU's perspective?