I'm following the Even Easier Introduction to CUDA. The first example of the add
kernel that uses the GPU is this:
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i+=stride)
y[i] = x[i] + y[i];
}
The kernel is launched using 256 threads:
add<<<1,256>>>(N, x, y);
It runs in about 1.5ms on my GPU.
As an experiment, I tried grouping the array elements together into n/256 chunks with each thread working on all the contiguous elements in a chunk rather than striding through memory.
__global__
void add(int n, float *x, float *y)
{
int size = n / blockDim.x;
int start = size * threadIdx.x;
int end = start + size;
for (int i = start; i < end; i++)
y[i] = x[i] + y[i];
}
But this runs in about 3.7ms.
Why is accessing contiguous global memory slower than striding?