CUDA: Why is striding faster than contiguous memory access?

Question

I'm following the Even Easier Introduction to CUDA. The first example of the add kernel that uses the GPU is this:

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i+=stride)
      y[i] = x[i] + y[i];
}

The kernel is launched using 256 threads:

add<<<1,256>>>(N, x, y);

It runs in about 1.5ms on my GPU.

As an experiment, I tried grouping the array elements together into n/256 chunks with each thread working on all the contiguous elements in a chunk rather than striding through memory.

__global__
void add(int n, float *x, float *y)
{
  int size = n / blockDim.x;
  int start = size * threadIdx.x;
  int end = start + size;
  for (int i = start; i < end; i++)
      y[i] = x[i] + y[i];
}

But this runs in about 3.7ms.

Why is accessing contiguous global memory slower than striding?

The first case uses coalesced access, the second does not. In the first case, adjacent threads in a warp are accessing adjacent elements in memory. In the second case, adjacent threads in a warp are accessing elements that are `size` elements apart. This is one of the most basic optimization principles for writing fast GPU code. — Robert Crovella, Oct 27 '17 at 02:56
Thanks for the answers. Now I know the term I should have used: memory coalescing. — Barry Brown, Oct 30 '17 at 19:25

CUDA: Why is striding faster than contiguous memory access?

0 Answers0