0

I'm following the Even Easier Introduction to CUDA. The first example of the add kernel that uses the GPU is this:

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i+=stride)
      y[i] = x[i] + y[i];
}

The kernel is launched using 256 threads:

add<<<1,256>>>(N, x, y);

It runs in about 1.5ms on my GPU.

As an experiment, I tried grouping the array elements together into n/256 chunks with each thread working on all the contiguous elements in a chunk rather than striding through memory.

__global__
void add(int n, float *x, float *y)
{
  int size = n / blockDim.x;
  int start = size * threadIdx.x;
  int end = start + size;
  for (int i = start; i < end; i++)
      y[i] = x[i] + y[i];
}

But this runs in about 3.7ms.

Why is accessing contiguous global memory slower than striding?

Barry Brown
  • 20,233
  • 15
  • 69
  • 105
  • 3
    The first case uses coalesced access, the second does not. In the first case, adjacent threads in a warp are accessing adjacent elements in memory. In the second case, adjacent threads in a warp are accessing elements that are `size` elements apart. This is one of the most basic optimization principles for writing fast GPU code. – Robert Crovella Oct 27 '17 at 02:56
  • Thanks for the answers. Now I know the term I should have used: memory coalescing. – Barry Brown Oct 30 '17 at 19:25

0 Answers0