How can I increase the limit of generated prime numbers with Sieve of Eratosthenes?

Question

What do I need to change in my program to be able to compute a higher limit of prime numbers?

Currently my algorithm works only with numbers up to 85 million. Should work with numbers up to 3 billion in my opinion.

I'm writing my own implementation of the Sieve of Eratosthenes in CUDA and I've hit a wall. So far the algorithm seems to work fine for small numbers (below 85 million).

However, when I try to compute prime numbers up to 100 million, 2 billion, 3 billion, the system freezes (while it's computing stuff in the CUDA device), then after a few seconds, my linux machine goes back to normal (unfrozen), but the CUDA program crashes with the following error message:

CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"

I have a GTX 780 (3 GB) and I'm allocating the sieves in a char array, so if I were to compute prime numbers up to 100,000, it would allocate 100,000 bytes in the device.

I assumed that the GPU would allow up to 3 billion numbers since it has 3 GB of memory, however, it only lets me do 85 million tops (85 million bytes = 0.08 GB)

this is my prime.cu code:

#include <stdio.h>
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc
// #include <cuda.h>
// #include <cuda_runtime_api.h>
// #include <cuda_runtime.h>

typedef unsigned long long int uint64_t;

/******************************************************************************
* kernel that initializes the 1st couple of values in the primes array.
******************************************************************************/
__global__ static void sieveInitCUDA(char* primes)
{
   primes[0] = 1; // value of 1 means the number is NOT prime
   primes[1] = 1; // numbers "0" and "1" are not prime numbers
}

/******************************************************************************
* kernel for sieving the even numbers starting at 4.
******************************************************************************/
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max)
{
   uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4;
   if (index < max)
      primes[index] = 1;
}

/******************************************************************************
* kernel for finding prime numbers using the sieve of eratosthenes
* - primes: an array of bools. initially all numbers are set to "0".
*           A "0" value means that the number at that index is prime.
* - max: the max size of the primes array
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads
*   compute this over and over again, so it's being passed in
******************************************************************************/
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max,
                                               const uint64_t maxRoot)
{
   // get the starting index, sieve only odds starting at 3
   // 3,5,7,9,11,13...
   /* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */

   // apparently the following indexing usage is faster than the one above. Hmm
   int index = blockIdx.x * blockDim.x + threadIdx.x + 3;

   // make sure index won't go out of bounds, also don't start the execution
   // on numbers that are already composite
   if (index < maxRoot && primes[index] == 0)
   {
      // mark off the composite numbers
      for (int j = index * index; j < max; j += index)
      {
         primes[j] = 1;
      }
   }
}

/******************************************************************************
* checkDevice()
******************************************************************************/
__host__ int checkDevice()
{
   // query the Device and decide on the block size
   int devID = 0; // the default device ID
   cudaError_t error;
   cudaDeviceProp deviceProp;
   error = cudaGetDevice(&devID);
   if (error != cudaSuccess)
   {
      printf("CUDA Device not ready or not supported\n");
      printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__);
      exit(EXIT_FAILURE);
   }

   error = cudaGetDeviceProperties(&deviceProp, devID);
   if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess)
   {
      printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n");
      return EXIT_FAILURE;
   }

   // Use a larger block size for Fermi and above (see compute capability)
   return (deviceProp.major < 2) ? 16 : 32;
}

/******************************************************************************
* genPrimesOnDevice
* - inputs: limit - the largest prime that should be computed
*           primes - an array of size [limit], initialized to 0
******************************************************************************/
__host__ void genPrimesOnDevice(char* primes, uint64_t max)
{
   int blockSize = checkDevice();
   if (blockSize == EXIT_FAILURE)
      return;

   char* d_Primes = NULL;
   int sizePrimes = sizeof(char) * max;
   uint64_t maxRoot = sqrt(max);

   // allocate the primes on the device and set them to 0
   checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes));
   checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes));

   // make sure that there are no errors...
   checkCudaErrors(cudaPeekAtLastError());

   // setup the execution configuration
   dim3 dimBlock(blockSize);
   dim3 dimGrid((maxRoot + dimBlock.x) / dimBlock.x);
   dim3 dimGridEvens(((max + dimBlock.x) / dimBlock.x) / 2);

   //////// debug
   #ifdef DEBUG
   printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
   printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z);
   printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z);
   #endif

   // call the kernel
   // NOTE: no need to synchronize after each kernel
   // http://stackoverflow.com/a/11889641/2261947
   sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize
   sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max);
   sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot);

   // check for kernel errors
   checkCudaErrors(cudaPeekAtLastError());
   checkCudaErrors(cudaDeviceSynchronize());

   // copy the results back
   checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost));

   // no memory leaks
   checkCudaErrors(cudaFree(d_Primes));
}

to test this code:

int main()
{
   int max = 85000000; // 85 million
   char* primes = malloc(max);
   // check that it allocated correctly...
   memset(primes, 0, max);

   genPrimesOnDevice(primes, max);

   // if you wish to display results:
   for (uint64_t i = 0; i < size; i++)
   {
      if (primes[i] == 0) // if the value is '0', then the number is prime
      {
         std::cout << i; // use printf if you are using c
         if ((i + 1) != size)
            std::cout << ", ";
      }
   }

   free(primes);

}

There are far more efficient sieving techniques for parallel computaiton of primes. Have a look at the [Atkin-Bernsein](http://en.wikipedia.org/wiki/Sieve_of_Atkin) sieve, for example — talonmies, Jul 17 '14 at 08:16

score 2 · Accepted Answer · answered Jul 17 '14 at 03:42

This error:

CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"

doesn't necessarily mean anything other than that your kernel is taking too long. It's not necessarily a numerical limit, or computational error, but a system-imposed limit on the amount of time your kernel is allowed to run. Both Linux and windows can have such watchdog timers.

If you want to work around it in the linux case, review this document.

You don't mention it, but I assume your GTX780 is also hosting a (the) display. In that case, there is a time limit on kernels by default. If you can use another device as the display, then reconfigure your machine to have X not use the GTX780, as described in the link. If you do not have another GPU to use for the display, then the only option is to modify the interactivity setting indicated in the linked document, if you want to run long-running kernels. And in this situation, the keyboard/mouse/display will become non-responsive while the kernel is running. If your kernel should happen to run too long, it can be difficult to recover the machine, and may require a hard reboot. (You could also SSH into the machine, and kill the process that is using the GPU for CUDA.)

yes, my graphics card is also being used for my display – Unglued Jul 17 '14 at 03:44 — Unglued, Jul 17 '14 at 03:44
that explains too why my screen freezes :) – Unglued Jul 17 '14 at 03:46 — Unglued, Jul 17 '14 at 03:46

How can I increase the limit of generated prime numbers with Sieve of Eratosthenes?

1 Answers1