Finding lists of prime numbers with SIMD - SSE/AVX

Question

I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX.

The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I find a prime I form the results to do a fast division and save them. Then the next time I do the division it goes much quicker. Doing this I get a speedup of about 3x (out of 4x). With AVX2 maybe it would be faster as well.

However, trial division is much slower than the Sieve of Eratosthenes and I can't think of any way to use SIMD with the Sieve of Eratosthenes except with some kind of scatter instruction with not even AVX2 has yet. Would a scatter instruction helpful? Is anyone aware of this being used on GPUs for the Sieve of Eratosthenes?

Here is the fastest version of the Sieve of Eratosthenes using OpenMP I know of. Any ideas how to speed this up with SSE/AVX? http://create.stephan-brumme.com/eratosthenes/

Here is the function I use to determine if a number is prime. I operate on eight primes at once (will actually it's 4 at a time done twice without AVX2). I'm using Agner Fog's vectorclass. The idea is that for large values there are unlikely to be eight primes in sequence. If I find a prime among the eight I have to loop over the results in sequence.

inline int is_prime_vec8(Vec8ui num, Divisor_ui_s *buffer, int size) {
    Divisor_ui div = Divisor_ui(buffer[0].m, buffer[0].s1, buffer[0].s2);
    int val = buffer[0].d; 
    Vec8ui cmp = -1;

    for(int i=0; (val*val)<=num[7]; i++) {
        Divisor_ui div = Divisor_ui(buffer[i].m, buffer[i].s1, buffer[i].s2);
        val = buffer[i].d;
        Vec8ui q = num/div; 
        cmp &= (q*val != num);
        int cnt = _mm_movemask_epi8(cmp.get_low()) || _mm_movemask_epi8(cmp.get_high());
        if(cnt == 0) {
            return size;  //0 primes were found
        }
    }
    num &= cmp;  //at least 1 out of 8 values were found to be prime
    int tmp[8];
    num.store(tmp);

    for(int i=0; i<8; i++) {
        if(tmp[i]) {
            set_ui(tmp[i], &buffer[size++]);
        }
    }       
    return size;
}

Here is where I setup to the eight prime candidates. I skip multiples of 2, 3, and 5 doing this.

int find_primes_vec8(Divisor_ui_s *buffer, const int nmax) {
    int start[] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
    int size = sizeof(start)/4;

    for(int i=0; i<size; i++) {
        set_ui(start[i], &buffer[i]);
    }

    Vec8ui iv(49, 53, 59, 61, 67, 71, 73, 77);
    size-=3;
    for(int i=49; i<nmax; i+=30) {
        if((i-1)%100000==0) printf("i %d, %f %%\n", i, 100.f*i/(nmax/16));
        size = is_prime_vec8(iv, &buffer[3], size);
        iv += 30;
    }   
    size+=3;

    return size;
}

BTW, Granlund improved the division method (particularly for modern processors) in [this](http://gmplib.org/~tege/division-paper.pdf) 2011 paper. If you precomputed the reciprocals (after normalization) for trial division, you could do trial 'division' very quickly. Just the first 54 primes < 256 will eliminate ~ 80% of *all* odd candidates, and provide the *correct* result for any candidate < 65536. — Brett Hale, Jun 21 '13 at 18:25
I experimented with this a year ago. I found that a blocked approached to the Sieve of Eratosthenes was the fastest, but that it's also incompatible with SIMD. — Mysticial, Jun 21 '13 at 18:42
@Mysticial, wouldn't some kind of scatter instruction help with the Sieve of Eatosthenes? Xeon Phi and many GPUs have a scatter option. The link I mentioned uses a blocked approach for the Sieve - it gives a big improvement. — , Jun 21 '13 at 18:52
@BrettHale, thanks for the link! I check out your profile. Your a prime guy, maybe you could write up an answer. I already skip multiples of 2,3,5 (from wheel factorization). The removes most over 70% of the composites. Do you mean to make a sublist of prime candidates using the first 54 primes using SIMD and the fast 'division and then doing scalar division on the much smaller sublist? — , Jun 21 '13 at 18:55
@raxman - actually I was thinking about testing 4 candidates at a time. The problem is that they will bail out at different times. I thought about wheel factorization, but Mystical is right about all things SIMD. I don't see how to keep the pipeline doing useful work across all 4 or 8 elements. It also depends on what you're trying to achieve: generate all primes < N? How much memory will you use? How much pre-computation is acceptable? etc. — Brett Hale, Jun 21 '13 at 19:10
I already test 4 candidates at a time (well 8 in some sense). Maybe I should provide my code to show what I'm doing now. — , Jun 21 '13 at 19:11
I would just like to find a way to beat the sieve using trial division and SIMD (well even better would be to use SIMD with the sieve but I have no idea how to do that), currently I'm trying to generate the primes in the first billion numbers (about 50 million primes) but eventually I want to go to 64bit to go beyond 4 billion. — , Jun 21 '13 at 19:26
Do you think reciprocal division will help here? It's certainly a win for general purpose (integer) registers, but I don't see the potential here. The efficiency of Granlund's approach is predicated on an efficient, hardware NxN=2N bit multiplication. — Brett Hale, Jun 21 '13 at 19:37
I got the idea for the fast division form the vector class. I did a test dividing an array of integers with a scalar using scalar division and the fast division and the SIMD version is about 4x faster. That's the whole reason it's in the vectorclass. I don't get a speedup of 4x with my prime search but it's between 2x and 3x. I do this by saving four values each time I find a prime (four integers: the prime, a multipler, and two shift values) which makes the division fast. It's described in the paper and vectorclass — , Jun 21 '13 at 19:56
Well here's a truly psychotic challenge... how about a 4-way or 8-way Miller-Rabin test? Just one test, optimized for base (2) (*a 2-SPRP test*); the first composite to pass this test is (2047). There are only (4842) composites under *1 billion* that pass this test and need further attention. Of those, a further 7-SPRP and 61-SPRP test will detect *all* composites - in fact, this will find all composites for a bit over 2^32. — Brett Hale, Jun 21 '13 at 20:04
See, this is why I asked you to write up an answer. I'm not a prime person, this is only a hobby to help me learn about parallel programming. I'm just looking for suggestions. I think it would be easier to explain what you mean if you wrote up an answer. — , Jun 21 '13 at 20:09
Well, let me think about it. I've only really used SSE intrinsics for floating point (matrix / vector stuff); I haven't had much to do with integer, or logical instructions apart from or/xor stuff. In the mean time - I answered a [question](http://stackoverflow.com/questions/17080661/miller-rabin-primality-test-in-java) the other day, and provided some [code](http://pastebin.com/8GJC5s61) for the SPRP test. You could have a look to see if has any promise for parallel testing. I've got Fog's manuals, but I'll have a look at his vector library. — Brett Hale, Jun 21 '13 at 20:39
Thank you for those links and code! I'll read them. I have not used SSE for integers much (actually, I only started using SSE/AVX for the first time about 4 months ago) that's reason I started this prime project - to help me learn integer operations with SSE/AVX. — , Jun 21 '13 at 20:51
@BrettHale, I though a bit more about SIMD and prime numbers. Since the sieve is so much faster I don't think SIMD is useful for finding a list of primes. However, for testing if a number is prime it might be useful. What I could do is generated all the primes for N<2^32 (e.g. with a sieve) then I could test for any prime up to 2^64. If I could get this division algorithm working for scalar/vector I could test multiples primes at once. I would need 64bit though so the best I could do is two numbers at once with SSE/AVX or four numbers at once with AVX2. I'll look into this. — , Jun 23 '13 at 19:14

Finding lists of prime numbers with SIMD - SSE/AVX

0 Answers0

Linked