Broadcast entry of __m256d

Question

What is the fastest way to broadcast a single entry of a __m256d register to all the elements of an other __m256d register using AVX?

For single precision this can be done with a single call to _mm256_shuffle_ps(). Moreover, for AVX2 _mm256d_permute4x64_pd does seem to do the trick. Thank you.

`_mm256_shuffle_ps` = `VSHUFPS`, which operates within a lane. You can always use `VSHUFPS` on double-precision data, as long as the shuffle control mask keeps pairs of elements together. — Peter Cordes, Jun 09 '15 at 04:10
Is the position of the element to broadcast known at compile time? If so, that's easy. Broadcast the high or low 128 to the other 128 with `vinsertf128` or `vperm2f128`, then `vshufpd` to broadcast the element within each lane. (or `vpermilpd`, but it only has an advantage when the source is in memory.) But if your source is in memory, you'd just use `vbroadcastpd` if you had to load anyway. — Peter Cordes, Jun 09 '15 at 04:28

score 0 · Answer 1 · answered Nov 25 '13 at 09:45

Okay, I came up with a compact solution but I'm not quite sure if this is the fastest.

double pos0[4] = {0,1,2,3};
__m256d a= _mm256_loadu_pd(pos0);

__m256d a02,a13;
a02 = _mm256_shuffle_pd(a,a,0);
__m256d a0 = _mm256_blend_pd(a02, _mm256_permute2f128_pd(a02,a02,1), 0x0C);
__m256d a2 = _mm256_blend_pd(a02, _mm256_permute2f128_pd(a02,a02,1), 0x03);
a13 = _mm256_shuffle_pd(a,a,0x0F);
__m256d a1 = _mm256_blend_pd(a13, _mm256_permute2f128_pd(a13,a13,1), 0x0C);
__m256d a3 = _mm256_blend_pd(a13, _mm256_permute2f128_pd(a13,a13,1), 0x03);

This would result in:

a0 = {0,0,0,0}
a1 = {1,1,1,1}
a2 = {2,2,2,2}
a3 = {3,3,3,3}

score 0 · Answer 2 · edited May 23 '17 at 11:43

Let's assume you have a __m256d register x4. You can broadcast each element to four elements to four __m256d registers like this:

__m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
__m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

For more information see Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

score -2 · Answer 3 · edited Nov 06 '14 at 23:52

-2

__m256d shuffle(__m256d V, int x, int y, int z, int w) {
    return _mm256_set_pd(V.m256d_f64[x], V.m256d_f64[y], V.m256d_f64[z], V.m256d_f64[w]);
}

but it is very very slow

edited Nov 06 '14 at 23:52

Alex K

8,269
9
39
57

answered Nov 06 '14 at 23:05

ArthurNeo

1

Broadcast entry of __m256d

3 Answers3