3

What is the fastest way to broadcast a single entry of a __m256d register to all the elements of an other __m256d register using AVX?

For single precision this can be done with a single call to _mm256_shuffle_ps(). Moreover, for AVX2 _mm256d_permute4x64_pd does seem to do the trick. Thank you.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
user1829358
  • 1,041
  • 2
  • 9
  • 19
  • `_mm256_shuffle_ps` = `VSHUFPS`, which operates within a lane. You can always use `VSHUFPS` on double-precision data, as long as the shuffle control mask keeps pairs of elements together. – Peter Cordes Jun 09 '15 at 04:10
  • Is the position of the element to broadcast known at compile time? If so, that's easy. Broadcast the high or low 128 to the other 128 with `vinsertf128` or `vperm2f128`, then `vshufpd` to broadcast the element within each lane. (or `vpermilpd`, but it only has an advantage when the source is in memory.) But if your source is in memory, you'd just use `vbroadcastpd` if you had to load anyway. – Peter Cordes Jun 09 '15 at 04:28

3 Answers3

0

Okay, I came up with a compact solution but I'm not quite sure if this is the fastest.

double pos0[4] = {0,1,2,3};
__m256d a= _mm256_loadu_pd(pos0);

__m256d a02,a13;
a02 = _mm256_shuffle_pd(a,a,0);
__m256d a0 = _mm256_blend_pd(a02, _mm256_permute2f128_pd(a02,a02,1), 0x0C);
__m256d a2 = _mm256_blend_pd(a02, _mm256_permute2f128_pd(a02,a02,1), 0x03);
a13 = _mm256_shuffle_pd(a,a,0x0F);
__m256d a1 = _mm256_blend_pd(a13, _mm256_permute2f128_pd(a13,a13,1), 0x0C);
__m256d a3 = _mm256_blend_pd(a13, _mm256_permute2f128_pd(a13,a13,1), 0x03);

This would result in:

a0 = {0,0,0,0}
a1 = {1,1,1,1}
a2 = {2,2,2,2}
a3 = {3,3,3,3}
user1829358
  • 1,041
  • 2
  • 9
  • 19
0

Let's assume you have a __m256d register x4. You can broadcast each element to four elements to four __m256d registers like this:

__m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
__m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

For more information see Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
-2
__m256d shuffle(__m256d V, int x, int y, int z, int w) {
    return _mm256_set_pd(V.m256d_f64[x], V.m256d_f64[y], V.m256d_f64[z], V.m256d_f64[w]);
}

but it is very very slow

Alex K
  • 8,269
  • 9
  • 39
  • 57