Demultiplex an AVX register into four registers each containing identical values

Question

I have an array double x[4] of four doubles stored contiguously in memory.

What would be the fastest (in terms of efficient) way using the AVX instruction set to prepare four registers, say, ymm0,ymm1,ymm2,ymm3 such that :

ymm0 = { x[0], x[0], x[0], x[0] }
ymm1 = { x[1], x[1], x[1], x[1] }
ymm2 = { x[2], x[2], x[2], x[2] }
ymm3 = { x[3], x[3], x[3], x[3] }

I can do it as:

ymm0 = _mm256_set1_pd(x[0]);
ymm1 = _mm256_set1_pd(x[1]);
ymm2 = _mm256_set1_pd(x[2]);
ymm3 = _mm256_set1_pd(x[3]);

but would there be a better way with one _mm256_load_pd ?

See [_mm256_permute4x64_pd](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_permute4x64_pd&expand=522,485,3891,3883). One load + 4 permutes should do it. — Paul R, Sep 02 '16 at 16:13
Thank you @PaulR . Would that (1 load + 4 permutes) be faster than the method I describe? I cannot find any information about the latency of [_mm256_set1_pd](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=set1_pd&expand=522,485,3891,3883,4629,4629) — Tomas, Sep 02 '16 at 16:21
It depends on whether 4 load-port uops are cheaper (given the surrounding code) than one load and 4 shuffle uops. Intel CPUs can broadcast-load 32-bit elements without any ALU uops. IIRC, there's an existing question about broadcast-load throughput. Also see [the x86 tag wiki](http://stackoverflow.com/tags/x86/info), specifically Agner Fog's guides. — Peter Cordes, Sep 02 '16 at 16:22
It depends on your compiler - `_mm256_set1_pd` does not map to any particular instruction(s) - a good compiler might well do something similar to my load/permute suggestion, or something entirely different. — Paul R, Sep 02 '16 at 16:22
Did you actually mean 4 floats in an xmm register, or 4 doubles in a ymm register? Your code seems to show four 64-bit doubles in a 128-bit xmm register, or else an extremely bad choice of variable name. — Peter Cordes, Sep 02 '16 at 16:27
sorry, I meant 4 doubles in a ymm register. I renamed the variable names to reflect AVX. — Tomas, Sep 02 '16 at 18:01

0 Answers0