I have an array double x[4] of four doubles stored contiguously in memory.
What would be the fastest (in terms of efficient) way using the AVX instruction set to prepare four registers, say, ymm0,ymm1,ymm2,ymm3 such that :
ymm0 = { x[0], x[0], x[0], x[0] }
ymm1 = { x[1], x[1], x[1], x[1] }
ymm2 = { x[2], x[2], x[2], x[2] }
ymm3 = { x[3], x[3], x[3], x[3] }
I can do it as:
ymm0 = _mm256_set1_pd(x[0]);
ymm1 = _mm256_set1_pd(x[1]);
ymm2 = _mm256_set1_pd(x[2]);
ymm3 = _mm256_set1_pd(x[3]);
but would there be a better way with one _mm256_load_pd ?