2

I have grayscale image from uint8_t values. I want to load data to SIMD. I load 16 values and convert them to two __m256 float registers.

I use:

uint8_t * data = .....
size_t index = ....

//load 16 uint8_t (16 * 8 = 128bit)
__m128i val = _mm_loadu_si128((const __m128i*)(data + index));

//convert to 16bit int
__m128i lo16 = _mm_unpacklo_epi8(val, _mm_setzero_si128());
__m128i hi16 = _mm_unpackhi_epi8(val, _mm_setzero_si128());

//convert to 32bit int
__m256i lo32 = _mm256_cvtepi16_epi32(lo16);
__m256i hi32 = _mm256_cvtepi16_epi32(hi16);

//convert to float
__m256 lo = _mm256_cvtepi32_ps(lo32);
__m256 hi = _mm256_cvtepi32_ps(hi32);

Is there a better way how to do this (without AVX-512), than my solution?

Would it be better to load _mm256_loadu_si256 and "split" it to 4 __m256 registers?

Martin Perry
  • 9,232
  • 8
  • 46
  • 114

1 Answers1

2

It would cost even more shuffles to unpack the high lane of a _mm256_loadu_si256. Most AVX2 CPUs have more load throughput than shuffle throughput, and you already need at least 1 shuffle per output vector, so already your way of doing 1 load and 4 shuffles for 2 epi32 vectors is a poor tradeoff.

If anything it would be better to use 2x _mm256_cvtepu8_epi32 to get two vectors of inputs to cvtepi32_ps, with one load per shuffle.

It's a bit of a pain to use memory-source pmovz/sx because you need to tell the compiler you're doing a narrow load into a __m128i (for safety), and some compilers won't optimize away the zero-extending load into a memory-source for vpmovzx. See Loading 8 chars from memory into an __m256 variable as packed single precision floats

But apparently things have improved some since I originally wrote that answer in 2015; GCC9 and later fixed that missed-optimization bug and now folds a _mm_loadl_epi64( (const __m128i*)p) into a memory source for vpmovzx. clang and ICC are fine, even old versions. MSVC still does poor code-gen with a separate vmovq, even with -march:AVX2, even v19.28 and "latest". (Godbolt).


On Intel CPUs vpmovzxbd ymm, qword [mem] is always 2 uops; can't micro-fuse the load (only for an xmm destination) https://uops.info/table.html, so you don't gain anything (except code-size) even if the compiler does manage to fold a 64-bit memory source into a mem operand instead of using a vmovq load.

But on Zen2, that instruction has 2/clock throughput, vs. worse than 1/clock throughput for vpmovzxbd ymm, xmm (same for the wd shuffle, or the sign-extending version you're using, vpmovsxwd = epi16_epi32). So you do really want the compiler to get this right if you care about Zen CPUs, especially Zen 2.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847