2

This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses.

Can it be done with AVX2 or below (I don't have AVX512)?

[1] How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
budchan chao
  • 327
  • 3
  • 15

1 Answers1

4

My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast (vpbroadcastq zmm0{k1}, rax). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq + an immediate blend.

(On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port. See https://agner.org/optimize/).

I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like:

#include <immintrin.h>
#include <stdint.h>

// integer version.  An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
    static_assert(elem <= 3, "a __m256i only has 4 qword elements");

    __m256i splat = _mm256_set1_epi64x(newval);

    constexpr unsigned dword_blendmask = 0b11 << (elem*2);  // vpblendd uses 2 bits per qword
    return  _mm256_blend_epi32(v, splat, dword_blendmask);
}

Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles.

From the Godbolt compiler explorer, some test functions to see what happens with args in regs.

__m256i merge3(__m256i v, int64_t newval) {
    return merge_epi64<3> (v, newval);
}
// and so on for 2..0

# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
    vmovq   xmm1, rdi
    vpbroadcastq    ymm1, xmm1
    vpblendd        ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
                      # 192 = 0xC0 = 0b11000000
    ret

merge2(long long __vector(4), long):
    vmovq   xmm1, rdi
    vinserti128     ymm1, ymm0, xmm1, 1          # Runs on more ports than vbroadcast on AMD Ryzen
        #  But it introduced a dependency on  v (ymm0) before the blend for no reason, for the low half of ymm1.  Could have used xmm1, xmm1.
    vpblendd        ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
    ret

merge1(long long __vector(4), long):
    vmovq   xmm1, rdi
    vpbroadcastq    xmm1, xmm1           # only an *XMM* broadcast, 1c latency instead of 3.
    vpblendd        ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
    ret

merge0(long long __vector(4), long):
    vmovq   xmm1, rdi
           # broadcast optimized away, newval is already in the low element
    vpblendd        ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
    ret

Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add if() conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v; to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.


GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result.

This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally -march=<an intel uarch> favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell.

# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
    push    rbp
    mov     rbp, rsp
    and     rsp, -32          # align the stack even though no YMM is spilled/loaded
    mov     QWORD PTR [rsp-8], rdi
    vpbroadcastq    ymm1, QWORD PTR [rsp-8]   # 1 uop on Intel
    vpblendd        ymm0, ymm0, ymm1, 3
    leave
    ret

; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too.  (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
    vmovq   xmm2, rdi
    vpbroadcastq    ymm1, xmm2
    vpblendd        ymm0, ymm0, ymm1, 3
    ret

For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a vpmovsxbq load from mask[3-elem] in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };. But vpblendvb or vblendvpd is slower than an immediate blend, especially on Haswell, so avoid that if possible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This is exactly what I was looking for. Thanks for the detailed answer. I have one follow up question. For low order two elements what's the advantage of using this approach over pinsrq? – budchan chao Jan 06 '19 at 00:20
  • 1
    @budchanchao: `vpinsrq` would zero the high 2 elements. `pinsrq` would cause an SSE/AVX stall on Haswell. Non-VEX `pinsrq` for element 1 of a YMM might be optimal on Skylake (and on AMD and Xeon Phi), but you'll never get a compiler to emit that without inline asm. For element 0, `vmovq` + `vpblendd` has better execution port pressure: 1p5 + 1p015 instead of 2p5 for `pinsrq xmm0, rax, 0`. – Peter Cordes Jan 06 '19 at 01:12