My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast (vpbroadcastq zmm0{k1}, rax
). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq
+ an immediate blend.
(On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port.
See https://agner.org/optimize/).
I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like:
#include <immintrin.h>
#include <stdint.h>
// integer version. An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
static_assert(elem <= 3, "a __m256i only has 4 qword elements");
__m256i splat = _mm256_set1_epi64x(newval);
constexpr unsigned dword_blendmask = 0b11 << (elem*2); // vpblendd uses 2 bits per qword
return _mm256_blend_epi32(v, splat, dword_blendmask);
}
Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles.
From the Godbolt compiler explorer, some test functions to see what happens with args in regs.
__m256i merge3(__m256i v, int64_t newval) {
return merge_epi64<3> (v, newval);
}
// and so on for 2..0
# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq ymm1, xmm1
vpblendd ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
# 192 = 0xC0 = 0b11000000
ret
merge2(long long __vector(4), long):
vmovq xmm1, rdi
vinserti128 ymm1, ymm0, xmm1, 1 # Runs on more ports than vbroadcast on AMD Ryzen
# But it introduced a dependency on v (ymm0) before the blend for no reason, for the low half of ymm1. Could have used xmm1, xmm1.
vpblendd ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
ret
merge1(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq xmm1, xmm1 # only an *XMM* broadcast, 1c latency instead of 3.
vpblendd ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
ret
merge0(long long __vector(4), long):
vmovq xmm1, rdi
# broadcast optimized away, newval is already in the low element
vpblendd ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
ret
Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add if()
conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v;
to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.
GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result.
This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally -march=<an intel uarch>
favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell
.
# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
push rbp
mov rbp, rsp
and rsp, -32 # align the stack even though no YMM is spilled/loaded
mov QWORD PTR [rsp-8], rdi
vpbroadcastq ymm1, QWORD PTR [rsp-8] # 1 uop on Intel
vpblendd ymm0, ymm0, ymm1, 3
leave
ret
; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too. (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
vmovq xmm2, rdi
vpbroadcastq ymm1, xmm2
vpblendd ymm0, ymm0, ymm1, 3
ret
For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a vpmovsxbq
load from mask[3-elem]
in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };
. But vpblendvb
or vblendvpd
is slower than an immediate blend, especially on Haswell, so avoid that if possible.