Accessing arbitrary 16-bit elements packed in a 128-bit register

Question

With Intel compiler intrinsics, given a 128-bit register, packing 8 16-bit elements, how do I access (cheaply) arbitrary elements from within the register, for subsequent use of _mm_cvtepi8_epi64 (sign extend two 8-bit elements, packed at the lower 16 bits of the register, to two 64-bit elements)?

I'll explain why I ask:

Input: An in-memory buffer with k bytes, each either 0x0 or 0xff.
Desired output: For every two consecutive bytes of the input, a register packing two quad words (64-bits) with 0x0 and 0xffff ffff ffff ffff, respectively.
Ultimate goal: Sum a buffer of k doubles, masked according to the entries of the input buffer.

Note: The values 0x0 and 0xff of the input buffer may be changed to whatever is most helpful, provided that the effect of masking before the sum remains.

As may be apparent from my question, my current plan is as follows, streaming across the inputs buffers:

Extend the input mask buffer from 8-bits to 64-bits.
Mask the doubles buffer with the extended mask.
Sum the masked doubles.

Thanks, Asaf

`pmovsxbq` can actually take a memory operand and directly load those two bytes from memory. But of course the MSVC team doesn't care about that. — harold, Apr 01 '12 at 12:42
@harold Yes, there is actually a address mode missing for the intrinsics given by Intel. So actually Intel is to blame, not MS (as I hate to say it ;-)). The easy solution is using the `pmovsxbq` in inline assembly. Otherwise reading 16 byte at once and some `pshufb` to get the bytes to the right places will do. — Gunther Piez, Apr 01 '12 at 16:16
@drhirsch well that's unexpected.. thanks for letting me know — harold, Apr 01 '12 at 16:48
@drhirsch, @harold: See my answer below - just use the intrinsic passing it a dereferenced pointer. At least `gcc` and `icc` figure out to do the right thing. — FrankH., Apr 04 '12 at 23:33

score 3 · Answer 1 · answered Apr 04 '12 at 23:31

3

Rather a tangent to the question itself, more filling in some information on the comments because the comment section itself is too small to hold this (sic !):

At least gcc can deal with the following code:

#include <smmintrin.h>

extern int fumble(__m128i x);

int main(int argc, char **argv)
{
    __m128i foo;
    __m128i* bar = (__m128i*)argv;

    foo = _mm_cvtepi8_epi64(*bar);

    return fumble(foo);
}

It turns this into the following assembly:

Disassembly of section .text.startup:

0000000000000000 :
   0:   66 0f 38 22 06          pmovsxbq (%rsi),%xmm0
   5:   e9 XX XX XX XX          jmpq   .....

This means that the intrinsics don't need to come in memory-argument form - the compiler handles dereferencing a mem argument transparently and uses the corresponding mem-operand instruction if possible. ICC does the same. I do not have a Windows machine / Visual C++ around to test whether MSVC does so as well, but I'd expect it to.

answered Apr 04 '12 at 23:31

FrankH.

17,675
3
44
63

Not quite sure about that. The assembly form doesn't need any alignment, and it takes a pointer to a word (`movw` or `mov WORD PTR`). Will the compiler emit the `pmovsxbq` even if the pointer is unaligned? Anyway a better answer than Paul R.`s, which is useless for this scenario. – Gunther Piez Apr 05 '12 at 07:47
OK, i see now that the pointer is actually unaligned. Sorry for the noise :-) – Gunther Piez Apr 05 '12 at 07:50
@drhirsch: The above is contrived of course - just to illustrate that the compiler will emit `pmovsxbq (...), %xmm..` if the intrinsic is given a dereferenced pointer as argument. I merely chose an arbitrary available non-`NULL` pointer ;-) – FrankH. Apr 05 '12 at 08:44

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Each byte is the mask for an entire double, so PMOVSXBQ does exactly what we need: load two bytes from a m16 pointer, and sign-extend them to the two 64bit (qword) halves of an xmm register.

# UNTESTED CODE
# (loop setup stuff)
# RSI: double pointer
# RDI: mask pointer
# RCX: loop conter = mask byte-count
    add   rdi, rcx
    lea   rsi, [rsi + rcx*8]  ; sizeof(double) = 8
    neg   rcx  ; point to the end and count up

    XORPS xmm0, xmm0  ; clear accumulator
      ; for real use: use multiple accumulators
      ; to hide ADDPD latency

ALIGN 16
.loop:
    PMOVSXBQ XMM1, [RDI + RCX]
    ANDPD    XMM1, [RSI + RCX * 8]
    ADDPD    XMM0, XMM1
    add      RCX, 2      ; 2 bytes / doubles per iter
    jl       .loop

    MOVHLPS  XMM1, XMM0    ; combine the two parallel sums
    ADDPD    XMM0, XMM1 
    ret

For real use, use multiple accumulators. Also see Micro fusion and addressing modes re: indexed addressing modes.

Writing this with intrinsics should be easy. As others have pointed out, just use dereferenced pointers as args to the intrinsics.

To answer the other part of your question, about how to shift data around to line it up for PMOVSX:

On Sandybridge and later, using PMOVSXBQ from RAM is probably good. On earlier CPUs that can't handle two loads per cycle, loading 16B of mask data at a time, and shifting it by 2 bytes at a time with PSRLDQ xmm1, 2 will put 2 bytes of mask data in the low 2 bytes of the register. Or maybe PUNPCKHQDQ, or PSHUFD to get two dependency chains going by moving the high 64 to the low 64 of another reg. You'd have to check on which port is used by which instruction (shift vs. shuffle/extract), and see which conflicts less with PMOVSX and ADDPD.

punpck and pshufd both use p1/p5 on SnB, so does pmovsx. addpd can only run on p1. andpd can only run on p5. Hmm, maybe PAND would be better, since it can run on p0 (and p1/p5). Otherwise nothing in the loop will be using execution port 0. If there's a latency penalty for moving data from integer to fp domains, it's unavoidable if we use PMOVSX, as that will get the mask data in the int domain. Better to use more accumulators to make the loop longer than the longest dependency chain. But keep it under 28uops or so to fit in the loop buffer, to make sure 4 uops can issue per cycle.

And more about optimizing the whole thing: Aligning the loop isn't really needed, since on nehalem and later it will fit in the loop buffer.

You should unroll the loop by 2 or 4, because pre-Haswell Intel CPUs don't have enough execution units to handle all 4 (fused) uops in a single cycle. (3 vector and one fused add/jl. The two loads fuse with the vector uops they're part of.) Sandybridge and later can execute both loads every cycle, so one iteration per cycle is doable, except loop overhead.

Oh, ADDPD has a latency of 3 cycles. So you need to unroll and use multiple accumulators to avoid the loop-carried dependency chain being the bottleneck. Probably unroll by 4, and then sum up the 4 accumulators at the end. You'll have to do that in the source code even with intrinsics, because that would change the order of operations for the FP math, so the compiler might not be willing to do that while unrolling.

So each unrolled-by-4 loop would take 4 clock cycles, plus 1 uop for the loop overhead. On Nehalem, where you have a tiny loop-cache but no uop cache, unrolling might mean you have to start caring about decoder throughput. On pre-sandybridge, though, one load per clock will probably be the bottleneck anyway.

For decoder throughput, you can probably use ANDPS instead of ANDPD, which takes one less byte to encode. IDK if that would help.

Widening this to 256b ymm registers would require AVX2 for the most straightforward implementation (for VPMOVSXBQ ymm). You might get a speedup on AVX-only by doing two VPMOVSXBQ xmm and combining them with VINSERTF128 or something.

I'm looking for the following transformation. It is 2 XMM words to 4 XMM words with `X` meaning "don't care". Do you see an efficient way to do it? `[A1 A2 A3 A4][B1 B2 B3 B4] ... => [A1 B1 X X][A2 B2 X X][A3 B3 X X][A4 B4 X X]`. I've tried `pinsrd` and `pextrd` but they have more overhead than I can tolerate. — jww, Jun 18 '18 at 23:32
@jww: `r0 = punpckldq(v0,v1)` / r1 = extract high half of r0 with `movhlps` or `punpckhqdq`. Repeat with `punpckHdq` for the A3/B3 and A4/B4 outputs. Without AVX you'll need a `movdqa` to avoid clobbering clobbering the first input so you can unpack low and high. — Peter Cordes, Jun 18 '18 at 23:40
I was looking at some old notes where you helped me before. How do you feel about a `_mm_shuffle_ps` with a `_MM_SHUFFLE`? The downside is, I have to do it four times. — jww, Jun 19 '18 at 00:18
@jww: You definitely need 4 instructions to produce 4 outputs, just like my 4-instruction solution. x86 doesn't have any multi-output SIMD instructions. `_mm_shuffle_ps` with AVX is nice: all 4 can read from the original sources instead of being dependent on a previous. But without VEX 3 operand encoding, it doesn't appear to have any advantage over `_mm_unpacklo/hi_ps` and `_mm_movehi_ps(tmp, vec_with_high_half)` (or whatever the intrinsics are for `unpcklps` / `movhlps`). [Using `movhlps` with a well-chosen destination variable can save a `movaps`](https://stackoverflow.com/q/6996764) — Peter Cordes, Jun 19 '18 at 00:32
We ended up using the unpacks. All paths seemed to need 4 of them. The unpacks won and did not requires the `*_ps` casts. Plus, we needed `_mm_shuffle_epi8` from SSSE3 so we were not limited to MMX/SSE/SSE2. Also see [`cham-simd.cpp`](https://github.com/weidai11/cryptopp/blob/master/cham-simd.cpp). — jww, Jun 20 '18 at 00:58
@jww: Did you notice the redundancy between outputs? It's weird the way you put the `_mm_unpacklo_epi32(a, b);` and so on into both `UnpackXMM<0>` and `UnpackXMM<1>`, but I guess if the compilers you care about all reliable CSE that away after inlining, it's not doing any harm. What won't optimize away in your code is using `kr = _mm_shuffle_epi8(k, _mm_set_epi8(7,6,5,4, 7,6,5,4, 7,6,5,4, 7,6,5,4));` and so on to broadcast 1 of 4 elements. gcc fails to transform that to a `pshufd` copy-and-shuffle (`_mm_shuffle_epi32`), so instead you get a `movdqa` + `pshufb` with a separate mask. — Peter Cordes, Jun 20 '18 at 02:50
@jww: also, `__m128i a = UnpackXMM<0>(block0);` and so on (in the Enc/Dec 1 block functions) should optimize to one `pshufb`, just rearranging the data in `block0`, but I'm not sure that happens with gcc. Probably with clang, though. `RotateRight32<8>` can avoid using another mask by using `_mm_alignr_epi8(same,same,1)` (`palignr`). I tried to get the code onto Godbolt where it would be easy to see source->asm mappings, but it's hard to untangle from Crypto++ dependencies / helper functions. In the 4-block funcs, if you endian-swap before xpose, a couple of those insns could be blends. — Peter Cordes, Jun 20 '18 at 02:59

score 2 · Answer 3 · answered Apr 01 '12 at 11:37

2

Have you looked at _mm_extract_epi16 (PEXTRW) and _mm_insert_epi16 (PINSRW) ?

answered Apr 01 '12 at 11:37

Paul R

208,748
37
389
560

I did, and I'm guessing that these would potentially output to memory instead of to a register, making everything slow. Am I wrong? Would the compiler (MSVC) fix this? – Whaa Apr 02 '12 at 06:17
No - these instructions operate directly between SSE (xmm) registers and normal registers. If you look at the code generated for e.g. `_mm_set_epi16` you'll see that it just generates a string of `PINSRW`s. – Paul R Apr 02 '12 at 08:17
@Anonymous down-voter: would you care to add a comment as to why you think the above answer is inappropriate or unhelpful ? – Paul R Apr 05 '12 at 08:36
The answer is unhelpful because you do not explain how to use `pextrw` and `pinsrw` in the scenario given by the OP. The only way is see is unrolling the loop 8 times (because `pextrw` does only take immediates), move the 16 bit value to a gpr, back to a xmm register, and expand with `pmovsxbq` to do the masking operation on the doubles. If there is a simple usage of this two instructions which I fail to see, please explain so. – Gunther Piez Apr 05 '12 at 12:49
@drhirsch: sorry - I thought it was obvious - I'm not sure that it deserves a down-vote though - the OP seems to have found it helpful and others have up-voted the answer. I guess you're entitled to your opinion though, however harsh it may be, but this kind of negativity tends to discourage people from offering help. – Paul R Apr 05 '12 at 13:16
I really doubt the answer was helpful to the OP, even if he accepted - because of the reason I gave: The instructions you recommend are barely usable in the scenario of this question. Tell me, did you have an algorithm similar to what I described in mind, as you wrote the answer? If not, this is just the kind of fast cheap answer which sometimes work and are useful, other times are not. This time it works not. And hey, you got 3 upvotes vs 1 downvote, don't tell me that you are now discouraged :-) – Gunther Piez Apr 05 '12 at 13:39
@drhirsch: The algorithm you described is exactly what I did. Regrettably, it's not as fast as I thought it would be. – Whaa Apr 08 '12 at 09:21
@Whaa I almost expected that. On most Intels there is a penalty for moving data from GPR to SSE or back. You could try Franks approach, this saves some register moves at the expense of additional memory accesses (which are all in L1 though). – Gunther Piez Apr 08 '12 at 09:41
@Whaa Possibly the fastest approach is unrolling the loop eight times, but instead of `pextrw` use `pshufb` or `psrldq` to get the bytes into the right places (needs one register move and one logic instruction, if you have AVX both get combined to one). – Gunther Piez Apr 08 '12 at 09:45

Accessing arbitrary 16-bit elements packed in a 128-bit register

3 Answers3