2

I have three ymm registers -- ymm4, ymm5 and ymm6 -- packed with double precision (qword) floats:

ymm4:   73  144 168 41
ymm5:   144 348 26  144
ymm6:   732 83  144 852

I want to write each column of the matrix above. For example:

-- extract ymm4[63:0] and insert it at ymm0[63:0]
-- extract ymm5[63:0] and insert it at ymm0[127:64]
-- extract ymm6[63:0] and insert it at ymm0[191:128]

so that ymm0 reads 73, 144, 732.

So far I have used:

mov rax,4
kmovq k6,rax
vpxor ymm1,ymm1
VEXPANDPD ymm1{k6}{z},ymm6

That causes ymm1 to read [ 0 0 732 ], so I've accomplished the first step because 732 is the element at [63:0] in ymm6.

For ymm4 and ymm5 I use vblendpd:

vblendpd ymm0,ymm1,ymm4,1

That causes ymm0 to read [ 73 0 732 ], so I've accomplished the second step because 73 is the element at [63:0] in ymm4.

Now I need to put ymm5[63:0] at ymm0[127:64]:

vblendpd ymm0,ymm0,ymm5,2

That causes ymm0 to read [ 73 144 732 ], so now I am finished with the first column [63:0].

But now I need to do the same thing with columns 2, 3 and 4 in the ymm registers. Before I add more instructions, is this the most efficient way to do what I described? Is there another, more effective way?

I have investigated unpckhpd (https://www.felixcloutier.com/x86/unpckhpd), vblendpd (https://www.felixcloutier.com/x86/blendpd, and vshufpd (https://www.felixcloutier.com/x86/shufpd), and what I show above seems like the best solution but it's a lot of instructions, and the encodings shown in the docs for the imm8 value are somewhat opaque. Is there a better way to extract the corresponding columns of three ymm registers?

RTC222
  • 2,025
  • 1
  • 20
  • 53
  • 1
    some of your bit-ranges are backwards. `127:64` has the highest position first, like Intel manuals. But `0:63` is opposite. `vunpcklpd` looks like the way to go to combine the low doubles from 2 vectors into the low 128 bits of another register. You could even do that under a merge-mask with AVX512 if you have it to avoid a separate `vpblendd`, but you only tagged this AVX2. – Peter Cordes Sep 07 '20 at 19:06
  • (1) Based on your suggestion I'll focus on vunpcklpd; (2) I'm writing this for ymm, not zmm, so I avoided AVX-512. I can use AVX-512 instructions with ymm registers (using the zmm name) but will I get downclocking by mixing AVX2 with AVX-512 instructions? I've been warned about that, and even pure AVX-512 downclocks where I've used it (that's reportedly been solved by Ice Lake, but we're not there yet). – RTC222 Sep 07 '20 at 19:14
  • 1
    @RTC222 The sample code you give does use AVX-512 instructions. Just avoiding `zmm` registers doesn't ensure that you avoid AVX-512. Instead, make sure to only use AVX2 and earlier instructions. – fuz Sep 08 '20 at 14:17
  • 2
    Do I understand correctly that your end goal is to essentially transpose the 3x4 matrix into 4 registers of 3 entries each? – fuz Sep 08 '20 at 14:19
  • @fuz -- Yes, I want a ymm register filled the first column (73,144,732), then the second column (144,348,83) then the third and fourth columns. That means taking from 63:0 for each register, then 127:64 for each register, etc. – RTC222 Sep 08 '20 at 15:46
  • 1
    @RTC222. Okay. Fast algorithms exist for this, let me sketch something for you. – fuz Sep 08 '20 at 16:08
  • 2
    Downclocking is based on using 512-bit vectors. 256-bit vectors are fine even using AVX512VL like you're doing for masked 256-bit shuffles. Since they're "light" instructions, not running on FMA units, they shouldn't affect turbo clocks at all: [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812). But as fuz pointed out, you did *not* avoid AVX512. `VEXPANDPD` and `kmovq` only exist with AVX512, `{k1}` masking is an AVX512 feature, and even the `k1` register itself only exists with AVX512! – Peter Cordes Sep 08 '20 at 16:56

1 Answers1

5

Let's name the matrix elements like this:

YMM0 = [A,B,C,D]
YMM1 = [E,F,G,H]
YMM2 = [I,J,K,L]

Eventually, you want a result like this, where * indicates a “don't care.”

YMM0 = [A,E,I,*]
YMM1 = [B,F,J,*]
YMM2 = [C,G,K,*]
YMM3 = [D,H,K,*]

To achieve this, we extend the matrix to 4×4 (imagine another row of just [*,*,*,*]) and then transpose the matrix. This is done in two steps: first, each 2×2 submatrix is transposed. Then, the top left and bottom right matrices are exchanged:

[A,B,C,D]       [A,E,C,G]       [A,E,I,*]
[E,F,G,H]  --\  [B,F,D,H]  --\  [B,F,J,*]
[I,J,K,L]  --/  [I,*,K,*]  --/  [C,G,K,*]
[*,*,*,*]       [J,*,L,*]       [D,H,L,*]

For the first step in ymm0 and ymm1, we use a pair of unpack instructions:

vunpcklpd %ymm1, %ymm0, %ymm4         // YMM4 = [A,E,C,G]
vunpckhpd %ymm1, %ymm0, %ymm5         // YMM5 = [B,F,D,H]

Row 3 stays in ymm2 for the moment as it doesn't need to be changed. Row 4 is obtained by unpacking ymm2 with itself:

vunpckhpd %ymm2, %ymm2, %ymm6         // YMM5 = [J,*,L,*]

The second step is achieved by blending and swapping lanes twice:

vblendpd $0xa, %ymm2, %ymm4, %ymm0    // YMM0 = [A,E,I,*]
vblendpd $0xa, %ymm6, %ymm5, %ymm1    // YMM1 = [B,F,J,*]
vperm2f128 $0x31, %ymm2, %ymm4, %ymm2 // YMM2 = [C,G,K,*]
vperm2f128 $0x31, %ymm6, %ymm5, %ymm3 // YMM3 = [D,H,L,*]

This achieves the desired permutation in 7 instructions.

Note that as none of these instructions require AVX2, this code will run on a Sandy Bridge processor with just AVX.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • 1
    Those first two `vperm2f128` instructions can be `vblendpd` immediate blends (any ALU port) instead of lane-crossing shuffles that compete for the shuffle port. Basically never use `vperm2f128` when you could blend instead. (Or with AVX512VL, possibly merge mask as part of the `vpunpcklpd` shuffle and avoid a separate blend instruction). clang's shuffle optimizer will do that optimization for you if you'd written this with intrinsics. IDK why the OP wants hand-written asm; intrinsics would let you benefit from clang's shuffle optimizer to find missed optimizations. – Peter Cordes Sep 08 '20 at 17:04
  • 1
    @PeterCordes Thanks, I was unaware of this possibility. – fuz Sep 08 '20 at 17:05
  • @fuz -- I want to clarify one thing. Before this operation starts the data are in ymm4, 5 and 6. I think you believe they begin in ymm0, 1 and 2, because you show the first two instructions unpacking ymm0 and ymm1 into ymm4 and 5 (assuming I have read this 3-element AT&T syntax correctly). I rewrote the first part to read vunpcklpd ymm0,ymm4,ymm5 and vunpckhpd ymm1,ymm4,ymm5. Am I correct you expected the starting data in ymm0, 1 and 2? Is my rewrite consistent with your plan? Thanks for your help. – RTC222 Sep 08 '20 at 17:53
  • 2
    @RTC222 Yes. My answer expects the input in `ymm0`, `ymm1`, and `ymm2` and places the result in `ymm0` to `ymm3`, trashing `ymm4` and `ymm5` in the process. Feel free to renumber the registers as needed. – fuz Sep 08 '20 at 18:58