I have three ymm registers -- ymm4, ymm5 and ymm6 -- packed with double precision (qword) floats:
ymm4: 73 144 168 41
ymm5: 144 348 26 144
ymm6: 732 83 144 852
I want to write each column of the matrix above. For example:
-- extract ymm4[63:0] and insert it at ymm0[63:0]
-- extract ymm5[63:0] and insert it at ymm0[127:64]
-- extract ymm6[63:0] and insert it at ymm0[191:128]
so that ymm0 reads 73, 144, 732.
So far I have used:
mov rax,4
kmovq k6,rax
vpxor ymm1,ymm1
VEXPANDPD ymm1{k6}{z},ymm6
That causes ymm1 to read [ 0 0 732 ], so I've accomplished the first step because 732 is the element at [63:0] in ymm6.
For ymm4 and ymm5 I use vblendpd:
vblendpd ymm0,ymm1,ymm4,1
That causes ymm0 to read [ 73 0 732 ], so I've accomplished the second step because 73 is the element at [63:0] in ymm4.
Now I need to put ymm5[63:0] at ymm0[127:64]:
vblendpd ymm0,ymm0,ymm5,2
That causes ymm0 to read [ 73 144 732 ], so now I am finished with the first column [63:0].
But now I need to do the same thing with columns 2, 3 and 4 in the ymm registers. Before I add more instructions, is this the most efficient way to do what I described? Is there another, more effective way?
I have investigated unpckhpd (https://www.felixcloutier.com/x86/unpckhpd), vblendpd (https://www.felixcloutier.com/x86/blendpd, and vshufpd (https://www.felixcloutier.com/x86/shufpd), and what I show above seems like the best solution but it's a lot of instructions, and the encodings shown in the docs for the imm8 value are somewhat opaque. Is there a better way to extract the corresponding columns of three ymm registers?