For example, have a look at this snippet of code generated by gcc.
.L11:
vpand ymm0, ymm1, YMMWORD PTR [rax]
add rax, 224
vmovdqa YMMWORD PTR [rax-224], ymm0
vpand ymm0, ymm2, YMMWORD PTR [rax-192]
vmovdqa YMMWORD PTR [rax-192], ymm0
vpand ymm0, ymm3, YMMWORD PTR [rax-160]
vmovdqa YMMWORD PTR [rax-160], ymm0
vpand ymm0, ymm4, YMMWORD PTR [rax-128]
vmovdqa YMMWORD PTR [rax-128], ymm0
vpand ymm0, ymm5, YMMWORD PTR [rax-96]
vmovdqa YMMWORD PTR [rax-96], ymm0
vpand ymm0, ymm6, YMMWORD PTR [rax-64]
vmovdqa YMMWORD PTR [rax-64], ymm0
vpand ymm0, ymm7, YMMWORD PTR [rax-32]
vmovdqa YMMWORD PTR [rax-32], ymm0
cmp rax, rsi
jb .L11
It does what I want, but one thing I notice is that the result is always stored to ymm0 before being stored to memory. I mean, I know that vpand cannot operate directly on memory, but wouldn't this, for example, be more efficient?
.L11:
vpand ymm8, ymm1, YMMWORD PTR [rax]
add rax, 224
vmovdqa YMMWORD PTR [rax-224], ymm8
vpand ymm9, ymm2, YMMWORD PTR [rax-192]
vmovdqa YMMWORD PTR [rax-192], ymm9
vpand ymm10, ymm3, YMMWORD PTR [rax-160]
vmovdqa YMMWORD PTR [rax-160], ymm10
vpand ymm11, ymm4, YMMWORD PTR [rax-128]
vmovdqa YMMWORD PTR [rax-128], ymm11
vpand ymm12, ymm5, YMMWORD PTR [rax-96]
vmovdqa YMMWORD PTR [rax-96], ymm12
vpand ymm13, ymm6, YMMWORD PTR [rax-64]
vmovdqa YMMWORD PTR [rax-64], ymm13
vpand ymm0, ymm7, YMMWORD PTR [rax-32]
vmovdqa YMMWORD PTR [rax-32], ymm0
cmp rax, rsi
jb .L11
This way, I think more operations can be done in parallel because no dependency is carried through ymm0.
It uses more registers, but this function in whole uses all 16 ymms anyway, and after this part, new values are loaded to each registers, so using more registers is not really a problem.
I checked clang, and it produces the same code with just different register numbers.
Can I expect a visible speedup if I write in assembly directly with the second way of register allocation?
I can actually test and see, but writing directly in assembly is not an easy task to me, and there are other parts to be worked on, so I'm asking this question to first better understand if using different registers for consecutive memory stores can actually improve performance.