1

For example, have a look at this snippet of code generated by gcc.

.L11:
        vpand   ymm0, ymm1, YMMWORD PTR [rax]
        add     rax, 224
        vmovdqa YMMWORD PTR [rax-224], ymm0
        vpand   ymm0, ymm2, YMMWORD PTR [rax-192]
        vmovdqa YMMWORD PTR [rax-192], ymm0
        vpand   ymm0, ymm3, YMMWORD PTR [rax-160]
        vmovdqa YMMWORD PTR [rax-160], ymm0
        vpand   ymm0, ymm4, YMMWORD PTR [rax-128]
        vmovdqa YMMWORD PTR [rax-128], ymm0
        vpand   ymm0, ymm5, YMMWORD PTR [rax-96]
        vmovdqa YMMWORD PTR [rax-96], ymm0
        vpand   ymm0, ymm6, YMMWORD PTR [rax-64]
        vmovdqa YMMWORD PTR [rax-64], ymm0
        vpand   ymm0, ymm7, YMMWORD PTR [rax-32]
        vmovdqa YMMWORD PTR [rax-32], ymm0
        cmp     rax, rsi
        jb      .L11

It does what I want, but one thing I notice is that the result is always stored to ymm0 before being stored to memory. I mean, I know that vpand cannot operate directly on memory, but wouldn't this, for example, be more efficient?

.L11:
        vpand   ymm8, ymm1, YMMWORD PTR [rax]
        add     rax, 224
        vmovdqa YMMWORD PTR [rax-224], ymm8
        vpand   ymm9, ymm2, YMMWORD PTR [rax-192]
        vmovdqa YMMWORD PTR [rax-192], ymm9
        vpand   ymm10, ymm3, YMMWORD PTR [rax-160]
        vmovdqa YMMWORD PTR [rax-160], ymm10
        vpand   ymm11, ymm4, YMMWORD PTR [rax-128]
        vmovdqa YMMWORD PTR [rax-128], ymm11
        vpand   ymm12, ymm5, YMMWORD PTR [rax-96]
        vmovdqa YMMWORD PTR [rax-96], ymm12
        vpand   ymm13, ymm6, YMMWORD PTR [rax-64]
        vmovdqa YMMWORD PTR [rax-64], ymm13
        vpand   ymm0, ymm7, YMMWORD PTR [rax-32]
        vmovdqa YMMWORD PTR [rax-32], ymm0
        cmp     rax, rsi
        jb      .L11

This way, I think more operations can be done in parallel because no dependency is carried through ymm0.

It uses more registers, but this function in whole uses all 16 ymms anyway, and after this part, new values are loaded to each registers, so using more registers is not really a problem.

I checked clang, and it produces the same code with just different register numbers.

Can I expect a visible speedup if I write in assembly directly with the second way of register allocation?

I can actually test and see, but writing directly in assembly is not an easy task to me, and there are other parts to be worked on, so I'm asking this question to first better understand if using different registers for consecutive memory stores can actually improve performance.

xiver77
  • 2,162
  • 1
  • 2
  • 12
  • 7
    Modern processors employ *register renaming* and do not have false dependencies from reusing the same register for an unrelated calculation. You will likely not observe any speedup from choosing different registers. – fuz Jan 24 '22 at 17:02
  • Even if the renaming would be somehow limited (due to fancy low-level processor tricks), the code will be bounded by the store port on all Intel architectures prior to the quit recent Icelake/Sunny-Cove (due to a saturation of the unique store port). The same thing also applies on all AMD Zen processors prior to Zen3 (which is quite recent too). Still, I am wondering if this always results in no performance impact on modern processors (for example the renaming may cause more transistors to switch resulting in more heat and a lower frequency boost although it should be a small effect if any). – Jérôme Richard Jan 24 '22 at 21:15
  • 2
    @JérômeRichard: Register renaming happens whenever you write a register, whether or not it's been used recently. The back end uses physical register numbers, the front-end uses architectural register numbers. Every instruction that writes a register has to update the RAT, and read the RAT to find out which physical reg its inputs are coming from. So I expect power is more or less independent of register reuse. – Peter Cordes Jan 24 '22 at 21:37
  • The only useful thing a compiler might have done is to run a few loads before any of the stores, so they can execute and retire sooner. i.e. do a bit of instruction scheduling to hide some load-use latency instead of leaving it all to out-of-order execution. Also letting the pipeline see new loads ASAP when it first starts to execute this loop. – Peter Cordes Jan 24 '22 at 21:40
  • @PeterCordes Thank your for the information! For the load, I expected the same thing in similar hot loops, but so far my attempt to improve the performance in such a case with many iterations have been a failure. I guess this is because the ordering of instructions does not matter a lot (at least with many iterations and no issues like tricky dependencies and with many iterations) and OOO execution does a very good job in hot loops on modern x86 processors. Isn't OOO execution supposed to be always active like register renaming? – Jérôme Richard Jan 24 '22 at 22:52
  • @JérômeRichard: Right, mainstream "big core" CPU will have little trouble hiding that latency, especially in cases that benchmark consistently (high iteration count, and other logical thread doing nothing). Since this is AVX2 code, it can't run on early Atom / Silvermont cores with small OoO windows, or even Jaguar. Alder Lake E-cores have pretty big ROBs. If it ever matters, it might be just a couple cycles of extra / better overlap with surrounding code, out of thousands to millions. – Peter Cordes Jan 24 '22 at 23:33
  • Yes, OoO exec is always active; the idea is to help it "see" farther by placing instructions that can start sooner statically earlier. Also maybe to reduce possible penalties for memory disambiguation. (Figuring out if a load depends on an earlier store or not.) Hmm, or does it actually *help* avoid 4k aliasing problems by interleaving load/store, rather than 4x load / 4x store? Anyway, with longer ALU dep chains, OoO has limits, see [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) for 2x imul dep c – Peter Cordes Jan 24 '22 at 23:37
  • @JérômeRichard: In this case, code-size savings from using 2-byte VEX prefixes instead of 3-byte to handle ymm8..15 might be better than any minuscule speedups from helping out-of-order exec when it doesn't really need help. Small loops make scheduling pretty redundant; if this was just running once as part of something larger, 4 loads to start with could all execute in the first 2 cycles, and it's pointless to have the later VPAND and store instructions in the back end until they have data ready. (Except the store-address uop can execute early and start resolving a TLB miss if needed; hmm.) – Peter Cordes Jan 24 '22 at 23:42

0 Answers0