In x86 assembly, is it better to use two separate registers for imul?

Question

I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns?

mov %rbx, %rcx
imul %rcx, %rcx

or

mov %rbx, %rcx
imul %rbx, %rcx

Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly.

score 4 · Answer 1 · answered Jun 14 '16 at 06:51

On a modern processor, using one register for both source and destination and using two different registers will never make any difference to performance. The reason for this is partly due to register renaming which, if there were a difference in performance would solve it by changing one of the registers to a different one and modify your subsequent instructions to use the new register (your processor actually has more registers than the instruction set has a way of refering to them so that it can do stuff like this). It is also because of the nature of a pipelined processor's implementation -- the contents of source registers are read at one pipeline stage and are then written at another later stage, which makes it difficult or impossible for register usage for a single instruction to cause any kind of interaction like the one you're worrying about.

More problematic is if an instruction refers to a value produced in its previous instruction, but even that is solved (usually) by out-of-order execution.

This is a really vague way of saying "I don't think so" and linking to some useful wiki articles. Register renaming doesn't even enter into it: even without renaming, register read happens before writeback. This suggested code-sequence doesn't have multiple dep chains that reuse the same architectural register (which is when register renaming works its magic and breaks the dependency). — Peter Cordes, Jun 20 '16 at 08:23
Anyway, going to have to downvote this because it's actually wrong on both counts: 1. Intel P6-family CPUs like Nehalem are still widespread, and do have limited register-read ports. 2. the length of loop-carried dependency chains still matters with out-of-order execution. Both the OPs sequences have the same latency, but an alternative is possible. If you can reduce latency without making anything else worse, you should do it. If you're hand-writing asm in the first place (or reading compiler output, or writing a compiler), then everything potentially matters. — Peter Cordes, Jun 20 '16 at 08:25

score 4 · Accepted Answer · edited May 23 '17 at 10:33

resources where I could read about this type of thing

See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).

The interesting option you didn't mention is:

mov   %rbx, %rcx
imul  %rbx, %rbx     # doesn'y have to wait for mov to execute
# old value of %rbx is still available in %rcx

If the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.

If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.

Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.

However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).

If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).

For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.

If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.

Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.

mov   %rbx, %rcx
imul  %rcx, %rcx     # uses only the recently-written rcx; can't contribute to register-read stalls

worst of both worlds:

mov   %rbx, %rcx
imul  %rbx, %rcx     # can't execute until after the mov, but still reads a potentially-old register

If you're going to depend on a recently-written register, you might as well use only recently-written registers.

Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.

Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.

However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.

It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.

how to benchmark this

I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)).

In x86 assembly, is it better to use two separate registers for imul?

2 Answers2

Linked