Is it more efficient to touch fewer registers in ARM assembly?

Question

I've just started learning Assembly via Raspbian and have a quick question: how efficient is saving register space in Assembly? For example, if I wanted to do a quick addition, is there a meaningful difference in

mov r1, #5
mov r2, #3
add r1, r1, r2

and

mov r1, #5
mov r2, #3
add r3, r1, r2     @ destination in a new register that wasn't previously used

(except storing in different registers)?

Depends. Which microarchitecture? Are you using caller-saved registers? Do you need the previous values held in the source registers? Can you easily recompute those when needed? Too broad. — EOF, Feb 06 '19 at 04:44
The basic efficiency levels: no code < code using registers < code using memory load < code using memory store < code using peripherals read < code using peripherals write ... and the growth of inefficiency is more like exponential, than linear, i.e. it is lot more better to re-read values from memory (if register space is exhausted), than to re-read them from disk, or other way, if you can avoid memory store for the price of couple of more register-only instructions, it will be highly likely more efficient to stay registers-only. Of course only profiling knows true numbers, write concisely. — Ped7g, Feb 06 '19 at 08:00
but AFAIK on ARM CPUs all general purpose registers are "equal", i.e. there is no direct difference between `r1` and `r3`... so the indirect difference of how the surrounding code must change to cope with your register allocation is the only thing to consider in your particular example, if the surrounding code is ignored, your examples are equal in efficiency. — Ped7g, Feb 06 '19 at 08:02
also AFAIK I know ARM gpus are equal there is no microcoding its just logic. the key here has nothing to do with architecture it instead has to do with conservation of registers. both of these code snippets are efficient and equal depending on how many other registers are used and how they are used if you are trying to stay within the four disposable registers for the popular calling convention then you might want to save r3 for something else — old_timer, Feb 06 '19 at 15:47
@old_timer Whether the processor is microcoded or not is irrelevant. Also, note that particularly ARM has a nice example of registers mattering. In pre-ARMv6 (AFAIR), multiplication instructions may not have the destination register be the same as one of the source registers. — EOF, Feb 06 '19 at 19:35
@EOF it does matter for processors that are. Understand that a number of those rules are due to bugs cores already laid out for a foundry which is how it worked back then. some of the "unpredictable results" were in fact predictable and how they would test if your clone was stolen or not and/or how well you cloned it... — old_timer, Feb 07 '19 at 03:54

Peter Cordes · Accepted Answer · 2019-02-06T22:29:32.207

Using the same register for output as input has no inherent disadvantage on ARM¹. I don't think there's any inherent advantage either, though. Things can get more interesting in the general case when we're talking about writing registers that the instruction didn't already have to wait for (i.e. not inputs).

Use as many registers as you need to save instructions. (Be aware of the calling convention, though: if you use more than r0..r3 you'll have to save/restore the extra ones you use, if you want to call your function from C). Specifically, normally optimize for the lowest dynamic instruction count; doing a bit of extra setup / cleanup to save instructions inside loops is normally worth it.

And not just to save instructions: software pipelining to hide load latency is potentially valuable on pipelined in-order execution CPUs. e.g. if you're looping over an array, load the value you'll need 2 iterations from now into a register, and don't touch it until then. (Unroll the loop). An in-order CPU can only start instructions in order, but they can potentially complete out of order. e.g. a load that misses in cache doesn't stall the CPU until you try to read it when it's not ready. I think you can assume that high-performance in-order CPUs like modern ARMs will have whatever scoreboarding is necessary to track which registers are waiting for an ALU or load result to be ready.

Without actually going full software-pipelining, you can sometimes get similar results by doing a block of loads, then stuff, then a block of stores. e.g. a memcpy optimized for big copies might load 12 registers in its main unrolled loop, then store those 12 registers. So the distance between a load and store of the same register is still large enough to hide L1 cache load latency at least.

Current(?) Raspberry Pi boards (RPi 3+) use ARM Cortex-A53 cores, a 2-wide superscalar in-order microarchitecture.

Any ARM core (like Cortex-A57) that does out-of-order execution will use register renaming to make WAW (write-after-write) and WAR hazards a non-issue. (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards).

On an in-order core like A53, WAR is definitely a non-issue: there's no way a later instruction can write a register before an earlier instruction has a chance to read its operand from there.

But a WAW hazard could limit the ability of the CPU to run two instructions at once. This would only be relevant when writing a register you didn't already read. add r1, r1, r2 has to wait for r1 to be ready before it can even start executing, because it's an input.

For example, if you had this code, we might actually see a negative performance effect from writing the same output register in 2 instructions that might run in the same cycle. I don't know how Cortex-A53 or any other in-order ARM handles this, but another dual-issue in-order CPU (Intel P5 Pentium from 1993) doesn't pair instructions that write to the same register (Agner Fog's x86 uarch guide). The 2nd one has to wait a cycle before starting (but can maybe pair with the instruction after that).

@ possible WAW hazard
adds  r3, r1, r2      @ set flags, we don't care about the r3 output
add   r3, r1, #5      @ now actually calculate an integer result we want

If you'd used a different dummy output register, these could both start in the same clock cycle. (Or if you'd use cmn r1, r2 (compare-negated), you could have set flags from r1 - (-r2) without writing an output at all, which according to the manual is the same as setting flags from r1 + r2.) But probably there's some case you can come up with that couldn't be replaced with a cmp, cmn, tst (ANDS), or teq (EORS) instruction.

I'd expect that an out-of-order ARM could rename the same register multiple times in the same cycle (OoO x86 CPUs can do that) to fully avoid WAW hazards.

I'm not aware of any microarchitectural benefit to leaving some registers "cold".

On a CPU with register renaming, normally that's done with a physical register file, and even a not-recently-modified architectural register (like r3) will need a PRF entry to hold the value of whatever instruction last wrote it, no matter how long ago that was. So writing a register always allocates a new physical register, and (eventually) frees up the physical register holding the old value. Regardless if the old value was also just written, or if it's had that value for a long time.

Intel P6-family did use a "retirement register file" that holds the retirement state separately from "live" values in the out-of-order back-end. But it kept those live register values right in the ROB with the uop that produced them (instead of a reference to a PRF entry), so it couldn't run out of physical registers for renaming before the back-end was full. See http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ more some interesting x86 CPU experiments measuring ROB vs. PRF limits on out-of-order window size for other x86 CPUs that do use a PRF.

In fact, due to limited read ports on the retirement register file, P6-family (PPro through Nehalem) can actually stall when reading too many registers that haven't been written recently, in one issue group. (See Agner Fog's microarch guide, register read stalls.) But I don't think this is a typical problem on other uarches, like any out-of-order ARM cores. Set up constants / loop invariants in registers outside loops, and freely use them inside.

Footnote 1: this is generally true across all architectures, but there are exceptions. The only one I know if is a pretty special case: on recent Intel x86 CPUs (in 64-bit mode) mov eax, eax (1 cycle latency) is slower than mov ecx, eax (0 cycle latency) for truncating a 64-bit register to 32 bit, because mov-elimination only works between different registers. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)

I agree with your point that the procedure calling standard is relevant to this question. The OP should consult the AAPCS — Elliot Alderson, Feb 06 '19 at 15:32
On old-ish Intel x86, leaving registers "cold" (rarely writing to them but frequently reading from them) can even *harm* performance, according to Agner Fog, since the register file can't support as many reads as the bypass network. — EOF, Feb 06 '19 at 19:41
@EOF: oh yes, I was only thinking of the entirely-untouched case, good point about reg-read stalls for reading cold registers. It's a surprisingly low limit, like 2 or 3 cold regs per group of 3 or 4 uops, so base+idx for looping through an array could suck on P6-family, too, for totally different reasons than SnB-family. With 64-bit adding more regs, and more focus on FP performance (where it's not rare to have many constants in regs), and more non-destructive instructions like VEX encodings, that bottleneck definitely had to go. — Peter Cordes, Feb 06 '19 at 22:39

score 2 · Answer 2 · answered Feb 06 '19 at 09:34

At risk of being shot down by someone who knows a lot more about the theoretical aspects, using more registers can be faster - this is one reason why there is pressure on an architecture design to include more registers (compare T32/A32/A64 for the range of addressable core registers as the architectural implementation cost increases).

At the architectural level, core registers are all equivalent (so long as the opcode can address them) - i.e. some instructions might only permit access to the lower 8 registers.

At the micro-architectural level, it would be very unusual to give certain registers preferential treatment. One example of preferential treatment at the architectural level, ARMv7-M and related, is the exception push/pop behaviour. A compiler can take advantage of this optimisation quite easily (by avoiding inserting some shim code).

The higher performance processors actually include more physical registers than architectural registers, and automatically allocate these to provide some of the performance benefits of having more logical registers.

In your example, your first code fragment explicitly indicates to the CPU that the first r1 value will never be used in the future. In the 2nd code fragment, you have left r1 == 5 kind of blocked for the rest of time - there is no way to look ahead and predict if you will ever use this again.

So:

More registers allows for more fast data (single-cycle), and potential out-of-order execution
Re-using a register might activate interlocks in a wide issue machine without register renaming
Re-using a register can break the dependency chains and free up more physical registers on higher performance processors.

For A53, I guess there is no difference at all, until your software runs out of registers (unless you want that value of 5 later on).

In computer-architecture terminology, register renaming avoids WAR (write-after-read) and WAW hazards for out-of-order execution. But yes, +1, use all the registers for high-performance loops, saving instructions is always a win over any minor microarchitectural effects from letting the CPU free a physical reg earlier. But remember, all the architectural regs have to map to a physical reg at any time, even if they're cold. (Unless there's a separate retirement register file. Intel P6-family used this, but kept OoO-exec reg values right *in* the ROB instead of a physical register file). — Peter Cordes, Feb 06 '19 at 10:13
https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards and https://en.wikipedia.org/wiki/Register_renaming. And BTW, avoiding write after write or read is only going to possibly help avoid stalls (interlocks) if you don't also read the register. e.g. `add r1, r1, r2` already has to stall until `r1` is ready for `add` to read its input, so the write is irrelevant. — Peter Cordes, Feb 06 '19 at 10:16

score 1 · Answer 3 · answered Feb 06 '19 at 18:20

with arm the efficiency comes from the calling convention primarily, outside normal pipeline stuff (does add xx,r1,r2 have to stall for mov r2,xx to complete).

with so little code both chunks are the right solution, depends on the problem. if trying to avoid using the stack and staying within 4 registers of information using the popular calling convention re-using a register rather than burning another may or may not be right.

all other factors held constant, not counting anything in the pipeline design, there is nothing magical about the arm that will limit you here its not a microcoded design like a CISC, where you may have for specific cores specific performance rules. Any processor can have pipeline rules even if using a single register file and no microcoding, but there registers should be equal on the arm.

and the arm is easy to test to see if you have a performance hit here, but you have to be careful with your benchmark to not end up measuring something else and thinking its the instruction under test.

Is it more efficient to touch fewer registers in ARM assembly?

3 Answers3