0

I was plying with some C code to inspect how my compiler (main-git Clang, in this case) handles function parameters wrt to the stack in ARM ABI. I discovered that this function:

int test(int a, int b, int c, int d, int e, int f, int g) {
  return a + b + c + d + e + f + g;
}

gets translated, with -O0, in

08000298 <test>:
 8000298:   b084        sub sp, #16
 800029a:   f8dd c018   ldr.w   ip, [sp, #24]
 800029e:   f8dd c014   ldr.w   ip, [sp, #20]
 80002a2:   f8dd c010   ldr.w   ip, [sp, #16]
 80002a6:   9003        str r0, [sp, #12]
 80002a8:   9102        str r1, [sp, #8]
 80002aa:   9201        str r2, [sp, #4]
 80002ac:   9300        str r3, [sp, #0]
 80002ae:   9803        ldr r0, [sp, #12]
 80002b0:   9902        ldr r1, [sp, #8]
 80002b2:   4408        add r0, r1
 80002b4:   9901        ldr r1, [sp, #4]
 80002b6:   4408        add r0, r1
 80002b8:   9900        ldr r1, [sp, #0]
 80002ba:   4408        add r0, r1
 80002bc:   9904        ldr r1, [sp, #16]
 80002be:   4408        add r0, r1
 80002c0:   9905        ldr r1, [sp, #20]
 80002c2:   4408        add r0, r1
 80002c4:   9906        ldr r1, [sp, #24]
 80002c6:   4408        add r0, r1
 80002c8:   b004        add sp, #16
 80002ca:   4770        bx  lr

Notice the initial store of values in ip. Why are there? They feel useless to me, even with -O0. (-O0 explains why it spills a to d from r0-r3 to the stack on function entry and reloads them later when needed. But these loads of the stack args aren't doing anything.)

I don't understand what's the usage of ip in this case, since it's not used later, and it's not used as a frame pointer (which iiuc is its normal usage?). Thanks!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 3
    You have told the compiler to disable its brains with `-O0`. Why are you surprised that it generated poor code? – fuz Apr 24 '23 at 15:23
  • That's not my doubt, as I hinted in the last part of the question. My question is: why should it produce such a load in the first place? – Alessandro Bertulli Apr 24 '23 at 15:25
  • The `ip` register, aka `r12`, is a scratch register. (Call stubs use it, though just plain scratch for functions). Clang is gratuitously and harmlessly loading the 3 memory-based parameters, but why I couldn't say; I think you'd need a clang expert to explain. Clang on godbolt is doing the same except it names the register `r12` instead of `ip`. It loads them even if the body is `return a;` – Erik Eidt Apr 24 '23 at 16:26
  • 1
    The first 4 arguments are passed in registers according to the ABI. The final 3 are passed on the stack. There is probably some obscure standard issue that ensures parameters are accessible before the routine begins. Ie, a fault may occur to cause a load of memory for a disk, etc. It also goes and saves the registers to the stack and then reloads them. You said `-O0`; compilers used to generate code like this. If you just translate 'LALR()' type parsing to a stack machine, this is the code you get. Ie, no attempt to use registers. – artless noise Apr 24 '23 at 17:54
  • The way compilers works is to first apply some correct boilerplate, then improve that with optimization. It is probably visiting each argument to ensure that it is copied into memory, so doing *something* like a=a, b=b, c=c, d=d... where the reference on the right side of the `=` is to the register and the reference on the left side of `=` is to the memory location. But first it generates g=g, f=f, e=e, and it probably wanted to store them back to memory, though even at -O0, it realizes that there's no need to store back to memory where they had just come from. – Erik Eidt Apr 24 '23 at 18:44
  • remove int g and it should become two, then f and one ... – old_timer Apr 25 '23 at 02:24

1 Answers1

0

Dummy loads of the stack args in the prologue seems to just be a quirk of clang's -O0 code-gen across multiple architectures. It's been around a long time (Godbolt's earliest clang version, 3.0, exhibits this behaviour). It's totally useless, no reason this is helpful, and gcc -O0 doesn't do it.

There's no ABI or asm reason to understand, this is just one of the many inefficiencies in -O0 code-gen which aren't interesting.

(GCC -O0 does set up a frame pointer even for leaf functions, unlike clang).


Clang targeting x86-64, i386, RISC-V, MIPS, and AArch64 does the same thing on all of them. (Also PowerPC, I think, if I'm reading it correctly.) Usually clang picks different registers for each stack arg, but clang 16 on Godbolt does load 4 times into EAX, overwriting it. I added more args, to a total of 10, since x86-64 SysV passes up to 6 in regs, and some like RISC-V and AARch64, pass up to 8 integer args in registers.

Another example: https://godbolt.org/z/x4vzcYdEn - the mov eax, [rbp + 24] and mov r10d, [rbp + 16] near the start are an example of GCC for x86-64 using different registers.

Earlier clang for ARMv7 doesn't use ip for every load. It spreads them around to other regs, like R12, LR, R4, and R5 (after pushing those registers so it can overwrite them.) With more function args, those clang 11 even pushes more registers just so they can load all the stack args in the function prologue, the block of asm associated (in the debug metadata) with the opening {. So clang (trunk) has actually gotten less bad!

Older clang for 32-bit x86 also uses extra registers, and copies some stack args down to the space it reserved for locals. Or copies all of them, some in the prologue, some kept in registers until the epilogue? clang 8.0.1 on Godbolt is pretty wacky, for x86 with -m32.


I have no idea what detail of the compiler internals leads to this behaviour. Or even whether it's present in the LLVM-IR, or only during final generation of the asm.

I'm curious whether it would speed up overall compilation time to run an optimization pass that got rid of those dummy loads, instead of emitting machine code for them (and having them around in the data structures while optimizing.)

I wonder if "touching" all the args is maybe related to how clang arranges to spill the register args in the prologue, which it does need to do for 100% consistent debugging.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847