7

I'm doing some experimenting with x86-64 assembly. Having compiled this dummy function:

long myfunc(long a, long b, long c, long d,
            long e, long f, long g, long h)
{
    long xx = a * b * c * d * e * f * g * h;
    long yy = a + b + c + d + e + f + g + h;
    long zz = utilfunc(xx, yy, xx % yy);
    return zz + 20;
}

With gcc -O0 -g I was surprised to find the following in the beginning of the function's assembly:

0000000000400520 <myfunc>:
  400520:       55                      push   rbp
  400521:       48 89 e5                mov    rbp,rsp
  400524:       48 83 ec 50             sub    rsp,0x50
  400528:       48 89 7d d8             mov    QWORD PTR [rbp-0x28],rdi
  40052c:       48 89 75 d0             mov    QWORD PTR [rbp-0x30],rsi
  400530:       48 89 55 c8             mov    QWORD PTR [rbp-0x38],rdx
  400534:       48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
  400538:       4c 89 45 b8             mov    QWORD PTR [rbp-0x48],r8
  40053c:       4c 89 4d b0             mov    QWORD PTR [rbp-0x50],r9
  400540:       48 8b 45 d8             mov    rax,QWORD PTR [rbp-0x28]
  400544:       48 0f af 45 d0          imul   rax,QWORD PTR [rbp-0x30]
  400549:       48 0f af 45 c8          imul   rax,QWORD PTR [rbp-0x38]
  40054e:       48 0f af 45 c0          imul   rax,QWORD PTR [rbp-0x40]
  400553:       48 0f af 45 b8          imul   rax,QWORD PTR [rbp-0x48]
  400558:       48 0f af 45 b0          imul   rax,QWORD PTR [rbp-0x50]
  40055d:       48 0f af 45 10          imul   rax,QWORD PTR [rbp+0x10]
  400562:       48 0f af 45 18          imul   rax,QWORD PTR [rbp+0x18]

gcc very strangely spills all argument registers onto the stack and then takes them from memory for further operations.

This only happens on -O0 (with -O1 there are no problems), but still, why? This looks like an anti-optimization to me - why would gcc do that?

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • 6
    I think you might have it backwards. I'm pretty sure the above is how GCC always (initially) generates the code, it's just you won't normally see it as it's trivially optimized away (but of course only if optimizations are enabled). – user786653 Aug 26 '11 at 08:14
  • This isn't anti optimization, it's just no optimization. – Gunther Piez Aug 27 '11 at 21:08
  • I had just seen this example somewhere: http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/ :-) – Ciro Santilli OurBigBook.com Jul 16 '15 at 09:15
  • @GuntherPiez: I'd call it anti-optimization; there are enough registers to trivially hold all the locals, and they started in registers, so spilling them is only necessary to support consistent debugging. (And to make compiler internal algorithms simpler.) See also [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) – Peter Cordes Sep 04 '20 at 05:33
  • @user786653: GCC optimizes *before* generating asm, on an internal representation of the program like GIMPLE. An optimized build I think will realize that these variables don't need addresses and not give them one in the first place, rather than giving them one and then optimizing it away. – Peter Cordes Sep 04 '20 at 05:35
  • @PeterCordes: It's been a while since I looked at this question (9 years I guess - time flies), and I can agree that my initial comment isn't 100% accurate (I know, and think I knew, that GCC won't actually generate spilling code and then delete it later on), but I still think the spirit of my comment and answer are still true. Perhaps we just disagree on what "anti-optimization" means (in my mind that would be something like generating an `imul` instruction for `3*4`, not the compiler doing what's easiest/fastest/most debuggable) – user786653 Sep 04 '20 at 17:30

2 Answers2

8

I am by no means a GCC internals expert, but I'll give it a shot. Unfortunately most of the information on GCCs register allocation and spilling seems to be out of date (referencing files like local-alloc.c that don't exist anymore).

I'm looking at the source code of gcc-4.5-20110825.

In GNU C Compiler Internals it is mentioned that the initial function code is generated by expand_function_start in gcc/function.c. There we find the following for handling parameters:

4462   /* Initialize rtx for parameters and local variables.
4463      In some cases this requires emitting insns.  */
4464   assign_parms (subr);

In assign_parms the code that handles where each arguments is stored is the following:

3207       if (assign_parm_setup_block_p (&data))
3208         assign_parm_setup_block (&all, parm, &data);
3209       else if (data.passed_pointer || use_register_for_decl (parm))
3210         assign_parm_setup_reg (&all, parm, &data);
3211       else
3212         assign_parm_setup_stack (&all, parm, &data);

assign_parm_setup_block_p handles aggregate data types and is not applicable in this case and since the data is not passed as a pointer GCC checks use_register_for_decl.

Here the relevant part is:

1972   if (optimize)
1973     return true;
1974 
1975   if (!DECL_REGISTER (decl))
1976     return false;

DECL_REGISTER tests whether the variable was declared with the register keyword. And now we have our answer: Most parameters live on the stack when optimizations are not enabled, and are then handled by assign_parm_setup_stack. The route taken through the source code before it ends up spilling the value is slightly more complicated for pointer arguments, but can be traced in the same file if you're curious.

Why does GCC spill all arguments and local variables with optimizations disabled? To help debugging. Consider this simple function:

1 extern int bar(int);
2 int foo(int a) {
3         int b = bar(a | 1);
4         b += 42;
5         return b;
6 }

Compiled with gcc -O1 -c this generates the following on my machine:

 0: 48 83 ec 08             sub    $0x8,%rsp
 4: 83 cf 01                or     $0x1,%edi
 7: e8 00 00 00 00          callq  c <foo+0xc>
 c: 83 c0 2a                add    $0x2a,%eax
 f: 48 83 c4 08             add    $0x8,%rsp
13: c3                      retq   

Which is fine except if you break on line 5 and try to print the value of a, you get

(gdb) print a
$1 = <value optimized out>

As the argument gets overwritten since it's not used after the call to bar.

user786653
  • 29,780
  • 4
  • 43
  • 53
7

A couple of reasons:

  1. In the general case, an argument to a function has to be treated like a local variable because it could be stored to or have its address taken within the function. Therefore, it is simplest to just allocate a stack slot for every arguments.
  2. Debug information becomes much simpler to emit with stack locations: the argument's value is always at some specific location, instead of moving around between registers and memory.

When you're looking at -O0 code in general, consider that the compiler's top priorities are reducing compile-time as much as possible and generating high-quality debugging information.

servn
  • 3,049
  • 14
  • 8
  • 1
    Yes, and with no optimizations, the compiler specifically makes all lines independent, always reloading from actual variables and storing out immediately, which allows you to move the CPU to another line, or change the value of any variable in the debugger, and have it behave correctly. – doug65536 Feb 06 '13 at 22:48
  • Yes, exactly. Spilling register args so they have an address is part of `-O0`, unless you declare them `register int foo` or whatever. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) – Peter Cordes Sep 04 '20 at 05:37