Writing some very basic C code for a Cortex-M0 device, I'm surprised to see the disassembly:
void delay(void) {
for (int x=0;x<0xffff;x++) ;
}
This becomes:
for (int x=0;x<0xffff;x++) ;
2300 movs r3, #0
9301 str r3, [sp, #4]
E002 b 0x0800026E
9B01 ldr r3, [sp, #4] //0x08000268
3301 adds r3, #1
9301 str r3, [sp, #4]
9B01 ldr r3, [sp, #4] //0x0800026E
4A03 ldr r2, =0x0000FFFE
4293 cmp r3, r2
DDF8 ble 0x08000268
--- main.c -- 8 --------------------------------------------
}
46C0 nop
46C0 nop
B002 add sp, sp, #8
4770 bx lr
46C0 nop
0000FFFE .word 0x0000FFFE
Now this seems awfully wasteful. I know my purpose was to 'waste time' with the simple delay function, but it seems like gcc uses only two registers to access variables on the stack.
This is stock Rowley Crossworks 4.10 with all default settings using the GCC compiler that came with it. The debug configuration adds no optimization flags.
Wouldn't something like this be significantly better?
# Counter reset
movs r0, #0x0
ldr r1, =0xffff
loopone:
adds r0,#0x1
cmp r0,r1
bne loopone
It seems like default unoptimized gcc output prefers stack variables over registers. But we have 4 registers available as per AAPCS which lets us bypass any stack pushes and pops above the usual. This function was also not inlined, which could possibly explain this, but just saving the original values to stack and recovering them would still be better than repeatedly using the stack like this.
Why does gcc prefer the stack over available registers?