7

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:

  • used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
  • used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).

Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).

So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?

Here's an excerpt:

    ; standard prologue: ebp/esp, SEH, overflow protection, ... then:
    xor     ebx, ebx
    mov     [ebp+4], ebx        ; zero out some locals
    mov     [ebp], ebx
    call    function_1
    xor     ecx, ecx            ; ebx _not_ used to zero registers
    cmp     eax, ebx            ; ... but used for compares?! why not test eax,eax?
    setnz   cl                  ; what? it goes through cl to check if eax is not zero?
    cmp     ecx, ebx            ; still, why not test ecx,ecx?
    jnz     function_body
    push    123456
    call    throw_something
function_body:
    mov     edx, [eax]
    mov     ecx, eax            ; it's not like it was interested in ecx anyway...
    mov     eax, [edx+0Ch]
    call    eax                 ; virtual method call; ebx is preserved but possibly pushed/popped
    lea     esi, [eax+10h]
    mov     [ebp+0Ch], esi
    mov     eax, [ebp+10h]
    mov     ecx, [eax-0Ch]
    xor     edi, edi            ; ugain, registers are zeroed as usual
    mov     byte ptr [ebp+4], 1
    mov     [ebp+8], ecx
    cmp     ecx, ebx            ; why not test ecx,ecx?
    jg      somewhere

label1:
    lea     eax, [esi-10h]
    mov     byte ptr [ebp+4], bl    ; ok, uses bl to write a zero to memory
    lea     ecx, [eax+0Ch]
    or      edx, 0FFFFFFFFh
    lock xadd [ecx], edx
    dec     edx
    test    edx, edx            ; now it's using the regular test reg,reg!
    jg      somewhere_else

Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • 1
    No wait, maybe yesterday I was too tired, it's not as I described it above at all. I checked again now and, while it's true that it keeps `ebx` to zero for the whole function (which is what stroke me as unusual), it uses it to zero out *memory*, where the encoding for the `mov` is much more compact (compared to using an immediate zero), and for `cmp`, even against registers - which again stroke me as unusual because `cmp reg,ebx` (with `ebx`==0) should be the same as plain `test reg,reg`. I'll update the question accordingly. – Matteo Italia Dec 31 '16 at 16:47
  • xD, should have clicked to expand that comment that popped up while typing my answer :P And yes, [`test reg,reg` sets flags the same as a `cmp` against zero](http://stackoverflow.com/a/38032818/224132) (immediate or register), except for leaving AF undefined. Perf-wise, `test` can macro-fuse in more cases on more CPUs, and avoids reading a dead register (helps P6, so this applies as early as 1993). – Peter Cordes Dec 31 '16 at 16:55
  • 1
    The compiler might be optimizing for size instead of speed. – Ross Ridge Dec 31 '16 at 19:10
  • It does not consistently do this, as your question seems to imply—at least not in optimized code. I've studied the object code generated by VS 2010's compiler in great detail for a vast array of different code sequences, and have never noticed such a pattern. So either you've come across some edge cases where the optimizer really does think this makes sense, or you are looking at unoptimized code. You're right in general, though, that Microsoft's compiler will often use a zero-register to zero out memory because this is faster and smaller, except under very high register pressure conditions. – Cody Gray - on strike Jan 02 '17 at 12:00
  • Can you post some examples of the code that you compiled to see these patterns? – Cody Gray - on strike Jan 02 '17 at 12:02
  • I finally updated my answer to say something about the actual code, so I can close this browser tab :P – Peter Cordes May 13 '17 at 21:15
  • @PeterCordes: hahaha, I have the same problem with browser tabs lingering open *way* too long. Thank you for keeping your mind on my question for all this time. :-) – Matteo Italia May 14 '17 at 13:48

1 Answers1

5

Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.

On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).

Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.

IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.


The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.

Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?

Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?


(the rest of this was written before the question included code).

Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.

Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.


When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.

This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.

For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).

Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?" I would exclude inline assembly, although I cannot be sure, as I don't have the original C++ source. The second option looks more likely, but still, the generated assembly is really bad beyond what I expected to be possible from a decently modern compiler. – Matteo Italia May 14 '17 at 21:28
  • @MatteoItalia: Last time I checked, MSVC didn't hoist loads of vector constants out of a loop after inlining a helper function. It might be decently modern, but this wouldn't be the only serious missed-optimization problem. Other compilers fall on their face in some cases, too, but MSVC makes worse code more often, from what I've looked at on a small scale. And yeah, it didn't really look like inline asm, because MSVC-style inline-asm has to store/reload any inputs to inline-asm, because the syntax doesn't allow asking for inputs in registers (/facepalm). – Peter Cordes May 15 '17 at 00:00