3

I'm testing out a piece of a math library I'm working on, and have found a bit of an oddity while looking at the generated assembly.

Ignoring the mess of template code to get here, the code I'm testing out essentially boils down to multiplying two floats, and then XORing the sign bit by some value.

I also compiled (-O3 in both cases) a boiled-down version of the code:

int main() {
    float a = *reinterpret_cast<float*>(0x20);
    float b = *reinterpret_cast<float*>(0x40);

    float tmp = a * b;
    *reinterpret_cast<int*>(&tmp) ^= 0x80000000;

    return tmp;
}

and got the same results. (dereferencing the first two pointers just serves to stop GCC from precomputing everything, this will definitely not run)

From my code and the example, GCC essentially generates the following:

mulss xmm0, xmm1
mov eax, 80000000
movd DWORD PTR SS:[rsp + 2c], xmm0
add eax, DWORD PTR SS:[rsp + 2c]
movd xmm2, eax
cvttss2si eax, xmm2

Looking at this, there are a couple of things I notice:

  • GCC seems to favor an add instead of xor. An answer to a question I found while researching mentioned that an add should be slightly faster due to consecutive instructions not being blocked in the pipeline, but I don't see why xor would have to wait (commutative and associative, am I missing something?).
  • Memory is used as a temporary instead of a register. There shouldn't be any latency between the write and read due to store forwarding, but that cache line is going to have to be written to memory at some point, it just seems wasteful.
  • Necessitating the previous point, GCC seems insistent on using a different register (eax) to carry out the XOR, even though there are instructions that would work fine on the XMM register.

From what I can tell, something like this:

mulss xmm0, xmm1
mov eax, 0x80000000
movd xmm1, eax
pxor xmm0, xmm1
cvttss2si eax, xmm0

would make much more sense.

This avoids unnecessarily accessing memory and doesn't need to move the XORed value back into an XMM register for the conversion to an integer (I was a little worried that GCC wasn't using pxor as it requires SSE2, but -msse2 changed nothing).

I thought maybe pxor would be significantly slower than add, but according to https://www.agner.org/optimize/instruction_tables.pdf, on Skylake, pxor x, x (compared to add r, m) has a slightly lower reciprocal-throughput and generates fewer uops! (at least unfused, and to be fair, the speed difference is probably only due to the included memory access)

So what's the deal? Why doesn't GCC generate something closer to my custom assembly? Is this a bug? Or is GCC seeing something I'm not.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
the4naves
  • 333
  • 2
  • 9
  • 1
    clang does better, `-march=native` helps a little too: https://godbolt.org/z/qMPcf8xTG – Alan Birtles Jul 03 '21 at 14:31
  • @AlanBirtles Not sure if it's godbolt or the compilers, but it's strange that they both use the triple operand versions of the math instructions. Clang output is definitely what I was expecting, but I have no idea if it's faster. – the4naves Jul 03 '21 at 14:40
  • 2
    How does the assembly compare when doing `tmp = -tmp;` instead? – Eljay Jul 03 '21 at 15:32
  • @Eljay It generates instructions almost identical to clang's, which I expected. I was going to say I can't do that as `0x80000000` is just a placeholder for a runtime value (high-bit 0 or 1), but the value is pulled from a constexpr array, the index of which is determined by a template argument, so I should actually be able to use a constexpr `if` to do that. Still is odd that GCC doesn't generate optimal code for the example though. – the4naves Jul 03 '21 at 15:44
  • @Eljay Yep, using a constexpr `if` does exactly what I hoped. The generated template function is optimized to just a multiply/xor. I'll probably leave this question open for a little longer as I am genuinely curious as to whether there's any logic behind GCC's original solution. – the4naves Jul 03 '21 at 15:55
  • The most likely explanation is probably: just a missed optimization opportunity. No one coded it up yet. – Eljay Jul 03 '21 at 17:29
  • If you just want to look at the asm on Godbolt, write `float foo(float x, float y)` instead of `main` with insane pointer-casting from absolute addresses. [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116). (Or an `int` return value if you really want to force that.) – Peter Cordes Jul 03 '21 at 18:01
  • 1
    @AlanBirtles: `-march=native` on Godbolt includes AVX-512. That's not very realistic for most use-cases yet. Godbolt even warns you against using `-march=native` because it depends on the AWS server hardware; use `-march=skylake-avx512` or `icelake-server` if that's what you really want to look at. With just `-march=haswell` it does a separate `vbroadcastss` load to feed `vxorps`, still not wastefully copying to integer and back. – Peter Cordes Jul 03 '21 at 18:03
  • 1
    Seems like a GCC missed optimization, and not a recent regression; present in GCC4.9 for example https://godbolt.org/z/TEez77jnT. Even with `-fno-strict-aliasing` to avoid undefined behaviour on the type-pun via pointer-cast instead of a same memcpy or (in GNU C++) union, or C++20 std::bit_cast. And BTW, the "3 operand" instructions are from enabling AVX (via `-march=native` on a machine new enough). GCC always uses the `v` versions of XMM instructions when AVX is enabled. SSE2 is baseline for x86-64, and SSE1 `xorps` is what you'd expect a good compiler to use. – Peter Cordes Jul 03 '21 at 18:16
  • You can report this on https://gcc.gnu.org/bugzilla/ if you want, with the missed-optimization keyword tag. – Peter Cordes Jul 03 '21 at 18:21
  • @PeterCordes Huh, I expected that putting the code in an unused function would just end with it being eliminated, but it looks like that only happens under certain circumstances (or with LTO). – the4naves Jul 03 '21 at 20:21
  • 1
    @the4naves: It's not "unused", it's necessary to implement a stand-alone version of the function because it's not `static`, so other compilation units might call it. You're getting the compiler's asm output for that C source file, not a linked executable, so there is no whole program yet for whole-program optimization to consider removing definitions of functions not ultimately called from `main`. – Peter Cordes Jul 03 '21 at 20:58
  • @PeterCordes Ah, I assumed it would disassemble the linked file, but I suppose that could get overly verbose depending on what it would include. I've filed a bug report at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101311. – the4naves Jul 03 '21 at 21:13
  • Godbolt's "binary" option in the drop-down will do that. (With compiler-explorer supplying a dummy main if there isn't one.) Still, without `-fwhole-program` or `-flto`, it won't remove un-called non-`static` functions, and `-O3` doesn't imply either of those. – Peter Cordes Jul 03 '21 at 21:23

0 Answers0