I'm testing out a piece of a math library I'm working on, and have found a bit of an oddity while looking at the generated assembly.
Ignoring the mess of template code to get here, the code I'm testing out essentially boils down to multiplying two floats, and then XORing the sign bit by some value.
I also compiled (-O3 in both cases) a boiled-down version of the code:
int main() {
float a = *reinterpret_cast<float*>(0x20);
float b = *reinterpret_cast<float*>(0x40);
float tmp = a * b;
*reinterpret_cast<int*>(&tmp) ^= 0x80000000;
return tmp;
}
and got the same results. (dereferencing the first two pointers just serves to stop GCC from precomputing everything, this will definitely not run)
From my code and the example, GCC essentially generates the following:
mulss xmm0, xmm1
mov eax, 80000000
movd DWORD PTR SS:[rsp + 2c], xmm0
add eax, DWORD PTR SS:[rsp + 2c]
movd xmm2, eax
cvttss2si eax, xmm2
Looking at this, there are a couple of things I notice:
- GCC seems to favor an
addinstead ofxor. An answer to a question I found while researching mentioned that anaddshould be slightly faster due to consecutive instructions not being blocked in the pipeline, but I don't see whyxorwould have to wait (commutative and associative, am I missing something?). - Memory is used as a temporary instead of a register. There shouldn't be any latency between the write and read due to store forwarding, but that cache line is going to have to be written to memory at some point, it just seems wasteful.
- Necessitating the previous point, GCC seems insistent on using a different register (
eax) to carry out the XOR, even though there are instructions that would work fine on the XMM register.
From what I can tell, something like this:
mulss xmm0, xmm1
mov eax, 0x80000000
movd xmm1, eax
pxor xmm0, xmm1
cvttss2si eax, xmm0
would make much more sense.
This avoids unnecessarily accessing memory and doesn't need to move the XORed value back into an XMM register for the conversion to an integer (I was a little worried that GCC wasn't using pxor as it requires SSE2, but -msse2 changed nothing).
I thought maybe pxor would be significantly slower than add, but according to https://www.agner.org/optimize/instruction_tables.pdf, on Skylake, pxor x, x (compared to add r, m) has a slightly lower reciprocal-throughput and generates fewer uops! (at least unfused, and to be fair, the speed difference is probably only due to the included memory access)
So what's the deal? Why doesn't GCC generate something closer to my custom assembly? Is this a bug? Or is GCC seeing something I'm not.