12

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.

starblue
  • 55,348
  • 14
  • 97
  • 151
florin
  • 13,986
  • 6
  • 46
  • 47
  • I'm having the same problem, with ARM. AFAICT, the syntax I'm using is correct - it matches that specified in the GCC docs. However, I get the same error...I wonder if the latest GCCs are bugged in this regard. –  Nov 30 '09 at 21:43
  • Ah - sorry - my comment is in fact with regard to the problem florin describes in his comment to the reply below (using asm("regname") generates an error). –  Nov 30 '09 at 21:44
  • 1
    a common reason for this behavior is not enabling optimizations (-O1, -O2 or -O3), when using intrinsics without optimizations it will flush to memory every time and essentially only use 2-3 simd registers – jtaylor Mar 14 '13 at 00:12

3 Answers3

2

Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.

C. K. Young
  • 219,335
  • 46
  • 382
  • 435
1

If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.

Dark Shikari
  • 7,941
  • 4
  • 26
  • 38
0

It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.

Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.

See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847