Intrinsics: using __128 registers

Question

I am playing with SIMD and thinking to use for Vector operations in 3D math. Instead having

class Vec4f
{
float val[4]; 
//+operators here
}

I could have

class SimdVec4f
{
 __m128 val; //+operators
}

But since there are just 8 available registers for __m128, what will happend if I want to have more than 8 instances of this class? Does the compiler handles the loading from memory to registers and vica versa on its own as for usual variables?

Thanks for your time and for giving me some insight into this.

All you need is to disassemble, try this: https://stackoverflow.com/questions/1289881/using-gcc-to-produce-readable-assembly — kbridge4096, Nov 04 '18 at 06:16

Peter Cordes · Answer 1 · 2019-12-30T04:01:49.193

4

It's exactly the same as when you have more int variables than there are integer registers: the compiler may have to spill them to memory if too many are live at the same time, and reload them later. Register allocation for vector registers is done pretty much the same way as register allocation for integer regs, analysing the data flow of a function and figuring out which variables are alive at the same time.

You should think of _mm_load_ps/loadu and store/storeu intrinsics as more describing the type-punning to/from vector types, not as being the only thing that can compile to a vector load or store instruction, or always compiling to a load/store.

And BTW, x86-64 has xmm0..15. Compile for 64-bit if you want code that needs several registers to be efficient.

SSE for 3D vectors:

Generally avoid keeping a single direction/geometry vector in a SIMD vector. You can add efficiently, but any cross- or dot-products or length calculations will require shuffling.

It's better if you can use a vector of 4 x values, a vector of 4 y values, etc., so you can compute 4 lengths in parallel. See https://stackoverflow.com/tags/sse/info for more, especially these slides: SIMD at Insomniac Games (GDC 2015) which show how to lay out your data for efficient SIMD. (Struct of arrays, not array of structs).

See also Parallel programming using Haswell architecture

Sometimes you can get a minor benefit for a single vector in cases where you can't reorganize to compute lots of things in parallel. _mm_setr_ps() can be slow if the source data isn't contiguous, though.

There are already several C++ wrapper libraries for SIMD, such as Agner Fog's ~~GPL~~ Apache-licensed VectorClass, and some others.

edited Dec 30 '19 at 04:01

answered Nov 04 '18 at 06:18

Peter Cordes

328,167
45
605
847

Sure I am using _mm_setr_ps in the class constructor to set the value. But since the documentation is saying it sets the register value I started to worrying if I have to be sure there is always enough of free registers to do that. I checked the assembly and it looks like its not always loading it into the register. Thanks for help, I appreciate that! – Martin Nov 04 '18 at 06:27
@Martin: yup, a `__m128` variable is just a C variable; documentation that calls it a register is not strictly accurate. You *want* to write code that the compiler can optimize into registers when optimization is enabled, though. – Peter Cordes Nov 04 '18 at 06:54
@Martin: Make sure you provide a constructor that uses `_mm_loadu_ps` at least, for contiguous data. And see also the links I added about geometry vectors not mapping directly to SIMD vectors. – Peter Cordes Nov 04 '18 at 07:03
1

There's also [DirectXMath](https://github.com/Microsoft/DirectXMath) which is now available under a MIT license. – Chuck Walbourn Nov 06 '18 at 19:15
1

During 2019 Agner Fog changed the license of VectorClass to Apache License 2.0. See https://github.com/vectorclass/version2/blob/master/LICENSE – Erik Sjölund Dec 29 '19 at 22:20
@ErikSjölund: Thanks, I'd seen that but haven't searched through all my old answers to update them. If you see any more in future, feel free to edit yourself. – Peter Cordes Dec 30 '19 at 04:02

Intrinsics: using __128 registers

1 Answers1