It's exactly the same as when you have more int variables than there are integer registers: the compiler may have to spill them to memory if too many are live at the same time, and reload them later. Register allocation for vector registers is done pretty much the same way as register allocation for integer regs, analysing the data flow of a function and figuring out which variables are alive at the same time.
You should think of _mm_load_ps/loadu and store/storeu intrinsics as more describing the type-punning to/from vector types, not as being the only thing that can compile to a vector load or store instruction, or always compiling to a load/store.
And BTW, x86-64 has xmm0..15. Compile for 64-bit if you want code that needs several registers to be efficient.
SSE for 3D vectors:
Generally avoid keeping a single direction/geometry vector in a SIMD vector. You can add efficiently, but any cross- or dot-products or length calculations will require shuffling.
It's better if you can use a vector of 4 x values, a vector of 4 y values, etc., so you can compute 4 lengths in parallel. See https://stackoverflow.com/tags/sse/info for more, especially these slides:
SIMD at Insomniac Games (GDC 2015) which show how to lay out your data for efficient SIMD. (Struct of arrays, not array of structs).
See also Parallel programming using Haswell architecture
Sometimes you can get a minor benefit for a single vector in cases where you can't reorganize to compute lots of things in parallel. _mm_setr_ps() can be slow if the source data isn't contiguous, though.
There are already several C++ wrapper libraries for SIMD, such as Agner Fog's GPL Apache-licensed VectorClass, and some others.