From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).
You didn't say what CPU, so I'm just assuming recent Intel.
I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.
I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).
In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).
See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.