You are asking two questions: the order to read/write x86 SIMD registers and the methods to do this.
You can set the values of a register in big-endian format like this
__m128 x = _mm_set_ps(4.0, 3.0, 2.0, 1.0)
or in little-endian format like this
__m128 x = _mm_setr_ps(1.0, 2.0, 3.0, 4.0) // r presumably means reverse
The confusion probably comes from arrays. We write/store arrays in little-endian format irrespective of hardware. I mean for example we write
float xa[] = {1.0, 2.0, 3.0, 4.0};
(The x86 architecture stores the bytes of each float in little-endian format but that's a separate issue).
So there is only one way to load an array in terms of order
__m128 x = _mm_loadu_ps(xa);
Now we can answer your other question. If you want to access multiple elements of the SSE register the best method is to store the values to an array like this
float t[4]; _mm_storeu_ps(t, x);
Since it stores to an array there is only one method to in terms of order just like load.
Using the store intrinsics is I think is best solution because it does not rely on a compiler specific implementation. That will work with GCC, ICC, Clang, and MSVC in C and C++. That's the point of intrinsics. They give you assembly like features which don't depend on a certain compiler implementation or assembly syntax.
But if you only want the first element use _mm_cvtss_f32.
This is also worth reading.
More on enddianess.
If we wrote numbers in little-endian format maybe there would be less confusion. Consider comparing integers written in separate lines the way we usually write numbers (big-endian style):
54321
4321
321
21
1
We end up right justifying the numbers by entering appropriate spaces. If we had used little-endian format it would have been
12345
1234
123
1
That requires less work to write but then we would probably read the numbers right-to-left.