1

I was messing around with sse when I discovered to my surprise that the following works. I was trying to access the individual floats in an __m128.

__m128 x = _mm_set_ps(1.0, 2.0, 3.0, 4.0);
cout << x[1] << endl;

Here I can read from it like it is an array. x[0] prints 4 and x[1] prints three which is backwards from what I was expecting but this may have to do with the byte order. What I am wondering is this a standard / recommended way to read or even write individual components of the vector. All the examples I have seen online seem to be using a union between the vector type and array.

chasep255
  • 11,745
  • 8
  • 58
  • 115
  • 2
    By convention register parts and values, like numbers, are written in big-endian with the most significant piece/digit on the left. The x86 architecture is little-endian internally however, and so the first element actually written to memory is the least significant. That's the general rule but as you've experienced this can lead to some confusion and the intrinsics library are not necessarily completely consistent. Incidentally Arabic numerals were supposedly originally little-endian as well but weren't flipped when adopted by Western left-to-right writing systems. – doynax Oct 18 '15 at 07:07
  • Use taged C but you use `cout` which is C++. Do you want an answer for both C and C++ or just C++? – Z boson Oct 18 '15 at 10:56
  • @doynax, I was not aware that ARabic numbeals were originally little-endian. That's interesting. I think it makes more sense. – Z boson Oct 18 '15 at 17:06

2 Answers2

4

No, that is not legal and the MSDN documentation explicitly warns not to do that.

In C++, you could use a valarray. In C, most compilers support the GCC extension __attribute__ ((vector_size(N))).

On Intel compilers since Parallel Studio 2011, MSVC for even longer, but not GCC (although GCC and Clang do support type-punning), you can write:

__m128 result = foo();
float f1 = result.m128_f32[0];

Whether or not this is undefined behavior in general, it’s supported on the compilers you probably care about, and it is unlikely that a future implementation will compile code that uses m128_f32 but silently break it. You might be able to use other intrinsics, such as _mm_store_ss(), to extract a float.

The maximally-portable solution is memcpy().

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • What I really want to know is if it is considered good practice to just use the array notation on the vector type like I did in my question. That code compiles and runs but I have never seen that before. – chasep255 Oct 18 '15 at 07:09
  • 2
    @Jason The question is tagged C and GCC, where type-punning is explicitly legal in the GCC documentation. If this is SSE code, it will only run on a little-endian CPU, as well. However, the code snippet is C++. – Davislor Oct 18 '15 at 07:58
  • I agree, although using unions to type pun isn't legal/safe in general. – Jason Oct 18 '15 at 16:22
  • @MarcGlisse Thanks for the information! Edited my post. – Davislor Oct 18 '15 at 19:04
4

You are asking two questions: the order to read/write x86 SIMD registers and the methods to do this.

You can set the values of a register in big-endian format like this

__m128 x = _mm_set_ps(4.0, 3.0, 2.0, 1.0)

or in little-endian format like this

__m128 x = _mm_setr_ps(1.0, 2.0, 3.0, 4.0) // r presumably means reverse

The confusion probably comes from arrays. We write/store arrays in little-endian format irrespective of hardware. I mean for example we write

float xa[] = {1.0, 2.0, 3.0, 4.0};

(The x86 architecture stores the bytes of each float in little-endian format but that's a separate issue).

So there is only one way to load an array in terms of order

__m128 x = _mm_loadu_ps(xa);

Now we can answer your other question. If you want to access multiple elements of the SSE register the best method is to store the values to an array like this

float t[4]; _mm_storeu_ps(t, x);

Since it stores to an array there is only one method to in terms of order just like load.

Using the store intrinsics is I think is best solution because it does not rely on a compiler specific implementation. That will work with GCC, ICC, Clang, and MSVC in C and C++. That's the point of intrinsics. They give you assembly like features which don't depend on a certain compiler implementation or assembly syntax.

But if you only want the first element use _mm_cvtss_f32.

This is also worth reading.


More on enddianess.

If we wrote numbers in little-endian format maybe there would be less confusion. Consider comparing integers written in separate lines the way we usually write numbers (big-endian style):

54321
 4321
  321
   21
    1

We end up right justifying the numbers by entering appropriate spaces. If we had used little-endian format it would have been

12345
1234
123
1

That requires less work to write but then we would probably read the numbers right-to-left.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226