How to constexpr initialize intrinsic SSE/AVX register?

Question

Consider something like __m128i xmm_stuff = _mm_set_epi32(1, 2, 3, 4); it could be const but not consexpr because of an underlying reinterpret_cast in the compiler implementations. And the fact that intrinsics are function which aren't declared constexpr.
For example, from clang-12's immintrin.h:

static __inline__ __m128i __DEFAULT_FN_ATTRS
_mm_set_epi32(int __i3, int __i2, int __i1, int __i0)
{
  return __extension__ (__m128i)(__v4si){ __i0, __i1, __i2, __i3};
}

Is the some cunning (and portable) way to initialize `__m128i` and such as `constexpr`?

Actual usage

constexpr std::array<int32_t,4> input = {1, 2, 3, 4};
const __m128i xmm_input = _mm_load_si128(reinterpret_cast<const __m128i*>(input.data()));

Desired usage, more concise and clear:

constexpr __m128i xmm_input = {1, 2, 3, 4};

FYI: `__` at the start of an identifier is not allowed for normal user code. They can only be used by implementations of the standard library. — Nicol Bolas, Jan 23 '22 at 17:10
not sure I'm following... how you suggest to declare a `__m128i` type when referencing it in my code? — kreuzerkrieg, Jan 23 '22 at 18:27
I'm talking about the function parameters: `__i3` and such. This is illegal in C++ unless you're writing a standard library implementation, as such identifiers are reserved for the implementation. `__m128i` is an identifier provided by your implementation. — Nicol Bolas, Jan 23 '22 at 18:35
Inquisitive reader would notice that example is taken from clang intrinsic implementation, so it is implementation of library. Provided by compiler vendor. Perfectly valid code. — kreuzerkrieg, Jan 23 '22 at 18:43
What are you hoping to gain from `constexpr` with Intel intrinsics that you can't get from `const`? Intrinsics like `_mm_add_epi32` are (unfortunately) not `constexpr`, even though compilers do in practice know how to do constant-propagation through them. (At least gcc/clang do). And you pretty much never want to use a `static __m128i` or one in global scope, since compilers are dumb and will actually reserve space in the BSS and make a "constructor" that copies a constant from .rodata. (Although if there was a constexpr _mm_set we might avoid that!) — Peter Cordes, Jan 23 '22 at 22:48
@PeterCordes actually nothing, except clarity in declaration and just curiosity, why it couldnt be achieved. Your comment about `static` is very important, yesterday I found it the hard way that static actually degrades performance — kreuzerkrieg, Jan 24 '22 at 12:16

Peter Cordes · Accepted Answer · 2022-01-24T02:21:55.633

Not that it would change anything semantically or performance-wise

Correct, just use const __m128i like most code does.
I don't see any benefit to constexpr for this use-case, just pain for no gain.

Maybe if there was a way, it would let you initialize vectors in static storage (global or static) without the usual mess you get if you use _mm_set, where the compiler reserves space in .bss and runs a constructor at run-time to copy from an anonymous constant in .rodata.

(Yes, it's really that bad with gcc/clang/MSVC; godbolt. Don't use static const __m128i or at global scope. Do const __m128i foo = _mm_set_epi32() or whatever inside functions; compilers + linkers will eliminate duplicates, like with string literals. Or use plain arrays with alignas(16) and _mm_load_si128 from them inside functions if that works better.)

just curious why in the year 2022 I can't declare constexpr __m128i

You can declare constexpr __m128i, you just can't portably initialize it¹, because Intel intrinsics like _mm_set_* were defined before the year 2000 (for MMX and then SSE1), and aren't constexpr. (And later intrinsics still follow the same pattern established for SSE1.) Remember, in C / C++ terms they're actual functions that just happen to inline. (Or macros around __builtin functions to get a compile-time constant for an operand that becomes an immediate.)

Foonote 1: In C++20, GCC lets you use constexpr auto y = std::bit_cast<__m128i>(x);, as shown in https://godbolt.org/z/YGMGM69qs. Other compilers accept bit_cast<float> or whatever, but not __m128, so this may be an implementation detail of GCC. In any case, it doesn't save typing, and isn't useful for much even if it was portable to clang and MSVC.

There's little point to it because intrinsic functions like _mm_add_epi32 are also not constexpr, and you can't portably do v1 += v2; in GNU C/C++ that does compile (to a paddq).

Example with non-portable braced initializers; don't do this:

#include <immintrin.h>

__m128i foo() {
    // different meaning in GCC/clang vs. MSVC
    constexpr __m128i v = {1, 2};
    return v;
}

GCC11.2 -O3 asm output (Godbolt) - two long long halves, as per the way GCC/clang define __m128i as typdef long long __m128i __attribute__((vector_size(16),may_alias))

foo():
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .quad   1
        .quad   2

MSVC 19.30 - the first two bytes of 16x int8_t - MSVC defines __m128i as a union of arrays of various element-widths, apparently with the char[16] first.

__xmm@00000000000000000000000000000201 DB 01H, 02H, 00H, 00H, 00H, 00H, 00H
        DB      00H, 00H, 00H, 00H, 00H, 00H, 00H, 00H, 00H

__m128i foo(void) PROC                   ; foo, COMDAT
        movdqa  xmm0, XMMWORD PTR __xmm@00000000000000000000000000000201
        ret     0
__m128i foo(void) ENDP                   ; foo

So you could initialize a vector to {0} and get the same result on gcc/clang as on MSVC, or I guess any {0..255}. But that's still taking advantage of implementation details on each specific compiler, not strictly using Intel's documented intrinsics API.

And MS says you should never directly access those fields of the union (the way MSVC defines __m128i).

GCC does define semantics for GNU C native vectors; GCC / clang implement the Intel intrinsics API (including __m128i) on top of their portable vector extensions which work like a struct or class with operators like + - & | * / [] and so on.

See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? re: what a __m128i object is and how it works.

Terminology: `__m128i` isn't a register.

It's a C++ object like an int that can fit in a register, and normally compilers will keep the variable's value in a register across statements, if you enable optimization.

But you can still take its address, memcpy into / out of (parts of) it, and otherwise mess with its object representation, all of which works according to the rules of the C++ abstract machine (including the vector extensions). (The resulting asm might not be very efficient vs. using shuffle intrinsics, though!)

You can make an array or even std::vector<__m128i> (with C++17 for aligned allocation), and obviously those __m128i objects can't all be in registers.

Better terminology: "initialize an AVX intrinsic vector". These types represent a SIMD vector of data, which can be loaded into a vector register. Just like an int represents a fixed-width integer that can be loaded into an integer register. It's common to write code using __m128i in ways that all such objects are locals that actually can live in registers, hopefully not even getting spilled/reloaded, but that's due to how it's used, not what it is.

When you talk about initializing an int object, you talk about the object, not the register. (Especially for constexpr; there are no registers in the C++ abstract machine.)

GCC seems to allow `std::bit_cast` to `__m128i` in a constant expression and Clang says that it is just not supported _yet_. I don't know whether GCC is doing "the right thing", but if the other compilers allowed it, it may be a method to initialize a `constexpr __m128i` without relying on the particular implementation, I think?: https://godbolt.org/z/YGMGM69qs — user17732522, Jan 24 '22 at 00:16
@user17732522: Oh yes, that can work. It doesn't actually help you keep your source neat & tidy, though. And if the goal is to `static_assert` some test of SIMD code (like [this Q&A seemed to be talking about](https://stackoverflow.com/questions/70777429/is-it-possible-to-do-an-aligned-alllocation-in-a-constexpr-context)), that's a non-starter because you can't do portably anything with a `__m128i` in a constexpr context: all the intrinsics like `_mm_add_epi32` are non-constexpr. Some compilers actually *don't* constant-propagate through them the way gcc/clang do. — Peter Cordes, Jan 24 '22 at 00:27
@user17732522: I edited in a note about bit_cast. I didn't take the time to rephrase other parts of the answer :/ — Peter Cordes, Jan 24 '22 at 00:34
@user17732522: Oh, I was in a rush earlier, didn't notice that it doesn't compile with other compilers. That might well be happens-to-work implementation detail of GCC. OTOH, it does work to bit_cast an int to constexpr float (https://godbolt.org/z/3s578WE79), except with ICC where it seems to be missing from the headers (at least in Godbolt's hacky setup.) — Peter Cordes, Jan 24 '22 at 01:50
`std::bit_cast` is mandated by the standard to work in a constant expression between trivially copyable types of the same size (with some restrictions e.g. on reference members). Of course it is not clear how non-standard types play into this. However, I did forget that one of the restrictions is that the types are not a union type, so it will probably never work with an implementation of `__m128i` based on a union. — user17732522, Jan 24 '22 at 08:31

score 2 · Answer 2 · answered Jan 23 '22 at 17:10

2

Registers don't exist at compile-time. Whatever these AVX instructions are doing, the compile-time result is going to have to be loaded into a register at runtime. So you should just compute that compile-time value using normal C++ code (perhaps using if (std::is_constant_evaluated()) to fence off such blocks of code to allow you to put both in the same function) and then load that constexpr value into an AVX object.

answered Jan 23 '22 at 17:10

Nicol Bolas

449,505
63
781
982

What do you mean by "Registers don't exist at compile-time."? Registers are just "registers", it is a POD struct, union or whatever the library decides to use to hold needed numbers of bytes. Example https://clang.llvm.org/docs/LanguageExtensions.html#vector-literals – kreuzerkrieg Jan 23 '22 at 18:33
@kreuzerkrieg: I mean that your compiler isn't going to put such objects in registers at compile-time because they don't exist. So your code is going to have to load the data for such things into a register. The only question is where it gets that data from. – Nicol Bolas Jan 23 '22 at 18:37
let me rephrase it, declare constexpr instance of a type that would represent a content of a register in a runtime. Meaning, I dont care what are you going to do later on to get the data into the register, when declaring I want the declaration as `constexpr` – kreuzerkrieg Jan 23 '22 at 18:47
@kreuzerkrieg: But it doesn't have to be a declaration of `__m128i`, does it? Could it not be a declaration of `constexpr std::array`, which you then non-`constexpr` convert to an `__m128i`? – Nicol Bolas Jan 23 '22 at 18:59
yes, exactly. this is what I'm doing now. but as you can imagine it is quite ugly in the code. but because of `reinterpret_cast` which is (essentially) used to fill data to `__m128i` I cant declare `constexpr __m128i`. Not that it would change anything semantically or performance-wise, just curious why in the year 2022 I cant declare `constexpr __m128i` – kreuzerkrieg Jan 23 '22 at 19:06
Just edited the question with example to make it more clear – kreuzerkrieg Jan 23 '22 at 19:13
@kreuzerkrieg: `__m128i` isn't a register. It's a C++ object like an `int` that can fit in a register, and normally compilers will keep the variable's value in a register across statements, if you enable optimization. But you can still take its address, memcpy into / out of (parts of) it, and otherwise mess with its object representation, all of which works according to the rules of the C++ abstract machine (including the vector extensions). (The resulting asm might not be very efficient vs. using shuffle intrinsics, though!) So that's what Nicol's statement means. – Peter Cordes Jan 23 '22 at 23:03
Perhaps you can create a type which implicitly converts to `__m128i`? This would be good for initializing,. Or do you want to have all intrinsic SSE/AVX instructions as constexpr, too? E.g. for simulation/testing? – Sebastian Jan 23 '22 at 23:28
@PeterCordes you are right, I should put the register in quotes, since it is not real machine register – kreuzerkrieg Jan 24 '22 at 12:09
@Sebastian hm... interesting. Will definitely try it – kreuzerkrieg Jan 24 '22 at 12:11
@Sebastian second thoughts... whatever you do it will end using `_mm_load_si128` or something similar, so no implicitly convertible struct could be used as `constexpr`, IMHO – kreuzerkrieg Jan 24 '22 at 12:44
That is true. But you can prepare as `constexpr` everything up to the last step of constructing `__m128i` without explicit conversion needed then. And hopefully the optimizer will be able to take use of that constant initialization. It is not so different from the function example by Peter (as the `operator __m128i` *is* a function). – Sebastian Jan 24 '22 at 13:09

How to constexpr initialize intrinsic SSE/AVX register?

Is the some cunning (and portable) way to initialize __m128i and such as constexpr?

2 Answers2

Terminology: __m128i isn't a register.

Is the some cunning (and portable) way to initialize `__m128i` and such as `constexpr`?

Terminology: `__m128i` isn't a register.