set individual bit in AVX register (__m256i), need "random access" operator

Question

So, I want to set an individual bit of a __m256i register.

Say, my __m256i contains: [ 1 0 1 0 | 1 0 1 0 | ... | 1 0 1 0 ], how do I set and unset the n-th bit?

Simplest way is creating a lookup table for 256 different mask values and use n as the index to get the mask for setting/clearing bits — phuclv, Sep 13 '16 at 17:17
Just load the mask vector from `LUT[n]` and then use `_mm256_or_si256`. — Paul R, Sep 13 '16 at 17:23

score 5 · Answer 1 · edited Sep 14 '16 at 12:47

5

This is an implementation of function which can set individual bit inside a vector:

#include <immintrin.h>
#include <assert.h>

void SetBit(__m256i & vector, size_t position, bool value)
{
    assert(position <= 255);
    uint8_t lut[32] = { 0 };
    lut[position >> 3] = 1 << (position & 7);
    __m256i mask = _mm256_loadu_si256((__m256i*)lut);
    if (value)
        vector = _mm256_or_si256(mask, vector);
    else
        vector = _mm256_andnot_si256(mask, vector);
}

int main(int argc, char* argv[])
{
    __m256i a = _mm256_set1_epi8(-1);
    SetBit(a, 54, false);

    __m256i b = _mm256_set1_epi8(0);
    SetBit(b, 54, true);

    return 0;
}

edited Sep 14 '16 at 12:47

Paul R

208,748
37
389
560

answered Sep 14 '16 at 10:27

ErmIg

3,980
1
27
40

1

Just to be clear for anyone else seeing this, this is not an efficient method due to the store-forwarding stall. So it's not something you're gonna want to put in a performance-critical loop. Unfortunately, SIMD isn't designed for this. So there probably are no efficient ways to do it. I imagine a shift+permute *might* be faster, but it's also a lot more complicated. – Mysticial Sep 14 '16 at 15:00
In Fortran90 there is IBSET, IBSHFT, IBTEST, etc, intrinsics. So this is a place where a mixed language solution may be worthwhile. – Holmz Sep 14 '16 at 20:42
The store forwarding is mostly a latency concern, not a throughput one, right? – BeeOnRope Mar 09 '17 at 17:39

score 4 · Answer 2 · edited Feb 25 '17 at 03:33

There is another implementation:

#include <immintrin.h>
#include <assert.h>

template <bool value> void SetMask(const __m256i & mask, __m256i & vector);

template <> inline void SetMask<true>(const __m256i & mask, __m256i & vector)
{
    vector = _mm256_or_si256(mask, vector);
}

template <> inline void SetMask<false>(const __m256i & mask, __m256i & vector)
{
    vector = _mm256_andnot_si256(mask, vector);
}

template <int position, bool value> void SetBit(__m256i & vector)
{
    const uint8_t mask8 = 1 << (position & 7);
    const __m128i mask128 = _mm_insert_epi8(_mm_setzero_si128(), mask8, (position >> 3)&15);
    const __m256i mask256 = _mm256_inserti128_si256(_mm256_setzero_si256(), mask128, position >> 7);
    SetMask<value>(mask256, vector);
}

int main(int argc, char* argv[])
{
    __m256i a = _mm256_set1_epi8(-1);
    SetBit<50, false>(a);

    __m256i b = _mm256_set1_epi8(0);
    SetBit<50, true>(b);

    return 0;
}

Let me denote this is not usable for runtime indices. – marek094 Feb 24 '17 at 19:20 — marek094, Feb 24 '17 at 19:20

wim · Answer 3 · 2016-09-21T09:13:16.023

4

If you'd like to avoid a LUT and/or store-forwarding stalls, you can do this to set the k-th bit of an avx-256 register:

inline __m256i setbit_256(__m256i x,int k){
// constants that will (hopefully) be hoisted out of a loop after inlining  
  __m256i indices = _mm256_set_epi32(224,192,160,128,96,64,32,0);
  __m256i one = _mm256_set1_epi32(-1);
  one = _mm256_srli_epi32(one, 31);    // set1(0x1)


  __m256i kvec = _mm256_set1_epi32(k);  
// if 0<=k<=255 then kvec-indices has exactly one element with a value between 0 and 31
  __m256i shiftcounts = _mm256_sub_epi32(kvec, indices);
  __m256i kbit        = _mm256_sllv_epi32(one, shiftcounts);   // shift counts outside 0..31 shift the bit out of the element
                                                               // kth bit set, all 255 other bits zero.
  return _mm256_or_si256(kbit, x);                             // use _mm256_andnot_si256 to unset the k-th bit
}

Below is my previous answer, which is less straight forward and now obsolete.

#include <immintrin.h>

inline __m256i setbit_256(__m256i x,int k){
  __m256i c1, c2, c3;
  __m256i t, y, msk;

  // constants that will (hopefully) be hoisted out of a loop after inlining
  c1=_mm256_set_epi32(7,6,5,4,3,2,1,0);
  c2=_mm256_set1_epi32(-1);
  c3=_mm256_srli_epi32(c2,27);     // set1(0x1f) mask for the shift within elements
  c2=_mm256_srli_epi32(c2,31);     // set1(0x1)

  // create a vector with the kth bit set
  t=_mm256_set1_epi32(k);
  y=_mm256_and_si256(c3,t);        // shift count % 32: distance within each elem
  y=_mm256_sllv_epi32(c2,y);       // set1( 1<<(k%32) )

  t=_mm256_srli_epi32(t,5);        // set1( k>>5 )
  msk=_mm256_cmpeq_epi32(t,c1);    // all-ones in the selected element
  y=_mm256_and_si256(y,msk);       // kth bit set, all 255 other bits zero.

  x=_mm256_or_si256(y,x);   /* use _mm256_andnot_si256 to unset the k-th bit */
  return x;
}

I'am not sure if this will be any faster than the approaches suggested in the other answers.

This compiles to pretty good asm with clang or gcc (Godbolt compiler explorer), considering that the constants will be hoisted out of loops. As usual, clang defeats the attempt to generate constants on the fly, and broadcast-loads them from memory (which is very efficient on modern CPUs).

edited Sep 21 '16 at 09:13

answered Sep 20 '16 at 13:39

wim

3,702
19
23

Nice use of broadcast+cmpeq to select the vector element that should contain the `1` bit. This is probably optimal in terms of latency at least, for cases where the bit position is not a compile-time constant. (When it is, ermlg's template answer hopefully compiles down to a single VPANDN or VPOR with a pre-calculated constant). – Peter Cordes Sep 20 '16 at 20:19
1

I commented your code, since it was non-trivial to follow without descriptive variable names. Feel free to roll back the edit. SO doesn't give me the option of leaving it as a suggestion for you to approve, so I just went ahead and did it. – Peter Cordes Sep 20 '16 at 20:59
@PeterCordes : Thanks for editing and commenting to improve my answer! – wim Sep 20 '16 at 22:34
Nice update, great idea to take advantage of the saturating behaviour of vector shifts (unlike scalar shifts where the count is masked). I'm pretty sure that's optimal, except maybe using 64-bit elements to simplify the constant. (So the compiler could load it with PMOVZXBQ instead of BD, when hoisted out of the loop). You don't need to explicitly write `broadcastd_epi32(...)`, just use set1 and the compiler will use VPBROADCASTD. If `k` is in memory, it can broadcast directly from memory (but your code might fool a dumb compiler into using a MOVD first). – Peter Cordes Sep 21 '16 at 00:18
1

I was still thinking about actually using `set1( 1U << (k&31) )` and `set1( k & ~31U )`, since scalar shifts can run on port6 (not competing with vector ALU uops). So it trades off some scalar insns and an extra MOVD + VPBROADCASTD against 3 vector instructions. I put all three versions up on Godbolt: https://godbolt.org/g/F3NqdW. You should make your final version the main part of your answer, it's definitely the best. (Instead of just an "update", reorder your answer to show it first, with your earlier idea as a footnote if you want. Or just say "see history for an earlier idea". – Peter Cordes Sep 21 '16 at 00:29
BTW, on that godbolt link I included a version of your good function with meaningful variable names. Have a look, see what you think of that coding style. You don't always need as many comments if your variable names have meaning. As you can see from the asm output, modern compilers don't make worse code when you make new named variables instead of reusing the same temporary. – Peter Cordes Sep 21 '16 at 00:34
The Godbold link is very nice! Note that with icc `kvec=_mm256_broadcastd_epi32(_mm_cvtsi32_si128(k))` compiles to movd+vpbroadcastd while `kvec = _mm256_set1_epi32(k)` compiles to movd+pshufd+vinsertf128 . On gcc there is no difference – wim Sep 21 '16 at 09:53
Hmm, I'd forgotten that icc13 didn't optimize set1 properly, if I'd ever noticed. It's quite an old version of ICC, though, and AVX2 was probably pretty new when it came out. The "free for personal use" licensing isn't clear for providing icc output as an online service, so Matt Godbolt hasn't ever updated to a newer version of ICC. I'm sure newer versions make better code. – Peter Cordes Sep 21 '16 at 10:11
I didn't realise that icc13 is quite old. I have a small bias against using _mm_set1, which is ok as long as the asm output is fine. :) – wim Sep 21 '16 at 10:55
Yeah, it's not free, and that's an important thing to realize, but AFAICT there's no advantage to using the specific broadcast intrinsics. All you get is code that won't compile to a fallback (e.g. __m128 code without AVX). Or with AVX1 but not AVX2, potentially a store to memory and a broadcast-load of a float (since the register-source form of VBROADCASTPS is AVX2-only). (I forget if I ever saw this in compiler output while testing, though). What intrinsics are missing though is casting between scalar float and vector (leaving high garbage): http://stackoverflow.com/q/39318496/224132 – Peter Cordes Sep 21 '16 at 11:06
@PeterCordes : Yes, but we are already using other AVX2 intrinsics, so there is no point of avoiding `_mm256_broadcastd_epi32` here. Note that icc 17 (which is now on Godbold) still compiles `_mm256_set1_epi32(k)` to movd+pshufd+vinsertf128 and `_mm256_broadcastd_epi32(_mm_cvtsi32_si128(k))` to movd+vpbroadcastd. Apparently icc 17 has no specific AVX2 optimized `set1`. On the other hand, gcc stores to memory followed by a broadcast-load from memory is both cases, which is not what I like. – wim Nov 11 '16 at 11:22
Hmm, too bad ICC still makes bad code. Fair enough, I guess you might as well use the broadcast intrinsic if it helps the asm and the surrounding code requires AVX2 anyway. But I don't see gcc 5.4, 6.2, or 7.0pre storing to memory with `-march=haswell` (https://godbolt.org/g/PRAO0m). I'm seeing VMOVD/VPBROADCASTD for the `_mm256_set1_epi32(k)`. Are you testing gcc4.9 or something? It's pretty old and not very good at AVX2. It also wastes a ton of instructions on aligning the stack to 32 when it doesn't end up spilling a YMM. – Peter Cordes Nov 11 '16 at 21:00
@PeterCordes : Thanks for your remark! It turns out that I was using the wrong compiler options. I used gcc 5.4 with -mavx2 which enables AVX2 and all older (SSEx,AVX) instruction sets. However, with -march=haswell much better code is generated here. For icc the 'right' compiler option is -march=core-avx2, which compiles `set1` to vmovd+vpbroadcastd. Compiler option -march=haswell (or skylake) leads to vmovd+vphufd+vinsertf128. – wim Nov 12 '16 at 22:49
Huh, yeah I don't use ICC except on Godbolt, so I'm not very familiar with its options. But for gcc/clang, an important part of `-march` is setting `-mtune`, not just enabling instruction sets. – Peter Cordes Nov 12 '16 at 23:20
Yes. Indeed `-mavx2` with `-mtune=haswell` produces to the same code as `-march=haswell` alone. – wim Nov 12 '16 at 23:43

score 2 · Answer 4 · edited Sep 15 '16 at 07:50

If you like to avoid a LUT, you can use BTS for setting a single bit (or BTR for clearing it, respectively). There seems to be no intrinsic for this instruction (at least in GCC), so inline-assembly is required (so for x86 architecture only).

0F AB /r --- BTS r/m32, r32 --- Store selected bit in CF flag and set.

They're very slow with memory operands, but these Bit-String instructions allow bit-offsets that go outside of the byte or dword referenced by the addressing mode. The manual explains:

Some assemblers support immediate bit offsets larger than 31 by using the immediate bit offset field in combination with the displacement field of the memory operand. In this case, the low-order 3 or 5 bits (3 for 16-bit oper-ands, 5 for 32-bit operands) of the immediate bit offset are stored in the immediate bit offset field, and the high-order bits are shifted and combined with the byte displacement in the addressing mode by the assembler. The processor will ignore the high order bits if they are not zero.

When accessing a bit in memory, the processor may access 4 bytes starting from the memory address for a 32-bit operand size, using by the following relationship:

Effective Address + (4 ∗ (BitOffset DIV 32))

In pure assembler (Intel-MASM-syntax) this would look like this:

.data
  .align 16
  save db 32 dup(0)    ; 256bit = 32 byte YMM/__m256i temp variable space
  bitNumber dd 254     ; use an UINT for the bit to set (here the second to last)
.code
  mov eax, bitNumber
  ...
  lea edx, save
  movdqa xmmword ptr [edx], xmm0    ; save __m256i to to memory
  bts dword ptr [edx], eax          ; set the 255st bit
  movdqa xmm0, xmmword ptr [edx]    ; read __m256i back to register
  ...

If the variable already is in memory, this would be even easier.

Using inline assembly, this would result in the following functions:

static inline
void set_m256i_bit(__m256i * value, uint32_t bit)
{
    // doesn't need to be volatile: we only want to run this for its effect on *value.
    __asm__ ("btsl %[bit], %[memval]\n\t"
             : [memval] "+m" (*value) : [bit] "ri" (bit));
}

static inline
void clear_m256i_bit(__m256i * value, uint32_t bit)
{
    __asm__ ( "btrl %[bit], %[memval]\n\t"
              : [memval] "+m" (*value) : [bit] "ri" (bit));
}

These compile to what you'd expect on the Godbolt compiler explorer

And some test code similar to the assembler code above:

__m256i value = _mm256_set_epi32(0,0,0,0,0,0,0,0);
set_m256i_bit(&value,254);
clear_m256i_bit(&value,254);

You know BTS with a memory operand is over 10 uops on recent Intel CPUs, right? Precisely because of its insane bit-string addressing where the address of the byte (or dword) to be modified isn't determined by the addressing-mode alone. And it will still cause a store-forwarding stall on reload. Still, interesting to point out. — Peter Cordes, Sep 15 '16 at 07:33
I'm pretty sure you could beat this easily with AVX2 by using a vector shift+shuffle (with a shuffle mask generated with integer instructions and expanded with pmovzx or something). Avoiding the store-forwarding stall is huge. — Peter Cordes, Sep 15 '16 at 07:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123414/discussion-between-peter-cordes-and-zx485), where we moved our off-topic comments about downvoting. Cheers, keep up the good work with your answers. — Peter Cordes, Sep 15 '16 at 08:32

set individual bit in AVX register (__m256i), need "random access" operator

4 Answers4

Linked