-3

So for the following sequence: 0001000111000

The desired result would be: 0001000000000

I am fully aware that this is achievable by finding the MSB's index with assembly BSRL (or similar bit-twiddling hack) then >> bit shifting the number by (index - 1), then << shifting back by (index - 1), but I want to know whether there is, specifically, an assembly instruction or a sequence of instructions with better performance, not a bit-twiddling hack that can do this.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
metamorphosis
  • 1,972
  • 16
  • 25

2 Answers2

6

There's no single instruction that can do this. BMI1 blsi dst,src can isolate the lowest set bit, not the highest. i.e. x & -x. If x86 had a bit-reversed version of blsi, we could use that, but it doesn't.


But you can do much better than what you were suggesting. An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It's 1 << bsr(input).

;; input: x in RDI
;; output: result in RAX
isolate_msb:
    xor   eax, eax           ; tmp = 0
    bsr   rdi, rdi           ; edi = bit index of MSB in input
    jz    .input_was_zero
    bts   rax, rdi           ; rax |= 1<<edi

.input_was_zero:             ; return 0 for input=0
    ret

Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.

The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n on Intel CPUs. It's 2 uops with 2c latency on AMD, so mov rax,1 / shl rax,cl would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx.

Anyway, the real work here is BSR + BTS, so that's 3 cycle + 1 cycle latency on Intel SnB-family. (https://agner.org/optimize/)


In C / C++, you'd write this as

unsigned isolate_msb32(unsigned x) {
    unsigned bitidx = BSR32(x);
    //return 1ULL << bitidx;           // if x is definitely non-zero
    return x ? 1U << bitidx : x;
}

unsigned isolate_msb64(uint64_t x) {
    unsigned bitidx = BSR64(x);
    return x ? 1ULL << bitidx : x;
}

Where BSR32 is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There's no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x) back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).

#include <stdint.h>

// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
    #ifdef __INTEL_COMPILER
        typedef unsigned int bsr_idx_t;
    #else
        #include <intrin.h>   // MSVC
        typedef unsigned long bsr_idx_t;
    #endif

    static inline
    unsigned BSR32(unsigned long x){
        bsr_idx_t idx;
        _BitScanReverse(&idx, x); // ignore bool retval
        return idx;
    }
    static inline
    unsigned BSR64(uint64_t x) {
        bsr_idx_t idx;
        _BitScanReverse64(&idx, x); // ignore bool retval
        return idx;
    }
#elif defined(__GNUC__)

  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)

#endif

On the Godbolt compiler explorer, clang and ICC compile this branchlessly, even when they don't know that x is non-zero.

All 4 compilers fail to use bts to implement 1<<bit. :( It's very cheap on Intel.

# clang7.0 -O3 -march=ivybridge   (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate.  /sigh.
isolate_msb32(unsigned int):
        bsr     ecx, edi
        mov     eax, 1
        shl     rax, cl
        test    edi, edi
        cmove   eax, edi       # return 1<<bsr(x)  or  x (0) if x was zero
        ret

GCC and MSVC make branchy code. e.g.

# gcc8.2 -O3 -march=haswell
    mov     eax, edi
    test    edi, edi
    je      .L6
    bsr     eax, edi
    mov     edi, 1
    shlx    rax, rdi, rax    # BMI2:  1 uop instead of 3 for shl rax,cl
.L6:
    ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    The 32-bit version of MSVC will emit `BT*` instructions. I've never seen the 64-bit compiler do so. You can actually get it [here](https://godbolt.org/z/bJi9Xf). When you add the conditional back in, `BTS` is still used, but unfortunately so is the branch. The good news is, it looks like the MSVC team has *finally* fixed the bug in the codegen for the `_BitScanReverse` intrinsic. – Cody Gray - on strike Feb 21 '19 at 23:14
-1

There is not single instruction for what you ask, no.

But, if you want to avoid twiddling the bits of the variable, there is an alternative approach:

Declare a 2nd variable of the same type as the original variable, and set the 2nd variable to 0. Then loop through the bits of the original variable from highest bit to lowest bit, testing each bit with the & operator. If you find a bit set to 1, set the corresponding bit in the 2nd variable, then exit the loop. Assign the 2nd variable to the original variable, if needed.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 1
    Why are you suggesting an alternative that's (significantly) worse than what the OP already suggested in the question, using `bsr` to find the highest set bit? (Their idea for how to use it is sub-par, though.) – Peter Cordes Feb 19 '19 at 04:01
  • @PeterCordes Because the original poster's BSR suggestion was given as an example of bit-twiddling hack that the OP doesn't want as an answer. – Ross Ridge Feb 19 '19 at 04:38
  • 2
    @RossRidge: I wouldn't call `bsr` + `bts` a "bit-twiddling hack" in the vein of Hackers Delight or https://graphics.stanford.edu/~seander/bithacks.html. I posted an efficient answer that creates a new output with 1 bit set (or 0), not based on repeated manipulation of the input. – Peter Cordes Feb 19 '19 at 06:40
  • @PeterCordes I said it was just an alternative, I didn't say it was an optimal solution. If the OP is resorting to writing manual assembly, it should optimally use as few instructions as needed to get the job done, and that is likely to involve manipulating bits, which the OP was trying to avoid. – Remy Lebeau Feb 19 '19 at 09:37