- PCMPEQB/W/D/Q against an all-zero register to get a vector with elements that are all-1 for the zero elements and all-zero for the zero elements.
- PMOVMSKB to turn the vector of all-ones or all-zero into an integer bitmask. (Or
movmskps or pd to get 1 bit per dword or qword, instead of per byte, if that makes your bit-scan -> index calculation more efficient, like if you want an element offset instead of a byte offset.)
- invert that (C
~ operator, asm NOT instruction) to get 1s in the bitmap for elements that were non-zero
- TZCNT or BSF that integer to find the first (lowest) set bit. Beware of BSF's behaviour if its input is all-zero. But fortunately that's not a problem when the input is an int
~bitmask - the high 16 zero bits become 1s. (An AVX2 version of this with vpmovmskb ymm that filled a whole uint32_t with possibly-1 bits could use ~(uint64_t)bitmask, or just use tzcnt since AVX2 CPUs also have BMI1.)
For example with intrinsics:
int first_nonzero_byte(__m128i v){
//__m128i v = _mm_loadu_si128((const __m128i*)p); // for a pointer arg
__m128i vcmp = _mm_cmpeq_epi8(v, _mm_setzero_si128());
unsigned bitmask = _mm_movemask_epi8(vcmp);
#ifdef __GNUC__
return __builtin_ctz(~bitmask);
#else
return _tzcnt_u32( ~bitmask );
#endif
// returns 16 if v was all zero so ~bitmask is 0xFFFF0000
}
Compiles on https://godbolt.org/z/Y8vYbsW69 to
# GCC11.2 -O3 -msse4.1
movdqa xmm1, xmm0 # missed optimization, should zero XMM1 instead
pxor xmm0, xmm0
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
not eax
rep bsf eax, eax # tzcnt on new CPUs, BSF on old
ret
In GNU C where _tzcnt_u32 won't compile without -march=haswell or something, we use __builtin_ctz. As I said, ~bitmask is guaranteed to be non-zero. tzcnt is encoded as rep bsf; old CPUs will execute it as bsf, producing the same result for non-zero inputs. New CPUs will execute it as tzcnt, which is more efficient on AMD (2 uops instead of 7). Intel executes either as single-uop. GCC uses rep bsf aka tzcnt if you don't tell it a specific CPU to tune for.
For a related function like shown in JATothrim's answer, using only 4 single-uop instructions (actually 2 uops for tzcnt on AMD) instead of 8 instructions including a pblendvb (2 uops on Intel). The shuffle/horizontal-reduction idea in that answer could possibly be useful if you want the element index as a shuffle control vector for vpermilps, but seems sub-optimal vs. this when you actually want a scalar int.
int equal_first_dword_bitscan(__m128i x, __m128i y)
{
__m128i vcmp = _mm_cmpeq_epi32(x,y);
unsigned bitmask = _mm_movemask_ps(_mm_castsi128_ps(vcmp));
bitmask |= 1<<4; // return 4 if the low 4 bits are all 0
#ifdef __GNUC__
return __builtin_ctz(bitmask);
#else
return _tzcnt_u32( bitmask ); // runs as BSF on old CPUs, don't skip the OR
#endif
}
MSVC doesn't have __builtin_ctz, but will compile _tzcnt_u32 even if you haven't told it the target CPU supports BMI1. If you're definitely only running on CPUs with BMI1, you can leave out the bitmask |= 1<<4; so it will return 32 for not-found.
If you use trailing-zero count in multiple functions, best to wrap that ifdef stuff in a helper function, instead of at each use-case.
If there's only one possible non-zero value (like 1), PCMPEQB against a vector of that so you don't need to invert it later.
If that's the case, consider storing your data in a bitmap in the first place, to decrease your cache footprint by a factor of 8. Then you just TZCNT 64-bit chunks of the array.
Or for a larger array of data, search for the first non-zero vector with SIMD, then TZCNT the first non-zero element of it, if you expect there to be multiple qwords of zeros before the first set bit. Like memcmp does for finding the mismatching byte position.
See Efficiently find least significant set bit in a large array? and How to find the first nonzero in an array efficiently?
BTW, the asm instruction ref manual lists the relevant C intrinsics at the bottom of each entry, and you can search Intel's intrinsics finder by asm mnemonic. (See the x86 tag wiki for links).