0

I am finding difficulties in creating a code for this seemingly easy problem.

Given a packed 8 bits integer, substitute one byte with another if present.

For instance, I want to substitute 0x06 with 0x01, so I can do the following with res as the input to find 0x06:

// Bytes to be manipulated
res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06);

// Target value and substitution
val = _mm_set1_epi8(0x06);
sub = _mm_set1_epi8(0x01);

// Find the target
sse = _mm_cmpeq_epi8(res, val);

// Isolate target
sse = _mm_and_si128(res, sse);

// Isolate remaining bytes
adj = _mm_andnot_si128(sse, res);

Now I don't know how to proceed to or those two parts, I need to remove the target and substitute it with the replaced byte.

What SIMD instruction am I missing here?

As with other questions, I am limited to AVX, I have no better processor.

senseiwa
  • 2,369
  • 3
  • 24
  • 47
  • 1
    See: [_mm_blendv_epi8](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_blendv_epi8&techs=SSSE3,SSE4_1,SSE4_2&cats=Mask,Swizzle&expand=448,446). – Paul R Jan 15 '19 at 18:39
  • 1
    You're clobbering `sse` with the "Isolate target" operation. The original value is needed in the "Isolate remaining bytes" operation. – aqrit Jan 15 '19 at 21:36

1 Answers1

6

What you essentially need to do is to set all bytes (of the input) which you want to substitute to zero. Then set all other bytes of the substitution to zero and OR the results. You already got a mask to do that from the _mm_cmpeq_epi8. Overall, this can be done like this:

__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_or_si128(_mm_and_si128(mask, sub), _mm_andnot_si128(mask, inp));

Since the last combination of and/andnot/or is very common, SSE4.1 introduced an instruction which (essentially) combines these into one:

__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_blendv_epi8(inp, sub, mask);

In fact, clang5.0 and later is smart enough to replace the first variant by the second, when compiled with optimization: https://godbolt.org/z/P-tcik


N.B.: If the substitution value is in fact 0x01 you can exploit the fact that the mask (the result of the comparison) is 0x00 or 0xff (which is -0x01), i.e., you can zero out the values you want to substitute and then subtract the mask:

__m128i val = _mm_set1_epi8(0x06);
__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_sub_epi8(_mm_andnot_si128(mask, inp), mask);

This can save either loading the 0x01 vector from memory or wasting a register for it. And depending on your architecture it may have a slightly better throughput.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • 3
    Nice trick with using `sub_epi8` to replace `and`/`or` (but unfortunately still won't save the MOVDQA, so `pblendvb` is still better, especially on Skylake and newer, especially the non-AVX version being 1 uop for any port). Other special cases are replacement=0xff: `OR(inp, mask)` because `x|0xFF = 0xFF` for any x, and of course replacement=0 where you can just AND instead of blending. (For some operations like ADD and XOR, `0` is the identity value so you can mask an input to `a+b` instead of blending the output.) Clang will find some of these for you, its SIMD optimizer is pretty great. – Peter Cordes Jan 16 '19 at 06:29