I am trying to find the sum of all bytes in an __m128 register using SSE and SSE2.
So far what I have is
__m128i sum = _mm_sad_epu8(bytes, _mm_setzero_si128());
return _mm_cvtsi128_si32(sum) + _mm_extract_epi16(sum, 4);
where bytes is the __m128 value that contains the bytes that I want to find the sum of.
This works, however I am getting a lot of overflows which leads to me getting the wrong values. Is there a way to do this without getting overflows?
Alternatively I was thinking about just adding them to an array and summing them up that way, however I haven't been able to find a store method for bytes.
Unfortunately I can only support SSE and SSE2 methods.
Thank you for your help!