I have the following C++ function that sums all elements of an SSE 128-bit float register. Basically I just do two horizontal adds using the code bellow:
float sum4(__m128 x) {
const __m128 hsum_0 = _mm_hadd_ps(x, x);
const __m128 hsum_1 = _mm_hadd_ps(x, x);
return _mm_cvtss_f32(hsum_1);
}
Is this the most efficient way of summing all the elements of a 128-bit SSE register? I'm asking this because I read that we should avoid horizontal operations for dense processing (http://wiki.ros.org/PatrickMihelich/pcl_simd#Horizontal_or_vertical.3F), so if I call sum4() multiple times through the program execution time the performance will be highly damaged.
Thanks for all help in advance!