What is the most performant way to split one AVX (AVX2) register into two SSE (SSE2) registers and backwards - join (concatenate) two SSE registers to create one AVX register?
I need this for all types of registers - integer, float, double.
For example I made following code for float case:
#include <immintrin.h>
__m128 avx_to_sse_ps(__m256 a, __m128 * hi) {
*hi = _mm256_castps256_ps128(
_mm256_permute2f128_ps(a, a, 0b0000'0001)
);
return _mm256_castps256_ps128(a);
}
__m256 sse_to_avx_ps(__m128 a, __m128 b) {
return _mm256_permute2f128_ps(
_mm256_castps128_ps256(a),
_mm256_castps128_ps256(b),
0b0010'0000
);
}
int main() {}
Is it possible to make this code anyhow faster? What about integer and double cases, will be optimal code for them similar to this one?