If I know I have e.g. at least 4 doubles sitting at given (aligned) location in memory,
double *d, I can simply do __m256d x = _mm256_load_pd(&d[i]), i.e. load them into an AVX(2) register.
The question is: How do I correctly handle cases where there aren't 4 doubles left at the given location, i.e. I'd theoretically access the array out of bounds?
One solution that I have been using so far is to only allocate memory of multiples of 4 * 8 bytes in this specific case. Alternatively, for cases where I do not control the memory allocation completely, I have also been playing with stuff like this, assuming that this probably not the way to go:
static __m256d inline _load_256d(size_t diff, size_t i, double *d){
if (diff == 4) {
return _mm256_load_pd(&d[i]);
}
if (diff == 3) {
return _mm256_set_pd(0.0, d[i+2], d[i+1], d[i]);
}
if (diff == 2) {
return _mm256_set_pd(0.0, 0.0, d[i+1], d[i]);
}
return _mm256_set_pd(0.0, 0.0, 0.0, d[i]);
}
What is the (a) canonical solution for a case like this?