I'm writing a kernel which, among other things, has each thread populate one variable with data constituting the lower bytes, and pads the rest (assuming little-endianness). This is done repeatedly and non-uniformly across threads, in that some threads might have more bytes to copy into their variable and less padding, and some threads less to copy and more padding. Both the result and the unpadded data are either in a register (for smaller sizes), in shared memory or in local memory (which should be covered by L1).
In other words, suppose each thread executes:
T padded;
pad_high_bytes(padded, my_low_bytes, my_num_low_bytes);
do_stuff(padded);
where we have:
template <typename T>
__device__ __forceinline__
void pad_high_bytes(
T& result,
const unsigned char* __restrict__ low_bytes,
unsigned num_low_bytes);
If T is large (say, a struct { int data[50]; }) then I guess I should probably just use CUDA's device-code memcpy(). However, that's not usually the case - T is usually of size 4 or 8, and the number of low bytes is typically between 1 and 3, and even 0 is not rare.
I can obviously loop over the bytes and hope for the best. I can also "loop over the ints" as long as there are more than 4 bytes left to copy, then loop over the remaining bytes. But:
- Can I do better?
- Should I account for alignment somehow? Or is it too much hassle to bother?
- Should the copying and the padding be completely separate in the code or should I combine them somehow?
- What would be a reasonable value for sizeof(T) at which to switch to memcpy() - if at all?
Remember the function is templated on the result type, so if you only have something to say regarding smaller/larger T's, that's helpful.