Optimizing per-thread copying and 0-padding of data in registers/L1

Question

I'm writing a kernel which, among other things, has each thread populate one variable with data constituting the lower bytes, and pads the rest (assuming little-endianness). This is done repeatedly and non-uniformly across threads, in that some threads might have more bytes to copy into their variable and less padding, and some threads less to copy and more padding. Both the result and the unpadded data are either in a register (for smaller sizes), in shared memory or in local memory (which should be covered by L1).

In other words, suppose each thread executes:

T padded;
pad_high_bytes(padded, my_low_bytes, my_num_low_bytes);
do_stuff(padded);

where we have:

template <typename T>
__device__ __forceinline__
void pad_high_bytes(
    T&                                result,
    const unsigned char* __restrict__ low_bytes, 
    unsigned                          num_low_bytes);

If T is large (say, a struct { int data[50]; }) then I guess I should probably just use CUDA's device-code memcpy(). However, that's not usually the case - T is usually of size 4 or 8, and the number of low bytes is typically between 1 and 3, and even 0 is not rare.

I can obviously loop over the bytes and hope for the best. I can also "loop over the ints" as long as there are more than 4 bytes left to copy, then loop over the remaining bytes. But:

Can I do better?
Should I account for alignment somehow? Or is it too much hassle to bother?
Should the copying and the padding be completely separate in the code or should I combine them somehow?
What would be a reasonable value for sizeof(T) at which to switch to memcpy() - if at all?

Remember the function is templated on the result type, so if you only have something to say regarding smaller/larger T's, that's helpful.

How is all your variables and padded data organized in memory? It's better to have a full picture, not just a per-thread view. — kangshiyin, Jun 12 '16 at 09:18
@Eric: See edit. Does that suffice? I don't make guarantees regarding `do_stuff()` - it might be a computation or it might write `padded` to shared/global memory. — einpoklum, Jun 12 '16 at 09:32
Hard to say. May be you want to redesign your kernel. It is not looks like a [coalesced mem access](https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels) if your do_stuff writes padded data to global mem. On the other hand, 32 threads in a warp have to execute the same instructions at any time. As each of your thread have to deal with data with different length, it will be very inefficient. — kangshiyin, Jun 12 '16 at 09:46
@Eric: The memory access coalescing will not be the issue, both in reading and in writing. It's true that there's a risk of inefficiency if lengths diverge a lot, and I might need to redesign my kernel then. But - assume they don't, i.e. that they diverge by at most 1 or 2. — einpoklum, Jun 12 '16 at 09:56
If you can accept the divergence, just try the most simple way then. As the divergence maybe the worst part. — kangshiyin, Jun 12 '16 at 10:16

score 2 · Answer 1 · edited May 23 '17 at 11:58

I am really struggling to see how Q1-3, which could be paraphrased as "are my words an optimal design for this vaguely described task", are really answerable. So I'm not going to even try.

Q4 is answerable:

Device side memcpy(or device side cudaMemcpy, which is just a thin wrapper around memcpy) always emit a loop which performs a byte-by-byte copy. When you know the size of the type you are copying at compile time, you can always do better by writing a copy loop yourself which exploits that a priori knowledge about the size of the type (subject to alignment constraints, etc). And if you know both the size of the type and the number of words which you will copy, then you can do even better by exploiting loop unrolling in addition to large than byte size transactions

If you don't know either of those two things, then memcpy is still the best choice, just because it simplifies code and it opens up the possibility of idiomatic optimisations behind the scenes within the toolchain. The only time I would counsel against it is if you have an opportunity to fuse other operations with the copy, in which case doing something yourself probably still makes sense.

Optimizing per-thread copying and 0-padding of data in registers/L1

1 Answers1