CUDA memory bandwidth when reading a limited number of finite-sized chunks?

Question

Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.

But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?

"nd you can approach this limit if the chunks you are reading are large enough" <- Well, not quite. You typically only approach some large fraction of NVIDIA's published maximum bandwidth. — einpoklum, Sep 11 '22 at 18:42

Robert Crovella · Accepted Answer · 2022-09-11T23:34:03.607

let's assume:

we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect

In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?

The predicted bandwidth (B) is:

Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)

And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.

(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:

B = Bd/((((D+31)/32)+1)*32)

note that this condition cannot apply when the chunk size is less than 34 bytes.

All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.

Thanks! I'd like to *not* assume that the task itself is large enough (huge `N`) though. Since we are just looking for an upper bound on the bandwidth, I think we can get rid of this assumption, but the question is: Can we *tighten* your limit by, say, dividing it by `min(1, float(N)/number_of_SMs)` ? (or with some other appropriate constant) — MWB, Sep 09 '22 at 22:53
Sorry, I don't know how to apportion bandwidth to individual SMs or consider the case where we are not saturating the GPU with such requests. I think the logic of a per-chunk rounding up is sound, however, even if you consider an individual chunk (given the other assumptions that are not in dispute.) — Robert Crovella, Sep 09 '22 at 23:03
Your formula has the effective bandwidth drop the faster the indicated memory bandwidth increases. That's not right. — einpoklum, Sep 11 '22 at 18:47
Why not just invert the formula? Also, suggest giving the "bandwidth as indicated" a symbol, so as not to overflow the line. — einpoklum, Sep 11 '22 at 20:34

score 0 · Answer 2 · answered Sep 11 '22 at 20:40

Under @RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.

But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).

CUDA memory bandwidth when reading a limited number of finite-sized chunks?

2 Answers2