let's assume:
- we are talking about accesses from device code
- a chunk of
D
bytes means D
contiguous bytes
- when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by
D/4
.
- the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
- we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
- the overall footprint is not so large as to thrash the TLB
- the starting address of each chunk is aligned to a 32-byte boundary (&)
- your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
- the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D
up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B
) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.