0
if (threadIdx.x < 128) {
  float reg[32];
  // do something with reg...
} else {
  return;
}

let's say each block has 256 threads, but only half of the threads are using registers, and the other half is doing something else (in this case nothing). my question is, how many registers this thread block will use (only concidering reg)? 32 * 256 or 32 * 128 ?

omer sahban
  • 77
  • 1
  • 8
  • Apart from your question (answered by Robert): Be careful that arrays do not end up as (thread-)local memory! That happens, if the array is accessed with non-compiletime-const indices – Sebastian Sep 26 '20 at 23:11

1 Answers1

4

All threads use the same number of registers.

It is a compile-time decision, the decision has nothing to do with runtime behavior, and the compiler determines register usage for all threads in the grid (i.e. kernel launch), it is not in any way decided per thread. At runtime, the necessary number of registers must be allocated for each thread, whether they "use" them or not. See here.

The answer to your question is that regardless of what the threads "do", the number of registers per block is equal to the number of registers per thread (determined at compile time) times the number of threads per block.

So in your example it would presumably be 32*256, not 32*128.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 2
    The hardware has a per warp register allocation granularity. The register allocation is documented in the CUDA_Occupancy_Calculator.xls GPU Data Sheet. The warp allocation granularity is 256 registers/warp for CC3.0 - CC8.x. All threads (whether active or inactive) will be allocated 8, 16, 24, 32, ... registers. – Greg Smith Sep 26 '20 at 21:58