How is the number of registers per thread decided inside the GPU?
Actual (non-PTX-virtual) register assignments are determined at the point of running the ptxas tool on your code (part of the nvcc compiler driver toolchain), or the equivalent tool as part of the driver API loader or the NVRTC mechanism.
ptxas is the tool that converts PTX to SASS (machine code). SASS is the thing that actually runs on a GPU, PTX is not. PTX must first be converted to SASS.
PTX and the virtual register system in PTX are not useful for understanding of these concepts. There is essentially no limit to the number of virtual registers that can be defined in PTX, and the number of virtual registers defined in PTX tells you nothing at all about how actual registers will be used in GPU hardware. PTX is not useful for this sort of study.
The register assignments are entirely statically determined at this point. You can get some evidence of this by passing -Xptxas=-v compile switch to nvcc when your nvcc compile command has specified a valid SASS target. There is no runtime variability (ignoring the "variability" that would come about via the CUDA JIT PTX->SASS conversion mechanism; the item in focus here is SASS not PTX. Once the SASS is defined, there is no runtime variability.)
do these registers all get allocated to the active thread block running on the SM?
The number of registers allocated will be determined by the registers per thread, some granularity/rounding effects, and the number of threads per threadblock (i.e. the product of these). This quantity of registers will be "carved out" of the total available in the SM, at the point at which a threadblock is "deposited" on that SM, by the CUDA Work Distributor (CWD or CUDA block scheduler). The CWD will not deposit a block until a sufficient number of registers are available to be allocated.
The entire complement of registers (e.g. 65536 or whatever the SM capacity is) are not automatically or always allocated for a single threadblock. It will depend on the actual needs of that threadblock. Remaining/unallocated registers can be used in the future if the CWD decides to deposit another threadblock on that SM. CUDA SMs have the ability to support multiple threadblocks simultaneously, with registers allocated for each. Unless unallocated registers are available in sufficient quantity to meet the needs of a prospective threadblock, the CWD will not deposit a new threadblock on that SM.
My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code,
The profiler reported number is correct (and it includes the granularity/rounding effects, which may or may not be included in the -Xptxas=-v output.) Your confusion is that you are attempting to understand what is happening via the PTX. Do not do that. It is irrelevant for this discussion.