Why are CPU registers fast to access?

Question

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?

I don't quite understand what you are asking. Registers are at the top because they are at the top. There is nothing closer to the ALU where the work is done. Keeping data in a register means no data transfer overhead. Incidentally the keyword doesn't do much with modern optimizing compilers. — Amardeep AC9MF, Aug 19 '10 at 03:16
More info on ALU: https://en.wikipedia.org/wiki/Arithmetic_logic_unit — Bernard Vander Beken, Oct 16 '17 at 09:17

score 24 · Answer 1 · edited Oct 16 '17 at 16:36

24

Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)

The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.

edited Oct 16 '17 at 16:36

BeeOnRope

60,350
16
207
386

answered Aug 19 '10 at 03:19

Potatoswatter

134,909
25
265
421

2

The wires (and MUXes) connecting execution units directly to each other are called the forwarding or bypass network, because it bypasses the latency of write-back to registers and then reading from the register file. This is how an `add` instruction can have 1c latency even in a pipelined CPU. (See [Wikipedia's Classic RISC pipeline](https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing) article. The idea is the same even in an out-of-order superscalar CPU, but multiple execution units can be forwarding to each other in parallel.) – Peter Cordes Oct 16 '17 at 09:00

score 12 · Accepted Answer · answered Aug 19 '10 at 03:19

Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.

score 4 · Answer 3 · answered Jun 04 '20 at 21:20

Several factors lead to registers being faster than cache.

Direct vs. Indirect Addressing

First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.

Complicating Factors

Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.

Fixed vs. Variable Size

Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)

Complicating Factors

Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)

Small Capacity

A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.

Complicating Factors

Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).

Common Case Optimization

Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and

Complicating Factors

Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.

Small Address Space

A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.

Complicating Factors

Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.

Conclusion

Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.

Indeed, on modern (Haswell?) Intel x86 using high-8 partial registers like AH (`RAX[15:8]`) as a source register increases latency by 1 cycle. `movsx edx, al` (low 8 bits) is faster than `movsx edx, ah`. (Even if the critical path isn't through AH! e.g. `add cl, ah` has 2-cycle latency from CL->CL as well as from AH->CL.) — Peter Cordes, Jun 04 '20 at 21:37
In case anyone's wondering, [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139) has details on low-8 regs not being renamed separately on modern Intel, unlike P6 family and SnB. And on writes to AH/BH/CH/DH still being renamed, but with the merge uop maybe having to issue in a cycle by itself. — Peter Cordes, Jun 04 '20 at 21:40
[Is there a penalty when base+offset is in a different page than the base?](https://stackoverflow.com/q/52351397) investigates some details of Sandybridge-family's AGU shortcut for addressing modes of the form `[reg + 0..2047]`. It seems they speculate that the final address will be in the same page as the base register, starting TLB access 1 cycle earlier. Apparently that's on the critical path. It seems this is only done when the base reg itself came from a load, not an ALU uop, so it only tries this for pointer-chasing workloads where load-use latency is critical. — Peter Cordes, Jun 08 '20 at 05:07

score 2 · Answer 4 · answered Aug 19 '10 at 03:16

2

Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.

answered Aug 19 '10 at 03:16

Bill Forster

6,137
3
27
27

score 1 · Answer 5 · answered Aug 20 '10 at 02:01

Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.

A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.

score 0 · Answer 6 · answered Aug 19 '10 at 03:22

Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.

The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.

Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.

Why are CPU registers fast to access?

6 Answers6

Direct vs. Indirect Addressing

Fixed vs. Variable Size

Small Capacity

Common Case Optimization

Small Address Space

Conclusion

Linked