4

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible. I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).

Some things to take note:

  • the pointer chasing path can comfortably fit in the L1 data cache
  • using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect

perf record has the following output:

Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926                                                                                                                                                               
myFunction  /tmp/foobar [Percent: local hits]                                                                                                                                                                            
Percent│      endbr64                                                                                                                                                                                                                       
      ...
 80.09 │20:   mov     (%rdx,%rbx,1),%ebx                                                                                                                                                                                                    
  0.07 │      add     $0x1,%rax                                                                                                                                                                                                             
       │      cmp     $0xffffffff,%ebx                                                                                                                                                                                                      
 19.84 │    ↑ jne     20                                                                                                                                                                                                                    
      ...

I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf. I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.

I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

7

Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors A 90-Minute Guide! and https://agner.org/optimize/)

But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.

The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.

The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?

IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Do you think it could be that the load buffer is filling up as a bottleneck for the code so CPU will get to jmp then not be able to make progress for a bit leading to higher probability that the jmp will be oldest non-retired instruction? Or does that not make sense as an explination. – Noah Jan 16 '21 at 06:00
  • @Noah: Load / store buffer entries are allocated during issue/rename/alloc (in program order, before exec time), so reasoning like that could maybe tell us what the *oldest* un-executed uop is. This doesn't help us figure out retirement state on an interrupt. If the front-end will even issue a partial group of less than 4 if the first uop would fit but the later uops would need a resource that's already full. To save power, it might just wait until all 4 uops can be allocated. – Peter Cordes Jan 16 '21 at 09:01
  • https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Renaming_.26_Allocation - In Skylake (for example since OP hasn't said what uarch), the total number of load buffers is 72, while RS size is 97. (And IIRC, the RS *isn't* unified: only some of the entries can hold un-executed load uops.) Since un-executed compare-and-branch uops will also be un-executed, the reason for the front-end not being able to issue more uops is the RS filling up, not the load buffers. There's no MLP to let multiple loads execute in parallel with load buffers waiting for the result after exec. – Peter Cordes Jan 16 '21 at 09:06
  • The BOB (Branch Order Buffer) is 48 entries, very close to half the RS size, so un-executed branches won't fill it up. The ROB is large (224 uops) so it's not going to fill up, even though it also has to hold onto the `add` uops until they retire. That chain can execute far ahead, so only the mov loads and macro-fused cmp/jne will be stuck in the RS. The only instruction-level parallelism is the add chain, and the load + cmp/jne within a single iteration. – Peter Cordes Jan 16 '21 at 09:09
  • Re: performance counters blaming the instruction waiting for the slow result: [Unexpectedly slow performance on moving from register to frequently accessed variable](https://stackoverflow.com/a/76774051) is an attempt at a simple canonical answer for that. – Peter Cordes Jul 26 '23 at 18:43