I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record
detects that ~20% of execution time in function myFunction()
is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
- the pointer chasing path can comfortably fit in the L1 data cache
- using
__builtin_expect
to avoid the cost of branch misprediction had no noticeable effect
perf record
has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf. I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.