How many bytes does a Xeon bring into the cache per memory access?

Question

I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs) held in RAM, over 10 GB, and elements of it need to be accessed periodically. I want to revise the data structure to work with the system's caching mechanism as much as possible.

Currently, accesses are done mostly randomly across the structure, and each time 1-4 32-bit ints are read. It is a long time before another read occurs in the same place, so there is no benefit from the cache.

Now I know that when you read a byte from a random location in RAM, more than just that byte is brought into the cache. My question is how many bytes are brought in? Is it 16, 32, 64, 4096? Is this called a cache line?

I am looking to redesign the data structure to minimize random RAM accesses and work with the cache instead of against it. Knowing how many bytes are pulled into the cache on a random access will inform the design choices I make.

Update (October 2014): Shortly after I posed the question above the project was put on hold. It has since resumed and based on suggestions in answers below, I performed some experiments around RAM access, because it seemed likely that TLB thrash was happening. I revised the program to run with huge pages (2MB instead of the standard 4KB), and observed a small speedup, about 2.5%. I found great information about setting up for huge pages here and here.

Yes, cache line. You can assume 64 bytes until you figured out which of the dozens of Xeon processor models you have. The L2 and L3 caches play a role too. Focus on sequential memory accesses and don't assume anything. Measure. — Hans Passant, Dec 23 '11 at 20:38

score 8 · Accepted Answer · answered Dec 23 '11 at 20:31

Today’s CPUs fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache.

More here : http://igoro.com/archive/gallery-of-processor-cache-effects/

score 3 · Answer 2 · answered Dec 23 '11 at 23:12

3

A cache line for any current Xeon processor is 64 bytes. One other thing that you might want to think about is the TLB. If you are really doing random accesses across 10GB of memory then you are likely to have a lot of TLB misses which can potentially be as costly as cache misses. You can get work around with with large pages, but it's something to keep in mind.

answered Dec 23 '11 at 23:12

Gabriel Southern

9,602
12
56
95

Could elaborate further, @Gabriel? What does TLB stand for, and how would large pages help? – Randall Cook Dec 23 '11 at 23:56
Translate Lookaside Buffers - they're involved in making hardware paging less slow (long story, google :)). One page, one TLB entry - you need less TLB entries if mapping your memory with large (4meg or 2meg, depending on CPU mode) than normal pages (4kbyte). – snemarch Dec 24 '11 at 00:52
TLB stands for "translation lookaside buffer." When you access memory you have to translate the virtual address to a physical address. This is done at page granularity and the typical page size for x86 is 4KB. The TLB caches the translation from virtual to physical address but it is a relatively small structure (around 200 entries). – Gabriel Southern Dec 24 '11 at 00:58
1

Excellent advice; TLB misses can be even costlier than cache misses. – Stephen Canon Dec 24 '11 at 01:09
The data structure is really a set of arrays of structs. Each array is about 200 MB and is allocated in shared memory. Does that help with TLB misses? – Randall Cook Dec 26 '11 at 06:18
it depends on the access patterns. also shared memory implies that you have multiple processes so these could conflict with each other. Ultimately, as someone else noted, the best thing to do if you aren't sure if to measure what's happening in your system when the application runs. Something like Vtune or Oprofile can help. Modern CPUs have performance counters that can be used to track things like cache misses. – Gabriel Southern Dec 26 '11 at 21:59
@Gabriel, your TLB suggestions provided some value to me. Read my update in the question above. Thanks for the knowledge boost. If I could vote you up twice, I would. – Randall Cook Oct 10 '14 at 16:34

score 2 · Answer 3 · edited May 23 '17 at 12:10

2

Old SO question that has some info that might be of use to you (in particular the first answer where to look for Linux CPU info - responder doesn't mention line size proper, but 'other info' on top of associativity etc). Question is for x86, but answers are more general. Worth a look.

Where is the L1 memory cache of Intel x86 processors documented?

edited May 23 '17 at 12:10

Community

1
1

answered Dec 23 '11 at 20:55

gnometorule

2,151
2
20
29

Useful tips on the linked question, @gnometorule. +1. – Randall Cook Dec 23 '11 at 21:24

score 1 · Answer 4 · answered Dec 24 '11 at 00:59

You might want to head over to http://agner.org/optimize/ and grab the optimization PDFs available there - there's a lot of good (low-level) information in there. Pretty focused on assembly language level, but there's lessons to be learned for C/C++ programmers as well.

Volume 3, "The microarchitecture of Intel, AMD and VIA CPUs" should be of interest :-)

score 1 · Answer 5 · answered Dec 26 '11 at 22:54

1

Good (long) article about organizing data structures to take cache and RAM hierarchy into account from GNU's libc maintainer: https://lwn.net/Articles/250967/ (full PDF here: http://www.akkadia.org/drepper/cpumemory.pdf)

answered Dec 26 '11 at 22:54

TazMainiac

364
2
5

How many bytes does a Xeon bring into the cache per memory access?

5 Answers5

Linked