Writing a full cache line at an uncached address before reading it again on x64

Question

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?

As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.

Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?

It is my understanding that, even with AVX512, there is no way to perform a 64 Bytes (typical cache line size) transfer in a single memory access. Thus, while possible, I believe that no processors skip the line fill operation before a write; if the caching type requires it. Plus the MESI protocol requires a Request For Ownership (that appears as a read) operation when performing certain writes — Margaret Bloom, May 17 '17 at 11:38
I wasn't sure if there would be a specific optimisation related to write combining given consecutive writes over multiple instructions that fill a line (As you say you can't fill a whole cache line in one operation). I imagine a protocol between multiple cores could account for this too even if MESI currently does not. The more I read, the more I am pretty sure the answer to this is a no though. — iam, May 17 '17 at 12:03
FWIW, Write combining doesn't use caches. I also would say "no" as an answer. Wait for the experts though ;) — Margaret Bloom, May 17 '17 at 12:35
Oh I meant a feature 'similar' to write combining but not necessarily write combining itself :-) But then I guess such a feature would need to interact with the store buffer (I am not a hardware person so I don't really know). It would be interesting for software optimisation if the answer isn't a no though... — iam, May 17 '17 at 13:57
@MargaretBloom - I'm curious why you mention AVX512 doesn't offer this ability? ISTM that an aligned 64-byte `mov` would fully overwrite the cache line (but whether implementing CPUs optimize it to avoid RFO is a different story). Perhaps the issue is that current hardware still splits it into two 32-byte accesses? — BeeOnRope, May 17 '17 at 22:16
@BeeOnRope Yes, for what I can gather from a quick glance at the manual it seems that AVX512 loads/stores are split in two. — Margaret Bloom, May 18 '17 at 07:31

score 6 · Accepted Answer · edited May 23 '17 at 12:34

In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!

Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.

In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.

Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.

In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.

The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.

From the above linked document, here's what Agner says about store-forwarding on Skylake:

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

A write followed by a smaller read from the same address has little or no penalty.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.

Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.

There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

Awesome answer. Do you know if this is something that started to be supported (Reads matching writes exactly) from Sandy Bridge onward? — iam, May 18 '17 at 08:35
I'm not sure what you mean by "this" - but if you mean store-forwarding it has been supported for a lot longer than that. For example Agner's guide that I [linked above](http://www.agner.org/optimize/#manual_microarch), it talks already about store-forwarding in the Pentium Pro, so going back at least a couple of decades. @iam — BeeOnRope, May 20 '17 at 04:24
It's not clear to me how this works out in a multicore situation. Say core A overwrites an entire cache line that was not cached before, without reading any of the bytes, then sometime after core B tries to read that cache line. Will core A will have sent the cache line contents to the cache for B to read (I assume core B can't read directly from core A's store buffer?) without having had to load the cache line itself and experiencing a cache miss? Also what if B tries to read the cache line while A is still writing to it -- can A experience a miss then? — Joseph Garvin, Sep 26 '19 at 15:38

Writing a full cache line at an uncached address before reading it again on x64

1 Answers1