In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.