How does Docker know when to use the cache during a build and when not?

Question

I'm amazed at how good Docker's caching of layers works but I'm also wondering how it determines whether it may use a cached layer or not.

Let's take these build steps for example:

Step 4 : RUN npm install -g   node-gyp
 ---> Using cache
 ---> 3fc59f47f6aa
Step 5 : WORKDIR /src
 ---> Using cache
 ---> 5c6956ba5856
Step 6 : COPY package.json .
 ---> d82099966d6a
Removing intermediate container eb7ecb8d3ec7
Step 7 : RUN npm install
 ---> Running in b960cf0fdd0a

For example how does it know it can use the cached layer for npm install -g node-gyp but creates a fresh layer for npm install ?

Have you read https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#build-cache ? — Roman, Jul 29 '16 at 09:58

score 75 · Accepted Answer · edited Aug 27 '21 at 16:11

75

The build cache process is explained fairly thoroughly in the Best practices for writing Dockerfiles: Leverage build cache section.

Starting with a parent image that is already in the cache, the next instruction is compared against all child images derived from that base image to see if one of them was built using the exact same instruction. If not, the cache is invalidated.

In most cases, simply comparing the instruction in the Dockerfile with one of the child images is sufficient. However, certain instructions require more examination and explanation.

For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.

Aside from the ADD and COPY commands, cache checking does not look at the files in the container to determine a cache match. For example, when processing a RUN apt-get -y update command the files updated in the container are not examined to determine if a cache hit exists. In that case just the command string itself is used to find a match.

Once the cache is invalidated, all subsequent Dockerfile commands generate new images and the cache is not used.

You will run into situations where OS packages, NPM packages or a Git repo are updated to newer versions (say a ~2.3 semver in package.json) but as your Dockerfile or package.json hasn't updated, docker will continue using the cache.

It's possible to programatically generate a Dockerfile that busts the cache by modifying lines on certain smarter checks (e.g retrieve the latest git branch shasum from a repo to use in the clone instruction). You can also periodically run the build with --no-cache=true to enforce updates.

edited Aug 27 '21 at 16:11

yaobin

2,436
5
33
54

answered Jul 29 '16 at 10:29

Matt

68,711
7
155
158

1

another option to break the cache is to use fixed versions for your packages and update them manualy inside the dockerfile – Ohmen Jul 29 '16 at 13:59
@Ohmen True, you can do similar fixed versions in `package.json` too. It's possible to still get caught out by package dependencies even then. – Matt Jul 29 '16 at 14:35
But if you manualy update the version in your dockerfile it is not the same command any more wihich causes docker to not use the cache and run the cmd again. – Ohmen Jul 29 '16 at 14:38
1

Yep, which fixes the direct dependencies and flushes the cache more often limiting the problem. Problem is, most package dependencies are a deep trees now. There can be package dependencies that change on the second and subsequent levels that have updates too, but don't bust the cache. – Matt Jul 29 '16 at 15:15
1

This is more a general (temporal) versioning issue but the build cache highlights it, especially when you have multiple caches around. The problem is particularly prevalent in Node.js where the dependency trees are large and dependency dependency versions are rarely strict so running an `npm install` at different times on the same source varies, often significantly – Matt Jul 29 '16 at 15:15
You are right these dependencies can just be pinned if you explicitly include them in the install command – Ohmen Jul 30 '16 at 22:08
The lockfiles in yarn or npm 5+ will pin versions for all dependencies transitively and could help out significantly here. – cliff.meyers May 17 '18 at 20:04
@cliff.meyers it's been fixed for top level apps that can define a lock file. Unfortunately module dependencies still have the temporal version creep as neither npm or yarn will respect a lock file in the module itself. – Matt Jun 03 '18 at 03:24
1

Does the docker cache have some _expiration date_, for example, even if checksums are equal, cache is invalidated, because it has been couple of days/weeks since layer has been used last time? – Taz Jun 09 '19 at 15:00
Don't forget, cached layers can come from `--cache-from` argument to `docker build` :+1: – Kingdon Aug 21 '19 at 04:04
1

@Taz No expiry date, you could inject something like that by generating a Dockerfile or into the build process itself. – Matt Aug 21 '19 at 04:39
1

"the next instruction is compared against all child images derived from that base image", where are all these child images looked for? Let's say i use an 'ubuntu' base image and build a docker image and save it in Google container registry, what happens in that case? Is it only from the `--cache-from` sources? Or are there default locations? – Pavithra Vijay Oct 01 '20 at 20:57
@PavithraVijay If the sha256 sum exists in the local docker cache, which is any image/layer docker has used before. `--cache-from` allows you to specify remote registry images as cache sources. – Matt Oct 01 '20 at 23:36

score 8 · Answer 2 · answered Jul 29 '16 at 09:53

8

It's because your package.json file has been modified, see Removing intermediate container.

That's also usually the reason why package-manager (vendor/3rd-party) info files are COPY'ed first during docker build. After that you run the package-manager installation, and then you add the rest of your application, i.e. src.

If you've no changes to your libs, these steps are served from the build cache.

answered Jul 29 '16 at 09:53

schmunk

4,708
1
27
50

That cache could also be invalidated by a change to the instruction in the `Dockerfile` – Matt Jul 29 '16 at 10:12

How does Docker know when to use the cache during a build and when not?

2 Answers2

Linked