Show HN: Docker pulls more than it needs to - and how we can fix it

(dockerpull.com)

7 points | by a_t48 8 hours ago ago

9 comments

Havoc 6 hours ago ago

This is fundamentally a business strategy problem on docker side.
Docker hasn't really managed to find product market fit...and the one leverage point they have is that they're the central hub. So they're actively incentivized to not make this more efficient
Was setting up local caches for arch and docker and the difference is stunning. The one is a 10 minute exercise to get flawless local lan speed caching the other is a perilous journey

[-]
- a_t48 5 hours ago ago
  
  The other issue is that it's not like they can go back and run this deduplication after the fact. Image layers are stored as a single gzipped tar of the contents on the layer. You can't just pull a single file out of that. If you go and reorganize as multiple gzip streams, you'll change the digest of the layer. A new registry could do that reorganization on import (and return the digest), or provide tooling to build the layers in the right format to begin with.
PaulHoule 8 hours ago ago

Back in the early 2010s I couldn't bring up Docker images at all on my 2mbps DSL because any attempt to download images would time out.
theamk 8 hours ago ago

Reminds me of OSTree and casync.
danudey 8 hours ago ago

If you're interested in implementing this directly into your dockerfiles with some minimal changes, Docker already supports this to a degree:
https://docs.docker.com/reference/dockerfile/#copy---link
The TL;DR:
If you change your dockerfile to use `COPY --link <foo> <bar>`, then docker will create a layer containing only the files that would be copied, and that layer is treated as independent of layers coming before it. The only caveat is that you need to have a build cache with previous builds and use --cache-from to specify it, which means saving build state.
That said, there are a lot of benefits you can get very quickly if you can implement it. For example, if you have a dockerfile which creates a container, builds your golang application in it, and then copies the result into a fresh alpine:3.23.3 image, and you use a local cache for that build, then when you update to alpine 3.23.4 it will see that the build layers have not changed, therefore the `COPY --link` layer has not changed. Thus, it can just directly apply that on top of the new alpine image without doing any extra work.
Apparently it can even be smart enough to realize that it doesn't need to pull down the new alpine:3.23.4 image; it can just create a manifest that references its layers and upload the manifest; the new alpine image layers are there, the original 'my application' layers are already there, so it just creates a new manifest and publishes it. No bandwidth used at all!
> How many copies of `python3.10` do I have floating around `/var/lib/docker`.
Well, if you use 'FROM python:3.10' for your images then only one.
If you're careful, you can sort of pull together contents of multiple images by using `COPY --link`, and then even if you have 10 layers then changing from python:3.10 to python:3.14 only changes one of them.
Again, this does require that you maintain a cache, but that cache can live in a lot of places that doesn't have to be the local filesystem: https://docs.docker.com/reference/cli/docker/buildx/build/#c...

[-]
- a_t48 7 hours ago ago
  
  I'm well aware of `COPY --link`, it doesn't solve the problem. I'm a heavy heavy user of it, combined with throwaway build stages. `COPY --link` won't help my `apt install` commands.
  The use case here isn't `FROM python:3.10`, it's `FROM ubuntu; RUN apt install -y vim wget curl software-properties-common python3.10`/`RUN rosdep install`/`RUN --mount=type=cache,target=/root/.cache/uv --mount=type=bind,source=uv.lock,target=uv.lock --mount=type=bind,source=pyproject.toml,target=pyproject.toml uv sync --locked --no-install-project`. All of those dependencies get merged onto a single layer that isn't shared with anything else. You'd better hope something like tensorflow isn't one of those dependencies.
  
  [-]
  - yjftsjthsd-h 3 hours ago ago
    Meta: I think your example code would benefit from being a code block; in HN this is done by prefixing with 2 spaces.
    eg.
```
  FROM ubuntu
  RUN apt install -y vim wget curl software-properties-common python3.10
  RUN rosdep install
  RUN --mount=type=cache,target=/root/.cache/uv --mount=type=bind,source=uv.lock,target=uv.lock --mount=type=bind,source=pyproject.toml,target=pyproject.toml uv sync --locked --no-install-project
```
    [-]
    - a_t48 2 hours ago ago
      
      They were intended to be three separate examples but point taken, yes, I should have
- OptionOfT 6 hours ago ago
  
  > Well, if you use 'FROM python:3.10' for your images then only one.
  Negative, there can be multiple versions of an image with the same tag and a different SHA.