Building small Docker images faster

(sgt.hootr.club)

66 points | by steinuil 2 days ago ago

24 comments

grim_io a day ago ago

I've seen so many devs not know that things like multi stage even exists.
Multi gigabyte containers everywhere.
rapidlua 19 hours ago ago

For go specifically, I find ko-build handy. It builds on the host (leveraging go crosscompilation and taking advantage of caches) and outputs a Docker image.
zerotolerance a day ago ago

I always like finding people advocating for older sage knowledge and bringing it forward for new audiences. That said, as someone who wrote a book about Docker and has lived the full container journey I tend to skip the containerized build all together. Docker makes for great packaging. But containerizing ever step of the build process or even just doing it in one big container is a bit extra. Positioning it as a build scripting solution was silly.

[-]
- maccard a day ago ago
  
  I’m inclined to agree with you about not building containers. That said, I find myself going around in circles. We have an app that uses a specific toolchain version, how do we install that version on a build machine without requiring an SRE ticket to update our toolchain?
  Containers nicely solve this problem. Then your builds get a little slow, so you want to cache things. Now your docker file looks like this. You want to run some tests - now it’s even more complicated. How do you debug those tests? How do those tests communicate with external systems (database/redis). Eventually you end up back at “let’s just containerise the packaging”.
  
  [-]
  - cogman10 a day ago ago
    
    You can mount the current directory into docker and run an image of your tool.
    Here's an example of that from the docker maven.
    `docker run -it --rm --name my-maven-project -v "$(pwd)":/usr/src/mymaven -w /usr/src/mymaven maven:3.3-jdk-8 mvn clean install`
    You can get as fancy as you like with things like your `.m2` directory, this just gives you the basics of how you'd do that.
    
    [-]
    - maccard 5 hours ago ago
      
      Thanks - this is an interesting idea I had never considered. I do like the layer based caching of dockerfiles, which you give up entirely for this but it allows for things like running containerised builds cached SCM checkouts (our repository is 300GB…)
      
      [-]
      - cogman10 3 hours ago ago
        
        Yeah, it's basically tradeoffs all around.
        The benefit of this approach is it's a lot easier to make sure dependencies end up on the build node so you aren't redownloading and caching the same dependency for multiple artifacts. But then you don't get to take advantage of docker build caching to speed up things when something doesn't change.
        That's the part about docker I don't love. I get why it's this way, but I wish there was a better way to have it reuse files between images. The best you can do is a cache mount. But that can run into size issues as time goes on which is annoying.
  - pstuart a day ago ago
    
    Depending on how the container is structured, you could have the original container as a baseline default, and then have "enhanced" containers that use it as a base and overlay the caching and other errata to serve that specialized need.
    
    [-]
    - maccard 5 hours ago ago
      
      I’ve tried this in the past, but it pushes the dependency management pf the layers into whatever is orchestrating the container build, as opposed to multi stage builds which will parallelise!
      Not dismissing, but it’s just caveats every which way. I think in an ideal world I just want Bazel or Nixos without the baggage that comes with them - docker comes so close but yet falls so short of the finish line.
- yjftsjthsd-h 21 hours ago ago
  
  I quite strongly disagree; a Dockerfile is a fairly good way to describe builds, a uniform approach across ecosystems, and the self contained nature is especially useful for building software without cluttering the host with build dependencies or clashing with other things you want to build. I like it so much that I've started building binaries in docker even for programs that will actually run on the host!
  
  [-]
  - solatic 15 hours ago ago
    
    It can indeed be uniform across ecosystems, but it's slow. There's a very serious difference between being on a team where CI takes ~1 minute to run, vs. being on a team where CI takes a half hour or even, gasp, longer. A large part of that is the testing story, sure, but when you're really trying to optimize CI times, then every second counts.
    
    [-]
    - yjftsjthsd-h 8 hours ago ago
      
      If the difference is <1 minute vs >30 minutes, containers (per se) are not the problem. If I was guessing blindly, it sounds like you're not caching/reusing layers, effectively throwing out a super easy way to cache intermediate artifacts and trashing performance for no good reason. And in fact, this is also a place where I think docker - when used correctly - is quite good, because if you (re)use layers sensibly it's trivial to get build caching without having to figure out a per-(language|build system|project) caching system.
      
      [-]
      - solatic 5 hours ago ago
        
        I'm exaggerating somewhat. But I'm familiar with Docker's multi-stage builds and how to attempt to optimize cache layers. The first problem that you run into, with ephemeral runners, is where the Docker cache is supposed to be downloaded from, and it's often not faster at all compared to re-downloading artifacts (network calls are network calls, and files are files after all). This is fundamentally different from per-language caching systems where libraries are known to be a dumb mirror of upstream, often hash-addressed for modern packaging, and thus are safe to share between builds, which means that it is safe to keep them on the CI runner and not be forced to download the cache for a build before starting it.
        > without having to figure out a per-language caching systems
        But most companies, even large ones, tend to standardize on no more than a handful of languages. Typescript, Python, Go, Java... I don't need something that'll handle caching for PHP or Erlang or Nix (not that you can really work easily with Nix inside a container...) or OCaml or Haskell... Yeah I do think there's a lot of room for companies to say, this is the standardized supported stack, and we put in some time to optimize the shit out of it because the DX dividends are incredible.
    - maccard 12 hours ago ago
      
      You can have fast pipelines in containers - I’ve worked in quick containerised build environments and agonisingly slow non-containerised places, the difference is whether anyone actually cares and if there’s a culture of paying attention to this stuff.
- solatic 16 hours ago ago
  
  Agree, and I would go another step to suggest dropping Docker altogether for building the final container image. It's quite sad that Docker requires root to run, and all the other rootless solutions seem to require overcomplicated setups. Rootless is important because, unless you're providing CI as a public service and you're really concerned about malicious attackers, you will get way, way, way more value out of semi-permanent CI workers that can maintain persistent local caches compared to the overhead of VM enforced isolation. You just need an easy way to wipe the caches remotely, and a best-effort at otherwise isolating CI builds.
  A lot of teams should think long and hard about just taking build artifacts, throwing them into their expected places in a directory taking the place of chroot, generating a manifest JSON, and wrapping everything in a tar, which is indeed a container.
  
  [-]
  - OptionOfT 3 hours ago ago
    
    I like to build my stuff inside of Docker because it is my moat against changes of the environment.
    We have our base images, and in there we install dependencies by version. That package then is the base for our code build. (as apt seemingly doesn't have any lock file support?).
    In the subsequent built EVERYTHING is versioned, which allows us to establish provenance all the way up to the base image.
    And next to that when we promote images from PR -> main we don't even rebuild the code. It's the same image that gets retagged. All in the name of preserving provenance.
  - miladyincontrol 8 hours ago ago
    
    I mean personally I find nspawn to be a pretty simple way of doing rootless containers. Replace manifest JSON with a systemd service file and you've got a rootless container that can run on most linux systems without any non-systemd dependencies or strange configuration required. Dont even need to extract the tarball.
- dboreham a day ago ago
  
  Agree. Using a container to build the source that is then packaged as a "binary" in the resulting container always seemed odd to me. imho we should have stuck with the old ways : build the product on a regular computer. That outputs some build artifacts (binaries, libraries, etc). Docker should take those artifacts and not be hosting the compiler and what not.
  
  [-]
  - rcxdude a day ago ago
    
    If anything the build being in a container is the more valuable bit, though mainly because the container usually more repeatable by having a scripted setup itself. Though I dunno why the build and the host would be the _same_ container in the end.
    (and of course, nix kinda blows both out the water for consistency)
  - exe34 a day ago ago
    
    nix allows you to build docker containers with anything you can build in nix.
paulddraper 18 hours ago ago

A Bazel option is https://github.com/bazel-contrib/rules_oci
Doesn’t even need Docker, just writes the image files with a small Python script.
Can build from scratch, or use the very small Distroless images.
lrvick a day ago ago

For even smaller images that are always deterministic/reproducible with a multi-party signed supply chain, check out https://stagex.tools

[-]
- abound a day ago ago
  
  Might want to disclose that you built it.
  Also, I took a quick look and I don't understand how your tool could possibly produce "even smaller images". The article is using multi-stage builds to produce a final Docker image that is quite literally just the target binary in question (based on the scratch image), whereas your tool appears be a whole Linux distribution.
  
  [-]
  - lrvick 7 hours ago ago
    
    I am one of the maintainers at this point, fair.
    This would be a much smaller drop in replacement for the base images used in the post to give full source bootstrapped final binaries.
    You can still from scratch for the final layer though of course and that would be unlikely to change size much though, to your point.