i worked professionally as a linux distro maintainer from 2014 to 2021. this is a tale of what i learned about docker, and the ecosystem that grew up around it, over those years. treat this as folklore, not as a proper secondary source, because i am not wasting my time googling for open source drama
docker essentially does two things:
- it lets you build, layer, and share OS images in a standard1 format.
- it lets you run linux containers that use those layered images as a filesystem.
the idea of a linux container is that the kernel creates separate namespaces for all of the features userspace programs use. there's a lot of them, but the one we'll talk about today is the PID namespace, which keeps track of all the processes on a system with a process ID.
i'm glossing over the details, but when you create a new PID namespace and put a process in it, that process becomes PID 1 inside the namespace. PID 1 is special; on normal systems it is usually the init system, which is primarily responsible for starting all of the other processes you care about running.
ok well this definition was correct like 13 years ago. since that time some folks decided (imo correctly) that an init system should not just run a series of shell scripts in lexicographical order on boot and shutdown, and should instead know what the concept of a long-running process that does things and should be restarted if it crashes is. this led to upstart and later systemd which do many more things, like avoid running a process altogether until someone asks for it over a network socket. (or things like sandboxing, which look a whole lot like some features of docker!)
this box is called "the foreshadowing zone" because we will come back to the complexity of init in a sec.
because of implementation details it also does a half dozen other tiny things, like becoming the new parent of any processes whose parents die, and being responsible for [stares at notecard] reaping zombie processes? listen, it's complicated, and this blog post about docker PID 1 zombies or whatever goes into detail just fine if you're interested.
docker was originally designed as a relatively lightweight system for assembling a filesystem and making a container on top of it, so it just exec'd whatever you asked it to — for example, mysqld — as the first PID in a new PID namespace: PID 1. mysqld doesn't expect to be run as PID 1 and does not know how to perform the responsibilities of PID 1 because mysqld is not an init system. in effect, docker has placed an unlicensed four-year-old in the drivers seat of a multimodal semi truck and said "ok buddy you can do this". this manifests itself as all sorts of weird problems, but the most notable is if you docker run that image and then hit ^C, it will not exit, because the process group hasn't exited because it is not mysqld's job to terminate all its children before exiting.
hang on. why is mysqld in its own container anwyay? the web app we want to containerize uses the database, and we're supposed to be able to package all our dependencies into one container, right? it's not like real world shipping containers, the entire metaphor upon which docker invented itself, have little pipes that go between them so the containers can talk to each other, right?
maybe what we need is an init system for our containers. that'll solve all two of the problems we know about so far: running multiple processes, and reaping their zombies.
except that's not what happened:
- upstart and systemd are extremely complex (foreshadowing payoff), were never designed to run inside containers2, and didn't seem to care to change fundamental design decisions to do so, instead focusing on the part of the system outside your containers.
- separate docker images can't be combined in any meaningful way; a stock mysql image might have different libraries in it than your web server because you used different distros as your base. so docker did the pragmatic thing, and decided they would make it so that you could link containers together and let them talk to each other.
these days, lots of docker containers do have an init process. it's called tini, and it runs only one process, but does so correctly (blah blah zombies). if you want to run multiple services inside a single container, you have to do it yourself, so it's not a surprise the entire ecosystem assumes you won't do that.
so, the answer: sometimes something is so groundbreaking that it changes the world without actually being ready yet, and we get stuck using a half-finished ecosystem for a decade or more.
-
by "standard" we of course mean "made-up and then retconned into a standard", which is how all standards are made; as you'd expect there are some rough edges even to this day.
-
if you want to run systemd in a container you can — some folks from red hat were even touting it at the time — but good lord is it a pain in the ass. it didn't catch on for a reason.