This year I spoke at FOSDEM about Live Atomic Updates, the source for the presentation and the demo program are available on github, but this is not a useful format for reading, so I'm going to convert it for reading here, so I can talk about the result.

What are Live Atomic Updates

A Live update is one that doesn't interrupt service to install, so if you have to restart your computer, it is an offline update.

Atomic updates let you go from one version of the system to another with no intermediate states in-between. So you either get the new version, or the old version, and none of the intermediate states where parts have been upgrades, but others haven't.

Traditional embedded projects do atomic updates by either a recovery partition, where you boot into a special version of the system for applying updates, or A/B partitioning where you make the update to the partition you aren't running in, and instruct the bootloader to use the new version after a reboot.

OSTree/project atomic do this by booting into another hardlink tree.

Baserock does this by applying the update to another subvolume and instructing the bootloader to use the new subvolume as the root filesystem after a reboot.

Why we need Live Atomic Updates

We need Live updates because we want to be able to provide always-on services when we're building servers, and on the desktop you don't want to lose all your state when you need to update.

We need Atomic updates because non-atomic updates aren't reliable, so it fails and causes more downtime recovering the degraded system.

Non-atomic updates are not reliable because of all the inter-dependent components in a modern Linux system.

To make this work without an atomic update you need to have:

  1. Every file is backwards compatible.
  2. Dependency information of every file in your system.
  3. A way to atomically update a single file.

You would make this work by atomically replacing every file in dependency order, which works because the old versions of the dependent files can still work with the new versions of the files.

We can do point 3 by writing the new file to disk and using a rename(2) syscall to atomically replace the old version of the file with the new version.

Point 2 is theoretically possible, but is a lot of extra work for packagers to do, and a slip-up results in someone's update breaking their system if they update at the wrong time.

Point 1 would be very difficult to make happen, since it's a lot of work to keep things backwards compatible when things need to change. glibc goes to great effort to make this work with symbol versioning.

However this is only usually considered for extra-project dependencies (your shell depending on glibc), rather than intra-project dependencies. For example, libm.so and librt.so depend on libc.so. The majority of the symbols they depend on are versioned, so they work correctly.

glibc symbol dependencies

However libm.so and librt.so also depend on symbols marked as GLIBC_PRIVATE, which cannot normally be used by external libraries.

glibc GLIBC_PRIVATE dependencies

This implies that glibc doesn't want to provide ABI compatibility for these private symbols, so may change their ABI occasionally, so not even glibc promises perfect backwards compatibility.

For the above reasons, some form of Live Atomic Update is necessary for updates to be reliable.

As a side-benefit, atomic updates mean you can drop all the hacks that are needed to make non-atomic updates work:

  1. You can fall back to only needing the whole system to provide backwards compatible interfaces, even if the internals have shuffled around what provides the interface. e.g. symbols can move out of libc.so into other libraries.
  2. Packagers don't need to track dependencies of files internal to their project any more, or the machinery to apply changes to files in this order.
  3. We don't need the extra fsync(2) or rename(2) to make it atomically update individual files, and we can update multiple files in parallel and defer the fsync(2) until after everything has been written, which can make the update a lot faster.

Proof of Concept program for Live Atomic Updates

This demo video shows it upgrading from an older version of GCC to a newer version in a Baserock system, which will only perform Atomic updates.

This creates a parallel mount tree with the new verion mounted to pivot_root(2) into.

Mount tree with new tree mounted

The initial attempt at upgrading fails because pivot_root(2) refuses to work if the mount propagation of / is shared. I believe this is so that it doesn't cause other mount namespaces to be pivoted, as the target mount may not exist in the other mount namespace.

After calling pivot_root(2) the mount tree looks more like this:

Mount tree after pivot

Unfortunately this not sufficient to have all processes start using the new mount tree, as the open files, root and working directories of processes point directly to the files, rather than indirectly through the current state of the filesystem.

Process descriptors refer to old versions

pivot_root(2) is designed to only move the root of the calling process, and only in special circumstances, which is sufficient for moving from an initramfs, but not our case.

In my Proof of Concept, I use ptrace(2) to inject code into all the processes to make them migrate to the new root, but as you can see from the demo, it cannot work for every process.

Process descriptors refer to new versions

It's generally the wrong thing to use since:

  1. Not all processes are ptraceable
  2. Most processes aren't able to chroot to the new root
  3. Reopening directory fds messes with their state and results in a change in st_dev and st_ino fields, which are often used to identify if two files are identical. e.g. journald currently relies on this for its graceful restart logic.

Remaining approaches

File-system transactions

There's been a few attempts to get filesystem transactions into Linux. Btrfs has a form of transaction already, but it's for expert use only, and effectively requires there to be only one process running on the system as it's trivially easy to deadlock yourself unless you have deep and meaningful knowledge of how the filesystem works.

There was also an attempt to have a transaction by specifying a list of syscalls to run atomically, but that wasn't merged. An interesting idea would be to instead branch and merge the filesystem like it were git, by making a snapshot, making changes to the snapshot, then merging back the changes atomically.

Freeze userland during update

Instead of having an actually atomic update operation, we could fake it by freezing all of userland, except for the update process, so it appears to be atomic from their perspective.

This does result in the system being unresponsive during the update though.

We could make this work by:

  1. Make a snapshot
  2. Apply the update to the snapshot
  3. Instruct the bootloader to use that snapshot for future boots
  4. Freeze userland
  5. Mirror the updated snapshot in the currently running snapshot, using hardlinks or CoW clones to make this faster
  6. If the update failed roll back the changes to the current system and instruct the bootloader to go back to a snapshot of the old version before unfreezing
  7. If it succeeded instruct the bootloader to use the current version again, and remove the snapshot of the new version.

Use a proxy filesystem

We could have a proxy filesystem swap its backing filesystem atomically to provide the update. This would need to keep the st_ino numbers stable between updates.

AUFS nearly fits the bill here, since it has inode number proxying and allows a new backing fs to be added on top, but it doesn't allow you to remove the backing fs underneath it while there are processes still using it.

Only pivot pid 1 into the new version, and have it restart services

The idea behind this is to have systemd be the process to do the migration, and after calling pivot_root(2) it is in the new version.

All the existing services remain running in the old version, but this is not necessarily a bad thing for the moment, as they will work reliably until they are restarted, since everything they depend on hasn't changed underneath them.

Soon after pivoting, systemd should trigger a graceful restart of all the services.

They can now do this by passing file descriptors back to systemd and getting them back on the next boot with socket activation.

Services can store state that isn't associated with a file descriptor by writing it to a file in /run, or by writing it to a temporary file and also passing that file descriptor to systemd.

So the services are gradually updated into a new system while keeping everything running.

If we can arrange for systemd to enter a new mount namespace when it migrates to the new root, then the old namespace, and all the mounts from that namespace that weren't kept in the new namespace, will be removed when all the processes in the old mount tree exit.

Some services may depend on other running services that have changed during the update and may not have backwards compatible interfaces, however systemd tracks dependencies between services, so could restart old versions of dependent services first.

Result of speaking at FOSDEM

The most difficult question I was asked at FOSDEM was why I cared about making sure processes are migrated into the new mount tree, when I have to restart them all soon anyway.

My only answers were:

  1. I want my shell sessions to appear in the new root, so I can instantly see it working, and systemd allows me to restart services because I am in the same root as it.
  2. The graceful restart logic of existing programs needs them to be in the new root for them to re-exec the new binary, since they rely on fd inheritance for the new version of the process to have access to the file descriptors.

Problem 1 could be solved by making systemd remember the (st_dev, st_ino) pair of old root directories, so service restart requests coming from those roots are permitted, rather than refused, since systemd thinks I'm in a chroot because I want isolation of a completely different userland.

Problem 2 could be handled by porting services to systemd's new way to gracefully re-exec by passing fds back to systemd.

Given these solutions, the most appealing solution is to extend systemctl switch-root to have a mode where it leaves existing services running, rather than killing them off.