Distri Researching Fast Linux Package Management Michael Stapelberg @Zekjur 2020-10-11 Overview
Total Page:16
File Type:pdf, Size:1020Kb
distri researching fast Linux package management Michael Stapelberg @zekjur 2020-10-11 Overview ● 1-minute introduction ● demo videos: arch vs. distri package installation speed ● Comparison with Arch Linux ● How does distri work? Introduction: Michael Stapelberg ● Debian Developer for 7 years (2012-2019) ○ left Debian because of antique tooling and slow changes ● Using Arch Linux for 1 year ○ used Fedora and NixOS each for a few months ○ in a previous life, used Gentoo, Ubuntu and NetBSD ● Wrote the i3 tiling window manager in 2009 ● other FOSS projects, too! Debian Code Search, RobustIRC, gokrazy, … demo: installing “ack” Arch distri demo: installing “qemu” Arch distri Updates/package install: faster in distri ● transport compression → Arch switched to zstd in 2020-01-04 ● mirror selection → Arch asks its users to maintain their mirror list → Why can’t Arch default to a CDN that’s fast everywhere? ● no hooks/triggers: maximum parallelism Arch is moving from package hooks to pacman hooks (e.g. sysusers) ● no unpacking stage: use images instead of archives Updates/package install: more robust in distri ● Arch does not support partial upgrades → distri packages depend on the specific transitive closure, so can always be installed ● Arch upgrades frequently require manual intervention → distri packages use separate hierarchies: file conflicts impossible :) → distri packages are hermetic: not easily broken by other packages on the system Debugging experience ● Installing gdb should be all that a user needs to do: Debug symbols and sources of any package should be fetched on demand! ● Arch does not (yet) provide debug infos for all packages Arch does not (yet) transparently make available symbols → will be solved with debuginfod ● (distri solves this on the package manager level) Packaging experience ● quicker feedback → more engaging → more contributions ● isolating package builds from the host system should be the default Arch asks package maintainers to do manual chroot management Changes over time ● declarative packaging is key to make changes happen the Arch package format is a custom format, not defined anywhere → want auto-formatting → want machines to be able to make edits (→ monorepo?) → express intents/end states, not mechanisms How does distri work? package manager speed: install “ack” (Perl)* distribution package data wall-clock time rate manager Fedora dnf 114 MB 33s 3.4 MB/s Debian apt 16 MB 10s 1.6 MB/s NixOS Nix 15 MB 5s 3.0 MB/s Arch Linux pacman 6.5 MB 3s 2.1 MB/s Alpine apk 10 MB 1s 10.0 MB/s rate = data ÷ wall-clock time * standard installation, includes metadata & package download and dependencies → https://michael.stapelberg.ch/posts/2019-08-17-linux-package-managers-are-slow/ Why are package managers slow? ● 2 most widely used package formats: ○ deb (Debian package), tar(1) in ar(1) ○ rpm (Red Hat Package Manager), metadata around cpio(1) ○ (Arch: tar(1) with metadata) ● task: make package contents available → e.g. pacman -S nginx results in /usr/bin/nginx ● traditionally: resolve deps, download, extract, configure → need to carefully fsync(2) to make I/O as safe as possible How can we go faster? append-only package store of immutable images 1. use an image format (e.g. SquashFS) instead of an archive format 2. mount each image under its own path (“separate hierarchies”): e.g. /ro/nginx-amd64-1.14.1/… e.g. /ro/zsh-amd64-5.6.2/… 3. (rest of the system as usual, e.g. /etc, /var/cache, …) advantages ● mount instead of extract → faster package installation → faster build environment composition ● append-only: can use unsafe I/O ● immutable: no longer possible to screw up your installation hermetic packages ● when run, use the same version of dependencies as when built ● a wrapper script sets e.g. LD_LIBRARY_PATH, PYTHONPATH, PERL5LIB, … separate hierarchies: exchange dirs ● packages exchange data via directories with well-known paths, e.g.: man(1) ⟷ nginx(1) via /usr/share/man gcc(1) ⟷ libusb(3) via /usr/include ● prudent approach: emulate well-known paths e.g.: /usr/include/jpeglib.h is a symlink to /ro/libjpeg-turbo-amd64-2.0.0/out/include/jpeglib.h separate hierarchies: exchange dirs (per package) ● loose coupling (global) vs. tight coupling (per package) → typically suitable for plugin mechanisms where ABI must match ● e.g. /ro/xorg-server-amd64-1.20.3/out/lib/xorg/modules/ separate hierarchies: advantages ● move conflicts from package installation to program execution → only need to resolve /bin/python (2.7 or 3?) when assembling /bin ● packages always co-installable e.g. zsh-amd64-5.6.2 and zsh-amd64-5.6.3 → partial updates/rollbacks easily possible ● package manager can be version-agnostic! → entirely eliminates a large source of slowness → no need for global metadata, package-specific metadata sufficient immutability ● package contents and exchange dirs are read-only ● rarely, programs expect the system to be writable e.g. GNOME’s gsettings wants a cache in the exchange directory ● such designs need to be improved upstream: 1. good caches are not required (fallback to slow path) 2. good caches are transparently created 3. good caches are automatically updated when needed no hooks/triggers (1) ● hook (or maintscript, postinst, …): program run after package installation trigger: program run after other package installation (e.g. man-db) → work at package-installation time which may be unnecessary ● preclude concurrent package installation (not implemented concurrency-safe) ● arbitrary code, can be slow no hooks/triggers (2) ● claim: we can build a functioning system without hooks/triggers ● 1. packages declare what they need (e.g. sysusers) ● 2. move work from package installation to program execution e.g. ssh needs a hostkey: create it in sshd(8) wrapper script ● very few exceptions: bootloader or firmware (need to install them outside of the file system) practicality ● FUSE file system for providing /ro → easier to implement than managing separate mounts, overlays, unions, … → faster (!), as kernel mounts are slow ● packages need to be built with --prefix=/ro/nginx-amd64-1.14.1 etc. ● a small number of packages need to be patched → path-related issues (e.g. service files, gcc, gobject, automake, …) → deep system integration (e.g. dracut) practicality (2) ● removal of hooks is not for everyone → configuration layers (debconf, YAST, …) might be a feature to some Why is distri faster? ● traditionally: resolve deps, download, extract, configure + careful fsync(2) to make I/O as safe as possible ● distri: resolve deps, download image, extract, configure (unsafe I/O okay) → scales to 12+ GB/s (!) on 100 Gbit links using Go’s net/http conclusion 1. append-only package stores are more elegant than mutable systems → simpler design, faster implementation 2. exchange directories make things seem normal to third-party software → can compile unpackaged software, can run closed-source binaries 3. all of these ideas are practical → live CDs (read-only) and cross-compilation paved the way project goals ● Not trying to build a community or user base! ● Instead, distri enables (my) Linux distribution research, with regular proof of concept releases ● Now that you know the pain points and how fast it could be, maybe you can improve things? :) ● ● ● ●.