Making Sense of the Audio Stack on Unix
Total Page:16
File Type:pdf, Size:1020Kb
Making Sense of The Audio Stack On Unix Patrick Louis 2021-02-07 Published online on venam.nixers.net © Patrick Louis 2021 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of the rightful author. First published eBook format 2021 The author has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Introduction 4 Hardware layer 6 Analog to Digital & Digital to Analog (ADC & DAC) 7 Libraries 10 Audio Driver 11 Advanced Linux Sound Architecture (ALSA) 15 Open Sound System (OSS) and SADA 22 Sound Servers 25 sndio 26 aRts (analog Real time synthesizer) and ESD or ESounD (Enlight- ened Sound Daemon) 28 PulseAudio 29 PulseAudio — What Is It? ........................ 29 Pulseaudio — Overall Design ....................... 30 Pulseaudio — Sink, Sink Input, Source, and Source Input ....... 32 Pulseaudio — Internal Concepts: Cards, Card Profile, Device Port, Device ................................. 34 Pulseaudio — Everything Is A Module Thinking ............ 35 Pulseaudio — Startup Process And Configuration ........... 36 Pulseaudio — Interesting Modules And Features ............ 38 Pulseaudio — Tools ............................ 40 Pulseaudio — Suspending ......................... 41 JACK 42 PipeWire 45 Conclusion 53 Bibliography 55 3 Introduction Come see my magical gramophone Audio on Unix is a little zoo, there are so many acronyms for projects and APIs that it’s easy to get lost. Let’s tackle that issue! Most articles are confusing because they either use audio technical jargon, or because they barely scratch the surface and leave people clueless. A little knowledge can be dangerous. In this article I’ll try to bridge the gap by not requiring any prerequisite knowl- edge while also giving a good overview of the whole Unix audio landscape. There’s going to be enough details to remove mysticism (Oh so pernicious in web bubbles) and see how the pieces fit. By the end of this article you should understand the following terms: • ALSA • OSS • ESD • aRts • sndio • PulseAudio • PipeWire • GStreamer • LADSPA We’ll try to make sense of their utility and their link. The article will focus a bit more on the Linux stack as it is has more components than others and is more advanced in that respect. We’ll also skip non-open source Unix-like systems. As usual, if you want to go in depth there’s a list of references at the bottom. Overall, we got: 4 • Hardware layer: the physical devices, input and output • Kernel layer: interfacing with the different hardware and managing their specificities (ALSA, OSS) • Libraries: used by software to interface with the hardware directly, to ma- nipulate audio/video, to interface with an intermediate layer for creating streams (GStreamer, Libcanberra, libpulse, libalsa, etc..), and to have a standard format (LADSPA). • Sound servers: used to make the user facing (user-level) interaction easier, more abstract, and high level. This often acts as glue, resolving the issue that different software speak different protocols. (PulseAudio, ESD, aRts, PipeWire, sndio) Let me preface this by saying that I am not a developer in any of these tech, neither am I a sound engineer. I am simply regrouping my general understanding of the tech so that anyone can get an overview of what the pieces involved are, and maybe a bit more. 5 Hardware layer It’s essential to have a look at the hardware at our disposal to understand the audio stack because anything above it will be its direct representation. There are many types of audio interfaces, be it input or output, with different varieties of sound cards, internal organizations, and capabilities. Because of this diversity of chipsets, it’s simpler to group them into families when interacting with them. Let’s list the most common logical components that these cards can have. • An interface to communicate with the card connected to the bus, be it interrupts, IO ports, DMA (direct memory access), etc.. • Output devices (DAC: Digital to analog converter) • Input devices (ADC: Analog to digital converter) • An output amplifier, to raise the power of output devices • An input amplifier, same as above but for input devices (ex: microphones). • Controls mechanism to allow different settings • Hardware mixer, which controls each devices volume and routing, usually volume is measured in decibel. • A MIDI (Musical Instrument Digital Interface) device/controller, a stan- dard unified protocol to control output devices (called synthesizers) — think of them like keyboards for sounds. • A sequencer, a builtin MIDI synthesizer (output of the above) • A timer used to clock audio • Any other special features such as a 3D spatializer It is important to have a glance at these components because everything in the software layers attempts to make them easier to approach. 6 Analog to Digital & Digital to Analog (ADC & DAC) A couple of concepts related to the interaction between the real and digital world are also needed to kick-start our journey. In the real world, the analog world, sound is made up of waves, which are air pressures that can be arbitrarily large. Speakers generating sound have a maximum volume/amplitude, usually repre- sented by 0dB (decibels). Volume lower than the maximum is represented by negative decibels: -10dB, -20dB, etc.. And no sound is thus -∞ dB. This might be surprising and actually not really true either. Decibel doesn’t mean much until it’s tied to a specific absolute reference point, it’s a relative scale. You pick a value for 0dB that makes sense for what you are trying to measure. vu meter The measurement above is the dBFS, the dB relative to digital full-scale, aka digital 0. There are other measurements such as dB SPL and dBV. One thing to note about decibels is that they follow a strictly exponential law, which matches it to human perception. What sounds like a constantly increas- ing volume is indicated by a constantly rising dB meter, corresponding to an 7 exponentially rising output power. This is why you can hear both vanishingly soft sounds and punishingly loud sounds. The step from the loudest you can hear up to destroying your ears or killing you is only a few more dB. While decibels are about loudness, the tone is represented as sine waves of certain frequency, the speed. For example, the note A is a 440Hz sine wave. Alright, we got the idea of decibel and tone but how do we get from waves to our computer or in reverse? This is what we call going from analog to digital or digital to analog. To do this we have to convert waves into discrete points in time, taking samples per second — what we call sample rate. The higher the sample rate, the more accurate the representation of the analog sound (a lollipop graph). Each sample has a certain accuracy, how much information we store in it, the number of bits for each sample — what we call the bit rate/depth (the higher the less noise). For example, CDs use 16 bits. Which value you choose as your sample rate and bit rate will depend on a trade-off between quality and memory use. NB: That’s why it makes no sense to convert from digital low sample rate to digital high sample rate, you’ll just be filling the void in the middle ofthe discrete points with the same data. Additionally, you may need to represent how multiple channels play sound — multichannel. For example, mono, stereo, 3d, surround, etc.. It’s important to note that if we want to play sounds from multiple sources at the same time, they will need to agree on the sample rate, bit rate, and format representation, otherwise it’ll be impossible to mix them. That’s something essential on a desktop. The last part in this equation is how to implement the mechanism to send audio to the sound card. That highly depends on what the card itself supports, but the usual simple mechanism is to fill buffers with streams of sound, then let the hardware read the samples, passing them to the DAC (digital to analog converter), to then reach the speaker, and vice versa. Once the hardware has read enough samples it’ll do an interrupt to notify the software side that it needs more samples. This cyclic mechanism goes on and on in a ring fashion. If the buffer samples aren’t filled fast enough we call this an underrun ordrop- out (aka xruns), which result in a glitch, basically audio stopping for a short period before the buffer is filled again. The audio can be played at a certain sample rate, with a certain granularity, as we’ve said. So if we call buffer-size the number of samples that can be contained in a cyclic-buffer meant to be read by the hardware, fragment-size or period-size the number of samples after which an interrupt is generated, number-fragments the number of fragments that can fit in a hardware buffer (buffer-size/fragment- size), and sample-rate the number of samples per seconds. Then the latency will be buffer-size/sample-rate, for example if we can fit 100 8 samples in a buffer and the samples are played once every 1ms then that’sa 100ms latency; we’ll have to wait 100ms before the hardware finishes processing the buffer. From the software side, we’ll be getting an interrupt every period-size/sample- rate. Thus, if our buffer-size is 1024 samples, and fragment-size is 512, and our sample-rate is 44100 samples/second, then we get an interrupt and need to refill the buffer every 512/44100 = 11.6 ms, and our latency for this stage is up to 1024/44100 = 23 ms.