Multimedia Systems Giorgio Leonardi A.A.2014-2015

Lectures 17: Audio formats Outline of this lecture • Music representation formats: – PCM – Perceptual compression: • Principles of audioacoustics • MPEG1 layer 3 (MP3)

Digital audio representation

• A sound wave can be represented as a time- varying signal, that is as a signal whose pressure levels continuously change with time Digital audio representation

• Digital audio refers to: – Synthesized sounds, which are audio signals that originates entirely in the digital domain (e.g., by means of a digital synthesizer) – Computerized (discrete) representation of real (natural) sounds (e.g., by means of a microphone)

• Digitization is the process to transform a real sound into a digital one Digital audio representation

• Digital audio acquisition, recording and reproduction Digital audio representation

• Digitization involves two steps

– Sampling – Quantization Sampling Sampling Sampling Quantization Quantization Quantization

• Example: Audio Dynamic Range – Audio quantized at 16 bit/sample • Dynamic range ≈6⋅16=96 dB – Audio quantized at 8 bit/sample • Dynamic range ≈6⋅8=48 dB

• Be careful not to interpret it as that the larger is the number of bits the louder amplitudes you have. Instead, dynamic range says that for each additional bit you add about 6dB of resolution Pulse Code PCM

• PCM is the standard way of coding digital, non-compressed (or lossless compressed) audio.

• It is the standard form to represent digital audio in computers, digital telephone systems and in various digital media storage like CD, DVD, and Blue-ray PCM • We know all the building blocks to obtain this format: – Sampling: signal is sampled at discrete points in time, – Quantization: each sample is discretized in amplitude, using either a uniform, or non- uniform, or a quantizer – Encoding: each quantized sample is encoded with a binary codeword

• After these steps, the resulting bits are stored/transmitted as “pulses” representing 0s and 1s PCM PCM • Example: Typical PCM Parameters for Single Channel Audio PCM types • Different variants of PCM – Speech & Music • Linear PCM • DPCM • Delta Modulation • ADPCM

– Speech • A-law PCM • 흁-law PCM Linear PCM • Linear PCM (LPCM) is PCM with uniform quantization – LPCM represents sample amplitudes on a linear scale – Common sampling rates: 44.1-192 KHz 8, 16 or 24 bits • Tipically used in: – WAV and AIFF audio file format – CD, DVD – HDMI Differential PCM • Differential PCM (DPCM) is a PCM variant – DPCM exploits the fact that most audio signals show significant correlation between successive sample amplitudes – So, instead of encoding sample values, it only encodes the difference between two successive samples (remember coding DC values in JPEG?)

• The sequence 147, 150, 139, 142 becomes: 147, +3, -11, +3

• At the same sampling rate, DPCM generally requires fewer bits (about 25%) than LPCM Delta Modulation • Delta Modulation (DM) is the simplest form of DPCM where only 1 bit is used to encode difference between successive samples

A single bit tells whether the next sample is “above” or “below” the previous one Adaptive DPCM • Adaptive DPCM (ADPCM) is a DPCM variant that uses non-uniform quantization and adaptively modifies the quantizer to suit the input signal

• Adaptation is obtained by changing the step size according to an adaptive algorithm (e.g. Floyd-Max) to minimize the quantization error Companding PCM • Mostly used for voice quantization – Voice levels are concentrated near zero – Companding uses logarithmic compression/decompression to obtain more quantization intervals near zero

• The ITU-T Recommendation G.711 defines two PCM variants – 흁-law PCM companding, used in digital communication systems of North America and Japan – A-law PCM companding, used in the European digital communication systems and for international connections Audio compression Audio compression

• Challenges of audio compression – Reduced size of audio data – Good wrt to uncompressed audio – Low processing time • In addition, when audio is to be streamed – Random access – Platform-independence Audio compression • Different audio compression techniques for speech and music – E.g., in speech, the reduction in quality resulting from effectively reducing the resolution (bit depth) isn't objectionable

• Usually, you are interested in understanding what the speech sound “say” – Thus a noisy conversation can still be tolerated

• Conversely, in music, you are usually interested in hearing good quality sound Music compression

• Music compression refers to compression schemes particularly suited to compress audio signals more complex than human conversation – Songs, nature sounds, instrument sounds, ...

• If a music audio signal is digitized in a straightforward way (e.g. PCM), data corresponding to sounds that are inaudible may be included in the digitized version – The signal records all the physical variations in air pressure that cause sound, but the perception of sound is a sensation produced in the brain

• Hearing is not a purely mechanical phenomenon of wave propagation, but is also a sensory and perceptual process Perceptual coding • Perceptual coding is based upon an analysis of how the ear and brain perceive sound, called psychoacoustical modeling, or simply psychoacoustics

• Perceptual coding exploits audio elements that the human ear cannot hear very well: sounds occurring together may cause some of them not to be heard, despite being physically present: – A sound may be too quiet to be heard, or – A sound may be obscured by some other sound Perceptual coding

• The absolute threshold of hearing (ATH) characterizes the amount of energy needed in a pure (sinusoidal) tone such that it can be detected by a human listener in a noiseless environment – The absolute threshold is typically expressed in terms of dB Sound Pressure Level (dB SPL) – Practically, it is the minimum level (proportional to the volume) at which a sound can be heard ATH ATH Perceptual coding

• The reason for this is due to the way the ear works

• In fact, human hearing is responsible of several auditory phenomena, like – Auditory masking – Temporal masking Auditory Masking

• Auditory masking (or frequency masking or simultaneous masking) is the phenomenon which happens when loud tones mask softer tones at nearby frequencies – Masking may either occur when these tones occur at the same time or when loud tones occur a little later or slightly earlier than softer tones

https://www.youtube.com/watch?v=k6DVywW 5NR4

Auditory masking

• Masking can be conveniently described as a modification of the threshold of hearing curve in the region of a loud tone

• That is, the ATH curve increases in presence of a dominant frequency component

• Thus, the effect of masking increases near the dominant frequency component and decreases as moving away from it Masking threshold

• The portion of ATH curve that is changed is called the masking threshold curve – All frequencies that appear at amplitudes beneath the masking threshold will be inaudible (even if they are above the original ATH and thus potentially audible)

• The width of the masking threshold curve is called critical bandwidth Masking threshold Example: Auditory Masking • In this example, the 1 kHz sound masks the sound at 1.1 kHz, but not the one at 3.1 kHz! Auditory masking • The effect of auditory masking varies with the critical band (sub-band) – It does occur within the same critical band – It also spreads to neighboring (sub-)bands – Higher bands have larger masking effects Temporal masking

• Temporal masking happens when a sudden loud tone causes the hearing receptors in the ear to become saturated, thus making inaudible other tones which immediately precedes or follows the loud tone

• Thus, if we hear a loud sound, then it stops, it takes a little while until we can hear a soft tone nearby Temporal masking • Pre-masking: it takes a little while before this sound masks other softer sounds nearby the same frequency, but:

• Post-masking: this sound masks other sounds even before it stops

• Temporal masking raises (pre-masking) and decreases (post-masking) exponentially Perceptual coding • The heart of perceptual coding (and compression) is:

Removing masked sound leads to compression without altering the overall quality of sound

• Therefore, perceptual coders analyze the input PCM stream (dividing it into fixed-lenght frames) to detect the masked thresholds for this frame and re-code (and re-quantize) only the sound whose level is over the calculated threshold Perceptual compression (MPEG)

• Intuitively, the masking phenomenon can be applied to compression in the following way:

• A small window of time (frame) is moved across a sound file. Samples in that frame are compressed as one unit. 1. The samples in the current frame are loaded to be processed 2. Fourier analysis divides each frame into (usually 32) sub-bands of frequencies 3. Using of FA, a masking curve for each band is calculated 4. DCT is calculated on the samples loaded in Step 1. DCT reveals the information that can be discarded or retained 5. Information to be retained is re-quantized by an adaptive quantizer, determining the lowest possible bit-depth such that the resulting quantization noise remains under the masking curve 6. That is, where a masking sound is present, the signal can be quantized relatively coarsely, using fewer bits than would otherwise be needed, because the resulting quantization noise can be hidden under the masking curve MPEG 1 • The MPEG 1 encoding technique allows you to select several options. First of all, the number of channels and their functionality:

– single channel (mono); – two independent channels (e.g. 2 languages); – stereo; – joint-stereo: stereo channels are combined/down-mixed into one single (mono) channel.

• The sampling frequency can be set to 32 kHz, 44.1 kHz, 48 kHz, while varies from 16 to 320 kbit / sec. MPEG 1 • MPEG is divided in layers. Each layer defines a different coder, with increasing features and complexity:

• Layer I: This is the simplest of the three and is designed to have the best performance with bitrates of 128 kbit / s per channel. Provides compression factors of approximately 1 to 4.

• Layer II: more complex than the first, is suitable for bitrates around 128 kbit / s per channel. The compression factors ranging from 1 to 6 to 1 to 8.

• Layer III: is the most complex of the three and offers excellent performance with bitrates of about 64 kbit / s per channel. Able to reduce the size up to 12 times.

• The quality obtained at 192 kbps for each channel at Layer 1 only needs 128 kbps at Layer 2, and 64 kbps at Layer 3 MP3

• Example: MP3 (MPEG 1 layer 3) Compression Rate

• CD-quality audio is achieved with compression factors in the range of 11:1 to 7:1 (i.e., bitrates of 128 to 192 kbps) – Uncompressed CD-quality stereo audio would require 2×16 bits×44100samples/sec =1.4 Mbits/s – Compressed CD-quality stereo audio at 128 kbps or 192 kbps with MP3 yields a compression factor of 11:1 or 7:1, respectively MP3 MP3 MP3 – Sample analysis • From the PCM input stream, the samples in the current frame are loaded. Since MP3 analyzes the samples in 32 sub-bands, and 32 samples per sub-band are used, each frame is composed of 32*32 = 1024 samples.

• A filter bank (i.e., a set of critical-band filters) performs spectral analysis and divides each frame into 32 bands of frequencies (frequency subbands) – The width of each subband is 푓푠/64, where 푓푠/2 is the Nyquist frequency and 푓푠 is the sampling rate – Samples inside each subband are called subband samples MP3 – Sample analysis • Meanwhile, a DFT is applied in order to represent the input signal in the frequency domain – This analysis will be used by subsequent steps to build the psychoacoustic model which allows to cut out the inaudible sound – The DFT is computed by means of a 1024- point FFT MP3 – Cutting unhearable data • The output is a set of signal-to-mask ratios (SMRs), that is the ratios between the peak sound pressure levels and the masking thresholds

• Each of these ratios determines how many bits are needed to represent samples within a band: the lower is SMR, the smaller is the number of bits – The idea is to assign more bits to frames where hearing is most sensitive – Fewer bits create more quantization noise, but it doesn’t matter if the quantization noise is below the masking threshold MP3 – Cutting unhearable data • This psychoacoustic model is applied to the input frequency spectrum to find the frequency components whose amplitude is subject to masking (below the masking threshold defined by the psychoacoustic Model itself) MP3 – Cutting unhearable data • The spectrum of each frequency band is analyzed by means of the modified discrete cosine transform (MDCT)

• MDCT is used to improve frequency resolution, particularly at low-frequency bands, thus modeling the human ear’s critical bands more closely – MDCT coefficients are grouped in a way similar to critical bands, in order to use this spectrum with the masking threshold Cutting unhearable data - example • Consider for simplicity only 16 of the 32 sub-bands:

Level(dB )

• Our (example) psychoacoustic model, tells that the octave band, if it has an intensity of 60 dB, generates a mask of 12dB in the seventh band, and of 15dB in the the ninth. The seventh band has a level of 10 (<12dB), and is therefore masked and cut away from the output. The ninth is 35dB (> 15) thus passes out. MP3 – Quantization

• For every sample in each frequency band, apply quantization with the input bit depth – Nonuniform quantization is used to decrease the quantization noise for low amplitude samples – But the quantization intervals are larger for high amplitude samples

• The output is a set of quantized amplitude samples in frequency domain MP3 – Output and • Huffman encoding is applied to the input quantized amplitude samples (in frequency domain) – This is done to lower the final data rate

• Side-information contains a range of information for the correct decoding of the audio data: pointer to the beginning of the main data, Huffman and relative sizes of the regions tables used, the size of the scale factors, the size of main_data, etc… Format of MP3 file • Bitstream formatting module encodes the frame as shown here:

• The MP3 file is composed by a header containing information such as song name, artist, album ecc… and then the sequence of the encoded frames.

The decoder reads each frame, decompresses it using Huffman codes, dequantizes and transforms the data in the time domain. These are straightforward operations, which can be properly performed by cheap hardware.