PCM – Perceptual Compression: • Principles of Audioacoustics • MPEG1 Layer 3 (MP3)
Total Page:16
File Type:pdf, Size:1020Kb
Multimedia Systems Giorgio Leonardi A.A.2014-2015 Lectures 17: Audio formats Outline of this lecture • Music representation formats: – PCM – Perceptual compression: • Principles of audioacoustics • MPEG1 layer 3 (MP3) Digital audio representation • A sound wave can be represented as a time- varying signal, that is as a signal whose pressure levels continuously change with time Digital audio representation • Digital audio refers to: – Synthesized sounds, which are audio signals that originates entirely in the digital domain (e.g., by means of a digital synthesizer) – Computerized (discrete) representation of real (natural) sounds (e.g., by means of a microphone) • Digitization is the process to transform a real sound into a digital one Digital audio representation • Digital audio acquisition, recording and reproduction Digital audio representation • Digitization involves two steps – Sampling – Quantization Sampling Sampling Sampling Quantization Quantization Quantization • Example: Audio Dynamic Range – Audio quantized at 16 bit/sample • Dynamic range ≈6⋅16=96 dB – Audio quantized at 8 bit/sample • Dynamic range ≈6⋅8=48 dB • Be careful not to interpret it as that the larger is the number of bits the louder amplitudes you have. Instead, dynamic range says that for each additional bit you add about 6dB of resolution Pulse Code Modulation PCM • PCM is the standard way of coding digital, non-compressed (or lossless compressed) audio. • It is the standard form to represent digital audio in computers, digital telephone systems and in various digital media storage like CD, DVD, and Blue-ray PCM • We know all the building blocks to obtain this format: – Sampling: signal is sampled at discrete points in time, – Quantization: each sample is discretized in amplitude, using either a uniform, or non- uniform, or a companding quantizer – Encoding: each quantized sample is encoded with a binary codeword • After these steps, the resulting bits are stored/transmitted as “pulses” representing 0s and 1s PCM PCM • Example: Typical PCM Parameters for Single Channel Audio PCM types • Different variants of PCM – Speech & Music • Linear PCM • DPCM • Delta Modulation • ADPCM – Speech • A-law PCM • 흁-law PCM Linear PCM • Linear PCM (LPCM) is PCM with uniform quantization – LPCM represents sample amplitudes on a linear scale – Common sampling rates: 44.1-192 KHz 8, 16 or 24 bits • Tipically used in: – WAV and AIFF audio file format – CD, DVD – HDMI Differential PCM • Differential PCM (DPCM) is a PCM variant – DPCM exploits the fact that most audio signals show significant correlation between successive sample amplitudes – So, instead of encoding sample values, it only encodes the difference between two successive samples (remember coding DC values in JPEG?) • The sequence 147, 150, 139, 142 becomes: 147, +3, -11, +3 • At the same sampling rate, DPCM generally requires fewer bits (about 25%) than LPCM Delta Modulation • Delta Modulation (DM) is the simplest form of DPCM where only 1 bit is used to encode difference between successive samples A single bit tells whether the next sample is “above” or “below” the previous one Adaptive DPCM • Adaptive DPCM (ADPCM) is a DPCM variant that uses non-uniform quantization and adaptively modifies the quantizer to suit the input signal • Adaptation is obtained by changing the step size according to an adaptive algorithm (e.g. Floyd-Max) to minimize the quantization error Companding PCM • Mostly used for voice quantization – Voice levels are concentrated near zero – Companding uses logarithmic compression/decompression to obtain more quantization intervals near zero • The ITU-T Recommendation G.711 defines two PCM variants – 흁-law PCM companding, used in digital communication systems of North America and Japan – A-law PCM companding, used in the European digital communication systems and for international connections Audio compression Audio compression • Challenges of audio compression – Reduced size of audio data – Good sound quality wrt to uncompressed audio – Low processing time • In addition, when audio is to be streamed – Random access – Platform-independence Audio compression • Different audio compression techniques for speech and music – E.g., in speech, the reduction in quality resulting from effectively reducing the resolution (bit depth) isn't objectionable • Usually, you are interested in understanding what the speech sound “say” – Thus a noisy conversation can still be tolerated • Conversely, in music, you are usually interested in hearing good quality sound Music compression • Music compression refers to compression schemes particularly suited to compress audio signals more complex than human conversation – Songs, nature sounds, instrument sounds, ... • If a music audio signal is digitized in a straightforward way (e.g. PCM), data corresponding to sounds that are inaudible may be included in the digitized version – The signal records all the physical variations in air pressure that cause sound, but the perception of sound is a sensation produced in the brain • Hearing is not a purely mechanical phenomenon of wave propagation, but is also a sensory and perceptual process Perceptual coding • Perceptual coding is based upon an analysis of how the ear and brain perceive sound, called psychoacoustical modeling, or simply psychoacoustics • Perceptual coding exploits audio elements that the human ear cannot hear very well: sounds occurring together may cause some of them not to be heard, despite being physically present: – A sound may be too quiet to be heard, or – A sound may be obscured by some other sound Perceptual coding • The absolute threshold of hearing (ATH) characterizes the amount of energy needed in a pure (sinusoidal) tone such that it can be detected by a human listener in a noiseless environment – The absolute threshold is typically expressed in terms of dB Sound Pressure Level (dB SPL) – Practically, it is the minimum level (proportional to the volume) at which a sound can be heard ATH ATH Perceptual coding • The reason for this is due to the way the ear works • In fact, human hearing is responsible of several auditory phenomena, like – Auditory masking – Temporal masking Auditory Masking • Auditory masking (or frequency masking or simultaneous masking) is the phenomenon which happens when loud tones mask softer tones at nearby frequencies – Masking may either occur when these tones occur at the same time or when loud tones occur a little later or slightly earlier than softer tones https://www.youtube.com/watch?v=k6DVywW 5NR4 Auditory masking • Masking can be conveniently described as a modification of the threshold of hearing curve in the region of a loud tone • That is, the ATH curve increases in presence of a dominant frequency component • Thus, the effect of masking increases near the dominant frequency component and decreases as moving away from it Masking threshold • The portion of ATH curve that is changed is called the masking threshold curve – All frequencies that appear at amplitudes beneath the masking threshold will be inaudible (even if they are above the original ATH and thus potentially audible) • The width of the masking threshold curve is called critical bandwidth Masking threshold Example: Auditory Masking • In this example, the 1 kHz sound masks the sound at 1.1 kHz, but not the one at 3.1 kHz! Auditory masking • The effect of auditory masking varies with the critical band (sub-band) – It does occur within the same critical band – It also spreads to neighboring (sub-)bands – Higher bands have larger masking effects Temporal masking • Temporal masking happens when a sudden loud tone causes the hearing receptors in the ear to become saturated, thus making inaudible other tones which immediately precedes or follows the loud tone • Thus, if we hear a loud sound, then it stops, it takes a little while until we can hear a soft tone nearby Temporal masking • Pre-masking: it takes a little while before this sound masks other softer sounds nearby the same frequency, but: • Post-masking: this sound masks other sounds even before it stops • Temporal masking raises (pre-masking) and decreases (post-masking) exponentially Perceptual coding • The heart of perceptual coding (and compression) is: Removing masked sound leads to compression without altering the overall quality of sound • Therefore, perceptual coders analyze the input PCM stream (dividing it into fixed-lenght frames) to detect the masked thresholds for this frame and re-code (and re-quantize) only the sound whose level is over the calculated threshold Perceptual compression (MPEG) • Intuitively, the masking phenomenon can be applied to compression in the following way: • A small window of time (frame) is moved across a sound file. Samples in that frame are compressed as one unit. 1. The samples in the current frame are loaded to be processed 2. Fourier analysis divides each frame into (usually 32) sub-bands of frequencies 3. Using information of FA, a masking curve for each band is calculated 4. DCT is calculated on the samples loaded in Step 1. DCT reveals the information that can be discarded or retained 5. Information to be retained is re-quantized by an adaptive quantizer, determining the lowest possible bit-depth such that the resulting quantization noise remains under the masking curve 6. That is, where a masking sound is present, the signal can be quantized relatively coarsely, using fewer bits than would otherwise be needed, because the resulting quantization noise can be hidden under the masking curve MPEG 1 • The MPEG 1 encoding technique allows you to select several options. First of all, the number of channels and their functionality: – single channel (mono); – two independent channels (e.g. 2 languages); – stereo; – joint-stereo: stereo channels are combined/down-mixed into one single (mono) channel. • The sampling frequency can be set to 32 kHz, 44.1 kHz, 48 kHz, while bit rate varies from 16 to 320 kbit / sec. MPEG 1 • MPEG is divided in layers. Each layer defines a different coder, with increasing features and complexity: • Layer I: This is the simplest of the three and is designed to have the best performance with bitrates of 128 kbit / s per channel.