State-Of-The-Art of Audio Perceptual Compression Systems
Total Page:16
File Type:pdf, Size:1020Kb
State-of-the-Art of Audio Perceptual Compression Systems Estado del Arte de los sistemas de Compresión de Audio Perceptual Marcelo Herrera Martínez, Ana María Guzmán Palacios Abstract Resumen n this paper, audio perceptual compression n este trabajo se describen los sistemas de systems are described, giving special attention compresión de audio perceptual, con espe- to the former one: The MPEG 1, Layer III, in cial atención al último modelo disponible: El I short, the MP3. Other compression technolo- E MPEG 1 Layer III, en definitiva, el MP3. Así gies are described, especially the technologies mismo, se describen y evalúan otras tecnolo- evaluated in the present work: OGG Vorbis, WMA (Windows gías de compresión, especialmente las siguientes tecnologías: Media Audio) from Microsoft Corporation and AAC (Audio OGG Vorbis, WMA (Windows Media Audio) de Microsoft Coding Technologies). Corporation y AAC (Audio Coding Technologies). Keywords: Audio Signal, Compression, Frecuency mapping, Palabras clave: señal de audio, compresión, mapeo Frecuencia, MP3, MPEG3, Quantization, SBR, WMA. MP3, MPEG3, cuantificación, SBR, WMA Recibido: Abril 30 de 2013 Aprobado: Noviembre 19 de 2013 Tipo de artículo: Artículo de revisión de tema, investigación no terminada. Afiliación Institucional de los autores: Grupo de Acústica Aplicada – Semillero de investigación Compresión de Audio. Facultad de Ingeniería. Universidad de San Buenaventura sede Bogotá. Los autores declaran que no tienen conflicto de interés. State-of-the-Art of Audio Perceptual Compression Systems These high compression ratios are achieved due to the Introduction possibility of discarding components from the signal due to The storage and reproduction of sound has evolved since irrelevancies (masking) and redundancies (entropic coding). the times of Thomas Alva Edison’s phonograph cylinders. The development of the vinyl, as a mechanical storage MPEG-1 device, and the cassette, based on magnetization princi- MPEG-1 was the first audio-compression technology ples were widely distributed and used by the audiophile to appear at the Fraunhofer Institute, in Erlangen community. Nevertheless, it was in the 80’s, with the (Germany) in 1987 within the frame of the EUREKA’s incursion of the CD-Audio by Philips, when reducing project (DAB broadcasting system). The audio standard sizes for storage devices began. In a parallel way, radio- which became available is part of the standard MPEG 1. broadcasting (AM-FM) began to visualize the incoming MPEG-2, and further releases were supposed to improve market of internet solutions in order to save bandwidth this first algorithm. The basic scheme of MPEG-1 Layer III costs. The development of compression technologies, is depicted in the appendix. 2. as MPEG (Fraunhofer Institute), Ogg Vorbis and WMA (Windows Media Audio) among others, initiated. These technologies are based on discarding irrelevancies and Table I. Bit-rates used in MPEG-1 layer III redundancies of the audio signal. The irrelevancies of Bandwidth Bit rate Compression the audio signal can be achieved using psychoacoustic Quality Mode [kHz] [kbit/s] Rate features of the HHS as the masking phenomenon, which was extensively described in the previous chapter. The Telephone 2.5 mono 8 1:96 general configuration of an audio compression system is Better 7.5 mono 32 1:24 shown in the appendix. 1. than AM FM radio 11 stereo 56..64 1:24..26 The PCM signal, a digital representation of the audio signal enters the compression system. A time-to-frequency Near CD 15 stereo 96 1:16 mapping is performed initially. These values are compared CD >15 stereo 112..128 1:12..14 to masking thresholds, obtained by a psychoacoustic model which has implementations based on the previous MPEG-1 became a standard in 1992, and it is also known chapters. In these models, the codec compares at each as ISO/IEC 11172. The audio signal is then represented instance the value of the incoming signal, computes by its spectral components on a frame-by-frame basis and the masking threshold associated with the given signal encoded exploiting perceptual models. MPEG-1 Audio component, and the components which remain below standard specifications were derived from two main the computed masking threshold are discarded. This is proposals: MUSICAM presented by CCETT, IRT Phillips, understood as irrelevance. The non-discarded compo- and ASPEC presented by AT&T, FhG, and Telefunken [9]. nents of the signal are fed into a re-quantization block, where each of the Fourier or DCT coefficients is re-quan- The Stereo Audio Signal is encoded using similarities of tized. The re-quantized coefficients are fed into a Huffman right and left channels. coding block where entropic coding is performed, and MP3 has implemented three modes for this purpose: more information is discarded based on redundancies, as it is explained in the section I-G. 1. Stereo Compression schemes have reached compression ratios 2. Joint Stereo (Middle/Side): In this mode, one signal till 1:12, and further development is near with the inclu- contains the sum of channel components and a sion of other frequency-time transform algorithms, and second signal the difference. quantization schemes. These compression schemes 3. Joint Stereo (Intensity Stereo): In this mode, high introduces to the musical audio signal undesirable arti- frequency components are coded in mono and the facts. Their evaluation is the main part of this research. information of panning (direction) is also coded. Revista de Tecnología ¦ Journal Technology ¦ Volumen 12 ¦ Número 2 ¦ Págs. 102-110 ¦ 103 Marcelo Herrera Martínez, Ana María Guzmán Palacios MP3 has 3 layers (back compatible), which are physical sample window is chosen. This reduces the temporal and mathematical models. The complexity is propor- spreading of quantization noise for sharp attacks. The tional to the layer number. definition of those windows is treated in [9]. The bit-rates used in MPEG-1 Layer III are summarized in the Tab. 1. Psychoacoustic Models in MPEG-1 The scheme for the psychoacoustic model in MPEG-1 is The structure of the MP3 file is shown in the appendix. 3. shown in the Appendix 7. Layer I and II Summarizing, the psychoacoustic model 1 performs the SPL computation in each band (the signal level for each The time to frequency mapping is performed by applying spectral line k), the separation of tonal and non-tonal a 32-band PQMF to the audio data. Quantization is components and the other processes outlined in the performed by a mid-tread quantizer. Psychoacoustic Appendix. 14. More information can be found in [9]. Model 1 is applied, with a 512-point FFT (Layer I) or 1024-point FFT (Layer II). Psychoacoustic Model 2 The MPEG-1 Layer I-II structure is depicted in the Model 2 process and calculations differ from those of Appendix 4. Basically, it corresponds to the general confi- Model 1. The block diagram is depicted in Appendix. 8. guration of an audio codec, shown in Appendix 5. After This model uses another mathematical expression for splitting the signal into 32 sub-band components, the the basic spreading function B(dz) measured in dB. quantizer recalculates the values of the signal taking into The tonality index in each partition is calculated based account the psychoacoustic thresholds stored inside the on the predictability of the signal from the frequency psychoacoustic model. lines in prior frames. When model 2 is applied to Layer III, it is altered to take into account the different nature Layer III of the Layer III (hybrid filter bank, and the switching block). Attacks, for example, are detected based on a For the Layer III, the output of the PQMF is fed to a psychoacoustic entropy calculation. [9]. The specifica- MDCT stage. In Layer III, the filter bank is signal adaptive tion of the audio-syntax, the scale factors, bit allocation, in difference to previous layers. quantization, Huffman coding and bit-reservoir are The block diagram of Layer III filter bank analysis stage extensively explained in [9]. is shown in Appendix 5. After the 32-band PQMF filter, The Software applications which support MP3 format are blocks of 36 sub-band samples are overlapped by 50 Cool Edit Pro (Syntrillium), Sound Forge, Musicmatch percent, multiplied by a sine window and then processed Jukebox and Easy MP3, among others. by the MDCT transform. The filter bank analysis in MPEG-1 Layer III is shown in MP3 Format, Lame Codec Appendix 6. The Lame Codec was created in 1998. It has its own Compression systems encounter problems while trac- psychoacoustical model (GPSYCHO) which enables king signals containing transients, such as the castanet the graphic imaging of MDCT coefficients. Some of the excerpts. Quantization errors spread over a block of 1152 softwares which support this codec are Audiograbber, time- samples. This leads to unmasked temporal noise, WaveLab, WinLame and RazorLame, among others. and the audibility of the artifact known as pre-echo. MP3 Pro format The layer III implemented a shorter block size for avoi- ding pre-echo. When a transient is detected, the 384 This format uses the MPEG-1 Layer III specification. It was sample window is used; in the opposite case, a 1152 released to the market by Thomson. The available softwares 104 ¦ Revista de Tecnología ¦ Journal Technology ¦ Volumen 12 ¦ Número 2 ¦ Págs. 102-110 State-of-the-Art of Audio Perceptual Compression Systems which support the format are Thomson MP3Pro Audio The Noise-like signal components are detected on a scale- Player, MusicMatch Jukebox and Nero 5.5, among others. It factor band basis. The corresponding groups of spectral uses the SBR Technology (Spectral Band Replication). coefficients are excluded from the quantization/coding process. Instead, only a noise substitution flag plus total Spectral Band Replication is a technique by which the power of the substituted band is transmitted in the bit- non-transmitted higher frequency range of a compressed audio file is deduced by some helper bits and the trans- stream.