Lapped Transforms in Perceptual Coding of Wideband Audio

Lapped Transforms in Perceptual Coding of Wideband Audio Sien Ruan Department of Electrical & Computer Engineering McGill University Montreal, Canada December 2004 A thesis submitted to McGill University in partial fulfillment of the requirements for the degree of Master of Engineering. c 2004 Sien Ruan ° i To my beloved parents ii Abstract Audio coding paradigms depend on time-frequency transformations to remove statistical redundancy in audio signals and reduce data bit rate, while maintaining high fidelity of the reconstructed signal. Sophisticated perceptual audio coding further exploits perceptual redundancy in audio signals by incorporating perceptual masking phenomena. This thesis focuses on the investigation of different coding transformations that can be used to compute perceptual distortion measures effectively; among them the lapped transform, which is most widely used in nowadays audio coders. Moreover, an innovative lapped transform is developed that can vary overlap percentage at arbitrary degrees. The new lapped transform is applicable on the transient audio by capturing the time-varying characteristics of the signal. iii Sommaire Les paradigmes de codage audio dépendent des transformations de temps-fréquence pour enlever la redondance statistique dans les signaux audio et pour réduire le taux de trans- mission de données, tout en maintenant la fidélitéélevée du signal reconstruit. Le codage sophistiquéperceptuel de l’audio exploite davantage la redondance perceptuelle dans les signaux audio en incorporant des phénomènes de masquage perceptuels. Cette thèse se concentre sur la recherche sur les différentes transformations de codage qui peuvent être employées pour calculer des mesures de déformation perceptuelles efficacement, parmi elles, la transformation enroulé, qui est la plus largement répandue dans les codeurs audio de nos jours. D’ailleurs, on développe une transformation enroulée innovatrice qui peut changer le pourcentage de chevauchement àdes degrés arbitraires. La nouvelle transformation en- roulée est applicable avec l’acoustique passagère en capturant les caractéristiques variantes avec le temps du signal. iv Acknowledgments I would like to acknowledge my supervisor, Prof. Peter Kabal, for his support and guidance throughout my graduate studies at McGill University. Prof. Kabal’s kind treatment to his students is highly appreciated. I would also like to thank Ricky Der for working with me and advising me through the work. My thanks go to my fellow TSP graduate students for their close friendship; especially Alexander M. Wyglinski for the various technical assistances. I am sincerely indebted to my parents for all the encouragement they have given to me. They are the reason for who I am today. To my mother, Mrs. Dejun Zhao and my father, Mr. Liwu Ruan, thank you. v Contents 1 Introduction 1 1.1 AudioCodingTechniques ........................... 1 1.1.1 ParametricCoders ........................... 1 1.1.2 WaveformCoders............................ 2 1.2 Time-to-Frequency Transformations . ..... 3 1.3 ThesisContributions .............................. 4 1.4 ThesisSynopsis................................. 4 2 Perceptual Audio Coding: Psychoacoustic Audio Compression 6 2.1 HumanAuditoryMasking ........................... 6 2.1.1 HearingSystem............................. 7 2.1.2 PerceptionofLoudness......................... 7 2.1.3 CriticalBands.............................. 8 2.1.4 MaskingPhenomena .......................... 10 2.2 Example Perceptual Model: Johnston’s Model . ..... 11 2.2.1 LoudnessNormalization . 11 2.2.2 Masking Threshold Calculation . 11 2.2.3 PerceptualEntropy........................... 13 2.3 PerceptualAudioCoderStructure. 14 2.3.1 Time-to-Frequency Transformation . 15 2.3.2 Psychoacoustic Analysis . 17 2.3.3 Adaptive Bit Allocation . 17 2.3.4 Quantization .............................. 18 2.3.5 BitstreamFormatting . 20 Contents vi 3 Signal Decomposition with Lapped Transforms 21 3.1 BlockTransforms ................................ 22 3.2 LappedTransforms ............................... 22 3.2.1 LTOrthogonalConstraints. 23 3.3 FilterBanks: SubbandSignalProcessing . 26 3.3.1 Perfect Reconstruction Conditions . 27 3.3.2 Filter Bank Representation of the LT . 28 3.4 ModulatedLappedTransforms . 28 3.4.1 Perfect Reconstruction Conditions . 28 3.5 AdaptiveFilterBanks ............................. 33 3.5.1 Window Switching with Perfect Reconstruction . 33 4 MP3 and AAC Filter Banks 35 4.1 Time-to-Frequency Transformations of MP3 and AAC . ..... 35 4.1.1 MP3 Transformation: Hybrid Filter Bank . 35 4.1.2 AAC Transformation: Pure MDCT Filter Bank . 43 4.2 PerformanceEvaluation . 44 4.2.1 FullCoderDescription . 44 4.2.2 AudioQualityMeasurements . 49 4.2.3 ExperimentResults........................... 50 4.3 Psychoacoustic Transforms of DFT and MDCT . 52 4.3.1 InherentMismatchProblem . 52 4.3.2 ExperimentResults........................... 54 5 Partially Overlapped Lapped Transforms 55 5.1 Motivation of Partially Overlapped LT: NMR Distortion . ....... 55 5.2 Construction of Partially Overlapped LT . ..... 56 5.2.1 MLT as DST via Pre- and Post-Filtering . 56 5.2.2 SmallerOverlapSolution . 60 5.3 PerformanceEvaluation . 62 5.3.1 Pre-echoMitigation........................... 62 5.3.2 Optimal Overlapping Point for Transient Audio . 65 Contents vii 6 Conclusion 66 6.1 ThesisSummary ................................ 66 6.2 FutureResearchDirections. 68 A Greedy Algorithm and Entropy Computation 70 A.1 GreedyAlgorithm................................ 70 A.2 EntropyComputation ............................. 71 viii List of Figures 2.1 Absolute threshold of hearing for normal listeners. ...... 8 2.2 Genericperceptualaudioencoder . 14 2.3 SineMDCT-window(576points). 16 3.1 General signal processing system using the lapped transform. ...... 23 3.2 Signal processing with a lapped transform with L = 2M........... 24 3.3 Typical subband processing system, using the filter bank. 26 3.4 Magnitude frequency response of a MLT (M =10). ............. 29 4.1 MPEG-1LayerIIIdecompositionstructure. 36 4.2 Layer III prototype filter (b) and the original window (a). .......... 37 4.3 Magnituderesponseofthelowpassfilter. 38 4.4 Magnitude response of the polyphase filter bank (M =32).......... 38 4.5 Switching from a long sine window to a short one via a start window. 41 4.6 Layer III aliasing-butterfly, encoder/decoder. ......... 41 4.7 Layer III aliasing reduction encoder/decoder diagram. ........... 42 4.8 Block diagram of the encoder of the full audio coder. ........ 45 4.9 Frequency response of the MDCT basis function hk(n), M =4........ 53 5.1 Flowgraph of the Modified Discrete Cosine Transform. 57 5.2 Flowgraph of MDCT as block DST via butterfly pre-filtering. ...... 58 5.3 Global viewpoint of MDCT as pre-filtering at DST block boundaries. 59 5.4 Pre-DST lapped transforms at arbitrary overlaps (L< 2M). ........ 61 5.5 Post-DST lapped transforms at arbitrary overlaps (L< 2M)......... 62 List of Figures ix 5.6 Partially overlapped Pre-DST example showing pre-echo mitigation for sound files of castanets. ................................... 64 x List of Tables 2.1 CriticalbandsmeasuredbyScharf. ... 9 4.1 MOS is a number mapping to the above subjective quality. ....... 50 4.2 Subjective listening tests: Hybrid filter bank (Hybrid) vs. Pure MDCT filter bank (Pure)................................... 51 4.3 PESQ MOS values: Hybrid filter bank (Hybrid) vs. Pure MDCT filter bank (Pure)...................................... 51 4.4 PESQ MOS values: DFT spectrum (DFT ) vs. MDCT spectrum (MDCT ) 54 5.1 Subjective listening tests of Pre-DST coded test files of castanets. ..... 65 xi List of Terms AAC MPEG-2 Advanced Audio Coding ADPCM Adaptive Differential Pulse Code Modulation CELP Code Excited Linear Prediction DCT Discrete Cosine Transform DFT Discrete Fourier Transform DPCM Differential Pulse Code Modulation DST Discrete Sine Transform EBU-SQAM European Broadcasting Union — Sound Quality Assessment Material ERB Equivalent Rectangular Bandwidth FIR Finite Impulse Response IMDCT Inverse Modified Discrete Cosine Transform ITU International Telecommunication Union MDCT Modified Discrete Cosine Transform MDST Modified Discrete Sine Transform MLT Modulated Lapped Transform MOS Mean Opinion Score MPEG Moving Picture Experts Group MP3 MPEG-1LayerIII PCM Pulse Code Modulation NMN Noise-Masking-Noise NMR Noise-to-Masking Ratio NMT Noise-Masking-Tone LOT Lapped Orthogonal Transform List of Terms xii LT Lapped Transform QMF QuadratureMirrorFilter PE Perceptual Entropy PEAQ Perceptual Evaluation of Audio Quality PESQ Perceptual Evaluation of Speech Quality PR Perfect Reconstruction Pre-DST Pre-filtered Discrete Sine Transform SFM Spectral Flatness Measure SMR Signal-to-Masking Ratio SNR Signal-to-Noise Ratio SPL SoundPressureLevel TDAC Time-Domain Aliasing Cancellation TMN Tone-Masking-Noise TNS Temporal Noise Shaping VQ Vector Quantization 1 Chapter 1 Introduction 1.1 Audio Coding Techniques Audio coding algorithms are concerned with the digital representation of sound using in- formation bits. A number of paradigms have been proposed for the digital compression of audio signals. Roughly, audio coders can be grouped as either parametric coders or wave- form coders. The concept of perceptual audio coding is relevant in the latter case, where auditory perception characteristics are applicable [1]. 1.1.1 Parametric Coders Parametric coders represent the source of the signal with a few parameters. Such coders are suitable for speech signals

Lapped Transforms in Perceptual Coding of Wideband Audio

MIGRATING RADIO CALL-IN TALK SHOWS to WIDEBAND AUDIO Radio Is the Original Social Network

Polycom Voice

NTT DOCOMO Technical Journal

The Growing Importance of HD Voice in Applications the Growing Importance of HD Voice in Applications White Paper

Fast Computational Structures for an Efficient Implementation of The

HD Voice – a Revolution in Voice Communication

Low Bit-Rate Speech Coding with Vq-Vae and a Wavenet Decoder

LOW COMPLEXITY H.264 to VC-1 TRANSCODER by VIDHYA

Perceptual Audio Coding Contents

Using Daala Intra Frames for Still Picture Coding

Daala: a Perceptually-Driven Still Picture Codec

ITU-T G.711.1: Extending G.711 to Higher-Quality Wideband Speech