Perceptual Audio Coding Contents
Total Page:16
File Type:pdf, Size:1020Kb
12/6/2007 Perceptual Audio Coding Henrique Malvar Managing Director, Redmond Lab UW Lecture – December 6, 2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 2 1 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 3 Many applications need digital audio • Communication – Digital TV, Telephony (VoIP) & teleconferencing – Voice mail, voice annotations on e-mail, voice recording •Business – Internet call centers – Multimedia presentations • Entertainment – 150 songs on standard CD – thousands of songs on portable music players – Internet / Satellite radio, HD Radio – Games, DVD Movies 4 2 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 5 Linear Predictive Coding (LPC) LPC periodic excitation N coefficients x()nen= ()+−∑ axnrr ( ) gains r=1 pitch period e(n) Synthesis x(n) Combine Filter noise excitation synthesized speech 6 3 12/6/2007 LPC basics – analysis/synthesis synthesis parameters Analysis Synthesis algorithm Filter residual waveform N en()= xn ()−−∑ axnr ( r ) r=1 original speech synthesized speech 7 LPC variant - CELP selection Encoder index LPC original gain coefficients speech . Synthesis . Filter Decoder excitation codebook synthesized speech 8 4 12/6/2007 LPC variant - multipulse LPC coefficients excitation Synthesis Compute pulse Filter positions and amplitudes original speech synthesized speech 9 G.723.1 architecture Audio Input FRAMER Synthesis Filter Coefficients LPC ANALYSIS HIGHPASS FILTER ZERO-INPUT RESPONSE SIMULATED PERCEPTUAL (MEMORY) DECODER NOISE SHAPING Past Decoded Weighted Residual Residual Excitation + – MP-MLQ/ACELP Indices PITCH + EXCITATION COMPUTATION ESTIMATION Encoded Pitch Values 10 5 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 11 Physiology of the ear • Automatic gain control – muscles around transmission bones • Directivity – pinna TRANSMISSION EARDRUM BONES • Boost of middle frequencies PINNA AUDITORY NERVE – auditory canal AUDITORY CANAL • Nonlinear processing – auditory nerve COCHLEA • Filter bank separation EUSTACHIAN –cochlea TUBE • Thousands of “microphones” – hair cells in cochlea 12 6 12/6/2007 Filter bank model linear linear nonlinear A1(m1) x(n) bandpass group into amplitude amplitude, filter 1 blocks measurement band 1 incoming sound A2(m2) bandpass group into amplitude amplitude, filter 2 blocks measurement band 2 . AN(mN) bandpass group into amplitude amplitude, filter N blocks measurement band N • Explains frequency-domain masking 13 Frequency-domain masking 60 40 20 de, dB de, u 0 tone 1 -20 amplit tone 2 -40 -60 2 3 4 tone 3 10 10 10 frequency, Hz 60 tone 1+3 40 20 tone 1+2+3 e, dB 0 -20 12 3 amplitud -40 -60 2 3 4 10 10 10 frequency, Hz 14 7 12/6/2007 Absolute threshold of hearing • Fletcher-Munson curves 80 70 60 50 40 30 20 10 0 -10 2 3 4 10 10 10 frequency, Hz • Basis for loudness correction in audio amplifiers 15 Example of masking •Typical spectrum & masking threshold • Original sound: amplitude, dB amplitude, • Sound after removing components below the threshold (1/3 to 1/2 of the data): 16 8 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 17 Block signal processing x Direct T Extract X = P x Orthogonal Bloc k Input Transform Signal ~ X Processing Output Inverse ~ ~ x = PX Append Signal Orthogonal Bloc k Transform Signal is reconstructed as a linear combination of basis functions 18 9 12/6/2007 Block processing: good and bad • Pro: allows adaptability • Con: blocking artifacts 19 Why transforms? • More efficient signal representation –Freqqyuency domain – Basis functions ~ “typical” signal components • Faster processing – Filtering, compression •Orthoggyonality – Energy preservation – Robustness to quantization 20 10 12/6/2007 Compactness of representation • Maximum energy concentration in as few coefficients as possible • For stationary random signals, the optimal basis is the Karhunen-Loève transform: T λiipRp==xxi, PP I • Basis functions are the columns of P • Minimum geometric mean of transform coefficient variances 21 Sub-optimal transforms •KLT problems: – Signal dependency – P not factorable into sparse components • Sinusoidal transforms: – Asymptotically optimal for large blocks –Frequency component iiinterpretation – Sparse factors - e.g. FFT 22 11 12/6/2007 Lapped transforms • Basis functions have tails beyond block boundaries – Linear combinations of overlapping functions such as generate smooth signals, without blocking artifacts 23 Modulated lapped transforms • Basis functions = cosines modulating the same low-pass (window) prototype h(n): 21M + 1 π pna f =+ hna f cosLF n IF k + I O k M NMH 2 KH 2 K M QP • Can be computed from the DCT or FFT T • Projection X = P x can be computed in Oblog2 Mg operations per input point 24 12 12/6/2007 Fast MLT computation M-sample one-block delay block x()n −hna() z−M uM(/2+ n ) . ha()M−1−n Uk() X ()k uM(/21− − n ) MLT coefficients x()M −−1 n hna() input signal DCT-IV windowing transform output signal −hns () yn() w(M /2+n) . Yk() hMs ()−−1 n . processed wM(/21− − n ) z−M MLT y()M −1−n coefficients one-block hns () M-sample DCT-IV delay block transform windowing 25 Basis functions DCT: MLT: 26 13 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 27 Basic architecture TIME-TO- FREQUENCY SIDE INFO TRANSFORM ORIGINAL (MLT) AUDIO SIGNAL ENCODED BITSTREAM MASKING MUX WEIGHTING THRESHOLD SI SPECTRUM SI UNIFORM ENTROPY QUANTIZER ENCODER 28 14 12/6/2007 Quantization of transform coefficients • Quantization = rounding to nearest integer. • Small range of integer values = fewer bits needed to represent data • Step size T controls range of integer values y are mapped y = T int(x /T ) to this value x all values in this range … 29 Encoding of quantized coefficients • Typical plot of quantized transform coefficients 2 1 0 Quantized value Quantized -1 -2 0 1000 2000 3000 4000 5000 6000 Freq uency, Hz • Run-length + entropy coding 30 15 12/6/2007 Basic entropy coding Value Codeword • Huffman coding: less -7 '1010101010001' -6 '10101010101' frequent values have -5 '101010100' longer codewords -4 '10101011' -3 '101011' -2 '1011' • More efficient if -1 '01' groups of values are 0 '11' assembled in a vector +1 '00' before coding +2 '100' +3 '10100' +4 '1010100' +5 '1010101011' +6 '101010101001' +7 '1010101010000' 31 Side information & more about EC • Side info: model of frequency spectrum – e.g. averages over subbands • Quantized spectral model determines weighting – masking level used to scale coefficients • Backward adaptation reduces need for SI • Run-length + Vector Huffman works – Context-based AC can be better – Room for better context models via machine learning? 32 16 12/6/2007 Improved architecture TIME-TO- FREQUENCY SIDE INFO TRANSFORM ORIGINAL (MLT) AUDIO SIGNAL ENCODED BITSTREAM MASKING MUX WEIGHTING THRESHOLD SI SPECTRUM CONTEXT MODELING SI SI UNIFORM ENTROPY QUANTIZER ENCODER 33 Examples of context modeling • For strongly voiced segments, spectral energies may be well predicted by a “Linear Prediction” model, similar to those used in VoIP coders. • For strongly periodic components, spectral energies may be predicted by a pitch model. • For noisy segments, a noise-only model may allow for very coarse quantization Æ lower data rate. 34 17 12/6/2007 Other aspects & directions •Stereo coding – (L+R)/2 & L-R coding, expandable to multichannel – Intensity + balance coding – Mode switching – extra work for encoder only • Lossless coding – Easily achievable via integer transforms – exactly reversible via integer arithmetic – example: lifting-based MLT (see Refs) • Using complex subband decompositions (MCLT) – Potential for more sophisticated auditory models – Efficient encoding is an open problem 35 Audio coding standards MPEG-1 Layer III (MP3) · MPEG-1 Layer II · MPEG-1 ISO/IEC Layer I · AAC · HE-AAC · HE-AAC v2 G.711 · G.722 · G.722.1 · G.722.2 · G.723 · ITU-T G.723.1 · G.726 · G.728 · G.729 · G.729.1 · G.729ª AC3 · AMR · Apple Lossless · ATRAC · FLAC · iLBC · Monkey's Audio · μ-law · Musepack · Nellymoser · Others OptimFROG · RealAudio · RTAudio · SHN · Speex · Vorbis · WavPack · WMA · TAK From http://en.wikipedia.org/wiki/Advanced_Audio_Coding 36 18 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 37 WMA examples: •Original clip (~1,400 kbps) 64 kbps (MP3) 64 kbps (WMA) • Original clip WMA @ 32 kbps (Internet radio) • More examples at http://www.microsoft.com/windows/windowsmedia/demos/audio_quality_demos.aspx 38 19 12/6/2007 References • S. Shlien, “The modulated lapped transform, its time-varying forms, and its applications to audio coding standards,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 359-366, July 1997. • H. S. Malvar, “Fast Algorithms for Orthogonal and Biorthogonal Modulated Lapped Transforms,” IEEE Symposium Advances Digital Filtering and Signal Processing, Victoria, Canada, pp. 159–163, June 1998. • H. S. Malvar, “Enhancing the performance of subband