Perceptual Audio Coding Contents

12/6/2007 Perceptual Audio Coding Henrique Malvar Managing Director, Redmond Lab UW Lecture – December 6, 2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 2 1 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 3 Many applications need digital audio • Communication – Digital TV, Telephony (VoIP) & teleconferencing – Voice mail, voice annotations on e-mail, voice recording •Business – Internet call centers – Multimedia presentations • Entertainment – 150 songs on standard CD – thousands of songs on portable music players – Internet / Satellite radio, HD Radio – Games, DVD Movies 4 2 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 5 Linear Predictive Coding (LPC) LPC periodic excitation N coefficients x()nen= ()+−∑ axnrr ( ) gains r=1 pitch period e(n) Synthesis x(n) Combine Filter noise excitation synthesized speech 6 3 12/6/2007 LPC basics – analysis/synthesis synthesis parameters Analysis Synthesis algorithm Filter residual waveform N en()= xn ()−−∑ axnr ( r ) r=1 original speech synthesized speech 7 LPC variant - CELP selection Encoder index LPC original gain coefficients speech . Synthesis . Filter Decoder excitation codebook synthesized speech 8 4 12/6/2007 LPC variant - multipulse LPC coefficients excitation Synthesis Compute pulse Filter positions and amplitudes original speech synthesized speech 9 G.723.1 architecture Audio Input FRAMER Synthesis Filter Coefficients LPC ANALYSIS HIGHPASS FILTER ZERO-INPUT RESPONSE SIMULATED PERCEPTUAL (MEMORY) DECODER NOISE SHAPING Past Decoded Weighted Residual Residual Excitation + – MP-MLQ/ACELP Indices PITCH + EXCITATION COMPUTATION ESTIMATION Encoded Pitch Values 10 5 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 11 Physiology of the ear • Automatic gain control – muscles around transmission bones • Directivity – pinna TRANSMISSION EARDRUM BONES • Boost of middle frequencies PINNA AUDITORY NERVE – auditory canal AUDITORY CANAL • Nonlinear processing – auditory nerve COCHLEA • Filter bank separation EUSTACHIAN –cochlea TUBE • Thousands of “microphones” – hair cells in cochlea 12 6 12/6/2007 Filter bank model linear linear nonlinear A1(m1) x(n) bandpass group into amplitude amplitude, filter 1 blocks measurement band 1 incoming sound A2(m2) bandpass group into amplitude amplitude, filter 2 blocks measurement band 2 . AN(mN) bandpass group into amplitude amplitude, filter N blocks measurement band N • Explains frequency-domain masking 13 Frequency-domain masking 60 40 20 de, dB de, u 0 tone 1 -20 amplit tone 2 -40 -60 2 3 4 tone 3 10 10 10 frequency, Hz 60 tone 1+3 40 20 tone 1+2+3 e, dB 0 -20 12 3 amplitud -40 -60 2 3 4 10 10 10 frequency, Hz 14 7 12/6/2007 Absolute threshold of hearing • Fletcher-Munson curves 80 70 60 50 40 30 20 10 0 -10 2 3 4 10 10 10 frequency, Hz • Basis for loudness correction in audio amplifiers 15 Example of masking •Typical spectrum & masking threshold • Original sound: amplitude, dB amplitude, • Sound after removing components below the threshold (1/3 to 1/2 of the data): 16 8 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 17 Block signal processing x Direct T Extract X = P x Orthogonal Bloc k Input Transform Signal ~ X Processing Output Inverse ~ ~ x = PX Append Signal Orthogonal Bloc k Transform Signal is reconstructed as a linear combination of basis functions 18 9 12/6/2007 Block processing: good and bad • Pro: allows adaptability • Con: blocking artifacts 19 Why transforms? • More efficient signal representation –Freqqyuency domain – Basis functions ~ “typical” signal components • Faster processing – Filtering, compression •Orthoggyonality – Energy preservation – Robustness to quantization 20 10 12/6/2007 Compactness of representation • Maximum energy concentration in as few coefficients as possible • For stationary random signals, the optimal basis is the Karhunen-Loève transform: T λiipRp==xxi, PP I • Basis functions are the columns of P • Minimum geometric mean of transform coefficient variances 21 Sub-optimal transforms •KLT problems: – Signal dependency – P not factorable into sparse components • Sinusoidal transforms: – Asymptotically optimal for large blocks –Frequency component iiinterpretation – Sparse factors - e.g. FFT 22 11 12/6/2007 Lapped transforms • Basis functions have tails beyond block boundaries – Linear combinations of overlapping functions such as generate smooth signals, without blocking artifacts 23 Modulated lapped transforms • Basis functions = cosines modulating the same low-pass (window) prototype h(n): 21M + 1 π pna f =+ hna f cosLF n IF k + I O k M NMH 2 KH 2 K M QP • Can be computed from the DCT or FFT T • Projection X = P x can be computed in Oblog2 Mg operations per input point 24 12 12/6/2007 Fast MLT computation M-sample one-block delay block x()n −hna() z−M uM(/2+ n ) . ha()M−1−n Uk() X ()k uM(/21− − n ) MLT coefficients x()M −−1 n hna() input signal DCT-IV windowing transform output signal −hns () yn() w(M /2+n) . Yk() hMs ()−−1 n . processed wM(/21− − n ) z−M MLT y()M −1−n coefficients one-block hns () M-sample DCT-IV delay block transform windowing 25 Basis functions DCT: MLT: 26 13 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 27 Basic architecture TIME-TO- FREQUENCY SIDE INFO TRANSFORM ORIGINAL (MLT) AUDIO SIGNAL ENCODED BITSTREAM MASKING MUX WEIGHTING THRESHOLD SI SPECTRUM SI UNIFORM ENTROPY QUANTIZER ENCODER 28 14 12/6/2007 Quantization of transform coefficients • Quantization = rounding to nearest integer. • Small range of integer values = fewer bits needed to represent data • Step size T controls range of integer values y are mapped y = T int(x /T ) to this value x all values in this range … 29 Encoding of quantized coefficients • Typical plot of quantized transform coefficients 2 1 0 Quantized value Quantized -1 -2 0 1000 2000 3000 4000 5000 6000 Freq uency, Hz • Run-length + entropy coding 30 15 12/6/2007 Basic entropy coding Value Codeword • Huffman coding: less -7 '1010101010001' -6 '10101010101' frequent values have -5 '101010100' longer codewords -4 '10101011' -3 '101011' -2 '1011' • More efficient if -1 '01' groups of values are 0 '11' assembled in a vector +1 '00' before coding +2 '100' +3 '10100' +4 '1010100' +5 '1010101011' +6 '101010101001' +7 '1010101010000' 31 Side information & more about EC • Side info: model of frequency spectrum – e.g. averages over subbands • Quantized spectral model determines weighting – masking level used to scale coefficients • Backward adaptation reduces need for SI • Run-length + Vector Huffman works – Context-based AC can be better – Room for better context models via machine learning? 32 16 12/6/2007 Improved architecture TIME-TO- FREQUENCY SIDE INFO TRANSFORM ORIGINAL (MLT) AUDIO SIGNAL ENCODED BITSTREAM MASKING MUX WEIGHTING THRESHOLD SI SPECTRUM CONTEXT MODELING SI SI UNIFORM ENTROPY QUANTIZER ENCODER 33 Examples of context modeling • For strongly voiced segments, spectral energies may be well predicted by a “Linear Prediction” model, similar to those used in VoIP coders. • For strongly periodic components, spectral energies may be predicted by a pitch model. • For noisy segments, a noise-only model may allow for very coarse quantization Æ lower data rate. 34 17 12/6/2007 Other aspects & directions •Stereo coding – (L+R)/2 & L-R coding, expandable to multichannel – Intensity + balance coding – Mode switching – extra work for encoder only • Lossless coding – Easily achievable via integer transforms – exactly reversible via integer arithmetic – example: lifting-based MLT (see Refs) • Using complex subband decompositions (MCLT) – Potential for more sophisticated auditory models – Efficient encoding is an open problem 35 Audio coding standards MPEG-1 Layer III (MP3) · MPEG-1 Layer II · MPEG-1 ISO/IEC Layer I · AAC · HE-AAC · HE-AAC v2 G.711 · G.722 · G.722.1 · G.722.2 · G.723 · ITU-T G.723.1 · G.726 · G.728 · G.729 · G.729.1 · G.729ª AC3 · AMR · Apple Lossless · ATRAC · FLAC · iLBC · Monkey's Audio · μ-law · Musepack · Nellymoser · Others OptimFROG · RealAudio · RTAudio · SHN · Speex · Vorbis · WavPack · WMA · TAK From http://en.wikipedia.org/wiki/Advanced_Audio_Coding 36 18 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 37 WMA examples: •Original clip (~1,400 kbps) 64 kbps (MP3) 64 kbps (WMA) • Original clip WMA @ 32 kbps (Internet radio) • More examples at http://www.microsoft.com/windows/windowsmedia/demos/audio_quality_demos.aspx 38 19 12/6/2007 References • S. Shlien, “The modulated lapped transform, its time-varying forms, and its applications to audio coding standards,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 359-366, July 1997. • H. S. Malvar, “Fast Algorithms for Orthogonal and Biorthogonal Modulated Lapped Transforms,” IEEE Symposium Advances Digital Filtering and Signal Processing, Victoria, Canada, pp. 159–163, June 1998. • H. S. Malvar, “Enhancing the performance of subband

Perceptual Audio Coding Contents

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support