CELT: a Low-Latency, High-Quality Audio Codec

CELT: A Low-latency, High-quality Audio Codec Dr. Jean-Marc Valin, Gregory Maxwell, and Dr. Timothy B. Terriberry The Xiph.Org Foundation Outline ● Introduction and Motivation ● CELT Design ● libcelt ● Demo ● Conclusion 2 The Xiph.Org Foundation Introduction ● Two common types of lossy audio codecs – Speech/communication (G.72x, GSM, AMR, Speex) ● Low delay (15-30 ms) ● Low sampling rate (8 kHz to 16 kHz): limited fidelity ● No support for music – General purpose (MP3, AAC, Vorbis) ● High delay (> 100 ms) ● High sampling rates (44.1 kHz or higher) ● "CD-quality" music – We want both: high fidelity with very low delay 3 The Xiph.Org Foundation Introduction ● Low delay is critical to live interaction – Prevents collisions during conversation – Reduce need for echo cancellation ● Good for small, embedded devices without much CPU – High delay Low delay Higher sense of presence (~250 ms) (~15 ms) – Allows synchronization for live music ● Need less than 25 ms total delay to synchronize (Carôt 2006) ● Equivalent to sitting 8 m apart (farther requires a conductor) ● Lower delay in the codec increases range – 1 ms = 200 km in fiber 4 The Xiph.Org Foundation Introduction ● No entrenched standard in this space – G.722.1C (ITU-T) [40 ms delay, up to 32 kHz] – AAC-LD (MPEG) [20-50 ms delay, up to 48 kHz] – ULD (Fraunhofer) [< 10 ms delay, up to 48 kHz] ● CELT is already ahead of the competition – Delay: Configurable, 1.3 ms to 24 ms, ~8 ms typical – Quality (at equivalent rates): Much better than G.722.1C, as good as or better than AAC-LD, better than ULD – Flexibility: 24 kbps to 160+ kbps, 32 kHz to 96 kHz, configurable delay, low-complexity mode – Freedom: Open source (BSD), no patents 5 The Xiph.Org Foundation Outline ● Introduction ● CELT Design ● libcelt ● Demo ● Conclusion 6 The Xiph.Org Foundation CELT: "Constrained Energy Lapped Transform" ● Transform codec (MDCT, like MP3, Vorbis) – Short windows (~8 ms) → poor frequency resolution ● Explicitly code energy of each band of the signal – Coarse shape of sound preserved no matter what ● Code remaining details using vector quantization ● Also uses pitch prediction with a time offset – Similar to linear prediction used by speech codecs – Helps compensate for poor frequency resolution 7 The Xiph.Org Foundation Outline ● Introduction ● CELT Design – "Lapped Transform" – "Constrained Energy" – Coding Band Shape – Performance Tests ● libcelt ● Demo ● Conclusion 8 The Xiph.Org Foundation "Lapped Transform" Time-Frequency Duality ● Any signal can be represented as a weighted sum of cosine curves with different frequencies ● The Discrete Cosine Transform (DCT) computes the weights for each frequency 220 Hz (A3) 440 Hz (A4) 1245 Hz (D#5) 9 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (harder to compress) N=48000 10 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (harder to compress) N=4096 (Maximum Vorbis block size) 11 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (harder to compress) N=1024 (Typical Vorbis block size) 12 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (harder to compress) N=256 (CELT can use 64...512) 13 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (unstable over time) N=256 (CELT can use 64...512) Frame 2... 14 The Xiph.Org Foundation "Lapped Transform" Discrete Cosine Transform ● The "Discrete" in DCT means we're restricted to a finite number of frequencies – As the transform size gets smaller, energy "leaks" into nearby frequencies (unstable over time) N=256 (CELT can use 64...512) Frame 3... 15 The Xiph.Org Foundation "Lapped Transform" Modified DCT ● The normal DCT causes coding artifacts (sharp discontinuities) between blocks, easily audible ● The "Modified" DCT (MDCT) uses a decaying window to overlap multiple blocks – Same transform used in MP3, Vorbis, AAC, etc. – But with much smaller blocks, less overlap 16 The Xiph.Org Foundation Outline ● Introduction ● CELT Design – "Lapped Transform" – "Constrained Energy" – Coding Band Shape – Performance Tests ● libcelt ● Demo ● Conclusion 17 The Xiph.Org Foundation "Constrained Energy" Critical Bands ● The human ear can hear about 25 distinct "critical bands" in the frequency domain – Psychoacoustic masking within a band is much stronger than between bands Threshold of detection in the presence of masker at 1kHz with a bandwidth of 1 critical band and various levels. Image blatantly stolen from http://www.tonmeister.ca/main/textbook/node331.html 18 The Xiph.Org Foundation "Constrained Energy" Critical Bands ● Group MDCT coefficients into bands approximating the critical bands (Bark scale) – We limit bands to contain at least 3 coefficients to minimize per-band overhead – Insufficient frequency resolution for all the bands – But we spend most of our bits on LFs anyway Bark Scale vs. CELT @ 48kHz, Frame Size=256 Bark CELT 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Frequency (Hz) 19 The Xiph.Org Foundation "Constrained Energy" Coding Band Energy ● Most important psychoacoustic lesson learned from Vorbis: Preserve the energy in each band ● Vorbis does this implicitly with its "floor curve" ● CELT codes the energy explicitly – Coarse energy (6 dB resolution), predicted from previous frame and from previous band ● Prediction saves 28 bits/frame, 5.6 kbps with 5 ms frames – Fine energy, improves resolution where we have available bits, not predicted 20 The Xiph.Org Foundation "Constrained Energy" Coding Band Energy ● CELT (green) vs original (red) – Notice the quantization between 8.5 kHz and 12 kHz – Speech is intelligible using coarse energy alone (~9 kbps for 5.3 ms frame sizes) 21 The Xiph.Org Foundation Outline ● Introduction ● CELT Design – "Lapped Transform" – "Constrained Energy" – Coding Band Shape – Performance Results ● libcelt ● Demo ● Conclusion 22 The Xiph.Org Foundation Coding Band Shape ● After normalizing, each band is represented by an N-dimensional unit vector – Point on an N-dimensional sphere – Describes "shape" of energy within the band ● Code this shape using two pieces: – An adaptive codebook using previously decoded signal content to predict the current frame – A fixed codebook to handle the part of the signal that can't be predicted (the "innovation") ● Latter uses vector quantization 23 The Xiph.Org Foundation Coding Band Shape Vector Quantization ● Approximates a multidimensional distribution with a finite number of codewords (vectors) Scalar Quantization (2 bits/dim) Vector Quantization (2 bits/dim) RMS error = 0.89 RMS error = 0.71 (20% better) 24 The Xiph.Org Foundation Coding Band Shape Vector Quantization ● Easily scales to less than 1 bit per dimension (very important for HF bands: 50-200 dims) Scalar Quantization (0.5 bits/dim) Vector Quantization (0.5 bits/dim) RMS error = 2.93 RMS error = 1.63 (44% better) 25 The Xiph.Org Foundation Coding Band Shape Algebraic Vector Quantization ● CELT requires a lot of codebooks – Every band can have a different # of dimensions – Exact number of bits available for each band varies from packet to packet ● CELT requires large codebooks – Exponential in # of dimensions: 50 dims at 0.6 bits/ dim. requires over a billion codebook entries – We couldn't even store one codebook that large – And even if we could, it'd take forever to search ● But we have uniformly distributed unit vectors 26 The Xiph.Org Foundation Coding Band Shape Algebraic Vector Quantization ● Use a regularly structured, algebraic codebook: Pyramid Vector Quantization (Fischer, 1986) – We want evenly distributed points on a sphere ● Don't know how to do that for arbitrary dimension – Use evenly distributed points on a pyramid instead ● For N dimensional vector, allocate K "pulses" ● Codebook consists vectors with integer coordinates whose magnitudes sum to K 27 The Xiph.Org Foundation Coding Band Shape Pyramid Vector Quantization ● PVQ codebook has a fast enumeration algorithm – Converts between vector and integer codebook index – O(N+K) (lookup table, muls) or simpler O(NK) (adds) – Latter great for embedded processors, often faster ● Fast codebook search algorithm: O(N·min(N,K)) – Divide by L1 norm, round down: at least K-N pulses – Place remaining pulses (at most N) one at a time ● Codebooks larger than 32 bits – Split the vector in half and code each half separately 28 The Xiph.Org Foundation Coding Band Shape Pitch Prediction ● Short block sizes → poor frequency resolution – Speech/music have periodic, stationary content – Can't represent the period accurately via short MDCT ● Pitch prediction compensates for poor resolution – Search the past 1024 decoded samples in the time domain, code the offset of the best match ● Resolution equal to the sampling rate 48000 48000 ● Range (48 kHz,

CELT: a Low-Latency, High-Quality Audio Codec

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support