Itu-T G.719: a New Low-Complexity Full-Band (20 Khz) Audio Coding Standard for High-Quality Conversational Applications
Total Page:16
File Type:pdf, Size:1020Kb
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics O ctober 18-21, 2009, New Paltz, NY ITU-T G.719: A NEW LOW-COMPLEXITY FULL-BAND (20 KHZ) AUDIO CODING STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS Minjie Xie∗ and Peter Chu Anisse Taleb and Manuel Briand Polycom, Inc. Ericsson Research – Multimedia Technologies 10 Mall Road, #110, Burlington, MA 01803, USA Färögatan 6, Kista, Sweden [email protected] {anisse.taleb, manuel.briand}@ericsson.com ABSTRACT Qualication phase was completed in June 2007 and two candidates submitted respectively by Ericsson AB and Polycom, This paper describes a new low-complexity full-band (20 kHz) Inc. were qualied. In September 2007, the two proponents audio coding algorithm which has been recently standardized announced their collaboration in an Optimization/ by ITU-T as Recommendation G.719. The algorithm is Characterization phase and agreed to jointly develop a candidate designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 codec. The joint candidate codec met all the requirements in the kHz sample rate, operating at 32 - 128 kbps. This codec subjective quality tests for the G.722.1FB Optimization/ features very high audio quality and low computational Characterization phase in April 2008. In June 2008, the joint complexity and is suitable for use in applications such as codec was adopted as ITU-T Recommendation G.719 which is videoconferencing, teleconferencing, and streaming audio over the rst full-band audio coding standard of ITU-T [1]. the Internet. Subjective test results from the Optimization/ In this paper we rst give an overview of the ITU-T G.719 Characterization phase of G.719 are also presented in the paper. codec, followed by a description of two important modules of Index Terms— Audio coding, full-band, low the codec - adaptive time-frequency transform and quantization complexity, adaptive time-frequency transform, fast of transform coecients. We also present the evaluation of the lattice vector quantization computational complexity and memory requirements of the codec. Finally, subjective test results from the Optimization/ 1. INTRODUCTION Characterization phase are summarized. In hands-free videoconferencing and teleconferencing markets, 2. OVERVIEW OF THE G.719 CODEC there is strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 The G.719 codec is a low-complexity transform-based audio kHz. This is because: codec and can provide an audio bandwidth of 20 Hz to 20 kHz • Conferencing systems are increasingly used for more at 32 - 128 kbps. The codec operates on frames of 20 ms elaborate presentations, often including music and sound corresponding to 960 samples at a sampling rate of 48 kHz and eects (i.e. animal sounds, musical instruments, vehicles or has an algorithmic delay of 40 ms. The codec features very high nature sounds, etc.) which occupy a wider audio band than audio quality and extremely low computational complexity speech. Presentations involve remote education of music, compared to other state-of-the-art audio coding algorithms. playback of audio and video from DVDs and VCRs, audio/video clips from PCs, and elaborate audio-visual 2.1 Encoder presentations from, for example, PowerPoint. • Users perceive the bandwidth of 20 Hz to 20 kHz as Norm Norm quantization Norm estimation adjustment representing the ultimate goal for audio bandwidth. The and coding resulting market pressures aer causing a shift in this Transient direction, now that sucient IP bitrate and audio coding detector Bit allocation technology are available to deliver this. As with any audio codec for hands-free videoconferencing Spectrum FLVQ Transform use, the requirements include: Input signal normalization and coding Fs = 48 kHz • Low latency (support natural conversation) Noise level • Low complexity (free cycles for video processing and other adjustment audio processing; reduce cost) Quantization Noise • High quality on all signal types Transient indices level signaling Human adjustment To meet the market need for such a full-band audio coding code index standard, ITU-T Q10/SG16 launched the standardization of MUX Low-Complexity Full-Band Audio Coding Extension to G.722.1 (G.722.1FB) in November 2006. The ITU-T G.722.1FB Figure 1: Block diagram of the G.719 encoder. ∗ Minjie Xie was with Polycom, Inc., when this work was done. He is now with Huawei Technologies (USA). 978-1-4244-3679-8/09/$25.00 ©2009 IEEE 265 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY A block diagram of the G.719 encoder is shown in Figure 1. coefficients and regenerated transform coefficients are mixed The input signal sampled at 48 kHz is processed through a and lead to normalized spectrum. The decoded spectral transient detector. Depending on the detection of a transient, envelope is applied to the decoded full-band spectrum. Finally, indicated by a flag IsTransient, a high frequency resolution or a the inverse transform is applied to recover the time-domain low frequency resolution transform is applied on the input signal decoded signal. This is performed by applying either the frame. The adaptive transform is based on a modified discrete inverse MDCT for stationary mode, or the inverse of the higher cosine transform (MDCT) [2] in case of stationary frames. For temporal resolution transform for transient mode. transient frames, the MDCT is modified to obtain a higher temporal resolution without a need for additional delay and with 3. ADAPTIVE TIME-FREQUENCY TRANSFORM very little overhead in complexity. Transient frames have a temporal resolution equivalent to 5 ms frames. The adaptive time-frequency transform is based on the detection The obtained transform coefficients are grouped into bands of a transient. In the case of stationary signals, the transform of unequal lengths as 8, 16, 24, or 32. As the bandwidth is 20 has a high frequency resolution which is able to efficiently kHz, only 800 transform coefficients are used. The 160 represent stationary sounds. In the case of transient sounds, the transform coefficients representing frequencies above 20 kHz time-frequency transform will increase its time resolution and are ignored. The norm or power of each band is estimated and allows a better representation of the rapid changes in the input the resulting spectral envelope consisting of the norms of all signal characteristics. These two modes of operations share a bands is scalar quantized and encoded. The coefficients are common buffering and windowing module and the switching then normalized by the quantized norms. The quantized norms between one mode of operation to the other is instantaneous. are further adjusted based on adaptive spectral weighting and Thus no additional look-ahead in the transient detection is used as input for bit allocation. An adaptive bit-allocation needed. This allows the codec to have selectable time resolution scheme based on the quantized norms of the bands is used to at low complexity and with zero additional delay. assign the available bits in a frame among the bands. The If the input signal is detected as stationary, a type IV number of bits assigned to each transform coefficient can be as discrete cosine transform (DCTIV) [2] is applied on the output ~ large as 9 bits depending on the input signal. The normalized nx )( of the time domain aliasing (TDA) operation. The DCTIV transform coefficients are lattice vector quantized according to of the time aliased signal is defined by the following equation: the allocated bits for each band. Huffman coding is applied to the quantization indices for both the coded norms and transform 959 π coefficients. The norm of the non-coded transform coefficients ~ ⎡⎛ 1 ⎞⎛ 1 ⎞ ⎤ )( = cos)( ⎢⎜ + ⎟⎜knnxky + ⎟ ⎥ , k = 0, 1, …, 959 (1) is estimated, coded, and transmitted to the decoder. ∑ 2 9602 n=0 ⎣⎝ ⎠⎝ ⎠ ⎦ 2.2 Decoder The resulting signal y(k) represents the transform coefficients of the input frame. It should be noted that in stationary mode, the cascade of windowing, TDA, and DCT is equivalent to DEMUX IV applying the modulated lapped transform (MLT) [2]. Transient signaling Transient Transient signaling Transient adjustment index adjustment FLVQ indices If the signal is detected as a transient, a further re-ordering Norm indices Norm Noise level of the time-domain aliased signal is performed. The basis functions of the resulting filter-bank would have an incoherent time and frequency responses without re-ordering. The re- ordering operation consists of shuffling the upper and lower half ~ Audio signal of the TDA output signal nx )( as follows: FLVQ Spectral-fill Envelope Inverse Fs = 48 kHz Decoding Generator + Shaping Transform )( ~ −= nxnv )959( , n = 0, 1, …, 959. (2) This re-ordering is only conceptual and in reality no Figure 2: Block diagram of the G.719 decoder. computations are involved. Higher time resolution is obtained by zero padding the signal v(n) and dividing the resulting signal A block diagram of the decoder is shown in Figure 2. The into four overlapped equal length sub-frames. The amount of transient flag is first decoded which indicates the frame zero-padding is equal to 120 on each side of the signal. The configuration, i.e. stationary or transient. The spectral envelope segments are 50% overlapped and each segment has a length is then decoded and the same, bit-exact, norm adjustment and equal to 480. The two inner segments are post-windowed using bit-allocation algorithms are used at the decoder to re-compute a sine window of length 480. The windows for outer segments the bit-allocation which is essential for decoding quantization are constructed using half a sine window. indices of the normalized transform coefficients. Each resulting post-windowed segment is further processed After decoding the transform coefficients, the non-coded by applying the MDCT, i.e.