2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics O ctober 18-21, 2009, New Paltz, NY

ITU-T G.719: A NEW LOW-COMPLEXITY FULL-BAND (20 KHZ) AUDIO CODING STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS

Minjie Xie∗ and Peter Chu Anisse Taleb and Manuel Briand

Polycom, Inc. Ericsson Research – Multimedia Technologies 10 Mall Road, #110, Burlington, MA 01803, USA Färögatan 6, Kista, Sweden [email protected] {anisse.taleb, manuel.briand}@ericsson.com

ABSTRACT Qualication phase was completed in June 2007 and two candidates submitted respectively by Ericsson AB and Polycom, This paper describes a new low-complexity full-band (20 kHz) Inc. were qualied. In September 2007, the two proponents audio coding algorithm which has been recently standardized announced their collaboration in an Optimization/ by ITU-T as Recommendation G.719. The algorithm is Characterization phase and agreed to jointly develop a candidate designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 . The joint candidate codec met all the requirements in the kHz sample rate, operating at 32 - 128 kbps. This codec subjective quality tests for the G.722.1FB Optimization/ features very high audio quality and low computational Characterization phase in April 2008. In June 2008, the joint complexity and is suitable for use in applications such as codec was adopted as ITU-T Recommendation G.719 which is videoconferencing, teleconferencing, and streaming audio over the rst full-band audio coding standard of ITU-T [1]. the Internet. Subjective test results from the Optimization/ In this paper we rst give an overview of the ITU-T G.719 Characterization phase of G.719 are also presented in the paper. codec, followed by a description of two important modules of Index Terms— Audio coding, full-band, low the codec - adaptive time-frequency transform and quantization complexity, adaptive time-frequency transform, fast of transform coecients. We also present the evaluation of the lattice vector quantization computational complexity and memory requirements of the codec. Finally, subjective test results from the Optimization/ 1. INTRODUCTION Characterization phase are summarized. In hands-free videoconferencing and teleconferencing markets, 2. OVERVIEW OF THE G.719 CODEC there is strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 The G.719 codec is a low-complexity transform-based audio kHz. This is because: codec and can provide an audio bandwidth of 20 Hz to 20 kHz • Conferencing systems are increasingly used for more at 32 - 128 kbps. The codec operates on frames of 20 ms elaborate presentations, often including music and sound corresponding to 960 samples at a sampling rate of 48 kHz and e ects (i.e. animal sounds, musical instruments, vehicles or has an algorithmic delay of 40 ms. The codec features very high nature sounds, etc.) which occupy a wider audio band than audio quality and extremely low computational complexity speech. Presentations involve remote education of music, compared to other state-of-the-art audio coding algorithms. playback of audio and from DVDs and VCRs, audio/video clips from PCs, and elaborate audio-visual 2.1 Encoder presentations from, for example, PowerPoint. • Users perceive the bandwidth of 20 Hz to 20 kHz as Norm Norm quantization Norm estimation adjustment representing the ultimate goal for audio bandwidth. The and coding

resulting market pressures aer causing a shift in this Transient direction, now that sucient IP bitrate and audio coding detector Bit allocation technology are available to deliver this.

As with any audio codec for hands-free videoconferencing Spectrum FLVQ Transform use, the requirements include: Input signal normalization and coding Fs = 48 kHz • Low latency (support natural conversation) Noise level • Low complexity (free cycles for video processing and other adjustment audio processing; reduce cost) Quantization Noise • High quality on all signal types Transient indices level signaling Hu man adjustment To meet the market need for such a full-band audio coding code index standard, ITU-T Q10/SG16 launched the standardization of MUX Low-Complexity Full-Band Audio Coding Extension to G.722.1 (G.722.1FB) in November 2006. The ITU-T G.722.1FB Figure 1: Block diagram of the G.719 encoder.

∗ Minjie Xie was with Polycom, Inc., when this work was done. He is now with Huawei Technologies (USA).

978-1-4244-3679-8/09/$25.00 ©2009 IEEE 265 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

A block diagram of the G.719 encoder is shown in Figure 1. coefficients and regenerated transform coefficients are mixed The input signal sampled at 48 kHz is processed through a and lead to normalized spectrum. The decoded spectral transient detector. Depending on the detection of a transient, envelope is applied to the decoded full-band spectrum. Finally, indicated by a flag IsTransient, a high frequency resolution or a the inverse transform is applied to recover the time-domain low frequency resolution transform is applied on the input signal decoded signal. This is performed by applying either the frame. The adaptive transform is based on a modified discrete inverse MDCT for stationary mode, or the inverse of the higher cosine transform (MDCT) [2] in case of stationary frames. For temporal resolution transform for transient mode. transient frames, the MDCT is modified to obtain a higher temporal resolution without a need for additional delay and with 3. ADAPTIVE TIME-FREQUENCY TRANSFORM very little overhead in complexity. Transient frames have a temporal resolution equivalent to 5 ms frames. The adaptive time-frequency transform is based on the detection The obtained transform coefficients are grouped into bands of a transient. In the case of stationary signals, the transform of unequal lengths as 8, 16, 24, or 32. As the bandwidth is 20 has a high frequency resolution which is able to efficiently kHz, only 800 transform coefficients are used. The 160 represent stationary sounds. In the case of transient sounds, the transform coefficients representing frequencies above 20 kHz time-frequency transform will increase its time resolution and are ignored. The norm or power of each band is estimated and allows a better representation of the rapid changes in the input the resulting spectral envelope consisting of the norms of all signal characteristics. These two modes of operations share a bands is scalar quantized and encoded. The coefficients are common buffering and windowing module and the switching then normalized by the quantized norms. The quantized norms between one mode of operation to the other is instantaneous. are further adjusted based on adaptive spectral weighting and Thus no additional look-ahead in the transient detection is used as input for bit allocation. An adaptive bit-allocation needed. This allows the codec to have selectable time resolution scheme based on the quantized norms of the bands is used to at low complexity and with zero additional delay. assign the available bits in a frame among the bands. The If the input signal is detected as stationary, a type IV number of bits assigned to each transform coefficient can be as discrete cosine transform (DCTIV) [2] is applied on the output ~ large as 9 bits depending on the input signal. The normalized nx )( of the time domain aliasing (TDA) operation. The DCTIV transform coefficients are lattice vector quantized according to of the time aliased signal is defined by the following equation: the allocated bits for each band. is applied to the quantization indices for both the coded norms and transform 959 π coefficients. The norm of the non-coded transform coefficients ~ ⎡⎛ 1 ⎞⎛ 1 ⎞ ⎤ )( = cos)( ⎢⎜ + ⎟⎜knnxky + ⎟ ⎥ , k = 0, 1, …, 959 (1) is estimated, coded, and transmitted to the decoder. ∑ 2 9602 n=0 ⎣⎝ ⎠⎝ ⎠ ⎦

2.2 Decoder The resulting signal y(k) represents the transform coefficients of the input frame. It should be noted that in stationary mode, the cascade of windowing, TDA, and DCT is equivalent to DEMUX IV applying the modulated lapped transform (MLT) [2]. Transient signaling Transient signaling adjustment index

FLVQ indices FLVQ If the signal is detected as a transient, a further re-ordering Norm indices Noise level Noise of the time-domain aliased signal is performed. The basis functions of the resulting filter-bank would have an incoherent time and frequency responses without re-ordering. The re- ordering operation consists of shuffling the upper and lower half ~ Audio signal of the TDA output signal nx )( as follows: FLVQ Spectral-fill Envelope Inverse Fs = 48 kHz Decoding Generator + Shaping Transform )( ~ −= nxnv )959( , n = 0, 1, …, 959. (2)

This re-ordering is only conceptual and in reality no Figure 2: Block diagram of the G.719 decoder. computations are involved. Higher time resolution is obtained by zero padding the signal v(n) and dividing the resulting signal A block diagram of the decoder is shown in Figure 2. The into four overlapped equal length sub-frames. The amount of transient flag is first decoded which indicates the frame zero-padding is equal to 120 on each side of the signal. The configuration, i.e. stationary or transient. The spectral envelope segments are 50% overlapped and each segment has a length is then decoded and the same, bit-exact, norm adjustment and equal to 480. The two inner segments are post-windowed using bit-allocation algorithms are used at the decoder to re-compute a sine window of length 480. The windows for outer segments the bit-allocation which is essential for decoding quantization are constructed using half a sine window. indices of the normalized transform coefficients. Each resulting post-windowed segment is further processed After decoding the transform coefficients, the non-coded by applying the MDCT, i.e. time aliasing followed by DCTIV. transform coefficients (allocated zero bits) in low frequencies The output of the MDCT for each segment represents the signal are regenerated by using a spectral-fill codebook built from the spectrum at different time instants, thus in case of transients, a decoded transform coefficients. The noise level adjustment higher time resolution is used. The length of the output of each index is used to adjust the level of the regenerated coefficients. of the four MDCT is half the length of the input segment, i.e. The non-coded transform coefficients in high frequencies are 240. Therefore, this operation does not introduce additional regenerated using bandwidth extension. The decoded transform redundancy.

266 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

4. TRANSFORM COEFFICIENT QUANTIZATION and consists of the points falling on concentric spheres of radius Each band consists of one or more vectors of 8-dimensional 22r centered at the origin, where r = 0, 1, 2, ... . The set of transform coefficients and the coefficients are normalized by the points on a sphere constitutes a spherical code and can be used quantized norm. All 8-dimensional vectors belonging to one as a quantization codebook. The LRQ codebook of 256 band are assigned the same number of bits for quantization. A codewords is stored in a structured look-up table so that a fast fast lattice vector quantization (FLVQ) scheme [3] is used to searching method and a fast indexing method are developed to quantize the normalized coefficients in 8 dimensions. In FLVQ reduce the computational complexity [3]. the quantizer comprises two sub-quantizers: a D8-based higher- rate lattice vector quantizer (HRQ) and an RE8-based lower-rate 5. COMPLEXITY OF THE G.719 CODEC lattice vector quantizer (LRQ). HRQ is a multi-rate quantizer designed to quantize the The computational complexity of the G.719 codec in 16/32-bit transform coefficients at rates of 2 up to 9 bit/coefficient and its fixed-point is estimated by encoding and decoding the source material used for the subjective tests of the G.719 Optimization/ codebook is based on the so-called Voronoi code for the D8 lattice [4]. D is a well-known lattice and defined as: Characterization phase. The observed average and worst-case 8 complexity in units of WMOPS [6] are shown in Table 1 for 8 ∈ different bitrates. Storage requirements are reported in Table 2. D8 = {(y1, y2, y3, y4, y5, y6, y7, y8) Z8 | ∑ yi = even} (3) i=1 Table 1: Computational complexity of the G.719 codec. where Z8 is the lattice which consists of all points with integer Bitrate Encoder Decoder Encoder + Decoder coordinates. It can be seen that D8 consists of the points having integer coordinates with an even sum. The codebook of HRQ is (kbps) Avg. Max Avg. Max Avg. Max constructed from a finite region of the D8 lattice and is not 32 6.663 7.996 6.876 7.413 13.539 15.397 stored in memory. The codewords are generated by a simple algebraic method and a fast quantization algorithm is used. 48 7.427 9.073 7.230 7.806 14.657 16.861

To minimize the distortion for a given rate, the D8 lattice 64 7.912 9.899 7.554 8.161 15.466 18.060 should be truncated and scaled. Actually, the input vectors 80 8.303 10.564 7.865 8.496 16.169 19.026 instead of the lattice codebook are scaled in order to use the fast 96 8.555 10.796 8.122 8.757 16.677 19.536 searching and indexing algorithms proposed by Conway and Sloane [4]. However, the fast searching algorithm introduced in 112 8.787 10.881 8.368 9.009 17.155 19.890 [4] assumes an infinite lattice which can not be used as the 128 8.980 11.775 8.610 9.225 17.590 21.000 codebook in real-time audio coding systems. In other words, for a given rate the algorithm can not be used to quantize the input Table 2: Memory usage of the G.719 codec. vectors lying outside the truncated lattice region. To resolve Memory type Encoder Decoder Encoder + Decoder this problem, a fast method for quantizing “outliers” xo is developed and described as follows: Static RAM 1 1.0 3.9 4.9

1 1) Scale down the vector xo by 2: xo = xo / 2. Scratch RAM 12.2 12.2 24.4 2) Find the nearest D8 point u to xo and then compute the Data ROM 1 8.3 8.9 10.7 index vector j of u using the indexing algorithm in [4]. Program ROM 2 1.2 1.2 1.8 3) Find the codeword y from the index vector j and then compare y with u. If y is different from u, repeat Steps 1) 1 In units of 16-b kWords 2 In units of 1000’s basic operators to 3). Otherwise, compute w = xo / 16. Note that due to the normalization of transform coefficients, a few iterations are required to find the best codeword. 6. SUBJECTIVE PERFORMANCE OF G.719 4) Compute xo = xo + w. Subjective tests for the ITU-T G.719 (ex. G.722.1FB) 5) Find the nearest D8 point u to xo and then compute the index vector j of u using the indexing algorithm in [4]. Optimization/Characterization phase were performed from mid February through early April 2008 by independent listening 6) Find the codeword y from the index vector j and then laboratories in American English, French, and Spanish. compare y with u. If y and u are exactly same, k = j and According to a test plan designed by ITU-T Q7/SG12 experts, repeat Steps 4) to 6). Otherwise, k is the index of the best the joint candidate codec conducted two experiments as follows: codeword to xo and stop. • Experiment 1: Speech (clean, reverberant, and noisy) It has been observed that the 8-dimensional coefficient • Experiment 2: Mixed content and music vectors have a high concentration of probability around the origin. Therefore, Huffman coding is optionally tried for the Mixed content items are representative of advertisement, film quantization indices of HRQ to increase efficiency of trailers, news with jingles, music with announcements, etc., and quantization when the rate is smaller than 6 bit/coefficient. contain speech, music, and noise. Each experiment used the LRQ is a 1-bit quantizer and uses spherical codes based on “triple stimulus/hidden reference/double blind test method” described in ITU-R Recommendation BS.1116-1. A standard the rotated Gosset lattice RE8 [5] as the codebook. The RE8 lattice is defined as MPEG audio codec, LAME MP3 version 3.97 as found on the LAME website as of September 2006 [7] was used as the RE8 = 2 D8 U {2 D8 + (1, 1, 1, 1, 1, 1, 1, 1)} (4) reference codec in the subjective tests. The ITU-T requirement

267 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

was that the G.719 candidate codec at 32, 48, and 64kbps be proven “Not Worse Than” the reference codec at 40, 56, and 64 Mixed-content and Music (Spanish) kbps, respectively, with a 95% statistical confidence level. In 5.0 addition, the G.719 candidate codec at 64 kbps was also tested 4.5 4.0 against the G.722.1C at 48 kbps for Experiment 2. 3.5 The subjective test results for the G.719 codec are shown in 3.0 DMOS Figures 3-6. Statistical analysis of the results showed that the 2.5 G.719 codec met all performance requirements specified for the 2.0 subjective Optimization/Characterization test. For experiment 1 1.5 the G.719 codec was better than the reference codec at all bit 1.0 Ref Ref Ref

rates in both languages. For experiment 2 the G.719 codec is G.719 G.719 G.719 G.719 40kbps 32kbps 56kbps 48kbps 64kbps 64kbps 48kbps 64kbps better than the reference codec at the lowest for all the G.722.1C items and at the two other bitrates for most of the items. Figure 6: Subjective test results – Experiment 2 (Spanish). An additional subjective listening test for the G.719 codec was conducted later to evaluate the quality of the codec at rates higher than those described in the ITU-T test plan. Because the Additional Subjective Test Results (DMOS) quality expectation of the codec at these high rates is high, a 5.5 5.0 pre-selection of critical items, for which the quality at the lower 4.5 bitrate range was most degraded, was conducted prior to testing. 4.0 3.5 The test results are shown in Figure 7. It has been proven that 3.0 transparency was reached for critical material at 128 kbps. 2.5 2.0 1.5 Speech (North American English) 1.0 5.0 4.5 G.719 G.719 G.719 G.719 G.719 Source 32kbps 48kbps 64kbps 80kbps 96kbps G.719 G.719 4.0 Source Source Source Source Source Source 3.5 112kbps 128kbps 3.0

DMOS 2.5 2.0 Figure 7: Additional subjective test results. 1.5 1.0 7. CONCLUSION Ref 64kbpsRef Ref 40kbpsRef 56kbpsRef Ref 40kbps Ref 56kbps Ref 64kbps Ref Ref 40kbps Ref 56kbps Ref 64kbps Ref G.719 32kbps G.719 32kbps G.719 48kbps G.719 64kbps G.719 G.719 64kbps G.719 48kbps G.719 64kbps G.719 32kbps G.719 48kbps G.719 This paper has presented the 20 kHz full-band audio coding Clean Reverberant Reverberant and Noisy algorithm and subjective Optimization/Characterization test results of ITU-T Recommendation G.719. The G.719 codec Figure 3: Subjective test results – Experiment 1 (English). features very high audio quality, extremely low computational complexity, and low algorithmic delay compared to other state- Speech (French) of-the-art audio coding codecs. Subjective test results show that 5.0 4.5 the G.719 codec achieves transparent audio quality at 128 kbps. 4.0 The main intended applications for this codec are 3.5 3.0 videoconferencing and teleconferencing, as well as Internet

DMOS 2.5 2.0 streaming. 1.5 1.0 8. REFERENCES [1] ITU-T Recommendation G.719, “Low-complexity full- Ref 40kbpsRef 56kbpsRef Ref 56kbpsRef 64kbpsRef Ref 64kbps Ref 40kbps Ref 64kbps Ref 40kbps Ref Ref 56kbps G.719 32kbps G.719 48kbps G.719 G.719 64kbps G.719 32kbps G.719 48kbps G.719 64kbps G.719 32kbps G.719 G.719 48kbps G.719 64kbps G.719 band audio coding for high-quality conversational Clean Reverberant Reverberant and Noisy applications,” June 2008. Figure 4: Subjective test results – Experiment 1 (French). [2] H. S. Malvar, Signal Processing with Lapped Transforms. Norwood, MA: Artech House, 1992. [3] M. Xie, “Lattice Vector Quantization and its Applications Mixed-content and Music (North American English) 5.0 in Speech and Audio Coding,” in Speech Processing in 4.5 Modern Communication: Challenges and Perspectives. 4.0 Springer, To appear. 3.5 [4] J. H. Conway and N. J. A. Sloane, Sphere Packings, 3.0

DMOS Lattices and Groups. Springer-Verlag, New York, 1988. 2.5 2.0 [5] C. Lamblin and J.-P. Adoul, “Algorithme de quantification 1.5 vectorielle sphérique à partir du réseau de Gosset d’ordre 1.0 8,” Annales des Télécommunications, pp. 172-186, 1988. [6] ITU-T Recommendation G.191, “Software tools for speech Ref Ref Ref G.719 G.719 G.719 G.719 40kbps 32kbps 56kbps 48kbps 64kbps 64kbps 48kbps 64kbps

G.722.1C and audio coding standardization,” Sept. 2005.

Figure 5: Subjective test results – Experiment 2 (English). [7] http://sourceforge.net/project/showfiles.php?group_id=290 &package_id=309.

268