High-quality Speech Coding

Speech Coding Super-Wideband VoIP

High-quality Speech Coding

The high-quality speech coding is a super-wideband speech coding technique for Kei Kikuiri, Nobuhiko Naka providing high-quality VoIP services in a high-speed wireless communication envi- and Shinya Abe ronment, and was developed in collaboration with DoCoMo Communications Lab- oratories USA, Inc. This technology encodes and transmits nearly all of the fre- quency components of the human voice to allow easily-understood communication with a sense of presence.

wideband (50 Hz to 7 kHz) or super-wide- higher than that of narrowband, and 1. Introduction band (upper frequency is more than 7 kHz) encodes the signal at 48 to 64 kbit/s. This Research topics in speech coding tech- speech that have previously been also requires computational com- nologies have shifted from encoding algo- used mainly for telephony. plexity comparable with a conventional rithms for narrowband (300 Hz to 3.4 kHz) In mobile communication, wireless narrowband speech codec. speech signals at around 10 kbit/s to encod- VoIP services including PASSAGE This technology will enable realistic ing algorithms for speech signals having DUPLE and Smartphones*2 such as the conversation over mobile terminals with wider bandwidth at about 20 to 64 kbit/s. hTc Z terminal have appeared on the mar- speech quality close to natural human Current narrowband speech codecs, such ket, and high-speed wireless packet access voice. This advantage will be especially as Adaptive Multi-Rate (AMR)*1[1], pro- networks using Super 3G*3 and Fourth- helpful in mobile applications where vide speech quality that is almost equiva- Generation mobile communication sys- voice quality is particularly critical, such lent to the legacy telephony networks, thus tems are also underway. These conditions as teleconferencing and remote education. no further improvement can be expected. show that speech communication with In this article, we present an overview In addition, a high compression ratio for wideband or super-wideband codecs will of the developed high-quality speech cod- narrowband speech signals is less impor- be feasible in mobile environments. ing algorithm and quality evaluation tant in VoIP services where high-speed We propose a high-quality speech results. We also introduce mobile VoIP packet access lines are typically available. coding technique that was developed in prototype software that implements this For perceivable improvement in speech collaboration with DoCoMo Communica- coding technology. quality, i.e., better than that of the legacy tions Laboratories USA, Inc. This is a telephony services, it is necessary to extend super-wideband speech codec that is 2. High-quality Speech Coding the speech bandwidth. Transmitting the intended to provide better speech quality 2.1 Overview of Speech Coding low- and high-frequency components that than wideband speech codecs for future for VoIP are not included in narrowband speech pro- VoIP services via high-speed wireless The specifications of speech codecs duces decoded speech signal that is close to packet access networks. This codec han- used in VoIP services and applications are the original and sounds more natural. Actu- dles speech signals with upper frequencies shown in Table 1. G.711*4[2] is the ally, a number of VoIP services adopt the of from 10 to 16 kHz, which is three times speech coding technique used in the fixed

*1 AMR: A mandatory speech coding for Third- *2 Smartphone: A mobile terminal equipped with *4 G.711: 64-kbit/s PCM telephone band speech Generation mobile communication defined by the functions of a mobile data terminal. coding recommended by International Telecom- 3GPP. It allows flexible variation of the transmis- *3 Super 3G: A high-speed wireless access system munication Union-Telecommunication Standard- sion rate according the type and condition of net- that is an extended Third-Generation mobile com- ization Sector (ITU-T) (See *5). works. munication system. Standardization by 3GPP is in progress as Long Term Evolution (LTE).

38 NTT DoCoMo Technical Journal Vol. 9 No.2 Table 1 VoIP codec specifications nents of human voice. In addition, the bit G.711 G.729 Annex A G.722.1 G.722.1 Annex C AAC-LD rate can be set freely between 48 and 64 Sampling frequency (kHz) 8 8 16 32 32 to 48 kbit/s corresponding to the types of net- (kbit/s) 64 8 24, 32 24, 32, 48 32 to 64 Frame length (ms) — 10 20 20 10 to 16 works and applications. The values of Algorithmic delay (ms) 0.125 15 40 40 20 to 32 frame length and algorithmic delay in the Speech bandwidth Narrowband Wideband Super-wideband table are the respective values for opera- Coding algorithm PCM CELP tion at 22.05 kHz and 32 kHz. network as well as many VoIP services. music distribution, is applicable for two- This technique is based on a transform Because it uses the Pulse-Code Modula- way communication. It requires a high coding algorithm like G.722.1 and AAC- tion (PCM)*5 algorithm, the algorithmic computational complexity for a precise LD (Figure 1). In the encoder, the Modi- delay*6 is only 0.125 ms (1 sample). A pack- modeling of the auditory property*13 used fied Discrete Cosine Transform (MDCT) et loss concealment*7 algorithm has also in audio codecs. G.722.1 and AAC-LD is used to convert the time domain speech been recommended together with G.711, adopt a transform coding algorithm that signal into frequency spectrum coeffi- which is robust against errors such as pack- converts the time domain signal into a fre- cients (transform coefficients). The et loss. The G.729 Annex A*8 (G.729A)[3] quency domain representation. This algo- MDCT is a lapped transform over-lapping codec is a low-complexity version of the rithm does not use a voice production with the adjacent transform blocks and G.729*8[4] used in NTT DoCoMo’s model, so it handles audio signals other prevents distortion at the block boundary Hypertalk®*9 service. Like other speech than speech. without any redundant data. This tech- codecs for mobile communication, it nique adopts 256-sample MDCT (8 ms adopts the Code Excited Linear Prediction 2.2 High-quality Speech Coding sampled at 32 kHz) to achieve compara- (CELP)*10 algorithm to model the human Specifications ble algorithmic delay to G.729A and other voice production mechanism, and achieves The specifications of the high-quality Table 2 High-quality speech coding specifications speech quality close to that of the fixed speech coding we developed are shown in Sampling frequency (kHz) 22.05 32 network at 8 kbit/s. NTT DoCoMo’s Table 2. By setting a sampling frequency Bit rate (kbit/s) 48 to 64 PASSAGE DUPLE uses G.711 and of 22.05 kHz or higher, this technique Frame length (ms) 11.61 8 *11 G.729A. G.722.1 [5] is a wideband extends the upper frequency limit for Algorithmic delay (ms) 23.22 16 speech codec that is mainly used in video speech coding from 10 to 16 kHz, which Speech bandwidth Super-wideband telephony, but has recently come into use covers nearly all of the frequency compo- Coding algorithm Transform coding in VoIP applications as well. Its extended [Encoder] version, G.722.1 Annex C*11(G.722.1C)[5] Input speech Output is designed for 14-kHz bandwidth speech signal bit stream Quantization Frame MDCT signals. In comparison with other super- and Encoding packing wideband speech coding, it allows a lower bit rate and lower computational com- [Decoder] plexity, although it has a longer algorith- Input bit Output stream speech signal Frame Decoding and mic delay of 40 ms. inverse Inverse MDCT unpacking quantization -Low Delay *12 (AAC-LD) [6], which is a low-delay Figure 1 Basic structure of the high quality speech coding version of the AAC*12[6] codec used in

*5 PCM: A coding algorithm in which the signal *7 Packet loss concealment: A function for con- *9 Hypertalk®: Registered trademark of amplitude sampled at the sampling frequency is cealing the speech distortion caused by packet loss. NTT DoCoMo. expressed as a binary number. *8 G.729**: The 8-kbit/s narrowband CELP speech *10 CELP: A speech coding algorithm that compares *6 Algorithmic delay: The time required for pro- coding recommended by ITU-T (See *10). A the input speech to a signal synthesized by stan- cessing by the coding algorithm regardless of hard- compatible low-complexity version is defined as dard patterns and transmits the indices of the pat- ware performance of computation and transmission. Annex A. tern generating the closest synthetic signal.

NTT DoCoMo Technical Journal Vol. 9 No.2 39 High-quality Speech Coding

such conventional speech coding tech- be evaluated, the hidden version of the same bit rate. Therefore, the high-quality niques and to suppress pre-echo*14, a dis- reference signal, and the band-limited sig- speech coding can be said to offer the tortion that is specific to the transform nals (anchors). And they are asked to equivalent subjective quality to AAC-LD coding algorithm. Next, it quantizes and assign a score from 0 to 100 to the signals at 64 kbit/s. Furthermore, the high-quality encodes the transform coefficients to be evaluated, the hidden reference and speech coding has significantly better sub- weighted according to the human auditory the anchors, in comparison with 100 jective quality than the 7 kHz and the 3.5 property at low computational complexi- points for the reference signal. Here, it kHz band-limited speech at 64 kbit/s and ty. Finally, the encoded data are packed can be said that the results of 7 kHz band- 48 kbit/s respectively. frame by frame (frame packing) and sent limited and 3.5 kHz band-limited speech The experiment results show that the to the transmission channel. It is also pos- show the respective upper limits of sub- high-quality speech coding achieves com- sible to introduce a super-frame structure jective quality for the wideband speech parable speech quality to the international which improves coding efficiency by coding and the narrowband speech cod- standard super-wideband speech codec using the correlation between frames, and ing. and audio codec, and it outperforms con- packet loss concealment that utilizes The subjective evaluation test results ventional narrowband and wideband information from the preceding and suc- are shown in Figure 2. The error ranges speech codecs. ceeding frames. in the figure are for the 95% confidence The decoder on the receiving side interval. The statistical tests of signifi- 3. High-quality VoIP unpacks the received frame information to cance*15 show that all of the encoded Prototype Software encoded data (frame unpacking), and speech including the high-quality speech Targeting high-quality VoIP services then, after decoding and inverse quantiza- coding are not good compared to the orig- via a mobile terminal, we implemented a tion of the transform coefficients, the time inal speech for the female speech, male high-quality speech coding module that is domain speech signal is reproduced by the speech and speech with BGM samples, capable of real-time encoding and decoding inverse MDCT. This technology does not but the subjective quality of the high-qual- on Windows Mobile®*16 5.0. We also creat- involve the predictive processing across ity speech coding is equivalent to that of ed high-quality VoIP prototype software multiple packets used in the CELP algo- AAC-LC (Low Complexity)*12[6] at 64 with that module running on the hTc Z rithms, thus one of its features is to sup- kbit/s and that of G.722.1C at 48 kbit/s. terminal (Photo 1). For comparison with press the degradation of speech quality The AAC-LC generally has higher coding the speech quality of ordinary telephony, instantaneously due to packet loss. efficiency than AAC-LD, thus can be said this software is also equipped with G.711. to achieve the equivalent or better subjec- Real-time Transport Protocol (RTP)*17[8] 2.3 Verification of the Subjective Speech tive quality as that of AAC-LD at the is used for transmission of the speech data Quality of the High-quality Speech Coding

To verify the subjective speech quality Table 3 Experimental conditions of the high-quality speech coding, we con- Test method MUSHRA ducted subjective evaluation experiments. Number of subjects 19 Table 3 Reference speech The test conditions are shown in . (sampling frequency) Original speech (32 kHz) In the MUlti Stimulus test with Hidden Encoded speech High-quality speech coding (48, 64 kbit/s, 22.05 kHz) (bit rate/sampling frequency) AAC-LC (64 kbit/s, 48 kHz) Reference and Anchor (MUSHRA)[7] G.722.1 Annex C (48 kbit/s, 32 kHz) method, the subjects are presented with a Band-limited speech 7-kHz bandwidth, 3.5-kHz bandwidth reference signal (original), the signals to Listening system Headphones (both ears)

*11 G.722.1**: An audio coding recommended by tional Electrotechnical Commission (ISO/IEC). *13 Auditory property: The property that it is diffi- ITU-T. A fixed-point implementation of the 7- AAC-LC is a profile that reduces the computa- cult to perceive the noise in a frequency region kHz bandwidth and the 14-kHz bandwidth modes tional complexity of the main profile. A low-delay near the region in which the input signal power is defined as Annex C. extension that allows two-way communication is spectrum has large power. *12 AAC**: An audio coding specified by Interna- also specified as AAC-LD. tional Organization for Standardization/Interna-

40 NTT DoCoMo Technical Journal Vol. 9 No.2 100 =95% confidence interval 80

60

40

20

0 Original High-quality High-quality AAC-LC G.722.1 7 kHz 3.5 kHz speech speech coding speech coding 64 kbit/s Annex C bandwidth bandwidth 64 kbit/s 48 kbit/s 48 kbit/s speech speech (a) Female speech

100 =95% confidence interval 80 Photo 1 Display example of high-quality 60 VoIP prototype software 40

20

0 References Original High-quality High-quality AAC-LC G.722.1 7 kHz 3.5 kHz speech speech coding speech coding 64 kbit/s Annex C bandwidth bandwidth [1] 3GPP TS26.090: “Adaptive Multi-Rate 64 kbit/s 48 kbit/s 48 kbit/s speech speech (AMR) speech codec; Transcoding func- (b) Male speech tions,” 1999. 100 =95% confidence interval [2] ITU-T Recommendation G.711: “Pulse Code 80 Modulation (PCM) of Voice Frequencies,” 60 1988.

40 [3] ITU-T Recommendation G.729 Annex A:

20 “Coding of Speech at 8 kbit/s using conju- gate structure algebraic-code-excited linear 0 Original High-quality High-quality AAC-LC G.722.1 7 kHz 3.5 kHz prediction (CS-ACELP) Annex A: Reduced speech speech coding speech coding 64 kbit/s Annex C bandwidth bandwidth 64 kbit/s 48 kbit/s 48 kbit/s speech speech complexity 8 kbit/s CS-ACELP speech (c) Speech + background music codec,” 1996.

Figure 2 Subjective evaluation results [4] ITU-T Recommendation G.729: “Coding of Speech at 8 kbit/s using conjugate structure algebraic-code-excited linear prediction (CS- and call control is done with Session Initi- ware that adopts this technology. ACELP),” 1996. ation Protocol (SIP)*18[9]. This technology can realize more nat- [5] ITU-T Recommendation G.722.1: “Low- complexity coding at 24 and 32 kbit/s for ural conversations over the mobile termi- 4. Conclusion hands-free operation in systems with low nals by extending the speech bandwidth frame loss,” 2005. We presented high-quality speech cod- and is expected to enhance future speech [6] ISO/IEC 14496-3: 2001: “Information tech- ing technology that enables super-wide- communication services. nology—Coding of audio-visual objects— ” band speech communication services. For future work, we plan to develop Part 3: Audio, 2001. [7] ITU-R Recommendation BS.1534-1: Subjective evaluation results confirmed additional functionality for the entire “Method for the subjective assessment of that the high-quality speech coding shows VoIP system while considering require- intermediate quality level of coding better speech quality than that of wideband ments and objectives of actual VoIP ser- systems,” 2003. speech, and equivalent quality to that of vices and to investigate technologies for [8] IETF RFC3261: “SIP: Session Initiation Proto- col,” 2002. existing super-wideband speech codecs. new applications that use the advantages [9] IETF RFC1889: “RTP: A Transport Protocol We also introduced VoIP prototype soft- of super-wideband speech. for Real-Time Applications,” 1996.

*14 Pre-echo: A phenomenon in which a frequency between two values is statistically significant. If *16 Windows Mobile®: Registered trademark of the domain quantization error just prior to an onset of the difference between two compared values is Microsoft Corporation in the United States and attack in audio signal is perceived as an echo-like within the range of the confidence interval calcu- other countries. distortion. lated from the dispersion in the corresponding val- *17 RTP: A real-time multimedia transport protocol *15 Statistical test of significance: A method ues, there is no statistically significant difference; via IP networks defined by the Internet Engineer- used to determine whether or not the difference otherwise, the difference is statistically significant. ing Task Force (IETF).

NTT DoCoMo Technical Journal Vol. 9 No.2 41