High-Quality Speech Coding
Total Page:16
File Type:pdf, Size:1020Kb
High-quality Speech Coding Speech Coding Super-Wideband VoIP High-quality Speech Coding The high-quality speech coding is a super-wideband speech coding technique for Kei Kikuiri, Nobuhiko Naka providing high-quality VoIP services in a high-speed wireless communication envi- and Shinya Abe ronment, and was developed in collaboration with DoCoMo Communications Lab- oratories USA, Inc. This technology encodes and transmits nearly all of the fre- quency components of the human voice to allow easily-understood communication with a sense of presence. wideband (50 Hz to 7 kHz) or super-wide- higher than that of narrowband, and 1. Introduction band (upper frequency is more than 7 kHz) encodes the signal at 48 to 64 kbit/s. This Research topics in speech coding tech- speech codecs that have previously been codec also requires computational com- nologies have shifted from encoding algo- used mainly for video telephony. plexity comparable with a conventional rithms for narrowband (300 Hz to 3.4 kHz) In mobile communication, wireless narrowband speech codec. speech signals at around 10 kbit/s to encod- VoIP services including PASSAGE This technology will enable realistic ing algorithms for speech signals having DUPLE and Smartphones*2 such as the conversation over mobile terminals with wider bandwidth at about 20 to 64 kbit/s. hTc Z terminal have appeared on the mar- speech quality close to natural human Current narrowband speech codecs, such ket, and high-speed wireless packet access voice. This advantage will be especially as Adaptive Multi-Rate (AMR)*1[1], pro- networks using Super 3G*3 and Fourth- helpful in mobile applications where vide speech quality that is almost equiva- Generation mobile communication sys- voice quality is particularly critical, such lent to the legacy telephony networks, thus tems are also underway. These conditions as teleconferencing and remote education. no further improvement can be expected. show that speech communication with In this article, we present an overview In addition, a high compression ratio for wideband or super-wideband codecs will of the developed high-quality speech cod- narrowband speech signals is less impor- be feasible in mobile environments. ing algorithm and quality evaluation tant in VoIP services where high-speed We propose a high-quality speech results. We also introduce mobile VoIP packet access lines are typically available. coding technique that was developed in prototype software that implements this For perceivable improvement in speech collaboration with DoCoMo Communica- coding technology. quality, i.e., better than that of the legacy tions Laboratories USA, Inc. This is a telephony services, it is necessary to extend super-wideband speech codec that is 2. High-quality Speech Coding the speech bandwidth. Transmitting the intended to provide better speech quality 2.1 Overview of Speech Coding low- and high-frequency components that than wideband speech codecs for future for VoIP are not included in narrowband speech pro- VoIP services via high-speed wireless The specifications of speech codecs duces decoded speech signal that is close to packet access networks. This codec han- used in VoIP services and applications are the original and sounds more natural. Actu- dles speech signals with upper frequencies shown in Table 1. G.711*4[2] is the ally, a number of VoIP services adopt the of from 10 to 16 kHz, which is three times speech coding technique used in the fixed *1 AMR: A mandatory speech coding for Third- *2 Smartphone: A mobile terminal equipped with *4 G.711: 64-kbit/s PCM telephone band speech Generation mobile communication defined by the functions of a mobile data terminal. coding recommended by International Telecom- 3GPP. It allows flexible variation of the transmis- *3 Super 3G: A high-speed wireless access system munication Union-Telecommunication Standard- sion rate according the type and condition of net- that is an extended Third-Generation mobile com- ization Sector (ITU-T) (See *5). works. munication system. Standardization by 3GPP is in progress as Long Term Evolution (LTE). 38 NTT DoCoMo Technical Journal Vol. 9 No.2 Table 1 VoIP codec specifications nents of human voice. In addition, the bit G.711 G.729 Annex A G.722.1 G.722.1 Annex C AAC-LD rate can be set freely between 48 and 64 Sampling frequency (kHz) 8 8 16 32 32 to 48 kbit/s corresponding to the types of net- Bit rate (kbit/s) 64 8 24, 32 24, 32, 48 32 to 64 Frame length (ms) — 10 20 20 10 to 16 works and applications. The values of Algorithmic delay (ms) 0.125 15 40 40 20 to 32 frame length and algorithmic delay in the Speech bandwidth Narrowband Wideband Super-wideband table are the respective values for opera- Coding algorithm PCM CELP Transform coding tion at 22.05 kHz and 32 kHz. network as well as many VoIP services. music distribution, is applicable for two- This technique is based on a transform Because it uses the Pulse-Code Modula- way communication. It requires a high coding algorithm like G.722.1 and AAC- tion (PCM)*5 algorithm, the algorithmic computational complexity for a precise LD (Figure 1). In the encoder, the Modi- delay*6 is only 0.125 ms (1 sample). A pack- modeling of the auditory property*13 used fied Discrete Cosine Transform (MDCT) et loss concealment*7 algorithm has also in audio codecs. G.722.1 and AAC-LD is used to convert the time domain speech been recommended together with G.711, adopt a transform coding algorithm that signal into frequency spectrum coeffi- which is robust against errors such as pack- converts the time domain signal into a fre- cients (transform coefficients). The et loss. The G.729 Annex A*8 (G.729A)[3] quency domain representation. This algo- MDCT is a lapped transform over-lapping codec is a low-complexity version of the rithm does not use a voice production with the adjacent transform blocks and G.729*8[4] used in NTT DoCoMo’s model, so it handles audio signals other prevents distortion at the block boundary Hypertalk®*9 service. Like other speech than speech. without any redundant data. This tech- codecs for mobile communication, it nique adopts 256-sample MDCT (8 ms adopts the Code Excited Linear Prediction 2.2 High-quality Speech Coding sampled at 32 kHz) to achieve compara- (CELP)*10 algorithm to model the human Specifications ble algorithmic delay to G.729A and other voice production mechanism, and achieves The specifications of the high-quality Table 2 High-quality speech coding specifications speech quality close to that of the fixed speech coding we developed are shown in Sampling frequency (kHz) 22.05 32 network at 8 kbit/s. NTT DoCoMo’s Table 2. By setting a sampling frequency Bit rate (kbit/s) 48 to 64 PASSAGE DUPLE uses G.711 and of 22.05 kHz or higher, this technique Frame length (ms) 11.61 8 *11 G.729A. G.722.1 [5] is a wideband extends the upper frequency limit for Algorithmic delay (ms) 23.22 16 speech codec that is mainly used in video speech coding from 10 to 16 kHz, which Speech bandwidth Super-wideband telephony, but has recently come into use covers nearly all of the frequency compo- Coding algorithm Transform coding in VoIP applications as well. Its extended [Encoder] version, G.722.1 Annex C*11(G.722.1C)[5] Input speech Output is designed for 14-kHz bandwidth speech signal bit stream Quantization Frame MDCT signals. In comparison with other super- and Encoding packing wideband speech coding, it allows a lower bit rate and lower computational com- [Decoder] plexity, although it has a longer algorith- Input bit Output stream speech signal Frame Decoding and mic delay of 40 ms. inverse Inverse MDCT unpacking quantization Advanced Audio Coding-Low Delay *12 (AAC-LD) [6], which is a low-delay Figure 1 Basic structure of the high quality speech coding version of the AAC*12[6] codec used in *5 PCM: A coding algorithm in which the signal *7 Packet loss concealment: A function for con- *9 Hypertalk®: Registered trademark of amplitude sampled at the sampling frequency is cealing the speech distortion caused by packet loss. NTT DoCoMo. expressed as a binary number. *8 G.729**: The 8-kbit/s narrowband CELP speech *10 CELP: A speech coding algorithm that compares *6 Algorithmic delay: The time required for pro- coding recommended by ITU-T (See *10). A the input speech to a signal synthesized by stan- cessing by the coding algorithm regardless of hard- compatible low-complexity version is defined as dard patterns and transmits the indices of the pat- ware performance of computation and transmission. Annex A. tern generating the closest synthetic signal. NTT DoCoMo Technical Journal Vol. 9 No.2 39 High-quality Speech Coding such conventional speech coding tech- be evaluated, the hidden version of the same bit rate. Therefore, the high-quality niques and to suppress pre-echo*14, a dis- reference signal, and the band-limited sig- speech coding can be said to offer the tortion that is specific to the transform nals (anchors). And they are asked to equivalent subjective quality to AAC-LD coding algorithm. Next, it quantizes and assign a score from 0 to 100 to the signals at 64 kbit/s. Furthermore, the high-quality encodes the transform coefficients to be evaluated, the hidden reference and speech coding has significantly better sub- weighted according to the human auditory the anchors, in comparison with 100 jective quality than the 7 kHz and the 3.5 property at low computational complexi- points for the reference signal. Here, it kHz band-limited speech at 64 kbit/s and ty.