ITU-T G.711.1: Extending G.711 to Higher-Quality Wideband Speech
Total Page:16
File Type:pdf, Size:1020Kb
HIWASAKI LAYOUT 9/22/09 2:24 PM Page 110 ITU-T STANDARDS ITU-T G.711.1: Extending G.711 to Higher-Quality Wideband Speech Yusuke Hiwasaki and Hitoshi Ohmuro, NTT Corporation ABSTRACT 7 kHz, called wideband speech, which is equiva- lent to audio signals conveyed in AM radio In March 2008 the ITU-T approved a new broadcasts. One of the most popular applications wideband speech codec called ITU-T G.711.1. of voice over IP (VoIP) is remote audio-visual This Recommendation extends G.711, the most conferences, where hands-free terminals are widely deployed speech codec, to 7 kHz audio often used. In that case, intelligibility becomes bandwidth and is optimized for voice over IP more important than when using handsets applications. The most important feature of this because participants usually sit around a terminal codec is that the G.711.1 bitstream can be at a certain distance from a loudspeaker. This is transcoded into a G.711 bitstream by simple where wideband speech coders, which can repro- truncation. G.711.1 operates at 64, 80, and 96 duce speech at high fidelity and intelligibility, are kb/s, and is designed to achieve very short delay particularly favored. and low complexity. ITU-T evaluation results Today, the majority of fixed-line digital show that the codec fulfils all the requirements telecommunications terminals are equipped with defined in the terms of reference. This article ITU-T G.711 (log-compressed pulse code modu- presents the codec requirements and design con- lation [PCM]) capability. In fact, for communica- straints, describes how standardization was con- tion using Real-Time Transport Protocol (RTP) ducted, and reports on the codec performance over IP networks, G.711 support is mandatory. and its initial deployment. Until wideband speech terminals completely replace narrowband ones, these two types of ter- INTRODUCTION minals will continue to coexist, meaning that the wideband ones must be capable of interoperat- In the early years of voice communications, ing with those that carry only G.711. In an ordi- transmission bandwidth was somewhat limited, nary telecommunications scenario, the codec and the main technological focus at the time was used during a session is negotiated between the to transmit voice at the best achievable quality terminals as the call is set up. However, there given the bandwidth constraint. This led to using are some cases where this may not be possible, speech limited to a frequency range of 300 Hz to for example, in call transfers and multipoint con- 3.4 kHz, today called narrowband speech. Howev- ferencing. Therefore, transcoding2 between dif- er, with the exponential growth of transmission ferent types of bitstream must be performed by a bandwidth for both wired and wireless communi- bitstream translator at a gateway or a signal cations, broadband connections are now more mixer at a multi-point control unit (MCU). This widely available. For fixed lines, the trend is to is problematic when those devices must accom- transport all information and services, including modate a large number of lines because voice, video, and other data, in packet-based net- transcoding usually requires a high computation- works. One advantage of a packet network such al complexity and will likely introduce quality as an Internet Protocol (IP)-based one is that it degradations and additional delay. This can be 1 These ITU-T Recom- can adapt to various bit rates, and this means an obstacle in the increased use of wideband mendations can be freely that we are now free from having to use constant voice-communication services. downloaded: bit rates and band-limited audio. This means that To overcome this obstacle, ITU-T has stan- http://itu.int/rec/T-REC- the new generation of terminals can support rich- dardized a new speech-coding algorithm, G.711.1 G/ er services and functionalities. Consequently, [1]. This codec, approved in March 2008, is an speech coding algorithms can be designed with extension of ITU-T G.711. It was initially studied 2 Transcoding is the con- emphasis on factors such as low delay, low com- under the name G.711-WB (wideband extension). version processing of one plexity, and, in particular, wider audio frequency The coding algorithm is designed to provide a encoding format to anoth- bandwidth. We have seen the emergence of low-delay, low-complexity, and high-quality wide- er. coders, such as International Telecommunication band speech addressing transcoding problems Union — Telecommunication Standardization with legacy (narrowband) terminals with a bitrate 3 Scalability enables the Sector (ITU-T) G.722, G.722.1, AMR-WB (also trade-off. The main feature of this extension is to best quality of service to known as ITU-T G.722.2), G.729.1, and G.718.1 give G.711 wideband scalability.3 It aims to be provided as the system Those coders are for encoding conversational achieve high-quality speech services over broad- load varies. speech signals in the frequency range of 50 Hz to band networks, particularly for IP phone and 110 0163-6804/09/$25.00 © 2009 IEEE IEEE Communications Magazine • October 2009 HIWASAKI LAYOUT 9/22/09 2:24 PM Page 111 In the partial mixing Location Location Location Location method, only the A B A B core bitstream (usually the lower band) is decoded MCU and mixed, and the enhancement layers are not decoded, Location Location Location Location hence the name C D C D partial. Instead, one active endpoint is Mesh connection Star connection selected from all the endpoints, and its Figure 1. Configurations for connecting remote conferences. enhancement layers are redistributed to multipoint speech conferencing, while enabling the MCU, in addition to the summation of the other endpoints. seamless interoperability with conventional termi- signals and the transmitting and receiving of 2n nals and systems equipped only with G.711. media streams. The number of mixed endpoints n In the next section one of the key applications might be limited to m (m < n, e.g., 3), selected of G.711.1, partial mixing, is described, and then on the active channels, but this would still require the design constraints of G.711.1 are detailed in considerable computational complexity. Another the section after. The succeeding section issue contributing significantly to complexity is describes how the standardization progressed, transcoding. Terminals with different coding and then a brief overview of the codec algorithm capabilities require transcoding because various is presented. The characterized results of G.711.1, coding algorithms are used in VoIP systems. Usu- the speech quality of the codec, are presented ally, this capability is implemented by decoding a next. Finally, new extensions to G.711.1 and the bitstream to a linear PCM signal and then re- status of the codec deployment are discussed. encoding that with another encoder. For intercon- nection between wideband and narrowband ARTIAL IXING coders, transcoding requires another intermediate P M step, down-/up-sampling, and this would further One of the primary applications of G.711.1 is increase the required computational complexity at audio conferencing, and partial mixing [2] is a MCUs. Another less significant downside of solution to contain its growing complexity. When transcoding is the accumulation of algorithmic considering such conferences, there are two possi- delays caused by re-encoding and maybe down- ble configurations, as shown in Fig. 1: one is a /up-sampling, and this can be problematic when mesh connection where all endpoints are connect- using codecs working on a certain frame length. ed to all others, and the other is a star connection These problems can be overcome by taking centered on a multipoint control unit (MCU). advantage of a subband scalable bitstream structure Mesh connections, such as those used in Skype, because a signal can be reconstructed by decoding do not require a server but are restricted to con- only part of the bitstream. In the partial mixing ferences with a few endpoints. In large-scale n- method, only the core bitstream (usually the lower point conferences, each endpoint needs to band) is decoded and mixed, and the enhancement transmit and receive 2(n – 1) media bitstreams layers are not decoded, hence the name partial. (counting both inbound and outbound streams). Instead, one active endpoint is selected from all the This is undesirable because n – 1 decoding com- endpoints, and its enhancement layers are redis- putation would operate simultaneously, whereas tributed to other endpoints. To implement this only one decoder is required for star connections. hybrid approach, which combines redistribution and In addition, each endpoint needs to have a suffi- mixing, the mixer must judge which endpoint to ciently large transmission bandwidth, whereas in select by detecting voice activities and/or detecting star connections only two (inbound and out- the endpoint with the largest signal power. bound) media bitstreams are required. Thus, Figure 2 illustrates how a partial mixer works. mesh connections are suitable for conferences There are three locations connected to an MCU; 4 Note that there is anoth- involving a few endpoints. However, in star con- assume that location A is talking, and locations B er type of MCU, a for- nections the computation now occurs in the and C are listening in this instance. The partial warding bridge, that also MCU, where mixing must be performed.4 Here, mixer performs conventional core (G.711) layer sends media bitstreams in the MCU has to decode all the bitstreams from mixing, but for enhancement layers, it detects the a star configuration, but endpoints, mix the obtained signals, and then re- speaker location (A) and selects a set of enhance- multiplexes media bit- encode the mixed signal.