NTT Technical Review, Vol. 14, No. 11, Nov. 2016

Feature Articles: Basic Research Envisioning Future Communication Transmission of High-quality Sound via Networks Using Speech/Audio Codecs Yutaka Kamamoto, Takehiro Moriya, and Noboru Harada Abstract This article describes two recent advances in speech and audio codecs. One is EVS (Enhanced Voice Service), the new standard by 3GPP (3rd Generation Partnership Project) for speech codecs, which is capable of transmitting speech signals, music, and even the ambient sound on the speaker’s side. This codec has been adopted in a new VoLTE (voice over Long-Term Evolution) service with enhanced high- definition voice (HD+), which provides us with clearer and more natural conversations than conventional telephony services such as with fixed-line/land-line and 3G mobile phones. The other is MPEG-4 Audio Lossless Coding (ALS) standardized by the Moving Picture Experts Group (MPEG), which makes it possible to transmit studio-quality audio content to the home. ALS is expected to be used by some broadcasters, including IPTV (Internet protocol television) companies, in their broadcasts in the near future. Keywords: audio/speech coding, data compression, international standards 1. Introduction speech codecs use a human phonation model, so they are not suitable for music. When clean speech items Many audio and speech codecs are available, and are coded by audio codecs at low bit rates, we get the we can select the most suitable one for different usage impression that a machine is talking. Speech and scenarios ranging from those requiring reasonable audio compression schemes have these kinds of quality with low bit rates to ones demanding original trade-offs. To achieve the best quality of speech and signal quality with high bit rates. With the increases music content with less delay, experts in speech and in network capacity that have been achieved, content audio coding around the world have been working that requires high bit rates such as 4K television (TV) together to develop new codecs. Furthermore, loss- and high-resolution audio can also be transmitted. less coding, which refers to compression without any However, the first priority is to transmit speech sig- loss, has been standardized to ensure that the quality nals in ordinary telephony without congestion. There- of the original content is maintained. fore, speech codecs for telephony should use as low a This article presents an overview of the 3rd Genera- bit rate as possible. In addition, they must have lower tion Partnership Project (3GPP) Enhanced Voice algorithmic delay because the longer the codec delay Services (EVS) codec, which has been newly stan- is, the more difficult it becomes for people to com- dardized for mobile communications, and MPEG-4 municate with each other. Audio Lossless Coding (ALS) for high-resolution In contrast, one-way transmission such as broad- audio, which was standardized by the Motion Picture casting is less sensitive to delay. Most audio codecs Experts Group (MPEG). utilize the advantages of longer delay and then effi- ciently compress audio signals by means of signal processing with sufficient frame length. Moreover, 1 NTT Technical Review Feature Articles Fixed-line phone VoLTE New VoLTE (HD+) Future service 3G mobile phone 8-kHz 16-kHz 32-kHz 48-kHz sampling sampling sampling sampling 0 Frequency (Hz) 50 300 3400 8000 16,000 20,000 HD: high-definition Fig. 1. Supported audio bandwidth in EVS. 2. 3GPP EVS codec for mobile communications This enables smooth migration from the conventional VoLTE system since EVS has inter-operability with 3GPP is the international standardization consor- AMR-WB (Adaptive Multi-Rate Wideband). tium for mobile communications. It has newly During the standardization process, a huge number defined EVS speech and audio coding standards for of subjective quality evaluations were conducted for voice over Long-Term Evolution (VoLTE) [1, 2]. various coding conditions, input items, and languag- Conventional speech coding schemes for mobile es. The results of the evaluations indicated that EVS phones have been based on code excited linear pre- outperformed conventional speech and audio coding diction (CELP). These schemes have utilized a schemes in terms of quality [5]. NTT used a similar human voice production model and achieved high- procedure to conduct listening tests on Japanese quality speech transmission with very low bit rates. materials [6]. The results of the tests confirmed the EVS consists of newly developed low-delay and low- superiority of EVS over the coding schemes for con- bit-rate audio coding modules in addition to CELP, ventional mobile communications systems. and it achieves high-quality transmission of various Note that all EVS development has been carried out types of input signals, including speech, audio, back- by 12 organizations*1 based mainly in Europe, North ground noise, and background music [3, 4]. America, and East Asia, including Japanese compa- EVS uses new bandwidth extension technologies to nies. EVS has been deployed in commercial services support signals with higher sampling rates up to 48 such as VoLTE (HD+) by NTT DOCOMO since the kHz, in contrast to the narrowband signal (8-kHz summer of 2016 [7] and by some operators in the sampling rate) of conventional fixed-line/land-line USA and Europe as well [8]. We believe that EVS telephones and 3G mobile phones and the wideband will allow billions of people around the world to signal (16-kHz sampling rate) of VoLTE. Note that enjoy high-quality communication in the near future. the wideband signal is used for AM (amplitude modulation) radio, the super-wideband signal (32-kHz 3. MPEG-4 ALS for high-resolution sampling rate) is used for FM (frequency modulation) audio transmission radio, and the full-band signal (48-kHz sampling rate) is used for digital broadcasting, as shown in MPEG has standardized many useful audio and Fig. 1. video codecs that are commonly used in daily life. EVS has been optimized for VoLTE with a frame The MPEG audio subgroup standardized the lossless length of 20 ms and algorithmic delay of 32 ms. It has compression scheme MPEG-4 ALS, which can per- been designed to minimize perceptual distortion fectly reconstruct original signals. NTT is one of the against packet loss, whereas coding schemes for conventional 3G mobile phones were optimized for *1 In alphabetical order, Fraunhofer IIS, Huawei Technologies Co. robustness against bit errors. In addition, EVS covers Ltd, Nokia Corporation, NTT, NTT DOCOMO, INC., ORANGE, Panasonic Corporation, Qualcomm Incorporated, Samsung Elec- a wide range of bit rates from 5.9 kbit/s to 128 kbit/s tronics Co., Ltd., Telefonaktiebolaget LM Ericsson, VoiceAge and enables frame-by-frame selection of bit rates. Corporation, and ZTE Corporation. Vol. 14 No. 11 Nov. 2016 2 Feature Articles such When capacityas MPEG-2/4 is restricted, AAC is a reasonable. lossy codec Lossy codec Lossless codec = original sound When capacity is high enough, a lossless codec such as MPEG-4 ALS can be used. AAC: Advanced Audio Coding Fig. 2. Selection of audio codec according to the allowed bit rate. contributors to this standard [9, 10]. Although the produced in a broadcasting studio—with the quality compression ratio depends on the input signal, a file of the original signal—because the lossless codec is normally compressed to around 30% to 70% of its guarantees bit-exactness over the entire transmission. original size [11, 12]. High-resolution audio*2 has We can enjoy high-resolution audio in our living become more popular recently and enables precisely rooms when a sufficient bit rate is assigned to the digitized music. Lossy codecs often cannot convey audio signal (Fig. 2). IPTV (Internet protocol TV) the fidelity of high-resolution audio, so a lossless services, which use optical-fiber lines, may introduce codec such as MPEG-4 ALS is necessary. MPEG-4 ALS before the radio-wave services do. For example, the audio signal in TV broadcasting is In order to support practical deployment of MPEG- usually produced in a studio or broadcasting station 4 ALS, NTT has prepared related standards such as in a 48-kHz, 24-bit format, which is referred to as MPEG-4 ALS Simple Profile and IEC 61937-10 Edi- high-resolution audio. Since access to radio waves is tion 2 by the International Electrotechnical Commis- limited, we cannot assign a high bit rate to the con- sion (IEC) [15]. MPEG-4 ALS Simple Profile tent, and it is necessary to compress the audio signal restricts the parameters of the input signal such as the at a loss of quality. This lossy codec enables us to sampling frequency, number of channels, bit depth, enjoy broadcasting content because the codec reduc- and frame size, and also restricts some processing es the bit rate remarkably without any noticeable dif- tools and functionalities that require higher computa- ference from the original signal. tional complexity such as compression for floating- The ultrahigh-definition (or super-high-definition) point format signals. ARIB STD-B32 recommends TV service called 4K/8K TV has now started, and it the use of MPEG-4 ALS Simple Profile with LATM/ uses very high bit rates. The audio signal is also LOAS (Low-overhead Audio Transport Multiplex/ expected to use higher bit rates. The Association of Low Overhead Audio Stream)*3 capsuling. To facilitate Radio Industries and Businesses (ARIB), which defines the standards for radio-wave systems in *2 High-resolution audio: The Japan Electronics and Information Japan, standardized ARIB STD-B32 for the 4K/8K Technology Industries Association (JEITA) defines high resolu- TV system. This standard enables the use of MPEG-4 tion audio as an audio signal with a sampling frequency higher than 48 kHz and a bit depth greater than 16 bits [13]. ALS as one of the audio codecs [14].

NTT Technical Review, Vol. 14, No. 11, Nov. 2016

Surround Sound Processed by Opus Codec: a Perceptual Quality Assessment

Tr 126 959 V15.0.0 (2018-07)

Enhanced Voice Services (EVS) Codec Management of the Institute Until Now, Telephone Services Have Generally Failed to Offer a High-Quality Audio Experience Prof

Parametric Coding for Spatial Audio

Ts 126 244 V12.5.0 (2016-10)

Overview of the Evs Codec Architecture

Transcoding a Look at How Transcoding Can Help Ipx Providers Climb up the Value Chain

The Importance of Mobile Audio Sponsored By

Ts 103 624 V1.1.1 (2019-11)

5G Multimedia Standardization

Speech Codec Intelligibility Testing in Support of Mission-Critical Voice Applications for LTE

Convolutional Neural Networks to Enhance Coded Speech