<<

FEATURE

Trends in Standardization of Audio Coding Technologies

Tomoyasu Komori, Advanced Systems Research Division

An ordinance from the Ministry of Internal Affairs and coding, regulations were added on the number of channels Communications (MIC) was issued in 2011 on revision and constraints such as the prediction order. of the audio coding formats of 8K Super Hi-Vision (8K) This article describes these trends in international and using 22.2 multichannel (22.2 ch) sound. domestic standardization and introduces the latest 3D audio The ordinance makes it possible to use 22.2 ch sound in coding scheme, called MPEG-H 3D Audio, which was broadcasting satellite (BS) digital broadcasts and other standardized in February, 2015. media. In particular, it specifies that digital broadcast audio formats conform to either MPEG-4 (AAC) or (ALS). The Association 2. Overview of 22.2 ch sound of Industries and Businesses (ARIB) revised ARIB 22.2 ch is a 3D sound format with a total of 24 channels STD-B32 accordingly. These revisions set the maximum arranged in three layers6). number of audio input channels for digital broadcasts to “22 There are nine channels in the top layer, above the channels and two low-frequency effect (LFE) channels”, viewing position, ten channels in the middle layer, at the and added MPEG-4 AAC and ALS to the available formats. level of the viewers’ ears, three channels in the bottom This article describes the latest trends in standardization and layer, below the viewer’s position, and two LFE channels. audio coding formats for 3D sound. The arrangement and labels of the channels in 22.2 ch sound system are shown in Figure 1. NHK set requirements for a highly realistic sound 1. Introduction format suitable for 8K broadcasts, conducted subjective In Japan, audio encoding formats were revised in 2011 evaluations showing that 22.2 ch sound system meets these by the issuing of MIC ordinance No. 87, “Standard digital requirements, and has been contributing to standardization broadcasting formats for television broadcasting1)”, to of the format in Japan and internationally6). enable 8K broadcasts with 22.2 ch sound. The ordinance increases the maximum number of input audio channels for BS and (CS) digital broadcasts 3. Overview of MPEG-4 AAC standard and ALS from 5.1 ch (5 channels and 1 LFE channel) to 22.2 ch standard (22 channels and two LFE channels). Audio encodings 3.1 Compression encoding technology for audio for 8K broadcasts were also regulated to conform to the There are two main types of encoding MPEG-4 AAC standard2), which is the most efficient lossy technology used for compression of audio signals. 3) compression coding, or to MPEG-4 ALS , which is a non- (a) Coding methods that consider auditory characteristics: coding. with these methods, the distortion produced from the ARIB revised its standard, ARIB STD-B32, “Video encoding is either completely or almost completely codings, audio codings and methods for undetectable acoustically, even with compression. 4) digital broadcasting ” in response to the MIC ordinance. (b) Methods that attempt to eliminate redundancy in the In this revision, regulations were added with detailed audio data using techniques such as waveform prediction or specifications supporting 22.2 ch audio modes in the statistical methods: if the original signal can be perfectly MPEG-4 AAC audio coding5). For MPEG-4 ALS audio reproduced from the received data, it is called a lossless

11 FEATURE

Top layer: 9 channels

TpFC

TpFL TpFR

TpSiL TpC TpSiR

TpBL TpBR TpBC

Middle layer: 10 channels

BtFC FC LFE1 LFE2 FLc FRc BtFL BtFR FL FR

Bottom layer: 3 channels + SiL SiR 2 LFE channels

BL BR BC

Figure 1: 22.2 ch audio channel placement and labels

encoding. audio encoding using auditory characteristics is shown in AAC is a type (a) method, while ALS is a type (b) method. Figure 2. To break down audio into frequency components, MPEG-4 AAC uses a “” method, which 3.2 Overview of MPEG-4 AAC uses the Discrete Cosine Transform (DCT) to convert MPEG-4 AAC is standardized in International the signal directly into a frequency domain signal. When Organization for Standardization/ International performing transform coding, the long window (block) used Electrotechnical Commission 14496-3 Subpart 4. MPEG-4 to transform the signal from the time into the frequency AAC is an extension of MPEG-2 AAC (ISO/IEC 13818- domain is 2,048 samples, but this can be changed adaptively 7)7); it can efficiently encode audio signals such as music to 256 sample blocks if a finer time resolution is needed. and can handle multichannel signals such as 22.2 ch in MPEG-4 AAC has several audio object types*2, but addition to monaural and stereo. broadcast services currently only use “Low Complexity” MPEG-4 AAC is a type of frequency-domain compression (LC), which has a good balance between the size of the encoding, which encodes by analyzing frequency decoder circuit and sound quality. components of the audio signal and using techniques such as With MPEG-4 AAC, almost no distortion due to encoding masking*1 to achieve high compression rates by exploiting can be detected, even when compressing a stereo signal by the characteristics of human hearing. A block diagram of approximately 1/12 its original size into something in the

*1 The phenomenon by which a sound is obscured by another *2 MPEG-4 audio classifies according to the and tools sound so that it cannot be heard or seems as though its vol- that can be used. ume is low.

12 FEATURE

Audio signal Time⇒ Frequency, Quantization Bitstreamformatting Encoded transform Encoding bitstream

Psychoacoustic model

Figure 2: Block diagram of audio coding using Psychoacoustic model range of 128 to 144 kbps. 3.4 Overview of MPEG-4 ALS MPEG-4 ALS was standardized as ISO/IEC 14496-3:2007 3.3 Differences between MPEG-2 AAC and MPEG-4 AAC Amd.2 MPEG-4 Audio Lossless Coding in March, 2006. MPEG-2 AAC (ISO/IEC 13818-7) and MPEG-4 AAC It is a type of lossless encoding and can exactly reproduce (ISO/IEC 14496-3 Subpart 4) use almost the same tools the original waveform through predictive analysis, by using for compressing audio signals, but MPEG-4 AAC adds an linear predictive techniques on past sample, even for multi- encoding tool called Perceptual Noise Substitution (PNS)*3. channel signals and signals with high sampling rates. The When encoding audio, much of the required bit rate is for input audio signal is analyzed in order to calculate the transmitting the DCT coefficients gotten from transforming linear prediction parameters and prediction residual. The the audio signal into the frequency domain. PNS reduces parameters and residual are variable-length encoded to the bit rate by treating signals within a scale-factor band*4 as format the encoded bitstream (Figure 3). The amplitude of noise within the band and sends only the applicable power the prediction residual is generally small compared with information. That information is then used to add noise of the original signal, and this characteristic can be used to a suitable level when reconstructing the audio signal during compress the amount of data relative to the uncompressed decoding. data by 15% to 70%.

4. ARIB STD-B32 revisions

*3 A tool that replaces noise with a small amount of data when Several revisions to ARIB STD-B32 were made to encoding signals and adds a noise waveform at the receiving support ultra-high-definition television in advanced BS side. digital broadcasts. In addition to supporting 22.2 ch audio *4 A group summarizing DCT coefficients for neighboring fre- quencies. input signals, a functionality was standardized for the

Linear predictive parameters Variable length coding Audio signal Linear predictive Bitstream Encoded bitstream coding formatting

Variable length Prediction error coding

Figure 3: Basic architecture of MPEG-4 ALS encoding and decoding

13 FEATURE

down-mixing*5 parameters when 22.2 ch audio encoded of 16 bits or greater. Table 1 gives technical formats for in MPEG-4 AAC is received on devices with 5.1 ch audio applicable to each digital broadcast standard format audio or stereo, along with formats for transmitting these (from 2011 MIC ordinances No. 87 and No. 94). parameters. Dialog enhancement*6 and dialog switching Separate numbers were also assigned in the MPEG-4 functions*7 were also introduced to extend conventional audio encoding standard, for commonly used audio systems broadcast services. There are also some restrictions on the such as two-channel stereo and 5.1 ch audio. Table 2 gives parameters that can be used with MPEG-4 ALS. the numbering for the channel configurations and number Note that in the MPEG-4 audio encoding standard, there of channels usable with MPEG-4 AAC and ALS. Note that is a wide range of sampling frequencies and numbers of 22.2 ch audio is assigned the number 13. channels that can be used, but ordinances and bulletins from MIC, and the ARIB standards, specify that 8K broadcasts 4.1 Revisions for transmitting AAC down-mix coefficients must use a sampling frequency of 48 kHz and quantization When down-mixing from multichannel stereo with more than 5.1 channels (audio modes with channel configuration numbers 7, 11, 12, 13, and 14) to two-channel stereo, the *5 A way of converting a multi-channel audio signal into a signal consisting of fewer channels. signals are first down-mixed to 5.1 ch sound, and then to *6 A function that allows the volume of dialog (voices) within a two-channel stereo. A data stream element (DSE)*8, as program to be adjusted at the receiver. *7 A function that allows the language of the dialog or descrip- described in ISO/IEC 14496.3:2009/AMD 4, is used when tions in a program to be switched to another language, such as from Japanese to English. *8 One type of data block for transmitting signals in AAC.

Table 1: Audio formats suitable for digital broadcasting

Audio input format Audio coding format

Sampling Max. audio MPEG-2 MPEG-2 MPEG-4 MPEG-4 frequency input channels AAC BC†2 AAC ALS 32 kHz Digital terrestrial TV broadcasting 44.1 kHz 5.1 ch Y 48 kHz 32 kHz V-High multimedia broadcasting 44.1 kHz 5.1 ch Y 48 kHz

32 kHz V-Low multimedia broadcasting or greater 5.1 ch Y Y Y

32 kHz BS digital broadcasting 44.1 kHz 5.1 ch Y 48 kHz

Advanced BS digital broadcasting 48 kHz 22.2 ch Y Y

32 kHz Narrow band CS digital broadcasting 44.1 kHz 5.1 ch Y Y 48 kHz 32 kHz Wide band CS digital broadcasting 44.1 kHz 5.1 ch Y 48 kHz 32 kHz Advanced narrow band CS digital broadcasting 44.1 kHz 22.2 ch†1 Y Y Y 48 kHz

Advanced wide band CS digital broadcasting 48 kHz 22.2 ch Y Y

†1 Limited to 5.1 in operational regulations. †2 Encoding that is backward compatible with MPEG-1 Layer 2.

14 FEATURE

Table 2: Individual channel configurations and number of channels usable with MPEG-4 AAC and ALS

Channel configuration number Number of channels

1 1 ch (1/0)

2 2 ch (2/0)

3 3 ch (3/0)

4 4 ch (3/1)

5 5 ch (3/2)

6 5.1 ch (3/2.1)

7 7.1 ch (5/2.1)

11 6.1 ch (3/0/3.1)

12 7.1 ch (3/2/2.1)

13 22.2 ch (3/3/3-5/2/3-3/0/0+2)

14 7.1 ch (2/0/0-3/0/2-0/0/0+1)

0 3 ch (2/1), 4 ch (2/2) or 2 audio track (dual mono) (1/0+1/0) case

・The channels are expressed as “top layer (front/side/back) – middle layer (front/side/back) – bottom layer (front/side/back)+ LFE” ・0 indicates no channels allocated for that direction. ・Audio modes with only the middle layer are expressed as “middle layer (front/side/back).LFE”; audio modes with only the middle layer and no side channels and for stereo are expressed as “middle layer (front/back).LFE” sending coefficients*9 for down-mixing from 5.1 ch to two- a user-domain stream (DSE) within the same audio stream, channel stereo. and to be substituted for the originally allocated signal (the Note that when creating these standards, NHK used initial dialog signal) on the receiver. The alternate audio can materials from many programs to conduct experiments8) be reproduced from single or multiple channels, as selected examining how to downmix appropriately from 22.2 ch to by the broadcaster. In that case, the audio levels of each 5.1 ch. It derived a default down-mixing method and set channel can also be specified by the broadcaster (e.g., FC 0 of coefficients and contributed them as revisions to ARIB dB, BtFC -3 dB, etc.). STD-B32. Receivers with the dialog switching function can receive external instructions to switch, for example, the original 4.2 Revisions to the AAC dialog control function Japanese dialog in FC and BtFC (see Figure 1) to English (1) Dialog enhancement function or French dialog. Moreover, the dialog’s level can be The dialog enhancement function distinguishes between controlled after the language has been switched. the dialog channels (containing script, narration, etc.) and NHK submitted draft revisions including these dialog the background audio channels in a program by using flags, control functions after conducting a study of the MPEG-4 and it enables the dialog channel signal levels to be adjusted AAC syntax (the rules for expressing data within the independently of the background channels. encoded bit stream). It also prototyped a conforming to the standard and demonstrated the feasibility of the (2) Dialog signal switching function functions9). The dialog switching function enables additional alternate dialog signals (such as English or French dialog) to be 4.3 ALS parameters transmitted separately from the 22.2 ch audio signal using The MPEG-4 ALS standard supports up to 65,536 channels, and the linear prediction supports up to

*9 Mix levels used to convert a multi-channel signal into a 1,023 orders, but the MPEG-4 ALS standard for digital smaller number of channels. broadcasting is restricted to a maximum of 22.2 channels

15 FEATURE

and 15 orders for prediction. it also accommodates other viewing, such as personal , , and tablets used together with headphones. 5. Future coding formats The features of MPEG-H 3D Audio include advanced Besides MPEG-4 AAC and ALS, a number of 3D audio encoding technology, based on MPEG Unified Speech formats with more channels than 5.1 ch have recently begun and Audio Coding (USAC)11)*10 and MPEG Spatial Audio to be used in movie theaters and home reproduction systems. Object Coding (SAOC)12)*11, and the use of multiple For example, Auro-3D places additional loudspeakers rendering technologies. The base rendering method is above the horizontal plane of the loudspeakers in the 5.1 called Vector Base Amplitude Panning (VBAP)13)*12, which ch scheme, and there are 3D audio formats, such as Dolby is combined with technology to play back the rendered Atoms, that can mix independent audio channels, called signals in headphones or other loudspeaker arrangements. objects, with other channels during playback. This section It also uses a format called Higher Order Ambisonics introduces MPEG-H 3D Audio as one such format that is in (HOA)14), which expands the sound field into a sum of the process of international standardization. spherical surface harmonic functions*13 for recording and playback. 5.1 Latest trends in MPEG Audio standards: MPEG-H 3D Audio 5.2 MPEG-H 3D Audio coding technology MPEG is currently working on standardization of A block diagram of audio encoding with MPEG-H 3D MPEG-H 3D Audio10), as a next-generation audio coding Audio is shown in Figure 4. Encoding efficiency of channel- format for video formats exceeding the quality of HDTV, *10 A low-bit-rate encoding combining codecs for speech and including 4K and 8K Ultra High Definition. music. MPEG-H 3D Audio will encode multichannel audio such *11 A multichannel encoding that separates dialog and back- as 22.2 ch more efficiently and render 3D audio in smaller ground audio and allows the levels of the channels to be adjusted. spaces with a more practical number of loudspeakers *12 A method using one to three speakers that adjusts the ampli- (e.g.: 10.1 or 8.1 channels) by redistributing signals to the tude level in each speaker to reproduce a sound source at any individual speaker channels. coordinate in the plane of the speakers. *13 Spherical harmonic functions appear when the wave equa- The specification mainly targets home reproduction tion is expressed in polar coordinates, and they are the com- systems with loudspeakers positioned overhead. However, ponents of wave motion in the angular direction.

Multichannel + rendered object input Multichannelinput SAOC transmission channel Object input SAOC Auxiliary data encoding Object input Pre-renderer/ Encoded bit stream Mixer HOA transmission HOA + rendered channel MPEG-H object input HOA 3D Audio encoding Auxiliary data Core encoding HOA input Object signal

OAM input Object OAM data encoding

Figure 4: Block diagram of MPEG-H 3D Audio encoding

16 FEATURE

based*14 objects is improved by encoding them after pre- References rendering. Conversely, for objects whose playback position 1) MIC Ordinance No. 87, “Standard broadcast formats for digital may change at the receiver, a monaural signal is provided broadcasting as part of standard ” to the encoder, and rendering and mixing are done by the (2011) receiver. Multiple objects can also be handled together 2) ISO/IEC 14496-3:2009, “Information Technology - Coding of using technologies such as MPEG SAOC; this reduces the Audio-visual Objects - Part 3: Audio” (2009) number of transmission channels and the amount of data 3) ISO/IEC 14496-3:2005/Amd.2 2006, “Information and improves coding efficiency. The core encoding block Technology - Coding of Audio-visual Objects - Part 3: Audio is AAC, and Single Channel Element (SCE)*15, Coupling Amendment 2: Audio Lossless Coding (ALS),” New Audio Channel Element (CPE)*16, and Quad Channel Element Profiles and BSAC Extensions (2006) (QCE)*17 are used to improve efficiency. MPEG-H 3D can 4) ARIB: “Video coding, audio coding and multiplexing methods also encode object metadata (OAM)*18 efficiently. for digital broadcasting,” ARIB STD-B32 ver. 3.3 (2015) 5) ISO/IEC 14496-3:2009/AMD 4:2013, “New Levels for AAC Profiles” (2013) 6. Conclusion 6) T. Nishiguchi, K. Ono, K. Watanabe: “Development and This article described the revisions to MIC ordinances Standardization of an Audio Production system for 8K Super and ARIB standards that have been issued in an effort to Hi-Vision,” NHK STRL R&D, No. 148, pp. 12-21 (2014) standardize audio coding technology for 8K broadcasting. 7) ISO/IEC 13818-7:2006(E), “Information Technology - In addition, it introduced the formats conforming to MPEG- Generic Coding of Moving Pictures and Associated Audio 4 AAC and ALS standards that will enable 22.2 ch sound Information - Part 7: Advanced Audio Coding (AAC)” (2006) broadcast services on advanced BS digital broadcasts and 8) T. Sugimoto, S. Oode and Y. Nakayama: “Downmixing other services. It also described revisions to ARIB standards Method for 22.2 Multichannel Sound Signal in 8K Super Hi- on downmixing and dialog control functions, revisions Vision Broadcasting,” J. Audio Eng. Soc. (2015) related to new broadcast services, and standardization 9) Sugimoto, Nakayama: “22.2 ch Audio Encoder/Decoder trends related to MPEG-H 3D Audio, which is the latest Using MPEG-4 AAC,” Proc. Autumn Meeting of the ASJ, audio encoding format for 3D audio. NHK will continue to 2-P-9 (2015) contribute to domestic and international standardization in 10) ISO/IEC 23008-3, “High Efficiency Coding and Media the future. Delivery in Heterogeneous Environments - Part 3: 3D Audio” 11) ISO/IEC 23003-3:2012, “Information Technology - MPEG Audio Technologies - Part 3: Unified Speech and Audio Coding” (2012) 12) ISO/IEC 23003-2:2010, “Information Technology - MPEG Audio Technologies - Part 2: Spatial Audio Object Coding (SAOC)” (2010) 13) V. Pulkki: “Virtual Sound Source Positioning Using Vector Base Amplitude Panning,” J. Audio Eng. Soc., Vol. 45, pp. 456-466 (1997) 14) J. Daniel, R. Nicol and S. Moreau: “Further Investigations of High Order Ambisonics and Wavefield Synthesis for *14 The production studio signal is reproduced as-is by a speaker. *15 A type of data block for signal transmission, standardized in Holophonic Sound Imaging,” 114th AES Conv., Amsterdam, AAC, consisting of compressed data from one channel. The Netherlands (2003) *16 A data block consisting of data from two channels compressed together to improve encoding efficiency. *17 A data block consisting of data from four channels com- pressed together in order to improve encoding efficiency. *18 Attribute data indicating object information such as position.

17