<<

arXiv:1602.04845v1 [cs.MM] 15 Feb 2016 n o rmteAES. the from not and 1 constra real-time under operate into not built do knowledge that psychoacoustic the audio co existing to transform a attention with particular coder prediction with linear RFC6716. a as combining by Opus applications the standardized recently IETF The ABSTRACT nso A,a h vlain nScin6show. 6 Section in vari- evaluations LC the and as HE AAC, competi- the of and ants is and Opus as storage such delay, for streaming, designed low codecs the high-delay with Despite tive lower de- even ultra-low [3]. net- these ms, lays as require 5 such performance as Applications music work low ms). (15 as AAC-ELD delays than to scales Opus delay. algorithmic low with a a loss with packet integration over concealment, good and bitrates, (RTP), operating Protocol changing Real-Time music, the of and means range This speech designed wide codec for applications. audio standardized support Internet versatile recently interactive highly IETF for a the [2], [1] Opus 6716 RFC In orsodnesol eadesdt enMr ai ( Valin Jean-Marc to addressed be should Correspondence 2 1 enMr Valin Jean-Marc the Codec in Opus Coding Music Low-Delay High-Quality, hspprwsacpe o ulcto tte135 the at publication for accepted was paper This Xiph.Org Mozilla, . INTRODUCTION 1 rgr Maxwell Gregory , h psCodec Opus The rsne tte15hASConvention AES 135th the at Presented 03Otbr1–0NwYr,USA York, New 17–20 October 2013 1 ihOgFoundation Xiph.Org ioh .Terriberry B. Timothy , th E ovnin hsvrino h ae sfo h authors the from is paper the of version This Convention. AES pssupports Opus ito,adXp.r’ ETcdc ae nthe pre- on linear based on codec, based CELT Xiph.Org’s codec, and [4] technolo- diction, SILK core two ’s from codec gies: Opus the created We artifacts. the switching of no all with change above, dynamically can signaling In-band • • • • • [email protected] ieadobnwdh,fo narrowband kHz), from (48 fullband to bandwidths, kHz) (8 audio kb/s, Five 510 to kb/s 6 from Bitrates ooadsee coupling. stereo and Mono and music, ms, and 60 Speech to ms 2.5 from sizes Frame 1 n onVos Koen and , ints. agt ierneo eltm Internet real-time of range wide a targets e.W ecietetasomcoder, transform the describe We der. h omt h eutout-performs result The format. the 2 ) Valin et al. Music Coding in Opus

Modified Discrete Cosine Transform (MDCT). Sec- tion 2 presents a high-level view of their unification into Opus. After many major incompatible changes to the original codecs, the result is open source, with patents licensed under royalty-free terms1. The ref- erence encoder is professional-grade, and supports

Constant bit-rate (CBR) and variable bit-rate Fig. 1: High-level overview of the Opus codec. • (VBR) rate control, Floating point and fixed-point arithmetic, and • CELT’s look-ahead is 2.5 ms, while SILK’s look- Variable encoder complexity. ahead is 5 ms, plus 1.5 ms for the resampling (in- • cluding both encoder and decoder resampling). For Its CBR produces packets with exactly the size the this reason, the CELT path in the encoder adds a encoder requested, without a bit reservoir to imposes 4 ms delay. However, an application can restrict the additional buffering delays, as found in codecs such encoder to CELT and omit that delay. This reduces as MP3 or AAC-LD. It has two VBR modes: Con- the total look-ahead to 2.5 ms. strained VBR (CVBR), which allows bitrate fluctu- 2.1. Configuration and Switching ations up to the average size of one packet, making it equivalent to a bit reservoir, and true VBR, which Opus signals the mode, frame size, audio bandwidth, does not have this constraint. and channel count (mono or stereo) in-band. It en- codes this in a table-of-contents (TOC) at the This paper focuses on the CELT mode of Opus, start of each packet [1]. Additional internal framing which is used primarily for encoding music. Al- allows it to multiple frames into a single packet, though it retains the fundamental principles of the up to a maximum duration of 120 ms. Unlike the original algorithms published in [5, 6] reviewed in rest of the Opus bitstream, the TOC byte and inter- Section 3, the CELT algorithm used in Opus differs nal framing are not entropy coded, so applications significantly from that work. We have psychoacous- can easily access the configuration, split packets into tically tuned the bit allocation and quantization, as individual frames, and recombine them. Section 4 outlines, and designed additional tools to Configuration changes that use CELT on both sides conceal artifacts, which Section 5 describes. (or between wideband SILK and Hybrid mode) use the overlap of the transform window to avoid dis- 2. OVERVIEW OF OPUS continuities. However, switching between CELT and Opus operates in one of three modes: SILK or Hybrid mode is more complicated, because SILK operates in the time domain without a win- dow. For such cases, the bitstream can include an SILK mode (speech signals up to wideband), • additional 5 ms redundant CELT frame that the de- CELT mode (music and high-bitrate speech), or coder can overlap-add to bridge and gap between the • discontinuous data. Two redundant CELT frames— Hybrid mode (SILK and CELT simultaneously one on each side of the transition—allow smooth • for -wideband and fullband speech). transitions between modes that use SILK at differ- ent sampling rates. The encoder handles all of this Fig. 1 shows a high-level overview of Opus. CELT transparently. The application may not even notice always operates at a sampling rate of 48 kHz, while that the mode has changed. SILK can operate at 8 kHz, 12 kHz, or 16 kHz. In hybrid mode, the crossover frequency is 8 kHz, with 3. CONSTRAINED-ENERGY LAPPED SILK operating at 16 kHz and CELT discarding all TRANSFORM (CELT) frequencies below the 8 kHz Nyquist rate. Like most algorithms, CELT is 1http://opus-codec.org/license/ based on the MDCT. However, the fundamental idea

AES 135th Convention, New York, USA, 2013 October 17–20 Page 2 of 10 Valin et al. Music Coding in Opus

frame size lookahead

algorithmic delay

MDCT window

0 60 180 240 300 420 480 Fig. 2: CELT band layout vs. the Bark scale. Time (sample)

Fig. 3: Low-overlap window used for 5 ms frames. behind CELT is that the most important perceptual aspect of an audio is the spectral envelope. CELT preserves this envelope by explicitly coding coefficients. For 20-ms frames, there are 8 MDCTs the energy of a set of bands that approximates the with full-overlap, 5 ms windows. We constrain the auditory system’s critical bands [7]. Fig. 2 illustrates band sizes to be a multiple of the number of short the band layout used by Opus. The format itself MDCTs, so that the interleaved coefficients form incorporates a significant amount of psychoacoustic bands of the same size, covering the same part of knowledge. This not only reduces certain types of the spectrum in each block, as the corresponding artifacts, it also avoids coding some parameters. band of a long MDCT. CELT uses flat-top MDCT windows with a fixed 4. QUANTIZATION AND ENCODING overlap of 2.5 ms, regardless of the frame size, as Fig. 3 shows. The overlapping part of the window is Opus supports any bitrate that corresponds to an the Vorbis [8] power-complementary window integer number of per frame. Rather than signal a rate explicitly in the bitstream, Opus re- 1 π 2 π n + 2 lies on the lower-level transport protocol, such as w(n) = sin sin . (1) RTP, to transmit the payload length. The decoder, " 2 2L !# not the encoder, makes many bit allocation decisions Unlike the AAC-ELD low-overlap window, the Opus automatically based on the number of bits remain- window is still symmetric. Compared to the full- ing. This means that the encoder must determine overlap of MP3 or Vorbis, the low overlap allows the final rate early in the encoding process, so it lower algorithmic delay and simplifies the handling can make matching decisions, unlike codecs such as of transients, as Section 3.1 describes. The main AAC, MP3, and Vorbis. This has two advantages. drawback is increased spectral leakage, which is First, the encoder need not transmit these decisions, problematic for highly tonal signals. We mitigate avoiding the associated overhead. Second, they al- this in two ways. First, the encoder applies a first- low the encoder to achieve a target bitrate exactly, −1 order pre-emphasis filter Ap (z)=1 0.85z to without repeated encoding or bit reservoirs. Even the input, and the decoder applies the− inverse de- though entropy coding produces variable-sized out- emphasis filter. This attenuates the low frequencies put, these dynamic adjustments to the bit alloca- (LF), reducing the amount of leakage they cause at tion ensure that the coded symbols never exceed the higher frequencies (HF). Second, the encoder applies number of bytes allocated for the frame by the en- a perceptual prefilter, with a corresponding postfil- coder earlier in the process. In the vast majority of ter in the decoder, as Section 5.1 describes. cases, the encoder also wastes less than two bits. Fig. 4 shows a complete block diagram of CELT. Opus encodes most symbols using a range coder [9]. Sections 4 and 5 describe the various components. Some symbols, however, have a power-of-two range and approximately uniform probability. Opus packs 3.1. Handling of Transients these as raw bits, starting at the end of the packet, Like other transform codecs, Opus controls pre-echo back towards the end of the range coder output, as primarily by varying the MDCT size. When the en- Fig. 5 illustrates. This allows the decoder to rapidly coder detects a transient, it computes multiple short switch between decoding symbols with the range MDCTs over the frame and interleaves the output coder and reading raw bits, without interleaving the

AES 135th Convention, New York, USA, 2013 October 17–20 Page 3 of 10 Valin et al. Music Coding in Opus

Fig. 4: Overview of the CELT algorithm.

4.2. Bit Allocation

Rather than transmitting scale factors like MP3 and AAC or a floor curve like Vorbis, CELT mostly al- Fig. 5: Layout and coding order of the bitstream. locates bits implicitly. After coarse energy quan- tization, the encoder decides on the total number data in the packet. It also improves robustness to bit of bytes to use for the frame. Then both the en- errors, as corruption in the raw bits does not desyn- coder and decoder run the same bit-exact bit alloca- tion function to partition the bits among the bands. chronize the range coder. A special termination rule for the range coder, described in Section 5.1.5 of CELT interpolates between several static allocation RFC 6716 [1], ensures the stream remains decod- prototypes (see Fig. 6) to achieve the target rate. able regardless of the values of the raw bits, while Some bands may not receive any bits. The decoder using at most 1 bit of padding to separate the two. reconstructs them using only the energy, generating 4.1. Coarse Energy Quantization (Q1) fine details by spectral folding, as Section 4.4.1 de- tails. When a band receives very few bits, the sparse The most important information encoded in the bit- spectrum that could be encoded with them would stream is the energy of the MDCT coefficients in sound worse than spectral folding. Such bands are each band. Band energy is quantized using a two- automatically skipped, redistributing the bits they pass coarse-fine quantizer. The coarse quantizer uses would have used to code their spectrum to the re- a fixed 6 dB resolution for all bands, with inter- maining bands. The encoder can also skip more band prediction and, optionally, inter-frame predic- bands via explicit signaling. This allows it to give tion. The 2D z-transform of the predictor is the skip decisions some hysteresis between frames. −1 −1 1 zb After the initial allocation, bands are encoded one A (zℓ,zb)= 1 αz − − , (2) − ℓ · 1 βz 1 at a time. In practice, a band may use slightly more − b  or slightly fewer bits than allocated. The difference where ℓ is the frame index and b is the band. Inter- propagates to subsequent bands to ensure that the frame prediction can be turned on or off for any final rate still matches the overall target. Automat- frame. When enabled, both α and β are non-zero ically adjusting the allocation based on the actual and depend on the frame size. When disabled, α =0 bits used makes achieving CBR easy. and β = 0.15. Inter-frame prediction is more effi- cient, but less robust to packet loss. The encoder The implicit allocation produces a nearly constant can use packet loss statistics to force inter-frame signal-to-noise ratio in each band, with the LF coded prediction off adaptively. The prediction residual is at a higher resolution than the HF. It approximates entropy-coded assuming a Laplace probability dis- the real masking curve well without any signaling, tribution with per-band variances trained offline. and achieves good quality by itself, as demonstrated

AES 135th Convention, New York, USA, 2013 October 17–20 Page 4 of 10 Valin et al. Music Coding in Opus

7 In transients frames, bands dominated by the • 6 leakage of the shorter MDCTs receive more bits.

5 Bands that have significantly larger energy than • surrounding bands receive more bits. 4

3 CELT does not provide a mechanism to reduce the allocation of a single band because it would not be 2 Depth (bit/coefficient) worth the signaling cost.

1 4.3. Fine Energy Quantization (Q2) 0 0 5 10 15 20 Once the per-band bit allocation is determined, the Band ID encoder refines the coarsely-quantized energy of each band. Let a be the total allocation for a band con- 2 Fig. 6: Static bit allocation curves in bits/sample taining NDoF degrees of freedom . We approximate for each band and for multiple bitrates. (30) from [10] to obtain the fine energy allocation: a 1 af = + log2 NDoF Kfine , (3) by earlier versions of the algorithm [5, 6]. However, NDoF 2 − it does not cover two theoretical phenomena: where Kfine is a tuned fine allocation offset. We 1. Tonality: tones provide weaker masking than round the result to an integer and code the refine- noise, requiring a finer resolution. Since tones ment data as raw bits. Bands where NDoF = 2 get usually have harmonics in many bands, the en- slightly more bits, and we slightly bias the allocation coder increases the total rate for these frames. upwards when adding the first and second fine bit. 2. Inter-band masking: a band may be masked by If any bits are left unused at the very end of the neighboring bands, though this is weaker than frame, each band may add one additional bit per intra-band masking. channel to refine the band energy, starting with bands for which af was rounded down. CELT provides two signaling mechanisms that ad- 4.4. Pyramid (Q3) just the implicit allocation: one that changes the tilt of the allocation, and one that boosts specific bands. Let Xb be the MDCT coefficients for band b. We normalize the band with the unquantized energy, 4.2.1. Allocation Tilt Xb The allocation tilt parameter changes the slope of xb = , (4) Xb the bit allocation as a function of the band index k k by up to 5/64 bit/sample/band, in increments of ± producing a unit vector on an N-sphere, coded with 1/64 bit/sample/band. Although in theory the slope a pyramid vector quantizer (PVQ) [11] codebook: of the masking threshold should follow the slope of the signal’s spectral envelope, we have observed that N−1 y N LF-dominated signals require more bits in the LF, S (N,K)= , y Z : yi = K , y ∈ | | ( ( i=0 )) with a similar observation for HF-dominated signals. k k X 4.2.2. Band Boost where K is the L1-norm of y, i.e. the number of When a specific band requires more bits, the bit- pulses. The codebook size obeys the recurrence stream includes a mechanism for increasing its allo- V (N,K)= V (N,K 1) cation (reducing the allocation of all other bands). − Versions 1.0.x and earlier of the Opus reference im- + V (N 1,K)+ V (N 1,K 1) , (5) − − − plementation rarely use this band boost. However, 2 Usually equal to the number of coefficients. When stereo newer versions use it to improve quality in the fol- coupling is used on a band with more than 2 coefficients, the lowing circumstances: combined band has an additional degree of freedom.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 5 of 10 Valin et al. Music Coding in Opus

with V (N, 0) = 1 and V (0,K)=0, K> 0. Be- Opus encodes the mid and side as normalized signals cause V (N,K) is rarely a power of two, we use the m = M/ M and s = S/ S . To recover M and S range coder with a uniform probability to encode the from m andk sk, we need tok knowk the ratio of S to codeword index, derived from y using the method M , which we encode as the angle k k of [11]. When V (N,K) is larger than 255, the index k k is renormalized to fall in the range [128, 255] and S θs = arctan k k . (8) the least significant bits are coded using raw bits. M k k The uniform probability allows both the encoder and Explicitly coding θs preserves the stereo width and the decoder to choose K such that log2 V (N,K) achieves allocation determined in Section 4.2. reduces the risk of stereo unmasking [12, 7], since it preserves the energy of the difference signal, in 4.4.1. Spectral Folding addition to the energy in each channel. We quantize When a band receives no bits, the decoder replaces θs uniformly, deriving the resolution the same way the spectrum of that band with a normalized copy of as the fine energy allocation. Uniform quantization MDCT coefficients from lower frequencies. This pre- of θs achieves optimal mean-squared error (MSE). serves some temporal and tonal characteristics from Let mˆ and ˆs be the quantized versions of m and s. the original band, and CELT’s energy normalization We can compute the reconstructed signals as preserves the spectral envelope. Spectral folding is far less advanced than spectral band replication xˆl = mˆ cos θˆs + ˆs sin θˆs , (9) (SBR) from HE-AAC and mp3PRO, but is compu- xˆr = mˆ cos θˆs ˆs sin θˆs . (10) tationally inexpensive, requires no extra delay, and − the decision to apply it can change frame-by-frame. As a result of the quantization, mˆ and ˆs may not be 4.5. Stereo orthogonal, so xˆl and xˆr may not have exactly unit norm and must be renormalized. Opus supports three different stereo coupling modes: The MSE-optimal bit allocation for m and s depends 1. Mid-side (MS) stereo on θˆs. Let N be the size of the band and a be the total number of bits available for m and s. Then the 2. Dual stereo optimal allocation for m is 3. Intensity stereo a (N 1) log tan θs a = − − 2 . (11) A coded band index denotes where intensity stereo mid 2 begins: all bands above it use intensity stereo, while m s all bands below it use either MS or dual stereo. A The larger of and is coded first, and any unused single flag at the frame level chooses between them. bits are given to the other channel. As a special case, when N = 2 we use the orthogonality of m and s to 4.5.1. Mid-Side Stereo code one of the channels using a single sign bit. We apply MS stereo coupling separately on each 4.5.2. Dual Stereo band, after normalization. Because we code the en- ergy of each channel separately, MS stereo coupling Dual stereo codes the normalized left and right chan- never introduces cross-talk between channels and is nels independently. We use this only when the cor- relation between the channels is not strong enough safe even when dual stereo is more efficient. Let xl to make up for the cost of coding the θs angles. and xr be the normalized band for the left and right channels, respectively. The orthogonal mid and side 4.5.3. Intensity Stereo signals are computed as Intensity stereo also works in the normalized do- xl + xr main, using a single mid channel with no side. In- M = , (6) 2 stead of θs, we code a single inversion flag for each xl xr band. When set, we invert the right channel, pro- S = − . (7) 2 ducing two channels 180 degrees out of phase.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 6 of 10 Valin et al. Music Coding in Opus

12 4.6. Band splitting tapset 0 tapset 1 10 tapset 2 At high bitrates, we allocate some bands hundreds of bits. To avoid arithmetic on large integers in the 8 PVQ index calculations, we split bands with more 6 than 32 bits, using the same process as MS stereo. 4 M and S are set to the first and second half of the 2 band, with θs indicating the distribution of energy Response (dB) 0 between the two halves. If a band contains data from -2 multiple short MDCTs, we bias the bit allocation to -4 account for pre-echo or forward masking using θˆs. If -6 one sub-vector still requires more than 32 bits, we 0 5 10 15 20 split it recursively. This recursion stops after 4 levels Frequency (kHz) (1/16th the size of the original band), which puts a hard limit on the number of bits a band can use. Fig. 7: Frequency response of the different postfilter This limit lies far beyond the rate needed to achieve tapsets for T = 24, g =0.75. transparency in even the most difficult samples. different tapsets to control the range of frequencies 5. PSYCHOACOUSTIC IMPROVEMENTS to which we apply the enhancement. They are We can achieve good audio quality using just the a0 · = [0.80 0.10 0] , algorithms described above. However, four dif- , a1 · = [0.46 0.27 0] , (13) ferent psychoacoustically-motivated improvements , a2 · = [0.30 0.22 0.13] . make coding artifacts even less audible. , 5.1. Prefilter and Postfilter The pitch period lies in the range [15, 1022], and the gain varies between 0.09 and 0.75. Fig. 7 shows The low-overlap window increases leakage in the the frequency response of each tapset for a period of MDCT, resulting in higher quantization noise on T = 24 (2 kHz) and a gain g =0.75. highly tonal signals. Widely-spaced harmonics in periodic signals provide especially little masking. Subjective testing conducted by Broadcom on an Opus mitigates this problem using a pitch-enhancing earlier version of the algorithm demonstrated the post-filter. Unlike speech codec postfilters, we run a postfilter’s effectiveness [13]. matching prefilter on the encoder side. The pair pro- 5.2. Variable Time-Frequency Resolution vides perfect reconstruction (in the absence of quan- Some frames contain both tones and transients, re- tization), allowing us to enable the postfilter even at quiring both good time resolution and good fre- high bitrates. Although the filters look like a pitch quency resolution. Opus achieves this by selec- predictor, unlike pitch prediction we ap- tively modifying the time-frequency (TF) resolution ply the prefilter to the unquantized signal, allowing in each band. For example, Opus can have good fre- pitch periods shorter than the frame size. The gain quency resolution for LF tonal content while retain- and period are transmitted explicitly. When these ing good time resolution for a transient’s HF. We change between two frames, the filter response is in- change the TF resolution with a Hadamard trans- terpolated using a 2.5 ms cross-fade window equal to form, a cheap approximation of the DCT. When us- the square of the w(n) power-complementary win- ing multiple short MDCTs (good time resolution), dow. We use a 5-tap prefilter with an impulse re- we increase the frequency resolution of a band by sponse of applying the Hadamard transform to the same co- −T −2 −T +2 efficient across multiple MDCTs. This can increase A (z)=1 g ap,2 z + z − · the frequency resolution by a factor of 2 to 8, de- −T −1 −T +1 −T + ap,1 z  + z + ap,0z , (12) creasing the time resolution by the same amount. where T is the pitch period, g is the gain, and ap,i are The Hadamard transform of consecutive coefficients the coefficients of tapset p. We choose one of three increases the time resolution of a long MDCT. This

AES 135th Convention, New York, USA, 2013 October 17–20 Page 7 of 10 Valin et al. Music Coding in Opus

1 1.2 long MDCT short MDCT transformed short MDCT transformed long MDCT 1 Input Spreading Rotation 0.5 0.8 0.6 Spread 0.4 0 PVQ Quantization (n=64; k=4) 0.2 Coded 0 -0.5 -0.2 Inverse Spreading Rotation

-0.4 Output -1 0 200 400 600 800 1000 0 200 400 600 800 1000 Time (sample) Time (sample) Fig. 9: Spreading example. Fig. 8: Basis functions with modified time- frequency resolution for a 20 ms frame. Left: fourth basis function of a long MDCT vs. the equivalent ba- the end, and then back. We determine θr from the sis function from TF modification of 8 short MDCTs. band size, N, and the number of pulses used, K: Right: First (“DC”) basis function of a short MDCT 2 π N vs. the equivalent basis function from TF modifica- θ = , (15) r 4 N + δK tion of the first 8 coefficients of a long MDCT.   where δ is the spreading constant. Once per frame, yields more time-localized basis functions, although the encoder selects δ from one of three values: 5, 10, they have more ringing than the equivalent short or 15, or disables spreading completely. MDCT basis functions. Fig. 8 illustrates basis func- In transient frames, we apply the spreading rotations tions produced by adaptively modifying the time- to each short MDCT separately to avoid pre-echo. frequency resolution for a 20 ms frame. When vectors of more than 8 coefficients need to be 5.3. Spreading Rotations rotated, we apply an additional set of rotations to pairs of coefficients √N positions apart, using the A common type of artifact in transform codecs is ′ angle θ = π θ . This spreads the energy within tonal noise, also known as birdies. When quantizing r 2 r   large bands more− widely. a large number of HF MDCT coefficients to zero, the few remaining non-zero coefficients sound tonal 5.4. Collapse Prevention even when the original signal did not. This is most In transients at low bitrates, Opus may quantize all noticeable in low-bitrate . Opus greatly re- of the coefficients in a band corresponding to a par- duces tonal noise by applying spreading rotations. ticular short MDCT to zero. Even though we pre- The encoder applies these rotations to the normal- serve the energy of the entire band, this quantization ized signal prior to quantization, and the decoder causes audible drop outs, as Fig. 10 shows on the left. applies the inverse rotations, as Fig. 9 shows. The decoder detects holes that occur when a short We construct the spreading rotations from a series MDCT receives no pulses in a given band, or when of 2D Givens rotations. Let G (m,n,θr) denote a folding copies such a hole into a higher band, and Givens rotation matrix by angle θr between coef- fills them with pseudo-random noise at a level equal ficients m and n in some band with N coefficients, to the minimum band energy over the previous two with angles near π/4 implying more spreading. Then frames. The encoder transmits one flag per frame the spreading rotations are that can disable collapse prevention. We do this af- ter two consecutive transients to avoid putting too N−3 much energy in the holes. Fig. 10 shows the result R (θr)= G (k, k +1,θr) of collapse prevention on the right. The short drop k=0 Y outs around each transient are no longer audible. N G (N k,N k +1,θr) . (14) 6. EVALUATION AND RESULTS · − − k=2 Y This section presents a quality evaluation of Opus’s In other words, we rotate adjacent coefficient pairs CELT mode on music signals. More complete eval- one at a time from the beginning of the vector to uation data on Opus is available at [14].

AES 135th Convention, New York, USA, 2013 October 17–20 Page 8 of 10 Valin et al. Music Coding in Opus

4.2

4.0

3.8

3.6 Average score

3.4

3.2 Vorbis Nero HE-AAC Apple HE-AAC Opus

Fig. 11: Results of the 64 kb/s evaluation. The low anchor (omitted) was rated at 1.54 on average.

Apple’s HE-AAC was better than both Nero’s HE- AAC and Vorbis with greater than 99.9% confidence. Nero’s HE-AAC and Vorbis were statistically tied. A Fig. 10: Extreme collapse prevention example for simple ANOVA analysis gives the same results. castanets at 32 kb/s mono. Top: without collapse prevention. Bottom: with collapse prevention. 6.2. Cascading Performance In broadcasting applications, audio streams are com- pressed and recompressed multiple times. According 6.1. Subjective Quality to [18], typical broadcast chains may include up to 5 Volunteers of the HydrogenAudio forum3 evaluated lossy encoding stages. For this reason, we compare the quality of 64 kb/s VBR Opus on fullband the cascading quality of Opus to both Vorbis and stereo music with headphones. 13 listeners evalu- MP3 using PQevalAudio [19], an implementation of ated 30 samples using the ITU-R BS.1116-1 method- the PEAQ basic model [20]. Fig 12 plots quality as ology [15] with a function of bitrate and the number of cascaded en- codings. Opus performs better than MP3 and Vorbis The Opus [2] reference implementation (v0.9.2), in the presence of cascading, with 64 kb/s Opus even • Apple’s HE-AAC4 (QuickTime v7.6.9), out-performing 128 kb/s MP3. Although the Opus • quality with 5 ms frames is lower than for 20 ms Nero’s HE-AAC5 (v1.5.4.0), and • frames, it is still acceptable, and better than MP3. Vorbis (AoTuV6 v6.02 Beta). • 7. CONCLUSION AND FUTURE WORK Apple’s AAC-LC at 48 kb/s served as a low anchor. Fig. 11 shows the results. A pairwise resampling- By building psychoacoustic knowledge into the Opus based free step-down analysis using the max(T) al- format, we minimize the side information it trans- gorithm [16, 17] reveals that Opus is better than mits and the impact of coding artifacts. This allows the other codecs with greater than 99.9% confidence. Opus to achieve higher music quality than existing non-real-time codecs, even under cascading. Since 3 http://hydrogenaudio.org/ Opus was only recently standardized, we are con- 4With constrained VBR, as it cannot run unconstrained 5 tinuing to improve its encoder, experimenting with http://www.nero.com/enu/company/about-nero/nero- aac-codec.php such things as look-ahead and automatic frame size 6http://www.geocities.jp/aoyoume/aotuv/ switching for non-real-time encoding.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 9 of 10 Valin et al. Music Coding in Opus

0 0 [9] G. Nigel and N. Martin. : An -0.5 -0.5 algorithm for removing redundancy from a digi- -1 -1 tised message. In Proc. and Data Record- -1.5 -1.5 ing Conference, 1979. -2 -2 [10] H. Kr¨uger, R. Schreiber, B. Geiser, and P. Vary. PEAQ ODG -2.5 PEAQ ODG -2.5 On logarithmic spherical vector quantization. -3 -3 In Proc. ISITA, 2008. Opus (20ms) Opus (20ms) -3.5 Opus (5ms) -3.5 Opus (5ms) Vorbis Vorbis (aoTuV 6.03) MP3 MP3 (LAME 3.99.5) [11] T. R. Fischer. A pyramid vector quantizer. -4 -4 1 2 3 4 5 6 7 8 9 10 64 80 96 112 128 144 160 192 224 256 cascadings Bitrate (kbit/s) IEEE Trans. on , 32:568– 583, 1986. Fig. 12: Cascading quality. Left: Quality degrada- [12] J. D. Johnston and A. J. Ferreira. Sum- tion vs. number of cascadings at 128 kb/s. Right: difference stereo transform coding. In Proc. Quality degradation vs. bitrate after 5 cascadings. ICASSP, volume 2, pages 569–572, 1992. [13] R. Chen, T. B. Terriberry, J. Skoglund, 8. REFERENCES G. Maxwell, and H. T. M. Nguyet. Opus test- th [1] J.-M. Valin, K. Vos, and T. B. Ter- ing. In Proc. codec WG, 80 IETF meeting, riberry. Definition of the Opus Audio pages 1–4, Prague, 2011. http://www.ietf. Codec. RFC 6716, http://www.ietf.org/ org/proceedings/80/slides/codec-4.pdf. rfc/rfc6716.txt, September 2012. [14] . Hoene, J.-M. Valin, K. Vos, and J. Skoglund. Summary of opus listening test results. IETF [2] Opus website. http://opus-codec.org/. Internet-Draft http://tools.ietf.org/ [3] A. Carˆot. Musical Telepresence – A Compre- html/draft-ietf-codec-results, 2012. hensive Analysis Towards New Cognitive and [15] ITU-R. Recommendation BS.1116-1: Methods Technical Approaches. PhD thesis, University for the subjective assessment of small impair- of L¨ubeck, 2009. ments in audio systems including multichannel sound systems, 1997. [4] K. Vos, S. Jensen, and K. Sørensen. SILK speech codec. IETF Internet-Draft http:// [16] Peter H. Westfall and S. Stanley Young. tools.ietf.org/html/draft-vos-silk-02. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley Se- [5] J.-M. Valin, T. B. Terriberry, and G. Maxwell. ries in Probability and Statistics. John Wiley & A full-bandwidth with low com- Sons, New York, January 1993. plexity and very low delay. In Proc. EUSIPCO, 2009. [17] Gian-Carlo Pascutto. Bootstrap. http://www. sjeng.org/bootstrap.html, 2011. [6] J.-M. Valin, T. B. Terriberry, C. Montgomery, [18] D. Marston and A. Mason. Cascaded audio cod- and G. Maxwell. A high-quality speech and au- ing. EBU Technical Review, 2005. dio codec with less than 10 ms delay. IEEE Trans. Audio, Speech and Language Processing, [19] P. Kabal. An examination and interpretation of 18(1):58–67, 2010. ITU-R BS.1387: Perceptual evaluation of au- dio quality. Technical report, TSP Lab, ECE [7] B. C.J. Moore. An Introduction to the Psychol- Dept., McGill University, http://www.TSP. ogy of Hearing. fifth edition, 2004. ECE.McGill.CA/MMSP/Documents, May 2002.

[8] C. Montgomery. Vorbis I specification. [20] ITU-R. Recommendation BS.1387: Perceptual http://www.xiph.org/vorbis/doc/Vorbis_ Evaluation of Audio Quality (PEAQ) recom- I_spec.html, 2004. mendation, 1998.

AES 135th Convention, New York, USA, 2013 October 17–20 Page 10 of 10