arXiv:1602.05311v1 [cs.MM] 17 Feb 2016 ujcieevaluation. subjective hl anann eylwdly t hrceitc r a are characteristics Its delay. follows: low very maintaining while less echo acoustic also makes possible it is as delay, control perceptible. low echo very from device, acoustic benefit limited audio syste Teleconferencing only wireless speaker. where a visible desynchro- the causes is with delay each where nization example microphone, with digital synchronize Another a as properly such to able [3]. ms be 25 other to than delay less two total require where remotely of playing performance, musicians moderate music more have networked or is applica- example such del purpose low One while very general require However, applications some more delays, algorithmic to 2]. kHz [1, suitable 48 or tions them 44.1 of make rates quality bandwidth” to “full audio to the codecs increasing speech on of focused has research Recent WMOPS 17 only kHz. at 48 small, at is operation real-time The codec for the rate. proposed same of the the complexity at delay, total operating algorithmic codec ULD ms the kbit/s 4 out-performs 96 only at However, with codecs. and hinde transform typically of delay envelope low performance for spectral the required sizes the uses frame short and preserve The discrete frames to modified short quantization very the gain-shape with on music (MDCT) based low-delay transform network is cosine codec the as The such addresses applications that performance. some codec of audio requirements an propose We te o-ea oesi eto ,floe ydirect by followed 3. 4, to Section Section compares in in approach quantization codecs proposed the the low-delay of how other 2 details discuss Section the then in codec We into the go of principles and basic the introduce We • • • • • ewudlk otakaltelseeswopriiae nt in participated who listeners the all thank to like would We epooeacdcta rvdshg ui quality audio high provides that codec a propose We rm ieo 5 ape 53m)wt 2 samples 128 with ms) (5.3 samples ms); (2.7 256 look-ahead of size frame kHz; 48 of rate sampling uha 2-apefae ih6 ape look-ahead. samples 64 with frames sizes, 128-sample frame as and such rates sampling other for support WMOPS; optional 17 (mono); of complexity kbit/s 64 total at a quality audio good very achieves ULBNWDHADOCDCWT O OPEIYADVERY AND COMPLEXITY LOW WITH CODEC AUDIO FULL-BANDWIDTH A .INTRODUCTION 1. ABSTRACT enMr Valin Jean-Marc †‡ ioh .Terriberry B. Timothy , 11Mlo tet ut 300 suite Street, Molson 4101 [email protected] otel(ubc Canada (Quebec) Montreal † ‡ cai Semiconductor Octasic ihOgFoundation Xiph.Org O DELAY LOW [email protected] [email protected] ms ay. he s r . iue1 oe-opeetr idw ihreduced with windows Power-complementary overlap. 1: Figure ai rnilscnb umrzda follows: as summarized be can principles basic 1. Fig. in reduced-overlap shown samples configuration a size 384 frame of with 256-sample delay the algorithmic combined for an in size, results frame This we window. short delay, algorithmic a the cosine use discrete minimize modified To the (MDCT). on transform based is codec proposed The conclusion the and 5 6. Section Section in in comparisons quality audio • • • • h tutr fteecdri hw nFg n its and 2 Fig. in shown is encoder the of structure The h DTotu sslti ad prxmtn the approximating bands in split bands; is critical output MDCT the h nry(an nec adi unie n transmit- and quantized is separately; band ted each in (gain) energy the h eal sae nec adaeqatzdalge- quantized are band each codebook; spherical a in using braically (shape) details the h i loaini nerdfo nomto shared information from decoder. the inferred and encoder is the between allocation bit the iue2 ai tutr fteencoder. the of structure Basic 2: Figure ‡ rgr Maxwell Gregory , .OEVE FTECODEC THE OF OVERVIEW 2. 0 64 frame size algorithmic delay MDCT window 192 Time [samples] 256 320 lookahead ‡ 448 512 The most important aspect of these is the explicit coding 3.2.1 Pyramid vector quantization of the shape of a per-band energy constraint combined with an indepen- Because there is no known algebraic formulation for the dent shape quantizer which never violates that constraint. optimal tessellation of a hyper-sphere of arbitrary dimension This prevents artifacts caused by energy collapse or over- N, we use a codebook constructed as the sum of K signed shoot and preserves the spectral envelope’s evolution in time. unit pulses normalized to the surface of the hyper-sphere. A The bands are defined to match the ear’s critical bands as codevectorx ˜ can be expressed as: closely as possible, with the restriction that bands must be at least 3 MDCT bins wide. This lower limit results in 19 bands K for the codec when 256-sample frames are used. (k)ε y = ∑ s n(k) , (2) k=1 3. QUANTIZATION y x˜ = , (3) T We use a type of arithmetic coder called a range coder [4] y y for all symbols. We use it not only for entropy coding, but p (k) (k) th also to approximate the infinite precision arithmetic required where n and s are the position and sign of the k pulse, ε (k) to optimally encode integers whose range is not a power of respectively, and n(k) is the n th elementary basis vector. ( j) (k) two. The signs sk are constrained such that n = n implies ( j) (k) s = s , and hence y satisfies kykL1 = K. This codebook 3.1 Energy quantization (Q1, Q2) has the same structure as the pyramid vector quantizer [6] The energy of the final decoded signal is algebraically and is similar to that used in many ACELP-based [7] speech constrained to match the explicitly coded energy exactly. codecs. Therefore it is important to quantize the energy with suffi- The search for the best positions and signs is based on cient resolution because later stages cannot compensate for minimizing the cost function J = −xx˜ = −xy/ yT y using quantization error in the energy. It is perceptually important a greedy search, one pulse at a time. For iterationpk, the cost (k) to preserve the band energy, regardless of the resolution used Jn of placing a pulse at position n can be computed as: for encoding the band shape. We use a coarse-fine strategy for encoding the energy in (k) s =sign x (k) , (4) the log domain (dB). The coarse quantization of the energy n (k) (k−1) (k) (Q1) uses a fixed resolution of 6 dB. This is also the only Rxy =Rxy + s xn(k) , (5) place we use prediction and entropy coding. The prediction k k 1 k 1 2 is applied both in time (using the previous frame) and in R( ) =R( − ) + 2s(k)y( − ) + s(k) , (6) yy yy n(k) frequency (using the previous band). The 2-D z-transform   (k) (k) 2 (k) of the prediction filter is Jn = − Rxy /Ryy . (7)   1 − z−1 A(z ,z )= 1 − αz−1 · b , (1) To compare two costs, the divisions in (7) are transformed ℓ b ℓ β −1 into two multiplications. The algorithm can be sped up by 1 − zb  starting from a projection of x onto the pyramid, as suggested where b is the band index and ℓ is the frame index. Unlike in [6]. We start from methods which require predictor reset [5], the proposed system with α < 1 is guaranteed to re-synchronize after xn yn = K , (8) a transmission error. We have obtained good results  kxkL1  with α = 0.8 and β = 0.7. To prevent error accumulation, the prediction is applied on the quantized log-energy. The where ⌊·⌋ denotes rounding towards zero. From there, we prediction step reduces the entropy of the coarsely-quantized add any remaining pulses one at a time using the search energy from 61 to 30 bits. Of this 31-bit reduction, 12 procedure described by (4)-(7). The worst-case complexity are due to inter-frame prediction. We approximate the is thus O(N · min(N,K)). ideal probability distribution of the prediction error using a Laplace distribution, which results in an average of 33 bits 3.2.2 Encoding of the pulses' signs and positions per frame to encode the energy of all 19 bands at a 6 dB For K pulses in N samples, the number of codebook entries resolution. Because of the short frames, this represents is up to a 16% bitrate savings on the configurations tested in Section 5. V (N,K)= V (N − 1,K)+ The fine energy quantizer (Q2) is applied to the Q1 quantization error and has a variable resolution that depends V (N,K − 1)+V (N − 1,K − 1) , (9) on the bit allocation. We do not use entropy coding for with V (N,0) = 1 and V (0,K) = 0, K > 0. We use an Q since the quantization error of Q is mostly white and 2 1 enumeration algorithm to convert between codebook entries uniformly distributed. and integers between 0 and V (N,K) − 1 [6]. The index is encoded with the range coder using equiprobable sym- 3.2 Shape quantization (Q3) bols. The factorial pulse coding (FPC) method [8] uses the We normalize each band by the unquantized energy, so its same codebook with a different enumeration. However, it shape always has unit norm. We thus need to quantize a requires multiplications and divisions, whereas ours can be vector on the surface of an N-dimensional hyper-sphere. implemented using only addition. To keep computational complexity low, the band is recursively partitioned in half when the size of the codebook exceeds 32 bits. The number Table 1: Average bit allocation at 64.5 kbit/s (344 bits per of pulses in the first half is explicitly encoded, and then each frame). The mode flags are used for pre-echo avoidance and half of the vector is coded independently. to signal the low-complexity mode described here. Parameter Average bits 3.2.3 Avoiding sparseness Coarse energy (Q1) 32.8 Fine energy (Q ) 43.2 When a band is allocated few bits, the codebook described in 2 Shape (Q3) 264.4 section 3.2.1 produces a sparse spectrum, containing only a Modeflags 2 few non-zero values. This tends to produce “birdie” artifacts, Unallocated 1.6 common to many transform codecs. To mitigate the problem, we add some small values to the spectrum. We could use a noise generator, but choose to use a scaled copy of the lower frequency MDCT bins. Doing so mostly preserves yielding a pre-defined signal-to-mask ratio (SMR) for each the temporal aspect of the signal [9]. The gain applied is band. Because the bands have a width of one Bark, this is computed as: equivalent to modeling the masking occurring within each N critical band, while ignoring inter-band masking and tone-vs- g = , (10) noise characteristics. This is not an optimal bit allocation, but N + δK δ it provides good results without requiring the transmission where = 6 was experimentally found to be a good com- of any allocation information. The average bit allocation promise between excessive noise and a sparse spectrum. The between the three quantizers is given in Table 1. gain in (10) increases as fewer pulses are used. For cases where no pulse is allocated, we have g = 1, which preserves the energy in the band without using any additional bits. In 4. RELATED WORK all cases, the total energy is normalized to be equal to the The proposed codec shares some similarities with the energy value encoded. This constraint slightly changes the G.722.1C [2] in that both transmit the objective function used to place pulses, but for simplicity we energy of MDCT bands explicitly. There are, however, only take this into consideration when placing the last pulse. significant algorithmic differences between the two codecs. First, G.722.1C uses scalar quantization to encode the 3.2.4 Avoiding pre-echo normalized spectrum in each band, so it must encode N Pre-echo is a common artifact in transform codecs, intro- degrees of freedom instead of the N − 1 required by a duced because quantization error is spread over an entire spherical codebook. For a Gaussian source, pyramid vector window, including samples before a transient event. It is quantization provides a 2.39 dB asymptotic improvement seldom a problem in the proposed codec because of the short over the optimal scalar quantizer, according to [6]. We frames, but occurs in some extreme cases. To avoid pre- replaced the VQ codebook in our codec with per-bin entropy echo, we detect transients and use two smaller MDCTs for coding, and measured a 10 kbit/s degredation from our those frames. The output of the two MDCTs is interleaved, 64 kbit/s configuration. The use of scalar quantization also and the rest of the codec is not affected, operating as if only means that the energy is not guaranteed to be preserved in one MDCT was used. No additional lookahead is needed to the decoded signal. determine which window to use because the long windows A second difference is that in G.722.1C,the bit allocation have the same window overlap shape and length as the short information is explicitly transmitted in the bitstream. Given windows. that G.722.1Chas 20 ms frames, this is a reasonable strategy. However, with the very short frames (5.8 ms or less) used 3.3 Bit allocation in the proposed codec, explicitly encoding the bit allocation The shorter the frame size used in a codec, the higher the in each frame would result in too much overhead. A third overhead of transmitting metadata. In low-delay codecs, difference is that the bands in G.722.1C have a fixed width the overhead of explicitly transmitting the bit allocation of 500 Hz. While this helps reduce the complexity of the can become very large. For this reason, we choose not to codec, which is around 11 WMOPS at 48 kbit/s, it has a cost transmit the bit allocation explicitly, but rather derive it using in quality compared to using Bark-spaced bands. Besides information available to both the encoder and the decoder. the differences in the core algorithm, G.722.1C has a lower We assume that both the encoder and the decoder know complexity and a significantly larger (5 to 10 times) delay how many 8-bit are used to encode a frame. This num- than the proposed codec, so the potential set of applications ber is either agreed on when establishing the communication for the two codecs only partially overlap. or obtained during the communication, e.g. the decoder The Fraunhofer Ultra Low Delay (ULD) codec [10] is knows the size of any UDP datagram it receives. After deter- one of the only full-bandwidth codecs with an algorithmic mining the number of bits used by the coarse quantization of delay comparable to the proposed codec. Its structure, how- the energy (Q1), both the encoderand decodermake an initial ever, is completely different from that of the proposed codec. bit allocation for the fine energy (Q2) and shape (Q3) using ULD is based on time-domain linear prediction instead of only static (ROM) data. Because we cannot always choose a the MDCT. It uses a pre-filter/post-filter combination, whose pulse count K that yields exactly the number of bits desired parameters are transmitted in the bitstream, to shape the for a band, we use the closest possible value and propagate quantization noise. ULD frames are 128 samples with 128 the difference to the remaining bands. samples of look-ahead, for a total algorithmic delay of 256 For a given band, the bit allocation is nearly constant samples at 48 kHz (5.3 ms). One disadvantage of the linear- across frames that use the same number of bits for Q1, prediction approach is the difficulty of resynchronizing the Table 2: Characteristics of the codecs as used in testing Codec Samplerate Bitrate Framesize Look-ahead Total delay kHz kbit/s sample(ms) sample(ms) sample(ms) Proposed(64) 48 64 256(5.3) 128(2.7) 384 (8) Proposed(96) 48 96 128(2.7) 64(1.3) 192 (4) ULD 48 96 128(2.7) 128(2.7) 256(5.3) G.722.1C 32 48 640(20) 640(20) 1280(40) decoder after a packet is lost [5]. In contrast, the proposed codec only uses inter-frame prediction for Q1, so the decoder resynchronizes very quickly after packet loss. Changing the proposed codec to have completely independent packets would cost approximately 12 bits per frame. AAC-LD [1] is another low-delay audio codec whose total algorithmic delay can range from 20 ms to around 50 ms, depending on the sampling rate and bit reservoir size. However, its complexity is higher than that of the proposed codec.

5. RESULTS AND DISCUSSION

The source code for the proposed codec is available at Figure 4: Objective Quality Degradation (ODG) measured http://www.celt-codec.org/ and corresponds to for different frame sizes. Equal-quality contours are shown the “low-complexity mode” of the CELT codec, version for ODG values -0.5, -1.0, -2.0, -3.0. 0.5.11. Both floating-point and fixed-point implementations are available. Of the codecs in Section 4, only ULD’s delay is com- Unsurprisingly, all ultra low-delay codecs do very well parable to the proposed codec’s. We include G.722.1C in on castanets, unlike G.722.1C, which has much longer our comparison using the highest bitrate available (48 kbit/s) frames. On the other hand, the highly tonal harpsichord because of the algorithmic similarities listed earlier, despite sample was difficult to encode for the low delay codecs its 40 ms algorithmic delay at 32 kHz. Also, since ULD uses and only the proposed codec achieved quality close to the 128-sample frames, we include a version of the proposed reference. G.722.1C did well on the harpsichord due to its codec with 128-sample frames. This version uses a 64- longer frames, despite its lower bitrate. sample look-ahead, compared to the 128-sample look-ahead The proposed codec is able to operate with a wide range of ULD. The conditions are summarized in Table 2. of frame sizes. We evaluated the effect of the frame size and bitrate on the audio quality. Because of the very large 5.1 Subjective quality number of possible combinations, we used PQevalAudio3,an implementation of the PEAQ basic model [12]. As expected, We use the MUltiple Stimuli with Hidden Reference and Fig. 4 shows that the bitrate required to obtain a certain level 2 Anchor (MUSHRA) methodology [11] with 11 listeners . of quality increases as the frame size decreases. However, We used short excerpts taken from the following mate- we observe that the bitrate difference between two equal- rial: female speech (SQAM), pop (Dave Matthews Band, quality contours is almost constant with respect to the frame #41), male speech (SQAM), harpsichord (Bach), a cappella size. For example, reducing the frame size from 256 to 64 (Suzanne Vega, Tom’s Diner), castanets (SQAM), rock (Du- samples (from 8 ms total delay to 2 ms) results in an increase ran Duran, Ordinary World), orchestra (Danse Macabre), and of 30 kbit/s for the same quality. techno. The results are shown in Fig. 3. Although some of the 5.2 Complexity non-paired confidence intervals in Fig. 3 overlap, a paired t- test reveals higher than 99% confidence in all the differences The total complexity of the algorithm when implemented in fixed-point is 11 WMOPS for the encoder and 6 WMOPS (P < 0.01). The proposed codec at 64 kbit/s was better than 4 ULD at 96 kbit/s, although it used a slightly higher delay. for the decoder, for a total of 17 WMOPS . The encoder and When using a slightly lower delay than ULD and the same the decoder states are very small, requiring around 0.5 kByte 96 kbit/s bitrate, the proposed codec was clearly better than for both states combined. The total amount of scratch space all other codecs and configurations tested. G.722.1C had required is 7 kBytes. the lowest quality, which was expected because its bitrate is When running on a 3 GHz x86 CPU (C code without limited to 48 kbit/s. any architecture-specific optimization), the floating-point implementation requires 0.9% of one CPU core for real-time 1The quality results were obtained using version 0.5.0, which is identical encoding and decoding. The memory requirements for the except for a slightly lower quality VQ search 2During post-screening, we discarded results from 3 additional listeners, 3http://www-mmsp.ece.mcgill.ca/Documents/Software/Packages/AFsp/PQevalAudio.html who rated on average more than 3/7 samples as 100. We verified that this 4Measured by running the fixed-point implementation with operators post-screening phase did not affect our conclusions. similar to the ETSI/ITU basicops and with the same weighting. 100

90

80

70

60

50

40

30

20 Female Pop Male Harpsichord Vega Castanets Duran Orchestra Techno All

Reference G.722.1C (48 kbps) Proposed (64 kbps) Anchor (7 kHz low-pass) Proposed (96 kbps) Anchor (3.5 kHz low-pass) ULD (96 kbps)

Figure 3: Subjective quality of the codecs obtained using the MUSHRA methodology with 11 listeners. The 95% non-paired confidence intervals are included.

floating point version are about twice the fixed-point memory [5] S. Wabnik, G. Schuller, J. Hirschfeld, and U. Kraemer, requirements, which still easily fits within the L1 cache of a “Packet loss concealment in predictive audio coding,” modern desktop CPU. in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2005, 6. CONCLUSION pp. 227–230. [6] T. R. Fischer, “A pyramid vector quantizer,” IEEE We have proposed a low-delay audio codec based on the Trans. on , vol. 32, pp. 568–583, MDCT with very short frames, using shape-gain quanti- 1986. zation to preserve the energy in critical bands. We have [7] C. Laflamme, J.-P. Adoul, H. Y. Su, and S. Morissette, demonstrated that the subjective quality of the proposed “On reducing computational complexity of codebook codec is higher than ULD when operating at the same bitrate search in CELP coder through the use of algebraic (96 kbit/s) and frame size. In addition, with a slightly higher codes,” in Proc. ICASSP, 1990, vol. 1, pp. 177–180. delay, the proposed codec operating at 64 kbit/s still out- performs the ULD codec. [8] J. P. Ashley, E. M. Cruz-Zeno, U. Mittal, and W. Peng, “Wideband coding of speech using a scalable pulse codebook,” in Proc. IEEE Workshop on , REFERENCES 2000, pp. 148–150. [9] J. Makhoul and M. Berouti, “High-frequency regen- [1] M. Lutzky, M. Schnell, M. Schmidt, and R. Geiger, eration in speech coding systems,” in Proc. ICASSP, “Structural analysis of low latency audio coding 1979. th schemes,” in Proc. 119 AES convention, 2005. [10] G. D. T. Schuller, B. Yu, D. Huang, and B. Edler, [2] M. Xie, D. Lindbergh, and P. Chu, “From ITU- “Perceptual audio coding using adaptive pre-and post- T G.722.1 to ITU-T G.722.1 Annex C: A new low- filters and ,” IEEE Trans. on complexity 14kHz bandwidth audio coding standard,” Speech and Audio Processing, vol. 10, no. 6, pp. 379– Journal of Multimedia, vol. 2, no. 2, 2007. 390, 2002. [3] A. Carôt, U. Krämer, and G. Schuller, “Network music [11] ITU-R, Recommendation BS.1534-1: Method for the performance(NMP) in narrow band networks,” in Proc. subjective assessment of intermediate quality level of 120th AES Convention, 2006. coding systems, 2001. [4] G. Nigel and N. Martin, “Range encoding: An algo- [12] ITU-R, Recommendation BS.1387: Perceptual Evalu- rithm for removing redundancy from a digitised mes- ation of Audio Quality (PEAQ) recommendation, 1998. sage.,” in Proc. and Data Recording Conference, 1979.