Arxiv:1602.05311V1 [Cs.MM] 17 Feb 2016 Ujcieevaluation
Total Page:16
File Type:pdf, Size:1020Kb
A FULL-BANDWIDTH AUDIO CODEC WITH LOW COMPLEXITY AND VERY LOW DELAY Jean-Marc Valin†‡, Timothy B. Terriberry‡, Gregory Maxwell‡ †Octasic Semiconductor 4101 Molson Street, suite 300 Montreal (Quebec) Canada [email protected] ‡Xiph.Org Foundation [email protected] [email protected] ABSTRACT frame size lookahead We propose an audio codec that addresses the low-delay algorithmic delay requirements of some applications such as network music performance. The codec is based on the modified discrete cosine transform (MDCT) with very short frames and uses MDCT window gain-shape quantization to preserve the spectral envelope. The short frame sizes required for low delay typically hinder 0 64 192 256 320 448 512 the performance of transform codecs. However, at 96 kbit/s Time [samples] and with only 4 ms algorithmic delay, the proposed codec out-performs the ULD codec operating at the same rate. The Figure 1: Power-complementary windows with reduced overlap. total complexity of the codec is small, at only 17 WMOPS for real-time operation at 48 kHz. 1. INTRODUCTION Recent research has focused on increasing the audio quality of speech codecs to “full bandwidth” rates of 44.1 or 48 kHz to make them suitable to more general purpose applica- tions [1, 2]. However, while such codecs have moderate algorithmic delays, some applications require very low delay. One example is networked music performance, where two or more musicians playing remotely require less than 25 ms of total delay to be able to properly synchronize with each other [3]. Another example is a wireless audio device, Figure 2: Basic structure of the encoder. such as a digital microphone, where delay causes desynchro- nization with the visible speaker. Teleconferencing systems where only limited acoustic echo control is possible also audio quality comparisons in Section 5 and the conclusion benefit from very low delay, as it makes acoustic echo less in Section 6. perceptible. We propose a codec that provides high audio quality 2. OVERVIEW OF THE CODEC while maintaining very low delay. Its characteristics are as arXiv:1602.05311v1 [cs.MM] 17 Feb 2016 follows: The proposed codec is based on the modified discrete cosine transform (MDCT). To minimize the algorithmic delay, we • sampling rate of 48 kHz; use a short frame size, combined with a reduced-overlap frame size of 256 samples (5.3 ms) with 128 samples • window. This results in an algorithmic delay of 384 samples look-ahead (2.7 ms); for the 256-sample frame size configuration shown in Fig. 1. • achieves very good audio quality at 64 kbit/s (mono); The structure of the encoder is shown in Fig. 2 and its • a total complexity of 17 WMOPS; basic principles can be summarized as follows: • optional support for other sampling rates and frame sizes, the MDCT output is split in bands approximating the such as 128-sample frames with 64 samples look-ahead. • critical bands; We introduce the basic principles of the codec in Section 2 • the energy (gain) in each band is quantized and transmit- and go into the details of the quantization in Section 3. ted separately; We then discuss how the proposed approach compares to • the details (shape) in each band are quantized alge- other low-delay codecs in Section 4, followed by direct braically using a spherical codebook; We would like to thank all the listeners who participated in the • the bit allocation is inferred from information shared subjective evaluation. between the encoder and the decoder. The most important aspect of these is the explicit coding 3.2.1 Pyramid vector quantization of the shape of a per-band energy constraint combined with an indepen- Because there is no known algebraic formulation for the dent shape quantizer which never violates that constraint. optimal tessellation of a hyper-sphere of arbitrary dimension This prevents artifacts caused by energy collapse or over- N, we use a codebook constructed as the sum of K signed shoot and preserves the spectral envelope’s evolution in time. unit pulses normalized to the surface of the hyper-sphere. A The bands are defined to match the ear’s critical bands as codevectorx ˜ can be expressed as: closely as possible, with the restriction that bands must be at least 3 MDCT bins wide. This lower limit results in 19 bands K for the codec when 256-sample frames are used. (k)ε y = ∑ s n(k) , (2) k=1 3. QUANTIZATION y x˜ = , (3) T We use a type of arithmetic coder called a range coder [4] y y for all symbols. We use it not only for entropy coding, but p (k) (k) th also to approximate the infinite precision arithmetic required where n and s are the position and sign of the k pulse, ε (k) to optimally encode integers whose range is not a power of respectively, and n(k) is the n th elementary basis vector. ( j) (k) two. The signs sk are constrained such that n = n implies ( j) (k) s = s , and hence y satisfies kykL1 = K. This codebook 3.1 Energy quantization (Q1, Q2) has the same structure as the pyramid vector quantizer [6] The energy of the final decoded signal is algebraically and is similar to that used in many ACELP-based [7] speech constrained to match the explicitly coded energy exactly. codecs. Therefore it is important to quantize the energy with suffi- The search for the best positions and signs is based on cient resolution because later stages cannot compensate for minimizing the cost function J = −xx˜ = −xy/ yT y using quantization error in the energy. It is perceptually important a greedy search, one pulse at a time. For iterationpk, the cost (k) to preserve the band energy, regardless of the resolution used Jn of placing a pulse at position n can be computed as: for encoding the band shape. We use a coarse-fine strategy for encoding the energy in (k) s =sign x (k) , (4) the log domain (dB). The coarse quantization of the energy n (k) (k−1) (k) (Q1) uses a fixed resolution of 6 dB. This is also the only Rxy =Rxy + s xn(k) , (5) place we use prediction and entropy coding. The prediction k k 1 k 1 2 is applied both in time (using the previous frame) and in R( ) =R( − ) + 2s(k)y( − ) + s(k) , (6) yy yy n(k) frequency (using the previous band). The 2-D z-transform (k) (k) 2 (k) of the prediction filter is Jn = − Rxy /Ryy . (7) 1 − z−1 A(z ,z )= 1 − αz−1 · b , (1) To compare two costs, the divisions in (7) are transformed ℓ b ℓ β −1 into two multiplications. The algorithm can be sped up by 1 − zb starting from a projection of x onto the pyramid, as suggested where b is the band index and ℓ is the frame index. Unlike in [6]. We start from methods which require predictor reset [5], the proposed system with α < 1 is guaranteed to re-synchronize after xn yn = K , (8) a transmission error. We have obtained good results kxkL1 with α = 0.8 and β = 0.7. To prevent error accumulation, the prediction is applied on the quantized log-energy. The where ⌊·⌋ denotes rounding towards zero. From there, we prediction step reduces the entropy of the coarsely-quantized add any remaining pulses one at a time using the search energy from 61 to 30 bits. Of this 31-bit reduction, 12 procedure described by (4)-(7). The worst-case complexity are due to inter-frame prediction. We approximate the is thus O(N · min(N,K)). ideal probability distribution of the prediction error using a Laplace distribution, which results in an average of 33 bits 3.2.2 Encoding of the pulses' signs and positions per frame to encode the energy of all 19 bands at a 6 dB For K pulses in N samples, the number of codebook entries resolution. Because of the short frames, this represents is up to a 16% bitrate savings on the configurations tested in Section 5. V (N,K)= V (N − 1,K)+ The fine energy quantizer (Q2) is applied to the Q1 quantization error and has a variable resolution that depends V (N,K − 1)+V (N − 1,K − 1) , (9) on the bit allocation. We do not use entropy coding for with V (N,0) = 1 and V (0,K) = 0, K > 0. We use an Q since the quantization error of Q is mostly white and 2 1 enumeration algorithm to convert between codebook entries uniformly distributed. and integers between 0 and V (N,K) − 1 [6]. The index is encoded with the range coder using equiprobable sym- 3.2 Shape quantization (Q3) bols. The factorial pulse coding (FPC) method [8] uses the We normalize each band by the unquantized energy, so its same codebook with a different enumeration. However, it shape always has unit norm. We thus need to quantize a requires multiplications and divisions, whereas ours can be vector on the surface of an N-dimensional hyper-sphere. implemented using only addition. To keep computational complexity low, the band is recursively partitioned in half when the size of the codebook exceeds 32 bits. The number Table 1: Average bit allocation at 64.5 kbit/s (344 bits per of pulses in the first half is explicitly encoded, and then each frame). The mode flags are used for pre-echo avoidance and half of the vector is coded independently. to signal the low-complexity mode described here. Parameter Average bits 3.2.3 Avoiding sparseness Coarse energy (Q1) 32.8 Fine energy (Q ) 43.2 When a band is allocated few bits, the codebook described in 2 Shape (Q3) 264.4 section 3.2.1 produces a sparse spectrum, containing only a Modeflags 2 few non-zero values.