Convention Paper Presented at the 111Th Convention 2001 September 21–24 New York, NY, USA
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Infoscience - École polytechnique fédérale de Lausanne Audio Engineering Society Convention Paper Presented at the 111th Convention 2001 September 21–24 New York, NY, USA This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Audio Coding Using Perceptually Controlled Bitstream Buffering Christof Faller Media Signal Processing Research, Agere Systems 600 Mountain Avenue, Murray Hill, New Jersey 07974, USA [email protected] ABSTRACT Perceptual audio coders use a varying number of bits to encode subsequent frames according to the perceptual entropy of the audio signal. For transmission over a constant bitrate channel the bitstream must be buffered. The buffer must be large enough to absorb variations in the bitrate, otherwise the quality of the audio will be compromised. We present a new scheme for buffer control of perceptual audio coders. In contrast to conventional schemes the proposed scheme systematically reduces the variation in a perceptual distortion measure over time. The new scheme applied to a perceptual audio coder (PAC) improves the quality of the encoded signal for a given buffer size. The same technique can be used to increase the performance of other coders such as MPEG-1 Layer III or MPEG-2 AAC while maintaining backward compatibility. 1 INTRODUCTION noise threshold determined by the perceptual model. For transparent audio coding the perceptual model computes Typical non-stationary signals such as audio signals the masked threshold [4]. For non-transparent audio have a variable inherent perceptual entropy [1]. To ap- coding the parameters of the perceptual model can be proach the compression limit (i.e. perceptual entropy tuned such that it generates a supra-threshold, i.e. a for transparent audio coding) of such signals, variable noise threshold which is above the masked threshold. In bitrate compression techniques are used. Perceptual au- this case, more quantization noise is introduced into the dio coders [2, 3] quantize the spectral components of an encoded audio signal and the bitrate will be lower. audio signal such that the quantization noise follows the FALLER Audio Coding Using Perceptually Controlled Bitstream Buffering In a general sense, a perceptual model is an algorithm SIGNAL AMPLITUDE which computes for each frame k of an audio signal a noise threshold for each spectral component of this sig- nal with the perceived distortion D[k] as a model pa- D[k] rameter. Ideally the computed noise threshold has the DR property that it is the spectrally shaped noise with the largest possible variance for a perceived distortion of M[k]|DR D[k]. For D[k] = 0 the perceptual model computes the R masked threshold (for transparent audio coding) and for 1 2 3 4 5 D[k] > 0 it computes a supra-threshold (for non trans- TIME (FRAME #) parent audio coding). Ideally a perceptual model com- putes the noise thresholds such that for a constant pa- Fig. 1: Perceptual audio coders are by nature variable bi- rameter D[k] the perceived distortion of the entire coded trate source coders. For encoding an audio signal (top) at a audio signal is constant. constant distortion D[k] (middle) each frame has a different In this paper, we assume that an audio coder without bitrate M[k] (bottom). rate control or buffer control is the ideal case of encod- ing an audio signal. In this case the audio signal is en- SIGNAL AMPLITUDE coded with quantization noise equal to the noise thresh- old given by the perceptual model with the distortion D[k] held constant D[k] = DR. We call the frame bitrates D M[k]|DR in this case inherent bitrates with an average R bitrate of R. The average bitrate M[k] N 1 R R = M[k]|D (1) N X R k=1 1 2 3 4 5 TIME (FRAME #) of an audio signal encoded with a constant distortion D[k] = DR is not known prior to encoding the signal. Fig. 2: For encoding an audio signal (top) at a constant The number of frames to be encoded is N. bitrate M[k] = const. (bottom) the distortion D[k] (middle) For a specific desired bitrate R the audio signal is ideally must be varied in time. encoded with frame bitrates M[k] (bits/frame) equal to the inherent bitrates M[k]|D . A criterion for the 2 R bitrates M[k]|DR leading to a large σR and to significant optimality of the encoding process is variations of the distortion D[k] over time. The encoded 2 2 audio signal has now a significantly worse quality than σD = E{(D[k]|M[k] − DR) } . (2) in the ideal case of encoding with M[k] = M[k]|DR for 2 The smaller σD is the less is the distortion varying over the same average bitrate R. 2 time with no variation at all for σD = 0. The more Figure 3 shows conceptionally these two extreme cases of the distortion D[k] differs from DR the more the bitrate operation for an average bitrate of R. In one case (a) only M[k] differs from the inherent bitrate M[k]|DR and a the bitrate is varied (M[k] = M[k]|DR , D[k] = const.) related measure for optimality is and in the other case (b) only the distortion is varied 2 2 (M[k]= R, D[k]). σR = E{(M[k] − M[k]|DR ) } . (3) Many applications for constant bitrate transmission, In the optimal case when an audio signal is encoded such as digital radio broadcasting [5, 6] or Internet with a constant distortion D[k]= DR the bitrate of each streaming [7], can afford an end-to-end delay. For such frame M[k] = M[k]|DR will vary significantly as shown applications the bitstream of the audio coder can be in Fig. 1. However, many applications require a constant buffered. The buffer can absorb a certain degree of varia- bitrate R transmission from the encoder to the decoder. tion in the bitrate M[k], leading to reduced variations in To operate an audio encoder at a constant bitrate one the distortion D[k] over the case of an audio coder with- can iteratively encode each frame k with various distor- out bitstream buffering. A tradeoff can be made between tions D[k] until the bitrate of the frame M[k] is equal to variation in bitrate and variation in the distortion. The the desired bitrate R as shown in Fig. 2. In this case the smaller the variation in the distortion, the larger the vari- bitrates M[k] = R differ significantly from the inherent ation in the bitrate, and visa versa. The buffer size and AES 111TH CONVENTION, NEW YORK, NY, USA, 2001 SEPTEMBER 21–24 2 FALLER Audio Coding Using Perceptually Controlled Bitstream Buffering M[k] M[k] d c b b a R R a D[k] D D[k] DR R Fig. 3: The two extreme cases of operation of an audio coder: Fig. 4: Different scenarios for audio coding over a constant (a) ideally the audio coder is unconstrained and only varying bitrate transmission channel: (a) constant bitrate audio cod- the bitrate M[k], (b) the audio coder forced to be a constant ing without bitstream buffering, (b) buffered bitstream, (c) bitrate coder by varying only the distortion. buffered bitstream with larger buffer or better buffer control, (d) infinitely large buffer. the buffer control scheme determine how much variation AUDIO M[k] in the bitrate can be absorbed by the buffer. Therefore IN ENCODER BUFFER the buffer size and the buffer control scheme also deter- mine the amount of variation in the distortion. Figure d[k] l[k] 4 schematically shows the regions in which the bitrates CONSTANT M[k] and distortions D[k] lie for different scenarios: BUFFER CONTROL BITRATE Rd a) Constant bitrate audio coding M[k] = R (no bit- stream buffering). AUDIO OUT DECODER BUFFER b) Audio coding with bitstream buffering. c) Audio coding with bitstream buffering (larger buffer Fig. 5: An audio encoder and decoder with a constant bitrate or better buffer control scheme than b). transmission channel. d) Unconstrained audio coding M[k] = M[k]|DR (in- finitely large buffer). complexity restrictions of the encoder or decoder. For broadcasting applications the desired end-to-end delay Figure 5 shows an audio encoder and decoder with a is limited by the cost of the decoder and the tune-in buffered bitstream for a constant bitrate transmission time, i.e. the time it takes from a request for playback of the bits. In this scenario, the M[k] bits of the en- until the audio actually plays back. Therefore there is coded frame, at the time of each frame k, are put into a a need of a buffer control scheme which minimizes the FIFO (first-in-first-out) buffer while Rd bits are removed variation in the distortion for a given limited buffer size. from the FIFO buffer by the constant bitrate transmis- In this paper, we present a novel buffer control scheme sion channel. The number of data bits in the encoder for audio coding which significantly reduces the variation buffer can be expressed iteratively as in the distortion D[k] for a given buffer size.