Improved Audio Coding Using a Psychoacoustic Model Based on a Cochlear Filter Bank Frank Baumgarte
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 7, OCTOBER 2002 495 Improved Audio Coding Using a Psychoacoustic Model Based on a Cochlear Filter Bank Frank Baumgarte Abstract—Perceptual audio coders use an estimated masked a band and the range of spectral components that can interact threshold for the determination of the maximum permissible within a band, e.g., two sinusoids creating a beating effect. This just-inaudible noise level introduced by quantization. This es- interaction plays a crucial role in the perception of whether a timate is derived from a psychoacoustic model mimicking the properties of masking. Most psychoacoustic models for coding sound is noise-like which in turn corresponds to a significantly applications use a uniform (equal bandwidth) spectral decomposi- more efficient masking compared with a tone-like signal [2]. tion as a first step to approximate the frequency selectivity of the The noise or tone-like character is basically determined by human auditory system. However, the equal filter properties of the the amount of envelope fluctuations at the cochlear filter uniform subbands do not match the nonuniform characteristics of outputs which widely depend on the interaction of the spectral cochlear filters and reduce the precision of psychoacoustic mod- eling. Even so, uniform filter banks are applied because they are components in the pass-band of the filter. computationally efficient. This paper presents a psychoacoustic Many existing psychoacoustic models, e.g., [1], [3], and [4], model based on an efficient nonuniform cochlear filter bank employ an FFT-based transform to derive a spectral decom- and a simple masked threshold estimation. The novel filter-bank position of the audio signal into uniform subbands with equal structure employs cascaded low-order IIR filters and appropriate bandwidths. The nonuniform spectral resolution of the auditory down-sampling to increase efficiency. The filter responses are optimized for the modeling of auditory masking effects. Results system is taken into account by summing up the energies of the of the new psychoacoustic model applied to audio coding show appropriate number of neighboring FFT frequency subbands. better performance in terms of bit rate and/or quality of the new Consequently, the phase relation between the spectral compo- model in comparison with other state-of-the-art models using a nents of the different subbands within a cochlear filter band is uniform spectral decomposition. The low delay of the new model not taken into account. Since the cochlear filter slopes are less is particularly suitable for low-delay coders. steep than the subband slopes, they must be approximated by Index Terms—Audio coding, filter bank, masked threshold, spreading the subband energies across several bands. This way model of masking, perceptual model. of mapping the uniform subbands to cochlear filter bands pro- duces envelopes of the output signal that are different from those I. INTRODUCTION measured at the output of the cochlea. The temporal resolution of the spectral decomposition is determined by the transform N PERCEPTUAL audio coding [1], the audio signal is size, i.e., FFT length, and thus, is constant across all center fre- treated as a masker for distortions introduced by lossy I quencies. For high center frequencies this results in a signif- data compression. For this purpose, the masked threshold for icantly lower temporal resolution in comparison with that of the distortions is approximated by a psychoacoustic model. the corresponding cochlear filters. All the described mismatches The masked threshold is the time and frequency-dependent contribute to an inaccurate modeling of masking that causes sub- maximum level that marks the boundary for distortions being optimal coder compression performance. inaudible if superimposed to the audio signal. The initial audio To overcome the mismatch between uniform filter banks and signal processing within the psychoacoustic model consists of a the spectral decomposition of the cochlea, a linear nonuniform spectral decomposition to account for the frequency selectivity cochlear filter bank was developed. A linear filter bank was of the auditory system. However, the auditory system performs chosen because it is computationally less complex than a non- a nonuniform (nonequal bandwidths) spectral decomposition linear one [5], [6]. Furthermore, a psychoacoustic model based of the acoustic signal in the cochlea. This first stage of cochlear on a nonlinear filter bank generally approximates the masked sound processing already determines basic properties of threshold in an iteration process. Applied to audio coding, masking, e.g., the frequency spread of masking which is related this involves encoding, decoding, and threshold computation to the frequency response of the human cochlear filters. Above for each iteration step, which can considerably increase the 1 kHz, the cochlear filter bandwidths increase almost propor- encoder complexity. The linear filter bank does not account for tionally to the center frequency. These bandwidths determine sound level-dependent effects. However, since the playback both, the spectral width of energy integration associated with level of the decoded audio signal is usually unknown, this is considered a minor restriction only. Manuscript received June 20, 2001; revised July 18, 2002. The associate ed- The cochlear filter bank is based on a novel structure that itor coordinating the review of this manuscript and approving it for publication supports the time- and frequency resolution necessary to sim- was Dr. Peter Vary. ulate psychophysical data closely related to cochlear spectral The author is with the Media Signal Processing Research Department, Agere Systems, Berkeley Heights, NJ 07922 USA (e-mail: [email protected]). decomposition properties. It will be shown that this filter bank Digital Object Identifier 10.1109/TSA.2002.804536 is able to closely mimic the spectral and temporal properties 1063-6676/02$17.00 © 2002 IEEE 496 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 7, OCTOBER 2002 Fig. 1. Block diagram of the cochlear filter-bank structure. Fig. 2. Downsampling scheme of the cochlear filter bank. of frequency decomposition of the human peripheral auditory plexity. A simple and efficient way to implement a “stage-wise” system. The benefits of using this filter bank in a new psy- sampling rate reduction is shown in Fig. 2, where a stage com- choacoustic model are explained and evaluated for two different prises a group of those cascaded filter-bank sections having audio coders. An informal subjective quality assessment was equal sampling rate. The rate reduction by a factor of two is carried out for both state-of-the-art coders. For this compar- achieved by leaving out every other sample at the stage input. It ison the coders were used with their individual reference psy- is applied when the cutoff frequency of the LPF cascade output choacoustic model based on a uniform filter bank and with the is below a given ratio with respect to the sampling rate in that new psychoacoustic model. Results show improved coder per- stage to reduce aliasing. The number of sections covering the formance for the new psychoacoustic model. auditory frequency range is usually in the order of 100. It can The paper is organized as follows. The filter-bank structure be adapted to the desired frequency resolution for a specific ap- is described in Section II. In Section III the filter-bank imple- plication. The number of stages is typically chosen between five mentation using low-order IIR filters is presented. The filter re- and nine. sponses are optimized for modeling of masked thresholds. A All the high-pass filters have the same order. Also, all the novel psychoacoustic model based on that filter bank is the sub- low-pass filters have the same order. However, the LPF and HPF ject of Section IV. The experimental setup of the coders used orders can be chosen independently and should be large enough and of the subjective listening tests is outlined in Section V. Re- to accurately model the spectral decomposition features found sults are given in Section VI in terms of the subjective quality in relevant psychophysical data. After the orders are fixed, the and data rate. Conclusions are drawn in Section VII. filter coefficients can be determined by an optimization algo- rithm to minimize the difference between the responses of the II. FILTER-BANK STRUCTURE desired and the proposed filter banks. The responses of the de- sired filters are generally derived from psychophysical measure- The peripheral auditory system performs spectral analysis of ments. the input acoustic signal in the cochlea with spectrally highly overlapping band-pass filters. The nonuniform frequency res- III. FILTER BANK IMPLEMENTATION olution and bandwidth of these filters is approximated in the proposed structure by cascaded IIR filters. Fig. 1 shows the In this section, the cochlear filter bank parameters are given proposed filter-bank structure with low-pass filters (LPF) and and the derivation of the filter coefficients is described for the high-pass filters (HPF). The LPFs in the cascade have a de- application in a psychoacoustic model. It turns out that an LPF creasing cutoff frequency from left to right (see Fig. 1). Each order of and an HPF order of is sufficient LPF output is connected to an HPF. The HPF cutoff frequency to achieve a reasonable approximation of the desired frequency is equal to the cutoff frequency of the LPF cascade segment be- responses. The slopes of the desired magnitude frequency re- tween the filter-bank input and the HPF input of the next section. sponses are chosen according to simple masking models that Thus, the output of each HPF has a bandpass characteristic with assume a constant slope steepness on a Bark [7] or an equiva- respect to the filter-bank input signal. The basic block of an LPF lent-rectangular-bandwidth (ERB) [8] scale. For center frequen- connected to an HPF, as shown in Fig.