An Audio Codec for Multiple Generations Compression Without Loss of Perceptual Quality

Frank Kurth An Audio Codec for Multiple Generations Compression without Loss of Perceptual Quality AN AUDIO CODEC FOR MULTIPLE GENERATIONS COMPRESSION WITHOUT LOSS OF PERCEPTUAL QUALITY FRANK KURTH Department of Computer Science V, University of Bonn, Bonn, Germany [email protected] We describe a generic audio codec allowing for multiple, i.e., cascaded, lossy compression without loss of perceptual quality as compared to the first generation of compressed audio. For this sake we transfer encoding information to all subsequent codecs in a cascade. The supplemental information is embedded in the decoded audio signal without causing degradations. The new method is applicable to a wide range of current audio codecs as documented by our MPEG-1 implementation. INTRODUCTION 1. AGEING EFFECTS IN CASCADED CODING Low bit rate high quality coding is used in a wide range of Input Output nowadays audio applications such as digital audio broad- Block- or Subband- Quantization Requantization Inverse casting or network conferencing. Although the decoded Transform Transform versions of the compressed data maintain very high sound quality, multiple- or tandem-coding may result in accu- Spectral Analysis/ mulated coding errors resulting from lossy data reduction Psychoacoustic Model Encoder Decoder schemes. Such multiple or tandem coding, leading to the notion of a signal’s generations, may be decribed as follows. Assuming a coder operation and a corresponding Figure 1: Simplified scheme of psychoacoustic codec. ¢ decoder operation ¡ we shall call, for a given signal , ¦¥ § A general scheme of a psychoacoustic audio codec is ¢ £ ¤ ¡ ¢ ¡ the first generation and, for an integer , given in Fig. 1. In this figure we have omitted optional ¢ the £ -th generation of . In this paper we propose a method to overcome ageing components such as entropy coding or scalefactor calcu- effects. More precisely, our ultimate goal is to preserve lation, although those may indirectly influence degener- the first generation’s perceptual quality. For this sake we ation effects. Such effects will have to be dealt with in use an embedding technique, conceptually similar to the each particular case. audio mole [1], to transport coding information from one The encoder unit consists of a subband transform ¨ op- codec in a cascade to subsequent codecs. As compared erating in a block by block mode. For a signal block ¢ , a to [1], our embedding technique is performed in the trans- synchronous (w.r.t. the subband transform of that block) spectral analysis © ¢ is used to obtain parameters of a psy- ¥ ¥ form domain. Moreover it is based on psychoacoustic ¥ ¢ ¤ ¤ ¢ ¨¢ ¢ principles which allows the embedding to be performed choacoustic model ¤ . A bit allocation without causing audible distortions. is performed using the psychoacoustic parameters and The paper is organized as follows. In the first section leading to a choice of quantizers for lossy data reduction. we perform a detailed analysis of ageing effects. For Following quantization, codewords and side information, this sake we look at a general model of a psychoacous- e.g., quantizers, scalefactors etc., are transmitted to the tic codec and investigate which of its components may decoder. The decoder reconstructs subband samples us- induce artifacts in casacded coding. Ageing effects in a ing the side information, e.g., by dequantization, code- real-world coding application are documented by the re- book look-up, or inverse scaling, and then carries out a ¨ sults of listening tests as well as measurements of rele- reconstruction subband transform . vant encoding parameters. The second section develops Within this framework, sources for signal degeneration a generic audio codec for cascaded coding without per- are ceptual loss of quality. In the third section, an implemen- 1. the lossy quantization step, tation based on an MPEG-1 Layer II codec is described. ¨ © ¨ 2. round-off and arithmetic errors due to , , , as The fifth section gives a codec evaluation based on exten- well as possibly further calculations, sive listening tests as well as objective signal similarity ¨ measurements. Concluding we point out some applica- 3. aliasing due to ¨ , cancelled by only in the ab- tions and future work in this area. sence of lossy quantization, AES 17 International Conference on High Quality Audio Coding 1 Frank Kurth An Audio Codec for Multiple Generations Compression without Loss of Perceptual Quality 4. NPR (near perfect reconstruction) errors, i.e., if The most important results [2] may be summarized as fol- is only a pseudoinverse of , lows: 5. inaccuracy of the psychoacoustic model , Almost all test pieces already show noticeable per- 6. non-suitable time-frequency resolution in spectral ceptual changes in the second and third genera- analysis, tions. 7. missing translation invariance of and (in view Fourth to sixth generations show a clear loss in of multiple coding). sound quality. The induced noise in many of the pieces is annoying. Almost all pieces above the The quantization error 1. dominates and is likely to be re- eighth generation show a severe perceptual distor- sponsible for the greatest part of the degenerations. This tion. error may differ in several magnitudes from all of the other errors. As an extreme example, consider MPEG Quiet and very harmonic signal parts are (almost) coding where, depending on the codec configuration, sev- not distorted due to MPEGs extensive use of scale eral of the highest subbands of the transformed signal are factors and due to higher SMR ratios induced by simply set to zero. Round-off and NPR-errors (2. and 4.) tonal signal components. are small as compared to 1., although they must be con- trolled from codec to codec in a cascade. Those errors It becomes clear that lossy coding changes the signals will be especially important in conjunction with our em- spectral content in a way that does not allow subsequent bedding technique. As, e.g., Layer III of MPEG-1 audio encoders to perform a suitable psychoacoustic analysis. coding shows, aliasing errors (3.) have to be dealt with This leads to a degeneration of coding parameters and, fi- carefully. The errors 5. and 6. indirectly contribute to the nally, of overall sound quality. We further illustrate this quantization error. 80 equal worse better Original 60 8 Percussion Generation 5 Generation 10 40 7 Pop (Jackson) Generation 15 6 Orchestra 20 Energy [dB] 5 Wind Instruments 0 4 Speech, male −20 3 Speech, female 2 Castanets −40 0 5 10 15 20 25 30 35 Subband 1 Pop (Abba) −3 −2 −1 0 1 2 3 Figure 3: Degeneration of subband energies in MPEG-1 coding. For a fixed frame, the figure shows the energy Figure 2: Comparison of first vs third generations. The in each of the subbands. The four lines correspond to solid bars give the average rating, the small bars show the the energies of the original signal and the fifth, tenth, and variance. fifteenth generation respectively. Listening tests are used to investigate signal degeneration by an example documenting the change of the spectral with increasing generations. Fig. 2 shows the results of a content of a signal for increasing generations. Fig. 3 listening test where 26 listeners compared first and third shows the subband energies for a fixed frame in the case generations coded with an MPEG-1 Layer II codec at 128 of an MPEG-1 encoded guitar piece. The different lines kbps. Negative values indicate that the third generations represent the energies for different generations. One ob- were rated worse than first generations, whereas posi- serves the decrease of energy in subbands 6-9 and 11-17. tive values give them better ratings. It is obvious from The higher subbands are attenuated as an effect of the Fig. 2 that the third generations could almost all be dis- MPEG encoder not allocating any bits to them. In Fig. 4 tinguished from the first ones and were generally judged the energy of the scalefactor bands is given for several to be of worse quality. frames of a piece containing a strummed electric guitar AES 17 International Conference on High Quality Audio Coding 2 Frank Kurth An Audio Codec for Multiple Generations Compression without Loss of Perceptual Quality 80 80 70 70 0 0.5 60 60 50 50 −5 0 40 40 −10 30 30 −0.5 Scalefactor band Scalefactor band Difference in allocation 20 20 Energy difference [dB] 5 5 5 10 10 20 10 10 40 10 15 60 20 15 80 15 25 20 40 60 20 40 60 Generation Subband Generation Subband Frame Frame 80 80 70 70 4 60 60 2 0.2 0 50 50 −2 0 40 40 −4 −0.2 30 30 −6 −0.4 Difference in allocation Scalefactor band Scalefactor band 20 20 Energy difference [dB] 5 5 5 10 10 20 10 10 40 10 15 60 20 15 80 15 25 20 40 60 20 40 60 Generation Subband Generation Subband Frame Frame Figure 5: Development of mean subband energy (left) Figure 4: Degeneration of subband energies in MPEG- and mean bit allocation (right) as a difference signal of 1 coding. For successive frames, the figure shows the (increasing) generations and the first generation. The top energy in each of the scalefactor bands as an intensity plots show those difference signals for the case of syn- plot. The four subplots correspond to the energies of the chronization between each of the codec cascades, the bot- first, fifth, tenth, and fifteenth generation respectively. tom plots for the case of absence of such a synchroniza- tion. (the rhythm of strumming may be observed as the verti- cal structure in each of the plots). Increasing generations It depends on the signal, which of the degenerations are are plotted in a zig-zag scheme. The signal’s degenera- percieved to be more severe.

An Audio Codec for Multiple Generations Compression Without Loss of Perceptual Quality

PXC 550 Wireless Headphones

Audio Coding for Digital Broadcasting

(A/V Codecs) REDCODE RAW (.R3D) ARRIRAW

A Multi-Frame PCA-Based Stereo Audio Coding Method

Lossless Compression of Audio Data

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

Cognitive Speech Coding Milos Cernak, Senior Member, IEEE, Afsaneh Asaei, Senior Member, IEEE, Alexandre Hyaﬁl

Codec Is a Portmanteau of Either

Lossy Audio Compression Identification

Speech Compression

Video Source File Specifications

A Psychoacoustic-Based Multiple Audio Object Coding Approach Via Intra- Object Sparsity