Efficient Perceptual Audio Coding Using Cosine and Sine Modulated Lapped Transforms

Christian Helmrich Efficient Perceptual Audio Coding Using Cosine and Sine Modulated Lapped Transforms Effiziente wahrnehmungsorientierte Audiocodierung unter Verwendung kosinus- und sinusmodulierter überlappender Transformationen Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Doktor-Ingenieur vorgelegt von Christian R. Helmrich aus Cuxhaven Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg Tag der mündlichen Prüfung: 18. Mai 2017 Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Reinhard Lerch Gutachter: Prof. Dr.-Ing. Bernd Edler Prof. Dr.-Ing. habil. Rudolf Rabenstein Copyright © 2017 Christian R. Helmrich, Germany All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without prior written permission from the author and publisher. Requests for permission should be addressed via text-only email to [email protected]. Printed in Germany Second edition, May 2017 (first edition: November 2016) Dedicated to my mother Angelika In loving memory of my father Lutz iv Abstract The increasing number of simultaneous input and output channels utilized in immer- sive audio configurations primarily in broadcasting applications has renewed industrial requirements for efficient audio coding schemes with low bit-rate and complexity. This thesis presents a comprehensive review and extension of conventional approaches for perceptual coding of arbitrary multichannel audio signals. Particular emphasis is given to use cases ranging from two-channel stereophonic to six-channel 5.1-surround setups with or without the application-specific constraint of low algorithmic coding latency. Conventional perceptual audio codecs share six common algorithmic components, all of which are examined extensively in this thesis. The first is a signal-adaptive filterbank, constructed using instances of the real-valued modified discrete cosine transform (MDCT), to obtain spectral representations of successive portions of the incoming discrete time signal. Within this MDCT spectral domain, various intra- and inter-channel optimizations, most of which are of linear predictive nature, are employed as a second step to minimize spectral, temporal, and/or spatial redundancy. These processing steps are succeeded by a psychoacoustically motivated and controlled quantization process, with optional simple parametric extensions such as noise substitution or related forms of MDCT coefficient exchange, in order to reach the desired coding bit-rate. The fourth component comprises lossless entropy coding of the quantized spectral coefficients and parameters as well as the compilation of all entropy coded data into a transmittable bit- stream. Components five and six, finally, represent low-bit-rate methods for improved high-frequency regeneration for audio bandwidth extension and downmix-based stereo or surround coding, which generally do not operate in the MDCT domain but require an additional pair of complex-valued pseudo-quadrature mirror filter (QMF) banks around the MDCT core infrastructure. The auxiliary filter-banks are shown to notably increase both the algorithmic codec complexity and latency, rendering their usage for low-delay communication applications difficult, especially on battery-powered mobile devices. The complex-domain coding tools can be regarded as pre- and post-processors to the MDCT core-coder, and it is demonstrated that most algorithmic details of these tools can be integrated directly into the MDCT architecture. Moreover, algorithms for respective encoder-side calculation of the modified spectral coefficients and the associated coding parameters, i. e., analysis, are derived which allow the decoder-side reconstruction, i. e. synthesis, to remain real-valued. More specifically, exclusive utilization of the MDCT can be maintained in the decoder, while the modulated complex lapped transform (MCLT), vi whose real part is the MDCT and whose imaginary part is represented by the modified discrete sine transform (MDST), may be employed in the encoder for best audio quality. Phase-related details of the conventional complex-valued coding algorithms, which are difficult to realize using only real-valued transformation, are substituted by an intensity downmix-based but subjectively acceptable encoder-side pre-processing operation. The characteristics of state-of-the-art MDCT filter-bank designs are the second focus of this thesis. Continuing the above investigation of parametric stereo/surround coding methods, an extension of the MDCT coding paradigm, applying sine modulation by way of the MDST instead of the traditional cosine modulation in some channels, is described. Time domain aliasing cancelation (TDAC) compliant transitions between the MDCT and MDST instances, for perfect reconstruction (PR) in the absence of spectral quantization, are discussed. When used in a signal-adaptive fashion, this so-called “kernel switching” method leads to significant coding quality gains on input material with an inter-channel phase difference (IPD) around ±90°. Thereafter, a so-called “ratio switching” approach is presented. Its purpose is the signal-adaptive variation of the inter-transform overlap ratio based on the input’s instantaneous harmonicity and temporal flatness. To this end the definition of the extended lapped transform (ELT), whose overlap ratio exceeds that of the MDCT and MDST, is modified to allow transitions to and from the latter two transforms with PR, i. e., proper TDAC. Using the modified ELT (MELT) with a newly designed window function on tonal quasi-stationary waveform portions, e. g., recordings of single instruments, while resorting to the MDCT or MDST on noise-like and/or non-stationary parts, is shown to yield small but significant improvements in overall coding quality. For low-delay use cases, where the additional look-ahead due to increased transform overlap ratio is undesirable, long-term predictive (LTP) coding as an alternative to ratio switching is examined as a third and final topic. After reviews of conventional time- and frequency-domain approaches, a new MDCT-domain algorithm with low parameter rate (one periodicity value per time unit) and complexity (a fraction of that of the prior art) is proposed. Supporting intra- and inter-channel prediction, this frequency-domain pre- dictor (FDP) offers coding gains which are close, and orthogonal, to those of the MELT. The work concludes with comparative objective and subjective evaluation of the presented contributions, when integrated into the MPEG-D USAC based MPEG-H 3D Audio codec. Objective assessment reveals large savings in delay and decoder complexity, and blind subjective testing indicates that, in terms of audio quality, the modified MPEG-H codec matches or outperforms the respective state of the art in both general-purpose and low-delay applications. Most importantly, for both stereo and 5.1-surround channel configurations, more consistent audio quality across the different types of input signals, with fewer observed negative outliers, is achieved in comparison to the state of the art. Kurzfassung Die steigende Anzahl gleichzeitig genutzter Eingangs- und Ausgangskanäle in Raum- klangkonfigurationen v. a. in Rundfunkanwendungen hat industrielle Forderungen nach effizienten Audiocodiersystemen mit niedriger Bitrate und Komplexität erneuert. Diese Arbeit präsentiert einen umfassenden UJberblick u ber die konventionellen Ansatze zur wahrnehmungsorientierten Codierung beliebiger Multikanal-Audiosignale und stellt im Anschluss Erweiterung bzw ,erbesserungen dieser vor 0esonderes Augenmerk gilt dabei Anwendungsfallen von Kweikanal-Stereo bis Sechskanal-5 1-Surround mit und ohne etwaiger einsatzspezifischer 0eschrankung auf niedrige algorithmische Codierlatenz Ionventionelle wahrnehmungsbezogene Audio-Codecs verwenden sechs vergleich- bare algorithmische Iomponenten, welche alle in dieser Arbeit untersucht werden Die erste ist eine signal-adaptive Filterbank aus Realisierungen der reellwertigen modifi- zierten diskreten Iosinus-Transformation 9MDCT), die eine spektrale Darstellung auf- einanderfolgender Abschnitte des eingehenden diskreten Keitsignals erlaubt Innerhalb dieses MDCT-Spektralbereichs finden diverse Intra- und Interkanal-Optimierungen, von denen die meisten linearpradiktiver Natur sind, als zweiter Schritt Anwendung mit dem Kiel der Minimierung spektraler, zeitlicher und raumlicher Redundanz Darauf folgt ein psychoakustisch motivierter und kontrollierter Quantisierungs-Prozess, mit optionalen parametrischen Erweiterungen wie Rausch-Ersatz oder ahnlichen Formen des MDCT- Ioeffizientenaustauschs, zur Erzielung der gewunschten Codierbitrate Die vierte Iom- ponente umfasst die verlustfreie Entropie-Codierung aller 6uantisierten Spektralwerte und Parameter sowie die Erfassung der codierten Daten im zu ubertragenden 0itstrom Die Iomponenten funf und sechs schlieLlich reprasentieren Methoden fur verbesserte Hochfre6uenzrekonstruktion zur Audiobandbreitenerweiterung und downmixbasierte Stereo- oder Surround-Codierung bei niedrigen 0itraten Diese arbeiten meist nicht in der MDCT-Domane, sondern benotigen zusatzliche komplexwertige Pseudo-Quadratur- Spiegelfilterbanke 9QMF) auLerhalb der MDCT-Infrastruktur Die Kusatz-Filterbanke fuhren dabei zu deutlich erhohter

Efficient Perceptual Audio Coding Using Cosine and Sine Modulated Lapped Transforms

Blackberry QNX Multimedia Suite

SA1OPS English User Manual

Ogg Audio Codec Download

Opus, a Free, High-Quality Speech and Audio Codec

Multi-Core Platforms for Audio and Multimedia Coding Algorithms in Telecommunications

Fast Computational Structures for an Efficient Implementation of The

CT8021 H.32X G.723.1/G.728 Truespeech Co-Processor

TR 101 329-7 V1.1.1 (2000-11) Technical Report

International Organisation for Standardisation Organisation Internationale De Normalisation Iso/Iec Jtc1/Sc29/Wg11 Coding of Moving Pictures and Audio

LOW COMPLEXITY H.264 to VC-1 TRANSCODER by VIDHYA

What Is Ogg Vorbis?

Multiple Description Coding Using Time Domain Division for MP3 Coded Sound Signal