Ambisonic Coding with Spatial Image Correction Pierre Mahé, Stéphane Ragot, Sylvain Marchand, Jérôme Daniel

Ambisonic Coding with Spatial Image Correction Pierre Mahé, Stéphane Ragot, Sylvain Marchand, Jérôme Daniel To cite this version: Pierre Mahé, Stéphane Ragot, Sylvain Marchand, Jérôme Daniel. Ambisonic Coding with Spatial Image Correction. European Signal Processing Conference (EUSIPCO) 2020, Jan 2021, Amsterdam (virtual ), Netherlands. hal-03042322 HAL Id: hal-03042322 https://hal.archives-ouvertes.fr/hal-03042322 Submitted on 6 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ambisonic Coding with Spatial Image Correction Pierre MAHE´ 1;2 Stephane´ RAGOT1 Sylvain MARCHAND2 Jer´ omeˆ DANIEL1 1Orange Labs, Lannion, France 2L3i, Universite´ de La Rochelle, France [email protected], [email protected], [email protected], [email protected] Abstract—We present a new method to enhance multi-mono re-creates the spatial scene based on a mono downmix and coding of ambisonic audio signals. In multi-mono coding, each these spatial parameters. This method has been extended to component is represented independently by a mono core codec, High-Order Ambisonics (HOA) in HO-DirAC [9] where the this may introduce strong spatial artifacts. The proposed method is based on the correction of spatial images derived from the sound field is divided into angular sectors; for each angular sound-field power map of original and degraded ambisonic sig- sector, one source is extracted. The recent COMPASS method nals. The correction metadata is transmitted as side information [10] estimates the number of sources and derives a beamform- to restore the spatial image by post-processing. The performance ing matrix by Principal Component Analysis (PCA) to extract of the proposed method is compared against naive multi-mono sources. These approaches are efficient at low bitrates, but they coding (with no side information) at the same overall bitrate. Experimental results are provided for the case of First-Order have the disadvantage to reproduce only spatial cues, and not Ambisonic (FOA) signals and two mono core codecs: EVS and the original spatial image. Moreover, they rely on assumptions Opus. The proposed method is shown to provide on average some on the ambisonic scene (e.g. number of sources, positions). audio quality improvement for both core codecs. ANOVA results If these assumptions are not valid for some signals, coding are provided as a complementary analysis. performance may be strongly impacted. Index Terms—ambisonics, audio coding, spatial audio. In this work, we present a new coding method that may be interpreted as a spatial audio post-processing. The objective I. INTRODUCTION of the method is to restore the spatial image of an ambisonic With the emergence of new spatial audio applications (e.g. signal altered by multi-mono coding. The image spatial cor- Virtual or Extended Reality or smartphone-based immersive rection is derived by comparing the sound-field power map of capture) there is a need for efficient coding of immersive original and coded ambisonic signals. audio signals, especially for mobile communications. There In the field of ambisonic processing, numerous works stud- are different ways to represent a spatial audio scene and ied spatial image visualization of an audio scene. A simple ambisonics [1]–[3] has become an attractive format. way to visualize spatial images is to compute the sound-field For that purpose, the most straightforward approach to power map using steered response power (SRP) beamforming code ambisonic audio signals is to extend existing mono or [11], [12]. Basic SRP may be replaced by a different type of stereo codecs by “discrete coding”, i.e. compress ambisonic beamforming such as MVDR [13]. For spatial audio effects, components using separate core codec instances. When the several solutions were proposed to manipulate and modify a core codec is mono, this approach is hereafter referred to as spatial image. For instance, methods for ambisonic-domain multi-mono coding. The objective of this paper is to enhance spatial processing (e.g. amplification of one direction) are the quality of multi-mono coding in a backward compatible described in [14]. way with a new method based on spatial post-processing with This paper is organized as follows. Section II gives an side information. overview of ambisonics, beamforming and soundfield pow- At low bitrates, multi-mono/-stereo coding causes several ermap concepts. Section III explains the key principles of quality artifacts [4] because correlation between components the proposed spatial image correction. Section IV provides is not preserved. Spatial artifacts may be reduced by ap- a full description of the proposed coding approach with plying a fixed channel matrixing before/after multi-mono/- post-processing for any ambisonic order. Section V presents stereo coding as implemented in ambisonic extensions of Opus subjective test results for the FOA case and compares the [5], [6] or e-AAC+ [7, clause 6.1.6]. Subjective test results proposed method with naive multi-mono coding. reported in [8] suggest that matrixing methods may have some quality limitations when the ambisonic order increases and that II. AMBISONICS matrixing does not remove all artifacts caused by the core Ambisonics is based on a decomposition of the sound field codec. on a basis of spherical harmonics. Initially limited to first order An alternative to multi-mono/-stereo coding is to analyze [1], the formalism has been extended to higher orders [2]. the audio scene to extract sources and spatial information. The We refer to [3] for fundamentals of ambisonics. To perfectly DirAC method [9] estimates the dominant source direction and reconstruct the sound field, an infinite order is required. In diffuseness parameters from first-order ambisonic signals and practice, the sound field representation is truncated to a finite Fig. 1. Power map of original (left), coded (middle) and post-processed (right) ambisonic signals for pink noise localized at (−75◦, 0◦). th order N. For a given order N the number of ambisonic where si(t) is the extracted signal of the i beam, di 2 th components is n = (N + 1) . the weight vector for the beam direction and bk(t) the k As an example, First-Order Ambisonic (FOA) (N = 1) has ambisonic components. The power in the i-th direction is: n = 4 components: W, X, Y, Z. A plane wave with pressure L 1 X s(t) at azimuth θ and elevation φ (with the mathematical con- P = s (t)2 (6) i L i vention) is encoded to the following B-format representation: t=1 T T 2w(t)3 2 1 3 where t is used as a discrete index in a L-sample frame. A 6x(t)7 6cos θ cos φ7 more efficient way to compute Pi is based on the covariance b(t) = 6 7 = 6 7 s(t) (1) T 4y(t)5 4sin θ cos φ5 matrix C = BB of the ambisonic samples, the later being z(t) sin φ gathered in matrix B = [bk(t = 1); ··· ; bk(t = L)]: More generally, a mono source s(t) in direction (θ; φ) is “pro- Pi = diCdi (7) jected” (encoded) into the spherical harmonic domain at order The sound-field power map may be visualized by plotting N, using a weight vector d(θ; φ) = [d (θ; φ); ··· ; d (θ; φ)] 1 n P as a function of azimuth and elevation, e.g. with an corresponding to the contribution of each ambisonic compo- i equirectangular projection. An example of such a 20ms power nent: map of the audio scene is shown in Fig. 1. A pink noise b(t) = d(θ; φ) s(t) (2) was generated and spatialized to (θ = −75◦; φ = 0◦) at where b(t) is an ambisonic signal and d(θ; φ) is the encoding order N = 3. To illustrate this, we used here the publicly vector, also called weight vector. Alternatively, the part from available VST plugin SPARTA [16] to generate the power signal s(t) originating from a direction (θ; φ) may be extracted map of the original, coded and post-processed signals. The from the ambisonic signal b(t) with the inverse vector trans- ambisonic signal was encoded by multi-mono EVS [17] at formation, known as beamforming or spatial filtering [13]. n×16.4 kbit/s. It can be seen that multi-mono coding results It is possible to transform the ambisonic representation to in phantom sources and spatial alterations. After the proposed another sound field representation by selecting a combination post-processing, sound spatialization is restored. of m beamforming vectors. The vectors are merged into a III. SPATIAL POST-PROCESSING IN AMBISONIC DOMAIN transformation matrix D of size n × m. The principle of the proposed coding method is to correct a(t) = Db(t) (3) the spatial degradation due to multi-mono coding by post- T where D = [d(θ1; φ1); ··· ; d(θm; φm)] is the transformation processing. The method computes a power map of the original matrix, b(t) = [b1(t); ··· ; bn(t)] is the input matrix of ambisonic signal and a power map of the coded ambisonic n ambisonic components, a(t) = [a1(t); ··· ; am(t)] is the signal after core codec processing. A spatial correction matrix output matrix of the sound field representation. The two is derived based on both power maps, with the objective to representations are equivalent provided that the matrix D is restore as well as possible the original signal spatialization.

Ambisonic Coding with Spatial Image Correction Pierre Mahé, Stéphane Ragot, Sylvain Marchand, Jérôme Daniel

Linux Sound Subsystem Documentation Release 4.13.0-Rc4+

Audio Codecs

Centauri II Multichannel Audio Gateway Codec – a New Generation Conquers the Control-Room

Low Bit-Rate Speech Coding with Vq-Vae and a Wavenet Decoder

UMTS); LTE; Performance Characterization of the Adaptive Multi-Rate Wideband (AMR-WB) Speech Codec (3GPP TR 26.976 Version 10.0.0 Release 10)

Audio Codec (1)

Input Formats & Codecs

ACELP Speech Codec and Its Overall Comparative Performance Analysis with CELP Based 12.2 Kbps AMR- NB Speech Codec

Dd 2.0 Codec

CELT Overview

Supported Codecs and Formats Codecs

IP-4C Audio Over IP Codec Professional Multi-Format 4-Channel Audio Coder / Decoder