Music Separation Guided by Cover Tracks: Designing the Joint NMF Model Nathan Souviraà-Labastie, Emmanuel Vincent, Frédéric Bimbot
Total Page:16
File Type:pdf, Size:1020Kb
Music separation guided by cover tracks: designing the joint NMF model Nathan Souviraà-Labastie, Emmanuel Vincent, Frédéric Bimbot To cite this version: Nathan Souviraà-Labastie, Emmanuel Vincent, Frédéric Bimbot. Music separation guided by cover tracks: designing the joint NMF model. 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, Apr 2015, Brisbane, Australia. hal-01108675 HAL Id: hal-01108675 https://hal.archives-ouvertes.fr/hal-01108675 Submitted on 23 Jan 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MUSIC SEPARATION GUIDED BY COVER TRACKS: DESIGNING THE JOINT NMF MODEL Nathan Souviraa-Labastie` ∗ Emmanuel Vincenty Fred´ eric´ Bimbotz ∗ Universite´ de Rennes 1, IRISA - UMR 6074, Campus de Beaulieu 35042 Rennes cedex, France y Inria, Centre de Nancy - Grand Est, 54600 Villers-les-Nancy,` France z CNRS, IRISA - UMR 6074, Campus de Beaulieu 35042 Rennes cedex, France ABSTRACT track songs [20]. A cover song is another performance of an original In audio source separation, reference guided approaches are a song. It can differ from the original song by its musical interpre- class of methods that use reference signals to guide the separation. tation and it is potentially performed by a different singer and with In prior work, we proposed a general framework to model the de- different instruments. Multitrack recordings of such covers are more formation between the sources and the references. In this paper, we likely to be found on the market than the original multitrack record- investigate a specific scenario within this framework: music sepa- ing and contrary to expectations, they are (for commercial reasons) ration guided by the multitrack recording of a cover interpretation musically faithful to the original [20]. Furthermore, most separation of the song to be processed. We report a series of experiments algorithms are sensitive to initialization and using cover multitrack highlighting the relevance of joint Non-negative Matrix Factoriza- recordings for initialization is an efficient way to sidestep this prob- tion (NMF), dictionary transformation, and specific transformation lem [20]. All these reasons make cover guided music separation a models for different types of sources. A signal-to-distortion ratio very promising approach for high quality music separation. improvement (SDRI) of almost 11 decibels (dB) is achieved, im- In the following, rather than using the cover multitrack record- proving by 2 dB compared to previous study on the same data set. ings for initialization only, we also use them to constrain the power These observations contribute to validate the relevance of the theo- spectrum of each source. In addition, although the considered cov- retical general framework and can be useful in practice for designing ers are musically faithful, deformations exist between the original models for other reference guided source separation problems. sources and the covers at the signal level. These deformations are significant enough not to be ignored. Here, different configurations Index Terms— Music separation, Joint-NMF, Cover song of deformations as formalized in [13] are tested. Finally, the optimal deformation model is selected for each type of source (bass, drums, 1. INTRODUCTION guitar, vocal ...). The paper is organized as follows. Section 2 recalls the gen- In signal processing, audio source separation is the task of recov- eral model of reference guided source separation proposed in [13]. ering the different audio sources that compose an observed audio Section 3 presents a specific model adapted to the task, and its es- mixture. In the case of music, this task aims to provide signals for timation procedure. Section 4 describes the data and the settings of each instrument or voice. As original songs are rarely released in the experiments. Section 5 reports on results illustrating the pro- multitrack formats, this step is compulsory to open new possibilities posed contributions and provides to the reader useful advice on how in music post-production, e.g., respatialization, upmixing and more to design deformation models for reference guided source separa- widely in audio edition. tion. Section 6 draws the conclusions and gives some perspectives. The complexity of the mixing process (not necessarily linear) as well as the fact that there are more sources than input channels make the demixing of professionally produced music difficult in the 2. GENERAL FRAMEWORK blind case, i.e., without any prior information. Thus, blind sepa- ration shows certain limitations for professional music applications In this section, we recall the general framework for multi-channel that require high audio quality [1]. Many approaches have taken source separation guided by multiple deformed references [13]. m several kinds of additional information into account with the objec- The observations are M multi-channel audio mixtures x (t) m tive of overcoming these limitations [2]. For instance, spatial and indexed by m and containing I channels. For instance, one mix- spectral information about the sources [3], information about the ture is to be separated, and the other mixtures contain the reference m recording/mixing conditions [4], musical scores [5, 6], or even se- signals used to guide the separation process. Each mixture x (t) lection of spectrogram areas [7, 8, 9], potentially in an interactive is assumed to be a sum of source spatial images yj (t) indexed by way [10, 11, 12] have been proposed in the literature. It is also pos- j 2 Jm. In the Short-Time Fourier Transform (STFT) domain, this sible to consider reference signals [13] that are similar to the sources can be written as to be separated, for instance uttered by a user [14, 15, 16], or syn- m X m Im thesized from symbolic information [15, 17] or even retrieved from xfn = yj;fn with xfn; yj;fn 2 C ; (1) a large set [18, 19]. j2Jm In this paper, we focus on source separation guided by reference signals, and more precisely music separation guided by cover multi- where f = 1; :::; F and n = 1; :::; N are respectively the fre- quency and the time indexes of the STFT. We assume that the time- ∗ Work supported by Maia Studio and Bretagne Region scholarship frequency coefficients of the source spatial images yj;fn have a zero- mean Gaussian distribution [3] : D denotes the number of spectral patterns used in the NMF de- composition. Hereafter, we only consider frequency and dictionary f F ×F yj;fn ∼ N (0; vj;fnRj;fn) (2) C transformation matrices that are now denoted Tjj0 2 R+ , and d D×D Tjj0 2 R+ . As each track of the two versions are sufficiently whose covariance factors into a scalar power spectrum vj;fn 2 R+ t m m aligned in time, we do not consider T matrices that induce un- and a spatial covariance matrix R 2 I ×I . The spatial co- j;fn C wanted smoothing effects. Thus, the related sources are modeled variance matrices model the spatial characteristics of the sources, using equation (6): like phase or intensity difference between channels. The power spec- F ×N f d trogram of each source j is denoted as Vj = [vj;fn]fn 2 R+ . V 0 = T W T 0 H e j jj0 j jj j (8) Each Vj is split into the product of an excitation spectrogram Vj and a filter spectrogram V φ. The excitation spectrogram (resp. the filter It can be noticed that this formulation leaves the possibility to set j 1 spectrogram) is decomposed by a Non-negative Matrix Factoriza- these transformation matrices either in the reference model (j 2 J 0 e F ×De j0 2 J m m0 6= 1 j0 2 J 1 tion (NMF) into a matrix of spectral patterns Wj 2 R+ (resp. and , ) or in the source model ( and m0 0 φ F ×Dφ e De×N j 2 J , m 6= 1). See Tables 3 and 4 in Section 5 for concrete Wj 2 R+ ) and a matrix of temporal activations Hj 2 R+ f d φ Dφ×N e φ cases. For both T and T matrices, we consider two possible ini- (resp. Hj 2 R+ ). D (resp. D ) denotes the number of spec- tializations: tral patterns used in the NMF decomposition of the excitation (resp. filter) part. This results in the following decomposition : • Diag : Identity matrix • Full : The sum of an identity matrix and a random matrix e φ e e φ φ Vj = Vj Vj = Wj Hj Wj Hj (3) drawn from a rectified Gaussian distribution where denotes point wise multiplication. 3.2. Parameter estimation As the different audio mixtures are composed of similar sources, the matrices W and H can be shared (i.e., jointly estimated) between Here, we present a method for parameter estimation in the maximum 0 a given source j 2 J m and one or more related sources j0 2 J m likelihood (ML) sense. In the single-channel case, maximizing the with m0 6= m. A reference signal is by nature deformed : if there log-likelihood is equivalent to minimizing the Itakura-Saito diver- were no deformation, the reference would be equal to the true source. gence [21]: These differences can be taken into account by partly sharing the e M F;N model parameters (e.g., sharing only H ) and/or by using transfor- ^ X X m m 0 θ = argmin dIS (XfnjVfn) (9) mation matrices Tjj0 (between sources j and j ).