Unsupervised Audio Source Separation Using Generative Priors
Total Page:16
File Type:pdf, Size:1020Kb
INTERSPEECH 2020 October 25–29, 2020, Shanghai, China Unsupervised Audio Source Separation using Generative Priors Vivek Narayanaswamy1, Jayaraman J. Thiagarajan2, Rushil Anirudh2 and Andreas Spanias1 1SenSIP Center, School of ECEE, Arizona State University, Tempe, AZ 2Lawrence Livermore National Labs, 7000 East Avenue, Livermore, CA [email protected], [email protected], [email protected], [email protected] Abstract given set of sources and the mixing process, consequently re- quiring complete re-training when those assumptions change. State-of-the-art under-determined audio source separation sys- This motivates a strong need for the next generation of unsuper- tems rely on supervised end to end training of carefully tailored vised separation methods that can leverage the recent advances neural network architectures operating either in the time or the in data-driven modeling, and compensate for the lack of labeled spectral domain. However, these methods are severely chal- data through meaningful priors. lenged in terms of requiring access to expensive source level Utilizing appropriate priors for the unknown sources has labeled data and being specific to a given set of sources and been an effective approach to regularize the ill-conditioned na- the mixing process, which demands complete re-training when ture of source separation. Examples include non-Gaussianity, those assumptions change. This strongly emphasizes the need statistical independence, and sparsity [13]. With the emergence for unsupervised methods that can leverage the recent advances of deep learning methods, it has been shown that choice of in data-driven modeling, and compensate for the lack of la- the network architecture implicitly induces a structural prior for beled data through meaningful priors. To this end, we propose a solving inverse problems [14]. Based on this finding, Tian et al. novel approach for audio source separation based on generative recently introduced a deep audio prior (DAP) [15] that directly priors trained on individual sources. Through the use of pro- utilizes the structure of a randomly initialized neural network jected gradient descent optimization, our approach simultane- to learn time-frequency masks that isolate the individual com- ously searches in the source-specific latent spaces to effectively ponents in the mixture audio without any pre-training. Interest- recover the constituent sources. Though the generative priors ingly, DAP was shown to outperform several classical priors. can be defined in the time domain directly, e.g. WaveGAN, Here, we consider an alternative approach for under-determined we find that using spectral domain loss functions for our opti- source separation based on data priors defined via deep gener- mization leads to good-quality source estimates. Our empirical ative models, and in particular using generative adversarial net- studies on standard spoken digit and instrument datasets clearly works (GANs) [16]. We hypothesize that such a data prior will demonstrate the effectiveness of our approach over classical as produce higher quality source estimates by enforcing the esti- well as state-of-the-art unsupervised baselines. mated solutions to belong to the data manifold. While GAN Index Terms: audio source separation, unsupervised learning, priors have been successfully utilized in inverse imaging prob- generative priors, projected gradient descent lems [17, 18, 19, 20] such as denoising, deblurring, compressed recovery etc., their use in source separation has not been studied 1. Introduction yet – particularly in the context of audio. In this paper, we pro- Audio source separation, the process of recovering constituent pose a novel unsupervised approach for source separation that source signals from a given audio mixture, is a key compo- utilizes multiple source-specific priors and employs Projected nent in downstream applications such as audio enhancement Gradient Descent (PGD)-style optimization with carefully de- and music information retrieval [1, 2]. Typically formulated signed spectral-domain loss functions. Since our approach is as an inverse optimization problem, source separation has been an inference-time technique, it is extremely flexible and general traditionally solved using a broad class of matrix factoriza- such that it can be used even with a single mixture. We uti- tion methods [3, 4, 5], e.g., Independent Component Analysis lize the time-domain based WaveGAN [21] model to construct (ICA) and Principal Component Analysis (PCA). While these the source-specific priors, and interestingly, we find that using methods are known to be effective in over-determined scenar- spectral losses for the inversion leads to superior quality results. ios, i.e. the number of mixture observations is greater than Using standard benchmark datasets (spoken digit audio (SC09), the number of sources, they are severely challenged in under- drums and piano), we evaluate the proposed approach under the determined settings [6]. Consequently, in the recent years, su- assumption that mixing process is known. From our rigorous pervised deep learning based solutions have become popular empirical study, we find that the proposed data prior is con- for under-determined source separation [7, 8, 9, 10, 11, 12]. sistently superior to other commonly adopted priors, including the recent deep audio prior [15]. The codes for our work are These approaches can be broadly classified into time domain i and spectral domain methods, and often produce state-of-the-art publicly available. performance on standard benchmarks. Despite their effective- ness, there is a fundamental drawback with supervised methods. 2. Designing Priors for Inverse Problems In addition to requiring access to large number of observations, Despite the advances in learning methods for audio process- a supervised source separation model is highly specific to the ing, under-determined source separation remains a critical chal- This work was supported in part by the ASU SenSIP Center, Ari- lenge. Formally, in our setup, the number of mixtures or ob- zona State University. Portions of this work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore Na- ihttps://github.com/vivsivaraman/ tional Laboratory under Contract DE-AC52-07NA27344. sourcesepganprior Copyright © 2020 ISCA 2657 http://dx.doi.org/10.21437/Interspeech.2020-3115 servations m n, i.e. the number of sources. A common approach to make this ill-defined problem tractable is to place appropriate priors to restrict the solution space. Existing ap- proaches can be broadly classified into the following categories: (i) Statistical Priors. This includes the class of matrix fac- torization methods conventionally used in source separation. For example in ICA, we enforce the assumptions of non- Gaussianity as well as statistical independence between the sources. On the other hand, PCA enforces statistical inde- pendence between the sources by linear projection onto mutu- ally orthogonal subspaces. KernelPCA [22] induces the same prior in a reproducing kernel Hilbert space. Another popular approach is Non-negative matrix factorization (NMF), which Figure 1: An overview of the proposed unsupervised source sep- places a non-negativity prior on the estimated basis matrices aration system. [23]. Finally, a sparsity prior (`1) [13] placed either in the ob- served domain or in the expansion via an appropriate basis set or a dictionary has also been widely adopted to regularize this 3. Approach problem. Audio source separation involves the process of recovering con- (ii) Structural Priors. Recent advances in deep neural net- d stituent sources fsi 2 R ji = 1 ··· Kg from a given audio mix- work design have shown that certain carefully chosen networks d ture m 2 R , where K is the total number of sources and d is have the innate capability to effectively regularize or behave as the number of time steps. In this paper, without loss of general- a prior to solve ill-posed inverse problems. These networks ity, we assume the source and mixtures to be mono-channel and essentially capture the underlying statistics of data, indepen- PK the mixing process to be a sum of sources i.e., m = i=1 si. dent of the task-specific training. These structural priors have Figure 1 provides an overview of our proposed approach for un- produced state-of-the-art performance in inverse imaging prob- supervised source separation. Here, we sample the source audio lems [14] and recently, Tian et al. [15] utilized the structure of from the respective priors and perform additive mixing to recon- an U-Net [24] model to learn time-frequency masks that can PK struct the mixture i.e, m^ = i=1 Gi(zi). The mixture is then isolate the individual components in the mixture audio. processed to obtain the corresponding spectrogram. In addition, (iii) GAN Priors. A third class of methods have relied on priors we also compute the source level spectrograms. We perform defined via generative models, e.g. GANs [16]. GANs can learn source separation by efficiently searching the latent space of the parameterized non-linear distributions p(X; z) from a sufficient source-specific priors Gi using Projected Gradient Descent op- amount of unlabeled data X [21, 25], where z denotes the la- timizing a spectral domain loss function L. More formally, for tent variables of the model. In addition to readily sampling from a single mixture m, our objective function is given by, trained GAN models, they can be leveraged as an effective prior X GAN priors ∗ K for . Popularly referred to as , they have been fzi gi=1 = arg min L(m^ ; m) + R(fGi(zi)g); (1) found to be highly effective in challenging inverse problems z1;z2:::zK [19, 20]. In its most general form, when one attempts to recover where the first term measures the discrepancy between the true the original data x from its corrupted version ~x (observed), and estimated mixtures and the second term is an optional reg- one can maximize the posterior distribution p(X = xj~x; z) by ularizer on the estimated sources. In every PGD iteration, we searching in the latent space of a pre-trained GAN.