Source Separation with Deep Generative Priors

Source Separation with Deep Generative Priors Vivek Jayaram * John Thickstun * Abstract With no further constraints or regularization, solving Equa- Despite substantial progress in signal source sep- tion (1) for x is highly underdetermined. Classical “blind” aration, results for richly structured data continue approaches to single-channel source separation resolve this to contain perceptible artifacts. In contrast, re- ambiguity by privileging solutions to (1) that satisfiy mathe- cent deep generative models can produce authen- matical constraints on the components x, such as statistical tic samples in a variety of domains that are in- independence (Davies & James, 2007) sparsity (Lee et al., distinguishable from samples of the data distri- 1999) or non-negativity (Lee & Seung, 1999). These con- bution. This paper introduces a Bayesian ap- straints can be be viewed as weak priors on the structure of proach to source separation that uses generative sources, but the approaches are blind in the sense that they models as priors over the components of a mix- do not require adaptation to a particular dataset. ture of sources, and noise-annealed Langevin dy- Recently, most works have taken a data-driven approach. To namics to sample from the posterior distribution separate a mixture of sources, it is natural to suppose that of sources given a mixture. This decouples the we have access to samples x of individual sources, which source separation problem from generative mod- can be used as a reference for what the source components eling, enabling us to directly use cutting-edge of a mixture are supposed to look like. This data can be used generative models as priors. The method achieves to regularize solutions of Equation (1) towards structurally state-of-the-art performance for MNIST digit sep- plausible solutions. The prevailing way to do this is to aration. We introduce new methodology for evalu- construct a supervised regression model that maps an input ating separation quality on richer datasets, provid- mixture m to components xi (Huang et al., 2014; Halperin ing quantitative evaluation of separation results et al., 2019). Paired training data (m; x) can be constructed on CIFAR-10. We also provide qualitative results by summing randomly chosen samples from the component on LSUN. distributions xi and labeling these mixtures with the ground truth components. 1. Introduction Instead of regressing against components x, we use samples to train a generative prior p(x); we separate a mixed signal The single-channel source separation problem (Davies & m by sampling from the posterior distribution p(xjm). For James, 2007) asks us to decompose a mixed signal m 2 X some mixtures this posterior is quite peaked, and sampling into a linear combination of k components x1;:::; xk 2 X from p(xjm) recovers the only plausible separation of m with scalar mixing coefficients αi 2 R: into likely components. But in many cases, mixtures are highly ambiguous: see, for example, the orange-highlighted k X MNIST images in Figure1. This motivates our interest in arXiv:2002.07942v2 [cs.LG] 21 Sep 2020 m = g(x) ≡ α x : (1) i i sampling, which explores the space of plausible separations. i=1 In Section3 we introduce a procedure for sampling from This is motivated by, for example, the “cocktail party prob- the posterior, an extension of the noise-annealed Langevin lem” of isolating the utterances of individual speakers xi dynamics introduced in Song & Ermon(2019), which we from an audio mixture m captured at a busy party, where call Bayesian Annealed SIgnal Source separation: “BASIS” multiple speakers are talking simultaneously. separation. * Equal contribution. Paul G. Allen School of Computer Science Ambiguous mixtures pose a challenge for traditional source and Engineering, University of Washington. Correspondence to: separation metrics, which presume that the original mixture Vivek Jayaram <[email protected]>, John components are identifiable and compare the separated com- Thickstun <[email protected]>. ponents to ground truth. For ambiguous mixtures of rich Proceedings of the 37 th International Conference on Machine data, we argue that recovery of the original mixture com- Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by ponents is not a well-posed problem. Instead, the problem the author(s). Source Separation with Deep Generative Priors Separated Original Separated Images Original Images Mixture (Input) Mixture (Input) Figure 1. Separation results for mixtures of four images from the MNIST dataset (Left) and two images from the CIFAR-10 dataset (Right), using BASIS with the NCSN (Song & Ermon, 2019) generative model as a prior over images. We draw attention to the central panel of the MNIST results (highlighted in orange), which shows how a mixture can be separated in multiple ways. we aim to solve is finding components of a mixture that are trained generative model. But the authors of this work docu- consistent with a particular data distribution. Motivated by ment fundamental obstructions to applying their method to this perspective, we discuss evaluation metrics in Section4. single-channel source separation; they propose using multiple image frames from a video, or multiple mixtures of Formulating the source separation problem in a Bayesian the same components with different mixing coefficients α. framework decouples the problem of source generation from This multiple-mixture approach is common to much of the source separation. This allows us to leverage pre-trained, work on blind separation. In contrast, our approach is able state-of-the-art, likelihood-based generative models as prior to separate components from a single mixture. distributions, without requiring architectural modifications to adapt these models for source separation. Examples of Supervised regression. Regression models for source sep- source separation using noise-conditioned score networks aration learn to predict components for a mixture using a (NCSN) (Song & Ermon, 2019) as a prior are presented in dataset of mixed signals labeled with ground truth compo- Figure1. Further separation results using NCSN and Glow nents. This approach has been extensively studied for sepa- (Kingma & Dhariwal, 2018) are presented in Section5. ration of images (Halperin et al., 2019), audio spectrograms (Huang et al., 2014; 2015; Nugraha et al., 2016; Jansson 2. Related Work et al., 2017), and raw audio (Lluis et al., 2019; Stoller et al., 2018b;D efossez´ et al., 2019), as well as more exotic data Blind separation. Work on blind source separation is data- domains, e.g. medical imaging (Nishida et al., 1999). By agnostic, relying on generic mathematical properties to priv- learning to predict components (or equivalently, masks on a ilege particular solutions to (1) (Comon, 1994; Bell & Se- mixture) this approach implicitly builds a generative model jnowski, 1995; Davies & James, 2007; Huang et al., 2012). of the signal components. This connection is made more Because blind methods have no access to sample compo- explicit in recent work that uses GAN’s to force components nents, they face the challenging task of modeling the distri- emitted by a regression model to match the distribution of a bution over unobserved components while simultaneously given dataset (Zhang et al., 2018; Stoller et al., 2018a). decomposing mixtures into likely components. It is difficult The supervised approach takes advantage of expressive deep to fit a rich model to latent components, so blind methods models to capture a strong prior over signal components. often rely on simple models such as dictionaries to capture But it requires specialized model architectures trained specif- the structure of these components. ically for the source separation task. In contrast, our ap- One promising recent work in the blind setting is Double- proach leverages standard, pre-trained generative models for DIP (Gandelsman et al., 2019). This work leverages the source separation. Furthermore, our approach can directly unsupervised Deep Image Prior (Ulyanov et al., 2018) as exploit ongoing advances in likelihood-based generative a prior over signal components, similar to our use of a modeling to improve separation results. Source Separation with Deep Generative Priors Signal Dictionaries. Much work on source separation is Algorithm 1 BASIS Separation based on the concept of a signal dictionary, most notably L Input: m 2 X , fσigi=1, δ, T the line of work based on non-negative matrix factorization Sample x1;:::; xk ∼ Uniform(X ) (NMF) (Lee & Seung, 2001). These approaches model for i 1 to L do signals as combinations of elements in a latent dictionary. 2 2 ηi δ · σi /σL Decomposing a mixture into dictionary elements can be for t = 1 to T do used for source separation by (1) clustering the elements of Sample "t ∼ N (0;I) p the dictionary and (2) reconstituting a source using elements (t) (t) (t) u x + ηirx log pσi (x ) + 2η"t of the decomposition associated with a particular cluster. (t+1) (t) ηi (t) x u − 2 Diag(α) m − g(x ) σi Dictionaries are typically learned from data of each source end for type and combined into a joint dictionary, clustered by end for source type (Schmidt & Olsson, 2006; Virtanen, 2007). The blind setting has also been explored, where the clustering is This defines a joint distribution pγ (x; m) = p(x)pγ (mjx) obtained without labels by e.g. k-means (Spiertz & Gnann, over signal components x and mixtures m, and a correspond- 2009). Recent work explores more expressive decomposi- ing posterior distribution tion models, replacing the linear decompositions used in NMF with expressive neural autoencoders (Smaragdis & pγ (xjm) = p(x)pγ (mjx)=pγ (m): (4) Venkataramani, 2017; Venkataramani et al., 2017). γ2 ! 0 When the dictionary is learned with supervision from la- In the limit as , we recover the hard constraint on m beled sources, dictionary clusters can be interpreted as im- the mixture given by Equation (1). plicit priors on the distributions over components.

Load more