Neural Decomposition: Functional ANOVA with Variational Autoencoders

Neural Decomposition: Functional ANOVA with Variational Autoencoders Kaspar Märtens Christopher Yau University of Oxford Alan Turing Institute University of Birmingham University of Manchester Abstract give insight into patterns of interest embedded within the data. Recently there has been particular interest in applications of Variational Autoencoders (VAEs) Variational Autoencoders (VAEs) have be- (Kingma and Welling, 2014) for generative modelling come a popular approach for dimensionality and dimensionality reduction. VAEs are a class of reduction. However, despite their ability to probabilistic neural-network-based latent variable mod- identify latent low-dimensional structures em- els. Their neural network (NN) underpinnings offer bedded within high-dimensional data, these considerable modelling flexibility while modern pro- latent representations are typically hard to gramming frameworks (e.g. TensorFlow and PyTorch) interpret on their own. Due to the black-box and computational techniques (e.g. stochastic varia- nature of VAEs, their utility for healthcare tional inference) provide immense scalability to large and genomics applications has been limited. data sets, including those in genomics (Lopez et al., In this paper, we focus on characterising the 2018; Eraslan et al., 2019; Märtens and Yau, 2020). The sources of variation in Conditional VAEs. Our Conditional VAE (CVAE) (Sohn et al., 2015) extends goal is to provide a feature-level variance de- this model to allow incorporating side information as composition, i.e. to decompose variation in additional fixed inputs. the data by separating out the marginal additive effects of latent variables z and fixed The ability of (C)VAEs to extract latent, low- inputs c from their non-linear interactions. dimensional representations from high-dimensional in- We propose to achieve this through what puts is a strength for predictive tasks. For example we call Neural Decomposition – an adapta- (C)VAEs have been particularly popular as generative tion of the well-known concept of functional models for images. However, when the objective is ANOVA variance decomposition from classi- to understand the contributions of different inputs to cal statistics to deep learning models. We components of variation in the data, the black box de- show how identifiability can be achieved by coder is insufficient to permit this. This is particularly training models subject to constraints on the important for tabular data problems where individual marginal properties of the decoder networks. features might correspond to actual physical quantities We demonstrate the utility of our Neural De- (e.g. expression of a gene or a physiological measure- composition on a series of synthetic examples ment) and we are interested in how the variability of as well as high-dimensional genomics data. the multivariate output response is driven by specific changes in the inputs (where inputs may be either latent variables or observed covariates). 1 Introduction Functional analysis of variance (functional ANOVA, or fANOVA) are a class of statistical models that uniquely Dimensionality reduction is often required for the anal- decompose a functional response according to the main ysis of complex high-dimensional data sets in order effects and interactions of various factors (Sobol, 1993; to identify low-dimensional sub-structures that may Ramsay and Silvermann, 1997). For example, smooth- ing spline ANOVA models (Gu, 2013) express the effects as linear combinations of some underlying basis rd Proceedings of the 23 International Conference on Artificial functions, and the coefficients on these basis functions Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s). are chosen to minimise a loss function balancing good- Neural Decomposition: Functional ANOVA with Variational Autoencoders Figure 1: Neural Decomposition illustrated. Given high-dimensional data (shown for two selected features) and covariate information c 2 R (in colour coding), we would like to learn a low-dimensional representation z 2 R (on x-axis) together with a decomposition into marginal z and c effects and their interaction f(z; c). “Variance decomposition” would quantify the variance attributed to each component and aid interpretability. ness of fit with a measure of smoothness of the fitted 2 Background functions. More recent treatments have proposed functional ANOVA decompositions via tree-based models VAEs: Variational Autoencoders are constructed (Lengerich et al., 2019), and within a Bayesian nonpara- based on particular a latent variable model structure. metric framework using Gaussian Processes (Kaufman Let y 2 Y denote a data point in some potentially high- and Sain, 2010). dimensional space and z 2 Z be a vector of associated In this paper, we propose the notion of neural decom- latent variables typically in a much lower dimensional position – the integration of functional ANOVA and space. We assume that the prior p(z) is from a family deep neural networks for dimensionality reduction and of distributions that is easy to sample from - we will θ variance decomposition. Specifically, we develop and assume z ∼ N (0; I) throughout. Now suppose f (z) is explore a class of (C)VAE models where the decoding a family of deterministic functions, indexed by parame- network is specifically implemented to explain feature- ters θ 2 Θ, such that f : Z ×Θ !Y. Given data points level variability in terms of additive and interaction y1:N = fy1;:::; yN g, the marginal likelihood is given QN R effects between the latent variables and any additional by p(y1:N ) = i=1 pθ(yijzi)p(zi)dzi: We will assume θ 2 covariate information (Figure1). a Gaussian likelihood, i.e. yijzi; θ ∼ N (f (zi); σ ). In the VAE, these deterministic functions f θ are given by Thus, neural decomposition can be seen from two per- deep neural networks (NNs). Posterior inference in a spectives: VAE is carried out via amortised variational inference, i.e. with a parametric inference model q (z jy ) where • From the classical statistics perspective, it is a scal- φ i i φ are variational parameters. Application of standard able neural adaptation of the functional ANOVA variational inference methodology, see e.g. (Blei et al., decomposition, where additional latent variables z 2017), leads to a lower bound on the log marginal have been introduced to capture low-dimensional likelihood, i.e. the ELBO of the form: structure within high-dimensional data and non- N linearities implemented via deep neural nets. θ,φ X L = Eqφ(zijyi)[log pθ(yijzi)] − KL(qφ(zijyi)jjp(zi)) • From the deep learning perspective, it is an exten- i=1 sion of the (C)VAE framework, where the decoder In the VAE, the variational approximation q (z jy ) is has a decomposable additive structure to aid in- φ i i referred to as the encoder since it encodes the data y terpretability on the level of individual features. into the latent variable z whilst the decoder refers to the generative model p (yjz) which decodes the latent Crucially, in order for this class of models to be useful, θ variables into observations. identifiability must be enforced to ensure the expres- sive power of each neural network component does not Training a VAE seeks to optimise both the model pa- absorb variation that should be explained by others. rameters θ and the variational parameters φ jointly Inference for our model relies on the imposition of func- using stochastic gradient ascent. This typically re- tional constraints which we implement as constrained quires a stochastic approximation to the gradients of optimisation in the weight-space of neural networks. the variational objective which is itself intractable. Typ- Kaspar Märtens, Christopher Yau ically the approximating family is assumed to be Gaus- 3.1 Identifiable Neural Decomposition 2 sian, i.e. qφ(zijyi) = N (zijµφ(yi); σφ(yi)), so that a reparametrisation trick can be used (Kingma and For a more general formulation, let us denote the latent Welling, 2014; Rezende et al., 2014). and fixed inputs collectively by x := (z; c) with dimen- 1 sionality D := dim(xi) . We would like to decompose Conditional VAEs: The Conditional VAE (Sohn as follows et al., 2015) augments the VAE by conditioning the X X f θ + f θ(x )+ f θ (x ; x )+:::+f θ (x) (2) generative model on additional inputs c, i.e. the VAE 0 k ik kl k l 1;:::;D θ k k;l generative model yi = f (zi) + i is now replaced by θ θ θ θ where all functions f ; f ; : : : ; f are parame- yi = f (zi; ci) + "i which requires a minor alteration k k;l 1;:::;D to the ELBO terised by neural networks. We note that without any N additional constraints on the neural networks the above θ,φ X decomposition (2) is unidentifiable. This is because L = Eqφ(zijyi;ci)[log pθ(yijzi; ci)] i=1 the functional subspaces fI corresponding to different index sets I can all be seen as functions defined on −KL(qφ(zijyi; ci)jjp(zi)): the same input space D, being constant in the rest of Conditioning on the extra inputs can greatly increase R coordinates, and these subspaces are overlapping: the expressiveness of the VAE when applied to multi- θ θ θ modal data where the indicator driving the multimodal- Proposition 1. Let fi ; fij; : : : ; f1;:::;D be neural net- ity is measurable. For the genomics applications we works with the same architecture, i.e. assuming that the 2 consider, we use this mechanism to incorporate known networks only differ in the number of inputs . Then for covariates as part of the model. any two disjoint sets of indices I and J the functional θ jθj subspace ffI (xI ): θ 2 R g is strictly a subset of the functional subspace ff θ (x ; x ): θ 2 jθjg. 3 Neural Decomposition I;J I J R Proof: weights in the first layer corresponding to inputs Our goal is to equip (C)VAEs with feature-level inter- xJ can be set to zero, eliminating the effect of xJ .

Neural Decomposition: Functional ANOVA with Variational Autoencoders

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

A Hierarchical Generative Model of Electrocardiogram Records

Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

Probabilistic PCA and Factor Analysis

Learning a Generative Model of Images by Factoring Appearance and Shape

Note 18 : Gaussian Discriminant Analysis

Logistic Regression, Generative and Discriminative Classifiers

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison

Acoustic Feature Learning with Deep Variational Canonical Correlation

Time Series Simulation by Conditional Generative Adversarial Net

Bayesian Modelling