Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data Yarin Gal [email protected] Yutian Chen [email protected] Zoubin Ghahramani [email protected] University of Cambridge Abstract a relatively small number of patients. Each patient has a Multivariate categorical data occur in many ap- medical record composed often of dozens of examinations, plications of machine learning. One of the main taking various categorical test results. We might be faced difficulties with these vectors of categorical vari- with the task of deciding which tests are necessary for a ables is sparsity. The number of possible obser- patient under examination to take, and which examination vations grows exponentially with vector length, results could be deduced from the existing tests. This can but dataset diversity might be poor in compari- be achieved with distribution estimation. son. Recent models have gained significant im- Several tools in the Bayesian framework could be used provement in supervised tasks with this data. for this task of distribution estimation of unlabelled small These models embed observations in a continu- datasets. Tools such as the Dirichlet-Multinomial distribu- ous space to capture similarities between them. tion and its extensions are an example of such. These rely Building on these ideas we propose a Bayesian on relative frequencies of categorical variables appearing model for the unsupervised task of distribution with others, with the addition of various smoothing tech- estimation of multivariate categorical data. niques. But when faced with long multivariate sequences, We model vectors of categorical variables as these models run into problems of sparsity. This occurs generated from a non-linear transformation of when the data consists of vectors of categorical variables a continuous latent space. Non-linearity cap- with most configurations of categorical values not in the tures multi-modality in the distribution. The con- dataset. In medical diagnosis this happens when there is a tinuous representation addresses sparsity. Our large number of possible examinations compared to a small model ties together many existing models, link- number of patients. ing the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Building on ideas used for big labelled discrete data, we Gaussian process classification. We derive in- propose a Bayesian model for distribution estimation of ference for our model based on recent develop- small unlabelled data. Existing supervised models for dis- ments in sampling based variational inference. crete labelled data embed the observations in a continuous We show empirically that the model outperforms space. This is used to find the similarity between vectors of its linear and discrete counterparts in imputation categorical variables. We extend this idea to the small unla- tasks of sparse data. belled domain by modelling the continuous embedding as a latent variable. A generative model is used to find a dis- 1 Introduction tribution over the discrete observations by modelling them arXiv:1503.02182v1 [stat.ML] 7 Mar 2015 as dependent on the continuous latent variables. Multivariate categorical data is common in fields ranging from language processing to medical diagnosis. Recently Following the medical diagnosis example, patient n would proposed supervised models have gained significant im- be modelled by a continuous latent variable xn 2 X . For provement in tasks involving big labelled data of this form each examination d, the latent xn induces a vector of prob- (see for example Bengio et al. (2006); Collobert & Weston abilities f = (fnd1; :::; fndK ), one probability for each pos- (2008)). These models rely on information that had been sible test result k. A categorical distribution returns test re- largely ignored before: similarity between vectors of cate- sult ynd based on these probabilities, resulting in a patient’s gorical variables. But what should we do in the unsuper- medical assessment yn = (yn1; :::; ynD). We need to de- vised setting, when we face small unlabelled data of this cide how to model the distribution over the latent space X form? and vectors of probabilities f. Medical diagnosis provides good examples of small unla- We would like to capture sparse multi-modal categorical belled data. Consider a dataset composed of test results of distributions. A possible approach would be to model the Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data continuous representation with a simple latent space and a then demonstrate the utility of the model in the real-world non-linear transformation of points in the space to obtain sparse data domain. We use a medical diagnosis dataset probabilities. In this approach we place a standard normal where data is scarce, comparing our model to discrete fre- distribution prior on a latent space, and feed the output of a quency based models. We use the estimated distribution non-linear transformation of the latent space into a Softmax for a task similar to the above, where we attempt to in- to obtain probabilities. We use sparse Gaussian processes fer which test results can be deduced from the others. We (GPs) to transform the latent space non-linearly. Sparse compare the models on the task of imputation of raw data GPs form a distribution over functions supported on a small studying the effects of government terror warnings on po- number of points with linear time complexity (Quinonero-˜ litical attitudes. We then evaluate the continuous models on Candela & Rasmussen, 2005; Titsias, 2009). We use a co- a binary Alphadigits dataset composed of binary images of variance function that is able to transform the latent space handwritten digits and letters, where each class contains a non-linearly. We name this model the Categorical Latent small number of images. We inspect the latent space em- Gaussian Process (CLGP). Using a Gaussian process with beddings and separation of the classes. Lastly, we evaluate a linear covariance function recovers the linear Gaussian the robustness of our inference, inspecting the Monte Carlo model (LGM, Marlin et al., 2011). This model linearly estimate variance over time. transform a continuous latent space resulting in discrete observations. 2 Related Work The Softmax likelihood is not conjugate to the our Gaus- Our model (CLGP) relates to some key probabilistic mod- sian prior, and integrating the latent variables with a Soft- els (fig. 1). It can be seen as a non-linear version of the max distribution is intractable. A similar problem exists latent Gaussian model (LGM, Khan et al. (2012)) as dis- with LGMs. Marlin et al. (2011) solved this by using vari- cussed above. In the LGM we have a standard normal prior ational inference and various bounds for the likelihood in placed on a latent space, which is transformed linearly and the binary case, or alternative likelihoods to the Softmax in fed into a Softmax likelihood function. The probability the categorical case (Khan et al., 2012). Many bounds have vector output is then used to sample a single categorical been studied in the literature for the binary case: Jaakkola value for each categorical variable (e.g. medical test re- and Jordan’s bound (Jaakkola & Jordan, 1997), the tilted sults) in a list of categorical variables (e.g. medical assess- bound (Knowles & Minka, 2011), piecewise linear and ment). These categorical variables correspond to elements quadratic bounds (Marlin et al., 2011), and others. But for in a multivariate categorical vector. The parameters of the categorical data fewer bounds exist, since the multivariate linear transformation are optimised directly within an EM Softmax is hard to approximate in high-dimensions. The framework. Khan et al. (2012) avoid the hard task of ap- Bohning bound (Bohning,¨ 1992) and Blei and Lafferty’s proximating the Softmax likelihood by using an alternative bound (Blei & Lafferty, 2006) give poor approximation function (product of sigmoids) which is approximated us- (Khan et al., 2012). ing numerical techniques. Our approach avoids this cum- Instead we use recent developments in sampling-based bersome inference. variational inference (Blei et al., 2012) to avoid integrating the latent variables with the Softmax analytically. Our Gaussian process approach takes advantage of this tool to obtain simple yet Linear regression regression powerful model and inference. We use Monte Carlo in- tegration to approximate the non-conjugate likelihood ob- Gaussian process latent Factor analysis taining noisy gradients (Kingma & Welling, 2013; Rezende variable model et al., 2014; Titsias & Lazaro-Gredilla,´ 2014). We then use Gaussian process Logistic regression learning-rate free stochastic optimisation (Tieleman & Hin- classification ton, 2012) to optimise the noisy objective. We leverage Linear Non-linear Categorical Latent symbolic differentiation (Theano, Bergstra et al., 2010) to Continuous Observed input Latent Gaussian models 1 Gaussian Process obtain simple and modular code . Discrete Latent input We experimentally show the advantages of using non-linear Figure 1. Relations between existing models and the model pro- transformations for the latent space. We follow the ideas posed in this paper (Categorical Latent Gaussian Process); the brought in Paccanaro & Hinton (2001) and evaluate the model can be seen as a non-linear version of the latent Gaussian models on relation embedding and relation learning. We model (left to right, Khan

Load more