Learning Deep Sigmoid Belief Networks with Data Augmentation

Learning Deep Sigmoid Belief Networks with Data Augmentation Zhe Gan Ricardo Henao David Carlson Lawrence Carin Department of Electrical and Computer Engineering, Duke University, Durham NC 27708, USA Abstract A directed graphical model that is closely related to these models is the Sigmoid Belief Network (SBN) (Neal, 1992). The SBN has a fully generative pro- Deep directed generative models are devel- cess and data are readily generated from the model oped. The multi-layered model is designed using ancestral sampling. However, it has been noted by stacking sigmoid belief networks, with that training a deep directed generative model is diffi- sparsity-encouraging priors placed on the cult, due to the \explaining away" effect. Hinton et al. model parameters. Learning and inference (2006) tackle this problem by introducing the idea of of layer-wise model parameters are imple- \complementary priors" and show that the RBM pro- mented in a Bayesian setting. By exploring vides a good initialization to the DBN, which has the the idea of data augmentation and introduc- same generative model as the SBN for all layers except ing auxiliary Pólya-Gamma variables, sim- the two top hidden layers. In the work presented here ple and efficient Gibbs sampling and mean- we directly deal with training and inference in SBNs field variational Bayes (VB) inference are im- (without RBM initialization), using recently developed plemented. To address large-scale datasets, methods in the Bayesian statistics literature. an online version of VB is also developed. Experimental results are presented for three Previous work on SBNs utilizes the ideas of Gibbs publicly available datasets: MNIST, Caltech sampling (Neal, 1992) and mean field approximations 101 Silhouettes and OCR letters. (Saul et al., 1996). Recent work focuses on extending the wake-sleep algorithm (Hinton et al., 1995) to training fast variational approximations for the SBN 1 Introduction (Mnih and Gregor, 2014). However, almost all previous work assumes no prior on the model parameters The Deep Belief Network (DBN) (Hinton et al., 2006) which connect different layers. An exception is the and Deep Boltzmann Machine (DBM) (Salakhutdinov work of Kingma and Welling (2013), but this is men- and Hinton, 2009) are two popular deep probabilistic tioned as an extension of their primary work. Previ- generative models that provide state-of-the-art results ous Gibbs sampling and variational inference proce- in many problems. These models contain many layers dures are implemented only on the hidden variables, of hidden variables, and utilize an undirected graph- while gradient ascent is employed to learn good model parameter values. The typical regularization on the ical model called the Restricted Boltzmann Machine 2 (RBM) (Hinton, 2002) as the building block. A nice model parameters is early stopping and/or L regu- property of the RBM is that gradient estimates on the larization. In an SBN, the model parameters are not model parameters are relatively quick to calculate, and straightforwardly locally conjugate, and therefore fully stochastic gradient descent provides relatively efficient Bayesian inference has been difficult. inference. However, evaluating the probability of a The work presented here provides a method for plac- data point under an RBM is nontrivial due to the com- ing priors on the model parameters, and presents a putationally intractable partition function, which has simple Gibbs sampling algorithm, by extending recent to be estimated, for example using an annealed impor- work on data augmentation for Bayesian logistic re- tance sampling algorithm (Salakhutdinov and Murray, gression (Polson et al., 2013). More specifically, a set of 2008). Pólya-Gamma variables are used for each observation, to reformulate the logistic likelihood as a scale mix- Appearing in Proceedings of the 18th International Con- ture, where each mixture component is conditionally ference on Artificial Intelligence and Statistics (AISTATS) normal with respect to the model parameters. Effi- 2015, San Diego, CA, USA. JMLR: W&CP volume 38. cient mean-field variational learning and inference are Copyright 2015 by the authors. 268 Learning Deep Sigmoid Belief Networks with Data Augmentation also developed, to optimize a data-augmented varia- of weights W, but a simple partition function is ob- tional lower bound; this approach can be scaled up tained. Therefore, the full likelihood under an SBN is to large datasets. Utilizing these methods, sparsity- trivial to calculate. Furthermore, SBNs explicitly ex- encouraging priors are placed on the model parameters hibit the generative process to obtain data, in which and the posterior distribution of model parameters is the hidden layer provides a directed \explanation" for estimated (not simply a point estimate). Based on ex- patterns generated in the visible layer. tensive experiments, we provide a detailed analysis of the performance of the proposed method. 2.2 Autoregressive Structure 2 Model formulation Instead of assuming that the visible variables in an SBN are conditionally independent given the hidden units, a more flexible model can be built by using an 2.1 Sigmoid Belief Networks autoregressive structure. The autoregressive sigmoid Deep directed generative models are considered for bi- belief network (ARSBN) (Gregor et al., 2014) is an nary data, based on the Sigmoid Belief Network (SBN) SBN with within-layer dependency captured by a fully (Neal, 1992) (using methods like those discussed in connected directed acyclic graph, where each unit xj Salakhutdinov et al. (2013), the model may be read- can be predicted by its parent units x<j, defined as ily extended to real-valued data). Assume we have fx1; : : : ; xj−1g. To be specific, N binary visible vectors, the nth of which is denoted > > p(vjn = 1jhn; v<j;n) = σ(wj hn + sj;<jv<j;n + cj) ; J vn 2 f0; 1g . An SBN is a Bayesian network that > p(hkn = 1jh<k;n) = σ(uk;<kh<k;n + bk) ; (6) models each vn in terms of binary hidden variables K J×K > > hn 2 f0; 1g and weights W 2 R as where S = [s1;:::; sJ ] and U = [u1;:::; uK ] are a lower triangular matrix that contains the autoregres- p(v = 1jw ; h ; c ) = σ(w>h + c ) ; (1) jn j n j j n j sive weights within layers, while W is utilized to cap- p(hkn = 1jbk) = σ(bk) ; (2) ture the dependencies between different layers. The graphical model is provided in Supplemental Section where σ(·) is the logistic function defined as > A. If no hidden layer exists, we obtain the fully visible σ(x) = 1=(1 + exp(−x)), vn = [v1n; : : : ; vJn] , > > sigmoid belief network (Frey, 1998), in which accurate hn = [h1n; : : : ; hKn] , W = [w1;:::; wJ ] , c = > > probabilities of test data points can be calculated. [c1; : : : ; cJ ] and b = [b1; : : : ; bK ] are bias terms. The \local" latent vector hn is observation-dependent In the work presented here, only stochastic autoregres- (a function of n), while the \global" parameters W sive layers are considered, while Gregor et al. (2014) are used to characterize the mapping from hn to vn further explore the utilization of deterministic hid- for all n. den layers. Furthermore, instead of using the simple linear autoregressive structure, one can increase the The SBN is closely related to the RBM, which is a representational power of the model by using more- Markov random field with the same bipartite struc- complicated autoregressive models, such as the work ture as the SBN. Specifically, the energy function of by Larochelle and Murray (2011), where each condi- an RBM is defined as tional p(vjnjv<j;n) is modeled by a neural network. > > > −E(vn; hn) = vn c + vn Whn + hn b ; (3) 2.3 Deep Sigmoid Belief Networks and the probability of an observation vn is Similar to the way in which deep belief networks and 1 X p(v ) = exp(−E(v ; h )) ; (4) deep Boltzmann machines build hierarchies, one can n Z n n hn stack additional hidden layers to obtain a fully directed deep sigmoid belief network (DSBN). Consider a deep where Z is a computationally intractable partition model with L layers of hidden variables. To generate function that guarantees p(v ) is a valid probability n a sample, we begin at the top, layer L. For each layer distribution. In contrast, the energy function of an below, activation h(l) is formed by a sigmoid transfor- SBN may be written as mation of the layer above h(l+1) weighted by W(l+1). > > > −E(vn; hn) = vn c + vn Whn + hn b We repeat this process until the observation is reached. X > Therefore, the complete generative model can be writ- − log(1 + exp(wj hn + cj)) : (5) ten as j L−1 (1) (L) Y (l) (l+1) The additional term in (5), when compared to (3), p(vn; hn) = p(vnjhn )p(hn ) p(hn jhn ) : (7) makes the energy function no longer a linear function l=1 269 Zhe Gan, Ricardo Henao, David Carlson, Lawrence Carin Let fh(l); h(l); : : : ; h(l) g represent the set of hidden 3 Learning and inference 1n 2n Kln units for observation n in layer l. For the top layer, (L) In this section, Gibbs sampling and mean field varia- the prior probability can be written as p(hkn = 1) = (L+1) (L+1) (0) tional inference are derived for the sigmoid belief net- σ(c ), where c 2 R. Defining vn = hn , k k works, based on data augmentation. From the per- conditioned on the hidden units h(l), the hidden units n spective of learning, we desire distributions on the at layer l − 1 are drawn from model parameters fW(l)g and fc(l)g, and distributions (l−1) (l) (l) > (l) (l) (l) on the data-dependent fhn g are desired in the con- p(hkn jhn ) = σ((wk ) hn + ck ) ; (8) text of inference.

Load more