Learning Deep Sigmoid Belief Networks with Data Augmentation

Zhe Gan Ricardo Henao David Carlson Lawrence Carin Department of Electrical and Computer Engineering, Duke University, Durham NC 27708, USA

Abstract A directed that is closely related to these models is the Sigmoid Belief Network (SBN) (Neal, 1992). The SBN has a fully generative pro- Deep directed generative models are devel- cess and data are readily generated from the model oped. The multi-layered model is designed using ancestral sampling. However, it has been noted by stacking sigmoid belief networks, with that training a deep directed is diffi- sparsity-encouraging priors placed on the cult, due to the “explaining away” effect. Hinton et al. model parameters. Learning and inference (2006) tackle this problem by introducing the idea of of layer-wise model parameters are imple- “complementary priors” and show that the RBM pro- mented in a Bayesian setting. By exploring vides a good initialization to the DBN, which has the the idea of data augmentation and introduc- same generative model as the SBN for all layers except ing auxiliary P´olya-Gamma variables, sim- the two top hidden layers. In the work presented here ple and efficient and mean- we directly deal with training and inference in SBNs field variational Bayes (VB) inference are im- (without RBM initialization), using recently developed plemented. To address large-scale datasets, methods in the Bayesian statistics literature. an online version of VB is also developed. Experimental results are presented for three Previous work on SBNs utilizes the ideas of Gibbs publicly available datasets: MNIST, Caltech sampling (Neal, 1992) and mean field approximations 101 Silhouettes and OCR letters. (Saul et al., 1996). Recent work focuses on extend- ing the wake-sleep algorithm (Hinton et al., 1995) to training fast variational approximations for the SBN 1 Introduction (Mnih and Gregor, 2014). However, almost all previ- ous work assumes no prior on the model parameters The Deep Belief Network (DBN) (Hinton et al., 2006) which connect different layers. An exception is the and Deep (DBM) (Salakhutdinov work of Kingma and Welling (2013), but this is men- and Hinton, 2009) are two popular deep probabilistic tioned as an extension of their primary work. Previ- generative models that provide state-of-the-art results ous Gibbs sampling and variational inference proce- in many problems. These models contain many layers dures are implemented only on the hidden variables, of hidden variables, and utilize an undirected graph- while gradient ascent is employed to learn good model parameter values. The typical regularization on the ical model called the Restricted Boltzmann Machine 2 (RBM) (Hinton, 2002) as the building block. A nice model parameters is early stopping and/or L regu- property of the RBM is that gradient estimates on the larization. In an SBN, the model parameters are not model parameters are relatively quick to calculate, and straightforwardly locally conjugate, and therefore fully stochastic provides relatively efficient Bayesian inference has been difficult. inference. However, evaluating the probability of a The work presented here provides a method for plac- data point under an RBM is nontrivial due to the com- ing priors on the model parameters, and presents a putationally intractable partition function, which has simple Gibbs sampling algorithm, by extending recent to be estimated, for example using an annealed impor- work on data augmentation for Bayesian logistic re- tance sampling algorithm (Salakhutdinov and Murray, gression (Polson et al., 2013). More specifically, a set of 2008). P´olya-Gamma variables are used for each observation, to reformulate the logistic likelihood as a scale mix- Appearing in Proceedings of the 18th International Con- ture, where each mixture component is conditionally ference on Artificial Intelligence and Statistics (AISTATS) normal with respect to the model parameters. Effi- 2015, San Diego, CA, USA. JMLR: W&CP volume 38. cient mean-field variational learning and inference are Copyright 2015 by the authors.

268 Learning Deep Sigmoid Belief Networks with Data Augmentation also developed, to optimize a data-augmented varia- of weights W, but a simple partition function is ob- tional lower bound; this approach can be scaled up tained. Therefore, the full likelihood under an SBN is to large datasets. Utilizing these methods, sparsity- trivial to calculate. Furthermore, SBNs explicitly ex- encouraging priors are placed on the model parameters hibit the generative process to obtain data, in which and the posterior distribution of model parameters is the hidden layer provides a directed “explanation” for estimated (not simply a point estimate). Based on ex- patterns generated in the visible layer. tensive experiments, we provide a detailed analysis of the performance of the proposed method. 2.2 Autoregressive Structure

2 Model formulation Instead of assuming that the visible variables in an SBN are conditionally independent given the hidden units, a more flexible model can be built by using an 2.1 Sigmoid Belief Networks autoregressive structure. The autoregressive sigmoid Deep directed generative models are considered for bi- belief network (ARSBN) (Gregor et al., 2014) is an nary data, based on the Sigmoid Belief Network (SBN) SBN with within-layer dependency captured by a fully (Neal, 1992) (using methods like those discussed in connected directed acyclic graph, where each unit xj Salakhutdinov et al. (2013), the model may be read- can be predicted by its parent units x > p(vjn = 1|hn, v

269 Zhe Gan, Ricardo Henao, David Carlson, Lawrence Carin

Let {h(l), h(l), . . . , h(l) } represent the set of hidden 3 Learning and inference 1n 2n Kln units for observation n in layer l. For the top layer, (L) In this section, Gibbs sampling and mean field varia- the prior probability can be written as p(hkn = 1) = (L+1) (L+1) (0) tional inference are derived for the sigmoid belief net- σ(c ), where c ∈ R. Defining vn = hn , k k works, based on data augmentation. From the per- conditioned on the hidden units h(l), the hidden units n spective of learning, we desire distributions on the at layer l − 1 are drawn from model parameters {W(l)} and {c(l)}, and distributions (l−1) (l) (l) > (l) (l) (l) on the data-dependent {hn } are desired in the con- p(hkn |hn ) = σ((wk ) hn + ck ) , (8) text of inference. The extension to ARSBN is straight- where W(l) = [w(l),..., w(l) ]> connects layers l and forward and provided in Supplemental Section C. We 1 Kl−1 l − 1 and c(l) = [c(l), . . . , c(l) ]> is the bias term. again omit the layer index l in the discussion below. 1 Kl−1

2.4 Bayesian sparsity shrinkage prior 3.1 Gibbs sampling

The learned features are often expected to be sparse. Define V = [v1,..., vN ] and H = [h1,..., hN ]. From In imagery, for example, features learned at the bot- recent work for the P´olya-Gamma data augmentation tom layer tend to be localized, oriented edge filters strategy (Polson et al., 2013), that is, if γ ∼ PG(b, 0), which are similar to the Gabor functions known to b > 0, then model V1 cell receptive fields (Lee et al., 2008). ψ a Z ∞ (e ) −b κψ −γψ2/2 ψ b = 2 e e p(γ)dγ , (10) Under the Bayesian framework, sparsity-encouraging (1 + e ) 0 priors can be specified in a principled way. Some canonical examples are the spike-and-slab prior, the where κ = a − b/2. Properties of the P´olya-Gamma Student’s-t prior, the double exponential prior and the variables are summarized in Supplemental Section B. horseshoe prior (see Polson and Scott (2012) for a dis- Therefore, the data-augmented joint posterior of the cussion of these priors). The three parameter beta SBN model can be expressed as normal (TPBN) prior (Armagan et al., 2011), a typical (0) (1) global-local shrinkage prior, has demonstrated better p(W, H, b, c, γ , γ |V) (11)   (mixing) performance than the aforementioned priors, X 1 > 1 (0) > 2 and thus is employed in this paper. The TPBN shrink- ∝ exp (vjn − )(wj hn + cj) − γjn (wj hn + cj)  2 2  age prior can be expressed as scale mixtures of nor- j,n   mals. If W ∼ TPBN(a, b, φ), where j = 1, . . . , J, k = jk X 1 1 (1) 2  (0) (1) 1,...,K, (leaving off the dependence on the layer l, · exp (hkn − )bk − γk bk · p0(γ , γ , W, b, c) ,  2 2  for notational convenience) then k,n where γ(0) ∈ J×N and γ(1) ∈ K are augmented ran- W ∼ N(0, ζ ) , (9) R R jk jk dom variables drawn from the P´olya-Gamma distribu- ζ ∼ Gamma(a, ξ ) , ξ ∼ Gamma(b, φ ) , (0) (1) jk jk jk k tion. The term p0(γ , γ , W, b, c) contains the prior φk ∼ Gamma(1/2, ω) , ω ∼ Gamma(1/2, 1) . information of the random variables within. Let p(·|−) represent the conditional distribution given other pa- 1 When a = b = 2 , the TPBN recovers the horseshoe rameters fixed, then the conditional distributions used prior. For fixed values of a and b, decreasing φ encour- in the Gibbs sampling are as follows. ages more support for stronger shrinkage. In high- (0) (1) (0) dimensional settings, φ can be fixed at a reasonable For γ , γ : The conditional distribution of γ is value to reflect an appropriate expected sparsity rather  1  than inferring it from data. p(γ(0)|−) ∝ exp − γ(0)(w>h + c )2 · PG(γ(0)|1, 0) jn 2 jn j n j jn Finally, to build up the fully generative model, com- > = PG(1, w hn + cj) , (12) monly used isotropic normal prior are imposed on the j bias term b and c, i.e. b ∼ N(0, νbIK ), c ∼ N(0, νcIJ ). where PG(·, ·) represents the P´olya-Gamma distribu- tion. Similarly, we can obtain p(γ(1)|−) = PG(1, b ). Note that when performing model learning, we trun- k k cate the number of hidden units at each layer at K, To draw samples from the P´olya-Gamma distribution, which may be viewed as an upper bound within the two strategies are utilized: (i) using rejection sampling model on the number of units at each layer. With the to draw samples from the closely related exponentially aforementioned shrinkage on W, the model has the tilted Jacobi distribution (Polson et al., 2013); (ii) capacity to infer the subset of units (possibly less than using a truncated sum of random variables from the K) actually needed to represent the data. Gamma distribution and then match the first moment

270 Learning Deep Sigmoid Belief Networks with Data Augmentation to keep the samples unbiased (Zhou et al., 2012). Typ- where h·i represents the expectation w.r.t. the varia- ically, a truncation level of 20 works well in practice. tional approximate posterior. P For H: The sequential update of the local conditional Note that hlog p(V|−)i = j,nhlog p(vjn|−)i, and distribution of H is p(hkn|−) = Ber(σ(dkn)), where each term inside the summation can be further lower bounded by using the augmented P´olya-Gamma vari- J > 1 X  (0) \k  ables. Specifically, defining ψjn = wj hn + cj, we can d = b + w>v − w + γ (2ψ w + w2 ) , kn k k n 2 jk jn jn jk jk obtain j=1 \k > where ψjn = wj hn − wjkhkn + cj. Note that wk and hlog p(vjn|−)i ≥ − log 2 + (vjn − 1/2)hψjni wj represent the kth column and the transpose of the 1 − hγ(0)ihψ2 i + hlog p (γ(0))i − hlog q(γ(0))i , (16) jth row of W, respectively. The difference between 2 jn jn 0 jn jn an SBN and an RBM can be seen more clearly from by using (10) and Jensen’s inequality. Therefore, the the sequential update. Specifically, in an RBM, the 0 update of h only contains the first two terms, which new lower bound L can be achieved by substituting kn (16) into (15). Note that this is a looser lower bound implies the update of the hkn are independent of each other. In the SBN, the existence of the third compared with the original lower bound L, due to the term demonstrates clearly the posterior dependencies data augmentation. However, closed-form coordinate between hidden units. Although the rows of H ascent update equations can be obtained, shown be- are correlated, the columns of H are independent, low. (0) (1) 0 (0) therefore the sampling of H is still efficient. For γ , γ : optimizing L over q(γjn ), we have For W: The prior is a TPBN shrinkage prior with   (0) 1 (0) (0) p (w ) = N(0, diag(ζ )), then we have p(w |−) = q(γ ) ∝ exp − γ hψ2 i · PG(γ |1, 0) 0 j j j jn 2 jn jn jn N(µ , Σ ), where j j  q  = PG 1, hψ2 i . (17) " N #−1 jn X (0) Σ = γ h h> + diag(ζ−1) , (13) j jn n n j (1) p 2 n=1 Similarly, we can obtain p(γk |−) = PG(1, hbki). In " N # the update of other variational parameters, only the X 1 (0) expectation of γ(0) is needed, which can be calculated µj = Σj (vjn − − cjγjn )hn . (14) jn √ 2 hψ2 i n=1 by hγ(0)i = √ 1 tanh( jn ). The variational jn 2 hψ2 i 2 The update of the bias term b and c are similar to the jn distribution for other parameters are in the exponen- above equation. tial family, hence the update equations can be derived For TPBN shrinkage: One advantage of this hier- from the Gibbs sampling, which are straightforward archical shrinkage prior is the full local conjugacy that and provided in Supplemental Section D. allows the Gibbs sampling easily implemented. Specif- In order to calculate the variational lower bound, the ically, the following posterior conditional distribution augmented P´olya-Gamma variables are integrated out, can be achieved: (1) ζ |− ∼ GIG(0, 2ξ ,W 2 ); (2) jk jk jk and the expectation of the logistic likelihood under the ξ |− ∼ Gamma(1, ζ +φ ); (3) φ |− ∼ Gamma( 1 J+ jk jk k k 2 variational distribution is estimated by Monte Carlo 1 PJ 1 1 2 , ω + j=1 ξjk); (4) ω|− ∼ Gamma( 2 K + 2 , 1 + integration algorithm. In the experiments 10 samples PK k=1 φk), where GIG denotes the generalized inverse are used and were found sufficient in all cases consid- Gaussian distribution. ered. The computational complexity of the above inference 3.2 Mean field variational Bayes is O(NK2), where N is the total number of training Using the VB inference with the traditional mean field data points. Every iteration of VB requires a full pass assumption, we approximate the posterior distribution through the dataset, which can be slow when applied Q Q (0) to large datasets. Therefore, an online version of VB with Q = j,k qwjk (wjk) j,n qhjn (hjn)q (0) (γjn ); for γjn inference is developed, building upon the recent online notational simplicity the terms concerning b, c, γ(1) implementation of latent Dirichlet allocation (Hoffman and the parameters of the TPBN shrinkage prior are et al., 2013). In online VB, stochastic optimization is omitted. The variational lower bound can be obtained applied to the variational objective. The key observa- as tion is that the coordinate ascent updates in VB pre- cisely correspond to the natural gradient of the vari- L = hlog p(V|W, H, c)i + hlog p(W)i + hlog p(H|b)i ational objective. To implement online VB, we sub- − hlog q(W)i − hlog q(H)i , (15) sample the data, compute the gradient estimate based

271 Zhe Gan, Ricardo Henao, David Carlson, Lawrence Carin on the subsamples and follow the gradient with a de- els by maximizing a variational lower bound on the creasing step size. marginal log likelihood (Mnih and Gregor, 2014; Gre- gor et al., 2014; Kingma and Welling, 2013; Rezende 3.3 Learning deep networks using SBNs et al., 2014). In the work reported here, we focus on providing a Once one layer of the deep network is trained, the fully Bayesian treatment on the “global” model pa- model parameters in that layer are frozen (at the mean rameters and the “local” data-dependent hidden vari- of the inferred posterior) and we can utilize the in- ables. An advantage of this approach is the ability to ferred hidden units as the input “data” for the training impose shrinkage-based (near) sparsity on the model of the next higher layer. This greedy layer-wise pre- parameters. This sparsity helps regularize the model, training algorithm has been shown to be effective for and also aids in interpreting the learned model. The DBN (Hinton et al., 2006) and DBM (Salakhutdinov idea of P´olya-Gamma data augmentation was first pro- and Hinton, 2009) models, and is guaranteed to im- posed to do inference on Bayesian prove the data likelihood under certain conditions. In (Polson et al., 2013), and later extended to the in- the training of a deep SBN, the same strategy is em- ference of negative binomial regression (Zhou et al., ployed. After finishing pre-training (sequentially for 2012), logistic-normal topic models (Chen et al., 2013), all the layers), we then “un-freeze” all the model pa- and discriminative relational topic models (Chen et al., rameters, and implement global training (refinement), 2014). The work reported here serves as another appli- in which parameters in the higher layer now can also cation of this data augmentation strategy, and a first affect the update of parameters in the lower layer. implementation of analysis of a deep-learning model in Discriminative fine-tuning (Salakhutdinov and Hinton, a fully Bayesian framework. 2009) is implemented in the training of DBM by us- ing label information. In the work presented here for 5 Experiments the SBN, we utilize label information (when available) in a multi-task learning setting (like in Bengio et al. We present experimental results on three publicly (2013)), where the top-layer hidden units are gener- available binary datasets: MNIST, Caltech 101 Sil- ated by multiple sets of bias terms, one for each label, houettes, and OCR letters. To assess the performance while all the other model parameters and hidden units of SBNs trained using the proposed method, we show below the top layer are shared. This multi-task learn- the samples generated from the model and report the ing is only performed when generating samples from average log probability that the model assigns to a test the model. datum.

4 Related work 5.1 Experiment setup

The SBN was proposed by Neal (1992), and in the For all the experiments below, we consider a one- original paper a Gibbs sampler was proposed to do in- hidden-layer SBN with K = 200 hidden units, and ference. A natural extension to a variational approxi- a two-hidden-layer SBN with each layer containing mation algorithm was proposed by Saul et al. (1996), K = 200 hidden units. The autoregressive version using the mean field assumption. A Gaussian-field of the model is denoted ARSBN. The fully visible sig- (Barber and Sollich, 1999) approach was also used for moid belief network without any hidden units is de- inference, by making Gaussain approximations to the noted FVSBN. unit input. However, due to the fact that the model The SBN model is trained using both Gibbs sampling is not locally conjugate, all the methods mentioned and mean field VB, as well as the proposed online VB above are only used for inference of distributions on method. The learning and inference discussed above the hidden variables H, and typically model parame- is almost free of parameter tuning; the hyperparam- ters W are learned by gradient descent. eters settings are given in Section 2. Similar reason- Another route to do inference on SBNs are based on able settings on the hyperparameters yield essentially the idea of Helmholtz machines (Dayan et al., 1995), identical results. The hidden units are initialized ran- which are multi-layer belief networks with recognition domly and the model parameters are initialized using models, or inference networks. These recognition mod- an isotropic normal with standard deviation 0.1. The els are used to approximate the true posterior. The maximum number of iterations for VB inference is set wake-sleep algorithm (Hinton et al., 1995) was first to 40, which is large enough to observe convergence. proposed to do inference on such recognition models. Gibbs sampling used 40 burn-in samples and 100 pos- Recent work focuses on training the recognition mod- terior collection samples; while this number of samples

272 Learning Deep Sigmoid Belief Networks with Data Augmentation

Figure 1: Performance on the MNIST dataset. (Left) Training data. (Middle) Averaged synthesized samples. (Right) Learned features at the bottom layer. is clearly too small to yield sufficient mixing and an Table 1: Log probability of test data on MNIST dataset. accurate representation of the posteriors, it yields ef- “Dim” represents the number of hidden units in each layer, starting with the bottom one. (∗) taken from Salakhutdi- fective approximations to parameter means, which are nov and Murray (2008), () taken from Mnih and Gregor used when presenting results. For online VB, the mini- (2014), (.) taken from Salakhutdinov and Hinton (2009). batch size is set to 5000 with a fixed learning rate of SBN.multi denotes SBN trained in the multi-task learning 0.1. Local parameters were updated using 4 iterations setting. per mini-batch, and results are shown over 20 epochs. Model Dim Test log-prob. The properties of the deep model were explored by ex- SBN (online VB) 25 −138.34 RBM∗ (CD3) 25 −143.20 amining Ep(v|h(2))[v]. Given the second hidden layer, the mean was estimated by using Monte Carlo inte- SBN (online VB) 200 −118.12 (2) (1) (1) (2) SBN (VB) 200 −116.96 gration. Given h , we sample h ∼ p(h |h ) SBN.multi (VB) 200 −113.02 (1) and v ∼ p(v|h ), repeat this procedure 1000 times SBN.multi (VB) 200 − 200 −110.74 to obtain the final averaged synthesized samples. FVSBN (VB) − −100.76 ARSBN (VB) 200 −102.11 The test data log probabilities under VB inference are ARSBN (VB) 200 − 200 −101.19 estimated using the variational lower bound. Eval- SBN (Gibbs) 200 −94.30 uating the log probability using the Gibbs output is SBN (NVIL) 200 −113.1  difficult. For simplicity, the harmonic mean estimator SBN (NVIL) 200 − 200 −99.8 DBN∗ 500 − 2000 −86.22 is utilized. As the estimator is biased (Wallach et al., DBM. 500 − 1000 −84.62 2009), we refer to the estimate as an upper bound. ARSBN requires that the observation variables are put in some fixed order. In the experiments, the ordering The results for MNIST, along with baselines from was simply determined by randomly shuffling the ob- the literature are shown in Table 1. We report the servation vectors, and no optimization of the ordering log probability estimates from our implementation of was tried. Repeated trials with different random or- Gibbs sampling, VB and online VB using both the derings gave empirically similar results. SBN and ARSBN model. First, we examine the performance in a low- 5.2 Binarized MNIST dataset dimensional model, with K = 25, and the results are shown in Table 1. All VB methods give similar results, so only the result from the online method is shown for We conducted the first experiment on the MNIST digit brevity. VB SBN shows improved performance over an dataset which contains 60, 000 training and 10, 000 test RBM in this size model (Salakhutdinov and Murray, images of ten handwritten digits (0 to 9), with 28 × 28 2008). pixels. The binarized version of the dataset is used ac- cording to (Murray and Salakhutdinov, 2009). Analy- Next, we explore an SBN with K = 200 hidden units. sis was performed on 10, 000 randomly selected train- Our methods achieve similar performance to the Neu- ing images for Gibbs and VB inference. We also ex- ral Varitional Inference and Learning (NVIL) algo- amine the online VB on the whole training set. rithm (Mnih and Gregor, 2014), which is the current

273 Zhe Gan, Ricardo Henao, David Carlson, Lawrence Carin

One Hidden Layer SBN

−120

−130

−140

25 −150 50 100 −160 200 500

Average variational lower bound of test data −170 5 10 15 20 25 30 35 40 Figure 3: Missing data prediction. For each subfigure, Number of iterations (Top) Original data. (Middle) Hollowed region. (Bottom) Reconstructed data. Figure 2: The impact of the number of hidden units on the average variational lower bound of test data under the one-hidden-layer SBN observed, the model can generate different digits than the true digit (see the transition from 9 to 0, 7 to 9 etc.). state of the art for training SBNs. Using a second hidden layer, also with size 200, gives 5.3 Caltech 101 Silhouettes dataset performance improvements in all algorithms. In VB there is an improvement of 3 nats for the SBN model The second experiment is based on the Caltech 101 Sil- when a second layer is learned. Furthermore, the houettes dataset (Marlin et al., 2010), which contains VB ARSBN method gives a test log probability of 6364 training images and 2307 test images. Estimated −101.19. The current state of the art on this size deep log probabilities are reported in Table 2. sigmoid belief network is the NVIL algorithm with −99.8, which is quantitatively similar to our results. Table 2: Log probability of test data on Caltech 101 Sil- houettes dataset. “Dim” represents the number of hidden The online VB implementation gives lower bounds units in each layer, starting with the bottom one. (∗) taken comparable to the batch VB, and will scale better to from Cho et al. (2013). larger data sizes. Model Dim Test log-prob. The TPBN prior infers the number of units needed SBN (VB) 200 −136.84 to represent the data. The impact on the number of SBN (VB) 200 − 200 −125.60 hidden units on the test set performance is shown in FVSBN (VB) − −96.40 Figure 2. The models learned using 100, 200 and 500 ARSBN (VB) 200 −96.78 ARSBN (VB) 200 − 200 −97.57 hidden units achieve nearly identical test set perfor- RBM∗ 500 −114.75 mance, showing that our methods are not overfitting RBM∗ 4000 −107.78 the data as the number of units increase. All mod- els with K > 100 typically utilize 81 features. Thus, In this dataset, adding the second hidden layer to the TPBN prior gives “tuning-free” selection on the the VB SBN greatly improves the lower bound. Fig- hidden layer size K. The learned features are shown ure 5 demonstrates the effect of the deep model on in Figure 1. These features are sparse and consistent learning. The first hidden layer improves the lower with results from sparse features learning algorithms bound quickly, but saturates. When the second hid- (Lee et al., 2008). den layer is added, the model once again improves the The generated samples for MNIST are presented in lower bound on the test set. Global training (refine- Figure 1. The synthesized digits appear visually good ment) further enhances the performance. The two- and match the true data well. layer model does a better job capturing the rich struc- ture in the 101 total categories. For the simple dataset We further demonstrate the ability of the model to (MNIST with 10 categories, OCR letters with 26 cate- predict missing data. For each test image, the lower gories, discussed below), this large gap is not observed. half of the digit is removed and considered as missing data. Reconstructions are shown in Figure 3, and the Remarkably, our implementation of FVSBN beats the model produces good completions. Because the labels state-of-the-art results on this dataset (Cho et al., of the images are uncertain when they are partially 2013) by 10 nats. Figure 4 shows samples drawn from

274 Learning Deep Sigmoid Belief Networks with Data Augmentation

Figure 4: Performance on the Caltech 101 Silhouettes dataset. (Left) Training data. (Middle) Averaged synthesized samples. (Right) Learned features at the bottom layer.

Table 3: Log probability of test data on OCR letters −125 dataset. “Dim” represents the number of hidden units in each layer, starting with the bottom one. (∗) taken from −130 Salakhutdinov and Larochelle (2010).

−135 Model Dim Test log-prob. SBN (online VB) 200 −48.71 SBN (VB) 200 −48.20 −140 SBN (VB) 200 − 200 −47.84 FVSBN (VB) − −39.71 −145 ARSBN (VB) 200 −37.97 Pretraining Layer 1 ARSBN (VB) 200 − 200 −38.56 −150 Pretraining Layer 2 SBN (Gibbs) 200 −40.95 ∗ Global Training DBM 2000 − 2000 −34.24

Average variational lower bound of test data −155 20 40 60 80 100 Number of iterations 6 Discussion and future work Figure 5: Average variational lower bound obtained from the SBN 200 − 200 model on the Caltech 101 Silhouettes A simple and efficient Gibbs sampling algorithm and dataset. mean field variational Bayes approximation are devel- oped for learning and inference of model parameters in the sigmoid belief networks. This has been imple- mented in a novel way by introducing auxiliary P´olya- the trained model; different shapes are synthesized and Gamma variables. Several encouraging experimental appear visually good. results have been presented, enhancing the idea that the problem can be efficiently tackled in a fully Bayesian framework. While this work has focused on binary observations, 5.4 OCR letters dataset one can model real-valued data by building latent bi- nary hierarchies as employed here, and touching the The OCR letters dataset contains 16 × 8 binary pixel data at the bottom layer by a real-valued mapping, as images of 26 letters in the English alphabet. The has been done in related RBM models (Salakhutdinov dataset is split into 42, 152 training and 10, 000 test et al., 2013). Furthermore, the logistic link function is examples. Results are reported in Table 3. The pro- typically utilized in the deep learning literature. The posed ARSBN with K = 200 hidden units achieves a probit function and the rectified linearity are also con- lower bound of −37.97. The state-of-the-art here is a sidered in the nonlinear Gaussian belief network (Frey DBM with 2000 hidden units in each layer (Salakhut- and Hinton, 1999). Under the Bayesian framework, dinov and Larochelle, 2010). Our model gives results by using data augmentation (Polson et al., 2011), the that are only marginally worse using a network with max-margin link could be utilized to model the non- 100 times fewer connections. linearities between layers when training a deep model.

275 Zhe Gan, Ricardo Henao, David Carlson, Lawrence Carin

Acknowledgements B. M. Marlin, K. Swersky, B. Chen, and N. D. Fre- itas. Inductive principles for restricted Boltzmann This research was supported in part by ARO, DARPA, . AISTATS, 2010. DOE, NGA and ONR. A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. ICML, 2014. References I. Murray and R. Salakhutdinov. Evaluating probabil- ities under high-dimensional models. A. Armagan, M. Clyde, and D. B. Dunson. General- NIPS, 2009. ized beta mixtures of Gaussians. NIPS, 2011. R. M. Neal. Connectionist learning of belief networks. D. Barber and P. Sollich. Gaussian fields for approx- Artificial intelligence, 1992. imate inference in layered sigmoid belief networks. NIPS, 1999. N. G. Polson and J. G. Scott. Local shrinkage rules, L´evyprocesses and regularized regression. JRSS, Y. Bengio, A. Courville, and P. Vincent. Represen- 2012. tation learning: A review and new perspectives. PAMI, 2013. N. G. Polson, S. L. Scott, et al. Data augmentation for support vector machines. Bayesian Analysis, 2011. J. Chen, J. Zhu, Z. Wang, X. Zheng, and B. Zhang. N. G. Polson, J. G. Scott, and J. Windle. Bayesian Scalable inference for logistic-normal topic models. inference for logistic models using P´olya–Gamma la- NIPS, 2013. tent variables. JASA, 2013. N. Chen, J. Zhu, F. Xia, and B. Zhang. Discriminative D. J. Rezende, S. Mohamed, and D. Wierstra. Stochas- relational topic models. PAMI, 2014. tic backpropagation and approximate inference in K. Cho, T. Raiko, and A. Ilin. Enhanced gradient deep generative models. ICML, 2014. for training restricted Boltzmann machines. Neural R. Salakhutdinov and G. E. Hinton. Deep Boltzmann computation, 2013. machines. AISTATS, 2009. P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. R. Salakhutdinov and H. Larochelle. Efficient learning The Helmholtz machine. Neural computation, 1995. of deep Boltzmann machines. AISTATS, 2010. B. J. Frey. Graphical models for machine learning and R. Salakhutdinov and I. Murray. On the quantitative digital communication. MIT press, 1998. analysis of deep belief networks. ICML, 2008. B. J. Frey and G. E. Hinton. Variational learning in R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. nonlinear Gaussian belief networks. Neural Compu- Learning with hierarchical-deep models. PAMI, tation, 1999. 2013. K. Gregor, A. Mnih, and D. Wierstra. Deep autore- L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean field gressive networks. ICML, 2014. theory for sigmoid belief networks. JAIR, 1996. G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning H. M. Wallach, I. Murray, R. Salakhutdinov, and algorithm for deep belief nets. Neural computation, D. Mimno. Evaluation methods for topic models. 2006. ICML, 2009. G. E. Hinton. Training products of experts by mini- M. Zhou, L. Li, D. Dunson, and L. Carin. Lognor- mizing contrastive divergence. Neural computation, mal and gamma mixed negative binomial regression. 2002. ICML, 2012. G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The “wake-sleep” algorithm for unsupervised neural networks. Science, 1995. M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 2013. D. P. Kingma and M. Welling. Auto-encoding varia- tional Bayes. arXiv:1312.6114, 2013. H. Larochelle and I. Murray. The neural autoregressive distribution estimator. JMLR, 2011. H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area v2. NIPS, 2008.

276