Infinite Variational for Semi-

M. Ehsan Abbasnejad Anthony Dick Anton van den Hengel The University of Adelaide {ehsan.abbasnejad, anthony.dick, anton.vandenhengel}@adelaide.edu.au

Abstract with whatever labelled data is available to train a discrimi- native model for classification. This paper presents an infinite We demonstrate that our infinite VAE outperforms both (VAE) whose capacity adapts to suit the input data. This the classical VAE and standard classification methods, par- is achieved using a mixture model where the mixing coef- ticularly when the number of available labelled samples is ficients are modeled by a Dirichlet process, allowing us to small. This is because the infinite VAE is able to more ac- integrate over the coefficients when performing inference. curately capture the distribution of the unlabelled data. It Critically, this then allows us to automatically vary the therefore provides a that allows the dis- number of in the mixture based on the data. criminative model, which is trained based on its output, to Experiments show the flexibility of our method, particularly be more effectively learnt using a small number of samples. for semi-supervised learning, where only a small number of The main contribution of this paper is twofold: (1) we training samples are available. provide a Bayesian non-parametric model for combining autoencoders, in particular variational autoencoders. This bridges the gap between non-parametric Bayesian meth- 1. Introduction ods and the deep neural networks; (2) we provide a semi- supervised learning approach that utilizes the infinite mix- The Variational Autoencoder (VAE) [18] is a newly in- ture of autoencoders learned by our model for prediction troduced tool for of a distribution x x with from a small number of labeled examples. p( ) from which a set of training samples is drawn. It The rest of the paper is organized as follows. In Sec- learns the parameters of a generative model, based on sam- tion 2 we review relevant methods, while in Section 3 we pling from a latent variable space z, and approximating the x z briefly provide background on the variational autoencoder. distribution p( | ). By designing the latent space to be easy In Section 4 our non-parametric Bayesian approach to infi- to sample from (e.g. Gaussian) and choosing a flexible gen- nite mixture of VAEs is introduced. We provide the math- erative model (e.g. a ) a VAE can pro- ematical formulation of the problem and how the combi- vide a flexible and efficient means of generative modeling. nation of and Variational inference can be One limitation of this model is that the dimension of the used for efficient learning of the underlying structure of the latent space and the number of parameters in the genera- input. Subsequently in Section 5, we combine the infinite tive model are fixed in advance. This means that while the mixture of VAEs as an unsupervised generative approach model parameters can be optimized for the training data, with discriminative deep models to perform prediction in the capacity of the model must be chosen a priori, assuming a semi-supervised setting. In Section 6 we provide empiri- some foreknowledge of the training data characteristics. cal evaluation of our approach on various datasets including In this paper we present an approach that utilizes natural images and 3D shapes. We use various discrimina- Bayesian non-parametric models [1, 8, 31, 13] to produce tive models including Residual Network [12] in combina- infinite an mixture of autoencoders. This infinite mixture is tion with our model and show our approach is capable of capable of growing with the complexity of the data to best outperforming our baselines. capture its intrinsic structure. Our motivation for this work is the task of semi- 2. Related Work supervised learning. In this setting, we have a large volume of unlabelled data but only a small number of labelled train- Most of the successful learning algorithms, specially ing examples. In our approach, we train a generative model with , require large volume of labeled instance using unlabelled data, and then use this model combined for training. Semi-supervised learning seeks to utilize the

15888 unlabeled data to achieve strong generalization by exploit- x1 ing small labeled examples. For instance unlabeled data h from the web is used with label propagation in [6] for clas- 1 sification. Similarly, semi supervised learning for object de- x2 tection in videos [28] or images [43, 7]. h2 z Most of these approaches are developed by either (a) x3 performing a projection of the unlabeled and labeled in- h3 stances to an embedding space and using nearest neigh- x4 bors to utilize the distances to infer the labeled similar to label propagation in shallow [15, 42, 14] or deep net- z works [45]; or (b), formulating some variation of a joint generative- that uses the latent struc- Figure 1. Variational encoder: the solid lines are direct connection ture of the unlabeled data to better learn the decision func- and dotted lines are sampled. The input layer represented by x tion with labeled instances. For example ensemble methods and the hidden layer h determine moments of the variational dis- [3, 26, 24, 47, 5] assigns pseudo-class labels based on the tribution. From the variational distribution the latent variable z is constructed ensemble learner and in turn uses them to find sampled. a new proper learner to be added to the ensemble. In recent years, deep generative models have gained at- The first term in this loss is the reconstruction error, or tention with success in Restricted Boltzman machines (and expected negative log-likelihood of the datapoint. The ex- its infinite variation [4]) and autoencoders (e.g. [17, 21]) pectation is taken with respect to the encoder’s distribution with their stacked variation [41]. The representations over the representations by taking a few samples. This term learned from these unsupervised approaches are used for encourages the decoder to learn to reconstruct the data when supervised learning. using samples from the latent distribution. A large error Other related approaches to ours are adversarial net- indicates the decoder is unable to reconstruct the data. A works [9, 29, 25] in which the generative and discriminative schematic network of the encoder is shown in Figure 1. model are trained jointly. This model penalizes the gener- As shown, deep network learns the mean and variance of ative model for as long as the samples drawn from it does a Gaussian from which subsequent samples of z are gener- not perform well in the discriminative model in a min-max ated. optimization. Although theoretically well justified, training such models proved to be difficult. The second term is the Kullback-Leibler divergence be- z x z Our formulation for semi-supervised learning is also re- tween the encoder’s distribution qθ( | ) and p( ). This di- lated to the Highway [40] and Memory [44] networks that vergence measures how much information is lost when us- z seek to combine multiple channels of information that cap- ing q to represent a prior over and encourages its values ture various aspects of the data for better prediction, even to be Gaussian. To perform inference efficiently a reparam- though their approaches mainly focus on depth. eterization trick is employed [18] that in combination with the deep neural networks allow for the model to be trained 3. Variational autoencoder with the .

While typically autoencoders assume a deterministic la- 4. Infinite Mixture of Variational autoencoder tent space, in a variational autoencoder the latent variable is stochastic. The input x is generated from a variable in An autoencoder in its classical form seeks to find an em- that latent space z. Since the joint distribution of the input bedding of the input such that its reproduction has the least when all the latent variables are integrated out is intractable, discrepancy. A variational autoencoder modifies this notion we resort to a variational inference (hence the name). The by introducing a Bayesian view where the conditional dis- model is defined as: tribution of the latent variables, given the input, is similar to the distribution of the input given the latent variable, while pθ(z) = N (z; 0, I), ensuring the distribution of the latent variable is close to a x z x z z I pθ( | ) = N ( ; µ( ),σ( ) ), Gaussian with zero mean and variance one. qφ(z|x) = N (x; µ(x),σ(x)I), A single variational encoder has a fixed capacity and thus where θ and φ are the parameters of the model to be found. might not be able to capture the complexity of the input The objective is then to minimize the following loss, well. However by using a collection of VAEs, we can en- sure that we are able to model the data, by adapting the E x z z x z − z∼q(z|x) [log p( | )] + KL (qφ( | )||p( )) . (1) number of VAEs in the collection to fit the data. In our infinite mixture, we seek to find a mixture of these varia- | reconstruction{z error } | regularization{z }

5889 φ θ of components to be determined based on the data. x x 1 z1 1 Algorithm 1 Learning Infinite mixture of Variational au- toencoders Initalize VAE assignments c x2 z2 x2 Ac = {} ∀c = 1,...,C while not converged do x3 z3 x3 for xi ∈ X do ⊲ VAE assignments … cnew Assign i to new VAE according to Eq. 3

x4 x4 cnew Otherwise, sample i according to Eq. 2 cnew c Figure 2. Infinite mixture of variational inference is shown as a if i 6= i then block within which VAE components operate. Each latent variable Aci = Aci ∪ {i} ⊲ Given VAE has to forget zi (one dimensional in this illustration) in each VAE is drawn from end if a Gaussian distribution. Solid lines indicate nonlinear encoding end for and the dashed lines are decoders. In this diagram, φ and θ are the Update C for new VAEs parameters of the encoder and decoder respectively. for c = 1,...,C do ⊲ Update VAEs Forget Ac in cth VAE Learn cth VAE ∀i where cnew = c tional autoencoders such that its capacity can theoretically i end for grow to infinity. Each autoencoder then is able to capture end while a particular aspect of the data. For instance, one might be Return Infinite Mixture of VAEs better at representing round structures, and another better at straight lines. This mixture intuitively represents the vari- ous underlying aspects of the data. Moreover, since each Formally, let c be the assignment matrix for each in- VAE models the uncertainty of its representations through stance to a VAE component (that is, which VAE is able to the density of the latent variable, we know how confident best reconstruct instance i) and π be the mixing coefficient prior for c. For n unlabeled instances we model the infinite each autoencoder is in reconstructing the input. mixture of VAEs as, One advantage of our non-parametric mixture model is that we are taking a Bayesian approach in which the distri- p(c, π, θ, x1,...,n,α) = p(c|π)p(π|α) pθ(x1,...,n|c, z)p(z)dz bution of the parameters are taken into account. As such, Z we capture the uncertainty of the model parameters. The autoencoders that are less confident about their reconstruc- We assume the mixing coefficients are drawn from a tion, have less effect on the output. As shown in Figure 2, Dirichlet distribution with parameter α (see Figure 3 for ex- each encoder finds a distribution for the embedding variable amples), with some probability through a nonlinear transform (con- volution or fully connected layers in neural net). Each au- p(π1, . . . , πC |α) ∼ Dir(α/C), toencoder in the mixture block produces a probability mea- sure for its ability to reconstruct the input. This behavior To determine the membership of each instance in one of the has parallels to the brain’s ability to develop specialized re- components of the mixture model, i.e. the likelihood that gions responsible for particular visual tasks and processing each variational autoencoder is able to encode the input and particular types of image pattern. reconstruct it with minimum loss, we compute the condi- Mixture models are traditionally built using a pre- tional probability of membership. This conditional proba- bility of each instance belonging to an autoencoder compo- determined number of weighted components. Each weight nent is computed by integrating over all mixing components coefficient determines how likely it is for a predictor to be π, that is [35, 36], successful in producing an accurate output. These coeffi- cients are drawn from a multinomial distribution where the n p(c, θ, x1,...,n,α)= pθc (xi|zc )p(zc )p(c|π)p(π|α)dπdzc number of these coefficients are fixed. On the other hand, ZZ i i i i Yi to learn an infinite mixture of the variational autoencoders in a non-parametric Bayesian manner we employ Dirich- This integration accounts for all possible membership co- let process. In Dirichlet process, unlike traditional mixture efficients for all the assignments of the instances to VAEs. models, we assume the probability of each component is The distribution of c is multinomial, for which the Dirichlet drawn from a multinomial with a Dirichlet prior. The ad- distribution is its conjugate prior, and as such this integra- vantage of taking this approach is that we can integrate over tion is tractable. To perform inference for the parameters θ all possible mixing coefficients. This allows for the number and c we perform block Gibbs sampling, iterating between

5890 The entire learning process is summarised in Algorithm 1. To improve performance, at each iteration of our ap- proach, we keep track of the cth VAE assignment changes in a set Ac. This allows us to efficiently update each VAE using a backpropagation operation for the new assign- ments. We perform two operations after VAE assignments (a) α =0.99 (b) α =2 (c) α = 50 are done: (1) forget, and (2) learn. In forgetting stage, we Figure 3. Dirichlet distribution with various values of α. Smaller tend to unlearn the instances that were assigned to the given values of α tend to concentrate the mass in the corners (in this sim- VAE. It is done by performing a gradient update with neg- plex example and in general as the dimensions increase). These ative learning-rate, i.e. reverse backpropagation. In the smaller values reduce the chance of generating new autoencoder learning stage on the other hand, we update the parame- components. ters of the given VAE with positive learning-rate, as is com- monly done using backpropagation. This alternation allows optimizing for θ for each VAE and updating the assign- for structurally similar instances that can share latent vari- ments in c. Optimization uses the variational autoencoder’s ables to be learned with a single VAE,while forgetting those trick by minimizing the loss in Equation 1. To update c, we that are not well suited. perform the following Gibbs sampling: To reconstruct an input x with an infinite mixture, the expected reconstruction is defined as: • The conditional probability that an instance i belongs

to VAE c: E x θ c x E x z [ ]= X p c ( i = c| i) qφ(zc|x) [ | c] . (4) c ηc(xi) p(ci = c|c\i, xi,α) = (2) n − 1+ α That is, we use each VAE to reconstruct the input and weight it with the probability of that VAE (this probability where η (x ) is the occupation number of cluster c, c i is inversely proportionate to the variance of each VAE). excluding instance i for n instances. We define, x c x 5. Semi-Supervised Learning using Infinite au- ηc( i) = (n − 1)pθc ( i = c| i), toencoders and Many of deep neural networks’ greatest successes have exp Ez ∼q [log pθ (xi|zc)]  c φc(z|x) c  pθ (ci = c|xi) = been in supervised learning, which depends on the availabil- c E x z exp z ∼q z x log pθ ( i| j ) Pj  j φj ( | ) h j i ity of large labeled datasets. However, in many problems such datasets are unavailable and alternative approaches, which in evaluates how likely an instance xi is to be such as combination of generative and discriminative mod- assigned to the cth VAE using latent samples zc. els, have to be employed. In semi-supervised learning, • The probability that instance i is not well represented where the number of labeled instances is small, we employ by any of the existing autoencoders and a new encoder our infinite mixture of VAEs to assist supervised learning. has to be generated: Inspired by the mixture of experts [30, Chapter 11] we for- mulate the problem of predicting output y∗ for the test ex- α ample x∗ as, p(c = c|c , x ,α) = . (3) i \i i n − 1+ α C x ∗ x∗ ∗ x∗ c x Note that in principle, ηc( i) is the a measure calculated p(y | ) = X p(y | , ωc) × pθc ( i = c| i) . by excluding the ith instance in the observations so that its c deep| discriminative{z } | deep generative{z } membership is calculated with respect to its ”similarity” to other members of the cluster. However, here we use cth This formulation for prediction combines the discriminative VAE as an estimate of this occupation number for perfor- power of a deep learner with parameter set ωc, and a flexi- mance reasons. This is justified so long as the influence of ble generative model. For a given test instance x∗, each dis- a single observation on the latent representation of an en- criminative expert produces a tentative output that is then coder is negligible. In Equation 2 when a sample for the weighted by the generative model. As such, each discrimi- new assignment is drawn from this multinomial distribution native expert learns to perform better with instances that are there is a chance for completely different VAE to fit this more structurally similar from the generative model’s per- new instance. If the new VAE is not successful in fitting, spective. the instance will be assigned to its original VAE with high During training we minimize the negative log of the probability in the subsequent iteration. discriminative term (log loss) weighted by the generative

5891 Infinite Mixture of VAEs pling, we set the number of clusters equal to the number of (Generave) … classes and use 100 random labeled examples to fine-tune z Shepherd VAE assignments. At each iteration, if there is no instance z1 2 somax 1 assigned to a VAE, it will be removed. As such, the mix- somax 2 watchdog Shepherd ture grows and shrinks with each iteration as instances are

… assigned to VAEs. We report the results over 3 trials. For comparing the autoencoder’s ability to internally Deep Net (Discriminave) capture the structure of the input, we compared latent repre- sentation obtained by a single VAE and the expected latent Figure 4. Our framework for infinite mixture of VAEs and semi- supervised learning. We share the parameters of the discriminative representation from our approach in Equation 4 and sub- model at the lower levels for more efficient training and prediction. sequently trained a support vector machine (SVM) with it. For each VAE in the mixture we have an expert (e.g. softmax) For expectations, we used 20 samples from the before the output. Thicker arrows indicate more probable connec- latent variable space. tion. Once the generative model is learned with all the unla- belled instances using the infinite mixture model in Section 4, we randomly select a subset of labeled instances for train- weight. Each instance’s weight–as calculated by the infinite ing the discriminative model. Throughout the experiments, autoencoder–acts as an additional coefficient for the gra- we share the parameters in the discriminative architecture dient in the backpropagation. It leads to similar instances from the input to the last layer so that each expert is repre- getting stronger weights in the neural net during training. sented by a softmax. Moreover, it should be noted that the generative and dis- We report classification results in various problems in- criminative models can share deep parameters ωc and θc at cluding handwritten binary images, natural images and 3D some level. In particular in our implementation, we only shapes. Although the performance of our semi-supervised consider parameters of the last layer to be distinct for each learning approach depends on the choice of the discrimina- discriminative and generative component. We summarize tive model, we observe our approach outperforms baselines our framework in Figure 4. particularly with smaller labeled instances. For all train- While combining an unsupervised generative model and ings–either discriminative or generative–we set the max- a supervised discriminative models is not itself novel, in our imum number of iterations to 1000 with batch size 500 problem the generative model can grow to capture the com- for the stochastic with constant learning plexity of the data. In addition, since we share the param- rate 0.001. For VAEs we use the Adam [16] updates with eters of the discriminative and generative models, each un- β1 = 0.9, β2= 0.999. However, we set a threshold on supervised learner does not need to learn all the aspects of the changes in the loss to detect convergence and stop the the input. In fact, in many classification problems with im- training. Except for the binary images where we use a bi- ages, each pixel value hardly matters in the final decision. nary decoder (pθ(x|z) is binomial), our decoder is continu- As such, by sharing parameters unsupervised model incurs ous ((pθ(x|z) is Gaussian) in which samples from the latent a heavier loss when the distribution of the latent variables space is used to regenerate the input to compute the loss. does not encourage the correct final decision. This sharing In problems when the input is too complex for the au- is done by reusing the parameters that are initialized with toencoder to perform well, we share the output of the last labels. layer of the discriminative model with the VAEs. 6. Experiments 6.1. MNIST Dataset MNIST dataset1 contains 60, 000 training and 10, 000 In this section, we examine the performance of our test images of size 28 × 28 of handwritten digits. Some approach for semi-supervised classification on various random images from this dataset are shown in Figure 5(a). datasets. We investigate how the combination of the gener- We use original VAE algorithm (single VAE) with 100 iter- ative and discriminative networks is able to perform semi- ation and 50 hidden variables to learn a representation for supervised learning effectively. Since convergence of Gibbs these digits with binary distribution for the input pθ(x|z). sampling can be very slow we first pre-train the base VAE As shown in Figure 5(b), these reconstructions are very un- with all the unlabeled examples. Each autoencoder is clear and at times wrong (6th column where 7 is wrongly trained with a two dimensional latent variable z and ini- reconstructed as 9). Using this VAE as base, we train an tialized randomly. Hence each new VAE is already capable infinite mixture of our generative model. After 10 iterations of reconstructing the input to a certain extent. During the with α = 2, the expected reconstruction E [x] is depicted sampling steps, this VAE becomes more specialized in a particular structure of the inputs. To further facilitate sam- 1http://yann.lecun.com/exdb/mnist/

5892 (a) Original Images

(b) VAE reconstruction (number of hidden variables 50)

(c) VAE reconstruction (number of hidden variables 1024)

(d) Infinite Mixture reconstruction (number of clusters 18 using base VAE with number of hidden variables 50) Figure 5. An illustration of the autoencoder’s input reconstruction. First row is the original images. Reconstructions in Figure 5(b) and 5(c) are obtained from using a single VAE. Images in the last row are obtained from the proposed mixture model of 18 VAEs each with 50 hidden units. As seen, reconstructed images are clearer in Figure 5(d).

Method C # hidden units Error Method/Labels 100 1000 All 2 100 9.17 Pseudo-label [23] 10.49 3.64 0.81 Infinite Mixture 10 100 5.12 EmbedNN [45] 16.9 5.73 3.59 17 100 4.9 DGN [17] 3.33 ± 0.14 2.40 ± 0.02 0.96 1 100 5.92 Adversarial [9] 0.78 VAE 1 1024 5.1 Virtual Adversarial [29] 2.66 1.50 0.64 ± 0.03 Table 1. Reconstruction error for MNIST dataset as the norm of AtlasRBF [32] 8.10 ± 0.95 3.68 ± 0.12 1.31 the difference of the input image and the expected reconstruction PEA [2] 5.21 2.64 2.30 comparing our approach with the original VAE. Γ-Model [34] 4.34 ± 2.31 1.71 ± 0.07 0.79 ± 0.05 Baseline CNN 8.62 ± 1.87 4.16 ± 0.35 0.68 ± 0.02 Infinite Mixture 3.93 ± 0.5 2.29 ± 0.2 0.6 ± 0.02 in Figure 5(d). We use 2 samples to compute E[x] for cth Table 2. Test error for MNIST with 17 clusters and 100 hidden VAE. As observed, this reconstruction is visually better and variables. Only [29] reports better performance than ours the mistake in the 6th column is fixed. Further, Figure 5(c) shows using VAE with 1024 hidden units. It is interesting to note that even though our proposed model has smaller 6.2. Dogs Experiment number of hidden units (900 vs 1024), the reconstruction is better using our model. ImageNet is a dataset containing 1, 461, 406 natural im- ages manually labeled according to the WordNet hierarchy In Table 1 we summarize reconstruction error (that is, to 1000 classes. We select a subset of 10 breeds of dogs x E x k − [ ] k) for using our approach versus the original for our experiment. These 10 breeds are: “Maltese dog, VAE. As seen, our approach performs similarly with the dalmatian, German shepherd, Siberian husky, St Bernard, VAE when the number of hidden units are almost similar Samoyed, Border collie, bull mastiff, chow, Afghan hound” (1000 vs 1024). As seen, with higher number of VAEs, we with 10, 400 training and 2, 600 test images. For an illus- are able to reduce the reconstruction error significantly. tration of the latent space and how the mixture of VAEs is To test our approach in a semi-supervised setting, we use able to represent the uncertainty in the hidden variables we a deep Convolutional Neural Net (CNN). Our deep CNN ar- use this dogs subset. We fine-tune a pre-trained AlexNet chitecture consists of two convolutional layers with 32 fil- [20] as the base discriminative model and share the param- ters of 5×5 and Rectified Linear Unit (ReLU) activation and eters with the generative model. In particular, we use the max-pooling of 2 × 2 after each one. We added a fully con- 4096-dimensional output of the 7th fully connected layer nected layer with 256 hidden units followed by a dropout (fc7) as the input for both softmax experts and the VAE au- layer and then the softmax output layer. As shown in Table toencoders. We trained the generative model with all the 2, our infinite mixture with 17 base VAEs has been able to unlabeled dog instance and used 1000 hidden units for each outperform most of the state-of-the-art methods. Only re- VAE and set α = 2 and stopped with 14 autoencoders. cently proposed Virtual Adversarial Network [29] performs We randomly select 5 images of dogs (from this Ima- better than ours with small training examples. geNet subset) and 5 images of anything else (non-dogs from

5893 erigamxueo xet model. experts when VAE of weights given mixture smaller a the subsequently learning to and assignment 3) Equation the (from for leads probability uncertainty lower This to dogs. not are that representations the elcutrdaati htVE eare we VAE) that sufficiently not in are apart non-digs clustered discrim- and not well dogs are non-dogs (the of enough mean inative the when dogs. the even the than such, addition, higher As In generally are different. non-dogs in- the be of which to variance space recognized are latent they this dicate in together clustered generally nFigure in n) h oiino ahcrl ersnstema ftedniyfrtegvniaei hssaeadisrdu stevrac ( variance the is radius its and space this in image given the for density the of mean the represents circle each of position The one). lce ihCetv omnLcne o h illustration the Figure for in License) Common Creative compare rows with two Flicker Second rows. two first the ours. in to approach compared VAE proposed single our a to from compared dataset obtained dogs representations the latent on the AlexNet of accuracy Test 3. Table dog 5 selected randomly We ( VAE each dataset. in Dogs representation on latent VAEs their of plotted mixture and infinite else our anything training of images from 5 found and space images latent dimensional Two 6. Figure ino hs mgsin images these of tion seen not were that images of representation the about uncertain are VAEs the hence training. non-dogs, during than variance smaller have dogs Moreover, hs ausaecluae rmec A ewr as network VAE each circle). from and a calculated in variance are illustration mean values better the These for use the Gaussian (we and bivariate radius circle the its the of variable as of latent shown center the is the of variance of density position the the of determines mean the plot, each lutae h blt formdli etrcpuigunder- capturing better representations. in lying model our of This VAE. ability single the when a illustrates expectation) outperforms significantly an SVM a as in (computed used mixture infinite representation the latent of addition, with In particularly instances. better, labeled performs smaller mixture approach. infinite mixture infinite seen our As with compared and shown is z2 nTable In σ nFigure in 1 epciey.A hw,rpeetto fnndg bu ice)aegnrlycutrdfraa rmtedg rdcircles). (red dogs the from away far clustered generally are circles) (blue non-dogs of representation shown, As respectively). , 6 epo h -iesoa aetrepresenta- latent 2-dimensional the plot We . 3 z 1 h cuayo lxe nti ossubset dogs this on AlexNet of accuracy the 1 sson h mgso o-osare non-dogs of images the shown, As . 5 aetMixture+SVM Latent

aetVAE+SVM Latent z Aso h ertmxue In mixture. learnt the of VAEs 2 nnt Mixture Infinite Method/Labels AlexNet [ 20 z 1 ] uncertain 49 75 69 58 . . . . 81 81 59 1 100 ±

± ± ± z2 2 1 1 3 . about . . . 63 87 83 21 µ z 63 89 86 72 z . . . 5894 . 1 28 28 72 28 1000 aaee hrn nwihgnrtv oe scombined is model generative such which Therefore, in im- prediction. sharing minimum classification parameter a final has the information on this pact reconstruct- while in images required the values pixel ing the preserve of to distribution seek autoencoders the may addition, for In approach useful sufficient. unsupervised be is not the model where This problems model. complex param- generative the our share with to eters learning discriminative for resentation n and ing si locnre ee[ here confirmed dataset this also encoding is for as well perform not does VAE single ewr Rse)[ Residual (ResNet) use network we learning, semi-supervised perform to is tive [ in approach original the Although augmentations. age natural labeled of [ number in reported various results with The CIFAR10 examples. on training error Test 4. Table ..CFRDataset CIFAR 6.3. oofru to up offer to ± ± ± ± nnt itr fVAEs of Mixture Infinite 0 0 0 eiulNtok[ Network Residual 0 h IA-0dtst[ dataset CIFAR-10 The . . . ovLre[ Conv-Large pk-n-lb[ Spike-and-slab . 64 19 66 2 Method/Labels Γ aot[ Maxout Mdl[ -Model 32 D [ GDI 10 90 89 79

74 z2 × , . . . 68 88 000 . 8 2% 33 4000 8 32 ± 11 34 ± ± ± 34 ] 0 ro euto ihaugmentation. with reduction error ] , G mgswith images RGB mgsfrtsig u xeiet show experiments Our testing. for images ] 0 0 0 z 10 . 39 . . . 18 1 12 2 05 03 ] ] o h rtdmninand dimension first the for 12 ] z 1 sascesu oe niaerep- image in model successful a as ] 91 90 83 79 10 8 . . . 69 26 . . 9 . 72 19 6 22 08 All ± 1000 ± ± ± scmoe f1 lse of classes 10 of composed is ] ± .Hwvr ic u objec- our since However, ]. ± 0 0 0 0 . 0 . . . 24 1 17 25 7 . . 45 12 z2 50 , 000 34 20 23 7 8 . i o nld im- include not did ] . . . 78 09 3 04 4000 mgsfrtrain- for images 31.9 ± z ± ± ± z 2 30 1 0 0 . o h second the for . 21 . . 13 61 46 39 µ seems ] 7 7 and . . 5 5 9 9 8 9 ± ± All . . . . 27 27 27 38 σ 0 0 . . 02 01 Method/Labels 100 1000 All 95 VAE latent+SVM 64.21 79.09 82.71 Mixture latent+SVM 74.01 83.26 85.68 90 Table 5. ModelNet10 accuracy of latent variable representation for training SVM using a single VAE versus expected latent variable

85 in our approach.

Accuracy 80 Additionally, Table 5 shows the accuracy comparison of Our Approach Single Softmax the latent representation obtained from the samples from 75 3D Shapenet our infinite mixture and a single VAE as measured by the DeepPano performance of SVM. As seen, the expected latent repre- 70 50 100 250 500 750 1000 1995 2000 3991 sentation in our approach is significantly more discrimina- Number of labeled instances tive and outperforms single VAE. This is because, we take Figure 7. ModelNet10 compared to 3D Shapenet [46] and Deep- into account the variations in the input and adapt to the com- Pano [37] averaged over 3 trials. plexity of the input. While a single VAE has to capture the dataset in its entirety, our approach is free to choose and with the classifier is necessary for better prediction. fit. Our experiments with both 2D and 3D images show the As such we fine-tune a ResNet and use output of the initial convolutional layers play a crucial rule for the VAEs to be able to encode the input into a latent space where the 127th layer as the input for the VAE. We use a 2000 hidden mixture of experts best perform. This 3D model further il- nodes and α = 2 to train an infinite mixture with 15 VAEs. For training we augmented the training images by padding lustrate the decision function mostly depends on the internal structure of the generative model rather than reconstruction images with 4 pixels on each side and random cropping. Table 4 reports the test error of running our approach of the pixel values. When we share the parameters of the on this dataset. As shown, our infinite mixture of VAEs discriminative model with the generative infinite mixture of combined with the powerful discriminative model outper- VAEs and learn the mixture of experts, we combine various forms the state-of-the-art in this dataset. When all the train- representations of the data for better prediction. ing instances are used the performance of our approach is the same as the discriminative model. This is because with 7. Conclusion larger labeled training sizes, the instance weights provided In this paper, we employed Bayesian non-parametric by the generative model are averaged and lose their impact, methods to propose an infinite mixture of variational au- therefore all the experts become similar. With smaller la- toencoders that can grow to represent the complexity of the beled examples on the other hand, each softmax expert spe- input. Furthermore, we used these autoencoders to create a cializes in a particular aspect of the data. mixture of experts model for semi-supervised learning. In both 2D images and 3D shapes, our approach provides state 6.4. 3D ModelNet of the art results in various datasets. The ModelNet datasets were introduced in [46] to evalu- We further showed that such mixtures, where each com- ate 3D shape classifiers. ModelNet has 151, 128 3D models ponent learns to represent a particular aspect of the data, classified into 40 object categories, and ModelNet10 is a are able to produce better predictions using fewer total subset based on classes in the NYUv2 dataset [38]. The parameters than a single monolithic model. This applies 3D models are voxelized to fit a 30 × 30 × 30 grid and whether the model is generative or discriminative. More- augmented by 12 rotations. For the discriminative model over, in semi-supervised learning where the ultimate objec- we use a convolutional architecture similar to that of [27] tive is classification, parameter sharing between discrimi- where we have a 3D convolutional layer with 32 filters of native and generative models was shown to provide better size 5 and stride 2, of size 3 and stride 1, max- prediction accuracy. pooling layer with size 2 and a 128-dimensional fully con- In future works we plan to extend our approach to use nected layer. Similar to the CIFAR-10 experiment, we share variational inference rather than sampling for better ef- the parameters of the last fully connected layer between the ficiency. In addition, a new variational loss that mini- infinite mixture of VAEs and the discriminative softmax. mizes the joint probability of the input and output in a As shown in Figure 7, when using the whole dataset our Bayesian paradigm may further increase the prediction ac- infinite mixture and the best result from [27] match at 92% curacy when the number of labeled examples is small. accuracy. However, as we reduce the number of labeled training examples it is clear that our approach outperforms a single softmax classifier.

5895 References [13] N. L. Hjort, C. Holmes, P. Muller,¨ and S. G. Walker. Bayesian nonparametrics, volume 28. Cambridge [1] E. Abbasnejad, S. Sanner, E. V. Bonilla, and University Press, 2010. 1 P. Poupart. Learning community-based preferences via dirichlet process mixtures of gaussian processes. [14] K. In Kim, J. Tompkin, H. Pfister, and C. Theobalt. In Proceedings of the Twenty-Third International Semi-supervised learning with explicit relationship Joint Conference on Artificial Intelligence, IJCAI ’13, regularization. In The IEEE Conference on Computer pages 1213–1219. AAAI Press, 2013. 1 Vision and (CVPR), June 2015. 2 [2] P. Bachman, O. Alsharif, and D. Precup. Learn- [15] F. Kang, R. Jin, and R. Sukthankar. Correlated label ing with pseudo-ensembles. In Z. Ghahramani, propagation with application to multi-label learning. M. Welling, C. Cortes, N. D. Lawrence, and K. Q. In Proceedings of the 2006 IEEE Computer Society Weinberger, editors, Advances in Neural Information Conference on and Pattern Recogni- Processing Systems 27, pages 3365–3373. Curran As- tion - Volume 2, CVPR ’06, pages 1719–1726, Wash- sociates, Inc., 2014. 6.1 ington, DC, USA, 2006. IEEE Computer Society. 2 [3] K. Chen and S. Wang. Regularized boost for semi- [16] D. Kingma and J. Ba. Adam: A Method for Stochastic supervised learning. In J. C. Platt, D. Koller, Y. Singer, Optimization. ArXiv e-prints, Dec. 2014. 6 and S. T. Roweis, editors, Advances in Neural Infor- [17] D. P. Kingma, S. Mohamed, D. J. Rezende, and mation Processing Systems 20, pages 281–288. Cur- M. Welling. Semi-supervised learning with deep ran Associates, Inc., 2008. 2 generative models. In Z. Ghahramani, M. Welling, [4] M. Cotˆ e´ and H. Larochelle. An infinite restricted C. Cortes, N. Lawrence, and K. Weinberger, editors, boltzmann machine. CoRR, abs/1502.02476, 2015. 2 Advances in Neural Information Processing Systems 27, pages 3581–3589. Curran Associates, Inc., 2014. [5] D. Dai and L. V. Gool. Ensemble projection for semi- 2, 6.1 supervised image classification. In Proceedings of the 2013 IEEE International Conference on Computer Vi- [18] D. P. Kingma and M. Welling. Auto-encoding vari- sion, ICCV ’13, pages 2072–2079, Washington, DC, ational bayes. In The International Conference on USA, 2013. IEEE Computer Society. 2 Learning Representations (ICLR), 2014. 1, 3 [6] S. Ebert, M. Fritz, and B. Schiele. Semi-Supervised [19] A. Krizhevsky and G. Hinton. Learning multiple lay- Learning on a Budget: Scaling Up to Large Datasets, ers of features from tiny images. 2009. 6.3 pages 232–245. Springer Berlin Heidelberg, Berlin, [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im- Heidelberg, 2013. 2 agenet classification with deep convolutional neural [7] Y. Fu and L. Sigal. Semi-supervised vocabulary- networks. In P. Bartlett, F. Pereira, C. Burges, L. Bot- informed learning. CVPR, 2016. 2 tou, and K. Weinberger, editors, Advances in Neu- ral Information Processing Systems 25 [8] A. Gelman, J. B. Carlin, and H. S. Stern. Bayesian , pages 1106– data analysis, volume 2. 2014. 1 1114. 2012. 6.2, 6.1 [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [21] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio. D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- Learning algorithms for the classiffcation restricted gio. Generative adversarial nets. In Z. Ghahramani, boltzmann machine. In Journal of M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Research, 2012. 2 Weinberger, editors, Advances in Neural Information [22] A. B. L. Larsen, S. K. SA˜ znderby,ˇ H. Larochelle, Processing Systems 27, pages 2672–2680. Curran As- and O. Winther. Autoencoding beyond pixels using sociates, Inc., 2014. 2, 6.1 a learned similarity metric. In The 33rd International [10] I. J. Goodfellow, A. Courville, and Y. Bengio. Large- Conference on Machine Learning, 2016. 6.3 scale with spike-and-slab sparse cod- [23] D.-H. Lee. Pseudo-label: The simple and efficient ing. In International Conference on Machine Learn- semi-supervised learning method for deep neural net- ing, 2012. 6.3 works. In Workshop on Challenges in Representation [11] I. J. Goodfellow, D. Warde-Farley, M. Mirza, Learning, ICML, volume 3, page 2, 2013. 6.1 A. Courville, and Y. Bengio. Maxout Networks. ArXiv [24] C. Leistner, A. Saffari, J. Santner, and H. Bischof. e-prints, Feb. 2013. 6.3 Semi-supervised random forests. ICCV, 2009. 2 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep [25] A. Makhzani, J. Shlens, N. Jaitly, and I. Good- residual learning for image recognition. CoRR, fellow. Adversarial autoencoders. arXiv preprint abs/1512.03385, 2015. 1, 6.3, 6.3 arXiv:1511.05644, 2015. 2

5896 [26] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu. [39] J. T. Springenberg, A. Dosovitskiy, T. Brox, and Semiboost: Boosting for semi-supervised learning. M. Riedmiller. Striving for simplicity: The all con- IEEE Trans. Pattern Anal. Mach. Intell., 31(11):2000– volutional net. arXiv preprint arXiv:1412.6806, 2014. 2014, Nov. 2009. 2 6.3, 4 [27] D. Maturana and S. Scherer. VoxNet: A 3D Convolu- [40] R. K. Srivastava, K. Greff, and J. Schmidhuber. High- tional Neural Network for Real-Time Object Recogni- way Networks. ArXiv e-prints, May 2015. 2 tion. In IROS, 2015. 6.4 [41] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and [28] I. Misra, A. Shrivastava, and M. Hebert. Watch and P.-A. Manzagol. Stacked denoising autoencoders: learn: Semi-supervised learning of object detectors Learning useful representations in a deep network from videos. CoRR, abs/1505.05769, 2015. 2 with a local denoising criterion. Journal of Machine [29] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and Learning Research, 11(Dec):3371–3408, 2010. 2 S. Ishii. Distributional smoothing with virtual adver- [42] C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi- sarial training. In International Conference on Learn- label sparse coding for automatic image annotation. ing Representation, 2016. 2, 6.1, 2 In Computer Vision and Pattern Recognition, 2009. [30] K. P. Murphy. Machine Learning: A Probabilistic Per- CVPR 2009. IEEE Conference on, pages 1643–1650. spective. MIT Press, 2012. 5 IEEE, 2009. 2 [31] P. Orbanz and Y. W. Teh. Bayesian nonparametric [43] Y.-X. Wang and M. Hebert. Model recommendation: models. In Encyclopedia of Machine Learning, pages Generating object detectors from few samples. In 81–89. Springer, 2011. 1 CVPR, 2015. 2 [32] N. Pitelis, C. Russell, and L. Agapito. Semi- [44] J. Weston, S. Chopra, and A. Bordes. Memory net- supervised Learning Using an Unsupervised Atlas, works. CoRR, abs/1410.3916, 2014. 2 pages 565–580. Springer Berlin Heidelberg, Berlin, [45] J. Weston, F. Ratle, H. Mobahi, and R. Col- Heidelberg, 2014. 6.1 lobert. Deep Learning via Semi-supervised Embed- [33] Y. Pu, X. Yuan, A. Stevens, C. Li, and L. Carin. ding, pages 639–655. Springer Berlin Heidelberg, A Deep Generative Deconvolutional Image Model. Berlin, Heidelberg, 2012. 2, 6.1 ArXiv e-prints, Dec. 2015. 6.3 [46] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, [34] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and J. Xiao. 3d shapenets: A deep representation for and T. Raiko. Semi-supervised learning with ladder volumetric shapes. In The IEEE Conference on Com- networks. In Proceedings of the 28th International puter Vision and Pattern Recognition (CVPR), June Conference on Neural Information Processing Sys- 2015. 7, 6.4 tems, NIPS’15, pages 3546–3554, Cambridge, MA, [47] Z.-H. Zhou. When semi-supervised learning meets USA, 2015. MIT Press. 6.1, 6.3, 4 . Frontiers of Electrical and Elec- [35] C. E. Rasmussen. The infinite gaussian mixture tronic Engineering in China, 6(1):6–16, 2011. 2 model. In In Advances in Neural Information Pro- cessing Systems 12, pages 554–560. MIT Press, 2000. 4 [36] C. E. Rasmussen and Z. Ghahramani. Infinite mix- tures of gaussian process experts. In T. G. Diet- terich, S. Becker, and Z. Ghahramani, editors, Ad- vances in Neural Information Processing Systems 14, pages 881–888. MIT Press, 2002. 4 [37] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, Dec 2015. 7 [38] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. In- door segmentation and support inference from rgbd images. In Proceedings of the 12th European Confer- ence on Computer Vision - Volume Part V, ECCV’12, pages 746–760, Berlin, Heidelberg, 2012. Springer- Verlag. 6.4

5897