Iowa State University Capstones, Theses and Creative Components Dissertations

Fall 2019

A Generative Model for Semi-

Shang Da

Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents

Part of the Artificial Intelligence and Robotics Commons

Recommended Citation Da, Shang, "A Generative Model for Semi-Supervised Learning" (2019). Creative Components. 382. https://lib.dr.iastate.edu/creativecomponents/382

This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A Generative Model for Semi-Supervised Learning

Shang Da

A Creative Component submitted to the graduate faculty in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science

Program of Study Committee:

Dr. Jin Tian, Major Professor Dr. Jia Liu, Committee Member Dr. Wensheng Zhang, Committee Member

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2019

Copyright © Cy Cardinal, 2019. All rights reserved.

ii

TABLE OF CONTENTS

Page

LIST OF FIGURES ...... iii

LIST OF TABLES ...... iv

ACKNOWLEDGMENTS ...... v

ABSTRACT ...... vi

CHAPTER 1. INTRODUCTION ...... 1

CHAPTER 2. BACKGROUND ...... 3 2.1 Variational Inference ...... 3 2.2 The evidence lower bound (ELBO) ...... 3 2.3 Variational auto-encoders ...... 4

CHAPTER 3. RELATED WORK ...... 6 3.1 Berkhahn et al...... 6 3.2 Kingma’s M2 model ...... 7

CHAPTER 4. PROPOSED MODEL ...... 9 4.1 Probabilistic model ...... 9 4.2 ...... 11 4.3 Model architecture ...... 13

CHAPTER 5. EXPERIMENTS ...... 15 5.1 Experiment Setup ...... 15 5.2 Semi-supervised learning performance ...... 16 5.2.1 Benchmarking ...... 16 5.2.2 The effect of unlabeled data ...... 18 5.3 Conditional data generation ...... 19 5.3.1 Style and content separation ...... 19 5.3.2 Conditional data generation performance ...... 20

CHAPTER 6. CONCLUSION...... 22

REFERENCES ...... 23 iii

LIST OF FIGURES

Page

Figure 1. Architecture of Berkhahn’s model ...... 6

Figure 2. Architecture of Kingma M2 model ...... 8

Figure 3. Probabilistic model for data generation ...... 9

Figure 4. Probabilistic model for data generation in Berkhahn’s and Kingma’s work ...... 10

Figure 5. Probabilistic model for inference ...... 11

Figure 6. Model architecture ...... 14 iv

LIST OF TABLES

Page

Table 1. 푆푆퐿퐺푀푓푐 model structure ...... 15

Table 2. 푆푆퐿퐺푀푐푛푛 model structure ...... 16

Table 3. Semi-supervised MNIST classification results ...... 17

v

ACKNOWLEDGMENTS

I would like to thank my committee chair, Dr. Jin Tian, who has provided me with valuable guidance in every stage of the writing of this thesis. His keen and vigorous academic observation enlightens me not only in this thesis but also in my future study. I would also like to express my gratitude towards my committee members, Dr. Jia Liu, and Dr. Wensheng Zhang, for their valuable support.

In addition, I would also like to thank my friends, colleagues, the department faculty and staff for making my time at Iowa State University a wonderful experience. I want to also offer my appreciation to my parents for supporting me in my study. vi

ABSTRACT

Semi-Supervised learning is of great interest in a wide variety of research areas, including natural language processing, speech synthesizing, image classification, genomics etc. Semi-

Supervised Generative Model is one Semi-Supervised learning approach that learns labeled data and unlabeled data simultaneously. A drawback of current Semi-Supervised Generative Models is that latent encoding learnt by generative models is concatenated directly with predicted label, which may result in degradation in representation learning. In this paper we present a new Semi-

Supervised Generative Models that removes the direct dependency of data generation on label, hence overcomes this drawback. We show experiments that verifies this approach, together with comparison with existing works.

1

CHAPTER 1. INTRODUCTION

Nowadays massive raw data is generated everyday thanks to the development of data gathering and storage techniques. However manual labeling of the large dataset is very time- and labor-consuming. In practice the number of unlabeled data is often far greater than that of labeled data. Hence, Semi-Supervised learning, which considers the problems of utilizing unlabeled data to assist supervised learning tasks, is of great interest in a wide variety of research areas, including natural language processing [6], speech synthesizing [7], image classification [8], genomics [9] etc. Existing semi-supervised learning models can be categorized into three main categories: unsupervised approach, graph-based regularization approach and multi-manifold learning approach.

Unsupervised feature learning approach achieves semi-supervised learning in two separate stages: feature representation learning stage and classification stage. In the first stage a set of latent representations are learnt from both labeled and unlabeled data with unsupervised generative models. In the second stage unlabeled data is classified based on learnt latent representations. For example, Kingma’s M1 model [1] first learn latent representations with auto-encoders then use an SVM to classify the results, and Johnson Rie et al. [10] use Local

Region Convolution block to learn Two-View Embedding feature and then use a Convolution neural network for classification.

Study of Berkhahn et al [2] shows that classification results can serve as a regularizer to generative model, while generative model can provide extra information to the classifier. They achieve better performance with regard to both classification accuracy and feature representation learning by allowing mutual influence between the two models and training them simultaneously. This type of model that learns labeled and unlabeled data simultaneously is 2 called Semi-Supervised Generative Model. It is first proposed by Kingma [1]. One drawback of most existing Semi-Supervised Generative Models is that they assume data generation is directly influenced by label. As a consequence, such models concatenate classification result with the latent encodings directly. This may result in degradation of representation learning [3].

Therefore, in this paper we present a new flavor of Semi-Supervised Generative Model which overcomes this problem. The proposed model is also able to handle both labeled and unlabeled data. Similar to Kingma and Berkhahn’s work, it utilizes the power of variational auto- encoder (VAE) for representations learning. But unlike existing works, the direct dependency of data generation on label is removed thanks to a new probabilistic modeling of data generation process.

In the future sections, we start with background knowledge, including Variational

Inference and Variational auto-encoders in chapter 2. In chapter 3 we introduce some related approach prior to this work. In chapter 4 we propose the new model and in chapter 5 we present some experiments validating the new model.

3

CHAPTER 2. BACKGROUND

This section covers the necessary background knowledge to the proposed work, including a brief introduction to Variational Inference, the evidence lower bound and Variational auto- encoders.

2.1 Variational Inference

The goal of is to learn a set of latent variables 풛 = 푧1:푛 to represent observed data 풙 = 푥1:푛. This requires learning the posterior distribution 푝(풛|풙). Directly computing 푝(풛|풙) is often intractable since it often requires computing integration ∫ 푝(풛, 풙) 푑풛 where 풛 is often a high dimensional variable.

Variational Inference [4] is one approach to estimate the intractable posterior distribution

푝(풛|풙). The core idea of variational inference is:

1. Introduce a tractable hypothesis probability distribution 푞(풛; 휆) parameterized by 휆.

2. Find a 휆 that makes 푞(풛; 휆) approximate 푝(풛|풙), namely

∗ 휆 = 푎푟푔푚푖푛휆 퐷(푝(풛|풙), 푞(풛; 휆)), (1)

where 퐷(⋅) is a measurement of distance between two probability distributions. In practice one of the most wildly used measurement in Variational Inference is KL divergence.

∗ 휆 = 푎푟푔푚푖푛휆 퐾퐿(푞(풛; 휆)||푝(풛|풙)) (2)

2.2 The evidence lower bound (ELBO)

Directly minimizing 퐾퐿(푞(풛)||푝(풛|풙)) in (2) is intractable because of the dependency on the evidence log 푝(x): 4

퐾퐿(푞(풛)||푝(풛|풙)) = 퐸풛~푞(풛)log 푞(풛) − 퐸풛~푞(풛)log 푝(풛|풙)

= 퐸풛~푞(풛)log 푞(풛) − 퐸풛~푞(풛)log 푝(풛, 풙) + log 푝(푥) (3)

Therefore, the evidence lower bound (ELBO) is introduced. It consists of the negative

KL term plus evidence log 푝(x), which is a constant with respect to 푞(풛). Maximizing the ELBO is equivalent to minimizing the KL term, which is the goal of variation inference:

log푝(푥) − 퐾퐿(푞(풛)||푝(풛|풙)) = 퐸풛~푞(풛)log 푝(풛, 풙) − 퐸풛~푞(풛)log 푞(풛) ∶= 퐸퐿퐵푂 (4)

2.3 Variational auto-encoders

Variational auto-encoder (VAE) is proposed by Kingma & Welling [5]. They first start with the ELBO in (4).

퐸퐿퐵푂 = 퐸풛~푞(풛) log 푝(풛, 풙) − 퐸풛~푞(풛) log 푞(풛)

= 퐸풛~푞(풛) log 푝(풙|풛) + 퐸풛~푞(풛) log 푝(풛) − 퐸풛~푞(풛) log 푞(풛)

푞(풛) = 퐸 log 푝(풙|풛) − 퐸 log 풛~푞(풛) 풛~푞(풛) 푝(풛) = 퐸풛~푞(풛) log 푝(풙|풛) − 퐾퐿(푞(풛)||푝(풛)) (5)

To optimize the ELBO in framework. The author used an - decoder structure. A generative model (decoder) 푝휃(풙|풛) is chosen to learnt 푝(풙|풛) while simultaneously an inference model (encoder) 푞휙(풛|풙) is chosen for the arbitrary probability distribution 푞(풛) in variational inference setting. The objective becomes:

( ) ( ) ( ) ( ) 퐸퐿퐵푂 = 퐸풛~푞휙(풛|풙) log 푝휃 풙|풛 − 퐾퐿(푞휙 풛|풙 ||푝휃 풛 ) 6

Then maximizing the ELBO becomes minimizing reconstruction error and a KL term.

푞휙(풛|풙) and 푝휃(풛) are often chosen to be multivariate gaussian distribution with diagonal variance matrix so that the KL term is easily computable. 5

In this setting 풛 becomes a probability distribution. Rather than output latent encoding 풛 directly, the encoder estimates parameters (흁, 흈) of a gaussian distribution. Latent encoding is first sampled from the distribution and then fed to decoder. However, directly sampling from gaussian is not differentiable. In order to train the model with stochastic gradient descent. The author introduced a method called reparameterization trick: Instead of directly sample

풛 ~ 푁(흁, 흈), we compute 푧 = 흁 + 흈2 ⋅ 흐, where 흐 ~ 푁(ퟎ, 푰). This is equivalent to sampling from 푁(흁, 흈), but it’s differentiable hence trainable. 6

CHAPTER 3. RELATED WORKS

In this chapter we introduce two Semi-Supervised Generative Model given by Berkhahn et al. [2] and Kingma [1]. These two pieces of work serve as inspiration for this paper. A further comparison to the proposed model is given in chapter 5.

3.1 Berkhahn et al.

Berkhahn et al [2] present a Semi-Supervised Generative Model makes minimal changes to vanilla variational-autoencoder structure: the only addition is a classification layer 휋 that is attached to the topmost encoder layer. The input of decoder becomes concatenation of 휋(푥) and

풛:

풙푟푒푐표푛~푝휃(휋, 풛) = 푝휃(휋 ⊕ 풛)

Figure 1. Architecture of Berkhahn’s model

The loss function of this model is traditional ELBO term plus classification error:

퐿 = 퐿퐸퐿퐵푂 + 퐿푐푙 (7)

훼(푦) 퐿푐푙 = − ∑ 푦푖 ⋅ log(휋푖) (8) #푙푎푏푒푙푒푑 푖 7

3.2 Kingma’s M2 model

Kingma M2 model is proposed in [1]. It is the first Semi-Supervised Generative Model. It assumes that data is generated by a latent class variable y in addition to a continuous latent variable 풛. Posterior is modeled by a decoder network taking 풚 and 풛 as input:

푝휃(풙|풚, 풛) = 푓(풙; 풚, 풛, 휽) (9)

The approximate posterior 푞휙(풚, 풛|풙) has a factorized form:

푞휙(풚, 풛|풙) = 푞휙(풛|풙)푞휙(풚|풙)

푞휙(풚|풙) = Cat (풚|휋휙(풙)) 2 푞휙(풛|풙) = 푁 (풛|흁휙(푥), 푑푖푎푔 (흈휙(푥))) (10) 2 where 푞휙(풛|풙) is modeled by an encoder network and 흁휙, 흈휙, 휋휙 are modeled by neural networks.

The model uses two different loss functions to handle both labeled and unlabeled data in semi-supervised learning setting.

For labeled data:

log pθ(x, y) ≥ Eqφ(z|x, y) [log pθ(x|y, z) + log pθ(y) + log p(z) − log qφ(z|x, y)] (11)

= −L(x, y)

For unlabeled data:

log p (x) ≥ E [log p (퐱|퐲, 퐳) + log p (퐲) + log p(퐳) − log q (퐲, 퐳|퐱)] (12) θ qϕ(풚, 풛|풙) θ θ φ

= −U(x) 8

Figure 2. Architecture of Kingma M2 model

Figure 2. shows the architecture of Kingma M2 model. Addition to vanilla VAE model, a classifier is applied to handle labeled data. Latent encoding 풛 and label 풚 or 풚′ are concatenated before fed to decoder.

9

CHAPTER 4. PROPOSED MODEL

This chapter describes the proposed model in detail. We first introduce a new probabilistic model for data generation and inference. Then we show the corresponding loss functions for labeled and unlabeled data. Finally, we propose the model architecture.

4.1 Probabilistic model

Data generation model

We assume the following data generation process: first a label 풚 is chosen, a set of latent encoding 풛 is generated conditioned on 풚. Then data 풙 is generated given purely latent encoding

풛. Figure 2. shows the probabilistic model of the above generation process.

Figure 3. Probabilistic model for data generation

Hence, we have:

푝(풙, 풚, 풛) = 푝(풙|풛)푝(풛|풚)푝(풚)

푝(풙|풚, 풛) = 푝(풙|풛) (13) 10

Similar to VAE. We choose a deep generative network 푝휃(풙|풛; 휽) to learn the true probability distribution 푝(풙|풛).

Compared with Berkhahn’s model and Kingma’s M2 model, the proposed generative model removes the direct dependency of data 풙 on label 풚. Therefore, latent encodings 풛 have all the information about reconstructing 풙. This is a more desirable property for representation learning [3].

Figure 4. Probabilistic model for data generation in Berkhahn’s and Kingma’s work

Inference model

In variation inference setting we use an inference model to approximate posterior distribution 푝(풛|풙, 풚) and 푝(풛, 풚|풙). Figure 4. describes the inference model. We assume 풚 can be inferenced by latent encoding 풛 while 풛 can be inferenced directly from 풙. Namely, we have:

푞휙(풙, 풚, 풛) = 푞휙(풛|풙)푞휙(풚|풛)푞휙(풙)

푞휙(풛|풙, 풚) = 푞휙(풛|풙) (14) 11

Figure 5. Probabilistic model for inference

This is a reasonable assumption because all information about data 풙, including information about label, should be encoded by 풛. Therefore, we can infer the label directly from

풛. This setting also allows us to utilize the nice latent encoding we’ve learnt from generative model for the downstream classification task. We can benefit from the potential disentanglement provided by VAE model.

4.2 Objective function

Similar to Kingma’s work. We also use two different loss function to handle labeled and unlabeled data.

Labeled Data

For labeled data. 풚 is treated as observed variable. We minimize the following KL term:

푞휙(풛|풙, 풚) 퐾퐿(푞 (풛|풙, 풚)||푝 (풛|풙, 풚)) = 퐸 log 휙 휃 풛~푞휙(풛|풙, 풚) 푝휃(풛|풙, 풚)

푞휙(풛|풙)푝휃(풙, 풚) = 퐸풛~푞휙(풛|풙) log 푝휃(풙, 풚, 풛) 12

푞 (풛|풙)푝 (풙, 풚) 휙 휃 = 퐸풛~푞휙(풛|풙) log 푝휃(풙|풛)푝휃(풛|풚)푝휃(풚)

( ) ( | ) = log 푝휃 풙, 풚 − 퐸풛~푞휙(풛|풙) log 푝휃 풙 풛 − log 푝휃(푦) +퐾퐿(푞휙(풛|풙)||푝휃(풛|풚)) (15)

Hence, we have the evidence lower bound for labeled data:

ELBOlabeled = log 푝휃(푥, 푦) − 퐾퐿(푞휙(풛|풙, 풚)||푝휃(풛|풙, 풚))

( | ) ( ) ( | ) ( ) = 퐸풛~푞휙(풛|풙) log 푝휃 풙 풛 − 퐾퐿(푞휙 풛|풙 ||푝휃 풛 풚 ) 16

A classification error 퐿푐푙 = 퐶푟표푠푠퐸푛푡푟표푝푦(풛, 풚) is also added in order to learn conditional probability 푝(풚|풛).

퐿(풙, 풚) = 퐿푐푙 − 퐸퐿퐵푂푙푎푏푒푙푒푑 (17)

Unlabeled Data

For unlabeled data. 풚 is treated as unobserved latent variable. We minimize:

푞휙(풛, 풚|풙) 퐾퐿(푞휙(풛, 풚|풙)||푝휃(풛, 풚|풙)) = 퐸풛~푞휙(풛|풙) log 푝휃(풛, 풚|풙) 풚~푞휙(풚|풙)

푞휙(풛|풙)푞휙(풚|풛)푝휃(풙) = 퐸풛~푞휙(풛|풙) log 푝휃(풙, 풚, 풛) 풚~푞휙(풚|풙)

푞휙(풛|풙)푞휙(풚|풛)푝휃(풙) = 퐸풛~푞휙(풛|풙) log 푝휃(풙|풛)푝휃(풛|풚)푝휃(풚) 풚~푞휙(풚|풙)

( ) ( | ) = log 푝휃 풙 − 퐸풛~푞휙(풛|풙) log 푝휃 풙 풛

− 퐸풚~푞 (풚|풙)log 푝휃(푦) 휙 +퐸 log 푞 (풚|풛) 풚,풛~푞휙(풚, 풛|풙) 휙 +퐸 퐾퐿(푞 (풛|풙)||푝 (풛|풚)) (18) 풚~푞휙(풚|풛) 휙 휃

Hence, we have the evidence lower bound for unlabeled data:

( | ) 퐸퐿퐵푂푢푛푙푎푏푒푙푒푑 = 퐸풛~푞휙(풛|풙) log 푝휃 풙 풛 13

−퐸 log 푞 (풚|풛) 풚,풛~푞휙(풚, 풛|풙) 휙

−퐸 퐾퐿(푞 (풛|풙)||푝 (풛|풚)) = −푈(푥) (19) 풚~푞휙(풚|풛) 휙 휃

Total Loss function

Finally, the total loss function for the entire dataset is now:

퐿표푠푠 = ∑ 퐿(푥, 푦) + ∑ 푈(푥) (20) (푥,푦) ∈ 푙푎푏푒푙푒푑 푥 ∈ 푢푛푙푎푏푒푙푒푑

4.3 Model architecture

As is shown in Figure 5. Our model is a modified version of vanilla variational- autoencoder. In order to optimize (16) and (19), some addition structures are added, including a classify-layer that outputs conditional probability 푞휙(풚|풛) and predicts label 풚′, and a discriminator taking 풚 or 풚′ as input and outputs 흁풛|풚 and 흈풛|풚.

When a labeled input 풙풍 is fed to the model. The classify-layer tries to learn 푝(풚|풛) by minimizing cross entropy loss. The true label 풚 is fed to discriminator therefore

퐾퐿(푞휙(풛|풙)||푝휃(풛|풚)) is then computed.

When an unlabeled input 풙풖 is fed to the model. The classify-layer outputs probability

푞휙(풚|풛). Then predicted label 풚′ is fed to the discriminator so that

퐸 퐾퐿(푞 (풛|풙)||푝 (풛|풚)) is computed. 풚~푞휙(풚|풛) 휙 휃

For both labeled and unlabeled data. Reconstruction error is computed, which is

( | ) equivalent to term 퐸풛~푞휙(풛|풙) log 푝휃 풙 풛 . 14

Figure 6. Model architecture 15

CHAPTER 5. EXPERIMENTS

In this chapter, we show an empirical evaluation of the proposed model, referred to as

SSLGM (Semi-Supervised Learning with Generative Model) in this report. We show that the proposed model effective in Semi-Supervised learning, as well as competitive comparing with other related works. Open source code, together with most important figures and results, is available at: https://github.com/Fuu3214/SSLGM.git.

5.1 Experiment Setup

Dataset

We use the well-known MNIST hand written digit dataset for experiment. The data set for semi-supervised learning is created by randomly splitting 60000 training data into labeled and unlabeled set. The size of labeled data is chosen to be 100, 600 and 1000 separately.

Model Setup

We experiment with two SSLGM models. The first model consists of shallow MLP encoder and decoder, with 10-dimensional latent variable 풛. The second model uses more complex Convolutional Neural networks as encoder and decoder, with 50-dimensional latent variable. Table 1 shows the detail for two models.

Table 1. 푆푆퐿퐺푀푓푐 model structure

푺푺푳푮푴풇풄 Layer Activation function Output shape Encoder dense relu 512 dense relu 256 dense None (10, 10) Decoder dense relu 256 dense relu 512 dense sigmoid 784 dense relu 256 16

Classifier (푞(풚|풛)) dense relu 256 dense None 10 Discriminator (푝(푧|푦)) dense relu 512 dense relu 512 dense None (10, 10)

Table 2. 푆푆퐿퐺푀푐푛푛 model structure

푺푺푳푮푴풄풏풏 Layer Activation Number of filters function Encoder 3 × 3 convolutional relu 32 max_pooling2d 3 × 3 convolutional relu 64 max_pooling2d 3 × 3 convolutional relu 32 max_pooling2d dense None (30, 30) Decoder dense relu 256 dense relu 512 dense sigmoid 784 Classifier (푞(풚|풛)) dense relu 256 dense relu 256 dense None 30 Discriminator (푝(푧|푦)) dense relu 512 dense relu 512 dense None (30, 30)

5.2 Semi-supervised learning performance

5.2.1 Benchmarking

For benchmarking, we compare our model with supervised models such as fully connected neural work and convolutional neural network trained on labeled data only, as well as related semi-supervised learning models such as Kingma’s M1 and M2 model and Berkhahn et al’s model. We use classification error rate on testing dataset as our evaluation metric.

17

Result

Table 3 shows the benchmarking result. Given fewer labeled examples, both of our models perform better than other models, including purely supervised models as well as semi- supervised models, demonstrating the effectiveness of latent representation learnt by VAE on the downstream classification tasks. However, when number of labels becomes larger, the power of

CNN on image classification tasks becomes more significant. It performs better compared with all semi-supervised learning models.

Comparing two of our models 푺푺푳푮푴풇풄 and 푺푺푳푮푴풄풏풏. 푺푺푳푮푴풄풏풏 performs slightly better than 푺푺푳푮푴풇풄 thanks to the power of CNN on extracting image features as well as preventing from overfitting.

Table 3. Semi-supervised MNIST classification results

N 50 100 600 1000

NN 36.32 26.42 12.44 10.64

CNN 24.45 11.25 4.41 2.84

Berkhahn et al − 18.9 (±0.6) − 5.5 (±0.1)

Kingma’s M1 − 11.82 (± 0.25) 5.72 (± 0.049) 4.24 (± 0.07)

Kingma’s M2 23.00 (±3.50) 11.97 (± 1.71) 4.94 (± 0.13) 3.60 (± 0.56)

푺푺푳푮푴풇풄 18.49 (±1.13) ퟖ. ퟑퟓ (± ퟎ. ퟖퟓ) 6.79(±0.35) 6.15 (±0.25)

푺푺푳푮푴풄풏풏 ퟏퟕ. ퟖퟖ (±ퟏ. ퟖퟗ) 9.52 (± 1.05) 6.81 (± 0.69) 5.32 (±0.31)

18

5.2.2 The effect of unlabeled data

We show that unlabeled data improves classification performance in our model. We compare our model with control groups CG푓푐 and CG푐푛푛. The control group models are obtained by first removing the unsupervised components and then train the model only on labeled data.

We choose exactly the same hyper-parameters, including number of hidden units, dimension of latent variables, number of layers of MLP and number of filters for CNN. We use classification error rate on testing dataset as our evaluation metric.

Result

The result in Table 4 shows that, in our model setting, unlabeled data is extremely helpful in improving semi-supervised classification performance, especially when we are given less labeled data. We also observe that, without unlabeled data, both of the control group models tend to overfit after a small number of training epochs. However, 푺푺푳푮푴풇풄 and 푺푺푳푮푴풄풏풏 hardly overfit even after a few hundred epochs. This shows that unlabeled data serves as a regularizer to the model.

Table 4. MNIST classification results compared with control groups

N 50 100 600 1000

CG푓푐 37.41 26.46 12.10 10.25

CG푐푛푛 26.26 20.24 7.57 6.22

푺푺푳푮푴풇풄 18.49 (±1.13) ퟖ. ퟑퟓ (± ퟎ. ퟖퟓ) ퟔ. ퟕퟗ(±ퟎ. ퟑퟓ) 6.15 (±0.25)

푺푺푳푮푴풄풏풏 ퟏퟕ. ퟖퟖ (±ퟏ. ퟖퟗ) 9.52 (± 1.05) 6.81 (± 0.69) ퟓ. ퟑퟐ (±ퟎ. ퟑퟏ)

19

5.3 Conditional data generation

5.3.1 Style and content separation

To demonstrate that model is capable of learning useful latent representation with a small number of labeled data. We use the decoder of 푺푺푳푮푴풇풄 model trained on 100 labeled data. In the first experiment we examine the style and content separation. First, we first randomly pick a class label and feed it to discriminator to acquire (흁풛|풚, 흈풛|풚). Then we randomly select two dimensions [푧푎, 푧푏] of the latent encoding 풛 and we range them as follow:

푧 = 푃푃퐹 (훼) 푎 푁(휇푎 ,휎푎 ) 푧|푦 푧|푦 푧푏 = 푃푃퐹 푏 푏 (훽) (20) 푁(휇푧|푦,휎푧|푦)

Where 푃푃퐹 is the Percent Point Function and 훼 and 훽 are evenly spaced numbers over

[0.05, 0.95].

For other dimensions we simply sample each of them from gaussian distribution

푖 푖 푧푖~푁(휇푧|푦, 휎푧|푦) once and then fix them. This guarantees that only two dimensions changes. To generate MNIST examples we simply feed 풛 to the decoder.

Result

In Figure 7 we choose dimension (푧0, 푧1) and generate 3 sets of digits from label 2,3,7.

We see that as 푧0 becomes larger, the width of digit tends to becomes smaller, while as 푧1 becomes larger the digit tends to becomes more rounded. In Figure 8 we choose dimension

(푧0, 푧5) and generate digits from label 0, 5. We observe that as 푧0 becomes larger, the width of digit also becomes smaller, while as 푧5 becomes larger, the angle of digits change. 20

Figure 7. Conditional generation by fixing other dimensions while varying 푧0, 푧1

Figure 8. Conditional generation by fixing other dimensions while varying 푧0, 푧5

5.3.2 Conditional data generation performance

In the second experiment we use the decoder of 푺푺푳푮푴풇풄 model trained on 100 labeled data to evaluate the conditional generation performance. To generate data, we simply sample 풛

from gaussian distribution 풛~푁(흁풛|풚, 흈풛|풚) and then feed 풛 to the decoder. 21

Result

Figure 9 demonstrate an example of conditional data generation performance of our

푺푺푳푮푴풇풄 model trained on 100 labeled data. Our model creates more accurate and more diverse data conditional on the label compared with Kingma’s M2 model shown in Figure 10.

Figure 9. Conditional data generation performance of 푺푺푳푮푴풇풄. Data is generated conditioned on label 0, 5, 6 respectively.

Figure 10. Conditional data generation performance of Kingma’s M2 model. Data is generated conditioned on label 0, 5, 6 respectively. 22

CHAPTER 6. CONCLUSION

We propose a new semi-supervised generative model to overcome a drawback of existing approaches. We remove the direct dependency of data generation on label. Compared with related works that directly concatenate latent encoding with label, our new approach improves quality of latent representations. Moreover, unlike existing works, our approach is capable of utilizing the nice latent representations obtained by variational autoencoder in the downstream classification tasks. Experiments also demonstrate the efficiency of our approach with respect to semi-supervised learning performance and conditional data generation given a relatively small number of labeled data.

One drawback of our model is overfitting. In some scenario such as few-shot or one-shot learning, the number of labeled data is extremely small. In these situations, the performance of our model will drop sharply. This is because the supervised component is overfitting while unsupervised component is still underfitting. Given unlabeled data, classifier gives wrong predictions frequently and influence conditional generation, hence impacts the quality of latent representation. We observe the same problem in other existing semi-supervised generative models as well. By adopting methods such as adding dropout layer and batch normalization, our model partially overcome this problem. But further investigation is still needed to improve performance on fewer labeled data. This can be a direction for our future work. Another potential direction for future work is applying our model in other domains. For example, text image transformation, where image is generated conditioned on text information obtained by some NLP models.

23

REFERENCES

[1] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in neural information processing systems, 2014, pp. 3581–3589.

[2] F. Berkhahn, R. Keys, W. Ouertani, N. Shetty, and D. Geißler, “One model to rule them all,” arXiv preprint arXiv:1908.03015, 2019.

[3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.

[4] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.

[5] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[6] Y. Cheng, “Semi-supervised learning for neural machine translation,” in Joint Training for Neural Machine Translation. Springer, 2019, pp. 25–40.

[7] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6940–6944.

[8] R. Fergus, Y. Weiss, and A. Torralba, “Semi-supervised learning in gigantic image collections,” in Advances in neural information processing systems, 2009, pp. 522–530.

[9] M. Shi and B. Zhang, “Semi-supervised learning improves gene expression-based prediction of cancer recurrence,” Bioinformatics, vol. 27, no. 21, pp. 3017–3023, 2011.

[10] R. Johnson and T. Zhang, “Semi-supervised convolutional neural networks for text categorization via region embedding,” in Advances in neural information processing systems, 2015, pp. 919–927.