DEGREE PROJECT IN AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016

Semi- with Sparse in Automatic

AKASH KUMAR DHAKA

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition

AKASH KUMAR DHAKA

Master in Date: November 2016 Supervisor: Giampiero Salvi Examiner: Danica Kragic Swedish title: Semi-övervakad inlärning med glesa autoencoders i automatisk taligenkänning School of Computer Science and Communication Abstract

This work is aimed at exploring semi-supervised learning techniques to improve the performance of Automatic Speech Recognition systems. Semi-supervised learning takes advantage of unlabeled data in order to improve the quality of the representations ex- tracted from the data. The proposed model is a neural network where the weights are updated by minimizing the weighted sum of a supervised and an unsupervised cost function, simultaneously. These costs are evaluated on the labeled and unlabeled por- tions of the data set, respectively. The combined cost is optimized through mini-batch stochastic via standard . The model was tested on a phone classification task on the TIMIT American English data set and on a written digit classification task on the MNIST data set. Our results show that the model outperforms a network trained with standard backpropagation on the labelled material alone. The results are also in line with state-of-the-art graph-based semi-supervised training methods. Sammanfattning

Detta arbete syftar till att utforska halvövervakade inlärningstekniker (eng: semi-supervised learning techniques) för att förbättra prestandan hos automatiska taligenkänningssystem. Halvövervakad maskininlärning använder sig av data ej märkt med klasstillhörighets- information för att förbättra kvaliteten hos den från datan extraherade representationen. Modellen som beskrivs i arbetet är ett neuralt nätverk där vikterna uppdateras genom att samtidigt minimera den viktade summan av en övervakad och en oövervakad kostnads- funktion. Dessa kostnadsfunktioner evalueras på den märkta respektive den omärkta da- tamängden. De kombinerade kostnadsfunktionerna optimeras genom gradient descent med hjälp av traditionell backpropagation. Modellen har evaluerats genom en fonklassificeringsuppgift på datamängden TIMIT American English, samt en sifferklassificeringsuppgift på datamängden MNIST. Resul- taten visar att modellen presterar bättre än ett nätverk tränat med backpropagation på endast märkt data. Resultaten är även konkurrenskraftiga med rådande state of the art, grafbaserade halvövervakade inlärningsmetoder.

Contents

1 Introduction 1 1.1 The Speech Recognition Problem ...... 1 1.2 Motivation For the Thesis ...... 2 1.3 Research Questions ...... 3 1.4 Assumptions ...... 3 1.5 Report Structure ...... 3

2 Relevant Theory 4 2.1 Automatic Speech Recognition and Phone Classification ...... 4 2.2 Feature Extraction ...... 6 2.3 Acoustic Modelling ...... 7 2.4 MLPs and Deep Neural Networks ...... 9 2.5 Autoencoders ...... 11 2.5.1 Manifold Learning with Autoencoders ...... 11 2.5.2 Sparse Autoencoders ...... 12 2.5.3 Performance of Sparse Autoencoders ...... 12 2.5.4 Applications of Autoencoders ...... 13 2.6 Semi-Supervised Learning ...... 13 2.7 Assumptions ...... 14

3 Related Work 15 3.1 Deep Neural Networks in ASR ...... 15 3.1.1 Deep Belief Networks ...... 15 3.1.2 Recurrent Neural Networks ...... 15 3.2 Examples of Semi-Supervised Learning Methods ...... 16 3.2.1 Heuristic based SSL/Self-Training ...... 16 3.2.2 Transductive SVMs ...... 16 3.2.3 Entropy Based Semi Supervised Learning ...... 17 3.2.4 Graph based SSL ...... 17 3.2.5 Semi-Supervised Learning with generative models ...... 18 3.3 Based Semi-Supervised Learning ...... 19

4 Method 20 4.1 The Model ...... 20 4.2 Evaluation ...... 22 4.3 Monitoring and Debugging ...... 23 4.3.1 Design Choices/Tuning Hyperparameters ...... 23 4.3.2 ...... 24 4.3.3 Batch Size ...... 25 4.3.4 Weight Initialization ...... 25 4.3.5 Number of Hidden Units ...... 25 4.3.6 Momentum ...... 26 4.3.7 ...... 26 4.3.8 Training Epochs ...... 26 4.3.9 Additive Noise ...... 26 4.3.10 Alpha ...... 26 4.3.11 Gradient Checking ...... 27

5 Experiment Setup and Results 28 5.1 Data ...... 28 5.1.1 MNIST ...... 28 5.1.2 TIMIT ...... 28 5.2 Experimental Setup ...... 30 5.3 Practical Setup ...... 31 5.4 Results ...... 31

6 Discussion, Conclusion and Future Work 36 6.1 Hypotheses discussed ...... 36 6.1.1 H.1 Do Semi-Supervised Sparse Autoencoders perform better than neu- ral networks on phone classification? ...... 36 6.1.2 H.2 Does the above result generalize to other domains? ...... 36 6.1.3 H.3: Do Semi-Supervised Sparse Autoencoders perform better than GBL SSL methods on phoneme classification? ...... 36 6.2 Evaluation Method ...... 37 6.3 Effect of α in the model ...... 37 6.4 Future Work ...... 37 6.5 Society and Ethics ...... 38

7 Appendix 39

Bibliography 40 Chapter 1

Introduction

With the invention of computers, the question of whether machines could be made to understand human speech emerged. In more recent years, speech technology has started to change the way we live by becoming an important tool for communication and inter- action with devices. The recent improvements in Spoken Language systems have greatly improved Human-Machine Communication. Personal Digital Assistance (PDA) systems are an example of an intelligent dialogue management system. They have been very pop- ular with the recent launch of products like Apple Siri, Amazon Alexa and Google Allo. Besides human-machine interaction, speech technology has also been applied in assisted human-human communication. There could be several barriers even when humans com- municate with each other. One of the most prominent of those barriers occurs if the two speakers do not speak a common language. In the past, and to great extent in present days, this was solved by means of a human . Speech-to-speech Translation sys- tems are, however, reaching sufficient quality to be of help for example for travellers. These systems accept spoken input in one language and output a spoken translation of the input in a target language. In all the above examples, a key component is Automatic Speech Recognition (ASR). This system has the task of translating spoken utterances into their textual transcription that can be more easily handled in the rest of the system. In dialogue systems, this tex- tual representation is fed to a language understanding module that extracts the semantic information to be handled by a dialogue manager. The dialogue manager, in turns, can decide to formulate a spoken response by means of a language generation system and a system. In speech-to-speech translation, instead, the output of the ASR is fed to an automatic translation system, and the translation is then converted to speech by means of speech synthesis.

1.1 The Speech Recognition Problem

Although humans recognize speech in their mother tongue effortlessly, the problem of automatic speech recognition presents a number of challenges. A source of complexity is due to the large variations in speech based on region, accent, age, gender, emotions, physical and mental well-being of the speaker. Another complication, if compared to many classification tasks, is that speech is a continuous stream of units hierarchically combined into speech sounds, syllables, words, phrases and utterances. A speech rec- ognizer must therefore be able to handle sequences of patterns.

1 2 CHAPTER 1. INTRODUCTION

The way the speech recognition problem is approached is by means of statistical meth- ods that can incorporate the variation and model sequences of acoustic events in a ro- bust way. The design of these models makes extensive use of domain knowledge com- ing from the fields of linguistics and phonetics, and incorporates this knowledge into a machine learning framework that can learn the variability of the spoken units from large collections of recordings of spoken utterances. The building blocks of the statistical models are short segments of speech that can be considered to be stationary. These seg- ments (or the corresponding models) are then combined to form phonemic segments. A phoneme is the smallest linguistic unit that can distinguish between two words. Phone- mic models are then combined into words and phrases by using lexical and grammatical rules. Although each language uses a specific set of phonemes, there is a large overlap between languages, because the number of sounds that we can produce is constraint by the physics of our speech organs. An example of phoneme classification for the Ameri- can English is reported in Appendix 7.1.

1.2 Motivation For the Thesis

In order to learn the associations between the constituent speech units and the corre- sponding sounds, most speech recognition methods require carefully annotated speech recordings. The increasing interest in speech based applications has produced large amount of such data for many languages with a sufficiently broad consumer basis. However, these linguistic resources are extremely expensive to produce because they require in- tense expert labour (phonetic transcriptions are usually created by phoneticians). A con- sequence of this is that most speech are not publicly available, and even re- searchers must pay royalties in order to use them. Another consequence is that speech technology, and in particular speech recognition, does not easily reach speakers of lan- guages spoken by minorities. This work will specifically target improvements in ASR in a semi-supervised set- ting, therefore reducing the need for annotated material. The existing methods in Semi- supervised learning in ASR are based on Graph based Learning or self-training using Neural Networks. Graph based learning is computationally very intensive, while self- training is based on heuristics and prone to error due to wrong predictions. Recently, learning through neural networks has been found to be scalable on industrial levels and has given state-of-the-art results in automatic speech recognition. Our model is a modi- fication on a single layer network which can be trained using simultaneously unlabeled and . The method, therefore, incorporates concepts of semi-supervised learn- ing, while retaining all the advantages of a neural network. A more robust and less resource intensive ASR system will have an immense contri- bution to better connectivity, better aid systems for the disabled, and better systems in low-resources languages. CHAPTER 1. INTRODUCTION 3

1.3 Research Questions

The objective of this thesis is to investigate semi-supervised learning using sparse auto- encoders and if they could be used to improve phoneme recognition over a standard neural network when the labeled dataset is very limited. We can break down this state- ment in three seperate hypotheses. The first hypothesis is to determine if the model we propose here can perform better at phoneme classification than a neural network trained purely discriminatively when we vary the amount of labelled data. The second hypothesis to be tested is that if the proposed semi-supervised method is not domain specific and can produce better results in other machine learning tasks. For this purpose we focus on , when the dataset comprises of images of handwritten digits. Finally, the third hypothesis is to determine if the model we propose here can per- form better at phoneme classification than the different graph based semi-supervised learning algorithms under varying percentage of labeled data from a dataset.

1.4 Assumptions

There is one common assumption behind all semi-supervised models which also be- comes an assumption for this work: The distribution of the data, which unlabeled data will help us to unravel, should be relevant for the classification problem. To state for- mally, the information about distribution p(x) which can be obtained from unlabeled data should also carry information required for inference of y expressed through p(y|x). If this is not the case, then semi-supervised learning will not work.

1.5 Report Structure

Chapter 2 introduces the theoretical aspects that are relevant to this thesis. These include aspects from automatic speech recognition and artificial neural networks. Section 2.2 de- scribes how to make fixed length feature vectors out of raw-speech waveform, which makes the task of classification and recognition easier for computers. Section 2.4 gives a background starting with the basic principles of a Multi Layer (MLP). This is then followed with a brief theoretical background on the more recent "deep" neural networks having multiple layers. We will also talk about several different kinds of deep neural networks like RBMs, RNNs which we do not work with but have been exten- sively used in and Speech Recognition. This is followed by a Chapter 3, which describes recent advances in semi supervised learning (SSL) with particular focus on Graph Based Learning (GBL) methods and algorithms that currently provide state- of-the-art results in SSL. The particular SSL algorithm used in this thesis is explained in Chapter 4. Chapter 5 reports details on the experimental setup and results. The report is concluded by a discussion on the results, the implications and possibility of future work in Chapter 6. An appendix giving a mapping from the standard 48-phonemes in English to 39-phonemes mapping used in these experiments and used by the community in gen- eral is also given. Chapter 2

Relevant Theory

This chapter gives a brief insight into the basic concepts of ASR and Neural Networks. This includes a section on transforming the raw speech into features that are suitable for phone classification. We give a brief overview of GMM-HMM models, that have been state-of-the-art in speech recognition for many years in the past. Finally, in Section 2.4 we introduce MLPs and Deep Neural Networks (DNN) that in recent years have outper- formed GMM-HMM models.

2.1 Automatic Speech Recognition and Phone Classification

Figure 2.1 illustrates the processes involved with a typical speech recognizer. The differ- ent parts use a combination of signal processing and machine learning methods to tran- scribe a spoken utterance into a sequence of words. Because the problem is intrinsically affected by uncertainty, a probabilistic framework is used. First, the raw speech waveform is converted into a sequence X = x1, x2, ..., xT of fea- ture vectors spaced at regular time intervals. This process is called feature extraction and is based on knowledge of speech production and perception. The goal of feature extraction is to convert speech into a representation that is suitable for the classification problem. More information about feature extraction methods is given in Section 2.2. Given the sequence of observations X, the objective is to predict the most likely word sequence Wˆ = w1, w2, ..., wm. In probabilistic terms this can be written as follows:

Wˆ = arg max p(W |X), (2.1) W where p(W |X) is the posterior of the sequence of words W given the observation se- quence X. According to Bayes’ Rule, the above expression can be written as: p(X|W )p(W ) Wˆ = arg max (2.2) W p(X) = arg max p(X|W )p(W ) (2.3) W The term p(W ) in the equation is our prior knowledge about which word sequences are likely to occur in a language and in a specific task. This is called a language model. The term p(X|W ) in Eq. 2.2 is the likelihood of a sequence X of acoustic features given a particular sequence of words denoted by W . This is computed by acoustic and lexi- cal models. The lexical models describe word’s pronunciations in terms of sequences of

4 CHAPTER 2. RELEVANT THEORY 5

Raw Speech

Feature Extraction

Acoustic Feature MFCC, LPC, PLP, Filterbank

Acoustic Models Our Model

Acoustic Likelihoods Lexicon Language Model

Decoder

Figure 2.1: The illustration shows all the components in an ASR system. The dotted line shows a particular instance, for example MFCC is a particular instance of acoustic feature, likewise the network we will propose will be used for acoustic modelling. 6 CHAPTER 2. RELEVANT THEORY

phonemes, and the acoustic models describe the likelihood of acoustic features given a certain phoneme. The decoder is a search algorithm that uses the information in the acoustic, lexical and language models to perform the maximization in Eq. 2.2. The acoustic models are the main focus in this thesis. They encode knowledge about acoustics, phonetics, microphone, environment variability, and differences due to gender, accent, dialect and age of the speaker. In order to test the effects of the acoustic models alone on the speech recognition task, a slightly simplified task is considered: phone classi- fication. In this case, instead of the optimization in Eq. 2.1, we classify each feature frame xn into one of K possible phonemic classes. The assessment of this classification task is only reliable if the speech data is annotated at the phonetic level (as is the case for the TIMIT used in this study). Acoustic models that perfrom better phone classi- fication are more likely to perform better speech recognition as well. Although phone classification is only the first step in estimating the method performance, this evaluation is an accepted practice when new speech recognition methods are introduced. In the following sections, we will only describe feature extraction and acoustic mod- els, because the phone classification task considered in this thesis does not require lexical and language models. We will focus in particular on Multi Layer and Deep Neural Networks that have been successfully used in recent years as acoustic models for speech.

2.2 Feature Extraction

A speech waveform can be represented as a sequence of samples at a rate that can typically vary between 8 and 20 kHz depending on the quality of the recording. The samples are highly correlated and contain variations that are not easily associated with phonetic classes without any preprocessing. The goal of feature extraction is to pro- vide a representation for the speech signal that is more suitable for classification. Sev- eral methods for feature extraction have been proposed in the past. The most commonly used are based on a short time spectral representation of the signal, as, for example, Mel Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Predictive coefficients (PLPs). In order to capture the time evolution of those feature vectors, first-order and second-order temporal differences are often appended to the original features. MFCCs are calculated according to the procedure shown in Algorithm 1:

Algorithm 1 Procedure to calculate MFCCs 1: divide the signal into short, possibly overlapping frames 2: For each frame, calculate the short time fourier transform 3: Apply the (logarithmic) mel filterbank to power spectra, and take the sum of energy in each filter. 4: Calculate the logarithm of all filterbank energies 5: Apply the Discrete Cosine Transform (DCT) of the log filterbank energies. 6: Keep DCT coefficients 1-12 and discard the rest, energy coefficient is optional. 7: Take the ∆ and ∆∆ of the coefficients w.r.t. preceding frames and append it to original 13 coefficients.

The first step is motivated by the assumption that, though time-varying, the speech signal is stationary for short time intervals. The length of those intervals is typically be- CHAPTER 2. RELEVANT THEORY 7

tween 10 and 20 msec. The frames are usually overlapping in time so that there is a smoother transition in the information captured by two consecutive frames. The following steps are motivated by perceptual phenomena. The cochlea is known to perform a frequency analysis of the signal, and humans are known to have logarithmic resolution both in fre- quency and in loudness. The Mel filterbank is a set of triangular filters, designed in such a way that filters are logarithmically spaced in frequency. The application of the Discrete Cosine Transform (DCT) is motivated by modelling constraints. When the feature vectors are modelled by Gaussian Mixture Models (GMMs), it is desirable to work with uncorrelated features. This allows to greatly simplify the models by using diagonal covariance matrices. This requirement has become less strin- gent with the advent of acoustic models based on neural networks (NNs), because these models can easily cope with feature correlations. In fact, in NN acoustic modelling it is common to work with filterbank features directly, that is, to skip steps number 5 and 6 in Algorithm 1 [8, 41]. The reason for truncating the MFCC vector to only 12 components is to limit the rep- resentation to a coarse description of the speech spectra, because the details are related to the frequency of vibration of the vocal fold and, in a first approximation, are a disturbing factor in phone classification. This approach of knowledge driven pre-processing of the waveform has proved to be successful in discarding information that is irrelevant for discrimination. However, it is always possible that relevant information is lost in the process. In some very recent stud- ies [1], Convolutional Neural Networks have been applied to the speech samples directly, eliminating the need for feature extraction. Although more difficult to train, these models can potentially make use of all the information contained in the signal. This similar trend can be observed in Computer Vision as well [21].

2.3 Acoustic Modelling

Hidden Markov Models (HMMs) are the most popular statistical models in Speech Recog- nition. A first-order Markov chain is a state-space model, where the probability distribu- tion of the current state depends only on the previous state. A is more complex in the way that the state is not directly observable. We observe an out- put that, given the current state, is conditionally independent from previous outputs and states. The states follow a first-order Markov chain. In speech recognition, the sequences of observations correspond to the feature vectors described in the previous section. The states roughly correspond to phonetic units. The model defines two types of probability distributions: transition probabilities, which describe what state is more likely to occur in the next step given the current state, and emission probabilities which describe the likelihood of a specific observation given the cur- rent state. One important inference problem is to calculate the posterior probability of the phoneme states, given a sequence of acoustic observations. The emission probabilities, for continuous feature vectors (e.g. MFCCs) are in ASR usually modelled by Gaussian Mixture Models. The combined model is called GMM- HMM. GMMs have been state-of-the-art emission probability models in ASR for many years, mainly due to their flexibility. Closed form adaptation techniques such as Max- imum likelihood (MLLR) [29] and feature-based MLLR (fMLLR) [11], made it possbile to quickly adapt those models to new speakers or environmental situa- 8 CHAPTER 2. RELEVANT THEORY

Figure 2.2: Illustration of an acoustic model based on neural networks. The input window comprises 11 frames that are stacked and input to a neural network. The output of each node i in the topmost layer of the neural network is interpreted as posterior probability of a certain phoneme given the observations: p(phi|X). tion. This is a key feature for methods that need to be used in real-life conditions. Alternatives to GMM-HMMs based on neural networks have been studied in the past and have become more and more popular in recent years. These discriminative models can outperform generative models like GMMs, at the expense of flexibility. Most often, the output activation of the neural network are interpreted as probabilities and used as estimators for the emission probabilities in a HMM model similar to the one describe above. These combinations are usually referred to as hybrid ANN-HMM systems [38, 41, 26]. Several Research groups have reported that Deep Neural Networks-based acoustic models outperform the GMM-based system even on large vocabulary continuous speech recognition (LVCSR) tasks [7, 42]. Figure 2.2 depicts an acoustic model which takes the MFCC feature of the frames of speech as the input and gives the values of the posterior for phoneme i given the acoustic observation P (phi|X). In phone classification, the max- imum a posteriori class is chosen for each time step independently from the previous classifications. If we are performing automatic speech recognition, instead, the prosteriors are turned into likelihoods P (X|phi) by means of Bayes rule. Those likelihoods are then used in the complete system described in Figure 2.1. There have also been some recent attempts at removing the HMM model and let the neural networks model all aspects of speech, including the lexical and language models [14]. The reminder of this chapter will introduce a number of ANN models that are relevant to this study in some details. CHAPTER 2. RELEVANT THEORY 9

2.4 MLPs and Deep Neural Networks

As our model is an autoencoder, a kind of a neural network, we will first describe a sim- ple MLP and then autoencoders more specifically. A Multi Layer Perceptron (MLP), also called feedforward neural network, is a series of models stacked on top of each other. In each layer, the inputs are linearly combined and the combination is passed through a non-linear function. A basic MLP is made up of three layers, the first layer is the input layer. The number of nodes in the input layer is equal to the dimension of the input. The second layer is called the "hidden layer", because of not being observed directly. The output layer is used to perform classification or regression. The dimension of the output is equal to the number of classes for classification or to the dimensionality of the output signal for regression. There may be more than one hidden layers in MLPs. Models with several hidden layers are often called Deep Neural Networks (DNNs). In a fully connected network, all the nodes in one layer are connected to all the nodes of the next layers. A connection between a node in one layer to a node in the next layer is assigned a weight, wij. The activation of a node from the second to the last layer is a non-linear function applied to the weighted sum of all nodes from the previous layer. This non-linear function is also called as an activation function. The expression both in scalar form and matrix form is given in 2.4 and 2.5 ! X yj = f wijxi + bj (2.4) i y = f(W T x + b), (2.5) where x is the activation of layer n, or the input vector for the input layer, y is the out- put of layer n + 1, b is the bias value of the hidden layer and W is the weight matrix between the layers. The activation functions can vary depending on the position of the node in the net- work. For hidden layers, commonly used activation functions are, for example:

1. : 1 yj = (2.6) (1 + e−zj )

2. Hyperbolic tangent function : yj = tanh(zj) (2.7)

3. Rectified Linear Unit (ReLu) function [34]:

yj = max(xj, 0) (2.8)

P where zj = i wijxi + bj is the linear combination of the activations of all nodes con- nected to node j. For the output layer, the activation function depends on the task. Linear activation is common for regression, whereas for classification it is common to use softmax activations, given by

exp(zj) yj = P (2.9) k exp(zk) 10 CHAPTER 2. RELEVANT THEORY

Because the activations sum to 1, they can be interpreted as posterior probabilities of the different classes given the input to the network. The maximum a posteriori classifier is then implemented by selecting the class that corresponds to the maximum activation:

o = arg max yj (2.10) j

DNNs are generally discriminatively trained using the Backpropagation algorithm to minimise a cost function. Backpropagation is an algorithm to train a multi-layered net- work so that each layer in the architecture can learn a mapping from input to output that is optimal according to an optimality criterium. The Backpropagation Learning algorithm requires an input and a target. First the weights in the network are initialized to random values. In the forward pass, an observation is input to the network and activations are generated for each node in each layer based on the current values of the weights. This allows us to measure the difference between the output of the network y and the desired (target) output t. This measurement can be a simple square error on one single observa- tion 1 E = (t − y)2 (2.11) E 2 or a cross entropy loss measure computed over several observations:

n K 1 X X E = − ti log yi + (1 − ti)log(1 − yi ) , (2.12) C n j j i=1 j=1 where the summation over j corresponds to the nodes in the output layer, and the sum over i is an average over a number of observations. In the backward pass, we calculate how the output error is dependent on each weight in the network, and update the weights in order to minimize the error. This is achieved by factorizing the partial derivatives with the help of the chain rule. This way, we can propagate the deltas back from the ouput layer to the input. If we evaluate the depen- dency of the error on a specific weight w as ∂E and we start from the value of the ij ∂wij weight wij(t) at iteration t, the new value of the weight at iteration t + 1 is calulated by gradient descent as:

wij(t + 1) = wij(t) + ∆wij(t + 1) (2.13) ∂E = wij(t) − η (2.14) ∂wij where η is the learning rate. For very large datasets, calculating the gradient for the entire dataset at once could be extremely computationally intensive. To mitigate this problem, it is more efficient to compute the derivatives on a small, random mini batch of training points, and then the weights of the layer are modified proportional to the gradient. To remove the effect of noisy spurious training samples, it is common to use an additional momentum term in the training algorithm to make the training more uniform and less spiky. The weight up- date rule including the momentum term α is given by ∂E ∆wij(t) = α∆wij(t) − η . (2.15) ∂wij CHAPTER 2. RELEVANT THEORY 11

The term α ensures smoother variations in gradient values. The term η gives the learning rate. DNNs with many hidden layers and many units per layer are very flexible models with a very large number of parameters, which can easily overfit due to some spurious characteristic of the training set. To reduce overfitting, several techniques like L2 regular- isation or Drop-out are used, which penalise the magnitude of the weights, and prevent them from becoming very large.

2.5 Autoencoders

Multi layer perceptrons require the target value to be specified for each training example. If we do not have this information, we can still learn a representation that is dependent on the distribution of the data by means of an auto-encoder. An auto-encoder is special kind of neural network with two components: an encoder and a decoder. The encoder takes the input x, and maps it to a hidden representation y, which is given by

y = σ(Wx + bh). (2.16)

The latent or hidden representation y is mapped back to the original input, using a de- coder which is given as: z = σ(W’yˆ + bv) (2.17) where y is the encoded value, yˆ is a possibly corrupted version of the the encoded value, bh, bv are the bias values of encoder and decoder respectively, W is the weight matrix of the encoder, while W 0 represents its transpose. This network tries to minimise the recon- struction error given as:

2 LC (x, z) = ||x − z|| (2.18) d X LB(x, z) = − [xk log zk + (1 − xk) log(1 − zk)] (2.19) k=1

The first equation 2.18 is for continuous input, while the second equation 2.19 is used for classes and binary vectors. It is basically the cross-entropy error already defined above. Although we have expressed the equations with σ function, the activation function could be any other prominent activation functions. If the hidden layer of the auto-encoder has a lower dimensionality than the input, the model will perform non-linear . If it is of equal or greater di- mensionality, special care must be put to avoid that the model learns a trivial mapping (identity function). For this reason, the input or the hidden representations may be cor- rupted during training.

2.5.1 Manifold Learning with Autoencoders A reason as to why auto-encoders do so well is that they exploit the idea that data is generally concentrated around a manifold, or several subsets of manifolds. The theoreti- cal understanding about how auto-encoders map data manifolds is still a very active area of research. But to give a brief motivation, the general principle behind all autoencoders is a trade-off between two ideas: first, to learn a representation y of a training example x such that x can be approximately recovered from y through a decoder. x should be 12 CHAPTER 2. RELEVANT THEORY

z

x

Figure 2.3: Illustration of a sparse autoencoder with more nodes in the hidden layer than the number of nodes in the input layer. But not all nodes in the hidden layer are activated. The green-colored nodes are the only activated nodes. drawn from the training data, because it means that the autoencoder need not success- fully reconstruct the inputs that are not probable under the data generating distribution. The other complementary idea is to satisfy the generalisation or regularisation penalty, the presence of this term will encourage solutions which are less sensitive to small per- turbations in the data.

2.5.2 Sparse Autoencoders Historically, the first application of autoencoder was for reducing dimensions of the in- put data, hence the name. Even when came at the forefront, the conven- tional architecture was to have more layers with lesser number of nodes than the input. Such architectures are called Bottleneck networks. The idea was that more abstract fea- tures should not require as many dimensions as the input. Sparse auto-encoders differ from standard approaches of compression and dimensionality reduction where there are fewer hidden units than the number of dimensions of the input, i.e. NW < ND. Sparse autoencoders will have "overcomplete" representations which means, NW > ND. The idea is to learn a representation, but at the same time impose a "sparsity" constraint on the activation of hidden units, so that only a small percentage of them are really active. This could be done by adding the sparsity term to the objective function [6]. We make use of this idea in our experiments by having an "overcomplete" network. The details will be presented in Chapter 4.

2.5.3 Performance of Sparse Autoencoders Sparse Autoencoders have generally performed well in many applications. But it has some disadvantages too. As shown in Figure 2.3, Sparse autoencoders require more memory as there are more nodes in the hidden layers (but only a few are actually acti- vated). Due to these extra nodes, the weight matrices have a bigger rank which slows down computation. In comparison, autoencoders with bottleneck architecture have fewer CHAPTER 2. RELEVANT THEORY 13

nodes in the hidden layer and could possibly have lesser time complexity. The over- all time complexity of the neural network is linear in the number of total connections present and though a sparse autoencoder might have much more connections than a bot- tleneck autoencoder, many of which might not even be used after the first few iterations. So, it is more difficult to find a lower bound on the time complexity for sparse autoen- coders. Sparse autoencoders could offer more flexibility and additional modelling power than a bottleneck autoencoder, although the theoretical understanding of their working is still an active area of research.

2.5.4 Applications of Autoencoders Autoencoders have been successfully applied to dimensionality reduction and informa- tion retrieval tasks. Dimensionality Reduction, infact was the first application of repre- sentation learning and auto-encoders. One other task that has made successful use of autoencoders is , where the dimensionality reduction algorithm pro- duces a low dimensional and binary code, which can be then stored in a hash table map- ping binary code to entries for fast search. The hash table gives us the capability to per- form information retrieval by returning all database entries having the same binary code. The search is then a lot faster because it is done on "hashed" binary codes.

2.6 Semi-Supervised Learning

While in the above sections, we talked about one popular supervised learning model and one model section, in this section we will talk about Semi- Supervised Learning. Semi-Supervised Learning is a field of study in machine learning which falls between "supervised" learning (a problem of classification/regression with full classes labels available in the training data) and unsupervised learning (modelling of input data into classes without any class labels). In most practical applications, it is hard to get fully labeled training data. Unlabeled training data is generally easier and in some cases much cheaper to obtain. Then an interesting question which arises is, how can the unlabeled data be used in conjunction with labeled data to improve the accuracy of a system trained on just the labeled data. If it can be, what is the minimum relative proportion at which the unlabeled data will lose any relevance to the problem. We are given a set of independent and identically distributed input samples, x1, ..., xl ∈ X, and their corresponding targets: y1, ..., yl ∈ Y such that DL = (XL,YL). In addition we also have u unlabeled input samples, xl+1, xl+2, ..., xl+u ∈ Xu, such that DU = (XU ), the total number of samples is then given as n = l + u. Supervised classification only utilizes the information from DL to learn a decision boundary, whereas in semi-supervised learning framework, the decision boundary is obtained by utilizing information from both DL and DU at the same time. Semi-supervised Learning attempts to make use of this combined dataset D = DL ∪ DU to surpass the classification performance that can be obtained either by doing supervised learning (discarding unlabeled data) or doing unsupervised learning such as clustering (discarding labeled information). Some researchers refer SSL as "trans- ductive learning" or "inductive learning". The goal of transductive learning would be to infer the correct labels of the unlabeled dataset DU . 14 CHAPTER 2. RELEVANT THEORY

2.7 Assumptions

We make several assumptions in SSL, the key idea for semisupervised learning to work is that, the information about distribution p(x) which can be obtained from unlabeled data should also carry the information required for inference of y expressed through p(y|x). Chapter 3

Related Work

In this chapter, we will talk about existing literature on the application of deep neural networks for ASR and on Semi-Supervised Learning techniques. One prominent such technique is Graph Based Semi-Supervised Learning (GBL SSL).

3.1 Deep Neural Networks in ASR

This section introduces us to the kind of DNNs and different architectures, which have been recently used in ASR.

3.1.1 Deep Belief Networks The first set of results which beat traditional GMM-HMMs were reported by [38], where the authors trained DBNs (Deep Belief Networks) for recognising the HMM states of phonemes, given an input of 9-11 frames of MFCC feature vectors. A DBN contains sev- eral stacked layers of RBMs (Restricted Boltzmann Machine). The RBMs are generatively trained using the contrastive divergence algorithm outlined in [38]. Initially, the results were reported on TIMIT [10]. Further attempts were made to replicate the success of TIMIT on large scale vocabulary applications. The first such attempt was on data col- lected from Bing mobile voice search application (BMVS). It had about 24 hours of train- ing data with high degree of variability. The DBNs trained on this dataset achieved sen- tence accuracy of 69.6% on the test set compared to just 63.8% achieved by the traditional GMM-HMM baseline. The DBNs are a kind of unsupervised learning model, and they were popular as a way to pretrain a neural network. But Further research revealed that purely supervised learning of a DNN works comparably, provided a large amount of la- beled data is available, the initial weights are set carefully, and the mini-batch sets are set properly [43]. Since then, DBNs have fallen a little out of favour in the speech commu- nity as the overhead of pretraining can be replaced by careful tuning of the network.

3.1.2 Recurrent Neural Networks Recurrent Neural Networks (RNNs) are another kind of neural network with an ability to model temporal and sequential data. While the above mentioned DBNs is a model for unsupervised learning, RNN is purely supervised training. One major advantage with RNNs is that they do not require fixed input feature length like the above discussed feed-forward networks. The experiments described in [14], demonstrate how RNNs can

15 16 CHAPTER 3. RELATED WORK

be made to understand multiple levels of representation, a salient feature of deep nets, and combine it with their ability to make flexible use of long range context in ASR. The authors reported a test set WER (word error rate) of 17.7%, which beat all the previ- ous benchmarks. RNNs have earlier been used with HMMs, but this is the first instance when RNNs have been used from end to end, and proved that stacking multiple recur- rent layer each on top of the other can give better results just like their counterparts in deep feed forward networks.

3.2 Examples of Semi-Supervised Learning Methods

This section introduces us to the different kinds of SSL techniques. The model we use in this thesis falls in the last category of models: Semi Supervised Learning by Autoen- coder.

3.2.1 Heuristic based SSL/Self-Training Among all the existing and accepted techniques, the simplest algorithm for semi-supervised learning is based on "self-training" scheme, where a model is trained just with the la- beled part of dataset, and newly labeled data obtained from its own highly confident predictions, until the confidence level of the predictions drops below a certain thresh- old. The training can take several iterations. To state formally, the self-training approach l n starts with the labeled set L = {(xi, yi)i=1} and unlabeled set U = {(xi)i=l+1}. An ini- tial model f is trained using only the labeled data using standard supervised learning. The resulting model is then used to make predictions on U , where the most confident predictions are removed from U and added to L together with their corresponding class predictions. In the next iteration, the model is refined with the new augmented set L.A critical assumption made in this algorithm is that the predictions added to the initial la- beled set are reliable enough themselves. One big advantage of this approach is that it can be used as a wrapper for any learning algorithm. This is quite a general technique, which has been used in many different research ares such as object classification [40] and speech recognition. In [46, 15, 17], self-training was used in combination with neural networks, whereas in [24, 25, 47] in combination with GMM-based acoustic models. Although showing promising results, these methods in- volve heuristics and can reinforce "bad" predictions. The confidence level and unit selec- tion are very important in these models and any mistuning of these parameters can lead to bad results in later iterations.

3.2.2 Transductive SVMs Transductive SVMs [18] work on the the principle of avoiding having decision bound- aries, where input is heavily distributed. Putting a decision boundary in high density re- gions of input, increases the chances of getting "all predictions wrong", this idea is based on the cluster smoothness, which was discussed in the previous section. Transductive Support Vector Machines (TSVMs) is an extension of traditional SVMs with unlabeled data. They have the same objective of maximising margin between classes, while ensur- ing that there are few unlabeled examples near the margin. Finding the exact solution is NP-hard. Some efficient approximate algorithms have been proposed but they lack scala- bility to problems with very large datasets. CHAPTER 3. RELATED WORK 17

3.2.3 Entropy Based Semi Supervised Learning Entropy based Learning jointly model the labeled and unlabeled data. The primary mo- tivation is entropy minimization for these methods and has been proposed in [9, 16]. In [16], the authors proposed to jointly model the labeled and unlabeled data in a con- ditional entropy minimisation framework first demonstrated in [13]. The authors also maximised the conditional entropy regularisation term posed on the unlabeled data, in addition to the usual task of maximising the posteriors on the labeled data. This addi- tional regularizer encourages the model to have as great confidence as possible on label prediction of the unlabeled data. The optimisation of the framework was performed us- ing the extended Baum-Welch algorithm. The method was evaluated on different speech recognition tasks such as phonetic classification and phonetic recognition tasks, where some improvement was obtained on the TIMIT dataset compared to a supervised dis- criminatively trained GMM model with Maximum Mutual Information Criterion [37]. These methods have mostly been applied with GMM models in the literature, and there is further scope for their study in discriminative methods and could be a potential area of future study.

3.2.4 Graph based SSL Graph based Learning has been quite popular in ASR lately for acoustic modelling. One part of the thesis is to compare the results achieved by our model with the results from Graph based Learning for improving frame-based classification on TIMIT dataset. One of the first works in this area was the application of label propagation algorithm for vo- cal classification task [2]. The work was evaluated on Voice Joystick dataset [19], which is a 8-vowel classification task that was used to develop voice-controlled assistive de- vices for patients with motor impairments. Graph based SSL methods define a graph composed of nodes which represent both labeled and unlabeled training examples as ex- plained in [50]. The nodes are connected to each other by edges, which have a weight. The weight of the edges is given according to the similarity of the examples. The most popular algorithm in GBL based semi supervised learning is Label Propagation (LP) [49]. It iteratively propagates the information from the labeled data on a graph G. The end goal of all GBL based algorithms is to infer the label via an undirected weighted graph G = (V,E,W ) where V are the data points of the vertices of a graph in DL and DU , and E are the undirected edges on the graph, weighted by wij ∈ W . The label propagation algorithm minimises the following function:

n n X X 2 wij||yˆi − yˆj|| (3.1) i=1 j=1 subject to yˆi = yi where yˆ is a predicted label and y is a true label. Two more techniques in Graph based SSL known as MAD (Modified Absorption) proposed in [45] and MP (Measure Propagation) proposed in [44] have been recently introduced. MP minimizes the following objective function:

l n n X X X X DKL(ri||pi) + µ wijKL(pi||pj) − ν H(pi) (3.2) i=1 i=1 j∈N i=1 18 CHAPTER 3. RELATED WORK

where p is the predicted probability distribution over the classes, r is the true (reference) distribution, N is the graph neighborhood of the node i, KL is the Kullback-Leibler di- vergence, and H is the entropy. The first term in the expression ensures that the pre- dicted probability distribution matches the true distribution over the labeled vertices as closely as possible, the second term stands for the smoothness of the label assignment enforced by the graph G defined above, which essentially means that the class proba- bility distribution on neighbouring vertices (i.e. which will have a higher edge weight) in the graph should have a smaller KL divergence. The third term encourages higher entropy in the final output. A new variant of MP called as prior-regularized measure propagation (pMP) has been given in [30]. The pMP algorithm minimises the following objective function:

l n l+u X X X X F (D, G, p) = KL(ri||pi) + µ wijKL(pi||pj) + ν KL(pi||pei). (3.3) i=1 i=1 j∈N i=l+1

The additional term in Equation 3.3 is a measure of how closely the predicted probability distribution is close to the aprior distribution of classes. Graph based SSL has recently been applied by [30, 31] in context of ASR. The authors keep a DNN as the final discriminative classifier and achieve state-of-the-art results. We will compare our results with their results in section 5.4. There are several problems with Graph based SSL methods. Firstly, their complexity is O(N 3), because they involve inver- sion of a N × N matrix. They also do not give any confidence about the estimates. More- over, addition of any new data points can be quite cumber-some as it requires modelling the entire graph structure again.

3.2.5 Semi-Supervised Learning with generative models These approaches take the SSL problem as a case of missing data imputation task for su- pervised classification problem. They are probabilistic in nature and also give confidence measures or uncertainity measures for predictions. Kingma et al in [20] have presented a Latent-feature discriminative models that provides an embedding or feature representa- tion of the data similar to an autoencoder. The deep of data provides a more robust set of latent features than autoencoders. This latent feature representation al- lows for clustering of related observations in latent feature space and gives quite accurate classification. The generative model used, is given below:

p(z) = N(z|0,I)

pθ(x|z) = f(x, z, θ) where the likelihood function f(x; z, θ) could be a Gaussian. Another model presented in the same paper describes the data as generated by a latent class variable y in addition to the embedding z learnt above. The generative process is given as:

p(y) = Cat(y|π); p(z) = N(z|0,I)

pθ(x|z) = f(x, z, y, θ) f(x, y, z, θ) is a Gaussian likelihood function. zi is the additional independent latent vari- able for each xi. All zis can be written as the distribution of a single latent variable z. CHAPTER 3. RELATED WORK 19

3.3 Autoencoder Based Semi-Supervised Learning

This section introduces us to the model we are going to present and motivate about it. Deep Neural Networks are generally trained either from just fully labeled data, or from just purely unlabeled data. Both models, when used alone are not ideal. Ranzato et al. in [39] explored the possibility of having a joint objective made of both a supervised and unsupervised objective on documents where bag of word representations were used as input features. The authors performed their experiments on Reuters and Newsgroups datasets. We propose to use a similar approach to frame-based phoneme recognition in ASR. Although our objective function is the same as the one proposed in [39], our set up is different in a number of ways. Firstly, instead of the compact and lower dimen- sional encoding used in [39], we employ sparse encoding. Secondly, instead of stacking a number of encoders, decoders and classifiers in a deep architecture as in [39], we use a single layer model. This is motivated by work in [6], where the authors analyse the effect of several model parameters in unsupervised learning of neural networks on com- puter vision benchmark data sets such as CIFAR-10 and NORB. They conclude that state- of-the-art results can be achieved with single layer networks regardless of the learning method, if an optimal model setup is chosen. An introduction to sparse autoencoders can be found in Stanford Lecture notes given by [35]. Chapter 4

Method

We present our algorithm here. We will try to motivate it from our understanding of autoencoders and the principles of supervised classification. Neural Networks require proper-care and attention during training, so we will describe how to initialise the net- work and the practical details to be kept in mind, and the hyperparameter values we used. We will also try to explain why we made certain choices in the network config- uration. We will also describe Gradient Checking, which is an important technique for debugging Backpropagation in Neural Networks.

4.1 The Model

Figure 4.1 shows a block diagram of the neural network model used in this study. The topmost path is equivalent to an autoencoder, consisting of an encoder WE and a de- coder WD. Although autoencoders usually share weights between the encoder and de- T coder (WD = WE ), in our case we optimize those weights independently. The reason for this will be clear in the following. The bottom path in the figure is a neural network classifier that uses the representa- tion learned by the encoder as input features. The figure also shows the reconstruction error ER and the classification error EC that can be computed for the two paths. Evalu- ating EC requires labels for each input observation, whereas ER is computed without the need for labels. This allows us to update the model parameters WE,WD and WC simul- taneously on labelled and unlabelled material in the same batch of observations. The advantage over a feed-forward network is that the WE can be estimated on much larger unlabelled data-sets. The advantage over unsupervised auto-encoders is that WE will be continuously optimised during training for the particular classification task we are considering. The training algorithm is given in Algorithm 2.

Algorithm 2 Algorithm to train the network 1: Transform the training samples x into codes z using the encoder part of the layer. 2: Calculate the reconstruction loss ER using the encoded input z. 3: Compute the classification error EC using again z and known labels y. 4: The is then combined , and the final objective function is given as: E = ER + αEC . 5: The layer is trained by minimising the combined loss term using SGD. 6: The encoded input, z is used as input to train the next layer. 7: The procedure can be repeated with other layers.

20 CHAPTER 4. METHOD 21

encoder decoder z xˆ x WE tanh WD tanh ER + E

classifier x y α WC softmax EC

t

Figure 4.1: Flow chart for the cost calculation in a single layer of the network. Three com- ponents are considered: encoder, decoder, and classifier. The loss is weighted sum of cross- entropy EC and reconstruction loss ER. If several layers are stacked together, only the encoder/decoder pairs are retained after training. The value t represents the value of the targets from the training data and y represents the output of the classifier network.

The computational cost is linear in the number of training samples, and thus it is more efficient than graph based Semi-Supervised Learning algorithms which have O(N 3) cubic complexity. For each layer, the cost is given by a forward and backward pass through encoder, decoder and classifier. In the supervised setting, the cost function EC is the cross-entropy logloss defined as:

n K 1 X X E = − ti log yi + (1 − ti)log(1 − yi ) , (4.1) C n j j i=1 j=1 where yi is the activation of the output node i of the classification network and ti is the corresponding target label and the summation over j corresponds to the nodes in the output layer, and the sum over i is an average over a number of observations. In unsu- pervised settings, instead, the objective is to find a representation that retains enough information in order to reconstruct the input data. The cost function ER, in this case, is the second degree norm (also called L2 norm) of difference between original input and reconstructed input.

2 ER = ||x − xˆ|| (4.2) where x denotes the original input, xˆ represents the reconstructed input. Similarly to the method described in [3], we add noise to the original input by a process called ’corrup- tion’ in which some dimensions of the input vector are randomly picked and set to zero. This helps the network learn a more robust representation of the input data. The final cost function that we want to optimise is a linear combination of ER and EC :

E = ER + αEC . (4.3)

Optimizing the cost E with respect to the model parameters WE,WD and WC is a non convex optimisation problem, that we solve with a gradienta algorithm. When the input 22 CHAPTER 4. METHOD

datapoint is not accompanied by a label, the classifier part of the layer is not updated, and the loss function simply reduces to ER. The data is split into three sets. The optimization is performed on the training set, while a validation set is used to optimize the meta parameters for each run, for exam- ple the value of α in the linear combination of Eq. 4.3. Neural Networks contain many hyperparameters and tuning could be a challenge. The final results are given on an independent test set. As we will describe later in 5.1, we run experiments on two datasets: MNIST and TIMIT. In case of MNIST, the input x is the raw image represented by the pixels con- catenated row-wise into a single vector. The output y is the digit number (0-9). In case of TIMIT, the input x is the MFCC+ ∆ + ∆∆ feature concatenated together with the fea- tures of 5 previous and 5 next frames as in Figure 2.2. The procedure to obtain these fea- tures has already been explained in Section 2.2 The output y are all the possible speech phonemes. Please note, that recently, it is more prevalent to have senones or the hmm states of the phonemes as the target labels as in [38], but to compare with other tech- niques in Semi-supervised learning for speech [30, 2], we experimented with phonemes. The training in this model can be peformed greedly layer by layer if we wish to use deep networks. However, in our experiments, we use only a single layer for feature rep- resentation. We use mini-batch SGD as explained in the theory section as the optimizer over this cost function. The weight matrices, WC and WD are simply updated by normal back- propagation algorithm as shown in the equations 2.14 given in the Theory section. How- ever, the update of the encoder weight matrix is not straightforward as in the other two, and is given by:

∂E ∂ER ∂EC = + α (4.4) ∂WE ∂WE ∂WE ∂E WE = WE − η . (4.5) ∂WE

It is important to note that the update of encoder weight WE is dependent both on WD and WC , and the delta propagated in the backpropagation algorithm will be a linear combination of the deltas calculated in both parts. Because of its sparse properties, we call this model Semi-Supervised Sparse Autoencoder (SSSAE). The weight update given in Equation 4.5 was implemented using [23] library.

4.2 Evaluation

The evaluation method was simply classification accuracy given as proportion of cor- rectly classified examples. In machine learning experiments, the standard convention is to partition the data into three sets: training set, validation set, and a test set. The largest of them is the training set, which is used to train the model, or fitting parameters of a model. When given a model class and a choice of hyperparameters, the parameters are selected which give the minimum error on the training set. Given a type of model, one tunes the hyperparameters on the validation set. The test set is kept untouched dur- ing the entire training process. Finding the error on the test set is the last step. We per- formed experiments on two different datasets. In the MNIST problem, each example CHAPTER 4. METHOD 23

correspond uniquely to a class (written digit) and the classification accuracy is straight- forward to define. In ASR, however, this is not the case. Because the speech signal is a continuous stream of linguistic units, many metrics can be defined at different levels of details. We can compute errors at the phonetic level, or word level or even at the level of full sentences. We can also consider fine alignment of the recognized linguistic units in time, or just consider the sequence of linguistic units disregarding errors in alignment. Which metric we use is determined by the application and by the output of our speech recognizer. The most commonly used metric is called Word Error Rate (WER). This met- ric is defined on sequences of words and disregards the alignment in time. The sequence of recognized words is aligned with the sequence of labels by means of dynamic pro- gramming, and the mismatch is computed in terms of number of insertions, deletions or substitutions. In this thesis, however, we focus on phone recognition, and we therefore use a corresponding metric. The most common way to calculate accuracy at the phonetic level is to consider each speech frame as an independent classification and count the pro- portion of correctly classified frames. This is possible because the data set that we use (TIMIT) contains carefully annotated phonetic transcriptions. This particular evaluation method is usually considered as a first step whenever a new method for ASR is intro- duced because it allowed to find possible problems early before a full large vocabulary speech recognizer is constucted.

4.3 Monitoring and Debugging

4.3.1 Design Choices/Tuning Hyperparameters Neural Networks contain many hyperparameters. A model’s parameters are directly fit- ted by a training algorithm, whereas the hyperparameters are fitted by hand or through trial and error. For a neural network the value of the weight matrices are the parame- ters of the model. In the model given above, WE, WC and WD serve as the parameters. However, in addition to these parameters, we also generally have another set of parame- ters, that cannot be learned directly from the regular training process. They act one level above the normal training process. For example, the value of K is a hyper-parameter in K-Means Clustering model, the learning rate η is a hyperparameter in gradient descent learning procedure. Setting the hyperparameters properly plays an important role in get- ting the best performance from the network and for the network to converge faster. A proper configuration of the network can also reduce the amount of computation required by reducing the number of epochs required to reach convergence. More recently, it has been shown that the initialization procedure based on unsupervised pretraining with Deep Belief Networks can be avoided if we use better activation functions and a suffi- cienty large training set [43, 12]. In our model as given by Equation 4.3, the value of variable α cannot be determined by one iteration of the training procedure described in 2. The usual way of determin- ing its value is to iterate the training procedure over a set of possible values and pick the one which performs best on a validation set. Other hyperparameters in our model are: the number of nodes in the hidden layer, the learning rate η, momentum term as de- scribed in Section 2.4, batch size B. The procedure for optimising the hyperparameters is shown step-wise below in Algorithm 4.3.1. Due to limited processing power of our system, we can not perform an exhaustive 24 CHAPTER 4. METHOD

search for hyperparameter optimisation. The strategy we are going to use is called coor- dinate descent. The idea is to change only one hyperparameter at a time, and optimise the particular hyperparameter, and then use its value along with the best configuration of hyperparameters found until now. Another approach is to start the search by consider- ing only a few values of the hyperparameter values but over a very large range. A more local search could be performed in the neighbourhood of the optimal value found in or- der to do finer adjustments with more iterations. For example, the learning rate could be optimised over the range: (0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.5). This procedure is de- scribed below: for p ∈ HP do

for p ← pi from p1, p2, pn do Train the model Estimate Model Accuracy on Valid set Mi end for j ← arg max M p ← pj end for Table 4.1 shows the values of hyperparameters we used for the experiments. We fol- lowed the procedures given in [4, 5] for tuning them to get the best validation set ac- curacy.

4.3.2 Learning Rate The learning rate determines how fast weight updates happen in one iteration of the training procedure. It is also probably the most important hyperparameter. If the rate is too low, then it might take us too many epochs to find the minima/solution. If the rate is too high, then we might even jump over the minima. An effective approach to manage learning rate is to decay the learning rate after every few iterations. The most common and prominent ways to adapt the learning rate are:

1. Exponential Decay : −kt η = η0 exp (4.6)

2. Inverse Decay : η0 η = (4.7) 1 + kt

where η0 is the initial learning rate at epoch 0, t is the value of current epoch, k is another hyperparameter to be tuned.

3. Step Decay after delay: Reduce the learning rate by some rate after every few epochs. We used a simple heuristic by observing the validation error after each epoch, the learning rate was reduced by a factor whenever the validation error stopped improving or in some cases, it became bigger.

We tried experimenting with all of the above mentioned schemes, but found the Step Decay scheme easier to work with for our problem as it was easier to tune and moni- tor and did not involve tuning of another hyperparameter, k as in the other two schemes. We took the initial learning rate η0 equal to 0.015. The learning rate was kept constant CHAPTER 4. METHOD 25

until the change in validation error became quite small, and then after each epoch, it was halved until the learning rate fell below a small negligible number like 1e-5.

4.3.3 Batch Size The batch size is another hyperparameter in our model, it is the number of input sam- ples in one batch while optimising the neural network with mini-batch Stochastic Gra- dient Descent (SGD) method. Its value generally ranges from 32 to 500. This param- eters is especially important for our model, as we have to keep a uniform proportion of labeled and unlabeled samples in every batch. If the right proportion is not main- tained, one cost might have too much weight in comparison to the other cost. For ex- ample, for a case when we have 1000 labeled points and 49000 unlabeled datapoints, if the batch size B is too small, we might not capture any labeled point in that batch and the cost will just be the reconstruction cost. On the other hand, if B is too big, it will slow computation down because of bigger input matrices. We chose B to be ap- proximately between 100 and 200. When the labeled data was too low, we took B to be between 300-400.

4.3.4 Weight Initialization Weight Initialisation is also an important decision to make while setting up a neural network. For each layer in a neural network, while the bias b can generally be ini- tialised to a zero vector, the weight matrix needs to be initialised more carefully to break symmetry between hidden units of the same layer. As described in [12], the initial values for the weights: WD, WE and WC should be uniformly sampled from a symmetric interval.

r 6 r 6 WC ∼ U[− , ] (4.8) nh + nout nin + nout r 6 r 6 WD ∼ U[− , ] (4.9) nh + nin nin + nout r 6 r 6 WE ∼ U[− , ] (4.10) nin + nh nin + nout

4.3.5 Number of Hidden Units

The number of units in the hidden layer is another hyperparameter. The size of WE is (nin, nh), size of WD is (nh, nin) and size of WC is (nh, nout). The number of input nodes nin and number of output nodes nout is already fixed by the problem, so nh is the only parameter which we can tune. In our experiments, we found that the in- creasing nh improved accuracy to a point and then increasing it further did not have any impact on the accuracy while slowing the computation down. The exact size de- pends on the dataset and problem, but in general it was much more than the number of input nodes nin. We found the range 7000-9000 to be the best value for nh for both datasets we used. 26 CHAPTER 4. METHOD

4.3.6 Momentum Some researchers use momentum term to smoothen gradient updates as described by Equation 2.15. We did not use momentum term though in our experiments, as it did not seem to either improve accuracy or improve convergence.

4.3.7 Activation Function Activation function is another hyperparameter, as there are several functions which can be used. The hyperbolic (tanh) function worked well for us, and is better at handling inputs which are not centered around 0. In our experiments with the two datasets, MNIST dataset is not normalized around 0, while TIMIT is normalised to have mean 0.

4.3.8 Training Epochs The number of training iterations (T ) is one more important hyperparameter, it is not so hard to optimize this hyperparameter, as it can be easily done through early stop- ping. As the training progresses, one can decide for how long to train for for any given settting of all other hyperparameters. We checked the validation error for the network at each iteration/epoch, and turned the training off whenever the validation error started increasing again. The way we implemented this was to reduce the learning rate as de- scribed above, and when the learning rate went below a certain minimum positive number like 1e-5, the training was stopped immediately. Our model needed more it- erations over a single hidden layer neural network. While a neural network took just about 20-25 epochs for MNIST and about 35-40 epochs for TIMIT, our model took about 50-60 epochs for MNIST, and about 70-90 epochs for TIMIT.

4.3.9 Additive Noise The amount of noise to be added to the input is another decision to be made when training autoencoders. As described in 4.2, it is generally preferred to add some noise to the input to the original input to avoid the network learning identity function. This noise could be Gaussian noise added to the input. Besides Gaussian noise, the more common way of adding noise to the input is to randomly zero out a finite percent- age of the dimensions. This procedure is called as "masking corruption". In our ex- periments, we used "masking corruption" procedure to add about 10% noise to the in- put. We found adding more noise to the input did not really affect the performance by much.

4.3.10 Alpha For choosing α, we use the above given algorithm to optimise. Interestingly, we found that the optimum value of α depends on the percentage of labeled samples in the dataset. We describe this in more detail in Chapter 6. CHAPTER 4. METHOD 27

Name of Hyperparameters Value Initial Learning Rate(η0) 0.015 Learning Rate Schedule Step Decay Batch Size(B) 100-400 Momentum(β) 0 # Nodes in hidden layer(nh) 7000-9000 Noise Level 10% Epochs(T ) 60-90 Activation Function Hyperbolic tanh

Table 4.1: Hyperparameter values for our model.

4.3.11 Gradient Checking This section will describe the practical nuances of the backpropagation algorithm. Back- propagation for all its simplicity and scalability could be hard to work with and debug practically. Sometimes, when using third party libraries or self-written code, even with a slightly buggy implementation, it may not be all that apparent that something is amiss or incorrect. In the model discussed, Equation 4.5 gives the equation for update of en- coder weight matrix. To verify if the derivative is correct and the weight update is work- ing correctly, we could numerically check the derivatives computed by any code/library which does gradient computation. For example, if we want to minimize F (θ) as a func- tion of θ. For a simple 1-dimensional case, one iteration of weight update is given by:

∂F (θ) θ ← θ − η (4.11) ∂θ ∂F (θ) g(θ) = (4.12) ∂θ θ ← θ − ηg(θ) (4.13)

where θ represents input dimension, η is a scalar learning rate value. To check if the implementation for g(θ) is correct, we use the following equation:

F (θ + ) − F (θ − ) g(θ) = lim (4.14) →0 2 The above equation is also known as numerical derivative.  is a very small positive number, its value can range from 10−4 to 10−6 or even smaller but it should not be as small as 10−20, to avoid numerical roundoff errors. In the case when θ is not 1-dimensional, the above equation can be applied for each dimension while keeping other dimensions constant, which means the increment/decrement in θ will be limited to one dimension at a time. So, the value obtained from numerical derivative shown above should be approx- imately equal to the value of the weight obtained through a standard library/package. In our experiments, we used the function T.grad given in Theano to compute deriva- tives. Numerical derivatives although easy to use have limitations: firstly, they have to be specified manually for each dimension of the weight matrix and secondly, they are not precise though still good enough to be used for debugging purposes. Theano on the other hand uses symbolic differentiation, which is easier to scale to multiple dimensions, and is more precise. Chapter 5

Experiment Setup and Results

The model described in Chapter 4 is tested on two different datasets and the results are compared with a standard two layer neural network. The first dataset is MNIST [27], which is used as a benchmark among the machine learning community for testing semi- supervised algorithms. As this thesis concentrates on improving semi-supervised classifi- cation on ASR, we also test the algorithm on TIMIT [10], which is a much bigger dataset than MNIST, and quite popular in speech recognition community due to its carefully manual annotated transcriptions.

5.1 Data

5.1.1 MNIST We test performance on the standard MNIST [27] digit classification benchmark. The dataset contains 10 digits from 0 to 9 each having 5000 sample points in the training set, and 1000 sample points in validation and test sets. The datapoints for each digit class represent intra-class variation in the form of different artistic strokes. All the im- ages in the dataset are of size 28*28, which amounts to 784 pixels. The dataset for semi- supervised learning is created by splitting the total 50000 training samples between a la- beled and unlabeled set, and varying the number of the labeled points from 100 to 3000. All digit classes have the same number of points in the training set to avoid any bias for any particular class. This way, we ensure that all classes are balanced and have equal representation in the labeled part of dataset. The MNIST in total contains 70000 samples: 50000 in training set, 10000 in validation set and 10000 in test set. Some images from the dataset are shown in Fig. 5.1 The image is vectorised by concatenating all the pixels row- wise. We do not do any pre-processing of the images, and all the pixels of the image rep- resented by the vector is fed as input to the network.

5.1.2 TIMIT The main part of this thesis is the experiment on the standard TIMIT dataset [10] for the task of frame-based phone classification. TIMIT has a total of 6300 American English sen- tences corresponding to about 5.4 hours of speech recording, with 630 unique individual speakers each having spoken 10 sentences. The spoken sentences were carefully chosen to be phonetically rich. The speech waveform sampled at 16KHz frequency and stored in

28 CHAPTER 5. EXPERIMENT SETUP AND RESULTS 29

Figure 5.1: Some sample images from MNIST dataset.The images are of handwritten digits with varying strokes for each digit from 0-9. All images are of size 28*28 pixels.

16 bit linear coding. There are 438 male speakers and 192 female speakers in this group of speakers which roughly corresponds to a 70-30% ratio in favour of males. One an- other distinction among the speakers apart from gender is due to the dialect they speak. The speakers can be divided into 8 different groups based on the American English di- alect they speak. A speaker’s dialect is roughly based on the region they lived during their childhood. More information about the dialects and regions can be found in the documentation [10]. There is also a core test set placed under a different directory. This dataset contains speech from 24 speakers, 2 male and 1 female from each dialect region. Each speaker has 8 sentences each making the total utterances equal to 192. There is an- other larger directory of speech containing a total of 168 speakers and 1344 utterances called as the "complete Test set" in the documentation which can be used for testing as well. The spectrogram of one such utterance has been shown in Figure 5.2. We used the standard core test set of 192 sentences described above, and a devel- opment/validation set of 184 sentences. For training, we used a total of 3512 sentences. Similarly as a part of standard procedure of experiments on TIMIT, glottal stop segments are excluded as well. A glottal speech segment is a kind of sound produced by obstruct- ing airflow in the vocal tract. The procedure of converting raw speech waveforms to a sequence of multiple fixed length vectors has already been documented in Section 2.2. The TIMIT dataset in processed form is much bigger than MNIST, It contains roughly 1.3 million speech frames:1.15 million in training set, and roughly 120000 in test+validation set. The data is created with the help of standard recipes given in [36, 33]. The input to our network was created by first extracting a 39-dimensional feature vectors for each frame. The feature vector is made of 12 MFCC coefficients, 1 energy coefficient, deltas and delta-deltas computed for every 10ms with an overlapping window of 20 ms. And for each frame, its feature vector is concatenated with the feature vectors of the previous 5 frames to the left, and 5 frames to the right, as a result of which the final vector has a dimension of 11x39 equal to 429. The total number of input feature vectors in the train- ing set is 1068816. The validation set has 56005 frames in total, and the test set has 57919 30 CHAPTER 5. EXPERIMENT SETUP AND RESULTS

frames. These counts are almost in line with the experiments of [30, 22, 32]. For training, we used the standard phone set of 48 phonemes, collapsed into 39 phonemes for evalu- ation as in [28]. This means, the output layer will have 48 nodes, but at the time of eval- uation, the 48 phonemes will be reduced to 39 phonemes. For example, if a particular frame is classified as the phoneme "zh" instead of "sh" and vice-versa, it will be thought of as a good and accurate classification. We follow this procedure, as this is a standard way for experimentation in Speech recognition with TIMIT as also reported by [30, 32]. As with MNIST, the training set is divided into a labeled portion of dataset , and an unlabeled portion of dataset. Speaker-dependent mean and variance normalization was also applied.

250

200

150

100 Frequency Bands

50

0 0 50 100 150 200 250 300 Frames

Figure 5.2: Spectrogram of an utterance from TIMIT dataset.

5.2 Experimental Setup

There were two sets of experiments done on each for the two datasets described above. For each data set, the training was performed several times varying the percentage of labelled samples in the training set. For MNIST, the % of labelled points is varied from 0.2% to 6% as shown in Table 5.1. While for TIMIT, the % ranges from 1% to 30% as shown in Table 5.2. Other researchers who worked on this problem with different models also use a similar variation [20, 32, 48]. It is important to see that this variation is smaller than that for TIMIT, as MNIST is a smaller dataset with fewer targets (10) than TIMIT (48). A model can give a high accu- racy even when it has seen just a small fraction of labels. For every training session, the value of the hyperparameters were tuned by minimising classification error on the val- idation set. The standard values have already been given in Table 4.1 but some of these values vary with the datasets and also according to the percentage of % labelled samples CHAPTER 5. EXPERIMENT SETUP AND RESULTS 31

Results on MNIST Labelled Obs. Neural Network SSSAE α % # valid. acc. test acc. valid. acc. test acc. 0.2 100 75.8 74.04 77.6 76.2 50 1 500 87.01 86.39 89.21 88.1 300 1.2 600 87.56 86.64 89.53 89.35 300 2 1000 89.47 89.14 91.43 91.07 600 6 3000 92.61 92.41 94.42 94.03 1400

Table 5.1: Results on Validation set and Test set on a neural network and our model on MNIST with different percentage of labeled data in training. Total number of datapoints is 50000. The value of α optimised on validation set is given in parentheses. in training set. All the models obtained by the above training session were tested on the respective test set in the databases.

5.3 Practical Setup

We used kaldi [36] and pdnn [33] for feature extraction, Theano [23] for symbolic alge- bra for with GPUs. Kaldi is a speech recognition software written in C++, which comes with several pre-built functions and recipes for splicing waveforms, and getting useful features out of the raw waveform like MFCCs, and feature-bank. The ex- periments were run on a Titan X Card donated by NVIDIA Corporation installed on a Ubuntu 14.04 based machine.

5.4 Results

Table 5.1 and Figure 5.3 show the performance of our model and compare it to that of a standard two layer neural network (NN) on MNIST. We can observe that our model performs better than a standard Neural Network for all the values of partition and per- centage of labeled data. The result for neural network matches the results reported in literature as in [20, 48]. Note that the number of hidden units is also a hyper-parameter for our experiments, and we determined its value through cross-validation similar to the way in [6]. The other hyper-parameters of a neural network namely learning rate, momentum and size of hid- den layer were optimized for the experiments with NN. In addition to all the hyper- parameters of a neural network already mentioned, we have an additional parameter, α, which controls the contribution of reconstruction error and misclassification error to the final objective function. The values of the hyper-parameters which give the best results on the validation set have been reported in the table as well. Table 5.2 and Figure 5.4 show the frame-level classification accuracy rates for a neural network and a single layer Semi-Supervised sparse auto-encoder for varying percentage of labeled data. As in MNIST, the hyper-parameters are tuned using the validation set. The neural network contains 2000 units in the hidden layer as in [30, 22]. We observed that the results we obtain match the reported results in the literature mentioned. We 32 CHAPTER 5. EXPERIMENT SETUP AND RESULTS

MNIST 95

90

85 accuracy 80

NN valid 75 NN test SSEAE valid SSEAE test 70 0 500 1000 1500 2000 2500 3000 # labeled examples

Figure 5.3: Accuracy on Validation set and test set on MNIST for varying amount of labels. Total number of labels in training set is 50000. For semi-supervised sparse auto-encoder, the points which do not have any labels are also used in training. used Adaptive Learning Rate scheme with linear decay, in which the learning rate decays linearly after a certain number of epochs. The table shows both validation set accuracy and test set accuracy respectively separated by a comma. The first value in the column is valid set accuracy, and the second value is the test accuracy. In Table 5.3, we compare the performance of our system with respect to frame-level phonetic classification to a simple neural network, and to graph based Semi-supervised Learning model as published in [30] on just 10% labeled data and observe that our sys- tem performs better than all the techniques mentioned except pMP algorithm. The Figure in 5.5 shows the values of the hyper-parameter α optimised on the vali- dation set for the experiments on MNIST( 5.5a) and on TIMIT( 5.5b) plotted against the percentage of labeled data. We observe that, alpha increases with increase in the percent- age of labeled data. Figure in 5.6, shows how reconstruction error and misclassification error go down with epochs of training using mini batch SGD(Stochastic Gradient Descent) optimizer. This experiment was done using 10% labeled data on TIMIT. We can observe from the plots, that saturation starts at around 50-60 epochs, and that reconstruction error gets more saturated early on, which means the rate of change in reconstruction error is greater than the rate of change in misclassification error. We also observe that the best results are obtained, when both the errors go down at almost the same rate. If both the costs are not properly balanced by the hyperparameter α, then one cost may fall at a much faster rate than the other cost, and it may happen that one cost actually starts increasing, while the other is still going down. We observe such instances may give overall lower classification accuracy. CHAPTER 5. EXPERIMENT SETUP AND RESULTS 33

Results on TIMIT Labelled Obs. Neural Network SSSAE α % # valid. acc. test acc. valid acc. test acc. 1 10688 57.46 57.93 59.65 59.84 100 3 32065 61.71 61.31 64.12 64.20 150 5 53441 63.20 63.30 65.44 65.71 150 10 106881 65.78 65.82 66.96 67.03 400 20 213763 68.02 67.80 69.31 69.18 600 30 320644 69.08 68.83 69.80 69.65 900

Table 5.2: Results on classification of HMM state frames in validation set and test set on a neural network and our model on TIMIT with different percentages of labeled data in train- ing. Total number of frames is 1068818. The value of α optimized on validation set is given in parentheses.

TIMIT 72 NN valid 70 NN test SSEAE valid

68 SSEAE test

66

64 accuracy 62

60

58

56 0 5 10 15 20 25 30 35 % labeled examples

Figure 5.4: Frame-level phoneme prediction accuracy on Validation set and test set on TIMIT for varying amount of labels. Total number of labels in training set is 1068818. For semi-supervised sparse auto-encoder, the points which do not have any labels are also used in training. 34 CHAPTER 5. EXPERIMENT SETUP AND RESULTS

MNIST-alpha TIMIT-alpha 2000 1000 alpha alpha

800 1500

600

1000

400 alpha optimised alpha optimised

500 200

0 0 0 1000 2000 3000 4000 5000 0 10 20 30 40 50 # labeled examples % labeled examples

(a) alpha for mnist (b) alpha for timit

Figure 5.5: Plot of tuned alphas against percentage of labeled data on MNIST and TIMIT

cost 47600 reconstruction cost 47550

47500

47450

47400 reconstruction cost

47350 0 10 20 30 40 50 60 70 80 80 epochs 70 classification cost 60 50 40 30 20 classification loss 10 0 0 10 20 30 40 50 60 70 80 epochs

Figure 5.6: The figure contains two plots-the figure at the top plots reconstruction cost against epochs, the plot at the bottom shows how misclassification cross entropy loss changes with the number of epochs during training of the network. Total cost is the weighted sum of these two costs shown here. CHAPTER 5. EXPERIMENT SETUP AND RESULTS 35

Amount of Training Data System 10% 30% NN 65.94 69.24 LP 65.47 69.24 MP 65.48 69.24 MAD 66.53 70.25 pMP 67.22 71.06 SSSAE 67.03 69.65

Table 5.3: Accuracy rates (%) for frame-based phoneme classification on TIMIT for the baseline(MLP), the four different algorithms in GBLSSL and our model, Semi-Supervised Sparse Autoencoder Chapter 6

Discussion, Conclusion and Future Work

This report is concluded by a discussion on the performance of the algorithm and the evaluation methods proposed.

6.1 Hypotheses discussed

6.1.1 H.1 Do Semi-Supervised Sparse Autoencoders perform better than neu- ral networks on phone classification? The results from our experiments validate the ability of autoencoders to use unlabeled data to find the manifold of the data distribution, which can be effectively used to boost accuracy on labeled data classification in speech domain for the task of frame-level phoneme classification.

6.1.2 H.2 Does the above result generalize to other domains? We have observed a similar improvement in classification accuracy of written digits on the MNIST dataset. This suggests that the improvements obtained by this model may be extended to a variety of tasks. It can be observed that this improvement is present when the percentage of labeled data is varied from 0.2% to 6% .

6.1.3 H.3: Do Semi-Supervised Sparse Autoencoders perform better than GBL SSL methods on phoneme classification? In this case, the results we obtain are similar to those found in the literature in similar conditions. We can conclude that our method obtains similar results with smaller compu- tational requirements. In order to quantify the difference, however we should perform a detailed complexity analysis of both approaches. We notice that autoencoders are easy to be modelled, and conceptually easy to be ex- plained. We also observed that we did not need to use any particular initialisation for the network and pretraining as sometimes required in learning of autoencoders. This is an improvement on a standard auto-encoder which can be explored more in future work.

36 CHAPTER 6. DISCUSSION, CONCLUSION AND FUTURE WORK 37

6.2 Evaluation Method

The evaluation method for the purpose of this thesis was classification accuracy. We per- formed experiments on two different datasets. For MNIST, the criterion was better clas- sification on the test set. For TIMIT, the criterion was better classification accuracy of phonemes for all the individual speech frames in the test set. Regarding the choice of the percentage of labeled data at which the experiments were performed, we can argue that in nature in general, we always find an abundance of unlabeled data, so it becomes more important for us to investigate our models when the percentage of labeled data is very low. Also, as the percentage of labeled datapoints increase, the accuracy starts to saturate, and any subsequent increase in percentage of labeled examples does not bring in much of improvement in the accuracy as observed in the lower percentages of data. This seems reasonable, as after a certain percentage of data, the model knows the input distribution almost completely, and only small improvements in knowledge are observed henceforth. The size of the labeled points for our experiments is varied from 1% to 30% of the entire training set frames count. Finally, it is also important to note that most standard and industry-grade speech pro- cessing systems need to handle units like phones, words that are inherently of variable length. In other-words, improvement in accuracy of frame-level based phoneme classi- fication might be a good indicator, but it cannot be definitely said, that overall speech recognition accuracy when measured in terms of word-recognition will definitely in- crease. Though, most researchers in the community still focus on frame-level phoneme accuracy as the first goal, before moving on to the harder challenge of word recognition and recognition of variable-length "large" speech segments like sentences.

6.3 Effect of α in the model

It can be observed from Figure 5.5, that the value of alpha increases as the proportion of labeled data in the training set is varied. The factor α plays an important role in bal- ancing out the two different kind of costs involved. The two costs are orthogonal and serve different purposes. The classification cost is discriminative in nature, but it can not be the only objective cost as that will lead to severe overfitting, the reconstruction cost acts as a regulariser and makes the network determine the structure or distribution of the data. One possible explanation for the monotonic increase, is that the network has al- ready seen enough of unlabeled data to decipher its structure so that it starts optimising on the classification cost to get a better classification accuracy on the validation set. We used cross-validation to tune the value of α for each experiment. Generally, the plot of error on validation set with α is a U-shaped curve. This means that, the accuracy starts increasing and becomes maximum at a certain value of α and then starts dropping again.

6.4 Future Work

There are many exciting ideas and research directions in which this work could be ex- tended. Firstly, we could use feature-bank features described in Section 2.2 on feature processing from raw waveform. Though MFCCs and their deltas have been effective as features for phoneme recognition, they are highly compressed. Using feature banks in- stead might give a higher accuracy. One objective of using MFCCs was to compare with 38 CHAPTER 6. DISCUSSION, CONCLUSION AND FUTURE WORK

the results of [30] who also have used MFCCs for GBL Semi-Supervised Learning. As discussed above, a logical next step would be to apply the algorithm for segment classifi- cation task. That way, it could be definitely said that, the algorithm is good for speech processing and recognition applications. One more research direction could be to use even more advanced and powerful auto-encoders like Variational Autoencoders or Gen- erative Adversarial Autoencoders on speech data in semisupervised setting.

6.5 Society and Ethics

Speech Recognition in the past decade has reached to the end user in the form of hear- ing aids, and through the popularisation of dialogue systems like Apple Siri, Alexa by Amaxon, Cortana by Microsoft. The potential effect of research on these models is to im- prove quality of life especially for people with disabilities. Speech technology-based aids are often non available for minority languages. For those languages the limited commer- cial interest does not justify the investment required to collect expensive hand-annotated speech material that is necessary to train speech recognition models. The purpose of this thesis is to improve the existing techniques in Speech Recognition so that the need for expensive annotated recordings is reduced. This would make speech technology avail- able to all citizens regardless if they speak a popular language or a more remote one. During this thesis we used standardised data sets with speech recordings from Amer- ican English speakers. All the ethical values were considered in this recordings, including anonymity, gender balance, dialectal balance. Chapter 7

Appendix

Phoneme Mapping Phoneme Mapping Phoneme-48 Phoneme-39 Phoneme-48 Phoneme-39 aa aa iy iy ae ae jh jh ah ah k k ao aa l l aw aw m m ax ah n n ay ay ng ng b b ow ow ch ch oy oy cl sil p p d d r r dh dh s s dx dx sh sh eh eh sil sil el l t t en n th th epi sil uh uh er er uw uw ey ey v v f f vcl sil g g w w hh hh y y ih ih z z ix ih zh sh

Table 7.1: Mapping between 48 phonemes used for training and 39 phonemes used for evaluation

39 Bibliography

[1] Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(10):1533–1545, October 2014.

[2] Andrei Alexandrescu and Katrin Kirchhoff. Graph-based learning for phonetic clas- sification. In IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2007, Kyoto, Japan, December 9-13, 2007, pages 359–364, 2007.

[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, Cambridge, MA, 2007.

[4] . Practical recommendations for gradient-based training of deep archi- tectures. In Neural Networks: Tricks of the Trade - Second Edition, pages 437–478. 2012.

[5] Léon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade - Second Edition, pages 421–436. 2012.

[6] A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsuper- vised . In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR Workshop and Conference Proceedings, pages 215–223. JMLR W&CP, 2011.

[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neu- ral networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, Jan 2012.

[8] Steven Davis and Paul Mermelstein. Improvements to filterbank and delta learning within a deep neural network framework. In IEEE Transactions on Acoustics, Speech, and Signal Processing, volume 28. IEEE, 1980.

[9] Li Deng Alex Acero Dong Yu, Balakrishnan Varadarajan. Active learning and semi- supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion. Elsevier, January 2009.

[10] William M. Fisher, George R. Doddington, and Kathleen M. Goudie-Marshall. The darpa speech recognition research database: Specifications and status. In Proceedings of DARPA Workshop on Speech Recognition, pages 93–99, 1986.

[11] M.J.F. Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech and Language, 12:75–98, 1998.

40 BIBLIOGRAPHY 41

[12] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 249–256, 2010.

[13] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy min- imization. In François Denis, editor, Actes de CAP 05, Conférence francophone sur l’apprentissage automatique - 2005, Nice, France, du 31 mai au 3 juin 2005, pages 281–296. PUG, 2005.

[14] Alex Graves, Abdel rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645–6649. IEEE, 2013.

[15] Frantisek Grézl and Martin Karafiát. Semi-supervised bootstrapping approach for neural network feature extractor training. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 8-12, 2013, pages 470–475, 2013.

[16] Jui-Ting Huang and Mark Hasegawa-Johnson. Semi-supervised training of gaussian mixture models by conditional entropy minimization. In Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura, editors, INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1353–1356. ISCA, 2010.

[17] Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu. Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re- calibration. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013, pages 2360–2364, 2013.

[18] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 200–209, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US.

[19] Kelley Kilanski, Jonathan Malkin, Xiao Li, Richard Wright, and Jeff A. Bilmes. The vocal joystick data collection effort and vowel corpus. In INTERSPEECH 2006 - IC- SLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006, 2006.

[20] Diederik P. Kingma, Danilo Jimenez Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. CoRR, abs/1406.5298, 2014.

[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[22] John Labiak and Karen Livescu. Nearest neighbors with learned distances for pho- netic frame classification. In Interspeech, 2011. 42 BIBLIOGRAPHY

[23] Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Berg- eron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed im- provements. 2012.

[24] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Investigating lightly supervised acoustic model training. In IEEE International Conference on Acoustics, Speech, and Sig- nal Processing, ICASSP 2001, 7-11 May, 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, Proceedings, pages 477–480. IEEE, 2001.

[25] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Lightly supervised and unsuper- vised acoustic model training. Computer Speech & Language, 16(1):115–129, 2002.

[26] T. Landauer, C. Kamm, and S. Singhal. Learning a minimally structured back propa- gation network to recognize speech. In 9th Annu. Conf. Cogn. Sci. Soc., pages 531–536, 1987.

[27] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010.

[28] K. F. Lee and H. W. Hon. Speaker-independent phone recognition using hid- den markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1641–1648, Nov 1989.

[29] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, pages 171–185, 1995.

[30] Yuzong Liu and Katrin Kirchhoff. Graph-based semi-supervised learning for phone and segment classification. In Frédéric Bimbot, Christophe Cerisara, Cécile Fougeron, Guillaume Gravier, Lori Lamel, François Pellegrino, and Pascal Perrier, editors, INTERSPEECH, pages 1840–1843. ISCA, 2013.

[31] Yuzong Liu and Katrin Kirchhoff. Graph-based semi-supervised acoustic modeling in DNN-based speech recognition. In Proceedings of the IEEE Spoken Language Technol- ogy Workshop (SLT), pages 177–182, 2014.

[32] Yuzong Liu and Katrin Kirchhoff. Graph-based semisupervised learning for acoustic modeling in automatic speech recognition. IEEE/ACM Trans. Audio, Speech & Language Processing, 24(11):1946–1956, 2016.

[33] Y. Miao. Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN. ArXiv e-prints, January 2014.

[34] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltz- mann machines. In Johannes Fürnkranz and Thorsten Joachims, editors, ICML, pages 807–814. Omnipress, 2010.

[35] Andrew Ng. Stanford university lecture notes on sparse autoencoders. 2011.

[36] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, and Burget. The kaldi speech recog- nition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understand- ing. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW- USB. BIBLIOGRAPHY 43

[37] Daniel Povey, Dimitri Kanevsky, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Karthik Visweswariah. Boosted mmi for model and feature-space discrim- inative training. In ICASSP, pages 4057–4060. IEEE, 2008.

[38] Abdel rahman Mohamed, Tara N. Sainath, George E. Dahl, Bhuvana Ramabhadran, Geoffrey E. Hinton, and Michael A. Picheny. Deep belief networks using discrimina- tive features for phone recognition. In ICASSP, pages 5060–5063. IEEE, 2011.

[39] Marc’Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In William W. Cohen, Andrew Mc- Callum, and Sam T. Roweis, editors, ICML, volume 307 of ACM International Confer- ence Proceeding Series, pages 792–799. ACM, 2008.

[40] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self- training of object detection models. In WACV/MOTION, pages 29–36. IEEE Com- puter Society, 2005.

[41] Tara Sainath, Brian Kingsbury, Abdel-Rahman Mohamed, and Bhuvana Ramabhad- ran. Learning filter banks within a deep neural network framework. In IEEE Interna- tional Conference on Acoustic, Speech and Signal Processing. IEEE, 2014.

[42] George Saon, Hong-Kwang Jeff Kuo, Steven J. Rennie, and Michael Picheny. The IBM 2015 english conversational telephone speech recognition system. CoRR, abs/1505.05899, 2015.

[43] Frank Seide, Gang Li, Xie Chen, and Dong Yu. in context- dependent deep neural networks for conversational speech transcription. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2011, Waikoloa, HI, USA, December 11-15, 2011, pages 24–29, 2011.

[44] Amarnag Subramanya, Jeff Bilmes, and Yoshua Bengio. Semi-supervised learning with measure propagation. Journal of Machine Learning. Research, pages 3311–3370, 2011.

[45] Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for trans- ductive learning. In Wray L. Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe-Taylor, editors, ECML/PKDD (2), volume 5782 of Lecture Notes in Computer Sci- ence, pages 442–457. Springer, 2009.

[46] Karel Veselý, Mirko Hannemann, and Lukáš Burget. Semi-supervised training of deep neural networks. In Proceedings of IEEE Conference on Automatic Speech Recogni- tion and Understanding (ASRU), pages 267–272, 2013.

[47] Frank Wessel and Hermann Ney. Unsupervised training of acoustic models for large vocabulary continuous speech recognition. IEEE Trans. Speech and Audio Processing, 13(1):23–31, 2005.

[48] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learn- ing via semi-supervised embedding. In Grégoire Montavon, Genevieve B. Orr, and Klaus-Robert Müller, editors, Neural Networks: Tricks of the Trade (2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 639–655. Springer, 2012. 44 BIBLIOGRAPHY

[49] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, 2002.

[50] Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty. Nonpara- metric transforms of graph kernels for semi-supervised learning. In NIPS, pages 1641–1648, 2004. www.kth.se