Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition AKASH KUMAR DHAKA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition AKASH KUMAR DHAKA Master in Machine Learning Date: November 2016 Supervisor: Giampiero Salvi Examiner: Danica Kragic Swedish title: Semi-övervakad inlärning med glesa autoencoders i automatisk taligenkänning School of Computer Science and Communication Abstract This work is aimed at exploring semi-supervised learning techniques to improve the performance of Automatic Speech Recognition systems. Semi-supervised learning takes advantage of unlabeled data in order to improve the quality of the representations ex- tracted from the data. The proposed model is a neural network where the weights are updated by minimizing the weighted sum of a supervised and an unsupervised cost function, simultaneously. These costs are evaluated on the labeled and unlabeled por- tions of the data set, respectively. The combined cost is optimized through mini-batch stochastic gradient descent via standard backpropagation. The model was tested on a phone classification task on the TIMIT American English data set and on a written digit classification task on the MNIST data set. Our results show that the model outperforms a network trained with standard backpropagation on the labelled material alone. The results are also in line with state-of-the-art graph-based semi-supervised training methods. Sammanfattning Detta arbete syftar till att utforska halvövervakade inlärningstekniker (eng: semi-supervised learning techniques) för att förbättra prestandan hos automatiska taligenkänningssystem. Halvövervakad maskininlärning använder sig av data ej märkt med klasstillhörighets- information för att förbättra kvaliteten hos den från datan extraherade representationen. Modellen som beskrivs i arbetet är ett neuralt nätverk där vikterna uppdateras genom att samtidigt minimera den viktade summan av en övervakad och en oövervakad kostnads- funktion. Dessa kostnadsfunktioner evalueras på den märkta respektive den omärkta da- tamängden. De kombinerade kostnadsfunktionerna optimeras genom gradient descent med hjälp av traditionell backpropagation. Modellen har evaluerats genom en fonklassificeringsuppgift på datamängden TIMIT American English, samt en sifferklassificeringsuppgift på datamängden MNIST. Resul- taten visar att modellen presterar bättre än ett nätverk tränat med backpropagation på endast märkt data. Resultaten är även konkurrenskraftiga med rådande state of the art, grafbaserade halvövervakade inlärningsmetoder. Contents 1 Introduction 1 1.1 The Speech Recognition Problem . .1 1.2 Motivation For the Thesis . .2 1.3 Research Questions . .3 1.4 Assumptions . .3 1.5 Report Structure . .3 2 Relevant Theory 4 2.1 Automatic Speech Recognition and Phone Classification . .4 2.2 Feature Extraction . .6 2.3 Acoustic Modelling . .7 2.4 MLPs and Deep Neural Networks . .9 2.5 Autoencoders . 11 2.5.1 Manifold Learning with Autoencoders . 11 2.5.2 Sparse Autoencoders . 12 2.5.3 Performance of Sparse Autoencoders . 12 2.5.4 Applications of Autoencoders . 13 2.6 Semi-Supervised Learning . 13 2.7 Assumptions . 14 3 Related Work 15 3.1 Deep Neural Networks in ASR . 15 3.1.1 Deep Belief Networks . 15 3.1.2 Recurrent Neural Networks . 15 3.2 Examples of Semi-Supervised Learning Methods . 16 3.2.1 Heuristic based SSL/Self-Training . 16 3.2.2 Transductive SVMs . 16 3.2.3 Entropy Based Semi Supervised Learning . 17 3.2.4 Graph based SSL . 17 3.2.5 Semi-Supervised Learning with generative models . 18 3.3 Autoencoder Based Semi-Supervised Learning . 19 4 Method 20 4.1 The Model . 20 4.2 Evaluation . 22 4.3 Monitoring and Debugging . 23 4.3.1 Design Choices/Tuning Hyperparameters . 23 4.3.2 Learning Rate . 24 4.3.3 Batch Size . 25 4.3.4 Weight Initialization . 25 4.3.5 Number of Hidden Units . 25 4.3.6 Momentum . 26 4.3.7 Activation Function . 26 4.3.8 Training Epochs . 26 4.3.9 Additive Noise . 26 4.3.10 Alpha . 26 4.3.11 Gradient Checking . 27 5 Experiment Setup and Results 28 5.1 Data . 28 5.1.1 MNIST . 28 5.1.2 TIMIT . 28 5.2 Experimental Setup . 30 5.3 Practical Setup . 31 5.4 Results . 31 6 Discussion, Conclusion and Future Work 36 6.1 Hypotheses discussed . 36 6.1.1 H.1 Do Semi-Supervised Sparse Autoencoders perform better than neural networks on phone classification? . 36 6.1.2 H.2 Does the above result generalize to other domains? . 36 6.1.3 H.3: Do Semi-Supervised Sparse Autoencoders perform better than GBL SSL methods on phoneme classification? . 36 6.2 Evaluation Method . 37 6.3 Effect of α in the model . 37 6.4 Future Work . 37 6.5 Society and Ethics . 38 7 Appendix 39 Bibliography 40 Chapter 1 Introduction With the invention of computers, the question of whether machines could be made to understand human speech emerged. In more recent years, speech technology has started to change the way we live by becoming an important tool for communication and interaction with devices. The recent improvements in Spoken Language systems have greatly improved Human-Machine Communication. Personal Digital Assistance (PDA) systems are an example of an intelligent dialogue management system. They have been very pop- ular with the recent launch of products like Apple Siri, Amazon Alexa and Google Allo. Besides human-machine interaction, speech technology has also been applied in assisted human-human communication. There could be several barriers even when humans com- municate with each other. One of the most prominent of those barriers occurs if the two speakers do not speak a common language. In the past, and to great extent in present days, this was solved by means of a human interpreter. Speech-to-speech Translation systems are, however, reaching sufficient quality to be of help for example for travellers. These systems accept spoken input in one language and output a spoken translation of the input in a target language. In all the above examples, a key component is Automatic Speech Recognition (ASR). This system has the task of translating spoken utterances into their textual transcription that can be more easily handled in the rest of the system. In dialogue systems, this textual representation is fed to a language understanding module that extracts the semantic information to be handled by a dialogue manager. The dialogue manager, in turns, can decide to formulate a spoken response by means of a language generation system and a speech synthesis system. In speech-to-speech translation, instead, the output of the ASR is fed to an automatic translation system, and the translation is then converted to speech by means of speech synthesis. 1.1 The Speech Recognition Problem Although humans recognize speech in their mother tongue effortlessly, the problem of automatic speech recognition presents a number of challenges. A source of complexity is due to the large variations in speech based on region, accent, age, gender, emotions, physical and mental well-being of the speaker. Another complication, if compared to many classification tasks, is that speech is a continuous stream of units hierarchically combined into speech sounds, syllables, words, phrases and utterances. A speech rec- ognizer must therefore be able to handle sequences of patterns. 1 2 CHAPTER 1. INTRODUCTION The way the speech recognition problem is approached is by means of statistical methods that can incorporate the variation and model sequences of acoustic events in a ro- bust way. The design of these models makes extensive use of domain knowledge com- ing from the fields of linguistics and phonetics, and incorporates this knowledge into a machine learning framework that can learn the variability of the spoken units from large collections of recordings of spoken utterances. The building blocks of the statistical models are short segments of speech that can be considered to be stationary. These segments (or the corresponding models) are then combined to form phonemic segments. A phoneme is the smallest linguistic unit that can distinguish between two words. Phone- mic models are then combined into words and phrases by using lexical and grammatical rules. Although each language uses a specific set of phonemes, there is a large overlap between languages, because the number of sounds that we can produce is constraint by the physics of our speech organs. An example of phoneme classification for the Ameri- can English is reported in Appendix 7.1. 1.2 Motivation For the Thesis In order to learn the associations between the constituent speech units and the corresponding sounds, most speech recognition methods require carefully annotated speech recordings. The increasing interest in speech based applications has produced large amount of such data for many languages with a sufficiently broad consumer basis. However, these linguistic resources are extremely expensive to produce because they require in- tense expert labour (phonetic transcriptions are usually created by phoneticians). A consequence of this is that most speech databases are not publicly available, and even re- searchers must pay royalties in order to use them. Another consequence is that speech technology, and in particular speech recognition, does not easily reach speakers of languages spoken by minorities. This work will specifically target improvements in ASR in a semi-supervised set- ting, therefore reducing the need for annotated material. The existing methods in Semi- supervised learning in ASR are based on Graph based Learning or self-training using Neural Networks. Graph based learning is computationally very intensive, while self- training is based on heuristics and prone to error due to wrong predictions. Recently, learning through neural networks has been found to be scalable on industrial levels and has given state-of-the-art results in automatic speech recognition.

Load more