
NATIONAL UNIVERSITY OF IRELAND, GALWAY Neural Transfer Learning for Natural Language Processing by Sebastian Ruder A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the College of Engineering and Informatics School of Engineering and Informatics Supervisors: John G. Breslin, Parsa Ghaffari February, 2019 Contents Declaration of Authorship vi Abstract viii Acknowledgements ix List of Figures x List of Tables xii Abbreviations xiv Notation xvi 1 Introduction 1 1.1 Motivation . .1 1.2 Research objectives . .3 1.3 Contributions . .4 1.4 Thesis outline . .6 1.5 Publications . .7 2 Background 10 2.1 Probability and information theory . 10 2.1.1 Probability basics . 10 2.1.2 Distributions . 15 2.1.3 Information theory . 16 2.2 Machine learning . 18 2.2.1 Maximum likelihood estimation . 19 2.2.2 Linear regression . 21 2.2.3 Gradient descent . 24 2.2.4 Generalization . 25 2.2.5 Regularization . 28 2.3 Neural networks . 30 2.3.1 Layers and models . 32 2.3.2 Back-propagation . 35 i Contents ii 2.4 Natural language processing . 37 2.5 Conclusions . 41 3 Transfer Learning 42 3.1 Introduction . 42 3.1.1 Definition . 44 3.1.2 Scenarios . 44 3.1.3 Taxonomy . 45 3.2 Multi-task learning . 47 3.2.1 Introduction . 47 3.2.2 Motivation . 47 3.2.3 Two methods for MTL in neural networks . 48 3.2.4 Why does MTL work? . 49 3.2.5 MTL in non-neural models . 50 3.2.5.1 Block-sparse regularization . 51 3.2.5.2 Learning task relationships . 52 3.2.6 Auxiliary tasks . 55 3.2.6.1 Common types of auxiliary tasks . 57 3.2.6.2 Related tasks used in NLP . 60 3.2.7 Summary . 62 3.3 Sequential transfer learning . 63 3.3.1 Introduction . 63 Sequential transfer learning stages . 64 3.3.2 Pretraining . 65 3.3.2.1 Distantly supervised pretraining . 66 3.3.2.2 Supervised pretraining . 67 3.3.2.3 Unsupervised pretraining . 67 3.3.2.4 Multi-task pretraining . 76 3.3.2.5 Architectures . 76 3.3.3 Adaptation . 77 3.3.3.1 Fine-tuning settings . 78 3.3.3.2 A framework for adaptation . 79 3.3.4 Lifelong learning . 80 3.3.5 Evaluation scenarios . 85 3.3.6 Summary . 85 3.4 Domain adaptation . 86 3.4.1 Introduction . 86 3.4.2 Representation approaches . 87 3.4.2.1 Distribution similarity approaches . 87 3.4.2.2 Latent feature learning . 90 3.4.3 Weighting and selecting data . 93 3.4.3.1 Instance weighting . 94 3.4.3.2 Instance selection . 95 3.4.4 Self-labelling approaches . 96 3.4.4.1 Self-training . 97 3.4.4.2 Multi-view training . 97 3.4.5 Multi-source domain adaptation . 99 Contents iii 3.4.6 Summary . 100 3.5 Cross-lingual learning . 100 3.5.1 Introduction . 101 3.5.2 Notation and terminology . 102 3.5.3 A typology for cross-lingual word embedding models . 104 3.5.4 Word-level alignment models . 106 3.5.4.1 Word alignment methods with parallel data . 106 3.5.4.2 Word alignment methods with comparable data . 115 3.5.5 Sentence-level alignment models . 116 3.5.5.1 Sentence alignment methods with parallel data . 117 3.5.5.2 Sentence alignment methods with comparable data . 123 3.5.6 Document-level alignment models . 123 3.5.6.1 Document alignment methods with comparable data . 124 3.5.7 Evaluation . 126 3.5.8 Summary . 128 3.6 Conclusions . 129 4 Selecting Data for Domain Adaptation 130 4.1 Learning to Select Data for Transfer Learning With Bayesian Optimization131 4.1.1 Introduction . 131 4.1.2 Data selection model . 132 4.1.2.1 Bayesian Optimization for data selection . 133 4.1.2.2 Features . 133 4.1.3 Experiments . 135 4.1.4 Results . 137 4.1.5 Related work . 143 4.1.6 Summary . 144 4.2 Strong Baselines for Neural Semi-supervised Learning under Domain Shift 145 4.2.1 Introduction . 145 4.2.2 Neural bootstrapping methods . 146 4.2.2.1 Self-training . 147 4.2.2.2 Tri-training . 148 4.2.2.3 Multi-task tri-training . 149 4.2.3 Experiments . 152 4.2.3.1 POS tagging . 152 4.2.3.2 Sentiment analysis . 153 4.2.3.3 Baselines . 153 4.2.3.4 Results . 153 4.2.4 Related work . 158 4.2.5 Summary . 159 4.3 Conclusions . 159 5 Unsupervised and Weakly Supervised Cross-lingual Learning 160 5.1 The Limitations of Unsupervised Bilingual Dictionary Induction . 161 5.1.1 Introduction . 161 5.1.2 How similar are embeddings across languages? . 162 5.1.3 Unsupervised cross-lingual learning . 164 Contents iv 5.1.3.1 Learning scenarios . 164 5.1.3.2 Summary of Conneau et al. [2018a] . 165 5.1.3.3 A simple supervised method . 166 5.1.4 Experiments . 167 5.1.4.1 Experimental setup . 167 5.1.4.2 Impact of language similarity . 167 5.1.4.3 Impact of domain differences . 169 5.1.4.4 Impact of hyper-parameters . 171 5.1.4.5 Impact of dimensionality . 172 5.1.4.6 Impact of evaluation procedure . 173 5.1.4.7 Evaluating eigenvector similarity . 174 5.1.5 Related work . 174 5.1.6 Summary . 175 5.2 A Discriminative Latent-Variable Model for Bilingual Lexicon Induction . 176 5.2.1 Introduction . 176 5.2.1.1 Graph-theoretic formulation . 177 5.2.1.2 Word embeddings . 178 5.2.2 A latent-variable model . 178 5.2.3 Parameter estimation . 181 5.2.3.1 Viterbi E-Step . ..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages329 Page
-
File Size-