Neural Transfer Learning for Natural Language Processing

NATIONAL UNIVERSITY OF IRELAND, GALWAY Neural Transfer Learning for Natural Language Processing by Sebastian Ruder A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the College of Engineering and Informatics School of Engineering and Informatics Supervisors: John G. Breslin, Parsa Ghaffari February, 2019 Contents Declaration of Authorship vi Abstract viii Acknowledgements ix List of Figures x List of Tables xii Abbreviations xiv Notation xvi 1 Introduction 1 1.1 Motivation . .1 1.2 Research objectives . .3 1.3 Contributions . .4 1.4 Thesis outline . .6 1.5 Publications . .7 2 Background 10 2.1 Probability and information theory . 10 2.1.1 Probability basics . 10 2.1.2 Distributions . 15 2.1.3 Information theory . 16 2.2 Machine learning . 18 2.2.1 Maximum likelihood estimation . 19 2.2.2 Linear regression . 21 2.2.3 Gradient descent . 24 2.2.4 Generalization . 25 2.2.5 Regularization . 28 2.3 Neural networks . 30 2.3.1 Layers and models . 32 2.3.2 Back-propagation . 35 i Contents ii 2.4 Natural language processing . 37 2.5 Conclusions . 41 3 Transfer Learning 42 3.1 Introduction . 42 3.1.1 Definition . 44 3.1.2 Scenarios . 44 3.1.3 Taxonomy . 45 3.2 Multi-task learning . 47 3.2.1 Introduction . 47 3.2.2 Motivation . 47 3.2.3 Two methods for MTL in neural networks . 48 3.2.4 Why does MTL work? . 49 3.2.5 MTL in non-neural models . 50 3.2.5.1 Block-sparse regularization . 51 3.2.5.2 Learning task relationships . 52 3.2.6 Auxiliary tasks . 55 3.2.6.1 Common types of auxiliary tasks . 57 3.2.6.2 Related tasks used in NLP . 60 3.2.7 Summary . 62 3.3 Sequential transfer learning . 63 3.3.1 Introduction . 63 Sequential transfer learning stages . 64 3.3.2 Pretraining . 65 3.3.2.1 Distantly supervised pretraining . 66 3.3.2.2 Supervised pretraining . 67 3.3.2.3 Unsupervised pretraining . 67 3.3.2.4 Multi-task pretraining . 76 3.3.2.5 Architectures . 76 3.3.3 Adaptation . 77 3.3.3.1 Fine-tuning settings . 78 3.3.3.2 A framework for adaptation . 79 3.3.4 Lifelong learning . 80 3.3.5 Evaluation scenarios . 85 3.3.6 Summary . 85 3.4 Domain adaptation . 86 3.4.1 Introduction . 86 3.4.2 Representation approaches . 87 3.4.2.1 Distribution similarity approaches . 87 3.4.2.2 Latent feature learning . 90 3.4.3 Weighting and selecting data . 93 3.4.3.1 Instance weighting . 94 3.4.3.2 Instance selection . 95 3.4.4 Self-labelling approaches . 96 3.4.4.1 Self-training . 97 3.4.4.2 Multi-view training . 97 3.4.5 Multi-source domain adaptation . 99 Contents iii 3.4.6 Summary . 100 3.5 Cross-lingual learning . 100 3.5.1 Introduction . 101 3.5.2 Notation and terminology . 102 3.5.3 A typology for cross-lingual word embedding models . 104 3.5.4 Word-level alignment models . 106 3.5.4.1 Word alignment methods with parallel data . 106 3.5.4.2 Word alignment methods with comparable data . 115 3.5.5 Sentence-level alignment models . 116 3.5.5.1 Sentence alignment methods with parallel data . 117 3.5.5.2 Sentence alignment methods with comparable data . 123 3.5.6 Document-level alignment models . 123 3.5.6.1 Document alignment methods with comparable data . 124 3.5.7 Evaluation . 126 3.5.8 Summary . 128 3.6 Conclusions . 129 4 Selecting Data for Domain Adaptation 130 4.1 Learning to Select Data for Transfer Learning With Bayesian Optimization131 4.1.1 Introduction . 131 4.1.2 Data selection model . 132 4.1.2.1 Bayesian Optimization for data selection . 133 4.1.2.2 Features . 133 4.1.3 Experiments . 135 4.1.4 Results . 137 4.1.5 Related work . 143 4.1.6 Summary . 144 4.2 Strong Baselines for Neural Semi-supervised Learning under Domain Shift 145 4.2.1 Introduction . 145 4.2.2 Neural bootstrapping methods . 146 4.2.2.1 Self-training . 147 4.2.2.2 Tri-training . 148 4.2.2.3 Multi-task tri-training . 149 4.2.3 Experiments . 152 4.2.3.1 POS tagging . 152 4.2.3.2 Sentiment analysis . 153 4.2.3.3 Baselines . 153 4.2.3.4 Results . 153 4.2.4 Related work . 158 4.2.5 Summary . 159 4.3 Conclusions . 159 5 Unsupervised and Weakly Supervised Cross-lingual Learning 160 5.1 The Limitations of Unsupervised Bilingual Dictionary Induction . 161 5.1.1 Introduction . 161 5.1.2 How similar are embeddings across languages? . 162 5.1.3 Unsupervised cross-lingual learning . 164 Contents iv 5.1.3.1 Learning scenarios . 164 5.1.3.2 Summary of Conneau et al. [2018a] . 165 5.1.3.3 A simple supervised method . 166 5.1.4 Experiments . 167 5.1.4.1 Experimental setup . 167 5.1.4.2 Impact of language similarity . 167 5.1.4.3 Impact of domain differences . 169 5.1.4.4 Impact of hyper-parameters . 171 5.1.4.5 Impact of dimensionality . 172 5.1.4.6 Impact of evaluation procedure . 173 5.1.4.7 Evaluating eigenvector similarity . 174 5.1.5 Related work . 174 5.1.6 Summary . 175 5.2 A Discriminative Latent-Variable Model for Bilingual Lexicon Induction . 176 5.2.1 Introduction . 176 5.2.1.1 Graph-theoretic formulation . 177 5.2.1.2 Word embeddings . 178 5.2.2 A latent-variable model . 178 5.2.3 Parameter estimation . 181 5.2.3.1 Viterbi E-Step . ..

Neural Transfer Learning for Natural Language Processing

What the Neurocognitive Study of Inner Language Reveals About Our Inner Space Hélène Loevenbruck

CNS 2014 Program

Review on Parse Tree Generation in Natural Language Processing

Building a Treebank for French

Sentiment Analysis for Multilingual Corpora

Talking to Computers in Natural Language

Revealing the Language of Thought Brent Silby 1

Natural Language Processing (NLP) for Requirements Engineering: a Systematic Mapping Study

Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley

Text-Based Relation Extraction from the Web”

Natural Language Processing

“Linguistic Analysis” As a Misnomer, Or, Why Linguistics Is in a State of Permanent Crisis