Neural Transfer Learning for Natural Language Processing

Neural Transfer Learning for Natural Language Processing

NATIONAL UNIVERSITY OF IRELAND, GALWAY Neural Transfer Learning for Natural Language Processing by Sebastian Ruder A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the College of Engineering and Informatics School of Engineering and Informatics Supervisors: John G. Breslin, Parsa Ghaffari February, 2019 Contents Declaration of Authorship vi Abstract viii Acknowledgements ix List of Figures x List of Tables xii Abbreviations xiv Notation xvi 1 Introduction 1 1.1 Motivation . .1 1.2 Research objectives . .3 1.3 Contributions . .4 1.4 Thesis outline . .6 1.5 Publications . .7 2 Background 10 2.1 Probability and information theory . 10 2.1.1 Probability basics . 10 2.1.2 Distributions . 15 2.1.3 Information theory . 16 2.2 Machine learning . 18 2.2.1 Maximum likelihood estimation . 19 2.2.2 Linear regression . 21 2.2.3 Gradient descent . 24 2.2.4 Generalization . 25 2.2.5 Regularization . 28 2.3 Neural networks . 30 2.3.1 Layers and models . 32 2.3.2 Back-propagation . 35 i Contents ii 2.4 Natural language processing . 37 2.5 Conclusions . 41 3 Transfer Learning 42 3.1 Introduction . 42 3.1.1 Definition . 44 3.1.2 Scenarios . 44 3.1.3 Taxonomy . 45 3.2 Multi-task learning . 47 3.2.1 Introduction . 47 3.2.2 Motivation . 47 3.2.3 Two methods for MTL in neural networks . 48 3.2.4 Why does MTL work? . 49 3.2.5 MTL in non-neural models . 50 3.2.5.1 Block-sparse regularization . 51 3.2.5.2 Learning task relationships . 52 3.2.6 Auxiliary tasks . 55 3.2.6.1 Common types of auxiliary tasks . 57 3.2.6.2 Related tasks used in NLP . 60 3.2.7 Summary . 62 3.3 Sequential transfer learning . 63 3.3.1 Introduction . 63 Sequential transfer learning stages . 64 3.3.2 Pretraining . 65 3.3.2.1 Distantly supervised pretraining . 66 3.3.2.2 Supervised pretraining . 67 3.3.2.3 Unsupervised pretraining . 67 3.3.2.4 Multi-task pretraining . 76 3.3.2.5 Architectures . 76 3.3.3 Adaptation . 77 3.3.3.1 Fine-tuning settings . 78 3.3.3.2 A framework for adaptation . 79 3.3.4 Lifelong learning . 80 3.3.5 Evaluation scenarios . 85 3.3.6 Summary . 85 3.4 Domain adaptation . 86 3.4.1 Introduction . 86 3.4.2 Representation approaches . 87 3.4.2.1 Distribution similarity approaches . 87 3.4.2.2 Latent feature learning . 90 3.4.3 Weighting and selecting data . 93 3.4.3.1 Instance weighting . 94 3.4.3.2 Instance selection . 95 3.4.4 Self-labelling approaches . 96 3.4.4.1 Self-training . 97 3.4.4.2 Multi-view training . 97 3.4.5 Multi-source domain adaptation . 99 Contents iii 3.4.6 Summary . 100 3.5 Cross-lingual learning . 100 3.5.1 Introduction . 101 3.5.2 Notation and terminology . 102 3.5.3 A typology for cross-lingual word embedding models . 104 3.5.4 Word-level alignment models . 106 3.5.4.1 Word alignment methods with parallel data . 106 3.5.4.2 Word alignment methods with comparable data . 115 3.5.5 Sentence-level alignment models . 116 3.5.5.1 Sentence alignment methods with parallel data . 117 3.5.5.2 Sentence alignment methods with comparable data . 123 3.5.6 Document-level alignment models . 123 3.5.6.1 Document alignment methods with comparable data . 124 3.5.7 Evaluation . 126 3.5.8 Summary . 128 3.6 Conclusions . 129 4 Selecting Data for Domain Adaptation 130 4.1 Learning to Select Data for Transfer Learning With Bayesian Optimization131 4.1.1 Introduction . 131 4.1.2 Data selection model . 132 4.1.2.1 Bayesian Optimization for data selection . 133 4.1.2.2 Features . 133 4.1.3 Experiments . 135 4.1.4 Results . 137 4.1.5 Related work . 143 4.1.6 Summary . 144 4.2 Strong Baselines for Neural Semi-supervised Learning under Domain Shift 145 4.2.1 Introduction . 145 4.2.2 Neural bootstrapping methods . 146 4.2.2.1 Self-training . 147 4.2.2.2 Tri-training . 148 4.2.2.3 Multi-task tri-training . 149 4.2.3 Experiments . 152 4.2.3.1 POS tagging . 152 4.2.3.2 Sentiment analysis . 153 4.2.3.3 Baselines . 153 4.2.3.4 Results . 153 4.2.4 Related work . 158 4.2.5 Summary . 159 4.3 Conclusions . 159 5 Unsupervised and Weakly Supervised Cross-lingual Learning 160 5.1 The Limitations of Unsupervised Bilingual Dictionary Induction . 161 5.1.1 Introduction . 161 5.1.2 How similar are embeddings across languages? . 162 5.1.3 Unsupervised cross-lingual learning . 164 Contents iv 5.1.3.1 Learning scenarios . 164 5.1.3.2 Summary of Conneau et al. [2018a] . 165 5.1.3.3 A simple supervised method . 166 5.1.4 Experiments . 167 5.1.4.1 Experimental setup . 167 5.1.4.2 Impact of language similarity . 167 5.1.4.3 Impact of domain differences . 169 5.1.4.4 Impact of hyper-parameters . 171 5.1.4.5 Impact of dimensionality . 172 5.1.4.6 Impact of evaluation procedure . 173 5.1.4.7 Evaluating eigenvector similarity . 174 5.1.5 Related work . 174 5.1.6 Summary . 175 5.2 A Discriminative Latent-Variable Model for Bilingual Lexicon Induction . 176 5.2.1 Introduction . 176 5.2.1.1 Graph-theoretic formulation . 177 5.2.1.2 Word embeddings . 178 5.2.2 A latent-variable model . 178 5.2.3 Parameter estimation . 181 5.2.3.1 Viterbi E-Step . ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    329 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us