C 2014 Gourab Kundu DOMAIN ADAPTATION with MINIMAL TRAINING
Total Page:16
File Type:pdf, Size:1020Kb
c 2014 Gourab Kundu DOMAIN ADAPTATION WITH MINIMAL TRAINING BY GOURAB KUNDU DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2014 Urbana, Illinois Doctoral Committee: Professor Dan Roth, Chair Professor Chengxiang Zhai Assistant Professor Julia Hockenmaier Associate Professor Hal DaumeIII, University of Maryland Abstract Machine learning models trained on labeled data of a domain degrade performance severely when tested on a different domain. Traditional approaches deal with this problem by training a new model for every new domain. In Natural language processing, top performing systems often use multiple interconnected models and therefore training all of them for every new domain is computationally expensive. This thesis is a study on how to adapt to a new domain, using the system trained on a different domain, avoiding the cost of retraining. This thesis identifies two key ingredients for adaptation without training: broad coverage re- sources and constraints. We show how resources like Wikipedia, VerbNet, WordNet that contain comprehensive coverage of entities, semantic roles and words in English can help a model adapt to a new domain. For the task of semantic role labeling, we show that in the decision phase, we can replace a linguistic unit (e.g. verb, word) with another equivalent linguistic unit residing in the same cluster defined in these resources (e.g. VerbNet, WordNet) such that after replacement, text becomes more like text on which the model was trained. We show that the model's output is more accurate on the transformed text than on original text. In another instance, we show how to use a system for linking mentions to Wikipedia concepts for adaptation of a named entity recognition system. Since Wikipedia has a broad domain coverage, the linking system is robust across domain variations. Therefore, jointly performing entity recognition and linking improves the accuracy of entity recognition on new domains without requiring training of a new system for the new domain. In all cases, we show how to use intuitive constraints to guide the model into making coherent predictions. We show how incorporating prior knowledge about a new domain as declarative constraints into the decision phase can improve performance of a model on the new domain. When such prior knowledge is unavailable, we show how to acquire knowledge automatically from unlabeled text from the new domain and domains similar to both new and old domains. ii To my family. iii Table of Contents List of Tables . vi List of Figures . vii List of Abbreviations . viii Introduction . ix 0.1 Statistical Learning in NLP . ix 0.2 The Need for Domain Adaptation . .x 0.3 Contributions of This Thesis . xii 0.4 Thesis Outline . xii Basic Terminology . xiv 0.5 Introduction . xiv 0.6 Classification . xiv 0.7 Learning Protocols . xiv 0.7.1 Supervised Learning . xiv 0.7.2 Unsupervised Learning . xv 0.7.3 Semi-supervised Learning . xv 0.8 Linear Models . xv 0.9 Binary and Multiclass Classification . xvi 0.10 Structured Prediction . xvi 0.10.1 Constrained Conditional Model . xvii 0.11 Summary . xviii Related Work . xx 0.12 Introduction . xx 0.13 Adaptation w/ Labeled Data in Target . xx 0.13.1 Source as Prior Based Methods . xx 0.13.2 General Distribution Based Methods . xxi 0.13.3 Instance Weighting Based Methods . xxii 0.13.4 Support Vector Based Methods . xxiii 0.14 Adaptation w/o Labeled Data in Target . xxiv 0.14.1 Theoretical Studies . xxv 0.14.2 Bootstrapping Based Methods . xxv 0.14.3 Representation Learning Based Methods . xxvii 0.15 Summary . xxx iv Adaptation Using Transformation . .xxxi 0.16 Introduction . xxxi 0.17 Problem Formulation . xxxiii 0.18 Motivating Examples . xxxiii 0.19 Transformation Functions . xxxiii 0.19.1 Transformation From List . xxxv 0.19.2 Learned Transformations . xxxvi 0.20 Joint Inference . xxxix 0.21 Experimental Results . xl 0.22 Summary . xli Prior Knowledge Driven Adaptation . xliii 0.23 Introduction . xliii 0.24 Tasks and Datasets . xliv 0.25 Motivating Examples . xlv 0.26 Inference w/ Prior Knowledge . xlvi 0.27 Self-training w/ Prior Knowledge . xlvi 0.28 Experimental Results . xlviii 0.29 Summary . xlix Adaptation Using Learned Constraints . l 0.30 Introduction . .l 0.31 Motivating Examples . li 0.32 Algorithm for Learning Constraints . lii 0.32.1 Intermediate Domain Constraints . lii 0.32.2 Target Domain Constraints . liii 0.32.3 Prediction . lvi 0.33 Experimental Results . lviii 0.33.1 Gene Recognition in Biomedical domain . lviii 0.33.2 Entity Recognition in News domain . lix 0.34 Summary . lxi Adaptation Using Wikipedia . lxii 0.35 Introduction . lxii 0.36 Motivating Examples . lxiv 0.37 Formulation of Joint Inference of NER and Wikifier . lxiv 0.38 Experimental Results . lxvi 0.38.1 In-domain Experiments . lxvi 0.38.2 Domain Adaptation Experiments . lxix 0.39 Summary . lxxi Conclusion . .lxxiii 0.40 Future Research Directions . lxxiii 0.40.1 Extension to Labeled Adaptation Scenario . lxxiii 0.40.2 Joint Inference Across NLP tasks . lxxiii References . .lxxiv v List of Tables 1 Machine Learning Notation. xix 2 Examples of Simplifications (Predicate is run) . xxxvi 3 Comparing single parse system on Brown. xl 4 Comparison of the multi parse system on Brown. xli 5 Ablation Study for ADUT-Charniak . xlii 6 Performance on Infrequent Verbs for the Transformation of Replacement of Predicate xlii 7 Results(F1) of Baseline model of POS Tagging . xlv 8 Results(F1) of Baseline model of SRL . xlv 9 Comparsion of results(F1) for the Baseline Model versus PDA-KW . xlvii 10 Comparsion of results(F1) of different ways of incorporating knowledge . xlviii 11 Comparsion of results(F1) for POS Tagging with (Jiang and Zhai, 2007a) . xlviii 12 The upper part of the table shows target domain text with gold labels. The lower part of the table shows the prediction of NER trained from news text (blue means person type, red means location type) ..... liii 13 Adaptation results on Gene Recognition Task, FD05= (Finkel et al., 2005a), HY09= (Huang and Yates, 2009) .......................................... lix 14 Experimental results on ACE 2005 data set, Wu09= (Wu et al., 2009a), RR09= (Ratinov and Roth, 2009) . lx 15 Experimental results on CoNLL to Enron Adaptation, RR09= (Ratinov and Roth, 2009), HY09= (Huang.