Methods for Efficient Supervision in Natural Language Processing

Methoden voor efficiënte supervisie van automatische taalverwerking Methods for Efficient Supervision in Natural Language Processing Lucas Sterckx Promotoren: prof. dr. ir. C. Develder, dr. ir. T. Demeester Proefschrift ingediend tot het behalen van de graad van Doctor in de ingenieurswetenschappen: computerwetenschappen Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. B. Dhoedt Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2017 - 2018 ISBN 978-94-6355-130-4 NUR 984 Wettelijk depot: D/2018/10.500/48 Ghent University Faculty of Engineering and Architecture Department of Information Technology imec Internet Technology and Data Science Lab Methods for Efficient Supervision in Natural Language Processing Examination Board: prof. C. Develder (advisor) dr. ir. T. Demeester (advisor) em. prof. D. De Zutter (chair) prof. B. Dhoedt (secretary) prof. I. Augenstein prof. A. Bronselaer prof. V. Hoste Dissertation for acquiring the degree of Doctor of Computer Science Engineering Dankwoord Tijdens mijn doctoraat werd ik onvoorwaardelijk, door dik en dun, ge- steund door vele mensen. Ik dank iedereen die me alle kansen gaf tot persoonlijke en professionele groei. Een dankjewel, aan prof. Chris Develder, die me vijf jaar lang het vertrouwen gaf en alle kansen tot wetenschappelijk onderzoek, samen met de vrijheid tot het be- wandelen van mijn eigen pad; aan Thomas Demeester, die, ondanks zijn eigen drukke agenda, altijd tijd maakte en klaar stond met advies en een luisterend oor, steeds met het- zelfde enthousiasme en dezelfde vriendelijkheid; aan mijn ouders, Godelieve en Paul; aan mijn broer en schoonzus, Thomas en Karolien; aan mijn meter, Maria; aan Johannes, die me met al zijn geduld en kennis bijstond vanaf dag één en me leerde kritisch zijn over eigen en bestaand werk; to Nasrin, who, despite living abroad, was always cheerful and kind to everyone, all while producing an incredible amount of high-quality research; to Giannis, or Lucas V2.0 as I like to call him, who picked up the torch and ran so fast with it, for making the future of research in NLP at IDLab look bright; to all my research collaborators over the years, Cedric, Fréderic, Baptist, Klim, Thong, Steven, Matthias, Jason, Bill, Cornelia and Laurent, from whom I learned so much; ii to the members of the examination board, em. prof. Daniël De Zutter, prof. Bart Dhoedt, prof. Isabelle Augenstein, prof. Antoon Bronselaer and prof. Veronique Hoste, for making time and providing me with nothing but con- structive feedback on this thesis and my research in general; aan prof. Piet Demeester, de drijvende kracht achter de onderzoeksgroep; to the founders and attendees of the machine learning reading group, who gracefully shared their knowledge every week; to the IDLab admins and support staff for always standing by with techni- cal support and providing me with the infrastructure to do research; to all my friends and colleagues at IDLab for providing me with such a pleasant working environment; to all friends I made abroad, who made me feel at home when I was far from it; aan alle collega’s en jobstudenten met wie ik samen aan projecten werkte, om me de middelen te geven om aan onderzoek te doen; aan het Fonds voor Wetenschappelijk Onderzoek - Vlaanderen (FWO), welke via een reisbeurs mijn onderzoek ondersteunde; aan al mijn collega-muzikanten en de leden van Concertband Theobaldus Groot-Pepingen; en aan al mijn vrienden uit het Pajottenland en het Gentse. Gent, juni 2018 Lucas Sterckx “The really important kind of freedom involves attention, and awareness, and discipline, and effort, and being able truly to care about other people and to sacrifice for them, over and over, in myriad petty little unsexy ways, every day." — David Foster Wallace, 2005 Table of Contents Dankwoordi Samenvatting xxi Summary xxv 1 Introduction1 1.1 The Data Bottleneck........................2 1.2 Beyond Traditional Supervision.................5 1.2.1 Semi-Supervised Learning................5 1.2.2 Active Learning......................7 1.2.3 Multi-Task Learning...................7 1.2.4 Transfer Learning.....................7 1.2.5 Weak Supervision.....................8 1.3 Efficient Supervision for NLP..................9 1.3.1 Semi-Supervised Learning for NLP..........9 1.3.2 Distant Supervision................... 10 1.3.3 Information Extraction using Weak Supervision... 10 1.3.4 Crowdsourcing...................... 11 1.3.5 Data Augmentation for NLP.............. 12 1.3.6 Transfer Learning for NLP................ 12 1.3.7 Multi-Task Learning for NLP.............. 13 1.4 Research contributions...................... 13 1.5 Publications............................ 16 1.5.1 Publications in international journals (listed in the Science Citation Index).......... 17 1.5.2 Publications in international conferences (listed in the Science Citation Index).......... 17 1.5.3 Publications in other international conferences.... 18 References................................ 19 2 Weak Supervision for Automatic Knowledge Base Population 27 2.1 Introduction............................ 28 2.2 Related Work........................... 31 2.2.1 Supervised Relation Extraction............. 31 iv 2.2.1.1 Bootstrapping models for Relation Extraction 32 2.2.1.2 Distant Supervision.............. 33 2.2.2 Semi-supervised Relation Extraction.......... 33 2.2.3 TAC KBP English Slot Filling.............. 34 2.2.4 Active Learning and Feature Labeling......... 35 2.2.5 Distributional Semantics................. 35 2.3 Labeling Strategy for Noise Reduction............. 36 2.3.1 Distantly Supervised Training Data.......... 37 2.3.2 Labeling High Confidence Shortest Dependency Paths 41 2.3.3 Noise Reduction using Semantic Label Propagation. 45 2.4 Experimental Results....................... 46 2.4.1 Testing Methodology................... 46 2.4.2 Knowledge Base Population System.......... 47 2.4.3 Methodologies for Supervision............. 47 2.4.4 Pattern-based Restriction vs. Similarity-based Exten- sion............................. 48 2.4.5 End-to-End Knowledge Base Population Results... 51 2.4.6 2015 TAC KBP Cold Start Slot Filling.......... 55 2.5 Conclusions............................ 55 2.A Using Active Learning and Semantic Clustering for Noise reduction in Distant Supervision................ 56 2.A.1 Introduction........................ 57 2.A.2 Related Work....................... 57 2.A.3 Semantic-Cluster-Aware Sampling........... 58 2.A.4 Experiments and Results................. 59 2.A.5 Conclusion........................ 62 References................................ 63 3 Weakly Supervised Evaluation of Topic Models 73 3.1 Introduction............................ 74 3.2 Experimental Setup........................ 75 3.3 Topic Model Assessment..................... 76 3.3.1 Measuring Topical Alignment.............. 76 3.3.2 Semantic Coherence................... 77 3.3.3 Graphical Alignment of Topics............. 78 3.4 Conclusion............................. 79 References................................ 81 4 Creation and Evaluation of Large Keyphrase Extraction Collec- tions with Multiple Opinions 83 4.1 Introduction............................ 84 4.2 Test Collections.......................... 85 4.2.1 Document Collection................... 86 4.2.2 Collecting Keyphrases.................. 87 v 4.2.3 Annotation Tool...................... 87 4.2.4 Keyphrase Annotation Guidelines........... 89 4.2.5 Annotator Disagreement................. 91 4.3 Keyphrase Extraction Techniques................ 95 4.3.1 Candidate Selection................... 95 4.3.2 Unsupervised Keyphrase Extraction.......... 96 4.3.3 Supervised Keyphrase Extraction........... 98 4.3.3.1 Feature Design and Classification...... 99 4.3.3.2 Supervised Model............... 100 4.4 Systematic Evaluation...................... 101 4.4.1 Experimental set-up................... 101 4.4.2 Evaluation Setup using Multiple Opinions...... 103 4.4.3 Comparison of different techniques.......... 105 4.4.4 Comparison of different test collections........ 106 4.4.5 Training set size...................... 108 4.4.6 Training data from multiple opinions......... 108 4.4.7 Effect of Document Length............... 109 4.5 Guidelines for Automatic Keyphrase Extraction....... 109 4.A Supervised Keyphrase Extraction as Positive Unlabeled Learn- ing.................................. 112 4.A.1 Introduction........................ 112 4.A.2 Noisy Training Data for Supervised Keyphrase Ex- traction........................... 113 4.A.3 Reweighting Keyphrase Candidates.......... 115 4.A.3.1 Experiments and Results........... 116 4.A.4 Conclusion........................ 118 References................................ 120 5 Sequence-to-Sequence Applications using Weak Supervision 127 5.1 Introduction............................ 127 5.2 Break it Down for Me: A Study in Automated Lyric Annotation............ 130 5.2.1 Introduction........................ 130 5.2.2 The Genius ALA Dataset................ 131 5.2.3 Context Independent Annotation............ 132 5.2.4 Baselines.......................... 133 5.2.5 Evaluation......................... 134 5.2.5.1 Data....................... 134 5.2.6 Measures.......................... 135 5.2.6.1 Hyperparameters and Optimization.... 135 5.2.7 Results........................... 136 5.2.8 Related Work....................... 137 5.2.9 Conclusion and Future Work.............. 137 5.3 Prior Attention for Style-aware Sequence-to-Sequence Models140 5.3.1 Introduction........................ 140 vi 5.3.2 Generation of Prior Attention.............

Methods for Efficient Supervision in Natural Language Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support