Proquest Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
mn u Ottawa l.'UnK'crsilc cnnndicnnc Ginadn's linivcrsily FACULTE DES ETUDES SUPERIEURES I^SI FACULTY OF GRADUATE AND ET POSTOCTORALES U Ottawa POSDOCTORAL STUDIES [/University canadienne Canada's university David Nadeau TuTEWDETATHls¥rAFfHOR"orfHESIS" Ph.D. (Computer Science) GRADE/DEGREE School of Information Technology and Engineering ~"FAl^iOc6LirDT:P^ Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision TITRE DE LA THESE / TITLE OF THESIS Stan Matwin "DiRl^fEURlbTRE^ Peter Turney T6WE"cTEUrf(c63iR^ EXAMINATEURS (EXAMINATRICES) DE LA THESE /THESIS EXAMINERS William Cohen Diana Inkpen Jean-Pierre Corriveau Nathalie Japkowicz Gary W. Slater Le Doyen de la Faculte des etudes superieures et postdoctorales / Dean of the Faculty of Graduate and Postdoctoral Studies Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision David Nadeau Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the PhD degree in Computer Science Ottawa-Carleton Institute for Computer Science School of Information Technology and Engineering University of Ottawa © David Nadeau, Ottawa, Canada, 2007 Library and Bibliotheque et 1*1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-49385-4 Our file Notre reference ISBN: 978-0-494-49385-4 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a la Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par Plntemet, prefer, telecommunication or on the Internet, distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni la these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission. In compliance with the Canadian Conformement a la loi canadienne Privacy Act some supporting sur la protection de la vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ont ete enleves de cette these. While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. Canada ii Table of contents List of tables iv List of figures v Abstract vi Acknowledgements vii 1 Introduction 1 2 Background and Related Work 6 2.1 Related Work 7 2.2 Applications 9 2.3 Observations: 1991 to 2006 10 2.4 Techniques and Algorithms to Resolve the NER Problem 14 2.5 Feature Space for NER 19 2.6 Evaluation of NER 26 2.7 Conclusion 30 3 Creating a Baseline Semi-Supervised NER System 32 3.1 Generating Gazetteers 35 3.2 Resolving Ambiguity 42 3.3 Evaluation with the MUC-7 Enamex Corpus 45 3.4 Evaluation with Car Brands 50 3.5 Supervised versus Unsupervised 51 3.6 Conclusion 51 4 Noise-Filtering Techniques for Generating NE Lists 53 4.1 Generating NE Lists from the Web 55 4.2 Lexical Noise Filter 58 4.3 Information Redundancy Filter 64 4.4 Noise Filter Combination 66 4.5 Statistical Semantics Filter 68 4.6 Conclusion 70 5 Discovering Unambiguous NEs for Disambiguation Rule Generation 72 5.1 Related Work 74 iii 5.2 Massive Generation of NE Lists 75 5.3 NE Ambiguity 78 5.4 From Unambiguous NE to Disambiguation Rules 82 5.5 Experiments on the NER Task 85 5.6 Conclusion 91 6 Detecting Acronyms for Better Alias Resolution 92 6.1 Related Work 94 6.2 Supervised Learning Approach 99 6.3 Evaluation Corpus 103 6.4 Experiment Results 104 6.5 Discussion 105 6.6 Improving Alias Resolution in NER Systems 107 6.7 Conclusion 108 7 Discussion and Conclusion 110 7.1 Limitations Ill 7.2 Future Work 113 7.3 Long-Term Research Ideas 114 Bibliography 115 Appendix: Seed words (system input) 125 IV List of tables Table 1: Word-level features 20 Table 2: List look-up features 22 Table 3: Features from documents 24 Table 4: NER error types 27 Table 5: Results of a supervised system for MUC-7 46 Table 6: Type and size of gazetteers built using Web page wrapper 46 Table 7: Supervised list creation vs. unsupervised list creation techniques 47 Table 8: Generated list performance on text matching 48 Table 9: Performance of heuristics to resolve NE ambiguity 48 Table 10: Estimated precision of automatically generated lists 49 Table 11: System performance for car brand recognition 50 Table 12: NE lexical features 59 Table 13: Reference lists for noise filter evaluation 63 Table 14: BaLIE performance on MUC-7 corpus with and without noise filtering 67 Table 15: BaLIE and Oak lexicon comparison 75 Table 16: Additional BaLIE lexicons 77 Table 17: Source of ambiguity between entity types 79 Table 18: Percentage of entity-entity ambiguity per type 81 Table 19: Accuracy of entity-entity classifiers 85 Table 20: Three-type BaLIE performance on MUC-7 corpus 87 Table 21: 100-type BaLIE performance on MUC-7 corpus with and without rules 87 Table 22: BaLIE's performance on the CONLL corpus 88 Table 23: System comparison on CONLL corpus 89 Table 24: BaLIE's performance on BBN corpus 90 Table 25: Summary of constraints on acronyms and definitions 97 Table 26: Acronym detection performance reported various teams 104 Table 27: Performance of various classifiers on the Medstract corpus 105 Table 28: Acronym detection on Swedish texts 107 Table 29: BaLIE's performance on the CONLL corpus with acronym detection 108 V List of figures Figure 1: Overview of the semi-supervised NER system 2 Figure 2: Details of the baseline named entity recognition system 33 Figure 3: Simple alias resolution algorithm 44 Figure 4: Details of noise filtering as a post-process for the Web page wrapper 54 Figure 5: Algorithm for one iteration of the NE list generation process 56 Figure 6: Comparing lexical filters 63 Figure 7: Comparison of lexical filter and information redundancy filter 65 Figure 8: Comparison of individual filters and their combination 67 Figure 9: Details of training disambiguation rules in a semi-supervised manner 73 Figure 10: CONLL corpus metonymic references 88 Figure 11: Details of acronym identification as a component of the alias network 93 VI Abstract Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi- supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. Vll Acknowledgements Le reve de batir un systeme autonome est partage par la plupart des chercheurs du domaine et s'etend en fait probablement a tous les ingenieurs de systemes intelligents. Lorsque j'ai mis au point un premier prototype de systeme de reconnaissance d'entites nominees, de 2001 a 2003, le probleme de la maintenance est rapidement devenu manifeste. En plus, l'effort d'annotation de documents requis pour etendre le systeme etait tel que le reve de depart devenait un imperatif. J'avais bien la motivation de creer ce systeme, mais je n'avais aucune idee de comment y arriver. C'etait la conjuncture ideale pour entreprendre une these. Je tiens a remercier Peter Turney et Stan Matwin qui ont supervise et contribue a ce travail. En questionnant et en soupesant chaque idee et chaque phrase, ils m'ont fait comprendre beaucoup plus que des notions abstraites et des procedures informatiques. Je tiens aussi a remercier Caroline Barriere, Cyril Goutte et Pierre Isabelle du Groupe de technologies langagieres interactives du Conseil national de recherches Canada pour leur aide et leur commentaires sur les versions preliminaires de cette these.