Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision
Total Page:16
File Type:pdf, Size:1020Kb
Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision David Nadeau Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the PhD degree in Computer Science Ottawa-Carleton Institute for Computer Science School of Information Technology and Engineering University of Ottawa © David Nadeau, Ottawa, Canada, 2007 ii Table of contents List of tables ........................................................................................................................ iv List of figures ....................................................................................................................... v Abstract ............................................................................................................................... vi Acknowledgements ........................................................................................................... vii 1 Introduction ...................................................................................................................... 1 2 Background and Related Work ...................................................................................... 6 2.1 Related Work ............................................................................................................. 7 2.2 Applications ............................................................................................................... 9 2.3 Observations: 1991 to 2006 .................................................................................... 10 2.4 Techniques and Algorithms to Resolve the NER Problem .................................. 14 2.5 Feature Space for NER ........................................................................................... 19 2.6 Evaluation of NER .................................................................................................. 26 2.7 Conclusion ................................................................................................................ 30 3 Creating a Baseline Semi-Supervised NER System .................................................... 32 3.1 Generating Gazetteers ............................................................................................ 35 3.2 Resolving Ambiguity ............................................................................................... 42 3.3 Evaluation with the MUC-7 Enamex Corpus ....................................................... 45 3.4 Evaluation with Car Brands .................................................................................. 50 3.5 Supervised versus Unsupervised ............................................................................ 51 3.6 Conclusion ................................................................................................................ 51 4 Noise-Filtering Techniques for Generating NE Lists ................................................. 53 4.1 Generating NE Lists from the Web ....................................................................... 55 4.2 Lexical Noise Filter ................................................................................................. 58 4.3 Information Redundancy Filter ............................................................................. 64 4.4 Noise Filter Combination ....................................................................................... 66 4.5 Statistical Semantics Filter ..................................................................................... 68 4.6 Conclusion ................................................................................................................ 70 5 Discovering Unambiguous NEs for Disambiguation Rule Generation ..................... 72 5.1 Related Work ........................................................................................................... 74 iii 5.2 Massive Generation of NE Lists ............................................................................. 75 5.3 NE Ambiguity .......................................................................................................... 78 5.4 From Unambiguous NE to Disambiguation Rules ............................................... 82 5.5 Experiments on the NER Task ............................................................................... 85 5.6 Conclusion ................................................................................................................ 91 6 Detecting Acronyms for Better Alias Resolution ........................................................ 92 6.1 Related Work ........................................................................................................... 94 6.2 Supervised Learning Approach ............................................................................. 99 6.3 Evaluation Corpus ................................................................................................ 103 6.4 Experiment Results ............................................................................................... 104 6.5 Discussion ............................................................................................................... 105 6.6 Improving Alias Resolution in NER Systems ..................................................... 107 6.7 Conclusion .............................................................................................................. 108 7 Discussion and Conclusion .......................................................................................... 110 7.1 Limitations ............................................................................................................. 111 7.2 Future Work .......................................................................................................... 113 7.3 Long-Term Research Ideas .................................................................................. 114 Bibliography ........................................................................................................................ 115 Appendix: Seed words (system input) ............................................................................... 125 iv List of tables Table 1: Word-level features ................................................................................................... 20 Table 2: List look-up features ................................................................................................. 22 Table 3: Features from documents .......................................................................................... 24 Table 4: NER error types ........................................................................................................ 27 Table 5: Results of a supervised system for MUC-7 .............................................................. 46 Table 6: Type and size of gazetteers built using Web page wrapper ...................................... 46 Table 7: Supervised list creation vs. unsupervised list creation techniques ........................... 47 Table 8: Generated list performance on text matching ........................................................... 48 Table 9: Performance of heuristics to resolve NE ambiguity ................................................. 48 Table 10: Estimated precision of automatically generated lists .............................................. 49 Table 11: System performance for car brand recognition ....................................................... 50 Table 12: NE lexical features .................................................................................................. 59 Table 13: Reference lists for noise filter evaluation ............................................................... 63 Table 14: BaLIE performance on MUC-7 corpus with and without noise filtering ............... 67 Table 15: BaLIE and Oak lexicon comparison ....................................................................... 75 Table 16: Additional BaLIE lexicons ..................................................................................... 77 Table 17: Source of ambiguity between entity types .............................................................. 79 Table 18: Percentage of entity-entity ambiguity per type ....................................................... 81 Table 19: Accuracy of entity-entity classifiers ....................................................................... 85 Table 20: Three-type BaLIE performance on MUC-7 corpus ................................................ 87 Table 21: 100-type BaLIE performance on MUC-7 corpus with and without rules .............. 87 Table 22: BaLIE's performance on the CONLL corpus .......................................................... 88 Table 23: System comparison on CONLL corpus .................................................................. 89 Table 24: BaLIE's performance on BBN corpus ..................................................................... 90 Table 25: Summary of constraints on acronyms and definitions ............................................ 97 Table 26: Acronym detection performance reported various teams ..................................... 104 Table 27: Performance of various classifiers on the Medstract corpus................................. 105 Table 28: Acronym detection on Swedish texts .................................................................... 107 Table 29: BaLIE's performance on the CONLL corpus with acronym detection