Automatic Extraction and Assessment of Entities from the Web
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Extraction and Assessment of Entities from the Web Dissertation submitted in partial satisfaction of the requirements for the degree of Doktoringenieur (Dr.-Ing.) at Technische Universit¨atDresden Faculty of Computer Science by Dipl.-Medieninf. David Urbansky born on November 25, 1984 in Dresden Advisers: Professor Dr. rer. nat. habil. Dr. h. c. Alexander Schill TUD, Dresden, Germany Associate Professor Dr. James Thom RMIT, Melbourne, Australia Professor Dr.-Ing. Michael Schroeder TU Dresden TUD, Dresden, Germany Day of Defense: October 15, 2012 Dresden, October 17, 2012 Statement of Authorship This dissertation has been conducted and presented solely by myself. I have not made use of other peoples work (published or otherwise) or presented it here without acknowledging the source of all such work. Dresden, July 9, 2012 David Urbansky Abstract The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75{90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval. Acknowledgement This thesis was written between 2009 and 2012 during my time as a research assistant at the Chair of Computer Networks at the Dresden University of Technology. I want to thank my doctoral adviser Professor Alexander Schill for giving me the opportunity and the freedom to write my thesis about my own research interests. He also provided me with research positions in the projects Aletheia, X2Lift, and Intellix which financed me during the three years of writing my thesis. I am very grateful for the critical feedback that I received from Associate Professor James Thom. Although we only met twice during these three years, our day-long meetings and feedback over email were of great help to me and shaped my thesis substantially. Also, I want to thank my adviser Dr. Daniel Schuster for constant feedback on my progress, valuable tips from his experience, and insightful suggestions. During my time at the Faculty of Computer Science, I supervised over thirty students. That alone was an incredible experience from which I learned a lot about providing help, working with different characters, and also myself. Some research areas in this thesis were studied in detail in diploma theses of my students. I want to thank Christopher Friedrich, Christian Hensel, Marcel Gerlach, Dmitry Myagkikh, Martin Werner, Robert Willner, and Martin Wunderwald for writing their master theses under my supervision and producing outstanding results. I want to express my gratitude to my colleagues whom I was lucky enough to be working with. My thanks go to Marius Feldmann who led me into the PhD program in the first place. Furthermore, I highly appreciate working with Philipp Katz, Klemens Muthmann, and Sandro Reichert on the Java toolkit \Palladian", our joint effort in giving something practical back to the information retrieval and machine learning community. And of course I want to thank my parents for always being there for me even though they have no idea what this thesis is all about. Lastly, I want to express my appreciation to my lovely girlfriend Crystal, who not only read the thesis back to back several times, correcting all my grammar and spelling errors, but who was also patient and understanding with me. Contents 1 Introduction 1 1.1 Motivation . .1 1.1.1 Question Answering . .2 1.1.2 Entity Dictionary . .4 1.1.3 Dataset for the Semantic Web . .6 1.2 Requirements . .6 1.3 Focus and Limitations . .9 1.4 Research Questions and Hypotheses . 10 1.5 Summary . 11 2 Background 13 2.1 Definitions . 13 2.1.1 Concept . 13 2.1.2 Named Entity . 14 2.1.3 Attribute . 16 2.1.4 Statement . 16 2.2 Sources of Information on the Web . 16 2.2.1 Types of Structures . 17 2.2.2 Distribution of Document File Formats on the Web . 18 2.2.3 Areas on the Web . 19 2.3 Evaluation Measures and Approaches . 21 2.3.1 Performance Measurements . 22 2.3.2 Evaluating Named Entity Recognition . 25 2.3.3 Evaluating Named Entity Discovery . 27 ii Contents 2.4 Related Systems and Knowledge Bases . 27 2.5 Summary . 31 3 Architecture 33 3.1 Components . 34 3.1.1 Controller . 34 3.1.2 Extraction and Assessment . 34 3.1.3 Retrieval . 35 3.1.4 Ontology . 35 3.1.5 Question Answering Interface . 36 3.1.6 Storage . 36 3.2 Summary . 36 4 Timely Source Retrieval 37 4.1 Feed Reading . 37 4.1.1 Related Work . 39 4.1.2 Feed Activity Patterns . 40 4.2 Update Strategies . 41 4.2.1 Fix . 42 4.2.2 Fix Learned . 42 4.2.3 Moving Average . 42 4.2.4 Post Rate . 43 4.3 Evaluation . 44 4.3.1 Dataset . 45 4.3.2 Network Traffic . 47 4.3.3 Timeliness . 47 4.4 Summary . 51 5 Extraction of Entities 53 5.1 Related Work . 53 5.1.1 Entity Types . 53 5.1.2 Language . 54 Contents iii 5.1.3 Source Types . 55 5.1.4 Research Tasks . 56 5.1.5 Techniques for Extracting Entities . 61 5.1.6 Summary of Related Work . 72 5.2 Extraction Techniques . 72 5.2.1 Extraction Using Phrase Patterns . 73 5.2.2 Focused Crawling Extraction . 73 5.2.3 List Extraction Using Seed Entities . 79 5.2.4 Named Entity Extraction from Plain Text . 82 5.2.5 Entity Extraction from the Semantic Web . 86 5.3 Evaluation . 91 5.3.1 Evaluation and Comparison of Entity Extraction Techniques . 92 5.3.2 Semantic Web Extractor Experiments . 102 5.3.3 Natural Language Processing-based Extraction Experiments . 104 5.4 Summary . 109 6 Assessment of Extractions 111 6.1 Related Work . 111 6.1.1 Syntax-based . 111 6.1.2 Dictionary-based . 113 6.1.3 Redundancy-based . 113 6.1.4 Co-occurrence-based . 114 6.1.5 Graph-based . 117 6.1.6 Combination: Redundancy- and Syntax-based . 119 6.1.7 Summary of Related Work . 119 6.2 Modifying Assessment Techniques for Assessing Extracted Entities . 119 6.2.1 Pointwise Mutual Information . 120 6.2.2 Latent Relational Analysis . 121 6.2.3 Random Graph Walk . 121 6.2.4 Text Classification . 123 6.2.5 Feature-based Assessor . 123 iv Contents 6.2.6 Combined Assessor . 124 6.3 Evaluation of Entity Assessment Techniques . 125 6.3.1 Dependency on the Amount of Training Data . 125 6.3.2 Assessment Techniques Performance by Concept . 128 6.3.3 Overall Comparison . 130 6.3.4 Trust Threshold Analysis . 132 6.4 Summary . 134 7 Extraction of Facts 135 7.1 Ontology Engineering . 136 7.1.1 Datatypes . 136 7.1.2 Unit Types . 137 7.1.3 Ranges . 137 7.2 Related Work . 139 7.2.1 Related Systems . 139 7.2.2 Fact Assessment . ..