Dbpedia Type and Entity Detection Using Word Embeddings and N-Gram Models

DBpedia Type and Entity Detection Using Word Embeddings and N-gram Models by Hanqing Zhou Thesis submitted in partial fulfillment of the requirements for the Master of Computer Science degree School of Electrical Engineering and Computer Science Faculty of Engineering University of Ottawa © Hanqing Zhou, Ottawa, Canada, 2018 Abstract Nowadays, knowledge bases are used more and more in Semantic Web tasks, such as knowledge acquisition (Hellmann et al., 2013), disambiguation (Garcia et al., 2009) and named entity corpus construction (Hahm et al., 2014), to name a few. DBpedia is playing a central role on the linked open data cloud; therefore, the quality of this knowledge base is becoming a central point of focus. However, there are some issues with the quality of DBpedia. In particular, DBpedia suffers from three major types of problems: a) invalid types for entities, b) missing types for entities, and c) invalid entities in the resources’ description. In order to enhance the quality of DBpedia, it is important to detect these invalid types and resources, as well as complete missing types. The three main goals of this thesis are: a) invalid entity type detection in order to solve the problem of invalid DBpedia types for entities, b) automatic detection of the types of entities in order to solve the problem of missing DBpedia types for entities, and c) invalid entity detection in order to solve the problem of invalid entities in the resource description of a DBpedia entity. We compare several methods for the detection of invalid types, automatic typing of entities, and invalid entities detection in the resource descriptions. In particular, we compare different classification and clustering algorithms based on various sets of features: entity embedding features (Skip-gram and CBOW models) and traditional n-gram features. We present evaluation results for 358 DBpedia classes extracted from the DBpedia ontology. The main contribution of this work consists of the development of automatic invalid type detection, automatic entity typing, and automatic invalid entity detection methods using clustering and classification. Our results show that entity embedding models usually perform better than n-gram models, especially the Skip-gram embedding model. ii Acknowledgement First and foremost, I offer my sincere gratitude and appreciation to my supervisors, Dr. Amal Zouaq and Dr. Diana Inkpen, for their eminent supervision, creative ideas, and endless support. This work would not have been possible without their help. Thanks to the NLP group meetings and the accompanied Tamale seminars for their broad knowledge and cutting-edge research ideas. Also, special thanks to my lab colleagues and all of the NLP group members for their help, suggestions and support. Finally, I would like to thank my parents for their support and encouragement during my time in the Master’s program. iii Contents Abstract ........................................................................................................................................... ii Acknowledgement ......................................................................................................................... iii List of Figures ................................................................................................................................ vi List of Tables ................................................................................................................................ vii Chapter 1: Introduction ................................................................................................................... 1 1.1 Motivation and Contributions ............................................................................................... 4 1.2 Outline................................................................................................................................... 5 Chapter 2: Background and Related Work ..................................................................................... 6 2.1 The Semantic Web and Linked Open Data........................................................................... 6 2.2 Word and Entity Embeddings ............................................................................................... 7 2.3 N-gram Models ..................................................................................................................... 9 2.4 Machine Learning Techniques ............................................................................................ 10 2.4.1 Clustering ..................................................................................................................... 10 2.4.2 Classification ................................................................................................................ 11 2.5. Related Work ..................................................................................................................... 14 2.5.1. DBpedia and Linked Data Quality Assessment and Enhancement ............................ 14 2.5.2. Automatic Type Detection .......................................................................................... 15 2.5.3. Outlier Detection in Linked Data ................................................................................ 17 Chapter 3: Research Methodology................................................................................................ 19 3.1 General Architecture ........................................................................................................... 19 3.2 Entity Embedding Model Building ..................................................................................... 20 3.3 Datasets Preparation............................................................................................................ 23 3.3.1. Dataset for Entity Type Detection .............................................................................. 23 3.3.2. Dataset for Invalid Entity Detection ........................................................................... 26 3.4 N-gram Model Building ...................................................................................................... 30 3.5 Evaluation Metrics .............................................................................................................. 30 Chapter 4: DBpedia Entity Type Detection .................................................................................. 33 4.1 Experiment Introduction and Setup .................................................................................... 33 4.2 Clustering Evaluation.......................................................................................................... 34 4.2.1 Clustering with N-gram Models .................................................................................. 34 iv 4.2.2 Clustering with Entity Embedding Models.................................................................. 37 4.3 Classification Evaluation .................................................................................................... 39 4.3.1 Classification with N-gram Models ............................................................................. 39 4.3.2 Classification with Entity Embedding Models ............................................................ 43 4.5 Student-t Test ...................................................................................................................... 46 4.6 Synthesis and Discussion .................................................................................................... 47 Chapter 5: DBpedia Invalid Entity Detection in Resources Description ...................................... 49 5.1 Experiment Setup ................................................................................................................ 50 5.2 Clustering Evaluation.......................................................................................................... 50 5.2.1 Clustering with N-gram Models .................................................................................. 51 5.2.2 Clustering with Entity Embedding Models.................................................................. 53 5.3 Classification Evaluation .................................................................................................... 55 5.3.1 Classification Evaluation with N-gram Models ........................................................... 55 5.3.2 Classification Evaluation on Entity Embedding Models ............................................. 58 5.4 Synthesis and Discussion .................................................................................................... 61 Chapter 6 Discussion and Conclusion .......................................................................................... 62 6.1 Applying the Random Forest classification model on the whole DBpedia Knowledge Base ................................................................................................................................................... 62 6.2 Conclusion and Future Work .............................................................................................. 66 References ..................................................................................................................................... 70 Appendix A ................................................................................................................................... 79 Appendix B ..................................................................................................................................

Load more