Semantic Analysis of Wikipedia's Linked Data Graph for Entity

Semantic Analysis of Wikipedia’s Linked Data Graph for Entity Detection and Topic Identification Applications by Milad AlemZadeh A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2012 ©Milad AlemZadeh 2012 AUTHOR'S DECLARATION I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Semantic Web and Linked Data community is now the reality of the future of the Web. The standards and technologies defined in this field have opened a strong pathway towards a new era of knowledge management and representation for the computing world. The data structures and the semantic formats introduced by the Semantic Web standards offer a platform for all the data and knowledge providers in the world to present their information in a free, publicly available, semantically tagged, inter-linked, and machine-readable structure. As a result, the adaptation of the Semantic Web standards by data providers creates numerous opportunities for development of new applications which were not possible or, at best, hardly achievable using the current state of Web which is mostly consisted of unstructured or semi-structured data with minimal semantic metadata attached tailored mainly for human-readability. This dissertation tries to introduce a framework for effective analysis of the Semantic Web data towards the development of solutions for a series of related applications. In order to achieve such framework, Wikipedia is chosen as the main knowledge resource largely due to the fact that it is the main and central dataset in Linked Data community. In this work, Wikipedia and its Semantic Web version DBpedia are used to create a semantic graph which constitutes the knowledgebase and the back-end foundation of the framework. The semantic graph introduced in this research consists of two main concepts: entities and topics. The entities act as the knowledge items while topics create the class hierarchy of the knowledge items. Therefore, by assigning entities to various topics, the semantic graph presents all the knowledge items in a categorized hierarchy ready for further processing. Furthermore, this dissertation introduces various analysis algorithms over entity and topic graphs which can be used in a variety of applications, especially in natural language understanding and knowledge management fields. After explaining the details of the analysis algorithms, a number of possible applications are presented and potential solutions to these applications are provided. The main themes of these applications are entity detection, topic identification, and context acquisition. To demonstrate the efficiency of the framework algorithms, some of the applications are developed and comprehensively studied by providing detailed experimental results which are compared with appropriate benchmarks. These results show how the framework can be used in different configurations and how different parameters affect the performance of the algorithms. iii Acknowledgements This work has only been made possible with the unwavering support and help of my supervisor Professor Fakhreddine Karray to whom I would like to give my sincere thanks. I am forever grateful to him for the freedom he provided me to explore the areas of science I have always liked to venture into while assisting me whenever I needed it. For this and all the directions I received throughout all the stages of this work, I am thankful to him. I would also like to express my gratitude to the members of my committee, Professors Robert Mercer, Mahesh Tripunitara, Kshirasagar Naik, and Kumaraswamy Ponnambalam for all of their invaluable comments and suggestions which helped improve the quality of this work. I would like to extend my appreciation to Professors Frank Tompa and Mohamed Kamel for the valuable lessons I learned in their classes which helped shape the work presented in this dissertation. Moreover, I would very much like to acknowledge the vital role my family has played in helping me achieve my goals in life. I am particularly thankful to my wife, Sarah, for her relentless support these past few years and to my mother for providing a great childhood in spite of all the hardships. I am very grateful for the unconditional support of my sisters, Parisa and Soraia, and my brothers, Davood and Saeed, whenever I needed it. Last but certainly not least, I would like to thank my good friend and the co-author of many of my published works Professor Richard Khoury. In collaboration with Richard, I learned how to better express my ideas and, at the same time, improve my work considerably. This work would have undoubtedly not been possible if it was not for his tireless efforts and significant feedback. iv Dedication Sarah, my dear beloved: I cannot express sufficiently how much your love means to me. I think and I wonder but I am not able to imagine how I would be able to live let alone progress in life if you have not been in it. This is just but a gesture, a simple nod to, and a token of my love for all you have given me. This is for you. v Table of Contents AUTHOR'S DECLARATION ............................................................................................................... ii Abstract ................................................................................................................................................. iii Acknowledgements ............................................................................................................................... iv Dedication .............................................................................................................................................. v Table of Contents .................................................................................................................................. vi List of Figures ....................................................................................................................................... ix List of Tables ......................................................................................................................................... xi Chapter 1 Introduction ............................................................................................................................ 1 1.1 Background .................................................................................................................................. 1 1.2 Motivations ................................................................................................................................... 2 1.3 Problem Domain and Contributions ............................................................................................. 3 1.4 Organization of the Dissertation ................................................................................................... 4 Chapter 2 Background and Literature Review ....................................................................................... 6 2.1 Semantic Web ............................................................................................................................... 6 2.1.1 General Structure ................................................................................................................... 6 2.1.2 URI/IRI .................................................................................................................................. 6 2.1.3 XML ...................................................................................................................................... 7 2.1.4 RDF ....................................................................................................................................... 8 2.1.5 Ontology .............................................................................................................................. 10 2.1.6 Ontology Editors ................................................................................................................. 12 2.1.7 Reasoning Engines .............................................................................................................. 14 2.2 Linked Data ................................................................................................................................ 15 2.3 Wikipedia ................................................................................................................................... 18 2.3.1 Wikipedia and Semantic Web ............................................................................................. 18 2.3.2 Wikipedia and Topic Identification ..................................................................................... 18 2.3.3 Wikipedia and Text Classification ...................................................................................... 20 2.4 Summary .................................................................................................................................... 20 Chapter 3 Semantic Analysis of Wikipedia .......................................................................................... 21 3.1 Introduction ................................................................................................................................ 21 3.2 Problem

Load more