Mining Semantic Relationships Between Concepts Across
Total Page:16
File Type:pdf, Size:1020Kb
MINING SEMANTIC RELATIONSHIPS BETWEEN CONCEPTS ACROSS DOCUMENTS USING WIKIPEDIA KNOWLEDGE A Dissertation Submitted to the Graduate Faculty of the North Dakota State University of Agriculture and Applied Science By Peng Yan In Partial Fulfillment for the Degree of DOCTOR OF PHILOSOPHY Major Department: Computer Science September 2013 Fargo, North Dakota North Dakota State University Graduate School Title Mining Semantic Relationships between Concepts across Documents using Wikipedia Knowledge By Peng Yan The Supervisory Committee certifies that this disquisition complies with North Dakota State University’s regulations and meets the accepted standards for the degree of DOCTOR OF PHILOSOPHY SUPERVISORY COMMITTEE: Wei Jin Chair Brian M. Slator Juan Li Ying Huang Approved: 11/01/2013 Brian M. Slator Date Department Chair ABSTRACT The ongoing astounding growth of text data has created an enormous need for fast and efficient Text Mining algorithms. However, the sparsity and high dimensionality of text data present great challenges for representing the semantics of natural language text. Traditional approaches for document representation are mostly based on the Vector Space (VSM) Model which takes a document as an unordered collection of words and only document-level statistical information is recorded (e.g., document frequency, inverse document frequency). Due to the lack of capturing semantics in texts, for certain tasks, especially fine-grained information discovery applications, such as mining relationships between concepts, VSM demonstrates its inherent limitations because of its rationale for computing relatedness between words only based on the statistical information collected from documents themselves. In this dissertation, we present a new framework that attempts to address the above problems by utilizing background knowledge to provide a better semantic representation of any text. This is accomplished through leveraging Wikipedia, the world’s currently largest human built encyclopedia. Meanwhile, this integration also sufficiently complements the existing information contained in text corpus and facilitates the construction of a more comprehensive representation and retrieval framework. Specifically, we present 1) Semantic Path Chaining (SPC), a new text mining model that automatically discovers semantic relationships between concepts across multiple documents (which the traditional search paradigm such as search engines cannot help much) and effectively integrates various evidence sources from Wikipedia; 2) the kernel methods that provide a more appropriate estimation of semantic relatedness between concepts and better utilize Wikipedia background knowledge in our defined query contexts; 3) Concept Association Graph (CAG), a iii graph-based mining prototype system interfaced directly to Wikipedia, enables fast and customizable concept relationship search using Wikipedia resources. The effectiveness of the proposed techniques has been evaluated on different data sets. The experimental results demonstrate the search performance has been significantly enhanced in terms of accuracy and coverage compared with several baseline models. In particular, some existing state-of-the-art related work such as Srinivasan’s closed text mining algorithm, Explicit Semantic Analysis (ESA) [19] and the RelFinder system [26, 27, 41] has been used as the comparison models. iv ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Dr. Wei Jin, who offered me an opportunity to come to the U.S. to pursue my PH.D study, introduced me to the research area of text mining, and gave excellent guidance and enormous support to my research. It was Dr. Jin who helped me out of numerous research difficulties, enlightened me with constructive advice, and guided me to move forward along the right track during my research. I have been always feeling that it is my great luck to have such a nice advisor who is strict towards my research in school but treats me as a friend in normal times. Without her guidance, I would never be able to complete my PH.D work. I am also very grateful to Dr. Brian Slator, who brought me to the research field of immersive virtual environments, and has been supporting my study over the past 17 months. Because of Dr. Slator, I was able to continue my study, and had a chance to apply one of the models proposed in this dissertation into the real world software industry. I was honored to be a member in his development team and work with so many talented engineers I wish to express my love and gratitude to my beloved brother, Fei Yan, who works hard in China and takes almost all of the responsibilities for my parents that I should have taken, so that I can focus on my study. If it had not been for supporting my life in the U.S., he would have had a car that he had craved for two years. No matter how far we are away from each other, I want to let him know my heart has always been with him. Also, I want to thank my parents, who pushed the longing of wanting me to go back and stay with them down into the dim recesses of their minds, understood and supported me in all moments of my life. v I would also like to thank my wife, Shuhui Ren, who sacrificed her personal career and life in China, and came here to take care of me. We lived a hard life together due to the financial embarrassment, but she had never complained to me and always unconditionally supported me for the past two years. I would never forget those numerous late nights she accompanied me either in school or at home for catching up with approaching conference deadlines. My current achievement is at too much cost of her interests. Deepest gratitude is also due to members of my committee, Dr. Juan Li, Dr. Ying Huang without whose assistance, this study would not have been completed. vi DEDICATION I dedicate this dissertation to my brother, my parents and my wife. vii TABLE OF CONTENTS ABSTRACT ................................................................................................................................... iii ACKNOWLEDGMENTS .............................................................................................................. v DEDICATION .............................................................................................................................. vii LIST OF TABLES ....................................................................................................................... xiii LIST OF FIGURES ..................................................................................................................... xiv CHAPTER 1. INTRODUCTION ................................................................................................... 1 1.1. Problem Description and Motivation ............................................................................. 2 1.2. Research Contributions .................................................................................................. 4 1.3. System Overview ........................................................................................................... 6 1.4. Organization of the Dissertation .................................................................................... 9 CHAPTER 2. LITERATURE REVIEW ...................................................................................... 10 2.1. Vector Space Model ..................................................................................................... 10 2.1.1. TFIDF Weighting Scheme ................................................................................. 10 2.1.2. Limitations of Vector Space Model ................................................................... 12 2.2. VSM-based Approaches ............................................................................................... 12 2.2.1. Cross-Document Coreference ............................................................................ 12 2.2.2. Open and Closed Discovery Algorithm ............................................................. 14 2.2.3. Concept Chain Queries ...................................................................................... 16 2.3. Web Oriented Approaches for Measuring Semantic Relatedness ............................... 17 2.3.1. Web Page Counts Oriented Approach ............................................................... 17 2.3.2. Related Terms Oriented Approach .................................................................... 18 2.4. WordNet-based Approaches ........................................................................................ 20 viii 2.4.1. Introduction to WordNet .................................................................................... 20 2.4.2. WordNet for Knowledge Representation ........................................................... 21 2.5. DBpedia-based Approaches ......................................................................................... 23 2.5.1. Introduction to DBpedia ..................................................................................... 23 2.5.2. Faceted Wikipedia Search .................................................................................. 24 2.5.3. RelFinder: DBpedia Relationship Finder ........................................................... 26 2.6. Wikipedia-based Approaches ....................................................................................... 28 2.6.1. Introduction to Wikipedia .................................................................................