Developing a Framework for Geographic Question Answering Systems
Total Page:16
File Type:pdf, Size:1020Kb
Developing a Framework for Geographic Question Answering Systems Using GIS, Natural Language Processing, Machine Learning, and Ontologies DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Wei Chen Graduate Program in Geography The Ohio State University 2014 Dissertation Committee: Dr. Eric Fosler-Lussier (committee member) Dr. Rajiv Ramnath (committee member) Dr. Daniel Sui (committee member) Dr. Ningchuan Xiao (committee chair) Copyrighted by Wei Chen 2014 Abstract Geographic question answering (QA) systems can be used to help make geographic knowledge accessible by directly giving answers to natural language questions. In this dissertation, a geographic question answering (GeoQA) framework is proposed by incorporating techniques from natural language processing, machine learning, ontological reasoning and geographic information system (GIS). We demonstrate that GIS functions provide valuable rule-based knowledge, which may not be available elsewhere, for answering geographic questions. Ontologies of space are developed to interpret the meaning of linguistic spatial terms which are later mapped to components of a query in a GIS; these ontologies are shown to be indispensable during each step of question analysis. A customized classifier based on dynamic programming and a voting algorithm is also developed to classify questions into answerable categories. To prepare a set of geographic questions, we conducted a human survey and generalized four categories that have the most questions for experiments. These categories were later used to train a classifier to classify new questions. Classified natural language questions are converted to spatial SQLs to retrieve data from relational databases. Consequently, our demo system is able to give exact answers to four categories of geographic questions within an average time of two seconds. The system has been evaluated using classical ii machine learning-based measures and achieved an overall accuracy of 90% on test data. Results show that spatial ontologies and GIS are critical for extending the capabilities of a GeoQA system. Spatial reasoning of GIS makes it a powerful analytical engine to answer geographic questions through spatial data modeling and analysis. iii Dedication To Ying Dong (Susie), my wife To Jianxi Chen, Donglin Wei, my parents To Tiejing Dong, Liqun Wang, my parents in law To Baoyuan Chen, my deceased grandpa iv Acknowledgments I would like to acknowledge the efforts of all committee members Dr. Daniel Sui, Dr. Eric Fosler-Lussier and Dr. Rajiv Ramnath for their significant contribution to the completion of this dissertation. My Ph.D. advisor Dr. Ningchuan Xiao was very helpful throughout the entire process of my Ph.D. study. He helped me with academic, financial, and spiritual support, especially during the last two semesters of my masters in geography and the final year of my Ph.D.. Many thanks also extend to my friends from both computer science and geography department at the Ohio State University, Dr. Shanshan Cai, Dr. Meng Guo, Dr. Shiguo Jiang, Dr. Michael Webb as well as Ph.D. candidates Bo Zhao, Xiaolin Zhu, Xiang Chen, Lili Wang, Xining Yang, Zhe Xu, Nan Deng, Grey Evenson, Sam Kay, Nicholas Crane, Scott Stuckman, James Baginski, and Hyeseon Jeong for their friendship and support for my six years’ academic life at Ohio State. My gratitude also go to pastor Chris Kauffman, Mr. Steve Will and Ms. Kelly Will, Mr. Emerson Wu and Ms. Ivy Wu, Mr. Dennis Shimer and Ms. Dori Shimer for their help with my life in Columbus OH. v Table of Contents Abstract ............................................................................................................................... ii Dedication .......................................................................................................................... iv Acknowledgments............................................................................................................... v Table of Contents ............................................................................................................... vi List of Tables ..................................................................................................................... xi List of Figures .................................................................................................................. xiii Chapter 1: Introduction ....................................................................................................... 1 1.1 Question answering (QA) ..................................................................................... 1 1.1.1 Advantages of QA systems ................................................................................. 1 1.1.2 Open-domain QA systems .................................................................................. 3 1.1.3 Closed-domain QA systems ............................................................................... 3 1.1.3 Challenges of QA systems .................................................................................. 4 1.2 Geographic question answering (GeoQA) ................................................................ 6 1.2.1 Challenges of GeoQA ......................................................................................... 6 1.2.3 Major problems in GeoQA ................................................................................. 9 vi 1.3 Previous approaches to geographic QA .................................................................. 11 1.3.1 Annotation-based approach .............................................................................. 11 1.3.2 IBM Watson: the deep QA approach ............................................................... 13 1.4 Purpose of This Study ............................................................................................. 14 1.5 An Overview of This Thesis ................................................................................... 17 Chapter 2 Underpinning Technologies ............................................................................. 18 2.1 Natural language processing ................................................................................... 18 2.1.1 Structure matching ............................................................................................ 18 2.1.2 NLP for information retrieval ........................................................................... 22 2.2 Machine learning ..................................................................................................... 27 2.3 Ontology .................................................................................................................. 28 2.3.1 Ontology and knowledge base .......................................................................... 28 2.3.2 Types of human knowledge .............................................................................. 29 2.3.4 Ontological representation of knowledge ......................................................... 30 2.3.5 Spatial ontologies ............................................................................................. 31 2.4 Geographic information systems ............................................................................. 32 Chapter 3 Framework and System Components ............................................................... 34 3.1 GIS as The Foundation of Knowledge-based QA Systems .................................... 34 3.2 An Example of a Geographic Question Used in This Research ............................. 35 vii 3.3 Potential applications .............................................................................................. 36 3.3.1 Internet search Google, Bing and Yahoo: ........................................................ 36 3.2.2 Local business search via Amazon, eBay, Craigslist, PayPal .......................... 37 3.2.3 Local search on social network sites such as Facebook and Twitter ................ 39 3.4 The GeoQA Framework and System Components ................................................. 40 3.4.1 Natural Language Processing Component ....................................................... 41 3.4.2 Machine Learning Components........................................................................ 54 3.4.3 Ontological Reasoning Component .................................................................. 62 3.4.5 GIS Components .............................................................................................. 79 Chapter 4 Data .................................................................................................................. 82 4.1 Daily Geography Practice Categories .................................................................... 84 4.2 Survey Questions..................................................................................................... 85 4.2.1 Survey Process .................................................................................................. 85 4.2.2 Survey Statistics ............................................................................................... 87 4.2.3 Validation of Survey Answers .......................................................................... 88 4.2.4 Observations about the Survey ......................................................................... 89 Chapter 5 Corpus Analysis Results ..................................................................................