Open Tawandist.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School College of Earth and Mineral Sciences THE GEOGRAPHICAL ANALOG ENGINE: HYBRID NUMERIC AND SEMANTIC SIMILARITY MEASURES FOR U.S. CITIES A Thesis in Geography by Tawan Banchuen © 2008 Tawan Banchuen Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2008 The thesis of Tawan Banchuen was reviewed and approved* by the following: C. Gregory Knight Professor of Geography Thesis Co-Advisor Co-Chair of Committee Mark N. Gahegan Professor of Geography Thesis Co-Advisor Co-Chair of Committee Robert G. Crane Professor of Geography Dennis W. Thomson Professor of Meteorology Karl Zimmerer Professor of Geography Head of the Department of Geography *Signatures are on file in the Graduate School iii ABSTRACT This dissertation began with the goal to develop a methodology for locating climate change analogs, and quickly turned into a quest for computational means of locating geographical analogs in general. Previous work in geographical analogs either only computed on numeric information, or manually considered qualitative information. Current and emerging technologies, such as electronic document collections, the Internet, and the Semantic Web, make it possible for people and organizations to store millions of books and articles, share them with the world, or even author some themselves. The amount of electronic and online content is expanding at an exponential speed, such that analysts are increasingly overwhelmed by the sheer volumes of accessible information. The dissertation explores techniques from knowledge engineering, artificial intelligence, information sciences, linguistics and cognitive science, and proposes a novel, automatic methodology that computes similarity within online/offline textual information, and graphically and statistically combines the results with those of numeric methods. U.S. cities with populations larger than 25,000 people are selected as a test case. Places are evaluated based on their numeric characteristics in the County and City Data Book and qualitative characteristics from Wikipedia entries. The dissertation recommends a way to convert Wikipedia entries into the Web Ontology Language (OWL) ontologies, which computer algorithms can read, understand and compute. The dissertation initially experiments with Mitra and Wiederhold’s semantic measure to quantify similarity between places in the qualitative space. Many shortfalls are identified, and a series of experimental enhancements are explored. The experiments demonstrate that good semantic measures should employ a comprehensive stop-words list and a complete, but succinct vocabulary. A semantic measure that can recognize synonyms must understand the intended senses of words in a place description. Furthermore, analysts need to be careful with two styles of descriptions: descriptions of places that are (1) created by following a template, or (2) laden with statistical statements can result in falsely high similarity between the places. It is illustrated that scatter plots of numeric similarity scores versus semantic similarity scores can effectively help analysts consider similarity between places in two-space. Analysts can visually observe whether the numeric ranks of places agree with the semantic iv ranks. The dissertation also shows that the Spearman’s rank correlation test and the Kruskal-Wallis test of means can provide statistical confirmation for visual observations. The proposed hybrid methodology enables analysts to automatically discover geographical analogs in ways that strictly numeric methods or manual semantic analysis cannot offer. v TABLE OF CONTENTS LIST OF FIGURES ................................................................................................................... viii LIST OF TABLES...................................................................................................................... xi ACKNOWLEDGEMENTS .................................................................................................... xiv Chapter 1 Why the Geographical Analog Engine? ............................................................... 1 1.1 Inadequate Numeric Methods and Overwhelming Qualitative Information...... 3 1.2 The Geographical Analog Engine.............................................................................. 5 Chapter 2 Analogs......................................................................................................................9 2.1 Definitions..................................................................................................................... 9 2.2 Roles of Analogy Making in Human Cognition....................................................... 9 2.2.1 Metaphors........................................................................................................... 10 2.2.2 Classification ...................................................................................................... 11 2.2.3 Abduction........................................................................................................... 12 2.3 Contemporary Usage of Analogs ............................................................................... 14 2.4 Conclusions ...................................................................................................................17 Chapter 3 Similarity: Numeric and Semantic Measures ....................................................... 18 3.1 Ontology ........................................................................................................................ 19 3.5.1 Web Ontology Language (OWL).................................................................... 20 3.5.2 Semantic Networks ........................................................................................... 25 3.2 Similarity Measures....................................................................................................... 29 3.8.1 Semantic Measure.............................................................................................. 30 3.8.1.1 Lexicon Based Measure......................................................................... 30 3.8.1.1.1 Mitra and Wiederhold’s Algorithm .......................................... 31 3.8.1.1.2 Enhanced Version of Mitra and Wiederhold’s Algorithm ... 32 3.8.1.2 Feature-Set Based Measure................................................................... 32 3.8.1.2.1 Basic Synonym Count (BSC) .................................................... 33 3.8.1.2.2 Synonym Count with a Vocabulary (SCV).............................. 34 3.8.1.3 Corpus Based Measure.......................................................................... 35 3.8.1.3.1 Corpus Based Synonym Count (CBSC) .................................. 36 3.8.1.4 Graph Based Measure ........................................................................... 36 3.8.1.5 Measure Based on Information Content............................................ 38 3.8.1.6 Expert Judgment.................................................................................... 40 3.8.2 Numeric Measure: Euclidean Similarity......................................................... 40 3.8.2.1 Principal Component Analysis............................................................. 41 3.8.2.2 K-Means Cluster Analysis..................................................................... 42 3.3 Statistical Tests of Similarity Measures...................................................................... 43 3.9.1 Kruskal-Wallis Test........................................................................................... 44 vi 3.9.2 Spearman’s Rank Correlation Coefficient Test............................................. 45 3.4 Uncertainty of Similarity Measures ........................................................................... 46 3.11 Conclusions ................................................................................................................. 48 Chapter 4 Preliminary Experiment: Six University Cities .................................................... 49 4.1 Comparison Using Mitra and Wiederhold’s Algorithm.......................................... 54 4.2 Comparison Using Enhanced Mitra and Wiederhold’s Algorithm....................... 57 4.3 Conclusions ...................................................................................................................60 Chapter 5 Numeric Analysis..................................................................................................... 61 5.1 Principal Component Analysis ................................................................................... 61 5.2 K-Means Cluster Analysis ........................................................................................... 66 5.3 Selection of Cities for Subsequent Analyses............................................................. 67 5.3.1 Accounting for the Current State of Wikipedia Entries.............................. 74 5.4 Conclusions ...................................................................................................................76 Chapter 6 Semantic Analysis .................................................................................................... 77 6.1 Synonym Count with a Vocabulary (SCV)................................................................ 78 6.1.1 Results and Discussion..................................................................................... 79 6.1.2 Discussion of SCV