A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution

A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution Anonymous ACL submission Abstract 001 Geocoding, the task of converting unstructured 002 text to structured spatial data, has recently seen 003 progress thanks to a variety of new datasets, 004 evaluation metrics, and machine-learning al- 005 gorithms. We provide a survey to review, or- 006 ganize and analyze recent work on geocod- 007 ing (also known as toponym resolution) where 008 the text is matched to geospatial coordinates Figure 1: An illustrative example of geocoding chal- 009 and/or ontologies. We summarize the findings lenges. One toponym (Paris) can refer to more than 010 of this research and suggest some promising one geographical location (a town in the state of Texas 011 directions for future work. in the United States or the capital city of France in Eu- rope), and a geographical location may be referred to 012 1 Introduction by more than one toponym (Leeuwarden and Ljouwert 013 Geocoding, also called toponym resolution or to- are two names for the same city in the Netherlands). 014 ponym disambiguation, is the subtask of geopars- 015 ing that disambiguates place names in text. The And the recent ACL-SIGLEX sponsored SemEval 041 016 goal of geocoding is, given a textual mention of a 2019 Task 12: Toponym Resolution in Scientific 042 017 location, to choose the corresponding geospatial co- Papers (Weissenbacher et al., 2019) resulted in sev- 043 018 ordinates, geospatial polygon, or entry in a geospa- eral new natural language processing approaches 044 019 tial database. Geocoders must handle place names to geocoding. The field has thus changed substan- 045 020 (known as toponyms) that refer to more than one ge- tially since the most recent survey of geocoding 046 021 ographical location (e.g., Paris can refer to a town (Gritta et al., 2017), including a doubling of the 047 022 in the state of Texas in the United States, or the cap- number of geocoding datasets, and the advent of 048 023 ital city of France), and geographical locations that modern neural network approaches to geocoding. 049 024 may be referred to by more than one name (e.g., The field would thus benefit from a survey 050 025 Leeuwarden and Ljouwert are two names for the and critical evaluation of the currently available 051 026 same city in the Netherlands), as shown in fig.1. datasets, evaluation metrics, and geocoding algo- 052 027 Geocoding plays a critical role in tasks such as rithms. Our contributions are: 053 028 tracking the evolution and emergence of infectious 029 diseases (Hay et al., 2013), analyzing and searching • the first survey on geocoding to include recent 054 030 documents by geography (Bhargava et al., 2017), deep learning approaches 055 031 geospatial analysis of historical events (Tateosian • coverage of new geocoding datasets (which 056 032 et al., 2017), and disaster response mechanisms increased by 100% since 2017) and geocoding 057 033 (Ashktorab et al., 2014; de Bruijn et al., 2018). systems (which increased by 50% since 2017) 058 034 The field of geocoding, previously dominated • discussion of new directions, such as polygon- 059 035 by geographical information systems communities, based prediction 060 036 has seen a recent surge in interest from the natural 037 language processing community due to the inter- In the remainder of this article, we first highlight 061 038 esting linguistic challenges this task presents. The some previous geocoding surveys (section2) and 062 039 four most recent geocoding datasets (see table1) explain the scope of the current survey (section3). 063 040 were all published at venues in the ACL Anthology. We then categorize the features of recent geocod- 064 1 065 ing datasets (section4), compare different choices for papers matching any of the keyword queries: 114 066 for geocoding evaluation metrics (section5), and geocoding, geoparsing, geolocation, toponym res- 115 067 break down the different types of features and ar- olution, toponym disambiguation, or spatial infor- 116 068 chitectures used by geocoding systems (section6). mation extraxtion. From the results, we excluded 117 069 We conclude with a discussion of where the field articles that described tasks other than mention- 118 070 should head next (section7). level geocoding, for example: 119 • matching a full document or full microblog 120 071 2 Background post to a single location (Luo et al., 2020; 121 072 To the best of our knowledge, the first formal sur- Hoang and Mothe, 2018; Kumar and Singh, 122 073 vey of geocoding is Leidner(2007). This Ph.D. the- 2019; Lee et al., 2015) 123 074 sis distinguished the tasks of finding place names • geographic document retrieval and classifica- 124 075 (known as geotagging or toponym recognition) tion (Gey et al., 2005; Adams and McKenzie, 125 076 from linking place names to databases (known as 2018) 126 077 geocoding or toponym resolution). They found that • matching typonyms to each other within a 127 078 most geocoding methods were based on combining geographical database (Santos et al., 2018) 128 079 natural language processing techniques, such as lex- We also excluded papers published before 2010 129 080 ical string matching or word sense matching, with (e.g., Smith and Crane, 2001), as they have been 130 081 geographic heuristics, such as spatial-distance min- covered thoroughly by prior surveys. 131 082 imum and population maximum. Most geocoders In total, we reviewed more than 60 papers and 132 083 studied in this thesis were rule-based. included more than 30 of them in this survey. 133 084 Monteiro et al.(2016) surveyed work on predict- 085 ing document-level geographic scope, which of- 4 Geocoding Datasets 134 086 ten includes mention-level geocoding as one of its 087 steps. Most of this survey focused on the document- Many geocoding corpora have been proposed, 135 088 level task, but the geocoding section found tech- drawn from different domains, linking to differ- 136 089 niques similar to those found by Leidner(2007). ent geographic databases, with different forms of 137 090 Gritta et al.(2017) reviewed both geotagging geocoding labels, and with varying sizes in terms 138 091 and geocoding, and proposed a new dataset, Wik- of both articles/messages and toponyms. Table1 139 092 ToR. The survey portion of this article compared summarizes these datasets, and the following sec- 140 093 datasets for geoparsing, explored heuristics of rule- tions walk through some of the dimensions over 141 094 based and feature-based machine learning-based which the datasets vary. 142 095 geocoders, summarized evaluation metrics, and 4.1 Domains 143 096 classified common errors from several geocoders The news domain is the most common target for 144 097 (misspellings, case sensitivity, processing fictional geocoding corpora, covering sources like broad- 145 098 and historical text presents, etc.). Gritta et al. cast conversation, broadcast news, and news mag- 146 099 (2017) concluded that future geoparsers would azines. Examples include the ACE 2005 English 147 100 need to utilize semantics and context, not just syn- SpatialML Annotations (ACS, Mani et al., 2010)1, 148 101 tax and word forms as the geocoders of the time. the Local Global Lexicon (LGL, Lieberman et al., 149 102 Geocoding research since these previous surveys 2010), CLUST (Lieberman and Samet, 2011), TR- 150 103 has changed in several important ways, as will be NEWS (Kamalloo and Rafiei, 2018), GeoVirus 151 104 described in the remainder of this article. Most (Gritta et al., 2018), and GeoWebNews (Gritta et al., 152 105 notably, new datasets and evaluation metrics are 2019). Though all these datasets include news text, 153 106 enabling new polygon-based views of the problem, they vary in what toponyms are included. For ex- 154 107 and deep learning methods are offering new algo- ample, LGL is based on local and small U.S. news 155 108 rithms and new approaches for geocoding. sources with most toponyms smaller than a U.S. 156 109 3 Scope state, while GeoVirus focuses on news about global 157 disease outbreaks and epidemics with larger, often 158 110 We focus on the geocoding problem, where men- country-level, toponyms. 159 111 tions of place names are resolved to database en- 1https://catalog.ldc.upenn.edu/ 112 tries or polygons. We thus searched the Google LDC2008T03 113 Scholar and Semantic Scholar search engines https://catalog.ldc.upenn.edu/LDC2011T02 2 Geographic Articles / Corpus Domain Label Type Toponyms Database Messages ACS, Mani et al.(2010) News GeoNames Point 428 4783 LGL, Lieberman et al.(2010) News GeoNames Point & GeoNamesID 588 4783 CLUST, Lieberman and Samet(2011) News GeoNames Point & GeoNamesID 1082 11564 Zhang and Gelernter(2014) Twitter GeoNames Point & GeoNamesID 956 1393 WOTR, DeLozier et al.(2016) Historical OpenStreetMap Point & Polygon 9653 10380 WikTOR, Gritta et al.(2017) Wikipedia GeoNames Point 5000 25000 TR-NEWS, Kamalloo and Rafiei(2018) News GeoNames Point & GeoNamesID 118 1274 GeoCorpora, Wallgrun¨ et al.(2018) Twitter GeoNames Point & GeoNamesID 211 2966 GeoVirus, Gritta et al.(2018) News GeoNames Point 229 2167 GeoWebNews, Gritta et al.(2019) News GeoNames Point & GeoNamesID 200 5121 SemEval2019, Weissenbacher et al.(2019) Scientific GeoNames Point & GeoNamesID 150 8360 GeoCoDe, Laparra and Bethard(2020) Wikipedia OpenStreetMap Polygon 360187 360187 Table 1: Summary of geocoding datasets covered by this survey, sorted by year of creation. 160 Web text is also a common target for geocoding CLUST, the Zhang and Gelernter(2014) corpus, 193 161 corpora. Wikipedia Toponym Retrieval (WikToR; WikToR, TR-NEWS, GeoCorpora, GeoVirus, Ge- 194 162 Gritta et al., 2017) and GeoCoDe (Laparra and oWebNews, and the SemEval-2019 Task 12 corpus. 195 163 Bethard, 2020) are both based on Wikipedia pages. GeoNames is a crowdsourced database of geospa- 196 164 ACS, mentioned above, also includes newsgroup tial locations, with almost 7 million entries and a 197 165 and weblog data.

A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support