A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution

Total Page:16

File Type:pdf, Size:1020Kb

A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution Anonymous ACL submission Abstract 001 Geocoding, the task of converting unstructured 002 text to structured spatial data, has recently seen 003 progress thanks to a variety of new datasets, 004 evaluation metrics, and machine-learning al- 005 gorithms. We provide a survey to review, or- 006 ganize and analyze recent work on geocod- 007 ing (also known as toponym resolution) where 008 the text is matched to geospatial coordinates Figure 1: An illustrative example of geocoding chal- 009 and/or ontologies. We summarize the findings lenges. One toponym (Paris) can refer to more than 010 of this research and suggest some promising one geographical location (a town in the state of Texas 011 directions for future work. in the United States or the capital city of France in Eu- rope), and a geographical location may be referred to 012 1 Introduction by more than one toponym (Leeuwarden and Ljouwert 013 Geocoding, also called toponym resolution or to- are two names for the same city in the Netherlands). 014 ponym disambiguation, is the subtask of geopars- 015 ing that disambiguates place names in text. The And the recent ACL-SIGLEX sponsored SemEval 041 016 goal of geocoding is, given a textual mention of a 2019 Task 12: Toponym Resolution in Scientific 042 017 location, to choose the corresponding geospatial co- Papers (Weissenbacher et al., 2019) resulted in sev- 043 018 ordinates, geospatial polygon, or entry in a geospa- eral new natural language processing approaches 044 019 tial database. Geocoders must handle place names to geocoding. The field has thus changed substan- 045 020 (known as toponyms) that refer to more than one ge- tially since the most recent survey of geocoding 046 021 ographical location (e.g., Paris can refer to a town (Gritta et al., 2017), including a doubling of the 047 022 in the state of Texas in the United States, or the cap- number of geocoding datasets, and the advent of 048 023 ital city of France), and geographical locations that modern neural network approaches to geocoding. 049 024 may be referred to by more than one name (e.g., The field would thus benefit from a survey 050 025 Leeuwarden and Ljouwert are two names for the and critical evaluation of the currently available 051 026 same city in the Netherlands), as shown in fig.1. datasets, evaluation metrics, and geocoding algo- 052 027 Geocoding plays a critical role in tasks such as rithms. Our contributions are: 053 028 tracking the evolution and emergence of infectious 029 diseases (Hay et al., 2013), analyzing and searching • the first survey on geocoding to include recent 054 030 documents by geography (Bhargava et al., 2017), deep learning approaches 055 031 geospatial analysis of historical events (Tateosian • coverage of new geocoding datasets (which 056 032 et al., 2017), and disaster response mechanisms increased by 100% since 2017) and geocoding 057 033 (Ashktorab et al., 2014; de Bruijn et al., 2018). systems (which increased by 50% since 2017) 058 034 The field of geocoding, previously dominated • discussion of new directions, such as polygon- 059 035 by geographical information systems communities, based prediction 060 036 has seen a recent surge in interest from the natural 037 language processing community due to the inter- In the remainder of this article, we first highlight 061 038 esting linguistic challenges this task presents. The some previous geocoding surveys (section2) and 062 039 four most recent geocoding datasets (see table1) explain the scope of the current survey (section3). 063 040 were all published at venues in the ACL Anthology. We then categorize the features of recent geocod- 064 1 065 ing datasets (section4), compare different choices for papers matching any of the keyword queries: 114 066 for geocoding evaluation metrics (section5), and geocoding, geoparsing, geolocation, toponym res- 115 067 break down the different types of features and ar- olution, toponym disambiguation, or spatial infor- 116 068 chitectures used by geocoding systems (section6). mation extraxtion. From the results, we excluded 117 069 We conclude with a discussion of where the field articles that described tasks other than mention- 118 070 should head next (section7). level geocoding, for example: 119 • matching a full document or full microblog 120 071 2 Background post to a single location (Luo et al., 2020; 121 072 To the best of our knowledge, the first formal sur- Hoang and Mothe, 2018; Kumar and Singh, 122 073 vey of geocoding is Leidner(2007). This Ph.D. the- 2019; Lee et al., 2015) 123 074 sis distinguished the tasks of finding place names • geographic document retrieval and classifica- 124 075 (known as geotagging or toponym recognition) tion (Gey et al., 2005; Adams and McKenzie, 125 076 from linking place names to databases (known as 2018) 126 077 geocoding or toponym resolution). They found that • matching typonyms to each other within a 127 078 most geocoding methods were based on combining geographical database (Santos et al., 2018) 128 079 natural language processing techniques, such as lex- We also excluded papers published before 2010 129 080 ical string matching or word sense matching, with (e.g., Smith and Crane, 2001), as they have been 130 081 geographic heuristics, such as spatial-distance min- covered thoroughly by prior surveys. 131 082 imum and population maximum. Most geocoders In total, we reviewed more than 60 papers and 132 083 studied in this thesis were rule-based. included more than 30 of them in this survey. 133 084 Monteiro et al.(2016) surveyed work on predict- 085 ing document-level geographic scope, which of- 4 Geocoding Datasets 134 086 ten includes mention-level geocoding as one of its 087 steps. Most of this survey focused on the document- Many geocoding corpora have been proposed, 135 088 level task, but the geocoding section found tech- drawn from different domains, linking to differ- 136 089 niques similar to those found by Leidner(2007). ent geographic databases, with different forms of 137 090 Gritta et al.(2017) reviewed both geotagging geocoding labels, and with varying sizes in terms 138 091 and geocoding, and proposed a new dataset, Wik- of both articles/messages and toponyms. Table1 139 092 ToR. The survey portion of this article compared summarizes these datasets, and the following sec- 140 093 datasets for geoparsing, explored heuristics of rule- tions walk through some of the dimensions over 141 094 based and feature-based machine learning-based which the datasets vary. 142 095 geocoders, summarized evaluation metrics, and 4.1 Domains 143 096 classified common errors from several geocoders The news domain is the most common target for 144 097 (misspellings, case sensitivity, processing fictional geocoding corpora, covering sources like broad- 145 098 and historical text presents, etc.). Gritta et al. cast conversation, broadcast news, and news mag- 146 099 (2017) concluded that future geoparsers would azines. Examples include the ACE 2005 English 147 100 need to utilize semantics and context, not just syn- SpatialML Annotations (ACS, Mani et al., 2010)1, 148 101 tax and word forms as the geocoders of the time. the Local Global Lexicon (LGL, Lieberman et al., 149 102 Geocoding research since these previous surveys 2010), CLUST (Lieberman and Samet, 2011), TR- 150 103 has changed in several important ways, as will be NEWS (Kamalloo and Rafiei, 2018), GeoVirus 151 104 described in the remainder of this article. Most (Gritta et al., 2018), and GeoWebNews (Gritta et al., 152 105 notably, new datasets and evaluation metrics are 2019). Though all these datasets include news text, 153 106 enabling new polygon-based views of the problem, they vary in what toponyms are included. For ex- 154 107 and deep learning methods are offering new algo- ample, LGL is based on local and small U.S. news 155 108 rithms and new approaches for geocoding. sources with most toponyms smaller than a U.S. 156 109 3 Scope state, while GeoVirus focuses on news about global 157 disease outbreaks and epidemics with larger, often 158 110 We focus on the geocoding problem, where men- country-level, toponyms. 159 111 tions of place names are resolved to database en- 1https://catalog.ldc.upenn.edu/ 112 tries or polygons. We thus searched the Google LDC2008T03 113 Scholar and Semantic Scholar search engines https://catalog.ldc.upenn.edu/LDC2011T02 2 Geographic Articles / Corpus Domain Label Type Toponyms Database Messages ACS, Mani et al.(2010) News GeoNames Point 428 4783 LGL, Lieberman et al.(2010) News GeoNames Point & GeoNamesID 588 4783 CLUST, Lieberman and Samet(2011) News GeoNames Point & GeoNamesID 1082 11564 Zhang and Gelernter(2014) Twitter GeoNames Point & GeoNamesID 956 1393 WOTR, DeLozier et al.(2016) Historical OpenStreetMap Point & Polygon 9653 10380 WikTOR, Gritta et al.(2017) Wikipedia GeoNames Point 5000 25000 TR-NEWS, Kamalloo and Rafiei(2018) News GeoNames Point & GeoNamesID 118 1274 GeoCorpora, Wallgrun¨ et al.(2018) Twitter GeoNames Point & GeoNamesID 211 2966 GeoVirus, Gritta et al.(2018) News GeoNames Point 229 2167 GeoWebNews, Gritta et al.(2019) News GeoNames Point & GeoNamesID 200 5121 SemEval2019, Weissenbacher et al.(2019) Scientific GeoNames Point & GeoNamesID 150 8360 GeoCoDe, Laparra and Bethard(2020) Wikipedia OpenStreetMap Polygon 360187 360187 Table 1: Summary of geocoding datasets covered by this survey, sorted by year of creation. 160 Web text is also a common target for geocoding CLUST, the Zhang and Gelernter(2014) corpus, 193 161 corpora. Wikipedia Toponym Retrieval (WikToR; WikToR, TR-NEWS, GeoCorpora, GeoVirus, Ge- 194 162 Gritta et al., 2017) and GeoCoDe (Laparra and oWebNews, and the SemEval-2019 Task 12 corpus. 195 163 Bethard, 2020) are both based on Wikipedia pages. GeoNames is a crowdsourced database of geospa- 196 164 ACS, mentioned above, also includes newsgroup tial locations, with almost 7 million entries and a 197 165 and weblog data.
Recommended publications
  • Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text
    remote sensing Article Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text Edwin Aldana-Bobadilla 1,∗ , Alejandro Molina-Villegas 2 , Ivan Lopez-Arevalo 3 , Shanel Reyes-Palacios 3, Victor Muñiz-Sanchez 4 and Jean Arreola-Trapala 4 1 Conacyt-Centro de Investigación y de Estudios Avanzados del I.P.N. (Cinvestav), Victoria 87130, Mexico; [email protected] 2 Conacyt-Centro de Investigación en Ciencias de Información Geoespacial (Centrogeo), Mérida 97302, Mexico; [email protected] 3 Centro de Investigación y de Estudios Avanzados del I.P.N. Unidad Tamaulipas (Cinvestav Tamaulipas), Victoria 87130, Mexico; [email protected] (I.L.); [email protected] (S.R.) 4 Centro de Investigación en Matemáticas (Cimat), Monterrey 66628, Mexico; [email protected] (V.M.); [email protected] (J.A.) * Correspondence: [email protected] Received: 10 July 2020; Accepted: 13 September 2020; Published: 17 September 2020 Abstract: The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation.
    [Show full text]
  • Knowledge-Driven Geospatial Location Resolution for Phylogeographic
    Bioinformatics, 31, 2015, i348–i356 doi: 10.1093/bioinformatics/btv259 ISMB/ECCB 2015 Knowledge-driven geospatial location resolution for phylogeographic models of virus migration Davy Weissenbacher1,*, Tasnia Tahsin1, Rachel Beard1,2, Mari Figaro1,2, Robert Rivera1, Matthew Scotch1,2 and Graciela Gonzalez1 1Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and 2Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA *To whom correspondence should be addressed. Abstract Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and ani- mals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveil- lance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public data- bases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (topo- nym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using inte- grated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’). Results: For detecting and disambiguating locations, our system performed best using the meta- data heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score).
    [Show full text]
  • TAGGS: Grouping Tweets to Improve Global Geotagging for Disaster Response Jens De Bruijn1, Hans De Moel1, Brenden Jongman1,2, Jurjen Wagemaker3, Jeroen C.J.H
    Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2017-203 Manuscript under review for journal Nat. Hazards Earth Syst. Sci. Discussion started: 13 June 2017 c Author(s) 2017. CC BY 3.0 License. TAGGS: Grouping Tweets to Improve Global Geotagging for Disaster Response Jens de Bruijn1, Hans de Moel1, Brenden Jongman1,2, Jurjen Wagemaker3, Jeroen C.J.H. Aerts1 1Institute for Environmental Studies, VU University, Amsterdam, 1081HV, The Netherlands 5 2Global Facility for Disaster Reduction and Recovery, World Bank Group, Washington D.C., 20433, USA 3FloodTags, The Hague, 2511 BE, The Netherlands Correspondence to: Jens de Bruijn ([email protected]) Abstract. The availability of timely and accurate information about ongoing events is important for relief organizations seeking to effectively respond to disasters. Recently, social media platforms, and in particular Twitter, have gained traction as 10 a novel source of information on disaster events. Unfortunately, geographical information is rarely attached to tweets, which hinders the use of Twitter for geographical applications. As a solution, analyses of a tweet’s text, combined with an evaluation of its metadata, can help to increase the number of geo-located tweets. This paper describes a new algorithm (TAGGS), that georeferences tweets by using the spatial information of groups of tweets mentioning the same location. This technique results in a roughly twofold increase in the number of geo-located tweets as compared to existing methods. We applied this approach 15 to 35.1 million flood-related tweets in 12 languages, collected over 2.5 years. In the dataset, we found 11.6 million tweets mentioning one or more flood locations, which can be towns (6.9 million), provinces (3.3 million), or countries (2.2 million).
    [Show full text]
  • Structured Toponym Resolution Using Combined Hierarchical Place Categories∗
    In GIR’13: Proceedings of the 7th ACM SIGSPATIAL Workshop on Geographic Information Retrieval, Orlando, FL, November 2013 (GIR’13 Best Paper Award). Structured Toponym Resolution Using Combined Hierarchical Place Categories∗ Marco D. Adelfio Hanan Samet Center for Automation Research, Institute for Advanced Computer Studies Department of Computer Science, University of Maryland College Park, MD 20742 USA {marco, hjs}@cs.umd.edu ABSTRACT list or table is geotagged with the locations it references. To- ponym resolution and geotagging are common topics in cur- Determining geographic interpretations for place names, or rent research but the results can be inaccurate when place toponyms, involves resolving multiple types of ambiguity. names are not well-specified (that is, when place names are Place names commonly occur within lists and data tables, not followed by a geographic container, such as a country, whose authors frequently omit qualifications (such as city state, or province name). Our aim is to utilize the context of or state containers) for place names because they expect other places named within the table or list to disambiguate the meaning of individual place names to be obvious from place names that have multiple geographic interpretations. context. We present a novel technique for place name dis- ambiguation (also known as toponym resolution) that uses Geographic references are a very common component of Bayesian inference to assign categories to lists or tables data tables that can occur in both (1) the case where the containing place names, and then interprets individual to- primary entities of a table are geographic or (2) the case ponyms based on the most likely category assignments.
    [Show full text]
  • Toponym Resolution in Scientific Papers
    SemEval-2019 Task 12: Toponym Resolution in Scientific Papers Davy Weissenbachery, Arjun Maggez, Karen O’Connory, Matthew Scotchz, Graciela Gonzalez-Hernandezy yDBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA zBiodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA yfdweissen, karoc, [email protected] zfamaggera, [email protected] Abstract Disambiguation of toponyms is a more recent task (Leidner, 2007). We present the SemEval-2019 Task 12 which With the growth of the internet, the public adop- focuses on toponym resolution in scientific ar- ticles. Given an article from PubMed, the tion of smartphones equipped with Geographic In- task consists of detecting mentions of names formation Systems and the collaborative devel- of places, or toponyms, and mapping the opment of comprehensive maps and geographi- mentions to their corresponding entries in cal databases, toponym resolution has seen an im- GeoNames.org, a database of geospatial loca- portant gain of interest in the last two decades. tions. We proposed three subtasks. In Sub- Not only academic but also commercial and open task 1, we asked participants to detect all to- source toponym resolvers are now available. How- ponyms in an article. In Subtask 2, given to- ponym mentions as input, we asked partici- ever, their performance varies greatly when ap- pants to disambiguate them by linking them plied on corpora of different genres and domains to entries in GeoNames. In Subtask 3, we (Gritta et al., 2018). Toponym disambiguation asked participants to perform both the de- tackles ambiguities between different toponyms, tection and the disambiguation steps for all like Manchester, NH, USA vs.
    [Show full text]
  • A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution
    A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution Anonymous ACL submission Abstract survey and critical evaluation of the currently avail- 040 able datasets, evaluation metrics, and geocoding 041 001 Geocoding, the task of converting unstructured algorithms. Our contributions are: 042 002 text to structured spatial data, has recently seen 003 progress thanks to a variety of new datasets, • the first survey to review deep learning ap- 043 004 evaluation metrics, and machine-learning algo- proaches to geocoding 044 005 rithms. We provide a comprehensive survey • comprehensive coverage of geocoding sys- 045 006 to review, organize and analyze recent work tems, which increased by 50% in the last 4 046 007 on geocoding (also known as toponym resolu- 008 tion) where the text is matched to geospatial years 047 009 coordinates and/or ontologies. We summarize • comprehensive coverage of annotated geocod- 048 010 the findings of this research and suggest some ing datasets, which increased by 100% in the 049 011 promising directions for future work. last 4 years 050 012 1 Introduction 2 Background 051 013 Geocoding, also called toponym resolution or to- An early work on geocoding, Amitay et al.(2004), 052 014 ponym disambiguation, is the subtask of geopars- identifies two important types of ambiguity: A 053 015 ing that disambiguates place names in text. The place name may also have a non-geographic mean- 054 016 goal of geocoding is, given a textual mention of a ing, such as Turkey the country vs. turkey the ani- 055 017 location, to choose the corresponding geospatial co- mal, and two places may have the same name, such 056 018 ordinates, geospatial polygon, or entry in a geospa- as the San Jose in California and the San Jose in 057 019 tial database.
    [Show full text]
  • An Overview of Geotagging Geospatial Location
    An Overview of Geotagging Geospatial location Archival collections may include a huge amount of original documents which contain geospa- tial location (toponym) data. Scientists, researchers and archivists are aware of the value of this information, and various tools have been developed to extract this key data from archival records (Kemp, 2008). Geospatial identification metadata generally involves latitude and longitude coordi- nates, coded placenames and data sources, etc. (Hunter, 2012). Tagging Tagging, a natural result of social media and semantic web technologies, involves labelling content with arbitrarily selected tags. Geotags, which describe the geospatial information of the content, are one of the most common online tag types (Hu & Ge, 2008), (Intagorn , et al., 2010). Geotagging Geotagging refers to the process of adding geospatial identification metadata to various types of media such as e-books, narrative documents, web content, images and video and social media applications (Facebook, Twitter, Foursquare etc) (Hunter, 2012). With various social web tools and applications, users can add geospatial information to web content, photographs, audio and video. Using hardware integrated with Global Positioning System (GPS) receivers, this can be done automatically (Hu & Ge, 2008). 2 Social media tools with geo-tagging capabilities Del.icio.us1 A social bookmarking site that allows users to save and tag web pages and resources. CiteULike2 An online service to organise academic publications that allows users to tag academic papers and books. Twitter3 A real-time micro-blogging platform/information network that allows users all over the world to share and discover what is happening. Users can publish their geotag while tweeting. Flickr4 A photo-sharing service that allows users to store, tag and geotag their personal photos, maintain a network of contacts and tag others photos.
    [Show full text]
  • A Metadata Geoparsing System for Place Name Recognition and Resolution in Metadata Records
    A Metadata Geoparsing System for Place Name Recognition and Resolution in Metadata Records Nuno Freire, José Borbinha, Pável Calado, Bruno Martins IST / INESC-ID Apartado 13069, 1000-029 Lisboa, Portugal {nuno.freire, jlb, pavel.calado, bruno.g.martins}@ist.utl.pt ABSTRACT follow any data model at all, when it consists of natural language This paper describes an approach for performing recognition and texts (it was even estimated that 80% to 90% of business resolution of place names mentioned over the descriptive information may exist in those unstructured forms [1][2]). metadata records of typical digital libraries. Our approach exploits As society becomes more data oriented, much interest has arisen evidence provided by the existing structured attributes within the in these unstructured sources of information. This interest gave metadata records to support the place name recognition and origin to the research field of information extraction, which looks resolution, in order to achieve better results than by just using for automatic ways to create structured data from unstructured lexical evidence from the textual values of these attributes. In data sources [3]. An information extraction process selectively metadata records, lexical evidence is very often insufficient for structures and combines data that is found, explicitly stated or this task, since short sentences and simple expressions are implied, in texts or data sets with semi-structured schemas. The predominant. Our implementation uses a dictionary based final output of the extraction process will vary according to the technique for recognition of place names (with names provided by purpose, but typically it will consist in semantically richer data, Geonames), and machine learning for reasoning on the evidences which follows a more structured data model, and on which more and choosing a possible resolution candidate.
    [Show full text]
  • Geocoding Location Expressions in Twitter Messages: a Preference Learning Method
    JOURNAL OF SPATIAL INFORMATION SCIENCE Number 9 (2014), pp. 37–70 doi:10.5311/JOSIS.2014.9.170 RESEARCH ARTICLE Geocoding location expressions in Twitter messages: A preference learning method Wei Zhang and Judith Gelernter School of Computer Science, Carnegie Mellon University, Pittsburgh, USA Received: March 6, 2014; returned: May 12, 2014; revised: June 3, 2014; accepted: June 27, 2014. Abstract: Resolving location expressions in text to the correct physical location, also known as geocoding or grounding, is complicated by the fact that so many places around the world share the same name. Correct resolution is made even more difficult when there is little context to determine which place is intended, as in a 140-character Twitter message, or when location cues from different sources conflict, as may be the case among different metadata fields of a Twitter message. We used supervised machine learning to weigh the different fields of the Twitter message and the features of a world gazetteer to create a model that will prefer the correct gazetteer candidate to resolve the extracted expression. We evaluated our model using the F1 measure and compared it to similar algorithms. Our method achieved results higher than state-of-the-art competitors. Keywords: geocoding, toponym resolution, named entity disambiguation, geographic ref- erencing, geolocation, grounding, geographic information retrieval, Twitter 1 Introduction News tweet: “Van plows into Salvation Army store in Sydney.”1 The obvious guess is that this tweet refers to an incident in Australia. In fact, the tweet refers to an incident that happened on the Cape Breton Island in Nova Scotia, Canada.
    [Show full text]
  • Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text
    remote sensing Article Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text Edwin Aldana-Bobadilla 1,* , Alejandro Molina-Villegas 2 , Ivan Lopez-Arevalo 3 , Shanel Reyes-Palacios 3, Victor Muñiz-Sanchez 4 and Jean Arreola-Trapala 4 1 Conacyt-Centro de Investigación y de Estudios Avanzados del I.P.N. (Cinvestav), Victoria 87130, Mexico 2 Conacyt-Centro de Investigación en Ciencias de Información Geoespacial (Centrogeo), Mérida 97302, Mexico; [email protected] 3 Centro de Investigación y de Estudios Avanzados del I.P.N. Unidad Tamaulipas (Cinvestav Tamaulipas), Victoria 87130, Mexico; [email protected] (I.L.-A.); [email protected] (S.R.-P.) 4 Centro de Investigación en Matemáticas (Cimat), Monterrey 66628, Mexico; [email protected] (V.M.-S.); [email protected] (J.A.-T.) * Correspondence: [email protected] Received: 10 July 2020; Accepted: 13 September 2020; Published: 17 September 2020 Abstract: The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation.
    [Show full text]
  • Toponym Resolution in Social Media
    Toponym Resolution in Social Media Neil Ireson and Fabio Ciravegna University of Sheffield, UK Abstract. Increasingly user-generated content is being utilised as a source of information, however each individual piece of content tends to contain low levels of information. In addition, such information tends to be informal and imperfect in nature; containing imprecise, subjec- tive, ambiguous expressions. However the content does not have to be interpreted in isolation as it is linked, either explicitly or implicitly, to a network of interrelated content; it may be grouped or tagged with similar content, comments may be added by other users or it may be re- lated to other content posted at the same time or by the same author or members of the author’s social network. This paper generally examines how ambiguous concepts within user-generated content can be assigned a specific/formal meaning by considering the expanding context of the information, i.e. other information contained within directly or indirectly related content, and specifically considers the issue of toponym resolution of locations. Keywords: Concept Disambiguation, Social networks, Information Extraction. 1 Introduction The growth in the use of social media for sharing content (text, images or video) to other individuals who can be close personal associates or random strangers, is staggering. If the latest statistics from Facebook1 are to be believed, around 7% of the world’s population are active users and they spend on average 40 minutes per day on this one site. Whilst the value of this User-Generated Content (UGC) is being realised, utilising the information it contains poses a number of challenges.
    [Show full text]
  • Toponym Resolution with Deep Neural Networks
    Toponym Resolution with Deep Neural Networks Ricardo Jorge Barreira Custódio Thesis to obtain the Master of Science Degree in Telecommunications and Computer Engineering Supervisors: Prof. Doutor Bruno Emanuel da Graça Martins Prof. Miguel Daiyen Carvalho Won Examination Committee Chairperson: Prof. Doutor Luís Eduardo Teixeira Rodrigues Supervisor: Prof. Doutor Bruno Emanuel da Graça Martins Members of the Committee: Prof. Doutor David Martins de Matos October 2017 Acknowledgements First, I would like to thank Professor Bruno Emanuel da Gra¸caMartins for his guidance during this last and challenging year. His knowledge and constant motivation were a major contribution to the work that we developed together. Second, I would also like to thank my family, in particular my parents and my sister, for their constant support and for giving me the opportunity to learn in such a distinguished institute as Instituto Superior Te´cnico. Finally, I have to thank all my friends and colleagues for the constant support during the hard, although also amazing, time spent at Instituto Superior Te´cnico. A special word of gratitude for my colleagues under the tutelage of Professor Bruno Martins, who went through this last year experience with me. The countless hours we spent together, helping and making fun of each other is something I will never forget. I share this achievement with you. Ricardo Jorge Barreira Cust´odio For my parents and sister, Resumo Resolu¸c~aode top´onimos,i.e., inferir as coordenadas geogr´aficasde uma string que representa o nome de um local, ´eum problema fundamental no contexto de v´ariasaplica¸c~oesrelacionadas com extra¸c~aode informa¸c~aogeogr´aficae das ci^enciasde informa¸c~aogeogr´afica.
    [Show full text]