Geocoding for Texts with Fine-Grain Toponyms

Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, Mauro Gaio To cite this version: Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, Mauro Gaio. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. ACM SIGSPA- TIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2014), Nov 2014, Dallas, Texas, United States. 10.1145/2666310.2666386. hal-01069625v2 HAL Id: hal-01069625 https://hal.archives-ouvertes.fr/hal-01069625v2 Submitted on 12 Nov 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Geocoding for texts with fine-grain toponyms : an experiment on a geoparsed hiking descriptions corpus Ludovic Moncla Walter Javier Nogueras-Iso Université de Pau et des Pays Renteria-Agualimpia Universidad de Zaragoza de l’Adour LIUPPA Universidad de Zaragoza C/ María de Luna, 1 Pau, France C/ María de Luna, 1 Zaragoza, Spain [email protected] Zaragoza, Spain [email protected] [email protected] Mauro Gaio Université de Pau et des Pays de l’Adour LIUPPA Pau, France [email protected] ABSTRACT 1. INTRODUCTION Geoparsing and geocoding are two essential middleware ser- Geoparsing and geocoding are complementary services to vices to facilitate final user applications such as location- facilitate respectively the recognition of spatial language in aware searching or different types of location-based services. text documents and the mapping of such language to explicit The objective of this work is to propose a method for es- georeferences (e.g., lat/long values) [19]. In the last decade tablishing a processing chain to support the geoparsing and there has already been big research efforts addressing the geocoding of text documents describing events strongly lin- problems behind these two services because they are essen- ked with space and with a frequent use of fine-grain topo- tial for providing back-end data which is exploited later by nyms. The geoparsing part is a Natural Language Proces- multiple applications such as location-aware searching [26, sing approach which combines the use of part of speech and 33] or different types of location-based services (e.g. mobile syntactico-semantic combined patterns (cascade of transdu- applications for finding nearest points of interest or route cers). However, the real novelty of this work lies in the geoco- planning for emergency response services). ding method. The geocoding algorithm is unsupervised and takes profit of clustering techniques to provide a solution The problem of geoparsing, i.e. the recognition of toponyms for disambiguating the toponyms found in gazetteers, and (place names) in text, can be seen as a particular category at the same time estimating the spatial footprint of those of named entity recognition and classification (NERC). Two other fine-grain toponyms not found in gazetteers. The fea- categories of approaches have been proposed, those that use sibility of the proposal has been tested with a corpus of learning techniques and those based on natural language hiking descriptions in French, Spanish and Italian. processing (NLP), in particular syntactico-semantic rules. Both categories use external lexical resources. These ap- Categories and Subject Descriptors proaches can be used in a complementary manner in hybrid H.3.1 [Information Storage and Retrieval]: Content Ana- systems [18]. Amongst the rule-based approaches, several lysis and Indexing—Linguistic processing; H.3.3 [Information use transducers with a finite number of states [25], which Storage and Retrieval]: Information Search and Retrie- can also be used in cascade [22]. val—Information filtering With respect to the problem of geocoding, also known as General Terms toponym resolution, the objective is to associate a toponym with its spatial footprint. The first issue to solve for Algorithms, Experimentation this resolution is the ambiguity contained in place names expressed in text (the problem of solving these ambigui- Keywords ties is known as toponym disambiguation [7]). According Geocoding, Toponym disambiguation, Spatio-textual sear- to Smith and Mann [30] there are three main types of am- ching, Geoparsing, Location based services biguities : the same name is used for several places (referent Permission to make digital or hard copies of all or part of this work for ambiguity) ; the same place can have several names (refe- personal or classroom use is granted without fee provided that copies are not rence ambiguity) ; and the place name can be used in a non- made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components geographical context, like organizations or names of people of this work owned by others than ACM must be honored. Abstracting with (referent class ambiguity). Another type of ambiguity is cal- credit is permitted. To copy otherwise, or republish, to post on servers or to led structural ambiguity and it arises when the structure of redistribute to lists, requires prior specific permission and/or a fee. Request the words constituting the place name in the text are am- permissions from [email protected]. biguous (e.g. is the word Lake part of the toponym Lake SIGSPATIAL’14, November 04 - 07 2014, Dallas/Fort Worth, TX, USA Grattaleu or not ?) [31]. We consider that this type of am- Copyright 2014 ACM 978-1-4503-3131-9/14/11 ...$15.00 http://dx.doi.org/10.1145/2666310.2666386 biguity can be seen as a subset of the reference ambiguity. The second issue is to have adequate resources to associate the adequate selection of a resource for assigning a foot- a footprint to a disambiguated place name. The use of re- print to a disambiguated toponym. The task of toponym sources like gazetteers is thus inescapable. disambiguation and the assignment of a footprint are two activities that are usually combined thanks to the use of a In an open data context, the availability of gazetteers is ex- gazetteer with an appropriate coverage of the georeferences panding and we may mention global resources such as Geo- which might be associated to ambiguous toponyms. Subsec- names, OpenGeoData, OpenStreetMap and Wikimapia or tion 2.1 provides an overview of approaches for toponym national resources 1 such as BD NYME (France) or Nomen- disambiguation. clátor Geográfico Básico (Spain). However, whatever the resource is selected for geocoding, the problem that usually However, sometimes there is no availability of gazetteers arises is its completeness. Currently, the present Web geo- with appropriate coverage for the processing of text docu- coding public market is dominated by geocoding services ments. In those cases, we need to explore alternative me- for average users, i.e. users that can accept low Quality of thods to infer the spatial location of toponyms. Although Service in terms of resolution [13], i.e. geocoding of admi- this last issue has received little attention in the research nistrative units, street names in urban areas, or names of literature, Subsection 2.2 analyzes some approaches about well-known touristic sites are the main needs. But there are how to define new toponyms or improve the spatial infor- contexts, even for public citizens or casual users, where the mation associated with existing toponyms. completeness of resources is crucial, in particular as regards the geocoding of fine-grain toponyms. For instance, in a cor- 2.1 Related work about toponym disambigua- pus of narrative descriptions of places in a small area, it is tion common to find toponyms referring to geographical entities Buscaldi [5] provides an overview about different ways of of varying size. Additionally, there are frequent occurrences disambiguating toponyms. According to this work, the ap- of micro-toponyms, which are not usually found in gazet- proaches can be classified in three categories : map-based, teers thought for broader audiences. knowledge-based, and supervised or data-driven approaches. Map-based approaches use as the context for disambigua- The objective of this work is to propose a method establi- tion other unambiguous and georeferenced toponyms found shing a processing chain to support the geoparsing and geo- on the same document. These approaches provide a score to coding of text documents describing events strongly linked any of the possible locations according to the distance to the with space and with a frequent use of fine-grain toponyms. unambiguous toponym. Knowledge-based approaches make The geoparsing part of the workflow is a NLP approach ba- profit of knowledge sources (gazetteers, ontologies, and so sed on a previous work of the authors [23], which combines on) to see if other related toponyms in this knowledge source the use of a Part Of Speech tagger and a cascade of trans- are also referred in the document [26, 6, 20], or additional ducers. The real novelty of this work lies in the geocoding information from the toponyms (e.g., population) or docu- part of the method. The geocoding algorithm is an unsuper- ment creators (e.g. documents of social media [17]) can be vised algorithm that takes profit of clustering techniques to exploited. Finally, data-driven approaches [2, 30] are based provide a solution for disambiguating those toponyms found on machine learning algorithms. The main drawback with in gazetteers, and at the same time estimating the spatial this last category is the lack of classified collections.

Load more