Automatic Reconstruction of Itineraries from Descriptive Texts

Ecole´ doctorale des sciences exactes et leurs applications Escuela de Doctorado de la Universidad de Zaragoza Automatic reconstruction of itineraries from descriptive texts THESE` pour l’obtention du Doctorat de l’Universitéde Pau de des Pays de l’Adour (France) (mention Informatique) et Doctor por la Universidad de Zaragoza (España) (Programa de Doctorado de Ingenier´ıade Sistemas e Informática) par Ludovic Moncla Composition du jury Rapporteurs : Christophe Claramunt Institut de Recherche de l’Ecole Navale (IRENav) Denis Maurel LI, UniversitéFran¸cois Rabelais, Tours Ross Purves University of Zurich Examinateurs : Philippe Muller IRIT, UniversitéPaul Sabatier, Toulouse Adeline Nazarenko LIPN, UniversitéParis 13 Invité: David Buscaldi LIPN, UniversitéParis 13 Directeurs : Mauro Gaio LIUPPA, Universitéde Pau et des Pays de l’Adour Javier Nogueras Iso DIIS, Universidad de Zaragoza Co-Encadrant : Sébastien Mustière COGIT IGN, UniversitéParis-Est Laboratoire d’Informatique de l’Universitéde Pau et des Pays de l’Adour - EA 3000 Departamento de Informática e Ingenier´ıa de Sistemas, Universidad de Zaragoza Laboratoire COGIT, IGN, UniversitéParis-Est Acknowledgments First of all, I wish to express my greatest thanks to my two supervisors Mauro Gaio and Javier Nogueras- Iso. Mauro Gaio for giving me the opportunity to do this PhD. His unconditional availability, encouragement and trust helped me a lot to accomplish this work. I thank him for all the interesting discussions we had, sharing ideas and talking about everything. Then, I thank Javier Nogueras-Iso for his availability, his patience, his hospitality and his precious advices. I also thank them for their great support during all stages of this work and for their help with administrative issues. My thanks go also to Sébastien Mustière for his support and his precious remarks. I thank all three of them for the time they have spent re-reading papers and documents including this dissertation. I would thank the French National Mapping Agency (IGN) and the Communauté d’Agglomération Pau Pyrénées (CDAPP) for funding my PhD. I am very grateful to Christophe Claramunt, Denis Maurel, Ross Purves, Adeline Nazarenko, Philippe Muller and David Buscaldi for accepting to be members of the jury. Also I would like to thank especially Christophe Claramunt, Denis Maurel and Ross Purves for accepting to review this dissertation, and to Philippe Muller for accepting to be present each year for the mid-term evaluations of my work, for his interest and his useful remarks. I would like to thank all my colleagues from the University of Pau, researchers, teachers, staff members and more specifically current and former PhD students: Samson, Ehsan, Manzoor, Mamour and Tien. I would also like to thank the members of the Advanced Information Systems Research Group (IAAA) of the Computer Science and Systems Engineering Department at the University of Zaragoza for their support and help during my stays in Zaragoza and especially to Walter Renteria-Agualimpia for our fruitful collaboration. I would like to thank the members of the COGIT laboratory of IGN, for their support during my stays in Paris and especially Cécile Duchène, Sidonie Christophe and Guillaume Touya for their encouragement during my talk at the GIScience conference in Vienna. Also I want to thank my family for their encouragement and my friends who reminded me that there is more to life than academic research and for all the good times spent together. Last but not least, special thanks to my fiancé, Camille, for her unconditional support, encouragement and understanding during these three years. Also thanks for lending a hand in making nice schemas for this dissertation and for my oral presentations. But the most important, I thank her for all those great moments we share together. i ii To my fiancé and family iii iv Abstract This PhD thesis is part of the research project ‘PERDIDO’, which aims at extracting and retrieving displacements from textual documents. This work was conducted in collaboration with the LIUPPA laboratory of the university of Pau (France), the Advanced Information Systems (IAAA) group of Uni- versidad de Zaragoza (Spain) and the COGIT laboratory of IGN (France). The objective of this PhD is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space. We propose an approach for the automatic geocoding of itineraries described in natural language. Our proposal is divided into two main tasks. The first task aims at identifying and extracting information describing the itinerary in texts such as spatial named entities and expressions of displacement or perception. The second task deal with the reconstruction of the itinerary. Our proposal combines local information extracted using natural language processing and physical features extracted from external geographical sources such as gazetteers or datasets providing digital elevation models. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers) for the annotation of spatial named entities and expressions of displacement or perception. The main contribution in the first task of our approach is the toponym disambiguation which represents an important issue in Geographical Infor- mation Retrieval (GIR). We propose an unsupervised geocoding algorithm that takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. We propose a generic graph-based model for the automatic reconstruction of itineraries from texts, where each vertex represents a location and each edge represents a path between locations. Our model is original in that in addition to taking into account the classic elements (paths and waypoints), it allows to represent the other elements describing an itinerary, such as features seen or mentioned as landmarks. To build automatically this graph-based representation of the itinerary, our approach computes an informed spanning tree on a weighted graph. Each edge of the initial graph is weighted using a multi-criteria analysis approach combining qualitative and quantitative criteria. Criteria are based on information extracted from the text and information extracted from geographical sources. For instance, we compare information given in the text such as spatial relations describing orientation (e.g., going south) with the geographical coordinates of locations found in gazetteers. Finally, according to the definition of an itinerary and the information used in natural language to describe itineraries, we propose a multi-scale markup langugage. This language relies on a core generic layer based on the Text Encoding and Interchange guidelines (TEI) which defines a standard for the representation of texts in digital form. We also define a second layer adding spatial semantics for encoding spatial and motion information. Additionally, the rationale of the proposed approach has been verified with a set of experiments on a corpus of multilingual hiking descriptions (French, Spanish and Italian). Keywords: Information Extraction, Automatic itinerary reconstruction, Natural Language Processing v Résumé Cette thèse s’inscrit dans le cadre du projet PERDIDO dont les objectifs sont l’extraction et la reconstruction d’itinéraires à partir de documents textuels. Ces travaux ont été réalisés en collaboration entre le laboratoire LIUPPA de l’université de Pau et des Pays de l’Adour (France), l’équipe Systèmes d’Information Avancés (IAAA) de Universidad de Zaragoza (Espagne) et le laboratoire COGIT de l’IGN (France). Les objectifs de cette thèse sont de concevoir un système automatique permettant d’extraire, dans des récits de voyages ou des descriptions d’itinéraires, des déplacements, puis de les représenter sur une carte. Nous proposons une approche automatique pour la représentation d’un itinéraire décrit en langage na- turel. Notre approche est composée de deux tâches principales. La première tâche à pour rôle d’identifier et d’extraire les informations qui décrivent l’itinéraire dans le texte, comme par exemple les entités nom- mées de lieux et les expressions de déplacement ou de perception. La seconde tâche à pour objectif la reconstruction de l’itinéraire. Notre proposition combine l’utilisation d’informations extraites grâce au traitement automatique du langage ainsi que des données extraites de ressources géographiques externes (comme des gazetiers). L’étape d’annotation d’informations spatiales est réalisée par une approche qui combine l’étiquetage morpho-syntaxique et des patrons lexico-syntaxiques (cascade de transducteurs) afin d’annoter des entités nommées spatiales et des expressions de déplacement ou de perception. Une première contribution au sein de la première tâche est la désambiguïsation des toponymes, qui est un problème encore mal résolu en NER et essentiel en recherche d’information géographique. Nous proposons un algorithme non-supervisé de géo- référencement basé sur une technique de clustering capable de proposer une solution pour désambiguïser les toponymes trouvés dans les ressources géographiques externes, et dans le même temps proposer une estimation de la localisation des toponymes non référencés. Nous proposons un modèle de graphe générique pour la reconstruction

Automatic Reconstruction of Itineraries from Descriptive Texts

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support