Toponym Resolution in Text Information Systems and Computer

Toponym Resolution in Text Ana Bárbara Inácio Cardoso Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Bruno Emanuel da Graça Martins Prof. Jacinto Paulo Simões Estima Examination Committee Chairperson: Prof. Alexandre Paulo Lourenço Francisco Supervisor: Prof. Bruno Emanuel da Graça Martins Members of the Committee: Prof. Fernando Manuel Marques Batista October 2019 Acknowledgements First of all, I would like to thank my advisors Professor Bruno Emanuel da Gra¸caMartins and Professor Jacinto Paulo Sim~oesEstima, for their guidance during the development of this work and their immense contributions to the growth and success of this thesis. I would like to thank my amazing family for their support, motivation, and for allowing me to learn and to grow professionally and personally during the time spent in Instituto Superior Técnico.Thank you for being so supportive. I am also grateful for the support and constant motivation of my friends and colleagues during this entire journey over the past five years. Thank you for the fantastic time, joys, victories, and all the tears that were shared. Finally, I would like to thank the Funda¸c~aopara a Ciênciae Tecnologia (FCT), for supporting the work developed during this dissertation, through the project grants with references PTDC/EEI-SCR/1743/2014 (Saturn), T-AP HJ-253525 (DigCH), and PTDC/CCI- CIF/32607/2017 (MIMU), as well as through the INESC-ID multi-annual funding from the PIDDAC programme (UID/CEC/5 0021/2019). I also gratefully acknowledge the support of NVIDIA Corporation, with the donation of two Titan Xp GPUs used in the reported experiments. Ana BárbaraInácio Cardoso For my family, Resumo A resolu¸c~aode topónimosem texto, em que um topónimose refere a um nome de local ou a uma referênciade local, consiste na desambigua¸c~aodestas referências,associando a cada uma delas uma localiza¸c~aoúnica,portanto inequ´ıvoca, sobre a superf´ıcieda Terra (por exemplo, atravésda atribui¸c~aode coordenadas geográficasde latitude e longitude). Dado que os nomes dos locais s~aoaltamente amb´ıguos,a resolu¸c~aode topónimoséuma tarefa desafiante; por exemplo, existem múltiploslocais na Terra que partilham o mesmo nome, e ainda múltiplas des- igna¸c~oesque referem o mesmo local. A resolu¸c~aode topónimoséuma tarefa muito interessante, uma vez que váriasaplica¸c~oesposs´ıveis podem beneficiar dos resultados, incluindo o apoio ao processamento e análisede informa¸c~aogeográficapresente em cole¸c~oesextensas de documentos, assim como o suporte àgeolocaliza¸c~aode documentos. O trabalho que desenvolvi durante a tese de mestrado, descrito neste documento de disserta¸c~ao,teve como objetivo a análisede estudos desenvolvidos na áreaatéao momento, bem como o desenvolvimento de um modelo para a resolu¸c~aode topónimosconsiderando técnicasdo estado-da-arte aplicadas ao processamento de l´ınguanatural. A arquitetura de rede neural proposta utiliza unidades recorrentes com múltiplas entradas (por exemplo, o topónimoa ser desambiguado juntamente com as palavras adjacentes), aproveitando especificamente incorpora¸c~oesde palavras contextuais pré-treinadas(incorpora¸c~oes ELMo ou BERT) e unidades bidirecionais de Long Short-Term Memory (LSTM), ambas muito utilizadas para a modela¸c~aode dados textuais. Adicionalmente, o modelo proposto foi avali- ado em diferentes contextos, (i) usando informa¸c~oesexternas extra´ıdasde dados rasterizados com informa¸c~oesgeof´ısicas,incluindo cobertura terrestre, eleva¸c~aodo terreno, entre outras, e (ii) usando dados adicionais de artigos da Wikipédiaem inglêspara treinar o modelo com o objetivo de guiar e ajudar durante o treino. Os resultados obtidos mostram uma qualidade significativamente superior do método proposto, em compara¸c~aocom as abordagens anteriores, particularmente no cenárioque envolve incorpora¸c~oesBERT juntamente com a adi¸c~aode dados. Abstract Toponym resolution in text, where toponym refers to a place name or place reference, consists in the disambiguation of these references, by associating each one of them to a unique, thus unambiguous, location over the surface of the Earth (e.g., through the assignment of latitude and longitude geographical coordinates). Given that place names are highly ambiguous, the toponym resolution is a challenging task; for instance, multiple places on Earth share the same name and even multiple designations that refer to the same location. Toponym resolution is an exciting task since several possible applications can benefit from the results, which includes the support of the processing and analysis of geographic information present in collections of large documents, as well as the support of document geolocation. The research conducted during the MSc thesis, and presented in this dissertation aims to analyze the studies developed in the area so far, as well as the development of a model for the toponym resolution considering state-of-the-art techniques applied to natural language processing. The proposed neural network architecture uses recurrent units with multiple inputs (e.g., the toponym to disambiguate along with the surrounding words), leveraging pre-trained contextual word embeddings (i.e., ELMo or BERT embeddings) and bi-directional Long Short-Term Memory (LSTM) units, both commonly used for textual data modeling. Additionally, the proposed model was evaluated in different contexts, (i) using external information extracted from raster data with geophysical information, including land cover, terrain elevation, among others, and (ii) using additional data from English Wikipedia articles to train the model, to guide and help during the model training. The obtained results show a significantly higher quality of the proposed method, in comparison to previous approaches and particularly in the setting that involves BERT embeddings and additional data. Palavras-chave Keywords Palavras-chave Análisegeográficade texto Resolu¸c~aode topónimosem texto Aprendizagem profunda aplicada ao Processamento de L´ınguaNatural Redes neuronais recorrentes Representa¸c~oescontextuais de incorpora¸c~aode palavras Propriedades geof´ısicas Keywords Geographical text analysis Toponym resolution in text Deep learning for Natural Language Processing Recurrent neural networks Contextual word embedding representations Geophysical properties Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Thesis Proposal . .2 1.3 Contributions . .3 1.4 Structure of the Document . .4 2 Fundamental Concepts 5 2.1 Introduction to Neural Networks and Deep Learning . .5 2.1.1 Feed-forward Neural Networks . .6 2.1.2 Optimization Algorithms to Neural Networks . .9 2.2 Recurrent Neural Networks . 11 2.2.1 Simple Recurrent Neural Network Architecture . 13 2.2.2 Long Short-Term Memory Architecture . 14 2.2.3 Gated Recurrent Unit Architecture . 15 2.3 Text Representation Methods . 15 2.3.1 Traditional Approaches . 16 2.3.2 Word Embeddings . 16 2.3.3 Contextual Word Embeddings . 18 2.3.3.1 Embeddings from Language Models . 19 2.3.3.2 Bidirectional Encoder Representations from Transformers . 20 2.4 Overview . 21 i 3 Related Work 23 3.1 Named Entity Linking . 23 3.2 Fine-Grained Entity Classification . 26 3.3 Toponym Resolution . 28 3.3.1 Heuristic Methods . 28 3.3.2 Combining Heuristics through Supervised Learning . 29 3.3.3 Methods Combining Geodesic Grids and Language Models . 30 3.3.4 Deep Learning Techniques . 30 3.4 Overview . 31 4 Toponym Resolution in Text 33 4.1 Toponym Resolution as Classification . 33 4.2 Proposed Model Architecture . 34 4.3 Additional Experiments with the Proposed Model . 37 4.3.1 Wikipedia instances . 38 4.3.2 Geophysical properties . 39 4.4 Overview . 40 5 Experimental Evaluation 41 5.1 Corpora Used in the Experiments . 41 5.2 Experimental Methodology . 42 5.3 The Obtained Results . 44 5.4 Overview . 46 6 Conclusions and Future Work 49 6.1 Overview on the Contributions . 50 6.2 Future Work . 51 Bibliography 57 ii List of Figures 2.1 Perceptron architecture. .6 2.2 Feed-forward neural network architecture. .7 2.3 Recurrent neural network architecture. 11 2.4 Multi-layer RNN architecture . 12 2.5 Bi-directional recurrent neural network. 13 2.6 Traditional text representation. 16 2.7 ELMo model. 19 2.8 BERT model. 20 4.1 HEALPix partitioning. 34 4.2 Proposed neural network architecture. 35 4.3 Using Wikipedia to create new data instances. 38 iii iv List of Tables 3.1 Methods used in previous toponym resolution systems. 32 4.1 Experiences with Wikipedia instances. 37 4.2 Land coverage classes. 39 5.1 Statistical characterization of the corpora. 42 5.2 Experiments with the proposed model. 43 5.3 Additional experiments results. 44 5.4 Locations and respective prediction distance error. 46 5.5 Illustrative examples. 47 v vi Introduction1 Toponymy resolution concerns the disambiguation of place names and other references to places in textual documents. The disambiguation is achieved by associating each of these place references to a unique position on the Earth's surface, e.g., through the assignment of geographic coordinates. Through the results emerging from toponymy resolution, it is possible to consider several applications, such as the improvement of search engine results (e.g., by geographic indexing or classification), document classification according to spatial criteria, which allows grouping documents into meaningful clusters and enables the mapping of textually encoded information (Monteiro et al., 2016). Another possible application is in areas such as computational social sciences or digital humanities (Wing, 2015), for instance, through supporting the auto- matic processing and analysis of geographic data encoded over extensive collections of textual documents. Moreover, place reference resolution can be an auxiliary component for the complete geolocation of documents (Melo and Martins, 2017), whereas the toponyms mentioned in the text can provide indications about the overall location of the document.

Toponym Resolution in Text Information Systems and Computer

Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text

Knowledge-Driven Geospatial Location Resolution for Phylogeographic

TAGGS: Grouping Tweets to Improve Global Geotagging for Disaster Response Jens De Bruijn1, Hans De Moel1, Brenden Jongman1,2, Jurjen Wagemaker3, Jeroen C.J.H

Structured Toponym Resolution Using Combined Hierarchical Place Categories∗

Toponym Resolution in Scientific Papers

A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution

An Overview of Geotagging Geospatial Location

A Metadata Geoparsing System for Place Name Recognition and Resolution in Metadata Records

Geocoding Location Expressions in Twitter Messages: a Preference Learning Method

Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text

Toponym Resolution in Social Media

Toponym Resolution with Deep Neural Networks