Toponym Resolution in Text Information Systems and Computer
Total Page:16
File Type:pdf, Size:1020Kb
Toponym Resolution in Text Ana Bárbara Inácio Cardoso Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Bruno Emanuel da Graça Martins Prof. Jacinto Paulo Simões Estima Examination Committee Chairperson: Prof. Alexandre Paulo Lourenço Francisco Supervisor: Prof. Bruno Emanuel da Graça Martins Members of the Committee: Prof. Fernando Manuel Marques Batista October 2019 Acknowledgements First of all, I would like to thank my advisors Professor Bruno Emanuel da Gra¸caMartins and Professor Jacinto Paulo Sim~oesEstima, for their guidance during the development of this work and their immense contributions to the growth and success of this thesis. I would like to thank my amazing family for their support, motivation, and for allowing me to learn and to grow professionally and personally during the time spent in Instituto Superior T´ecnico.Thank you for being so supportive. I am also grateful for the support and constant motivation of my friends and colleagues during this entire journey over the past five years. Thank you for the fantastic time, joys, victories, and all the tears that were shared. Finally, I would like to thank the Funda¸c~aopara a Ci^enciae Tecnologia (FCT), for sup- porting the work developed during this dissertation, through the project grants with ref- erences PTDC/EEI-SCR/1743/2014 (Saturn), T-AP HJ-253525 (DigCH), and PTDC/CCI- CIF/32607/2017 (MIMU), as well as through the INESC-ID multi-annual funding from the PIDDAC programme (UID/CEC/5 0021/2019). I also gratefully acknowledge the support of NVIDIA Corporation, with the donation of two Titan Xp GPUs used in the reported experi- ments. Ana B´arbaraIn´acio Cardoso For my family, Resumo A resolu¸c~aode top´onimosem texto, em que um top´onimose refere a um nome de local ou a uma refer^enciade local, consiste na desambigua¸c~aodestas refer^encias,associando a cada uma delas uma localiza¸c~ao´unica,portanto inequ´ıvoca, sobre a superf´ıcieda Terra (por exem- plo, atrav´esda atribui¸c~aode coordenadas geogr´aficasde latitude e longitude). Dado que os nomes dos locais s~aoaltamente amb´ıguos,a resolu¸c~aode top´onimos´euma tarefa desafiante; por exemplo, existem m´ultiploslocais na Terra que partilham o mesmo nome, e ainda m´ultiplas des- igna¸c~oesque referem o mesmo local. A resolu¸c~aode top´onimos´euma tarefa muito interessante, uma vez que v´ariasaplica¸c~oesposs´ıveis podem beneficiar dos resultados, incluindo o apoio ao processamento e an´alisede informa¸c~aogeogr´aficapresente em cole¸c~oesextensas de documentos, assim como o suporte `ageolocaliza¸c~aode documentos. O trabalho que desenvolvi durante a tese de mestrado, descrito neste documento de disserta¸c~ao,teve como objetivo a an´alisede estudos desenvolvidos na ´areaat´eao momento, bem como o desenvolvimento de um modelo para a resolu¸c~aode top´onimosconsiderando t´ecnicasdo estado-da-arte aplicadas ao processamento de l´ınguanatural. A arquitetura de rede neural proposta utiliza unidades recorrentes com m´ultiplas entradas (por exemplo, o top´onimoa ser desambiguado juntamente com as palavras adjacentes), aproveitando especificamente incorpora¸c~oesde palavras contextuais pr´e-treinadas(incorpora¸c~oes ELMo ou BERT) e unidades bidirecionais de Long Short-Term Memory (LSTM), ambas muito utilizadas para a modela¸c~aode dados textuais. Adicionalmente, o modelo proposto foi avali- ado em diferentes contextos, (i) usando informa¸c~oesexternas extra´ıdasde dados rasterizados com informa¸c~oesgeof´ısicas,incluindo cobertura terrestre, eleva¸c~aodo terreno, entre outras, e (ii) usando dados adicionais de artigos da Wikip´ediaem ingl^espara treinar o modelo com o objetivo de guiar e ajudar durante o treino. Os resultados obtidos mostram uma qualidade significativamente superior do m´etodo proposto, em compara¸c~aocom as abordagens anteriores, particularmente no cen´arioque envolve incorpora¸c~oesBERT juntamente com a adi¸c~aode dados. Abstract Toponym resolution in text, where toponym refers to a place name or place reference, consists in the disambiguation of these references, by associating each one of them to a unique, thus unambiguous, location over the surface of the Earth (e.g., through the assignment of latitude and longitude geographical coordinates). Given that place names are highly ambiguous, the toponym resolution is a challenging task; for instance, multiple places on Earth share the same name and even multiple designations that refer to the same location. Toponym resolution is an exciting task since several possible applications can benefit from the results, which includes the support of the processing and analysis of geographic information present in collections of large documents, as well as the support of document geolocation. The research conducted during the MSc thesis, and presented in this dissertation aims to analyze the studies developed in the area so far, as well as the development of a model for the toponym resolution considering state-of-the-art techniques applied to natural language processing. The proposed neural network architecture uses recurrent units with multiple inputs (e.g., the toponym to disambiguate along with the surrounding words), leveraging pre-trained contextual word embeddings (i.e., ELMo or BERT embeddings) and bi-directional Long Short-Term Memory (LSTM) units, both commonly used for textual data modeling. Additionally, the proposed model was evaluated in different contexts, (i) using external information extracted from raster data with geophysical information, including land cover, terrain elevation, among others, and (ii) using additional data from English Wikipedia articles to train the model, to guide and help during the model training. The obtained results show a significantly higher quality of the proposed method, in comparison to previous approaches and particularly in the setting that involves BERT embeddings and additional data. Palavras-chave Keywords Palavras-chave An´alisegeogr´aficade texto Resolu¸c~aode top´onimosem texto Aprendizagem profunda aplicada ao Processamento de L´ınguaNatural Redes neuronais recorrentes Representa¸c~oescontextuais de incorpora¸c~aode palavras Propriedades geof´ısicas Keywords Geographical text analysis Toponym resolution in text Deep learning for Natural Language Processing Recurrent neural networks Contextual word embedding representations Geophysical properties Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Thesis Proposal . .2 1.3 Contributions . .3 1.4 Structure of the Document . .4 2 Fundamental Concepts 5 2.1 Introduction to Neural Networks and Deep Learning . .5 2.1.1 Feed-forward Neural Networks . .6 2.1.2 Optimization Algorithms to Neural Networks . .9 2.2 Recurrent Neural Networks . 11 2.2.1 Simple Recurrent Neural Network Architecture . 13 2.2.2 Long Short-Term Memory Architecture . 14 2.2.3 Gated Recurrent Unit Architecture . 15 2.3 Text Representation Methods . 15 2.3.1 Traditional Approaches . 16 2.3.2 Word Embeddings . 16 2.3.3 Contextual Word Embeddings . 18 2.3.3.1 Embeddings from Language Models . 19 2.3.3.2 Bidirectional Encoder Representations from Transformers . 20 2.4 Overview . 21 i 3 Related Work 23 3.1 Named Entity Linking . 23 3.2 Fine-Grained Entity Classification . 26 3.3 Toponym Resolution . 28 3.3.1 Heuristic Methods . 28 3.3.2 Combining Heuristics through Supervised Learning . 29 3.3.3 Methods Combining Geodesic Grids and Language Models . 30 3.3.4 Deep Learning Techniques . 30 3.4 Overview . 31 4 Toponym Resolution in Text 33 4.1 Toponym Resolution as Classification . 33 4.2 Proposed Model Architecture . 34 4.3 Additional Experiments with the Proposed Model . 37 4.3.1 Wikipedia instances . 38 4.3.2 Geophysical properties . 39 4.4 Overview . 40 5 Experimental Evaluation 41 5.1 Corpora Used in the Experiments . 41 5.2 Experimental Methodology . 42 5.3 The Obtained Results . 44 5.4 Overview . 46 6 Conclusions and Future Work 49 6.1 Overview on the Contributions . 50 6.2 Future Work . 51 Bibliography 57 ii List of Figures 2.1 Perceptron architecture. .6 2.2 Feed-forward neural network architecture. .7 2.3 Recurrent neural network architecture. 11 2.4 Multi-layer RNN architecture . 12 2.5 Bi-directional recurrent neural network. 13 2.6 Traditional text representation. 16 2.7 ELMo model. 19 2.8 BERT model. 20 4.1 HEALPix partitioning. 34 4.2 Proposed neural network architecture. 35 4.3 Using Wikipedia to create new data instances. 38 iii iv List of Tables 3.1 Methods used in previous toponym resolution systems. 32 4.1 Experiences with Wikipedia instances. 37 4.2 Land coverage classes. 39 5.1 Statistical characterization of the corpora. 42 5.2 Experiments with the proposed model. 43 5.3 Additional experiments results. 44 5.4 Locations and respective prediction distance error. 46 5.5 Illustrative examples. 47 v vi Introduction1 Toponymy resolution concerns the disambiguation of place names and other references to places in textual documents. The disambiguation is achieved by associating each of these place references to a unique position on the Earth's surface, e.g., through the assignment of geographic coordinates. Through the results emerging from toponymy resolution, it is possible to consider several applications, such as the improvement of search engine results (e.g., by geographic indexing or classification), document classification according to spatial criteria, which allows grouping documents into meaningful clusters and enables the mapping of textually encoded informa- tion (Monteiro et al., 2016). Another possible application is in areas such as computational social sciences or digital humanities (Wing, 2015), for instance, through supporting the auto- matic processing and analysis of geographic data encoded over extensive collections of textual documents. Moreover, place reference resolution can be an auxiliary component for the complete geolocation of documents (Melo and Martins, 2017), whereas the toponyms mentioned in the text can provide indications about the overall location of the document.