Toponym Resolution in Text

Ana Bárbara Inácio Cardoso

Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

Supervisors: Prof. Bruno Emanuel da Graça Martins Prof. Jacinto Paulo Simões Estima

Examination Committee

Chairperson: Prof. Alexandre Paulo Lourenço Francisco Supervisor: Prof. Bruno Emanuel da Graça Martins Members of the Committee: Prof. Fernando Manuel Marques Batista

October 2019

Acknowledgements

First of all, I would like to thank my advisors Professor Bruno Emanuel da Gra¸caMartins and Professor Jacinto Paulo Sim˜oesEstima, for their guidance during the development of this work and their immense contributions to the growth and success of this thesis.

I would like to thank my amazing family for their support, motivation, and for allowing me to learn and to grow professionally and personally during the time spent in Instituto Superior T´ecnico.Thank you for being so supportive.

I am also grateful for the support and constant motivation of my friends and colleagues during this entire journey over the past five years. Thank you for the fantastic time, joys, victories, and all the tears that were shared.

Finally, I would like to thank the Funda¸c˜aopara a Ciˆenciae Tecnologia (FCT), for sup- porting the work developed during this dissertation, through the project grants with ref- erences PTDC/EEI-SCR/1743/2014 (Saturn), T-AP HJ-253525 (DigCH), and PTDC/CCI- CIF/32607/2017 (MIMU), as well as through the INESC-ID multi-annual funding from the PIDDAC programme (UID/CEC/5 0021/2019). I also gratefully acknowledge the support of NVIDIA Corporation, with the donation of two Titan Xp GPUs used in the reported experi- ments.

Ana B´arbaraIn´acio Cardoso

For my family,

Resumo

A resolu¸c˜aode top´onimosem texto, em que um top´onimose refere a um nome de local ou a uma referˆenciade local, consiste na desambigua¸c˜aodestas referˆencias,associando a cada uma delas uma localiza¸c˜ao´unica,portanto inequ´ıvoca, sobre a superf´ıcieda Terra (por exem- plo, atrav´esda atribui¸c˜aode coordenadas geogr´aficasde latitude e longitude). Dado que os nomes dos locais s˜aoaltamente amb´ıguos,a resolu¸c˜aode top´onimos´euma tarefa desafiante; por exemplo, existem m´ultiploslocais na Terra que partilham o mesmo nome, e ainda m´ultiplas des- igna¸c˜oesque referem o mesmo local. A resolu¸c˜aode top´onimos´euma tarefa muito interessante, uma vez que v´ariasaplica¸c˜oesposs´ıveis podem beneficiar dos resultados, incluindo o apoio ao processamento e an´alisede informa¸c˜aogeogr´aficapresente em cole¸c˜oesextensas de documentos, assim como o suporte `ageolocaliza¸c˜aode documentos. O trabalho que desenvolvi durante a tese de mestrado, descrito neste documento de disserta¸c˜ao,teve como objetivo a an´alisede estudos desenvolvidos na ´areaat´eao momento, bem como o desenvolvimento de um modelo para a resolu¸c˜aode top´onimosconsiderando t´ecnicasdo estado-da-arte aplicadas ao processamento de l´ınguanatural. A arquitetura de rede neural proposta utiliza unidades recorrentes com m´ultiplas entradas (por exemplo, o top´onimoa ser desambiguado juntamente com as palavras adjacentes), aproveitando especificamente incorpora¸c˜oesde palavras contextuais pr´e-treinadas(incorpora¸c˜oes ELMo ou BERT) e unidades bidirecionais de Long Short-Term Memory (LSTM), ambas muito utilizadas para a modela¸c˜aode dados textuais. Adicionalmente, o modelo proposto foi avali- ado em diferentes contextos, (i) usando informa¸c˜oesexternas extra´ıdasde dados rasterizados com informa¸c˜oesgeof´ısicas,incluindo cobertura terrestre, eleva¸c˜aodo terreno, entre outras, e (ii) usando dados adicionais de artigos da Wikip´ediaem inglˆespara treinar o modelo com o objetivo de guiar e ajudar durante o treino. Os resultados obtidos mostram uma qualidade significativamente superior do m´etodo proposto, em compara¸c˜aocom as abordagens anteriores, particularmente no cen´arioque envolve incorpora¸c˜oesBERT juntamente com a adi¸c˜aode dados.

Abstract

Toponym resolution in text, where toponym refers to a place name or place reference, consists in the disambiguation of these references, by associating each one of them to a unique, thus unambiguous, location over the surface of the Earth (e.g., through the assignment of latitude and longitude geographical coordinates). Given that place names are highly ambiguous, the toponym resolution is a challenging task; for instance, multiple places on Earth share the same name and even multiple designations that refer to the same location. Toponym resolution is an exciting task since several possible applications can benefit from the results, which includes the support of the processing and analysis of geographic information present in collections of large documents, as well as the support of document geolocation. The research conducted during the MSc thesis, and presented in this dissertation aims to analyze the studies developed in the area so far, as well as the development of a model for the toponym resolution considering state-of-the-art techniques applied to natural language processing. The proposed neural network architecture uses recurrent units with multiple inputs (e.g., the toponym to disambiguate along with the surrounding words), leveraging pre-trained contextual word embeddings (i.e., ELMo or BERT embeddings) and bi-directional Long Short-Term Memory (LSTM) units, both commonly used for textual data modeling. Additionally, the proposed model was evaluated in different contexts, (i) using external information extracted from raster data with geophysical information, including land cover, terrain elevation, among others, and (ii) using additional data from English Wikipedia articles to train the model, to guide and help during the model training. The obtained results show a significantly higher quality of the proposed method, in comparison to previous approaches and particularly in the setting that involves BERT embeddings and additional data.

Palavras-chave Keywords

Palavras-chave

An´alisegeogr´aficade texto

Resolu¸c˜aode top´onimosem texto

Aprendizagem profunda aplicada ao Processamento de L´ınguaNatural

Redes neuronais recorrentes

Representa¸c˜oescontextuais de incorpora¸c˜aode palavras

Propriedades geof´ısicas

Keywords

Geographical text analysis

Toponym resolution in text

Deep learning for Natural Language Processing

Recurrent neural networks

Contextual word embedding representations

Geophysical properties

Contents

1 Introduction 1

1.1 Motivation ...... 1

1.2 Thesis Proposal ...... 2

1.3 Contributions ...... 3

1.4 Structure of the Document ...... 4

2 Fundamental Concepts 5

2.1 Introduction to Neural Networks and Deep Learning ...... 5

2.1.1 Feed-forward Neural Networks ...... 6

2.1.2 Optimization Algorithms to Neural Networks ...... 9

2.2 Recurrent Neural Networks ...... 11

2.2.1 Simple Recurrent Neural Network Architecture ...... 13

2.2.2 Long Short-Term Memory Architecture ...... 14

2.2.3 Gated Recurrent Unit Architecture ...... 15

2.3 Text Representation Methods ...... 15

2.3.1 Traditional Approaches ...... 16

2.3.2 Word Embeddings ...... 16

2.3.3 Contextual Word Embeddings ...... 18

2.3.3.1 Embeddings from Language Models ...... 19

2.3.3.2 Bidirectional Encoder Representations from Transformers . . . . 20

2.4 Overview ...... 21

i 3 Related Work 23

3.1 Named Entity Linking ...... 23

3.2 Fine-Grained Entity Classification ...... 26

3.3 Toponym Resolution ...... 28

3.3.1 Heuristic Methods ...... 28

3.3.2 Combining Heuristics through Supervised Learning ...... 29

3.3.3 Methods Combining Geodesic Grids and Language Models ...... 30

3.3.4 Deep Learning Techniques ...... 30

3.4 Overview ...... 31

4 Toponym Resolution in Text 33

4.1 Toponym Resolution as Classification ...... 33

4.2 Proposed Model Architecture ...... 34

4.3 Additional Experiments with the Proposed Model ...... 37

4.3.1 Wikipedia instances ...... 38

4.3.2 Geophysical properties ...... 39

4.4 Overview ...... 40

5 Experimental Evaluation 41

5.1 Corpora Used in the Experiments ...... 41

5.2 Experimental Methodology ...... 42

5.3 The Obtained Results ...... 44

5.4 Overview ...... 46

6 Conclusions and Future Work 49

6.1 Overview on the Contributions ...... 50

6.2 Future Work ...... 51

Bibliography 57

ii List of Figures

2.1 Perceptron architecture...... 6

2.2 Feed-forward neural network architecture...... 7

2.3 Recurrent neural network architecture...... 11

2.4 Multi-layer RNN architecture ...... 12

2.5 Bi-directional recurrent neural network...... 13

2.6 Traditional text representation...... 16

2.7 ELMo model...... 19

2.8 BERT model...... 20

4.1 HEALPix partitioning...... 34

4.2 Proposed neural network architecture...... 35

4.3 Using Wikipedia to create new data instances...... 38

iii iv List of Tables

3.1 Methods used in previous toponym resolution systems...... 32

4.1 Experiences with Wikipedia instances...... 37

4.2 Land coverage classes...... 39

5.1 Statistical characterization of the corpora...... 42

5.2 Experiments with the proposed model...... 43

5.3 Additional experiments results...... 44

5.4 Locations and respective prediction distance error...... 46

5.5 Illustrative examples...... 47

v vi Introduction1

Toponymy resolution concerns the disambiguation of place names and other references to places in textual documents. The disambiguation is achieved by associating each of these place references to a unique position on the Earth’s surface, e.g., through the assignment of geographic coordinates.

Through the results emerging from toponymy resolution, it is possible to consider several applications, such as the improvement of search engine results (e.g., by geographic indexing or classification), document classification according to spatial criteria, which allows grouping documents into meaningful clusters and enables the mapping of textually encoded informa- tion (Monteiro et al., 2016). Another possible application is in areas such as computational social sciences or digital humanities (Wing, 2015), for instance, through supporting the auto- matic processing and analysis of geographic data encoded over extensive collections of textual documents. Moreover, place reference resolution can be an auxiliary component for the complete geolocation of documents (Melo and Martins, 2017), whereas the toponyms mentioned in the text can provide indications about the overall location of the document.

1.1 Motivation

The toponym resolution task is particularly challenging, given that place references are highly ambiguous.

Three distinct types of ambiguity should be addressed when resolving toponyms in textual documents (Monteiro et al., 2016): (1) geo/geo ambiguity arises when distinct locations share the same place name (e.g., the name Dallas can be associated with either Dallas, Texas, United States, or Dallas County, Alabama, United States); (2) geo/non-geo ambiguity corresponds to places named using common language words, i.e., when a location and a non-location share the same name. For example, the word Charlotte can refer to a person name or to the location of Charlotte County, Virginia, United States or, for instance, the word Manhattan that can refer to the location of Manhattan, New York, United States, or to the cocktail beverage; (3) reference ambiguity, which occurs when multiple names are referring to the same place (e.g., Big Apple is a common nickname used for referring to New York City, New York, United States).

Problem (2) should be covered when identifying place references in textual documents, whereas Problems (1) and (3) should be addressed when attempting to associate the recog- nized references to physical locations unambiguously (e.g., geospatial coordinates of latitude and longitude).

1.2 Thesis Proposal

Most of the previously developed systems for toponym resolution are based on the use of heuristics (e.g., population density), relying on an external knowledge to decide which location is more likely to correspond to the reference. The place references in the text are first compared against similar entries in a gazetteer (Berman et al., 2016; Manguinhas et al., 2009), and highly populated places are often preferred, given that they are more likely to be used in textual documents (Ardanuy and Sporleder, 2017; Leidner, 2007).

Other studies employed supervised approaches that consider these types of heuristics as features in standard machine learning techniques (Freire et al., 2011; Karimzadeh et al., 2019; Lieberman and Samet, 2012; Santos et al., 2015), while later studies explore the application of language modeling approaches (DeLozier et al., 2015; Speriosu and Baldridge, 2013) and, more recently, deep learning techniques yielding state-of-the-art results (Adams and McKenzie, 2018; Gritta et al., 2018a).

Given recent developments in natural language processing, it would be interesting to build a system that integrates state-of-the-art techniques, by combining into the model, contextual word embedding approaches for the representation of the text, and using a recurrent neural network architecture for modeling the text sequences and considering multiple textual inputs, including the context (i.e., without the need to resort to external knowledge bases).

2 1.3 Contributions

Briefly, the main contributions of this work are the following:

• The proposal of a novel model architecture for toponym resolution that incorporates deep learning techniques, the model combines pre-trained contextual word embeddings with bidirectional Long Short-Term Memory units to model the textual elements. The proposed model incorporates textual inputs, namely the place name reference, the corresponding sentence, and the corresponding paragraph, with multiple outputs (i.e., a primary output of geographic coordinates and a secondary output of classification into regions over the surface of the Earth corresponding to the place reference, resorting to a geodesic grid). The result of the classification is used to improve the prediction of geographic coordinates for each place reference, through a separate layer that directly applies the Great Circle distance as a loss function. The obtained results exceed previously reported results over the same corpora, thus demonstrating state-of-the-art performance.

• Integration and evaluation of the proposed model with distinct pre-trained contextual word embedding approaches, namely the Embeddings from Language Models (ELMo) (Pe- ters et al., 2018) and the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), along with the analysis of the impact that the textual rep- resentations have on the obtained results, concluding that the text representation method used has a significant impact on the results, verifying that by using the contextual word embeddings BERT, our model achieves higher performance in the toponym resolution task.

• Additional experiments with the proposed model considering different scenarios with fur- ther information: (i) using external information from geophysical properties (i.e., land coverage, terrain elevation, percentage of vegetation, and minimum distance until a water zone) extracted from external raster datasets and incorporated in the proposed model to guide the prediction of the geographic coordinates; (ii) using a larger sample, to determine the impact of the training data size on the results. The instances added to the original corpora were collected from a random sample of English Wikipedia articles, leveraging the Wikipedia link structure to infer which spans of text correspond to place references, in the sense that they link to Wikipedia pages associated with geospatial coordinates. Both experiences revealed slightly improvements on the obtained results, demonstrating that

3 indeed, the neural network model benefits from the addition of information, both from the addition of geophysical information and from the addition of training instances.

Part of the work reported in this dissertation was described in the article “Using Recurrent Neural Networks for Toponym Resolution in Text” (Cardoso et al., 2019), which was submitted and accepted to the EPIA 2019 Conference on Artificial Intelligence. In this article, it is possible to find a description of toponym resolution studies along with the proposed model architecture, together with the preliminary results that allowed the comparison of the proposed model with the existing toponym resolution systems. Additionally, I prepared an extended article about the research conducted during the MSc thesis, named “Using Contextual Embeddings and External Geophysical Information for Toponym Resolution in Text” envisioning a posterior conference submission.

1.4 Structure of the Document

The remaining of the document is organized as follows: Chapter 2 introduces the fundamen- tal concepts necessary for a more transparent comprehension of the rest of this work, followed by Chapter 3, which presents relevant related studies previously developed in the context of to- ponym resolution. Chapter 4 describes the proposed model as well as the conducted experiments, while Chapter 5 details the experimental evaluation methodology, as well as the obtained results. Finally, Chapter 6 summarizes the conclusions and presents directions for future research.

4 Fundamental Concepts2

This chapter discusses fundamental concepts for a clearer understanding of the work devel- oped in this dissertation. Section 2.1 explains fundamental concepts associated with the use of neural networks and deep learning, including the Perceptron model and some of the possible strategies that we can apply in neural network training. Then, Section 2.2 describes recurrent neural network architectures, a particular subclass of neural networks that supports the rep- resentation of sequential structures, which is useful in Natural Language Processing to model sequences of text elements. Section 2.3 discusses approaches to represent textual elements, a critical issue given that this choice must be adequate to build an effective system. Finally, Section 2.4 provides an overview of the topics that were discussed and presented in this chapter.

2.1 Introduction to Neural Networks and Deep Learning

Neural networks refers to a class of learning techniques whose computational manner was inspired by the human brain. Machine learning approaches are defined as learning based on past observations to perform predictions, while deep learning techniques learn to perform predictions, and beyond that, learn the most suitable representation of the data appropriate for such predic- tion. Deep learning approaches operate by feeding the data into a network, producing successive transformations of the input data until a final transformation predicts the output. The transfor- mations produced by the network are learned from the given input-output mappings, such that each transformation helps to relate the input data with the desired output (Goldberg, 2017). Goldberg developed a survey dedicated to neural networks and the distinct categories that ex- ist. In this study, we can find a more detailed explanation about neural networks applied to Natural Language Processing (Goldberg, 2017). In the next sections, I will briefly explain the feed-forward architecture, the most basic architecture of neural networks (Section 2.1.1), and some optimization algorithms that can be applied to neural networks (Section 2.1.2). Figure 2.1: Perceptron architecture.

2.1.1 Feed-forward Neural Networks

Basically, a neural network takes one or multiple inputs, processes them, and produces one or multiple outputs. A neural network is composed by set of connected neurons, the fundamental computational unit, that receives one or more inputs and produces one or more outputs. Each input xn has an associated weight wn, and the neuron multiplies each input by its weight, and sums them, applying a nonlinear function f to the result (i.e., activation function), resulting in the output y, as represented in Figure 2.1. The Perceptron is the most straightforward neural network architecture, receives multiple inputs values that are supplied directly to the output layer, where is applied mathematical transformations that considers the input weights followed by a summation of them with a bias term. The Perceptron is defined mathematically by Equation 2.1 where the W is the weight matrix and b is the bias term (Figure 2.1).

NNPerceptron(x) = xW + b (2.1)

A network is composed by multiple neurons, with connections between them, and grouped into several layers. The output of a neuron may be the input of other neurons in the next layer. There are three types of layers in the network, namely the input layer, the hidden layer and the output layer. As the name suggests the inputs are in the input layer, and outputs are in output layer, while the middle layers are called hidden layers where the neurons performs a series of mathematical computations. In Figure 2.2, it is represented as a simple architecture

6 Figure 2.2: Feed-forward neural network architecture with two hidden layers. of a feed-forward neural network with two hidden layers. If there are multiple hidden layers, it can be said that it is a deep neural network. As mentioned before, each input has an associated weight and, as the output of a layer can be the input for the next layer, these connections also have an associated weight. The role of the neuron is to take the value of its input and multiply it by the weight of its connection, summing them, and apply a non-linear function to the obtained results, passing to its output. As the name suggests, the information gathered until a particular layer only moves in one direction (forward) therefore, in this network does not exist any cycles.

The Multi-Layer Perceptron considers the existence of hidden layers that uses the activation function to introduce non-linearity into the network model. Therefore, to introduce a nonlinear hidden layer, resulting in the Multi-Layer Perceptron with one hidden-layer, we define as follows in Equation 2.2, where f is an nonlinear function (i.e., activation function) applied to the elements of the first linear transformation, W 1 and b1, the matrix weight and the bias term for the first linear transformation. The W 2 and b2 are the matrix weight and the bias term for the second linear transformation.

1 1 2 2 NNMLP1(x) = f(xW + b )W + b (2.2)

7 Similarly we can define a Multi-Layer Perceptron with two hidden-layers (represented in Figure 2.2) as follows, and so on:

NNMLP2(x) = y (2.3a)

h1 = f 1(xW 1 + b1) (2.3b)

h2 = f 2(h1W 2 + b2) (2.3c)

y = h2W 3 (2.3d)

There are several activation functions, for introducing non-linearity to the network model, mapping the resulting value into a defined domain range, usually [0, 1] or [−1, 1], depending on the function used. Between the most common are the following activation functions:

• Sigmoid - The sigmoid activation function in an S-shaped function that transforms each value x into the range [0,1], Equation 2.4.

1 σ(x) = (2.4) (1 + e−x)

• Hyperbolic tangent (tanh) - The hyperbolic tangent activation function in an S-shaped function that transforms each value x into the range [-1,1], Equation 2.5.

e2x − 1 tanh(x) = (2.5) e2x + 1

• Rectifier (ReLU) - The rectifier activation function, known as the rectified linear unit is a simple activation function that has shown to produce very good results. The ReLU simply removes negative elements, substituting by the value 0, Equation 2.6.

  0, if x < 0 ReLU(x) = max(0, x) = (2.6)  x, otherwise

8 2.1.2 Optimization Algorithms to Neural Networks

The loss function or cost function maps the cost associated with a particular event when training neural networks. As the goal is to improve the performance at each time step, lower loss values are preferred. The loss can be an arbitrary function mapping two vectors to a scalar. Bellow I present the most common loss functions.

• Log loss: Llog(ˆy, y) = log(1 + exp(−(ˆy[t] − yˆ[k]))

• Binary cross-entropy loss: Llogistic(ˆy, y) = −ylogyˆ − (1 − y)log(1 − yˆ)

P • Categorical cross-entropy loss: Lcross−entropy(ˆy, y) = − i y[i]log(ˆy[i])

As the objective is to minimize as much as possible our loss function, we perform improve- ments upon the weights to obtain better results than the previous. Therefore, usually it is employed gradient descent algorithms. In the following content I present some of its variants (i.e., more detailed explanation can be find in Goldberg (2017) and Ruder (2016) studies).

The batch gradient descent (Equation 2.7) computes the gradient of the cost function for the entire training dataset (sums up overall examples on each iteration when performing the updates to the parameters). So, for each update, sum over all examples. This variant guarantee convergence, but the need for calculating the gradient for the whole dataset to perform one update is computationally slow.

θ = θ − η · ∇θJ(θ) (2.7)

The stochastic gradient descent, also known as SGD (Equation 2.8) performs a parameter update for each training example. It is faster than batch gradient descent, and the SGD’s fluctuation enables jumps to a new potentially better local minimum, although it is challenging the convergence to the exact minimum.

(i) (i) θ = θ − η · ∇θJ(θ; x ; y ) (2.8)

9 As for the mini-batch gradient descent (Equation 2.9) performs an update for every mini- batch of n training examples. In this way, it reduces the variance of the parameter updates, which leads to more stable convergence.

(i:i+n) (i:i+n) θ = θ − η · ∇θJ(θ; x ; y ) (2.9)

Although there are more challenges and limitations imposed by the models described above, we can find more variants of the gradient descent algorithm for optimization in Ruder (2016). Some of the solutions includes the Momentum, Nesterov Accelerated Gradient, Adagrad, RM- Sprop, Adam, and Nadam. Reaching to the conclusion that adaptive learning-rate meth- ods (Adagrad, Adadelta, RMSprop and Adam) are most suitable and provide better conver- gence (Ruder, 2016).

However, there are additional strategies for optimizing the stochastic gradient descent, among them:

2 • Adding noise to the gradient, that follows a Gaussian Distribution N(0, σt ) to each gradient update, Equation 2.10.

η g = g + N(0, σ2) , where σ2 = (2.10) t,i t,i t t (1 + t)γ

Adding this noise makes the network more robust to poor initialization and helps training particularly deep and complex networks. Added noise gives the model more chances to escape and find a new local minimum (i.e., that are more frequent for deeper models).

• Using of batch normalization, which facilitates learning, this strategy consists of normal- izing the initial values of the parameters (i.e., initialize the value with zero mean and unit variance), and as the training proceeds, we update the parameters to different extents.

• Using early stopping though the monitoring of the error on a validation set during training and stop if the validation error does not improve enough.

• Choosing an adequate learning rate, which determines the size of the steeps needed until we reach a local minimum. Too large learning rates will prevent the network from converging on an effective solution. Too small learning rates will take a very long time to converge.

10 Figure 2.3: Recurrent neural network architecture, together with the corresponding unrolled representation.

2.2 Recurrent Neural Networks

A recurrent neural network (RNN) architecture allows modeling over sequential structures, which is extremely useful in natural language processing (NLP), since there are sequences in any text, whether of characters, words or phrases. An RNN enables the representation of input structures with arbitrary lengths, transforming them into a fixed-size vector, while maintaining the structural properties of the input sequence. In this architecture, the connections between neurons form a graph directed over a sequence, in other words, a RNN is recursively defined by means of a function R that receives a state vector as input sj−1 (i.e., corresponding to the previous state), along with the input vector of the current state xj, and returns a new state vector sj. The state vector sj is mapped by function O into a vector that corresponds to the output vector of the current state yj. Therefore, this structure considers the set of history of all previous states (x1, x2, ..., xj) (Goldberg, 2017). Figure 2.3 shows a graphical representation of recursive neural network, recursive representation and unrolled representation, respectively.

∗ RNN (x1:n; s0) = y1:n (2.11a)

yi = O(si) (2.11b)

si = R(si−1, xi) (2.11c)

Summing up, the loop of the network allows information to pass from one state of the network, to the next. The network accepts an input vector x and produces an output vector y. Not only the input just fed, but also the entire history of past inputs influences the output

11 Figure 2.4: Multi-layer RNN architecture, with 3 layers. vector content (Goldberg, 2017), as can be seen in Figure 2.3.

To train a recursive neural network, it is necessary to add a loss node to the unrolled graph and use the backward propagation algorithm to calculate the gradient whit respect to the loss, needed to calculate the weighs of the network. As seen in Figure 2.3, between time steps there is a sharing of the same parameters θ (Goldberg, 2017).

s4 = R(s3, x4) (2.12a)

s3 z }| { = R(R(s2, x3), x4) (2.12b)

s2 z }| { = R(R(R(s1, x2), x3), x4) (2.12c)

s1 z }| { = R(R(R(R(s0, x1), x2), x3), x4) (2.12d)

Recurrent neural networks can be stacked in layers, just as feed-forward neural networks, forming the so-called deep recurrent neural networks. Deep RNN increases the model capacity, and for some tasks it has better performance, such as in machine-translation task. In Figure 2.4 we can see a multi-layer recurrent neural network with 3 stacked layers.

Similarly, a bi-directional RNN follows the same structure mentioned above, connecting two hidden layers of opposite directions to the same output (i.e., one that reads the sequence from the left-to-right and other that reads from right-to-left). Thus the output layer can get information from past (backward) and future (forward) states simultaneously. As exemplified in Figure 2.5 upon the sentence “the brown fox jumped”, where the sequence of words is read

12 Figure 2.5: A bidirectional RNN over the phrase “the brown fox jumped”. from the left-to-right, and from right-to-left, considering the backward and forward states.

Usually, during the training of a RNN, the gradient begins to vanish, preventing the net- work weights from changing and adjusting as necessary during training, which emerged as a major obstacle to the network’s performance. This problem is known as the vanishing gradients problem (Hochreiter, 1998), where the gradient decreases exponentially and consequently gen- erates a decay of information over time. To address this problem, it was developed gating-based approaches, a technique that helps the neural network decide when to forget the current input and when to remember it for future time steps.

2.2.1 Simple Recurrent Neural Network Architecture

The simplest formulation of recurrent neural network is the Simple RNN, also known as the Elman Network, mathematically defined in Equation 2.13.

s x si = RSRNN (xi, si−1) = g(si−1W + xiW + b) (2.13a)

yi = OSRNN (si) = si (2.13b)

The state si−1 and the input xi are each linearly transformed, the results are added (together with a bias term) and then passed through a nonlinear activation function g. The output at position i is the same as the hidden state in that position (Goldberg, 2017).

13 2.2.2 Long Short-Term Memory Architecture

The Long Short-Term Memory (LSTM) is a concrete RNN architecture, designed to address the vanishing gradients problem by using a gating mechanism. At each input state it is used a gate which is responsible for deciding how much of the new input should be written to the memory cell and how much of the current memory cell content should be forgotten (Goldberg, 2017). Equation 2.14 defines mathematically the LSTM architecture.

sj = RLST M (sj−1, xj) = [cj; hj] (2.14a)

where, cj = f cj−1 + i z (2.14b)

hj = o tanh(cj) (2.14c)

xi hi i = σ(xjW + hj−1W ) (2.14d)

xf hf f = σ(xjW + hj−1W ) (2.14e)

xo ho o = σ(xjW + hj−1W ) (2.14f)

xz hz z = tanh(xjW + hj−1W ) (2.14g)

yj = OLST M (sj) = hj (2.14h)

The state at time j, sj, is composed by two vectors, cj and hj, which corresponds respectively to the memory component and the hidden state component (2.14a). There are three gating components, these are responsible for controlling the input, forget and output, i.e., i, f, and o respectively (2.14d; 2.14e; 2.14f). The gate values are obtained from linear combinations of current input xj and previous states hj−1, to which a sigmoid activation function is applied. An update candidate, z, is obtained by a linear combination of xj and hj−1 passing through a tangent activation function (2.14g). Afterwards the memory component cj is updated considering the forget gate, i.e., which controls how much of the previous memory should be kept, and the input gate, i.e., which controls the amount of the proposed update that will be preserved (2.14b).

Finally, the value of hj, which corresponds to the output yj, is computed considering the memory content cj, then passing through a nonlinear tangent function and controlled by the output gate(2.14c; 2.14h) (Goldberg, 2017).

14 2.2.3 Gated Recurrent Unit Architecture

Although the LSTM architecture is effective, it is quite complex and therefore computa- tionally heavy. Therefore, emerged another architecture, the Gated Recurrent Unit (GRU) architecture. The GRU architecture, as well as the LSTM, also based on a gating mechanism, the difference is that the GRU has fewer gates and a separated memory component. The GRU architecture is mathematically defined as follows:

sj = RGRU (sj−1, xj) = (1 − z) sj−1 + z s˜j (2.15a)

xz sz where, z = σ(xjW + sj−1W ) (2.15b)

xr sr r = σ(xjW + sj−1W ) (2.15c)

xs sg s˜j = tanh(xjW + (r sj−1)W ) (2.15d)

yj = OGRU (sj) = sj (2.15e)

One gate (r) is used to control access to the previous state sj−1 and compute a proposed updates ˜j. The updated state sj (which also serves as the output yj ) is then determined based on an interpolation of the previous state sj−1 and the proposals ˜j, where the proportions of the interpolation are controlled using the gate z (Goldberg, 2017).

2.3 Text Representation Methods

The text representation is an essential aspect when dealing with textual analysis tasks. It is important to chose an effective data representation. In the following sections I will present multiple approaches, including traditional approaches, as well as recent approaches that has shown to improve the performance of the systems.

15 Figure 2.6: Traditional text representation.

2.3.1 Traditional Approaches

We can represent the text of documents in several ways, in order to support computational analysis. The simplest way corresponds perhaps to the bag-of-words model, which is based on individual term frequency, i.e. the number of occurrences of each distinct term on a document. When this model is used, the order of appearance of each term within a document is lost. There are some limitations associated with this model, such as the high dimensionality of this representation and the loss of correlation between synonyms. Another approach is by the one-hot representation, which consists in modelling term appearance thorough a binary representation (i.e., 1 if the terms appears in the document, 0 otherwise), as represented in Figure 2.6. These two models generates vector representations with a fixed size (size of the vocabulary, |V |), thus the size of the input vector scales with the size of vocabulary). This means that it is a sparse representation, it has high dimensionality, and the notion of similarity between words does not exist. The approaches mentioned before besides considered as sparse representations (i.e., due to its high dimensionality), also raise the ”Out-of-Vocabulary” (OOV) problem, that happens when the system does not know how to handle unseen words in the test set.

2.3.2 Word Embeddings

Recently, novel approaches emerged to represent textual elements, capturing linguistic in- formation, namely word embeddings. This method for representing text consists in learning to map a set of words into real number vectors (i.e., within a corpus it is produced a vector space, where to each unique word is assigned a corresponding vector in space). The word em- bedding representations capture similarity between words hence word vectors are positioned in

16 the vector space in a meaningful way so that distance between words is related to their semantic similarity (Goldberg, 2017).

We can obtain the word embedding vectors by applying unsupervised learning algorithms, such as the Word2Vec or GloVe1 (Global Vectors for Word Representation). Neural network models inspired these algorithms which are based on stochastic gradient training.

For now, we will use the following representation: w for representing a word, c corresponds to the context words, D corresponds to the set of correct word-context pairs, and D¯ the set of incorrect words-context pairs.

The goal of the algorithm is to train the network to differentiate good word-context pairs, from bad word-context pairs. For that the algorithm estimates the probability P (D = 1|w, c), (i.e., the probability equal to 1 for the correct word-context pairs), Equation 2.16.

1 P (D = 1|w, c) = (2.16) 1 + e−s(w,c)

The objective of the algorithm is to maximize the log-likelihood of the data, Equation 2.17.

X X L(Θ; D; D¯) = logP (D = 1|w, c) + logP (D = 0|w, c) (2.17) (w,c)∈D (w,c)∈D¯

There are two different approaches to the Word2Vec algorithm, the Continuous Bag of Words (CBOW), and the Skip-gram approach:

• CBOW: In this architecture the model predicts the current word from a context window (i.e., surrounding context words, this window can have a size of one word, or more), here the order of the context words does not influence the outcome. To consider the context window, the algorithm adapts its computation considering the range of the context words

from c1 to ck, as it is shown in Equation 2.18 (Goldberg, 2017).

1 P (D = 1|w, c1+k) = (2.18) 1 + e−(w·c1+w·c2+...+w·ck)

1https://nlp.stanford.edu/projects/glove

17 • Skip-gram: The skip-gram architecture weights the context words within a window size, following the principle that words weight more if they are in a nearby context than words

that are more distant. This variant assumes that the elements ci in the context are inde- pendent from each other, treating them as k different contexts, Equations 2.19. Resulting in a single embedding vector for each context (Goldberg, 2017).

1 P (D = 1|w, ci) = 1 + e−(w·ci) k k Y Y 1 P (D = 1|w, c1:k) = P (D = 1|w, ci) = 1 + e−(w·ci) (2.19) i=1 i=1 k X 1 log P (D = 1|w, c1:k) = log 1 + e−(w·ci) i=1

GloVe algorithm bases the word embedding on the structure of the corpus, this model trains on global co-occurrences counts of words, constructing a word-context matrix, and training the words and context vectors, w and c respectively, and tries to satisfy the Equation 2.20, where b[w] and b[c] are the trained biases (for the word and context), Equation 2.20. The model suggests representing each word as the sum of the corresponding word and context embedding vector (Goldberg, 2017).

w · c + b[w] + b[c] = log#(w, c) ∀(w, c) ∈ D (2.20)

Summing up, the two main benefits of the use of word embeddings are the reduced di- mensionality which has an impact in the computational level and the capacity of capturing similarity between words. Word embedding provides a better vector feature on most of NLP problems and has a superior performance in deep learning methods, including neural networks models (Goldberg, 2017).

2.3.3 Contextual Word Embeddings

The contextual word embedding representations, beyond capturing similarities between words, are also able to perceive the semantic meaning of words since it considers the surrounding context words, which enables the capacity to handle polysemic properties of words. For exam- ple, if the word wood occurs in a text, the representation generated is distinct depending on the

18 Figure 2.7: ELMo model. context, i.e., if wood is referring to the material made from trees, or to a geographical area with many trees. Although there are several models for generating contextual word embeddings, in the next sections it is only described two of these models, namely the Embeddings from Lan- guage Models (ELMo) (Peters et al., 2018) and the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019).

2.3.3.1 Embeddings from Language Models

Peters et al. (2018) presented the ELMo model for generating pre-trained contextual word embeddings. As mentioned before, this model considers the context words when creating the embedding representations handling with the semantic meaning, syntactic use, and polysemy of words (Peters et al., 2018). To contextualize the word representations ELMo model examines the entire sentence before assigning the word embedding representation. ELMo is based on a neural language model (i.e., a model of the probability distribution over word sequences), these generative models are used to predict which is the most likely next word from a given sequence of words. The language model that ELMo uses relies on a multi-layer bi-directional LSTM (previously described in Section 2.2.2). Therefore, when generating the word embedding representation, ELMo considers both the following and the previous words. To generate a contextualized embedding representation for each word, ELMo extracts the hidden state of each layer, concatenates them, and applies a weighted sum operation, as represented in Figure 2.7.

19 Figure 2.8: BERT model.

2.3.3.2 Bidirectional Encoder Representations from Transformers

Another model for generating contextual word embeddings is the BERT model, proposed by Devlin et al. (2019), and inspired by previous studies, including ELMo (Peters et al., 2018), UMLFiT (Howard and Ruder, 2018), the OpenAI transformer (Radford et al., 2018), and the Transformer (Vaswani et al., 2017). The architecture of the BERT model is based on a pre- trained transformer encoder stack. A transformer encoder component is composed of two sub- layers, a self-attention layer, and a feed-forward neural network. Thus, the model receives as input a set of words and each encoder layer applies a self-attention mechanism (i.e., enabling the encoder to consider other words present in the input sequence while encoding a particular word), passes the result through a feed-forward neural network, passing the output to the next encoder layer, and so on. BERT model is based on transformer encoders, whose language model considers both forward and backward words. The authors decided to adopt a ”masked language model” to train the model, i.e., randomly applying masks to 15% of the input tokens, and using the output of the position of the masked word to predict which word was masked. Besides, occasionally, the words are randomly replaced by another word, and the model is asked to predict the correct word in that position. Similar to ELMo, the pre-trained BERT model can be used to generate contextual word embeddings (i.e., the word embedding representations can consist of one of the vectors or a combination of multiple vectors generated by the encoder representations).

20 2.4 Overview

This chapter presented the fundamental concepts related to the work developed in this dissertation. Section 2.1 explains fundamental concepts associated with the use of neural net- works and deep learning, including the Perceptron model, feed-forward neural networks, and some of the possible strategies that we can apply in neural network training. Then, Section 2.2 describes recurrent neural network architectures, a category of neural networks that supports the representation of sequential structures, which is useful in natural language processing to model sequences of text elements. Finally, Section 2.3 discusses approaches to represent textual elements, including contextual word embedding methods.

21 22 Related3 Work

This chapter presents relevant related work covering the most critical areas, in Section 3.1 presents related work applied to named entity linking, a field dedicated to the disambiguation of entities mentioned in text. Then, Section 3.2 presents the related work about fine-grained entity classification concerning the classification of entities mentioned in text into very similar categories (i.e., different categories, but sharing a standard part structure), Section 3.3 provides details about the different methods applied to toponym resolution. Finally, Section 3.4 provides an overview of the related work.

3.1 Named Entity Linking

The Named Entity Linking task dedicates to the disambiguation of recognized entities present in textual elements, identifying which real-world entity its mention on the text. This task is strongly related to the theme addressed in this dissertation given that both deal with the disambiguation of entities in text, and the theme of this dissertation focuses on a more specific category of named entities, the locations (i.e., toponym mentions).

Gupta et al. (2017) presents a novel neural network system for entity linking that learns a dense representation for entities considering multiple sources of information including unstruc- tured textual descriptions retrieved from external knowledge bases (i.e., encyclopedic entity descriptions), the context around the mention (i.e., the local and global context for the men- tion), and finally the structured knowledge of the mention fine-grained types. Enabling the model to capture distinct aspects that concerns the semantic sense and the information around at various levels of granularity. This model applies knowledge base information to create an embedding for entity mentions, such as using existing mentions in Wikipedia to encode its context, use textual descriptions to code background information and finally retrieves the fine- grained type from Freebase. The procedure to link a mention to the correct entity divides into a two-step process. The first step consists in gathering a set of candidate entities, obtained by computing their prior probabilities (i.e. using a pre-computed dictionary). In the second step they use the mention context encoder to estimate the semantic similarity between each mention and the vector representations of each candidate, thus they use the combination of the results of both steps for making the linking decisions (Gupta et al., 2017). The proposed model uses compositional training to ensure that the representation will capture all the information about the entity provided. Unlike developed models that use domain-specific data for training, one advantage of this model is that they only use indirect supervision that resort to the Wikipedia or Freebase, which makes their model domain-independent. In this way, not only the model leverages correctly the available information, but also it is robust when dealing with missing information (Gupta et al., 2017). Another advantage derived from is that the linking system is flexible to the change of the knowledge bases (e.g. added entities to the knowledge base), the changes are easily incorporated, and the model performs well even for new entities without the need to re-train the representations (i.e. by only make use of the available information, more specifically description and types).

The aim of the work developed by Cao et al. (2018) is to take advantage of the power of neural networks applied on collective entity linking. Until this date, the neural network methods for entity linking applied only for construction word or entity embeddings from feature extraction (Gupta et al., 2017), thus they not fully exploited neural network methods capacity. This work proposes a new neural network model for collective entity linking, called the NCEL that stands for neural collective entity linking model, it combines deep neural networks with graph convolutional network to consider in the integration of local context features together with global coherence for entity linking (Cao et al., 2018). It also introduced an attention scheme with the objective of improve the robustness of the NCEL system to model local context information by selecting words of greater importance, excluding the data noise, and to train the model from Wikipedia hyperlinks preventing overfitting. The framework of NCEL subdivides in three components: There is the candidate generation phase where it used a pre-computed dictionary to generate the candidate entity set to be disambiguate for each mention. Followed by a feature extraction phase, which is based on the document and in the graph of entity of the document, including local and global features. Finally, the neural model, that receives as inputs the features vectors as the sub-graphs of the candidates, represents the nodes in the graph as the received features, followed by multiple graph convolutions, after that step the correct candidates have a strongly connected topology, contrary to incorrect candidates that present a week connected topology due to their sparse relations. In the end, the results presented are the probabilities

24 reflecting the likelihood of a certain candidate corresponding to the entity mention (Cao et al., 2018).

Yang et al. (2018) presents a learning model for jointly disambiguation of named entities in a document using gradient tree boosting algorithm. The system considers global features on past disambiguation decisions and jointly models them with local features, achieving global opti- mizations in the entity linking task. As exact inference is computationally expansive the authors proposed a Bidirectional Beam Search with Gold Path (BiBSG), a variant of the standard beam search algorithm, with the difference that BiBSG an approximation of the inference algorithm. To improve the performance of the local search, BiBSG considers the global information, i.e. past and future information (Yang et al., 2018). Most of the entity disambiguation systems considers two types of components, local components and global components. The joint resolu- tion of entities carries out as it is a way of maximizing the overall coherence between entities. Through statistical features of entity linked corpora and features of similarity, for example the cosine, it is possible to capture the similarities between a mention and a candidate entity as well as the relation between entities that co-occur within the same document. To capture non-linear relationships, it is preferable to resort to machine learning models (e.g. neural networks with gradient tree boosting). The authors developed a model based on the structured gradient tree boosting, with the difference that it is a globally normalized model that uses conditional random fields, with this modified version it allows the use of global features defined between the current entity candidate and the entire decision history of the previous entity assignments, allowing global optimization. Globally normalized models tend to be more expressive than locally ones, however, it is difficult to calculate the normalization term for training and inference. To address this issue, the authors adopted the beam search, in which they track multiple hypotheses and sum over the paths in the beam. The BiBSG designed to train SGTB model efficiently and effectively, since it reduces the model’s variance and can consider both past and future infor- mation when predicting an output. About the obtained results BiBSG performs competitively with standard beam search SGTB models. Since the test data was a gathered from multiple domain sources, the model learns more abstract representations (Yang et al., 2018).

25 3.2 Fine-Grained Entity Classification

Fined-grained entity classification aims to label entity mentions considering their context not only associating them with the respective semantic types, but inferring a more specific class of which the entity belongs (e.g. the mention entity refers to the class ”Person”, and the goal of fine-grained entity classification is to attribute a more specialized class, such as ”Actor” or ”Writer”, for instance).

The work of Shimaoka et al. (2017) presents a study of several neural networks architectures variants applied to fine-grained entity type classification. The developed model combines directly ”hand-crafted” features with learned features. It also created an attention mechanism that learn to follow the syntactic heads and the tokens prior to and after a mention, both of this has an important paper on successfully classification of a mention. The authors introduce parameter sharing between labels through a hierarchical encoding method. Considering a probability of the mention belonging to a certain class type, using logistic regression, they construct the model’s variants, which differs in the ways that the input for the computed logistic regression model. At the inference stage, it assumes that each mention assigns at least one type, where the first assignment bases on the most likely type. Then it adds types based if their probability overcomes the threshold of 0.5, i.e. defined from the used of the development data (Shimaoka et al., 2017). The results demonstrate that the training data used has impact on the performance of the model and proved that the attention mechanism detects the expressions considered more relevant for the classification of fine-grained types (Shimaoka et al., 2017).

To address the task of fine-grained classification Yaghoobzadeh et al. (2018) propose FIG- MENT, an embedding-based that considers a global model of the information of the entity and a context model that in the first stage considers the individual occurrences of an entity. The global model, starts by learning the distributed representation of an entity using a multi- layer perceptron. Here is important that the learned representations are high quality and for that reason, the authors presented different representations using entity-level, word-level and character-level. Each level brings a complementary information which affects, positively, the performance of the system. Due to the scarcity of the context-labels entities, forces the use of distant supervision. As distantly supervised labels tend to be noisy, the authors developed and applied new algorithms for noise mitigation using multi instance learning (Yaghoobzadeh et al., 2018).

26 Xin et al. (2018) presents the developed framework Knowledge-Attention Neural Fine- Grained Entity Typing or KNET, an attention mechanism that leverages information from knowledge bases and jointly considers text and the knowledge base into consideration. The de- veloped model has a main goal to predict the probability of each type for a certain entity mention. The framework KNET divides into two parts, a sentence encoder and a type predictor. The sentence encoder used to transform word vectors into representations for entity mentions and context. The feature vector is composed by the concatenation of the entity mention represen- tation and its context. The type predictor vector is computes from a vector sentence through multi-layer perceptron, each entry of the type predicted vector indicates the probability of each type for a given mention. The attention mechanism used in KNET considers the semantic at- tention (i.e. the context representation), mention attention (i.e. the entity mentions, which is expected to capture semantic correlations between entities and context), and lastly knowledge attention, i.e., learned from external knowledge bases, expected to capture semantic correlations of entity-context and entity-knowledge base (Xin et al., 2018).

Previously existing methods that rely on distant supervision are more likely susceptible to noisy labels (i.e. labels that are out-of-context or not appropriated, overly-specific). The focus of the work developed by Xu and Barbosa (2018) is to deal with this issue using neural network models using a variant of cross-entropy loss function to deal not only with out-of- context labels but also with hierarchical loss normalization to deal with labels that are too specific. The approach proposed makes use of word embedding to train a single label that jointly learns representations for entity mentions and their context. They learned two different entity representations and used bidirectional long-short term memory (LSTM), with the purpose to learn the context representations. This work introduces hierarchical loss normalization to make an adjustment on penalties for correlated types, thus allowing the model to understand the hierarchy of type and to alleviate the negative weight of overly specific labels. To simplify the problem the author turned this problem into a single label classification problem, assuming the following ”each mention can only have one type-path depending on the context”. The approach proposed is robust against noise, does not require post-processing or ad-hoc features. A more detailed note on hierarchical loss functions: Due to the naturally type hierarchy existence in the task fine-grained entity classification, the lost functions adapts and adjust penalties for errors depending on the distance between the errors and the correct hierarchy (Xu and Barbosa, 2018).

27 3.3 Toponym Resolution

The following sections present relevant studies previously developed using different tech- niques, namely approaches based on heuristics (Section 3.3.1), approaches that combine heuris- tics with supervised learning (Section 3.3.2), methods that use both geodesic grids and language models (Section 3.3.3), and finally methods based on deep learning techniques (Section 3.3.4).

3.3.1 Heuristic Methods

Most of the previously developed toponym resolution systems rely on the use of heuristics and typically resort to external knowledge sources such as gazetteers, enabling the access to a variety of data about places on Earth (e.g., alternative names, type of places, population density, area, among others). The systems based on heuristics usually leverage this information to decide which of the possible locations refers to the place name identified in the text (Ardanuy and Sporleder, 2017; Leidner, 2007).

Additionally, it is possible to consider linguistic aspects to generate heuristics. Leidner (2007) considers both linguistic heuristics (i.e., rules and patterns inferred from the textual con- tent) and extra-linguistic heuristics (i.e., based on an external knowledge source). For example, one of the linguistic heuristics used by Leidner is based on a qualifier ”contained-in”, that recog- nizes patterns such as ”toponym1 in toponym2 ” or ”toponym1 (toponym2 )” and evaluates the spatial containment of the possible candidate locations (i.e., locations with the same name as the one under resolution) for both toponym mentions, assigning the correspondent geographic coordinates according to the spatial containment (e.g., if the pattern recognizes London (UK), it assigns to London the coordinates of the capital of England, whereas if the pattern recognizes London, Ontario, Canada it assigns to the mention London the coordinates of London, the city of Ontario). One of the extra-linguistic heuristics that Leidner uses is the attribution of the candidate location with higher population density to the toponym mention to disambiguate. Another example is the heuristic that considers if a given toponym occurs only once in the text, and if precisely one candidate location is a capital, then it believes that the toponym mention refers to the capital (e.g., if Madrid occurs in the text, always assign the coordinates of Madrid, the capital of Spain, without considering other locations named Madrid). Besides the examples mentioned, it combines both types of heuristics, such as considering the textual-spatial corre-

28 lation, where it assumes that textual proximity is strongly correlated with spatial proximity, assigning the locations accordingly. For example, if within a small text span (i.e., textual prox- imity) occurs in text the mentions Paris and Versailles, then the mention Paris is associated with Paris, France (Leidner, 2007).

One of the significant disadvantages of relying on gazetteers is that often, these are outdated and incomplete, thus impacting the systems that use them and making them unable to handle new and vernacular place names (Berman et al., 2016; Manguinhas et al., 2009).

3.3.2 Combining Heuristics through Supervised Learning

Other studies use supervised approaches that consider heuristics as features in standard ma- chine learning techniques (Freire et al., 2011; Karimzadeh et al., 2019; Lieberman and Samet, 2012; Santos et al., 2015). The work of Santos et al. (2015) explores the combination of multiple features that may capture similarities between possible candidate locations, as well as other to- ponyms present in the text and the context of the place reference (i.e., the text surrounding the mention). Afterward, a rank is assigned to each candidate location, and the location with the highest rank is associated with the mention under resolution, this method revealed state-of-the- art results. Another work that employs supervised learning techniques along with heuristics is the GeoTXT geocoder, developed by Karimzadeh et al. (2019), a flexible application program- ming interface for extracting and disambiguating toponyms in small textual documents. This geocoder utilizes existing resources to recognize toponyms in text, focusing exclusively on disam- biguating the recognized place references. When resolving toponyms, for each place reference, the system retrieves a list of candidate locations and associates a score to each of them. The score assigned to each location is a combination of multiple scores referring to features, which include independent political entities, administrative divisions, populated places, continent, region, or type of establishment (e.g., buildings, school, airport), among others (Karimzadeh et al., 2019). Furthermore, GeoTXT enables the incorporation of additional disambiguation mechanisms that consider the co-occurrence of toponyms in the text. Two such of these mechanisms are based on hierarchical relationships between toponyms, i.e., if two toponyms share the same geographic space containment, either applied to immediately consecutive place names (e.g., pairs consisting of city, state) or applied to toponyms that appear separately in the text. The third mechanism is based on spatial proximity, which aims to minimize the average distance between the predicted toponym location and the location for toponyms that co-occur in the text.

29 3.3.3 Methods Combining Geodesic Grids and Language Models

Besides the techniques previously mentioned, it is possible to use geodesic grids, sometimes combined with language models, to predict geographic coordinates when resolving toponyms. A geodesic grid over the surface of the Earth allows its subdivision into multiple regions of equal dimensions. Both works of Adams and McKenzie (2018) and Gritta et al. (2018a) uses geodesic grids, respectively, to textual content and to disambiguate toponyms, assigning to each toponym the corresponding region over the surface of the Earth (details about these works in Section 3.3.4). Wing and Baldridge (2011) also explore the use of geodesic grids for document geolocation. The authors developed a model that attempts to predict the correct region of a document by applying simple supervised methods and only considering textual elements as input. The developed model records improvements over previous document geolocation studies Wing and Baldridge (2011).

With TopoCluster system, proposed by DeLozier et al. (2015), the authors proposed to ad- dress and resolve the limitation brought by the recurrent need to rely on external gazetteers when resolving toponyms. TopoCluster considers the geographical distribution of each word, includ- ing the surrounding common language words, since there are certain words with the property of being geographically indicative. The authors use spatial statistics over multiple geo-referenced language models to create geographic clusters for each word, and derives a smoothed geographic likelihood for each word in the vocabulary and computes, which is the strongest geographic point where the toponym and context words clusters overlap. The authors show that it is possible to obtain superior results without recurring to gazetteers, noticing that the model performs well in corpora based on international news and historical texts (DeLozier et al., 2015).

3.3.4 Deep Learning Techniques

Adams and McKenzie (2018) proposed a character-level convolutional neural network model for geocoding multilingual text using any character set represented by UTF-8 encoding. The model receives as input a sequence of characters encoded as a one-hot vector to which is applied a series of temporal convolution and temporal max pooling operations. Then multiple linear transformations are applied to the result. Finally, the output layer predicts the region classifica- tion using a geodesic grid. By using character-level convolutional neural networks, the approach

30 is language independent. The authors verified that the model did not achieve the best results when diacritical characters were present, concluding that individual words are sometimes good geographical indicators (Adams and McKenzie, 2018).

Another example of a toponym resolution system that uses deep learning techniques is the CamCoder model (Gritta et al., 2018a), which attempts to disambiguate place references by dis- covering lexical clues through the context words surrounding the mention. The model introduces a sparse vector representation, named MapVec, which encodes the prior geographic probabilities distribution of locations (i.e., based on location coordinates and population counts). The spatial data is projected onto a 2D world map, which is then reshaped into a 1D feature vector (i.e., MapVec), enabling the codification of additional information about spatial knowledge usually ignored in similar studies (Gritta et al., 2018a). The CamCoder geocoder combines lexical and geographic information, thus receiving the following inputs: the context words (without the location mentions), the mentions to locations (excluding the context words), the target entity to disambiguate, and finally the feature vector MapVec. The textual inputs are fed into separate convolutional layers with global maximum pooling to detect words indicative of locations among context words, while the feature vector is supplied into a fully-connected dense layer. Then, the four resulting components are fed into another dense layer, followed by a concatenation of their results, which are provided to the output layer, where the model predicts a location based on classification into regions defined by a geodesic grid. CamCoder is a robust model that enables the consideration of geographic factors beyond lexical clues to improve the performance of the toponym resolution, presenting state-of-the-art results (Gritta et al., 2018a).

3.4 Overview

This chapter presented relevant related work covering the most critical areas, namely the related work applied to named entity linking, a field dedicated to the disambiguation of entities mentioned in text (Section 3.1). Then in Section 3.2, the related work about fine-grained entity classification concerning the classification of entities mentioned in text into very similar categories. And finally, in Section 3.3, details about the different methods applied to address the toponym resolution task, bellow, in Table 3.1 it is summarized the different approaches for toponym resolution in the previous studies.

31 Table 3.1: Methods used in previous toponym resolution systems.

Heuristics Learning approaches Study Extra- Supervised Language Deep Geodesic Linguistic linguistic learning models learning grids

Leidner (2007) X X Santos et al. (2015) X X X Karimzadeh et al. (2019) X X X Wing and Baldridge (2011) X DeLozier et al. (2015) X X Adams and McKenzie (2018) X X Gritta et al. (2018a) X X X

32 Toponym Resolution in 4Text

In this chapter is presented the neural model proposed to address toponym resolution. Section 4.1 explains the possibility of addressing toponym resolution as classification into regions using geodesic grids. Section 4.2 presents the proposed neural model architecture. Finally, Section 4.3 presents the additional experiments settings and model variants that were conducted using the proposed model architecture. Finally, Section 4.4 provides an overview upon the topics discussed in this chapter.

4.1 Toponym Resolution as Classification

As mentioned before, this work focuses exclusively on the toponym resolution task, intending to assign an unambiguous position over the surface of the Earth to each place name reference in textual contents. With this in mind, we choose to approach the problem as a classification task, where each place name reference is associated with a delimited region on the surface of the Earth through a geodesic grid. Therefore, we use the Hierarchical Equal Area isoLatitude Pixelization (HEALPix) scheme proposed by G´orskiet al. (2005), an algorithm that performs partitions on a sphere generating cells of equal area, corresponding to different regions on the Earth’s surface.

These partitions are obtained hierarchically from recursive divisions over a spherical surface, where the user defines the number of recursive divisions to execute over that surface (i.e., the desired resolution). These partitions are exemplified in Figure 4.1 that shows the grid is divided according to different resolution parameters, differing in the number of cells generated. The number of pixels generated (i.e., distinct regions) is obtain according to the Equation 4.1, where

Nside corresponds to the desired resolution.

2 Npix = 12 × Nside (4.1) Figure 4.1: Orthographic view of the HEALPix partitioning.

Throughout the experiences conducted in this work, we fixed the resolution parameter,

Nside = 256, which is equivalent to considering a maximum of 786432 regions (Npix). In practice, the number of classes will be much smaller, given that most regions will not be associated to any data instance.

4.2 Proposed Model Architecture

The overall idea behind our model is the following: from a textual document with previously annotated references to locations (i.e., identified as toponyms and associated with geographical coordinates of latitude and longitude), by providing textual elements as input to the model, including the context, using contextual word embeddings together with bi-directional LSTMs units to model the text sequence, predict the region classification upon a geodesic grid and use the classification probability distribution to obtain geographical coordinates (i.e., latitude and longitude) of each recognized place reference.

Moreover, our neural network model only receives textual inputs, more specifically three elements for each place reference recognized in the text: (1) the place mention itself; (2) the words around the mention (i.e., a fixed window size, to the left and right sides of the focus span of the text with the toponym, totaling 50 words); and, (3) a paragraph text, also defined by a fixed window size of larger dimensions (i.e., a total of 500 words), so it can capture the text around the sentence where the toponym occurs. Both the sentence and paragraph input consider the context around the mention in the forward and backward directions. When feeding the neural network with the paragraph of the mention, we are considering the general context of the document, and by considering a smaller textual window where the mention is present (i.e., the sentence), we are considering the closest context to the entity. Since other toponyms, or

34 Figure 4.2: The proposed neural network architecture. even common language words appearing in the surrounding text, can be characteristic of specific regions that might provide clues about the location of the mention.

The structure of the developed model is represented in Figure 4.2. The model starts by pre-processing the text documents, extracting, for each annotated toponym, its geographical coordinates together with the three textual components corresponding to the inputs of the neural network model, namely (1) the mention itself, (2) the sentence and finally, (3) the paragraph text, as previously explained (i.e., for each corpora it was necessary to pre-process the text in order to retrieve the sentence and paragraph correspondent to each annotated toponym, within this step it was necessary to perform tokenization and remove the punctuation). To generate the contextual representation of the text elements, we use pre-trained contextual word embeddings (Section 2.3.3). To generate the textual representations, we apply the contextual word embedding approach to each one of the inputs, resulting in one embedding vector for each of them, which is fed into a separate bi-directional LSTM layer to model the word sequence. In Figure 4.2, we represent the use of contextual word embeddings with the ELMo embedding model. However, the developed architecture is versatile since it allows the usage of other word embedding models, which in our case, we also evaluated the proposed model with the BERT embeddings.

To each bi-directional LSTM layer, we apply the penalized hyperbolic tangent, an improve- ment of the hyperbolic tangent loss function, suggested by Eger et al. (2018). As demonstrated in Equation 4.2, the function penalizes the identity function in the negative region (i.e., when-

35 ever the input is negative). This loss function has proved to achieve superior results across a variety of natural language processing tasks.

  tanh(x), if x > 0 f(x) = (4.2)  0.25 · tanh(x), otherwise

Afterward, we concatenate the resulting representation from the maximum pooling opera- tion over each bi-directional LSTMs and use them to predict the HEALPix region class (the first output), through a dense layer where we apply a softmax activation function, obtaining a prob- ability vector of equal size to the number of distinct HEALPix region classes. This HEALPix class probability vector is used to estimate the corresponding geographic coordinates (the sec- ond output) through a cubic interpolation between the class probability distribution, and the centroid coordinates matrix, i.e., a previously constructed matrix that contains the centroid coordinates of each HEALPix class, where each row corresponds to a distinct class. By applying a cubic interpolation (i.e., raising the HEALPix class probability vector to the power of three and normalizing), we accentuate the classes with higher probabilities leading to a more peaked distribution. Each of the outputs is associated with a separate layer, and during the model training, the goal is to minimize the combined loss of the outputs, thereby mutually guiding the learning process and improving the results. In the geographic coordinates output layer, we apply a regression loss based on the Great Circle distance (i.e., to calculate the distance between two points, namely the predicted point and the actual one, over the surface of the Earth), while for the region classification output layer we apply the standard cross-entropy categorical loss.

When using the interpolation technique, we obtain a estimation of the coordinate point that considers the HEALPix class probability, where the most probable classes contribute the most to the estimation of geographic coordinates. By considering distinct output layers we consider both output losses during the training, which the model aims to minimize at the same time, re-thinking the network weights when making a new prediction (i.e., the HEALPix region probability distribution, which in turn influences the prediction of geographic coordinates).

When training, we use the Adam optimization algorithm with a Cyclical Learning Rate (CLR) (Smith, 2017) policy, adjusting the learning rate throughout the training, with the basis on a cycle between a lower bound of 0.00001 and an upper bound of 0.0001. Besides, we also use an early stopping strategy (i.e., a form of regularization used to avoid overfitting that interrupts the training process once the model performance stops improving). The training was stopped when the combined loss over the training data was not improving for five consecutive epochs.

36 Table 4.1: Number of training instances considered in the experiences.

Training instances WOTR LGL SpatialML

Original corpora instances 8458 4016 4145 Added Wikipedia instances 15000 15000 15000 Total instances in Wikipedia experiences 23458 19016 19145

4.3 Additional Experiments with the Proposed Model

The neural network architecture described in Section 4.2 is the architecture adopted for our base model, called the ELMo model. However, additional experiments with other models were also performed, the Wikipedia model, the BERT model, and the model that integrates geophysical properties, described below:

• ELMo model - The base model described in Section 4.2 and represented in Figure 4.2. This model uses the ELMo embeddings model (Section 2.3.3.1) to generate contextual word embedding representations for the text.

• Wikipedia model - To determine the impact of the size of the training instances, we created a new corpus with articles from the English Wikipedia dumps. From random Wikipedia articles, we verify existing hyperlinks with associated geographic coordinates, collecting the article text, the hyperlink text, and the geographic coordinates (Sec- tion 4.3.1).

• BERT model - To observe the impact of using a different contextual word embed- dings model, we chose to use the BERT contextual embeddings (Section 2.3.3.2), instead of the ELMo contextual embeddings. Therefore, the only difference between this experiment and the ELMo model is the embedding model chosen to represent the text.

• Integration of geophysical properties - In this experiment, we consider additional information about geophysical properties, such as land cover, elevation, percentage of vegetation, and minimum distance to a water zone. We extract this extra information from datasets in raster format, incorporating this information into the model using the

37 Figure 4.3: Using Wikipedia to create new data instances.

same interpolation technique used to estimate the prediction of geographic coordinates, as described previously, more details about the geophysical properties in Section 4.3.2.

We tested both the Wikipedia model and the model that integrates geophysical properties, with both contextual word embedding models covered in this work, ELMo and BERT, originating the following models: (1) Wikipedia+ELMo model and Wikipedia+BERT model, and (2) Geophysical+ELMo model and Geophysical+BERT model, respectively.

4.3.1 Wikipedia instances

To determine the impact traning with a larger sample of data instances, we created a new corpus with articles from the English Wikipedia dumps. From random sample of Wikipedia articles, we verify existing hyperlinks with associated geographic coordinates, the we collected the text from the original article, the hyperlink text, and the geographic coordinates, considering that the hyperlink text with geographic coordinates referred to a place name reference, as demonstrated in Figure 4.3.

Moreover the instances added to the train data were filtered to coincide with the HEALPix regions present in the original corpora, thus adding more instances to the training data without modifying the region classification space of each corpus. To each corpus we added a sample of 15000 Wikipedia instances, as reported in Table 4.1.

38 Table 4.2: Land coverage classes.

Code Class name

1 Broadleaf Evergreen Forest 2 Broadleaf Deciduous Forest 3 Needleleaf Evergreen Forest 4 Needleleaf Deciduous Forest 5 Mixed Forest 6 Tree Open 7 Shrub 8 Herbaceous 9 Herbaceous with Sparse Tree/Shrub 10 Sparse vegetation 11 Cropland 12 Paddy field 13 Cropland / Other Vegetation Mosaic 14 Mangrove 15 Wetland 16 Bare area,consolidated (gravel,rock) 17 Bare area,unconsolidated (sand) 18 Urban 19 Snow / Ice 20 Water bodies

4.3.2 Geophysical properties

Another additional extension that we tested with the proposed model was considering ad- ditional information about geophysical properties, such as land cover, elevation, percentage of vegetation, and minimum distance to a water zone. We extract this extra information from datasets in raster format (i.e., a grid mapping properties with geographic coordinates).

This information was incorporated into the model with the same interpolation technique used to estimate the prediction of geographic coordinates, described previously. Allowing the model to obtain more information about the locations when making predictions, enabling the model to re-consider the network weights when classifying into regions, to minimize all the model output losses. For each of the properties, we create a matrix with the values corresponding to each centroid of the distinct HEALPix class and interpolate these matrices with the probability distribution of the HEALPix classes, with the purpose of using the geophysical information to

39 guide the prediction of the geographic coordinates.

The raster datasets with the geophysical properties were collected from the project “Global Map data archives” that was developed under the cooperation of National Geospatial Informa- tion Authorities (NGIAs) of respective countries and regions. In this work we considered the the following geophysical properties: (1) the land coverage classification1 where it was considered 20 classes of land coverage, detailed in Table 4.2, (2) the terrain elevation2 expressed in meters, and considering the sea level, and (3) the percentage of vegetation3, as the name indicates, it encodes the percentage of the tree coverage. The fourth property considered was derived from the land coverage classification raster dataset, where we calculated the minimum distance between pixels until a zone classified with water (i.e., ocean or lakes).

4.4 Overview

This chapter presented the neural model proposed to address toponym resolution. Sec- tion 4.1 explained the possibility of addressing toponym resolution as classification into regions using geodesic grids. Section 4.2 presented the proposed neural model architecture. Finally, Sec- tion 4.3 presented the additional experiments settings and model variants that were conducted using the proposed model architecture.

1https://globalmaps.github.io/glcnmo.html 2https://globalmaps.github.io/el.html 3https://globalmaps.github.io/ptc.html

40 Experimental Evaluation5

This chapter details the experimental evaluation, Section 5.1 presents a description of the corpora used in the experiments. Followed by the methodology applied to conduct the exper- iments (Section 5.2). And finally, Section 5.3 presents the results obtained, together with the conclusions. Lastly, Section 5.4 presents an overview about the topics discussed in the previous sections.

5.1 Corpora Used in the Experiments

In this section, we describe the corpora that were used during the development of this work, namely the War of the Rebellion (DeLozier et al., 2016), the Local-Global Lexicon (Lieberman et al., 2010), and SpatialML (Mani et al., 2010), which have been widely used in several studies in the area (Ardanuy and Sporleder, 2017; DeLozier et al., 2015, 2016; Gritta et al., 2018b,a; Santos et al., 2015). The War of the Rebellion (WOTR) corpus is composed of historical texts collected from military archives of the American Civil war, among which predominate military orders, reports, and government correspondence. DeLozier et al. (2016) presents the process of annotating these historical documents, as well as an evaluation of the performance of existing toponym resolution systems over the developed corpus, and additionally testing other corpora to examine the results. The authors concluded that the WOTR corpus was the most challenging corpus surveyed, with lower performance results than the Local-Global Lexicon corpus (i.e., considered the most challenging corpus until then). In turn, Lieberman et al. (2010) constructed the Local-Global Lexicon (LGL) corpus from articles retrieved from small and geographically distributed newspapers. This corpus was deliberately created to present several challenges to toponym resolution systems, given that it contains articles from small newspapers, based on near locations with highly ambiguous names. For example, Paris is a highly ambiguous toponym, and in this collection, there are articles from The Paris News (Paris, Texas), The Paris Table 5.1: Statistical characterization of the corpora used in our experiments.

Statistic WOTR LGL SpatialML

Number of documents 1644 588 428 Number of toponyms 10377 4462 4606 Avg. toponyms per document 6.3 7.6 10.8 Avg. tokens per document 246 325 497 Avg. sentences per document 12.7 16.1 30.7 Vocabulary size 13386 16518 14489 Distinct HEALPix classes 999 761 461

Post-Intelligencer (Paris, Tennessee), and The Paris Beacon-News (Paris, Illinois)) (Lieberman et al., 2010). As mentioned before, until the appearance of the WOTR, it was considered one of the most challenging corpora for toponym resolution. The SpatialML corpus is provided by the Linguistic Data Consortium, which comprises documents from the ACE English, among which are broadcast conversations, broadcast news, magazine news, newsgroups, and web blogs. SpatialML is an annotation scheme, where the references of the locations identified in the text are associated with a PLACE tag, and a LATLONG attribute corresponding to the geographical coordinates.

5.2 Experimental Methodology

To conduct the described experiments, we use three well-known corpora, namely the War of the Rebellion (DeLozier et al., 2016), the Local-Global Lexicon (Lieberman et al., 2010), and the SpatialML (Mani et al., 2010) (Section 5.1). As these corpora have different sources (i.e., histor- ical documents, news from small places, and international news, respectively), naturally, they also have different textual structures. For example, SpatialML is based on international news documents, which tend to be more extensive and with more toponym references than the other corpora. As said before we consider the resolution parameter equal to 256 when dividing the spherical surface into multiple regions, originating the following distributions: the WOTR cor- pus contains 999 HEALPix classes, while the LGL corpus covers 761 classes, and the SpatialML 461 classes. This can be verified in Table 5.1, which presents a statistical characterization of the corpora. We did our best to simulate the conditions of the experiments conducted by previous systems enabling the comparison of results and model performance. In the WOTR corpus, we used precisely the same data split (i.e., division of train and test data) provided by the authors.

42 Table 5.2: Experimental results obtained with the proposed neural method (ELMo model).

Toponym resolution system Mean (km) Median (km) Acc@161 (%)

WOTR TopoCluster (DeLozier et al., 2016) 604 − 57.0 TopoClusterGaz (DeLozier et al., 2016) 468 − 72.0 GeoSem (Ardanuy and Sporleder, 2017) 445 − 68.0 Our approach 164 11.48 81.5

LGL GeoTxt (Gritta et al., 2018a) 1400 − 68.0 CamCoder (Gritta et al., 2018a) 700 − 76.0 TopoCluster (DeLozier et al., 2015) 1735 274.00 45.5 Santos et al. (2015) 742 2.79 − Our approach 237 12.24 86.1

SpatialML Santos et al. (2015) 140 28.71 − Our approach 395 9.08 87.4

Regarding the results presented with the LGL and SpatialML corpora, we split the data in the following proportion: 90% of the instances for train and the remaining 10% for test.

To calculate the regions over the surface of the Earth, we used the Healpy python library1, based on the HEALPix scheme (Section 4.1), enabling to calculate the region code knowing the latitude and longitude coordinates, specifying the resolution, and vice versa. While to evaluate the prediction of geographic coordinates of each toponym, we calculate the distance between the predicted coordinates and the real coordinates on the Earth’s surface. This distance is computed using Vincenty’s geodesic formulae (Vincenty, 1975) (i.e., an iterative method that calculates the shortest geographic distance between two points on the Earth’s surface, with an accuracy within 0.5 millimeters). From the error distances between the two points, we can calculate the mean and median of these values, as well as the accuracy@161, i.e., a widely used measure in previous studies, that reflects the percentage of distance errors less or equal than 161 kilometers.

1http://pypi.org/project/healpy/

43 Table 5.3: Comparison of the results between the different variations of the experiences with the proposed model.

Experiment Mean (km) Median (km) Acc@161 (%)

WOTR ELMo 164 11.48 81.5 Wikipedia+ELMo 158 11.28 82.4 Geophysical+ELMo 166 11.35 81.9 BERT 117 10.99 87.3 Wikipedia+BERT 122 11.04 86.4 Geophysical+BERT 114 10.99 87.3

LGL ELMo 237 12.24 86.1 Wikipedia+ELMo 304 12.16 87.4 Geophysical+ELMo 282 12.24 87.7 BERT 193 11.81 90.1 Wikipedia+BERT 226 11.51 90.6 Geophysical+BERT 216 12.24 87.9

SpatialML ELMo 395 9.08 87.4 Wikipedia+ELMo 364 9.08 88.5 Geophysical+ELMo 387 9.08 87.4 BERT 363 9.08 89.2 Wikipedia+BERT 205 9.08 92.4 Geophysical+BERT 339 9.08 89.4

5.3 The Obtained Results

The developed model achieves impressive results, sometimes outperforming previous state of the art results. In Table 5.2, we summarize the results obtained by our base model (i.e., using ELMo embeddings) comparing them with previous systems. Overall, our model exceeds expectations, as it achieves outstanding results across all corpora, recording the lowest mean error in both the WOTR corpus and the LGL corpus, yielding a difference of minus 281 kilo- meters and 463 kilometers, respectively, when compared to the second-best value obtained by other systems. As for the SpatialML corpus, the system of Santos et al. records the best mean. However, our model reaches the median value of 9.08 kilometers, which represents less 19.63 kilometers. Regarding the accuracy at 161 kilometers measure, our model obtains a value of

44 81.5% for the WOTR and a value of 86.1% for the LGL corpus, representing an increase of 9.5% and 10.1% respectively, when compared to the previous second-best result reported. In SpatialML, we record a value of 87.4% for the accuracy@161 metric.

In Table 5.3, we present the results obtained in the several additional experiments conducted with the proposed architecture. The results show that the selection of the textual representations has a significant impact on the results achieved. By using the BERT contextual representations instead of the ELMo, we achieved amazing results, with, on average, less 41 kilometers in the mean value, less 0.3 kilometers in the median value, and an increase of 3.9% in the accuracy@161. We observed that by increasing the size of the training data, with more training instances, resulted in a slight improvement in the results obtained previously. Both comparing the ELMo model with ELMo+Wikipedia and the BERT model with BERT+Wikipedia, we verified the same pattern in the LGL corpus and the SpatialML corpus. For example, in the LGL corpus, both with the ELMo and the BERT embeddings, we recorded an increase in the mean values, a slight decrease in the median values, and finally, an increase in the accuracy@161 values. Regarding the WOTR corpus, the results are inconclusive. Possibly due to the difference in the textual register, i.e., this corpus is composed of historical documents, mainly governmental correspondence, while the Wikipedia articles have an informative register and a modern tone. As for the experiments with geophysical information (i.e., such as land coverage, among others) both with ELMo and BERT, we recorded a slight improvement on the obtained results in either experiment when compared with the ELMo and BERT model, respectively. Thus, the model benefits from the addition of geophysical information, which also helps to guide predicting the coordinates. However, the addition of geophysical properties does not provide relevant information when applied to the LGL corpus, leading us to the conclusion that the geophysical data does not have enough spatial resolution in the case of this corpus.

In Table 5.4, we present the locations with the lowest and highest distance error of prediction for all corpora. It is worth noting that, in all corpora, between the locations with lower prediction distance error, there are cases of demonym (e.g., English in the SpatialML corpus, resolved to the location England, UK with only an error of 2.44 kilometers), or even small places designated using vernacular names (e.g., in the case of Owen’s Big Lake in the WOTR corpus).

45 Table 5.4: Locations with lower and higher and distance error of prediction.

Corpus Lowest error (km) Highest error (km)

(0.63) Mexico (3104.59) Fort Welles WOTR (1.00) Resaca (3141.29) Washington (1.09) Owen’s Big Lake (3682.01) Astoria

(1.21) W.Va. (8854.04) Ohioans LGL (1.36) Butler County (9225.86) North America (1.51) Manchester (9596.54) Nigeria

(0.45) Tokyo (9687.43) Capital SpatialML (2.38) Lusaka (10818.50) Omaha (2.44) English (13140.64) Atlantic City

It is also presented illustrative figures together with the document text, retrieved from the WOTR corpus in Table 5.5. Each of the example document text has the annotated toponyms highlighted in red, and the corresponding image, where it is shown the real location (green point) and the predicted location (red point), the distance between the two points is represented through a black line. In the examples shown, we included clear cases where the error between the predicted point and the actual point is small, and in other cases, this distance is significantly considerable. It is noteworthy that the presence of toponym co-occurrence consecutively, i.e., Memphis, Tenn., which can give clues about both toponym locations. In the third example, we can see that all the toponyms have assigned locations with lower errors, with an error average of approximately 16.6 kilometers, among which there is a reference to the Paris location, a very ambiguous place name that is well resolved by the model with the help of the surrounding context.

5.4 Overview

This chapter detailed the experimental evaluation, Section 5.1 presented a description of the corpora used in the experiments. Section 5.2 presented the methodology applied to conduct the experiments. And finally, Section 5.3 presented the results obtained, together with the conclusions.

46 Table 5.5: Illustrative examples.

Figure Text

[Indorsement.] HDQRS. DETACHMENT SIXTEENTH ARMY CORPS, Memphis, Tenn., June 12, 1864. Respectfully referred to Colonel David Moore, commanding THIRD DIVISION, SIXTEENTH Army Corps, who will send the THIRD Brigade of his command, substituting some regiment for the Forty-ninth Illinois that is not entitled to veteran furlough, making the number as near as possible to 2,000 men. They will be equipped as within directed, and will move to the railroad depot as soon as ready. You will notify these headquarters as soon as the troops are at the depot. By order of Brigadier General A. J. Smith: J. HOUGH, Assistant Adjutant-General.

HYDESVILLE, October 21, 1862 SIR: I started from this place this morning, 7. 30 o’clock, en route for Fort Baker. The express having started an hour before, I had no es- cort. About two miles from Simmons’ ranch I was attacked by a party of Indians. As soon as they fired they tried to surround me. I re- turned their fire and retreated down the hill. A portion of them cut me off and fired again. I re- turned their fire and killed one of them. They did not follow any farther. I will start this evening for my post as I think it will be safer to pass this portin of the country in the night. Those Indians were lurking about of rthe pur- pose of robbing Cooper’s Mills. They could have no othe robject, and I think it would be well to have eight or ten men stationed at that place, as it will serve as an outpost for the set- tlement, as well as a guard for the mills. The expressmen disobeyed my orders by starting without me this morning. I have the honor to be, very respectfully, your obedient servant, H. FLYNN, Captain, Second Infantry California Volunteers. First Lieutenant JOHN HANNA, Jr., Acting Assistant Adjutant-General, Hum- boldt Military District.

47 LEXINGTON, KY., June 11, 1864–11 p. m. Colonel J. W. WEATHERFORD, Lebanon, Ky. Have just received dispatch from General Burbridge at Paris. He says direct Colonel Weatherford to closely watch in the direction of Bardstown and Danville, and if any part of the enemy’s force appears in that region to attack and destroy it. J. BATES DICKSON, Captain and Assistant Adjutant-General.

HEADQUARTERS DEPARTMENT OF THE NORTHWEST, Milwaukee, Wis., April 13, 1864. Brigadier General H. H. SIBLEY, Com- manding District of Minnesota, Saint Poul, Minn.: GENERAL: Your letter of 9th instant to the major-general commanding is received, and I am directed by him to advise you that the Sixth Minnesota Regiment will remain un- der your orders until its place can be supplied by the Eighth Regiment on its return from ex- pedition. A telegraphic dispatch to that effect is sent you to-day, and I inclose a copy of it. I am, general, most respectfully, your obedi- ent servant, J. F. MELINE, Acting Assistant Adjutant-General.

48 Conclusions and Future 6Work

In this dissertation, I addressed the toponym resolution task, presenting a novel proposal of a recurrent neural network architecture with multiple textual inputs, leveraging pre-trained contextual word embeddings (ELMo or BERT) and bi-directional Long Short-Term Memory (LSTM) units, producing multiple outputs for classification and regression tasks. The proposed model incorporates classification into HEALPix regions, divisions upon the surface of the Earth, used to improve the expected results for the regression task when predicting the geographic coordinates. I conducted several additional experiments, including training data augmentation through English Wikipedia articles, application of different contextual word embeddings, and adding geophysical properties retrieved from a raster dataset to support the prediction of ge- ographic coordinates. The proposed model was tested on the following corpora: the War of the Rebellion, the Local-Global Lexicon, and the SpatialML. The results obtained confirm the superiority of the proposed method over previous studies that demonstrated state-of-the-art results. Using contextual word embeddings has shown to be useful for improving many NLP tasks, particularly when involving relatively small amounts of annotated training data, as in the case of the experiments reported in this work. We compare the impact of different embedding models, concluding that BERT outperforms ELMo representations when applied to the toponym resolution task. In the data augmentation scenario through a selection of the English Wikipedia, recording an improvement in the results obtained when compared to a scenario where exclusively the original corpora were used. Information on geophysical properties, when incorporated into the proposed model, had a beneficial impact since it contributed to the achievement of better results. 6.1 Overview on the Contributions

The most important contributions of my M.Sc. thesis are the following:

• The proposal of a novel model architecture for toponym resolution that incorporates deep learning techniques, the model combines pre-trained contextual word embeddings with bidirectional Long Short-Term Memory units to model the textual elements. The proposed model incorporates textual inputs, namely the place name reference, the corresponding sentence, and the corresponding paragraph, with multiple outputs (i.e., a primary output of geographic coordinates and a secondary output of classification into regions over the surface of the Earth corresponding to the place reference, resorting to a geodesic grid). The result of the classification is used to improve the prediction of geographic coordinates for each place reference, through a separate layer that directly applies the Great Circle distance as a loss function. The obtained results exceed previously reported results over the same corpora, thus demonstrating state-of-the-art performance.

• Integration and evaluation of the proposed model with distinct pre-trained contextual word embedding approaches, namely the Embeddings from Language Models (ELMo) (Pe- ters et al., 2018) and the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), along with the analysis of the impact that the textual rep- resentations have on the obtained results, concluding that the text representation method used has a significant impact on the results, verifying that by using the contextual word embeddings BERT, our model achieves higher performance in the toponym resolution task.

• Additional experiments with the proposed model considering different scenarios with fur- ther information: (i) using external information from geophysical properties (i.e., land coverage, terrain elevation, percentage of vegetation, and minimum distance until a water zone) extracted from external raster datasets and incorporated in the proposed model to guide the prediction of the geographic coordinates; (ii) using a larger sample, to determine the impact of the training data size on the results. The instances added to the original corpora were collected from a random sample of English Wikipedia articles, leveraging the Wikipedia link structure to infer which spans of text correspond to place references, in the sense that they link to Wikipedia pages associated with geospatial coordinates. Both experiences revealed slightly improvements on the obtained results, demonstrating that

50 indeed, the neural network model benefits from the addition of information, both from the addition of geophysical information and from the addition of training instances.

6.2 Future Work

Regarding the future work, it may be interesting to explore cross-language embeddings to support the idea of training models leveraging existing data in a given language, and capable of operating on texts from a different language with less resources. It is worth noticing that approaches such as ELMo take character information to compose word representations, this way addressing the problem of out-of-vocabulary words to some degree (i.e., it is possible to generate representations for words that are not present in the vocabulary used for learning the word embeddings, leveraging the characters that compose these words). However, individual characters are an insufficient and unnatural linguistic unit for word representation, and, similarly to other approaches such as FastText embeddings that leverage character n-grams. It would be interesting to extend the ELMo contextual approach to consider learning representations also for sub-words.

In this work I chose to use the ELMo and BERT embedding models, however, there are numerous contextual word embedding models (Wolf et al., 2019), for example the RoBERTa (Liu et al., 2019), an optimized version of BERT (i.e., including the elimination of the pre-training objective concerning the prediction of the next sentence, moreover the model is trained with larger mini-batches, higher learning rates, with more training data and for a more extended amount of time than BERT) which has proven to be more efficient, producing state-of-the-art results.

Additionally, it would be interesting to test the model more intensively (e.g., using other cor- pora based on different sources, such as scientific documents, and even compare the performance of the proposed model against previous systems). One possibility would be to use EUPEG (Wang and Hu, 2019b), a new benchmark platform developed by Wang and Hu, that integrates a wide range of document collections and permits the comparison between several existing systems. The EUPEG platform is a very complete and up-to-date system that includes the collection of scientific corpora used in the SemEval-2019 competition on toponym resolution (Weissenbacher et al., 2019) and features the systems that obtained the best classifications (Wang and Hu, 2019a).

51 52 Bibliography

Adams, B. and McKenzie, G. (2018). Crowdsourcing the character of a place: Character-level convolutional networks for multilingual geographic text classification. Transactions in GIS, 22(2):394–408.

Ardanuy, M. and Sporleder, C. (2017). Toponym disambiguation in historical documents using semantic and geographic features. In Proceedings of the International Conference on Digital Access to Textual Cultural Heritage, pages 175–180. ACM.

Berman, M., Mostern, R., and Southall, H. (2016). Placing names: Enriching and integrating gazetteers. Indiana University Press.

Cao, Y., Hou, L., Li, J.-Z., and Liu, Z. (2018). Neural collective entity linking. In Proceedings of the International Conference on Computational Linguistics, pages 675–686. Association for Computational Linguistics.

Cardoso, A. B., Martins, B., and Estima, J. (2019). Using recurrent neural networks for toponym resolution in text. pages 769–780. Springer International Publishing.

DeLozier, G., Baldridge, J., and London, L. (2015). Gazetteer-independent toponym resolu- tion using geographic word profiles. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2382–2388. AAAI Press.

DeLozier, G., Wing, B., Baldridge, J., and Nesbit, S. (2016). Creating a novel geolocation corpus from historical texts. In Proceedings of the Linguistic Annotation Workshop held in conjunction with ACL, pages 188–198. Association for Computational Linguistics.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 4171–4186. Association for Computational Linguistics.

Eger, S., Youssef, P., and Gurevych, I. (2018). Is it time to swish? comparing deep learning activation functions across NLP tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4415–4424. Association for Computational Linguistics.

53 Freire, N., Borbinha, J., Calado, P., and Martins, B. (2011). A metadata geoparsing system for place name recognition and resolution in metadata records. In Proceedings of the Annual International ACM/IEEE Joint Conference on Digital Libraries, pages 339–348. ACM.

Goldberg, Y. (2017). Neural network methods in natural language processing. Morgan & Claypool Publishers.

G´orski,K., Hivon, E., Banday, A., Wandelt, B., Hansen, F., Reinecke, M., and Bartelman, M. (2005). HEALPix: A framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759–771.

Gritta, M., Pilehvar, M., and Collier, N. (2018a). Which Melbourne? augmenting geocoding with maps. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1285–1296. Association for Computational Linguistics.

Gritta, M., Pilehvar, M., Limsopatham, N., and Collier, N. (2018b). What’s missing in geo- graphical parsing? Language Resources and Evaluation, 52(2):603–623.

Gupta, N., Singh, S., and Roth, D. (2017). Entity linking via joint encoding of types, de- scriptions, and context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2681–2690. Association for Computational Linguistics.

Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6:107–116.

Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. Computing Research Repository, abs/1801.06146.

Karimzadeh, M., Pezanowski, S., MacEachren, A., and Wallgr¨un,J. (2019). GeoTxt: A scalable geoparsing system for unstructured text geolocation. Transactions in GIS, 23(1):118–136.

Leidner, J. (2007). Toponym resolution in text. PhD thesis, University of Edinburgh.

Lieberman, M. and Samet, H. (2012). Adaptive context features for toponym resolution in streaming news. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 731–740. ACM.

Lieberman, M., Samet, H., and Sankaranarayanan, J. (2010). with local lexicons to build indexes for textually-specified spatial data. In Proceedings of the IEEE International Conference on Data Engineering, pages 201–212. IEEE.

54 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. Computing Research Repository, abs/1907.11692.

Manguinhas, H., Martins, B., Borbinha, J., and Siabato, W. (2009). The DIGMAP geo-temporal web gazetteer service. E-Perimetron, 4(1):9–24.

Mani, I., Doran, C., Harris, D., Hitzeman, J., Quimby, R., Richer, J., Wellner, B., Mardis, S., and Clancy, S. (2010). SpatialML: annotation scheme, resources, and evaluation. Language Resources and Evaluation, 44(3):263–280.

Melo, F. and Martins, B. (2017). Automated geocoding of textual documents: A survey of current approaches. Transactions in GIS, 21(1):3–38.

Monteiro, B., Davis, C., and Fonseca, F. (2016). A survey on the geographic scope of textual documents. Computers & Geosciences, 96:23–34.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 2227–2237. Association for Computational Linguistics.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. Computing Research Repository, abs/1609.04747.

Santos, J., Anast´acio,I., and Martins, B. (2015). Using machine learning methods for disam- biguating place references in textual documents. GeoJournal, 80(3):375–392.

Shimaoka, S., Stenetorp, P., Inui, K., and Riedel, S. (2017). Neural architectures for fine- grained entity type classification. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, volume 1, pages 1271–1280. Association for Computational Linguistics.

Smith, L. (2017). Cyclical learning rates for training neural networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 464–472. IEEE.

Speriosu, M. and Baldridge, J. (2013). Text-driven toponym resolution using indirect supervi- sion. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1466–1476. Association for Computational Linguistics.

55 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems, pages 5998–6008. Curran Associates Inc.

Vincenty, T. (1975). Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey Review, 23(176):88–93.

Wang, J. and Hu, Y. (2019a). Are we there yet?: Evaluating state-of-the-art neural network- based geoparsers using EUPEG as a benchmarking platform. In Proceedings of the ACM SIGSPATIAL International Workshop on Geospatial Humanities. ACM.

Wang, J. and Hu, Y. (2019b). Enhancing spatial and textual analysis with EUPEG: an extensible and unified platform for evaluating geoparsers. Transactions in GIS, 23(6):1393–1419.

Weissenbacher, D., Magge, A., O’Connor, K., Scotch, M., and Gonzalez-Hernandez, G. (2019). SemEval-2019 task 12: Toponym resolution in scientific papers. In Proceedings of the Inter- national Workshop on Semantic Evaluation, pages 907–916. Association for Computational Linguistics.

Wing, B. (2015). Text-based document geolocation and its application to the digital humanities. PhD thesis, University of Texas at Austin.

Wing, B. and Baldridge, J. (2011). Simple supervised document geolocation with geodesic grids. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 955–964. Association for Computational Linguistics.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2019). Transformers: State-of-the-art natural language processing. Computing Research Repository, abs/1910.03771.

Xin, J., Lin, Y., Liu, Z., and Sun, M. (2018). Improving neural fine-grained entity typing with knowledge attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5997–6004.

Xu, P. and Barbosa, D. (2018). Neural fine-grained entity type classification with hierarchy- aware loss. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 16–25. Asso- ciation for Computational Linguistics.

Yaghoobzadeh, Y., Adel, H., and Sch¨utze,H. (2018). Corpus-level fine-grained entity typing. Journal of Artificial Intelligence Research, 61(1):835–862.

56 Yang, Y., Irsoy, O., and Rahman, K. S. (2018). Collective entity disambiguation with structured gradient tree boosting. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 777–786. Association for Computational Linguistics.

57