Spatial Language Representation with Multi-Level Geocoding

Sayali Kulkarni∗ Shailee Jain∗† Mohammad Javad Hosseini† Google Research University of Texas, Austin University of Edinburgh [email protected] [email protected] [email protected]

Jason Baldridge Eugene Ie Li Zhang Google Research Google Research Google Research [email protected] [email protected] [email protected]

Abstract Joint loss

We present a multi-level geocoding model L5 L6 L7

(MLG) that learns to associate texts to ge- Projections to multiple levels ographic locations. The Earth’s surface is represented using space-filling curves that de- Feature encoding Target Toponyms in Context compose the sphere into a hierarchy of simi- Toponym context words larly sized, non-overlapping cells. MLG bal- ances generalization and accuracy by combin- The five boroughs - Brooklyn, Queens, Manhattan, the Bronx and Staten Island - ing losses across multiple levels and predict- were consolidated into a single city in 1898. ing cells at each level simultaneously. With- out using any dataset-specific tuning, we show Figure 1: Overview of Multi-Level Geocoder, using that MLG obtains state-of-the-art results for multiple context features and jointly predicting cells at toponym resolution on three English datasets. multiple levels of the S2 hierarchy. Furthermore, it obtains large gains without any knowledge base metadata, demonstrating that it can effectively learn the connection between We present Multi-Level Geocoder (MLG, Fig. text spans and coordinates—and thus can be 1), a model that learns spatial language represen- extended to toponymns not present in knowl- tations that map toponyms from English texts to edge bases. coordinates on Earth’s surface. This geocoder is not restricted to resolving toponyms to specific lo- 1 Introduction cation entities, but rather to geo-coordinates di- rectly. MLG can thus be extended to non-standard Geocoding is the task of resolving a location ref- location references in future. For comparative eval- erence in text to a corresponding point or region uation, we use three toponym resolution datasets on Earth. It is often studied in the context of so- from distinct textual domains. MLG shows strong cial networks, where metadata and the network performance, especially when gazetteer metadata itself provide additional signals to geolocate nodes and population signals are unavailable. (usually people) (Backstrom et al., 2010; Rahimi MLG is a text-to-location neural geocoder simi- et al., 2015). These evaluate performance on so- lar to CamCoder (Gritta et al., 2018a). Locations arXiv:2008.09236v1 [cs.CL] 21 Aug 2020 cial media data, which has a strong bias for highly- are represented using S2 geometry1—a hierarchical populated locations. If the locations can be mapped discretization of the Earth’s surface based on space- to an entity in a knowledge graph, toponym reso- filling curves. S2 naturally supports spatial rep- lution – which is a special case of entity resolu- resentation at multiple levels, including very fine tion – can be used to resolve location references to grained cells (as small as 1cm2 at level 30); here, geo-coordinates, This can be done using heuristics we use different combinations of levels 4 (∼300K based on both location popularity (Leidner, 2007) km2) to 8 (∼1K km2). MLG predicts the classes at and distance between candidate locations (Speriosu multiple S2 levels by jointly optimizing for the loss and Baldridge, 2013), as well as learning associa- at each level to balance between generalization and tions from text to locations. accuracy. The example shown in Fig.1 covers an ∗Equal contribution † Work done during internship at Google 1https://s2geometry.io/ area around New York City by cell id 0x89c25 at source; as such, a k-d tree grid may not generalize level 8 and 0x89c4 at level 5. This is more fine- well to examples with different distributions from grained than previous work, which tends to use training resources. Spatial hierarchies based on arbitrary square-degree cells, e.g. 2◦-by-2◦ cells containment relations among entities relies heavily ( 48K km2)(Gritta et al., 2018a). The hierarchi- on metadata like GeoNames (Kamalloo and Rafiei, cal geolocation model over kd-trees of (Wing and 2018). Polygons for geopolitical entities such as Baldridge, 2014) can have some more fine-grained city, state, and country (Martins et al., 2015) are per- cells, but we predict over a much larger set of finer haps ideal, but these too require detailed metadata cells. Furthermore, we train a single model that for all toponyms, managing non-uniformity of the jointly incorporates multi-level predictions rather polygons, and general facility with GIS tools. The than learning many independent per-cell models Point-to-City (P2C) method applies an iterative k-d and do not rely on gazetteer-based features. tree-based method for clustering coordinates and We consider toponymn resolution for evaluation, associating them with cities(Fornaciari and Hovy, but focus on distance-based metrics rather than 2019b). S2 can represent such hierarchies in vari- standard resolution task metrics like top-1 accuracy. ous levels without relying on external metadata. When analyzing three common evaluation sets, we found inconsistencies in the true coordinates that Some of the early models used with grid-based we fix and unify to support better evaluation.2 representations were probabilistic language mod- els that produce document likelihoods in different 2 Spatial representations and models geospatial cells (Serdyukov et al., 2009; Wing and Baldridge, 2011; Dias et al., 2012; Roller et al., Geocoders maps a text span to geo-coordinates— 2012). Extensions include domain adapting lan- a prediction over a continuous space representing guage models from various sources (Laere et al., the surface of a (almost) sphere. We adopt the 2014), hierarchical discriminative models (Wing standard approach of quantizing the Earth’s surface and Baldridge, 2014; Melo and Martins, 2015), as a grid and performing multi-class prediction and smoothing sparse grids with Gaussian priors over the grid’s cells. There have been studies to (Hulden et al., 2015). Alternatively, Fornaciari model locations as standard bivariate Gaussians on and Hovy(2019a) use a multi-task learning setup multiple flattened regions (Eisenstein et al., 2010; that assigns probabilities across grids and also pre- Priedhorsky et al., 2014)), but this involves difficult dicts the true location through regression. Melo trade-offs between flattened region sizes and the and Martins(2017) covers a broad survey of doc- level of distortion they introduce. ument geocoding. Much of this work has been We construct a hierarchical grid using the S2 conducted on social media data like Twitter, where library. S2 projects the six faces of a cube onto the additional information beyond the text—such as Earth’s surface and each face is recursively divided the network connections and user and document into 4 quadrants, as shown in Figure1. Cells at each metadata—have been used (Backstrom et al., 2010; level are indexed using a Hilbert curve. Each S2 Cheng et al., 2010; Han et al., 2014; Rahimi et al., cell is represented as a 32-bit unsigned integer and 2 2015, 2016, 2017). MLG is not trained on social can represent spaces as granular as ≈ 1cm . S2 media data and hence, does not need additional net- cells preserves cell size across the globe compared ◦ ◦ work information. Further, the data does not have to commonly used degree-square grids (e.g. 1 x1 ) a character limit like tweets, so models can learn (Serdyukov et al., 2009; Wing and Baldridge, 2011). from long text sequences. Hierarchical triangular meshes (Szalay et al., 2007) and Hierarchical Equal Area isoLatitude Pixelation Toponym resolution identifies place mentions (Melo and Martins, 2015) are alternatives, though in text and predicting the precise geo-entity in lack the spatial properties of S2 cells. Furthermore, a knowledge base (Leidner, 2007; Gritta et al., S2 libraries provide excellent tooling. 2018b). The knowledge base is then used to ob- Adaptive, variable shaped cells based on k-d tain the geo-coordinates of the predicted entity for trees (Roller et al., 2012) perform well but depend the geocoding task. Rule-based toponym resolvers on the locations of labeled examples in a training re- (Smith and Crane, 2001; Grover et al., 2010; To- 2We will release these. Please contact the first author in bin et al., 2010; Karimzadeh et al., 2013) rely on the meantime if interested. hand-built heuristics like population from metadata S2 Level number of cells Avg area erarchical S2 grid that enables us to do multi-level L4 1.5k 332 prediction jointly, by optimizing for total loss com- L5 6.0k 83 puted from each level. L6 24.0k 21 L7 98.0k 5 3.1 Building blocks L8 393.0k 1 For a toponym in context, MLG predicts a distri- bution over all cells via a convolutional neural net- Table 1: S2 levels used in MLG. Average area is in 1k work. Optionally, the predictions may be snapped km2. to the closest valid cells that overlap the gazetteer locations for the toponym, weighted by their pop- resources like Wikipedia and GeoNames3 gazetteer. ulation similar to CamCoder. CamCoder incor- This works well for many common places, but it is porates side metadata in the form of its MapVec brittle and cannot handle unknown or uncommon feature vector, which encodes knowledge of po- place names. As such, machine learned approaches tential locations and their populations matching that use toponym context features have demon- all toponym in the text. For each toponym, the strated better performance (Speriosu and Baldridge, cells of all candidate locations are activated and 2013; Zhang and Gelernter, 2014; DeLozier et al., given a prior probability proportional to the high- 2015; Santos et al., 2015). A straightforward–but est population. These probabilities are summed data hungry–approach learns a collection of multi- over all toponyms and renormalized as MapVec class classifiers, one per toponym with a gazetteer’s input. It thus uses population signals in both the locations for the toponym as the classes (e.g., the MapVec feature in training and in output predic- WISTR model of Speriosu and Baldridge(2013)). tions. This biases predictions toward locations with A hybrid approach that combines learning and larger populations—which MLG is not prone to do. heuristics by predicting a distribution over the 3.2 Multi-level classification grid cells and then filtering the scores through a gazetteer works for systems like TRIPDL (Spe- MLG’s core is a standard multi-class classifier us- riosu and Baldridge, 2013) and TopoCluster (De- ing a CNN. We use multi-level S2 cells as the out- Lozier et al., 2015). A combination of classifica- put space from a multi-headed model. The penul- tion and regression loss to predict over recursively timate layer maps representations of the input to partitioned regions shows promising results with in- probabilities over S2 cells. Gradient updates are domain training (Cardoso et al., 2019). CamCoder computed using the cross entropy loss between the (Gritta et al., 2018a) uses this strategy with a much predicted probabilities p and the one-hot true class stronger neural model and achieves state-of-the-art vector c. results, including gazetteer signals at training time. MLG exploits the natural hierarchy of the geo- Our experiments go as far as S2 level eight (of graphic locations by jointly predicting at different thirty), but our approach is extendable to any level levels of granularity. CamCoder uses 7,823 output of granularity and could support very fine-grained classes representing 2x2 degree tiles, after filtering locations like buildings and landmarks. The built-in cells that have no support in training, such as over hierarchical nature of S2 cells makes it well suited bodies of water. This requires maintaining a cum- as a scaffold for models that learn and combine bersome mapping between actual grid cells and the evidence from multiple levels. This combines the classes. MLG’s multi-level hierarchical represen- best of both worlds: specificity at finer levels and tation overcomes this problem by including coarser aggregation/smoothing at coarser levels. levels. Here, we focus on three levels of granular- ity: L5, L6 and L7 (shown in Table1), each giving 3 Multi-Level Geocoder (MLG) 6K, 24K, and 98K output classes, respectively. We define losses at each level (L5, L6, L7) and Multi-Level Geocoder (MLG, Figure2) is a text-to- minimize them jointly, i.e., L = (L(p , c ) + location CNN-based neural geocoder. Context fea- total L5 L5 L(p , c ) + L(p , c ))/3. At inference time, tures are similar to CamCoder (Gritta et al., 2018a) L6 L6 L7 L7 a single forward pass computes probabilities at but we do not rely on its metadata-based MapVec all three levels. The final score for each L7 cell feature. The locations are represented using a hi- is dependent on its predicted probability as well 3www.geonames.org as the probabilities in its corresponding parent L6 Total loss Inference

xent loss xent loss xent loss f: predicted_cell → latitude + longitude L5 L6 L7

Dense projection to N-classes softmax I N P U T concat

Dense Dense Dense GlobalMaxPool1D GlobalMaxPool1D GlobalMaxPool1D

Conv1D Conv1D Conv1D 1-gram 2-gram 3-gram 1-gram 2-gram 1-gram 2-gram

Embedding

Target Toponym Toponyms in context Context words

Finland is a Nordic country in Northern Europe bordering the Baltic Sea, Gulf of Bothnia, andFigure Gulf 2: of Multi-Level Finland, between Geocoder Sweden model to the architecture west, Russia and to inference the east, setup. Estonia to the south, and north-eastern Norway to the north. and L5 cells. Then the final score for sL7(f) = a gazetteer to filter the output predictions to only pL7(f)∗pL6(e)∗pL5(d) and the final prediction is those that are valid for the toponym. Furthermore, yˆ = argmaxy sL7(y). This approach is easily ex- gazetteers come with population information that tensible to capture additional levels of resolution— can be used to nudge predictions toward locations we also present results with finer resolution at L8, with high populations—which tend to be discussed with ∼1K km2 area and coarser resolution at L4 more than less populous alternatives. Like De- with ∼300K km2 area for comparison. Lozier et al.(2015), we consider both gazetteer- MLG consumes three features extracted from free and gazetteer-constrained predictions. the context window: (a) token sequence (wa,1:l), Gazetteer-constrained prediction makes to- (b) toponym mentions (wb,1:l), and (c) sur- ponym resolution a sub-problem of entity resolu- face form of the target toponym (wc,1:l). All tion. As with broader entity resolution, a strong text inputs are transformed uniformly, using baseline is an alias table (the gazetteer) with a pop- shared model parameters. Let input text con- ularity prior. For geographic data, the population tent be denoted as a word sequence wx,1:l = of each location is an effective quantity for charac- [wx,1, . . . , wx,l], initialized using GloVe embed- terizing popularity: choosing Paris, France rather dings φ(wx,1:l) = [φ(wx,1), . . . , φ(wx,l)] (Pen- than Paris, Texas for the toponym Paris is a better nington et al., 2014). We use 1D convolu- bet. This is especially true for zero-shot evaluation tional filters to capture n-gram sequences through where one has no in-domain training data. Conv1Dn(·). This is followed by max pool- We follow the strategy of Gritta et al.(2018a) ing which is projected onto a dense layer to for gazetteer constrained predictions. We con- get Dense(MaxPool(Conv1Dn(φ(wx,1:l)))) ∈ struct an alias table which maps each mention 2048 R , where n = {1, 2} for the sequence of to- m to a set of candidate locations, denoted by kens and toponym mentions, and n = {1, 2, 3} for C(m) using link information from Wikipedia and the target toponym. These projections are concate- the population pop(`) for each location ` is read nated as input representation. from WikiData.4 For each of the gazetteer’s can- didate locations we compute a population dis- 3.3 Gazetteer-constrained prediction counted distance from the geocoder’s predicted The only way MLG uses geographic information location p and choose the one with smaller value as is from training labels for toponym targets. At test argmin`∈C(m) dist(p, `)·(1−c·pop(`)/ pop(m)). time, MLG predicts a distribution over all cells at Here, pop(m) is the maximum population among each S2 level given the input features and picks all candidates for mention m, dist(p, `) is the great the highest probability cell at the most granular circle distance between prediction p and location `, level. We use the center of the cell as predicted and c is a constant in [0, 1] that indicates the degree coordinates. However, when the goal is to resolve a specific toponym, an effective heuristic is to use 4http://www.wikidata.org of population bias applied. For c=0, the location oritized (e.g. “Lima, Peru” vs. “Lima, Ohio” vs. nearest the the prediction is chosen (ignoring pop- “Lima, Oklahoma” vs “Lima, New York”). As such, ulation); for c=1, the most populous location is population-based heuristics are counter-productive chosen, (ignoring p). This is set to 0.90 as found in WikToR. optimal on development set. LGL consists of 588 news articles from 78 dif- ferent news sources. This dataset contains 5,088 3.4 Training Data and Representation toponyms and 41% of these refer to locations with MLG is trained on geographically annotated small populations. About 16% of the toponyms are Wikipedia pages, excluding all pages in the Wik- for street names, which do not have coordinates; ToR dataset (see sec. 4.1). For each page with we dropped these from our evaluation set. About latitude and longitude coordinates, we consider 2% have an entity that does not exist in Wikipedia, context windows of up to 400 tokens (respecting which were also dropped. In total, this dataset pro- sentence boundaries) as potential training example vides 4,172 examples for evaluation. candidates. Only context windows that contain the GeoVirus dataset (Gritta et al., 2018a) is based on target Wikipedia toponym are used. We use Google 229 articles from WikiNews.7 The articles detail Cloud Natural Language API libraries to tokenize5 global disease outbreaks and epidemics and were the page text and for identifying6 toponyms in the obtained using keywords such as “Bird Flu” and contexts. We use the July 2019 English Wikipedia “Ebola”. Place mentions are manually tagged and dump, which has 1.11M location annotated pages assigned Wikipedia page URLs along with their giving 1.76M training examples. This is split 90/10 global coordinates. In total, this dataset provides for training/development. 2,167 toponyms for evaluation. As an example, consider a short context for WikToR serves as in-domain Wikipedia-based United Kingdom,“The UK consists of four con- evaluation data, while both LGL and GeoVirus stituent countries: England, Scotland, Wales and provide out-of-domain news corpora evaluation. Northern Ireland.”. Tokens in the context are lower cased and used as features, e.g., [“the”, “uk”, “con- 4.2 Unifying evaluation sets sists’, ..., “.”]. Sub-strings referring to locations We use the publicly available version for the three are recognized, extracted and used as features, e.g., datasets used in CamCoder. 8 However, after ana- [“uk”, “england”, “scotland”, ..., “ireland”]. Fi- lyzing examples across the evaluation datasets, we nally, the surface form of the target mention “uk” identified some inconsistencies in location target is used as the third feature. coordinates. First, the WikToR evaluation set delivers anno- 4 Evaluation tations given its reliance on GeoNames DB and We train MLG as a general purpose geocoder and Wikipedia APIs. However, we discovered that Wik- evaluate it on toponym resolution. A strong base- ToR was mapped from an older version of GeoN- line is to choose the most populous candidate loca- ames DB which has a known issue of sign fl