DOI 10.1111/tgis.12238

RESEARCH ARTICLE

Reference data enhancement for geographic information retrieval using linked data

Tiago H. V. M. Moura1 | Clodoveu A. Davis Jr.1 | Frederico T. Fonseca2

1 Universidade Federal de Minas Gerais Abstract (UFMG), Avenida Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil Gazetteers are instrumental in recognizing place names in documents 2 Pennsylvania State University, 330 D IST such as Web pages, news, and social media messages. However, cre- Building, University Park, PA, U.S.A. ating and maintaining gazetteers is still a complex task. Even though Correspondence some online gazetteers provide rich sets of geographic names in plan- Tiago H. V. M. Moura, Universidade Fed- etary scale (e.g. GeoNames), other sources must be used to recognize eral de Minas Gerais (UFMG), Avenida references to urban locations, such as street names, neighborhood Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil. names or landmarks. We propose integrating Linked Data sources to Email: [email protected] create a gazetteer that combines a broad coverage of places with urban detail, including content on geographic and semantic relation- ships involving places, their multiple names and related non- geographic entities. Our final goal is to expand the possibilities for recognizing, disambiguating and filtering references to places in texts for geographic information retrieval (GIR) and related applications. The resulting ontological gazetteer, named LoG (Linked OntoGazet- teer), is accessible through Web services by applications and research initiatives on GIR, text processing, named entity recognition and others. The gazetteer currently contains over 13 million places, 140 million attributes and relationships, and 4.5 million non-geographic entities. Data sources include GeoNames, Freebase, DBPedia and LinkedGeoData, which is based on OpenStreetMap data. An analysis on how these datasets overlap and complement one another is also presented.

KEYWORDS gazetteer, geocoding, linked data, knowledge bases

1 | INTRODUCTION

The relevance of geographic information in Web and mobile applications is undeniable. While searching the Web, users often employ terms or expressions indicating place names or positioning (Backstrom, Kleinberg, Kumar, & Novak, 2008; Sanderson & Kohler, 2004; Wang, Zi, Wang, Lu, & Ma, 2005). Recognizing references to places in documents,

Transactions in GIS 2016; 00: 00-00 wileyonlinelibrary.com/journal/tgis VC 2016 John Wiley & Sons Ltd | 1 2 | MOURA ET AL. social media messages, keyword-based queries and other situations leads to a better understanding of the user’sinten- tions and better query results. Geographic information retrieval (GIR) applications usually rely on reference data (often organized as gazetteers) for recognizing place names. The determination of the geographic scope of documents is an example of such applica- tions, in which gazetteers are employed first for identifying place names in the text (geoparsing) and later for finding their geographic locations (geolocating). This problem is complex because place names are often ambiguous, being iden- tical or very similar to other place names and terms used to designate non-geographic entities. An ideal gazetteer would have global coverage with detail down to urban elements, such as streets, addresses, and landmarks. It would also go beyond the traditional representation (Hill, 2000) in the form of a triple to include information to support name disambiguation and the determination of the geographic scope of documents. No gazetteer currently offers this set of features, but the information is out there, available as Linked Data sources. Broad coverage can be obtained from global gazetteers, such as GeoNames (http:// www.geonames.org). Urban detail can be found, albeit with a strong concentration in more developed countries, in knowledge bases, such as DBPedia (http://dbpedia.org/) and Freebase (http://www.freebase.com/). Volunteered geographic information (VGI) and collaborative sources, such as WikiMapia (http://wikimapia.com) and OpenStreet- Map (http://www.openstreetmap.org) provide unique information on urban elements, based on the personal knowledge of citizens. An alternative to the use of gazetteers would be to employ general purpose knowledge bases. Many commercial and academic projects are available, including Google (https://developers.google.com/freebase/data), DBPedia (Auer et al., 2007) and YAGO (Suchanek, Kasneci, & Weikum, 2007). from these bases requires sophisticated algorithms able to deal with non- or semi-structured sources, such as natural-language text, HTML trees and tables (Dong et al., 2014). On other hand, Linked Data (Bizer, Heath, & Berners-Lee, 2009a) uses a simple structured format and open protocols over the Web, so that their data is kept largely available to be processed by simpler algorithms. Our proposal is that all the effort spent in extracting information from non- or semi-structured sources should be redir- ected to the creation of new applications. In this work we present a gazetteer crafted using four Linked Data sources. The resulting gazetteer, called LoG (from Linked OntoGazetteer, http://aqui.io/log),1 has worldwide coverage and intra-urban detailed information. As discussed in the next sections, LoG also contains information on non-geographic entities related to places and place names. We also describe an open application programming interface (API), conceived to support GIR applications with LoG data. The remainder of the article is organized as follows. Section 2 presents related work. Sections 3 and 4 show gazet- teer enhancements. We present a Linked in Section 5. Section 6 describes the programming interface to access LoG data. Finally, conclusions and future work are presented in Section 7.

2 | RELATED WORK

As the volume of information on the Web grows, the available methods for querying and using information become insufficient. Consider, for instance, the recognition of the geographic intentions of a user that presents a set of key- words to a . Among these keywords, one can find place names (Yi, Raghavan & Leggetter, 2009), indirect references to places (McCurley, 2001; Quercini & Samet, 2014), and expressions related to spatial positioning (Delboni, Borges, Laender, & Davis, 2007) or to spatial relationships (Borges, Davis, Laender, & Medeiros, 2011). If the input is taken as a simple set of keywords, the search engine cannot be expected to cover the full extent of the users’ inten- tions when posting the query. Egenhofer (2002) argues that obtaining adequate geospatial content from Web sources requires going beyond keyword-based methods and taking into account the spatial semantics of positioning expressions (such as inside, crosses, or near). The creation of semantic resources, such as ontologies, is supposed to enable a new framework of information retrieval based on the meaning of terms and expressions. Abdelmoty, Smar, El-Geresy and Jones (2009) show that place ontologies can be used to encode the meaning of spatial properties (geometrical shape, location, MOURA ET AL. | 3 proximity, topological relationships) as well as our usual descriptions of places, in the form of place names and coordi- nate systems. They showed how these concepts can be adequately encoded using RDF triples, using Wikipedia articles as an example. Gazetteers can help with the complex problem of spatial discontinuity and administrative divisions with special status (Laurini, 2015). Textual descriptions in gazetteers can lay out these relationships in a more clear and direct way than is possible to do topologically in an ontology. Overall, this is a two-way relationship, with gazetteers also being enriched by ontologies (Laurini, 2015). For instance, Hoffart, Suchanek, Berberich, and Weikum, (2013) demonstrate the importance of complementing ontologies with knowledge bases. They use Wikipedia to enhance an ontology (YAGO2) along the dimensions of space and time. Giunchiglia, Dutta, Maltese, and Farazi (2012) combined a gazetteer (GeoNames) with an ontology (WordNet) to create an enhanced ontology called GeoWordNet. Brisaboa. Luaces, Pla- ces, and Seco (2010) put together a spatial ontology and a gazetteer to build an index. The index is then used in GIR to solve isolated or combined spatial and textual queries. However, the usual gazetteer implementation uses a simple data structure, composed by a place name, a type of place and a footprint (usually a simple pair of coordinates) (Hill, 2000). This representation is often insufficient to pro- vide evidence for common GIR tasks, since it does not explore spatial relations between places. A simple point foot- print also does not allow the use of spatial relations such as contains, for instance, and geographic relationships are encoded using pairs of keys, as in conventional . Therefore, we argue that the expansion of gazetteers towards richer semantics, topological relationships, and descriptions, such as those provided by knowledge bases, can significantly improve the usefulness of gazetteers for GIR tasks. Previous work (Han and Zhao, 2009; Hoffart et al., 2011; Cucerzan, 2007; Pouliquen et al., 2006) use knowledge bases and gazetteers to handle the ambiguity problem in GIR applications. As presented by Alencar, Davis, and Gon- çalves, (2010), Wikipedia is a good external source of evidence, both for name recognition and for disambiguation. Quer- cini and Samet (2014) point out that Wikipedia articles can be spatially related to each other. They demonstrate how article contents can be used to obtain concepts that are spatially connected, and thereby used to describe or identify a certain place. Some other initiatives (Bizer et al., 2009b; Dong et al., 2014) also try to extract relevant information from Wikipedia and other non- or semi-structured data sources. However, extracting and querying such reference data can be difficult and time-consuming. Smart et al. (2010) present a mediation framework to build a meta-gazetteer,2 with data from multiple sources, including Wikipedia, OpenStreetMap and GeoNames. Integration is performed at querying time, retrieving and integrating data from various remote sources. Tanasescu, Smart, and Jones (2014) present another approach to create a meta-gazetteer using also data from Google Places (https://developers.google.com/places/) and Foursquare (https://foursquare.com/) to recommend toponyms for photo captions. Popescu, Grefenstette and Moellie€ (2008) propose Gazetiki, a gazetteer assembled from various heterogeneous online sources that are used to complement the GeoNames , including Wikipedia. Gazetiki expands the traditional gazetteer structure (place name, footprint, place class) by assigning a relevance score to each entry. In a broad effort to determine the spatial scope of news articles for the NewStand project, Samet et al. (2014) argue that gazetteers are important sources of information for GIR tasks. Gazetteer data are instrumental for toponym recognition (i.e. identifying place names in text) and toponym resolution (i.e. deciding which of the places that are asso- ciated to a given name is the correct one). However, solving these problems requires much more than simply recogniz- ing valid place names embedded in text, for instance by comparing candidate words with the contents of a gazetteer, which must include alternative names, nicknames, acronyms and multilingual variations. Other data must be used in order to resolve geo/non-geo and geo/geo ambiguity, which are part, respectively, of toponym recognition and resolu- tion. For instance, tools can use semantic connections between places or identify terms that are strongly associated to places and use such evidence for disambiguation (Quercini and Samet, 2014; Alencar et al., 2010; Lieberman & Samet, 2012). Current gazetteers, such as GeoNames (used by NewStand), do not include that kind of additional information. Lieberman, Samet, and Sankaranarayanan (2010) and Quercini, Samet, Sankaranarayanan, and Lieberman (2010) demonstrate the usefulness of local lexicons (gazetteer subsets comprising closely related places) for disambiguation in the context of NewStand. Using local lexicons, a news source is associated to a set of places that fall within its 4 | MOURA ET AL. geographic scope of interest, thus helping in the disambiguation task. It is possible to conclude that gazetteers with a broad coverage are necessary in order to include globally known places, while those with local and detailed coverage are required for establishing a narrower context, as in local lexicons. Samet et al. (2014) created techniques to help in solving the ambiguity problem through a map-based user interface. The current zoom level in the map enables disam- biguating in a more precise way, which corresponds to heuristically narrowing the geographic scope of interest of the user. This approach also allows using spatial synonyms, i.e. alternative names of spatially-related places, that can be used to expand searches. Since no single data source currently fulfills these requirements, we decided to investigate the application of Linked Data (Bizer et al., 2009a) over several different data sources, as in the Linking Open Data (http://linkeddata.org/) (LOD) project, to enrich existing gazeteers. The LOD project encourages people to publish data according to the four Linked Data principles (Berners-Lee, 2011): (1) Use Uniform Resource Identifiers (URIs) as names for things; (2) use HTTP URIs so that people can look up those names; (3) when someone looks up a URI, provide useful information, using the Resource Definition Framework (RDF) standard; and (4) include links to other URIs. Between 2011 and 2014, the LOD project recorded an increase of 271% in the number of data sets published following the Linked Data principles, which means 1,091 different data sets are available (Schmachtenberg, Bizer, & Paulheim, 2014), and points to a growing adop- tion of that concept. Integrating LOD data, however, is still not a simple task. The heterogeneity of schemas within the Web of Data makes the task of determining redundant or overlapping entities a non-trivial task (Freitas, Curry, Oliviera & O’Riain, 2012; Shvaiko & Euzenat, 2005). To fit in the Web of Data environment, data integration solutions must take into account the schema and ontologies defined by the data sources. Manguinhas, Martins & Borbinha (2008) propose a solution to resolve the data integration problem based on simple criteria, such as place type and distance between two possibly duplicate entities. Properly solving this problem is essential to provide better support for GIR applications, keeping fewer representations, expanding the knowledge about real-world objects, and helping to solve ambiguity problems. In the next sections, we describe the creation of LoG, an enhanced gazetteer created by integrating several Linked Data sources.

3 | GAZETTEER ENHANCEMENTS: NON-PLACES AND FLEXIBLE RELATIONSHIPS

Specially-crafted gazetteers can move beyond the traditional structure to organize place-related elements in a way that is meaningful for GIR. However, very few of them are available. One of them is OntoGazetteer (Machado, Alencar, Campos & Davis, 2011), a gazetteer that includes topological and geographic relationships, such as contained by and neighbor to, and is enriched with alternative names, ambiguously named places, and relationships between places. It also implements semantic relationships, through which it is possible to establish connections between places that belong to the same ontological category, for instance state capitals, historical sites or cities along the same highway. Relationships can be used for disambiguation, since a reference to a related geo- or non-geo entity in the same text helps inferring which is the correct place. Figure 1 shows a basic conceptual schema for OntoGazetteer (Machado et al., 2011). The main class of this schema is Place, whose instances represent real world places. The place names are represented in three different ways: (1) as a Place attribute; (2) as an Alternative place instance; or (3) as an Ambiguous name instance. These classes keep alternative names for the same place, and a list with identifiers of homonymous places, respectively. Although this schema has been successfully implemented as a relational database, searching for a toponym implies looking into three different structures (Place, Alternative place and Ambiguous name), making it less efficient to detect and solve ambiguities. Our work can be understood as an extension and a revision of OntoGazetteer, with a focus on the use of Linked Data to implement a gazetteer with worldwide coverage and able to keep information about every type of place, rang- ing from administrative divisions, populated places and intra-urban information, to geographic features, such as rivers and mountains. This new gazetteer, built from the integration of many data sources on places, can be populated with MOURA ET AL. | 5

FIGURE 1 OntoGazetteer conceptual schema from Machado et al. (2011) related terms, alternative names, relationships of various kinds, and other related entities. In our implementation, we created an alternative schema for an enhanced gazetteer, which is more efficient for search and retrieval tasks. Our goal is to provide broader support to GIR, going beyond existing reference data sources by providing evidence that can be used in tasks such as disambiguation and in the recognition of the geographic scope of documents. The primary Linked Data sources for our gazetteer include a traditional global gazetteer, knowledge bases, and geographic informa- tion derived from OpenStreetMap. Our revised conceptual schema is shown in Figure 2. The new schema is simpler than OntoGazetteer’s, replacing the classes that keep alternative names and relationships with separate classes for places and for place names. There- fore, the new schema includes two new classes, Non-place and Name. The instances of Name can be related with Non-place and Place instances through many-to-many relationships. Two or more places can be related to the same name, indicating a geo/geo ambiguity, or even share a name with non-places, indicating a geo/non-geo ambiguity. Also, places and non-places can have many alternative names (multi-lingual, nicknames, popular acronyms). Alternative names are identified using the isAlternative attribute. Entities that do not represent real-world places are kept as Non- place instances. For example, the (non-place) event “FIFA World Cup 2014” is related to the place “Belo Horizonte”, one of the host cities in the competition. The relationship between places and non-places is defined as relatedTo,which means that all kinds of relationship can be recorded. The relationship attribute relType keeps the nature of each rela- tionship, as encoded in Linked Data predicates, thus preserving their original flexibility. Topological and semantic rela- tionships between places are represented using a many-to-many self-relationship in the Place class. Likewise, the nature of the relationship between Place and Non-place is recorded in the relationship attribute relType.Bothpla- ces and non-places are also categorized through their relationship to one or more instances of the Type class Overall, the schema in Figure 2 contains only many-to-many relationships. Considering the need to effi- ciently query such relationships in a large database, along with the need to navigate between places, non-places, names and types for solving most GIR-related tasks, we decided to map the schema to a directed and labeled graph. The graph is defined as G(V, E), where V is a set of all Place, Non-place, Name and Type instances, and E a set composed by the relationships definedontheschema.Everyedgee 2 E has a label, which keeps the 6 | MOURA ET AL.

FIGURE 2 Linked gazetteer schema nature of the relationship. Only the edges between Name and Place or Non-place are unlabeled. Also, verti- ces may have attributes. A graph structure is also flexible enough to allow future evolution and the addition of new elements. After considering some possibilities, we implemented the enhanced gazetteer using Titan (http://thinkaurelius. github.io/titan/), a NoSQL graph database manager based on Apache Cassandra (http://cassandra.apache.org/). Titan provides many features that allowed us to implement the graph exactly as in the conceptual schema. In Titan, a vertex is seen as a bag of properties, and the vertex set is not required to share properties or a predefined structure. Also, properties do not need to be defined before use, they can be created on the fly. It is possible to associate constraints, such as unique (the value of this property is unique in each vertex), single (each vertex can have only one property of this kind), and list (multivalued properties), with any property. Furthermore, Titan automatically creates indexes for unique and single properties. Edges are similarly implemented, with a required label to identify the edge type and the possibility to be directed or not. Edges can also have properties, and these properties can be constrained and indexed. Titan is specialized in running fast local graph computations, which are frequently required when exploring the graph neighborhood of elements such as places and names searching, for instance, for homonyms or alternative names. When a vertex is created, the DBMS generates an ID property, through which accessing the vertex is fastest. Indexes can also be used to run fast searches using vertex or edge properties. The easiest way to query a Titan database is to use a graph traversal language, called Gremlin (https://github.com/tinkerpop/gremlin/wiki), built into the DBMS. Since all data sources are linked data, it would be possible, in principle, to implement queries to data sources directly. This alternative would mean querying individual sources separately, then dynamically integrating their results before using them in GIR applications. Smart, Jones, and Twaroch (2010) use this approach and report several prob- lems and difficulties, such as prioritizing sources, matching entities and augmenting found entities. Furthermore, no mention is made to response times and latency. For this work, we opted to take a different approach. We collect data from individual sources, integrate them with a high level of certainty, and filter out unwanted or unnecessary details, so that a concise and efficient set of services can be rendered through an API to GIR applications. MOURA ET AL. | 7

FIGURE 3 Decision path to map a triple into LoG

4 | GAZETTEER ENHANCEMENTS: LINKED DATA

Linked Data standards define that all published data must use the RDF data structure. This means that all information is represented as triples with the following structure: {subject, predicate, [object, literal]}.Manylinkeddata sources use general-purpose triple stores to manage their data. Adding new data is simple with triple stores, since there is no explicit schema (regardless of the triple format). However, the lack of documentation on the actual predicates used in the triples and the semantics behind them makes it harder to use the data. Users need to know about the actual predi- cates, either when adding data or when formulating queries, but usually data sources do not have such information. We established a two-step mapping from RDF triples into a graph to obtain LoG contents from Linked Data. First, we match the triples to our basic schema (Figure 2), and then we map them to the graph implementation. We followed the decision path shown in Figure 3 to map each RDF triple in each data source into our schema. The answer to each question depends on the data representation used by each data source, and will be detailed in the next subsections. The five grey boxes in Figure 3 are possible endpoints in the process, with useful information being stored and triples not related to places being discarded. Each data source has a different set of predicates that characterize relationships between places. Sources also vary as to the availability of data on non-places related to places, and the predicates that are used for that purpose. We ana- lyzedeachdatasourcetofind the most frequent predicates that relate useful entities to places. The next subsections present our findings and decisions for each source, after a brief description of the source and its main characteristics.

4.1 | DBPedia

DBPedia is a general purpose knowledge base. Information in DBPedia originates from semi-structured information (infoboxes) in Wikipedia. DBPedia is part of the LOD project since its beginning, and it is one of its largest datasets. Every entity in DBPedia has one or more types. A type is defined by the predicate http://www.w3.org/1999/02/ 22-rdf-syntax-ns##type and usually points to an object that defines an abstract type concept, such as thing, person, or place. Following this definition, a subject is a place if, and only if, it is described in a triple whose object is a “place” con- cept, identified by the URI http://dbpedia.org/ontology/Place.

TABLE 1 DBPedia predicates that represent relationships between places – db: http://dbpedia.org/ontology

Predicate Linked OntoGazetteer db:country Place –relatedTo (db:country)– > Place db:isPartOf Place –relatedTo (db:isPartOf)– > Place db:location Place –relatedTo (db:location)– > Place db:state Place –relatedTo (db:state)– > Place db:region Place –relatedTo (db:region)– > Place 8 | MOURA ET AL.

TABLE 2 GeoNames predicates that represent vertex attributes – gn: http://www.geonames.org/ontology#, wgs84_pos: http:// www.w3.org/2003/01/geo/wgs84_pos#

Predicate Linked OntoGazetteer wgs84_pos:lat Place.gnPoint.y wgs84_pos:long Place.gnPoint.x gn:featureClass Place.gnFeatureClass gn:featureCode Place.gnFeatureCode gn:wikipediaArticle Place.dbpediaId

DBPedia has many predicates that define place names. We chose to work with the two most frequent ones: Name (http://xmlns.com/foaf/0.1/name) and Nick (http://xmlns.com/foaf/0.1/nick). Only one vertex attribute is extracted from DBPedia: the geographic representation of a place, which in this case is limited to a single pair of coor- dinates. The predicate that defines a geographic representation is http://www.georss.org/georss/point. Table 1 shows all selected predicates from DBPedia and their correspondent representation for LoG. Every predicate between a selected place and a non-place was imported. Every non-place that is related to a place and has at least one name was selected to be included in LoG. At the end of the process, we selected 639,462 places from DBPedia, encoded in seven million triples. Also, 594,026 non- places were selected. All relationships were stored as attributes in the PlaceRel class.

4.2 | GeoNames

GeoNames is one of the largest gazetteers available on the Web. It keeps information about more than eight million places, and its main data sources are official cartographic agencies (http://www.geonames.org/data-sources.html). Since GeoNames only stores data from real world places, the answer to the first question in the decision flow for this data source is always true. We chose two predicates from GeoNames that define names: http://www.geonames.org/ ontology#name and http://www.geonames.org/ontology#alternateName. Table 2 shows all predicates from GeoNames mapped to LoG as vertex attributes. Unlike DBPedia, GeoNames keeps the geographic representation split into two different attributes, one for latitude and one for longitude. Geo- Names uses two attributes, Feature Class and Feature Code, to characterize place types. Both were selected and mapped as vertex properties in LoG. The last attribute selected from GeoNames is the URL for the Wikipedia article corresponding to each place, when available. Table 3 summarizes GeoNames predicates indicating relationships between places and shows how each of them are mapped into LoG. All places in GeoNames were added to LoG, comprising more than 8.4 million places, obtained from almost 80 million RDF triples.

TABLE 3 GeoNames predicates that represent relationships between places – gn: http://www.geonames.org/ontology#

Linked Onto Predicate Gazetteer gn:parentCountry Place –relatedTo (gn:parentCountry)– > Place gn:parentFeature Place –relatedTo (gn:parentFeature)– > Place gn:parentADM[12345] Place –relatedTo (gn:parentADM[12345])– > Place MOURA ET AL. | 9

TABLE 4 Freebase predicates representing vertex attributes – fbsds: http://rdf.freebase.com/key/user.metaweb.datasource, fbs: http://rdf.freebase.com/key/, fbsloc: http://rdf.freebase. com/ns/location

Predicate Linked OntoGazetteer fbsds:geonames Place.geonamesId fbs:wiki.en_title Place.dbpediaId fbsloc:geocode.latitude Place.fbPoint.y fbsloc:geocode.longitude Place.fbPoint.x

4.3 | Freebase

Freebase is a knowledge base supported by Google and open to user contributions. Like DBPedia, it was composed using non- or semi-structured information from Wikipedia. Freebase also works as a general purpose data source, keeping categorized records for entities such as songs, books, locations, and people. Freebase defines types in a way that is similar to DBPedia, using a predicate that defines an entity’s type (http://rdf.freebase.com/ns/type.object.type). However, the object that defines the concept of place is named Location and is defined by the URI http://rdf.freebase. com/ns/location.location. In other words, an entity is a place if it appears in a triple in which the predicate Type points to Location. Two predicates that are used to associate names to entities are considered in this work. They are defined by the URIs http://rdf.freebase.com/ns/type.object.name and http://rdf.freebase.com/ns/common.topic.alias. Freebase keeps references to DBPedia and GeoNames using specific predicates, as shown in Table 4. As in the case of GeoNames, the geographic coordinate is recorded using two different attributes, indicating latitude and longi- tude. Every triple that contains one of the predicates listed in Table 4 was selected and stored as a vertex attribute. We also selected predicates from Freebase that define relationships between places. Table 5 shows the four most frequent predicates in Freebase that are used to relate two places. As in the case of DBPedia data, a non-place must have a name and be related with at least one place to be included in LoG. However, Freebase has a much more complex ontology with almost eight times more predicates describing the relationships between places and non-places. Only 158 of these predicates are used in more than 1,000 triples each, so we decided to eliminate predicates that are only rarely used. We selected the 31 most used predicates, representing 90% of all triples in Freebase. Some of the most common predicates regard birth places (http://rdf.free- base.com/ns/people.person.place_of_birth), nationalities (http://rdf.freebase.com/ns/people.place_lived.location), and event locations (http://rdf.freebase.com/ns/time.event.locations). The result was a set of 1,642,081 places and 3,902,886 non-places, from more than 26 million triples.

TABLE 5 Freebase predicates representing relationships between places – fbsloc: http://rdf.freebase.com/ns/location.location, fbsbib: http://rdf. freebase.com/ns/base.biblioness

Predicate Linked OntoGazetteer fbsbib:bibs_location. Place –relatedTo country (fbsbib:bibs_location.country)– > Place fbsloc:containedby Place –relatedTo (fbsloc:containedby)– > Place fbsloc:contains Place –relatedTo (fbsloc:contains)– > Place fbsloc:nearby_airports Place –relatedTo (fbsloc:nearby_airports)– > Place 10 | MOURA ET AL.

TABLE 6 LinkedGeoData predicates representing vertex attrib- utes – lgdo: http://linkedgeodata.org/ontology/, geo: http:// www.w3.org/2003/01/geo/wgs84_pos

Predicate Linked OntoGazetteer lgdo:geonames_id Place.geonamesId lgdo:wikipedia Place.dbpediaId geo#lat Place.lgdpoint.y geo#long Place.lgdpoint.x

4.4 | LinkedGeoData

LinkedGeoData is a data source built from OpenStreetMap (OSM), a project that allows for collaborative mapping using data supplied by volunteers. LinkedGeoData obtained OSM data and republished it following the Linked Data principles. Since OSM receives numerous contributions in urban areas, the contents of LinkedGeoData were supposed to increase the amount of intra-urban information in LoG. However, we found that the data can be skewed due to var- iations in OSM popularity around the world. Considering the four data sources used in our work, LinkedGeoData has the most confusing ontology. There is a huge variety of predicates and no documentation. However, the entity type definition is quite similar to the other sour- ces. A type predicate is defined by http://www.w3.org/1999/02/22-rdf-syntax-ns##type, and there is an object that defines the concept of place (http://linkedgeodata.org/ontology/Place). We selected the six most frequent predicates used in LinkedGeoData to specify entity names, since the wide vari- ety of predicates caused most of them to be used in only a small number of instances. The selected predicates are:

http://www.w3.org/2000/01/rdf-schema#label http://www.w3.org/2004/02/skos/core#altLabel http://linkedgeodata.org/ontology/internationalName

http://linkedgeodata.org/ontology/alt_name_be http://linkedgeodata.org/ontology/name_genitive http://linkedgeodata.org/ontology/officialName

As in Freebase, LinkedGeoData keeps links to other data sources, such as GeoNames and Wikipedia, and also stores the geographic coordinates of places divided into two different predicates. Table 6 shows the four predicates selected from LinkedGeoData to be mapped as vertex attributes. There is another particular issue with LinkedGeoData that sets it apart from the other data sources. Most predi- cates that relate two places violate the Linked Data principle on using URIs to identify entities. For instance, the predi- cate http://linkedgeodata.org/ontology/is_in, which materializes the topological relationship contained by,isusedto relate a place URI with a literal string containing a sequence of place names that form a territorial hierarchy. Figure 4 shows an example, in which the URI for the city of Belo Horizonte is related to two places simultaneously: the state of Minas Gerais and the country of Brazil. There should be two triples for this situation, and the literals for Minas Gerais and Brazil should be replaced by their corresponding URIs. Using names instead of unique identifiers can introduce ambiguity in the data, among other problems. In this particular case, since the literal string follows a definite pattern, we were able to separate the references to places in the hierarchy and create correct relationships for LoG. The algorithm that performs this task checks for existing

FIGURE 4 Sample of the isIn predicate use in LinkedGeoData MOURA ET AL. | 11

FIGURE 5 Representation of the triple in Figure 4 after correction places already in LoG (from other sources) to make sure the relationships are correct. Figure 5 shows the cor- rected triples. LinkedGeoData includes about 6,000 predicates used to relate places, often with a semantic overlap. For that rea- son, we selected 16 predicates based on the number of occurences (Table 7). At the end of the process, LinkedGeo- Data contributed 3,600,880 places, described using 28,047,183 triples, to LoG.

5 | GOODANDBADNEWS:ANANALYSISOFLINKEDDATASOURCES

Data integration for LoG took place during data import. After defining the mapping of data from the original triples to LoG’s schema, and then to the graph structure, the database was created. The integration used the criteria and the algorithm proposed by Moura and Davis Jr (2014) and is illustrated by Figure 6. In broad terms, the data integration algorithm uses geographic relationships and properties to confirm that two separate entities refer, in fact, to the same place, with a high degree of certainty. The process starts with a list of all ambiguous places, and for each pair it verifies if: (1) the places have a sameAs link or share a Wikipedia article; (2) both places are in the same country; (3) both places are contained by at least one other place; and (4) the distance between them is no bigger than five kilometers. Only the first criterion is meant to certify whether the places are the same; all the other criteria are used together to achieve a higher degree of certainty in the integration. Simultaneously, data for an analysis of the overlapping of sources was collected. Figure 7 shows how many places were imported from each data source. As expected, the two geographic data sources had the larger contribution. The knowledge bases were also important, contributing with more than four mil- lion non-place entities.

TABLE 7 LinkedGeoData predicates that represent relationships between places – lgdo: http://linkedgeodata. org/ontology/

Predicate Linked OntoGazetteer lgdo:addr_country Place –relatedTo (lgdo:addr_country)– > Place lgdo:addr_district Place –relatedTo (lgdo:addr_district)– > Place lgdo:addr_region Place –relatedTo (lgdo:addr_region)– > Place lgdo:addr_subdistrict Place –relatedTo (lgdo:addr_subdistrict)– > Place lgdo:addr:postcode Place –relatedTo (lgdo:addr:postcode)– > Place lgdo:cladr_code Place –relatedTo (lgdo:cladr_code)– > Place lgdo:is_in_country_code Place –relatedTo (lgdo:is_in_country_code)– > Place lgdo:is_in_country Place –relatedTo (lgdo:is_in_country)– > Place lgdo:is_in_municipality Place –relatedTo (lgdo:is_in_municipality)– > Place lgdo:is_in_province Place –relatedTo (lgdo:is_in_province)– > Place lgdo:is_in_region Place –relatedTo (lgdo:is_in_region)– > Place lgdo:is_in_state_code Place –relatedTo (lgdo:is_in_state_code)– > Place lgdo:is_in_state Place –relatedTo (lgdo:is_in_state)– > Place lgdo:is_in_village Place –relatedTo (lgdo:is_in_village)– > Place lgdo:isIn Place –relatedTo (lgdo:isIn)– > Place lgdo:postal_code Place –relatedTo (lgdo:postal_code)– > Place 12 | MOURA ET AL.

FIGURE 6 Data integration algorithm

Our previous work shows the overlapping between GeoNames and DBPedia (Moura & Davis Jr, 2014). From all DBPedia’s places, 426,317 (66.67%) were found in GeoNames too. These places correspond to less than 5% of the set obtained from GeoNames. Most places that were found in both data sources are classified as cities, administrative divi- sions, and countries. For instance, 86.3% of the places classified as cities on DBPedia were also found in GeoNames. Places that were not merged are often intra-urban (mostly found in DBPedia, but not in GeoNames) or geographic fea- tures, such as rivers and mountains (found in GeoNames, but rarely in DBPedia). We now discuss the integration of data from Freebase to the combination of GeoNames and DBPedia. At the end of the integration of Freebase, 924,853 of 1,642,081 (56.32%) places could not be merged with places from DBPedia and GeoNames, and thus were added to LoD as new entries. The places that were successfully integrated were cate- gorized into four groups. The first group has places integrated only with DBPedia and the second only with GeoNames. The third group includes places merged with both DBPedia and GeoNames. The fourth group contains the places that are associated to a Wikipedia article, but do not exist on DBPedia. Table 8 summarizes the integration results for each group. While adding Freebase as the third data source, we found an interesting problem: Freebase’s entities helped inte- grating DBPedia and GeoNames entities that were not merged previously. A total of 268 places from Freebase were associated to distinct places originating from GeoNames and DBPedia, and therefore Freebase acted as a “bridge” that allowed the merging of places from the two initial sources. However, manual inspection showed that some errors occurred in these transitive relationships. Figure 8 shows an example of this problem, in which GeoNames entity

FIGURE 7 Sources of places for LOD MOURA ET AL. | 13

TABLE 8 Integration of Freebase to GeoNames and DBPedia

Group # places % of all Places Freebase 1 DBPedia 337,498 20.5% Freebase 1 GeoNames 61 > 0.01% Three sources 2,882 0.175% Linked to Wikipedia 376,787 22.9% Total 717,228 43.68%

1819727 (Hong Kong island) should be connected to the entity identified in Freebase as m.018nxh, but it is incorrectly linked to Freebase m.03h64 (Hong Kong city) and thus would provoke an undue merging. Freebase has a large and specific type hierarchy, which makes it harder to determine the kind of place that is most commonly integrated with GeoNames and DBPedia. Furthermore, the number of Linked Data predicates that connect Freebase and GeoNames is too small to permit drawing any conclusions. Nevertheless, it is interesting to emphasize that 92% of the integrated places were categorized as cities (http://rdf.freebase.com/ns/location.citytown). This obser- vation is reinforced by an analysis of the connections we established between Freebase and DBPedia. Table 9 shows some Freebase place types, selected to show how DBPedia overlaps with Freebase. Places categorized as statistical regions in Freebase (http://rdf.freebase.com/ns/location.statistical_region) fre- quently contain official information, such as those provided by a census bureau. As statistical units, most of these pla- ces are actually cities, which explains the small difference (4.99%) between the two top categories in Table 9. All places that were found in all three sources so far are also included in the statistical region category, and 96% of them are cities. Notice also that Freebase contributed with some intra-urban places, such as bus stops, tourist atrations, air- ports and roads, adding new elements to the results of the integration between GeoNames and DBPedia. Next, we integrated triples from LinkedGeoData to LoG. The main issue with this data source, as described earlier, is the way that the predicates isIn are structured as in place hierarchies, but with place names instead of URIs. This problem probably derives from the mapping between OpenStreetMap data and LinkedGeoData, and indicates that there may be opportunities for gazetteer enhancement by obtaining data directly from OSM and other GIS sources. In LinkedGeoData, we found 718,981 places that appear as the subject in association to the isIn predicate, and linked to 44,156 different literal strings containing place names. These strings are composed by 155,244 comma-delimited topo- nyms, 70,695 of which were unambiguously identified. The remaining triples were not integrated mainly because the necessary disambiguation of the literal string could not be safely performed. Notice that, since the data comes from OpenStreetMap, the use of literal strings could probably be avoided, and unambiguous triples could be produced using GIS functions. This problem caused an important drawback on the success rate of this procedure with LinkedGeoData, resulting in only 17,755 of 3,600,880 places being successfully merged. Table 10 summarizes the results of the integration process, showing the number of places in LoG that were obtained from each data source and combinations. We emphasize the relatively small overlap among the four data sources, which indicates their complementarity.

FIGURE 8 An example of mismatch in transitive relationships 14 | MOURA ET AL.

TABLE 9 Freebase’s place types and integration with DBPedia only

Place type URI % places # places http://rdf.freebase.com/ns/location.statistical_region 40.22% 135,730 http://rdf.freebase.com/ns/location.citytown 35.23% 118,895 http://rdf.freebase.com/ns/architecture.structure 10.02% 33,809 http://rdf.freebase.com/ns/geography.geographical_feature 9.34% 31,530 http://rdf.freebase.com/ns/architecture.building 6.13% 20,692 http://rdf.freebase.com/ns/location.administrative_division 5.63% 19,005 http://rdf.freebase.com/nsgeography.body_of_water 4.95% 16,721 http://rdf.freebase.com/ns/base.aareas.schema.administrative_area 4.14% 13,981 http://rdf.freebase.com/ns/metropolitan_transit.transit_stop 3.42% 11,546 http://rdf.freebase.com/ns/geography.river 3.13% 10,564 http://rdf.freebase.com/ns/transportation.road 2.27% 7,666 http://rdf.freebase.com/ns/aviation.airport 2.17% 7,332 http://rdf.freebase.com/ns/geography.mountain 1.89% 6,392 http://rdf.freebase.com/ns/travel.tourist_attraction 1.88% 6,357 http://rdf.freebase.com/ns/geography.lake 1.39% 4,679 http://rdf.freebase.com/ns/architecture.venue 1.26% 4,247 http://rdf.freebase.com/ns/geography.island 1.21% 4,100

The integration results show that GeoNames and DBPedia are complementary. About 95% of the places that are included in GeoNames have no DBPedia counterpart, and about a third of the places in DBPedia correspond to no GeoNames place. Table 10 indicates that this is also the case in other source combinations. The resulting dataset cov- ers, then, a much broader set of places, in a broader range of scales. In the case of places that appear in both sources, the resulting dataset is richer in details and in relationships (Moura & Davis Jr, 2014).

6 | THE PROGRAMMING INTERFACE

We intend LoG to be used as a reference data source for GIR and other applications. As such, we opted to offer a set of methods from which LoG data can be retrieved. This frees developers of GIR algorithms from having to deal with LoG’s database internal structure and other technological details, and gives LoG maintainers more freedom to expand and modify the set of methods and the database’s contents in the future. We developed an application programming

TABLE 10 Summary of the integration of places. Missing combinations account for less than 1,000 places each

Source # of places GeoNames only 8,087,661 DBPedia only 213,145 Freebase only 924,853 LinkedGeoData only 3,583,125 GeoNames 1 DBPedia 426,317 DBPedia 1 Freebase 337,498 DBPedia 1 LinkedGeoData 16,965 GeoNames 1 DBPedia 1 Freebase 2,882 GeoNames 1 DBPedia 1 LinkedGeoData 17,755 MOURA ET AL. | 15 interface (API), externalized as a representational state transfer (REST) Web service, to provide access to LoG, along with a Web interface for manual searching (http://aqui.io/log). The main Web service provides common methods used to support the execution of GIR tasks. The methods are accessible through an HTTP GET request. Requests and responses are always structured as JavaScript Object Notation (JSON) objects, thus are programming language-independent. Next, we list all exter- nalized methods. In order to make a request, the endpoint info must be prefixed by the server from where the Web service is deployed.3 The full documentation, along with usage examples and the source code, is available on Github (https://github.com/thvmm/LinkedOntoGazetteer/wiki).

retrievePlacesByName(String name): Search every place associated with the name passed as a parameter. Endpoint:/api/place/name/{name}

retrieveAllEntitiesByName(String name): Search for all entities related to a given name, independently of their class. Endpoint:/api/entity/name/{name}

isPlace(longint id): Verify if the entity identified by id is a place. Endpoint:/api/isPlace/{id}

retrieveNamesByPlaceId(longint placeId): Search all names for a given place. Endpoint:/api/name/place/{placeId}

retrieveRelatedEntities(longint placeId): Search all non-place entities that have a relationship with the place identified by placeId. Endpoint:/api/entity/relatedPlace/{placeId}

retrieveRelatedPlacesByEntityName(String entityName): Search for all places related to entities asso- ciated to a name equal to entityName. Endpoint:/api/place/entity/name/{name}

retrievePlacesInRectangle(Point a, Point b, String reference): Search all places contained by a rectangle defined by the points a and b. reference specifies the source to be considered for the point location: GeoNames, DBPedia, Freebase or LinkedGeoData. Endpoint:/api/place/inRectangle/{reference}

retrievePath(longint fromPlaceId, longint toPlaceId, int maxSize):Searchforapathinthe graph, between fromPlaceId and toPlaceId,usingcontainedBy edges. The set of places retrieved by this method is a hierarchical list of places that contain the place identified by fromPlaceId, up to the target place toPlaceId, e.g: [Empire State Building, New York City, New York, USA] If maxSize is undefined, the default value is 5. Endpoint:/api/place/path/{fromPlaceId}/{toPlaceId}

isContainedBy(String pNameA, String pNameB): Verify if a place whose name equals pNameA is contained by another place whose name equals pNameB. Endpoint:/api/place/name/containedBy/{pNameA}/{pNameB}

isContainedBy(longint placeIdA, longint placeIdB): Verify if the place identified by placeIdA is con- tained by place placeIdB. Endpoint:/api/place/id/containedBy/{placeIdA}/{placeIdB}

retrievePlaceAdjacentListByName(String name): Search all places named after the given parameter, and their graph-adjacent list. Endpoint:/api/place/name/adjacentList/{name}

retrievePlaceAdjacentList(longint placeId): Retrieve the graph-adjacent list of a place identified by placeId. Endpoint:/api/place/id/adjacentList/{placeId} 16 | MOURA ET AL.

7 | CONCLUSIONS AND FUTURE WORK

In its current version, LoG contains data on 13,074,366 places. More than 140 million place attributes and relationships were created. Furthermore, 4,477,739 non-place entities are available, having more than six million relationships with places. We also created an API, exposed as a Web service, with 12 endpoints, that can support GIR applications using gazetteers. All reference data were obtained from online data sources that follow the Linked Data principles. Some over- lap between the sources used in this work was expected, but we also expected sources to be somewhat comple- mentary. GeoNames is the source that provides the largest amount of information about places, but approximately 95% of its entities do not have a correspondent entry in the other Linked Data sources we used (DBPedia, FreeBase and LinkedGeoData). The most common overlapping places in the data sources are cities and local administrative divisions, which serve as bridges between sources with global coverage, such as GeoNames, and sources rich with urban detail. Our conclusion is that the Web of Data is lacking in actual integration of sour- ces. We were able to find many missing links between the link data sources we used. There is certainly room for improvement in the elimination of duplicates and in the integration techniques we used. This future work will decrease ambiguity in the Linked Gazetteer. During exploratory data analysis and preparation, we were able to notice the potential impact of volunteered information for the Linked Gazetteer. GeoNames has a well-defined schema, its data comes from official sources, and it does not allow for volunteered contributions or corrections, but we were able to find many errors and distortions, such as duplicate places and relationship mismatches. DBPedia, Freebase and LinkedGeoData were created from vol- unteered contributions, and present different issues. DBPedia lacks documentation on the wide variety of predicates. Freebase has the same problem, but in a larger scale, since it includes very specific predicates. LinkedGeoData, on the other hand, presents violations of Linked Data rules. Extracting gazetteer data directly from OpenStreetMap using GIS functions may give better results. For volunteered information, there must be a much clearer set of guidelines to help standardize the use of predicates and literal strings for Linked Data creation. The geographic representation used by sources in the Web of Data data is quite simple. Even in GeoNames and LinkedGeoData, the specifically geographic data sources, the places are represented by a single point. Titan currently has a limitation for geographic data representation, and currently only supports points, but not lines, polygons and other geometries. Since LoG’s schema has been designed to include point, line and polygon footprints, Titan must be extended or a connection with a full spatial database management system, such as PostGreSQL/PostGIS, must be implemented. In such a hybrid implementation, the graph database would enable searching and navigating among pla- ces, non-places, types and relationships, while the geographic database would enable spatial and topological analyses using the more detailed footprints. One of the biggest problems with the Web of Data, also perceived as an advantage, is the simple structure of tri- ples, the absence of format, and de facto ontology standardization. Every schema matching was performed manually and we did many statistical analyses to detect and remove less relevant data. These problems are common in informa- tion retrieval applications and may always be encountered in Web data. The lack of standardization, mainly concerning ontologies, puts in question the adoption of Linked Data paradigms and their use as originally intended. Primarily, Linked Data supports data reuse, but the current situation shows an increase in data duplication and missing connec- tions among the various sources, creating integration issues. However, the problems listed here do not diminish the importance of the Linked Data initiative for semantic Web expansion and popularization. Clustering and classification techniques can certainly reduce the impact of the predicate variability and help in the identification and integration of unstructured data. An interesting future work will be to implement agents that navigate the Web of Data, validating the data sources and their relationships, and finding opportunities for improvement. All this work has been developed using data dumps and offline processing. An important step for our work is to study the possiblity of redesigning the data gathering processes to work online, directly on the Web of Data. If that is achieved, new data could be dynamically added to the Linked OntoGazetteer. MOURA ET AL. | 17

With the development of applications that use LoG to solve GIR problems, it will be possible to evaluate the impacts of storing such a large volume of information and the problems that it can bring regarding the accuracy of the implemented methods. Also, the API can be improved based on user feedback to provide a better service catalog.

ACKNOWLEDGMENTS

Authors acknowledge the support of CNPq (303532/2015-7, 459818/2014-7, 401822/2013-3) and FAPEMIG (CEX-PPM-00679/15), Brazilian agencies in charge of fostering research initiatives.

NOTES 1A Web interface for interactive querying is available at http://aqui.io/log 2A meta-gazetteer, in this sense, is a service that queries multiple reference data sources and integrates the results as they are received, in order to obtain the best available toponym data for a given purpose 3API available at http://aqui.io/linkedOntoGazetteerWeb

REFERENCES Abdelmoty, A. I., Smart, P. D., El-Geresy, B. A., & Jones, C. B. (2009). Supporting frameworks for the geospatial semantic web. In N. Mamoulis, Th. Seidl, K. Torp, & I. Assent (Eds.), Advances in spatial and temporal databases (pp. 355–372). Berlin, Germany: Springer Lecture Notes in Vol. 5644. Alencar, R. O., Davis Jr., C. A., & Gonçalves, M. A. (2010). Geographical classification of documents using evidence from wikipedia. In Proceedings of the 6th Workshop on Geographic Information Retrieval, Zurich, Switzerland, 12:1-12:8. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. Berlin, Germany: Springer. Backstrom, L., Kleinberg, J., Kumar, R., & Novak, J. (2008). Spatial variation in search engine queries. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 357-366. Berners-Lee, T. (2011). Design issues: Linked data (2006). Retrieved from http://www.w3.org/DesignIssues/LinkedData. Bizer, C., Heath, T., & Berners-Lee, T. (2009a). Linked data-the story so far. International Journal on Semantic Web & Infor- mation Systems, 5(3), 1–22. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S. (2009b). Dbpedia: A crystallization point for the web of data. Web Semantics: Science, Services and Agents on the World Wide Web, 7(3), 154–165. Borges, K. A., Davis Jr., C. A., Laender, A. H., & Medeiros, C. B. (2011). Ontology-driven discovery of geospatial evidence in web pages. Geoinformatica, 15, 609–631. Brisaboa, N. R., Luaces, M. R., Places, A. S., & Seco, D. (2010). Exploiting geographic references of documents in a geo- graphical information retrieval system using an ontology-based index. Geoinformatica, 14, 307–331. Cucerzan, S. (2007). Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the Joint Confer- ences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 708-716. Delboni, T. M., Borges, K. A. V., Laender, A. H. F., & Davis Jr., C. A. (2007). Semantic expansion of geographic web queries based on natural language positioning expressions. Transactions in GIS, 11, 377–397. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., ..., & Zhang, W. (2014). Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowl- edge Discovery & , New York, NY, 601-610. Egenhofer, M. J. (2002). Toward the semantic geospatial web. In Proceedings of the 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, Virginia: 1-4. Freitas, A., Curry, E., Oliveira, J. G., & O’Riain, S. (2012). Querying heterogeneous datasets on the linked data web: Chal- lenges, approaches, and trends. Internet Computing, 16(1), 24–33. Giunchiglia, F., Dutta, B., Maltese, V., & Farazi, F. (2012). A facet-based methodology for the construction of a large-scale geospatial ontology. Journal on Data Semantics, 1,57–73. Han, X., & Zhao, J. (2009). Named entity disambiguation by leveraging Wikipedia semantic knowledge. In Proceedings of the 18th ACM Conference on Information & Knowledge Management, Hong Kong, China, 215-224. 18 | MOURA ET AL.

Hill, L. L. (2000). Core elements of digital gazetteers: Placenames, categories, and footprints. In J. Borbinha & T. Baker (eds.), Research and advanced technology for digital libraries (pp. 280–290). Berlin, Germany, Springer Lecture Notes in Computer Science Vol. 1923. Hoffart, J., Suchanek, F. M., Berberich, K., & Weikum, G. (2013). Yago2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194,28–61. Hoffart, J., Yosef, M. A., Bordino, I., Furstenau,€ H., Pinkal, M., Spaniol, M., ..., & Weikum, G. (2011). Robust disambigua- tion of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, United Kingdom, 782-792. Laurini, R. (2015). Geographic ontologies, gazetteers and multilingualism. Future Internet, 7(1), 1–23. Lieberman, M. D., & Samet, H. (2012). Adaptive context features for toponym resolution in streaming news. In Proceed- ings of the 35th International ACM SIGIR Conference on Research & Development in Information Retrieval, Portland, Oregon, 731-740. Lieberman, M. D., Samet, H., & Sankaranarayanan, J. (2010). Geotagging with local lexicons to build indexes for textually specified spatial data. In Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, Califor- nia, 201-212. Machado, I. M. R., Alencar, R. O., Campos, R. O., & Davis Jr., C. A. (2011). An ontological gazetteer and its application for place name disambiguation in text. Journal of the Brazilian Computer Society, 17, 267–279. Manguinhas, H., Martins, B., & Borbinha, J. (2008). A geo-temporal web gazetteer integrating data from multiple sources. In Proceedings of the 3rd International Conference on Digital Information Management, Bangalore, India, 146-153. McCurley, K. S. (2001). Geospatial mapping and navigation of the web. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 221-229. Moura, T. H. V. M. & Davis, Jr., C. A. (2014). Integration of linked data sources for gazetteer expansion. In Proceedings of the 8th Workshop on Geographic Information Retrieval, Dallas, Texas. Popescu, A., Grefenstette, G., & Moellic,€ P. A. (2008). Gazetiki: automatic creation of a geographical gazetteer. In Proceed- ings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, Pittsburgh, Pennsylvania, 85-93. Pouliquen, B., Kimler, M., Steinberger, R., Ignat, C., Oellinger, T., Blackler, K., ..., & Best, C. (2006). Geocoding multilingual texts: Recognition, disambiguation and visualisation. arXiv preprint cs/0609065. Quercini, G., & Samet, H. (2014). Uncovering the spatial relatedness in Wikipedia. In Proceedings of the 22nd ACM SIGSPA- TIAL International Conference on Advances in Geographic Information Systems, Dallas, Texas, 153-162. Quercini, G., Samet, H., Sankaranarayanan, J., & Lieberman, M. D. (2010). Determining the spatial reader scopes of news sources using local lexicons. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, California, 43-52. Samet, H., Sankaranarayanan, J., Lieberman, M. D., Adelfio, M. D., Fruin, B. C., Lotkowski, J. M., ..., & Teitler, B. E. (2014). Reading news with maps by exploiting spatial synonyms. Communications of the ACM, 57(10), 64–77. Sanderson, M., & Kohler, J. (2004). Analyzing geographic queries. In Proceedings of the SIGIR Workshop on Geographic Information Retrieval, Sheffield, United Kingdom. Schmachtenberg, M., Bizer, C., & Paulheim, H. (2014). Adoption of the linked data best practices in different topical domains. In C. Goble, C. A. Knoblock, K. Janowicz, & P. Mika (Eds.), The Semantic Web – ISWC 2014 (pp. 245–260). Berlin, Germany: Springer Lecture Notes in Computer Science Vol. 8797. Shvaiko, P., & Euzenat, J. (2005). A survey of schema-based matching approaches. In S. Spaccapietra (Ed.), Journal on Data Semantics IV (pp. 146–171). Springer. Berlin, Germany: Springer Lecture Notes in Computer Science Vol. 3730. Smart, P. D., Jones, C. B., & Twaroch, F. A. (2010). Multi-source toponym data integration and mediation for a meta- gazetteer service. In S. I. Fabrikant, T. Reichenbacher, M. van Kreveld, & C. Schlieder (Eds.), Geographic information sci- ence (pp. 234–248). Berlin, Germany: Springer Lecture Notes in Computer Science Vol. 6292. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. In Proceedings of the 16th Inter- national Conference on World Wide Web, Banff, Alberta, 697-706. Tanasescu, V., Smart, P. D., & Jones, C. B. (2014). Reverse geocoding for photo captioning with a meta-gazetteer. In Pro- ceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, Texas, 509-512. Wang, C., Xie, X., Wang, L., Lu, Y., & Ma, W. -Y. (2005). Detecting geographic locations from Web resources. In Proceed- ings of the 2nd International Workshop on Geographic Information Retrieval, Bremen, Germany, 17-24. Yi, X., Raghavan, H., & Leggetter, C. (2009). Discovering users’ specific geo intention in web search. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 481-490.