DOI 10.1111/tgis.12238
RESEARCH ARTICLE
Reference data enhancement for geographic information retrieval using linked data
Tiago H. V. M. Moura1 | Clodoveu A. Davis Jr.1 | Frederico T. Fonseca2
1 Universidade Federal de Minas Gerais Abstract (UFMG), Avenida Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil Gazetteers are instrumental in recognizing place names in documents 2 Pennsylvania State University, 330 D IST such as Web pages, news, and social media messages. However, cre- Building, University Park, PA, U.S.A. ating and maintaining gazetteers is still a complex task. Even though Correspondence some online gazetteers provide rich sets of geographic names in plan- Tiago H. V. M. Moura, Universidade Fed- etary scale (e.g. GeoNames), other sources must be used to recognize eral de Minas Gerais (UFMG), Avenida references to urban locations, such as street names, neighborhood Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil. names or landmarks. We propose integrating Linked Data sources to Email: [email protected] create a gazetteer that combines a broad coverage of places with urban detail, including content on geographic and semantic relation- ships involving places, their multiple names and related non- geographic entities. Our final goal is to expand the possibilities for recognizing, disambiguating and filtering references to places in texts for geographic information retrieval (GIR) and related applications. The resulting ontological gazetteer, named LoG (Linked OntoGazet- teer), is accessible through Web services by applications and research initiatives on GIR, text processing, named entity recognition and others. The gazetteer currently contains over 13 million places, 140 million attributes and relationships, and 4.5 million non-geographic entities. Data sources include GeoNames, Freebase, DBPedia and LinkedGeoData, which is based on OpenStreetMap data. An analysis on how these datasets overlap and complement one another is also presented.
KEYWORDS gazetteer, geocoding, linked data, knowledge bases
1 | INTRODUCTION
The relevance of geographic information in Web and mobile applications is undeniable. While searching the Web, users often employ terms or expressions indicating place names or positioning (Backstrom, Kleinberg, Kumar, & Novak, 2008; Sanderson & Kohler, 2004; Wang, Zi, Wang, Lu, & Ma, 2005). Recognizing references to places in documents,
Transactions in GIS 2016; 00: 00-00 wileyonlinelibrary.com/journal/tgis VC 2016 John Wiley & Sons Ltd | 1 2 | MOURA ET AL. social media messages, keyword-based queries and other situations leads to a better understanding of the user’sinten- tions and better query results. Geographic information retrieval (GIR) applications usually rely on reference data (often organized as gazetteers) for recognizing place names. The determination of the geographic scope of documents is an example of such applica- tions, in which gazetteers are employed first for identifying place names in the text (geoparsing) and later for finding their geographic locations (geolocating). This problem is complex because place names are often ambiguous, being iden- tical or very similar to other place names and terms used to designate non-geographic entities. An ideal gazetteer would have global coverage with detail down to urban elements, such as streets, addresses, and landmarks. It would also go beyond the traditional representation (Hill, 2000) in the form of a triple
2 | RELATED WORK
As the volume of information on the Web grows, the available methods for querying and using information become insufficient. Consider, for instance, the recognition of the geographic intentions of a user that presents a set of key- words to a search engine. Among these keywords, one can find place names (Yi, Raghavan & Leggetter, 2009), indirect references to places (McCurley, 2001; Quercini & Samet, 2014), and expressions related to spatial positioning (Delboni, Borges, Laender, & Davis, 2007) or to spatial relationships (Borges, Davis, Laender, & Medeiros, 2011). If the input is taken as a simple set of keywords, the search engine cannot be expected to cover the full extent of the users’ inten- tions when posting the query. Egenhofer (2002) argues that obtaining adequate geospatial content from Web sources requires going beyond keyword-based methods and taking into account the spatial semantics of positioning expressions (such as inside, crosses, or near). The creation of semantic resources, such as ontologies, is supposed to enable a new framework of information retrieval based on the meaning of terms and expressions. Abdelmoty, Smar, El-Geresy and Jones (2009) show that place ontologies can be used to encode the meaning of spatial properties (geometrical shape, location, MOURA ET AL. | 3 proximity, topological relationships) as well as our usual descriptions of places, in the form of place names and coordi- nate systems. They showed how these concepts can be adequately encoded using RDF triples, using Wikipedia articles as an example. Gazetteers can help with the complex problem of spatial discontinuity and administrative divisions with special status (Laurini, 2015). Textual descriptions in gazetteers can lay out these relationships in a more clear and direct way than is possible to do topologically in an ontology. Overall, this is a two-way relationship, with gazetteers also being enriched by ontologies (Laurini, 2015). For instance, Hoffart, Suchanek, Berberich, and Weikum, (2013) demonstrate the importance of complementing ontologies with knowledge bases. They use Wikipedia to enhance an ontology (YAGO2) along the dimensions of space and time. Giunchiglia, Dutta, Maltese, and Farazi (2012) combined a gazetteer (GeoNames) with an ontology (WordNet) to create an enhanced ontology called GeoWordNet. Brisaboa. Luaces, Pla- ces, and Seco (2010) put together a spatial ontology and a gazetteer to build an index. The index is then used in GIR to solve isolated or combined spatial and textual queries. However, the usual gazetteer implementation uses a simple data structure, composed by a place name, a type of place and a footprint (usually a simple pair of coordinates) (Hill, 2000). This representation is often insufficient to pro- vide evidence for common GIR tasks, since it does not explore spatial relations between places. A simple point foot- print also does not allow the use of spatial relations such as contains, for instance, and geographic relationships are encoded using pairs of keys, as in conventional databases. Therefore, we argue that the expansion of gazetteers towards richer semantics, topological relationships, and descriptions, such as those provided by knowledge bases, can significantly improve the usefulness of gazetteers for GIR tasks. Previous work (Han and Zhao, 2009; Hoffart et al., 2011; Cucerzan, 2007; Pouliquen et al., 2006) use knowledge bases and gazetteers to handle the ambiguity problem in GIR applications. As presented by Alencar, Davis, and Gon- çalves, (2010), Wikipedia is a good external source of evidence, both for name recognition and for disambiguation. Quer- cini and Samet (2014) point out that Wikipedia articles can be spatially related to each other. They demonstrate how article contents can be used to obtain concepts that are spatially connected, and thereby used to describe or identify a certain place. Some other initiatives (Bizer et al., 2009b; Dong et al., 2014) also try to extract relevant information from Wikipedia and other non- or semi-structured data sources. However, extracting and querying such reference data can be difficult and time-consuming. Smart et al. (2010) present a mediation framework to build a meta-gazetteer,2 with data from multiple sources, including Wikipedia, OpenStreetMap and GeoNames. Integration is performed at querying time, retrieving and integrating data from various remote sources. Tanasescu, Smart, and Jones (2014) present another approach to create a meta-gazetteer using also data from Google Places (https://developers.google.com/places/) and Foursquare (https://foursquare.com/) to recommend toponyms for photo captions. Popescu, Grefenstette and Moellie€ (2008) propose Gazetiki, a gazetteer assembled from various heterogeneous online sources that are used to complement the GeoNames database, including Wikipedia. Gazetiki expands the traditional gazetteer structure (place name, footprint, place class) by assigning a relevance score to each entry. In a broad effort to determine the spatial scope of news articles for the NewStand project, Samet et al. (2014) argue that gazetteers are important sources of information for GIR tasks. Gazetteer data are instrumental for toponym recognition (i.e. identifying place names in text) and toponym resolution (i.e. deciding which of the places that are asso- ciated to a given name is the correct one). However, solving these problems requires much more than simply recogniz- ing valid place names embedded in text, for instance by comparing candidate words with the contents of a gazetteer, which must include alternative names, nicknames, acronyms and multilingual variations. Other data must be used in order to resolve geo/non-geo and geo/geo ambiguity, which are part, respectively, of toponym recognition and resolu- tion. For instance, tools can use semantic connections between places or identify terms that are strongly associated to places and use such evidence for disambiguation (Quercini and Samet, 2014; Alencar et al., 2010; Lieberman & Samet, 2012). Current gazetteers, such as GeoNames (used by NewStand), do not include that kind of additional information. Lieberman, Samet, and Sankaranarayanan (2010) and Quercini, Samet, Sankaranarayanan, and Lieberman (2010) demonstrate the usefulness of local lexicons (gazetteer subsets comprising closely related places) for disambiguation in the context of NewStand. Using local lexicons, a news source is associated to a set of places that fall within its 4 | MOURA ET AL. geographic scope of interest, thus helping in the disambiguation task. It is possible to conclude that gazetteers with a broad coverage are necessary in order to include globally known places, while those with local and detailed coverage are required for establishing a narrower context, as in local lexicons. Samet et al. (2014) created techniques to help in solving the ambiguity problem through a map-based user interface. The current zoom level in the map enables disam- biguating in a more precise way, which corresponds to heuristically narrowing the geographic scope of interest of the user. This approach also allows using spatial synonyms, i.e. alternative names of spatially-related places, that can be used to expand searches. Since no single data source currently fulfills these requirements, we decided to investigate the application of Linked Data (Bizer et al., 2009a) over several different data sources, as in the Linking Open Data (http://linkeddata.org/) (LOD) project, to enrich existing gazeteers. The LOD project encourages people to publish data according to the four Linked Data principles (Berners-Lee, 2011): (1) Use Uniform Resource Identifiers (URIs) as names for things; (2) use HTTP URIs so that people can look up those names; (3) when someone looks up a URI, provide useful information, using the Resource Definition Framework (RDF) standard; and (4) include links to other URIs. Between 2011 and 2014, the LOD project recorded an increase of 271% in the number of data sets published following the Linked Data principles, which means 1,091 different data sets are available (Schmachtenberg, Bizer, & Paulheim, 2014), and points to a growing adop- tion of that concept. Integrating LOD data, however, is still not a simple task. The heterogeneity of schemas within the Web of Data makes the task of determining redundant or overlapping entities a non-trivial task (Freitas, Curry, Oliviera & O’Riain, 2012; Shvaiko & Euzenat, 2005). To fit in the Web of Data environment, data integration solutions must take into account the schema and ontologies defined by the data sources. Manguinhas, Martins & Borbinha (2008) propose a solution to resolve the data integration problem based on simple criteria, such as place type and distance between two possibly duplicate entities. Properly solving this problem is essential to provide better support for GIR applications, keeping fewer representations, expanding the knowledge about real-world objects, and helping to solve ambiguity problems. In the next sections, we describe the creation of LoG, an enhanced gazetteer created by integrating several Linked Data sources.
3 | GAZETTEER ENHANCEMENTS: NON-PLACES AND FLEXIBLE RELATIONSHIPS
Specially-crafted gazetteers can move beyond the traditional structure to organize place-related elements in a way that is meaningful for GIR. However, very few of them are available. One of them is OntoGazetteer (Machado, Alencar, Campos & Davis, 2011), a gazetteer that includes topological and geographic relationships, such as contained by and neighbor to, and is enriched with alternative names, ambiguously named places, and relationships between places. It also implements semantic relationships, through which it is possible to establish connections between places that belong to the same ontological category, for instance state capitals, historical sites or cities along the same highway. Relationships can be used for disambiguation, since a reference to a related geo- or non-geo entity in the same text helps inferring which is the correct place. Figure 1 shows a basic conceptual schema for OntoGazetteer (Machado et al., 2011). The main class of this schema is Place, whose instances represent real world places. The place names are represented in three different ways: (1) as a Place attribute; (2) as an Alternative place instance; or (3) as an Ambiguous name instance. These classes keep alternative names for the same place, and a list with identifiers of homonymous places, respectively. Although this schema has been successfully implemented as a relational database, searching for a toponym implies looking into three different structures (Place, Alternative place and Ambiguous name), making it less efficient to detect and solve ambiguities. Our work can be understood as an extension and a revision of OntoGazetteer, with a focus on the use of Linked Data to implement a gazetteer with worldwide coverage and able to keep information about every type of place, rang- ing from administrative divisions, populated places and intra-urban information, to geographic features, such as rivers and mountains. This new gazetteer, built from the integration of many data sources on places, can be populated with MOURA ET AL. | 5
FIGURE 1 OntoGazetteer conceptual schema from Machado et al. (2011) related terms, alternative names, relationships of various kinds, and other related entities. In our implementation, we created an alternative schema for an enhanced gazetteer, which is more efficient for search and retrieval tasks. Our goal is to provide broader support to GIR, going beyond existing reference data sources by providing evidence that can be used in tasks such as disambiguation and in the recognition of the geographic scope of documents. The primary Linked Data sources for our gazetteer include a traditional global gazetteer, knowledge bases, and geographic informa- tion derived from OpenStreetMap. Our revised conceptual schema is shown in Figure 2. The new schema is simpler than OntoGazetteer’s, replacing the classes that keep alternative names and relationships with separate classes for places and for place names. There- fore, the new schema includes two new classes, Non-place and Name. The instances of Name can be related with Non-place and Place instances through many-to-many relationships. Two or more places can be related to the same name, indicating a geo/geo ambiguity, or even share a name with non-places, indicating a geo/non-geo ambiguity. Also, places and non-places can have many alternative names (multi-lingual, nicknames, popular acronyms). Alternative names are identified using the isAlternative attribute. Entities that do not represent real-world places are kept as Non- place instances. For example, the (non-place) event “FIFA World Cup 2014” is related to the place “Belo Horizonte”, one of the host cities in the competition. The relationship between places and non-places is defined as relatedTo,which means that all kinds of relationship can be recorded. The relationship attribute relType keeps the nature of each rela- tionship, as encoded in Linked Data predicates, thus preserving their original flexibility. Topological and semantic rela- tionships between places are represented using a many-to-many self-relationship in the Place class. Likewise, the nature of the relationship between Place and Non-place is recorded in the relationship attribute relType.Bothpla- ces and non-places are also categorized through their relationship to one or more instances of the Type class Overall, the schema in Figure 2 contains only many-to-many relationships. Considering the need to effi- ciently query such relationships in a large database, along with the need to navigate between places, non-places, names and types for solving most GIR-related tasks, we decided to map the schema to a directed and labeled graph. The graph is defined as G(V, E), where V is a set of all Place, Non-place, Name and Type instances, and E a set composed by the relationships definedontheschema.Everyedgee 2 E has a label, which keeps the 6 | MOURA ET AL.
FIGURE 2 Linked gazetteer schema nature of the relationship. Only the edges between Name and Place or Non-place are unlabeled. Also, verti- ces may have attributes. A graph structure is also flexible enough to allow future evolution and the addition of new elements. After considering some possibilities, we implemented the enhanced gazetteer using Titan (http://thinkaurelius. github.io/titan/), a NoSQL graph database manager based on Apache Cassandra (http://cassandra.apache.org/). Titan provides many features that allowed us to implement the graph exactly as in the conceptual schema. In Titan, a vertex is seen as a bag of properties, and the vertex set is not required to share properties or a predefined structure. Also, properties do not need to be defined before use, they can be created on the fly. It is possible to associate constraints, such as unique (the value of this property is unique in each vertex), single (each vertex can have only one property of this kind), and list (multivalued properties), with any property. Furthermore, Titan automatically creates indexes for unique and single properties. Edges are similarly implemented, with a required label to identify the edge type and the possibility to be directed or not. Edges can also have properties, and these properties can be constrained and indexed. Titan is specialized in running fast local graph computations, which are frequently required when exploring the graph neighborhood of elements such as places and names searching, for instance, for homonyms or alternative names. When a vertex is created, the DBMS generates an ID property, through which accessing the vertex is fastest. Indexes can also be used to run fast searches using vertex or edge properties. The easiest way to query a Titan database is to use a graph traversal language, called Gremlin (https://github.com/tinkerpop/gremlin/wiki), built into the DBMS. Since all data sources are linked data, it would be possible, in principle, to implement queries to data sources directly. This alternative would mean querying individual sources separately, then dynamically integrating their results before using them in GIR applications. Smart, Jones, and Twaroch (2010) use this approach and report several prob- lems and difficulties, such as prioritizing sources, matching entities and augmenting found entities. Furthermore, no mention is made to response times and latency. For this work, we opted to take a different approach. We collect data from individual sources, integrate them with a high level of certainty, and filter out unwanted or unnecessary details, so that a concise and efficient set of services can be rendered through an API to GIR applications. MOURA ET AL. | 7
FIGURE 3 Decision path to map a triple into LoG
4 | GAZETTEER ENHANCEMENTS: LINKED DATA
Linked Data standards define that all published data must use the RDF data structure. This means that all information is represented as triples with the following structure: {subject, predicate, [object, literal]}.Manylinkeddata sources use general-purpose triple stores to manage their data. Adding new data is simple with triple stores, since there is no explicit schema (regardless of the triple format). However, the lack of documentation on the actual predicates used in the triples and the semantics behind them makes it harder to use the data. Users need to know about the actual predi- cates, either when adding data or when formulating queries, but usually data sources do not have such information. We established a two-step mapping from RDF triples into a graph to obtain LoG contents from Linked Data. First, we match the triples to our basic schema (Figure 2), and then we map them to the graph implementation. We followed the decision path shown in Figure 3 to map each RDF triple in each data source into our schema. The answer to each question depends on the data representation used by each data source, and will be detailed in the next subsections. The five grey boxes in Figure 3 are possible endpoints in the process, with useful information being stored and triples not related to places being discarded. Each data source has a different set of predicates that characterize relationships between places. Sources also vary as to the availability of data on non-places related to places, and the predicates that are used for that purpose. We ana- lyzedeachdatasourcetofind the most frequent predicates that relate useful entities to places. The next subsections present our findings and decisions for each source, after a brief description of the source and its main characteristics.
4.1 | DBPedia
DBPedia is a general purpose knowledge base. Information in DBPedia originates from semi-structured information (infoboxes) in Wikipedia. DBPedia is part of the LOD project since its beginning, and it is one of its largest datasets. Every entity in DBPedia has one or more types. A type is defined by the predicate http://www.w3.org/1999/02/ 22-rdf-syntax-ns##type and usually points to an object that defines an abstract type concept, such as thing, person, or place. Following this definition, a subject is a place if, and only if, it is described in a triple whose object is a “place” con- cept, identified by the URI http://dbpedia.org/ontology/Place.
TABLE 1 DBPedia predicates that represent relationships between places – db: http://dbpedia.org/ontology
Predicate Linked OntoGazetteer db:country Place –relatedTo (db:country)– > Place db:isPartOf Place –relatedTo (db:isPartOf)– > Place db:location Place –relatedTo (db:location)– > Place db:state Place –relatedTo (db:state)– > Place db:region Place –relatedTo (db:region)– > Place 8 | MOURA ET AL.
TABLE 2 GeoNames predicates that represent vertex attributes – gn: http://www.geonames.org/ontology#, wgs84_pos: http:// www.w3.org/2003/01/geo/wgs84_pos#
Predicate Linked OntoGazetteer wgs84_pos:lat Place.gnPoint.y wgs84_pos:long Place.gnPoint.x gn:featureClass Place.gnFeatureClass gn:featureCode Place.gnFeatureCode gn:wikipediaArticle Place.dbpediaId
DBPedia has many predicates that define place names. We chose to work with the two most frequent ones: Name (http://xmlns.com/foaf/0.1/name) and Nick (http://xmlns.com/foaf/0.1/nick). Only one vertex attribute is extracted from DBPedia: the geographic representation of a place, which in this case is limited to a single pair of coor- dinates. The predicate that defines a geographic representation is http://www.georss.org/georss/point. Table 1 shows all selected predicates from DBPedia and their correspondent representation for LoG. Every predicate between a selected place and a non-place was imported. Every non-place that is related to a place and has at least one name was selected to be included in LoG. At the end of the process, we selected 639,462 places from DBPedia, encoded in seven million triples. Also, 594,026 non- places were selected. All relationships were stored as attributes in the PlaceRel class.
4.2 | GeoNames
GeoNames is one of the largest gazetteers available on the Web. It keeps information about more than eight million places, and its main data sources are official cartographic agencies (http://www.geonames.org/data-sources.html). Since GeoNames only stores data from real world places, the answer to the first question in the decision flow for this data source is always true. We chose two predicates from GeoNames that define names: http://www.geonames.org/ ontology#name and http://www.geonames.org/ontology#alternateName. Table 2 shows all predicates from GeoNames mapped to LoG as vertex attributes. Unlike DBPedia, GeoNames keeps the geographic representation split into two different attributes, one for latitude and one for longitude. Geo- Names uses two attributes, Feature Class and Feature Code, to characterize place types. Both were selected and mapped as vertex properties in LoG. The last attribute selected from GeoNames is the URL for the Wikipedia article corresponding to each place, when available. Table 3 summarizes GeoNames predicates indicating relationships between places and shows how each of them are mapped into LoG. All places in GeoNames were added to LoG, comprising more than 8.4 million places, obtained from almost 80 million RDF triples.
TABLE 3 GeoNames predicates that represent relationships between places – gn: http://www.geonames.org/ontology#
Linked Onto Predicate Gazetteer gn:parentCountry Place –relatedTo (gn:parentCountry)– > Place gn:parentFeature Place –relatedTo (gn:parentFeature)– > Place gn:parentADM[12345] Place –relatedTo (gn:parentADM[12345])– > Place MOURA ET AL. | 9
TABLE 4 Freebase predicates representing vertex attributes – fbsds: http://rdf.freebase.com/key/user.metaweb.datasource, fbs: http://rdf.freebase.com/key/, fbsloc: http://rdf.freebase. com/ns/location
Predicate Linked OntoGazetteer fbsds:geonames Place.geonamesId fbs:wiki.en_title Place.dbpediaId fbsloc:geocode.latitude Place.fbPoint.y fbsloc:geocode.longitude Place.fbPoint.x
4.3 | Freebase
Freebase is a knowledge base supported by Google and open to user contributions. Like DBPedia, it was composed using non- or semi-structured information from Wikipedia. Freebase also works as a general purpose data source, keeping categorized records for entities such as songs, books, locations, and people. Freebase defines types in a way that is similar to DBPedia, using a predicate that defines an entity’s type (http://rdf.freebase.com/ns/type.object.type). However, the object that defines the concept of place is named Location and is defined by the URI http://rdf.freebase. com/ns/location.location. In other words, an entity is a place if it appears in a triple in which the predicate Type points to Location. Two predicates that are used to associate names to entities are considered in this work. They are defined by the URIs http://rdf.freebase.com/ns/type.object.name and http://rdf.freebase.com/ns/common.topic.alias. Freebase keeps references to DBPedia and GeoNames using specific predicates, as shown in Table 4. As in the case of GeoNames, the geographic coordinate is recorded using two different attributes, indicating latitude and longi- tude. Every triple that contains one of the predicates listed in Table 4 was selected and stored as a vertex attribute. We also selected predicates from Freebase that define relationships between places. Table 5 shows the four most frequent predicates in Freebase that are used to relate two places. As in the case of DBPedia data, a non-place must have a name and be related with at least one place to be included in LoG. However, Freebase has a much more complex ontology with almost eight times more predicates describing the relationships between places and non-places. Only 158 of these predicates are used in more than 1,000 triples each, so we decided to eliminate predicates that are only rarely used. We selected the 31 most used predicates, representing 90% of all triples in Freebase. Some of the most common predicates regard birth places (http://rdf.free- base.com/ns/people.person.place_of_birth), nationalities (http://rdf.freebase.com/ns/people.place_lived.location), and event locations (http://rdf.freebase.com/ns/time.event.locations). The result was a set of 1,642,081 places and 3,902,886 non-places, from more than 26 million triples.
TABLE 5 Freebase predicates representing relationships between places – fbsloc: http://rdf.freebase.com/ns/location.location, fbsbib: http://rdf. freebase.com/ns/base.biblioness
Predicate Linked OntoGazetteer fbsbib:bibs_location. Place –relatedTo country (fbsbib:bibs_location.country)– > Place fbsloc:containedby Place –relatedTo (fbsloc:containedby)– > Place fbsloc:contains Place –relatedTo (fbsloc:contains)– > Place fbsloc:nearby_airports Place –relatedTo (fbsloc:nearby_airports)– > Place 10 | MOURA ET AL.
TABLE 6 LinkedGeoData predicates representing vertex attrib- utes – lgdo: http://linkedgeodata.org/ontology/, geo: http:// www.w3.org/2003/01/geo/wgs84_pos
Predicate Linked OntoGazetteer lgdo:geonames_id Place.geonamesId lgdo:wikipedia Place.dbpediaId geo#lat Place.lgdpoint.y geo#long Place.lgdpoint.x
4.4 | LinkedGeoData
LinkedGeoData is a data source built from OpenStreetMap (OSM), a project that allows for collaborative mapping using data supplied by volunteers. LinkedGeoData obtained OSM data and republished it following the Linked Data principles. Since OSM receives numerous contributions in urban areas, the contents of LinkedGeoData were supposed to increase the amount of intra-urban information in LoG. However, we found that the data can be skewed due to var- iations in OSM popularity around the world. Considering the four data sources used in our work, LinkedGeoData has the most confusing ontology. There is a huge variety of predicates and no documentation. However, the entity type definition is quite similar to the other sour- ces. A type predicate is defined by http://www.w3.org/1999/02/22-rdf-syntax-ns##type, and there is an object that defines the concept of place (http://linkedgeodata.org/ontology/Place). We selected the six most frequent predicates used in LinkedGeoData to specify entity names, since the wide vari- ety of predicates caused most of them to be used in only a small number of instances. The selected predicates are: