Reference Data Enhancement for Geographic Information Retrieval Using Linked Data

DOI 10.1111/tgis.12238 RESEARCH ARTICLE Reference data enhancement for geographic information retrieval using linked data Tiago H. V. M. Moura1 | Clodoveu A. Davis Jr.1 | Frederico T. Fonseca2 1 Universidade Federal de Minas Gerais Abstract (UFMG), Avenida Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil Gazetteers are instrumental in recognizing place names in documents 2 Pennsylvania State University, 330 D IST such as Web pages, news, and social media messages. However, cre- Building, University Park, PA, U.S.A. ating and maintaining gazetteers is still a complex task. Even though Correspondence some online gazetteers provide rich sets of geographic names in plan- Tiago H. V. M. Moura, Universidade Fed- etary scale (e.g. GeoNames), other sources must be used to recognize eral de Minas Gerais (UFMG), Avenida references to urban locations, such as street names, neighborhood Presidente Antonio^ Carlos, 6627, Belo Horizonte, Brazil. names or landmarks. We propose integrating Linked Data sources to Email: [email protected] create a gazetteer that combines a broad coverage of places with urban detail, including content on geographic and semantic relationships involving places, their multiple names and related non- geographic entities. Our final goal is to expand the possibilities for recognizing, disambiguating and filtering references to places in texts for geographic information retrieval (GIR) and related applications. The resulting ontological gazetteer, named LoG (Linked OntoGazet- teer), is accessible through Web services by applications and research initiatives on GIR, text processing, named entity recognition and others. The gazetteer currently contains over 13 million places, 140 million attributes and relationships, and 4.5 million non-geographic entities. Data sources include GeoNames, Freebase, DBPedia and LinkedGeoData, which is based on OpenStreetMap data. An analysis on how these datasets overlap and complement one another is also presented. KEYWORDS gazetteer, geocoding, linked data, knowledge bases 1 | INTRODUCTION The relevance of geographic information in Web and mobile applications is undeniable. While searching the Web, users often employ terms or expressions indicating place names or positioning (Backstrom, Kleinberg, Kumar, & Novak, 2008; Sanderson & Kohler, 2004; Wang, Zi, Wang, Lu, & Ma, 2005). Recognizing references to places in documents, Transactions in GIS 2016; 00: 00-00 wileyonlinelibrary.com/journal/tgis VC 2016 John Wiley & Sons Ltd | 1 2 | MOURA ET AL. social media messages, keyword-based queries and other situations leads to a better understanding of the user’sinten- tions and better query results. Geographic information retrieval (GIR) applications usually rely on reference data (often organized as gazetteers) for recognizing place names. The determination of the geographic scope of documents is an example of such applications, in which gazetteers are employed first for identifying place names in the text (geoparsing) and later for finding their geographic locations (geolocating). This problem is complex because place names are often ambiguous, being iden- tical or very similar to other place names and terms used to designate non-geographic entities. An ideal gazetteer would have global coverage with detail down to urban elements, such as streets, addresses, and landmarks. It would also go beyond the traditional representation (Hill, 2000) in the form of a triple <place name, place type, point footprint> to include information to support name disambiguation and the determination of the geographic scope of documents. No gazetteer currently offers this set of features, but the information is out there, available as Linked Data sources. Broad coverage can be obtained from global gazetteers, such as GeoNames (http:// www.geonames.org). Urban detail can be found, albeit with a strong concentration in more developed countries, in knowledge bases, such as DBPedia (http://dbpedia.org/) and Freebase (http://www.freebase.com/). Volunteered geographic information (VGI) and collaborative sources, such as WikiMapia (http://wikimapia.com) and OpenStreet- Map (http://www.openstreetmap.org) provide unique information on urban elements, based on the personal knowledge of citizens. An alternative to the use of gazetteers would be to employ general purpose knowledge bases. Many commercial and academic projects are available, including Google (https://developers.google.com/freebase/data), DBPedia (Auer et al., 2007) and YAGO (Suchanek, Kasneci, & Weikum, 2007). Data extraction from these bases requires sophisticated algorithms able to deal with non- or semi-structured sources, such as natural-language text, HTML trees and tables (Dong et al., 2014). On other hand, Linked Data (Bizer, Heath, & Berners-Lee, 2009a) uses a simple structured format and open protocols over the Web, so that their data is kept largely available to be processed by simpler algorithms. Our proposal is that all the effort spent in extracting information from non- or semi-structured sources should be redir- ected to the creation of new applications. In this work we present a gazetteer crafted using four Linked Data sources. The resulting gazetteer, called LoG (from Linked OntoGazetteer, http://aqui.io/log),1 has worldwide coverage and intra-urban detailed information. As discussed in the next sections, LoG also contains information on non-geographic entities related to places and place names. We also describe an open application programming interface (API), conceived to support GIR applications with LoG data. The remainder of the article is organized as follows. Section 2 presents related work. Sections 3 and 4 show gazetteer enhancements. We present a Linked Data analysis in Section 5. Section 6 describes the programming interface to access LoG data. Finally, conclusions and future work are presented in Section 7. 2 | RELATED WORK As the volume of information on the Web grows, the available methods for querying and using information become insufficient. Consider, for instance, the recognition of the geographic intentions of a user that presents a set of keywords to a search engine. Among these keywords, one can find place names (Yi, Raghavan & Leggetter, 2009), indirect references to places (McCurley, 2001; Quercini & Samet, 2014), and expressions related to spatial positioning (Delboni, Borges, Laender, & Davis, 2007) or to spatial relationships (Borges, Davis, Laender, & Medeiros, 2011). If the input is taken as a simple set of keywords, the search engine cannot be expected to cover the full extent of the users’ intentions when posting the query. Egenhofer (2002) argues that obtaining adequate geospatial content from Web sources requires going beyond keyword-based methods and taking into account the spatial semantics of positioning expressions (such as inside, crosses, or near). The creation of semantic resources, such as ontologies, is supposed to enable a new framework of information retrieval based on the meaning of terms and expressions. Abdelmoty, Smar, El-Geresy and Jones (2009) show that place ontologies can be used to encode the meaning of spatial properties (geometrical shape, location, MOURA ET AL. | 3 proximity, topological relationships) as well as our usual descriptions of places, in the form of place names and coordi- nate systems. They showed how these concepts can be adequately encoded using RDF triples, using Wikipedia articles as an example. Gazetteers can help with the complex problem of spatial discontinuity and administrative divisions with special status (Laurini, 2015). Textual descriptions in gazetteers can lay out these relationships in a more clear and direct way than is possible to do topologically in an ontology. Overall, this is a two-way relationship, with gazetteers also being enriched by ontologies (Laurini, 2015). For instance, Hoffart, Suchanek, Berberich, and Weikum, (2013) demonstrate the importance of complementing ontologies with knowledge bases. They use Wikipedia to enhance an ontology (YAGO2) along the dimensions of space and time. Giunchiglia, Dutta, Maltese, and Farazi (2012) combined a gazetteer (GeoNames) with an ontology (WordNet) to create an enhanced ontology called GeoWordNet. Brisaboa. Luaces, Pla- ces, and Seco (2010) put together a spatial ontology and a gazetteer to build an index. The index is then used in GIR to solve isolated or combined spatial and textual queries. However, the usual gazetteer implementation uses a simple data structure, composed by a place name, a type of place and a footprint (usually a simple pair of coordinates) (Hill, 2000). This representation is often insufficient to provide evidence for common GIR tasks, since it does not explore spatial relations between places. A simple point footprint also does not allow the use of spatial relations such as contains, for instance, and geographic relationships are encoded using pairs of keys, as in conventional databases. Therefore, we argue that the expansion of gazetteers towards richer semantics, topological relationships, and descriptions, such as those provided by knowledge bases, can significantly improve the usefulness of gazetteers for GIR tasks. Previous work (Han and Zhao, 2009; Hoffart et al., 2011; Cucerzan, 2007; Pouliquen et al., 2006) use knowledge bases and gazetteers to handle the ambiguity problem in GIR applications. As presented by Alencar, Davis, and Gon- çalves, (2010), Wikipedia is a good external source of evidence, both for name recognition and for disambiguation. Quer- cini and Samet

Reference Data Enhancement for Geographic Information Retrieval Using Linked Data

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Solving the Unstructured Data Puzzle with Analytics

Cheminformatics for Genome-Scale Metabolic Reconstructions

Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA

Unstructured Data Analysis in Arcgis

The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications