Publishing Geodata on the Web
Total Page:16
File Type:pdf, Size:1020Kb
2015-ENST 0022 EDITE - ED 130 Doctorat ParisTech TH ESE` pour obtenir le grade de docteur delivr´ e´ par TELECOM ParisTech Specialit´ e´ Sciences de l’information present´ ee´ et soutenue publiquement par Ghislain Auguste ATEMEZING le 10 avril 2015 Publishing and Consuming Geo-Spatial and Government Data on the Semantic Web Directeur de these` : Raphael¨ TRONCY Jury M. Sren AUER, Professeur, IAIS, Universite´ de Bonn, Allemagne Rapporteur Mme. Chantal REYNAUD, Professeur, INRIA Saclay, Universite´ de Paris XI, France Rapporteur M. Roberto GARCIA, Maˆıtre de Confrences associe,´ Universitat de Lleida, Espagne Examinateur M. Andreas HARTH, Maˆıtre de Conferences,´ Institut AIFB, Karlsruhe, Allemagne Examinateur Mme. Elisabeth METAIS, Professeur, CNAM - Equipe ISID, France Examinateur M. Sebastien´ MUSTIERE, HDR, COGIT-IGN, France Invite´ TELECOM ParisTech ecole´ de l’Institut Tel´ ecom´ - membre de ParisTech i To all those who helped me to make this dream coming true. “Things don’t have to change the world to be important”. -Steve Jobs - Acknowledgements This thesis is the result of part of my work carried out in the context of two Projects: the DATALIFT Project (ANR-10-CORD-009), and the APPS4EUROPE Project. There are many people I want to thank since they kindly supported me in so many different ways for the successful completion of this thesis. I am indebted to my thesis advisor Dr. Raphael¨ Troncy for giving me the opportunity for a PhD. at EURECOM / Telecom ParisTech. Throughout my PhD he provided helpful ideas and encouraging support in this exciting domain of Semantic Web . I would like to thank my committee members, the reviewers Prof. Soren¨ AUER and Pr. Chantal REYNAUD, and furthermore the examiners Pr. Elisabeth METAIS, Dr. An- dreas HARTH and Dr. Roberto GARCIA for their precious time, shared positive insight and guidance. A special thanks to Dr. Sebastien´ MUSTIERE for accepting our invitation. I would like to express my deepest appreciation for all the colleagues and partners of Datalift project for all the exchanges during the duration of the project. I would like to thank specially Bernard, Pierre-Yves, Nathalie, Laurent. It was a pleasure to work and exchange with them. My sincere thanks are due to all the members of the W3C Government Linked Data Working Group, specially to Bernadette Hyland, Boris Villazon-Terrazas,´ Richard Cyga- niak, Dave Reynolds and Phil Archer. We had many intensive moments of exchange and collaborations. I would like also to show my sincerest gratitude to all my mentors at OEG/UPM (Madrid) for providing me first skills on ontology engineering: Pr. Asuncion Gomez´ and Dr. Oscar Corcho. Their enthusiasm and dedication to research on this field were truly inspiring. My warmest thanks to my colleagues who supported me during my Ph.D. Precisely, I would like to thank Houda, Giuseppe, Jose´ Luis, Vuk and Ahmad. Also, I thank all those working at EURECOM, they made my stay at EURECOM very pleasant. Thanks also to Jodi Schneider for her insightful comments to this document. I would like to express my deepest thanks to my wife, Veronica´ Alvarez´ Aguilar, who gave me all her patience and comprehension since the beginning of this adventure. I owe my family my profound gratitude because they always supported and believed in me: spe- cially my parents Genevieve Djifack and Prosper Tabondjou; and all my brothers and sisters: Stephanie, Judith, Arlette, Mathias, Alvine, Edwige, Yannick. Lastly, special thanks to my friends for their unwavering friendship, moral and infinite support. Abstract Over the past few years, the domain of Open Data has received an increasing attention from public administrations. Potential benefits for citizens include more transparency in the decision making, a better governance and a virtuous development of a digital eco-system that would create added-value apps when processing and analyzing open data. However, opening up and publishing data is not enough to create this data value chain due to a number of challenges such as the heterogeneity of formats (XML, CSV, Excel, PDF, Shape Files), the variety of access methods (API, database, dump) and the lack of nomenclature that would enable to better re-use and interconnect datasets. In this thesis, we explore how semantic web technologies can be used to tackle the research problems related to the integration and consumption of geo-spatial data. This thesis applies the Linked Data principles in the domain of geographic information (a key domain for open government). In particular, we address three key challenging problems in the publishing workflow of geospatial open data publication and consumption, with real world use cases from the the French National mapping agency (IGN): (1) How to efficiently represent and store geospatial data on the Web to ensure interoperable applications? (2) What are the best options for a user to interact with semantic content using visualizations? (3) What are the mechanisms that support preserving structured data of a high quality on the Web? Our contributions are thus break down into three parts with applications in the geograph- ical domain. We propose and model four vocabularies for representing coordinate reference systems (CRS), topographic entities and their geometries. These ontologies extend existing vocabularies and add two additional advantages: an explicit use of CRS identified by URIs for geometry, and the ability to describe structured geometries in RDF. We have contributed to the development of the Datalift platform that aims to support lay users in the process of ”lifting” raw data into RDF. We have published the French authoritative database GEOFLA using this tool and we provide a systematic evaluation of the performances of the most used endpoints when dealing with spatial queries. Regarding the consumption of linked data, after reviewing different categories of visual- ization (generic and Linked Data specific), we propose a vocabulary for Describing VIsual- ization Applications (DVIA). We formalize and implement a novel workflow for visualizing datasets with the LDVizWiz tool: a Linked Data Visualization Wizard. The last part of the thesis describes contributions to the Linked Open Vocabularies (LOV) catalogue: it shows how LOV can be used with an ontology modeling methodology (e.g. the NeOn methodology) to improve reuse of vocabularies. We propose an heuristic to align vo- cabularies and a ranking of vocabularies based on information content (IC) metrics. Finally, the thesis provides answers on how to check the license compatibility between vocabularies and datasets in the publication workflow. Through this thesis, we demonstrate the benefits of using semantic technologies and W3C standards to improve the discovery, interlinking and visualization of geospatial government data for their publication on the Web. Contents Abstract . iii Contents . vi List of Figures . xi List of Tables . xiii List of Listings . xv List of Publications . xvii Acronyms . xx 1 Introduction 1 1.1 Context . .1 1.1.1 Geographic Information (GI) . .3 1.1.2 Geographic Data in the Linked Data Cloud . .4 1.2 Research Questions . .6 1.3 Contributions . .9 1.3.1 Modeling, publishing and querying geodata . .9 1.3.2 Visualization Tools in Linked Government Data . 10 1.3.3 Contributions to Standards . 11 1.4 Thesis Outline . 12 I Modeling, Interconnecting and Generating Geodata on the Web 15 2 Geospatial Data on the Web 17 2.1 Geographic Information . 18 2.1.1 Specificity . 18 2.1.2 Data Formats and Serialization . 20 2.2 A REST Service for Converting Geo Data . 21 2.2.1 Background . 21 2.2.2 Why a REST Service is Needed? . 22 2.2.3 API Access and Implementation . 23 2.3 Current Modeling Approach . 26 2.3.1 Status of Vocabularies Usage for Geospatial Data . 27 2.3.2 Geospatial Vocabularies . 31 2.3.3 Georeferencing Data on the Web . 32 2.4 Vocabularies for Geometries and Feature Types . 36 2.4.1 Motivation . 36 2.4.2 A vocabulary for Geometries . 37 2.4.3 A Vocabulary for Topographic Entities . 39 2.4.4 Discussions . 41 2.5 Summary . 42 viii Contents 3 Publishing, Interlinking and Querying Geodata 43 3.1 Current Representation of Geodata on the Web . 44 3.1.1 DBpedia Modeling . 44 3.1.2 Geonames Modeling . 45 3.1.3 LinkedGeoData Modeling . 45 3.1.4 Discussion . 45 3.2 Existing Tools for Converting Geospatial Data . 46 3.2.1 Geometry2RDF . 46 3.2.2 TripleGeo . 46 3.2.3 shp2GeoSPARQL . 47 3.2.4 GeomRDF: Datalift tool for Converting Geodata . 47 3.2.5 Limitations of Existing Tools . 48 3.3 Interlinking Geospatial Vocabulary and Data . 49 3.3.1 Criteria for Interlinking Geospatial Data . 50 3.3.2 Functions for Comparing Geometries . 50 3.3.3 GeOnto Alignment Process Scenario . 52 3.4 Survey on Triple Stores and Workbench . 53 3.4.1 Generic Triple Stores . 53 3.4.2 Geospatial Triple Stores . 54 3.4.3 Assessing Triple Stores . 55 3.4.4 Workbench for Geospatial Data . 56 3.5 Datalift: A tool for Managing Linked (Geo)Data Publishing Workflow . 58 3.5.1 Functionalities of the Datalift Platform . 58 3.6 Publishing French Authoritative Datasets . 62 3.6.1 Publishing French Administrative Units (GEOFLA) . 62 3.6.2 Publishing the French Gazetteer . 64 3.6.3 Publishing Addresses of OSM-France in RDF . 64 3.6.4 Status of the French LOD cloud (FrLOD) . 68 3.7 Evaluation of Spatial Queries . 70 3.7.1 Querying LinkedGeodata . 70 3.7.2 Querying FactForge (OWLIM) . 71 3.7.3 Querying Structured Geometries . 71 3.8 Summary . 72 II Generating Visualizations for Linked Data 73 4 Analyzing and Describing Visualization Tools and Applications 75 4.1 Survey on Visualization Tools . 75 4.1.1 Tools for visualizing Structured Data . 76 4.1.2 Tools for visualizing RDF Data . 78 4.1.3 Discussion .