On the efficient distributed evaluation of SPARQL queries Damien Graux

To cite this version:

Damien Graux. On the efficient distributed evaluation of SPARQL queries. Other [cs.OH]. Université Grenoble Alpes, 2016. English. ￿NNT : 2016GREAM058￿. ￿tel-01618366￿

HAL Id: tel-01618366 https://tel.archives-ouvertes.fr/tel-01618366 Submitted on 17 Oct 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE`

Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE´ GRENOBLE ALPES

Specialit´ e´ : Informatique

Arretˆ e´ ministeriel´ du 25 mai 2016

Prepar´ ee´ au sein de INRIA et de l’ecole´ doctorale MSTII

On the Efficient Distributed Evaluation of SPARQL Queries

Present´ ee´ par

Damien GRAUX

These` dirigee´ par Nabil LAYA¨IDA et codirigee´ par Pierre GENEVES`

These` soutenue publiquement le 15 Decembre´ 2016, devant le jury compose´ de :

M, Patrick VALDURIEZ DR Inria Rapporteur M, Mohand-Sa¨ıdHACID Prof. Lyon Rapporteur M, Farouk TOUMANI Prof. Clermont-Ferrand Examinateur M, Pierre GENEVES` CR Cnrs Co-Directeur de these` M, Nabil LAYA¨IDA DR Inria Directeur de these` M, Jer´ omeˆ EUZENAT DR Inria President´

Universit´eGrenoble Alpes

ON THE EFFICIENT DISTRIBUTED EVALUATION OF SPARQL QUERIES

Damien Graux

Supervisor: Nabil Laya¨ıda Co-Supervisor: Pierre Geneves`

– Jury Members – Patrick Valduriez, rapporteur Mohand-Sa¨ıd Hacid, rapporteur Farouk Toumani, examinateur Pierre Geneves` , co-directeur Nabil Laya¨ıda, directeur J´erˆome Euzenat, pr´esident, examinateur

December 15th, 2016

Foretaste

Beneath snow-capped peaks, On Distributed SPARQL. PhD. Thesis. . .

The Semantic Web standardized by the World Wide Web Consortium aims at providing a common framework that allows data to be shared and analyzed across applications. The Resource Descrip- tion Framework (rdf) and the query language sparql constitute two major components of this vision.

Because of the increasing amounts of rdf data available, dataset distribution across clusters is poised to become a standard storage method. As a consequence, efficient and distributed sparql evaluators are needed.

To tackle these needs, we first benchmark several state-of-the-art distributed sparql evaluators while monitoring a set of metrics which is appropriate in a distributed context (e.g. network traf- fic). Then, an analysis driven by typical use cases leads us to define new development perspectives in the field of distributed sparql evaluation. On the basis of these perspectives, we design several efficient distributed sparql evaluators whose performances are validated and compared to state- of-the-art evaluators. For instance, our distributed sparql evaluator named sparqlgx1 offers efficient time performances while being resilient to the loss of nodes.

1http://github.com/tyrex-team/sparqlgx

i

Abstract

Context. The Semantic Web aims at providing a common framework that allows data to be shared and reused across application. The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query evaluators.

Contributions. In this work, in order to constitute a common basis of comparative analysis, we first evaluate on the same cluster of machines various sparql evaluation systems from the state-of-the-art. These experiments lead us to point several observations: (i) the solutions have