Ontology-Based Methods for Analyzing Life Science Data Olivier Dameron

Ontology-based methods for analyzing life science data Olivier Dameron To cite this version: Olivier Dameron. Ontology-based methods for analyzing life science data. Bioinformatics [q-bio.QM]. Univ. Rennes 1, 2016. tel-01403371 HAL Id: tel-01403371 https://hal.inria.fr/tel-01403371 Submitted on 25 Nov 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Habilitation a` Diriger des Recherches présentée par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury composéde Anita Burgun Professeur, UniversitéRenéDescartes Paris Examinatrice Marie-Dominique Devignes Chargéede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, UniversitéParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universitéde Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective . 44 2.4.3 Semantic expansion . 45 2.4.4 Query generation . 45 2.5 Synthesis ........................................ 47 3 Reasoning based on classification 51 3.1 Principle......................................... 51 3.1.1 OWL Classes . 52 3.1.2 Union and intersection of classes . 53 3.1.3 Disjoint classes . 53 3.1.4 Negation: complement of a class . 54 3.1.5 Existential and universal restrictions . 54 3.1.6 Cardinality restrictions . 55 3.1.7 Property chains . 55 3 3.1.8 Synthesis . 55 3.2 Methodology: Description-logics representation of anatomy . 57 3.2.1 Context . 57 3.2.2 Objective . 58 3.2.3 Converting the FMA into OWL-DL . 58 3.2.4 Addressing expressiveness and application-independence: OWL-Full . 60 3.2.5 Pattern-based generation of consistency constraints . 61 3.3 Methodology: diagnosis of heart-related injuries . 64 3.3.1 Context . 64 3.3.2 Objective . 65 3.3.3 Reasoning about coronary artery ischemia . 65 3.3.4 Reasoning about pericardial effusion . 68 3.4 Optimization: modeling strategies for estimating pacemaker alerts severity . 73 3.4.1 Context . 73 3.4.2 Objective . 74 3.4.3 CHA2DS2VASc score . 74 3.4.4 Modeling strategies . 76 3.4.5 Comparison of the strategies' performances . 77 3.5 Synthesis ........................................ 80 4 Reasoning with incomplete information 83 4.1 Principle......................................... 84 4.2 Methodology: grading tumors . 84 4.2.1 Context . 84 4.2.2 Objective . 85 4.2.3 Why the NCIT is not up to the task . 85 4.2.4 An ontology of glioblastoma based on the NCIT . 86 4.2.5 Narrowing the possible grades in case of incomplete information . 89 4.3 Methodology: clinical trials recruitment . 91 4.3.1 Context . 91 4.3.2 Objective . 92 4.3.3 The problem of missing information . 92 4.3.4 Eligibility criteria design pattern . 95 4.3.5 Reasoning . 96 4.4 Synthesis ........................................ 98 5 Reasoning with similarity and particularity 101 5.1 Principle......................................... 101 5.1.1 Comparing elements with independent annotations . 102 5.1.2 Taking the annotations underlying structure into account . 103 5.1.3 Synthesis . 108 5.2 Methodology: semantic particularity measure . 109 5.2.1 Context . 109 5.2.2 Objective . 110 5.2.3 Definition of semantic particularity . 111 5.2.4 Formal properties of semantic particularity . 111 5.2.5 Measure of semantic particularity . 112 5.2.6 Use case: Homo sapiens aquaporin-mediated transport . 113 5.3 Methodology: threshold determination for similarity and particularity . 116 5.3.1 Context . 117 4 5.3.2 Objective . 118 5.3.3 Similarity threshold . 118 5.3.4 Particularity threshold . 125 5.3.5 Evaluation of the impact of the new threshold on HolomoGene . 126 5.4 Synthesis ........................................ 128 6 Conclusion and research perspectives 129 6.1 Producing and querying linked data . 130 6.1.1 Representing our data as linked data . 131 6.1.2 Querying linked data . 134 6.2 Analyzing data . 135 6.2.1 Selecting relevant candidates when reconstructing metabolic pathways . 136 6.2.2 Analyzing TGF-β signaling pathways . 136 6.2.3 Data analysis method combining ontologies and formal concept analysis . 137 Bibliography 139 Curriculum Vitæ 163 5 Acknowledgments I and indebted to Mark Musen, Anita Burgun and Anne Siegel for allowing me to join their team, for their help and their encouragement, as well as for setting up such exciting environments. I am most grateful to Michel Dumontier, Christine Froidevaux and Fabien Gandon for kindly accepting to review this document, and also to Anita Burgun, Marie-Dominique Devignes, Anne Siegel and Alexandre Termier for accepting to be parts of the committee. I am thankful to all my colleagues for all the work we have accomplished together, as summarized by Figure 1 on the facing page. The PhD students I co-supervised have brought significant contributions to most of the works presented in this manuscript: Nicolas Lebreton, Elodie´ Roques, Charles Bettembourg, Philippe Finet, Jean Coquet and Yann Rivault. Close collaborations with colleagues from whom I learnt a lot were very stimulating: Daniel Rubin, Natasha Noy, Anita Burgun, Julie Chabalier, Andrea Splendiani, Paolo Besana, Lynda Temal, Pierre Zweigenbaum, Cyril Grouin, Annabel Bourdé,Oussama Zekri, Pascal van Hille, Jean-Fran¸cois Ethier,´ Bernard Gibaud, Régine Le Bouquin-Jeannès, Anne Siegel, Jacques Nico- las, Guillaume Collet, Sylvain Prigent, Geoffroy Andrieux, Nathalie Théret, Fabrice Legeai, Anthony Bretaudeau, Olivier Filangi and again Charles Bettembourg. Collaborating with colleagues from INRA was a great opportunity to tackle \real problems" and gave a great perspective on what this is all about: Christian Diot, Frédéric Hérault, Denis Tagu, Mélanie Jubault, Aurélie Evrard,´ Cyril Falentin Pierre-Yves Lebail and all the ATOL curators: JérômeBugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaun,¨ Jean Vernet, LéaJoret. Similarly, collaborations with colleagues from EHESP provided the \human" counterpart: Nolwenn Le Meur, Yann Rivault. More broadly, I benefited from the interactions with many other colleagues: Gwenaëlle Marquet, Fleur Mougin, Ammar Mechouche, Mikaël Roussel, as well as the whole Symbiose group at IRISA, and particularly Pierre Peterlongo. Working on workflows with Olivier Collin, Yvan Le Bras, Alban Gaignard and Audrey Bihouée has been fun and I am eager to see what will come out of it. In addition to research all these years were also the occasion of great teaching-related en- counters: Christian Delamarche, Emmanuelle Becker, Emmanuel Giudice, Annabelle Monnier, Yann le Cunff, Cédric Wolf. Thank you all, I enjoyed all of it. 6 Figure 1: Graph of my co-authors. Two authors are linked if they share at least one publication. Node size is proportional to the number of articles. 7 8 Chapter 1 Introduction This document summarizes my research activities since the defense of my PhD in Decem- ber 2003. This work has been carried initially as a postdoctoral fellow at Stanford University with Mark Musen's Stanford Medical Informatics group (now BMIR1), and then as an associate professor at University of Rennes 1, first with the UPRES-EA 3888 (which became UMR 936 INSERM { University of Rennes 1 in 2009) from 2005 to 2012, and then with the Dyliss team at IRISA since 2013. First, I will present the context in which my research takes place. We will see that the traditional approaches for analyzing life science data do not scale up and cannot handle their increasing quantity, complexity and connectivity. It has become necessary to develop automatic tools not only for performing the analyses, but also for helping the experts do it. Yet, processing the raw data is so difficult to automate that these tools usually hinge on annotations and metadata as machine-processable proxies that describe the data and the relations between them. Second, I will identify the main challenges. While generating these metadata is a challenge of its own that I will not tackle here, it is only the first step. Even if metadata tend to be more compact than the original data, each piece of data is typically associated with many metadata, so the problem of data quantity remains. These metadata have to be reused and combined, even if they have been generated by different people, in different places, in different contexts, so we also have a problem of integration.

Ontology-Based Methods for Analyzing Life Science Data Olivier Dameron

Analysis of the Impact Factor of Science Journals

A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon

JOURNAL of WEB SEMANTICS Science, Services and Agents on the World Wide Web

The Document Components Ontology (Doco)

A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon

Semantic Web: Who Is Who in the Field – a Bibliometric Analysis

2019 Journal Citation Reports Full Journal List

The Computer Science Ontology: a Comprehensive Automatically-Generated Taxonomy of Research Areas

Evaluating the Impact Factor: a Citation Study For