Ontology-Based Methods for Analyzing Life Science Data Olivier Dameron

Total Page:16

File Type:pdf, Size:1020Kb

Ontology-Based Methods for Analyzing Life Science Data Olivier Dameron Ontology-based methods for analyzing life science data Olivier Dameron To cite this version: Olivier Dameron. Ontology-based methods for analyzing life science data. Bioinformatics [q-bio.QM]. Univ. Rennes 1, 2016. tel-01403371 HAL Id: tel-01403371 https://hal.inria.fr/tel-01403371 Submitted on 25 Nov 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective . 44 2.4.3 Semantic expansion . 45 2.4.4 Query generation . 45 2.5 Synthesis ........................................ 47 3 Reasoning based on classification 51 3.1 Principle......................................... 51 3.1.1 OWL Classes . 52 3.1.2 Union and intersection of classes . 53 3.1.3 Disjoint classes . 53 3.1.4 Negation: complement of a class . 54 3.1.5 Existential and universal restrictions . 54 3.1.6 Cardinality restrictions . 55 3.1.7 Property chains . 55 3 3.1.8 Synthesis . 55 3.2 Methodology: Description-logics representation of anatomy . 57 3.2.1 Context . 57 3.2.2 Objective . 58 3.2.3 Converting the FMA into OWL-DL . 58 3.2.4 Addressing expressiveness and application-independence: OWL-Full . 60 3.2.5 Pattern-based generation of consistency constraints . 61 3.3 Methodology: diagnosis of heart-related injuries . 64 3.3.1 Context . 64 3.3.2 Objective . 65 3.3.3 Reasoning about coronary artery ischemia . 65 3.3.4 Reasoning about pericardial effusion . 68 3.4 Optimization: modeling strategies for estimating pacemaker alerts severity . 73 3.4.1 Context . 73 3.4.2 Objective . 74 3.4.3 CHA2DS2VASc score . 74 3.4.4 Modeling strategies . 76 3.4.5 Comparison of the strategies' performances . 77 3.5 Synthesis ........................................ 80 4 Reasoning with incomplete information 83 4.1 Principle......................................... 84 4.2 Methodology: grading tumors . 84 4.2.1 Context . 84 4.2.2 Objective . 85 4.2.3 Why the NCIT is not up to the task . 85 4.2.4 An ontology of glioblastoma based on the NCIT . 86 4.2.5 Narrowing the possible grades in case of incomplete information . 89 4.3 Methodology: clinical trials recruitment . 91 4.3.1 Context . 91 4.3.2 Objective . 92 4.3.3 The problem of missing information . 92 4.3.4 Eligibility criteria design pattern . 95 4.3.5 Reasoning . 96 4.4 Synthesis ........................................ 98 5 Reasoning with similarity and particularity 101 5.1 Principle......................................... 101 5.1.1 Comparing elements with independent annotations . 102 5.1.2 Taking the annotations underlying structure into account . 103 5.1.3 Synthesis . 108 5.2 Methodology: semantic particularity measure . 109 5.2.1 Context . 109 5.2.2 Objective . 110 5.2.3 Definition of semantic particularity . 111 5.2.4 Formal properties of semantic particularity . 111 5.2.5 Measure of semantic particularity . 112 5.2.6 Use case: Homo sapiens aquaporin-mediated transport . 113 5.3 Methodology: threshold determination for similarity and particularity . 116 5.3.1 Context . 117 4 5.3.2 Objective . 118 5.3.3 Similarity threshold . 118 5.3.4 Particularity threshold . 125 5.3.5 Evaluation of the impact of the new threshold on HolomoGene . 126 5.4 Synthesis ........................................ 128 6 Conclusion and research perspectives 129 6.1 Producing and querying linked data . 130 6.1.1 Representing our data as linked data . 131 6.1.2 Querying linked data . 134 6.2 Analyzing data . 135 6.2.1 Selecting relevant candidates when reconstructing metabolic pathways . 136 6.2.2 Analyzing TGF-β signaling pathways . 136 6.2.3 Data analysis method combining ontologies and formal concept analysis . 137 Bibliography 139 Curriculum Vitæ 163 5 Acknowledgments I and indebted to Mark Musen, Anita Burgun and Anne Siegel for allowing me to join their team, for their help and their encouragement, as well as for setting up such exciting environments. I am most grateful to Michel Dumontier, Christine Froidevaux and Fabien Gandon for kindly accepting to review this document, and also to Anita Burgun, Marie-Dominique Devignes, Anne Siegel and Alexandre Termier for accepting to be parts of the committee. I am thankful to all my colleagues for all the work we have accomplished together, as summarized by Figure 1 on the facing page. The PhD students I co-supervised have brought significant contributions to most of the works presented in this manuscript: Nicolas Lebreton, Elodie´ Roques, Charles Bettembourg, Philippe Finet, Jean Coquet and Yann Rivault. Close collaborations with colleagues from whom I learnt a lot were very stimulating: Daniel Rubin, Natasha Noy, Anita Burgun, Julie Chabalier, Andrea Splendiani, Paolo Besana, Lynda Temal, Pierre Zweigenbaum, Cyril Grouin, Annabel Bourd´e,Oussama Zekri, Pascal van Hille, Jean-Fran¸cois Ethier,´ Bernard Gibaud, R´egine Le Bouquin-Jeann`es, Anne Siegel, Jacques Nico- las, Guillaume Collet, Sylvain Prigent, Geoffroy Andrieux, Nathalie Th´eret, Fabrice Legeai, Anthony Bretaudeau, Olivier Filangi and again Charles Bettembourg. Collaborating with colleagues from INRA was a great opportunity to tackle \real problems" and gave a great perspective on what this is all about: Christian Diot, Fr´ed´eric H´erault, Denis Tagu, M´elanie Jubault, Aur´elie Evrard,´ Cyril Falentin Pierre-Yves Lebail and all the ATOL curators: J´er^omeBugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaun,¨ Jean Vernet, L´eaJoret. Similarly, collaborations with colleagues from EHESP provided the \human" counterpart: Nolwenn Le Meur, Yann Rivault. More broadly, I benefited from the interactions with many other colleagues: Gwena¨elle Marquet, Fleur Mougin, Ammar Mechouche, Mika¨el Roussel, as well as the whole Symbiose group at IRISA, and particularly Pierre Peterlongo. Working on workflows with Olivier Collin, Yvan Le Bras, Alban Gaignard and Audrey Bihou´ee has been fun and I am eager to see what will come out of it. In addition to research all these years were also the occasion of great teaching-related en- counters: Christian Delamarche, Emmanuelle Becker, Emmanuel Giudice, Annabelle Monnier, Yann le Cunff, C´edric Wolf. Thank you all, I enjoyed all of it. 6 Figure 1: Graph of my co-authors. Two authors are linked if they share at least one publication. Node size is proportional to the number of articles. 7 8 Chapter 1 Introduction This document summarizes my research activities since the defense of my PhD in Decem- ber 2003. This work has been carried initially as a postdoctoral fellow at Stanford University with Mark Musen's Stanford Medical Informatics group (now BMIR1), and then as an associate professor at University of Rennes 1, first with the UPRES-EA 3888 (which became UMR 936 INSERM { University of Rennes 1 in 2009) from 2005 to 2012, and then with the Dyliss team at IRISA since 2013. First, I will present the context in which my research takes place. We will see that the traditional approaches for analyzing life science data do not scale up and cannot handle their increasing quantity, complexity and connectivity. It has become necessary to develop automatic tools not only for performing the analyses, but also for helping the experts do it. Yet, processing the raw data is so difficult to automate that these tools usually hinge on annotations and metadata as machine-processable proxies that describe the data and the relations between them. Second, I will identify the main challenges. While generating these metadata is a challenge of its own that I will not tackle here, it is only the first step. Even if metadata tend to be more compact than the original data, each piece of data is typically associated with many metadata, so the problem of data quantity remains. These metadata have to be reused and combined, even if they have been generated by different people, in different places, in different contexts, so we also have a problem of integration.
Recommended publications
  • Analysis of the Impact Factor of Science Journals
    Master in Artificial Intelligence (UPC-URV-UB) Master of Science Thesis Analysis of the Impact Factor of scientific journals Jaume Singla Valls Advisor: Dr. Sergio Gómez February, 2014 Index 1. Introduction......................................................................................................................................3 1.1. The Impact Factor..................................................................................................................... 4 1.2. Summary................................................................................................................................... 5 2. Citations and Impact Factor Data..................................................................................................... 6 2.1. Data download..........................................................................................................................6 2.1.1. Version based on Mechanize Python Library................................................................... 6 2.1.2. Old SOAP Version.............................................................................................................. 7 2.1.3. New official SOAP Version.................................................................................................9 2.2. Data selection......................................................................................................................... 10 2.3. Data validation........................................................................................................................11
    [Show full text]
  • A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon
    A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon To cite this version: Fabien Gandon. A Survey of the First 20 Years of Research on Semantic Web and Linked Data. Revue des Sciences et Technologies de l’Information - Série ISI : Ingénierie des Systèmes d’Information, Lavoisier, 2018, 10.3166/ISI.23.3-4.11-56. hal-01935898 HAL Id: hal-01935898 https://hal.inria.fr/hal-01935898 Submitted on 27 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon Inria, Université Côte d’Azur, CNRS, I3S, Wimmics 2004 rt des Lucioles, 06902 Sophia Antipolis, France [email protected] ABSTRACT. This paper is a survey of the research topics in the field of Semantic Web, Linked Data and Web of Data. This study looks at the contributions of this research community over its first twenty years of existence. Compiling several bibliographical sources and bibliometric indica- tors, we identify the main research trends and we reference some of their major publications to provide an overview of that initial period.
    [Show full text]
  • JOURNAL of WEB SEMANTICS Science, Services and Agents on the World Wide Web
    JOURNAL OF WEB SEMANTICS Science, Services and Agents on the World Wide Web AUTHOR INFORMATION PACK TABLE OF CONTENTS XXX . • Description p.1 • Impact Factor p.2 • Abstracting and Indexing p.2 • Editorial Board p.2 • Guide for Authors p.5 ISSN: 1570-8268 DESCRIPTION . The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web. These areas include: knowledge technologies, ontology, agents, databases and the semantic grid, obviously disciplines like information retrieval, language technology, human- computer interaction and knowledge discovery are of major relevance as well. All aspects of the Semantic Web development are covered. The publication of large-scale experiments and their analysis is also encouraged to clearly illustrate scenarios and methods that introduce semantics into existing Web interfaces, contents and services. The journal emphasizes the publication of papers that combine theories, methods and experiments from different subject areas in order to deliver innovative semantic methods and applications. The Journal of Web Semantics addresses various prominent application areas including: e-business, e-community, knowledge management, e-learning, digital libraries and e-sciences. The Journal of Web Semantics features a multi-purpose web site, which can be found at: http://www.semanticwebjournal.org/. Readers are also encouraged to visit the Journal of Web Semantics blog, at http://journalofwebsemantics.blogspot.com/
    [Show full text]
  • The Document Components Ontology (Doco)
    The Document Components Ontology (DoCO) Editor(s): Oscar Corcho, Universidad Politécnica de Madrid, Spain Solicited review(s): Francesco Ronzano, Universitat Pompeu Fabra, Barcelona, Spain; Almudena Ruiz Iniesta, Universidad Politécnica de Madrid, Spain; one anonymous reviewer Alexandru Constantina, Silvio Peronib,c,*, Steve Pettiferd, David Shottone, and Fabio Vitalib a École Polytechnique Fédérale de Lausanne EPFL IC IIF LSIR, BC 159 (Bâtiment BC), Station 14, CH-1015 Lausanne, Switzerland [email protected] b Department of Computer Science And Engineering, University of Bologna Mura Anteo Zamboni 7, 40127 Bologna (BO), Italy [email protected], [email protected] c Semantic Technology Laboratory, Institute of Cognitive Sciences and Technologies, National Research Council, Via Nomentana 56, 00161 Rome (RM), Italy d School of Computer Science, University of Manchester Kilburn Building, M13 9PL Manchester, United Kingdom [email protected] e Oxford e-Research Centre, University of Oxford 7 Keble Road, OX1 3QG Oxford, United Kingdom [email protected] Abstract. The availability in machine-readable form of descriptions of the structure of documents, as well as of the document discourse (e.g. the scientific discourse within scholarly articles), is crucial for facilitating semantic publishing and the overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe both structural and rhetorical document components in RDF. In addition to describing the formal description of the ontology, this paper showcases its utility in practice in a variety of our own applications and other activities of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.
    [Show full text]
  • A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon
    A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon To cite this version: Fabien Gandon. A Survey of the First 20 Years of Research on Semantic Web and Linked Data. Revue des Sciences et Technologies de l’Information - Série ISI : Ingénierie des Systèmes d’Information, Lavoisier, In press, <10.3166/ISI.23.3-4.11-56>. <hal-01935898> HAL Id: hal-01935898 https://hal.inria.fr/hal-01935898 Submitted on 27 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Survey of the First 20 Years of Research on Semantic Web and Linked Data Fabien Gandon Inria, Université Côte d’Azur, CNRS, I3S, Wimmics 2004 rt des Lucioles, 06902 Sophia Antipolis, France [email protected] ABSTRACT. This paper is a survey of the research topics in the field of Semantic Web, Linked Data and Web of Data. This study looks at the contributions of this research community over its first twenty years of existence. Compiling several bibliographical sources and bibliometric indica- tors, we identify the main research trends and we reference some of their major publications to provide an overview of that initial period.
    [Show full text]
  • Semantic Web: Who Is Who in the Field – a Bibliometric Analysis
    Accepted for Publication By the Journal of Information Science: http://jis.sagepub.co.uk Semantic Web: Who is who in the field – A bibliometric analysis Ying Ding1 School of Library and Information Science, Indiana University Bloomington, IN, USA Abstract The Semantic Web is one of the main efforts aiming to enhance human and machine interaction by representing data in an understandable way for machines to mediate data and services. It is a fast-moving and multidisciplinary field. This study conducts a thorough bibliometric analysis of the field by collecting data from Web of Science (WOS) and Scopus for the period of 1960-2009. It utilizes a total of 44,157 papers with 651,673 citations from Scopus, and 22,951 papers with 571,911 citations from WOS. Based on these papers and citations, it evaluates the research performance of the Semantic Web (SW) by identifying the most productive players, major scholarly communication media, highly cited authors, influential papers and emerging stars. Keywords: citation analysis, semantic web, research evaluation, impact analysis 1. Introduction The Web is experiencing tremendous changes in its function to connect information, people and knowledge, but also facing severe challenges to integrate data and facilitate knowledge discovery. The Semantic Web is one of the main efforts aiming to enhance human and machine interaction by representing data in an understandable way for machine to mediate data and services [1]. Recently, PriceWaterhouseCoopers [2] has predicted that Semantic Web technologies may revolutionize the entire enterprise of decision-making and information sharing. The profile of the Semantic Web has been further heightened by the Obama administration’s new groundbreaking plan to initiate Semantic Web technologies to bring transparency to government activities [3].
    [Show full text]
  • 2019 Journal Citation Reports Full Journal List
    2019 Journal Citation Reports Full journal list Every journal has a story to tell About the Journal Citation Reports Each year, millions of scholarly works are published containing tens of millions of citations. Each citation is a meaningful connection created by the research community in the process of describing their research. The journals they use are the journals they value. Journal Citation Reports aggregates citations to our selected core of journals, allowing this vast network of scholarship to tell its story. Journal Citation Reports provides journal intelligence that highlights the value and contribution of a journal through a rich array of transparent data, metrics and analysis. jcr.clarivate.com 2 Journals in the JCR with a Journal Impact Factor Full Title Abbreviated Title Country/Region SCIE SSCI 2D MATERIALS 2D MATER ENGLAND ! 3 BIOTECH 3 BIOTECH GERMANY ! 3D PRINTING AND ADDITIVE 3D PRINT ADDIT MANUF UNITED STATES ! MANUFACTURING 4OR-A QUARTERLY JOURNAL OF 4OR-Q J OPER RES GERMANY ! OPERATIONS RESEARCH AAPG BULLETIN AAPG BULL UNITED STATES ! AAPS JOURNAL AAPS J UNITED STATES ! AAPS PHARMSCITECH AAPS PHARMSCITECH UNITED STATES ! AATCC JOURNAL OF AATCC J RES UNITED STATES ! RESEARCH AATCC REVIEW AATCC REV UNITED STATES ! ABACUS-A JOURNAL OF ACCOUNTING FINANCE AND ABACUS AUSTRALIA ! BUSINESS STUDIES ABDOMINAL RADIOLOGY ABDOM RADIOL UNITED STATES ! ABHANDLUNGEN AUS DEM ABH MATH SEM MATHEMATISCHEN SEMINAR GERMANY ! HAMBURG DER UNIVERSITAT HAMBURG ACADEMIA-REVISTA LATINOAMERICANA DE ACAD-REV LATINOAM AD COLOMBIA ! ADMINISTRACION
    [Show full text]
  • The Computer Science Ontology: a Comprehensive Automatically-Generated Taxonomy of Research Areas
    DATA PAPER The Computer Science Ontology: A Comprehensive Automatically-Generated Taxonomy of Research Areas Angelo A. Salatino1†, Thiviyan Thanapalasingam1, Andrea Mannocci1, Aliaksandr Birukou2, Francesco Osborne1 & Enrico Motta1 1 Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK Downloaded from http://direct.mit.edu/dint/article-pdf/2/3/379/1857480/dint_a_00055.pdf by guest on 24 September 2021 2Springer-Verlag GmbH, Tiergartenstrasse 17, 69121 Heidelberg, Germany Keywords: Scholarly data; Ontology learning; Bibliographic data; Scholarly ontologies; Semantic Web Citation: A.A. Salatino, T.Thanapalasingam, A. Mannocci, A. Birukou, F. Osborne & E. Motta. The computer science ontology: A comprehensive automatically-generated taxonomy of research areas. Data Intelligence 2(2020), 379-416. doi: 10.1162/ dint_a_00055 Received: April 15, 2019; Revised: August 10, 2019; Accepted: October 10, 2019 ABSTRACT Ontologies of research areas are important tools for characterizing, exploring, and analyzing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 14K topics and 162K semantic relationships. It was created by applying the Klink-2 algorithm on a very large data set of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications.
    [Show full text]
  • Evaluating the Impact Factor: a Citation Study For
    https://doi.org/10.48009/1_iis_2010_309-318 BEYOND THE IMPACT FACTOR: UPDATES ON A CITATION STUDY FOR INFORMATION TECHNOLOGY JOURNALS Kara Gust Rawlins, Michigan State University, [email protected] Dale D. Gust, Central Michigan University, [email protected] ================================================================================== exclusive Journal Citation Reports (JCR), providing ABSTRACT a way for scholars, librarians, and publishers to evaluate the impact of a journal in its own discipline Citation data for journals continues to be a or across disciplines based upon a number of criteria” prominent tool for evaluating scholarly journals in [4]. ISI continues to provide online subscription various fields of study, although new factors play access to JCR and its other citation indexes through important roles in assessing its validity and its Web of Knowledge database. importance. This paper will build upon a previous citation study for information technology/computer For years, JCR has been the recognized authority on information systems journals, with special attention analyzing citation data to establish a journal’s to the Journal of Computer Information Systems. It influence and dominance in the field of scholarly will analyze how ISI’s Journal Citation Reports literature, although certain other databases are now traditionally measures a journal’s impact in its beginning to challenge that conception. This paper disciplines, with special attention to how the Journal will explore JCR’s primary statistical calculation— of Computer Information Systems has increased its the impact factor, along with a few other influencing impact factor over the past several years. It will also factors. It will build upon a previous citation study discuss new implications such as how Google for information technology journals, with special Scholar and other information vendors are now attention to the Journal of Computer Information providing citation data that should also be Systems: Gust, K.
    [Show full text]