Ontology Based Data Integration in Life Sciences

Ontology based data integration in Life Sciences Dmitry Repchevskiy A questa tesi doctoral està subjecta a la llicència Reconeixement 3.0. Espanya de Creative Commons . Esta tesis doctoral está sujeta a la licencia Reconocimi ento 3.0. España de Creative Commons . Th is doctoral thesis is licensed under the Creative Commons Att ribution 3.0. Spain License . UNIVERSITY OF BARCELONA FACULTY OF BIOLOGY Doctorate Program: Biomedicine Research Line: Bioinformatics 2010-2015 Ontology based data integration in life sciences Submitted by Dmitry Repchevskiy in fulfillment of the requirements for the doctoral degree by the University of Barcelona Supervisor: Dr. Josep Lluís Gelpí Buchaca Department of Biochemistry and Molecular Biology University of Barcelona Dmitry Repchevskiy Barcelona Supercomputing Center National Institute of Bioinformatics ACKNOWLEDGEMENTS “It is not the consciousness of men that determines their being, but, on the contrary, their social being that determines their consciousness.” Karl Marx Without any doubts this thesis wouldn’t be possible without many people that I had a pleasure to work with. If only I tried to enumerate all their invaluable help, all the chats and communications we had, this thesis would require a second volume. First and foremost, I would like to thank my supervisor Josep Lluís Gelpí, who gave me a lot of freedom in defining projects I worked on. Not all of them have been included into this thesis and some of them never reached the end, but the real experience gathered in these years allowed me to go get into this final point. I would also like to thank my colleague José María Fernández from Spanish National Cancer Research Centre (CNIO), who always found a time to discuss technological aspects of my projects and to our entire group just to be with me all these long years. Finally, the greatest thank to my colleagues Romina and Laia whose invaluable support has been so important for me over these years. SUMMARY As many other science disciplines, Life Sciences operate with an enormous amount of information. The genomic revolution was a big bang in biological and especially genetic data creation. Exponentially growing data had thrown up a bunch of problems with its storage, processing and interoperability. None of these problems could be resolved without a substantial progress in computer technology. On the merge of biology and information technology, a new field, coined as bioinformatics, had arisen. Many heterogeneous data sources continuously generate a huge amount of different types of data. This data comes from clinical studies, micro-array experiments, DNA sequencing, or publications (data mining). In most of the cases, this information has little value without further processing and analysis, including normalization and filtering. To handle this data deluge, many institutions spend a considerable amount of resources to maintain core databases (UniProt, Protein Data Bank, Ensembl, etc.), and are in a continuous search for new approaches in data storage, annotation, and integration. Independently on the origin, biological data requires a structure, the definition of an appropriate storage format, and metadata provision. Sometimes metadata is included into the format, but very often is provided externally in form of annotations. In many cases such annotations, describing a specific knowledge domain, are organized to form an ontology (for instance the Gene Ontology Database) and may be a valuable source of information. Although ontologies are often used to annotate biological data, modern ontology languages provide enough expressibility to structurally describe biological objects that makes them a great choice for biological data interoperability. Another use of ontologies for interoperability purpose is the semantic description of biological services. The power of semantic integration in Life Sciences brought a lot of interest from major bioinformatics institutions that embrace ontologies and more generally all Linked Data technologies as a common platform for biological data integration. TABLE OF CONTENTS INTRODUCTION ........................................................................................................... 1 1.1. HETEROGENEOUS DATA INTEGRATION .......................................................... 1 Data formats ............................................................................................................ 4 1.2. XML TECHNOLOGY ....................................................................................... 8 1.3. ONTOLOGIES IN LIFE SCIENCES ................................................................... 10 OBO format ............................................................................................................ 11 1.4. SEMANTIC WEB ............................................................................................ 12 1.4.1. Resource Description Framework (RDF) ................................................ 13 1.4.2. RDF in Attributes (RDFa) ....................................................................... 15 1.4.3. RDF Schema (RDFS) ............................................................................... 15 1.4.4. SPARQL 1.1 ............................................................................................ 16 1.4.5. OWL 2 Web Ontology Language ........................................................... 17 1.4.6. The Semantic Web Rule Language (SWRL) ........................................... 20 1.4.7. Rule Interchange Format (RIF) .............................................................. 20 1.4.8. Linked Data ........................................................................................... 21 1.5. WEB SERVICES .............................................................................................. 22 1.5.1. HTTP ...................................................................................................... 23 1.5.2. SOAP ...................................................................................................... 25 1.5.3. WSDL ..................................................................................................... 25 1.5.4. REST....................................................................................................... 26 1.5.5. WADL .................................................................................................... 28 1.6. SEMANTIC WEB SERVICES ........................................................................... 29 1.6.1. OWL-S.................................................................................................... 31 1.6.2. MOBY-S ................................................................................................. 32 1.6.3. WSMO ................................................................................................... 35 1.6.4. WSMO-Lite ............................................................................................ 35 1.6.5. SADI ....................................................................................................... 35 1.6.6. WSDL 2.0 RDF Mapping ........................................................................ 36 OBJECTIVES ............................................................................................................... 37 MATERIALS AND METHODS ....................................................................................... 39 TECHNOLOGICAL CHOICES .......................................................................................... 41 Semantic Web Services ontology ................................................................... 41 BioMoby integration libraries ........................................................................ 41 RESULTS .................................................................................................................... 43 PART I – BIOMOBY ONTOLOGY MODEL INTEGRATION ............................................................. 44 4.1.1. New lightweight Java API for BioMoby Registry access and Web Services execution. 46 4.1.2. BioNemus. Creating SAWSDL bioinformatics services based on BioMoby ontology model ....................................................................................................... 49 4.1.3. SAWSDL-based BioMoby ontology integration. .................................... 57 PART II – XML SCHEMA GENERATION FROM OWL 2 ONTOLOGIES. .......................................... 62 4.2.1. Implementation ..................................................................................... 62 4.2.2. OWL 2 Model to XML Schema transformation ..................................... 62 4.2.3. Practical applications for bioinformatics Semantic Web Service development ........................................................................................................... 69 PART III – ONTOLOGY-BASED SERVICE DESCRIPTION FOR BIOINFORMATICS INTEGRATION .............. 71 4.3.1. BioSWR: Semantic Web services Registry for Bioinformatics................ 72 PART IV – WEB SERVICES INTEGRATION INTO WORKFLOWS EXECUTION TOOLS. ........................... 79 4.4.1. BioSWR Registry integration into the Taverna Workbench. ................. 79 4.4.2. Galaxy Gears. Web Services integration into Galaxy workbench.......... 81 DISCUSSION ............................................................................................................... 85 CONCLUSION .............................................................................................................. 93

Ontology Based Data Integration in Life Sciences

Document Databases, JSON, Mongodb 17

Smart Grid Serialization Comparison

V a Lida T in G R D F Da

Semantic Integration and Knowledge Discovery for Environmental Research

Rdfa in XHTML: Syntax and Processing Rdfa in XHTML: Syntax and Processing

Using Rule-Based Reasoning for RDF Validation

Semantic Integration Across Heterogeneous Databases Finding Data Correspondences Using Agglomerative Hierarchical Clustering and Artificial Neural Networks

RPC and SOAP Services

Semantic Description of Web Services

An Ontological Approach for Querying Distributed Heterogeneous Information Systems Under Critical Operational Environments

Rdfs:Frbr– Towards an Implementation Model for Library Catalogs Using Semantic Web Technology

Semantic Resource Adaptation Based on Generic Ontology Models