An Ontology Based Query Engine for Querying Biological Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Faculteit Bio-ingenieurswetenschappen Academiejaar 2015-2016 An ontology based query engine for querying biological sequences Jim Clauwaert Promotor: Prof. Dr. ir. Wim van Criekinge Tutor: Martijn Devisscher Masterproef voorgedragen tot het behalen van de graad van Master in de bio-ingenieurswetenschappen: Cel- en genbiotechnologie Foreword This thesis came about during the academic year of 2015-2016, and it has been worked on as a final project for my masters degree in bio-engineering. In many ways, this year has been very heavy. Even though many times working on my thesis meant not working on something else I should have been working on, I have enjoyed researching the subject and am satisfied when looking back at the work invested. This feeling of comfort is due to many external influences that have guided and supported me. I wish to extend my gratitude to the people that have stood by my side throughout the last year. First, I would like to thank Martijn Devisscher, my tutor and the spiritual father of boinq. The rich experience obtained through my work on boinq and The Semantic Web is mainly attributed to the positive working environment he created. I have been given both the responsibility and the trust to handle important parts of the boinq program. This gave me not only the opportunity, but also the ability to think for myself and introduce solutions when these presented themselves. Through weekly appointments, I was able to follow-up and discuss work, and get directions when no path was obvious. Through these elements I feel that I was able to contribute in the creation of boinq, and that my input was of value. This has been both my strongest motivation and fulfilling aspect of my thesis. I also extend my gratitude to the BioBix group. Specifically, to my promoter, Prof. Wim Van Criekinge, for helping in making this thesis a possibility, Prof. Tim de Meyer and dr. Gerben Menschaert, for helping me define a use case and assisting me during. I want to thank my family for supporting me all these years. I want to thank my friends for being awesome in general. A special thanks to Meaghan Blanchard, for being the first helping hand when correcting and revising my work, and being there for whatever reason. Gent, 2016 Jim Clauwaert i Table of Contents Foreword i 1 Abstract 1 2 Introduction 3 3 The Semantic Web 5 3.1 Introduction . .5 3.2 What is The Semantic Web? . .5 3.3 RDF . .6 3.3.1 Structure of RDF . .7 3.3.2 Vocabularies of RDF . .9 3.4 Linked Data . 12 3.4.1 Inference . 12 3.4.2 Linked databases . 13 3.5 RDF data management . 15 3.5.1 RDF formats . 16 3.5.2 Triplestores . 17 3.6 SPARQL . 18 3.6.1 SPARQL syntax . 18 4 Boinq 23 4.1 Introduction . 23 4.2 Design . 23 4.2.1 Data unification . 24 4.2.2 Data organization . 24 4.3 Comparison to other frameworks . 25 4.3.1 Biological query building . 25 4.3.2 Semantic access to sequence information . 26 4.4 Material and methods . 26 5 Genomic Data Implementation 29 5.1 Introduction . 29 5.2 Genomic Data . 29 5.2.1 Browser Extensible Data format . 30 5.2.2 Generic Feature Format . 31 5.2.3 Variant Call Format . 33 5.2.4 Sequence Alignment/Map format . 35 5.3 Data integration into The Semantic Web . 36 5.3.1 Overview . 37 5.3.2 Basic data model . 38 5.3.3 Vocabularies . 39 5.3.4 Data models . 40 5.3.5 Metadata . 48 5.3.6 Practical implementation . 49 5.4 Evaluation . 52 5.4.1 sparql-bed and sparql-vcf . 52 5.4.2 Big data files . 53 5.4.3 JBrowse . 53 iii iv TABLE OF CONTENTS 6 Biological research in RDF 55 6.1 Introduction . 55 6.2 A biomarker for colon cancer . 55 6.2.1 Introduction . 55 6.2.2 Material and methods . 56 6.2.3 Results . 58 6.3 Discussion . 59 6.3.1 Methods . 59 6.3.2 Results . 60 7 Conclusion and Future Prospects 63 A Code Examples 65 B Tables 71 C Figures 75 List of Acronyms List of Acronyms B Boinq Bio ontology integrated query platform BED Browser Extensible Data C CDS Coding DNA Sequence CNV Copy Number Variations CIMP CpG Island Methylator Phenotype CRC Colorectal Cancer CTD Comparative Toxicogenomics Database D DBMS Database Management Systems DDBJ DNA Databank of Japan DKO Double Knock-Out E EBI-EMBL The European Bioinformatics Institute G GDA Gene Disease Association GFF/GFF3 General Feature Format GFVO Genomic Feature and Variation Ontology GMOD Generic Model Organism Database GRC Genome Reference Consortium GTF Genetic Transfer Object I v vi TABLE OF CONTENTS IRI International Resource Identifier J JSON-LD JavaScript Object Notation for Linked Data M MeSH Medical Subject Headings N NCBI National Center for Biotechnology Information NCI National Cancer Institute NHGRI National Human Genome Research Institute O OWL Web Ontology Language R RDF Resource Description Framework RDFS Resource Description Framework Schema S SKOS Simple Knowledge Organization System SNP Single Nucleotide Polymorphism SO Sequence Ontology SPARQL SPARQL Protocol and RDF Query Language STS Spring Tool Suite T TCGA The Cancer Genome Atlas U UniProt The Universal Protein Resource TABLE OF CONTENTS vii URI Uniform Resource Identifier URL Uniform Resource Locator V VCF Variant Call Format W W3C World Wide Web Consortium WT Wild Type WWW World Wide Web X XML Extensible Markup Language XSD XML Schema Definition 1 Abstract English version The Semantic Web is an enhancement of the World Wide Web with a focus on providing a standardized framework for exchanging data. This allows for a web of data not limited by applications and data formats. Technologies created for The Semantic Web have increasingly been adapted by public databases. Boinq is a web platform that aims to connect the researcher to biological databases based upon semantic web technologies. One design goal is the ability to manage and implement custom data into the data framework of The Semantic Web. Data integration of four different data formats has been realized with the creation of custom data structures and converters. Integrated data covers varying levels of high throughput sequencing data, represented in the BED, GFF, VCF, and SAM format. It has been shown that the use of The Semantic Web offers a fast way to select and combine data from public databases. Obstacles preventing a widespread use of the technology are still existing, including the level of knowledge needed about The Semantic Web and used databases, a lack of tools to manage and analyze data from a semantic environment, and the incomplete state of several public databases. 1 2 CHAPTER 1. ABSTRACT Nederlandse versie Het Semantische web is een gevorderde versie van het World Wide Web met een focus op het creÃńren van een gestandardiseerde omgeving voor het distribueren van data. Hierbij wordt een web van data verwezenlijkt dat niet gelimiteerd is door de diverse applicaties en datafor- maten. TechnologiÃńn gecreÃńerd voor Het Semantische Web worden met toenemende interesse geadopteerd door publieke databanken. Boinq is een webapplicatie die ernaar streeft biologische databanken gebouwd op semantische technologien toegankelijker te maken voor de onderzoeker. EÃľn van de doeleinden van het project is het aanmaken van een functionaliteit die eigen data kan inbrengen en beheren in een semantische omgeving. De data integratie van vier verschillende dataformaten is mogelijk gemaakt met de creatie van aangepaste data structuren and converters. GeÃŕntegreerde data is terug te vinden in diverse niveaus van high throughput sequencing data, zoals te vinden in het BED, GFF, VCF en SAM formaat. Er is aangetoond dat het gebruik van Het Semantische Web een snelle optie biedt voor het selecteren en combineren van data komende van publieke databanken. Hindernissen in een algemeen gebruik van Het Semantisch Web zijn echter nog bestaand, daarbij horen een hoge eis aan kennis over Het Semantische Web and gebruikte datasets, toepassingen voor het beheer en de analyze van data, en de incomplete status van publieke datasets. 2 Introduction Since Tim Berners-Lee invented the World Wide Web in 1989, he has continuously worked on defining and improving its construction [79]. In 1994, he founded the World Wide Web Consortium (W3C), an organization focused on generating specifications, guidelines, software and tools to improve the internet. In 2004, W3C defined the specifications of the Resource Description Framework (RDF) in its first iteration. The RDF was created as a guideline and framework to optimize data interchange throughout an ever growing web, a first step towards The Semantic Web. The specifications for RDF 1.1, the second iteration, followed in 2014 [76]. In 1955, the first amino acid sequence was determined by Robert W. Holley and his colleagues. It was the catalyst for a boom in genetic sequence data that has continued to grow exponentially since 1995. In 2007, cost reductions of genome sequencing allowed for another significant boost in new data generation. The vast influx of genomic data has brought the birth of many different databases and formats, causing a hindrance in cataloging, processing and researching data between different sources. Due to the further development and maturing of the technologies created by The Semantic Web, an increasing investment into the adaptation of these technologies for genomic databases has been realized. Although semantic web integration is only adapted by some databases, a conscious effort is invested to expand this technology by major bioinformatic institutes, including EMBL-EBI. Boinq [25] is a web platform that aims to serve as a connection between the researcher and The Semantic Web.