Faculteit Bio-ingenieurswetenschappen Academiejaar 2015-2016

An ontology based query engine for querying biological sequences

Jim Clauwaert Promotor: Prof. Dr. ir. Wim van Criekinge Tutor: Martijn Devisscher

Masterproef voorgedragen tot het behalen van de graad van Master in de bio-ingenieurswetenschappen: Cel- en genbiotechnologie

Foreword

This thesis came about during the academic year of 2015-2016, and it has been worked on as a final project for my masters degree in bio-engineering. In many ways, this year has been very heavy. Even though many times working on my thesis meant not working on something else I should have been working on, I have enjoyed researching the subject and am satisfied when looking back at the work invested. This feeling of comfort is due to many external influences that have guided and supported me. I wish to extend my gratitude to the people that have stood by my side throughout the last year.

First, I would like to thank Martijn Devisscher, my tutor and the spiritual father of boinq. The rich experience obtained through my work on boinq and The Semantic Web is mainly attributed to the positive working environment he created. I have been given both the responsibility and the trust to handle important parts of the boinq program. This gave me not only the opportunity, but also the ability to think for myself and introduce solutions when these presented themselves. Through weekly appointments, I was able to follow-up and discuss work, and get directions when no path was obvious. Through these elements I feel that I was able to contribute in the creation of boinq, and that my input was of value. This has been both my strongest motivation and fulfilling aspect of my thesis. I also extend my gratitude to the BioBix group. Specifically, to my promoter, Prof. Wim Van Criekinge, for helping in making this thesis a possibility, Prof. Tim de Meyer and dr. Gerben Menschaert, for helping me define a use case and assisting me during. I want to thank my family for supporting me all these years. I want to thank my friends for being awesome in general. A special thanks to Meaghan Blanchard, for being the first helping hand when correcting and revising my work, and being there for whatever reason.

Gent, 2016 Jim Clauwaert

i

Table of Contents

Foreword i

1 Abstract 1

2 Introduction 3

3 The Semantic Web 5 3.1 Introduction ...... 5 3.2 What is The Semantic Web? ...... 5 3.3 RDF ...... 6 3.3.1 Structure of RDF ...... 7 3.3.2 Vocabularies of RDF ...... 9 3.4 Linked Data ...... 12 3.4.1 Inference ...... 12 3.4.2 Linked databases ...... 13 3.5 RDF data management ...... 15 3.5.1 RDF formats ...... 16 3.5.2 Triplestores ...... 17 3.6 SPARQL ...... 18 3.6.1 SPARQL syntax ...... 18

4 Boinq 23 4.1 Introduction ...... 23 4.2 Design ...... 23 4.2.1 Data unification ...... 24 4.2.2 Data organization ...... 24 4.3 Comparison to other frameworks ...... 25 4.3.1 Biological query building ...... 25 4.3.2 Semantic access to sequence information ...... 26 4.4 Material and methods ...... 26

5 Genomic Data Implementation 29 5.1 Introduction ...... 29 5.2 Genomic Data ...... 29 5.2.1 Browser Extensible Data format ...... 30 5.2.2 Generic Feature Format ...... 31 5.2.3 Variant Call Format ...... 33 5.2.4 Sequence Alignment/Map format ...... 35 5.3 Data integration into The Semantic Web ...... 36 5.3.1 Overview ...... 37 5.3.2 Basic data model ...... 38 5.3.3 Vocabularies ...... 39 5.3.4 Data models ...... 40 5.3.5 Metadata ...... 48 5.3.6 Practical implementation ...... 49 5.4 Evaluation ...... 52 5.4.1 sparql-bed and sparql-vcf ...... 52 5.4.2 Big data files ...... 53 5.4.3 JBrowse ...... 53

iii iv TABLE OF CONTENTS

6 Biological research in RDF 55 6.1 Introduction ...... 55 6.2 A biomarker for colon cancer ...... 55 6.2.1 Introduction ...... 55 6.2.2 Material and methods ...... 56 6.2.3 Results ...... 58 6.3 Discussion ...... 59 6.3.1 Methods ...... 59 6.3.2 Results ...... 60

7 Conclusion and Future Prospects 63

A Code Examples 65

B Tables 71

C Figures 75 List of Acronyms

List of Acronyms

B

Boinq Bio ontology integrated query platform BED Browser Extensible Data

C

CDS Coding DNA Sequence CNV Copy Number Variations CIMP CpG Island Methylator Phenotype CRC Colorectal Cancer CTD Comparative Toxicogenomics Database

D

DBMS Database Management Systems DDBJ DNA Databank of Japan DKO Double Knock-Out

E

EBI-EMBL The European Bioinformatics Institute

G

GDA Disease Association GFF/GFF3 General Feature Format GFVO Genomic Feature and Variation Ontology GMOD Generic Model Organism Database GRC Genome Reference Consortium GTF Genetic Transfer Object

I

v vi TABLE OF CONTENTS

IRI International Resource Identifier

J

JSON-LD JavaScript Object Notation for Linked Data

M

MeSH Medical Subject Headings

N

NCBI National Center for Biotechnology Information NCI National Cancer Institute NHGRI National Research Institute

O

OWL Web Ontology Language

R

RDF Resource Description Framework RDFS Resource Description Framework Schema

S

SKOS Simple Knowledge Organization System SNP Single Nucleotide Polymorphism SO Sequence Ontology SPARQL SPARQL Protocol and RDF Query Language STS Spring Tool Suite

T

TCGA The Cancer Genome Atlas

U

UniProt The Universal Protein Resource TABLE OF CONTENTS vii

URI Uniform Resource Identifier URL Uniform Resource Locator

V

VCF Variant Call Format

W

W3C World Wide Web Consortium WT Wild Type WWW World Wide Web

X

XML Extensible Markup Language XSD XML Schema Definition

1 Abstract

English version The Semantic Web is an enhancement of the World Wide Web with a focus on providing a standardized framework for exchanging data. This allows for a web of data not limited by applications and data formats. Technologies created for The Semantic Web have increasingly been adapted by public databases. Boinq is a web platform that aims to connect the researcher to biological databases based upon semantic web technologies. One design goal is the ability to manage and implement custom data into the data framework of The Semantic Web.

Data integration of four different data formats has been realized with the creation of custom data structures and converters. Integrated data covers varying levels of high throughput sequencing data, represented in the BED, GFF, VCF, and SAM format. It has been shown that the use of The Semantic

Web offers a fast way to select and combine data from public databases. Obstacles preventing a widespread use of the technology are still existing, including the level of knowledge needed about

The Semantic Web and used databases, a lack of tools to manage and analyze data from a semantic environment, and the incomplete state of several public databases.

1 2 CHAPTER 1. ABSTRACT

Nederlandse versie Het Semantische web is een gevorderde versie van het World Wide Web met een focus op het creÃńren van een gestandardiseerde omgeving voor het distribueren van data. Hierbij wordt een web van data verwezenlijkt dat niet gelimiteerd is door de diverse applicaties en datafor- maten. TechnologiÃńn gecreÃńerd voor Het Semantische Web worden met toenemende interesse geadopteerd door publieke databanken. Boinq is een webapplicatie die ernaar streeft biologische databanken gebouwd op semantische technologien toegankelijker te maken voor de onderzoeker.

EÃľn van de doeleinden van het project is het aanmaken van een functionaliteit die eigen data kan inbrengen en beheren in een semantische omgeving.

De data integratie van vier verschillende dataformaten is mogelijk gemaakt met de creatie van aangepaste data structuren and converters. GeÃŕntegreerde data is terug te vinden in diverse niveaus van high throughput sequencing data, zoals te vinden in het BED, GFF, VCF en SAM formaat. Er is aangetoond dat het gebruik van Het Semantische Web een snelle optie biedt voor het selecteren en combineren van data komende van publieke databanken. Hindernissen in een algemeen gebruik van Het Semantisch Web zijn echter nog bestaand, daarbij horen een hoge eis aan kennis over Het

Semantische Web and gebruikte datasets, toepassingen voor het beheer en de analyze van data, en de incomplete status van publieke datasets. 2 Introduction

Since Tim Berners-Lee invented the World Wide Web in 1989, he has continuously worked on defining and improving its construction [79]. In 1994, he founded the World Wide Web Consortium (W3C), an organization focused on generating specifications, guidelines, software and tools to improve the internet. In 2004, W3C defined the specifications of the Resource Description Framework (RDF) in its first iteration. The RDF was created as a guideline and framework to optimize data interchange throughout an ever growing web, a first step towards The Semantic Web. The specifications for RDF 1.1, the second iteration, followed in 2014 [76].

In 1955, the first amino acid sequence was determined by Robert W. Holley and his colleagues. It was the catalyst for a boom in genetic sequence data that has continued to grow exponentially since 1995. In 2007, cost reductions of genome sequencing allowed for another significant boost in new data generation. The vast influx of genomic data has brought the birth of many different databases and formats, causing a hindrance in cataloging, processing and researching data between different sources. Due to the further development and maturing of the technologies created by The Semantic Web, an increasing investment into the adaptation of these technologies for genomic databases has been realized. Although semantic web integration is only adapted by some databases, a conscious effort is invested to expand this technology by major bioinformatic institutes, including EMBL-EBI.

Boinq [25] is a web platform that aims to serve as a connection between the researcher and The Semantic Web. It is designed to manage an RDF environment used for the integration and manipulation of data. The integration of custom genomic data into an RDF dataset is investigated during this study. Specifically, a data conversion tool for common formats such as BED, GFF, VCF and SAM has been created. file converters have been integrated into the functionality of boinq. An newly designed RDF structure has been outlined and elaborated on for each of the supported formats. Further design goals of boinq include the implementation of a graphical interface at the front end of the program, at which files can be uploaded and relevant information retrieved; A server that converts supported data formats into triples; A data and metadata structure in RDF, constructed following W3C standards.

To review the possibilities of the current state of public biological databases and the implementation of the converter, a case study has been defined. For this, the expression and methylation data retrieved from wild type and double knockout cancer cells (HCT116) have been analysed. With the use of custom user data and data retrieved from The Semantic Web, a list of candidate biomarkers was selected for further analysis. The complete process, done in an RDF environment, has been reviewed in the last section.

3

3 The Semantic Web

3.1 Introduction

The Semantic Web is the concept of an idealized data network designed by W3C. One of the main features of this web of data is the use of standardized data formats and exchange protocols that allows quick and easy access to its users. The use of many data formats of today’s web, controlled by different applications used over the web, inhibits a straight-forward way to request, link, process and display information. The use of a standardized format will place all of this data in a shared web, under one set of rules.

3.2 What is The Semantic Web?

A typical example used to illustrate the problem with today’s web is the fact that connecting linked information found on the web often takes an unnecessary amount of work. Suppose a student is searching for a place to stay close by his university. A common tool to find the distance between two places is Google Maps. However, it doesn’t feature every place for rent on the map. Instead, the student will have to search for different websites featuring places for rent, find their address and copy/paste this information into Google Maps. The only way to find out the distance from each place to his university is by transferring information manually. Why does today’s web force every student encountering this problem to do the same task over and over again? Why is the address of these places not automatically linked with tools such as Google maps. This would enable the service to highlight all places of interest with a simple selection tool on the map.

The problem inhibiting map-tools to acquire information, like in this example, is the many formats in which the desired addresses are to be found. Common possible formats are standard HTML-code, Word or Excel documents. Furthermore, there is no unified way to tag an address as a place that is for rent. Finding these places would require tools that search whole files for keywords, and even then, there will be uncertainty to whether or not the information is correct.

Data found on the web is often only linked with one another by human language. When one finds an address listed underneath the pictures of a house with an e-mail address, one will know who to call if one is interested to see a house. However, although a human can interpret and link information stored on the web, it can not thus be interpreted by a computer. There would be no direct way to request information on the web by giving an address as input.

The use of a Semantic Web offers a more accessible workspace for many applications. The in- tegration of various data formats into one furthermore enables an easier and quicker way to find possible relations in data resources for academic purposes or the cataloging of contents found at

5 6 CHAPTER 3. THE SEMANTIC WEB a particular web site, page, or digital library are just two possible examples [72]. Herein, data can be accessed by using a general web architecture. The relations of information to one another are defined in a standard way, allowing data to be shared and reused across applications, to be processed automatically by tools as well as manually. Data can be related to one another in both ways.

The Semantic Web aims to be a unification of all data, a place that offers a more accessible workspace for its users and applications. With the unification of all data also comes the need for a single database of definitions. Data that is stored in different formats by various applications or tools is defined by the rules and environment of the application itself. To illustrate this we take the example of a website database that lists the Uniform Resource Locator (URL) of stored websites under the definition of ’address’. A map tool could represent street addresses under the same definition. The user accessing these different databases will know in which environment the listed data is defined, and will thus have no problem processing the use of the same word ’address’ and understanding its different meaning. A problem arises when this user will access a database listing both of previous datasets, where no thought was put into the varying meanings of the word ’address’. Referring to the first example, a student querying for an email address using the home address of a location might encounter problems in finding the required information, depending on how the search tool handles the situation. Thus, creating a single environment where all data is defined and related to one another has to be supported by the creation of an organized structure that defines the boundaries between the different ontologies of a word.

Ontology is a Greek word first used in philosophical studies. It defines the nature and being of an entity. It is a more complete and distinct notion of the concept given to an entity than the word it is defined by. The study and formulation of ontologies is important in the setup of The Semantic Web, as it is a necessary element in structuring the data it stores.

A semantic web, just like the World Wide Web, is built from different databases. It contains information about movies, religions, chemicals, family trees and many other things. This information is hosted by multiple servers around the world, each with their own topic subject. The creation of semantic web technologies, such as the Resource Description Framework (RDF), enables interrelating data amongst different datasets. The collection of public linked datasets is known as Linked Open Data [75]. 3.1 gives an overview of datasets published in Linked Open Data format, and it is known as ’The Linking Open Data Cloud’ (LOD Cloud). Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that resources are shared between two data sets [23].

3.3 RDF

The set of rules and specifications defining the framework of The Semantic Web is known as the Resource Description Framework (RDF). The first iteration of this data model was published in 2004 as RDF 1.0. The second, current iteration was released in 2014, known as RDF 1.1. [78].

As the name suggests, RDF is the framework that aims to create the environment or universe for data in which the theoretical aspects of a semantic web are to be upheld. RDF was introduced by W3C as a means to extend the existing World Wide Web into The Semantic Web. A fast evolution of the Web was expected by the creators once the specification of the RDF model were released, but even today, the vast majority of websites still haven’t adapted to this model [11].

Information can be derived from or about anything. Examples are the data output of a scientific experiment, the formulation of a philosophical concept, the chemical properties of substances or even an abstract concept. As outlined in the previous part about the Semantic Web, the construction of RDF is mainly focused on a seamless sharing of information across different datasets without the loss or change of meaning. Other uses of RDF, stated by W3C, include:

• Adding machine-readable information to Web pages using, for example, the popular schema.org vocabulary, enabling them to be displayed in an enhanced format on search engines or to be automatically processed by third-party applications. 3.3. RDF 7

Figure 3.1: Linked datasets of the Semantic Web. The pink nodes at the bottom right annotate databases concerning life sciences.

• Enriching a dataset by linking it to third-party datasets. For example, a dataset about paintings could be enriched by linking them to the corresponding artists in Wikidata, therefore giving access to a wide range of information about them and related resources.

• Interlinking API feeds, making sure that clients can easily discover how to access more infor- mation.

• Using the datasets currently published as Linked Data. For example building aggregations of data around specific topics.

• Building distributed social networks by interlinking RDF descriptions of people across multiple Web sites. Providing a standards-compliant way for exchanging data between databases.

• Interlinking various datasets within an organisation, enabling cross-dataset queries to be per- formed using SPARQL, a query language used for RDF databases.

3.3.1 Structure of RDF The general focus of the design is to create both generality and precision. With the generation of an ontologically neutral environment, the RDF framework is able to uphold the expression of data about any topic. An RDF dataset is built out of triples that can be organized into different graphs. A graph is an optional feature of a dataset which can contain parts of the data. Graphs are very common amongst large datasets and can offer several advantages. An example of using graphs is to separate the data and metadata of a dataset. For genome data it is common to use different graphs as a means of separation of data by species. The use of graphs can also bring the advantage of a performance improvement when querying the database.

3.3.1.1 Triple The core structure of data in RDF is called a triple. A triple is a single statement about a resource. As the name suggests, it is built up out of three parts; the subject, predicate and object. The subject denotes the resource, the predicate denotes the property, trait or aspect of the resource, 8 CHAPTER 3. THE SEMANTIC WEB expressing a relationship between the subject and the object. Subjects and objects are represented by different nodes in an RDF graph. Predicates, shown as arches connecting nodes, connect these nodes expressing the relationship. Code example 3.1 shows the layout of a basic triple shown in a commonly used turtle format. The end of a statement is given by the use of a full stop or period. Triples are in some cases written over multiple lines, so it is important to know that a period signifies the end of a triple, and not a newline [77].

Code Example 3.1: Structure of a triple .

Three different types of nodes exist: IRIs, blank nodes and literals.

IRIs International Resource Identifiers are the generalization of a Uniform Resource Identifier (URI), as it uses a broader range of Unicode characters. URIs and IRIs are both used when talking about resource identifiers. This paper will generally only refer to IRIs to avoid confusion, although both are options are valid. URIs are commonly known as for their subtype, the URL, as it is used extensively to navigate through the World Wide Web. IRIs, on the other hand, are used as identifiers, and although recommended, are not necessary accessible through the web. IRIs can sometimes be used as web address, where a definition can be found. IRIs are identifiers, and can identify both resources (nodes) and properties (arches). The referent is the resource denoted by the IRI, which can be the subject, object and/or the predicate of the triple. Code Example 3.2 contains three IRIs which complete a triple. The triple in this example expresses a specific street address belonging to a specific house. IRIs used are constructed specifically for the example.

Code Example 3.2: A triple containing three IRIs

.

Literals These are absolute values including strings, numbers and booleans. Optionally, they are represented by a value in lexical form followed by an IRI identifying the data type of the value. The two elements are separated by two carets. Although the addition of a data type identifier is optional, it is considered good practice. Through the identification of the data type, literals can easily be extracted and processed. They are always found in between quotes. The literal value is the resource denoted by the literal, they can only be represented by the subject of the triple. Code Example 3.3 is a variation on the previous triple. The street address is this time expressed as a string. Although the use of literals offer an easy interpretation to the user or program, it has one main limitation. It is not a unique identifier and can thus not be used as a subject or predicate.

Code Example 3.3: A triple containing two IRIs and a literal

"Coupure Links 653"ˆˆ .

Blank nodes These are empty nodes that are used when an undefined node is known to exist. The existence of blank nodes is known to exist by its relation to other nodes or the logical existence of the entity, e.g. the unknown melting temperature of a chemical substance. They are expressed as ’ :’, followed by a unique identifier. Blank nodes can be represented by both the subject and object of the triple. Code Example 3.4 gives an example of a triple including all three different kinds of nodes. 3.3. RDF 9

Code Example 3.4: A triple containing a blank node, an IRI and a literal

_:ahouse "Coupure Links 653"ˆˆ .

From this point forward, in line with the general theme of this thesis, examples are going to be in line with bioinformatics. The aim is to ensure a build-up of required understanding towards the main part of this research through the use of more related examples.

3.3.2 Vocabularies of RDF Typically, different sets of vocabularies are used to structure and provide semantic meaning to the RDF dataset. A vocabulary consists of a list of definitions, which can formalize a class/property (object or subject), or relation (predicate). A list of definitions can also be referred to as an ontology. The term ontology is typically used for a more general or abstract collection of definitions. Vocabulary on the other hand, is considered to be used for a list of definitions about more specific subjects. The use of both terms are common, with no clear line separating the correct usage of either.

The creation of vocabularies is open to everyone, and is thus found on many different locations on the web. As an RDF database needs to be able to store data about every subject, it is important that definitions are unique and well defined. Terms defined in vocabularies are defined through IRIs. Since vocabularies can be created by anybody, a huge amount of lists have been published. IRIs offer a high rate of variability, and can thus easily be chosen to be unique. Public vocabularies can feature domain names (URL) for each IRI or make the IRI directly resolvable. Since the meaning of an IRI can be hard to derive from its string, it is often useful to be able to quickly retrieve their definitions. BioPortal is a web service that collects information about the available ontologies with a biological background. As the service offers you to browse a variety of vocabularies by keywords, it facilitates the search for specific ontologies [51].

Code Example 3.5 is an example of a triple stored in the public Ensembl triplestore. Not being familiar with a vocabulary beforehand can make it difficult to understand the meaning of an IRI. The subject denotes a resource from the Ensembl database. The predicate defines the instantiation of a class type. The object defines a class that features the exact position of a base pair. Thus, the object specifies what type of resource the subject is. In this example, the triple states that the resource identified by

Code Example 3.5: A triple from the Ensembl database 1

.

Both the predicate and the subject have a hashtag in their name. A hashtag is used to separate the vocabulary identifier (http://biohackathon.org/resource/faldo) and the reference to a specific element of the vocabulary (ExactPosition). The first substring is named the namespace IRI, as it is the same for every definition listed by the vocabulary. The second part is called the pointer as it points to a specific section in the vocabulary list. The retrieval of a resource on the World Wide Web through their IRI is called dereferencing.

It is important to note that no exact position is given by the triple. The triple only identifies the type of the subject. To find out the exact position another predicate is used, namely http: //biohackathon.org/resource/faldo#position. Code Example 3.6 contains the triple assigning a value to the subject. Thus, through the use of the correct predicate, both the membership of a class (Code Example 3.5) and information about that object (Code Example 3.6) can be stated. The used subject is an example of an IRI built up out of identity specific parameters, such as the location 10 CHAPTER 3. THE SEMANTIC WEB of the base pair on the chromosome. More information that can be directly derived from the string are the species, reference genome, chromosome and strand specification.

Code Example 3.6: A triple from the Ensembl database 2

"6000001"ˆˆ .

IRIs referring to definitions from vocabularies can be quite long. As they are often used in RDF data formats, naive data formats take up a lot of storage. It is undesirable to have the full version of every IRI repeated each time it is used. To solve this problem, namespace prefixes are introduced.

3.3.2.1 Namespace prefix

A namespace prefix is associated and replaced by convention to longer namespace substrings used in vocabularies. Table 3.2 gives an overview to the most common namespaces that will come back throughout the length of this thesis.

Table 3.2: A list of namespace prefixes used to substitute their longer namespace IRI variants. Listed vocabularies will be used often throughout this thesis.

Namespace prefixes Namespace prefix Namespace IRI rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs http://www.w3.org/2000/01/rdf-schema# obo http://purl.obolibrary.org/obo/ dcterms http://purl.org/dc/terms/ sio http://semanticscience.org/resource/ faldo http://biohackathon.org/resource/faldo# so http://purl.obolibrary.org/obo/ gfvo http://biointerchange.org/gfvo/ ensembl http://rdf.ebi.ac.uk/resource/ensembl/ xsd http://www.w3.org/2001/XMLSchema# void http://rdfs.org/ns/void# tcga http://tcga.deri.ie/schema/

Namespace prefixes are found in exported triplestore data formats and when using the SPARQL query language. Code Example 3.7 shows the use of namespace prefixes applied on Code Example 3.5. Due to the shortness of the Code Example, no clear improvement can be seen. Namespaces are used to decrease the amount in characters in exported databases, which can contain up to millions of triples, and to simplify the notation when using SPARQL. A further elaboration on SPARQL is given in Section 3.6.

Code Example 3.7: The use of namespace prefixes in Turtle format #HEADER @prefix rdf: . @prefix faldo: . #BODY rdf:type faldo:ExactPosition . 3.3. RDF 11

RDF and RDF Schema (RDFS) The most basic vocabularies used to form the structure of an RDF dataset. They are used to classify the semantic meanings of objects, and can almost always be found back in the construction of a dataset or vocabulary. The first version of RDFS was published in 1998 by W3C. Table 3.3 lists the seven most important constructs of RDFS, and probably the whole semantic web. The use of these definitions cover an important part of the creation of a cohesive dataset.

Table 3.3: Most important constructs of the RDF Schema language. P and C are objects which are classes and properties, respectively. The table is copied from [78].

RDF schema vocabulary Construct Syntactic form Description Class C rdf:type rdfs:Class C is an RDF class Property P rdf:type rdf:Property C is an RDF property type I rdf:type C I is an instance of C subClassOf C1 rdfs:subClassOf C2 C1 is a subclass of C2 subPropertyOf P1 rdfs:subPropertyOf P2 P1 is a sub-property of P2 domain P rdfs:domain C domain of P is C range P rdfs:range C range of P is C

Web Ontology Language (OWL) OWL is a language used for the instantiation of ontologies, designed by W3C. It is used to define properties and relationships to and between ontologies. OWL can be considered as an extension of RDFS, and has been designed to further extend the accessibility of web resources to automated processes [70].

Simple Knowledge Organization System (SKOS) SKOS is a vocabulary used for the knowledge organization of ontologies or concepts. SKOS has been designed by W3C and is created for a broader functionality of indexing and classification of data structures [71].

Ontology for Biomedical Investigation (OBI) A vocabulary created by The OBO Foundry. A collaborative, international effort to serve as a means to annotate biomedical protocols, instru- mentation and data generated in research. The database has just over 3000 definitions [8].

DCMI Metadata Terms (dcterms) A vocabulary created by The Metadata Community. It fea- tures specifications for a wide array of metadata terms. Although the vocabulary lists only 102 definitions, it is extensively used by other public vocabularies [27].

Semanticscience Integrated Ontology (SIO) The Semanticscience foundation is the creator of the SIO vocabulary. It provides definitions giving a rich description of objects, processes and their attributes. Ensembl and DisGeNET are two examples that have integrated the SIO vocabulary.

Feature Annotation Location Description Ontology (FALDO) The FALDO vocabulary is made to describe the position of sequence features in the genome. It can be used to annotate regions that are described in different file formats including the General Feature Format (GFF3), Vari- ant Call format (VCF) and BED format. It does not contain vocabulary to describe features of regions itself [15].

Sequence Ontology (SO) The SO vocabulary has originally been created by the Consortium. It aims to be a collection of ontologies used to describe and annotate features of a biological sequence. SO has a wide variety of contributers, such as the GMOD community and the Sanger Institute [29].

Genomic Feature and Variation (GFVO) The GFVO vocabulary was created to aid non-RDF genome analysis data to RDF resources. The vocabulary is created by BioInterchange, as they create tools for the conversion of genome data into RDF data. GFVO has been used in the creation of a VCF data structure [9]. 12 CHAPTER 3. THE SEMANTIC WEB

Ensembl/Uniprot The European Bioinformatics Institute (EMBL-EBI) specifically made vocabular- ies for the conversion of their databases into a linked RDF data structure. EMBL-EBI provides many well known databases for genome data which are featured as Linked Data including Ensembl, UniProt, ChEMBL and Expression Atlas [22]. XML Schema Definition (XSD) The XSD schema consists of terms which are used to describe the Extensible Markup Language (XML). It is typically used in adjunction with literals to specify their data format. The Vocabulary of Interlinked Datasets (VoID) Another creation of W3C is the VoID vocab- ulary. It focuses on definitions concerned with the metadata of an RDF datasets. VoID descriptions range from data discovery, cataloging to archiving of datasets [73]. The Cancer Genome Atlas (TCGA) The Cancer Genome Atlas project was created to get a deeper understanding of the molecular basis if cancers through genome analysis. The data has also been published as RDF data, with the creation of the TCGA vocabulary as a means.. TCGA contains a collection of annotated genomes, information about cancer patients and their treatments process. All data is anonymous [81].

3.4 Linked Data

The Semantic Web is a web of data, a unified structure supporting all data through the use of Semantic technologies such as RDF. To achieve this goal, a standardized data format was introduced and backed up by vocabularies to support the structure necessary for interpretation of data resources. But the aim of The Semantic Web is not to simply link data in a dataset. The framework described is likewise able to support relationships between data from different datasets. The collection of interrelated datasets is known as Linked Data. It enables integration and reasoning across multiple datasets.

For a dataset to be considered Linked Data, it has to be built out of logical constructs that can be interpreted by semantic web tools. These constructs include RDFS vocabulary and was later extended with the creation of OWL and SKOS. These tools can be seen as vocabularies that are used to define other vocabularies, as they define relationships in between definitions. These connections between definitions can subsequently create new logic links in between data. This process is referred to as inference [75].

3.4.1 Inference Inference is the creation and discovery of new relations as a result of automatic procedures by The Semantic Web. The creation of these new relationships between resources is a result of the logic created by core semantic web technology formats such as RDFS, OWL and SKOS. Vocabularies are in fact a hierarchal construct that create a classification of the resources. Through the existence of logical relations amongst definitions of a vocabulary, inference is possible. To give a better view on how these connections are formed, an example of RDFS inference is given, using constructs given in Table 3.3.

Code Example 3.8: A database introducing inference. #HEADER @prefix rdf: . @prefix rdfs: . @prefix obo: .

#BODY obo:SO_0000704 rdf:type rdfs:Class . obo:SO_0000704 rdfs:subClassOf obo:SO_0001411 .

Code Example 3.8 defines the ontology used to annotate a gene, using the SO vocabulary. The ontology is defined to be a class type (thus used as a subject/object), and is a subclass of 3.4. LINKED DATA 13

Table 3.5: The descriptions of the definitions used in Code Example 3.8, The descriptions were found using the IRI as a web address.

Vocabulary definitions IRI Description obo:SO 0000704 A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions. obo:SO 0001411 A region defined by its disposition to be involved in a biological process. obo:SO 0001411. Through inference, every member of the class obo:SO 0000704 is also pro- cessed as a member of the class obo:SO 0001411. This is an extremely simple example, and more complex relationships can exist between different ontologies. The use of the RDFS vocabulary, of which some ontologies are given in Table 3.5, is next to OWL and RDF the main building block for creating relations within or across different vocabularies. It is important to point out that the struc- ture defining the relations between different ontologies is defined in the vocabulary files themselves. Vocabularies are exported under .rdf or .owl extensions using a XML format.

Inferencing can be more than just the creation of new links in between resources. Constructs exist that define that two resources with specific overlapping properties can be considered identical. This can be used to merge properties from resources existing as two nodes, to fill in blank nodes or to conclude that two resources are one and the same. As vocabularies are used for many public databases, the creation of new relationships is possible over different datasets [69]. In a similar way, inconsistencies between datasets can be detected if theoretically overlapping nodes show differences.

Figure 3.2: Zoomed view on the linked datasets of the Semantic Web. The pink nodes annotate databases about life sciences.

3.4.2 Linked databases Linked Data is also a term used to describe a recommended practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using IRIs and RDF. 14 CHAPTER 3. THE SEMANTIC WEB

A database must uphold certain key aspects to be considered part of Linked Data, as discussed by Tim-Berners Lee [12]:

• Use URIs/IRIs as names for things

• Use HTTP URIs/IRIs so that people can look up those names.

• Provide useful information, using the standards (RDF/RDFS/..., SPARQL) when someone looks up a URI/IRI.

• Include links to other URIs. so that they can discover more things.

Linked Data that is open to the public is called Linked Open Data. Figure 3.2 gives a close-up view of the life science databases of the Linked Open Data Cloud [23]. The links between the datasets show that data from one node has direct connections with data from another node. The specific requirement for a link to be represented between two databases is if at least 50 triples have an IRI originating from another database. An arrow pointing to one database means that the database from which the arrow emanated has triples from the database at which the arrow points. Each dataset has a SPARQL endpoint, which is a specific web address at which SPARQL-queries can be sent. The following list contains a set of important contributors to Linked Data in the life science department. Every dataset has their own set of vocabularies which are needed to successfully navigate through their data. SPARQL endpoints are listed in Table 3.7.

Bio2RDF The largest network of Linked Data for life sciences is constructed by the Bio2RDF project. It was created by an individual party with a first release in 2010. The network is a collection of multiple datasets converted to homogeneous RDF resources. The 3rd release of Bio2RDF, dated from July 2014, consisted of a total of 11 billion triples from across 35 datasets. Featured datasets include dbSNP, NCBI, OMIM, GenAge, PubMed and LSR. The Linked Data is not up-to-date with featured datasets (last update: 2014). Bio2RDF makes use of multiple SPARQL endpoints to browse different datasets [17]. Although Bio2RDF converts a variety of public databases, it has not been used in this thesis as the datasets are not maintained or checked for errors.

Ensembl Ensembl is a joint project between EMBL-EBI and the Wellcome Trust Sanger Institute. It features the assembled annotations of genomes from multiple species. The project also includes comparative genomics, variations and regulatory data [30]. The datasets are maintained and updated by the Ensembl team and are thus a viable source of information.

The Universal Protein Resource (UniProt) EMBL-EBI, in collaboration with the Swiss Institute of Bioinformatics (SIB), created UniProt. The database is focused on assembling protein knowledge and annotation data. It exists out of different parts, being the UniProt Knowl- edgebase (UniProtKB), UniProt Reference Clusters (UniRef) and UniProt Archive (UniParc). Protein data can be listed under two distinct datasets: TrEMBL and Swiss-Prot. TrEMBL is a collection of computationally analyzed and unreviewed data. After this data gets reviewed and annotated, it is listed in Swiss-Prot [6]. The datasets are maintained and updated by the UniProt team and are thus a viable source of information.

Other EBI Resources EMBL-EBI features a variety of other datasets in RDF. All of these feature a public SPARQL endpoint and web platform [32]. These include BioModels, BioSamples, ChEMBL, Expression Atlas and Reactome. Most of these datasets are still in the pipeline and are thus subject to downtime and do not feature a complete dataset. Due to the fact that these databases are still in the middle of development, changes to the data structure are to be expected.

The Cancer Genome Atlas (TCGA) TCGA is a coordinated effort to collect and further knowl- edge of the molecular basis of cancer. It is a collection of large-scale genome sequences. The project is a collaboration between the National Cancer Institute (NCI) and the National Hu- man Genome Research Institute (NHGRI). The database is split up in a public dataset and a licensed private dataset. The private dataset is the connection of the anonymous public data to personal data from the humans analyzed. 3.5. RDF DATA MANAGEMENT 15

DisGeNET Human gene-disease associations (GDA) obtained from different sources are collected and published by DisGeNET, also featuring an RDF store. The project aims to collect all information about genetic diseases found on different levels. Data stored is split up in three important compartments: curated data, predicted data and data from literature. Curated data is the integration of GDA found back on expert sites such as UniProt, The Comparative Toxicogenomics Databse (CTD) and ClinVar. Predicted data comes from GDA found back in rats and mice predicted to exist in humans. The literature data is a collection of GDA found from data mining through publications. DisGeNET furthermore offers a dataset combining the previous three collections through a scoring system which can be found back on their site [34].

Table 3.7: A list from some important SPARQL endpoints in the field of life sciences. Distinction in data collections can be taken care of by using varying SPARQL endpoints (e.g. TCGA and Bio2RDF) or by using different graphs (e.g. DisGeNET)

SPARQL endpoints Database SPARQL endpoint Bio2RDF (PubMed) http://pubmed.bio2rdf.org/sparql Ensembl http://wwwdev.ebi.ac.uk/rdf/services/ensembl/sparql UniProt http://sparql.uniprot.org/ TCGA (Bladder Cancer) http://vmlion14.deri.ie/node42/8082/sparql DisGeNET http://rdf.disgenet.org/lodestar/sparql

3.4.2.1 Schema of data structure

Linked Open Data is used for many databases. Furthermore, it is a community that keeps growing as more and more projects become involved in the conversion of data into the RDF model. A problem existing with the current status of The Semantic Web is that only a small scope of people are familiar with the concepts of it. The use of the SPARQL query language offers many advantages further discussed in 3.6. But to use SPARQL effectively, understanding of the underlying structure in data is needed. Since the conversion of datasets into an RDF framework is facilitated by the use and sometimes creation of specific vocabularies, there is no unified way in which databases are structured. To help the community out in gaining an understanding of how data is structured into an RDF environment, schema’s representing links in between data are published.

Figure 3.3 is a representation of the data structure of the Linked TCGA database. The data is built around a central node, which refers to a specific case (anonymous patient) for which data is available. The arrows indicate relations with different classes or IRIs (circles) and values or literals (squares). The predicate connecting two nodes is written along the arch. In Section 3.6 SPARQL queries are constructed to illustrate how data can be obtained. The creation of these queries were made using this schema.

3.5 RDF data management

RDF data is commonly stored and accessed through triplestores, which are tools for RDF data management. Triplestores can offer many functions, such as the creation of a SPARQL endpoint for the data that is stored. Navigation tools can also be integrated, making a way to navigate through data by presenting all nodes a specific resource is connected with. A variety of commercial and non- commercial triplestores are available, each competing with each other to attain the best performance, storage room and functionality. The demand for these tools are increasing as good data management is key to companies dealing in Big Data. Facebook and Google are just two examples of companies that have based their own technologies and query languages upon the principles of The Semantic Web to manage their data. 16 CHAPTER 3. THE SEMANTIC WEB

Figure 3.3: TCGA schema representing the structure of the Linked Data. Circles and squares are a representation for IRIs and literals, respectively. Predicates linking two nodes together are displayed on the arches. The schema is taken from their site [49].

3.5.1 RDF formats

The simplest storage method is the storage of the RDF data as flat files accessible through the web. Following examples feature the most common format used to display and export RDF data. All these formats have been constructed to be readable and editable by humans. They are all text based formats [78].

N-Triples The most basic format to distribute triples is N-Triples. The format does not support the integration of namespace prefixes. Triples displayed in the first part of this (Code Example 2.1-2.6) are examples of N-Triples. The N-Triples format is not common anymore as it has been superseded by the Turtle format. The file extension for RDF data stored as N-triples is .nt.

Turtle The Turtle format is an extension of N-Triples which introduces the support of namespace prefixes, lists and shorthands. The Turtle format uses the file extension .ttl . Databases can contain billions of triples, which have a high amount of repetitiveness through their IRIs. By using the Turtle format over the N-Triple format, significantly smaller files can be obtained when exporting datasets. Some changes introduced in Turtle are represented in Code Example 3.9. @base on line 1 has the same functionality as a namespace prefix, listing an IRI that will be added to each incomplete (no http :) IRI listed in the dataset. Line 7-9 provide a shorthand for a set of triples with the same subject. When a triple ends with ”;”, the subject is implicitly repeated. Line 7 furthermore lists a shorthand for rdf:type, which can be substituted by the shorter version a. 3.5. RDF DATA MANAGEMENT 17

Code Example 3.9: The Turtle format

1 @base . 2 @prefix obo: . 3 @prefix rdf: . 4 @prefix xsd: .

5

6 7 a obo:SO_0000704 ; 8 rdf:label "BRCA1"ˆˆxsd:string .

TriG The TriG format is considered an extension of Turtle, introducing the ability to specify multiple graphs in the RDF dataset. Multiple graphs can be used to create divisions between large chunks of data in the dataset. It is common practice to specify the metadata of the featured data in a separate graph. The usage of the graph syntax is introduced in Code Example 3.10. The TriG format uses the file extension .trig.

Code Example 3.10: The TriG format

1 @base . 2 @prefix dcterms: . 3 @prefix obo: . 4 @prefix rdf: . 5 @prefix void: . 6 @prefix xsd: .

7

8 GRAPH { 9 10 a void:dataset ; 11 dcterms:title "The Trig Format"ˆˆxsd:string ; 12 dcterms:modified "2015-12-13"ˆˆxsd:date . 13 } 14 GRAPH { 15 16 a obo:SO_0000704 ; 17 rdf:label "BRCA1"ˆˆxsd:string> . 18 }

N-Quads An extension of the N-Triples format came with the creation of N-Quads, this introduces a fourth element to every line. The extra element bears the name of the graph to which the triple statement belongs. The file extension for the N-Quads format is .nq.

JavaScript Object Notation for Linked Data (JSON-LD) The JSON-LD format is used to rep- resent Linked Data in JSON format. JSON is a format used in Web-based programming and services. With the change to JSON-LD, web based resources are added to the RDF framework with a minimal change. The file extension for the JSON-LD format is .jsonld.

RDF/XML The representation of RDF data is possible through the means of XML syntax. RDF/XML was the first format RDF was introduced to, as it was developed in 1990. It took another 11 years for the introduction of N-Triples to be introduced in 2001. The file extension for the RDF/XML format is .rdf.

3.5.2 Triplestores Triplestores are Database Management Systems (DBMS) for RDF data used to store, access and manage data. They have the ability to store trillions of triples, supported by multiple datasets. A key feature of any triplestore is the creation of a SPARQL endpoint that can be public or private. A 18 CHAPTER 3. THE SEMANTIC WEB

SPARQL endpoint is an access point through which the stored data can be queried. These SPARQL endpoints are public for all of the previous databases discussed. With the steady and exponential increase of collected information, making data analysis tools as performant as possible has been the main focus of many DBMS. The Uniprot Linked Data, to give an example, counted a staggering 18,957,878,319 triples in the release of December 2015. Applications for RDF data have the following key functions:

RDF Parser/Serializer The RDF parser is able to transform data from one format into the RDF data model and vice versa. RDF parsers need to be able to interpret all the RDF formats to have a functional triplestore. RDF serializers are necessary to convert stored data into different RDF files.

RDF Store The RDF store is able to store and retrieve data according to the principles of The Semantic Web. An important property of the RDF store is the performance of processing SPARQL queries. At the core of a performant triplestore is the indexing of data.

RDF Query Engine The RDF query engine interprets and executes SPARQL queries. After trans- lating the prompted query, the triplestore is able to retrieve the fitting information.

Linked Open Data has the advantage of datasets available to download. Acquiring data to load into one’s own triplestore can offer advantages such as performance gains and offline availability. Most triplestores are web application tools that can easily be run and accessed on external servers. The applications featured all have a graphic user interface through which SPARQL queries can be built and datasets can be managed. The following server applications were used throughout this thesis:

Fuseki Apache Jena, a free and open source Java framework for building The Semantic Web and Linked Data applications. Fuseki can be run as a stand-alone server or as a web application via Apache Tomcat [5].

Blazegraph Blazegraph is a triplestore released by Systap featuring both commercial as open-source licensing. Customers can pay for support and development subscriptions that offer up-to-date releases and hotfixes before they are released open-source [65]. Blazegraph has been awarded the Big Data Startup Award in 2015 for their innovative work on GPU accelerated graph analystics [66]. It offers the same basic functionalities as Fuseki, but has been found to run faster with larger datasets.

3.6 SPARQL

SPARQL is a recursive acronym for SPARQL Protocol And RDF Query Language. It is made by W3C as a standardized query language for RDF data. It was made as a response to several different other query languages for RDF being available in the young days of The Semantic Web. SPARQL has a collection of different functionalities integrated which will partially be discussed in the following section.

SPARQL bears many resemblances to the triple layout used in the discussed Turtle, N-Triples and TriG format. It is powerful tool to retrieve information, discover new relationships (inferencing) and compare data over multiple datasets. One of the main principles of The Semantic Web, having a homogenized web of data, is the ability for queries to retrieve connections between nodes that can be distant from one another. Through the usage of SPARQL, it becomes easy to create a list of houses for rent in a certain area. The addition of extra information, such as telephone numbers or rent prices, can be added to the query request with a slight change to the query [74].

3.6.1 SPARQL syntax SPARQL elements will be explained using the public TCGA database. It offers an opportunity for a simple query buildup as the data structure is mainly built upon the TCGA vocabulary. Queries are performed on the database for bladder cancer. The SPARQL endpoint used can be found back in Table 3.6. SPARQL 19

Figure 3.4: Zoomed view on the TCGA schema representing the RDF data structure. Circles and squares are a representation for IRIs and literals, respectively. Predicates linking two nodes together are displayed on the arches. The schema is taken from their site [49].

3.7. SPARQL endpoints sometimes feature a user interface for a SPARQL query editor. Following examples can be copied into the query editor featured when accessing the endpoint through a browser. Figure 3.4 gives a close up of part of the data structure that is going to be used. Predicates used are displayed on the arches connecting two nodes. The central green circle is the unique IRI every patient is given.

3.6.1.1 Example 1 Code Example 3.11 is a SPARQL query for the TCGA database which lists the ID, gender and vital status upon analysis of a patient. The results of the query can be found in Table 3.9. The query can be broken down into multiple elements.

Code Example 3.11: SPARQL Ex. 1, Endpoint: http://vmlion14.deri.ie/node42/8082/sparql

1 PREFIX tcga:

2

3 SELECT DISTINCT ?patientIRI ?barcode ?gender ?status 4 WHERE { 5 ?patientIRI 6 tcga:gender ?gender; 7 tcga:bcr_patient_barcode ?barcode; 8 tcga:vital_status ?status. 9 } 10 LIMIT 4

PREFIX - The PREFIX clause is used for the initialization of a namespace prefix that will be used in the body of the query. SELECT - The respons of the query is featured through the SELECT clause. Elements of interest can only be variables. Variables are defined as strings starting with a question mark. 20 CHAPTER 3. THE SEMANTIC WEB

DISTINCT - An additional parameter that is given to the SELECT clause that removes duplicate elements or duplicate combination of elements.

WHERE - The body of the query is contained by the WHERE clause. This syntax is only used when the query only evaluates the data of the local database, i.e. the data linked to the SPARQL endpoint. WHERE is always used with curled brackets.

LIMIT - To limit the number of results to a given integer, LIMIT is used. This statement comes behind the body of the query. LIMIT is mainly used to prevent the triplestore from processing heavy queries, as not all possible results are needed.

Lines 5-7 of Code Example 3.11 feature the body of the query. SPARQL queries are built using triple patterns, using defined and undefined elements. The logic is that every matching possibility is a result stored into the variables. This means that the subjects and objects of all triples, having tcga : bcr patient barcode as a predicate, are stored into ?patientIRI and ?patientBarcode, respectively. The code is evaluated sequentially, meaning that the results stored in variables are saved for the following statements. Variables will only be kept if none of the nodes in a triple are blank. This also means that once a variable contains a set of results, they can only be reduced by sequential statements. A shorthand can be used for triples using the same subject, as seen in the example. LIMIT is used to finish the query, or the result would otherwise contain the information of all patients in the dataset.

Table 3.9: The result given by Code Example 3.11.

SPARQL result for Code Example 3.11 patientIRI barcode gender status http://tcga.deri.ie/TCGA-HD-8314 TCGA-HD-8314 MALE Alive http://tcga.deri.ie/TCGA-CQ-5333 TCGA-CQ-5333 MALE Dead http://tcga.deri.ie/TCGA-BB-8601 TCGA-BB-8601 MALE Alive http://tcga.deri.ie/TCGA-CV-5435 TCGA-CV-5435 MALE Dead

3.6.1.2 Example 2 Code Example 3.12 is a SPARQL query that lists the ID, latest vital status and optionally the days until death (after first diagnosis) of all male patients. The results of the query can be found in Table 3.11. There are three new modifiers presented in the query.

Code Example 3.12: SPARQL Ex. 2

1 PREFIX tcga:

2

3 SELECT DISTINCT ?patientIRI ?barcode ?vital ?death{ 4 SERVICE { 5 ?patientIRI 6 tcga:gender "MALE"; 7 tcga:bcr_patient_barcode ?barcode; 8 tcga:follow_up [tcga:vital_status ?vital].

9

10 OPTIONAL{?patientIRI tcga:follow_up [tcga:days_to_death ?death]}

11

12 }} 13 LIMIT 4 14 ORDER BY ?death

SERVICE - The SERVICE clause is used to access external datasets. The use of SERVICE enables the user to commit queries from a SPARQL tool of choice, to any SPARQL endpoint that is open for public querying. The main advantage of SERVICE comes with the ability to call upon 3.6. SPARQL 21

multiple SPARQL endpoints in one query, enabling data analysis over multiple datasets. The clause comes with the declaration of a SPARQL endpoint. The query executed on the endpoint is defined within curled brackets. OPTIONAL - Variables and statements defined within the optional field won’t restrict the result set if no data is available. It returns a blank field when no value can be found. OPTIONAL triples are introduced in between curled brackets. ORDER BY - Ordering results by a variable can happen through the modifier ORDER BY. By default, the results are displayed in ascending order. To return the values in a descending order, DESC() is added. e.g. ORDER BY DESC(?death). The use of ORDER BY requires the engine to retrieve all possible results in a dataset, even in the case LIMIT is used in the query.

Code Example 3.12 introduces a new shorthand construct on line 8 and line 10. Square brackets are useful for navigation through nodes when one is interested to link distant variables with one another. The square brackets contain the predicate and object that are linked to the subject it replaces. Specifications of the queried data can be given by defining triples, as shown on line 6. The SPARQL language features a multitude of other clauses, e.g. the usage of regular expression using filters and graph selection, which enhances the functionality and power of the language. A full documentations is available online [74].

Table 3.11: The result given by Code Example 3.12.

SPARQL result for Code Example 3.12 patientIRI barcode gender days to death http://tcga.deri.ie/TCGA-CV-5432 TCGA-CV-5432 Alive http://tcga.deri.ie/TCGA-CQ-5334 TCGA-CQ-5334 Dead 128 http://tcga.deri.ie/TCGA-CQ-5333 TCGA-CQ-5333 Dead 341 http://tcga.deri.ie/TCGA-CV-5435 TCGA-CV-5435 Dead 2318

4 Boinq

4.1 Introduction

The continuous development of technologies created for The Semantic Web have made it a more finished and attractive asset each year. Because of this, an increased interest has been shown in adapting biological datasets into an RDF structure. A considerable amount of data is already featured by a variety of databases, albeit data requested over The Semantic Web is still lacking.

Boinq is an open source platform that leverages the semantic web to share, organize and combine sequence based information. It furthermore aims to be a tool through which user data can be injected onto The Semantic Web. A boinq installation can be used locally, and is intended to interoperate with other public endpoints.

4.2 Design

The boinq platform intends to ease the organization of genome annotations irrespective of the original data format. Organizing these data includes importing from widely used file formats, recombining, and uploading into a general purpose triplestore that can be exposed as a SPARQL endpoint. The platform should be able to recombine local data with data from public databases, and should leverage existing sources of sequence based information and existing ontologies. The design requirements of the tool were as follows:

• The system should be accessible as a web platform. This will allow easy communication between different endpoints for data sharing. Web platforms also allow the use and accession of multiple users to one instance. • The platform should be able to use a triplestore of choice and not impose a given framework or technology. Due to differences in functionalities found in a variety of triplestores, a certain triplestore might not be optimal for the user. The triplestores are furthermore the key software in ensuring fast and optimized data storage and retrieval, fully functional softwares at this date can be missing features, slow and outdated and thus become obsolete in the future. • The platform should import and convert genomic data into a triplestore. The Semantic Web is built up from an RDF framework, which means that the data should be represented as triples. Implemented functionalities for data analysis are largely adapted to the data structure. • Recombining data from different sources should be possible without manual query writing. Boinq wants to bring The Semantic Web to the user without the need of an excessive knowledge on the matter. To use The Semantic Web without assisting, knowhow on a variety of matters in necessary including SPARQL, RDF and the RDF schemas of remote endpoints.

23 24 CHAPTER 4. BOINQ

• The platform should support visualization of stored information. To further user-friendliness and functionality, boinq aims to provide a graphic visualization of stored or queried data. It is conventional for genomic data to be represented in a graphical way, as representation offers an easy way to browse and interpret this data. The creation of this functionality requires a lot of work, and options to implement third party tools have been considered in Section 5.4.3.

4.2.1 Data unification The integration of data into an RDF environment has been realized and adapted in boinq as discussed in Chapter 5. In general, two approaches are available. Either a custom vocabulary is created that defines the structure of data, or existing vocabularies are used and combined. Although a completely custom data structure gives the advantage of complete adaptability to the functionality of boinq, it would not adapt structural elements with genomic data stored in public RDF stores. To ensure the alignment of the data structure to public efforts the use of existing vocabularies is preferred.

Furthermore, the conversion of data comes with the effort to create a standard schema that is adaptable for every data format. It is advantageous to follow a unified approach for usability and simplification of a automated processing. Figure 4.1 gives a representation of the schema introduced that supports a unified data structure. In general, every entry of a file represents a feature of a certain type with an identity/label and attributes. Attributes can also be pertained to the entry instead of the feature. Features can hold relations with other features introduced by other entries. Every feature is bound to a location. More information about the different formats, data implementation and conversion is discussed in Chapter 5. To unify representation of features into triples, the following principles were adopted: • entry nodes are introduced to add information to a feature that is inherent to the file format instead of a biological attribute. • attributes pertaining to the file entry are separated from attributes pertaining to the feature it represents • rdf/rdfs is used to describe the entity of a node. • location on the reference is represented by FALDO terms • genomic feature types, attributes and relations are represented by SO terms.

4.2.2 Data organization Boinq offers a way to manage data stored in a triplestore. Data can be introduced that is obtained from both the user and The Semantic Web. To properly differentiate between different sets of data, a hierarchal structure has been introduced in the organization of data as available to the user. Different SPARQL endpoints, used as a source of annotation information, are attributed to a data source. This can be referred to as a work space. A data source consists of different tracks, a track being a collection of features organized along a common theme. Data sources are used to allocate all the data from, e.g. one project. In these data sources, tracks can be selected which divides the data from a project into different graphs. An example is to store the genomic data from different organisms into different tracks. Another example is the allocation of varying data types as found in the different data formats, or the allocation of data obtained during different steps of a research project. Metadata of different tracks is stored as triples in a single graph for every data source. Metadata consists of information of the original and converted data, and about the capabilities of the endpoint. These include operations that can be conducted for data analysis. A custom vocabulary is created to handle high level, domain specific abstractions used by boinq’s query builder.

Data analysis through recombination with the use of SPARQL and Linked Open Data is the main functionality of boinq. A graphical approach to data recombination from different tracks is being worked on. To represent the network of architecture of The Semantic Web, tracks are represented by nodes that can be dragged to a diagram. Nodes are therefore sources of features. These can be filtered by applying criteria that depend on the endpoint, and are customizable by modifying the metadata for the track. The following criteria are supported: 4.3. COMPARISON TO OTHER FRAMEWORKS 25

Figure 4.1: A representation of the general data structure as implemented in the RDF framework. Every connection of a node to another node represents a triple. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to, orange nodes represent objects linked over multiple entries.

Location Limit features to a certain location, such as reference sequence, strand or a genomic region. Location filters can both be explicit as locations derived from the feature in the database.

FeatureType Limit features to a certain type; metadata includes the available types in the track.

MatchTerm Limit features to those linked to a given term from a target ontology. Metadata includes a path expression linking the feature entity to the term and information about the target terms.

MatchInteger Limit features to those linked to a certain integer value that matches a given value or lies within a given range. Metadata includes path expression from the feature entity to the integer value.

MatchDecimal Similar to the integer match, but for decimal values rather than integer values.

MatchString Limit features to those linked to a certain string value. Options are to have either exact matches or regular expressions matching a substring. Both are case-insensitive. Metadata includes path expression from the feature entity to the integer value

4.3 Comparison to other frameworks

Boinq has functionalities that have similarities to existing projects. The analysis of independent projects can offer advancement in the development of boinq and will be shortly discussed. Some examples are used for both implementation of data as evaluation of the constructed framework.

4.3.1 Biological query building Frameworks exist to help build SPARQL queries for biological data. Biogateway [3] presents example SPARQL queries as easily adaptable templates. SPARQLGraph [61] is a service that allows the construction of integrative queries over biological databases using a GUI. These tools are very useful for helping the advance of semantic web use among bioinformatics, yet they do not focus on the specific use case of managing sequence based information. Indeed, boinq does not intend primarily the generation of SPARQL queries, but rather uses SPARQL as a tool to reorganize sequence annotations. 26 CHAPTER 4. BOINQ

The mentioned tools could be used to exploit sequence based information such as offered by a boinq endpoint in refined integrative queries.

4.3.2 Semantic access to sequence information sparql-bed [13] and sparql-vcf [14] are tools for direct SPARQL querying, performed on the BED and VCF format, respectively. These tools are only making use of a location schema in the RDF framework using FALDO. Each feature is represented by its entry and no further information can be extracted on top of their location. Thus, no queries involving attributes or labeling of features can be performed. Both tools are useful for performing a quick command line query, however they do not manage an endpoint that can be used for exposing a SPARQL endpoint, nor do they allow for any data management.

Integration of sequence based information is also a goal of Biointerchange [10]. A payed tool developed by Codamono. The service supports the integration of VCF and GFF files into an RDF framework, Data integration has been constructed such that data conversion is possible from both an RDF to GFF/VCF and back. No features to enhance for connectivity of integrated data with The Semantic Web are published.

4.4 Material and methods

Research and work done in the creation of custom data structures for next generation sequencing has been directly applied in the development of boinq. Thus, a fully functional tool for the conversion of the four data formats is created. The development of this endeavor was executed using the following tools, used in a Windows 7 environment:

Development environment Coding on boinq has mainly been performed using Spring Tool Suite (STS), an Eclipse based environment for development of programs and tools. STS offers a multitude of functionalities for building, debugging and coding and has been documented for development using a JHipster stack. Version 3.7.1 has been used during development.

Architecture and Libraries Boinq has been built using the JHipster stack [36], enabling the use of state of the art components and best practices for web applications. It eases deployment of industry standard frameworks and provides complex functionality like security, caching, or logging, out of the box. In the server application, several technologies are combined. HTSJDK [59] and Jannovar have been used for handling flat file access. Apache Jena [4] is used for programmatic query building, as a SPARQL and SPARQL/Update client, and for generating Java classes as shorthand for ontology terms. Quartz is used for asynchronous job handling. Elda is currently being implemented to offer the triplestore data as resolvable URIs. Boinq uses a local database to store information necessary for the webapp, such as known data sources and tracks, users and credentials, and stores metadata and actual data in a triplestore using SPARQL 1.1.

Vocabularies To allow for a seamless communication between the platform and the triplestore two independent vocabularies have been constructed and implemented. Ontology building was performed using the Prot´eg´e software [50]. Prot´eg´e is a free, open source ontology building tool with a variety of functionalities which enable for an intelligent framework. The created ontologies are divided into two main purposes; to structure data and metadata implementation. These vocabularies are named the ’format’ and ’track’ vocabulary, respectively. The format vocabulary is used in the construction of a data structure from varying data formats. The track vocabulary is used to construct the metadata according to the functionality of boinq. Practical uses of these vocabularies are laid out in Chapter 5. These vocabularies can be viewed and downloaded from the GitHub repository [24]. Prot´eg´e version 5.0.0 beta 17 has been used to construct ontologies.

Triplestores As elaborated in Section 3.5.2, Blazegraph and Fuseki were both for development and research purposes. Blazegraph version 1.3.4 to 2.1.0 and Fuseki version 2 have been used. Blazegraph was run on both a local machine and a server. 4.4. MATERIAL AND METHODS 27

Server A remote version of boinq and blazegraph (2.1.0) was run on a server for the execution of the use case and analysis of correct functioning with big data files. The server was provided by Genohm and is a virtualized CentOS version 6.7 with eight cores (Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz) and 32 Gb RAM assigned.

Programming languages Boinq has mainly been written in Java (server-side) and JavaScript (client-side), with a local installment of Java 8 Update 60. Python version 3.5.0 and SQLite version 3.11.1 were used for data file manipulation, analysis and conversion. Distribution curves of the methylation and expression data (as explained in Chapter 6) have been created in RStudio version 0.99.896 with a local installment of R for Windows version 3.2.4.

Version management Github has been used for the management of different versions and the creation of a remote backup. A local installment of Git version 2.5.1 is used. The repository is located on https://github.com/Kleurenprinter/boinq2.

5 Genomic Data Implementation

5.1 Introduction

The introduction of second generation sequencing technologies, such as Roche 454 and Illumina, have induced exponential growth in the acquisition of genomic data. These inventions have caused a rapid cost reduction per megabase of sequencing information [46]. In 2007, the National Center for Biotechnology Information (NCBI) began the collection of raw sequencing data from these platforms. NCBI was followed by the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). In 2009, The Sequence Read Archive (SRA) was established as a central database to help the research community gain access to sequencing data for scientific purposes [38]. The total amount of raw sequence data contained by the SRA is currently more than 3.6 petabases.

The historical growth rate of genomic data, since the development of second generation sequencing, has been doubling the amount of data every seven months [64]. Figure 5.1 shows the growth of genomic data, the total amount of human genomes sequenced and the data capacity. Several projects such as the 1000 Genomes project, TCGA and the Exome Aggregation Consortium (ExAC) contributed large amounts of data to the scientific community. Future growth of data is displayed according to several predictions: the historical growth rate representing a doubling of all data every seven months, an estimation given by Illumina [56] and the doubling of data capacity every two years, as stated by Moore’s Law [64].

5.2 Genomic Data

Genomic data is a heterogeneous collection of data. A variety of elements, such as the Coding DNA Sequence (CDS), translated proteins and variations in nucleotides can be retrieved. Acquired genomic data goes through different stages of processing. It requires sorting through multiple levels of data from Next Generation Sequencing (NGS), containing probabilities, calculations and parameters.

The high variety of genomic data is reflected in the existence of specialized public databases and data formats. Although an effort has been made to list references from one database to another, the mining of these datasets requires, in most cases, user-made scripts.

Data acquisition and distribution by end-users are highly heterogeneous and are performed with the use of different data formats, coding languages and tools. With the consideration that the analytical studies of genomic data can exceed the capabilities of modern computers, the improvement of current techniques and the creation of new tools for faster and specialized data analysis have received a high amount of focus. Several data formats exist, each created and designed according to different standards and design goals.

29 30 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

Figure 5.1: A plot portraying the growth of DNA sequencing showing both the annual sequencing capacity (right) and total number of human genomes. Important contributors, such as the 1000 Genome, TCGA and ExAC project, are shown on their respective launch dates. Three future growth estimates are given: red following the historical growth rate, orange following the Illumina estimate and blue following Moore’s Law. [64]

With the attendance of public databases supporting RDF data, an opportunity arises to get rid of the existing boundaries that exist when calling upon data from multiple public databases. In order to make Linked Open Data accessible and useful to end-users, the data conversion’s most commonly used file formats has been integrated in boinq. Since most data is not found in RDF, a multitude of converters have been introduced. This feature will help the user to convert their data, keeping them from having to find other third-party tools. The conversion of data by boinq also offers the advantage that the RDF data is created according to a data structure design that fits the logic of further data analysis queries constructed by boinq. The current version of boinq integrates the conversion of five different file formats: BED, GFF3, GTF, SAM and BAM. These formats will be discussed here.

5.2.1 Browser Extensible Data format

The UCSC Genome Browser is a maintained web tool displaying annotations and features mapped across the length of specified chromosome. It is developed and maintained by the Genome Bioin- formatics Group from the University of California Santa Cruz (UCSC) [62]. The database contains assembled genomes of all sequenced species. The Browser Extensible Data (BED) format has been developed to represent genomic features and annotations, displayed through their web tool, in a concise and flexible way. It is a tab delimited text format that supports up to twelve columns, of which only the first three are obligatory [68]. Code Example 5.1 gives a representation of the data structure in a BED format. A more complete description of the BED format can be found on the UCSC Genome Browser website (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)

Code Example 5.1: Example of a BED file

1 browser hide all 2 chr2 178707289 178707561 Hs.666133 0 + 178707289 178707561 0 3 85,177,8, 0,87,264, 3 chr1 178709699 178711955 Hs.377257 0 + 178709699 178711955 0 1 2256, 0, 4 chr1 178711404 178712057 Hs.688767 0 - 178711404 178712057 0 1 653, 0, 5 chr2 178777793 178778272 Hs.541631 0 - 178777793 178778272 0 1 479, 0, 6 chr2 178908612 178916376 Hs.318775 0 + 178908612 178916376 0 4 2644,3067,464,1588, 0,2645,5712,6176,

The first three required fields are sequentially: 5.2. GENOMIC DATA 31

1. chrom - Lists the chromosome of the given region. The chromosome or contig is typically denoted with or without the prefix ’chr’ or ’ctg’, respectively.

2. chromStart - The start position of the feature. base pair counting starts at 0. The 0-based coordinate system numbers between nucleotides.

3. chromEnd - The end position of the feature.

An additional nine fields can be added. Empty fields are not allowed, meaning that for each field all listed previous fields must be occupied.

4. name - The name, label or ID under which a feature is commonly specified.

5. score - The features displayed in the Genome Browser are given a gray-value. This is stored in the BED file as a score ranging from 0 to 1000. This field is often used to store experimentally derived information of a feature

6. strand - A value being either ’+’ or ’-’, representing that the annotation is found on the forward or backward strand, respectively.

7. thickStart - The coordinate at which the Genome Browser displays the feature as a solid rectangle.

8. thickEnd - The coordinate at which the Genome Browser stops displaying the feature as a solid rectangle.

9. itemRgb -The RGB color value that is used as an alternative to the gray-value score.

10. blockCount - The number of sub-elements in a feature, e.g. the number of exons in a gene.

11. blockSizes - The size of the specified sub-elements.

12. blockStarts - The start of the specified sub-elements. These start positions are listed according to the sizes of each sub-element as specified in the blockSizes field.

5.2.2 Generic Feature Format

The Generic Feature Format (GFF) is a tab delimited text file used for storing DNA, RNA and protein features. GFF files are broadly used for exporting genomic data from public databases, found on e.g. Uniprot and Ensembl. The file format gives a brief representation of genome data from specific regions. Different versions of GFF are being used with the latest and more complete one being GFF3. GFF3 is an extension on the GFF2 format developed to solve its predecessor’s shortcomings [31]. The Genetic Transfer Format (GTF) is a format borrowed from GFF. It was developed in between the creation of GFF2 and GFF3, and is therefore sometimes referred to as GFF2.5. The format has high similarity with GFF3, with only small variations. One example is different naming of features in the attribute list.

GFF3 files feature a fixed amount of nine columns in which the information is represented. Different columns and rows are seperated by tabs and newlines respectively. Empty fields are denoted with a period. Code Example 5.2 gives a representation of a generic GFF file. A more complete description of GFF3 with some examples can be found on several websites, e.g. http://www.sequenceontology. org/gff3.shtml. 32 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

Code Example 5.2: Example of a GFF file

1 ##gff-version 3.2.1 2 ##sequence-region ctgA 1 1497228 3 ctgA example gene 1050 9000 . + . ID=EDEN;Name=EDEN;Note=Protein Kinase 4 ctgA example mRNA 1050 9000 . - . ID=EDEN.1;Parent=EDEN;Name=EDEN.1 5 ctgA example five_prime_UTR 1050 1200 . + . Parent=EDEN.1 6 ctgA example CDS 1201 1500 . + 0 Parent=EDEN.1 7 ctgA example CDS 3000 3902 . + 0 Parent=EDEN.1 8 ctgA example CDS 5000 5500 . + 0 Parent=EDEN.1 9 ctgA example CDS 7000 7608 . + 0 Parent=EDEN.1 10 ctgA example three_prime_UTR 7609 9000 . + . Parent=EDEN.1 11 ctgA example mRNA 1050 9000 . + . ID=EDEN.2;Parent=EDEN;Name=EDEN.2 12 ctgA example five_prime_UTR 1050 1200 . + . Parent=EDEN.2 13 ctgA example CDS 1201 1500 . + 0 Parent=EDEN.2 14 ctgA example CDS 5000 5500 . + 0 Parent=EDEN.2 15 ctgA example CDS 7000 7608 . + 0 Parent=EDEN.2 16 ctgA example three_prime_UTR 7609 9000 . + . Parent=EDEN.2 17 ...

The header of GFF files can contain information about the version of the format and other metadata. It is annotated with a double hashtag (##) and is not required. The nine required fields of the body of the file are sequentially:

1. seqname - The name given to the chromosome or scaffold. The chromosome or contig is typically denoted with or without the prefix ’chr’ or ’ctg’, respectively.

2. source - The name of the database, project or program that annotated the given feature.

3. feature - The type of the feature, e.g. Gene, Exon.

4. start - The start position of the feature. base pair counting starts at 1. The 1-based coordinate system numbers nucleotides directly.

5. end - The end position of the feature.

6. score - The score of a given feature. A variety of scores can be chosen from, such as an E-value for sequence similarity or a P-value for gene prediction.

7. strand - A value being either ’+’ or ’-’, representing that the annotation is found on the forward or backward strand, respectively.

8. phase - A value being ’0’, ’1’ or ’2’, can also be interpreted as the frame of the feature. a value ’0’ indicates the start of a codon at the beginning of the feature. a value ’1’ indicates the start of the codon at the second base of the feature and the value ’2’ indicates the start at the third base. The phase is used for a Coding DNA Sequence (CDS) feature.

9. attribute - A list of values separated by a semicolon defining additional information about the feature. The length of the list is undefined.

ID - Indicates the ID of a feature. An idea, unlike a name or label, is a unique value within the GFF file. IDs are used as the reference name when connecting features through Parent or ”Derives from”. Name - Carries the label of the feature. It does not have to be a unique value. Alias - A secondary or alternative name of a feature. are common to have multiple labels. Parent - Indicates the relationship between two features. Indicating that one feature is part of the other it points to. The value given with parent is always the ID of another feature. Target - Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. the Target field is followed by the four fields; target ID, start, end, and optionally strand. Gap - Linked to the Target field, the Gap field contains a CIGAR string to indicate the gaps in the alignment. 5.2. GENOMIC DATA 33

Derives from - A temporal relationship of one feature to another. This field is used to dis- tinguish the structural relation given through the ”Parent” field with a temporal relation, needed in polycistronic genes. Note - A comment given on the feature. Dbxref - Contains a database cross reference. Is circular - A flag indicating whether the feature is circular. Can be adapted for features indicating a bacterial genome.

5.2.3 Variant Call Format The Variant Call Format (VCF) was introduced with the appearance of large scale genotyping and sequencing projects. It was initially specified by the 1000 Genomes Project, an international study launched in 2008 aimed to create a database collecting human genetic variation on a global scale. It is aimed to offer a more complete understanding about the effect of genomic differences such as Single Nucleotide Polymorphisms (SNP), Copy Number Variations (CNV) and structural variations on the phenotype. Since VCF was created to store the results of whole genome sequencing, they are often very large files containing several millions of reads [19]. VCF portrays nucleotide variations found from HTS data compared to a reference genome.

The Variant Call Format is, just like the BED and GFF formats, a tab delimited text file format listing all genetic variations of a sequenced genome. Unlike the BED and GFF format no complete sequences have to be listed. VCF permits the creation of custom fields as there are no strict rules on the amount of columns. Custom fields are specified in the header in a predefined format. The addition of new elements that every new version has brought and the addition of customized fields ensure the appropriate amount of information for every read. Although a broad customization of the format is possible, the first eight fields are fixed. Empty fields are annotated with a period [1]. The extension and maintenance of the VCF file format has been integrated with SAMtools, a suite of programs for interacting with high-throughput sequencing data [60]. The official and more complete description of the the latest VCF format (version 4.3, visited December 2015) can be found on the GitHub repository of the SAMtools project: (http://samtools.github.io/hts-specs/VCFv4.3.pdf).

Code Example 5.3: Example of a VCF file

1 ##fileformat=VCFv4.0 2 ##fileDate=20100610 3 ##source=glfTools v3 4 ##reference=1000GenomesPilot-NCBI36 5 ##phasing=NA 6 ##INFO= 7 ##INFO= 8 ##INFO= 9 ##INFO= 10 ##FILTER= 11 ##FORMAT= 12 ##FORMAT= 13 ##FORMAT= 14 ##INFO= 15 ##INFO= 16 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 17 Y 27284 rs2058276 T C 32 . AC=2;AN=2;DB;DP=182 GT:GQ:DP 0|0:48:1 18 Y 27342 . G A,C 31 . AC=1;AN=2;DP=196;NS=63 GT:GQ:DP 0|2:48:3 19 Y 27432 . C T 25 . AC=1;AN=2;DP=275;NS=66 GT:GQ:DP 0|0:3:6 20 Y 27467 . A G 34 . AC=2;AN=2;DP=179;NS=64 GT:GQ:DP 1|1:2:7 21 Y 27779 . T A 67 . AC=1;AN=2;DP=225;NS=67 GT:GQ:DP 1|0:48:4 22 Y 27825 rs2075640 A G 38 . "AC=1;AN=2;DB;DP=254;H2;NS=66 GT:GQ:DP 0|0:17:2 23 Y 27837 . G A 51 . AC=1;AN=2;DP=217;NS=67 GT:GQ:DP 0|1:48:3 24 ...

The header can hold a variety of information, with no restrictions given on it’s length or format. The only rule given is that entries start with a double hashtag (’##’). The header is meant to be interpreted by the user, providing metadata about the file and specifications about the information given with every read. All fields are optional. Common fields include, but are not limited to:

filefomat - A specification of file format 34 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

fileDate - A specification of date at which the file was created source - The source from which data was retrieved contig - A specification that gives a specification of the species and used assembly. INFO - The info fields specify and describe keys used for giving results/values of tests/attributes to reads. FILTER - The filter fields specify and described filters used for the quality control of reads. FORMAT - The format fields specify and describe keys used for giving results/values of tests/attributes to samples

The body of the format consists of a minimal of eight fixed fields, given in sequential order: 1. CHROM - The name given to the chromosome on which the read was registered. The chro- mosome or contig is typically denoted with or without the prefix ’chr’ or ’ctg’, respectively. 2. POS - The position of the reference genome. VCF files are ordered by increasing position number and uses a 1-based coordinate system. 3. ID - The identifier used for the given base pair or set of base pairs. SNPs featured in the dbSNP database are commonly annotated with an rs number. 4. REF - The reference base, this can be one of A,C,G,T or N. Multiple bases are permitted. 5. ALT - The alternate base, this can be a comma separated list when multiple samples reads were called. Strings made up of A,C,G,T,N,* are permitted. ’*’ is used for missing alleles caused by upstream deletions 6. QUAL - A quality score for the assertions made in ALT. The score is called the Phred scaled quality score. 7. FILTER - Reads can be evaluated by filters that qualify based on their Phred score. ’PASS’ is given to reads that passed all filters. Codes of all filters for which the read failed for are listed in a semi-colon separated list. e.g. q10 means that the quality (Phred-score) of the site is below 10. Filters can be described in the header. 8. INFO - A field listing additional information. The information, separated by semi-colons, is attributed to keys that are defined in the header. Although custom keys can be added by the user, a large list of keys has been defined by the format. Attribute keys can described in the header. These include, but are not limited to: AA - ancestral allele AC - allele count in genotypes, for each ALT allele, in the same order as listed AF - allele frequency for each ALT allele in the same order as listed: - use this when estimated from primary data, not called genotypes AN - total number of alleles in called genotypes DB - dbSNP membership DP - combined depth across multiple samples H2 - membership in hapmap 2 H3 - membership in hapmap 3 HQ - RMS mapping quality NS - number of samples 1000G - membership of 1000 Genomes Additionally, genotype information might be present. This data is preceded by a FORMAT column, which represents the interpretation of every item present on the columns that are followed. Keys used in the FORMAT column are generally specified in the header of the file. Genotype information can be spread over multiple column representing the different samples the data has been obtained from. Commonly used items are: 5.2. GENOMIC DATA 35

GT - Genotype specification of the sample. ’0’ values are used for the reference sequence, ’1’,’2’,... values for the alternative sequence. Multiple alternative sequences can be given. The values for different alleles are separated by ’/’ or ’|’, meaning that the genotype is unphased, or phased, respectively. DP - The read depth at the position of the specified sample. PL - The phred-scaled genotype likelihoods rounded to the closest integer. GP - The phred-scaled genotype posterior probabilities. GQ - The conditional genotype quality. HQ - The haplotype qualities, two phred scores separated by a comma. EC - A comma separated list of expected alternate allele counts.

5.2.4 Sequence Alignment/Map format The Sequence Alignment/Map (SAM) format is a data format introduced and used by SAMtools. It integrates short DNA sequence read alignments retrieved from HTS and is used for the post- processing of this data to a reference genome. This low level data is commonly used for the calculation of variants. SAMtools can be used to perform a variety of operations on the format, including, but not limited to alignment viewing, sorting, indexing and data conversion as well as extraction . Since SAM files can be to up to several tens of Gigabytes large, the Binary Alignment/Map (BAM) format was introduced, which is a more data compact alternative to the SAM format [40].

Code Example 5.4: Example of a SAM file

1 @SQ SN:ref LN:45 2 @SQ SN:ref2 LN:40 3 r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * XX:B:S,12561 4 r002 0 ref 9 30 1S2I6M1P1I1P1I4M2I * 0 0 AAAAGATAAGGGATAAA * H0:i:12 5 r003 0 ref 9 30 5H6M * 0 0 AGCTAA * 6 r004 0 ref 16 30 6M14N1I5M * 0 0 ATAGCTCTCAGC * 7 r003 16 ref 29 30 6H5M * 0 0 TAGGC * 8 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * 9 x1 0 ref2 1 30 20M * 0 0 aggttttataaaacaaataa * 10 x2 0 ref2 2 30 21M * 0 0 ggttttataaaacaaataatt * 11 x3 0 ref2 6 30 9M4I13M * 0 0 ttataaaacAAATaattaagtctaca * 12 x4 0 ref2 10 30 25M * 0 0 CaaaTaattaagtctacagagcaac * 13 x5 0 ref2 12 30 24M * 0 0 aaTaattaagtctacagagcaact * 14 x6 0 ref2 14 30 23M * 0 0 Taattaagtctacagagcaacta *

The SAM format is a TAB-delimited text format consisting of an optional header section and alignment section. The header contains lines starting with ’@’ and contains metadata about the alignment data. Each entry consists of 11 mandatory fields and one optional field. Although all fields have to be present, values constituting that no information is available can be used. These are dependent for each field. The official and more complete description of the SAM/BAM format (version 1.0, visited December 2015) can be found on the Github repository of the SAMtools project. (https://samtools.github.io/hts-specs/SAMv1.pdf). The first eleven required fields are:

1. QNAME - Query template name. Reads having an identical QNAME are regarded to come from the same template 2. FLAG - The FLAG is an integer value composed out of 12 bits, each of them conferring information about the read depending on their 0/1-value 3. RNAME - Reference sequence name. This field gives the name of the reference alignment used for a given read. 4. POS - The value of the leftmost mapping position of the first matching base to the reference sequence. BAM uses a 0-based coordinate system while SAM uses a 1-based coordinate system. 36 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

5. MAPQ - Mapping Quality. MAPQ Gives the quality of the alignment between the given sequence and the reference sequence. The score is phred-based taking the probability that a given alignment is correct. 6. CIGAR - The CIGAR string is a short way of representing how different parts of the given sequence align with the reference sequence. 7. RNEXT - Reference sequence name of the NEXT alignment of the same read. This field is used for reads having multiple alignments. when no string is given, either ’*’ is used to denote that the information is unknown or ’=’ is used meaning the the reference sequence name is the same for the next read. 8. PNEXT - Position of the next alignment of an identical read in the template. This field is used for reads having multiple alignments. When no integer is given, ’0’ is used when the information is unavailable. 9. TLEN - Signed observed template length. No position 10. SEQ - Segment Sequence. SEQ contains the read from the HTS which is aligned to the reference genome. a ’*’ denotes that the sequence is not stored and a ’=’ denotes that the sequence is equal to the reference. 11. QUAL - Base Quality. Gives a string of phred-based scores calculated from the probability that a given base in SEQ is wrong. The string has the same length as the SEQ string. When no information is given, a ’*’ value is used.

The optional field has a structure following the TAG:TYPE:VALUE format. The TAG field is a unique two-character string which is the key referring to which information is given. Lower case keys are reserved for end users, whom can define the descripton of a given tag in the header field. The TYPE field is reserved to a single case-sensitive letter, defining the format of the information given in the VALUE field. The VALUE fields contains data, which can be both a single value as a vector of values, as defined by the TYPE field.

5.2.4.1 CIGAR CIGAR notation is a shorthand explanation of how two sequences are aligned. Values commonly used are ’M’ for Match, ’I’ for Insertion (to reference) and ’D’ for Deletion (from reference). Thus, the CIGAR string of the first entry of Code Example 5.4, ’8M4I4M1D3M’, is exemplified in Code Example 5.5. The asterisk is used for bases that are unknown. A full list of all the CIGAR characters is listed in the official SAM documentation.

Code Example 5.5: Interpretation of CIGAR string

1 Entry: r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG XX:B:S,12561 2 3 POS: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 4 REF: * * * * * * T T A G A T A A G A T A * C T G 5 SEQ: T T A G A T A A A G A G G A T A C T G

5.3 Data integration into The Semantic Web

Making RDF data around the web accessible to the end-user holds several obstacles, one of which is the non-existence of RDF data usage outside of The Semantic Web. For users to process their own data using extern RDF data sources, the integration of their data into the structure of the RDF schema is the first obstacle. The correct integration of data brings forth many challenges, requiring a good understanding of the RDF data and a knowledge of existing ontologies/structures. There are currently no public tools available that offer the integration of genomic data from common file formats. To make sure boinq is an accessible and attractive tool, file converters for data integration have been implemented. boinq integrates the conversion of GFF, GTF. BAM, SAM, and BED files, summing to a collection of formats that cover multiple levels of experimental data and are commonly used. 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 37

5.3.1 Overview The implementation of the different file converters in boinq, with the usage and exploration of correct ontologies that build up to a fitting data structure, has been an constantly evolving subject with many iterations and changes. The implementation of genomic data coming from existing data formats has only been documented by the Genomic Feature and Variation Ontology (GFVO). GFVO is an open-source vocabulary created by Codamono as part of the BioInterchange project, a private and paid software tool made for the integration of GFF and VCF files into the RDF data structure. The GFVO vocabulary has been created as a stand-alone ontology library that supports the complete integration of data from GFF and VCF file formats into an RDF schema. GFVO has many ontologies needed to define the many attributes that are generally found in VCF and GFF formats.

The creation of a custom vocabulary has advantages such as the liberty to customize ontologies and relations at will. No work has to be invested into the study of existing ontologies in case these are integrated and combined with different vocabularies. The integration of multiple vocabularies into one data structure needs careful assessment to verify whether given ontologies are meant to be used in the environment that one created. To combine and retrieve ontologies from existing vocabularies, a substantial amount of effort has to be made in making sure that ontologies are used correctly, analyzing their definitions, and the subclasses they belong to.

The main limitation of using GFVO for our purpose lies in the fact that it is a vocabulary designed to represent the data as structured by the specific file format. This defers from the design goal for the data integration by boinq. In contrast to BioInterchange, the ability to convert data from both the RDF framework to other data formats and back is not implemented. One of the foundations in the construction of The Semantic Web is the elimination of barriers existing between datasets due to the different formats data is implemented in. Different data formats in genomic data have been created to divide and group specific information together. The existence of data formats have no function but to support a specific structure in which data is easily representable. Since data in this web is structured using the same triple construct, representing the format through custom ontologies serves no purpose.

Data integration created for boinq is aimed to follow a model that can work for any file format, using different vocabularies created by the community to cover the different categories of data retrieved from popular data formats for the storage of genomic information. The retrieval of suitable ontologies has not always resulted in an answer, as the active community is still very limited. For lack of an alternative, the GFVO vocabulary was partially integrated for VCF.

The extension of existing vocabularies by customization is not an option. These custom ontologies would not be supported by the original creators. To make sure your created ontologies will be supported, is to get your ontologies accepted into the next version of a publicly published and maintained vocabulary. As the process would take many months and revisions, there hasn’t been any work invested into this.

Next to the usage of a RDF data for the implementation and analysis of data, boinq also uses the RDF data framework for the integration and retrieval of data used for the functionality of boinq. The exchange of information between the server and the triplestore works through automated queries hard-coded into the boinq software. When data is converted into the boinq triplestore, metadata is generated and stored in a predefined graph. This metadata can therefore be accessed and queried by both the user as the software. Data stored in this graph includes:

• A list linking local regions with external ones. This is explained in Section 5.3.6.2.

• specifics about the different tracks (graphs) in which data can be found

• specifics about the source files data was converted from

• specifics about various feature types stored in each track/file.

• parsed/unparsed headers from converted data files. 38 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

A custom vocabulary is created linking the metadata stored in this graph. As this information is collected in an environment created to serve boinq, the use of a custom vocabulary is appropriate.

5.3.2 Basic data model The basic information embedded in every data format consists of the description of a feature and the location it is positioned, as shown in Figure 5.2. Feature entities, as well as any object defined as a triple, can be given a custom IRI. The IRI is a unique pointer to the object and does not define the characteristics of it. Although IRIs do not have to contain any information, these have been chosen to be built from a logical set of strings. This helps with the identification of different aspects of an object and can even contain a certain amount of information also defined specifically through triples.

Figure 5.2: Basic structure of the model found in every data format. A feature of the genome is described and bound to a specific region

The model represented in Figure 5.2 has been proven to fit the standards and design goals explained in the previous section. By converting different data formats into one model that represents the features and locations as independent objects, even though linked by their relationships, it successfully gets rid of the differences of these objects based upon their source file characteristics. Data describing the features are defined with triples starting from the feature node. Data describing the elements of the location node, identified as a chromosomal region, are triples connected with that node.

Figure 5.2 shows an ideal model describing genome features and their locations on the chromosome. The implementation of data gave rise to several issues, causing need for expansion of the given model. Problems arise when dealing with data linked to a feature that cannot be considered an attribute of that feature. An example is the allocation of an RGB color scheme to a feature assigned in a BED file. The color is not an attribute of the feature, but rather a customization specifically used for the representation of the feature in the UCSC browser. Another issue arises with the appropriation of attributes linked to a feature for which no ontology is found, common in more complex and newer formats such as the BAM/SAM format. The converter has not yet been expanded to include the translation of all fields into processed information, an example is the FLAG bit values given in the SAM formats.

To offer a solution to these problems, and to be able to implement above mentioned information, the model was expanded with an extra layer, as shown in Figure 5.3. The entry offers the solution by tying all information that has not been fitted into the previous model. Unlike the feature and location nodes, the entry node does not adapt community made ontologies and is introduced as a repository to store unprocessed data, or data correlated to the entry of a file rather than a genomic feature. The expansion also offers a capability to assign metadata about the entry. The introduction of an extra layer of data has made it possible to make a clear separation between a standard genomic and custom mapping. These parts are clearly separated with the use of different vocabularies.

Figure 5.3: Basic structure of the model found in every data format. A feature of the genome is described and bound to a specific region 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 39

5.3.3 Vocabularies A list of all used vocabularies is displayed in Table 5.2. The resource description framework and re- source description framework schema vocabulary are used to describe the different objects. RDF/RDFS are the groundwork for every other vocabulary and are therefore always used. Ontologies used are: rdf:type - A predicate to link an object to a class. By default, the identity of an object is defined this way. rdf:value - is used to link a literal to an object. A literal can be different kinds of data such as strings, integers and booleans. rdfs:label - label an object. The label is also a literal and can be best seen as the name an object carries. rdfs:comment - is used to link comments to an object. The comment value is typically a string value and can be about anything. rdfs:description - is used to give a description of an object. Unlike the comment field the descrip- tion value is specifically used to describe the object.

Dublin Core terms (DCterms) and Simple Knowledge Organization System (SKOS) are both in- troduced to extend the labeling of objects. Features can carry different labels or identifier. An identifier is unique pointer for the database the object is listed in. Differences between a primary label, alternative labels and identifier are represented with the use of rdfs:label, skos:altLabel and dcterms:identifier, respectively.

The XML Schema Definitions (XSD) is used to define the data type of literals. It is one of the main building blocks of creating semantic data. The characterization of literals through the use of the XSD vocabulary has been explained in Chapter 2.

An early misconception in the first iterations of the data structure was the misuse of certain ontologies. An ontology, next to carrying a definition, is also bound to a strict rule defining the circumstances of usage. Specifically, ontologies can be used to define classes, datatype properties, and object properties. Datatype properties are properties of an object expressed through a literal. Object properties are predicates that define links between two different objects. It is important to identify the type of predicate used. To exemplify, consider the ontology Score, which is defined as an object class. Figure 5.4 A shows the incorrect use of the Score ontology, linking the score value with a literal to the object. As the score value is a class object, it is used to define the object carrying the value through the use of rdf:value. This object is then linked to the feature using an object property, such as hasAttribute. The existence of a data property for score is in theory possible. Yet, the relations carried by predicates is kept as overlapping as possible. Because of this, the amount of predicates created in a vocabulary is usually only a small part of the total ontologies found.

The direct appropriation of data to an object, as displayed in example A, gives the advantage of keeping a model simple. It also offers a minimal use of triples to define a relationship. Model B uses three triples where model A uses one. A database adapting model B would use considerably more data storage than a database adapting model A. Even though the use of an extensive model as displayed by model B seems unnecessary, it is a common practice for data to be allocated as such. This offers the advantage to give specifications about the attribute objects or expand it in any other way. With the association of extra information to these objects, better filtering criteria can furthermore be constructed.

Other vocabularies used are the Feature Annotation Location Description Ontology (FALDO). FALDO consists of a comprehensive collection of ontologies used to define regions bound to genome features, for which it is subsequently used. The GFVO vocabulary, as previously discussed, offers the possibility to integrate complex elements from the VCF format into the model. The Sequence Ontology (SO) is a collection of ontologies defining sequence features, attributes and relationships as found in genome annotation. Lastly we have the Format Ontology, constructed specifically for boinq , made to support the addition of the unprocessed and custom labeling found in each entry. 40 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

Figure 5.4: Differences between the wrong (A) and correct (B) use of a given ontology to specify data. Model A uses the ontology for score as a predicate to link a literal to an object, Model B defines correctly uses the score ontology to specify the identity of an object through the rdf:type predicate, and giving that object the value of the score through rdf:value . IRI objects are represented as circles; literals as squares.

Table 5.2: Vocabularies used to create the data structure in which discussed file formats are converted to.

Used vocabularies Vocabulary Namespace prefix Resource Description Framework rdf Resource Description Framework Schema rdfs Dublin Core terms dcterms Simple Knowledge Organization System skos XML Schema Definitions xsd Feature Annotation Location Description Ontology faldo Genomic Feature and Variation Ontology gfvo Sequence Ontology obo Format Ontology format

5.3.4 Data models

5.3.4.1 Location

The first model created is the data structure representing information about the location of a given feature. The structure of data determining the regional attributes of a feature is identical for all data formats. The FALDO vocabulary was used to implement the information into an RDF frame- work, as it is a format that has been adapted by external databases, such as Ensembl and Uniprot. Figure 5.5 gives a representation of the final model. faldo:reference points to a literal defining the chromosome or contig on which the region is located. faldo:begin and faldo:end are both object properties connecting the objects, specifying the begin position and the end position, respectively. faldo:position links the integer value of the position to an object, as found in the file. The 1-based coordinate system is used for all data. Since BAM and BED both use the 0-based coordinate system, adjustments are made to their start positions as the data is implemented. The objects pointed to by faldo:begin and faldo:end are both objects of the type faldo:ExactPosition, meaning that, as the name suggests, the position given through faldo:position is the exact position. Given features can be allocated at either the forward- or reverse strand, annotated by faldo:ForwardStrandPosition and faldo:ReverseStrandPosition, respectively. If no information about the strand is present, neither 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 41

Figure 5.5: The schema used to describe the implementation of the data annotating the region of a feature into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green circles are used to represent the type of the object it is connected to. object is used.

The IRIs of the location, begin and end entity are all carrying the information formally expressed through triples. An important matter when creating thousands or even millions of IRIs is to make sure that no entities, which are meant to be separate, are merged in a triplestore due to the acquisition of an identical IRI. Since the data is represented through triples, failing to create a unique IRI will merge the incoming and outgoing links of both data entities under the same subject IRI. On the other hand, it is sometimes useful or necessary to be able to merge identical mappings, such as a specific region, under one data object. For example: it is possible for two features to be mapped on the same region of an identical chromosome or contig. A specific example is the mapping of millions of experimental reads to a reference genome, another case can be when two or more genes carry different labels, but are in fact representing an identical gene. By building up the IRI for a specific region with the necessary information to specify that exact region, the features of previous examples are automatically going to be linked to the same object.

IRIs used for the location, begin and end node (as represented in Figure 5.5) are built from the species, assembly, contig/chromosome, begin and end position, and strand. The strand value is not always specified in the data format, in which case the tail of the IRI falls away. The species of the genomic data can be communicated through the client, which offers the input of several variables before conversion. If no value is given, the field is set to ”Unknown”. The specific configuration of the nodes is as follows:

Location : http://www.boinq.org/resource/species/assembly/contig:begin-end:strand Begin : http://www.boinq.org/resource/species/assembly/contig:begin:strand End : http://www.boinq.org/resource/species/assembly/contig:end:strand

The implementation of both the species and assembly in the IRI are necessary to create the distinction between different species or multiple samples of a specific species. This information is 42 CHAPTER 5. GENOMIC DATA IMPLEMENTATION in many cases found in the header of the flat format file. Because the necessary information isn’t always present, the user is prompted to input both the species and assembly of the data before the data can be processed. Code Example 5.6 gives the generated triples implementing the first line of Code Example 5.2 into the model given in Figure 5.6.

Code Example 5.6: The implementation of the first entry of Code Example 5.2 into the FALDO schema.

a ; ; ; .

a , ; 1050 .

a , ; 9000 .

5.3.4.2 BED

Figure 5.6: The schema used to describe the implementation of BED data into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to.

The BED file format is a relatively simple format to implement, and no complex data parsing is required to extract relevant information. Data represented in BED is mainly limited to the rep- resentation of a feature and its sub-features to a genomic region. Additional information can be given through the optional fields: [score] and [itemRgb]. The RGB value attribute is specific to the file format and has no direct connection to the entity of a feature. [score] is often used for score 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 43 values denoting experimental values. For this reason, the RGB and score values are mapped to the entry and feature objects, respectively. No ontologies exist to define the RGB colors, and have thus been created. BED files do not contain information about the identity of the features or sub-features pertained in the data file. To obtain this information, the user is prompted before the conversion to assign types to the feature and sub-features of the file. A representation of the data model is given in Figure 5.6.

• three values ranging from 0-255 are stored in [itemRGB]. The data, connected to an object with the type format:RGBvalue, is represented as a string featuring the array of integers.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from [chrom], [chromStart], [chromEnd] and [strand].

• The variable extracted from [name] is linked to the feature through rdfs:label. The object is no unique identifier and dcterms:identifier is therefore not used.

• The score of a feature, retrieved from [score], is considered an attribute of the feature. Objects specifying attributes of a different object are linked with the usage of obo:so-xp.obo#has quality. obo:SO 0001685 is used as object type as it refers to an experimentally obtained score. Values to attributes, and thus the data in the [score] field, is linked to the object using rdf:value.

• Information stored in [thickStart] and [thickEnd] is not implemented, for it is believed these fields store no relevant information.

• The values from [blockCount], [blockSizes] and [blockStarts] are used to determine the sub- features of an entry. features and sub-features have a two-sided link expressed through obo:so- xp.obo#part of and obo:so-xp.obo#has integral part. As these entities have different genomic regions, location mappings are created for each object.

5.3.4.3 GFF

Figure 5.7: The schema used to describe the implementation of GFF data into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to, orange nodes represent objects linked over multiple entries.

The GFF schema has more elements than the BED schema and a more complex parsing code has been implemented to convert all relations. Unlike the BED format, relations between features are linked over different entries. These are defined in the attributes field. The value from [Parent] points to an ID given in another entry. The converter progresses through a line per line feed, and IDs linked 44 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

to feature nodes are mapped throughout the process. [Parent] links to an ID-value from a previous entry. Nodes displayed in orange signify that the nodes are connected over objects from different entries. The [Target] attribute is used in specific occasions when the entry specifies a ’match’ entity. It contains the ID of a genomic entity, followed by a start, end and strand value. [Gap] can be used when the alignment is not perfect, and contains a CIGAR string to specify this alignment. Both the [Target] and [Gap] field have been considered too complex to parse and their explicit string values are therefore linked to the entry node. A representation of the data model is given in Figure 5.7.

• The values stored in [Target], [Gap] and [Source] are stored as attributes to the entry node. • Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from [seqname], [start], [end] and [strand]. • The feature type is given by [feature]. A complete set of supported features in given in Table B.2. • Connection between features are annotated by [Parent] or [Derives from], and connected through obo:so-xp.obo#part of and obo:so-xp.obo#has integral part. • Item attributes are [score], [phase] and [Is circular], denoted by obo:SO 0001685, obo:SO 0000717 and obo:SO 0000988/obo:SO 0000987, respectively. • A variety of fields are directly linked to the feature node: [Note] using rdfs:comment . [Name] using rdfs:label. [Alias] using skos:altLabel. [Dbxref] using rdfs:seeAlso. [ID] using dcterms:identifier.

5.3.4.4 VCF The VCF shema is the most extensive, it is built upon the GFVO vocabulary. As no other ontologies are available for the many data entities a VCF file consists of, the choice was simple. As discussed in Section 5.3.1 the structure and usage of the GFVO vocabulary has some disadvantages and design choices that are different from our own. GFVO is an almost completely independent set of ontologies, meaning that, except for some broadly used terms, such as rdf:type and partial integration of FALDO, custom ontologies were created and preferred over existing ones. Some examples are: gfvo:value instead of rdf:value gfvo:Label instead of rdfs:label gfvo:Comment instead of rdfs:comment gfvo:Identifier instead of dcterms:identifier gfvo:hasAttribute instead of obo:so-xp.obo#has quality gfvo:value is furthermore the ontology of choice when appropriating a literal value to an object. This means that other ontologies, such as gfvo:Label and gfvo:Identifier aren’t used as predi- cates. These are instead used to annotate the type of object, appointed to with gfvo:hasAttribute, gfvo:hasAttribute and gfvo:hasIdentifier, respectively.

This design choice is not a random one and deserves consideration. The designation of data related to an object has different ways of implementation in the RDF framework. Data affiliated to an object has been implemented both directly, eg. using rdf:label and indirectly, e.g. creation of attribute objects carrying a value. gfvo:value is a neutral ontology, and doesn’t carry any meaning except for the fact that it is used to point towards a literal. The relationship of that value to the central feature is always defined within the type of object it is bound to. The GFVO vocabulary is thus designed to be consistent on these levels of data appropriation. Ontologies such as label, identity or comment define the classes of these objects to which a literal is bound using gfvo:value. 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 45

To implement the VCF schema, some conflicts had to be solved first. Replacing some of the ontologies with the ones used in previous schemas would bring overall consistency. Yet, it is not desirable to implement only part of the GFVO structure, as it was not designed to be used with other vocabularies. Thus, the decision was made to integrate data both ways. Specifically, data elements supported by the general schema, adapted by each format, are present twice following both the GFVO structure and general structure. An exception to the rule is gfvo:Label, the usage of this ontology, coupled with the creation of three triples instead of one compared to using rdfs:label, would increase the general amount of data usage (i.e. created triples) by more than two-fold. With the use of rdfs:label, rdf:value is also used instead of gfvo:value. It was also found that combining these two structural differences results in a data structure which is over complicated.

Figure 5.8: The schema used to describe the implementation of VCF data into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to, orange nodes represent objects linked over multiple entries.

Two unique sets of entities are created before and during the conversion. The first set is the filters, a probability test listed in the [FILTER] field in case the probability that the evidence is lower than a set percentage. Filter properties are usually defined in the header, which are parsed and loaded before the body is converted. The converter is then able to create new Filter entities as they are encountered in the entry field. Samples, for example ’NA0001’ in Code Example 5.3 are created in the same way. A representation of the data model is given in Figure 5.8. Filter and Sample objects are displayed in orange circles.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from [CHROM] and [POS].

• dcterms:identifier is used to point towards the dbsnp identifier value stored in [ID]. The object is assigned gfvo:Identifier as type.

• The feature types are not given by any field, but can be deducted from the data listed in the entry. A complete set of supported features in given in Table B.1.

• Filters and samples are identified using the types gfvo:VariantCalling and gfvo:BiologicalEntity, respectivelly. Connections to the feature is made using gfvo:isRefutedBy and gfvo:hasSource. [FILTER] and [SAMPLE] are integrated in the labels of these objects. 46 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

• Data stored in [REF],[ALT], [QUAL] are qualified as attributes and thus linked as object to the feature using gfvo:hasAttribute. the ontologies of the types are gfvo:ReferenceSequence, gfvo:SequenceVariant and gfvo:PhredScore, respectively. The phred score value is bound as an attribute to the reference sequence attribute object.

• the varying keys used by [FILTER] are also implemented using gfvo:hasAttribute. A full list of the object types found in [INFO] is listed in Table B.3. The field values are stored to these objects using rdf:value. Keys used by the INFO field are stored to objects using rdfs:label. To increase flexibility, data retrieved from this field assigned to an unknown key are still stored to an object without a type. This still allows the user to query the data using the label value of the key, linked to that object. One exception to this rule is the implementation of gfvo:ExternalReference. These objects have no stored value as the database for which an external reference exists is given through the label of the key itself.

Figure 5.9: The schema used to describe the implementation of the FORMAT fields of the VCF schema. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to.

• The design of schema is largely based upon the construction of the vocabulary itself. This is no- table when implementing the [FORMAT] fields into the schema. A correct use of the vocabulary was shown on the official site of BioInterchange. However, it was taked down when visited in April 2016. A central node, linked to the feature node with gfvo:hasEvidence, is connected to both the sample and the genotype attributes. The genotype node has, furthermore, a complex design of which the schema can be found in Figure 5.9. The genotype node has two elements: the first and last part, which are linked with gfvo:hasFirstPart and gfvo:hasLastPart respec- tively. The value and type of these objects are obtainable from the values linked to the GT key, retrieved from [FORMAT]. The same data is used for the determination of homozygos- ity or heterozygosity of the genotype, which is annotated using gfvo:hasQuality. The Phred scores from the determination of the haplotypes, linked to the HQ key, are granted to the haplotype object using gfvo:PhredScore. Only two other attributes of [FORMAT] have been implemented, due to the limitations of a third party parser. These attributes are conditional genotype quality, linked to the GQ key, and the coverage of the genotype, using the key DP. 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 47

The object types are gfvo:ConditionalGenotypeQuality and gfvo:Coverage, respectively.

5.3.4.5 SAM The SAM schema was the last to be created. The SAM format has a high amount of data entities for which no ontologies are available. The integration of data into a valid structure has only been partially realized. Following the same reasoning as previous examples, fields containing unparsable data have been linked to the entry node. Values have been implemented as they are extracted from the source. The objects are identified according to the labels of the field they are extracted from. The SAM format contains data implemented in complex formats. Both the limitations of available packages for data parsing and the complexity of data integration or processing are the reason that the current model is incomplete. Although the implementation offers integration of the most general data features, further adjustments might help to transfer all data coupled to the entry node to data coupled to the feature node.

Figure 5.10: The schema used to describe the implementation of SAM data into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodes define the type of the entity it is connected to, orange nodes represent objects linked over multiple entries.

Every entry features an alignment. This alignment is created from statistic probabilities when evaluating the position of a read or query on a reference sequence. Since the rule does not apply that one read equals with one alignment, an independent Read object is implemented into the schema. Unlike the Filter and Sample object from the VCF schema, a mapping of the Read objects is continuously updated during the conversion. Because a SAM file can contain millions of alignments, a mapping is only kept of those reads for which [RNEXT] or [PNEXT] contain a value. An alignment consists furthermore of one or several blocks of sequences that are an exact match to the reference sequence. The locations of the matched sequences are extracted from the CIGAR string that is given in every alignment. Both the location of the Feature and Match objects are based upon the coordinate system of the reference sequence. A representation of the data model is given in Figure 5.10. • Unprocessed data, bound as an attribute to the Entry node, contain [FLAG], [RNEXT], [PNEXT] and [QUAL]. The ontology terms bound to these objects are equal to their respective names. • Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from [RNAME], [POS], [TLEN] and [CIGAR]. These give the locations of the Feature and Match objects. 48 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

• Labels given to the reads, stored in [QNAME], are used to label the Read objects.

• The phred quality scores of the alignment and its nucleotide sequence, retrieved from [MAPQ] and [SEQ], are both stored as attributes with the types obo:SO 0001686 and obo:SO 0001683 respectively. shorter nucleotide sequences which are matched are also linked as an attribute of the Match object.

• The quality values of each base in an alignment sequence retrieved from [QUAL] could, in principle, be integrated. This would be done by creating a set of objects representing individual nucleotides for every nucleotide sequence. A score attribute can subsequently be added. Since this would increase the amount of triples for every base with six, a literal (string) containing these values was added to the respective objects instead, using rdf:comment.

• The extreme customization and variability of the [TAG] field, combined with limitations to the parser have prevented a complete integration of the [TAG] values into the model. Thus, [TAG] is bound to the entry node as a string

5.3.5 Metadata Metadata created during conversion is converted and implemented into the RDF framework. To keep the triples extracted from genomic data apart from the metadata created, separate graphs are used. The collection of metadata stored in boinq is furthermore separated according to the dataset in which data is generated in. Metadata created in varying tracks within the same dataset is registered in the same graph. Data implemented into the RDF framework can be partitioned into three parts:

• data created with the conversion of the header of a file

• data created with the conversion of the body of a file

• data created after conversion of a file

Both the header information and the data created after conversion are treated as metadata, and thus stored as triples in the same graph. After integration of data, a variety of properties and details is saved about the original data, the conversion operation and the newly created data. Triples containing metadata are linked to an object representing the uploaded file, which acts as central node. The graph in which the metadata is stored is the collection of all metadata in a dataset. Practically, info about the dataset, the tracks it contains, and other items are all collected in the same environment. More information about which data can be retrieved has been explained in Chapter 3.

The structure of metadata is determined through the creation of the custom ’track’ vocabulary. This vocabulary has been built as no public vocabulary was considered adequate. The ’track’ vo- cabulary creates a data structure that fits to the functionality of boinq, and enables for an easy communication of data. Future development of boinq will enable the communication of different boinq instances with each other. The querying of metadata from remote servers is going to be a core part for a proper working network that is able to share and query data over all boinq instances that are connected.

The integration of the info stored in the header is stored as metadata to a file. The data conversion of the BED, GFF, VCF and SAM headers is currently implemented in a simple way, where every header entry is stored as a string. A complete parsing of each object has not been achieved. This is largely due to the complexity of some headers and the limitations of the parser. To ensure that no information is lost after conversion, the complete and unprocessed header was copied. This should enable retrieval of data if necessary. Relevant information found in the header can be the identification of the species and reference assembly the data is linked to. This information has to be given by the user before the conversion starts. This is the best way, as the presence of this information into the header is not obligatory, although needed for the creation of reference IRIs. A collection of other data is stored, as shown in Table 5.4. These ontologies are class identifiers for objects bound to the file node using track:hasAttribute. The File object is defined using track:File and is linked from the Dataset object using track:holds. The specific use for the storage of count values such as track:EntryCount is further explained in Section 5.3.6. 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 49

Table 5.4: The different ontologies used for the creation of metadata linked with the conversion of a data file.

Custom metadata ontologies with their definitions Ontology IRI Definition track:HeaderBED A header from a BED file. track:HeaderGFF A header from a GFF file. track:HeaderVCF A header from a VCF file. track:HeaderSAM A header from a SAM file. track:ConversionDate A string containing the date, time and timezone indicating the start of the conversion. e.g. ”Sun Mar 13 17:21:18 CET 2016”. track:FileName The file name of a file. this contains the file extension. track:FileExtension The file extension of a file. track:User The login identifier of the account under which the file was uploaded. track:EntryCount The total amount of Entry objects generated during conver- sion. track:FeatureCount The total amount of Feature objects generated during con- version. track TripleCount The total amount of Triples generated during conversion. track:FilterCount The total amount of Filter objects generated during conver- sion (VCF). track:SampleCount The total amount of Sample objects generated during con- version (VCF). track:ReadCount The total amount of Read objects generated during conver- sion (SAM).

5.3.6 Practical implementation 5.3.6.1 IRIs The creation of many triples and objects (blue and orange nodes) raises the question of IRI naming. The specifications of the naming of objects are already partially discussed when describing the FALDO schema but have not yet been handled when referring to the other created objects. An important matter presents itself when figuring the way certain objects should be named. An infinite amount of options are possible, with no prescribed conventions such as the use of a specific length or characters. In general, it is possible to construct IRIs in a way that offers certain advantages. An example of this is the way location IRIs have been constructed. Containing information of the object into an IRI can offer following advantages, of which some have already been discussed in Section 5.3.4.1:

• human interpretation about the identity/type of the object

• human interpretation about the information an object holds, including the information held by objects that are defined upstream or downstream

• human interpretation about the level an object is identified on

• a better view on the data structure when browsing through IRIs

• the prevention of creating duplicate objects for entities that are in essence the same

In practice, unlike the construction of controlled databases such as Ensembl, in which case the handlers of the features are constructed with the identifier of the object (e.g. http://rdf.ebi.ac.uk- /resource/ensembl/ENSG00000139618), it was not feasible to implement objects according to their labels or identifiers. Features contained in the processed data formats are in many cases carrying a label rather than an identifier. Labels can be customized to the users wishes and have no certainty of being unique. To prevent the merging of unrelated objects, the adapted method has to produce 50 CHAPTER 5. GENOMIC DATA IMPLEMENTATION unique IRIs for every new entry. Two options are available, one is to create an IRI through the use of a randomized string creator, the other one is to keep a count of every type of object to prevent duplicates.

The usage of a randomized string creator holds a few disadvantages. Unique randomized strings hold no information whatsoever, and identification of properties about the object is thus not possible by observing the IRI. These strings are furthermore very long. Using a simple integer to keep objects apart has been the chosen method for the naming of IRIs. Although more complicated to implement, it offers all the above mentioned advantages with the exception of the ability to identify properties of the object itself. It is not feasible to keep track of every specific type of object, as identified in the schemas. Instead, the more overlapping identity of objects has been tracked, such as the numbering of Entry and Feature objects, following the naming of the blue nodes in the different schemas. Table 5.5 features the base IRIs used for these objects. other objects, such as Attribute and Evidence nodes, are always tied to a central object. For this reason the IRI is defined as an extension of the object it is tied to. An overview of all these extra components is given in Table 5.7. No extra hashtag is used as the use of multiple hashtags in an IRI carries conflicts when using namespace prefixes. Specifically, the use of a namespace prefix and a hashtag gives an error, e.g. prefix:1/atr#1. Counts of components need only be registered if multiple instances of that component can exist on that specific node. This can only happen for attributes and samples (evidence). Other components, referring to the genotype, haplotype and the alleles do not need a count integer.

Table 5.5: Base IRIs used with the creation of object IRIs during conversion, counts are placed behind the hashtag.

Base components in the creation of IRIs Object type Base IRI Entry http://www.boinq.org/resource/entry#count Feature http://www.boinq.org/resource/feature#count Sample http://www.boinq.org/resource/sample#count Filter http://www.boinq.org/resource/filter#count Read http://www.boinq.org/resource/read#count

Table 5.7: Components to construct IRIs during conversion, the namespace ’feature’ is used, replacing ’http://www.boinq.org/resource/feature#’. The examples are given for the conversion of a VCF file.

Sub-components used for the creation of IRIs Object type IRI component Example Attribute /attribute count feature:1/attribute 1 Evidence /evidence count feature:1/evidence 1 Genotype /genotype feature:1/evidence 1/genotype Haplotype /haplotype feature:1/evidence 1/haplotype First Part /first part feature:1/evidence 1/genotype/first part Last Part /last part feature:1/evidence 1/genotype/last part

Different data files can be uploaded to the same track, and a correct count of the varying objects has to be kept. This information is stored in the metadata section as explained in Section 5.3.5. Total count of every object is furthermore saved in the local database of the server. After the metadata of a conversion has been added to the triplestore it is queried and stored by the server. Code Example 5.7 displays the generic query used by the server to retrieve an array of counts. The sum of every element of the array is taken, resulting into the total count. 5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 51

Code Example 5.7: Query used by the server to retrieve the total amount of created features in a track.

1 PREFIX track: 2 PREFIX rfd: 3 SELECT ?featureCount 4 WHERE{ 5 track:holds ?fileNode. 6 ?fileNode track:hasAttribute ?attribute. 7 ?attribute a track:FeatureCount>; 8 rdf:value ?featureCount. 9 }

5.3.6.2 Mapping The reference mapping of the genomic data is bound to the location node using the faldo:reference predicate, as discussed in Section 5.3.4.1. This IRI is a unique identifier which expresses the species, assembly and chromosome. The Genome Reference Consortium (GRC) is the organization responsible for constructing assemblies for every organism commonly used for data annotation. Version naming is straightforward with version numbers replicating both the major as minor (patch) version of the release. The latest reference assembly for the Human is named ’GRCh38p7’, being the 38th version, on which 7 minor updates have been performed.

Although the same reference assemblies are used by all the major databases, no unique IRIs have been adapted for an RDF environment. Indeed, every database has their own way to appoint the chromosome, reference assembly and species the data is mapped on. Specifying the correct reference IRI is an essential part when using federated queries based upon locations. By not doing so, results will be incorrect.

To connect data with external databases, links have been defined in the metadata section of the triplestore. In practice, a central and custom IRI is used that is specific to the boinq environment. Since the creation of reference IRIs by the boinq converter is logical, it is possible to define these relationships before any data is uploaded. Before the conversion of a custom file, the user is given the option to select the species and assembly from a predefined list. Only organisms and assemblies that have been mapped in the metadata can be used as input. From this data, the IRI is formed. For example: if the user maps his data on the GRCh38 assembly of the Homo sapiens, the IRI created when reading the X chromosome from the entry is . As the string of these IRIs is known, links with parallel references on external databases can be created.

5.3.6.3 Parsers The transformation of data from flat files to triples happens in different stages. The files are first uploaded to the server, after which they are fed to a parser line after line. The parsers are appointed according to the file extension and are different for each format. The parsers then return the entry as an object containing all information. Extraction of relevant data from this object is possible with the use of predefined functions that are part of the package the parser is distributed with.

The creation of parsers on our end was not needed, and have they been retrieved from existing libraries. The used parsers have been replaced, updated and customized to fit our needs. Most tools are still under active development, and are not feature-complete.

HTSJDK The HTSJDK java library is constructed and maintained as part of Samtools [59]. Sam- tools is an umbrella organization encompassing several projects on a variety of tools, designed for the creation and manipulation of next generation sequencing data. The HTSJDK archive is very active, with over 60 contributors to the code base. From September 2015 to April 2016, eight releases have been distributed ranging from version 1.139 to 2.2.1. The HTSJDK package contains a variety of parsers, including the ones used for the conversion of BED, VCF and SAM/BAM files. Due to the simplicity of the BED format, work on the parser is finished. 52 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

Most of work on the library has been evolving on the extension and completion of functionali- ties surrounding the SAM and VCF format. The incomplete parts of the VCF parser are only limiting the functionality of the converter, as the parser is only able to extract a limited amount of attributes linked to the [FORMAT] field. The SAM parser is the least complete compared to the others and has undergone the most apparent changes over the last year. Limitations are mostly centered on the parsing of the attributes or [TAG] field.

Jannovar - The parser used for GFF files has been retrieved from the Jannovar library. Jannovar is a project used for the annotation of VCF files in the analysis of disease-gene discoveries [35]. It identifies all transcripts affected by base variants stored in VCF files. Jannovar is created to be used as both an application and as a library. The only limitation of the parser is due to the inflexible way feature types are recognized and stored. Instead of the extraction of the feature type string from an entry to a central object, a limited and hard-coded list of feature types are predefined. Supported features are listed in Table B.2. To offer a solution for boinq, the list has been extended and pushed to be included in the next release.

5.3.6.4 Client Before data can be converted, it is uploaded by the user through the client. The client has been scripted using AngularJS. Boinq has been designed to integrate functionalities of The Semantic Web.

Figure C.1 gives a representation of the different variables requested before conversion. Options are selected for the conversion of Code Example 5.1. As explained in Section 5.3.6.2, the input of the species and assembly is restricted to a predefined list. This data is stored in the metadata section of the triplestore and is obtained through a query on the metadata section of the triplestore. Assemblies can only be chosen once the species has been defined. The prefix contig is a variable implemented to take care of prefixes used in the contig field. It is important that no prefixes, e.g. chr or Chrom, are present in the creation of the reference IRI, as boinq will not be able to retrieve the reference IRIs used by other databases. The species and assembly field are obligatory, the contig prefix is optional.

The selection of the feature types is another example of the close interaction between boinq and the triplestore. Figure C.2 shows the interface created for the selection of a feature type. The Sequence Ontology vocabulary contains all of the possible options. Through SPARQL, an interactive list has been realized by querying the vocabulary which is uploaded to the triplestore when boinq is started. A search bar is implemented as the list contains several hundred possible options.

The properties of every track can be reviewed through the client. The information is retrieved by querying the triplestore, as outlined in section 5.3.5. Figure C.3 shows the properties of the track after uploading the BED file displayed in Code Example 5.1.

5.3.6.5 Code The code written for the functionalities implemented in boinq can be reviewed on Github. Code contributed is mostly in Java and Javascript. The repository is situated at https://github.com/ Kleurenprinter/boinq2.

5.4 Evaluation

5.4.1 sparql-bed and sparql-vcf sparql-bed and sparql-vcf are tools created by Jerven Bolleman, creator of the FALDO vocabulary, to directly query bed and vcf files for their locations. It is important to notice that the tools do not integrate a full mapping of the files in RDF and are only focused on implementing location data into a FALDO schema. The tools proven unable to handle large files and complex queries, and the given examples are thus limited to small files. In the interest of checking the correct parsing and implementation of the FALDO schema by boinq, we have evaluated the response of identical criteria on queries executed by both sparql-bed/sparql-vcf and the boinq triplestore. Due to limitations of 5.4. EVALUATION 53

sparql-vcf, only a very basic query could be executed. The queries and results are featured in Code Example A.1 and Code Example A.2.

The code used for the translation of location data into the faldo schema has been constructed as such that it is used for the conversion of all data formats. For this reason, the correct imple- mentation of the schema can be considered for all supported data formats. The BED and VCF files can be found at https://github.com/samtools/htsjdk/blob/master/src/test/resources/ htsjdk/tribble/bed/unsorted.bed and https://github.com/samtools/htsjdk/blob/master/ src/test/resources/htsjdk/variant/dbsnp_135.b37.1000.vcf, respectively. Corresponding queries all returned the same values. This is a clear indication that the conversion of the location data is implemented correctly.

5.4.2 Big data files To affirm that big data files don’t pose a problem for boinq, a VCF file of approximately 1Gb containing 4,686,454 unique entries has been successfully converted using boinq. The data file was given confidentially and can thus not be shared. The data contains SNPs from the first human chromosome. Code Example A.4 and A.5 show queries to find the genes and exons the SNPs are situated in. A local version of the Ensembl endpoint was used as the official database experienced downtime. The Ensembl database has served as a guide in the creation of the data structures, and thus bears many resemblances.

The FALDO vocabulary is integrated by the Ensembl database. The constructed query evaluates the position of the SNP compared to the positions of the requested features. SNPs contained within a specific gene or exon are listed. In both Code Examples, nested queries are used. These are always executed first and return a set of variables, used in the query itself. For the queries featured, the location of genes and exons are retrieved first. Specifically, for genes this returns 1766 matches and for exons 28808, the queries lasted approximately 8 and 50 hours, respectively. The data can be used to search for genes with a significantly higher frequency of SNP. Further steps to analyse this data have not been executed, this due to the size of the database.

5.4.3 JBrowse JBrowse is a web client software created for the visualization of genome annotations. It is built with JavaScript and HTML5 with Perl based data formatting tools implemented [16]. JBrowse is an open source software tool maintained by the Generic Model Organism Database (GMOD) community. The GMOD project features a collection of software tools for managing, visualising and storing genetic data.

Boinq aims to implement a visualisation tool for RDF data through the integration of existing software. Version 1.10.0 of JBrowse is the first version featuring data visualisation through con- nectivity with a SPARQL endpoint. This feature has been mainly designed during a Biohackathon and is therefore limited to basic functionalities. By downloading a region from the UCSC Genome Browser in BED format, converting it with Boinq and displaying it using JBrowse, both the correct functionality of boinq and JBrowse can be tested. Figure C.4 gives the region as shown in the UCSC browser. Figure C.5 gives the result as displayed by JBrowse. The configuration file of the JBrowse is shown in Code Example A.3.

Code Example 5.8: The BED file exported from the UCSC Browser and uploaded to the boinq triplestore

1 1 29553 31097 RP11-34P13.3 0 + 29553 31097 0 3 486,104,122 0,1010,1422 2 1 30365 30503 MIR1302-9 0 + 30365 30503 0 3 1 34553 36081 FAM128A 0 - 34553 36081 0 3 621,205,361 0,723,1167

6 Biological research in RDF

6.1 Introduction

Research is not simply done through single rounds of data comparison. Many research objectives require the analysis of data over multiple iterations, done in an environment that makes it possible to manage data created and obtained at different steps of the research. To make The Semantic Web a useful asset for research purposes, the management of RDF data is necessary. Having introduced a way to integrate data from next generation sequencing into a local RDF environment, researchers can now compare custom data with public databases. This chapter is an exploratory study of how technologies from The Semantic Web can be used to handle research objectives. For this, a use case was constructed. The use case serves as an example and gives only the initial steps of a more in-depth study. No validation or statistical proof is generated.

6.2 A biomarker for colon cancer

6.2.1 Introduction Colorectal cancer (CRC) is the third most frequently diagnosed cancer in the United States and United Kingdom and the second leading cause of cancer death in the western world [45]. CRC has a high survivability rate for early stage prognosis, 93% and 83% in the first five years for Stage I and II, respectively. Yet, late stage cancer survivability goes down drastically, with 60% survivability for stage III cancer and only 8% for stage IV [52]. Invasive techniques such as colonoscopy offer significant improvements of detection of CRC. Although the cost, risk and inconvenience have caused compliance rates of the method to be low [83]. The use of biomarkers for detection of CRC is a promising method, for there is a high need for a non-invasive early stage screening techniques.

Epigenetic alterations of the genome and remodeling of chromatin have been shown to play an important role in the development of cancer [28]. Epigenetic changes are known to happen from early development of cancer, such as the creation of aberrant methylation patterns often present in the promotor regions. These have been shown to result in the silencing of tumor suppressor genes, or the activation of oncogenes [57]. A change in methylation patterns, furthermore, constitutes a wide range of different expression patterns in tumor cells compared to healthy cells [7]. Detection methods exist for both methylation and expression patterns of DNA. These are of high interest in today’s search for the identification of a selective and sensitive set of biomarkers for the early detection of cancer.

The CpG island methylator phenotype (CIMP), caused by methylation-driven transcriptional reg- ulation, is a fingerprint of methylation patterns used to define the state and type of CRC [33]. At this point, there is no standardized panel of methylation markers or methylation detection technique.

55 56 CHAPTER 6. BIOLOGICAL RESEARCH IN RDF

Fingerprints for differential RNA expression are currently being created, proposed and validated. Nev- ertheless, a robust multigene signature has not yet been defined [37]. The detection of DNA, RNA and protein markers are possible through stool samples. Markers are present in stool because of leakage, exfoliation, or secretion of CRC [54]. Since the process can also occur in nonneoplastic cells, and stool contains genetic material of a varying amount of sources, the tests can have limited sensi- tivity and specificity. Despite numerous the discoveries and methodological advances, CRC research has not yet yielded a novel molecular biomarker suitable for population-wide screening purposes. A problem that exists due to the high variability of current screening tests [33].

The silencing of the DNMT1 and DNMT3b in cancer cells, two important methyltranferases, have shown a reduction of methylation greater than 95%. This also resulted in a reduction of other detrimental effects typical in cancer cells, including a suppression of growth and a demethylation of repeated sequences [57]. By comparing expression data and methylation data from both a wild type (WT) and double knockout (DKO) variant of a CRC cell, it is possible to select genes with high differential expression and methylation. If these genes are possible effectors of factors contributing in CRC, differential patterns between methylation and expression data are to be expected between healthy and cancer cells.

6.2.2 Material and methods The HCT116 cell line is an immortal cancer cell line of the colon [44]. Differential Expression and methylation data is available from a DKO and WT variant. DNMT1 and DNMT3b have been knocked out. Variable expression and methylation data of the whole genome is obtained. The data is generated from Reduced Representation Bisulfite Sequencing (RRBS), values are given for promotor methylation, expressed as the difference in average percentage between the DKO and WT. Exon expression data is retrieved from mRNA-Seq. The data representing the differential expression patterns is expressed as the log2 fold change of the DKO compared to the WT.

Data, available in BED format, is imported in the triplestore using the developed functionalities. The two BED files containing a list of exons for which promotor methylation and expression data are given (169137 unique entries) were converted in two separate tracks (graphs) using a boinq and blazegraph (v2.1.0) installation on a dedicated server (CentOS version 6.7, Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with 8 cores, 32 Gb RAM). All data is labeled with the Ensembl exon identifier. The first step to find genes of interest is the selection of exons with a significant difference in both their expression and methylation data. For this, an analysis of the distribution of the data is done in R. The distribution of expression data is given using the absolute fold change. A first selection of genes is made using this data, as we are only interested in the genes with both differential methylation and expression values.

An essential element in the use of an RDF environment for data analysis is the function INSERT. INSERT is a variation on SELECT, in which the variables selected through the main body of the query can be used for the creation of new triples into the triplestore. Indeed, by storing the results of a query into a new graph, it is possible to iterate with smaller steps and thus decrease the need for elaborate queries that are heavy to process. Other advantages are that it external databases need only be queried once if the data is used multiple times, giving faster results. Every triplestore has both a query and an update endpoint. The update endpoint, through which data can be send to manipulate data inside the triplestore, is also used by queries containing the INSERT function.

To select a candidate biomarker from the data a comparison is made between in-house expression and methylation data from the WT and DKO and data found in public databases. For this, relevant LOD from The Semantic Web is used. By comparing data from other studies, more evidence can be obtained that a given gene is of interest. Following queries are executed directly through the triplestore, as no interface for query building has been implemented yet in boinq.

Ensembl The Ensembl database, just like all other databases maintained by EMBL-EBI, features an LOD version of their database. Although the RDF database is well structured and maintained, it has not yet reached a full implementation of the complete Ensembl dataset. An example is the absence of CDS data. However, it does contain all the data concerning genes, exons and 6.2. A BIOMARKER FOR COLON CANCER 57

their locations. Furthermore, due to the use of different kinds of unique identifiers by different databases, en extensive list of identifiers is given for every entity. This is, as will become apparent in the following examples, necessary when comparing data over different datasets. Code Example A.7 features the query used for the curation of data of the selected exons. An installation of a local Ensembl database is used as the public endpoint experienced long periods of downtime. Having retrieved genes for every exon, the data now consists of a selection of genes that are potential biomarkers for colon cancer. To find out if the selected genes share any comparison with other research done about colon cancer, data is retrieved from DisGeNET, Expression Atlas and TCGA. DisGeNET DisGeNET is a database connecting research with diseases and genes. A simplified schema of their data structure is shown in Figure 6.1. The complete schema, which is too large to fit on a page, is given on http://www.disgenet.org/ds/DisGeNET/html/images/ disgenet-rdf-schema-125.png. Through the use of SPARQL, any amount of nodes can be defined to select data that adheres to it. As we are interested in the selection of genes that are recognized to be affiliated with CRC, an ontology used for this disease was selected. DisGeNET uses the Medical Subject Headings (MeSH) vocabulary for the identification of diseases. A quick search resulted in the retrieval of http://id.nlm.nih.gov/mesh/D003110 for the identification of colonic neoplasms. Through the query given in Code Example A.8, 4424 genes with associations with CRC where retrieved. DisGeNET retrieves Gene Disease Association (GDA) data from different sources, through curation, prediction and literature. To make a distinction between the level of evidence for every GDA, a score is granted ranging from 0 to 1. Data retrieved from DisGeNET, Expression Atlas and TCGA are stored in separate graphs.

Figure 6.1: A simplified structure of the DisGeNET data structure [26].

Expression Atlas The Expression Atlas Database is a collection of expression data researched in a variety of experiments. In line with the retrieval of data from DisGeNET, all data has been retrieved from cells associated with CRC. Expression Atlas uses a different vocabulary for identifying diseases, called the Experimental Factor Ontology. The ontology http://www.ebi. ac.uk/efo/EFO_0000365, identified as colorectal adenocarcinoma, is best suited for CRC. P- values are given for each observation of differential expression. The p-values have been adjusted using the false discovery rate correction for multiple testing. Code Example A.9 shows the query used for the import of data. TCGA An obvious choice is the screening of colon cancer data from TCGA. Public endpoints to access this data are poorly maintained and several are not working (May 2016). RDF dumps of their data are available, yet, instructions for the import of data into the triplestore could not be retrieved. As the data is spread over hundreds of separate files, no data from TCGA could be used for this research. 58 CHAPTER 6. BIOLOGICAL RESEARCH IN RDF

Table 6.2: Data imported into a local RDF environment using the in-house data and LOD. The total gene count of the in-house data is after selection of genes with a significant expression and methylation differences. A two by two comparison of genes shared across datasets is given.

Genes present and shared in every dataset Dataset #Genes ∩ DisGeNET ∩ Expression Atlas In-house data 1833 404 823 DisGeNET 4424 2352 Expression Atlas 11077 2352

Table 6.2 gives an overview of the amount of genes in every dataset and the gene entries that are shared. Genes are specified by their NCBI identifier. Once all data is retrieved, the data can be analyzed.

6.2.3 Results Figure 6.2 shows a representation of distribution of differential methylation and expression data. A local minimum in the methylation data can be observed. Since other methyltransferase proteins exist, it is to be expected that not every exon promotor region is demethylated. This is clearly shown in Figure 6.2, where the local minimum separates a distribution of unchanged methylation values and a distribution with changed methylation values. The local minimum is situated at -22.73. To select only the most significant changes in expression data we want to select all values situated in the tail of the distribution graph. A mimimum absolute fold change of 1 has not been chosen, as this constitutes more than one third (33.9281%) of all exons. An absolute fold change of 2, which selects approximately 10% (10.984%) of exons is deemed a suitable cut-off rate for which only the most significant values are selected. Differential methylation values lower than -22.73 and an absolute expression fold change larger than 2 are the chosen criteria. The query is shown in Code Example A.6.

Figure 6.2: The distribution of differential methylation (left) and expression (right). The absolute value of the fold change is taken for the expression data.

An interest is to be found in genes with the highest differences in expression data. Code Example A.10 gives the query used to retrieve a list ordered by highest expression. The regulation of those genes on Expression Atlas and scores on DisGeNET are also given if available. The top ten results are shown in Table 6.4.

Candidate biomarkers are genes that show significantly different expression and/or methylation pat- terns in developing CRC compared to healthy cells. As oncogenes are typically upregulated in cancer cells compared to healthy cells, an interest is shown in upregulated expression patterns. Knocking out methyltransferase activity has shown to reduce factors contributing to tumor growth [57]. Thus, it is to be expected that oncogenes are downregulated while tumor suppressor genes are upregulated 6.3. DISCUSSION 59

Table 6.4: The selection of ten genes for which methylation impacts expression the most.

Results obtained through Code Example A.10 Label FC Methylation P-value Regulation Score TIMP3 12.3750 -37.7740 2.7132E-4 UP 2.7144E-4 IL32 11.5849 -59.4912 2.4937E-5 UP 1.2027E-1 PAGE5 10.7696 -62.9020 NPTX2 10.4534 -85.5591 5.1579E-3 UP HDGFRP3 10.3798 -87.2696 4.6599E-4 UP COL4A1 10.3092 -87.2447 3.1173E-3 UP 0.3600 LINC00667 10.2738 -82.6889 LINC00221 10.2313 -84.0186 SOHLH2 10.1901 -75.1615 8.9690E-8 DOWN ZNF140 10.1346 -45.3473 3.0980E-3 UP in DKO compared to WT. As only genes are selected with a promotor methylation percentage that has gone down, genes showing less expression are bound to be suppressed by other mechanics, such as tumor suppressor genes. For this reason, the following criteria will focus on genes that are down- regulated in DKO. Furthermore, as data is obtained from Expression Atlas, candidate oncogenes can be selected by an upregulation of the genes in cancer cells compared to healthy cells. Code Example A.11 gives the query to list the genes ordered by their expression values. Results are filtered to only show genes with data from Expression Atlas for which upregulation of the gene has been observed. P-values and an optional score from DisGeNET is also shown. As we are looking for new candidate biomarkers, it is not necessary to have the gene registered in DisGeNET. Only the ten genes with the highest downward differential expression and meeting the proposed criteria are discussed. These are shown in Table 6.6. Table 6.6: The selection of ten genes with highest differential expression, which are downwards regulated, have a significant methylation difference and are known to be up regulated in CRC.

Results obtained through Code Example A.11 Label FC Methylation P-value Score GRIN2B -7.1794 -31.3310 1.4233E-4 0.2400 EHF -6.9060 -46.1462 1.4683E-16 TIAM1 -6.3210 -50.6652 8.4553E-4 DSC3 -6.0103 -42.6699 3.433E-2 EPAS1 -5.3557 -35.1357 1.3316E-3 0.3600 ZNF462 -5.3033 -73.7690 3.4391E-2 SLCO1B3 -5.0295 -35.0112 2.617E-3 0.1200 ZBTB20 -4.4870 -87.0630 3.0116E-2 PDE10A -4.1334 -83.6438 2.4624E-2 2.7144E-4 THOC2 -4.1334 -31.1591 6.9420E-3 HOXB8 -4.0678 -56.4980 2.2630E-5

6.3 Discussion 6.3.1 Methods Each file, consisting of approximately 200.000 entries, took about five minutes to convert, with both graphs containing about three million triples at the end of the conversion. The acquisition of information from The Semantic Web has been easily retrieved using public endpoints. No federated queries processed for longer than five minutes, where local queries only took a couple of seconds. It is important to point out that the use of simple queries are elemental for a fast paced process. 60 CHAPTER 6. BIOLOGICAL RESEARCH IN RDF

Indeed, when trying to combine triples from several endpoints in one query, the process time increases exponentially. Retrieving data from external endpoints should always be happen in separated steps. For this, data can be downloaded to specific graphs. This method ensures data retrieval to be the most efficiÃńnt. Data downloaded can afterwards be retrieved from a local database using the same query construct. Combining local data instead of external data also brings the advantage of performance.

The combination of query elements, which defines the order data is retrieved, is of importance when working with (federated) queries. For example, Code Example A.7 retrieves the genes from the Ensembl database from the exon labels located in the user database. As the amount of exons had already been restricted to include only the most significant ones, the process requirements were still acceptable. Yet, if one wants to retrieve the genes for every exon listed in the original dataset, another approach is necessary. The exon label is a literal, and is therefore heavy for the triplestore to process. Unlike an IRI, no specific object handler is given. To process a literal, the query engine is required to evaluate the alignment of the string for every literal that matches the criteria of the query. In this example, Ensembl would have to align several hundred thousands of exon labels with the several hundred thousands of exon labels stored in the Ensembl RDF dataset. Public endpoints are programmed to be restricted in the amount of process power they can give to incoming queries, and will thus reject the query. A possible approach is querying and storing all the necessary data from the Ensembl database, after which the alignment of strings can be performed by the local triplestore. An easy way to split queries into several parts is to iterate over different .

The construction of queries can be extensive and built from a high amount of elements. Never- theless, queries are always assembled following the specific data structure of the dataset. In this use case the retrieval of data from these datasets has been realized using the same query patterns, as shown in Code Example A.7 through A.11. The combinations and placement of elements and extra elements such as filters and optional values have their specific logic. Because of this, it is possible to create a more user friendly method for the creation of queries. An example of such a tool is SPARQLGraph [61].

6.3.2 Results A quick observation of both result sets shows no relation between the methylation and expression values. Furthermore, only part of the genes retrieved have entries in DisGeNET. As we are looking for new candidate biomarkers, this is not in itself a problem. Results from Table 6.4 show only upregulated genes, with fold changes all larger than 10. This can be expected from a comparison between a cell with significantly less methylation compared to the other. In that same table, data from Expression Atlas almost exclusively show that the selected genes are found to be upregulated in experiments between cancer cells and healthy cells. As these genes are showing the highest amounts of additional expression in DKO, it could be expected that selected gener are tumor suppressor genes. Yet, this is contradicted by data retrieved from Expression Atlas. A literature review of the given elements is in order, and might give us conslusive evidence of the specific role of these genes.

In short, TIMP3 has been shown to be an important tumor suppressor gene [43] which is inhibited by miRNA-191 in the development of CRC [55]. IL-32 has been reported to reduce tumor growth when inducing overexpression in CRC [53]. Yet, other studies have also suggested IL-32 to be overexpressed in CRC [84]. NPTX2 has been shown to inhibit pancreatic cancer growth [86] and SOHLH2 ovarian cancer growth [85]. HDGFRP3 has been reported as a potential angiogenic factor that helps tumor growth. No data about relations between the expression or methylation of long intergenic non-coding RNA 667 and 221 have been found with any cancer type. COL4A1 has been identified as a gene which has shown consistent methylation in CRC [47]. Literature of ZNF140 are also lacking.

The set of results featured in Table 6.4 gave no clear or one-sided view on the expression of genes selected and their characteristics in the development of CRC. Although most genes showed relevance to some extent with the development of cancers, no clear relation can be shown between the role of the given genes and their differential expression and methylation values. 6.3. DISCUSSION 61

Table 6.6 features a set of candidate genes that encompass a wide range of expression values, with an FC ranging from -7.179 to -4.0677. Due to a small amount of genes actually having a lowered expression in the DKO, a larger partition of values with a FC smaller than minus two are selected. A significant lowering of the methylation percentage is not to be expected with the downregulation of the genes. There are many possible reasons attributing to this, but it is most likely due to the higher expression of tumor suppressor genes. Due to the construction of the query, no contradictions exist between values obtained from Expression Atlas and the in-house data. Reviewing literature sources reveal uniform evidence.

In short, GRIN2B and ZNF462 are suggested to be prone to mutations in CRC [2] [82]. Low expression of EHF has been shown to induce tumorigenic potential of prostate cancer cells. Yet, another study shows that the knockdown of EHF cells inhibited the proliferation, invasion and tu- morigenesis of ovarian cancer cells, where a correlation was found between the survival time of the patient and expression of EHF [18]. The downregulation of TIAM1 has recently been identified to help supressing gastric cancer invasion and growth [42]. DSC3 has been reported to be downregulated in CRC [21], and has been shown to have tumor suppressor activity [20]. High levels of mRNA coming from EPAS1 have been shown to have a correlation relapse and mortality of CRC [48]. A variant of the SLCO1B3 has been shown to be expressed in both colon and pancreatic cancer [67]. ZBTB20 has been linked with the promotion of non-small cell lung cancer through repression of Fox01 [87]. ZBTB20 has also been documented to have increased expression in hepatocellular carcinoma, and is linked with poor prognosis [80]. Reports of PDE10A include a high expression and important role in the development in CRC [39], and is recognized as a novel target for inhibition in the prevention of CRC [41]. High levels of HOXB8 expression are documented, with direct correlations to patient survival [63]. Overexpression of the gene is furthermore directly linked to the immortalization of a tumor cells [58]. Although not all genes have been specifically linked to colon cancer, specific roles in other cancers are links for a possible involvement in CRC. The data retrieved from the HCT116 cell line, literature studies, and data found on Expression Atlas affirm the involvement of specific genes represented in Table 6.6 in the development of CRC.

Although a promising set of results has been given, the selection of a specific biomarker cannot be made. This is because an adequate biomarker is not directly dependent to the degree in which expression and methylation values differ. Instead, it displays consistency over a large variety of cancer and healthy cells, retrieved from different patients. This is important as one wants to find a biomarker with a high selectivity and specificity. Furthermore, tests produced will evaluate a set of biomarkers to further the specificity and selectivity of the test. Evaluation of multiple genes is thus required. Due to the criteria composed, genes have been selected that have shown significant expression and methylation patterns retrieved from in-house data and Expression Atlas. Many of the retrieved genes have been reported to show distinct roles in the development of cancer. Unlike the results displayed in 6.4, 6.6 lists genes with higher expression in CRC, and can thus be used in tests measuring RNA expression. To evaluate whether selected genes can be retrieved through protein expression, possible post transcriptional silencing should be evaluated. Thus, The next step in the selection of a potential set of biomarkers is the evaluation of their expression and methylation values over hundreds of samples.

7 Conclusion and Future Prospects

The Semantic Web has, since its creation in 2003, increasingly grown into a technology that fulfills the design goals it was created for. It is followed by a growing community of public databases, which have been gradually adapting the construction of RDF databases. This study has realized a data structure for the integration of biological data formats into a local RDF environment. Through this, the researcher can use his own data for analysis on The Semantic Web. The BED and GFF data structures are the most complete, with largely all elements implemented using community accepted ontologies. Both the VCF and SAM implementations lack completeness. One of the major challenges to completely integrate all featured elements into RDF is the creation or retrieval of correct ontologies. Only ontologies that explain the entity of a value outside the context of the data format are adequate for use. Only when this design goal is followed, it is possible to create an environment where all data from different formats can be seamlessly combined into one environment. The schemas created in this study have been translated into a data converter implemented in boinq. As boinq aims to be an open source tool to serve research, it is necessary to share the proposed implementations with the active community, and create an environment where the further creation and development of NGS data structures is community-based.

The implementation of the functionality is one step closer to making The Semantic Web an easily accessible asset for the analysis and management of data. It has furthermore been shown that an RDF environment can be used to iterate through different steps of a workflow. This requires a high level of knowledge, and cannot be performed by people unfamiliar with SPARQL or the data structures of LOD databases. This obstacle can be overcome with the creation of a tool that makes query building for external databases intuitive. An example of a project aiming towards this is SPARQLGraph [61]. When integrated in boinq, additional functionalities are required, where a focus is set towards query building that enables an iterative process of data collection and analysis in a local RDF environment. A feature complete boinq could thus become one of the first applications that enables a practical use of The Semantic Web for bioinformatics purposes.

63

A Code Examples

PREFIXES USED PREFIX rdf: PREFIX xsd: PREFIX rdfs: PREFIX obo1: PREFIX obo2: PREFIX faldo: PREFIX sio: PREFIX skos: PREFIX up: PREFIX dcterms: PREFIX atlasterms:

Code Example A.1: A set of queries executed on the triplestore and sparql-bed.

1 #Feature count (boinq), Result: ?count = 15 2 SELECT (COUNT(?p) AS ?count) 3 WHERE{?p ?o.} 4 #Feature count (sparql-bed), Result: ?count = 15 5 SELECT (COUNT(?p) AS ?count) 6 WHERE{?p rdf:type .} 7 #Location interval (boinq), Result: ?count = 5 8 SELECT (COUNT(?feature) AS ?count) 9 WHERE {?feature faldo:location ?p. 10 ?p faldo:begin [faldo:position ?x]; 11 faldo:end [faldo:position ?y]. 12 FILTER(?y<178999999)?location1 && ?x>178908612} 13 #Location interval (sparql-bed), Result: ?count = 5 14 SELECT (COUNT(?p) AS ?count) 15 WHERE { ?p faldo:begin [faldo:position ?x]; 16 faldo:end [faldo:position ?y]. 17 FILTER(?y<178999999 && ?x>178908612)}

Code Example A.2: A query executed on the triplestore and sparql-vcf.

1 #Feature count (boinq), Result: ?count = 99 2 SELECT (COUNT(?p) AS ?count) 3 WHERE{ ?p ?o.} 4 #Feature count (sparql-vcf), Result: ?count = 99 5 SELECT (COUNT(?p) AS ?count) 6 WHERE { ?p ?o }

65 66 APPENDIX A. CODE EXAMPLES

Code Example A.3: The configuration of the trackList.json file loaded by JBrowse. The SPARQL query requires to be written over one single line.

1 ... 2 { 3 "label": "SPARQLGene", 4 "key": "SPARQL Genes", 5 "style" : { 6 "className" : "gene", 7 "histScale" : 2, 8 "featureCss" : "background-color: #66F; height: 8px", 9 "histCss" : "background-color: #88F", 10 "height" : "500" 11 }, 12 "storeClass": "JBrowse/Store/SeqFeature/SPARQL", 13 "type": "JBrowse/View/Track/HTMLFeatures", 14 "urlTemplate": "http://localhost:9999/blazegraph/namespace/boinq/sparql", 15 "queryTemplate": "prefix rdf: prefix rdfs: 16 prefix xsd: prefix obo: prefix faldo: 17 SELECT ?start ?end ?strand (?feature as ?uniqueID) ?name WHERE{ ?feature faldo:location 18 ?location . OPTIONAL{ ?feature rdfs:label ?name}. ?location faldo:begin [faldo:position ?start]; faldo:end [faldo:position ?end; 19 rdf:type ?strandpos] . FILTER(?strandpos = faldo:ForwardStrandPosition || ?strandpos= faldo:ReverseStrandPosition). 20 BIND(xsd:integer(IF(?strandpos = faldo:ForwardStrandPosition,1,IF(?strandpos = faldo:ReverseStrandPosition,-1,1))) as ?strand)}" 21 }, 22 ...

Code Example A.4: The SPARQL query used for the retrieval of genes SNPs are located on.

1 SELECT ?object ?idx ?id ?label 2 WHERE { 3 { 4 SELECT ?begin ?end ?gene ?id ?label{ 5 SERVICE{ 6 ?gene dcterms:description ?desc; 7 dcterms:identifier ?id; 8 rdfs:label ?label; 9 a ?type; 10 faldo:location ?location. 11 ?location faldo:reference [rdfs:subClassOf 12 ]. 13 14 ?location faldo:begin [faldo:position ?begin]. 15 ?location faldo:end [faldo:position ?end]. 16 }} 17 } 18 ?object ?idx; 19 faldo:location ?location. 20 ?location faldo:begin [faldo:position ?beginx]; 21 faldo:end [faldo:position ?endx]. 22 FILTER(?endx?begin) 23 } 67

Code Example A.5: The SPARQL query used for the retrieval of exons SNPs are located on.

1 SELECT ?object ?idx ?exon ?label ?genelabel 2 WHERE { 3 { 4 SELECT ?exon ?label ?genelabel ?begin ?end{ 5 SERVICE{ 6 ?exon a obo:SO_0000147 ; 7 rdfs:label ?label; 8 faldo:location ?location. 9 ?location faldo:reference [rdfs:subClassOf 10 ]. 11 12 ?gene obo:SO_has_part ?exon; 13 dcterms:description ?desc; 14 rdfs:label ?genelabel. 15 16 ?location faldo:begin [faldo:position ?begin]. 17 ?location faldo:end [faldo:position ?end]. 18 }} 19 } 20 ?object ?idx; 21 faldo:location ?location. 22 ?location faldo:begin [faldo:position ?beginx]; 23 faldo:end [faldo:position ?endx]. 24 FILTER(?endx?begin) 25 }

Code Example A.6: The query used for the selection of exons with significant difference in methylation and expression.

1 INSERT{ 2 GRAPH{ 3 ?o rdfs:label ?s 4 } 5 } 6 WHERE{ 7 GRAPH { 8 #EXPRESSION DATA 9 ?o rdfs:label ?s; 10 obo1:has_quality [rdf:value ?expresion]. 11 } 12 13 GRAPH { 14 #METHYLATION DATA 15 ?w rdfs:label ?s; 16 obo1:has_quality [rdf:value ?methylation]. 17 } 18 FILTER(?methylationDiff<"-22.73007"ˆˆxsd:decimal && (?expresion>"4.0"ˆˆxsd:decimal|| 19 ?expresion<"0.25"ˆˆxsd:decimal)) 20 } 68 APPENDIX A. CODE EXAMPLES

Code Example A.7: The query used for the curation of gene data for every exon.

1 INSERT{ 2 GRAPH{ 3 ?o rdfs:label ?s. 4 ?o obo1:part_of ?gene. 5 ?gene rdfs:label ?label; 6 rdf:type ?genetype. 7 rdfs:seeAlso ?link. 8 ?link a . 9 10 } 11 } 12 WHERE{ 13 GRAPH{ 14 ?o rdfs:label ?s 15 } 16 SERVICE{ 17 ?exon dcterms:identifier ?s. 18 ?transcript obo2:SO_has_part ?exon; 19 obo2:SO_transcribed_from ?gene. 20 ?gene rdfs:label ?label; 21 rdf:type ?genetype. 22 OPTIONAL{ 23 {?gene ?link} UNION { 24 ?gene ?link}. 25 ?link a ?type.} 26 }}

Code Example A.8: The query used to retrieve gene disease associations with their relevant genes and scores from DisGeNET.

1 INSERT{ 2 GRAPH{ 3 ?gene rdfs:label ?title. 4 ?gda sio:SIO_000628 ?gene. 5 ?gda sio:SIO_000216 ?scoreIRI. 6 ?scoreIRI sio:SIO_000300 ?score. 7 } 8 } 9 WHERE{ 10 { 11 SELECT ?gene ?title WHERE{ 12 SERVICE{ 13 ?disease skos:exactMatch . 14 ?gda sio:SIO_000628 ?gene,?disease; 15 sio:SIO_000216 ?scoreIRI. 16 ?gene rdf:type ncit:C16612 ; 17 sio:SIO_000205 [dcterms:title ?title]. 18 ?scoreIRI sio:SIO_000300 ?score. 19 }} 20 }} 69

Code Example A.9: The query used to retrieve expression data with their relevant p-values from data featuring CRC cells on Expression atlas.

1 INSERT { 2 GRAPH{ 3 ?gda sio:SIO_000628 ?geneID. 4 ?gda sio:SIO_000216 ?scoreIRI. 5 ?scoreIRI sio:SIO_000300 ?score. 6 } 7 } 8 WHERE{ 9 GRAPH{ 10 ?value atlasterms:pValue ?pvalue. 11 ?value atlasterms:isMeasurementOf ?probe; 12 rdfs:label ?expressionValue. 13 ?probe atlasterms:dbXref ?uniprot. 14 ?geneID sio:SIO_010078 ?uniprot. 15 } 16 }

Code Example A.10: The query used to retrieve the genes that are downregulated with relevant p-values and DisGeNET scores. Genes are ordered by fold change in decreasing order.

1 SELECT DISTINCT ?label ?expression ?methylation ?pvalue ?score 2 WHERE { 3 GRAPH{ 4 ?gene rdf:type obo2:SO_0000704 ; 5 rdfs:seeAlso ?geneID. 6 ?geneID a . 7 ?exon obo1:part_of ?gene; 8 rdfs:label ?exonlabel. 9 ?gene rdfs:label ?label. 10 } 11 OPTIONAL{ 12 GRAPH{ 13 ?gda sio:SIO_000628 ?geneID. 14 ?gda sio:SIO_000216 ?scoreIRI. 15 ?scoreIRI sio:SIO_000300 ?score. 16 } 17 } 18 OPTIONAL{ 19 GRAPH{ 20 ?value atlasterms:pValue ?pvalue. 21 ?value atlasterms:isMeasurementOf ?probe; 22 rdfs:label ?expressionValue. 23 ?probe atlasterms:dbXref ?uniprot. 24 ?geneID sio:SIO_010078 ?uniprot. 25 } 26 } 27 GRAPH { 28 ?exon1 rdfs:label ?exonlabel; 29 obo1:has_quality [rdf:value ?expression]. 30 } 31 GRAPH { 32 ?exon2 rdfs:label ?exonlabel; 33 obo1:has_quality [rdf:value ?methylation]. 34 } 35 FILTER regex(str(?expressionValue), "UP") 36 } 37 ORDER BY ?expression 70 APPENDIX A. CODE EXAMPLES

Code Example A.11: The query used to retrieve the genes that are downregulated with relevant p-values and DisGeNET scores. Genes are ordered by fold change in decreasing order.

1 SELECT DISTINCT ?label ?expression ?methylation ?pvalue ?score 2 WHERE { 3 GRAPH{ 4 ?gene rdf:type obo2:SO_0000704 ; 5 rdfs:seeAlso ?geneID. 6 ?geneID a . 7 ?exon obo1:part_of ?gene; 8 rdfs:label ?exonlabel. 9 ?gene rdfs:label ?label. 10 } 11 OPTIONAL{ 12 GRAPH{ 13 ?gda sio:SIO_000628 ?geneID. 14 ?gda sio:SIO_000216 ?scoreIRI. 15 ?scoreIRI sio:SIO_000300 ?score. 16 } 17 } 18 OPTIONAL{ 19 GRAPH{ 20 ?value atlasterms:pValue ?pvalue. 21 ?value atlasterms:isMeasurementOf ?probe; 22 rdfs:label ?expressionValue. 23 ?probe atlasterms:dbXref ?uniprot. 24 ?geneID sio:SIO_010078 ?uniprot. 25 } 26 } 27 GRAPH { 28 ?exon1 rdfs:label ?exonlabel; 29 obo1:has_quality [rdf:value ?expression]. 30 } 31 GRAPH { 32 ?exon2 rdfs:label ?exonlabel; 33 obo1:has_quality [rdf:value ?methylation]. 34 } 35 } 36 ORDER BY DESC( abs(?expression)) B Tables

Table B.1: The different feature types found in VCF files.

VCF schema: feature type ontologies Type Definition Ontology IRI INDEL http://purl.obolibrary.org/obo/SO 1000032 A sequence alteration which included an insertion and a deletion, affecting 2 or more bases.

MIXED http://purl.obolibrary.org/obo/SO 0000667 The sequence of one or more nucleotides added between two adja- cent nucleotides in the sequence. http://purl.obolibrary.org/obo/SO 0000159 The point at which one or more contiguous nucleotides were ex- cised.

MNP http://purl.obolibrary.org/obo/SO 0001013 A multiple nucleotide polymorphism with alleles of common length bigger than 1, for example AAA/TTT.

NO VARIATION http://purl.obolibrary.org/obo/SO 0000347 A match against a nucleotide sequence.

SNP http://purl.obolibrary.org/obo/SO 0000694 SNPs are single base pair positions in genomic DNA at which differ- ent sequence alternatives exist in normal individuals in some pop- ulation(s), wherein the least frequent variant has an abundance of 1

71 72 APPENDIX B. TABLES

Table B.2: Recognized fields for feature types in the GFF format

GFF schema: feature type ontologies Field value Definition Ontology IRI CDS A contiguous sequence which begins with, and includes, a start http://purl.obolibrary.org/obo/SO 0000316 codon and ends with, and includes, a stop codon. GENE A region (or regions) that includes all of the sequence elements http://purl.obolibrary.org/obo/SO 0000704 necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional se- quence regions. MRNA Messenger RNA is the intermediate molecule between DNA and http://purl.obolibrary.org/obo/SO 0000234 protein. It includes UTR and coding sequences. It does not contain introns. CDNA DNA synthesized by reverse transcriptase using RNA as a template. http://purl.obolibrary.org/obo/SO 0000756 OPERON A group of contiguous genes transcribed as a single (polycistronic) http://purl.obolibrary.org/obo/SO 0000178 mRNA from a single regulatory region. PROMOTOR A regulatory region composed of the TSS(s) and binding sites for http://purl.obolibrary.org/obo/SO 0000167 TF complexes of the basal transcription machinery. TF BINDING SITE A region of a nucleotide molecule that binds a Transcription Factor http://purl.obolibrary.org/obo/SO 0000235 or Transcription Factor complex. THREE PRIME UTR A region at the 3’ end of a mature transcript (following the stop http://purl.obolibrary.org/obo/SO 0000205 codon) that is not translated into a protein. FIVE PRIME UTR A region at the 5’ end of a mature transcript (preceding the initi- http://purl.obolibrary.org/obo/SO 0000204 ation codon) that is not translated into a protein. INTRON A region of a primary transcript that is transcribed, but removed http://purl.obolibrary.org/obo/SO 0000188 from within the transcript by splicing together the sequences (ex- ons) on either side of it. EXON A region of the transcript sequence within a gene which is not http://purl.obolibrary.org/obo/SO 0000147 removed from the primary RNA transcript by RNA splicing. TRANSCRIPT An RNA synthesized on a DNA or RNA template by an RNA poly- http://purl.obolibrary.org/obo/SO 0000673 merase. REGION A sequence feature with an extent greater than zero. A nucleotide http://purl.obolibrary.org/obo/SO 0000001 region is composed of bases and a polypeptide region is composed of amino acids. START CODON First codon to be translated by a ribosome. http://purl.obolibrary.org/obo/SO 0000318 STOP CODON In mRNA, a set of three nucleotides that indicates the end of in- http://purl.obolibrary.org/obo/SO 0000319 formation for protein synthesis. NCRNA An RNA transcript that does not encode for a protein rather the http://purl.obolibrary.org/obo/SO 0000655 RNA molecule is the gene product. TRNA Transfer RNA (tRNA) molecules are approximately 80 nucleotides http://purl.obolibrary.org/obo/SO 0000253 in length. Their secondary structure includes four short double- helical elements and three loops (D, anti-codon, and T loops). Further hydrogen bonds mediate the characteristic L-shaped molec- ular structure. Transfer RNAs have two regions of fundamental functional importance: the anti-codon, which is responsible for specific mRNA codon recognition, and the 3’ end, to which the tRNA’s corresponding amino acid is attached (by aminoacyl-tRNA synthetases). RRNA RNA that comprises part of a ribosome, and that can provide both http://purl.obolibrary.org/obo/SO 0000252 structural scaffolding and catalytic activity. 73

Table B.3: Object types created for every key found in the [INFO] field, used in the VCF schema. The namespace prefix gfvo is used to replace ’http://www.biointerchange.org/gfvo#’

VCF schema: attribute type ontologies Key value Definition Ontology IRI AA Denotes an ancestral allele of a feature. May be used to denote the gfvo:AncestralSequence ”ancestral allele” (”AA” additional information) of VCF formatted files. AC Count of a specific allele in genotypes. Encodes for ”AC” additional gfvo:AlleleCount information in VCF files. AF Proportion of a particular gene allele in a gene pool or genotype. gfvo:AlleleFrequency Encodes for ”AF” additional information in VCF files. AN Total number of alleles in called genotypes. Encodes for ”AN” gfvo:TotalNumberOfAlleles additional information in VCF files. BQ Root mean square base quality. Accounts for ”BQ” additional in- gfvo:BaseQuality formation in VCF files. DB, H2, H3, 1000G A cross-reference to associate an entity to a representation in gfvo:ExternalReference another database. Encodes for the ”Dbxref” attribute in GFF3 and GVF. Can be used to describe the contents of the ”source” column in GTF files. Captures the ”genome-build” pragma, ”source-method”, ”attribute-method”, ”phenotype-description”, and ”phased-genotypes” structured pragmas in GVF. Accounts for the ”assembly” and ”pedigreeDB” information fields, and ”DB”, ”H2”, ”H3”, ”1000G” additional information in VCF. DP Number of nucleic acid sequence reads for a particular genomic gfvo:Coverage locus (a region or single base pair). Accounts for ”DP” additional information in VCF files. MQ Root mean square mapping quality. Encodes values of the ”MQ” gfvo:MappingQuality additional information in VCF files. MQ0 Number of reads supporting a particular feature or variant. Can gfvo:NumberOfReads encode for ”MQ0” additional information in VCF files, if additional annotations are provided to denote a mapping quality of zero for the given count. In GVF files, the class accounts for the ”Variant reads” attribute. NS Number of samples in the dataset. Encodes for ”NS” additional gfvo: SampleCount information in VCF files. SB A note is a short textual description about an entity. It provides gfvo:Note a formal or semi-formal description of an entity, as opposed to a ”Comment”. Encodes for the ”sample-description” pragma and ”Comment” key/value pairs in structured attributes in GVF. Cap- tures ”Description” key/value pairs in information fields and ”SB” information field in VCF. SOMATIC The somatic feature class captures information about genomic se- gfvo:SomaticCell quence features arising from somatic cells. Encodes for ”genomic- source” pragma in GVF and ”SOMATIC” additional information in VCF. VALIDATED An experimental method is a procedure that yields an experimental gfvo:ExperimentalMethod outcome (result). Experimental methods can be in vivo, in vitro or in silico procedures that are well described and can be referenced. Encodes for ”source” column contents of GFF3, GTF, and GVF file formats as well as the ”CHROM” column in VCF. Can be used to describe the ”capture-method” pragma in GVF; it can describe ”VALIDATED” additional information in VCF. 74 APPENDIX B. TABLES C Figures

Figure C.1: The options given before conversion of a supported file format by boinq. Mapping options, for which the type of your main and sub feature can be selected, are only visible when converting BED-files. Options are selected for the conversion of Code Example 5.1

75 76 APPENDIX C. FIGURES

Figure C.2: Boinq offers a browser for the selection of a specfic feature type. This browser supports all types given by the Sequence Ontology vocabulary. A search bar is present.

Figure C.3: The track properties as displayed by boinq after the conversion of Code Example 5.1 77

Figure C.4: A genomic region on the first chromosome of the Homo sapiens assembly (GRCh38) as displayed in the UCSC browser. This exact region has been downloaded in BED format and imported into an RDF triplestore using boinq.

Figure C.5: JBrowse displaying the genomic region downloaded from the UCSC browser. The specific file BED file is featured in Code Example 5.8. JBrowse can access the SPARQL endpoint of the triplestore using the configuration as shown in Code Example A.3. 78 APPENDIX C. FIGURES [16] R. Buels, E. Yao, C. M. Diesh, R. D. Hayes, M. Munoz- Torres, G. Helt, D. M. Goodstein, C. G. Elsik, S. E. Lewis, L. Stein, and I. H. Holmes. JBrowse: a dynamic web plat- form for genome visualization and analysis. Genome Biol., References 17(1):66, 2016. [17] A. Callahan, J. Cruz-Toledo, and M. Dumontier. Ontology-Based Querying with Bio2RDF’s Linked Open Data. J Biomed Semantics, 4 Suppl 1:S1, Apr 2013.

[18] Z. Cheng, J. Guo, L. Chen, N. Luo, W. Yang, and X. Qu. [1] 1000Genomes. VCF (Variant Call Format) version Knockdown of EHF inhibited the proliferation, invasion 4.3. http://samtools.github.io/hts-specs/VCFv4.3. and tumorigenesis of ovarian cancer cells. Mol. Carcinog., pdf. Online; accessed Dec 25 2015. 55(6):1048–1059, Jun 2016.

[2] H. Alakus, M. L. Babicky, P. Ghosh, S. Yost, K. Jepsen, [19] L. Clarke et al. The 1000 Genomes Project: data man- Y. Dai, A. Arias, M. L. Samuels, E. S. Mose, R. B. agement and community access. Nat. Methods, 9(5):459– Schwab, M. R. Peterson, A. M. Lowy, K. A. Frazer, 462, May 2012. and O. Harismendy. Correction: Genome-wide mutational landscape of mucinous carcinomatosis peritonei of appen- [20] T. Cui, Y. Chen, L. Yang, T. Knosel, O. Huber, diceal origin. Genome Med, 6(7):53, 2014. M. Pacyna-Gengelbach, and I. Petersen. The p53 tar- get gene desmocollin 3 acts as a novel tumor suppres- [3] E. Antezana, W. Blonde, M. Egana, A. Rutherford, sor through inhibiting EGFR/ERK pathway in human lung R. Stevens, B. De Baets, V. Mironov, and M. Kuiper. cancer. Carcinogenesis, 33(12):2326–2333, Dec 2012. BioGateway: a semantic systems biology tool for the life sciences. BMC Bioinformatics, 10 Suppl 10:S11, 2009. [21] T. Cui, Y. Chen, L. Yang, T. Knosel, K. Zoller, O. Huber, and I. Petersen. DSC3 expression is regulated by p53, and [4] Apache Jena. Apache Jena. https://jena.apache.org/. methylation of DSC3 DNA is a prognostic marker in hu- Online; accessed December 12 2015. man colorectal cancer. Br. J. Cancer, 104(6):1013–1019, Mar 2011. [5] Apache Jena. Fuseki: serving RDF data over HTTP . https://jena.apache.org/documentation/serving_ [22] F. Cunningham et al. Ensembl 2015. Nucleic Acids Res., data/. Online; accessed December 12 2015. 43(Database issue):D662–669, Jan 2015.

[6] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, [23] R. Cyganiak and A. Jentzsch. The Linking Open Data B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, cloud diagram. http://lod-cloud.net/, 2014. Online; R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, accessed October 11 2015. C. O’Donovan, N. Redaschi, and L. S. Yeh. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res., [24] M. Devisscher. Boinq, Github. https://github.com/mr- 32(Database issue):D115–119, Jan 2004. tijn/boinq2/tree/master/ontologies.

[7] C. Balch, J. S. Montgomery, H. I. Paik, S. Kim, S. Kim, [25] M. Devisscher, T. D. Meyer, W. V. Criekinge, and T. H. Huang, and K. P. Nephew. New anti-cancer strate- P. Dawyndt. An ontology based query engine for querying gies: epigenetic therapies and biomarkers. Front. Biosci., biological sequences. EMBnet.journal, 19(B), 2013. 10:1897–1931, 2005. [26] DisGeNET. disgenet2r: an R package to ex- [8] A. Bandrowski et al. The Ontology for Biomedical Inves- plore the molecular underpinnings of human dis- tigations. PLoS ONE, 11(4):e0154556, 2016. eases. http://www.disgenet.org/ds/DisGeNET/html/ dissemination/disgenet2r-JBI-valencia-2016.pdf. [9] J. Baran, B. S. Durgahee, K. Eilbeck, E. Antezana, R. Hoehndorf, and M. Dumontier. GFVO: the Genomic [27] Dublin Core Metadata Initiative. DCMI Metadata Feature and Variation Ontology. PeerJ, 3:e933, 2015. Terms. http://dublincore.org/documents/2012/06/ 14/dcmi-terms/?v=terms#. Online; accessed December [10] J. Baran, B. S. Durgahee, K. Eilbeck, E. Antezana, 10 2015. R. Hoehndorf, and M. Dumontier. GFVO: the Genomic Feature and Variation Ontology. PeerJ, 3:e933, 2015. [28] A. M. Dworkin, T. H. Huang, and A. E. Toland. Epi- genetic alterations in the breast: Implications for breast [11] T. Berners-Lee. The semantic web. Scientific American cancer detection, prognosis and treatment. Semin. Can- Magazin, 17, 5 2001. cer Biol., 19(3):165–171, Jun 2009.

[12] T. Berners-Lee. Linked data design issues. http://www. [29] K. Eilbeck, S. E. Lewis, C. J. Mungall, M. Yandell, w3.org/DesignIssues/LinkedData, 2009. Online; ac- L. Stein, R. Durbin, and M. Ashburner. The Sequence cessed December 10 2015. Ontology: a tool for the unification of genome annota- tions. Genome Biol., 6(5):R44, 2005. [13] J. Bolleman. sparql-bed Github. https://github.com/ JervenBolleman/sparql-bed. Online; accessed April [30] EMBI-EBI. About the Ensembl Project. http://www. 2016. ensembl.org/info/about/index.html, 2015. Online; accessed December 12 2015. [14] J. Bolleman. sparql-vcf Github. https://github.com/ JervenBolleman/sparql-vcf. Online; accessed April [31] EMBL-EBI. GFF/GTF File Format - Definition and 2016. supported options. http://www.ensembl.org/info/ website/upload/gff.html. Online; accessed December [15] J. Bolleman, C. J. Mungall, F. Strozzi, J. Barran, M. Du- 13 2015. montier, R. J. P. Bonnal, R. Buels, R. Hoendorf, T. Fuji- sawa, T. Katayama, and P. J. A. Cock. Faldo: A semantic [32] EMBL-EBI. EMBL-EBI RDF Platform. https://www. standard for describing the location of nucleotide and pro- ebi.ac.uk/rdf/, 2015. Online; accessed December 12 tein feature annotation. bioRxiv, 2014. 2015.

79 80 REFERENCES

[33] M. Gonzalez-Pons and M. Cruz-Correa. Colorectal Can- [48] N. Mohammed, M. Rodriguez, V. Garcia, J. M. Garcia, cer Biomarkers: Where Are We Now? Biomed Res Int, G. Dominguez, C. Pena, M. Herrera, I. Gomez, R. Diaz, 2015:149014, 2015. B. Soldevilla, A. Herrera, J. Silva, and F. Bonilla. EPAS1 mRNA in plasma from colorectal cancer patients is asso- [34] B. I. Group. DisGeNET Database Information. http:// ciated with poor outcome in advanced stages. Oncol Lett, www.disgenet.org/web/DisGeNET/menu/dbinfo, 2015. 2(4):719–724, Jul 2011. Online; accessed December 12 2015. [49] NIH. The cancer genome atlas; program overview. http: //cancergenome.nih.gov/abouttcga/overview. On- [35] Jannovar. Jannovar Home Page. http://charite. line; accessed December 12 2015. github.io/jannovar/. Online; accessed April 15 2016. [50] N. F. Noy, M. Crubezy, R. W. Fergerson, H. Knublauch, [36] JHipster. JHipster Home Page, Github. http:// S. W. Tu, J. Vendetti, and M. A. Musen. ProtÃľgÃľ-2000: jhipster.github.io/. Online; accessed April 16 2016. an open-source ontology-development and knowledge- acquisition environment. AMIA Annu Symp Proc, page [37] S. W. Jiang, J. Li, K. Podratz, and S. Dowdy. Applica- 953, 2003. tion of DNA methylation biomarkers for endometrial can- cer management. Expert Rev. Mol. Diagn., 8(5):607–616, [51] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, Sep 2008. N. Griffith, C. Jonquet, D. L. Rubin, M. A. Storey, C. G. Chute, and M. A. Musen. BioPortal: ontologies and in- tegrated data resources at the click of a mouse. Nucleic [38] Y. Kodama, M. Shumway, and R. Leinonen. The Sequence Acids Res., 37(Web Server issue):W170–173, Jul 2009. Read Archive: explosive growth of sequencing data. Nu- cleic Acids Res., 40:D54–56, Jan 2012. [52] J. B. O’Connell, M. A. Maggard, and C. Y. Ko. Colon cancer survival rates with the new American Joint Com- [39] K. Lee, A. S. Lindsey, N. Li, B. Gary, J. Andrews, A. B. mittee on Cancer sixth edition staging. J. Natl. Cancer Keeton, and G. A. Piazza. ÃŐš-catenin nuclear translo- Inst., 96(19):1420–1425, Oct 2004. cation in colorectal cancer cells is suppressed by PDE10A [53] J. H. Oh et al. IL-32 inhibits cancer cell growth through inhibition, cGMP elevation, and activation of PKG. On- inactivation of NFB and STAT3 signals. Oncogene, cotarget, 7(5):5353–5365, Feb 2016. 30(30):3345–3359, Jul 2011. [40] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, [54] N. K. Osborn and D. A. Ahlquist. Stool screening for col- N. Homer, G. Marth, G. Abecasis, and R. Durbin. The orectal cancer: molecular approaches. Gastroenterology, Sequence Alignment/Map format and SAMtools. Bioin- 128(1):192–206, Jan 2005. formatics, 25(16):2078–2079, Aug 2009. [55] S. Qin, Y. Zhu, F. Ai, Y. Li, B. Bai, W. Yao, and L. Dong. [41] N. Li, K. Lee, Y. Xi, et al. Phosphodiesterase 10A: a novel MicroRNA-191 correlates with poor prognosis of colorec- target for selective inhibition of colon tumor cell growth tal carcinoma and plays multiple roles by targeting tissue and ÃŐš-catenin-dependent TCF transcriptional activity. inhibitor of metalloprotease 3. Neoplasma, 61(1):27–34, Oncogene, 34(12):1499–1509, Mar 2015. 2014. [56] A. Regalado. A. emtech: Illumina says 228,000 human [42] Z. Li, X. Yu, Y. Wang, J. Shen, W. K. Wu, J. Liang, genomes will be sequenced this year, mit technology re- and F. Feng. By downregulating TIAM1 expression, view. microRNA-329 suppresses gastric cancer invasion and growth. Oncotarget, 6(19):17559–17569, Jul 2015. [57] I. Rhee, K. E. Bachman, B. H. Park, K. W. Jair, R. W. Yen, K. E. Schuebel, H. Cui, A. P. Feinberg, C. Lengauer, [43] H. Lin, Y. Zhang, H. Wang, D. Xu, X. Meng, Y. Shao, K. W. Kinzler, S. B. Baylin, and B. Vogelstein. DNMT1 C. Lin, Y. Ye, H. Qian, and S. Wang. Tissue inhibitor and DNMT3b cooperate to silence genes in human cancer of metalloproteinases-3 transfer suppresses malignant be- cells. Nature, 416(6880):552–556, Apr 2002. haviors of colorectal cancer cells. Cancer Gene Ther., [58] M. Salmanidis, G. Brumatti, N. Narayan, B. D. Green, 19(12):845–851, Dec 2012. J. A. van den Bergen, J. J. Sandow, A. G. Bert, N. Silke, R. Sladic, H. Puthalakath, L. Rohrbeck, T. Okamoto, [44] T. Lindgren, T. Stigbrand, A. Raberg, K. Riklund, L. Jo- P. Bouillet, M. J. Herold, G. J. Goodall, A. M. Jabbour, hansson, and D. Eriksson. Genome wide expression anal- and P. G. Ekert. Hoxb8 regulates expression of microR- ysis of radiation-induced DNA damage responses in iso- NAs to control cell death and differentiation. Cell Death genic HCT116 p53+/+ and HCT116 p53-/- colorectal Differ., 20(10):1370–1380, Oct 2013. carcinoma cell lines. Int. J. Radiat. Biol., 91(1):99–111, Jan 2015. [59] Samtools. HTSJDK repository, Github. https://github. com/samtools/htsjdk. Online; accessed April 15 2016. [45] K. W. Marshall, S. Mohr, F. E. Khettabi, N. Nossova, [60] Samtools. Samtools. . Online; S. Chao, W. Bao, J. Ma, X. J. Li, and C. C. Liew. A http://www.htslib.org/ accessed Dec 24 2015. blood-based biomarker panel for stratifying current risk for colorectal cancer. Int. J. Cancer, 126(5):1177–1186, [61] D. Schweiger, Z. Trajanoski, and S. Pabinger. SPAR- Mar 2010. QLGraph: a web-based platform for graphically querying biological Semantic Web databases. BMC Bioinformatics, [46] E. Merrill, S. Corlosquet, P. Ciccarese, T. Clark, and 15:279, 2014. S. Das. Semantic Web repositories for genomics data us- ing the eXframe platform. J Biomed Semantics, 5(Suppl [62] M. L. Speir, A. S. Zweig, et al. The UCSC Genome 1 Proceedings of the Bio-Ontologies Spec Interest G):S3, Browser database: 2016 update. Nucleic Acids Res., 2014. 44(D1):D717–725, Jan 2016. [63] H. T. Stavnes et al. HOXB8 expression in ovarian serous [47] S. M. Mitchell, J. P. Ross, H. R. Drew, T. Ho, G. S. carcinoma effusions is associated with shorter survival. Gy- Brown, N. F. Saunders, K. R. Duesing, M. J. Buckley, necol. Oncol., 129(2):358–363, May 2013. R. Dunne, I. Beetson, K. N. Rand, A. McEvoy, M. L. Thomas, R. T. Baker, D. A. Wattchow, G. P. Young, [64] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, T. J. Lockett, S. K. Pedersen, L. C. Lapointe, and P. L. C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and Molloy. A panel of genes methylated with high frequency G. E. Robinson. Big Data: Astronomical or Genomical? in colorectal cancer. BMC Cancer, 14:54, 2014. PLoS Biol., 13(7):e1002195, Jul 2015. REFERENCES 81

[65] SYSTAP. Blazegraph Licensing. https://www. [78] W3C. RDF 1.1 Primer: W3C Working Group Note 24 blazegraph.com/services/blazegraph-licensing/, June 2014. http://www.w3.org/TR/2014/NOTE-rdf11- 2015. Online; accessed December 13 2015. primer-20140624/, 2014. Online; accessed Sept 15 2015.

[66] SYSTAP. Mapgraph Technology is now in Blaze- [79] W3C. Tim Berners-Lee Biography. http://www.w3.org/ graph GPU. https://www.blazegraph.com/mapgraph- People/Berners-Lee/#Bio, 2015. Online; accessed 14 technology/, 2015. Online; accessed December 12 2015. Sept 2015. [67] N. Thakkar, K. Kim, E. R. Jang, S. Han, K. Kim, D. Kim, N. Merchant, A. C. Lockhart, and W. Lee. A [80] Q. Wang, Y. X. Tan, Y. B. Ren, L. W. Dong, Z. F. Xie, cancer-specific variant of the SLCO1B3 gene encodes a L. Tang, D. Cao, W. P. Zhang, H. P. Hu, and H. Y. Wang. novel human organic anion transporting polypeptide 1B3 Zinc finger protein ZBTB20 expression is increased in hep- (OATP1B3) localized mainly in the cytoplasm of colon atocellular carcinoma and associated with poor prognosis. and pancreatic cancer cells. Mol. Pharm., 10(1):406–416, BMC Cancer, 11:271, 2011. Jan 2013. [81] J. N. Weinstein et al. The Cancer Genome Atlas Pan- [68] UCSC. Frequently Asked Questions: Data File Formats. Cancer analysis project. Nat. Genet., 45(10):1113–1120, https://genome.ucsc.edu/FAQ/FAQformat.html. On- Oct 2013. line; accessed December 24 2015. [82] J. L. Wilding, S. McGowan, Y. Liu, and W. F. Bod- [69] W3C. Inference. http://www.w3.org/standards/ mer. Replication error deficient and proficient colorec- semanticweb/inference. Online; accessed December 10 tal cancer gene expression differences caused by 3’UTR 2015. polyT sequence deletions. Proc. Natl. Acad. Sci. U.S.A., [70] W3C. OWL 2 Web Ontology Language Primer (Sec- 107(49):21058–21063, Dec 2010. ond Edition). http://www.w3.org/TR/2012/REC-owl2- primer-20121211/. Online; accessed October 11 2015. [83] S. Winawer, R. Fletcher, D. Rex, J. Bond, R. Burt, J. Fer- rucci, T. Ganiats, T. Levin, S. Woolf, D. Johnson, L. Kirk, [71] W3C. SKOS Primer. https://www.w3.org/TR/2009/ S. Litin, and C. Simmang. Colorectal cancer screening NOTE-skos-primer-20090818/. Online; accessed April and surveillance: clinical guidelines and rationale-Update 15 2016. based on new evidence. Gastroenterology, 124(2):544– 560, Feb 2003. [72] W3C. W3C Semantic Web Frequently Asked Questions. http://www.w3.org/2001/sw/SW-FAQ#swonbrowser, [84] Y. Yang, Z. Wang, Y. Zhou, X. Wang, J. Xiang, and 2009. Online; accessed Sept 15 2015. Z. Chen. Dysregulation of over-expressed IL-32 in col- [73] W3C. Describing Linked Datasets with the VoID Vocab- orectal cancer induces metastasis. World J Surg Oncol, ulary. http://www.w3.org/TR/void/, 2011. Online; ac- 13:146, 2015. cessed December 11 2015. [85] H. Zhang, C. Hao, Y. Wang, S. Ji, X. Zhang, W. Zhang, [74] W3C. SPARQL Query Language for RDF. http:// Q. Zhao, J. Sun, and J. Hao. Sohlh2 inhibits human ovar- www.w3.org/TR/sparql11-overview/, 2013. Online; ac- ian cancer cell invasion and metastasis by transcriptional cessed December 13 2015. inactivation of MMP9. Mol. Carcinog., Jul 2015.

[75] W3C. Linked Data: What is Linked Data. http:// [86] L. Zhang, J. Gao, L. Li, Z. Li, Y. Du, and Y. Gong. The www.w3.org/standards/semanticweb/data, 2014. On- neuronal pentraxin II gene (NPTX2) inhibit proliferation line; accessed October 11 2015. and invasion of pancreatic cancer cells in vitro. Mol. Biol. [76] W3C. RDF. http://www.w3.org/RDF, 2014. Online; ac- Rep., 38(8):4903–4911, Nov 2011. cessed Sept 14 2015. [87] J. G. Zhao, K. M. Ren, and J. Tang. Zinc finger pro- [77] W3C. RDF 1.1 N-Triples: A line-based syntax for an RDF tein ZBTB20 promotes cell proliferation in non-small cell graph. http://www.w3.org/TR/n-triples/, 2014. On- lung cancer through repression of FoxO1. FEBS Lett., line; accessed October 11 2015. 588(24):4536–4542, Dec 2014.