Semantic Prioritization of Novel Causative Genomic Variants in Mendelian and Oligogenic Diseases

Dissertation by

Imane Boudellioua

In Partial Fulfillment of the Requirements

For the Degree of

Doctor of Philosophy

King Abdullah University of Science and Technology

Thuwal, Kingdom of Saudi Arabia

February, 2019 2

EXAMINATION COMMITTEE PAGE

The dissertation of Imane Boudellioua is approved by the examination committee

Committee Chairperson: Prof. Robert Hoehndorf Committee Members: Prof. Stefan T. Arold, Prof. Xin Gao, and Prof. Dietrich Rebholz-Schuhmann 3

©February, 2019 Imane Boudellioua

All Rights Reserved 4

ABSTRACT

Semantic Prioritization of Novel Causative Genomic Variants in Mendelian and Oligogenic Diseases Imane Boudellioua

Recent advances in Next Generation Sequencing (NGS) technologies have facili- tated the generation of massive amounts of genomic data which in turn is bringing the promise that personalized medicine will soon become widely available. As a result, there is an increasing pressure to develop computational tools to analyze and interpret genomic data. In this dissertation, we present a systematic approach for interrogat- ing patients’ genomes to identify candidate causal genomic variants of Mendelian and oligogenic diseases. To achieve that, we leverage the use of biomedical data available from extensive biological experiments along with machine learning techniques to build predictive models that rival the currently adopted approaches in the field. We inte- grate a collection of features representing molecular information about the genomic variants and information derived from biological networks. Furthermore, we incorpo- rate genotype-phenotype relations by exploiting semantic technologies and automated reasoning inferred throughout a cross-species phenotypic ontology network obtained from human, mouse, and zebra fish studies. In our first developed method, named PhenomeNet Variant Predictor (PVP), we perform an extensive evaluation of a large set of synthetic exomes and genomes of diverse Mendelian diseases and phenotypes. Moreover, we evaluate PVP on a set of real patients’ exomes suffering from congenital hypothyroidism. We show that PVP successfully outperforms state-of-the-art meth- ods, and provides a promising tool for accurate variant prioritization for Mendelian diseases. Next, we update the PVP method using a deep neural network architecture 5 as a backbone for learning and illustrate the enhanced performance of the new method, DeepPVP on synthetic exomes and genomes. Furthermore, we propose OligoPVP, an extension of DeepPVP that prioritizes candidate oligogenic combinations in personal exomes and genomes by integrating knowledge from -protein interaction net- works and we evaluate the performance of OligoPVP on synthetic genomes created by known disease-causing digenic combinations. Finally, we discuss some limitations and future steps for extending the applicability of our proposed methods to identify the genetic underpinning for Mendelian and oligogenic diseases. 6

ACKNOWLEDGEMENTS

Completion of this dissertation was not going to see the light without the full support of many individuals. First of all, I would like to express my sincere gratitude to my advisor: Prof. Robert Hoehndorf for the valuable guidance and incredible support he has provided throughout my time as his PhD student, The constant motivation and encouragement from his side will always be remembered and appreciated. I would also like to thank my committee members for their availability, and constructive feedback which were determinant for the accomplishment of the work presented in this dissertation. I would also like to thank CBRC team for their companionship, prompt help, and for making the work environment so pleasurable and friendly. I dedicate this dissertation to my life-coach and role model, my father: Prof. Mohamed Salah Boudellioua. To my biggest source of love, my mother: Samiha Mezdour. To the always enthusiastic and supportive siblings of mine, Asma, Ahlem, Abderrahmane, and Abdulaziz. To my forever supportive and encouraging grandfa- ther in heaven: Rabeh Mezdour, May Allah bless his soul. I would also like to dedicate this dissertation to my dear friends for their uncondi- tional support and love. A special thanks go to my second family in KAUST; Dounia Oualialami and Dr. Hassan Bouchekif for making my time in KAUST so memorable and cheerful. Last but not least, I would like to thank every single person who helped me in my PhD journey with a nice word, a prayer, or even a reassurance smile. Above all, I owe it all to Almighty Allah for granting me the wisdom, health, and strength to undertake this work and enabling me to its completion. 7

TABLE OF CONTENTS

Examination Committee Page 2

Copyright 3

Abstract 4

Acknowledgements 6

List of Figures 10

List of Tables 11

1 Introduction 14

2 Background 17 2.1 The ...... 17 2.2 Human Genetic Variations ...... 18 2.2.1 Genetic Variation in the Coding regions of DNA ...... 20 2.2.2 Genetic Variation in the Non-Coding regions of DNA . . . . . 21 2.2.3 Consequences of Genetic Variation ...... 22 2.3 Human Genetic Diseases ...... 23 2.3.1 Monogenic diseases ...... 23 2.3.2 Oligogenic diseases ...... 24 2.4 Model organisms and genetic diseases ...... 24

3 Literature Review 26 3.1 Discovery of Genetic Variations ...... 26 3.2 Computational methods for the interpretation of genetic variants . . . 27 3.3 Phenotype-aware discovery of candidate disease-causing variants . . . 31 3.3.1 Representation of genotype-phenotype knowledge ...... 31 3.3.2 Resources of genotype-phenotype knowledge ...... 33 3.4 Phenotype-driven methods for predicting candidate disease and disease variants ...... 39 8 3.4.1 Methods for prioritization of candidate disease-causing genes . 40 3.4.2 Methods for prioritization of candidate disease-causing variants 44

4 Problem Statement 48 4.1 Hypothesis ...... 49 4.2 Personal Genome Data ...... 49 4.3 Approach ...... 51

5 Semantic Prioritization of Novel Causative Genomic Variants 53 5.1 Abstract ...... 53 5.2 Introduction ...... 54 5.3 Results and Discussion ...... 57 5.3.1 Integration of genotype and phenotype information predicts causal variants in whole exome and whole genome sequencing 57 5.3.2 PVP predicts causative variants in diagnosed cases ...... 64 5.3.3 Effects of datasets and evaluation method ...... 67 5.3.4 Impact of the use of model organism phenotypes on variant prioritization and disease discovery ...... 69 5.3.5 Conclusions ...... 70 5.4 Materials and Methods ...... 71 5.4.1 Updates to the PhenomeNET system ...... 71 5.4.2 Model organism phenotypes and similarity search ...... 73 5.4.3 Generation of model training data ...... 74 5.4.4 Generation of synthetic exomes/genomes ...... 75 5.4.5 Model training ...... 76 5.4.6 Model evaluation ...... 78 5.4.7 PVP ...... 79 5.4.8 Ethical approval ...... 79 5.4.9 Availability of data and material ...... 79

6 DeepPVP: phenotype-based prioritization of causative variants us- ing deep learning 80 6.1 Abstract ...... 80 6.1.1 Background: ...... 80 6.1.2 Results: ...... 81 6.2 Background ...... 81 6.3 Implementation ...... 83 9 6.3.1 Training and testing data ...... 83 6.3.2 Generation of synthetic patients ...... 86 6.3.3 Annotating variants ...... 86 6.3.4 Model and availability ...... 86 6.4 Results ...... 88 6.4.1 DeepPVP: phenotype-based prediction using a deep artificial neural networks ...... 88 6.4.2 Evaluating DeepPVP’s ability to find causative variants . . . . 90 6.5 Conclusions ...... 93 6.6 Availability and requirements ...... 94

7 OligoPVP: Phenotype-driven analysis of individual genomic infor- mation to prioritize oligogenic disease variants 95 7.1 Abstract ...... 95 7.2 Introduction ...... 96 7.3 Materials and Methods ...... 100 7.3.1 Digenic disease ...... 100 7.3.2 Interaction data ...... 101 7.3.3 PhenomeNET Variant Predictor ...... 101 7.3.4 Evaluation and comparison ...... 102 7.4 Results ...... 102 7.4.1 Prediction of biallelic and triallelic disease variants ...... 102 7.4.2 OligoPVP: Use of background knowledge to find causative com- binations of variants ...... 104 7.5 Discussion ...... 106

8 Conclusion 113 8.1 Summary ...... 113 8.2 Future Research Work ...... 114

References 116

Appendices 168 10

LIST OF FIGURES

2.1 Categories of genetic variations based on their molecular nature. . . . 20 2.2 Categories of genetic variations in the coding region of DNA . . . . . 21

5.1 Performance of PVP in retrieving causative variants in whole exome sequences. Results are compared against CADD, DANN, and GWAVA, and the phenotype-based tools Exomiser, Phevor and eXtasy...... 62 5.2 Performance of PVP in identifying causative variants in whole genome sequences using human phenotypes (PVP-Human), model organisms phenotypes (PVP-Model), and combined phenotypes (PVP), and com- parison of PVP to CADD, DANN, GWAVA, and Genomiser...... 62

6.1 Overview over the DeepPVP neural network model...... 89 11

LIST OF TABLES

3.1 Landscape of phenotype ontologies. Statistics as of May 2018. . . . . 32 3.2 Some of the widely used databases for characterizing human genes, phenotypes, and their genetic variants (as of May 2018)...... 37 3.3 Some databases characterizing model organisms’ genes, phenotypes, and their genetic variants with some statistics (as of May 2018). . . . 39 3.4 Summary of some of variant prioritization methods ...... 47

5.1 Overview of how many causative variants out of 8,746 exonic were recovered on rank 1 and within the top 10 ranks by PVP and PVP- Human, and comparison to CADD, DANN, GWAVA, Exomiser eXtasy, and Phevor. Analysis was performed on WES data. If a tool did not provide a score for a causative variant, we excluded the variant from this table; consequently, the total number of samples analyzed differs between the methods and the percentages reported are based on the number of samples for which the causative variant was ranked. . . . . 60 5.2 Overview of the performance of PVP, CADD, DANN, GWAVA and Exomiser in prioritizing causative variants in WGS data. We prioritize all variants in a VCF file resulting from WGS using the same models. Analysis is separated reflecting the performance of the various tools identifying exonic and non-exonic variants. For CADD, DANN, and GWAVA, we report only analysis results for which a prediction score is returned; consequently, total numbers are less than the total of 11,251 causative variants...... 61 5.3 Performance of PVP in variant prioritization in WGS data, separated by mode of inheritance of the disease...... 63

6.1 Comparison of top ranks of ClinVar variants as recovered from WES data filtered by MAF > 1% using different methods...... 92 12 7.1 Comparison of different variant prioritization systems for recovering biallelic variants. We split the evaluation in two parts, one in which we consider all variants and another in which we only consider variants for which we have background knowledge about their interactions. . . 110 7.2 Comparison of different variant prioritization systems for recovering triallelic variants. We split the evaluation in two parts, one in which we consider all variants and another in which we only consider variants for which we have background knowledge about their interactions. . . 110 7.3 Analysis of top ranks of variants by PVP summarized by disease . . . 110 7.4 Performance of PVP on all combinations present in DIDA database (All DIDA), combinations added before the PVP build date date of Feb 7th, 2017 (Old DIDA), and combinations added after the cutoff date of Feb 7th, 2017 (New DIDA) ...... 111 7.5 Cases of DIDA combinations improved by OligoPVP in comparison to PVP. OligoPVP incorporates protein-protein interactions in the prioritization of variant tuples. We compare the results of applying OligoPVP to the ranks obtained using PVP on individual variants. . 112

S1 PVP model features...... 169 S2 Stratified cross-validation results when training our models on 80% of ClinVar pathogenic variants...... 170 S3 Experiments when adding noise to phenotypes associated with 8,522 synthetic whole exome sequences. Each WES contains a single causative

variant v1 and is associated with phenotypes P1. First, we randomly add the complete set of phenotypes from a randomly chosen disease to

P1 and record how well we recover v1. Second, we randomly remove 1 each phenotype in P1 with a probability of 3 and record how well v1 is recovered. In the third experiment, we add another causative variant

v2 to the WES (which is causative for the disease that is phenotypically

most similar to P1, and is itself associated with phenotypes P2), and

use P1 to recover v1. We record how often we identify v1 and v2 at the first or top 10 ranks. Finally, we perform the same experiment using

as phenotypes P1 ∪ P2...... 170 13 S4 The table shows the 36 cases from UK10K dataset diagnosed with Congenital hypothyroidism which show genes already implicated in the disease within the top 20 hits. Patients are broadly stratified into three groups: CCH, central congenital hypothyroidism; GIS, patients with gland-in-situ and no anatomical signs of dysplasia which include likely dyshormonogenesis; DG, thyroid dysgenesis. Heterozygous state; 0/1, homozygous state 1/1. SIFT categories T and D correspond to Dele- terious (sift score ≤ 0.05), and Tolerated (sift score > 0.05). Polyphen categories: D; Probably damaging (polyphen score ≥ 0.909), P; possi- bly damaging (0.447 to 0.909) B; benign (polyphen score ≤ 0.446). . . 171 S5 Results of variant and gene prioritization methods for a subset of UK10K rare thyroid dataset (EGAS00001000131) with confirmed disease- associated mutations. In cases THY5329044, THY5068932, THY5068933, and THY5068934, the variants were considered to contribute to but not entirely explain the whole phenotype observed (Schoenmakers et al. 2016)...... 172 S1 DeepPVP model features...... 177 S1 Cases of DIDA combinations missed by OligoPVP prioritization in- corporating protein-protein interaction network data, in comparison to ranks of individual variants of each digenic combination ranked by DeepPVP...... 178 14

Chapter 1

Introduction

The ever-dropping cost of DNA sequencing, aided with the advances in Next Gen- eration Sequencing (NGS) technologies, has facilitated the generation of a massive amounts of genomic data by the day. Nowadays, Whole Exome Sequencing (WES) which involves sequencing the protein-coding part of DNA, as well as Whole Genome Sequencing (WGS) which involves sequencing the complete DNA, are no longer used only in a research setting but are increasingly being adopted in clinics. Personalised medicine has the promise of providing tailored diagnosis and treatment for patients[1]. Personalised medicine focuses on the individual variability of genetic makeup of indi- viduals for determining diagnosis and treatment strategies. Given the recent advances of sequencing technologies and the availability of an increasing amount of personal genomic data, there is a pressing need to develop computational methods to increase the understanding of the human genome and its effect on the human health. This is especially the case for the study of pathobiology of human diseases where the genetic underpinning may play a vital role in the disease development and its severity. There are already many clinical studies that established relations between human DNA sequence variants and rare and common genetic dis- eases by studying families or cohorts of affected and unaffected individuals []. There are also detailed catalogues of DNA sequence variants from the general population that can be used to pinpoint common and rare DNA sequence variants []. Although there have been many studies linking DNA sequence variants to genetic diseases, there is a big challenge in establishing a causal link between DNA sequence variants and 15 genetic diseases [2]. Guidelines for investigating causality of DNA sequence variants have been discussed to avoid an increase in false findings; these guidelines aim to establish a link between a variant and a biological function (which may be a loss of function in a protein, a dysregulation of gene expression, or similar) and the observed phenotype [2]. As there are many possible mechanisms for an observed phenotype, computa- tional methods may contribute to identifying causative variants by incorporating prior knowledge about DNA sequence variants, gene regulation, gene function, and how they may contribute to an organism’s phenotype. Such computational approaches, if shown to be accurate, can provide a significant aid in facilitating diagnosis for patients of genetic diseases, or identify candidate variants for functional validation. Here, we aim to tackle the problem of identifying candidate disease-causing vari- ants computationally using background knowledge from human and model organisms studies. We develop a set of machine learning algorithms that explicitly encode biolog- ical background knowledge and use this knowledge to improve variant prioritization. Specifically, we develop methods that can be used to investigate personal genomes for causative variants. Variant prioritization involves ranking genomic variants on the basis of specific measure, where higher ranks indicate stronger likelihood of in- ferred causality for the observed disease phenotypes. The main contribution of this dissertation can be summarized as follows:

1. A review on the state-of-the-art methods for computational disease genes dis- covery and disease causing variants discovery, as well as genotype-phenotype resources and standards of annotations.

2. A novel random-forest based method for the prioritization of candidate causative genomic variants in Mendelian diseases, named, PhenomeNet Variant Predic- tor (PVP), that intelligently integrates knowledge about phenotypes and their associations with the genotype as obtained from human and model organisms 16 studies. We demonstrate that PVP outperforms state-of-the-art methods for variant prioritization on both simulated data and real patient data.

3. A novel deep learning method, named DeepPVP, that is based on PVP but utilizes a deep artificial neural network and corrects for several biases in pheno- type data. We demonstrate that DeepPVP provides an enhanced performance for variant prioritization in Mendelian diseases, as compared to PVP, and other state-of-the-art methods.

4. A novel deep learning method, named OligoPVP, to prioritize candidate oli- gogenic combinations of variants causal for oligogenic diseases. To the best of our knowledge, OligoPVP is the first method for the prioritization of vari- ant combinations and it encodes further background knowledge from biological networks to discover candidate disease modules.

The remainder of this document is organized as follows. Chapter 2 provides a background on genetic variations and common biological terminology used to describe their various effects and associations with human genetic diseases. Chapter 1 presents a comprehensive review on computational methods for disease genes discovery and the prediction of functional effects of DNA sequence variants and their disease causality. Chapter 4 is the problem statement and describes the hypothesis, the data, and the approach for addressing the problem. Chapter 5 introduces PVP as a method for the prioritization of candidate disease-causing variants for Mendelian diseases. Chapter 6 extends the work on PVP with a deep learning architecture and differences in training along with experiments supporting the improved performance of DeepPVP over PVP method. Chapter 7 describes OligoPVP, a method for the prioritization of oligogenic variant combinations that describe oligogenic disease modules. Finally, Chapter 8 summarizes the research work and provides and outlook to future work. 17

Chapter 2

Background

2.1 The Human Genome

Genetics involves “the study of genes, genetic variation, and heredity in living organ- isms” [3]. Genetics as science was first established when Gregor Mendel systematically analyzed traits of plant peas from one generation to the next and expressed his find- ings as inheritable units responsible for the manifestation of observable characteristics in the plants [4]. It was not until the beginning of the 1900s when the term Gene was coined by Johannsen [5] to refer to inheritable units of an organism. On a molecular level, genes are comprised of stretches of deoxyribonucleic acid (DNA), a molecule first isolated by Friedrich Miescher in 1869 [6]. The DNA consists of two chains of nucleotides that form a double helix where each nucleotide is composed of a phos- phate group, a deoxyribose sugar, and one of four nucleobases (adenine (A), cytosine (C), guanine (G), or thymine (T)) [7]. DNA codes for the genetic information of an organism. The flow of the genetic information is referred to as the central dogma of molecular biology [8]. It involves the process of converting the sequence of DNA nucleotides into ribonucleic acid (RNA), called transcription, and the process of converting RNA into , called trans- lation. Proteins are macromolecules consisting of chains of residues. Proteins differ by their sequence of amino acid residues, which determines their 3- dimensional structure and functions. Proteins carry a wide range of functions within organisms such as catalyzing chemical reactions, transporting and storing molecules, 18 and cell signaling. The human genome consists of around 3 billion DNA nucleotides, and DNA is arranged into 22 pairs of and one pair of sex , so that genes are usually present twice (once on each chromosome). The variant forms at one location () in a chromosome are called alleles. Zygosity specifies the degree of similarity between two alleles at the same locus. When the nucleotide or nucleotide sequence at a locus (a single position or a region) is identical, the zygosity is referred to as homozygote; if the nucleotide sequence is different, as a heterozygote. A gene can be defined as a series of nucleotides in a DNA region that encodes a polypeptide chain [9]. A gene is composed of intron and exon regions. An intron is a stretch of DNA within the gene that is removed by RNA splicing, and hence, introns do not directly code for proteins [10]. An exon is the portion of the gene sequence that is retained after DNA transcription. The entire set of exons in an organism is called an exome. It is estimated that the human genome contains between 20,000 and 25,000 genes on all human chromosomes [11]. The allele present at a specific locus is called a genotype, a term introduced by Johannsen in 1903 [5].

2.2 Human Genetic Variations

The human DNA is approximately 99.9% identical in all humans [12]. The 0.1% DNA difference is what makes each human unique. Genetic variation is the term that is widely used to describe differences between human genomes. A key determi- nant in the study of genetic diversity is allele frequency in a given population. The frequency of alleles can be affected by several factors or mechanisms of evolution. In natural selection, the effect of alleles carried by individuals on their traits determines their survival and reproduction, hence affecting the frequency of those alleles in a population. For instance, in positive selection, individuals carrying alleles that equip them with more adaptive traits tend to survive and reproduce better, passing on their 19 alleles to their offspring and increasing the allele frequency. On the other hand, in negative selection, individuals carrying alleles with disadvantageous traits tend not to survive or reproduce as well. Hence the alleles are less likely to be passed down to next generations and eventually become rare or absent in the population. Genetic variations can be classified according to their size and form into small- scale variations and large-scale variations [13]. Small-scale genetic variations con- stitute changes in nucleotides sequence that affect a relatively small region in the nucleotides sequence, usually less than 1000 nucleotides. These small-scale variations can be further classified according to their nature into single nucleotide base substitu- tion, insertion, and deletion variations. Single nucleotide base substitution, commonly referred to as Single Nucleotide Polymorphism (SNP), involves the phenomenon of substituting a single nucleotide base with another. A substitution variation is either a transition (interchange of the nucleic acid base belonging to the same chemical category, such as A↔G or C↔T), or a transversion (interchange of the nucleic acid base belonging to the opposite chemical category, such as A↔T or C↔G). Inser- tion or deletions, commonly referred to as indels, involve the addition or removal of one or more nucleotides base letters. Large-scale genetic variations affect a genomic region larger than 1000 nucleotides and are commonly referred to as structural varia- tions. Structural variations include two categories; Copy Number Variations (CNVs) or chromosomal rearrangement. CNVs happen when a large genomic region of DNA is copied multiple times and the number of copies differs between individuals. Chro- mosomal rearrangements may incur insertion, deletion, duplication, or inversion of a large sequence of DNA nucleotide bases. Figure 2.1 summaries these classes of genetic variations. Based on the position and nature of the genetic variations, their functional consequences vary accordingly. As any two genomes will likely be different, variants are identified based on agreed reference genomes [14]; if a genome has a different allele at a locus than the reference genome, it is a variant with respect to the reference. 20

Figure 2.1: Categories of genetic variations based on their molecular nature.

2.2.1 Genetic Variation in the Coding regions of DNA

Genetic variations in the coding regions of DNA can be exonic or intronic. Intronic variants appear in the intron region of a gene, while exonic variants appear in the exon region of the gene. Depending on the nature of variants in the coding sequence, substitution variants can be classified as nonsynonymous or synonymous variants. Nonsynonymous variants are those that alter the protein amino acid sequences during DNA translation, while synonymous variants are those that do not alter the encoded protein amino acid sequences (all with respect to a reference genome). Figure 2.2 illustrates the different classes of the described variants. Even though synonymous variants do not alter the amino acid sequence, it has been reported that synonymous variants could substantially contribute to disease risk through various mechanisms [15]: these variants may affect mRNA splicing, mRNA stability, protein conformation, and thus, influence protein expression and function [15]. 21

Figure 2.2: Categories of genetic variations in the coding region of DNA

2.2.2 Genetic Variation in the Non-Coding regions of DNA

Historically, genetic variations in the coding region of DNA received much attention in research regarding the study of their functional consequences for diseases. In contrast, genetic variants in the non-coding region were believed to have a relatively small role compared to variants in the coding DNA. Recently, increasing attention has been placed on the study of non-coding variants and their effects on tissue-specific gene expression [16]. Large-scale projects such as the Genotype-Tissue Expression (GTEx) project [17] and the Encyclopedia of DNA Elements (ENCODE) [18] aim to document the role of non-coding variants in the regulation of gene expression. In particular, the ENCODE project assigned biochemical functions for over 80% of the genome, where the majority of them lie in the non-coding regions of the DNA [18]. For instance, transcription factor binding sites, a type of DNA binding sites, which are part of DNA non-coding regions are crucial for gene expression. Some of these sites can sometimes be found hundreds of thousands nucleotides away from its associated gene, which adds to the complexity of identifying them. 22 2.2.3 Consequences of Genetic Variation

The efficient discovery and characterization of genetic variations became feasible re- cently owed to the advances of genome sequencing technologies and availability of a reference human genome [13]. The human reference genome is a database of DNA nu- cleotides sequences collected and assembled from multiple individuals. For example, the GRCh37 reference genome is assembled from 13 individuals [14]. Certain alleles may have deleterious effects if they reduce the reproductive fit- ness of their carriers. Hence, they are less likely to be passed down to generations, which eventually result in the removal of such alleles from the population. There- fore, these alleles tend to be rare in a given population. Interestingly, the human genome contains millions of genetic variants wherein thousands of them are rare [19], and about 100 deleterious loss-of-function variants [20]. A deleterious variant may become pathogenic for a particular disease depending on the function of the gene it affects, the genetic background, and environmental conditions [2]. Some of the terms used to describe genetic variations and their relation to diseases are defined as follows [2]:

• Pathogenic: contributes mechanistically to disease, but is not necessarily fully penetrant (i.e., may not be sufficient in isolation to cause disease).

• Associated: significantly enriched in disease cases compared to matched con- trols.

• Deleterious: reduces the reproductive fitness of carriers, and would thus be targeted by purifying natural selection.

• Protective: has a beneficial effect on the carrier, reducing or eliminating risk of a disease. 23 2.3 Human Genetic Diseases

A phenotype is defined as “the observable physical properties of an organism; these include the organism’s appearance, development, and behavior” [21]. It is believed that “Virtually all human diseases, except perhaps trauma, have a genetic compo- nent” [22]. In some diseases, the genetic component is very large and inheriting the genetic component will very likely cause the disease. Other diseases may have a “modest” genetic component, and inheriting the genetic component would influ- ence the risk of developing the disease, but not necessary its manifestation. For such diseases, external environmental factors (such as smoking or exposure to certain chemicals) play a role in addition to the genetic component. Genetic diseases, although share a genetic origin, differ in the mechanisms of disease etiology and development by their associated genetic variants. Some genetic diseases may arise from the presence of single genetic variants while other genetic conditions require the presence of a set of multiple genetic variants that together give rise to the genetic disease or syndrome. Moreover, these variants differ in their population frequency as well as their effect size or penetrance (i.e., the proportion of individuals who harbor the variants and develop the associated disease). Typically, variants with a large effect size tend to be deleterious and rare in frequency in a population. On the other hand, variants with low or modest penetrance may not be deleterious and hence have a higher frequency in the population.

2.3.1 Monogenic diseases

Monogenic diseases, also referred to as Mendelian diseases, are diseases that are caused by a mutation in a single gene. An example of such diseases is Cystic fi- brosis, a recessive disease that can be caused by a single homozygote variant in the cystic fibrosis conductance transmembrane regulator (CFTR) gene [23]. Monogenic diseases can exhibit different forms of inheritance modes. Dominant inheritance arises 24 when a single heterozygote allele is sufficient to cause the disease. Recessive inheri- tance requires a single homozygote allele in order to develop the disease. The genetic variants that cause Mendelian diseases are observed to be rare in the population with a deleterious large effect size [24]. Mendelian diseases have been studied extensively, and their molecular basis is reported in multiple publicly available repositories [25]. The World Health Organization estimates that there are over 10,000 monogenic dis- eases, where the global prevalence of such diseases at birth is approximately 10/1000 [26]. Given the rarity of these diseases, diagnosis is a challenge with respect to cost, time, and required expertise.

2.3.2 Oligogenic diseases

Oligogenic diseases are diseases caused by variants affecting multiple genes. These variants could vary in their frequency in the population and effect size. An example of an oligogenic disease is Hirschsprung disease which is influenced by variants in two genes, EDNRB and RET [27]. It is more challenging to identify the genetic underpinning of such diseases given the complexity in determining the set of genetic variants that contribute to the development of the diseases.

2.4 Model organisms and genetic diseases

The classical approach for studying genetic diseases is to determine the “genotype” given the “phenotype”. This approach is referred to as forward genetics. Inversely, reverse genetics involves determining the “phenotype” given the “genotype”. Re- verse genetics is typically conducted through experimentation on model organisms as disease models. Some examples of experiments include gene knockout, knockin, or knockdown. Gene knockout experiments target a specific gene locus to completely “si- lence” the gene, resulting in no gene expression. Inversely, gene knockin experiments modify a specific gene locus by substitution or insertion of a nucleotide sequence to 25 observe its effects on the organisms’ phenotypes. Knockdown experiments involve modifying a specific gene to reduce its expression levels. The study of model or- ganisms as disease models is providing valuable insight into inferring disease-causing genes and variants since the human genome shares a remarkable similarity with other model organisms. Many of the genes found in the human genome are also found in other organisms such as mice, yeast, and zebra fish, and thus, research conducted on model organisms is playing a significant role in identifying disease genes and de- scribing the relationship between genotype and phenotype [28]. Not only have model organisms aided in the discovery of disease genes, but they have also helped in char- acterizing various phenotypes and traits associated with many genetic variants. Even with new technologies emerging for the determination of disease-gene associations, model organisms are still needed to help uncover the biological mechanisms and pro- cesses involved in genetic diseases [29]. 26

Chapter 3

Literature Review

3.1 Discovery of Genetic Variations

Detection of genetic variations present in a human requires experiments and subse- quent computational processing. Sanger sequencing, a first generation approach was the primary method for sequencing for two decades [30] and is still applied to se- quence short regions of DNA. New and improved sequencing technologies have now been developed [31] and these technologies vary in their cost, run time, error rate, and length of DNA sequence fragments. The sequencing machine produces data files con- taining DNA nucleotide sequences, called reads. These DNA reads are then processed computationally to reconstruct the original DNA sequence in order. Genomic alignment is the process of aligning the DNA short reads against the reference genome, matching them to specific genomic positions. Several methods have been proposed to tackle this task [32, 33, 34, 35]. The resulting aligned reads are typically stored as a Sequence Alignment/Map (SAM) file [36] and as a compressed Binary Alignment/Map (BAM) file. Once the alignment stage is finished, variant calling can be initiated to determine the difference between the aligned reads and the reference genome. During this step, variants and their genotypes are identified, where the confidence of each called variant depends on its coverage; the number of reads available for its genomic position. SAMtools [37] and the Genome Analysis Toolkit (GATK) [38] are widely used bioinformatics systems for variant calling. The detected variants can be stored in several file formats including Variant Calling Format (VCF) 27 file [39], and the Binary variant Call Format (BCF) [37]. A typical VCF file for a sequenced exome may contain tens of thousands or hundreds of thousands of variants. An individual genome contains millions of variants in its VCF file, and computational methods are usually required to provide an interpretation of this data.

3.2 Computational methods for the interpretation of genetic variants

A widely adopted approach for providing insight into genotype-phenotype relations in complex diseases is Genome-Wide Association Studies (GWAS) [40]. GWAS has had a significant impact on the field of human genetics by successfully linking thousands of genetic variants to the predisposition of human diseases, phenotypes and traits [40]. GWAS has aided in associating many common genetic variations to genetic diseases. A GWAS involves examining a large set of genetic variants across multiple individuals to discover, with some statistical power, the likelihood of an association between some variants(s) and certain traits or diseases, in a hypothesis-free agnostic manner [40]. An example of a traditional GWAS method uses statistical methods, for example, chi-squared test, to investigate for each variant whether its allele frequency is significantly altered between a set of healthy and affected individuals [41]. Many of GWAS findings are publicly and freely accessible through dedicated databases [42, 43, 44, 45, 46]. Although GWAS are widely acknowledged for discovering disease-variant associations, their major limitation falls under their inability to identify and explain disease-causality of genetic variants [47]. While GWAS scan potentially hundreds of thousands of variants in the genomes of a cohort to provide a relatively unbiased identification of variants with moderate or low effect size, they can only establish correlation but not causation [47]. Many of the loci reported by GWAS lie in the noncoding regions of the DNA and are located far away from discovered genes, thereby limiting the potential to explain disease mechanisms [47]. 28 In the case of monogenic diseases, family studies involve identifying a narrow set of rare variants with functional effects of the candidate genes in multiple affected pa- tients [48]. Family studies incorporate filters such as allele frequency, functional effects of variants, and segregating variants using family relations as background knowledge. Sequencing of parent-child trios has been successful in identifying causative de novo mutations that are present only in the child [49]. The success of family-based studies depends on the availability of multiple affected families. The use of WES for clin- ical diagnosis of monogenic disorders was demonstrated in many large studies [50]. Now, there are several examples in the literature for similar studies which uses WGS to identify candidate causative variants with success rate typically ranging between 20%-50% [50]. The relatively low success rate can be explained by the human ex- ome containing around 30,000 variants with an estimated 100 loss-of-function variants present [51]. Therefore, filtering those variants based on criteria such as allele fre- quency would still leave hundreds of variants that need to be investigated in order to pinpoint the disease-causing variant. This forms the main challenge in designing mechanisms for the prioritization of disease variants for rare diseases [19]. Several variant analysis tools that predict their functional effects have been devel- oped which are facilitating the discovery of functional and deleterious variants [52]. These methods typically leverage evolutionary, biochemical, or sometimes structural information to predict deleterious variants. They differ in the approaches used to encode for evolutionary information and conservation of genomic sequences as well as their target genetic variants (i.e., coding, noncoding, or all variants). Methods such as Sorting Intolerant from Tolerant (SIFT) [53], likelihood ratio test (LRT) [54], Genomic Evolutionary Rate Profiling: (GERP++) [55], and the recently published LINSIGHT [56], use exclusively evolutionary information about sequences to predict their functional effects. SIFT is based on the assumption that variants appearing in highly conserved DNA regions are likely to be deleterious. By applying multiple se- 29 quence alignment techniques of homologous proteins, and calculating the probability of all possible amino acid changes tolerated, SIFT establishes a probability cutoff for determining whether a variant will be predicted as tolerated or deleterious. Such a concept can be applicable genome-wide, although methods such as SIFT, and LIN- SIGHT are designed for assessment of variants in the coding and noncoding DNA respectively. Others methods such as Polymorphism Phenotyping (PolyPhen2) [57] and SNAP2 [58], incorporate the effect of variants on secondary structures of pro- teins in addition to conservation features. For instance, PolyPhen2, utilizes a naive Bayes classifier coupled with evolutionary and structural information about the vari- ants to make its predictions. Such methods aim to evaluate whether the variants will negatively alter the protein’s structure properties such as its hydrophobic core and interactions with ligands. Although such features are useful in prediction of delete- riousness, they are limited in its application to only nonsynonymous variants (i.e., variants that alter the protein amino acid sequence). Therefore, they can not be ap- plied genome-wide. Another limitation of these methods is that they depend greatly on the accuracy and phylogenetic scope of the used multiple sequence alignment tech- niques [2]. It has been shown that such methods yield a false positive rate of around 10%-20% [59]. Other approaches for prediction of functional effects of coding and noncoding DNA variants include those trained in a supervised manner on known disease-causing vari- ants obtained from publicly available databases including ClinVar [60] and HGMD [61]. These approaches encode multiple features including precomputed scores ob- tained from methods such as SIFT and PolyPhen2. Examples include FATHMM- MKL [62], MutationTaster [63], GWAVA [64], the Combined AnnotationDependent Depletion (CADD) [65], and DANN [66]. By directly or indirectly (through the use of precomputed scores from other methods) encoding different features representing conservation of sequences and various properties of DNA sequences, these methods 30 aim to provide a generalized approach towards the interpretation of functional conse- quences that takes into account different evidence scores from different methods. In doing so, they aim to increase the scope of applicability of variants (not limited to nonsynonymous variants), and the information sources for predicting deleteriousness of variants (for example, by incorporating regulatory effects from ENCODE [67] and JASPAR [68]). For instance, CADD trains a support vector machine (SVM) classifier using over a thousand features to predict the deleteriousness of both SNPs and indels [65]. DANN was shown to outperform CADD by training a deep neural network to capture the non-linear relationships among the features associated with the training data using the same features and training data used for CADD [66]. Although such methods achieve good benchmark performance, they suffer from some limitations. Since a typical genome contains around 100 LOF variants which completely inac- tivate around 20 genes, it is quite possible that such variants will be identified as deleterious. Furthermore, methods that score pathogenicity will often not be useful in examining the effects of gain-of-function variants [19]. Inspired by a large number of tools developed to assess the deleteriousness of ge- netic variants, ANNOVAR [69] and the Variant Effect Predictor (VEP) [70] have been developed to provide functional annotations of genetic variants for various methods. These tools can be used to filter out variants based on their observed frequencies from multiple resources such as the 1000 Genome Project as well as filtering based on the variants functional effect scores. The key contribution of these tools is that they combine multiple types of information, from several different methods; however, they do not use these tools directly but rather precompute the prediction scores and annotate variants based on a lookup table. 31 3.3 Phenotype-aware discovery of candidate disease-causing variants

3.3.1 Representation of genotype-phenotype knowledge

Given the massive amount of genotype-phenotype data derived from phenotyping projects of human and model organisms, it is becoming evident that expressing pheno- types using natural language has its limitations. Such limitations include introducing ambiguity, and its dependability on the background knowledge of the expert docu- menting these phenotypes [71]. To overcome these limitations, there have been efforts in creating a standardized common language that facilitates applicability across mul- tiple domains as well as the interpretation of findings obtained from different experi- mental pipelines. One of the most prominent ways to describe phenotypes is through the adoption of structured and semantically formalized ontologies that deal with the issues of both standardization and computability. There are multiple ontologies def- initions in the literature, but in the context of biomedical sciences, ontologies are expected to exhibit a set of features to enable their applications [71]. These features are classes and relations, a domain vocabulary, textual definitions and descriptions, and lastly, formal definitions and axioms to enable automated reasoning. One of the early developed bio-ontologies is (GO) which describes the attributes of genes and gene products in a species-independent manner [72]. GO encompasses three ontologies; Biological Process, Molecular Function, and Cellular Component, providing annotations for genes, gene products or gene-product groups. Biological Process component describes the chemical or physical processes of the involved genes or gene products. The success of GO has encouraged the development of many ontolo- gies including those describing phenotype-genotype relations for multiple eukaryotic organisms. The Human Phenotype Ontology (HPO) is a widely used formal ontology that provides a standardized vocabulary of phenotypic clinical abnormalities encoun- 32 tered in human diseases [73]. HPO was preceded by the creation of the Mammalian Phenotype ontology (MP) [74] that describes phenotypic information related to the mouse and other mammalian species. Table 3.1 lists some phenotype ontologies for humans, several model organisms, plants, and micro-organisms.

Ontology # classes Human phenotype ontology (HPO) [73] 15,381 Dictyostelium phenotype ontology (DDPHENO) [75] 1,058 Drosophila phenotype ontology (DPO) [76] 506 Mouse pathology ontology (MPATH) [77] 889 Mammalian Phenotype Ontology (MP) [74] 30,316 Worm Phenotype Ontology (WBPhenotype) [78] 2,435 Ascomycete Phenotype ontology (APO) 619 Flora phenotype ontology (FLOPO) [79] 28,430 Fission Yeast Phenotype Ontology (FYPO) [80] 9,870 Cell microscopical phenotype ontology (CMPO) [81] 813 Ontology for microbial phenotypes (OMP) [82] 1,120

Table 3.1: Landscape of phenotype ontologies. Statistics as of May 2018.

In an effort to unify phenotype descriptions, enable interoperability, and facilitate automated reasoning, the Phenotype And Trait Ontology (PATO) [83] framework was built. Not only does PATO represents an ontology scheme, but it also provides an Entity-Quality (EQ) formalism to express phenotype statements [84]. In an EQ setup, a phenotype is defined as Entity, which is characterized by a Quality that cap- tures the properties or attributes of the Entity. The Entity comes from the reference ontology such as GO or anatomy ontology, whereas the Quality component comes from PATO ontology. For instance, to describe the trait heart morphology using the EQ method, the class heart (from an anatomy ontology) is used as Entity and the class morphology (from PATO) is used as Quality. Another variation is to represent a phenotype by linking two entities with a single quality. The EQ formalism for de- scribing phenotypes works extremely well in most scenarios. The PATO framework has facilitated ontology-based data integration, especially in a phenotypic context. Applications include semantic search across different ontologies [85, 86], and enrich- 33 ment over ontologies [87]. Intuitively, the need to enable interoperability between phenotype ontologies to facilitate data integration can be fulfilled through the use of standardized identifiers for naming classes and relations in an ontology. Subsequently, these standardized class names can be used to annotate database entities, and enable efficient data integration and mining across these databases. There has been a substantial effort for collecting and classifying human diseases. For instance, the Disease Ontology (DO) [88] is a standardized framework for provid- ing descriptions of human diseases and phenotypes. There is an overlap frequently observed between phenotype and disease ontologies. Such overlap can be seen in HPO and DO where both ontologies contain the class Alzheimer disease (HP:0002511) as well as the class hypotrichosis or generalized hypotrichosis. These classes may be re- garded as a “phenotype” if occurring with other signs or symptoms, or as a disease in itself if isolated. This renders the use of ontology mapping techniques for the identification of overlapping or equivalent classes in different ontologies feasible [89]. The development of multiple disease and phenotype ontologies has a wide variety of applications including clinical diagnostics, mapping of phenotypes between human and model organisms, and standardizing vocabulary in clinical databases.

3.3.2 Resources of genotype-phenotype knowledge

Human genotype-phenotype resources

The abundance of genetic and phenotypic data aided by the advances in NGS tech- nologies has led to the establishment of multipledisease databases that characterize genotype-phenotype relations and provide information about genetic variation and their effect on human health. These databases vary in size, the area of focus, or over- all structural representation of information [90]. They document the functional and pathogenic significance of human genetic variations within the scientific community. Examples of these databases for genetic diseases, along with some of their statistical 34 summaries are listed in Table 3.2. In the case of monogenic diseases, the following databases focus mainly on genetic variations and their pathogenicity for mostly rare diseases.

• Online Mendelian Inheritance in Man

The Online Mendelian Inheritance in Man (OMIM) is a freely accessible, regu- larly updated comprehensive database of human genes and genetic phenotypes related to Mendelian diseases [91]. The online database was established in 1985 and made public by 1987. In 1995, the National Center for Biotechnology Infor- mation (NCBI) developed the OMIM database. OMIM provides a sophisticated search tool where users can search for diseases and phenotypes by DO terms or HPO terms.

• Orphanet

Orphanet is another publicly available multilingual knowledgebase on rare dis- eases and orphan drugs [92]. It provides a catalog of rare diseases classified and cross-referenced with other databases. It also includes an inventory of or- phan drugs at all stages of development along with other resources that aim to assist clinicians in providing a diagnosis. Recently, in a collaboration be- tween Orphanet and EBI, the Orphanet Rare Disease Ontology (ORDO) [89] was developed to capture relationships between genes and diseases linking other terminologies, databases, or classifications. ORDO is maintained by Orphanet and is updated monthly.

• ClinVar

One of the most comprehensive databases of genetic variations for the inter- pretation of human genetic variants and their phenotypes is ClinVar. ClinVar [60] is a publicly accessible database that archives the relationships between 35 reported human genetic variants, their phenotypes, and their effect on human health along with the evidence supporting these effects. ClinVar utilizes the Sequence Ontology [93] to represent molecular consequence, type of variation, and location of genes. ClinVar also provides the mapping of HPO terms to the phenotypes associated with the genetic variants reported in the database. Each reported variant is accompanied by a value representing its clinical significance. That is, the clinical significance metric annotates whether the variants are be- nign, likely benign, likely pathogenic, pathogenic, affects drug response, risk factor, or protective. Entries in ClinVar are mapped to known disease pheno- types such as OMIM and Orphanet which further facilitates the analysis and interpretation of these variants by computational tools.

• The Human Gene Mutation Database

The Human Gene Mutation Database [61] is another gold-standard resource of known mutations that cause or are associated with human genetic diseases. It was established in 1996 and evolved to become a central unified disease-oriented mutation repository. It has two versions, one that is publicly freely available, and a subscription-based alternative. HGMD contains the mapping of variants to several ontologies such as GO, DO, and HPO.

In the case of oligogenic diseases, the following resources curate information about genetic variations and their associated phenotypes.

• Digenic diseases database

The Digenic diseases database (DIDA) [94] is a recent curation effort that aims to catalog detailed information on genes and their associated genetic variants causative of digenic diseases, which are the simplest form of oligogenic diseases. It provides annotations for HPO, variants consequences, and zygosity. 36 • GWAS Central

GWAS Central [43] is a database that provides a comprehensive collection of summary level findings from genetic association studies in human, mainly GWAS. Established in 1998 as The Human Genome Bi-Allelic Sequence (HG- BASE) database, it was the first publicly available SNP database. It incorpo- rates data of known SNPs from external databases such as the Single Nucleotide Polymorphism Database (dbSNP) [95] as well as other GWAS databases. HPO is used to annotate phenotypes observed in the studies reported in GWAS Cen- tral.

• GWAS Catalog

The GWAS Catalog [42] is a repository of a manually curated collection of all reported GWAS. It was founded in 2008 by the National Human Genome Research Institute (NHGRI) and is currently maintained jointly by NHGRI and the European Bioinformatics Institute (EMBL-EBI). Unlike the previously mentioned databases, the GWAS Catalog maps traits using the Experimental Factor Ontology (EFO) [96]. This is because traits in this database are diverse and include disease, disease markers, and non-clinical phenotypes.

• DisGeNET

Given the enormous number of disease databases available, it is becoming evi- dent that integrating all these resources into a single platform is necessary for the scientific community. DisGeNET [97] was created as a publicly available queryable discovery platform that integrates information on gene-disease associ- ations and variant-diseases associations from the medical literature and multiple public data repositories. DisGeNET includes information on human Mendelian and complex diseases and also animal disease models. There are several ontolo- gies that are adopted by DisGeNET. For instance, PANTHER Protein Class 37 Ontology [98] was used for classification of genes, while DO, HPO, and MP terms were used to provide annotations for disease associations and phenotypes of human and mouse models respectively. Moreover, an ontology representing DisGeNET associations were created to systematically integrate gene-disease association data, called DisGeNET association type ontology. This ontology formally structures associations found in external databases.

Database Summary Statistics Online Mendelian Inheritance 15,885 genes and 8,593 diseases in Man (OMIM) [91] ClinVar [60] 30,180 genes for 406,930 variants Orphanet [92] 3,573 genes and 5,856 diseases DisGeNET [97] 561,119 associations between 17,074 genes and 20,370 diseases and phenotypes The Digenic diseases database 448 variants in 169 genes causative of 54 diseases (DIDA) [94] DECIPHER [99] 4,223 sequence variants and 76,783 diseases and phe- notypes The Human Gene Mutation 9,073 genes and 224,642 variants Database (HGMD) [61] GWAS Central [43] 2,974,967variants and 829 diseases GWAS Catalog [42] 61,173 SNP-trait associations

Table 3.2: Some of the widely used databases for characterizing human genes, phe- notypes, and their genetic variants (as of May 2018).

Model organisms genotype-phenotype resources

In the study of model organisms, there are several species-specific databases that col- lect information about associations between genes and phenotypes obtained from pub- lished articles. All these databases fall under the notion of Model Organism Databases (MODs) which aim to provide researchers with in-depth information on gene, func- tion, and phenotypes associated with a given model organism. All MODs use Gene Ontology (GO) to describe gene functions, the processes they are involved in, and their cellular location. For example, WormBase [100] is a repository of Caenorhabditis 38 elegans and other nematodes and provides annotations with Worm Phenotype Ontol- ogy (WFO) [78], and DO among others. FlyBase [101] is a database for Drosophila melanogaster and it annotates its content using several ontologies including Molecular Interactions Ontology [102], DO, Sequence Ontology [103], and Fly Anatomy ontol- ogy [104]. Another MOD is the Saccharomyces Genome Database (SGD) [105] which provides genetic and biological data for the budding yeast Saccharomyces cerevisiae. DictyBase [106] is a MOD for Dictyostelium discoideum known as slime mold. Table 3.3 presents some model organisms databases. The primary databases utilized within the scope of this work are:

• The International Mouse Phenotyping Consortium

To support research on model organisms, specifically mouse, the International Mouse Phenotyping Consortium (IMPC) was founded in 2011 with an aim to systematically phenotype 20,000 knockout mouse strains and provide their respective data publicly and freely [107]. The outcome of these experiments will be deposited at the Knockout Mouse Project Repository [108] and the European Mutant Mouse Archive [109]. The goal of IMPC is to provide “the first truly comprehensive functional catalog of a mammalian genome”.

• The Mouse Genome Informatics

The Mouse Genome Informatics (MGI) is one of the most widely used resources for data on laboratory mice. MGI is comprised of the Mouse Genome Database (MGD) [110] which collects phenotype and functional annotations of mouse genes and alleles, and the Gene Expression Database (GXD) [111] for mouse developmental expression data.

• The Zebrafish Information Network

The Zebrafish Information Network (ZFIN) [112] is a publicly accessible database that collects zebrafish genetic, genomic and developmental data. 39 Database Summary Statistics The International Mouse Phe- 4,918 phenotyped genes notyping Consortium (IMPC) [107] The Mouse Genome Informat- 24,401 gene records with protein sequence data and ics (MGI) [110] 61,901 genotypes with phenotype annotations The Zebrafish Information 37,141 gene records and 114,097 phenotypes Network (ZFIN) [112] The Rat Genome Database 45,742 gene records and 446,973 variants (RGD) [113] Flybase [101] 248,194 gene records and 247 diseases The Saccharomyces Genome 6,448 gene records and 146,128 phenotype associations Database (SGD) [105] WormBase [100] 235,375 gene records, and 434,920 phenotype associa- tions

Table 3.3: Some databases characterizing model organisms’ genes, phenotypes, and their genetic variants with some statistics (as of May 2018).

3.4 Phenotype-driven methods for predicting candidate dis- ease genes and disease variants

The standardization of phenotype ontologies, and the development of cross-species anatomy and physiology ontologies such as GO and Uberon [114], along with auto- mated reasoning over these ontologies have enabled effective computational integra- tion of phenotype data across species. Model organisms phenotypes can now directly contribute to the advancement of translational research. Some notable applications include predicting the causative genes of rare human diseases, and, more recently, prioritizing causative variants from undiagnosed clinical exomes [115]. Next, we dis- cuss some computational approaches for the discovery of candidate disease genes and disease variants. 40 3.4.1 Methods for prioritization of candidate disease-causing genes

The heterogeneous nature of human genetic diseases and their phenotypes has been a great challenge in designing computational tools for the analysis of phenotypic data and their association with the genotype. Since the molecular basis of many of the genetic diseases is still unknown, there has been a great effort in developing compu- tational methods for candidate genes prioritization based on functional annotation, sequence-derived features, or gene-expression data [116, 117, 118, 119, 120]. Other approaches include sequencing multiple affected patients and looking for the affected genes [121], and linkage analysis [122]. However, these classical approaches result in hundreds of candidate genes, making it difficult to assess their outcomes. Overall, it remains a major challenge in genetics to identify candidate disease genes [52, 123]. Methods for prioritizing candidate disease genes can be classified into two broad categories; guilt by association based methods, and de novo gene discovery meth- ods. Guilt by associations methods assume that “genes with related functions tend to share properties such as genetic or physical interactions” [124]. In other words, guilt by association is based on the hypothesis that interacting genes are likely to be associated with the same phenotypes. Guilt by association methods exploit func- tional or topological similarity for inferring candidate disease genes [125, 126]. These methods use data available in various biological databases, coupled with statistics to identify candidate genes in silico. An example of such approach involves generating a likelihood for every gene in the biological network and ranking them accordingly as candidate genes. Several methods have been developed based on guilt by asso- ciations such as those based on finding direct neighbors of known disease genes or computing the shortest path to known disease genes [127, 128]. These methods vary by the utilized data resources or the strategies adopted for computing the associa- tions. One interesting method that adopts the core concept of guilt by association 41 is walking the interactome [129, 130]. It is based on the assumption that genes that share the same or similar disease phenotype tend to lie in proximity to each other in the protein-protein interaction network. Walking the interactome utilizes global network-similarity measures through the application of random walk on graphs [131], in a protein-protein interaction network to capture the relationships between genes and prioritize candidate disease genes. Random walk with restart is formally defined as:

pt+1 = (1 − r)W pt + rp0 (3.1)

where W is the column-normalized adjacency matrix of the network and pt is the probability vector for each node at time step t. It was demonstrated that this ap- proach outperformed previously proposed methods by achieving an area under the ROC curve of up to 98% on simulated data for predicting candidate disease-genes of Mendelian diseases. In the case of oligogenic diseases, Disease Module Detection (DIAMOnD) algorithm [132] is an example of a guilt by association approach. Given a set of known disease genes, the algorithm aims to infer unknown disease genes and identify diseases modules using a derived p-value for estimating connectivity sig- nificance for genes in the human interactome, and ranking candidate disease genes accordingly. DIAMOnD was demonstrated to achieve better performance compared to random walk approach on a set of diseases genes representing 70 complex dis- eases. Spectral-based graph clustering techniques have also been demonstrated in the latest Disease Module Identification DREAM Challenge to precede other proposed approaches in performance [133]. One of the main limitations of guilt by association methods arises from its poten- tial bias introduced through building the assumption on functional genomics. That is, since the knowledge of various biological networks is incomplete and imbalanced (i.e, some genes are well studies where others are still uncharacterized), having such inconsistencies on the reference genes set will affect the outcomes of the method, 42 subsequently neglecting the potential role of uncharacterized genes and selecting the well-annotated genes [124]. In other words, such an approach will only discover dis- ease genes consistent with what we already know about the disease and its molecular basis. Another limitation comes from the use of a single biological network for dis- covery which in reality will not be able to capture all the relevant relations. For instance, using coexpression data alone will fail to detect many effects of posttran- scriptional modifications, while relying on protein-protein interaction data alone will fail to capture transcriptional regulation. Therefore, incorporating multiple biologi- cal networks such as coexpression networks and protein-protein interaction networks into a single heterogeneous network will aid in overcoming such limitation to infer stronger relationships between disease genes [52]. On the other hand, de novo gene discovery methods aim to discover new associ- ations between genes and diseases without the prior assumption of “guilt” between genes. One of the de novo gene discovery methods is text mining. Text mining techniques can be universally applied to identify candidate genes from literature and databases in rare and complex genetic diseases [134, 135]. For example, the T-HOD database is a text-mined candidate disease gene database for three cardiovascular diseases: , , and diabetes (type 1 and type 2), that is regularly updated by the text mining system and also verified by domain experts [136]. An- other interesting method [137] integrates text mining of literature and data-mining of human gene expression data to prioritize candidate disease genes based on their expression in disease-affected tissues. It is also worth mentioning that text mining can be viewed as a guilt by association method if the intended use is to search for co-occurrences of genes given the occurrences of some genes and phenotypes, thus inferring the co-occurred genes as also disease-genes. Text mining-based methods usually do not yield significant accuracy in performance and are generally regarded as a complementary approach that is utilized in conjunction with other methods like 43 those based on the ontological representation of biological data. With the availability of many ontologies in biology and biomedical domains, it is evident that the transformation of biological data into formal representation us- ing the Web Ontology Language OWL [138] provides an unprecedented opportunity for the accurate prediction of disease candidate genes [139]. Multiple ontology-based approaches have been proposed as forms of de novo gene discovery methods for the prioritization of candidate disease genes. Phenomizer [140] is a tool that leverages the use of semantic similarities to measure phenotype similarities between genes and Mendelian diseases by incorporating the Human Phenotype Ontology (HPO) anno- tations. This tool provides an accurate performance in comparison to simple term matching approaches for the identification of candidate disease genes for a set of disease phenotypes. Inspired by the availability of phenotyping data obtained from many experimental studies from model organisms and the need for characterizing their direct effect on human health and physiology, PhenomeNET [86] was estab- lished. PhenomeNET was one of the first attempts to link model organisms pheno- types with human disease phenotypes. It is represented in the form of a cross-species phenotype network that is constructed by transforming phenotype ontologies into formal representation and applying semantic similarity to the resulting phenotypic network. PhenomeNet currently is comprised of 124,730 complex phenotype nodes originating from several model organisms databases (i.e. yeast, fish, worm, fly, rat, slime mold and mouse) as well as human disease phenotypes obtained from OMIM and OrphaNet databases. It was demonstrated that PhenomeNET is an effective and accurate method for the prioritization of disease candidate genes as well as the identification of orthologous genes and genes involved in the same pathway. An- other method that incorporates model organisms phenotypic data is Phenodigm [85]. Phenodigm integrates phenotypic data from human, mouse, and zebra fish and ap- plies semantic similarity between each pair of concepts from the resulting aligned 44 cross-species ontology. In this work, we adopt PhenomeNet ontology (for human, mouse, and zebrafish), and we compute semantic similarity using Resnik’s measure [141] defined as:

Sim(c1, c2) = max [− log p(c)] (3.2) c∈S(c1,c2) where c1 and c2 are the two concepts being evaluated and S(c1, c2) is the set of concepts that subsume both c1 and c2. This similarity measure is coupled with the Best Matching Average (BMA) strategy for combining pairwise similarities, where BMA averages the best similarity values for each term. We use Resnik’s information content [141] computed over the corpus of gene-phenotype associations as specificity measure for each class in the phenotype ontology by considering only the content of the least common subsumer. A least common subsumer is the deepest shared parent between two classes in the ontology. The use of gene-level approaches to computationally determine candidate disease genes is useful in restricting the search space of causal genetic variations in per- sonal exomes or genomes. However, an exome/genome may contain multiple variants affecting the candidate disease genes loci, and hence determining the variants of sig- nificance remain a challenging task. The problem is even more prominent in the case of oligogenic diseases, where combinations of genetic variants in multiple genes need to be assessed. Therefore, investigating variant-level evidence to further pinpoint the candidate causal disease variants is needed.

3.4.2 Methods for prioritization of candidate disease-causing variants

Following up from the success of phenotype-driven disease-genes prioritization meth- ods, several methods were proposed that aim to discriminate phenotypically relevant genetic variants rather than just predicting the functional effects of variants. It is therefore required to have standardized phenotype annotations from HPO and MP 45 in order to utilize such methods. One of the early acknowledged methods for disease- causing variants prioritization that takes on the phenotypic relevance of genes is eXtasy [142]. This approach integrates, in a machine learning model, multiple delete- rious prediction scores along with haploinsufficiency prediction score; the phenomena where having only a single functional copy of a gene is insufficient to maintain normal gene function. This integration leads to a phenotype-specific prioritization scheme based on HPO as derived from Phenomizer. However, eXtasy is limited to prioritiz- ing nonsynonymous single-nucleotide variants (nSNVs) only. Exomiser [115] was later proposed as another method for prioritizing variants obtained through WES data. Ex- omiser leverages the use of an integrated cross-species phenotypic network of human and model organisms phenotypes for determining genes that are phenotypically rele- vant for the patient phenotypes. Exomsier was recently extended to prioritize WGS data including variants in the non-coding region of DNA, called Genomiser [143]. Phenotype Driven Variant Ontological Re-ranking Tool (Phevor) [144] is another ap- proach that aims to reprioritize WES data variants as outputted from any variant prioritization tool. Phevor combines gene function, disease, and phenotype knowledge as expressed in multiple biomedical ontologies to rerank the variants. These methods suffer from varied false-positive incident rates that are inherited from the features they exploit for training, or from the design patterns of training that potentially lead to overfitting, and hence, inconsistent performance [145]. Table 3.4 summaries some information about some of the methods used for the interrogation of human genomes and exomes for pathogenicity and disease causality. It is therefore much needed to develop a unified approach that is applicable for all SNPs and indels present in an exome or genome while exploiting genotype-phenotype knowledge to better capture candidate causative variants for the patient’s observed phenotypes. Such intelligent variant prioritization method would be designed to dis- tinguish between disease-causing variants from the rest of the irrelevant deleterious 46 variants or benign variants based on the patient’s phenotypes. 47

Method Main Features Approach Training Data Sorting Intolerant from Toler- Evolutionary Probabilistic NA ant (SIFT) [53] Likelihood ratio test (LRT) Evolutionary Likelihood ratio NA [54] test Genomic Evolutionary Rate Evolutionary Maximum likeli- NA Profiling: GERP++ [55] hood estimation LINSIGHT [56] Evolutionary Probabilistic with NA a generalized lin- ear model Polymorphism Phenotyping Evolutionary, structural Nave Bayes classi- UniProt (PolyPhen2) [57] fier SNAP2 [57] Evolutionary, structural Neural network UniProt, HumVar, OMIM FATHMM-MKL [62] Evolutionary, ENCODE [67] Multiple kernel HGMD learning MutationTaster [63] Evolutionary, ENCODE [67], Bayes classifier HGMD JASPAR [68] GWAVA [64] GERP, ENCODE [67], JAS- Random Forest HGMD PAR [68] annotations The Combined Annota- SIFT, PolyPhen2, GERP, Support Vector Observed tionDependent Depletion ENCODE [67], and more Machine (SVM) from En- (CADD) [65] sembl EPO DANN [66] SIFT, PolyPhen2, GERP, Deep neural net- Observed ENCODE [67], and more work (DNN) from En- sembl EPO eXtasy [142] SIFT, PolyPhen-2, Mutation- Random Forest HGMD Taster, LRT, haploinsuffi- ciency, phenotypic similarity Exomiser [115] SIFT, PolyPhen-2, Mutation- Logistic Regres- HGMD and Taster, phenotypic similarity sion ClinVar Genomizer [143] SIFT, PolyPhen-2, Mutation- Logistic Regres- HGMD and Taster, phenotypic similarity sion ClinVar Phenotype Driven Variant GO, MP, DO, and HPO on- Probabilistic NA Ontological Re-ranking Tool tology mapping (Phevor) [144]

Table 3.4: Summary of some of variant prioritization methods 48

Chapter 4

Problem Statement

Given the abundance of genetic variants that are present in a human exome or genome, it is very crucial to accurately distinguish the disease-causing variants from the rest of benign and irrelevant functional variants. It is estimated that a human genome contains around 100 loss-of-function alleles, with about 20 genes completely inacti- vated [51]. It is clear that identifying the causative variants among this large set of variants is much needed in order to facilitate effective diagnosis of genetic diseases. In light of the discussion presented in the previous chapters concerning trends and challenges in prioritizing candidate disease-causing variants for genetic diseases, the problem addressed in this dissertation is as follows:

Given a set of genetic variants obtained from WES or WGS data, can we effectively combine:

• Molecular properties of the variants

• Gene-disease associations of respective variants

• Disease modules identified from molecular networks

to prioritize candidate causative variants for Mendelian disease, and explain their mechanisms in oligogenic disease? 49 4.1 Hypothesis

The problem described in this thesis tackles variant prioritization for two categories of human genetic diseases, monogenic (Mendelian) diseases, and oligogenic diseases. Given the nature of these diseases, the problem will be addressed for both scenarios based on the following hypothesis:

• Variant prioritization for Mendelian diseases In the case of Mendelian diseases, we would like to test whether combining molecular information about the human genetic variants that describe their functional and deleteriousness effects, along with gene-disease association in- formation encoded via PhenomeNet phenotypic semantic similarity of human and model organisms genes, could effectively and accurately identify candidate disease-causing variants in personal genomes for patients of Mendelian diseases.

• Variant prioritization for oligogenic diseases This hypothesis involves testing whether combining molecular information about the genetic variants that describe their functional and deleteriousness effects, along with gene-disease association information as encoded via PhenomeNet phenotypic semantic similarity of human and model organisms genes, along with information encoded in biological networks, could effectively and accu- rately identify candidate disease-causing variants in personal genomes for pa- tients of oligogenic diseases.

4.2 Personal Genome Data

In this thesis, we will exploit the use of publicly accessible sequencing data from multiple resources. This will be vital for conducting a systematic evaluation of our proposed predictive models for the identification of candidate disease-causing variants of patients of Mendelian and oligogenic diseases. 50 • The 1000G Project The 1000G Project [146] is an international endeavor that ran between 2008 and 2015 with an aim to create the largest public catalog of diverse human population variation and genotype data. The project publicly and freely pro- vides WES and WGS data for over 2000 anonymous samples across different ethnicities with some related individuals. The project server provides raw BAM files along with Variant Calling Format (VCF) files for the scientific commu- nity to use for the analysis and interpretation of the human genome. The 1000G Project data will be used to create synthetic patients using the known pathogenic variants associated with Mendelian and oligogenic diseases for com- parative analysis purposes. Synthetic exomes/genomes for Mendelian diseases can be created by inserting annotated pathogenic variants from ClinVar, while for the case of digenic diseases (i.e, the simplest form of oligogenic diseases), we utilize the DIDA database for such a task. In both cases, the synthetic genome/exome will form synthetic patients with phenotypes defined as those annotated with the inserted pathogenic variant(s).

• The UK10K Project The UK10K project [147] is a population-based project to sequence nearly 10,0000 individuals from the UK population. The project aims to characterize rare and low-frequency variants and their association to human traits and ge- netic diseases. The project is divided into two main branches. The first branch focuses on providing WGS data for studying the effects of genetic variation in various traits including those of clinical relevance in around 4000 healthy indi- viduals. The second branch focuses on analyzing WES data from around 6000 individuals harboring rare genetic diseases, severe obesity or neurodevelopmen- tal disorders. We have obtained access to exomes of patients of rare diseases, and we will use the samples of patients of congenital hypothyroidism, with a 51 confirmed molecular diagnosis, to evaluate our proposed models and compare them with state-of-the-art approaches for variant prioritization.

4.3 Approach

The integrative approach that we will adapt to test the hypothesis will leverage the use of multiple machine learning methods for building accurate predictive models for prioritization of disease-causing variants. The features used to encode the biological information discussed in our hypothesis can be summarized as follows:

• Functional effects of variants The molecular information about the variants will be characterized using their respective prediction scores as obtained from CADD, GWAVA, and DANN methods which will assess the functional effects of variants. These three scores will collectively be used as features for representing functional effects of genomic variants.

• PhenomeNet phenotypic semantic similarity The cross-species phenotypic network derived by PhenomeNet generates an ac- curate representation of gene-disease association captured from reported studies of human, mouse, and zebra fish. We collect the mutant model organism phe- notypes for mouse from the MGI database, human phenotypes from the HPO database, and zebrafish phenotypes from the ZFIN database. After that, we compute the semantic similarity between a patient’s phenotypes and the collec- tion of model organisms and human phenotypes using Resnik’s measure with the Best Matching Average (BMA) strategy for combining pairwise similari- ties. We use Resnik’s information content measure computed over the corpus of gene-phenotype associations (from human, mouse, and zebrafish) as a specificity measure for each class in the phenotype ontology. 52 • Disease (phenotype) module identification The information encoded in various biological networks such as protein-protein interactions (PPIs) and genetic co-expression provide an important clue about the disruption of various protein functions and gene expression especially in the context of oligogenic diseases that manifest from mutations occurring in multiple genes [148]. One of the challenges in genetic diagnosis for oligogenic diseases is the identification of disease (or if unknown, phenotypes) modules that are unique to the patient under study. In this work, we utilize STRINGdb, partic- ularly PPIs to develop a method to prioritize combinations of genetic variants that are likely causal for oligogenic diseases. We determine these combinations of variants based on their respective genes connectivity in a PPI network. 53

Chapter 5

Semantic Prioritization of Novel Causative Genomic Variants

I published the work presented in this chapter at PLOS Computational Biology (Pa- per citation: Boudellioua I, Mahamad Razali RB, Kulmanov M, Hashish Y, Bajic VB, Goncalves-Serra E, Schoenmakers N, Gkoutos GV, Schofield PN, and Hoehndorf R. (2017) Semantic prioritization of novel causative genomic variants. PLOS Com- putational Biology 13(4): e1005500. https://doi.org/10.1371/journal.pcbi.1005500). My contribution involves the curation of genomic and phenotypic data, the design of the methodology and formal analysis, the investigation and validation of findings, the visualization of results, the implementation of the software, and the writing of the manuscript. The full author contribution is as follows: Conceptualization: RH PNS GVG, Data curation: YH IB MK RBMR NS EGS, Formal analysis: IB RH GVG PNS, Funding acquisition: GVG RH NS, Investigation: IB RBMR MK YH, Methodology: IB GVG PNS RH VBB, Project administration: RH PNS GVG, Re- sources: NS EGS, Software: IB MK RBMR RH, Supervision: RH VBB PNS GVG NS, Validation: IB NS EGS PNS GVG RH, Visualization: IB, Writing original draft: IB GVG PNS RH, and Writing review and editing: IB RBMR NS VBB PNS GVG RH.

5.1 Abstract

Discriminating the causative disease variant(s) for individuals with inherited or de novo mutations presents one of the main challenges faced by the clinical genetics com- 54 munity today. Computational approaches for variant prioritization include machine learning methods utilizing a large number of features, including molecular informa- tion, interaction networks, or phenotypes. Here, we demonstrate the PhenomeNET Variant Predictor (PVP) system that exploits semantic technologies and automated reasoning over genotype-phenotype relations to filter and prioritize variants in whole exome and whole genome sequencing datasets. We demonstrate the performance of PVP in identifying causative variants on a large number of synthetic whole exome and whole genome sequences, covering a wide range of diseases and syndromes. In a retrospective study, we further illustrate the application of PVP for the interpretation of whole exome sequencing data in patients suffering from congenital hypothyroidism. We find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants.

5.2 Introduction

Since the first successful identification of disease-causing variation from whole exome sequencing in 2010 [149], impressive advances have been made in the field of next gen- eration sequencing and its related analysis, with the aim of fulfilling the promise of whole exome (WES) and whole genome (WGS) sequencing for personalized medicine. Such approaches have revolutionized our ability to identify the genetic underpinnings of disease as well as improve our capacity to stratify patient populations and diagnose them in a more accurate and timely manner [150]. A recent critical study provided some objective estimates of the efficiency of diagnoses by traditional medical ge- netics diagnostic approaches, with 54% of referred patients undiagnosed [151]. The introduction of next generation sequencing (NGS) technologies in clinical settings is anticipated to improve diagnosis efficiency, and between 13% [152] to 50% of those remaining undiagnosed are likely to receive a molecular diagnosis following WES or 55 WGS [153]. Nevertheless, the success rate of the state-of-the-art tools for identifying causative variants using WES data range between 22% to 25% [154, 155], and WGS data in a similar range [156] depending on the disease type and the availability of sequence data from family members. The identification of the causative disease mutations in an individual patient re- mains a challenge due to the complexity and scale of the task. An individual exome might contain 20,000-30,000 variants with respect to the reference genome; a third of which might comprise non-synonymous variation [157]. Many thousands of variants in an average genome might be unique, and on average 20 genes may have complete loss of function (LOF) mutations [158] whose physiological consequences for the bearer are unpredictable [159]. Adding to the complexity of analysis are contingencies such as oligogenicity and haploid insufficiency. Oligogenicity is the phenomenon where ad- ditional genes modify the phenotypic effect of a variant in a primary gene, so that the overall disease phenotype is the consequence of multiple variants in the same genome. Haploid insufficiency describes a situation where loss of function of one allele of a gene in a normal diploid cell or individual results in an abnormal phenotype. For many genes, loss of function of one allele is not significant, but for some genes, dosage is critical and phenotypic effects are seen with the loss of one allele. Consequently, in haploid insufficiency, a heterozygote with a loss of function allele may develop an ab- normal phenotype [160]. Given these phenomena, it is clear why finding the “needle in a stack of needles” [161] remains one of the key challenges in fully utilizing WES and WGS data for personalized medicine. The main approaches taken to prioritize the pathogenic consequences of genomic mutations involve variant calling to identify variants from raw sequencing data, fil- tering by variant quality, filtering by minor allele frequency, and then successive assessment of variant properties based on its potential to affect protein integrity and function, for example, by the insertion of nonsense codons or indels, compromis- 56 ing the function of active sites, protein-protein interactions, dominant or recessive inheritance, physico-chemical properties, sequence conservation [162], or analysis of changes in the DNA regulatory domains [163]. Although the majority of the methods currently used to assess pathogenicity of a variant are focused on exonic variation, there are also methods that examine non-coding sequences, notably GWAVA, CADD, DANN, FATHMM-MKL, and others [164, 165, 166, 167, 168]. However, many of these methods alone are not able to identify the causative variants underlying a patient’s phenotype and require additional investigation, such as analysis of additional family members, to look for de novo variants, identification of shared rare variants in unrelated individuals with similar diseases [169], and identity- by-descent inference [150]. Prioritizing disease candidates by using phenotypic similarity to known diseases and characterized non-human disease models can potentially add an additional layer of discrimination to gene prioritization, but until recently the ability to computation- ally establish formal phenotypic relatedness at scale was not possible. Two crucial developments have enabled the computational integration and comparison of pheno- types: the systematic application of the PATO framework [170, 171] and the devel- opment of the cross-species anatomy ontology Uberon [114]. While PATO provides a uniform way of describing phenotypes, Uberon can be used to systematically de- scribe and relate anatomical structures between species. In 2011, PhenomeNET [172] was developed to exploit phenotype-genotype associations observed in humans and model organisms and prioritize candidate causal genes based on patient phenotypes. PhenomeNET makes use of axioms and formal definitions in the major phenotype ontologies using the PATO ontology [170] to formally integrate species-specific phe- notypes [173, 174, 175, 176, 177]. It gathers phenotype data from model organism and human genotype-phenotype databases, applies measures of phenotypic similar- ity and then systematically compares them across species. PhenomeNET has been 57 demonstrated to provide a high degree of predictive accuracy for the discovery of animal models of human disease [178], novel pathways [179], gene function [180], and druggable therapeutic targets [181]. Since the introduction of PhenomeNET, sev- eral further methods have been been developed that take advantage of this approach and utilize phenotypic similarity between patients and gene-phenotype associations in public databases to improve variant prioritization for WES datasets [182, 183, 142]. We developed PhenomeNET Variant Predictor (PVP) to prioritize causal vari- ants based on comparing patient phenotypes with gene-phenotype associations made in humans and model organisms. PVP combines two main sources of information: molecular and phenotypic. We use molecular information from multiple pathogenic- ity prediction tools to identify the pathogenicity of a variant and the phenotypic information to determine whether a variant is causative. PVP facilitates a highly accurate identification of causative variants from both WES and WGS datasets, and we demonstrate the performance of PVP on a set of synthetic and real whole exome and whole genome sequences. Our results demonstrate that PVP significantly out- performs other state of the art tools revealing that phenotypic similarity can provide a powerful approach for prioritizing causal variants.

5.3 Results and Discussion

5.3.1 Integration of genotype and phenotype information pre- dicts causal variants in whole exome and whole genome sequencing

PVP has been developed to facilitate the identification of causative variants in ge- nomic data (whole exome or whole genome). We consider a variant to be causative if it is both pathogenic (evaluated based on molecular information) and involved in developing the patient’s phenotype (evaluated based on the gene–disease similarities 58 provided by PhenomeNET). Variants may be pathogenic but not causative if they are not involved in the pathogenesis of the patient’s phenotype [159], whilst non- functional, benign variants are generally not causative. In PVP, we combine methods to determine whether a variant is pathogenic (i.e., functional) with information about the phenotypes in which a gene is known to be in- volved to identify candidate causative variants in WES and WGS data. For predicting pathogenicity, we utilize tools that can provide a pathogenicity score for every variant within a genome, i.e. CADD [165], DANN [166], and GWAVA [164]; for the latter, we use an improved version of the PhenomeNET framework to match a patient’s phe- notypes with a database of gene-phenotype associations derived from human, mouse and fish resources. The full list of features used for prediction in PVP is provided as S1 Table. PhenomeNET consists of a repository of gene-phenotype associations from human and model organisms, an ontology that integrates phenotypes across species, and a semantic similarity measure that determines the similarity between two sets of phenotypes. It provides a score that measures the similarity between a set of patient phenotypes and sets of phenotypes in the PhenomeNET repository. Depending on the intended application, the choice of gene-phenotype associations can strongly affect the performance of PhenomeNET [178]. Here, we utilize two over- lapping sets of gene-phenotype associations; we include gene-phenotype associations observed in zebrafish and mouse (marked “Model” for Model Organism Databases), and additionally include human phenotypes propagated from known gene-disease and disease-phenotype associations (marked “Human” in our experiments). We also use both genotype-phenotype associations together. We represent variants by their pathogenicity scores, the scores provided by the PhenomeNET system to measure similarity between the patient’s phenotype and known phenotypes associated with the gene affected by the variant, a small set of high-level phenotypes observed in a patient, as well as mode of inheritance of the 59 disease (if known) and zygosity of the variant. We use these as features to train a random forest classifier that separates variants into causative variants and non- causative variants. Initially, we use 80% of the pathogenic variants available from the ClinVar database [184] to train our model, keeping 20% of the ClinVar variants for further testing. In 10-fold cross validation on these 80%, our model achieves an area under the receiver operating characteristic curve (ROC AUC) of up to 0.994 and F-measure of up to 0.963 (S2 Table). To test the performance of this model in identifying causal variants in sequencing data, we generated a synthetic dataset of 11,251 whole genomes sequences (one for each of the 20% variants in ClinVar that were not used to train the model). The synthetic dataset was created by randomly choosing one of the WGS samples from the 1,000 Genomes Project (1KGP) [185] and inserting a single causative variant in each of these. 8,746 causative variants were inserted in exonic regions and 2,505 in non-exonic regions. Next, we mark the synthetic individual as having the disease and use the phenotypes associated with the disease in the HPO database [186] as the patient phenotypic profile before trying to recover the inserted pathogenic variant using our PVP-based models. Before applying our PVP models, we apply a filter to remove variants with ≤ 1% global minor allele frequency from 1KGP on each variant. We perform two experiments to test the performance of PVP, PVP-Human and PVP-Model. First, we remove all non-exonic variants from the synthetic genomes to simulate a WES dataset and employ the resulting WES dataset to assess our recovery rate of causative variants located in an exonic region. We identify 45.82% of the candidate causative variants as the top ranked and 72.64% of the causative variants in the top 10 ranked variants for WES data using only model organism phenotypes to determine phenotypic similarity, 79.21% of variants top-ranked and 87.94% variants in the top 10 ranks when using only human phenotypes, and 78.80% top-ranked and 89.50% within the top 10 when using both human and model organism phenotypes 60 together. As second experiment, we apply our approach to all variants in the whole genome sequences, and recover 12.62% of the variants at first rank and 23.75% within the first 10 ranks using only model organism phenotypes, 75.10% variants top-ranked and 89.36% in the top 10 ranks using only human phenotypes, and 76.47% top- ranked and 88.61% within the top 10 when using both model organism and human phenotypes. Tables 5.1 and 5.2 summarize these results.

Top hit (exonic) Top 10 (exonic) Total (ex- Median onic) (exonic) CADD 1,095 (15.15%) 2,317 (32.05%) 7,229 49 DANN 406 (6.06%) 1,789 (26.69%) 6,704 108 GWAVA 102 (1.41%) 458 (6.32%) 7244 339 eXtasy 553 (14.85%) 1,601 (42.99%) 3,724 19 Exomiser 2,156 (24.65%) 5,122 (58.56%) 8,746 5 Phevor 1,679 (28.25%) 3,845 (64.70%) 5,943 4 PVP-Model 4,007 (45.82%) 6,353 (72.64%) 8,746 2 PVP-Human 6,928 (79.21%) 7,691 (87.94%) 8,746 1 PVP 6,892 (78.80%) 7,828 (89.50%) 8,746 1

Table 5.1: Overview of how many causative variants out of 8,746 exonic were recov- ered on rank 1 and within the top 10 ranks by PVP and PVP-Human, and comparison to CADD, DANN, GWAVA, Exomiser eXtasy, and Phevor. Analysis was performed on WES data. If a tool did not provide a score for a causative variant, we excluded the variant from this table; consequently, the total number of samples analyzed dif- fers between the methods and the percentages reported are based on the number of samples for which the causative variant was ranked.

We compare our method against several state of the art variant prioritization tools, namely CADD [165], DANN [166] and GWAVA [164], as well as the phenotype-based tools Exomiser/Genomiser [187, 188], Phevor [182] and eXtasy [142]. Our results and the comparison with state of the art tools is summarized in Tables 5.1 and 5.2 as well as Figures 5.1 and 5.2, demonstrating that PVP outperforms the other methods in our experiments. We further assess how well our method performs on identifying causative variants for diseases with different mode of inheritance (MOI) in WES data. The percentage 61 PVP # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 6,500 74.32% 7,595 86.84% 8,746 Non-exonic 2,104 83.99% 2,374 94.77% 2,505 Total 8,604 76.47% 9,969 88.61% 11,251 PVP-Model # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 1,012 11.57% 1,992 22.78% 8,746 Non-exonic 435 17.37% 703 28.06% 2,505 Total 1,447 12.86% 2,695 23.95% 11,251 PVP-Human # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 6,611 75.59% 7,620 87.13% 8,746 Non-exonic 2,156 86.07% 2,368 94.53% 2,505 Total 8,767 77.92% 9,988 88.77% 11,251 CADD # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 441 6.1% 1759 24.33% 7229 Non-exonic 118 4.77% 599 24.2% 2475 Total 559 5.76% 2358 24.3% 9704 DANN # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 325 4.85% 1287 19.2% 6704 Non-exonic 101 5.32% 347 18.27% 1899 Total 426 4.95% 1634 18.99% 8603 GWAVA # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 34 0.47% 44 0.61% 7244 Non-exonic 9 0.42% 22 1.04% 2121 Total 43 0.46% 66 0.7% 9365 Exomiser/Genomiser # top 1 hits % top 1 hits # top 10 hits % top 10 hits Total Exonic 2,747 31.41% 6,879 78.65% 8,746 Non-exonic 780 31.14% 1,895 75.65% 2,505 Total 3,527 31.35% 8,774 77.98% 11,251

Table 5.2: Overview of the performance of PVP, CADD, DANN, GWAVA and Ex- omiser in prioritizing causative variants in WGS data. We prioritize all variants in a VCF file resulting from WGS using the same models. Analysis is separated reflect- ing the performance of the various tools identifying exonic and non-exonic variants. For CADD, DANN, and GWAVA, we report only analysis results for which a predic- tion score is returned; consequently, total numbers are less than the total of 11,251 causative variants. 62

Figure 5.1: Performance of PVP in retrieving causative variants in whole exome sequences. Results are compared against CADD, DANN, and GWAVA, and the phenotype-based tools Exomiser, Phevor and eXtasy.

Figure 5.2: Performance of PVP in identifying causative variants in whole genome se- quences using human phenotypes (PVP-Human), model organisms phenotypes (PVP- Model), and combined phenotypes (PVP), and comparison of PVP to CADD, DANN, GWAVA, and Genomiser. of cases in which the causal variant is ranked first is shown in Table 5.3. We find that, unsurprisingly, our models perform better on recessive diseases as the variants have to be homozygous, which can be used as an additional filter, while a dominant mode of inheritance may be caused by either heterozygous or homozygous variants, and complicated by haploid insufficiency, and hence cannot be used to discriminate between causative and non-causative variants. To evaluate the importance of the “depth” of phenotyping [189] for predicting 63 candidate variants, we compared the predictive accuracy of PVP with the information content in the disease (or patient) description. Information content of a phenotype class is measured by its depth in the PhenomeNET ontology and the number of diseases in our sample that contain this phenotype. For diseases associated with multiple phenotypes, we sum the information content of the individual phenotype classes. We evaluate the correlation between the rank of the causative variant in our set of 8,746 synthetic exome sequences and the information content associated with the disease, and find a negative correlation (Spearman’s rank correlation ρ = −0.54), i.e., if the information content of the phenotypes used to characterize the disease (or patient) is higher, PVP can provide better predictions.

Coding Noncoding Dominant Recessive Others/Unknown Dominant Recessive Others/Unknown PVP 4006 (77.61%) 2005 (93.26 %) 881 (61.44%) 1178 (83.66%) 684 (97.3%) 310 (78.68%) PVP-Model 2100 (40.68%) 1535 (71.40%) 372 (25.94%) 754 (53.55%) 587 (83.50%) 179 (45.43%) PVP-Human 4027 (78.01%) 1993 (92.7%) 908 (63.32%) 1197 (85.01%) 686 (97.58%) 321 (81.47%) Table 5.3: Performance of PVP in variant prioritization in WGS data, separated by mode of inheritance of the disease.

The set of phenotypes observed in patients is not always complete, or patients may suffer from multiple co-morbidities that can affect our phenotype-based analysis. To determine the effect of noise on our analysis, we focus on a subset of 8,522 out of 8,746 synthetic whole exome sequences for which the disease is characterized phenotypically (the remaining cases were imputed by our algorithm, see Materials and Methods), and we perform two experiments (see S3 Table): first, we randomly add the phenotypes of a second disease to the phenotypes of the patient to simulate co-morbidity; and second, we randomly remove each phenotype used to characterize the patient’s disease with a probability of 1/3 (i.e., on average, 1/3 of the phenotype annotations for each disease are removed). Using the PVP-Human model, we find that in the first experiment, only 3,547 (41.62%) variants are ranked first and 4,315 (50.63%) in the top 10, compared to over 75% ranked first with phenotypes matching the disease exactly. In our second experiment, removing phenotypes with probability 1/3 results 64 in 3,963 (46.50%) of causative variants ranked first and 4,921 (57.74%) in the top 10. We further investigated how well PVP can distinguish between variants that are causative for closely related diseases. For this purpose, we insert a second causative

variant v2 to the whole exome sequence of the synthetic patients (each containing a single causative variant v1). The second variant v2 is chosen to be causative for the most phenotypically similar disease (within our test dataset). We then use the

phenotypes associated with v1 and test at which rank v1 and v2 are predicted by PVP.

Using PVP-Human, we find v1 ranked first in 62.38% of the cases, while v2 is ranked first in 15.36% of the cases, demonstrating that PVP can also discriminate between

closely related diseases. Combining the phenotypes associated with v1 and v2, we predict both v1 and v2 with equal probability of 37% on the first rank (see S3 Table). To make PVP available as a tool for diagnostic support, we re-train all our models using the whole ClinVar dataset and combine the phenotype similarity computation using PhenomeNET with annotation of pathogenicity into the PVP tool. PVP can analyze WES or WGS datasets using the VCF file and a set of observed patient phenotypes as input and then outputting a list of variants ranked by the likelihood they are causative for the observed phenotypes.

5.3.2 PVP predicts causative variants in diagnosed cases

We evaluate the performance of PVP on a series of real exomes from individuals di- agnosed as having Congenital Hypothyroidism (CH), included in the UK10K dataset [190] (see Methods), to assess how well we could recover potentially pathological vari- ants in genes already associated with the disease. Congenital hypothyroidism is one of the most frequent endocrine disorders of the neonate with a frequency of up to 1/1,500 births [191], although some forms and molecular etiologies can be much more rare, such as Central Congenital Hypothyroidism (CCH) [192] estimated at around 1/16,000. Historically, most cases were thought to be due to thyroid gland dysgenesis 65 comprising ectopias, hypoplasia and complete agenesis [193]. However, recently, an increase in diagnosis of CH in the presence of apparently anatomically normal glands (gland-in-situ) has been reported [191]. The pathophysiology of such cases may in- clude organisational and functional defects (dyshormonogenesis) within the glands leading to compromised or absent function. A range of genes has been implicated in these processes which include thyroid transcription factors, genes involved in thyroid hormone biosynthesis, and the Thyroid Stimulating Hormone receptor (TSHR) [194]. Mutations in known genes are implicated in less than 5% of thyroid dysgenesis cases, whereas dyshormonogenesis is usually associated with mutations in components of the thyroid hormone biosynthetic machinery [193]. We analyze 43 individuals from the UK10K rare disease cohort of patients and relatives with congenital hypothyroidism, using PVP. The dataset includes 11 con- firmed cases of thyroid dysgenesis (DG), 30 CH with gland-in-situ (GIS, likely involv- ing dyshormonogenesis), and two with CCH, in addition to 80 individuals that do not show any phenotypes but have a family relation to the affected individuals. We use a common set of phenotypes from the HPO for the whole cohort, comprising hy-

pothyroidism (HP:0000821), congenital hypothyroidism (HP:0000851), TSH excess (HP:0002925), thyroid hypoplasia (HP:0005990), and TSHR defect (HP:0011791); these are the most relevant phenotypes in HPO. We analyze the individual cases in- dependently and do not account for the relationships between individuals. Thirty six of these show variants in genes already associated with CH within the top 20 hits, fil- tered for a minor allele frequency (MAF) of 1% (S4 Table) while the remainder do not show known CH-associated disease genes above this rank. We do not, in the current study, further analyze the likelihood that high ranking genes in these 7 individuals might represent novel genes in this disease or differential diagnoses. Of the 11 cases of thyroid dysgenesis, 9 show homozygous or heterozygous alleles of genes already implicated in dysgenesis-associated CH within the first five ranked 66 hits. All were assessed as deleterious or possibly deleterious by SIFT [195], PolyPhen [196], or both. These genes include GLIS3 [197], NKX2-1 [198], and PAX8 [199]. One case shows a predicted deleterious allele of LHX3 normally associated with CCH through an effect on pituitary development [192]. Of the cases with GIS all but 9 show deleterious alleles in DUOX2 [200], TG [201], or TPO [202], and in some cases predicted pathogenic variants of two or three of these genes are found together in the highest ranks in our analysis. The remain- der show variants in NKX2-1, LHX3, and, in one case, PAX8. Homozygous alleles in DUOX2 and TPO are present in 15 individuals. One homozygous variant has been previously reported in ClinVar to be pathogenic and affects iodotyrosyl coupling (NM 003235.4(TG):c.638+5G>A) [203]. In five cases of GIS, homozygous mutations of TG are found in the same individual as deleterious heterozygous DUOX2 alleles. In one case, a homozygous DUOX2 allele is found with a compound heterozygote in TG. While our analysis of the complete dataset provides hypotheses about the most likely disease-causing variants, confirmation requires detailed analysis and re-sequencing. Of the 43 cases we analyze, 15 individuals with CH were previously subjected to Sanger sequencing of candidate variants, confirming the association with the disease [204]. In 9 of these 15 cases, PVP correctly implicates the likely causative alleles as the first hit. In six of the cases, potentially deleterious mutations are found in two genes, and in five of these six cases, PVP correctly identifies the second gene within the first 10 ranks. Additionally, multiple mutations in TG are found in three cases, and in two of these, PVP identifies the second variant as the second rank (S5 Table). The unexpected involvement of oligogenic and triallelic loss of function/hypomorphic mutations in the genesis of congenital thyroid disease is discussed in [204]. We also test PVP with diseases displaying different sets of phenotypes. We utilize data available from the Personal Genomes Project (PGP) [205] and examine if we 67 can predict disease-associated variants consistent with the information that patients that participate in the PGP have declared. We analyze two patients from the PGP,

one patient (PGP:hu92FD55) with a disease in mental functioning (Asperger’s Syn- drome) the other (PGP:hu432EB5) with hemostasis abnormalities (Von Willebrand disease). For the individual associated with (OMIM:300494), the top variant predicted by our approach is in PLCB1, phospholipase C beta 1, located at 20p12.3. PLCB1, which is involved in extracellular signal transduction in the phosphoinositol pathway, has been implicated in GWAS analysis for spec- trum associated phenotypes in the ALSPAC study [206] and a homozygous deletion in a single case of malignant migrating partial seizures in infancy (MMPEI) [207]. Rare mutations associated with autistic spectrum disorders, largely small deletions and duplications, have been reported within and around the gene [208]. The variant seen here is predicted to be pathogenic, heterozygous, and has not been previously reported, suggesting that this is not a simple LOF mutation as seen in MMPEI, and may warrant further research. For the case of the patient associated with von Wille- brand disease (OMIM:193400) [209], VWF is the top hit in our analysis, identifying the variant (chr12:6143978G>A), already described as pathogenic. This individual is heterozygous, consistent with the known pathogenesis of type 1 von Willebrand disease.

5.3.3 Effects of datasets and evaluation method

PVP provides a system for prioritization of causative genomic variants. While other systems have previously used phenotypes for variant prioritization [187, 188, 182, 142], key novelties of PVP are a novel cross-species phenotype ontology and the way in which gene-phenotype information is used for variant prioritization. The choice of gene-phenotype associations strongly determines the performance of the system and possible application scenarios. In particular, in contrast to systems such as Phevor 68 or Exomiser, we explicitly provide PVP with the option to ignore human phenotype information and rely only on independent data from model organisms. Human phe- notypes, provided by the HPO project [186], are derived from disease phenotypes by identifying causative genes for a disease and propagating the phenotypes associated with the disease to the known disease genes. While we observe a strong increase in performance when using these propagated human phenotypes, methods that are trained using them will likely over-emphasize known disease genes and therefore only provide limited performance in identifying variants in novel disease genes. Another observation from our experiments is that the type of evaluation has a strong impact on the reported performance. We evaluate PVP and related variant prioritization systems using ClinVar variants, and, since PVP was trained using this dataset, we specifically evaluate PVP and the other systems using a 20% holdout set that we have not used for training our models so that we can determine its perfor- mance on unseen variants. While we find that PVP performs comparably to, or better than, other systems in our experiments using WES data, we also observe a striking difference in performance to previously reported results for some variant prioritization systems. For example, Exomiser has been reported to identify up to 97% of causative variants on the first rank in prior experiments using WES data [187], and over 70% of causative variants on the first rank in WGS data [188]. The main difference between our experiments and those performed to evaluate Exomiser/Genomiser is the use of a different evaluation dataset which only partially overlaps with the dataset used to evaluate Exomiser/Genomiser. Additionally, the results reported in the evaluations of Exomiser and Genomiser [187, 188] that found up to 97% of variants to be predicted correctly were performed on the model’s training data, i.e., using an overfitted model [187]. Such a strategy will be able to accurately find known variants (i.e., variants on which the model has been trained), but, as demonstrated by our results, will perform with lower accuracy on previously unseen or novel data. 69 In PVP, we chose to focus on two different application scenarios that should be among the most useful in the task of interpretation of variants in a clinical setting: identification of causative variants in known disease genes (using PVP-Human), and identification of causative variants in potentially novel genes (using PVP-Model or PVP).

5.3.4 Impact of the use of model organism phenotypes on variant prioritization and disease gene discovery

Use of phenotypic similarity of experimental mouse models to human diseases has been shown to guide the discovery of the associated human gene. For example the mouse “hairless” mutation was first described in 1859 and the gene identified in 1994 [210]. On the basis of phenotypic similarity to alopecia universalis, the human gene was identified as the human homologue of mouse “hairless” in 1998 [210]. In PVP, phenotype data from mouse and fish models is particularly useful when no human phenotypes are available for a gene, i.e., when a variant is in a gene not previously implicated in a disease. Currently (23 Jan 2017), mouse phenotypes are available for 9,045 mouse genes with human orthologs, but only 3,698 genes are associated with phenotypes in OMIM, and we evaluated the effect of using mouse phenotype data for variants in genes without available human phenotypes (see S6 Table).

In our analysis, we can identify a variant (rs766783183) in the keratin 25 (KRT25) gene at rank 8 for Hypotrichosis 8 (OMIM:278150) in our analysis based on a strong concordance between mouse phenotypes (all of which are associated with hair and nail morphology and hair growth) and the phenotypes associated with the human disease. Using PVP without model organism phenotypes results in rank 172 for the

same variant. Similarly, we can improve the rank of a variant (rs764239923) in the Gliomedin (GLDN) gene as causative for lethal congenital contracture arthrogryposis-

11 (OMIM:617194) from rank 342 without model organism phenotype to rank 7 using 70 model organism phenotypes based on matching nervous system abnormality pheno- types in the mouse. However, in some cases, the model organism phenotypes add noise to the results, especially where there are discordant phenotypes, either for reasons intrinsic to the disease, due to differences in human and mouse physiology, or because the scope of phenotyping in the model organism is distinct from that carried out on humans.

For example, a variant (rs121908425) in the collapsin response mediator protein 1 (CRMP1) gene would be prioritized at rank 1 for the disease Ellis-van Creveld syn- drome (OMIM:225500) without relying on any phenotypes and based on pathogenic- ity of the variant alone. All phenotypes associated with the mouse ortholog Crmp1 are associated with abnormal nervous system physiology and morphology, while the phenotypes associated with the disease relate to a wide range of morphological abnor- malities. Consequently, when relying on PVP-Mod that uses phenotypic similarity to model organism phenotypes, prediction of the causative variant drops to rank 65. In our quantitative evaluation, predictive performance including mouse phenotypes is slightly less than performance relying on human phenotypes alone, demonstrating (unsurprisingly) that model organism phenotypes are overall less similar to a human disease than phenotypes observed in humans. However, in particular in cases where no human phenotypes are available or causative variants occur in genes not previously implicated in a disease, model organism phenotypes may aid in identifying causative variants. In the future, methods should be developed that can determine automati- cally whether the phenotypes observed in a model organism are of sufficient quality and depth to contribute to prioritization of causative variants.

5.3.5 Conclusions

Mobilizing the volume and richness of genotype-phenotype associations from human and model organism databases provides a powerful resource with which potential 71 disease candidates can be discriminated. Data from large scale mutagenesis efforts and hypothesis-driven science have created sufficient genotype-phenotype association data. PhenomeNET [172] was developed as a framework that exploits these pheno- types in a computational approach, using phenotypes as surrogates for their underly- ing genes. By identifying relations between phenotypes, PhenomeNET identifies the similarity between the underlying molecular processes and their components. We have developed PVP as a computational method to prioritize variants, and we demonstrate here using synthetic and real patients’ genomic data that PVP is a system for highly accurate genome-scale identification of causative variants involved in human disease. PVP on its own relies only on model organism phenotypes and is particularly useful when variants in potentially novel genes should be found; PVP-Human emphasizes variants in known disease genes and should be used when variants are suspected in genes already known to be involved in the pathogenesis of a disease.

5.4 Materials and Methods

5.4.1 Updates to the PhenomeNET system

Changes in the HPO, MP and other ontologies, as well as improved OWL reasoning technologies [211], allowed us to improve upon the method originally used to build the PhenomeNET [172] to generate a more comprehensive phenotype ontology spanning fish, mouse and human. PhenomeNET includes all classes contained in the HPO, MP, but is formalized primarily based on the structure of anatomy and physiology ontologies [212]. All our experiments are based on ontology versions downloaded from the AberOWL ontology repository [213] on 10 June 2016, and all ontologies included in the PhenomeNET ontology are from this date. The PhenomeNET ontology includes UBERON [114], GO [214], BSPO [215], ZFA [216], PATO [170], CL [217], NBO [218], but removes all disjointness axioms from these ontologies prior to inclusion due to possible inconsistencies arising from these. 72 Furthermore, the PhenomeNET ontology includes the CHEBI [219] and MPATH [220] ontologies as imports. Within the PhenomeNET ontology, axioms are rewritten to follow the phene pattern [212] so that phenotypes are primarily organized by anatomical structure or physiological process. In particular, within HPO and MP, we identify axioms for a phenotype class P by identifying a class E and Q, and reformulate the formal definition of P as

P EquivalentTo: has-part some (E and has-quality some Q). We initialize E and Q with owl:Thing and then generate axioms from the definition of P provided by HPO or MP using the following rules:

• modifier some X: we keep the object property and target class as modifier of the quality Q, setting Q := Q and modifier some X

• inheres-in some X: set E := X

• inheres-in-part-of some X: set E := part-of some X

• towards some X: set E := E and towards some X

• has-quality some X: set E := E and has-quality some X

• exists-during some X: set E := E and exists-during some X

• has-part some X1 and ... and has-part some Xn: treated as intersection, P := X1 and ... and Xn

• part-of some X: set E := E and part-of some X

• has-central-participant some X: set E := E and has-central-participant some X

• results-from some X: set E := E and results-from some X

• occurs-in some X: set E := E and occurs-in some X 73 These axioms are intended to reformulate axioms in the HPO and MP so that

each phenotype class characterizes a whole organism that has an entity E as part which is further characterized by its qualities and relations to other entities. Fur- thermore, the axioms aim to enforce a taxonomic structure that closely resembles anatomy (from Uberon) and physiology (from GO). Specifically, if X is a subclass of part-of some Y in either Uberon or GO, the axioms aim to force X phenotype to become a subclass of Y phenotype. To completely resemble parthood relations, we further generate an additional phenotype class S for each unique E that we iden- tify, using the axiom S EquivalentTo: has-part some (part-of some (E and has-quality some owl:Thing)). This class serves as additional class that is not usually present in either HPO or MP, and enforces the taxonomic structure of the PhenomeNET ontology to follow both the taxonomic structure and parthood struc- ture of the GO and Uberon. Fish phenotypes are not represented using a dedicated phenotype ontology but rather annotated using E and Q classes directly. Within the PhenomeNET ontology, we generate one class for each unique combination of E and Q found in annotations to zebrafish models. If two entities are used to annotate a zebrafish model (i.e., E1 and E2, we generate the axiom P := has-part some (E1 and has-quality some (Q and towards some E2)). The ontology structure is not manually created but must be inferred using deduc- tive reasoning. We rely on the ELK reasoner [211] to infer the ontology structure. The PhenomeNET ontology is updated regularly, is freely available and can be queried using the ELK reasoning in the AberOWL ontology repository [213].

5.4.2 Model organism phenotypes and similarity search

We collected the mutant model organism phenotypes for mouse from the MGI database [221] on 14 December 2015, human phenotypes from the HPO database [186] on 14 74 December 2015, and zebrafish phenotypes from the ZFIN database [216] on 13 De- cember 2015. We compute semantic similarity between a patient phenotype and the collec- tion of model organism and human phenotypes using Resnik’s measure [141] with the Best Matching Average (BMA) strategy for combining pairwise similarities. We use Resnik’s information content measure [141] computed over the corpus of gene- phenotype associations (from human, mouse and zebrafish) as specificity measure for each class in the phenotype ontology. Semantic similarity is computed using the Semantic Measures Library [222]. We normalize semantic similarity values to the range of [0, 1] for the annotation of variants by dividing each similarity value by the maximum similarity observed for each patient phenotype profile.

5.4.3 Generation of model training data

To train our models, we used the set of variants from ClinVar [184]. ClinVar is a public archive of human variations with their corresponding clinical significance. Clinical significance information in ClinVar is provided based on the American College of Medical Genetics and Genomics (ACMG) guidance in describing variants identified in genes that cause Mendelian disorders. We used ClinVar (dated 05 January 2016) using the reference genome of GRCh37.p13 as our main set. Within the 120,509 records in this dataset, we identified two sets of variants that we use for training, a set of pathogenic variants (ClinVar significance code 5) and a set of benign variants (ClinVar significance code 2). Additionally, for each pathogenic variant, we obtain the disease that the variant causes, identified through its OMIM identifier [223]. By default, ClinVar grouped a variant with multiple alleles into a single record.

By using the VCF2TSV parser script from VCFLIB (https://github.com/vcflib) we converted the VCF format file of ClinVar to a tab-delimited format file and split 75 the variants with multiple alleles into multiple records. We further split variants that are associated with multiple diseases into multiple records. Next, we downloaded the mode of inheritance (MOI) for diseases in OMIM from the HPO phenotype database. We obtained a total of 5,864 MOI records which were classified as “Dominant”, “Recessive”, “Multifactorial”, “Others”, “Sporadic”, “X-linked” and “Y-linked”. We combined this information with the variants from ClinVar to generate candidate disease-causing genotypes; if the MOI of the disease associated with a ClinVar variant is “Recessive”, we generate a single homozygote genotype using the variant; in all other cases, we generate a heterozygote as well as a homozygote genotype based on the variant. The results are 43,236 genotypes classified as pathogenic and 52,084 genotypes classified as benign. This set includes 12,783 pathogenic non-coding variants (i.e., variants that do not lie in an exonic region, including intronic and intergenic variants).

5.4.4 Generation of synthetic exomes/genomes

So that we can quantitatively evaluate our method, we generated 11,251 synthetic whole genome sequences corresponding to our hold-out test sets. To generate this test set, we inserted a single pathogenic variant into a randomly selected whole genome sequence from the 1000 Genomes Project, hg19. In 8,746 of these sequences we inserted an exonic causative variant and in 2,505 we inserted a non-exonic causative variants. 46 exonic and 7 non-exonic variants from our holdout set were excluded as they have a MAF higher than our cutoff of 1%. We generated synthetic exome sequences by removing non-exonic variants from the 8,746 WGS files that include an exonic variant. We use these synthetic whole exome and whole genome sequences to test the performance of our method. 76 5.4.5 Model training

We split the set of 43,236 pathogenic variants randomly into 80% for training and 20% for testing. We annotated all variants in these sets with methods that can pre- dict pathogenicity of both coding and non-coding variants. We used the Combined Annotation Dependent Depletion (CADD) [165], Genome Wide Annotation of VAri- ants (GWAVA) [164] and a deep neural network approach (DANN) [166] to obtain three pathogenicity prediction scores for each of the variants. Additionally, we used the genotype (homozygote or heterozygote) of a variant as feature. For each variant, we also added features related to the disease the variant is involved in according to ClinVar. In particular, we added as features the mode of inheritance of the disease, using only “Dominant”, “Recessive”, “X-linked”, and “Other” as features, and a binary vector of 54 high-level phenotypes of the disease based on our PhenomeNET ontology combining HPO and MP. Finally, we added the normalized semantic similarity between the disease phenotypes and the gene in which the variant is located as a feature. If a variant is non-exonic, we used the gene that is closest to the variant in genomic coordinates as the gene for which similarity was computed. In total, each variant is represented as 60 features (see S1 Table). Based on these 60 features, we trained a random forest classifier to classify variants into causative and non-causative (given a set of phenotypes observed in a patient). We understand a causative variant as a variant that is both pathogenic and involved in the pathogenesis of the disease phenotypes observed in the patient. For training, we represented the patient’s disease phenotypes by the phenotypes associated with the disease in the HPO database. A variant may be pathogenic but not causative for a set of patient phenotypes [159]. We simulated this case by randomly selecting another disease from the OMIM database and assigning these phenotypes as patient phenotypes in the feature representation of the variant. We called these variants pathogenic non-causative variants. We treated all variants identified as benign in 77 ClinVar as non-causative and selected the phenotypes of a random OMIM disease to represent them. For training, missing values were imputed using the C4.5 method [224]. We use pathogenic causative variants as positives, but have two different types of negatives: pathogenic non-causative variants and benign non-causative variants. We train three models that emphasize the negative variants differently: a first model uses only pathogenic non-causative variants as negatives, a second model uses only benign variants as negatives, and a third model uses 50% pathogenic non-causative and 50% benign non-causative variants as negatives. Since the first model cannot distinguish variants by their pathogenicity prediction scores (since both positive and negative variants are pathogenic and only differ in the disease for which they are causative), it is trained to under-emphasize pathogenicity of a variant and rely primarily on the phenotype similarity. The second model can clearly distinguish pathogenic variants from non-pathogenic based on pathogenicity prediction scores and will not have to rely heavily on the phenotype similarity scores; therefore, it is trained to under-emphasize phenotype similarity and predict primarily based on pathogenicity of a variant. The third model aims to achieve a balance between both. For each model, we train a random forest binary classifier (using the pre-selected 80% of the variants in ClinVar [184] while keeping 20% of the variants as holdout set for final validation) and evaluate the results using stratified 10-fold cross-validation. We trained the models using the Random Forest implementation in Weka [225] using 100 trees, unlimited depth of trees, and constructing each tree considering 6 random features. Random forests are trained to output probability estimates of class assign- ment, which we use as prediction score to rank variants. We report cross-validation evaluation results in S2 Table. 78 5.4.6 Model evaluation

The trained models are then applied to our synthetic exomes and genomes. Each synthetic whole exome or whole genome sequence is taken randomly from one of the 1,000 Genomes project sequences, with one causal variant from our holdout set ar- tificially inserted. We use the phenotypes associated with the disease for which this variant is causal as patient phenotypes and use our models to compute a predic- tion score for each variant in the synthetic sequences. We then evaluate the ranks on which we recover the causal variant and compare the results against Exomiser version 7.2.1, Phevor version 2, eXtasy version 0.1beta (for whole exome sequences only), and CADD version 1.3, DANN version 1, GWAVA version 1, and Genomiser version 7.2.1 (for whole genome sequences). For evaluation, none of our models were trained on the variants we inserted in these sequences. We report the area under the receiver operating characteristic curve (ROC AUC) and the top ranks and top 10 ranks obtained by applying each method. We analyze the synthetic whole exome sequences with the Exomiser [187] using the same sets of phenotypes and mode of inheritance as input and using its variant prioritization mode. For comparison with Phevor, we first rank variants based on their CADD score and submit the ranked list to the Phevor web interface using the same phenotypes used in our analysis. Phevor provides a ranked list of genes, not variants, and we assign variants the Phevor rank of the gene in which it is located. We performed the analysis with eXtasy using its default parameter settings with im- putation of missing values, and combining multiple phenotypes. eXtasy was not able to utilize all HPO phenotype classes in our analysis and we omitted the phenotypes that were not available to eXtasy. In all tools besides PVP, we remove variants for which no rank is assigned from the analysis. For DANN and GWAVA, this includes all insertions and deletions as they are not scored by these tools. 79 5.4.7 PVP

In PVP, we remove all variants that are not clearly identified as homozygote or heterozygote (e.g., genotypes that were not confidently called). Moreover, if the mode of inheritance of the disease is known to be recessive, we filter out variants associated with 0/1 genotype call as the disease will require a variant with a 1/1 genotype call in order to be present. MAF is also used as a filtering option for some of the experiments we conducted. MAF data were obtained from the 1000 Genomes Project corresponding to all populations (release August 2015) using the Annovar

tool [226]. The source code of PVP is freely available at https://github.com/ bio-ontology-research-group/phenomenet-vp.

5.4.8 Ethical approval

Use of UK10K data for this project was approved by the UK10K Data Access Commit- tee at the European Genome-phenome Archive for GVG, RH, MK, IB, and RBMR. Access to UK10K data and analysis was limited to GVG, RH, MK, IB, RBMR.

5.4.9 Availability of data and material

Source code developed for this project is available at: https://github.com/bio-ontology-research-group/phenomenet-vp, and analy- sis results at http://www.cbrc.kaust.edu.sa/onto/pvp/. Data to UK10K samples is available from the European Genome-Phenome Archive through the UK10K Data Access Committee ([email protected], https:// www.uk10k.org/data_access.html) for researchers who meet the criteria for access to confidential data. 80

Chapter 6

DeepPVP: phenotype-based prioritization of causative variants using deep learning

I submitted the work presented in this chapter at BMC Bioinformatics (Paper preprint: Boudellioua I, Kulmanov M, Schofield PN, Gkoutos GV, and Hoehndorf R. (2018) DeepPVP: phenotype-based prioritization of causative variants using deep learning. bioRxiv https://doi.org/10.1101/311621). My contribution involves conceiving of the use of a deep neural network architecture, the design of the methodology and formal analysis, the validation and interpretation of findings, the implementation of the soft- ware, and the writing of the manuscript. The full author contribution is as follows: IB and RH conceived of the use of a deep neural network, IB and MK implemented DeepPVP, IB trained the models, performed experiments, and evaluated DeepPVP’s performance, IB, PNS, GVG, RH interpreted results and critically evaluated the per- formance, all authors contributed to writing the manuscript.

6.1 Abstract

6.1.1 Background:

Prioritization of variants in personal genomic data is a major challenge. Recently, computational methods that rely on comparing phenotype similarity have shown to be useful to identify causative variants. In these methods, pathogenicity prediction is combined with a semantic similarity measure to prioritize not only variants that are likely to be dysfunctional but those that are likely involved in the pathogenesis of a 81 patient’s phenotype.

6.1.2 Results:

We have developed DeepPVP, a variant prioritization method that combined auto- mated inference with deep neural networks to identify the likely causative variants in whole exome or whole genome sequence data. We demonstrate that DeepPVP per- forms significantly better than existing methods, including phenotype-based meth-

ods that use similar features. DeepPVP is freely available at https://github.com/ bio-ontology-research-group/phenomenet-vp.

Conclusions:

DeepPVP further improves on existing variant prioritization methods both in terms of speed as well as accuracy.

6.2 Background

There is now a large number of methods available for prioritizing variants in whole exome or whole genome datasets [227]. These methods commonly identify the vari- ants which are pathogenic, i.e., the variants that may alter normal functions of a protein, either directly through a change in a protein’s amino acid sequence or indi- rectly through a change of expression [228, 165, 229]. In coding and noncoding DNA sequences, there are usually multiple variants that could possibly be pathogenic, but most of them are sub-clinical or will not result in any detectable phenotypic manifes- tations [158]. Recently, several methods have become available that utilize information about phenotypes observed in a patient to identify potentially causative variants [183, 230, 142, 182]. Phenotypes are useful for identifying gene–disease associations because they implicitly reflect interactions occurring within an organism across multiple levels 82 of organisation [231, 170, 232]. Phenotype-based methods work by comparing the phenotypes of a patient with a knowledgebase of gene-to-phenotype associations. A measure of phenotypic similarity is computed between a patient’s phenotypes and abnormal phenotypes associated with gene variants or mutations. The phenotypic similarity is then used either as a filter to remove pathogenic variants in genes that are not associated with similar phenotypes to the ones observed in the patient [182] or as a feature in machine learning approaches [183, 230]. The gene-to-phenotype associations used in phenotype-based prioritization strate- gies come from clinical observations such as those reported in the Online Mendelian Inheritance in Man (OMIM) database [223] or in the ClinVar database [184]. In some cases, they may also come from model organisms. Comparing model organism phe- notypes to human phenotypes (i.e., the phenotypes observed in a patient) requires a framework that allows phenotypes of different species to be compared, such as the Uberpheno [233] or PhenomeNET ontology [234]. We have previously developed the PhenomeNET Variant Predictor (PVP) [230] to prioritize causative variants in personal genomic data. We have shown that PVP outperforms other phenotype-based approaches such as the Exomiser or Genomiser tools [187, 188], or Phevor [182]. PVP is based on a random forest classifier, similarly to Exomiser and Genomiser, which also use a random forest. Features used to classify a variant as causative or non-causative combine a phenotype similarity score (to prioritize a gene as being associated with the phenotypes observed in the patient) and a pathogenicity score, as well as other features such as the mode of inheritance and genotype of the variant. As most variants are neutral, there is a very large imbalance between positive and negatives, and the challenges for building a machine learning model for finding causative variants is to account for this imbalance during training and testing. Recently, deep neural networks have shown to be successful in many domains 83 [235] and often result in better classification performance [229]. We have developed DeepPVP, an extension of the PVP system which uses deep learning and achieves significantly better performance in predicting disease-associated variants than the previous PVP system, as well as competing algorithms that combine pathogenicity and phenotype similarity. DeepPVP not only uses a deep artificial neural network to classify variants into causative and non-causative but also corrects for a common bias in variant prioritization methods [236, 237] in which gene-based features are

repeated and potentially lead to overfitting. DeepPVP is freely available at https: //github.com/bio-ontology-research-group/phenomenet-vp.

6.3 Implementation

6.3.1 Training and testing data

We downloaded the ClinVar database 7th Feb, 2017, and extracted GRCh37 genomic variants annotated with at least one disease from OMIM, characterized as Pathogenic in their clinical significance, and not annotated with conflicting interpretation in their review status. We obtained 31,156 pathogenic variants associated with 3,938 diseases in total and the set of these variants constitutes candidate positive instances in our training dataset. We also extracted GRCh37 genomic variants that are characterized as Benign in clinical significance, and not annotated with conflicting interpretation in their review status. We obtained 23,808 such benign genomic variants from ClinVar which form the candidate negative instances in our training dataset. We excluded any variant records mapped to more than one gene and variant records with missing information on the reference or alternate alleles. For pathogenic variant records, we defined variant–disease pairs for each pathogenic variant and its associated OMIM disease. In our dataset, some pathogenic variants are annotated with multiple OMIM diseases. For each of these variants and the n OMIM diseases they may cause, we created n variant-disease pairs. For example, variant rs201108965 in TMEM216 is 84 annotated with two diseases; 2 (OMIM:608091) and Meckel syn- drome type 2 (OMIM:603194). We define two variant-disease pairs (rs201108965, OMIM:608091) and (rs201108965, OMIM:603194). After this step, we have 30,770 pathogenic variant-disease pairs and 20,174 benign variants. In DeepPVP, we use the zygosity of a variant as one of the training features. The zygosity information is not provided in ClinVar. In a given Variant Call Format (VCF) [238] file, zygosity is represented in the genotype field. For instance, a het- erozygous variant will have a genotype value 0/1, while a homozygous variant will have a genotype value 1/1 in the VCF file. We assigned the genotype information to our pathogenic variant-disease pairs based on the mode of inheritance associated with the disease caused by the variant. We extracted the mode of inheritance of the associated OMIM disease using the information provided by the Human Phenotype Ontology (HPO) annotations of OMIM diseases [186]. If the disease’s mode of inheri- tance is recessive, we assigned the zygosity of the variant as homozygote (denoted with genotype 1/1). In this case, we created a variant-disease-zygosity triple representing this information. If the OMIM disease’s mode of inheritance is not recessive (i.e., any other mode of inheritance, including dominant, unknown, X-linked, etc.), we gener- ated two variant-disease-zygosity triples and characterized one of them as homozygote

(denoted with genotype 1/1) and another as heterozygote (denoted with genotype 0/1). For example, pathogenic variant rs397704705 in AP5Z1 is associated with Spastic paraplegia 48 (OMIM:613647). This OMIM disease is recessive and, hence, we characterize variant rs397704705 with genotype 1/1, generating a variant-disease- zygosity triple consisting of variant rs397704705, disease OMIM:613647, and genotype 1/1. Another example is the pathogenic variant rs387907031 in ARHGAP31 associ- ated with Adams-Oliver syndrome 1 (OMIM:100300). This disease is dominant and, hence, we generated two variant-disease-zygosity triples: variant rs387907031, disease

OMIM:100300, and the genotype 0/1, and variant rs387907031, disease OMIM:100300, 85 and genotype 1/1. Since benign variants are not associated with a disease or mode of inheritance, we treat each of them as both a homozygote and heterozygote, gener- ating two variant-zygosity pairs for each benign variant. After this step, we obtained 61,540 triples consisting of pathogenic variant, disease, and zygosity, and 40,348 pairs of benign variant and zygosity. The triples consisting of variant, disease, and zygosity constitute positive samples. For each positive instance (V,D,Z) consisting of a variant, disease, and zygosity, we randomly select, with equal probability, one of two possible negative instances: a randomly selected benign variant in the same gene as V , or a triple (V,D0,Z) where D0 6= D. To map intergenic variants to genes, we link variants to their nearest gene. For example, a positive instance in our training data is a pathogenic variant

rs267606829 in FOXRED1, associated with Mitochondrial complex I deficiency (OMIM ID 252010), as a homozygote. A negative instance according to the first strategy could be a benign variant, such as rs1786702, in FOXRED1, as a heterozygote. A negative instance according to the second strategy is the same pathogenic variant, rs267606829 in FOXRED1, as a homozygote, but associated with another OMIM

disease such as Tooth Agenesis (OMIM:604625). The resulting training data is bal- anced. As an independent and unseen evaluation dataset, we downloaded all variants from ClinVar released between Feb 8th 2017 and Jan 27th 2018. In this dataset, we processed all GRCh37 variants in the same manner as for our training dataset to construct triples of pathogenic variants, disease, and zygosity. However, if the OMIM disease’s mode of inheritance of the variant is not recessive (i.e., any other mode of inheritance, including dominant, unknown, X-linked, etc.), we assigned the

zygosity randomly either as homozygote (denoted with genotype 1/1), or heterozygote (denoted with genotype 0/1). We obtained a total of 5,686 such triples associated with 1,370 diseases for validation. 86 6.3.2 Generation of synthetic patients

In our evaluation, we generated a set of synthetic patients as a realistic evaluation case, similarly to previous work [188, 230]. We randomly selected a whole exome from the 1,000 Genomes project [185] and inserted a pathogenic variant V , assign the disease associated with V in ClinVar to the exome, and present V as a homo- or heterozygote based on the mode of inheritance associated with the disease. Each of these exomes together with the disease’s phenotypes and mode of inheritance form a synthetic patient in which we aim to recover the inserted variant.

6.3.3 Annotating variants

Annotating variants with pathogenicy scores from CADD [165], DANN [166], and GWAVA [164] is a time-consuming process in PVP [230] and other phenotype-based variant prioritization tools [188], especially when analyzing WGS data comprised of millions of variants. PVP 1.0 uses tabix [239] for indexing and retrieval of the pathogenicty scores per chromosome and genomic position. To optimize the annota- tion phase of DeepPVP, we extracted 31,491,995 variants from samples of the 1000 Genomes Project [185] and annotated them with pathogenicity scores from CADD, DANN, and GWAVA. DeepPVP keeps this set of pre-annotated variants in memory to provide fast retrieval of annotations for common variants. DeepPVP utilizes tabix only when the variant annotated is not available in the pre-annotated library, and therefore minimizes disk access.

6.3.4 Model and availability

We implemented our DeepPVP deep neural network in Python 2.7. We used Keras [240] with a TensorFlow backend [241]. We used one hot encoding to represent our categorical feature of the inheritance mode of the disease. We handled missing values for CADD, GWAVA, DANN, and semantic similarity scores by mean imputation. 87 We also added additional flags for missing values as features. We retrieved gene- phenotype association data from human and model organisms (mouse and zebrafish) on Feb 7th, 2017 and used them to generate the ontology and high level phenotypes and semantic similarity score features. We used the Hyperas [242] Python library for tuning the hyperparameters of the neural network using the tree-structured Parzen estimator (TPE) algorithm [243]. We selected the following hyperparameters for tuning: number of hidden layers (two, three, or four), number of hidden units in each layer (32, 64, 67, 128, 134, 201, 256, and 512), and the batch size (2,500, 5,000, 10,000, 15,000, and 20,000). The hyperparamters combination resulting in best performing model out of 50 trials using Hyperas was selected for the final model setup. Therefore, we designed a sequential model with an input layer, three hidden layers of 67, 32, 256 neurons respectively with Rectified Linear Units (ReLU) [244] activation function, and an output layer with a sigmoid activation function. We trained the model using the Adam optimization algorithm [245] which has been widely adopted for deep learning as a computationally efficient, fast convergent, extension to stochastic gradient descent. We used dropout [246] between the hidden layers and the output layer to prevent overfitting. We trained our DeepPVP model for 100 epochs, 2,500 batch size, and a learning rate of 0.001. In training, we specified a 20% random stratified-by-disease validation set, and binary cross entropy loss function. We kept the rest of the parameters in their default values. The model was trained on the CPU. The DeepPVP system (version 2.1), training and evaluation experiments, the syn- thetic genome sequences and our analysis results can be found at https://github. com/bio-ontology-research-group/phenomenet-vp. 88 6.4 Results

6.4.1 DeepPVP: phenotype-based prediction using a deep artificial neural networks

We developed the Deep PhenomeNET Variant Predictor (DeepPVP) as a system to identify causative variants for patients based on personal genomic data as well as phenotypes observed in the patient. We consider a variant to be causative for a disease D if the variant is both pathogenic and affects a structure or function that leads to D. This distinction is motivated by the observation that healthy individuals can have multiple highly pathogenic variants resulting in a complete loss of function; it is therefore not usually sufficient to identify pathogenic variants alone as there may be many. DeepPVP is a command-line tool which takes a Variant Call Format (VCF) [238] file as an input together with a set of phenotypes coded either through the Human Phenotype Ontology (HPO) [247] or the Mammalian Phenotype Ontology (MP) [248]. It outputs a prediction score for each variant in the VCF file; the prediction score measures the likelihood that a variant is causative for the phenotypes specified as input to the method. To predict whether a variant is causative or not, DeepPVP uses similar features as the PVP system [230] and combines multiple pathogenicity prediction scores, a phenotype similarity computed by the PhenomeNET system, and a high-level phe- notypic characterization of a patient. The full list of features used by DeepPVP is listed in Supplementary Table 1. All features can be generated from a patient’s VCF file and a set of phenotypes coded either with HPO or MP. In DeepPVP, we use a deep neural network to classify variants as causative or non- causative. Specifically, DeepPVP uses a feed forward neural network with five layers (see Figure 6.1). The input layer in our architecture consists of 67 neurons (for the 67 89 features) and an output layer consisting of a single output neuron which outputs the prediction score of DeepPVP. DeepPVP uses three hidden layers with 67, 32, and 256 neurons, respectively. Each hidden layer uses a Rectified Linear Unit (ReLU) [244] activation function, and the output layer uses a sigmoid activation function.

Figure 6.1: Overview over the DeepPVP neural network model.

DeepPVP is trained similarly as PVP to improve performance of identifying causative variants in real genomic sequences (in contrast to performance on a testing set). When training DeepPVP, we use as positive instances all causative variants from our training set together with the phenotypes of the disease for which they are causative. We discriminate these from two kinds of negatives: benign variants (i.e., variants that do not alter protein function) and pathogenic but non-causative vari- ants. We consider pathogenic non-causative variants as pathogenic variants (in our training set) which are not associated with phenotypes of the disease they cause, but rather with a different disease. The aim of this selection strategy is to discriminate causative variants from all other variants. We train the DeepPVP model using back-propagation, using binary cross entropy as loss function, and evaluate the model’s results on predicting causative variants, and compare against several competing methods. While the different evaluation scenarios omit some parts of the information about variants and the diseases they are associated with in order to not bias the evaluation results, we finally retrain a model using all 90 available information and make it available as the final DeepPVP prediction model.

6.4.2 Evaluating DeepPVP’s ability to find causative vari- ants

We use a nested cross-validation experiment as our main evaluation and as means to optimize hyperparameters of our DeepPVP model. We first split our training instances into five folds (80% for training and 20% for testing) where each fold is stratified by disease (i.e., the diseases are disjoint between all five folds). We use the training part in each of these folds for optimizing parameters and hyperparameters and use the 20% to evaluate and report the final performance. In each of the resulting five folds, we split the 80% training set into further five folds that are similarly stratified by disease. In total, we end up with 25 different training sets (in the second level of this nested cross-validation) and evaluation sets. For each of the 25 different splits of data, we tuned the hyperparameters of the network (number of hidden layers, number of hidden units, and batch size) using the Hyperas library [242]. We generated 50 trial models with 50 epochs for each model using the tree-structured Parzen estimator algorithm [243] to find optimal hyperparameters. We ended up with 25 optimized hyperparameters, and we selected the hyperparameters outperesulting in the best performance in most of the folds. After hyperparameter optimization, we train five different models using the opti- mal set of hyperparameters obtained in the second level of the cross-validation and evaluate the predictive performance of the model on the 20% set that has not been used for optimizing hyperparameters. The resulting accuracy of our model in 5-fold cross-validation is 0.911, ROCAUC is 0.959, and the AUPR is 0.954. For compari- son, we also performed the same training and evaluation steps using a random forest classifier, and we obtained an accuracy of 0.898, a ROCAUC of 0.958, and an AUPR of 0.952. 91 To accurately evaluate the performance of DeepPVP on real sequencing data, we apply DeepPVP to causative variants added to the ClinVar database on or after Feb 7th, 2017 while our training data is restricted to the variants that have been added to ClinVar before this date. Between Feb 7th, 2017 and Feb 6th, 2018, there were 5,686 causative variants added to ClinVar, covering 1,370 diseases. 297 of these diseases were not present in our training data. Evaluation on completely unseen variants allows us to estimate under more realistic conditions how well DeepPVP is able to prioritize novel variants. We generated synthetic patient exomes by inserting a single causative variant from the set of 5,686 variants in a randomly selected exome from the 1000 Genomes Project [185] (removing all variants with Minor Allele Frequency (MAF) greater than 1% (using the frequencies provided by the 1000 Genomes across all populations). We then assign the phenotypes associated with the causative variant in ClinVar, as well as the mode of inheritance of the disease, to the synthetic exome and consider this combination a synthetic patient. We use DeepPVP to prioritize variants given the synthetic patient’s filtered VCF file, phenotypes, and mode of inheritance, and determine the rank at which the causative (inserted) variant is found. For comparison, we use PVP v1.1 [230] as well as the random forest classifier we trained using the same training setup as DeepPVP (named DeepPVP-RF). We further compare the performance to the Exomiser version 7.2.1 released on Feb 6th, 2017 with and without using CADD scores as feature. Furthermore, we compare the performance against CADD [165], DANN [166], and GWAVA [164]. Table 6.1 shows the evaluation results. We find that DeepPVP has an improved performance compared to the original PVP, the use of a neural network classifier gives better results than the random forest classifier, and DeepPVP outperforms Exomiser, CADD, DANN, and GWAVA in this evaluation. Of the 5,686 “new” variants in our ClinVar evaluation set, 5,489 are in 934 genes 92 which are associated with phenotypes. These 5,489 variants are associated with 1,289 diseases. Only 197 variants are in 74 novel genes and are associated with 89 diseases. We test the performance of DeepPVP separately on these 197 variants. DeepPVP identifies 46 of the 197 variants (23%) at rank one, and 87 variants (44%) in the first ten ranks. In comparison, Exomiser and CADD identified 26 and 13 variants at the first, and 61 and 51 variants in the top ten ranks, respectively. Exomiser-CADD identified 27 at rank one, and 64 in the top ten ranks. This evaluation demonstrates that DeepPVP can not only identify variants in known disease-associated genes but also in novel genes, although with lower performance than if the gene is already known. While the predictive performance of DeepPVP in this evaluation is lower than in the other types of evaluation, DeepPVP still improves over established methods such as CADD and Exomiser.

Table 6.1: Comparison of top ranks of ClinVar variants as recovered from WES data filtered by MAF > 1% using different methods.

Top hit Top 10 hits Total ROC AUC AUPR DeepPVP 4,060 (71.40%) 4,750 (83.54%) 5,686 0.94 0.66 DeepPVP-RF 3,520 (61.91%) 4,321 (75.99%) 5,686 0.95 0.55 PVP 1.1 3,619 (63.65%) 4,076 (71.68%) 5,686 0.95 0.55 Exomiser 2,910 (51.18%) 3,608 (63.45%) 5,686 0.89 0.43 Exomiser-CADD 2,926 (51.46%) 3,621 (63.68%) 5,686 0.89 0.43 CADD 1,060 (18.64%) 2,429 (42.72%) 5,686 0.94 0.14 DANN 170 (2.99%)) 1,322 (23.25%)) 5,686 0.90 0.03 GWAVA 63 (1.11%) 264 (4.64%) 5,686 0.66 0.01

Our performance results demonstrate that DeepPVP can identify causative vari- ants with significantly higher recall at rank one and rank ten than several other methods to which we compare, including the original PVP system [230] from which DeepPVP is derived. In some applications of variant prioritization, it is also impor- tant to identify causative variants quickly and with low computational costs. We therefore benchmarked the time it takes DeepPVP to process large VCF files. We used a machine equipped with 128 GB Memory and an Intel Xeon ES-2680 v3 CPU 93 with 2.50GHz and 16 cores, using a 64-bit Ubuntu 1604˙ system. We selected a genome from the Personal Genome Project (PGP) [249] which contains 4,120,185 variants to benchmark DeepPVP. We prioritized variants in this genome using DeepPVP ten times and recorded the time elapsed. On average, analysis of all the variants in the whole genome VCF file took DeepPVP 85 minutes, i.e., approximately 13˙ milliseconds per variant. Analyzing the same VCF file using the phenotype-based Exomiser software (with and without CADD annotations) took 189 minutes without CADD annotations (ap- proximately 27˙ milliseconds per variant) and 800 minutes with CADD annotations (approximately 116˙ milliseconds per variant).

6.5 Conclusions

DeepPVP is an easy to use and fast phenotype-based tool for prioritizing variants in personal whole exome or whole genome sequence data. DeepPVP takes a VCF file of an individual as input, together with an ontology-based description of the phenotypes observed in an individual. It then aims to identify the variants of the individual that are causative of the phenotypes observed. Through the use of a deep neural network and an updated training and evalua- tion strategy, DeepPVP improves over its predecessor PVP, and further outperforms several established methods for variant prioritization, including the phenotype-based tool Exomiser [187, 188] and pathogenicity scoring algorithms such as CADD [165]. Importantly, DeepPVP shows a better performance than other methods in finding variants in novel genes, i.e., genes not previously associated with a disease pheno- type, and may therefore be particularly suited for investigating variants in orphan diseases as well as variants of unknown significance in genes not yet associated with phenotypes. We update DeepPVP in regular intervals when new training data (i.e., vari- 94 ants associated with diseases and phenotypes, as well as gene–phenotype associ-

ations) becomes available. DeepPVP is freely available at https://github.com/ bio-ontology-research-group/phenomenet-vp.

6.6 Availability and requirements

Availability and requirements

• Project name: DeepPVP

• Project home page: https://github.com/bio-ontology-research-group/ phenomenet-vp

• Operating system: Java virtual machine

• Programming language: Java, Groovy, Python

• Other requirements: none

• License: 4-clause BSD-style license 95

Chapter 7

OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants

I published the work presented in this chapter at Scientific Reports (Paper citation: Boudellioua I, Kulmanov M, Schofield PN, Gkoutos GV, and Hoehndorf R. (2018) OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants. Scientific Reports 8(1): e1005500. https://doi.org/10.1038/s41598-018-32876-3). My contribution involves conceiving the experiments, the design of the methodology and formal analysis, the validation and interpretation of findings, the implementation of the software, and the writing of the manuscript. The full author contribution is as follows: IB and RH conceived the experiments, IB conducted the experiments, IB and MK implemented the software and algorithms, IB, PNS, GVG, RH analyzed the results.

7.1 Abstract

An increasing number of disorders have been identified for which two or more distinct alleles in two or more genes are required to either cause the disease or to significantly modify its onset, severity or phenotype. It is difficult to discover such interactions using existing approaches. The purpose of our work is to develop and evaluate a system that can identify combinations of alleles underlying digenic and oligogenic diseases in individual whole exome or whole genome sequences. Information that links 96 patient phenotypes to databases of gene–phenotype associations observed in clinical or non-human model organism research can provide useful information and improve variant prioritization for genetic diseases. Additional background knowledge about interactions between genes can be utilized to identify sets of variants in different genes in the same individual which may then contribute to the overall disease phenotype. We have developed OligoPVP, an algorithm that can be used to prioritize causative combinations of variants in digenic and oligogenic diseases, using whole exome or whole genome sequences together with patient phenotypes as input. We demonstrate that OligoPVP has significantly improved performance when compared to state of the art pathogenicity detection methods in the case of digenic diseases. Our results show that OligoPVP can efficiently prioritize sets of variants in di- genic diseases using a phenotype-driven approach and identify etiologically important variants in whole genomes. OligoPVP naturally extends to oligogenic disease involv- ing interactions between variants in two or more genes. It can be applied to the identification of multiple interacting candidate variants contributing to phenotype, where the action of modifier genes is suspected from pedigree analysis or failure of traditional causative variant identification.

7.2 Introduction

Discrimination of causative genetic variants responsible for disease is a major chal- lenge. An increasingly large family of algorithms and strategies has been developed to aid in identification of such variants [227]. These methods use properties of vari- ants such as evolutionary conservation, predicted structural changes, allele frequency and function to predict pathogenicity. For variants in non-coding sequence regions, additional information used by computational models includes predicted regulatory function and recognized DNA–protein or DNA–RNA interactions [227, 250, 251]. Furthermore, phenotype annotations to human and model organism genes can be 97 added to provide another layer of discrimination between involved pathogenic and non-pathogenic variants [252, 183, 253]. Phenotype-based methods can identify the likelihood that a particular gene or gene product may give rise to phenotypes observed in an individual [254, 255]. The increasing availability of patient sequence information coupled with resources that provide a detailed phenotypic characterization of diseases, as well as the wealth of gene-to-phenotype associations from non-human disease models [232], are now enabling new approaches to the prioritization of causative variants and facilitat- ing our ability to dissect the genetic underpinnings of disease [183]. PhenomeNET [172], developed in 2011, is a computational framework that utilizes pan-phenomic data from human and non-human model organisms to prioritize candidate genes in genetically-based diseases [172]. We have combined PhenomeNET with genome-wide pathogenicity predictions to develop the PhenomeNET Variant Predictor (PVP) [252] as a system that combines information about pathogenicity of variants with known gene–phenotype associations to predict causative variants. We recently developed the PVP system to classify variants into those likely to be causative or non-causative [252]. While PVP has a significantly better performance in the prioritization of single variants in monogenic diseases than competing algorithms [252], the phenotypes of many diseases with a recognized genetic origin show a range of severity, phenotypic spectrum, age of onset and prognosis [256]. While characteristic phenotypic vari- ability can be associated with different alleles in single disease-causing genes or their mode of inheritance, it has been known for some time that in many diseases there is variable expressivity, and in some cases variable penetrance, associated with the same primary mutation in different individuals or pedigrees [257]. This implies that in those cases the phenotypic variability observed must be due to the presence of other modifier variants or environmental influences. Increasingly, several diseases are being 98 understood within the context of complex inheritance and multifactorial disease phe- notypes where multiple independent variants modify each other’s effect on phenotype [258], or, in some cases, render the disease di- or oligogenic where variants in two or more different genes are needed for its clinical manifestation [259]. Evidence for digenic inheritance is available for around 50 diseases [260] and details are gathered in the Digenic disease database (DIDA) [94]. Epistatic interactions have been postulated to explain the missing heritability in many types of common and rare disease [257], and with the increasing clinical use of next generation sequencing, further evidence is accumulating for a spectrum of types of interactions between genes. These interactions are manifest in different ways. For example, there is evidence from population genetics for phenotypic modifier genes such as the modifiers of the age of onset in Huntington’s disease [261, 262], and from a candidate gene approach in Parkinson’s disease [263]. In a similar candidate approach for amyotrophic lateral sclerosis (ALS), affected individuals with proven or potentially pathogenic mutations in two or three known ALS genes are associated again with lower age of disease onset. Congenital hypothyroidism has both rare monogenic recessive loss-of-function, and common, apparently sporadic, forms. The recent description of patients carrying bial- lelic and triallelic digenic combinations of variants in known thyroid development and function genes [264] has lead to the suggestion that a frequent oligogenic origin might explain sporadic hypothyroid cases. Evidence has also been provided for rare trigenic involvement of variants, such as in TSHR, SLC26A4 and GLIS3 [265]. In addition to these examples of genetic interaction, the observations suggesting digenic/triallelic inheritance in Bardet-Biedel syndrome (BBS) [266] continue to provoke interest and further research, and illustrate the challenges in establishing formal digenicity [267]. One of the best characterized cases of digenic inheritance is a form of Usher syndrome where digenic heterozygous mutations in CDH23 and PCDH15 have been shown to 99 interact in both humans and mice [268]. Other examples are critically discussed elsewhere [260]. Digenic disease can be divided into two classes: strict digenic disease where vari- ants at both loci are required for the disease, and composite disease which is either the result of the epistatic relationship between alleles of two independent genes modifying the phenotypes of the individual mutations alone, or the phenotypic overlay of two monogenic Mendelian diseases present in the same patient [269, 270]. Identification of the genes involved in all of these types of digenic disease usually requires pedigree information or the use of existing knowledge about candidate genes. For example, the selection of candidate genes may rely on the availability of additional information about molecular or functional connections between the entities (genes or gene prod- ucts) bearing the variants [264]. The difficulties in establishing strong evidence for digenic inheritance are discussed elsewhere [260, 271]. Computational identification of likely causative alleles that are involved in di- genic or genetically more complex diseases, in particular for genes not previously associated with the disease, is particularly challenging; such methods would have to be able to incorporate and utilize a large amount of background information about molecular and (patho-)physiological interactions within an organism to determine how combinations of variants jointly result in an observed phenotype. The observa- tion that disease-implicated proteins often interact with each other has stimulated the development of network-based approaches to identification of disease modules [272, 273, 274, 275]. However, relevant interactions may occur across much larger distances within pathways and networks, or at the whole organism physiological level where knowledge about biological systems and multi-scale interactions is critical for understanding pathobiology [231, 276]. Phenotypes provide a readout for all of these disease-relevant interactions and offer insights into the underlying pathobiological mechanisms [277]. Phenotype data can be a powerful source of information for vari- 100 ant prioritization and is complementary to pathogenicity prediction methods based on molecular information [252, 278, 182, 187, 142]. Here, we first evaluate the ability of the PVP system to identify combinations of variants in digenic diseases obtained from a database of digenic diseases. We then present OligoPVP, a novel algorithm for prioritizing digenic or higher order combinations of variants in personal genomes. While the OligoPVP algorithm will not identify whether a phenotype or disease has a digenic or oligogenic inheritance, it identifies and ranks potential causative variants in the same way as PVP but then prioritizes pairs or sets of higher cardinality of potentially interacting variants present in the same genome, as specified by the user, on the basis of prior knowledge about genetic, regulatory and biochemical interactions between them. It is therefore mainly useful to identify candidate causative sets of variants in cases in which digenic or oligogenic inheritance is already suspected, for example due to variable penetrance or expressivity in family studies, or as means of exclusion because established methods for variant prioritization failed. We apply OligoPVP to the identification of variants in pairs of genes where mutations in two separate genes present in a single individual and lead to a par- ticular phenotypic profile that is not apparent in individuals carrying only one of these variants. We demonstrate that OligoPVP is able to identify gene variant sets in digenic diseases using a set of synthetic whole genome sequences into which we insert multiple known causative gene variants. OligoPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp.

7.3 Materials and Methods

7.3.1 Digenic disease

The Digenic Disease Database (DIDA) v2 [94] consists of 258 curated digenic combi- nations representing 54 diseases, with 448 variants in 169 genes. Of the 258 digenic 101 combinations, 189 have Human Phenotype Ontology (HPO) annotations, represent- ing 52 diseases, 153 distinct genes, and 337 unique variants. We use the 189 digenic combinations with HPO annotations in our experiments. 25 of these combinations are triallelic and exhibit compound heterozygosity in one gene while the remaining 164 combinations are biallelic. We use the combinations of variants from DIDA to generate 189 synthetic whole genome sequences by randomly inserting the causative variants in a randomly selected whole genome sequence from the 1000 Genomes Project [185].

7.3.2 Interaction data

We downloaded all interactions occurring in humans from the STRING database version 105˙ [279]. Then, we mapped all interactions to their respective genes us- ing the mapping file provided by STRING to generate 989,998 interactions between genes, representing 13,770 unique genes. We use these interactions between genes to prioritize combinations of variants in OligoPVP.

7.3.3 PhenomeNET Variant Predictor

In our work, we use the PhenomeNET Variant Predictor (PVP) version 2.0 (“Deep- PVP”) [280]. This version of PVP is based on the Human Phenotype Ontology (HPO) [278] and the Mammalian Phenotype Ontology (MP) [281] ontologies obtained from the AberOWL repository [213] on Feb 7th, 2017. PVP uses gene-to-phenotype asso- ciations in humans as well as from the mouse and zebrafish model organisms down- loaded on Feb 7th, 2017 from the HPO website [278], the Mouse Genome Informatics website [281], and the Zebrafish Information Network website [282], respectively. The PVP system was trained on data generated from the ClinVar database [184] accessed on Feb 7th, 2017. PVP combines features that score the pathogenicity of variants with a pheno- 102 type similarity measure that aims to identify whether a variant is likely to cause the phenotypes observed in a patient. Using the cross-species phenotype ontology Phe- nomeNET [283], phenotype similarity also can be computed between patient pheno- types and phenotypes in model organisms. The PVP system used in our analysis, the synthetic genome sequences we gen- erated for the evaluation of our system, and our analysis results can be found at https://github.com/bio-ontology-research-group/phenomenet-vp.

7.3.4 Evaluation and comparison

We compare PVP and OligoPVP to several state of the art variant prioritization method. Specifically, we compare the PVP and OligoPVP scores to variant pathogenic- ity prediction scores obtained from CADD v1.3 [165], DANN v1.0 [166], and GWAVA v1.0 [164]. Furthermore, we compare our results to the phenotype-based tool Genomiser version 7.2.1 [188] using its default parameters.

7.4 Results

7.4.1 Prediction of biallelic and triallelic disease variants

We analyze each WGS using the phenotypes provided for the combination of variants in DIDA. We do not filter any variants by minor allele frequency to avoid missing potentially important interacting variants that might have medium to common fre- quencies in the background population. On average, each WGS in our experiments contains 2,192,967 variants. We use the phenotypes associated with the combination of variants in DIDA as phenotypes associated with the synthetic WGS, and we use PVP [252] to prioritize variants, using an “unknown” mode of inheritance model. Out of 164 whole genome sequences where two variants were inserted, we find both causative variants (i.e., 103 the two variants we inserted) as the highest ranked variants in 88 cases (53.7%) and within the top ten ranks in 107 cases (65.2%) (see Table 7.1). For the 25 cases of triallelic diseases, we find all three causative variants within the first three ranks in 10 cases (40.0%) and we find all three causative variants within the top ten variants in 14 cases (56.0%) (see Table 7.2). Tables 7.1 and 7.2 also compare PVP to established variant prioritization methods, including CADD [165], DANN [166], the phenotype- based Exomiser/Genomiser system [188], and GWAVA [164]. Out of these systems, CADD performs the best in prioritizing combinations of variants; however, PVP can rank variant involved in bi- or triallelic diseases significantly higher than CADD (Mann-Whitney U test, p = 6.8 × 10−5). Individually, the performance of our approach differs between diseases, depending on the availability of gene–phenotype associations and high quality and informative disease–phenotype associations in DIDA. Table 7.3 provides an overview of the per- formance of PVP for individual diseases, and we provide the full analysis results on our website. In particular, for the case of hypodontia, PVP identifies all the causative variant pairs in the top 3 ranks in all synthetic patients, and in Familial long QT syndrome, the causitive variant pairs can be found in the top 3 ranks in over 90% of the synthetic patients. Similarly, for the case of Bardet-Biedl syndrome (BBS), PVP ranks 84.21% of causative variant pairs in the top 10, and identifies digenic causative variants in 9 of the 16-20 genes now implicated in BBS [267, 284]. To ensure that newer versions of ontologies and our training data do not already contain, implicitly, information about associations between variants and disease, we perform a semi-prospective experiment; the PVP system we used is based on ontology versions (HPO and MP) and training data obtained on Feb 7th, 2017. We separately test the performance of our system on the digenic cases added to the DIDA database 104 after Feb 7th, 2017. In total, 45 digenic combinations with HPO annotations are newly added to DIDA after the date our PVP system was built. Among these newly added combinations, 42 are biallelic and 3 are triallelic. Table 7.4 shows the perfor- mance of PVP on these cases. We find that the performance on predicting the new cases drops somewhat in comparison to cases before the PVP build date.

7.4.2 OligoPVP: Use of background knowledge to find causative combinations of variants

Our results demonstrate that PVP can identify combinations of variants implicated in a disease, outperforming current state-of-the-art gene prioritization approaches. The variants found by PVP are commonly in genes that form a disease module, i.e., a set of interacting genes that are jointly associated with a disease or phenotype [285]. For example, out of the 164 biallelic combinations used in our study, we can find evidence of interactions for 71 biallelic combinations and 16 triallelic combina- tions using the interaction database STRING [279]. The STRING resource contains background knowledge about the interaction between genes based on protein-protein interactions, co-expression, pathway involvement, or co-mention in literature, and therefore provides a wide range of distinct interaction types which may underlie a phenotype. We have now exploited this background knowledge to further improve prioritization of variants in oligogenic diseases which involve interactions between alleles of two or more genes. OligoPVP is an algorithm that uses background knowledge from interaction net- works to prioritize variants in oligogenic diseases. It identifies likely causative variants in interacting genes and ranks tuples of n variants in genes that are connected through an interaction network. OligoPVP will first rank all variants in a set of variants (such as those found in a VCF file) independently using PVP and assign each variant v a prediction score σ(v). When ranking combinations of n variants, OligoPVP will 105 then evaluate all n-tuples of variants v1, ..., vn and assign a score σ to the n-tuple

(v1, ..., vn), given an interaction network Υ:

  σ(v1) + ... + σ(vn) if v1, ..., vn are variants in a connected subgraph of Υ σ(v1, ..., vn) =  0 otherwise

Algorithm 1 illustrates the procedure to find oligogenic disease modules in more de- tail. OligoPVP can identify combinations of variants both in exonic and non-exonic regions. For non-exonic variants, we assign the gene that is located closest to the variant as the variant’s gene.

Algorithm 1 OligoPVP prioritization of oligogenic combinations 1: function OligoPVP(k, S, Υ) . k ∈ N+, S a set of variants, Υ = (V,E) an interaction network 2: scores ← [:] 3: candidates ← ∅ 4: for each g ∈ V do 5: candidates ← candidates ∪ {v|v is a variant of g, v is ranked in top k most pathogenic variants in g} 6: end for k 7: for each (v1, ..., vk) ∈ candidates do 8: genes ← ∅ 9: for each vi ∈ {v1, ..., vk} do 10: genes ← genes ∪ gene(vi) . gene(x) maps variant x to a gene 11: end for 12: if genes form a connected subgraph in Υ then Pk 13: scores[v1, ..., vk] = i=1 score(vi) 14: end if 15: end for 16: return scores 17: end function

The OligoPVP algorithm strictly relies on an interaction network as background knowledge and will not prioritize any combinations of variants if they are not con- nected in a known network. For digenic or oligogenic disease we assume that the interactions between alleles are mediated directly or indirectly through physical or 106 regulatory mechanisms at any level, e.g., through DNA (for example transcriptional regulation), RNA (for example processing of RNA or half life modulation), or pro- tein (e.g., direct functional interaction or co-participation in the same pathway or physiological processes). Data relevant to all these levels of interaction is available in STRING. OligoPVP is of course limited by the completeness of the interaction data available and until more complete high level physiological modeling is achieved, interactions at the level of the whole organism physiome will be difficult to capture. OligoPVP utilizes beam search [286] to optimize memory usage. We can simply extend OligoPVP to also consider compound heterozygote combinations of variants by adding self-loops to each node in Υ. The main advantage of OligoPVP is its ability to identify and rank connected sets of variants higher than individual variants. Table 7.5 lists several cases in which OligoPVP prioritizes pairs of variants higher than PVP would prioritize them on their own. On the other hand, OligoPVP will not prioritize combinations of variants if they are in genes that are not connected in the background network Υ. Supplementary Table 1 lists some of the cases which can be prioritized with PVP but not OligoPVP.

7.5 Discussion

With the increasing appreciation of the relationship between complex and Mendelian diseases [287], the ability to discover multiple variants contributing to disease phe- notype in the same genome provides a powerful tool to help understand the genetic architecture of diseases and the underlying physiological pathways. With the advent of whole exome and whole genome sequencing, advances have been made using exist- ing approaches to prioritize causative variants. However, use of standard criteria for the identification of rare disease variants, e.g., a low minor allele frequency (MAF) of, for example, less than 1%, are designed to detect de novo, homozygous, or com- pound heterozygous variants, and may not give sufficient priority to variants of low 107 apparent pathogenicity, haploinsufficiency, or low to medium MAF, although these variants may still be important in the pathogenesis of a disease. Because the approach we take with OligoPVP and PVP makes no assumptions about allele frequency or mode of inheritance, and balances estimates of pathogenicity with phenotypic relat- edness, a wider net is cast and candidate genes affecting the penetrance, expressivity or spectrum of the phenotype are more readily identified. Genes whose variants contribute to a disease phenotype are considered likely to be situated within the same pathway or network [288, 289, 290, 291]. In addition to well established studies of genes involved in, for example the [292, 267], newer studies are now identifying network relations between genes implicated in the oligogenic origins of diseases [293, 294]. Consequently, we can exploit background knowledge on the interactions of gene products in OligoPVP and improve the ranking of candidate pairs of variants over that assigned through pathogenicity and phenotypic relatedness scores alone. Currently, identification of multiple variants contributing to the characteristics of a disease in a cohort or individual patient rely either on a candidate gene approach or the assumption that contributing alleles are likely to be rare in the population. The contribution of rare alleles of low effect, i.e., which by themselves generate sub- clinical phenotypes, for example hypomorphs, may be missed in this way, and rare to medium frequency alleles which modify the penetrance or expressivity of a second remain difficult to identify (the former because of low potential pathogenicity and the latter because of high frequency and lack of association with a phenotype when occurring alone). An alternative strategy for identification of candidate genes for highly heterogeneous human diseases is to use mouse genetics to identify phenotypic modifier genes. For example, neural tube defects are believed to involve more than 300 genes in the mouse, mutations in many of which need to be digenic or trigenic for expression of the phenotype [295]. Similarly, mouse double heterozygous mutants 108 in Nkx2-1 and Pax8 show strain-specific thyroid dysgenesis phenotypes not seen in the individual mutations [296]. The scale of genetic interactions becoming apparent from mouse studies strongly supports the suggestion that in the human, we are only seeing the tip of a very important iceberg [297]. The OligoPVP algorithm offers a generic framework for using background knowl- edge about any form of interaction between genes and gene products to guide the identification of combinations of variants. In its generic form, the worst case com- plexity of the algorithm is O(nk) where n is the number of variants and k the size of the module (the size of the module is a parameter of OligoPVP). It is clear that our algorithm, in its basic form, will not yet scale to large disease modules (i.e., large k); however, in the future, several methods can be used to further improve the average case complexity to find larger disease modules. Furthermore, background knowledge about interactions between genes and gene products is far from complete. In particular, information about coarse scale physi- ological interactions, i.e., those that occur based on whole organism physiology, are significantly underrepresented in interaction databases [231]. Additionally, interac- tion networks may have biases such as over-representation of commonly studied genes [298, 299], and these biases will likely effect the performance of our algorithm. As more genomic data related to complex diseases becomes available, more work will be required to identify and remove these biases in the identification of phenotype modules from personal genomic data. The function of OligoPVP is not to determine whether a disease is formally di- or oligogenic but to assess whether sets of variants may be jointly responsible for phenotype when an individual contains multiple, potentially pathogenic, variants in two or more genes that might, from background knowledge, be expected to interact to generate the phenotypic profile of the patient. The user is able to specify the cardinality of the set of variants that should be prioritized, and the rank scores provide 109 a relative measure of the likelihood that two, three or more specific allelic variants might be involved. While this limitation restricts OligoPVP’s applicability, we nevertheless believe our algorithm to have important applications. The most likely scenario in which OligoPVP can be used successfully is to generate hypotheses for combinations of specific variants after other variant prioritization approaches have failed to yield sig- nificant results. Alternatively, knowledge of family history or ethnic background might suggest the contribution of two or more genes to the disease phenotype, and OligoPVP can then be used to identify plausible combinations of genes and allelic variants. OligoPVP is, to the best of our knowledge, the first phenotype-based method to identify disease modules in personal genomic data. With the large (i.e., exponen- tial) number of combinations of variants that have to be evaluated in finding disease modules, it is clear that any computational method has to make use of background knowledge to restrict the search space of potentially causative combinations of vari- ants. OligoPVP is such a method which uses knowledge about interactions and pheno- type associations to limit the search space. In the future, more background knowledge can be incorporated to improve OligoPVP’s coverage as well as accuracy. OligoPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp. 110

Table 7.1: Comparison of different variant prioritization systems for recovering bial- lelic variants. We split the evaluation in two parts, one in which we consider all vari- ants and another in which we only consider variants for which we have background knowledge about their interactions.

All Interacting only Top pair Top 10 pairs Comb Top pair Top 10 pairs IntComb PVP 88 (53.7%) 107 (65.2%) 164 42 (59.2%) 51 (71.8%) 71 CADD 34 (20.7%) 87 (53.1%) 164 10 (14.1%) 37 (52.1%) 71 DANN 5 (3.1%) 59 (36.0%)) 164 0 17 (23.9%) 71 Genomiser 0 0 164 0 0 71 GWAVA 0 0 164 0 0 71 OligoPVP 47 (28.7%) 59 (36.0%) 164 47 (66.2%) 59 (83.1%) 71

Table 7.2: Comparison of different variant prioritization systems for recovering tri- allelic variants. We split the evaluation in two parts, one in which we consider all variants and another in which we only consider variants for which we have background knowledge about their interactions.

All Interacting only Top triple Top 10 triple Comb Top triple Top 10 triple IntComb PVP 10 (40.0%) 14 (56.0%) 25 7 (43.8%) 10 (40.0%) 16 CADD 4 (16.0%) 9 (36.0%) 25 7 (43.8%) 12 (75.0%) 16 DANN 0 6 (24.0%)) 25 0 4 (25.0%) 16 Genomiser 0 0 25 0 0 16 GWAVA 0 0 25 0 0 16 OligoPVP 10 (40.0%) 10 (40.0%) 25 10 (62.5%) 10 (62.5%) 16

Table 7.3: Analysis of top ranks of variants by PVP summarized by disease

Top hit Top 3 hits Top 10 hits Variants (Combina- tions) Familial long QT syndrome 21 (50.0%) 38 (90.5%) 41 (97.6%) 42 (21) Kallmann syndrome 18 (47.4%) 27 (71.1%) 27 (71.1%) 38 (19) Bardet-Biedl syndrome 14 (36.8%) 28 (73.7%) 32 (84.2%) 38 (14) Alport syndrome 14 (45.2%) 28 (90.3%) 29 (93.6%) 31 (15) Non-syndromic genetic deafness 12 (50.0%) 18 (75.0%) 18 (75.0%) 24 (12) Oculocutaneous albinism 6 (40.0%) 12 (80.0%) 12 (80.0%) 15 (7) Primary ovarian insufficiency 2 (13.3%) 2 (13.3%) 2 (13.3%) 15 (7) Usher syndrome 5 (33.3%) 11 (73.3%) 12 (80.0%) 15 (7) Hypodontia 6 (50.0%) 12 (100.0%) 12 (100.0%) 12 (6) Others 66 (38.2%) 118 (68.2%) 128 (74.0%) 173 (81) 111

Table 7.4: Performance of PVP on all combinations present in DIDA database (All DIDA), combinations added before the PVP build date date of Feb 7th, 2017 (Old DIDA), and combinations added after the cutoff date of Feb 7th, 2017 (New DIDA)

Biallelic Triallelic Top pair Top 10 pairs Comb Top triple Top 10 triples IntComb All DIDA 88 (53.7%) 107 (65.2%) 164 10 (40.0%) 14 (56.0%) 25 Old DIDA 69 (56.6%) 84 (86.9%) 122 10 (45.5%) 14 (63.6%) 22 New DIDA 19 (45.2%) 23 (54.8%) 42 0 0 3 112 OligoPVP Rank PVP Rank B 11 7 14130 2 4 1 4 PVP Rank A 22 1 3 steroid-resistant nephrotic syndrome drome with febrile seizures-plus PHANET) Seckel syndrome 22 1 5 hemochromatosis ) ) Familial long QT syn- ) Generalized epilepsy T ) Usher syndrome 7 1 1 ) CANDLE syndrome) CANDLE syndrome 8 292 1 1 1 2 ) CANDLE syndrome 1980 1 2 A ) Familial idiopathic ) Familial atrial fibrillation 1 21 4 T T C > A > T > > > ) Rare hereditary > > A > c.1022C c.224C c.224C c.313A c.1571G c.193delC c.622G c.892C 3015delAAinsT c.845G c.3014 ) NPHS2 ( CEP152 ( KCNQ1 ( PSMB8 ( PSMB8 ( LMNA ( SCN2A ( CDH23 ( PSMB8 ( HFE ( ) ) ) ) G ) ) A C ) T ) ) > A A A > > > 5603delAAC > > > 698delAAG 112delAGA c.1488G c.379C c.4187T c.696 c.5601 c.404+2T c.666C c.5054C c.212G c.110 CD2AP ( KCNE1 ( CDK5RAP2 ( PSMA3 ( PCDH15 ( PSMA3 ( PSMB4 ( SCN1A ( HAMP ( EMD ( dd114 dd053 dd229 DIDA ID Gene Add225 Gene B Disease name (OR- dd007 dd043 dd052 dd226 dd228 dd159 Table 7.5: Cases ofinteractions DIDA in combinations improved the by prioritization OligoPVPPVP of in on variant comparison individual tuples. to variants. PVP. We OligoPVP compare incorporates the protein-protein results of applying OligoPVP to the ranks obtained using 113

Chapter 8

Conclusion

8.1 Summary

The accelerating pace of the generation of genomic data is providing an unprecedented opportunity to enable advancements in personalized diagnosis and treatment. How- ever, for this to become a reality, there is a pressuring need to develop computational tools for the analysis and interpretation of genomic data. Such advanced bioinfor- matics methods will bridge the gap between basic science conducted in biology and translational research applications in the clinic. This task is challenging due to the limited knowledge regarding diseases phenotypes, their genes, and causal genomic variants. Also, incomplete penetrance and variable expressivity further complicate the interpretation of genomic variants with respect to diseases, and thus impede the diagnostic and discovery process. In this thesis, we presented methods for the prioritization of genomic variants that are likely to cause Mendelian and oligogenic diseases. Our novel approaches are phenotype-driven in the sense that they integrate gene-to-phenotype knowledge from human and model organisms, along with several predictive measures for variants’ pathogenicity. The premise is to predict variants that are not only likely to have functional effects, but are also in genes associated with the observed phenotypes on the patient. There are some limitations to our methods that arise from the inherent nature of the training data. That is, there were cases missed by our methods which might be due to the lack of gene-to-phenotype relations data for the respective genes 114 and diseases phenotypes, or the low prediction values of the pathogenicity prediction methods used as features by our methods. In such cases, these variants were ranked low, and hence, predicted as false negatives. Given how challenging it is to infer causality of genomic variants [2, 300], we demonstrated the performance of our methods in comparison to state-of-the-art ap- proaches in retrospective studies. In doing so, we created synthetic genomic data by inserting variants that are known to be causative of certain disease phenotypes, and observe how well the existing methods, along with ours, are capable of recovering the inserted variants. We utilized the UK10K project data, in particular, patients with a confirmed molecular diagnosis for congenital hypothyroidism.. More and more predictive methods for the interpretation of genomic variants are being developed. With such an increasing number of computational tools, it is challenging to measure their performance in a fair and systematic way. For instance, in machine learning based methods, reporting cross-validation results as a way to quantify the performance is not sufficient. Moreover, overfitting issue of some methods have been shown as their performance on unseen data was remarkably lower on novel genomic data (see Chapter 5. In this case, emphasis should be put on showcasing the performance on real patient data in a retrospective manner, which ensure the reliability of the proposed method as well as its applicability in the genetics clinic.

8.2 Future Research Work

. The work presented in this thesis can be extended in the following directions. In order to overcome the limitation of incomplete gene-to-phenotype associations knowl- edge, we propose the following. First, PhenomeNet ontology that is incorporated in our methods, can be extended to include phenotypic knowledge from other model organisms. That is, studies on fruit fly, yeast, and worm genes and phenotypes can 115 be useful in elucidating genes involved in autoimmune diseases, developmental dis- eases, and metabolic diseases respectively. Moreover, we can enrich the ontology by including not only curated knowledge about gene-to-phenotype associations, but also those associations generated computationally via text mining or machine learning approaches. Another promising approach that was shown to match, and sometimes outperform semantic similarity-based methods for comparing genes to disease (or patients) phenotypes, is through the exploitations of large biological heterogeneous knowledge graphs [301]. This method uses feature learning to generate vector-based representations of phenotypes associated with genes that are directly or indirectly associated with phenotypes through biological networks. Using these vector-based representations as training features for DeepPVP, and OligoPVP can be evaluated especially in cases where diseases have little or unknown associated genes. On a different note, extending DeepPVP and OligoPVP to incorporate informa- tion about common genomic variants, accounting for their small, yet aggregating effect is a key area of future work. With a wealth of knowledge about genomic vari- ants and their associations with common diseases, as obtained from GWAS, it is essential to find ways to gain insight into variant prioritization for common genetic diseases. This may aid in identifying disease subtypes among a set of patients which is key for prognosis. In addition, genetic disorders with compound heterozygosity need to be considered for such extension in order to assess these unique cases. An- other area of improvement includes optimizing the usage of biological networks for OligoPVP. Currently, OligoPVP incorporates PPI networks for prioritization of oli- gogenic combinations. Other interactions such as gene co-expressions, or co-mention in text mined literature can be evaluated for coverage and accuracy of prioritization. 116

REFERENCES

[1] E. A. Ashley, “Towards precision medicine,” Nature Reviews Genetics, vol. 17, no. 9, pp. 507–522, Aug. 2016. [Online]. Available: http: //dx.doi.org/10.1038/nrg.2016.86

[2] D. G. MacArthur, T. A. Manolio, D. P. Dimmock, H. L. Rehm, J. Shendure, G. R. Abecasis, D. R. Adams, R. B. Altman, S. E. Antonarakis, E. A. Ashley, J. C. Barrett, L. G. Biesecker, D. F. Conrad, G. M. Cooper, N. J. Cox, M. J. Daly, M. B. Gerstein, D. B. Goldstein, J. N. Hirschhorn, S. M. Leal, L. A. Pennacchio, J. A. Stamatoyannopoulos, S. R. Sunyaev, D. Valle, B. F. Voight, W. Winckler, and C. Gunter, “Guidelines for investigating causality of sequence variants in human disease,” Nature, vol. 508, no. 7497, pp. 469–476, Apr. 2014. [Online]. Available: http://dx.doi.org/10.1038/nature13127

[3] A. Griffiths, S. Wessler, S. Carroll, and J. Doebley, An Introduction to Genetic Analysis. Macmillan Learning, 2015. [Online]. Available: https://books.google.com.sa/books?id=kfthMgEACAAJ

[4] D. Noble, “Genes and causation,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 366, no. 1878, pp. 3001–3015, 2008. [Online]. Available: http://rsta.royalsocietypublishing.org/content/366/1878/3001

[5] W. W. Johannsen, Elemente der exakten erblichkeitslehre. Deutsche wesentlich erweiterte ausgabe in funfundzwanig vorlesungen,. Jena,G. Fischer,, https://www.biodiversitylibrary.org/bibliography/94247. [Online]. Available: https://www.biodiversitylibrary.org/item/171889

[6] R. Dahm, “Discovering dna: Friedrich miescher and the early years of nucleic acid research,” Human Genetics, vol. 122, no. 6, pp. 565–581, Jan 2008. [Online]. Available: https://doi.org/10.1007/s00439-007-0433-0 117 [7] J. D. Watson and F. H. C. Crick, “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid,” Nature, vol. 171, no. 4356, pp. 737–738, Apr. 1953. [Online]. Available: http://dx.doi.org/10.1038/171737a0

[8] F. CRICK, “Central dogma of molecular biology,” Nature, vol. 227, pp. 561 EP –, Aug 1970. [Online]. Available: http://dx.doi.org/10.1038/227561a0

[9] P. Portin and A. Wilkins, “The evolving definition of the term “gene”,” Genetics, vol. 205, no. 4, pp. 1353–1364, 2017. [Online]. Available: http://www.genetics.org/content/205/4/1353

[10] A. Uzman, “Molecular Cell Biology (4th edition) Harvey Lodish, Arnold Berk, S. Lawrence Zipursky, Paul Matsudaira, David Baltimore and James Darnell; Freeman & Co., New York, NY, 2000, 1084 pp., list price $102.25, ISBN 0-7167-3136-3,” Biochemistry and Molecular Biology Education, vol. 29, no. 3, pp. 126–128, 2001. [Online]. Available: http://dx.doi.org/10.1016/s1470-8175(01)00023-6

[11] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. D. Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R.-R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Y. Wang, A. Wang, X. Wang, J. Wang, M.-H. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. C. 118 Zhu, S. Zhao, D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson, L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh, I. McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon, R. Rodriguez, Y.-H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guig´o,M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk, Y.-H. Chiang, M. Coyne, C. Dahlke, A. D. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris, J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu, “The sequence of the human genome,” Science, vol. 291, no. 5507, pp. 1304–1351, 2001. [Online]. Available: http://science.sciencemag.org/content/291/5507/1304

[12] D. E. Reich, S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M. Higgins, D. J. Richter, E. S. Lander, and D. Altshuler, “Human genome sequence variation and the influence of gene history, mutation and recombination,” Nature Genetics, vol. 32, no. 1, pp. 135–142, aug 2002. [Online]. Available: https://doi.org/10.1038%2Fng947

[13] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural variation in the 119 human genome.” Nature reviews. Genetics, vol. 7, no. 2, pp. 85–97, Feb. 2006. [Online]. Available: http://dx.doi.org/10.1038/nrg1767

[14] “E pluribus unum,” Nature Methods, vol. 7, pp. 331 EP –, May 2010, editorial. [Online]. Available: http://dx.doi.org/10.1038/nmeth0510-331

[15] Z. E. Sauna and C. Kimchi-Sarfaty, “Understanding the contribution of synonymous mutations to human disease,” Nat Rev Genet, vol. 12, no. 10, pp. 683–691, Oct. 2011. [Online]. Available: http://dx.doi.org/10.1038/nrg3051

[16] A. Gusev, A. Ko, H. Shi, G. Bhatia, W. Chung, B. W. J. H. Penninx, R. Jansen, E. J. C. de Geus, D. I. Boomsma, F. A. Wright, P. F. Sullivan, E. Nikkola, M. Alvarez, M. Civelek, A. J. Lusis, T. Lehtim¨aki, E. Raitoharju, M. K¨ah¨onen, I. Sepp¨al¨a, O. T. Raitakari, J. Kuusisto, M. Laakso, A. L. Price, P. Pajukanta, and B. Pasaniuc, “Integrative approaches for large-scale transcriptome-wide association studies,” Nature Genetics, vol. 48, pp. 245 EP –, Feb 2016, article. [Online]. Available: http://dx.doi.org/10.1038/ng.3506

[17] , “The genotype-tissue expression (gtex) pilot analysis: Multitissue gene regulation in humans,” Science, vol. 348, no. 6235, pp. 648–660, 2015. [Online]. Available: http://science.sciencemag.org/content/348/6235/648

[18] T. E. P. Consortium, I. Dunham, A. Kundaje, S. F. Aldred, P. J. Collins, C. A. Davis, F. Doyle, C. B. Epstein, S. Frietze, J. Harrow, R. Kaul, J. Khatun, B. R. Lajoie, S. G. Landt, B.-K. Lee, F. Pauli, K. R. Rosenbloom, P. Sabo, A. Safi, A. Sanyal, N. Shoresh, J. M. Simon, L. Song, N. D. Trinklein, R. C. Altshuler, E. Birney, J. B. Brown, C. Cheng, S. Djebali, X. Dong, J. Ernst, T. S. Furey, M. Gerstein, B. Giardine, M. Greven, R. C. Hardison, R. S. Harris, J. Herrero, M. M. Hoffman, S. Iyer, M. Kellis, P. Kheradpour, T. Lassmann, Q. Li, X. Lin, G. K. Marinov, A. Merkel, A. Mortazavi, S. C. J. Parker, T. E. Reddy, J. Rozowsky, F. Schlesinger, R. E. Thurman, J. Wang, L. D. Ward, T. W. Whitfield, S. P. Wilder, W. Wu, H. S. Xi, K. Y. Yip, J. Zhuang, B. E. Bernstein, E. D. Green, C. Gunter, M. Snyder, M. J. Pazin, R. F. Lowdon, L. A. L. Dillon, L. B. Adams, C. J. Kelly, J. Zhang, J. R. Wexler, P. J. Good, E. A. Feingold, G. E. Crawford, J. Dekker, L. Elnitski, P. J. Farnham, M. C. Giddings, T. R. Gingeras, R. Guig´o,T. J. Hubbard, W. J. Kent, J. D. Lieb, E. H. Margulies, R. M. Myers, J. A. Stamatoyannopoulos, S. A. Tenenbaum, 120 Z. Weng, K. P. White, B. Wold, Y. Yu, J. Wrobel, B. A. Risk, H. P. Gunawardena, H. C. Kuiper, C. W. Maier, L. Xie, X. Chen, T. S. Mikkelsen, S. Gillespie, A. Goren, O. Ram, X. Zhang, L. Wang, R. Issner, M. J. Coyne, T. Durham, M. Ku, T. Truong, M. L. Eaton, A. Dobin, A. Tanzer, J. Lagarde, W. Lin, C. Xue, B. A. Williams, C. Zaleski, M. R¨oder,F. Kokocinski, R. F. Abdelhamid, T. Alioto, I. Antoshechkin, M. T. Baer, P. Batut, I. Bell, K. Bell, S. Chakrabortty, J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D. Gonzalez, A. Gordon, C. Howald, S. Jha, R. Johnson, P. Kapranov, B. King, C. Kingswood, G. Li, O. J. Luo, E. Park, J. B. Preall, K. Presaud, P. Ribeca, D. Robyr, X. Ruan, M. Sammeth, K. S. Sandhu, L. Schaeffer, L.-H. See, A. Shahab, J. Skancke, A. M. Suzuki, H. Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, Y. Hayashizaki, A. Reymond, S. E. Antonarakis, G. J. Hannon, Y. Ruan, P. Carninci, C. A. Sloan, K. Learned, V. S. Malladi, M. C. Wong, G. P. Barber, M. S. Cline, T. R. Dreszer, S. G. Heitner, D. Karolchik, V. M. Kirkup, L. R. Meyer, J. C. Long, M. Maddren, B. J. Raney, L. L. Grasfeder, P. G. Giresi, A. Battenhouse, N. C. Sheffield, K. A. Showers, D. London, A. A. Bhinge, C. Shestak, M. R. Schaner, S. Ki Kim, Z. Z. Zhang, P. A. Mieczkowski, J. O. Mieczkowska, Z. Liu, R. M. McDaniell, Y. Ni, N. U. Rashid, M. J. Kim, S. Adar, Z. Zhang, T. Wang, D. Winter, D. Keefe, V. R. Iyer, M. Zheng, P. Wang, J. Gertz, J. Vielmetter, E. Partridge, K. E. Varley, C. Gasper, A. Bansal, S. Pepke, P. Jain, H. Amrhein, K. M. Bowling, M. Anaya, M. K. Cross, M. A. Muratet, K. M. Newberry, K. McCue, A. S. Nesmith, K. I. Fisher-Aylor, B. Pusey, G. DeSalvo, S. L. Parker, S. Balasubramanian, N. S. Davis, S. K. Meadows, T. Eggleston, J. S. Newberry, S. E. Levy, D. M. Absher, W. H. Wong, M. J. Blow, A. Visel, L. A. Pennachio, H. M. Petrykowska, A. Abyzov, B. Aken, D. Barrell, G. Barson, A. Berry, A. Bignell, V. Boychenko, G. Bussotti, C. Davidson, G. Despacio-Reyes, M. Diekhans, I. Ezkurdia, A. Frankish, J. Gilbert, J. M. Gonzalez, E. Griffiths, R. Harte, D. A. Hendrix, T. Hunt, I. Jungreis, M. Kay, E. Khurana, J. Leng, M. F. Lin, J. Loveland, Z. Lu, D. Manthravadi, M. Mariotti, J. Mudge, G. Mukherjee, C. Notredame, B. Pei, J. M. Rodriguez, G. Saunders, A. Sboner, S. Searle, C. Sisu, C. Snow, C. Steward, E. Tapanari, M. L. Tress, M. J. van Baren, S. Washietl, L. Wilming, A. Zadissa, Z. Zhang, M. Brent, D. Haussler, A. Valencia, N. Addleman, R. P. Alexander, R. K. Auerbach, S. Balasubramanian, K. Bettinger, N. Bhardwaj, A. P. Boyle, 121 A. R. Cao, P. Cayting, A. Charos, Y. Cheng, C. Eastman, G. Euskirchen, J. D. Fleming, F. Grubert, L. Habegger, M. Hariharan, A. Harmanci, S. Iyengar, V. X. Jin, K. J. Karczewski, M. Kasowski, P. Lacroute, H. Lam, N. Lamarre-Vincent, J. Lian, M. Lindahl-Allen, R. Min, B. Miotto, H. Monahan, Z. Moqtaderi, X. J. Mu, H. O’Geen, Z. Ouyang, D. Patacsil, D. Raha, L. Ramirez, B. Reed, M. Shi, T. Slifer, H. Witt, L. Wu, X. Xu, K.-K. Yan, X. Yang, K. Struhl, S. M. Weissman, L. O. Penalva, S. Karmakar, R. R. Bhanvadia, A. Choudhury, M. Domanus, L. Ma, J. Moran, A. Victorsen, T. Auer, L. Centanin, M. Eichenlaub, F. Gruhl, S. Heermann, B. Hoeckendorf, D. Inoue, T. Kellner, S. Kirchmaier, C. Mueller, R. Reinhardt, L. Schertel, S. Schneider, R. Sinn, B. Wittbrodt, J. Wittbrodt, E. C. Partridge, G. Jain, G. Balasundaram, D. L. Bates, R. Byron, T. K. Canfield, M. J. Diegel, D. Dunn, A. K. Ebersol, T. Frum, K. Garg, E. Gist, R. S. Hansen, L. Boatman, E. Haugen, R. Humbert, A. K. Johnson, E. M. Johnson, T. V. Kutyavin, K. Lee, D. Lotakis, M. T. Maurano, S. J. Neph, F. V. Neri, E. D. Nguyen, H. Qu, A. P. Reynolds, V. Roach, E. Rynes, M. E. Sanchez, R. S. Sandstrom, A. O. Shafer, A. B. Stergachis, S. Thomas, B. Vernot, J. Vierstra, S. Vong, H. Wang, M. A. Weaver, Y. Yan, M. Zhang, J. M. Akey, M. Bender, M. O. Dorschner, M. Groudine, M. J. MacCoss, P. Navas, G. Stamatoyannopoulos, K. Beal, A. Brazma, P. Flicek, N. Johnson, M. Lukk, N. M. Luscombe, D. Sobral, J. M. Vaquerizas, S. Batzoglou, A. Sidow, N. Hussami, S. Kyriazopoulou-Panagiotopoulou, M. W. Libbrecht, M. A. Schaub, W. Miller, P. J. Bickel, B. Banfai, N. P. Boley, H. Huang, J. J. Li, W. S. Noble, J. A. Bilmes, O. J. Buske, A. D. Sahu, P. V. Kharchenko, P. J. Park, D. Baker, J. Taylor, and L. Lochovsky, “An integrated encyclopedia of dna elements in the human genome,” Nature, vol. 489, pp. 57 EP –, Sep 2012, article. [Online]. Available: http://dx.doi.org/10.1038/nature11247

[19] G. M. Cooper and J. Shendure, “Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data.” Nature reviews. Genetics, vol. 12, no. 9, pp. 628–640, Sep. 2011. [Online]. Available: http://dx.doi.org/10.1038/nrg3046

[20] D. G. MacArthur, S. Balasubramanian, A. Frankish, N. Huang, J. Morris, K. Walter, L. Jostins, L. Habegger, J. K. Pickrell, S. B. Montgomery, C. A. Albers, Z. D. Zhang, D. F. Conrad, G. Lunter, H. Zheng, Q. Ayub, M. A. 122 DePristo, E. Banks, M. Hu, R. E. Handsaker, J. A. Rosenfeld, M. Fromer, M. Jin, X. J. Mu, E. Khurana, K. Ye, M. Kay, G. I. Saunders, M.-M. Suner, T. Hunt, I. H. A. Barnes, C. Amid, D. R. Carvalho-Silva, A. H. Bignell, C. Snow, B. Yngvadottir, S. Bumpstead, D. N. Cooper, Y. Xue, I. G. Romero, , J. Wang, Y. Li, R. A. Gibbs, S. A. McCarroll, E. T. Dermitzakis, J. K. Pritchard, J. C. Barrett, J. Harrow, M. E. Hurles, M. B. Gerstein, and C. Tyler-Smith, “A systematic survey of loss-of-function variants in human protein-coding genes,” Science, vol. 335, no. 6070, pp. 823–828, 2012. [Online]. Available: http://science.sciencemag.org/content/335/6070/823

[21] Phenotype — Learn Science at Scitable. [Online]. Available: http: //www.nature.com/scitable/definition/phenotype-phenotypes-35

[22] N. I. of Health (US), “Biological sciences curriculum study.” NIH Curriculum Supplement Series [Internet], 2007. [Online]. Available: https: //www.ncbi.nlm.nih.gov/books/NBK20364/

[23] B. Kerem, J. Rommens, J. Buchanan, D. Markiewicz, T. Cox, A. Chakravarti, M. Buchwald, and L. Tsui, “Identification of the cystic fibrosis gene: genetic analysis,” Science, vol. 245, no. 4922, pp. 1073–1080, 1989. [Online]. Available: http://science.sciencemag.org/content/245/4922/1073

[24] S. Antonarakis, A. Chakravarti, J. Cohen, and J. Hardy, “Mendelian disorders and multifactorial traits: the big divide or one for all?” Nature reviews. Genetics, vol. 11, no. 5, pp. 380–384, 2010. [Online]. Available: http://dx.doi.org/10.1038/nrg2793

[25] J. Chong, K. Buckingham, S. Jhangiani, C. Boehm, N. Sobreira, J. Smith, T. Harrell, M. McMillin, W. Wiszniewski, T. Gambin, Z. CobanAkdemir, K. Doheny, A. Scott, D. Avramopoulos, A. Chakravarti, J. Hoover-Fong, D. Mathews, P. Witmer, H. Ling, K. Hetrick, L. Watkins, K. Patterson, F. Reinier, E. Blue, D. Muzny, M. Kircher, K. Bilguvar, F. L?pez- Gir?ldez, V. Sutton, H. Tabor, S. Leal, M. Gunel, S. Mane, R. Gibbs, E. Boerwinkle, A. Hamosh, J. Shendure, J. Lupski, R. Lifton, D. Valle, D. Nickerson, and M. Bamshad, “The genetic basis of mendelian phenotypes: Discoveries, challenges, and opportunities,” The American Journal of Human 123 Genetics, vol. 97, no. 2, pp. 199 – 215, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0002929715002451

[26] “Genes and human disease,” Dec 2010. [Online]. Available: http: //www.who.int/genomics/public/geneticdiseases/en/index2.html

[27] A. Auricchio, P. Griseri, M. L. Carpentieri, N. Betsos, A. Staiano, A. Tozzi, M. Priolo, H. Thompson, R. Bocciardi, G. Romeo, A. Ballabio, and I. Ceccherini, “Double heterozygosity for a ¡em¿ret¡/em¿ substitution interfering with splicing and an ¡em¿ednrb¡/em¿ missense mutation in hirschsprung disease,” The American Journal of Human Genetics, vol. 64, no. 4, pp. 1216–1221, Apr 1999. [Online]. Available: https://doi.org/10.1086/302329

[28] T. Aitman, C. Boone, G. Churchill, M. Hengartner, T. MacKay, and D. Stemple, “The future of model organisms in human disease research,” Nature Reviews Genetics, vol. 12, no. 8, pp. 575–582, 2011, cited By 27. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2. 0-79960527515&partnerID=40&md5=3a44f458c41651e897b87fd3534e0bf3

[29] B. Lehner, “Genotype to phenotype: lessons from model organisms for human genetics,” Nat Rev Genet, vol. 14, no. 3, pp. 168–178, Mar. 2013. [Online]. Available: http://dx.doi.org/10.1038/nrg3404

[30] M. L. Metzker, “Sequencing technologies – the next generation,” Nature Reviews Genetics, vol. 11, pp. 31 EP –, Dec 2009, review Article. [Online]. Available: http://dx.doi.org/10.1038/nrg2626

[31] S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of age: ten years of next-generation sequencing technologies,” Nature Reviews Genetics, vol. 17, no. 6, pp. 333–351, Jun. 2016. [Online]. Available: http://dx.doi.org/10.1038/nrg.2016.49

[32] H. Li and R. Durbin, “Fast and accurate short read alignment with burrowswheeler transform,” Bioinformatics, vol. 25, no. 14, pp. 1754–1760, 2009. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btp324 124 [33] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short dna sequences to the human genome,” Genome Biology, vol. 10, no. 3, p. R25, Mar 2009. [Online]. Available: https://doi.org/10.1186/gb-2009-10-3-r25

[34] H. Jiang and W. H. Wong, “Seqmap: mapping massive amount of oligonucleotides to the genome,” Bioinformatics, vol. 24, no. 20, pp. 2395–2396, 2008. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btn429

[35] H. L. Eaves and Y. Gao, “Mom: maximum oligonucleotide mapping,” Bioinformatics, vol. 25, no. 7, pp. 969–970, 2009. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btp092

[36] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and . G. P. D. P. Subgroup, “The sequence alignment/map format and samtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 2009. [Online]. Available: http: //dx.doi.org/10.1093/bioinformatics/btp352

[37] H. Li, “A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data,” Bioinformatics, vol. 27, no. 21, pp. 2987–2993, 2011. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr509

[38] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo, “The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010. [Online]. Available: http://genome.cshlp.org/content/20/9/1297.abstract

[39] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and . G. P. A. Group, “The variant call format and vcftools,” Bioinformatics, vol. 27, no. 15, pp. 2156–2158, 2011. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr330 125 [40] M. I. Mccarthy, G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. A. Ioannidis, and J. N. Hirschhorn, “Genome-wide association studies for complex traits: consensus, uncertainty and challenges,” Nature Reviews Genetics, vol. 9, no. 5, pp. 356–369, 1111. [Online]. Available: http://dx.doi.org/10.1038/nrg2344

[41] G. M. Clarke, C. A. Anderson, F. H. Pettersson, L. R. Cardon, A. P. Morris, and K. T. Zondervan, “Basic statistical analysis in genetic case-control studies.” Nature protocols, vol. 6, no. 2, pp. 121–133, Feb. 2011. [Online]. Available: http://dx.doi.org/10.1038/nprot.2010.182

[42] D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, and H. Parkinson, “The nhgri gwas catalog, a curated resource of snp-trait associations,” Nucleic Acids Research, vol. 42, no. D1, pp. D1001–D1006, 2014. [Online]. Available: http://nar.oxfordjournals.org/content/42/D1/D1001.abstract

[43] T. Beck, R. K. Hastings, S. Gollapudi, R. C. Free, and A. J. Brookes, “GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies,” European Journal of Human Genetics, vol. 22, no. 7, pp. 949–952, Dec. 2013. [Online]. Available: http://dx.doi.org/10.1038/ejhg.2013.274

[44] M. J. Li, Z. Liu, P. Wang, M. P. Wong, M. R. Nelson, J.-P. A. Kocher, M. Yeager, P. C. Sham, S. J. Chanock, Z. Xia, and J. Wang, “Gwasdb v2: an update database for human genetic variants identified by genome-wide association studies,” Nucleic Acids Research, 2015. [Online]. Available: http://nar.oxfordjournals.org/content/early/2015/11/27/nar.gkv1317.abstract

[45] D. Ryu, S. Cho, H. Kim, S. Lee, and W. Kim, “Gepdb: a database for investigating the ternary association of genotype, gene expression and phenotype,” Bioinformatics, vol. 30, no. 17, pp. 2540–2542, 2014. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/30/17/ 2540.abstract

[46] J. D. Eicher, C. Landowski, B. Stackhouse, A. Sloan, W. Chen, N. Jensen, J.-P. Lien, R. Leslie, and A. D. Johnson, “Grasp v2.0: an update on the 126 genome-wide repository of associations between snps and phenotypes,” Nucleic Acids Research, vol. 43, no. D1, pp. D799–D804, 2015. [Online]. Available: http://nar.oxfordjournals.org/content/43/D1/D799.abstract

[47] L. D. Ward and M. Kellis, “Interpreting non-coding variation in complex disease genetics,” Oct 2012.

[48] M. J. Bamshad, S. B. Ng, A. W. Bigham, H. K. Tabor, M. J. Emond, D. A. Nickerson, and J. Shendure, “Exome sequencing as a tool for mendelian disease gene discovery,” Nature Reviews Genetics, vol. 12, pp. 745 EP –, Sep 2011, review Article. [Online]. Available: http://dx.doi.org/10.1038/nrg3031

[49] J. A. Veltman and H. G. Brunner, “De novo mutations in human genetic disease,” Nature Reviews Genetics, vol. 13, pp. 565 EP –, Jul 2012, review Article. [Online]. Available: http://dx.doi.org/10.1038/nrg3241

[50] S. Sawyer, T. Hartley, D. Dyment, C. Beaulieu, J. Schwartzentruber, A. Smith, H. Bedford, G. Bernard, F. Bernier, B. Brais, D. Bulman, J. Warman Chardon, D. Chitayat, J. Deladoy, B. Fernandez, P. Frosk, M. Geraghty, B. Gerull, W. Gibson, R. Gow, G. Graham, J. Green, E. Heon, G. Horvath, A. Innes, N. Jabado, R. Kim, R. Koenekoop, A. Khan, O. Lehmann, R. Mendoza-Londono, J. Michaud, S. Nikkel, L. Penney, C. Polychronakos, J. Richer, G. Rouleau, M. Samuels, V. Siu, O. Suchowersky, M. Tarnopolsky, G. Yoon, F. Zahir, , , J. Majewski, and K. Boycott, “Utility of whole-exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care,” Clinical Genetics, vol. 89, no. 3, pp. 275–284. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/cge.12654

[51] D. G. e. a. MacArthur, “A systematic survey of loss-of-function variants in human protein-coding genes,” Science, vol. 335, no. 6070, pp. 823–828, 2012. [Online]. Available: http://science.sciencemag.org/content/335/6070/823

[52] Y. Moreau and L.-C. C. Tranchevent, “Computational tools for prioritizing candidate genes: boosting disease gene discovery.” Nature reviews. Genetics, vol. 13, no. 8, pp. 523–536, Aug. 2012. [Online]. Available: http://dx.doi.org/10.1038/nrg3253 127 [53] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng, “Sift web server: predicting effects of amino acid substitutions on proteins,” Nucleic Acids Research, vol. 40, no. W1, pp. W452–W457, 2012. [Online]. Available: http://nar.oxfordjournals.org/content/40/W1/W452.abstract

[54] S. Chun and J. C. Fay, “Identification of deleterious mutations within three human genomes,” Genome Research, vol. 19, no. 9, pp. 1553–1561, 2009. [Online]. Available: http://genome.cshlp.org/content/19/9/1553.abstract

[55] E. V. Davydov, D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Batzoglou, “Identifying a high fraction of the human genome to be under selective constraint using gerp++,” PLoS Comput Biol, vol. 6, no. 12, pp. 1–13, 12 2010. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pcbi. 1001025

[56] Y.-F. Huang, B. Gulko, and A. Siepel, “Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data,” Nature Genetics, vol. 49, no. 4, pp. 618–624, Mar. 2017. [Online]. Available: http://dx.doi.org/10.1038/ng.3810

[57] I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev, “A method and server for predicting damaging missense mutations,” Nat Meth, vol. 7, no. 4, pp. 248–249, Apr. 2010. [Online]. Available: http://dx.doi.org/10.1038/nmeth0410-248

[58] Y. Bromberg and B. Rost, “Snap: predict effect of non-synonymous polymorphisms on function,” Nucleic Acids Research, vol. 35, no. 11, pp. 3823–3835, 2007. [Online]. Available: http://nar.oxfordjournals.org/content/ 35/11/3823.abstract

[59] “Predicting the effects of amino acid substitutions on protein function,” Annual Review of Genomics and Human Genetics, vol. 7, no. 1, pp. 61–80, 2006, pMID: 16824020. [Online]. Available: http://dx.doi.org/10.1146/ annurev.genom.7.080505.115630

[60] M. J. Landrum, J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church, and D. R. Maglott, “Clinvar: public archive of 128 relationships among sequence variation and human phenotype,” Nucleic Acids Research, vol. 42, no. D1, pp. D980–D985, 2014. [Online]. Available: http://nar.oxfordjournals.org/content/42/D1/D980.abstract

[61] P. D. Stenson, M. Mort, E. V. Ball, K. Shaw, A. D. Phillips, and D. N. Cooper, “The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine,” Human Genetics, vol. 133, no. 1, pp. 1–9, 2014. [Online]. Available: http://dx.doi.org/10.1007/s00439-013-1358-4

[62] H. A. Shihab, M. F. Rogers, J. Gough, M. Mort, D. N. Cooper, I. N. M. Day, T. R. Gaunt, and C. Campbell, “An integrative approach to predicting the functional effects of non-coding and coding sequence variation,” Bioinformatics, vol. 31, no. 10, pp. 1536–1543, 2015. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/31/10/1536.abstract

[63] J. M. M. Schwarz, D. N. Cooper, M. Schuelke, and D. Seelow, “MutationTaster2: mutation prediction for the deep-sequencing age,” Nature methods, vol. 11, no. 4, pp. 361–362, Apr. 2014. [Online]. Available: http://dx.doi.org/10.1038/nmeth.2890

[64] G. R. Ritchie, I. Dunham, E. Zeggini, and P. Flicek, “Functional annotation of noncoding sequence variants.” Nature methods, vol. 11, no. 3, pp. 294–296, Mar. 2014. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/24487584

[65] M. Kircher, D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, and J. Shendure, “A general framework for estimating the relative pathogenicity of human genetic variants.” Nature genetics, vol. 46, no. 3, pp. 310–315, Mar. 2014. [Online]. Available: http://dx.doi.org/10.1038/ng.2892

[66] D. Quang, Y. Chen, and X. Xie, “Dann: a deep learning approach for annotating the pathogenicity of genetic variants,” Bioinformatics, vol. 31, no. 5, pp. 761–763, 2015. [Online]. Available: http://bioinformatics.oxfordjournals. org/content/31/5/761.abstract

[67] G. A. McVean, D. M. Altshuler Co-Chair, R. M. Durbin Co-Chair, G. R. Abecasis, D. R. Bentley, A. Chakravarti, A. G. Clark, P. Donnelly, E. E. 129 Eichler, P. Flicek, S. B. Gabriel, R. A. Gibbs, E. D. Green, M. E. Hurles, B. M. Knoppers, J. O. Korbel, E. S. Lander, C. Lee, H. Lehrach, E. R. Mardis, G. T. Marth, G. A. McVean, D. A. Nickerson, J. P. Schmidt, S. T. Sherry, J. Wang, R. K. Wilson, R. A. Gibbs Principal Investigator, H. Dinh, C. Kovar, S. Lee, L. Lewis, D. Muzny, J. Reid, M. Wang, J. Wang Principal Investigator, X. Fang, X. Guo, M. Jian, H. Jiang, X. Jin, G. Li, J. Li, Y. Li, Z. Li, X. Liu, Y. Lu, X. Ma, Z. Su, S. Tai, M. Tang, B. Wang, G. Wang, H. Wu, R. Wu, Y. Yin, W. Zhang, J. Zhao, M. Zhao, X. Zheng, Y. Zhou, E. S. Lander Principal Investigator, D. M. Altshuler, S. B. Gabriel Co-Chair, N. Gupta, P. Flicek Principal Investigator, L. Clarke, R. Leinonen, R. E. Smith, X. Zheng-Bradley, D. R. Bentley Principal Investigator, R. Grocock, S. Humphray, T. James, Z. Kingsbury, H. Lehrach Principal Investigator, R. Sudbrak Project Leader, M. W. Albrecht, V. S. Amstislavskiy, T. A. Borodina, M. Lienhard, F. Mertes, M. Sultan, B. Timmermann, M.-L. Yaspo, S. T. Sherry Principal Investigator, G. A. McVean Principal Investigator, E. R. Mardis Co-Principal Investigator Co-Chair, R. K. Wilson Co-Principal Investigator, L. Fulton, R. Fulton, G. M. Weinstock, R. M. Durbin Principal Investigator, S. Balasubramaniam, J. Burton, P. Danecek, T. M. Keane, A. Kolb-Kokocinski, S. McCarthy, J. Stalker, M. Quail, J. P. Schmidt Principal Investigator, C. J. Davies, J. Gollub, T. Webster, B. Wong, Y. Zhan, A. Auton Principal Investigator, R. A. Gibbs Principal Investigator, F. Yu Project Leader, M. Bainbridge, D. Challis, U. S. Evani, J. Lu, D. Muzny, U. Nagaswamy, J. Reid, A. Sabo, Y. Wang, J. Yu, J. Wang Principal Investigator, L. J. M. Coin, L. Fang, X. Guo, X. Jin, G. Li, Q. Li, Y. Li, Z. Li, H. Lin, B. Liu, R. Luo, N. Qin, H. Shao, B. Wang, Y. Xie, C. Ye, C. Yu, F. Zhang, H. Zheng, H. Zhu, G. T. Marth Principal Investigator, E. P. Garrison, D. Kural, W.-P. Lee, W. Fung Leong, A. N. Ward, J. Wu, M. Zhang, C. Lee Principal Investigator, L. Griffin, C.-H. Hsieh, R. E. Mills, X. Shi, M. von Grotthuss, C. Zhang, M. J. Daly Principal Investigator, M. A. DePristo Project Leader, D. M. Altshuler, E. Banks, G. Bhatia, M. O. Carneiro, G. del Angel, S. B. Gabriel, G. Genovese, N. Gupta, R. E. Handsaker, C. Hartl, E. S. Lander, S. A. McCarroll, J. C. Nemesh, R. E. Poplin, S. F. Schaffner, K. Shakir, S. C. Yoon Principal Investigator, J. Lihm, V. Makarov, H. Jin Principal Investigator, W. Kim, K. Cheol Kim, J. O. Korbel Principal Investigator, T. Rausch, P. Flicek Principal Investigator, K. Beal, 130 L. Clarke, F. Cunningham, J. Herrero, W. M. McLaren, G. R. S. Ritchie, R. E. Smith, X. Zheng-Bradley, A. G. Clark Principal Investigator, S. Gottipati, A. Keinan, J. L. Rodriguez-Flores, P. C. Sabeti Principal Investigator, S. R. Grossman, S. Tabrizi, R. Tariyal, D. N. Cooper Principal Investigator, E. V. Ball, P. D. Stenson, D. R. Bentley Principal Investigator, B. Barnes, M. Bauer, R. Keira Cheetham, T. Cox, M. Eberle, S. Humphray, S. Kahn, L. Murray, J. Peden, R. Shaw, K. Ye Principal Investigator, M. A. Batzer Principal Investigator, M. K. Konkel, J. A. Walker, D. G. MacArthur Principal Investigator, M. Lek, S. P. Leader, V. S. Amstislavskiy, R. Herwig, M. D. Shriver Principal Investigator, C. D. Bustamante Principal Investigator, J. K. Byrnes, F. M. De La Vega, S. Gravel, E. E. Kenny, J. M. Kidd, P. Lacroute, B. K. Maples, A. Moreno-Estrada, F. Zakharia, E. Halperin Principal Investigator, Y. Baran, D. W. Craig Principal Investigator, A. Christoforides, N. Homer, T. Izatt, A. A. Kurdoglu, S. A. Sinari, K. Squire, S. T. Sherry Principal Investigator, C. Xiao, J. Sebat Principal Investigator, V. Bafna, K. Ye, E. G. Burchard Principal Investigator, R. D. Hernandez Principal Investigator, C. R. Gignoux, D. Haussler Principal Investigator, S. J. Katzman, W. James Kent, B. Howie, A. Ruiz-Linares Principal Investigator, E. T. Dermitzakis Principal Investigator, T. Lappalainen, S. E. Devine Principal Investigator, X. Liu, A. Maroo, L. J. Tallon, J. A. Rosenfeld Principal Investigator, L. P. Michelson, G. R. Abecasis Principal Investigator Co-Chair, H. Min Kang Project Leader, P. Anderson, A. Angius, A. Bigham, T. Blackwell, F. Busonero, F. Cucca, C. Fuchsberger, C. Jones, G. Jun, Y. Li, R. Lyons, A. Maschio, E. Porcu, F. Reinier, S. Sanna, D. Schlessinger, C. Sidore, A. Tan, M. Kate Trost, P. Awadalla Principal Investigator, A. Hodgkinson, G. Lunter Principal Investigator, G. A. McVean Principal Investigator Co-Chair, J. L. Marchini Principal Investigator, S. Myers Principal Investigator, C. Churchhouse, O. Delaneau, A. Gupta-Hinch, Z. Iqbal, I. Mathieson, A. Rimmer, D. K. Xifara, T. K. Oleksyk Principal Investigator, Y. Fu Principal Investigator, X. Liu, M. Xiong, L. Jorde Principal Investigator, D. Witherspoon, J. Xing, E. E. Eichler Principal Investigator, B. L. Browning Principal Investigator, C. Alkan, I. Hajirasouliha, F. Hormozdiari, A. Ko, P. H. Sudmant, E. R. Mardis Co-Principal Investigator, K. Chen, A. Chinwalla, L. Ding, D. Dooling, D. C. Koboldt, M. D. McLellan, J. W. Wallis, M. C. Wendl, Q. Zhang, R. M. Durbin Principal Investigator, M. E. Hurles Principal Investigator, C. A. 131 Albers, Q. Ayub, S. Balasubramaniam, Y. Chen, A. J. Coffey, V. Colonna, P. Danecek, N. Huang, L. Jostins, T. M. Keane, H. Li, S. McCarthy, A. Scally, J. Stalker, K. Walter, Y. Xue, Y. Zhang, M. B. Gerstein Principal Investigator, A. Abyzov, S. Balasubramanian, J. Chen, D. Clarke, Y. Fu, L. Habegger, A. O. Harmanci, M. Jin, E. Khurana, X. Jasmine Mu, C. Sisu, Y. Li, R. Luo, H. Zhu, C. Lee Principal Investigator Co-Chair, L. Griffin, C.-H. Hsieh, R. E. Mills, X. Shi, M. von Grotthuss, C. Zhang, G. T. Marth Principal Investigator, E. P. Garrison, D. Kural, W.-P. Lee, A. N. Ward, J. Wu, M. Zhang, S. A. McCarroll Project Leader, D. M. Altshuler, E. Banks, G. del Angel, G. Genovese, R. E. Handsaker, C. Hartl, J. C. Nemesh, K. Shakir, S. C. Yoon Principal Investigator, J. Lihm, V. Makarov, J. Degenhardt, P. Flicek Principal Investigator, L. Clarke, R. E. Smith, X. Zheng-Bradley, J. O. Korbel Principal Investigator Co-Chair, T. Rausch, A. M. St¨utz,D. R. Bentley Principal Investigator, B. Barnes, R. Keira Cheetham, M. Eberle, S. Humphray, S. Kahn, L. Murray, R. Shaw, K. Ye Principal Investigator, M. A. Batzer Principal Investigator, M. K. Konkel, J. A. Walker, P. Lacroute, D. W. Craig Principal Investigator, N. Homer, D. Church, C. Xiao, J. Sebat Principal Investigator, V. Bafna, J. J. Michaelson, K. Ye, S. E. Devine Principal Investigator, X. Liu, A. Maroo, L. J. Tallon, G. Lunter Principal Investigator, G. A. McVean Principal Investigator, Z. Iqbal, D. Witherspoon, J. Xing, E. E. Eichler Principal Investigator Co-Chair, C. Alkan, I. Hajirasouliha, F. Hormozdiari, A. Ko, P. H. Sudmant, K. Chen, A. Chinwalla, L. Ding, M. D. McLellan, J. W. Wallis, M. E. Hurles Principal Investigator Co-Chair, B. Blackburne, H. Li, S. J. Lindsay, Z. Ning, A. Scally, K. Walter, Y. Zhang, M. B. Gerstein Principal Investigator, A. Abyzov, J. Chen, D. Clarke, E. Khurana, X. Jasmine Mu, C. Sisu, R. A. Gibbs Principal Investigator Co-Chair, F. Yu Project Leader, M. Bainbridge, D. Challis, U. S. Evani, C. Kovar, L. Lewis, J. Lu, D. Muzny, U. Nagaswamy, J. Reid, A. Sabo, J. Yu, X. Guo, Y. Li, R. Wu, G. T. Marth Principal Investigator Co-Chair, E. P. Garrison, W. Fung Leong, A. N. Ward, G. del Angel, M. A. DePristo, S. B. Gabriel, N. Gupta, C. Hartl, R. E. Poplin, A. G. Clark Principal Investigator, J. L. Rodriguez-Flores, P. Flicek Principal Investigator, L. Clarke, R. E. Smith, X. Zheng-Bradley, D. G. MacArthur Principal Investigator, C. D. Bustamante Principal Investigator, S. Gravel, D. W. Craig Principal Investigator, A. Christoforides, N. Homer, T. Izatt, S. T. Sherry Principal Investigator, C. Xiao, E. T. Dermitzakis 132 Principal Investigator, G. R. Abecasis Principal Investigator, H. Min Kang, G. A. McVean Principal Investigator, E. R. Mardis Principal Investigator, D. Dooling, L. Fulton, R. Fulton, D. C. Koboldt, R. M. Durbin Principal Investigator, S. Balasubramaniam, T. M. Keane, S. McCarthy, J. Stalker, M. B. Gerstein Principal Investigator, S. Balasubramanian, L. Habegger, E. P. Garrison, R. A. Gibbs Principal Investigator, M. Bainbridge, D. Muzny, F. Yu, J. Yu, G. del Angel, R. E. Handsaker, V. Makarov, J. L. Rodriguez-Flores, H. Jin Principal Investigator, W. Kim, K. Cheol Kim, P. Flicek Principal Investigator, K. Beal, L. Clarke, F. Cunningham, J. Herrero, W. M. McLaren, G. R. S. Ritchie, X. Zheng-Bradley, S. Tabrizi, D. G. MacArthur Principal Investigator, M. Lek, C. D. Bustamante Principal Investigator, F. M. De La Vega, D. W. Craig Principal Investigator, A. A. Kurdoglu, T. Lappalainen, J. A. Rosenfeld Principal Investigator, L. P. Michelson, P. Awadalla Principal Investigator, A. Hodgkinson, G. A. McVean Principal Investigator, K. Chen, Y. Chen, V. Colonna, A. Frankish, J. Harrow, Y. Xue, M. B. Gerstein Principal Investigator Co-Chair, A. Abyzov, S. Balasubramanian, J. Chen, D. Clarke, Y. Fu, A. O. Harmanci, M. Jin, E. Khurana, X. Jasmine Mu, C. Sisu, R. A. Gibbs Principal Investigator, G. Fowler, W. Hale, D. Kalra, C. Kovar, D. Muzny, J. Reid, J. Wang Principal Investigator, X. Guo, G. Li, Y. Li, X. Zheng, D. M. Altshuler, P. Flicek Principal Investigator Co-Chair, L. Clarke Project Leader, J. Barker, G. Kelman, E. Kulesha, R. Leinonen, W. M. McLaren, R. Radhakrishnan, A. Roa, D. Smirnov, R. E. Smith, I. Streeter, I. Toneva, B. Vaughan, X. Zheng-Bradley, D. R. Bentley Principal Investigator, T. Cox, S. Humphray, S. Kahn, R. Sudbrak Project Leader, M. W. Albrecht, M. Lienhard, D. W. Craig Principal Investigator, T. Izatt, A. A. Kurdoglu, S. T. Sherry Principal Investigator Co-Chair, V. Ananiev, Z. Belaia, D. Beloslyudtsev, N. Bouk, C. Chen, D. Church, R. Cohen, C. Cook, J. Garner, T. Hefferon, M. Kimelman, C. Liu, J. Lopez, P. Meric, C. O’Sullivan, Y. Ostapchuk, L. Phan, S. Ponomarov, V. Schneider, E. Shekhtman, K. Sirotkin, D. Slotta, C. Xiao, H. Zhang, D. Haussler Principal Investigator, G. R. Abecasis Principal Investigator, G. A. McVean Principal Investigator, C. Alkan, A. Ko, D. Dooling, R. M. Durbin Principal Investigator, S. Balasubramaniam, T. M. Keane, S. McCarthy, J. Stalker, A. Chakravarti Co-Chair, B. M. Knoppers Co-Chair, G. R. Abecasis, K. C. Barnes, C. Beiswanger, E. G. Burchard, C. D. Bustamante, H. Cai, H. Cao, R. M. Durbin, N. Gharani, R. A. Gibbs, C. R. 133 Gignoux, S. Gravel, B. Henn, D. Jones, L. Jorde, J. S. Kaye, A. Keinan, A. Kent, A. Kerasidou, Y. Li, R. Mathias, G. A. McVean, A. Moreno-Estrada, P. N. Ossorio, M. Parker, D. Reich, C. N. Rotimi, C. D. Royal, K. Sandoval, Y. Su, R. Sudbrak, Z. Tian, B. Timmermann, S. Tishkoff, L. H. Toji, C. Tyler Smith, M. Via, Y. Wang, H. Yang, L. Yang, J. Zhu, W. Bodmer, G. Bedoya, A. Ruiz-Linares, C. Zhi Ming, G. Yang, C. Jia You, L. Peltonen, A. Garcia-Montero, A. Orfao, J. Dutil, J. C. Martinez-Cruzado, T. K. Oleksyk, L. D. Brooks, A. L. Felsenfeld, J. E. McEwen, N. C. Clemm, A. Duncanson, M. Dunn, E. D. Green, M. S. Guyer, J. L. Peterson, G. R. Abecasis, A. Auton, L. D. Brooks, M. A. DePristo, R. M. Durbin, R. E. Handsaker, H. Min Kang, G. T. Marth, and G. A. McVean, “An integrated map of genetic variation from 1,092 human genomes,” Nature, vol. 491, no. 7422, pp. 56–65, Oct. 2012. [Online]. Available: http://dx.doi.org/10.1038/nature11632

[68] A. Mathelier, O. Fornes, D. J. Arenillas, C.-y. Chen, G. Denay, J. Lee, W. Shi, C. Shyr, G. Tan, R. Worsley-Hunt, A. W. Zhang, F. Parcy, B. Lenhard, A. Sandelin, and W. W. Wasserman, “Jaspar 2016: a major expansion and update of the open-access database of transcription factor binding profiles,” Nucleic Acids Research, vol. 44, no. D1, p. D110, 2016. [Online]. Available: +http://dx.doi.org/10.1093/nar/gkv1176

[69] K. Wang, M. Li, and H. Hakonarson, “Annovar: functional annotation of genetic variants from high-throughput sequencing data,” Nucleic Acids Research, vol. 38, no. 16, p. e164, 2010. [Online]. Available: http: //nar.oxfordjournals.org/content/38/16/e164.abstract

[70] W. McLaren, L. Gil, S. E. Hunt, H. S. Riat, G. R. S. Ritchie, A. Thormann, P. Flicek, and F. Cunningham, “The ensembl variant effect predictor,” Genome Biology, vol. 17, no. 1, p. 122, 2016. [Online]. Available: http://dx.doi.org/10.1186/s13059-016-0974-4

[71] R. Hoehndorf, P. N. Schofield, and G. V. Gkoutos, “The role of ontologies in biological and biomedical research: a functional perspective,” Briefings in Bioinformatics, vol. 16, no. 6, pp. 1069–1080, 2015. [Online]. Available: http://bib.oxfordjournals.org/content/16/6/1069.abstract 134 [72] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, M. J. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. I. Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, May 2000. [Online]. Available: http://dx.doi.org/10.1038/75556

[73] S. K?hler, S. C. Doelken, C. J. Mungall, S. Bauer, H. V. Firth, I. Bailleul- Forestier, G. C. M. Black, D. L. Brown, M. Brudno, J. Campbell, D. R. FitzPatrick, J. T. Eppig, A. P. Jackson, K. Freson, M. Girdea, I. Helbig, J. A. Hurst, J. J?hn, L. G. Jackson, A. M. Kelly, D. H. Ledbetter, S. Mansour, C. L. Martin, C. Moss, A. Mumford, W. H. Ouwehand, S.-M. Park, E. R. Riggs, R. H. Scott, S. Sisodiya, S. V. Vooren, R. J. Wapner, A. O. M. Wilkie, C. F. Wright, A. T. Vulto-van Silfhout, N. d. Leeuw, B. B. A. de Vries, N. L. Washingthon, C. L. Smith, M. Westerfield, P. Schofield, B. J. Ruef, G. V. Gkoutos, M. Haendel, D. Smedley, S. E. Lewis, and P. N. Robinson, “The human phenotype ontology project: linking molecular biology and disease through phenotype data,” Nucleic Acids Research, vol. 42, no. D1, pp. D966–D974, 2014. [Online]. Available: http://nar.oxfordjournals.org/content/42/D1/D966.abstract

[74] C. L. Smith, C.-A. W. Goldsmith, and J. T. Eppig, “The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information,” Genome Biology, vol. 6, no. 1, pp. 1–9, 2004. [Online]. Available: http://dx.doi.org/10.1186/gb-2004-6-1-r7

[75] P. Fey, P. Gaudet, T. Curk, B. Zupan, E. M. Just, S. Basu, S. N. Merchant, Y. A. Bushmanova, G. Shaulsky, W. A. Kibbe, and R. L. Chisholm, “dictybase– a dictyostelium bioinformatics resource update,” Nucleic Acids Res, vol. 37, no. Database issue, pp. D515–9, 2009.

[76] D. Osumi-Sutherland, S. J. Marygold, G. H. Millburn, P. A. McQuilton, L. Ponting, R. Stefancsik, K. Falls, N. H. Brown, and G. V. Gkoutos, “The Drosophila phenotype ontology,” J Biomed Semantics, vol. 4, no. 1, p. 30, 2013.

[77] P. N. Schofield, J. P. Sundberg, B. A. Sundberg, C. McKerlie, and G. V. Gk- outos, “The mouse pathology ontology, mpath; structure and applications,” J 135 Biomed Semantics, vol. 4, no. 1, p. 18, 2013.

[78] G. Schindelman, J. S. Fernandes, C. A. Bastiani, K. Yook, and P. W. Sternberg, “Worm Phenotype Ontology: integrating phenotype data within and beyond the C. elegans community.” BMC bioinformatics, vol. 12, no. 1, pp. 32+, 2011. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-12-32

[79] R. Hoehndorf, M. Alshahrani, G. V. Gkoutos, G. Gosline, Q. Groom, T. Hamann, J. Kattge, S. M. de Oliveira, M. Schmidt, S. Sierra, E. Smets, R. A. Vos, and C. Weiland, “The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants,” J Biomed Semantics, vol. 7, no. 1, p. 65, 2016.

[80] M. A. Harris, A. Lock, J. Bahler, S. G. Oliver, and V. Wood, “FYPO: the fission yeast phenotype ontology,” Bioinformatics, vol. 29, no. 13, pp. 1671–8, 2013.

[81] S. Jupp, J. Malone, T. Burdett, J. K. Heriche, E. Williams, J. Ellenberg, H. Parkinson, and G. Rustici, “The cellular microscopy phenotype ontology,” J Biomed Semantics, vol. 7, p. 28, 2016.

[82] M. C. Chibucos, A. E. Zweifel, J. C. Herrera, W. Meza, S. Eslamfam, P. Uetz, D. A. Siegele, J. C. Hu, and M. G. Giglio, “An ontology for microbial pheno- types,” BMC Microbiol, vol. 14, p. 294, 2014.

[83] G. V. Gkoutos, E. C. Green, A.-M. M. Mallon, J. M. Hancock, and D. Davidson, “Using ontologies to describe mouse phenotypes.” Genome biology, vol. 6, no. 1, p. R5, 2005. [Online]. Available: http://dx.doi.org/10.1186/gb-2004-6-1-r8

[84] C. Mungall, G. Gkoutos, C. Smith, M. Haendel, S. Lewis, and M. Ash- burner, “Integrating phenotype ontologies across multiple species,” Genome Biol, vol. 11, no. 1, pp. R2+, 2010.

[85] D. Smedley, A. Oellrich, S. Kohler, B. Ruef, , M. Westerfield, P. Robinson, S. Lewis, and C. Mungall, “Phenodigm: analyzing curated annotations to associate animal models with human diseases,” Database, vol. 2013, p. bat025, 2013. [Online]. Available: +http://dx.doi.org/10.1093/database/bat025 136 [86] R. Hoehndorf, P. N. Schofield, and G. V. Gkoutos, “Phenomenet: a whole-phenome approach to disease gene discovery,” Nucleic Acids Research, vol. 39, no. 18, p. e119, 2011. [Online]. Available: http: //nar.oxfordjournals.org/content/39/18/e119.abstract

[87] M.-P. Weng and B.-Y. Liao, “MamPhEA: a web tool for mammalian phenotype enrichment analysis,” Genome Biology, vol. 11, no. Suppl 1, p. P27, 2010. [Online]. Available: http://genomebiology.com/2010/11/S1/P27

[88] L. M. Schriml, C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe, “Disease ontology: a backbone for disease semantic integration,” Nucleic Acids Research, vol. 40, no. D1, pp. D940–D946, 2012. [Online]. Available: http://nar.oxfordjournals.org/content/40/D1/D940. abstract

[89] S. Sarntivijai, D. Vasant, S. Jupp, G. Saunders, A. P. Bento, D. Gonzalez, J. Betts, S. Hasan, G. Koscielny, I. Dunham, H. Parkinson, and J. Malone, “Linking rare and common disease: mapping clinical disease- phenotypes to ontologies in therapeutic target validation,” Journal of Biomedical Semantics, vol. 7, no. 1, pp. 1–11, 2016. [Online]. Available: http://dx.doi.org/10.1186/s13326-016-0051-7

[90] A. J. Brookes and P. N. Robinson, “Human genotype?phenotype databases: aims, challenges and opportunities,” Nature Reviews Genetics, vol. 16, no. 12, pp. 702–715, Nov. 2015. [Online]. Available: http://dx.doi.org/10.1038/nrg3932

[91] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick, “Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, no. suppl 1, pp. D514–D517, 2005. [Online]. Available: http://nar.oxfordjournals.org/content/33/suppl 1/D514.abstract

[92] S. Aym´eand J. Schmidtke, “Networking for rare diseases: a necessity for europe,” Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz, vol. 50, no. 12, pp. 1477–1483, 2007. [Online]. Available: http: //dx.doi.org/10.1007/s00103-007-0381-9 137 [93] C. J. Mungall, C. Batchelor, and K. Eilbeck, “Evolution of the sequence ontology terms and relationships,” Journal of Biomedical Informatics, vol. 44, no. 1, pp. 87 – 93, 2011, ontologies for Clinical and Translational Research. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S1532046410000353

[94] A. M. Gazzo, D. Daneels, E. Cilia, M. Bonduelle, M. Abramowicz, S. Van Dooren, G. Smits, and T. Lenaerts, “Dida: A curated and annotated digenic diseases database,” Nucleic Acids Research, vol. 44, no. D1, pp. D900–D907, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/ 44/D1/D900.abstract

[95] S. T. Sherry, M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin, “dbsnp: the ncbi database of genetic variation,” Nucleic Acids Research, vol. 29, no. 1, pp. 308–311, 2001. [Online]. Available: http://nar.oxfordjournals.org/content/29/1/308.abstract

[96] J. Malone, E. Holloway, T. Adamusiak, M. Kapushesky, J. Zheng, N. Kolesnikov, A. Zhukova, A. Brazma, and H. Parkinson, “Modeling sample variables with an experimental factor ontology,” Bioinformatics, vol. 26, no. 8, pp. 1112–1118, 2010. [Online]. Available: http://bioinformatics.oxfordjournals. org/content/26/8/1112.abstract

[97] J. Pi?ero, N. Queralt-Rosinach, l. Bravo, J. Deu-Pons, A. Bauer-Mehren, M. Baron, F. Sanz, and L. I. Furlong, “Disgenet: a discovery platform for the dynamical exploration of human diseases and their genes,” Database, vol. 2015, 2015. [Online]. Available: http://database.oxfordjournals.org/content/ 2015/bav028.abstract

[98] H. Mi, Q. Dong, A. Muruganujan, P. Gaudet, S. Lewis, and P. D. Thomas, “Panther version 7: improved phylogenetic trees, orthologs and collaboration with the gene ontology consortium,” Nucleic Acids Research, vol. 38, no. suppl 1, pp. D204–D210, 2010. [Online]. Available: http://nar.oxfordjournals.org/content/38/suppl 1/D204.abstract

[99] E. Bragin, E. A. Chatzimichali, C. F. Wright, M. E. Hurles, H. V. Firth, A. P. Bevan, and G. J. Swaminathan, “Decipher: database for the interpretation of 138 phenotype-linked plausibly pathogenic sequence and copy-number variation,” Nucleic Acids Research, vol. 42, no. D1, pp. D993–D1000, 2014. [Online]. Available: http://nar.oxfordjournals.org/content/42/D1/D993.abstract

[100] K. L. Howe, B. J. Bolt, S. Cain, J. Chan, W. J. Chen, P. Davis, J. Done, T. Down, S. Gao, C. Grove, T. W. Harris, R. Kishore, R. Lee, J. Lomax, Y. Li, H.-M. Muller, C. Nakamura, P. Nuin, M. Paulini, D. Raciti, G. Schindelman, E. Stanley, M. A. Tuli, K. Van Auken, D. Wang, X. Wang, G. Williams, A. Wright, K. Yook, M. Berriman, P. Kersey, T. Schedl, L. Stein, and P. W. Sternberg, “Wormbase 2016: expanding to enable helminth genomic research,” Nucleic Acids Research, vol. 44, no. D1, pp. D774–D780, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/44/D1/D774.abstract

[101] H. Attrill, K. Falls, J. L. Goodman, G. H. Millburn, G. Antonazzo, A. J. Rey, S. J. Marygold, and the FlyBase consortium, “Flybase: establishing a gene group resource for drosophila melanogaster,” Nucleic Acids Research, vol. 44, no. D1, pp. D786–D792, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/44/D1/D786.abstract

[102] P. D. Karp, “An ontology for biological function based on molecular interactions,” Bioinformatics, vol. 16, no. 3, pp. 269–285, 2000. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/16/3/269.abstract

[103] K. Eilbeck, S. E. Lewis, C. J. Mungall, M. Yandell, L. Stein, R. Durbin, and M. Ashburner, “The sequence ontology: a tool for the unification of genome annotations,” Genome Biology, vol. 6, no. 5, pp. 1–12, 2005. [Online]. Available: http://dx.doi.org/10.1186/gb-2005-6-5-r44

[104] M. Costa, S. Reeve, G. Grumbling, and D. Osumi-Sutherland, “The drosophila anatomy ontology,” Journal of Biomedical Semantics, vol. 4, no. 1, pp. 1–11, 2013. [Online]. Available: http://dx.doi.org/10.1186/2041-1480-4-32

[105] S. S. Dwight, M. A. Harris, K. Dolinski, C. A. Ball, G. Binkley, K. R. Christie, D. G. Fisk, L. Issel-Tarver, M. Schroeder, G. Sherlock, A. Sethuraman, S. Weng, D. Botstein, and J. M. Cherry, “Saccharomyces genome database (sgd) provides secondary gene annotation using the gene ontology (go),” 139 Nucleic Acids Research, vol. 30, no. 1, pp. 69–72, 2002. [Online]. Available: http://nar.oxfordjournals.org/content/30/1/69.abstract

[106] L. Kreppel, P. Fey, P. Gaudet, E. Just, W. A. Kibbe, R. L. Chisholm, and A. R. Kimmel, “dictybase: a new dictyostelium discoideum genome database,” Nucleic Acids Research, vol. 32, no. suppl 1, pp. D332–D333, 2004. [Online]. Available: http://nar.oxfordjournals.org/content/32/suppl 1/D332.abstract

[107] S. D. Brown and M. W. Moore, “Towards an encyclopaedia of mammalian gene function: the International Mouse Phenotyping Consortium.” Disease models & mechanisms, vol. 5, no. 3, pp. 289–292, May 2012. [Online]. Available: http://dx.doi.org/10.1242/dmm.009878

[108] e. a. Austin, CP., “The Knockout Mouse Project,” Nature Genetics, vol. 36, no. 9, pp. 921–924, Sep. 2004. [Online]. Available: http: //dx.doi.org/10.1038/ng0904-921

[109] M. Hagn, S. Marschall, and M. Hrab? de Angelis, “Emma?the european mouse mutant archive,” Briefings in Functional Genomics and Proteomics, vol. 6, no. 3, pp. 186–192, 2007. [Online]. Available: http://bfg.oxfordjournals.org/ content/6/3/186.abstract

[110] J. T. Eppig, J. A. Blake, C. J. Bult, J. A. Kadin, J. E. Richardson, and T. M. G. D. Group, “The mouse genome database (mgd): facilitating mouse as a model for human biology and disease,” Nucleic Acids Research, vol. 43, no. D1, pp. D726–D736, 2015. [Online]. Available: http://nar.oxfordjournals.org/content/43/D1/D726.abstract

[111] J. H. Finger, C. M. Smith, T. F. Hayamizu, I. J. McCright, J. T. Eppig, J. A. Kadin, J. E. Richardson, and M. Ringwald, “The mouse gene expression database (gxd): 2011 update,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D835–D841, 2011. [Online]. Available: http://nar.oxfordjournals.org/content/39/suppl 1/D835.abstract

[112] J. Sprague, D. Clements, T. Conlin, P. Edwards, K. Frazer, K. Schaper, E. Segerdell, P. Song, B. Sprunger, and M. Westerfield, “The zebrafish information network (zfin): the zebrafish model organism database,” Nucleic 140 Acids Research, vol. 31, no. 1, pp. 241–243, 2003. [Online]. Available: http://nar.oxfordjournals.org/content/31/1/241.abstract

[113] R. Nigam, D. H. Munzenmaier, E. A. Worthey, M. R. Dwinell, M. Shimoyama, and H. J. Jacob, “Rat strain ontology: structured controlled vocabulary designed to facilitate access to strain data at rgd,” Journal of Biomedical Semantics, vol. 4, no. 1, pp. 1–8, 2013. [Online]. Available: http://dx.doi.org/10.1186/2041-1480-4-36

[114] C. Mungall, C. Torniai, G. Gkoutos, S. Lewis, and M. Haendel, “Uberon, an integrative multi-species anatomy ontology,” Genome Biology, vol. 13, no. 1, p. R5, 2012. [Online]. Available: http://genomebiology.com/2012/13/1/R5

[115] P. N. Robinson, S. K?hler, A. Oellrich, S. M. G. Project, K. Wang, C. J. Mungall, S. E. Lewis, N. Washington, S. Bauer, D. Seelow, P. Krawitz, C. Gilis- sen, M. Haendel, and D. Smedley, “Improved exome prioritization of disease genes through cross-species phenotype comparison,” Genome Res, vol. 24, no. 2, pp. 340–348, 2014.

[116] N. L?pez-Bigas and C. Ouzounis, “Genome-wide identification of genes likely to be involved in human genetic disease,” Nucleic Acids Research, vol. 32, no. 10, pp. 3108–3114, 2004, cited By 175. [Online]. Avail- able: https://www.scopus.com/inward/record.uri?eid=2-s2.0-3543106033& partnerID=40&md5=5135f5442ceeeec13a5aa85df30a41d2

[117] J. Freudenberg and P. Propping, “A similarity-based method for genome-wide prediction of disease-relevant human genes,” Bioinformatics, vol. 18, no. suppl 2, pp. S110–S115, 2002. [Online]. Available: http://bioinformatics. oxfordjournals.org/content/18/suppl 2/S110.abstract

[118] F. S. Turner, D. R. Clutterbuck, and C. A. Semple, “Pocus: mining genomic sequence annotation to predict disease genes,” Genome Biology, vol. 4, no. 11, pp. 1–9, 2003. [Online]. Available: http://dx.doi.org/10.1186/gb-2003-4-11-r75

[119] C. Perez-Iratxeta, P. Bork, and M. Andrade, “Association of genes to genetically inherited diseases using data mining,” Nature Genetics, 141 vol. 31, no. 3, pp. 316–319, 2002, cited By 238. [Online]. Avail- able: https://www.scopus.com/inward/record.uri?eid=2-s2.0-0036649742& partnerID=40&md5=e5042bade146d549f487430694a6235f

[120] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L.-C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau, “Gene prioritization through genomic data fusion,” Nature Biotechnology, vol. 24, no. 5, pp. 537–544, 2006, cited By 506. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2. 0-33646568805&partnerID=40&md5=ddfb94045267dbe20f14809ddda2cd68

[121] P. N. Robinson, P. Krawitz, and S. Mundlos, “Strategies for exome and genome sequence data analysis in disease-gene discovery projects.” Clinical genetics, vol. 80, no. 2, pp. 127–132, Aug. 2011. [Online]. Available: http://dx.doi.org/10.1111/j.1399-0004.2011.01713.x

[122] K. R. Smith, C. J. Bromhead, M. S. Hildebrand, A. E. Shearer, P. J. Lockhart, H. Najmabadi, R. J. Leventer, G. McGillivray, D. J. Amor, R. J. Smith, and M. Bahlo, “Reducing the exome search space for mendelian diseases using genetic linkage analysis of exome genotypes,” Genome Biology, vol. 12, no. 9, pp. 1–9, 2011. [Online]. Available: http://dx.doi.org/10.1186/gb-2011-12-9-r85

[123] L.-C. Tranchevent, F. B. Capdevila, D. Nitsch, B. De Moor, P. De Causmaecker, and Y. Moreau, “A guide to web tools to prioritize candidate genes,” Briefings in Bioinformatics, vol. 12, no. 1, pp. 22–32, 2011. [Online]. Available: http://bib.oxfordjournals.org/content/12/1/22.abstract

[124] J. Gillis and P. Pavlidis, “?guilt by association? is the exception rather than the rule in gene networks,” PLOS Computational Biology, vol. 8, no. 3, pp. 1–13, 03 2012. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pcbi.1002444

[125] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, “A combined algorithm for genome-wide prediction of protein function.” Nature, vol. 402, no. 6757, pp. 83–86, Nov. 1999.

[126] X. Ma, T. Chen, and F. Sun, “Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction 142 networks,” Briefings in Bioinformatics, vol. 15, no. 5, pp. 685–698, 2014. [Online]. Available: http://bib.oxfordjournals.org/content/15/5/685.abstract

[127] M. Oti, B. Snel, M. A. Huynen, and H. G. Brunner, “Predicting disease genes using protein–protein interactions,” Journal of Medical Genetics, vol. 43, no. 8, pp. 691–698, 2006. [Online]. Available: http://jmg.bmj.com/content/43/8/691

[128] L. Franke, H. van Bakel, L. Fokkens, E. D. de Jong, M. Egmont-Petersen, and C. Wijmenga, “Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes.” American journal of human genetics, vol. 78, no. 6, pp. 1011–1025, Jun. 2006. [Online]. Available: http://dx.doi.org/10.1086/504300

[129] S. K?hler, S. Bauer, D. Horn, and P. N. Robinson, “Walking the interactome for prioritization of candidate disease genes,” The American Journal of Human Genetics, vol. 82, no. 4, pp. 949 – 958, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0002929708001729

[130] D. Smedley, S. K?hler, J. C. Czeschik, J. Amberger, C. Bocchini, A. Hamosh, J. Veldboer, T. Zemojtel, and P. N. Robinson, “Walking the interactome for candidate prioritization in exome sequencing studies of mendelian diseases,” Bioinformatics, vol. 30, no. 22, pp. 3215–3222, 2014. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/30/22/3215.abstract

[131] T. Can, O. C¸amoˇglu,and A. K. Singh, “Analysis of protein-protein interaction networks using random walks,” in Proceedings of the 5th International Workshop on Bioinformatics, ser. BIOKDD ’05. New York, NY, USA: ACM, 2005, pp. 61–68. [Online]. Available: http://doi.acm.org/10.1145/1134030.1134042

[132] S. D. Ghiassian, J. Menche, and A.-L. Barab?si, “A disease module detection (diamond) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome,” PLOS Computational Biology, vol. 11, no. 4, pp. 1–21, 04 2015. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pcbi.1004120

[133] R. J. Prill, D. Marbach, J. Saez-Rodriguez, P. K. Sorger, L. G. Alexopoulos, X. Xue, N. D. Clarke, G. Altan-Bonnet, and G. Stolovitzky, “Towards a 143 rigorous assessment of systems biology models: The dream3 challenges,” PLOS ONE, vol. 5, no. 2, pp. 1–18, 02 2010. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pone.0009202

[134] P. C. Iratxeta, P. Bork, and M. A. Andrade, “Association of genes to genetically inherited diseases using data mining,” Nat. Genet., vol. 31, no. 3, pp. 316–319, 2002.

[135] C. Perez-Iratxeta, M. Wjst, P. Bork, and M. A. Andrade, “G2d: a tool for mining genes associated with disease,” BMC Genetics, vol. 6, no. 1, pp. 1–9, 2005. [Online]. Available: http://dx.doi.org/10.1186/1471-2156-6-45

[136] H.-J. Dai, J. C.-Y. Wu, R. T.-H. Tsai, W.-H. Pan, and W.-L. Hsu, “T-hod: a literature-based candidate gene database for hypertension, obesity and diabetes,” Database, vol. 2013, 2013. [Online]. Available: http://database.oxfordjournals.org/content/2013/bas061.abstract

[137] N. Tiffin, J. F. Kelso, A. R. Powell, H. Pan, V. B. Bajic, and W. A. Hide, “Integration of text- and data-mining using ontologies successfully selects disease gene candidates,” Nucleic Acids Research, vol. 33, no. 5, pp. 1544–1552, 2005. [Online]. Available: http://nar.oxfordjournals.org/content/33/5/1544. abstract

[138] B. C. Grau, I. Horrocks, B. Motik, B. Parsia, P. Patel-Schneider, and U. Sattler, “{OWL} 2: The next step for {OWL},” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 6, no. 4, pp. 309 – 322, 2008, semantic Web Challenge 2006/2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1570826808000413

[139] N. L. Washington, M. A. Haendel, C. J. Mungall, M. Ashburner, M. Westerfield, and S. E. Lewis, “Linking human diseases to animal models using ontology- based phenotype annotation,” PLoS Biol, vol. 7, no. 11, pp. 1–20, 11 2009. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pbio.1000247

[140] S. K?hler, M. H. Schulz, P. Krawitz, S. Bauer, S. D?lken, C. E. Ott, C. Mundlos, D. Horn, S. Mundlos, and P. N. Robinson, “Clinical diagnostics in human genetics with semantic similarity searches in ontologies,” The American Journal 144 of Human Genetics, vol. 85, no. 4, pp. 457 – 464, 2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0002929709003991

[141] P. Resnik, “Semantic similarity in a taxonomy: An Information-Based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, 1999. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3785

[142] A. Sifrim, D. Popovic, L.-C. Tranchevent, A. Ardeshirdavani, R. Sakai, P. K. andJoris R Vermeesch, J. Aerts, B. D. Moor, and Y. Moreau, “eXtasy: variant prioritization by genomic data fusion,” Nature Methods, vol. 10, pp. 1083–1084, 2013.

[143] D. Smedley, M. Schubach, J. Jacobsen, S. K?hler, T. Zemojtel, M. Spielmann, M. J?ger, H. Hochheiser, N. Washington, J. McMurry, M. Haendel, C. Mungall, S. Lewis, T. Groza, G. Valentini, and P. Robinson, “A whole-genome analysis framework for effective identification of pathogenic regulatory variants in mendelian disease,” The American Journal of Human Genetics, vol. 99, no. 3, pp. 595 – 606, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0002929716302786

[144] M. Singleton, S. Guthery, K. Voelkerding, K. Chen, B. Kennedy, R. Margraf, J. Durtschi, K. Eilbeck, M. Reese, L. Jorde, C. Huff, and M. Yandell, “Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families,” The American Journal of Human Genetics, vol. 94, no. 4, pp. 599 – 610, 2014. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0002929714001128

[145] D. G. Grimm, C.-A. Azencott, F. Aicheler, U. Gieraths, D. G. MacArthur, K. E. Samocha, D. N. Cooper, P. D. Stenson, M. J. Daly, J. W. Smoller, L. E. Duncan, and K. M. Borgwardt, “The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity,” Human Mutation, vol. 36, no. 5, pp. 513–523. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/humu.22768 145 [146] T. . G. P. Consortium, “A global reference for human genetic variation,” Nature, vol. 526, no. 7571, pp. 68–74, Sep. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature15393

[147] T. U. Consortium, “The UK10K project identifies rare variants in health and disease,” Nature, vol. 526, no. 7571, pp. 82–90, Sep. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14962

[148] D. J. Burgess, “Network effects of disease mutations,” Nature Reviews Genetics, vol. 16, pp. 317 EP –, May 2015. [Online]. Available: http://dx.doi.org/10.1038/nrg3957

[149] S. B. Ng, K. J. Buckingham, C. Lee, A. W. Bigham, H. K. Tabor, K. M. Dent, C. D. Huff, P. T. Shannon, E. W. Jabs, D. A. Nickerson, J. Shendure, and M. J. Bamshad, “Exome sequencing identifies the cause of a mendelian disorder,” Nat Genet, vol. 42, no. 1, pp. 30–5, 2010.

[150] P. N. Robinson, P. Krawitz, and S. Mundlos, “Strategies for exome and genome sequence data analysis in disease-gene discovery projects,” Clin Genet, vol. 80, no. 2, pp. 127–32, 2011.

[151] A. M. Rosell, L. D. Pena, K. Schoch, R. Spillmann, J. Sullivan, S. R. Hooper, Y. H. Jiang, N. Mathey-Andrews, D. B. Goldstein, and V. Shashi, “Not the end of the odyssey: Parental perceptions of whole exome sequencing (wes) in pediatric undiagnosed disorders,” J Genet Couns, vol. 25, no. 5, pp. 1019–1031, 2016.

[152] J. de Ligt, M. H. Willemsen, B. W. van Bon, T. Kleefstra, H. G. Yntema, T. Kroes, A. T. Vulto-van Silfhout, D. A. Koolen, P. de Vries, C. Gilissen, M. del Rosario, A. Hoischen, H. Scheffer, B. B. de Vries, H. G. Brunner, J. A. Veltman, and L. E. Vissers, “Diagnostic exome sequencing in persons with severe intellectual disability,” N Engl J Med, vol. 367, no. 20, pp. 1921–9, 2012.

[153] K. A. Johansen Taber, B. D. Dickinson, and M. Wilson, “The promise and chal- lenges of next-generation genome sequencing for clinical care,” JAMA Internal Medicine, vol. 174, no. 2, pp. 275–280, 2014. 146 [154] Y. Yang, D. M. Muzny, J. G. Reid, M. N. Bainbridge, A. Willis, P. A. Ward, A. Braxton, J. Beuten, F. Xia, Z. Niu, M. Hardison, R. Person, M. R. Bekheir- nia, M. S. Leduc, A. Kirby, P. Pham, J. Scull, M. Wang, Y. Ding, S. E. Plon, J. R. Lupski, A. L. Beaudet, R. A. Gibbs, and C. M. Eng, “Clinical whole-exome sequencing for the diagnosis of mendelian disorders,” New England Journal of Medicine, vol. 369, no. 16, pp. 1502–1511, 2013.

[155] P. S. Atwal, M.-L. Brennan, R. Cox, M. Niaki, J. Platt, M. Homeyer, A. Kwan, S. Parkin, S. Schelley, L. Slattery, Y. Wilnai, J. A. Bernstein, G. M. Enns, and L. Hudgins, “Clinical whole-exome sequencing: are we there yet?” Genetics in Medicine, vol. 16, pp. 717–719, 2014.

[156] J. Taylor, H. Martin, S. Lise, J. Broxholme, J.-B. Cazier, A. Rimmer, A. Kanapin, G. Lunter, S. Fiddy, C. Allan, A. Aricescu, M. Attar, C. Babbs, J. Becq, D. Beeson, C. Bento, P. Bignell, E. Blair, V. Buckle, K. Bull, O. Cais, H. Cario, H. Chapel, R. Copley, R. Cornall, J. Craft, K. Dahan, E. Daven- port, C. Dendrou, O. Devuyst, A. Fenwick, J. Flint, L. Fugger, R. Gilbert, A. Goriely, A. Green, I. Greger, R. Grocock, A. Gruszczyk, R. Hastings, E. Hatton, D. Higgs, A. Hill, C. Holmes, M. Howard, L. Hughes, P. Humburg, D. Johnson, F. Karpe, Z. Kingsbury, U. Kini, J. Knight, J. Krohn, S. Lam- ble, C. Langman, L. Lonie, J. Luck, D. McCarthy, S. McGowan, M. McMullin, K. Miller, L. Murray, A. Nmeth, M. Nesbit, D. Nutt, E. Ormondroyd, A. Oturai, A. Pagnamenta, S. Patel, M. Percy, N. Petousi, P. Piazza, S. Piret, G. Polanco- Echeverry, N. Popitsch, F. Powrie, C. Pugh, L. Quek, P. Robbins, K. Robson, A. Russo, N. Sahgal, P. van Schouwenburg, A. Schuh, E. Silverman, A. Sim- mons, P. Srensen, E. Sweeney, J. Taylor, R. Thakker, I. Tomlinson, A. Trebes, S. Twigg, H. Uhlig, P. Vyas, T. Vyse, S. Wall, H. Watkins, M. Whyte, L. Witty, B. Wright, C. Yau, D. Buck, S. Humphray, P. Ratcliffe, J. Bell, A. Wilkie, D. Bentley, P. Donnelly, and G. McVean, “Factors influencing success of clini- cal genome sequencing across a broad spectrum of disorders,” Nature Genetics, vol. 47, no. 7, pp. 717–26, 7 2015.

[157] A. Niroula, S. Urolagin, and M. Vihinen, “Pon-p2: prediction method for fast and reliable identification of harmful variants,” PLoS One, vol. 10, no. 2, p. e0117380, 2015. 147 [158] D. G. MacArthur and C. Tyler-Smith, “Loss-of-function variants in the genomes of healthy humans,” Hum Mol Genet, vol. 19, no. R2, pp. R125–30, 2010.

[159] R. Chen, L. Shi, J. Hakenberg, B. Naughton, P. Sklar, J. Zhang, H. Zhou, L. Tian, O. Prakash, M. Lemire, P. Sleiman, W.-Y. Y. Cheng, W. Chen, H. Shah, Y. Shen, M. Fromer, L. Omberg, M. A. Deardorff, E. Zackai, J. R. Bobe, E. Levin, T. J. Hudson, L. Groop, J. Wang, H. Hakonarson, A. Wojcicki, G. A. Diaz, L. Edelmann, E. E. Schadt, and S. H. Friend, “Analysis of 589,306 genomes identifies individuals resilient to severe mendelian childhood diseases.” Nature biotechnology, vol. 34, no. 5, pp. 531–538, apr 2016. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/27065010

[160] N. Huang, I. Lee, E. M. Marcotte, and M. E. Hurles, “Characterising and predicting haploinsufficiency in the human genome,” PLOS Genetics, vol. 6, no. 10, p. e1001154, 2010.

[161] G. M. Cooper and J. Shendure, “Needles in stacks of needles: finding disease- causal variants in a wealth of genomic data,” Nat Rev Genet, vol. 12, no. 9, pp. 628–40, 2011.

[162] Y. Moreau and L.-C. Tranchevent, “Computational tools for prioritizing candi- date genes: boosting disease gene discovery,” Nature Reviews: Genetics, vol. 13, pp. 523–536, 2012.

[163] J. M. Heckmann, H. Uwimpuhwe, R. Ballo, M. Kaur, V. B. Bajic, and S. Prince, “A functional snp in the regulatory region of the decay-accelerating factor gene associates with extraocular muscle pareses in ,” Genes and Immunity, vol. 11, pp. 1–10, 2010.

[164] G. R. S. Ritchie, I. Dunham, E. Zeggini, and P. Flicek, “Functional annotation of noncoding sequence variants,” Nature Methods, vol. 11, pp. 294–296, 2014.

[165] M. Kircher, D. Witten, P. Jain, B. O’Roak, G. Cooper, and J. Shendure, “A general framework for estimating the relative pathogenicity of human genetic variants.” Nature Genetics, vol. 46, no. 5, pp. 310–5, 2014-03-01 00:00:00.001. 148 [166] D. Quang, Y. Chen, and X. Xie, “Dann: a deep learning approach for annotating the pathogenicity of genetic variants,” Bioinformatics, vol. 31, no. 5, pp. 761–763, 2015. [Online]. Available: http://bioinformatics.oxfordjournals. org/content/31/5/761.abstract

[167] H. A. Shihab, M. F. Rogers, J. Gough, M. Mort, D. N. Cooper, I. N. M. Day, T. R. Gaunt, and C. Campbell, “An integrative approach to predicting the functional effects of non-coding and coding sequence variation,” Bioinformatics, vol. 31, no. 10, pp. 1536–1543, 2015. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/31/10/1536.abstract

[168] G. Macintyre, A. Jimeno Yepes, C. S. Ong, and K. Verspoor, “Associating disease-related genetic variants in intergenic regions to the genes they impact,” PeerJ, vol. 2, p. e639, Oct. 2014. [Online]. Available: https: //doi.org/10.7717/peerj.639

[169] O. J. Buske, M. Girdea, S. Dumitriu, B. Gallinger, T. Hartley, H. Trang, A. Misyura, T. Friedman, C. Beaulieu, W. P. Bone, A. E. Links, N. L. Wash- ington, M. A. Haendel, P. N. Robinson, C. F. Boerkoel, D. Adams, W. A. Gahl, K. M. Boycott, and M. Brudno, “Phenomecentral: a portal for phenotypic and genotypic matchmaking of patients with rare genetic diseases,” Hum Mutat, vol. 36, no. 10, pp. 931–40, 2015.

[170] G. V. Gkoutos, E. C. Green, A.-M. M. Mallon, J. M. Hancock, and D. Davidson, “Using ontologies to describe mouse phenotypes.” Genome biology, vol. 6, no. 1, p. R5, 2005. [Online]. Available: http://dx.doi.org/10.1186/gb-2004-6-1-r8

[171] G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “The anatomy of phenotype ontologies: principles, properties and applications,” Briefings in Bioinformatics, 2017.

[172] R. Hoehndorf et al., “Phenomenet: a whole-phenome approach to disease gene discovery,” Nucleic Acids Res, vol. 39, no. 18, p. e119, 2011.

[173] G. V. Gkoutos, C. Mungall, S. Dolken, M. Ashburner, S. Lewis, J. Hancock, P. Schofield, S. Kohler, and P. N. Robinson, “Entity/quality- based logical definitions for the human skeletal phenome using PATO,” 149 Annual International Conference of the IEEE Engineering in Medicine and Biology Society., vol. 1, pp. 7069–7072, 2009. [Online]. Available: http://dx.doi.org/10.1109/IEMBS.2009.5333362

[174] C. Mungall, G. Gkoutos, C. Smith, M. Haendel, S. Lewis, and M. Ash- burner, “Integrating phenotype ontologies across multiple species,” Genome Biol, vol. 11, no. 1, pp. R2+, 2010.

[175] G. V. Gkoutos and R. Hoehndorf, “Ontology-based cross-species integration and analysis of saccharomyces cerevisiae phenotypes,” Journal of Biomedical Semantics, vol. 3, no. Suppl 2, p. S6, September 2012. [Online]. Available: http://www.jbiomedsem.com/content/3/S2/S6

[176] G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “The neurobehavior ontol- ogy: An ontology for annotation and integration of behavior and behavioral phenotypes,” in Bioinformatics of Behavior: Part 1, ser. International Review of Neurobiology, E. J. Chesler and M. A. Haendel, Eds. Academic Press, 2012, vol. 103, pp. 69 – 87.

[177] N. Adams, R. Hoehndorf, G. V. Gkoutos, G. Hansen, and C. Hennig, “Pido: The primary immunodeficiency disease ontology,” Bioinformatics, vol. 27, no. 22, pp. 3193–3199, 2011. [Online]. Available: http://bioinformatics. oxfordjournals.org/content/early/2011/09/22/bioinformatics.btr531.abstract

[178] R. Hoehndorf, P. N. Schofield, and G. V. Gkoutos, “An integrative, transla- tional approach to understanding rare and orphan genetically based diseases,” Interface Focus, vol. 3, no. 2, p. 20120055, 2013.

[179] R. Hoehndorf, M. Dumontier, and G. V. Gkoutos, “Identifying aberrant path- ways through integrated analysis of knowledge in pharmacogenomics,” Bioin- formatics, vol. 28, no. 16, pp. 2169–2175, 2012.

[180] R. Hoehndorf, N. W. Hardy, D. Osumi-Sutherland, S. Tweedie, P. N. Schofield, and G. V. Gkoutos, “Systematic analysis of experimental phenotype data re- veals gene functions,” PLoS ONE, vol. 8, no. 4, p. e60847, 2013. 150 [181] R. Hoehndorf, T. Hiebert, N. W. Hardy, P. N. Schofield, G. V. Gkoutos, and M. Dumontier, “Mouse model phenotypes provide information about human drug targets,” Bioinformatics, vol. 30, no. 5, pp. 719–725, 2014.

[182] M. V. Singleton, S. L. Guthery, K. V. Voelkerding, K. Chen, B. Kennedy, R. L. Margraf, J. Durtschi, K. Eilbeck, M. G. Reese, L. B. Jorde, C. D. Huff, and M. Yandell, “Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families,” The American Journal of Human Genetics, vol. 94, no. 4, pp. 599 – 610, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0002929714001128

[183] P. N. Robinson et al., “Improved exome prioritization of disease genes through cross-species phenotype comparison,” Genome Res, vol. 24, no. 2, pp. 340–348, 2014.

[184] M. J. Landrum, J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church, and D. R. Maglott, “Clinvar: public archive of relationships among sequence variation and human phenotype,” Nucleic Acids Research, vol. 42, no. D1, pp. D980–D985, 2013. [Online]. Available: http://nar.oxfordjournals.org/content/early/2013/11/14/nar.gkt1113.abstract

[185] The 1000 Genomes Project Consortium, “A global reference for human genetic variation,” Nature, vol. 526, pp. 68–74, 2015.

[186] S. K¨ohler et al., “The human phenotype ontology project: linking molecular bi- ology and disease through phenotype data,” Nucleic Acids Res, vol. 42, no. D1, pp. D966–D974, 2014.

[187] D. Smedley and P. N. Robinson, “Phenotype-driven strategies for exome prioritization of human mendelian disease genes,” Genome Medicine, vol. 7, no. 1, pp. 1–11, 2015. [Online]. Available: http://dx.doi.org/10.1186/ s13073-015-0199-2

[188] D. Smedley, M. Schubach, J. O. B. Jacobsen, S. K¨ohler, T. Zemojtel, M. Spielmann, M. J¨ager,H. Hochheiser, N. L. Washington, J. A. McMurry, M. A. Haendel, C. J. Mungall, S. E. Lewis, T. Groza, G. Valentini, and P. N. 151 Robinson, “A Whole-Genome analysis framework for effective identification of pathogenic regulatory variants in mendelian disease,” The American Journal of Human Genetics, vol. 99, no. 3, pp. 595–606, Sep. 2016. [Online]. Available: http://dx.doi.org/10.1016/j.ajhg.2016.07.005

[189] P. N. Robinson, “Deep phenotyping for precision medicine,” Human Mutation, vol. 33, no. 5, pp. 777–780, 2012. [Online]. Available: http://dx.doi.org/10.1002/humu.22080

[190] U. K. Consortium, K. Walter, J. L. Min, J. Huang, L. Crooks, Y. Memari, S. Mc- Carthy, J. R. Perry, C. Xu, M. Futema, D. Lawson, V. Iotchkova, S. Schiffels, A. E. Hendricks, P. Danecek, R. Li, J. Floyd, L. V. Wain, I. Barroso, S. E. Humphries, M. E. Hurles, E. Zeggini, J. C. Barrett, V. Plagnol, J. B. Richards, C. M. Greenwood, N. J. Timpson, R. Durbin, and N. Soranzo, “The uk10k project identifies rare variants in health and disease,” Nature, vol. 526, no. 7571, pp. 82–90, 2015.

[191] L. Persani, “Congenital hypothyroidism with gland in situ is more frequent than previously thought,” Front Endocrinol (Lausanne), vol. 3, p. 18, 2012, persani, Luca eng Switzerland 2012/06/02 06:00 Front Endocrinol (Lausanne). 2012 Feb 15;3:18. doi: 10.3389/fendo.2012.00018. eCollection 2012.

[192] N. Schoenmakers, K. S. Alatzoglou, V. K. Chatterjee, and M. T. Dattani, “Re- cent advances in central congenital hypothyroidism,” J Endocrinol, vol. 227, no. 3, pp. R51–71, 2015.

[193] I. C. Nettore, V. Cacace, C. De Fusco, A. Colao, and P. E. Macchia, “The molecular causes of thyroid dysgenesis: a systematic review,” J Endocrinol Invest, vol. 36, no. 8, pp. 654–64, 2013.

[194] G. Szinnai, “Clinical genetics of congenital hypothyroidism,” Endocr Dev, vol. 26, pp. 60–78, 2014.

[195] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng, “Sift web server: predicting effects of amino acid substitutions on proteins,” Nucleic Acids Research, vol. 40, no. W1, pp. W452–W457, 2012. 152 [196] I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev, “A method and server for predicting damaging missense mutations,” Nat Meth, vol. 7, no. 4, pp. 248–249, apr 2010.

[197] K. Lichti-Kaiser, G. ZeRuth, and A. M. Jetten, “Transcription factor gli-similar 3 (glis3): Implications for the development of congenital hypothyroidism,” J Endocrinol Diabetes Obes, vol. 2, no. 2, p. 1024, 2014.

[198] K. Devriendt, C. Vanhole, G. Matthijs, and F. de Zegher, “Deletion of thyroid transcription factor-1 gene in an infant with neonatal thyroid dysfunction and respiratory failure,” N Engl J Med, vol. 338, no. 18, pp. 1317–8, 1998.

[199] P. E. Macchia, P. Lapi, H. Krude, M. T. Pirro, C. Missero, L. Chiovato, A. Souabni, M. Baserga, V. Tassi, A. Pinchera, G. Fenzi, A. Grueters, M. Bus- slinger, and R. D. Lauro, “Pax8 mutations associated with congenital hypothy- roidism caused by thyroid dysgenesis,” Nature Genetics, vol. 19, pp. 83–86, 1998.

[200] J. C. Moreno, H. Bikker, M. J. Kempers, A. S. van Trotsenburg, F. Baas, J. J. de Vijlder, T. Vulsma, and C. Ris-Stalpers, “Inactivating mutations in the gene for thyroid oxidase 2 (thox2) and congenital hypothyroidism,” N Engl J Med, vol. 347, no. 2, pp. 95–102, 2002.

[201] M. Caputo, C. M. Rivolta, S. A. Esperante, L. Grueiro-Papendieck, A. Chiesa, C. G. Pellizas, R. Gonzlez-Sarmiento, and H. M. Targovnik, “Congenital hy- pothyroidism with goitre caused by new mutations in the thyroglobulin gene,” Clinical Endocrinology, vol. 67, no. 3, pp. 351–357, 2007.

[202] C. Ris-Stalpers and H. Bikker, “Genetics and phenomics of hypothyroidism and goiter due to {TPO} mutations,” Molecular and Cellular Endocrinology, vol. 322, no. 12, pp. 38 – 43, 2010, special Issue: Genetics of Thyroid Diseases. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0303720710000663

[203] Y. Li, H. Yagi, E. O. Onuoha, R. R. Damerla, R. Francis, Y. Furutani, M. Tariq, S. M. King, G. Hendricks, C. Cui, M. Saydmohammed, D. M. Lee, M. Zahid, 153 I. Sami, L. Leatherbury, G. J. Pazour, S. M. Ware, T. Nakanishi, E. Goldmuntz, M. Tsang, and C. W. Lo, “Dnah6 and its interactions with pcd genes in hetero- taxy and primary ciliary dyskinesia,” PLoS Genet, vol. 12, no. 2, p. e1005821, 2016.

[204] A. Nicholas, E. Serra, H. Cangul, S. Alyaarubi, I. Ullah, E. Schoenmakers, A. Deeb, A. Habeb, M. AlMaghamsi, C. Peters, N. Nathwani, Z. Aycan, H. Saglam, E. Bober, M. Dattani, S. Shenoy, P. Murray, A. Babiker, R. Willem- sen, A. Thankamony, G. Lyons, R. Irwin, R. Padidela, K. Tharian, J. Davies, V. Puthi, S.-M. Park, A. Massoud, J. Gregory, A. Albanese, E. Pease-Gevers, H. Martin, K. Brugger, E. Maher, K. Chatterjee, C. Anderson, and N. Schoen- makers, “Comprehensive screening of eight known causative genes in congenital hypothyroidism with gland-in-situ,” The Journal of Clinical Endocrinology & Metabolism, vol. 0, no. 0, pp. jc.2016–1879, 0.

[205] G. M. Church, “The personal genome project,” Molecular Systems Biology, vol. 1, no. 1, p. 2005.0030, 2005. [Online]. Available: http: //msb.embopress.org/content/1/1/2005.0030

[206] B. St Pourcain, D. H. Skuse, W. P. Mandy, K. Wang, H. Hakonarson, N. J. Timpson, D. M. Evans, J. P. Kemp, S. M. Ring, W. L. McArdle, J. Golding, and G. D. Smith, “Variability in the common genetic architecture of social- communication spectrum phenotypes during childhood and adolescence,” Mol Autism, vol. 5, no. 1, p. 18, 2014.

[207] A. Poduri, S. S. Chopra, E. G. Neilan, P. Christina Elhosary, M. A. Kurian, E. Meyer, B. J. Barry, O. S. Khwaja, M. A. M. Salih, T. Stdberg, I. E. Scheffer, E. R. Maher, M. Sahin, B.-L. Wu, G. T. Berry, C. A. Walsh, J. Picker, and S. V. Kothare, “Homozygous plcb1 deletion associated with malignant migrating partial seizures in infancy,” Epilepsia, vol. 53, no. 8, pp. e146–e150, 2012. [Online]. Available: http://dx.doi.org/10.1111/j.1528-1167.2012.03538.x

[208] S. Girirajan, M. Y. Dennis, C. Baker, M. Malig, B. P. Coe, C. D. Campbell, K. Mark, T. H. Vu, C. Alkan, Z. Cheng, L. G. Biesecker, R. Bernier, and E. E. Eichler, “Refinement and discovery of new hotspots of copy-number variation associated with autism spectrum disorder,” Am J Hum Genet, vol. 92, no. 2, pp. 221–37, 2013. 154 [209] W. L. Nichols, M. B. Hultin, A. H. James, M. J. Manco-Johnson, R. R. Montgomery, T. L. Ortel, M. E. Rick, J. E. Sadler, M. Weinstein, and B. P. Yawn, “von willebrand disease (vwd): evidence-based diagnosis and management guidelines, the national heart, lung, and blood institute (nhlbi) expert panel report (usa)1,” Haemophilia, vol. 14, no. 2, pp. 171–232, 2008. [Online]. Available: http://dx.doi.org/10.1111/j.1365-2516.2007.01643.x

[210] W. Ahmad, M. Faiyaz ul Haque, V. Brancolini, H. C. Tsou, S. ul Haque, H. Lam, V. M. Aita, J. Owen, M. deBlaquiere, J. Frank, P. B. Cserhalmi- Friedman, A. Leask, J. A. McGrath, M. Peacocke, M. Ahmad, J. Ott, and A. M. Christiano, “Alopecia universalis associated with a mutation in the hu- man hairless gene,” Science, vol. 279, no. 5351, pp. 720–4, 1998.

[211] Y. Kazakov, M. Kr¨otzsch, and F. Simancik, “The incredible elk,” Journal of Automated Reasoning, vol. 53, no. 1, pp. 1–61, 2014. [Online]. Available: http://dx.doi.org/10.1007/s10817-013-9296-3

[212] R. Hoehndorf, A. Oellrich, and D. Rebholz-Schuhmann, “Interoperability be- tween phenotype and anatomy ontologies,” Bioinformatics, vol. 26, no. 24, pp. 3112 – 3118, 10 2010.

[213] R. Hoehndorf, L. Slater, P. N. Schofield, and G. V. Gkoutos, “Aber-OWL: a framework for ontology-based data access in biology,” BMC Bioinformatics, vol. 16, p. 26, 2015. [Online]. Available: http://www.biomedcentral.com/ 1471-2105/16/26/abstract

[214] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, M. J. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. I. Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, May 2000. [Online]. Available: http://dx.doi.org/10.1038/75556

[215] J. P. Balhoff, I. Mik, M. J. Yoder, P. L. Mullins, and A. R. Deans, “A semantic model for species description applied to the ensign wasps (hymenoptera: Evaniidae) of new caledonia,” Systematic Biology, vol. 62, no. 5, pp. 639–659, 155 2013. [Online]. Available: http://sysbio.oxfordjournals.org/content/62/5/639. abstract

[216] D. G. Howe, Y. M. Bradford, T. Conlin, A. E. Eagle, D. Fashena, K. Frazer, J. Knight, P. Mani, R. Martin, S. A. T. Moxon, H. Paddock, C. Pich, S. Ramachandran, B. J. Ruef, L. Ruzicka, K. Schaper, X. Shao, A. Singer, B. Sprunger, C. E. Van Slyke, and M. Westerfield, “Zfin, the zebrafish model organism database: increased support for mutants and transgenics,” Nucleic Acids Research, vol. 41, no. D1, pp. D854–D860, 2013. [Online]. Available: http://nar.oxfordjournals.org/content/41/D1/D854.abstract

[217] J. Bard, S. Y. Rhee, and M. Ashburner, “An ontology for cell types,” Genome Biology, vol. 6, no. 2, p. R21, 2005. [Online]. Available: http://dx.doi.org/10.1186/gb-2005-6-2-r21

[218] R. Hoehndorf, J. M. Hancock, N. W. Hardy, A. M. Mallon, P. N. Schofield, and G. V. Gkoutos, “Analyzing gene expression data in mice with the Neuro Behavior Ontology,” Mamm Genome, vol. 25, no. 1-2, pp. 32–40, 2014.

[219] K. Degtyarenko, P. Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcantara, M. Darsow, M. Guedj, and M. Ashburner, “ChEBI: a database and ontology for chemical entities of biological interest,” Nucleic Acids Re- search, vol. 36, no. Database issue, pp. D344–D350, 2008.

[220] P. N. Schofield, J. B. L. Bard, J. Boniver, V. Covelli, P. Delvenne, M. Ellen- der, W. Engstrom, W. Goessner, M. Gruenberger, H. Hoefler, J. W. Hopewell, M. Mancuso, C. Mothersill, Q. L. Martinez, B. Rozell, H. Sariola, J. P. Sund- berg, and A. Ward, “Pathbase: a new reference resource and database for lab- oratory mouse pathology,” Radiat Prot Dosimetry, vol. 112, no. 4, pp. 525–528, 2004.

[221] C. J. Bult, J. T. Eppig, J. A. Blake, J. A. Kadin, J. E. Richardson, and the Mouse Genome Database Group, “Mouse genome database 2016,” Nucleic Acids Research, vol. 44, no. D1, pp. D840–D847, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/44/D1/D840.abstract 156 [222] S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain, “The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies,” Bioinformatics, vol. 30, no. 5, pp. 740–742, 2014. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/ 30/5/740.abstract

[223] J. Amberger, C. Bocchini, and A. Hamosh, “A new face and new challenges for Online Mendelian Inheritance in Man (OMIM),” Hum Mutat, vol. 32, pp. 564–567, 2011.

[224] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993.

[225] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. Amsterdam: Morgan Kaufmann, 2011. [Online]. Available: http://www.sciencedirect.com/science/ book/9780123748560

[226] K. Wang, M. Li, and H. Hakonarson, “Annovar: functional annotation of genetic variants from high-throughput sequencing data,” Nucleic Acids Research, vol. 38, no. 16, p. e164, 2010. [Online]. Available: http: //nar.oxfordjournals.org/content/38/16/e164.abstract

[227] K. Eilbeck, A. Quinlan, and M. Yandell, “Settling the score: variant prioritiza- tion and mendelian disease,” Nature Reviews Genetics, vol. 18, p. 599, 2017.

[228] Y.-F. F. Huang, B. Gulko, and A. Siepel, “Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data.” Nature genetics, vol. 49, no. 4, pp. 618–624, Apr. 2017. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/28288115

[229] D. Quang, Y. Chen, and X. Xie, “Dann: a deep learning approach for annotating the pathogenicity of genetic variants,” Bioinformatics, vol. 31, no. 5, pp. 761– 763, 2015.

[230] I. Boudellioua, R. B. Mahamad Razali, M. Kulmanov, Y. Hashish, V. B. Bajic, E. Goncalves-Serra, N. Schoenmakers, G. V. Gkoutos, P. N. Schofield, and 157 R. Hoehndorf, “Semantic prioritization of novel causative genomic variants,” PLOS Computational Biology, vol. 13, no. 4, pp. 1–21, 04 2017. [Online]. Available: https://doi.org/10.1371/journal.pcbi.1005500

[231] B. de Bono, R. Hoehndorf, S. Wimalaratne, G. V. Gkoutos, and P. Grenon, “The ricordo approach to semantic interoperability for biomedical data and models: strategy, standards and solutions.” BMC Research Notes, vol. 4, no. 1, p. 313, 2011.

[232] M. H. de Angelis, G. Nicholson, M. Selloum, J. K. White, H. Morgan, R. Ramirez-Solis, T. Sorg, S. Wells, H. Fuchs, M. Fray, D. J. Adams, N. C. Adams, T. Adler, A. Aguilar-Pimentel, D. Ali-Hadji, G. Amann, P. Andr´e,S. Atkins, A. Auburtin, A. Ayadi, J. Becker, L. Becker, E. Bedu, R. Bekeredjian, M.-C. Birling, A. Blake, J. Bottomley, M. R. Bowl, V. Brault, D. H. Busch, J. N. Bussell, J. Calzada-Wack, H. Cater, M.-F. Champy, P. Charles, C. Chevalier, F. Chiani, G. F. Codner, R. Combe, R. Cox, E. Dalloneau, A. Dierich, A. D. Fenza, B. Doe, A. Duchon, O. Eickelberg, C. T. Esapa, L. E. Fertak, T. Feigel, I. Emelyanova, J. Estabel, J. Favor, A. Flenniken, A. Gambadoro, L. Garrett, H. Gates, A.-K. Gerdin, G. Gkoutos, S. Greenaway, L. Glasl, P. Goetz, I. G. D. Cruz, A. G¨otz, J. Graw, A. Guimond, W. Hans, G. Hicks, S. M. H¨olter,H. H¨ofler,J. M. Hancock, R. Hoehndorf, T. Hough, R. Houghton, A. Hurt, B. Ivandic, H. Jacobs, S. Jacquot, N. Jones, N. A. Karp, H. A. Katus, S. Kitchen, T. Klein-Rodewald, M. Klingenspor, T. Klopstock, V. Lalanne, S. Leblanc, C. Lengger, E. le Marchand, T. Ludwig, A. Lux, C. McKerlie, H. Maier, J.-L. Mandel, S. Marschall, M. Mark, D. G. Melvin, H. Meziane, K. Micklich, C. Mittelhauser, L. Monassier, D. Moulaert, S. Muller, B. Naton, F. Neff, P. M. Nolan, L. M. J. Nutter, M. Ollert, G. Pavlovic, N. S. Pellegata, E. Peter, B. Petit-Demouli`ere, A. Pickard, C. Podrini, P. Potter, L. Pouilly, O. Puk, D. Richardson, S. Rousseau, L. Quintanilla-Fend, M. M. Quwailid, I. Racz, B. Rathkolb, F. Riet, J. Rossant, M. Roux, J. Rozman, E. Ryder, J. Salisbury, L. Santos, K.-H. Sch¨able, E. Schiller, A. Schrewe, H. Schulz, R. Steinkamp, M. Simon, M. Stewart, C. St¨oger,T. St¨oger,M. Sun, D. Sunter, L. Teboul, I. Tilly, G. P. Tocchini-Valentini, M. Tost, I. Treise, L. Vasseur, E. Velot, D. Vogt-Weisenhorn, C. Wagner, A. Walling, M. Wattenhofer-Donze, B. Weber, O. Wendling, H. Westerberg, M. Willersh¨auser,E. Wolf, A. Wolter, 158 J. Wood, W. Wurst, A. Onder¨ Yildirim, R. Zeh, A. Zimmer, A. Zimprich, C. Holmes, K. P. Steel, Y. Herault, V. Gailus-Durner, A.-M. Mallon, and S. D. M. Brown, “Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics,” Nature Genetics, July 2015. [Online]. Available: http://dx.doi.org/10.1038/ng.3360

[233] S. K¨ohler, S. C. Doelken, B. J. Ruef, S. Bauer, N. Washington, M. Westerfield, G. Gkoutos, P. Schofield, D. Smedley, S. E. Lewis, P. N. Robinson, and C. J. Mungall, “Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research.” F1000Research, vol. 2, Feb. 2013. [Online]. Available: http://dx.doi.org/10.12688/f1000research.2-30.v1

[234]M. A.´ Rodr´ıguez-Garc´ıa,G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “Integrating phenotype ontologies with phenomenet,” Journal of Biomedical Semantics, vol. 8, no. 1, p. 58, Dec 2017. [Online]. Available: https: //doi.org/10.1186/s13326-017-0167-4

[235] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 5 2015.

[236] D. G. Grimm, C. Azencott, F. Aicheler, U. Gieraths, D. G. MacArthur, K. E. Samocha, D. N. Cooper, P. D. Stenson, M. J. Daly, J. W. Smoller, L. E. Duncan, and K. M. Borgwardt, “The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity,” Human Mutation, vol. 36, no. 5, pp. 513–523, 2015. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/humu.22768

[237] A. J. Cornish, A. David, and M. J. E. Sternberg, “Phenorank: reducing study bias in gene prioritization through simulation,” Bioinformatics, p. bty028, 2018.

[238] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, and R. Durbin, “The variant call format and vcftools,” Bioinformatics, vol. 27, no. 15, pp. 2156–2158, Aug. 2011. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr330 159 [239] H. Li, “Tabix: fast retrieval of sequence features from generic tab-delimited files,” Bioinformatics, vol. 27 5, pp. 718–9, 2011.

[240] F. Chollet et al., “Keras,” https://github.com/keras-team/keras, 2015.

[241] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

[242] M. Pumperla, “Hyperas,” https://github.com/maxpumperla/hyperas.

[243] J. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl, “Algorithms for hyper-parameter optimization,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, ser. NIPS’11. USA: Curran Associates Inc., 2011, pp. 2546–2554. [Online]. Available: http://dl.acm.org/citation.cfm?id=2986459.2986743

[244] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10. USA: Omnipress, 2010, pp. 807–814. [Online]. Available: http://dl.acm.org/citation.cfm?id=3104322. 3104425

[245] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization.” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://dblp.uni-trier.de/ db/journals/corr/corr1412.html#KingmaB14

[246] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313 160 [247] P. N. Robinson, S. K¨ohler,S. Bauer, D. Seelow, D. Horn, and S. Mundlos, “The human phenotype ontology: a tool for annotating and analyzing human hereditary disease.” Am J Hum Genet, vol. 83, no. 5, pp. 610–615, 2008.

[248] C. L. Smith, C.-A. W. Goldsmith, and J. T. Eppig, “The mammalian pheno- type ontology as a tool for annotating, analyzing and comparing phenotypic information,” Genome Biol, vol. 6, no. 1, p. R7, 2004, dOI:10.1186/gb-2004-6- 1-r7.

[249] M. P. Ball, J. R. Bobe, M. F. Chou, T. Clegg, P. W. Estep, J. E. Lunshof, W. Vandewege, A. Zaranek, and G. M. Church, “Harvard personal genome project: lessons from participatory public research,” Genome Med, vol. 6, no. 2, p. 10, 2014.

[250] Y.-F. Huang, B. Gulko, and A. Siepel, “Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data,” bioRxiv, 2016. [Online]. Available: https://www.biorxiv.org/content/early/2016/08/ 15/069682

[251] S. Flygare, E. J. Hernandez, L. Phan, B. Moore, M. Li, A. Fejes, H. Hu, K. Eilbeck, C. Huff, L. Jorde, M. G. Reese, and M. Yandell, “The vaast variant prioritizer (vvp): ultrafast, easy to use whole genome variant prioritization tool,” BMC Bioinformatics, vol. 19, no. 1, p. 57, Feb 2018. [Online]. Available: https://doi.org/10.1186/s12859-018-2056-y

[252] I. Boudellioua, R. B. Mahamad Razali, M. Kulmanov, Y. Hashish, V. B. Bajic, E. Goncalves-Serra, N. Schoenmakers, G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “Semantic prioritization of novel causative genomic variants,” PLOS Computational Biology, vol. 13, no. 4, pp. 1–21, 04 2017. [Online]. Available: https://doi.org/10.1371/journal.pcbi.1005500

[253] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L.-C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau, “Gene prioritization through genomic data fusion,” Nature Biotechnology, vol. 24, no. 5, pp. 537–544, May 2006. [Online]. Available: http://dx.doi.org/10.1038/nbt1203 161 [254] G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “The anatomy of phenotype ontologies: principles, properties and applications,” Briefings in Bioinformatics, 2017. [Online]. Available: https://academic.oup.com/bib/ article-lookup/doi/10.1093/bib/bbx035

[255] D. Smedley, A. Oellrich, S. K¨ohler,B. Ruef, S. M. G. Project, M. Westerfield, P. Robinson, S. Lewis, and C. Mungall, “Phenodigm: analyzing curated annotations to associate animal models with human diseases,” Database, vol. 2013, 2013. [Online]. Available: http://database.oxfordjournals.org/content/ 2013/bat025.abstract

[256] J. B. S. Haldane, “The relative importance of principal and modifying genes in determining some human diseases,” Journal of Genetics, vol. 41, no. 2, pp. 149–157, 1941.

[257] D. N. Cooper, M. Krawczak, C. Polychronakos, C. Tyler-Smith, and H. Kehrer- Sawatzki, “Where genotype is not predictive of phenotype: towards an un- derstanding of the molecular basis of reduced penetrance in human inherited disease,” Hum Genet, vol. 132, no. 10, pp. 1077–130, 2013.

[258] N. Katsanis, “The continuum of causality in human genetic disorders,” Genome Biol, vol. 17, no. 1, p. 233, 2016.

[259] M. Kousi and N. Katsanis, “Genetic modifiers and oligogenic inheritance,” Cold Spring Harb Perspect Med, vol. 5, no. 6, 2015.

[260] A. A. Schaffer, “Digenic inheritance in medical genetics,” J Med Genet, vol. 50, no. 10, pp. 641–52, 2013.

[261] J.-M. Lee, V. Wheeler, M. Chao, J. Vonsattel, R. Pinto, D. Lucente, K. Abu- Elneel, E. Ramos, J. Mysore, T. Gillis, M. MacDonald, J. Gusella, D. Harold, T. Stone, V. Escott-Price, J. Han, A. Vedernikov, P. Holmans, L. Jones, S. Kwak, M. Mahmoudi, M. Orth, G. B. Landwehrmeyer, J. Paulsen, E. R. Dorsey, I. Shoulson, and R. Myers, “Identification of genetic factors that mod- ify clinical onset of huntington’s disease,” Cell, vol. 162, no. 3, pp. 516–526, 2015. 162 [262] M. J. Chao, K.-H. Kim, J. W. Shin, D. Lucente, V. C. Wheeler, H. Li, J. C. Roach, L. Hood, N. S. Wexler, L. B. Jardim, P. Holmans, L. Jones, M. Orth, S. Kwak, M. E. MacDonald, J. F. Gusella, and J.-M. Lee, “Population-specific genetic modification of huntington’s disease in venezuela,” PLOS Genetics, vol. 14, no. 5, p. e1007274, 2018.

[263] S. J. Lubbe, V. Escott-Price, J. R. Gibbs, M. A. Nalls, J. Bras, T. R. Price, A. Nicolas, I. E. Jansen, K. Y. Mok, A. M. Pittman, J. E. Tomkins, P. A. Lewis, A. J. Noyce, S. Lesage, M. Sharma, E. R. Schiff, A. P. Levine, A. Brice, T. Gasser, J. Hardy, P. Heutink, N. W. Wood, A. B. Singleton, N. M. Williams, H. R. Morris, and C. for International Parkinson’s Disease Genomics, “Addi- tional rare variant analysis in parkinson’s disease cases with and without known pathogenic mutations: evidence for oligogenic inheritance,” Hum Mol Genet, vol. 25, no. 24, pp. 5483–5489, 2016.

[264] A. K. Nicholas, E. G. Serra, H. Cangul, S. Alyaarubi, I. Ullah, E. Schoenmak- ers, A. Deeb, A. M. Habeb, M. Almaghamsi, C. Peters, N. Nathwani, Z. Ay- can, H. Saglam, E. Bober, M. Dattani, S. Shenoy, P. G. Murray, A. Babiker, R. Willemsen, A. Thankamony, G. Lyons, R. Irwin, R. Padidela, K. Tharian, J. H. Davies, V. Puthi, S. M. Park, A. F. Massoud, J. W. Gregory, A. Albanese, E. Pease-Gevers, H. Martin, K. Brugger, E. R. Maher, V. K. Chatterjee, C. A. Anderson, and N. Schoenmakers, “Comprehensive screening of eight known causative genes in congenital hypothyroidism with gland-in-situ,” J Clin En- docrinol Metab, vol. 101, no. 12, pp. 4521–4531, 2016.

[265] T. de Filippis, G. Gelmini, E. Paraboschi, M. C. Vigone, M. Di Frenna, F. Marelli, M. Bonomi, A. Cassio, D. Larizza, M. Moro, G. Radetti, M. Salerno, D. Ardissino, G. Weber, D. Gentilini, F. Guizzardi, S. Duga, and L. Persani, “A frequent oligogenic involvement in congenital hypothyroidism,” Hum Mol Genet, vol. 26, no. 13, pp. 2507–2514, 2017.

[266] E. Eichers, R. A. Lewis, N. Katsanis, and J. Lupski, “Triallelic inheritance: a bridge between mendelian and multifactorial traits,” Annals of Medicine, vol. 36, no. 4, pp. 262–272, 2004.

[267] R. Shaheen et al., “Characterizing the morbid genome of ciliopathies,” Genome Biol, vol. 17, no. 1, p. 242, 2016. 163 [268] Q. Y. Zheng, D. Yan, X. M. Ouyang, L. L. Du, H. Yu, B. Chang, K. R. Johnson, and X. Z. Liu, “Digenic inheritance of deafness caused by mutations in genes encoding cadherin 23 and protocadherin 15 in mice and humans,” Hum Mol Genet, vol. 14, no. 1, pp. 103–11, 2005.

[269] A. Gazzo, D. Raimondi, D. Daneels, Y. Moreau, G. Smits, S. Van Dooren, and T. Lenaerts, “Understanding mutational effects in digenic diseases,” Nucleic Acids Res, vol. 45, no. 15, p. e140, 2017.

[270] J. E. Posey, T. Harel, P. Liu, J. A. Rosenfeld, R. A. James, Z. H. Coban Akdemir, M. Walkiewicz, W. Bi, R. Xiao, Y. Ding, F. Xia, A. L. Beaudet, D. M. Muzny, R. A. Gibbs, E. Boerwinkle, C. M. Eng, V. R. Sutton, C. A. Shaw, S. E. Plon, Y. Yang, and J. R. Lupski, “Resolution of disease phenotypes re- sulting from multilocus genomic variation,” New England Journal of Medicine, vol. 376, no. 1, pp. 21–31, 2016.

[271] J. F. Robinson and N. Katsanis, Oligogenic Disease. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 243–262.

[272] I. Feldman, A. Rzhetsky, and D. Vitkup, “Network properties of genes harboring inherited disease mutations,” Proc Natl Acad Sci U S A, vol. 105, no. 11, pp. 4323–8, 2008.

[273] T. K. Gandhi, J. Zhong, S. Mathivanan, L. Karthick, K. N. Chandrika, S. S. Mohan, S. Sharma, S. Pinkert, S. Nagaraju, B. Periaswamy, G. Mishra, K. Nan- dakumar, B. Shen, N. Deshpande, R. Nayak, M. Sarker, J. D. Boeke, G. Parmi- giani, J. Schultz, J. S. Bader, and A. Pandey, “Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets,” Nat Genet, vol. 38, no. 3, pp. 285–93, 2006.

[274] A. Bauer-Mehren, M. Bundschus, M. Rautschka, M. A. Mayer, F. Sanz, and L. I. Furlong, “Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases,” PLoS One, vol. 6, no. 6, p. e20284, 2011.

[275] J. Menche, A. Sharma, M. Kitsak, S. D. Ghiassian, M. Vidal, J. Loscalzo, and A. L. Barabasi, “Disease networks. uncovering disease-disease relationships 164 through the incomplete interactome,” Science, vol. 347, no. 6224, p. 1257601, 2015.

[276] R. Hoehndorf, M. Dumontier, J. H. Gennari, S. Wimalaratne, B. de Bono, D. L. Cook, and G. V. Gkoutos, “Integrating systems biology models and biomedical ontologies,” BMC Systems Biology, vol. 5, no. 1, pp. 124+, August 2011. [Online]. Available: http://www.biomedcentral.com/1752-0509/5/124

[277] P. N. Schofield, R. Hoehndorf, and G. V. Gkoutos, “Mouse genetic and phe- notypic resources for human genetics,” Hum Mutat, vol. 33, no. 5, pp. 826–36, 2012.

[278] S. Khler, N. A. Vasilevsky, M. Engelstad, E. Foster, J. McMurry, S. Aym, G. Baynam, S. M. Bello, C. F. Boerkoel, K. M. Boycott, M. Brudno, O. J. Buske, P. F. Chinnery, V. Cipriani, L. E. Connell, H. J. Dawkins, L. E. DeMare, A. D. Devereau, B. B. de Vries, H. V. Firth, K. Freson, D. Greene, A. Hamosh, I. Helbig, C. Hum, J. A. Jhn, R. James, R. Krause, S. J. F. Laulederkind, H. Lochmller, G. J. Lyon, S. Ogishima, A. Olry, W. H. Ouwehand, N. Pontikos, A. Rath, F. Schaefer, R. H. Scott, M. Segal, P. I. Sergouniotis, R. Sever, C. L. Smith, V. Straub, R. Thompson, C. Turner, E. Turro, M. W. Veltman, T. Vulliamy, J. Yu, J. von Ziegenweidt, A. Zankl, S. Zchner, T. Zemojtel, J. O. Jacobsen, T. Groza, D. Smedley, C. J. Mungall, M. Haendel, and P. N. Robinson, “The human phenotype ontology in 2017,” Nucleic Acids Research, vol. 45, no. D1, pp. D865–D876, 2017. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw1039

[279] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, L. J. Jensen, and C. vonMering, “The string database in 2017: quality-controlled proteinprotein association networks, made broadly accessible,” Nucleic Acids Research, vol. 45, no. D1, pp. D362–D368, 2017. [Online]. Available: +http: //dx.doi.org/10.1093/nar/gkw937

[280] I. Boudellioua, M. Kulmanov, P. N. Schofield, G. V. Gkoutos, and R. Hoehndorf, “Deeppvp: phenotype-based prioritization of causative variants using deep learning,” bioRxiv, 2018. [Online]. Available: https: //www.biorxiv.org/content/early/2018/05/02/311621 165 [281] J. A. Blake, J. T. Eppig, J. A. Kadin, J. E. Richardson, C. L. Smith, C. J. Bult, and the Mouse Genome Database Group, “Mouse genome database (mgd)-2017: community knowledge resource for the ,” Nucleic Acids Research, vol. 45, no. D1, pp. D723–D729, 2017. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw1040

[282] D. G. Howe, Y. M. Bradford, A. Eagle, D. Fashena, K. Frazer, P. Kalita, P. Mani, R. Martin, S. T. Moxon, H. Paddock, C. Pich, S. Ramachandran, L. Ruzicka, K. Schaper, X. Shao, A. Singer, S. Toro, C. Van Slyke, and M. Westerfield, “The zebrafish model organism database: new support for human disease models, mutation details, gene expression phenotypes and searching,” Nucleic Acids Research, vol. 45, no. D1, pp. D758–D768, 2017. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw1116

[283] M. A. Rodriguez-Garcia, G. V. Gkoutos, P. N. Schofield, and R. Hoehndorf, “Integrating phenotype ontologies with phenomenet,” Journal of Biomedical Semantics, vol. 8, no. 1, p. 58, Dec 2017. [Online]. Available: https: //doi.org/10.1186/s13326-017-0167-4

[284] E. Forsythe and P. L. Beales, “Bardet-biedl syndrome,” Eur J Hum Genet, vol. 21, no. 1, pp. 8–13, 2013.

[285] B. R. Jasny, “A network approach to finding disease modules,” Science, vol. 347, no. 6224, pp. 836–836, 2015. [Online]. Available: http: //science.sciencemag.org/content/347/6224/836.11

[286] D. Furcy and S. Koenig, “Limited discrepancy beam search,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence, ser. IJCAI’05. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005, pp. 125–131. [Online]. Available: http://dl.acm.org/citation.cfm?id=1642293.1642313

[287] D. R. Blair et al., “A nondegenerate code of deleterious variants in mendelian loci contributes to complex disease risk,” Cell, vol. 155, no. 1, pp. 70–80, 2013.

[288] M. Oti and H. G. Brunner, “The modular nature of genetic diseases,” Clin Genet, vol. 71, pp. 1–11, 2007. 166 [289] K.-I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.-L. Barab´asi,“The human disease network,” Proceedings of the National Academy of Sciences, vol. 104, no. 21, pp. 8685–8690, 2007. [Online]. Available: http://www.pnas.org/content/104/21/8685

[290] V. Khurana, J. Peng, C. Y. Y. Chung, P. K. Auluck, S. Fanning, D. F. Tardiff, T. Bartels, M. Koeva, S. W. Eichhorn, H. Benyamini, Y. Lou, A. Nutter-Upham, V. Baru, Y. Freyzon, N. Tuncbag, M. Costanzo, B.-J. J. San Luis, D. C. Sch¨ondorf, M. I. Barrasa, S. Ehsani, N. Sanjana, Q. Zhong, T. Gasser, D. P. Bartel, M. Vidal, M. Deleidi, C. Boone, E. Fraenkel, B. Berger, and S. Lindquist, “Genome-Scale networks link neurodegenerative disease genes to α-Synuclein through specific molecular pathways.” Cell systems, Jan. 2017. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/28131822

[291] D. Marbach, D. Lamparter, G. Quon, M. Kellis, Z. Kutalik, and S. Bergmann, “Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases.” Nature methods, Mar. 2016. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/26950747

[292] F. Hildebrandt, T. Benzing, and N. Katsanis, “Ciliopathies,” N Engl J Med, vol. 364, no. 16, pp. 1533–1543, 2011.

[293] J. R. Priest, K. Osoegawa, N. Mohammed, V. Nanda, R. Kundu, K. Schultz, E. J. Lammer, S. Girirajan, T. Scheetz, D. Waggott, F. Haddad, S. Reddy, D. Bernstein, T. Burns, J. D. Steimle, X. H. Yang, I. P. Moskowitz, M. Hurles, R. P. Lifton, D. Nickerson, M. Bamshad, E. E. Eichler, S. Mital, V. Sheffield, T. Quertermous, B. D. Gelb, M. Portman, and E. A. Ashley, “De novo and rare variants at multiple loci support the oligogenic origins of atrioventricular septal heart defects,” PLoS Genet, vol. 12, no. 4, p. e1005963, 2016.

[294] Y. Li, A. Salfelder, K. O. Schwab, S. C. Grunert, T. Velten, D. Lutjo- hann, P. Villavicencio-Lorini, U. Matysiak-Scholze, B. Zabel, A. Kottgen, and E. Lausch, “Against all odds: blended phenotypes of three single-gene defects,” Eur J Hum Genet, 2016.

[295] R. Y. Leduc, P. Singh, and H. E. McDermid, “Genetic backgrounds and modifier genes of ntd mouse models: An opportunity for greater understanding of the 167 multifactorial etiology of neural tube defects,” Birth Defects Res, vol. 109, no. 2, pp. 140–152, 2017.

[296] E. Amendola, P. De Luca, P. E. Macchia, D. Terracciano, A. Rosica, G. Chiap- petta, S. Kimura, A. Mansouri, A. Affuso, C. Arra, V. Macchia, R. Di Lauro, and M. De Felice, “A mouse model demonstrates a multigenic origin of congen- ital hypothyroidism,” Endocrinology, vol. 146, no. 12, pp. 5038–47, 2005.

[297] J. H. Nadeau, “Modifier genes in mice and humans,” Nat Rev Genet, vol. 2, no. 3, pp. 165–74, 2001.

[298] J. Gillis and P. Pavlidis, ““Guilt by Association” is the exception rather than the rule in gene networks,” PLoS Comput Biol, vol. 8, no. 3, p. e1002444, 03 2012. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pcbi.1002444

[299] M. H. Schaefer, L. Serrano, and M. A. Andrade-Navarro, “Correcting for the study bias associated with proteinprotein interaction measurements reveals differences between protein degree distributions from different cancer types,” Frontiers in Genetics, vol. 6, p. 260, 2015. [Online]. Available: https://www.frontiersin.org/article/10.3389/fgene.2015.00260

[300] K. Eilbeck, A. Quinlan, and M. Yandell, “Settling the score: variant prioritization and Mendelian disease,” Nature Reviews Genetics, vol. 18, no. 10, pp. 599–612, Aug. 2017. [Online]. Available: http://dx.doi.org/10.1038/nrg. 2017.52

[301] M. Alshahrani and R. Hoehndorf, “Semantic disease gene embeddings (smudge): phenotype-based disease gene prioritization without phenotypes,” Bioinformatics, vol. 34, no. 17, pp. i901–i907, 2018. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/bty559 168

APPENDICES 169

A Semantic Prioritization of Novel Causative Genomic Variants: Supplementary materials

Feature Type CADD score (numeric) Pathogencity prediction GWAVA score (numeric) DANN score (numeric) PhenomeNet similarity score (numeric) Disease inheritance mode: Dominant, Recessive, X-linked, Others/Unknown (one-hot encoding) High-level phenotype HP 0000078(binary) High-level phenotype HP 0000291(binary) High-level phenotype HP 0000708(binary) High-level phenotype HP 0001001(binary) High-level phenotype HP 0001939(binary) High-level phenotype HP 0002086(binary) High-level phenotype HP 0006476(binary) High-level phenotype HP 0009126(binary) High-level phenotype HP 0010515(binary) High-level phenotype HP 0010948(binary) High-level phenotype HP 0010987(binary) High-level phenotype HP 0011017(binary) High-level phenotype HP 0011025(binary) High-level phenotype HP 0011277(binary) High-level phenotype HP 0011482(binary) High-level phenotype HP 0011915(binary) High-level phenotype HP 0040063(binary) High-level phenotype MP 0000003(binary) High-level phenotype MP 0000358(binary) High-level phenotype MP 0000428(binary) High-level phenotype MP 0000462(binary) High-level phenotype MP 0000516(binary) High-level phenotype MP 0000685(binary) High-level phenotype MP 0000716(binary) High-level phenotype MP 0001188(binary) High-level phenotype MP 0001213(binary) Phenotype-related High-level phenotype MP 0001270(binary) High-level phenotype MP 0001533(binary) High-level phenotype MP 0001663(binary) High-level phenotype MP 0001672(binary) High-level phenotype MP 0001764(binary) High-level phenotype MP 0001790(binary) High-level phenotype MP 0001983(binary) High-level phenotype MP 0002060(binary) High-level phenotype MP 0002089(binary) High-level phenotype MP 0002095(binary) High-level phenotype MP 0002106(binary) High-level phenotype MP 0002109(binary) High-level phenotype MP 0002138(binary) High-level phenotype MP 0002139(binary) High-level phenotype MP 0002163(binary) High-level phenotype MP 0002164(binary) High-level phenotype MP 0002396(binary) High-level phenotype MP 0003385(binary) High-level phenotype MP 0004133(binary) High-level phenotype MP 0004134(binary) High-level phenotype MP 0005408(binary) High-level phenotype MP 0005451(binary) High-level phenotype MP 0005621(binary) High-level phenotype MP 0009389(binary) High-level phenotype MP 0010678(binary) High-level phenotype MP 0010769(binary) High-level phenotype MP 0012719(binary) High-level phenotype MP 0013328(binary) Genotype Homozygote or heterozygote (binary) Table S1: PVP model features. 170 Precision Recall F-measure ROC AUC PVP 0.963 0.963 0.963 0.994 PVP-Model 0.894 0.893 0.893 0.963 PVP-Human 0.96 0.96 0.96 0.96

Table S2: Stratified cross-validation results when training our models on 80% of ClinVar pathogenic variants.

Randomly added phenotypes (8522 total) Top rank 3547 41.62% Top 10 4315 50.63%

1 Randomly removed phenotypes (8522 total) with p= 3 Top rank 3963 46.50% Top 10 4921 57.74%

Double pathogenic variants on D1 (8522 total) Top rank 5316 62.38% Top 10 7602 89.20%

Double pathogenic variants on D2 (8522 total) Top rank 1309 15.36% Top 10 3612 42.38%

Double pathogenic variants on both (8522 total) D1 Variant Top rank 3116 36.56% Top 10 5428 63.69%

D2 Variant Top rank 3171 37.21% Top 10 5468 64.16%

Table S3: Experiments when adding noise to phenotypes associated with 8,522 syn- thetic whole exome sequences. Each WES contains a single causative variant v1 and is associated with phenotypes P1. First, we randomly add the complete set of phe- notypes from a randomly chosen disease to P1 and record how well we recover v1. 1 Second, we randomly remove each phenotype in P1 with a probability of 3 and record how well v1 is recovered. In the third experiment, we add another causative variant v2 to the WES (which is causative for the disease that is phenotypically most similar to P1, and is itself associated with phenotypes P2), and use P1 to recover v1. We record how often we identify v1 and v2 at the first or top 10 ranks. Finally, we perform the same experiment using as phenotypes P1 ∪ P2. 171 Sample Phenotype Rank Zygosity Gene SIFT prediction Polyphen prediction UK10K THY5003269.REL-2013-04-20 DG 1 0/1 PROP1 D B 2 0/1 LHX3 D D 5 0/1 LHX4 D P UK10K THY5003270.REL-2013-04-20 DG 1 0/1 NKX2-1 T P 7 0/1 GLIS3 T P 8 0/1 TPO D D UK10K THY5003272.REL-2013-04-20 DG 1 0/1 LHX3 D D 3 0/1 NKX2-1 T D UK10K THY5003273.REL-2013-04-20 DG 1 0/1 NKX2-1 T B 8 0/1 PAX8 D D UK10K THY5068934.REL-2013-04-20 GIS 5 0/1 TG T . UK10K THY5159470.REL-2013-04-20 DG 4 0/1 PAX8 D D UK10K THY5159478.REL-2013-04-20 DG 1 0/1 NKX2-1 T P 3 0/1 TPO T D UK10K THY5159472.REL-2013-04-20 DG 1 0/1 NKX2-1 T D 2 0/1 TG D D 3 0/1 LHX3 D P 6 0/1 PAX8 D D UK10K THY5159467.REL-2013-04-20 GIS 2 0/1 NKX2-1 T D 6 0/1 PAX8 D D UK10K THY5236178.REL-2013-04-20 GIS 1 0/1 TG D D 3 0/1 TG D B 9 1/1 DUOX2 D D UK10K THY5236182.REL-2013-04-20 GIS 1 0/1 LHX3 T D UK10K THY5298762.REL-2013-04-20 CCH 1 0/1 NKX2-1 T D 3 0/1 TG T P UK10K THY5298765.REL-2013-04-20 GIS 4 1/1 TPO D D UK10K THY5298766.REL-2013-04-20 GIS 3 0/1 TPO D D UK10K THY5298767.REL-2013-04-20 GIS 4 1/1 TPO D D 5 0/1 PAX8 T D UK10K THY5298777.REL-2013-04-20 DG 1 0/1 GLIS3 T D 6 0/1 PAX8 D D UK10K THY5298778.REL-2013-04-20 DG 1 0/1 GLIS3 T D 2 0/1 NKX2-1 T D 4 0/1 TG T P UK10K THY5298780.REL-2013-04-20 GIS 1 0/1 NKX2-1 T P 2 0/1 NKX2-1 T D 8 0/1 PAX8 D D UK10K THY5329043.REL-2013-04-20 CCH 1 0/1 NKX2-1 T D UK10K THY5329044.REL-2013-04-20 GIS 1 0/1 TG T . 4 0/1 LHX3 T P UK10K THY5329047.REL-2013-04-20 GIS 1 0/1 NKX2-1 T D 5 1/1 TG UK10K THY5329049.REL-2013-04-20 GIS 1 0/1 NKX2-1 T D UK10K THY5329050.REL-2013-04-20 GIS 1 0/1 NKX2-1 T B UK10K THY5329051.REL-2013-04-20 GIS 5 0/1 PAX8 D D UK10K THY5329052.REL-2013-04-20 GIS 1 0/1 NKX2-1 T P 2 0/1 NKX2-1 T D UK10K THY5329053.REL-2013-04-20 GIS 1 1/1 TG D D 2 0/1 TPO D P 3 0/1 DUOX2 D D 5 0/1 PAX8 T . 6 0/1 DUOX2 T . UK10K THY5329054.REL-2013-04-20 GIS 1 1/1 TG D D 2 0/1 DUOX2 D D 3 0/1 DUOX2 T . UK10K THY5329055.REL-2013-04-20 GIS 1 1/1 TG T . 7 0/1 PAX8 D D UK10K THY5329056.REL-2013-04-20 GIS 1 0/1 NKX2-1 T D 3 1/1 TG T . 5 0/1 LHX3 D P UK10K THY5329058.REL-2013-04-20 GIS 1 0/1 NKX2-1 T P 2 0/1 NKX2-1 T D UK10K THY5329059.REL-2013-04-20 GIS 1 1/1 TG D D 2 0/1 NKX2-1 T D 3 1/1 TG . D UK10K THY5329060.REL-2013-04-20 GIS 1 1/1 TG D D 2 1/1 TG . D 3 0/1 DUOX2 D D 4 0/1 DUOX2 T . UK10K THY5329061.REL-2013-04-20 GIS 1 1/1 TG D D 2 0/1 DUOX2 T . 5 0/1 DUOX2 D P UK10K THY5329062.REL-2013-04-20 GIS 1 1/1 TG D D 2 0/1 PAX8 T D 3 0/1 DUOX2 D P 10 0/1 LHX3 T B UK10K THY5329068.REL-2013-04-20 GIS 3 0/1 GLIS3 T P UK10K THY5370898.REL-2013-04-20 GIS 1 0/1 GLIS3 T B 3 1/1 TG 10 0/1 PAX8 D D Table S4: The table shows the 36 cases from UK10K dataset diagnosed with Con- genital hypothyroidism which show genes already implicated in the disease within the top 20 hits. Patients are broadly stratified into three groups: CCH, central congenital hypothyroidism; GIS, patients with gland-in-situ and no anatomical signs of dysplasia which include likely dyshormonogenesis; DG, thyroid dysgenesis. Heterozygous state; 0/1, homozygous state 1/1. SIFT categories T and D correspond to Deleterious (sift score ≤ 0.05), and Tolerated (sift score > 0.05). Polyphen categories: D; Probably damaging (polyphen score ≥ 0.909), P; possibly damaging (0.447 to 0.909) B; benign (polyphen score ≤ 0.446). 172

heightID Phenotype Variant(s) PVP PVP-Human PVP-Model eXtasy Phevor Exomiser CADD GWAVA DANN UK10K THY5329055 GIS TG:c.1583C>A (hom) 1 1 1 2 209 5 500 13,522 1,866 UK10K THY5329056 GIS TG:c.1583C>A (hom) 2 1 1 4 103 9 495 13,763 1,904 UK10K THY5329059 GIS TG:c.2177G>A (hom); 3; 2 1; 3 1; 2 2; 9 223 23; 26 680; 554 6,929; 3,108 493; 1,377 TG:c.3149G>T (hom) UK10K THY5329060 GIS TG:c.2177G>A (hom); 4; 3; 1 1; 2; 3 1; 2; 8 1; 7; 43 209; 209; 268 3; 7; 4,327 661; 533; 73 6,874; 3,078; 4,632 488; 1,296; 315 TG:c.3149G>T (hom) ; DUOX2:C.2056C>T (het) UK10K THY5370898 GIS TG:c.638+5G>A (hom) 3 524 8,031 – 92 13,851 4,700 3,130 3,707 UK10K THY5329047 GIS TG:c.638+5G>A (hom) 6 457 5,616 – 103 11,624 4,041 2,366 3,238 UK10K THY5329053 GIS TG:c.4478G>A (hom); 5; 2 2; 3 1; 5 9; 38 125; 304 12; 3,946 605; 74 13,624; 4,596 921; 297 DUOX2:c.2056C>T (het) UK10K THY5329054 GIS TG:c.4478G>A (hom); 3; 1 1; 2 1; 5 11; 46 127; 285 29; 3,940 617; 69 13,408; 4,569 887; 312 DUOX2:c.2056C>T (het) UK10K THY5329061 GIS TG:c.8054G>T (hom); 2; 1 1; 2 1; 5 8; 17 74; 106 40; 4,275 375; 192 14,222; 4,748 1,305; 114 DUOX2:c.1060C>T (het) UK10K THY5329062 GIS TG:c.8054G>T (hom); 4; 3 2; 3 1; 5 8; 14 220; 288 46; 6,184 348; 177 13,969; 4,667 1,322; 107 DUOX2:c.1060C>T (het) UK10K THY5236178 GIS TG:c.5071T>C (het); 5; 1; 11 21; 15; 163 94; 13; 172 13; 3; 1 116; 116; 64 1,427; 1,447; 53,947 833; 675; 871 1,311; 1,717; 3,455 996; 889; 867 TG:c.7640T>A (het); DUOX2:c.1709A>T (hom) UK10K THY5329044 GIS TG:c.2311C>T (het) 1 1 1 1 209 62 90 7,072 381 UK10K THY5068932 GIS TG:c.3433+3 3433+6delGAGT 8 3 43 – 246 21,654 1,731 19,169 1,885 (het) UK10K THY5068934 GIS TG:c.3433+3 3433+6delGAGT 10 5 41 – 202 21,086 1,688 19,091 1,857 (het) UK10K THY5068933 GIS TG:c.3433+3 3433+6delGAGT – – – – – – – – – (het) Table S5: Results of variant and gene prioritization methods for a subset of UK10K rare thyroid dataset (EGAS00001000131) with confirmed disease-associated muta- tions. In cases THY5329044, THY5068932, THY5068933, and THY5068934, the variants were considered to contribute to but not entirely explain the whole pheno- type observed (Schoenmakers et al. 2016).

Chr Start RSID Gene Source OMIM PVP-MOD PVP-Human

10 60150616 rs757075712 TFAM mouse 617156 1465 1413

5 484635 rs869312806 SLC9A3 mouse-fish 616868 340 93

5 483385 rs766076524 SLC9A3 mouse-fish 616868 273 49

5 484762 rs869312807 SLC9A3 mouse-fish 616868 249 53

20 36031750 rs879255268 SRC mouse 616937 495 95

5 121409850 rs876657852 LOX mouse-fish 617168 233 139

6 33756856 rs878852983 LEMD2 mouse 212500 34 5

2 27278096 rs879253768 AGBL5 fish 617023 6 3

2 27278039 rs879253769 AGBL5 fish 617023 417 97

22 19486647 rs754080445 CDC45 mouse 617063 327 379

22 19470326 rs745800041 CDC45 mouse 617063 919 994

22 19470234 rs879255632 CDC45 mouse 617063 39 306

22 19471511 rs540217942 CDC45 mouse 617063 148 218

22 19506390 rs778665661 CDC45 mouse 617063 238 39

22 19468567 rs879255633 CDC45 mouse 617063 18 274

22 19470341 rs748749078 CDC45 mouse 617063 1831 1840

22 19494977 rs146559223 CDC45 mouse 617063 632 667 173

3 154834337 rs879255651 MME mouse 617018 20 79

3 132394747 rs114925667 UBA5 mouse 617132 550 83

3 132384669 rs774318611 UBA5 mouse 617132 1 3

3 132394134 rs745968949 UBA5 mouse 617132 3 6

3 132384686 rs886039756 UBA5 mouse 617132 521 213

3 132389876 rs374052333 UBA5 mouse 617132 91 41

3 132394183 rs886039757 UBA5 mouse 617132 60 19

3 132390987 rs886039759 UBA5 mouse 617132 8 8

3 132395320 rs886039760 UBA5 mouse 617132 783 566

3 132389817 rs886039761 UBA5 mouse 617132 264 84

3 132384674 rs532178791 UBA5 mouse 617132 1139 1035

16 68381533 rs886039897 PRMT7 mouse 617157 701 446

16 68386217 rs751670999 PRMT7 mouse 617157 68 54

16 68349977 rs149170494 PRMT7 mouse 617157 73 51

16 68380151 rs762515973 PRMT7 mouse 617157 439 336

16 68380047 rs201824659 PRMT7 mouse 617157 480 570

3 192053223 rs886039903 FGF12 mouse 617166 14 71

12 6690297 rs201992075 CHD4 mouse 617159 296 605

12 6700879 rs886039915 CHD4 mouse 617159 633 182

12 6702357 rs886039916 CHD4 mouse 617159 3 26

12 6697549 rs886039917 CHD4 mouse 617159 83 43

12 6697063 rs886039918 CHD4 mouse 617159 200 78

12 6697486 rs886039919 CHD4 mouse 617159 596 368

X 24082345 rs886040855 EIF2S3 fish 300987 365 169

X 24084119 rs886040856 EIF2S3 fish 300987 496 642

11 118452217 rs886040859 ARCN1 mouse 617164 62 14

19 48922979 rs886040861 GRIN2D mouse 617162 14 18 174

5 121411138 rs886040965 LOX mouse-fish 617168 53 110

5 121413556 rs886040966 LOX mouse-fish 617168 533 152

5 121411177 rs886040967 LOX mouse-fish 617168 440 283

18 56033433 rs879255599 NEDD4L mouse 617201 93 36

18 56034996 rs879255598 NEDD4L mouse 617201 26 21

18 56057899 rs879255597 NEDD4L mouse 617201 1 14

18 56057912 rs879255596 NEDD4L mouse 617201 40 81

3 132390945 rs540839115 UBA5 mouse 617133 535 141

3 132394207 rs886039762 UBA5 mouse 617133 825 706

15 52446137 rs886041054 GNB5 mouse 617173 566 684

15 52416726 rs773902879 GNB5 mouse 617173 45 21

15 52446136 rs886041055 GNB5 mouse 617173 626 525

15 52446134 rs766004901 GNB5 mouse 617173 210 348

15 52416814 rs749597091 GNB5 mouse 617173 40 6

15 51696718 rs764239923 GLDN mouse 617194 7 342

15 51633976 rs779432560 GLDN mouse 617194 883 681

15 51696535 rs539703340 GLDN mouse 617194 644 136

15 51676090 rs886041057 GLDN mouse 617194 922 831

15 51696730 rs368085516 GLDN mouse 617194 441 215

19 54684489 rs886041060 MBOAT7 mouse 617188 224 281

2 25973137 rs886041070 ASXL2 mouse 617190 319 22

5 14290918 . TRIO mouse 617061 868 155

10 99344567 rs201803986 PI4K2A mouse 613616 90 52

10 99361646 rs755562733 PI4K2A mouse 613616 52 41

10 99361646 rs755562733 PI4K2A mouse 613616 37 7

10 99361676 rs796052086 PI4K2A mouse 613616 9 5

10 99361747 rs770050262 PI4K2A mouse 613616 85 53 175

11 112098994 rs794726657 BCO2 mouse 261640 164 102

11 14316390 rs113954997 RRAS2 mouse 167000 280 51

11 14316390 rs113954997 RRAS2 mouse 167000 127 29

1 151789714 rs774357869 RORC mouse 616622 356 82

1 151789714 rs774357869 RORC mouse 616622 504 101

1 160293229 rs794727993 COPA fish 616414 1 3

1 160293229 rs794727993 COPA fish 616414 1 2

11 6637744 rs755445790 TAF10 mouse 204500 119 113

12 56749487 rs281874770 STAT2 mouse 616636 542 542

12 56749487 rs281874770 STAT2 mouse 616636 789 611

13 111329354 rs557671802 CARS2 mouse 616672 951 444

13 111329354 rs557671802 CARS2 mouse 616672 713 430

13 111335398 rs727505361 CARS2 mouse 616672 486 559

13 111335398 rs727505361 CARS2 mouse 616672 550 553

15 50544717 rs267606861 HDC mouse-fish 137580 1 3

15 50544717 rs267606861 HDC mouse-fish 137580 1 4

17 38907213 rs766783183 KRT25 mouse 278150 8 172

17 38907213 rs766783183 KRT25 mouse 278150 49 271

19 39074134 rs193922886 MAP4K1 mouse 255320 165 135

2 227661632 rs104893642 IRS1 mouse 125853 230 650

2 227661632 rs104893642 IRS1 mouse 125853 184 416

4 103226211 rs779241085 SLC39A8 mouse 616721 1065 298

4 103226211 rs779241085 SLC39A8 mouse 616721 1159 316

4 103265723 rs373562040 SLC39A8 mouse 616721 438 409

4 103265723 rs373562040 SLC39A8 mouse 616721 223 435

4 5749953 rs121908425 CRMP1 mouse 225500 65 1

5 74842932 rs148960463 POLK mouse 176807 187 139 176

5 74842932 rs148960463 POLK mouse 176807 601 202

5 74882880 rs111584802 POLK mouse 176807 794 382

5 74882880 rs111584802 POLK mouse 176807 1002 373

5 74886193 rs770984846 POLK mouse 176807 1081 1086

5 74886193 rs770984846 POLK mouse 176807 1484 1115

5 74892259 rs863225457 POLK mouse 176807 143 252

5 74892259 rs863225457 POLK mouse 176807 124 287

5 74892710 rs863225456 POLK mouse 176807 303 290

5 74892710 rs863225456 POLK mouse 176807 476 473

5 82400865 rs587779351 XRCC4 mouse 262400 44 2

5 82554426 rs797045016 XRCC4 mouse 616541 549 190

5 82554426 rs797045016 XRCC4 mouse 616541 41 116

Table S6: Prediction on variants for which no phenotype similarity could be computed based on human phenotypes. 177

B DeepPVP: phenotype-based prioritization of causative variants using deep learning: Supplementary materials

Feature Type CADD score (numeric) Pathogencity prediction GWAVA score (numeric) DANN score (numeric) PhenomeNet similarity score (numeric) Disease inheritance mode: Dominant, Recessive, X-linked, Others/Unknown (one-hot encoding) High-level phenotype HP 0000078(binary) High-level phenotype HP 0000291(binary) High-level phenotype HP 0000708(binary) High-level phenotype HP 0001001(binary) High-level phenotype HP 0001939(binary) High-level phenotype HP 0002086(binary) High-level phenotype HP 0006476(binary) High-level phenotype HP 0009126(binary) High-level phenotype HP 0010515(binary) High-level phenotype HP 0010948(binary) High-level phenotype HP 0010987(binary) High-level phenotype HP 0011017(binary) High-level phenotype HP 0011025(binary) High-level phenotype HP 0011277(binary) High-level phenotype HP 0011482(binary) High-level phenotype HP 0011915(binary) High-level phenotype HP 0040063(binary) High-level phenotype MP 0000003(binary) High-level phenotype MP 0000358(binary) High-level phenotype MP 0000428(binary) High-level phenotype MP 0000462(binary) High-level phenotype MP 0000516(binary) High-level phenotype MP 0000685(binary) High-level phenotype MP 0000716(binary) High-level phenotype MP 0001188(binary) High-level phenotype MP 0001213(binary) Phenotype-related High-level phenotype MP 0001270(binary) High-level phenotype MP 0001533(binary) High-level phenotype MP 0001663(binary) High-level phenotype MP 0001672(binary) High-level phenotype MP 0001764(binary) High-level phenotype MP 0001790(binary) High-level phenotype MP 0001983(binary) High-level phenotype MP 0002060(binary) High-level phenotype MP 0002089(binary) High-level phenotype MP 0002095(binary) High-level phenotype MP 0002106(binary) High-level phenotype MP 0002109(binary) High-level phenotype MP 0002138(binary) High-level phenotype MP 0002139(binary) High-level phenotype MP 0002163(binary) High-level phenotype MP 0002164(binary) High-level phenotype MP 0002396(binary) High-level phenotype MP 0003385(binary) High-level phenotype MP 0004133(binary) High-level phenotype MP 0004134(binary) High-level phenotype MP 0005408(binary) High-level phenotype MP 0005451(binary) High-level phenotype MP 0005621(binary) High-level phenotype MP 0009389(binary) High-level phenotype MP 0010678(binary) High-level phenotype MP 0010769(binary) High-level phenotype MP 0012719(binary) High-level phenotype MP 0013328(binary) Genotype Homozygote or heterozygote (binary) CADD flag (binary) GWAVA flag (binary) Imputation Flags DANN flag (binary) PhenomeNet similarity flag (binary) Table S1: DeepPVP model features. 178

C OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants: Supplementary materials

DIDA ID Gene A Gene B Disease name (ORPHANET) DeepPVP DeepPVP OligoPVP Rank A Rank B Rank dd010 NEK1 (c.1640insA) DYNC2H1 Short rib-polydactyly syndrome 1 2 Not (c.11747G>A) Found dd211 MYH7 (c.2645A>G) RBM20 Familial isolated dilated cardiomyopa- 2 1 Not (c.2062C>T) thy Found dd040 GDAP1 (c.358C>T) MFN2 Charcot-Marie-Tooth disease 1 2 Not (c.479 480delTG) Found dd125 EDA (c.769G>C ) WNT10A Hypodontia 1 2 Not (c.511C>T) Found dd218 BMPR2 KCNA5 (c.1448del) Heritable pulmonary arterial hyperten- 1 2 Not (c.1471C>T) sion Found dd244 NLRP3 MEFV (c.442G>C) Familial Mediterranean fever 2 1 Not (c.526C>T]) Found dd028 MITF (c.824delA) TYR (c.1205G>A) Ocular albinism with congenital sen- 1 2 Not sorineural deafness Found dd013 NSMF FGFR1 Kallmann syndrome 2 1 Not (c.1132-22 1132- (c.1025T>C) Found -15del) dd018 FGFR1 PROKR2 Kallmann syndrome 1 2 Not (c.165 171del) (c.518T>G) Found dd164 FGFR1 IL17RD Kallmann syndrome 1 2 Not (c.1042G>A) (c.1136A>G) Found dd165 KISS1R (c.581C>A) IL17RD Kallmann syndrome 1 2 Not (c.2204C>T) Found dd166 FGFR1 DUSP6 (c.545C>T) Kallmann syndrome 1 2 Not (c.2075A>G) Found dd168 SPRY4 (c.722C>A) DUSP6 Kallmann syndrome 1 2 Not (c.1037C>T) Found dd169 SPRY4 (c.722C>A) FGFR1 Kallmann syndrome 2 1 Not (c.1447C>A) Found dd117 TYR (c.230G>A) SLC45A2 Oculocutaneous albinism 1 2 Not (c.1045G>A) Found dd121 TYR (c.346C>T) OCA2 (c.1441G>A) Oculocutaneous albinism 1 2 Not Found dd123 LMBRD1 MTR (c.3518C>T) Homocystinuria without methylmalonic 2 1 Not (c.1056delG) aciduria Found dd116 WT1 NPHS1 Familial idiopathic steroid-resistant 1 2 Not (c.1228+5G>A) (c.1126C>G) nephrotic syndrome Found dd050 HMBS UROD Porphyria 2 1 Not (c.422+1G>T) (c.650 651dupTT) Found dd001 KCNQ1 KCNH2 Familial long QT syndrome 1 2 Not (c.1022C>A) (c.2592+1G>A) Found dd048 KCNE2 (c.178T>C SCN5A Familial long QT syndrome 2 1 Not ) (c.4868G>A) Found dd065 KCNE1 (c.95G>A) SCN5A Familial long QT syndrome 2 1 Not (c.4931G>A) Found dd069 SCN5A KCNH2 (c.298C>G) Familial long QT syndrome 1 2 Not (c.5455G>A) Found dd134 ATP2B2 MYO6 (c.737A>G) Non-syndromic genetic deafness 1 2 Not (c.1756G>A) Found dd145 KCNJ10 SLC26A4 Non-syndromic genetic deafness 2 1 Not (c.1042C>T) (c.919-2A>G) Found dd206 GJB2 (c.35delG) TMPRSS3 Non-syndromic genetic deafness 2 1 Not (c.208delC) Found dd124 MYO7A PCDH15 Usher syndrome 2 1 Not (c.2311G>T) (c.158-1G>A) Found dd262 TEK (c.309A>C) CYP1B1 (c.343G>C) Congenital glaucoma 2 1 Not Found dd026 MYOC (c.1196G>T) CYP1B1 Juvenile glaucoma 2 1 Not (c.1103G>A) Found dd171 ITGA7 (c.2656G>A) MYH7B Left ventricular non-compaction 2 1 Not (c.2668C>T) Found dd224 TRIM54 (c.316G>A) TRIM63 (c.739C>T) Congenital myopathy with protein accu- 1 2 Not mulation Found dd019 BAAT (c.226A>G) TJP2 (c.143T>C) Familial hypercholanemia 1 2 Not Found dd006 ATP2B2 CDH23 Non-syndromic genetic deafness 1 2 Not (c.1756G>A) (c.5663T>C) Found Table S1: Cases of DIDA combinations missed by OligoPVP prioritization incorpo- rating protein-protein interaction network data, in comparison to ranks of individual variants of each digenic combination ranked by DeepPVP. 179

D List of Publications

Journal Publications

• I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “DeepPVP: Phenotype-based prioritization of causative variants using deep learning”, Submitted to BMC Bioinformatics, 2018. • I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants”, Scientific Reports, Oct 2018. • I. Boudellioua, RM. Razali, M. Kulmanov, Y. Hashish, VB. Bajic, E. Goncalves- Serra, N. Schoenmakers, GV. Gkoutos, PN. Schofield, and R. Hoehndorf, “Semantic prioritization of novel causative genomic variants”, PLOS Computational Biology, Apr 2017. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, PLOS One, Jul 2016.

Conference Proceedings

• I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “Phenotype-driven discovery of digenic variants in personal genome sequences”, Pro- ceedings of Variant Interpretation Community Of Special Interest (VarI-COSI), part of the 25th Conference on Intelligent Systems for Molecular Biology and 16th Eu- ropean Conference on Computational Biology (ISMB/ECCB), Prague, The Czech Republic, Jul 2017. 180 • I. Boudellioua, R. Saidi, MJ. Martin, and V. Solovyev, “Association rule mining for metabolic pathway prediction”, Proceedings of the 16th Open days in Biology, Informatics, and Mathematics (JOBIM), Clermont-Ferrand, France, Jul 2015.

Poster Presentations

• I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “Ex- ploring genotype-phenotype associations to prioritize genetic variants implicated in digenic diseases”, RECOMB 2018, Apr 2018. • I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “driven discovery of digenic variants in personal genome sequences”, KAUST Re- search Conference on Big Data Analyses in Evolutionary Biology, Dec 2017. • I. Boudellioua, M. Kulmanov, PN. Schofield, GV. Gkoutos, and R. Hoehndorf, “driven discovery of digenic variants in personal genome sequences”, ISMB/ECCB 2017, Jul 2017. • I. Boudellioua, RM. Razali, M. Kulmanov, Y. Hashish, VB. Bajic, E. Goncalves- Serra, N. Schoenmakers, GV. Gkoutos, PN. Schofield, and R. Hoehndorf, “Semantic prioritization of novel causative genomic variants”, KAUST Research Conference on Computational Systems Biology in Biomedicine, Dec 2016. • I. Boudellioua, RM. Razali, M. Kulmanov, Y. Hashish, VB. Bajic, E. Goncalves- Serra, N. Schoenmakers, GV. Gkoutos, PN. Schofield, and R. Hoehndorf, “Semantic prioritization of novel causative genomic variants”, ECCB 2016, Sep 2016. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, Biocuration 2016, Apr 2016. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, KAUST Research Conference on Computational and experimental interfaces 181 of Big Data and Biotechnology, Jan 2016. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, ISMB/ECCB 2015, Jul 2015. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, JOBIM 2015, Jul 2015. • I. Boudellioua, R. Saidi, R. Hoehndorf, MJ. Martin, and V. Solovyev, “Prediction of metabolic pathway involvement in prokaryotic UniProtKB data by association rule mining”, RECOMB 2015, Apr 2015. • I. Boudellioua and V. Solovyev, “Investigation of Prokaryotic Gene Regulation in Distant Phylogenetic Groups”, ECCB 2014, Sep 2015.