PDF hosted at the Radboud Repository of the Radboud University Nijmegen

The following full text is a publisher's version.

For additional information about this publication click this link. http://hdl.handle.net/2066/76559

Please be advised that this information was generated on 2021-10-04 and may be subject to change.

Phenotype-guided disease investigation using Bioinformatics

Martin Oti

Front cover image: Leonardo da Vinci’s Vitruvian Man, as an illustration of the systematic measurement of the human phenotype. Systematic representations are essential to fully exploit phenotype information in Bioinformatic disease analyses. Underlying the phenotype is the genome, represented by the fragment of DNA on which he is standing.

Source : Wikimedia Commons ( http://commons.wikimedia.org/wiki/File:Vitruvian-Icon.png and http://commons.wikimedia.org/wiki/File:DNA_simple.svg ). Authors: Producer (Vitruvian Man) and Forluvoft (DNA); DNA image further modified by Leyo. License : Both images are licensed under the Creative Commons Attribution ShareAlike license (http://creativecommons.org/licenses/by-sa/3.0/) and the GNU Free Documentation License (http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License).

Printed by PrintPartners Ipskamp, Nijmegen.

ISBN: 978-90-9024785-4

Phenotype-guided disease investigation using Bioinformatics

Een wetenschappelijke proeve op het gebied van de Medische Wetenschappen

Proefschrift

ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen op gezag van de rector magnificus prof. mr. S.C.J.J. Kortmann, volgens besluit van het college van decanen in het openbaar te verdedigen op vrijdag 23 april 2010 om 10.30 uur precies

door

Martin Okechukwu Oti

geboren op 7 october 1974 te Kaduna (Nigeria)

Promotors: Prof. dr. Han G. Brunner Prof. dr. Martijn A. Huynen

Manuscriptcommissie: Prof. dr. Stan Gielen Prof. dr. Jan Smeitink Dr. Lude Franke (Rijksuniversiteit Groningen)

Phenotype-guided disease investigation using Bioinformatics

An academy essay in the Medical Sciences

Doctoral thesis

to obtain the degree of doctor from Radboud University Nijmegen on the authority of the Rector Magnificus prof. dr. S.C.J.J. Kortmann, according to the decision of the council of deans to be defended in public on Friday April, 23 2010 at precisely 10.30 hours

by

Martin Okechukwu Oti

Born on 7 October 1974 in Kaduna (Nigeria)

Supervisors: Prof. dr. Han G. Brunner Prof. dr. Martijn A. Huynen

Doctoral Thesis Committee: Prof. dr. Stan Gielen Prof. dr. Jan Smeitink Dr. Lude Franke (Rijksuniversiteit Groningen)

Table of Contents

Chapter 1: General Introduction ...... 11 1.1 Discovering disease genes...... 12 1.2 Disease phenotypes...... 12 1.3 Genetic mapping...... 13 1.4 Functional candidate disease gene approaches using bioinformatics ...... 14 1.5 The importance of the phenotype ...... 16 1.6 Outline of the thesis ...... 17 References...... 19 Chapter 2: The modular nature of genetic diseases ...... 21 Abstract...... 22 Introduction ...... 22 Relationship between genes and phenotypes...... 23 Formalizing phenotype descriptions ...... 32 Future prospects...... 33 Acknowledgements...... 36 References...... 36 Chapter 3: Predicting disease genes using protein-protein interactions ...... 41 Abstract...... 42 Introduction ...... 42 Methods ...... 43 Results ...... 46 Discussion ...... 55 Acknowledgements...... 57 References...... 57 Chapter 4: Conserved co-expression for candidate disease gene prioritization...... 61 Abstract...... 62 Background ...... 62 Results ...... 63 Discussion ...... 69 Conclusions...... 70 Methods ...... 71 Acknowledgements...... 77 References...... 77 Chapter 5: The biological coherence of human phenome databases...... 81 Abstract...... 82 Introduction ...... 82 Methods ...... 83 Results ...... 87 Discussion ...... 92 Acknowledgements...... 93 References...... 93 Chapter 6: General Discussion ...... 95 6.1 Summary of findings ...... 96 6.2 Moving from locus-based to genome-wide approaches...... 99 6.3 The importance of detailed phenotype characterization for genome-wide approaches...... 104 6.4 The Future: Comprehensive elucidation of disease processes...... 108 References...... 116 Summary ...... 123 Samenvatting...... 127 Acknowledgements...... 131 Appendix I: Color Figures...... 133 Curriculum Vitae...... 141 List of Publications...... 142

9

10

Chapter 1: General Introduction

11

Chapter 1: General Introduction

1.1 Discovering disease genes There are about 6000 human genetic diseases, almost 4000 of which do not have a known molecular basis in 2009 (http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM)[1]. These diseases have a wide spectrum of phenotypic manifestations which are triggered by underlying genetic perturbations. While disease phenotypes are readily observable, the underlying mutations are far more problematic to characterize. Different techniques are suited to the elucidation of disease genetics, depending on the underlying genetic architecture of the disease. They fall into two general categories, genetic mapping[2] and functional candidate gene approaches. Genetic mapping uses correlation of disease phenotypes with genetic markers to localize the disease mutation to a particular genomic region in a hypothesis-free manner. It does not make use of disease phenotype information, other than for th-e delineation of diseases. On the other hand, functional candidate gene approaches use molecular biological knowledge to hypothesize the involvement of particular genes in the disease under investigation, which are subsequently tested for mutations in patients. These approaches utilize more information about disease phenotype and known disease biology.

An important advantage of functional approaches is that they facilitate not only disease gene identification but also disease biology elucidation. Nevertheless, data and knowledge limitations have negated these advantages, and genetic mapping has historically been the most productive approach to disease gene identification[2]. However, with the advent of the post-genomic era and its associated flood of molecular data, functional approaches have been reinvigorated and this trend is set continue into the future. Somewhat surprisingly however, disease phenotype data—one of the most readily available sources of information about diseases—have received much less attention than molecular genetic data in this recent surge in interest in functional approaches.

Below we first take a brief look at genetic mapping, after which we look at current Bioinformatics-driven functional approaches to large-scale candidate gene prioritization—which thus far have generally been applied to the results of genetic mapping. Finally we introduce the main topic of this thesis, namely the importance of disease phenotype information and its potential utility in functional candidate disease gene approaches. Further introduction into this topic is provided in chapter 2. However, before proceeding we first need to take a look at what disease phenotypes actually are.

1.2 Disease phenotypes Disease phenotypes are the ultimate consequences of the mutations in disease genes. They are also the most immediate source of information about what the underlying disease mutations might be, provided they are comprehensively observed and recorded. Herein lies an important and difficult to overcome problem, however, as the phenotypic effects of disease-related mutations can vary extensively. Such effects could be prominent and obvious physical malformations or functional impairments, or they could be very subtle and easily overlooked ones[3]. They need not even be externally observable physical features; instead they could be changes in molecular phenomena such as mRNA transcript levels, protein modification statuses

12

Chapter 1: General Introduction or metabolite concentrations[4,5], as well as behavioral changes[6]. Additionally, anatomical and physiological changes can occur over the course of a patient’s lifetime, so the phenotypic state of a patient at a given point in time does not necessarily reflect the full disease phenotype. In fact, disease phenotypes can even span generations—such as in diseases with genetic anticipation which leads to progressively more severe phenotypes in successive generations[7,8]. This phenotypic complexity makes it more difficult to identify and record disease phenotypes fully and accurately.

Some diseases have very specific and limited phenotypic effects, while others have extensive effects on patient phenotypes. For instance, cleft lip can occur as an isolated symptom while other diseases such as those associated with the TP63 gene[9,10] and Hutchinson-Gilford progeria (MIM: 176670) affect multiple systems and have wide-ranging phenotypic effects. While some diseases (such as skeletal malformations) manifest themselves during the developmental process, others only appear postnatally or at later life stages (e.g. neurodegenerative diseases). Some have constant effects on phenotype, while others (such as the previously mentioned neurodegenerative diseases) have progressive effects. Disease phenotypes can differ from each other quantitatively as well as qualitatively. Finally, many diseases have phenotypes that can vary substantially between patients.Despite these complexities, disease phenotypes are the most readily observable effects of genetic mutations and form a valuable source of information for disease research.

The use of such phenotypic information in genetic disease investigation will be discussed later, after a brief look at current genetic and bioinformatic methods for finding disease genes.

1.3 Genetic mapping There are two primary approaches to the genetic mapping of human disease genes, namely linkage analysis[11] and association studies[12]. These two approaches are complementary. Linkage analysis is best suited to the identification of rare mutations with large effects on the phenotype, such as those underlying Mendelian diseases. Association studies on the other hand are better suited to identifying common mutations with modest effects on the phenotype, such as those involved in complex diseases. Both approaches result in the delineation of a genomic region suspected to contain the disease mutation, which can subsequently be sequenced in both patients and controls in order to identify the causative mutation – an approach termed “positional cloning”[2].

Linkage analysis is responsible for the vast majority of disease genes identified to date[2]. In linkage analysis, genomic markers are sought that co-segregate with the disease phenotype in affected families. This indicates that the disease mutation is on the same chromosome as the marker, and that they are in close enough proximity to result in a lack of chromosomal crossovers between them. The use of restriction fragment length polymorphisms in the 1980s[13] and short tandem repeats in the 1990s[14] led to the generation of a sufficient density of markers across the whole genome to make linkage analysis feasible as a tool for the systematic localization of human disease genes on the genome[15]. Nevertheless, the resulting linkage regions are still quite large, generally ranging from about 2 to 10 million DNA basepairs and containing tens to hundreds of genes ( http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM ).

13

Chapter 1: General Introduction

Therefore, it is still expensive and time-consuming to identify the disease gene within the linkage regions by positional cloning.

Association studies are a complementary approach to linkage analysis. They are conceptually simple, involving the identification of genetic variants that occur significantly more frequently in affected individuals than in unaffected individuals. As they rely on statistical overrepresentation, they are better suited to identifying common genetic variants than rare variants. Such genetic variants need not have very strong associations with the disease phenotype as long as they are statistically significant, and this detection power can be improved by increasing the size of the study population. As a result, association studies are particularly suited to investigating complex polygenic diseases in which several different genetic variants contribute only modestly to the final disease phenotype. In contrast, linkage analysis is better suited to identifying rare genetic variants that have a strong association with the disease phenotype, which makes it more suited to the analysis of Mendelian diseases where single mutations of large effect cause the disease phenotype.

In the past, most association studies investigated the associations between specific candidate genes and the disease in question[15]. This was necessitated by the inability to conduct such studies on a genome-wide scale, due to technical limitations. However, these targeted studies generally struggled with limited and poorly reproducible results[2]. Only with the recent development of techniques for high throughput genotyping has it finally become feasible to conduct association studies at a genome-wide scale using large sample sizes[2]. These genome-wide association studies (GWASs), first proposed in the mid 1990s[16-18] have driven a recent explosion in the number of disease variants found[2]. Nevertheless, these studies generally result in the identification of genomic regions which could contain several genes – or sometimes no genes at all – complicating the identification of the relevant disease genes and necessitating further analysis[19].

In addition to these systematic genome-wide searches for disease genes using genetic mapping techniques, prior biological knowledge about genes can also be exploited to identify candidate disease genes based on their apparent functional links to disease processes. Such functional candidate gene approaches have met with limited success in the past[2], but they have been given a new lease of life by the various genome-scale functional analyses[17] that are now possible in the post-genomic era. The rapid expansion in large-scale functional data along with the concurrent expansion of the field of bioinformatics has finally made such large-scale functional analyses tractable.

1.4 Functional candidate disease gene approaches using bioinformatics Despite the deluge of biological data that have been generated over the past decade, they are still too limited to enable pragmatic genome-scale functional candidate disease-gene prioritization. Therefore, bioinformatic approaches to functional candidate gene prioritization have so far generally been combined with genetic mapping (Chapter 2). This is a powerful combination as the two approaches complement each other nicely. Genetic mapping restricts the number of genes that need to be prioritized, but is frequently not sufficient by itself to identify

14

Chapter 1: General Introduction the causative disease gene. Functional approaches on the other hand benefit greatly from the restriction of the search space to tens or hundreds of genes instead of thousands.

As previously mentioned, linkage mapping, which is the most commonly used genetic mapping technique, frequently results in the delineation of a disease-associated genomic region (locus) containing tens to hundreds of genes. Up until the present day it is still laborious and expensive to sequence many genes, so it is necessary to prioritize the candidate genes in the disease locus in order to minimize the amount of sequencing required to identify the causative disease gene. This situation is changing however, as the advent of the genomics era, our increased knowledge of molecular biology, and the identification of more and more disease genes which can be used as starting points have resulted in this candidate gene approach gaining more traction[20]. To assist in this process, several bioinformatic tools and approaches have been developed in the past years (Chapter 2 and ref.[21]).

Such candidate gene approaches are made possible by the close relationship between genes and phenotypes. The fundamental premise behind all such approaches is namely that functionally related genes result in similar disease phenotypes when mutated. In other words, common molecular processes are assumed to underlie similar diseases, leading to the search for genes associated with particular disease processes or biological processes associated with particular disease genes (Figure 1). This results in candidate genes being sought based on functional relationships with other genes known to cause the same or similar phenotypes (such as ENDEAVOUR[22]). A second approach based on the same premise is to look for sets of functionally related genes in combinations of loci associated with the same or similar diseases (such as POCUS[23] and Prioritizer[24]). Finally, disease candidate genes can be identified through their direct association with biological processes known to be involved in the disease in question (such as Genes2Diseases[25] and GeneSeeker[26]). These approaches are discussed further in chapter 2 and will not be elaborated upon here.

Functional candidate gene approaches also readily permit, and can even benefit from, the concurrent investigation of several different diseases. The existence of genetically heterogeneous diseases, as well as genes which can lead to different disease phenotypes when mutated, indicate that the relationship between disease genes and phenotypes is not insular. Rather, it is clear that there are common biological processes linking different diseases[27-29]. Therefore, functional approaches to disease gene prioritization should enable knowledge of the pathobiology of certain diseases to be transferred to other distinct but related diseases. This is dependent on the identification of phenotypic relationships between diseases, which in turn requires appropriate disease phenotype descriptions. However, disease phenotype analysis has historically not received as much attention as disease genetics analysis, and the most suitable ways to describe and encode disease phenotypes are yet to be identified.

15

Chapter 1: General Introduction

Figure 1: Bioinformatic approaches to positional candidate disease gene prioritization. (A) Functional links can be sought between candidate disease genes and known disease genes. This approach is used by ENDEAVOUR. (B) If there are several different disease-associated genomic loci, common functional modules can be sought that link genes in the different loci together. Given that these loci are all associated with the same disease, such overrepresented functional modules are likely to be involved in the disease etiology. This approach is used by POCUS and Prioritizer. (C) Functional links between candidate genes and disrupted disease processes or phenotypic features can be sought directly. This approach is used by Genes2Diseases and GeneSeeker.

1.5 The importance of the phenotype In contrast to disease genes, disease phenotypes are easier to observe. This makes them one of the most useful resources currently available to the disease researcher, provided that this information is properly collected and recorded. As mentioned earlier, this can be challenging to do properly, but it is essential in order to investigate the full extent of the relationships between disease genes and phenotypes.

Disease phenotypes can be viewed as collections of individual phenotypic features, which occur in different combinations in different diseases. Diseases that consist of a combination of co- occurring phenotypic features are called syndromes. These syndromes do not consist of independent sets of features; rather there is significant feature overlap between different

16

Chapter 1: General Introduction syndromes. The nature of this overlap is not random; rather syndromes can be grouped into collections of similar phenotypes known as syndrome families[30]. For instance, mutations in the TP63 gene can give rise to several phenotypically overlapping syndromes[9,10]. There are several classes of similar neurodegenerative diseases such as the spinocerebellar ataxias[31] and Charcot-Marie-Tooth syndrome subtypes[32]. There are also several families of related syndromes with grouped according to their etiology into various “opathies”, such as muscle[33] and neuronal[34] channelopathies, neurocristopathies[35] and ciliopathies[36]. These are just a few of the many examples of phenotypically related syndromes that can be grouped into families (http://www.ncbi.nlm.nih.gov/omim/). Such shared features between syndromes suggest that genes involved in the different syndromes interact with each other at some point in the path from gene to phenotype.

The fact that diseases have such phenotypic structure can be exploited to guide functional analyses. The most obvious case is that phenotypic features can indicate the location and timing of the disease process (Figure 1C). These phenotypic features, if seen in diseases of unknown etiology, can give a clue as to the nature of the underlying molecular dysfunction. Neurodegenerative diseases are probably caused by molecular processes that affect neuronal homeostasis, while congenital skeletal dysmorphologies must be caused by defects in pre-natal skeletal development. This knowledge can be used to narrow the list of candidate genes. Though these examples may seem trivial, this is a general principle that also extends to less obvious cases. For instance, though ciliopathies are phenotypically diverse, there are specific features that tend to occur frequently in them, such as retinal degeneration, cystic kidney and liver disease and situs inversus [36]. The presence of these phenotypic features in diseases of unknown etiology can be used to hypothesize ciliary dysfunction as a potential disease mechanism[36]. Indeed, such an approach has already been successfully used to predict novel ciliopathies[37,38].

As well as providing direct insight into the nature of underlying disease processes, disease phenotypes can be used to identify which groups of diseases are more likely to share underlying disease processes and which are less likely to have any relationship with each other, through the determination of phenotypic similarity between diseases. This can inform functional candidate prioritization approaches that look for functional relationships between known disease genes and novel candidates, such as ENEAVOUR (Figure 1A), by suggesting which diseases can be used to generate the pool of known disease genes. It is equally applicable to approaches that look for functional relationships between candidates from disparate genomic loci (Figure 1B), to inform the selection of loci to use.

Due to the importance of disease phenotypes in such functional approaches to candidate disease gene prioritization, and given the increasing importance of large-scale functional analyses in candidate gene prioritization, we decided to systematically investigate the degree to which disease phenotype similarity reflects common underlying disease processes.

1.6 Outline of the thesis In this thesis we investigate the degree to which similar disease phenotypes are caused by functionally related genes. We begin by determining whether genes causing identical disease

17

Chapter 1: General Introduction phenotypes do indeed tend to be functionally related, using two objective measures of gene functional relatedness. We then investigate whether such functional relatedness extends beyond those disease genes that lead to identical disease phenotypes, to those causing phenotypically similar but nevertheless distinct diseases.

However, we start in chapter 2 with an overview of the relationships between disease genes and disease phenotypes. This chapter expands on the discussion in sections 1.3 and 1.4 above. We emphasize the modular nature of genetic diseases, showing based on empirical knowledge that they are modular at both the genetic and the phenotypic level. Such modularity enhances the degree to which phenotypic similarity can be used to infer genetic similarity by highlighting functional genetic modules. Additionally, we give an overview of bioinformatic approaches to candidate disease gene prioritization, which exploit the relationships between functional genetics and phenotypes.

We proceed in chapter 3 to investigate the degree to which protein-protein interactions, the strongest indicator of functional relatedness between two proteins, can be used to prioritize candidate disease genes for genetically heterogeneous diseases. Given that they result in very similar or identical disease phenotypes, these genetically heterogeneous diseases represent the strongest degree of phenotypic similarity between disease genes. We show that genes causing the same disease have a much stronger tendency to interact with each other at the protein level than other genes. In this analysis we also use protein-protein interaction data from other species, and show that such functional relationships between disease genes are conserved through evolution, as protein-protein interactions from other species are equally good at predicting disease genes as human protein-protein interactions. Importantly for candidate disease gene prioritization, we find that literature-derived human protein-protein interactions are strongly biased toward known disease genes and may therefore not be representative for novel candidate disease gene prediction.

In chapter 4 we investigate whether another type of objective functional relationship between genes, namely coordinated expression or co-expression, is stronger for genes underlying the same disease. In contrast to the previously investigated protein-protein interaction data, co- expression data are more abundant and less biased in their ascertainment. However, they are also less powerful at identifying functional relationships between genes than protein-protein interactions. We compensate for this by utilizing evolutionary conservation of co-expression to strengthen the correspondence between co-expression and functional relatedness. The rationale is that genes whose expression patterns are coordinately regulated in multiple evolutionarily distant species are more likely to be functionally related, as there has been a selective pressure to maintain this pattern through evolution. We show that the expression of genes causing the same disease is generally more closely coordinated than that of other genes, and that this effect can be strengthened by exploiting evolutionary conservation of co- expression. We further show that such evolutionarily conserved gene co-expression can be used to improve candidate disease gene prioritization.

In chapter 5 we turn our attention from disease genes to disease phenotypes, investigating the genetic coherence of the human phenome – the sum total of all human (disease) phenotypes

18

Chapter 1: General Introduction and their relationships to each other. Having shown that genes causing virtually identical disease phenotypes tend to be functionally related, we seek to confirm that this phenomenon extends beyond identical phenotypes to overlapping but non-identical phenotypes. In order to do this, disease phenotypes need to be described in a systematic manner that can enable quantitative phenotypic overlap estimation. This has historically been done using text mining of free text phenotype descriptions, but this approach does not result in systematic phenotype descriptions due to the limitations of text mining. There are several databases that systematically annotate disease phenotypes with phenotypic features, but is not clear how useful they will be for such disease phenome analyses. In this chapter we confirm that the relationship between similar phenotypes and similar molecular genetics does indeed extend to separate but overlapping disease phenotypes. We additionally identify several database properties that influence the genetic coherence of systematically determined phenotype clusters.

We conclude in chapter 6 with a look into the future of the field, and the importance of phenotype-guided bioinformatic analyses therein. Due to the advent of new sequencing technologies, this field is going to change rapidly over the coming years. Positional candidate gene prioritization is becoming less relevant due to the ability to sequence entire candidate loci or even whole genomes relatively cheaply and quickly. However, there will be new challenges, such as making sense of the large number of mutations and genetic variants such high throughput approaches uncover, and identifying what their contributions are to the disease process.

References 1. Amberger, J., Bocchini, C.A., Scott, A.F. & Hamosh, A. McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37 , D793-6 (2009). 2. Altshuler, D., Daly, M.J. & Lander, E.S. Genetic mapping in human disease. Science 322 , 881-8 (2008). 3. Kulaga, H.M. et al. Loss of BBS proteins causes anosmia in humans and defects in olfactory cilia structure and function in the mouse. Nat Genet 36 , 994-8 (2004). 4. Jansen, R.C. & Nap, J.P. Genetical genomics: the added value from segregation. Trends Genet 17 , 388-91 (2001). 5. Snyder, M., Weissman, S. & Gerstein, M. Personal phenotypes to go with personal genomes. Mol Syst Biol 5, 273 (2009). 6. Leboyer, M. et al. Psychiatric genetics: search for phenotypes. Trends Neurosci 21 , 102-5 (1998). 7. Orr, H.T. Unstable trinucleotide repeats and the diagnosis of neurodegenerative disease. Hum Pathol 25 , 598-601 (1994). 8. Warren, S.T. & Nelson, D.L. Trinucleotide repeat expansions in neurological disease. Curr Opin Neurobiol 3, 752-9 (1993). 9. Brunner, H.G., Hamel, B.C. & Van Bokhoven, H. The p63 gene in EEC and other syndromes. J Med Genet 39 , 377-81 (2002). 10. Rinne, T., Brunner, H.G. & van Bokhoven, H. p63-associated disorders. Cell Cycle 6, 262-8 (2007). 11. Taylor, E.W., Xu, J., Jabs, E.W. & Meyers, D.A. Linkage analysis of genetic disorders. Methods Mol Biol 68 , 11-25 (1997). 12. Lunetta, K.L. Genetic association studies. Circulation 118 , 96-101 (2008). 13. Botstein, D., White, R.L., Skolnick, M. & Davis, R.W. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32 , 314-31 (1980).

19

Chapter 1: General Introduction

14. Weber, J.L. & May, P.E. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet 44 , 388-96 (1989). 15. Stein, C.M. & Elston, R.C. Finding genes underlying human disease. Clin Genet 75 , 101-6 (2009). 16. Collins, F.S., Guyer, M.S. & Charkravarti, A. Variations on a theme: cataloging human DNA sequence variation. Science 278 , 1580-1 (1997). 17. Lander, E.S. The new genomics: global views of biology. Science 274 , 536-9 (1996). 18. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273 , 1516-7 (1996). 19. Katsanis, N. From association to causality: the new frontier for complex traits. Genome Med 1, 23 (2009). 20. McCarthy, M.I., Smedley, D. & Hide, W. New methods for finding disease-susceptibility genes: impact and potential. Genome Biol 4, 119 (2003). 21. Zhu, M. & Zhao, S. Candidate gene identification approach: progress and challenges. Int J Biol Sci 3, 420-7 (2007). 22. Aerts, S. et al. Gene prioritization through genomic data fusion. Nat Biotechnol 24 , 537-544 (2006). 23. Turner, F.S., Clutterbuck, D.R. & Semple, C.A. POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 4, R75 (2003). 24. Franke, L. et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 78 , 1011-25 (2006). 25. Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M.A. G2D: a tool for mining genes associated with disease. BMC Genet 6, 45 (2005). 26. van Driel, M.A., Cuelenaere, K., Kemmeren, P.P., Leunissen, J.A. & Brunner, H.G. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 11 , 57-63 (2003). 27. Brunner, H.G. & van Driel, M.A. From syndrome families to functional genomics. Nat Rev Genet 5, 545-51 (2004). 28. Goh, K.I. et al. The human disease network. Proc Natl Acad Sci U S A 104 , 8685-90 (2007). 29. Loscalzo, J., Kohane, I. & Barabasi, A.L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol 3, 124 (2007). 30. Pinsky, L. The polythetic (phenotypic community) system of classifying human malformation syndromes. Birth Defects Orig Artic Ser 13 , 13-30 (1977). 31. Soong, B.W. & Paulson, H.L. Spinocerebellar ataxias: an update. Curr Opin Neurol 20 , 438-46 (2007). 32. Barisic, N. et al. Charcot-Marie-Tooth disease: a clinico-genetic confrontation. Ann Hum Genet 72 , 416-41 (2008). 33. Saperstein, D.S. Muscle channelopathies. Semin Neurol 28 , 260-9 (2008). 34. Catterall, W.A., Dib-Hajj, S., Meisler, M.H. & Pietrobon, D. Inherited neuronal ion channelopathies: new windows on complex neurological diseases. J Neurosci 28 , 11768-77 (2008). 35. Bolande, R.P. Neurocristopathy: its growth and development in 20 years. Pediatr Pathol Lab Med 17 , 1-25 (1997). 36. Badano, J.L., Mitsuma, N., Beales, P.L. & Katsanis, N. The Ciliopathies: An Emerging Class of Human Genetic Disorders. Annu Rev Genomics Hum Genet 7, 125-148 (2006). 37. Arts, H.H. et al. Mutations in the gene encoding the basal body protein RPGRIP1L, a nephrocystin-4 interactor, cause Joubert syndrome. Nat Genet 39 , 882-8 (2007). 38. Beales, P.L. et al. IFT80, which encodes a conserved intraflagellar transport protein, is mutated in Jeune asphyxiating thoracic dystrophy. Nat Genet 39 , 727-9 (2007).

20

Chapter 2: The modular nature of genetic diseases

Martin Oti, Han G. Brunner

Clin Genet. 2007 Jan;71(1):1-11.

21

Chapter 2: The modular nature of genetic diseases

Abstract Evidence from many sources suggests that similar phenotypes are begotten by functionally related genes. This is most obvious in the case of genetically heterogeneous diseases such as Fanconi anemia, Bardet-Biedl or Usher syndrome, where the various genes work together in a single biological module. Such modules can be a multiprotein complex, a pathway, or a single cellular or subcellular organelle. This observation suggests a number of hypotheses about the human phenome that are now beginning to be explored.

First, there is now good evidence from bioinformatics analyses that human genetic diseases can be clustered on the basis of their phenotypic similarities and that such a clustering represents true biological relationships of the genes involved.

Second, one may use such phenotypic similarity to predict and then test for the contribution of apparently unrelated genes to the same functional module. This concept is now being systematically tested for several diseases. Most recently, a systematic yeast two-hybrid screen of all known genes for inherited ataxias indicated that they all form part of a single extended protein-protein interaction network.

Third, one can use bioinformatics to make predictions about new genes for diseases that form part of the same phenotype cluster. This is done by starting from the known disease genes and then searching for genes that share one or more functional attributes such as gene expression pattern, co-evolution, or gene ontology. Ultimately, one may expect that a modular view of disease genes should help the rapid identification of additional disease genes for multifactorial diseases once the first few contributing genes (or environmental factors) have been reliably identified.

Introduction With the arrival of the post-genomic era, the use of the candidate gene approach – in which gene candidacy is estimated based on knowledge about gene function – to augment or even supplant positional cloning has become more prevalent, and this trend will probably continue into the future[1,2]. Disease phenotypes provide a valuable window into the function of genes which can be exploited in the candidate gene approach. Here we discuss the relationship between genes and phenotypes and the application of this knowledge to the elucidation of genetic diseases, with an emphasis on malformation syndromes.

22

Chapter 2: The modular nature of genetic diseases

Relationship between genes and phenotypes

The functional genetics of syndrome families

Human malformation syndromes do not fall into neatly separated categories. Instead there is much overlap between syndromes, leading to the clustering of similar syndromes into syndrome families[3]. In addition, even within a particular syndrome there is often substantial phenotypic variation. This overlap in phenotype between different syndromes leads to ambiguity in their classification, with a tendency to lump similar syndromes together on the one hand and a tendency to split them into different entities on the other[4]. This situation is further complicated by the fact that there is not a one-to-one relationship between syndromes and genes. Typically, multiple syndromes can be caused by mutations in the same gene and a single disorder can be caused by mutations in different genes[5]. The existence of phenotypic variability is not surprising given that genes work in concert to form and maintain the human body. Different alleles of different genes in different individuals integrate differently with each other to create different final phenotypes. This is the basis of human phenotypic diversity and an important factor contributing to the fact that no two individuals are identical.

However, not all genes are equally functionally related. Functionally distant genes may have little or no impact on each other, while strongly related genes – for instance genes in the same protein complex or biochemical pathway – together perform a single biological function. Such strongly related genes may thus lead to the same or similar phenotypes when mutated, potentially resulting in the phenomenon of syndrome families[6] (see Figure 1). If the relationship is strong enough the resulting phenotypes may be clinically inseparable, leading to a genetically heterogeneous syndrome rather than a syndrome family.

There are numerous examples of diseases or groups of diseases with similar phenotypes that are caused by functionally related genes (see table 1 for a few examples).

While there is a tendency for similar disease phenotypes to be caused by functionally related genes, the converse does not hold true in all cases. Mutations that affect different functions of a pleiotropic gene can result in different phenotypic manifestations. A well-known example is the XPD gene, which is involved in both basal transcription and DNA repair[7]. Mutations in this gene can lead to different disease phenotypes depending on which of these two functions is compromised[8]. An even better known example is the lamin A/C gene (LMNA), which can lead to a host of different disease phenotypes collectively called laminopathies (see e.g.[9,10] for reviews) depending on where the mutation is located[11]. Also, nonsense and frameshift mutations of different regions of the GLI3 gene cause Greig cephalopolysyndactyly, and Pallister-Hall syndrome, respectively, because this gene again contains at least two separate biological functions[12]. Moonlighting enzymes form an interesting subset of pleiotropic genes that have regulatory or structural functions that are unrelated to their enzymatic function[13]. Mutations in these genes may lead to disease phenotypes that do not correlate with the known enzymatic function of the causative gene[14].

23

Chapter 2: The modular nature of genetic diseases

Figure 1 . Possible relationships between genes and phenotypes. The similar phenotypes of Stickler, Marshall and OSMED syndromes are caused by mutations in the functionally closely related genes COL2A1, COL11A1 and COL11A2. The phenotypically distinct Pallister-Hall syndrome is caused by mutations in the functionally unrelated or only weakly related GLI3 gene. Several genes can underlie one phenotype, as in the case of Stickler syndrome which can be caused by mutations in each of the three collagen genes. Conversely, one gene can lead to different phenotypes as in the case of COL11A1 (Stickler and Marshall phenotypes) and COL11A2 (Stickler and OSMED phenotypes). Thickness of black lines linking genes indicates (hypothetical) degree of functional relatedness between them.

Also, epistasis can mask the phenotypic effect of a gene[15], obscuring the link between gene and phenotype. The phenotypic effect of a mutation in a disease gene could be buffered or aggravated by alleles of other genes, leading to greater variablility in disease phenotype. In a

24

Chapter 2: The modular nature of genetic diseases large scale epistasis study of metabolic enzymes in baker’s yeast ( Saccharomyces cerevisiae )[16], it was found that mutations in genes in certain metabolic modules could influence the effect of mutations in genes in other modules – for instance mutations in respiratory genes aggravated the effects of mutations in genes involved in glycolysis.

The question may therefore be asked whether pleiotropy, epistasis, and diversity of gene function together obscure the modular view of genetic disease, or whether the relationship between functional genetic modules and phenotype is actually the rule in human disease.

The modular nature of the phenome

A first test for the hypothesis that similar phenotypes are caused by mutations in functionally related genes is provided by large scale phenotyping studies in Saccharomyces cerevisiae , or baker’s yeast[38,39]. Dudley and colleagues[38] analyzed the growth phenotypes of 4710 yeast mutants under 21 environmental conditions and found that phenotype correlated with the cellular function of the mutated gene. They also found significant pleiotropy in gene function. Similarly, Ohya and colleagues[39] used high-throughput image processing to automatically quantify yeast morphological phenotypes. Using 4718 deletion strains, they found that similar phenotypes were caused by deletions of functionally related genes.

In humans, similar analyses have also been done. In a first attempt, Jimenez-Sanchez and colleagues[40] investigated the relationship between the functional classes of disease genes and certain disease properties such as age at onset, mode of inheritance and reduction of life expectancy. They found associations between various functional categories and disease features; for instance transcription factors are overrepresented in diseases that manifest themselves in utero , while enzymatic insufficiencies tend to manifest themselves in the neonatal period. Freudenberg & Propping[41] clustered 878 diseases of known genetic origin from the OMIM database[42] according to phenotypic similarity, and found that genes leading to phenotypically similar diseases tend to have more similar functional annotation than other genes.

We have recently used an automated text mining approach to perform a similar analysis on over 5000 OMIM phenotypes, defined by higher resolution phenotypic descriptions[43]. Using standardized phenotypic feature terms from the Medical Subject headings (MeSH) controlled vocabulary, we mined OMIM records for disease-associated phenotypic features. These feature terms were further weighted according to specificity, rarity in the OMIM database and relative frequency of occurrence in the given syndrome record. This resulted in standardized and quantified descriptions of syndromes using feature vectors instead of plain text. This allowed for phenotypic similarity scores to be systematically generated for all 25 million syndrome pairs. We then sought to associate phenotype similarity with various measures of shared gene function. We found that phenotypic similarity between syndromes did indeed correlate with the sequence similarity of their disease genes. Significant correlations were also present between phenotype similarity and shared protein motifs, shared functional annotation of their underlying genes, and protein-protein interactions.

25

Chapter 2: The modular nature of genetic diseases

Table 1 . Examples of functionally related genes that lead to similar disease phenotypes when mutated. The different types of genetic relationships listed here are not mutually exclusive and genes could be related in several of these ways. These are not the only ways genes can be related.

Genetic relationship Examples Comments

Similar gene structure/function Stickler, Marshall and OSMED syndromes all Collagens also interact with each (homologs) caused by mutations in collagens[17-20]. other.

Fanconi Anemia caused by mutations in several genes involved in the functioning of a DNA repair complex[21,22].

Usher syndrome type I caused by mutations in Protein-protein interactions are a genes forming a protein complex involved in broad category that overlaps both inner ear hair cell differentiation and with the others mentioned here; retinal photoreceptor synapses[23,24]. proteins in the same pathway or developmental process may Protein-protein interactions, Hermansky-Pudlak syndrome caused by interact with each other, and especially protein complexes mutations in subunits of three protein homologous proteins often complexes involved in lysosomal vesicle interact through di- or trafficking[25,26]. multimerization as is the case with the collagens mentioned above. Inherited ataxias caused by mutations in proteins that form an extended functional network[27].

Pallister-Hall and Smith-Lemli-Opitz (SLOS) syndromes caused by mutations in GLI3 and DHCR7 genes respectively[28-30]; both are There are several different kinds involved in the hedgehog signaling of pathways, such as metabolic pathway[31,32]. pathways, signal transduction Same pathway pathways, or even protein modification pathways like the α- Muscle-Eye-Brain Disease, Walker-Warburg dystroglycan glycosylation Syndrome and Fukuyama Congenital Muscular pathway mentioned here. Dystrophy caused by genes involved in the glycosylation of α-dystroglycan [33].

Neurocristopathies, caused by defective neural crest development[34,35]: Hirschsprung Overall phenotypic similarity is Disease, Waardenburg, Waardenburg-Shah, not always high for these DiGeorge, and Congenital Central diseases but they can often be Hypoventilation (CCHS, or Ondine’s curse) linked though shared traits. For Same developmental or syndromes. instance, aganglionic megacolon organogenesis process occurs in various Cilia-related disorders, caused by defective neurocristopathies and situs cilia biogenesis[36,37]: Bardet-Biedl, inversus occurs in several cilia- Kartagener and Meckel syndromes, Primary related diseases. Ciliary Dyskinesia, Polycystic Kidney Disease.

26

Chapter 2: The modular nature of genetic diseases

Predicting disease genes Different bioinformatics strategies have been developed to predict and prioritize potential disease genes, and several web tools and bioinformatic approaches are now available (Table 2). They can be classified into three broad, not mutually exclusive categories: those based on intrinsic disease gene properties, those that use expression patterns and phenotypic information directly, and those that look at functional relatedness between candidate genes. These three categories differ in the extent to which they utilize disease phenotype information varying from simply distinguishing between disease and non-disease phenotypes on the one hand to the systematic comparison of disease phenotypes at the individual trait level on the other.

Are disease genes different?

A number of recent analyses have shown that there are systematic differences between disease genes and non-disease genes. While the biological basis for such systematic differences might not be immediately apparent, they could theoretically be used to evaluate the likelihood that a given candidate gene is disease-causing. However, their discriminatory power is still poor at the level of individual diseases, because up to half of all genes are classified as potentially disease- causing.

For instance, disease genes tend to be highly expressed and have tissue-specific expression patterns, as shown by Bortoluzzi and colleagues[44] and Smith & Eyre-Walker[45]. The latter study also found that disease genes have higher mutation rates over evolutionary time, a result confirmed by Huang and colleagues[46]. However, these higher disease gene mutation rates are challenged by results from Lopez-Bigas & Ouzounis[47] and Adie and colleagues[48], who found – both using decision tree computer learning algorithms – that disease genes tend to be more highly conserved with a broader phylogenetic extent. These apparently conflicting results were reconciled by Tu and colleagues[62], who separated the non-disease genes into two categories: housekeeping genes and non-housekeeping genes. They showed that the higher disease gene mutation rates found by the first two studies were due primarily to the low mutation rates of the housekeeping genes, while the higher conservation level of disease genes found by the latter two studies were due to the faster evolution of the non-housekeeping genes in the non-disease gene set.

Other findings from the two decision tree-based approaches were that disease genes tend to have fewer close paralogues (homologues resulting from gene duplication) – suggesting that their loss of function in disease cannot be adequately compensated for by a closely related counterpart – and that they tend to differ in some potentially regulation-related features such as having longer mRNA 3’ UTRs (Untranslated Terminal Regions), a higher proportion of genes with signal peptides, a higher proportion with CpG islands at their promoters (usually associated with housekeeping genes) and longer intergenic distances. These algorithms are available on the web for use by researchers (see Table 2).

27

Chapter 2: The modular nature of genetic diseases

Table 2 . Existing approaches and web tools for the prediction or prioritization of disease candidate genes.

Approach Online availability Data types used

Approaches based on disease gene properties

Bortoluzzi et al. Article supplementary material at Expression [44] http://physiolgenomics.physiology.org/cgi/content/full/00095.2003/DC1/

Smith & Eyre- - Sequence, expression Walker [45]

Additional data files available from journal website at Huang et al. [46] Sequence, expression http://genomebiology.com/2004/5/7/R47

DGP [47] http://cgg.ebi.ac.uk/services/dgp/ Sequence

PROSPECTR [48] http://www.genetics.med.ed.ac.uk/prospectr/ Sequence Approaches using links between genes and phenotypes

Sequence, functional Genes2Diseases http://www.ogic.ca/projects/g2d_2/ annotation, literature [49,50] mining

BITOLA [51] http://www.mf.uni-lj.si/bitola/ Literature mining

Expression, literature Tiffin et al. [52] Article supplementary data available at http://www.sanbi.ac.za/tiffin_et_al/ mining

GeneSeeker Expression, phenotype, http://www.cmbi.ru.nl/GeneSeeker/ [53,54] literature mining

GFINDer [55,56] http://www.bioinformatics.polimi.it/GFINDer/ Expression, phenotype

Expression, functional TOM [57] http://www-micrel.deis.unibo.it/~tom/ annotation Approaches using functional relatedness between candidate genes

Freudenberg & Phenotype, functional - Propping [41] annotation

Phenotype, sequence, OMIM phenome http://www.cmbi.ru.nl/MimMiner/ functional annotation, map [43] protein interactions

Protein-protein Article supplementary material on journal website at Protein interactions interactions [58] http://www.jmedgenet.com/supplemental/

Additional data files available from journal website at POCUS [59] Functional annotation http://genomebiology.com/2003/4/11/R75

Sequence, expression, SUSPECTS [60] http://www.genetics.med.ed.ac.uk/suspects/ functional annotation

Expression, functional Prioritizer [61] http://www.prioritizer.nl/ annotation, protein interactions

Sequence, expression, functional annotation, Endeavour [62] http://www.esat.kuleuven.be/endeavour/ pathways, literature mining

28

Chapter 2: The modular nature of genetic diseases

Recently, Lopez-Bigas and colleagues conducted a more comprehensive analysis of the general properties of disease genes[63]. They found extensive correlations between various gene properties and disease characteristics. For instance, they confirmed an earlier identified[40] relationship between gene functional class and disease mode of inheritance; enzymes and transporters are mostly involved in recessive diseases, while transcription regulators and structural molecules are usually associated with a dominant mode of inheritance. Importantly for phenotype- and functional genetics-based approaches to disease gene discovery, they also confirm that disease gene functional classification and expression patterns correlate with the type of disease they cause when mutated.

While these studies show that there are quantitative differences between disease genes and non-disease genes, these approaches are much too broad to be practically useful in the search for causative genes for specific diseases. In fact, the two methods that can be used for this – the decision trees of Lopez-Bigas and Adie and colleagues – classify respectively ~31% and ~44% of the entire genome as probable disease genes. Even so, they do not always detect the correct disease genes in benchmark tests. For practical purposes, the use of specific disease phenotypes, as in the approaches described below, is more productive.

Candidate genes based on phenotype

A number of bioinformatic approaches directly prioritize genes according to disease phenotypes or disease-associated phenotypic traits. These tools use gene expression patterns, gene ontology functional annotation, text mining of MEDLINE abstracts or data from mouse mutants to arrive at a list of most likely candidates. Typically, such approaches enrich candidate genes by 5- to 10-fold over random. While this is encouraging, benchmark tests only test for known disease genes, which means that the performance of these systems may be overestimated. Here we describe several disease candidate gene identification web tools and approaches that are based on directly linking candidate genes to disease phenotypes.

Two web tools, Genes2Diseases[49,50] and BITOLA[51] (see Table 2 for web locations), primarily use text mining of MEDLINE abstracts to associate genes from a candidate region to diseases. They essentially automate literature searches for associations between candidate genes and diseases, relieving the researcher of much manual searching. Being automated, they naturally do not interpret the literature as intelligently as a researcher would; however they can assist the researcher in prioritizing or narrowing down the list of candidate genes. The Genes2Diseases web tool, which is also based on gene functional annotation in databases in addition to text mining, performed reasonably well in benchmark tests. Using 100 artificial 30Mb disease loci, the target disease gene was recovered in 87 cases, and was among the top 8 genes in 47/100 cases.

A third literature mining-based approach was taken by Tiffin and colleagues[52], though they did not create a web interface for it. In this approach, candidate genes are ranked according to their expression overlap with disease-related anatomical regions. The associations between diseases and anatomical regions are retrieved from MEDLINE abstracts by text mining. The strength of this approach is the use of a controlled vocabulary of standardized anatomical terms (eVOC) to describe the gene expression patterns, and it was able to successfully select the

29

Chapter 2: The modular nature of genetic diseases correct disease gene in 15 out of 17 test cases whilst reducing the candidate gene set to ~63% of its original size on average.

Other web tools that place the emphasis on disease gene expression patterns are GeneSeeker[53,54], GFINDer[55,56] and TOM[57]. These tools can also be used to help identify interesting candidate disease genes. Similar to the approach used by Tiffin and colleagues, our GeneSeeker interactive web tool also matches postional candidate genes to diseases using anatomical expression patterns. However, it uses terms entered by human experts rather than terms associated with the disease through automated literature mining, and it does not use a controlled vocabulary. This interactive nature increases its utility to researchers, as they can use their own choice of terms rather than depending on the tool to automatically generate appropriate terms for them. In addition, GeneSeeker also incorporates mouse expression data, expanding the amount of data it can exploit for disease association. Its performance is naturally dependent on the chosen search terms, but in a test using 10 syndromes with known genesis, the causative gene was retrieved while the list of candidates was reduced from 165 to 22 on average. To our knowledge, GeneSeeker is also the only web tool that has thus far been successfully applied in clinical research, as it was used to identify the gene for an inherited skeletal dysplasia[65].

The GFINDer web tool contains a module that uses tissue mRNA expression patterns to analyze candidate disease genes. This tool works in the reverse direction relative to the others in that it identifies disease phenotypes from the OMIM database that are most likely to be associated with a given set of genes, rather than starting out with a given disease phenotype and identifying candidate genes. In other words, it identifies candidate diseases for a given set of genes rather than candidate genes for a given disease. Despite this, it can still be used to find potentially interesting disease genes in a candidate gene set as it may detect genes known to cause similar diseases. TOM uses microarray gene expression data and optionally also functional annotation to find positional candidate genes that are related to known disease genes, or to find related genes in multiple candidate disease loci.

While these direct candidate gene approaches typically enrich 5- to10-fold over random, there are a number of weaknesses, notably that they rely on the relatively poor annotation of many human genes, and the small number of genes that have a known knock-out phenotype in mice.

Using functional genetic modules to identify candidate genes

If we assume that genes which lead to the same phenotype are functionally related, as has been shown for several diseases (e.g. [66-71]), then we may define candidate genes as those that are functionally related to the known causative genes for the same or similar diseases. As mentioned earlier, the OMIM database has been systematically analyzed to identify relationships between gene functional relatedness and phenotypic similarity[41,43]. The results indicate that there is indeed a strong correlation between phenotype similarity and a number of relations at the gene and protein levels. This approach may therefore also be used for candidate disease gene prediction, as it can highlight functional genetic modules containing known disease genes that cause similar disease phenotypes.

30

Chapter 2: The modular nature of genetic diseases

More directly, Lim and colleagues[27] have systematically tested this approach for inherited ataxias. They found by using yeast two-hybrid approaches that most if not all genes for human inherited ataxias form part of a single expanded protein network of 54 genes. This result is interesting as previously none of the known ataxia-causing proteins were known to interact with each other. It is also a concrete illustration of the relationship between genetic modules and syndrome families, and suggests that other syndrome families might also be caused by proteins that form functional genetic modules.

Recently, we have also used protein-protein interactions to predict novel candidate disease genes for genetically heterogeneous diseases[58]. We collected a total of 72940 protein-protein interactions between 10984 human proteins, mostly from human datasets, but some of which were mapped from other species. We then used this protein-protein interaction network to look for positional candidate disease genes for 383 genetically heterogeneous diseases, based on their interactions with known disease proteins. We made almost 300 predictions for 48 diseases, of which about 10% are expected to be correct based on benchmark tests with known disease genes. This represents a 10-fold enrichment over using positional data only. Though not available through a web tool, the predictions are available online as supplementary data accompanying the publication (Table 2).

In addition to protein-protein interactions, functional annotation can also be used to link different genes into larger functional modules. Turner and colleagues[59] developed an approach (POCUS) which compares the functional annotation of genes in different candidate loci, looking for gene combinations in the different loci that share the most functional annotation terms. The idea is that if there are functionally related genes in different disease loci, they may be part of a genetic module underlying the disease phenotype. In tests this approach resulted in a 12-fold to 42-fold enrichment in candidate genes depending on locus size. Unfortunately POCUS also does not have a web interface, though supplementary data accompany the publication (Table 2).

A recent trend in bioinformatic disease gene prediction tools is the combination and integration of several different types of functional genomic data in order to better predict or prioritize candidate disease genes. Adie and colleagues further expanded their decision tree-based system, described in the previous section, to accommodate the investigation of candidate disease regions for specific diseases[60] (SUSPECTS). To this end they augmented it with different types of functional genomic data – coexpression, shared protein domains and functional annotation – which are used to rank candidate genes based on their similarity to known disease genes associated with the query disease. This approach performs substantially better than their previous approach (PROSPECTR) which does not take disease-specific information into account, and is available through the web (Table 2).

The most comprehensive approaches developed to date that attempt to prioritize positional candidate genes by integrating various types of genomic data are Prioritizer[61] and Endeavour[62]. Both tools are available as downloadable software (see Table 2), but cannot be used directly over the web. Prioritizer attempts to identify functional similarities between genes in different candidate regions by integrating coexpression, protein-protein interaction, and

31

Chapter 2: The modular nature of genetic diseases functional annotation data using a Bayesian network. It performs relatively well, as the correct disease gene was amongst the top five candidates 54% of the time in a benchmark test using artificial loci with 100 genes each. Endeavour, on the other hand, uses a user-supplied training set of known disease genes to prioritize genes in candidate loci. It integrates more data types than Prioritizer, utilizing sequence similarity, expression data, protein-protein interactions, functional annotation, pathway data and regulatory information based on transcription factor binding sites, and it performs remarkably well with a good training set of known disease genes. Naturally, its success is dependent on the availability of a set of known disease genes for the disease in question or similar diseases.

An example of the success of combining different types of genomic data is a recent study by Calvo and colleagues[72] which attempted to identify novel genes involved in mitochondrial dysfunction. They combined several types of genomic data including mitochondrial targeting sequences, cis-regulatory motifs, protein domains, (co)expression, protein homology and mass spectrometry data to identify nuclear-encoded genes associated with mitochondrial function. The resulting genes were checked against chromosomal loci to identify candidate genes for mitochondrial disorders, one of which was confirmed. Although this was an analysis of a specific disease category, it shows that combining functional genomics data can be successful in predicting disease genes.

One limitation of many of these approaches is that they rely quite heavily on functional annotation of candidate genes, and are therefore inherently biased toward better characterized genes. This raises the question of how representative these benchmark tests with known disease genes are for novel disease gene prediction. Nonetheless, functional genetic modules containing known disease genes are clearly an important new route towards effective candidate disease gene prediction.

Formalizing phenotype descriptions To be maximally effective, phenotypic trait-based methods require formalized, standardized and preferably also quantified disease phenotype descriptions. Phenotypic information is the most readily attainable information relating to genetic diseases. It does not require genetic studies as are required for positional cloning. Doctors are adept at identifying even the slightest phenotypic alterations in disease patients and given the worldwide nature of the medical community, vast amounts of detailed phenotypic information already exist in the medical literature awaiting full exploitation[73]. Furthermore, a few excellent disease databases such as the Online Mendelian Inheritance in Man (OMIM)[42] database collect and store large amounts of this information in one place.

Unfortunately, most phenotypic information is not currently stored in formats that are amenable to automated processing. Most phenotypic data are stored as prose, as in the OMIM database, and the use of phenotypic terms in disease description is not standardized in the literature. Furthermore, some phenotypes are described in more detail than others. These factors make automated processing of phenotypic information difficult.

32

Chapter 2: The modular nature of genetic diseases

In order to create an automated disease phenotype analysis tool, a standardized system of phenotypic description would be invaluable[74]. In addition to standardized description, more quantitative information such as frequency of occurrence is important for automated processing. To begin with this would permit related diseases to be systematically associated with each other even if they are classified separately in OMIM; it would enable a more rigorous classification of disease phenotypes. More importantly, it would allow more systematic association of disease phenotypes with underlying disease genes, allowing their impact on phenotype to be measured more precisely.

The Orphanet database of rare human malformation disorders[75] already uses a systematic means of classifying disease phenotypes including trait lists annotated with frequency of occurrence. This approach illustrates the direction that needs to be taken for phenotype databases to be even more useful to automated processing tools than they currently are. Controlled vocabularies such as eVOC[76] or the National Library of Medicine’s Medical Subject Headings (MeSH) can be used to standardize terms used in phenotype description. For craniofacial malformation syndromes it remains difficult to find unequivocal terms that capture the complete phenotypic information. In this regard, a picture still tells more than a thousand words. Quantitative topological information acquired using computerized analysis techniques such as the ‘Dense Surface Model’[77-79] (Figure 2) and Gabor wavelet[80] technologies could potentially lead to more rigorous phenotype comparisons than text-based descriptions. A human phenome project, as proposed by Freimer & Sabatti[81], is clearly needed in some shape or form.

Future prospects Given a more structured phenotype description system, topological analyses of the human phenome landscape can be conducted with more precision. For instance, disease phenotypes could be clustered to create a phenotypic similarity network of inherited diseases, as already attempted by us and also by Cantor & Lussier with courser phenotype descriptions[43,82]. Such a phenotype network would be valuable for syndrome classification and could facilitate research into the underlying genetics[6].

For instance, a similarity-based network of syndrome phenotypes could be overlaid over a functional relatedness-based network of genes such as that created by Franke and colleagues[61]. The hypothesis that similar phenotypes are caused by functionally related disease genes could then be tested on a genomic – and phenomic – scale. Additionally, known causative genes from syndromes that are phenotypically similar to a genetically uncharacterized syndrome can be used to query the gene network for functionally related candidate genes (Figure 3).

33

Chapter 2: The modular nature of genetic diseases

Figure 2 . The use of Dense Surface Models (DSMs) for facial morphology quantification. (A) DSM-based images of children with Williams-Beuren syndrome (WBS) and control children. Patient HR has an atypical form of WBS. These 3D facial images were synthesized in Dense Surface Models (DSMs) that integrate large numbers of facial images. (B) Scatterplot display of the unseen classification of normal and WBS DSMs. The position of patient HR’s DSM is located between typical WBS and control DSMs. This illustrates the means by which DSMs can be used to quantify and distinguish a syndrome’s characteristic facial morphology from normal facial morphology, or that of a different syndrome. See reference [71] for further details. Reprinted with permission from Tassabehji et al., Science 310:1184-7 (18 Nov 2005). Copyright 2005 AAAS. Note: Modified from published article; photographs of patients’ faces removed.

34

Chapter 2: The modular nature of genetic diseases

Figure 3 . Mapping phenotype networks to gene networks for candidate disease gene prediction. In this hypothetical example, diseases 1, 2 and 3 have known causative genes (genes A, C and E respectively), and are all phenotypically related to disease 4 which lacks an identified causative gene. If the known causative genes are functionally closely related, as in this case, then candidate genes (genes B and D) can be hypothesized for disease 4 due to their close functional relationships to the known genes of the phenotypically related diseases. This kind of analysis could potentially also identify higher level links between functional genetic modules and syndrome families. Green lines connect (green) known disease genes with their (green) diseases, while red lines indicate potential causative links between (red) candidate genes and the (red) disease of unknown etiology. Black lines of varying thickness indicate the degree of phenotypic and functional similarity between diseases and genes respectively.

35

Chapter 2: The modular nature of genetic diseases

The modular nature of disease genes is rapidly changing the way in which candidate genes are mostly defined. Ultimately, one may expect that this modular view of disease genes should be especially helpful for the rapid identification of additional disease genes for multifactiorial diseases, once the first few contributing genes (or environmental factors) have been reliably identified.

Acknowledgements We are grateful to May Tassabehji and colleagues for permitting us to use one of their dense surface model images in this review. We would also like to thank Martijn Huynen for reviewing the manuscript and providing critical comments. This work was supported in part by the BioRange program of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).

References

1. Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 2003: 33 Suppl: 228-37. 2. McCarthy MI, Smedley D, Hide W. New methods for finding disease-susceptibility genes: impact and potential. Genome Biol 2003: 4: 119. 3. Pinsky L. The polythetic (phenotypic community) system of classifying human malformation syndromes. Birth Defects Orig Artic Ser 1977: 13 : 13-30. 4. McKusick VA. On lumpers and splitters, or the nosology of genetic disease. Perspect Biol Med 1969: 12 : 298-312. 5. Biesecker LG. Lumping and splitting: molecular biology in the genetics clinic. Clin Genet 1998: 53 : 3- 7. 6. Brunner HG, van Driel MA. From syndrome families to functional genomics. Nat Rev Genet 2004: 5: 545-51. 7. Bootsma D, Hoeijmakers JH. DNA repair. Engagement with transcription. Nature 1993: 363 : 114-5. 8. Lehmann AR. The xeroderma pigmentosum group D (XPD) gene: one gene, two functions, three diseases. Genes Dev 2001: 15 : 15-23. 9. Benedetti S, Merlini L. Laminopathies: from the heart of the cell to the clinics. Curr Opin Neurol 2004: 17 : 553-60. 10. Mattout A, Dechat T, Adam SA, et al. Nuclear lamins, diseases and aging. Curr Opin Cell Biol 2006: 18 : 335-41. 11. Hegele R. LMNA mutation position predicts organ system involvement in laminopathies. Clin Genet 2005: 68 : 31-4. 12. Biesecker LG. What you can learn from one gene: GLI3. J Med Genet 2006: 43 : 465-469. 13. Jeffery CJ. Moonlighting proteins. Trends Biochem Sci 1999: 24 : 8-11. 14. Sriram G, Martinez JA, McCabe ER, et al. Single-gene disorders: what role could moonlighting enzymes play? Am J Hum Genet 2005: 76 : 911-24. 15. Moore JH. A global view of epistasis. Nat Genet 2005: 37 : 13-4. 16. Segre D, Deluna A, Church GM, et al. Modular epistasis in yeast metabolism. Nat Genet 2005: 37 : 77-83.

36

Chapter 2: The modular nature of genetic diseases

17. Annunen S, Korkko J, Czarny M, et al. Splicing mutations of 54-bp exons in the COL11A1 gene cause Marshall syndrome, but other mutations cause overlapping Marshall/Stickler phenotypes. Am J Hum Genet 1999: 65 : 974-83. 18. Griffith AJ, Sprunger LK, Sirko-Osadsa DA, et al. Marshall syndrome associated with a splicing defect at the COL11A1 locus. Am J Hum Genet 1998: 62 : 816-23. 19. Snead MP, Yates JR. Clinical and Molecular genetics of Stickler syndrome. J Med Genet 1999: 36 : 353-9. 20. van Steensel MA, Buma P, de Waal Malefijt MC, et al. Oto- spondylo-megaepiphyseal dysplasia (OSMED): clinical description of three patients homozygous for a missense mutation in the COL11A2 gene. Am J Med Genet 1997: 70 : 315-23. 21. Garcia-Higuera I, Taniguchi T, Ganesan S, et al. Interaction of the Fanconi anemia proteins and BRCA1 in a common pathway. Mol Cell 2001: 7: 249-62. 22. Mace G, Bogliolo M, Guervilly JH, et al. 3R coordination by Fanconi anemia proteins. Biochimie 2005: 87 : 647-58. 23. El-Amraoui A, Petit C. Usher I syndrome: unravelling the mechanisms that underlie the cohesion of the growing hair bundle in inner ear sensory cells. J Cell Sci 2005: 118 : 4593-603. 24. Reiners J, Reidel B, El-Amraoui A, et al. Differential distribution of harmonin isoforms and their possible role in Usher-1 protein complexes in mammalian photoreceptor cells. Invest Ophthalmol Vis Sci 2003: 44 : 5006-15. 25. Dell'Angelica EC. The building BLOC(k)s of lysosomes and related organelles. Curr Opin Cell Biol 2004: 16 : 458-64. 26. Li W, Rusiniak ME, Chintala S, et al. Murine Hermansky-Pudlak syndrome genes: regulators of lysosome-related organelles. Bioessays 2004: 26 : 616-28. 27. Lim J, Hao T, Shaw C, et al. A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell 2006: 125 : 801-14. 28. Kang S, Graham JM, Jr., Olney AH, et al. GLI3 frameshift mutations cause autosomal dominant Pallister-Hall syndrome. Nat Genet 1997: 15 : 266-8. 29. Tint GS, Irons M, Elias ER, et al. Defective cholesterol biosynthesis associated with the Smith-Lemli- Opitz syndrome. N Engl J Med 1994: 330 : 107-13. 30. Yu H, Patel SB. Recent insights into the Smith-Lemli-Opitz syndrome. Clin Genet 2005: 68 : 383-91. 31. Koide T, Hayata T, Cho KW. Negative regulation of Hedgehog signaling by the cholesterogenic enzyme 7-dehydrocholesterol reductase. Development 2006: 133 : 2395-405. 32. Nieuwenhuis E, Hui CC. Hedgehog signaling and congenital malformations. Clin Genet 2005: 67 : 193-208. 33. van Reeuwijk J, Brunner HG, van Bokhoven H. Glyc-O-genetics of Walker-Warburg syndrome. Clin Genet 2005: 67 : 281-9. 34. Bolande RP. The neurocristopathies. A unifying concept of disease arising in neural crest maldevelopment. Hum Pathol 1974: 5: 409-20. 35. Bolande RP. Neurocristopathy: its growth and development in 20 years. Pediatr Pathol Lab Med 1997: 17 : 1-25. 36. Afzelius BA. Cilia-related diseases. J Pathol 2004: 204 : 470-7. 37. Eley L, Yates LM, Goodship JA. Cilia and disease. Curr Opin Genet Dev 2005: 15 : 308-14. 38. Dudley AM, Janse DM, Tanay A, et al. A global view of pleiotropy and phenotypically derived gene function in yeast. Mol Syst Biol 2005: 1: 2005 0001. 39. Ohya Y, Sese J, Yukawa M, et al. High-dimensional and large-scale phenotyping of yeast mutants. Proc Natl Acad Sci U S A 2005: 102 : 19015-20. 40. Jimenez-Sanchez G, Childs B, Valle D. Human disease genes. Nature 2001: 409 : 853-5. 41. Freudenberg J, Propping P. A similarity-based method for genome-wide prediction of disease- relevant human genes. Bioinformatics 2002: 18 Suppl 2: S110-5.

37

Chapter 2: The modular nature of genetic diseases

42. Hamosh A, Scott AF, Amberger JS, et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005: 33 : D514-7. 43. van Driel MA, Bruggeman J, Vriend G, et al. A text-mining analysis of the human phenome. Eur J Hum Genet 2006: 14 : 535-42. 44. Bortoluzzi S, Romualdi C, Bisognin A, et al. Disease genes and intracellular protein networks. Physiol Genomics 2003: 15 : 223-7. 45. Smith NG, Eyre-Walker A. Human disease genes: patterns and predictions. Gene 2003: 318 : 169- 75. 46. Huang H, Winter EE, Wang H, et al. Evolutionary conservation and selection of human disease gene orthologs in the rat and mouse genomes. Genome Biol 2004: 5: R47. 47. Lopez-Bigas N, Ouzounis CA. Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res 2004: 32 : 3108-14. 48. Adie EA, Adams RR, Evans KL, et al. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 2005: 6: 55. 49. Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nat Genet 2002: 31 : 316-9. 50. Perez-Iratxeta C, Wjst M, Bork P, et al. G2D: a tool for mining genes associated with disease. BMC Genet 2005: 6: 45. 51. Hristovski D, Peterlin B, Mitchell JA, et al. Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005: 74 : 289-98. 52. Tiffin N, Kelso JF, Powell AR, et al. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 2005: 33 : 1544-52. 53. van Driel MA, Cuelenaere K, Kemmeren PP, et al. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 2003: 11: 57-63. 54. van Driel MA, Cuelenaere K, Kemmeren PP, et al. GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res 2005: 33 : W758- 61. 55. Masseroli M, Galati O, Pinciroli F. GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res 2005: 33 : W717-23. 56. Masseroli M, Martucci D, Pinciroli F. GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res 2004: 32 : W293-300. 57. Rossi S, Masotti D, Nardini C, et al. TOM: a web-based integrated approach for identification of candidate disease genes. Nucleic Acids Res 2006: 34 : W285-92. 58. Oti M, Snel B, Huynen MA, et al. Predicting disease genes using protein-protein interactions. J Med Genet 2006: 43 (8):691-8. 59. Turner FS, Clutterbuck DR, Semple CA. POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 2003: 4: R75. 60. Adie EA, Adams RR, Evans KL, et al. SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 2006: 22 : 773-4. 61. Franke L, Bakel H, Fokkens L, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006: 78: 1011-25. 62. Aerts S, Lambrechts D, Maity S, et al. Gene prioritization through genomic data fusion. Nat Biotechnol 2006: 24 : 537-544. 63. Tu Z, Wang L, Xu M , et al. Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics 2006: 7: 31. 64. Lopez-Bigas N, Blencowe BJ, Ouzounis CA. Highly consistent patterns for inherited human diseases at the molecular level. Bioinformatics 2006: 22 : 269-77.

38

Chapter 2: The modular nature of genetic diseases

65. Thiel CT, Horn D, Zabel B, et al. Severely incapacitating mutations in patients with extreme short stature identify RNA-processing endoribonuclease RMRP as an essential cell growth regulator. Am J Hum Genet 2005: 77 : 795-806. 66. McCabe ER. Hirschsprung's disease: dissecting complexity in a pathogenetic network. Lancet 2002: 359 : 1169-70. 67. Di Pietro SM, Dell'Angelica EC. The cell biology of Hermansky-Pudlak syndrome: recent advances. Traffic 2005: 6: 525-33. 68. Mace G, Bogliolo M, Guervilly JH, et al. 3R coordination by Fanconi anemia proteins. Biochimie 2005: 87 : 647-58. 69. Wanders RJ, Waterham HR. Peroxisomal disorders I: biochemistry and genetics of peroxisome biogenesis disorders . Clin Genet 2005: 67 : 107-33. 70. van Reeuwijk J, Brunner HG, van Bokhoven H. Glyc-O-genetics of Walker-Warburg syndrome. Clin Genet 2005: 67 : 281-9. 71. Gandhi TK, Zhong J, Mathivanan S, et al. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006: 38 : 285-93. 72. Calvo S, Jain M, Xie X, et al. Systematic identification of human mitochondrial disease genes through integrative genomics. Nat Genet 2006: 38 : 576-582. 73. Brenner S. Nature's gift to science (Nobel lecture). Chembiochem 2003: 4: 683-7. 74. Biesecker LG. Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations. Clin Genet 2005: 68 : 320-6. 75. Ayme S. [Orphanet, an information site on rare diseases]. Soins 2003: 46-7. 76. Kelso J, Visagie J, Theiler G, et al. eVOC: a controlled vocabulary for unifying gene expression data. Genome Res 2003: 13 : 1222-30. 77. Hammond P, Hutton TJ, Allanson JE, et al. Discriminating power of localized three-dimensional facial morphology. Am J Hum Genet 2005: 77 : 999-1010. 78. Hammond P, Hutton TJ, Allanson JE, et al. 3D analysis of facial morphology. Am J Med Genet A 2004: 126 : 339-48. 79. Tassabehji M, Hammond P, Karmiloff-Smith A, et al. GTF2IRD1 in craniofacial development of humans and mice. Science 2005: 310 : 1184-7. 80. Loos HS, Wieczorek D, Wurtz RP, et al. Computer-based recognition of dysmorphic faces. Eur J Hum Genet 2003: 11 : 555-60. 81. Freimer N, Sabatti C. The human phenome project. Nat Genet 2003: 34 : 15-21. 82. Cantor MN, Lussier YA. Mining OMIM for insight into complex diseases. Medinfo 2004: 11 : 753-7.

39

40

Chapter 3: Predicting disease genes using protein-protein interactions

Martin Oti, Berend Snel, Martijn Huynen, Han G. Brunner

J Med Genet. 2006 Aug;43(8):691-8.

41

Chapter 3: Predicting disease genes using protein-protein interactions

Abstract Background: For a large number of genetically mapped disease loci, the responsible genes have not yet been identified. Physically interacting proteins tend to be involved in the same cellular process and mutations in their genes may lead to similar disease phenotypes. It should thus be possible to use protein-protein interactions to predict genes for genetically heterogeneous diseases.

Objective: To investigate how successful protein-protein interactions are in predicting genetic disease genes.

Methods: A total of 72,940 protein-protein interactions between 10894 human proteins were used to search 432 loci for candidate disease genes representing 383 genetically heterogeneous hereditary diseases. For each disease, the protein interaction partners of its known causative genes were compared with the disease-associated loci lacking identified causative genes. Interaction partners located within such loci were considered candidate disease gene predictions. Prediction accuracy was tested using a benchmark set of known disease genes.

Results: Almost 300 candidate disease gene predictions were made. Some of these have since been confirmed. On average, 10% or more are expected to be genuine disease genes representing a 10-fold enrichment compared to positional information only. Examples of interesting candidates are AKAP6 for arrythmogenic right ventricular dysplasia 3 and SYN3 for familial partial epilepsy with variable foci.

Conclusions: Exploiting protein-protein interactions can greatly increase the likelihood of finding positional candidate disease genes. When applied on a large scale they can lead to novel candidate gene predictions.

Introduction Many human genetic diseases can be caused by multiple genes. Since they lead to the same or similar disease phenotypes, the underlying genes are likely to be functionally related. Such functional relatedness can be exploited to aid in the finding of novel disease genes[1]. Direct protein-protein interactions are one of the strongest manifestations of a functional relation between genes, so interacting proteins may lead to the same disease phenotype when mutated. Indeed, several genetically heterogeneous hereditary diseases are known to be caused by mutations in different interacting proteins, such as Hermansky-Pudlak syndrome and Fanconi anemia[2,3]. Also, a recent study showed that interacting proteins tend to lead to similar disease phenotypes when mutated[4]. Therefore protein-protein interactions might in principle be used to identify potentially interesting disease gene candidates.

A large number of human protein-protein interactions have been described in the literature[5]. These literature-based interactions are reliable, but are naturally biased toward better studied

42

Chapter 3: Predicting disease genes using protein-protein interactions proteins and have already been exploited by the community for disease gene prediction. Protein-protein interactions from high throughput experiments do not have this bias, though they are also generally less reliable than literature-based interactions[6]. These high throughput sets are especially interesting for novel disease gene prediction since they can contain previously undescribed protein-protein interactions. There are two human high throughput protein-protein interaction sets available[7,8], but more are available from other species. These first have to be mapped to interactions between human proteins before they can be applied to disease gene prediction.

We investigated how successful protein-protein interactions are in predicting candidate disease genes for genetically heterogeneous hereditary diseases using a systematic large scale bioinformatics approach. To be as comprehensive and unbiased as possible we used both literature-based and high throughput human protein-protein interactions, and human-mapped high throughput interactions from three other species – Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode) and Saccharomyces cerevisiae (baker’s yeast). To identify potential new candidate disease genes, we examined whether disease proteins had interaction partners which were located within other loci associated with that same disease; such interaction partners were considered to be candidate disease genes. Several of these predictions have since been confirmed.

Methods

Genetic disease data

Disease data were obtained from the Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) ”Morbid Map” . This list contains disease loci and known disease genes from the OMIM database[9]. We selected genetically heterogeneous hereditary diseases with at least one known disease-causing gene and at least one disease locus lacking an identified causative gene. Disease subtypes were pooled into single diseases. A disease can have several loci, some of which may overlap each other. For instance, there are several X-linked mental retardation subtypes, many of which have been mapped to overlapping loci. Therefore, a single protein-protein interaction could result in multiple candidate gene predictions for different disease subtypes.

The loci vary in length and gene count, with a median of 88 genes per locus, and a mean of 123.6, the loci lengths are not normally distributed. Whole chromosome loci were excluded from the analyses. In total, there were 383 diseases in the data set. Together these diseases have 1195 disease loci with identified disease genes and 432 disease loci lacking identified causative genes.

The data set used in the benchmark tests contained all the genetically heterogeneous hereditary diseases from Morbid Map with at least two known disease genes (289 diseases, 1114 disease loci with known disease genes, 1003 distinct genes). Only loci with known disease genes were used in the benchmark tests.

43

Chapter 3: Predicting disease genes using protein-protein interactions

Protein-protein interaction sets

Five protein-protein interaction data sets were used in this study, from four different species – human, Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode worm) and Saccharomyces cerevisiae (baker’s yeast) (Table 1). One of the human interaction sets contained manually curated protein-protein interactions culled from the literature, while all the other protein interaction data were from high throughput protein-protein interaction experiments.

Table 1: Protein-protein interaction sets used in the study.

Number of Number of Interaction Source of proteins in human- human- References Comments set interaction data * mapped mapped interactions † interactions ‡

Downloaded october HPRD set Literature 6005 19728 [5] 21 2005.

High throughput Human Y2H Interactions from both experiments 2686 5211 [7,8] set experiments pooled. (Y2H)

All interactions were High throughput Fly set 4706 16313 [10] used, regardless of experiment (Y2H) confidence level.

High throughput Downloaded from DIP Worm set 1933 5771 [11] experiment (Y2H) database. 11

High throughput Interactions from all Y2H:[13,14] Yeast set experiments 2455 27098 four experiments PCP:[15,16] (Y2H, PCP) pooled.

Combined high All high 8162 54048 See above Excluding HPRD throughput throughput sets set

Total combined All interaction sets 10894 72940 See above Including HPRD set

* Y2H: Yeast two-hybrid protein-protein interaction assay; PCP: Protein complex purification experiment (based on mass spectrometry) † These are the proteins that could be automatically mapped to human Ensembl gene IDs ‡ These are the interactions from the original interaction sets that could be automatically mapped to human Ensembl gene IDs

The non-human protein-protein interactions were mapped to human proteins using orthology relationships. Orthology between the other species’ and human proteins was determined using the Inparanoid program[17], with default settings, on the whole (protein) genomes of the

44

Chapter 3: Predicting disease genes using protein-protein interactions species. Genomes were acquired from Ensembl[18] for the metazoan species and from the Saccharomyces Genome Database[19] for yeast. Where one non-human protein was orthologous to several human proteins, the interaction was assumed to be valid for all these proteins.

Candidate gene prediction

For all the diseases in the heterogenic diseases data set, the following actions were performed:

1. The protein interaction partners were determined for each known disease protein. 2. The chromosomal locations of the genes coding for these interacting proteins were determined using gene location data from the Ensembl database. The cytogentic loci were mapped to chromosomal base pair ranges using the Ensembl database, which specifies cytogenetic band boundaries as exact base pair positions, rounded off to the nearest 100kb. 3. These chromosomal locations were checked to see if they fell within one or more disease loci (of the same disease) that lacked a known disease gene. 4. Each interacting protein gene that was located within one of these loci was considered a candidate gene prediction. 5. If a candidate gene lay within multiple (overlapping) loci of the same disease, each of them was counted as a separate prediction. This procedure was carried out using a custom written C++ program (available on request), as were the benchmark and randomization tests described below.

Benchmark tests

The benchmark test is introduced to examine how well a protein-protein interaction set performs in recovering known disease genes from different loci known to be involved in the same disease. These tests were therefore carried out analogously to the candidate gene prediction tests, with the exception that the protein interactor positions were examined against the disease loci with known disease genes (as opposed to the loci with unidentified disease genes). As with the prediction tests, these genes and their associated disease loci were taken from OMIM Morbid Map.

If an interactor lay within a disease locus it was considered a candidate gene prediction (a positive). If this interactor was indeed the known disease gene in that locus, it was considered a correct prediction (true positive). If it was not the known disease gene for that locus, it was considered a wrong prediction (false positive).

Randomization tests

Due to the complex nature of the data – potentially overlapping loci with different gene counts and networked protein-protein interactions – protein interactor randomization tests were used to estimate the significance of the candidate gene prediction and benchmark results. In each case the genome was randomly shuffled and each protein in the interaction set was replaced with its

45

Chapter 3: Predicting disease genes using protein-protein interactions counterpart from the shuffled genome. This approach retains the original structure of the interaction network, it only randomizes the protein identities.

1000 randomization tests were carried out for each of the ten analyses: five protein-protein interaction sets, each of which was used for both novel disease gene prediction and for benchmarking. In addition, separate randomization tests were carried out for the two combined data sets, the combined high throughput set and the total combined set.

Two other types of randomization tests were carried out for each data set, namely the randomization of the gene positions on the genome and the shuffling of the protein interactors in the interactor sets. These lead to similar results as the interactor identity randomization (data not shown) and were left out of the results for brevity.

Results

Benchmark tests perform well above random expectation

In order to examine how well protein-protein interaction sets predict disease genes in other loci of the same disease, we first carried out benchmark tests that attempted to predict known disease genes. These tests were carried out using only diseases with multiple known causative genes and their disease-associated loci from OMIM Morbid Map, allowing the candidate gene predictions to be evaluated for accuracy and overrepresentation. With regard to the number of disease gene interaction partners that are located in another locus of the same disease, the HPRD interaction stands out as having over twice as many as would be expected by chance (Figure 1). The high throughput sets all score higher than the vast majority of their randomized controls, though the magnitude of this differs from the fly set scoring higher than all but two controls to the worm set scoring higher than 823 of the 1000 randomized controls. We thus show here an overrepresentation of disease gene interaction partners in other disease- associated loci, suggesting that disease genes encode proteins that tend to interact with each other.

The tendency for proteins associated with the same disease to interact with each other can be tested more directly by examining what percentage of these correctly localized interaction partners are indeed the known disease causing genes in those loci (Figure 2). Once again, the HPRD protein-protein interactions performed very well. Almost 60% of these interacting proteins corresponded to the known disease-causing genes in these loci. The high throughput protein interaction sets had a lower performance (9%-17%, with an average of 12%), but they all did much better than their randomized controls. Except for the two smallest sets, all interaction sets substantially outperformed every single run of their corresponding 1000 randomized controls. This implies that disease protein interaction partners, when located within other loci of the same disease, are at least 10-fold more likely to be involved in that disease than the other genes in these loci, given that randomly chosen locus genes have on average a 1% (1/88) chance of being correct.

46

Chapter 3: Predicting disease genes using protein-protein interactions

Figure 1 . Overrepresentation of physically interaction proteins from loci associated with the same disease. The different protein-protein interaction sets used are on the X-axis, while the Y-axis contains the number of disease protein interactors falling within another locus associated with the same disease (hits) in the benchmark locus set. Black stars and associated shaded boxes indicate the values based on the real interaction data sets, while the box plots indicate the values resulting from the 1000 randomized interactor controls per set (numbers in clear boxes are medians). The value for the combined interaction data set (indicated by the black arrow) is not included in the plot to keep the Y-axis scale manageable. The HPRD and total combined sets score much higher than all their randomized controls, the difference is smaller for the high throughput sets. HT = High Throughput.

An interesting result is that the yeast interaction set is more accurate in predicting candidate genes than the other high throughput sets, including those using “native” human proteins. Most of the yeast interactions are based on protein complex purification rather than yeast two-hybrid assays. A likely explanation for the relatively good performances of the yeast interactions is the previously observed higher quality of protein-protein interaction data from protein complex purification experiments relative to yeast two-hybrid assays[6].

47

Chapter 3: Predicting disease genes using protein-protein interactions

Figure 2 . Relatively high likelihood of.finding a disease gene from protein interaction data in a given locus. The different protein-protein interaction sets used are on the X-axis, while the Y-axis contains the percentage of disease protein interactors falling within other disease loci that correspond to the known disease genes in those loci. Black stars (and associated shaded boxes) indicate the values based on the real interaction data sets, while the box plots indicate the values resulting from the 1000 randomized interactor controls per set. Randomized controls have median accuracies of less than 1%. All interaction sets substantially outperform all their controls, with the HPRD scoring exceptionally high. HT = High Throughput.

Candidate disease gene prediction results follow similar patterns

Having established the validity of the approach for known disease genes we compared the likelihood of mapping an interactor to a disease locus to chance expectation for disease loci for which we do not know the disease gene. As shown in Figure 3, the three largest data sets and

48

Chapter 3: Predicting disease genes using protein-protein interactions the combined data all showed an increase relative to the medians of the randomization experiments. This result is consistent with that of the benchmark experiments for the high throughput data. It indicates an appreciable enrichment for true disease genes in the prediction results. Nevertheless the majority of the candidates are still probably false positives.

Figure 3 . Candidate disease gene prediction hit counts. The different protein-protein interaction sets used are on the X-axis, while the Y-axis contains the number of disease protein interactors falling within another locus associated with the same disease (hits) in the candidate gene prediction locus set. Black stars (and associated shaded boxes) indicate the values based on the real interaction data sets, while the box plots indicate the values resulting from the 1000 randomized interactor controls per set (numbers in clear boxes are medians). The fly and yeast sets score higher than the majority of their randomized controls, but the two smallest sets do not perform above random expectation. The HPRD scores relatively lower than the fly and worm sets, but still above the majority of its controls. HT = High Throughput.

In contrast to the high throughput results from fly and yeast, the HPRD results do show a large discrepancy between the benchmark results and the novel gene prediction results. In the latter

49

Chapter 3: Predicting disease genes using protein-protein interactions the enrichment of interactors in disease loci is much smaller than in the former. This suggests that the majority of disease genes which could be found using HPRD protein-protein interactions have already been detected by the community, which is unsurprising given that all these interactions have previously been described in the literature. Some of the interactions in the HPRD are even based on research on disease genes in the first place. (e.g. [20-23]) Therefore, the HPRD benchmark results are not representative for its novel gene prediction results.

The two smallest high throughput interaction sets, Human and Worm, show no signal for the prediction of novel disease genes. It should be noted that the candidate gene prediction disease data set contains fewer target loci (432) than the benchmark data set (1114). One explanation for this might be that the combination of a small number of interactions and a small number of target loci prevents the two smallest sets from showing any signal here.

The full list of predicted candidate disease genes and their corresponding interactions are given in Supplementary Table 1.

Confirmed and refuted predictions

A few candidate disease gene predictions could be confirmed or refuted due to the fact that the disease-causing genes are known but absent from the list of known disease genes used in the study (Table 2). These disease loci were treated as having unidentified causative genes during the study, but manual inspection of the results led to their identification as disease loci with known causative genes.

Promising leads

In addition to these confirmed prediction the protein interaction sets also led to a number of plausible but unconfirmed candidate gene predictions. For instance, the AKAP6 (A-kinase (PRKA) anchor protein 6) gene is predicted as a candidate gene for arrythmogenic right ventricular dysplasia 3 (OMIM 602086) based on an HPRD interaction with RYR2 (ryanodine receptor 2). A-kinase anchor proteins are involved in cardiac myocyte contractility and may be involved in heart failure.[27,28] SYN3 (synapsin III) was predicted as a candidate gene for familial partial epilepsy with variable foci (OMIM 604364) based on an HPRD interaction with SYN1 (synapsin I), which is causative for epilepsy, X-linked, with variable learning disabilities and behavior disorders (OMIM 300491). There are several other interesting examples, which can be viewed in Supplementary Table 1.

50

Chapter 3: Predicting disease genes using protein-protein interactions

Table 2: Confirmed and refuted candidate disease gene predictions. A prediction is considered confirmed if it is known in the literature to be causative for the relevant disease, and considered refuted if a different gene in the same locus is known to be causative for that disease. It is important to note that a “refuted” candidate gene may not have been screened and excluded in the literature, and may thus still be a valid candidate.

Based on Original disease Disease subtype Candidate genes Status Comments interaction with * subtype

Confirmed in literature[24], but not Branchiootic SIX1 (sine oculis EYA1 (eyes Branchiootic syndrome in Morbid Map version syndrome 3 homeobox homolog 1) absent homolog 1) Confirmed 1 [602588] used. Interaction [608389] [601205] † [601653] occurs in fly, worm and HPRD sets.

LCK (lymphocyte- specific protein- SCID due to LCK tyrosine kinase) deficiency [153390] Relevant Ensembl [153390] gene ID erroneously SCID, autosomal mapped to INSL3 PTPRC (protein- recessive, T- symbol instead of tyrosine negative/B- JAK3 (Janus kinase 3) SCID due to LCK Confirmed JAK3, thus JAK3 not phosphatase, positive type deficiency [151460] mapped to an receptor type, C) [600802] Ensembl gene ID. All [151460] interactions from IL2RG (interleukin HPRD. SCID, X-linked 2 receptor, [300400] gamma) [308380]

Nephronophthisis Nephronophthisis, NPHP4 gene symbol 4 [606966] juvenile [256100] not mapped to NPHP1 corresponding NPHP4 (nephrocystin (nephrocystin 1) Confirmed Ensembl gene ID in 4) [607215] Senior-Loken [607100] Ensembl database Senior-Loken syndrome 4 version used. HPRD syndrome 1 [266900] [606996] interaction.

HSPB8 identified as HSPB1 (heat Charcot-Marie- HSPB8 (heat shock Charcot-Marie-Tooth disease gene in shock 22kDa Tooth disease, 22kDa protein 8) disease, axonal, type Confirmed OMIM database, but protein 1) type 2L [608673] [608014] 2F [606595] not in Morbid Map. [602195] HPRD interaction.

Charcot-Marie-Tooth DNCL1 (dynein light DNM2 (dynamin 2) disease, dominant chain, LC8-type 1) [602378] intermediate B [601562] HSPB8 is causative [606482] (see above). HSPB8 RNF10 (ring finger GARS (glycyl- Charcot-Marie-Tooth identified as disease Charcot-Marie- protein 10) [not in tRNA synthetase) disease, axonal, type gene in OMIM Tooth disease, Refuted OMIM] [600287] 2D [601472] database, but not in type 2L [608673] Morbid Map. Two MAPKAPK5 (mitogen HSPB1 (heat HPRD interactions, activated protein Charcot-Marie-Tooth shock 22kDa one yeast. kinase-activated disease, axonal, type protein 1) protein kinase 5) 2F [606595] [602195] [606723]

51

Chapter 3: Predicting disease genes using protein-protein interactions

Table 2: Continued…

Based on Original disease Disease subtype Candidate genes Status Comments interaction with * subtype

Disease caused by chromosomal deletion Polycystic kidney which affects two disease, infantile Polycystic kidney PKD1 (polycystin 1) PKD2 (polycystin genes, PKD1 and severe, with disease, adult, type II Confirmed [601313] 2) [173910] TSC2[25]. Mentioned tuberous sclerosis [173910] in OMIM, but not in [600273] Morbid Map. HPRD interaction.

Ensembl ID maps to KRT6E, KRT6D, KRT6C and KRT6A. OMIM uses KRT6A Pachyonychia name, whereas congenita, KRT17 Pachyonychia KRT6A (keratin 6A) Ensembl uses KRT6E Jadassohn- congenita, Jackson- Confirmed [148041] (keratin 17) as primary name, thus Lewandowsky Lawler type [167210] [148069] mapping to type [167200] corresponding Ensembl gene ID failed. From human high throughput set.

Causative gene is TGFBR2 (TGF-beta receptor II) [190182]. Marfan-like FBLN2 was FBLN2 (fibulin 2) FBN1 (fibrillin 1) Marfan syndrome connective tissue Refuted suspected but refuted [135821] [134797] [154700] disorder [154705] in literature[26]. Mentioned in OMIM, but not in Morbid Map. HPRD interaction.

PRPF3 (Pre- MRNA processing Retinitis pigmentosa factor 3 homolog) 18 [601414] [607301]

PRPF8 (Pre- ENSG00000163510 MRNA processing Retinitis pigmentosa Causative gene is [not in OMIM] factor 8 homolog) 13 [600059] CRKL (ceramide [607300] kinase-like) [608381]. Retinitis Gene name not PRPF31 (Pre- pigmentosa 26 Refuted mapped to Ensembl MRNA processing Retinitis pigmentosa [608380] gene ID in Ensembl. factor 31 homolog) 11 [600138] Three interactions [606419] from yeast set, one from fly set. HECW2 (HECT, C2 PRPF3 (Pre- and WW domain MRNA processing Retinitis pigmentosa containing E3 ubiquitin factor 3 homolog) 18 [601414] protein ligase 2) [not in [607301] OMIM]

52

Chapter 3: Predicting disease genes using protein-protein interactions

Table 2: Continued…

Based on Original disease Disease subtype Candidate genes Status Comments interaction with * subtype

PRPF3 (Pre- MRNA processing Retinitis pigmentosa factor 3 homolog) 18 [601414] Causative gene is [607301] IMPDH1 (IMP dehydrogenase 1) LUC7L2 (LUC7-like 2) [146690]. Morbid Map [not in OMIM] Retinitis version used contains pigmentosa 10 PRPF31 (Pre- Refuted two entries for this [180105] MRNA processing Retinitis pigmentosa subtype, one with and factor 31 homolog) 11 [600138] one without [606419] associated gene. Two from yeast set, one from fly set. METTL2 CRX (cone-rod Retinitis pigmentosa, (methyltransferase like homeobox) late onset dominant 2A) [607846] [602225] [268000]

HSPD1 (heat Known gene is SF3B2 (splicing factor shock 60kDa Spastic paraplegia 13 BSCL2 (seipin) 3b, subunit 2) protein 1) [605280] [606158]. Mentioned Spastic [605591] [118190] in OMIM, but not paraplegia 17 Refuted listed in Morbid Map. [270685] KIF5A (kinesin KLC2 (kinesin light Spastic paraplegia 10 Both from HPRD set, family member 5A) chain 2) [601334] [604187] KLC2 also from fly [602821] and worm sets.

EIF4G1 (eukaryotic translation initiation factor 4-gamma 1) [600495] Causative gene is EIF4A2 (eukaryotic TERC (telomerase translation initiation RNA component) Dyskeratosis factor 4A, isoform 2) [602322]. Gene congenita, symbol not mapped to [601102] DKC1 (dyskerin 1) Dyskeratosis autosomal Refuted Ensembl gene ID in [300126] congenital 1 [305000] dominant Ensembl database [127550] KPNA1 (karyopherin version used. Three alpha 1) [600686] from fly set (EIF4G1, EIF4A2, KPNA1) one from yeast (CPA3). CPA3 (carboxypeptidase A3, mast cell) [114851]

* These are the known disease genes, associated with other disease subtypes, which have physical protein-protein interactions with the candidate disease genes. † Square brackets contain OMIM numbers for both diseases and genes.

53

Chapter 3: Predicting disease genes using protein-protein interactions

The HPRD interaction set is biased toward disease proteins

As it is based on literature-described interactions, the HPRD interaction set is expected to be biased toward known disease genes. These proteins would be better studied, and known interactions with candidate disease proteins would have been investigated by the community already. This bias can already be seen in the HPRD benchmark results, but it can be quantified more directly by the proportion of the interacting proteins that are known disease proteins. Here once again the bias is clear – the proportion of interacting proteins that occur in the heterogeneous disease protein set is twice as high for the HPRD set as for the high throughput sets, and is well over twice the proportion of Ensembl proteins in the disease protein set (Table 3).

It is encouraging to see that such a large proportion of these predictions were confirmed, though these are generally from the HPRD interaction set (of which 10 were confirmed and 5 refuted). From the high throughput sets there were 2 confirmed and 13 refuted predictions, which is consistent with their benchmark results (Figure 2). This list may not be exhaustive due to the complexities of thoroughly investigating all these predictions by hand, but we do not expect these proportions to change substantially. In any case the benchmark results remain valid, since the known-gene disease loci are not susceptible to this misannotation problem.

Interestingly, even the high throughput sets are enriched for disease genes. This suggests that genetically heterogeneous disease proteins are more likely to have protein-protein interactions, or that they have more easily detectable interactions.

Protein-protein interactions add as much information as localization

Hereditary diseases do not always have genetic loci associated with them. It is therefore interesting to see how much protein-protein interactions can at all predict the candidate disease gene pool in the absence of any genetic localization data. When disregarding localization information entirely in the benchmark disease gene set, the combined high throughput protein- protein interaction set still has a prediction accuracy of 0.7%. This is two orders of magnitude higher than the chance of randomly picking the disease protein from the entire genome and is of the same order of enrichment as genetic localization, which generally reduces the candidate gene pool from ~20000 to ~100. Combining these two information sources enriches the candidate gene pool a further order of magnitude to a 1-in-10 chance (12%) of picking the right disease gene (Figure 2), which corresponds to a 1000-fold enrichment relative to the entire genome.

Needless to say, the HPRD interaction set performs much better than the combined high throughput set, resulting in a prediction accuracy of 6.6% when localization data are disregarded. This corresponds to a 1000-fold enrichment even before localization is taken into account. Once again, combining this with localization information leads to a further 10-fold enrichment resulting in the 58% accuracy found in this study.

54

Chapter 3: Predicting disease genes using protein-protein interactions

Table 3: Overrepresentation of heterogeneous disease genes in HPRD protein interaction set (Chi- square test). The disease gene enrichment in HPRD is highly significantly higher than in the high throughput sets (P < 1e-13 after Bonferroni correction for every case).

Number of Subset also in Subset also in proteins in Chi- disease protein disease protein P-value interaction square set set (percentage) set HPRD set 6005 678 11.29% 550.2098 < 2.2e-16 (literature based) Human Y2H set 2686 146 5.44% 4.845 0.03 (high throughput) Fly set (high 4706 276 5.86% 18.109 2.1e-5 throughput) Worm set (high 1933 101 5.23% 2.086 0.15 throughput) Yeast set (high 2455 141 5.74% 7.838 0.005 throughput) Reference set – all human protein-coding genes in Ensembl

Total In disease set Percentage Ensembl known 22242 1003 4.51% genes

Discussion The hypothesis being investigated here is that interacting proteins would often lead to similar disease phenotypes when mutated, enabling the usage of protein-protein interactions to suggest candidate disease genes. Our results do suggest that this is indeed the case. Given the average locus size of close to a hundred genes and high throughput interaction benchmark accuracies of 9-17%, positional candidate genes that interact with known disease genes have a more than 10-fold higher likelihood of being disease-causing genes than random locus genes.

These results are a minimal estimate

There are several practical limitations to the degree to which protein-protein interactions can predict disease gene candidates. To begin with, high throughput protein-protein interaction sets – especially yeast two-hybrid sets – are inherently noisy and contain a lot of interactions with no biological relevance[10-14]. Therefore we might be predicting a disease gene based on an interaction that does not occur in vivo, but which did erroneously appear in a yeast two-hybrid assay. Indeed, only 5.8% of the human, fly and worm Y2H interactions were confirmed by the HPRD, even amongst proteins common to both sets. However, given the Y2H set prediction

55

Chapter 3: Predicting disease genes using protein-protein interactions accuracies of over 10% and the fact that the HPRD is not exhaustive, the proportion of Y2H interactions that are genuine is probably substantially higher than this figure suggests. Nevertheless, these high noise levels could reduce the accuracy of the Y2H based predictions relative to other techniques, as evidenced by the higher performance of the mainly protein complex purification-based yeast interaction set.

Another practical limitation is the mapping of the high throughput interactions from other species to human proteins. In this study, when a protein in the other species had multiple human orthologues, the interaction was transferred to all of them. However, this need not be the case in reality. Encouragingly, we have previously shown that interactions between proteins are quite conserved across species and that conserved interactions tend to involve functionally related proteins[29]. Also, the yeast set outperforms the other sets despite its evolutionary distance to humans – though this may be due to the fact that most yeast interactions were from more reliable protein complex purification experiments rather than yeast two-hybrid assays.

Apart from the protein interactions, the designation of the candidate disease loci can also be a source of noise. Some of the candidate disease loci were designated based on incorrect reasoning, or faulty linkage assignment. For instance, we have recently shown that a family with EEC syndrome linked to chromosome 19 (EEC2, OMIM 602077)[30] actually has a mutation in the P63 gene denoted EEC3 (OMIM 604292) which is localized on human chromosome 3q27[31]. However, the EEC2 locus remains in OMIM as a separate EEC locus with unidentified causative gene.

Furthermore, the use of cytogenetic bands to designate disease loci in OMIM Morbid Map can lead to problems in locating the genes in the Ensembl database.

Though they do not have sharp boundaries in reality, the Ensembl database uses specific base pair positions (rounded off to the nearest 100 kb) as band boundaries. Thus, genes lying in the vicinity of a band boundary could easily be assigned to separate bands in the literature and the Ensembl database. Indeed over 20% of the known disease genes in OMIM Morbid Map are associated with loci that differ from their Ensembl annotation. Most of these genes lie between 1 Mb and 10 Mb of their Morbid Map-annotated loci on the same chromosome. The use of markers instead of cytogenetic bands could improve this, however OMIM Morbid Map does not include marker information.

Finally, phenotypically similar diseases can be functionally related, even though they are classified as different diseases. Since this study uses preexisting disease classifications rather than systematic phenotypic similarity analysis, potential links between disease genes causing similar but differently classified disease phenotypes would be overlooked. This would reduce the number of predictions made, without affecting the accuracy of those predictions that have been made.

All these practical limitations reduce the accuracy of the predictions, meaning that the true degree to which proteins involved in the same genetic disease interact is likely to be much higher. With higher quality protein interaction sets and more precise locus demarcation and

56

Chapter 3: Predicting disease genes using protein-protein interactions more systematic disease phenotype descriptions the value of this approach to disease gene prediction should increase even further.

Apart from the practical limitations, there are fundamental limits to the prediction capacity of protein-protein interactions. Two interacting proteins need not lead to similar disease phenotypes when mutated – for instance they may have different but overlapping functions or one may be more dispensable than the other. Also, disease proteins may lie at different points in a molecular pathway and need not interact with each other directly. Disease mutations need not even involve proteins, as is the case with TERC (telomerase RNA component) in autosomal dominant dyskeratosis congenita (see Table 2). Protein-protein interactions will thus not be capable of detecting every novel disease protein. Despite these fundamental limitations, the high proportion of disease proteins among correctly localized HPRD interaction partners is promising, although this interaction set is biased. And despite their practical limitations, even the high throughput data sets have prediction accuracies of up to 17%. Therefore, in the absence of practical limitations, these fundamental limitations should result in a prediction accuracy that lies between these two values.

Outlook

This study provides evidence that the systematic use of protein-protein interaction data may lead to an approximately 10-fold improvement of positional candidate gene prediction. At the same time, the quality and quantity of the data available can be much improved. Though almost 73,000 interactions between almost 11,000 proteins were used in this study, the actual number of interactions between these proteins should be much higher since all interaction assaying techniques miss large numbers of interactions[6,7]. In addition, a more systematic phenotypic classification of diseases, such as our recently developed text-mining approach[32], may lead to more interactions between related disease genes being identified. With increasing quantity and quality of interaction and phenotypic data and more dense interaction networks, the performance and utility of this approach to disease gene prediction should improve even further.

Acknowledgements We thank Bas Dutilh for doing the orthology determination, and Gert Vriend, Marc van Driel, René van der Heijden, Vera van Noort and Toni Gabaldon for discussions and suggestions. This work is part of the BioRange program of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).

References

1. Brunner HG, van Driel MA. From syndrome families to functional genomics. Nat Rev Genet 2004; 5(7):545-51.

57

Chapter 3: Predicting disease genes using protein-protein interactions

2. Di Pietro SM, Dell'Angelica EC. The cell biology of Hermansky-Pudlak syndrome: recent advances. Traffic 2005; 6(7):525-33. 3. Mace G, Bogliolo M, Guervilly JH, Dugas du Villard JA, Rosselli F. 3R coordination by Fanconi anemia proteins. Biochimie 2005; 87 (7):647-58. 4. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, Mishra G, Nandakumar K, Shen B, Deshpande N, Nayak R, Sarker M, Boeke JD, Parmigiani G, Schultz J, Bader JS, Pandey A. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006; 38 (3):285-93. 5. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003; 13 (10):2363-71. 6. Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002; 417 (6887):399- 403. 7. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M. Towards a proteome- scale map of the human protein-protein interaction network. Nature 2005;437(7062):1173-8. 8. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE. A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005; 122 (6):957- 68. 9. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005; 33 (Database issue):D514-7. 10. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RLJr, White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM. A protein interaction map of Drosophila melanogaster. Science 2003; 302 (5651):1727-36. 11. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M. A map of the interactome network of the metazoan C. elegans. Science 2004; 303 (5657):540-3.

58

Chapter 3: Predicting disease genes using protein-protein interactions

12. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004; 32 (Database issue):D449-51. 13. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001; 98 (8):4569-74. 14. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000; 403 (6770):623-7. 15. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002; 415 (6868):141-7. 16. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002; 415 (6868):180-3. 17. Remm M, Storm CE, Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001; 314 (5):1041-52. 18. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M. The Ensembl genome database project. Nucleic Acids Res 2002; 30 (1):38-41. 19. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D. SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998; 26 (1):73-9. 20. Huber PA, Medhurst AL, Youssoufian H, Mathew CG. Investigation of Fanconi anemia protein interactions by yeast two-hybrid analysis. Biochem Biophys Res Commun 2000; 268 (1):73-7. 21. Schenck A, Bardoni B, Moro A, Bagni C, Mandel JL. A highly conserved protein family interacting with the fragile X mental retardation protein (FMRP) and displaying selective interactions with FMRP-related proteins FXR1P and FXR2P. Proc Natl Acad Sci U S A 2001; 98 (15):8844-9. 22. Singaraja RR, Hadano S, Metzler M, Givan S, Wellington CL, Warby S, Yanai A, Gutekunst CA, Leavitt BR, Yi H, Fichter K, Gan L, McCutcheon K, Chopra V, Michel J, Hersch SM, Ikeda JE, Hayden MR. HIP14, a novel ankyrin domain-containing protein, links huntingtin to intracellular trafficking and endocytosis. Hum Mol Genet 2002; 11 (23):2815-28. 23. Tsuchiya H, Iseda T, Hino O. Identification of a novel protein (VBP-1) binding to the von Hippel- Lindau (VHL) tumor suppressor gene product. Cancer Res 1996; 56 (13):2881-5. 24. Ruf RG, Xu PX, Silvius D, Otto EA, Beekmann F, Muerb UT, Kumar S, Neuhaus TJ, Kemper MJ, Raymond RMJr, Brophy PD, Berkman J, Gattas M, Hyland V, Ruf EM, Schwartz C, Chang EH, Smith RJ, Stratakis CA, Weil D, Petit C, Hildebrandt F. SIX1 mutations cause branchio-oto-renal syndrome by disruption of EYA1-SIX1-DNA complexes. Proc Natl Acad Sci U S A 2004; 101 (21):8090-5.

59

Chapter 3: Predicting disease genes using protein-protein interactions

25. Brook-Carter PT, Peral B, Ward CJ, Thompson P, Hughes J, Maheshwar MM, Nellist M, Gamble V, Harris PC, Sampson JR. Deletion of the TSC2 and PKD1 genes associated with severe infantile polycystic kidney disease--a contiguous gene syndrome. Nat Genet 1994; 8(4):328-32. 26. Collod G, Chu ML, Sasaki T, Coulon M, Timpl R, Renkart L, Weissenbach J, Jondeau G, Bourdarias JP, Junien C, Boileau C. Fibulin-2: genetic mapping and exclusion as a candidate gene in Marfan syndrome type 2. Eur J Hum Genet 1996; 4(5):292-5. 27. Fink MA, Zakhary DR, Mackey JA, Desnoyer RW, Apperson-Hansen C, Damron DS, Bond M. AKAP-mediated targeting of protein kinase a regulates contractility in cardiac myocytes. Circ Res 2001; 88 (3):291-7. 28. Zakhary DR, Moravec CS, Bond M. Regulation of PKA binding to AKAPs in the heart: alterations in human heart failure. Circulation 2000; 101 (12):1459-64. 29. Huynen MA, Snel B, van Noort V. Comparative genomics for reliable protein-function prediction from genomic data. Trends Genet 2004; 20 (8):340-4. 30. O'Quinn JR, Hennekam RC, Jorde LB, Bamshad M. Syndromic ectrodactyly with severe limb, ectodermal, urogenital, and palatal defects maps to chromosome 19. Am J Hum Genet 1998; 62 (1):130-5. 31. Celli J, Duijf P, Hamel BC, Bamshad M, Kramer B, Smits AP, Newbury-Ecob R, Hennekam RC, Van Buggenhout G, van Haeringen A, Woods CG, van Essen AJ, de Waal R, Vriend G, Haber DA, Yang A, McKeon F, Brunner HG, van Bokhoven H. Heterozygous germline mutations in the p53 homolog p63 are the cause of EEC syndrome. Cell 1999; 99 (2):143-53. 32. Van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA. A text-mining analysis of the human phenome. Eur J Hum Genet 2006; 14 (5):535-42.

60

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Martin Oti, Jeroen van Reeuwijk, Martijn A. Huynen, Han G. Brunner

BMC Bioinformatics. 2008 Apr 23;9:208.

61

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Abstract Background: Genes that are co-expressed tend to be involved in the same biological process. However, co-expression is not a very reliable predictor of functional links between genes. The evolutionary conservation of co-expression between species can be used to predict protein function more reliably than co-expression in a single species. Here we examine whether co- expression across multiple species is also a better prioritizer of disease genes than is co- expression between human genes alone.

Results: We use co-expression data from yeast (S. cerevisiae), nematode worm (C. elegans), fruit fly (D. melanogaster), mouse and human and find that the use of evolutionary conservation can indeed improve the predictive value of co-expression. The effect that genes causing the same disease have higher co-expression than do other genes from their associated disease loci, is significantly enhanced when co-expression data are combined across evolutionarily distant species. We also find that performance can vary significantly depending on the co- expression datasets used, and just using more data does not necessarily lead to better prioritization. Instead, we find that dataset quality is more important than quantity, and using a consistent microarray platform per species leads to better performance than using more inclusive datasets pooled from various platforms.

Conclusions: We find that evolutionarily conserved gene co-expression prioritizes disease candidate genes better than human gene co-expression alone, and provide the integrated data as a new resource for disease gene prioritization.

Background In the past few years several bioinformatic tools and approaches have been developed to assist medical genetic researchers in positional candidate disease gene identification (reviewed in [1]; see also [2-5]). Several tools use functional genomics to prioritize candidate genes located within disease-associated genomic loci by evaluating functional relationships between known disease genes and positional candidate genes[6-8]. These tools are based on the premise that genes which are involved in the same disease phenotype are likely to be functionally related[1,9,10]. This has indeed been shown to be the case as evidenced by the fact that these tools all perform better than random expectation in the prediction or prioritization of candidate disease genes. Nevertheless, not all types of functional genomic data perform equally well in terms of sensitivity and specificity[2,7,8]. Microarray expression data have wider coverage than other high-throughput genomic data such as protein-protein interactions, as genome-scale expression analyses are readily and routinely performed with them. Additionally, they are less biased toward better studied genes than gene function annotation or literature mining, although the latter approaches fare better at prioritizing disease candidate genes[2,7,8]. Therefore, given the large coverage of co-expression data and their complementarity to functional annotation and literature mining, it is of importance to maximize the disease gene predictive value of this type of data.

62

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Several bioinformatic candidate disease gene prioritization tools already incorporate microarray- based co-expression data[2,6-8,11,12]. This approach is based on the assumption that if two genes are functionally related then their expression should vary concordantly across tissues and under different circumstances, and proposes that their expression profiles should therefore be correlated. For candidate disease gene prioritization, the use of co-expression analysis is preferable to the use of tissue-specific gene expression patterns, as it is a better predictor of functional relatedness between genes[13].

However, co-expression data can be applied more comprehensively than is currently implemented by these tools. One important and currently underexploited approach is to incorporate co-expression data from other species. One might expect that while human co- expression data are the most relevant for disease gene prioritization, evolutionary conservation of co-expression can be used to enhance the reliability of identified co-expression relationships. The premise is that co-expression relationships that are maintained across phylogenetically distant organisms must be under selective pressure, and should therefore be functional – a premise that has indeed been confirmed in several previous studies[14-17]. Though one tool already includes multi-species co-expression data[11], the improvement in disease gene ranking performance due to the exploitation of evolutionary conservation has not yet been investigated.

We therefore investigated the predictive value of conserved co-expression for candidate disease gene prioritization. To this end we analyzed how well co-expression between known and candidate disease genes could prioritize positional candidate disease genes. We restricted our analysis to known disease genes from genetic diseases containing at least two known causative genes. We constructed artificial loci of 100 candidate genes around the known disease-causing genes, and investigated the tendency of these causative genes to have higher co-expression with other known causative genes compared to the non-causative candidate genes from the same disease loci. Using co-expression data from five eukaryotic species – baker’s yeast (Saccharomyces cerevisiae), nematode worm (Caenorhabditis elegans), fruit fly (Drosophila melanogaster), mouse (Mus musculus) and human – we investigated the effect of evolutionary conservation on the ranking of the disease gene pairs, finding that evolutionary conservation of co-expression does indeed improve disease gene ranking. Therefore, exploiting evolutionary conservation could potentially improve the performance of co-expression data in existing disease candidate gene prioritization tools[2,6-8], which might in turn improve the prioritization of less well-studied genes.

Results

Evolutionary conservation of co-expression improves disease gene ranking performance

We investigated how well disease genes tend to rank relative to non-causative candidate disease genes when ranked according to co-expression with other genes known to cause the same disease. We combined co-expression scores across species using orthology relationships

63

Chapter 4: Conserved co-expression for candidate disease gene prioritization from the euKaryotic clusters of Orthologous Groups (KOG) database[18]. The co-expression scores are thus based on these KOGs rather than on individual genes (see methods section for further details). We used expression data from human, mouse, fruit fly ( D. melanogaster ), worm (C. elegans ) and baker’s yeast ( S. cerevisiae ) assembled from the Gene Expression Omnibus database[19] and the Genomics Institute of the Novartis Foundation[20] Gene Atlas expression data. For this study we used artificial disease loci containing 100 genes per locus. Testing with 50, 100 and 200 genes per locus does not make much difference though smaller loci tend to perform slightly better than larger loci (data not shown). Only disease gene pairs with co- expression scores unlikely to occur randomly in the corresponding dataset (i.e. more than 2 standard deviations from the dataset randomization mean) were included in the final rankings. This process implicitly weighs the scores according to the number of species involved, as the random score distributions are narrower for datasets combining more species. The standard deviations of these randomized distributions range from 0.051 for the five-species combined dataset to between 0.057 (yeast) and 0.094 (mouse) for the individual single-species datasets. Therefore, a given correlation score is more likely to be considered significant in a multi-species dataset than in a single-species dataset.

Genes whose co-expression is conserved through evolution have been shown to have stronger functional ties than genes whose co-expression is not[17]. Therefore, conserved co-expression between species should improve disease gene ranking over co-expression in one species alone. Such a trend is indeed apparent (Figure 1). Both for human and for multi-species KOG- based co-expression sets, disease gene pairs generally score in the upper half of the co- expression rankings for the KOG-mappable genes in the disease loci and are significantly better than random expectation (medians 0.64 and 0.69 for human-only and conserved co-expression sets respectively; p << 10 -16 for both sets, Wilcoxon signed rank test). Therefore, the causative disease gene in a candidate locus will generally rank higher than the other locus candidate genes. Furthermore, the multi-species combined co-expression set performs significantly better than the human-only co-expression set (medians 0.64 and 0.69 respectively, p < 10 -6, Wilcoxon rank sum test), indicating that the use of evolutionary conservation can significantly improve co- expression-based candidate disease gene ranking.

Given that co-expression-based disease gene ranking is only sensible if there is a non-random co-expression relationship between the disease genes, we further narrowed down the analysis to only those disease gene pairs that are likely to be genuinely co-expressed, having scores that are unlikely to occur randomly in their datasets. This substantially improves the rankings (Figure 2), at the expense of reduced coverage (695 versus 3286 disease gene rankings). Both the human-only and the conserved co-expression datasets now rank the disease genes very high (medians of 0.87 and 0.93 respectively), with the conserved co-expression still significantly outperforming the human-only co-expresssion (p < 10-11 , Wilcoxon rank sum test). A much higher proportion of disease genes is now ranked in the top 10% of the candidate gene lists, and the use of evolutionary conservation increases this proportion by almost half, from 31% to 44% of the ranked disease genes.

64

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 1. Evolutionary conservation of co-expression improves disease candidate gene prioritization performance over human-only co-expression. The disease gene rank histogram shows the relative proportions of disease genes scoring in different rank bins based on co-expression with known disease genes causing the same disease. Disease gene rank indicates its degree of co-expression with known disease genes, relative to those of the other locus candidate genes. Evolutionary conservation results in a much higher proportion of disease genes ranking in the top 10% of the locus candidate genes and fewer in the mid-regions of the lists. A) Disease gene rank histogram. B) Cumulative proportion of disease genes detected as the co-expression rank threshold is decreased from 1 to 0.

Disease gene ranking improved by co-expression conservation at different evolutionary distances

Evolutionary conservation across multiple species clearly improves the disease gene ranking performance of gene co-expression. However, what influence might different evolutionary distances have on this improvement? Is the evolutionary distance between human and mouse sufficient to improve co-expression performance, and is yeast biology so divergent that it would reduce rather than improve performance? To examine the role of evolutionary distance on the disease gene ranking performance of gene co-expression, we conducted pairwise comparisons in which we compared the ranking performance human co-expression with co-expression conserved between human and each of the other species. For all pairwise comparisons except the human-yeast comparison, evolutionary conservation significantly improves co-expression- based disease gene ranking compared to using only data for human genes (Figure 3, Table 1). Surprisingly, even human-mouse conservation improves disease gene ranking despite the relatively short evolutionary distance between these two species.

65

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 2. Filtering out insignificant co-expression improves performance. When disease gene pairs with co-expression scores that do not differ significantly from those of randomized datasets are filtered out, the disease gene ranking performance improves substantially. However, the coverage drops from 3286 to 695 pairs. A) Disease gene rank histogram. B) Cumulative proportion of disease genes detected as the co-expression rank threshold is decreased from 1 to 0.

In contrast to the other species, co-expression conservation with yeast does not significantly improve disease gene ranking (Table 1). For this species pair yeast-only co-expression performs best, outperforming even the combined human-yeast set at ranking human disease genes (albeit not significantly; p = 0.26). This is primarily due to specific disease types involving housekeeping processes such as metabolism (congenital disorder of glycosylation, glycogen storage disease) and DNA repair (xeroderma pigmentosum) which consistently score well particularly in the yeast set. As yeast co-expression already performs very well, the combination with human co-expression may not yield much extra information. However, this performance comes at the expense of much reduced coverage of disease genes relative to the other sets, which all have a similar coverage (550 disease gene rankings for the human-yeast set, versus ~3000 for human-mouse, human-fly and human-worm sets). It is thus evident that despite the large evolutionary distance between these two species, yeast co-expression is still effective at ranking human disease genes for those genes that have orthologs in both species.

66

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 3. Co-expression-based disease gene rankings are improved by conservation at various evolutionary distances. Pairwise cross-species co-expression improves disease gene ranking over single- species datasets for comparisons between human and mouse (A), fruit fly (B), and nematode worm (C) co-expression. However, co-expression in yeast alone performs as well as co-expression conserved between human and yeast (D), and significantly better than human-only co-expression (p = 0.01).

67

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Table 1. Pairwise species comparisons for co-expression-based disease gene ranking.

P-value Individual # Disease Median Pairwise # Disease Median (Combined species gene pairs rank combined sets gene pairs rank better than single species)

Human 3286 0.64 6.7x10 -4 * Human-Mouse 3114 0.67 Mouse 3188 0.63 1.3x10 -5 *

Human 3286 0.64 0.007 * Human-Fly 3176 0.66 Fly 3534 0.62 1.8x10 -5 *

Human 3286 0.64 2.1x10 -4 * Human-Worm 2954 0.67 Worm 3264 0.63 9.1x10 -6 *

Human 3286 0.64 0.12 Human-Yeast 550 0.67 Yeast 674 0.69 0.74

Footnote: * statistically significant at p = 0.05 level

Disease gene ranking performance is dependent on co-expression data used

We initially used the multi-species co-expression dataset from Stuart and colleagues[21], but this resulted in limited disease gene ranking performance. We suspected that the extensive pooling of expression data from different platforms might have a negative impact on performance. Therefore, we created our own custom multi-species co-expression dataset, in which we restricted ourselves to a single microarray platform per species. Consistent with the findings of others[22,23], this single platform approach resulted in significantly better performance, even when using only the four species included in the Stuart et al. dataset (Figure 4). Though both datasets tend to rank disease genes highly, the new (GEO/GNF) dataset performs significantly better than the larger and more inclusive dataset from Stuart and colleagues (p < 10 -10 , Wilcoxon rank sum test) without loss of coverage (3286 and 3212 disease gene rankings for the GEO/GNF and the Stuart et al. datasets respectively).

In addition to restricting expression data to a single platform per species, normalizing the microarray expression data according to total expression level also improves the ranking of disease genes relative to non-disease genes from the candidate loci. As we were mainly interested in relative expression levels of genes across conditions and not in total gene expression levels, we normalized all expression values according to the total expression level of the microarray sample (see methods section for further details). This reduces systematic biases between samples due to differences in total expression levels and highlights the expression relationships between genes per sample, resulting in up to 5% improvement in candidate disease gene ranking (data not shown).

68

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 4. Single platform co-expression dataset outperforms multiple platform co-expression dataset. The more cohesive GEO/GNF set containing a single microarray platform per species outperforms the more inclusive multi-platform Stuart et al. set (p < 10-10). Both sets have similar coverage, with 3212 and 3286 disease gene pairs for the Stuart et al. and GEO/GNF sets respectively. These results are based on co- expression data from human, fly, worm and yeast, with co-expression scores are averaged across species wherever possible. A) Disease gene rank histogram. B) Cumulative proportion of disease genes detected as the co-expression rank threshold is decreased from 1 to 0.

All disease gene ranking and conserved co-expression correlation data presented here are freely available online from the following URL: http://www.cmbi.ru.nl/~moti/coexpression/

Discussion In this study, we show that we can increase the predictive value of co-expression for disease gene prioritization by exploiting evolutionary conservation, despite the variations in the biology of the species compared. Given a genuine co-expression relationship between the disease genes, using conserved co-expression to prioritize candidate disease genes can reduce the number of genes to be tested over sevenfold compared to using a random ranking of the candidate disease genes, as the correct gene will be found on average after testing 7% of the candidates (the median disease gene rank is 0.93) instead of 50% (Figure 2). Encouragingly, even human-mouse conservation can lead to a substantial improvement in disease gene ranking performance, despite the relatively short evolutionary distance between these two species (Figure 3). This means that improvements in specificity can be gained without large losses in sensitivity, as most human genes have mouse orthologs.

69

Chapter 4: Conserved co-expression for candidate disease gene prioritization

An interesting finding is that large pooled datasets which combine as many expression data as possible from various experiments and platforms can actually result in reduced co-expression performance relative to smaller but more coherent expression datasets (Figure 4). Microarray data are notoriously variable between independently generated datasets while they are somewhat more consistent between experiments using the same platform[22,23]. In order to minimize dilution of the co-expression signal when combining data from many different sources a weighting scheme is required, such as using the co-expression overlap between different sets to weigh the relevance of the co-expression value[24]. Our results are consistent with these previously reported findings, as our single-platform-per-species dataset ranks disease genes significantly better than the more inclusive pooling approach adopted earlier by Stuart and colleagues[21]. An alternative explanation would be that their expression sets are of lower quality or are less representative of the relationships between disease genes, but there is no reason to assume that either of these is the case. This underscores the fact that combining as many data as possible does not necessarily lead to an improved performance of co-expression data for disease gene prioritization, so it is therefore not a trivial finding that combining data from different species does.

Another reason why the larger sets do not perform as well as the smaller sets could lie in the use of correlation coefficients to determine genetic relatedness. Correlation coefficients estimate expression coherence across all conditions surveyed, but even functionally related genes may not have coherent expression patterns across all tissues and conditions. The larger the datasets, the greater the potential for irrelevant conditions to mask the co-expression relationship that a group of genes has under a limited set of conditions. Therefore, a biclustering-based approach[25,26] may yield more refined co-expression relationships between genes, and is a potential avenue for future improvement of co-expression-based disease candidate gene prioritization.

Conclusions We analyze here the predictive power of gene co-expression for disease gene prioritization and identify factors that affect it, such as evolutionary conservation. We show that co-expression data from other species have predictive power for human disease gene prioritization, and that evolutionarily conserved gene co-expression can improve disease gene prioritization over human-only gene co-expression. In addition, we show that platform consistency is important and that smaller but more cohesive datasets can outperform larger pooled datasets. Though we only examined disease gene ranking, these findings have broader relevance for the use of microarray co-expression data in functional genomics. We provide these conserved co- expression data as a new resource that can be used in disease gene prioritization programs, particularly those that integrate several different data types.

70

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Methods

Disease data

We used the Online Mendelian Inheritance in Man (OMIM)[27] Morbid Map as a source of genetic diseases and known disease genes. We restricted our analysis to those diseases with two or more known disease genes. There were 890 known disease genes (727 distinct genes) for 177 diseases in our dataset. Artificial disease loci were constructed around these known disease genes based on localization information from the Ensembl database [28], by taking the required number of neighboring genes centered on the disease gene. These genes were then translated to HGNC gene IDs[29] or KOG IDs[18] depending on the analysis. This means that the locus genes used in the analyses are a subset of the Ensembl genes in the locus, depending on how many could be mapped to their relevant IDs. We used artificial loci of 100 genes, which is representative for the average candidate disease locus, as the OMIM heterogeneous disease loci have a median of 88 genes per locus. In addition, we investigated the use of 50- and 200-gene artificial loci, as well as actual associated loci from OMIM Morbid Map, but do not consider them further as their results did not differ substantially from those of the 100-gene artificial loci.

Expression data processing

Initially we used the multi-species expression data used by Stuart and colleagues in their functional analysis of conserved co-expression[21]. This dataset contains expression data for human, fruit fly (Drosophila melanogaster), nematode worm (Caenorhabditis elegans) and baker’s yeast (Saccharomyces cerevisiae) genes. These expression data had already been normalized and were therefore not further processed prior to their use in the co-expression calculations.

However, due to limited performance of this dataset in ranking disease genes we created a new multi-species co-expression dataset involving expression data from these four species and mouse (Mus musculus). For human and mouse expression data we used the gcRMA- normalized Gene Atlas expression sets generated by the Genomics Institute of the Novartis Foundation[20], as this is an often-used and well-constructed expression dataset. The expression values were log 2-transformed to increase robustness and emphasize the variation between the lower expressed genes relative to the more highly expressed ones. Lacking similar standard expression datasets for the other species, the expression data for fruit fly (Drosophila melanogaster), nematode worm (Caenorhabditis elegans) and baker’s yeast (Saccharomyces cerevisiae) were collected from the Gene Expression Omnibus (GEO) database[19] of the National Center for Biotechnology Information (NCBI). In order to maximize consistency of the expression data, only one expression platform was used per species. As these expression data are normalized in different ways in different experiments, we used the raw signal intensity data (Affymetrix CEL files) rather than the normalized expression data. This further restricted the data available to us, as these raw data are not required for submission of expression data to the GEO database and are not included with all datasets. Only experiments with at least 10 microarray samples were considered. We ended up with 357 samples in 12 experiments for

71

Chapter 4: Conserved co-expression for candidate disease gene prioritization yeast, 242 samples in 8 experiments for fly and 123 samples in a single experiment for worm (Table 2). For within-array normalization we used the Robust Multi-array Averaging (RMA) algorithm[30] as implemented in the R statistical software bioconductor library[31]. However, our between-array normalization was done by normalizing for total sample expression (see below), as we found this to yield better results than RMA-normalizing per experiment (data not shown). Total expression normalization prevents spurious correlation due to differences in total expression level between samples. For the yeast and fly datasets, data were pooled across all experiments before total expression normalization and calculation of the co-expression correlation coefficients.

Table 2. GEO expression datasets used.

Species GEO series ID Number of samples Reference

GSE6515 78 [35] GSE1690 10 [36] GSE2780 10 [37] GSE2828 12 [38] Fly GSE3069 18 [39] GSE3842 72 [40] GSE4714 30 [41] GSE4235 12 [42] Worm GSE2180 123 [43] GSE4807 30 [44] GSE6073 12 [45] GSE1311, GSE1312, 66 [46] GSE1313 GSE1639 18 [47] GSE1693 26 [48]

Yeast GSE1934 24 [49] GSE1938 15 [50] GSE1975 28 [51] GSE2343 12 [52] GSE3076 96 [53] GSE3821 16 [54] GSE4135 14 [55]

72

Chapter 4: Conserved co-expression for candidate disease gene prioritization

The Affymetrix probe set expression levels were translated to gene expression levels, and for genes that were represented by multiple probe sets the median was taken. These genes were further filtered according to gene sets used (Table 3). Both GEO and GNF datasets were further normalized according to total sample expression level by dividing each gene expression value by the mean expression value of all considered genes in the microarray sample. This further minimizes spurious correlations due to differences in total expression level between experimental conditions or across tissues.

Table 3. Genes included in the co-expression calculations.

Genes included in co- Species Number of genes expression data

Those with gene names also 13955 (11410 genes in all Human present in Affymetrix HG-U133A disease loci combined) microarray platform Those with mouse gene names (unknown transcripts, RIKEN Mouse 14000 transcripts and predicted genes excluded) Fly Those with FlyBase IDs 13282 Worm WormBase-annotated genes 17948 Those with systematic names Yeast 6563 (Y… IDs)

No artificial cut-off was used to filter out noisy low expression values, or to define presence or absence of gene expression in a sample. This is not necessary, as we are using correlation between expression profiles rather than absolute expression levels. The inclusion of non- biologically significant noise should not result in spurious correlations between genes, and if there is a correlation between low expression values then they are probably not merely noise and should not be filtered out.

It should be noted that while we used log 2-transformed signal intensity values in a multi-species study involving different microarray platforms, and not relative abundances as Liao & Zhang did[16], our analysis does not suffer from the same problems that led them to use relative abundances. We do not directly compare expression values between different microarray platforms. Instead, these expression values are converted to co-expression values for each platform separately. This process involves only within-platform signal intensity comparisons. The between-species – and therefore between-platform – comparisons are done at the co- expression level and involve comparisons of Spearman rank correlation coefficients.

73

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Co-expression score calculations

We used Spearman rank correlation coefficients as the microarray signal intensity values were not normally distributed. For the GEO datasets comprising several experiments (the fly and yeast sets), these data were pooled before gene pair co-expression correlation coefficients were calculated.

In order to be able to compare co-expression relationships between species, we used the gene orthology relationships as defined by the euKaryotic clusters of Orthologous Groups (KOG) database[18]. We chose to use KOGs instead of a metagenes-based approach such as was used by Stuart and colleagues[21] in order to maximize coverage, as KOGs not only contain bidirectional best hits but also closely related paralogs. The gene to KOG mapping was done using the STRING database version 6.1[32]. Mapping of the protein IDs used in STRING to the gene IDs used on the microarrays was done using Ensembl BioMart[33]. Of the 13955 human genes with expression data used in this study 8186 could be mapped to KOGs.

A single pair of KOGs can have multiple co-expression values if one or both of them contain multiple genes per species. In such a case these co-expression scores need to be combined into a single co-expression score representative of all pairwise combinations of genes in the two KOGs (Figure 5). We accomplished this by taking the mean of all such gene pair co-expression scores, resulting in a single KOG-based co-expression score (within-species averaging).

In order to incorporate evolutionary conservation into the final co-expression scores, we took the mean of the species-specific KOG-based co-expression scores over all species considered (between-species averaging). For the comparison between human and multi-species conserved co-expression the union of all the sets was taken for maximal coverage – i.e. all KOG-based co- expression scores were used regardless of which species were represented in the KOGs.

Disease gene ranking analyses

We investigated the co-expression ranking performance between candidate disease genes and known disease genes for each pair of disease genes causing the same disease (Figure 6). To this end we ordered all co-expression values between a known disease gene and the genes in a candidate disease locus, and scored the relative rank (0-1) of the actual causative gene in the resultant list. If the causative gene was at the top of the list it was assigned a relative rank of 1.0, and if at the bottom it received a relative rank of 0.0. A score of 0.5 indicates an equal number of more highly co-expressed and less highly co-expressed non-disease genes in the candidate locus, and is equivalent to random expectation. For each disease each causative gene was sequentially treated as the known disease gene and tested against all the other loci.

74

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 5. Procedure for calculating conserved co-expression scores. The procedure is illustrated using an example involving KOGs KOG0011 and KOG3438 between human and fly. KOG0011 contains two genes in each species (RAD23A, RAD23B and FBgn0026777, FBgn0039147 in human and fly respectively) while KOG3438 contains one in human (CKS1B) and two in fly (FBgn0010314 and FBgn0037613). For each species a KOG0011-KOG3438 co-expression (thick purple and green arrows) correlation is calculated by taking the mean of all gene-gene combinations (thin black arrows) for the two KOGs. The mean of these species-specific KOG-pair correlations (thick vertical orange arrow) is taken to represent the final multispecies KOG0011-KOG3438 co-expression correlation. This co-expression value is used for all relevant gene-pairs as their KOG-based co-expression score. If co-expression is conserved in both species then this value will be high, if it is high in only one species it will be intermediate, and if it is low in both it will be low.

75

Chapter 4: Conserved co-expression for candidate disease gene prioritization

Figure 6. Disease gene ranking procedure. Co-expression Spearman correlation coefficients (SCC) are determined for all candidate disease genes from a disease locus with known causative gene (E) and another gene known to cause the same disease (A). The SCCs are ranked, and the ranks are subsequently normalized to the 0.0-1.0 range. The relative position of the causative gene in the locus (E) is determined. This procedure would subsequently be repeated with E as known disease gene and A as candidate disease gene. For each disease, each gene is treated as a candidate disease gene in turn and its co-expression is successively compared to all other genes known to cause the same disease, leading to a list of scores per disease.

To avoid ranking candidate disease genes which do not have any co-expression relationship with each other at all, we randomly permuted the co-expression datasets used to determine the random distribution of co-expression scores for each dataset. We then excluded all disease gene pairs for which the co-expression score fell within 2 standard deviations of the randomization means. These distributions all had co-expression scores with a mean of approximately zero and standard deviations ranging between 0.05 and 0.09 depending on the dataset. Multiple randomizations always resulted in almost identical score distributions per dataset due to the large numbers involved.

76

Chapter 4: Conserved co-expression for candidate disease gene prioritization

To investigate the influence of evolutionary conservation on disease gene pair ranking performance, human-derived co-expression data were compared with co-expression data averaged across all five species included in the study. Additionally, pairwise species comparisons were performed comparing human-only co-expression data with pairwise conserved co-expression between human and mouse, fly, worm or yeast.

In order to test for the effect of averaging gene-gene co-expression within KOGs, the performance of the GNF human expression set when using KOG-based co-expression was compared to its performance when using gene-based co-expression.

Tools

The R statistical software package[34] was used for the microarray data processing and the Spearman rank correlation calculations, as well as for statistical tests and data plotting. For performance reasons, small custom-written C++ programs were used to average the gene-gene correlation coefficients into KOG-KOG correlation coefficients, and the per-species KOG-KOG correlations into cross-species KOG-KOG correlation values. Python scripts were written for the disease gene correlation coefficient ranking analyses. All scripts and source code are available on request.

Acknowledgements We would like to thank Walter Hoolwerf and Mark Jans for implementing the initial version of the conserved co-expression database. This work was supported in part by the BioRange program of the Netherlands Bioinformatics Centre, and by the Horizon program, both of which are supported by a BSIK grant through the Netherlands Genomics Initiative.

References

1. Oti M, Brunner HG: The modular nature of genetic diseases. Clin Genet 2007, 71 (1):1-11. 2. Chen J, Xu H, Aronow BJ, Jegga AG: Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 2007, 8:392. 3. George RA, Liu JY, Feng LL, Bryson-Richardson RJ, Fatkin D, Wouters MA: Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res 2006, 34 (19):e130. 4. Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N et al: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 2007, 25 (3):309-316. 5. Xu J, Li Y: Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics 2006, 22 (22):2800-2805. 6. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 2006, 22 (6):773-774.

77

Chapter 4: Conserved co-expression for candidate disease gene prioritization

7. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B et al: Gene prioritization through genomic data fusion. Nat Biotechnol 2006, 24 (5):537-544. 8. Franke L, Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006, 78 (6):1011-1025. 9. Brunner HG, van Driel MA: From syndrome families to functional genomics. Nat Rev Genet 2004, 5(7):545-551. 10. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome. Eur J Hum Genet 2006, 14 (5):535-542. 11. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA: G2D: a tool for mining genes associated with disease. BMC Genet 2005, 6:45. 12. Rossi S, Masotti D, Nardini C, Bonora E, Romeo G, Macii E, Benini L, Volinia S: TOM: a web- based integrated approach for identification of candidate disease genes. Nucleic Acids Res 2006, 34 (Web Server issue):W285-292. 13. Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E et al: The functional landscape of mouse gene expression. J Biol 2004, 3(5):21. 14. Bergmann S, Ihmels J, Barkai N: Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2004, 2(1):E9. 15. Liao BY, Zhang J: Low rates of expression profile divergence in highly expressed genes and tissue-specific genes during mammalian evolution. Mol Biol Evol 2006, 23 (6):1119-1128. 16. Liao BY, Zhang J: Evolutionary conservation of expression profiles between human and mouse orthologous genes. Mol Biol Evol 2006, 23 (3):530-540. 17. van Noort V, Snel B, Huynen MA: Predicting gene function by conserved co-expression. Trends Genet 2003, 19 (5):238-242. 18. Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS et al: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 2004, 5(2):R7. 19. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30 (1):207-210. 20. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004, 101 (16):6062-6067. 21. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302 (5643):249-255. 22. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G et al: Multiple-laboratory comparison of microarray platforms. Nat Methods 2005, 2(5):345-350. 23. Kuo WP, Liu F, Trimarchi J, Punzo C, Lombardi M, Sarang J, Whipple ME, Maysuria M, Serikawa K, Lee SY et al: A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nat Biotechnol 2006, 24 (7):832-840. 24. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res 2004, 14 (6):1085-1094. 25. Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 2000, 8:93-103. 26. Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004, 1(1):24-45.

78

Chapter 4: Conserved co-expression for candidate disease gene prioritization

27. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33 (Database issue):D514-517. 28. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al: The Ensembl genome database project. Nucleic Acids Res 2002, 30 (1):38-41. 29. Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res 2004, 32 (Database issue):D255-257. 30. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 31. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 32. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 2005, 33 (Database issue):D433-437. 33. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14 (1):160-169. 34. R: A Language and Environment for Statistical Computing [http://www.R-project.org/] 35. Transcriptional control in embryonic Drosophila midline guidance assessed through a whole genome approach: Magalhães TR, Palmer J, Tomancak P, Pollard KS: BMC Neurosci 2007, 31 ;8:59. 36. Wang J, Kean L, Yang J, Allan AK, Davies SA, Herzyk P, Dow JA: Function-informed transcriptome analysis of Drosophila renal tubule. Genome Biol 2004, 5(9):R69. 37. Akdemir F, Christich A, Sogame N, Chapo J, Abrams JM: p53 directs focused genomic responses in Drosophila. Oncogene 2007, 26 (36):5184-93. 38. Dostert C, Jouanguy E, Irving P, Troxler L, Galiana-Arnoux D, Hetru C, Hoffmann JA, Imler JL: The Jak-STAT signaling pathway is required but not sufficient for the antiviral response of drosophila. Nat Immunol 2005, 6(9):946-953. 39. Beckstead RB, Lam G, Thummel CS: The genomic response to 20-hydroxyecdysone at the onset of Drosophila metamorphosis. Genome Biol 2005, 6(12):R99. 40. Wijnen H, Naef F, Boothroyd C, Claridge-Chang A, Young MW: Control of daily transcript oscillations in Drosophila by light and the circadian clock. PLoS Genet 2006, 2(3):e39. 41. Zimmerman JE, Rizzo W, Shockley KR, Raizen DM, Naidoo N, Mackiewicz M, Churchill GA, Pack AI: Multiple mechanisms limit the duration of wakefulness in Drosophila brain. Physiol Genomics 2006, 27 (3):337-350. 42. Wang X, Bo J, Bridges T, Dugan KD, Pan TC, Chodosh LA, Montell DJ: Analysis of cell migration using whole-genome expression profiling of migratory cells in the Drosophila ovary. Dev Cell 2006, 10 (4):483-495. 43. Baugh LR, Hill AA, Claggett JM, Hill-Harfe K, Wen JC, Slonim DK, Brown EL, Hunter CP: The homeodomain protein PAL-1 specifies a lineage-specific regulatory network in the C. elegans embryo. Development 2005, 132 (8):1843-1854. 44. Tai SL, Boer VM, Daran-Lapujade P, Walsh MC, de Winde JH, Daran JM, Pronk JT: Two- dimensional transcriptome analysis in chemostat cultures. Combinatorial effects of oxygen availability and macronutrient limitation in Saccharomyces cerevisiae. J Biol Chem 2005, 280 (1):437-447.

79

Chapter 4: Conserved co-expression for candidate disease gene prioritization

45. Yarragudi A, Parfrey LW, Morse RH: Genome-wide analysis of transcriptional dependence and probable target sites for Abf1 and Rap1 in Saccharomyces cerevisiae. Nucleic Acids Res 2007, 35 (1):193-202. 46. Singh J, Kumar D, Ramakrishnan N, Singhal V, Jervis J, Garst JF, Slaughter SM, DeSantis AM, Potts M, Helm RF: Transcriptional response of Saccharomyces cerevisiae to desiccation and rehydration. Appl Environ Microbiol 2005, 71 (12):8752-8763. 47. Sabet N, Volo S, Yu C, Madigan JP, Morse RH: Genome-wide analysis of the relationship between transcriptional regulation by Rpd3p and the histone H3 and H4 amino termini in budding yeast. Mol Cell Biol 2004, 24 (20):8823-8833. 48. Hochwagen A, Wrobel G, Cartron M, Demougin P, Niederhauser-Wiederkehr C, Boselli MG, Primig M, Amon A: Novel response to microtubule perturbation in meiosis. Mol Cell Biol 2005, 25 (11):4767-4781. 49. Schawalder SB, Kabani M, Howald I, Choudhury U, Werner M, Shore D: Growth-regulated recruitment of the essential yeast ribosomal protein gene activator Ifh1. Nature 2004, 432 (7020):1058-1061. 50. Pitkanen JP, Torma A, Alff S, Huopaniemi L, Mattila P, Renkonen R: Excess mannose limits the growth of phosphomannose isomerase PMI40 deletion strain of Saccharomyces cerevisiae. J Biol Chem 2004, 279 (53):55737-55743. 51. Ronald J, Akey JM, Whittle J, Smith EN, Yvert G, Kruglyak L: Simultaneous genotyping, gene- expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res 2005, 15 (2):284-291. 52. Takagi Y, Masuda CA, Chang WH, Komori H, Wang D, Hunter T, Joazeiro CA, Kornberg RD: Ubiquitin ligase activity of TFIIH and the transcriptional response to DNA damage. Mol Cell 2005, 18 (2):237-243. 53. Guan Q, Zheng W, Tang S, Liu X, Zinkel RA, Tsui KW, Yandell BS, Culbertson MR: Impact of nonsense-mediated mRNA decay on the global expression profile of budding yeast. PLoS Genet 2006, 2(11):e203. 54. Kresnowati MT, van Winden WA, Almering MJ, ten Pierick A, Ras C, Knijnenburg TA, Daran- Lapujade P, Pronk JT, Heijnen JJ, Daran JM: When transcriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose limitation. Mol Syst Biol 2006, 2:49. 55. Yu C, Palumbo MJ, Lawrence CE, Morse RH: Contribution of the histone H3 and H4 amino termini to Gcn4p- and Gcn5p-mediated transcription in yeast. J Biol Chem 2006, 281 (14):9755- 9764.

80

Chapter 5: The biological coherence of human phenome databases

Martin Oti, Martijn Huynen, Han G. Brunner

Submitted

81

Chapter 5: The biological coherence of human phenome databases

Abstract Disease networks are increasingly explored as a complement to networks centred around interactions between genes and proteins. The quality of disease networks is heavily dependent on the amount and quality of phenotype information in phenotype databases of human genetic diseases. We explored which aspects of phenotype database architecture and content best reflect the underlying biology of disease. We used the OMIM, Orphanet and POSSUM phenotype databases for this purpose, and devised a biological coherence score based on the sharing of gene ontology annotation to investigate the degree to which phenotype similarity in these databases reflects related pathobiology. Our analyses support the notion that a fine- grained phenotype ontology enhances the accuracy of phenome representation. In addition, we find that the OMIM database that is most used by the human genetics community is heavily under-annotated. We show that this problem can easily be overcome by simply adding data available in the POSSUM database to improve phenotype representations in OMIM. Also, we find that the use of feature frequency estimates – currently only implemented in the Orphanet database – significantly improves the quality of the phenome representation. Our data suggest that there is much to be gained by improving human phenome databases, and that some of the measures needed to achieve this are relatively easy to implement. More generally, we propose that curation and more systematic annotation of human phenome databases can greatly improve the power of the phenotype for genetic disease analysis.

Introduction The human genome is defined by the complete DNA sequence and by the functional relationships between all human genes. Similarly, the human phenome can be viewed as the sum total of all human phenotypes, and the relationships that exist between the various diseases and traits. By correlating networks of genes and phenotypes[1,2], we can investigate disease pathobiology at the whole phenome scale[1-13]. Such analyses build on the premise that phenotypic overlap is a good predictor of genetic relationships, and their success relies on the quality and amount of the phenotype data[5,8,11-13].

The importance of using adequate phenotype information is obvious both for clinical diagnosis and for proper disease classification for research studies. The concept of disease families that can be organized into phenotype networks has spurred new interest into more precise and more comprehensive phenotype annotation[14-17]. For example, mutations in proteins involved in ciliary functioning result in overlapping phenotypes, collectively referred to as ciliopathies[18]. The realization that features such as retinopathy and kidney cysts are indicative of disturbed cilium function has enabled the identification of ciliary diseases based only on their phenotype[16,19,20], as well as the identification of novel ciliopathy genes[21]. This and other examples suggest that much can be learned from disease comparisons on a phenome-wide scale. Such phenotype comparisons will need to become more sophisticated as correlations are sought between genetic variants and phenotypic features in ever greater detail, up to the level of individual genotype-phenotype mappings across the genome and across populations[22].

82

Chapter 5: The biological coherence of human phenome databases

Here we analyzed three human phenotype data sets to investigate which characteristics of the disease phenotype descriptions in the available databases would maximize their utility. We examined OMIM (Online Mendelian Inheritance in Man; http://www.ncbi.nlm.nih.gov/Omim/)[23] using phenotype descriptions based on a recently developed phenotype ontology called HPO[24] (Human Phenotype Ontology; http://www.human-phenotype-ontology.org /). In addition we performed analyses on the diagnosis-oriented Orphanet[25] (http://www.orpha.net/) and POSSUM (Pictures Of Standard Syndromes and Undiagnosed Malformations)[26] (http://www.possum.net.au/) databases. Using the sharing of gene ontology annotation as a measure of biological coherence between diseases, we investigated the degree to which phenotype similarity in these databases reflects shared pathobiology for different treatments of the phenotype data. It is important to note that differences in information content and structure of the OMIM, POSSUM, and Orphanet databases preclude comparing them directly with each other, so comparisons were always between different treatments of data from a single database. To remove biases that remain even when comparing treatments within one database, all results were expressed relative to random permutations of the phenotypes in that database.

We find that a fine-grained phenotype ontology improves phenome representation, as does inclusion of feature frequency estimates. In addition, we find that the OMIM database that is most used by the community is heavily under-annotated, at least for the purpose of systematic phenotype comparisons. We show that this problem can easily be overcome by simply adding data available in the POSSUM database to improve phenotype representations in OMIM. Our data suggest that there is much to be gained by improving human phenome databases, and that some of the measures needed to achieve this are relatively easy to implement. While manual curation and systematic annotation of disease phenotypes may require substantial investment, we feel that this is justified and necessary to realize the full potential of systematic genotype- phenotype correlations and phenomics.

Methods

Data sets

In order to compare databases for the relationships between their phenotype content and the underlying genetic architecture, we first needed to formalize phenotype descriptions. Several current disease phenotype databases already define their own standardized feature terms. The diagnosis-oriented LDDB (London Dysmorphology Database)[27], Orphanet[25] (http://www.orpha.net/) and POSSUM (Pictures Of Standard Syndromes and Undiagnosed Malformations)[26] (http://www.possum.net.au/) databases all use a controlled vocabulary to systematically annotate disease phenotypes with features, an approach that facilitates differential diagnosis. We used Orphanet and POSSUM in our analyses as examples of such structured human phenome databases. However, the largest phenotype database available, the OMIM (Online Mendelian Inheritance in Man; http://www.ncbi.nlm.nih.gov/Omim/)[23], is intended to serve more as an information repository than as a diagnostic tool. As such it does not use a controlled vocabulary. To enable phenotype comparisons to be conducted for this

83

Chapter 5: The biological coherence of human phenome databases database, previous efforts have employed text mining to convert the free text records into feature lists, using terms defined in external vocabularies[8,11]. In this study, we used another recent conversion of OMIM phenotype data, which uses a manually curated systematic hierarchical vocabulary (or ontology) to describe OMIM phenotypes[24]. This ontology, known as the Human Phenotype Ontology or HPO (http://www.human-phenotype-ontology.org/), was used to annotate OMIM phenotypes with feature lists based on annotation taken from the OMIM database itself. The HPO is comprehensive, with over 8000 terms organized into a deep hierarchical structure (Table 1).

Table 1. Overview of the considered syndrome databases.

OMIM Orphaneta POSSUM HPO b MimMiner c Num. syndromes 2070 3167 4779 5948

Num. features in d 864 1115 8275 1368 ontology Feature ontology depth: 4 / 2 2 / 2 13 / 6 15 / 5 max. / median Median num. features per 13 / 25 22 / 34 7 / 20 8 / 22 syndrome: original / expanded e Num. syndromes mapped to disease 668 924 2053 2055 genes Num. disease genes 1038 f 986 2019 1937 Reference [25] [26] [24] [11] a Only the feature-annotated syndromes were included in this analysis; there were 7435 Orphanet syndromes in the full Orphanet syndrome database. b The Human Phenotype Ontology contains phenotype annotation for a subset of OMIM diseases. c The MimMiner text-mining conversion of the OMIM database is listed for comparison, but was not used in this study. d The “Anatomy” (A) and “Pathological Conditions, Signs and Symptoms” (C23) parts of the MeSH ontology were used in the MimMiner approach. e The number in parentheses refers to the median number of features per syndrome after the syndrome feature vectors are expanded to include the feature’s ontological ancestors in the feature vector. f The Orphanet database used originally contained 569 disease genes, but this number was expanded to 1935 using syndrome-to-gene mappings from the OMIM database. 1038 of these were associated with feature-annotated syndromes.

The Pictures Of Standard Syndromes and Undiagnosed Malformations (POSSUM) and Orphanet phenotype data were received upon request from the Murdoch Children’s Research

84

Chapter 5: The biological coherence of human phenome databases

Institute in Melbourne, Australia and the INSERM in Paris, France, respectively. The POSSUM data were received in August 2007 while the Orphanet data were received in July 2008. The Human Phenotype Ontology (HPO) and Online Mendelian Inheritance in Man (OMIM) data were downloaded on December 18, 2008.

Cluster biological coherence score calculation procedure

We used a cluster-based approach as we are interested in the degree to which phenotypically similar diseases share pathogenetic mechanisms, as proposed by the syndrome family concept[3]. We calculated phenotypic distances between syndromes based on their feature vectors. This approach is described in detail in Van Driel et al. 2006[11]. Briefly, we first used the hierarchical relationships between features in the relevant feature ontology to supplement the feature vectors with their more general ancestor features, and subsequently calculated the phenotypic distances between all syndrome pairs using the cosine similarity metric, which uses the angle between the two feature vectors as distance measure.

After calculating the phenotypic distances between syndromes, we hierarchically clustered the phenotypes using average linkage. We then partitioned the resulting dendrogram into clusters using the “Dynamic Tree Cut” algorithm[28], which creates comparable cluster sizes across different dendrograms (data not shown). This algorithm requires a minimum cluster size as parameter, which we set to five syndromes.

Upon partitioning the syndromes into clusters, we then calculated the average biological coherence of the clusters using the “Gene Ontology” (GO) gene functional annotation[29] as genetic relatedness measure (downloaded from the Ensembl database[30] on August 28, 2007). We used GO annotation as it reflects many different kinds of functional relatedness and has a large coverage of genes (>16,000 genes).

Cluster biological coherence was calculated as follows: First, we retrieved the GO terms associated with the disease genes underlying cluster syndromes. These GO terms were then pooled across all genes causing the same syndrome, resulting in a set of GO terms annotated to the syndrome. For each syndrome pair, we determined the GO term overlap between the two syndromes (Equation 1):

Sp(i,j) = n(Gi∩Gj)/ n(GiUGj) (1) where Sp(i,j) is the pairwise GO term overlap score for diseases i and j, n is the number of GO terms meeting the specified criteria and Gi and Gj are the sets of GO terms associated with diseases i and j respectively. We did not incorporate ontological relationships between GO terms into the comparison. For each cluster, the mean pairwise overlap was used as the biological coherence score for that cluster (Equation 2):

n Sc = ∑ i,j Sp(i,j)/ n (2) where Sc is the genetic cohesiveness score for cluster c, Sp(i,j) is the GO overlap score for diseases i and j, and n is the number of disease pairs in the cluster. The mean biological

85

Chapter 5: The biological coherence of human phenome databases coherence score across all clusters was used as the overall cluster biological coherence score for the database (Equation 3):

n S = ∑ cSc/n (3) where S is the overall genetic cohesiveness score for the phenotype data set, Sc is the genetic cohesiveness score for cluster c, and n is the number of clusters in the phenotype data set.

Randomizations

We did not compare the cluster biological coherence scores between data sets directly because the many differences between the databases would make it hard to determine which aspects of the database did cause the variation in the biological coherence score. Instead we compared the scores of the actual data sets to those of randomly permuted data sets. Randomization was done by reshuffling the features over the diseases, while maintaining the phenotypic structure of the data sets. Thus, feature frequencies in the data sets and feature distributions across diseases were maintained, and only the feature assignment to diseases was randomized. In this randomization approach disease to gene mappings are maintained, correcting for biases due to the sharing of genes between diseases, variation in number of genes per disease , and function annotation bias of genes, as these remain identical across both actual and randomized data sets. As final biological coherence measure, we used the ratios of the cluster biological coherence scores of the actual data sets to those of 30 randomized variants. These ratios were used as performance metric for evaluating effect of weighting schemes or other database properties. All ratio comparisons were done using the non-parametric two-sided Wilcoxon rank sum test with continuity correction as implemented in the R statistical software package (http://www.r-project.org/ ).

OMIM supplementation analyses

The OMIM diseases were supplemented with feature annotation from the POSSUM database. We excluded the more detailed skeletal features–those with feature IDs above 746–which were added later to the POSSUM database in order to better describe the skeletal abnormalities that this database is oriented towards. This supplementation resulted in the increase in median number of features per OMIM disease from 14 to 38 for those diseases that could be supplemented. We performed the comparisons between the original and the supplemented OMIM data sets using only the 1950 OMIM syndromes that could be supplemented with at least one POSSUM feature. The supplemented feature vectors were further processed analogously to the original feature vectors.

HPO feature ontology truncation procedure

The HPO feature ontology contains over 8,000 features organized into a deep hierarchical structure with a median feature depth of six and a maximum depth of thirteen. We mapped all features located at a depth level of five or higher (i.e. four or more steps from the root of the HPO ontology) to their more general ancestor features at the fourth level, resulting in a set of 1,833 more broadly defined features. Where a deep feature had multiple ancestors at this level,

86

Chapter 5: The biological coherence of human phenome databases it was mapped to all of them. Syndrome feature vectors were modified using this feature mapping, with deeper features being replaced with their appropriate fourth-level ancestor features. All features were only registered once per feature vector, regardless of how many deeper features they replaced. These modified feature vectors were further processed analogously to the full ontology-based feature vectors.

Orphanet feature occurrence frequency weighting scheme

Orphanet features are annotated with occurrence frequency estimates. These frequency estimates are divided into three classes, “Very Frequent”, “Frequent’ and “Occasional”. To investigate the effect of incorporating feature frequency estimates on syndrome clustering, we weighted these three frequency classes with the weights 1.0, 0.1 and 0.01 respectively. We contrasted this scheme with one in which the assignment order was reversed, assigning a weight of 1.0 to the “Occasional” frequency class and 0.01 to the “Very Frequent” class. These weighted feature vectors were further processed analogously to the unweighted feature vectors.

Inverse Document Frequency weighting scheme

Features in feature vectors were weighted using the Inverse Document Frequency algorithm (Equation 4), which assigns higher weights to features occurring in fewer syndromes:

Fidf = log 2(n/nf)F (4) where Fidf is the IDF-weighted feature score, n is the total number of phenotypes, nf is the number of phenotypes with the feature, and F is the original feature score. In this scheme, weights increase logarithmically with rarity. All weights were subsequently rescaled to the 0–1 range. These weighted feature vectors were further processed analogously to the unweighted feature vectors.

Statistical analysis

Biological coherence scores were compared between data sets using the Wilcoxon signed rank test with continuity correction as implemented in the R statistical software package (http://www.r-project.org/ ). This test does not assume normally distributed data.

Results

Incomplete phenotype descriptions impair phenome coherence

The number of phenotype features per disease in OMIM (7) is much less than it is for the POSSUM and Orphanet databases (22 and 13 respectively) (Table 1). We reasoned that increasing phenotype annotation in OMIM might aid the discovery of biological relationships between diseases in that database. We investigated this by supplementing OMIM disease annotation with features from the POSSUM database, thus increasing the OMIM disease annotation almost threefold from 14 features per disease to 38 features per disease. We then

87

Chapter 5: The biological coherence of human phenome databases hierarchically clustered all OMIM phenotypes based on their HPO feature similarities, creating a disease network that we could link to the biological function of disease genes. The biological coherence of resulting OMIM phenotypic clusters was measured by the degree to which disease genes shared Gene Ontology (GO) function annotation. Briefly, GO terms were pooled across genes per disease, and the mean degree of GO annotation overlap between all disease pairs in a cluster was used as the cluster biological coherence score. The mean score over all clusters was used as the final biological coherence score for the dataset (see methods section for more detailed description of procedure).

Enriching OMIM with POSSUM annotation does indeed lead to improved phenotypic clustering of known syndrome families (Figure 1). The 12 annotated subtypes of Ehlers Danlos syndrome, and of Charcot-Marie-Tooth disease segregate better (Figure 1A) and the detection of similarity between the phenotypically more diverse ciliopathies is also increased (Figure 1B). Using enriched phenotype descriptions, the OMIM disease similarity matrix reflects underlying genetic relationships to a much greater degree (Figure 2A; p<10 -16 , two-sided Wilcoxon signed rank test). We conclude that the OMIM database is currently greatly under-annotated and that this affects its performance on detecting biological relationships between disease phenotypes.

We further demonstrated this point by artificially under-annotating the well-annotated POSSUM and Orphanet syndromes, through the random elimination of half the annotated features per syndrome. As expected, the performance of POSSUM and Orphanet is much reduced when half of the annotated features are randomly removed (Figure 2B; p<10-16 in both cases, two- sided Wilcoxon signed rank test). These results underline the importance of complete phenotype descriptions for phenotype-based disease analysis and highlight the limitations of using the OMIM database for such analyses.

Detailed feature ontologies improve phenome coherence

We then asked whether one would require a highly detailed feature ontology as recently developed in HPO in order to accurately reflect the biology that underlies inherited diseases. Such feature ontologies organize disease features into a hierarchical structure, with deeper features becoming progressively more specific. The HPO has a comprehensive feature ontology containing over 8000 features, in contrast to the POSSUM and Orphanet ontologies which both contain about a thousand features each (Table 1). It also has the deepest ontology, with a median feature depth of 6 (as opposed to 2 for the POSSUM and Orphanet ontologies) and a maximum depth of 13 (as opposed to 2 and 4 for POSSUM and Orphanet respectively).

To investigate the benefits of such a highly detailed ontology, we first hierarchically clustered all OMIM phenotypes based on their HPO feature similarities. This created a disease network that we could link to the biological function of disease genes. We then repeated the analysis using a simplified version of HPO truncating the feature tree at three steps from the root of the ontology. This procedure reduced the HPO feature set from a total of 8,275 to just 1,833 features.

88

Chapter 5: The biological coherence of human phenome databases

Figure 1. Comprehensively annotated syndromes cluster better than sparsely annotated syndromes. ( A) The Ehlers-Danlos and Charcot-Marie-Tooth syndrome families overlap in the phenome landscape of the OMIM data set, but separate when the phenotype descriptions are supplemented with POSSUM annotation. The phenome landscapes were created using multi-dimensional scaling of the HPO feature- based OMIM distance matrices (left), supplemented with POSSUM annotation (right). The more similar the annotations of two syndromes are, the closer they are on the landscape. The background colors indicate the density of syndromes in that region of the landscape. Lighter colors represent higher densities. ( B) Mean phenotypic similarity is consistently greater for the POSSUM supplemented OMIM data set (red dashed lines) than for the original OMIM data set (blue dashed lines). Besides Ehlers- Danlos (n=12) and Charcot-Marie-Tooth (n=12), the more phenotypically diverse family of ciliopathies is also shown (n=59; Supplementary Table 1). Continuous lines show the distributions of mean distances for randomly composed syndrome families of equivalent size (n=107) for the original and supplemented OMIM data sets.

89

Chapter 5: The biological coherence of human phenome databases

Surprisingly, an initial comparison indicated that the highly detailed feature ontology of HPO did not improve the degree to which phenotype clustering reflects biological relationships between disease genes (Figure 2A). In fact, the use of more detailed feature definitions had a slightly detrimental effect (p=0.003, two-sided Wilcoxon signed rank test). However, this is likely an artifact of the previously noted under-annotation of diseases in OMIM. To confirm this, we repeated the experiment using the OMIM disease annotation that had been supplemented with features from the POSSUM database (Figure 2A). As could be expected, the detailed phenotype ontology did indeed improve phenome representation. The HPO feature ontology performed better on biological coherence of phenotype clusters relative to the simplified ontology (p<10 -10 , two-sided Wilcoxon signed rank test). This result highlights two effects; firstly, detailed and comprehensive feature ontologies such as HPO enable improved phenotype description, and secondly, under-annotation of disease phenotypes severely limits the benefits of such detailed feature ontologies.

Using feature occurrence frequency can improve phenome coherence

We then asked whether all phenotypic features are of equal importance to the overall disease phenotype. We first restricted the phenotype descriptions to those features that occur very frequently in the respective diseases. Even though this leads to a considerable reduction of features per syndrome (median 57% of original features), the biological coherence scores remained high (Figure 2B). This result shows that the core phenotypic features that occur most commonly in a disease best reflect the underlying biological relationships.

Consistent with this, we found that if we emphasized commonly occurring features and assigned lower weights to infrequent features, a more biologically relevant phenotype clustering was obtained (Figure 2C; p<10 -5, two-sided Wilcoxon signed rank test). By contrast, emphasizing infrequent features by assigning them higher weights almost completely abolished any recognizable biological coherence of phenotype clusters (Figure 2C; p<10 -16 , two-sided Wilcoxon signed rank test). Thus, while the weighting of phenotypic features based on their frequency of occurrence improves disease classification, emphasizing features that are not part of the core phenotype may have a severe detrimental effect. This result clearly argues for the systematic curation of phenotype data. More specifically, the inclusion of feature frequencies appears to be a requirement for optimal phenotype representation, a feature that is currently only available in Orphanet.

Emphasizing rare features is detrimental to phenome coherence

It has previously been suggested that those features which occur in many diseases will be too general to discriminate between diseases, and too common to aid in specifying the pattern of features that defines a disease family. Rarer features might be more informative for the underlying biology[8,11]. We investigated this assumption for the systematically annotated POSSUM database. In contrast to previous studies[8,11], we find that a weighting scheme that uses the standard “inverse document frequency” (or IDF) score is detrimental to the biological coherence of similar phenotypes (Figure 2D; p<10 -7, two-sided Wilcoxon signed rank test).

90

Chapter 5: The biological coherence of human phenome databases

Figure 2. Biological coherence of phenotypic clusters for different data sets and conditions. The box plots show relative enrichment of shared GO terms for genes associated with diseases within clusters compared to randomly permutated phenotype data sets (n=30). Box limits show the 25th and 75th percentiles, whiskers extend out up to 1.5x the box range, points outside this range are plotted individually. ( A) The full HPO ontology results in biologically more coherent phenotype clusters than a simplified HPO ontology containing only more general features, but only when the OMIM phenotypes are supplemented with POSSUM annotation (purple boxes). ( B) Artificial under-annotation of the POSSUM and Orphanet databases by randomly halving the syndrome feature lists (“sparse”) leads to strong reductions in cluster biological coherence. However, limiting the Orphanet syndrome descriptions to just the very frequent features has limited impact on cluster coherence, despite the strong reduction in the average number of features per syndrome to just 57% of the original. ( C) Weighting Orphanet features according to their prevalence within affected patients improves the biological coherence of clustered phenotypes. Counter-weighting them by assigning higher weights to less frequently occurring features abolishes the biological coherence of the resulting phenotype clusters almost completely. ( D) Weighting annotated features according to their specificity (number of syndromes they occur in) using the inverse document frequency (I.D.F.) weighting scheme diminishes cluster biological coherence for well annotated POSSUM syndromes, but improves it for under-annotated syndromes.

91

Chapter 5: The biological coherence of human phenome databases

Interestingly, this only holds for fully annotated syndromes since under-annotated syndromes do benefit from emphasizing rarer features (Figure 2D; p<10 -11 , two-sided Wilcoxon signed rank test). This explains previous results, which were based on text-mining of less well-annotated phenotypes. We conclude that frequent phenotypic features are generally more indicative of underlying genetic relationships than the sharing of specific features that are observed in a few syndromes or diseases only.

Discussion Here, we have used a biological coherence score based on the sharing of gene ontology annotation between diseases. We use this scoring system to identify database characteristics that enable a better clustering of related disease phenotypes. Our analysis of the OMIM, POSSUM, and Orphanet databases highlights three areas where improvements of phenotype annotation are required. First, the number of features per disease in OMIM can easily be increased and this will much improve its applicability for phenome scale analyses. Second, a simple score of the frequency of feature occurrence per disease as is implemented in Orphanet, refines the phenotype description and improves database performance. Third, emphasizing phenotype features that are rare in the databases does not allow one to cluster diseases more efficiently. These findings have implications for future and perhaps current database design.

Clearly, current human phenotype databases were intended as repositories, as tools for accurate clinical diagnosis of syndromes, and to provide references to a selection of the pertinent literature for specific genetic diseases and syndromes. One could therefore argue that our plea for improvement is demanding something that lies outside the original scope for which these databases were designed. In contrast, we would argue that our ability to compare and group genes based on their sequence and function has proven to be of immense use for genome scientists. We therefore believe that the human phenome deserves a representation that allows scientists to be similarly inquisitive and creative in distilling biologically relevant patterns.

Further down the line we need to improve the collection of phenotype data as well as their storage. While curators can standardize phenotype recording in databases, such efforts could be greatly assisted by the standardization of phenotype recording in the clinic[15,16]. Such standardized reporting would require controlled feature vocabularies, of which there are several in existence[24-27] or under development[31]. Given the difficulties of designing feature vocabularies and their importance in phenotype analysis, it might be beneficial to unify terminology across vocabularies. Ultimately, a complete human phenome description incorporating all human phenotypic variation[14] – including molecular phenotypes[32] – would be most desirable for correlating phenotype variation to genetic variation. Such analyses could even be performed at the level of individual genomes and phenotypes once sufficient data from initiatives such as the Personal Genome Project become available[22]. We believe that the time is ripe for the allocation of substantial resources to improve human phenome annotation on the

92

Chapter 5: The biological coherence of human phenome databases one hand, and to foster the more systematic storage of such data in human phenotype databases on the other.

Acknowledgements We would like to thank to A. Bankier, C. Rose, M. Black and the rest of the POSSUM team at the Murdoch Children’s Research Institute in Melbourne, Australia, for granting us access to the POSSUM data. We would also like to thank S. Aymé, A. Rath and the rest of the Orphanet team at the INSERM in Paris, France, for providing us with the Orphanet data. We would further like to express our gratitude to A. Hamosh and colleagues at Johns Hopkins University, Baltimore, for making the OMIM data publicly available and P. Robinson, S. Mundlos and the HPO team at Charité-Universitätsmedizin in Berlin, Germany, for publicly releasing the HPO. Additionally, we are grateful to P. Beales for providing us with a list of putative ciliopathies. Finally, we would like to thank A. Schenk, H. van Bokhoven and G. Vriend for critical reading of the manuscript and useful comments. This work was supported in part by the BioRange program of the Netherlands Bioinformatics Centre which is supported by a BSIK grant through the Netherlands Genomics Initiative, and by the European Union’s 6th Framework Program contract number LSHB-CT- 2005-019067 (EPISTEM).

References

1. Oti M, Brunner HG (2007) The modular nature of genetic diseases. Clin Genet 71 :1-11 2. Oti M, Huynen MA, Brunner HG (2008) Phenome connections. Trends Genet 24 :103-106 3. Brunner HG, van Driel MA (2004) From syndrome families to functional genomics. Nat Rev Genet 5:545-551 4. Butte AJ, Kohane IS (2006) Creation and implications of a phenome-genome network. Nat Biotechnol 24 :55-62 5. Freudenberg J, Propping P (2002) A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18 Suppl 2:S110-115 6. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL (2007) The human disease network. Proc Natl Acad Sci U S A 104 :8685-8690 7. Jiang X, Liu B, Jiang J, Zhao H, Fan M, Zhang J, Fan Z, Jiang T (2008) Modularity in the genetic disease-phenotype network. FEBS Lett 582 :2549-2554 8. Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, Moreau Y, Brunak S (2007) A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25 :309-316 9. Lee DS, Park J, Kay KA, Christakis NA, Oltvai ZN, Barabasi AL (2008) The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci U S A 105 :9880-9885 10. Loscalzo J, Kohane I, Barabasi AL (2007) Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol 3:124 11. Van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A text-mining analysis of the human phenome. Eur J Hum Genet 14 :535-542 12. Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Mol Syst Biol 4:189 13. Wu X, Liu Q, Jiang R (2009) Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics 25 :98-104 14. Freimer N, Sabatti C (2003) The human phenome project. Nat Genet 34:15-21

93

Chapter 5: The biological coherence of human phenome databases

15. Hall JG (2003) A clinician's plea. Nat Genet 33 :440-442 16. Biesecker LG (2005) Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations. Clin Genet 68 :320-326 17. Schulze TG, McMahon FJ (2004) Defining the phenotype in human genetic studies: forward genetics and reverse phenotyping. Hum Hered 58 :131-138 18. Badano JL, Mitsuma N, Beales PL, Katsanis N (2006) The Ciliopathies: An Emerging Class of Human Genetic Disorders. Annu Rev Genomics Hum Genet 7:125-148 19. Arts HH, Doherty D, van Beersum SE, Parisi MA, Letteboer SJ, Gorden NT, Peters TA, Marker T, Voesenek K, Kartono A, Ozyurek H, Farin FM, Kroes HY, Wolfrum U, Brunner HG, Cremers FP, Glass IA, Knoers NV, Roepman R (2007) Mutations in the gene encoding the basal body protein RPGRIP1L, a nephrocystin-4 interactor, cause Joubert syndrome. Nat Genet 39 :882-888 20. Beales PL, Bland E, Tobin JL, Bacchelli C, Tuysuz B, Hill J, Rix S, Pearson CG, Kai M, Hartley J, Johnson C, Irving M, Elcioglu N, Winey M, Tada M, Scambler PJ (2007) IFT80, which encodes a conserved intraflagellar transport protein, is mutated in Jeune asphyxiating thoracic dystrophy. Nat Genet 39 :727-729 21. Li JB, Gerdes JM, Haycraft CJ, Fan Y, Teslovich TM, May-Simera H, Li H, Blacque OE, Li L, Leitch CC, Lewis RA, Green JS, Parfrey PS, Leroux MR, Davidson WS, Beales PL, Guay- Woodford LM, Yoder BK, Stormo GD, Katsanis N, Dutcher SK (2004) Comparative genomics identifies a flagellar and basal body proteome that includes the BBS5 human disease gene. Cell 117 :541-552 22. Church GM (2005) The personal genome project. Mol Syst Biol 1:2005 0030 23. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33 :D514-517 24. Robinson PN, Kohler S, Bauer S, Seelow D, Horn D, Mundlos S (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet 83 :610-615 25. Ayme S (2003) [Orphanet, an information site on rare diseases]. Soins :46-47 26. Bankier A, Keith CG (1989) POSSUM: the microcomputer laser-videodisk syndrome information system. Ophthalmic Paediatr Genet 10 :51-52 27. Winter RM, Baraitser M (1987) The London Dysmorphology Database. J Med Genet 24 :509-510 28. Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24 :719-720 29. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25 :25-29 30. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, et al. (2007) Ensembl 2007. Nucleic Acids Res 35 :D610-617 31. Allanson JE, Biesecker LG, Carey JC, Hennekam RC (2009) Elements of morphology: introduction. Am J Med Genet A 149A :2-5 32. Snyder M, Weissman S, Gerstein M (2009) Personal phenotypes to go with personal genomes. Mol Syst Biol 5:273

94

Chapter 6: General Discussion

95

Chapter 6: General Discussion

6.1 Summary of findings In this thesis we set out to demonstrate that phenotypically similar diseases do tend to be caused by mutations in functionally related genes, and that this phenomenon can be used to identify potential novel disease genes (Chapters 1 & 2 and refs. [1,2]). We did this by investigating the degree to which genes causing the same disease are functionally related at the molecular level, using two measures of functional relationships – physical protein-protein interactions and correlated expression profiles (co-expression). As many gene functional data are acquired from model organisms rather than human material, we also investigated the utility of such data from other species for elucidating human disease pathobiology. Finally, we also investigated the relationship between disease phenotypes and disease genetics from the phenotypic angle to find out if the relationship extends beyond individual diseases to phenotypically overlapping but distinct diseases. In the process we identified phenotype database properties that maximize their utility for disease investigation.

We began our investigations using genetically heterogeneous diseases, which are diseases that can be caused by mutations in one of several alternative genes. We chose to start with them as they represent disease mutations with maximal phenotypic overlap, i.e. clinically indistinguishable phenotypes. If there are functional relationships between genes causing similar phenotypes, they should be strongest for those genes that lead to virtually identical disease phenotypes when mutated.

In addition to choosing disease mutations with the strongest possible phenotype similarity, we also chose to begin with one of the strongest possible measures of gene functional relationships, namely physical protein-protein interactions (PPIs). If two proteins interact physically and specifically with each other, then they can be expected to be involved in the same process. This is especially the case for proteins that interact with each other in protein complexes. We investigated whether genes underlying genetically heterogeneous phenotypes are more likely to interact with each other physically at the protein level using protein-protein interaction data from several sources (Chapter 3). We used primarily PPIs based on high- throughput experimental assays, but also included literature-derived PPIs in the analysis. We also investigated whether PPIs can be used to identify novel disease genes in positional candidate disease genes in genomic loci known to be associated with particular diseases, but for which the causative gene had not yet been identified. As the amount of PPI data from human sources is limited, we also included PPI data from other species. We used both binary interaction data from yeast two-hybrid protein interaction assays as well as protein interactions within complexes identified through mass spectrometry.

We showed that proteins of genes causing the same disease have a tendency to interact with each other physically. This tendency is tenfold greater than their tendency to interact with non- disease-related genes. We show that this information can be use to improve positional candidate disease gene prioritization, as PPI partners of known heterogeneous disease genes that lie in another genomic locus associated with the same disease are more likely candidates than neighboring genes in the locus. Based on this premise we make several predictions of

96

Chapter 6: General Discussion

novel disease genes, some of which have been confirmed. Importantly, we found that even protein-protein interactions detected in other species can be predictive for human diseases. We found that high-throughput PPIs taken from organisms as evolutionarily distant as baker’s yeast (Saccharomyces cerevisiae ) are still capable of predicting heterogeneous disease genes as well as those based on human material. This is partly due to the increased predictive power of proteins involved in the same complexes relative to those shown to interact using the binary yeast two-hybrid approach, as was later demonstrated by others[3]. However, we also found that literature-based PPIs are strongly biased toward known disease genes, and are probably not representative for the identification of novel disease genes. Our findings suggest that those disease genes that could be detected using literature-based PPIs have unsurprisingly already been identified by the research community. Our PPI findings support the notion that similar disease phenotypes are caused by mutations in genes that are functionally related.

In Chapter 4 we investigated whether a different kind of functional genomic data, namely correlated mRNA expression patterns, can also lead to similar results. In contrast to the previously used protein-protein interaction data, mRNA expression patterns can be determined efficiently and cost-effectively on a genome-wide scale meaning that these data are far more abundant and comprehensive. If genes causing similar disease phenotypes are likely to be functionally related, then genes involved in genetically heterogeneous diseases should have a relatively higher tendency to be coordinately expressed (co-expressed) across tissues and conditions. However, mRNA expression levels are but one level of gene regulation; gene expression can be coordinated at several different levels including transcription, translation and protein stability[4]. mRNA levels can therefore be expected to be a noisy proxy for gene expression, and this is indeed the case[5,6]. This led us to further investigate how these mRNA expression data can be best exploited to maximally reflect functional relationships between genes and their ability to rank genuine disease genes amongst unrelated positional candidate disease genes. To this end we investigated different expression data sets and normalization procedures, but we primarily investigated the use of evolutionary conservation of co-expression to strengthen the reliability of these co-expression ties between genes. This evolutionary angle and the deeper investigation of expression data pre-processing together set this study apart from others that use co-expression data for candidate disease gene prioritization, such as Prioritizer[7], ENDEAVOUR[8] and G2D[9].

We showed that genes causing the same disease do indeed have a higher tendency to be coordinately expressed across tissues and conditions. There is a clear tendency for heterogeneous disease genes to have a higher co-expression with fellow disease genes in other genomic loci than with other neighboring genes in those loci, despite the tendency for genes located close to each other on the genome to have similar expression levels[10]. We further showed that conserved co-expression is a stronger indicator of shared pathology than co-expression in human tissues alone. The requirement for genes to be coordinately expressed in different species filters out less consistent and therefore less reliable relationships between genes. We also demonstrated that mRNA expression data choice and pre-processing can have a substantial impact on their performance in co-expression-based ranking of positional candidate disease genes. We found that pooling microarray expression datasets from different

97

Chapter 6: General Discussion

sources is detrimental to their performance, introducing extra noise. We also show that relative expression levels perform better than absolute expression levels. In summary, we found that when it comes to identifying common involvement of genes in diseases using mRNA co- expression data, quality is better than quantity. While the predictive value of (conserved) co- expression by itself is more limited than those of other kinds of gene functional data such as protein-protein interactions, it should nevertheless be a useful data source for integrative candidate disease gene prioritization tools such as Prioritizer[7] and ENDEAVOUR[8] that blend several different kinds of gene functional data together.

The utility of these protein-protein interactions and (conserved) mRNA co-expression for candidate disease gene prioritization is consistent with those of other studies (many of which are described in Chapter 2), and confirms the notion that similar disease phenotypes are caused by functionally related genes. However, while there are large and growing amounts of gene functional data to use in such prioritization tools, disease phenotypes have so far not been given the same level of consideration. Nevertheless, the phenotype is critical to disease definition upon which further molecular analyses are based, and is also a rich source of readily available information about the disease. It is therefore of great importance to investigate how best to describe disease phenotypes for research purposes. It is clear that phenotypically very similar diseases tend to be caused by functionally related genes, but can this relationship be extended further to diseases that have overlapping but non-identical phenotypes? To what extent could such a relationship be exploited in disease analysis using existing disease phenotype databases? And how can these disease phenotype descriptions be improved? What kinds of phenotype information are important from a pathobiological point of view? We investigated this using systematically annotated disease phenotypes from several existing disease databases (Chapter 5). Given our focus on the phenotype, we took a broad view of gene functional relationships in this study, using gene ontology (GO) functional annotation to identify functional links between genes instead of restricting ourselves to a single kind of functional data as we had done in the previous studies.

We showed that the tendency of functionally related genes to lead to phenotypically similar diseases extends beyond genetically heterogeneous diseases. Even when more broadly defined disease phenotype clusters are used, causative genes tend to be functionally related. We further identified several properties of phenotype databases that improve their utility for phenotype-based disease analyses. We found that both deep trait hierarchies and the use of trait frequency estimates can improve the biological coherence of phenotypic disease clusters. Somewhat contrary to previous expectations we found that overall disease phenotypic similarity is more important than the sharing of specific rare traits. This is reflected in the fact that the frequently used Inverse Document Frequency weighting scheme is actually harmful when applied to well-annotated phenotypes, only benefiting under-annotated phenotype comparisons. Above all, we highlighted the importance of complete and systematic disease phenotype annotation, and emphasize the need to pay more attention to this aspect of disease research in the future.

98

Chapter 6: General Discussion

Together, these results demonstrate the genetic relatedness of similar disease phenotypes and underscore the validity of using gene functional information to prioritize disease candidate genes for both genetically heterogeneous diseases and syndrome families. Such phenotype- based bioinformatic analysis of candidate disease genes will become even more important in the near future with the advent of widespread individual whole-genome re-sequencing, due to the vast quantities of data that will need to be interpreted. In the rest of this discussion, we take a look at these expected future developments and the important role phenotype-based functional genetic analyses can be expected to play in them.

6.2 Moving from locus-based to genome-wide approaches

6.2.1 Locus-based candidate disease gene prioritization is becoming less relevant

In the past, the focus of genetic disease research has been on identifying genes in which mutation leads to the disease under investigation. Purely candidate gene-based approaches were previously hampered by limited functional biological knowledge, so linkage mapping has historically been the major source of disease gene localization information—particularly since the use of large numbers of naturally occurring DNA polymorphisms as molecular markers became practical[11]. This has resulted in the identification of large numbers of disease- associated genomic loci[12], from which disease genes can and have been cloned. However, as this process is laborious and costly, it was important to prioritize these positional candidate genes for cloning in order to minimize the number of genes that need to be tested before the disease mutation is found. This prioritization of positional candidate disease genes was a primary focus of this thesis, in which different kinds of functional genomic data were shown to improve positional candidate disease gene ranking. At the present day, even though disease genes have been cloned from an ever-increasing number of these loci, there still remain many disease loci that are yet to divulge their mutant genes (http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM).

The situation is set to change in the near future, as the current rise in the application of deep sequencing to genetics research will eventually eliminate the need to prioritize positional candidate genes for sequencing. This will be brought about by the ability to sequence whole disease-associated loci in parallel and without the need for primer design, and at relatively low cost. However, such deep genomic sequencing will bring with it challenges of its own due to the plethora of rare genomic variants (alleles) that are bound to be uncovered by deep-sequencing disease loci. Such rare variants can resemble disease mutations but are not necessarily related to the disease in question. Given the fact that each individual’s genome contains more than 3 million SNPs and hundreds of thousands of structural variations[13-16], many of which are rare, every sequenced locus can be expected to contain several of these disease-irrelevant rare genomic variants that resemble mutations. Indeed, a recent survey of all the exons on the X- chromosome in over 200 families found that up to 1% of X-chromosome genes can contain rare truncating mutations that are nevertheless non-pathogenic [17,18]. When an entire disease

99

Chapter 6: General Discussion

locus is sequenced including introns and intergenic regions, even more rare genomic variants will be found that may or may not be involved in gene regulation. These rare non-coding variants will be even more difficult to interpret than coding variants.

Despite these problems, identifying which gene in the locus is causative for the disease should remain tractable as long as the region is sequenced in multiple patients from unrelated families. The disease genes should always be affected, while the disease-unrelated rare genomic variants are, by virtue of their rarity in the population, not likely to recur in different unrelated families. Even if different mutations are involved in different families, it should be possible to get an idea about which genes in the locus are likely to be involved in the disease due to the fact that they are consistently mutated in patients. As a result, bioinformatic candidate disease gene prioritization will probably play a more limited role in future locus-based genetic analyses than it currently does, although it may still play an important role in cases where few unrelated patients are available. Therefore, existing positional candidate disease gene ranking approaches – such as those described in this thesis – will need to be retooled and repurposed in order to remain relevant and useful to disease research.

6.2.2 Genome-wide candidate disease gene prioritization is becoming more relevant

These locus-based deep sequencing approaches are mainly suited to the identification of a subset of disease genes, namely those that those that cause diseases which are amenable to linkage mapping. This technique is most suited to identifying rare disease genes with a strong impact on disease phenotype[11], which facilitates the interpretation of the genomic variants detected by deep sequencing as described in the previous section. If mutations involved in complex diseases or modifier genes that have a more limited impact on Mendelian disease phenotype are to be identified, it will be necessary to expand the search space to the whole genome as it will be difficult to find specific loci harboring major causative genes to which the search can be restricted. Such analyses were beyond the scope of this thesis, but they will become more and more important in the near future as complex disease genetics is tackled and Mendelian diseases are elucidated in ever more detail.

While expanding the search space to the whole genome is necessary to get a more complete picture of disease genetics, it will also bring with it its own challenges that will need to be resolved before the correct disease genes can be identified. It will uncover large numbers of potential disease mutations which will need to be prioritized in order to minimize the time and cost required for follow-up experimental investigation. Prioritization will thus remain essential, it will just be applied to genome-wide candidate disease mutations for follow-up analysis rather than to positional candidate disease genes for sequencing. When taking a genome-wide view of disease genetics like this, it will be essential to have precise disease phenotyping, as subtle differences between phenotypes of the same disease can reflect differences in underlying biology, while subtle similarities between the phenotypes of apparently unrelated diseases could point to commonalities in their underlying biology. This issue of proper disease phenotyping will be discussed in more detail later, after we first take a look at the role of bioinformatic analysis in genome-wide candidate disease gene prioritization.

100

Chapter 6: General Discussion

6.2.2.1 Bioinformatic analysis is important for genome-wide association studies

To a certain degree, this expansion of the search space to the whole genome is already being used in genome-wide association studies (GWASs), which are used to investigate the contribution of genomic loci to complex diseases on a genome-wide scale. This genome-wide scale, combined with larger study sizes that have more power to detect modest genetic effects, has had a substantial impact on the field of complex disease genetics which until recently had met with very limited success[11]. Earlier association studies were centered on specific candidate genes for practical reasons, but detected associations that were usually difficult to reproduce. However, recent advances in genotyping technology combined with results from the HAPMAP project[19] have led to a sharp rise in the number of GWAS loci reproducibly implicated in complex diseases, from tens of loci to hundreds of loci[11].

Despite this increase in detected loci, such GWASs are still limited in the amount of genetic risk that they can explain. In contrast to linkage mapping which is best suited to identifying rare variants with strong phenotypic effects, GWASs are most suited to identifying common genetic variants, which generally contribute only modestly to the disease phenotype[11]. They are much less suited to detecting rare variants, even those that have larger effects on the phenotype. Additionally, their ability to detect a variant’s contribution to the phenotype depends on both the size of its effect and the study size. The large numbers of SNPs tested in such studies also require the use of stringent cut-offs for determining the statistical significance of associations in order to correct for multiple testing. This can lead to bona-fide associations being overlooked because they do not meet such stringent statistical requirements. Apart from the new difficulties introduced by testing larger numbers of genetic markers, the larger population sizes needed to improve statistical power increase the risk of including structured sub-populations in analyses which can lead to false associations – although the large numbers of SNPs tested in GWASs at least enable such sub-structuring to be detected[11]. These statistical issues are compounded yet further when gene-by-gene or gene-by-environment effects are taken into account[20].

Given the limitations of such purely statistical approaches, extra information will be required to identify which genes are involved in these complex diseases. Existing knowledge about gene function will need to be exploited, reinvigorating the candidate gene approach which has met with more limited success in the past. The current fast pace of discovery in the biological sciences and the vast amounts of new data that are being generated are making such an approach more and more feasible. Indeed gene functional information is already being exploited in current GWASs to make sense of the associations[21-23].

One way in which it is applied is to find common denominators between different candidate regions, analogously to some existing approaches to Mendelian candidate disease gene prioritization (Chapters 1 & 2). GWASs identify SNPs that may not themselves contribute to the phenotype, but simply in linkage disequilibrium with the actual causative variation, within a region that generally spans tens of kilobases[11]. Additionally, the majority of associations detected by GWASs lie in introns or intergenic regions[24]. These complications mean that the number of potential disease genes within a disease-associated locus can range from several

101

Chapter 6: General Discussion

down to none at all. However, as multiple loci are generally detected for the same disease common biological denominators can be sought between genes in different disease-associated loci. This approach has successfully led to the identification of novel pathways involved in different complex traits, such as hypothalamic function in obesity[25,26] and a few pathways that are shared between several immune-related diseases[27]. Once such functional commonalities are identified, they can guide the further search for candidate disease genes, even in loci that may not meet the statistical criteria used in such association studies[21-23].

This bioinformatic approach also overcomes a second limitation of purely statistical approaches, in that it automatically identifies interactions between disease genes. Though interactions between disease loci can be detected statistically with GWASs, the vast number of potential gene-by-gene as well as gene-by-environment interactions impose insurmountable numerical limits on their use in practice[20]. In contrast, the experimental detection of biological relationships between genes is ever increasing and does not suffer from any such limitations. These known interactions between disease genes highlight the nature of the biological relationships between them directly, in contrast to the GWAS statistical associations that still need to be interpreted in a biological framework. Functional approaches should therefore provide a more productive avenue for finding such interactions amongst disease genes or between them and environmental conditions than purely statistical approaches. Current bioinformatic approaches to disease gene prioritization such as those described in this thesis should be applicable to these problems once they are updated for genome-wide application[21].

6.2.2.2 Bioinformatic analysis is important for the interpretation of rare genomic variants

Ultimately, genome re-sequencing must eventually replace SNP genotyping if the full spectrum of mutations contributing to complex disease is to be elucidated. Re-sequencing enables the detection of all genomic variants within individuals rather than just a pre-determined set of SNPs which occur relatively frequently in the population. Such an approach can enable one of the main limitations of GWASs to be overcome—the inability of association studies to detect rare genomic variants. It is now clear that rare variants play a major role in complex traits, as even the largest GWASs fail to account for more than a few percent of the genetic component of complex traits[11]. This is perhaps unsurprising as, from an evolutionary perspective, common genetic variants are not likely to have a strong detrimental effect on the organism’s fitness. Therefore, disease-causing variants can generally be expected to be rare in the population[28]. The extreme case of this inverse relationship between mutation severity and frequency is embodied by monogenic diseases, which constitute very rare mutations with very strong detrimental effects on the phenotype.

There are also arguments for the persistence of detrimental genetic variants in the population, such as late onset of disease, antagonistic pleiotropy, and changing environments[29]. Nevertheless it appears that rare missense SNPs are indeed under negative selection which provides further support for this model of complex disease[29,30]. There is also empirical evidence for this phenomenon, as obese individuals have been found to have an overrepresentation of rare SNPs in obesity-related genes[31]. A consequence of this genetic

102

Chapter 6: General Discussion

architecture is that different patients may have different combinations of rare alleles contributing to the complex disease phenotype, complicating genetic analysis. This is somewhat analogous to genetically heterogeneous diseases, in which different genes are mutated in different patients but all result in the same phenotype. Therefore, as with genetically heterogeneous diseases, bioinformatic analysis can be used to prioritize rare genetic variants that might contribute to complex disease. For instance, pathways can be sought whose genes are overrepresented amongst rare genomic variants in patients relative to controls, even if the mutated genes differ between subsets of patients.

6.2.2.3 Bioinformatic analysis can help identify modifier genes for monogenic diseases

Not only complex diseases benefit from a genome-wide approach to disease mutation analysis, even monogenic or oligogenic diseases can benefit from such a broad search space. These diseases are more amenable to linkage analysis, for which functional candidate gene prioritization will become less important with the widespread adoption of deep sequencing. However, they usually have phenotypes that are quite variable between patients, and which can be influenced by genetic variation in modifier genes located elsewhere on the genome. These modifier genes can be difficult to detect, and though suspected to be common, relatively few have been identified to date[32]. Bioinformatic gene function analyses can be used to prioritize candidate modifier genomic variants that may affect the associated disease phenotype by evaluating their functional relationships with primary disease genes.

Given the large amount of phenotypic variation present in most syndromes, it is likely that modifier genes are common both in terms of number of genes in the genome and, for at least some genetic variants of modifier genes, also in terms of allele frequency in the population. For example, a relatively common allele of the RPGRIP1L gene has been found to function as a retinal phenotype modifier in several disparate ciliopathies for which retinal involvement is variable[33]. Other rare mutations in the same gene are also likely to function as determinants of retinal involvement. Interestingly, not only does this gene function as a retinal modifier in ciliopathies that are caused by other genes, it can itself also be a primary disease-causing gene for one of the ciliopathies – Joubert syndrome[34,35]. This further emphasizes the blurry distinction between primary disease genes and disease modifier genes[36], and strengthens the case for using biological relationships between genes to identify candidate disease genes regardless of the nature of their involvement with the disease. Once again, the genome-wide scope of such analyses will necessitate the use of automated bioinformatic approaches to their investigation.

103

Chapter 6: General Discussion

6.3 The importance of detailed phenotype characterization for genome- wide approaches

6.3.1 Interpreting genome-scale variation data requires detailed phenotype descriptions

We are currently moving into an era of genetic analyses at the level of entire individual genomes. At this scale, we have an overview of all the genetic variation within the individual that can influence his or her phenotype. However, not all aspects of a patient’s phenotype will be involved in the disease pathology, and conversely, subtle aspects of the patient’s phenotype that are involved in the pathology may be overlooked. As a result, it will be necessary to generate as complete a view of the phenotype as possible in order to be able to interpret the significance of genetic variants across the entire genome. Moreover, as the phenotype is influenced by environment as well as genetics, it will also be highly desirable to chart the living environment, lifestyle and life history of patients. Only with a complete view of the phenotype can all the genomic variants potentially be elucidated. The main problem with such comprehensive data gathering is that such data will inevitably contain large amounts of irrelevant information in addition to the information that is pertinent to understanding the patient’s condition. To complicate matters further, interactions between genes and environment and between allelic variants of genes at different loci will further obscure the relationship between genotype and phenotype. For these reasons, it is currently unfeasible to map the full spectrum of genomic variants to their phenotypic effects in any one patient, let alone the entire disease phenome. Nevertheless, with sufficiently detailed phenotype descriptions, substantial progress can be made in this direction.

Detailed phenotype descriptions can increase the resolution at which genetic analyses can be performed, which is essential as these analyses expand to encompass entire genomes. As previously mentioned, there are statistical limits to the ability of GWASs to identify contributory complex disease loci, especially when interactions are taken into account. Similar limitations will also apply to large-scale individual genome re-sequencing. It will therefore be of great importance to use phenotype descriptions that are as detailed as possible for such genome- wide analyses, in order to minimize the conflation of different but related disease processes in these studies. Additionally, given the large number of genes involved in complex diseases, it is likely that different genes could be involved in different patients or populations. For such cases, subtle but consistent phenotypic differences between disease populations could reflect differences in their underlying disease gene complements.

There are several ways in which phenotype descriptions that are too coarse-grained could obscure the underlying disease genetics of a particular disease. Firstly, broadly defined phenotypes may fail to distinguish between different but related phenotypes and secondly, they may fail to reflect phenotypic substructure. Finally, in addition to obscuring subtle differences between phenotypes, coarse-grained phenotyping could also obscure subtle similarities between phenotypes that are globally quite different from each other.

104

Chapter 6: General Discussion

6.3.2 Detailed phenotype descriptions help discriminate between similar phenotypes

With phenotype descriptions that are too broadly defined, different but related phenotypes may be classified as a single phenotype. In other words, different phenotypes may be lumped together, potentially obscuring details that could reflect important differences in their underlying pathobiology. This phenomenon can be illustrated by taking a closer look at some of the genetic underpinnings of type 2 diabetes and obesity. For instance, genetic variants associated with type 2 diabetes (T2D) risk can differ between obese and non-obese patients, with gene variants involved in insulin action being more important in obese patients while those involved in insulin secretion are more important in lean patients[37]. It is therefore important to distinguish between the two groups when phenotyping T2D patients for studies.

Even obesity itself is a broad phenotype. Association studies on obesity have generally used body mass index (BMI) as a measure for obesity due to its ease of measurement. However, this measure (weight/height 2) does not directly measure body fat content (adiposity), and it completely ignores fat distribution. Fat is not the only tissue type that contributes to BMI, and excess adipose tissue could be primarily subcutaneous or concentrated in the visceral region[38]. These obesity-related phenotype variants could vary in their underlying genetics. An example of this is provided by the familial partial lipodystrophies, genetic disorders that are associated with abnormal distribution of adipose tissue across the body[39,40]. Mutations in the peroxisome proliferator-activated receptor-γ (PPARG), a gene which has been associated with obesity[41], can cause familial partial lipodystrophy type 2[42,43]. PPARG is involved in adipocyte differentiation[44]. This is a biological function that differs from others which have also been associated with obesity, such as for instance hypothalamic function[45]. Therefore, refining the obesity phenotype description with information about fat distribution as well as fat content and BMI can enable more detailed bioinformatic analysis of the different aspects of obesity pathobiology.

6.3.3 Detailed phenotype descriptions facilitate identification of phenome substructure

Not only does coarse-grained phenotype definition obscure similar but distinct phenotypes, it can also gloss over phenotypic substructure in the disease phenome. Disease phenotypes frequently consist of a constellation of features or sub-phenotypes which could exist independently of the overall phenotype and might reflect different biological processes that are involved in the associated disease. For instance, Senior-Loken syndrome, which affects retina and kidneys, can be considered a sub-phenotype of Joubert syndrome which reproduces those symptoms and additionally affects the cerebellum[2]. Orofacial clefting can occur as an isolated feature or can co-occur with other features in more than 350 Mendelian syndromes[46]. Such phenome substructure can assist in disease genetic analysis (Chapter 2 and ref. [2], but requires modular phenotype descriptions at sufficient resolution to detect the substructure.

There are also several syndromes which co-occur more often than expected by chance, suggesting related genetic processes[47]. Such co-occurring diseases can also be viewed as combinations of sub-phenotypes. The best known example of modular sub-phenotypes is

105

Chapter 6: General Discussion

probably the endophenotype concept from neuropsychiatry[48-51]. These are subclinical traits of a neuropsychiatric disorder that can occur independently of it. Autism, for instance, can be viewed as a combination of language and communication deficits, repetitive behavior, and social unresponsiveness [52]. These endophenotypes are only modestly correlated (r = 0.1- 0.4)[53,54], and can segregate independently in unaffected family members[55,56]. They may therefore reflect underlying molecular processes more directly than the full autism phenotype[53,57]. Describing phenotypes such as these at lower levels of granularity enables these phenotypic overlaps between syndromes to be analyzed more systematically, and provides better starting points for the genetic analysis of these diseases.

6.3.4 Detailed phenotype descriptions improve comparisons between diseases

Coarse-grained phenotype descriptions not only obscure differences between similar phenotypes, they can also obscure similarities between different phenotypes. They can lead to subtle phenotypic links between apparently unrelated diseases being overlooked, which might obscure subtle biological relationships between the diseases. Indeed, even if the diseases are already known to share other features with each other, such hidden links can further expand the set of affected biological functions that can be ascribed to common pathological processes, giving further clues as to what these might be. An example of such a subtle phenotypic link between related diseases is anosmia, which is a common feature in ciliopathies[58]. Cilia are important in the functioning of sensory neurons[59], amongst which photoreceptors, inner ear hair cells and olfactory neurons. Therefore, impairment of these senses can potentially indicate cilia dysfunction. However, while vision and hearing impairment are readily identifiable symptoms due to the importance of these senses in normal human functioning, anosmia is a more subtle feature that is more easily overlooked[60]. Nevertheless, given the large number of genes that can cause vision or hearing impairment (http:// http://www.ncbi.nlm.nih.gov/omim/), the presence of anosmia as an additional symptom can be important in indicating what manner of molecular dysfunction underlies a disease.

Another example is the “molar tooth sign”[61], a cerebellar and brain stem malformation characteristic of several ciliopathies related to Joubert syndrome[62]. Some of these ciliopathies – such as Nephronophthisis – have not generally been associated with such neurological dysfunction[63], but the presence of this feature in patients with these syndromes highlights their etiological links to Joubert syndrome[64]. There are undoubtedly many other subtle phenotypic links to be observed between diseases, but detecting commonalities such as these will require particularly fine-grained and comprehensive phenotype descriptions.

6.3.5 Detailed phenotype descriptions should also include environmental information

In addition to genetics and phenotypes, a third axis in disease analysis is environmental influence. It will play a more important role in diseases where genetic contribution is more subtle, such as in complex diseases, but could in principle be relevant for any disease. Therefore, though it is not really a part of the patient phenotype, it would nevertheless be desirable to include as much environmental information as possible during the collection of

106

Chapter 6: General Discussion

patient phenotype descriptions. However, it will be very difficult to obtain truly comprehensive environmental data that are on a par with the genomic or even phenotypic data, as patient life histories cannot be directly observed in the clinic. Nevertheless, there are still many forms of environmental and life history information that can be gathered using simple questionnaires or taken from medical records, as is being done for instance in the Personal Genome Project[65]. Furthermore, as with genetic and phenotypic data, some environmental information will be more important than others and the most prominent factors – such as lifestyle and diet – should also be amongst the most readily available.

Analyzing such environmental data could be just as challenging as acquiring them. Due to the diversity of possible gene-to-phenotype interactions, it will probably be necessary to conduct such studies on specific subpopulations as different populations with different genetic backgrounds might respond differently to the same environmental conditions. An example of this are the Pima Indians from Arizona, who have a higher incidence of type 2 diabetes than people of European descent living under the same conditions[66]. This is not just a genetic effect as exemplified by the fact that Pima Indians living under different dietary and exercise conditions in Mexico do not have an equivalently high incidence of type 2 diabetes[66,67]. As different genetic backgrounds complicate the gene-environment interactions, the best approach to the elucidation of environmental effects is probably the investigation of phenotypically discordant monozygotic twins, though even they are not necessarily (epi)genetically identical[68,69]. Regardless of the approach taken, it is clear that genetic information alone is not sufficient to fully understand most disease processes, and it will be important to record as much environmental information as possible when phenotyping patients for disease analysis purposes.

6.3.6 Formalized phenotype descriptions are necessary to exploit increasing detail

It is clear that more detailed phenotype descriptions are going to be needed if disease genetics are to be investigated at the whole genome level. However, in order to fully exploit such phenotypic information, it will be necessary to formalize and standardize its collection and encoding[70-73]. This issue is discussed in more detail in Chapters 2 & 5 and will not be treated comprehensively here. Importantly, we show in Chapter 5 that for the purpose of bioinformatic phenotype analysis, systematically described disease phenotypes based on feature ontologies are far superior to those taken from a database that was not designed to use ontology-based phenotype descriptions. For bioinformatic analyses to be maximally effective, disease phenotypes should be comprehensively recorded in the clinic using standardized approaches[70,71], and stored in structured databases that use feature ontologies to describe disease phenotypes[74-77]. While current ontologies are generally oriented towards morphological features, it will be important for them to be expanded to include as much information as possible about the phenotypic effects of disease mutations, including the less obvious molecular phenotypes[78].

Properly designed feature ontologies, coupled with comprehensive phenotype annotation, can enable disease phenotypes to be systematically described at the level of detail required to

107

Chapter 6: General Discussion

capture the nuances both within and between diseases. For monogenic diseases in which individual rare mutations exert a large effect on the phenotype, it is easier to identify the set of phenotypic features that are affected by the mutation. Nevertheless, even such monogenic diseases are rarely uniform in their phenotype, frequently due to the effects of modifier allelic variants at other loci[32]. For complex diseases this situation is compounded even further, as the disease process is probably primarily due to this genetic background of modifier genes. All this genetic variation can be expected to influence the phenotype in different and potentially subtle ways, necessitating such detailed descriptions. The formalization of disease phenotype data collection and recording will therefore be required to describe them in sufficient detail to fully interpret genome-wide genetic data.

6.4 The Future: Comprehensive elucidation of disease processes The role of bioinformatics in disease research is set to grow even larger in the future, as more and more data become available that need to be correlated with each other. Although the nature of the analyses will change to adapt to the new circumstances, the general approaches to disease bioinformatics outlined in Chapter 2 should remain broadly valid. The search for functional relationships between candidate and known disease genes for phenotypically related diseases – based on the premise that diseases are genetically coherent – should remain a fruitful avenue of research. However, the scale and scope of these analyses will need to be increased if disease pathobiology is to be understood in its entirety.

6.4.1 Diseases are genetically coherent

Genetic diseases are caused by disruptions in biological processes, either developmental or homeostatic, in which multiple genes can partake. Such processes could potentially be disrupted by mutations in a single critical gene, but could also be disrupted by milder mutations in multiple less critical genes, whose functioning could in turn potentially be affected by environmental influences. This biological process-oriented view of genetic diseases is further strengthened by the phenomenon of genetically heterogeneous diseases, that can be caused by mutations in any one of several different genes. Such sets of genes frequently constitute proteins in the same complex, such as Fanconi Anemia[79], Hermansky-Pudlak syndrome[80] and Usher syndrome[81]. Alternatively they could be genes involved in the same metabolic pathway, such as Congenital Disorders of Glycosylation[82]. They could also be genes involved in the functioning of the same subcellular organelle, such as the cilia in Bardet-Biedel syndrome[83].

The fact that even complex diseases can be caused by mutations in single genes[84-86], and the identification of common pathways uniting complex trait candidate genes[25,27,38], indicate that even complex diseases are genetically coherent. This means that bioinformatic analysis of gene functional relationships should also be a productive approach to their investigation. It also argues for the continued investigation of rare Mendelian diseases despite the current interest in complex diseases, as they are more suitable for investigating the underlying biology due to their

108

Chapter 6: General Discussion

genetic simplicity[84-87]. Given the ever-increasing quantities of genome-wide genetic data and the rising interest in more sophisticated phenotypic descriptions, it should eventually become feasible to fully elucidate disease pathobiology, identifying all the biological processes involved.

6.4.2 Our current knowledge of human biology is limited

However, our current knowledge of human functional biology is limited and much more needs to be learned before we will be capable of elucidating all disease processes. There are still a large number of uncharacterized protein-coding genes (only 11234 proteins with manually curated Gene Ontology annotation as of GOA release 73.0, April 2009; http://www.ebi.ac.uk/GOA/human_release.html), some of which are undoubtedly involved in as yet unidentified molecular processes. The last few years have witnessed the discovery of previously unrecognized biological processes such as RNA interference[88] and the diversity of histone modifications[89]. They have also led to the realization that gene regulation plays an unexpectedly large role in human biology. The majority of evolutionarily conserved human DNA base pairs do not code for proteins[90], and recent genome-wide association studies have identified disease-associated variants primarily in non-coding intronic and intergenic regions[24]. Chromatin immuno-precipitation (ChIP) studies of transcription factors have similarly identified vast numbers of intronic and intergenic binding sites[91], which may correspond to long-range transcription regulatory elements such as enhancers and insulators[92], and could be involved in disease[93]. It is now realized that the majority of the genome is transcribed[94], and several new classes of non-coding RNA species have been identified[95]. These novel non-coding RNAs are generally involved in the regulation of transcription through epigenetic means[96] or the regulation of translation[97], and can be involved in disease[98,99]. Repetitive DNA elements such as transposons, previously thought to be purely parasitic DNA, are now recognized to also have roles in gene regulation[100-102]. Additionally, disease phenotypes can be caused by non-coding tandem repeat expansions[103] or repeat contractions[104], mechanisms which probably also affect local chromatin organization. It is a safe bet that yet more unexpected biological phenomena remain to be discovered which may be involved in disease processes.

6.4.3 New data types and approaches are becoming available to help remedy this

Given these biological complexities, the vast number of non-coding genomic variants that will be uncovered[13-16] will be difficult to interpret. Happily there are also novel types of functional genomic data being produced that can aid in this process (Table 1), in addition to ever increasing quantities of the more traditional kinds (Chapter 2).

109

Chapter 6: General Discussion

Table 1: Emerging data types for bioinformatic candidate disease gene analyses.

Data type Data type Description Comments class

Population genetic variation data such as Single Nucleotide Polymorphisms Population (SNPs) and Copy Number genetic Population genetics can help distinguish common Polymorphisms (CNPs) have been sequence from rare variants in patient DNA. available for some time, but increasing data quantities will help build more complete pictures of population variation.

Interpreting evolutionarily conserved Can be used to identify functionally important but Evolutionary Evolutionary but unannotated genomic regions unannotated regions in the genome. The phylogenetic history sequence remains challenging. Additionally, not extent of conservation can give further clues about the conservation all functionally important elements are possible functions of the regions. necessarily evolutionarily conserved.

Some organisms have evolved features which The mechanisms by which these correspond to disease states in humans, such as phenotypes are generated in these Evolutionary reduced eyes (microphthalmia) in cave-dwelling fish. other species might differ from how mutant Comparing their genomes with those of closely related they occur in patients. However, they models species without those features can give insights into could still give insights into the the processes involved, which can in turn aid in the processes involved. investigation of human diseases[105,106].

Similar to Quantitative Trait Loci (QTLs), using Particularly popular for investigating Genetical molecular phenotypes instead of anatomical transcriptional regulatory relationships genomics phenotypes[107]. Used to identify genomic variants by using gene expression levels as that correlate with the magnitude of the molecular trait. the quantitative traits.

Chromatin Immuno-Precipitation can be followed by Chromatin either hybridization to tiling arrays (ChIP-chip) or high The newer ChIP-seq technology is Immuno- throughput sequencing (ChIP-seq) to identify DNA more sensitive and specific than ChIP- Precipitation sequences bound to target proteins. Can be used for chip[108]. Consequently it is currently Functional (ChIP) data genome-wide identification of transcription factor replacing the older technology. Genomics binding sites and histone modifications.

Used to determine cellular RNA expression levels. In High contrast to microarrays, it does not depend on pre- A disadvantage of current high throughput selected probes, so it can also detect novel throughput sequencing technologies is RNA unsuspected RNAs. It also does not suffer from a that the individual sequence reads are sequencing background signal and has a larger dynamic range short which can cause problems when (RNA-seq) than microarrays, enabling it to detect lower mapping them to the genome. abundance transcripts than microarrays[109,110].

Non-coding Non-coding RNAs play a larger role than previously There are several different classes of Gene Types RNAs expected in gene regulation[95]. Databases exist that ncRNAs, with several different (ncRNAs) catalog ncRNAs[111-115]. functions[95].

Detailed phenotype descriptions are important to guide the interpretation of There are several ongoing projects to systematically all previously mentioned functional catalog human disease phenotypes using structured data types. Coarse-grained phenotype Phenotype Phenotype feature vocabularies[76,116]. These will enable classification could lead to the systematic and comprehensive analyses of conflation of different biological phenotypic relationships between diseases. processes which actually have distinct phenotypic effects[38].

110

Chapter 6: General Discussion

More and more genomic data are becoming available for more and more individuals, increasing the viability of population genetics-based approaches to disease analysis. SNP frequencies can be analyzed for evidence of either purifying or positive selection[29,30,117-119]. Such analyses are facilitated by the recent HAPMAP project, which mapped the haplotype structure of different populations at a genome-wide scale and identifying common genetic variants[19] . They will be further facilitated by the international 1000 genomes project[120], which will enable rarer alleles to be characterized down to a frequency of about 1% (http://www.1000genomes.org/). Given their population-based nature, these analyses will be more suited to the identification of genetic variants involved in complex diseases or functioning as penetrance or expressivity modifiers of Mendelian diseases, rather than the identification of primary Mendelian disease genes. Nevertheless, such insights will be crucial if disease processes are to be understood in their entirety.

Expanding beyond the human species, the ever-increasing number of species whose genomes have been sequenced enables far more extensive use of evolutionary information than employed in this thesis. Evolutionary sequence conservation of genomic regions enables the identification of functional non-protein coding elements[121]. The larger numbers of sequenced vertebrate genomes enables better detection of the extent to which such elements are evolutionarily conserved[122], which might in turn give insights into the functioning of the elements. For instance, enhancers involved in the regulation of mammary gland development can be expected to be restricted to the mammalian lineage. Another potential application of evolutionary analysis to disease elucidation is the use of evolutionary mutant models. These are species that have evolved traits that mimic human genetic disorders[105,106]. For instance, there are species of Antarctic ice fish which lack hemoglobin and red blood cells. Comparing their genomes with those of related species that do have red blood cells could give further insights into red blood cell biology and anemic diseases[105]. There are many other examples of evolutionary mutant models involving different species[105,106], and the approach is also applicable to breeds of domestic animals such as dogs[123,124]. Unfortunately, the lack of genome sequences for most such species and breeds is a current limitation to this approach[106].

Apart from new evolutionary approaches, new functional approaches are becoming increasingly viable. One approach that has seen a recent rise in interest is genetical genomics, in which genetic mapping and functional genomics are combined to identify genomic variants that are associated with variation in functional genomics data[107]. The approach is analogous to quantitative trait mapping, except that the traits analyzed are molecular rather than physical. For instance, the expression levels of genes or the concentrations of metabolites could be used as phenotypic traits and correlated with specific genomic variants. The use of gene expression levels as quantitative traits is particularly popular, and this sub-field of genetical genomics (expression genetics) is used to investigate gene expression regulatory relationships between and within genomic loci at a genome-wide scale[125].

In addition to novel analytical approaches, disease analysis can also look forward to ever- increasing amounts of novel kinds of data. Given the recent insights into the role of epigenetics

111

Chapter 6: General Discussion

and non-coding RNA in human biology, chromatin immune-precipitation and non-coding RNA data are logical data types to include in future analyses. Chromatin immuno-precipitation followed by sequencing (ChIP-seq)[126] can be used to identify the locations of various chromatin modifications[127], as well as transcription factor binding sites, across the whole genome in a given cell type. This technology succeeds a similar approach in which microarrays were used instead of sequencing to identify the relevant DNA sequences[128]. Datasets such as these are collected in expression databases such as the Gene Expression Omnibus(http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress (http://www.ebi.ac.uk/microarray- as/ae/). Non-coding RNA data are also available from several databases[111-115].

There are currently several existing candidate disease gene prioritization tools, as discussed in Chapter 2. Future disease gene prioritization will equally require help from computational tools, which will need to be take advantage of these novel data sources. The general approaches outlined in Chapters 1 & 2 – identifying functional links between candidate disease genes and known ones, identifying common functional themes between various disease-associated genomic loci , and identifying candidate genes with functional links to specific aspects of the disease phenotype – will still remain relevant in the future. However, they will need to be modified to accommodate novel data and their focus will need to shift from disease loci to the whole genome. Some recent candidate disease gene prioritization approaches already prioritize candidate disease genes for complex diseases at a genome-wide scale[22,23,129,130]. Existing tools such as the Prioritizer[7] and ENDEAVOUR[8] computer programs can also be applied to genome-scale candidate disease gene prioritization[131-133]. Prioritizer is a computer program that uses functional links between genes at different disease-associated loci to prioritize candidate disease genes, and this tool has also already been adapted for use in complex disease loci from genome-wide association studies (GWASs) [21]. It uses a gene network which connects genes to each other based on various kinds of functional relationships. The links in this network could be expanded using novel kinds of functional genomics data. Similarly, ENDEAVOUR already uses several different kinds of functional genomics data to identify relationships between known disease genes and candidate disease genes. It is currently used primarily for prioritizing positional candidate disease genes, but integrating yet more kinds of data into this approach should further increase both its sensitivity and its specificity, increasing its performance when applied to the whole genome. The same applies for other existing candidate disease gene prioritization tools, such as GenTrepid[134] for instance. However, while these existing tools exploit gene functional data extensively, further developments in the application of phenotype data will still be required if their value is to be maximized.

6.4.4 From disease genes to disease pathobiology and the elucidation of the disease phenome

The past two decades have seen an explosion in the number of identified disease genes from about 100 to over 2000[11], particularly with the advent of the post-genomic era. This trend is set to continue in the near future as deep sequencing becomes more widespread. These developments will change the nature of disease genetics as more and more disease mutations

112

Chapter 6: General Discussion

are identified. In particular, the focus will eventually shift from the identification of disease genes to the elucidation of disease pathobiology. This is not going to be as simple as creating a catalog of genes linked to cognate diseases, as disease pathobiology is far more complicated than this (Figure 1). The concept of disease genes is relatively simplistic as diseases are affected by genetic mutation at scales ranging from the sub-gene level to the whole chromosome level. Similarly, the concept of disease is relatively simplistic as there is much variation within, and overlap between, disease phenotypes.

It is already clear that one gene can underlie disparate diseases, and that one disease can be caused by mutations in several different genes (Chapter 2 and [1,135]). However, even this bi- partite network-based view of disease genetics is a rather over-simplified representation of reality. Diseases are not uniform entities, and neither are genes. Diseases share features with each other, to various degrees. Even diseases caused by the same mutation can be phenotypically disparate[136]. At the same, different mutations in the same gene can affect different aspects of its functioning[ 1]. Mutations affecting a gene’s functioning need not even lie within the gene’s transcripts, or even within its genomic region[93].

Mutations in protein-coding disease genes can disrupt their function, but they could also disrupt only part of the genes’ function (e.g. mutations in the DNA-binding domain or trans-activation domain of p63 causing EEC and AEC syndromes respectively[137]), or disrupt their function in only a subset of cell types or developmental history (such as mutations in a SHH enhancer disrupting its expression exclusively in the developing limb bud and causing isolated preaxial polydactyly[138]). They could disrupt function by either reducing it or increasing it inappropriately, with potentially different phenotypic consequences. This is illustrated by mutations in the RET gene, which cause Hirschsprung disease when they are inactivating and Multiple Endocrine Neoplasia 2 (MEN2) when they are activating[139]. Function modulation could be the result of transcriptional or translational dosage modification, or of the modulation of molecular functions such as enzymatic activity or protein responsiveness to regulation – e.g. the constitutive activation of signal transducing receptors[140]. Protein dosage modulation could be realized by mutations in regulatory regions[93], splice sites[141] (which can also result in modified proteins) or by copy number variation[142]. In addition to modulating native function, mutations could even lead to unusual protein behavior that is disruptive of other cellular processes (toxic gain of function mutations), as in several neuronal and muscular diseases [143-145].

113

Chapter 6: General Discussion

Figure 1: The complexity of the disease process, viewed as a set of interacting networks (this figure only illustrates the overall concept and is far from exhaustive). Individual genomes contain large sets of DNA variants that overlap between individuals and can influence biological processes. These biological processes can be seen as networks of linked components in which both the components and the links can be of many different types, such as transcription or translation regulation, protein-protein interactions, metabolic fluxes, etc. These networks can also vary according to tissue or cell type, and can occur within or between them (not illustrated). These interacting biological processes give rise to the final phenotype, which can be viewed as a combination of phenotypic features. Such features can be used to link different phenotypes into a phenotype network. Environmental influences, which can be correlated to each other, can modulate biological processes or affect the underlying DNA itself through for instance mutagenesis or epigenetic reprogramming. Conversely the phenotype, which includes behavior, can influence environmental factors—such as dieting in order to lose weight, increasing exposure to UV radiation for cosmetic reasons, or increasing exposure to toxins such as alcohol or nicotine. An individual’s phenotype can also influence the environment of their offspring, such as for instance the intra-uterine environment.

Different mutations in the same gene could have completely different effects on organismal biology depending on which aspects of which (sub)functions are disrupted, and on the nature of the disruption, such as in laminopathies[146]. They could lead to a dominant or recessive pattern of inheritance depending on the mutation; for instance, GJB2 (OMIM: 121011) has both

114

Chapter 6: General Discussion

recessively and dominantly inheriting mutations that cause hearing loss[147]. Disease mutations need not affect the functioning of just one gene, either. Chromosomal deletions and duplications could affect multiple genes[148-151], as could disorders of imprinting[136,152]. Mutations affecting gene function need not even be genetic; they can also be epigenetic, such as the heritable MLH1 epimutation in Hereditary Non-Polyposis Colorectal Cancer[153]. Finally, diseases can probably even be caused, or at least influenced, by mismatches between epigenetic programming and environmental conditions as is postulated by the thrifty phenotype hypothesis for Type 2 Diabetes[154,155].

Given the myriad ways in which mutations can affect gene functions, the vast numbers of genes whose functioning can be altered indirectly by the malfunctioning of other genes, and the vast amount of genetic variation in the population, disease elucidation is going to require more sophisticated views of the relationships between genotypes and phenotypes. Such increased sophistication will need to be applied not just to disease genetics, but also to disease phenotypes. As already discussed previously, phenotypic similarities and differences between diseases can reflect similarities and differences in their underlying molecular genetics, and these phenotypic features can be subtle and non-obvious. Every patient’s phenotype would ideally be described as an individual collection of features rather than being assigned a disease label. Given the ongoing transition from a single canonical genome sequence to the sequencing of individual genomes, it is important that disease phenotype descriptions keep pace and undergo a similar transition, from single canonical descriptions of the disease phenotype to the separate descriptions of individual patient phenotypes.

If patient genomes are also sequenced, individual mappings of genotypes to phenotypes can be created. If this is done for many patients with disparate diseases, a large collection of genotype- to-phenotype mappings can be assembled. Each phenotype can be seen as a node in a network linked through shared features. Each gene can be linked to other genes through a vast selection of different functional interaction types. From such a data set, phenotypic links can be correlated with various kinds and combinations of genetic links in order to identify biological processes that lead to specific feature sets when disrupted (Figure 1). Mapping such shared features to genomic variants could also lead to the identification of novel links between different biological processes. A glimpse of this is given by the recent identification of the importance of hypothalamic function in obesity etiology[25,38], but can be extended much further by combining phenome-wide feature-based phenotype analyses with genome-wide variation analyses. Environmental parameters can be included in such analyses, either by using them to categorize gene-phenotype relationships according to different environments or by correlating them directly with genetic and phenotypic variation.

As the genetic variation in the human variome[156-158] is charted in ever more detail, the recording of associated phenotypic variation in the human phenome[159] will need to keep pace. Otherwise, in contrast to historical disease research where incomplete genetic understanding has been the limiting factor, in the future it will be incomplete phenotypic understanding that risks becoming the limiting factor in disease investigation.

115

Chapter 6: General Discussion

References

1. Brunner, H.G. & van Driel, M.A. From syndrome families to functional genomics. Nat Rev Genet 5, 545-51 (2004). 2. Oti, M., Huynen, M.A. & Brunner, H.G. Phenome connections. Trends Genet 24 , 103-6 (2008). 3. Fraser, H.B. & Plotkin, J.B. Using protein complexes to predict phenotypic effects of gene mutation. Genome Biol 8, R252 (2007). 4. Brockmann, R., Beyer, A., Heinisch, J.J. & Wilhelm, T. Posttranscriptional expression regulation: what determines translation rates? PLoS Comput Biol 3, e57 (2007). 5. Irizarry, R.A. et al. Multiple-laboratory comparison of microarray platforms. Nat Methods 2, 345-50 (2005). 6. Kuo, W.P. et al. A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nat Biotechnol 24 , 832-40 (2006). 7. Franke, L. et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 78 , 1011-25 (2006). 8. Aerts, S. et al. Gene prioritization through genomic data fusion. Nat Biotechnol 24 , 537-544 (2006). 9. Perez-Iratxeta, C., Bork, P. & Andrade-Navarro, M.A. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res 35 , W212-6 (2007). 10. Gierman, H.J. et al. Domain-wide regulation of gene expression in the human genome. Genome Res 17 , 1286-95 (2007). 11. Altshuler, D., Daly, M.J. & Lander, E.S. Genetic mapping in human disease. Science 322 , 881-8 (2008). 12. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A. & McKusick, V.A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33 , D514-7 (2005). 13. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456 , 53-9 (2008). 14. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007). 15. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456 , 60-5 (2008). 16. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452 , 872-6 (2008). 17. Raymond, F.L., Whibley, A., Stratton, M.R. & Gecz, J. Lessons learnt from large-scale exon re- sequencing of the X chromosome. Hum Mol Genet 18 , R60-4 (2009). 18. Tarpey, P.S. et al. A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation. Nat Genet (2009). 19. Frazer, K.A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449 , 851-61 (2007). 20. Flint, J. & Munafo, M.R. Forum: Interactions between gene and environment. Curr Opin Psychiatry 21 , 315-7 (2008). 21. Elbers, C.C. et al. Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol 33 , 419-31 (2009). 22. Gaulton, K.J., Mohlke, K.L. & Vision, T.J. A computational system to select candidate genes for complex human traits. Bioinformatics 23 , 1132-40 (2007). 23. Shriner, D. et al. Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies. Nucleic Acids Res 36 , e26 (2008). 24. Katsanis, N. From association to causality: the new frontier for complex traits. Genome Med 1, 23 (2009). 25. Calton, M.A. & Vaisse, C. Narrowing down the role of common variants in the genetic predisposition to obesity. Genome Med 1, 31 (2009). 26. Willer, C.J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet 41 , 25-34 (2009).

116

Chapter 6: General Discussion

27. Zhernakova, A., van Diemen, C.C. & Wijmenga, C. Detecting shared pathogenesis from the shared genetics of immune-related diseases. Nat Rev Genet 10 , 43-55 (2009). 28. Pritchard, J.K. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet 69 , 124-37 (2001). 29. Kryukov, G.V., Pennacchio, L.A. & Sunyaev, S.R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet 80 , 727-39 (2007). 30. Gorlov, I.P., Gorlova, O.Y., Sunyaev, S.R., Spitz, M.R. & Amos, C.I. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am J Hum Genet 82 , 100-12 (2008). 31. Ahituv, N. et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet 80 , 779-91 (2007). 32. Genin, E., Feingold, J. & Clerget-Darpoux, F. Identifying modifier genes of monogenic disease: strategies and difficulties. Hum Genet 124 , 357-68 (2008). 33. Khanna, H. et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nat Genet (2009). 34. Arts, H.H. et al. Mutations in the gene encoding the basal body protein RPGRIP1L, a nephrocystin-4 interactor, cause Joubert syndrome. Nat Genet 39 , 882-8 (2007). 35. Delous, M. et al. The ciliary gene RPGRIP1L is mutated in cerebello-oculo-renal syndrome (Joubert syndrome type B) and Meckel syndrome. Nat Genet 39 , 875-81 (2007). 36. Badano, J.L. & Katsanis, N. Beyond Mendel: an evolving view of human genetic disease transmission. Nat Rev Genet 3, 779-89 (2002). 37. Cauchi, S. et al. The genetic susceptibility to type 2 diabetes may be modulated by obesity status: implications for association studies. BMC Med Genet 9, 45 (2008). 38. Hofker, M. & Wijmenga, C. A supersized list of obesity genes. Nat Genet 41 , 139-40 (2009). 39. Agarwal, A.K. & Garg, A. Genetic basis of lipodystrophies and management of metabolic complications. Annu Rev Med 57 , 297-311 (2006). 40. Capeau, J. et al. Diseases of adipose tissue: genetic and acquired lipodystrophies. Biochem Soc Trans 33 , 1073-7 (2005). 41. Masud, S. & Ye, S. Effect of the peroxisome proliferator activated receptor-gamma gene Pro12Ala variant on body mass index: a meta-analysis. J Med Genet 40 , 773-80 (2003). 42. Hegele, R.A., Cao, H., Frankowski, C., Mathews, S.T. & Leff, T. PPARG F388L, a transactivation- deficient mutant, in familial partial lipodystrophy. Diabetes 51 , 3586-90 (2002). 43. Savage, D.B. et al. Human metabolic syndrome resulting from dominant-negative mutations in the nuclear receptor peroxisome proliferator-activated receptor-gamma. Diabetes 52 , 910-7 (2003). 44. Kersten, S., Desvergne, B. & Wahli, W. Roles of PPARs in health and disease. Nature 405 , 421-4 (2000). 45. Woods, S.C. & D'Alessio, D.A. Central control of body weight and appetite. J Clin Endocrinol Metab 93 , S37-50 (2008). 46. Muenke, M. The pit, the cleft and the web. Nat Genet 32 , 219-20 (2002). 47. Rzhetsky, A., Wajngurt, D., Park, N. & Zheng, T. Probing genetic overlap among complex human phenotypes. Proc Natl Acad Sci U S A 104 , 11694-9 (2007). 48. Cannon, T.D. & Keller, M.C. Endophenotypes in the genetic analyses of mental disorders. Annu Rev Clin Psychol 2, 267-90 (2006). 49. Gottesman, II & Gould, T.D. The endophenotype concept in psychiatry: etymology and strategic intentions. Am J Psychiatry 160 , 636-45 (2003). 50. Leboyer, M. et al. Psychiatric genetics: search for phenotypes. Trends Neurosci 21 , 102-5 (1998). 51. Puls, I. & Gallinat, J. The concept of endophenotypes in psychiatric diseases meeting the expectations? Pharmacopsychiatry 41 Suppl 1 , S37-43 (2008). 52. Jones, A., Cork, C. & Chowdhury, U. Autistic spectrum disorders. 1: Presentation and assessment. Community Pract 79 , 97-8 (2006). 53. Happe, F., Ronald, A. & Plomin, R. Time to give up on a single explanation for autism. Nat Neurosci 9, 1218-20 (2006).

117

Chapter 6: General Discussion

54. Ronald, A., Happe, F., Price, T.S., Baron-Cohen, S. & Plomin, R. Phenotypic and genetic overlap between autistic traits at the extremes of the general population. J Am Acad Child Adolesc Psychiatry 45 , 1206-14 (2006). 55. Pickles, A. et al. Variable expression of the autism broader phenotype: findings from extended pedigrees. J Child Psychol Psychiatry 41 , 491-502 (2000). 56. Piven, J., Palmer, P., Jacobi, D., Childress, D. & Arndt, S. Broader autism phenotype: evidence from a family history study of multiple-incidence autism families. Am J Psychiatry 154 , 185-90 (1997). 57. Losh, M., Sullivan, P.F., Trembath, D. & Piven, J. Current developments in the genetics of autism: from phenome to genome. J Neuropathol Exp Neurol 67 , 829-37 (2008). 58. Sharma, N., Berbari, N.F. & Yoder, B.K. Ciliary dysfunction in developmental abnormalities and diseases. Curr Top Dev Biol 85 , 371-427 (2008). 59. Bae, Y.K. & Barr, M.M. Sensory roles of neuronal cilia: cilia development, morphogenesis, and function in C. elegans. Front Biosci 13 , 5959-74 (2008). 60. Kulaga, H.M. et al. Loss of BBS proteins causes anosmia in humans and defects in olfactory cilia structure and function in the mouse. Nat Genet 36 , 994-8 (2004). 61. Maria, B.L. et al. "Joubert syndrome" revisited: key ocular motor signs with magnetic resonance imaging correlation. J Child Neurol 12 , 423-30 (1997). 62. Valente, E.M., Brancati, F. & Dallapiccola, B. Genotypes and phenotypes of Joubert syndrome and related disorders. Eur J Med Genet 51 , 1-23 (2008). 63. Parisi, M.A. et al. The NPHP1 gene deletion associated with juvenile nephronophthisis is present in a subset of individuals with Joubert syndrome. Am J Hum Genet 75 , 82-91 (2004). 64. Tory, K. et al. High NPHP1 and NPHP6 mutation rate in patients with Joubert syndrome and nephronophthisis: potential epistatic effect of NPHP6 and AHI1 mutations in patients with NPHP1 mutations. J Am Soc Nephrol 18 , 1566-75 (2007). 65. Church, G.M. The personal genome project. Mol Syst Biol 1, 2005 0030 (2005). 66. Pratley, R.E. Gene-environment interactions in the pathogenesis of type 2 diabetes mellitus: lessons learned from the Pima Indians. Proc Nutr Soc 57 , 175-81 (1998). 67. Schulz, L.O. et al. Effects of traditional and western environments on prevalence of type 2 diabetes in Pima Indians in Mexico and the U.S. Diabetes Care 29 , 1866-71 (2006). 68. Bruder, C.E. et al. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am J Hum Genet 82 , 763-71 (2008). 69. Singh, S.M., Murphy, B. & O'Reilly, R. Epigenetic contributors to the discordance of monozygotic twins. Clin Genet 62 , 97-103 (2002). 70. Biesecker, L.G. Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations. Clin Genet 68 , 320-6 (2005). 71. Hall, J.G. A clinician's plea. Nat Genet 33 , 440-2 (2003). 72. Oti, M. & Brunner, H.G. The modular nature of genetic diseases. Clin Genet 71 , 1-11 (2007). 73. Schulze, T.G. & McMahon, F.J. Defining the phenotype in human genetic studies: forward genetics and reverse phenotyping. Hum Hered 58 , 131-8 (2004). 74. Ayme, S. [Orphanet, an information site on rare diseases]. Soins , 46-7 (2003). 75. Bankier, A. & Keith, C.G. POSSUM: the microcomputer laser-videodisk syndrome information system. Ophthalmic Paediatr Genet 10 , 51-2 (1989). 76. Robinson, P.N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet 83 , 610-5 (2008). 77. Winter, R.M. & Baraitser, M. The London Dysmorphology Database. J Med Genet 24 , 509-10 (1987). 78. Snyder, M., Weissman, S. & Gerstein, M. Personal phenotypes to go with personal genomes. Mol Syst Biol 5, 273 (2009). 79. Mace, G., Bogliolo, M., Guervilly, J.H., Dugas du Villard, J.A. & Rosselli, F. 3R coordination by Fanconi anemia proteins. Biochimie 87 , 647-58 (2005). 80. Dell'Angelica, E.C. The building BLOC(k)s of lysosomes and related organelles. Curr Opin Cell Biol 16 , 458-64 (2004). 81. El-Amraoui, A. & Petit, C. Usher I syndrome: unravelling the mechanisms that underlie the cohesion of the growing hair bundle in inner ear sensory cells. J Cell Sci 118 , 4593-603 (2005).

118

Chapter 6: General Discussion

82. Vodopiutz, J. & Bodamer, O.A. Congenital disorders of glycosylation-a challenging group of IEMs. J Inherit Metab Dis (2008). 83. Zaghloul, N.A. & Katsanis, N. Mechanistic insights into Bardet-Biedl syndrome, a model ciliopathy. J Clin Invest 119 , 428-37 (2009). 84. Antonarakis, S.E. & Beckmann, J.S. Mendelian disorders deserve more attention. Nat Rev Genet 7, 277-82 (2006). 85. Peltonen, L., Perola, M., Naukkarinen, J. & Palotie, A. Lessons from studying monogenic disease for common disease. Hum Mol Genet 15 Spec No 1 , R67-74 (2006). 86. Ropers, H.H. New perspectives for the elucidation of genetic disorders. Am J Hum Genet 81 , 199-207 (2007). 87. Dean, M. Approaches to identify genes for complex human diseases: lessons from Mendelian disorders. Hum Mutat 22 , 261-74 (2003). 88. Ghildiyal, M. & Zamore, P.D. Small silencing RNAs: an expanding universe. Nat Rev Genet 10 , 94-108 (2009). 89. Villar-Garea, A. & Imhof, A. The analysis of histone modifications. Biochim Biophys Acta 1764 , 1932-9 (2006). 90. Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420 , 520-62 (2002). 91. Berman, B.P., Frenkel, B. & Coetzee, G.A. Location, location, (ChIP-)location! Mapping chromatin landscapes one immunoprecipitation at a time. J Cell Biochem 107 , 1-5 (2009). 92. Bartkuhn, M. & Renkawitz, R. Long range chromatin interactions involved in gene regulation. Biochim Biophys Acta 1783 , 2161-6 (2008). 93. Kleinjan, D.A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet 76 , 8-32 (2005). 94. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309 , 1559- 63 (2005). 95. Amaral, P.P., Dinger, M.E., Mercer, T.R. & Mattick, J.S. The eukaryotic genome as an RNA machine. Science 319 , 1787-9 (2008). 96. Mattick, J.S., Amaral, P.P., Dinger, M.E., Mercer, T.R. & Mehler, M.F. RNA regulation of epigenetic processes. Bioessays 31 , 51-9 (2009). 97. Wu, L. & Belasco, J.G. Let me count the ways: mechanisms of gene regulation by miRNAs and siRNAs. Mol Cell 29 , 1-7 (2008). 98. Blenkiron, C. & Miska, E.A. miRNAs in cancer: approaches, aetiology, diagnostics and therapy. Hum Mol Genet 16 Spec No 1 , R106-13 (2007). 99. Mencia, A. et al. Mutations in the seed region of human miR-96 are responsible for nonsyndromic progressive hearing loss. Nat Genet 41 , 609-13 (2009). 100. Bejerano, G. et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 441 , 87-90 (2006). 101. Bourque, G. et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res 18 , 1752-62 (2008). 102. Lyon, M.F. X-chromosome inactivation: a repeat hypothesis. Cytogenet Cell Genet 80 , 133-7 (1998). 103. Mirkin, S.M. Expandable DNA repeats and human disease. Nature 447 , 932-40 (2007). 104. de Greef, J.C., Frants, R.R. & van der Maarel, S.M. Epigenetic mechanisms of facioscapulohumeral muscular dystrophy. Mutat Res 647 , 94-102 (2008). 105. Albertson, R.C., Cresko, W., Detrich, H.W., 3rd & Postlethwait, J.H. Evolutionary mutant models for human disease. Trends Genet 25 , 74-81 (2009). 106. Maher, B. Evolution: Biology's next top model? Nature 458 , 695-8 (2009). 107. Jansen, R.C. & Nap, J.P. Genetical genomics: the added value from segregation. Trends Genet 17 , 388-91 (2001). 108. Marguerat, S., Wilhelm, B.T. & Bahler, J. Next-generation sequencing: applications beyond genomes. Biochem Soc Trans 36 , 1091-6 (2008). 109. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18 , 1509-17 (2008).

119

Chapter 6: General Discussion

110. Wilhelm, B.T. et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single- nucleotide resolution. Nature 453 , 1239-43 (2008). 111. Dinger, M.E. et al. NRED: a database of long noncoding RNA expression. Nucleic Acids Res 37 , D122-6 (2009). 112. Gardner, P.P. et al. Rfam: updates to the RNA families database. Nucleic Acids Res 37 , D136-40 (2009). 113. He, S. et al. NONCODE v2.0: decoding the non-coding. Nucleic Acids Res 36 , D170-2 (2008). 114. Mituyama, T. et al. The Functional RNA Database 3.0: databases to support mining and annotation of functional RNAs. Nucleic Acids Res 37 , D89-92 (2009). 115. Pang, K.C. et al. RNAdb 2.0--an expanded database of mammalian non-coding RNAs. Nucleic Acids Res 35 , D178-82 (2007). 116. Allanson, J.E., Biesecker, L.G., Carey, J.C. & Hennekam, R.C. Elements of morphology: introduction. Am J Med Genet A 149A , 2-5 (2009). 117. Sabeti, P.C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449 , 913-8 (2007). 118. Walsh, E.C. et al. Searching for signals of evolutionary selection in 168 genes related to immune function. Hum Genet 119 , 92-102 (2006). 119. Yu, F. et al. Positive selection of a pre-expansion CAG repeat of the human SCA2 gene. PLoS Genet 1, e41 (2005). 120. Siva, N. 1000 Genomes project. Nat Biotechnol 26 , 256 (2008). 121. Dermitzakis, E.T. et al. Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 302 , 1033-5 (2003). 122. Miller, W. et al. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 17 , 1797-808 (2007). 123. Fondon, J.W., 3rd & Garner, H.R. Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci U S A 101 , 18058-63 (2004). 124. Sutter, N.B. et al. A single IGF1 allele is a major determinant of small size in dogs. Science 316 , 112-5 (2007). 125. Cookson, W., Liang, L., Abecasis, G., Moffatt, M. & Lathrop, M. Mapping complex disease traits with global gene expression. Nat Rev Genet 10 , 184-94 (2009). 126. Barski, A. & Zhao, K. Genomic location analysis by ChIP-Seq. J Cell Biochem 107 , 11-8 (2009). 127. Park, P.J. Epigenetics meets next-generation sequencing. Epigenetics 3, 318-21 (2008). 128. Wu, J., Smith, L.T., Plass, C. & Huang, T.H. ChIP-chip comes of age for genome-wide functional analysis. Cancer Res 66 , 6899-902 (2006). 129. Saccone, S.F. et al. Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence. Bioinformatics 24 , 1805-11 (2008). 130. Sun, J. et al. A multi-dimensional evidence-based candidate gene prioritization approach for complex diseases - schizophrenia as a case. Bioinformatics (2009). 131. Knauff, E.A. et al. Genome-wide association study in premature ovarian failure patients suggests ADAMTS19 as a possible candidate gene. Hum Reprod (2009). 132. Sookoian, S., Gianotti, T.F., Schuman, M. & Pirola, C.J. Gene prioritization based on biological plausibility over genome wide association studies renders new loci associated with type 2 diabetes. Genet Med 11 , 338-43 (2009). 133. van der Zwaag, B. et al. Gene-network analysis identifies susceptibility genes related to glycobiology in autism. PLoS One 4, e5324 (2009). 134. George, R.A. et al. Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res 34 , e130 (2006). 135. Goh, K.I. et al. The human disease network. Proc Natl Acad Sci U S A 104 , 8685-90 (2007). 136. Cassidy, S.B. & Schwartz, S. Prader-Willi and Angelman syndromes. Disorders of genomic imprinting. Medicine (Baltimore) 77 , 140-51 (1998). 137. Rinne, T., Hamel, B., van Bokhoven, H. & Brunner, H.G. Pattern of p63 mutations and their phenotypes--update. Am J Med Genet A 140 , 1396-406 (2006). 138. Lettice, L.A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet 12 , 1725-35 (2003).

120

Chapter 6: General Discussion

139. Moore, S.W. & Zaahl, M.G. Multiple endocrine neoplasia syndromes, children, Hirschsprung's disease and RET. Pediatr Surg Int 24 , 521-30 (2008). 140. Tao, Y.X. Constitutive activation of G protein-coupled receptors and diseases: insights into mechanisms of activation and therapeutics. Pharmacol Ther 120 , 129-48 (2008). 141. Baralle, D. & Baralle, M. Splicing in action: assessing disease causing sequence changes. J Med Genet 42 , 737-48 (2005). 142. Henrichsen, C.N., Chaignat, E. & Reymond, A. Copy number variants, diseases and gene expression. Hum Mol Genet 18 , R1-8 (2009). 143. Gil, J.M. & Rego, A.C. Mechanisms of neurodegeneration in Huntington's disease. Eur J Neurosci 27 , 2803-20 (2008). 144. Kuyumcu-Martinez, N.M. & Cooper, T.A. Misregulation of alternative splicing causes pathogenesis in myotonic dystrophy. Prog Mol Subcell Biol 44 , 133-59 (2006). 145. Zoghbi, H.Y. & Orr, H.T. Pathogenic mechanisms of a polyglutamine-mediated neurodegenerative disease, spinocerebellar ataxia type 1. J Biol Chem 284 , 7425-9 (2009). 146. Hegele, R.A. & Oshima, J. Phenomics and lamins: from disease to therapy. Exp Cell Res 313 , 2134-43 (2007). 147. Martinez, A.D., Acuna, R., Figueroa, V., Maripillan, J. & Nicholson, B. Gap-junction channels dysfunction in deafness and hearing loss. Antioxid Redox Signal 11 , 309-22 (2009). 148. Kobrynski, L.J. & Sullivan, K.E. Velocardiofacial syndrome, DiGeorge syndrome: the chromosome 22q11.2 deletion syndromes. Lancet 370 , 1443-52 (2007). 149. Koolen, D.A. et al. A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat Genet 38 , 999-1001 (2006). 150. Schubert, C. The genomic basis of the Williams-Beuren syndrome. Cell Mol Life Sci 66 , 1178-97 (2009). 151. Shelley, B.P. & Robertson, M.M. The neuropsychiatry and multisystem features of the Smith- Magenis syndrome: a review. J Neuropsychiatry Clin Neurosci 17 , 91-7 (2005). 152. Delaval, K., Wagschal, A. & Feil, R. Epigenetic deregulation of imprinting in congenital diseases of aberrant growth. Bioessays 28 , 453-9 (2006). 153. Suter, C.M., Martin, D.I. & Ward, R.L. Germline epimutation of MLH1 in individuals with multiple cancers. Nat Genet 36 , 497-501 (2004). 154. Hales, C.N. & Barker, D.J. The thrifty phenotype hypothesis. Br Med Bull 60 , 5-20 (2001). 155. Stoger, R. The thrifty epigenotype: an acquired and heritable predisposition for obesity and diabetes? Bioessays 30 , 156-66 (2008). 156. What is the human variome project? Nat Genet 39 , 423 (2007). 157. Cotton, R.G. et al. GENETICS. The Human Variome Project. Science 322 , 861-2 (2008). 158. Ring, H.Z., Kwok, P.Y. & Cotton, R.G. Human Variome Project: an international collaboration to catalogue human genetic variation. Pharmacogenomics 7, 969-72 (2006). 159. Freimer, N. & Sabatti, C. The human phenome project. Nat Genet 34 , 15-21 (2003).

121

Summary

122

Summary

Summary

Genetic diseases are caused by mutations that disrupt the functioning of genes. These genes do not function independently of each other, they work together to carry out various tasks in the body and its cells. Therefore, it is not surprising that mutations in different genes can cause the same disease, or different diseases that resemble each other in their signs and symptoms. This knowledge is used in genetic research to guide the search for genes which might be involved in various genetic diseases. In the past, it has been applied individually to specific diseases by medical genetic researchers working on those diseases. Recently however, technological advances in both laboratory technology and computers have enabled large scale analyses to be conducted on the whole set of known diseases – the human disease phenome – at once.

These new laboratory techniques enable large quantities of data to be generated for large numbers of genes, and much of these data are stored in electronic form in large databases. Microarrays can be used to measure the levels of gene expression for all genes at once in the same biological sample. Mass spectrometry enables the levels of proteins and cellular chemicals to be measured on a large scale, and can also be used to identify which proteins interact with each other. These data are not only being generated for human samples, but also for many other species that are widely used in laboratories for medical and biological research such as mice, rats, fruit flies, worms and yeast. Whole genome sequences are also available for these species, and more and more are becoming available each year. More and more data about the genetic variation between people in the human population are being generated, such as for example between people who have a certain disease and people who don’t. These large data sets are a treasure trove of information, but their sheer size makes them difficult or impossible to analyze by hand. Instead, computer programs need to be used to automatically sift through these data sets to extract as much useful information as possible. This is where Bioinformatics comes in. Bioinformatics is the application of computational processing techniques to biological data. It enables such large data sets to be effectively exploited, and can be used to assist in disease genetics research. It has become particularly relevant for this purpose in the last decade or so, since the advent of microarray and other high-throughput techniques, as well as the completion of several whole genome sequences, enabled the widespread generation of large biological data sets.

There are several different techniques that can be used in the search for disease genes, which fall into two general categories – genetic mapping and functional candidate gene approaches. In genetic mapping, genetic markers are used to correlate the occurrence of particular diseases with genetic variants in particular regions of the genome. This can be done by tracing the disease through family trees and looking for specific genetic variants that occur only those members of the family with the disease, a technique called linkage mapping. This approach is effective for finding genomic regions associated with rare disease mutations that have a strong impact on the disease phenotype, such as in Mendelian diseases (these are diseases that are inherited according to Mendelian principles, and include most rare genetic diseases).

123

Summary

Alternatively, large numbers of patients and non-patients can be taken from the population at large, and genomic regions sought which have variants that occur more frequently in patients than in non-patients. This approach, called association mapping, is more suited to finding common genomic variants that are involved in common diseases such as type 2 diabetes or coronary heart disease. Both approaches result in disease-associated genomic regions that could contain multiple genes. After localizing a disease gene to a particular genomic region, the genes within that locus can be sequenced to identify the disease mutations, an approach termed positional cloning. This is primarily effective in combination with linkage mapping, as association study population sizes are large and complex disease mutations usually only occur in subsets of patients. Historically, most disease genes have been identified in Mendelian diseases using linkage mapping followed by positional cloning. A major limitation of this approach has been the expensive and time consuming sequencing of the positional candidate disease genes within the linkage-mapped genomic loci, particularly since this approach can result in loci with hundreds of genes.

In contrast to genetic mapping which only looks at DNA variants, candidate gene approaches try to identify potential disease genes based on their biology. Given a certain amount of knowledge about the biological nature of the disease process, genes are sought that are involved in those disease-associated biological processes. They are then sequenced for mutations. Candidate gene approaches are based on the premise that genes causing the same or similar disease are functionally related to each other. This approach has historically been much less successful than genetic mapping, primarily due to the limitations in our knowledge of human biology, and until recently even which genes are present in the human genome. However, the completion of the human genome along with the recent deluge of gene functional data has reinvigorated this approach and made it vastly more productive. It has also introduced the need for Bioinformatics to process these large amounts of data. In response, several Bioinformatic tools have been developed to aid in disease gene investigation. They are generally used to prioritize positional candidate disease genes lying in genomic loci that have been identified using linkage mapping. They do this by attempting to find gene functional data that links them to other genes associated with the same disease, or more directly to the disease itself. In the future, as DNA sequencing becomes much cheaper and quicker, it will be necessary for such prioritization tools to prioritize candidate disease genes at the genome-wide level.

In this thesis we investigate the basic premise underlying all these present and future candidate disease gene prioritization tools, namely that similar or identical diseases are caused by mutations in functionally related genes. Instead of building a specific tool, we investigate the degree to which different kinds of gene functional relationships reflect shared disease pathology. We also investigate the degree to which similar disease phenotypes reflect the disruption of shared disease processes. These findings can help guide the development of both candidate disease gene prioritization tools and disease phenotype databases, to increase their utility to future genetic disease research.

We start by giving an overview of the relationships between disease genes and disease phenotypes in chapters 1 & 2. Chapter 1 gives a general overview of the process of disease

124

Summary

gene identification and the role of Bioinformatics in it. Chapter 2 expands on this, emphasizing the modular nature of genetic diseases. Using empirical knowledge, we explain that genetic diseases are modular at both the genetic and the phenotypic level. Such modularity enhances the degree to which phenotypic similarity can be used to infer genetic similarity by highlighting gene functional modules. In this chapter we additionally give an overview of existing Bioinformatic approaches to candidate disease gene prioritization.

In chapter 3 we begin the investigation of the degree to which diseases with the same or similar phenotypes are caused by mutations in functionally related genes. In this chapter we use protein-protein interactions as our measure of gene functional relationships, as it is one of the strongest indicators of functional relatedness between two proteins. We also investigate positional candidate disease genes for genetically heterogeneous diseases, as these represent disease-causing mutations with very similar, clinically indistinguishable phenotypes. We show that protein-protein interactions are indeed enriched between disease genes, and positional candidate disease genes interacting at the protein level with known disease genes are ten times more likely to be involved in the same disease than their neighboring positional candidates. Interestingly, this holds true for protein-protein interaction data from other species as well, which greatly increases the amount of data that can be used for this purpose. However, we also find that literature-derived human protein-protein interactions are strongly biased toward known disease genes and should therefore be used with caution when prioritizing novel candidate disease genes.

In chapter 4 we investigate positional candidate disease gene prioritization using another kind of functional relationship between genes, namely coordinated expression (co-expression). In contrast to the protein-protein interaction data investigated in chapter 3, co-expression data are far more abundant and comprehensive. However, they are also less powerful at identifying functional relationships between genes than protein-protein interactions. Therefore, we investigate whether utilizing evolutionary conservation can increase the value of co-expression data for candidate disease gene prioritization. Once again we use genetically heterogeneous diseases to get sets of disease genes that lead to virtually identical phenotypes when mutated. We show that co-expression is higher between genes causing the same disease than for other genes, and that this effect is strengthened by exploiting evolutionary conservation of co- expression. We further show that such evolutionarily conserved gene co-expression can be used to improve candidate disease gene prioritization.

In chapter 5 we turn our attention from disease genes to disease phenotypes. Having shown that genes causing virtually identical disease phenotypes tend to be functionally related, we investigate whether this phenomenon extends beyond identical phenotypes to overlapping but non-identical phenotypes. In order to do this, disease phenotypes need to be described in a systematic manner that can enable quantitative phenotypic overlap estimation. This has historically been done using text mining of free text phenotype descriptions, but this approach does not result in systematic phenotype descriptions due to the limitations of text mining. There are several databases that systematically annotate disease phenotypes with phenotypic features, but is not clear how useful they will be for such disease phenome analyses. In this

125

Summary

chapter we confirm that the relationship between similar phenotypes and similar molecular genetics does indeed extend to separate but overlapping disease phenotypes. We additionally identify several database properties that influence the genetic coherence of systematically determined phenotype clusters. We find that both deep trait hierarchies and the use of trait frequency estimates can improve the biological coherence of phenotypic disease clusters. Contrary to previous studies we find that overall disease phenotypic similarity is more important for identifying genetic relationships between diseases than is the sharing of specific rare traits. The contrary findings of previous studies were due to the fact that they were based on incompletely described disease phenotypes generated using text mining. Above all, we highlight the importance of complete and systematic disease phenotype annotation, and emphasize the need to pay more attention to this aspect of disease research in the future.

We conclude in chapter 6 with a look into the future of the field, and the importance of phenotype-guided bioinformatic analyses therein. Due to the advent of new sequencing technologies, this field is going to change rapidly over the coming years. Positional candidate disease gene prioritization is becoming less relevant due to the ability to sequence entire candidate loci and later even whole genomes relatively cheaply and quickly. However, there will also be new challenges, such as making sense of the large number of mutations and genetic variants such high throughput approaches uncover, and identifying what their contributions are to the disease process. In short, the role of Bioinformatics in genetic disease research is going to remain large in the future, probably even larger than it is today. As we have shown, its utility will depend on good phenotype databases as well as more gene functional data. Most importantly however, the fundamental premise behind these Bioinformatic candidate gene approaches – namely that phenotypically similar diseases are related at the genetic level as well – is a sound one, as we have explicitly demonstrated in this thesis.

126

Samenvatting

Samenvatting

Genetische ziektes worden veroorzaakt door mutaties die de werking van genen verstoren. Genen werken niet onafhankelijk van elkaar, maar werken eerder samen met elkaar om de verscheidene biologische functies van het lichaam te bewerkstelligen. Het moet dus als geen verassing komen dat mutaties in verschillende genen toch tot dezelfde ziektes kunnen leiden, of tot ziektes die uiterlijk sterk op elkaar lijken. Deze kennis wordt gebruikt in onderzoek naar genetische ziektes om de zoektocht naar de onderliggende genen te informeren. In het verleden werd dit afzonderlijk toegepast op individule ziektes. Sinds kort hebben technologische ontwikkelingen daar echter verandering in gebracht, want tegenwoordig is het mogelijk om zulke analyses op de totale set van ziektes – het menselijk ziektephenoom – in één keer toe te passen.

Deze nieuwe laboratorium technieken leveren grote hoeveelheden data op over grote aantallen genen, waarvan veel in grote electronische databanken opgeslagen ligt. Microarrays kuunen gebruikt worden om genexpressie te meten voor alle genen in een biologisch monster in één keer te meten. Massa spectrometrie stelt ons in staat de niveau’s van eiwitten of chemische stoffen op grote schaal te meten, en kan bovendien ook gebruikt worden om physieke interacties tussen eiwitten te detecteren. Deze data worden niet alleen vergaard voor menselijk materiaal, mmar ook voor andere veelgebruikte laboratorium dieren zoals muizen, ratten, fruitvliegjes, wormen en gist. De genoomsequenties van al deze soorten zijn nu beschikbaar, en elk jaar komen er steeds meer genoomsequenties uit. Ook worden er meer data gegenereerd over de genetische variatie in menselijke populaties, waaronder bijvoorbeeld de verschillen tussen mensen met of zonder een bepaalde ziekte. Zulke grote datasets zijn moeilijk met de hand te analyseren, en vereisen de hulp van computerprogramma’s. Hiertoe dient de bioinformatica. Bioinformatica omvat het toepassen van computationele algoritmes om biologische data te analyseren. Het maakt het mogelijk om zulke grote datasets te verwerken, en is vooral de laatste tijd – met de opkomst van de eerdergenoemde grootschalige technieken – erg waardevol gebleken voor medisch onderzoek.

Er zijn meerdere manieren om naar ziektegenen te speuren, maar ze vallen in het algemeen onder twee noemers – genetische mapping en kandidaatgen aanpakken. Bij genetische mapping worden specifieke genetische varianten op specifieke genomische gebieden gecorreleerd met het voorkomen van bepaalde ziektes. Dit kan gedaan worden door ziektes in families door de generaties te traceren, om genetische varianten te zoeken die alleen voorkomen in patiënten en niet in gezonde familieleden. Deze techniek heet “linkage mapping” en is vooral geschikt om zeldzame ziektemutaties op te sporen die een grote effect hebben op het ziektephenotype, zoals bij Mendeliaanse ziektes (dit zijn ziektes die overerven volgens de Mendeliaanse overervingswetten). Een alternatieve genetische mapping aanpak zijn de zogeheten associatiestudies. In deze aanpak worden genetische varianten gezocht die vaker in patiënten voorkomen dan in mensen zonder de relevante ziekte. Voor deze aanpak zijn grote aantallen mensen nodig om statistisch significante resultaten te krijgen. Vanwege deze vereiste

127

Samenvatting

zijn associatiestudies vooral geschikt om veel voorkomende genetische varianten te zoeken die betrokken zijn bij veel voorkomende ziektes, zoals bijvoorbeeld type 2 diabetes of gevoeligheid voor hartinfarcten. Allebei deze aanpakken resulteren in het identificeren van ziekte- gerelateerde genomische gebieden (of loci) die mmerdere genen kunnen bevatten. Deze gebieden moeten vervolgens ge-sequenced worden om de ziektemutatie te vinden, een aanpak die “positional cloning” heet. In het verleden zijn de meeste ziektegenen gevonden voor Mendelaanse ziektes door middel van linkage mapping gevolgd door positional cloning. Echter, een nadeel van deze aanpak is dat het sequencen van genen in die loci duur en tijdrovend kan zijn, vooraal als er honderden genen in liggen.

In tegenstelling tot genetische mapping waarbij alleen naar DNA varianten gekeken wordt, proberen kandidaatgen aanpakken mogelijke ziektegenen te identificeren aand de hand van hun biologie. Als er wat bekend is over de biologie van het ziekteproces kan deze kennis gebruikt worden om genen te zoeken die ermee te maken hebben. Deze worden dan gecontroleerd op mutaties in patiënten. Kandidaatgen aanpakken zijn wel gebaseerd op de aanname dat genen die bij eenzelfde ziekte betrokken zijn ook iets met elkaar te maken hebben. In het verleden kende deze aanpak weinig succes, grotendeels door gebrek aan biologische kennis over menselijke genen. Echter, door de recente technologische ontwikkelingen en bijkomende grote hoeveelheden data wordt deze aanpak steeds interessanter en productiever, al heeft het ook een noodzaak voor bioinformatica met zich meegebracht. Als antwoord hierop zijn er de laatste tijd een aantal bioinformatische tools ontwikkeld om kandidaat ziektegen prioritering op grote schaal mogelijk te maken, veelal in genomische gebieden die door linkage mapping bepaald zijn. Ze werken door functionele verbanden tussen kandidaat ziektegenen onderling te identificeren, of door ze direct aan de ziekte te koppelen. Zulke tools zullen in de toekomst ook op genoom-schaal moeten kunnen werken, aangezien vooruitgangen in de techniek het mogelijk zullen maken de volledige genomen van patiënten te sequencen.

In dit proefschrift onderzochten wij de grondvesten waarop al deze huidige en toekomstige kandidaat ziektegen prioritieringstools gebaseerd zijn, namelijk dat ziektes die op elkaar lijken veroorzaakt worden door mutaties in genen die met elkaar te maken hebben. In plaats van zelf een tool te bouwen hebben wij onderzocht in hoeverre verschillende typen functionele verbanden tussen ziektegenen hun betrokkenheid bij dezelfde ziekteprocessen weerspiegelen. We zochten ook uit in welke mate op elkaar lijkende phenotypen een gezamenlijke onderliggende ziekteproces weerspiegelen. Deze bevindingen kunnen gebruikt worden om zowel kandidaatgen prioritieringstools als ziektephenotype databases te verbeteren.

In hoofdstukken 1 & 2 beginnen we met een overzicht van de relaties tussen genen en ziektephenotypen. Hoofdstuk 1 geeft een algemeen overzicht van het proces van ziektegen identificatie en de rol van bioinformatica daarin. Hoofdstuk 2 gaat hier verder op in, en legt de nadruk op de modulariteit van genetische ziektes. Aan de hand van empirische kennis laten we zien hoe genetische ziektes zowel op genetisch als op phenotypisch niveau modulair zijn. Zulke modulariteit vergroot de mate waarin phenotypische similariteit gebruikt kan worden om genetische verbanden te identificeren door gezamenlijke functioneel genetische modules aan te

128

Samenvatting

duiden. In dit hoofdstuk geven we bovendien een overzicht van bestaande bioinformatica aanpakken voor kandidaatgen prioritering.

In hoofdstuk 3 beginnen we met het onderzoek naar de mate waarin genetische ziektes met gelijke of op elkaar lijkende phenotypen veroorzaakt worden door mutaties in functioneel gerelateerde genen. In dit hoofdstuk gebruiken we fysieke eiwit-eiwit interacties als maat voor functioneel verwantschap aangezien het één van de sterkste indicaties is dat ze iets met elkaar te maken hebben. We gebruiken ook ziektegenen van genetisch heterogene ziektes aangezien ze phenotypen veroorzaken die klinisch identiek zijn – en die dus heel sterk op elkaar lijken. We tonen aan dat eiwit-eiwit interacties inderdaad verrijkt zijn tussen ziektegenen, en dat positionele kandidaargenen wiens eiwitten met ziektegenen interacteren een tien keer zo grote kans hebben om bij dezelfde ziekte betrokken te zijn dan kandidaatgenen zonder interacties met ziektegenen. Interessant genoeg gaat dit ook op voor eiwit-eiwit interacties die uit andere soorten afkomstig zijn, wat de hoeveelheid bruikbare eiwit-eiwit interactie data sterk vergroot. We vinden echer ook dat interacties die uit literatuurbronnen gehaald zijn een sterke bias vertonen voor bekende ziektegenen en dus met zorg gebruikt moeten worden voor het prioriteren van kandidaatgenen die nog nooit met een ziekte in verband zijn gebracht.

In hoofdstuk 4 kijken we of een andere vorm van gen functioneel verwantschap, namelijk gecorreleerde gen expressie (co-expressie). In tegenstelling tot de in hoofdstuk 3 gebruikte eiwit-eiwit interactie data zijn co-expressie data veel omvangrijker. Ze zijn echter ook een minder sterke indicatie van functionele relaties tussen genen dan eiwit-eiwit interacties. Om deze reden onderzoeken we of het gebruik van evolutionaire conservering deze data nuttiger kunnen maken voor kandidaat ziektegen prioritiering. Wederom gebruiken we heterogene ziektegenen vanwege hun sterk op elkaar lijkende ziektephenotypen. We laten zien dat co- expressie inderdaad hoger is tussen genen die dezelfde ziekte veroorzaken dan voor andere genen, en dat dit effect versterkt wordt door evolutionaire conservering van co-expressie te gebruiken. We laten ook zien dat dit gebruikt kan worden om de prioritiering van positionele kandidaat ziektegenen te verbeteren.

In hoofdstuk 5 verzetten we onze aandacht van ziektegenen naar ziektephenotypen. Na aangetoond te hebben dat genen wiens mutatie tot eenzelfde phenotype leiden de neiging hebben functioneel gerelateerd te zijn, kijken we nu of dit patroon ook doorgetrokken kan worden naar op elkaar lijkende, maar toch verschillende ziektephenotypen. Om dit te doen is het nodig de ziektephenotypen op een systematische manier te beschrijven, om kwantitatieve overlappen te kunnen bepalen. In het verleden werd dit gedaan door tekst-mining van tekstuele ziektebeschrijvingen, maar deze aanpak hheft zijn beperkingen en resulteert niet in erg systematische beschrijvingen van ziektephenotypen. Er zijn verschillende databanken die ziektephenotypen wel systematisch beschrijven, maar het is nog niet duidelijk hoe bruikbaar deze zijn voor dit soort ziektephenoom analyses. In dit hoofdstuk bevestigen wij dat de relatie tussen verwante genen en verwante phenotypen inderdaad doorgetrokken kan worden buiten de grenzen van individuele ziektes. Daarnaast identificeren wij een aantal databank eigenschappen die de genetische coherentie van systematisch bepaalde phenotype clusters vergroot. We vinden dat zowel diepe ziektekenmerk hierarchieën als het gebruik van frequentie

129

Samenvatting

inschattingen ziektekenmerken de genetische coherentie van phenotype clusters vergroot. In tegenstelling tot eerdere studies vinden we dat de algehele phenotypische similariteit belangrijker is voor het identificeren van genetische verwantschap tussen ziektes dan het delen van specifieke zeldzame kenmerken. De tegenstrijdige bevindingen van eerdere studies zijn te wijten aan het feit dat deze gebruik maakten van onvolledige ziektebeschrijvingen die verkregen werden door tekst-mining. Onze belangrijkste bevinding is echter dat complete en systematische phenotypebeschrijvingen heel belangrijk zijn voor ziektephenotype onderzoek, en dat het meer aandacht nodig heeft voor toekomstig onderzoek.

We eindigen in hoofdstuk 6 met een blik naar de toekomst van dit veld, en het belang van het door ziektephenotypen geïnformeerde bioinformatica daarin. Door de opkomst van nieuwe sequentie-technologieën zal dit veld sterk veranderen over de komende jaren. Positionele kandidaatgen prioritering wordt minder relevant aangezien het hele genomisch gebied – en eventueel zelfs hele patiëntengenomen – in één keer gesequenced zal kunnen worden. Toch zullen er ook nieuwe uitdagingen ontstaan, zoals het interpreteren van de grote aantallen mutaties en genetische varianten die deze technologieën zullen opleveren, en het in kaart brengen van hoe ze aan het ziekteproces bijdragen. Hierdoor zal de rol van de bioinformatica in het onderzoek naar genetische ziektes groot blijven ook in de toekomst, waarschijnlijk zelfs nog groter dan het tegenwoordig al is. Zoals wij hebben laten zien zal de toepasbaarheid van de bioinformatica afhangen van zowel goede ziektephenotype databanken als van meer functioneel genetische data. Het allerbelangrijkste is echter dat het grondbeginsel waarop deze bioinformatische kandidaatgen aanpakken gebaseerd zijn – namelijk dat phenotypisch op elkaar lijkende ziektes ook genetisch met elkaar te maken hebben – ook daadwerkelijk klopt, zoals wij in dit proefschrift duidelijk hebben laten zien.

130

Acknowledgements

First and foremost I would like to thank my PhD supervisor Han Brunner for investing so much time and effort in me, in spite of his busy schedule. In particular, he was responsible for getting my project back on the rails with regular bi-weekly meetings at a time when I was getting a little discouraged about it due to lack of progress. His bubbling enthusiasm and perennially cheerful, positive attitude greatly helped to keep my spirits up even when things weren’t going so well. He served not only as my supervisor, but also as my mentor. Many thanks for everything, Han.

I would also like to thank my PhD co-supervisor Martijn Huynen for investing so much time and effort in my project, despite the fact that the subject matter was not part of his group’s core activities. I would like to thank him for letting me be a member of the COMICS (COMputational genomICS) group, which exposed me to a lot of bright minds and stimulating research discussions. I know I am not the easiest person to deal with, Martijn, but I think you did quite well at it.

The third critically important person for my PhD was Gert Vriend, to whom I also owe a big thank you. He is directly responsible for me starting on the PhD program in the first place, as he convinced me to do it here instead of heading straight to warmer climates after my MSc and looking for something there, as I originally planned to do. He also found a way to start my PhD funding before the grant actually started. I know it was a lot of administrative hassle to organize my funding, and I am grateful to both he and Han for putting in the effort. Finally, Gert was also a mentor, keeping an eye on me throughout my stay at the CMBI.

In addition to my primary supervisors I would like to thank Berend Snel, who was my immediate supervisor in my first PhD year, for showing me how to do Bioinformatics at a time when I was still thinking like a programmer from the information technology sector. He left to start his own group in Utrecht, and I wish him success with it. I would also like to thank Marc van Driel, the supervisor of my MSc project and my direct predecessor, for allowing me to take over his project when he got his own PhD. He maintained an interest in my work, continuing to point out relevant articles to me even after my work was finished.

I would like to thank the members of the COMICS group for providing me with an interesting scientific environment within which to work. I would like to thank my room mates, who provided me with a pleasant daily working environment: Bas for his zany sense of humor, vast expertise and help during my first PhD analysis; Iziah for memories from Africa and for providing me with some of his unpublished data for my own analyses (happy wedding and married life, as far as that’s possible), and Rasa for her nice company and discussions (achiu). I would particularly like to thank Isabel for functioning as my behavioral therapist – muito obrigado Isabel, I wish you happiness and success in both your personal and your professional life. Radek, it’s a pity we don’t have any co-publication together, especially since we have almost the same interests. Maybe that will still come in the future. Philip, we do have a co-publication together, thanks to the fact that you generously considered my contribution to your second PhD paper worthy of a

131

co-authorship. It must be difficult juggling a family and PhD study at the same time – good luck with both. Joanna, you were one of the social motors of the group, and indeed of the CMBI, coming up with the role-playing and pub quiz evenings. I wish you all the best in your new post- doc position in France. Fiona, thank you for making sure I didn’t miss the interesting daily COMICS lunch banter, even though my eating schedule was different from everyone else’s. Good luck with your own PhD work. Urszula, you are probably the only normal person in the group, and I wish I had your personality. Edwin, you were always willing to swap group presentation schedules when I wanted to postpone mine. In addition to present COMICS colleagues, I would also like to thank past colleagues who enriched my earlier years: Toni (congratulations with your move to Barcelona), Guénola, Vera, René and Thijs. Finally the many undergraduate students that came and went over the years also contributed to making the group as nice as it was.

Apart from the COMICS group I would like to thank the rest of the CMBI for creating a pleasant working environment. I would particularly like to thank Barbara and the other office staff (Esther, Eugenie and Margot) for efficiently organizing everything that needed to be organized. Also, I would like to offer special thanks to our system administrator Wilmar for always making sure I had what I needed to work – there was no problem he couldn’t fix. Additionally I want to mention the names of Richard (for interesting discussions about work and travel, and help with thesis formatting), Christof (for social advice in the days before Isabel), Maarten (for webserver and other technical help) and Celia (for giving me a bit more insight into how to teach students). Finally, my apologies are due to Hanka for using up her PhD funding and forcing her to go to the Department of Human Genetics to get it back. I caused you more trouble than I should have.

In addition to the CMBI, I would also like to extend my thanks to the members of the Human Genetics Department, from whence my PhD project came. I would especially like to thank Jeroen for the collaboration that resulted in chapter 4, and for being such a great colleague. Best wishes to you and Marsha, and I hope you have a happy family life in the near future. I should not forget to thank Ineke, Miranda and Lilly for organizing my meetings with Han and making every attempt to find space in Han’s busy schedule, and also Wil, Marly, Hans and Martijn for always finding ways to finance and extend my PhD contract after it was officially over. Finally, I would like to thank Jo, Hans and Annette for providing me with interesting things to do while writing up my thesis, Jayne for helping a student I was supervising with his microarray analyses (I know we took up more of your time than you anticipated, but you remained gracious throughout – as always) and David for being a great colleague and an inspiration during his short stay at the CMBI.

I would also like to express my appreciation of the NCMLS which provided stimulating seminars and forum evenings, and a good collaborative scientific environment within which to work.

All of this was made possible by the NBIC’s BioRange program, which provided funding for my PhD.

Last but not least, I would like to thank my mother and father, and my brothers Elliott, Kenneth and Edward for providing a life outside of work.

132

Appendix I: Color Figures

133

Chapter 2, Figure 1: Possible relationships between genes and phenotypes. The similar phenotypes of Stickler, Marshall and OSMED syndromes are caused by mutations in the functionally closely related genes COL2A1, COL11A1 and COL11A2. The phenotypically distinct Pallister-Hall syndrome is caused by mutations in the functionally unrelated or only weakly related GLI3 gene. Several genes can underlie one phenotype, as in the case of Stickler syndrome which can be caused by mutations in each of the three collagen genes. Conversely, one gene can lead to different phenotypes as in the case of COL11A1 (Stickler and Marshall phenotypes) and COL11A2 (Stickler and OSMED phenotypes). Thickness of black lines linking genes indicates (hypothetical) degree of functional relatedness between them.

134

Chapter 2, Figure 3: Mapping phenotype networks to gene networks for candidate disease gene prediction. In this hypothetical example, diseases 1, 2 and 3 have known causative genes (genes A, C and E respectively), and are all phenotypically related to disease 4 which lacks an identified causative gene. If the known causative genes are functionally closely related, as in this case, then candidate genes (genes B and D) can be hypothesized for disease 4 due to their close functional relationships to the known genes of the phenotypically related diseases. This kind of analysis could potentially also identify higher level links between functional genetic modules and syndrome families. Green lines connect (green) known disease genes with their (green) diseases, while red lines indicate potential causative links between (red) candidate genes and the (red) disease of unknown etiology. Black lines of varying thickness indicate the degree of phenotypic and functional similarity between diseases and genes respectively.

135

Chapter 4, Figure 5: Procedure for calculating conserved co-expression scores. The procedure is illustrated using an example involving KOGs KOG0011 and KOG3438 between human and fly. KOG0011 contains two genes in each species (RAD23A, RAD23B and FBgn0026777, FBgn0039147 in human and fly respectively) while KOG3438 contains one in human (CKS1B) and two in fly (FBgn0010314 and FBgn0037613). For each species a KOG0011-KOG3438 co-expression (thick purple and green arrows) correlation is calculated by taking the mean of all gene-gene combinations (thin black arrows) for the two KOGs. The mean of these species-specific KOG-pair correlations (thick vertical orange arrow) is taken to represent the final multispecies KOG0011-KOG3438 co-expression correlation. This co-expression value is used for all relevant gene-pairs as their KOG-based co-expression score. If co-expression is conserved in both species then this value will be high, if it is high in only one species it will be intermediate, and if it is low in both it will be low.

136

Chapter 4, Figure 6: Disease gene ranking procedure. Co-expression Spearman correlation coefficients (SCC) are determined for all candidate disease genes from a disease locus with known causative gene (E) and another gene known to cause the same disease (A). The SCCs are ranked, and the ranks are subsequently normalized to the 0.0-1.0 range. The relative position of the causative gene in the locus (E) is determined. This procedure would subsequently be repeated with E as known disease gene and A as candidate disease gene. For each disease, each gene is treated as a candidate disease gene in turn and its co-expression is successively compared to all other genes known to cause the same disease, leading to a list of scores per disease.

137

Chapter 5, Figure 1. Comprehensively annotated syndromes cluster better than sparsely annotated syndromes. ( A) The Ehlers-Danlos and Charcot-Marie-Tooth syndrome families overlap in the phenome landscape of the OMIM data set, but separate when the phenotype descriptions are supplemented with POSSUM annotation. The phenome landscapes were created using multi-dimensional scaling of the HPO feature-based OMIM distance matrices (left), supplemented with POSSUM annotation (right). The more similar the annotations of two syndromes are, the closer they are on the landscape. The background colors indicate the density of syndromes in that region of the landscape. Lighter colors represent higher densities. ( B) Mean phenotypic similarity is consistently greater for the POSSUM supplemented OMIM data set (red dashed lines) than for the original OMIM data set (blue dashed lines). Besides Ehlers- Danlos (n=12) and Charcot-Marie-Tooth (n=12), the more phenotypically diverse family of ciliopathies is also shown (n=59; Supplementary Table 1). Continuous lines show the distributions of mean distances for randomly composed syndrome families of equivalent size (n=107) for the original and supplemented OMIM data sets.

138

Chapter 5, Figure 2. Biological coherence of phenotypic clusters for different data sets and conditions. The box plots show relative enrichment of shared GO terms for genes associated with diseases within clusters compared to randomly permutated phenotype data sets (n=30). Box limits show the 25th and 75th percentiles, whiskers extend out up to 1.5x the box range, points outside this range are plotted individually. ( A) The full HPO ontology results in biologically more coherent phenotype clusters than a simplified HPO ontology containing only more general features, but only when the OMIM phenotypes are supplemented with POSSUM annotation (purple boxes). ( B) Artificial under- annotation of the POSSUM and Orphanet databases by randomly halving the syndrome feature lists (“sparse”) leads to strong reductions in cluster biological coherence. However, limiting the Orphanet syndrome descriptions to just the very frequent features has limited impact on cluster coherence, despite the strong reduction in the average number of features per syndrome to just 57% of the original. ( C) Weighting Orphanet features according to their prevalence within affected patients improves the biological coherence of clustered phenotypes. Counter-weighting them by assigning higher weights to less frequently occurring features abolishes the biological coherence of the resulting phenotype clusters almost completely. ( D) Weighting annotated features according to their specificity (number of syndromes they occur in) using the inverse document frequency (I.D.F.) weighting scheme diminishes cluster biological coherence for well annotated POSSUM syndromes, but improves it for under-annotated syndromes.

139

140

Curriculum Vitae

Martin Oti was born on 7 October 1974 in Kaduna, Nigeria. He obtained his “Doctorandus” degree in Biology from the University of Utrecht, the Netherlands, in 1998. His interests encompassed both Biology and computer programming, and he created an individual-based population genetics simulation model as part of his thesis work. He subsequently worked for several years in the information technology industry as a software developer, due to initial difficulty in finding a Biology PhD project that had a substantial computational component. In 2003 he returned to scientific research, following an accelerated one-year Master’s degree program in Bioinformatics at the CMBI at the Radboud University in Nijmegen, the Netherlands. During this period he worked on gene-phenotype relationships in the fruit fly, with an eye to identifying genetic relationships that might inform research into human inherited diseases. After completing this in 2004 he started on a four-year PhD program at the same university, in which he continued investigating the functional relationships between genes and how these relate to phenotypes, this time in human genetic diseases. He investigated the degree to which genetic diseases with similar signs and symptoms are caused by mutations in genes with related biological functions, and how best to use such disease phenotype information to guide genetic research. The results of these investigations are presented in this thesis.

141

List of Publications

1. Oti M , Huynen MA, Brunner HG. The biological coherence of human phenome databases. (submitted)

2. Rinne T, Kouwenhoven EN, Van Heeringen SJ, Oti M , Tena JJ, Alonso ME, De la Calle- Mustiene E, Parsaulian L, Smeenk L, Bolat E, Dutilh BE, Huynen MA, Gilissen C, Hoischen A, Veltman JA, Tjabringa S, Schalkwijk J, Hamel B, Brunner HG, Roscioli T, Oates E, Wilson M, Manzanares M, Gómez-Skarmeta JL, Stunnenberg HG, Lohrum M, Van Bokhoven H, Zhou H. Genome-wide profiling of p63 binding sites identifies genes and regulatory elements for p63-related disorders: Elucidation of the genetic basis of split hand/foot malformation type 1 (SHFM1). (submitted)

3. Oti M , van Reeuwijk J, Huynen MA, Brunner HG. Conserved co-expression for candidate disease gene prioritization. BMC Bioinformatics. 2008 Apr 23;9:208. PMID: 18433471

4. Kensche PR, Oti M , Dutilh BE, Huynen MA. Conservation of divergent transcription in fungi. Trends Genet. 2008 May;24(5):207-11. Epub 2008 Mar 28. PMID: 18375009

5. Ala U, Piro RM, Grassi E, Damasco C, Silengo L, Oti M , Provero P, Di Cunto F. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008 Mar 28;4(3):e1000043. PMID: 18369433

6. Oti M , Huynen MA, Brunner HG. Phenome connections. Trends Genet. 2008 Mar;24(3):103-6. Epub 2008 Feb 19. PMID: 18243400

7. Oti M, Brunner HG. The modular nature of genetic diseases. Clin Genet. 2007 Jan;71(1):1-11. PMID: 17204041

8. Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M , Lopez-Bigas N, Ouzounis C, Perez-Iratxeta C, Andrade-Navarro MA, Adeyemo A, Patti ME, Semple CA, Hide W. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res. 2006 Jun 6;34(10):3067-81. PMID: 16757574

9. Oti M , Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein-protein interactions. J Med Genet. 2006 Aug;43(8):691-8. Epub 2006 Apr 12. PMID: 16611749

142

143

144