UNIVERSITY OF CINCINNATI

Date:______

I, ______, hereby submit this work as part of the requirements for the degree of: in:

It is entitled:

This work and its defense approved by:

Chair: ______

Computational Selection and Prioritization of Disease Candidate

A dissertation submitted to the

Graduate School

of the University of Cincinnati

in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the Department of Biomedical Engineering of the College of Engineering

2008

by

Jing Chen

B.E., National University of Singapore, 2002

Committee Chairs: Bruce J Aronow, Ph.D. and

Anil G Jegga, D.V.M., M.S.

Committee Member: Marepalli Rao, Ph.D. Abstract

Identifying causal genes underlying susceptibility to disease is a problem of primary

importance in post-genomic era and current biomedical research. Recently, there has been a

paradigm shift of such -discovery efforts from rare, monogenic conditions to common

“oligogenic” or “multifactorial” conditions such as asthma, diabetes, cancers and neurological disorders. These conditions are referred as multifactorial because, susceptibility to these diseases is attributed to the combinatorial effects of genetic variation at a number of different genes and their interaction with relevant environmental exposures. The expectation is that identification and characterization of the causal genes implicated in the inherited component of disease susceptibility will lead to substantial advances in our understanding of disease. These advances in turn can lead to improvements in diagnostic accuracy, prognostic

precision, the range and targeting of available therapeutic options and ultimately realize the promise of personalized or “tailor-made” medicine. The objective of my thesis therefore is to design, develop, and validate computational approaches for identification and prioritization of these causal genes.

The first approach tests the hypothesis that the majority of genes that impact or cause disease share membership in any of several functional relationships. We use a p-value-based meta-analysis method to prioritize the candidate genes based on functional annotation. For the very first time, we use and demonstrate, the utility of mouse phenotype annotations in human disease gene prioritization. Since this approach is limited to only genes with functional annotation, and because many human genes are yet to be functionally classified,

i we have developed another approach that is independent of gene functional annotations. We

implemented a set of new algorithms to prioritize genes based on -protein interaction networks. Large scale cross-validation were performed for comparison and evaluation of the methods, and to determine the associated parameters. Our results demonstrate that the functional annotation-based method performs better than other approaches. Although the performance of the network-based method was not as good as functional annotation-based method, it is much simpler to implement, apply, and execute. The best performance was however achieved, as demonstrated through asthma test case, by combining the results from the two methods.

ii

iii Acknowledgements

First, I am in debt of gratitude to my advisors and mentors, Dr. Bruce Aronow and Dr. Anil

Jegga, for providing me with supervision, motivation and encouragement throughout my

graduate studies. Their enthusiasm, high expectations, and trust pushed me toward a new

level of professionalism. They have forever set standards for dedication and excellence to

which I will always aspire. I owe much of my accomplishments to them. Without their care,

supervision and friendship, I would not be able to complete this work.

Thanks and gratitude also to Dr. Marepalli Rao, for being a part of the graduation committee,

for sharing his expertise in statistics, and for making complicated statistical concepts and

problems easy to understand. I am grateful for his advice and help, and importantly making

me realize the importance of statistics in the research of bioinformatics.

Thanks to Dr. Jarek Meller, Dr. Mario Medvedovic, Dr. Michael Wagner and Dr. Yan Xu in

division of Biomedical Informatics at Cincinnati Children’s Hospital for giving advice and

raising questions during my presentations and discussions. The journal club experience is

valuable and unforgettable for me.

Being part of the graduate student population at Bioinformatics at University of Cincinnati, I

have been very lucky to have had many supportive fellow students. Sincere thanks to

Sivakumar Gowrisankar, Ranga Chandra Gudivada, Johannes Freudenberg, and Mukta

iv Phatak for their support in various aspects of the project and for always being available for help. I am also very grateful to Dr. Xiaohua Sheng for his suggestion on statistical analysis and Eric Bardes for his advice on programming.

Finally, I wish to thank my parents for their love, support, and continued encouragement throughout the years. I wish to thank my son Lang who troubles me with so much joy. I would also like to thank my wife Huan Xu, for giving me her unconditional love and support throughout everything. This dissertation could not have been completed without her. I would like to dedicate this work to her.

v Publications arising from this thesis

Papers

Chen J, Xu H, Aronow BJ, Jegga AG 2007. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392.

Chen J, Aronow BJ, Jegga AG 2008. Disease candidate gene identification and prioritization using protein-protein interaction networks. (Submitted)

Chen J, Gowrisankar S, Xu H, Aronow BJ, Jegga AG 2008. In silico Prioritization of Novel Asthma Candidate Genes. (Submitted )

Gudivada RC, Qu X, Chen J, Jegga AG, Neumann EK, Aronow BJ 2007. Identifying disease-causal genes using semantic web-based integration of genomic-phenomic data (accepted by Journal of Biomedical Informatics)

Book chapters

Chen J, Jegga AG 2007. Systems Biology Based Integrative Approaches to Identify and Prioritize Novel Disease Candidate Genes. Bios Publications, In Press.

vi Table of contents

Abstract ...... i

Acknowledgements ...... iv

Publications arising from this thesis ...... vi

Table of contents ...... vii

List of figures ...... x

List of tables ...... xii

Chapter 1. Overview...... 1

1.1 Motivation...... 1

1.2 Contributions of this Thesis ...... 2

Chapter 2. Systems Biology Based Integrative Approaches to Identify and Prioritize Novel Disease Candidate Genes...... 4

2.1 Background...... 4 2.1.1 Connecting phenotype with genotype: Disease gene discovery ...... 4 2.1.2 Traditional Candidate Gene Approach...... 7 2.1.3 Candidate disease gene prediction using protein-protein interactions...... 8 2.1.4 Candidate gene prioritization based on functional annotations ...... 10

2.2 Current work: improved functional-based and novel network-based prioritization methods...... 14

2.3 Limitations of candidate gene prioritization approaches...... 16

2.4 Summary ...... 18

Chapter 3. Discovery and Prioritization of Candidate Genes that Cause or Impact Disease Using an Integrative Genome-Transcriptome-Phenome-Bibliome Approach...... 19

3.1 Background...... 19

3.2 Materials and methods...... 21 3.2.1 Data sources...... 21 3.2.2 Pre-processing of annotation terms...... 23 3.2.3 Processing of Training Set Genes...... 24 3.2.4 Similarity measure...... 24 3.2.5 Processing of Test Set Genes...... 26

3.3 Results...... 27 3.3.1 Mouse Phenotype as a Feature for Candidate Gene Prioritization ...... 27

vii 3.3.2 Document Identifier as a Feature for Candidate Gene Prioritization...... 27 3.3.3 Comparison of ToppGene with Other Gene Prioritization Approaches...... 28 3.3.4 Comparison of ToppGene with ENDEAVOUR: Random-gene cross-validation...... 30 3.3.5 Evaluation of features used for gene prioritization in ToppGene ...... 32 3.3.6 Comparison of ToppGene with SUSPECTS and PROSPECTR: -region cross-validation 34 3.3.7 Comparison of ToppGene with ENDEAVOUR and SUSPECTS...... 37 3.3.8 ToppGene Implementation and Access...... 38

3.4 Discussion...... 39

3.5 Conclusions...... 42

Chapter 4. Disease candidate gene identification and prioritization using Protein-Protein Interaction Network 44

4.1 Introduction...... 44 4.1.1 Protein-protein interactions networks...... 44 4.1.2 Ranking algorithms in networks...... 46

4.2 Methods...... 47 4.2.1 Human protein interaction datasets...... 47 4.2.2 Prioritization methods...... 47 4.2.3 PPIN analysis and derivation of topological parameters ...... 51 4.2.4 Evaluation methods of PPIN topological features...... 51

4.3 Results...... 52 4.3.1 Human protein interaction network ...... 52 4.3.2 Evaluation...... 55

4.4 Discussion and Conclusion...... 62

Chapter 5. Prioritization of Novel Asthma Candidate Genes: A Complete Case Study ...... 65

5.1 Introduction...... 65

5.2 Methods...... 66 5.2.1 Gene sets preparation...... 67 5.2.2 Prioritization of candidate genes ...... 68 5.2.3 Combination of multiple prioritization results...... 69 5.2.4 Evaluation of the prioritization results...... 70

5.3 Results...... 71 5.3.1 Over-represented functional annotation terms of known asthma genes...... 72 5.3.2 Performance of each prioritization method...... 73 5.3.3 Performance of combined ranked lists...... 74 5.3.4 Genome-wide prioritization of asthma-related candidate genes...... 77

5.4 Conclusion ...... 86

Bibliography...... 87

viii Appendix A: 252 known asthma related human genes compiled from literature...... 95

Appendix B: Web application...... 102

ix List of figures

Figure 3-1: Schematic representation of gene prioritization. (A) Genes in the training set are selected based on their attributes or current gene annotations (genes associated with a disease, phenotype, pathway or a GO term). (B) Test gene source can be candidate genes from linkage analysis studies or genes differentially expressed in a particular disease or phenotype. (C) Enriched terms of the eight gene annotations, namely, GO: Molecular Function, GO: Biological Process, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains and , compiled from various data sources, are obtained for the training set of genes. (D) A similarity score is generated for each annotation of each test gene by comparing to the enriched terms in the training set of genes. The final prioritized gene list is then computed based on the aggregated values of the eight similarity scores...... 23 Figure 3-2: Schema diagram of comparison of ToppGene with other applications. To evaluate the performance of our approach and also compare it with other similar gene prioritization approaches, we performed two types of comparisons: large-scale cross-validations and small-scale test cases. For large-scale cross-validations, we used the same or similar training sets as mentioned in the previous methods. Specifically we compared ToppGene’s performance with ENDEAVOUR using random-gene cross-validation; and with PROSPECTR and SUSPECTS, we used locus-region cross-validation. Further, as test cases, we selected two diseases, congenital heart defects (CHD) and diabetic retinopathy (DR), and compared the prioritization performance of ToppGene with SUSPECTS and ENDEAVOUR...... 28 Figure 3-3: ROC curves of random-gene cross-validation based on score ranks. Green curve was generated from the 19 disease gene training sets. Black curve, negative control, was generated from 20 random training sets. See text for the definitions of sensitivity and specificity...... 31 Figure 3-4: AUC of different feature sets. Red bars indicate the AUC scores based on each feature set, and blue bars are the corresponding random controls. Yellow bars indicate the coverage of each feature set in the whole genome. For example, mouse phenotype (MP) has AUC score 0.78 and covers 19% of genes in the whole genome. For each feature set, the ROC curve was generated using genes with annotations only...... 33 Figure 3-5: ROC curves of random-gene cross-validation based on scores. The red curve was generated using all features sets (AUC score 0.913). The blue curve was generated without Mouse Phenotype annotations (AUC score 0.893). The orange curve was generated without Mouse Phenotype and Pubmed annotations (AUC score 0.888). See text for the definitions of sensitivity and specificity. ....34 Figure 3-6: The performance of locus-region cross-validation using different feature sets. The average rank ratio (y-axis on the left) indicates the average rank ratio of the “target” genes in the resulting list, thus lower value corresponding to a better performance. At the same time, the higher the number of top 5% ranked “target” genes among total of 150 prioritizations (y-axis on the right), the better the performance. As a result, it’s very clear that removing MP, PubMed or both resulted in significant drop of performance...... 36 Figure 4-1: Venn diagrams of unique genes and interactions from all data sources...... 54 Figure 4-2: Distribution of node degrees of the protein interaction network. X-axis is the degree and Y-axis is the number of nodes of that degree. Both X and Y-axis are log-scaled. A linear trend (power law property) can be observed from the plot...... 55

x Figure 4-3 Plot of Avg. clustering coefficient vs. degree of the protein interaction network. X-axis is the degree and Y-axis is the corresponding average clustering coefficient. Both X and Y-axis are log-scaled. A linear trend, although not strong, can be observed from the plot...... 55 Figure 4-4: ROC curves from cross validations. This figure shows the representative ROC curves using PageRank with Priors with back probability 0.01, 0.05, 0.1, 0.3 and 0.5, and HITS with Priors with back probability 0.3 and 0.5. The random curve was derived from prioritization of the random training set using PageRank with Prior method with back probability 0.3...... 56 Figure 4-5: ROC curves from cross validations. This figure shows the representative ROC curves using K-Step Markov method with K = 1, 2, 4, and 6. The random curve was derived from prioritization of the random training set using PageRank with Prior method with back probability 0.3...... 57 Figure 4-6: Plots of AUC with different parameter values. The left panel shows the AUC values of PageRank with Priors with back probability varied from 0.01 to 0.5. The right panel shows the AUC values of K-Step Markov method with random walk length varied from 1 to 6. The vertical bars indicate the standard deviations...... 60 Figure 4-7: Plot of differences in mean levels of AUC values from Tukey HSD test. Left panel are the results from PRankP, where AUC values with back probability of 0.01and 0.5 were significantly lower than others. Right panel shows the results from KSMarkov and AUC values with random walk length 1 were significantly lower than others...... 61 Figure 5-1: Flowchart describing the step-by-step procedure taken to prioritize for novel asthma candidate genes...... 67 Figure 5-2: Distribution of known asthma genes in the . Known asthma genes are highlighted in blue...... 72 Figure 5-3: Profile of the Running Enrichment Score & Positions of Known Asthma Genes on the Rank Ordered List based on...... 76 Figure 5-4: Subnetwork of asthma related genes. Red nodes are the known asthma genes. The rest nodes represent the genes connecting to at least two of the known asthma genes. Dark and light blue nodes represent the significant (p-value < 0.05) candidate genes in QTL and non-QTL regions respectively. Other genes are colored in grey. The size of each node is determined by the number of interactions with the known asthma-genes. For example, JAK2 and EP300 marked in the figure, are interacting with 10 and 9 asthma-genes respectively...... 82

xi List of tables

Table 2-1: Current bioinformatics approaches and tools for prioritization of human disease candidate genes. The first column has the source or the name of the tool (including reference when available). The second column has URL of the corresponding web application, if available. If there is no web application, information regarding either the project home page or links to the corresponding supplementary material are provided. The third column is the genomic annotation types/features used by each of the methods. The last column has details of the training or the input data, if used. The last row, ToppGene, is the application developed in the current thesis...... 12 Table 3-1: Comparison of features used in the three gene prioritization applications...... 29 Table 3-2: Comparison of methods used in the three gene prioritization applications...... 30 Table 3-3: Summary of comparison of results from ToppGene with other gene prioritization applications...... 35 Table 3-4: Performance summary of locus-region cross-validation using different feature sets. When either MP or PubMed, or both (MP + PubMed) were left out, the performance dropped significantly...... 36 Table 4-1: The columns of interaction table from NCBI Gene. Only column 2, 7 and 18 were captured for human interactions...... 53 Table 4-2: The number of unique genes and interactions from each interaction data source...... 54 Table 4-3: AUC values from each cross validation run. Column “Test Type” indicates the method and parameter settings of the test. P01 through p5 stand for PageRank with Priors with back probability 0.01 to 0.5 respectively; k1, k2, k4 and k6 represent K-Step Markov with K = 1, 2, 4 and 6 accordingly; h3 and h5 are HITS with Priors with back probability 0.3 and 0.5 respectively. There were 11 test conditions each repeated 5 times...... 59 Table 4-4: Means and stand deviations of AUC values under 11 different cross validation conditions. Highlighted rows correspond to the best parameter value of each method...... 59 Table 4-5: Summary of ANOVA on PRankP and KSMarkov with different parameter values. Significant p values suggest parameter values have strong effects on the performance...... 60 Table 5-1: Normalized Enrichment Scores (NES) for different prioritization methods for ten iterations..74 Table 5-2: P values for one-sided paired t test comparing different prioritization methods based on NES. White cell indicates that the P value is based on the null hypothesis that the column method performs better than the row method. Gray cell indicates that the P value is based on the null hypothesis that the row method performs better than the column method. It’s shown that the order of the performance for different methods from best to worst is combined > functional > network > microarray, and there’s no significant difference between the two combined methods...... 74 Table 5-3: Top 37 genes in the combined ranked list of asthma candidate genes...... 80 Table 5-4: Top 40 significant novel candidate genes sorted by up-regulation, down-regulation and p-values.Highlighted are the two genes, JAK2 and EP300, having the most number of interactions with known asthma related genes...... 85

xii Chapter 1. Overview

1.1 Motivation

Identifying causal genes underlying susceptibility to human disease is a problem of primary

importance in post-genomic era and current biomedical research. Recently, there has been a

paradigm shift of such gene-discovery efforts from rare, monogenic conditions to common

“oligogenic” or “multifactorial” conditions such as asthma, diabetes, cancers and neurological disorders. These conditions are referred as multifactorial because, susceptibility to these diseases is attributed to the combinatorial effects of genetic variation at a number of different genes and their interaction with relevant environmental exposures. The expectation is that identification and characterization of the causal genes implicated in the inherited component of disease susceptibility will lead to substantial advances in our understanding of disease. These advances in turn can lead to improvements in diagnostic accuracy, prognostic

precision, the range and targeting of available therapeutic options and ultimately realize the promise of personalized or “tailor-made” medicine.

Biomedical researchers usually approach the problem of disease candidate gene identification by first identifying a set of candidate genes using traditional positional cloning or high-throughput genomics techniques and then these genes are subjected to further experimental investigations and biological validations. However, these methods are expensive and time-consuming. To overcome these, researchers must prioritize the candidate genes for wet lab experiments. Typically, researchers have relied on biomedical literature,

1 queries to multiple knowledge and annotation databases and hunches about expected properties of the disease gene to determine such prioritization. However, several computational approaches have recently been developed which perform this task automatically by relying on different genome-wide data sources, annotation and knowledge repositories, such as [1], literature, gene expression, sequence features, and others. In Chapter 2, we will present and discuss several of these approaches.

1.2 Contributions of this Thesis

The central problem addressed in this work is to prioritize disease candidate genes in the genome wide scale using bioinformatics methods. Chapter 2 described the origin of the problem and the traditional experimental methods applied. With the success of human genome project, the availability of large scale experimental data and the development in bioinformatics algorithms, new in silico analysis tools emerged to identify and prioritize the candidate genes, some of which were listed in Chapter 2 too. The current work, including two different methods, is an improvement and extension on the existing ones.

The first method, named as “ToppGene” and detailed in Chapter 3, is an improvement on the previous functional annotation-based prioritization methods. It’s demonstrated that incorporation of phenotype information for mouse orthologs of human genes and literature co-citations of genes as prioritization features greatly improve the human disease candidate gene analysis. In the comparison result, it’s shown that ToppGene outperforms existing candidate gene prioritization methods.

2

Chapter 4 talked about another group of prioritization methods, ranking genes based on the

protein-protein interaction networks. Specifically three methods were tested, namely

PageRank and HITS with Priors, and K-Step Markov method. Cross-validation was used to

determine the optimal parameter setting in each method and compare the relative performance of the different methods. The results indicate that although network-based methods are generally not as effective as the integrated functional annotation based methods for disease candidate gene prioritization, they are relatively easier to apply and more efficient.

Furthermore, network-based prioritization performs better than all other functional

annotations, indicating that protein-protein interaction networks are a potentially good feature

for disease candidate gene prioritization.

Chapter 5 is a complete case study where the two bioinformatics tools developed in this

thesis were utilized together with a gene expression based prioritization method to rank

asthma related candidate genes. It’s shown by a cross-validation method that the functional annotation based method performed better than the network based, and significantly better than the expression based method. The best performance was achieved by combining the

ranks from the first two methods and hence it’s used to prioritize the available genes in the

entire genome for new asthma candidate genes.

A web application was developed as the implementation of the above methods in this thesis.

It’s accessible at http://toppgene.cchmc.org. See Appendix B for the details.

3 Chapter 2. Systems Biology Based Integrative Approaches to Identify and Prioritize Novel Disease Candidate Genes

2.1 Background

2.1.1 Connecting phenotype with genotype: Disease gene discovery

Connecting phenotype with genotype is the fundamental aim of genetics [2]. The conventional route to disease gene discovery especially for the monogenic and/or Mendelian diseases has been positional cloning. Typically, the gene responsible for a trait is first localized by linkage analysis to a small interval (ideally less than 1 centiMorgan, cM) by successive rounds of linkage mapping within families. Next, each of the candidate genes mapping to the interval is assessed for their potential functional relevance to the disease and screened for etiological mutations. More than 1,200 genes causing human diseases or traits have been identified, largely by a process that is generally referred to as ‘positional cloning’.

Classic examples of successful positional cloning include hemochromatosis, nail patella syndrome and lactose intolerance [2]. In contrast, relatively few genes underlying genetically complex traits, have been identified in the last 20 years [3]. Genes that contribute to complex traits (also known as quantitative trait loci or QTLs) pose special challenges that make gene discovery more difficult, including locus heterogeneity, epistasis, low penetrance, variable expressivity and pleiotropy, and limited statistical power [4, 5]. Notable examples of these difficulties involve important diseases such as schizophrenia in , where claims of linkage discovery have been notoriously difficult to verify [3]. The progression at snail’s pace

4 stems from the weak relationship between genotype - at any given locus - and phenotype that

characterizes multifactorial traits. Additionally, the regions of interest defined through

complex-trait linkage studies frequently exceed 30 cM in size, and contain several hundreds of genes!

Genome-wide linkage analysis has also been carried out for many common diseases and quantitative traits, for which the Mendelian disease characteristics might not apply [6]. In some cases, genomic regions that show significant linkage to the disease have been identified, leading to the discovery of variants that contribute to susceptibility to diseases such as inflammatory bowel disease (IBD), schizophrenia and type 1 diabetes. However, for most common diseases, linkage analysis has achieved only limited success, and the genes discovered usually explain only a small fraction of the overall heritability of the disease. For example, variants known to affect the risk of IBD together explain an excess risk to siblings of just over two-fold, compared with a total excess risk of ~30-fold, indicating that many other causal genes are yet to be discovered. The lack of success so far can be attributed to (i) the low heritability of most complex traits; (ii) the inability of the standard set of microsatellite markers — which are spaced 10 cM apart — to extract complete information about inheritance; (iii) the imprecise definition of phenotypes; and (iv) inadequately powered study designs [6]. Although linkage analysis using dense marker sets, larger sample sizes and larger pedigrees could be more productive, extensive candidate gene studies are still required to progress from a broad region of linkage to the causal gene or genes within this region [6].

5 Difficulties with the positional cloning approach, genome-wide linkage analysis, and the need

to cope with the large number of putative candidate genes led many investigators to favor a

strategy based primarily on identifying susceptibility variants through direct examination of

biological candidates (“candidate gene approach”). A major setback for this strategy however

is our current ignorance about the biology of several complex diseases, frustrating all efforts

to define biological candidacy with confidence. Thus, the key to accelerate the discovery of

multifactorial disease susceptibility genes is in developing improved strategies for refining

both disease-gene location and assessments of biological candidacy. While candidate gene approach studies examine the “good candidates” first based on some prior knowledge of the disease and human genomics, genome-wide linkage analysis studies survey genome-wide genomic variants and compare their frequency of alleles or genotypes between disease cases and controls. Genome-wide scanning usually proceeds without any presuppositions regarding the importance of specific functional features of the investigated traits, but of which the principal disadvantage is expensive and resource intensive. In general, genome-wide scanning only locates the glancing chromosomal regions of quantitative trait loci (QTLs) at cM-level with the aid of DNA markers under family-based or population-based experimental designs, which usually embed a large number of candidate genes. In comparison, the alternative candidate gene approach has been proven to be extremely powerful for studying the genetic architecture of complex traits, which is a far more effective and economical method for direct gene discovery. Nevertheless, the practicability of traditional candidate gene approach is largely limited by its reliance on existing knowledge about the known or presumed biology of the phenotype under investigation, and unfortunately the detailed

6 molecular anatomy of most biological traits remains unknown. It is quite necessary to develop new strategies to break the restriction of information bottleneck, although considerable candidate genes have already been identified.

In this chapter, I will review and summarize the recent research advances in computational prioritization of candidate genes, including the outline of candidate gene approach and the extended strategies for breaking the information bottleneck of traditional candidate gene approach.

2.1.2 Traditional Candidate Gene Approach

Candidate gene approach has been ubiquitously applied for gene-disease research, genetic association studies, biomarker and drug target selection in many organisms from animals to humans [7]. Candidate genes are typically genes with known biological function directly or indirectly regulating the developmental processes of a specific phenotypic trait, which could be confirmed by evaluating the effects of the causal gene variants in an association analysis.

The underlying rationale for candidate gene approach is that quantitative genetic variation of phenotype is caused by functional mutations in the putative gene [8]. Candidate gene analysis is usually the indispensable procedure for subsequent positional cloning of quantitative trait loci (QTLs) controlling the major genetic variation of interested traits after initial genome scans. Candidate gene approach has been criticized owing to low replication of results and its limited ability to include all possible causative genes [7]. Moreover, this approach is by necessity highly subjective in the process of choosing specific candidates from numbers of

7 potential possibilities. The main disadvantage is that it requires the information that comes from the existed well-known physiological, biochemical or functional knowledge such as hormonal regulation, biochemical metabolism pathway and etc., which is generally finite or sometimes not available at all [8].

2.1.3 Candidate disease gene prediction using protein-protein interactions

As discussed earlier, since most of the common human diseases are multifactorial or caused by multiple genes, it is reasonable to expect them to be functionally related. Such functional relatedness has been exploited to aid in the finding of novel disease genes [9]. Among the functional relatedness, direct protein–protein interactions are probably one of the strongest manifestations. In other words, we can hypothesize that mutations in interacting may result in same phenotype. The underlying rationale is the examples of several genetically heterogeneous hereditary diseases caused by mutations in different interacting proteins (for e.g. Hermansky-Pudlak syndrome and Fanconi anaemia [10, 11]). A recent study also reported that mutations in interacting proteins indeed cause similar disease phenotypes [12].

Therefore protein–protein interactions data can be used to identify putative novel disease candidate genes.

Although studies have shown that usage of protein–protein interactions enables identification of novel candidate disease genes, there are several practical limitations with this approach.

Firstly, high throughput protein–protein interaction sets—especially yeast two-hybrid sets—are inherently noisy and contain a lot of interactions with no biological relevance

8 [13-16]. Surprisingly, only 5.8% of the human, fly, and worm yeast two hybrid interactions

were confirmed by the HPRD [17]. Second, there is a possibility of inherent bias towards well-studied proteins in the interactome. Third, some of the human protein interactome data is derived by extrapolating the high throughput interactions from other species. Although, previous studies have shown that protein-protein interactions are quite conserved across species [18], there is a possibility for species-specific protein interactions. Fourth, two interacting proteins need not lead to similar disease phenotypes when mutated—for instance, they may have different but overlapping functions or one may be more dispensable than the other [17]. Additionally, disease proteins may lie at different points in a molecular pathway and need not interact with each other directly. Fifth, disease mutations need not always involve proteins (for e.g. telomerase RNA component in congenital autosomal dominant dyskeratosis) [17]. Finally, a practical limitation for using protein-protein interactions in

identifying novel candidate disease genes is the large number of putative candidates

generated by mining the interactome. Extending the network using known seed lists of

proteins invariably results in a large number of potential candidates and the question remains as to which of these are true positives. For example, mining the human protein interactome using the 15 known breast cancer genes (OMIM database) results in 342 directly interacting

genes (level 1) and 2469 indirectly interacting genes (level 2)! One way of overcoming this

problem is to prioritize the extended interactome either through functional similarity methods

(using the seed list of 15 as the training set and the interactants 342 or 2469 as the test sets in

the breast cancer example) or network prioritization though topology. Despite all the

difficulties and issues mentioned above, there have been a few successful studies. See

9 Chapter 4 for a detailed discussion of one of such efforts included in the current thesis.

2.1.4 Candidate gene prioritization based on functional annotations

As discussed in earlier sections, evidence from many sources suggests that similar

phenotypes are begotten by functionally related genes and this has been used to develop

several bioinformatics-based approaches to predict novel candidate disease genes. Almost all

of these approaches start with a list of known disease genes and then search for genes that

share one or more functional attributes such as sequence similarity, gene expression pattern,

pathway, phenotype, or gene ontology. Broadly, these approaches can be classified into three

categories according to the gene functional attributes they consider: (a) those based on intrinsic disease gene properties; (b) those that use expression patterns and phenotypic information directly; and (c) those that look at functional relatedness between candidate genes. They can also be categorized into two groups depending on whether or not they require training sets to represent the prior knowledge of the particular disease or phenotype

[19] (see Table 2-1 for a list and summary of current candidate gene prioritization approaches

and tools). In the following sections, I enlist and discuss some of the current approaches used

for candidate gene prioritization.

Approach Online availability Data types used Training set (Input) Approaches based on disease gene properties

Bortoluzzi et al. Article supplementary material at Expression N/A [20] http://physiolgenomics.physiology.org/cgi/c ontent/full/00095.2003/DC1/ Smith and Eyre-Walker Sequence, N/A [21] expression

10 Huang et al. [22] Additional data files available from journal Sequence, N/A website at expression http://genomebiology.com/2004/5/7/R47 DGP [23] http://cgg.ebi.ac.uk/services/dgp/ Sequence N/A

PROSPECTR [24] http://www.genetics.med.ed.ac.uk/prospectr Sequence N/A / Approaches using links between genes and phenotypes

Genes2Diseases [25, 26] http://www.ogic.ca/projects/g2d_2/ Sequence, GO, Phenotype literature mining GO terms Known genes BITOLA [27] http://www.mf.uni-lj.si/bitola/ Literature mining Concept Tiffin et al. [28] Article supplementary data available at Expression, Disease http://www.sanbi.ac.za/tiffin_et_al/ literature mining GeneSeeker [29, 30] http://www.cmbi.ru.nl/GeneSeeker/ Expression, N/A phenotype, literature mining GFINDer [31, 32] http://www.bioinformatics.polimi.it/GFIND Expression, N/A er/ phenotype TOM [33] http://www-micrel.deis.unibo.it/~tom/ Expression, GO Known genes and/or disease loci Approaches using functional relatedness between candidate genes Freudenberg and Phenotype, GO N/A Propping [34] OMIM phenome map http://www.cmbi.ru.nl/MimMiner/ Phenotype, N/A [35] sequence, GO, protein interactions Protein-protein Article supplementary material on journal Protein Known genes interactions [17] Web site at interactions and disease http://www.jmedgenet.com/supplemental/ loci POCUS [36] Additional data files available from journal GO Disease loci Web site at http://genomebiology.com/2003/4/11/R75 SUSPECTS [37] http://www.genetics.med.ed.ac.uk/suspects/ Sequence, Known genes expression, GO Prioritizer [38] http://www.prioritizer.nl/ Expression, GO, Disease loci protein interactions Endeavour [39] http://www.esat.kuleuven.be/endeavour/ Sequence, Known genes expression, GO, pathways,

11 literature mining

ToppGene [40] http://toppgene.cchmc.org Mouse phenotype, Known genes expression, GO, pathways, literature mining

Table 2-1: Current bioinformatics approaches and tools for prioritization of human disease candidate genes. The first column has the source or the name of the tool (including reference when available). The second column has URL of the corresponding web application, if available. If there is no web application, information regarding either the project home page or links to the corresponding supplementary material are provided. The third column is the genomic annotation types/features used by each of the methods. The last column has details of the training or the input data, if used. The last row, ToppGene, is the application developed in the current thesis.

PROSPECTR

PROSPECTR [24] uses sequence features to predict genes associated with Mendelian

disorders. The sequence feature set of PROSPECTR comprises gene length, length of UTR

regions, cross-species sequence identities, CpG contents, etc. representing the structure,

content and phylogenetic extent (the extent to which a gene is conserved back through

evolution based on homologs in other species) of each gene. Ten such sequence features out

of a total 24 tested were reported to be significantly different in disease gene set (1084 genes

from OMIM known to be associated with human disease) compared to control set (~18,000

genes from Ensembl that are not known to be involved in human disease) by Mann-Whitney

U test. A classifier, implemented by the alternating decision tree learnt from a training set

composed of the 1084 genes from disease set and 1084 random genes from the control set.

On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold

37% of the time and twenty-fold 11% of the time.

12 POCUS

Developed by Turner et al., POCUS (Prioritization of Candidate Genes Using Statistics) [36]

predicts disease genes based on Gene Ontology (GO) and InterPro domain annotations. The

approach requires a set of susceptibility loci of the disease as input and assumes that the

disease-related GO and domain functional annotation IDs are shared and enriched in these

loci. POCUS calculates the probability of observing a shared event or annotation of each ID

in the loci and generates a score of a candidate gene as the sum of scores for each ID of that gene. To test the performance of POCUS, a list of 29 OMIM diseases with three or more known contributing genes was used. Pocus was reportedly successful (correctly identifying two or more disease genes) for 15–65% of positive control sets depending on the size of each disease loci.

SUSPECTS

SUSPECTS [37] is an enhanced version of PROSPECTR with different feature sets. The

input is a list of known disease genes (training set). The test or candidate genes are then

ranked and prioritized based on the weighted sum of scores from four categories: sequence

features, coexpression, shared protein domains and GO annotations. The performance of

SUSPECTS was reported to be significantly better than PROSPECTR. The evaluation was

carried out by a locus-region leave-one-out cross-validation, where each disease-related gene

was selected as “target”, and the rest of the genes were used as training set. Using the same

29 OMIM disease set with 155 known related genes as derived from POCUS (mentioned

above), SUSPECTS was able to rank 87 genes out of 155 (56%) in the top 5%. On average,

13 using SUSPECTS, target genes were ranked in the top 12.93% in the ranked candidate list while PROSPECTR ranked 20 genes out of 155 (13%) and 31.23% on the average for all target genes.

ENDEAVOUR

ENDEAVOUR [39], one of the recent methods, uses a more comprehensive set of features

including sequence similarity, gene expression data, protein-protein interactions, GO

annotation, pathway, protein domains, transcriptional information and literature. Similar to

SUSPECTS, ENDEAVOUR prioritizes candidate genes based on a user-supplied training set

of known disease genes. The performance was evaluated by a similar cross-validation schema

as SUSPECTS. Instead of taking genes in the 15 Mb locus region of the target, a set of 99

random genes plus the target is selected as the candidate set for each run. The

disease-associated gene sets for this cross-validation was compiled from 19 OMIM diseases.

With sensitivity and specificity defined by rankings of the target genes, the authors were able

to create the ROC curve and obtain the AUC (Area under curve) score as 86.6.

2.2 Current work: improved functional-based and novel

network-based prioritization methods

Extending on the above mentioned approaches, and an earlier hypothesis, that the majority of

genes that impact or cause disease share membership in any of several functional

relationships [36], we [41], for the first time use mouse phenotype data in human disease

gene prioritization. We further demonstrate that employing the mouse phenotype data can

14 significantly assist in the process of focusing the search for most likely disease gene candidates (see Chapter 3). Existing disease candidate gene prioritization methodologies mine biological and functional information about candidate genes, and we believe that our system, ToppGene (http://toppgene.cchmc.org), can complement the current approaches by using a novel method that mines mouse phenotype data. Most importantly, through various examples, we demonstrate that ToppGene performs better than other current candidate gene prioritization methods (see Chapter 3 for more details). However, I would like to emphasize that my aim is not to prove that ToppGene prioritized genes are true disease genes but to aid in selection of a subset of most likely disease gene candidates from larger sets of disease-implicated genes identified by high throughput genome-wide techniques like linkage analysis and microarray analysis.

While functional annotation based candidate gene prioritization methods are proven to be effective, the coverage of the gene functional annotations is a limiting factor. For instance, although more than 1,500 human disease genes have been documented, majority remain functionally uncharacterized. In fact, currently, only a fraction of the genome is annotated

with pathways and phenotypes. While two thirds of all the genes are annotated by at least one

annotation, a remaining one third continues to yet to be annotated. To tackle this problem, my

second candidate gene prioritization method is based entirely on protein interaction networks

and does not take into account the functional annotations. Based on the observation that biological networks share many properties with Web and social networks, I have used the successful graph analysis-based algorithms from computer science research area to tackle this

15 problem. Specifically, extended versions of PageRank and HITS algorithm, and K-Step

Markov method are applied to prioritize disease candidate genes in a similar training-test schema as used in ToppGene. Literature-based and manually curated protein interactions were used to form the base network. Our results indicate that although integrated functional annotation based methods continue to be superior to the network-based methods, the latter are relatively easier to apply and more efficient. Significantly, in a one-to-one individual comparison, network-based disease candidate gene prioritization performs better than all other gene features, indicating the protein-protein interaction networks (PPINs) are a potentially good feature for disease candidate gene prioritization (see Chapter 4).

In the last part of my thesis (Chapter 5), using asthma as a test case, I evaluate and compare the two candidate gene prioritization approaches, namely, integrated functional annotation

(IFA) and PPINs based methods.

2.3 Limitations of candidate gene prioritization approaches

In general, almost all of the current disease gene identification and prioritization approaches are gene-centric. In other words, these are based on coding sequences. However, it has been speculated that complex traits in fact result more often from noncoding regulatory variants than from coding sequence variants [42-44]. Analyzing noncoding regulatory variants is replete with several problems. For instance, functional consequences of coding region variants are typically readily assessed as missense, nonsense, splicing, and other polymorphisms. On the other hand, interpreting the consequences of noncoding sequence variants is more complicated with several factors involved (for e.g. the relationship between

16 promoter, intergenic, or noncoding sequence variation, gene expression level, and trait phenotype is relatively less well understood than the relationship between coding DNA sequence and protein function).

Another principal limitation is that since these methods are primarily based on gene annotation, they tend to be biased towards selecting better annotated genes. For instance, a

“true” candidate gene can be missed if it lacks sufficient annotations. Furthermore, some clinical understanding of the molecular basis or disease etiology is needed to aid the

clinically-informed binary evaluation, and this process could be partly subjective and

researcher-specific. In other words, the effectiveness of this approach critically depends on

how well the disease under investigation is defined both molecularly and physiologically, in

order to avoid erroneous associations or candidate gene prioritizations.

A further limitation of employing this approach (using known disease genes as training set) in

both IFA and PPIN based method is the assumption that the disease genes we have yet to

discover will be consistent with what is already known about a disease and/or its genetic

basis which may not always be the case.

Additionally, it is important to note that the annotations and analyses provided and the

prioritization by any of the approaches discussed above can only be as accurate as the

underlying online sources from which the annotations are retrieved. Only one-fifth of the

known human genes have pathway or phenotype annotations and there are still more than

17 40% genes whose functions are not defined. Likewise, network-based prioritization methods also have limitations and just like IFA based methods, the performance depends on the quality of interaction data. Additionally, although we used extended versions of algorithms that were originally developed to identify “important” nodes in social or other general networks, further modifications may be required to make them fit better with biological networks (for e.g., considering weights on nodes or proteins and edges or interactions) (see

Chapter 4).

2.4 Summary

As the number of publicly available databases containing information about human genes and proteins continues to grow, there is a critical need to develop robust computational algorithms and approaches to integrate and utilize all this heterogeneous information for prioritizing any set of genes given a set of reference genes. Such a prioritization is not only useful for gene hunting in human diseases, but also for identifying gene or protein candidates of biological processes and pathways. At the same time, it needs to be emphasized that genes identified through these computational approaches may not always be true disease genes. Nevertheless, these approaches aid in selection of a subset of most likely disease gene candidates from larger sets of disease-implicated genes identified by high throughput genome-wide techniques like linkage analysis and microarray analysis.

18 Chapter 3. Discovery and Prioritization of Candidate Genes that Cause or Impact Disease Using an Integrative Genome-Transcriptome-Phenome-Bibliome Approach

3.1 Background

As discussed in the previous chapter, although the availability of complete genome sequences and the wealth of large-scale biological data sets opened up unprecedented opportunities to elucidate the genetic basis of rare and common human diseases [45], comprehending the underlying pathophysiological mechanisms continues to be challenging. High-throughput genome-wide studies like linkage analysis and gene expression profiling although useful for classification and characterization do not provide sufficient information to identify specific disease causal genes. Both of these approaches typically result in hundreds of potential candidate genes, failing to help the researchers in reducing the target genes to a manageable number for further validation. Functional enrichment approaches [46-48] focusing on gene sets that share common biological function, chromosomal location, or regulation although successful in identifying enriched biological themes fail to prioritize the candidate genes. To overcome this, several gene prioritization methods have been developed [24, 28, 34, 36, 39]

(see Chapter 2, Tiffin et al [49] and Oti and Brunner [50] for a complete list of existing approaches and web tools for the prediction or prioritization of disease candidate genes).

POCUS [36], for instance, finds candidate genes by identifying an enrichment of keywords associated with gene ontology (GO), shared protein domains and expression profiles among a given set of susceptibility loci relative to the genome at large. Similarly, PROSPECTR [24]

19 and SUSPECTS [37], focusing on Mendelian and oligogenic disorders, compare GO, protein

domains and expression libraries of putative disease genes with those known to be involved with the same disease. Integrating genomic and proteomic data, Mootha et al [51] identified

LSFC (Leigh syndrome, French-Canadian type ) causal gene. The recent method,

ENDEAVOUR [39], uses several data sources to prioritize candidate genes. None of these approaches however utilize the mouse phenotype data in their prioritization approaches although mouse is the key for the analysis of mammalian developmental, physiological and disease processes [52].

Extending on the above mentioned approaches, and an earlier hypothesis, that the majority of disease causal genes are functionally closely related [36], we reasoned that an integrative genomics-transcriptomics-phenomics-bibliomics approach utilizing the available human gene annotations, mouse phenotype data and literature co-citations of genes will expedite human complex disease candidate gene identification and prioritization. We call our prioritization method ToppGene (acronym for Transcriptome Ontology Pathway PubMed based

prioritization of Genes). For the first time, we incorporated the mouse phenotype data and

biomedical literature as two of the feature parameters apart from GO, pathways, protein

domains, protein interactions and gene expression of genes to prioritize human disease

candidate genes and demonstrate their utility.

20 3.2 Materials and methods

3.2.1 Data sources

We used seven data sources (6 human-related and 1 mouse-related) to prioritize the gene

candidates (see Figure 3-1).

1. GO: Gene Ontology [1] was downloaded from Gene Ontology web site [53].

Corresponding human GO-gene annotations were downloaded from NCBI Entrez Gene

ftp site [54]. This data set contained 15,068 human genes annotated with 7,124 unique

GO terms. GO Molecular Function (GO:MF) and GO Biological Process (GO:BP) were

considered as separate features since although they belong to the same annotation family

(GO), they have separate roots and term spaces.

2. MP: Mammalian Phenotype ontology [55] and mouse gene phenotype annotations and

the corresponding orthologous genes from human were downloaded from Mouse Genome

Informatics (MGI) website [56]. This data set contained 4329 human genes compiled by

extrapolating the mouse genes annotated with 4280 mouse phenotype terms.

3. Pathway: Gene-pathway annotations were compiled by combining data from KEGG [57],

BioCarta [58], BioCyc [59], Reactome [60], GenMAPP [61], and MSigDb [62] [47].

4,860 human genes had at least one pathway association (a total of 780 pathways).

4. Domain: Domain information of all gene products was collected by parsing the UniProt

human records. This compiled gene-domain annotation data set contains 12,454 distinct

genes annotated by 10,223 distinct domains from 6 protein domain databases: InterPro

[63], Pfam [64], SMART [65], PROSITE [66], Gene3D [67] and ProDom [68].

21 5. PubMed: Gene-PubMed ID relations were downloaded from NCBI Entrez Gene ftp site

[54]. This data set contained 25,294 distinct genes associated with at least one PMID (a

total of more than 142,000 PubMed abstracts). About 32% (44,806) of these papers were

associated with at least two genes.

6. Interaction: Gene-Interaction complex relations were downloaded from NCBI Entrez

Gene ftp site. This data set contained 8,040 distinct genes from 19,714 distinct interaction

complexes from 3 interaction databases: HPRD [69], BIND [70], and BioGRID [71].

7. Gene Expression: Human microarray expression data (Series GSE1133) from Genomics

Institute of the Novartis Research Foundation was obtained from the NCBI Gene

Expression Omnibus (GEO) [72]. This dataset [73] contained expression values of 11,883

genes from 79 tissues from normal adult human body. Microarray expression CEL files

were pre-processed using RMA algorithm. The annotations were created with a custom

chip description file Hs133A_Hs_REFSEQ_8.cdf [74] to account for recent advances in

human genomics, followed by per gene median normalization. Each gene was represented

by a vector of size 79, corresponding to the expression values of the 79 normal adult

human tissues.

22

Figure 3-1: Schematic representation of gene prioritization. (A) Genes in the training set are selected based on their attributes or current gene annotations (genes associated with a disease, phenotype, pathway or a GO term). (B) Test gene source can be candidate genes from linkage analysis studies or genes differentially expressed in a particular disease or phenotype. (C) Enriched terms of the eight gene annotations, namely, GO: Molecular Function, GO: Biological Process, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains and Gene Expression, compiled from various data sources, are obtained for the training set of genes. (D) A similarity score is generated for each annotation of each test gene by comparing to the enriched terms in the training set of genes. The final prioritized gene list is then computed based on the aggregated values of the eight similarity scores.

3.2.2 Pre-processing of annotation terms

A pre-processing step was performed before the prioritization started, where information content values of all categorical annotation terms, i.e. GO:MF, GO:BP, MP, Pathway, Domain,

PubMed, and Interaction annotations, were calculated. The information content (gi) of annotation term Ti of a gene was defined in the following way:

))ln(p(T- gi = i , (1) max j ))}{-ln(p(T 1 4342 in the T all T j in the taxonomy

23 where

count(occurrence of ofcaseinTofchildrenandT )annotationlontologica Tp )( = i i . i count(occurrence of termsall in the same annotation set)

3.2.3 Processing of Training Set Genes

The training process was to create a representative profile of the training genes based on all the 8 annotations (features). For categorical gene annotations this process was to identify the over-representative terms from the training genes. Hypergeometric distribution with

Bonferroni correction was used as the standard method. For numeric gene annotation, i.e. microarray expression levels, the training process generated the average (a vector of size 79) of all the training genes.

3.2.4 Similarity measure

Again different methods were used for similarity measures of categorical and numeric annotations. Fuzzy measure-based similarity measure was applied for categorical terms. The following part explains the method in detail.

If G = denotes the set of annotation terms of a gene, a Sugeno fuzzy measure, g, is a real valued function g: 2G → [0, 1], satisfying

1) g Φ = Ggand = 1)(0)( ,

2) ( ≤ )() ⊆ BAifBgAg , and

3) ,allFor ⊆ with BAGBA Φ=∩ ,

++=∪ λ BgAgBgAgBAg somefor)()()()()( λ > −1. (2)

This is a recursive definition which stops when A and B are single element sets, and g(A) and

24 i j g(B) are replaced with g and g , the fuzzy densities of Ti and Tj (the only elements in A and B) respectively. For a given gene annotation set G, the parameter λ of its Sugeno fuzzy measure can be determined uniquely by solving the following equation:

n λ ∏ +=+ λg i )1()1( for λ> -1. (3) i=1

i where g is the fuzzy density of term Ti, or in this context, the information content obtained in the pre-processing step, and n is the number of terms in G.

Fuzzy measure-based similarity (FMS) of two sets G1 and G2 of annotation terms is defined as ∩ + ∩ GGGG )(g)(g )G ,(GS )G = 212211 , (4) 21FMS 2 which can be derived based on the values of λ1 and λ2 determined using equation (3). For ontological terms, the augmented FMS (AFMS) was used to account for the hierarchical structure of ontology annotations.

+ ++ ∩+∩ GGgGGg + )]([)]([ GGS ),( = 211 212 , (5) AFMS 21 2

+ + + + + where [G1 ∩ G2] = [G1 ∩ G2 ] = [G1 ∩ G2] ∪{T1i, T2j}, G1 = G1 ∪{T1i, T2j}, G2 = G2

∪{T1i, T2j}, and {T1i, T2j} denotes the set of most specific common ancestors of every pair of terms (T1i, T2j) from G1 and G2. This ensures for two genes annotated by ontological terms, even though they don’t share common terms, the similarity measure is > 0 (See Popescu et al

[75] for additional details). For numeric annotation, i.e. the microarray expression values, the similarity score was calculated as the Pearson correlation of the two expression vectors of the two genes.

25 3.2.5 Processing of Test Set Genes

In this step, each of the genes from the test set was compared to the representative profile of the training set. As described earlier, the training profile contained the over-represented terms from the training genes for all categorical annotations and the average vector for the expression values. For a test gene, a similarity score to the training profile for each of the 8 features was derived using the methods mentioned in the previous section. The test gene was then summarized by the 8 similarity scores. In case of missing value (for instance, one annotation of a test gene was unknown), the score was set to -1. Otherwise, it is a real value in [0, 1].

In order to combine the 8 similarity scores into an overall score, we applied a statistical meta-analysis. A p-value of each annotation of a test gene G was derived by random sampling from the whole genome. The p-value of similarity score Si was defined as:

genesofcount having randomtheinGthanhigherscore sample Sp )( = . i randomtheingenesofcount sample containing annotation

n 2 Fisher’s inverse chi-square method, which states that − ∑log2 pi → χ n)2( assuming pi’s i=1 come from independent tests, was then applied to combine the p-values from multiple annotations into an overall p-value. It was noted, however, p-values of GO:MF and GO:BP were highly correlated, therefore a single p-value was generated by taking the p-value of the average of GO:MF and GO:BP scores in the random sample. A pair wise Pearson correlation test result of the p-values is shown in Supplementary Figure 2 in [41]. The final similarity score of the test gene was then obtained by 1 – (the combined p-value). We used random sampling to estimate the p-values because the density functions of the similarity scores were not easy to estimate, and although this process increased the computation, for a reasonably

26 large random sample the p-values were fairly stable.

3.3 Results

3.3.1 Mouse Phenotype as a Feature for Candidate Gene Prioritization

The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. The MP Ontology (MPO) supports different levels and richness of phenotypic knowledge and flexible annotations to individual genotypes [55]. Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments. In the current study, we retrieved mouse genes associated with each of the MP term and extracted the corresponding human orthologous genes. In the current version of MPO, there are 4280 terms associated to

4329 unique Entrez mouse genes (extrapolated to 4329 orthologous human genes). We do not check whether the human orthologous gene of a mouse gene causes similar phenotype.

Rather, we assume that orthologous genes cause “orthologous” phenotype and test the potential of the extrapolated mouse phenotype terms as a similarity measure between the training and test group of genes in candidate gene analysis.

3.3.2 Document Identifier as a Feature for Candidate Gene Prioritization

We use biomedical literature abstract identifiers (PubMed identifiers, PMIDs) as a feature for classification, where the dimensionality of the feature space was equal to the number of documents in the document set. We hypothesized that if a PMID is cross-referenced in two

27 genes, the two genes are likely to have a direct or indirect association. A large number of co-citations for a pair of genes (i.e. same PMIDs associated with two different genes) probably represents a relationship (direct or indirect association) between the two genes. For each gene, ToppGene considers all associated articles (represented as PMIDs) as literature annotation of this gene. The gene to PMID association file was downloaded from ftp://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz. 44806 PMIDs were associated with more than one gene and 25294 genes had at least one PMID association. 24273 genes shared at least one PMID with another gene. For the current study, we do not look into the details of the relationship type between the genes but consider only co-citation. In other words, the

PMIDs are used only as a feature of similarity measure in the candidate gene analysis.

3.3.3 Comparison of ToppGene with Other Gene Prioritization Approaches

Figure 3-2: Schema diagram of comparison of ToppGene with other applications. To evaluate the performance of our approach and also compare it with other similar gene prioritization approaches, we performed two types of comparisons: large-scale cross-validations and small-scale test cases. For large-scale cross-validations, we used the same or similar training sets as mentioned in the previous methods. Specifically we compared ToppGene’s performance with ENDEAVOUR using random-gene

28 cross-validation; and with PROSPECTR and SUSPECTS, we used locus-region cross-validation. Further, as test cases, we selected two diseases, congenital heart defects (CHD) and diabetic retinopathy (DR), and compared the prioritization performance of ToppGene with SUSPECTS and ENDEAVOUR.

To evaluate the performance of our approach and also compare it with other similar gene prioritization approaches [24, 37, 39], we performed two types of comparisons: large-scale cross-validations and small-scale test cases (See Figure 3-2 for the workflow, and Table 3-1 and Table 3-2 for a comparison of features and methods used in the 3 applications, namely,

SUSPECTS, ENDEAVOUR and ToppGene). For large-scale cross-validations, we used the same or similar training sets as mentioned in the previous methods. Specifically we compared

ToppGene’s performance with ENDEAVOUR [39] using random-gene cross-validation; and for comparison with PROSPECTR [24] and SUSPECTS [37], we used locus-region cross-validation. Additionally, as test cases, we selected two diseases, congenital heart defects (CHD) and diabetic retinopathy (DR), and compared the prioritization performance of ToppGene with

SUSPECTS [37] and ENDEAVOUR [39].

Feature type SUSPECTS ENDEAVOUR ToppGene Sequence Gene length Blast Features & Homology cis-element Annotations Base composition Transcriptional motifs Gene Gene Ontology Gene Ontology Gene Ontology Annotations Mouse Phenotype

Transcript Gene expression Gene expression Gene expression Features EST expression Protein Features Protein domains Protein domains Protein domains Protein interactions Protein interactions Pathways Pathways Literature Keywords in abstracts Co-citation (PMIDs)

Table 3-1: Comparison of features used in the three gene prioritization applications.

29

Data type SUSPECTS ENDEAVOUR ToppGene Attribute-based data Semantic similarity p-value from fuzzy measure based meta-analysis similarity Vector-based data Pearson correlation Pearson correlation Pearson correlation Combination of scores Weighted mean p-value from order p-value from statistics meta-analysis

Table 3-2: Comparison of methods used in the three gene prioritization applications.

3.3.4 Comparison of ToppGene with ENDEAVOUR: Random-gene cross-validation

In the current study we used our own disease training sets because the complete data sets used by ENDEAVOUR are not available for public access. We, therefore, randomly selected

19 diseases along with their associated genes from Online Mendelian Inheritance In Man

(OMIM) and the Genetic Association Database (GAD). Each disease gene set contained 30 to

44 genes. The total of number genes across 19 selected diseases was 693 (See Supplementary

Table 1 of [41] for the complete list of the datasets). For negative controls, 20 sets, each containing 35 random genes, were created as training data. We followed the same methodology as ENDEAVOUR to evaluate the performance of our prioritization method and also compare the results with ENDEAVOUR. In each validation run, the gene group of a particular disease (with one gene removed as the “target”) was used as the training set. The

“target” gene was then mixed with 99 random genes to make a test set of 100 genes. The rank of the “target” gene in the resulting list, following prioritization, was recorded. This process was repeated for each gene in the list. Sensitivity was defined as the frequency of “target” genes that are ranked above a particular threshold position, and specificity as the percentage of genes ranked below the threshold. For instance, a sensitivity/specificity value of 70/90

30 indicates that the correct disease gene (the “target”) is ranked among the best-scoring 10% of genes in 70% of the prioritizations. Receiver operating characteristic (ROC) curves were plotted based on the sensitivity/specificity values and area under curve (AUC) was computed as the standard measure of the performance of the method. ENDEAVOUR reported 90/74 sensitivity/specificity value and an AUC score of 0.866 [39].

Using ToppGene, we first created the overall ROC curves. In order to compare with

ENDEAVOUR directly, we followed the same definitions for sensitivity and specificity as described by Aerts et al [39]. Figure 3-3 shows the overall ROC curves using ToppGene. The

AUC score of the 19 disease training sets was 0.916, and the sensitivity/specificity was 90/77, i.e. the “target” gene was ranked among the top 23% in 90% of the cases. In case of control, the AUC score of the 20 random training sets was 0.503 (see section A of Table 3-3).

Figure 3-3: ROC curves of random-gene cross-validation based on score ranks. Green curve was generated from the 19 disease gene training sets. Black curve, negative control, was generated from 20 random training

31 sets. See text for the definitions of sensitivity and specificity.

Second, we studied the ROC curves based on p-value based scores. ENDEAVOUR provides ranking of the “target” gene based on p-values from order statistics, which are local p-values.

In contrast, ToppGene provides p-values based on random sampling of the whole genome.

ToppGene p-value based scores are therefore global measures of the similarity of the test genes to the training genes. As a result, sensitivity and specificity can also be defined based on the p-value based scores; specifically, sensitivity is the true positive rate (the proportion of detected “target” genes among all “target” genes) at a cutoff score, and specificity is the true negative rate (the proportion of “rejected” genes among all “non-target” genes) at the same cut-off level. For example, a sensitivity/specificity of 70/90 indicates that 70% of the “target” genes and 10% of the “non-target” genes have scores higher than a particular cut-off value.

3.3.5 Evaluation of features used for gene prioritization in ToppGene

To study the efficiency of different features (GO-Gene Ontology, MP-Mouse Phenotype, Pathways,

PubMed, Protein Domains, Gene Expression and Protein Interactions), ROC curve of each of the feature sets was generated. Figure 3-4 shows the corresponding AUC scores of the ROC curves, depicting the relative performance of each feature set in the prioritization method. The mouse phenotype and PubMed showed the best performance while protein interactions and gene expression features performed poorly. In terms of coverage (the percentage of genes annotated with each of these features in the whole genome), PubMed was the best while MP had least coverage (only about 19% of genes have at least one MP term association).

32

To understand better the relative performance and the power of each of the features in gene prioritization, we tested ToppGene by performing cross-validations with one of the features left out. The performance decreased significantly only when MP was removed (see ROC curve in

Figure 3-5). As expected, the best performance was recorded when all the features were considered for prioritization, with an AUC of 0.913 (see ROC curve in Figure 3-5) and a coverage of ~89%. For a cutoff score of 0.93, the sensitivity/specificity was 74/90. In other words, 74% of the “target” genes were included in the candidate list (about 9-fold reduction from the original test set).

AUC of different feature sets

1 100.00% AUC (random control)

0.9 AUC (p-value score) 90.00% 0.8 Coverage 80.00%

0.7 70.00%

0.6 60.00%

0.5 50.00% AUC Coverage 0.4 40.00%

0.3 30.00%

0.2 20.00%

0.1 10.00%

0 0.00% All GO:MF GO:BP MP Pathway Domain Pubmed Interaction Expression Feature set

Figure 3-4: AUC of different feature sets. Red bars indicate the AUC scores based on each feature set, and blue bars are the corresponding random controls. Yellow bars indicate the coverage of each feature set in the whole genome. For example, mouse phenotype (MP) has AUC score 0.78 and covers 19% of genes in the whole genome. For each feature set, the ROC curve was generated using genes with annotations only.

33

Figure 3-5: ROC curves of random-gene cross-validation based on scores. The red curve was generated using all features sets (AUC score 0.913). The blue curve was generated without Mouse Phenotype annotations (AUC score 0.893). The orange curve was generated without Mouse Phenotype and Pubmed annotations (AUC score 0.888). See text for the definitions of sensitivity and specificity.

3.3.6 Comparison of ToppGene with SUSPECTS and PROSPECTR: Locus-region cross-validation

In this cross-validation we compared the performance of ToppGene with two other gene prioritization methods, namely, SUSPECTS [37] and PROSPECTR [24]. We used the same data set [36] that was used in SUSPECTS and PROSPECTR study (See Supplement Table 2 in [41] for a list of 29 OMIM diseases, each with at least 3 known gene associations). For each cross-validation run, the training set was composed of all the genes related to a disease except the “target” gene. The test set was created by including all the genes in the 15 Mb locus region i.e. genes occurring in the 7.5 Mb flanking regions (5’ and 3’) of the “target” gene’s chromosomal location along with the “target” gene itself.

34 PROSPECTR, which uses sequence feature alone for gene prioritization, ranked the “target” gene in an average of top 31.23% in the prioritized test lists and among the top 5% about 20 times out of 155 (i.e. about 13%). On the other hand, SUSPECTS, which uses GO, protein domains, gene expression, and sequence features for gene prioritization, ranked the “target” genes in the top 5% of the prioritized lists 87 times out of 155 (~56%), and on average the

“target” genes were ranked at top 12.93% in the prioritization results. In comparison,

ToppGene was able to rank the “target” gene among the top 5% of the prioritized lists for 118 times out of 150 (79%). Five genes in the original list were not present in the current NCBI

Entrez Gene database and were therefore excluded. Thus, instead of 155 genes, 150 genes were used for this cross-validation test. On average, the “target” genes were ranked at top

7.39% in the prioritized lists using our approach (see section B of Table 3-3).

A. Random cross-validation ENDEAVOUR ToppGene AUC (area under curve) 86.6 91.6 True positive rate/false positive rate 74/90 77/90 B. Locus region cross-validation PROSPECTR SUSPECTS ToppGene Percentage of top 5% ranked target genes 13% (20/155) 56% (87/155) 79% (118/150) Average rank ratio of target gene 31.23% 12.93% 7.39% C. Congenital Heart Disease (CHD) test case SUSPECTS ENDEAVOUR ToppGene Percentage of top 10% ranked target 32% (9/28) 50% (14/28) 64% (18/28) genes Percentage of top 5% ranked target genes 18% (5/28) 14%(4/28) 25% (7/28) Average rank ratio 25.03% 17.29% 17.35% D. Diabetic Retinopathy (DR) test case SUSPECTS ENDEAVOUR ToppGene Percentage of top 10% ranked target 63% (17/27) 56% (15/27) 70% (19/27) genes Percentage of top 5% ranked target genes 44% (12/27) 44% (12/27) 63% (17/27) Average rank ratio 17.04% 13.31% 8.60%

Table 3-3: Summary of comparison of results from ToppGene with other gene prioritization applications.

35 To evaluate the performance of the individual feature, we repeated the same locus-region cross-validation with one feature removed at a time (as described earlier under comparison of

ToppGene with ENDEAVOUR). The performance did not change significantly if only GO, pathway, protein domains, protein interactions or gene expression features were excluded during gene prioritization. The performance however declined significantly when MP or

PubMed was not included as features in gene prioritization (see Table 3-4 and Figure 3-5).

Average rank Number of times Number of times Features ratio of “target” genes were “target” genes were “target” genes ranked top 5% ranked top 10% All 7.39% 118 125 GO + MP + PubMed 7.50% 118 126 MP + PubMed 7.08% 121 126 Without GO 6.84% 117 123 Without Pathway 7.66% 118 124 Without Domain 6.71% 118 124 Without Interaction 7.17% 120 124 Without Expression 7.28% 118 128 Without MP 9.77% 110 117 Without Pubmed 9.91% 100 111 Without MP & Pubmed 22.61% 71 80

Table 3-4: Performance summary of locus-region cross-validation using different feature sets. When either MP or PubMed, or both (MP + PubMed) were left out, the performance dropped significantly.

25.00% 140

120 20.00%

100

15.00% 80

60 Rank ratio 10.00%

40

5.00% 20 Average rank ratio

Number of top 5% Number of top ranked5% "target" genes 0.00% 0 All Pubmed Without MP Without GO MP+Pubmed Without MP & Without Domain Without Without Pubmed Without Pathway GO+MP+Pubmed Without Interaction Without Expression Feature set

Figure 3-6: The performance of locus-region cross-validation using different feature sets. The average rank

36 ratio (y-axis on the left) indicates the average rank ratio of the “target” genes in the resulting list, thus lower value corresponding to a better performance. At the same time, the higher the number of top 5% ranked “target” genes among total of 150 prioritizations (y-axis on the right), the better the performance. As a result, it’s very clear that removing MP, PubMed or both resulted in significant drop of performance.

3.3.7 Comparison of ToppGene with ENDEAVOUR and SUSPECTS

Test Case 1: Congenital heart disease (CHD)

We used 28 genes implicated in congenital heart disease (CHD) (see Supplementary Table 3 in [41] for the complete list and comparison of relative rankings of “target” genes using different gene prioritization approaches) as the test case and prioritized the genes using the random-gene cross-validation method as described in the earlier sections. In each run, same training and test sets were submitted to SUSPECTS, ENDEAVOUR and ToppGene manually.

Twenty-eight prioritizations were performed by each of the three methods and the average size of the test sets was 20 genes.

Following the prioritization, the “target” genes were ranked among the top 5% in the resulting lists 5, 4, and 7 times out of 28 (i.e., about 18%, 14%, and 25%) , and in the top

10% 9, 14 and 18 times (about 32%, 50% and 64%) with SUSPECTS, ENDEAVOUR and

ToppGene respectively. The average rank ratios of the “target” genes were 25.03%, 17.29% and 17.35% for SUSPECTS, ENDEAVOUR and our approach respectively (see section C of

Table 3-3).

Test Case 2: Diabetic retinopathy (DR)

A similar comparative analysis was repeated with diabetic retinopathy (DR) as a test case

37 using locus-region cross-validation as described in previous section. The training set comprised 27 known genes implicated in DR (see supplementary Table 4 in [41] for the complete list and comparison of the relative rankings of the “target” genes using SUSPECTS,

ENDEAVOUR and ToppGene while the test sets comprised genes in the locus regions of the

“target” genes.

The “target” genes were ranked among top 5% in the resulting lists 12 times out of 27 (~44%) with both SUSPECTS and ENDEAVOUR based gene prioritization. As witnessed in earlier comparisons, ToppGene again outperformed both SUSPECTS and ENDEAVOUR by ranking the “target” genes among top 5% in 17 times out of 27 (~63%). If we considered the top 10%, surprisingly SUSPECTS fared better than ENDEAVOUR and was close to ToppGene’s performance. Thus, the “target” genes were ranked among the top 10% of the prioritized gene lists 17, 15 and 19 times (63%, 56% and 70%) respectively with SUSPECTS, ENDEAVOUR and ToppGene. The average rank ratios of the “target” genes were 17.04%, 13.31% and

8.49% for SUSPECTS, ENDEAVOUR and our approach respectively (see section D of Table

3-3).

3.3.8 ToppGene Implementation and Access

The programs of our prioritization method are implemented purely in JAVA. Open source

JAVA package FtpBean by Calvin Tai [76] is used to automatically download data and annotation files from Ftp servers. BioJava packages [77] are used to process UniProt records

[78] and extract related protein domain information. GOLEM [79] source code was adopted

38 and modified for dealing with ontology annotations. Colt [80] and Jakarta Commons-Math libraries [81] are used for statistical analysis. The fuzzy similarity measure and related functions are implemented locally.

Our prioritization method is available as a standalone web application at http://toppgene.cchmc.org [82]. The user interface is written in JAVA script, JSP and servlets, and integrated with the Tomcat web server. Users can enter the training and test sets of interest as queries from the interface, and application will display the training and test results accordingly. All the gene information and annotation data will be updated automatically except for pathways.

3.4 Discussion

Traditionally there are two categories of approaches to compute the similarity between any two genes based on semantic annotations: pair-based and set-based [75]. In pair-based methods, an average or maximum of pairwise term information content is calculated as the similarity between the two genes. This will however cause inconsistency problems.

Specifically, average of pairwise term information content tends to underestimate the similarities (e.g. two identical genes have similarity less than 1), and maximum of pairwise term information content tends to overestimate the similarity (e.g. two genes sharing one annotation term have similarity equal to 1). Set-based similarity measures, such as Jaccard and Dice similarity [75], will generate 0 if the two genes do not share a common annotation term. This behavior is especially undesirable for annotation terms from ontologies. The

39 fuzzy-based similarity measure adopted and applied in our approach could overcome the problems mentioned above and generate a better similarity measure than the traditional methods.

Most of the current tools to enrich lists of genes or candidate gene prioritization are based on

GO, gene expression, pathways or human phenotype [31, 46, 48, 83]. Additionally, previous studies have shown that integrating multiple lines of evidence is good for candidate gene analysis. However, to the best of our knowledge none of the current approaches integrate mouse phenotype features although the mouse is a key model organism for the analysis of mammalian developmental, physiological, and disease processes [52]. Additionally, there have been reports wherein a direct comparison of human and mouse phenotypes allowed for the rapid recognition of disease causal genes (for example, ROR2 as the Robinow syndrome gene [84]; the phenotype of the Abcc6-/- mouse shares calcification of elastic fibers with human Pseudoxanthoma elasticum, PXE, pathology, caused by mutations in human ABCC6 gene [85]). For the first time, we use phenotype annotations for mouse orthologs of human genes as one line of evidence for candidate gene analysis. We are aware that comparing phenotypes between two different organisms may involve consideration of several issues. For instance, the mouse genotype may involve mutations to orthologs of one or more of the genes associated with a phenotype, but the mouse phenotype may not resemble the disease in human. Nevertheless, finding, for instance that targeted disruption of the mouse ortholog of human CFC1 gene (associated with visceral heterotaxy which is characterized by congenital anomalies that include complex cardiac malformations and situs inversus or situs ambiguous

40 [86]) results in L-R laterality defects including cardiac malformations [87] can lead to novel and interesting hypotheses. Although, our results have conclusively demonstrated the utility of mouse phenotype data in human candidate gene analysis, there are some inherent limitations in using mouse phenotype annotations. For instance, MP is not a disease-centric ontology and the phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds. Most importantly, orthologous genes need not necessarily result in orthologous phenotypes.

Improved performance of ToppGene can be attributed partially to the usage of more comprehensive data resources. For instance, unlike ENDEAVOUR, the pathway data set in

ToppGene is not limited to KEGG resource. We compiled more than 700 additional pathways

(associated with about 4800 human genes) from various sources (see Methods section for details) and used for gene prioritization.

Our approach however has some limitations. First, by using a training set we assume that the disease genes we have yet to discover will be consistent with what is already known about a disease and/or its genetic basis which may not always be the case. Second, it is important to note that the annotations and analyses provided and the prioritization by our approach can only be as accurate as the underlying online sources from which the annotations are retrieved.

Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined (see Methods). To address this problem, we have developed a network-based gene prioritization approach independent of

41 functional annotations (see Chapter 4). Third, using an appropriate training set - although the difference was not significant, while cross-validating, we noted that using larger training sets

(>100 genes) would decrease the sensitivity and specificity of the prioritization when compared with smaller training sets (7 to 21 genes).

3.5 Conclusions

Existing disease candidate gene prioritization methodologies mine biological and functional information about candidate genes, and we believe that our system, ToppGene, can complement these existing approaches by using a novel method that mines mouse phenotype data. The aim of ToppGene is to generate likely candidates by extensive analysis of known characteristics of genes, and is inevitably restricted by existing information be it GO annotation, pathways, phenotype or gene expression data. Through various examples, we demonstrate that ToppGene performs better than SUSPECTS, PROSEPCTR and

ENDEAVOUR in candidate gene prioritization. However, it needs to be emphasized that our aim is not to prove that ToppGene prioritized genes are true disease genes but to aid in selection of a subset of most likely disease gene candidates from larger sets of disease-implicated genes identified by high throughput genome-wide techniques like linkage analysis and microarray analysis. In conclusion, we have used for the first time in human candidate gene analysis the mouse phenotype data. We further demonstrate that employing the mouse phenotype data can significantly assist in the process of focusing the search for most likely disease gene candidates. Lastly, as the functional annotations of human and mouse genes improve, we envisage a proportional increase in the performance of ToppGene

42 and strongly believe that it will be a valuable adjunct to wet lab experiments in human genetics and disease research.

43 Chapter 4. Disease candidate gene identification and prioritization using Protein-Protein Interaction Network

4.1 Introduction

Most of the current disease candidate gene prioritization methods [24, 36, 37, 39, 41, 49], including our ToppGene method described in the previous chapter, rely on the functional annotations. However, the coverage of the gene functional annotations is a limiting factor.

Although more than 1,500 human disease genes have been documented, most of them remain functionally uncharacterized. Currently, only a fraction of the genome is annotated with pathways and phenotypes. While two thirds of all the genes are annotated by at least one annotation, a remaining one third remains as yet to be annotated.

4.1.1 Protein-protein interactions networks

Analysis of protein-protein interaction networks (PPINs) is becoming important for inferring the function of uncharacterized proteins. Protein-protein interactions refer to the association among the protein molecules and the study of these associations from the perspective of biochemistry, signal transduction and networks. Further, interactions between proteins provide a mechanistic basis for most biological processes and diseases in an organism.

Recent biotechnological advances like high-throughput yeast two-hybrid screen facilitated building proteome-wide PPINs or "interactome" maps in human [88, 89]. The shift in focus to systems biology led to further interest in PPINs and biological pathways. Network-based analyses have been developed with a number of goals in mind [90], including protein

44 function prediction [91], identification of functional modules [92], interaction prediction [93,

94], identification of disease candidate genes [95, 96] and drug targets [97, 98], and the study of network structure and evolution [99-103]. While there is a wealth of protein-disease relationships in the published literature and a number of PPIN resources, relatively few studies have actually used PPIN analyses for prioritizing disease genes. Thus, making use of these networks in the context of disease is a relatively new challenge in the field [95]. One of the earliest efforts [104] uses a classifier based on several topological features, including degree (the number of links to the protein), 1N index (proportion of links to disease-related proteins), 2N index (average 1N index in the neighbors), average distance to disease-genes, and positive topology coefficient (average neighborhood overlapping with disease-genes). Xu et al., built a KNN based classifier with all disease-genes from OMIM and concluded that hereditary disease-genes from OMIM in the literature-curated protein-protein interaction network are characterized by a larger degree, tendency to interact with other disease-genes, more common neighbors and quick communication to each other [104]. A more recent application, Genes2Networks [105], identifies important genes based on a list of “seed” genes.

It generates a Z-score for each “intermediate” gene from a binomial proportions test to represent its specificity or significance to the “seed” genes. The former method, not requiring a list of known disease related genes, is best used for disease candidate gene identification where little prior knowledge is known about the disease. The latter application, on the other hand, uses a “seed” list as training to score the neighboring genes. It avoids bias towards highly connected “hub” genes, but the candidate gene is searched in a local network region and the user has to provide the size of the neighborhood region in the network.

45

4.1.2 Ranking algorithms in networks

In the analysis of social networks, Web graphs and telecommunication networks one common question is “which entities are most important in the network?” Although visualization-centered approaches such as graph-drawing are useful to gain qualitative intuition about the structure, especially of small graphs, it is difficult to use these approaches for large and more complex networks. A number of other approaches have therefore been developed in the search for alternate methods. For instance, a variety of measures (degree centrality [106], closeness centrality [107] and betweenness centrality [108]) have been proposed by sociologists to determine the “centrality” of a node in a social network. Likewise, in the area of Web graphs, computer scientists have proposed a number of algorithms such as

HITS [109] and PageRank [110] for automatically determining the “importance” of Web pages. Biological networks have been found to be comparable to communication and social networks [111]. For instance, PPINs and communication networks share several common characters, such as scale-freeness and small-world properties suggesting that the algorithms used for social and Web networks are equally applicable to biological networks. However, besides Xu et al. [104], the recent application of Genes2Networks [105], and Kohler et al.

[112], there have been no reports on using these existing and successful ranking methods in other areas such as disease gene prioritization problem in biology.

In this chapter, we describe a new candidate gene prioritization method, solely based on PPIN analyses. Specifically, we apply ranking algorithms to prioritize disease candidate genes and

46 compare their performance with other methods based on genomic functional annotations.

Although, our earlier functional-annotation based gene prioritization method (described in

Chapter 3) takes into account PPINs, the network structure itself was not fully utilized. In order to be comparable with methods described in the previous chapter, this network based prioritization requires a user provided known disease-related gene set as root. Importance of the other genes relative to root is calculated and ranked using a framework and algorithms proposed by White and Smyth.

4.2 Methods

4.2.1 Human protein interaction datasets

Human protein interaction dataset (file “interactions.gz”), a compilation of PPIs from BIND

[70], BioGRID [113], and HPRD [114], was downloaded from NCBI Entrez Gene ftp site

[115]. All of these interactions are derived from large-scale experiments and curated manually.

For example, all interactions in BIND are experimentally validated and published in at least one peer-reviewed journal; interactions in BioGRID are entirely derived from manual literature curation just as in HPRD.

4.2.2 Prioritization methods

In the current study, the protein interaction network is represented as an unweighted, undirected simple graph G, where proteins (genes) are nodes and interactions are edges. The set of all the proteins in the network is denoted as V, all the interactions as E. The set of

47 known disease-genes (also called the seeds) is denoted as R. The prioritization approaches are based on White and Smyth’s methods [116], whose general framework, consisting of four successive problem formulations, each building on the next, defines the approach to ranking nodes in an unweighted digraph G(V, E):

1. Relative importance of a node t with respect to a root node r: Given G and r and t, where

r and t are both nodes in G and r is the root, compute the “importance” of t respect to r.

This importance is denoted as I(t|r), a non-negative quantity.

2. Rank of importance of a set of nodes T with respect to a root node r: Given G and a root

node r in G, rank all vertices in T, a subset of vertices in G. For each node t in T, the value

of I(t|r) can be computed. Then the nodes can be ranked so that the largest values

correspond to the highest importance.

3. Rank of importance of a set of nodes T with respect to a set of root nodes R: Given G and

a set of root node R in G, rank all vertices in T, a subset of vertices in G. The importance

of node t to R is defined as the average sum of importance of t to each node in R:

I(t|R) = (1/|R|)(∑(I(t|r)). (1)

4. Given G, rank all nodes: This is a special case where R=T=V.

Based on White and Smyth’s framework, the solution to problem 3 is what is needed in this study. To recap it in the context of disease gene prioritization, the problem is to prioritize a set of genes in the network based on their importance to a set of root genes (for e.g. genes known to be associated to a disease). The importance of a gene to the set of root genes is just the average sum of the importance of it to each of the individual root gene. Although this

48 framework was proposed for directed networks, it can be applied to the undirected networks also because, the latter is just a special case of the former. In this study, the undirected protein interaction network was converted to an equivalent directed network when necessary.

With the problem formulation defined, the key of the solution is to find I(t|r), the importance of node t with respect to a root node r. White and Smyth [116] provided a few algorithms.

Three of them, namely, a) PageRank with Priors, b) HITS with Priors, and c) K-step Markov, are considered in this study.

PageRank with Priors is an extension to the original PageRank algorithm. The iterative stationary probability equation is:

(2)

In this equation, pv represents the “prior bias” and pv = 1/|R| for v in R, the root node set, pv =

0 otherwise. β, empirically defined on [0, 1], represents a “back probability”. The rest parameters are constants, where din(v) is the indegree of node v, p(v|u) is the probability of arriving v from u.

HITS with Priors is an extension to the original HITS algorithm. The iterative equations are defined as:

49 (3) where din(v) and dout(v) are the indegree and outdegree of v, respectively, and H(i) and A(i) are defined as:

(4)

The definition of prior bias pv and “back probability” β is similar to PageRank with Priors.

The authority score is set as the importance of the node.

The K-Step Markov approach computes the relative probability that the system will spend time at any particular node given that it starts in a set of roots R and ends after K steps.

According to White and Smyth [116], the value of K controls the relative trade-off between a distribution “biased” towards R and when K gets larger the steady-state distribution will converge to PageRank result. The equation to compute the K-Step Markov importance is very simple:

(5) where A is the transition probability matrix of size n × n, pR is an n × 1 vector of initial probabilities for the root set R, and I(t|R) is the t-th entry in this sum vector.

For additional details of the methods, the readers are referred to the original paper by White and Smyth [116].

50

4.2.3 PPIN analysis and derivation of topological parameters

The basic network statistics and topological parameters were derived using NetworkAnalyzer

[117]. NetworkAnalyzer is a Java plugin for Cytoscape [118], a software platform for the analysis and visualization of molecular interaction networks. The version of Cytoscape was

2.5.2 and NetworkAnalyzer was 2.5.1. For details of Cytoscape and NetworkAnalyzer, please refer to their websites.

The implementation of the prioritization methods, PageRank with Priors, HITS with Priors, and K-Step Markov approach are all available in JUNG (JAVA Universal Network/Graph)

[119] framework. It is a JAVA package that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. The version 2.0 was used and integrated with other in-house programs through APIs to perform all the required functions. For details of JUNG, please refer to their website [119].

4.2.4 Evaluation methods of PPIN topological features

Cross-validations to test the performance of the prioritization methods were done as described in chapter 3 and earlier [41]. Briefly, 19 diseases from OMIM [120] and GAD [121] were used as training sets. For each of the diseases, the associated genes were used as “seeds” and leave-one-out random cross-validation was performed. Random sets were used as the control training sets. The rank-based sensitivity and specificity followed the previous definitions (Chapter 3). ROC curves were plotted to visualize the performance with AUC

51 values as quantitative measures. For further details refer to the previous chapter or our previous publication [41].

All of the three node ranking methods require pre-determined parameters. For PageRank with

Priors and HITS with Priors, the “back probability” is needed. It represents the bias towards the seeds and the recommended value is 0.3 according to White and Smyth [116]. For K-Step

Markov approach, the only parameter is the length of the random walk, which controls the relative tradeoff between a distribution “biased” towards the “seeds” and the steady-state distribution which is independent of the “seeds”. As K gets bigger, the final state is moving towards the steady state. The recommended K value was 6. In order to evaluate the effect of different values of the parameters on the performance, different values of parameters were used in the cross-validations and test with each parameter setting was repeated 5 times to estimate the mean and standard deviation. Comparison of performance was achieved through analysis of variance followed by Tukey HSD multiple comparison method.

4.3 Results

4.3.1 Human protein interaction network

The interactions were extracted from the file downloaded from NCBI Entrez Gene ftp site on

Dec 7, 2007. The file from NCBI was a tab-delimited text file containing 18 columns in the format shown below:

Column Content 1 First Tax ID 2 First Gene ID 3 First Protein Accession

52 4 First Protein Name 5 Interaction short phrase 6 Second Tax ID 7 Second ID 8 Second Interactant ID Type 9 Second Accession 10 Second Protein Name 11 Interaction Complex ID 12 Interaction Complex Type 13 Interaction Complex Name 14 Pubmed ID List 15 Last Update Timestamp 16 GeneRIF Text 17 Interaction ID 18 Interaction ID Type

Table 4-1: The columns of interaction table from NCBI Entrez Gene. Only column 2, 7 and 18 were captured for human interactions.

Three steps were performed using TextPad to prepare the data for Cytoscape network analysis.

First, regular expression “^9606\t\([[:digit:]]+\).*\t9606\t\([[:digit:]]+\).*\t\([[:alpha:]]+\)$” was replaced with “\1\t\2\t\3” to extract human protein interactions. Secondly,

“\(^[[:digit:]]+\t\).*\t\1.*\n” was replaced with “” (empty string) to remove proteins interacting with themselves (loops). Finally duplicated rows were removed. This resulted in

50061 unique rows each representing an interaction by GeneID1, GeneID2 and the corresponding interaction data source.

The result human protein interaction network contained 8340 vertices, corresponding to 8340 unique genes (proteins) and 27250 edges, corresponding to 27250 unique interactions among the genes. The following table shows the basic counts of this interaction dataset separated by data sources:

Source Genes Interactions BIND 2389 4054

53 BioGRID 7683 23205 HPRD 6594 22802 Total 8340 27250

Table 4-2: The number of unique genes and interactions from each interaction data source.

The following figures are the Venn diagrams of the genes and interactions respectively in the resultant protein interaction network.

Figure 4-1: Venn diagrams of unique genes and interactions from all data sources.

Analyzed with NetworkAnalyzer [117] in Cytoscape [118], the complete human protein interaction network contained 120 connected components with the largest component having

8075 genes. The average shortest path length was 4.591 and average degree was 6.535. The distribution of the degree and clustering coefficient of the nodes are shown in the following figures. The curves show that both distributions have a strong power law property. In the fitted model of y = axb for degree distribution, a = 9188.7, b = -1.93, correlation = 0.899 and

R-squared = 0.933. For clustering coefficient distribution, a = 0.479, b = -0.599, correlation =

0.862 and R-squared = 0.516.

54

Figure 4-2: Distribution of node degrees of the protein interaction network. X-axis is the degree and Y-axis is the number of nodes of that degree. Both X and Y-axis are log-scaled. A linear trend (power law property) can be observed from the plot.

Figure 4-3 Plot of Avg. clustering coefficient vs. degree of the protein interaction network. X-axis is the degree and Y-axis is the corresponding average clustering coefficient. Both X and Y-axis are log-scaled. A linear trend, although not strong, can be observed from the plot.

4.3.2 Evaluation

I used the same training data, from Chapter 3 or previous study [41], comprising 19 diseases with 693 associated genes. Of these, 589 genes were used in the cross validation because the rest (104 genes) had no reported interactions. The random training dataset, used as control, was built with 19 random gene lists, each of size 31 to 38 genes. Three methods, K-Step

Markov (KSMarkov), PageRank with Priors (PRankP), and HITS with Priors (HITSP) were used to prioritize the disease-gene with different parameter values. The random genes were

55 prioritized using PRankP with back probability set to 0.3. ROC curves of representative cross validation results are shown in the Figure 4-4 and Figure 4-5.

1

0.8

0.6

HITSP 0.5

Sensitivity HITSP 0.3

0.4 PRankP 0.5

PRankP 0.3

PRankP 0.1

0.2 PRankP 0.05

PRankP 0.01

Random

0 0 0.2 0.4 0.6 0.8 1 1 - Specificity

Figure 4-4: ROC curves from cross validations. This figure shows the representative ROC curves using PageRank with Priors with back probability 0.01, 0.05, 0.1, 0.3 and 0.5, and HITS with Priors with back probability 0.3 and 0.5. The random curve was derived from prioritization of the random training set using PageRank with Prior method with back probability 0.3.

56 1

0.8

0.6

KSMarkov 6 Sensitivity

0.4 KSMarkov 4

KSMarkov 2

0.2 KSMarkov 1

Random

0 00.20.40.60.81 1 - Specificity

Figure 4-5: ROC curves from cross validations. This figure shows the representative ROC curves using K-Step Markov method with K = 1, 2, 4, and 6. The random curve was derived from prioritization of the random training set using PageRank with Prior method with back probability 0.3.

It’s observed that HITS with Priors was similar to PageRank with Priors in terms of performance under different back probability values. Therefore, only PageRank with Priors was tested for extreme back probability values such as 0.01 and 0.05. The results from back probability 0.7 and 0.9 were similar to that of 0.3 and 0.5, therefore not shown here. 11 different test conditions are listed below; PageRank with Priors with back probability 0.01,

0.05, 0.1, 0.3, 0.5; K-Step Markov with k = 1, 2, 4, 6; and HITS with Priors with back

57 probability 0.3 and 0.5. The following table shows the AUC values from each validation run.

Each method with the same parameter setting was repeated 5 times.

Test ID AUC Test Type 1 0.658157895 k1 2 0.658667233 k1 3 0.872911715 k1 4 0.657996604 k1 5 0.657801358 k1 6 0.778132428 k2 7 0.777775891 k2 8 0.778947368 k2 9 0.778480475 k2 10 0.777903226 k2 11 0.803140917 k4 12 0.802105263 k4 13 0.797606112 k4 14 0.801561969 k4 15 0.803599321 k4 16 0.801417657 k6 17 0.802589134 k6 18 0.802478778 k6 19 0.803921902 k6 20 0.803149406 k6 21 0.800747029 h3 22 0.800645161 h3 23 0.802003396 h3 24 0.79942275 h3 25 0.800730051 h3 26 0.800526316 h5 27 0.800976231 h5 28 0.800517827 h5 29 0.800967742 h5 30 0.800135823 h5 31 0.778845501 p05 32 0.77655348 p05 33 0.775509338 p05 34 0.776001698 p05 35 0.775135823 p05 36 0.728268251 p01 37 0.726298812 p01 38 0.726714771 p01

58 39 0.72704584 p01 40 0.726375212 p01 41 0.792826825 p1 42 0.789601019 p1 43 0.792555178 p1 44 0.791001698 p1 45 0.791247878 p1 46 0.803539898 p3 47 0.800806452 p3 48 0.801010187 p3 49 0.801706282 p3 50 0.801621392 p3 51 0.799252971 p5 52 0.800967742 p5 53 0.801536503 p5 54 0.799303905 p5 55 0.802597623 p5

Table 4-3: AUC values from each cross validation run. Column “Test Type” indicates the method and parameter settings of the test. P01 through p5 stand for PageRank with Priors with back probability 0.01 to 0.5 respectively; k1, k2, k4 and k6 represent K-Step Markov with K = 1, 2, 4 and 6 accordingly; h3 and h5 are HITS with Priors with back probability 0.3 and 0.5 respectively. There were 11 test conditions each repeated 5 times.

The following table summarizes the performance from each method with particular parameter value.

Method Parameter Mean of AUC Std. Dev. of AUC PageRank with Priors Back probability = 0.01 0.7269406 0.0007994196 Back probability = 0.05 0.7764092 0.0014623336 Back probability = 0.1 0.7914465 0.0013016877 Back probability = 0.3 0.8017368 0.001079227 Back probability = 0.5 0.8007317 0.001450028 K-Step Markov K = 1 0.7011070 0.096042314 K = 2 0.7782479 0.0004738863 K = 4 0.8016027 0.0023758973 K = 6 0.8027114 0.0009219541 HITS with Priors Back probability = 0.3 0.8007097 0.0009132172 Back probability = 0.5 0.8006248 0.0003540317

Table 4-4: Means and stand deviations of AUC values under 11 different cross validation conditions. Highlighted rows correspond to the best parameter value of each method.

59

Figure 4-6: Plots of AUC with different parameter values. The left panel shows the AUC values of PageRank with Priors with back probability varied from 0.01 to 0.5. The right panel shows the AUC values of K-Step Markov method with random walk length varied from 1 to 6. The vertical bars indicate the standard deviations.

The best performance of each method was selected, namely PRankP and HITSP with back probability 0.3 and KSMarkov with K = 4, for Analysis of Variance. The p value of 0.5585 suggests there’s no significant difference among the best performance of the three methods.

Analysis of variance was further performed on PRankP and KSMarkov to test if different parameter values resulted in significant drop in performance. The result is summarized in the following table. A Tukey HSD test was followed to compare the pair wise differences for each ANOVA. The differences in mean levels of Tukey HSD are plotted in the following figure.

Method Parameter values p-value PRankP Back probability = 0.01, 0.05, 0.1, 2.20E-16 0.3, 0.5 KSMarkov K = 1, 2, 4, 6 0.01264 Table 4-5: Summary of ANOVA on PRankP and KSMarkov with different parameter values. Significant p

60 values suggest parameter values have strong effects on the performance.

95% family-wise confidence level 95% family-wise confidence level k6-k4 k6-k2 k4-k2 k6-k1 k4-k1 k2-k1 p3-p5 p3-p1 p5-p1 p3-p05 p5-p05 p1-p05 p3-p01 p5-p01 p1-p01 p05-p01

0.00 0.02 0.04 0.06 0.08 -0.05 0.00 0.05 0.10 0.15 Differences in mean levels of type Differences in mean levels of type Figure 4-7: Plot of differences in mean levels of AUC values from Tukey HSD test. Left panel are the results from PRankP, where AUC values with back probability of 0.01and 0.5 were significantly lower than others. Right panel shows the results from KSMarkov and AUC values with random walk length 1 were significantly lower than others.

Tukey HSD test, consistent with the barplots, suggested that for PRankP and HITSP if the back probability was very low, the performance dropped significantly. For example, back probability 0.01 and 0.05 resulted in dramatic drops in AUC values. Similarly for KSMarkov, if the length of the random walk starting from “seeds” was too low, the performance decreased significantly. For example, when K = 1, the AUC was significantly lower with a much higher variance indicating the performance was significantly worse and less robust.

When K = 2, the AUC was noticeably lower, but not very significant. K = 4 or above would generally give consistent and equivalent results.

61 4.4 Discussion and Conclusion

Our current study, based on the observation that biological networks share many properties with Web and social networks, is an attempt to extend the successful graph analysis-based algorithms from computer science research area to tackle the disease gene prioritization problem for genes that have poorly defined functional annotations. Using literature-based and manually curated protein interactions to form the base network, extended versions of

PageRank algorithm and HITS algorithm, as well as K-Step Markov method were applied to prioritize disease candidate genes in a training-test schema. For each prioritization, a list of known disease-related genes was used as training (“seeds”), and the genes in the test list

(candidates) were ranked. To evaluate and compare the performance of the methods, a large-scale cross validation was performed. 11 conditions with 3 algorithms and different parameter settings were tested, each repeated 5 times. Rank-based ROC curves were plotted and AUC values were used to quantitatively measure the performance.

Based on our results, I draw the following conclusions: First, under appropriate settings, for example a back probability of 0.3 for PageRank with Priors and HITS with Priors, and walk length 4 for K-Step Markov method, the three methods achieved the same AUC value and hence the similar performance. This suggests that based on the current knowledge of protein interaction networks, other similar methods (for e.g., ranking of nodes in an unweighted graph) under the same framework may result in similar results.

Second, the value of back probability in PageRank with Priors and HITS with Priors could be

62 from a fairly large range (for example 0.1 to 0.9) and had relatively little effect on the performance. When the back probability was set to very low, however, for example 0.01, the performance dropped significantly. This is expected because in both the methods (see equations 3 and 4 under Methods), as the back probability reaches 0, the bias towards the

“seeds” is eliminated and PageRank/HITS with Priors is same as the original PageRank/HITS algorithm and therefore the prioritization to the selected “seeds” fails. The performance of

K-Step Markov method, on the other hand, decreased significantly when the length of random walk K was small (for e.g. K = 1). Under this condition, the method calculates the probability to spend time on each protein from the seeds with a random walk of length 1. The proteins that are not directly interacting with “seeds” will never be arrived and scored 0. This suggests that if the true disease candidate is not directly interacting with the “seeds”, it will be ignored when K is 1. The method converged to the best performance when K was 4. The further increase in random walk length didn’t necessarily result in the improvement in the performance. This could be due to the fact that the average shortest path length in the interaction network was only about 4.5.

Third, the overall performance of prioritizations based on protein networks can be compared to functional annotation based methods [41] since they were all tested using the same cross validation. The AUC value of functional annotation based method, ToppGene [41], was 0.916, and the best AUC value of network-based methods was 0.801. This shows that network-based methods are generally not as effective as the integrated functional annotation based methods in candidate gene prioritization. However, network-based methods are easier to apply and

63 more efficient in practice. Actually for a fair comparison they should be compared with individual functional annotation features used in our previous study [41], and I found that network-based methods are better than all annotations (see [41] for details). Thus, I conclude that PPINs can be a potentially good feature for disease candidate gene prioritization especially when the genes lack other functional annotations.

Network-based prioritization methods however have certain limitations. Just like functional annotation based methods, the performance depends on the quality of interaction data. The algorithms used in our current study were originally developed to identify “important” nodes in networks. Although I used extended versions of these algorithms to prioritize nodes to selected “seeds”, they could still be biased towards hubs. Finally, these approaches were designed for Web and general networks; therefore, additional modifications may be required to make them fit better with biological networks, for example, considering weights on nodes

(proteins) or edges (interactions). As future extension, apart from considering weighted nodes and edges, I plan to integrate our method with other methods. For example, I hypothesize that combining results from functional annotation based methods and expression profiles may result in better performance of candidate gene prioritization.

64 Chapter 5. Prioritization of Novel Asthma Candidate Genes: A Complete Case Study

5.1 Introduction

In the earlier two chapters (Chapter 3 and 4), I have discussed the design, development, validation and applications of functional annotation based and network based disease candidate gene prioritization methods. In this chapter, using asthma as a test case, I have applied and compared the two gene candidate gene prioritization approaches. I have also integrated the rankings obtained by each of these methods to test the hypothesis that an integrative ranking system based on functional annotations and network analysis will result in better performance and identification of novel asthma-associated genes.

Asthma is a complex genetic disorder marked by inflammation of the airway accompanied by coughing, wheezing, and breathlessness [122]. As an allergen-induced complex disease the molecular basis and disease etiology of asthma has been difficult to study due to several reasons that encompasses its multigenic origin and gene-environmental interactions [123,

124]. The problem also stems from the phenotype heterogeneity that is consistent with any complex disorder. As a result only a handful of common phenotypes or quantitative traits that are associated with asthma have been used to perform genome-wide scans and positional cloning to undercover the genetic underpinnings of asthma [125].

The first method, ToppGene [41], as described earlier in Chapter 3, mimics the

65 candidate-gene approach in that genes are prioritized based on their similarity in functionality to the known asthma susceptible genes. One of the caveats of this approach is that completely novel genes that have no known functional annotations related to asthma will be ignored. To overcome this drawback, protein interaction network based prioritization method developed from PageRank algorithm is used. Additionally, as a third alternative, I explored the use of fold changes in microarray expression levels to prioritize candidate genes. The prioritization performance of the three methods were evaluated using a cross-validation method, and our result showed that the combined ranked list from the first two methods achieved the best performance and adding microarray expression data did not improve the performance. Finally a genome-wide combined ranked list of novel asthma candidate genes is generated by integrating functional annotation- and protein interaction network-based rankings.

5.2 Methods

Figure 5-1 shows an overview of steps followed in this study for prioritization of novel asthma candidate genes.

66

Figure 5-1: Flowchart describing the step-by-step procedure taken to prioritize for novel asthma candidate genes.

5.2.1 Gene sets preparation

Gene sets for the functional annotation, network, and expression based prioritization were compiled as follows:

1) The three gene sets corresponding to the three candidate gene prioritization methods

were compiled by mining the publicly available resources. For functional annotation

67 based gene prioritization (ToppGene), 17024 human genes with functional

annotations from at least 3 categories available were compiled; 8340 human genes

with at least one known interactant were pooled for network-based prioritization; and

14230 human genes (orthologs of differentially expressed genes in mouse models of

asthma) were downloaded for expression level-based prioritization.

2) A common set 7127 genes which could be used in all the 3 analyses was then

generated. In other words, this was the intersection of the three gene sets described in

step 1. This set represents the universal set of genes in the cross-validation.

3) 252 known asthma-associated genes were compiled from literature (see Appendix A).

Of these, 186 genes were present in the 7127 genes described in step 2. This set of

186 genes was considered as the disease-genes (denoted as set D) in the

cross-validation.

4) The set of 6903 “non-disease-genes”, denoted as set N, was created by removing

disease-gene set D and 1789 ubiquitously expressed human genes (UEHG) ([126])

from the universal set.

Gene Sets D and N were used for evaluating and comparing the performance from each prioritization method.

5.2.2 Prioritization of candidate genes

Three candidate gene prioritization methods were considered which were based on genomic functional annotations, protein interaction network, and expression levels. The first two methods have been described in detail in the two previous chapters. Here, I used a third prioritization method based on fold change of gene expression levels from microarray

68 experiments. The expression level-based prioritization method differs from the other two approaches in that it does not require a training gene set. It is to be noted that prioritization of candidate genes using expression data alone is seldom practiced. However, I thought that this is a more unbiased approach apart from being systematic. It is primarily used as a control to see how the two new methods perform compared to an “unbiased”, more traditional method.

Gene set GSE1301 from GEO database was downloaded. CEL files of 3 wild type mice treated with phosphate buffered saline (PBS) and house dust mite (HDM) were RMA normalized using custom chip description files Mm430A_Mm_ENTREZG.cdf and

Mm430B_Mm_ENTREZG.cdf [74]. The genes were ranked by their absolute fold change values decreasingly to obtain the ranked list.

For the two previously described prioritization approaches, the parameters were set as follows: for the functional annotation-based prioritization, the correction method of the training process was “bonferroni” with p-value cutoff at 0.05; for the test process, the random sample size was set at 2000 and minimum feature count at 3. In the protein interaction network-based prioritization, the “back probability” in the PageRank with Priors method was set at 0.3.

5.2.3 Combination of multiple prioritization results

The result from each of the prioritization methods was a ranked list of candidate genes. In order to combine multiple ranked lists into one prioritization result, a simple method known as Rank Products [127] was used. This method which was initially proposed to analyze

69 microarray expression results can also used to combine ranked lists. Briefly, if there are n genes and k ranked lists of genes (rank 1 corresponds to the best candidate), the rank matrix can then be constructed such that rg,i is the rank of gene g from prioritization i. The rank product of gene g is given by . The genes were then ranked by their rank product increasingly to construct the combined ranked list.

In order to estimate the significance of the ranks, a permutation method was used. In each permutation, the ranks in each column were shuffled randomly and the RP of each gene was calculated. The same permutation was then repeated 1000 times, and the p-value of each gene was computed as

.

5.2.4 Evaluation of the prioritization results

Gene Set Enrichment Analysis (GSEA) [47] was originally developed to determine whether the members of an a priori defined set of genes are randomly distributed throughout a ranked list or primarily found at the top or bottom. Here I have used it to score the ranked lists instead: given a ranked list and an a priori defined set of genes, I wanted to determine if this ranked list is significantly correlated with the gene set. Specifically, the known asthma related genes were the a priori defined set of genes and the ranked lists were the result of the test set from the three prioritization methods. The asthma related genes were expected to be ranked towards the top in the test set and therefore a high Enrichment Score (ES). Higher the ES, better the ranked list related to asthma. Normalized ES (NES) based on permutations was

70 actually used to account for differences in the ranked lists. NES was defined as

The evaluation and comparison of the prioritization methods was carried out in the following process. In each validation run, a random sample of 124 genes (two thirds), denoted as set S, were extracted from asthma related gene set (Gene Set D described above) and used as training set and “seeds” for functional annotation-and network-based prioritization respectively. The remaining 62 asthma-genes were then combined with “non-asthma-genes”

(Gene Set N) to form the test set T (i.e. 62 + 6903 = 6965 genes), which were ranked by all the three methods. Normalized ES of the 62 asthma-related gene set from GSEA was used to quantitatively evaluate each ranked candidate gene list. This validation run was repeated 10 times for each method to evaluate the mean and variance of ES. In each validation run, prioritization results from different methods were combined by their rank products described earlier and the ES of the asthma-related gene set in the combined ranked list were calculated.

Since only the ranks of the candidate genes were considered in the prioritization results, their associated prioritization score or p-values were ignored by using the “classic” (un-weighted) mode of GSEA to generate the ES, and the number of permutations in each GSEA analysis was set to 1000.

5.3 Results

I obtained a list of 252 known asthma genes (see Appendix A) that have been published in

71 literature and were distributed throughout the genome (Figure 5-2).

Figure 5-2: Distribution of known asthma genes in the human genome. Known asthma genes are highlighted in blue.

5.3.1 Over-represented functional annotation terms of known asthma genes

The over-represented functional annotation terms of known asthma genes were identified using ToppGene [41]. Given that asthma is an inflammatory airway disease I expected related biological processes and pathways to be over-represented. Not surprisingly, immune related processes and pathways were enriched in the results (p-value ≈

0). For example 107 genes were related to immune system process (p = 0) and 132 genes were implicated in immune system phenotype (p = 0). Among the pathways cytokine-cytokine interaction was significantly enriched (p = 0) as expected with a

72 significant number of interleukins and chemokines present in the asthma-gene list. These genes were localized in specific genomic segments that were known to be associated with immune-response. For example 11 of these genes are localized on 6p21 that is known to be a human major histocompatibility (MHC) genomic region and associated with about 100 diseases including asthma [128].

5.3.2 Performance of each prioritization method

As described in the method section, NES score was used to measure the quality of the ranked list in the cross-validation. For each validation run, an NES score was calculated for each ranked list generated by corresponding prioritization method. The result is shown in Table

5-1. Since in each validation run the ranked lists were generated from the same set of test genes, one-sided paired t-test instead of two-sample t-test was performed to compare the performance between every two methods. The overall result indicated that functional annotation-based method performed significantly better than network-based method (p =

0.006243), and network-based method performed significantly better than expression-based method (p = 6.70E-10).

NES NES NES (functional, (functional NES NES (functional network, & Iteration annotation) (network) (microarray) & network) microarray) 1 4.89 4.79 1.36 5.62 5.34 2 5.66 5.54 0.84 6.34 6.14 3 4.8 4.15 0.84 5.35 5.28 4 6.14 5.31 1.79 6.15 6 5 5.41 4.21 1.53 5.52 5.05 6 5.18 4.85 1.59 5.46 5.69

73 7 5.74 5.25 1.66 6.03 6.13 8 5.34 4.24 2.04 5.47 5.56 9 5.08 4.41 1.76 5.69 6.19 10 4.42 4.91 0.94 5.1 5.31 Mean 5.266 4.766 1.435 5.673 5.669 SD 0.504 0.5 0.427 0.387 0.422

Table 5-1: Normalized Enrichment Scores (NES) for different prioritization methods for ten iterations.

Method Functional + Functional + Microarray Network Network + Network Microarray Functional 0.008037 0.00054 6.7E-10 0.006243 Network 0.0000333 0.0000102 4.18E-08 Microarray 4.5E-10 7.02E-10 Functional + 0.4828 Network

Table 5-2: P values for one-sided paired t test comparing different prioritization methods based on NES. White cell indicates that the P value is based on the alternative hypothesis that the column method performs better than the row method. Gray cell indicates that the P value is based on the alternative hypothesis that the row method performs better than the column method. It’s shown that the order of the performance for different methods from best to worst is combined > functional > network > microarray, and there’s no significant difference between the two combined methods.

5.3.3 Performance of combined ranked lists

The ranked lists combined using Rank Product (described in Methods section) from two or more prioritization methods were evaluated in the cross-validation also. The NES of the combined ranked lists were shown in the last two columns in Table 5-1. I observed that combined ranked lists performed better than any individual ranked list in all the 10 validation runs. A one-sided paired t-test confirmed the observation that combined ranked lists from functional annotation- and network-based methods performed significantly better than functional annotation-based method (p = 0.00054). There is no significant difference, however, by combining results from expression-based method (0.4828). This indicated that by combining rankings from functional annotation- and network-based prioritization methods

74 the performance improved suggesting the two methods could complement each other. Since combining expression-based method did not result in any additional improvement, results from functional annotation- and network-based methods were combined in the rest of the analysis.

Figure 5-3 shows the result of validation run 1 as an example. The cumulative enrichment score (ES) curves are shown for result of functional annotation-, network-, expression-based method, the combined of the first two methods, and the combined of all the three methods respectively. The top 10 genes in the combined rank list of functional annotation- and network-based methods are also shown with known asthma-genes highlighted. The black bars denote the positions of the known asthma genes in the ranked list. As the prioritization becomes more accurate the vertical bars tend to occur towards the top of the ranked list. For example, 5 (CCR5) is a known asthma candidate and has been shown to be definitively involved in the development of OVA-induced allergic airway inflammation. In our prioritization results, CCR5 was ranked 9 by functional annotation-based method, and 27 by network-based method. However through the combined analysis CCR5 was ranked 3 among the list of genes.

75

Gene Symbol Rankf Rankn Rank Product Rankf+n ICOSLG 163 1 12.76714533 1 IL2RG 13 15 13.96424004 2 CCR5 9 27 15.58845727 3 STAT1 3 82 15.68438714 4 JAK2 32 8 16 5 MYD88 8 40 17.88854382 6 STAT5A 14 23 17.94435844 7 CXCL10 2 189 19.4422221 8 IL23A 4 116 21.54065923 9 IL3 1 558 23.62202362 10

Figure 5-3: Profile of the Running Enrichment Score & Positions of Known Asthma Genes on the Rank Ordered List based on a. Functional-annotation based prioritization b. Network-based prioritization c. Microarray-based prioritization d. Rank product of (a) and (b) e. Rank product of (a), (b), and (c)

The green curve denotes the cumulative enrichment score (ES). The black vertical bars denote the position of the

76 known asthma genes on the rank ordered list.

5.3.4 Genome-wide prioritization of asthma-related candidate genes

As the result from cross-validation indicated, the ranked list combined from functional

annotation- and network-based method achieved the best overall performance. Therefore, the

combined ranks were used to prioritize the asthma-related candidate genes in the

genome-wide scale. To do this, 8118 genes not known related to asthma were first identified

which were available for functional annotation- and network-based prioritizations. 252

known asthma-genes were used as training and “seeds” in the two methods respectively. The

combined ranks were derived based on the Rank Products and the significance level was

estimated based on 1000 permutations (see the methods for details). Table 5-3 lists the top 37

genes with p-value < 0.001.

Rank Symbol Entrez Description Functional Network Rank P-value ID Annotation Rank Product Rank 1 CCBP2 1238 Chemokine-binding protein 2 4.5 8 6 <0.001 (Chemokine-binding protein D6) (C-C chemokine receptor D6) (Chemokine receptor CCR-9) (Chemokine receptor CCR-10). 2 JAK2 3717 Tyrosine-protein kinase JAK2 (EC 2.7.10.2) 140 1 11.83216 <0.001 (Janus kinase 2) (JAK-2). 3 STAT5A 6776 Signal transducer and activator of 21.5 7 12.26784 <0.001 transcription 5A. 4 CD4 920 T-cell surface CD4 precursor 82.5 5 20.3101 <0.001 (T-cell surface antigen T4/Leu-3). 5 ICOSLG 23308 ICOS ligand precursor (B7 homolog 2) 41 12 22.18107 <0.001 (B7-H2) (B7-like protein Gl50) (B7-related protein 1) (B7RP-1) (CD275 antigen). 6 PIK3R1 5295 Phosphatidylinositol 3-kinase regulatory 140 4 23.66432 <0.001 subunit alpha (PI3-kinase p85-subunit alpha) (PtdIns-3-kinase p85-alpha) (PI3K). 7 LTB 4050 Lymphotoxin-beta (LT-beta) (Tumor 1 591 24.31049 <0.001 necrosis factor C) (TNF-C) (Tumor necrosis

77 factor ligand superfamily member 3). 8 IL23A 51561 Interleukin-23 subunit alpha precursor 2.5 266 25.78759 <0.001 (IL-23 subunit alpha) (Interleukin-23 subunit p19) (IL-23p19). 9 GRB2 2885 Growth factor receptor-bound protein 2 360 2 26.83282 <0.001 (Adapter protein GRB2) (SH2/SH3 adapter GRB2) (Protein Ash). 10 HLA-DPA1 3113 HLA class II histocompatibility antigen, DP 27.5 39 32.74905 <0.001 alpha chain precursor (HLA-SB alpha chain) (MHC class II DP3-alpha) (DP(W3)) (DP(W4)). 11 CD3G 917 T-cell surface glycoprotein CD3 gamma 11 102 33.49627 <0.001 chain precursor (T-cell receptor T3 gamma chain). 12 TP53 7157 Cellular tumor antigen p53 (Tumor 41 30 35.07136 <0.001 suppressor p53) (Phosphoprotein p53) (Antigen NY-CO-13). 13 B2M 567 Beta-2-microglobulin precursor 11 115 35.56684 <0.001 14 IL2RG 3561 Cytokine receptor common gamma chain 82.5 17 37.44997 <0.001 precursor (Gamma-C) (Interleukin- 2 receptor gamma chain) (IL-2R gamma chain) (p64) (CD132 antigen). 15 PTGIR 5739 (Prostanoid IP 2.5 721 42.45586 <0.001 receptor) (PGI receptor) (Prostaglandin I2 receptor). 16 HLA-F 3134 HLA class I histocompatibility antigen, 4.5 401 42.47941 <0.001 alpha chain F precursor (HLA F antigen) (Leukocyte antigen F) (CDA12). 17 PTPN6 5777 Tyrosine-protein phosphatase non-receptor 41 46 43.4281 <0.001 type 6 (EC 3.1.3.48) (Protein-tyrosine phosphatase 1C) (PTP-1C) (Hematopoietic cell protein-tyrosine phosphatase) (SH-PTP1) (Protein-tyrosine phosphatase SHP-1). 18 IL18R1 8809 Interleukin-18 receptor 1 precursor (IL1 11 185 45.11097 <0.001 receptor-related protein) (IL-1Rrp) (CD218a antigen) (CDw218a). 19 CD3D 915 T-cell surface glycoprotein CD3 delta chain 11 249 52.33546 <0.001 precursor (T-cell receptor T3 delta chain). 20 IL6ST 3572 Interleukin-6 receptor subunit beta precursor 21.5 136 54.07402 <0.001 (IL-6R-beta) (Interleukin-6 signal transducer) (Membrane glycoprotein 130) (gp130) (Oncostatin-M receptor alpha subunit) (CD130 antigen) (CDw130).

78 21 IL2RB 3560 Interleukin-2 receptor subunit beta precursor 41 76 55.82114 <0.001 (IL-2 receptor) (P70-75) (p75) (High affinity IL-2 receptor subunit beta) (CD122 antigen). 22 JAK3 3640 Tyrosine-protein kinase JAK3 (EC 2.7.10.2) 21.5 149 56.59947 <0.001 (Janus kinase 3) (JAK-3) (Leukocyte janus kinase) (L-JAK). 23 CD3E 916 T-cell surface glycoprotein CD3 epsilon 11 410 67.15653 <0.001 chain precursor (T-cell surface antigen T3/Leu-4 epsilon chain). 24 RELA 5970 Transcription factor p65 (Nuclear factor 82.5 57 68.57478 <0.001 NF-kappa-B p65 subunit). 25 TRAF2 7186 TNF receptor-associated factor 2 (Tumor 238.5 23 74.06416 <0.001 necrosis factor type 2 receptor-associated protein 3). 26 PTPN11 5781 Tyrosine-protein phosphatase non-receptor 360 16 75.89466 <0.001 type 11 (EC 3.1.3.48) (Protein-tyrosine phosphatase 2C) (PTP-2C) (PTP-1D) (SH-PTP3) (SH- PTP2) (SHP-2) (Shp2). 27 PTGDS 5730 Prostaglandin-H2 D-isomerase precursor 238.5 25 77.21723 <0.001 (EC 5.3.99.2) (Lipocalin-type prostaglandin-D synthase) (Glutathione-independent PGD synthetase) (Prostaglandin-D2 synthase) (PGD2 synthase) (PGDS2) (PGDS) (Beta-trace protein) (Cerebrin-28). 28 CCL13 6357 Small inducible cytokine A13 precursor 41 165 82.24962 <0.001 (CCL13) (Monocyte chemotactic protein 4) (MCP-4) (Monocyte chemoattractant protein 4) (CK-beta-10) (NCC-1) 29 LYN 4067 Tyrosine-protein kinase Lyn (EC 2.7.10.2). 516.5 15 88.01988 <0.001 30 CCL7 6354 Small inducible cytokine A7 precursor 82.5 117 98.24714 <0.001 (CCL7) (Monocyte chemotactic protein 3) (MCP-3) (Monocyte chemoattractant protein 3) (NC28). 31 NFKB1 4790 Nuclear factor NF-kappa-B p105 subunit 140 80 105.8301 <0.001 (DNA-binding factor KBF1) (EBP- 1) 32 BLR1 643 C-X-C chemokine receptor type 5 11 1405 124.3181 <0.001 (CXC-R5) (CXCR-5) (Burkitt lymphoma receptor 1) (Monocyte-derived receptor 15) (MDR-15) (CD185 antigen). 33 EP300 2033 Histone acetyltransferase p300 (EC 360 45 127.2792 <0.001 2.3.1.48) (E1A-associated protein p300). 34 LRP1 4035 Low-density lipoprotein receptor-related 516.5 32 128.5613 <0.001

79 protein 1 precursor (LRP) (Alpha-2-macroglobulin receptor) (A2MR) (Apolipoprotein E receptor) (APOER) (CD91 antigen). 35 MAG 4099 Myelin-associated glycoprotein precursor 21.5 918 140.4884 <0.001 (Siglec-4a). 36 TYK2 7297 Non-receptor tyrosine-protein kinase TYK2 360 78 167.5709 <0.001 (EC 2.7.10.2). 37 TAPBP 6892 tapasin isoform 1 precursor 238.5 227 232.679 <0.001

Table 5-3: Top 37 genes in the combined ranked list of asthma candidate genes

Annotation of prioritized genes based on asthma-specific functional annotation and phenotypes

Though I have used 2 different approaches to rank asthma candidate genes and generated a

rank product (integrating the rankings from both the approaches), I are still left with an

unmanageably large number of putative candidate genes. Thus, there is a need for further

ranking or pruning of these ranked gene list. Other annotations (listed below) can be used to

rank the prioritized target genes. Some of the features, specific to asthma that can be

considered for such ranking are:

i) association with the GO term “immune response”

ii) association with “immune system phenotype” – genes whose mouse knock-out

has immune system phenotype

iii) known targets of NF-Kappa B (the promoters of 252 known asthma-associated

genes were enriched for putative NF-Kappa B binding sites)

iv) predicted targets of NF-Kappa B

v) interact directly with a known asthma gene

80 vi) differentially expressed in mouse models asthma or other asthma related

expression data sets

vii) genes located in regions classified as asthma QTL

The rationale for annotation (i) and (ii) is that asthma is an immune-related disease [122] and hence any gene directly or indirectly associated with the disease has a higher probability of already having these annotations. NF-Kappa B is a major regulator of immune system genes and also has been known to be associated with asthma as a potential regulator of asthmatic gene [129, 130]. Under the hypothesis that targets of asthma related immune response transcription factor have a higher probability of being involved in asthma than any other disease, I have annotated the prioritized genes (annotations iii and iv) whether or not they are actual known or predicted targets of NF-Kappa B. Known targets of NF-Kappa B were downloaded from curated data sets of MSigDB [47]. Based on the interaction data obtained from various sources that include HPRD [78], BIND [131], and BioGRID [113] databases, I have annotated the novel genes if they are interacting with any known asthma genes

(annotation v). The rationale here is that genes interacting with known asthma genes might also be associated with asthma. Finally I have used expression profiling data from NCBI

GEO (Gene Expression Omnibus) [132] of lung tissue obtained from IL-13 knockout mice treated with dust-mite allergen (see methods). I annotated (vi) the prioritized genes if they were upregulated or downregulated in the expression profile. Based on these annotations I calculated a separate overall score for each prioritized gene. The score was allotted such that each annotation category was associated a binary score of 0 or 1 depending on whether the

81 specific gene has that annotation. The overall score for the gene will be the sum of the seven binary scores obtained from each annotation category. Theoretically the more the gene is related to asthma the higher should be its score. Table 5-4 displays the top 40 significant genes in the prioritization result with all the annotations and the overall annotation score.

A sub-network of the known asthma related genes (Figure 5-4) were created. In this sub-network the genes that connected to at least two known asthma genes were included. As can be seen, two significant candidate genes JAK2 and EP300 have the most number of interactions with known asthma-genes.

Figure 5-4: Subnetwork of asthma related genes. Red nodes are the known asthma genes. The rest nodes represent the genes connecting to at least two of the known asthma genes. Dark and light blue nodes represent

82 the significant (p-value < 0.05) candidate genes in QTL and non-QTL regions respectively. Other genes are colored in grey. The size of each node is determined by the number of interactions with the known asthma-genes. For example, JAK2 and EP300 marked in the figure, are interacting with 10 and 9 asthma-genes respectively.

83

Interacts With Known Predicted KNOWN Overall ToppGene Network Rank QTL GO NFKB NFKB Asthma Up- Down- Anno Symbol Rank Rank Product P-value Count QTLs Term Phenotype targets? Targets? Genes regulated? regulated? Score CD3G 11 102 33.50 0 1 1 1 0 1 1 0 5 AASTH22_H, LTB 1 591 24.31 0 2 AASTH34_H, 1 1 1 1 1 1 0 6 CTSS 360 71 159.87 0.002 1 1 0 0 1 1 0 4 IL7R 82.5 203 129.41 0.003 1 1 0 0 1 1 0 4 CXCL13 140 361 224.81 0.003 1 1 0 0 1 1 0 4 CXCL3 238.5 457.5 330.32 0.006 1 0 0 0 1 1 0 3 CTSB 360 390 374.70 0.013 1 AASTH23_H, 0 0 0 0 1 1 0 2 CD48 82.5 2340 439.37 0.017 0 1 1 0 0 1 0 3 THBS1 1619 120 440.77 0.018 0 1 0 0 1 1 0 3 AASTH7_H, AASTH20_H, IGF1 516.5 325 409.71 0.019 3 AASTH6_H, 0 0 0 0 1 1 0 2 LCN2 238.5 1224 540.30 0.024 0 1 0 1 1 1 0 4 AASTH51_H, TYROBP 238.5 1469 591.91 0.027 2 AASTH24_H, 0 1 0 0 0 1 0 2 SLA 360 1172 649.55 0.029 0 1 0 0 0 1 0 2 ITGAX 680.5 779 728.09 0.032 1 0 0 0 1 1 0 3 CSF2RA 680.5 483 573.31 0.037 0 0 0 0 1 1 0 2 EGR2 82.5 4802 629.42 0.04 0 0 0 1 0 1 0 2 CTSK 1220.5 383 683.70 0.044 0 1 0 0 1 1 0 3

84 COTL1 2774.5 200 744.92 0.049 0 0 0 0 1 1 0 2 AASTH51_H, SPIB 360 1776 799.60 0.05 2 AASTH24_H, 1 1 0 1 0 1 0 4 TEK 238.5 1181 530.72 0.037 1 AASTH47_H, 0 0 0 0 0 0 1 1 ETS2 1619 332 733.15 0.051 1 AASTH53_H, 0 0 0 0 1 0 1 2 JAK2 140 1 11.83 0 1 0 0 0 1 0 0 2 STAT5A 21.5 7 12.27 0 1 1 1 0 1 0 0 4 CD4 82.5 5 20.31 0 1 1 0 0 1 0 0 3 PIK3R1 140 4 23.66 0 1 1 0 0 1 0 0 3 GRB2 360 2 26.83 0 0 1 0 0 1 0 0 2 B2M 11 115 35.57 0 1 1 1 0 1 0 0 4 IL2RG 82.5 17 37.45 0 1 1 0 0 1 0 0 3 PTPN6 41 46 43.43 0 0 1 0 0 1 0 0 2 IL18R1 11 185 45.11 0 1 1 0 1 1 0 0 4 CD3D 11 249 52.34 0 1 1 0 0 1 0 0 3 IL6ST 21.5 136 54.07 0 1 1 0 1 1 0 0 4 JAK3 21.5 149 56.6 0 1 1 0 1 1 0 0 4 CD3E 11 410 67.16 0 1 1 0 0 1 0 0 3 TRAF2 238.5 23 74.06 0 1 1 1 0 1 0 0 4 PTGDS 238.5 25 77.22 0 0 0 1 0 1 0 0 2 LYN 516.5 15 88.02 0 1 1 0 0 1 0 0 3 CCL7 82.5 117 98.25 0 1 1 1 0 1 0 0 4 NFKB1 140 80 105.83 0 1 1 1 0 1 0 0 4 EP300 360 45 127.28 0 0 1 0 1 1 0 0 3

Table 5-4: Top 40 significant novel candidate genes sorted by up-regulation, down-regulation and p-values.Highlighted are the two genes, JAK2 and EP300, having the most number of interactions with known asthma related genes.

85 5.4 Conclusion

In this chapter three bioinformatics methods were used to prioritize asthma related candidate genes. They were based on gene functional annotations, protein-protein interaction networks, and differential gene expressions in asthma mouse models. Two-thirds cross-validation were used for the first two methods and GSEA and the resulted NES was used to quantitatively measure the relative performance of the methods. Our results, consistent with observations in

Chapter 2 and 3, showed that functional annotation based method performed better than network based method, and both performed significantly better than expression based method.

The best performance was achieved by the combining the ranked lists from the first two methods using Rank Product. The genes available in the entire genome were prioritized by the combined rankings and novel candidate genes were identified.

86 Bibliography

1. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258-261. 2. Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 2003, 33 Suppl:228-237. 3. Glazier AM, Nadeau JH, Aitman TJ: Finding genes that underlie complex traits. Science 2002, 298(5602):2345-2349. 4. Lander ES, Schork NJ: Genetic dissection of complex traits. Science 1994, 265(5181):2037-2048. 5. Risch NJ: Searching for genetic determinants in the new millennium. Nature 2000, 405(6788):847-856. 6. Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 2005, 6(2):95-108. 7. Tabor HK, Risch NJ, Myers RM: Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 2002, 3(5):391-397. 8. Zhu M, Zhao S: Candidate gene identification approach: progress and challenges. Int J Biol Sci 2007, 3(7):420-427. 9. Brunner HG, van Driel MA: From syndrome families to functional genomics. Nat Rev Genet 2004, 5(7):545-551. 10. Di Pietro SM, Dell'Angelica EC: The cell biology of Hermansky-Pudlak syndrome: recent advances. Traffic 2005, 6(7):525-533. 11. Mace G, Bogliolo M, Guervilly JH, Dugas du Villard JA, Rosselli F: 3R coordination by Fanconi anemia proteins. Biochimie 2005, 87(7):647-658. 12. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B et al: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006, 38(3):285-293. 13. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E et al: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. 14. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. 15. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T et al: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. 16. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P et al: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. 17. Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using protein-protein interactions. J Med Genet 2006, 43(8):691-698. 18. Huynen MA, Snel B, van Noort V: Comparative genomics for reliable protein-function prediction from genomic data. Trends Genet 2004, 20(8):340-344. 19. Oti M, Brunner HG: The modular nature of genetic diseases. Clin Genet 2007, 71(1):1-11. 20. Bortoluzzi S, Romualdi C, Bisognin A, Danieli GA: Disease genes and intracellular protein networks. Physiol Genomics 2003, 15(3):223-227.

87 21. Smith NG, Eyre-Walker A: Human disease genes: patterns and predictions. Gene 2003, 318:169-175. 22. Huang H, Winter EE, Wang H, Weinstock KG, Xing H, Goodstadt L, Stenson PD, Cooper DN, Smith D, Alba MM et al: Evolutionary conservation and selection of human disease gene orthologs in the rat and mouse genomes. Genome biology 2004, 5(7):R47. 23. Lopez-Bigas N, Ouzounis CA: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic acids research 2004, 32(10):3108-3114. 24. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 2005, 6:55. 25. Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nature genetics 2002, 31(3):316-319. 26. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA: G2D: a tool for mining genes associated with disease. BMC genetics 2005, 6:45. 27. Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. International journal of medical informatics 2005, 74(2-4):289-298. 28. Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic acids research 2005, 33(5):1544-1552. 29. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG: A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 2003, 11(1):57-63. 30. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, Vriend G: GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic acids research 2005, 33(Web Server issue):W758-761. 31. Masseroli M, Galati O, Pinciroli F: GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic acids research 2005, 33(Web Server issue):W717-723. 32. Masseroli M, Martucci D, Pinciroli F: GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic acids research 2004, 32(Web Server issue):W293-300. 33. Rossi S, Masotti D, Nardini C, Bonora E, Romeo G, Macii E, Benini L, Volinia S: TOM: a web-based integrated approach for identification of candidate disease genes. Nucleic acids research 2006, 34(Web Server issue):W285-292. 34. Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics (Oxford, England) 2002, 18 Suppl 2:S110-115. 35. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome. Eur J Hum Genet 2006, 14(5):535-542. 36. Turner FS, Clutterbuck DR, Semple CA: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 2003, 4(11):R75. 37. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 2006, 22(6):773-774. 38. Franke L, Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. American journal of human genetics 2006, 78(6):1011-1025.

88 39. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B et al: Gene prioritization through genomic data fusion. Nat Biotechnol 2006, 24(5):537-544. 40. Chen J, Xu H, Aronow BJ, Jegga AG: Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 2007, 8(1):392. 41. Chen J, Xu H, Aronow BJ, Jegga AG: Improved human disease candidate gene prioritization using mouse phenotype. BMC bioinformatics 2007, 8:392. 42. King MC, Wilson AC: Evolution at two levels in humans and . Science 1975, 188(4184):107-116. 43. Korstanje R, Paigen B: From QTL to gene: the harvest begins. Nat Genet 2002, 31(3):235-236. 44. Mackay TF: Quantitative trait loci in Drosophila. Nat Rev Genet 2001, 2(1):11-20. 45. Giallourakis C, Henson C, Reich M, Xie X, Mootha VK: Disease gene discovery through integrative genomics. Annu Rev Genomics Hum Genet 2005, 6:381-406. 46. Dennis G, Jr., Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 47. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15545-15550. 48. Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20(4):578-580. 49. Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M, Lopez-Bigas N, Ouzounis C, Perez-Iratxeta C, Andrade-Navarro MA et al: Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res 2006, 34(10):3067-3081. 50. Oti M, Brunner H: The modular nature of genetic diseases. Clin Genet 2007, 71(1):1-11. 51. Mootha VK, Lepage P, Miller K, Bunkenborg J, Reich M, Hjerrild M, Delmonte T, Villeneuve A, Sladek R, Xu F et al: Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci U S A 2003, 100(2):605-610. 52. Clarke AR: Murine genetic models of human disease. Curr Opin Genet Dev 1994, 4(3):453-460. 53. Nagaoka I, Matsui K, Ueyama T, Kanemoto M, Wu J, Shimizu A, Matsuzaki M, Horie M: Novel mutation of plakophilin-2 associated with arrhythmogenic right ventricular cardiomyopathy. Circ J 2006, 70(7):933-935. 54. Kostetskii I, Li J, Xiong Y, Zhou R, Ferrari VA, Patel VV, Molkentin JD, Radice GL: Induced deletion of the N-cadherin gene in the heart leads to dissolution of the intercalated disc structure. Circ Res 2005, 96(3):346-354. 55. Smith CL, Goldsmith CA, Eppig JT: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol 2005, 6(1):R7. 56. Matsui S, Larsson L, Hayase M, Katsuda S, Teraoka K, Kurihara T, Murano H, Nishikawa K, Fu M: Specific removal of beta1-adrenoceptor autoantibodies by immunoabsorption in rabbits with autoimmune cardiomyopathy improved cardiac structure and function. J Mol Cell Cardiol 2006, 41(1):78-85. 57. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids

89 Res 2006, 34(Database issue):D354-357. 58. Kannankeril PJ, Mitchell BM, Goonasekera SA, Chelu MG, Zhang W, Sood S, Kearney DL, Danila CI, De Biasi M, Wehrens XH et al: Mice with the R176Q cardiac ryanodine receptor mutation exhibit catecholamine-induced ventricular tachycardia and cardiomyopathy. Proc Natl Acad Sci U S A 2006, 103(32):12179-12184. 59. Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005, 33(19):6083-6089. 60. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L et al: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33(Database issue):D428-432. 61. Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet 2002, 31(1):19-20. 62. Jona I, Nanasi PP: Cardiomyopathies and sudden cardiac death caused by RyR2 mutations: are the channels the beginning and the end? Cardiovasc Res 2006, 71(3):416-418. 63. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R et al: New developments in the InterPro database. Nucleic Acids Res 2007, 35(Database issue):D224-228. 64. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-251. 65. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 2006, 34(Database issue):D257-260. 66. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, 34(Database issue):D227-230. 67. Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA: Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 2006, 34(Database issue):D281-284. 68. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D: The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 2005, 33(Database issue):D212-215. 69. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM et al: Human protein reference database--2006 update. Nucleic Acids Res 2006, 34(Database issue):D411-414. 70. Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 2003, 31(1):248-250. 71. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539. 72. Zimmermann HJ, Zysno, P.: Latent connectives in human decision making. Fuzzy Sets and Systems 1980, 4:37-51. 73. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004, 101(16):6062-6067. 74. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H et al: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):e175.

90 75. Popescu M, Keller, J.M., Mitchell, J.A.: Fuzzy Measures on the Gene Ontology for Gene Product Similarity. IEEE/ACM Trans Comput Biol Bioinformatics 2006, 3(3):263-274. 76. Open source JAVA package FtpBean [http://www.geocities.com/SiliconValley/Code/9129] 77. BioJava Package [http://biojava.org] 78. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363-2371. 79. GOLEM [http://function.princeton.edu/GOLEM/download.html] 80. Colt [http://dsd.lbl.gov/~hoschek/colt] 81. Jakarta Commons-Math [http://jakarta.apache.org/commons/math/] 82. Kabaeva ZT, Perrot A, Wolter B, Dietz R, Cardim N, Correia JM, Schulte HD, Aldashev AA, Mirrakhimov MM, Osterziel KJ: Systematic analysis of the regulatory and essential myosin light chain genes: genetic variants and mutations in hypertrophic cardiomyopathy. Eur J Hum Genet 2002, 10(11):741-748. 83. Khatri P, Bhavsar P, Bawa G, Draghici S: Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments. Nucleic Acids Res 2004, 32(Web Server issue):W449-456. 84. van Bokhoven H, Celli J, Kayserili H, van Beusekom E, Balci S, Brussel W, Skovby F, Kerr B, Percin EF, Akarsu N et al: Mutation of the gene encoding the ROR2 tyrosine kinase causes autosomal recessive Robinow syndrome. Nat Genet 2000, 25(4):423-426. 85. Gorgels TG, Hu X, Scheffer GL, van der Wal AC, Toonstra J, de Jong PT, van Kuppevelt TH, Levelt CN, de Wolf A, Loves WJ et al: Disruption of Abcc6 in the mouse: novel insight in the pathogenesis of pseudoxanthoma elasticum. Hum Mol Genet 2005, 14(13):1763-1773. 86. Bamford RN, Roessler E, Burdine RD, Saplakoglu U, dela Cruz J, Splitt M, Goodship JA, Towbin J, Bowers P, Ferrero GB et al: Loss-of-function mutations in the EGF-CFC gene CFC1 are associated with human left-right laterality defects. Nat Genet 2000, 26(3):365-369. 87. Yan YT, Gritsman K, Ding J, Burdine RD, Corrales JD, Price SM, Talbot WS, Schier AF, Shen MM: Conserved requirement for EGF-CFC genes in vertebrate left-right axis formation. Genes Dev 1999, 13(19):2527-2537. 88. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173-1178. 89. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S et al: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957-968. 90. Sharan R, Ideker T: Modeling cellular machinery through biological network comparison. Nature biotechnology 2006, 24(4):427-433. 91. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics (Oxford, England) 2005, 21 Suppl 1:i302-310. 92. Lubovac Z, Gamalielsson J, Olsson B: Combining functional and topological properties to identify core modules in protein interaction networks. Proteins 2006, 64(4):948-959. 93. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF,

91 Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science (New York, NY 2003, 302(5644):449-453. 94. Wong SL, Zhang LV, Tong AH, Li Z, Goldberg DS, King OD, Lesage G, Vidal M, Andrews B, Bussey H et al: Combining biological networks to predict genetic interactions. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(44):15682-15687. 95. Sam L, Liu Y, Li J, Friedman C, Lussier YA: Discovery of protein interaction networks shared by diseases. Pacific Symposium on Biocomputing 2007:76-87. 96. Goehler H, Lalowski M, Stelzl U, Waelter S, Stroedicke M, Worm U, Droege A, Lindenberg KS, Knoblich M, Haenig C et al: A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease. Molecular cell 2004, 15(6):853-865. 97. Ruffner H, Bauer A, Bouwmeester T: Human protein-protein interaction networks and the value for drug discovery. Drug discovery today 2007, 12(17-18):709-716. 98. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, Gibson TJ, Lewis J, Serrano L, Russell RB: Systematic discovery of new recognition mediating protein interaction networks. PLoS biology 2005, 3(12):e405. 99. Barabasi AL, Albert R: Emergence of scaling in random networks. Science (New York, NY 1999, 286(5439):509-512. 100. Berg J, Lassig M, Wagner A: Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC evolutionary biology 2004, 4(1):51. 101. Eisenberg E, Levanon EY: Preferential attachment in the protein network evolution. Physical review letters 2003, 91(13):138701. 102. Rzhetsky A, Gomez SM: Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics (Oxford, England) 2001, 17(10):988-996. 103. Wagner A, Fell DA: The small world inside large metabolic networks. Proceedings 2001, 268(1478):1803-1810. 104. Xu J, Li Y: Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics (Oxford, England) 2006, 22(22):2800-2805. 105. Berger SI, Posner JM, Ma'ayan A: Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC bioinformatics 2007, 8:372. 106. Freeman LC: Centrality in social networks conceptual clarification. Social Networks 1978, 1(3):215-239. 107. Sabidussi G: The centrality index of a graph. Psychometrika 1966, 31(4):581-603. 108. Freeman LC: A Set of Measures of Centrality Based on Betweenness. Sociometry 1977, 40(1):35-41. 109. Jon MK: Authoritative sources in a hyperlinked environment. In., vol. 46: ACM; 1999: 604-632. 110. Page L, Brin S, Motwani R, Winograd T: The PageRank Citation Ranking: Bringing Order to the Web. In.; 1998. 111. Junker BH, Koschutzki D, Schreiber F: Exploration of biological network centralities with CentiBiN. BMC bioinformatics 2006, 7:219. 112. Kohler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet 2008, 82(4):949-958. 113. Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V et al: The BioGRID Interaction Database: 2008 update. Nucleic acids research 2008, 36(Database issue):D637-640. 114. Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK,

92 Chandrika KN, Deshpande N, Suresh S et al: Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 2004, 32(Database issue):D497-501. 115. MGI Mouse Genome Informatics [http://www.informatics.jax.org/] 116. White S, Smyth P: Algorithms for estimating relative importance in networks. In: KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining: 2003: ACM Press; 2003: 266-275. 117. Assenov Y, Ramirez F, Schelhorn SE, Lengauer T, Albrecht M: Computing topological parameters of biological networks. Bioinformatics (Oxford, England) 2008, 24(2):282-284. 118. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research 2003, 13(11):2498-2504. 119. JUNG [http://jung.sourceforge.net/] 120. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33(Database issue):D514-517. 121. Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nat Genet 2004, 36(5):431-432. 122. Wills-Karp M: Immunologic basis of antigen-induced airway hyperresponsiveness. Annual review of immunology 1999, 17:255-281. 123. Laitinen T, Rasanen M, Kaprio J, Koskenvuo M, Laitinen LA: Importance of genetic factors in adolescent asthma: a population-based twin-family study. American journal of respiratory and critical care medicine 1998, 157(4 Pt 1):1073-1078. 124. Wills-Karp M, Santeliz J, Karp CL: The germless theory of allergic disease: revisiting the hygiene hypothesis. Nature reviews 2001, 1(1):69-75. 125. Howard TD, Meyers DA, Bleecker ER: Mapping susceptibility genes for asthma and allergy. The Journal of allergy and clinical immunology 2000, 105(2 Pt 2):S477-481. 126. Tu Z, Wang L, Xu M, Zhou X, Chen T, Sun F: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics 2006, 7:31. 127. Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004, 573(1-3):83-92. 128. Shiina T, Inoko H, Kulski JK: An update of the HLA genomic region, locus information and disease associations: 2004. Tissue antigens 2004, 64(6):631-649. 129. Min YD, Choi CH, Bark H, Son HY, Park HH, Lee S, Park JW, Park EK, Shin HI, Kim SH: Quercetin inhibits expression of inflammatory cytokines through attenuation of NF-kappaB and p38 MAPK in HMC-1 human mast cell line. Inflamm Res 2007, 56(5):210-215. 130. Zhou LF, Zhang MS, Hu AH, Zhu Z, Yin KS: Selective blockade of NF-kappaB by novel mutated IkappaBalpha suppresses CD3/CD28-induced activation of memory CD4(+) T cells in asthma. Allergy 2007. 131. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E et al: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic acids research 2005, 33(Database issue):D418-424. 132. Edgar R, Barrett T: NCBI GEO standards and services for microarray data. Nature biotechnology 2006, 24(12):1471-1472.

93

94 Appendix A: 252 known asthma related human genes compiled from literature

Entrez Gene Gene ID Symbol Gene Name 596 BCL2 B-cell CLL/lymphoma 2 6356 CCL11 chemokine (C-C motif) ligand 11 6347 CCL2 chemokine (C-C motif) ligand 2 6352 CCL5 chemokine (C-C motif) ligand 5 1234 CCR5 chemokine (C-C motif) receptor 5 1437 CSF2 colony stimulating factor 2 (granulocyte-macrophage) 3627 CXCL10 chemokine (C-X-C motif) ligand 10 3586 IL10 interleukin 10 3596 IL13 interleukin 13 3600 IL15 interleukin 15 3552 IL1A interleukin 1, alpha 3553 IL1B interleukin 1, beta 3557 IL1RN interleukin 1 receptor antagonist 3558 IL2 interleukin 2 3569 IL6 interleukin 6 (interferon, beta 2) 3578 IL9 interleukin 9 3659 IRF1 interferon regulatory factor 1 3660 IRF2 interferon regulatory factor 2 4049 LTA lymphotoxin alpha (TNF superfamily, member 1) 64127 NOD2 nucleotide-binding oligomerization domain containing 2 6890 TAP1 transporter 1, ATP-binding cassette, sub-family B (MDR/TAP) 7124 TNF (TNF superfamily, member 2) 1493 CTLA4 cytotoxic T-lymphocyte-associated protein 4 2113 ETS1 v-ets erythroblastosis virus E26 oncogene homolog 1 (avian) 100 ADA adenosine deaminase 718 C3 complement component 3 719 C3AR1 complement component 3a receptor 1 727 C5 complement component 5 728 C5AR1 complement component 5a receptor 1 6369 CCL24 chemokine (C-C motif) ligand 24 1231 CCR2 chemokine (C-C motif) receptor 2 929 CD14 CD14 molecule 940 CD28 CD28 molecule 942 CD86 CD86 molecule 6376 CX3CL1 chemokine (C-X3-C motif) ligand 1 7852 CXCR4 chemokine (C-X-C motif) receptor 4

95 57105 CYSLTR2 cysteinyl 2 2205 FCER1A Fc fragment of IgE, high affinity I, receptor for; alpha polypeptide 2209 FCGR1A Fc fragment of IgG, high affinity Ia, receptor (CD64) 50943 FOXP3 forkhead box P3 11251 GPR44 G protein-coupled receptor 44 3117 HLA-DQA1 major histocompatibility complex, class II, DQ alpha 1 3119 HLA-DQB1 major histocompatibility complex, class II, DQ beta 1 29851 ICOS inducible T-cell co-stimulator 3458 IFNG interferon, gamma interleukin 12A (natural killer cell stimulatory factor 1, cytotoxic 3592 IL12A lymphocyte maturation factor 1, p35) interleukin 12B (natural killer cell stimulatory factor 2, cytotoxic 3593 IL12B lymphocyte maturation factor 2, p40) 3603 IL16 interleukin 16 (lymphocyte chemoattractant factor) 3606 IL18 interleukin 18 (interferon-gamma-inducing factor) 3554 IL1R1 interleukin 1 receptor, type I 9173 IL1RL1 interleukin 1 receptor-like 1 3562 IL3 interleukin 3 (colony-stimulating factor, multiple) 3565 IL4 interleukin 4 3566 IL4R interleukin 4 receptor 3567 IL5 interleukin 5 (colony-stimulating factor, eosinophil) 4153 MBL2 mannose-binding lectin () 2, soluble (opsonic defect) membrane-spanning 4-domains, subfamily A, member 2 (Fc fragment of 2206 MS4A2 IgE, high affinity I, receptor for; beta polypeptide) solute carrier family 11 (proton-coupled divalent metal ion transporters), 6556 SLC11A1 member 1 6778 STAT6 signal transducer and activator of transcription 6, interleukin-4 induced 30009 TBX21 T-box 21 7040 TGFB1 transforming growth factor, beta 1 7097 TLR2 toll-like receptor 2 7099 TLR4 toll-like receptor 4 10333 TLR6 toll-like receptor 6 54106 TLR9 toll-like receptor 9 8795 TNFRSF10B tumor necrosis factor receptor superfamily, member 10b tumor necrosis factor receptor superfamily, member 14 (herpesvirus entry 8764 TNFRSF14 mediator) tumor necrosis factor (ligand) superfamily, member 4 (tax-transcriptionally 7292 TNFSF4 activated glycoprotein 1, 34kDa) 598 BCL2L1 BCL2-like 1 969 CD69 CD69 molecule 3383 ICAM1 intercellular adhesion molecule 1 (CD54), human rhinovirus receptor 3958 LGALS3 lectin, galactoside-binding, soluble, 3 4318 MMP9 matrix metallopeptidase 9 (gelatinase B, 92kDa gelatinase, 92kDa type IV

96 collagenase) 4843 NOS2A nitric oxide synthase 2A (inducible, hepatocytes) 5328 PLAU plasminogen activator, urokinase prostaglandin-endoperoxide synthase 2 (prostaglandin G/H synthase and 5743 PTGS2 cyclooxygenase) 6403 SELP selectin P (granule membrane protein 140kDa, antigen CD62) 3576 IL8 interleukin 8 1636 ACE angiotensin I converting enzyme (peptidyl-dipeptidase A) 1 57379 AICDA activation-induced cytidine deaminase 240 ALOX5 arachidonate 5-lipoxygenase 241 ALOX5AP arachidonate 5-lipoxygenase-activating protein 313 AOAH acyloxyacyl hydrolase (neutrophil) 374 AREG amphiregulin (schwannoma-derived growth factor) 624 BDKRB2 receptor B2 820 CAMP cathelicidin antimicrobial 1232 CCR3 chemokine (C-C motif) receptor 3 cystic fibrosis transmembrane conductance regulator (ATP-binding 1080 CFTR cassette sub-family C, member 7) 1215 CMA1 chymase 1, mast cell 1524 CX3CR1 chemokine (C-X3-C motif) receptor 1 2833 CXCR3 chemokine (C-X-C motif) receptor 3 10800 CYSLTR1 cysteinyl leukotriene receptor 1 1910 EDNRB type B 1991 ELA2 elastase 2, neutrophil 2150 F2RL1 coagulation factor II (thrombin) receptor-like 1 2323 FLT3LG fms-related tyrosine kinase 3 ligand 2524 FUT2 fucosyltransferase 2 (secretor status included) 2625 GATA3 GATA binding protein 3 2778 GNAS GNAS complex locus 27202 GPR77 G protein-coupled receptor 77 3454 IFNAR1 interferon (alpha, beta and omega) receptor 1 3459 IFNGR1 interferon gamma receptor 1 3460 IFNGR2 interferon gamma receptor 2 (interferon gamma transducer 1) 3594 IL12RB1 interleukin 12 receptor, beta 1 3595 IL12RB2 interleukin 12 receptor, beta 2 3568 IL5RA interleukin 5 receptor, alpha 3579 IL8RB interleukin 8 receptor, beta 3623 INHA inhibin, alpha 3676 ITGA4 integrin, alpha 4 (antigen CD49D, alpha 4 subunit of VLA-4 receptor) 3694 ITGB6 integrin, beta 6 3727 JUND jun D proto-oncogene 3952 LEP leptin (obesity homolog, mouse) 4048 LTA4H leukotriene A4 hydrolase

97 4056 LTC4S leukotriene C4 synthase 4282 MIF macrophage migration inhibitory factor (glycosylation-inhibiting factor) 4353 MPO myeloperoxidase 10392 NOD1 nucleotide-binding oligomerization domain containing 1 2908 NR3C1 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) 27306 PGDS prostaglandin D2 synthase, hematopoietic 5729 PTGDR prostaglandin D2 receptor (DP) 864 RUNX3 runt-related transcription factor 3 6404 SELPLG selectin P ligand serpin peptidase inhibitor, clade E (nexin, plasminogen activator inhibitor 5054 SERPINE1 type 1), member 1 4087 SMAD2 SMAD family member 2 signal transducer and activator of transcription 3 (acute-phase response 6774 STAT3 factor) 6775 STAT4 signal transducer and activator of transcription 4 6915 TBXA2R thromboxane A2 receptor 7076 TIMP1 TIMP metallopeptidase inhibitor 1 7132 TNFRSF1A tumor necrosis factor receptor superfamily, member 1A 7133 TNFRSF1B tumor necrosis factor receptor superfamily, member 1B 6964 TRD@ receptor delta locus 7421 VDR vitamin D (1,25- dihydroxyvitamin D3) receptor 720 C4A complement component 4A (Rodgers blood group) 6361 CCL17 chemokine (C-C motif) ligand 17 6367 CCL22 chemokine (C-C motif) ligand 22 10344 CCL26 chemokine (C-C motif) ligand 26 10803 CCR9 chemokine (C-C motif) receptor 9 27159 CHIA chitinase, acidic 1394 CRHR1 corticotropin releasing 1 6387 CXCL12 chemokine (C-X-C motif) ligand 12 (stromal cell-derived factor 1) 1672 DEFB1 defensin, beta 1 2204 FCAR Fc fragment of IgA, receptor for 3105 HLA-A major histocompatibility complex, class I, A 3106 HLA-B major histocompatibility complex, class I, B 3107 HLA-C major histocompatibility complex, class I, C 3115 HLA-DPB1 major histocompatibility complex, class II, DP beta 1 3123 HLA-DRB1 major histocompatibility complex, class II, DR beta 1 3135 HLA-G major histocompatibility complex, class I, G 3274 HRH2 H2 inhibitor of kappa light polypeptide gene enhancer in B-cells, kinase 8518 IKBKAP complex-associated protein 29949 IL19 interleukin 19 6318 SERPINB4 serpin peptidase inhibitor, clade B (ovalbumin), member 4 11005 SPINK5 serine peptidase inhibitor, Kazal type 5

98 81793 TLR10 toll-like receptor 10 8743 TNFSF10 tumor necrosis factor (ligand) superfamily, member 10 183 AGT angiotensinogen (serpin peptidase inhibitor, clade A, member 8) 2950 GSTP1 glutathione S-transferase pi 4583 MUC2 2, oligomeric mucus/gel-forming serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), 5265 SERPINA1 member 1 3371 TNC tenascin C (hexabrachion) 7412 VCAM1 vascular cell adhesion molecule 1 404744 AAA1 asthma-associated alternatively spliced gene 1 ABO blood group (transferase A, alpha 1-3-N-acetylgalactosaminyltransferase; transferase B, alpha 28 ABO 1-3-galactosyltransferase) 52 ACP1 acid phosphatase 1, soluble 80332 ADAM33 ADAM metallopeptidase domain 33 154 ADRB2 adrenergic, beta-2-, receptor, surface 217 ALDH2 aldehyde dehydrogenase 2 family (mitochondrial) 302 ANXA2 annexin A2 383 ARG1 arginase, liver 384 ARG2 arginase, type II 627 BDNF brain-derived neurotrophic factor 847 CAT catalase 1071 CETP cholesteryl ester transfer protein, plasma 1128 CHRM1 cholinergic receptor, muscarinic 1 1131 CHRM3 cholinergic receptor, muscarinic 3 1179 CLCA1 chloride channel, calcium activated, family member 1 4513 COX2 cytochrome c oxidase II 26999 CYFIP2 cytoplasmic FMR1 interacting protein 2 1565 CYP2D6 cytochrome P450, family 2, subfamily D, polypeptide 6 4051 CYP4F3 cytochrome P450, family 4, subfamily F, polypeptide 3 7818 DAP3 death associated protein 3 57628 DPP10 dipeptidyl-peptidase 10 1906 EDN1 endothelin 1 1909 EDNRA endothelin receptor type A 1942 EFNA1 ephrin-A1 1958 EGR1 early growth response 1 2001 ELF5 E74-like factor 5 (ets domain transcription factor) 2053 EPHX2 epoxide hydrolase 2, cytoplasmic 2332 FMR1 fragile X mental retardation 1 2694 GIF gastric (vitamin B synthesis) 2944 GSTM1 glutathione S-transferase M1 2952 GSTT1 glutathione S-transferase theta 1 26762 HAVCR1 hepatitis A virus cellular receptor 1

99 84868 HAVCR2 hepatitis A virus cellular receptor 2 3176 HNMT histamine N-methyltransferase 3269 HRH1 histamine receptor H1 3303 HSPA1A heat shock 70kDa protein 1A 3304 HSPA1B heat shock 70kDa protein 1B 3315 HSPB1 heat shock 27kDa protein 1 3356 HTR2A 5-hydroxytryptamine (serotonin) receptor 2A 3416 IDE insulin-degrading enzyme 3439 IFNA1 interferon, alpha 1 3478 IGES immunoglobulin E concentration, serum 3597 IL13RA1 interleukin 13 receptor, alpha 1 3577 IL8RA interleukin 8 receptor, alpha 3581 IL9R interleukin 9 receptor 3690 ITGB3 integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61) 3790 KCNS3 potassium voltage-gated channel, delayed-rectifier, subfamily S, member 3 5650 KLK7 kallikrein-related peptidase 7 3911 LAMA5 laminin, alpha 5 membrane-spanning 4-domains, subfamily A, member 3 (hematopoietic 932 MS4A3 cell-specific) 4524 MTHFR 5,10-methylenetetrahydrofolate reductase (NADPH) 4582 MUC1 mucin 1, cell surface associated 4585 MUC4 , cell surface associated 4586 MUC5AC mucin 5AC, oligomeric mucus/gel-forming 727897 MUC5B , oligomeric mucus/gel-forming 4589 MUC7 , secreted 9 NAT1 N-acetyltransferase 1 (arylamine N-acetyltransferase) 10 NAT2 N-acetyltransferase 2 (arylamine N-acetyltransferase) nuclear factor of kappa light polypeptide gene enhancer in B-cells 4795 NFKBIL1 inhibitor-like 1 4803 NGFB nerve growth factor, beta polypeptide 4842 NOS1 nitric oxide synthase 1 (neuronal) 4846 NOS3 nitric oxide synthase 3 (endothelial cell) 387129 NPSR1 S receptor 1 4908 NTF3 neurotrophin 3 4909 NTF5 neurotrophin 5 (neurotrophin 4/5) 5048 PAFAH1B1 platelet-activating factor acetylhydrolase, isoform Ib, alpha subunit 45kDa 5154 PDGFA platelet-derived growth factor alpha polypeptide 51131 PHF11 PHD finger protein 11 phospholipase A2, group VII (platelet-activating factor acetylhydrolase, 7941 PLA2G7 plasma) 5732 PTGER2 prostaglandin E receptor 2 (subtype EP2), 53kDa 115727 RASGRP4 RAS guanyl releasing protein 4 6036 RNASE2 ribonuclease, RNase A family, 2 (liver, eosinophil-derived neurotoxin)

100 6037 RNASE3 ribonuclease, RNase A family, 3 (eosinophil cationic protein) 6094 ROM1 retinal outer segment membrane protein 1 7356 SCGB1A1 secretoglobin, family 1A, member 1 () 117156 SCGB3A2 secretoglobin, family 3A, member 2 26135 SERBP1 SERPINE1 mRNA binding protein 1 serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), 12 SERPINA3 member 3 6317 SERPINB3 serpin peptidase inhibitor, clade B (ovalbumin), member 3 6647 SOD1 superoxide dismutase 1, soluble (amyotrophic lateral sclerosis 1 (adult)) tachykinin, precursor 1 (substance K, substance P, neurokinin 1, neurokinin 2, neuromedin L, neurokinin alpha, neuropeptide K, 6863 TAC1 neuropeptide gamma) transforming growth factor, beta receptor I (activin A receptor type II-like 7046 TGFBR1 kinase, 53kDa) 8914 TIMELESS timeless homolog (Drosophila) 8797 TNFRSF10A tumor necrosis factor receptor superfamily, member 10a tumor necrosis factor receptor superfamily, member 10d, decoy with 8793 TNFRSF10D truncated death domain 7177 TPSAB1 tryptase alpha/beta 1 7178 TPT1 tumor protein, translationally-controlled 1 6955 TRA@ T cell receptor alpha locus 85480 TSLP thymic stromal lymphopoietin 7422 VEGFA vascular endothelial growth factor A 7498 XDH xanthine dehydrogenase

101 Appendix B: Web application

The methods described in this thesis were implemented as a web application and is publicly accessible at http://toppgene.cchmc.org. The application was developed using Java technology with external libraries from BioJava [77], GOLEM [79], Colt [80] and Jakarta Commons-Math [81].

The following is a brief user’s instruction to the web tool.

ToppGene homepage

This screen shot shows the homepage of ToppGene where three tools are listed: gene list enrichment analysis, candidate gene prioritization using functional annotations, and relative importance of candidate genes in networks. Click on the link will enter the input page of the corresponding tool.

102 Gene list enrichment analysis

1) Type or paste the gene list (symbol or Entrez ID) in the input page. Click on example gene sets link, an example gene list will be entered. It is the list of seven genes essential to cardiovascular development.

103 2) Click on “Submit Query” button. It will show the parameter page where user specifies the multiple correction method (Bonferroni, FDR or none) and the cutoff significance level (0.1, 0.05, 0.025, 0.001).

3) Click on “Submit” button. It will jump to the output page where the input information and the over-represented gene annotations are listed.

104 4) Click on “Show Detail” in the Input Parameter box and then click on the number of “Number of genes in training set”, it will pop up the detailed list of the genes in the query.

5) Click on “Show All” in the Training Result box. The over-represented terms will be displayed for all the 12 categories. The result shows that they are all transcription factors and important to heart development.

105 6) A chart associated with each category of annotation will be shown by clicking the corresponding “Display Chart” button.

Prioritization of candidate genes based on functional annotations

1) Similar to gene list enrichment analysis, enter or paste the genes in the input page. In this case, there are two gene lists, one for training and the other the query. Click on the example gene sets link, example training and test gene lists will be entered.

2) Click on “Submit Query” button. It will jump to parameter page where user the training parameters (multiple correction method and significance cutoff level) as well as test parameters (random sampling size (0, 1000, 1500, 2000) and minimum feature count (1, 2, 3)). The random sampling size is the size of the random sample used to estimate the significance level of the test genes. Increasing the size will result in more robust estimation; however the computation time

106 goes up significantly at the same time. The minimum feature count indicates the number of features (annotation categories) a test gene must have in order to be considered. Test genes having less features will be ignored and 0 scores will be assigned.

107 3) Click on “Start prioritization” button. The prioritization process will start. Depending on the query gene sets and parameters, this process can take varied amount of time up to thousands of seconds. When it finishes, the output page will be shown. Besides the input parameters and training result, the test result will be displayed in the “Test Result” box.

4) Each row in the result table is the result of a test gene. It consists the score and p-value from each annotation category. The last two columns show the average score and the overall p-value from meta-analysis. By default the rows are ordered by the overall p-value ascendingly. The result indicates EP300 is the best candidate which is the most similar to the training set overall.

108 Prioritization of candidate genes based on protein-protein interaction networks

1) Similar to prioritization based on functional annotations, enter or paste the training and test gene lists in the input page. Click on the example gene sets link, example training and test gene lists will be entered.

109 2) In the parameter page, specify the prioritization method (K-Step Markov, HITS with Priors or PageRank with Priors) and the associated parameter (step size for K-Step Markov, or bias (back probability) for HITSP and PRankP). User can also set the scale of the subnetwork for visualization by choosing the neighborhood distance from 0 to 3. The neighborhood distance is the length of the shortest path from a gene to one of the training genes. A large value such as 2 or 3 can result in a huge network containing thousands of proteins.

3) Click on “Start prioritization” button. It will display the output page. There are three boxes in the page: training parameters, training gene subnetwork and test gene result.

110 4) Each row in the result table is the result of a test gene. The rows are ordered by the test gene scores descendingly. Consistent with prioritization based on functional annotations, EP300 is the best candidate with the highest prioritization score.

5) Click on the training gene subnetwork link (HTML, PNG or SVG format), a network image page will open. Blue nodes are the training genes, green are the test ones. The rest nodes are colored grey.

111