DEVELOPMENT OF COMPUTATIONAL APPROACHES FOR MEDICAL IMAGE RETRIEVAL, DISEASE PREDICTION, AND DRUG DISCOVERY

by

YANG CHEN

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Rong Xu, Dr. Guo-qiang Zhang

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

August 2015 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of Yang Chen

candidate for the degree of Doctor of Philosophy∗

Committee Chair Guo-qiang Zhang

Committee Member Rong Xu

Committee Member Jing Li

Committee Member M. Cenk Cavusoglu

Committee Member Xiang Zhang

Date of Defense June 23, 2015

*We also certify that written approval has been obtained

for any proprietary material contained therein.

iii Contents

1 Introduction 1

1.1 Domain knowledge guided strategy for developing computational

approaches for biomedical applications ...... 1

1.2 Retrieving medically-relevant web images ...... 3

1.3 Detecting novel genetic basis for human diseases ...... 4

1.4 Predicting novel drug treatments based on disease genetics ...... 7 1.5 Contribution and organization of the dissertation ...... 8

2 Ontology guided approach to retrieving medically-relevant web images: application on retrieving disease manifestation images 10

2.1 Motivation ...... 10

2.2 Data and methods ...... 12

2.2.1 Discovering target body parts ...... 12

2.2.2 Detecting target body parts ...... 14 2.2.3 Combining detections for disease image classification ..... 16

2.3 Results ...... 17

2.3.1 Single-organ disease classification ...... 17

2.3.2 Multiple-organ disease classification ...... 20 2.4 Discussion ...... 23

2.5 Conclusions ...... 25

iv 3 Analyzing cross-species genetic networks to predict disease-associated : application on Plasmodium falciparum malaria 27

3.1 Motivation ...... 27

3.2 Data and methods ...... 29

3.2.1 Construct cross-species gene network ...... 30 3.2.2 Predict candidate genes for malaria ...... 30

3.2.3 Evaluate the validity in predicting malaria genes ...... 32

3.2.4 Evaluate the ranks of druggable genes ...... 32

3.2.5 Extract and analyze malaria-specific pathways based on gene

ranking ...... 33 3.3 Results ...... 33

3.3.1 Network-based approach allows the prioritization of known

malaria genes from both human and parasite genomes .... 33

3.3.2 Network-based approach prioritizes novel malaria genes other than the seeds ...... 34

3.3.3 Prioritized genes are enriched by druggable genes ...... 36

3.3.4 Pathway analysis shows functions of prioritized genes .... 37

3.4 Discussion ...... 38

3.5 Conclusions ...... 40

4 Combining multiple human phenotype networks to predict disease-associated

genes: application on Crohn’s disease 41

4.1 Motivation ...... 41

4.2 Data and methods ...... 43

4.2.1 Construct DMN using disease-manifestation associations in

UMLS ...... 44

4.2.2 Compare phenotypic relationships in DMN with genetic dis- ease associations ...... 46

v 4.2.3 Compare DMN with the widely-used disease phenotype net- work mimMiner ...... 48

4.2.4 Integrate networks ...... 48

4.2.5 Predict disease genes from the integrated network ...... 49

4.2.6 Evaluate gene prediction in cross validation analyses ..... 51 4.2.7 Evaluate gene prediction for different disease classes ..... 53

4.2.8 Investigate translational potential in drug discovery of the

predicted genes for Crohn’s disease ...... 53

4.3 Results ...... 54

4.3.1 DMN network properties ...... 54 4.3.2 DMN partially correlates with the genetic disease networks . 57

4.3.3 DMN contains knowledge different from mimMiner ..... 59

4.3.4 Integrating DMN with mimMiner significantly improves the

performance of disease gene predictions ...... 60 4.3.5 Our method achieves high but varying performance for dif-

ferent disease classes ...... 62

4.3.6 Our gene prediction method has the potential to guide the

drug discovery for Crohn’s disease ...... 64

4.4 Discussion ...... 67 4.5 Conclusions ...... 68

5 Studying disease comorbidity network to detect genetic evidences for disease links: application on colorectal cancer and obesity 69

5.1 Motivation ...... 69

5.2 Data and methods ...... 71

5.2.1 Construct disease comorbidity network ...... 71

5.2.2 Prioritize the diseases that have strong associations with both obesity and CRC ...... 75

vi 5.2.3 Identify gene overlaps through meta-analysis 76 5.3 Results ...... 77

5.3.1 Local disease comorbidity network models the connection

between obesity and CRC ...... 77

5.3.2 Osteoporosis shows high comorbidity associations with both CRC and obesity ...... 77

5.3.3 Innovative genes shared among osteoporosis, obesity and

CRC are detected using gene expression meta-analysis .... 79

5.4 Discussion ...... 80

5.5 Conclusions ...... 81

6 Combing human disease genetics and mouse model phenotypes towards

drug repositioning: application on Parkinson’s disease 83

6.1 Motivation ...... 83

6.2 Data and methods ...... 85

6.2.1 Identify mouse model phenotypes for PD using disease ge-

netics in OMIM ...... 85

6.2.2 Prioritize candidate PD drugs based on the similarities of mouse phenotype profiles between disease and drugs .... 87

6.2.3 De novo evaluation in prioritizing FDA-approved PD drugs 88

6.2.4 Evaluation in ranking novel PD drugs and comparison with

an existing drug repositioning approach ...... 90 6.2.5 Test the top-ranked drugs using gene expression data analysis 91

6.3 Results ...... 91

6.3.1 Our disease genetics-based phenotype prioritization algorithm

identified PD-specific mouse model phenotypes ...... 91

6.3.2 Our approach prioritized FDA-approved PD drugs ...... 92

vii 6.3.3 Our approach outperformed an existing approach in priori- tizing novel PD drugs ...... 94

6.3.4 Gene expression analysis suggests quetiapine as a potential

PD drug ...... 96

6.4 Discussion ...... 96 6.5 Conclusions ...... 97

7 Conclusions and future work 98

7.1 Conclusions ...... 98 7.2 Future work ...... 100

7.2.1 Disease image retrieval ...... 100

7.2.2 Disease gene prediction ...... 100

7.2.3 Drug repositioning ...... 101

Appendices

viii List of Tables

2.1 Performance Comparison on Ten Eye Disease Image Test Sets. .... 19

2.2 Performance Comparison on Ten Ear Disease Image Test Sets. .... 21

2.3 Performance Comparison on Ten Mouth/Lip Disease Image Test Sets. 22 2.4 Performance Comparison on Ten Mouth/Lip Disease Image Test Sets. 24

3.1 Result of the leave-one-out cross validation for human genes. We left out one malaria gene from the seed list each time, and deter-

mined the rank of this excluded gene using our method. We showed

the rank and percentage among all human genes...... 35

3.2 Top 10 parasite genes in the leave-one-out cross validation...... 35 3.3 Rank of other malaria-associated genes from literature...... 36

3.4 Pathways prioritized over 50% in rank...... 39

4.1 Global properties of DMN and the other disease networks, includ-

ing HDNs (genetic disease networks) and mimMiner (widely-used

phenotype network) based on OMIM text mining. The last three

columns represent average shortest path, average cluster coefficient, and connected component, respectively...... 55

ix 4.2 Compare the edge overlaps N between DMN and the genetic dis-

′ ease networks. Network B represents the randomized graph that

′ preserves the properties of Network B. Column N(A,B ) represents the average number of edge overlap comparing network A and the

randomized networks...... 58 4.3 Compare the community structures between DMN and the genetic

disease networks. SA→B and SB→A represent the two-way the simi-

larity in community partitions between network A and B...... 59

4.4 Compare DMN with mimMiner in nodes, edges and community

structures...... 60 4.5 Ratios of successful disease-gene association predictions in the leave-

one-out cross validation experiment. All diseases were included in

the experiment...... 61

4.6 Success ratio of disease-gene association predictions for all diseases and monogenetic diseases in the nine disease classes...... 63

4.7 Drug candidates for Crohn’s disease that are supported by literature. 66

5.1 Top five disease nodes in the local network that contains all paths

from obesity to colorectal cancer. the diseases were ranked by de-

gree and betweenness, respectively...... 78

5.2 Common genes shared by obesity, colorectal cancer and osteoporo- sis, and plausible evidence supporting their relationships with the

three diseases...... 82

6.1 The top-ranked categories of mouse phenotypes extracted using PD

genes in OMIM...... 92

6.2 Common significantly differential genes for PD and quetiapine as

well as their directions of regulation and fold change...... 96

x List of Figures

1.1 Domain knowledge guided strategy in designing approaches for

biomedical application. The strategy is applied in three contexts: (1)

retrieving medically-related web images, (2) detecting genetic basis for human diseases, and (3) repositioning drug treatments...... 2

2.1 (A) The Overview of disease image retrieval approach. (B) The struc- ture of organ detectors...... 13

2.2 (A) Trend for decreasing precision in finer scales. (B) Trend for in-

creasing recall in finer scales...... 20

3.1 The methods contain two parts: gene prediction based on the cross-

species genetic networks and result analysis for the method validity in predicting malaria genes, distribution of druggable genes among

the rank, and the pathways associated with the top-ranked genes. .. 29

3.2 The count of drug target genes among every 500 genes in our rank

from the top to the bottom ...... 37

4.1 The three steps of network analysis for DMN...... 44

4.2 Integrating the knowledge in DMN, mimMiner and the genetic net- work...... 45

4.3 Robustness of DMN with respect to the removal of random nodes

and hub nodes...... 55

xi 4.4 Randomly selected subgraphs of (a) DMN (b) mimMiner (c) OMIM- based HDN and (d) GWAS-based HDN. Only part of the node labels

are shown in the figure due to space limit. In contrast to DMN and

mimMiner, the sub-graphs in HDNs are less connective and cliquish. 56

4.5 Correlation between manifestation similarities and genetic associa- tions. Left: Correlation between proportion of genetically associated

disease pairs (x-axis) and the phenotype similarity ranks (y-axis) in

DMN. Right: Correlation between the average numbers of genes

shared by disease pairs (x-axis) and the phenotype ranks (y-axis) in

DMN. Diseases with larger phenotype similarity in DMN tend have stronger genetic association...... 57

4.6 The ROC curves and AUCs for the our method (red) and the base-

line method (blue) in the leave-one-out cross validation analysis. .. 61

4.7 Average AUCs of de novo gene prediction for our approach (red) and the baseline approach (green). The comparisons are on overall

AUCs, as well as the AUCs when the numbers of false positive genes

are up to 10, 50, 100, 300, 500, and 1000...... 62

4.8 The ROC curves for each disease class in de novo gene prediction.

The comparisons include the top part of ROC curves and AUC scores based on the top 100 genes in each validation run...... 63

4.9 The ROC curves for each disease class in de novo gene prediction. . 64

4.10 A1-A2: Compare our gene rank with the Crohn’s disease genes from

GWAS. B1-B2: Compare our gene rank with the drug target genes.

C1-C2: Compare our drug rank with the FDA-approved drugs. ... 65

xii 5.1 Approach to detect the diseases that have strong connections with both obesity and CRC in the comorbidity network. Nodes D1, D2

and D3 were prioritized because they play important roles in main-

taining the network structure and the connection...... 71

5.2 The approach contains three steps: (1) construct a comorbidity net- work based on data mining; (2) extract the local network that con-

tains paths from obesity to CRC, and analyzed the local network

to pin point the strong comorbidity for both obesity and CRC; (3)

conduct gene expression meta-analysis to identify common genes

shared among obesity, CRC and the comorbidity...... 72 5.3 (a) Age distribution of the patients in the adverse event reports.

(b) Gender distribution. (c) Distribution of disease semantic types:

T047, Disease or Syndrome; T020, Acquired Abnormality; T046, Patho-

logic Function; T184, Sign or Symptom; T033, Finding; T190, Anatom- ical Abnormality; T191, Neoplastic Process; T048, Mental or Behav-

ioral Dysfunction; T049, Cell or Molecular Dysfunction; T019, Con-

genital Abnormality; T037, Injury or Poisoning...... 72

5.4 Automatic pipeline to pre-process the patient-disease data in ad-

verse event reports and mine comorbidity patterns ...... 74 5.5 The local network that contains all paths from obesity to colorectal

cancer in the comorbidity network...... 78

5.6 The paths from obesity to colorectal cancer that pass through osteo-

porosis...... 79

6.1 Drug discovery approach for Parkinson’s disease combining human

disease genetics and mouse mutation phenotypes...... 86

xiii 6.2 Comparison with genetics-based drug discovery methods, which directly match the disease genes and their interacting genes with the

drug target genes, to demonstrate the importance of using mouse

phenotypes...... 89

6.3 Our approach ranked the approved PD drugs in the top. A total of 10 among 22 approved PD drugs were ranked within top 10%

among all the 1197 drugs...... 93

6.4 The drug target genes that are most frequently targeted by our top

10% drugs. (a) The top 10 drug target genes for our prioritized

drugs. (b) The distribution of target genes for approved PD drugs among all the drug target genes...... 93

6.5 The distribution of our ranks for two sets of novel PD drugs ex-

tracted from clinical trials and Medline texts...... 94

6.6 The distribution of evaluation sets based on clinical trials and Med- line texts among the ranks generated by the baseline approach based

on mouse phenotypes...... 95

6.7 Precision-recall curves in ranking the novel PD drugs for our ap-

proach and Hoehndorf’s approach based on PhenomeNet...... 95

xiv Acknowledgements

This dissertation would be impossible without the support of my advisors. I would like to thank my research advisor Dr. Rong Xu for her constant support and guid- ance. Dr. Xu led me into the field of translational biomedical research and guided me in every project. She shared innovative ideas with me, and contributed a lot of time to make my Ph.D. experience productive and exciting. Everything would be different without her. I would like to thank my advisor and dissertation committee chair Dr. Guo-qiang Zhang, who offered me the opportunity to join CCI, where I met great people and started my research. I appreciate all his insightful discussions and constructive suggestions to improve my dissertation and the overall research. My sincere thanks also go to the members of my committee, Dr. Jing Li, Dr.

Xiang Zhang and Dr. M. Cenk Cavusoglu, for their invaluable feedback, scientific suggestions and insightful discussions, which helped me improve this disserta- tion. I would like to give special thanks to Dr. Xiang Zhang, Dr. Xiaofeng Ren and Dr. Li Li, who generously offered advices and shared research experiences with me during my Ph.D. study.

I would like to thank all present and past members of CCI and all my friends in the EECS department for their love and friendship through all these years dur- ing my Ph.D. I would like to thank my family: my husband Zhuofu Bai, and my parents, for their unconditional love and support.

xv List of Abbreviations

• OMIM: Online mendelian inheritance in man

• GWAS: Genome-wide association study

• UMLS: Unified medical language system

• CBIR: Content-based image retrieval

• SIFT: Scale invariant feature transformation

• HOG: Histograms of oriented gradients

• SVM: support vector machine

• HPRD: Human reference database

• PPI: Protein-protein interaction

• HDN: Human disease network

• CRC: Colorectal cancer

• PD: Parkinson’s disease

• DMN: Disease manifestation network

• IMPC: International mouse phenotyping consortium

• FDA: Food and drug administration

xvi Development of Computational Approaches for Medical Image Retrieval, Disease Gene Prediction, and Drug Discovery

Abstract

by

Yang Chen

With the deluge of biomedical data, developing computational approaches for data analysis and interrogation has become a key step in translational biomedical research. It is critical to leverage existing data to ask the right question and de- sign algorithms for specific biomedical applications. This dissertation proposes a domain knowledge guided strategy for data gathering, data fusion and algorithm design in solving specific biomedical problems. This strategy is demonstrated in three distinct application contexts.

The first application is retrieving disease manifestation images from the web for supporting patients’ self-education and decision making. The challenge is three- fold: heterogeneous irrelevant web images need to be filtered; the positive exam- ples of disease images contain diverse objects and complex backgrounds; and large amounts of manual efforts in generating training data are unaffordable. Based on a key observation that detecting disease-affected abnormal organs may greatly re- duce the manual labeling efforts, our approach extracts the disease-organ semantic relationships from ontologies to guide the organ detection with pre-trained detec- tors. Comparing with a standard supervised method, our approach improved the average precision by 5% while reduced the manual efforts by 85%.

The second application is developing disease-specific models to detect genetic basis for human diseases. We first develop a cross-species genetic network anal-

xvii ysis approach to study the host-pathogen interactions in parasitic infectious dis- eases and predict disease associated genes. This approach was applied on malaria and demonstrated useful in guiding anti-malaria drug discovery. In the second phenome-driven approach, we explore a new disease phenotype data source in medical ontologies to construct the Disease Manifestation Network (DMN), and integrate multiple phenotype networks with genetic networks to predict genes.

An application of this approach on Crohn’s disease demonstrated the translational potential of the predicted genes in drug discovery. The third approach is identi- fying the mutual comorbidity for colorectal cancer and obesity in the comorbidity network to detect genetic basis for the link between the two diseases. The last application is drug repositioning based on combining disease genet- ics and mouse phenotype data. Disease associated genes have the potential to guide drug discovery. On the other hand, the mouse phenotypes provide knowl- edge on gene functions, which is impossible to be obtained in human. Our ap- proach first identifies disease-specific mouse phenotypes using well-studied dis- ease genes, and then search all FDA-approved drugs for the candidates that share similar mouse phenotype profiles with the disease.The approach was applied to predict drugs for Parkinson’s disease, and achieved significant improvements com- paring with a state-of-art approach based on mouse phenotype data. In summary, this dissertation demonstrates the effectiveness of computational algorithms in translational biomedical research. It also demonstrates that the computation- based work have great potential in elucidating disease genetic basis, finding inno- vative drugs, and improving patient health education.

xviii Chapter 1

Introduction

1.1 Domain knowledge guided strategy for develop-

ing computational approaches for biomedical ap-

plications

Biomedicine and healthcare have become data intensive fields [51]. Currently, re- searchers have generated and shared access to vast amounts of genetic, genomic, and phenomic data. With the increase in amount and heterogeneity of biomedical data, developing approaches for data integration, analysis, and interrogation has become a key step to fulfill the translational needs of understanding human dis- eases, discovering new treatment options and facilitating medically-relevant deci- sion making [25, 182]. One of the major challenges in designing computational approaches for biomed- ical applications is to ask the right question, gather relevant data and develop algo- rithms based on a deep understanding of the problem. This dissertation presents a domain knowledge guided strategy towards addressing this challenge. Problem- specific motivations based on domain knowledge are used to guide the process

1 of (1) gathering relevant data from massive amounts of existing biomedical data, (2) connecting heterogenous data, and (3) designing algorithms to discover knowl- edge from the data (Fig. 1.1).

We demonstrate the effectiveness of the strategy using applications in three dis- tinct contexts: (1) retrieving medically-related images from massive web images based on their contents, (2) detecting genetic basis for human diseases towards genetics-based drug discovery, and (3) repositioning drug treatments based on dis- ease genetics. Among them, the goal of web medical image retrieval is to support patients’ self-education and decision-making. Disease-associated gene prediction and drug repositioning are fundamental components of translational biomedical research. The rest of this chapter will describe the background, challenges and the application of the knowledge guided strategy in each context.

Domain knowledge guided strategy

Knowledge based motivation

Data Data Algorithm gathering fusion design

Existing biomedical data

Application context 1: Application context 3: Retrieve medically- Predict new drug treatments relevant web images

Application Retrieve disease images Application Parkinson’s disease Application context 2: Detect genetic basis for human diseases

Infectious multifactorial Mouse gene-phenotype Diseases manifest on Application cancer Motivation Motivation body parts disease disease associations provide insights

Human-pathogen Disease phenotypic Cancer Data & Data & Ontology guided organ interactions similarity indicates comorbidity Combining disease genetics Motivation Method Method detection provide insights genetic overlaps provides insights and mouse model phenotypes

Data & Study cross- Study multiple Analyze disease Method species genetic phenotype comorbidity network networks networks

Figure 1.1: Domain knowledge guided strategy in designing approaches for biomedical application. The strategy is applied in three contexts: (1) retrieving medically-related web images, (2) detecting genetic basis for human diseases, and (3) repositioning drug treatments.

2 1.2 Retrieving medically-relevant web images

Medical knowledge in both textual and visual format is important for health in- formation retrieval and clinical applications. A number of comprehensive textual knowledge bases have been constructed and made available in the medical do- main, such as the Unified Medical Language System (UMLS) [93]. In comparison, fewer studies have attempted to systematically organize medical knowledge in a visual format. Many medical image bases concentrate on specific domains, such as lung CT images [9], cardiovascular MRI images [101], and human anatomy im- ages [2]. The scale of these databases is limited, largely because the image collec- tion processes are manual and laborious. Also, they annotate images by natural language sentences, which introduce ambiguities in image retrieval. Last but not least, most existing image bases are not freely available.

Our eventual goal is to build a freely accessible, large scale and patient ori- ented health image base, which contains images of human disease manifestations, organs, drugs and other medical entities. Unlike existing databases, we plan to build up our image base in line with the UMLS structure and annotate images by terms from standard medical ontologies, such as the FMA (Foundational Model of

Anatomy) [173], ICD9 (International Classification of Diseases, 9th revision) [184] and RxNorm [124]. For each medical term, the image base provides a set of high quality images with relevant contents, creating a rich and reusable information re- source for patient education, patient selfcare and web-content illustration. This image base designed for consumers collects photographic images, which are a sig- nificant subset of all biomedical images.

The most challenging problem in building the image base is how to collect a large number of credible images for tens of thousands of medical terms. The web is a readily available source: it is free, it contains billions of images and is fast growing, and search engines such as Google can already do reasonable image re-

3 trieval based on text queries. However, the web is heterogeneous, and most of the images are non-medical and need to be filtered. Generic image retrieval en- gines such as Google are not specialized for medical applications. For example, the top Google results for UMLS concepts “heart,” “ear deformities, acquired,” and “ibuprofen” do not only contain images of the heart organs, ear deformities and ibuprofen tablets, but also include other items such as cartoon symbols, pa- per snapshots, and molecular formulae. In particular, image retrieval for disease terms is highly challenging, since disease manifestation images contain diverse objects and complex backgrounds. Collecting medically-relevant images from the web clearly needs a content-based image retrieval (CBIR) method, which requires minimal manual effort, as the number of disease terms is large.

This application focuses on developing an automatic approach to retrieving web images on human diseases. Traditional supervised methods need a training image set for each disease, thus will not scale when the number of diseases is large. Our key observation is that although the number of diseases is in the tens of thou- sands, most disease manifestations are shown on body parts, and the number of body parts is much smaller. We develop an ontology-guided approach to retrieve disease images from the web. In this approach, the knowledge of the affected body parts is first extracted for a given disease term from UMLS. Then we use this knowledge to guide the selection of pre-trained organ detectors, and combined the organ detection outputs to retrieve disease images.

1.3 Detecting novel genetic basis for human diseases

Identifying genetic basis for human diseases plays an important role in elucidat- ing disease mechanisms and discovering targets of drug treatments [94, 163]. For computational strategies to predict disease genes, mining relevant data for specific

4 disease types can lead to new discoveries [16, 162, 193]. Traditional approaches exploited human genomic data and prioritized genes for a disease if the genes are functionally similar to the known disease genes [3, 68, 105, 214]. A few recent studies incorporated clinical phenotype data to increase the ability of identifying new disease genes [95, 111, 121, 201, 212, 213] and assumed that similar disease phenotypes reflect overlapping genetic causes [34, 90, 153].

In the first study, we develop an approach to predicting genes for parasitic infectious diseases. Traditional disease gene discovery methods that exploit hu- man protein interactome are insufficient for infectious diseases, which naturally involve human-pathogen protein interactions. The hypothesis is that the study on human-parasite protein interactions can provide insights into the molecular sig- natures for disease-specific host immune responses [96, 103, 211]. We construct a cross-species network to integrate human-human, parasite-parasite and human- parasite protein interactions. Then known disease genes are used as the seeds to find novel candidate disease associated genes. The approach is applied on Plas- modium falciparum malaria, which is the most deadly parasitic infectious disease and killed six millions people worldwide in 2012 [128]. The top-ranked candidate genes are demonstrated not only associated with malaria, but also have the poten- tial to guide genetics-based anti-malaria drug discovery. In the second study, we develop a phenotype-driven approach to predicting disease-associated genes. For syndromes and many multifactorial diseases, sys- tematically analyzing disease phenotype networks in combination with protein functional interaction networks have great potential in illuminating disease patho- physiological mechanisms [16, 90, 153]. However, disease phenotype networks re- main largely incomplete, and most current disease gene discovery studies used only one data source of human disease phenotypes [111, 121, 201, 212, 213]. In- corporating more comprehensive phenotype data can enhance the performance

5 of disease gene prediction. Therefore, we explore a new disease phenotype data source–the disease-manifestation semantic relationships in the UMLS, and con- struct a Disease Manifestation Network (DMN). Comparative analysis result demon- strates that the phenotype clustering in DMN reflects common disease genetics and contains different knowledge from mimMiner, which is a widely-used pheno- type database. Then we develop an innovative and generic strategy to combine

DMN, mimMiner, and a genetic network, and predict disease-gene associations from the integrated network. The application of this approach on Crohn’s disease demonstrates that the predicted genes have the translational potential in drug dis- covery by integrating with drug-target associations. In the third study, we develop a comorbidity network analysis approach to in- fer novel genetic basis for the link between two diseases, and apply the approach on colorectal cancer (CRC) and obesity. Phenotype-driven approaches to predict- ing novel disease genes may not be suitable for cancers, which usually have non- specific disease manifestations, such as pain, fever and ascites. Disease comor- bidity often leads to unexpected disease links [153] and offers novel insights into disease genetic mechanisms [14, 29]. Specially, studying cancer comorbidity has impacted the understanding of cancer mechanisms [43]. The common comorbidity between CRC and obesity in the context of comorbidity network provides insights into the novel molecular evidence underlying both diseases. Traditional comor- bidity studies usually focus on pairwise disease links [82, 157, 171, 175], and the results are often biased due to noises and intrinsic bias in the patient data. We ex- plore new patient data, which are not biased towards patients of certain ages and genders, and develop a comorbidity mining approach to reduce the bias towards rare diseases. Instead of studying pairwise disease comorbdity, we construct a disease comorbidity network and design a network analysis approach to identify common comorbidity between two diseases. Gene expression analysis guided by

6 the detected common comorbidity identifies a few genes that have the potential to explain the link between CRC and obesity.

1.4 Predicting novel drug treatments based on disease

genetics

Computational drug repositioning approaches lead to rapid drug discovery. Pre-

vious studies have predicted new indications for existing drugs by analyzing mul- tiple types of data, such as drug side effects [36], drug response gene expressions

[63], and disease similarities [75]. Recent studies demonstrate that disease genetics in genome-wide association studies (GWAS) [178] and Online Mendelian Inheri-

tance in Man (OMIM) [207] has great potential to guide drug discovery. On the

other hand, International Mouse Phenotyping Consortium (IMPC) [33] has made available large amounts of phenotypic descriptions for mouse genetic mutations

based on systematic gene knockouts, which are impossible on human. The mouse

phenotype data enrich the knowledge on disease genetic basis, and has facilitated

the detection of new disease genes [88] and drug targets [86]. Combining human disease genetics and mouse phenotype data will provide novel insights into the

genetics of many complex diseases, which can guide the discovery of novel drug

options.

We develop a novel drug repositioning approach leveraging both disease ge-

netics and mouse model phenotypes, and apply the approach on Parkinson’s dis- ease (PD). In this approach, PD-specific mouse phenotypes are first identified us-

ing well-studied human disease genes. Then all FDA-approved drugs are searched

for candidates that share similar mouse phenotype profiles with PD. The approach

is compared with pure genetics-based approaches and a state-of-art drug reposi-

tioning approach based on mouse phenotypes [87] to demonstrate the importance

7 of combining these two kinds of data.

1.5 Contribution and organization of the dissertation

This dissertation demonstrates in three application scenarios that the knowledge guided strategy of combining unique data and novel computational approaches ef-

fectively contributes in solving specific medical problems. This dissertation makes

the following contributions: the development of a novel ontology-guided approach

to retrieving disease manifestation images from the web; the development of three

computational approaches to predicting genetic basis for parasitic infectious dis- eases, multifactorial diseases, and cancers, respectively; the demonstration of the

translational potential of predicted disease genes in drug discovery; and develop-

ment of a novel drug repositioning approach based on disease genetics and mouse

model phenotypes.

The remainder of the dissertation is organized as follows: Chapter 2 presents an ontology-guided image retrieval method for identifying

disease web images.

Chapter 3 presents a disease gene prediction approach for malaria based on

studying the cross-species genetic networks. Chapter 4 describes the construction of a novel disease phenotype network and development of a generalizable disease gene prediction approach based on multi- ple disease phenotype data sources.

Chapter 5 introduces the construction of a disease comorbidity network and

development of a novel network analysis approach to detecting genetic basis for the link between colorectal cancer and obesity.

Chapter 6 presents a computational drug repositioning approach combining

disease genetics and mouse model phenotypes and its application on Parkinson’s

8 disease. Chapter 7 concludes this dissertation and discusses the possible improvements for future work.

9 Chapter 2

Ontology guided approach to retrieving medically-relevant web images: application on retrieving disease manifestation images

2.1 Motivation

Towards the goal of constructing a patient-oriented health image base, this chap- ter presents a content-based image retrieval (CBIR) method to retrieving disease manifestation images from the web. The image retrieval task is highly challeng- ing, since disease manifestation images contain diverse objects and complex back- grounds. For example, the positive examples of “hand, foot, and mouth disease” may contain infected feet, hands, mouths, or tongues. The body parts are in differ- ent positions and sizes, and more than one infected body part may appear in one single image. The task of collecting disease images from the web requires image analysis at the object level at the minimal cost of manual effort.

10 Most CBIR systems apply machine learning approaches to bridge the semantic gap between image content and users’ interpretations [57]. These approaches in-

clude supervised classification [40], similarity-based clustering [120], semi-supervised

co-training [66], and active learning based on relevant feedback [195]. A few meth-

ods incorporated additional information to improve retrieval precision. For exam- ple, Deserno et al exploited figure types and panel numbers to retrieve literature

figures [60]. Muller et al summarized the retrieval methods for integrating texts

with image content [141]. Simpson et al combined natural language and image processing to map regions in CT scans to concepts in RadLex ontology, which was automatically extracted from image captions [183]. Deng et al used semantic prior knowledge to retrieve similar images [59]. One particularly relevant method of re-

ducing human effort in health image collection is the bootstrap image classification

method [47]. This approach uses one positive sample as the “seed” to iteratively

retrieve more positive images, and thus is appropriate for large-scale image collec- tion. Although this approach effectively collects human organ and drug images,

it has limited precision for disease images [47]. Because web images are highly

heterogeneous, our task requires supervision of the training data to ensure good

precision. However, traditional supervised methods need a training image set for

each disease, thus will not scale up when the number of disease terms is large. To solve the scalability problem, we propose an ontology-guided organ de-

tection method to collect disease manifestation images from the web. Based on

observations, we assume that most disease manifestation images contain abnor-

mal human body parts, such as eyes, ears, and hands, which show visible disease

symptoms. Therefore, our approach uses the existence of these body parts to dis- criminate between images of disease and non-disease images. Instead of training

a classifier for each disease, a set of organ detectors are pre-trained, each of which

detects one target organ. When retrieving images for a given disease, we extract

11 the disease-organ semantic relationships from ontologies, and use the correspond- ing detectors to detect associated organs from web images.

Our method has two major advantages. First, it requires much fewer training

data than the standard supervised method, which trains a classifier for each dis-

ease, because it reuses organ detectors across diseases. For example, 428 diseases in the UMLS record eyes. Instead of training 428 classifiers, one for each disease,

our approach trains one detector for “eye” and reuse it to classify 428 types of eye

disease images. Second, our knowledge-guided approach achieves high accuracy

when disease images contain diverse manifestations of different organs, such as

images of “hand, foot, and mouth disease.” For each disease, it uses prior knowl- edge of disease- organ associations as guidance to scan images at the object level.

2.2 Data and methods

Fig. 2.1A shows the steps of the method. For a given disease, the UMLS is first used to determine what body parts the disease is manifested on. Then a set of

pre-trained body part detectors, using state-of-the-art image features, detect the

targets at multiple scales. Finally the scanning results of each detector at all scales

are combined into high level features to classify the input images as relevant or irrelevant.

2.2.1 Discovering target body parts

For each target disease, its corresponding body parts is found in the UMLS seman-

tic network relationship of “has finding site.” One body part is typically associ- ated with at least hundreds of diseases; this fact shows that reusing common body part detectors across diseases can save a huge amount of human labeling effort.

Each disease can have manifestations on multiple body parts: among the diseases

12 High level Detector A feature Lowlevel Scale 1 classifier C feature A1 UMLS SA1 Finaldecision about input

Disease Term Target Body Part C image baedon logic Scale 2 classifier A 2 SA2 Google select Scale 3 classifier C A3 SA3 top results Pre-trained Body ...... Part Detectors Detector B Input Data SB1 Lowlevel Scale 1 classifier C Make scan Target Body feature B1 Part Detectors C SB2 decision Scale 2 classifier B2 C SB3 Positive Negative Scale 3 classifier B3 ......

(a) (b)

Figure 2.1: (A) The Overview of disease image retrieval approach. (B) The structure of organ detectors. that involve the “has finding site” relation in UMLS, around 15% of them are lo- cated on more than one body parts. For such complex diseases, detection results of multiple body parts are combined into high level features to boost the retrieval precision.

Some diseases affect internal organs that are not directly visible in images.

Nonetheless, symptoms on these organs can have direct manifestation on exter- nal body parts. We manually map part internal organs to their associated external body parts by using the “isa” and “part of” relationships in the UMLS. For exam- ple, the “oral mucous membrane structure” is a part of the “entire mouth region,” which is a synonym of “mouth” in the UMLS. Thus diseases having the symp- toms in “oral mucous membrane structure” are considered to be associated with

“mouth.” In addition, some diseases are located on body parts that are too detailed according to the UMLS. We also manually map such body parts to the larger or- gans that contain them based on the “part of” relationship. For example, “upper eye lid” and “lower eye lid” are parts of “eye,” therefore are mapped to “eye.” Dis- eases manifested on upper and lower eyelids then have “eye” as the target body part.

13 2.2.2 Detecting target body parts

A general human organ detection method is developed and adapted to specific tar-

gets by tuning training data as well as parameters. Currently, this study reuses the

detectors for eye, ear, lip/mouth, hand and foot in retrieving images of a variety

of diseases. The detectors are learned in a generic way and can be easily extended to other body parts.

Object detection is a fundamental problem in computer vision. Approaches

to object detection typically consist of two major components: feature extraction

and model construction. Lowe developed the scale invariant feature transforma-

tion (SIFT) as the image patch descriptor [125]. Dalal and Triggs [54] proposed the histograms of oriented gradients (HOG) for human detection. These features

have proved effective in object detection applications. In addition, Zhang et al constructed a bag-of-feature model to classify texture and object categories [218].

Felzenszwalb et al developed a generic object detector with deformable part mod-

els to handle significant variations in object appearances [65]. Fig. 2.1B shows the structure of our organ detector. Each detector i(i = A, B. . . )

detects one target organ using multiple classifiers. Each classifier Cij scans the in-

put image and searches for the target at detection scale j(j = 1, 2... ). For example, if detector A is an eye detector and contains three classifiers, then CA1 decides if the full image is an eye, CA2 scans the image with a detection window to search for small eyes, and CA3 searches for eyes of an even smaller size. The organ de- tection results SA1,SA2, ..., SB1,SB2 are binary values and represent the existence of each organ {A, B, ···} at each detection scale {1, 2, ···}. We then combined these results into high-level features, based on logic, to make final decisions about the input images. The accuracies of our simpler detection system were comparable to that of Felzenszwalb et al [65].

14 Training organ detectors

For all classifiers in each organ detector, the training samples consisted of web im-

ages collected by Google. Positive examples were collected through searching the

six body part names as the keywords and manually picked 200–300 images of the

body part itself with little background. Most positive examples are not medically relevant, but contain different views of the body parts. To collect negative images,

we summarized the categories of objects and backgrounds that often appear in the

Google query results, such as paper snapshots, animals, and buildings. Negative

examples were then collected by searching keywords such as “research paper,”

“dog,” and “building.” Five thousand images comprised the negative training set. The same negative examples are used for all the organ detectors.

We trained three standard soft margin support vector machine (SVM) classi-

fiers for each organ detector to detect targets on three scales. In detector i, Ci1 was

trained by full training images. Since Ci2 and Ci3 search for targets with detec-

tion windows, they used positive samples that were resized to the window sizes, and randomly selected image patches of the window sizes from negative samples.

The HOG features [54] are extracted from training images. The HOG is reminis-

cent of the SIFT descriptor, but uses overlapping local contrast normalizations for

improved performance [54]. The window sizes of Ci2 and Ci3 were empirically chosen as 64 × 96 pixels and 32 × 48 for eye, lip/mouth, and hand detectors; and

96 × 64 and 48 × 32 for foot and ear detectors. By browsing 100 eye disease images, we found that images containing only very small target organs were usually false positives, therefore did not train classifiers at any smaller scale in order to maintain high retrieval precision.

15 2.2.3 Combining detections for disease image classification

Finally input images are classified into disease or non-disease categories using the

organ detection results that represent the existence of affected organs (high-level

features). Ideally, if all the classifiers behave in the same way and are independent,

the high-level combined feature might look like:

y = (SA1 + SA2 + SA3) + (SB1 + SB2 + SB3) + ..., (2.1)

where + is the ‘or’ operation between binary values.

However, such a simple combination had problems: if the whole image itself

is the target body part, it is unlikely to contain the same target at smaller scales.

If a body part is detected at both the whole-image level and the finer scales, the

image is often a false positive. This may be partly due to the incompleteness of the training samples or the challenge of detection of small-scale objects. Rule (2.1)

ignores this problem and concludes that the result is positive if the classifiers at all

three scales are positive. As precision is more important for our retrieval problem,

we used the exclusive ‘or’ operation to set the decisions in such cases as negative, even though the recall might be decreased. The final decision rule was as follows:

y = (SA1 ⊕ (SA2 + SA3)) + (SB1 ⊕ (SB2 + SB3)) + ..., (2.2)

where ⊕ is the exclusive ‘or’ and + is the ‘or’ operation between binary values.

Comparison of the truth of (2.1) and (2.2) shows that the two equations make different decisions only when the detection results are positive at both the whole- image level and the finer levels, and then decision rule (2.2) is more desirable.

16 2.3 Results

We evaluated the proposed ontology-guided disease image retrieval method for two kinds of image sets: (1) images of multiple diseases that are located on the same body part, and (2) images of diseases that are located on more than one body part, in experiments A and B, respectively. All the test images were top Google search results for the given disease term. Images with either widths or heights smaller than 128 were excluded to ensure image quality. Also, to apply the organ detectors with the selected detection window sizes, all test images were resized such that both their widths and heights were between 128 and 256. For evaluation purposes, the test images were labeled by three human evaluators. Since perfor- mance depends on the ground-truth labeling, a majority vote was used among individual evaluators. The average agreement rate among the three evaluators was 92%.

2.3.1 Single-organ disease classification

Our method trained organ detectors by normal organ images. Since the test images can be quite different from the training images and much more diverse, experiment A was designed to evaluate the performance of our method by comparing the results with a supervised classification method. For each individual disease, the supervised method trained a soft margin SVM classifier using the actual disease images as training data, and extracted the same HOG features.

This experiment repeatedly compared our object detection based method with the supervised classification method on 2000 test images in three groups. Each group contains 10 sets of eye, ear, and mouth/lip disease images, respectively. Our method trains a single object detector to classify the 10 test sets in each group. In contrast, the supervised method trains 10 different classifiers for each disease, thus

17 requiring 10 times more human labeling effort. The methods compared their pre- cisions, recalls and F1 measures. Precision is the most important criterion among the three, because our goal is to collect data for a health image base, and we are more interested in the credibility than the completeness of the images. Table 2.1 compares the performance for eye disease images. The average positive percent- age of the 10 Google test image sets is 52.2%. After using our method, the average positive percentage of the retrieved images was 79.1%. For 9 out of 10 test sets, our method achieved precision of between 70% and 90%. The precisions, recalls and F1 measures of our method were comparable (p > 0.1) to those of the super- vised method for all the 10 test sets, even though our method only needs one tenth of the manual labeling effort, by reusing the organ detector across 10 diseases. In practice, our method will be able to reuse the eye detector for far more than 10 eye diseases and further reduce human effort.

Also, results show that the object detectors at smaller detection scales tend to introduce more false-positive results. For eye disease image retrieval, a finer de- tection scale yields decreasing precision in 8 out of 10 test sets (Fig. 2.2A) and increasing recall in all test sets (Fig. 2.2B). The second test set of Duane retraction syndrome images has the lowest recall in all detection scales. One possible reason is that many positive images in this set contain eyes smaller than our detection window scales in order to illustrate the eye movement disorder. Adding organ de- tectors at smaller scales may increase the recall, but may also introduce many false positives. Since precision is of more importance, we stopped the object detection at the third detection scale.

Table 2.2 and 2.3 show the results for ear and mouth/lip disease images. Our method achieved average precisions of 80.7% and 84.2%, while the baselines of test set quality were 43.4% and 47.5%, respectively. The three evaluation criteria in table 2.2 are similar between the two methods (p > 0.1), and the average preci-

18 Table 2.1: Performance Comparison on Ten Eye Disease Image Test Sets.

Eye diseases Object Detection Based Method Supervised Classification Method Disease CUI Disease Term Precision Recall F1 Precision Recall F1 C0009363 Coloboma 0.750 0.720 0.735 0.707 0.746 0.726 ¯ ¯ C0013261 Duane retraction syndrome 0.818 0.400 0.537 0.779 0.652 0.710 C0014236 Endophthalmitis 0.852 0.697 0.767 0.817 0.867 0.841

19 C0015397 Disorder of eye 0.882 0.882 0.882 0.683 0.700 0.692 C0015401 Eye foreign bodies 0.826 0.792 0.809 0.648 0.687 0.667 C0015402 Eye hemorrhage 0.692 0.800 0.742 0.772 0.821 0.796 C0015404 Bacterial eye infections 0.706 0.720 0.713 0.807 0.864 0.834 C0017601 Glaucoma 0.727 0.800 0.762 0.821 0.827 0.824 C0025210 Ocular melanosis 0.794 0.540 0.643 0.723 0.689 0.706 C0086543 Cataract 0.862 0.806 0.833 0.862 0.913 0.887 Average 0.791 0.716 0.742 0.762 0.777 0.768 Precision Recall

(a) (b)

Figure 2.2: (A) Trend for decreasing precision in finer scales. (B) Trend for increasing recall in finer scales.

sion of our method for ear disease retrieval was higher than that of the supervised method. In table 2.3, our method achieved around 6% average higher precision than the supervised method (p = 0.15), at the cost of lower recall. One possible reason is that the body parts in some mouth disease images from the test sets are at very different angles and considerably deformed.

2.3.2 Multiple-organ disease classification

Experiment B evaluated the performance of our method on 220 images of two

diseases that are located on multiple organs. Table 2.4 shows that the precision of the proposed method on both test sets was > 80%. Compared with the supervised

method, our method improved precision by more than 10% in these two cases.

Since the proposed method is guided by the semantic information of body part

location, it can detect various kinds of positive images, whereas the supervised method does not make use of the high-level features that have greater semantic

meaning.

For hand, foot, and mouth disease, 42.9%, 28.6%, and 37.1% of the positive im-

ages in the test set contained a hand, foot, and mouth, respectively. A few positive

20 Table 2.2: Performance Comparison on Ten Ear Disease Image Test Sets.

Ear diseases Object Detection Based Method Supervised Classification Method Disease CUI Disease Term Precision Recall F1 Precision Recall F1 C0008373 Cholesteatoma 0.794 0.818 0.806 0.881 0.953 0.915 C0013446 Acquired Ear Deformity 1.000 0.828 0.906 0.819 0.863 0.840 C0013449 Ear Neoplasms 0.875 0.412 0.560 0.733 0.567 0.639

21 C0029877 Ear Inflammation 0.921 0.778 0.843 0.822 0.804 0.813 C0154258 Gouty tophi of ear 0.792 0.655 0.717 0.614 0.470 0.532 C0347354 Benign neoplasm of ear 0.571 0.533 0.552 0.820 0.497 0.619 C0423576 Irritation of ear 0.733 0.611 0.667 0.933 0.607 0.735 C0521833 Bacterial ear infection 0.786 0.647 0.710 0.820 0.670 0.737 C0729545 Fungal ear infection 0.800 0.696 0.744 0.728 0.864 0.791 C2350059 Cancer of Ear 0.800 0.414 0.545 0.797 0.699 0.736 Average 0.807 0.639 0.705 0.797 0.699 0.736 Table 2.3: Performance Comparison on Ten Mouth/Lip Disease Image Test Sets.

Mouth/Lip diseases Object Detection Based Method Supervised Classification Method Disease CUI Disease Term Precision Recall F1 Precision Recall F1 C0007971 Cheilitis 0.846 0.667 0.746 0.828 0.893 0.859 C0019345 Herpes Labialis 0.900 0.486 0.632 0.817 0.953 0.880 C0023761 Lip Neoplasms 0.909 0.500 0.645 0.607 0.553 0.579

22 C0149637 Carcinoma of lip 0.952 0.435 0.597 0.819 0.906 0.860 C0153932 Benign neoplasm of the lip 0.625 0.333 0.435 0.833 0.713 0.769 C0158670 Congenital fistula of lip 0.700 0.368 0.483 0.693 0.800 0.743 C0221264 Cheilosis 0.750 0.577 0.652 0.813 0.867 0.839 C0267022 Cellulitis of lip 0.923 0.600 0.727 0.810 0.917 0.860 C0267025 Contact cheilitis 0.950 0.543 0.691 0.849 0.943 0.894 C0267032 Granuloma of lip 0.867 0.500 0.634 0.721 0.865 0.786 Average 0.842 0.501 0.624 0.779 0.841 0.807 images contained two or three body parts at the same time. The hand, foot, and lip/mouth detectors contributed to finding 28.6%, 14.3%, and 28.6% of the total positive images. For Ascher’s syndrome, 67.9% and 32.1% positive test images contained lip and eye, respectively. The corresponding mouth/lip and eye detec- tors found 57.1% and 28.6% positive images, respectively, from the whole test sets. In summary, this study reused five pre-trained organ detectors to filter 2220 web images of 32 different diseases in two experiments. Compared with the su- pervised approach that require training 32 classifiers for each of the diseases, we reduce the labeling efforts to 15.6%. The average retrieval precision of our method on all the 32 datasets was 81.6%, an improvement of 3.9% compared with the su- pervised method. For 13 out of 32 disease datasets, we improved the retrieval precision by 10%.

2.4 Discussion

With the aim of achieving large-scale medical image retrieval, this study proposed an ontology-guided approach and compared it with standard supervised classifi- cation. Results showed that the proposed method achieves a precision comparable to that of the supervised method while saving manual labeling efforts by an order of magnitude. The results also illustrated that our method has limitations in low recall values on some test sets and in decreasing precision when the detection scale becomes smaller. To improve the recall, more robust algorithms and better data are needed to train the organ detectors. For the limitation of decreasing precision, we plan to build a two-layer learning model, in which the first layer classifiers detect target objects at different scales and the second layer classifier learns the weights to combine results from the first layer and make final decisions.

The scale of the experiments is limited owing to the intensive manual labeling

23 Table 2.4: Performance Comparison on Ten Mouth/Lip Disease Image Test Sets.

Disease CUI C001852 C0339085 Hand, Foot and Ascher’s syn- Disease Term Mouth Disease drome C0222224 Skin C0023759 Lip Disease Locations structure of hand structure C0222289 Skin C0015426 Eyelid structure of foot structure C0026639 Oral mucous mem- brane structure Hand, Foot and Mouth/Lip and Detectors Mouth/Lip Eye Positive percentage 0.58 0.56 Object detection 0.8333 0.8889 Precision based Supervised classi- 0.6944 0.7857 fication Object detection 0.7143 0.8571 Recall based Supervised classi- 0.7753 0.9429 fication Object detection 0.7692 0.8727 F1 based Supervised classi- 0.7326 0.8571 fication

24 work required for training data and evaluation purposes. Our experiments are based on five organ detectors. The future plan is to train more organ detectors and apply the method to handle more diseases. In addition, a few organs, such as skin, muscle, and veins, do not appear as concrete objects in images. Our method based on object detection is insufficient for diseases on these organs. Adding texture pattern recognition may further improve the retrieving precision and cover a wider range of diseases.

Our approach also depends on disease-organ relationships in the UMLS, and assumes that the appearance of related organs determines if the image is disease- related or not disease-related. Although the assumption is true for many cases as the results have shown, a small number of false-positive samples retrieved by our method are still non-disease images (only contain normal organs), or images of a different disease. Another limitation of this assumption is that the value of

“has finding site” relationship in the UMLS is incomplete. Among 74 785 disease concepts of semantic-type “disease or syndrome,” “neoplastic process,” “acquired abnormality,” and “congenital abnormality,” 44.1% have values in “has finding site.”

For disease terms that have no body-site information, our approach can be ex- tended by scanning the web images with all organ detectors. In this way, the

“has finding site” relationship in the UMLS can be enriched by mining web im- ages.

2.5 Conclusions

In this work, we developed an ontology-guided disease image retrieval method based on body-part detection towards mining web images to build a large-scale health image base for consumers. Compared with standard supervised classifi- cation, the proposed method improves the retrieval precision of complex disease

25 images by incorporating semantic information from medical ontologies. In ad- dition, our method significantly reduces manual labeling efforts by reusing a set of pretrained organ detectors. The resulting health image database is annotated using terms from standard medical ontologies and will create a rich source of in- formation for multiple descriptive and educational purposes. Although the scale of the study is limited, it proves the concept that the web is a feasible source for automatic health image retrieval, and it only requires a small amount of manual effort to collect and annotate complex disease images. In future work, we plan to improve the accuracy of organ detectors and ontology-based classification, and extend our approach to handle a wider range of diseases.

26 Chapter 3

Analyzing cross-species genetic networks to predict disease-associated genes: application on Plasmodium falciparum malaria

3.1 Motivation

Malaria is the most deadly parasitic infectious disease [128]. Existing drug treat- ments show limited efficacy in malaria elimination [62, 99, 104], and the pathogen- esis of the disease is not fully understood [136]. The pathogen causing malaria is the Plasmodium species. After injected by mosquitos into human skin, these para- sites infect the liver and multiply using the host cell resources. Then they invade the red blood cells and cause the disease symptoms [133]. In both the liver and blood stage, the parasites trigger the host’s innate immune responses and remodel the host cells to survive from the immune responses [52, 220]. The complex patho- genesis of malaria involves both human and parasite genomes [71, 103].

27 Studies of the human-parasite protein interactions have provided insights into the molecular signatures for malaria-specific host immune responses [96, 103, 211].

For example, studies show that the parasite protein PfEMP1 binds the human pro- tein CD36 [19] and ICAM1 [185], which play critical roles in the adhesion of the infected red blood cells to the endothelial cells, and eventually lead to the disrup- tion of blood-brain barrier in cerebral malaria patients [132]. In addition, other studies show that the PfRh family of in the parasites directly interact with the human protein CR1 during the invasion of red blood cells, and CR1 has the potential to become the target of blood-stage vaccines [192].

Currently, large-scale data have accumulated on the , parasite genome and their interactions. Integration and systematic analysis of these data may lead to novel discoveries in malaria genetic basis. In this study, we designed a data-driven approach to infer novel malaria-associated genes. Recent computa- tional disease gene discovery algorithms have shown great potential in predicting disease causes [3, 16, 105, 111, 121, 201]. They exploited the protein interactome in human genome and assumed that genes related to a disease phenotype tend to be located in a neighborhood in the protein-protein interaction network [69]. How- ever, traditional methods are not sufficient for predicting genes for malaria, which naturally involves human-parasite protein interactions. Our approach represented the interacting human and parasite genomes with a heterogeneous network. We prioritized genes that are functionally related to the known malaria genes in the heterogeneous network and investigated if the top-ranked genes have the poten- tial to guide drug discovery for malaria.

28 Predict genes Ranked genes for malaria Human Parasite CD36 genetic network genetic network Network ICAM-1 analysis CR1

...

BSG IFNG

Host-pathogen ... malaria interactions

Evaluate Evaluate Evaluate validity drug targets pathways

Gene rank Our Gene rank Pathway With human Pathway Leave-one-out cross Gene rank gene rank rank network rank validation Drug targets Validation gene set

Remove from seed list

Compare

Figure 3.1: The methods contain two parts: gene prediction based on the cross-species genetic networks and result analysis for the method validity in predicting malaria genes, distribution of druggable genes among the rank, and the pathways associated with the top-ranked genes.

3.2 Data and methods

The experiment work flow is depicted in Fig. 3.1 and consists of two steps: (1) prioritize genes through network analysis and (2) analyze the result. Genetic net- works for human genome and parasite genome are first constructed separately, and then connected with host-pathogen protein interactions. A network-based al- gorithm is then developed to rank the genes in the cross-species network, using the genes that are known to be associated with malaria as the seeds. The approach is validated using a “leave-one-out” cross validation analysis and a set of malaria genes extracted from literature. Then the distribution of druggable genes is evalu- ated among all the ranked genes. Finally, the functions of the prioritized genes are analyzed by extracting pathways on the basis of gene ranking.

29 3.2.1 Construct cross-species gene network

We construct the genetic network for human and Plasmodium falciparum (the species

that causes the most dangerous form of malaria) from the STRING [186] database.

STRING includes gene relationships for over a thousand species from four sources:

protein-protein interactions (PPIs) databases, PPIs mined from literature abstracts, curated pathway databases and co-expressed genes. All the four sources are used

to build comprehensive networks for both human and Plasmodium falciparum. The

human network contains 20,770 proteins and 4,850,628 interactions; and the Plas-

modium falciparum network contains 4,913 proteins and 1,007,938 interactions. In

addition, the edges in the two genetic networks are weighted by the scores from STRING.

The two protein networks are connected with 36 interactions from Pathogen-

Portal [13] and literature [26, 96, 169]. These interactions are binary and cover

physical associations, direct interactions and chemical reactions between the two

species. The interaction pairs from literature are curated manually. The gene iden- tifiers are unified for human and parasites through HUGO

Committee [77] and PlasmoDB [12], respectively.

3.2.2 Predict candidate genes for malaria

A total of 77 known malaria genes are used as the seeds to find additional malaria

genes. Among the 77 seed genes, 14 human genes are extracted from Online

Mendelian Inheritance in Man (OMIM). In addition, extensive literature evidence suggests that the Plasmodium falciparum proteins—PfEMP1 [159], PfRh4 [192] and

PfRh5 [210]— are essential for parasite growth and red blood cell invasion. These

three proteins are encoded by 63 parasite genes, which are added into the seed list.

The network-based algorithm based on the random walk model ranks all the

30 genes by the probabilities of being reached from the seeds. The jumping probabili- ties λ are used to regulate the movements of the random walker between networks.

When the random walker stands on a node in the human network H, which is con-

nected with a node in the parasite network P , it may jump to P with the probability

λ or stay in H with the probability of 1 − λ.

The ranking scores for each node are calculated as follows. Assume p0 is a vec-

tor of initial scores for each node, pk is the score vector at step k and was iteratively updated by:

T pk+1 = (1 − γ)M pk + γp0, (3.1)

where γ is the probability that the random walker restarts from the seeds at each

step, and M is the transition matrix of the cross-species genetic network:  

 MH MHP  M =   . (3.2) MHP T MP

The diagonal sub-matrices MH and MP consist of intra-network transition proba-

bilities and were calculated as:

∑ (Mi)kl = (1 − λx)(Ai)kl/ (Ai) , (3.3) l kl

where i ∈ {H,P }, Ai is the adjacency matrix of the network H or P , k is the index

of row, l is the index of column, and x is an indicator variable, which equals to 1 ∑ ̸ if l (Ai)kl = 0 and 0 otherwise. The off-diagonal sub-matrices MHP and MHP T consist of inter-network transition probabilities and were calculated as:

∑ (Mj)kl = λx(Aj)kl/ (Aj) , (3.4) l kl

where j ∈ {HP,HP T } and x is the same indicator variable. While the method

31 could obtain a score for each human and parasite gene, this study focuses on rank- ing and analyzing the human genes in this study.

3.2.3 Evaluate the validity in predicting malaria genes

Before using our method to predict genes for malaria, a “leave-one-out” cross val- idation analysis is performed to validate the method. Each time, one malaria gene is left out from the seed list, the rest seeds are used as the input, and rank of the excluded seed is evaluated among the genes from the human or parasite genome. The same procedure is repeated for each of the 77 seeds. The excluded seeds can be ranked highly if the method works well.

Then we used all the 77 seeds as the input, and evaluated if the method can prioritize a manually collected validation set of novel malaria genes (other than the seeds). The set contains 27 human genes involving malaria resistance and the host immune responses triggered by malaria parasites. These genes were extracted from literature references, which were mentioned in the textual descriptions of malaria in OMIM, and have zero overlap with the seed genes. This set is a proxy of novel malaria genes and their ranks among all human genes are evaluated.

3.2.4 Evaluate the ranks of druggable genes

Currently, only a subset of the human genome is druggable [89]. This evaluation experiment investigates if the top-ranked genes represent opportunities for drug discovery for malaria. We first extracted 1,935 human genes that were targets of all drugs from DrugBank [112]. All these drug target genes appear in our genetic network and have no overlap with the seeds. We used all 77 seeds as the input and ranked the human genes. Then we calculated the number of target genes among every 500 human genes in the rank from the top to the bottom, and plotted the

32 variation of this number.

3.2.5 Extract and analyze malaria-specific pathways based on gene

ranking

To better understand the functions of the prioritized genes, we linked the top 10% of human genes to their pathways. We downloaded 1320 canonical pathways from

MSigDB [189] and ranked them based on the average of random walk scores for all the genes in each pathway. We manually examined if the top pathways are associated with the host response to the pathogen invasion.

In addition, we evaluated the impact of introducing the parasite genome into our gene prediction method. We removed the parasite genetic network and host- parasite interactions from our method, and calculated the random walk scores for human genes. Then we re-ranked the pathways containing the top 10% genes again. We compared the rank of pathways before and after using the parasite genetic network, and extracted the ones with largest rank difference.

3.3 Results

3.3.1 Network-based approach allows the prioritization of known

malaria genes from both human and parasite genomes

Among the 77 seed genes, 14 were human genes and 63 were parasite genes. We evaluated the performances of our algorithms in ranking human and parasite seed genes separately with a leave-one-out cross validation analysis. Our method re- quired two parameters: the jumping probability λ between human and parasite genetic networks and the probability γ that the random walker restarts from the seeds. We chose λ=0.8 and γ=0.3 to achieve the best performance in the cross val-

33 idation, but different parameter values only slightly affect the result. We used the same values for the two parameters through all the analyses.

Table 3.1 shows that the ranks of the excluded human seed genes were high.

In nine cases, the excluded genes directly interact with another seed and were

ranked within the top 1% amongst all the human genes. Of these, two genes (CD36, ICAM1) were ranked in the top five. In 13 out of 14 cases, the excluded

genes were ranked within top 3%. The average rank for the excluded human seed

genes is 336 (top 2% among all human genes).

We also evaluated the 63 parasite seed genes, and our approach ranked the ex-

cluded nodes within the top 5% in 56 out of 62 cases. Table 3.2 shows the top 10 parasite genes and their ranks in the cross validation. The average rank for the

excluded parasite genes is 199 (top 4% among all parasite genes). Less compre-

hensive data in the parasite genome than in the human genome may contribute

to the lower rank (in percentage) of the parasite seed genes. Overall, this analy- sis demonstrated the utility of the extended random walk to accurately prioritize

known malaria genes.

3.3.2 Network-based approach prioritizes novel malaria genes other

than the seeds

Large amounts of literature have demonstrated strong associations between indi-

vidual genes and malaria through transcriptional profiling, biological experiment-

ing and genome-wide association studies. These genes include inflammatory re- sponding genes, such as NFκB and CXCL1 [198], parasite protein receptors, such as BSG [53] and PROCR [199], and the genes involving protection against malaria,

such as HLA-B [83] and HAVCR1 [149]. We then used all the seeds to generate

our ranking for human genes, and examined the rank of 27 malaria genes, which

have been validated in previous published studies. Table 3.3 shows that 12 out of

34 Table 3.1: Result of the leave-one-out cross validation for human genes. We left out one malaria gene from the seed list each time, and determined the rank of this excluded gene using our method. We showed the rank and percentage among all human genes.

Gene symbol Rank Top percentage CD36 1 0.00% ICAM1 2 0.01% CR1 14 0.07% SLC4A1 78 0.44% NOS2 99 0.55% GYPC 121 0.67% HBB 126 0.70% GYPA 137 0.76% FCGR2B 159 0.88% CISH 232 1.29% TIRAP 277 1.54% G6PD 378 2.11% FCGR2A 403 2.25% TNF 2679 14.9%

Table 3.2: Top 10 parasite genes in the leave-one-out cross validation.

Gene (ORF name) Rank Top percentage PFD1235W 18 0.37% PF11 0521 22 0.45% PF13 0003 29 0.59% PFD0995C 49 0.99% PFL1955W 49 0.99% PFL1950W 50 1.01% PF07 0050 52 1.05% PF07 0051 53 1.07% PFD0630C 53 1.07% PFD1150C 55 1.11%

27 genes were ranked within the top 1%, and a total of 24 genes within the top 5%.

We also manually examined the top 50 human genes and found interesting pre- dictions. Among them, TLR4 has been suggested to be protective against malaria in certain populations [67, 180]. In addition, a recent mouse model experiment

[100] has demonstrated that P53 was critical in the liver-stage infection of malaria.

Together, the result demonstrated that our gene ranking prioritized novel malaria-

35 Table 3.3: Rank of other malaria-associated genes from literature.

Gene symbol Rank Top percentage BSG 15 0.08% IL6 20 0.11% IFNG 25 0.14% IL1B 34 0.19% IL10 38 0.21% IL8 65 0.36% IL4 66 0.37% IL1A 137 0.77% CD40LG 142 0.79% HLA-DRB1 145 0.81% HLA-B 168 0.94% HAVCR2 179 0.99% FUT9 183 1.02% NFκB1 219 1.22% HBA1 221 1.23% HBA2 227 1.27% HLA-DQB1 230 1.28% HAVCR1 319 1.78% GNAS 358 1.99% IFNGR1 380 2.12% CXCL1 381 2.12% MBL2 444 2.48% CCL20 494 2.76% IL12B 499 2.79% IFNAR1 954 5.33% PROCR 1515 8.46% IL22 2467 13.8%

associated genes other than the seeds.

3.3.3 Prioritized genes are enriched by druggable genes

Fig. 3.2 shows that the top-ranked genes are enriched for drug targets. The top 500 human genes in our ranking have 235 overlaps with the drug targets, which is a

4.3-fold enrichment compared with the average of 100 random rankings (p < 10−8). Among the 235 druggable genes, only 5 have been targeted by FDA-approved

anti-malaria drugs, such as chloroquine, proguanil and mefloquine. This result

36 250

200

150

100

number of drug target genes of drug number 50

0

1−500

2000−2500 4500−5000 7000−7500 9500−10000 12000−12500 14500−15000 17000−17500 our gene ranking

Figure 3.2: The count of drug target genes among every 500 genes in our rank from the top to the bottom indicated that the top-ranked candidate genes for malaria may provide unique opportunities for malaria drug discovery through novel disease genetics.

3.3.4 Pathway analysis shows functions of prioritized genes

In order to gain insight into the commonalities underlying predicted malaria can- didate genes, we analyzed the pathways associated with top-ranked genes. The top-ranked pathways are associated with different aspects of malaria. For ex- ample, malaria parasites actively alter the immune function of B cells and BIO-

CARTA BLYMPHOCYTE PATHWAY [181]. BIOCARTA LYM PATHWAY is a path- way of lymphocytes adhesion, and plays a central role in binding bacteria, para- sites, viruses and tumor cells [202]. Also, BIOCARTA STEM PATHWAY regulates the hematopoiesis and induce hematopoietic activities in the presence of infection [1].

We compared the pathway ranking before and after introducing the parasite

37 genetic network and found nine pathways increased the rank by over 50%. Table 3.4 lists these pathways and their plausible associations with malaria pathogenesis and protection. Several of these pathways are directly related with the parasite infection and inflammatory responses. REACTOME BASIGIN INTERACTIONS was prioritized through the interaction with the parasite protein PfRh5. Other pathways that were brought up by less than 50% also may have associations with malaria. For example, the rank of the REACTOME HDL MEDIATED LIPID TRANSPORT pathway were improved by 40%. A recent meta-analysis showed that host lipid profile alteration has a link with malaria pathogenesis, though the precise path- way has not been elucidated yet [205].

3.4 Discussion

Malaria is caused by the invasion of deadly parasites into human skin, liver and blood. The parasites trigger the human immune responses, but can manipulate human cells for nutrient uptake and cell growth. Recent studies have shown that host-pathogen protein interactions illuminate the malaria-specific pathways in the human host. With the accumulation of data in both human and parasite genome, systematically analyzing these two interacting genomes may potentially discover new malaria-associated genes, which will pave the way to identify novel anti- malaria drugs.

We developed a data-driven method to infer malaria genes based on random walking on the cross-species genetic networks. We demonstrated that the method can prioritize genes that are both drug targets and associated with malaria. Through comparing the result before and after adding the parasite genetic network into our method, we extracted specific pathways involving human-parasite interactions.

Our approach can be improved with a more comprehensive database of host-

38 Table 3.4: Pathways prioritized over 50% in rank.

Pathway Potential association with malaria Pyruvate kinase deficiency protect against REACTOME PYRUVATE METABOLISM malaria [15] Basigin is a receptor essential for erythro- REACTOME BASIGIN INTERACTIONS cyte invasion by Plasmodium falciparum [53] PID SYNDECAN1 PATHWAY Induced by parasite infection [22] REACTOME PYRUVATE METABOLISM Pyruvate kinase deficiency protect against AND CITRIC ACID TCA CYCLE malaria [15] and citric acid cycle activity involves chloro- quine resistance [92] REACTOME INTEGRIN CELL Associated with Plasmodium induced SURFACE INTERACTIONS thrombocytopenia [37] REACTOME CELL SURFACE Associated with red blood cell adhesion to INTERACTIONS AT THE VASCULAR the endothelial WALL cell and cerebral malaria [140, 142] Control cellular nutrient uptake, differenti- BIOCARTA VDR PATHWAY ation, apoptosis, which may be affected by parasites [133, 217] Recruitment and activation of monocytes BIOCARTA MONOCYTE PATHWAY and macrophages are essential for both protection and pathology in malaria- infected individuals [49] REACTOME PLATELET ADHESION Platelet adhesion and aggregation may play TO EXPOSED COLLAGEN important roles in facilitating adhesion of infected red blood cells [76, 155, 208] pathogen protein interactions. We currently manually curated 36 interactions, mostly from literature, to connect the human and parasite genetic network. Com- pared with the human-human protein interactions, the coverage of human-parasite interaction is much lower and might be biased. As more data are introduced into the method, the global structure of the cross-species genetic network may change, which will affect the result of gene ranking. In the future, we plan to automatically mine the human-parasite interaction from literature and construct a database with better coverage.

39 Since our approach prioritized a set of druggable genes, which are associated with malaria, one example of subsequent work is to perform drug repositioning through matching the targets of approved drugs to predicted genes. In this way, however, a part of the candidate drugs may target generic inflammatory responses and may not be specific enough to kill the parasites. In addition, malaria is as- sociated with different pathways when human are infected by different parasite species (other than Plasmodium falciparum) or different strains [211]. To develop more effective agents against malaria, we need to dissect the genetic basis using more specific data.

3.5 Conclusions

The lack of effective anti-malaria drugs and the poorly-understood disease genet- ics has motivated our study of detecting novel malaria-associated genes from both human and parasite genomes, with the ultimate goal of discovering innovative anti-malaria drugs based on a new genetic understanding of the disease. We de- veloped a data-driven approach to infer malaria-associated genes. Since malaria is caused by the interactions between parasites and human, we constructed a cross- species genetic network to model these interactions, and prioritized relative genes using network analysis. We demonstrated the validity of the method in predicting malaria genes, and showed the potential of the predicted genes in drug discov- ery. We also extracted pathways from the result of gene ranking, and found these pathways reflect different aspects of malaria pathogenesis.

40 Chapter 4

Combining multiple human phenotype networks to predict disease-associated genes: application on Crohn’s disease

4.1 Motivation

Systematic study on disease phenotype networks in combination with protein func- tional interaction networks can offer insights into disease mechanisms. However, the disease phenotype networks remain largely incomplete, and most current dis- ease gene prediction approaches [95, 111, 121, 201, 212, 213] used only one sin- gle data source of human disease phenotypes. Phenotypic similarity databases were usually obtained through extracting phenotype knowledge from texts, such as biomedical literature [107] and the phenotype descriptions in Online Mendelian

Inheritance in Man (OMIM) [111, 170, 200]. Among them, mimMiner [200] and hu- man phenotype ontology [170] are based on OMIM and has been widely-used in

41 disease gene prediction studies [88, 95, 121, 143, 201]. Combining different phenotype data has the potential to reduce the bias in each

data source and improve the network-based prediction models [135, 154]. In this

study, we explore new accurate and publicly accessible disease phenotype data

in addition to the existing phenotype networks. We create Disease Manifestation Network (DMN) using the highly accurate and structured clinical manifestation

data from Unified Medical Language System (UMLS) [30, 123, 131]. Clinical man-

ifestation captures a major aspect of disease phenotype and can predict disease

causes [34]. For example, the Stickler syndrome, Marshall syndrome and Oto- spondylomegaepiphyseal dysplasia (OSMED) have highly similar manifestations and also involve mutations in interacting collagen genes COL2A1, COL11A2, and

COL11A1, respectively [7]. The UMLS semantic network currently uses 50,543

disease-manifestation semantic relationships to explicitly link 2,305 diseases to

their clinical manifestations. In this knowledge base, all disease and manifestation terms are formally represented by unified concepts and the semantic relationships

between concepts were collected from multiple different ontologies.

We demonstrate that DMN not only reflects known disease-gene relationships,

but also contains different phenotypic knowledge compared with mimMiner. We

test the hypothesis through network comparative analysis between DMN, mim- Miner [200], and the two variants of human disease network (HDN) [74], which

connects diseases if they share genes. The correlation between DMN and HDNs

indicated that DMN reflects existing knowledge on genetic relationships among

diseases. The comparison between DMN and mimMiner demonstrated that the

two phenotype networks are largely complementary in nodes, edges and commu- nity structures. The overall analysis suggests that combining DMN with previous

phenotype data sources, such as mimMiner, may potentially improve the data-

driven methods for biomedical applications, such as disease gene discovery and

42 drug repositioning. Then we develop a novel and generic approach to combine multiple different

data sources on human disease phenotype, and predict disease genes from seam-

lessly integrated phenotypic and genomic data. Specifically, we integrate DMN,

mimMiner, a protein interaction network and known disease-gene associations. We predict disease genes from the heterogeneous network, and demonstrate the

benefit of incorporating an additional phenotype network DMN through compar-

ing with a baseline approach, which is also based on network analysis but only

used mimMiner.

Several recent studies showed that genetic basis of diseases from OMIM [207] and GWAS [178] may lead to the discovery of candidate drug treatments. Here,

we demonstrate that the disease genes predicted by our approach, in combination

with the drug-target data, may guide the discovery of new candidate drugs. We

use Crohn’s disease as examples, which both have increasing worldwide preva- lence [137] and is currently incurable [11, 50]. We predict candidate genes for each of the two disease, and prioritized candidate drugs based on the rank of drug tar- get genes. Then we validate the result with the FDA-approved therapies. Our result provides empirical evidence that our disease genetics prediction strategy, which combined unique data and a novel systems approach, can lead to rapid drug discovery.

4.2 Data and methods

In creating the novel phenotype network, our approach consists of the following steps (Fig. 4.1): We first constructed DMN using the disease-manifestation as-

sociations from UMLS. Then we compared phenotypic relationships in DMN and

genetic relationships among diseases. Finally, we compared DMN with mimMiner

43 [200].

Compare DMN with the most Construct a novel Compare DMN with genetic widely-used phenotype phenotype network: DMN disease networks: HDNs network: mimMiner Human disease network Disease Manifestation Disease-gene based on OMIM Network pairs in OMIM mimMiner (2,974 nodes, (2,305 nodes, (34,007 pairs) (4,391 nodes, 3,573 edges) 420,567 edges) 373,527 edges)

Human disease network Disease-gene Disease-phenotype pairs Disease-manifestation based on GWAS pairs in GWAS mined from OMIM disease pairs from UMLS semantic (355 nodes, (5,895 pairs) records (5,080 records) network (50,543 pairs) 3,406 edges)

Figure 4.1: The three steps of network analysis for DMN.

In using the novel phenotype network to predict disease-associated genes, our approach consists of the following steps. We first integrated DMN, mimMiner, and a genetic network based on protein-protein interactions (PPIs), and constructed a heterogeneous network in Fig.4.2. Given a disease, we then prioritized the genes using a ranking algorithm extended from the random walk model. We validated our approach using well-studied disease-gene associations from OMIM and com- pared the performance with a baseline disease gene prediction method that used only one phenotype network. We also evaluated our approach in predicting genes for diseases of different classes. Finally, we identified candidate drug therapies for Crohn’s disease based on gene prediction results, and demonstrated the transla- tional potential of our newly predicted genes.

4.2.1 Construct DMN using disease-manifestation associations in

UMLS

We first extracted disease-manifestation relationships from the UMLS file MR-

REL.RRF (2013 version). The file contains 647 different kinds of semantic rela- tionships between biomedical concepts. We collected the concepts pairs linked by

44 Disease-manifestation Textual disease descriptions semantic relationships in in OMIM disease records UMLS (50,543 pairs) (5,080 disease records)

mimMiner DMN (2,312 disease Map between UMLS and OMIM identifiers (5,003 disease nodes, nodes, 408,029 edges) l 119,740 edges) P1 P 2

l P2 P 1

l PG 1 dis 2 OMIM GP as e l l ase GP sociations 1 -gene -ge l P2 G OMIM ne ations sease di ssoci PPI gene network (9465 a gene nodes, 37,039 edges) Protein-protein interactions in HPRD (37,039 interactions)

Figure 4.2: Integrating the knowledge in DMN, mimMiner and the genetic network. the “has manifestation” relationship, and obtained 50,543 disease-manifestation pairs. The disease-manifestation relationships come from OMIM [79], Ultrasound

Structured Attribute Reporting [23], and Minimal Standard Digestive Endoscopy Terminology [197]. OMIM is the major contributor among these data sources.

The manifestation terms vary greatly in abundance. For example, common manifestations such as “seizures” are associated with many diseases, while rare manifestations such as “Amegakaryocytic thrombocytopenia” are only associated with one disease. We used the information content (1) in to weight each manifes- tation concept.

wc = −log(nc/N) (4.1)

Variable wc is the weight of the manifestation concept c, nc is the number of dis- eases associated with manifestation c, and N is the total number of diseases. Then we modeled the manifestation similarity between disease x and y by the cosine of their feature vectors in (2), in which the feature vectors consist of manifestations xi and yi for disease x and y. The cosine similarity was used before [111, 200] to

45 quantify phenotype overlaps. ∑ x y s (x, y) = √∑ i √i ∑i 2 2 (4.2) i xi i yi

We constructed DMN as a weighted network with the manifestation similarities.

The edges weights are in the range (0, 1].

4.2.2 Compare phenotypic relationships in DMN with genetic dis-

ease associations

We conducted two experiments to evaluate whether the phenotypic relationships in DMN reflect genetic associations among diseases. The first experiment is to calculate the correlation between the disease similarities in DMN and two quanti-

fied measures of genetic associations. We first ranked the edges (disease pairs) in

DMN by their weights (disease similarities) from large to small. For top N disease pairs, we counted the percentage of disease pairs that share associated genes in

OMIM and the average number of genes shared by the N disease pairs. Then we calculated the Pearson’s correlations between N and the genetic measures.

In the second experiment, we compared the network topologies between DMN and two genetic disease networks. A well-studied genetic disease network is

HDN, in which diseases were connected if they share associated genes in OMIM and edges were weighted by the number of overlapping genes [74]. Here we in- herited the network construction method of HDN, but used two different disease- gene association data: the updated data in OMIM (April, 2013) and GWAS cata- log (August, 2013). We represented the disease terms in OMIM-based HDN and

GWAS-based HDN with 2974 and 355 UMLS concept unique identifiers, respec- tively, to enable the comparison with DMN. The two genetic disease networks both contains rich information of disease genetics [114, 119], but are largely differ-

46 ent. The OMIM-based HDN mostly contains Mendelian diseases with strong ge- netic causes; the GWAS-based HDN mostly contains common complex diseases.

The two networks only share 45 diseases.

We compared the edges and community structures between DMN and the two

HDNs. Network community structure reveals the biological network properties and offered insights into cell functions, protein interactions, and disease dynamics

[38, 156, 177]. We applied a widely-used community detection algorithm [146] and

calculated the two-way similarities between community groups:

SDMN→HDN = |X ∩ Y |/|X|, (4.3)

SHDN→DMN = |X ∩ Y |/|Y |. (4.4)

|X| and |Y | are the number of disease pairs that appear in the same community

in DMN and HDN, respectively. |X ∩ Y | is the count of disease pairs that were grouped into one community in both networks.

We tested the significance of edge and community similarities between DMN and HDNs by creating a background distribution of similarities expected at ran-

dom. We kept the number and size of communities in DMN, and randomly swapped

the assignments of disease nodes into each community. Then we linked nodes in-

side a community with probability Pin, and those across communities with prob-

ability Pout. The Pin and Pout were estimated from the edge density within and

between communities in DMN, respectively. We repeated 100 times of randomiz-

ing DMN, and compared each random network to HDNs to create the background

signals. Finally, we compared the observed similarities with the background sig-

nals using Wilcoxon signed-rank test.

47 4.2.3 Compare DMN with the widely-used disease phenotype net-

work mimMiner

DMN and mimMiner both contain phenotypic knowledge based on clinical ob- servations. Here, we compared DMN with mimMiner to demonstrate that the two phenotype networks contain different knowledge, so that combining them in applications, such as disease gene discovery and drug repositioning, may poten- tially lead to improved performance. We first mapped the 5,080 diseases in mim-

Miner from OMIM identifiers to UMLS concept unique identifiers to allow the comparison. Since text mining introduced false positive disease-phenotype rela- tionships, we needed to tradeoff between the data coverage and accuracy in mim- Miner. Based on previous analysis [200], we chose to connect two disease nodes if their similarities are above 0.3. The network of mimMiner contains 4,391 disease nodes after these processes. We then compared the node, edges and community structures between DMN with mimMiner.

4.2.4 Integrate networks

We integrate DMN, mimMiner and the PPI network as shown in Fig.4.2. To con- struct DMN, we used 50,543 disease-manifestation pairs from UMLS and calcu- lated pairwise disease similarities based on disease manifestations. Then we down- loaded mimMiner [200] and built the PPI network using 37,039 binary interactions among 9,465 genes in Human Protein Reference Database (HPRD), which has high coverage and accuracy [138] and has been used in many disease gene discovery studies [121, 201, 212, 213].

To construct the heterogeneous network, we linked the disease nodes with the same semantic meanings in DMN and mimMiner using 1,313 pairwise mappings between UMLS and OMIM identifiers from the UMLS metathesaurus. We also

48 connected 1,188 disease nodes in DMN and 1,542 in mimMiner to the gene nodes in the PPI network based on the disease-gene associations in OMIM. Note that our

approach can easily incorporate more phenotypic or genetic networks in the same

way, given that the new networks contains different knowledge from the existing

ones. The adjacency matrix of the heterogeneous network is:  

 AG AGP1 AGP2      A =  AT A A  , (4.5)  GP1 P1 P1P2  T T AGP2 AP1P2 AP2

where P1, P2 and G represent DMN, mimMiner and the genetic network, respec-

tively, and the diagonal sub-matrices AG, AP1 , and AP2 are their adjacency matrices.

The off-diagonal AGP1 , AGP2 , and AP1P2 are the adjacency matrices of the bipartite

T T T graphs connecting each pair of the three networks, and AGP1 , AGP2 , and AP1P2 rep- resent their transposes.

4.2.5 Predict disease genes from the integrated network

Our prediction model was based on random walk with restart, which is a network-

based ranking algorithm. The random walk model avoids over emphasizing the

connections through high-degree nodes and has been useful in biomedical appli- cations [27, 105, 121]. It simulates a random walker starting from a set of seed

nodes and ranks all the nodes by the probability of being reached by the random

walker after converge. We set certain disease nodes as the seeds and ranked all the

gene nodes to predict their association with the given diseases.

We extended the algorithm by regulating the movements of the random walker between any two networks among DMN, mimMiner and the PPI network with the ∈ { } jumping probabilities λNiNj (Ni,Nj P1,P2,G ) (Fig.4.2). For example, if the ran-

49 dom walker stands on a node in DMN, which is connected with both mimMiner and the genetic network, it has the option to walk to mimMiner with the probabil-

ity λP1P2 , to the PPI network with the probability λP1G, or stay within DMN with − − the probability 1 λP1P2 λP1G.

We calculated the ranking scores for all nodes as follows. Assume p0 is a vector of initial scores for each node, pk is the score vector at step k and was iteratively updated by:

T pk+1 = (1 − γ)M pk + γp0, (4.6) where γ is the probability that the random walker restarts from the seeds at each step, and M is the transition matrix defined based on the adjacency matrix in (4.5).

The transition matrix consists of three intra-network transition matrices on the di- agonal, and six inter-network transition matrices off-diagonal:  

 MG MGP1 MGP2      M =  M T M M  (4.7)  GP1 P1 P1P2  T T MGP2 MP1P2 MP2

We calculated the inter-network transition matrices in (4.8), which first normalized ∈ { } the adjacency matrices of the bipartite network ANiNj (Ni,Nj P1,P2,G ), and then weighted them with the jumping probabilities between networks Ni and Nj.  / ∑  ∑ (ANiNj ) ̸ λNiNj kl (A ) l (ANiNj )kl = 0 (M ) = l NiNj kl NiNj kl  (4.8)  0 otherwise

The intra-network transition matrices were calculated in (4.9), which normalized the adjacency matrix of a network Ni, and weighted the matrix with the probability

50 that the random walker jumps within the same network.

∑ ∑ (MN )kl = (1 − IN · λN N )(AN )kl/ (AN ) (4.9) i j i j i l i kl

· In (4.9), represent dot product and INj is an indicator function, which value is

1 if the kth row of ANiNj contains at least one non-zero element. For the generic case, where N phenotype networks were incorporated, the transition matrix M is defined as:  

 MG MGP1 ... MGPN     T   M MP ... MP P   GP1 1 1 N  M =   . (4.10)  ...... M ...   Pi  M T M T ... M GPN P1PN PN

The inter-network transition matrices MNiNj (off-diagonal) and intra-network tran- sition matrices MNi (diagonal) can still be calculated with (4.8) and (4.9), respec- tively.

Our gene prediction model allows accumulating evidences from different dis- ease phenotype networks and preserves the unique information in each network.

For example, if a pair of diseases are connected in both DMN and mimMiner, the random walker can reach one disease node from the other with a strengthened probability; if the diseases are connected in only one network, the random walker may still reach one disease from the other through the links between networks, but with a relatively lower probability.

4.2.6 Evaluate gene prediction in cross validation analyses

We first performed a leave-one-out cross validation analysis and compared our ap- proach with a baseline method [121], which only used one phenotype network. We removed one disease-gene association each time, set the disease as the seed, and

51 tested the rank of the retained gene. If the same disease appeared in the both phe- notype networks (diseases from the two networks have the same semantic mean- ing) and were connected to the same gene, the redundant disease-gene association was also removed.

We evaluated the ranks of the tested genes with two metrics: (1) we calcu- lated the percentage of successful prioritizations, in which the retained genes were ranked in top one (excluding the other known disease genes), and (2) we generated a receiver operating characteristic (ROC) curve for each method and calculated the the area under the curve (AUC). To generate the ROC, we followed the definitions in [3, 105, 121]: sensitivity refers to the percentage of tested genes that are ranked above a particular threshold among all prioritizations, and specificity refers to the percentage of genes ranked below this threshold. For instance, a sensitivity/speci-

ficity value of 70/90 indicates that the correct disease gene was ranked among the top 10% of genes in 70% of the prioritizations. The ROC shows the plot of sensitiv- ity against 1-specificity when varying the rank threshold from the top to bottom.

The two metrics are complimentary: the AUC evaluates the entire rank of genes while the success ratio is more strict and evaluates the top-ranked genes.

Currently, the causal genes for over 1,500 genetic disorders remain unknown

[8]. A primary advantage of phenotype-driven gene prediction approaches, com- pared to the conventional gene function-driven approaches, is that they can predict genes for diseases without known genetic basis. Therefore, we further conducted a de novo gene prediction analysis to evaluate our approach. In de novo gene prediction, we removed all disease-gene links for a query disease each time. If the disease appeared in both phenotype networks, we removed all its gene asso- ciations through both phenotype networks. Then we set the disease as the seed, ranked all the genes, and compared the AUCs between different approaches. In this experiment, we have different settings from the leave-one-out cross valida-

52 tion and tested multiple retained genes in each prioritization. We generated an ROC curve for each prioritization following the definitions in [42, 95] and aver-

aged AUCs across all prioritizations. For each ROC, sensitivity is the percentage

of retained genes that are ranked above a threshold among all the retained genes

in one prioritization; and specificity is the percentage of negative genes (genes that are not known disease genes) ranked below the threshold among all the negative

genes. Since the top ranked genes are more important than the lower ranked genes,

we highlighted a set of false positive cutoffs for the ROC curves and compared the

corresponding average AUCs between methods. A better method will rank more

true positive genes above the false positives, resulting in larger average AUCs at smaller cutoffs.

4.2.7 Evaluate gene prediction for different disease classes

The degree that phenotypic associations reflect genetic overlaps varies for differ-

ent disease classes. Thus phenotype-driven gene predictions may have varying

performance. We classified diseases into nine groups based on International Clas-

sification of Diseases (10th edition), and repeated the two cross validation experi- ments within each group to evaluate the performance variance of our method.

4.2.8 Investigate translational potential in drug discovery of the

predicted genes for Crohn’s disease

We used Crohn’s disease as an example to demonstrate that our gene prediction

method has the translation potential to guide drug discovery. Crohn’s disease is

a chronic and relapsing inflammatory disorder that affects millions of people and

has an increasing prevalence [137]. It involves genetic abnormalities that lead to overly aggressive responses to commensal enteric bacteria [179]. Current treat-

53 ment options, such as systemic anti-inflammatory drugs, targeted drugs and surg- eries, may be effective for only a subset of patients or lead to severe side effects [21].

Therefore, discovering new drug therapies for Crohn’s disease is of great interests.

We first predicted genes for Crohn’s disease using our approach. Then we com-

pared the result with the disease associated genes in Genome-wide association studies (GWAS) catalog [84]. We also evaluated the ranks of drug target genes extracted from DrugBank [112]. We hypothesized that if the predicted genes are

useful for guiding drug discovery, the top-ranked candidate disease genes would

be enriched for the disease associated genes in GWAS and drug target genes.

Then we extracted 1,190 drugs targeting on the genes in our PPI network using the drug-target data from Drugbank. We ranked these candidate drugs based on

the sum of the random walk scores for their target genes. We validated our rank

of candidate drugs with 7 FDA-approved Crohn’s disease drugs (extracted from

the drug-indication data in Drugbank), and further investigated the literature evi- dence for the top 200 candidate drugs.

4.3 Results

4.3.1 DMN network properties

DMN contains 2,305 nodes and 373,527 edges. The network has a long-tail degree distribution and is robust to random removal of nodes. Removing the nodes with

large degrees can quickly break down the network into small components (Figure

4.3). Table 4.1 lists the network properties of DMN. To understand DMN better,

we also showed the properties of three other disease networks, including OMIM- based HDN, GWAS-based HDN and mimMiner. DMN is denser than mimMiner,

but the nodes tend to cluster into disjoint components. Both the phenotype net-

works are evidently different from the genetic networks: DMN and mimMiner

54 are denser (higher network density), less cliquish (lower clustering coefficients) and more connective (less connected components) than HDNs. Figure 4.4 shows

Table 4.1: Global properties of DMN and the other disease networks, including HDNs (genetic disease networks) and mimMiner (widely-used phenotype network) based on OMIM text mining. The last three columns represent average shortest path, average cluster coefficient, and connected component, respectively.

Disease Number Network Network Avg. Avg. Conn. network of nodes density diameter shortest path clu. coeff. comp. DMN 2305 0.14 6 2.042 0.649 6 HDN(OMIM) 2974 0.001 9 2.341 0.74 797 HDN(GWAS) 355 0.054 5 2.505 0.702 17 MimMiner 4391 0.044 7 2.445 0.421 1 example subnetworks from DMN, mimMiner, and HDNs containing randomly sampled nodes. In contrast to the densely-connected subnetworks of DMN and mimMiner, OMIM-based HDN mostly contains small components such as trian- gles and chains. GWAS-based HDN contains complex diseases, which are often associated with multiple genes, thus its edge density is higher than OMIM-based HDN, but still lower than DMN.

1

0.8 Attack random nodes Attack hubs 0.6

0.4

0.2

Size of the largest component of Size the largest 0 0 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Fraction of nodes removed

Figure 4.3: Robustness of DMN with respect to the removal of random nodes and hub nodes.

55 gaucher disease, perinatal lethal adams oliver cerebrohepatorenal emanuel syndrome syndrome syndrome, variant types arthrogryposis, ear, patella, short distal, type 2b stature syndrome aarskog syndrome absent corpus callosum cataract immunodeficiency popliteal pterygium syndrome, lethal type

(a) (b)

noonan reis-bucklers' syndrome 4 corneal dystrophy

adiposity

carotid intimal pulmonary function medial thickness 1 (finding) choroid plexus asthma papilloma pathway multiple sclerosis

mental retardation, x- linked, with spasticity coronary heart disease malignant hyperpyrexia due to anesthesia

(c) (d)

Figure 4.4: Randomly selected subgraphs of (a) DMN (b) mimMiner (c) OMIM-based HDN and (d) GWAS-based HDN. Only part of the node labels are shown in the figure due to space limit. In contrast to DMN and mimMiner, the sub-graphs in HDNs are less connective and cliquish.

The differences in global structures between phenotype and genetic disease net- works indicate that we may have not fully discovered the genes accounting for the observed phenotypic connections. Systematic studying the disease phenotype networks offers a chance to detect new disease genes, particularly for the disease whose genetic basis is completely unknown. Note that non-genetic factors, such as common environments and life styles, may also contribute to the overlapping phenotypes. To evaluated the potential of phenotype networks to predict disease genes, we show the correlation between phenotypic and genetic relationships in the next section.

56 4.3.2 DMN partially correlates with the genetic disease networks

In the first experiment, we found that the manifestation similarities in DMN have correlations with quantified measures of disease genetic associations. Figure 4.5

(left) shows that the disease pairs with larger manifestation similarities (higher ranks) are more likely to share genes. The Pearson’s correlation between the ranks of manifestation similarities and the probabilities of sharing gene is -0.603 (p ≪

E−8). Also, Figure 4.5 (right) shows that diseases with larger manifestation simi- larities tend to share more genes. The Pearson’s correlation between the ranks of manifestation similarities and average number of shared genes is -0.647 (p ≪ E−8).

Figure 4.5: Correlation between manifestation similarities and genetic associations. Left: Correla- tion between proportion of genetically associated disease pairs (x-axis) and the phenotype similarity ranks (y-axis) in DMN. Right: Correlation between the average numbers of genes shared by disease pairs (x-axis) and the phenotype ranks (y-axis) in DMN. Diseases with larger phenotype similarity in DMN tend have stronger genetic association.

We found that only a small percentage of disease pairs share associated genes despite the significant correlations between phenotype similarities and genetic as- sociations. For example, among the top five disease pairs with highest pheno- type similarities, only one pair shared associated genes. This observation indicates that the overlapping manifestations may result from unknown genes, shared path- ways, protein complexes, or common environment. Discovering unknown genetic

57 factors responsible for overlapping phenotypes among diseases is one of the goals of studying the disease phenotype networks.

In the second experiment, we compared the edges and community structures of

DMN with the genetic disease networks. Table 4.2 shows that the number of com-

mon edges between DMN and HDNs is significant higher than the random distri- bution. We found that mimMiner also contains 520 common edges with OMIM-

based HDN and 14 with GWAS-based HDN. However, DMN and mimMiner share

different disease connections with HDNs: 76 of 278 (27%) edge overlaps between

DMN and OMIM-based HDN do not appear in mimMiner, and 5 of 6 edge over-

laps between DMN and GWAS-based HDN do not appear in mimMiner.

Table 4.2: Compare the edge overlaps N between DMN and the genetic disease networks. Network ′ ′ B represents the randomized graph that preserves the properties of Network B. Column N(A,B ) represents the average number of edge overlap comparing network A and the randomized networks.

′ Network A Network B N(A,B) N(A,B ) P-value HDN(OMIM) DMN 278 65.4 ≪ E−8 HDN(GWAS) DMN 6 2.93 ≪ E−8

Table 4.3 lists the community structure similarities between DMN and HDNs.

If two diseases are grouped together in OMIM-based HDN, they have over 60%

chances to stay in one community in DMN. On the other hand, diseases in one

community in DMN have 0.6% chance of being grouped together in OMIM-based

HDN. The absolute values of community structure similarities may be biased: OMIM-based HDN mostly contains small size clusters, and the probability of two

diseases share one cluster is naturally low. However, statistical test shows that

the similarities in community partitions between DMN and HDN are significantly

higher than the random distribution, indicating that the observed similarities re- flect intrinsic correlations between the biological networks. The community struc-

ture correlation between DMN and GWAS-based HDN is also significant com-

pared with random signals.

58 Table 4.3: Compare the community structures between DMN and the genetic disease networks. SA→B and SB→A represent the two-way the similarity in community partitions between network A and B.

′ ′ Network A Network B SA→B SA→B P-value SB→A SB→A P-value HDN(OMIM) DMN 0.655 0.281 ≪ E−8 0.006 0.002 ≪ E−8 HDN(GWAS) DMN 0.611 0.279 ≪ E−8 0.297 0.156 ≪ E−8

In summary, DMN is partially correlated with the genetic disease networks in both edges and community structures. On the one hand, the phenotype relation- ships among diseases in DMN reflects shared genetic mechanisms. On the other hand, many disease-associated genes and pathways may have not been discovered yet. In addition, comparative analysis to HDNs also show that DMN and mim-

Miner contain different knowledge. The phenotype relationships in DMN have the potential to provide leads for discovering new disease genetics.

4.3.3 DMN contains knowledge different from mimMiner

We compared DMN with the widely-used phenotype network mimMiner to show their differences. Table 4.4 summarizes their differences in nodes, edges, and com- munity structures. Though DMN shares 75% of the nodes with mimMiner, 295,975 edges (79.2%) are unique and do not appear in mimMiner. Examples of the unique edges are schizophrenia–myopia, autism–tuberous sclerosis, and familial mediter- ranean fever–alport syndrome. We extracted all unique disease pairs in DMN and made the data publicly accessible. In addition, the community structures of DMN and mimMiner are partly correlated. The community similarities in the two di- rections are comparable and both moderate, showing that we cannot completely predict the phenotype clusters in one network based on the other. Therefore, the knowledge captured in DMN and mimMiner is complementary. Integrating these two networks is valuable for better prediction of candidate disease genes.

59 Table 4.4: Compare DMN with mimMiner in nodes, edges and community structures.

Network A Network B Unique Node Unique Edge WA→B WB→A mimMiner DMN 582 (25.2%) 295,975(79.2%) 0.392 0.533

4.3.4 Integrating DMN with mimMiner significantly improves the

performance of disease gene predictions

We compared our gene prediction approach with a baseline method, which inte- grated mimMiner and the PPI network used in our approach, and predicted dis- ease genes with a random walk model [121]. We tuned parameters for both the methods to achieved optimal performance in the cross validations, but different parameter values only slightly affect the results. For our method, the jumping

probabilities λP1P2 and λP2P1 were set to 0.1; λP1G and λP2G were set to 0.7; and λGP1 and λGP2 were set to 0.4. For the baseline method, the jumping probability between mimMiner and the PPI network were set to 0.9. The probability of restarting from seeds (γ is (4.6)) was set to 0.7 for both methods.

Leave-one-out cross validation

Our approach achieved significantly better success ratio and AUCs than the base- line method. The integrated network in our approach contains a total of 2,397 unique disease-gene associations. If one disease appeared in the two phenotype networks and were connected to a same gene, the two disease-gene links were counted only once. In 1,100 out of 2,397 validation runs (45.89%), our approach successfully ranked the retained genes in top one. The success ratio is significantly higher (p < e−4) than 10.36% for the baseline method (Table 4.5). In addition,

Fig.4.6 compares the ROC curves for gene prediction methods. Our approach achieved an AUC of 90.65%, which is significantly higher (p < e−4) than 84.2% for the baseline approach.

60 Table 4.5: Ratios of successful disease-gene association predictions in the leave-one-out cross vali- dation experiment. All diseases were included in the experiment.

Phenotype networks Success number Success ratio mimMiner 219 10.36% DMN and mimMiner 1100 45.89%

1

0.8

0.6

0.4 true positive rate

0.2 both mimMiner and DMN (AUC: 0.91) mimMiner (AUC: 0.84)

0 0 0.2 0.4 0.6 0.8 1 false positive rate

Figure 4.6: The ROC curves and AUCs for the our method (red) and the baseline method (blue) in the leave-one-out cross validation analysis. De novo gene prediction

Our approach is effective in de novo gene predictions, and outperforms the base- line method by boosting the phenotype knowledge. Specifically, our method achieves an average AUC of 90.33%, which is significantly higher than 81.28% for the base- line method using mimMiner alone (p < e−12). Fig.4.7 shows that at six false posi- tive cutoffs, integrating DMN and mimMiner achieves significantly higher AUCs

(p < e−18) than using only mimMiner. For example, at the cutoff of 10, we achieve an average AUC of 59.19%, while that for the baseline method is 24.17% (p < e−95). For the diseases that only have one associated gene in OMIM, our method success- fully predicted the tested genes in top one for 52.12% diseases, while the baseline method succeeded in 11.47% prioritizations (p < e−4). These results show that de novo gene prediction highly depends on disease phenotype relationships, and our method successfully took the advantage of more comprehensive knowledge in multiple phenotypic networks to achieve better performance.

61 mimMiner DMN+mimMiner AUCs

0.0 0.210 0.4 0.6 50 0.8 1.0 100 1.2 300 500 1000 9465 False positive cutoffs Figure 4.7: Average AUCs of de novo gene prediction for our approach (red) and the baseline approach (green). The comparisons are on overall AUCs, as well as the AUCs when the numbers of false positive genes are up to 10, 50, 100, 300, 500, and 1000.

4.3.5 Our method achieves high but varying performance for dif-

ferent disease classes

We evaluated the approach for nine disease classes. In the leave-one-out cross validation, 93.4% retained genes was ranked within top 100, and the AUCs for all disease classes are close and above 90%. But the ranks of the retained genes vary up and down within the top 100 for different disease classes. Fig.4.8 shows the top part of ROC curves for each disease class. The corresponding AUC is the highest for “congenital malformations and deformations,” and lowest for “men- tal diseases” and “malignant neoplasms.” Table 4.6 (the column of “All diseases”) compares the success ratio for all diseases between disease classes, and shows that our approach ranked 78% retained genes for “congenital malformations and defor- mations” in top one, while prioritized 26% and 27% retained genes for “malignant neoplasms” and “mental diseases,” respectively. In the de novo gene prediction, we observed similar performance variance

62 1

0.9

0.8

0.7

0.6 Cardiovascular disease (AUC: 0.86) Congenital malformations and deformations (AUC: 0.95) 0.5 Digestive system disorder (AUC: 0.87) Malignant neoplasm (AUC: 0.84) sensitivity 0.4 Mental disorder (AUC: 0.79) 0.3 Metabolic disorder (AUC: 0.87) Musculoskeletal and connective tissue disorder (AUC: 0.91) 0.2 Nervous system disorder (AUC: 0.92) Skin and subcutaneous tissue disease (AUC: 0.93) 0.1

0 0 0.002 0.004 0.006 0.008 0.01 0.012 1−specificity

Figure 4.8: The ROC curves for each disease class in de novo gene prediction. The comparisons include the top part of ROC curves and AUC scores based on the top 100 genes in each validation run. Table 4.6: Success ratio of disease-gene association predictions for all diseases and monogenetic diseases in the nine disease classes.

Disease classes All diseases Monogenetic diseases Congenital malformations and deformations 77.97% 90.48% Skin and subcutaneous tissue disease 70.80% 81.58% Nervous system disorder 66.67% 89.89% Musculoskeletal and connective tissue disorder 65.09% 84.06% Digestive system disorder 65.06% 80.00% Metabolic disorder 61.67% 75.33% Cardiovascular disease 48.84% 84.09% Mental disorder 27.12% 71.43% Malignant neoplasm 26.04% 50.00%

among the nine disease classes. Fig.4.9 shows that the averaged AUC is the high-

est for “congenital malformations and deformations” and lowest for “malignant

neoplasms” at all cutoffs. Table 4.6 (the column of “Monogenetic diseases”) shows that for monogenetic diseases, which have only one gene in OMIM, 90% predic- tions ranked the disease genes for “congenital malformations and deformations” in top one, while 50% predictions succeeded for “malignant neoplasms.”

We traced the disease phenotype features to explain the performance variance.

The “congenital malformations and deformations” often have specific phenotypic features. For example, Otospondylomegaepiphyseal dysplasia (OSMED) has man-

63 Cardiovascular disease Congenital malformations and deformations Digestive system disorder Malignant neoplasm Mental disorder Metabolic disorder Musculoskeletal and connective tissue disorder Nervous system disorder Skin and subcutaneous tissue disease AUC 0.0 0.5 1.0 1.5 2.0 AUC10 AUC50 AUC100 AUC300 AUC500 AUC1000 AUC

Candidate gene selection for diseases in each class

Figure 4.9: The ROC curves for each disease class in de novo gene prediction. ifestations such as “Sensorineural Hearing Loss” and “Pierre Robin Syndrome.” These features link OSMED to phenotypically similar diseases in the network, such as Stickler syndrome and Marshall Syndrome, which are also genetically re- lated to OSMED. On the other hand, “malignant neoplasms” usually have non- specific manifestations, such as pain, fever and ascites, which are common in can- cers with different genetic causes. Therefore, while our approach achieves high performance for all disease classes, building disease-specific models and introduc- ing prior knowledge of disease phenotypes may further improve the accuracy of disease gene predictions.

4.3.6 Our gene prediction method has the potential to guide the

drug discovery for Crohn’s disease

We ranked the 9,465 genes in the PPI network for Crohn’s disease and compared the result with 70 Crohn’s disease genes from GWAS catalog. These 70 genes also appeared our gene rank, and have no overlap with the data in OMIM. Fig.4.10 A1 shows that the number of GWAS disease genes drops when the rank based on our approach change from the top to the bottom, while this number distributes evenly

64 A1. Distribution of Crohn’s disease genes from GWAS in our rank A2. Average number of GWAS disease genes among 50 random ranks

B1. Distribution of drug target genes in our rank B2. Average number of drug target genes among 50 random ranks

C1. Distribution of FDA-approved drugs in our rank C2. Average number of FDA-approved drugs among 50 random ranks

Figure 4.10: A1-A2: Compare our gene rank with the Crohn’s disease genes from GWAS. B1- B2: Compare our gene rank with the drug target genes. C1-C2: Compare our drug rank with the FDA-approved drugs. among random ranks (Fig.4.10 A2). Among the top 10% in our rank, we found 19 overlaps with the GWAS disease genes, which is a 2.5-fold enrichment (p < e−4) compared with the average of 50 random gene ranks. The result shows that our approach tends to prioritize the disease genes obtained through statistical analysis on large-scale patient data.

Among the top genes in our rank, we found RIPK2, NLRC4 and ERBIN, which have substantial literature supports on their roles in Crohn’s disease [72, 97, 110,

127, 161, 187, 194] and directly interact with NOD2 (a Crohn’s disease gene in

OMIM). In addition, we also found literature evidence to support a few top-ranked genes that are not directly interacting with the disease genes from OMIM and were not identified in GWAS. For example, NLRP3 (ranked top 32), CASP1 (ranked top 45) and BCL10 (ranked top 46) are associated with the innate immune responses to the intestinal microbiota, which has been linked with the pathogenesis of Crohn’s disease [31, 85, 145, 204].

65 Table 4.7: Drug candidates for Crohn’s disease that are supported by literature.

Rank Drugs Current drug indications References 3 tocilizumab rheumatoid arthritis [73, 147] 11 sargramostim myeloid reconstitution [108, 174] 31 minocycline infections [130] 78 amitriptyline depression [166] 80 desipramine depression [167] 86 mecasermin growth failure [165, 172] 194 thalidomide erythema nodosum leprosum [113]

We also investigated the distribution of 1,502 drug target genes (from Drug- Bank) among our gene rank. Fig.4.10 B1 and B2 show that our rank is more likely to prioritize druggable genes than the random ranks. The top 10% genes in our rank contains 331 drug target genes, which is a 2.1-fold enrichment (p < e−21) compared to the average of random cases. The result shows that our top-ranked predicted genes are enriched for druggable genes associated with Crohn’s disease, and offer the opportunities to detect candidate drugs for Crohn’s disease.

We ranked 1,190 candidate drugs (from Drugbank) based on the sum of the random walk scores for their target genes. Fig.4.10 C1-C2 show that our approach can prioritize the approved Crohn’s disease therapies. The top 200 in our rank con- tains 4 FDA-approved drugs, which is a 3.3-fold enrichment (p < e−3) compared with the average of random cases. Note that these 4 approved drugs, including

Sulfasalazine, Mesalazine, Adalimumab, and Natalizumab, do not directly target on the Crohn’s disease genes in OMIM, and were detected through the prioritized genes using our approach. We further investigated the other candidate drugs in top 200 in our rank, and found that a number of them are supported by literature evidence as candidate Crohn’s disease treatments. Table 4.7 shows a few examples of candidate drugs and their supports. Among them, the efficacy of tocilizumab has recently been tested in a randomized clinical trial [113] and showed positive results in clinical remission.

66 4.4 Discussion

Incorporating clinical phenotype data can improve the prediction power of dis- ease gene discovery methods. In this study, we developed a disease gene pre- diction framework leveraging multiple different human phenotype data sources.

We explored a unique phenotype data source and constructed a new phenotype network called DMN. We designed an innovative strategy to predict disease as- sociated genes from the heterogeneous network combining DMN with mimMiner

(a widely-used phenotype database) and a genetic network. Comparing with the gene prediction approach using only one phenotype network, our approach signif- icantly improved the performance through boosting phenotypic knowledge. Us- ing Crohn’s disease as an example, we demonstrated that our gene prediction re- sult has translational potentials to guide drug discovery.

As more human disease phenotype data become available, our approach can be further improved by integrating new disease phenotype networks, given that the new networks contain different knowledge. For example, our approach in this study included many Mendelian diseases. Adding phenotypic associations involv- ing common complex diseases may offer novel insights. Also, the phenotypic re- lationships in this study are primarily based on disease-manifestation pairs. Other kinds of disease phenotype data, such as disease co-morbidities and gene expres- sion profiles, may also reflect different aspects of genetic mechanisms. In the fu- ture, we will develop new approaches to rationally integrate heterogeneous phe- notype data. For common complex diseases, we will also incorporate multiple different types of genetic associations besides the PPI network, such as the gene regulatory network into the approach. In addition, phenotype-driven disease gene prediction approaches are effective at different degrees for disease classes (as we have demonstrated) and among dif- ferent patients. Building disease-specific and patient-specific computational mod-

67 els may further improve the quality of disease gene predictions. We recently stud- ied cancer-specific comorbidities and analyzed the variation of comorbidity pat- terns among stratified patients in different age and gender brackets [43, 44]. Based in these results, we plan to build a cancer-specific gene prediction model.

Currently, we directly used disease-gene associations in drug discovery. The method to identifying candidate drugs can be further enhanced if more detailed information is available, including drug actions and disease pathogenesis, such as the direction of the genetic abnormality. For example, if a disease results from the loss of function, agonists will be potential drugs, whereas antagonists will lead to side effects. In the future work, we will develop rational drug discovery approach on the basis of our result and more data on both diseases and drugs.

4.5 Conclusions

We constructed a new phenotype network DMN using a unique data source of human disease phenotype, and demonstrated that it contains different knowledge comparing with a widely-used phenotype network mimMiner. We designed an innovative strategy to predict disease associated genes from the heterogeneous network combining DMN with mimMiner and a genetic network. Our approach achieved significantly improved the performance comparing with the gene predic- tion approach using only one phenotype data source. We applied the approach on

Crohn’s disease and demonstrated that the gene prediction result has translational potentials to guide drug discovery.

68 Chapter 5

Studying disease comorbidity network to detect genetic evidences for disease links: application on colorectal cancer and obesity

5.1 Motivation

A number of epidemiological studies suggest that obesity increases the risk of col- orectal cancer (CRC) [17, 35, 102]. Based on these evidences of co-occurrence, many genetic factors have been proposed to explain the role of obesity in the develop- ment of CRC. For example, both animal and human studies have demonstrated that the increased release of insulin and reduced insulin signaling play roles in obesity and colorectal carcinogenesis [115, 164, 168]. Experiments also show that obesity leads to altered level of adipocytokines, such as Adiponectin [6, 55, 209] and leptin [188, 190], which may either prevent or foster carcinogenesis.

The mechanism for the association between obesity and CRC is multifactorial

69 and inconclusive [56, 102]. Shared comorbidities between obesity and CRC can provide unique insights into the common genetic basis for the two diseases. For

example, type 2 diabetes is highly correlated with obesity and was identified as

a risk factor for CRC [28]. A few studies then discovered that genetic factors of

insulin resistance, which occur in type 2 diabetes, contribute in explaining the role of obesity in CRC [106]. However, both obesity and CRC are heterogeneous con-

ditions. Over 40% of the obese population is not characterized by the presence

of insulin resistance [134]. We hypothesize that systems approaches to studying

the diseases that are phenotypically-significant to both CRC and obesity may offer

new insights into the common molecular mechanisms between the two intercon- nected diseases.

Systematic comorbidity studies have been conducted previously, but mostly

focused on pairwise comorbidities and their genetic overlaps. Rhetsky et al. de-

veloped a statistical model to estimate the co-occurrence relationship for each pair of 160 diseases [175], and demonstrated that comorbidities are genetically linked.

Park et al. [157] and Hidalgo et al. [82] detected the comorbidities pairs from the

Medicare claims (which only contain senior patients ages 65 or older) with sta-

tistical measures. Roque et al. mined pairwise disease correlations using similar

measures from medical records of a psychiatric hospital [171]. In this study, we developed a novel approach to detect diseases that have strong

connections with both obesity and CRC in a comorbidity network. Specifically, we

first mined disease comorbidity relationships from a new data source and con-

structed a novel disease comorbidity network. Then we extracted the local net-

work consisting of all the paths between obesity and CRC, and prioritized the nodes (diseases) that play critical roles in maintaining the connection between the

two diseases (Fig.5.1). Substantial literature evidences can support that the top ranked diseases have associations with both obesity and CRC. We investigated the

70 D3

obesity CRC D1

D2

Figure 5.1: Approach to detect the diseases that have strong connections with both obesity and CRC in the comorbidity network. Nodes D1, D2 and D3 were prioritized because they play important roles in maintaining the network structure and the connection. gene expression profiles of a prioritized comorbid disease to facilitate detecting novel genetic basis underlying the link between obesity and CRC. Our approach is generalizable to study the genetic basis for other disease associations.

5.2 Data and methods

Fig.5.2 shows the steps of our approach. We first mined disease comorbidity re- lationships from large amounts of patient records in a public database and con- structed a disease comorbidity network. We then extracted the local comorbidity cluster for obesity and CRC and prioritize the candidate comorbidity that plays a critical role in connecting the two diseases. Finally we conducted gene expression meta-analysis to identify common genes shared by obesity, CRC and the priori- tized comorbidity.

5.2.1 Construct disease comorbidity network

Data sets for comorbidity mining

The adverse event reports contain records of 3,354,043 patients. Among all pa- tients, 66% and 94% have their age and gender information available. Figure 5.3(a)-

71 Disease comorbidity network construction Gene expression data analysis FDA adverse Patient ID Diseases event reporting system Significant differentially expressed genes 2004

2013 Association Meta- Meta- Meta- rule mining Local network analysis analysis analysis analysis Disease C B comorbidity network obesity A CRC

D E obesity CRC osteoporosis

Figure 5.2: The approach contains three steps: (1) construct a comorbidity network based on data mining; (2) extract the local network that contains paths from obesity to CRC, and analyzed the local network to pin point the strong comorbidity for both obesity and CRC; (3) conduct gene expression meta-analysis to identify common genes shared among obesity, CRC and the comorbidity.

(b) show distributions of age and gender. Unlike the Medicare system, FAERS con- tains patients in of ages from one day to hundreds of years. The distributions are not severely inclined to particular gender or age levels.

Figure 5.3: (a) Age distribution of the patients in the adverse event reports. (b) Gender distri- bution. (c) Distribution of disease semantic types: T047, Disease or Syndrome; T020, Acquired Abnormality; T046, Pathologic Function; T184, Sign or Symptom; T033, Finding; T190, Anatom- ical Abnormality; T191, Neoplastic Process; T048, Mental or Behavioral Dysfunction; T049, Cell or Molecular Dysfunction; T019, Congenital Abnormality; T037, Injury or Poisoning.

72 The data represents the diseases that patients have by 10,122 indications of drugs that patients take. These indication terms include not only diseases, but also treatment procedures, such as surgery; common symptoms, such as pain; and ill-defined events, such as unevaluable events. We mapped the indication terms to the concept unique identifiers (CUIs) in Unified Medical Language System (UMLS) and extracted their semantic types. Figure 5.3(c) listed the distribution of eleven semantic types, in which the types such as “disease or syndromes,” “neoplastic process,” and “mental or behavioral dysfunction” contain disorder concepts. With the disease data for million of patients, we were able to conduct large-scale comor- bidity mining and extract interesting disease associations.

Preprocess data

We developed an automatic pipeline to preprocess the patient-indication pairs

(Figure 5.4). We mapped all indication terms to CUIs and classified them by se- mantic types using the UMLS metathesaurus. Then We selected the identifiers of six semantic types: Mental or Behavioral Dysfunction, Neoplastic Process, Ac- quired Abnormality, Congenital Abnormality, Disease or Syndrome, and Anatomi- cal Abnormality. We combined the synonyms among terms corresponding to these identifier and removed those only appearing once in the data, since rare diseases may lead to unstable association patterns. Finally, the data contains 3,033,368 links between 2,371,406 patients and 3,994 diseases.

Mine comorbidity patterns

We explored comorbidity patterns among the 3,994 diseases with association rule mining. Due to the large number of patients and diseases in the adverse event reports, exhausting all possible association patterns is computationally impracti- cal. We applied the frequent pattern growth algorithm, which uses an tree struc-

73 Download adverse Data preprocessing event reports from UMLS Metathesaurus 2004 to 2012

10117 CUIs 4859 terms 4564 terms Remove rare Parse indication Indication concept Select Combine diseases in the records for all patients recognition semantic types synonyms data 3996 terms

Comorbidty Pattern Mining Build patient- patterns disease database

Weka implementation of FP_Growth algorithm

Figure 5.4: Automatic pipeline to pre-process the patient-disease data in adverse event reports and mine comorbidity patterns ture to compress the input and grow the patterns in a bottom-up manner [80].

Previous effort has demonstrated that this algorithm outperforms other popular pattern mining methods, such as the Apriori algorithm [4]. The frequent pattern growth algorithm has also been successfully applied in biomedical domain to ex- tract drug adverse effects [126]. Association rule mining can flexibly detect strong co-occurrence relationships among sets of diseases, and alleviates the intrinsic bias of traditional comorbidity measures (such as relative risk and ϕ-correlation) to- wards rare diseases.

We implemented the algorithm using the Weka java package [78]. The result of the algorithm is a set of patterns indicating how diseases are associated with each other. The pattern between two sets of diseases is represented in the form X ⇒ Y, where X is the pattern body and Y is the pattern head. For example,

[anxiety, amnesia] ⇒ [depression] indicates that when patients have anxiety and amnesia, are also likely to have depression. Note that though each pattern is di- rected with an arrow, they do not indicate causations between diseases, but rep- resent co-occurrences. To avoid confusion, we currently ignored the directions of patterns, considered all diseases in set X and Y associated.

The mining algorithm requires a few parameters: the minimum support was

74 set to 0.0008%, which means at least 20 patients should have all the diseases in each pattern at the same time; the maximum number of diseases in each pattern was set at 3; and confidence was chosen to measure and rank the patterns. The confidence score of pattern X ⇒ Y is defined as:

confidence(X → Y ) = |X ∪ Y |/|X|, (5.1) where |X ∪ Y | is the number of patients who have diseases in both X and Y , and |X| is the number of patients who have diseases in X.

Construct disease comorbidity network

We constructed an undirected and unweighted comorbidity network based on the result of association rule mining, which is a list of patterns between two sets of diseases, represented in the form x → y. We collected all diseases in the set x and y in each pattern, a ssuming they have comorbidity relationships with each other, and established an edge between each pair of diseases in x ∪ y to construct the comorbidity network.

5.2.2 Prioritize the diseases that have strong associations with both

obesity and CRC

We extracted the local network consisting of the paths from obesity to CRC in the disease comorbidity network. The local network thus includes the nodes that may represent different aspects of the relationship between obesity and CRC. We im- plemented breath first search to enumerate the paths, and limited the paths within four steps. Then we ranked the nodes in the local network, except obesity and

CRC, based on how important they are in maintaining the local network structure and the connection between obesity and CRC. We used the degree and between-

75 ness centrality to characterize the importance of each node in the flowing of the network. The degree of a node becomes higher if more paths between obesity and CRC pass through this node. The betweeness evaluates the number of times that the node acts as the bridge along the shortest paths. Removing the nodes with highest degree or betweenness can easily break down the connection between obesity and CRC. We investigated the top ranked diseases based on both ranking methods, and used the unexpected ones to guide the detection of genetic associa- tions between obesity and CRC.

5.2.3 Identify gene overlaps through gene expression meta-analysis

We chose a top ranked disease on the path between obesity and CRC, and then conducted gene expression meta-analysis for the prioritized disease, obesity and CRC, respectively, to detect new genetic explanations for the relationship between obesity and CRC. Gene expression normalized data (SOFT files) were downloaded from NCBI GEO omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) using the

R package GEOquery [58]. Then, we performed microarray meta-analyses for each disease independently using the R package MetaDE [206]. MetaDE imple- ments meta-analysis methods for differential expression analysis, and we used the

Fisher’s method. Significant differentially expressed genes (DEGs) were selected as those displaying a FDR corrected p-value ¡0.05. Last, we extracted the common significant genes for the three diseases.

76 5.3 Results

5.3.1 Local disease comorbidity network models the connection

between obesity and CRC

We extracted 7006 comorbidity association rules with the confidence larger than

50% from the patient records across ten years. The comorbidity network based on these rules contains 771 nodes and 15,667 edges. Fig.5.6 shows the local network consisting of all the 119 paths (no longer than four steps) from obesity to CRC.

A total of 24 nodes in the local network are the candidate diseases, which have associations with both obesity and CRC, and may indicate different aspects of the relationship between the two diseases.

5.3.2 Osteoporosis shows high comorbidity associations with both

CRC and obesity

Table 5.1 shows the top five nodes sorted by degree and betweenness in the local network. In either way of ranking, hypertension, diabetes and hyperlipaemia were in top three and closely related with both obesity and CRC. Substantial literature evidences support that the metabolic syndrome components, hypertension and hyperlipaemia, as well as diabetes have association with obesity and CRC through insulin resistance in substantial literature [102, 106, 115, 164, 168]. These three disorders also independently increase the risk of CRC and colorectal adenoma [28,

102, 106]. The top ranked comorbidities demonstrated the validity of our network analysis approach.

Significantly, osteoporosis was ranked highly by both centrality ranking meth- ods. Epidemiological studies suggested an inverse association between bone min- eral density and CRC [144], colon cancer among postmenopausal women [70], and

77

Figure 5.5: The local network that contains all paths from obesity to colorectal cancer in the comor- bidity network. Table 5.1: Top five disease nodes in the local network that contains all paths from obesity to colorectal cancer. the diseases were ranked by degree and betweenness, respectively.

Ranked by degree Ranked by betweenness Rank Nodes Degree Nodes Betweenness 1 Hypertension 26 Hypertension 60.2 2 Diabetes mellitus 24 Diabetes mellitus 55.9 3 Hyperlipaemia 22 Hyperlipaemia 35.2 4 Osteoporosis 14 Osteoporosis 12.3 5 Hypothyroid 14 Hypothyroid 9.5 colorectal adenoma [148]. On the other hand, patients of obesity and osteoporo- sis may share common genetic and environmental factors [219]. Different from previous studies, our result shows that osteoporosis is crucial for the association between CRC and obesity. Fig.5.6 shows the paths of obesity-osteoporosis-CRC.

We further investigate the gene expression profiles of osteoporosis patients to gain novel insight of the genetic basis for the link between obesity and CRC.

78

Figure 5.6: The paths from obesity to colorectal cancer that pass through osteoporosis. 5.3.3 Innovative genes shared among osteoporosis, obesity and

CRC are detected using gene expression meta-analysis

We downloaded five microarray series (GSE4017, GSE9348, GSE4183, GSE8671,

GSE20916) for CRC, three (GSE48964, GSE29718, GSE55205) for obesity and three

(GSE7429, GSE2208, GSE7158) for osteoporosis. Through meta-analysis, we ob- tained 9058 significant differentially expressed genes for CRC, 275 for obesity and

91 for osteoporosis. CRC and obesity shared a total of 192 genes. Among them, we found genes on insulin signaling pathways, such as PDK1, PRKAG2 and PDE3B, and adipocytokines, such as IL6 and IL8.

The three diseases osteoporosis, obesity and CRC shared six genes. Table 5.2 lists the genes and literature evidences, which support their relationships with each of the three diseases. Among them, FOS, JUN, and FOSB are oncogenes. FOS and JUN are known on the insulin signaling pathway. FOSB is on the AP1 path- way, which is associated with the proliferation of colon cancer cells [10]. Several studies suggested that overexpression of FOSB increases the responding of high fat reward while decreases energy expenditure and promotes adiposity [191, 203].

Interestingly, we found several genes not involving insulin signaling. Gene

PPP1R15A is in the bone morphogenetic protein signaling (BMP) pathway and its superfamily, the TGF beta signaling pathway. The mutation of BMP pathway has

79 been found in patients with juvenile polyposis, which is rare syndrome with an increased risk for developing CRC [32, 91]. Mutations in TGF beta signaling also have been found susceptibility to CRC through genome-wide association studies

[24]. A recent mouse experiment also showed that the BMP pathway regulates

brown adipogenesis, energy expenditure and appetite, thus is highly associated with diet-induced obesity [196]. These evidences support our result. Further in-

vestigation is required to confirm and elucidate the role of the BMP pathway in the

connection between obesity and CRC.

Gene NRIP1 regulates the estrogen receptor. Its interaction with sex hormone

receptors plays a role in both obesity [39] and osteoporosis [139]. Its relationship with CRC is unclear yet, but studies suggested that estrogen may have protec-

tive effect on CRC [20]. Gene HADHA is on multiple pathways of fatty acid

metabolism. But its role in CRC and osteoporosis is unknown yet.

To identify the common genes among obesity, CRC and osteoporosis, we cur- rently analyzed the gene expression data, which can be noisy. While we found lit-

erature evidences to support the detected genes and their relationships with both

obesity and CRC, these candidate genes need further investigations, for example,

through mouse model experiments.

5.4 Discussion

The genetic connection between CRC and obesity is multifactorial and inconclu-

sive. In this study, we developed a comorbidity network analysis approach, which

suggested that osteoporosis is important for the connection between obesity and CRC. We identified common genes among obesity, CRC and osteoporosis, and

found these genes are associated with the regulation of sex hormone receptors and

growth factors inducing bone formation. These genes are candidates in explaining

80 the genetic overlaps between obesity and CRC. Our comorbidity network may be not inclusive and biased toward the diseases whose drugs have high toxicity. The FDA adverse event reporting system collects data from medical product manufacturers, health professionals, and the public.

The diseases without drug treatments are not included in the data, and the disease comorbidity relationships were often under-estimated in practice based on these data. In this study, we developed a network analysis approach to compensate the bias of the comorbidity data. In the future, including more complete patient disease data may facilitate the detection of new interesting comorbidities other than osteoporosis for obesity and CRC. In addition, we currently detect comorbidities based on disease co-occurrence.

The co-occurrence patterns may indicate the increase of the risk between two dis- eases in a mutual way. Incorporating more comprehensive patient-level data, such as time series data, may help refine the disease relationships and control confound- ing factors.

5.5 Conclusions

We constructed a disease comorbidity network through mining large scale patient data. We developed an approach to analyze the comorbidity network and detect shared comorbidities between two diseases. Using this approach, we identified os- teoporosis as an important comorbidity for both CRC and obesity. We discovered the common genes among obesity, CRC and osteoporosis, and found these genes are associated with the regulation of sex hormone receptors and growth factors in- ducing bone formation. We showed that these genes have the potential to explain the genetic overlaps between obesity and CRC.

81 Table 5.2: Common genes shared by obesity, colorectal cancer and osteoporosis, and plausible evidence supporting their relationships with the three diseases.

GENES OBESITY CRC OSTEOPOROSIS In the bone morpho- In the bone morpho- Mutations in the BMP genetic protein signal- genetic protein (BMP) pathway are related ing pathway, which are PPP1R15A* signaling pathway, with colorectal carcino- associated with bone- which regulates ap- genesis [81] related diseases, such as petite [196] osteoporosis [41] diet-induced obesity is Proto-oncogene, in the Mice lacking c-fos de- accompanied by alter- FOS KEGG pathway of col- velop severe osteopetro- ation of FOS expression orectal cancer [98] sis [150] [158] Oncogene, regulators positive association be- of cell proliferation, has Overexpression of FosB

82 FOSB tween maternal obesity a debatable impact on increases bone forma- [191] CRC patient survival tion [176] [160] Associated with multiple fatty acid Unknown. Associated HADHA* Unknown. metabolism pathways with breast cancer [129] [122] The c-Jun NH2-terminal Proto-oncogene, in the Associated with osteo- JUN Kinase Promotes In- KEGG pathway of col- genesis [109, 118] sulin Resistance [5] orectal cancer [98] Down-regulated in obese subjects, may Modulates transcrip- suggest a compensatory Unknown. Involved in tional activity of the NRIP1* mechanism to favor regulation of E2F1, an estrogen receptor. Inter- energy expenditure and oncogene [61] act with ESR1 and ESR2 reduce fat accumulation in osteoporosis [139] in obesity states [39] Chapter 6

Combing human disease genetics and mouse model phenotypes towards drug repositioning: application on Parkinson’s disease

6.1 Motivation

Disease genetics information in genome-wide association studies (GWAS) [178]

and Online Mendelian Inheritance in Man (OMIM) [207] has great potential to

guide drug discovery. In a recent drug repositioning study, Wang and Zhang di-

rectly match the disease genes in OMIM with the drug target genes to repositioning

existing drugs for new indications [207]. Another approach proposed by Okada and colleagues extends the disease associated genes in GWAS with their function- ally related genes based on protein-protein interactions (PPIs), and matches the extended gene set with the drug target genes for drug repositioning [151].

On the other hand, studies on the underlying in vivo biology of animal models

83 are also useful in drug discovery. The phenotypic descriptions for mouse genetic mutations provide an in-depth understanding of gene functions, thus allow us to

gain new insights into human diseases [88] and drug targets [86]. In a recent drug

repositioning approach based on mouse phenotypes, Hoehndorf and the team link

human diseases to mouse phenotypes through matching human and mouse phe- notype ontologies. They then compare mouse phenotype features for the disease

and all genes to predict disease-associated genes. After that, they link the predicted

disease-gene associations with the drug-target data to suggest candidate drugs for

a given disease [87].

In this study, we developed a novel drug repositioning approach leveraging both disease genetics and mouse model phenotypes. Given a disease, we first

identified disease-specific mouse phenotypes using well-studied human disease

genes. Then we searched all the FDA-approved drugs for the candidates that share

similar mouse phenotype profiles with the disease. We demonstrated the approach using Parkinson’s disease (PD). PD is the second most common neurodegenerative

disorder and currently lacks effective drug treatments [152]. We used disease genes

in OMIM to identify the PD mouse phenotypes. To date, OMIM has included 15

high-penetrance PD genes that are likely to cause the PD symptoms among the

mice carrying their mutations [79]. Even though these genes are mostly associated with familial PD, clinical researches and association studies have shown that the

familial and sporadic forms of PD usually share the same molecular pathways

[116, 117]. We ranked candidate drugs based on the semantic similarities of mouse

phenotype profiles between PD and the drugs.

We tested the ranking algorithm in prioritizing FDA-approved PD drugs and novel PD drugs. We compared our approach with the pure genetics-based ap-

proaches [151, 207] and demonstrated that mouse model phenotypes are important for improving the performance of PD drug identification. We also compared with

84 Hoehndorf’s approach [87] and show that incorporating disease genetics using our novel approach achieves significantly better precision. We further examined the top-ranked drugs by comparing their gene expression profiles with that of PD.

6.2 Data and methods

Our hypothesis is that a drug has the potential to treat PD if the drug target genes are associated with PD phenotypes. Gene-phenotype associations based on sys- tematic mouse gene knockouts provide rich information to link drugs and their new indications. Fig. 6.1 shows that our drug repositioning approach based on mouse phenotypes contains two steps. In the first step, we searched for the mouse phenotypes associated with PD using the well-studied disease genes. In the sec- ond step, we extracted a set of mouse phenotype features for each candidate drug and systematically calculated the semantic similarities (using mammalian pheno- type ontology) of the phenotype profiles between PD and candidate drugs. Using the mouse phenotype similarity between the drugs and disease, we predicted how likely the drugs can be used to treat PD.

6.2.1 Identify mouse model phenotypes for PD using disease ge-

netics in OMIM

We searched for mouse model phenotypes for PD using 15 genes associated with

20 subtypes of PD in OMIM. The mutations of these genes highly increase the risk for PD and are likely to cause PD phenotypes. All these human genes have homologies among mice. We downloaded the phenotype annotations for mouse genes from Mouse Genome Informatics (MGI) [64], and extracted 358 phenotypes that are linked to the 15 PD genes. Different PD genes may share common pheno- type annotations. For example, 7 out of 15 PD genes point to the phenotype of neu-

85 (A) Search for Parkinson disease-specific mouse model phenotypes using disease genetics in OMIM

Parkinson disease Mouse mutation phenotype analysis Mouse Genome Informatics (MGI) phenotype scores Mouse/Human orthology with gene phenotype phenotype annotations Seed genes: 15 PD genes from OMIM

Infer Semantic relationships (B) Search for the candidate drugs that have highly similar distance mouse phenotype profiles with PD

drug gene drug gene phenotype

Drug-target association Gene-phenotype Candidate from DrugBank association from MGI drugs: Approved drugs

... in Drugbank ...

Figure 6.1: Drug discovery approach for Parkinson’s disease combining human disease genetics and mouse mutation phenotypes.

rodegeneration. We weighted each phenotype with the number of its associated

PD genes. The weights intuitively represent the confidence that the phenotype is related with PD.

We ranked the PD-specific mouse model phenotypes by their weights, and in-

vestigated the category of the top-ranked phenotypes. The mammalian phenotype

ontology classifies mouse phenotypes into 30 categories. We first mapped each PD

phenotype to its categories by tracing the isa relationship in the mammalian phe- notype ontology. The 358 phenotypes were mapped into 24 categories. Then we calculated a score for each category by summing the weights of all the phenotypes in it. We ranked the categories based on these scores and examined the top-ranked ones.

86 6.2.2 Prioritize candidate PD drugs based on the similarities of

mouse phenotype profiles between disease and drugs

We collected a set of candidate drugs from DrugBank [112]. The drug-target database in DrugBank contains information for 1427 FDA-approved (for any indication) drugs. We extracted 1197 drugs that target on human/mouse orthologous genes, and included them into the candidate drug set. Then we combined the drug-target relationships and phenotype annotations for the target genes to link each candi- date drug to a set of mouse model phenotypes through the drug target genes. We constructed a vector of mouse phenotypes for each drug, and weighted each phe- notype by the number of its associated target genes. We calculated the semantic similarity between the vector of mouse phenotypes associated with PD and each candidate drug to determine how likely the drug can be used to treat PD. We first quantified the information content for each phenotype term t as −logp(t), in which p(t) represents the frequency among phenotype anno- tations to all the 7568 mouse genes. In calculating the information content, if a gene is annotated by one phenotype term, we assumed that it is also annotated by the ancestors of this term in the hierarchy of mammalian phenotype ontology. Hence, a phenotype term has higher information content than its ancestors, which lie on higher levels in the ontology. Then we defined the semantic distance sim(t1, t2) between phenotype terms t1 and t2 as:

sim(t1, t2) = max −logp(a), (6.1) α∈A(t1,t2)

where A(t1, t2) is the set of common ancestors for t1 and t2. To calculate the distance from the phenotype vector p1 to p2, we matched each phenotype term in p1 to the

87 most similar term in p2 and took the average:

∑ sim(p1 → p2) = avg( max sim(t1, t2)). (6.2) t1∈p1 t2∈p2

The matching similarity was weighted by the product of weights for phenotype term t1 and t2. The similarity between p1 and p2 was defined as the average of semantic similarities in both directions:

sim(p1, p2) = 1/2sim(p1 → p2) + 1/2sim(p2 → p1). (6.3)

A similar calculation of semantic similarity between two vectors of ontology con- cepts was used before [170].

6.2.3 De novo evaluation in prioritizing FDA-approved PD drugs

We investigated if our method can prioritize approved PD drugs. We ranked the

1197 candidate drugs using the semantic similarities of the mouse phenotype pro-

files between the drugs and PD. Then we extracted approved PD drugs from FDA drug labels. Our drug ranking algorithm does not use any information of the ap- proved PD drugs. In the de novo evaluation, we calculated the distribution of approved PD drugs among our ranks by plotting a 10-bin histogram. Specifically, we divided the ranks into 10 ranges, and counted the number of approved PD drugs within each range. In addition, we investigated the target genes for the top

10% candidate drugs. We ranked these drug target genes by the number of drugs (ranked within top 10%) that target on each gene. We also calculated the distribu- tion of genes targeted by the FDA-approved drugs among all the drug target genes using histogram.

We demonstrated the importance of using mouse phenotypes to predict drugs

88 for PD. Recent studies have shown that disease associated genes can guide the detection of existing drug therapies and promising candidate drugs [178, 207]. We compared our approach with two genetics-based drug discovery methods (Fig.

6.2). The first method [207] directly matches the disease genes in OMIM with the drug target genes to repositioning existing drugs for new indications. The second method [151] extends the disease genes with their functionally related genes based on protein-protein interactions (PPIs), and matched the extended gene set with the drug target genes for drug repositioning. We downloaded the PPIs from the

STRING database [186], and used the experiment data source, which contains PPI databases such as HPRD, BIND, and GRID. We evaluated if the two methods have the ability to identify approved PD drugs without using mouse phenotypes, and compared the result with our approach.

Disease Disease Parkinson associated associated mouse disease genes phenotypes

Method 2 Method 1 Interacting genes Our method

Drug associated Candidate Drug target mouse drug genes phenotypes

Figure 6.2: Comparison with genetics-based drug discovery methods, which directly match the disease genes and their interacting genes with the drug target genes, to demonstrate the importance of using mouse phenotypes.

89 6.2.4 Evaluation in ranking novel PD drugs and comparison with

an existing drug repositioning approach

We investigated if our approach has the ability to prioritize novel PD drugs. In

our recent studies, we constructed large-scale drug-disease treatment knowledge

bases from multiple data resources using techniques including natural language

processing, text mining and data mining [215, 216]. The databases included 9,216 drug-disease treatment pairs extracted from FDA drug labels, 34,306 pairs ex-

tracted from 22 million published biomedical literature abstracts, and 69,724 pairs

extracted from 171,805 clinical trials. Based on these knowledge bases, we con-

structed two evaluation sets as the proxy of novel PD drugs: the first set consists of the drugs that have been tested for PD in clinical trials and the second set con-

sists of PD drugs extracted from literature abstracts in Medline. We removed the

FDA-approved PD drugs from both sets. We used histogram to investigate the dis-

tribution of drugs in each set among our rank. We also generated a precision-recall

curve and calculated the mean average precision to evaluate the ranking of drugs in the union of the two sets.

We compared the performance of our approach with a recent drug discovery

approach proposed by Hoehndorf [87]. In their approach, the human diseases were linked to mouse phenotypes through phenotype ontology comparison, and then associated with orthologous genes based on the gene-phenotype relationships in animal models. After that, they linked the predicted disease genes with the drug-target data to suggest candidate drugs for a given disease. We compared the histograms that represent the distributions of evaluation drugs as well as the precision-recall curves for the two methods.

90 6.2.5 Test the top-ranked drugs using gene expression data anal-

ysis

We further examined the top-ranked drugs by comparing their gene expression profiles in Gene Expression Omnibus (GEO) with that of PD. For the drugs, we extracted data sets that contain gene expression levels before and after adding the drugs to human or animal brain tissues. For PD, we downloaded the data sets that compared the PD patients and healthy controls. We used the GEO2R software [18] to identify the significantly differential expressed genes (adjusted p value ¡0.05) for the disease and drugs, respectively. Then we investigated if common significant genes exist between PD and the drug, and if these common genes have opposite directions of regulation.

6.3 Results

6.3.1 Our disease genetics-based phenotype prioritization algo-

rithm identified PD-specific mouse model phenotypes

We ranked and classified the mouse model phenotypes detected using PD genes in OMIM. The top ranked phenotype categories are nervous system and behav- ior/neurological phenotypes as expected (Table 6.1). Examples of nervous sys- tem phenotypes with the highest weights include neurodegeneration and alpha- synuclein inclusion body, which characterize the pathology of PD. In addition, top- ranked behavior/neurological phenotypes, such as impaired coordination and ab- normal gait, mostly include typical motor symptoms of PD. Interestingly, the rest top-ranked phenotype categories show that the pathology of PD is complex and involves not only the nervous system, but also immune system, homeostasis and other aspects.

91 Table 6.1: The top-ranked categories of mouse phenotypes extracted using PD genes in OMIM.

Rank Phenotype Category Example top-ranked phenotype 1 nervous system phenotype Neurodegeneration 2 behavior/neurological phenotype impaired coordination 3 immune system phenotype decreased double-positive t cell number 4 homeostasis/metabolism phenotype decreased dopamine level 5 hematopoietic system phenotype decreased hemoglobin content

6.3.2 Our approach prioritized FDA-approved PD drugs

We extracted 22 FDA-approved drugs for PD and 474 genes targeted by these drugs. The median rank of the 22 drugs is 125 (top 10% among 1197 drugs). The histogram in fig. 6.3 shows that our approach prioritized 10 approved PD drugs within top 10%. The table in fig. 6.3 shows the rank and percentile of the top 10 ap- proved PD drugs. Among them, the most effective dopamine replacement agent, levodopa, was ranked within top 5%. Fsig. 6.4 shows that the drugs prioritized by our approach frequently target on the drug target genes for approved PD drugs. In

fig. 6.4(a), nine in the top ten drug target genes (except GABRA1) are target genes for approved PD drugs. Fig. 6.4(b) shows that half of the top 10% genes have been targeted by approved PD drugs, while the other half are new drug targets and may lead to novel candidate PD drugs.

Approved PD drugs and their target genes cannot be easily detected through matching disease genes and drug target genes. We compared the performance in identifying approved PD drugs with two genetics-based drug discovery methods.

Using the first method, none of the 15 PD genes directly matches the target genes for approved PD drugs and we detected zero approved drug. Using the second method, we detected one approved PD drug, rasagiline, through its target gene BCL2, which interacts with the PD gene PARK2. Though the disease genes for PD and their interacting genes do not directly provide information on the drug target genes, our approach prioritized 10 out of 22 approved PD drugs by exploiting the

92 Figure 6.3: Our approach ranked the approved PD drugs in the top. A total of 10 among 22 approved PD drugs were ranked within top 10% among all the 1197 drugs.

(a) Top 10 drug target genes (b) Ranks of approved PD drug target genes

Figure 6.4: The drug target genes that are most frequently targeted by our top 10% drugs. (a) The top 10 drug target genes for our prioritized drugs. (b) The distribution of target genes for approved PD drugs among all the drug target genes. gene-phenotype associations in mouse models.

93 6.3.3 Our approach outperformed an existing approach in priori-

tizing novel PD drugs

The top ranked drugs generated by our approach are enriched for the novel PD

drugs in the two evaluation sets (fig. 6.5). We extracted 81 drugs from clinical trials to construct the first set, and the candidate drugs in our approach contain 69 of them. Our approach ranked a total of 22 drugs in the top 10%, and this number is 450% higher than 4 drugs in the bottom 10%. Most testing drugs (68%) in the clinical trial set were ranked within top 30%. The evaluation set based on Medline contains 102 drugs, and our candidate drugs included 85 among them. We ranked

26 within top 10%, which is a 760% increase comparing with 3 drugs in the bottom 10%. In contrast, fig. 6.6 shows that the evaluation drugs spread out in different

rank ranges when using the existing drug discovery approach based on mouse

model phenotypes. Comparison between fig. 6.5 and 6.6 show that our approach

performed better than Hoehndorf’s approach in ranking novel PD drugs in the

two evaluation sets.

Figure 6.5: The distribution of our ranks for two sets of novel PD drugs extracted from clinical trials and Medline texts.

The precision-recall curves in fig. 6.7 further shows that our performance is significantly better than the previous approach. The mean average precision for our approach is 0.24, which is significantly higher than 0.16 for the Hoehndorf’s

94 Figure 6.6: The distribution of evaluation sets based on clinical trials and Medline texts among the ranks generated by the baseline approach based on mouse phenotypes. approach (p < e−11). The result means that our approach achieved higher precision averagely at all recall levels, and mostly ranked the novel PD drugs higher than the previous approach.

Figure 6.7: Precision-recall curves in ranking the novel PD drugs for our approach and Hoehndorf’s approach based on PhenomeNet.

95 6.3.4 Gene expression analysis suggests quetiapine as a potential

PD drug

Among the top 10 candidate drugs, we found a set of gene expression samples available for quetiapine in GEO. We identified 61 significant genes for quetiapine from GEO series GSE4522933 and 1650 significant genes for PD from GSE839734.

Table 6.2 lists the common significantly differential genes between PD and queti- apine, as well as the direction of regulation for each gene and the logarithm of fold change. Among these genes, MAOA regulates the metabolism of neurotransmit- ters such as dopamine and is closely associated with PD35. In addition, MAOA is not a drug target gene for quetiapine based on the drug-target data in Drug- Bank. The gene expression analysis suggests that quetiapine, one of the top ranked drugs, has the potential to treat PD.

Table 6.2: Common significantly differential genes for PD and quetiapine as well as their directions of regulation and fold change.

Quetiapine PD Gene regulation Log(FC) regulation Log(FC) HSPB1 Up 1.5 Down -1.4 CHORDC1 Up 0.6 Down -1.1 MAOA Down -0.6 Up 0.8 MRPL15 Down -0.6 Up 0.8 SPEN Up 0.3 Down -0.5 EIF5 Down -0.2 Up 0.4

6.4 Discussion

Currently, we used the disease genetics knowledge in OMIM as the seeds to de- tect PD mouse phenotypes. We have demonstrated in several recent works that disease genes predicted by analyzing human disease phenotype networks and ge- netic functional relationship networks also have the translational potential in drug

96 discovery [45, 46, 48]. In the future, we will develop approaches to integrate dis- ease associated genes in OMIM, GWAS and prediction results from computational approaches in the drug repositioning approach. In addition, we will incorporate other information, including human disease phenotypes, disease similarities and drug similarities to further prioritize strong candidate drugs.

6.5 Conclusions

In this study, we developed a novel drug repositioning approach to predict new drugs for Parkinson’s disease using both disease genetics knowledge and mouse model phenotypes. Our approach can identify FDA-approved PD drugs and pri- oritize novel PD drugs. Comparison with pure genetics-based drug repositioning approaches shows the importance of mouse model phenotypes in identifying PD drugs. In addition, our approach outperformed a recently proposed mouse phe- notype based drug discovery method through combining disease genetics with mouse model phenotypes using a novel computational approach. Further gene expression analysis on top-ranked candidate drugs suggested quetiapine as a po- tential PD therapy.

97 Chapter 7

Conclusions and future work

7.1 Conclusions

As the biomedical data become big, complex, and heterogeneous, computational approaches are necessary to combine different kinds of data and discover new knowledge from them. A major challenge in developing computational approaches for biomedical applications is to ask the right question, gather relevant data and design algorithms based on the understanding of specific problems. Towards ad- dressing this challenge, this dissertation presents a knowledge guided strategy, which uses problem-specific domain knowledge to guide the data gathering, data fusion and algorithm design. The effectiveness of the strategy is demonstrated us- ing the applications of disease image retrieval (Chapter 2), disease gene prediction

(Chapter 3, 4, and 5), and drug discovery (Chapter 6).

Chapter 2 presents a disease image retrieval method based on organ detection towards building a patient-oriented health image database. We use the knowl- edge of the affected body parts for each disease to guide the disease image re- trieval. Compared with standard supervised classification, which trains a classi-

fier for each disease, our approach significantly reduced manual labeling efforts by

98 reusing a set of pre-trained organ detectors across multiple diseases. In addition, the proposed method improved the image retrieval precision for complex diseases

that affect multiple body parts. The resulting health image database is automat-

ically annotated using terms from standard medical ontologies and will create a

rich source of information to support patient education and decision making. Chapter 3, 4, and 5 present disease gene prediction approaches for parasitic in- fectious diseases, multifactorial diseases, and cancers. We used domain knowledge to guide the construction of disease specific gene prediction models using unique data. In Chapter 3, our approach is to model the interaction between human and

pathogen using a cross-species genetic network, and prioritize disease associated genes using a network analysis approach. The method was applied on Plasmodium

falciparum malaria, and detected both known and novel genes that are associated

with malaria pathogenesis. The predicted genes have translation potential in anti-

malaria drug discovery. In Chapter 4, we constructed a new disease phenotype network from a unique

data source of human disease phenotype, and then designed an innovative strat-

egy to predict disease associated genes from integrated phenotypic and genetic

networks. Our approach achieved significantly improved the performance com-

paring with the gene prediction approach using only one phenotype data source. An approach on Crohn’s disease demonstrates that the gene prediction result has

translational potentials to guide drug discovery.

In Chapter 5, we constructed a disease comorbidity network through mining

large scale patient data, and developed an approach detect shared comorbidities

between two diseases from the comorbidity network. This approach identified osteoporosis as an important comorbidity for colorectal cancer and obesity, and

discovered the common genes that are significant for the three diseases. Results

showed that the detected genes have the potential to explain the genetic overlaps

99 between obesity and colorectal cancer. Finally, Chapter 6 presents a novel drug repositioning approach combining both disease genetics knowledge and mouse model phenotypes. The approach was applied to predict new drugs for Parkinson’s disease (PD), and identified both

FDA-approved and novel PD drugs. The proposed approach outperformed a re- cent drug discovery method using the mouse phenotype data. Gene expression analysis on the top-ranked candidate drugs suggested quetiapine as a potential

PD therapy.

7.2 Future work

7.2.1 Disease image retrieval

For the future work, we plan to train more organ detectors and apply the method to handle more diseases. To cover a wider range of diseases, we plan to use texture pattern recognition to further improve the retrieving precision in detecting organs that do not appear as concrete objects in images, such as skin, muscle, and veins.

For disease terms that have no body-site information in the ontologies, we plan to extend our approach by scanning the web images with all organ detectors.

7.2.2 Disease gene prediction

We have demonstrated that the genes predicted by the proposed computational approaches have the potential in drug discovery. One nature subsequent work is to develop drug repositioning methods through matching the targets of approved drugs to predicted genes. The future plan includes analyzing the functions of the top-ranked predicted genes, further filtering the strong candidate drug target genes, and systematically integrating the gene function and drug action data into

100 the drug discovery approach.

7.2.3 Drug repositioning

Currently, our approach combined the disease genes in OMIM and mouse pheno- type data to predict new drug indications. In the future, we will incorporate more data, such as human disease phenotypes, disease similarities and drug similarities in the drug repositioning approach to further prioritize strong candidate drugs.

We will also develop approaches to combine other disease genetics data, including disease associated genes in GWAS and prediction results from computational ap- proaches in our approach. The future work also includes developing methods to validate the predicted drugs using additional data, such as gene expression level changes associated with the diseases and the drug compounds.

101 Bibliography

[1] S. Abdalla. Hematopoiesis in human malaria. Blood cells, 16(2-3):401–16,

1989.

[2] A.D.A.M. http://www.adam.com/. Accessed: 2012.

[3] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L.-C.

Tranchevent, B. De Moor, P. Marynen, B. Hassan, et al. Gene prioritization

through genomic data fusion. Nature biotechnology, 24(5):537–544, 2006.

[4] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In int. conf. very large data bases, VLDB, volume 1215, pages 487–499. ACM,

1994.

[5] V. Aguirre, T. Uchida, L. Yenush, R. Davis, and M. F. White. The c-jun

nh2-terminal kinase promotes insulin resistance during association with in-

sulin receptor substrate-1 and phosphorylation of ser307. Journal of Biological

Chemistry, 275(12):9047–9054, 2000.

[6] W. An, Y. Bai, S.-X. Deng, J. Gao, Q.-W. Ben, Q.-C. Cai, H.-G. Zhang, and Z.-S. Li. Adiponectin levels in patients with colorectal cancer and adenoma:

a meta-analysis. European Journal of Cancer Prevention, 21(2):126–133, 2012.

[7] S. Annunen, J. Korkk¨ o,¨ M. Czarny, M. L. Warman, H. G. Brunner,

H. Ka¨ari¨ ainen,¨ J. B. Mulliken, L. Tranebjærg, D. G. Brooks, G. F. Cox, et al.

102 Splicing mutations of 54-bp exons in the col11a1 gene cause marshall syn- drome, but other mutations cause overlapping marshall/stickler pheno-

types. The American Journal of Human Genetics, 65(4):974–983, 1999.

[8] S. E. Antonarakis and J. S. Beckmann. Mendelian disorders deserve more

attention. Nature Reviews Genetics, 7(4):277–282, 2006.

[9] S. G. Armato III, G. McLennan, M. F. McNitt-Gray, C. R. Meyer, D. Yankele-

vitz, D. R. Aberle, C. I. Henschke, E. A. Hoffman, E. A. Kazerooni,

H. MacMahon, et al. Lung image database consortium: Developing a re- source for the medical imaging research community 1. Radiology, 232(3):739–

748, 2004.

[10] R. Ashida, K. Tominaga, E. Sasaki, T. Watanabe, Y. Fujiwara, N. Oshitani,

K. Higuchi, S. Mitsuyama, H. Iwao, and T. Arakawa. Ap-1 and colorectal

cancer. Inflammopharmacology, 13(1-3):113–125, 2005.

[11] R. Atreya, H. Neumann, C. Neufert, M. J. Waldner, U. Billmeier, Y. Zopf,

M. Willma, C. App, T. Munster,¨ H. Kessler, et al. In vivo imaging using fluorescent antibodies to tumor necrosis factor predicts therapeutic response

in crohn’s disease. Nature medicine, 20(3):313–318, 2014.

[12] C. Aurrecoechea, J. Brestelli, B. P. Brunk, J. Dommer, S. Fischer, B. Gajria,

X. Gao, A. Gingle, G. Grant, O. S. Harb, et al. Plasmodb: a functional

genomic database for malaria parasites. Nucleic acids research, 37(suppl 1):D539–D543, 2009.

[13] C. Aurrecoechea, J. Brestelli, B. P. Brunk, S. Fischer, B. Gajria, X. Gao, A. Gin- gle, G. Grant, O. S. Harb, M. Heiges, et al. Eupathdb: a portal to eukaryotic

pathogen databases. Nucleic acids research, 38(suppl 1):D415–D419, 2010.

103 [14] C. L. Avery, Q. He, K. E. North, J. L. Ambite, E. Boerwinkle, M. Fornage, L. A. Hindorff, C. Kooperberg, J. B. Meigs, J. S. Pankow, et al. A phenomics-based

strategy identifies loci on apoc1, brap, and plcg1 associated with metabolic

syndrome phenotype domains. PLoS genetics, 7(10):e1002322, 2011.

[15] K. Ayi, G. Min-Oo, L. Serghides, M. Crockett, M. Kirby-Allen, I. Quirt,

P. Gros, and K. C. Kain. Pyruvate kinase deficiency and malaria. New Eng-

land Journal of Medicine, 358(17):1805–1810, 2008.

[16] A.-L. Barabasi,´ N. Gulbahce, and J. Loscalzo. Network medicine: a network-

based approach to human disease. Nature Reviews Genetics, 12(1):56–68, 2011.

[17] M. Bardou, A. N. Barkun, and M. Martel. Obesity and colorectal cancer. Gut, 62(6):933–947, 2013.

[18] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, et al. Ncbi geo:

archive for functional genomics data sets—update. Nucleic acids research,

41(D1):D991–D995, 2013.

[19] D. Baruch, J. Gormely, C. Ma, R. Howard, and B. Pasloske. Plasmodium falci-

parum erythrocyte membrane protein 1 is a parasitized erythrocyte receptor

for adherence to cd36, thrombospondin, and intercellular adhesion molecule

1. Proceedings of the National Academy of Sciences, 93(8):3497–3502, 1996.

[20] A. Barzi, A. M. Lenz, M. J. Labonte, and H.-J. Lenz. Molecular pathways: estrogen pathway in colorectal cancer. Clinical Cancer Research, 19(21):5842–

5848, 2013.

[21] D. C. Baumgart and W. J. Sandborn. Inflammatory bowel disease: clinical

aspects and established and evolving therapies. The Lancet, 369(9573):1641–

1657, 2007.

104 [22] D. P. Beiting, P. W. Park, and J. A. Appleton. Synthesis of syndecan-1 by skeletal muscle cells is an early response to infection with trichinella spi-

ralis but is not essential for nurse cell development. Infection and immunity,

74(3):1941–1943, 2006.

[23] D. S. Bell, R. Greenes, and P. Doubilet. Form-based clinical input from a

structured vocabulary: initial application in ultrasound reporting. In Pro- ceedings of the Annual Symposium on Computer Application in Medical Care, page

789. American Medical Informatics Association, 1992.

[24] N. Bellam and B. Pasche. Tgf-β signaling alterations and colon cancer. In

Cancer Genetics, pages 85–103. Springer, 2010.

[25] R. Bellazzi, M. Diomidous, I. N. Sarkar, K. Takabayashi, A. Ziegler, A. T.

McCray, et al. Data analysis and data mining: current issues in biomedical

informatics. Methods of information in medicine, 50(6):536, 2011.

[26] A. Bengtsson, L. Joergensen, T. S. Rask, R. W. Olsen, M. A. Andersen,

L. Turner, T. G. Theander, L. Hviid, M. K. Higgins, A. Craig, et al. A novel domain cassette identifies plasmodium falciparum pfemp1 proteins binding

icam-1 and is a target of cross-reactive, adhesion-inhibitory antibodies. The

Journal of Immunology, 190(1):240–249, 2013.

[27] S. I. Berger, A. Ma’ayan, and R. Iyengar. Systems pharmacology of arrhyth-

mias. Science signaling, 3(118):ra30, 2010.

[28] J. M. Berster and B. Goke.¨ Type 2 diabetes mellitus as risk factor for colorectal

cancer. Archives of physiology and biochemistry, 114(1):84–98, 2008.

[29] D. R. Blair, C. S. Lyttle, J. M. Mortensen, C. F. Bearden, A. B. Jensen, H. Khia- banian, R. Melamed, R. Rabadan, E. V. Bernstam, S. Brunak, et al. A nonde-

105 generate code of deleterious variants in mendelian loci contributes to com- plex disease risk. Cell, 155(1):70–80, 2013.

[30] O. Bodenreider. The unified medical language system (umls): integrating

biomedical terminology. Nucleic acids research, 32(suppl 1):D267–D270, 2004.

[31] A. Borthakur, S. Bhattacharyya, P. K. Dudeja, and J. K. Tobacman. Car-

rageenan induces interleukin-8 production through distinct bcl10 pathway

in normal human colonic epithelial cells. American Journal of Physiology-

Gastrointestinal and Liver Physiology, 292(3):G829–G838, 2007.

[32] L. A. Brosens, A. van Hattem, L. M. Hylind, C. Iacobuzio-Donahue, K. E. Romans, J. Axilbund, M. Cruz-Correa, A. C. Tersmette, G. J. A. Offerhaus,

and F. M. Giardiello. Risk of colorectal cancer in juvenile polyposis. Gut,

56(7):965–967, 2007.

[33] S. D. Brown and M. W. Moore. Towards an encyclopaedia of mammalian

gene function: the international mouse phenotyping consortium. Disease

models & mechanisms, 5(3):289–292, 2012.

[34] H. G. Brunner and M. A. Van Driel. From syndrome families to functional genomics. Nature Reviews Genetics, 5(7):545–551, 2004.

[35] E. E. Calle, C. Rodriguez, K. Walker-Thurmond, and M. J. Thun. Overweight,

obesity, and mortality from cancer in a prospectively studied cohort of us

adults. New England Journal of Medicine, 348(17):1625–1638, 2003.

[36] M. Campillos, M. Kuhn, A.-C. Gavin, L. J. Jensen, and P. Bork. Drug target

identification using side-effect similarity. Science, 321(5886):263–266, 2008.

[37] F. M. Campos, M. L. Santos, F. S. Kano, C. J. Fontes, M. V. Lacerda, C. F.

Brito, and L. H. Carvalho. Genetic variability in platelet integrin α2β1 den-

106 sity: Possible contributor to plasmodium vivax–induced severe thrombocy- topenia. The American journal of tropical medicine and hygiene, 88(2):325–328,

2013.

[38] C. Caretta-Cartozo, P. De Los Rios, F. Piazza, and P. Lio.` Bottleneck genes

and community structure in the cell cycle network of s. pombe. PLoS compu-

tational biology, 3(6):e103, 2007.

[39] V. Catalan,´ J. Gomez-Ambrosi,´ A. Lizanzu, A. Rodr´ıguez, C. Silva, F. Rotel-

lar, M. J. Gil, J. A. Cienfuegos, J. Salvador, and G. Fruhbeck.¨ Rip140 gene and protein expression levels are downregulated in visceral adipose tissue

in human morbid obesity. Obesity surgery, 19(6):771–776, 2009.

[40] O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for

histogram-based image classification. Neural Networks, IEEE Transactions on,

10(5):1055–1064, 1999.

[41] G. Chen, C. Deng, and Y.-P. Li. Tgf-β and bmp signaling in osteoblast dif-

ferentiation and bone formation. International journal of biological sciences, 8(2):272, 2012.

[42] Y. Chen, T. Jiang, and R. Jiang. Uncover disease genes by maximiz-

ing information flow in the phenome–interactome network. Bioinformatics,

27(13):i167–i176, 2011.

[43] Y. Chen and R. Xu. Mining cancer-specific disease comorbidities from a large

observational health database. Cancer informatics, 13(Suppl 1):37, 2014.

[44] Y. Chen and R. Xu. Network analysis of human disease comorbidity patterns

based on large-scale data mining. In Bioinformatics Research and Applications, pages 243–254. Springer, 2014.

107 [45] Y. Chen and R. Xu. Network-based gene prediction for plasmodium falci- parum malaria towards genetics-based drug discovery. BMC Genomics, 2015.

[46] Y. Chen and R. Xu. Phenome-driven disease genetics prediction towards

drug discovery. Bioinformatics, 2015.

[47] Y. Chen, G.-q. Zhang, and R. Xu. Semi-supervised image classification

for automatic construction of a health image library. In Proceedings of the

2nd ACM SIGHIT International Health Informatics Symposium, pages 111–120.

ACM, 2012.

[48] Y. Chen, X. Zhang, G.-q. Zhang, and R. Xu. Comparative analysis of a novel disease phenotype network based on clinical manifestations. Journal

of biomedical informatics, 2014.

[49] C. L. L. Chua, G. Brown, J. A. Hamilton, S. Rogerson, and P. Boeuf. Mono-

cytes and macrophages in malaria: protection or pathology? Trends in para-

sitology, 29(1):26–34, 2013.

[50] J. Cosnes, C. Gower-Rousseau, P. Seksik, and A. Cortot. Epidemiol-

ogy and natural history of inflammatory bowel diseases. Gastroenterology, 140(6):1785–1794, 2011.

[51] F. F. Costa. Big data in biomedicine. Drug discovery today, 19(4):433–440, 2014.

[52] P. D. Crompton, J. Moebius, S. Portugal, M. Waisberg, G. Hart, L. S. Garver,

L. H. Miller, C. Barillas, and S. K. Pierce. Malaria immunity in man and

mosquito: Insights into unsolved mysteries of a deadly infectious disease*.

Annual review of immunology, 32:157–187, 2014.

[53] C. Crosnier, L. Y. Bustamante, S. J. Bartholdson, A. K. Bei, M. Theron,

M. Uchikawa, S. Mboup, O. Ndir, D. P. Kwiatkowski, M. T. Duraisingh, et al.

108 Basigin is a receptor essential for erythrocyte invasion by plasmodium falci- parum. Nature, 480(7378):534–537, 2011.

[54] N. Dalal and B. Triggs. Histograms of oriented gradients for human de-

tection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE

Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.

[55] M. Dalamaga, K. N. Diakopoulos, and C. S. Mantzoros. The role of

adiponectin in cancer: a review of current evidence. Endocrine reviews,

33(4):547–594, 2012.

[56] E. Danese, M. Montagnana, A. M. Minicozzi, S. Bonafini, O. Ruzzenente, M. Gelati, G. De Manzoni, G. Lippi, and G. C. Guidi. The role of resistin in

colorectal cancer. Clinica Chimica Acta, 413(7):760–764, 2012.

[57] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences,

and trends of the new age. ACM Computing Surveys (CSUR), 40(2):5, 2008.

[58] S. Davis and P. S. Meltzer. Geoquery: a bridge between the gene expression

omnibus (geo) and bioconductor. Bioinformatics, 23(14):1846–1847, 2007.

[59] J. Deng, A. C. Berg, and L. Fei-Fei. Hierarchical semantic indexing for large

scale image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 785–792. IEEE, 2011.

[60] T. M. Deserno, S. Antani, and L. Rodney Long. Content-based image re-

trieval for scientific literature access. Methods of information in medicine,

48(4):371, 2009.

[61] A. Docquier, P.-O. Harmand, S. Fritsch, M. Chanrion, J.-M. Darbon, and

V. Cavailles. The transcriptional coregulator rip140 represses e2f1 activ-

109 ity and discriminates breast cancer subtypes. Clinical Cancer Research, 16(11):2959–2970, 2010.

[62] A. M. Dondorp, F. Nosten, P. Yi, D. Das, A. P. Phyo, J. Tarning, K. M. Lwin,

F. Ariey, W. Hanpithakpong, S. J. Lee, et al. Artemisinin resistance in plas-

modium falciparum malaria. New England Journal of Medicine, 361(5):455–

467, 2009.

[63] J. T. Dudley, M. Sirota, M. Shenoy, R. K. Pai, S. Roedder, A. P. Chiang, A. A.

Morgan, M. M. Sarwal, P. J. Pasricha, and A. J. Butte. Computational repo- sitioning of the anticonvulsant topiramate for inflammatory bowel disease.

Science translational medicine, 3(96):96ra76–96ra76, 2011.

[64] J. T. Eppig, J. A. Blake, C. J. Bult, J. A. Kadin, J. E. Richardson, et al. The

mouse genome database (mgd): facilitating mouse as a model for human

biology and disease. Nucleic acids research, 43(D1):D726–D736, 2015.

[65] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object

detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.

[66] H. Feng, R. Shi, and T.-S. Chua. A bootstrapping framework for annotating

and retrieving www images. In Proceedings of the 12th annual ACM interna-

tional conference on Multimedia, pages 960–967. ACM, 2004.

[67] B. Ferwerda, M. B. McCall, S. Alonso, E. J. Giamarellos-Bourboulis,

M. Mouktaroudi, N. Izagirre, D. Syafruddin, G. Kibiki, T. Cristea, A. Hi-

jmans, et al. Tlr4 polymorphisms, infectious diseases, and evolutionary pressure during migration of modern humans. Proceedings of the National

Academy of Sciences, 104(42):16645–16650, 2007.

110 [68] L. Franke, H. Van Bakel, L. Fokkens, E. D. De Jong, M. Egmont-Petersen, and C. Wijmenga. Reconstruction of a functional human gene network, with an

application for prioritizing positional candidate genes. The American Journal

of Human Genetics, 78(6):1011–1025, 2006.

[69] T. Gandhi, J. Zhong, S. Mathivanan, L. Karthick, K. Chandrika, S. S. Mo-

han, S. Sharma, S. Pinkert, S. Nagaraju, B. Periaswamy, et al. Analysis of the human protein interactome and comparison with yeast, worm and fly

interaction datasets. Nature genetics, 38(3):285–293, 2006.

[70] O. Ganry, B. Lapotre-Ledoux, P. Fardellone, and A. Dubreuil. Bone mass

density, subsequent risk of colon cancer and survival in postmenopausal

women. European journal of epidemiology, 23(7):467–473, 2008.

[71] M. J. Gardner, N. Hall, E. Fung, O. White, M. Berriman, R. W. Hyman, J. M.

Carlton, A. Pain, K. E. Nelson, S. Bowman, et al. Genome sequence of the human malaria parasite plasmodium falciparum. Nature, 419(6906):498–511,

2002.

[72] R. Gerard, B. Sendid, J.-F. Colombel, D. Poulain, and T. Jouault. An immuno-

logical link between candida albicans colonization and crohn’s disease. Crit-

ical reviews in microbiology, (0):1–5, 2013.

[73] U. Gergis, J. Arnason, R. Yantiss, T. Shore, U. Wissa, E. Feldman, and

T. Woodworth. Effectiveness and safety of tocilizumab, an anti–interleukin- 6 receptor monoclonal antibody, in a patient with refractory gi graft-versus-

host disease. Journal of Clinical Oncology, 28(30):e602–e604, 2010.

[74] K.-I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.-L. Barabasi.´

The human disease network. Proceedings of the National Academy of Sciences,

104(21):8685–8690, 2007.

111 [75] A. Gottlieb, G. Y. Stein, E. Ruppin, and R. Sharan. Predict: a method for inferring novel drug indications with application to personalized medicine.

Molecular systems biology, 7(1), 2011.

[76] G. E. Grau, C. D. Mackenzie, R. A. Carr, M. Redard, G. Pizzolato, C. Allasia,

C. Cataldo, T. E. Taylor, and M. E. Molyneux. Platelet accumulation in brain

microvessels in fatal pediatric cerebral malaria. Journal of Infectious Diseases,

187(3):461–466, 2003.

[77] K. A. Gray, L. C. Daugherty, S. M. Gordon, R. L. Seal, M. W. Wright, and E. A.

Bruford. Genenames. org: the hgnc resources in 2013. Nucleic acids research, page gks1066, 2012.

[78] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Wit-

ten. The weka data mining software: an update. ACM SIGKDD explorations

newsletter, 11(1):10–18, 2009.

[79] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick.

Online mendelian inheritance in man (omim), a knowledgebase of human

genes and genetic disorders. Nucleic acids research, 33(suppl 1):D514–D517,

2005.

[80] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gener-

ation. In ACM SIGMOD Record, volume 29, pages 1–12. ACM, 2000.

[81] J. C. Hardwick, L. L. Kodach, G. J. Offerhaus, and G. R. Van den Brink. Bone morphogenetic protein signalling in colorectal cancer. Nature Reviews Cancer,

8(10):806–812, 2008.

[82] C. A. Hidalgo, N. Blumm, A.-L. Barabasi,´ and N. A. Christakis. A dynamic

network approach for the study of human phenotypes. PLoS computational

biology, 5(4):e1000353, 2009.

112 [83] A. V. Hill, C. E. Allsopp, D. Kwiatkowski, N. M. Anstey, P. Twumasi, P. A. Rowe, S. Bennett, D. Brewster, A. J. McMichael, and B. M. Greenwood. Com-

mon west african hla antigens are associated with protection from severe

malaria. Nature, 352(6336):595–600, 1991.

[84] L. A. Hindorff, P. Sethupathy, H. A. Junkins, E. M. Ramos, J. P. Mehta, F. S.

Collins, and T. A. Manolio. Potential etiologic and functional implications of

genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, 2009.

[85] S. A. Hirota, J. Ng, A. Lueng, M. Khajah, K. Parhar, Y. Li, V. Lam, M. S. Potentier, K. Ng, M. Bawa, et al. Nlrp3 inflammasome plays a key role in the

regulation of intestinal homeostasis. Inflammatory bowel diseases, 17(6):1359–

1372, 2011.

[86] R. Hoehndorf, T. Hiebert, N. W. Hardy, P. N. Schofield, G. V. Gkoutos, and

M. Dumontier. Mouse model phenotypes provide information about human

drug targets. Bioinformatics, 30(5):719–725, 2014.

[87] R. Hoehndorf, A. Oellrich, D. Rebholz-Schuhmann, P. N. Schofield, and G. V.

Gkoutos. Linking pharmgkb to phenotype studies and animal models of disease for drug repurposing. World Scientific.

[88] R. Hoehndorf, P. N. Schofield, and G. V. Gkoutos. Phenomenet: a whole-phenome approach to disease gene discovery. Nucleic acids research,

39(18):e119–e119, 2011.

[89] A. L. Hopkins and C. R. Groom. The druggable genome. Nature reviews Drug

discovery, 1(9):727–730, 2002.

[90] D. Houle, D. R. Govindaraju, and S. Omholt. Phenomics: the next challenge.

Nature Reviews Genetics, 11(12):855–866, 2010.

113 [91] J. R. Howe, J. L. Bair, M. G. Sayed, M. E. Anderson, F. A. Mitros, G. M. Pe- tersen, V. E. Velculescu, G. Traverso, and B. Vogelstein. Germline mutations

of the gene encoding bone morphogenetic protein receptor 1a in juvenile

polyposis. Nature genetics, 28(2):184–187, 2001.

[92] R. Howells and L. Maxwell. Citric acid cycle activity and chloroquine re-

sistance in rodent malaria parasites: the role of the reticulocyte. Annals of tropical medicine and parasitology, 67(3):285, 1973.

[93] B. L. Humphreys, D. A. Lindberg, H. M. Schoolman, and G. O. Barnett. The unified medical language system an informatics research collaboration. Jour-

nal of the American Medical Informatics Association, 5(1):1–11, 1998.

[94] M. Hurle, L. Yang, Q. Xie, D. Rajpal, P. Sanseau, and P. Agarwal. Computa-

tional drug repositioning: from data to therapeutics. Clinical Pharmacology &

Therapeutics, 93(4):335–341, 2013.

[95] T. Hwang, G. Atluri, M. Xie, S. Dey, C. Hong, V. Kumar, and R. Kuang. Co-

clustering phenome–genome for phenotype classification and disease gene discovery. Nucleic acids research, 40(19):e146–e146, 2012.

[96] J. H. Janes, C. P. Wang, E. Levin-Edens, I. Vigan-Womas, M. Guillotte,

M. Melcher, O. Mercereau-Puijalon, and J. D. Smith. Investigating the host

binding signature on the plasmodium falciparum pfemp1 protein family.

PLoS pathogens, 7(5):e1002032, 2011.

[97] L. Jostins, S. Ripke, R. K. Weersma, R. H. Duerr, D. P. McGovern, K. Y. Hui,

J. C. Lee, L. P. Schumm, Y. Sharma, C. A. Anderson, et al. Host-microbe inter- actions have shaped the genetic architecture of inflammatory bowel disease.

Nature, 491(7422):119–124, 2012.

114 [98] M. Kanehisa and S. Goto. Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, 2000.

[99] S. Kar and S. Kar. Control of malaria. Nature Reviews Drug Discovery,

9(7):511–512, 2010.

[100] A. Kaushansky, A. S. Ye, L. S. Austin, S. A. Mikolajczak, A. M. Vaughan,

N. Camargo, P. G. Metzger, A. N. Douglass, G. MacBeath, and S. H. Kappe.

Suppression of host p53 is critical for¡ i¿ plasmodium¡/i¿ liver-stage infec-

tion. Cell reports, 3(3):630–637, 2013.

[101] D. B. Keator, J. S. Grethe, D. Marcus, B. Ozyurt, S. Gadde, S. Murphy, S. Pieper, D. Greve, R. Notestine, H. J. Bockholt, et al. A national human

neuroimaging collaboratory enabled by the biomedical informatics research

network (birn). Information Technology in Biomedicine, IEEE Transactions on,

12(2):162–172, 2008.

[102] L. Khaodhiar, K. C. McCowen, and G. L. Blackburn. Obesity and its comor-

bid conditions. Clinical cornerstone, 2(3):17–31, 1999.

[103] C.-C. Khor and M. L. Hibberd. Revealing the molecular signatures of host- pathogen interactions. Genome Biol, 12(10):229, 2011.

[104] Y. Kim and K. Schneider. Evolution of drug resistance in malaria parasite

populations. Nature Education Knowledge, 4(8):6, 2013.

[105] S. Kohler,¨ S. Bauer, D. Horn, and P. N. Robinson. Walking the interactome

for prioritization of candidate disease genes. The American Journal of Human

Genetics, 82(4):949–958, 2008.

[106] D. Komninou, A. Ayonote, J. P. Richie, and B. Rigas. Insulin resistance and

115 its contribution to colon carcinogenesis. Experimental Biology and Medicine, 228(4):396–405, 2003.

[107] J. O. Korbel, T. Doerks, L. J. Jensen, C. Perez-Iratxeta, S. Kaczanowski, S. D.

Hooper, M. A. Andrade, and P. Bork. Systematic association of genes to

phenotypes by genome and literature mining. PLoS biology, 3(5):e134, 2005.

[108] J. R. Korzenik, B. K. Dieckgraefe, J. F. Valentine, D. F. Hausman, and M. J.

Gilbert. Sargramostim for active crohn’s disease. New England Journal of

Medicine, 352(21):2193–2201, 2005.

[109] J. Y. Krzeszinski, W. Wei, H. Huynh, Z. Jin, X. Wang, T.-C. Chang, X.-J. Xie, L. He, L. S. Mangala, G. Lopez-Berestein, et al. mir-34a blocks osteoporo-

sis and bone metastasis by inhibiting osteoclastogenesis and tgif2. Nature,

512(7515):431–435, 2014.

[110] T. Kufer, E. Kremmer, D. Banks, and D. Philpott. Role for erbin in bacterial

activation of nod2. Infection and immunity, 74(6):3115–3124, 2006.

[111] K. Lage, E. O. Karlberg, Z. M. Størling, P. I. Olason, A. G. Pedersen, O. Rigina,

A. M. Hinsby, Z. Tumer,¨ F. Pociot, N. Tommerup, et al. A human phenome- interactome network of protein complexes implicated in genetic disorders.

Nature biotechnology, 25(3):309–316, 2007.

[112] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski,

D. Arndt, M. Wilson, V. Neveu, et al. Drugbank 4.0: shedding new light on

drug metabolism. Nucleic acids research, 42(D1):D1091–D1097, 2014.

[113] M. Lazzerini, S. Martelossi, G. Magazzu,` S. Pellegrino, M. C. Lucanto,

A. Barabino, A. Calvi, S. Arrigo, P. Lionetti, M. Lorusso, et al. Effect of thalidomide on clinical remission in children and adolescents with refrac-

116 tory crohn disease: a randomized clinical trial. JAMA, 310(20):2164–2173, 2013.

[114] Y. Lee, H. Li, J. Li, E. Rebman, I. Achour, K. E. Regan, E. R. Gamazon, J. L.

Chen, X. H. Yang, N. J. Cox, et al. Network models of genome-wide as-

sociation studies uncover the topological centrality of protein interactions

in complex diseases. Journal of the American Medical Informatics Association, 20(4):619–629, 2013.

[115] D. LeRoith and C. T. Roberts. The insulin-like growth factor system and cancer. Cancer letters, 195(2):127–137, 2003.

[116] S. Lesage and A. Brice. Parkinson’s disease: from monogenic forms to ge-

netic susceptibility factors. Human molecular genetics, 18(R1):R48–R59, 2009.

[117] S. Lesage and A. Brice. Role of mendelian genes in “sporadic” parkinson’s

disease. Parkinsonism & related disorders, 18:S66–S70, 2012.

[118] D. Lewinson, A. Rachmiel, S. Rihani-Bisharat, Z. Kraiem, P. Schenzer, S. Ko-

rem, and Y. Rabinovich. Stimulation of fos-and jun-related genes during dis-

traction osteogenesis. Journal of Histochemistry & Cytochemistry, 51(9):1161– 1168, 2003.

[119] H. Li, Y. Lee, J. L. Chen, E. Rebman, J. Li, and Y. A. Lussier. Complex-

disease networks of trait-associated single-nucleotide polymorphisms (snps)

unveiled by information theory. Journal of the American Medical Informatics

Association, 19(2):295–305, 2012.

[120] J. Li and J. Z. Wang. Real-time computerized annotation of pictures. Pattern

Analysis and Machine Intelligence, IEEE Transactions on, 30(6):985–1002, 2008.

117 [121] Y. Li and J. C. Patra. Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics, 26(9):1219–1224,

2010.

[122] A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdottir,´ P. Tamayo,

and J. P. Mesirov. Molecular signatures database (msigdb) 3.0. Bioinformatics,

27(12):1739–1740, 2011.

[123] D. A. Lindberg, B. L. Humphreys, and A. T. McCray. The unified medical

language system. Methods of information in medicine, 32(4):281–291, 1993.

[124] S. Liu, W. Ma, R. Moore, V. Ganesan, and S. Nelson. Rxnorm: prescription for electronic drug information exchange. IT professional, 7(5):17–23, 2005.

[125] D. G. Lowe. Object recognition from local scale-invariant features. In Com- puter vision, 1999. The proceedings of the seventh IEEE international conference

on, volume 2, pages 1150–1157. Ieee, 1999.

[126] Z. Luo, G.-Q. Zhang, and R. Xu. Mining patterns of adverse events using

aggregated clinical trial results. AMIA Summits on Translational Science Pro-

ceedings, 2013:112, 2013.

[127] C. Lupfer, P. G. Thomas, P. K. Anand, P. Vogel, S. Milasta, J. Martinez, G. Huang, M. Green, M. Kundu, H. Chi, et al. Receptor interacting pro-

tein kinase 2-mediated mitophagy regulates inflammasome activation dur-

ing virus infection. Nature immunology, 14(5):480–488, 2013.

[128] W. H. O. W. malaria report 2013. [online].

http://www.who.int/malaria/publications/world_malaria_report_2013/en/.

Accessed: 2014.

118 [129] M. Mamtani and H. Kulkarni. Association of hadha expression with the risk of breast cancer: targeted subset analysis and meta-analysis of microarray

data. BMC research notes, 5(1):25, 2012.

[130] D. J. Margolis, M. Fanelli, O. Hoffstad, and J. D. Lewis. Potential associa-

tion between the oral tetracycline class of antimicrobials used to treat acne

and inflammatory bowel disease. The American journal of gastroenterology, 105(12):2610–2616, 2010.

[131] A. T. McCray. An upper-level ontology for the biomedical domain. Compar- ative and Functional Genomics, 4(1):80–84, 2003.

[132] I. M. Medana and G. D. Turner. Human cerebral malaria and the blood–brain

barrier. International journal for parasitology, 36(5):555–568, 2006.

[133] R. Menard,´ J. Tavares, I. Cockburn, M. Markus, F. Zavala, and R. Amino.

Looking under the skin: the first steps in malarial infection and immunity.

Nature Reviews Microbiology, 11(10):701–712, 2013.

[134] J. Mesquita, S. Souto, A. Varela, P. Freitas, M. J. Matos, M. Ferreira, F. Correia,

D. Braga, D. Carvalho, and J. L. Medina. Metabolically healthy but obese individuals. Lancet, 372(9646):1281–1283, 2008.

[135] J. Mestres, E. Gregori-Puigjane,´ S. Valverde, and R. V. Sole. Data com-

pleteness—the achilles heel of drug-target networks. Nature biotechnology,

26(9):983–984, 2008.

[136] L. H. Miller, H. C. Ackerman, X.-z. Su, and T. E. Wellems. Malaria biol-

ogy and disease pathogenesis: insights for new treatments. Nature medicine,

19(2):156–167, 2013.

119 [137] N. A. Molodecky, S. Soon, D. M. Rabi, W. A. Ghali, M. Ferris, G. Chernoff, E. I. Benchimol, R. Panaccione, S. Ghosh, H. W. Barkema, et al. Increasing in-

cidence and prevalence of the inflammatory bowel diseases with time, based

on systematic review. Gastroenterology, 142(1):46–54, 2012.

[138] Y. Moreau and L.-C. Tranchevent. Computational tools for prioritizing can-

didate genes: boosting disease gene discovery. Nature Reviews Genetics, 13(8):523–536, 2012.

[139] F. J. Moron,´ N. Mendoza, F. Vazquez,´ E. Molero, F. Quereda, A. Salinas, J. Fontes, T. Mart´ınez-Astorquiza, R. Sanchez-Borrego,´ and A. Ruiz. Multi-

analysis of estrogen-related genes in spanish postmenopausal women

suggests an interactive role of esr1, esr2 and nrip1 genes in the pathogenesis

of osteoporosis. Bone, 39(1):213–221, 2006.

[140] M. M. Mota, W. Jarra, E. Hirst, P. K. Patnaik, and A. A. Holder. Plasmodium chabaudi-infected erythrocytes adhere to cd36 and bind to microvascular en-

dothelial cells in an organ-specific way. Infection and immunity, 68(7):4135–

4144, 2000.

[141] H. Muller,¨ A. G. S. de Herrera, J. Kalpathy-Cramer, D. Demner-Fushman,

S. Antani, and I. Eggel. Overview of the imageclef 2012 medical image re-

trieval and classification tasks. In CLEF (online working notes/labs/workshop), pages 1–16, 2012.

[142] A. Nacer, A. Movila, K. Baer, S. A. Mikolajczak, S. H. Kappe, and U. Frevert.

Neuroimmunological blood brain barrier opening in experimental cerebral

malaria. PLoS pathogens, 8(10):e1002982, 2012.

[143] N. Natarajan and I. S. Dhillon. Inductive matrix completion for predicting

gene–disease associations. Bioinformatics, 30(12):i60–i68, 2014.

120 [144] R. L. Nelson, M. Turyk, J. Kim, and V. Persky. Bone mineral density and the subsequent risk of cancer in the nhanes i follow-up cohort. BMC cancer,

2(1):22, 2002.

[145] M. G. Netea, A. Simon, F. van de Veerdonk, B.-J. Kullberg, J. W. Van der Meer,

and L. A. Joosten. Il-1β processing in host defense: beyond the inflamma-

somes. PLoS pathogens, 6(2):e1000661, 2010.

[146] M. E. Newman and M. Girvan. Finding and evaluating community structure

in networks. Physical review E, 69(2):026113, 2004.

[147] N. Nishimoto and T. Kishimoto. Humanized antihuman il-6 receptor anti- body, tocilizumab. In Therapeutic Antibodies, pages 151–160. Springer, 2008.

[148] N. L. Nock, A. Patrick-Melin, M. Cook, C. Thompson, J. P. Kirwan, and L. Li.

Higher bone mineral density is associated with a decreased risk of colorectal

adenomas. International Journal of Cancer, 129(4):956–964, 2011.

[149] P. Nuchnoi, J. Ohashi, R. Kimura, H. Hananantachai, I. Naka, S. Krudsood,

S. Looareesuwan, K. Tokunaga, and J. Patarapotikul. Significant associa-

tion between tim1 promoter polymorphisms and protection against cerebral malaria in thailand. Annals of human genetics, 72(3):327–336, 2008.

[150] S. Okada, Z.-Q. Wang, A. E. Grigoriadis, E. F. Wagner, and T. von Ruden.¨

Mice lacking c-fos have normal hematopoietic stem cells but exhibit altered

b-cell differentiation due to an impaired bone marrow environment. Molec-

ular and cellular biology, 14(1):382–390, 1994.

[151] Y. Okada, D. Wu, G. Trynka, T. Raj, C. Terao, K. Ikari, Y. Kochi, K. Ohmura,

A. Suzuki, S. Yoshida, et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature, 506(7488):376–381, 2014.

121 [152] C. W. Olanow, M. B. Stern, and K. Sethi. The scientific and clinical basis for the treatment of parkinson disease (2009). Neurology, 72(21 Supplement

4):S1–S136, 2009.

[153] M. Oti, M. A. Huynen, and H. G. Brunner. Phenome connections. Trends in

genetics, 24(3):103–106, 2008.

[154] M. Oti, M. A. Huynen, and H. G. Brunner. The biological coherence of hu-

man phenome databases. The American Journal of Human Genetics, 85(6):801–

808, 2009.

[155] A. Pain, D. J. Ferguson, O. Kai, B. C. Urban, B. Lowe, K. Marsh, and D. J.

Roberts. Platelet-mediated clumping of plasmodium falciparum-infected erythrocytes is a common adhesive phenotype and is associated with severe

malaria. Proceedings of the National Academy of Sciences, 98(4):1805–1810, 2001.

[156] G. Palla, I. Derenyi,´ I. Farkas, and T. Vicsek. Uncovering the overlapping

community structure of complex networks in nature and society. Nature,

435(7043):814–818, 2005.

[157] J. Park, D.-S. Lee, N. A. Christakis, and A.-L. Barabasi.´ The impact of cellular

networks on disease comorbidity. Molecular systems biology, 5(1), 2009.

[158] J. Parker, K. McCullough, B. Field, J. Minnion, N. Martin, M. Ghatei, and

S. Bloom. Glucagon and glp-1 inhibit food intake and increase c-fos expres-

sion in similar appetite regulating centres in the brainstem and amygdala. International Journal of Obesity, 37(10):1391–1398, 2013.

[159] N. D. Pasternak and R. Dzikowski. Pfemp1: An antigen that plays a key role in the pathogenicity and immune evasion of the malaria parasite¡ i¿ plas-

modium falciparum¡/i¿. The international journal of biochemistry & cell biology,

41(7):1463–1466, 2009.

122 [160] J. Pfannschmidt, S. Bade, J. Hoheisel, T. Muley, H. Dienemann, and E. Her- pel. Identification of immunohistochemical prognostic markers for survival

after resection of pulmonary metastases from colorectal carcinoma. The Tho-

racic and cardiovascular surgeon, 57(7):403–408, 2009.

[161] D. J. Philpott, M. T. Sorbara, S. J. Robertson, K. Croitoru, and S. E. Girardin.

Nod proteins: regulators of inflammation in health and disease. Nature Re- views Immunology, 14(1):9–23, 2014.

[162] R. M. Piro and F. Di Cunto. Computational approaches to disease-gene pre- diction: rationale, classification and successes. FEBS Journal, 279(5):678–696,

2012.

[163] R. M. Plenge, E. M. Scolnick, and D. Altshuler. Validating therapeutic targets

through human genetics. Nature Reviews Drug Discovery, 12(8):581–594, 2013.

[164] M. Pollak. Insulin and insulin-like growth factor signalling in neoplasia.

Nature Reviews Cancer, 8(12):915–928, 2008.

[165] J. E. Puche and I. Castilla-Cortazar.´ Human conditions of insulin-like growth

factor-i (igf-i) deficiency. J Transl Med, 10:224, 2012.

[166] H. R. Rahimi, M. Shiri, and A. Razmi. Antidepressants can treat inflamma- tory bowel disease through regulation of the nuclear factor-κb/nitric oxide

pathway and inhibition of cytokine production: A hypothesis. World journal

of gastrointestinal pharmacology and therapeutics, 3(6):83, 2012.

[167] R. Rahimi, S. Nikfar, A. Rezaie, and M. Abdollahi. Efficacy of tricyclic an-

tidepressants in irritable bowel syndrome: a meta-analysis. World journal of

gastroenterology: WJG, 15(13):1548, 2009.

123 [168] A. G. Renehan, M. Zwahlen, C. Minder, S. T O’Dwyer, S. M. Shalet, and M. Egger. Insulin-like growth factor (igf)-i, igf binding protein-3, and

cancer risk: systematic review and meta-regression analysis. The Lancet,

363(9418):1346–1353, 2004.

[169] B. A. Robinson, T. L. Welch, and J. D. Smith. Widespread functional special-

ization of plasmodium falciparum erythrocyte membrane protein 1 family members to bind cd36 analysed across a parasite genome. Molecular microbi-

ology, 47(5):1265–1278, 2003.

[170] P. N. Robinson, S. Kohler,¨ S. Bauer, D. Seelow, D. Horn, and S. Mundlos.

The human phenotype ontology: a tool for annotating and analyzing human

hereditary disease. The American Journal of Human Genetics, 83(5):610–615,

2008.

[171] F. S. Roque, P. B. Jensen, H. Schmock, M. Dalgaard, M. Andreatta, T. Hansen, K. Søeby, S. Bredkjær, A. Juul, T. Werge, et al. Using electronic patient records

to discover disease correlations and stratify patient cohorts. PLoS computa-

tional biology, 7(8):e1002141, 2011.

[172] A. L. Rosenbloom. Mecasermin (recombinant human insulin-like growth

factor i). Advances in therapy, 26(1):40–54, 2009.

[173] C. Rosse and J. L. Mejino. A reference ontology for biomedical informat-

ics: the foundational model of anatomy. Journal of biomedical informatics, 36(6):478–500, 2003.

[174] L. Roth, J. K. MacDonald, J. W. McDonald, and N. Chande. Sargramostim (gm-csf) for induction of remission in crohn’s disease. The Cochrane Library,

2011.

124 [175] A. Rzhetsky, D. Wajngurt, N. Park, and T. Zheng. Probing genetic overlap among complex human phenotypes. Proceedings of the National Academy of

Sciences, 104(28):11694–11699, 2007.

[176] G. Sabatakos, N. Sims, J. Chen, K. Aoki, M. Kelz, M. Amling, Y. Bouali,

K. Mukhopadhyay, K. Ford, E. Nestler, et al. Overexpression of δfosb tran-

scription factor (s) increases bone formation and inhibits adipogenesis. Na- ture medicine, 6(9):985–990, 2000.

[177] M. Salathe´ and J. H. Jones. Dynamics and control of diseases in networks with community structure. PLoS Computational Biology, 6(4):e1000736, 2010.

[178] P. Sanseau, P. Agarwal, M. R. Barnes, T. Pastinen, J. B. Richards, L. R. Cardon,

and V. Mooser. Use of genome-wide association studies for drug reposition-

ing. Nature biotechnology, 30(4):317–320, 2012.

[179] R. B. Sartor. Mechanisms of disease: pathogenesis of crohn’s disease and ul-

cerative colitis. Nature clinical practice Gastroenterology & hepatology, 3(7):390–

407, 2006.

[180] C. E. Sawian, S. D. Lourembam, A. Banerjee, and S. Baruah. Polymorphisms and expression of tlr4 and 9 in malaria in two ethnic groups of assam, north-

east india. Innate immunity, 19(2):174–183, 2013.

[181] A. Scholzen and R. W. Sauerwein. How malaria modulates memory: activa-

tion and dysregulation of b cells in¡ i¿ plasmodium¡/i¿ infection. Trends in

parasitology, 29(5):252–262, 2013.

[182] N. H. Shah and J. D. Tenenbaum. The coming age of data-driven medicine:

translational bioinformatics’ next frontier. Journal of the American Medical In- formatics Association, 19(e1):e2–e4, 2012.

125 [183] M. S. Simpson, D. You, M. M. Rahman, S. K. Antani, G. R. Thoma, and D. Demner-Fushman. Towards the creation of a visual ontology of biomed-

ical imaging entities. In AMIA Annual Symposium proceedings, volume 2012,

page 866. American Medical Informatics Association, 2012.

[184] V. N. Slee. The international classification of diseases: ninth revision (icd-9).

Annals of internal medicine, 88(3):424–426, 1978.

[185] J. D. Smith, A. G. Craig, N. Kriek, D. Hudson-Taylor, S. Kyes, T. Fagen,

R. Pinches, D. I. Baruch, C. I. Newbold, and L. H. Miller. Identification of a plasmodium falciparum intercellular adhesion molecule-1 binding domain:

a parasite adhesion trait implicated in cerebral malaria. Proceedings of the

National Academy of Sciences, 97(4):1766–1771, 2000.

[186] B. Snel, G. Lehmann, P. Bork, and M. A. Huynen. String: a web-server to

retrieve and display the repeatedly occurring neighbourhood of a gene. Nu- cleic acids research, 28(18):3442–3444, 2000.

[187] A. Standaert-Vitse, B. Sendid, M. Joossens, N. Franc¸ois, P. Vandewalle- El Khoury, J. Branche, H. Van Kruiningen, T. Jouault, P. Rutgeerts, C. Gower-

Rousseau, et al. Candida albicans colonization and asca in familial crohn’s

disease. The American journal of gastroenterology, 104(7):1745–1753, 2009.

[188] P. Stattin, R. Palmqvist, S. Soderberg,¨ C. Biessy, B. Ardnor, G. Hallmans,

R. Kaaks, and T. Olsson. Plasma leptin and colorectal cancer risk: a prospec- tive study in northern sweden. Oncology reports, 10(6):2015–2021, 2003.

[189] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, et al. Gene set

enrichment analysis: a knowledge-based approach for interpreting genome-

126 wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005.

[190] K. Tamakoshi, H. Toyoshima, K. Wakai, M. Kojima, K. Suzuki, Y. Watanabe,

N. Hayakawa, H. Yatsuya, T. Kondo, S. Tokudome, et al. Leptin is associated

with an increased female colorectal cancer risk: a nested case-control study

in japan. Oncology, 68(4-6):454–461, 2004.

[191] K. M. Thakali, J. Saben, J. B. Faske, F. Lindsey, H. Gomez-Acevedo, C. L.

Lowery Jr, T. M. Badger, A. Andres, and K. Shankar. Maternal pregravid obesity changes gene expression profiles toward greater inflammation and

reduced insulin sensitivity in umbilical cord. Pediatric research, 76(2):202–210,

2014.

[192] W.-H. Tham, D. W. Wilson, S. Lopaticki, C. Q. Schmidt, P. B. Tetteh-Quarcoo,

P. N. Barlow, D. Richard, J. E. Corbin, J. G. Beeson, and A. F. Cowman. Com- plement receptor 1 is the host erythrocyte receptor for plasmodium falci-

parum pfrh4 invasion ligand. Proceedings of the National Academy of Sciences,

107(40):17327–17332, 2010.

[193] N. Tiffin, M. A. Andrade-Navarro, and C. Perez-Iratxeta. Linking genes to

diseases: it’s all in the data. Genome Med, 1(8):77, 2009.

[194] J. Tomalka, S. Ganesan, E. Azodi, K. Patel, P. Majmudar, B. A. Hall, K. A.

Fitzgerald, and A. G. Hise. A novel role for the nlrc4 inflammasome in mu- cosal defenses against the fungal pathogen candida albicans. PLoS pathogens,

7(12):e1002379, 2011.

[195] S. Tong and E. Chang. Support vector machine active learning for image re-

trieval. In Proceedings of the ninth ACM international conference on Multimedia,

pages 107–118. ACM, 2001.

127 [196] K. L. Townsend, R. Suzuki, T. L. Huang, E. Jing, T. J. Schulz, K. Lee, C. M. Taniguchi, D. O. Espinoza, L. E. McDougall, H. Zhang, et al. Bone morpho-

genetic protein 7 (bmp7) reverses obesity and regulates appetite through a

central mtor pathway. The FASEB Journal, 26(5):2187–2196, 2012.

[197] M. Tringali, W. T. Hole, and S. Srinivasan. Integration of a standard gastroin-

testinal endoscopy terminology in the umls metathesaurus. In Proceedings of

the AMIA symposium, page 801. American Medical Informatics Association, 2002.

[198] A. K. Tripathi, W. Sha, V. Shulaev, M. F. Stins, and D. J. Sullivan. Plasmod- ium falciparum–infected erythrocytes induce nf-κb regulated inflammatory

pathways in human cerebral endothelium. Blood, 114(19):4243–4252, 2009.

[199] L. Turner, T. Lavstsen, S. S. Berger, C. W. Wang, J. E. Petersen, M. Avril,

A. J. Brazier, J. Freeth, J. S. Jespersen, M. A. Nielsen, et al. Severe malaria

is associated with parasite binding to endothelial protein c receptor. Nature,

498(7455):502–505, 2013.

[200] M. A. van Driel, J. Bruggeman, G. Vriend, H. G. Brunner, and J. A. Leunissen.

A text-mining analysis of the human phenome. European journal of human genetics, 14(5):535–542, 2006.

[201] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. Associating genes and protein complexes with disease via network propagation. PLoS

computational biology, 6(1):e1000641, 2010.

[202] D. Vestweber and J. E. Blanks. Mechanisms that regulate the function of the

selectins and their ligands. Physiological reviews, 79(1):181–213, 1999.

[203] V. Vialou, H. Cui, M. Perello, M. Mahgoub, G. Y. Hana, A. J. Rush, H. Pranav,

S. Jung, M. Yangisawa, J. M. Zigman, et al. A role for δfosb in calorie

128 restriction-induced metabolic changes. Biological psychiatry, 70(2):204–207, 2011.

[204] A.-C. Villani, M. Lemire, G. Fortin, E. Louis, M. S. Silverberg, C. Collette,

N. Baba, C. Libioulle, J. Belaiche, A. Bitton, et al. Common variants in

the nlrp3 region contribute to crohn’s disease susceptibility. Nature genetics,

41(1):71–76, 2009.

[205] B. J. Visser, R. W. Wieten, I. M. Nagel, and M. P. Grobusch. Serum lipids

and lipoproteins in malaria-a systematic review and meta-analysis. Malaria

journal, 12(1):442, 2013.

[206] X. Wang, D. D. Kang, K. Shen, C. Song, S. Lu, L.-C. Chang, S. G. Liao, Z. Huo, S. Tang, Y. Ding, et al. An r package suite for microarray meta-analysis in

quality control, differentially expressed gene analysis and pathway enrich-

ment detection. Bioinformatics, 28(19):2534–2536, 2012.

[207] Z.-Y. Wang and H.-Y. Zhang. Rational drug repositioning by medical genet-

ics. Nature biotechnology, 31(12):1080–1082, 2013.

[208] S. C. Wassmer, C. Lepolard,´ B. Traore,´ B. Pouvelle, J. Gysin, and G. E. Grau.

Platelets reorient plasmodium falciparum–infected erythrocyte cytoadhe-

sion to activated endothelial cells. Journal of Infectious Diseases, 189(2):180–

189, 2004.

[209] E. K. Wei, E. Giovannucci, C. S. Fuchs, W. C. Willett, and C. S. Mantzoros. Low plasma adiponectin levels and risk of colorectal cancer in men: a

prospective study. Journal of the National Cancer Institute, 97(22):1688–1694,

2005.

[210] A. R. Williams, A. D. Douglas, K. Miura, J. J. Illingworth, P. Choudhary, L. M.

Murungi, J. M. Furze, A. Diouf, O. Miotto, C. Crosnier, et al. Enhancing

129 blockade of plasmodium falciparum erythrocyte invasion: assessing com- binations of antibodies against pfrh5 and other merozoite antigens. PLoS

pathogens, 8(11):e1002991, 2012.

[211] J. Wu, L. Tian, X. Yu, S. Pattaradilokrat, J. Li, M. Wang, W. Yu, Y. Qi, A. E.

Zeituni, S. C. Nair, et al. Strain-specific innate immune signaling pathways

determine malaria parasitemia dynamics and host mortality. Proceedings of

the National Academy of Sciences, 111(4):E511–E520, 2014.

[212] X. Wu, R. Jiang, M. Q. Zhang, and S. Li. Network-based global inference of

human disease genes. Molecular systems biology, 4(1), 2008.

[213] X. Wu, Q. Liu, and R. Jiang. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinfor-

matics, 25(1):98–104, 2009.

[214] J. Xu and Y. Li. Discovering disease-genes by topological features in human

protein–protein interaction network. Bioinformatics, 22(22):2800–2805, 2006.

[215] R. Xu and Q. Wang. Large-scale extraction of accurate drug-disease treat-

ment pairs from biomedical literature for drug repurposing. BMC bioinfor-

matics, 14(1):181, 2013.

[216] R. Xu and Q. Wang. A semi-supervised approach to extract

pharmacogenomics-specific drug–gene pairs from biomedical literature

for personalized medicine. Journal of biomedical informatics, 46(4):585–593, 2013.

[217] A. N. Zeba, H. Sorgho, N. Rouamba, I. Zongo, J. Rouamba, R. T. Guiguemde,´ D. H. Hamer, N. Mokhtar, and J.-B. Ouedraogo. Major reduction of malaria

morbidity with combined vitamin a and zinc supplementation in young chil-

dren in burkina faso: a randomized double blind trial. Nutr J, 7(7):7, 2008.

130 [218] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive

study. International journal of computer vision, 73(2):213–238, 2007.

[219] L.-J. Zhao, Y.-J. Liu, P.-Y. Liu, J. Hamilton, R. R. Recker, and H.-W. Deng. Re-

lationship of obesity with osteoporosis. The Journal of Clinical Endocrinology

& Metabolism, 92(5):1640–1646, 2007.

[220] H. Zheng, Z. Tan, and W. Xu. Immune evasion strategies of pre-erythrocytic

malaria parasites. Mediators of Inflammation, 2014, 2014.

131