c 2018 Sheng Wang LEVERAGING KNOWLEDGE NETWORKS FOR PRECISION MEDICINE

BY

SHENG WANG

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2018

Urbana, Illinois

Doctoral Committee: Professor ChengXiang Zhai, Chair Assistant Professor Jian Peng, Director of Research Professor Saurabh Sinha Professor Jiawei Han Professor Xinghua Lu ABSTRACT Akin to the exponential growth of genomic sequencing data, high-throughput techniques in proteomics and biotechnology have been creating ever-expanding repositories of proteomic, pharmacological, and interactomic data. Other molecular data, including expression profiles, genomic mutations and cell conditions, have also been massively generated and they are further refining our understanding of disease mechanisms. In addition, patient data, gathered by electronic medical record systems and social medias, complement biological data and pave the way for personalized treatment strategies. Therefore, efficiently and effectively integrating and mining these invaluable data hold the great promising of making precision medicine a reality. However, integrating and mining these large-scale, heterogeneous, and noisy dataset pose several fundamental computational challenges and have therefore become a bottleneck to clinical decision making and medical knowledge discovery. This thesis is a systematic study of mining these biological and healthcare data for precision medicine. I take a network perspective and integrate these datasets into a large knowledge network where nodes are biological concepts and links are biological relationships. I then propose a novel computa- tional framework to mine these knowledge networks. To demonstrate the effectiveness of mining knowledge networks, I will introduce how this framework can be used to understand molecular functions, accelerate drug discovery, and support clinical decision making. To understand molecular functions, I will show how a knowledge network can substantially im- prove function prediction performance and further annotate novel gene sets by mining scientific literature-based knowledge network. To accelerate drug discovery, I will use the knowledge network to predict drug targets and identify drug associated pathways. To sup- port clinical decision making, I will discuss our efforts in integrating genomics data with clinical data to cluster patients, predict patient survival and visualize patient records. Fi- nally, I will conclude this thesis by summarizing how mining knowledge networks advance precision medicine and discussing the promising future work of this thesis.

ii To my wife and my parents for all their love.

iii ACKNOWLEDGMENTS I would like to thank all the people who give me the tremendous help in the past five years. First of all I would like to express my sincere gratitude to my advisor Prof. ChengXiang Zhai for his patience, earnest encouragements, and strongest supports. He always encouraged me to think critically and independently, defended and improved my ideas through discussion with him. From him I have learned the way of being an independent researcher. Also, I am heartily thankful to the way Prof. Zhai treated his students. I also want to express my deepest appreciation to my advisor Prof. Jian Peng. Prof. Peng introduced me to the wonderful world of bioinformatics and network biology. He later guided me to focus on an important topic of cancer genomics. In the pursuit of my Ph.D thesis, I have been always inspired by his passion, vision and knowledge in the field. Prof. Zhai and Prof. Peng worked with me like peers and colleagues. I could not have imagined having a better advisor and mentor for my Ph.D study. I would also like to thank my thesis committee members, Prof. Saurabh Sinha, Prof. Jiawei Han and Prof. Xinghua Lu, for their great feedback and suggestions on my Ph.D. research and thesis work. Prof. Sinha has been giving me invaluable support in many aspects of my research, especially in my postdoc application. I have also benefited a lot from the weekly meetings with him. Prof. Han has set up a great example of an authentic researcher for me: his keen insights and vision, his passion on research and life, his patience in mentoring students, and his diligence. Prof. Lu has given me many insightful comments about cancer genomics research and invaluable support and advice in my career development. I also sincerely thank Dr. Hoifung Poon, Dr. Ryen White and Dr. Eric Horvitz in Microsoft Research, Prof. Trey Ideker at University of California San Diego and Dr. Liewei Wang at Mayo Clinic. Working with them has given me the unique opportunities to be exposed to practical clinical problems, and the outcome of these collaborations contributed to very important pieces of this thesis. I also want to thank all DAIS group members and visitors, who have built such a great environment for research. Finally and above all, I owe my deepest gratitude to my wife Xiuwen and my parents. I want to thank my wife, for her love, understanding, patience, and support at every moment. And I want to thank my parents for their endless and unreserved love, who have been encouraging and supporting me all the time. This thesis is dedicated to them.

iv TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION ...... 1

CHAPTER 2 A FRAMEWORK FOR MINING KNOWLEDGE NETWORKS . . . 3 2.1 Computational approaches for mining knowledge networks ...... 3 2.2 Related work ...... 11

CHAPTER 3 KNOWLEDGE NETWORK ASSISTED GENE FUNCTION AN- NOTATION ...... 13 3.1 Homogeneous Network Embedding for Gene Function Prediction ...... 13 3.2 Integrating Heterogeneous Information for Gene Function Prediction . . . . . 26 3.3 Annotating gene sets by integrating literature data and molecular networks . 36

CHAPTER 4 KNOWLEDGE NETWORK ASSISTED DRUG DISCOVERY . . . . 48 4.1 Knowledge network assisted drug pathway identification ...... 48 4.2 Network-assisted drug target identification ...... 64

CHAPTER 5 KNOWLEDGE NETWORK ASSISTED CLINICAL DECISION MAKING ...... 85 5.1 Constructing A Symptom, Disease, and Drug Network from Clinical Data . . 85 5.2 Mining Adverse Drug Reactions Network from Online Health Forums . . . . 97 5.3 Incorporating knowledge networks in electronic medical records analysis . . . 106

CHAPTER 6 SUMMARY AND FUTURE DIRECTIONS ...... 114 6.1 Knowledge Network-Supported Biological Finding Interpretation ...... 114 6.2 Refining Knowledge Networks with Experimental Data ...... 115 6.3 Uncovering disease mechanisms through network biology ...... 115

REFERENCES ...... 116

v CHAPTER 1: INTRODUCTION

Akin to the exponential growth of genomic sequencing data, high-throughput techniques in proteomics and biotechnology have been creating ever-expanding repositories of genomic, proteomic, and interactomic data, leading to significant breakthroughs in precision medicine [1, 2, 3, 4, 5, 6, 7, 8]. Other molecular data, including expression profiles, genomic muta- tions and cell conditions, have also been massively generated and they are further refining our understanding of disease progression and drug mechanisms of action [9, 10, 11, 12, 13, 14, 15]. Patients data, including clinical data collected in electronic medical records and user-generated online social media data, is also of considerable interest due to its unique value in clinical data mining tasks such as adverse drug reactions detection, patient survival prediction, patient subtyping, and treatment recommendation [16, 17, 18, 19, 20, 21]. These datasets contain valuable knowledge about biological systems, human diseases, and drug mechanisms, creating an unprecedented opportunity for accelerating scientific discov- ery via discovery of knowledge from these data as well as optimizing clinical decisions and improving healthcare by developing intelligent machine learning-based decision support sys- tems. Moreover, it becomes even more pronounced when we jointly mine these heterogeneous dataset. For example, we can use large-scale genome-wide association studies to interpret unexpected adverse drug reactions identified in electronic medical records [22, 23]. Patient’s response to small molecular inhibitors can also be used to refine our understanding of genetic interaction in cancer [24, 25, 26]. In this thesis, I systematically studied how to leverage knowledge networks to discover new biomedical knowledge and support clinical decision making. Despite the success of pre- vious data-driven approaches, they suffer from the noise and missing values in biomedical datasets. Intuitively, integrating different biomedical datasets reduces noise in biomedical dataset, leading to less false-positive results. However, jointly mining different biomedical datasets is more challenging due to the heterogeneity of biomedical datasets. To address these challenges, I propose to integrate different biomedical datasets into a large-scale, het- erogeneous knowledge network. In this knowledge network, nodes are biomedical entities such as drugs, diseases, and . Links are relations such as drug a inhibits gene b, gene c ’is mutated’ in patient d, and gene e interacts with gene f. This knowledge network’s poten- tial for advancing precision medicine is based on the following two factors. First, integrating knowledge from different datasets offers a multi-perspective view of disease progression and drug mechanisms. Applying ‘guilt by association’ rule [27] to this knowledge network boosts the inference of new knowledge by combining information from different sources. Secondly,

1 despite the success of existing biomedical data mining approaches, many of them still suf- fer from noisy and missing data. Using a network-based approach mitigates the noise by correcting the features or attributes according to neighbors, leading to less false-positive results. This thesis aims to tap into this exciting opportunity of mining knowledge network for precision medicine. I will first propose the framework of mining knowledge networks. This framework consists of three network modeling approaches including diffusion model, ho- mogeneous network embedding model, and hetergeneous network embedding model. I will propose two novel network embedding methods which learnt compact representations for nodes in homogeneous networks and heterogeneous networks, respectively. After that, I will demonstrate how this framework can be used in key precision medicine tasks such as gene function annotation, drug discovery, and clinical decision making. In addition to the com- putational contribution introduced in this thesis, I have also developed relevant practical tools and software in this direction. Consequently, this thesis benefits a large population including working biologists, physicians and patients. The organization of this thesis is as follows. In Chapter 2, I will propose the framework of mining knowledge networks. In Chapter 3, I will introduce how this framework can be used to annotate gene functions. In Chapter 4, I will introduce how I use a pharmacogenmics knowledge network to accelerate drug discovery. In Chapter 5, I will introduce how I use a clnical knowledge network to support clinical decision making. Finally, I will conclude this thesis and propose existing future directions.

2 CHAPTER 2: A FRAMEWORK FOR MINING KNOWLEDGE NETWORKS

Knowledge networks integrate biomedical relations from different datasets. In knowledge networks, nodes are biomedical entities such as drug, disease, and gene. Links are relations such as drug a inhibits gene b, gene c is mutated in patient d, and gene e interacts with gene f. By integrating selective biomedical datasets according to downstream tasks, we can obtain different knowledge networks. In this thesis, I will demonstrate the value of mining these four knowledge networks: 1. A pharmacogenomics network. In this knowledge network, nodes are drugs, diseases, and genes. Links includes drug-target interactions, -protein interactions, and drug- disease associations. This network can be used to infer novel pharmacogenomics knowledge such as drug-drug interactions, drug-target interactions, and drug repurposing.

2. A multi-species network. In this knowledge network, nodes are genes belong to dif- ferent species. Links are protein-protein interactions and homology. This network can be used to transfer biomedical discovery from model organisms to human.

3. A clinical knowledge network. In this knowledge network, nodes are drugs, symptoms, and patients. Links are extracted from patient profiles in electronic medical records. This network can be used to support clinical decision making such as finding similar patients, patient stratification, and treatment recommendation.

4. A biomedical text network. In this knowledge network, nodes are biomedical terms. Links are biomedical terms frequently co-occured in scientific papers. This network can be used to retrieve and summarize existing biomedical knowledge. It could also be used to discover new knowledge such as adverse drug reactions. These knowledge networks are useful because of the following two reasons. Firstly, knowl- edge networks can be used to address missing values and noise in biomedical datasets ac- cording to the “guilt by association” rule. Secondly, knowledge networks provide a multi- perspective view and thus boost the prediction performance. However, mining knowl- edge networks poses unique computational challenges due to the scale, heterogeneity, high- dimensionality, and noise of knowledge networks.

2.1 COMPUTATIONAL APPROACHES FOR MINING KNOWLEDGE NETWORKS

To tackle these computational challenges, I applied three different network modeling ap- proaches to mining knowledge networks. The goal of these three methods is to convert a

3 given knowledge network into feature vectors. In particular, we want to obtain a feature vector for each node and later use these feature vectors in downstream prediction tasks.

1. Diffusion model. Diffusion model is developed based on random walk with restart (RWR) which calculates the probability that each node can be reached by other nodes in the network. Although the diffusion model is effective in many applications, the resulted feature vectors are high-dimensional and noisy, leading to overfitting in downstream clas- sification tasks. This method is suitable for analyzing small-scale (i.e., < 10K nodes) and homogeneous knowledge networks.

2. Homogeneous network embedding. To tackle the overfitting problem due to the high-dimensional feature vectors generated by diffusion model, I propose a novel net- work embedding approach which projects high-dimensional networks into low-dimensional space. The feature vector would be the compact representation for each node in the low- dimensional space. Comparing to feature vectors generated from diffusion models, feature vectors generated from network embedding models are less noisy. This method is suitable for analyzing medium-scale (i.e., < 20K nodes) and homogeneous knowledge networks.

3. Heterogeneous network embedding. Homogeneous network embedding is effective in decomposing a network with a single node type (e.g., a human protein-protein in- teraction network). However, it is not able to model more sophisticated heterogeneous networks which have multiple types of node and link. To this end, I propose a novel heterogeneous network embedding algorithm to project heterogeneous networks into low- dimensional space. Moreover, heterogeneous networks could be very large due to the integration of different biomedical datasets. I propose a fast learning algorithm which significantly reduces the time complexity. Consequently, the proposed method scales to very large networks which have millions of nodes. This method is suitable for analyzing large-scale and heterogeneous knowledge networks.

Next, I will introduce these three models in detail. I will first briefly introduce the diffusion model which serves as the basis for embedding methods. I will then propose two network embedding methods for decomposing homogeneous networks and heterogeneous networks, respectively.

2.1.1 Diffusion model

The goal of diffusion model is to generate equilibrium probability vectors for each node. It runs random walk with restart (RWR) on each node in each network (e.g. protein-protein

4 interaction or co-expression network) to compute the ‘diffusion state’ of each node, which summarizes local topology. RWR is different from conventional random walks in that it introduces a pre-defined probability of restarting at the initial node after every iteration. Formally, let A denote the weighted adjacency matrix of a network with n nodes. Each

entry Bi,j in the transition matrix B represents the probability of a transition from node i to node j and is defined as:

Bi,j Bi,j = P , (2.1) j0 Bi,j0 t Next, letting Si be an n-dimensional distribution vector in which each entry stores the probability of a node being visited from node i after t steps, RWR from node i with restart

probability pr is defined as:

t+1 t Si = (1 − pr)Si B + prqi, (2.2)

where qi is an n-dimensional distribution vector with qi(i) = 1 and qi(j) = 0 ∀i 6= j . Note that the restart probability controls the relative influence of global and local topolog- ical information in the diffusion, where a larger value places greater emphasis on the local ∞ structure. We can obtain the stationary distribution Si of RWR at the fixed point of this ∞ iteration, and we refer to this as the ‘diffusion state’ Si of node i (i.e. Si = Si ) , using the same definition as previous work [28]. Intuitively, the jth entry Sij stores the probability that RWR starts at node i and ends up at node j in equilibrium. The fact that two nodes having similar diffusion states implies they are in similar positions with respect to other nodes in the graph, which may reflect functional similarity.

2.1.2 Homogeneous network embedding

Diffusion states are not entirely accurate, partially due to the noisy and incomplete na- ture of interactomes. Moreover, high dimensionality imposes additional computational con- straints on directly using the diffusion states as features for classification or regression tasks. To address this issue, Cho et al. proposed DCA which employs the following dimensionality reduction scheme. The probability assigned to node j in the diffusion state of node i is modeled as

T x wj 0 e i Sij = T , (2.3) P xi wj0 j0 e

5 d where ∀i, wi, xi ∈ R for d  n. DCA refers to wi as the context feature and xi as the node feature of node i both capturing the topological properties of the network. If xi and wj are close in direction and have large inner product, then it is likely that node j is frequently visited in the random walk starting from node i. DCA takes a set of observed diffusion states

S = {S1,...,Sn} as input and optimizes over w and x for all nodes, using KL-divergence as the objective function:

n ˆ 1 X ˆ min C(S, S) = DKL(Si||Si), (2.4) w,x n i=1 The original framework uses a standard quasi-Newton method L-BFGS [29] to solve this opti- mization problem. Although the learnt low-dimensional vector representation can effectively capture the network structure, we found that optimizing in this way is time consuming.

New contributions

To make DCA more scalable to large networks, we developed a fast, matrix factorization- ˆ based approach to decompose the diffusion states. Based on the definition of Sij , we have:

X T ˆ T xi wj0 log Sij = xi wj − log e . (2.5) j0

The first term in the above equation corresponds to the low-dimensional approximation of ˆ Sij , while the second term is the normalization factor that enforces si ∈ ∆n, where ∆n is the n-dimensional probability simplex. In our new formulation, we relax the constraint that the entries in factorization-based approach to decompose the diffusion states. In particular, we drop the second term in the above equation. While the resulting low-dimensional ap- proximations of diffusion states are no longer strictly valid probability distributions, we find that the approximations are close enough to the true distribution that the relaxation has a ˆ negligible impact. As a result, Sij can be simplified as:

ˆ T log Sij = xi wj. (2.6)

In addition, instead of optimizing the relative entropy between the true and the approxi- mated diffusion states, we use the sum of squared errors as the new objective function:

n n ˆ X X T 2 min C(S, S) = (xi wj − log Sij) . (2.7) w,x i=1 j=1

6 Now, the resulting optimization problem can be easily solved by the classic singular value decomposition (SVD) [30]. To avoid taking a logarithm of zeros, we added a small positive

constant to sij and computed the logarithm diffusion state matrix L as:

L = ln(S + Q) − ln(Q). (2.8)

n×n 1 n×n where Q ∈ R with Qij = n , ∀i, j, and S ∈ R is the concatenation of si, . . . , sn. With SVD, we decompose L into three matrices U , Σ and V :

L = UΣVT , (2.9)

where U ∈ Rn×n, V ∈ Rn×n , and Σ ∈ Rn×n is the diagonal singular value matrix. To ob- tain the low-dimensional vectors wj and xi with d dimensions, we simply choose the first d sin- gular vectors Ud, Vd and the first d singular values Σd . More precisely, let X = {x1, . . . , xn}

denote the low-dimensional vector representation matrix, and W = {w1, . . . , wn} denote the context feature matrix. X and W can be computed as:

0.5 X = UdΣd , (2.10)

0.5 W = VdΣd . (2.11)

Runtime improvements

The key benefit of this new optimization procedure is significantly reduced computational time. For example, decomposing the STRING yeast network with around 6,000 nodes into 500-dimensional vectors takes < 5 min on a standard server (with six 3.07 GHz Intel Xeon CPUs and 32GB RAM) for SVD, while the original approach with L-BFGS takes more than 2 hours. We noticed that the prediction accuracies for both methods are almost identical in predicting yeast gene function. To integrate multiple network data, we extend the above single-network DCA to multiple

networks. Let L = {L1,..., Lk} denote the set of logarithm diffusion state matrices based

on the diffusion states S = {S1,..., Sk} from the k input networks. Here, we optimize the following objective function:

n n k ˆ X X X T r r 2 min C(S, S) = (xi wj − log Lij) . (2.12) w,x i=1 j=1 r=1

7 r where for each node i in network r, we assign a network-specific context feature wi , which encodes the intrinsic topological properties of node i in network r. The node features x are shared across all k networks to be able to capture more global patterns. This objective function can be also optimized by SVD. It is worth noting that it is possible to weight each network differently when concatenating the networks, but we give equal importance to each

network in this work for simplicity. In the following sections, we use X = {x1, . . . , xn} as the low-dimensional vector representations of genes. Note that one can also use w, but we observed that the performances of these two representations are quite similar.

2.1.3 Heterogeneous network embedding

For heterogeneous networks that have multiple node types and edge types, I propose a novel heterogeneous network embedding approach called ProSNet. As an overview, ProSNet first constructs a heterogeneous biomedical network by integrating different datasets. It then performs a novel dimensionality reduction algorithm on this heterogeneous network to optimize a low-dimensional vector representation for each nodes. The vectors of two nodes will be co-localized in the low-dimensional space if these two nodes are close to each other in the heterogeneous biological network. A key computational contribution is that ProSNet obtains low-dimensional vectors through a fast online learning algorithm instead of the batch learning algorithm used by previous work [31, 32]. In each iteration, ProSNet samples a path from the heterogeneous network and optimizes low-dimensional vectors based on this path instead of all pairs of nodes. Therefore, it can easily scale to large networks containing hundreds of thousands or even millions of links and nodes.

Definition 2.1 Heterogeneous Biomedical Networks (HBNs) are biomedical net- works where both nodes and edges are associated with different types. In an HBN G = (V,E,R), V is the set of typed nodes (i.e., each node has its own type), R is the set of edge types in the network, and E is the set of typed edges. An edge e ∈ E in a heterogeneous biomedical network is an ordered triplet e = hu, v, ri, where u ∈ V and v ∈ V are two typed nodes associated with this edge and r ∈ R is the edge type.

Definition 2.2 In an HBN G = (V,E,R), a heterogeneous path is a sequence of com- patible edge types M = hr1, r2, . . . , rLi, ∀i, ri ∈ R. The outgoing node type of ri should match the incoming node type of ri+1. Any path Pe1 eL = he1, e2, . . . , eLi connecting node u1 and uL+1 is a heterogeneous path instance following M, iff ∀i, ei is of type ri.

In particular, any edge type r is a length-1 heterogeneous path M = hri. We show a toy example of an HBN under our function prediction framework in Fig. 2.1.

8 Figure 2.1: An example of the heterogeneous biological network under our function prediction framework. The node set V consists of four types, {“Human protein”, “Yeast protein”, “Mouse protein”, and “ term”}. The edge type set R consists of five types, {“Sequence similarity”, “Protein function annotation”,“Gene Ontology relationship”,“Experimental”, and “Co-expression” }. This HBN explicitly captures interolog and transfer of annotation through heterogeneous paths across different species.

Low-dimensional vector learning in the heterogeneous biological network

ProSNet finds the low-dimensional vector for each node through first sampling a large number of heterogeneous path instances according to the HBN. It then finds the optimal low-dimensional vector so that nodes that appear together in many instances turn to have similar vector representations. We first define the conditional probability of node v connected to node u by a heterogeneous path M as:

exp(f(u, v, M)) P r(v|u, M) = , (2.13) P 0 v0∈V exp(f(u, v , M))

where f is a scoring function modeling the relevance between u and v conditioned on M. Inspired from the previous work[33], we define the following scoring function:

T T T f(u, v, M) = µM + pM xu + qM xv + xu xv. (2.14)

d Here, µM ∈ R is the global bias of the heterogeneous path M. pM and qM ∈ R are local d bias d dimensional vectors of the heterogeneous path M. xu and xv ∈ R are low-dimensional vectors for nodes u and v respectively. Our framework models different heterogeneous paths

differently by using pM and qM to weight different dimensions of node vectors according to the heterogeneous path M.

For a heterogeneous path instance Pe1 eL = he1 = hu1, v1, r1i, . . . , eL = huL, vL, rLii

9 following M = hr1, r2, . . . , rLi, we propose the following approximation.

γ P r(Pe1 eL |M) ∝ C(u1, 1|M) × P r(Pe1 eL |u1, M), (2.15) where C(u, i|M) represents the count of path instances following M with the ith node being u. C(u, i|M) can be efficiently computed through a dynamic programming algorithm. γ is a widely used parameter to control the effect of overly-popular nodes, which is set to 0.75 in previous work [34]. We assume that each node on the path only depends on its previous node. Then we have L Y P r(Pe1 eL |u1, M) = P r(vi|ui, ri). (2.16) i=1 Given the conditional distribution defined in Eq. (3.8) and (2.15), the maximum likelihood training is tractable but expensive because computing the gradient of the log-likelihood takes time linear in the number of nodes. Following the noise-contrastive estimation (NCE)[35], we reduce the problem of density estimation to a binary classification, discriminating between samples from path instances following the heterogeneous path and samples from a known noise distribution. In particular, we assume these samples come from the following mixture. 1 θ P r+(P |M) + P r−(P |M), (2.17) θ + 1 e1 eL θ + 1 e1 eL

+ where θ is the negative sampling weight and P r (Pe1 eL |M) denotes the distribution of − path instances in the HBN following the heterogeneous path M. P r (Pe1 eL |M) is a noise distribution, and for simplicity we set

L+1 − Y γ P r (Pe1 eL |M) ∝ C(ui, i |M) . (2.18) i=1 We further assume noise samples are θ times more frequent than positive path instance samples. The posterior probability that a given sample D came from positive path instance samples following the given heterogeneous path is

+ P r (Pe1 eL |M) P r(D = 1|Pe1 eL , M) = + − , (2.19) P r (Pe1 eL |M) + θ · P r (Pe1 eL |M) where D ∈ {0, 1} is the label of the binary classification. Since we would like to fit + P r(Pe1 eL |M) to P r (Pe1 eL |M), we simply maximize the following expectation.   P r(Pe1 eL |M) LM =EP r+ log − P r(Pe1 eL |M) + θ · P r (Pe1 eL |M)  −  (2.20) θ · P r (Pe1 eL |M) + θ · EP r− log − . P r(Pe1 eL |M) + θ · P r (Pe1 eL |M)

10 The loss function can be derived as L X X LM ≈ log σ( f(ui, vi, ri)) + i=1 Pe1 eL following M (2.21)   Pθ PL j j  j=1 EPj ∼P r−|u ,M log 1 − σ( i=1 f(ui , vi , ri)) , e1 eL 1 where σ(·) is the sigmoid function. Note that when deriving the above equation we used exp(f(u, v, M)) in place of P r(v|u, M), ignoring the normalization term in Eq. (3.8). We can do this because the NCE objective encourages the model to be approximately normalized and recovers a perfectly normalized model if the model class contains the data distribution PL [35]. Following the idea of negative sampling [34], we also replaced i=1 f(ui, vi, ri)−log θ · −  PL P r (Pe1 eL |M) with i=1 f(ui, vi, ri) for ease of computation. We optimize parameters xu, xv, pr, qr, and µr based on Eq. (2.21).

Runtime improvements through online learning

Like diffusion component analysis[31], the number of pairs of nodes hu, vi that are con- nected by some path instances following at least one of the paths is O(|V |2) in the worst case. This is too large for storage or processing when |V | is at the order of hundreds of thousands. Therefore, sampling a subset of path instances according to their distribution is the most feasible choice when optimizing, instead of going through every path instance per iteration. Thus, our method is still very efficient for networks containing large num- bers of edges. Based on Eq. (2.15), we can sample a path instance by sampling the nodes on the heterogeneous path one by one. Once a path instance has been sampled, we use gradient descent to update the parameters xu, xv, pr, qr, and µr based on Eq. (2.21). As a result, our sampling-based framework becomes a stochastic gradient descent framework. The derivations of these gradients are trivial and thus are omitted. Moreover, since stochas- tic gradient descent can generally be parallelized without locks, we can further optimize via multi-threading. Decomposing a heterogeneous network with more than sixty thousand nodes and ten million edges into a 500-dimensional vector space takes less than 30 minutes on a 12-core 3.07GZ Intel Xeon CPU through this online learning framework.

2.2 RELATED WORK

In this section, I will review previous works that are related to this thesis. In particular, I will introduce related works on extensively studied biomedical datasets including genomics data, electronic medical records, and scientific literature data.

11 Large-scale cancer genomics projects, such as the Cancer Genome Atlas [36], the Cancer Genome project [37], and the Cancer Cell Line Encyclopedia project [1], as well as cancer pharmacology projects, such as the Genomics of Drug Sensitivity in Cancer project [38], have generated large volumes of genomics and pharmacological profiling data. As a result, there is an unprecedented opportunity of linking these pharmacological and genomic data to identify therapeutic biomarkers [39, 40, 41]. Large-scale cancer genomics projects, such as the Cancer Genome Atlas [36], the Cancer Genome project [37], and the Cancer Cell Line Encyclopedia project [1], as well as cancer pharmacology projects, such as the Genomics of Drug Sensitivity in Cancer project [38], have generated large volumes of genomics and pharmacological profiling data. As a result, there is an unprecedented opportunity of linking these pharmacological and genomic data to identify therapeutic biomarkers [39, 40, 41]. In pursuit of this vision, significant efforts have been invested in identifying the genetic basis of drug response variation among individual patients [42, 9, 43]. Electronic medical records are also of great interest to health informatics researchers [44, 45, 16, 46]. For example, recent study shows that high quality knowledge graphs can be constructed from medical records by using simple concept extraction [47]. In addition to electronic medical records, we showed that using knowledge network constructed from other datasets (e.g., genomics data, literature data) can substantially help electronic medical records analysis. Scientific paper data is another important source to discover medical knowledge [48, 49, 50]. For example, Hoon et al. used advance natural language processing methods to identify gene regulatory networks from millions of PubMed papers [51]. Despite the encouraging results of previous work, they didn’t integrate literature data with molecular data. We showed more promising results by using the additional information in molecular data while mining literature data.

12 CHAPTER 3: KNOWLEDGE NETWORK ASSISTED GENE FUNCTION ANNOTATION

To evaluate the performance of the proposed network embedding methods, I applied these two methods to gene function annotation. Automatic annotating functions of genes is a fundamental task in life sciences and has significant impact in drug discovery. In particular, I applied these three network models to three different gene annotation tasks. First, I used the proposed homogeneous network embedding approach to predict gene function based on molecular networks. Experiment results show that the proposed network embedding approach has a much better performance in comparison to other network-based approaches. Secondly, I used the heterogeneous network embedding approach to integrate information from different species. Experimental results demonstrate the effectiveness of jointly learning from multiple species. Finally, I used the diffusion model to annotate novel gene sets by mining a network constructed from scientific papers and molecular networks. I proposed NetAnt which annotates novel gene sets by mining millions of PubMed papers. NetAnt helps doctors and biologists understand function of novel gene sets without spending a huge amount of time reading scientific papers.

3.1 HOMOGENEOUS NETWORK EMBEDDING FOR GENE FUNCTION PREDICTION

Systematically predicting gene (or protein) function based on molecular interaction net- works has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database [52, 53]. Fortunately, an increasing compendium of genomic, proteomic and interactomic data allows us to extract patterns from functionally well-characterized genes (or ) to accurately infer functional prop- erties of lesser-known ones. In particular, recently developed high-throughput experimental techniques, such as yeast two-hybrid screens and genetic interaction assays, have helped to build molecular interaction networks in bulk. Topological structures of these networks can be exploited for function prediction using the ‘guilt-by-association’ principle, which states that genes (or proteins) that share similar neighbors or other topological properties in interaction networks are more likely to be functionally related. To this end, a variety of graph-theoretic and machine learning algorithms [54, 55, 56, 57, 56] have been developed to provide a way of refining and enhancing existing functional annotations (e.g. Gene Ontology database [58]) based on network data. A popular class of graph-theoretic algorithms uses a diffusion process to examine the local topology of nodes,

13 Figure 3.1: A breakdown of GO labels by the number of annotated genes in (a) human and (b) yeast. exploiting both direct and indirect linkages [59, 60, 61, 28]. Alternatively, the number of occurrences of different elementary subgraphs (known as graphlets) in the neighborhood can be used to characterize each node and to establish pairwise affinity scores [62, 63]. More sophisticated machine learning algorithms, such as GeneMANIA [64, 65], have also been proposed. GeneMANIA uses label propagation on an integrated network specifically constructed for each functional label, and is currently available as the state-of-the-art web interface for gene function prediction in multiple organisms. Despite the success of existing algorithms, a major difficulty that has not been sufficiently addressed is that of predicting rare labels. Because many molecular functions (MFs) are inherently specific in their scope, a large number of functional labels have only a few anno- tated genes (or positive annotations); for instance, in the human GO annotation database [58], there are currently 8626 GO labels with at least 3 annotations, 4178 of which have less than 10 annotated genes and 7905 labels have less than 100 genes. The distributions of GO labels with different numbers of annotations in yeast and human are shown in Figure 3.1. Nearly half of the GO labels have less than 10 annotations in both species. Predicting new associations for these sparsely annotated labels is substantially more chal- lenging than those for labels with a lot of annotations, because patterns extracted from the few known genes are more likely to be statistical artifacts that cannot be generalized, which is commonly known as the ‘overfitting’ problem in machine learning and statistics. One way to mitigate overfitting is to take similarities between labels into account. For instance, if we have a priori knowledge that two labels reflect similar MFs (e.g. they are both children of the same parent in the ontology graph), we would also expect the two

14 corresponding sets of genes (or proteins) to be similar. If one of the gene sets contains only a few genes, then the other may provide valuable information about missing associations. Thus, by propagating information along the edges in the ontology graph, one can pool available data together for increased robustness to overfitting. Notably, previous efforts to incorporate label similarity into function prediction algorithms have largely been unsuccessful. They formulated the problem as a single structured-output hierarchical classification (HC) instead of binary classification, but its predictive performance for sparsely annotated functional labels is far from satisfactory [66, 67, 68, 69, 70, 71, 72]. Another related work [73] exploited the similarity between 17 Munich Information Center for Protein Sequences (MIPS) functional categories via a regularization scheme. However, this approach does not scale to tens of thousands of sparse GO annotations in human, as it employs a computationally expensive optimization. Our solution to this problem is based on diffusion component analysis (DCA) [59], a recently developed algorithm that combines network diffusion, such as random walk with restart (RWR) [60], with dimensionality reduction to obtain low-dimensional vector rep- resentations of nodes in a graph that capture topological properties. Topological features extracted from interaction networks in this manner can be used in conjunction with k-nearest neighbors (kNNs) or support vector machines (SVMs) to outperform the corresponding state- of-the-art for predicting hundreds of MIPS labels in yeast [59]. However, DCA also suffers from overfitting for sparsely annotated labels when predicting GO labels for a larger human interactome, if we want to train label-wise classifiers for all labels. To solve this problem, we introduce clusDCA, an improved function prediction algorithm based on DCA, which (i) incorporates the similarity between functional labels and (ii) scales to a large number of annotations. The key idea of clusDCA is to perform DCA also on the ontology graph to obtain compact vector representations of labels. The gene vectors from the original method are then projected onto the space of label vectors so that the projections of positively annotated genes are geometrically close to their assigned labels. Because labels that are similar to each other in the ontology graph are co-localized in the label vector space, classifiers for sparsely annotated labels will now favor genes associated with other similar labels in the neighborhood. This is how information is transferred between labels to avoid overfitting in our approach. When compared with state-of-the-art methods that do not incorporate label similarity, our experiments on yeast, mouse and human datasets demonstrate that our method substantially improves the predictive accuracy of sparsely annotated labels while achieving comparable performance for GO labels with sufficiently many genes. We also demonstrate the perfor- mance improvement of clusDCA over an alternative approach to utilizing the ontology graphs

15 based on HC. Furthermore, our method can be used to predict new genes for a given GO label, even in the extreme case where there are no existing gene annotations and the only information available is the label’s position in the ontology graph and the genes associated with other labels. In addition to improving function prediction on its own, this demonstrates the potential for our method to be used in conjunction with recent methods that extract new ontology terms from data [74, 75] to provide an improved way of refining and extending our knowledge of gene or protein function.

3.1.1 Methods

As an overview, clusDCA first computes the ‘diffusion state’ of each node by performing a RWR on each input network, and subsequently finds a low-dimensional vector representation for each gene via an efficient matrix factorization of the diffusion states. A key contribution is that clusDCA then follows an analogous procedure to obtain a low-dimensional vector representation of each functional label based on the ontology graph. Intuitively, the gene vectors encode the topology of the interactome, which in turn reflects gene function, while the label vectors encode the topology of the ontology graph, which reflects the semantic and relational properties of the labels. Given both the gene and the label vectors, clusDCA novelly finds the best projection of the gene vectors onto the label vector space, thus keeping the projected gene vectors geometrically close to their known labels. In the final step, clusDCA computes its predictions for an uncharacterized gene by sorting the candidate functions by their proximity to the projected gene vector, based on the optimal projection. An illustration of this pipeline is given in Figure 3.2. We give a more detailed description of this pipeline below.

3.1.2 Low-dimensional vector representations of functional labels

The GO graph is a directed acyclic graph (DAG) over functional labels where the edges represent various semantic relationships. We only consider the ‘is a’ and ‘part of’ edges, which result in a hierarchy of labels with edges going from the more specific to the more generic terms. As a consequence of this hierarchical structure, which is generally not present in molecular networks, a naive application of RWR on the ontology graph where the edges are treated as undirected unfairly favors high-level nodes, which tend to have higher centrality. On the other hand, allowing a random walk to only move from high- to low-level nodes would greatly restrict the portion of the graph a random walk can explore. To address these issues, we allow both edge directions but with different weights, whose

16 Figure 3.2: Overview of clusDCA

17 ratio is controlled by the ‘back propagation’ parameter α. With B denoting the transition matrix of the original graph with unidirectional edges, our modified RWR for the ontology graph is defined as

t+1 t T Si = (1 − pr)Si ((1 − α)B + αB B) + prqi, (3.1) We chose a value for α that generally shrinks the diffusion scores of high-level nodes, and confirmed that the final prediction performance is stable for different values of α between 0.5 and 0.8. Based on the diffusion states from this modified random walk, we learned a low- dimensional vector representation of the ontology graph using the same procedure as the one for molecular networks. Importantly, our representation captures not only single-hop parent- child relationships, but also more global patterns such as long-range sibling relationships in the network.

In the following sections, we use Y = {y1, . . . , ym} to denote the low-dimensional vector representation matrix of functional labels. yj is the vector for function j.

3.1.3 Projecting gene vectors into ontology label space

After obtaining the low-dimensional vector representations of both genes and functional labels, we use these vectors to predict gene function. Because the vectors reflect the topo- logical structure of nodes in the network, genes that are close in their vector directions are more likely to be similar in their functions. Analogously, functional labels that are close in their vector directions may be more semantically similar. Based on this intuition, we use a transformation matrix W to project genes from the gene vector space to the function vector 0 space, which allows us to match genes to functions based on geometric proximity. Let yi be the projection of the gene vector xi:

0 yi = xiW. (3.2)

Then we define the pairwise affinity score zij between gene i and function j to be used for function prediction as: T zij = xiWyj . (3.3)

A larger zij indicates that gene i is more likely to be annotated with function j. We want to optimize W so that positively annotated genes are geometrically similar to 0 their assigned GO labels. We use the inner product zij = hyj, yji as the similarity func- tion. We also explored the L2 distance, but it performed generally worse than the inner

18 product, possibly due to the fact that the inner product explicitly models both positive and pos neg negative annotations. Next, we define fj (fj ) as the set of genes that are positively (negatively) annotated with function j. Note that whenever a gene is positively annotated with a particular function we also positively annotated all of its ancestors with the gene. Our constrained optimization problem for finding the best projection that incorporates both positive and negative annotations is given by

  X 1 X 1 X min  neg zij − pos zgj w |f | |f | j j neg j pos i∈fj g∈fj 2 s.t.||W||2 = 1

1 1 where the weights neg and pos correct for the imbalance in the training data. Instead |fj | |fj | of simply maximizing the affinity scores of positive annotations, this formulation aims at maximizing the margin between the affinity scores of positive and negative annotations. This problem can be solved analytically by a closed-form solution:

∗ 1 T W = T X FY, (3.4) ||X FY||2

neg pos pos neg where F is a weight matrix with Fij = |fj |, ∀i ∈ fj and Fij = |fj |, ∀i ∈ fj . Because modeling the complex relationship between genes and functional labels with a single transformation matrix may be overly restrictive, we group functions into different clusters and learn a separate projection model for each cluster. For the main results, we divide the GO labels into the following four groups based on the number of annotated genes in the training data: [3−10], [11−30], [31−100] and [101−300]. We also tested using the partition given by the clustering of our learned label vectors and obtained comparable prediction performance.

3.1.4 Results

Networks and annotations

We obtained a collection of six molecular networks each for human, yeast and mouse from the STRING database v9.1 [76]. These networks are built from heterogeneous data sources, including high-throughput interaction assays, curated protein-protein interaction databases,

19 and conserved co-expression. We excluded text mining-based networks to avoid confounding by links based on functional similarity. There were 16,662 nodes in human, 6,311 in yeast and 18,248 in mouse. The number of edges in these networks varied from 1,183 to 673,410 in human, from 1,059 to 293,921 in yeast and from 3,917 to 1,638,107 in mouse. Note that each edge is associated with a weight between 0 and 1 representing the confidence of interaction. Next, we obtained gene-function associations and the ontology of functional labels from the GO Consortium [58]. We built a DAG of GO labels from all three categories [biological process (BP), molecular function (MF), cellular component (CC)] based only on the ‘is a’ and ‘part of’ relationships for each species. Labels without any associated genes were removed, resulting in three species-specific ontology graphs for yeast, human and mouse. The human ontology graph had 13,708 functions and 19,206 edges, the yeast ontology graph had 4,240 functions and 4,804 edges and the mouse ontology graph had 13,807 functions and 19,704 edges.

Experimental setting

Following previous work [65], we used 3-fold cross-validation to evaluate our method, where a randomly chosen subset of one-third of the genes are held out as the test set. After computing the optimal projection of gene vectors into the functional label space based only on the training data, we calculated the affinity scores to get a ranked list of test genes for each function. Then we measured the extent to which true annotations are concentrated near the top of the list by calculating the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), which are standard performance metrics in this field [65]. To summarize results across different labels, we used both micro- and macro-averages. The micro-average directly combines the entries in the confusion matrix constructed from different labels prior to calculating the predictive performance, and the macro-average calculates the areas under the curves for each label independently and then takes the average. We compared clusDCA to two state-of-the-art network-based function prediction algo- rithms, GeneMANIA [65] and DCA [59], and another algorithm based on HC [72], which exploits the hierarchical structure of functional labels. For consistency, we used the same dataset (i.e. annotations, genes, networks) and the same evaluation scheme for every method we tested. We obtained the original MATLAB implementation of GeneMANIA from http://morrislab. med.utoronto.ca/Data/GB08/. For DCA, we tested only the kNN version, because the SVM version seriously suffers from overfitting for sparsely annotated labels and also

20 Table 3.1: Number of GO terms in different sparsity levels

3-10 11-30 31-100 101-300 Human MF 886 390 222 99 Human BP 2940 1677 1122 553 Yeast MF 351 156 92 29 Yeast BP 815 408 235 87 Mouse MF 188 215 165 84 Mouse BP 337 568 678 329 does not scale to the human dataset. Importantly, neither GeneMANIA nor DCA lever- ages topological information from the GO graph. Noting that the original formulation of HC does not scale to large datasets, we instead implemented an efficient version that uti- lizes the clusDCA framework. In particular, we associate each gene with a ‘macro-label’ k Y = {y1, . . . , yk} ∈ {0, 1} , where yi is a ‘micro-label’ which is set to 1 when the gene is positively annotated with label i and 0 otherwise. Following HC, we impose a constraint that if yi = 1 then yj = 1 for every ancestor j of i. We then find the optimal projection from DCA gene vectors to the space of macro-labels using the same optimization problem we introduced in Equation 16. After solving the optimization problem, we use the optimized ∗ transformation matrix W to compute the pairwise affinity score zij between gene i and function j. For clusDCA, we set the back propagation parameter α to 0.8 and the restart probability to 0.8 for the GO graph. We observed that our performance is stable for different values of α between 0.5 and 0.8. For the molecular networks, we used a restart probability of 0.5, adopted from our previous work [59]. We set a larger restart probability for the GO graphs because they are generally much sparser. We used d = 2500 as the dimensionality of the learned vectors for the main results. We followed the same procedure as the one in GeneMANIA to group the GO labels into two major gene ontologies: ‘BP’ and ‘MF’. For both ontologies, we further binned GO labels into four sparsity levels, each consisting of GO labels with [3−10], [11−30], [31−100] and [101−300] annotated genes (see Table 3.1). Although we used all of the GO terms in human and yeast, for mouse, we used only the GO terms with evidence codes that are also used in the evaluation of GeneMANIA [65]: TAS, RCA, ND, NAS, ISS, IPI, IMP, IGI, IEP, IEA, IDA and IC [77].

21 GO label vectors capture semantic similarity

Unlike some of the previous work [65, 73] that did not explicitly incorporate the GO graph, our approach exploits the ontology structure to learn a low-dimensional vector representation of each GO label. If these vectors can be clustered into semantically meaningful clusters, it would validate our attempt to enforce gene assignments to be similar between labels that are geometrically close in the label vector space as being well founded. To test this hypothesis, we used K-means to cluster the GO labels based on the cosine similarity of our low-dimensional vector representations of labels. We determined the number of clusters by restricting the largest cluster to have at most 80% of the total number of labels. In Supplementary Figure S1 (Supplementary Data), two of the 41 clusters we identified are visualized with Cytoscape [78]. The complete list of clusters can be found in Supplementary Data. In the visualization, the node size reflects the number of genes that the corresponding function is annotated with, and the edge width reflects the cosine similarity between the vector representations of the two nodes. The first cluster represents functions related to molecular binding, such as cation binding, metal ion binding, nucleotide binding and NAD binding, whereas the second cluster represents functions related to different transmembrane transporter activity. The fact that the set of GO labels in each of these clusters is highly consistent in function provides evidence that the learned vectors faithfully reflect the semantic relationships among the labels.

clusDCA substantially improves prediction of sparsely annotated GO labels

To evaluate clusDCA, we performed large-scale function prediction for human, yeast and mouse. The results are summarized in Figure 3.3 and Supplementary Figure S2 (Supplemen- tary Data). It is clear that our approach significantly outperforms other methods on sparsely annotated labels in all three datasets. For example, in human, our method achieved 0.8491 micro-AUROC and 0.8648 macro-AUROC on BP labels with 3-10 annotations, which is much higher than 0.5815 (micro), 0.5857 (macro) for DCA and 0.7288 (micro), 0.8002 (macro) for GeneMANIA. It is worth noting that DCA performs consistently worse than GeneMANIA at this task, possibly due to the fact that GeneMANIA adaptively integrates the input networks for each functional label to optimize performance on training data. In yeast, clus- DCA achieved 0.9025 micro-AUROC on BP labels with 3−10 annotations, which is again substantially higher than 0.6645 for DCA and 0.8504 for GeneMANIA. In mouse, clusDCA achieved 0.8627 micro-AUROC and 0.8802 macro-AUROC on BP labels with 3-10 annota- tions, which is again substantially higher than 0.5873 (micro), 0.5937 (macro) for DCA and 0.7609 (micro), 0.8245 (macro) for GeneMANIA. A similar improvement was observed for

22 functional labels with 11−30 annotations and also for the MF labels in human, yeast and mouse (Fig. 3.3 and Supplementary Fig. S2). We found most of the improvements to be statistically significant (P < 0.05; paired Wilcoxon signed-rank test). The improvement was most pronounced in human overall, presumably because the human dataset is much sparser than the other two. The above results suggest that the topological information in ontology graphs can be exploited to greatly improve function prediction performance for sparse labels. It remains to be shown whether clusDCA is better than other approaches to incorporating the ontology. To this end, we found that clusDCA substantially outperforms HC. For instance, in human, our method achieved 0.8984 micro-AUROC and 0.9135 macro-AUROC on MF labels with 3−10 annotations, which is much higher than 0.7435 (micro), 0.7580 (macro) for HC. The improvement was more pronounced where the number of GO labels was large (e.g. human BP). This is likely because the number of candidate predictions for HC grows exponentially with the number of GO labels. As a result, HC is highly prone to overfitting in a dataset with a large number of labels. Notably, HC also invariably performed worse than GeneMANIA in most of our experiments. In addition, we observed consistent improvements over GeneMANIA with respect to the AUPRC. In human, our method achieved 0.0429 macro-AUPRC on BP labels with 3−10 annotations, which is higher (better) than 0.0368 AUPRC for GeneMANIA. In yeast, clus- DCA achieved 0.1360 AUPRC on MF labels with 3−10 annotations, which is substantially higher than 0.1075 macro-AUPRC for GeneMANIA. Similarly, in mouse, clusDCA achieved 0.0516 macro-AUPRC on BP labels with 3−10 annotations, which is substantially higher than 0.0389 macro-AUPRC for GeneMANIA Interestingly, we note that the improvement of our method is negatively correlated with the number of annotations of the GO labels. In other words, we observed a greater improve- ment of clusDCA over previous methods for sparser labels. This observation suggests that clusDCA indeed addresses the overfitting issue, which has more significant impact on the sparsely annotated labels. In addition to sparsely annotated labels, our approach also achieved a performance compa- rable to GeneMANIA and greatly outperformed HC and DCA on labels with a large number of annotations (i.e. 31−100 and 101−300) with respect to both AUROC and AUPRC. The difference between clusDCA and GeneMANIA is not statistically significant in this case, but clusDCA is still marginally better than GeneMANIA on most categories.

23 Figure 3.3: Comparison of our approach with other methods in terms of micro-AUROC. Asterisk indicates that our approach is statistically significant in comparison with GeneMANIA. Performance is evaluated for different subsets of GO labels with varying sparsity levels as shown on the x-axis

24 clusDCA accurately predicts genes for new GO labels

Given that the current GO database is likely incomplete, in the event that a new GO label is created, we hope to automatically find genes that this label may be associated with. Remarkably, our framework can be directly used to tag genes with a newly created GO label using only the topological information from the ontology graph and other annotated labels. When the new GO labels are added to the ontology graph, we first obtain the low- dimensional vectors of these labels with DCA. Then, given the low-dimensional vectors for both genes and functions, we can inversely project function vectors onto the gene vector space and predict associated genes for the new GO labels. This approach can also potentially help to refine and enhance the current GO annotation database, thus serving as a verification platform. As a proof-of-concept, we repeatedly held out one-third of the GO labels as the validation set of ‘uncharacterized’ labels. We then used the remaining two-third GO labels to learn the projection model and to predict genes that are associated with the held out labels. Figure 3.4 shows the result of this experiment in yeast. We observed that our framework achieves a promising performance on all categories with micro-AUROC ranging from 0.81 to 0.87. It is worth noting that, to our best knowledge, no other existing method is able to predict associated genes for new GO labels without any existing annotations. Disease gene prioritization is a closely related task where the goal is to predict genes associated with a particular disease, but most algorithms proposed for this problem also require an initial set of associated genes to be able to make predictions.

3.1.5 Conclusion

We introduced a novel algorithm, clusDCA, for gene function prediction. The major idea of clusDCA is to leverage similarity between functional labels in addition to similarity between genes to prevent overfitting of sparsely annotated GO labels. We achieve this goal by learning low-dimensional vector representations of genes and functions and matching gene vectors to function vectors via a projection that best preserves known gene-function associations. Similar labels are co-localized in the vector space, which allows the transfer of information between neighboring functions when genes are assigned to them. Since learning the projection solves the prediction of all labels simultaneously, our method has the added benefit of being scalable to datasets with a large number of annotations. Although based on DCA, clusDCA is a substantial advance, as evidenced by its supe- rior performance over DCA, as well as GeneMANIA and HC, on sparsely annotated GO

25 Figure 3.4: Micro ROC curve of predicting genes for new GO labels on MF in yeast

labels, while maintaining a comparable performance on labels with many genes. Moreover, we demonstrated clusDCAs ability to identify putatively associated genes for newly created GO labels without any annotations, which suggests that our method can be used to im- prove poorly annotated labels and thus takes a significant step towards more comprehensive understanding of gene or protein function in various organisms. Supplementary Data: https://academic.oup.com/bioinformatics/article/ 31/12/i357/216487

3.2 INTEGRATING HETEROGENEOUS INFORMATION FOR GENE FUNCTION PREDICTION

The above homogeneous network embedding based function prediction only used the infor- mation from a single species. Intuitively, integrating molecular networks from multi-species reduces noise in each species and thus may boost the function annotation performance. Consequently, I used the proposed heterogeneous network embedding approach to develop an integrated, multi-species function prediction pipeline which integrates information from homology data and molecular networks [79]. This is the first work that integrates molecular networks from multiple species into a large heterogeneous network for function prediction. To scale our network embedding approach to massive heterogeneous graphs, I performed ran-

26 dom walks on-the-fly from each node and applied the noise-contrastive estimation approach to optimize the low-dimensional gene vectors. Importantly, the impact of the proposed approaches goes beyond function prediction. First, since proteins of different species are projected into the same low-dimensional space, these vector representations can be used to translate discoveries from model organisms to humans for accelerating the process of biomed- ical research. Secondly, our method can be applied to analyzing other heterogeneous data. For example, we constructed a large drug-gene-disease network and observed promising re- sults in medical record retrieval [80], survival analysis [81], medical record visualization [82], and patient clustering [83].

3.2.1 Introduction

Computational prediction of protein function has been extensively studied in the context of molecular evolution. Homologous proteins have most likely evolved from a common ancestor. They often carry out similar protein functions, because functions are generally conserved dur- ing molecular evolution. Consequently, computational approaches can predict the function of query proteins by transferring those of their annotated homologs. In addition to automatic annotations based on orthology or domain information or pre-existing cross-references and keywords [84], a variety of machine learning algorithms [85, 86, 87, 88, 89, 90, 91] have been proposed to extract annotations based on sequence similarity-detection tools such as BLAST, PSI-BLAST[92], and phylogenetic analysis [93, 94]. Despite the success of homology-based approaches, their major constraint arises from a lack of annotated sequences[69]. In fact, among over 65 million protein sequences in publicly accessible databases [95], only 2 million of them are manually curated [96]. Consequently, the predictive power of homology-based methods suffers from the scarcity of annotations. Furthermore, reliable homology relation- ships are sparse between distantly related species, thus posing computational and statistical challenges when making faithful predictions. Fortunately, the rapidly growing interactome data from high-throughput experimental techniques allows us to extract patterns from neighbors in molecular networks [97, 98, 99] in addition to homologous proteins. This idea is supported by the established “guilt-by- association” principle, which states that proteins that are associated or interacting in the network are more likely to be functionally related [100]. Recently, this “guilt-by-association” principle has become the foundation of many network-based function prediction algorithms [31, 101, 102, 28, 103, 104, 61, 105]. Among them, GeneMANIA [106] and clusDCA [32] are state-of-the-art network-based function prediction approaches. In addition to incorporat- ing network topology, clusDCA also leverages the similarity between GO labels and obtains

27 substantial improvement on sparsely annotated functions. GeneMANIA uses a label propa- gation algorithm on an integrated network specifically constructed for each functional label, and is currently available as a state-of-the-art web interface for gene function prediction for many organisms. Intuitively, integrating homology data with molecular networks can synergistically improve function prediction results. On one hand, it enables us to transfer annotations from func- tionally well-characterized neighbors in the molecular network as well as from homologous proteins with conserved similar functions. On the other hand, homology data can further mitigate the incomplete and noisy nature of molecular networks through interologs[107], which states that a conserved interaction occurs between a pair of proteins that have inter- acting homologs in another organism [108]. Nevertheless, integrating homology data with molecular networks is both computationally and statistically challenging. Since they are heterogeneous data sources, it is likely sub- optimal to integrate them in an additive way which simply averages the prediction results of either of these two data sources. Moreover, we also need an efficient algorithm that scales to hundreds of thousands of proteins from multiple species. One way to integrate these two heterogeneous data sources seamlessly is to construct a multiple species heterogeneous network in which both nodes and edges are associated with different types. With this network, we can predict functions for query proteins based on annotations extracted from both their homologs and their neighbors in molecular networks. Furthermore, information can also be transferred between two proteins that are neither homologs nor neighbors in molecular networks. Notably, the only previous attempt to integrate these two heterogeneous data sources is using multi-view learning[109]. However, it does not scale to multiple species. In addition, they formulated protein function prediction as a structured-output hierarchical classification problem whose performance for sparsely annotated functional labels is far from satisfactory [32]. Here, we introduce ProSNet, a novel Protein function prediction algorithm which ef- ficiently integrates Sequence data with molecular Network data across multiple species. Specifically, an integrated heterogeneous network is first constructed to include all molecu- lar networks of multiple species, in which homologous proteins across multiple species are also linked together. Based on this integrated network, a novel dimensionality reduction algorithm is applied to obtain compact low-dimensional vectors for proteins in the network. Proteins that are topologically close in the molecular networks and/or have similar sequences are co-localized in this low-dimensional space based on their vectors. These low-dimensional vectors are then used as input features to two classifiers which utilize annotations from molecular networks and homologous proteins, respectively. In addition, ProSNet is inher-

28 ently parallelized, which further promises scalability. When compared to the state-of-the-art methods that only use homology data or molecular networks, ProSNet substantially improves the function prediction performance on five major species.

3.2.2 Methods

As an overview, ProSNet first constructs a heterogeneous biological network by integrat- ing homology data with molecular network data of multiple species. It then performs a novel dimensionality reduction algorithm on this heterogeneous network to optimize a low- dimensional vector representation for each protein. The vectors of two proteins will be co-localized in the low-dimensional space if the proteins are close to each other in the het- erogeneous biological network. A key computational contribution is that ProSNet obtains low-dimensional vectors through a fast online learning algorithm instead of the batch learn- ing algorithm used by previous work [31, 32]. In each iteration, ProSNet samples a path from the heterogeneous network and optimizes low-dimensional vectors based on this path instead of all pairs of nodes. Therefore, it can easily scale to large networks containing hundreds of thousands or even millions of edges and nodes. After finding low-dimensional vector representation for each node, ProSNet calculates an intra-species affinity score and an inter-species affinity score by transfering annotations within the same species and across different species, respectively. Finally, ProSNet predicts functions for a query protein by averaging these scores and picking the function(s) with the highest score(s).

Runtime improvements through online learning

Like diffusion component analysis[31], the number of pairs of nodes hu, vi that are con- nected by some path instances following at least one of the paths is O(|V |2) in the worst case. This is too large for storage or processing when |V | is at the order of hundreds of thousands. Therefore, sampling a subset of path instances according to their distribution is the most feasible choice when optimizing, instead of going through every path instance per iteration. Thus, our method is still very efficient for networks containing large num- bers of edges. Based on Eq. (2.15), we can sample a path instance by sampling the nodes on the heterogeneous path one by one. Once a path instance has been sampled, we use gradient descent to update the parameters xu, xv, pr, qr, and µr based on Eq. (2.21). As a result, our sampling-based framework becomes a stochastic gradient descent framework. The derivations of these gradients are trivial and thus are omitted. Moreover, since stochas- tic gradient descent can generally be parallelized without locks, we can further optimize

29 via multi-threading. Decomposing a heterogeneous network with more than sixty thousand nodes and ten million edges into a 500-dimensional vector space takes less than 30 minutes on a 12-core 3.07GZ Intel Xeon CPU through this online learning framework.

Function prediction

After using the above framework to find the low-dimensional vector for each protein in the HBN, ProSNet transfers annotations both within the same species and across different species to predict for a query protein. To transfer annotations within the same species, ProSNet first uses diffusion component

analysis [31] on the Gene Ontology graph [110] to find low-dimensional vector yi for each functional label i. It then uses a transformation matrix W to project proteins from the protein vector space to the function vector space, which allows us to match proteins to 0 functions based on geometric proximity. Let yi be the projection of the protein vector xi:

0 yi = xiW. (3.5)

We define the intra-species affinity score zij between gene i and function j to be used for function prediction as: T zij = xiWyj . (3.6)

A larger zij indicates that gene i is more likely to be annotated with function j. We follow clusDCA[32] to find the optimal W. Since proteins from different species are located in the same low-dimensional vector space, ProSNet is able to use the annotations across different species as well. Instead of using the annotations from all the other proteins, ProSNet only considers the k most similar proteins based on the cosine similarity between their low-dimensional vectors. It then calculates the

inter-species affinity score sij between gene i and function j as:

X sij = cos(xi, xg) · 1(g ∈ Tj), (3.7)

g∈Bi

where Bi is the set of k most similar proteins of i and Tj is the set of genes that are annotated to function j in the training data. After obtaining the intra-species affinity score z and inter-species affinity score s, ProSNet normalizes them by z-scores. It predicts functions for a query protein by averaging these two normalized affinity scores and picking the function(s) with the highest score(s)

30 3.2.3 Experimental results

Construction of heterogeneous biological network for function prediction

To construct the heterogeneous biological network (HBN), we obtained six molecular networks for each of five species, including human (Homo sapiens), mouse (Mus mus- culus), yeast (Saccharomyces cerevisiae), fruit fly (Drosophila melanogaster), and worm (Caenorhabditis elegans) from the STRING database v10 [98]. These six molecular networks are built from heterogeneous data sources, including high-throughput interaction assays, cu- rated protein-protein interaction databases, and conserved co-expression data. We excluded text mining-based networks to avoid potential confounding. Each edge in the molecular networks has been associated with a weight between 0 and 1 representing the confidence of interaction. Next, we obtained protein-function annotations and the ontology of functional labels from the GO Consortium [110]. We only used annotations that have experimental ev- idence codes including EXP, IDA, IPI, IMP, IGI, and IEP. As a result, annotations that are based on an in silico analysis of the gene sequence and/or other data are removed to avoid potential leakage of labels. We built a directed acyclic graph of GO labels from all three categories [biological process (BP), molecular function (MF) and cellular component (CC)] based on “is a” and “part of ” relationships. This graph has 13,708 functions and 19,206 edges. We set all edge weights of protein-function links to 1 and all edge weights between GO labels to 1. Finally, we extracted amino acid sequences of all proteins in our five-species network from the STRING database and the Universal Protein Resource (Uniprot) [95]. To construct homology edges, we performed all-vs-all BLAST [92] and excluded edges with E-value larger than 1e-8. We then used the negative logarithm of the E-values as the edge weights and rescaled them into [0, 1]. We showed the statistics of our HBN in Tab. 3.2. For simplicity, all edges are undirected. Note that we excluded the protein-function annotation edges that are in the hold-out test set in the following experiments for rigorous comparisons. Our heterogeneous network is similar to the example network in Fig. 2.1, except that our network has five species and six different types of molecular networks.

Experimental setting

We used 3-fold cross-validation to evaluate the methods of interest. For a given species for evaluation, we randomly split proteins of the species into three equal-size subsets. Each time, the GO annotations of proteins in one subset were held out for testing, and the annotations of the other two subsets were used for intra-species classification training. For inter-species

31 Table 3.2: Statistics of our heterogeneous network

Human Mouse Yeast Fruit fly Worm #proteins 16,544 16,649 6,307 11,261 13,469 #co-expression edges 1,319,562 1,406,572 628,014 2,466,234 2,774,840 #co-occurrence edges 28,334 29,472 5,328 17,962 14,678 #database edges 275,860 347,406 66,972 116,748 69,948 #experimental edges 492,548 672,326 439,956 380,046 298,684 #fusion edges 2,678 3,994 2,722 4,026 4,336 #neighborhood edges 78,440 77,962 91,220 69,934 49,890 #human homology edges 0 525,221 55,884 202,993 159,481 #mouse homology edges 525,221 0 52,916 188,729 151,408 #yeast homology edges 55,884 52,916 0 26,950 28,269 #fruit fly homology edges 202,993 188,729 26,950 0 75,831 #worm homology edges 159,481 151,408 28,269 75,831 0 #annotations 77,950 66,238 28,668 32,259 21,655 training, we used all experimental GO annotations from the other four species, ensuring no leakage of label information in the training data. To evaluate the predictive performance, we measured the extent to which the predicted ranked list was consistent with the ground truth ranked list by computing the receiver operating characteristic curve (AUROC). We used the macro-AUROC as the evaluation metric following previous work[106, 32]. The macro-AUROC is calculated by separately averaging the area under the curves for each label. We set the vector dimension d = 500, the number of nearest neighbors k = 2000, and the negative sampling weight θ = 5 in our experiment. We observed that the performance of our algorithm is quite stable with different d, k, and θ values. We included all edge types in the predefined heterogeneous path set. Additionally, we added “transfer of annotation” to the predefined heterogeneous path set (Fig. 2.1). To show the improvement from integrating homology data with molecular networks of multiple species, we compared our method with three existing state-of-the-art function pre- diction methods: GeneMANIA[106], clusDCA[32], and BLAST[92]. GeneMANIA and clus- DCA integrate protein molecular networks within a given species. Neither of them is able to integrate information across different species. We used the latest released code and the sug- gested parameter settings for these two methods. BLAST uses bit score to rank annotations from significant hits by BLAST. We used the same datasets (i.e. annotations, proteins, and networks) and the same evaluation scheme for every method we tested.

32 Figure 3.5: Comparison of using different data sources for function prediction

Molecular network data and homology data are complementary in function prediction

We first studied whether information extracted from homology and from molecular net- works are complementary. We compared the predictive performance of three different data sources: 1) molecular networks, 2) homology, 3) both molecular network and homology (in- tegrated). We used clusDCA to predict function annotations based on molecular networks. We used BLAST to make predictions of function annotation based on homology. We summa- rized how many functions can be accurately annotated (AUROC>0.9) by each data source (Fig. 3.5). We notice that there are many functions that can only be accurately predicted by homology or network. For example, on mouse MF with 3-10 labels, 9% of functions (difference between yellow bar and green bar) can be accurately predicted only by homology but not by network. In the same category, another 21% of functions (difference between blue bar and green bar) can be accurately predicted only by network but not by homology. This suggests that these two data sources are complementary, and integrating them can syner- gistically improve the function prediction results. To this end, we integrated homology and network data by simply taking average of the z-scores of predicted annotations from these two data sources. We found that the predictive performance using both molecular network data and homology data is significantly better than only using one in all categories on both human and mouse. For example, on human MF with 101-300 labels, using both network data and homology data accurately annotates 60% of functions, which is much higher than

33 Figure 3.6: Comparison of different methods

4% of only using network data and 26% of only using homology data. Notably, we only use the homology data from five species here. When including homology data from more species in the future, homology data may further boost the function prediction performance.

ProSNet substantially improves function prediction performance

We performed large-scale function prediction on all five species to compare our method to other state-of-the-art function prediction approaches. The results are summarized in Fig. 3.6 and Supplementary Fig. 1. It is clear that our approach achieved the best overall results in all five species. When comparing with homology-based methods, we found that ProSNet significantly outperforms BLAST on both sparsely annotated and densely annotated labels (data not shown). For example, ProSNet achieves 0.8690 AUROC on human BP labels with 3-10 annotations, which is much higher than the 0.6326 AUROC by BLAST. Furthermore, we compared ProSNet to existing state-of-the-art network-based methods, including clusDCA and GeneMANIA, which only integrate molecular networks of single species. We found that the overall performance of our approach is substantially higher than that of both of these methods. For instance, in human, our method achieved 0.9211 AUROC on MF labels with 3-10 annotations, which is much higher than 0.8673 by GeneMANIA and 0.8794 by clusDCA. In mouse, our method achieved 0.8523 AUROC on BP labels with 31-

34 100 annotations, which is much higher than 0.8078 AUROC by GeneMANIA and 0.8299 AUROC by clusDCA. To evaluate the integration of homology and network data, we developed a baseline ap- proach that simply merges predictions made from homology data and sequence data, sepa- rately. This additive approach takes the average z-scores of the annotation score of clusDCA and BLAST to rank functional labels for each protein. We note that this baseline approach outperformes both GeneMANIA and clusDCA, indicating that integrating homology with molecular networks can substantially improve the function prediction performance. We then compared this additive approach to our method. We found that ProSNet also outperforms the additive approach. For instance, in human, our method achieves 0.9129 AUROC on MF labels with 11-30 labels, which is higher than 0.8956 AUROC by the additive approach. The improvement of our method in comparison to the additive approach demonstrates a bet- ter data integration by constructing a heterogeneous network and finding low-dimensional vector representations for each node in this network. The improvement of ProSNet over existing network-based approaches is more pronounced on sparsely annotated functions. Since very few proteins are annotated to these functions, it is very easy to overfit any classification algorithm if we only use the data from a single species. With the integrated heterogeneous biological network, ProSNet successfully transfers anno- tations from other species to have a more robust and improved predictive performance on sparsely annotated functions.

3.2.4 Conclusion

We have presented ProSNet, a novel protein function prediction method which seamlessly integrates homology data and molecular network data. ProSNet constructs a heterogeneous network to include molecular networks from all species and homology links across different species. We have designed an efficient dimensionality reduction approach which only takes 30 minutes to decompose a heterogeneous network containing hundreds of thousands of proteins. We have demonstrated that ProSNet outperforms state-of-the-art network-based approaches and homology-based approaches on five major species. Furthermore, ProSNet has achieved improved performance over an additive integration approach that simply adds predictions from network and homology data. This result supports our hypothesis that constructing a heterogeneous network and then finding low-dimensional vector representations for each node in this network is a better data integration approach. In the future, we plan to study how to annotate proteins of species that have very sparse molecular networks or even no molecular network. In addition, we plan to pursue further improvement by integrating networks and

35 homology data from a complete spectrum of reference species.

3.3 ANNOTATING GENE SETS BY INTEGRATING LITERATURE DATA AND MOLECULAR NETWORKS

A large number of functionally related gene sets have been massively generated in high- throughout experiments. Given a novel gene set, an immediate question to ask is the function of this gene set. However, due to the limited coverage of known pathway and function collec- tions, only a small proportion of these gene sets are found to be significantly overlapped with existing functions. Consequently, extensive human efforts have been devoted to manually search and summarize the function of these gene sets in the literature. In this work, I studied the novel problem of gene set description generation. I took a network biology perspective and mined millions of scientific research papers to automatically generate description of gene sets. I evaluated our framework on both gene sets that have known function and novel gene sets that are generated in high-throughput experiments. This work, along with the previous introduced function prediction works, form a unified pipeline to comprehensively annotate gene sets based on different data sources.

3.3.1 Introduction

With significant advances in ‘omics technologies, it has become increasingly routine to identify functionally related sets of genes based on different biological patterns. For example, a gene set may be computationally derived based on differential expression [111, 112], based on associations to the same phenotypes [113, 114], or based on a high density of molecular interactions among the genes [115, 116, 117, 118]. Because of their functional relationships, these gene sets can often be interpreted as cellular pathways or protein complexes, enabling a systems approach to studying human diseases beyond individual genes. Given a gene set of interest, a critical task is to learn what is its overall function as a pathway or complex in the cell. There are two major approaches to address this task. The first approach is to search for significant overlap with known pathways in manually curated databases such as the Gene Ontology (GO) [119] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) [120]. However, it is very likely that little or no overlap can be found due to the limited coverage of these databases, especially when querying with gene sets related to a rare disease. The second approach is to search for scientific articles that describe each gene in the set, and then summarize these articles to describe the aggregate function of the gene set. Man-

36 ually performing this process requires substantial domain knowledge and does not scale to large pathways. While automatic summarization of free text has been proposed by many text-mining methods [121, 122], these methods can describe only one gene rather than a gene set. In particular, automatic summarization for a gene set requires addressing several new challenges. First, the increased number of free text articles introduces diverse and noisy an- notations compared to individual genes. Second, the relationship between pathway functions and gene interactions should be considered, since genes can perform very different functions when participating in different biological processes. Third, literature contains many poten- tial and diverse function annotations, only some of which are relevant. Thus researchers need systematic approaches to filter, organize and display the most useful information in literature to better understand the biological pathways represented by a gene set. Many related approaches mine literature data to study the functions of a group of genes together. CoCiter tests the significance of co-citation of a gene set either from a user-defined queried gene sets or a known pathway [123]. Since the functions of this gene set are provided by user, CoCiter is not able to automatically mine new functional annotations to describe the gene set. Martini is a gene set comparison tool which assesses the similarity of two gene sets by using keywords extracted from Medline abstracts [124]. Although gene sets are compared using keywords, the functional description for each gene set is not explicitly generated. Here we develop a novel approach to automatically mine functional annotations of path- ways from a large corpus of literature supported by biological networks. Our approach has two major advantages over previous text mining methods. First, it integrates semantic in- formation derived from literature with biological information derived from experimental and interactome data. In this framework, annotations and genes are linked through a comprehen- sive similarity network. By propagating information in the network, an annotation can be assigned to a gene even when the two were never mentioned together in the same literature. Second, we adopt a new way to organize and visualize functional annotations using a data structure called a Hierarchical Concept Ontology. This ontology reduces redundant informa- tion and visual complexity to display the complex structure embedded in the network. We evaluate our method on both manually-curated pathway annotations and gene sets derived from computational tools. We observe substantial improvement in predicting the manually curated annotations in comparison to a text-mining baseline (non-network) approach. We further explore two case studies to demonstrate how our method can combine text mining, molecular networks and advanced visualization to discover new pathways related to cancer.

37 Figure 3.7: Diagram of our method

3.3.2 Methods

Our method consists of four major steps 3.7. First, it constructs a vocabulary of high quality phrases (a sequence of one or more words) by processing a large corpus of PubMed journal articles [125] using a software AutoPhrase [126]. Second, phrases are connected within a weighted network based on their probability of co-occurrence within the same articles. Third, our method builds a phrase-gene similarity network by joining the phrase- phrase network with an existing gene-gene network derived from experimental data. Fourth, phrases are ranked by how well they describe the function of a gene set. Finally, top-ranked phrases are projected into a low-dimensional space and hierarchically clustered to create a Concept Ontology.

Constructing a phrase-gene network

We construct a weighted network to quantify the functional similarities between both phrases and genes. The edge weight WAB between phrase A and phrase B is defined as:

P r(A, B) W = (3.8) AB P r(A)P r(B)

where Pr(A) is the marginal probability that phrase A appears in any article and Pr(A, B) is the probability that phrase A and phrase B co-occur in the same article. Intuitively, two phrases receive a large edge weight if they co-occur together more often than expected given their individual probabilities. In practice, non-informative phrases such as ‘cell lines’ and ‘system biology’ have many network neighbors with low edge weights; thus we retain only the top 50 edges for each phrase. To calculate the edge weight between two genes,

38 we integrate multiple heterogeneous data sources, including gene co-expression, protein- protein interaction, protein-domain co-occurrence and genetic interaction. We perform this integration in an unsupervised fashion using a network-fusion-based algorithmic framework [127]. To calculate the edge weight between a phrase and a gene, the name of the gene is considered as a phrase. In this way, the phrase-phrase and gene-gene networks are joined into a single network consisting of both phrases and genes as nodes.

Ranking candidate annotations of a pathway

Based on connections in this initial phrase-gene network, we further identify non-obvious links between phrases and genes through a random walk transformation of the network. An association score between gene A and phrase B is defined as the probability of randomly walking from A to B in the network, with restart probability = 0.5. Similarly, the association score between a queried gene set (pathway) and a phrase is defined as the average association score between the phrase and all genes in the set. We then rank pathways based on these scores. To efficiently rank a large number of phrases in a reasonable time, we only consider phrases that are within a distance of <3 to any of the genes in a queried pathway. Use of this filter in practice did not result in any significant decrease in performance (as evaluated below). Finally, we select all phrases with scores above a threshold as the candidate anno- tations of the queried pathway. We will discuss how to empirically pick this threshold in the below ‘Experimental results’ subsection.

Visualizing results as a Concept Ontology

The number of candidate annotations returned by the previous step can be very large, especially for large pathways. In general, synonyms are connected by the strongest weights because they are exchangeable in the literature. Phrases related to the same topic such as ‘tumor suppressor’ and ‘driver mutations’ will also be assigned strong weights but weaker than synonyms. Such intuition encouraged us to organize the flat phrase networks into a data-driven hierarchical ‘concept’ ontology [128, 129]. For this purpose we adopt a network embedding approach [127] in which phrases are projected into a low-dimensional space and the cosine of two phrase embedding vectors is used as their pairwise distance. Given this new distance matrix, we then apply a network clustering approach, CLiXO [128], to transform the flat phrase network into a data-driven ‘concept’ ontology, where leaf nodes are phrases and internal nodes are clusters of similar phrases suggestive of higher order ‘concepts’. Low-level concepts tend to be relatively concrete, because all phrases are strongly connected with each

39 other, while high-level concepts tend to be more abstract, because phrases are more loosely connected with each other. Similar to a manually curated ontology, we assign each concept a name using a representative phrase having minimum distance with all the other phrases in the same concept cluster. Cytoscape [130] is then applied to visualize the data-driven Concept Ontology.

3.3.3 Experimental results

Dataset and experimental settings

We obtained 33,462,308 journal articles from PubMed published between 1994 to 2017. For each article, we only used the abstract and title rather than the whole article. We obtained 41,367 gene descriptions and gene name synonyms from NCBI [125]. The lengths of descriptions ranged from 100 to 300 words. AutoPhrase [126] then identified 727,289 phrases from the text corpus combining both gene descriptions and journal articles. To calculate gene similarities, we aggregated various types of molecular networks using a Random Forest (RF) model trained to best recover the GO semantic distance between gene pairs. The trained model can be viewed as a nonlinear weighting of different kinds of features to reflect statistical pairwise correlations between two genes. The integrated data sources include coexpression networks, protein-protein interaction networks and protein- domain co-occurrences and genetics interactions, as follows. For co-expression networks, we used 980 genome-wide datasets extracted from the Gene Expression Omnibus (GEO) database [131]. We also used co-expression networks from the Genotype-Tissue Expression (GTEx )[132] project in which both global and tissue-specific co-expression are considered. In addition, we calculated a co-expression matrix on both the Human Protein Atlas and the Cancer Cell Line Encyclopedia [133, 134]. For protein-protein interaction networks, we included all interactions in InBioMap [135] and only physical interactions in BioGRID [136]. In addition, we included genetic interaction data inferred from radiation hybrid genotypes [137] and domain co-occurrence data from InterPro [138] and PFAM [139].

Recovering curated names in GO

We examined the ability of our method to recover the names of known biological processes and cellular components in GO, given only information about their sets of annotated genes. For each GO term, we looked for its curated name among all candidate phrases ranked according to their association scores to the genes in the term. Gene-term annotations were

40 taken using experimental evidence codes (EXP, IDA, IPI, IMP, IGI, and IEP) but not in silico codes (e.g. IEA) to avoid potential leakage of labels. We found that for 40% of terms in the biological process branch of GO, the curated name was among the top 50 candidates (Fig. 3.8a). Similarly, for 50% of terms in the cellular component branch, the curated name was among the top 50 candidates (Fig. 3.8b). More generally, we calculated the proportion of GO terms for which the curated name was among the top K candidate names of the term. For comparison, we set up a baseline approach in which a phrase is scored and ranked simply by the number of articles that mention this phrase together with any of the genes in the gene set. This simple but intuitive baseline mimics a search engine that ranks documents based on word frequency [140]. Our method substantially outperformed this baseline approach in naming terms across all three branches of GO (Fig. 3.8a-c). Here, for each term we only considered the curated name itself and did not reward returning the names of ancestors or descendants. In practice, however, we also observed the names of ancestors and descendants among the top ranked phrases (Fig. S1-3). Further examining these results, we observed that the rank of the curated name identified by our method was positively correlated with the size of the gene set (Fig. 3.8d). That is, our method predicted more accurately when the gene set was small. For sets with fewer than 250 genes, our method found the correct curated term among the top 10 phrases the majority of the time. When the gene set was larger than 750 genes, our method could only detect the curated name among the top 75 phrases. An explanation for this result is that large gene sets tend to cover broad or diverse functions and thus are more difficult to summarize by a short phrase. Next, we studied another critical problem: Given a ranking of phrases, how do we deter- mine the threshold to select the most relevant phrases? To address this problem, for each GO term, we compared the ranking of its curated name with its association score. As shown in Figs. 3.9a-b, better rankings of curated names were generally tied to stronger association scores. This implies that the association scores across different GO terms are comparable. Therefore, we applied a universal threshold on the association score to determine final an- notations for every GO term. We found that when the score is larger than -6 (log domain), we could always find the curated name among top 40 ranked phrases, regardless of term size (Fig. 3.9b). Therefore, we used -6 as our universal threshold to determine annotations.

41 Figure 3.8: Comparison of our method and baseline on recovering term names of three Gene Ontology categories: Biological Process (a), Cellular Component (b) and Molecular Function (c). The fraction of terms for which the curated name was among the top K candidate phrases. (d) The correspondence between the rank of term names and the sizes of terms. The Y-axis shows the distribution of ranks of curated names with varying sparsity levels shown in the X-axis.

42 Figure 3.9: Selecting relevant phrases based on the association score. (a) For each GO term, the association score of its curated name is plotted with the rank of this score among all candidate phrases. (b) Zoom-in of panel (a) reveals that applying a threshold of -6 on the association score guarantees that the curated name of a term is ranked among the top 40 candidate phrases.

Functional annotations for unknown cancer pathways

Encouraged by the ability of our method to recover the curated names of known pathways, we set out to assign names to new gene sets inferred from molecular data. We analyzed a total of 2,132 gene sets detected by the hierarchical clustering algorithm CLiXO [128] based on a human gene similarity network with 19,035 genes and 181,156,095 edges. Of these, we only considered those that did not significantly overlap with known pathways. In this subsection, we chose two example gene sets which were suggested to be highly related to cancer by our approach to demonstrate how our method can help to discover new biological knowledge. As a first case study, we examined a pathway consisting of eight strongly interacting genes: NEDD4, PTEN, SLC11A2, SLC11A1, SFTPC, MT3, NDFIP1 and NDFIP2 (Fig. 3.10a). To our knowledge, this pathway was previously unknown, as it has poor overlap with all catalogued pathways in GO and KEGG (Jaccard Index 0.25). Our method identified 38 literature phrases associated with this set of genes (Fig. 3.10a). Although each of these phrases might represent a distinct biological function, we found that some were highly related to one another, forming a hairball-like subnetwork of gene-phrase linkages (Fig. 3.10a). Thus, it would be very challenging for a human to summarize the overall functions of this pathway. To address this challenge, we applied CLiXO to hierarchically organize these phrases into a Concept Ontology. Visualization of this ontology revealed six major functions

43 Figure 3.10: Discovery and characterization of a new pathway by our method. (a) The pathway is defined by eight genes related by protein interactions, co-expression and protein shared domains. These functions of these genes are collectively described by 38 phrases. (b) Cancer types in which these genes are significant mutated in The Cancer Genome Atlas (TCGA). at multiple scales (Fig. 3.11). On a molecular level, this pathway has functions related to ‘ion transport’, ‘acetoacetate decarboxylase activity’ and ‘ubiquitin ligase’. On a cellular and organismal level, it is involved in ‘epithelial cells’ and ‘lung disease’. These descriptions were supported by direct associations between phrases and genes in multiple articles, such as ‘lung disease’ and NEDD4 in Rotin et al. [139, 142] and ‘lung epithelial cell’ and PTEN in Mao et al. [141], and by indirect associations learned through the random walk transformation. As validation of these descriptions, we found that genes in this pathway were recurrently mutated in lung squamous cell carcinoma (LUSC), a disease in epithelial cells [143], based on MutSigCV scores [144] in The Cancer Genome Atlas (TCGA) [145]. All evidence suggested that this is a novel functional pathway related to lung cancer. As a second case study, we examined another pathway, consisting of eight genes FBXW7, ARL2, FBXW11, FBXW2, BTRC, PWP2, COPA and FBXW10. These genes strongly in- teracted with each other primarily through domain co-occurrence, suggesting their proteins

44 Figure 3.11: Summarization of biological function by a Concept Ontology. The 38 phrases describing the pathway were hierarchically clustered based on their semantic relations using the CLiXO algorithm. These phrases were organized into six major concepts. We list two of the journal articles, Mao et al.[141] and Rotin et al.[139, 142], contributing to the concepts ‘lung disease’ and ‘epithelial cell’. share similar 3D structures (Fig. 3.12a). This pathway was also previously unknown (Jac- card Index 0.1 in GO and KEGG). Our method described its functions with 37 phrases, which could be hierarchically organized into six major concepts (Fig. 3.13). An interesting concept was ‘acute monoblastic leukemia’, suggesting this pathway was cancer-associated. As shown in Fig. 3.13, validation for this pathway was achieved by tracing back the actual literature referencing these genes and diseases simultaneously. One of the articles, Gel- bard et al. [147], related FBXW7 to sinonasal carcinoma, a kind of head and neck cancer. This is consistent with our finding that these genes were recurrently mutated in the HNSC and UCEC patient cohorts in TCGA (Fig. 3.12b). These two examples demonstrate how pathways can be automatically discovered and annotated by integrating years of biomedical knowledge with ‘omics datasets.

3.3.4 Conclusion

In this work, we have developed a novel text mining and visualization tool for automated pathway functional annotation. Our main idea is to integrate literature and molecular in- teraction information into a large heterogeneous network and then use a random walk-based approach to rank candidate pathway descriptions. In the final step, we use a Concept Ontol- ogy to visualize annotations as a more informative alternative to a flat network of biomedical

45 Figure 3.12: Discovery and characterization of another new pathway by our method. (a) The pathway is defined by eight genes related by protein interactions, co-expression, and protein shared domains. The functions of these genes are collectively described by 37 phrases set out around the periphery. (b) Cancer types in which these genes are recurrently mutated in TCGA. phrases. In this work our primary focus is to annotate gene set, however, our framework can be well generalized to other applications. For instance, if the user provides a set of drugs, targets and their corresponding interaction networks, our method should be able to return the potential downstream and upstream pathways where these drugs might influence. Another application is that we can replace gene set with a group of disease symptoms and replace molecular network with symptom similarity network. Then our method might help to define the potential pathways and genes that lead to such symptoms. One of the major limitations of our work is that currently we can not accept users’ input to specify a particular context. For example, the user might want to know the roles of these genes in brain or the user only want to know the location information. Theoretically speaking, these information is all included in our result, however, they might not rank high enough to pass our filter. There are many interesting directions to explore in the future. To name a few, we plan to automatically generate sentences instead of phrases for new pathways. Sentences are more widely accepted and carry more information than

46 Figure 3.13: Summarization of biological function by a Concept Ontology. The 37 phrases describing the pathway were hierarchically clustered based on their semantic relations, using the CLiXO algorithm. These phrases were organized into six major concepts. We list two of the journal articles, Spinella et al.[146] and Gelbard et al.[147], from which the concept ‘acute monoblastic leukemia’ was inferred. phrases. Another direction is to improve our algorithm to move beyond the abstract and title to scanning complete articles and even figures. A more challenging direction is to link functional descriptions more deeply with molecular data. In our current method, the types of interactions among genes do not influence the final functional annotations. However, in practice, a rich protein-protein interactions and genetic interactions usually suggest a protein complex.

47 CHAPTER 4: KNOWLEDGE NETWORK ASSISTED DRUG DISCOVERY

In this Chapter, I will demonstrate how knowledge network can be used to accelerate drug discovery. I will show how to use knowledge networks and the proposed network embedding approaches to understand drug mechnisms of actions through identification of drug targets and associated pathways. To identify drug associated pathways, I integrated moleclar networks and pathways into a large hetergeneous network. I then applied the proposed network embedding to project pathways and genes into the same low-dimensional space. Experiment results showed that this heterogeneous network enables an improved pathway identification performance. To identify drug targets, I applied the diffusion model to yeast genetic interaction networks. Using network neighbors’ phenotypes can reduce noise in the experimental data, thus improving target identification performance.

4.1 KNOWLEDGE NETWORK ASSISTED DRUG PATHWAY IDENTIFICATION

Basal gene expression levels have been shown to be predictive of cellular response to cy- totoxic treatments. However, such analyses do not fully reveal complex genotype-phenotype relationships, which are partly encoded in highly interconnected molecular networks. Bi- ological pathways provide a complementary way of understanding drug response variation among individuals. In this study, we integrate chemosensitivity data from a recent phar- macogenomics study with basal gene expression data from the CCLE project and prior knowledge of molecular networks to identify specific pathways mediating chemical response. We develop a computational method called PACER, which ranks pathways for enrichment in a given set of genes using a novel network embedding method. It examines known rela- tionships among genes as encoded in a molecular network along with gene memberships of all pathways to determine a vector representation of each gene and pathway in the same low- dimensional vector space. The relevance of a pathway to the given gene set is then captured by the similarity between the pathway vector and gene vectors. To apply this approach to chemosensitivity data, we identify genes with basal expression levels in a panel of cell lines that are correlated with cytotoxic response to a compound, and then rank pathways for relevance to these response-correlated genes using PACER. Extensive evaluation of this approach on benchmarks constructed from databases of compound target genes, compound chemical structure, as well as large collections of drug response signatures demonstrates its advantages in identifying compound-pathway associations, compared to existing statistical methods of pathway enrichment analysis. The associations identified by PACER can serve as

48 testable hypotheses about chemosensitivity pathways and help further study the mechanism of action of specific cytotoxic drugs. More broadly, PACER represents a novel technique of identifying enriched properties of any gene set of interest while also taking into account networks of known gene-gene relationships and interactions.

4.1.1 Introduction

Large-scale cancer genomics projects, such as the Cancer Genome Atlas(Cancer Genome Atlas Research [148], the Cancer Genome project [149], and the Cancer Cell Line Encyclope- dia project [150], and cancer pharmacology projects such as the Genomics of Drug Sensitivity in Cancer project [149] have generated a large volume of genomics and pharmacological pro- filing data. As a result, there is an unprecedented opportunity to link pharmacological and genomic data to identify therapeutic biomarkers [40, 39, 41]. In pursuit of this vision, sig- nificant efforts have been invested in identifying the genetic basis of drug response variation among individual patients. For instance, a recent study performed a comprehensive survey of genes with basal expression levels in cancer cell lines that correlate with drug sensitivity, revealing potential gene candidates for explaining mechanisms of action of various drugs [42]. While significant efforts have focused on specific genes that interact with compounds and confer observed cellular phenotypes, there has been relatively little progress in studying the synergistic effects of genes. These effects are key factors in comprehensively deciphering the mechanisms of action of compounds and understanding complex phenotypes [151]. Similarly, pathways, which comprise a set of interacting genes, have emerged as a useful construct for gaining insights into cellular responses to compounds. Analysis at the pathway level not only reduces the analytic complexity from tens of thousands of genes to just hundreds of pathways, but also contains more explanatory power than a simple list of differentially expressed genes [152]. Consequently, an important yet unsolved problem is the effective identification of pathways mediating drug response variation. Although the associated pathways for certain drugs have been studied experimentally [153, 154, 155], in vitro pathway analysis is costly and inherently difficult, making it hard to scale to hundreds of compounds. Fortunately, a growing compendium of genomic, proteomic, and pharmacologic data al- lows us to develop scalable computational approaches to help solve this problem. Although statistical significance tests and enrichment analyses can be naturally applied to compound- pathway association identification (e.g., by testing the overlap between pathway members and differentially expressed genes), these approaches fail to leverage well-established biolog- ical relationships among genes [156, 157, 158, 159]. Even when analyzing individual genes, molecular networks such as protein-protein interaction networks have been shown to play

49 crucial roles in understanding the cellular drug response [160, 151, 161, 162, 163]. Therefore, we propose to combine molecular networks with gene expression and drug response data for pathway identification. However, integrating these heterogeneous data sources is statisti- cally challenging. Moreover, networks are high-dimensional, incomplete, and noisy. Thus, our algorithm needs to accurately and comprehensively identify pathways while exploiting suboptimal networks. Here we present PACER, a novel, network-assisted algorithm that identifies pathway as- sociations for any gene set of interest. Additionally, we apply the algorithm to discover chemosensitivity-related pathways. PACER first constructs a heterogeneous network that includes pathways and genes, pathway membership information, and gene-gene relationships from a molecular network such as protein-protein interaction network. It then applies a novel dimensionality reduction algorithm to this heterogeneous network to obtain compact, low- dimensional vectors for pathways and genes in the network. Pathways that are topologically close to drug response-related genes in the network are co-localized with those genes in this low-dimensional vector space. Hence, PACER ranks each pathway based on its proximity in the low-dimensional space to genes that have basal expressions highly correlated with drug response. We evaluated PACER’s ability to identify compound-pathway associations with three ‘ground truth’ sets built from compound target data [42], compound structure data [164], and LINCS differential expression data [165]. When comparing PACER to state- of-the-art methods that ignore prior knowledge of interactions among genes, we observed substantial improvement of the concordance with the chosen benchmarks. Even though we developed PACER and tested its ability to identify compound-pathway associations, the algorithm is applicable to any scenario in which one seeks to discover pathways related to a pre-specified gene set of interest, while utilizing a given gene network.

4.1.2 Methods

Compound response data and gene expression data

We obtained a large-scale compound response screening dataset from Rees et al. [42], which spans 481 chemical compounds and 842 human cancer cell lines encompassing 25 lineages. These 481 compounds were collected from different sources including clinical can- didates, FDA-approved drugs and previous chemosensitivity profiling experiments. Area under the drug response curve (AUC) was used by the authors of that study to measure cel- lular response to individual compound. We also obtained gene expression profiles for these cell lines from the Cancer Cell Line Encyclopedia (CCLE) project [134], profiled using the

50 GeneChip U133 Plus 2.0 Array. Since these expression measurements were done in each cell line without any drug treatment, they are referred to as ‘basal’ expression levels. In contrast, the expression profiling of a cell line was performed after treatment with a drug in certain studies such as LINCS L1000 [165] and [166]. We obtained the SMILE specification of each drug from PubChem [164] and then calculated the Tanimoto similarity scores between all pairs of drugs based on their SMILE specifications.

STRING-based molecular network and NCI pathway collection

We obtained a collection of six human molecular networks from the STRING database v9.1 [167]. These six networks include experimentally derived protein-protein interactions, manually curated protein-protein interactions, protein-protein interactions transferred from model organism based on orthology, and interactions computed from genomic features such as fusion-fusion events, functional similarity and co-expression data. There are 16,662 genes in the network. We used all the STRING channels except “text-mining” and used the Bayesian integration method provided by STRING. Since our approach can deal with dif- ferent edge weights, we did not set a threshold to remove low confidence edges. We referred to this integrated network as the STRING-based molecular network. To test whether genes that are highly correlated with many compounds tend to have higher degrees in the network, we formed two groups of genes. One group contained genes that are correlated with over 100 compounds, and the other group contained the remaining genes. We then used the Wilcoxon signed-rank test to test whether the degrees of genes in these two groups were from the same distribution. We obtained a collection of 223 cancer-related pathways from the National Cancer Institute (NCI) pathway database. These manually curated pathways include human signaling and regulatory pathways as well as key cellular processes [168].

The PACER Framework

PACER integrates pathway information with the STRING-based molecular network de- scribed above by constructing a heterogeneous network of genes and pathways. An edge exists between two genes if they are connected in the network. An edge exists between a pathway and a gene if the gene belongs to the pathway. There are no direct pathway-pathway edges in the heterogeneous network. PACER adopts diffusion component analysis (DCA), a recently developed network representation algorithm to learn a low-dimensional vector for each node in the network [59]. Because of its ability to handle noisy and missing edges in the biological network, DCA has achieved state-of-the-art results in different tasks [59, 32]. Since

51 compounds are not nodes in the constructed heterogeneous network, only genes and path- ways are projected onto the low-dimensional space. After learning the low-dimensional rep- resentations of all nodes (genes and pathways), DCA ranks pathways based on the weighted cosine similarities between a pathway and the set of 250 genes most correlated with response to a compound. We henceforth refer to this set of genes as “response-correlated genes” (RCG) for the compound. These genes’ expression values are most significantly correlated with chemosensitivity. We found that the performance of the PACER method is stable for different choices (200, 250, and 300) for the number of RCGs considered in this step.

LINCS drug perturbation profiles

LINCS is a data repository of over 1.3 million genome-wide expression profiles of human cell lines subjected to a variety of perturbation conditions, which include treatments with more than 20 thousand unique compounds at various concentrations. Each perturbation experiment is represented by a list of differentially expressed genes that are ranked based on z- scores of perturbation expression relative to basal expression. For each gene, we first took the difference between its expression in a perturbation condition and its expression in a control condition (i.e., treatment with pure DMSO solvent). We then considered the differential expression of the gene in multiple perturbation experiments involving that compound (i.e., different concentrations, time points, and cell lines). We used the maximum differential expression to represent the compound’s effect on that gene’s expression. All genes were then ranked by their differential expression on treatment with the compound, and the top 250 genes were treated as differentially expressed genes (DEGs) of the compound, provided their z-score has an absolute value greater than 2.

Comparison with method of Huang et al.

We implemented the method of Huang et al. [156] ourselves using the exact same input (i.e., chemosensitivity and gene expression data) as PACER. We first computed a gene’s correlation to a drug by calculating the Pearson correlation coefficient between the gene’s expression values and the drug response values across cell lines. Let the set of genes in pathway p be denoted by Gp, and their correlation values to a drug d by C(Gp, d). Conversely, the set of genes not in pathway p is denoted as Gp, and their correlation values to d as

C(Gp, d). We then performed the Kruskal-Wallis H-test, following Huang et al., to test if the medians of C(Gp, d) and C(Gp, d) were significantly different. We used the resulting p-value to rank pathways for each drug.

52 Figure 4.1: Global analysis of correlations between basal gene expression and compound response. (A) Heatmap of the Pearson correlation coefficient between genes (expression) and compounds (chemosensitivity, measured by AUC values). (B) Histogram of the number of compounds associated with each gene. The y-axis shows the number of genes associated with k compounds, where k is shown on the x-axis. (C) Histogram of the number of genes associated with each compound. The y-axis shows the number of compounds associated with k genes, where k is shown on the x-axis. (D) Histogram of the number of compounds significantly associated with each pathway (Fishers exact test FDR 0.05). (E) Histogram of the number of pathways significantly associated with each compound (Fishers exact test FDR 0.05).

53 4.1.3 Experimental results

Global analysis of correlations between basal gene expression and compound response

Following the work of Rees et al. [42], we first examined correlations between the com- pound sensitivity and basal gene expression profiles across hundreds of cell lines. We calcu- lated Pearson correlation coefficients between each gene’s expression and the cellular response (measured as the area under the curve or AUC) to each compound, across different cell lines (Figure 4.1A). Compared to the IC50 and EC50 scores, AUC simultaneously captures the efficacy and potency of a drug. Of the 8.7 million pairs of genes and compounds tested, we found 294,789 to be significantly correlated (p-value < 0.0001 after Bonferroni correction, corresponding to a Pearson correlation coefficient of 0.215.) Since the Rees et al. dataset comprises measurements on 842 cell lines, each correlation was computed over 842 pairs of values (drug response, gene expression). This is why even a modest-looking Pearson correla- tion of 0.215 was deemed highly statistically significant. The key observation from this initial analysis, also noted by Rees et al., is that basal gene expression levels are highly correlated with cytotoxic response for large numbers of compound-gene pairs. Within these signifi- cantly correlated pairs, 1,749 genes were correlated with over 100 compounds (Figure 4.1B, Suppl. Table 1). We note that these key genes tend to be high-degree nodes in STRING- based molecular network (Wilcoxon rank sum test p-value < 9.6e-14, see Methods). We also found that some (10 of 481) compounds were significantly correlated (Pearson correlation p-value < 0.0001 after Bonferroni correction) with more than 3,200 genes (Figure 4.1C). Five of these ten compounds are chemotherapeutic agents (Suppl. Table 2). In contrast, about 100 compounds were not significantly correlated with any genes; these compounds are mostly probes that either lack FDA approval or are not clinically used. The large disparity among the examined compounds in terms of the number of correlated genes reflects the di- versity of these 481 small molecules. While many of them are chemotherapeutic, which can affect the expression of a large number of genes, some compounds may be targeting specific mutations, post-translational modifications, or protein expression. A closer examination re- vealed that the compounds with the highest cytotoxicity had the fewest gene correlations (i.e., fewest genes whose expression correlates with cytotoxic response were mostly those with low cytotoxicity) (Suppl. Figure 1). This suggests that the strategy of identifying compound-associated genes by correlating basal gene expression profiles with cytotoxicity is likely to be more effective for more potent compounds, for which average response is stronger. Note that the gene expression profiles used here are basal and not in response to treatment with compound, hence it was not clear a priori that more effective compounds

54 would have larger numbers of gene correlates. In summary, examination of individual genes’ correlations with chemical response confirmed previous reports [149, 169, 42] that basal gene expression significantly correlates with cytotoxicity across cell lines, especially for effective cytotoxic drugs.

Identifying compound-specific pathways via enrichment tests

The above evidence for correlations between basal gene expression and chemical response raised the possibility that one might discover important biological pathways associated with the response by a systems-level analysis of gene expression data. To explore this, we consid- ered a collection of 223 cancer-related pathways from the National Cancer Institute (NCI) pathway database[168] and used Fisher’s exact test to quantify the overlap between the set of genes in a pathway p and RCGs. A significantly large overlap between the two sets in- dicates an association between the pathway and the compound. We performed a multiple hypothesis correction on all pathway association tests for each compound, using FDR 0.05. The results of this baseline method for predicting pathway associations are shown in Figure 4.1D (distribution of the number of compounds that are significantly associated with each pathway) and Figure 4.1E (distribution of the number of pathways significantly associated with each compound). Both distributions revealed a long tail. For instance, while each pathway was associated with an average of 18 compounds (of the 481 tested), there were 10 pathways that were associated with over 150 compounds (Suppl. Table 3). Likewise, while each compound was associated with an average of eight pathways, there were 12 compounds associated with over 25 pathways (Suppl. Table 4). We show the details of these long tails in Suppl. Figure 2.

A new method for identifying pathways associated with chemical response, based on network embedding

We observed above that key RCGs those correlated with many compounds tend to be enriched in high degree nodes of the STRING-based molecular network. This suggests that an analysis combining this network with pathway enrichment tests might provide additional insights. We therefore developed a novel network-based method, called PACER, for scor- ing compound-pathway associations. PACER (Figure 4.2A) first constructs a heterogeneous network consisting of genes and pathways as nodes. In this network, gene-pathway edges denote pathway memberships based on a compendium of pathways and gene-gene edges from the STRING-based molecular network introduced above (also see Methods). PACER

55 Figure 4.2: Identifying pathways associated with chemical response using PACER. (A) Schematic description of PACER. (B) Heat map of associations between compounds and pathways (PACER scores). Rows are pathways and columns are compounds. (C) Detailed view of a subset of the red branch cluster of pathways, marked by C. (D) Detailed view of a subnet of purple branch cluster of compounds, marked by a rectangle. (E) Comparative evaluation of different methods for predicting compound-pathway associations. The ground truth used here is the pathways that contain any known target gene of the compound. (F) Comparison of PACER, Fishers exact test and Huang et al. on predicting compounds with similar chemical structure. The y-axis shows the number of compounds with an AUROC larger than k, where k is shown on the x-axis. Compound structure similarity is determined by the Tanimoto similarity score calculated based on their SMILE specifications. Prediction is made according to the Spearman correlation between the pathway rankings of two compounds. (G) Number of compounds with significant overlap (p < 0.05) between pathways from LINCS and pathways from PACER, from Huang et al. 2005 and from the baseline method (Fishers exact test) respectively, at different levels of stringency in pathway prediction. Stringency refers to the FDR control used by the baseline method in determining significant pathways. Both PACER and the Huang et al. 2005 method were used to predict the same number of (highest scoring) pathways as the baseline method, for a fair comparison.

56 then creates a low-dimensional vector representation for each gene and pathway node in the heterogeneous network, reflecting the node’s position in this heterogeneous network. This is done by the Diffusion Component Analysis (DCA) approach reported in previous work [59, 32]. Nodes (i.e., pathways or genes) will have similar vector representations if they are near each other in the network. For instance, two pathway nodes will have similar vector representations if the pathways share genes and/or their genes are related by the STRING- based molecular network. In a similar vein, two genes will have similar representations if they belong to the same pathway(s) and/or exhibit the same network neighbors. A gene and a pathway can also be compared in the low-dimensional space, and will be deemed similar if the gene is in the pathway and/or the gene is related by network to other genes of the path- way. DCA performs network-based embedding without utilizing gene expression or chemical response data. Next, PACER identifies RCGs as a fixed number of genes the expression of which shows the greatest correlation with chemical response to a specific compound. Finally, it scores a pathway based on the average cosine similarity between the vector representation of the pathway and those of the RCGs. A pathway can thus be found to be associated with a compound if, in the network, the pathway genes are closely related to the compound’s RCGs; this association can be discovered even if the pathway does not actually include the RCGs. We note that scores assigned by PACER are not statistical significance scores and are meant only to rank pathways for association with a given compound. Also, a negative score assigned to a compound-pathway pair does not imply a negative correlation between expression levels of pathway genes and chemosensitivity. Rather, it only implies a lack of evidence for an association between the compound-pathway pair. The PACER association scores for all combinations of 481 compounds and 223 NCI sig- naling pathways are shown in Figure 4.2B. The pathways cluster into many distinct groups, each with different compound association profiles. One group (cyan branches in the row dendrogram), associated with more than half of the compounds, consists of pathways de- scribing various integrin cell surface interactions (e.g., ‘Integrins in angiogenesis’ pathway, ‘Alpha 4 Beta1 integrin cell surface interactions’ pathway). These pathways are known to play crucial roles in communications among cells in response to small molecules [170]. Notably, integrins are a major family of cell surface adhesion receptors, and are involved in major pathways that contribute to cancer cell survival and resistance to chemotherapy [171]. PACER found 329 of the 481 compounds to be associated with the “integrin family cell surface interactions” pathway. Since PACER scores are not easily assigned statisti- cal significance levels, we chose for each compound n pathways with the highest PACER scores, where n is the number of statistically significant pathway associations (FDR ≤ 0.05) found by the baseline method above for the same compound. We found literature support

57 for some of these associations. For example, ruxolitinib, a JAK/STAT inhibitor, is associ- ated with integrin pathways by PACER analysis. In a previous study, it was shown that beta 4 integrin enhances activation of the transcription factor STAT3, which is a target of ruxolitinib [172]. Furthermore, vorinostat, a member of a larger class of compounds that inhibit histone deacetylases (HDACs), can induce integrin 51 expression and activate MET, leading to resistance [173]. Figure 4.2C reveals another example of functionally related path- ways being grouped together. The pathways ‘VEGFR1 specific signals’, ‘ErbB4 signaling events’, ‘EGFR-dependent Endothelin signaling events’, ‘ErbB receptor signaling network’, and ‘PDGF receptor signaling network’ form one group (red branches in row dendrogram), and are associated with masitinib (PDGFRB inhibitor) and RAF265 (VEGFR2 inhibitor), among other compounds (see Suppl. Table 5). Vascular endothelial growth factor recep- tor (VEGFR inhibitor) and platelet-derived growth factor receptor (PDGFR inhibitor) are both members of the family of 58 known tyrosine kinase receptors in humans [174]. Ty- rosine kinases have various modulatory functions in growth factor signaling and several of their inhibitors are known for their anti-tumor activity [175]. Figure 4.2B also shows compounds clustered into different groups based on their associa- tions with pathways. We found that many compounds with similar structure were grouped together. For example, teniposide and etoposide had a Tanimoto similarity score of 0.94 be- tween their SMILE specifications, which was substantially higher than the average Tanimoto similarity score of 0.3716 for all pairs of drugs. They were clustered together in the same group (Figure 4.2D, also marked as a rectangle in Figure 4.2B), which had seven compounds. This group is associated with a set of similar pathways, including ‘p53 pathway’, ‘Direct p53 effectors’, ‘Signaling mediated by p38-alpha and p38-beta’, and ‘Signaling mediated by p38-gamma and p38-delta’. We found support in the literature in favor of some of these as- sociations. For example, a previous study reported that etoposide activates p38MAPK and can be used as a new combined treatment approach when used with p38MAPK inhibitor SB203580 [176]. To take another example, temsirolimus and tacrolimus, which are both epipodophyllotoxins and inhibit topoisomerase II, have a Tanimoto similarity score of 0.82, and are grouped closely in Figure 4.2B.

PACER improves pathway identification

We noted a substantial degree of complementarity between the top predictions of PACER and those of the baseline method that uses Fisher’s exact test between RCGs and pathway genes (see Suppl. Table 5). For instance, PACER found that PD153035, an ErbB2 inhibitor, is associated with the ‘C-MYC pathway’, reflecting the fact that PD153035 is able to reduce

58 Table 4.1: Compounds for which PACER predicted pathways with greatest precision. Evaluation was performed with known targets.

Compound AUROC bms-536924 0.936652 ml239 0.869446 kx2-391 0.846667 skepinone-l 0.844854 mgcd-265 0.841279 nsc23766 0.835607 vorapaxar 0.834101 pf-3758309 0.824359 cyclophosphamide 0.798150 pf-573228 0.792370 c-Myc protein levels in breast tumor cells [177]. The baseline approach did not find this association to be significant. Similarly, PACER reported that the ‘EGFR-dependent En- dothelin signaling events’ pathway is associated with EGFR inhibitor gefitinib [178], while the baseline method did not. For a more systematic comparison between the two methods, we evaluated PACER based on a database of known compound targets. We performed the evaluation under the as- sumption that a pathway containing at least one known target is an associated pathway. Huang et al. used and suggested this approach [156]. We used it here to evaluate PACER, the baseline method, as well as a third method presented by Huang et al. [156] Although this third method was proposed to detect association between pathways and drug clades, it can directly detect pathway-compound associations. We implemented the method ourselves (see Methods) and included it in our evaluations. We obtained the known targets for 246 compounds in our compound set from Rees et al. [42]. We then computed the AUROC of pathway predictions made by PACER for each compound, and plotted this information alongside analogous information for the baseline method and the method of Huang et al. [156] As shown in Figure 4.2E, PACER identified pathways with higher AUROC compared to the other two methods. For example, PACER identified pathways with an AUROC greater than 0.75 for 22 different compounds, while the baseline method achieved this level of AU- ROC for only 7 compounds. Table 4.1 shows the 10 compounds for which PACER achieved highest AUROC. We present a closer examination of PACER predictions for one of these compounds:‘nsc23766’ (0.84 AUROC) in Table 4.2. This table also shows the complemen- tarity between PACER predictions and those of the baseline method. We note that the AUROC values reported here are likely to be underestimates, as there is

59 Table 4.2: Top 10 pathways predicted by PACER for the compound nsc23766

Contains P-value of Pathway known target? baseline method

Fanconi anemia pathway NO 0.2670 CXCR4-mediated signaling events YES 1 TCR signaling in naive CD4+ T cells YES 1 IL2 signaling events mediated by PI3K YES 1 ATR signaling pathway NO 1 TCR signaling in naive CD8+ T cells YES 1 Class I PI3K signaling events NO 1 p53 pathway YES 0.3185 BARD1 signaling events NO 1 EPO signaling pathway NO 1 literature evidence for some of the reported pathways being associated with the compound, even though the pathway does not include a known target (and is thus considered a false positive in our AUROC estimate). The Rac1-specific inhibitor nsc23766 was identified by PACER as being associated with several pathways that include a known target (Table 4.2), as well as some pathways that do not but whose association is supported by literature evidence. For example, the pathway ‘Beta5 beta6 beta7 and beta8 integrin cell surface interactions’ was predicted as being associated with this compound, and even though the latter’s known target is not in this pathway, various lines of evidence support the association. A study of cholangiocarcinoma found beta6 integrin to promote invasiveness by activating Rac1, and that the compound nsc23766 is able to suppress this invasiveness [179] and can be used to identify poor prognostic HER2 amplified breast cancer patients [180]. Beta8 integrin influences Rac1 levels to promote cell invasiveness in glioblastoma [181]. Also, Beta8 integrin is reported to activate Rac1 signaling in endometrial epithelial cells [182]. In addition, ncs23766 was predicted to be associated with ‘a4b7 Integrin signaling’. While this pathway does not directly include the compound’s target, Rac1 was previously shown to induce a4b7-mediated T cell adhesion to MAdCAM-1 [183], supporting the predicted association. Furthermore, ‘Plexin-D1 Signaling pathway’ was found by PACER to be associated with ncs23766. Plexin-D1 is a receptor for SEMA4A which inhibits Rac activation [184]. Thus, the anecdotal observations made above suggest that compound-pathway predictions made by PACER may sometimes be worth pursuing even if the compound’s target is not included in the pathway. We further evaluated PACER based on the identified pathways for compounds with similar

60 chemical structures. We performed this evaluation under the assumption that compounds with similar chemical structures tend to be associated with the same pathways. To this end, we calculated the Tanimoto similarity between each pair of compounds according to their SMILE specifications. We regarded two compounds as having similar chemical structures if their Tanimoto similarity is larger than 0.8. This threshold was used in previous work to indicate high Tanimoto similarity [185]. For a given compound, we used PACER to rank other compounds according to the Spearman correlation coefficient between their rankings of pathways, and asked if this ranking was predictive of chemical structure similarity. For each compound, an AUROC score was computed to measure this predictive ability. We repeated this evaluation with the three different methods of ranking pathways for a com- pound: PACER, the baseline method, and the method of Huang et al. Figure 4.2F shows the number of compounds for which the AUROC is above a specified threshold, for each of the three methods. We found that PACER achieves better performance in identifying compounds with similar chemical structure based on similarity of pathway ranking. For example, 21 (of 42) compounds yielded an AUROC greater than 0.8 when using PACER, compared to 10 compounds meeting the same criterion when using the baseline method. As compounds with similar chemical structure tend to be functionally similar, our results demonstrate that PACER can be used to identify similar compounds by integrating prior network information into chemosensitivity data. We also compared the associations predicted by the three methods to those identified from an external data set. To this end, we mined the Library of Integrated Network-Based Cellular Signatures (LINCS) L1000 data[165], which reports genes differentially expressed upon treatment of various cell lines with a compound. For each compound in our analysis that is also included as a perturbagen in the L1000 compendium, we established a LINCS- based benchmark of significantly associated pathways. This was based on a Fisher’s exact test (p-value ≤ 0.05) between pathway genes and the most differentially expressed genes from treatments with the same compound (see Methods). We required this criterion to be met in at least one of the cell lines for which data was available from LINCS. We then assessed the concordance between this set of LINCS-based compound-pathway associations and those predicted by either method presented above. We recognize that this is not an ideal benchmark: LINCS data points to genes (and, indirectly, to pathways) that are differentially expressed in response to treatment, while PACER and the compared methods base their pathway predictions on genes that have basal expression levels across cell lines that correlate with chemical response. At the same time, we expect the pathways affected by chemical treatment to also be, to an extent, involved in interpersonal variation of chemosensitivity, making this a suitable evaluation procedure. This was inspired by similar observations

61 in cancer biology: genes and pathways disrupted in cancer tissues overlap with genes and pathways whose mutation status in germline non-tumor samples is informative about disease susceptibility and progression. To test whether the significant pathways identified from LINCS data agree with the path- ways predicted by one of the methods being evaluated (based on chemical response variation in CCLE cell lines), we counted the compounds for which the two sets of predicted path- ways overlapped significantly (Fisher’s exact test p-value ≤ 0.05). As shown in Figure 4.2G, the PACER approach predicts pathways concordant with the corresponding LINCS-based benchmark for more compounds, compared to the baseline method and that of Huang et al.[156] For instance, when the baseline method used an FDR threshold of 10% to desig- nate significant pathway associations for each drug, and the PACER method predicted the same number of pathways, the latter’s predictions were concordant with the LINCS-based benchmark for 110 of the 481 compounds, a nearly two-fold improvement over the baseline method’s predictions. Our evaluations actually provide evidence for the above-mentioned possibility that pathways predictive of drug sensitivity overlap with genes that mediate drug response. In fact, we found 86 compounds for which the pathways identified from basal ex- pression correlations and the pathways identified from LINCS signatures overlap with FDR < 5%. After observing the substantial improvement of PACER, we then investigated whether the performance of PACER is stable when only using experimental derived protein-protein interactions as input. In our current work, we used different types of STRING channels except “text-mining”. One may have the concern that some of these channels (e.g., the “databases” channel) contain direct information about the pathways we want to prioritize in this paper. However, we believe that it is appropriate to use all available information, including this channel, in our method for the following reasons. (1) Our method is completely unsupervised. Any existing resources can therefore be legitimately used by our method for pathway identification. (2) The goal is to rank pathways for each drug. The pathway information encoded in STRING database is not drug-specific and thus should not introduce bias into our results. (3) The “databases” channel can provide additional information that other molecular networks may not have. It improves the quality of the integrated network by using the information from “databases” interactions. For these reasons, we believe that using the “databases” channel is appropriate and decided to use the results that contain the “database” channel in Figure 4.2E,F,G. However, we also report the results of only using “experimental” channel in the supplementary. We found that the performance of PACER, as per the three evaluations presented above, was stable when only using experimental derived protein-protein interactions as input (Suppl. Figure 4-6). We further demonstrated that the

62 result of our method is robust to different numbers of top response-correlated genes used in PACER, as shown in Suppl. Figures 7-9. We compared different values for k in the top k genes chosen by PACER. We found that for two of the three evaluation schemes, results were comparable when using k=100, 150, 200, 250 and 300. For the third evaluation scheme, results were comparable when using k=200, 250, and 300. This demonstrates the stability of the algorithm’s performance to different but reasonable values of k in its choice of top k response-correlated genes.

4.1.4 Conclusion

We have shown that embedding prior knowledge in a gene network can more accurately identify compound-pathway. Our new method, called PACER, identified many compound- pathway associations that are supported by known compound targets as well as literature evidence. Due to its unique ability to incorporate any suitable compendium of gene inter- actions, our approach may provide complementary insights into drug mechanisms of action. Historically, pathways associated with a particular gene set are identified by using popular statistical methods such as Gene Set Enrichment Analysis [186], Fisher’s exact test or the Binomial test (Reactome [187]). These tools test the overlap between differentially expressed genes and pathway members. They may also be applied to the set of drug-response-correlated genes (RCGs) analyzed here. Ingenuity Pathway Analysis [188] is another related tool, which utilizes information about causal interactions between pathway members. Our study is similar to the above tools in that PACER also seeks to find pathways implicated by a gene set. However, our approach differs from these existing tools in that known molecular interactions (e.g., PPI) among different genes are taken into consideration. Thus, a gene set, be it the RCGs of a compound or the members of a pathway, is not treated merely as the sum of its parts, but also includes the relationships among those parts. Since the dominant theme in existing approaches is assessment of overlaps between two gene sets (MSigDB, DAVID, and Reactome adopt variations on this theme), our extensive comparisons between PACER and the baseline method of Fisher’s exact test shed light on the relative merits of the new approach. A related line of work aims to identify differentially expressed subnetworks in a given interaction network, e.g., KeyPathwayMiner [189], but these studies are only superficially relevant to our work since we aim to prioritize existing pathways instead of finding new pathways. We consider two potential reasons for the strong performance of PACER. First, it is widely appreciated that a chemical compound not only affects individual genes, but also combina- tions of genes in molecular networks corresponding to core processes, such as cell proliferation

63 and apoptosis. Our method postulates that even if the RCGs and a pathway may only have a few genes in common, they may be close to each other in the network. Although current compound pathway maps are incomplete, much relevant information is available in public databases of human molecular networks. While traditional pathway enrichment analysis methods like Fisher’s exact test identify pathways according to the number of shared genes, PACER prioritizes pathways based on their proximities to RCGs in molecular networks. Second, manually curated pathways may have arbitrary boundaries due to the need to cap- ture knowledge at different levels of detail. Consequently, identifying drug related pathways might be hindered by pathway boundaries. By leveraging the prior knowledge in molecular networks, PACER is more robust to pathways with different boundaries, thus improving the sensitivity of detecting compound-pathway associations. We see many opportunities to improve upon the basic concept of PACER in future work. First, although the current PACER framework was developed in an unsupervised fashion, the scores assigned to each pathway for the given gene set can be used as the feature and plugged into off-the-shelf machine learning classifiers for compound-pathway association identifica- tion. Second, although this study focused on chemosensitivity response, the PACER method is broadly applicable to testing the association between two sets of genes according to their proximity in the network. Finally, although we use gene expression data as the molecular profile of each cell line, it might be interesting to test our method based on other molecular data such as somatic mutations and copy number alterations.

4.2 NETWORK-ASSISTED DRUG TARGET IDENTIFICATION

Chemical genomic screens have recently emerged as a systematic approach to drug dis- covery on a genome-wide scale. Drug target identification and elucidation of the mech- anism of action (MoA) of hits from these noisy high-throughput screens remain difficult. Here, I present GIT (Genetic Interaction Network-Assisted Target Identification), a net- work analysis method for drug target identification in haploinsufficiency profiling (HIP) and homozygous profiling (HOP) screens. With the drug-induced phenotypic fitness defect of the deletion of a gene, GIT also incorporates the fitness defects of the gene’s neighbors in the genetic interaction network. On three genome-scale yeast chemical genomic screens, GIT substantially outperforms previous scoring methods on target identification on HIP and HOP assays, respectively. Finally, I showed that by combining HIP and HOP assays, GIT further boosts target identification and reveals potential drug’s mechanism of action.

64 4.2.1 Introduction

Chemical genomic screens have been extensively used to discover functional interactions between genes and small molecular compounds in vivo [190, 191, 192, 193, 194, 195, 196, 197]. Due to its short generation time, inexpensive cultivation, and facile genetics, the budding yeast S. cerevisiae has been widely used as a platform for chemical genomic screens to de- cipher proteins and pathways targeted by small molecular compounds [198, 199, 200]. In comparison to other approaches [201, 202, 203, 204], yeast chemical genomic screens pro- vide comprehensive and systematic genome-wide measurements of a complete set of deletion strains. When the target protein(s)’s functions are conserved throughout evolution, results of chemical genomic screens in yeast can be readily transferred to other species, including human [205, 206, 207]. There are two types of yeast chemical genomics assays: haploinsuf- ficiency profiling (HIP) and homozygous profiling (HOP). A HIP assay consists of a set of heterozygous deletion diploid strains that are grown in the presence of a compound. De- creasing gene dosage of a drug target from two copies to one copy will result in increased drug sensitivity, or drug-induced haploinsufficiency [208]. Under normal condition, one copy of gene is adequate for the normal growth for diploid yeast. Haploinsufficiency can happen when a drug is added into the strain. Consequently, HIP experiments are designed to iden- tify the relationship between gene haploinsufficiency and compounds. In contrast, a HOP assay measures drug sensitivities of strains with complete deletion of non-essential genes in either haploid or diploid strains. Because of the complete deletion, HOP assays identify genes that act to buffer the drug target pathway. The fitness defect score (FD-score) is widely used to predict drug targets by comparing the perturbed growth rates to those of a set of control strains [198, 199]. Recently, a large-scale Synthetic Genetic Analysis (SGA) study [209] showed that target’s genetic interaction profiles are highly correlated with the outcomes of chemical genomic screens, suggesting the possibility of combining genetic interaction profiles with chemical ge- nomic screens for drug target identification [210, 211, 212]. A genetic interaction is measured as the difference between the experimentally measured double-mutant phenotype and the expected double-mutant phenotype [213]. A negative genetic interaction occurs when two genes have similar functions that compensate each others absence to support cell viability [214, 215, 216]. In contrast, a positive genetic interaction occurs when a mutation in one gene rescues the fitness defect associated with a mutation in another gene [217, 218, 219]. Intuitively, a gene’s genetic interaction neighbors are also modulated if the gene is tar- geted and perturbed by a compound. Hence, we can use genetic interaction neighbors’ FD-scores to assist the inference of drug targets on chemical genomic screens. To the best

65 of our knowledge, the only previous attempt to combine chemical genomic screens with ge- netic interaction profiles is by computing the Pearson correlation coefficient between their outcomes[212]. A higher and positive Pearson correlation coefficient indicates a potential compound-target interaction because genetic perturbation is inherently similar to chemical perturbation. However, this approach often works poorly because the Pearson correlation coefficient is sensitive to the noise in high-throughput chemical genomic screens and SGA profiles. Moreover, these existing methods ignore the inherent differences between HIP and HOP assays. Since HIP and HOP assays are complementary, combining HIP and HOP should further improve target identification and enhance our understanding of compounds’ mechanisms of action (MoA) in a comprehensive way. We introduce GIT, a novel Genetic Interaction Network-Assisted Target Identification scoring method for HIP-HOP screens. Due to the inherent similarity between genetic pertur- bation and chemical perturbation, it is possible to use genetic interaction neighbors’ chemical genomic profiles to assist drug target inference. Therefore, we adopt a network biology per- spective to detect drug targets by its neighbors. We first constructed a weighted, signed genetic interaction network from SGA profiles. For HIP assays, GIT supplements a gene’s FD-score by the FD-scores of its neighboring genes in the genetic interaction network. If the FD-scores of its positive genetic interaction neighbors are high while the FD-scores of its negative genetic interaction neighbors are low, the gene is more likely to be a target. For HOP assays, GIT incorporates the FD-scores of long-range two-hop neighbors to iden- tify drug targets, since HOP is more likely to prioritize genes that buffer the drug target pathway rather than the direct targets. By combining HIP and HOP assays using GIT, we observed further improvement in target identification. Extensive experiments on three genome-wide chemical genomic screens demonstrated that GIT substantially improves tar- get identification in comparison with existing scoring methods. We also identified many novel compound-target interactions that are currently not in any curated database but are supported by literature. In addition to target identification, we further demonstrated that GIT can be used to reveal the mechanisms of action of compounds and uncover co-functional gene complexes.

4.2.2 Methods

FD-score

The fitness defect score (FD-score) is the log-ratio of the growth defect of a deletion strain in response to a compound treatment, relative to its growth under control conditions. For

66 gene deletion strain i and compound c, the corresponding FD-score is defined as

ric FDic = log , (4.1) ri

where ric is the growth defect of deletion strain i in the presence of compound c, and ri is the average growth defect of deletion strain i measured under multiple control conditions without any compound treatment. FD-score reflects the sensitivity of a gene deletion strain to a compound treatment. Specifically, a negative FDic score means the growth fitness of the strain i in the presence of the chemical c should be weaker than that of the control without treatment. Therefore, a low, negative FD-score indicates a putative interaction between the deleted gene and the compound. The FD-score does not consider epistasis or interactions among genes. However, recent studies indicate that the phenotype of a particular strain can be caused by the deletion of a genetic modifier of a neighboring gene that is responsible for the phenotype [220, 221, 222, 223, 198]. Therefore, it is necessary to consider a gene’s neighboring genes for target identification.

Genetic interaction network

We first obtained genetic interaction profiles of 5.4 million gene-gene pairs in yeast from a recent genome-scale Synthetic Genetic Array (SGA) study [209]. We then constructed a signed, weighted genetic interaction network based on these profiles. The edge weight gij between gene i and gene j in the genetic interaction network is defined as

gij = fij − fifj, (4.2)

where fij is the double-mutant growth fitness, and fi is the single-mutant growth fitness of gene i. A negative genetic interaction refers to a more severe growth fitness observed than expected, with an extreme case being synthetic lethality, whereas a positive genetic interaction refers to double mutants with a less severe growth fitness than expected [224, 24].

Network-assisted target identification in HIP assays

To identify drug targets in HIP assays, we introduce the GITHIP-score, which combines a gene’s FD-score and the FD-scores of its neighboring genes in the genetic interaction

67 network. For gene i and compound c, we define the GITHIP-score as

HIP X GITic = FDic − FDjc · gij. (4.3) j

The GITHIP-score considers two types of information of gene i: the FD-score of gene i and the FD-scores of gene i’s genetic interaction neighbors (Fig. 4.3). To account for different signs and strengths of genetic interactions, we compute a linear combination of the FD-scores of neighboring genes according to their genetic interaction edge weights to gene i. When gene i and gene j are a negative genetic interaction pair

(gij < 0) and gene i is the target of compound c, it is very likely that we also observe a negative FDjc value. This is because the deletion of one copy of gene j will make gene i more essential to the cell growth, according to the SGA assays. Likewise, if gene i and gene j are a positive genetic interaction pair (gij > 0) and gene i is the target of compound c, it is very likely that we observe a positive FDjc value. This is because the deletion of one copy of gene j will make gene i less essential to the cell growth. Therefore, we designed the GIT score to integrate the information from the genetic interaction neighbors to increase the signal-to-noise ratio, thus improving the sensitivity of the target identification. A low GITHIP-score indicates a potential compound-target interaction. For target that cannot be accurately identified by the FD-score because of the noise in chemical genomic screens or neighboring gene effect [220], the GITHIP-score corrects the target’s FD-score according to the FD-scores of its genetic interaction neighbors. Previous study proposed to identify essential genes in human cancer cell lines according to the ex- pression profile of genetic interaction neighbors [225]. Specifically, one gene becomes more essential if its negative genetic interaction neighbors are inactive and its positive genetic interaction neighbors are active [226]. In this paper, we studied more direct phenotypic growth fitness from chemical genomics assays, instead of gene expression data. Accordingly, the GITHIP-score can also be viewed as a conditional essentiality score since it captures the growth fitness of genetic interaction neighbors.

Network-assisted target identification in HOP assays

Compared to HIP, HOP assays delete both copies of non-essential genes in either haploid or diploid strains. Therefore, the FD-score in HOP assays prioritizes genes that buffer the drug target pathway. In contrast to existing studies [212, 198, 199, 200], which apply the same scoring methods for HIP and HOP assays, we introduce a different network-assisted

68 Figure 4.3: Illustration of how GIT identifies drug targets in HIP assays. Red (Blue) nodes indicate genes with high (low) FD-scores. Red (Blue) lines indicates positive (negative) genetic interactions. Yellow node indicates drug target. GIT supplements a gene’s FD-score by the FD-scores of its neighboring genes in the genetic interaction network. GIT identifies a gene as the target if the gene’s positive genetic interaction neighbors have high FD-scores and the gene’s negative genetic interaction neighbors have low FD-scores.

69 approach for HOP assays to tackle the inherent difference between HIP and HOP assays. Since genes with low FD-scores in the HOP assay are often close to or located in the drug target pathway, we extend our framework to consider the FD-scores of long-range two-hop genetic interaction neighbors. To utilize long-range two-hop genetic interaction neighbors’ FD-scores, we first calculate 1st the first-order GIT-score of each gene. We define the first-order GIT-score GITic of gene i in the presence of compound c as

1st X GITic = FDic − FDjc · gij. (4.4) j

We then use the first-order GIT-score to calculate the GITHOP-score for target identifica- tion in HOP assays.

HOP X 1st GITic = FDic − GITjc · gij. (4.5) j

The GITHOP-score considers two types of information for each gene: the FD-score of the gene itself and the first-order GIT-scores of the gene’s genetic interaction neighbors (Fig 4.4). A low GITHOP-score indicates a potential compound-target interaction. While genes that act to buffer the drug target pathway may not be the one-hop neighbors of the target, the GITHOP-score takes into account indirect neighboring genes in the genetic interaction network. Specifically, if the FD-scores of positively (negatively) weighted two-hop neighbors are high (low), the corresponding one-hop neighbors will have a significantly low GIT1st-scores. On the other hand, if the FD-scores of positively (negatively) weighted two- hop neighbors are low (high), the corresponding one-hop neighbors will have a significantly high GIT1st-scores. By correcting a gene’s FD-score according to the GIT1st-scores of the gene’s genetic interaction neighbors, the GITHOP-score explicitly considers the FD-scores of two-hop neighbors, which capture the drug target pathway buffer effect in the HOP assay. GITHOP-scores can be viewed as the second-order GIT-score. It can be calculated by iteratively multiplying the FD-score vector with the genetic interaction network matrix. Although this scoring framework can be naturally extended to kth-order, we did not observe substantial improvement for k > 2 in our experiments.

70 Figure 4.4: Illustration of how GIT identifies drug targets in HOP assays. Red (Blue) nodes indicate genes with high (low) FD-scores. Red (Blue) lines indicate positive (negative) genetic interactions. Yellow node indicates drug target. GIT supplements a gene’s FD-score by the FD-scores of its long-range two-hop neighbors in the genetic interaction network. These long-range two-hop neighbors capture the drug target pathway buffer effect in HOP assays.

71 ρ-score

In addition to the FD-score, we compared our GIT-score with the ρ-score, which combines genetic interaction profiles with chemical genomic screens by computing the Pearson corre- lation coefficient between their outcomes[212]. ρ-score is proposed to leverage the inherent similarity between genetic perturbation and chemical perturbation. Both of them profiles the relative growth fitness when a gene i is knockdown by mutation or chemical compound. Thus if the genetic interaction profile of a gene i is positively correlated to the FD-score profile of a compound c, it means that the effect of mutating this gene i is similar to adding a compound c. For gene i and compound c, the ρ-score is defined as

P k∈ngh(i)(FDkc − FDc)(gik − gi) ρic = q , (4.6) P 2 P 2 k∈ngh(i)(FDkc − FDc) k∈ngh(i)(gik − gi)

where ngh(i) denotes the genetic interaction neighbors of gene i. When calculating ρic, we excluded pairs of k and c if FDkc is missing a value. A high, positive ρ-score indicates a potential compound-target interaction.

4.2.3 Datasets

STITCH compound-target interactions and genetic interaction profiles

We obtained known compound-target interactions from the STITCH 4 database [227] as a benchmark to evaluate the performance of different target identification scoring methods. These compound-target interactions are built from heterogeneous data sources, including experiments, expert-curated databases, and literature mining. Each interaction has a com- bined score (between 0 and 1) representing the confidence. We excluded low-confidence interactions (combined score < 0.4), as suggested by the STITCH 4 database. We further excluded interactions predicted solely from putative homologs from other species. After fil- tering compounds that are not in any of the collected chemical genomics screens, we obtained 1472 compound-target interactions. We obtained genetic interaction profiles of 5.4 million gene-gene pairs in yeast from a recent genome-scale Synthetic Genetic Array (SGA) study [209].

72 Genome-wide HIP-HOP screens

We obtained three yeast HIP-HOP chemical genomic screens from [198, 199, 200]. For evaluation purposes, we only included compounds that had at least one STITCH compound- target interaction. The first screen from Hoepfner et al. 2014 has 4,146 and 4,921 deletion strains grown under 71 and 73 compound treatment conditions for heterozygous and homozy- gous deletion collections, respectively. The second screen from Lee et al. 2014 has 1,095 and 4,810 deletion strains grown under 382 compound treatment conditions for heterozygous and homozygous deletion collections, respectively. The third screen from Hillenmeyer et al. 2008 has 5,307 and 4,810 deletion strains grown under 333 and 162 compound treatment conditions for heterozygous and homozygous deletion collections, respectively.

Experimental settings

We evaluated the performance of each scoring method based on two criteria: 1) the number of compound-target interactions that can be identified in the top k genes and 2) the number of compounds, at least one target of which can be identified in the top k genes. For each criteria, we plotted a curve which describes the number of identified compound- target interactions (drugs for second criteria) against the rank of gene for each scoring method. The y-axis shows the number of compound-target interactions (drugs for second criteria) identified in the top k genes, where k is shown on the x-axis. We first calculated the area under the curve (AUC) of each scoring method. Then we calculated a normalized AUC (nAUC) for scoring method X as

AUCX nAUCX = , (4.7) AUCFD−score

where the AUCFD−score is the AUC of the FD-score and the AUCX is the AUC of the scoring method X. If the nAUC is larger than 1, then the corresponding scoring method is better than the FD-score method.

We denoted the nAUC obtained from the first criteria as nAUCt and the nAUC obtained

from the second criteria as nAUCd. Since one drug may have multiple targets, nAUCt is

more suitable for evaluation. Therefore, nAUCt is the primary metric used for evaluation in this paper. When calculating the GITHIP-score and the GITHOP-score, we only considered the top q positive genetic interaction neighbors with highest weights and the top q negative genetic interaction neighbors with highest absolute weights for each gene in the network. We set q

73 Figure 4.5: Comparison of GIT with other scoring methods in terms of nAUCt in HIP assays (A, B, C) and HOP assays (D, E, F) on three chemical genomic screens. The y-axis shows the number of compound-target interactions identified in the top k genes, where k is shown on the x-axis. to 100 in all three screens.

4.2.4 Experimental results

GIT substantially improves target identification in HIP assays

To evaluate GIT in HIP assays, we performed large-scale target identification on three chemical genomic screens. The results are summarized in Fig 4.5. It is clear that our GITHIP-score substantially outperforms other scoring methods on all three screens. For example, on Hoepfner et al. 2014 screen, GIT achieves 1.270 nAUCt, which is much higher than 1.000 nAUCt of the FD-score. On Hillenmeyer et al. 2008 screen, GIT identifies 89 compound-target interactions in the top 150 genes, which is again substantially higher than the 75 compound-target interactions identified by the FD-score. Similar improvement was observed in terms of nAUCd on all three screens. To further understand how the GITHIP-score achieves improved performance, we listed compound-target interactions that are ranked higher by the GITHIP-score than by the FD- score in Table 4.3. For instance, the FD-score failed to identify the interaction between HSP90 and geldanamycin because of the large FD-score (2.01) of HSP90 in the presence of geldanamycin. In contrast, our GITHIP-score successfully identified HSP90 as the target of

74 geldanamycin by considering relevant HSP90’s negative genetic interaction neighbors (e.g., SGT1 with FD-score -8.77) and its positive genetic interaction neighbors (e.g., NSE3 with FD-score 0.68).

Table 4.3: Top compound-target interactions identified by GIT in HIP assays. We listed examples of compound-target interactions that are ranked higher by the GITHIP-score than by the FD-score. We showed compound-target interactions that are identified in the top 100 genes by the GITHIP-score.

Rank by the Rank by the Compound Target GITHIP-score FD-score 5-fluorouracil GLT1 78 140 5-fluorouracil DFR1 31 760 Curcumin STT1 20 94 Cycloheximide GSP1 64 513 Fenpropimorph ERG11 33 5,567 Fenpropimorph ERG2 26 56 Geldanamycin HSP90 69 5,731 Hydrochloric Acid END7 14 17 Hydroxyurea SGS1 13 406 Latrunculin A END7 13 56 Latrunculin A PSL7 85 125 Methotrexate DFR1 1 4 Radicicol TOP3 79 961 Rapamycin STT1 96 853 Rapamycin TOR2 2 3 Rapamycin LST8 4 30 Triclosan OAR1 62 224

We further compared the GITHIP-score with the ρ-score, which combines SGA profiles with chemical genomic screens via the Pearson correlation coefficient. We found that GIT substantially outperforms the ρ-score on all three chemical genomic screens. For instance,

GIT achieves 1.270 nAUCt on Hoepfner et al. 2014 screen, which is much higher than 0.919 nAUCt of the ρ-score. Same as the observation in the previous work [212], the ρ-score performs consistently worse than the FD-score, possibly due to the fact that the Pearson correlation coefficient is sensitive to the noise in high-throughput chemical genomic screens and SGA profiles. Finally, we examined whether other molecular networks also enable good target identifi- cation performance. Since previous work [228] has used protein-protein interaction network to identify essential genes in CRISPR screens, we hope to test if protein interaction network can also achieve good performance when is used to identify drug targets. Therefore, we com-

75 pared the performance between protein interaction network and genetic interaction network on chemical genomics screens. We first obtained a physical interaction network (PI) of yeast proteins from BioGRID V3.4 [4]. There are no sign but only weighted edges in the obtained physical interaction network. We then used Eq. 4.3 to calculate the GITHIP-score(PI) based on this physical interaction network, where gij is the weight between gene i and gene j in the physical interaction network. We found that the GITHIP-score has better performance than the GITHIP-score(PI). Since the physical interaction network is unsigned, the GITHIP- score(PI) inevitably prioritizes hubs which have a large number of neighbors. In contrast, by considering both negatively weighted edges and positively weighted edges in the genetic interaction network, our GITHIP-score is not biased towards hubs and thus substantially enhances target identification performance.

GIT substantially improves target identification in HOP assays

We next performed large-scale target identification on three chemical genomic screens to evaluate GIT in HOP assays. The results are summarized in Fig 4.5. It is clear that our GITHOP-score achieves the best target identification performance on all three chemical ge- HOP nomic screens. For instance, the GIT -score achieves 1.355 nAUCt on Lee et al. 2014 which is substantially higher than 1.000 nAUCt of the FD-score. We listed compound-target interactions that are ranked higher by the GITHOP-score than by the FD-score in Table 4.4. For instance, the FD-score fails to identify YSR2 as the target of sphingosine because of the close to zero FD-score (-0.017) of YSR2 in the presence of sphingosine. In contrast, the GITHOP-score successfully identifies the interaction between YSR2 and sphingosine, mainly due to the high GIT1st-scores of YSR2’s positive genetic interaction neighbors (e.g., CSF1 with the GIT1st-score 1.60) and the low GIT1st-scores of YSR2’s negative genetic interac- tion neighbors (e.g., VBM2 with the GIT1st-score -3.00). The substantial improvement of GIT over the FD-score demonstrates the promising of utilizing two-hop genetic interaction neighbors’ FD-scores to identify targets in HOP assays. It is crucial to understand whether it is necessary to apply different scoring methods to the HIP assay and the HOP assay. To this end, we calculated the GITHIP-score in the HOP assay based on Eq. 4.3, where FDic (FDjc) is the FD-score of the HOP assay rather than the HIP assay. Consequently, the GITHIP-score in the HOP assay corrects a gene’s FD-score only according to the FD-scores of its one-hop neighbors. Although the GITHIP-score in HOP assays achieves better performance in comparison to the FD-score and the ρ-score, we noticed that it is consistently worse than the GITHOP-score on all three screens, suggesting the necessity of utilizing the FD-scores of two-hop neighbors to capture the drug target

76 Table 4.4: Top compound-target interactions identified by GIT in HOP assays. We listed examples of compound-target interactions that are ranked higher by the GITHOP-score than by the FD-score. We showed compound-target interactions that are identified in the top 100 genes by the GITHOP-score.

Rank by the Rank by the Compound Target GITHOP-score FD-score Caffeine TOR1 4 111 Camptothecin RAD51 39 349 Fenpropimorph ERG11 27 2,414 Fluconazole ERG11 85 2,412 Hydrochloric Acid GET3 79 278 Hydroxyurea RAD51 20 1,751 Hydroxyurea RAD52 82 1,396 Nocodazole YOR29-09 62 393 Nocodazole KAR9 26 256 Sphingosine YSR2 2 1,981 Staurosporin STT1 11 2,261 pathway buffer effect in the HOP assay. We further investigated whether two-hop neighbors also enable better performance in HIP assays. We calculated the GITHOP-score in HIP assays by using the FD-score of HIP assays in Eq. 4.4 and Eq. 4.5. Different from HOP assays, the GITHOP-score does not obtain substantial improvement in comparison to the GITHIP-score in HIP assays, reflecting the inherent differences between the HIP assay and the HOP assay. Finally, we examined whether using the physical interaction network enables good target identification performance in HOP assays. We used Eq. 4.4 and Eq. 4.5 to calculate the HOP GIT -score(PI) based on the obtained physical interaction network, where gij is the edge weight between gene i and gene j in the physical interaction network. We compared the GITHOP-score(PI) with the GITHOP-score in HOP assays. Similar to our observation in HIP assays, we found that using the genetic interaction network has an overall better performance than using the physical interaction network in HOP assays.

Combining the HIP assay with the HOP assay further improves target identification

Since the HIP assay and the HOP assay are inherently different, we then studied whether these two assays can be combined for improving target identification. We noticed that either using the GITHIP-score from the HIP assay or using the GITHOP-score from the HOP assay identifies compound-target interactions that are not identified by the other. For example, in

77 Figure 4.6: Comparison of GIT with other scoring methods in terms of nAUCt on two chemical genomic screens. The y-axis shows the number of compound-target interactions identified in the top k genes, where k is shown on the x-axis. The GITHIP-score is calculated by using the FD-score from the HIP assay. The GITHOP-score is calculated by using the FD-score from the HOP assay. To calculate nAUCt, we divided the AUC of each method by the AUC of the GITHIP-score. the Hoepfner et al. 2014 screen, the GITHOP-score identifies 11 compound-target interactions that are not discovered by the GITHIP-score. Since the HIP assay and the HOP assay are complementary, we sought to combine them in order to further enhance the target identification performance. We proposed a combined scoring method which takes the average of the z-score by the GITHIP-score from the HIP assay and the z-score by the GITHOP-score from the HOP assay. Averaging these two scores can be viewed as boosting two weaker scoring methods to create a more robust and better scoring method. We denote this score as the GIT-score. Since Lee et al. 2014 only measures the FD-scores of the heterozygous strains of essential genes and the homozygous strains of nonessential genes, there is no overlapping genes between the HIP assay and the HOP assay in this screen. Therefore, we evaluated the GIT-score on the other two chemical genomic screens. We observed that the combined GIT-score is substantially better than both the GITHIP-score and the GITHOP-score (Fig 4.6). For example, the GIT- HOP score achieves 1.802 nAUCt, which is much higher than 1.284 nAUCt of the GIT -score HIP and 1.000 nAUCt of the GIT -score on the Hoepfner et al. 2014 screen.

78 Statistical assessment of using genetic interaction network in GIT

Since drug targets are likely to be enriched with high-degree nodes in the network [229], the improvement of GIT may be caused by its ability to prioritize these high-degree nodes instead of utilizing neighboring gene’s FD-scores. We then examined whether the improve- ment of GIT comes from prioritizing high degree nodes. We first constructed a large set of random networks according to the following procedure. For each node, we replaced each of its neighbors to another random node in the network while keeping the same edge weight and sign. Hence, each node in the new random networks has the same number of positive weighted neighbors and negative weighted neighbors as in the original genetic interaction

network. We then used these networks to calculate the GIT-score, where gij is the edge weight between gene i and gene j. We calculated an empirical p-value according to the number of random networks that have better performance than the genetic interaction net- work when used to identify targets. Since each node in the new random networks have the same degree as it does in the original genetic interaction network, this empirical p-value tests whether the improvement of GIT is achieved by identifying high-degree nodes. We obtained significant empirical p-values on both screens (empirical p-value < 0.009 on Hoepfner et al. 2014; empirical p-value < 0.018 on Hillenmeyer et al. 2008). Therefore, we found that GIT on these random networks is significantly worse compared to GIT on the original GI network. This demonstrates that the improvement comes from correcting each genes FD-score with its neighbors FD-scores rather than the network topology only.

GIT elucidates established mechanism of action of compound

To understand how GIT achieves the substantial improvement, we studied how GIT-score elucidates the compound’s MoA. We examined caffeine, which is a distinct, small molecular inhibitor of TOR complex [230, 231]. We analyzed GIT’s performance based on the Hoepfner et al. 2014 chemical genomic screen, which is the most recent screen among all three screens. We first noticed that the FD-scores of TOR1 in the HIP assay and the HOP assay are 0.079 and -1.827 in the presence of caffeine, respectively. Consequently, the FD-score fails to identify TOR1 as the very top target candidate. In contrast, our GIT-score successfully identifies TOR1 as the target of caffeine, mainly due to the high FD-scores of its positive genetic interaction neighbors SAC7 and UGP. The high FD-scores of SAC7 (1.75) and UGP (2.38) indicate that their positive genetic interaction neighbor TOR1 is inhibited by caffeine. Moreover, most of TOR1’s negative genetic interaction neighbors have substantially low FD- scores (e.g., GTR1 has -4.52 FD-score). We show the genetic interaction neighbors of TOR1

79 Figure 4.7: GIT elucidates MoA of caffeine. Green lines indicate positive genetic interactions. Red lines indicate negative genetic interactions. Red nodes (e.g., GTR1) are genes with low FD-scores in the HIP assay and the HOP assay. Green nodes (e.g., SAC7) are genes with high FD-scores in the HIP assay and the HOP assay. TOR1 and TOR2 are established targets of caffeine. The GIT-score successfully identifies their interactions with caffeine by using neighboring genes’ FD-scores. TOR1 complex, TOR2 complex, and GSE complex are identified as crucial cellular components that interact with caffeine. in Fig 4.7. We can see that TOR1’s negative genetic interaction neighbors have low FD- scores, whereas its positive genetic interaction neighbors have high FD-scores. Even though TOR1 does not have a substantially low FD-score, GIT still accurately identifies TOR1 as the target of caffeine by correcting TOR1’s FD-score according to its genetic interaction neighbors’ FD-scores. Notably, the GIT-score also identifies TOR2 as a target of caffeine. Although the GIT- score ranks TOR2 lower than TOR1, it ranks TOR2 higher than the FD-score does. We noticed that it is difficult to identify TOR2 only according to one-hop genetic interaction neighbors’ FD-scores. Both TOR1 and AVO1 have close to zero FD-scores. GTR1 has a substantially low FD-score, but the genetic interaction edge weight between GTR1 and TOR2 is much lower than the one between GTR1 and TOR1. Nevertheless, the GIT-score still identifies TOR2 as a target of caffeine through the GIT1st-scores of TOR2’s genetic interaction neighbors (e.g, GTR1, TOR1 and AVO1). Fig 4.7 not only elucidates how the GIT-score identifies the targets of caffeine, but also

80 reveals the underlying MoA of caffeine. We can see that there are three major functional complexes that interact with caffeine. Both the TOR1 complex and the TOR2 complex are established cellular components that are affected by caffeine [230, 231]. GSE complex, along with PIB2 and YCL062W, are all associated with vacuolar membranes which play important roles in the TOR2 complex [232].

GIT discovers novel compound-target interactions

We then investigated whether those novel compound-target interactions that are discov- ered by the GIT-score can be supported by existing literature. The output of GIT is a score for each compound-gene pairs. The top ranking genes are the potential targets of each drug. According to the average number of targets of each compound, we proposed to use an empirical p-value 0.001 as the cut-off values of significant compound-target interactions. Here, for each compound, we examined the top five genes that were predicted to be poten- tial targets by the GIT-score. We listed the novel compound-target interactions that are supported by literature in Table 4.5. To show the advantage of using the genetic interaction network, we only listed targets that cannot be identified by the FD-score. For example, the GIT-score successfully identifies the interaction between 5-fluorouracil and SSF1. This interaction is verified by a haploid yeast knockout strains screen [233]. In contrast to the FD-score which fails to identify SSF1, the GIT-score identifies this interaction according to the high FD-scores of SSF1’s positive genetic interaction neighbors (YLR407W, TSL26) and the low FD-scores of SSF1’s negative genetic interaction neighbors (RRP6, YMR268W-A). The GIT-score also identifies POL32 as the target of camptothecin, which is verified by a recent cross-species chemical genomics profiling [234]. Again, the FD-score of POL32 in the presence of camptothecin is not significantly low (-0.38 in the HIP assay and -1.38 in the HOP assay). The GIT-score identifies POL32’s interaction with camptothecin through the high FD-scores of its positive genetic interaction neighbors (VPS39, BTS1 and RPL13A) and the low FD-scores of its negative genetic interaction neighbors (YCL060C, XRS1 and RTT110).

GIT groups genes into co-functional gene complexes

In addition to understanding of compound’s MoA, we studied whether the GIT-score can be used to identify co-functional gene complexes. We used k-means to cluster yeast genes into 100 different clusters based on their GIT-scores in the presence of different compounds. For each cluster, we used Fishers exact test to test whether it was enriched with the annotated

81 Table 4.5: Novel compound-target interactions identified by GIT. We listed compound-target interactions that are identified by the GIT-score but are currently not in any curated database. All these interactions are supported by literature.

Compound Target PMID 5-fluorouracil SSF1 18314501 Bafilomycin A1 DOR1 22470510 Caffeine SLT26 16729036, 16738548 Camptothecin POL32 21179023 Curcumin SEC37 21908599 genes of at least one Gene Ontology term. We obtained Gene Ontology annotations from BioGRID V3.4 [4]. We compared the clustering performance of using the GIT-score with using the FD-score from the HIP assay and using the FD-score from the HOP assay on three Gene Ontology categories in Fig 4.8. We can see that the GIT-score discovers more established complexes than the FD-score. For example, 93 of the 100 clusters identified by the GIT-score are significantly enriched with at least one cellular component function by using a false discovery rate of 0.005. In contrast, the FD-score from the HOP(HIP) assay only identifies 75(65) clusters that are significantly enriched with at least one cellular component function. In addition, the GIT-score also exclusively identifies many important cellular component complexes such as pore complex, , centromeric region, and microtubule.

4.2.5 Conclusion

Here we have reported the discovery that, through the use of prior knowledge captured in the genetic interaction network, compound-target interactions can be identified more accu- rately on chemical genomic screens. Our method identifies many compound-target interac- tions that comprise existing curated database as well as novel compound-target interactions that are supported by literature evidence. Due to its ability in modeling the genetic in- teraction among genes, we can better understand the mechanism of action of compounds, which may provide new insight into drug discovery and drug repositioning. Historically, genetic interaction profiles have been integrated with chemical genomic screens to identify compound-target interactions via the Pearson correlation coefficient [212] due to the inherent similarity between genetic perturbation and chemical perturbation. Our study is different from these previous works in that different local network topology features are taken into consideration in HIP and HOP assays. To the best of our knowledge, this is the first time that HIP and HOP assays are used differently to decipher compound-target interactions.

82 Figure 4.8: GIT identifies co-functional gene complexes. The y-axis shows the number of co-functional gene complexes that can be identified by the GIT-score, the FD-score from the HIP assay, and the FD-score from the HOP assay by using false discovery rate r, where r is shown in the x-axis. (A) shows the biological process category. (B) shows the cellular component category.

One might consider at least three potential reasons for the good performance of GIT. First, existing high-throughput chemical genomic screens might be noisy. GIT is more robust to the noise by using genetic interaction neighbors’ FD-scores to assist the inference of drug targets. Second, HOP assay and HIP assay are fundamentally different biological assays, thus prioritizing different sets of genes. We use direct neighbors’ FD-scores to identify compound- target interactions in the HIP assay, whereas we consider two-hop neighbors’ FD-scores to capture the drug target pathway buffer effect in the HOP assay. Moreover, combining predictions from the HIP assay and the HOP assay further makes GIT more robust. Finally, the genetic interaction network reflects the consequence of perturbing gene function and uncovers broader relationships between diverse functional modules, thus provides functional information that is largely invisible to physical interactions. One interesting observation is that GIT achieves a substantial improvement in comparison to the ρ-score. Both the ρ-score and the GIT-score use genetic interactions to assist target identification. However, the ρ-score prioritizes genes according to the Pearson correlation between one genes chemical genomic profile and its genetic interaction profile. Consequently, it is sensitive to the noise in SGA and chemical genomic screens. In contrast, GIT scores a gene according to the dot product between its neighbors FD scores and their genetic in-

83 teraction edge weights, making it more robust to the noise. More importantly, in HOP assays, unlike the ρ-score which only considers one genes one-hop genetic interaction neigh- bors, we also consider its two-hop genetic interaction neighbors. Our observation that the GITHOP-score (two-hop) has much better performance than the GITHIP-score (one-hop) and the ρ-score (one-hop) in HOP assays demonstrates the promising of considering two-hop genetic interaction neighbors in the HOP assay. Finally, we see many opportunities to improve upon the basic concept of GIT in future work. First, although the current GIT framework is developed in an unsupervised fashion, the GIT-score can be used as the feature and plugged into off-the-shelf machine learning classifier for compound target identification on chemical genomic screens. Second, although this study focused on yeast chemical genomic assays, the GIT method is broadly applicable to any drug perturbation screens on other species [166]. Finally, current genetic interaction network is still noisy, whereas GIT can be potentially used to predict the genetic interaction given the compound-target interaction. For example, one gene with a low FD-score might have a negative genetic interaction with the drug target. In comparison to model organisms such as yeast and worm, high-throughput measuring genetic interactions in human is inher- ently difficult due to the lower efficiency of genetic engineering and the absence of resources like the yeast knockout collection. With available large-scale drug perturbation screens [166] in human, GIT offers the intriguing opportunity to explore genetic interactions in human.

84 CHAPTER 5: KNOWLEDGE NETWORK ASSISTED CLINICAL DECISION MAKING

In this Chapter, I will show how knowledge networks can be used to support clinical decision making. I will first propose a novel topic model which is developed to construct knowledge networks from noisy text data. I will then introduce how we use a medical knowl- edge graph to support clinical decision making by analyzing large-scale electronic medical records. Using our network embedding approach, we obtain promising results in many impor- tant clinical decision support tasks, including patient categorization, patient visualization, and survival prediction.

5.1 CONSTRUCTING A SYMPTOM, DISEASE, AND DRUG NETWORK FROM CLINICAL DATA

Recently, electronic medical records (EMRs) have been widely adopted in United States. These EMRs contain invaluable clinical notes that record medical events. However, these free text notes pose ineligible challenges for data mining algorithms to understand and mine these unstructured datasets. Instead of evaluating our approach on western medicine electronic medical record collection, I evaluated my approach on a large traditional Chinese medicine (TCM) medical record collection. I chose to study this dataset because mining TCM data is more challenging as each patient is often prescribed with more than 10 drugs, thus making it difficult to identify the drug symptom relations properly. Consequently, if we can discover symptoms, disease, and drug relations on this dataset, our method would be likely to work well for EMR datasets in other domains. Traditional Chinese medicine (TCM), a style of medicine widely used in China for thou- sands of years, has been used as a complementary and alternative medicine (CAM) [235] and is especially effective for certain diseases such as stomach ailments [236]. According to the 2007 National Health Interview Survey (NHIS) [237], which included a comprehensive survey on the use of complementary health approaches by Americans, an estimated 3.1 million U.S. adults had used TCM treatment in the previous year. Traditional Chinese medicine has also been widely used in Australia, which has an estimated 2.1 million active TCM practitioners each year [238]. In TCM clinical notes, doctors record symptoms that they observe in their patients. The symptoms are captured and described in information entities such as “night sweats” and “night fevers.” The doctors then prescribe herbs, which is the corresponding drugs in TCM, based on the patient’s disease profile. This personalized medical process is then detailed in

85 Figure 5.1: A toy example of symptom separation in a medical record. The latent associations are not explicitly stated in the record. the patient’s medical record. Unlike modern medicine, TCM is flexible with prescriptions, assigning multiple ingredients accordingly to treat a particular patient’s symptoms. Learning how TCM doctors analyze symptoms and arrive at prescriptions can greatly increase our knowledge of medicine. Unfortunately, due to the empirical nature of TCM, key knowledge about disease diag- noses based on TCM-specific symptoms and herb prescriptions is not well-documented. This creates a significant challenge to retain, share, and inherit knowledge of TCM [239]. This challenge is exacerbated by the fact that patients often have comorbid medical conditions and symptoms, which introduces a host of complex, intricately connected factors [240]. In general, medical data is considered to be one of the most difficult domains for data mining [241]. Moreover, because of the personalized treatment principle in TCM, prescribed herbs and observed symptoms are often mixed together in a patient’s medical record. Conse- quently, for patients with multiple diseases, it is difficult to associate each disease with the prescribed herbs and observed symptoms. We refer to this challenge as the mixed pre- scription challenge. Though we focus on TCM medical record data in this paper, mixed

86 prescription is also a fundamental challenge in western medicine record analysis when a pa- tient is prescribed a combination of drugs to simultaneously treat several diseases. We show an example of a mixture in which six symptoms are associated with three different diseases (Figure 5.1). Fortunately, these challenges can be addressed by analyzing large-scale TCM clinical pa- tient data. With this vision, the National Data Center of Traditional Chinese Medicine in Beijing, China has been collecting TCM electronic medical records (EMRs) since 2007, integrating them into a clinical data warehouse [242]. The data warehouse has curated over 300,000 clinical cases and will continue to acquire patient records from six hospitals in subsequent years. Each hospital has received roughly 3,000,000 patient visits annually for a decade. This data provides an unprecedented opportunity to apply powerful mining algorithms to discover useful knowledge. Indeed, several interesting studies have already succeeded in detecting herb combinations for disease treatment [243], latent symptom phe- notype regularities [244], and symptom-herb relationships [245]. In pursuit of this vision, we analyze large-scale datasets obtained from the TCM Data Center. We propose a novel probabilistic model proTCM to jointly analyze symptoms, diseases, and herbs and to discover patterns that reveal knowledge about the following:

1. Typical symptom groups associated with disease conditions. This knowledge can help doctors diagnose diseases.

2. Typical herb groups associated with disease conditions. This knowledge can help doctors prescribe herbs.

3. Correlations among symptom groups and herb groups. This knowledge can facilitate personalized prescriptions and reveal new insights into herb effectiveness.

Generative probabilistic models can model an entire dataset as a whole, which optimizes pattern discovery by using all of the available data. Such models have already been suc- cessfully used for text mining [246] and TCM data analysis [247, 248]. However, previous works did not model the asymmetric causal relations among symptoms, diseases, and herbs; the relations were instead inaccurately considered to be symmetric. In this paper, we pro- pose a new probabilistic model that can represent the causal relations asymmetrically to enable more accurate discovery of medical knowledge. In particular, we are interested in disease-specific knowledge. We first introduce a new pattern, which we call a disease profile. It consists of three elements: a disease, a multinomial distribution over symptoms for the disease, and a multinomial distribution over herbs for the disease. For example, the disease

87 profile for chronic gastritis shows that its most common symptoms are “bloating” and “epi- gastric pain and chills,” and its most frequently prescribed herbs are “stir-fried immature bitter orange in bran” and “crow-dipper.” To discover disease profiles from our dataset, we propose a generative probabilistic model with the profiles as model parameters. We can ob- tain maximum likelihood estimates for these profile parameters, which can be interpreted as knowledge discovered from the data. These profiles are not only meaningful by themselves, but are also useful as references to TCM practitioners for recommending prescriptions.

5.1.1 Methods

Specifically, the disease information implicitly provides important connections between symptoms and herbs, effectively reducing the space complexity of all possible matchings. Moreover, explicit incorporation of diseases into the mining process also enables direct dis- covery of symptom-disease and herb-disease associations. Both types of associations have important clinical applications, with the former aiding diagnosis and the latter aiding herb prescription in real-world clinical settings. To mine these associations, we define our computational problem as follows: the input consists of a set of tripes (S,D,H), where S is a subset of all possible symptoms, D is a subset of all possible diseases, and H is a subset of all possible herbs. The output consists of  a set of r disease profiles p1, p2, . . . , pr . Each profile corresponds to a disease d ∈ D, and is characterized by two distributions. One distribution is over symptoms, Pr(s|d), s ∈ S, reflecting the typical symptoms of disease d. The other is over herbs, Pr(h|d), h ∈ H, capturing the typical herbs prescribed to treat disease d. In a medical record, a single patient possesses a set of symptoms, to which a physician attributes to a set of diseases. However, within the patient record, it is unknown whether a given symptom s ∈ S is related to some assigned disease di ∈ D or another assigned disease dj ∈ D (Figure 5.1). Furthermore, a patient record contains a set of prescribed herbs. The disease that each herb is meant to treat is also unknown. This challenge of symptom and herb separation occurs very frequently in our dataset, in which each record has an average of 15 symptoms and 20 herbs. As a result, separating symptoms and herbs is necessary to optimally deduce their corresponding diseases and thus prescribe the appropriate treatments to the patient. Here, we propose to solve the challenge of symptom and herb separation by using an unsupervised generative probabilistic model. The proposed probabilistic model mimics these causal relations: Pr(s|d) models the symptoms of a disease d, while Pr(h|d) models the prescribed herbs. Using a distribution to represent the causal relations captures uncertainty, which is inevitable given the lack of documented, precise medical knowledge in

88 TCM. There are at least two simplifications of these models, dependent upon the assumptions that we make.

1. Multinomial model. In this model, we assume that the symptoms “compete” with each other. If one symptom has an increased probability of appearing, then it decreases the probability of appearance of other symptoms. This assumption allows us to infer which P symptom is most likely to appear, giving us two constraints: s∈S Pr(s|d) = 1 and P h∈H Pr(h|d) = 1. From the perspective of a generative model, we assume that the observed symptoms are generated by sampling symptoms using Pr(s|d). Similarly, the observed herbs are sampled using Pr(h|d).

2. Multi-Bernoulli model. In this model, we assume that symptoms independently appear. Under this assumption, we have two constraints: Pr(s = 0|d) + Pr(s = 1|d) = 1 and Pr(h = 0|d) + Pr(h = 1|d) = 1. Each symptom is generated independently from its symptom distribution. In other words, we flip a biased coin with probability Pr(s|d) to determine whether symptom s appears with disease d. Herbs are generated in the same way, but with Pr(h|d).

Here, we focus on the multinomial model in the hopes of discovering the most prominent symptoms and herbs for each disease, leaving the exploration of the multi-Bernoulli model as future work. We now give a more detailed description of the multinomial model. Intuitively, we model how symptoms and herbs of a patient record are generated from their corresponding set of diseases by using the conditional distribution Pr(s, h|d). In order to characterize this generative model, we introduce another distribution over diseases for patient record t, Pr(d|t). Pr(d|t) decides the primary disease responsible for generating a specific set of symptoms and prescriptions.  To generate a set of symptoms S = s1, . . . , sn , we generate each symptom si ∈ S independently. We first sample a disease with probability Pr(d|t), and then sample from its corresponding symptom model Pr(s|d). We repeat this process n times to generate the n  symptoms. The herbs h1, . . . , hm are generated in a similar way, using Pr(h|d) instead of Pr(s|d). The probability of patient t having symptom s and being prescribed herb h with disease  diagnoses Dt = d1, . . . , dk will then be:

X Pr(s, h|t) = Pr(d|t) Pr(s|d) Pr(h|d) (5.1)

d∈Dt If we view symptoms and herbs as “words,” our proposed model is similar to basic topic models such as PLSA [246], which has been successfully applied to the task of discovering

89 word clusters from noisy text collections. However, an important difference is that in PLSA, each word w ∈ W in document d ∈ D belongs to an underlying cluster (topic). In contrast, we assume that all of the clusters (diseases) are observed. This assumption not only applies to real clinical situations, but also restricts the model and allows us to directly learn disease- specific knowledge. In effect, we are imposing an infinitely strong prior on latent topics such that they are strictly tied to certain diseases. A previous model for a similar dataset assumed latent associations between symptoms and herbs [249]. As a result of this limitation, the learned associations are not ensured to be aligned with particular diseases, despite using diagnosis information. We hypothesize that our model can discover more accurate disease-specific knowledge than this older, related model. The results of our experiments confirm this hypothesis. Because common symptoms and herbs frequently co-occur with many different diseases, they are not discriminative enough to accurately infer a specific disease. To alleviate this problem, we further define a background cluster Pr(θB|t). The background cluster helps generate these noisy symptoms and herbs, allowing the final disease cluster to be more informative. A patient record in the collection can now be generated by the following mixture model: ! X Pr(s, h|t) = Pr(d|t) Pr(s|d) Pr(h|d)

d∈Dt (5.2)

+ Pr(θB|t) Pr(s|θB) Pr(h|θB)

where Pr(s|θB) and Pr(h|θB) are the background multinomial distributions over symptoms and herbs, respectively. By fitting this mixture model to patient records, we can find the multinomial distributions that best explain our data. These distributions are associated with the symptoms and herbs that correspond the most to each disease. Under the definition of Pr(s, h, t), the log-likelihood function of the patient record collection M is:

X X X log Pr(S,H|T ) = I(t, s, h) × log Pr(s, h|t) (5.3) t∈T s∈S h∈H where I(t, s, h) = 1 if patient t has symptom s and was prescribed herb h, 0 otherwise. Our

model has the following parameters: Pr(d|t), Pr(s|d), Pr(h|d), Pr(θB), Pr(θB|t), Pr(s|θB),

and Pr(h|θB). We use an expectation-maximization (EM) algorithm to perform maximum likelihood estimation of this latent variable model. The EM algorithm alternately performs an expectation (E) step and a maximization (M) step, and is guaranteed to converge [250]. It attempts to find the optimal parameters to explain the model. Because the parameter

90 estimation for the background cluster and herb distributions is similar to that of a symptom distribution, we only give the update formula for a symptom distribution. In the E step, we compute the posterior probabilities for Pr(d|s, h, t).

Pr(d|s, h, t) ∝ P n(d)P n(t|d)P n(s|d)P n(h|d) (5.4)

In the M step, we re-estimate the parameters based on the posterior probabilities obtained in the E step.

5.1.2 Clinical Decision Support Applications

Note that our model is completely unsupervised, so it requires neither human annotation nor prior knowledge. After estimating the model parameters from the patient record, we can use the resulting symptom and herb distributions for TCM clinical decision support. We outline three possible applications in the following subsections.

1. Mining Disease-Specific Symptoms and Herbs Mining TCM symptoms and herbs can help doctors diagnose diseases as well as reveal when certain herbs are prescribed. In our model, Pr(s|d) and Pr(h|d) provide distributions over all symptoms and herbs, respectively. Dis- tributions with the highest probabilities are the most relevant to disease d.

2. Symptom-Herb Relationship Discovery Besides mining symptoms and herbs of a given disease, our model can discover highly correlated symptoms and herbs. Here, we define the correlation between symptom s and herb h as:

X corr(s, h) = Pr(s|d) Pr(h|d) (5.5) d∈D

Thus, if symptom s and herb h co-occur frequently in many disease multinomial distribu- tions, they will have a high correlation score. These symptom-herb relationships can help doctors understand the precise usages of each herb.

3. Disease and Treatment Prediction Our model can be extended to build intelligent systems that can predict a patient’s disease based on his or her symptoms. Assuming that all  diseases are equally likely to occur, the probability of a patient with symptoms s1, . . . , sn having disease d is computed as:

 Pr(d| s1, . . . , sn ) ∝ Pr(s1|d) ... Pr(sn|d) Pr(d) (5.6)

91 Similarly, we can build an intelligent prediction system to recommend herbs to a patient  based on his or her symptoms. The probability of a patient with symptoms s1, . . . , sn receiving effective treatment by using herb h is computed as:

 X Pr(h| s1, . . . , sn ) ∝ Pr(s1|d) ... Pr(sn|d) Pr(h|d) Pr(d) (5.7)

d∈Dt

5.1.3 Experimental Results

The main goal of our experiments is to examine whether the proposed model can indeed extract useful knowledge from patient records. While we hypothesize that our method can successfully mine a large number of patient records obtained from multiple physicians, direct experimentation with such a mixed dataset would make it difficult to choose the most suitable physician to assess the data mining results. This is because different physicians tend to have varying opinions due to the differences in their training and experiences. Thus, we decided to evaluate the proposed model on two real-world medical record collections, each of which was obtained from a single physician.

1. Digestive disorder EMR. In this dataset, we collected 10,907 anonymous electronic med- ical records provided by a TCM physician who specializes in digestive disorders. We normalized the symptom and herb terms to obtain 9,530 records with 3,000 symptoms, 97 diseases, and 652 herbs. All disease terminology in this dataset all have analogs in western medicine (i.e., hypertension, chronic gastritis, and constipation).

2. Lung cancer EMR. In this dataset, we collected 187 anonymous electronic medical records provided by a TCM physician who specializes in lung cancer. We normalized the symptom and herb terms to obtain 187 records with 12 symptoms, 12 syndromes, and 318 herbs. This dataset contains only TCM syndromes and no western medicine diseases besides the initial cancer diagnosis. We treated these 12 syndromes collectively as the disease profile in our model and aimed at mining the associated herbs and symptoms accordingly.

We compare our model with with the closely related, previously proposed symmetric gener- ative model [249].

proTCM substantially improves symptom-herb relationship prediction

To quantitatively evaluate our model, we collected a set of symptom-herb relationship annotations manually curated by TCM experts. These annotations contain 27,285 symptom-

92 Figure 5.2: Comparison of our method and frequent pattern mining for identifying symptom-herb relationships herb relationships. The symptoms and herbs of only 1,624 of these relationships appear in our dataset. Consequently, we evaluate how our method can accurately identify this subset of 1,624 relationships. We use AUROC, one of the most widely used metrics, to evaluate our method. We compare our method with frequent pattern mining (FPM), which is extensively used in pattern mining [251]. FPM ranks symptom-herb pairs by their frequencies in the collection. Consequently, it identifies frequently co-occurring symptom-herb pairs. However, FPM fails to identify symptom-herb pairs that do not frequently co-occur in the medical records. Moreover, FPM biases towards highly frequent herbs (or symptoms) as they often co-occur with many other symptoms (or herbs) in the medical record. In contrast to FPM, our method infers the associated herbs (or symptoms) among a mixed of herbs (or symptoms) by simultaneously modeling the patient record of all patients. It can also discover symptom-herb pairs that do not frequently co-occur in the dataset. Our method significantly outperforms FPM with a p-value < 4.45 × 10−8 (Figure 5.2). Our method achieves an AUROC of 0.7721, which is much higher than FPM’s AUROC of 0.6938. Notably, we found that FPM does not make accurate predictions when the false positive rate exceeds 0.2. This is because FPM requires an herb and a symptom to appear in the same transaction to constitute an symptom-herb relationship, a restriction that does not affect our model.

93 Table 5.1: Comparison between our model and frequent pattern mining in treatment prediction and disease prediction

Treatment prediction Disease prediction F1 Precision Recall F1 Precision Recall proTCM 0.2504 0.1518 0.738 0.5813 0.7281 0.6084 FPM 0.2456 0.1489 0.7250 0.5266 0.7277 0.5130 proTCM improves disease and treatment prediction

Our model can be further extended to build an intelligent system that can predict diseases and recommend herbs as references for physicians or patients. To examine the potential of this direction, we use two-thirds of the patient record collection as training data and the remaining one-third as test data. The test set and training set are disjoint. This simulates a realistic scenario in which the system learns from the patient record history to provide support for the treatment of new patients.

We employ three standard metrics commonly used to evaluate prediction tasks: F1 score, precision, and recall. We take the macro-average of these three metrics (Table 5.1). We note that disease prediction based on symptoms achieves a reasonably high F1 score, precision, and recall (0.58, 0.73, and 0.61, respectively). Herb prediction, based on symptoms and diseases, achieves a relatively low F1 score, a much worse precision, but a very high recall (0.25, 0.15, and 0.74, respectively). This suggests that the system can recover the herbs that the physician actually prescribed to a patient, while also recommending many herbs that were not prescribed. We compare our method with frequent pattern mining (FPM), which is one of the most widely used approaches in pattern mining (Table 5.1). We see that our method outperforms the frequent pattern mining method on all metrics for both disease prediction and herb prediction. proTCM recommends quality herb prescriptions for TCM syndromes

The concept of syndromes is unique to the field of TCM. Syndromes are high-level ab- stractions of health conditions and are not specifically associated with any single organism or disease. For example, the moisture syndrome is used to describe patients who have arthralgia and weakness due to a dysfunctional spleen. These TCM syndromes are often diagnosed by the comprehensive analysis of a patient’s overall condition and thus cannot be determined solely by laboratory tests. Syndromes serve as an alternative method of categorizing patients, and have been used as

94 the principle element in TCM diagnosis. However, current understanding of syndromes is incomplete and inconsistent among different TCM physicians. Here, we take a data-driven perspective to prescription recommendation for each syndrome according to the medical records. This knowledge can greatly expand the TCM textbook and help junior doctors bet- ter understand syndromes. Since our dataset only contains 12 unique syndromes, we only took into account their related herb prescriptions. We applied proTCM to the lung cancer dataset and treated each syndrome as a disease in our model. Four TCM physicians manu- ally verified the results and annotated the associations consistent with their knowledge. We also found that most of these associations are consistent with existing knowledge, reflect- ing proTCM’s ability to mine underlying TCM knowledge from medical records. Among the 12 syndromes, the phlegm syndrome and the lung syndrome show promising overall performance. Since this analysis is applied to a lung cancer dataset, the phlegm and lung syndromes appear for many patients. Our model successfully captures the herbs associated with these two syndromes. In addition, our method also obtains good performance on a rela- tively rare syndrome, the moisture syndrome. All five of its identified herbs were verified by TCM doctors. By using the proposed probabilistic model, we successfully mined knowledge of these rare syndromes by solving the mixed prescription challenge in the dataset. Note that the performance of yin deficiency is worse than that of other syndromes. This decreased performance is possibly due to the obscure and diverse properties of this syndrome across different patients. Overall, our method achieves a 0.79 accuracy in identifying the best herb prescriptions for each syndrome.

5.1.4 Related Work

Mining TCM clinical data has steadily gained popularity in the past decade. A number of studies have been devoted to clustering or determining the hidden structures of symptoms. For example, Zhang et al. proposed an unsupervised latent tree model to learn diagnosis structures from TCM datasets with symptom variables [244]. The study showed that natural clusters exist in the datasets and these clusters correspond well to syndrome types. Chen et al. performed a comparative study of five classification methods (e.g. support vector machines, neural networks, decision trees, etc.) for syndrome differentiation in coro- nary heart disease using 1,069 clinical epidemiology survey cases [252]. The results showed that support vector machines (SVM) performed best in the prediction task. For the herb regularities, He et al. proposed an approach for discovering functional groups of herbs from a large set of prescriptions recorded in TCM texts [253]. Guang et al. constructed prescription association networks by mining literature datasets [254].

95 Two studies most similar to our work are of note [247, 249]. Zhang et al. proposed the Symptom-Herb-Diagnosis topic modeling approach to discover the common relationships among symptom-herb combinations and disease diagnoses [247]. In another study, Zhang et al. proposed a topic model to automatically extract the hierarchical latent topic structures with both symptoms and herbs in TCM clinical data [249]. Although these works also use topic model-based approaches to mine TCM data, none of them explicitly model the asymmetric causal relations among symptoms, diseases, and herbs.

5.1.5 Discussion

We presented a new probabilistic model which captures the asymmetric causal relations of symptoms, diseases, and herbs in TCM patient records. Compared to previous models that analyze similar data, ours captures the asymmetry in the causal relations more accu- rately. The model discovers disease-symptom distributions and disease-herb distributions from medical records in a completely unsupervised manner. Four experienced physicians judged our experimental results to be meaningful, suggest- ing that the proposed method is effective in discovering latent knowledge in TCM medical records. Such knowledge can be very useful as references, especially to inexperienced doc- tors in clinical applications. The derived symptom-herb associations are also meaningful and provide useful insights and knowledge that can help personalize prescriptions according to the patients’ individualized symptoms. Moreover, syndromes, which are the main TCM descriptors in clinical practice, are inferred based on patient symptoms as well as laboratory tests. Consequently, a thorough understanding of TCM syndromes facilitates future work in generalizing the proposed probabilistic model for TCM clinical knowledge discovery. Our current model is general and completely unsupervised, so it can be used to analyze large amounts of medical records collected by the TCM Data Center. This will be a major future direction for us to explore. Another important future work is to perform natural language processing on the symptom descriptions to better understand the effectiveness of prescriptions. This information, coupled with the proposed model in this paper, will further enable us to distinguish groups of herbs that effectively treat a disease from ineffective ones, directly advancing our understanding of treatments.

96 5.2 MINING ADVERSE DRUG REACTIONS NETWORK FROM ONLINE HEALTH FORUMS

Automatic discovery of medical knowledge using data mining has huge benefits in im- proving population health and reducing healthcare cost. Discovering adverse drug reaction (ADR) is especially important because of the huge damage of ADR to patients. Recently, more and more patients describe the ADRs they experienced and seek for help through online health forums, creating great opportunities for mining health forums to discover previously unknown ADRs. Here, I used the proposed generative model to tap into the increasingly available health forums to mine the side effect symptoms of drugs mentioned by forum users. Existing approaches to mine ADRs from health forums are all based on supervised learning and thus require manually created labeled data, which is expensive to create. They also can- not distinguish side effect symptoms from treated symptoms. Our approach overcomes both deficiencies of previous work with a novel probabilistic mixture model of symptoms, where the side effect symptoms and treated symptoms are explicitly modeled with two separate component models, and discovery of side effect symptoms can be achieved in an unsuper- vised way through fitting the mixture model to the forum data. Extensive experiments on online health forums demonstrate that our proposed model is effective for discovering the re- ported ADRs on forums in a completely unsupervised way. The mined knowledge using our model is directly useful for increasing our understanding of long-term side effects, drug-drug interactions, and rare side effects. Since our approach is unsupervised, it can be applied to mine large amounts of growing forum data to discover new knowledge about ADR, helping many patients understand possible ADRs that they should be aware of.

5.2.1 Experimental Results

In this subsection, we perform extensive experiments on real-world online health forum to evaluate the effectiveness of our model. We perform both qualitative evaluation and quantitative evaluation. In the qualitative evaluation, we show the top side effect symptoms we have mined. In the quantitative evaluation, we compare our model with baselines on two ground truth labeling sets. We further use our model to mine some more challenging ADRs, such as long-term side effects, drug-drug interactions and rare side effects. Finally, we give a detailed comparison between the side effect symptoms we have mined and the side effect symptoms reported in the FDA database. We show that our model has mined some suspicious ADRs which have not caught enough attention by FDA.

97 Data Collection and Preprocessing

We crawl data from a real-world online health forum HealthBoards (www.healthboards. com). HealthBoards is an online health forum that allows patients to discuss their condi- tions. It has been rated as one of the top 20 health information communities, with over 10 million monthly visitors, 850,000 registered members and over 4.5 million messages posted. We crawl a large text collection of 330,305 threads. We then extract 886 threads from a board called “Drug Interation/Side Effects.” Existing NLP tools perform reasonable well when we extract drugs from the raw text. We use Metamap [255] to extract 287 drugs from all the 886 documents. Each thread has on average 336 words. As we have mentioned before, each thread may describe more than one drug. In our data, we find an average of 3.18 drugs per thread. We filter the stopwords in the text and use a large medical phrase dictionary to extract all the medical related phrases. We also remove the high-frequency and low-frequency phrases. In total, we build a dictionary with 2107 phrases. We use the large corpus of 330,305 threads to build the phrases distribution of the back- ground topic. We also use these 330,305 threads to build a unigram model as a prior for the disease symptom topic of each kind of drug. The disease symptom topic would be further updated by the training collection.

Qualitative Evaluation

We first show the side effect symptom topics discovered from the health forum using Side- EffectPTM. On one hand, some drugs are related to common diseases such as depression so that they appear frequently in the corpus. On the other hand, some drugs appear infre- quently in the corpus because they are used to treat some uncommon diseases. Since the online health forum is a mixture of drugs with different frequencies, it is important for our model to handle all of them. Therefore, we show the results of drugs with high-frequency, middle-frequency and low-frequency. We list each drug with the medical phrases that have

the highest P (w|Rm). From Table 5.2, we see that SideEffectPTM mines side effect symptom topics success- fully for drugs with different frequencies. Most of the mined top symptoms are side effect symptoms instead of the disease symptoms. This proves our model’s ability to distinguish side effect symptoms from disease symptoms by explicitly modeling two parts separately. The extracted symptoms are also general enough to cover a wide range of symptoms from common symptoms, such as headaches and weight gain, to rare but severe symptoms, such as kidney stones and diabetes. Specifically, for the anti-depressant drug Wellbutrin(R), our

98 Table 5.2: Top-15 Phrases of Side Effect Symptom Topic

Drug(Freq) Drug Use Symptoms in Descending Order Zoloft(R)(84) antidepressant weight gain, weight, depression, side effects, mgs, gain weight, anxiety, nausea, head, brain, pregnancy, pregnant, headaches, depressed, tired Wellbutrin(R)(48) antidepressant wellbutrin, Wellbutrin, seizures, depression, seizure, sleep, mgs, weight loss, period, enery crashes, pharmaceutical, high doses, dosages, crash, depressed Ativan(R)(33) anxiety disorders ativan, sleep,seroquel, doc prescribed seroquel, raising blood sugar levels, anti-psychotic drug, diabetic, constipation, diabetes, 10mg, benzo, addicted, Ativan, plans, vertigo Topamax(R)(20) anticonvulsant Topamax, liver, side effects, migraines, headaches, weight, topamax, pdoc, neurologist, supplement, sleep, fatigue, seizures, liver problems, kidney stones Zocor(R)(3) lipid-lowering agent small pill, membranes, adverse reactions, muscle cramps, peripheral neuropathy, memory problems, memory loss, heart palpitations, healthy diet, reactions, vomiting, tablets, anemia, dizziness, dizzy Ephedrine(2) stimulant dizziness, stomach, benadryl, dizzy, tired, lethargic, tapering, tremors, panic attack, head, pshaw, advil cold, stomach symptoms, cold sweats, bpm

99 Table 5.3: Comparison of Side Effect Symptom Topic and disease symptom Topic

Seroquel(R) Trazodone(R) Rm Tm Rm Tm seroquel seroquel sleep sleep sleep sleep side effects anxiety diabetes depression trazadone depression side effects anxiety trazodone trazodone seroquel pdoc dry mouth fibro doc prescribed seroquel bipolar tired side effects raising blood sugar levels side effects anti-depressant tired anti-psychotic drug diagnosed christina diagnosed ambition depressed blood sugars pdoc mood swings lithium heart racing head diabetic mania muscle twitches insomnia constipation weight blood tests depressed blood work seroquel blood pain meds weight lamictal pill weight model extracts seizures as the most likely side effect. Seizures are known as one of the most important ADRs of Wellbutrin(R) in medical literature [256]. Ativan(R) is used to treat anxiety disorders and is reported to cause high blood sugar in 1.16% of patients who have side effects when taking Ativan(R) [257]. Ephedrine only appears in two threads in our data set. However, our model still successfully captures its side effect such as dizziness, tired, tremors, headaches, panic attack and lethargy. These side effects are all verified by the medical literature [258]. The side effect results of Topamax(R) contain several severe symptoms such as liver problems and kidney stones. Recent study shows that treatment with Topamax(R) has been proved to increase the propensity to form calcium phosphate stones [259]. This proves that our model is able to not only mine the common side effects (e.g, headache, weight loss), but also find uncommon severe side effects. To further demonstrate that our model can separate side effect symptom topics from disease symptom topics, we show both topics of three different drugs in Table 5.3. From Table 5.3, we can naturally guess that Seroquel(R) is used to treat depression from the top phrases in its disease symptom topic. On the other hand, in its side effect symptom topic, only the word “anti-psychotic drug” is related to depression. Trazodone(R) is used to treat depression and insomnia. Its disease symptom topic matches this usage with phrases such as “insomnia”, “sleep” and “depression”. Its side effect symptom topic shows its side effect such as “dry mouth” and “high blood sugars”. Besides these drugs, our model can separate the side effect symptoms from the disease symptoms for most of the drugs and achieve an

100 Table 5.4: Top-10 Most Reported Side Effect Symptoms

Symtpoms #Reported Symtpoms #Reported anxiety 93 depression 79 sleep 74 weight 45 depressed 40 headaches 39 stomach 36 tired 21 sick 19 heart 17 average cosine similarity of 0.00567. Such a low cosine similarity indicates a large difference between treated topic and side effect topic. We also extract the most reported side effect symptoms from our experimental results to study the common symptoms. Table 3 is the top 10 most reported side effect symptoms from our experiment results. Most of these symptoms are easy to observe (e.g., weight gain, headaches) or occur frequently in everyone’s daily life (e.g., depression, anxiety). A previous study also discovered this phenomena and explained that people are more conscious of issues that they can directly observe [260]. We think that patients may connect these symptoms with the drug they are using incorrectly because of the anxiety when they are searching for medical information on the internet [261]. At the same time, patients are not able to detect symptoms that can only be observed with the help of physicians or medical devices, such as liver problem or hypertension. These rare unobserved symptoms may be the actual side effect symptoms which cause those common observed symptoms. For example, a liver problem may cause an eating disorder and weight loss. So if a patient reports weight loss and eating disorder, we should also suspect the patient of having a liver problem. Therefore, it is important to mine the potential rare side effects which are hidden behind these common side effects.

Quantitative Evaluation

The main goal of the quantitative evaluation is to see how accurately our model is for mining the side effects. Our proposed model is completely unsupervised which does not require any labeled data for training. To perform quantitative evaluation, we construct the following two ground truth sets.

• Human Annotation : We first build a human annotation ground truth set. Two Ph.D. students manually extracted the side effects appearing in that thread. Due to the limited manpower, we only labeled threads for four drugs including Zoloft(R), Wellbutrin(R), Tylenol(R) and Topamax(R). These four drugs appear frequently in our collection so that

101 they can provide us more convincible evaluation results. We labeled 96 threads in total. According to the labeling results, each thread contains 9 side effect symptoms on average.

• FDA Database : We use the FDA database as the ground truth set. The FDA database doesn’t always use the same terms as in the health forum, thus there is a mismatch between professional medical term and the plain English term (e.g., dyspnoea) used by patients (e.g., difficulty breathing) [262]. We address this problem by simply doing keyword-matching between mined side effect phrases with those reported side effects to FDA database. Although this is not entirely reliable, it allows us to evaluate our method with many more drug instances. Besides, such a simple matching strategy unlikely in- troduces any systematic bias, thus it is still useful for performing relative comparisons between different methods.

There is no existing unsupervised approach which mines ADRs from online health forums. We use the following two baselines for comparison.

• Basic SideEffectPTM (Basic) : This approach does not incorporate prior in our Side- EffectPTM model.

• PhraseMatch (PM) : This is the model similar to the approach in [260]. Although they focus on short messages which are less noisy and shorter, we can still apply it to forum text. They assign a score to each phrase in the document according to its similarity to a symptom phrase. We used our dictionary to make this approach match our corpus better. This model is unable to separate side effect symptoms from disease symptoms.

We want to use these two baselines to answer two questions. First, is the prior of disease symptom topic beneficial to our model? We compare Basic SideEffectPTM with our model to answer this question. Second, is it necessary to explicitly build two language models for both side effect symptoms and disease symptoms? We compare our model with PhraseMatch to answer this question. Since our method would be able to nominate candidate side effects of each drug for human experts to further verify, the task is essentially an ADR retrieval task. Therefore, we can use standard retrieval measures to evaluate our method. We use the following information retrieval measures: Precision, Recall and Mean Average Precision (MAP). From Table 5.5, we see that our method outperforms all the baselines on all evaluations on both ground truth labeling sets. Our method improves Basic SideEffectPTM by 212% in Precision@3 and 150% in Precision@10 based on the Human Annotation ground truth. This demonstrates that the prior of disease symptom topics can greatly improve the performance when we mine ADRs.

102 Table 5.5: Compare our model with baselines. * represents the improvement of our model over the corresponding baseline is statistically significant.

Human Annotation FDA Databse Basic PM Our Basic PM Our Prec@3 0.1667 0.1667 0.5000 0.1188 0.0991* 0.1282 Recall@3 0.0032 0.0032 0.0122 0.0030 0.0017* 0.0039 MAP@3 0.0032 0.0016 0.0072 0.0015* 0.0013* 0.0020 Prec@10 0.2000 0.2500 0.5000 0.1192 0.0951* 0.1262 Recall@10 0.0191 0.0260 0.0497 0.0103 0.0084* 0.0137 MAP@10 0.0083 0.0058* 0.0280 0.0039 0.0027* 0.0046

Our method improves PhraseMatch by 212% in Precision@3 and 100% in Precision@10. This proves that explicitly modeling side effect symptoms and disease symptoms separately can make the extracted ADRs more accurate. We also observe that the improvement on FDA database results is less significant than the improvement on human annotation ground truth. Although there are mismatches between the terms in the FDA database and health forum, this may indicate that some of our extracted side effects from the health forum haven’t been approved by the FDA. Therefore, we further compare our experiment results with FDA database and find some previously unknown side effects that haven’t been included by FDA in subsection 5.2.1.

Mining Challenging Adverse Drug Reactions

Besides normal side effects, we also apply our generative model to mining three kinds of more challenging side effects. From the final top-10 extracted long-term side effect candidates, we detect three potential long-term side effects and show them in Table 5. All of these symptoms have been reported

Table 5.6: Long-term Side Effects

Drug Symptom Hydroxycoumarin bleeding Cipro(TM) abdominal pain Phentermine bone and joint pain as potential long-term side effects in the medical literatures or in surveys. Phentermine has been reported to cause bone and joint pain after using it for 9 month on average[263]. Below is a patient’s thread about the side effect of Cipro(TM) to him.

103 Figure 5.3: Patient’s Thread Reports the Long-term Side Effect of Cipro(TM)

Table 6 shows the top extracted drug-drug interactions we have mined through our model. Most of these drug pairs have been reported to be dangerous when taken together. Methadone and Xanax(R) are reported to have potentially life-threatening effects if taken together [264]. A recent review of Methamphetamine suggests that Zoloft(R) might not be effective in the treatment of methamphetamine addiction and might even worsen the condi- tion [265]. Valium(TM) has been reported as one of the major interactions with Zoloft(R) and the level of medication in the bloodstream should be monitored closely when taken them together[266].

Table 5.7: Drug-drug Interaction

Drug1 Drug2 Drug1 Drug2 Xanax(R) Zoloft(R) Methadone Xanax(R) Valium(TM) Zoloft(R) OxyContin(R) Methadone Wellbutrin(R) Celexa(TM) Methamphetamine Zoloft(R) Levaquin(R) Avelox(R) Lamictal(R) Concerta(R)

Table 7 is the extracted rare side effects. These top extracted drug-symptom pairs reveal some potential rare side effects. For example, Prozac(R) has been reported to have the sexual side effect, especially ejaculation problems [267]. The anti-inflammatory drug Solaraze(R) has been reported to cause rare but significant cases of serious hepatotoxicity in medical literature [268].

Discovering Unknown ADRs

To further demonstrate the effectiveness of our model and prove that health forums are good information sources to collect ADRs, we compare the top extracted ADRs with the FDA database. Since there is a mismatch between phrases in two corpus (e.g., “difficulty

104 Table 5.8: Rare Side Effects

Drug Symptom Prozac(R) ejaculate Warfarin(TM) leg weakness Solaraze(R) liver enzymes Benicar(R) difficulty breathing breathing” in forum, “dyspnoea” in FDA database) [262], we manually check the extracted ADRs. Table 8 is the potential ADRs listed in our experiment results but not listed in the FDA database. Prozac(R) has been reported to have “sexually inappropriate behaviour” in FDA database. Our method further narrows this symptom to “ejaculate problem” which has been proved in the medical literature [267]. Ativan(R) doesn’t have symptoms related to “blood sugar” or “diabetes” in FDA database. We have discussed its ADRs in subsection 5.2.1. Although there are some existing medical literatures or surveys demonstrate these ADRs, we still need more professional medical tests to help patients understand the possible danger that they should be aware of.

Table 5.9: Discovering Unreported ADRs

Drug Symptom Prozac(R) ejaculate Ativan(R) raising blood sugar levels, diabetic, constipation Adderall(TM) lost weight

We notice that the FDA database often covers more ADRs than the ADRs mined from health forums by using our proposed model. However, our approach still has the following advantages. First, our model could successfully mine some previously unknown ADRs that haven’t been collected by FDA database. Second, although we return fewer ADRs, we still maintain results of high quality and accuracy based on both qualitative and quantitative evaluations. In addition, our model can mine ADRs for new drugs more efficiently than the FDA database and link the ADRs to the corresponding documents.

5.2.2 Conclusion and Future Work

We use the abundant text information from online health forum to mine adverse drug reactions. Our proposed SideEffectPTM is the first approach that separates side effect

105 symptoms from disease symptoms when mining ADRs. We successfully mine many mean- ingful ADRs from real-world online health forum. Some of them have never been reported in FDA database. Furthermore, we also mine some more challenging side effects such as long-term side effects, rare side effects and drug-drug interactions, which are generally very hard to detect. The voluntary participation of huge amounts of patients and the rapid growth of online health forums will bring us more and more forum text data. Our method can be continuously applied to these growing data to discover new ADRs reported and link the extracted ADRs to the corresponding documents. Our approach can be further improved by addressing the data quality problem of the forums, such as misspelled words, different forms of the same word and unrelated information being brought up in a single thread.

5.3 INCORPORATING KNOWLEDGE NETWORKS IN ELECTRONIC MEDICAL RECORDS ANALYSIS

With the increasing adopting of electronic medical record systems in United States, there is an urgent need to analyze the massively generating electronic medical record (EMR) data. To support clinical decision making, we have developed novel data mining algorithms for important clinical applications such as patient clustering [83], EMR visualization [82], similar patient records retrieval [81], and survival analysis [80]. To address the noisy and missing value challenges in mining EMR data, we use a network-based approach to integrate external knowledge from genomics data with electronic medical records. Our methods have been demonstrated to be effective in tens of thousands of patient records.

5.3.1 Method

We propose to create the heterogeneous medical record network (HEMnet) to address the challenges of EMR analysis. The basic idea is to leverage information from several external sources in order to supplement the knowledge contained in clinical data.

Definition of HEMnet

The HEMnet is a network that consists of nodes and edges associated with different types of information. Formally, we define the HEMnet as a graph G = (V,E,R), where V is the set of typed nodes (i.e., each node belongs to a specified type), E is the set of typed edges, and

106 R is the set of edge types. An edge e ∈ E in the HEMnet is an ordered triplet e = {u, v, r}, where u, v ∈ V are typed nodes and r ∈ R is the corresponding edge type. The network in this study utilizes four distinct categories of edges. The first three cat- egories are drawn from external databases, while the last category draws directly from the EMR database.

(1) Protein-protein interaction network: this network is based on an external gene network of protein-encoding genes, called HumanNet [269]. For a functional linkage

between two proteins p1 and p2, we create a node for p1, a node for p2, and an undirected

edge {p1, p2} in the HEMnet. This is the molecular interaction network.

(2) Drug targets: this database contains drugs and the proteins they target, curated from expert knowledge and medical literature. We create a node h, a node p, and an undirected edge {h, p} if a drug h targets a protein p. This is domain knowledge.

(3) Drug-symptom dictionary: this database is a TCM textbook consisting of drugs and the symptoms they treat, also curated from expert knowledge and medical literature. We create a node h, a node s, and an undirected edge {h, s} if a drug h treats a symptom s in the dictionary. This is domain knowledge.

(4) Electronic medical records: we directly add co-occurrence edges from each medical record. For example, if a patient is diagnosed with cancer type c and prescribed a drug d, then we create a node c, a node d, and an undirected edge {c, d}. We repeat this for all elements in each patient’s medical record.

All edges are unweighted and undirected. Because we only use 133 sparse patient records, the external information is critical in discovering relationships among EMR entities that may otherwise be completely hidden. Overall, |V | = 11, 911, |E| = 379, 715, and |R| = 23. It is important to note that there are only 449 actual features in the EMRs, and that the remaining nodes are proteins that are only used in the HEMnet.

Feature Matrix Enrichment

After generating low-dimensional vector representations of nodes in the HEMnet using ProsNet, we can tackle the problem of missing data and semantic mismatches in the patient records. The patient record matrix M is a 133×449 feature matrix, since we have 133 patients and 449 features. Each row of M consists of a patient record with binary, categorical, discrete,

107 Molecular Interaction Network Electronic Medical Records Clustered Medical Records

patient1:

drugb, 10mg patient4 patient1 testc patient2: age 60 patient3 bone metastasis patient2

Clustering

z gene drugb 1 Network gene2 gene2 druga treats symptomx Embedding test drugb targets gene1 gene1 c drugb targets gene2 y drugb testc x Domain Knowledge Node Vector Space HEMnet Figure 5.4: The pipeline for the HEMnet, our proposed heterogeneous information network consisting of electronic medical records, molecular interaction networks, and domain knowledge. We train network embedding vectors on the HEMnet, obtaining low-dimensional vector representations for each node. With these vectors, we enrich the original medical records and cluster them into two groups.

or continuous column values, depending on the corresponding feature. For example, drugs may be continuous features because they are recorded as dosages, while metastasis sites are categorical features that may indicate multiple locations in the body. In this study, we binarize the categorical features. In order to impute missing features, we construct a similarity matrix, S. S is a 449 × 449 matrix such that Sij is the absolute value of the cosine similarity between feature i and feature

j’s embedding vectors. Furthermore, we set a threshold such that any entry Sij is set to 0 if Sij is less than the threshold. We empirically set this threshold to 0.3. After creating S, we can generate the enriched patient record matrix M 0 with the following operation:

M 0 = M × S (5.8)

With this operation, we can simultaneously solve the problems of incomplete data and semantic mismatches by filling in missing values that are highly similar to existing features for each patient. For example, a patient that has “halitosis” in his or her patient record will receive a non-zero value for “bad breath”, as these two symptoms are likely to have very similar embedding vectors.

5.3.2 Patient stratification and survival analysis

clustering squamous-cell carcinoma patients into two sets using HEMnet-enriched feature matrices, clong has 28 patients and cshort has 15 patients. By the log-rank test, clong’s survival

108 Table 5.10: SCC Patient Survival Time Statistics with HEMnet

Cluster Restricted Restricted Mean Median Size Mean Standard Error

clong 28 20.7 2.01 22.7 cshort 15 11.0 1.60 11.0

2 function is significantly better than cshort’s with a χ statistic of 10.79 (p-value = 0.001020). We show the cluster summary statistics in Table 5.10, with u = 33.3.

In contrast, when using the baseline feature matrix, clong has 25 patients and cshort has 18 patients. Here, the survival functions are not significantly different at the 1% significance level, with a χ2 statistic of 6.214 (p-value = 0.01267).

When using feature matrices with mean imputation, clong has 25 patients and cshort has 18 patients. Here, the survival functions are not significantly different with a χ2 statistic of 2.489 (p-value = 0.1148). Of all significant features, only “Karnofsky performance status” has a higher mean in

clong than in cshort. This is expected, as a higher KPS score indicates a relatively healthier

cancer patient [270]. All other significant features have lower means in clong than in cshort. This makes sense for symptom features, as patients with longer survival rates should be in better health. On the other hand, it might seem, at first, as if the higher values of drugs in cshort indicate their ineffectiveness. However, patients with more life-threatening conditions require higher dosages of treatments. The baseline method did not find sputum and cough to be symptoms that are statistically different between clong and cshort, while our method did. However, they are very common symptoms in lung cancer patients that tend to be more severe in patients with lower survival rates. Furthermore, the other three features that our method discovered are all drugs. We can interpret these features’ higher values in cshort as potential drug-symptom relationships. In the drug-symptom dictionary, sulphurweed is known to treat both cough and sputum, while umbrella polypore only treats urinary tract-related symptoms and A. tataricus simply does not appear. Despite this, a study showed that a naturally occurring compound derived from umbrella polypore mycelia induces apoptosis in human lung cancer cells [271]. Fur- thermore, a recent study showed that a polysaccharide isolated from A. tataricus inhibits the growth of cancer cells [272]. Our method was able to capture these meaningful drug-symptom relationships while the baseline methods could not. This is due to the fact that the HEMnet integrates external drug information in the form of drug-protein targets and protein-protein interaction information.

109 5.3.3 Electronic medical records retrieval

We present a study of electronic medical record (EMR) retrieval that simulates situations in which a doctor treats a new patient. Given a query consisting of a new patient’s symptoms, the retrieval system returns the set of most relevant records of previously treated patients. However, due to semantic, functional, and treatment synonyms in medical terminology, queries are often incomplete and thus require enhancement. In this paper, we present a topic model that frames symptoms and treatments as separate languages. Our experimental results show that this method improves retrieval performance over several baselines with statistical significance. These baselines include methods used in prior studies as well as state-of-the-art embedding techniques. Finally, we show that our proposed topic model discovers all three types of synonyms to improve medical record retrieval. BiLDA-based mixed query expansion achieved the best retrieval performance among all expansions. For NDCG@5, 10, 15, and 20, it performed better than the baseline with no query expansion with p-values of 2.842 ×10−3, 4.784 ×10−3, 6.852 ×10−7, and 1.929 ×10−6, respectively. Furthermore, BiLDA mixed expansion performed better than all of the runner- up methods at the 5% significance level. Mixed expansion was only the best-performing expansion type for BiLDA. This is due to the fact that all other methods do not separately mine symptom and treatment synonyms. On the other hand, BiLDA-based query expansion considers symptoms and treatments to be from separate topics, and therefore it successfully added in mixed query terms. Dictionary-based expansion’s poor performance can be explained by the fact that it adds too many treatment synonyms, which dilutes the original query’s symptoms. On average, dictionary-based expansion nearly doubled each query in size. We show an example of a mixed query expansion from BiLDA. In the patient query in Table 5.11, the five expansion terms include three symptoms and two treatment drugs.

5.3.4 Electronic medical records visualization

Here, we present VisAGE, a method that visualizes electronic medical records (EMRs) in a low-dimensional space. Effective visualization of new patients allows doctors to view similar, previously treated patients and to identify the new patients’ disease subtypes, reducing the chance of misdiagnosis. However, EMRs are typically incomplete or fragmented, resulting in patients who are missing many available features being placed near unrelated patients in the visualized space. VisAGE integrates several external data sources to enrich EMR databases to solve this issue. We evaluated VisAGE on a dataset of Parkinson’s disease patients. We

110 Table 5.11: Example of an expanded query created by the BiLDA method. Different query terms are separated by semicolons.

Disease Label Original Query Expansion Terms yellow, greasy tongue coating; epigastric chills; fluttering pulse; heartburn; dark, red tongue; Chronic gastritis bloating; fullness; belching; bitter orange; stomachache; crow-dipper acid reflux; dry mouth qualitatively and quantitatively show that VisAGE can more effectively cluster patients, which allows doctors to better discover patient subtypes and thus improve patient care.

Two-Dimensional Visualization with UPDRS

Using the two-dimensional representations of patient records, we labeled each record ac- cording to its unified Parkinson’s disease rating scale (UPDRS) scores. The UPDRS consists of six subsections, each containing survey questions that evaluate a patient’s physical and mental condition [273]. The questions deal with topics ranging from anxiety to sleeping problems, with scores scaled from 0 to 4. A higher score indicates more severe impairment or disability. As in previous work, we labeled each patient with the sum of his or her UPDRS scores [274]. The main difference between VisAGE and the baseline is that in Figure 5.5, moderately impaired patients (orange circles) on the right side of the plots were clustered more distinctly in the VisAGE visualization, while the same patients were less structured in the baseline visualization. The most severe Parkinson’s disease patients are marked by red squares, and were clustered more tightly together in the VisAGE visualization than in the baseline. The baseline’s worse performance can be attributed to the data sparsity of the EMRs.

Qualitative Evaluation: Drug and Symptom Enrichment

We qualitatively evaluated the visualization results by computing drug and symptom enrichments for each cluster. We used symptoms and drugs because they are strongly con- nected to patient statuses and diagnoses. Thus, if a cluster is highly enriched in a symptom

111 Minor Moderate Severe

(a) Baseline visualization

Minor Moderate Severe

(b) VisAGE visualization

Figure 5.5: The two-dimensional representations of patient records, plotted with color labels determined by each record’s UPDRS scores. VisAGE’s visualization identifies more clusters for moderately impaired patients, and more tightly groups severely impaired patients.

112 or drug, then doctors will have a general idea of the cluster’s disease subtype. We first clustered the two-dimensional patient representations with DBSCAN [275], which is robust to outliers and does not need to specify the number of clusters. Because the PPMI dataset contains control patients to simulate noise, DBSCAN’s robustness to outliers is especially desirable. Additionally, not having to specify the number of clusters a priori is useful for our application, as we do not know the exact number of patient subtypes beforehand. For the DBSCAN parameters, we set  = 1 and minPts = 10. In the baseline method, patients were placed into 10 clusters. With VisAGE, patients were placed into 18 clusters. For each cluster c and each symptom or drug b, we computed Fisher’s exact test to determine if c was significantly enriched in b. We only used symptoms and drugs with binary values to avoid medical tests for which all patients had non-zero values (e.g., the Epworth Sleepiness Scale [276]).

113 CHAPTER 6: SUMMARY AND FUTURE DIRECTIONS

In this thesis, I systematically studied the problem of mining knowledge networks for precision medicine. I used three different network models to tackle the challenges in min- ing knowledge networks. For small-sized network, I applied the diffusion model to capture the topologic similarity of nodes in the network. For larger networks, I proposed novel network embedding approaches to reduce the noise of the diffusion state. The proposed network embedding approaches also enable us to obtain low-dimensional features that can alleviate overfitting in downstream prediction tasks. This thesis covers three important pre- cision medicine tasks: gene function annotatin, drug discovery, and clinical decision support. Promising results on these three tasks show the power of using knowledge networks to ad- vance precision medicine. In particular, I show the promising performance of the proposed approaches in precision medicine tasks, including drug symptom relation detection, adverse drug reaction detection, gene function prediction, drug target prediction, drug pathway as- sociation identification, gene set annotation, and electronic medical record analysis. While these applications demonstrate the encouraging future of mining knowledge network for pre- cision medicine, there are certainly many other research problems we can explore in this direction. This thesis has prepared a solid basis to pursue such a long-term goal, and in my future research I envision quite a few opportunities to extend my current research:

6.1 KNOWLEDGE NETWORK-SUPPORTED BIOLOGICAL FINDING INTERPRETATION

As biomedical data has been massively generated, many predictive models have been developed to guide drug discovery and disease diagnosis. For example, one can train a classifier to identify genes whose expression values are significantly correlated with drug response. Despite the improved prediction performance, many of these models are not interpretable. Consequently, we can gain very limited insight of drug mechanism from these predictive models. Several modern machine learning methods try to interpret the prediction results by finding important features (e.g., genes). However, even if we know that certain gene is crucial to accurately predict the response of a drug, the underlying mechanism and functions are still unclear. The setbacks may be due to restricting the models to only look for interpretations from the limited input feature sets. I plan to use the knowledge from knowledge networks to build an end-to-end knowledge network-assisted interpretable predictive model. The ideal output of our model would be a

114 subnetwork of the original knowledge network. This subnetwork reflects how a drug gradually affects the expression of genes through certain functions and pathways. By using the external information from other datasets, this subnetwork would be much more informative than a simple list of selected features.

6.2 REFINING KNOWLEDGE NETWORKS WITH EXPERIMENTAL DATA

Knowledge networks could be noisy due to experimental errors and lack of enough exper- iment replicates. Consequently, the resulted knowledge networks could also be noisy and need to be further refined. I plan to develop novel machine learning approaches to refine and validate knowledge network. I want to use different data sources to cross validate each other. I also plan to use limited high-quality experimental data to exclude potential false-positive links in knowledge networks. The resulted high-quality knowledge networks could empower better data mining applications for precision medicine with less false-positive results.

6.3 UNCOVERING DISEASE MECHANISMS THROUGH NETWORK BIOLOGY

My work on mining cancer genomics data has shown encouraging results in understand- ing cancer heterogeneity using a novel network-based approach. I have recently applied this approach to Parkinsons disease, which is also driven by a combination of genes, and obtained preliminary but promising results in patient subtyping based on whole exome se- quencing data. In the future, I want to further explore this direction and apply it to a broader range of human diseases. While significant efforts to date have focused on specific genes that interact with compounds and confer observed cellular phenotypes, there has been relatively little progress in studying the interactions of genes in response to the presence of compounds. Comprehensively understanding gene interactions requires us to take a network biology perspective and consider information from different data sources. One direction I plan to explore is to transfer disease-related discovery from model organisms to humans by aligning the molecular interactions in different species. In addition, I am also interested in integrating clinical data with genomics data in the hopes of building an intelligent system that enables disease diagnosis based on not only symptoms but also molecular and cellular abnormalities.

115 REFERENCES

[1] J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Leh´ar,G. V. Kryukov, D. Sonkin et al., “The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, vol. 483, no. 7391, p. 603, 2012.

[2] F. Iorio, T. A. Knijnenburg, D. J. Vis, G. R. Bignell, M. P. Menden, M. Schubert, N. Aben, E. Gon¸calves, S. Barthorpe, H. Lightfoot et al., “A landscape of pharma- cogenomic interactions in cancer,” Cell, vol. 166, no. 3, pp. 740–754, 2016.

[3] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou et al., “String v10: protein–protein in- teraction networks, integrated over the tree of life,” Nucleic acids research, p. gku1003, 2014.

[4] C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, “Biogrid: a general repository for interaction datasets,” Nucleic acids research, vol. 34, no. suppl 1, pp. D535–D539, 2006.

[5] T. Li, R. Wernersson, R. B. Hansen, H. Horn, J. Mercer, G. Slodkowicz, C. T. Work- man, O. Rigina, K. Rapacki, H. H. Stærfeldt et al., “A scored human protein–protein interaction network to catalyze genomic interpretation,” Nature methods, vol. 14, no. 1, p. 61, 2017.

[6] E. G. Cerami, B. E. Gross, E. Demir, I. Rodchenkov, O.¨ Babur, N. Anwar, N. Schultz, G. D. Bader, and C. Sander, “Pathway commons, a web resource for biological pathway data,” Nucleic acids research, vol. 39, no. suppl 1, pp. D685–D690, 2010.

[7] A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsd´ottir,P. Tamayo, and J. P. Mesirov, “Molecular signatures database (msigdb) 3.0,” Bioinformatics, vol. 27, no. 12, pp. 1739–1740, 2011.

[8] G. Consortium et al., “The genotype-tissue expression (gtex) pilot analysis: Multitis- sue gene regulation in humans,” Science, vol. 348, no. 6235, pp. 648–660, 2015.

[9] M. J. Van De Vijver, Y. D. He, L. J. Van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton et al., “A gene-expression sig- nature as a predictor of survival in breast cancer,” New England Journal of Medicine, vol. 347, no. 25, pp. 1999–2009, 2002.

[10] C. G. A. R. Network et al., “Integrated genomic analyses of ovarian carcinoma,” Na- ture, vol. 474, no. 7353, p. 609, 2011.

[11] C. G. A. Network et al., “Comprehensive molecular portraits of human breast tu- mours,” Nature, vol. 490, no. 7418, p. 61, 2012.

116 [12] C. G. A. R. Network et al., “Comprehensive molecular characterization of urothelial bladder carcinoma,” Nature, vol. 507, no. 7492, p. 315, 2014.

[13] C. G. A. Network et al., “Comprehensive genomic characterization of head and neck squamous cell carcinomas,” Nature, vol. 517, no. 7536, p. 576, 2015.

[14] C. G. A. R. Network et al., “Comprehensive molecular characterization of clear cell renal cell carcinoma,” Nature, vol. 499, no. 7456, p. 43, 2013.

[15] T. J. Lynch, D. W. Bell, R. Sordella, S. Gurubhagavatula, R. A. Okimoto, B. W. Brannigan, P. L. Harris, S. M. Haserlat, J. G. Supko, F. G. Haluska et al., “Activating mutations in the epidermal growth factor receptor underlying responsiveness of non– small-cell lung cancer to gefitinib,” New England Journal of Medicine, vol. 350, no. 21, pp. 2129–2139, 2004.

[16] R. Harpaz, S. Vilar, W. DuMouchel, H. Salmasian, K. Haerian, N. H. Shah, H. S. Chase, and C. Friedman, “Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions,” Journal of the American Med- ical Informatics Association, vol. 20, no. 3, pp. 413–419, 2012.

[17] E. Iqbal, R. Mallah, R. G. Jackson, M. Ball, Z. M. Ibrahim, M. Broadbent, O. Dzahini, R. Stewart, C. Johnston, and R. J. Dobson, “Identification of adverse drug events from free text electronic patient records and information in a large mental health case register,” PloS one, vol. 10, no. 8, p. e0134208, 2015.

[18] P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better research applications and clinical care,” Nature Reviews Genetics, vol. 13, no. 6, p. 395, 2012.

[19] C. Shivade, P. Raghavan, E. Fosler-Lussier, P. J. Embi, N. Elhadad, S. B. Johnson, and A. M. Lai, “A review of approaches to identifying patient phenotype cohorts using electronic health records,” Journal of the American Medical Informatics Association, vol. 21, no. 2, pp. 221–230, 2013.

[20] R. Bellazzi, F. Ferrazzi, and L. Sacchi, “Predictive data mining in clinical medicine: a focus on selected methods and applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 5, pp. 416–430, 2011.

[21] N. Lavraˇc,“Selected techniques for data mining in medicine,” Artificial intelligence in medicine, vol. 16, no. 1, pp. 3–23, 1999.

[22] A. K. Daly, “Pharmacogenomics of adverse drug reactions,” Genome medicine, vol. 5, no. 1, p. 5, 2013.

[23] S. L. Chan, S. Jin, M. Loh, and L. R. Brunham, “Progress in understanding the genomic basis for adverse drug reactions: a comprehensive review and focus on the role of ethnicity,” Pharmacogenomics, vol. 16, no. 10, pp. 1161–1178, 2015.

117 [24] R. Srivas, J. P. Shen, C. C. Yang, S. M. Sun, J. Li, A. M. Gross, J. Jensen, K. Licon, A. Bojorquez-Gomez, K. Klepper et al., “A network of conserved synthetic lethal interactions for exploration of precision cancer therapy,” Molecular Cell, vol. 63, no. 3, pp. 514–525, 2016.

[25] J. P. Shen, D. Zhao, R. Sasik, J. Luebeck, A. Birmingham, A. Bojorquez-Gomez, K. Licon, K. Klepper, D. Pekin, A. N. Beckett et al., “Combinatorial crispr-cas9 screens for de novo mapping of genetic interactions,” Nature methods, 2017.

[26] K. Han, E. E. Jeng, G. T. Hess, D. W. Morgens, A. Li, and M. C. Bassik, “Syner- gistic drug combinations for cancer identified in a crispr screen for pairwise genetic interactions,” Nature Biotechnology, 2017.

[27] D. Altshuler, M. Daly, and L. Kruglyak, “Guilt by association,” Nature genetics, vol. 26, no. 2, p. 135, 2000.

[28] M. Cao, C. M. Pietras, X. Feng, K. J. Doroschak, T. Schaffner, J. Park, H. Zhang, L. J. Cowen, and B. J. Hescott, “New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence,” Bioinformatics, vol. 30, no. 12, pp. i219–i227, 2014.

[29] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-b: Fortran subrou- tines for large-scale bound-constrained optimization,” ACM Transactions on Mathe- matical Software (TOMS), vol. 23, no. 4, pp. 550–560, 1997.

[30] G. H. Golub and C. Reinsch, “Singular value decomposition and least squares solu- tions,” Numerische mathematik, vol. 14, no. 5, pp. 403–420, 1970.

[31] H. Cho, B. Berger, and J. Peng, “Diffusion component analysis: unraveling functional topology in biological networks,” in RECOMB. Springer, 2015, pp. 62–64.

[32] S. Wang, H. Cho, C. Zhai, B. Berger, and J. Peng, “Exploiting ontology graph for predicting sparsely annotated gene function,” Bioinformatics, vol. 31, no. 12, pp. i357– i364, 2015.

[33] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word repre- sentation.” in EMNLP, vol. 14, 2014, pp. 1532–1543.

[34] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed rep- resentations of words and phrases and their compositionality,” in NIPS, 2013, pp. 3111–3119.

[35] M. U. Gutmann and A. Hyv¨arinen,“Noise-contrastive estimation of unnormalized sta- tistical models, with applications to natural image statistics,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 307–361, 2012.

118 [36] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network et al., “The cancer genome atlas pan-cancer analysis project,” Nature genetics, vol. 45, no. 10, p. 1113, 2013.

[37] M. R. Stratton, P. J. Campbell, and P. A. Futreal, “The cancer genome,” Nature, vol. 458, no. 7239, p. 719, 2009.

[38] W. Yang, J. Soares, P. Greninger, E. J. Edelman, H. Lightfoot, S. Forbes, N. Bindal, D. Beare, J. A. Smith, I. R. Thompson et al., “Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells,” Nucleic acids research, vol. 41, no. D1, pp. D955–D961, 2012.

[39] L. Wang, H. L. McLeod, and R. M. Weinshilboum, “Genomics and drug response,” New England Journal of Medicine, vol. 364, no. 12, pp. 1144–1153, 2011.

[40] L. Wang and R. M. Weinshilboum, “Pharmacogenomics: candidate gene identification, functional validation and mechanisms,” Human molecular genetics, vol. 17, no. R2, pp. R174–R179, 2008.

[41] Y. Xie, G. Xiao, K. R. Coombes, C. Behrens, L. M. Solis, G. Raso, L. Girard, H. S. Erickson, J. Roth, J. V. Heymach et al., “Robust gene expression signature from formalin-fixed paraffin-embedded samples predicts prognosis of non–small-cell lung cancer patients,” Clinical Cancer Research, vol. 17, no. 17, pp. 5705–5714, 2011.

[42] M. G. Rees, B. Seashore-Ludlow, J. H. Cheah, D. J. Adams, E. V. Price, S. Gill, S. Javaid, M. E. Coletti, V. L. Jones, N. E. Bodycombe et al., “Correlating chemical sensitivity and basal gene expression reveals mechanism of action,” Nature chemical biology, vol. 12, no. 2, pp. 109–116, 2016.

[43] H. Gao, J. M. Korn, S. Ferretti, J. E. Monahan, Y. Wang, M. Singh, C. Zhang, C. Schnell, G. Yang, Y. Zhang et al., “High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response,” Nature medicine, vol. 21, no. 11, p. 1318, 2015.

[44] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun, “Using recurrent neural network models for early detection of heart failure onset,” Journal of the American Medical Informatics Association, vol. 24, no. 2, pp. 361–370, 2016.

[45] Y. Kim, J. Sun, H. Yu, and X. Jiang, “Federated tensor factorization for computational phenotyping,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 887–895.

[46] J. Zhou, F. Wang, J. Hu, and J. Ye, “From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 135–144.

119 [47] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Sontag, “Learning a health knowledge graph from electronic medical records,” Scientific reports, vol. 7, no. 1, p. 5994, 2017. [48] H. Chen and B. M. Sharp, “Content-rich biological network constructed by mining pubmed abstracts,” BMC bioinformatics, vol. 5, no. 1, p. 147, 2004. [49] A. Doms and M. Schroeder, “Gopubmed: exploring pubmed with the gene ontology,” Nucleic acids research, vol. 33, no. suppl 2, pp. W783–W786, 2005. [50] H.-T. Zheng, C. Borchert, and H.-G. Kim, “Goclonto: An ontological clustering approach for conceptualizing pubmed abstracts,” Journal of biomedical informatics, vol. 43, no. 1, pp. 31–40, 2010. [51] H. Poon, C. Quirk, C. DeZiel, and D. Heckerman, “Literome: Pubmed-scale genomic knowledge base in the cloud,” Bioinformatics, vol. 30, no. 19, pp. 2840–2842, 2014. [52] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop, A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur et al., “A large-scale evaluation of computational protein function prediction,” Nature methods, vol. 10, no. 3, pp. 221– 227, 2013. [53] Y. Jiang, T. R. Oron, W. T. Clark, A. R. Bankapur, D. D’Andrea, R. Lepore, C. S. Funk, I. Kahanda, K. M. Verspoor, A. Ben-Hur et al., “An expanded evaluation of pro- tein function prediction methods shows an improvement in accuracy,” arXiv preprint arXiv:1601.00891, 2016. [54] U. Karaoz, T. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif, “Whole-genome annotation by using evidence integration in functional-linkage net- works,” Proceedings of the National Academy of Sciences of the United States of Amer- ica, vol. 101, no. 9, pp. 2888–2893, 2004. [55] S. Letovsky and S. Kasif, “Predicting protein function from protein/protein interaction data: a probabilistic approach,” Bioinformatics, vol. 19, no. suppl 1, pp. i197–i204, 2003. [56] E. Sefer and C. Kingsford, “Metric labeling and semi-metric embedding for protein annotation prediction,” in Research in computational molecular biology. Springer, 2011, pp. 392–407. [57] T. Murali, C.-J. Wu, and S. Kasif, “The art of gene function prediction,” Nature biotechnology, vol. 24, no. 12, pp. 1474–1475, 2006. [58] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al., “Gene ontology: tool for the unification of biology,” Nature genetics, vol. 25, no. 1, pp. 25–29, 2000. [59] H. Cho, B. Berger, and J. Peng, “Compact integration of multi-network topology for functional analysis of genes,” Cell systems, vol. 3, no. 6, pp. 540–548, 2016.

120 [60] S. K¨ohler,S. Bauer, D. Horn, and P. N. Robinson, “Walking the interactome for prioritization of candidate disease genes,” The American Journal of Human Genetics, vol. 82, no. 4, pp. 949–958, 2008.

[61] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh, “Whole-proteome pre- diction of protein function via graph-theoretic analysis of interaction maps,” Bioinfor- matics, vol. 21, no. suppl 1, pp. i302–i310, 2005.

[62] V. Gligorijevi´c,V. Janji´c,and N. Prˇzulj,“Integration of molecular network data re- constructs gene ontology,” Bioinformatics, vol. 30, no. 17, pp. i594–i600, 2014.

[63] T. Milenkovi´c,V. Memiˇsevi´c,A. K. Ganesan, and N. Prˇzulj,“Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis- related functional genomics data,” Journal of the Royal Society Interface, vol. 7, no. 44, pp. 423–437, 2010.

[64] S. Mostafavi and Q. Morris, “Fast integration of heterogeneous data sources for pre- dicting gene function with limited annotation,” Bioinformatics, vol. 26, no. 14, pp. 1759–1765, 2010.

[65] S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, and Q. Morris, “Genemania: a real- time multiple association network integration algorithm for predicting gene function,” Genome biology, vol. 9, no. 1, p. S4, 2008.

[66] W. T. Clark and P. Radivojac, “Information-theoretic evaluation of predicted onto- logical annotations,” Bioinformatics, vol. 29, no. 13, pp. i53–i61, 2013.

[67] R. Eisner, B. Poulin, D. Szafron, P. Lu, and R. Greiner, “Improving protein function prediction using the hierarchical structure of the gene ontology,” in Computational In- telligence in Bioinformatics and Computational Biology, 2005. CIBCB’05. Proceedings of the 2005 IEEE Symposium on. IEEE, 2005, pp. 1–10.

[68] Y. Guan, C. L. Myers, D. C. Hess, Z. Barutcuoglu, A. A. Caudy, and O. G. Troyan- skaya, “Predicting gene function in a hierarchical context with an ensemble of classi- fiers,” Genome biology, vol. 9, no. 1, p. S3, 2008.

[69] Y. Jiang, W. T. Clark, I. Friedberg, and P. Radivojac, “The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learn- ing perspective,” Bioinformatics, vol. 30, no. 17, pp. i609–i616, 2014.

[70] W. K. Kim, C. Krumpelman, and E. M. Marcotte, “Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy,” Genome biology, vol. 9, no. 1, p. S5, 2008.

[71] G. Obozinski, G. Lanckriet, C. Grant, M. I. Jordan, and W. S. Noble, “Consistent probabilistic outputs for protein function prediction,” Genome Biology, vol. 9, no. 1, p. S6, 2008.

121 [72] A. Sokolov and A. Ben-Hur, “Hierarchical classification of gene ontology terms using the gostruct method,” Journal of bioinformatics and computational biology, vol. 8, no. 02, pp. 357–376, 2010.

[73] H. Wang, H. Huang, and C. Ding, “Function–function correlated multi-label pro- tein function prediction over interaction networks,” Journal of computational biology, vol. 20, no. 4, pp. 322–343, 2013.

[74] J. Dutkowski, M. Kramer, M. A. Surma, R. Balakrishnan, J. M. Cherry, N. J. Krogan, and T. Ideker, “A gene ontology inferred from molecular networks,” Nature biotech- nology, vol. 31, no. 1, pp. 38–45, 2013.

[75] M. Kramer, J. Dutkowski, M. Yu, V. Bafna, and T. Ideker, “Inferring gene ontologies from pairwise similarity data,” Bioinformatics, vol. 30, no. 12, pp. i34–i42, 2014.

[76] A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. Von Mering et al., “String v9. 1: protein-protein interaction networks, with increased coverage and integration,” Nucleic acids research, vol. 41, no. D1, pp. D808–D815, 2012.

[77] L. Pe˜na-Castillo, M. Tasan, C. L. Myers, H. Lee, T. Joshi, C. Zhang, Y. Guan, M. Leone, A. Pagnani, W. K. Kim et al., “A critical assessment of mus musculus gene function prediction using integrated genomic evidence,” Genome biology, vol. 9, no. 1, p. S2, 2008.

[78] M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, and T. Ideker, “Cytoscape 2.8: new features for data integration and network visualization,” Bioinformatics, vol. 27, no. 3, pp. 431–432, 2010.

[79] S. Wang, M. Qu, and J. Peng, “Prosnet: Integrating homology with molecular networks for protein function prediction,” in PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. World Scientific, 2017, pp. 27–38.

[80] E. W. Huang, S. Wang, D. J.-L. Lee, R. Zhang, B. Liu, X. Zhou, and C. Zhai, “Hem- net: Integration of electronic medical records with molecular interaction networks and domain knowledge for survival analysis,” in AMIA Annual Symposium Proceedings. American Medical Informatics Association., 2017.

[81] E. W. Huang, S. Wang, B. Li, R. Zhang, B. Liu, R. Zhang, J. Liu, X. Zhou, H. Lin, and C. Zhai, “Hemnet: Integration of electronic medical records with molecular inter- action networks and domain knowledge for survival analysis,” in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2017, pp. 378–387.

[82] E. W. Huang, S. Wang, and C. Zhai, “Visage: Integrating external knowledge into elec- tronic medical record visualization.” in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, vol. 23. World Scientific, 2018, pp. 578–589.

122 [83] E. W. Huang, S. Wang, R. Zhang, B. Liu, X. Zhou, and C. Zhai, “Parecat: Patient record subcategorization for precision traditional chinese medicine,” in Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2016, pp. 443–452.

[84] S. Burge, E. Kelly, D. Lonsdale, P. Mutowo-Muellenet, C. McAnulla, A. Mitchell, A. Sangrador-Vegas, S.-Y. Yong, N. Mulder, and S. Hunter, “Manual go annotation of predictive protein signatures: the interpro approach to go curation,” Database, vol. 2012, p. bar068, 2012.

[85] Y. Loewenstein, L. Yaniv, R. Domenico, O. C. Redfern, W. James, F. Dmitrij, L. Michal, O. Christine, T. Janet, and T. Anna, “Protein function annotation by homology-based inference,” Genome Biol., vol. 10, no. 2, p. 207, 2009.

[86] W. T. Clark and P. Radivojac, “Analysis of protein function and its prediction from amino acid sequence,” Proteins: Structure, Function, and Bioinformatics, vol. 79, no. 7, pp. 2086–2096, 2011.

[87] J. Gillis and P. Pavlidis, “Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (cafa),” BMC bioinformatics, vol. 14, no. 3, p. 1, 2013.

[88] R. Rentzsch and C. A. Orengo, “Protein function prediction using domain families,” BMC bioinformatics, vol. 14, no. 3, p. 1, 2013.

[89] D. Cozzetto, D. W. Buchan, K. Bryson, and D. T. Jones, “Protein function prediction by massive integration of evolutionary analyses and multiple data sources,” BMC bioinformatics, vol. 14, no. Suppl 3, p. S1, 2013.

[90] D. Lee, O. Redfern, and C. Orengo, “Predicting protein function from sequence and structure,” Nature Reviews Molecular Cell Biology, vol. 8, no. 12, pp. 995–1005, 2007.

[91] G. Yachdav, E. Kloppmann, L. Kajan, M. Hecht, T. Goldberg, T. Hamp, P. H¨onigschmid, A. Schafferhans, M. Roos, M. Bernhofer et al., “Predictproteinan open resource for online prediction of protein structural and functional features,” Nu- cleic acids research, p. gku366, 2014.

[92] S. F. Altschul, T. L. Madden, A. A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped blast and psi-blast: a new generation of protein database search programs,” Nucleic acids research, vol. 25, no. 17, pp. 3389–3402, 1997.

[93] B. E. Engelhardt, M. I. Jordan, K. E. Muratore, and S. E. Brenner, “Protein molecular function prediction by bayesian phylogenomics,” PLoS Comput Biol, vol. 1, no. 5, p. e45, 2005.

[94] B. E. Engelhardt, M. I. Jordan, J. R. Srouji, and S. E. Brenner, “Genome-scale phy- logenetic function annotation of large and diverse protein families,” Genome research, vol. 21, no. 11, pp. 1969–1980, 2011.

123 [95] U. Consortium et al., “Uniprot: a hub for protein information,” Nucleic acids research, p. gku989, 2014.

[96] R. P. Huntley, T. Sawford, P. Mutowo-Meullenet, A. Shypitsyna, C. Bonilla, M. J. Martin, and C. O’Donovan, “The goa database: gene ontology annotation updates for 2015,” Nucleic acids research, vol. 43, no. D1, pp. D1057–D1063, 2015.

[97] A. Chatr-Aryamontri, B.-J. Breitkreutz, S. Heinicke, L. Boucher, A. Winter, C. Stark, J. Nixon, L. Ramage, N. Kolas, L. ODonnell et al., “The biogrid interaction database: 2013 update,” Nucleic acids research, vol. 41, no. D1, pp. D816–D823, 2013.

[98] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering, “STRING v10: protein-protein interaction networks, integrated over the tree of life,” Nucleic Acids Res., vol. 43, no. Database issue, pp. D447–52, Jan. 2015.

[99] T. Rolland, M. Ta¸san, B. Charloteaux, S. J. Pevzner, Q. Zhong, N. Sahni, S. Yi, I. Lemmens, C. Fontanillo, R. Mosca et al., “A proteome-scale map of the human interactome network,” Cell, vol. 159, no. 5, pp. 1212–1226, 2014.

[100] S. Oliver, “Guilt-by-association goes global,” Nature, vol. 403, no. 6770, pp. 601–603, 10 Feb. 2000.

[101] E. Sefer, S. Emre, and K. Carl, “Metric labeling and semi-metric embedding for protein annotation prediction,” in Lecture Notes in Computer Science, 2011, pp. 392–407.

[102] T. Milenkovic, V. Memisevic, A. K. Ganesan, and N. Przulj, “Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis- related functional genomics data,” J. R. Soc. Interface, vol. 7, no. 44, pp. 423–437, 6 Mar. 2010.

[103] A. K. Wong, A. Krishnan, V. Yao, A. Tadych, and O. G. Troyanskaya, “Imp 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks,” Nucleic acids research, vol. 43, no. W1, pp. W128– W133, 2015.

[104] R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction of protein function,” Molecular systems biology, vol. 3, no. 1, p. 88, 2007.

[105] S. Navlakha and C. Kingsford, “The power of protein interaction networks for associ- ating genes with diseases,” Bioinformatics, vol. 26, no. 8, pp. 1057–1063, 2010.

[106] S. Mostafavi and Q. Morris, “Fast integration of heterogeneous data sources for pre- dicting gene function with limited annotation,” Bioinformatics, vol. 26, no. 14, pp. 1759–1765, 2010.

124 [107] H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J.-D. J. Han, N. Bertin, S. Chung, M. Vidal, and M. Gerstein, “Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs,” Genome Res., vol. 14, no. 6, pp. 1107–1118, June 2004.

[108] A. J. Walhout, “Protein interaction mapping in c. elegans using proteins involved in vulval development,” Science, vol. 287, no. 5450, pp. 116–122, 2000.

[109] A. Sokolov, S. Artem, and B.-H. Asa, “Multi-view prediction of protein function,” in BCB ’11, 2011.

[110] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. the gene ontology consortium,” Nat. Genet., vol. 25, no. 1, pp. 25–29, May 2000.

[111] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, “Discovering regulatory and signalling circuits in molecular interaction networks,” Bioinformatics, vol. 18 Suppl 1, pp. S233–40, 2002.

[112] T. Ideker and N. J. Krogan, “Differential network biology,” Mol. Syst. Biol., vol. 8, p. 565, Jan. 2012.

[113] A. Califano, A. J. Butte, S. Friend, T. Ideker, and E. Schadt, “Leveraging models of cell regulation and GWAS data in integrative network-based association studies,” Nat. Genet., vol. 44, no. 8, pp. 841–847, July 2012.

[114] P. M. Visscher, M. A. Brown, M. I. McCarthy, and J. Yang, “Five years of GWAS discovery,” Am. J. Hum. Genet., vol. 90, no. 1, pp. 7–24, Jan. 2012.

[115] T. Nepusz, H. Yu, and A. Paccanaro, “Detecting overlapping protein complexes in protein-protein interaction networks,” Nat. Methods, vol. 9, no. 5, pp. 471–472, Mar. 2012.

[116] M. D. M. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V. Eldridge, J. L. Thomas, A. Papoutsaki, Y. Kim, B. Niu, M. McLellan, M. S. Lawrence, A. Gonzalez-Perez, D. Tamborero, Y. Cheng, G. A. Ryslik, N. Lopez-Bigas, G. Getz, L. Ding, and B. J. Raphael, “Pan-cancer network analysis identifies combinations of rare somatic muta- tions across pathways and protein complexes,” Nat. Genet., vol. 47, no. 2, pp. 106–114, Feb. 2015.

[117] M. Hofree, J. P. Shen, H. Carter, A. Gross, and T. Ideker, “Network-based stratifica- tion of tumor mutations,” Nat. Methods, vol. 10, no. 11, pp. 1108–1115, 2013.

[118] S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, and Q. Morris, “GeneMANIA: a real-time multiple association network integration algorithm for predicting gene func- tion,” Genome Biol., vol. 9 Suppl 1, p. S4, June 2008.

125 [119] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. the gene ontology consortium,” Nat. Genet., vol. 25, no. 1, pp. 25–29, May 2000.

[120] M. Kanehisa, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., vol. 28, no. 1, pp. 27–30, 2000.

[121] F. Jin, M. Huang, Z. Lu, and X. Zhu, “Towards automatic generation of gene sum- mary,” in Proceedings of the Workshop on BioNLP - BioNLP ’09, 2009.

[122] X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz, “Generating gene summaries from biomedical literature: A study of semi-structured summarization,” Inf. Process. Manag., vol. 43, no. 6, pp. 1777–1791, 2007.

[123] N. Qiao, Y. Huang, H. Naveed, C. D. Green, and J.-D. J. Han, “CoCiter: an efficient tool to infer gene function by assessing the significance of literature co-citation,” PLoS One, vol. 8, no. 9, p. e74074, Sep. 2013.

[124] T. G. Soldatos, S. I. O’Donoghue, V. P. Satagopam, L. J. Jensen, N. P. Brown, A. Barbosa-Silva, and R. Schneider, “Martini: using literature keywords to compare gene sets,” Nucleic Acids Res., vol. 38, no. 1, pp. 26–38, Jan. 2010.

[125] NCBI Resource Coordinators, “Database resources of the national center for biotech- nology information,” Nucleic Acids Res., vol. 45, no. D1, pp. D12–D17, Jan. 2017.

[126] J. Liu, J. Shang, and J. Han, Phrase Mining from Massive Text and Its Applications. Morgan & Claypool Publishers, Mar. 2017.

[127] H. Cho, B. Berger, and J. Peng, “Compact integration of Multi-Network topology for functional analysis of genes,” Cell Syst, vol. 3, no. 6, pp. 540–548.e5, Dec. 2016.

[128] M. Kramer, J. Dutkowski, M. Yu, V. Bafna, and T. Ideker, “Inferring gene ontologies from pairwise similarity data,” Bioinformatics, vol. 30, no. 12, pp. i34–42, June 2014.

[129] J. Dutkowski, M. Kramer, M. A. Surma, R. Balakrishnan, J. M. Cherry, N. J. Krogan, and T. Ideker, “A gene ontology inferred from molecular networks,” Nat. Biotechnol., vol. 31, no. 1, pp. 38–45, Jan. 2013.

[130] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker, “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome Res., vol. 13, no. 11, pp. 2498– 2504, Nov. 2003.

[131] R. Edgar, “Gene expression omnibus: NCBI gene expression and hybridization array data repository,” Nucleic Acids Res., vol. 30, no. 1, pp. 207–210, 2002.

126 [132] GTEx Consortium, “Human genomics. the Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans,” Science, vol. 348, no. 6235, pp. 648–660, May 2015.

[133] M. Uhlen, P. Oksvold, L. Fagerberg, E. Lundberg, K. Jonasson, M. Forsberg, M. Zwahlen, C. Kampf, K. Wester, S. Hober, H. Wernerus, L. Bj¨orling,and F. Ponten, “Towards a knowledge-based human protein atlas,” Nat. Biotechnol., vol. 28, no. 12, pp. 1248–1250, 2010.

[134] J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Leh´ar,G. V. Kryukov, D. Sonkin, A. Reddy, M. Liu, L. Murray, M. F. Berger, J. E. Monahan, P. Morais, J. Meltzer, A. Korejwa, J. Jan´e-Valbuena, F. A. Mapa, J. Thibault, E. Bric-Furlong, P. Raman, A. Shipway, I. H. Engels, J. Cheng, G. K. Yu, J. Yu, P. Aspesi, Jr, M. de Silva, K. Jagtap, M. D. Jones, L. Wang, C. Hat- ton, E. Palescandolo, S. Gupta, S. Mahan, C. Sougnez, R. C. Onofrio, T. Liefeld, L. MacConaill, W. Winckler, M. Reich, N. Li, J. P. Mesirov, S. B. Gabriel, G. Getz, K. Ardlie, V. Chan, V. E. Myer, B. L. Weber, J. Porter, M. Warmuth, P. Finan, J. L. Harris, M. Meyerson, T. R. Golub, M. P. Morrissey, W. R. Sellers, R. Schlegel, and L. A. Garraway, “The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, vol. 483, no. 7391, pp. 603–607, Mar. 2012.

[135] T. Li, R. Wernersson, R. B. Hansen, H. Horn, J. Mercer, G. Slodkowicz, C. T. Work- man, O. Rigina, K. Rapacki, H. H. Stærfeldt, S. Brunak, T. S. Jensen, and K. Lage, “A scored human protein-protein interaction network to catalyze genomic interpretation,” Nat. Methods, vol. 14, no. 1, pp. 61–64, Jan. 2017.

[136] C. Stark, “BioGRID: a general repository for interaction datasets,” Nucleic Acids Res., vol. 34, no. 90001, pp. D535–D539, 2006.

[137] A. Lin, R. T. Wang, S. Ahn, C. C. Park, and D. J. Smith, “A genome-wide map of human genetic interactions inferred from radiation hybrid genotypes,” Genome Res., vol. 20, no. 8, pp. 1122–1132, Aug. 2010.

[138] R. D. Finn, T. K. Attwood, P. C. Babbitt, A. Bateman, P. Bork, A. J. Bridge, H.- Y. Chang, Z. Doszt´anyi, S. El-Gebali, M. Fraser, J. Gough, D. Haft, G. L. Holliday, H. Huang, X. Huang, I. Letunic, R. Lopez, S. Lu, A. Marchler-Bauer, H. Mi, J. Mistry, D. A. Natale, M. Necci, G. Nuka, C. A. Orengo, Y. Park, S. Pesseat, D. Piovesan, S. C. Potter, N. D. Rawlings, N. Redaschi, L. Richardson, C. Rivoire, A. Sangrador-Vegas, C. Sigrist, I. Sillitoe, B. Smithers, S. Squizzato, G. Sutton, N. Thanki, P. D. Thomas, S. C. E. Tosatto, C. H. Wu, I. Xenarios, L.-S. Yeh, S.-Y. Young, and A. L. Mitchell, “InterPro in 2017-beyond protein family and domain annotations,” Nucleic Acids Res., vol. 45, no. D1, pp. D190–D199, Jan. 2017.

[139] R. D. Finn, “Pfam: the protein families database,” in Encyclopedia of Genetics, Ge- nomics, Proteomics and Bioinformatics, 2005.

127 [140] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’01, 2001.

[141] P. Mao, J. Li, Y. Huang, S. Wu, X. Pang, W. He, X. Liu, A. S. Slutsky, H. Zhang, and Y. Li, “MicroRNA-19b mediates lung Epithelial-Mesenchymal transition via Phosphatidylinositol-3,4,5-Trisphosphate 3-phosphatase in response to mechanical stretch,” Am. J. Respir. Cell Mol. Biol., vol. 56, no. 1, pp. 11–19, Jan. 2017.

[142] D. Rotin and O. Staub, “Nedd4-2 and the regulation of epithelial sodium transport,” Front. Physiol., vol. 3, p. 212, June 2012.

[143] K. D. Sutherland and A. Berns, “Cell of origin of lung cancer,” Mol. Oncol., vol. 4, no. 5, pp. 397–403, 2010.

[144] M. S. Lawrence, P. Stojanov, P. Polak, G. V. Kryukov, K. Cibulskis, A. Sivachenko, S. L. Carter, C. Stewart, C. H. Mermel, S. A. Roberts, A. Kiezun, P. S. Hammerman, A. McKenna, Y. Drier, L. Zou, A. H. Ramos, T. J. Pugh, N. Stransky, E. Helman, J. Kim, C. Sougnez, L. Ambrogio, E. Nickerson, E. Shefler, M. L. Cort´es,D. Auclair, G. Saksena, D. Voet, M. Noble, D. DiCara, P. Lin, L. Lichtenstein, D. I. Heiman, T. Fennell, M. Imielinski, B. Hernandez, E. Hodis, S. Baca, A. M. Dulak, J. Lohr, D.-A. Landau, C. J. Wu, J. Melendez-Zajgla, A. Hidalgo-Miranda, A. Koren, S. A. McCarroll, J. Mora, B. Crompton, R. Onofrio, M. Parkin, W. Winckler, K. Ardlie, S. B. Gabriel, C. W. M. Roberts, J. A. Biegel, K. Stegmaier, A. J. Bass, L. A. Garraway, M. Meyerson, T. R. Golub, D. A. Gordenin, S. Sunyaev, E. S. Lander, and G. Getz, “Mutational heterogeneity in cancer and the search for new cancer-associated genes,” Nature, vol. 499, no. 7457, pp. 214–218, July 2013.

[145] Cancer Genome Atlas Research Network, J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart, “The cancer genome atlas Pan-Cancer analysis project,” Nat. Genet., vol. 45, no. 10, pp. 1113–1120, Oct. 2013.

[146] J.-F. Spinella, P. Cassart, C. Richer, V. Saillour, M. Ouimet, S. Langlois, P. St-Onge, T. Sontag, J. Healy, M. D. Minden, and D. Sinnett, “Genomic characterization of pediatric t-cell acute lymphoblastic leukemia reveals novel recurrent driver mutations,” Oncotarget, vol. 7, no. 40, pp. 65 485–65 503, Oct. 2016.

[147] A. Gelbard, K. S. Hale, Y. Takahashi, M. Davies, M. E. Kupferman, A. K. El-Naggar, J. N. Myers, and E. Y. Hanna, “Molecular profiling of sinonasal undifferentiated car- cinoma,” Head Neck, vol. 36, no. 1, pp. 15–21, Jan. 2014.

[148] C. G. A. Research, J. N. W. N., E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart, “The Cancer Genome Atlas Pan-Cancer analysis project,” Nat Genet, 2013.

128 [149] W. Yang, J. Soares, P. Greninger, E. J. Edelman, H. Lightfoot, S. Forbes, N. Bindal, D. Beare, J. A. Smith, I. R. Thompson, S. Ramaswamy, P. A. Futreal, D. A. Haber, M. R. Stratton, C. Benes, U. McDermott, and M. J. Garnett, Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. 41(Database issue): Nucleic Acids Res, 2013.

[150] J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Lehr, G. V. Kryukov, D. Sonkin, A. Reddy, M. Liu, L. Murray, M. F. Berger, J. E. Monahan, P. Morais, J. Meltzer, A. Korejwa, J. Jan-Valbuena, F. A. Mapa, J. Thibault, E. Bric-Furlong, P. Raman, A. Shipway, I. H. Engels, J. Cheng, G. K. Yu, J. Yu, J. P. Aspesi, M. de Silva, K. Jagtap, M. D. Jones, L. Wang, C. Hat- ton, E. Palescandolo, S. Gupta, S. Mahan, C. Sougnez, R. C. Onofrio, T. Liefeld, L. MacConaill, W. Winckler, M. Reich, N. Li, J. P. Mesirov, S. B. Gabriel, G. Getz, K. Ardlie, V. Chan, V. E. Myer, B. L. Weber, J. Porter, M. Warmuth, P. Finan, J. L. Harris, M. Meyerson, T. R. Golub, M. P. Morrissey, W. R. Sellers, R. Schlegel, and L. A. Garraway, “The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, 2012.

[151] A. B. Castoreno and U. S. Eggert, “Small molecule probes of cellular pathways and networks,” ACS Chem. Biol, 2011.

[152] P. Khatri, M. Sirota, and A. J. Butte, “Ten years of pathway analysis: current ap- proaches and outstanding challenges,” PLoS Comput. Biol, 2012.

[153] H. Clevers, “Wnt/beta-catenin signaling in development and disease,” Cell, 2006.

[154] T. S. Mikkelsen, C. F. Thorn, J. J. Yang, C. M. Ulrich, D. French, G. Zaza, H. M. Dunnenberger, S. Marsh, H. L. McLeod, K. Giacomini, M. L. Becker, R. Gaedigk, J. S. Leeder, L. Kager, M. V. Relling, W. Evans, T. E. Klein, and R. B. Altman, “Pharmgkb summary: methotrexate pathway,” Pharmacogenet. Genomics, 2011.

[155] C. F. Thorn, C. Oshiro, S. Marsh, T. Hernandez-Boussard, H. McLeod, T. E. Klein, and R. B. Altman, “Doxorubicin pathways: pharmacodynamics and adverse effects,” Pharmacogenet. Genomics, 2011.

[156] R. Huang, A. Wallqvist, N. Thanki, and D. G. Covell, “Linking pathway gene expres- sions to the growth inhibition response from the National Cancer Institute’s anticancer screen and drug mechanism of action,” Pharmacogenomics J, 2005.

[157] R. Braun, L. Cope, and G. Parmigiani, “Identifying differential correlation in gene/pathway combinations,” BMC Bioinformatics, vol. 9, p. 488, 2008.

[158] R. Hoehndorf, M. Dumontier, and G. V. Gkoutos, “Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics,” Bioinformatics, 2012.

[159] M. Song, S. Meiyue, Y. Yan, and J. Zhenran, “Drugpathway interaction prediction via multiple feature fusion,” Mol. Biosyst, 2014.

129 [160] A. L. Hopkins, “Network pharmacology: the next paradigm in drug discovery,” Nat. Chem. Biol, 2008. [161] M. Kotlyar, K. Fortney, and I. Jurisica, “Network-based characterization of drug- regulated genes, drug targets, and toxicity,” Methods, 2012. [162] M. Schenone, V. Danˇc´ık,B. K. Wagner, and P. A. Clemons, “Target identification and mechanism of action in chemical biology and drug discovery,” Nature chemical biology, vol. 9, no. 4, pp. 232–240, 2013. [163] H. Guo, J. Dong, S. Hu, X. Cai, G. Tang, J. Dou, M. Tian, F. He, Y. Nie, and D. Fan, “Biased random walk model for the prioritization of drug resistance associated proteins,” Sci. Rep, vol. 5, p. 10857, 2015. [164] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, and S. H. Bryant, PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37(Web Server issue): W623-633, 2009. [165] A. B. Keenan, S. L. Jenkins, K. M. Jagodnik, S. Koplev, E. He, D. Torre, Z. Wang, A. B. Dohlman, M. C. Silverstein, A. Lachmann et al., “The library of integrated network-based cellular signatures nih program: system-level cataloging of human cells response to perturbations,” Cell systems, 2017. [166] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C. Blat, M. J. Wrobel, J. Lerner, J.-P. Brunet, A. Subramanian, K. N. Ross et al., “The connectivity map: using gene- expression signatures to connect small molecules, genes, and disease,” science, vol. 313, no. 5795, pp. 1929–1935, 2006. [167] A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. von Mering, and L. J. Jensen, “String v9,protein-protein interaction networks, with increased coverage and integration.” Nucleic Acids Res. 41(Database issue), 2013. [168] C. F. Schaefer, K. Anthony, S. Krupa, J. Buchoff, M. Day, T. Hannay, and K. H. Buetow, PID: the Pathway Interaction Database. Nucleic Acids Res 37(Database issue): D674-679, 2009. [169] C. Hanson, J. Cairns, L. Wang, and S. Sinha, Computational discovery of transcription factors associated with drug response. Pharmacogenomics J, 2015. [170] J. S. Hayes, E. M. Czekanska, and R. G. Richards, The CellSurface Interaction. Ad- vances in Biochemical Engineering/Biotechnology: 1-31, 2011. [171] F. Aoudjit and K. Vuori, Integrin signaling in cancer cell survival and chemoresistance. Chemother. Res. Pract, 2012. [172] W. Guo, Y. Pylayeva, A. Pepe, T. Yoshioka, W. J. Muller, G. Inghirami, and F. G. Giancotti, “Beta 4 integrin amplifies ErbB2 signaling to promote mammary tumori- genesis,” Cell, 2006.

130 [173] L. Ding, Z. Zhang, G. Liang, Z. Yao, H. Wu, B. Wang, J. Zhang, M. Tariq, M. Ying, and B. Yang, SAHA triggered MET activation contributes to SAHA tolerance in solid cancer cells. Cancer Lett 356(2 Pt B): 828-836, 2015.

[174] D. R. Robinson, W. Yi-Mi, and L. Su-Fang, “The protein tyrosine kinase family of the human genome,” Oncogene, 2000.

[175] A. Arora and E. M. Scholar, “Role of tyrosine kinase inhibitors in cancer therapy,” J Pharmacol Exp Ther, 2005.

[176] B. Marengo, C. G. D. Ciucis, R. Ricciarelli, A. L. Furfaro, R. Colla, E. Canepa, N. Traverso, U. M. Marinari, M. A. Pronzato, and C. Domenicotti, “p38mapk inhibi- tion: a new combined approach to reduce neuroblastoma resistance under etoposide treatment,” Cell Death Dis, 2013.

[177] R. M. Neve, H. Sutterluty, N. Pullen, H. A. Lane, J. M. Daly, W. Krek, and N. E. Hynes, “Effects of oncogenic ErbB2 on G1 cell cycle regulators in breast tumour cells,” Oncogene, 2000.

[178] T. Kitazaki, M. Oka, Y. Nakamura, J. Tsurutani, S. Doi, M. Yasunaga, M. Takemura, H. Yabuuchi, H. Soda, and S. Kohno, “Gefitinib, an EGFR tyrosine kinase inhibitor directly inhibits the function of P-glycoprotein in multidrug resistant cancer cells.” Lung Cancer, 2005.

[179] Z. Li, S. Biswas, B. Liang, X. Zou, L. Shan, Y. Li, R. Fang, and J. Niu, “Integrin beta6 serves as an immunohistochemical marker for lymph node metastasis and promotes cell invasiveness in cholangiocarcinoma,” Sci Rep, vol. 6, p. 30081, 2016.

[180] K. Desai, M. G. Nair, J. S. Prabhu, A. Vinod, A. Korlimarla, S. Rajarajan, R. Aiyappa, R. S. Kaluve, A. Alexander, P. S. Hari, G. Mukherjee, R. V. Kumar, S. Manjunath, M. Correa, B. S. Srinath, S. Patil, M. S. Prasad, K. S. Gopinath, R. N. Rao, S. M. Violette, P. H. Weinreb, and T. S. Sridhar, “High expression of integrin beta6 in association with the Rho-Rac pathway identifies a poor prognostic subgroup within HER2 amplified breast cancers,” Cancer Med, 2016.

[181] S. B. Reyes, A. S. Narayanan, H. S. Lee, J. H. Tchaicha, K. D. Aldape, F. F. Lang, K. F. Tolias, and J. H. McCarty, “alphavbeta8 integrin interacts with RhoGDI1 to regulate Rac1 and Cdc42 activation and drive glioblastoma cell invasion,” Mol Biol Cell, 2013.

[182] V. Kumar, U. K. Soni, V. K. Maurya, K. Singh, and R. K. Jha, “Integrin beta8 (ITGB8) activates VAV-RAC1 signaling via FAK in the acquisition of endometrial epithelial cell receptivity for blastocyst implantation,” Sci Rep, 2017.

[183] W. V. Zhang, Y. Yang, R. W. Berg, E. Leung, and G. W. Krissansen, “The small GTP-binding proteins Rho and Rac induce T cell adhesion to the mucosal addressin MAdCAM-1 in a hierarchical fashion,” Eur J Immunol, 1999.

131 [184] T. Toyofuku, M. Yabuki, J. Kamei, M. Kamei, N. Makino, A. Kumanogoh, and M. Hori, “Semaphorin-4a, an activator for T-cell-mediated immunity,” suppresses an- giogenesis via Plexin-D1. ” EMBO J, 2007.

[185] Y. Cao, T. Jiang, and T. Girke, “Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing,” Bioin- formatics, 2010.

[186] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov, “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,” Proc Natl Acad Sci U S A, 2005.

[187] A. Fabregat, K. Sidiropoulos, G. Viteri, O. Forner, P. Marin-Garcia, V. Arnau, P. D’Eustachio, L. Stein, and H. Hermjakob, “Reactome pathway analysis: a high- performance in-memory approach,” BMC Bioinformatics, 2017.

[188] A. Kr¨amer,J. Green, J. Pollard Jr, and S. Tugendreich, “Causal analysis approaches in ingenuity pathway analysis,” Bioinformatics, vol. 30, no. 4, pp. 523–530, 2013.

[189] N. Alcaraz, J. Pauling, R. Batra, E. Barbosa, A. Junge, A. G. Christensen, V. Azevedo, H. J. Ditzel, and J. Baumbach, “Keypathwayminer 4,” vol. 8, p. 99, 2014.

[190] A. M. Smith, R. Ammar, C. Nislow, and G. Giaever, “A survey of yeast genomic assays for drug and target discovery,” Pharmacol. Ther., vol. 127, no. 2, pp. 156–164, Aug. 2010.

[191] S. M. B. Nijman, “Functional genomics to uncover drug mechanism of action,” Nat. Chem. Biol., vol. 11, no. 12, pp. 942–948, Dec. 2015.

[192] G. Giaever, P. Flaherty, J. Kumm, M. Proctor, C. Nislow, D. F. Jaramillo, A. M. Chu, M. I. Jordan, A. P. Arkin, and R. W. Davis, “Chemogenomic profiling: identifying the functional interactions of small molecules in yeast,” Proc. Natl. Acad. Sci. U. S. A., vol. 101, no. 3, pp. 793–798, 20 Jan. 2004.

[193] T. Roemer, J. Davies, G. Giaever, and C. Nislow, “Bugs, drugs and chemical ge- nomics,” Nat. Chem. Biol., vol. 8, no. 1, pp. 46–56, Jan. 2012.

[194] G. Giaever, P. Flaherty, J. Kumm, M. Proctor, C. Nislow, D. F. Jaramillo, A. M. Chu, M. I. Jordan, A. P. Arkin, and R. W. Davis, “Chemogenomic profiling: identifying the functional interactions of small molecules in yeast,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 3, pp. 793–798, 2004.

[195] H. N. Chua and F. P. Roth, “Discovering the targets of drugs via computational systems biology,” Journal of Biological Chemistry, vol. 286, no. 27, pp. 23 653–23 658, 2011.

132 [196] P. Y. Lum, C. D. Armour, S. B. Stepaniants, G. Cavet, M. K. Wolf, J. S. Butler, J. C. Hinshaw, P. Garnier, G. D. Prestwich, A. Leonardson et al., “Discovering modes of action for therapeutic compounds using a genome-wide screen of yeast heterozygotes,” Cell, vol. 116, no. 1, pp. 121–137, 2004.

[197] S. Hoon, A. M. Smith, I. M. Wallace, S. Suresh, M. Miranda, E. Fung, M. Proctor, K. M. Shokat, C. Zhang, R. W. Davis et al., “An integrated platform of genomic assays reveals small-molecule bioactivities,” Nature chemical biology, vol. 4, no. 8, pp. 498–506, 2008.

[198] M. E. Hillenmeyer, E. Fung, J. Wildenhain, S. E. Pierce, S. Hoon, W. Lee, M. Proctor, R. P. St Onge, M. Tyers, D. Koller, R. B. Altman, R. W. Davis, C. Nislow, and G. Giaever, “The chemical genomic portrait of yeast: uncovering a phenotype for all genes,” Science, vol. 320, no. 5874, pp. 362–365, 18 Apr. 2008.

[199] A. Y. Lee, R. P. St Onge, M. J. Proctor, I. M. Wallace, A. H. Nile, P. A. Spagnuolo, Y. Jitkova, M. Gronda, Y. Wu, M. K. Kim, K. Cheung-Ong, N. P. Torres, E. D. Spear, M. K. L. Han, U. Schlecht, S. Suresh, G. Duby, L. E. Heisler, A. Surendra, E. Fung, M. L. Urbanus, M. Gebbia, E. Lissina, M. Miranda, J. H. Chiang, A. M. Aparicio, M. Zeghouf, R. W. Davis, J. Cherfils, M. Boutry, C. A. Kaiser, C. L. Cummins, W. S. Trimble, G. W. Brown, A. D. Schimmer, V. A. Bankaitis, C. Nislow, G. D. Bader, and G. Giaever, “Mapping the cellular response to small molecules using chemogenomic fitness signatures,” Science, vol. 344, no. 6180, pp. 208–211, 11 Apr. 2014.

[200] D. Hoepfner, S. B. Helliwell, H. Sadlish, S. Schuierer, I. Filipuzzi, S. Brachat, B. Bhullar, U. Plikat, Y. Abraham, M. Altorfer, T. Aust, L. Baeriswyl, R. Cerino, L. Chang, D. Estoppey, J. Eichenberger, M. Frederiksen, N. Hartmann, A. Hohendahl, B. Knapp, P. Krastel, N. Melin, F. Nigsch, E. J. Oakeley, V. Petitjean, F. Petersen, R. Riedl, E. K. Schmitt, F. Staedtler, C. Studer, J. A. Tallarico, S. Wetzel, M. C. Fishman, J. A. Porter, and N. R. Movva, “High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions,” Microbiol. Res., vol. 169, no. 2-3, pp. 107–120, Feb. 2014.

[201] U. Rix, R. Uwe, and S.-F. Giulio, “Target profiling of small molecules by chemical proteomics,” Nat. Chem. Biol., vol. 5, no. 9, pp. 616–624, 2009.

[202] D. G. Teotico, K. Babaoglu, G. J. Rocklin, R. S. Ferreira, A. M. Giannetti, and B. K. Shoichet, “Docking for fragment inhibitors of AmpC beta-lactamase,” Proc. Natl. Acad. Sci. U. S. A., vol. 106, no. 18, pp. 7455–7460, 5 May 2009.

[203] S. L. Schreiber, “Target-oriented and diversity-oriented organic synthesis in drug dis- covery,” Science, vol. 287, no. 5460, pp. 1964–1969, 17 Mar. 2000.

[204] M. J. Keiser, S. Vincent, J. J. Irwin, L. Christian, A. I. Abbas, S. J. Hufeisen, N. H. Jensen, M. B. Kuijer, R. C. Matos, T. B. Tran, W. Ryan, R. A. Glennon, H. J´erˆome, K. L. H. Thomas, D. D. Edwards, B. K. Shoichet, and B. L. Roth, “Predicting new molecular targets for known drugs,” Nature, vol. 462, no. 7270, pp. 175–181, 2009.

133 [205] S. A. Chervitz, L. Aravind, G. Sherlock, C. A. Ball, E. V. Koonin, S. S. Dwight, M. A. Harris, K. Dolinski, S. Mohr, T. Smith et al., “Comparison of the complete protein sets of worm and yeast: orthology and divergence,” Science, vol. 282, no. 5396, pp. 2022–2028, 1998. [206] K. L. McGary, T. J. Park, J. O. Woods, H. J. Cha, J. B. Wallingford, and E. M. Mar- cotte, “Systematic discovery of nonobvious human disease models through orthologous phenotypes,” Proceedings of the National Academy of Sciences, vol. 107, no. 14, pp. 6544–6549, 2010. [207] S. D. Kohlwein, “Obese and anorexic yeasts: experimental models to understand the metabolic syndrome and lipotoxicity,” Biochimica et Biophysica Acta (BBA)- Molecular and Cell Biology of Lipids, vol. 1801, no. 3, pp. 222–229, 2010. [208] G. Giaever, D. D. Shoemaker, T. W. Jones, H. Liang, E. A. Winzeler, A. Astromoff, and R. W. Davis, “Genomic profiling of drug sensitivities via induced haploinsuffi- ciency,” Nature genetics, vol. 21, no. 3, pp. 278–283, 1999. [209] M. Costanzo, B. VanderSluis, E. N. Koch, A. Baryshnikova, C. Pons, G. Tan, W. Wang, M. Usaj, J. Hanchard, S. D. Lee et al., “A global genetic interaction network maps a wiring diagram of cellular function,” Science, vol. 353, no. 6306, p. aaf1420, 2016. [210] A. H. Yan Tong and C. Boone, “Synthetic genetic array analysis in saccharomyces cerevisiae,” Yeast Protocol, pp. 171–191, 2006. [211] A. B. Parsons, R. L. Brost, D. Huiming, L. Zhijian, Z. Chaoying, S. Bilal, G. W. Brown, P. M. Kane, T. R. Hughes, and B. Charles, “Integration of chemical-genetic and genetic interaction data links bioactive compounds to cellular target pathways,” Nat. Biotechnol., vol. 22, no. 1, pp. 62–69, 2003. [212] M. A. Heiskanen and T. Aittokallio, “Predicting drug –target interactions through integrative analysis of chemogenetic assays in yeast,” Mol. Biosyst., vol. 9, no. 4, pp. 768–779, 31 Jan. 2013. [213] A. Baryshnikova, M. Costanzo, C. L. Myers, B. Andrews, and C. Boone, “Genetic interaction networks: toward an understanding of heritability,” Annu. Rev. Genomics Hum. Genet., vol. 14, pp. 111–133, 26 June 2013. [214] A. Bender and J. R. Pringle, “Use of a screen for synthetic lethal and multicopy sup- pressee mutants to identify two new genes involved in morphogenesis in saccharomyces cerevisiae,” Mol. Cell. Biol., vol. 11, no. 3, pp. 1295–1305, Mar. 1991. [215] L. Guarente, “Synthetic enhancement in gene interaction: a genetic tool come of age,” Trends Genet., vol. 9, no. 10, pp. 362–366, Oct. 1993. [216] A. H. Tong, M. Evangelista, A. B. Parsons, H. Xu, G. D. Bader, N. Pag´e,M. Robinson, S. Raghibizadeh, C. W. Hogue, H. Bussey, B. Andrews, M. Tyers, and C. Boone, “Systematic genetic analysis with ordered arrays of yeast deletion mutants,” Science, vol. 294, no. 5550, pp. 2364–2368, 14 Dec. 2001.

134 [217] R. P. St Onge, R. Mani, J. Oh, M. Proctor, E. Fung, R. W. Davis, C. Nislow, F. P. Roth, and G. Giaever, “Systematic pathway analysis using high-resolution fitness pro- filing of combinatorial gene deletions,” Nat. Genet., vol. 39, no. 2, pp. 199–206, Feb. 2007. [218] B. L. Drees, V. Thorsson, G. W. Carter, A. W. Rives, M. Z. Raymond, I. Avila- Campillo, P. Shannon, and T. Galitski, “Derivation of genetic interaction networks from quantitative phenotype data,” Genome Biol., vol. 6, no. 4, p. R38, 31 Mar. 2005. [219] R. Mani, R. P. St.Onge, J. L. Hartman, G. Giaever, and F. P. Roth, “Defining genetic interaction,” Proceedings of the National Academy of Sciences, vol. 105, no. 9, pp. 3461–3466, 2008. [220] T. Ben-Shitrit, N. Yosef, K. Shemesh, R. Sharan, E. Ruppin, and M. Kupiec, “Sys- tematic identification of gene annotation errors in the widely used yeast mutation collections,” Nature methods, vol. 9, no. 4, pp. 373–378, 2012. [221] S. G. Addinall, M. Downey, M. Yu, M. K. Zubko, J. Dewar, A. Leake, J. Hallinan, O. Shaw, K. James, D. J. Wilkinson et al., “A genomewide suppressor and enhancer analysis of cdc13-1 reveals varied cellular processes influencing telomere capping in saccharomyces cerevisiae,” Genetics, vol. 180, no. 4, pp. 2251–2266, 2008. [222] X. Pan, P. Ye, D. S. Yuan, X. Wang, J. S. Bader, and J. D. Boeke, “A dna integrity network in the yeast saccharomyces cerevisiae,” Cell, vol. 124, no. 5, pp. 1069–1081, 2006. [223] A. H. Y. Tong, M. Evangelista, A. B. Parsons, H. Xu, G. D. Bader, N. Page, M. Robin- son, S. Raghibizadeh, C. W. Hogue, H. Bussey et al., “Systematic genetic analysis with ordered arrays of yeast deletion mutants,” Science, vol. 294, no. 5550, pp. 2364–2368, 2001. [224] A. Baryshnikova, M. Costanzo, Y. Kim, H. Ding, J. Koh, K. Toufighi, J.-Y. Youn, J. Ou, B.-J. San Luis, S. Bandyopadhyay et al., “Quantitative analysis of fitness and genetic interactions in yeast on a genome scale,” Nature methods, vol. 7, no. 12, pp. 1017–1024, 2010. [225] L. Jerby-Arnon, N. Pfetzer, Y. Y. Waldman, L. McGarry, D. James, E. Shanks, B. Seashore-Ludlow, A. Weinstock, T. Geiger, P. A. Clemons et al., “Predicting cancer- specific vulnerability via data-driven detection of synthetic lethality,” Cell, vol. 158, no. 5, pp. 1199–1209, 2014. [226] J. M. Paul, S. D. Templeton, B. Akanksha, F. Andrew, and F. J. Vizeacoumar, “Build- ing high-resolution synthetic lethal networks: a google map of the cancer cell,” Trends Mol. Med., vol. 20, no. 12, pp. 704–715, 2014. [227] M. Kuhn, K. Michael, S. Damian, P.-F. Sune, T. H. Blicher, C. von Mering, L. J. Jensen, and B. Peer, “STITCH 4: integration of protein–chemical interactions with user data,” Nucleic Acids Res., vol. 42, no. D1, pp. D401–D407, 2013.

135 [228] P. Jiang, H. Wang, W. Li, C. Zang, B. Li, Y. J. Wong, C. Meyer, J. S. Liu, J. C. Aster, and X. S. Liu, “Network analysis of gene essentiality in functional genomics experiments,” Genome biology, vol. 16, no. 1, p. 239, 2015. [229] R. B. Russell and P. Aloy, “Targeting and tinkering with interaction networks,” Nature chemical biology, vol. 4, no. 11, pp. 666–673, 2008. [230] A. Reinke, J. C.-Y. Chen, S. Aronova, and T. Powers, “Caffeine targets TOR complex I and provides evidence for a regulatory link between the FRB and kinase domains of tor1p,” J. Biol. Chem., vol. 281, no. 42, pp. 31 616–31 626, 20 Oct. 2006. [231] V. Wanke, E. Cameroni, A. Uotila, M. Piccolis, J. Urban, R. Loewith, and C. De Vir- gilio, “Caffeine extends yeast lifespan by targeting TORC1,” Mol. Microbiol., vol. 69, no. 1, pp. 277–285, July 2008. [232] M. E. Cardenas and J. Heitman, “FKBP12-rapamycin target TOR2 is a vacuolar protein with an associated phosphatidylinositol-4 kinase activity,” EMBO J., vol. 14, no. 23, pp. 5892–5907, 1 Dec. 1995. [233] M. Gustavsson and H. Ronne, “Evidence that tRNA modifying enzymes are important in vivo targets for 5-fluorouracil in yeast,” RNA, vol. 14, no. 4, pp. 666–674, Apr. 2008. [234] L. Kapitzky, P. Beltrao, T. J. Berens, N. Gassner, C. Zhou, A. W¨uster,J. Wu, M. M. Babu, S. J. Elledge, D. Toczyski, R. S. Lokey, and N. J. Krogan, “Cross-species chemogenomic profiling reveals evolutionarily conserved drug mode of action,” Mol. Syst. Biol., vol. 6, p. 451, 21 Dec. 2010. [235] T. C. Clarke, L. I. Black, B. J. Stussman, P. M. Barnes, and R. L. Nahin, “Trends in the use of complementary health approaches among adults: United states, 2002–2012,” National health statistics reports, no. 79, p. 1, 2015. [236] R. Teschke, A. Wolff, C. Frenzel, A. Eickhoff, and J. Schulze, “Herbal traditional chinese medicine and its evidence base in gastrointestinal disorders,” World J Gas- troenterol, vol. 21, no. 15, pp. 4466–4490, 2015. [237] A. Burke, D. M. Upchurch, C. Dye, and L. Chyu, “Acupuncture use in the united states: Findings from the national health interview survey,” Journal of Alternative & Complementary Medicine, vol. 12, no. 7, pp. 639–648, 2006. [238] C. C. Xue, A. L. Zhang, V. Lin, C. Da Costa, and D. F. Story, “Complementary and alternative medicine use in australia: a national population-based survey,” The Journal of Alternative and Complementary Medicine, vol. 13, no. 6, pp. 643–650, 2007. [239] X. Zhou, B. Liu, X. Zhang, Q. Xie, R. Zhang, Y. Wang, and Y. Peng, “Data Mining in Real-World Traditional Chinese Medicine Clinical Data Warehouse,” pp. 189–213, 2014. [240] M. Oti, M. A. Huynen, and H. G. Brunner, “Phenome connections,” Trends in genetics, vol. 24, no. 3, pp. 103–106, 2008.

136 [241] K. J. Cios and G. W. Moore, “Uniqueness of medical data mining,” Artificial intelli- gence in medicine, vol. 26, no. 1, pp. 1–24, 2002.

[242] X. Zhou, S. Chen, B. Liu, R. Zhang, Y. Wang, P. Li, Y. Guo, H. Zhang, Z. Gao, and X. Yan, “Development of traditional Chinese medicine clinical data warehouse for medical knowledge discovery and decision support,” Artificial Intelligence in Medicine, vol. 48, no. 2, pp. 139–152, 2010.

[243] H. Xu, Y. Wang, and T. Zhang, “Data Mining of Regularities and Rules of Compound Herbal Formulae for Nonalcoholic Fatty Liver Disease,” Chinese Journal of Informa- tion on Traditional Chinese Medicine, pp. 21(8):38–41, 2014.

[244] N. L. Zhang, S. Yuan, T. Chen, and Y. Wang, “Latent tree models and diagnosis in traditional chinese medicine,” Artificial intelligence in medicine, vol. 42, no. 3, pp. 229–245, 2008.

[245] J. Chen, J. Poon, S. K. Poon, L. Xu, and D. M. Y. Sze, “Mining symptom-herb patterns from patient records using tripartite graph,” Evidence-Based Complementary and Alternative Medicine, vol. 2015, 2015.

[246] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 289–296.

[247] X. Zhang, X. Zhou, H. Huang, S. Chen, and B. Liu, “A hierarchical symptom-herb topic model for analyzing traditional Chinese medicine clinical diabetic data,” vol. 6, pp. 2246–2249, 2010.

[248] W. Xie, Y. Zhao, and L. Du, “Emerging approaches of traditional Chinese medicine formulas for the treatment of hyperlipidemia,” Journal of Ethnopharmacology, vol. 140, no. 2, pp. 345–367, 2012.

[249] X.-p. Zhang, X.-z. Zhou, H.-k. Huang, Q. Feng, S.-b. Chen, and B.-y. Liu, “Topic model for chinese medicine diagnosis and prescription regularities analysis: case on diabetes,” Chinese journal of integrative medicine, vol. 17, pp. 307–313, 2011.

[250] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incom- plete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.

[251] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: current status and future directions,” Data Mining and Knowledge Discovery, vol. 15, no. 1, pp. 55–86, 2007.

[252] J. Chen, G. Xi, Y. Xing, J. Chen, and J. Wang, “Global comparison study of five data mining methods in predicting syndrome in coronary heart disease,” J Comput Inform Syst, vol. 4, no. 1, pp. 219–224, 2008.

137 [253] P. He, K. Deng, Z. Liu, D. Liu, J. S. Liu, and Z. Geng, “Discovering herbal functional groups of traditional Chinese medicine,” Statistics in medicine, vol. 31, no. 7, pp. 636–642, 2012.

[254] M. J. Guang Zheng Cheng Lu, and Aiping Lu, “Prescription analysis and mining,” Data Analytics for Traditional Chinese Medicine Research Springer, pp. pages 97–109, 2014.

[255] O. Bodenreider, “The unified medical language system (umls): integrating biomedical terminology,” Nucleic acids research, vol. 32, no. suppl 1, pp. D267–D270, 2004.

[256] J. Davidson, “Seizures and bupropion: a review.” The Journal of clinical psychiatry, vol. 50, no. 7, pp. 256–261, 1989.

[257] eHealthMe, “Review: could ativan(r) cause high blood sugar?” http://www. ehealthme.com/ds/Ativan(R)/high+blood+sugar, 2014.

[258] C. for Disease Control, P. (CDC et al., “Adverse events associated with ephedrine- containing products–texas, december 1993-september 1995.” MMWR. Morbidity and mortality weekly report, vol. 45, no. 32, p. 689, 1996.

[259] B. J. Welch and et al., “Biochemical and stone-risk profiles with topiramate treat- ment,” American journal of kidney diseases, vol. 48, no. 4, pp. 555–563, 2006.

[260] R. Leaman, L. Wojtulewicz, R. Sullivan, A. Skariah, J. Yang, and G. Gonzalez, “To- wards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks,” in Proceedings of the 2010 workshop on biomedical natural language processing. Association for Computational Linguistics, 2010, pp. 117–125.

[261] R. W. White and E. Horvitz, “Studies of the onset and persistence of medical concerns in search logs,” in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012, pp. 265–274.

[262] Q. Zeng, S. Kogan, N. Ash, R. Greenes, and A. Boxwala, “Characteristics of consumer terminology for health information retrieval,” Methods of information in medicine, vol. 41, no. 4, pp. 289–298, 2002.

[263] eHealthMe, “Review: could phentermine cause bone and joint pain?” http://www. ehealthme.com/ds/phentermine/bone+and+joint+pain, 2014.

[264] S. Hornby, “Effects of methadone with xanax,” http://www.livestrong.com/article/ 260570-effects-of-methadone-with-xanax/, 2010.

[265] S. J. Kish, “Pharmacologic mechanisms of crystal meth,” Canadian Medical Associa- tion Journal, vol. 178, no. 13, pp. 1679–1682, 2008.

[266] C. Parker, “Zoloft drug interactions,” http://www.drugsdb.com/rx/zoloft/ zoloft-drug-interactions/, 2012.

138 [267] R. Crenshaw, “Prozac and premature ejaculation,” in Annual Meeting of the American Association of Sex Educators, Counselors and Therapists, Orlando, FL, 1992.

[268] U. A. Boelsterli, “Diclofenac-induced liver injury: a paradigm of idiosyncratic drug toxicity,” Toxicology and applied pharmacology, vol. 192, no. 3, pp. 307–322, 2003.

[269] I. Lee, U. M. Blom, P. I. Wang, J. E. Shim, and E. M. Marcotte, “Prioritizing candidate disease genes by network-based boosting of genome-wide association data,” Genome research, vol. 21, no. 7, pp. 1109–1121, 2011.

[270] D. A. Karnofsky, W. H. Abelmann, L. F. Craver, and J. H. Burchenal, “The use of the nitrogen mustards in the palliative treatment of carcinoma. with particular reference to bronchogenic carcinoma,” Cancer, vol. 1, no. 4, pp. 634–656, 1948.

[271] T. H. Kim, J. S. Kim, Z. H. Kim, R. B. Huang, and R. S. Wang, “Khz (fusion of ganoderma lucidum and polyporus umbellatus mycelia) induces apoptosis in a549 human lung cancer cells by generating reactive oxygen species and decreasing the mitochondrial membrane potential,” Food Science and Biotechnology, vol. 23, no. 3, pp. 859–864, 2014.

[272] Y. Zhang, Q. Wang, T. Wang, H. Zhang, Y. Tian, H. Luo, S. Yang, Y. Wang, and X. Huang, “Inhibition of human gastric carcinoma cell growth in vitro by a polysaccha- ride from aster tataricus,” International journal of biological macromolecules, vol. 51, no. 4, pp. 509–513, 2012.

[273] C. Ramaker, J. Marinus, A. M. Stiggelbout, and B. J. Van Hilten, “Systematic evalu- ation of rating scales for impairment and disability in parkinson’s disease,” Movement Disorders, vol. 17, no. 5, pp. 867–876, 2002.

[274] C. Shi, Z. Zheng, Q. Wang, C. Wang, D. Zhang, M. Zhang, P. Chan, and X. Wang, “Exploring the effects of genetic variants on clinical profiles of parkinsons disease assessed by the unified parkinsons disease rating scale and the hoehn–yahr stage,” PloS one, vol. 11, no. 6, p. e0155758, 2016.

[275] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for dis- covering clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231.

[276] M. W. Johns, “A new method for measuring daytime sleepiness: the epworth sleepiness scale.” sleep, vol. 14, no. 6, pp. 540–545, 1991.

139