The Pharmacogenomics Journal (2005) 5, 381–399 & 2005 Nature Publishing Group All rights reserved 1470-269X/05 $30.00 www.nature.com/tpj ORIGINAL ARTICLE

Linking pathway expressions to the growth inhibition response from the National Cancer Institute’s anticancer screen and drug mechanism of action

R Huang1 ABSTRACT 2 Novel strategies are proposed to quantitatively analyze and relate biological A Wallqvist pathways to drug responses using and small-molecule 2 N Thanki growth inhibition data (GI50) derived from the National Cancer Institute’s 60 1 DG Covell cancer cells (NCI60). We have annotated groups of drug GI50 responses with pathways defined by the Kyoto Encyclopedia of and Genomes (KEGG) 1Developmental Therapeutics Program, and BioCarta, and functional categories defined by (GO), Screening Technologies Branch, Laboratory of through correlations between pathway gene expression patterns and drug Computational Technologies, National Cancer Institute-Frederick, Frederick, MD, USA; 2Science GI50 profiles. Drug–gene-pathway relationships may then be utilized to find Applications International Corporation, National drug targets or target-specific drugs. Significantly correlated pathways and Cancer Institute at Frederick, NIH, Frederick, MD, the gene products involved represent interesting targets for further USA exploration, whereas drugs that are significantly correlated with only certain Correspondence: pathways are more likely to be target specific. Separate pathway clustering Dr DG Covell, Developmental Therapeutics finds that pathways engaged in the same biological process tend to have Program, Screening Technologies Branch, similar drug correlation patterns. The biological and statistical significances Laboratory of Computational Technologies, of our method are established by comparison to known small-molecule National Cancer Institute-Frederick, Bldg. 1052/235, Frederick, MD 21702, USA. inhibitor–gene target relationships reported in the literature and by standard Tel: þ 1 301 846 5785 randomization procedures. The results of our pathway, gene expression and Fax: þ 1 301 846 6978 drug-induced growth inhibition associations, can serve as a basis for E-mail: [email protected] proposing testable hypotheses about potential anticancer drugs, their targets, and mechanisms of action. The Pharmacogenomics Journal (2005) 5, 381–399. doi:10.1038/sj.tpj.6500331; published online 16 August 2005

Keywords: data mining; drug discovery; pathway; gene expression; cytotoxicity; small-molecule inhibitors

INTRODUCTION Our reference to ‘drug’ in this manuscript is The highly complex cellular regulatory networks and their interactions with not a clinical ‘drug’ per se, but generally a small molecules (A small molecule is a molecule of relatively low weight, perhaps small-molecule compound that has been less than 100 atoms. It is the opposite of a macromolecule, such as a or screened in the NCI for anticancer acticity. 60 DNA.) present challenges to our mechanistic understanding of drug action. Deeper insights into the fundamental mechanisms of cellular functions and Received: 26 May 2005 pathway regulations are likely to be critical for the development of rational Revised: 11 July 2005 Accepted: 25 July 2005 approaches directed at the identification of molecular targets and candidate Published online: 16 August 2005 inhibitors. While the antitumor activity of current anticancer drugs is reflected in Pathway gene expressions and cellular growth inhibition R Huang et al 382

cell killing, mechanism-based studies attempt to specifically N5 P5 associate a drug’s effect to one or many cellular regulation P3 N1 N2 N3 N6 P1 P7 mechanisms.1 Individual protein targets of a small molecule N7 P6 may be involved in diverse cellular processes, some or all of N4 P4 which could contribute to the killing potential of a N9 P8 P2 compound. Furthermore, environmental factors such as M4 N8 V1 temperature, radiation, hypoxia, nutrients, as well as drugs M6 N10 F1 N11 V2 stimulate an adaptive sensory and signaling machinery of V3 the cell, and therefore may influence drug sensitivity, cell M5 M7 N12 N13 V6 F2 survival, and apoptosis. It is thus relevant to pursue a global M1 M3 V4 and in-depth vision of the general regulatory circuits of V8 V7 cellular functions when attempting to better understand a M2 S1 S9 V5 compound’s mechanism of action (MOA). M8 S10 F3 F5 One general approach to identify target-specific agents as S2 S13 a basis for understanding a drug’s MOA is to relate gene S11 expression patterns measured across a diverse set of tumor S3 F6 F7 S12 cell models to drug-induced chemosensitivity of these same S6 F4 cells.2–16 Previous efforts using this strategy have focused S8 S4 J1 mainly on finding causes of drug resistance.2–4,11,17,18 Gene S5 S7 F8 expression signatures have also been used as surrogate R1 markers of cellular states, for example, to identify agents J6 J5 J2 J3 R2 that induce the differentiation of acute myeloid leukemia Q5 cells.19 However, nearly all of these investigations have been J8 J7 based on single gene expression–drug response relation- R3 R4 Q6 J4 ships, whereas complex interactions between a drug and Q2 highly interconnected biological networks may not be R6 reflected solely by the state of any one gene. Moreover, Q1 R5 Q3 Q7 quantitative assessments that associate significant correla- Q4 R7 tions between gene expression levels and drug sensitivity as a basis for validating a biologically significant connection Figure 1 Cytotoxicity measurements for nearly 30 000 small are not yet a standard practice. molecules screened against the NCI’s tumor cell panel (NCI60) were clustered into 1350 groups or nodes using an SOM.23 Each cluster Previously, we examined correlative associations between is represented by a hexagon and neighboring clusters identify gene expression coherence within predefined pathways and compounds with similar cytotoxicity responses. These nodes are 20 functionally related gene groups. Our goals in the present separated into nine major response categories (regions): mitosis study are to extend this work by incorporating the existing (M), membrane function (N), nucleic acid metabolism (S), meta- biological pathway and gene expression information with bolic stress and cell survival (Q), kinases/phosphatases and drug chemosensitivity data, and apply these findings toward oxidative stress (P), and four unexplored regions R, F, J, and V. the deconvolution of a drug’s MOA. Our analysis proposes a Each of these regions is colored differently and further divided into novel strategy for examining entire pathways according to a total of 80 clades (a clade is a group of clusters that share similar cytotoxic responses) (subregions: M –M ,N–N ,P–P ,Q–Q , each pathway’s component gene expressions, and establish- 1 8 1 13 1 8 1 7 R1–R7,S1–S13,F1ÀF8,J1–J8,V1–V8), shown here by the black ing a quantitative measure of biological significance repre- boundary lines. sented by correlations in gene expression and drug response patterns. The strongest pathway–drug associations are used to validate or gain a better understanding of a drug’s MOA, rich resource for establishing a compound’s MOA.16,22–24 and, subsequently, to propose agents that can perturb SOM clustering of the GI50 data segregates compounds into specific pathways. Our analysis focuses on the microarray nine major response categories: mitosis (M), membrane constitutive gene expression data measured across the NCI’s function (N), nucleic acid metabolism (S), metabolic stress

60-tumor cell screen (NCI60), and cytotoxicity data gener- and cell survival (Q), kinases/phosphatases and oxidative ated in the same cell lines from in vitro anticancer drug stress (P), and four unexplored regions R, F, J, and V.16,23,24 screening. These immortalized tumor cell lines reflect Each of these regions is further divided into a total of 80 diverse cell lineages as they are derived from lung, renal, clades (a clade is a group of clusters (nodes) that share colorectal, ovarian, breast, prostate, central nervous system, similar cytotoxic responses) (subregions: M1–M8,N1–N13, melanoma, and hematological malignancies. Since its P1–P8,Q1–Q7,R1–R7,S1–S13,F1–F8,J1–J8,V1–V8) (Figure 1). inception in 1990, cytotoxicity measures for over 40 000 The current SOM extends our previously published analysis compounds have been obtained that are publicly available. to include the existing complement of newly screened We have organized, specifically, the tumor cell growth compounds.23 21 inhibition (GI50) data into self-organizing maps (SOMs). Gene expression patterns across the NCI60 can be GI50 growth patterns have been found to be an information- organized in terms of predefined pathways or functional

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 383

categories annotated by the Kyoto Encyclopedia of Genes pathway, correlations with GI50 clades are compared and Genomes (KEGG), BioCarta, and Gene Ontology (GO). between genes ‘on a pathway’ with genes ‘off a pathway’, These widely used, publicly available gene annotations will and the Kruskal–Wallis H statistic is calculated (see Data and be employed to link pathways to drug responses through methods section for computation details). A pathway is correlations between pathway gene expression patterns and considered significantly correlated with the drug clade if drug GI50 response profiles clustered in SOM clades. Implicit H40 and Po0.05. Our underlying reasoning takes advan- in this design is the assumption that cytotoxicity profiles tage of clustering results that partition response data into a most strongly associated with gene expression profiles for smaller number of groups, and at the same time reduces the genes within a defined pathway are valid indications of a computational requirements. With this approach, the test compound perturbing a specific pathway; conversely, B30 000 drug GI50 response profiles are condensed to 80 the gene products in the pathway would play major roles in SOM clades (response regions), and 5K genes are condensed dictating the cytotoxic activity of the compound. Our to hundreds of pathways. approach associates drug responses in each clade with Using this method, each pathway in KEGG, BioCarta, and subsets of pathways. These subsets of pathways can then each GO term are mapped onto the GI50 SOM, where each be clustered based on their positions on the GI50 SOM. clade has an H-score, representing the strength of correla- Pathway gene expression coherence levels20 will also be tion between the pathway and the compounds in that clade. discussed in terms of targetability. We will establish a The most significantly and specifically correlated pathways quantitative measure of the degree to which a significant are proposed as the most likely targets of the drugs within a pathway–drug response correlation represents a biologically clade. Figure 2 (left panel) shows the mapping of the 111 significant association via comparison to known drug–target KEGG pathways on the GI50 SOM, where each horizontal relationships. Assignment of putative MOAs for agents band is a pathway, each vertical band is a SOM clade, and clustered in each SOM response region are then postulated each small square is colored according to the correlation to involve pathways that can be significantly correlated with strength of that pathway’s gene expressions with that clade’s these agents. GI50 responses. A reddish color indicates that the clade’s GI50 patterns correlate stronger with the gene expressions in RESULTS the pathway compared to expressions for genes not in the An important challenge to associating gene expression in pathway, a bluish color indicates the opposite, and a yellow- the context of biological pathways involves the formulation greenish color indicates that there is no significant differ- of effective strategies to relate drug action to precise ence between the ‘on’ and ‘off’ pathway gene correlation molecular targets. Since the practical goal of our strategy is strengths. The right panel of Figure 2 shows two examples of to utilize drug–gene-pathway relationships to propose novel pathways mapped on the GI50 SOM, the MAPK signaling drug targets or target-specific drugs, the biological response pathway (top) and oxidative phosphorylation (bottom). of a molecular activity, such as a GI50 profile, represents a These two maps are also colored according to the pathway– crude, albeit relevant, readout of the drug’s interaction clade correlation strength in a similar manner. Oxidative within a cellular milieu. For example, if a drug interacts with phosphorylation is one of the most cohesive KEGG path- one gene product, the entire pathway or pathways having ways and it shows strong significant correlations with the À4 this gene may be disturbed, and a direct correlation may not agents in a few clades in the R-region (R5, P ¼ 1.01 Â 10 À3 À6 be apparent. Alternatively, the consequence of a drug action and R4, P ¼ 1.12 Â 10 ), V-region (V6, P ¼ 1.33 Â 10 and À3 À3 may be revealed by correlations of the drug response to V3, P ¼ 1.72 Â 10 ), and N-region (N6, P ¼ 4.26 Â 10 and other gene expressions within the pathways containing the N13, P ¼ 0.012). Noteworthy is that many known inhibitors, 25–27 putative target. Cases where single drug–gene correlations including the acetogenins, of mitochondrial complex I, are not directly apparent may be revealed by this broader which plays a critical role in the oxidative phosphorylation examination of related genes within a pathway. pathway, are mostly clustered in R5 and R4. The MAPK signaling pathway, on the other hand, is one of the least Finding Pathways Correlated with Specific Drug Responses: cohesive KEGG pathways. It correlates the strongest with Mapping Pathways to GI50 SOM Clades parts of the N-region (N5,N7,N8), but none is significant; its A hallmark of targeted molecular therapies is over-expres- strongest correlation only has an H-score of 3.26 (P ¼ 0.07). sion of the drug’s molecular target. Strong support for Hierarchical clustering of the pathway–clade H-score this rationale can be found within the NCI60 screen, as matrix further segregates the 111 KEGG pathways into 24 evidenced by positive correlations between gene expressions clusters (Figure 2, left panel). These clusters can be used of the proteasome and heat shock to Velcades and to assess whether pathways involved in similar biological geldanamycin, respectively.16 Extending these cytotoxicity– processes share similar correlation patterns. KEGG groups gene expression correlations to pathways is an attempt to pathways into five general categories (Figure 3, top panel), establish a pathway-centric perspective to a drug’s MOA. Cellular Processes, Environmental Information Processing, Instead of examining each individual drug–gene correlation, Genetic Information Processing, Human Diseases, and correlations between the GI50 SOM clades (clusters of drugs Metabolism. These groups are further divided into 22 with similar GI50 profiles) and pathways as collections of subcategories (Figure 3, bottom panel). Figure 3 shows the genes are evaluated as a more general approach. For each composition of each of the 24 pathway clusters, shown in

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 384

MAPK signaling pathway N5 P5 N1 N3 N6 P3 N2 N7 P1 P7 P4 P6 N4 N9 P8 P2 M4 N10 N8 V1 M6 F1 FV PRJQ S N M N11 V2 V3 M5 M7 N12 N13 V6 M1 F2 M3 V8 V7 V4 10 S1 M2 S9 V5 F3 M8 S10 F5 S2 20 S11 S13 S3 F6 F7 S12 S6 F4 S4 S8 J1 30 S5 S7 F8 R1 R2 Q5 J6 J5 J2 J3 40 J8 J7 R3 R4 Q6 J4 Q2

50 Q3 Q1 R6 R5 Q4 Q7 R7

60 Oxidative phosphorylation N5 P5 N1 N3 N6 P3

KEGG Pathways N2 N7 P1 P7 70 P4 P6 N4 N9 P8 P2 M4 N10 N8 V1 M6 F1 80 N11 V2 V3 M5 M7 N12 N13 V6 M1 F2 M3 V8 V7 V4 90 S1 M2 S9 V5 F3 M8 S10 F5 S2 100 S11 S13 S3 F6 F7 S12 S6 F4 S4 S8 J1 110 S5 S7 F8 R1 10 20 30 40 50 60 70 80 R2 Q5 J6 J5 J2 J3 GI50 SOM Clades J8 J7 R3 R4 Q6 J4 Q2

Q3 Q1 R6 R5 Q4 Q7 R7

Figure 2 Hierarchical clustering of 111 KEGG pathways. The middle panel of the figure shows a 111 Â 80 matrix of row Z-score normalized Kruskal–Wallis H-scores, where each row represents a pathway, each column represents a GI50 SOM clade, and each small square is colored according to the value of the H-score, indicating the strength of association between a pathway and a clade. An orange-red color indicates a large positive H-score and the strongest association, a greenish yellow color indicates an H-score close to zero, and a dark blue color represents a large negative H-score indicative of the weakest association. The names of the SOM regions are listed at the top of the matrix. The dendrogram generated from the hierarchical clustering of pathways based on the H-scores is shown on the left side of the figure and the 111 pathways are arranged in the order as they appear in the dendrogram such that neighboring pathways have similar GI50 response associations. The right panel of the figure shows the mappings of the MAPK signaling pathway (top), a pathway with genes not coherently expressed, and the oxidative phosphorylation pathway (bottom), one of the most cohesive pathways, to the GI50 SOM. Each of these two pathway-to-GI50 maps is colored according to its corresponding row in the H-score matrix.

the same order as the dendrogram in Figure 2, ie, adjacent which are most correlated with regions J and Q; cluster #11 clusters have similar correlation patterns, according to the is composed solely of carbohydrate metabolism pathways, five general categories (top panel) and the 22 subcategories which are most correlated with the N-region; cluster #20 (bottom panel). Pathways belonging to the same category is composed solely of energy metabolism pathways, which generally cluster together, that is, they share similar GI50 are most correlated with regions R and V; and cluster #24 is clade correlation patterns. The 81 metabolic pathways are composed solely of pathways responsible for glycan bio- scattered across all 24 clusters; however, 15 of these clusters synthesis and metabolism, which are most correlated with are composed solely of metabolic pathways (Figure 3, top the Q-region. Cluster #15 contains five of the eight human panel). Furthermore, cluster #2 is composed solely of diseases pathways, which are most correlated with regions V, pathways involved in lipid metabolism, which are most P, and M. Four of the eight pathways involved in genetic correlated with compounds in SOM regions F and J (Figure information processing are also grouped together in cluster 2, left panel); cluster #5 solely of amino acid metabolism, #15, which is most correlated with regions N, S, V, and M. which are most correlated with the F-region; cluster #7 is Five of the six signal transduction pathways, which is the composed of two other amino acid metabolism pathways, major component of environmental information proces-

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 385

25

20

15 Metabolism Human Diseases Genetic Information Processing Environmental Information 10 Processing

Pathway Count Cellular Processes

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Pathway Cluster Nucleotide Metabolism Metabolism of Other Amino Acids 25 Metabolism of Cofactors and Vitamins Lipid Metabolism Glycan Biosynthesis 20 and Metabolism Energy Metabolism Carbohydrate Metabolism Biosynthesis of Secondary 15 Metabolites Biodegradation of Xenobiotics Amino Acid Metabolism Neurodegenerative Disorders

Pathway Count 10 Infectious Diseases Translation Transcription Sorting and Degradation 5 Replication and Repair Signal Transduction Ligand-Receptor Interaction Immune System 0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324 Cell Growth and Death Cell Communication Pathway Cluster Behavior

Figure 3 Histograms of the 24 clusters generated from hierarchical clustering of the 111 KEGG pathways (see Figure 2 caption for details). Each histogram represents the number of pathways in a cluster and the clusters are ordered as they appear in the dendrogram. Top panel: Histograms are colored according to the five general pathway categories (shown in the figure legend) defined by KEGG. Bottom panel: Histograms are colored according to the 22 KEGG pathway subcategories (shown in the figure legend). Pathways engaged in similar cellular processes tend to cluster together.

sing, fall into one cluster, cluster #9. Collectively, signal Looking at the KEGG pathway–GI50 SOM clade correlation transduction pathways are among the least cohesive path- matrix (Figure 4), one can also see that some pathways ways; their correlations with the GI50 SOM regions are are significantly correlated with many drug clades, while generally diffuse, lacking assignment to any one clade or some are correlated with very few. These KEGG pathways are region in GI50 response space. Conversely, this might also correlated, on average, with 3.5 drug clades (4.4%). The top imply that it will be difficult to find drugs that target histogram in Figure 4 shows the number of pathways specifically to one of these signaling pathways. significantly correlated with each GI50 SOM clade, which

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 386

Q5 10 Q6 9 P6 J5 S8 8 R5 Q1 N6N2 M6 M1 7

6 5 4

Pathways (%) 3

Significantly Correlated 2 F2 F1V5V3 P1 S13 S4 M5 M4 P8 J1 S1 1 0 Significantly Correlated Clades (%) F V P R J Q S N M 0701020 30 40 50 60 80 90 100

10

20 Tyrosine metabolism*

30

40 TGF-beta signaling pathway*

Integrin-mediated cell adhesion* 50 Biosynthesis of steroids* Terpenoid biosynthesis* Pyrimidine metabolism* Purine metabolism* 60 KEGG Pathway

Dentatorubropallidoluysian atrophy (DRPLA)* Neurodegenerative Disorders* Glycosaminoglycan degradation* 70 Glycine, serine and threonine metabolism* Basal transcription factors* Aminoacyl-tRNA biosynthesis*

Proteasome* 80

Galactose metabolism

90 ATP synthesis* Oxidative phosphorylation* Methionine metabolism Selenoamino acid metabolism* Prion disease Ribosome* 100 Glycosphingolipid metabolism N-Glycan degradation* DNA polymerase* Cell cycle* Blood group glycolipid biosynthesis- neolactoseries 110 Blood group glycolipid biosynthesis- lactoseries 10 20 30 40 50 60 70 80 GI50 SOM Clade

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 387

represent the biological processes that those drug agents are sequence alignment at expectation value p10À20)isa potentially perturbing. On average, each clade is correlated member of the pathway (B1000 drug–gene pairs). For a with B5 KEGG pathways (4.4%). The more the pathways particular pathway–drug pair, the pathway–clade H-score of correlated with a drug clade, the less specific the drugs in the the clade where the drug is located is used as their clade may be, and vice versa. Three clades (shown as green correlation score. Each drug–pathway pair is assigned bars in the histogram), P8 (limonene and pinene degrada- to one of four categories at various significance levels tion), J1 (ribosome), and S1 (ribosome), are only correlated (P-values): true positive (TP), false positive (FP), false with one pathway each, and nine other clades (shown as negative (FN), and true negative (TN), according to the yellow bars in the histogram) are each correlated with two following decision table (Table 1). The number of drug– pathways. These potentially contain drugs with specific pathway pairs within each category is counted. targets. On the other hand, 11 of the 80 drug clades (shown Out of all unique drug–pathway combinations, the as orange bars in the histogram) each correlate with more protein–ligand data set contains between 1–6% and the than eight pathways (B7%), and Q5 seems to be the most literature-derived known inhibitor–target data set contains promiscuous drug clade having more than 10 significantly B15% of drug–pathway pairs that have literature-estab- correlated pathways, seven of which are involved in glycan lished connections, depending on the pathway system used. biosynthesis and metabolism. Overall, as illustrated in Initially, we determined whether a stronger drug–pathway Figures 2 and 4, the different pathways appear to be fairly correlation signifies a more likely ‘true hit’, that is, the drug evenly spread over all SOM regions. This is consistent with is a true inhibitor of the pathway, or the drug and the the fact that the collection of screened compounds shows a pathway have an established connection. The property of wide spectrum of responses. Drug clades and their correlated interest is then the ability of using the correlation to select pathways will not be examined individually here, but drug– ‘true positives’, called the ‘positive predictive value’ (defined pathway relationships will be discussed in greater detail in as TP/(TP þ FP)), which is an important index of the actual the following sections. performance of a test. The positive predictive value is found to increase with increasing correlation significance level, as shown in Figure 5(a), obtained using the known inhibitor– Validation of Pathway–GI SOM Clade Correlations 50 target data set. (Same calculations are also done using the We have shown that some pathways have significant protein–ligand data set and similar trends are observed.) correlations with drug GI response profiles; however, a 50 These observations partly confirm the hypothesis that the natural question to ask is that if, or to what degree, a more significant the correlation between a drug and a significant correlation derived for a pathway and a drug pathway, the more likely that the drug is a true inhibitor of clade represents a real biological connection between the that pathway. At significance level PB10À5, the TP/(TP þ FP) drugs in that clade and the pathway. To answer this ratio is about 0.25 (increased from 0.2 at PB10À2), which question, protein products for our measured set of genes means that, in approximately one out of four pathway–drug that are known to have small-molecule inhibitors, derived pairs found to be significantly correlated at this level, the either from the literature or from known protein–ligand complexes with determined structures, are used as repre- sentative cases of established gene/pathway–drug connec- tions; and the strengths of the correlations between these Table 1 Decision table for the categorization of drug– agents and their known target pathways are examined. A pathway pairs connection between a small-molecule drug and a pathway is assumed to exist if the drug has been cited in the literature Pathway–drug Significant No as a known inhibitor of at least one of the gene products in correlation correlation the pathway (B100 000 unique drug–gene pairs), or if the drug has a structural match, at Tanimoto28,29 0.90 or above, Known connection TP FN No known connection FP TN to a ligand of a protein (retrieved from the (PDB)30) whose corresponding gene (determined by TP ¼ true positive; FP ¼ false positive; FN ¼ false negative; TN ¼ true negative.

Figure 4 Drug pathway specificity and pathway targetability. The main panel shows the KEGG pathway–GI50 clade correlation H-score matrix as found in Figure 2 (see Figure 2 caption for details). Top panel: Histogram each representing the number of pathways significantly (Kruskal–Wallis: Po0.05) correlated with each GI50 SOM clade. The more the pathways correlated with a drug clade, the less specific the compounds in the clade may be, and vice versa. The three clades, each correlated with only one pathway, are highlighted in green. The nine clades each correlated with two pathways are highlighted in yellow. The 11 clades, each correlated with more than eight pathways, are highlighted in orange. Right panel: Histograms, each representing the number of GI50 SOM clades significantly (Kruskal–Wallis: Po0.05) correlated with a particular pathway. These clades contain compounds that can potentially target a pathway. Cases where many drug clades are correlated with a pathway may indicate a better likelihood of targeting a gene or gene product in that pathway. The names of the pathways with coherently expressed genes are labeled with an asterisk and the pathways that correlate with more than 10 clades are highlighted in orange.

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 388

a 0.30

0.25

0.20

0.15 TP/(TP+FP) 0.10

KEGG 0.05 GO

0 0123456 -logP

b 0.001 0.01 0.1 1 1 KEGG GO

0.1 Sensitivity

0.01

0.001 1-Specificity

Figure 5 Quality and predictive power of the pathway–drug correlation method. (a) Plots of the positive predictive value (PPV ¼ TP/ (TP þ FP)) against correlation significance (Àlog(P)) for the KEGG pathways and GO terms. At P ¼ 1, the PPV represents the total fraction (15%) of pathway–drug pairs with literature established connections present in the data set. At P ¼ 10À5, about 25% of the correlated pathway–drug pairs have literature-established connection. (b) The ROC curves for the KEGG pathways and GO terms. The area below the ROC curve and above the diagonal is indicative of the quality of the method used to generate the curve. Both plots are generated by counting the fractions of pathway–drug pairs assigned to each of the four categories: TP, FP, FN, and TN according to Table 1 at various correlation significance levels as indicated by the Kruskal–Wallis test.

drug appears somewhere in the literature as being a known Similarly, specificity is defined as the proportion of subjects inhibitor of, or having a direct or indirect connection with, without the disease who have a negative test (TN/(TN þ FP)), some gene products in that pathway. This confidence level and, in our case, describes how well the correlation establishes a quantitative reference for using our approach identifies the drug–pathway pairs without a known connec- to find drugs that target certain biological pathways with tion. In general, a good testing procedure is characterized by the data set we have at hand. high sensitivity and specificity, whereas, in reality, when the Two metrics commonly used in clinical studies to test the sensitivity is very high, the specificity tends to be low. We accuracy of a certain diagnostic test are ‘sensitivity’ and have observed that, as the correlation strength increases, ‘specificity’. Sensitivity, defined as the proportion of subjects specificity increases while sensitivity decreases. This is with the disease who have a positive test (TP/(TP þ FN)), because when correlation significance levels are higher, describes how well a diagnostic test identifies those people there are fewer correlated drug–pathway pairs (which means with the disease, and, in our case, how well the correlation lower sensitivity), even though the likelihood of them being identifies the drug–pathway pairs with true connections. ‘true hits’ is better (which means higher specificity). A

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 389

receiver operating characteristic (ROC) curve, which is a plot TN, FN) are counted and the predictive values are calculated of sensitivity versus 1Àspecificity, is commonly used as an at various significance levels with the randomized drug– efficient way to display the relationship between sensitivity pathway pairs. This procedure is repeated 1000 times and and specificity.31 The preferred test yields the greatest 1000 profiles of TP/(TP þ FP) at various P-values are number of true positives with the least number of false obtained. Figure 6 shows the TP/(TP þ FP) profile calculated positives, resulting in a ROC curve that tends upwards while using the GO system with real drug–pathway connections moving from right to left. Figure 5(b) shows the ROC curves (dark solid curve on top), plotted together with the average derived from our drug–pathway correlation method using TP/(TP þ FP) profile of the thousand profiles generated using the KEGG and GO annotations. These results exhibit the randomly assigned drug–pathway connections. The random desired features of a valid testing procedure. predictive values show no change with increasing signifi- Nevertheless, one might still argue that the known cance levels (decreasing P-values), except for random inhibitor–target connections appearing in the literature fluctuations above and below the starting predictive value could be random and the observed trend of greater at P ¼ 1(Àlog(P) ¼ 0). Based on the distribution of the 1000 predictive value with higher correlation significance could random fluctuation amplitudes, the chance probability of happen by chance. To check whether this is true, randomi- getting the observed predictive value is about 10À30 (t-test). zation procedures are employed to estimate the probability This rejects, with nearly 100% confidence, the hypothesis of finding this trend, with the observed magnitude, by that the observed trends could happen by random chance. chance. The same set of drugs and pathways are used in the The mapping of a pathway to a GI50 clade via the procedure, but drugs are assigned randomly as inhibitors of existence of a significant correlation (Po0.05) with that each pathway, while the percentage of drug–pathway pairs clade’s cytotoxic profiles is therefore considered ‘validated’ with ‘known’ connections is controlled to approximate the if at least one of the drugs in that clade has established percentage in the real (nonrandomized) calculations (ap- connections, either through ligand–protein relationships or proximately 15%). For instance, if there are a total of n from the literature, with the pathway. Limited by the unique drug–pathway pairs and in m of them the drugs are availability of established known inhibitor–gene product mentioned in the literature as known inhibitors of their relationships and the availability of gene expression and partner pathways, then in the random simulations m unique drug GI50 data, not all pathways can be ‘validated’ in this pairs will be randomly picked from the n drug–pathway pairs fashion. The validated pathways and their known inhibitors and labeled as having ‘known’ connections. The drug– are listed in Tables 2 and 3. (Mapping of the BioCarta pathway pairs that fall in each of the four categories (TP, FP, Pathways can be found in the Supplementary Information.)

0.30 GO

0.25

0.20 Mean + 3*Stdev

0.15 Mean TP/(TP+FP) Mean - 3*Stdev 0.10 Randomly generated pathways GO Random Mean 0.05 Random Mean - 3*Stdev Random Mean + 3*Stdev

0 0123456 -logP

Figure 6 Validation of the PPV of the pathway–drug correlation method via comparison to random cases. The top curve is a plot of PPV against –log(P) generated using GO terms, and the bottom three curves are obtained from 1000 randomly generated pathways. In the random case, the average PPV plot (flat line in the middle) does not show any apparent change as correlation significance level (Àlog(P)) increases. The two curves above and below the average plot represent the three-standard-deviation window for random fluctuations. The deviation of the PPV plot generated using GO terms from random is significant (t-test: Po10À30). This clearly shows that the increase of PPV with increasing correlation significance can only be observed with vetted gene functional categorization systems, such as pathways.

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 390

Table 2 Pathway–drug connections validated using the protein–ligand data set

(a)

Pathway Pathway title Gene PDB ID PDB ligand Tanimoto P NSC match

hsa04510 Integrin-mediated cell adhesion* CSK 1BYG NSC618487 1 0.031 hsa00100 Biosynthesis of steroids* NQO1 1H66 NSC697726 1 0.042

(b)

Term Title Gene PDB ID PDB ligand Tanimoto P NSC match

Biological process GO:0006118 Electron transport* ETFA 1EFV NSC672972 X0.9 1.02EÀ03 GO:0006730 One-carbon compound metabolism DHFR 1DLS NSC740 1 3.19EÀ03 GO:0006730 One-carbon compound metabolism DHFR 1OHJ NSC667640 X0.9 7.04EÀ03 GO:0006310 DNA recombination FRAP1 1FAP NSC683864 X0.95 2.15EÀ02 GO:0007517 Muscle development ACTA1 1EQY NSC672972 X0.95 2.59EÀ02 GO:0007517 Muscle development MAPK12 1CM8 NSC672972 X0.9 2.59EÀ02 GO:0007018 Microtubule-based movement* KIF5B 1BG2 NSC672972 X0.95 3.43EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC107124 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC111533 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC302991 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC328410 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC606173 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC629971 X0.9 3.81EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC94600 X0.9 3.81EÀ02 GO:0006259 DNA metabolism* DTYMK 1E2D NSC672972 X0.9 3.97EÀ02 GO:0006259 DNA metabolism* DTYMK 1E2D NSC672972 X0.95 3.97EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC606172 X0.9 4.11EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC609699 X0.95 4.11EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC176323 X0.9 4.43EÀ02 GO:0006265 DNA topological change* TOP1 1K4T NSC603071 X0.9 4.43EÀ02 GO:0006265 DNA topological change* TOP1 1NH3 NSC672974 X0.9 4.43EÀ02 GO:0006265 DNA topological change* TOP1 1NH3 NSC99445 1 4.43EÀ02 GO:0006118 Electron transport* ACADM 1EGC NSC618486 X0.95 4.61EÀ02 GO:0006118 Electron transport* IVD 1IVH NSC618486 X0.9 4.61EÀ02 GO:0006091 Generation of precursor metabolites PPARG 1FM6 NSC122758 1 4.75EÀ02 and energy*

Cellular component GO:0005739 * FECH 1HRK NSC657956 X0.95 1.29EÀ04 GO:0005624 Membrane fraction* LCK 1QPD NSC618487 1 2.22EÀ04 GO:0005739 Mitochondrion* DGUOK 1JAG NSC672972 X0.95 3.71EÀ04 GO:0005624 Membrane fraction* HMGCR 1DQ9 NSC630995 X0.9 1.82EÀ03 GO:0005739 Mitochondrion* HADHSC 1F0Y NSC618486 X0.9 3.51EÀ03 GO:0005634 Nucleus* RARA 1DKF NSC6814 1 5.24EÀ03 GO:0005624 Membrane fraction* HMGCR 1HW8 NSC672972 X0.95 6.00EÀ03 GO:0005624 Membrane fraction* LCK 1QPC NSC672972 X0.9 6.00EÀ03 GO:0005624 Membrane fraction* TAP1 1JJ7 NSC672972 X0.95 6.00EÀ03 GO:0016021 Integral to membrane HMGCR 1HW8 NSC672972 X0.95 6.40EÀ03 GO:0016021 Integral to membrane IGF1R 1JQH NSC672972 X0.9 6.40EÀ03 GO:0016021 Integral to membrane KIT 1PKG NSC672972 X0.95 6.40EÀ03 GO:0005942 Phosphoinositide 3–kinase complex* FRAP1 1FAP NSC683864 X0.95 2.15EÀ02 GO:0005887 Integral to plasma membrane INSR 1I44 NSC672972 X0.9 4.22EÀ02 GO:0005887 Integral to plasma membrane TAP1 1JJ7 NSC672972 X0.95 4.22EÀ02

Molecular function GO:0005509 Calcium ion binding* S100A9 1IRJ NSC639634 X0.9 2.95EÀ04 GO:0005509 Calcium ion binding* PPP3CA 1TCO NSC625423 X0.95 7.79EÀ03

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 391

Table 2 Continued

(b)

Term Title Gene PDB ID PDB ligand Tanimoto P NSC match

GO:0005509 Calcium ion binding* PPP3CA 1TCO NSC625425 1 7.79EÀ03 GO:0005509 Calcium ion binding* PPP3CA 1TCO NSC625429 X0.95 7.79EÀ03 GO:0005509 Calcium ion binding* PPP3CA 1TCO NSC625431 1 7.79EÀ03 GO:0016491 Oxidoreductase activity* HMGCR 1HW8 NSC672972 X0.95 1.29EÀ02 GO:0005509 Calcium ion binding* S100A9 1IRJ NSC720499 X0.9 1.29EÀ02 GO:0016740 Transferase activity FRAP1 1FAP NSC683864 X0.95 1.68EÀ02 GO:0004428 Inositol or phosphatidylinositol kinase activity* FRAP1 1FAP NSC683864 X0.95 2.05EÀ02 GO:0003677 DNA binding* COL10A1 1GR3 NSC720499 X0.9 2.38EÀ02 GO:0016853 Isomerase activity TOP1 1K4T NSC176323 X0.9 3.19EÀ02 GO:0016853 Isomerase activity TOP1 1K4T NSC603071 X0.9 3.19EÀ02 GO:0016853 Isomerase activity TOP1 1K4T NSC606172 X0.9 3.19EÀ02 GO:0016853 Isomerase activity TOP1 1K4T NSC609699 X0.95 3.19EÀ02 GO:0016853 Isomerase activity TOP1 1NH3 NSC672974 X0.9 3.19EÀ02 GO:0016853 Isomerase activity TOP1 1NH3 NSC99445 1 3.19EÀ02 GO:0004872 Receptor activity FKBP3 1PBK NSC683864 X0.95 3.71EÀ02 GO:0003779 Actin binding GC 1J78 NSC9856 1 3.72EÀ02 GO:0004872 Receptor activity IGF1R 1JQH NSC672972 X0.9 4.20EÀ02 GO:0004872 Receptor activity INSR 1I44 NSC672972 X0.9 4.20EÀ02 GO:0004872 Receptor activity KIT 1PKG NSC672972 X0.95 4.20EÀ02 GO:0016853 Isomerase activity FKBP3 1PBK NSC226080 1 4.52EÀ02

For each pathway, the table lists the name of the pathway, the expressed protein of the gene within the pathway where a structure exists as a complex with a small molecule ligand deposited in the PDB, the PDB ID of the protein–ligand complex, the NSC number of the small-molecule compound that is structurally similar to the ligand and has its GI50 profile determined in the NCI60, the Tanimoto score that indicates the structural similarity between the small-molecule compound and the protein ligand (only compounds with Tanimoto scores of at least 0.9 are considered in our analysis), and the significance level of the connection between the pathway and the small-molecule compound (P-value). Pathways labeled with an asterisk have their genes significantly coherently expressed.20 (a) Statistically significant KEGG pathway– drug connections (Kruskal–Wallis: Po0.05). (b) Statistically significant GO term–drug connections (Kruskal–Wallis: Po0.05).

Table 2 lists the pathway–drug connections validated using pathway through other indirect mechanisms. For example, the protein–ligand data set. The mapping of two KEGG the NSC drugs connected to the GO term DNA topoisome- pathways, nine BioCarta pathways, and 22 GO terms is rase (ATP-hydrolyzing) activity through TOP2 include well- validated. Some of these pathway–drug connections have a known TOP2 inhibitors such as etoposide, teniposide, well-established mechanistic basis; for example, the NSC razoxane, dexrazoxane, mitindomide, and other drugs, drug listed for CSK is staurosporine, which is a well-known including DNA-damaging agents which may be indirectly kinase inhibitor; the drugs listed for DHFR are methotrexate linked to TOP2 activity. Geldanamycin, 17-AAG, herbimy- and its close structural analog; and drugs listed for TOP1 are cin A, and macbecin I and II, among other drugs, are mostly camptothecin analogs. In other cases, the drugs connected to GO terms such as heat shock protein activity, listed are either known ligands (not necessarily established co-chaperonin activity, and protein folding through HSP90. inhibitors) of a gene product (or a sequence analog of the We consider the pathway–drug connections associated with protein) in a pathway, or their close structural analogs; for these validated pathways to be more reliable. Nevertheless, example, the NSC drugs listed for HMGCR are close one still needs to be aware of the drug–pathway connections structural analogs of HMG. Table 3 lists the pathway–drug that fall into the false-positive or -negative (FP and FN) connections validated using the known inhibitor–target categories. It will also be interesting to examine the drugs in data set from the literature. Owing to space limitations, only clades that are significantly correlated with a pathway, but the validated pathways are listed here and their correspond- no known biological connection has been established. ing genes and drugs can be found in Table IV of

Supplementary Information. For BioCarta and GO, only Connecting Pathways with GI50 SOM Response Regions: the most significant results (BioCarta: Po0.02; GO: Po0.01) Putative MOA are shown for the same reason. Nonetheless, at Po0.05, the One of the primary goals of our drug–pathway analysis is to mapping of 34 KEGG pathways, 94 BioCarta pathways, and aid in finding and interpreting drug targets and MOA. Using 144 GO terms is validated. In this case, the drugs listed for the methods described above, the biological pathways that each pathway are either well-known inhibitors of gene are potentially perturbed by the drugs in the nine SOM products in that pathway, or have been linked to that response regions can be postulated. Conversely, each path-

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 392

Table 3 Pathway–drug connections validated using the known inhibitor–gene target data set obtained from the literature

(a)

Pathway Pathway title P Ref. count

hsa03010 Ribosome* 5.11EÀ16 X10 hsa00190 Oxidative phosphorylation* 1.91EÀ03 X10 hsa04110 Cell cycle* 4.78EÀ03 X10 hsa00280 Valine, leucine, and isoleucine degradation 5.20EÀ03 X1 hsa00970 Aminoacyl-tRNA biosynthesis* 6.25EÀ03 X5 hsa00220 Urea cycle and metabolism of amino groups 7.35EÀ03 X5 hsa00350 Tyrosine metabolism* 8.06EÀ03 X5 hsa00340 Histidine metabolism 9.68EÀ03 X1 hsa00720 Reductive carboxylate cycle (CO2 fixation) 1.10EÀ02 X1 hsa04060 Cytokine–cytokine receptor interaction 1.29EÀ02 X5 hsa00052 Galactose metabolism 1.41EÀ02 X1 hsa00252 Alanine and aspartate metabolism 1.42EÀ02 X5 hsa04510 Integrin-mediated cell adhesion* 1.64EÀ02 X10 hsa00630 Glyoxylate and dicarboxylate metabolism 1.76EÀ02 X3 hsa05050 Dentatorubropallidoluysian atrophy (DRPLA)* 1.95EÀ02 X10 hsa00120 Bile acid biosynthesis 2.11EÀ02 X10 hsa03030 DNA polymerase* 2.12EÀ02 X5 hsa04310 Wnt signaling pathway 2.23EÀ02 X5 hsa03050 Proteasome* 2.57EÀ02 X10 hsa00680 Methane metabolism 2.65EÀ02 X1 hsa00632 Benzoate degradation via CoA ligation 2.72EÀ02 X5 hsa00760 Nicotinate and nicotinamide metabolism 2.84EÀ02 X5 hsa05060 Prion disease 2.97EÀ02 X10 hsa04630 Jak-STAT signaling pathway 3.09EÀ02 X10 hsa00670 One carbon pool by folate 3.12EÀ02 X3 hsa01510 Neurodegenerative Disorders* 3.29EÀ02 X10 hsa05010 Alzheimer’s disease 3.29EÀ02 X10 hsa00251 Glutamate metabolism 3.48EÀ02 X5 hsa04350 TGF-beta signaling pathway* 3.53EÀ02 X5 hsa00710 Carbon fixation 3.70EÀ02 X1 hsa00530 Aminosugars metabolism 3.88EÀ02 X1 hsa05110 Cholera infection 4.31EÀ02 X1 hsa04210 Apoptosis 4.32EÀ02 X10 hsa04070 Phosphatidylinositol signaling system 4.62EÀ02 X1

(b)

Term Title P Ref. count

Biological process GO:0006412 Protein biosynthesis* 1.18EÀ09 X10 GO:0006260 DNA replication* 2.16EÀ04 X10 GO:0000910 Cytokinesis* 1.87EÀ03 X10 GO:0006120 Mitochondrial electron transport, NADH to ubiquinone* 2.14EÀ03 X10 GO:0006414 Translational elongation* 3.35EÀ03 X10 GO:0006461 Protein complex assembly 4.57EÀ03 X10 GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway 4.88EÀ03 X10 GO:0000079 Regulation of cyclin-dependent protein kinase activity 5.46EÀ03 X10 GO:0007067 Mitosis* 6.56EÀ03 X10 GO:0007049 Cell cycle* 6.99EÀ03 X10 GO:0045786 Negative regulation of cell cycle 8.20EÀ03 X10 GO:0006915 Apoptosis 9.31EÀ03 X10 GO:0000165 MAPKKK cascade 9.38EÀ03 X10 GO:0006730 One-carbon compound metabolism 9.46EÀ03 X10 GO:0000082 G1/S transition of mitotic cell cycle 9.99EÀ03 X10

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 393

Table 3 Continued

(b)

Term Title P Ref. count

Cellular component GO:0005840 Ribosome* 1.43EÀ09 X10 GO:0005843 Cytosolic small ribosomal subunit (sensu Eukaryota)* 5.27EÀ08 X10 GO:0005842 Cytosolic large ribosomal subunit (sensu Eukaryota)* 4.32EÀ07 X10 GO:0005622 Intracellular* 4.51EÀ07 X10 GO:0005634 Nucleus* 2.51EÀ04 X10 GO:0005739 Mitochondrion* 2.61EÀ04 X10 GO:0005856 Cytoskeleton* 3.25EÀ03 X10 GO:0005792 Microsome* 3.79EÀ03 X3 GO:0005624 Membrane fraction* 5.70EÀ03 X10 GO:0005829 Cytosol 7.95EÀ03 X10

Molecular function GO:0003735 Structural constituent of ribosome* 3.86EÀ12 X10 GO:0003723 RNA binding* 1.67EÀ06 X10 GO:0003954 NADH dehydrogenase activity* 1.87EÀ03 X10 GO:0008137 NADH dehydrogenase (ubiquinone) activity* 2.06EÀ03 X10 GO:0008094 DNA-dependent ATPase activity* 5.05EÀ03 X3 GO:0019843 rRNA binding* 5.84EÀ03 X10 GO:0003918 DNA topoisomerase (ATP-hydrolyzing) activity* 6.11EÀ03 X10 GO:0009055 Electron carrier activity* 6.25EÀ03 X5 GO:0016740 Transferase activity 7.56EÀ03 X5 GO:0004888 Transmembrane receptor activity 8.39EÀ03 X10

For each pathway, the table lists the name of the pathway, the significance level of the connection between the pathway and its small-molecule inhibitors (P-value), and the number of references where the small molecule is documented as an inhibitor of the product of a gene in the pathway (see Supplementary Information, Table IV, for corresponding gene names and drug NSCs). Pathways labeled with an asterisk have their genes significantly coherently expressed.20 (a) Statistically significant KEGG pathway–drug connections (Kruskal–Wallis: Po0.05). (b) Statistically significant GO term–drug connections (Kruskal–Wallis: Po0.01).

way may be associated with one or more response regions. A associated with a SOM region collectively represent, puta- pathway is ‘best-mapped’ to a specific SOM region if all the tively, the biological processes or molecular targets of the following conditions are met: the region contains at least compounds clustered in that region, that is, those that are one clade significantly correlated with the pathway (path- involved in their MOA. way–clade correlation score40 and Po0.05), the region We have previously established the general MOA of the average pathway–clade correlation score is positive, and the agents in some of these SOM regions: mitosis (M), region has the highest average pathway–clade correlation membrane function and oxidative stress (N), nucleic acid score (hit score) or the highest percentage (hit rate) of metabolism (S), and metabolic stress and cell survival (Q), significant clades among all regions. Table 4 lists the primary oxidative metabolism (R), and kinases/phosphatases and and secondary response regions for the KEGG pathways and oxidative stress (P), via other methods.16,23,24 The pathway GO terms when mapped in this fashion (Data for BioCarta mapping results provide additional support for the annota- pathways can be found in Supplementary Information). Due tion of some of the SOM regions: for example, the GO terms to space limitations, only the pathway–GI50 response region mitotic checkpoint, cytokinesis, kinetochore, and cell cycle associations that have been validated, as described earlier, are associated with the M-region; the GO terms mitochon- through either known inhibitor–target or ligand–protein drial inner membrane (data not shown), response to connections, are listed. For GO terms, only the associations oxidative stress (data not shown), and oxidoreductase with the best mapping scores are shown. Some pathways are activity are associated with the N-region; the KEGG pathway associated with more response regions than others, that is, one carbon pool by folate, the BioCarta granzyme A- these pathways have significant correlations with many mediated apoptosis pathway, the GO terms DNA topological different drug GI50 profiles, indicating that many different change and DNA topoisomerase (ATP-hydrolyzing) activity, agents can potentially disrupt these pathways. The degree of are associated with the S-region; the KEGG pathway

GI50 SOM ‘coverage’ of a pathway is determined by both the glutamate metabolism and the GO terms (data not shown) gene expression coherence of the pathway, and the overall xenobiotic metabolism, cysteine metabolism, and glu- correlation strength of the genes in the pathway with the tathione biosynthesis are associated with the Q-region; the

GI50 profiles. On the other hand, the pathways that are KEGG pathways fatty acid metabolism and oxidative

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 394

Table 4 Primary and secondary SOM GI50 response regions for each pathway

Pathway Pathway title Primary and secondary regions

(a) hsa00280 Valine, leucine, and isoleucine degradation F hsa00340 Histidine metabolism FRJ hsa00350 Tyrosine metabolism* F hsa04310 Wnt signaling pathway F hsa03050 Proteasome* FV hsa00120 Bile acid biosynthesis JR hsa04110 Cell cycle* JQ hsa00530 Aminosugars metabolism J F hsa00100 Biosynthesis of steroids* JS hsa03030 DNA polymerase* M hsa00760 Nicotinate and nicotinamide metabolism M J Q hsa01510 Neurodegenerative disorders* MV hsa04210 Apoptosis MV hsa00052 Galactose metabolism N hsa00252 Alanine and aspartate metabolism N hsa00680 Methane metabolism N hsa04350 TGF-beta signaling pathway* N hsa00632 Benzoate degradation via CoA ligation N M Q hsa04070 Phosphatidylinositol signaling system N S hsa00630 Glyoxylate and dicarboxylate metabolism P hsa00720 Reductive carboxylate cycle (CO2 fixation) P hsa05110 Cholera infection PFM hsa00251 Glutamate metabolism Q hsa00670 One carbon pool by folate Q N S hsa00710 Carbon fixation Q hsa03010 Ribosome* QR hsa05060 Prion disease QR hsa04060 Cytokine–cytokine receptor interaction R hsa04510 Integrin-mediated cell adhesion* R hsa05010 Alzheimer’s disease RQ hsa05050 Dentatorubropallidoluysian atrophy (DRPLA)* RVM hsa00190 Oxidative phosphorylation* VRN hsa00220 Urea cycle and metabolism of amino groups V hsa00970 Aminoacyl-tRNA biosynthesis* VNM

(b) Biological process GO:0007052 Mitotic spindle organization and biogenesis J GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway J GO:0007093 Mitotic checkpoint* JQM GO:0000070 Mitotic sister chromatid segregation* J GO:0008284 Positive regulation of cell proliferation J GO:0000910 Cytokinesis* MQJ GO:0006265 DNA topological change* MS GO:0015980 Energy derivation by oxidation of organic compounds N GO:0001525 Angiogenesis P GO:0007018 Microtubule-based movement* PF GO:0006260 DNA replication* Q GO:0007155 Cell adhesion* Q GO:0006364 rRNA processing* Q GO:0006282 Regulation of DNA repair Q GO:0006445 Regulation of translation Q R GO:0006310 DNA recombination Q GO:0007067 Mitosis* QJ GO:0007049 Cell cycle* QM GO:0007126 Meiosis* R GO:0007010 Cytoskeleton organization and biogenesis R P J

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 395

Table 4 Primary and secondary SOM GI50 response regions for each pathway

(b)

Pathway Pathway title Primary and secondary regions

GO:0006412 Protein biosynthesis* RQ GO:0006414 Translational elongation* VR

Cellular component GO:0005875 Microtubule-associated complex J GO:0005634 Nucleus* M GO:0000776 Kinetochore* MJ GO:0008305 Integrin complex* PR GO:0005856 Cytoskeleton* P GO:0005840 Ribosome* QR GO:0005842 Cytosolic large ribosomal subunit (sensu Eukaryota)* Q GO:0005622 Intracellular* QR GO:0005843 Cytosolic small ribosomal subunit (sensu Eukaryota)* QR GO:0015629 Actin cytoskeleton* R GO:0005624 Membrane fraction* RQ GO:0005739 Mitochondrion* VRP

Molecular function GO:0005198 Structural molecule activity* FN GO:0003684 Damaged DNA binding J F GO:0016853 Isomerase activity J GO:0008270 Zinc ion binding M GO:0003918 DNA topoisomerase (ATP-hydrolyzing) activity* MSJ GO:0003707 Steroid hormone receptor activity N GO:0005496 Steroid binding* N GO:0009055 Electron carrier activity* N GO:0004872 Receptor activity NJ GO:0016702 Oxidoreductase activity, acting on single donors with incorporation of molecular oxygen, P incorporation of two atoms of oxygen GO:0008094 DNA-dependent ATPase activity* Q GO:0019843 rRNA binding* QR GO:0003891 Delta DNA polymerase activity* QJ GO:0005509 Calcium ion binding* QJ GO:0005194 Cell adhesion molecule activity R GO:0008137 NADH dehydrogenase (ubiquinone) activity* R GO:0003954 NADH dehydrogenase activity* R GO:0016491 Oxidoreductase activity* RNF GO:0003735 Structural constituent of ribosome* RQ GO:0003723 RNA binding* SQ

These pathways and their associated SOM regions represent the putative targets and MOA for the small-molecule compounds whose GI50 response profiles are clustered in each SOM region. Pathways labeled with an asterisk have their genes significantly coherently expressed.20 (a) The association of validated KEGG pathways to SOM regions. (b) The association of validated GO terms to SOM regions.

phosphorylation and the GO terms NADH dehydrogenase response (signal transduction resulting in induction of (ubiquinone) activity, NADH dehydrogenase activity, oxi- apoptosis) are associated with the P-region. doreductase activity, and mitochondrion are associated with Moreover, the additional pathways that are associated the R-region; and the BioCarta pathways (data not shown) with each SOM region through this global pathway analysis signaling of hepatocyte growth factor receptor, Erk1/Erk2 procedure provide valuable information and new insights MAPK signaling pathway, ATM signaling pathway, FAS into the MOA for similarly clustered drug molecules, signaling pathway (CD95), and the GO terms (data not especially for the validated pathway–drug connections. Of shown) oxidoreductase activity (acting on single donors particular interest are the associations of apoptosis with the with incorporation of molecular oxygen, incorporation of M-region, cell adhesion, and immune response signaling two atoms of oxygen), cell–cell signaling, and DNA damage pathways with regions N and P, transport with the N-region,

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 396

hypoxia and angiogenesis with the P-region, DNA replica- energy metabolism and nucleotide metabolism, are found tion, regulation of DNA repair, and translation with the Q- to have significantly more coherent gene expressions than region, and cytoskeleton with the R-region. Finally, the most signaling and regular metabolic pathways. The MOA of the agents clustered in the three regions, F, J, and V, cohesive pathways (names labeled with an asterisk in the can be postulated by examining the pathways associated right panel of Figure 4 and in Tables 2–4) are also found here with these regions: for example, the amino-acid metabolism to have stronger pathway–drug correlations, in general, than pathways and the wnt signaling pathway mapped to the F- noncohesive pathways, since the high level of gene coex- region, cell cycle and DNA damage-related pathways pression in cohesive pathways makes it more likely for genes mapped to the J-region, and urea cycle and metabolism of in the pathway to have similar correlation patterns with, or amino groups and pyruvate metabolism mapped to the V- act coherently toward, a drug. This may then imply that region, which seems for the latter to share pathways with its cohesive pathways are easier to target, since many drugs neighboring regions. In fact, many pathways are shared seem to be able to significantly disrupt these pathways. among different SOM regions and conversely each region is Conversely, the correlations of the least cohesive pathways usually associated with multiple pathways. This is expected with the GI50 SOM regions are generally diffuse and not because any one biological process can be perturbed by strong or specific to any one clade or region. This may be an many drugs, but to different degrees. The agents that can indication that it will be hard to find drugs that can target most effectively disrupt a process could be hypothesized by noncohesive pathways, or the relationship between drugs looking at the most significantly correlated sets. Moreover, and these pathways are not reflected or easily interpretable each SOM region contains the GI50 profiles for thousands of by simple gene–drug correlations. Therefore, instead of compounds; therefore, it is not surprising that multiple looking at noncohesive pathways that do not correlate processes, even though usually related, are associated with significantly with any drugs, it may be more interesting to these compounds. To gain more specific information on the examine those noncohesive pathways that can act coher- MOA of one compound or a small cluster of related ently toward certain drugs; that is, how correlation or compounds, a detailed drug–pathway analysis, as described interaction with drugs changes their intrinsic cohesiveness. earlier, is required. The region-wide analysis of biological Taking this one step further, in addition to looking at ‘drug- pathways and drug response, however, provides a global coupled’ pathway cohesiveness through correlation, valu- view of biological activities or features shared by large able information may be obtained by examining ‘drug- groups of compounds. exposed’ pathway cohesiveness, that is, to analyze and compare gene expression cohesiveness within a pathway prior to and after drug exposure. DISCUSSION The number of pathways significantly correlated with

Our study presents the first in-depth, large-scale analysis of each GI50 SOM clade, on the other hand, represents the pathway gene expressions in relation to drug activities. One number of biological processes the drug agents in the clade of the important goals of understanding the nature of gene are potentially perturbing. This number can be used as an expression regulation and biological pathways is to apply indicator of the level of target specificity or promiscuity of this knowledge to understanding the mechanism by which these drugs. High pathway correlation promiscuity is small drug molecules interfere with the biological system indicated for some drugs. Although this may seem undesir- through interactions with gene products and consequently able because of multiple targets and thus the potential of pathways. Drug–gene-pathway relationships may then be detrimental side effects implicated for the drug, this may utilized to find drug targets or target-specific drugs. We have represent cases where assaulting a single target by the drug mapped each pathway in KEGG, BioCarta, and each GO can cause multiple intracellular effects, as reflected by term onto the drug GI50 SOM based on the strength of correlations with multiple pathways. This, on the other correlation between the pathway and the compounds in hand, can be deemed as a desirable property of the drug, each GI50 clade. The most significantly and specifically because it presents the potential of overcoming the correlated pathways are intuitively the most likely targets of insufficiency of single target inhibition caused by the the drugs within a GI50 clade. Clustering of the KEGG inherent ability of heterogeneous tumor populations to 32 pathways based on their GI50 clade correlation patterns activate alternative or redundant pathways. Based on this shows a higher level of pathway regulation, that is, premise, drugs with many significantly correlated pathways pathways belonging to the same broader biological process may warrant further exploration. category generally cluster together. This indicates that However, a general concern regarding the interpretation pathways engaged in the same biological process tend to and application of pathway–gene-drug correlations is share similar drug GI50 correlation patterns, or behave whether, or to what degree, a significant correlation derived similarly toward drugs, and each group of pathways can be for a pathway and a drug represents a real connection associated with specific SOM regions. between the drug and the genes in that pathway. To address As we have established earlier,20 pathways responsible for this question, we have established the confidence level of vital cellular processes or processes that are related to growth this method by evaluating the correlations between a set of or proliferation, specifically in cancer cells, such as those small-molecule inhibitors and their known gene targets. engaged in genetic information processing, cell cycle, These connections have been verified through either the

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 397

literature or known ligand–protein complexes with deter- This data set contains 12626 mRNA expression profiles and mined structures. We have shown that, as the correlation is publicly available. The data set is first filtered to include strength between a drug and a pathway increases, it is more only measurements that exhibited the strongest intensity in likely that a true biological connection exists between the signal (Po0.05). The logarithm of each signal is taken to drug and the pathway, and the probability of this happening suppress extreme data values. Replicate measurements for by pure chance is almost zero. These results can be applied each gene are then averaged by taking the median. Finally, directly in target-specific drug discovery or hypothesizing only gene expression profiles having data available for at the MOA of a drug. More specifically, if a GI50 SOM clade is least 40 cell lines are included. This yielded a data set of found significantly correlated with a pathway, we can then 4923 genes to be included in our analysis. test the drugs in the clade against gene targets in that pathway. However, we still cannot seem to avoid the tradeoff between specificity and sensitivity, that is, higher Pathway Data specificity can only be achieved at the expense of lower Three databases are used for pathway gene analysis: Kyoto sensitivity. If we want a smaller chance of a drug–pathway Encyclopedia of Genes and Genomes (KEGG, http:// correlation to be a false positive, we would need a higher www.genome.ad.jp/kegg/), GO (http://www.geneontolo- level of correlation significance between pathway gene gy.org/), and BioCarta (http://www.biocarta.com/). Annota- expression and cytotoxicity. This would effectively reduce tions for 134 human pathways containing 2804 genes are the number of correlated drug–pathway pairs and increase downloaded from the KEGG ftp site (ftp://ftp.genome.ad.jp/ the risk of missing real connections. Conversely, if we want pub/kegg/pathways/hsa/). BioCarta annotations for 314 to cover more drugs and pathways, we would have to lower pathways containing 1406 human genes are downloaded the correlation significance threshold and at the same time from NCI’s Cancer Genome Anatomy Project (CGAP, http:// increase the chance of getting false positives. cgap.nci.nih.gov/) ftp site (ftp://ftp1.nci.nih.gov/pub/ CGAP). Annotations for 3564 GO terms containing 10 921 human genes are downloaded from the GO ftp site (ftp:// CONCLUSION ftp.geneontology.org/pub/go/). Within our gene expression We have proposed novel strategies to comprehensively data set, 1047 genes are present in KEGG, 604 are present in analyze and link pathway gene expressions to drug BioCarta, and 3210 are present in GO. responses, for the purpose of generating hypotheses about drug targets and mechanisms of action. We have analyzed pathway gene expression patterns and drug GI50 response Pathway–GI50 SOM Clade Correlations and Significance profiles derived from the NCI60 cancer cell panels, and Calculation annotated GI50 response regions on the SOM with pathways Gene–GI50 SOM clade correlation scores defined by KEGG and BioCarta, and functional categories defined by GO, through correlations between pathway gene A clade correlation coefficient between a gene and a GI50 expression patterns and drug response profiles. Further SOM clade is generated by first calculating the Pearson organization of pathways based on their mapping patterns correlation coefficient r, between the gene expression data   on the GI50 SOM reveals that pathways engaged in the same vector x and each node vector y in the clade. A data vector is biological process tend to have similar drug responses. We composed of data values, either expression levels of a gene have validated quantitatively the quality of the method or GI50 values of a drug, measured across all cell lines. relating pathways to drug responses using established drug– Therefore, the number of cell lines determines the dimen- target relationships. These results can be used subsequently sion of a data vector. The SOM algorithm identifies node to provide potential targets and MOAs for drug molecules. (cluster) vectors by minimizing the deviation between the We have, in addition, found that pathways with coherently GI50 data vectors and the node vectors. Here we choose to expressed genes tend to have stronger correlations with drug use node vectors in place of individual drug data vectors,

GI50 profiles than pathways that do not have coherent gene because the drugs clustered in the same node share similar expressions, whereas some noncohesive pathways can act GI50 profiles; thus, the node vector can be used as a coherently toward their known inhibitors. These pathways representative of the drug profiles within the node. A node and the genes involved may represent interesting drug vector has the same dimension as a GI50 data vector. The targets for further exploration. Pearson correlation coefficient is defined by Pn ðxi À xÞðyi À yÞ DATA AND METHODS r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffii¼1 ð1Þ Pn Pn Gene Expression Data 2 2 ðxi À xÞ ðyi À yÞ Constitutive gene expression data from Novartis, measured i¼1 i¼1 in triplicate across the 60 tumor cell lines using the Affymetrix DNA oligonucleotide microarray technology, in which, xi and yi denote the expression level of the gene were downloaded from the Developmental Therapeutics and the GI50 value of the SOM node, respectively, for cell Program (DTP) web server at http://www. dtp.nci.nih.gov. line i; x and y denote averages over all cell lines; and n is the

www.nature.com/tpj Pathway gene expressions and cellular growth inhibition R Huang et al 398

number of cell lines. The magnitude of r describes the required to get Po0.05. The H-scores required to reach a strength of the association between the two data vectors. certain significance level (eg, Po0.05) appear to be inde- The r-values calculated between a gene and each node vector pendent of pathway size, that is, sample size. Therefore, P- in a clade are then averaged to generate the correlation values obtained from the w2-distribution are used directly as coefficient between the gene and that clade. Gene–clade the significance measure with no further correction. correlation coefficients are generated in this fashion for all genes with all 80 SOM clades. ACKNOWLEDGEMENTS We thank the members of the STB staff, especially Drs Robert Shoemaker and Susan Mertins, for valuable contributions during the Pathway–GI50 SOM clade correlation scores preparation of this manuscript. This project has been funded in For each pathway and each SOM clade, the clade correlation whole or in part with Federal funds from the National Cancer coefficients of all genes in the pathway and the genes that Institute, National Institutes of Health, under Contract No. NO1- are not in the pathway are compared as two sample CO-12400. The content of this publication does not necessarily populations using the Kruskal–Wallis rank sum procedure reflect the view or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or and a H statistic (H-score) is generated:  organizations imply endorsement by the US Government. 12 Xk R2 H ¼ t À 3ðN þ 1Þð2Þ NðN þ 1Þ n t¼1 t DUALITY OF INTEREST where Rt is the rank sum of sample population t, nt is the size None declared. of sample population t, k is the number of sample populations being compared, and Supplementary Information Xk Supplementary Information accompanies the paper on The N ¼ nt Pharmacogenomics Journal website http://www.nature. t¼1 com/tpj). when each of the k sample populations being compared includes at least five observations, the sampling distribution of H is a very close approximation of the w2 distribution for REFERENCES kÀ1 degrees of freedom. The H statistic closely approximates 1 Capranico G. A rational selection of drug targets needs deeper insights 2 into general regulation mechanisms. Curr Med Chem Anti-Cancer Agents the w statistic even when one or more of the samples 2004; 4: 393–394. includes as few as three observations. In the present study, 2 Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. Discovering only pathways that have at least three gene expression data functional relationships between RNA expression and chemotherapeutic vectors available are included in the calculations. The susceptibility using relevance networks. Proc Natl Acad Sci USA 2000; 97: 2 12182–12186. significance level (P-value) is assigned using the w -distribu- 3 Szakacs G, Annereau J-P, Lababidi S, Shankavaram U, Arciello A, Bussey tion with one degree of freedom. A large H (H43.84) KJ et al. Predicting drug sensitivity and resistance: profiling ABC indicates a statistically significant difference (Po0.05) transporter genes in cancer cells. Cancer Cell 2004; 6: 129–137. between the two sample populations. A negative sign is 4 Huang Y, Anderle P, Bussey KJ, Barbacioru C, Shankavaram U, Dai Z et al. Membrane transporters and channels: role of the transportome in added to the H-score if the sum of ranks of the pathway cancer chemosensitivity and chemoresistance. Cancer Res 2004; 64: clade–gene correlation coefficients is smaller than the 4294–4301. nonpathway clade–gene correlation coefficients, that is, 5 Blower PE, Yang C, Fligner MA, Verducci JS, Yu L, Richman S et al. the GI clade correlates stronger with nonpathway genes Pharmacogenomic analysis: correlating molecular substructure classes 50 with microarray gene expression data. Pharmacogenomics J 2002; 2: than pathway genes. A pathway is considered significantly 259–271. correlated with a GI50 clade if the H-score is 43.84 and 6 Zhou Y, Gwadry FG, Reinhold WC, Miller LD, Smith LH, Scherf U et al. Po0.05. Transcriptional regulation of mitotic genes by camptothecin-induced Randomization procedures are also employed to check the DNA damage: microarray analysis of dose- and time-dependent effects. Cancer Res 2002; 62: 1688–1695. probability of getting a large positive H-score by chance as 7 Lee JK, Scherf U, Smith LH, Tanabe L, Weinstein JN. Analysis of gene compared to the P-values obtained directly from the w2- expression data of the NCI 60 cancer cell lines using Bayesian distribution. Genes are selected and assigned randomly to hierarchical effects model. Proc Int Soc Opt Eng 2001; 4266: 228–235. build random pathways of sizes from 15 to 100, which is the 8 Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L et al. A gene expression database for the molecular pharmacology of cancer. Nat typical size range of the real pathways that we have Genet 2000; 24: 236–244. investigated, and the H-scores of each random pathway 9 Wosikowski K, Schuurhuis D, Johnson K, Paull KD, Myers TG, Weinstein JN et al. Identification of epidermal growth factor receptor and c-erbB2 with all GI50 SOM clades are calculated. This procedure is repeated 1000 times and the probabilities of getting H-scores pathway inhibitors by correlation with gene expression patterns. J Natl Cancer Inst 1997; 89: 1505–1515. at various levels are assessed. Further increase in the number 10 O’Connor PM, Jackman J, Bae I, Myers TG, Fan S, Mutoh M et al. of randomizations does not appear to affect the outcome. Characterization of the p53 tumor suppressor pathway in cell lines of The probabilities obtained this way closely approximate (the the National Cancer Institute anticancer drug screen and correlations log P vs log P(w2) plot fits to a straight line with B0 intercept with the growth-inhibitory potency of 123 anticancer agents. Cancer Res 1997; 57: 4285–4300. B 2 and slope of 1, R ¼ 0.99), the P-values derived directly 11 Alvarez M, Paull K, Monks A, Hose C, Lee JS, Weinstein J et al. from the w2-distribution, that is, an H-score of 43.84, is Generation of a drug resistance profile by quantitation of mdr-1/P-

The Pharmacogenomics Journal Pathway gene expressions and cellular growth inhibition R Huang et al 399

glycoprotein in the cell lines of the National Cancer Institute Anticancer 22 Paull KD, Shoemaker RH, Hodes L, Monks A, Scudiero DA, Rubinstein L Drug Screen. J Clin Invest 1995; 95: 2205–2214. et al. Display and analysis of patterns of differential activity of drugs 12 Li KC, Yuan S. A functional genomic study on NCI’s anticancer drug against human tumor cell lines: development of mean graph and screen. Pharmacogenomics J 2004; 4: 127–135. COMPARE algorithm. J Natl Cancer Inst 1989; 81: 1088–1092. 13 Wallqvist A, Rabow AA, Shoemaker RH, Sausville EA, Covell DG. Linking 23 Rabow AA, Shoemaker RH, Sausville EA, Covell DG. Mining the National the growth inhibition response from the National Cancer Institute’s Cancer Institute’s tumor-screening database: identification of com- anticancer screen to gene expression levels and other molecular target pounds with similar cellular activities. J Med Chem 2002; 45: 818–840. data. Bioinformatics 2003; 19: 2212–2224. 24 Huang R, Wallqvist A, Covell DG. Anticancer metal compounds in NCI’s 14 Freije JMP, Lawrence JA, Hollingshead MG, de la Rosa A, Narayanan V, tumor-screening database: putative mode of action. Biochem Pharmacol Grever M et al. Identification of compounds with preferential inhibitory 2005; 69: 1009–1039. activity against low-NM23-expressing human breast carcinoma and 25 Tormo JR, Gallardo T, Peris E, Bermejo A, Cabedo N, Estornell E et al. melanoma cell lines. Nat Med 1997; 3: 395–401. Inhibitory effects on mitochondrial complex I of semisynthetic mono- 15 Ficenec D, Osborne M, Pradines J, Richards D, Felciano R, Cho Raymond tetrahydrofuran acetogenin derivatives. Bioorg Med Chem Lett 2003; 13: J et al. Computational knowledge integration in biopharmaceutical 4101–4105. research. Brief Bioinform 2003; 4: 260–278. 26 Tormo JR, Royo I, Gallardo T, Zafra-Polo MC, Hernandez P, Cortes D 16 Covell DG, Wallqvist A, Huang R, Thanki N, Rabow AA, Lu XJ. Linking et al. In vitro antitumor structure–activity relationships of threo/trans/ tumor cell cytotoxicity to mechanism of drug action: an integrated threo mono-tetrahydrofuranic acetogenins: correlations with their analysis of gene expression, small-molecule screening and structural inhibition of mitochondrial complex I. Oncol Res 2003; 14: 147–154. databases. Proteins 2005; 59: 403–433. 27 Lannuzel A, Michel PP, Hoglinger GU, Champy P, Jousset A, Medja F 17 Huang Y, Blower PE, Yang C, Barbacioru C, Dai Z, Zhang Y et al. et al. The mitochondrial complex I inhibitor annonacin is toxic Correlating gene expression with chemical scaffolds of cytotoxic agents: to mesencephalic dopaminergic neurons by impairment of energy ellipticines as substrates and inhibitors of MDR1. Pharmacogenomics J metabolism. Neuroscience 2003; 121: 287–296. 2005; 5: 112–125. 28 Randic M. On characterization of chemical structure. J Chem Inf Comput 18 Nakatsu N, Yoshida Y, Yamazaki K, Nakamura T, Dan S, Fukui Y et al. Sci 1997; 37: 672–687. Chemosensitivity profile of cancer cell lines and identification of genes 29 Rosen R. An approach to molecular similarity. In: Johnson MAM, Gerald determining chemosensitivity by an integrated bioinformatical ap- M (eds). Concepts and Applications of Molecular Similarity. Wiley: New proach using cDNA arrays. Mol Cancer Therap 2005; 4: 399–412. York, NY, 1990, pp 369–382. 19 Stegmaier K, Ross KN, Colavito SA, O’Malley S, Stockwell BR, Golub TR. 30 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al. Gene expression-based high-throughput screening (GE-HTS) and The Protein Data Bank. Nucleic Acids Res 2000; 28: 235–242. application to leukemia differentiation. Nat Genet 2004; 36: 257–263. 31 Vinatier D, Monnier JC. Receiver operating curve, an aid in decision 20 Huang R, Wallqvist A, Covell DG. Comprehensive analysis of pathway or making. Principles and applications illustrated with some examples. functionally related gene expression in the National Cancer Institute’s J Gynecol Obstet Biol Reprod (Paris) 1988; 17: 981–989. anticancer screen. Genomics 2005, submitted. 32 Westwell AD, Stevens MF. Hitting the chemotherapy jackpot: strategy, 21 Kohonen T. Self-Organizing Maps. Springer Verlag: Berlin, Germany, 1995. productivity and chemistry. Drug Discov Today 2004; 9: 625–627.

www.nature.com/tpj