List Ofauthors Subject to Datasets Used and Who Writes Or Has Done Analysis

Applications of Machine Learning Approaches Integrating Analytic Methods and Statistics with High Dimensional Visualizations to Different Problems in Cancer Diagnosis and Detection

[LIST OFAUTHORS SUBJECT TO DATASETS USED AND WHO WRITES OR HAS DONE ANALYSIS]

John McCarthy*, Kenneth A. Marx, Philip O’Neil, M.L. Ujwal, Patrick Hoffman, Alex Gee and Natasha Markuzon

AnVil, Inc. 25 Corporate Drive Burlington, MA 01803

*corresponding author [email protected]; (781) 272-1600 X 460 Abstract Introduction to Data Analysis by Machine Learning

Overview of Clustering Methods and Cluster Comparison. Clustering is a method of unsupervised learning. In supervised learning, the object is to learn predetermined class assignments from other data attributes. For example, given a set of gene expression data for samples with known diseases, a supervised learning algorithm might learn to classify disease states based on patterns of gene expression. In unsupervised learning, there either are no predetermined classes or class assignments are ignored. Cluster analysis is the process by which data objects are grouped together based on some relationship defined between objects. It is an attempt to discover novel relationships within a given dataset independent of a priori knowledge about the data space [1,2]. An understanding of relationships between objects is inherent in any clustering technique. This is encoded in a distance measure or distance metric (sometimes called a similarity measure or dissimilarity measure). Unlike a distance measure, a distance metric is required to satisfy the triangle inequality. The most commonly used distance metric is Euclidean distance, which in three dimensions corresponds to the physical distance between objects in space. The Manhattan distance is a “city block” measure, in contrast to the Euclidean “shortest distance between two points.” There are several other alternatives including correlation measures. Generally, each measure defines a relationship based on certain assumptions about the underlying data and attempts to capture particular data characteristics [3]. In conjunction with selecting a distance measure, one must also decide on a clustering algorithm, the procedure by which these n-dimensional objects are grouped together to form clusters. Classical clustering techniques are divided into two groups, the partitioning methods and the hierarchical approaches [1, 3-5]. In more recent years, a third group of techniques that includes probabilistic approaches has emerged [6]. Similar to the choice of distance metric, the selection of clustering algorithms for specific datasets poses problems, as each algorithm focuses on certain types of relationships within any given dataset, some overlapping and others unique. The partitioning methods, sometimes referred to as iterative relocation algorithms, construct clusters by first partitioning the data objects into some number of clusters and then recursively moving objects between clusters until some cluster measure is minimized. The result is a set of object groups where each object is assigned to only one cluster. Hierarchical approaches are based on tree structures where the data objects occupy the leaves of the tree and the nodes of the tree define the relationship between subtrees or leaves. These hierarchical methods are defined as either agglomerative (bottom-up) or divisive (top-down) approaches. Agglomerative techniques start with each object in a separate cluster and perform a series of successive fusions of clusters into larger clusters. Divisive methods start with all objects in a single cluster and provide successive refinements of the clusters into smaller clusters. Agglomerative methods include: nearest neighbor (single-link method), furthest neighbor (complete-linkage method), centroid cluster analysis, median cluster analysis, group average method, Ward’s method, McQuitty’s methods, Lance and Williams flexible method, and others [7]. Divisive methods include monothetic methods, which are based on the value of a single attribute, and polythetic, which are based on the values of all attributes. Hierarchical techniques provide no indicators on the number of clusters that the data should be clustered into. The tree structure can be cut at various levels and the resulting subtrees determine the clusters and their number. Probabilistic techniques provide information as to how well an object belongs to each cluster rather than just providing the cluster memberships. Given that most data spaces do not contain well-defined objects, probabilistic techniques provide additional information about a data space. Examples of probabilistic techniques include the fuzzy clustering algorithms [6]. Many additional clustering techniques are a mixture of the basic types of methods discussed above. Over the past five years there have been a significant number of academic and commercial clustering and classification approaches focused on high dimensional data, particularly biological and chemical data. Fasulo [4] for example, describes recent results on clustering, each of which approaches the clustering problem from a different perspective and with different goals. A fundamental question that arises repeatedly is which clustering technique is better? The answer to this question is an important commercial consideration for pharmaceutical and biotechnology companies when the $ 500-800 million average cost of developing a successful drug is at stake. It is even more important when the outcome is accurate clinical detection of a specific cancer in a patient. In drug discovery, the outcome of a clustering technique can influence decisions about selecting drug targets and chemical lead compounds. Several investigators, including Schaffer [8], Felders [9], Dietterich [10] and Cheng [11], have attempted to answer this question. In our view, the assessment of which clustering technique is better is domain and data dependent, given the incomplete information on which clusters are usually based. The question of which technique is better is not the correct question to ask. A number of recent papers have discussed the pitfalls of current comparative analyses, especially when using public domain datasets and databases [12].

Combining Contextual Knowledge with Experimental Data in the Mining of Microarry Gene Expression and Other Molecular Datasets. The availability of genome-wide expression profiles promises to have a profound impact on the understanding of basic cellular processes, the diagnosis and treatment, and the efficacy of designing and delivering targeted therapeutics. Particularly relevant to these objectives is the ability to cross-reference experimental and analytical results with previously known biological facts, hypotheses, theories and results. Biological and biomedical literature databases provide the kind of knowledge warehouses for such extensive cross- referencing. However, the volume of such databases makes the task of cross-referencing lengthy, tedious and daunting [13]. In order to explain the underlying biological mechanisms and assign “biological meaning” to clusters of genes obtained by analytical methods, it is necessary to cross- reference genes with external information sources. Efforts in this direction are particularly relevant as to clustering/classification methods which typically rediscover known associations between genes. It is therefore important to take full advantage of the existing knowledge about classical cellular pathways, including the metabolic and signaling pathways, transcription factors, regulatory elements/motifs in sequence or structure information, and assigned gene functions. Literature databases, which are a rich source of information can be used to discover and analyze biologically significant information based on co-citations or co-occurrences of pairs of genes: gene terms or gene: disease terms in a given scientific paper. Likewise, one can extract biologically meaningful relationships in the semantic framework of ontologies being specifically developed to capture such information that use experimental results reported in literature [14]. One of AnVil’s strengths is our ability to carry out integrated data mining and visualization analyses on large, complex nonlinear datasets that may have as many as 50,000 data dimensions. Therefore, we have a practical way to overcome the need to reduce dimensionality early on in addressing any specific problem. One advantage this mechanism provides is the ability to simultaneously handle large numbers of data dimensions, enabling us, for example, to add contextual knowledge into already large- dimensionality datasets that researchers have to analyze; the contextual knowledge is simply considered as additional data dimensions. We discuss the distinct advantages of our technology in greater detail in the following sections. The Importance of High-dimensional Data Visualization and its Integration with Analytic Data Mining Techniques. Visualization, data mining, statistics, as well as mathematical modeling and simulation are all methodologies that can be used to enhance the discovery process [15]. AnVil’s expertise lies in a combination of analytic data mining techniques integrated with advanced high-dimensional visualizations (HDVs). There are numerous visualizations and a good number of valuable taxonomies (See [16] for an overview of taxonomies). Most information visualization systems focus on tables of numerical data (rows and columns), such as 2D and 3D scatterplots [17], although many of the techniques apply to categorical data. Looking at the taxonomies, the following stand out as high-dimensional visualizations: Matrix of scatterplots [17]; Heat maps [17]; Height maps [17]; Table lens [18]; Survey plots [19]; Iconographic displays [20]; Dimensional stacking (general logic diagrams) [21]; parallel coordinates [22]; Pixel techniques, circle segments [23]; Multidimensional scaling [23]; Sammon plots [24]; Polar charts [17]; RadViz [25]; Principal component analysis [26]; Principal curve analysis [27]; Grand Tours [28]; Projection pursuit [29]; Kohonen self-organizing maps [30]. Grinstein et.al., [31] have compared the capabilities of most of these visualizations. Historically, static displays include histograms, scatterplots, and large numbers of their extensions. These can be seen in most commercial graphics and statistical packages (Spotfire, S-PLUS, SPSS, SAS, MATLAB, Clementine, Partek, Visual Insight’s Advisor, and SGI’s Mineset, to name a few). Most software packages provide limited features that allow interactive and dynamic querying of data. HDVs have been limited to research applications and have not been incorporated into many commercial products. However, HDVs are extremely useful because they provide insight during the analysis process and guide the user to more targeted queries. Visualizations fall into two main categories: (1) low-dimensional, which includes scatterplots, with from 2-9 variables (fields, columns, parameters) and (2) high- dimensional, with 100-1000+ variables. Parallel coordinates or a spider chart or radar display in Microsoft Excel can display up to 100 dimensions, but place a limit on the number of records that can be interpreted. There are a few visualizations that deal with a large number (>100) of dimensions quite well: Heatmaps, Heightmaps, Iconographic Displays, Pixel Displays, Parallel Coordinates, Survey Plots, and RadViz. When more than 1000 records are displayed, the lines overlap and cannot be distinguished. Of these, only RadViz is uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions) datasets, and we discuss it in detail below. RadViz™ is a visualization and classification tool that uses a spring analogy for placement of data points and incorporates machine learning feature reduction techniques as selectable algorithms. 13-15 The “force” that any feature exerts on a sample point is determined by Hooke’s law: f  kd . The spring constant, k, ranging from 0.0 to1.0 is the value of the feature for that sample, and d is the distance between the sample point and the perimeter point on the RadViz circle assigned to that feature-see Figure 1. The placement of a sample point, as described in Figure 1 is determined by the point where the total force determined vectorially from all features is 0. The RadViz display combines the n data dimensions into a single point for the purpose of clustering, but it also integrates analytic embedded algorithms in order to intelligently select and radially arrange the dimensional axes. This arrangement is performed through Autolayout, a unique, proprietary set of algorithmic features based upon the dimensions’ significance statistics that optimizes clustering by optimizing the distance separating clusters of points. The default arrangement is to have all features equally spaced around the perimeter of the circle, but the feature reduction and class discrimination algorithms arrange the features unevenly in order to increase the separation of different classes of sample points. The feature reduction technique used in all figures in the present work is based on the t statistic with Bonferroni correction for multiple tests. The circle is divided into n equal sectors or “pie slices,” one for each class. Features assigned to each class are spaced evenly within the sector for that class, counterclockwise in order of significance (as determined by the t statistic, comparing samples in the class with all other samples). As an example, for a 3 class problem, features are assigned to class 1 based on the sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined. Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3 combined values, and Class 3 features are assigned based on the t-statistic comparing class 3 values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of the circle have no features assigned to them, the data points would all cluster on one side of the circle, pulled by the unbalanced force of the features present in other sectors. In this case, a variation of the spring force calculation is used, where the features present are effectively divided into qualitatively different forces comprised of high and low k value classes. This is done via requiring k to range from – 1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and others ‘push’ (low or –k values) the points to spread them absolutely into the display space, but maintaining the relative point separations.It should be stated that one can simply do feature reduction by choosing the top features by t-statistic significance and then apply those features to a standard classification algorithm. The t-statistic significance is a standard method for feature reduction in machine learning approaches, independently of RadViz. The top significance chemicals selected with the t-statistic are the same as those selected by RadViz. RadViz has this machine learning feature embedded in it and is responsible for the selections carried out here. The advantage of RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class separation correlates to the accuracy of any classifier built from the reduced features. The additional advantage to this visualization is that sub clusters, outliers and misclassified points can quickly be seen in the graphical layout. One of the standard techniques to visualize clusters or class labels is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter plot using the first few Principle Components as axes. Often this display shows clear class separation, but the most important features contributing to the PCA are not easily seen. RadViz is a “visual” classifier that can help one understand important features and how many features are related.

And RadViz Figure 1. How it works

We have studied the following systems related to cancer detection: 1. GI50 compound 60 cancer cell lines 2. Microarray lung cancer data 3. proteomics MS dataset

1. Data Mining the Public Domain NCI Cancer Cell Line Compound GI50 Data Set using Supervised Learning Techniques

Introduction to the Cheminformatics Problem. Important objectives in the overall process of molecular design for drug discovery are: 1) the ability to represent and identify important structure features of any small molecule, and 2) to select useful molecular structures for further study, usually using linear QSAR models and based upon simple partitioning of the structures in n- dimensional space. To date, partitioning using non-linear QSAR models has not been widespread, but the complexity and high-dimensionality of the typical data set requires them. The machine learning and visualization techniques that we describe and utilize here represent an ideal set of methodologies with which to approach representing structural features of small molecules followed by selecting molecules via constructing and applying non-linear QSAR models. QSAR models might typically use calculated chemical descriptors of compounds along with computed or experimentally determined compound physical properties and interaction parameters (G, Ka, kf, kr, LD50, GI50, etc) with other large molecules or whole cells. The former types of experimental data would be generated in silico (G) or via high throughput screening of compound libraries against appropriate receptors or important signaling pathway macromolecules (Ka, kf, kr), whereas the LD50, GI50 type of data would be generated against whole cells that are appropriate to the disease model being investigated. When the data has been generated, then the application of machine learning can take place. We provide a sample illustration of this process below. The National Cancer Institute’s Developmental Therapeutics Program maintains a compound data set (>700,000 compounds) that is currently being systematically tested for cytotoxicity (generating 50% growth inhibition, GI50, values) against a panel of 60 cancer cell lines representing 9 tissue types. Therefore, this dataset contains a wealth of valuable information concerning potential cancer drug pharmacophores. In a data mining study of the 8 largest public domain chemical structure databases, it was observed that the NCI compound data set contained by far the largest number of unique compounds of all the databases (32). The application of sophisticated machine learning techniques to this unique NCI compound dataset represents an important open problem that motivated the study we present in this report. Previously, this data set has been mined by supervised learning techniques such as cluster correlation, principle component analysis and various neural networks, as well as statistical techniques (33,34). These approaches have identified compound class subsets such as: tubulin active compounds (35), pyrimidine biosynthesis inhibitors (36) and topoisomerase II inhibitors (37), that possess similar mechanisms of action (MOA), share similar structures or develop similar patterns of drug resistance. Compound structure classes such as the ellipticine derivatives have also been studied and point to the validity of the concept that fingerprint patterns of activity in the NCI data set encode information concerning MOAs and other biological behavior of tested compounds (38). More recently, gene expression analysis has been added to the data mining activity of the NCI compound data set (39) to predict chemosensitivity, using the GI50 test data for each compound, for a few hundred compound subset of the NCI data set (40). After we completed our data mining analysis (41), gene expression data on the 60 cancer cell lines was combined with NCI compound GI50 data and with a 27,000 feature database computed for the NCI compounds to calculate chemical features similar to those identified in the following study and as we have presented elsewhere (42). In the present data miningstudy, we use microarray based gene expression data to first establish a number of ‘functional’ classes of the 60 cancer cell lines via a hierarchical clustering technique. These functional classes are then used to supervise a 3- Class learning problem, using a small but complete subset of 1400 of the NCI compounds’ GI50 values as the input to a clustering algorithm in the RadViz™ program (43). At p < .01 significance, RadViz™ identifies two small compound subsets that accurately classify the cancer cell line classes: melanoma from non-melanoma and leukemia from non-leukemia (41). We then demonstrate that independent analytic classifiers validate the two small compound subsets we selected. We found them to both be significantly enriched in quinone compounds of two distinct subtypes. We conclude that our machine learning approach has yielded important new molecular insights into a class of compounds demonstrating a high level of specificity in cancer cell type toxicity.

Specific Methods Used. For the ~ 4% missing values found in the 1400 compound data set, we tried and compared two approaches to missing value replacement: 1) record average replacement; 2) multiple imputation using Schafer’s NORM software (44). Using either missing value replacement method for the starting data set, there was close agreement( always > 90%) between the NCI compound lists selected in identical 2-Class Problem classifications we present below. Therefore, in the present study, we used the record average replacement method for all the data presented. Clustering of cell lines was done with R-Project software using the hierarchical clustering algorithm with “average” linkage method and a dissimilarity matrix computed as 1 – the Pearson correlations of the gene expression data. AnVil Corporation’s RadViz™ software (45) was used for feature reduction and initial classification of the cell lines based on the compound GI50 data. The selected features were validated using several classifiers from Weka 3.1.9 (Waikato Environment for Knowledge Analysis, University of Waikato, New Zealand). The classifiers used were IB1 (nearest neighbor), IB3 (3 nearest neighbor), logistic regression, Naïve Bayes Classifier, support vector machine, and neural network with back propagation. Both ChemOffice 6.0 (CambridgeSoft Corp.) and the NCI website were used to identify compound structures via their NSC numbers and substructure searching to identify quinone compounds in the larger data set were carried out using ChemFinder (CambridgeSoft).

Results and Discussion Identifying functional cancer cell line classes using gene expression data. Based upon gene expression data, we identified cancer cell line classes that we could use in a subsequent supervised learning approach. In Figure 2, we present a hierarchical clustering dendrogram using the 1-Pearson distances calculated from the T-Matrix, comprised of 1376 gene expression values determined for the 60 NCI cancer cell lines (43). There are five well defined clusters observed. Four of the clusters in Figure 2 (renal, leukemia, ovarian and colorectal from second left to right) represent pure cell line classes. In only the melanoma class instance does the class contain two members of another clinical tumor type, two breast cancer cell lines - MDA-MB-435 and MDA-N. The 2 breast cancer cell lines behave functionally as melanoma cells and seem to be related to melanoma cell lines via a neuroendocrine origin (43). The remaining cell lines in the Figure 2 dendrogram, those not found in any of the five functional classes, are defined as being in the sixth class- the non- melanoma, leukemia, renal, ovarian, colorectal class. In the supervised learning studies that follow, we treat these six functional clusters as the ground truth.

3-Class Cancer Cell Classifications and Validation of Selected Compounds. High class number classification problems are difficult to implement where the data are not clearly separable into distinct classes and we could not successfully carry out a 6- class classification of the cancer cell line classes based upon the starting GI50 compound data. Therefore, we implemented a 3-Class supervised learning classification utilizing RadViz™ (25, 45-47). Starting with the small 1400 compounds’ GI50 data set that contained no missing values for all 60 cell lines, those compounds were selected that were effective in carrying out the classification at the p < .01 (Bonferroni corrected t statistic) significance level. The 3 -Class problem at p < .01 significance, for the melanoma, leukemia and non-melanoma, non-leukemia classes are presented in Figure 3. This produced clear and accurate class separations of the 60 cancer cell lines. There were 14 compounds selected as being most effective against melanoma class cells and 30 compounds were identified as most effective against the leukemia class cells. Similar classification results were obtained for separate 2-Class problems: melanoma vs. non- melanoma and leukemia vs. non-leukemia (data not shown; [41]). For all other possible 2-Class problems, we found that few to no compounds could be selected at p <.01. Our next goal was to validate these results, utilizing 6 independent analytic classification techniques ( Instance Based 1, Instance based 3, neural networks, logistic regression and support vector machines), with the same selected compounds’ GI50 values as a classifier set, using the hold-one-out method (data not shown: see 41). Using these selected compounds resulted in a greater than 6-fold lowered level of error compared to using the equivalent numbers of randomly selected compounds, thus validating our selection methodology.

Quinone Compound Subtypes preferentially effective against melanoma. Next, we decided to examine the chemical identity of the compounds selected as most effective against melanoma and leukemia. To summarize, for the 14 compounds selected as most effective against melanoma, 11 are p-quinones. Of the 11 p-quinones, all 11 are internal ring quinone structures (41). We display in Figure 4A the most cytotoxic of these structures. These internal ring quinones possess either 2 neighbor aromatic 5 or 6 member fused rings, some of which are heteroatom containing, on either side of the quinone ring or an aromatic fused ring neighbor on one side and non-H substitutions off the other side of the quinone. In nearly all cases, these substitutions have electronegative atoms covalently bonded to either or both the o and m C positions of the quinone ring. A recent analysis simultaneously correlating gene expression data for the 60 cancer cell lines with GI50 values, identified a sub-class of compounds containing a benzothiophenedione core structure that were most highly correlated with the expression patterns of Rab7 and other melanoma specific genes (42). There is clearly some overlap between the internal quinone subtype we have defined in the present study and the benzothiophenedione core structure members. Out of the 11 internal quinone compounds we identified, 3 are of the benzothiophenedione core structure class, but they are not amongst the most effective compounds we identified. The Rab7 gene is a member of the GTP binding protein family involved in the docking of cellular transport vesicles and is a key regulator of aggregation and fusion of late endocytic lysosomes (48). A number of other genes whose expression levels highly correlate with the same compounds, expressed proteins involved in other lysosomal functions, suggesting a link between the quinone oxidation potential, the proton pump and the electron transport chain. This suggests the possibility that benzodithiophenedione compounds may act directly as surrogate oxidizing agents, effectively competing with ubiquinone in the electron transport chain, thereby disrupting cellular redox processes.

Quinone Compound Subtypes preferentially effective against leukemia. There were 30 compounds selected as most effective against leukemia in the leukemia, non-leukemia 3-Class Problem, of which 8 are structures containing p-quinones (41). In contrast to the internal ring quinones in the melanoma class, 6 out of the 8 leukemia p- quinones were external ring quinones. We display the most cytotoxic example of these structures in Figure 4B. In contrast to the internal ring quinones, these external ring quinones had only one aromatic fused ring neighbor, which had no ring heteroatoms in all cases. Also different, the quinone was itself at the periphery of the molecule and had no non-H substituents off the exterior side of the ring at either o or m C positions. Thus, the ‘external’ and ‘internal’ quinone rings should possess different electron densities and redox potentials for the quinoid oxygens. Besides redox potentials, other possible subtype differences may exist such as: solubility, steric differences relative to metabolic enzyme active sites, differential cellular adsorption, etc. In the study discussed already (42), a sub-class of compounds, comprised of an indolonaphthoquinone core structure was identified that were most highly correlated with the expression patterns of LCP1, lymphocyte cytosolic protein 1, HS1, a hematopoietic lineage specific gene, and other leukemia specific genes. There is overlap between the external quinone subtype in our study and the indolonaphthoquinone core structure members. This overlap between the two studies is somewhat remarkable since we included no gene expression data in our analysis of the GI50 values, as did the other study

(42). This suggests that there is sufficient information inherent in the compound GI50 values to carry out the basic core discovery presented here using sophisticated machine learning techniques, without the need to include gene expression data in the analysis.

Uniqueness of Two Quinone Subtypes. In order to ascertain the uniqueness of the two quinone subsets we discovered, we first determined the extent of occurrence of p- quinones of all types in our starting data set, via substructure searching using the ChemFinder 6.0 software. The internal and external quinone subtypes represent a significant fraction, 25 % (10/41) of all the internal quinones and 40 % (6/15) of all the external quinones in the entire data set. In addition, we determined that only one compound, NSC 621179, which is not a quinone but an epoxide, was found to be effective against both melanoma and leukemia in a 2-Class classification where one class was both leukemia and melanoma cell lines and the second class was non-melanoma, non-leukemia cell lines. This result attests to the uniqueness of the specificity of the two quinone subtype classes. Also, the NCI data set lists 92 well studied compounds known to fall within one of 6 Mechanism Of Action (MOA) Classes: alkylating agents, antimitotic agents, topoisomerase I inhibitors, topoisomerase II inhibitors, RNA/DNA antimetabolites, DNA antimetabolites (33). We determined that the most effective 14 and 30 compounds against melanoma and leukemia respectively that we identified in the 3- Class problem do not fall into clusters with any one of these 6 MOA compound classes.

Sub-classification of Leukemia Cell Lines. We next asked whether we these machine learning techniques could sub-classify either the melanoma or the leukemia cell lines into distinct clinical sub-classes based upon using our 2 respective quinone subtype classes. The answer is that we could with a 3-Class based leukemia cell sub-classification for the acute lymphoblastic leukemia (ALL), non-ALL leukemia (other) and non- leukemia cell classes at p < .05. To carry out the sub-classification, we used the most effective 30 compounds identified for the p < .01 selection criterion as most effective against all leukemias and this result is presented in Figure 5. Six of the 30 compounds were most effective against the ALL class; while 12 of the 30 compounds were most effective against the non-ALL leukemia. In this result, it is clear that there is a separation of the 2 ALL cell lines (CCRF-CEM and MOLT-4) from the non-ALL leukemia sub- class. These two ALL cell lines were also the most closely clustered leukemia cells in the Figure 2 gene expression based clustering dendrogram. These results suggest the interesting possibility that the chemical identity of the compounds most effective against the 2 ALL cell lines are linked to the gene functions most responsible for closely clustering these 2 ALL cell lines in Figure 1.

NAD(P)H: quinone oxidoreductase 1 -Quinone substrates and Leukemias Different redox potentials and enzymatic reactivities are likely to be the key to how these quinone subtypes differentially affect melanoma and leukemia cells. In addition to the gene candidates identified as potentially involved in quinone activity in the study already discussed (42), a strong candidate enzyme for the differential toxicity we observed is NAD(P)H:quinone oxidoreductase 1 (QRI, NQO1, also DT-diaphorase; EC 1.6.99.2). This enzyme, catalyzing two electron reduction of substrates, most efficiently utilizes quinones as substrates (49). The X-ray structures of the apoenzyme at 1.7-A resolution and its complex with substrate duroquinone (2,5A) are known (50,51). NAD(P)H:quinone oxidoreductase 1 is a chemoprotective enzyme that protects cells from oxidative challenge. Antitumor quinones, of the type we have identified above in the NCI data set, may be bioactivated by this enzyme to forms that are cytotoxic. Interestingly, there are a number of reports that correlate altered forms or alleles of this enzyme with leukemia (52-54). These reports, associating leukemias with particular aspects of NAD(P)H:quinone oxidoreductase 1, suggest the enzyme as likely being a significant factor in why the external quinone subtypes, acting as particularly potent and effective substrates, exhibit their differential selectivity toward leukemias.

Conclusion. With this cheminformatics example we have demonstrated that the machine learning approach described above utilizing RadViz™ has produced a novel discovery. Two quinone subtypes were identified that possess clearly different and specific toxicity to the leukemia and melanoma cancer cell types. We believe that this example illustrates the potential of sophisticated machine learning approaches to uncovering new and valuable relationships in complex high dimensional chemical compound data sets.

2. Microarrays Analysis of High Throughput Gene Expression Experiments: Effects of Normalization Methods on Gene Expression Analysis Clustering Results. Completion of the Human Genome Project has made possible the study of the gene expression levels of over 30,000 genes [14,15; although a ‘final’ human genome sequence is scheduled for release in Spring, 2003]. Major technological advances have made possible the use of DNA microarrays to speed up this analysis. Even though the first microarray experiment was only published in 1995, by October 2002 a PubMed query of microarray literature yielded more than 2300 hits, indicating explosive growth in the use of this powerful technique. DNA microarrays take advantage of the convergence of a number of technologies and developments including: robotics and miniaturization of features to the micron scale (currently 20-200 um surface feature sizes for spotting/printing and immobilizing sequences for hybridization experiments), DNA amplification by PCR, automated and efficient oligonucleotide synthesis and labeling chemistries, and sophisticated bioinformatics approaches. It is this latter aspect of the development of microarray technology that our Phase II proposal addresses. One significant aspect of analyzing microarray gene expression data is the need for normalization to remove non-biological sources of variation (noise), in order to make meaningful comparisons of data from different microarrays. The noise results from differences in individual chips, labeling chemistry, length of immobilized oligonucleotide sequence, different optical properties of various data scanners and other sources. The importance of understanding and controlling these variables has been underscored by the apparent lack of reproducibility of some published microarray studies. This has led to the establishment of the MIAME publication guidelines that detail the following requirements for describing microarray experiments: 1) experimental design, 2) array design and the name and location of array spots, 3) sample name extraction and labeling, 4) hybridization protocols, 5) image measurement methods, 6) controls used [16-18]. Normalization techniques that have been applied include simple linear scaling, locally linear transformations, and other nonlinear methods. To some extent, the techniques used depend on the type of array being used. In 2 channel arrays, for example cDNA microarrays, the issue is primarily within-chip normalization to correct distortions based on location and signal intensity. Between-chip normalization is less of an issue for these arrays because one channel usually contains a reference tissue that is common to all arrays in the experiment. Between-chip normalization has the potential of introducing more noise than it eliminates. A number of thorough discussions of normalization techniques for cDNA arrays have been presented [19,20]. These normalization approaches include dye swap experiments to correct for differences between the two channels, using the lowess function to correct for global intensity based differences (i.e. across all genes on the chip), and using the lowess function locally to account for spatial and print-tip differences. For the majority of applications, Affymetrix microarrays are in use. For these arrays, between-chip normalization is an important issue, and is closely related to the method of calculation of gene expression value from multiple probes for each gene. Techniques proposed for calculating expression include the original Affymetrix method of average difference between perfect match and mismatch probes, the Model Based Expression Index approach of Li and Wong [21], and the Robust Multichip Average approach of Irizzary et al [22]. Durbin et al [23] have suggested a variance-stabilizing transformation to aid microarray analysis. There is the additional consideration of whether to normalize data based on probe level measurements or expression calculations, and whether to use a baseline array for comparison or to normalize over the complete set of data. Bolstad et al [24] present comparisons of some of these techniques. They recommend probe level and complete data methods in general, and quantile normalization in particular. They also found that the invariant set normalization approach of Schadt et al [25] using a baseline array gives results that are comparable to complete data methods. Our experience has shown that quantile normalization works well even when probe level data are not available. However, quantile normalization makes the implicit assumption that the data on all chips have the same distribution. For some datasets this may not be appropriate. Different normalization and modeling techniques can lead to widely varying judgments and interpretations of differential gene expression. In this Phase II proposal, we aim to investigate the effects of different data normalizations on clustering. We will compare quantile normalization, invariant set normalization, lowess local regression, and simple linear scaling. We will focus primarily on Affymetrix type arrays, but we will ensure that the platform we develop supports the adaptation and application of these techniques to two channel microarrays where appropriate. We will also investigate the effects of different modeling techniques on clusters. The more successful a technique is at removing noise, the more likely it is that the clusters generated will be accurate and will have biological meaning. On the other hand, the quality and stability of clusters could be a useful measure of the appropriateness of the normalization and modeling techniques used. Therefore, a goal of this Phase II proposal is to provide users with decision making tools to decide which normalization approach is optimal or close to optimal for a given microarray dataset. Also, the normalization tools will be integrated with the perturbation algorithm output, discussed below, to determine the stability of clusters from different normalizations. In this way, we can provide users with the identity of those genes that are most stable within clusters, and those that are unstable and jump between clusters as a result of different normalizations.

Section: NCI Lung Cancer – 3 Classes (agee)

Introduction

An important use for gene expression data is the automatic distinction between normal and lung cancer tissue samples. In an attempt to understand the feasibility of such a task AnVil in collaboration with the NCI examined two example data sets of patients with and without various lung cancers. The initial aim of AnVil’s task was simply to determine if a patient has lung cancer or not based on microarray data collected from lung tissue samples. However, AnVil went one step further to analyze a three-class problem that could distinguish between normal tissue and two subclasses of non-small cell lung carcinomas, adenocarcinomas and squamous cell carcinomas. Given the numerous choices and various complexities of this task AnVil took a systematic approach that included three primary steps: selection, evaluation and relevance. The first step involves making an intelligent selection of genes via some modeling technique. Because the selection of genes depends on the number of genes and the selection algorithm, AnVil experimented with multiple variations. Next, these selected genes are evaluated by some classification algorithm to determine their ability to distinguish between normal and the two cancer types. Here AnVil opted to try a number of different classification algorithms and checking for consistency between these models. The final step adds domain knowledge to the process by determining the biological relevance of these genes and their known associations with lung cancer.

Available Data

AnVil was provided with two data sets of patients with and without lung cancer. Both data sets included gene expressions of patient malignant or normal tissue samples using Affymetrix’s Human Genome U95 Set [1]; only the first of five oligonucleotide based GeneChip® arrays was used in this experiment. Chip A of the HG U95 array set contains roughly 12,000 full-length genes and a number of controls. The first data set was provided directly from NCI, courtesy of Jin Jen and Tatiana Dracheva, and included 75 patient samples. This set contained 17 normal samples, 30 adenocarcinomas (6 doubles), and 28 squamous cell carcinomas (2 doubles). Doubles represent replicate samples prepared at different times using different equipment from the original sample preparation. A second patient set of 157 samples was provided via public access and courtesy of Matthew Meyerson at the Dana-Farber Cancer Institute [2]. This set inclused 17 normal samples, 139 adenocarcinomas (127 of these with supporting information) and 21 squamous cell carcinomas. In addition, this Meyerson data set also included 6 small cell lung cancer samples and 20 pulmonary carcinoid tumors for which AnVil left aside during this analysis. Because AnVil was dealing with two data sets both from different sources and microarray measurements taken at multiple times we needed to consider a normalization procedure. For this particular analysis we kept with a simple mean of 200 for each sample. As with our systematic approach to selecting and validating sets of genes AnVil has also undertaken an analysis of using various normalization techniques, though currently no conclusions are available yet. In addition to the consideration of normalization of the samples within each data set and between the two data sets, AnVil took this opportunity to treat each data set indepently. By keeping the data sets separate we could use one, the NCI data set, for training and gene selection whilst using the second Meyerson data set for independent validation of the selected genes.

Gene Sets

The first step of AnVil’s three-part analysis was the selection of genes that could distinguish between normal lung tissue and the two types of non-small cell lung carcinomas, adenocarcinomas and squamous cell carcinomas. When making a selection of genes for this task we need to consider to requirements: size and procedure. It is quite clear that one does not need to include all the genes present on the HG U95 chip A as there are over 12,000 genes and most of these provide no information, that is many of these genes do not provide adequate expression values when only looking at normal versus cancerous lung cells. Consequently a decision had to be made as to how many genes to selection. Secondly, there needed to be a mechanism by which these genes could be selected, a reproducible procedure for choosing the best set of genes that defines the three tissue types. In order to understand the best number of genes for this three-class problem AnVil took the systematic approach generating sets of genes that varied in sizes from very small so somewhat large relative to the 12,000 genes available. As such we decided to proceed by generating gene sets ranging from one up through one hundred to provide an initial understanding has to how many genes might be best for distinguishing the three tissue types. AnVil set the upper bound at one hundred since most research talk about small gene sets, mostly around twenty or so genes. Figure 1. Example RadViz™ Gene Selection

Next came the selection procedures. Once again there are many possible ways by which one might select subsets of genes from the initial 12,000; so which procedure would be the most fruitful was the question. AnVil settled on four selection algorithms: random, F-statistics, RadViz™, and PURS™. It was apparent that we needed some ground to stand on as to how well any set of genes of some size would perform, so we started by generating random gene sets, ten independent gene sets for each gene selection size. These random sets provided the best unintelligent estimate of how well the any set of genes distinguishes between normal and the two cancer types. Secondly we included an algorithm using the F-statistic to select some number of genes with the highest significance in distinguishing the three classes. One would assume that by adding some intelligence about the data we could select more appropriate genes that simply choosing random sets. A third algorithm and proprietary to AnVil involves applying the class distinction algorithm of RadViz™ to this three-class problem (see figure 1 for an example). And the final algorithm, also proprietary to AnVil, is PURS™ or Principal Uncorrelated Record Selection; here genes are selected based on their uniqueness in defining the space of expression values, which works by selecting genes that are most different from currently selected genes. PURS chooses genes independent of the three classes and so the initial gene selection to start this algorithm becomes important.

Set Evaluation

After generated a number of gene sets ranging in size from small to large and using the four selection procedures mentioned above these sets of genes needed to be evaluated as to how well they truly distinguish the three tissue types: normal, adenocarcinoma and squamous cell carcinoma. To accomplish this step AnVil applied a number of classification algorithms to each gene set in order to fully compare the relationship between different numbers of gene and the algorithms used to make the gene selections. Furthermore AnVil performed ten-fold and hold-one-out cross-validation using the NCI data set and independent validation using the Meyerson data set. One thing that was apparent during our independent testing was the unbalanced tissue samples present in the Meyerson data set: 139 adenocarcinoma samples versus only 38 combined normal and squamous cell carcinoma samples. In total AnVil used eleven classification algorithm versions, including variations of K-nearest Neighbors, Naïve Bayes, Support Vector Machines, and Neural Networks. Figure 2 provides a visual representation of the ten-fold cross-validation results for all gene sets and algorithms by their associated best classification score.

Figure 2. Classification Results Figure 3. Sample Misclassifications

# of Variables – number of genes selected Gray (left) – normal samples Light gray circles – random gene sets Blue (center) – adenocarcinomas Yellow squares – F-statistic gene sets Yellow (right) – squamous cell carcinomas Blue circles – RadViz™ gene sets The top row indicates the known tissue Red triangles – PURS™ gene sets type.

An interesting observation appeared when comparing between classifications of sample for different gene sets and the various classification algorithms was the finding of consistently misclassified samples. In figure 3 we present an example visualization of the classification results for each sample (displayed vertically) within the NCI data set. Notice how we can clearly see two continuous vertical lines; these represent two samples that have been misclassified by all the classification algorithms. Given that we had no supporting information for the NCI patients we could not make any inferences about this predicament other than making recommendations to resample these patients. When analyzing the consistent misclassifications of the Meyerson samples we were able to identify six patients, and after reviewing the patient’s supporting information we found that this sample consisted of mixed tissues type and the classification algorithms caught the differences.

Biological Relevance

ML’s stuff…

Mesh – Informax

Go-ontology Conclusion

[Overview of the approach taken] 1. Selection of gene sets 2. Evaluation of gene sets 3. Biological relevance

Random F-statistics Radviz PURS - Intelligent Principal Uncorrelated Record Selection (dissimilar)

K-nearest Neighbors Naïve Bayes Support Vector Machines Neural Network

References

1. Affymetrix, www.affymetrix.com. 2. Matthew Meyerson Lab, Dana-Farber Cancer Institute, http://research.dfci.harvard.edu/meyersonlab/lungca/data.html.

3. Proteomics

Conclusions

Acknowledgements

AnVil and the authors gratefully acknowledges support from two SBIR Phase I grants R43 CA94429-01 and R43 CA096179-01 from the National Cancer Institute. Also, support is acknowledged from ………..X Y Z

References 1. A. Strehl. Relationship-based Clustering and Cluster Ensembles for High- dimensional Data Mining. Dissertation, The University of Texas at Austin, May, 2002. 2. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000. 3. J. A. Hartigan. Clustering Algorithms. New York: John Wiley & Sons, 1975. 4. D. Fasulo. “An Analysis of Recent Work on Clustering Algorithms.” http://www.cs.washington.edu/homes/dfasulo/clustering.ps, April 26, 1999. 5. C. Fraley and A. E. Raftery “Model-Based Clustering, Discrimination Analysis, and Density Estimation.” Technical Report no. 380, Department of Statistics, University of Washington, Seattle, October, 2000. 6. F. Höppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Chichester: John Wiley & Sons, 1999.. 7. Everitt, B., Cluster Analysis, Halsted Press, New York (1980). 8. Schaffer, C., Selecting a classification method by cross-validation, Machine Learning, 13:135-143 (1993). 9. Feelders A., Verkooijen W.: Which method learns most from the data? Proc. of 5th International Workshop on Artificial Intelligence and Statistics, January 1995, Fort Lauderdale, Florida, pp. 219-225, (1995). 10. Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1924. 11. Cheng, J., Greiner, R., Comparing Bayesian network classifiers. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI ’99), 101-107, Morgan Kaufmann Publishers (1999). 12. Salzberg, S. L., On Comparing Classifiers: A Critique of Current Research and Methods, Data Mining and Knowledge Discovery, 1999, 1:1-12, Kluwer Academic Publishers, Boston. 13. Ramaswamy, S., Ross, K.N., Lander, E.S. and Golub, T.R. A molecular signature of metastasis in primary solid tumors. Science, 22, 1-5. 14. Chaussabel., D. and Sher, A. Mining microarray expression data by literature profiling. Genomebiology, 3, 1-16 15. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.) Advances in knowledge discovery and data mining, AAAI/MIT Press, 1996. 16. B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy of Information Visualization,” presented at IEEE Symposium on Visual Languages '96, Boulder, CO, 1996. 17. J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison-Wesley, MA, 1977. 18. R. Rao and S. K. Card, “The Table Lens: Merging Graphical and Symbolic Representations in an Interactive Focus+Context Visualization for Tabular Information,” presented at ACM CHI '94, Boston, MA, 1994. 19. D. F. Andrews, “Plots of High-Dimensional Data,” Biometrics, vol. 29, pp. 125-136, 1972. 20. H. Chernoff, “The Use of Faces to Represent Points in k-Dimensional Space Graphically,” Journal of the American Statistical Association, vol. 68, pp. 361-368, 1973. 21. J. Beddow, “Shape Coding of Multidimensional Data on a Microcomputer Display,” presented at IEEE Visualization '90, San Francisco, CA, 1990. 22. A. Inselberg, “The Plane with Parallel Coordinates,” Special Issue on Computational Geometry: The Visual Computer, vol. 1, pp. 69-91, 1985. 23. D. A. Keim and H.-P. Kriegel, “VisDB: Database Exploration Using Multidimensional Visualization,” IEEE Computer Graphics and Applications, vol. 14, pp. 40-49, 1994. 24. J. W. J. Sammon, “A Nonlinear Mapping for Data Structure Analysis,” IEEE Transactions on Computers, vol. 18, pp. 401-409, 1969. 25. P. Hoffman and G. Grinstein, “Dimensional Anchors: A Graphic Primitive for Multidimensional Multivariate Information Visualizations,” presented at NPIV '99 (Workshop on New Paradigmsn in Information Visualization and Manipulation), 1999. 26. H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” Journal of Educational Psychology, vol. 24, pp. 417-441, 498-520, 1933. 27. T. Hastie and W. Stuetzle, “Principal Curves,” Journal of the American Statistical Association, vol. 84, pp. 502-516, 1989. 28. D. Asimov, “The Grand Tour: A tool for Viewing Multidimensional Data,” DIAM Journal on Scientific and Statistical Computing, vol. 61, pp. 128-143, 1985. 29. J. H. Friedman, “Exploratory Projection Pursuit,” Journal of the American Statistical Association, vol. 82, pp. 249-266, 1987. 30. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, “Engineering Applications of the Self-Organizing Map,” presented at IEEE, 1996. 31. G. Grinstein, P. E. Hoffman, S. Laskowski, and R. Pickett, “Benchmark Development for the Evaluation of Visualization for Data Mining,” in Information Visualization in Data Mining and Knowledge Discovery, The Morgan Kaufmann Series in Data Managament Systems, U. Fayyad, G. Grinstein, and A. Wierse, Eds., 1st ed: Morgan-Kaufmann Publishers, 2001. 32. Voigt, K. and Bruggeman, R. (1995) Toxicology Databases in the Metadatabank of Online Databases Toxicology, 100, 225-240 33. Weinstein, J.N.,et.al., (1997,) An information-intensive approach to the molecular pharmacology of cancer, Science, 275, 343-349. 34. Shi, L.M., Fan, Y.,Lee, J.K., Waltham, M., Andrews, D.T., Scherf,U., Paul, K.D., and Weinstein, J.N. (2000) J. Chem. Inf. Comput. Sci., 40, 367-379. 35. Bai, R.L., Paul, K.D., Herald, C.L., Malspeis, L., Pettit, G.R., and Hamel, E. (1991) Halichondrin B and homahalichondrin B, marine natural products binding in the vinca domain of tubulin-based mechanism of action by analysis of fifferential cytotoxicity data J. Biol. Chem., 266, 15882 – 15889. 36. Cleveland, E.S., Monks, A., Vaigro-Wolff, A., Zaharevitz, D.W., Paul, K., Ardalan, K.,Cooney, D.A., and Ford, H. Jr. (1995) Site of action of two novel pyramidine biosynthesis inhibitors accurately predicted by COMPARE program Biochem. Pharmacol., 49, 947-954. 37. Gupta, M., Abdel-Megeed M., Hoki, Y, Kohlhagen, G., Paul, K., and Pommier, Y. (1995) Eukaryotic DNA topoisomerases mediated DNA cleavage induced by new inhibitor: NSC 665517 Mol. Pharmacol., 48, 658-665

38. Shi, L.M., Myers, T.G., Fan, Y., O’Connors, P.M., Paul, K.D., Friend, S.H., and Weinstein, J.N. (1998) Mining the National Cancer Institute Anticancer Drug Discovery Database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of avtivity Mol. Pharmacology, 53, 241-251.

39. Ross, D.T. et. al., (2000) Systemamtic variation of gene expression patterns in human cancer cell lines Nat. Genet., 24, 227-235

40. Staunton, J.E.; Slonim, D.K.; Coller, H.A.; Tamayo, P.; Angelo, M.P.; Park, J.; Sherf, U.; Lee, J.K.; Reinhold, W.O.; Weinstein, J.N.; Mesirov, J.P.; Landers, E.S.; Golub, T.R. Chemosensitivity prediction by transcriptional profiling, Proc. Natl. Acad. Sci., 2001, 98, 10787-10792. 41. Marx, K.A., O’Neil, P., Hoffman, P.; Ujwal, M.L. Data Mining the NCI Cancer Cell Line Compound GI50 Values: Identifying Quinone Subtypes Effective Against Melanoma and Leukemia Cell Classes, J. Chem. Inf. Comput. Sci., 2003, in press.

42. Blower, P.E.; Yang, C.; Fligner, M.A.; Verducci, J.S.; Yu, L.; Richman, S.; Weinstein, J.N. Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data, The Pharmacogenomics Journal, 2002, 2, 259- 271.

43. Scherf, W.; Ross, D.T.; Waltham, M.; Smith, L.H.; Lee, J.K.; Tanabe, L.; Kohn, K.W.; Reinhold, W.C.; Myers, T.G.; Andrews, D.T.; Scudiero, D.A.; Eisen, M.B.; Sausville, E.A.; Pommier, Y.; Botstein, D.; Brown, P.O.; Weinstein, J.N. A gene expression database for the molecular pharmacology of cancer, Nature, 2000, 24, 236- 247.

44. Schafer, J.L. Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability 72, Chapman & Hall/CRC, 1997.

45. RadViz, URL: www.anvilinfo.com

46. Hoffman, P.; Grinstein, G.; Marx, K.; Grosse, I.; Stanley, E. DNA visual and analytical data mining, IEEE Visualization 1997 Proceedings, pp. 437-441, Phoenix

47. Hoffman, P.; Grinstein, G. Multidimensional information visualization for data mining with application for machine learning classifiers, Information Visualization in Data Mining and Knowledge Discovery, Morgan-Kaufmann, San Francisco, 2000.

48. Bucci, C.; Thompsen, P.; Nicoziani, P.; McCarthy, J.; van Deurs, B. Rab7: a key to lysosome biogenesis, Mol. Biol. Cell, 2000, 11, 467-480.

49. Ross, D. NAD(P)H: quinone oxidoreductases, Encyclopedia of Molecular Medicine, 2001, 2208-2212. 50. Faig, M.; Bianchet, M.A.; Talalay, P.; Chen, S.; Winski, S.; Ross, D.; Amzel, L.M. Structure of recombinant human and mouse NAD(P)H:quinone oxidoreductase: Species comparison and structural changes with substrate binding and release, Proc. Natl. Acad. Sci., 2000, 97, 3177-3182

51. Faig, M.; Bianchet, M.A.; Winsky, S.; Moody, C.J.; Hudnott, A.H.; Ross, D.; Amzel, L.M. Structure-based development of anticancer drugs: complexes of NAD(P)H:quinone oxidoreductase 1 with chemotherapeutic quinones, Structure (Cambridge), 2001, 9, 659- 667

52. Smith, M.T.; Wang, Y.; Kane, E.; Rollinson, S.; Wiemels, J.L.; Roman, E.; Roddam, P.; Cartwright, R.; Morgan, G., Low NAD(P)H: quinone oxidoreductase I activity is associated with increased risk of acute leukemia in adults, Blood, 2001, 97, 1422-1426

53. Wiemels, J.L.; Pagnamenta, A.; Taylor, G.M.; Eden, O.B.; Alexander, F.E.; Greaves, M.F. A lack of a functional NAD(P)H:quinone oxidoreductase allele in selectively associated with pediatric leukemias that have MLL fusions. United Kingdom Childhood Cancer Study Investigators, Cancer Res., 1999, 59, 4095-4099

54. Naoe T.; Takeyama, K.;, Yokozawa, T.; Kiyoi, H.; Seto, M.; Uike, N.; Ino, T.; Utsunomiya, A.; Maruta, A.; Jin-nai, I.; Kamada, N.; Kubota, Y.; Nakamura, H.; Shimazaki, C.; Horiike, S.; Kodera, Y.; Saito, H.; Ueda, R.; Wiemels, J.; Ohno, R. Analysis of the genetic polymorphism in NQO1, GST-M1, GST-T1 and CYP3A4 in 469 Japanese patients with therapy related leukemia/myelodysplastic syndrome and de novo acute myeloid leukemia, Clin. Cancer Res., 2000, 6, 4091-4095

Other References (14-25 in CC Grant)

35. Venter, J.C., et.al., The Sequence of the Human Genome. Science, 291, 1303-1351 (2001). 36. Lander, E.S., et.al., Initial Sequencing and Analysis of the Human Genome. Nature, 409, 860-921 (2001). 37. Stoeckert, C.J., et.al., Microarray databases: standards and ontologies. Nat. Genet. 32 (Suppl) 469-473. 38. No author, Microarray standards at last. Nature, 419, 323. 39. Ball, C., et.al., Standards for microarray data., Science, 298, 539. 40. Quackenbush, J. (2001) Computational analysis of cDNA microarray data. Nature Reviews 2(6): 418-428. 41. Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica Vol. 12, No. 1, p. 111-139. 42. Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error applications. Genome Biology 2(8), 43. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K., Scherf, U., and Speed, T.P. (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics (in press). 44. Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) A variance- stabilizing transformation for gene expression microarray data. Bioinformatics 18, 105S- 110S. 45. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2002) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2): 185-193. Schadt, E.C., Li, C., Eliss, B., and Wong, W.H. (2002) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem. 84(S37), 120-125.

Figure Legends

Figure 1. RadViz figure

Figure 2. Cancer cell line functional class definition using a hierarchical clustering (1- Pearson coefficient) dendrogram for 60 cancer cell lines based upon gene expression data. Five well defined clusters are shown highlighted. We treat the highlighted cell line clusters as the truth for the purpose of carrying out studies to identify which chemical compounds are highly significant in their classifying ability

Figure 3. RadViz™ result for the 3-Class problem classification of melanoma, leukemia and non-melanoma, non-leukemia cancer cell types at the p < .01 criterion. Cell lines are symbol coded as described in the figure. A total of 14 compounds (bottom of layout) were most effective against melanoma and they are layed out on the melanoma sector (counterclockwise from most to least effective). For leukemia, 30 compounds were identified as most effective and are layed out in that sector. Some 8 compounds were found to be most effective against non-melanoma, non-leukemia cell lines and are layed out in that sector.

Figure 4. One example each of the two quinone subtypes selected in Figure 3 are displayed. A. The most highly effective of the 11 internal quinone subtype compounds most effective against melanoma is shown. B. The most highly effective of the 6 external quinone subtype compounds most effective against leukemia is shown

Figure 5. RadViz™ result for the 3-Class Problem classifying the following three classes: acute lymphoblastic leukemia (ALL), non-ALL leukemia (other-Leukemia) and non-leukemia cell classes at p < .05. We used as input the 30 compounds identified in the Figure 3 classification as most effective against all leukemias at the p < .01 selection criterion. Cell lines are symbol coded as described in the figure. The NSC numbers of the compounds selected to classify the classes are presented in the order of their ranking from most effective to least effective moving counterclockwise within each class sector. Cluster Dendrogram

1.0

0.8

5

0

4 3 6

- I 6 1 4 3

- 2 V - V 1 H R 2 U

- C

C O M I V - S 5 I t H 2 D 6 P M - -

_ C O K 1 _ 1 X I _ 2

h 0.6 6 9 R E 1

N S R N R 8 2 C 2

O R 2 - X 4 C L A _ _ g 9 3

6 2 S 3 2 L P 2 G N 5 T i P V - C I

C - 4 V 8 C 6 _ 2 2 _ 8 H _ C K _ 5 T P - - 6 L e V 5 T 5 A E E I O H I C -

9 E V 2 A H - _

B O H O

/ I - L R C 3 4 2 M K _ M 0 _ - H 0 _ - - - O E _ 9 9 I H F C 5 _ 2 N P C 1 O F 3 4 R _ R R V R C 1 - S 6 _ E N 1 L R - 5 5 - S C B - A A C _ - L O K _ N 2 C I

_ C T _ A

L C C _ S 9 F T W L 2 C K E _ 2 - C

C S - _

V V N S L C S L

1 A 0.4 3 0 5 T L P T C

H N 0 _ - L E _ 1 O O C 9

7 N 8 L A C 5 _ E 6 M O 6 - C / S 3 _ _ - R 3 9 - O 7 _

2 H - - 8 K 7 1 O B M L H D 2 L N V V 5 5 8 C 6 E - 7 _ F C 3 F O _ 7 - C E T N 0 H 9 S C O O _ K R 2 X A 4 O C U C

2 _ C - S 9 H 4 - M H E S _ _ L R - _ C 2 _ 1 E B C M _ T O M _ - R E _ E K L S _ _ L 3 O A R M M E C - E 8 R 7 S R - N R R _ C O U B 2 5 C E M _ R A - B _ C B E C 2 E H L M - D E 0.2 _ M _

L E M

C M M O O A 9 1 _ C M C - 1 5 C M A - R

2

K _ B B U 4 8 S U S M - E - _ N _ _ E T E R E M S S E L R C A _ - M - N O M C F S F C M V D N R _

0.0 A C O C

/ E _ 7 C L V F _

O E C 5 N L - 3 M 4 _ A - R D B B M M - _ A R D B M _ R B

O

+ N- O O

O O O O O O Cl S N O N O H O O O N O S N N O H O H O O N O O N O H N H S O S S N S S N HCl N H O H N O O S O S N O H OO N O NH H + _ O N Cl O Cl O H H 1.4.7. 670762658450690434O 11.2.5. 670766602617 6285078.OO 644902 3.6.9. O 690432 642061642009O 10. 656239

O O A O O O O O

N O S O

O O N O N O N O B 1.4.1. 648147641394 621179 2.5. 640192641395H 3.6.O 618315641396O O

Cl

N O

H O