Making Sense of Microarray Data to Classify Cancer

Making Sense of Microarray Data to Classify Cancer The Pharmacogenomics Journal (2003) 3, 308–311 & 2003 Nature Publishing Group All rights reserved 1470-269X/03 $25.00 CLINICAL IMPLICATION that identified a class of patients who Making sense of microarray data to succumb to the disease early, irrespec- tive of their tumor burden. The study classify cancer therefore uncovered a wealth of find- ings that shed some light on the S Hanash1 and C Creighton2 biology of this disease and uncovered markers that may have a practical utility. 1Department of Pediatrics, University of Michigan, Ann Arbor, MI, USA; 2Bioinformatics The DNA microarray studies de- Program, Ann Arbor, MI, USA scribed above and others in the litera- ture indeed point to the great utility of DNA microarrays for uncovering pat- The Pharmacogenomics Journal (2003) 3, aggressive variety characterized by terns of gene expression that are 308–311. doi:10.1038/sj.tpj.6500209 skin involvement, lymphadenopathy clinically informative. Have the data Published online 4 November 2003 and circulating atypical lymphocytes, been thoroughly analyzed? There is no the so-called Sezary cells. Kari et al shortage of analytical tools for unco- used cDNA microarrays to study gene vering patterns in microarray data. An Profiling gene expression using DNA expression patterns in peripheral important challenge for microarray arrays has had a tremendous impact blood mononuclear cells from patients analysis is to understand at a mechan- on biomedical research. From a disease with the leukemic form of cutaneous istic level the significance of associa- investigation point of view, applica- T-cell lymphoma. The goal of the tions observed between subsets of tions of DNA microarrays include study was to identify markers that genes and clinical features of disease. uncovering unsuspected associations may be useful for diagnosis or prog- Another challenge is to identify the between genes and specific clinical nosis, or that might provide new smallest but most informative sets of features of disease, resulting in novel, targets for treating this disease. The genes associated with specific clinical molecular-based disease classifica- approach was to uncover gene expres- features, which then could be inter- tions. Cancer is a case in point. Most sion differences between cells from 18 rogated using technologies available in published studies of cancers using patients with high Sezary cell counts clinical laboratories, as appears to have DNA microarrays have either exam- and an appropriate (Th2-skewed) cell been accomplished in this study. An- ined a pathologically homogeneous fraction from nine normal controls. other challenge is to determine how set of tumors to identify clinically The differences in gene expression well RNA levels of predictive genes relevant subtypes, for example, re- observed reflected many of the ob- correlate with protein levels. A lack of sponders vs nonresponders, or patho- served characteristics of the disease. correlation may imply that the pre- logically distinct subtypes of cancer of Overexpressed genes in disease sam- dictive property of the gene(s) is the same lineage, for example, high- ples included some genes required for independent of gene function. stage vs low-stage tumors to identify Th2 differentiation characteristic of To increase the effectiveness of DNA molecular correlates, or tumors of Sezary cells. The analysis, however, microarray analysis, global gene ex- different lineages to identify molecular did not uncover changes consistent pression data may be combined with signatures for each lineage. A study of with the hypothesis of defective apop- external data sources, such as gene cutaneous T-cell lymphoma by Kari et totic pathways in this disease. An annotation, in order to associate the al,1 published recently, typifies both important objective of the study was expression patterns of a set of genes what one hopes to gain from disease to identify markers for cutaneous T- with the biological processes that they investigations using DNA microarrays cell lymphoma given the paucity of may represent. A welcome trend of and the limitations of such studies. such markers. A member of the plastin data sharing allows others to analyze Primary cutaneous lymphomas are a gene family and a chemokine previously published microarray data heterogeneous group of lymphomas of (CX3CR1) inappropriately expressed and to combine multiple data sets. For T- or B-cell origin that represent a represented such potential novel mar- illustration, we examined the data set relatively common type of lymphoma kers. Two genes found to have a high published by Kari et al to see what we and their incidence appears to be predictive power to classify patients could uncover. In our analysis, we increasing. The two predominant sub- and controls were STAT4 and GTPase relied on the Gene Ontology (GO) types of cutaneous T-cell lymphomas RhoB. These two genes alone accu- annotation. The Gene Ontology Con- are mycosis fungoides, a mostly rately classified the high Sezary cell sortium2 has defined a controlled indolent variety, and its leukemic patients and controls. A signature vocabulary for describing genes in counterpart the Sezary syndrome, an profile with 10 genes was uncovered terms of their molecular function, Microarray data to classify cancer S Hanash and C Creighton 309 participation in biological processes terms were found for the set of 280 expressed participate in closely related and cellular locations. The GO anno- underexpressed genes with Po0.001, biological processes.4 For a given gene, tations are making possible the high- including class II major histocompatibil- a GO term may be associated if the throughput analyses of gene expres- ity complex antigen (five genes repre- gene is correlated in expression with a sion in terms of functional gene class sented), cytokine-binding activity (six), significant number of other genes that associations, which otherwise would mitochondrion (26), electron transporter share the given GO term annotation. require laborious and somewhat sub- activity (12) and nucleotide metabolism We examined the expression patterns jective manual literature searches. (four); these enriched terms could of 60 genes highly underexpressed in Using the data set from Kari et al,we suggest a downregulation in CTCL of the Kari et al data set for patients with searched a set of 122 genes found processes related to the immune re- high Sezary cell count (Po0.01, fold overexpressed in patients with high sponse and mitochondrial function. change o0.33) that were also repre- blood tumor burden, or Sezary cell Terms found enriched for the set of sented in a large independent data set count, compared to healthy controls 122 overexpressed genes with Po0.05 of leukemia expression profiles from (Po0.01, fold change 41.5), for sig- include cell adhesion (nine genes Armstrong et al.5 For each of the 60 nificantly enriched (over-represented) represented) and cell cycle arrest genes, the set of genes with significant GO terms, as described elsewhere.3 We (three). positive correlations (Po0.01) with made the same search for a set of 280 The enriched GO terms listed above the given gene in the Armstrong data genes found underexpressed in pa- represent only a fraction of the genes set was searched for significantly en- tients with high Sezary cell count significantly expressed in CTCL, and riched GO terms (Po0.0001). In this (Po0.01, fold change o0.67). Our additional gene-to-process associa- way, 1963 gene-to-term associations, premise is that annotation terms that tions, not currently described in the involving all 60 genes, were found. We are shared by a significant number of biomedical literature or public annota- performed two simulation tests to genes within a large gene set may tion sources, may be inferred from assess the number of random gene-to- provide clues as to the processes driv- data mining of large expression profile term associations that could exist in ing the coordinate expression of the data sets. Our premise in this case the Armstrong data set, in one test genes as a whole. Numerous enriched is that genes that are coordinately permuting the expression values and Figure 1 Hierarchical clustering of associations of GO terms for genes found underexpressed in patients with high Sezary cell count (Po0.01, fold change o0.33). For each gene-to-term association represented here, the given gene was found positively correlated in expression with a significant number of other genes that share the given GO term annotation. The rows in the matrix diagram represent genes; the columns represent terms. An entry in the matrix indicates that the corresponding gene-to-term association was found in the leukemia profile data set from Armstrong et al with Po0.0001. Three major clusters are highlighted corresponding to terms related to (1) intercellular signaling, (2) the immune response, and (3) cell proliferation. Table 1 lists the genes that fall under each cluster. Microarray data to classify cancer S Hanash and C Creighton 310 Table 1 GO term associations from Figure 1 for genes underexpressed in patients with high Sezary cell counts Gene Gene product description Cluster 1—integral to plasma membrane; receptor activity; signal transducer activity; cell surface receptor-linked signal transduction; cell motility; G-protein-coupled receptor protein signaling pathway; cell–cell signaling; development; organogenesis; morphogenesis; extracellular CCL2 Small inducible cytokine A2 CD8B1 CD8 antigen,
