Mathematical Modeling of Noise and Discovery of Genetic Expression Classes in Gliomas
Total Page:16
File Type:pdf, Size:1020Kb
Oncogene (2002) 21, 7164 – 7174 ª 2002 Nature Publishing Group All rights reserved 0950 – 9232/02 $25.00 www.nature.com/onc Mathematical modeling of noise and discovery of genetic expression classes in gliomas Hassan M Fathallah-Shaykh*,1, Mo Rigen1, Li-Juan Zhao1, Kanti Bansal1, Bin He1, Herbert H Engelhard3, Leonard Cerullo2, Kelvin Von Roenn2, Richard Byrne2, Lorenzo Munoz2, Gail L Rosseau2, Roberta Glick4, Terry Lichtor4 and Elia DiSavino1 1Department of Neurological Sciences, Rush Presbyterian – St. Lukes Medical Center, Chicago, Illinois, IL 60612, USA; 2Department of Neurosurgery, Rush Presbyterian – St. Lukes Medical Center, Chicago, Illinois, IL 60612, USA; 3Department of Neurosurgery, The University of Illinois at Chicago, Chicago, Illinois, IL 60612, USA; 4Department of Neurosurgery, The Cook County Hospital, Chicago, Illinois, IL 60612, USA The microarray array experimental system generates genetic repertoire in any disease-affected tissue. noisy data that require validation by other experimental However, genome-wide screening is still hampered by methods for measuring gene expression. Here we present the preponderance of false positive data in the gene an algebraic modeling of noise that extracts expression microarray experimental system (Ting Lee et al., 2000). measurements true to a high degree of confidence. This The following experiments are designed to profile the work profiles the expression of 19 200 cDNAs in 35 expression of 19 200 cDNAs in 35 human glioma human gliomas; the experiments are designed to generate samples. Here, we apply mathematical principles to four replicate spots/gene with switching of probes. The separate the noise and extract genes whose expression validity of the extracted measurements is confirmed by: levels are considered truly changed, to a high degree of (1) cluster analysis that generates a molecular classifica- confidence, in the tumor samples as compared to tion differentiating glioblastoma from lower-grade tumors normal brain. The results yield a genetic analysis of and radiation necrosis; (2) By what other investigators gliomas, identify genes whose expression patterns have reported in gliomas using paradigms for assaying differentiate glioblastoma from lower-grade tumors molecular expression other than gene profiling; and (3) and radiation necrosis, and discover classes of genetic Real-time RT – PCR. The results yield a genetic analysis expression that link novel genes to the biology of of gliomas and identify classes of genetic expression that gliomas. link novel genes to the biology of gliomas. Oncogene (2002) 21, 7164 – 7174. doi:10.1038/sj.onc. 1205654 Results and discussion Keywords: glioma; genetics; mathematical modeling; The experiments are designed to generate four replicate mathematical computing; genetic techniques ratios with probe switching. The replicate data are averaged and expressed in a matrix containing 19 200 gene rows and 35 tumor columns; the overwhelming Introduction majority of the standard deviations of the replicate ‘prepared’ data are less than 1. The tumor vectors are Despite recent advances in molecular technology and analysed using current standard techniques by: (1) therapeutics, the prognosis of patients suffering from omitting gene vectors excluded (see Methods) in more malignant brain tumor has not changed over the past than 20% of the tumor set; 18 314 rows remain. (2) 20 years. Gene expression profiling has emerged as a Agglomerative hierarchical clustering using single novel tool for rapid discovery of molecular expression linkage of Ward’s incremental sum of squares of the patterns associated with human disease (Alizadeh et 1-Pearson product moment correlation matrix (Everitt, al., 2000; Alter et al., 2000; Bittner et al., 2001; Golub 1993). The resulting dendrogram does not distinguish et al., 1999). Furthermore, the completion of the between the different pathological types. The results human genome project has created the possibility of are not surprising because only a fraction of the 18 314 studying changes in gene expression of the complete genes is expected to be modulated in brain tumors. The findings, suggesting that noise vectors mask the pathological distinction, highlight the need for new *Correspondence: HM Fathallah-Shaykh, Rush University Medical methods to separate true from false measurements. Center, 2242 West Harrison Street, Suite 200, Chicago, IL 60612, To study and model the noise in this experimental USA; E-mail: [email protected] system we define the filtering function f (applied to For supplementary information, send an e-mail to the corresponding 4 author four replicate ratio measurements). f4 computes the Received 25 March 2002; revised 30 May 2002; accepted 31 May mean of the four replicate log2 values only if: (1) all 2002 four log2 values are of the same sign and different than Genetic analysis of high-grade gliomas HM Fathallah-Shaykh et al 7165 0; (2) all four replicate ratios are either 50.71 or 41.4; the genes is expected to be truly changed, we reasoned and (3) a minimum of three spots are not flagged that by separating unfiltered false positive data, one manually because of artifacts. If all three conditions could use them to model the behavior of noise (Figure are not met, f4 ‘filters’ the gene by assigning a 0 to the 2). To generate ‘noise’ matrices containing false log2 expression value. positive data, the unfiltered log2 values of the replicate The next experiment is designed to study whether f4 ratios are expressed in four matrices E11,E12,E13,E14 generates false negative data. Here we use microarray of size (19 200635), each corresponding to one of the chips containing 1720 genes laid in duplicates (1.7 K four replicate spots. The rows and columns refer to chips from the Ontario Cancer Institute, Toronto, the 19 200 genes and 35 tumors, respectively. The Canada). Each 1.7 K chip contains a total of 128 spots filtered data after application of f4 are assembled to of Arabidopsis cDNA with no known homology to generate a matrix E of size (19 200635) its rows human genes (64 laid in duplicates) and 256 spots of correspond to the 19 200 genes and columns to the 35 buffer only (SSC). One ng of Arabidopsis RNA tumor samples: 9155 genes of E have log2 values=0 transcribed in vitro are added to tumor RNA and either: in all 35 tumor columns. The ‘noise’ matrices N1,N2, (1) not added, or (2) 0.5 ng added to reference RNA. N3 and N4 are constructed to contain the unfiltered Each of these experiments are repeated six times to a expression data of the 9155 genes mentioned above in total of 12 spots. The results reveal that, after applying f4 E11,E12,E13 and E14, respectively (Figure 2). to four replicate spots, 1.6% of the Arabidopsis spots are We model the noise by projecting the gene vectors false negative, and 0 – 2% of the SSC spots without onto spaces defined by linear transformation of their cDNA are false positive (Figure 1). Thus, f4 annuls false matrices. Singular value decomposition is a mathe- positive results without significant loss of data reflecting matical application that transforms the rows of a true changes in gene expression. matrix into vectors in space (eigenspace). The dimension of the eigenspace equals the number of entries in each row; here the 35-dimensional space Mathematical modeling of noise (Figure 2). Transformation by singular value decom- Because the predominant majority (495%) of the data zeroed by f4 (applying the filtering function to four replicate measurements) are false (see Figure 1) and because the expression of only a small fraction of Figure 1 f4 filters false positive data without significant loss of true changes in gene expression. One ng of Arabidopsis RNA transcribed in vitro are added to tumor RNA and either: (1) not added (dotted lines), or (2) 0.5 ng added to the reference RNA (solid lines). The curves on the right show per cent false ne- gative Arabidopsis spots after applications of the filtering func- tions (supplementary information) to the measurements of two spots (1,2), . ., four spots (1,2,3,4), . ., and 12 spots (1 – 12), re- Figure 2 Cartoon depicting the assembly of ‘noise’ matrices. The spectively. Per cent false negative refers to the number of Arabi- four replicate data are assembled into four matrices E11,E12,E13 dopsis spots whose expression values are equal to 0 after and E14, each corresponding to one of the four groups of mea- application of the filtering functions* 100/total number of Arabi- surements; their rows correspond to the 19 200 genes and columns dopsis spots. The curves on the left show per cent false positive to the 35 tumors. The filtering function f4 is applied to the four SSC (buffer) spots; the latter refers to the number of buffer spots replicate measurements of each gene located at the same coordi- whose measured expression values are different than 0 after appli- nates in each matrix (stars). f4 generates the matrix E. 9155 gene cation of the filtering function* 100/total number of buffer spots. vectors of E have log2 values=0 in all tumor columns. The noise The per cent false negative values are 1.6% at four and six spots, matrices N1,N2,N3 and N4 are constructed to contain the unfil- 5 – 6% at eight spots, and 9 – 11% at 12 spots tered data of the 9155 genes in E11,E12,E13 and E14 respectively Oncogene Genetic analysis of high-grade gliomas HM Fathallah-Shaykh et al 7166 position generates three matrices (supplementary 3.9, 3.8, 3.8 and standard deviations=1.2, respec- information); one defines the ‘axes’ of the space tively (Figures 4a, b). (eigenvectors); another includes numbers related to the coordinates of the row vectors onto each axis; Separation of noise and extraction of true measurements and the third contains a quantification of how much information is lost if the eigenspace is reduced to a Eigenprojections of the gene row vectors of E11,E12, lower ‘manageable’ dimensionality, like the 3-dimen- E13,E14 onto their corresponding 35-dimensional sional space.