Int. J. Research and Applications, Vol. 1, No. 1, 2005 31

How noisy and replicable are DNA microarry data?

Suman Sundaresh Institute for and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA E-mail: [email protected]

She-pin Hung and G. Wesley Hatfield Department of Microbiology and Molecular Genetics, Institute for Genomics and Bioinformatics, College of Medicine, University of California, Irvine, CA 92697, USA E-mail: [email protected] E-mail: [email protected]

Pierre Baldi* Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA E-mail: [email protected] *Corresponding author

Abstract: This paper analyses variability in highly replicated measurements of DNA microarray data conducted on nylon filters and Affymetrix GeneChipsTM with different cDNA targets, filters, and imaging technology. Replicability is assessed quantitatively using correlation analysis as a global measure and differential expression analysis and ANOVA at the level of individual genes.

Keywords: DNA microarrays; sources of variation; replication; correlation; differential expression analysis; ANOVA.

Reference to this paper should be made as follows: Sundaresh, S., Hung, S-P., Hatfield, G.W. and Baldi, P. (2005) ‘How noisy and replicable are DNA microarry data?’, Int. J. Bioinformatics Research and Applications, Vol. 1, No. 1, pp.31–50.

Biographical notes: Suman Sundaresh is a PhD student in the Department at UC Irvine. She gained her MSc and BSc (Hons) in Computer Science from the National University of Singapore. Her research interests are in the areas of , machine learning, and biomedical informatics.

She-pin Hung is a post-doctoral researcher in the Department of Microbiology and Molecular Genetics affiliated with the Institute for Genomics and Bioinformatics at UC Irvine. She received her PhD from the University of California at Irvine in 2002. Her research interests are in the areas of global gene expression profiling with the use of DNA microarrays and bioinformatics.

Copyright © 2005 Inderscience Enterprises Ltd.

32 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

G. Wesley Hatfield is Professor of Microbiology and Molecular Genetics in the College of Medicine and Associate Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. Hatfield holds a PhD degree from Purdue University and a BA degree from the University of California in Santa Barbara. His primary areas of scientific expertise include molecular biology, biochemistry, microbial physiology, functional genomics, and bioinformatics. His recent academic interests include the application and development of genomic and bioinformatics methods to elucidate the effects of chromosome structure and DNA topology on gene expression. He has received national recognition for his scientific contributions including the Eli Lilly and Company Research Award bestowed by the American Society of Microbiology.

Pierre Baldi is a professor in the School of Information and Computer Science and the Department of Biological Chemistry and Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. He received his PhD from the California Institute of Technology in 1986. From 1986 to 1988, he was a post-doctoral fellow at the University of California, San Diego. From 1988 to 1995, he held faculty and member of the technical staff positions at the California Institute of Technology and at the Jet Propulsion Laboratory. He was CEO of a startup company from 1995 to 1999 and joined UCI in 1999. He is the recipient of the 1993 Lew Allen Award at JPL and the Laurel Wilkening Faculty Innovation Award at UCI. Baldi has written over 100 research papers and four books. His research focuses in biological and chemical informatics, AI, and machine learning.

1 Introduction

This paper analyses and quantifies certain aspects of ‘noise’ contained in DNA microarray data. A DNA microarray experiment comprises several steps such as cDNA spotting, mRNA extraction, target preparation, hybridisation, image scanning and analyses. These procedures can be further subdivided into dozens of other elementary steps, each of which can introduce some amount of variability and noise. In addition to the variability introduced by the instruments and the experimenter, there is biological variability, which also has multiple sources ranging from fluctuations in the environment to the inherently stochastic nature of nano-scale regulatory chemistry (Barkai and Leibler, 2000; Hasty et al., 2000; McAdams and Arkin, 1999) – transcription alone involves dozens of individual molecular interactions. These compounded forms of ‘noise’ may lead one to doubt whether any reliable signal can be extracted at all from DNA microarrays. Here, we show that, while certainly noisy, DNA microarray data do contain reliable information. In this study, we look at highly replicated (up to 32×) experiments performed by different experimenters at different times in the same laboratory, using, as a model organism, wild-type Escherichia coli. In addition, we obtain these microarray measurements using two different formats, nylon filters and Affymetrix GeneChipsTM. Given the overwhelming number of variables that can in principle contribute to the variability, we focus on a particular subset of variables of great relevance to biologists. In particular, we measure the consistency of the results obtained using the filter technology across different filters and mRNA preparations. We also compare filters to

How noisy and replicable are DNA microarry data? 33

Affymetrix GeneChipTM technology and study the effects of five different image processing methods. Replicability is assessed quantitatively using correlation analysis and differential expression analysis. We use correlation as a global measure of similarity between two sets of measurements. While a correlation close to one is a good sign, it is a global measure that provides little information at the level of individual genes. Thus, we use differential expression analysis at the level of individual genes to detect which genes seem to behave differently in two different sets of measurements. The data sets and software used in our analysis are available over the web at http://www.igb.uci.edu/servers/dmss.html. Our approach differs from and complements previous related studies (Coombes et al., 2002; Piper et al., 2002). In particular, we use higher levels of replication (32x), relatively simpler biological samples (E. coli vs. S. cerevisiae or human B-cell lymphoma cell lines), and more diverse microarray technologies (filters and Affymetrix Gene Chips). Part of these other studies also focuses on the analysis of variables that are outside the scope of the present study, such as exposure time or inter-laboratory variability.

2 Methods

2.1 Filter dataset

The first dataset (‘filter dataset’) we use consists of 32 sets of measurements from 16 nylon filter DNA microarrays containing duplicate probe sites for each of 4,290 open reading frames (ORFs) hybridised with 33P-labeled cDNA targets from wild-type Escherichia coli cells cultured at 37oC under balanced growth conditions in glucose minimal salt medium. The experimental design and methods for these experiments are described in detail in Arfin et al. (2000), Baldi and Hatfield (2002) and Hung et al. (2002, 2003) and illustrated in Figure 1. Each filter contains duplicate probes (spots) for each of the 4,290 open reading frames (ORFs) of the E. coli genome. In Experiment 1, filters 1 and 2 were hybridised with 33P-labeled, random hexamer generated, cDNA targets complementary to each of three independently prepared RNA preparations (RNA1) obtained from the cells of three individual cultures of a wild-type (wt) E. coli strain. These three 33P-labeled cDNA target preparations were pooled prior to hybridisation to the full-length ORF probes on the filters (Experiment 1). Following phosphorimager analysis, these filters were stripped and again hybridised with pooled, 33P-labeled cDNA targets complementary to each of another three independently prepared RNA preparations (RNA2) from the wt strain (Experiment 2). This procedure was repeated two more times with filters 3 and 4, using two more independently prepared pools of cDNA targets (Experiment 3, RNA3; Experiment 4, RNA4). Another set of filters, filters 3 and 4, were used for Experiments 3 and 4 as described for Experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with cDNA targets complementary to four independently prepared sets of pooled RNA. Thus, since each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, 16 measurements (D1–D16) for each ORF from four experiments were obtained. These procedures were performed with

34 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

another two pairs of filters 5–8 for Experiments 5–8 to obtain another 16 measurements (D17–D32) for each ORF.

Figure 1 Experimental design for nylon filter DNA array experiments (‘filter dataset’)

The filter dataset is fairly representative of other filter datasets in the sense that it corresponds to experiments carried out by different people at different times in the same laboratory. In particular, of the 32 filter measurements, the data from measurements 1–16 were obtained six months later than the data from measurements 17–32. During this intervening period, the efficiency of the 33P labelling was improved. Consequently, more signals marginally above background were detected on the filters for measurements 1–16. In fact, when we edit out all of the genes that contain one or more measurements at or below background in at least one experiment, we observe the expression of 2,607 genes for measurements 1–16 and 1,579 genes for measurements 17–32. If we consider the dataset containing all 32 measurements, we find that only 1,257 genes have all 32 expression measurements above background. The log (natural)-transformed values (Speed, 2002) of these above background 1,257 gene expression values for all 32 measurements were used for subsequent analyses.

How noisy and replicable are DNA microarry data? 35

2.2 GeneChip dataset

To address another DNA microarray technology, we use a second dataset (‘GeneChip dataset’) that contains data from four Affymetrix GeneChipTM experiments that measure the expression levels of the same E. coli RNA preparations used for the filter Experiments 1–4. The experimental design and methods for these experiments are illustrated in Figure 2 and described in detail by Hung et al. (2002). The four GeneChip measurements are each processed by five methods, MAS 4.0 and MAS 5.0 software of Affymetrix, dChip software of Li and Wong (2001), RMA (Irizarry et al., 2003a; 2003b) and GCRMA (Wu and Irizarry, 2004). The dataset thus consists of 20 replicate measurements. The 2,370 genes whose expression levels are above background for all the 20 measurements are used in the subsequent analyses. We log (natural) transformed the measures processed with MAS 4.0, 5.0 and dChip. RMA and GCRMA functions differ from most other expression measuring methods as they return the expression measures in log (base 2).

Figure 2 Experimental design of the Affymetrix GeneChip experiments (‘GeneChip dataset’). The same twelve total RNA preparations used for the four pooled RNA sets (RNA1–RNA4) in filter Experiments 1–4 were used for the preparation of biotin-labelled RNA targets for hybridisation to four Affymetrix GeneChips. The Affymetrix *.cel file generated by data obtained with a confocal laser scanner was used as the raw data source for all subsequent analyses

36 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

The gene expression measurements for each experiment of both datasets from the filters and the GeneChips are globally normalised by dividing each expression measurement with a value above background on all sixteen filters or four GeneChips by the sum of all the gene expression measurements of that filter or GeneChip. Thus, the signal for each measurement can be expressed as a fraction of the total signal for each filter or GeneChip, or, by implication, as a fraction of total mRNA. This normalisation is not applied to the RMA and GCRMA measurements, which already have a built-in normalisation step. The datasets obtained from the experiments described above allow us to investigate the effects of not only the environmental and biological factors, but also the consistency of measurements taken from two different DNA microarray technologies.

2.3 Image processing software

The GeneChip dataset contains 20 replicates, where each of the four GeneChip measurements is processed with five image processing software, MAS4.0, dChip, MAS5.0, RMA, and GCRMA. In the Affymetrix MAS 4.0 software, the mean and standard deviation of the PM (perfect match) – MM (mismatch) differences of a probe set in one array are computed after excluding the maximum and the minimum values obtained for that probe set. If, among the remaining probe pairs, a difference deviates by more than 3SD from the mean, that probe pair is declared an outlier and not used for the average difference calculation of both the control and the experimental array. A flaw of this approach is that a probe with a large response might well be the most informative but may be consistently discarded. Furthermore, if multiple arrays are compared at the same time, this method tends to exclude probes inconsistently measured among GeneChips. Li and Wong (2001) developed a statistical model-based analysis method to detect and handle cross-hybridising probes, image and/or GeneChip defects, and to identify outliers across GeneChip sets. A probe set from multiple chips is modelled, and the standard deviation between a fitted curve and the actual curve for each probe set for each GeneChip is calculated. Probe pair sets containing an anomalous probe pair measurement(s) are declared outliers and discarded. The remaining probe pair sets are remodelled, and the fitted curve data are used for average difference calculations. These methods are implemented in a software program, dChip, which can be obtained from the authors. A different empirical approach to improve the consistency of average difference measurements has been implemented in the more recent Affymetrix MAS 5.0 software. In this implementation, if the MM value is less than the PM value, MAS 5.0 uses the MM value directly. However, if the MM value is larger than the PM value, MAS 5.0 creates an adjusted MM value based on the average difference intensity between the ln PM and ln MM, or if the measurement is too small, some fraction of PM. The adjusted MM values are used to calculate the ln(PM – adjusted MM) for each probe pair. The signal for a probe set is calculated as a one-step bi-weight estimate of the combined differences of all of the probe pairs of the probe set.

How noisy and replicable are DNA microarry data? 37

In Irizarry et al. (2003a; 2003b), it is demonstrated that the ln(PM-adjusted MM) technique in MAS5.0 results in gene expression estimates with elevated variances. The RMA (robust multi-array analysis) approach applies a global background adjustment and normalisation and fits a log-scale expression effect plus probe effect model robustly to the data. This method has been implemented in the Bioconductor affy package (http://www.bioconductor.org). The cel files of this GeneChip data set were pre-processed with the default setting at the probe level, and the expression value of each gene was obtained. An extension of RMA is discussed in Wu and Irizarry (2004) based on molecular hybridisation theory that takes into account the GC content of each of the probe sequence for the calculation of non-specific binding. This method, called GCRMA, is also available as part of the Bioconductor project in the gcrma package. The cel files of this GeneChip data set were pre-processed with the default setting at the probe level and the expression value of each gene was obtained.

2.4 Correlation and differential expression analyses

To measure the consistency between two sets of measurements globally, such as different filters or different cDNA target preparations, we use Pearson’s correlation coefficient. We compute and analyse matrices of correlation coefficients between different sets of measurements. The correlation coefficient provides a global measure of similarity but little information about possible fluctuations at the level of individual genes. A high level of global similarity with a correlation of, for instance, 0.95 can hide significant fluctuations at the level of individual gene measures. To address the issue of fluctuations at the level of single genes, we also perform differential analysis to detect false positives. The term ‘false’ here is used in reference to the experimental set up and not to the underlying physical reality. In other words, differences in expression across two different measurements may well be real and result, for instance, from random fluctuations, but they are false positive in the sense that ideally they should have not occurred since all conditions are supposed to be the ‘same’. Several methods for differential analysis have been developed in the literature, such as fold analysis, t-test, regularised t-test (Baldi and Hatfield, 2002; Baldi and Long, 2001), SAM (Tusher et al., 2001), applied to raw data or to data transformed in different ways (Durbin et al., 2002; Huber et al., 2002). The primary goal here is to get a ‘ballpark’ sense for the false positives rates between different sets of measurements under a typical and widespread analysis protocol. Thus, for illustration purposes, we use the t-test applied to the log-transformed data with a detection threshold corresponding to p-values of 0.005 or less. To isolate the effects of the filters, targets, and combination of factors, we obtained the number of genes significantly differentially expressed (as false positives) by comparing duplicate measurements (same cDNA targets or same filters) with other pairs of duplicate measurements. In this case, we assume a normal distribution for the expression levels of the duplicate measurements of each gene. We also perform a two-way factorial ANOVA to estimate the percentage contributions of cDNA targets and filters to the total variance.

38 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

3 Results

3.1 Correlations within filter data

A 32 x 32 correlation matrix of the duplicate measurements of all above background target signals present on the 16 filters described in Figure 1 is shown in Figure 3. The correlations are plotted as an intensity matrix, where the darker cells correspond to numbers closer to one indicating a stronger correlation. The reference chart for the intensities is shown on the left. These results clearly demonstrate strong correlations among the first 16 measurements of the filter experiments (D1–D16 vs. D1–D16) as well as strong correlations among the measurements of the experiments performed six months earlier (D17–D32 vs. D17–D32). However, low correlation is observed among measurements of experiments performed at different times (D1–D16 vs. D17–D32). These results demonstrate that significant variance can be introduced into a DNA microarray experiment when experimental parameters such as personnel, reagents, protocols, and experimental methods vary. For example, we know that during this time frame, the 33P labelling was improved.

Figure 3 Correlation intensity matrix for the filter dataset

We also notice that the measurements obtained using RNA3, D9–D12, do not correlate as well with the other 12 measurements taken during the same time frame. The reason is unknown and may have to do with that particular RNA preparation or some day-to-day variation in the experimental procedure, or a combination of both. Since two filters are hybridised with the same cDNA targets and two cDNA target preparations are hybridised to the same filter for each of eight cDNA target preparations, we are able to examine the correlations both between filters and cDNA target

How noisy and replicable are DNA microarry data? 39 preparations (Figure 1). Figure 4 below shows typical scatter plots (A) and intensity image (B) observed when analysing groups of eight measurements (e.g., D1–D8, D25–D32) corresponding to a quadrant in Figure 1. Duplicate measurements (e.g., D1–D2, D3–D4) are averaged and compared with other duplicate measurements. The intensity image shows that there is a higher correlation when the same cDNA targets are hybridised to different filters (D1–D2 with D3–D4) as opposed to when different cDNA targets are hybridised to the same filters (D1–D2 with D5–D6).

Figure 4 (a) Scatter plots of duplicate log-transformed filter measurements (Line y = x superimposed) and (b) a typical correlation intensity matrix when comparing the effects of different filters and cDNA targets

(a)

(b)

We have summarised the observations from Figure 3 in Table 1. When duplicate measurements of each filter are compared, a high average correlation of 0.97 is observed. A reasonably high correlation is also observed when we compare the measurements among different filters hybridised with the same targets (0.95). However, less correlation is observed when different targets are hybridised to the same filters (0.92). This demonstrates greater variance among target preparations (biological variance) than

40 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

among filters (experimental variance). Thus, it stands to reason that the variance is even greater when different target preparations are hybridised to different filters. It should be noted that it has been demonstrated that the variability among target preparations can be significantly reduced by pooling independently prepared target samples prior to hybridisation (Arfin et al., 2000; Baldi and Hatfield, 2002; Hung et al., 2002).

Table 1 The comparison of average correlation values from the correlation intensity matrix shown in Figure 3

Comparison Average correlation Duplicate measurements from each filter1 0.974 Same targets hybridised to different filters2 0.951 Different targets hybridised to the same filters3 0.917 Different targets hybridised to different filters4 (excluding the effects of the time gap between the sets of 16 measurements and the labelling improvements) 0.859 1The average correlation values from the correlation matrix illustrated in Figure 3 of D1 vs. D2, D3 vs. D4, D5 vs. D6, D7 vs. D8, D9 vs. D10, D11 vs. D12, D13 vs. D14, D15 vs. D16, D17 vs. D18, D19 vs. D20, D21 vs. D22, D23 vs. D24, D25 vs. D26, D27 vs. D28, D29 vs. D30, and D31 vs. D32. 2The average correlation values from the correlation matrix illustrated in Figure 3 of D1 vs. D3, D1 vs. D4, D2 vs. D3, D2 vs. D4, D5 vs. D7, D5 vs. D8, D6 vs. D7, D6 vs. D8, D9 vs. D11, D9 vs. D12, D10 vs. D11, D10 vs. D12, D13 vs. D15, D13 vs. D16, D14 vs. D15, D14 vs. D16, D17 vs. D19, D17 vs. D20, D18 vs. D19, D18 vs. D20, D21 vs. D23, D21 vs. D24, D22 vs. D23, D22 vs. D24, D25 vs. D27, D25 vs. D28, D26 vs. D27, D26 vs. D28, D29 vs. D31, D29 vs. D32, D30 vs. D31 and D30 vs. D32. 3The average correlation values from the correlation matrix illustrated in Figure 3 of D1 vs. D5, D1 vs. D6, D2 vs. D5, D2 vs. D6, D3 vs. D7, D3 vs. D8, D4 vs. D7, D4 vs. D8, D9 vs. D13, D9 vs. D14, D10 vs. D13, D10 vs. D14, D11 vs. D15, D11 vs. D16, D12 vs. D15, D12 vs. D16, D17 vs. D21, D17 vs. D22, D18 vs. D21, D18 vs. D22, D19 vs. D23, D19 vs. D24, D20 vs. D23, D20 vs. D24, D25 vs. D29, D25 vs. D30, D26 vs. D29, D26 vs. D30, D27 vs. D31, D27 vs. D32, D28 vs. D31 and D28 vs. D32. 4The average correlation values from the correlation matrix illustrated in Figure 3 of all other cells in quadrants D1–D16 vs. D1–D16 and D17–D32 vs. D17–32 except those belonging to the other three categories above. We do not consider cells in the other quadrants because they measure correlations of experiments across the time gap, the effects of which we do not want to include in this comparison.

In addition to confirming earlier suggestions that the experimental and biological variables of a DNA microarray experiment contribute more variance than differences among the microarrays themselves (Arfin et al., 2000), these data demonstrate both the subtle differences among replicated gene measurements obtained from DNA microarray experiments as well as more dramatic differences that are observed when basic changes in experimental protocols are adopted.

3.2 Correlations within GeneChip data

When considering different array formats that require different target preparation methods and a fundamentally different probe design, the sources and magnitudes of the

How noisy and replicable are DNA microarry data? 41 experimental errors are different. To illustrate this, we examine the correlations among data sets of another DNA microarray format. We use data sets obtained with Affymetrix GeneChips, which are manufactured by the in situ synthesis of short single-stranded oligonucleotide probes, complementary to sequences within each ORF, directly synthesised on a glass surface. Nylon filter arrays are manufactured by the attachment of full-length, double-stranded, DNA probes of each E. coli ORF directly onto the filter as described earlier. The intensity matrix in Figure 5 shows how the log-transformed GeneChip expression data are correlated when processed with either Affymetrix MAS 4.0, MAS 5.0, dChip software, RMA or GCRMA. Overall, the correlations among the different measurements of the GeneChip experiments are high (>0.7) and comparable to those observed in the first 16 measurements in the filter dataset.

Figure 5 Correlation intensities among data processed with dChip, MAS 4.0, MAS 5.0, RMA, and GCRMA

The intensity range is between 0.7 and 1.0.

Looking from bottom left, the first four rows and columns in Figure 5 compare the consistency of measurements obtained from four GeneChips processed with the MAS 4.0 software, the data in rows 5–8 and columns 5–8 compare the consistency of measurements obtained from four GeneChips processed with the dChip software, and the data in rows 9–12, columns 9–12 compare the consistency of measurements obtained from four GeneChips processed with the MAS 5.0 software and so on. Average correlations are calculated for each of the 15 blocks (see Figure 5) comparing pairs of software.

42 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

These correlations reveal that the most consistent set of measurements are obtained with GCRMA (average correlation = 0.97) and with RMA as a close second (average correlation = 0.96). MAS 5.0 software performs marginally better (average correlation = 0.83) than the previous MAS 4.0 version (average correlation = 0.81), and dChip is better correlated within itself than both of the MAS versions (average correlation = 0.85). It is also apparent that close correlations are observed when MAS 4.0 and MAS 5.0 processed data are compared to one another (average correlation = 0.82), and that this correlation is better than between dChip and MAS 4.0 or MAS 5.0 processed data (average correlation 0.78). Another interesting observation is that RMA and GCRMA are better correlated with the MAS versions as compared with dChip as shown by the averaged correlation numbers in the boxes.

3.3 Correlation of filter with GeneChip data

In order to be able to compare GeneChip and filter data, the exact same four pooled total RNA preparations used for the nylon filter (experiments 1–4, Figure 1) are used for hybridisation to four E. coli Affymetrix GeneChips (Figure 2) as described by Hung et al. (2002). In this case, however, instead of having four measurements for each gene expression level, as for each filter experiment, only one measurement was obtained from each GeneChip. On the other hand, this single measurement is the average of the difference between hybridisation signals from approximately 15 perfect match (PM) and mismatch (MM) probe pairs for each ORF. While these are not equivalent to duplicate measurements because different probes are used, these data can increase the reliability of each gene expression level measurement (Baldi and Hatfield, 2002). Nevertheless, large differences in the average difference of individual probe pairs are often observed. In order to compare the different technologies, we first averaged the filter measurements obtained from the same cDNA target, resulting in one expression profile per cDNA target. We then compared the four averaged filter measurements to the measurements obtained from the four GeneChips (each corresponding to one cDNA target). The data in Figure 6 indicate weak correlations (<0.4) between the filter measurements and Affymetrix GeneChip data no matter which image processing software was used. We also notice that the measurements obtained from Experiment 3 (D9–D12), which were observed earlier to be less consistent with the other measurements, are strikingly un-correlated with the GeneChip measurements.

Figure 6 Low correlation intensities observed (<0.4) when comparing measurements from filters (Expt 1–4) with GeneChips

The intensity range is between 0.0 and 0.4.

How noisy and replicable are DNA microarry data? 43

The low correlation between the filter and GeneChip measurements is possibly due to the probe-specific effects due to the differences in the hybridisation efficiencies for different probes. Individual outliers can have a large effect on the average difference for probe pair sets of individual GeneChips. In fact, it has been reported that this variance can be five times greater than the variance observed among GeneChips (Li and Wong, 2001). These probe effects are less for filters containing full-length ORF probes hybridised to targets generated with random hexamers than for Affymetrix GeneChips that query only a limited number of target sequences. As a result, it is expected that the signal intensities obtained from Affymetrix GeneChips are less correlated to in vivo transcript levels than signal intensities obtained from filters, thus providing a rationale for why signal intensities obtained from different microarray platforms may not correlate well with one another.

3.4 Differential analysis within filter data

We performed a statistical t-test between 16 duplicate pairs of measurements to study the magnitudes of variances attributed by cDNA targets, and/or filters. The number of genes, which are found to be significantly different (p < 0.005) in each of these comparisons, is shown in Table 2. Light grey cells identify experiments that compare different filters hybridised with the same cDNA targets. Dark grey cells identify experiments that compare the same filters hybridised with different cDNA targets. The values represent the number of false positive measurements observed when duplicate pairs of measurements from different cDNA target preparations or filters are compared to one another. The results of Table 2 demonstrate that the average ‘false positive’ genes obtained when the same targets are hybridised to different filters is about 2%, whereas when the different targets are hybridised to the same filters the number increases to about 6%. When different targets are hybridised to different filters, the average percentage of false positives rises to 10%. This is without taking into account the effects of the large time gap between the sets of 16 measurements. In other words, even with sample pooling, variances contributed by different cDNA target preparations are, on average, about three to four times higher than variances contributed by different filters. Similar relative effects are observed using more sophisticated differential analysis methods that compensate for the relationship between levels of gene expression and their variances. These methods include the approach of Huber et al. (2002) that uses a global arcsinh transformation to eliminate variance fluctuations as well as the local approach of Baldi and Long (2001) and Long et al. (2001) that uses a regularised t-test in which the variance of each gene is estimated by taking into account the variance of genes with similar expression levels. During the six-month time difference between D1–16 and D17–32, we had earlier noted that the correlation between the measurements dropped significantly. This can be attributed to several experimental changes including reagents, personnel, and labelling. A t-test between the two sets of 16 measurements shows a significant (p < 0.005) difference in the mean values of about 75% of the 1,257 genes.

44 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

Table 2 Matrix of significant genes with p < 0.005 found in sets of experiment pairs of log-transformed filter data (out of 1,257 total genes)

How noisy and replicable are DNA microarry data? 45

To estimate the contribution of the two factors, cDNA targets and filters, to the total variance, we performed an ANOVA analysis (Coombes et al., 2002; Kerr et al., 2000) for each of the four sets of eight measurements involving two filters and two cDNA targets. Each quadrant in Figure 1 (e.g., D1–D8) corresponds to one set of eight measurements. For each gene in each quadrant, we performed a two-way factorial ANOVA to obtain the sum of squares between (SSB) the different cDNA target groups and filter groups, respectively. We summarised the SSB of cDNA targets and filters as a percentage contribution to the total variance and averaged over all genes. On average, 41.3% of the total variance came from differences in cDNA and 20% from filters. The interaction (cDNA x filter) contributed 13.5% to the total variance. These estimates further validate the effect of the differences in biological factors on the total variance in the filter dataset.

3.5 Differential analysis within GeneChip and with filter data

To address the effects of different image processing software packages, we apply differential analysis to the GeneChip measurements. The number of false positive genes with p-values less than 0.005 based on a standard t-test is shown in Figure 7. Since the RMA and GCRMA functions transform the data differently from MAS4.0, 5.0, and dChip, it is not meaningful to perform t-tests comparing RMA and GCRMA to the other three software, since all the means of all 1,794 genes will appear significantly different. We can however compare the MAS and dChip software, and they are presented below.

Figure 7 Number of false positive genes identified by different data processing methods

While the replicates obtained using the similar MAS4.0 and MAS5.0 software packages exhibit very low levels of falsely identified significant genes, differences of up to 147 genes (5–6%) are observed when comparing dChip to the Affymetrix software. This does not imply that one software package is better than the others but rather that the users should be cognisant of these differences that are generated by different analysis methods. Figure 7 also shows the large number of falsely positive genes detected (38%) when the filter dataset is compared with the GeneChip dataset (regardless of the software used) using the standard t-test with p < 0.005, supporting the earlier observation from the correlation analysis that these two datasets may not be combined.

46 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

4 Discussion

Many experimental designs and applications of DNA array experiments are possible. However, no matter what the purpose of these gene expression profiling experiments is, a sufficient number of experiments must be performed for statistical analysis of the data, either through multiple measurements of homogeneous samples (replication) or multiple sample measurements (e.g., across time or subjects). Basically, this is because each gene expression profiling experiment results in the simultaneous measurement of the expression levels of thousands of genes. In such a high dimensional experiment, many genes will show large changes in expression levels between two experimental conditions simply by chance alone. In the same manner, many truly differentially expressed genes will show small changes. These false positive and false negative observations arise from chance occurrences exacerbated by biological variance as well as experimental and measurement errors. Thus, if we compare the gene expression patterns of cells simply grown under two different treatment conditions or between two genotypes, experimental replication is required for the assignment of statistical significance to each differential gene measurement. Such replications quickly become labour intensive and prohibitively expensive. This leads to the question – how many replicates are required for the data to be considered reliable for further analyses? The short answer is – enough to provide a robust estimate of the standard deviation of the mean of each gene measurement. Due to the prohibitive costs of generating replicates, this problem reduces to finding a method that will produce more robust estimates of the standard deviation of a small set of individual gene measurements with few replications. There are a few methods that address this problem. Techniques by Durbin et al. (2002) and Huber et al. (2002) apply a transformation on the entire dataset so as to render the variance a constant that is independent of the mean. Another method (Baldi and Long, 2001; Long et al., 2001) has shown that the confidence in the interpretation of DNA microarray data with a low number of replicates can be improved by using a Bayesian statistical approach that incorporates information of within treatment measurements. This method is based on the observation that genes of similar expression levels exhibit similar variance and hence more robust estimates of the variance of a gene can be derived by pooling neighbouring genes with comparable expression levels (Arfin et al., 2000; Baldi and Hatfield, 2002; Hatfield et al., 2003; Hung et al., 2002; Long et al., 2001). However, it would be advantageous if the factors introducing variance in the data could be identified upfront and pre-adjusted to minimise noise in the data. This brings us to the next question – what introduces noise in DNA microarray data? The results presented here demonstrate that the variability inherent in highly replicated (up to 32×) DNA microarray data can result from a number of disparate factors operating at different times and levels in the course of a typical experiment. These numerous factors are often interrelated in complex ways, but for the purpose of simplicity we have broken them down into two major categories: biological variability and experimental variability. Other sources of variability involve DNA microarray fabrication methods as well as differences in imaging technology, signal extraction, and data processing. This study confirms earlier assertions that, even with carefully controlled experiments with isogenic model organisms, the major sources of variance come from uncontrolled biological factors (Hatfield et al., 2003). Our ability to control biological variation in a

How noisy and replicable are DNA microarry data? 47 model organism such as E. coli with an easily manipulated genetic system is an obvious advantage for gene expression profiling experiments. However, most systems are not as easily controlled. For example, human samples obtained from biopsy materials not only differ in genotype but also in cell types. Thus, care should be taken to reduce this source of biological variability as much as possible, for example, with the use of laser-capture techniques for the isolation of single cells from animal and human tissues. A related study conducted on human cells (Coombes et al., 2002) found that the differences between two target preparations had a relatively small contribution to the variation compared to membrane reuse and exposure time to phosphorimager screens (the latter was outside the scope of our study). In our experiments, each of the eight sets of pooled E. coli RNA was extracted on different days, which may account for the greater variation as seen in the correlation and differential analyses as compared with experimental factors. With regard to the second dataset obtained from Affymetrix GeneChip experiments, one possibility for poor correlation between signal intensities between filter and Affymetrix GeneChip experiments can be attributed to probe effects. Nevertheless, signal ratios obtained from the same probe on two different arrays can ameliorate these probe effects. Thus, the overall differential expression profiles obtained from different microarray platforms should be comparable. In support of this conclusion, Hung et al. (2002) have demonstrated that, with appropriate statistical analysis, similar results can be obtained when the same experiments are performed with pre-synthesised filters containing full-length ORF probes and Affymetrix GeneChips. An additional source of biological variation, even when comparing the gene profiles of isogenic cell types, comes from the conditions under which the cells are cultured. In this regard, it has been recommended that standard cell-specific media should be adopted for the growth of cells queried by DNA array experiments (Baldi and Hatfield, 2002). While this is not possible in every case, many experimental conditions for the comparison of two different genotypes of common cell lines can be standardised. The adoption of such medium standards would reduce experimental variations and facilitate the cross-comparison of experimental data obtained from different experiments, different microarray formats, and/or different investigators. However, even employing these precautions, non-trivial and sometimes substantial variance in gene expression levels, even between genetically identical cells cultured in the same environment such as those revealed in this study, are observed. This simple fact can result from a variety of influences including environmental differences, phase differences between the cells in the culture, periods of rapid change in gene expression, and multiple additional stochastic effects. To emphasise the importance of microenvironments encountered during cell growth, Piper et al. (2002) have recently demonstrated that variance among replicated gene measurements is dramatically decreased when isogenic yeast cells are grown in chemostats rather than batch cultures. Biological variance can be even further exacerbated by experimental errors. For example, if extreme care in the treatment and handling of the RNA is not taken during the extraction of the RNA from the cell and its subsequent processing. It is often reported that the cells to be analysed are harvested by centrifugation and frozen for RNA extraction at a later time. It is important to consider the effects of these experimental manipulations on gene expression and mRNA stability. If the cells encounter a temperature shift during the centrifugation step, even for a short time, this could cause a change in the gene expression profiles due to the consequences of temperature stress.

48 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

If the cells are centrifuged in a buffer with even small differences in osmolarity from the growth medium, this could cause a change in the gene expression profiles due to the consequences of osmotic stress. Also, removal of essential nutrients during the centrifugation period could cause significant metabolic perturbations that would result in changes in gene expression profiles. Each of these and other experimentally caused gene expression changes will confound the interpretation of the experiment. These are not easy variables to control. Therefore, the best strategy is to harvest the RNA as quickly as possible under conditions that ‘freeze’ it at the same levels that it occurs in the cell population at the time of sampling. Several methods are available that address this issue (Baldi and Hatfield, 2002). There are numerous other sources of experimental variability such as: differences among protocols; different techniques employed by different personnel; differences between reagents; and, differences among instruments and their calibrations, as well as others. While these sources of variance are usually less than those that come from biological sources, they can dominate the results of a DNA microarray experiment. This is illustrated by the poor correlation between the two replicated data sets reported here, one obtained six months after the other. Although there is good correlation among the replicated measurements of each set, there is much less correlation among the measurements between these sets. In this case, the major difference can be attributed to improvements in the cDNA target labelling protocol. It is reassuring to observe that carefully executed and replicated DNA microarray experiments produce data with high global correlations (in the 0.9 range). This high correlation, however, should not be interpreted as a sign that replication is not necessary. Replication as well as proper statistical analysis remains important in order to monitor experimental variability and because the variability of individual genes can be high. It is also reassuring to know that while correlations of expression measurements across technologies remain low, overall differential expression profiles obtained from different microarray platforms can be compared (Hung et al., 2002). Finally, the comprehensive and diverse datasets for wild-type E. coli under standard growth conditions that have been compiled in the present study and are available via the web (http://www.igb.uci.edu/servers/dmss.html) may serve as a useful set of reference data for DNA microarray researchers and bioinformaticians interested in further developing the technology.

Acknowledgements

Suman Sundaresh and She-pin Hung have contributed equally to this work. This work was supported in part by the UCI Institute of Genomics and Bioinformatics, by grants from the NIH (GM-055073 and GM068903) to GWH, by a Laurel Wilkening Faculty Innovation Award to PB, and by a Sun Microsystems Award to PB. SH was supported by a post-doctoral training grant fellowship from the University of California Biotechnology Research and Education Program. We are grateful to Cambridge University Press for permission to reproduce materials from a book by PB and GWH titled DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling, ISBN: 0521800226.

How noisy and replicable are DNA microarry data? 49

References Arfin, S.M., Long, A.D., Ito, E.T., Tolleri, L., Riehle, M.M., Paegle, E.S. and Hatfield, G.W. (2000) ‘Global gene expression profiling in Escherichia coli K12. The effects of integration host factor’, J. Biol. Chem, Vol. 275, pp.29672–29684. Baldi, P. and Hatfield, G.W. (2002) DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling, Cambridge University Press, Cambridge, UK. Baldi, P. and Long, A.D. (2001) ‘A Bayesian framework for the analysis of microarray expression data: regularised t-test and statistical inferences of gene changes’, Bioinformatics, Vol. 17, pp.509–519. Barkai, N. and Leibler, S. (2000) ‘Biological rhythms: circadian clocks limited by noise’, Nature, Vol. 20, No. 403, pp.267–268. Coombes, K.R., Highsmith, W.E., Krogmann, T.A., Baggerly, K.A., Stivers, D.N. and Abruzzo, L.V. (2002) ‘Identifying and quantifying sources of variation in microarray data using high-density cDNA membrane arrays’, J. Comput. Biol, Vol. 9, pp.655–669. Durbin, B., Hardin, J., Hawkins, D. and Rocke, D.M. (2002) ‘A variance-stabilising transformation for gene expression microarray data’, Bioinformatics (ISMB 2002), Vol. 18, pp.S105–S110. Hasty, J., Pradines, J., Dolnik M. and Collins, J.J. (2000) ‘Noise-based switches and amplifiers for gene expression’, Proc Natl Acad Sci., USA, Vol. 29, No. 97, pp.2075–2080. Hatfield, G.W., Hung, S.P. and Baldi, P. (2003) ‘Differential analysis of DNA microarray gene expression data’, Mol. Microbiol, Vol. 47, pp.871–877. Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A. and Vingron, M. (2002) ‘Variance stabilisation applied to microarray data calibration and to the quantification of differential expression’, Bioinformatics (ISMB 2002), Vol. 18, No. 1, pp.S96–S104. Hung, S.P., Baldi, P. and Hatfield, G.W. (2002) ‘Global gene expression profiling in Escherichia coli K12: the effects of leucine-responsive regulatory protein’, J. Biol. Chem, Vol. 277, pp.40309–40323. Hung, S.P., Hatfield, G.W., Sundaresh, S. and Baldi, P. (2003) ‘Understanding DNA microarrays: sources and magnitudes of variances in DNA microarray data sets’, in Grandi, G. (Ed.): Genomics, , and Vaccines, John Wiley and Sons, West Sussex, England, pp.75–102. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U. and Speed, T.P. (2003a) ‘Exploration, normalisation, and summaries of high density oligonucleotide array probe level data’, Biostatistics, Vol. 4, No. 2, pp.249–264. Irizarry, R.A., Bolstad, B.M., Collin, R., Cope, L.C., Hobbs, B. and Speed, T.P. (2003b) ‘Summaries of Affymetrix GeneChip probe level data’, Nucleic Acids Research, Vol. 31, No. 4, p.E15. Kerr, M.K., Martin, M. and Churchill, G.A. (2000) ‘Analysis of variance for gene expression microarray data’, J. Comput. Biol., Vol. 7, pp.819–837. Li, C. and Wong, W.H. (2001) ‘Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection’, Proc. Natl. Acad. Sci., USA, Vol. 98, pp.31–36. Long, A.D., Mangalam, H.J., Chan, B.Y., Tolleri, L., Hatfield, G.W. and Baldi, P. (2001) ‘Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12’, J. Biol. Chem., Vol. 276, pp.19937–19944. McAdams, H.H. and Arkin, A. (1999) ‘It is a noisy business! Genetic regulation at the nanomolar scale’, Trends in Genetics, Vol. 15, pp.65–69.

50 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi

Piper, M.D., Daran-Lapujade, P., Bro, C., Regenberg, B., Knudsen, S., Nielsen, J. and Pronk, J.T. (2002) ‘Reproducibility of oligonucleotide microarray transcriptome analyses. An interlaboratory comparison using chemostat cultures of Saccharomyces cerevisiae’, J. Biol. Chem., Vol. 277, pp.37001–37008. Speed, T. (2002) ‘Always log spot intensities and ratios’, Speed Group Microarray Page, http://www.stat.berkeley.edu/users/terry/zarray/html/log.html. Tusher, V.G., Tibshirani, R. and Chu, R. (2001) ‘Significance analysis of microrarrays applied to the ionising radiation response’, Proc. Natl. Acad. Sci, USA, Vol. 98, pp.5116–5121. Wu, Z. and Irizarry, R.A. (2004) ‘Stochastic models inspired by hybridisation theory for short oligonucleotide arrays’, Proceedings of the 8th International Conference on Computational Molecular Biology (RECOMB 2004), pp.98–106.