Suman Sundaresh She-Pin Hung and G. Wesley Hatfield Pierre Baldi
Total Page:16
File Type:pdf, Size:1020Kb
Int. J. Bioinformatics Research and Applications, Vol. 1, No. 1, 2005 31 How noisy and replicable are DNA microarry data? Suman Sundaresh Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA E-mail: [email protected] She-pin Hung and G. Wesley Hatfield Department of Microbiology and Molecular Genetics, Institute for Genomics and Bioinformatics, College of Medicine, University of California, Irvine, CA 92697, USA E-mail: [email protected] E-mail: [email protected] Pierre Baldi* Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA E-mail: [email protected] *Corresponding author Abstract: This paper analyses variability in highly replicated measurements of DNA microarray data conducted on nylon filters and Affymetrix GeneChipsTM with different cDNA targets, filters, and imaging technology. Replicability is assessed quantitatively using correlation analysis as a global measure and differential expression analysis and ANOVA at the level of individual genes. Keywords: DNA microarrays; sources of variation; replication; correlation; differential expression analysis; ANOVA. Reference to this paper should be made as follows: Sundaresh, S., Hung, S-P., Hatfield, G.W. and Baldi, P. (2005) ‘How noisy and replicable are DNA microarry data?’, Int. J. Bioinformatics Research and Applications, Vol. 1, No. 1, pp.31–50. Biographical notes: Suman Sundaresh is a PhD student in the Computer Science Department at UC Irvine. She gained her MSc and BSc (Hons) in Computer Science from the National University of Singapore. Her research interests are in the areas of data mining, machine learning, and biomedical informatics. She-pin Hung is a post-doctoral researcher in the Department of Microbiology and Molecular Genetics affiliated with the Institute for Genomics and Bioinformatics at UC Irvine. She received her PhD from the University of California at Irvine in 2002. Her research interests are in the areas of global gene expression profiling with the use of DNA microarrays and bioinformatics. Copyright © 2005 Inderscience Enterprises Ltd. 32 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi G. Wesley Hatfield is Professor of Microbiology and Molecular Genetics in the College of Medicine and Associate Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. Hatfield holds a PhD degree from Purdue University and a BA degree from the University of California in Santa Barbara. His primary areas of scientific expertise include molecular biology, biochemistry, microbial physiology, functional genomics, and bioinformatics. His recent academic interests include the application and development of genomic and bioinformatics methods to elucidate the effects of chromosome structure and DNA topology on gene expression. He has received national recognition for his scientific contributions including the Eli Lilly and Company Research Award bestowed by the American Society of Microbiology. Pierre Baldi is a professor in the School of Information and Computer Science and the Department of Biological Chemistry and Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. He received his PhD from the California Institute of Technology in 1986. From 1986 to 1988, he was a post-doctoral fellow at the University of California, San Diego. From 1988 to 1995, he held faculty and member of the technical staff positions at the California Institute of Technology and at the Jet Propulsion Laboratory. He was CEO of a startup company from 1995 to 1999 and joined UCI in 1999. He is the recipient of the 1993 Lew Allen Award at JPL and the Laurel Wilkening Faculty Innovation Award at UCI. Baldi has written over 100 research papers and four books. His research focuses in biological and chemical informatics, AI, and machine learning. 1 Introduction This paper analyses and quantifies certain aspects of ‘noise’ contained in DNA microarray data. A DNA microarray experiment comprises several steps such as cDNA spotting, mRNA extraction, target preparation, hybridisation, image scanning and analyses. These procedures can be further subdivided into dozens of other elementary steps, each of which can introduce some amount of variability and noise. In addition to the variability introduced by the instruments and the experimenter, there is biological variability, which also has multiple sources ranging from fluctuations in the environment to the inherently stochastic nature of nano-scale regulatory chemistry (Barkai and Leibler, 2000; Hasty et al., 2000; McAdams and Arkin, 1999) – transcription alone involves dozens of individual molecular interactions. These compounded forms of ‘noise’ may lead one to doubt whether any reliable signal can be extracted at all from DNA microarrays. Here, we show that, while certainly noisy, DNA microarray data do contain reliable information. In this study, we look at highly replicated (up to 32×) experiments performed by different experimenters at different times in the same laboratory, using, as a model organism, wild-type Escherichia coli. In addition, we obtain these microarray measurements using two different formats, nylon filters and Affymetrix GeneChipsTM. Given the overwhelming number of variables that can in principle contribute to the variability, we focus on a particular subset of variables of great relevance to biologists. In particular, we measure the consistency of the results obtained using the filter technology across different filters and mRNA preparations. We also compare filters to How noisy and replicable are DNA microarry data? 33 Affymetrix GeneChipTM technology and study the effects of five different image processing methods. Replicability is assessed quantitatively using correlation analysis and differential expression analysis. We use correlation as a global measure of similarity between two sets of measurements. While a correlation close to one is a good sign, it is a global measure that provides little information at the level of individual genes. Thus, we use differential expression analysis at the level of individual genes to detect which genes seem to behave differently in two different sets of measurements. The data sets and software used in our analysis are available over the web at http://www.igb.uci.edu/servers/dmss.html. Our approach differs from and complements previous related studies (Coombes et al., 2002; Piper et al., 2002). In particular, we use higher levels of replication (32x), relatively simpler biological samples (E. coli vs. S. cerevisiae or human B-cell lymphoma cell lines), and more diverse microarray technologies (filters and Affymetrix Gene Chips). Part of these other studies also focuses on the analysis of variables that are outside the scope of the present study, such as exposure time or inter-laboratory variability. 2 Methods 2.1 Filter dataset The first dataset (‘filter dataset’) we use consists of 32 sets of measurements from 16 nylon filter DNA microarrays containing duplicate probe sites for each of 4,290 open reading frames (ORFs) hybridised with 33P-labeled cDNA targets from wild-type Escherichia coli cells cultured at 37oC under balanced growth conditions in glucose minimal salt medium. The experimental design and methods for these experiments are described in detail in Arfin et al. (2000), Baldi and Hatfield (2002) and Hung et al. (2002, 2003) and illustrated in Figure 1. Each filter contains duplicate probes (spots) for each of the 4,290 open reading frames (ORFs) of the E. coli genome. In Experiment 1, filters 1 and 2 were hybridised with 33P-labeled, random hexamer generated, cDNA targets complementary to each of three independently prepared RNA preparations (RNA1) obtained from the cells of three individual cultures of a wild-type (wt) E. coli strain. These three 33P-labeled cDNA target preparations were pooled prior to hybridisation to the full-length ORF probes on the filters (Experiment 1). Following phosphorimager analysis, these filters were stripped and again hybridised with pooled, 33P-labeled cDNA targets complementary to each of another three independently prepared RNA preparations (RNA2) from the wt strain (Experiment 2). This procedure was repeated two more times with filters 3 and 4, using two more independently prepared pools of cDNA targets (Experiment 3, RNA3; Experiment 4, RNA4). Another set of filters, filters 3 and 4, were used for Experiments 3 and 4 as described for Experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with cDNA targets complementary to four independently prepared sets of pooled RNA. Thus, since each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, 16 measurements (D1–D16) for each ORF from four experiments were obtained. These procedures were performed with 34 S. Sundaresh, S-P. Hung, G.W. Hatfield and P. Baldi another two pairs of filters 5–8 for Experiments 5–8 to obtain another 16 measurements (D17–D32) for each ORF. Figure 1 Experimental design for nylon filter DNA array experiments (‘filter dataset’) The filter dataset is fairly representative of other filter datasets in the sense that it corresponds to experiments carried out by different people at different times in the same laboratory. In particular, of the 32 filter measurements, the data from measurements 1–16 were obtained six months later than the data from measurements