Education BeadArray Expression Analysis Using Bioconductor

Matthew E. Ritchie1,2*, Mark J. Dunning3*, Mike L. Smith3,4, Wei Shi1,5, Andy G. Lynch3,4 1 Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia, 2 Department of Medical Biology, The University of Melbourne, Parkville, Victoria, Australia, 3 Cancer Research UK, Cambridge Research Institute, Cambridge, United Kingdom, 4 Department of Oncology, University of Cambridge, Cambridge, United Kingdom, 5 Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia

The hierarchy of data, from individual points: raw data including the scanned Abstract: Illumina whole- pixels that make up beads on a BeadArray TIFF images, bead-level data without the expression BeadArrays are a popu- for a WG-6 BeadChip, is illustrated in TIFF images, summarized output from lar choice in gene profiling studies. Figure 1A. BeadStudio/GenomeStudio, or data ob- Aside from the vendor-provided The experimental process for measur- tained from a public repository. Depend- software tools for analyzing Bea- ing transcript levels in a sample of ing on the format available, different dArray expression data (- open-source tools from Bioconductor tudio/BeadStudio), there exists a interest involves labelling RNA and may be used to import and analyze the comprehensive set of open-source hybridizing this material to the probes analysis tools in the Bioconductor on a BeadArray. The scanned intensities data (Figure 1B). The methods we rou- project, many of which have been from these probes provide a snapshot of tinely use in our own analyses of Illumina tailored to exploit the unique transcript abundance in a particular gene expression data are summarized in properties of this platform. In this sample. Comparing the intensities ob- Table 1. article, we explore a number of tained from different RNA species can The companion Bioconductor package these software packages and dem- provide researchers with insight into the BeadArrayUseCases [6] provides a vignette onstrate how to perform a com- molecular pathways regulating the system withaseriesofexamplesaimedat plete analysis of BeadArray data in under investigation. There is a rich computational biologists wanting instruc- various formats. The key steps of literature on the analysis of gene expres- tion on the specific commands involved importing data, performing quality sion microarrays (see Smyth et al. 2003 in analyses from any starting level of assessments, preprocessing, and data. Three experiments using three [2], Allison et al. 2006 [3], or Reimers annotation in the common setting generations of BeadArray allow us to 2010 [4] for reviews), and while the main of assessing differential expression span the range of data levels and steps of an analysis such as quality in designed experiments will be illustrate the use of specific functions assessment and normalization still apply, covered. from the beadarray, limma,andGEOquery BeadArraydatapresentanumberof packages. We also demonstrate how to unique opportunities that may not be extract information from chip-specific fully exploited by standard microarray annotation packages. Introduction analysis workflows. These include a high and variable level of intra-array replica- Microarrays are a standard laboratory Choosing a Starting Point for tion of probes and a large set of negative technique for high-throughput gene ex- controls. Specialized algorithms that the Analysis of BeadArray Data pression profiling in genomics research. make use of these features have been The BeadArray microarray platform The first decision facing the bioinfor- developed for Illumina BeadArrays to from Illumina Inc. (San Diego, CA) matician may be what data to use as the consists of an array of randomly packed improve the results obtained from this starting point for their analysis. If all beads, each bead bearing many copies of technology. primary data formats (raw data including a particular 50-mer oligonucleotide se- The aim of this article is to provide a TIFFs, bead-level data without TIFFs, or quence (the ‘‘probe’’). Each BeadArray how-to guide for Illumina expression summarized data) have been made avail- contains a collection of probes designed analysis, using packages from the open- able, then it should be clear that starting to interrogate the majority of protein- source Bioconductor project [5]. The from the TIFF images gives the greatest coding transcripts in a given organism overall workflow of an Illumina analysis amount of control over the steps being (human, mouse, or rat) along with a is summarized in Figure 1B. Analyses may performed at each stage. In most situa- large set of both positive and negative begin with data at one of four starting tions, the default processing methods control probes. Due to the random sampling of beads during the manufac- Citation: Ritchie ME, Dunning MJ, Smith ML, Shi W, Lynch AG (2011) BeadArray Expression Analysis Using turing process, the number and arrange- Bioconductor. PLoS Comput Biol 7(12): e1002276. doi:10.1371/journal.pcbi.1002276 ment of replicate beads varies from array Editor: Fran Lewitter, Whitehead Institute, United States of America to array. Published December 1, 2011 Multiple BeadArrays are grouped to- Copyright: ß 2011 Ritchie et al. This is an open-access article distributed under the terms of the Creative gether to form a BeadChip, with gene Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, expression products configured to have six provided the original author and source are credited. (WG-6), eight (Ref-8), or 12 (HT-12) Funding: This work was supported by NHMRC Program grant 406657 (MER, WS), the University of Cambridge, samples per chip. This format allows Cancer Research UK, and Hutchison Whampoa Limited (MJD, MLS, AGL). The funders had no role in the samples to be processed in parallel with preparation of the manuscript. benefits for experimental design, a key Competing Interests: The authors have declared that no competing interests exist. factor in the experimental workflow [1]. * E-mail: [email protected] (MER); [email protected] (MJD)

PLoS | www.ploscompbiol.org 1 December 2011 | Volume 7 | Issue 12 | e1002276 Figure 1. Overview of the technology and workflow. (A) A zoomed view of a typical bead (top) with the pixels that contribute to the overall (red square) and local background (yellow squares) signals marked. Many replicate beads that contain the same 50-mer oligo are located on each BeadArray (middle) to ensure robust measures of expression can be obtained for each probe in a given sample. Around 48,000 different probe types are assayed in this way per sample. These BeadArrays come from a WG-6 BeadChip (bottom), which is made up of a total of 12 arrays, which are paired to allow transcript abundance to be measured in a total of six samples per BeadChip. (B) Summarizes the various data formats available along with the Illumina workflow associated with the different levels of data. Data can be in raw form, where pixel-level data are available from TIFF images, allowing the complete data processing pipeline, including image analysis, to be carried out in . The next level, referred to as bead-level, refers to the availability of intensity and location information for individual beads. In this format, a given probe will have a variable number of replicate intensities per sample. Processed data, where replicate intensities have been summarized and outliers removed to give a mean, a measure of variability, and a number of observations per probe in each sample, is the most commonly available format. Summary data are usually obtained directly from Illumina’s BeadStudio/GenomeStudio software, but can also be retrieved from public repositories such as GEO or ArrayExpress. The right-hand column of this figure indicates the R/Bioconductor packages that can handle data in these different formats. Probe annotation packages are also listed. List of abbreviations and footnotes used in this figure: QA, quality assessment; DE, differential expression; ‘, package available from CRAN [46]; *, denotes chip-specific part of package name that depends upon platform version (e.g., v1, v2, v3, v4). doi:10.1371/journal.pcbi.1002276.g001

Table 1. Summary of the processing methods recommended for different levels of data.

Data Type Analysis Task Recommended Approach

All levels Quality assessment Examine scanner metrics Rawa Local background adjustment Median background subtraction

Raw Transformation log2 Bead-levelb Spatial artefact detection & removal BASH Bead-level Quality assessment Examine image plots & boxplots Bead-level Summarization Default Illumina method Summary-levelc Data export from BeadStudio/ GenomeStudio Non background corrected, non normalized, Sample and Control ‘‘Probe Profile’’ tables Summary-level Quality assessment Examine boxplots of regular & control probes, MDS plots Summary-level Background correction Normal-exponential convolution using negative controls Summary-level Normalization Quantile

Summary-level Transformation log2 Summary-level Estimation of proportion of expressed probes in a sample Mixture model that uses negative controls (propexpr [29]) Summary-level Probe filtering Based on annotation quality Summary-level Differential expression analysis Linear modelling using weights

aRaw data comprises one observation per pixel, per array. bBead-level data comprises one observation per bead, per array. cSummary-level data comprises one observation per probe type, per sample. doi:10.1371/journal.pcbi.1002276.t001

PLoS Computational Biology | www.ploscompbiol.org 2 December 2011 | Volume 7 | Issue 12 | e1002276 employed by Illumina to extract intensities provides grounds for down-weighting or As with other microarrays, it is usual to from the TIFFs and summarize these removal of this sample from the analysis. analyze data on the log2 scale, and values within each sample produce good The value and interpretation of these therefore the plotting and analysis meth- intensity estimates. Whether these process- metrics will be influenced by many factors, ods used in beadarray employ this transfor- ing steps are carried out in R (see vignette), so it is advisable for laboratories to keep an mation by default. The first within-sample or using vendor-provided software, is historical record of these values to assist in quality plots that one can produce are obviously up to the user; however, the the detection of systematic problems during overall image plots of the array surface ability to perform the entire analysis in a processing and in the identification of (Figure 2B) to look for obvious spatial platform-independent, reproducible, and outlier samples. problems. In addition, the checkRegistra- flexible manner in R will be appealing to tion function provides a convenient way to many computational biologists. In addi- assess whether the reported bead centers tion, image registration issues [7] and Raw and Bead-Level Data agree with the bead locations in the raw spatial artefacts [8] can only be managed Analysis images. if raw or bead-level data are available. Although Illumina’s processing steps While the impact of such events can range To obtain raw or bead-level data, include the removal of outliers for each from mild to catastrophic, if an analysis modifications to the default scanning bead type, we find that this is not sufficient begins with summarized data, then the settings in BeadScan or iScan are re- to account for all spatial artefacts that may user will only see the symptoms of such quired. Currently, the beadarray package occur on the array surface. Although it is a errors, and be unable to deal with the [9] is the only Bioconductor package that computationally expensive operation, we potential cause of the problem. can process these raw data either in the routinely use the BASH tool in beadarray to form obtained from the scanner or in a detect and remove spatial artefacts [8,11]. Quality Assessment for All compact representation via the BeadData- This method is based upon the principles PackR package [10]. Levels of Data of the Harshlight [12] package for Affyme- The import of raw data is handled using trix, but works on a within-array basis Irrespective of whether raw, bead-level, the readIllumina function. The availability rather than between arrays, using the or summarized data are being analyzed, the of TIFF images depends on scanner settings within-array replication to generate simi- first opportunity to assess the quality of an (jpegs are provided by default), and where lar performance. The use of BASH is experiment occurs as the arrays are being present beadarray can extract background, recommended, but the parameters may scanned, and without the need for special- foreground, and total intensities to the need to be tuned to achieve good perfor- ized software. The scanner produces a text user’s specifications. In particular, using a mance between different labs or experi- file that contains various signal-based array more robust measure of the local back- ments. quality measures. As an example, Figure 2A ground intensity (median) has been shown Other useful diagnostic plots such as shows the signal-to-noise ratio (SNR) for to be beneficial [7]. If TIFF images are not boxplots can be used to reveal unusual 200 arrays, including the 12 arrays from the available, then the user begins with Illumi- signal distributions (Figure 2C) and plots first data set analyzed in the vignette. Of na’s foreground intensities, calculated by of control probes (positive or negative) can these 12 samples, one has a very low SNR, subtraction of the local background mea- highlight processing problems that may which warrants further investigation and sure from the total intensity. warrant sample removal. For convenience,

Figure 2. Various diagnostic plots which are useful for quality assessment. Where scanner metrics information is available, arrays within a particular experiment can be compared to each other, or to a wider set from the same core facility. In (A), a per array signal-to-noise value (95th percentile of signal divided by the 5th percentile) is plotted for 200 consecutive BeadArrays, with the arrays from the experiment in question highlighted in color (blue or red). Low signal-to-noise ratios indicate a poor dynamic range of intensities and can highlight problems with array processing when they occur sequentially over time. At the individual array level, sub-array artefacts can be detected using spatial plots of the intensities across the BeadArray surface (B) and removed using BASH and outlier removal. For a between sample display, boxplots of the intensities from different arrays within an experiment can highlight samples with unusual signal distributions (C). The relationships between different samples can also be assessed using a multi-dimensional scaling (MDS) plot (D), which can highlight true biological differences between samples (in this example, the difference between UHRR and Brain in dimension 1 and the pure versus mixed samples in dimension 2), as well as technical effects due to lab, experiment date, etc., which may also need to be accounted for in the modelling. doi:10.1371/journal.pcbi.1002276.g002

PLoS Computational Biology | www.ploscompbiol.org 3 December 2011 | Volume 7 | Issue 12 | e1002276 the expressionQCPipeline function auto- these databases will generally be summa- beads are used to remove the relationship matically generates all recommended rized and probably normalized, and can between intensity and signal variability quality control plots for a given data set. be imported into R using the repository- that typically exists. After image processing, a key step is the specific packages GEOquery [18] and Ar- Negative controls are also useful for reduction of raw data (many values per rayExpress [19]. estimating the proportion of probes that probe type) to summary data (one value Once summarized data have been are expressed in a given sample [29], per probe type) in order that we might imported into R, quality assessment is which can be used to distinguish hetero- apply the methods detailed in the next necessary to identify poor-quality arrays geneous cell samples from pure samples section. beadarray offers flexibility in the and check for systematic biases. The [29] and to filter out non-expressed definition of which beads to include in arrayQualityMetrics [20] package is able to probes. summarization and the choice of summary collate quality assessment plots for sum- For normalization, between-array statistic and transformation applied to the marized data created by beadarray and quantile is the method most frequently raw data (the key aspects of summariza- identify potential outlier arrays. Boxplots applied to Illumina data both in the tion). The standard summary to are commonly used to assess the dynamic literature [21,27,30] and in our own use in beadarray are the mean and standard range from each sample and look for research. More sophisticated variants on error of the log-intensity (Illumina’s stan- unusual signal distributions (Figure 2C). this approach that use control probes or dard statistics to report are the mean and We also recommend making separate robust splines (implemented in rsn in ) standard error of the raw intensity). Note boxplots of regular probes and control have emerged and are increasing in that the standard error is important for probes as a means to highlight unusual popularity. Strip-level processing, which Illumina BeadArrays, as the random samples. separates probes depending on physical design means that we will have differing Before comparisons between different location and normalizes strips containing levels of confidence from one measure of biological samples can be made, it is the same probes between samples, can also intensity to the next. Besides variations of important to remove per-array technical be beneficial for older BeadChip versions the standard Illumina outlier removal effects to ensure the values being analyzed [31]. Ultimately, as with other high- method offered by beadarray, other robust truly reflect the biology. In the microarray throughput technologies, there is no summary options are possible as described literature, the three steps to achieve this ‘‘one-size-fits-all’’ solution for normaliza- in Kohl and Deigner (2010) [13] and are commonly referred to as background tion and the analyst should be prepared to implemented in the RobLoxBioC package. correction (not to be confused with the make an informed decision based on image processing step of the same name), exploratory plots and consideration of Summary Data Analysis between-array normalization, and trans- the assumptions of the method. For The most common entry point for the formation. Two popular methods that instance, classical quantile normalization computational biologist is to begin with implement these steps for Illumina data may be inappropriate in data sets com- summarized data obtained from the gene are neqc and vst from the limma and lumi prising many different tissue types. Stan- expression module of the BeadStudio/ packages, respectively. dardized data sets and methods of com- GenomeStudio software. These PC-based For background correction, the Geno- parison may help guide the analyst in their programs provide a convenient graphical meStudio option of subtracting the aver- choice [32]. user interface to import and process age of the negative controls on an array Next, relationships between a collection BeadArray data from the proprietary has been shown on several occasions to be of samples can be assessed via multidi- format idat files output by Illumina’s flawed [21–23]. One can get by with no mensional scaling (MDS, Figure 2D) or scanning software. Data are exported from background correction and a simple log2 principal component analysis (PCA). MDS this application in tab-delimited files transformation to stabilize variances; how- quantifies sample similarity across many (separate files for the experimental and ever, more sophisticated approaches that genes (typically the 500 most variable), and control probes) with each row giving the use Illumina’s negative control probes reduces the measure to two dimensions for summary information for a particular (sequences with no match to the ge- easy viewing. Ideally, samples would probe, and different columns for each nome/transcriptome) are preferable. separate based on biological variables sample. We recommend exporting raw These controls can be used to correct the (sex, treatment, etc.), but often technical summary values (which have not been observed signal intensities from each array effects (such as samples processed together background corrected, transformed, or using a normal-exponential convolution in batches) may dominate the differences normalized) at the probe level (‘‘probe model [24–27] to reduce bias and the between arrays. These effects may be profiles’’) rather than at the gene level number of false positives. Adding a small accounted for in a differential expression (‘‘gene profiles’’) for both regular and offset to the corrected intensities has been analysis, or managed using tools such as control probes to avoid combining probes shown to improve precision and reduce ComBat [33,34] or removeBatchEffect targeting different transcripts of the same the false discovery rate further. In our within limma (as used in Lim et al. (2010) gene in an undesirable manner. Such files research, we routinely use an offset of 16 [35]). Employing a good experimental can be imported and processed in the R for neqc to give a good trade-off between design that ensures biological factors of software environment using a range of variance stabilization and bias. Alterna- interest are not confounded with known tools that include beadarray [9], lumi [14], tively, the VST (variance stabilizing trans- technical or processing variables is of and limma [15]. formation) method [28] performs variance fundamental importance in any study. Another potential source of summarized stabilization and background correction in Once data are preprocessed into a data are public repositories such as Gene the same transformation. Instead of using normalized ‘‘expression matrix’’ format Expression Omnibus (GEO) [16] or Ar- negative controls, the within-array stan- used throughout Bioconductor, a wide rayExpress [17]. Experimental data from dard errors calculated from the replicate variety of analyses can take place such as

PLoS Computational Biology | www.ploscompbiol.org 4 December 2011 | Volume 7 | Issue 12 | e1002276 clustering, assessing differential expression, moderated t-statistics and F-statistics scriptome. Four basic categories, ‘‘per- classification, and pathway analysis. (when multiple contrasts are present), fect’’, ‘‘good’’, ‘‘bad’’, and ‘‘no match’’, and the resulting p-values are generally are defined and shown to correlate with Differential Expression Analysis used to rank probes in terms of their expression level and measures of differen- evidence for differential expression after tial expression. We routinely remove Throughout the vignette [6], we make adjusting for multiple testing. probes assigned a bad or no match quality use of the linear modelling framework in score after normalization. This approach the limma package for assessing differential Annotation is similar to the common practice of expression [36] due to its flexibility and removing lowly expressed probes, but with the maturity of the statistical methods it By following the steps in the previous the additional benefit of discarding probes provides. For a designed Illumina exper- section, the researcher may be presented with a high expression level caused by iment, which includes some replication of with a list of hundreds if not thousands of non-specific hybridization. Besides the RNA samples, average log-intensities are differentially expressed probes that are obvious benefit of removing probes that estimated for one or more distinct sample named according to their manufacturer- are either off-target or promiscuous, such types simultaneously using linear models assigned IDs. At the very least, these must a filtering step reduces the burden of fitted a probe at a time. The limma package be translated into gene symbols that the multiple testing and thereby improves the also allows for observations in a linear researcher can recognize, or into function- power to detect differential expression. model to be weighted according to the al pathways that can provide insight into Chip-specific packages such as illuminaHu- confidence in which we hold them. For the biological question being investigated. manv3.db and organism-specific packages Illumina BeadArrays, we might naturally The Bioconductor project provides such as lumiHumanIDMapping both provide want to weight observations by the inverse infrastructure for mapping between micro- the user with access to these quality scores. of the squared standard error (so that array probes and functional genomic observations about which we are more annotation to be used in downstream Downstream Analyses certain are given greater weight), as the analyses. For Illumina chips, these pack- standard error should be a function both ages are maintained on a per-organism There are many other analysis tools of the array quality and the number of (e.g., lumiHumanAll.db) or per-chip (e.g., available from R/Bioconductor that can copies of that type of bead. However, illuminaHumanv3.db) basis. The organism- be used for downstream analysis of obtaining an accurate measure of the specific packages use the nuIDs from Du Illumina microarray data. For example, standard error can be problematic. Even et al. (2007) [38] to encode the super-set of gene ontology/pathway enrichment anal- if we start out with one, steps such as all probe sequences used in different ysis can be performed with topGO or outlier removal, trimming, background revisions of chips for the same organism, GOstats and their associated annotation correction at the summary level, and which can be advantageous when analyz- packages (GO.db and KEGG.db), as can normalization will transform the mean ing data from different BeadChip versions. gene set enrichment analysis using the and leave the standard error lacking In these packages, the RefSeq IDs provid- GSEAlm package. In limma, both self- validity unless it is sympathetically trans- ed by Illumina in their own annotation contained gene set testing (using the roast formed. Thus, it is tempting to assume that files are used to provide functional anno- function [43]) and competitive gene set testing (using the battery of gene sets our transformation (if we have performed tation for each probe. available from MSigDB [44]—see the one—e.g., taking logs or using vst) has However, an important issue that is romer function) that operate within the removed any mean-variance relationship sometimes taken for granted in the linear model context are possible. in the data, in which case the number of analysis of microarray data is the assign- beads can be used as a weight to account ment of genomic and transcriptomic for technical variation that may arise in an identifiers to each unique probe sequence. Conclusions experiment. This ignores biological varia- Manufacturers provide their own annota- We have highlighted a number of tion between different arrays, and so we tion, but inevitably the reported mappings specially tailored tools and modelling should really use a weight consisting of the can become outdated as genome or approaches that are available in Biocon- number of beads contributing to the transcriptome versions are updated. This ductor for the analysis of Illumina gene observation adjusted by an array multipli- issue was the subject of extensive research expression data sets in various formats. A er that gives a measure of the reliability of for Affymetrix expression arrays (see Dai summary of the methods that we currently the array from which the observation et al. (2005) [39], amongst others) and has recommend for Illumina expression anal- comes. Array-specific weights have been recently been brought to light for Illumina ysis are listed in Table 1. Code examples shown to improve power to detect differ- expression [40] and methylation [41] that illustrate how to carry out each of ential expression [37] and are especially arrays. A significant proportion of probes these steps in the analysis are provided in useful in human studies where heteroge- on each Illumina expression platform are the separate vignette [6] from the BeadAr- neity can be high. reported to map to non-transcribed geno- rayUseCases package. These Bioconductor Having fitted our weighted linear mod- mic regions or have other properties that tools expand the set of analysis options el, we then set up contrasts between RNA complicate analyses, such as containing offered in the vendor-provided GenomeS- conditions and proceed to estimate be- SNPs or repeat-masked elements. Failure tudio/BeadStudio software, and are con- tween-sample differences of biological to take such factors into account can have tinually being developed to accommodate interest. Empirical Bayes shrinkage of the a profound effect on the interpretation of new applications of BeadArray technolo- probe-wise variances is then applied to microarray data [42]. Barbosa-Morais gy, such as methlyation assays. ensure that inference is reliable and stable, et al. (2010) [40] describe a scheme to The open-source Bioconductor plat- even when the number of replicate assign a quality score to each probe form also presents researchers with a samples is small [36]. These shrunken sequence that captures how well the choice of for their standard errors are used to calculate sequence maps to the genome and tran- analysis and a means to write analysis

PLoS Computational Biology | www.ploscompbiol.org 5 December 2011 | Volume 7 | Issue 12 | e1002276 scripts and generate reports based on them regular release schedule that ensures UK Cambridge Research Institute for generat- using Sweave [45], which assists with the packages are kept up-to-date with changes ing the HT-12 data used in this article, Roslin communication of results and ensures in the R software environment [46], which Russell and Ruijie Liu for feedback on the vignette,andValerieObenchainandDan reproducibility of a data analysis. Help is underpins all of this work. Tenenbaum for assistance in preparing the also easy to come by at various levels from BeadArrayUseCases package for release through manual pages for each function, through Bioconductor. to package-specific vignettes and the Acknowledgments Bioconductor mailing list for posting We thank Sean Davis for help on retrieving questions and reporting problems. Biocon- MAQC data from GEO, James Hadfield and ductor software also benefits from a Michelle Osbourne from the Cancer Research

References 1. Verdugo RA, Deschepper CF, Munoz G, 17. European Bioinformatics Institute (2011) Ar- 33. Johnson WE, Rabinovic A, Li C (2007) Adjusting Pomp D, Churchill GA (2009) Importance of rayExpress. Available: http://www.ebi.ac.uk/ batch effects in microarray expression data using randomization in microarray experimental de- arrayexpress/. Accessed 28 October 2011. empirical Bayes methods. 8: 118– signs with Illumina platforms. Nucleic Acids 18. Davis S, Meltzer PS (2007) GEOquery: a bridge 127. Research 37: 5610–5618. between the Gene Expression Omnibus (GEO) 34. Kitchen RR, Sabine VS, Sims AH, Macaskill EJ, 2. Smyth GK, Yang Y, Speed TP (2003) Statistical and Bioconductor. Bioinformatics 23: 1846– Renshaw L, et al. (2010) Correcting for intraex- issues in cDNA microarray data analysis. Meth- 1847. periment variation in Illumina BeadChip data is ods Mol Biol 224: 111–136. 19. Kauffmann A, Rayner TF, Parkinson H, necessary to generate robust gene-expression 3. Allison DB, Cui X, Page GP, Sabripour M (2006) Kapushesky M, Lukk M, et al. (2009) Importing profiles. BMC Genomics 11: 134. Microarray data analysis: from disarray to consol- Array-ExpressdatasetsintoR/Bioconductor. 35. Lim E, Wu D, Pal B, Bouras T, Asselin- idation and consensus. Nat Rev Genet 7: 55–65. Bioinformatics 25: 2092–2094. Labat ML, et al. (2010) Transcriptome analyses 4. Reimers M (2010) Making informed choices 20. Kauffmann A, Gentleman R, Huber W (2009) of mouse and human mammary cell subpopula- about microarray data analysis. PLoS Comput arrayQualityMetrics – a Bioconductor package tions reveal multiple conserved genes and path- Biol 6: e1000786. doi:10.1371/journal.pcbi. for quality assessment of microarray data. Bioin- ways. Breast Cancer Res 12: R21. 1000786. formatics 25: 415–416. 36. Smyth GK (2004) Linear models and empirical 5. Gentleman RC, Carey VJ, Bates DM, Bolstad B, 21. Dunning MJ, Barbosa-Morais NL, Lynch AG, Bayes methods for assessing differential expres- Dettling M, et al. (2004) Bioconductor: Open Tavare´ S, Ritchie ME (2008) Statistical issues in sion in microarray experiments. Stat Appl Genet for computational biology the analysis of Illumina data. BMC Bioinfor- Mol Biol 3: Article 3. and bioinformatics. Genome Biol 5: R80. matics 9: 85. 37. Ritchie ME, Diyagama D, Neilson J, van Laar R, 6. Bioconductor (2011) BeadArrayUseCases. Avail- 22. Schmid R, Baum P, Ittrich C, Fundel-Clemens K, Dobrovic A, et al. (2006) Empirical array quality able: http://www.bioconductor.org/packages/ Huber W, et al. (2010) Comparison of normal- weights in the analysis of microarray data. BMC release/data/experiment/. Accessed 1 November ization methods for Illumina BeadChip Hu- Bioinformatics 19: 261. 2011. manHT-12 v3. BMC Genomics 11: 349. 38. Du P, Kibbe WA, Lin SM (2007) nuID: a 7. Smith ML, Dunning MJ, Tavare´ S, Lynch AG 23. Dunning MJ, Ritchie ME, Barbosa-Morais universal naming scheme of oligonucleotides for (2010) Identification and correction of previously NL, Tavare´ S, Lynch AG (2008) Spike-in Illumina, Affymetrix, and other microarrays. Biol unreported spatial phenomena using raw Illu- validation of an Illumina-specific variance Direct 2: 16. mina BeadArray data. BMC Bioinformatics 11: stabilizing transformation. BMC Research 39. Dai M, Wang P, Boyd AD, Kostov G, Athey B, 208. Notes 1: 18. et al. (2005) Evolving gene/transcript definitions 24. Ding LH, Xie Y, Park S, Xiao G, Story MD significantly alter the interpretation of GeneChip 8. Cairns JM, Dunning MJ, Ritchie ME, Russell R, (2008) Enhanced identification and biological data. Nucleic Acids Res 33: e175. Lynch AG (2008) BASH: a tool for managing validation of differential gene expression via 40. Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, BeadArray spatial artefacts. Bioinformatics 24: Illumina whole-genome expression arrays Darot JF, Ritchie ME, et al. (2010) A reannotation 2921–2922. through the use of the model-based background pipeline for Illumina BeadArrays: improving the 9. Dunning MJ, Smith ML, Ritchie ME, Tavare´S correction methodology. Nucleic Acids Res 36: interpretation of gene expression data. Nucleic (2007) beadarray: R classes and methods for e58. Acids Res 38: e17. Illumina bead-based data. Bioinformatics 23: 25. Xie Y, Wang X, Story M (2009) Statistical 41. Chen Y, Choufani S, Ferreira J, Grafodatskaya D, 2183–2184. methods of background correction for Illumina Butcher D, et al. (2011) Sequence overlap 10. Smith ML, Lynch AG (2010) BeadDataPackR: a BeadArray data. Bioinformatics 25: 751–757. between autosomal and sex-linked probes on the tool to facilitate the sharing of raw data from 26. Allen JD, Chen M, Xie Y (2009) Model-based Illumina HumanMethylation27 microarray. Illumina BeadArray studies. Cancer Inform 9: background correction (MBCB): R methods and Genomics 97: 214–222. 217–227. GUI for Illumina Bead-array data. J Cancer Sci 42. Dunning MJ, Curtis C, Barbosa-Morais NL, 11. Lynch AG, Smith ML, Dunning MJ, Cairns JM, Ther 1: 25–27. Caldas C, Tavare´ S, et al. (2010) The importance Barbosa-Morais NL, et al. (2009) beadarray, 27. Shi W, Oshlack A, Smyth GK (2010) Optimizing of platform annotation in interpreting microarray BASH and HULK - tools to increase the value the noise versus bias trade-off for Illumina Whole data. Lancet Oncol 11: 717. of Illumina BeadArray experiments. In: Gusnanto Genome Expression BeadChips. Nucleic Acids 43. Wu D, Lim E, Vaillant F, Asselin-Labat ML, A, Mardia K, Fallaize C, eds. Statistical tools for Res 38: e204. Visvader JE, et al. (2010) ROAST: rotation gene challenges in bioinformatics. Leeds: Leeds Uni- 28. Lin SM, Du P, Huber W, Kibbe WA (2008) set tests for complex microarray experiments. versity Press. pp 33–37. Model-based variance-stabilizing transformation Bioinformatics 26: 2176–2182. 12. Sua´rez-Farin˜as M, Pellegrino M, Wittkowski KM, for Illumina microarray data. Nucleic Acids Res 44. Subramanian A, Tamayo P, Mootha VK, Magnasco MO (2005) Harshlight: a ‘‘corrective 36: e11. Mukherjee S, Ebert BL, et al. (2005) Gene set make-up’’ program for microarray chips. BMC 29. ShiW,deGraafCA,KinkelSA,AchtmanAH, enrichment analysis: a knowledge-based ap- Bioinformatics 6: 294. Baldwin T, et al. (2010) Estimating the pro- proach for interpreting genome-wide expression 13. Kohl M, Deigner HP (2010) Preprocessing of portion of microarray probes expressed in an profiles. Proc Natl Acad Sci U S A 102: gene expression data by optimally robust estima- RNA sample. Nucleic Acids Res 38: 2168– 15545–15550. tors. BMC Bioinformatics 11: 583. 2176. 45. Leisch F (2002) Sweave: dynamic generation of 14. Du P, Kibbe WA, Lin SM (2008) lumi: a pipeline 30. Barnes M, Freudenberg J, Thompson S, statistical reports using literate data analysis. In: for processing Illumina microarray. Bioinfor- Aronow B, Pavlidis P (2005) Experimental com- Ha¨rdle W, Ro¨nz B, eds. Compstat 2002 - Pro- matics 24: 1547–1548. parison and cross-validation of the Affymetrix and ceedings in Computational Statistics. Heidelberg: 15. Smyth GK (2005) limma: Linear models for Illumina gene expression analysis platforms. Physica Verlag. pp 575–580. ISBN 3- microarray data. In: Gentleman R, Carey V, Nucleic Acids Res 33: 5914–5923. 7908-1517-9. Available: http://www.stat.uni- Huber W, Irizarry R, Dudoit S, eds. Bioinfor- 31. Shi W, Banerjee A, Ritchie ME, Gerondakis S, muenchen.de/,leisch/Sweave. Accessed 28 Oc- matics and computational biology solutions Smyth GK (2009) Illumina WG-6 BeadChip tober 2011. using R and Bioconductor. New York: Springer. strips should be normalized separately. BMC 46. R Development Core Team (2011) R: a language pp 397–420. Bioinformatics 10: 372. and environment for statistical computing. 16. NCBI (2011) Gene expression omnibus. Avail- 32. McCall MN, Irizarry RA (2008) Consolidated Vienna: R Foundation for Statistical Computing, able: http://www.ncbi.nlm.nih.gov/geo/. Ac- strategy for the analysis of microarray spike-in ISBN 3-900051-07-0. Available: http://www.R- cessed 28 October 2011. data. Nucleic Acids Res 36: e108. project.org/. Accessed 28 October 2011.

PLoS Computational Biology | www.ploscompbiol.org 6 December 2011 | Volume 7 | Issue 12 | e1002276 Copyright of PLoS Computational Biology is the property of Public Library of Science and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.