An open-access long oligonucleotide microarray resource for analysis of the human and mouse transcriptomes. Kévin Le Brigand, Roslin Russell, Chimène Moreilhon, Jean-Marie Rouillard, Bernard Jost, Franck Amiot, Virginie Magnone, Christine Bole-Feysot, Philippe Rostagno, Virginie Virolle, et al.

To cite this version:

Kévin Le Brigand, Roslin Russell, Chimène Moreilhon, Jean-Marie Rouillard, Bernard Jost, et al.. An open-access long oligonucleotide microarray resource for analysis of the human and mouse transcrip- tomes.. Nucleic Acids Research, Oxford University Press, 2006, 34 (12), pp.e87. ￿10.1093/nar/gkl485￿. ￿hal-00088266￿

HAL Id: hal-00088266 https://hal.archives-ouvertes.fr/hal-00088266 Submitted on 16 Aug 2006

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Published online July 19, 2006

Nucleic Acids Research, 2006, Vol. 34, No. 12 e87 doi:10.1093/nar/gkl485 An open-access long oligonucleotide microarray resource for analysis of the human and mouse transcriptomes Ke´vin Le Brigand1,2, Roslin Russell3, Chime`ne Moreilhon1,2, Jean-Marie Rouillard4,5, Bernard Jost6, Franck Amiot7, Virginie Magnone1,2, Christine Bole-Feysot6, Philippe Rostagno1,2, Virginie Virolle1,2, Virginie Defamie1,2, Philippe Dessen8, Gary Williams3, Paul Lyons3,Ge´raldine Rios1,2, Bernard Mari1,2, Erdogan Gulari4,5, Philippe Kastner6, Xavier Gidrol7, Tom C. Freeman3 and Pascal Barbry1,2,*

1CNRS, Institut de Pharmacologie Mole´culaire et Cellulaire, UMR6097, 660, route des Lucioles F-06560 Sophia Antipolis, France, 2University of Nice Sophia Antipolis, Institut de Pharmacologie Mole´culaire et Cellulaire, UMR6097, 660, route des Lucioles F-06560 Sophia Antipolis, France, 3MRC Rosalind Franklin Centre for Genomics Research, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SB, UK, 4Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA, 5Biodiscovery LLC, 3886 Penberton Dr, Ann Arbor, MI 48109, USA, 6IGBMC, BP163, F67404 Illkirch Ce´dex, France, 7CEA—Service de Ge´nomique Fonctionnelle, Genopole d’Evry, F91057 Evry Ce´dex, France and 8Laboratoire de Ge´ne´tique Oncologique, UMR 1599 CNRS, Institut Gustave Roussy, F-94805 Villejuif Cedex, France

Received April 28, 2006; Revised June 5, 2006; Accepted June 23, 2006

ABSTRACT This work provides a comprehensive open resource Two collections of oligonucleotides have been for investigators working on human and mouse designed for preparing pangenomic human and transcriptomes, as well as a generic method to mouse microarrays. A total of 148 993 and 121 703 generate new microarray collections in other organ- oligonucleotides were designed against human and isms. All information related to these probes, as mouse transcripts. Quality scores were created in well as additional information about commercial order to select 25 342 human and 24 109 mouse microarrays have been stored in a freely-accessible oligonucleotides. They correspond to: (i) a BLAST- database called MEDIANTE. specificity score; (ii) the number of expressed sequence tags matching each probe; (iii) the dis- tance to the 30 end of the target mRNA. Scores were INTRODUCTION also used to compare in silico the two microarrays Microarray technologies for expression profiling may be split with commercial microarrays. The sets described into two broad categories, platforms that are based on in situ here, called RNG/MRC collections, appear at least as synthesis of oligonucleotide probes and those that are based specific and sensitive as those from the commercial of the deposition of preassembled DNA probes. The first platforms. The RNG/MRC collections have now been class of array platforms is dominated by the commercial sector with a number of companies, e.g. Affymetrix (1), used by an Anglo-French consortium to distribute Nimblegen (2), Agilent (3), offering a range of off-the-shelf more than 3500 microarrays to the academic com- or custom arrays to their customers. Microarrays fabricated munity. Ad hoc identification of tissue-specific using preassembled probes have traditionally been favoured transcripts and a 80% correlation with hybridiza- by many academic laboratories and are also available from tions performed on Affymetrix GeneChip suggest a number of commercial sources e.g. GE Healthcare’s Code- that the RNG/MRC microarrays perform well. link platform (4), Illumina’s ‘BeadChip’ arrays (5). Primarily

*To whom correspondence should be addressed. Tel : +33 4 9395 7793; Fax: +33 4 9395 7794; Email: [email protected] *Correspondence may also be addressed to Tom C. Freeman. Tel : +44 131 242 6242; Fax: +44 131 242 6244; Email: [email protected] Present address: Tom C. Freeman, Scottish Centre for Genomic Technology and Informatics, University of Edinburgh Medical School, The Chancellor’s Building, Edinburgh EH16 4SB, UK

2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 2 OF 13 for reasons of flexibility and cost, many academic labora- Here we describe the bioinformatic pipeline that has been tories still favour the use of spotted arrays made in-house used in the design of two pangenomic oligonucleotide collec- for their research. tions for study the expression profiling of human and mouse For a number of years the fabrication of spotted micro- systems. This includes in silico validation steps and bench- arrays largely relied on the attachment of fragments mark comparisons with commercial human and mouse amplified from cDNA libraries (6). Whilst this approach oligonucleotide probe collections, and the creation of an clearly works and can provide useable tools for expression open-access database called MEDIANTE, which integrates analysis, it suffers from several fundamental limitations: information about the RNG/MRC, Affymetrix, Agilent and gene representation within cDNA libraries is incomplete; Illumina probe sets. Lastly, we present experimental valida- there is often a significant degree of redundancy within tion data obtained after hybridizing distinct RNAs originating clone collections; annotation of clones can be flawed and from human or mouse tissues on microarrays spotted with the cDNA libraries often come with legal restrictions on their RNG/MRC probe collections. distribution and use. Furthermore, the relatively large size of the cDNA amplicons can be associated with the presence of repeat sequences or homology to related , which can MATERIALS AND METHODS compromise the specificity of the probes in an unpredictable Oligonucleotide design way (7). An alternative approach that addresses this issue involves the production of gene-specific DNA fragments by Transcript selection. Two non-redundant sets of mRNA PCR amplification using specific primers (8–10). Existence sequences (one for human and one for mouse) were assembled of a significant fraction of genes where a specific PCR amp- from RefSeq, a database derived from GenBank. These were licon cannot be designed or generated, as well as the high subjected to BLAST sequence analysis (20) against UniGene. costs and technical difficulty of DNA production, makes Out of the 105 680 representative sequences from human this approach impractical for the fabrication of mammalian UniGene clusters (build #167), 87 386 did not match this whole genome expression microarrays. first RefSeq selection (build #33 for human). When UniGene An alternative approach for probe synthesis for spotted clusters corresponding to less than 4 sequences were microarray production has come through the use of long excluded, there were 2979 UniGene clusters of more than 4 (50–70mers) oligonucleotides (11,12). A significant reduction sequences associated with at least 1 RNA sequence, which in the cost of production of the synthetic oligonucleotides, an did not match any RefSeq transcript. The representative improvement of the quality control provided by the different RNA from each of these UniGene clusters was then intro- suppliers and the ability to design one or several specific duced into the list of transcripts selected for oligo design. probes to any given target sequence, has made the use of Sequences defined in Affymetrix and Agilent human micro- long oligonucleotides for the fabrication of microarrays a array annotations were then compared to this second list very attractive option. As a result, the last few years have in order to identify sequences which were not represented. seen a number of companies offering aliquots of oligonuc- Following this selection the final number of human tran- leotide libraries for array fabrication. Transcript coverage scripts selected for oligo design was 29 894. BLAT analyses has then increasing alongside our knowledge of transcript (21) ensured that each sequence was correctly positioned on diversity. However, these sets have been relatively expensive to the genome sequence. Similar analysis were performed to purchase and the small aliquots provided can severely limit for the mouse, based on RefSeq (build #32) and the 86 213 the utility of the resource. In addition, though less of an issue Unigene clusters (build #125) and resulted in the selection now, the design criteria and the sequence of the oligonuc- of 25 002 mouse transcripts. leotides often remained proprietary. Finally, the use of a Calculation of oligonucleotide probes. After transcript selec- diverse range of probe sets by different laboratories has tion, OligoArray2.0 (22,23) was used to calculate probes. made comparison of data between groups difficult (13–19). This software integrates BLAST analysis against a non- In order to address the need for improved access and redundant set of sequences and probe secondary structure standardization of microarray resources within the academic analyses (24). Oligonucleotide calculation parameters were biomedical research community, a programme to develop set as follows: oligo length from 50 to 52mers; GC percent- long-oligonucleotide resources for every human and mouse age from 40 to 60%; maximum distance to 30 end of transcript gene was created. Specifically, a collaboration was launched less than 1500 bases; melting temperature from 84 to between the French Genopole Network (RNG), a consortium 94C. OligoArray 2.0 selected probes with the lowest cross- of French laboratories involved in functional genomics, and hybridization, the absence of secondary structure and bal- the Microarray Programme of the MRC Rosalind Franklin anced the set of probes in terms of melting temperature. Centre for Genomics Research, which had a remit to provide After the OligoArray2.0 calculation, all oligonucleotides spotted microarrays for human and mouse expression analysis matching with splice variants were grouped by transcript. to the UK academic community. The primary objective of the Oligonucleotides containing five consecutive A, C, G or T’s project was to develop an open-access probe resource that were discarded. would support the fabrication of high quality cost effective microarrays in UK and French academic laboratories. To Sub-selection of an oligonucleotide library for synthesis. ensure that probe design was open, dynamic and that annota- Following the calculation of all potential probes, there were tion of the resources was kept up to date and available to approximately five oligos designed against each transcript. the wider community, the creation of ad hoc bioinformatics An automatic procedure was then set up to select the tools was also central to the project. ‘optimal’ probe from these. To this end, three distinct criteria PAGE 3 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87 were integrated: (i) the specificity of the probe, (ii) the cross-hybridization at a BLAST expect-value equal to 1 number of EST’s matching the probe and (iii) the position have a X_HYBRID equal to 0. The higher the X_HYBRID of the probe from the 30 end of the target transcript. score, the lower the specificity of the oligonucleotide. 1. Specificity. Whilst a specificity analysis is integral to Table 1 shows the relationship existing between X_HYBRID OligoArray, it was necessary to set up an additional specifi- scores and the number of matching nucleotides. Typically, city check outside the program in order to re-evaluate the oligonucleotides with X_HYBRID equal to 1,x (where x design of the probe sets when new releases of Ensembl or is any integer from 1 to 9) can match 16 consecutive RefSeq became available. To this end, each oligonucleotide bases with a homologous sequence, or 19 bases out of was compared to three distinct BLAST formatted databases: 20 (Table 1). Experimental data support the fact that such the first database was composed of the 29 894 human tran- probes are still specific (data not shown). Oligos with a scripts selected for this study (or 25 002 mouse transcripts), X_HYBRID value superior to 2.0 can perfectly match with the second database corresponded to the current release of sequences of more than 18 bases, but can also match RefSeq -encoding transcripts, and the third to the 21 bases out of 22, or 27 bases out of 30. Most probes with current Ensembl collections of transcripts. Evaluation of the X_HYBRID > 2 were removed from the final selection. In the specificity of each oligo probe against these three databases rare cases where they were selected, this was only when no was adopted in order to minimize biases caused by the pro- better probes were available for the corresponding transcript. cess of transcript selection. A perfect-match between a 2. Comparison with expressed sequence tags. mRNA 50mer and the corresponding transcript is associated with a species derived from a single loci can vary in exon usage BLAST expect-value of 10À20. However, any 50mer can (splice variants) or the length of the 30 end due to the use perfectly match more than one transcript, for instance when the probe sequence is shared by several splice variants, or when it matches distinct members of a same gene family. Table 1. Relationship existing between the X_HYBRID scores and the Furthermore, a 50mer can also match imperfectly other trans- number of matching nucleotides in Blast hits cripts. In this case, the number of hits is not constant, but Number of Mismatchs 0 1 2 3 4 increases along with the BLAST expect-value. We set up a decimal score, called X_HYBRID, where the integer part X_HYBRID ¼ 0 14/14 17/18 20/22 23/26 26/30 depicts the Blast-specificity of the probe, and the decimal X_HYBRID ¼ 1.x 16/16 19/20 22/24 25/28 28/32 X_HYBRID ¼ 2.x 18/18 21/22 24/26 27/30 30/34 part depicts the number of Blast hits. In that context, the X_HYBRID ¼ 3.x 20/20 22/23 25/27 28/31 31/35 most specific probe available for a transcript will correspond X_HYBRID ¼ 4.x 21/21 24/25 27/29 30/33 33/37 to the one associated with the lowest X_HYBRID X_HYBRID ¼ 5.x 23/23 26/27 29/31 31/34 35/39 value. This definition is explained in detail in Figure 1, X_HYBRID ¼ 6.x 41/41 44/45 47/49 — — which shows the number of hits obtained for a 50mer at dif- For instance, a probe with a X_HYBRID score equal to 2.1 can match its ferent levels of the BLAST expect-values. According to the BLAST hit by 18 identical bases out of 18 (no mismatch), or 21 out of 22 definition of the X_HYBRID, oligonucleotides with no (one mismatch) or 24 out of 26 (two mismatches).

Figure 1. Definition of the X_HYBRID specificity score. Typical picture of a probe specificity analysis, as available from the MEDIANTE interface (http:// www.microarray.fr). Each column represents the number of BLAST hits in the MEDIANTE database (blue), Ensembl database (green), RefSeq database (red) for the BLAST expect-value indicated at bottom. The X_HYBRID score for a probe was calculated as the maximal x_hybrid scores among the three databases. Based on expect-values equal to 10, 1, 10À1,10À2,10À3,10À4,10À15,10À20, a ‘rank’ was defined, ranging from 0 for an expect-value equal to 10 to 6 for an expect-value equal to 10À15. For instance, the oligonucleotide depicted in Figure 1 has an extra hit in RefSeq for an expect-value of 10À1, thus defining a rank equal to 2. The number of extra hits between the rank and the lowest expect-value is called delta (D). In the example shown in Figure 1, D is equal to 1. D is always kept in the interval from 1 to 9, meaning that when there are more than 9 extra hits, D is kept to 9. A x_hybrid score is defined for each BLAST database (i.e. MEDIANTE, RefSeq, Ensembl) as a decimal number, where the integer part corresponds to the rank, and the leftovers to D. The final X_HYBRID score for a probe is defined as the maximal x_hybrid score obtained against the 3 BLAST databases. e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 4 OF 13 of alternative poly-adenylation sites (25). In order to select microarrays. The analysis of the Affymetrix GeneChips oligonucleotides that hybridized to the most invariant part probes was restricted to the first and last perfect-match probes of each transcript we counted the number of EST hits of each probe set. Three comparisons were carried out, for for each probe, as probes designed against alternatively each of the three scores: X_HYBRID, EST_NUMBER, and spliced exons or 50 end sequences are likely to hit fewer DIST_TO_30. These comparisons took place within a subset 30 EST’s. We used the LASSAP (26) implementation of of probes of 16 303 human and 13 073 mouse transcripts rep- BLAST to compare all oligos probes against dbEST data- resented on all four microarray platforms. For each of the bases (27). The number of EST’s matching (95% identity) three commercial platforms the same approach was taken to each oligonucleotide in the database (called EST_NUMBER) calculate X_HYBRID, EST_NUMBER and DIST_TO_30 NUMBER) was counted. Additional information, such as the scores for each probe. tissue of origin of the EST was also recorded. All information However, a direct comparison of BLAST scores was not was stored in the MEDIANTE database. LASSAP has been possible, due to the different lengths of the probes optimized for such an intensive task, where 270 700 oligos (25 bases for Affymetrix, 50 for RNG/MRC, 60 for Agilent had to be compared against 7 057 754 human and 4 688 047 and 70 for Illumina). Several analyses were therefore perfor- mouse EST’s. med, where sub-sequences were randomly selected within 3. ‘Optimal’ probe selection and synthesis. After having RNG/MRC, Agilent and Illumina probes in order to generate defined the X_HYBRID, EST_NUMBER and DIST_TO_30 BLAST queries of uniform length. Libraries of 25mers scores (the latter corresponding to the distance to the 30 end derived from RNG/MRC, Agilent and Illumina collections of the transcript) for every couple of probe and transcript, were constructed, as well as libraries of 50mers derived selection of the ‘optimal’ human and mouse sets was per- from Agilent and Illumina collections. In order to reduce formed using these three criteria. In order to select highly the bias caused by the selection of ‘random’ 25mers, the specific probes, the first criterion selected for each transcript procedure was independently repeated three times for each the probes exhibiting the lowest X_HYBRID scores. The set, and these scores were only used for global descriptive second criterion selected into this subset the probe(s) with statistics. The three independent measurements indeed led the highest EST_NUMBER values. Probes from the first to the same results, therefore demonstrating that shortening subset having an EST_NUMBER superior or equal to 60% the length of the probes had no impact on the results of our of this maximal value were rescued and stored in a second analyses. subset (this cut-off is explained in Supplementary Figure 1). For each transcript, the optimal probe was defined as the most 30 probe belonging to the second subset. After this Mediante web application first selection, which favours the selection of Blast-specific The development of the project has required the creation of a oligonucleotides, we checked whether probes characterized dedicated database, aimed at storing all oligonucleotide by an intermediate X_HYBRID score (X_HYBRID < 2) sequences. This database has been called MEDIANTE. and by an EST_NUMBER score at least five times superior MEDIANTE is a J2EE platform deployed under a Tomcat to the EST_NUMBER score of the selected probe, were web server. It is based on a PostgreSQL relational database. available. This step allowed the selection of probes with min- This database contains annotations pertaining to transcripts imal cross-hybridizations, but matching a much larger num- and oligonucleotides in 45 distinct tables. Thirty additional ber of ESTs. Such situations may happen due to alternative tables are used to store the information about hybridizations splicing, leading to the existence of two transcripts with dif- (K. Le Brigand and P. Barbry, manuscript in preparation). ferent levels of expression. Among such probes, the most The human and mouse RNG/MRC probe collections and 30 sequence was selected. all associated information can be directly downloaded from For several transcripts, we selected several additional the MEDIANTE home page (http://www.microarray.fr). oligonucleotides to analyse the variations of the ratios and/ Subscription is managed by the French National Genopole or intensities between probes targeting a same transcript. Network, and provides access to some additional tools, such Based on these criteria a subset of 25 342 human and as customized selection of oligonucleotides, or storage of 24 109 mouse oligonucleotides were selected for synthesis. microarray data. A total of 100 mmoles of each probe were ordered from Sigma–Proligo (Paris, FRANCE) as a 50 amino modified Probe update and evaluation. The current MEDIANTE data- oligonucleotide. Oligonucleotide stocks were aliquoted and base has now gone through six different iterations as updated distributed to participating laboratories for use in microarray versions of Ensembl, RefSeq and Unigene have been released fabrication. (Supplementary Figure 2). Upon each new RefSeq release, an automatic process is launched in order to update the ‘optimal’ RNG/MRC oligonucleotide collections. This process inte- Comparison of the probes sets with commercial grates: (i) an update of transcripts with an altered sequence, probe sets (ii) the identification of transcripts absent from the current In order to compare our oligonucleotide selection with other set, (iii) the design of new oligonucleotides and the construc- probe collections, we performed an in silico comparison tion of ad hoc relationships between all oligonucleotides between the human and mouse RNG/MRC probe sets with and all transcripts, (iv) an update of the oligonucleotides probes present on Affymetrix (human U133Plus2 and mouse ‘optimal’ selection. The aim of this automatic process is for MG-U74), Agilent (HumanGenome and MouseGenome), and each transcript to check whether the oligonucleotide currently Illumina (Illumina_human and Oligator_MEEBO_mouse) selected is always the ‘optimal’ probe available. The whole PAGE 5 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87 process makes possible the re-use of collections of oligo- oligonucleotide collection with a ChipWriterProarrayer nucleotides that were selected in a previous version of the (Bio-Rad, 1000 Alfred Nobel Drive Hercules, CA) on collections, with no need to re-calculate whole collections commercial HydroGel slides (Schott, Hattenbergstr with OligoArray. 10 55122 Mainz, Germany), and processed according to the manufacturer’s instructions. Annotation of the probe sets. Transcript annotations were derived from several public databases. All information is RNA labelling and hybridization. RNAs were labelled using currently accessible and can also be freely downloaded within an amplification protocol, as described in Moreilhon et al. the framework of MEDIANTE. A search tool allows the user (9). Briefly, 1 mg of total RNA was amplified with the Amino to browse according to GenBank accession numbers, Allyl MessageAmp aRNA kit (Ambion, 2130 Woodward LocusLink ID, Unigene ID, gene symbol, within sequence Austin TX) according to manufacturer’s instructions. Cy3 descriptors, exact terms (28,29), chromo- and Cy5 labelled aRNA was fragmented with the Ambion somal localization. Each query is built dynamically after aRNA Fragmentation Reagents, purified and made up in collection of the information and is returned via a web Agilent hybridization buffer. Labelled cRNAs were then form. The list of relevant transcripts is identified, and then hybridized to RNG/MRC human pangenomic microarrays visualized. An important feature is the publication of in an oven at 62C for 16 h. Microarrays were washed information about several commercial platforms (Affymetrix, and scanned with a Genepix scanner (Axon Instruments, Agilent and Illumina). BLAST analyses have been used to Molecular Devices Corporation 3280 Whipple Road Union position all probes in the RNG/MRC and commercial sets City, CA). Hybridization of Affymetrix GeneChip was per- on each transcript included in MEDIANTE. This makes it formed according to standard protocols, as suggested by the possible to compare probes belonging to any of these four supplier. distinct platforms (see Figure 5). This information appears Microarray analysis. TIF images containing the data particularly useful to highlight differences existing between from each fluorescence channel were quantified with the probes from different platforms, especially when conflicting Genepix pro 5.0 program (Axon Instruments). Data were data are collected from different platforms or probes. log-transformed, mean-centered and reduced for an equal Personal project manager. One of the initial remits of standard deviation between each slide (Z-score), using the MEDIANTE was to provide the possibility for a distant GeneANOVA software (31). Normalization was performed user to participate to the improvement of the oligonucleotide using the limma package available in Bioconductor (http:// collection. A typical scenario corresponds to the selection of www.bioconductor.org). Tissue-specific probes were defined a subset of probes for specific transcripts, for instance to as probes with a Z-score superior to 100 times the average allow the design of bespoke microarrays. The user can Z-scores in the other three conditions. Clustering was per- either upload FASTA-formatted (30) sequences or query in formed using Bioconductor (32,33) and TM4 (34). GeneChip MEDIANTE sequences of interest. When all sequences One-Cycle Target Labeling (Affymetrix) and the recommen- have been collected, the user can compare their sequences ded protocols from the Affymetrix Eukaryotic Sample with all the oligonucleotides. This BLAST analysis allows Analysis Technical Manual, revision 5 (Affymetrix SOP) the determination of oligonucleotides that match with the were used for the experiment shown on Figure 6. query sequences. A pre-selection of the ‘optimal’ probe is performed according to the method explained above, but the user can still change these parameters according to their RESULTS preferred criteria. The selection of probes is very similar to Oligonucleotide design process a ‘shopping basket’, where the user collects progressively the list of probes needed for their project. At the end of Statistics on RNG/MRC and several commercial probe col- the selection process, all information about the selected lections are summarized in Table 2. The X_HYBRID score oligonucleotides and their associated annotation (transcript defines the specificity of the set. A total of 68.9% (17 457 information, chromosomal localization, probe sequence, probes) and 23.4% (5921) of the 25 342 RNG/MRC human etc.) can be either downloaded, or stored in the database for probes have a X_HYBRID score equal to zero, or below 2, subsequent analysis. This virtual microarray can also be respectively. A total of 7.7% (1964 probes) have a transferred to another collaborator registered to MEDIANTE, X_HYBRID score above 2 indicating a possible cross- so that several users can cross-check the selection of probes. hybridization with other transcripts (as assessed by Blast ana- This method has been very useful during the development of lyses). They were nevertheless selected as no better probe the RNG/MRC collection, and has then been used for crea- could be identified. A very similar picture was obtained for ting a custom-made human microarray, which is currently the RNG/MRC mouse probes (24 109 probes) with 69.6% used by several laboratories. Information about this micro- (16 767) probes with a X_HYBRID equal to zero, 23.1% array is also available on the entry page of MEDIANTE. (5563) probes with a X_HYBRID below 2, and 7.4% (1779) probes with a X_HYBRID above 2. Comparison of the probes with EST databases (dbESTs), demonstrated that the Experimental evaluation 25 342 oligonucleotides of the final human selection matched Array preparation. Oligonucleotides were diluted to a final 1 282 376 distinct EST’s (>95% identity). The 24 109 mouse concentration of 35–50 mM in 35% dimethyl sulfoxide probes collection matched 926 311 distinct EST’s. On aver- (DMSO), 100 mM potassium phosphate (pH 8.0). Pange- age, 53 EST’s were hit by each human and 39 for each nomic microarrays were printed using human RNG/MRC mouse probe. e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 6 OF 13

Table 2. Descriptive statistics for RNG/MRC, Affymetrix, Agilent and Illumina probe collections

Human Mouse Probes Transcripts Probes Transcripts

Number of distinct RNG/MRC sequences 148 993 29 894 121 703 26 058 Number of distinct gene symbols — 21 043 — 22 740 Total number of RNG/MRC probes 25 342 29 894 24 109 25 002 Number of RNG/MRC probes with x_hyb ¼ 0 17 457 — 16 767 — Number of RNG/MRC probes with x_hyb ¼ 1,x 5921 — 5563 — Number of RNG/MRC probes with x_hyb > 1,y 1964 — 1779 — Average EST_NUMBER per RNG/MRC probe 53 — 39 — Average DIST_TO_30score per RNG/MRC probe 692 — 488 — Affymetrix U133Plus2 / MG-U74 108 371a 27 588b 89 502a 20 624b Agilent whole genome 40 990 25 627b 20 865 18 087b Illumina human / MEEBO 22 548 21 271b 36 362 22 463b

Oligonucleotides with X_HYBRID equal to 0 or to 1,x are considered ‘BLAST-specific’ probes; oligonucleotides with a X_HYBRID equal to x,y (x > 1) can possibly cross-hybridize with other transcripts. They were selected only when no better probe was available for a given transcript. For Affymetrix, comparison was performed for the first and last 25mer probes of each perfect-match Affymetrix probe sets (a probe set is specific to a gene and is composed of an average of ten 25mer probes). A total of 90% of the transcripts have a majority of their associated oligonucleotides characterized by a X_HYBRID score below 2. Less than 5% of the transcripts are only associated with oligonucleotides characterized by a X_HYBRID score above 2. aOnly the first and last probes from each Affymetrix probe set were used for analysis. bNumber of transcripts matched by probes from the RNG/MRC transcript selection set.

Figure 2. Blast-specificities of the different probe collections. (A) Average X_HYBRID scores for the different human and mouse collections. (B) Percentage of probes in each set associated with a X_HYBRID above 2, i.e. less ‘BLAST-specific’. This comparison has been performed on a subset of 16,303 human and 13,073 mouse transcripts, common to all platforms. ALL represents the collection of all probes calculated with OligoArray2.0. RNG/MRC represents the selection of probes used for the fabrication of the microarrays.

Comparison of the probes sets with commercial Probe update and evaluation. The current MEDIANTE data- probe sets base has now gone through six successive versions, as updates of Ensembl and RefSeq have been released. The Comparison of the RNG/MRC probe sets with commercial fourth version of the human probes selection was used for platforms i.e. Affymetrix, Agilent and Illumina was per- the design of the synthetized human oligonucleotide collec- formed on 16 303 human transcripts represented in all four tion, the so called RNG/MRC human probe set. Similarly, collections of human probes and on 13 073 mouse transcripts the second version of the mouse probes selection corresponds represented in all four collections of mouse probes. to the version used for the synthesis of the mouse collection, Figure 2 summarizes the comparison of the X_HYBRID the so called RNG/MRC mouse probe set. Supplementary scores between probes from RNG/MRC, Affymetrix, Agilent Figure 2 indicates the evolution of the probes collection and Illumina platforms. As explained in the ‘Materials and during several successive versions of the databases. Methods’ section, a direct comparison of the four platforms PAGE 7 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87

Figure 3. Matches with human and mouse EST databases for the different probe collections. (A) Average EST_NUMBER scores for the different human and mouse collections. (B) Percentage of probes matching no ESTs for All MEDIANTE probes, for the RNG/MRC, Agilent, Illumina and Affymetrix probe sets. For human, the comparison was performed on a subset of 7,325 transcripts having ‘BLAST-specific’ probes in all sets, i.e. X_HYBRID lower than 2.0. For mouse, the comparison was performed on a subset of 6,358 such transcripts. A matching EST was defined by a 95% identity between one probe and an EST. was not possible due to the different probe sizes used by the we randomly shortened RNG/MRC, Agilent and Illumina different platforms: from 25 bases (Affymetrix), 50 bases probes to 25 bases, so that their size did not differ with the (RNG/MRC), 60 bases (Agilent) to 70 bases (Illumina). To size of Affymetrix probes. This sampling was performed circumvent this problem, three independent calculations three times, and gave identical results (data not shown). were made with three randomly shortened probes, in order While the average EST_NUMBER led to similar scores for to work with BLAST queries of the same length between all platforms, the RNG/MRC probes mapped to a slightly the different collections. Figure 2 values represent averages larger number of EST’s in both human and mouse than the of the three resulting values. Figure 2B shows the percentage other platforms. of probes with a X_HYBRID superior to 2 as a function of The position of the probe with regard to the 30 end of the the length of the BLAST query, according to the source of transcript, represented by the DIST_TO_30 score, was then the oligonucleotide. A total of 10–15% of the 148 993 analysed. The distribution of the distance for the RNG/ human and of the 121 703 mouse oligonucleotides selected MRC probes and the three other sets is shown on Figure 4. with OligoArray (entitled ALL in Figures 2–3) have a As might be expected the Affymetrix probe sets show distinct X_HYBRID score above 2. A similar percentage is observed peaks associated with the 50 and 30 oligonucleotides. Agilent, for the oligonucleotides from Agilent and Illumina. This per- Illumina and the RNG/MRC probe collections display a centage decreases below 5% for the probes derived from the similar pattern of distribution with the majority of probes RNG/MRC or Affymetrix collections. A similar trend is being located within the 600 last base pairs of the 30 region. observed in Figure 2A for the average X_HYBRID score of of the target mRNA. each set. Based on this in silico analysis the RNG/MRC probe resources would appear to be more specific than the two Experimental evaluation of the RNG/MRC commercial long-oligonucleotide probe collections. oligonucleotides collection Figure 3 summarizes the comparison of the EST_ NUMBER scores among platforms. Figure 3B shows the Expression profiling of human cell types versus ‘electronic percentage of probes in each set that do not match any northern’. A first experimental evaluation of our selection of ESTs. More than 12% of the 148 993 human and of the oligonucleotides was provided by a comparison of experi- 121 703 mouse oligonucleotides selected with OligoArray mental data and in silico data (Supplementary Figure 3). (ALL) do not recognize any EST. A large difference can be For this purpose, we compared the results obtained after noticed between the 50 and 30 oligonucleotides from Affymet- hybridization of 45 RNG/MRC microarrays with diverse rix. In that case, the difference in EST_NUMBER can clearly human RNA originating from leucocytes (7 microarrays), be explained by their relative distance to the 30 end of the nasal epithelial cells (22 microarrays), keratinocytes transcript (see also Figure 4). Figure 3A indicates the average (4 microarrays) and liver (12 microarrays), in order to identi- EST_NUMBER score per set. As for the X_HYBRID score, fy a set of tissue-specific transcripts, characterized by a strong e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 8 OF 13

differential expression between one tissue/cell type against the three others. A total of 481 oligonucleotides were selected by their high EST_NUMBER scores in at least one of the four studied tissues, and by a differential expression between the four tissues, as assessed by the results of the hybridiza- tions on the RNG/MRC microarrays. Hierarchical clustering of these 481 probes revealed several tissue-specific clusters. Supplementary Figure 3 (left panel) shows a heat map of these 481 genes. A black signal is associated with a high level of expression, as measured by the intensity of the fluorescence. Annotations of the EST’s (see Materials and Methods) was used to count the number of EST’s from immune cells (Immune), respiratory tissues (Respiratory), skin (Skin) or liver (Liver) matching these 481 probes. Supplementary Figure 3 (right panel) shows a heat map derived from this analysis. Despite the fact that the samples analysed here are from similar but ultimately different bio- logical sources, several similarities can be noticed between the two plots. This suggests some relationship between the intensity of the signal and the number of expressed transcripts.

Comparison of hybridizations on RNG/MRC microarrays versus hybridizations on Affymetrix GeneChip. In order to directly evaluate expression profiles generated with the RNG/MRC microarrays, RNA derived from either HEK293 cells or a human cell line of keratinocytes (DK7) was analysed in parallel on the RNG/MRC platform, or on an Affymetrix platform. Figure 6 shows the relationship existing Figure 4. Distribution of the probes according to their DIST_TO_30 score. between ratios established with Affymetrix arrays (x-axis) (A) human. (B) mouse. More than 90% of probes for the human sets and 98% and ratios established with RNG/MRC arrays (y-axis). 0 of probes for the mouse sets are located within 1,500 bases from the 3 -end of When considering only the genes for which the average target mRNAs. Legend indicates the average DIST_TO_30 score for each collection. This comparison has been performed on a subset of 16,303 human intensity of the signal was superior to the 25th percentile and 13,073 mouse transcripts, common to all sets. for both platforms, the coefficient of correlation between

Figure 5. MEDIANTE screenshot of the summary data for transcript NM_001652. The different exons of each transcript are represented by dark and light blue boxes. The RNG/MRC probes are represented on the first line; the light green box indicates the ‘current optimal probe’. The red box indicates the RNG/MRC probe(s). The blue box indicates a probe selected for a local microarray production. Each set of probes is represented on a distinct line. Affymetrix probe sets are represented by their first and last 25-mer perfect match probes. Additional information about the transcript or probes, such as gene chromosomal location, probe specificity, etc. are provided as clickable links. Subforms provide information about Gene Ontology annotations, bibliographic references or tissue-specificity. PAGE 9 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87 the two measurements was equal to 0.807. It was equal to 0.700 when all data points are considered, and raised to 0.880 when considering the genes for which the average intensity of the signal was superior to the 75th percentile for both platforms. Independent validation of the probes. More than 3500 RNG/ MRC microarrays have been already distributed to more than 100 distinct projects. A total of 4666 probes, targeting 4522 distinct transcripts, have been so far confirmed by independ- ent measurements. More precisely, a probe was validated when a ratio above 2 was detected in an experiment using a RNG/MRC microarray and when an independent measure- ment (quantitative RT–PCR, other microarray platforms, northern blots, protein detection or functional assays) led to a similar variation (i.e. ratio above 2 for RNA detection, or increased protein expression, or increased activity). Such Figure 6. Scatter plot of the ratios measured on Affymetrix GeneChip probes are listed in Supplementary Table 1. Validated probes (x-axis) and on RNG/MRC microarrays (y-axis). RNA was derived from are also flagged in the Mediante interface. either HEK293 cells or a keratinocyte cell line (DK7). 11053 transcripts had at least one Affymetrix probe set and one RNG/MRC probe. Among them, 0 Validation of the EST_NUMBER and DIST_TO_3 scores. 7054 pairs were further analyzed, as their intensity level was larger than the For several transcripts, we selected distinct oligonucleotides 25th percentile on both platforms. After quantification of the signals on both to analyse the variations of the ratios and/or intensities platforms, the ratio of the expression levels between the two cell lines was between probes targeting a same transcript. In the experiment established. The coefficient of correlation was equal to 0.81. shown on Figure 6, 75% of such pairs of RNG/MRC probes were correlated (meaning that log2ratio[probe1] ¼ log2ra- tio[probe2] ± 1). This suggests that our selection was indeed able to select probes with similar properties. We then anticip- ated that the 25% of the probes exhibiting divergent proper- ties might shed some light on the relative importance of EST_NUMBER, distance to 30 ends, or Tm (the latter being the most commonly accepted parameter for the selection of probes). In an independent mouse microarray experiment, we selected 34 such pairs of probes, characterized by at least a 2-fold variation in intensity. We wondered whether a positive difference in intensity could be attributed to: (i) a positive difference in EST_NUMBER, (ii) a negative differ- ence in the DIST_TO_30 and/or (iii) a positive difference in melting temperature of the probes. Figure 7 summarizes in a Venn diagram our results: 100% of such pairs of probes were correlated with either a variation in EST_NUMBER (24 transcripts), DIST_TO_30 (30 transcripts) and/or Tm Figure 7. Analysis of 34 mouse transcripts targeted by 2 distinct RNG/MRC probes. Shown are probes with a variation in intensity greater than 2 fold. (14 transcripts). All information, as well as additional Each number corresponds to the number of transcripts for which fluorescence information about sequences, enthalpies, entropies of the intensity varied along with EST_NUMBER, DIST_TO_30 and/or Tm. probes are available in Supplementary Table 2. centralized facilities for microarray fabrication (43). Much of the early work of these centers relied on the use of DISCUSSION cDNA clone sets for generating probe resources for microar- The last 5 years have seen a significant increase in the ray fabrication. Incomplete gene coverage, inaccurate gene accessibility and diversification of microarray platforms for annotation, contaminated or missing clones, legal restrictions performing expression analyses (35–39). In particular, the which were often associated with the use of cDNA libraries, range of species for which commercial arrays are now avail- made these resources non-optimal for microarray production. able and the number of probe features per microarray have Furthermore, and perhaps most importantly, the large size of expanded dramatically due to improved sequence resources the DNA probe used for any given gene has a potential to and technological advances in microarray fabrication. As a cross-hybridize to other sequences, due to partial homology result, applications of microarray analysis to many fields of to other genes or presence of repeat sequences. The accep- basic and biomedical research have dramatically increased tance and availability of oligonucleotide probe resources (40–42). However, the cost of commercial arrays is still for spotted microarray fabrication has provided a powerful prohibitive for many large academic projects. alternative to the use of cDNAs (44–53). The first aim of To address the problem of accessing affordable arrays, our project was to provide an open probe resource for the a number of academic communities have established fabrication of cost effective pangenomic microarrays to e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 10 OF 13 the Anglo-French communities. However, our work also X_HYBRID score is, the more ‘BLAST-specific’ a probe is). addressed more general issues: Secondly, visualization through MEDIANTE, according to the representation shown in Figure 1, integrates results from (i) The development of an open-access repository for human our three references databases, giving indications about vari- and mouse probes usable in gene expression studies ations of the annotations among databases. provides a useful tool to compare the position of the A 2-fold enrichment in the number of ‘zero’ probes probes from several sources. Knowledge and comparison (i.e. probes having a X_HYBRID score equal to 0) was of these probes is crucial in assessing probe specificity observed between the ‘ALL’ collection of probes generated and in performing cross-platform comparisons. While by OligoArray and the selection of the optimal probe sets. this part of our project has some similarities with In the same way, a 4-fold decrease in the number of probes the Resourcerer (54) and Dragon projects (55), the having a X_HYBRID score greater than 2 was observed MEDIANTE web interface provides graphical represen- between the ‘ALL’ collection of probes generated by tations of all probes associated with a specific transcript OligoArray and the selection of the optimal probe sets (see (Figure 5). Although limited at the moment to RNG/ Figure 2B). This enrichment was similar in human and in MRC, Agilent, Illumina and Affymetrix probe collec- mouse (data not shown). The presence of ‘low-specificity’ tions in human and mouse, this visualization tool will be probes in the full collections of probes may appear surprising, extended in the near future to include other platforms since OligoArray supposedly rejects non-specific probe and other organisms. sequences. However, the presence of these probes may be (ii) Whilst access to the RNG/MRC probe libraries and due to: (i) an absence of high quality probes available for a arrays fabricated from them may be restricted to the given sequence, since OligoArray provides its ‘best’ available French and UK communities, the MEDIANTE interface candidates, even though none may be entirely specific, (ii) provides an open-access portal to a detailed description the presence of new transcripts in more recent releases of of the probe sets. It also allows any end-users to select a RefSeq or Ensembl, which increases the X_HYBRID score preferred set of probes, according to some specific of the probe, or (iii) the correction of previous versions of knowledge. With currently a selection of probes 148 993 mRNA sequence due to sequencing error. In our hands, the human and 121 703 mouse transcripts precalculated in X_HYBRID score allows a dynamic re-evaluation of a set the database, MEDIANTE can easily be used to define of probes every time a new release of sequences is available. probes against specific splice variants or subsets of This makes possible the use of a collection of probes over genes. Thus the tool can be viewed as both a gene numerous iterations (Supplementary Figure 2). This approach annotation tool and as a convenient tool to create differs from the approach developed with programs such dedicated microarrays. as OligoArray, where a re-calculation of new probes can (iii) The storage of hybridization data in a centralized lead to considerable disparity with older versions of a design. data warehouse, integrated within MEDIANTE, will From that perspective, we consider that our approach simpli- allow the integration of both probe and experimental fies the cycle of life of a probe collection. The X_HYBRID information. Validation tools will allow fast quality score was particularly helpful for comparing sets from several control of the data and easy generation of a MIAME- distinct microarray platforms. In human as well as in mouse, compliant export format (56). As such, it will facilitate the RNG/MRC and Affymetrix probe sets were always the transmission by end-users of their curated data to associated with lower X_HYBRID scores than Agilent or public repositories, such as GEO (57) or ArrayExpress Illumina probe sets. (58), based on the MAGE-ML language (59). In The second criterion used to select probes corresponded to addition, the ongoing production of human and mouse the EST_NUMBER. It was initially set up in order to avoid pangenomic microarrays using the RNG/MRC probe the selection of probes specific to rare or poorly expressed sets by groups funded by either the French ‘Re´seau splice variants. As a direct count of the number of EST’s National Genopoles’ (RNG, French Genopole Network) associated with each probe, this index provided an easy meas- or the UK Medical Research Council means that ure of the relative sensitivity of each probe to detect its target. many are now essentially using the same probe Comparison of the EST_NUMBER scores of RNG/MRC sets resources for their work. This will clearly facilitate the with commercial collections was also favourable to the RNG/ construction of homogeneous large datasets available MRC collections, in terms of the average number of EST for Meta-analysis. identified per probe, and in terms of the number of probes matching no EST’s (see Figure 3). With the availability of The quality scores used in the current study provide a con- additional information about the tissue of origin of each, it venient way to evaluate probe design and compare between was possible to divide the EST_NUMBER into 26 categories different sets of probes. Following on from the initial design according to the libraries from which the EST’s were of the ‘full’ set of probes generated by the OligoArray pro- sequenced (Brain, Eye, Heart, Muscle, Pancreas, Liver, Stom- gram, our probe selection procedure allowed for further ach, GI tract, Kidney, Bladder, Testis, Prostate, Respiratory, refinement of the probe collection. Otorhinolaryngology, Skin, Immune, Bone, Breast, Uterus, The X_HYBRID score represents a simplified output of Ovary, Placenta, Stem cell, Embryo, Fibroblast, Adipose tis- BLAST analyses and its use integrates BLAST analyses per- sue and Cancer). The relative number of EST’s in each cate- formed on three distinct databases (MEDIANTE, RefSeq, gory therefore provided an estimate of the abundance of the Ensembl). This index has several interesting characteristics. transcripts in the corresponding tissue, category or cell type. First, it varies along with the BLAST score (the lower the These values were used to draw Supplementary Figure 3. PAGE 11 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87

Supplementary Figure 1 indicates that EST_NUMBER might The need for comprehensive microarrays covering all known be linked to some extent to DIST_TO_30. Our selec- human and mouse genes, composed of homogeneous sets of tion strategy indeed took into account the few cases where probes has never been greater. As the technology arrives at EST_NUMBER scores are not directly related to this point in maturity, the development of additional proper- DIST_TO_30. ties, for instance in order to discriminate splice variants, will The fact that 75% pairs of RNG/MRC probes targeting a require new efforts. The current work is a contribution to this same transcript (Figure 6) share similar levels of expression quest, and represents, to our knowledge, the first report integ- is a good indication of the quality of our design. Besides, rating probe design, microarray fabrication and experimental the analysis of the 25% remaining pairs of probes showed validation. that differences can be correlated with differences in DIST_TO_30, EST_NUMBER or Tm. This argue in favour of the use of these parameters. SUPPLEMENTARY DATA We noticed a convergence over time of the design of an optimal collection of oligonucleotide probe (Supplementary Supplementary Data are available at NAR Online. Figure 2). This probably corresponds to the overall reduction in the number of new/novel sequences for each organism over time. This point is suggested by the fact that no significant ACKNOWLEDGEMENTS changes were made to our process over the period of re- This work was performed thanks to support from the Re´seau evaluations. We anticipate that further iterations will provide National des Genopoles, the GIP Aventis, the French Ministry marginal improvements to the optimal probe sets already of Industry (re´seau GenHomme), the French Association available for studying human and mouse transcriptomes. Vaincre la Mucoviscidose, the INCA, the CNRS, and the UK The in silico validation studies presented here (Figures 2-4) of Medical Research Council. J.M.R. and E.G. were supported the RNG/MRC probe collections suggests that they compare in part by a grant from NIH (1R01GM068564). Microarray well with three commercial platforms. However, more experiments were carried out using the facilities of the extensive experimental analyses of these observations will Nice-Sophia Antipolis, Strasbourg and Evry Transcriptome be required to confirm this. Platforms of the Genopole. The authors are grateful to A first experimental validation of RNG/MRC microarrays Dr Alexandra Louis (Infobiogen, Evry, France) for help with focused on the identification of clusters of genes specific to at LASSAP, to Dr Je´roˆme Aubert (Galderma Research and least one of four distinct tissues or cell types (leucocytes, Development, Sophia Antipolis, France) for sharing unpub- nasal epithelium cells, keratinocytes or liver) (Supplementary lished information, and to Franck Aguila for artwork. Funding Figure 3). Several tissue-specific transcripts were identified to pay the Open Access publication charges for this article after hybridization, which correspond to classical markers was provided by CNRS. of these tissues: this was the case of HLA molecules in immune cells, of keratins in keratinocytes, or of albumin in Conflict of interest statement. None declared. liver. The similar patterns revealed by heatmaps representing hybridization data or EST_NUMBER scores (Supplementary Figure 3) support the idea that we correctly identified tissue- REFERENCES specific traits. These observations suggest good overall 1. Fodor,S.P., Read,J.L., Pirrung,M.C., Stryer,L., Lu,A.T. and Solas,D. agreement between experimental results provided by hybrid- (1991) Light-directed, spatially addressable parallel chemical synthesis. ization of the RNG/MRC arrays and the annotation of the Science, 251, 767–773. 2. Nuwaysir,E.F., Huang,W., Albert,T.J., Singh,J., Nuwaysir,K., probes. Several additional validations have been provided Pitas,A., Richmond,T., Gorski,T., Berg,J.P., Ballin,J. et al. (2002) elsewhere for specific sets of probes that were identified Gene expression analysis using oligonucleotide arrays produced by after using the RNG/MRC microarrays (9,60,61). Additional maskless photolithography. Genome Res., 12, 1749–1755. results are summarized in Supplementary Table 1, where 3. Kronick,M.N. (2004) Creation of the whole microarray. validated probes are indicated. Expert. Rev. Proteomics, 1, 19–28. 4. Ramakrishnan,R., Dorris,D., Lublinsky,A., Nguyen,A., Domanus,M., A second, and to our opinion, more definitive demon- Prokhorova,A., Gieser,L., Touma,E., Lockner,R., Tata,M. et al. (2002) stration of the quality of the collections was provided after An assessment of Motorola CodeLink microarray performance comparing two distinct RNAs on the RNG/MRC human for gene expression profiling applications. Nucleic Acids Res., microarray and on the Affymetrix GeneChip. The high 30, e30. 5. Barnes,M., Freudenberg,J., Thompson,S., Aronow,B. and Pavlidis,P. coefficient of correlation (>0.8) observed between the two (2005) Experimental comparison and cross-validation of the distinct comparisons represents a definitive demonstration Affymetrix and Illumina gene expression analysis platforms. of the quality of our design. More elaborate experimental Nucleic Acids Res., 33, 5914–5923. design, such as those described in Barnes et al. (5) or in de 6. Brown,P.O. and Botstein,D. (1999) Exploring the new world of the Reynies et al. (62) may in the future help defining more genome with DNA microarrays. Nature Genet., 21, 33–37. 7. Handley,D., Serban,N., Peters,D., O’Doherty,R., Field,M., precisely sets of probes providing highly reproducible Wasserman,L., Spirtes,P., Scheines,R. and Glymour,C. (2004) results. Evidence of systematic expressed sequence tag IMAGE clone Whilst it is difficult to anticipate the future of the micro- cross-hybridization on cDNA microarrays. Genomics, array field (especially the role that will be played by aca- 83, 1169–1175. 8. Dayem,M.A., Moreilhon,C., Turchi,L., Magnone,V., Christen,R., demic facilities in array fabrication), genome-wide analysis Ponzio,G. and Barbry,P. (2003) Early gene expression in wounded of the human and mouse transcriptomes is now almost a human keratinocytes revealed by DNA microarray analysis. routine procedure in an increasing number of laboratories. Comp. Funct. Genom., 4, 460–467. e87 Nucleic Acids Research, 2006, Vol. 34, No. 12 PAGE 12 OF 13

9. Moreilhon,C., Gras,D., Hologne,C., Bajolet,O., Cottrez,F., 31. Didier,G., Brezellec,P., Remy,E. and Henaut,A. (2002) Magnone,V., Merten,M., Groux,H., Puchelle,E. and Barbry,P. (2005) GeneANOVA—gene expression analysis of variance. Live Staphylococcus aureus and bacterial soluble factors induce Bioinformatics, 18, 490–491. different transcriptional responses in human airway cells. 32. Durinck,S., Moreau,Y., Kasprzyk,A., Davis,S., De Moor,B., Brazma,A. Physiol. Genomics, 20, 244–255. and Huber,W. (2005) BioMart and Bioconductor: a powerful link 10. Postier,B.L., Wang,H.L., Singh,A., Impson,L., Andrews,H.L., between biological databases and microarray data analysis. Klahn,J., Li,H., Risinger,G., Pesta,D., Deyholos,M. et al. (2003) Bioinformatics, 21, 3439–3440. The construction and use of bacterial DNA microarrays based on 33. Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B., Dettling,M., an optimized two-stage PCR strategy. BMC Genomics, 4, 23. Dudoit,S., Ellis,B., Gautier,L., Ge,Y., Gentry,J. et al. (2004) 11. Hughes,T.R., Mao,M., Jones,A.R., Burchard,J., Marton,M.J., Bioconductor: open software development for computational biology Shannon,K.W., Lefkowitz,S.M., Ziman,M., Schelter,J.M., and bioinformatics. Genome. Biol., 5, R80. Meyer,M.R. et al. (2001) Expression profiling using microarrays 34. Saeed,A.I., Sharov,V., White,J., Li,J., Liang,W., Bhagabati,N., fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol., Braisted,J., Klapa,M., Currier,T., Thiagarajan,M. et al. (2003) TM4: 19, 342–347. a free, open-source system for microarray data management and 12. Kane,M.D., Jatkoe,T.A., Stumpf,C.R., Lu,J., Thomas,J.D. and analysis. Biotechniques, 34, 374–378. Madore,S.J. (2000) Assessment of the sensitivity and specificity of 35. Charbonnier,Y., Gettler,B., Francois,P., Bento,M., Renzoni,A., oligonucleotide (50mer) microarrays. Nucleic Acids Res., 28, Vaudaux,P., Schlegel,W. and Schrenzel,J. (2005) A generic approach 4552–4557. for the design of whole-genome oligoarrays, validated for 13. Hu,J., Kapoor,M., Zhang,W., Hamilton,S.R. and Coombes,K.R. (2005) genomotyping, deletion mapping and gene expression analysis on Analysis of dose-response effects on gene expression data with Staphylococcus aureus. BMC Genomics, 6, 95. comparison of two microarray platforms. Bioinformatics, 21, 36. Fitzpatrick,J.M., Johnston,D.A., Williams,G.W., Williams,D.J., 3524–3529. Freeman,T.C., Dunne,D.W. and Hoffmann,K.F. (2005) An 14. Irizarry,R.A., Warren,D., Spencer,F., Kim,I.F., Biswal,S., Frank,B.C., oligonucleotide microarray for transcriptome analysis of Schistosoma Gabrielson,E., Garcia,J.G., Geoghegan,J., Germino,G. et al. (2005) mansoni and its application/use to investigate gender-associated gene Multiple-laboratory comparison of microarray platforms. expression. Mol. Biochem. Parasitol., 141, 1–13. Nature Meth., 2, 345–350. 37. Lyons,P. (2003) Advances in spotted microarray resources for 15. Larkin,J.E., Frank,B.C., Gavras,H., Sultana,R. and Quackenbush,J. expression profiling. Brief. Funct. Genomic. Proteomic., 2, 21–30. (2005) Independence and reproducibility across microarray platforms. 38. Pylatuik,J.D. and Fobert,P.R. (2005) Comparison of transcript profiling Nature Meth., 2, 337–344. on Arabidopsis microarray platform technologies. Plant. Mol. Biol., 16. Park,P.J., Cao,Y.A., Lee,S.Y., Kim,J.W., Chang,M.S., Hart,R. and 58, 609–624. Choi,S. (2004) Current issues for DNA microarrays: platform 39. Talla,E., Tekaia,F., Brino,L. and Dujon,B. (2003) A novel design of comparison, double linear amplification, and universal RNA reference. whole-genome microarray probes for Saccharomyces cerevisiae which J. Biotechnol., 112, 225–245. minimizes cross-hybridization. BMC Genomics, 4, 38. 17. Petersen,D., Chandramouli,G.V., Geoghegan,J., Hilburn,J., 40. Brennan,C., Zhang,Y., Leo,C., Feng,B., Cauwels,C., Aguirre,A.J., Paarlberg,J., Kim,C.H., Munroe,D., Gangi,L., Han,J., Puri,R. et al. Kim,M., Protopopov,A. and Chin,L. (2004) High-resolution global (2005) Three microarray platforms: an analysis of their concordance in profiling of genomic alterations with long oligonucleotide microarray. profiling gene expression. BMC Genomics, 6, 63. Cancer Res., 64, 4744–4748. 18. Schlingemann,J., Habtemichael,N., Ittrich,C., Toedt,G., Kramer,H., 41. Chaudhuri,J.D. (2005) Genes arrayed out for you: the amazing world of Hambek,M., Knecht,R., Lichter,P., Stauber,R. and Hahn,M. (2005) microarrays. Med. Sci. Monit., 11, RA52–RA62. Patient-based cross-platform comparison of oligonucleotide microarray 42. Ewis,A.A., Zhelev,Z., Bakalova,R., Fukuoka,S., Shinohara,Y., expression profiles. Lab. Invest., 85, 1024–1039. Ishikawa,M. and Baba,Y. (2005) A history of microarrays in 19. Tan,P.K., Downey,T.J., Spitznagel,E.L.,Jr, Xu,P., Fu,D., biomedicine. Expert. Rev. Mol. Diagn., 5, 315–328. Dimitrov,D.S., Lempicki,R.A., Raaka,B.M. and Cam,M.C. (2003) 43. Allemeersch,J., Durinck,S., Vanderhaeghen,R., Alard,P., Maes,R., Evaluation of gene expression measurements from commercial Seeuws,K., Bogaert,T., Coddens,K., Deschouwer,K., microarray platforms. Nucleic Acids Res., 31, 5676–5684. Van Hummelen,P. et al. (2005) Benchmarking the CATMA 20. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and microarray. A novel tool for Arabidopsis transcriptome analysis. Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., Plant Physiol., 137, 588–601. 215, 403–410. 44. Chen,H. and Sharp,B.M. (2002) Oliz, a suite of Perl scripts that assist 21. Kent,W.J. (2002) BLAT—the BLAST-like alignment tool. in the design of microarrays using 50mer oligonucleotides from the Genome Res., 12, 656–664. 30 untranslated region. BMC Bioinformatics, 3, 27. 22. Rouillard,J.M., Herbert,C.J. and Zuker,M. (2002) OligoArray: 45. Chou,H.H., Hsia,A.P., Mooney,D.L. and Schnable,P.S. (2004) Picky: genome-scale oligonucleotide design for microarrays. Bioinformatics, oligo microarray design for large genomes. Bioinformatics, 20, 18, 486–487. 2893–2902. 23. Rouillard,J.M., Zuker,M. and Gulari,E. (2003) OligoArray 2.0: design 46. Giddings,M.C., Matveeva,O.V., Atkins,J.F. and Gesteland,R.F. (2000) of oligonucleotide probes for DNA microarrays using a thermodynamic ODNBase--a web database for antisense oligonucleotide effectiveness approach. Nucleic Acids Res., 31, 3057–3062. studies. Oligodeoxynucleotides. Bioinformatics, 16, 843–844. 24. Zuker,M. (2003) Mfold web server for nucleic acid folding and 47. Gordon,P.M. and Sensen,C.W. (2004) Osprey: a comprehensive hybridization prediction. Nucleic Acids Res., 31, 3406–3415. tool employing novel methods for the design of oligonucleotides 25. Legendre,M. and Gautheret,D. (2003) Sequence determinants in human for DNA sequencing and microarrays. Nucleic Acids Res., polyadenylation site selection. BMC Genomics, 4,7. 32, e133. 26. Glemet,E. and Codani,J.J. (1997) LASSAP, a LArge Scale Sequence 48. Herold,K.E. and Rasooly,A. (2003) Oligo Design: a computer program compArison Package. Comput. Appl. Biosci., 13, 137–143. for development of probes for oligonucleotide microarrays. 27. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST— Biotechniques, 35, 1216–1221. database for ‘expressed sequence tags’. Nature Genet., 4, 332–333. 49. Mrowka,R., Schuchhardt,J. and Gille,C. (2002) Oligodb--interactive 28. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., design of oligo DNA for transcription profiling of human genes. Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. Bioinformatics, 18, 1686–1687. (2000) Gene ontology: tool for the unification of biology. The Gene 50. Nordberg,E.K. (2005) YODA: selecting signature oligonucleotides. Ontology Consortium. Nature Genet., 25, 25–29. Bioinformatics, 21, 1365–1370. 29. Harris,M.A., Clark,J., Ireland,A., Lomax,J., Ashburner,M., Foulger,R., 51. Reymond,N., Charles,H., Duret,L., Calevro,F., Beslon,G. and Eilbeck,K., Lewis,S., Marshall,B., Mungall,C. et al. (2004) The Gene Fayard,J.M. (2004) ROSO: optimizing oligonucleotide probes for Ontology (GO) database and informatics resource. Nucleic Acids Res., microarrays. Bioinformatics, 20, 271–273. 32, D258–D261. 52. Rimour,S., Hill,D., Militon,C. and Peyret,P. (2005) GoArrays: highly 30. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with dynamic and efficient microarray probe design. Bioinformatics, FASTP and FASTA. Meth. Enzymol., 183, 63–98. 21, 1094–1103. PAGE 13 OF 13 Nucleic Acids Research, 2006, Vol. 34, No. 12 e87

53. Wernersson,R. and Nielsen,H.B. (2005) OligoWiz 2.0--integrating microarray gene expression data at the EBI. Nucleic Acids Res., sequence feature annotation into the design of microarray probes. 33, D553–D555. Nucleic Acids Res., 33, W611–W615. 59. Spellman,P.T., Miller,M., Stewart,J., Troup,C., Sarkans,U., Chervitz,S., 54. Tsai,J., Sultana,R., Lee,Y., Pertea,G., Karamycheva,S., Antonescu,V., Bernhart,D., Sherlock,G., Ball,C., Lepage,M. et al. (2002) Design and Cho,J., Parvizi,B., Cheung,F. and Quackenbush,J. (2001) RESOUR- implementation of microarray gene expression markup language CERER: a database for annotating and linking microarray resources (MAGE-ML). Genome Biol., 3:research0046.1-0046.9. within and across species. Genome Biol., 2:software0002.1-0002.4. 60. Frontini,M., Soutoglou,E., Argentini,M., Bole-Feysot,C., Jost,B., 55. Bouton,C.M. and Pevsner,J. (2000) DRAGON: database referencing of Scheer,E. and Tora,L. (2005) TAF9b (formerly TAF9L) is a bona fide array genes online. Bioinformatics, 16, 1038–1039. TAF that has unique and overlapping roles with TAF9. Mol. Cell Biol., 56. Brazma,A., Hingamp,P., Quackenbush,J., Sherlock,G., Spellman,P., 25, 4638–4649. Stoeckert,C., Aach,J., Ansorge,W., Ball,C.A., Causton,H.C. et al. 61. Hofman,V.M.C., Brest,P., Tripault,F., Le Brigand,K., Selva,E., (2001) Minimum information about a microarray experiment (MIAME)- Sicard,D., Raymond,J., Lamarque,D., Mari,B., He´buterne,X. et al. toward standards for microarray data. Nature Genet., 29, 365–371. (2005) Gene expression profiling in human gastric mucosa infected 57. Edgar,R., Domrachev,M. and Lash,A.E. (2002) Gene Expression with Helicobacter pylori: correlation with gastritis activity, Omnibus: NCBI gene expression and hybridization array data bacterial density and virulence factors. Submitted. repository. Nucleic Acids Res., 30, 207–210. 62. de Reynies,A., Geromin,D., Cayuela,J.M., Petel,F., Dessen,P., 58. Parkinson,H., Sarkans,U., Shojatalab,M., Abeygunawardena,N., Sigaux,F. and Rickman,D.S. (2006) Comparison of the latest Contrino,S., Coulson,R., Farne,A., Lara,G.G., Holloway,E., commercial short and long oligonucleotide microarray technologies. Kapushesky,M. et al. (2005) ArrayExpress--a public repository for BMC. Genomics, 7, 51.