N° d’ordre: 00000
Université d’Evry-Val d’Essonne
THÈSE
Présentée pour obtenir le grade de Docteur en Informatique Spécialité : Bio-Informatique
par
Ana Carolina Elisa FIERRO GUTIERREZ
École Doctorale : Des Génomes aux Organismes
Exploitation de données de séquences et de puces à ADN pour l’étude du transcriptome
À soutenir le 20 novembre 2007, devant le jury composé de :
Tijl De Bie Rapporteur Philippe Dessen Rapporteur Pascal Barbry Examinateur Kathleen Marchal Examinatrice Maurice Wegnez Examinateur Gilles Bernot Co-directeur de thèse Nicolas Pollet Co-directeur de thèse
Remerciements
2 Introduction
Le génome d’une espèce est un « invariant fondamental » qui véhicule une information transmise de génération en génération, assurant la production des macromolécules biologiques nécessaires à la physiologie des organismes. D’un point de vue moléculaire, l’expression du génome passe par la transcription de l’ADN en ARNs, intermédiaires obligatoires permettant la synthèse des macromolécules biologiques. Par conséquent, mesurer le nombre et le type de molécules d’ARN révèle des indices sur la fonction des gènes correspondants. Au cours des deux dernières décades, différentes techniques ont été développées pour réaliser de telles mesures à l’échelle du génome, car l’approche transcriptomique est plus simple que l’étude directe de l’expression de gènes à travers la quantification de protéines.
Cependant, le transcriptome possède un niveau de complexité qui était sous-évalué à l’initiation de ce type d’études. Des expériences de séquençage massif d’ADNc et d’études du transcriptome à haut débit ont mis en évidence ce niveau de complexité qui comprend l’épissage alternatif, les ARN non-codants, les microARNs, entre autres. Par ailleurs, les puces à ADN ont contribué substantiellement à la génomique fonctionnelle car elles ont permis d’acquérir de nombreuses mesures qui donnent des indices sur l’expression des gènes. Elles constituent de nos jours l’un des moyens préférés pour des études « high-throughput ».
L’objectif de ce travail a été d’analyser le transcriptome dans le contexte biologique de la métamorphose de Xenopus tropicalis, afin de remplir les absences d’information existantes en utilisant de façon optimale les ressources disponibles. Ainsi cette thèse est une illustration bio-informatique appliquée et montre plusieurs aspects des méthodologies contemporaines.
Xenopus tropicalis est devenu un organisme modèle pour la génomique chez les amphibiens, car il a un génome plus simple que le modèle classique Xenopus laevis. Au début de cette thèse les ressources étaient encore très limitées, mais pendant ces dernières années plus d’un million de séquences d’ESTs ont été mises à disponibilité et le génome de cet organisme est en cours de séquençage. Cependant, le système nerveux ainsi que la métamorphose restent encore à explorer, raison pour laquelle deux techniques servant à analyser ce transcriptome ont été utilisées : le séquençage partiel d’ADNc (ESTs) qui se base sur des séquences et les puces à ADN qui est une méthode basée sur des hybridations.
Le premier chapitre introduit les techniques les plus utilisées pour l’étude du transcriptome à grande échelle. Le séquençage d’ADNc et les puces à ADN sont décrites plus en détail avec pour objectif de présenter le mode d’obtention des données expérimentales. Le but de ce chapitre est de décrire les concepts nécessaires pour comprendre les analyses décrites par la suite.
Le deuxième chapitre présente l’utilité des séquences d’ADNc pour l’analyse du transcriptome de Xenopus tropicalis. Plus particulièrement, le système nerveux pendant l’embryogenèse et la métamorphose est exploré afin de produire une ressource pour la génomique fonctionnelle de cet organisme. Ce chapitre décrit l’analyse biologique (article publié dans BMC Genomics) ainsi que la ressource web développée et la description des étapes suivies pour traiter les données.
Le troisième chapitre montre l’usage des puces à ADN pour étudier l’expression des gènes au cours de la métamorphose du X. tropicalis. Nous décrivons les défis bio-informatiques trouvés au cours de cette étude ainsi que la problématique biologique.
3
Le quatrième chapitre porte sur l’utilisation des différentes stratégies expérimentales de puces à ADN et la reconstruction des profils d’expression. Les avantages potentiels d’utiliser des stratégies alternatives dépendent largement du succès de la reconstruction du profil. Il est donc nécessaire d’évaluer les méthodes d’analyse des puces à ADN, afin de déterminer laquelle des approches est la meilleure. L’étude menée pour comparer les méthodes de reconstruction des profils d’expression à partir de plans d’expériences complexes est présentée (article soumis dans BMC Bioinformatics).
Le dernier chapitre offre une revue concernant les ESTs, les puces à ADN et l’état de l’art sur l’intégration des données provenant de ces techniques. Les bases de données publiques ouvrent la voie à l’intégration, afin d’exploiter de façon optimale les ressources transcriptionnelles qu’un laboratoire de recherche peut obtenir. Le chapitre décrit trois voies d’intégration: l’intégration de données produites par plusieurs laboratoires ou groupes de recherche, l’intégration des données entre multiples organismes, et l’intégration avec une multiplicité d’autres sources, comme des données ChIP-on-chip, des études d’interactions de protéines, des recherches de motifs, etc.
Formée comme ingénieur civil spécialisé en informatique diplômée de l’Universidad de Chile (Chili), je suis venue étudier en France dans le cadre du DEA « Applications des Mathématiques et de l'Informatique à la Biologie » (Génopole). A l’issue de cette formation, j’ai choisi de continuer mes études et d’entreprendre une thèse de Doctorat. Ce travail a été conçu d’une façon pluridisciplinaire, avec un co-encadrement informatique et biologique pour aborder le sujet. Le travail a été effectué dans le laboratoire du Programme d’Epigénomique à Evry (G. Bernot) pour la partie informatique et le laboratoire Développement et Evolution à Orsay (N. Pollet) où j’ai acquis les connaissances biologiques et j’ai appris les besoins informatiques du côté biologiste. Ce laboratoire a aussi apporté les données analysées dans cette thèse. Finalement, un stage dans l’équipe bioinformatique de Kathleen Marchal, au Department of Microbial and Molecular Sciences, à la Katholieke Universiteit Leuven (Belgique), m’a permis d’approfondir mes connaissances sur les méthodes d’analyse de puces à ADN et mener l’étude comparative présentée dans cette thèse.
4 Table des matières
CHAPTER 1: LARGE-SCALE TECHNIQUES TO EXPLORE THE TRANSCRIPTOME ...... 7
1.1 Large-scale techniques ...... 8
1.2 Partial cDNA sequencing : Expressed Sequence Tags (ESTs).. 9
1.3 Microarray technology...... 12 1.3.1 Two-channel arrays...... 12 1.3.2 Single-channel arrays ...... 13 1.3.3 Experimental design...... 14 1.3.4 Microarray analysis...... 16
CHAPTER 2: EXPLORING THE TRANSCRIPTOMES USING ESTS ..21
2.1 Exploring the nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis ...... 22
2.2 XTScope: Xenopus tropicalis EST, a web resource for the nervous system ...... 36 Database content and data production...... 36 Implementation and Architecture...... 39 Web interface ...... 40 Extensions ...... 43 Conclusion...... 43
CHAPTER 3: MICROARRAYS TO STUDY THE TRANSCRIPTOME ..45
3.1 Bioinformatic issues ...... 46 Experimental design...... 46 Analysis steps for the microarray experiment...... 47
3.2 Xenopus tropicalis metamorphosis transcriptomes analysis using microarrays ...... 51
CHAPTER 4: EVALUATION OF TIME PROFILE RECONSTRUCTION FROM COMPLEX TWO-COLOR MICROARRAY DESIGNS...... 83
CHAPTER 5 : FROM GENE EXPRESSION PROFILES TOWARDS DATA INTEGRATION...... 113
5.1 Background...... 114
5 5.2 Gene expression profiles from transcriptomic experiments ... 115 Gene expression profiles from microarrays ...... 115 Gene expression profiles from ESTs...... 116 ESTs versus Microarrays ...... 117
5.3 Data Integration...... 118 Data integration within the same organism...... 118 Data integration across species ...... 120 Integration across different data sources...... 121
CONCLUSION ET PERSPECTIVES ...... 125
REFERENCES...... 129
6
Chapter 1: Large-scale techniques to explore the transcriptome
Le but de ce chapitre est de décrire les techniques les plus couramment utilisées pour étudier le transcriptome. Les techniques de sequençage partiel de ADNc ainsi que la technologie des puces à ADN sont décrites plus en détail, car les données expérimentales analysées dans cette thèse proviennent de ces deux techniques. Les principes de base et les aspects techniques sont abordés dans le chapitre.
7
1.1 Large-scale techniques
A whole range of techniques enable the measurements of transcript levels on a genome-wide scale. Whereas microarray analysis is hybridization-based, others are sequence- and/or fragment-based. ESTs, SAGE, and MPSS are examples of sequence-based techniques, whereas cDNA-AFLP is fragment (for expression level) and sequence (for identity) based.
In the microarray technique(DeRisi et al., 1996; Shalon et al., 1996), labelled cDNA targets representing the mRNA population of interest are hybridized with a large number of probes that have been immobilized on a substrate, such as glass, plastic or silicon chip, forming an array for the purpose of expression profiling, monitoring expression levels for thousands of transcripts simultaneously. Measuring gene expression using microarrays is relevant to many areas of biology and medicine, such as studying treatments, disease, and developmental stages. For example, microarrays can be used to identify disease genes by comparing gene expression in sick and normal cells.
Gene expression measurement by high-throughput partial sequencing of cDNAs, also called Expressed Sequence Tags or ESTs (Adams et al., 1991; Okubo and Matsubara, 1997), involves counting the ESTs that are sequenced per transcript. As it does not rely on previous sequence information, it has been mainly a valuable technique for “gene-discovery”.
Serial Analysis of Gene Expression or SAGE (Velculescu et al., 1995) reduces the DNA sequencing efforts by sequencing concatenated tags derived from transcripts. SAGE is based on counting sequence tags of 14 bp from cDNA libraries. Contrary to EST, SAGE requires that the genome sequence of the organism or a substantial cDNA sequence database is available in order to identify the corresponding genes. To facilitate target identification, the LongSAGE method was developed by Saha et al. (2002). LongSAGE generates 21 bp tags, which allow unique assignment of tags to genomic sequences. However, as for the EST-based approach, quantification of lowly expressed genes requires sequencing of a large number of tags, which implies a high cost.
Massively Parallel Signature Sequencing or MPSS (Brenner et al., 2000) improves SAGE as it is a parallel sequencing method that can generate 100-1000 short sequences signatures in one single analysis. It also generates longer (16-20 bp) signatures to make gene identification more accurate. However this method is technically demanding.
The cDNA-AFLP (Bachem et al., 1996) technique applies the standard Amplified Fragment Length Polymorphism or AFLP (Vos et al., 1995) protocol, as described for genomic DNA on a cDNA template. This low cost procedure involves cleavage of the cDNA population by two restriction enzymes, followed by adaptor ligation to these fragments to allow for PCR amplification. The amplified fragments are then presented as a banding pattern on a sequencing gel. The differences in the intensity of the bands provide a good measure of the relative differences in the level of gene expression. CDNA-AFLP does not require prior sequence information. Furthermore, the sensitivity and specificity of the method allows the detection of poorly expressed genes and the determination of subtle differences in transcriptional activity. However, cDNA-AFLP needs a great amount of PCR reactions to generate a global overview of gene expression.
8 From all the described techniques, microarrays is the most commonly used to study gene expression in living organisms. Microarrays have become an important tool for biological studies, which is demonstrated by the exponential growing of publication related to microarray. But the identification of ESTs has proceeded rapidly, with approximately 37 million ESTs now available in public databases (e.g. GenBank 7/2006), offering a great data source for gene expression profiling. In many cases, in particular when the genome sequence of an organism is unknown, EST sequencing is used before microarray technology, since the probe design requires to know in advance transcript sequences.
1.2 Partial cDNA sequencing : Expressed Sequence Tags (ESTs)
Expressed Sequence Tags (ESTs) are DNA sequences obtained by sequencing the 5’ and/or 3’ ends of randomly isolated cloned cDNA representing gene transcripts , i.e. they correspond to a portion of an entire transcript. In this section we describe the process to obtain the mRNA samples, the sequencing step, until the reconstruction of the mRNA sequences.
Using mRNA to generate cDNA libraries
Isolating mRNA from specific tissues or the whole organism is key to find expressed genes in the vast expanse of the genome. The problem, however, is that mRNA is unstable outside of a cell. Moreover there is no way to sequence directly RNA molecules in a manner analogous to the sequencing of DNA molecules. Therefore, an enzyme called reverse transcriptase is needed to convert mRNA to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence. But the reverse transcriptase will eventually fall off the template, and this will terminate the production of the cDNA generating fragments that represent portions of the original mRNA. A cDNA library is a collection of cDNA molecules generated from mRNAs contained within a cell or tissue. A cDNA library complexity can be estimated from the number of cDNA clones obtained. This complexity can range from tens of thousands to more than a million of cDNA clones. Usually, several hundred to several thousand clones are isolated at random and tag sequenced from a given cDNA library.
The first step is the isolation and purification of mRNA from a tissue sample. Most often, those RNA molecules with a polyA tail are isolated. A polyT primer is used to start the reverse transcription (see Figure 1). Then, the cDNA molecules are stored in a vector, which is a circular piece of DNA used to stably propagate cDNA clones within a host (typically bacteria).
9
Figure 1: Construction of a cDNA library from a tissue sample. mRNA is isolated and purified. Then a poly(T) primer is used to make cDNA copies.
cDNA sequencing: Expressed Sequence Tags
A single sequencing read is obtained from a given clone, from one or both ends of the cDNA insert, using universal primers which are complementary to the vector at the multiple cloning site. Sequence data in the form of trace files are produced by automatic sequencers. The traces are usually displayed in the form of chromatograms consisting of four curves of different colors, each curve representing the signal for one of the four bases (see Figure 2).
Figure 2: The raw data in a chromatogram file can be viewed as four sets of overlapping peaks, one each for the A,C,G and T sequencing reactions. Each curve represent a base, and for each position the corresponding nucleotide needs to be infered from the peaks.
An idealized trace would consist of evenly spaced, nonoverlapping peaks, each corresponding to the labeled fragments that terminate at a particular base in the sequenced strand. Real traces, however, deviate from this ideal for a variety of reasons. The corresponding nucleotides can be determined by the base-calling step.
Base calling and preprocessing
The purpose of base calling is to determine the nucleotide sequence on the basis of multi-color peaks in the sequence trace (see Figure 2), i.e. the processed trace is translated into a sequence of bases. As mentioned before, traces (and regions within a trace) are of variable quality, and therefore the fidelity of "called" nucleotides is also variable. This accuracy for each called base is measured by what are called base quality values. Base calling programs, such as Phred (Ewing et al., 1998a; Ewing et al., 1998b), provide these base quality values to help realistically evaluate sequence accuracy.
10 The “called” sequences are subject to errors caused by compressions and base calling problems resulting in frameshifts or wrong calls. Sequences have regions of high quality very close to regions of low quality. For these reasons, the sequence quality need to be assessed before any further step of analysis such as EST assembly.
EST data are generally considered to be more sensitive to sequencing errors because the depth of sequencing is variable for a given transcript. However, compared to genome assembly, repeats are less of a problem for assembling an individual gene, since the coding sequence of a gene is unlikely to contain repeats. A commonly used program for repeat and vector masking is Repeatmasker (http://www.repeatmasker.org). Publicly available packages can be used to perform quality assessment, such as Lucy (Chou and Holmes, 2001), that allows to mask repeats and vector sequences, and remove poor-quality regions.
Only 300-500 readable bases are produced from each sequence read, and yet a full gene transcript may be several thousands of bases long. ESTs thus provide a "tag level" association with an expressed gene sequence and an assembly step is needed in order to reconstruct the cDNA sequence.
EST assembly and annotation
Sequence assembly refers to aligning and merging many fragments of a much longer DNA sequence in order to reconstruct the original sequence. Since ESTs represent mRNA sequences, ideally the assembly step will reconstruct one sequence for each mRNA. However, the assembly step is complicated by features like read quality, repeated regions, alternative splicing, single-nucleotide polymorphism, and post-transcriptional modification.
The assembler returns contigs and singletons, where contigs are consensus sequences resulting from the alignment of several ESTs, while singletons are ESTs that could not be aligned with other ESTs. Some examples of assemblers are Phrap (http://www.phrap.org) and Cap3 (Huang and Madan, 1999) which are used for EST assembly. Other assemblers like Arachne(Batzoglou et al., 2002) are designed for genomic assemblies.
Once the ESTs are assembled in contigs and singletons the aim is to identify to which gene they correspond. In the annotation step, contigs are compared to known data to determine similarities and putative functions. Usually contigs are compared by sequence similarity to public databases such as the non-redundant database at NCBI, SWISS-PROT (Boeckmann et al., 2003), etc, in order to assign contigs to known genes. When the sequence similarity is low, protein domains are used to get more information from the contig sequence, using tools like InterproScan (Quevillon et al., 2005).
Given the large amount of public ESTs for several organisms, some public gene indices are available, such as UniGene at NCBI and Gene Indices at DFCI (ex-TIGR)(Quackenbush et al., 2000). These gene indices provide contigs or clusters based on the current publicly available EST data. More details about EST preprocessing and assembly can be found in Chapter 2.
Utility of ESTs
The current understanding of the human set of genes, and other organisms, includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs become a
11 tool to refine the predicted transcripts for those genes, which leads to prediction of their protein products, and eventually of their function.
Since cDNA libraries can be prepared from different tissues or developmental stages of a single organism, this approach can be useful for the construction of catalogues of tissue- specific or stage-specific genes. The identification of ESTs has proceeded rapidly, with approximately 37 million ESTs now available in public databases (e.g. GenBank 7/2006).
ESTs contain enough information to design precise probes for DNA microarrays that then can be used to determine the gene expression. In addition, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. This information is also useful to assess gene expression.
1.3 Microarray technology
Microarrays enable the quantification of transcript levels, and this is done on a global scale: the transcript abundance is measured for thousands of molecules simultaneously. Microarray technology enable the quantification of RNAs that may or may not be translated into active proteins, but measuring the gene expression at protein level genomewide is more difficult.
Microarrays can be split mainly into two classes, the two-channel arrays and the single- channel arrays. For two-channel arrays, two labeled samples are hybridized onto one single slide, resulting in two separate intensities. These intensity values are often reported as log- ratios. For single-channel arrays, only one sample is hybridized and this result in absolute measurements. This major difference implies that the data coming from both data types of platforms has to be treated distinctly and requires specific normalization methods. A further categorization can be made based on the probes used on the slides. In the following sections the different microarray types are presented. We restrict ourselves to describe the analysis steps for the technique that was used in this work.
1.3.1 Two-channel arrays
A first step to build a microarray is to select the probes that will be printed on the arrays. In two-channel microarrays the probes can be a DNA molecule (or an analog such as PNA or LNA) of known or unknown sequence : oligonucleotides, cDNA or small fragments of PCR products corresponding to mRNAs. Oligonucleotide arrays are created either by pre- synthesizing the oligonucleotides and printing on the substrate; or by synthesizing the oligonucleotides in situ directly on the substrate.
The DNA molecules are spotted on the array by a set of pins. These pins dip into the DNA mixture to take a small amount of DNA and then drop it on the microarray surface by contact with the substrate. When the DNA molecules used are double stranded, the array is heated so that the DNA is denatured and can bind to complementary strands.
In the microarray technology, the targets are always cDNA molecules (usually single- stranded) obtained from mRNA through reverse transcription and amplified by PCR to obtain sufficient material to print on a slide.
12
Two-channel arrays are hybridized with cDNA targets from two samples to be compared (e.g. patient and control). These two samples are labeled with a red and a green fluorescent dye, called Cy5 and Cy3 respectively. If a cDNA target is present in one or both of the samples, then it will bind to its complementary probe. If it is present in both samples, the spot will emit a Cy3 and a Cy5 fluorescent signal. If it is present in only one of the samples, either a Cy5 or a Cy3 intensity will be measured, depending on in which sample it was found. If it is absent for both samples, it will not hybridize and no signal will be emitted.
Before the intensities can be measured, the array is washed to remove the unbound targets. A scanner is then used to acquire the Cy3 and Cy5 signals. These signals then have to be preprocessed for quantification, what is described in section 3.4, and they are often reported as log-ratios.
Figure 3: Two channel microarray experiment. For two different RNA samples the cDNA is synthesized. The two samples are labeled in Cy3 (green) and Cy5 (red) and hybridized on the same array. The array is scaned and then the image is analysed. If a gene is expressed in the sample and present on the array, the gene will bind to the corresponding probe and a Cy3 or Cy5 signal will be emitted, depending in which samples the gene is expressed.
1.3.2 Single-channel arrays
For single channel arrays, only one sample is hybridized and this results in more or less absolute measurements. In this case, the probes are designed to match parts of the sequence of known or predicted mRNAs. There are commercially available designs that cover complete genomes from companies such as GE Healthcare, Affymetrix, Ocimum Biosolutions, or Agilent. These microarrays give estimations of the absolute value of gene expression and therefore the comparison of two conditions requires the use of two separate microarrays.
13 Affymetrix is probably the most used short oligonucleotide single-channel platform and with its specific probe design, it requires a completely different normalization strategy. Affymetrix uses short oligonucleotides that are synthesized on the slide by using a set of masks. By covering the slide with a mask, a selection of position on the chip is exposed and light will then be used to activate these unprotected sites. All these activated positions, nucleotides will bind, resulting in the synthesis of one nucleotide at the positions chosen with the mask. A new mask is then applied and the same process is repeated. After several rounds, the desired set of probes is obtained. In this way a lot of different masks are required and this makes the Affymetrix chip a rather expensive chip.
On an Affymetrix chip, a gene is not represented anymore by one single DNA molecule, but by a probe set, consisting of 11 to 20 probe pairs. Each probe pair is composed of two short oligonucleotides of length 25 nt. One matches with a part of the sequence of the gene and is called the perfect match (PM). The second oligonucleotide has the same sequence as the perfect match, except for a single mismatch in the middle of the oligonucleotide (at the 13th position), and is therefore called the mismatch (MM) probe. These mismatch probes are assumed to measure the nonspecific binding and are therefore often used as a kind of background correction. Disadvantage of this setup is that these oligonucleotides are so short that they are sometimes not gene specific.
1.3.3 Experimental design
Probe design and hybridizations
Perhaps the most important concern related to microarrays is that all current technologies are based on the fundamental assumption that most microarray probes produce specific signals under a single, rather permissive hybridization condition. Specificity, in the context of DNA microarrays, refers to the ability of a probe to bind a unique target sequence. A specific probe should provide a signal that is proportional to the amount of the target sequence only. A non- specific probe will provide a signal that is influenced by the presence of other molecules. The specificity of a probe can be diminished by cross-hybridization, a phenomenon in which sequences that are not strictly complementary bind to each other. Cross-hybridization is also called non-specific hybridization. Other factors such as the formation of secondary structure and the melting temperature of probes may also cause hybridisation error, which reduces experimental accuracy.
The advantage of cDNA microarrays is their low cost compared to oligonucleotide arrays, and therefore they were often used in the academic world. However, it is difficult to obtain full- genome coverage and highly similar transcripts may cause cross-hybridation. Long oligonucleotide arrays can overcome these problems selecting regions within genes in such a way they offer better specificity.
Signals observed from the immobilized probes varies within a range. The sensitivity threshold of microarray measurements defines the concentration range in which accurate measurement can be made. Some attempts have been made to assess the dynamic range of microarrays. The detection limit of current microarray technology seems to be between one and ten copies of mRNA per cell (Allemeersch et al., 2005; Draghici et al., 2006).
14
RNA spike-in experiments
An RNA spike-in is an RNA transcript used to calibrate measurements in a DNA microarray experiment. Each spike-in is designed to hybridize with a specific control probe on the target array. Manufacturers of commercially available microarrays typically offer companion RNA spike-in "kits". Known amounts of RNA spike-ins are mixed with the experiment sample during preparation. Subsequently the measured degree of hybridization between the spike-ins and the control probes can be used to normalize the hybridization measurements of the sample RNA. In addition to normalization purposes, the spike-in experiments allows to quantify the accuracy of microarrays, i.e. to compare the observed measurement with the real concentration of RNA for the spikes probes.
Experimental design for two-channel arrays
In two-channel arrays, mRNA extracted from two conditions (samples) are hybridised simultaneously on a given array. Which conditions to pair on the same array is a non trivial issue and relates to the choice of the “microarray design”. When several conditions need to be studied, for instance 10 conditions, co-hybridize all possible pair combinations is not tractable. A good microarray design can estimates the parameter of interest with an efficient use of the available material (number of arrays, the amount of mRNA available, or other cost considerations) (Yang and Speed, 2002).
The ‘reference design’ is the most common used design for two-channel microarrays, where each experimental sample is hybridised against a common reference sample (Figure 4a). The advantage is the easy interpretation of ratios and it extends easily to other experiments, if the common reference is preserved. However, half of the resources is used to measure this reference sample that usually has no biological interest, and indirect comparisons are needed to compare two samples of interest.
For comparing a number of samples of equal interest and high quality, a design that utilizes a large number of direct sample-to-sample comparisons is most accurate for the cost, from a theoretical perspective. The simplest of these is a ‘loop’ design, where each sample is hybridized to each of two different samples in two different dye orientations (Figure 4b). The drawback is that if one chip fails, or is of poor quality, then the error variance for all estimates is doubled. In addition, loops are inefficient compared to the reference design for a large number of samples because some pairs of varieties are too far apart (Kerr and Churchill, 2001). There are other efficient designs which are also robust to failure, as the interwoven loop design, which is a combination of direct and indirect comparisons that can give more robust and precise estimates (Figure 4c).
15
T1
T1 T2 … T7 T1 T6 T2 T5 T2
Common T4 T3 T5 T3 Reference T4 a) b) c) Figure 4: Examples of microarray designs. Circles represent samples or time points, and arrows represent a direct hybridization between two samples. The arrows point from the sample labelled with Cy3 to the sample labelled with Cy5. (a) Common reference design. (b) Loop design. (c) Interwoven loop design.
Independent of the chosen design, there are two fundamental principles of good design: balance and replication. In a balanced design, every sample is labeled equally often in red and green channel. For example, the loop design is by definition a balanced design, and swapped hybridizations provide balance for reference designs. Replication improves the precision of estimates. The number of replicates depends on the goals of the study, the resources, and the reliability of the technology.
1.3.4 Microarray analysis
The difference between single-channel and two-channel microarrays, implies that the data coming from both types of platforms has to be treated distinctly and requires specific normalization methods. Here we focus on the microarray analysis of two-channel microarrays, since the data presented in this work correspond to this type of arrays.
Image acquisition and analysis
After performing the hybridization experiments, scanning the slide constitutes the first step of data analysis followed by the extraction of the raw intensities from the images. There are four basic steps in image acquisition and analysis: scanning, spot recognition or gridding, segmentation and intensity extraction.
In the scanning step, there are two prerequisites for obtaining a high-quality image: all the previous steps have to be performed to the highest possible standards to ensure that all images would be least affected by contamination, and have consistent spots with high signal-to-noise ratios. Secondly, the choice of scaning parameters is also important. The photomultiplier tube (PMT) gain settings are adjusted to balance the overall intensities between the two channels.
Spot recognition is not a difficult task for most image analysis software. It consists in laying the grid, i.e. finding where the printed spots ought to be in the image.
Segmentation is a process used to differentiate the foreground pixels (i.e. the true signal) in a spot grid from the background pixels. A proper segmentation can be a problem because the spot morphology in a poor-quality image can vary substantially and the background can be high. There are several algorithms for segmentation and for background estimation, which are implemented in different image analysis software.
16 After segmentation, the pixel intensities within the foreground and background masks are averaged separately to give the respective signals. Median or other intensity extraction methods can be used.
Data pre-processing: Quality assessment
The data extracted from the image acquisition step need to be pre-processed to exclude poor- quality spots and normalized to remove many systematic errors as possible before downstream analysis. A common criterion is to exclude any spot with intensity lower than the background plus two standard deviations. By this criterion, spots associated with an intensity similar to the background range intensity are eliminated. The intensities should also be log- transformed. When ratios are used, upregulated and downregulated values are in the same scale using this log-transformation.
Usually a background subtraction is performed, but many researchers have found that background subtraction adds noise in measurements. On the other hand, it seems in principle wrong to ignore background, but at the moment there is no consensus.
Data pre-processing: Normalization
In a self-self experiment, i.e. two identical samples are hybridized on the same array, if we plot the log2 Cy5 intensity values versus log2 Cy3 values, one expects the values lie along the diagonal, but this is not the case, although biologically there is no difference between samples. This variation may be a consequence of different labeling efficiencies and scanning properties of Cy3 and Cy5 dyes; different scanning parameters, such as PMT settings; print tip, spatial, or PCR plate effects.
The purpose of normalization is to minimize systematic variation in the measured gene expression levels of two co-hybridized samples, so that biological differences can be more easily distinguished. The utility of normalization can be visualized in a ratio-intensity plot, or MA-plot (Figure 5). In a MA-plot the log ratios (M values) are plotted versus the mean log- intensities (A values). This MA-plot also shows the dye effects, as one expects that the plot is centered around M = 0, i.e. the proportion of up- and down-regulated genes is similar, which is clearly not the case.
The commonly used Loess normalization is a non-linear method that performs a local intensity dependent normalization (see Figure 5). The idea is to fit a curve for the M values based on A values, and then ratios are corrected based on this curve. In addition, normalization algorithms such as Loess, can be applied either globally (to the entire data set) or locally (to some physical subset of the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (print tip group). To remove print-tip effects, one can split the data into groups printed by the same print tip. A separate Loess curves is fitted for each group and the intensities are corrected using the corresponding Loess line. This local normalization has the advantage that it can help correct for systematic spatial variation in the array, including inconsistencies among the spotting pens used to make the array, variability in the slide surface, and local differences in hybridization conditions across the array.
17
Figure 5: Ratio-Intensity or MA-plot before (left) and after (right) loess normalization.
An additional normalization, the between-array normalization, may be useful for comparison across arrays. Scale and quantile normalization are some examples. Scale normalization is a linear method that scale ratios (M values) from a series of arrays so that each array has the same median absolute deviation. Quantile normalization makes the distribution of probe intensities the same for all the arrays, and this is applied to log-intensities rather than log- ratios as scale normalization does.
Estimating magnitude and significance of differential gene expression
Inherent to the hybridization technique, measurements are noisy. To assess the significance of gene profiles, replicated experiments and statistical tools are needed. After data preprocessing and normalization, the next stage is to fit a statistical model to estimate magnitude and significance of relative gene expression across samples. Several methods can deal with complex microarray designs. These methods fit a model to estimate the relative gene expression and error terms, which can be used to identify significant differentially expressed genes.
These methods can be classified according to the type of model they use. LIMMA (Linear model for microarray data analysis) is a gene-specific method, because it fits a linear model for each gene separately. LIMMA uses normalized log-ratios as input data. To test for differential expression, the gene specific variance estimates are improved in a Bayesian way by using the information from all genes (Smyth, 2004).
Another statistical tool used in microarray analysis is the analysis of variance (ANOVA). Firstly, researchers used global ANOVA models (Kerr et al., 2000) where a single model is applied on the whole data set at a time. However, global models are computationally time consuming and were subsequently replaced by two-stage models. In a two-stage model, a first global model is applied on the whole dataset to estimate gene independent terms. The residuals of the global models become the data in the second stage which is applied to one gene at a time. Wolfinger et al. (2001) introduced the ANOVA mixed models for microarrays. In a mixed model some of the effects in the experimental design are treated as random samples from a population.
ANOVA models use separate log-intensity values for each channel, as spot effects are explicitly incorporated. They return normalized absolute expression levels for each channel separately. The ANOVA F tests are designed to detect any pattern of differential expression among several conditions by comparing the variation among replicates samples within and between conditions. There are several variations of the F test, and software tools that can
18 compute appropriate F statistics for mixed models, such as MAANOVA. More details about methods to obtain gene expression profiles can be found in Chapter 4.
Clustering of gene expression profiles
Clustering is a useful exploratory technique for suggesting resemblances among groups of genes. It is essentially a grouping technique that aims to find patterns in the data that are not predicted by the experimenter’s current knowledge or pre-conceptions.
Different procedures emphasize different types of similarities, and give different resulting clusters. Most cluster programs offer several distance measures (Euclidean, Manhattan distances), some relational measures (correlation, and sometimes relative distance), and mutual information. Standard clustering techniques, such as hierarchical clustering, K-means, and self-organizing maps, are applied to group together the gene profiles with similar patterns across the conditions. Moreover, advanced algorithms have also been developed which are specifically fine-tuned for biological applications.
19 20
Chapter 2: Exploring the transcriptomes using ESTs
Un projet de séquençage partiel d’ADNc à grande échelle a été initié afin de créer une ressource pour la génomique fonctionnelle de Xenopus tropicalis. Les ADNc provenants des gènes exprimés dans le système nerveux pendant l’embryogenèse et la métamorphose de X. tropicalis ont été etudiés. Ce chapitre 2 décrit l’analyse des ESTs et la création d’un indice de gènes, a partir d’une collection d’environ 50000 séquences de haute qualité. Ces ESTs sont estimés représenter 9693 transcrits, dérivés d’ environ 6000 gènes. L’analyse effectuée dans le cadre de cette thèse a consisté en prétraitement et assemblage des ESTs, l’annotation des séquences. L’annotation des contigs (ESTs assembles) inclut la prédiction de domaines protéiques, la classification fonctionelle basée sur Gene Ontology et l’identification des cas d’épiçage alternatif. En plus, les analyses effectuées par les biologistes ont permis d’identifier des ARN non-codants, obtenir des profils d’expression de gènes à partir du comptage d’ESTs et les utiliser pour définir les transcrits qui sont spécifiques pour les stades métamorphiques du développement. Toutes les données ont été mises à disposition du public avec l’application web « XTScope » (http://indigene.ibaic.u-psud.fr/EST) qui permet un accès rapide aux séquences des ESTs et contigs, des données d’ épiçage, d’expression, de statistiques et informations, grâce à l'interface de recherche. La conception et l’implementation de XTScope font aussi partie des travaux effectués dans cette thèse.
21 Despite the fact that an EST represents only a small portion of a coding sequence, en masse this partial sequence data is of substantial utility. EST collections are a relatively quick and inexpensive route for discovering new genes, confirm coding regions in genomic sequence, provide the basis for development of microarrays and can be used to measure the transcriptome activity. These small fragments need to be assembled to reconstruct the transcribed gene sequence, and then identify the corresponding gene with its putative function. When the aim is to measure the transcriptome activity, gene profiles under different libraries are obtained from the analysis.
We have studied the nervous system of Xenopus tropicalis through ESTs to provide a functional genomics resource on genes expressed in the nervous system during early embryogenesis and metamorphosis. The first section of this chapter corresponds to the article published in BMC Genomics, which presents the biological analysis of the gene index derived from ESTs. The second section presents the bioinformatic and technical aspects that I have developed for this thesis: the workflow used to construct a gene index from the raw EST reads and the web resource developed to query this gene index.
2.1 Exploring the nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis
22 BMC Genomics BioMed Central
Research article Open Access Exploring nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis Ana C Fierro*1,2,5, Raphaël Thuret1,2, Laurent Coen3, Muriel Perron1,2, Barbara A Demeneix3, Maurice Wegnez1,2, Gabor Gyapay4, Jean Weissenbach4, Patrick Wincker4, André Mazabraud1,2 and Nicolas Pollet*1,2,5
Address: 1CNRS UMR 8080, F-91405 Orsay, France, 2Univ Paris Sud, F-91405 Orsay, France, 3CNRS UMR 5166, Evolution des Régulations Endocriniennes, USM 501, Département Régulations, Développement et Diversité Moléculaire, Muséum National d'Histoire Naturelle, 7 rue Cuvier, 75231 Paris Cedex 5, France, 4Genoscope and CNRS UMR 8030, 2 rue Gaston Crémieux CP5706, 91057 Evry, France and 5Programme d'Épigénomique, Univ Evry, Tour Évry 2, 10è étage, 523 Terrasses de l'Agora, 91034 Evry cedex, France Email: Ana C Fierro* - [email protected]; Raphaël Thuret - [email protected]; Laurent Coen - [email protected]; Muriel Perron - [email protected]; Barbara A Demeneix - [email protected]; Maurice Wegnez - [email protected]; Gabor Gyapay - [email protected]; Jean Weissenbach - [email protected]; Patrick Wincker - [email protected]; André Mazabraud - [email protected]; Nicolas Pollet* - [email protected] * Corresponding authors
Published: 16 May 2007 Received: 17 November 2006 Accepted: 16 May 2007 BMC Genomics 2007, 8:118 doi:10.1186/1471-2164-8-118 This article is available from: http://www.biomedcentral.com/1471-2164/8/118 © 2007 Fierro et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract Background: The western African clawed frog Xenopus tropicalis is an anuran amphibian species now used as model in vertebrate comparative genomics. It provides the same advantages as Xenopus laevis but is diploid and has a smaller genome of 1.7 Gbp. Therefore X. tropicalis is more amenable to systematic transcriptome surveys. We initiated a large-scale partial cDNA sequencing project to provide a functional genomics resource on genes expressed in the nervous system during early embryogenesis and metamorphosis in X. tropicalis. Results: A gene index was defined and analysed after the collection of over 48,785 high quality sequences. These partial cDNA sequences were obtained from an embryonic head and retina library (30,272 sequences) and from a metamorphic brain and spinal cord library (27,602 sequences). These ESTs are estimated to represent 9,693 transcripts derived from an estimated 6,000 genes. Comparison of these cDNA sequences with protein databases indicates that 46% contain their start codon. Further annotation included Gene Ontology functional classification, InterPro domain analysis, alternative splicing and non- coding RNA identification. Gene expression profiles were derived from EST counts and used to define transcripts specific to metamorphic stages of development. Moreover, these ESTs allowed identification of a set of 225 polymorphic microsatellites that can be used as genetic markers. Conclusion: These cDNA sequences permit in silico cloning of numerous genes and will facilitate studies aimed at deciphering the roles of cognate genes expressed in the nervous system during neural development and metamorphosis. The genomic resources developed to study X. tropicalis biology will accelerate exploration of amphibian physiology and genetics. In particular, the model will facilitate analysis of key questions related to anuran embryogenesis and metamorphosis and its associated regulatory processes.
Page 1 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
Background gene index and its assessment after the collection of Xenopus tropicalis is now an anuran amphibian reference 48.785 partial cDNA sequences. These ESTs are estimated genome for vertebrate comparative genomics. It presents to represent 6,000 genes that were annotated through the same advantages as Xenopus laevis but has a smaller sequence similarity searches, protein domain searches genome of 1.7 Gbp and a shorter generation time [1]. and Gene Ontology functional classification. Gene Moreover, while X. laevis is an allotetraploid derived from expression profiles were derived from EST counts and an allopolyploidization event, X. tropicalis is diploid [2,3]. used to evidence transcripts differentially expressed at Even though phylogenetic studies indicate that 30 to 50 metamorphic stages of development. A set of polymor- MY evolution separate the two species [3,4], it has been phic intragenic microsatellite markers was deduced from shown that most methods and resources developed for X. the analysis of ESTs derived from distinct strains of X. trop- laevis can be readily applied to X. tropicalis [5]. Thus, the icalis. We expect that this resource will be valuable for fur- genome of X. tropicalis was selected to explore amphibian ther molecular genetics experiments. genome characteristics by whole-genome shotgun sequencing [6]. Results and discussion Construction of cDNA libraries and normalization Working on X. laevis constitutes a challenge when dealing Two X. tropicalis cDNA libraries were constructed for this with large-scale transcriptomics, such as microarrays project. The first, designated xthr, was derived from dis- experiments or systematic cDNA sequencing. This is sected retinas and heads of young tadpoles (Nieuwkoop because some X. laevis genes are present as diploids, while and Faber st. 25–35). About 500 retinas were dissected others form pairs of paralogs (also called "pseudoalleles") from stage 32 X. tropicalis embryos, a stage where differen- that have been conserved with various degrees of diver- tiating retinal neurons are getting organized into layers. gence, generally less than 10% [7]. On a genomic scale, Because these retinas yielded only few polyA+ RNA, the recent data has led to the estimation of 12% as the mini- library was enriched by the addition of mRNA from heads mal fraction of paralogous gene pairs kept after allotetra- of embryos of the same developmental stage. The second ploidization [8]. However, this estimate is based on the library, designated xtbs, was made from central nervous application of strict and conservative criteria: less than systems of metamorphosing tadpoles. Brains and spinal 98% nucleotidic similarity and 93% mean similarity cords were dissected from tadpoles between stage 58 and between paralogs. Therefore, it is likely that more than 64, the period covering the whole of Xenopus metamor- 12% of paralogs are indeed active genes in X. laevis. More- phosis. To build the library, and with the aim of respect- over, such pairs of genes may have distinct expression pat- ing the relative proportion of nervous tissue obtained at terns [7]. An estimated 14% of paralogs show distinct the different stages, samples for six animals were pooled expression profiles based on EST counts [8]. Given these for each stage between 58–61 and three animals for each complications, it follows that the X. tropicalis genome is stage between 62–64. All these tissues were combined and more amenable to systematic transcriptome surveys than the mRNA extracted for preparation of the xtbs library. that of X. laevis The SMART technology (Clontech) was used to enrich the representation of full-length cDNA clones (defined here Transcriptome analysis relies heavily on cDNA analysis. as a copy of the transcript sequences between the 5' cap Collections of cDNA sequences have multiple uses for the and a polyA tail). molecular geneticist. They can be used to establish tran- script catalogues [9-11] and to provide experimental evi- To increase the information derived from EST projects, it dence when building gene models from genomic is necessary to sample complex or normalised cDNA sequence, particularly for 5' and 3' untranslated sequences libraries with few overrepresented cDNA clones (observed [12]. Further, they can be used to provide global views on individually with a frequency greater than 1%). To evalu- genome expression in a given cell type by the estimation ate our libraries quality, samples of 1,989 cDNAs from of the abundance of the different mRNA species (through xthr and 1,694 cDNA from xtbs were partially sequenced signatures as in [13]) and therefore can help decipher (see Methods) to obtain 4,120 ESTs. Next, a normaliza- physiological roles played by a given gene product. tion step was performed to increase the diversity of Finally, partial cDNA sequences (ESTs) are used to iden- sequence tags. We used a set of 53 oligonucleotides (35 tify full-length clones containing the entire open-reading mers) corresponding to highly represented clones (≥ 1%, frame for each transcript [14]. see Methods) in hybridizations on high-density colony filters (See additional file 1). A total of 22,561 clones were We initiated an EST program so as to provide a functional scored as positives (20% in both libraries) with an esti- genomics resource for X. tropicalis containing sequences mated false positive level of 0 and 3%, and an estimated from the highest possible number of genes expressed in false negative level of 38 and 10% for xthr and xtbs librar- the nervous system. We report the construction of such a ies, respectively. The negatively scored clones were re-
Page 2 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
arrayed to further the project using both 5' and 3' sequenc- ing the efficacy of the method. Of the 48 clusters corre- ing. The further sequencing of cDNA clones provided sponding to nuclear genes, 18 (37%) have 20 or more 48,785 high-quality sequences derived from 27,806 corresponding ESTs and 17 (35%) have 40 or more ESTs clones after trimming 57,874 reads (including the 4,120 after normalization. We conclude that the abundance of ESTs of the pre-normalisation step, Table 1, see Methods). ESTs after normalisation was sufficient in the majority of Both 5' and 3' end sequences were read for 75% of the cases. Even though this strategy requires re-arraying, there cDNA clones, therefore reducing the difficulties associated is no bias due to insert length compared to normalization with EST clustering. Moreover, this strategy helps to deter- by re-association [15] and therefore constitutes a useful mine the choice of a given cDNA clone for further experi- alternative. ments, whether it be full-length cDNA sequencing, overexpression studies or complementary RNA in vitro EST assembly synthesis. We analyzed these sequences with PHRAP [16] to build contigs out of the overlapping and redundant sequences To determine if the normalization process was successful, (Table 1). A total of 31,767 sequences were assembled the number of sequences containing each oligonucleotide into 8,756 contigs. These were further grouped by virtue probe was counted before and after normalization (Fig. of clone links into 6,547 unique groups (scaffolds). Tak- 1). Before normalization, the 53 clusters from which the ing into account the 2,982 singletons issued from 2,304 probes were derived accounted for 18% of the 4,120 ESTs. clones, a total of 9,693 transcripts sequences were identi- This fraction dropped to 1% after normalization, confirm- fied. We compared our results to the global clustering of all X. tropicalis ESTs (including ours) by the UniGene pipe- Table 1: Xenopus tropicalis EST project statistics line and the DFCI Gene Index. In UniGene, our set of ESTs belong to 7,778 groups made of between 1 and 220 xthr xtbs xthr and clones. Similarly, The DFCI Xenopus tropicalis Gene Index xtbs clustered these ESTs in 9,350 TCs and 1,160 singletons. Number of sequences reads obtained 30272 27602 57874 Number of clone sequences obtained 16548 14901 31449 The majority of clusters (66%) contained three or less ESTs. Only 11 contigs were composed of more than 100 Number of valid sequences 26440 22345 48785 sequences (See Additional file 2) and the largest contig Number of clones with valid sequences 15540 12266 27806 contained 159 sequences. Most of the corresponding gene Number of clones with 5' and 3' EST 9354 11486 20840 products (23/50) are ribosomal proteins, the other being Number of clones with 5' EST only 4831 0 4831 proteins involved in basic cellular processes (tubulin, Number of clones with 3' EST only 1355 780 2135 elongation factor 1 alpha). Two noteworthy exceptions Average trimmed EST length 522 546 534 are myelin basic protein (contig8746) and metal- Number of contigs 4327 4002 8756 lothionein (contig8708), for which transcripts are found Number of contigs groups 497 289 842 almost exclusively in the nervous sytem. Number of contigs grouped 1210 649 2209 Number of unique contigs 3117 3353 6547 The sequence redundancy (number of ESTs/cluster) of Nulmber of clones in contigs 9616 9268 15642 xthr and xtbs libraries was compared to other X. tropicalis Number of singletons 7958 4262 17018 cDNA libraries represented in dbEST (See Additional file Number of clones in singletons 5924 2998 12164 Number of putative transcripts 9538 6640 19553 3). A statistically significant difference at the 1% level of Max. assembled sequence length 3028 3144 3028 significance indicates that the complexity is higher for Average assembled sequence length 732 782 745 adult-type cDNA libraries, whether or not a size fraction- Max. assembled sequence size 147 144 159 ation was performed. Amongst cDNA libraries prepared Average assembled sequence size 6 5 5 from embryonic or larval stages of development, the com- Number of contigs containing plexity of the xtbs library ranks first, while the complexity 1 509 97 1152 of the xthr library is close to the mean value. 2 1779 2155 3730 3 452 362 905 Sequences were assembled into contigs of up to 3 kb in 4–5 587 545 1159 size (hsp90 transcript, Contig 8575, See Additional file 4), 6–10 505 476 952 but the mean contig length of 745 bp indicated that most 11–20 283 207 479 of them cover only parts of the cognate transcript 21–30 91 90 174 sequence. 31–50 80 41 118 50–100 31 26 76 >100 10 3 11 To assess the fraction of clones likely to be full-length, we estimated the number of sequences in our dataset that
Page 3 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
0.92% 1.55% 0.87%
AssessmentFigure 1 of normalization effectiveness Assessment of normalization effectiveness. Histogram showing the percentage of sequence matching each oligonucle- otide used in the procedure of normalization by hybridization. Bars represent percentages of positive clones calculated before (grey bars) and after normalization (black bars). Data before normalization were obtained after partial sequencing of 1,989 cDNAs from xthr and 1,694 cDNAs from xtbs. Note the relatively high abundance of cell-type specific transcript such as gamma crystallin (crystallinG1) or neurogranin (underlined on the figure). extends over the 5' or 3' end of complete cDNA sequences (version Xt6 [18]) xthr and xtbs libraries we found to con- (Figure 2; 1,945 entries from the X. tropicalis Xenopus Gene tain respectively 42% and 37% of full-length clones (MJ Collection [8], Xt-XGC and 2,963 entries from the Sanger Gilchrist, personal communication). The mean fraction of Institute [17]). Using conservative criteria, at least one full-length clones across all libraries is 18%. Therefore, we 5'EST was found to provide additional 5' upstream or 3' conclude that our normalization procedure did not downstream sequence for 854 complete cDNAs (17.4% of impair the proportion of full-length clones compared to the set). Using the same criteria but only on contig non-normalized libraries. sequences, further sequence information was obtained on 355 complete cDNAs (7.2% of the set). Of these full- Sequence annotations length cDNAs, 82 are completely matched by 122 contigs, In order to further analyse our dataset, we compared our and the latter are all longer. These results provide an indi- contigs to ENSEMBL predicted transcripts. Altogether, cation of the added-value of this sequence resource in the 4,437 contigs (52%) and 1,423 singletons (48%) framework of the delineation of gene structure, especially matched 4,083 transcripts from 3,703 ENSEMBL pre- with respect to the determination of the transcription ini- dicted genes (15%). The extent of the underclustering of tiation site. our ESTs was estimated from these numbers and used to calculate that our whole EST set represents about 6,000 Another way to assess the fraction of clones likely to be genes. We conclude that our cDNA sequence collection full-length has been described by Gilchrist et al. [17]. significantly improved annotation of the X. tropicalis Using this method on all X. tropicalis cDNA sequences genome sequence. Similarly, we compared our dataset to
Page 4 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
(contig7735 and contig8467, See Additional file 5). Using alignments on protein and genomic sequences we found 58 cases (such as elrD represented by contig7817, See Additional file 6).
Our set of transcript sequences was annotated using simi- larity searches in nucleotidic and protein databases and motif searches (Table 2). Of the 8,756 contigs, 62% have more than 70% nucleotidic similarity to previously described X. laevis regular entries, and may be considered as "known" Xenopus genes. Of these sequences, 4,426 had significant similarity to 2,803 protein sequences in Swiss- Prot database, and 5,506 to 3,571 cluster of the Uniref90 database. We identified 212 sequences corresponding to the Xenopus orthologs of human disease genes (See Addi- Added-valueFigure 2 of xthr and xtbs 5'ESTs tional file 7). Further molecular studies on these genes in Added-value of xthr and xtbs 5'ESTs. 5' cDNA Xenopus will be useful for understanding the physiopa- sequences were compared to 4,908 complete X. tropicalis thology of these diseases. cDNA sequences from XGC and Sanger Institute. When an EST matched unambiguously (>95% id over more than 50 nt on the same orientation) one of these cDNAs, the position Putative coding regions were identified using frame- of its first residue (X axis) was plotted as a function of the search, and corresponding protein sequences were anno- cDNA size (Y axis). Each dot represents the result of an tated using InterProScan, allowing for an automatic Gene alignment. A position of 0 on the x axis indicates identical 5' Ontology Annotation (Table 3). ends between the EST and cDNA. Negative values indicate that the EST extends further 5', positive values superior to Several known genes specifically expressed in the eye were the cDNA length indicate that the EST extends further 3', identified, including different crystallins (beta, gamma and positive values inferior to the cDNA size indicate the 5' and mu isoforms), vsx1 (visual system homeobox 1), pax6 EST position relative to the cDNA. (paired-box protein 6), rdgb (retinal degeneration B homolog), rgr (RPE-retinal G protein-coupled receptor). Well-characterised central nervous system specific genes were identified as well, notably elrC, mbp (myelin basic protein), plp (myelin proteolipid protein 1). The corre- 2,402 X. tropicalis RefSeq mRNA sequences. We found that sponding cDNAs will provide useful differentiation mark- 2,230 contigs (26%) and 484 (16%) singletons matched ers for X. tropicalis. 1,342 RefSeq entries (56%). These figures suggest that fur- ther extensive sequencing of putative full-length cDNA A significant number of the contigs (37%) had no signifi- clones from our collection will be of great use in order to cant similarities to previously described genes, and may cover the entire Xenopus gene set. represent transcribed pseudogenes, non-coding RNA sequences and undescribed genes. Indeed, comparing our We next estimated the proportion of our cDNA sequences sequences to non-coding RNA sequences (microRNA representing mRNA molecules produced by a splicing from RFAM, or ncRNA from the H-INV datasets) we found event and hence most likely to correspond to physiologi- 2 microRNA precursors (contig7127 and 7850 encoding cal products. We used "exonerate" to compute alignments mir-9-1 and mir-124a respectively) as well as E3 between cDNA and genomic sequences. We retained only (Contig2965) and 5SN4 (Contig5668) snoRNAs. the alignments satisfying the thresholds of 95% identity Contig7127 (452 nt) is derived from the assembly of 6 and 90% coverage. Evidence of splicing was found for ESTs derived from 3 distinct cDNA clones of the xtbs 5,025 contigs (65%) out of 7,718 contigs aligned to the library. The alignment of contig7127 sequence on X. trop- genome. From the 2,693 contigs left, only 274 are signifi- icalis genome sequence reveals 100% identity and indi- cantly similar to a protein sequence and it is likely that the cates that one splicing event is required. Thus, contig7127 others represent 3' untranslated regions, often encoded by represents a bona fide neural transcript of the mir-9-1 a single exon in vertebrates. gene. Contig7850 (800 nt) is derived from the assembly of 10 ESTs derived from 6 cDNA clones (one from xtbs Next, two complimentary methods were used to find evi- and 5 from xthr libraries). Four of these clones are identi- dence for alternative splicing. Using genomic sequences cal and characterized by a 409 bp cDNA, while two are (see Methods) we predicted 111 cases of alternative splic- longer and have their 3' ends ESTs as singletons but map- ing, including conserved ones such as Clathrin light chain ping to the same scaffold region.
Page 5 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
Table 2: Results of sequence comparisons
BLASTX BLASTN
SwissProt Uniref 90 X. laevis X. laevis X. tropicalis X. tropicalis genome X. tropicalis reg. entries UniGene UniGene cDNA
ND 2034 All hits 4426 51% 5506 63% 5447 62% 7175 82% 7865 90% 8703 99% 3877 44% All hits >= 90% 1594 18% 2753 31% 744 8% 864 10% 7199 82% 8371 96% 3660 42% >95% 70 – 90% 1705 19% 1917 22% 2963 34% 3843 44% 454 5% 302 3% 217 2% 90 – 95% < 70% 1127 13% 836 10% 1740 20% 2468 28% 212 2% 30 0% 0 0% < 90% No similarity 4330 49% 3250 37% 3309 38% 1581 18% 891 10% 53 1% 4879 56% No similarity
Expression profiles sion and analysed their expression profiles using hierar- Other collections of X. tropicalis ESTs are ongoing chical clustering (Figure 3). Twelve characteristic [8,17,19] using a variety of cDNA libraries made from expression profiles are observed, corresponding to peaks adult tissues or embryos at different stages of develop- of expression that are tissue (brain, intestine, kidney, ment. Hence, we undertook an in silico analysis of gene heart, lung, skeletal muscle, skin) or stage-specific (egg, expression profiles estimated from EST counts [20]. tailbud, tadpole, metamorphosis). The corresponding genes are potential differentiation markers that can be In a first analysis, we searched transcripts identified by useful in developmental studies and can easily be checked ESTs derived predominantly from our cDNA libraries. We by in situ hybridization on embryos. Only one transcript identified 99 and 238 cDNAs found prominently in the tagged by ESTs derived solely from a metamorphic stage heads and retinas of tailbuds or brain and spinal cord of was identified. This transcript codes for preprocaerulein tadpole, respectively (See Additional file 8 and Additional type-4; it is characterized by 40 ESTs derived from 24 file 9) and 25 clones found predominantly in both struc- cDNA clones issued from 6 libraries made from stage 62 tures. These clones are likely to represent genes differen- and 64 tadpoles, i.e. representing late metamorphosis tially expressed in the retina or the central or peripheral stages. Caerulein is a peptide found predominantly in nervous system during metamorphosis. The study of these skin secretions. It belongs to the gastrin/cholecystokinin genes in Xenopus could well improve our knowledge on family of neuropeptide, and may play a role as an antimi- CNS development and function in vertebrates. crobial molecule [23]. This finding is discussed later.
Metamorphosis Since there are currently ten times more ESTs in cDNA In a second analysis, we explored the metamorphosis libraries derived from metamorphic stages of develop- transcriptomes using expression profiles derived from EST ment in X. laevis than in X. tropicalis, we did a similar sur- counts. It is known that amphibian metamorphosis vey of the expression profiles of transcripts in X. laevis. brings about unique regulations triggered by thyroid hor- mones during late vertebrate development, but relatively We extracted 6,297 UniGene clusters (24% of all clusters) few genes are characterised as playing regulatory roles in containing at least one and up to 710 ESTs from at least this process [21]. We extracted 4,187 UniGene clusters one cDNA library prepared from metamorphic tadpoles. containing at least one EST from the xtbs cDNA library. This corresponds to 24,262 ESTs made from four cDNA Similarly, we fetched data from 592 UniGene clusters con- libraries: limb, tail, intestine and tadpole (NF stage 62). taining at least one and up to 132 ESTs from another The level of expression of each transcript was estimated by library derived from a metamorphic stage of develop- counting ESTs providing a corresponding UniGene clus- ment. Combining both sets gives 4,779 UniGene clusters ter. The 26 clusters containing the highest number of (13% of all clusters). To generate a useful expression ESTs, and hence corresponding to the most highly matrix an initial filtration step was performed whereby expressed genes during metamorphosis, are listed in table clusters composed of less than 10 ESTs were removed 4. We expected to find either ubiquitously or differentially leading to a set of 3,422 UniGene clusters. We used the GT expressed categories among these highly abundant tran- test [22] to rank profiles in three categories: strong (64 scripts at metamorphosis stages. Indeed, 16 of these 25 clusters with GT > 0.66), medium (803 clusters with 0.33 UniGene clusters are found as characterized by a restricted < GT < 0.66) or weak (2555 cluster with GT < 0.33) differ- expression in the tail, limb, or heart (table 4). Interest- ential expression. Because UniGene is prone to overclus- ingly, it is known that 11 are expressed in the muscle cells tering we focused our analysis on the 64 clusters that compose most of the tail and limb. One remarkable corresponding to genes with a strong differential expres- case is a gene coding for a protein involved in freeze-toler-
Page 6 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
Table 3: GO Molecular function classification. egories: strong (167 clusters with GT > 0.66) medium (1,300 clusters with 0.33 < GT < 0.66) or weak (2,132 Gene ontology term N cluster with GT < 0.33) differential expression. All molecular function terms 1775 ->antioxidant activity 14 From the 167 clusters corresponding to genes with a ->binding 623 strong differential expression, only 30 are composed of at -->calcium ion binding 89 least 2 ESTs derived solely from metamorphic or adult tis- -->carbohydrate binding 7 sues, four of which bear no similarity to known proteins. -->lipid binding 10 We analysed the expression profiles of these metamorphic -->nucleic acid binding 350 genes using unsupervised hierarchical clustering. The --->DNA binding 135 ---->chromatin binding 0 resulting clusters could be interpreted along the predomi- ---->transcription factor activity 31 nant expression domains (See Additional file 10). Three --->RNA binding 40 clusters (metamorphic, tadpole and limb) correspond to --->translation factor activity, nucleic acid binding 58 larval stages of development that are made of eight, three -->nucleotide binding 51 and four genes, respectively (Fig. 4). These genes are -->oxygen binding 0 promising candidates, potentially playing important roles -->protein binding 52 during this late developmental event. Below, we describe --->cytoskeletal protein binding 25 ---->actin binding 23 briefly what is known about each of these genes. -->receptor binding 28 ->catalytic activity 455 A larval beta chain of globin is among the metamorphic -->hydrolase activity 110 cluster together with an alpha chain, an indication of the --->nuclease activity 9 relevance of our analysis. The comparison of Xl.56714 --->peptidase activity 66 EST sequences with known proteins shows that they --->phosphoprotein phosphatase activity 6 resemble cell surface receptors of the SLAM (Signalling -->kinase activity 56 --->protein kinase activity 33 Lymphocytic Activation Molecule) family. The SLAM -->transferase activity 112 receptors regulate immune cell activation. Indeed, it is ->chaperone regulator activity 0 known that immune system remodelling is a major event ->enzyme regulator activity 37 of metamorphosis [24]. The gene corresponding to ->motor activity 15 Xl.56714 is expressed in metamorphic tadpoles (includ- ->nutrient reservoir activity 0 ing tail and intestine) as well as in the adult kidney. How- ->signal transducer activity 48 -->receptor activity 13 ever we could not detect significant similarities to any -->receptor binding 28 known gene sequences or proteins. Alpha-1 antichymot- ->structural molecule activity 398 rypsin (a plasma protease inhibitor) is highly expressed ->transcription regulator activity 40 during metamorphosis and found in adult liver. This cor- ->translation regulator activity 58 relates with the associated stress condition that occurs ->transporter activity 120 during tadpole transformations. The gene encoding -->electron transporter activity 39 alpha-2-HS-glycoprotein (also named fetuin) steadily -->ion channel activity 11 -->neurotransmitter transporter activity 3 increases in expression from tailbud stage up to metamor- ->triplet codon-amino acid adaptor activity 0 phosis. This gene product is secreted in plasma and plays a physiological role during mammalian fetal develop- ment, especially in mineralization and growth. A known ance found predominantly in metamorphic limbs, but Xenopus gene encoding a small peptide named PYLa is expressed in a variety of other tissues both embryonic found exclusively in a cDNA library prepared from stage (starting at gastrula stage) and adult (nearly all adult tis- 62 tadpoles. As for the preprocaerulein transcripts in X. sues sampled with the exception of ovary, testis and lung). tropicalis, the PYLa transcripts are abundant in metamor- This can be an artefact due to the handling of tissues at the phic stages, with ESTs found in limbs and whole tadpoles. time of RNA extraction. Alternatively, this may reflect the Both caerulein and PYLa peptides may be secreted from induction by stress-related hormones (glucocorticoids) skin glands and exert antimicrobial activities. This finding during metamorphosis. corroborates a previous report on caerulein expression [25]. Remarkably, skin glands are known to express a We then carried out an in silico reconstruction of the tran- cocktail of signalling peptides, including neuropeptides scriptional profile of X. laevis metamorphic genes using such as xenopsin, thyrotropin-releasing hormone and the IDEG suite of statistical tests. We removed clusters PGLa. Whether these peptides play specific roles in the composed of less than 10 ESTs, leading to a set of 3,599 context of metamorphosis is unknown. The cluster UniGene clusters. The GT test ranked profiles in three cat- Xl.24674 corresponds to a gene resembling uromodulin.
Page 7 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
is
uscle phos
se tissuetive ourinary ole in trula in
adipo bra diges genit head heart lung skeletal sk eggm gas neurula tailbud tadp metamor Str.20603 keratin 5 Str.40562 keratin 14 Str.49404 keratin 5 skin Str.41834 keratin 14 Str.51481 transcribed locus Str.41889 26 serine protease Str.41838 transcribed locus Str.51637 acyl-CoA dehydrogenase kidney Str.29235 G protein pathway suppressor 2 Str.4962 slc7a7 Solute carrier family 7 Str.8295 elastase I Str.5512 ctrb2: chymotrypsinogen B2 intestine Str.26623 lactotransferrin Str.26669 arginosuccinate synthetase (ass) Str.38826 apolipoprotein A-I precursor Str.7018 frzb: Frizzled-related protein (frzb) heart Str.8426 myl3: Myosin, light polypeptide 3 Str.24081 transcribed locus Str.2078 NADH dehydrogenase Str.49042 cytochrome-c oxidase tailbud Str.21355 progonadoliberin-2 precursor Str.22092 transcribed locus Str.49431 preprocaerulein type-4 tadpole Str.5804 kinesin light chain Str.17081 myosin heavy chain, fast skeletal muscle, embryo metamorph Str.8341 collagen, type V, alpha 3 Str.21728 transcribed locus Str.5815 DNA-binding protein A; Y box protein 3 muscle Str.35909 myosin, heavy polypeptide 2, fast muscle Str.37351 ribosomal protein L3-like lung Str.27777 mogat2: Monoacylglycerol O-acyltransferase 2 Str.3439 eukaryotic translation initiation factor 2-alpha Str.21815 erythrocyte membrane protein band 4.9 (dematin) Str.28019 discs, large homolog 4 Str.14347 RAB40B, member RAS oncogene family Str.21084 C14orf37 Str.21253 transcribed locus Str.52283 Rho-related GTP-binding protein RhoI Str.21114 neurensin 1 Str.21299 phytanoyl-CoA 2-hydroxylase interacting protein Str.20386 myelin basic protein Str.14909 ADP-ribosylation factor 3 Str.28128 microtubule-associated protein 2 Str.21155 neuronal pentraxin II Str.21068 transcribed locus Str.21392 transmembrane protein 59 brain Str.28868 centaurin gamma 1 Str.21176 neurexin 1, isoform alpha Str.28022 fibrinogen C domain containing 1 Str.20925 protein tyrosine phosphatase, receptor type, N Str.21977 kinesin-like protein KIF1A Str.15361 arrestin beta 1, isoform B Str.38310 latrophilin 1 Str.20880 contactin associated protein 1 Str.21943 transcribed locus Str.21088 EPH receptor A8 Str.21930 transcribed locus Str.8446 EF-hand calcium binding protein 1 Str.38334 lysosomal associated protein transmembrane 4 beta Str.49508 myelin proteolipid protein Str.26607 C2orf21 Str.21007 glutamate receptor, ionotrophic, AMPA 4 Str.38154 calcium channel, voltage-dependent, alpha 2/delta 1 Str.5352 prickle-like 2
DigitalFigure expression 3 profiles of X. tropicalis transcripts differentially expressed at metamorphosis Digital expression profiles of X. tropicalis transcripts differentially expressed at metamorphosis. Each line gives the expression profile of a given transcript represented by a UniGene cluster. The expression is deduced from counting the occurence of ESTs derived from a given cDNA library. The level of expression is colour coded in blue shades, dark blue means evidence for high levels of transcripts and white means no evidence for expression. On the left, clusters of expression profiles are delineated by a vertical bar labelled with the associated characteristic domain of expression. On the right, the cluster name and its annotation (i.e. the corresponding gene product description as deduced from sequence similarity analysis) are given. Each column corresponds to a category of tissue or stage of development: 8 adult tissues and 6 stages of development. Note that a given category may correspond to several cDNA libraries. Here, only clusters for which evidence of differential expres- sion were used to build the matrix of expression. This matrix was analysed by hierarchical clustering on the expression profile dimension using CLUSTER 3.0 as described in the methods section.
Page 8 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118 Table 4:Highlyexpressed metamorphic transcripts. nGn DMtET oa SscN oreNt etitdepeso Description Restrictedexpression Note b;h;l;li;ot; 1405 cDNAsource 710 Xl.24656_a Total ESTs Met ESTs UniGene ID b: brain;et:embryonic tissue;ey:eye;fb: italics. U: ubiquitous; D: nodiffe U:ubiquitous; italics. l2852850bltt;bRlm,tdoelm myo me tail limb tail, limb tail tadp limb, R limb, tadpole R tail,limb, tailbud R et;h;h;l;t;wb R u ub b;et;ey;fb;he;k;l;li;lu l;ot;sk;wb R b;et;ey;fb;h;k;l;li;lu;o 644 U b;l;t;th;wb U b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb b;et;l;t;wb b;et;ey;fb;he;k;l;li;s;sk;t;th;wb 1376 474 1333 98 b;he;l;ot;s;sk;t;te;wb 833 198 205 468 211 Xl.29221_b 510 et;h;he;l;o Xl.4138__a 217 751 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s Xl.28832 227 1008 Xl.7551 238 4290 Xl.25492 280 397 Xl.47042 653 Xl.24815 560 Xl.11405 248 Xl.5860 Xl.17432 Xl.1115__b l54 6 6 b;et;ey;fb;he;k 360 168 Xl.5146 l13 5 4 ;;ts;;bRtdoelimbmeta tadpole limb,hear limb,tailb R R R b;ey;he;l;lu;t;th;wb b;et;h;l;ot;sk;t;th;wb b;l;ot;sk;t;wb b;et;ey;fb;h;he;k;l;lu;ot;ov;s 975 426 288 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;w 120 243 134 1033 151 158 Xl.7875 164 Xl.24699 Xl.1055 Xl.1032 Xl.8842 l12 2 8 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;s 884 120 Xl.1728 l36 0 5 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;s b;ey;fb;l;lu;ot;s;sk;t;t b;et;ey;fb;h;he;k;l;lu;ot;ov;s;sk;t;te;th;w 554 440 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s 272 109 756 113 115 116 Xl.3463 Xl.995 Xl.49126 Xl.8672 Xl.395 94 479 li;lu;th;wb R liver liver serum albumin 74 kDa albumin74 serum liver myo tail tail liver ?ubiquitous D R tail, tadpole R lim R b;l;ot;sk;t;wb li;lu;th;wb b;ey;fb;l;ot;sk;t;te;wb R fb;l;li;ot;s;sk;t;wb b;et;fb;h;k;l;lu;ot;s;sk;t;wb 156 221 479 94 537 94 274 94 100 Xl.23399 102 Xl.938 Xl.395 Xl.1464 Xl.118 rential; N: norestrictedexpressi rential; fat body;h: head;he: heart;k: kidney;l: lim body;h:head;he:heart;k:kidney;l: fat on; R:restrictedexpression. o;vss;;et;bUtyu - thymus U ;ot;ov;s;sk;t;te;th;wb ;ktw al ibtail tail,limb R t;sk;t;wb ;vss;;et;bUuiutu - ubiquitous U t;ov;s;sk;t;te;th;wb ktw bqiostail ubiquitous R sk;t;wb ll;;;hw iblm eaopoi freez limbmetamorphosis limb R ;l;li;s;t;th;wb ;sk;t;te;th;wb D ubiquitous - solute carrie solute - ubiquitous D ;sk;t;te;th;wb ;bD?bqioslimbmetamorphosis ?ubiquitous D e;wb ;sk;t;te;th;wb U ubiquitous - eukaryotic tran eukaryotic - ubiquitous U ;sk;t;te;th;wb ;sk;t;te;th;wb N limb - ribosomal protein L4 protein ribosomal - limb N ;sk;t;te;th;wb k;t;te;th;wb N limb, tadpole, olfactory olfactory tadpole, limb, N k;t;te;th;wb ;;et;bRlm eaopoi guanine nucleo metamorphosis limb R k;t;te;th;wb b;li: liver;lu: lung;ot: b;li: liver;lu: ib-glyceraldehyde-3-ph - limb N b limb,thymus,spleen_ N b other;ov:ovary;s biquitous - creatine kinase,muscle creatine - biquitous epith. iquitous - eukaryotic translat eukaryotic - iquitous limb metamorphosis collagen, type I, alpha 1 typeI, collagen, metamorphosis limb iblm eaopoi tropomyosin limb metamorphosis limb d apl ibmtmrhssmyosinlight limbmetamorphosis ud, tadpole eaopoi tropomyosin 1alphachain metamorphosis b l ibmtmrhssoncomodulin limbmetamorphosis ole t heart metamorphosis actin alpha cardiac alpha actin heart metamorphosis t PHA - ribosomal protein L3 protein ribosomal - PHA : spleen;sk:skin; t: tail;te: testis;th: thym testis;th: : spleen;sk:skin;t: tail;te: eaopoi asparty metamorphosis tamorphosis actin al actin tamorphosis - acidic ribosomal protein P0 ribosomal protein acidic - opoi fast skeletaltroponinCbeta morphosis Calcium ATPaseCalcium at60A,cardiacmuscle muscle skeletal 1 alpha actin actin cytoplasmic 1 muscle skeletal 1 alpha actin actin cytoplasmic 1 adenine nucleotide translocator), member 5 procollagen, typeI,alpha2 sequence 1 sin, heavy polypeptide 13,skeletal muscle sin, heavy sin, heavy polypeptide 4,skeletalmuscle sin, heavy e tolerance-associated protein FR47 protein e tolerance-associated us;wb: wholebody. Allo pha skeletal muscle pha skeletal l beta-hydroxylase l r family 25 (mitochondrial carrier; carrier; 25(mitochondrial r family chain 1, fastskeletalmuscleisoform 1, chain beta chain, skeletalmuscle betachain, tide binding protein, beta2,relat tide bindingprotein, slation factor 1 alpha, somatic form slation factor1alpha,somatic ion elongation factor 2 ion elongationfactor osphate dehydrogenase genes are written written genesin are ed
Page 9 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
metamorphosis animal cap brain eye fat body genitourinary head heart limb liver lung lymphoreticular skin female germ line blastula gastrula gastrula/neurula cusp neurula tailbud tadpole metamorphosis Xl.56714 transcribed locus cDNA clone IMAGE:6871962 Xl.23966 Alpha-1-antichymotrypsin precursor Xl.24530 Xl Larval beta II globin Xl.1127 Xl Hemoglobin alpha chain Xl.24560 alpha-2-HS-glycoprotein Xl.144 Xl PYLa precursor Xl.24674 uromodulin Xl.26509 Xl stomatin
tadpole Xl.24591 Xl carboxyl ester lipase Xl.24594 mucin 2 precursor, intestinal Xl.26001 carboxypeptidase B limb Xl.51631 Keratin, type II cytoskeletal 1 Xl.57017 transcribed locus cDNA clone XL081o03 Xl.57064 transcribed locus cDNA clone IMAGE:8636 Xl.34936 Xl preprocaerulein type I
DigitalFigure expression 4 profiles of X. laevis transcripts differentially expressed at metamorphosis Digital expression profiles of X. laevis transcripts differentially expressed at metamorphosis. Using the same rep- resentation as in Fig. 3 three clusters are depicted that are associated with differential expression at metamorphosis (top clus- ter), tadpole stage (middle) and in the forming limb (down).
Corresponding transcripts are found in metamorphic from mitochondrial genes, three of which (snp1A, snp2A, intestine and whole tadpole. In mammals, uromodulin is snp6G) are specific to the Adiopodoume strain (See Addi- excreted in urine and plays a role in the cellular defense tional file 11). The presence of shared alleles for 5 SNPs response. A gene encoding a stomatin homolog is highly indicates the close relationship with the N strain as expressed in intestine during metamorphosis. Stomatin is already reported by Evans et al. 2004. We searched for a membrane protein regulating cation exchange and novel polymorphism markers made of di, tri, tetra and cytoskeletal attachment. Among the genes represented in pentanucleotide sequence repeats present in our EST col- the metamorphic limb cluster are a keratin and two clus- lection. We found from two to ten alleles in 225 markers ters (Xl.57017 and Xl.57064) annotated as lacking signif- derived from 212 contigs/ESTs clusters. A subset of 107 icant similarities to known proteins. In the tadpole markers are potential highly informative since two or cluster, a carboxyl ester lipase is found expressed in tad- more alleles are observed at high frequencies (See Addi- pole and in metamorphic intestine. Mucin 2 is another tional file 12). The dinucleotide repeat AT and TA are the gut protein highly expressed in tadpole, as well as carbox- most common, accounting for 137 markers. These intra- ypeptidase. genic markers should be useful once placed on a genetic linkage map. Taken together, these expression profiles, based on EST counts, reveals certain genes that are up-regulated during This dataset will provide an invaluable tool for exon defi- metamorphosis, possible targets of the thyroid hormones nition when the X. tropicalis genome sequence is finally signalling pathway. determined. The results presented here are available through a database on our web site [26]. Users can carry Polymorphisms out BLAST and other searches based on GO classification, The cDNA sequences we produced are derived from the InterproScan results, and expression information. The Adiopodoume strain of X. tropicalis, originating from the cDNA sequences have been deposited with Genbank/ Ivory Coast. The genomic and most other cDNA sequenc- EMBL/DDBJ (accession numbers CN072222 – ing efforts are made on the N strain from Nigeria or a dis- CN121006) and clones are available upon request. tinct IC strain from Ivory Coast. We therefore looked for polymorphisms that could be used in genetic mapping Conclusion experiments or to discriminate with mutations obtained Large-scale cDNA sequencing has provided invaluable from ENU mutagenesis. We identified 8 SNPs derived resources to decipher vertebrate genome structure and
Page 10 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
function. Recent studies on cDNA sequencing with deep High-throughput sequencing, assembling coverage provide fundamental knowledge on the com- The reactions were performed with a Big-Dye terminator plexity of transcriptomes in mammals [27]. Here, we pro- cycle sequencing kit and analyzed by ABI-3700 and ABI- vide information on the transcript sequence and 3730. Sequences were base-called using PHRED, then expression of an estimated 6,000 genes in X. tropicalis. A trimmed using LUCY and custom perl scripts. Sequences web resource [26] is available with associated annota- less than 100 bp were discarded, as well as those identified tions. The genetic resources stemming from the cDNA to be derived from ribosomal and mitochondrial RNAs. sequencing project described here can be used in diverse PHRAP was used to assemble the sequences taking into research projects, including vertebrate comparative account quality scores, further clustering was obtained by genomics, studies on evolution and development, cell scaffolding using mate-pairs informations. We retained biology and developmental genetics. More specifically, scaffolds only if two clone links were available (excepting retinogenesis and remodelling of the central nervous singletons) and if the orientation of the reads was consist- sytem during metamorphosis will benefit from this cDNA ent. resource. Annotation We are currently undertaking full cDNA insert sequencing Repetitive sequences were masked using CENSOR. Con- for a set of non-redundant clones, as well as characterizing tigs and singletons were used as queries in BLASTX and their expression using a whole-mount in situ hybridiza- BLASTN searches of Swissprot, Uniprot, Unigene (rel 70) tion screen [28,29]. The genomic resources developed to and Xenopus tropicalis JGI assembly v4.1 databases on study X. tropicalis biology are crucial to explore amphibian INFOBIOGEN server. The february 2005 release of physiology and genetics, this model system providing ENSEMBL was used. Framefinder was used to identify excellent characteristics for addressing key questions coding sequences, and protein domains were searched related to anuran metamorphosis and its associated regu- using INTERPROSCAN. The results of sequence assembly, latory processes. scaffolding and annotation were loaded on a custom- made mySQL database. Web interface was developed Methods using PHP scripts. Embryo and tissue dissection Embryos of Xenopus tropicalis Adiopodoumé strain were Assessment of the fraction of clones likely to be full-length obtained from parents issued of the Geneva collection 5'ESTs or contigs were compared to full-length cDNAs [30]. using BLASTN. Alignments longer than 50 nt and exhibit- ing more than 95% identities were selected. Overlap From each of the two libraries made, 58,368 clones were between query and subject sequences was scored only picked, arrayed in microtiter plates and gridded on high- when the alignment encompassed up to the last nucle- density nylon filters. A sample of 1,989 cDNA clones from otide at the 5' or 3' end of the sequences. the xthr library and 1,694 clones from xtbs were partially sequenced to obtain 4,120 ESTs. This step provided a Alternative splicing quality assessment of the two libraries, showing the Alignments between cDNA and genomic sequences absence of clones of bacterial origin and few ribosomal (masked) were computed using exonerate. Evidence for (0.45% in xthr, 0% in xtbs) and mitochondrial contami- alternative splicing was found when two contigs were nants (1.2 and 3.6% in xthr and xtbs, respectively). This aligned to the same genomic region with at least 95% procedure provided information on overrepresented mean overlap but with a different number of exons. clones, which were then removed before further sequenc- ing of up to 30,000 clones. These ESTs were grouped into Alignments between cDNA and UNIREF protein 1,985 clusters. sequences were computed using BLASTX. Alignments characterized by a gap (at least 10 aa) introduced in the Normalization of cDNA libraries contig sequence were retrieved. Alternative splicing was Two pools of 25 and 28 oligonucleotides probes of 35 nt considered present if the contig sequence including the in length were labelled using Terminal Transferase (New gap could be aligned to the genomic sequence. England Biolabs) and P33-dATP (Amersham). Labelled oligonucleotides were hybridised on two high-density fil- Expression profiles ters representing the xthr and xtbs libraries as in [31]. Pos- EST counts were downloaded from UniGene (release 70 itive clones were identified using X-Digitize software on for X. laevis and 32 for X. tropicalis). Relevant profiles were images acquired using a phosphorimager. extracted using custom perl scripts. GT test was run using the IDEG6 software [22]. Single hierarchical clustering was performed using Cluster 3.0 software [32]. We used
Page 11 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
absolute correlation similarity metrics followed by com- plete clustering on mean centered gene expression pro- Additional File 5 files. Results were visualised using TreeView. Alternative splicing detected by a different number of exons in contigs aligned in the same genomic region. The table provided lists the Contig identifier, the number of constituent exons, the description of the best pro- Polymorphisms tein hit identified by blastx. Each line describes one alternative splicing Mitochondrial SNPs were identified on a collection of case. mitochondrial ESTs downloaded from dbEST, JGI [33] Click here for file and from our own set. These ESTs were assembled using [http://www.biomedcentral.com/content/supplementary/1471- CAP3 and analyzed using visualization software [34]. 2164-8-118-S5.xls] Microsatellites were identified using custom perl scripts. Additional File 6 Alternative splicing detected by a gap in the alignment against Authors' contributions UniRef100. The table provided lists the Contig identifier, the identifier of RT, LC and MP carried out laboratory and data analysis. the protein evidencing an alternative splicing event, and the identifier of AF wrote and ran the EST processing pipeline, including the protein showing the highest similarity by blastx. Each line describes EST assembly and annotation, and is responsible for the one alternative splicing case. web-available database. G.G., J.W. and P.W. managed and Click here for file conducted the sequencing experiments. BD, MW and AM [http://www.biomedcentral.com/content/supplementary/1471- 2164-8-118-S6.xls] participated in the coordination of the study. NP partici- pated in the conception and design of the study, carried Additional File 7 out laboratory and data analysis and drafted the manu- Xenopus tropicalis genes related to human disease genes. The table script. provided lists the human disease trait, corresponding human gene name, chromosomal location, OMIM identifier and the corresponding Xenopus Additional material tropicalis Contig identifier. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Additional File 1 2164-8-118-S7.xls] Oligonucleotides used for normalization. The table provided lists the oligonucleotide identifier, corresponding gene, number of corresponding Additional File 8 ESTs, oligonucleotide sequence and Tm. cDNA clones found specifically in library xthr. The table provided lists Click here for file the Contig identifier and the description of the best protein hit identified [http://www.biomedcentral.com/content/supplementary/1471- by blastx. 2164-8-118-S1.xls] Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Additional File 2 2164-8-118-S8.xls] Top 50 of contigs according to the number of constituent ESTs. The table provided lists the Contig identifier, the number of constituent Additional File 9 sequence reads, the contig length and the description of the best Swissprot cDNA clones found specifically in library xtbs. The table provided lists hit identified by blastx. the Contig identifier and the description of the best protein hit identified Click here for file by blastx. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2164-8-118-S2.xls] [http://www.biomedcentral.com/content/supplementary/1471- 2164-8-118-S9.xls] Additional File 3 Analysis of X. tropicalis cDNA libraries complexity. The data provided Additional File 10 represent the description and complexity of X. tropicalis cDNA libraries Digital expression profiles of X. laevis transcripts. Using the same for- sampled by more than 20,000 ESTs. malism as in Fig. 3, all X. laevis Unigene clusters associated with differ- Click here for file ential expression at metamorphosis are depicted. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2164-8-118-S3.xls] [http://www.biomedcentral.com/content/supplementary/1471- 2164-8-118-S10.eps] Additional File 4 Top 50 of contigs according to their size. The table provided lists the Additional File 11 Contig identifier, the contig length and the description of the best Swiss- Mitochondrial SNPs. The table lists the occurence of given alleles of mito- prot hit identified by blastx. chondrial SNPs in the adiopodoume and N strains, in association with the Click here for file corresponding mitochondrial gene. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2164-8-118-S4.xls] [http://www.biomedcentral.com/content/supplementary/1471- 2164-8-118-S11.xls]
Page 12 of 13 (page number not for citation purposes) BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118
Shin T, Steptoe M, Swaller T, Theising B, Underwood K, Wylie T, Additional File 12 Yount T, Wilson R, Waterston R: An encyclopedia of mouse genes. Nat Genet 1999, 21(2):191-194. Highly informative intragenic microsatellite markers. The table lists 12. Wei C, Brent MR: Using ESTs to improve the accuracy of de allelic data for a set of intragenic microsatellite markers, including the novo gene prediction. BMC Bioinformatics 2006, 7:327. Contig ID, corresponding UniGene cluster ID, number of alleles, type of 13. Okubo K, Hori N, Matoba R, Niiyama T, Fukushima A, Kojima Y, Mat- microsatellite. In bold case are figured contig/ESTs/UG clusters for which subara K: Large scale cDNA sequencing for analysis of quanti- at least two alleles have a frequency higher than the mean (calculated as tative and qualitative aspects of gene expression. Nat Genet 1992, 2(3):173-179. the total number of ESTs divided by the number of alleles) or higher than 14. Gomez SM, Eiglmeier K, Segurens B, Dehoux P, Couloux A, Scarpelli 33%. Alleles number and frequency are shown in bold if the frequency is C, Wincker P, Weissenbach J, Brey PT, Roth CW: Pilot Anopheles higher than the mean or higher than 33%. A * is indicative of more than gambiae full-length cDNA study: sequencing and initial char- one repeat polymorphism observed for that cluster. acterization of 35,575 clones. Genome Biol 2005, 6(4):R39. Click here for file 15. Bonaldo MF, Lennon G, Soares MB: Normalization and subtrac- [http://www.biomedcentral.com/content/supplementary/1471- tion: two approaches to facilitate gene discovery. Genome Res 1996, 6(9):791-806. 2164-8-118-S12.xls] 16. Ewing B, Green P: Analysis of expressed sequence tags indi- cates 35,000 human genes. Nat Genet 2000, 25(2):232-234. 17. Gilchrist MJ, Zorn AM, Voigt J, Smith JC, Papalopulu N, Amaya E: Defining a large set of full-length clones from a Xenopus tropicalis EST project. Dev Biol 2004, 271(2):498-516. Acknowledgements 18. Wellcome X. tropicalis Full-Length Database [http://infor We thank L. Du Pasquier for the gift of X. tropicalis animals and his contin- matics.gurdon.cam.ac.uk/online/xt-fl-db.html] uous support. This research was funded by grants from l'Association pour 19. Klein SL, Strausberg RL, Wagner L, Pontius J, Clifton SW, Richardson P: Genetic and genomic tools for Xenopus research: The la Recherche contre le Cancer, le Centre National de la Recherche Scien- NIH Xenopus initiative. Dev Dyn 2002, 225(4):384-391. tifique, le Ministère de l'Education, de la Recherche (French Xenopus Stock 20. Ewing RM, Ben Kahla A, Poirot O, Lopez F, Audic S, Claverie JM: Center) et de la Technologie and the University of Paris Sud. Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res 1999, 9(10):950-959. 21. Tata JR: Amphibian metamorphosis as a model for the devel- References opmental actions of thyroid hormone. Mol Cell Endocrinol 2006, 1. Amaya E, Offield MF, Grainger RM: Frog genetics: Xenopus trop- 246(1-2):10-20. icalis jumps into the future. Trends Genet 1998, 14(7):253-255. 22. Romualdi C, Bortoluzzi S, Danieli GA: Detecting differentially 2. Bisbee CA, Baker MA, Wilson AC, Haji-Azimi I, Fischberg M: Albu- expressed genes in multiple tag sampling experiments: com- min phylogeny for clawed frogs (Xenopus). Science 1977, parative evaluation of statistical tests. Hum Mol Genet 2001, 195(4280):785-787. 10(19):2133-2141. 3. Evans BJ, Kelley DB, Tinsley RC, Melnick DJ, Cannatella DC: A mito- 23. Gibson BW, Poulter L, Williams DH, Maggio JE: Novel peptide chondrial DNA phylogeny of African clawed frogs: phyloge- fragments originating from PGLa and the caerulein and ography and implications for polyploid evolution. Mol xenopsin precursors from Xenopus laevis. J Biol Chem 1986, Phylogenet Evol 2004, 33(1):197-213. 261(12):5341-5349. 4. Evans BJ, Kelley DB, Melnick DJ, Cannatella DC: Evolution of RAG- 24. Izutsu Y, Tochinai S, Maeno M, Iwabuchi K, Onoe K: Larval antigen 1 in polyploid clawed frogs. Mol Biol Evol 2005, 22(5):1193-1207. molecules recognized by adult immune cells of inbred Xeno- 5. Khokha MK, Chung C, Bustamante EL, Gaw LW, Trott KA, J. Y, Lim pus laevis: partial characterization and implication in meta- N, Lin JC, Taverner N, Amaya E, Papalopulu N, Smith JC, Zorn AM, morphosis. Dev Growth Differ 2002, 44(6):477-488. Harland RM, Grammer TC: Techniques and probes for the study 25. Seki T, Kikuyama S, Yanaihara N: Development of Xenopus laevis of Xenopus tropicalis development. Dev Dyn 2002, skin glands producing 5-hydroxytryptamine and caerulein. 225:499-510. Cell Tissue Res 1989, 258(3):483-489. 6. Richardson P, Chapman J: The Xenopus tropicalis genome 26. Xtscope [http://indigene.ibaic.u-psud.fr/EST] project. Current Genomics 2003, 4:645-652. 27. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engstrom PG, 7. Graf JD, Kobel HR: Genetics of Xenopus laevis. Methods Cell Biol Lenhard B, Aturaliya RN, Batalov S, Beisel KW, Bult CJ, Fletcher CF, 1991, 36:19-34. Forrest AR, Furuno M, Hill D, Itoh M, Kanamori-Katayama M, 8. Morin RD, Chang E, Petrescu A, Liao N, Griffith M, Chow W, Kirk- Katayama S, Katoh M, Kawashima T, Quackenbush J, Ravasi T, Ring patrick R, Butterfield YS, Young AC, Stott J, Barber S, Babakaiff R, BZ, Shibata K, Sugiura K, Takenaka Y, Teasdale RD, Wells CA, Zhu Dickson MC, Matsuo C, Wong D, Yang GS, Smailus DE, Wetherby Y, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y: Transcript KD, Kwong PN, Grimwood J, Brinkley CP 3rd, Brown-John M, Red- annotation in FANTOM3: mouse gene catalog based on dix-Dugue ND, Mayo M, Schmutz J, Beland J, Park M, Gibson S, Olson physical cDNAs. PLoS Genet 2006, 2(4):e62. T, Bouffard GG, Tsai M, Featherstone R, Chand S, Siddiqui AS, Jang 28. Pollet N, Muncke N, Verbeek B, Li Y, Fenger U, Delius H, Niehrs C: W, Lee E, Klein SL, Blakesley RW, Zeeberg BR, Narasimhan S, Wein- An atlas of differential gene expression during early Xenopus stein JN, Pennacchio CP, Myers RM, Green ED, Wagner L, Gerhard embryogenesis. Mech Dev 2005, 122(3):365-439. DS, Marra MA, Jones SJ, Holt RA: Sequencing and analysis of 29. Pollet N, Schmidt HA, Gawantka V, Vingron M, Niehrs C: Axeldb: a 10,967 full-length cDNA clones from Xenopus laevis and Xenopus laevis database focusing on gene expression. Nucleic Xenopus tropicalis reveals post-tetraploidization transcrip- Acids Res 2000, 28(1):139-140. tome remodeling. Genome Res 2006, 16(6):796-803. 30. Rungger D: Xenopus helveticus, an endangered species? Int J 9. Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee Dev Biol 2002, 46(1):49-63. NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al.: Initial 31. Bulle F, Chiannilkulchai N, Pawlak A, Weissenbach J, Gyapay G, Guel- assessment of human gene diversity and expression patterns laen G: Identification and chromosomal localization of human based upon 83 million nucleotides of cDNA sequence. Nature genes containing CAG/CTG repeats expressed in testis and 1995, 377(6547 Suppl):3-174. brain. Genome Res 1997, 7(7):705-715. 10. Houlgatte R, Mariage-Samson R, Duprat S, Tessier A, Bentolila S, 32. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis Lamy B, Auffray C: The Genexpress Index: a resource for gene and display of genome-wide expression patterns. Proc Natl discovery and the genic map of the human genome. Genome Acad Sci U S A 1998, 95(25):14863-14868. Res 1995, 5(3):272-304. 33. JGI X. tropicalis v4.1 home [http://genome.jgi-psf.org/Xentr4/ 11. Marra M, Hillier L, Kucaba T, Allen M, Barstead R, Beck C, Blistain A, Xentr4.download.html] Bonaldo M, Bowers Y, Bowles L, Cardenas M, Chamberlain A, Chap- 34. SNP/INDEL Discovery Pipeline based on CAP3 assembly pell J, Clifton S, Favello A, Geisel S, Gibbons M, Harvey N, Hill F, Jack- [http://cgpdb.ucdavis.edu/SNP_Discovery/] son Y, Kohn S, Lennon G, Mardis E, Martin J, Mila L, McCann R, Morales R, Pape D, Person B, Prange C, Ritter E, Soares M, Schurk R,
Page 13 of 13 (page number not for citation purposes) 2.2 XTScope: Xenopus tropicalis EST, a web resource for the nervous system
In recent years Xenopus tropicalis has emerged as a good candidate for vertebrate comparative genomics, due to a simpler diploid genome and a shorter generation time compared to Xenopus laevis, the classic model for developmental biology. To explore amphibian genome characteristics, the genome of X. tropicalis is currently being sequenced and more than 1 million ESTs coming from several tissues and developmental stages are available at NCBI. However, up until now, the nervous system has not been completely covered by these EST libraries.
The previous section presents the EST project carried out to study the nervous system of X. tropicalis from a biological point of view. In developping this, the bioinformatic tasks are very important to derive useful biological information from the original raw data (chromatograms). As described previously, ESTs were assembled to create a gene index, then the assembled cDNA sequences were annotated to identify the corresponding genes. In addition, this information needs to be available for the scientific community. Therefore, the collected information has been stored in a public database accessible through a web application: XTScope. XTScope is a web resource that provides information on expressed sequence data from the diverse tissues studied by this project. This section describes the database content and data production, i.e. the information stored in XTScope and how this information was obtained from the ESTs. It follows with the implementation and architecture of XTScope. Finally the web interface is described to show the utility of this tool. XTScope is freely accessible at http://indigene.ibaic.u-psud.fr/EST.
Database content and data production
XTScope contains the information of: (i) 11,738 tentative consensus (TC) sequences assembled from 48,785 ESTs; (ii) clusters (scaffolds) of TCs based on clone links; (iii) alignments against the genome sequence; (iv) the functional annotations based on BLAST similarity searches; (v) predictions for the open reading frame; (vi) protein domain predictions and the Gene Ontology (GO) annotations.
This information was entirely derived from the EST raw data that were collected from 2 libraries covering two developmental stages: embryogenesis and metamorphosis. The first library was derived from dissected retinas and heads of young tadpoles, while the second library was made from central nervous systems (brains and spinal cords) of metamorphosing tadpoles.
Several bioinformatic steps are needed to obtain useful information from ESTs (see chapter 1). The process can be divided in three steps: (1) EST preprocessing to extract the high quality regions from raw data and remove the nucleotides that are not related to the mRNA sequence of interest, such as vector, mitochondrial and polyA sequences, (2) EST assembly to reconstruct the original transcript sequences, a step which produces tentative consensus (TC) sequences, (3) and the TC annotation to identify the corresponding genes and to extract the maximum information.
36 I implemented a specific workflow in order to extract the EST sequence from chromatograms and to subsequently reconstruct the corresponding mRNA sequences, including the sequence annotation step (see Figure 1).
EST preprocessing
The sequencing step generated 57,874 reads (chromatograms) from the two libraries. In order to determine the sequence from chromatograms, Phred (Ewing et al., 1998a; Ewing et al., 1998b) was used for base-calling (see chapter 1). The “called” sequences are subject to errors and they contain both, high and bad quality regions. Since Phred only removes bad quality regions from the ends and bad regions may still remain in the middle of sequences, LUCY (Chou and Holmes, 2001) was used to identify the largest high quality regions. In addition, LUCY was used to identify and remove the vector sequence. PolyA tails were removed by a custom perl script, and sequences less than 100 bp were discarded, as well as those identified to be derived from ribosomal and mitochondrial RNAs. This step resulted in 48,785 high- quality sequences that were used in the assembly step.
EST assembly
ESTs represent a tiny part of a given transcript, therefore in an assembly step an attempt is made to reconstruct the original transcript sequence (see chapter 1). High-quality EST sequences from both libraries were assembled together to build tentative consensus sequences using the PHRAP assembler (http://www.phrap.org). This step has generated 8,756 contigs and 2,982 singletons. However, in theory, a contig should represent an individual gene, but reads from the same clone may fall into different TCs during the assembly step. In order to correct for this problem and to group together the sequences coming from the same gene, the TCs were further grouped by virtue of clone links into 6,547 unique groups (scaffolds). These clusters were obtained by scaffolding using mate-pairs informations. We retained scaffolds only if two clone links were available (excluding singletons) and if the orientation of the reads was consistent. Inconsistent links are also kept in the database, and they are displayed in the web site as “gap problem”.
TC annotation
Functional annotations were performed on both contigs and singletons to identify the corresponding genes and determine their putative function. The functional annotation is based on similarities (E-value ≤ 0.001) with known proteins using BLAST (Altschul et al., 1990; Altschul et al., 1997) searches versus the Swiss-Prot, Trembl (Boeckmann et al., 2003), Uniref90, Uniref100 (Wu et al., 2006) databases respectively. Previous to any search, contigs and singletons were masked for repetitive sequences using CENSOR (Kohany et al., 2006).
Similarity with known proteins also indicates the putative coding regions (ORF) of TCs, but the exact region cannot be determined by this method. We used Framefinder (Slater, 2000) for this task, a tool based on hexamer frequency statistics. Predicted coding regions were further annotated to identify protein domains, which is useful for contigs having low or no similarities with known proteins. We used InterPro (Mulder et al., 2005) with the tool InterProScan (Quevillon et al., 2005), since it incorporates the major protein signature databases into a single resource, such as PROSITE (Hulo et al., 2006), which uses regular expressions and profiles, ProDom (Bru et al., 2005), which uses automatic sequence clustering, and Pfam (Finn et al., 2006), which use hidden Markov models (HMMs). InterPro
37 also associates protein domains with Gene Ontology (GO) categories. Programs like Framefinder are useful to predict coding regions when no known homologues are available, but they need a training set, which implies a limitation for organism lacking annotated sequence data. More recently we used TargetIdentifier (Min et al., 2005) to predict coding regions. This algorithm is based on BLASTX alignments with TrEMBL.
Masked TCs (contigs and singletons) were aligned against the genomic sequences of Xenopus tropicalis (JGI assembly v4.1 database) to reconstruct the respective gene model. We used an adapted tool for this purpose, EXONERATE (Slater and Birney, 2005), the program used by Ensembl. The BLASTN alignment against the genomic sequence is also available in XTScope for an informative purpose.
Figure 1: EST preprocessing and annotation workflow.
Putative cases of alternative splicing, obtained by two different methodologies, are stored in XTScope. First, to detect alternative forms of the same gene expressed in the nervous system, we used the alignments between TCs and genomic sequences (masked). Evidence for alternative splicing was found when two TCs were aligned to the same genomic region with at least 95 % mean overlap but with a different number of exons or splicing sites. Secondly, we used an additional approach to detect alternative forms of the same gene, that may be expressed in other tissues. Since no large set of protein sequences is available for X. tropicalis, we used the Uniref database. We computed alignments between cDNA and Uniref protein sequences using BLASTX. Alignments characterized by a gap (at least 10 aa) introduced in the contig sequence were retrieved. Alternative splicing was considered present if the contig sequence including the gap could be aligned to the genomic sequence.
ESTs derived predominantly from our cDNA libraries are available in XTScope (for more details see section 2.1). These clones are likely to represent genes differentially expressed in the retina or the central or peripheral nervous system during metamorphosis. These computations were performed by biologists and added to the XTScope database.
38 Implementation and Architecture
The XTScope architecture that I have developed consists of a relational database and a web interface to be query by external users. The database was implemented in MySQL and it contains the information generated by the workflow previously described. The web interface was created using HTML and PHP scripts which dynamically execute MySQL queries. It operates under an Apache web server on a Linux system. The web interface also provides links to external databases used to perform the TC annotation (Figure 2).
Figure 2: XTScope architecture. The XTScope database contains the information about high quality ESTs, the tentative consensus sequences and their annotation. This information is accessible through the XTScope website, which also provides links to external databases.
An entity relationship data model was designed to store the information generated by the three steps of the EST preprocessing and annotation workflow (Figure 3). In-house scripts were used to parse and load the information in the database. Adapted Perl scripts were developped for the output format of the PHRAP assembler, InterproScan, framefinder, Exonerate (genomic alignment) and BLAST reports. BioPerl was mainly used to parse BLAST output files. The current implementation allows to easily include new BLAST results. The alternative splicing cases were detected using either the data contained in XTScope database and BLAST results.
The TC annotation step generates a large amount of data that needs to be quickly accessed from PHP scripts. To gain performance in MySQL queries, I used a unique ID generating mechanism to identify elements in tables and establish the relationship between tables. Moreover, additional indices were added to tables, in particular for those related to BLAST results.
The web-application is organized in several modules as described in the next section. Each module contains HTML forms to query the database and/or predefined HTML queries to retrieve the information. The website navigation was implemented in a separated PHP library. This way, the implementation of new modules to XTScope is facilitated. The main HTML report contains the detailed information about TCs. This report can be accessed through the search module using an HTML form or by the URL indicating the TC id, allowing the establishment of direct links from external websites to the XTScope contig report.
39
Protein domain identification
Scaffold
BLAST results Genomic alignment
Figure 3: The entity-relationship data model for the database of XTScope. Only the attributes used as identifiers are displayed (primary and foreign keys). Tables are grouped by annotation procedure. The blue box contains the tables with the protein domain identification and GO terms. The black box contains tables related to scaffold construction. The red box groups the tables with BLAST results, and the green box groups the tables with the genomic alignment results.
We stored locally only the external information (protein identifiers, protein domain descriptions, etc) that was used for the sequence annotation. The accession numbers of best similarity hits with UniRef, Swiss-Prot and TrEMBL databases are stored in the XTScope database and they are used to build cross-references to these external databases. The Ensembl website is linked through the genomic sequences and TCs alignments.
Web interface
The XTScope web application supports data retrieval through a search engine and provide useful information about the tentative consensus sequences. To design the web interface, we have identified several ways to query the database that can be interesting for users. First, a complete report for each TC, containing detailed information about the assembly and the annotation is necessary. Users must be able to find TCs searching the annotations using different criterias. Statistics about the assembly, transcripts specific to our libraries and cases of alternative splicing must be easy to retrieve. Following these criterias, we divided the application in four modules: (i) a ‘Search’ engine and the informative modules (ii) ‘Statistic’, (iii) ‘Expression’ and (iv) ‘Splicing’. Subsequently, as an addition to the implementation that I developed, a new module was incorporated in the web site. This will be described in the ‘Extension’ section.
Several options are provided by the ‘Search’ module. A basic search option uses the TC identifier, a number that ranges from 1 to the number of contigs and singletons. For biologists working with physical clones, the clone identifier and plate number are criteria also available at the website. For general users, gene annotations may be more interesting: searching for TCs
40 corresponding to a given protein, TCs containing a specific protein domain or TCs associated to a given GO functional category are possible.
Searching by the TC identifier generates a ‘Contig report page’ that provides detailed information about the assembly and annotations related to a TC sequence. This report displays the nucleotide sequence, the ORF predictions and the scaffold group based on clone link information. The ESTs forming the consensus sequence and the gene model based in genomic alignments are displayed in a descriptive and graphic form (see Figure 4). The annotations based on sequence similarities are presented as summary tables, and each protein identifier has a link to the corresponding external record (Swiss-Prot, Uniprot or TRembl).
Figure 4: Part of the Contig Report page for a given contig. The contig sequences are shown on top. The Open Reading Frame predictions and protein domain predictions are displayed as lines with a size proportional to the contig length, under the relative position within the contig sequence. The position of the assembled ESTs are shown as grey or green lines, and more information is contained as a table. The ESTs are displayed as green lines when the 5’ and the 3’ EST from the same clone fall in the same contig.
41 The clone name or the plate name allows to retrieve the TCs where the corresponding ESTs were assembled. The GO classification for molecular functions can be browsed to retrieve the TCs associated to GO codes. Finally, XTScope allows more complex queries to retrieve the TCs containing a specific annotation. Since protein descriptions are non formatted text, this web application allows full-text searches. In this way one or more words, exact phrases, etc can be used to search the protein description. Full-text searches can be performed on the protein description of the BLAST results, or the domain description of the predicted Interpro domains (Figure 5). All the search functions and informative pages of XTScope display the TC identifier with a link to the corresponding ‘Contig report page’.
The ‘Statistic’ module of XTScope provides basic information about the assembly. The list of the longest contig sequences, as well as the list of the largest contigs is displayed. The ‘Expression’ module provides information relative to differential expression in the nervous system. This module contains the list of ESTs predominantly found in our cDNA libraries. The complete list shows the ESTs and their associated contigs, while the summary version displays only the contigs affected. The ‘Splicing’ module allows to retrieve the predicted cases of alternative gene forms. Three lists are available: predictions based on different number of exons, different splice sites or evidence of gaps in the alignment against known proteins.
Figure 5: Full-text searches in XTScope. The result page of full-text searches generates a report with the TC/Contig identifier, the protein description containing the query words and the ‘Hit rank’ , i.e. if the protein description matching the query words correspond to the best similarity hit, the second or another rank position. the protein description. The TC or Contig identifier provides a link to the respective ‘Contig report page’.
42 Extensions
The XTScope database has already been extended to include more annotations for the tentative consensus TCs. Contigs and singletons were compared to full-length clone sequences from Xenopus tropicalis (Gilchrist et al., 2004), which are named ‘Gurdon’s contigs’. These full-length sequences were aligned to the genome sequence, and they can also be retrieved using the module “Gurdon contigs on Genome”. Moreover, new annotations are available based on similarities versus Unigene sequences for Xenopus laevis and Xenopus tropicalis (Unigene at the NCBI).
Conclusion
The XTScope database has been developed to provide information about the transcriptome of the nervous system of Xenopus tropicalis. The collection described here contains 11,738 tentative consensus sequences assembled from 48,785 high-quality EST sequences. TCs with their respective annotation are available through our web-application. The database system provides a tool which will enable the Xenopus community to take advantage of the specific information collected from the nervous system during embryogenesis and metamorphosis. XTScope can easily incorporate new information related to Xenopus research due to the modular design of both the database and the web application. This flexibility is demonstrated by the extensions already incorporated in XTScope, as described in the previous section.
43 44
Chapter 3: Microarrays to study the transcriptome
Dans ce chapitre, des puces à ADN sont utilisées pour étudier le transcriptome de Xenopus tropicalis pendant la métamorphose. Plus précisement, nous nous intéressons aux gènes regulés pendant la métamorphose et comment les récepteurs des hormones thyroïdiennes agissent au niveau du contrôle de la transcription. Afin de comprendre mieux les programmes d’expression des gènes particuliers à certains tissus, six états de développement ont été utilisés pour produir des séries temporelles sur trois organes différents. Le chapitre décrit d’abord l’analyse bio-informatique effectuée sur les données de puces.
Une série d'étapes doivent s'exécuter avant de pouvoir analyser des données de puces. Les étapes effectuées dans cette thèse correspondent a : (1) la contribution au choix de la stratégie expérimentale, c'est-à-dire, quels échantillons sont hybridés dans la même lame (array), afin d’utiliser de manière efficace les ressources disponibles. (2) Une fois que les lames ont été hybridées et scannées, les données doivent être analysés afin d’évaluer leur qualité, et elles doivent être normalisées pour diminuer les biais existantes. (3) Finalement, les profils d’expression sont obtenus en utilisant les outils statistiques appropriés. Après ces étapes, les analyses menées par les biologistes on permis d’identifier 802 gènes différentiellement exprimés dans le système nerveux central, la queue et le foie. L’analyse biologique des profils d’expression est décrite pour chaque organe étudié. Ainsi, la structure et la dynamique des réseaux transcriptionnels de la métamorphose amphibie sont illustrées à une échelle génomique.
45 The power of microarrays to analyse thousands of genes in parallel increased the speed of experimental progress significantly. Microarrays are used in all fields of biology, for plants, animals and humans for a variety of biological questions. Expression profiles can be obtained from diverse developmental stages, under different environmental stress conditions, or in different disease states. The general goal of all these experiments is to find the function, the regulation of the genes and their interaction with other genes. Assessing the function of genes is mainly obtained by making the assumption that genes that share approximately the same expression patterns, are likely to have a similar biological function. Therefore, the classical output of microarray experiments consists of a number of clusters, showing genes with a similar behaviour under different conditions.
This study involves the monitoring of gene expression through several developmental stages during metamorphosis of Xenopus tropicalis. Three tissues were studied independently generating three time series that cover six developmental stage, one for each tissue. The first section of this chapter describes the bioinformatic aspects that I have developed for this thesis: contribute to the experimental design, the quality assesment of microarray data, the choice of the normalization method and the profile reconstruction. The second section presents the biological application (article in preparation).
3.1 Bioinformatic issues
Experimental design
The first challenge in this study was the experimental design for the hybridizations. The available resources were mainly constrained by the number of arrays to perform the experiments. Although replication is crucial for statistical analysis and reliable results, this implies to spend more resources for each experiment. A number of 3 replicates is usually used in microarrays, however, a ‘common reference’ design requires 18 arrays or slides for time- series experiments with 6 conditions, as in our case. Because the experiment consist of 3 time series, then 54 slides have to be used, in addition to the slides used for test and control.
Recent articles about experimental design point to alternative designs as more efficient in term of resources (number of slides and RNA samples) and they can answer the biological question with similar statistical performance (minimizing the observed variance) (Yang and Speed, 2002; Kerr and Churchill, 2001). One of the main critics to the reference design is that a half of the resources are used to measure a condition that, in most of the cases, has no biological interest. An interwoven design uses a half of resources to generate gene profiles, requering 9 slides instead of 18 slides to measure 6 time points using 3 technical replicates for each time point. For this reason, we have chosen an interwoven design for the three time series under study (Figure 1). Although the used design slighly differs from the optimal design described by Kerr and Churchill (2001), it performed well in our biological application.
46
T1
T6 T2
T5 T3
T4
Figure 1: Interwoven desing used in our study. Circles represent samples or time points, and arrows represent a direct hybridization between two samples. The arrows point from the time point labelled with Cy3 to the time point labelled with Cy5.
Analysis steps for the microarray experiment
The standard analysis of microarray experiment starts from the data, as they come out of the scanner. These ‘raw’ data are tab-delimited files, that contain the signal intensities and a number of other characteristics of the spots, as spots size, a quality flag for each spot, etc. Before this data can be actually analysed, some quality assessment and normalization steps are requiered (see chapter 1).
Quality assessment
The quality assessment can help to discover serious quality problems, or even mistakes that ocurred during the labeling and hybridization steps. If the quality assessment does not point any serious irregularities, then the analysis can continue and the data can be normalized. In this steps we evaluated the background signal, the intensities relative to the backgroud signal and the intensities itselfs to detect bias related to dye effects.
First, we controled the background signal distribution on the arrays to detect regions where spots may have unreliable high signals. Plots of the background distribution showed that for some arrays these regions are very small (Figure2 top) but for some other the noise is higher (Figure2 bottom).
Figure 2: Backgroud signal distribution for two arrays. Left images correspond to Cy5 channel (red) and right images correspond to Cy3 channel (green).
47 The assessment of the intensity relative to the background signal indicates how well the intensities differs from the backgound, and how much the array and the experimental procedure serve to measure gene expression. The mean and median of the intensities and background were computed for the Cy3 and Cy5 channels, as well as the histograms of the signal intensities with background substraction (Figure 3). The number of spots with intensity values above 2 times the standard deviation of the background indicates how many spots are giving a clear signal of expression (Table 1).
Figure 3: Histogram of background corrected intensities for Cy5 (red) and Cy3 (green) channel. The overall mean and median of corrected intensities is indicated.
Array Cy5 Red Cy3 Green 66CMEvs66CME 5629 5914 56CMEvs57CME 5285 4993 55CMEvs57CME 4926 4101 57CMEvs64CME 2939 5205 62CMEvs55CME 5034 5117 66CMEvs55CME 5717 6064 64CMEvs56CME 5356 5566 64CMEvs66CME 5205 5841 62CMEvs56CME 4663 5518 66CMEvs62CME 5154 5980
Table1: Number of spots with intensities above 2 X standard deviation of background. The values correspond to the brain (CME) time-series form a total of 6272 spots, including control spots.
To assess the precence of dye effects, we used the plot of the ratios versus intensities (MA- plot). In our dataset, bias due to the dye depend on the arrays. For most of them a normalization step is strongly recommended (see figure 4).
Figure 4: MA-plot for the array hybridizing stages 64 and 56. Control spots are labeled with colors.
48 Normalization step
There are many sources of systematic variation in microarray experiments which affect the measured gene expression levels. Normalization serves to remove bias in the data that is of a non-biological nature. Two-color cDNA microarray experiments are comparative in nature, therefore, commonly used normalization methods such as loess, focus on adjusting the value of log-intensity ratios between the red and the green channels.
Loess (Yang et al., 2002) is based on the hypothesis that most of the genes are not differentially expressed. In our experiments, 3000 genes of Xenopus tropicalis were selected to be printed on the slides (dedicated array). Since the selected genes are a small portion of the total number of genes of X. tropicalis and we do not know how many of these will be expressed in the tested conditions, we cannot assume that most of them will not be differencially expressed and therefore, we do not expect that the hypothesis for normalization apply in our data. For this reason, several normalization methods were tested to select the most apropriate for our experiment.
We tested three method provided by the Limma package (Smyth, 2004): loess for global normalization, print-tip loess that correct each print-tip group separately, and robust-spline which is compromise between print-tip and global loess normalization, with 5-parameter regression splines used in place of the loess curves.
First we assessed how well these methods corrected dye and spatial bias. MA-plot were used to check the dye bias corrections. Although they give different results, all corrected the dye bias (Figure 5). Evaluating the ratio distribution per print tip group showed the advantage of local (print-tip and robust splines) versus global normalization methods (loess), since the bias due to the print tip group was better treated (Figure 6).
Figure 5: Comparison of not-normalized, print-tip loess, robust spline and loess using the MA-plot.
Each time series include a self-self array, i.e. the same sample/condition is labelled with Cy3 and Cy5 dye and co-hybridized in the same array. For this data, there is no differential expression and consequently, all ratios are expected to be zero (all ratios must be around zero in the MA-plot). Again, local normalization methods performed better than global normalization (data not shown).
49
Figure 6: Boxplot of ratios per print tip group.
Finally, normalization methods were also compared based on duplicated spots. Since each oligo was spotted in duplicate in two separated grids, correlation between duplicated spots should be high if the normalization step corrected bias. The correlation and the standard deviation between duplicated spots confirm the previous results, that local normalization methods are best suited for this dataset.
In addition to the normalization methods implemented in the Limma package, other methods exists. A method specialy designed for dedicated arrays (Wang et al., 2005) was tested in a preliminary comparison, but this showed a similar result with loess since are both global normalization methods (data not shown).
Profile reconstruction
The interwoven design used for the presented study, requires complex analysis procedures to reconstruct the factor of interest (e.g. a gene profile across a time series) from the normalized data. Some packages are available in R which allows to estimate the time profiles and assess differentially expressed genes. Limma package implements a gene-specific linear model to estimate the time profiles and an empirical Bayes method to determine differential gene expression. On the other hand, Maanova implements a two-stage (fixed or mixed) model (http://www.jax.org/staff/churchill/labsite/software/anova) and uses the F tests to estimate differential expression. We have chose Maanova to carry out our data analysis, that also includes functionalities to perform clustering analysis (for a complete comparison see chapter 4).
Data analysis
The biological application, including the analysis, the interpretation of the clusters and the biological mechanism behind these expression profiles is presented in the following section.
50 3.2 Xenopus tropicalis metamorphosis transcriptomes analysis using microarrays
51
Xenopus tropicalis metamorphosis transcriptomes analysis using microarrays
Raphaël Thuret1,2*; Ana Carolina Fierro1,2* and Nicolas Pollet1,2#.
1 CNRS UMR 8080, F-91405 Orsay, France. 2 Univ Paris Sud, F-91405 Orsay, France * First co-authors # Corresponding author
Fax: +33 169154949 Phone: +33 169157273 e-mail: [email protected] Adress: Laboratoire Développement et Evolution, CNRS UMR 8080, Bat 445, Université Paris-Sud, 91405 Orsay cedex, France.
1 Abstract
Background: Amphibian metamorphosis is a developmental process that played a critical role during vertebrate evolution. Frog metamorphosis is triggered by thyroid hormones which act on the cells via their receptors that are ligand-binding transcription factors. Here we asked what genes are regulated during metamorphosis and how the thyroid hormone receptors mediates the specific modulation of transcription required to change cell fate during metamorphosis. We used Xenopus tropicalis as model because it benefits from a recent genome project and enables genetic manipulations.
Results: To gain insights about the tissue-specific gene expression programs, profiles of gene expression during metamorphosis were obtained in three different organs taken at six time points. A total of 802 genes differentially expressed in the central nervous system, the tail or the liver was identified. Analysis of the expression profiles provides evidences for up and down regulations that are either constant or transient. We show that the repertoire of differential expression of each organ is mainly non-overlapping with the other one and that this is due to the activity of a specific set of transcription regulators for each organ. We identify as well putative Thyroid hormone Responsive Elements by in silico promoter studies.
Conclusions: This report illustrates both the structure and the dynamics of amphibian metamorphosis transcriptional networks on a genomic scale. Moreover this study provides a foundation for functional genomics of amphibian metamorphosis using Xenopus tropicalis as a model to study thyroid hormone roles during vertebrate development.
2 Background
Amphibian metamorphosis is a unique and complex developmental process that brings about dramatic biochemical and morphological changes. Each tadpole organ is modified during this transition from a larval to an adult organism. For example, cells from the tail are fated to die by apoptosis, while those from the limb buds are proliferating and specific cell types of the nervous system, the intestine or the skin enter a differentiation process. Thyroid hormones (TH) are necessary and sufficient to trigger metamorphosis. This is why metamorphosis is a unique model to study the genomic effects of TH. These hormones act on the cells via their receptors, the Thyroid Hormone Receptors that are ligand-binding transcription factors of the nuclear receptors superfamily (THR) [1, 2]. THR act as heterodimers with Retinoic X Receptors (RXR) and bind their respective hormone responsive elements (Thyroid hormone Responsve Element for THR) present in the regulatory regions of transcription units [3]. Before metamorphosis, in the absence of hormone, THR act as transcriptional repressors on their target genes, recruiting co-repressor complexes. When TH is produced, it binds to THR and thus activates the transcription of target genes by replacing co-repressors by co-activators complexes. Therefore these receptors modify cell fates by a transcriptional reprogramming operating by chromatin modifications. The nature of this reprogramming depends on positional cues and cell type. Gene regulation by TH in the context of metamorphosis has been studied using differential display screens on cultured cells, tail, brain and intestine [4-8]. More recently, microarray studies have been reported [9-11]. Overall, two waves of transcriptional activation have been identified. The first wave corresponds to the regulation of direct TH-response genes. The second wave corresponds to the regulation of late TH-response genes that trigger most of the morphological changes occurring during metamorphosis. Direct TH-response genes encode proteases and proteins involved in various metabolic processes and transcription factors including TRβ itself, THbZIP, BTEB and Fra2 (for review see [12]). These direct TH-response factors would regulate the transcription of the late-response genes. Most studies on metamorphosis purpose the amphibian anuran model Xenopus laevis. However, X. laevis genome is pseudotetraploid and it is common to find one gene represented by two paralogous loci (“allogenes”) in X. laevis. Furthermore, nucleotide sequence similarity between “allogenes” is variable and there is no genomic sequence to help decipher a global view. This feature of X. laevis genome implies that when highly similar “allogene” expression differs in time or space, nucleic acid hybridization experiments results are impaired. On a genomic scale, such a technical limitation has important consequences. X. tropicalis is a better model for such transcriptomic studies [13, 14]. With more than a million of EST (Expressed Sequence Tag) and the genome sequence availability, large- scale surveys of gene expression are easier to tackle and interpret in this diploid anuran. We took advantage of the X. tropicalis model to answer the following questions: what genes are regulated during metamorphosis and how the TH/THR mediates the specific modulation of transcription required to accomplish the various cell fate changes during metamorphosis. To gain insights about the tissue-specific gene expression programs, profiles of gene expression during metamorphosis were studied in different tissues with microarrays. The transcription factors expressed in different organs were searched to identify direct TH response genes by in silico promoter studies.
3 Repositories of X. tropicalis nucleotide sequences were mined to build a long oligonucleotides microarray representing 2902 different genes. Then this microarray was used to study physiological metamorphosis in tail, central nervous system and liver of X. tropicalis. After analysis, sets of differentially expressed genes were compared to identify similarities and differences in transcriptional programs of the three organs. Availability of genomic sequences allowed us to search for TRE in promoters of regulated genes and to potentially identify TH regulated genes.
Results Experimental design for a time-course study on gene expression during metamorphosis Gene expression modifications were monitored in three dissected organs during metamorphosis to study common and specific transcriptional regulations. Our study is focused on the repertoire of transcription factors that can be regulated in each organ. Time-course experiments were performed on organs accomplishing different metamorphic fates: central nervous system (brain and spinal chord, CNS), tail and liver. These organs were selected because they are respectively partially remodeled, totally resorbed or enable a metabolic shift (from ammonotelism to ureotelism) during metamorphosis. The choice of metamorphic stages to sample was driven by the results gathered in the litterature. We selected the metamorphic period where gene expression changes were the most important for each organ. In the tail, major gene expression changes are observed between NF stages 59 and 64 (for Nieuwkoop and Faber Stage) [5]. Accordingly, we chose NF stage 58, 61, 62, 63, 64 and 65 to study gene expression in the tail during metamorphosis. In the central nervous system, innervations changes tend to occur all over metamorphosis stages. We chose to cover the period with NF stages 55, 56, 57, 62, 64 and 66 [7]. In the liver, samples covering all the metamorphosis process (NF55, NF58, NF60, NF63, NF 65 and NF 66) were selected since few data on gene expression are available [15]. Comparison of the 6 different samples was made using an interwoven loop design [16, 17]. This experimental design offers a good compromise between the number of arrays to use (6 different samples are compared in 9 hybridization experiments) and the statistical representation of each signal (variance minimization). Each sample is subjected to three direct and two indirect comparisons. These indirect comparisons are made using two independent paths (see fig 1.B design for example: NF 55 and NF 58 are not directly compared but we can estimate their expression level differences by using independently NF 60 and NF 63 comparisons).
Analysis of gene expression profiles in the tail Previous studies of tail metamorphosis [5, 6, 11, 18] showed that different biological processes (such as matrix-metalloproteinase expression, maintenance of tissue contraction, apoptosis regulation via mitochondrial pathway) are engaged to resorb tail tissues in response to thyroid hormone. Our microarray study of tail resorption during metamorphosis identified a set of 381 genes differentially expressed between NF stages 58 and 65. We performed an analysis of the corresponding gene expression profiles using K-means clustering in ten different groups. The results are presented on figure 1. Four broad different expression patterns can be observed. Transient up-regulation is characteristic of Clusters 1 and 2. Constantly up-regulated genes between NF stage 58
4 and 65 are found in clusters 3,4 and 5. Cluster 6 is the only group of transiently down- regulated genes. Finally, genes constantly repressed starting from NF stage 58 are gathered in clusters 7, 8, 9 and 10. We analyzed the molecular functions and biological processes associated with each cluster using Gene Ontology terms enrichment (Table 1). Transitory up-regulated profiles: Transitory up-regulation is associated with the Wnt signaling pathway, regulation of transcription, cellular metabolism and cell cycle (figure 1, clusters 1 and 2). Metabolism is more associated with cluster 2 and transcriptional regulation with cluster 2. A complex modulation of the Wnt pathway is observed; both positive (axn, porca, wnt10, frz1 and tcf1c) and negative (wif1) regulators are found in clusters 1 and 2. Coactivators such as hmgn3, hacs and psmc5 account for a global stimulation of transcription. Homeobox genes (tlx1, hoxb7, lhx1, six2) and zinc finger proteins (zn184, zg7, zo71) are recruited for gene-specific transcriptional regulation. Cell cycle checkpoint proteins mcm6, rfa1 and bub3 are found in cluster 2. Modifications of oxidative enzymes of the mitochondria (ldha, succa, gapdh, isocitdhmp) are most likely associated with the activation of apoptosis. Constantly up-regulated profiles: Different genes previously reported as being transcriptionally modulated during metamorphosis showed up as constantly up- regulated (Fig.1, clusters 3, 4 and 5): deiod2 (deiodinase 2, an enzyme catalyzing the conversion of T4 to T3; cluster 5), gata1a (a transcription factor regulating the switch from the larval to the adult hemoglobin; cluster 5), intb1 (integrin β1, a receptor of extra-cellular matrix proteins; cluster 4), nr2b2 (RXRβ an obligatory partner of thyroid hormone receptors; cluster 5) and prlr (Prolactine receptor, playing a role in antagonizing TH actions; cluster 4). These results indicate the efficiency of our experiments in capturing physiological transcriptional regulations. The biological processes found most significantly enriched among these constantly up- regulated genes are response to stress, regulation of transcription, and protein catabolism (table 1). The extra-cellular matrix is remodeled in the tail during metamorphosis [5, 6]. Indeed, we observed up-regulated genes playing a role in proteolysis: proteases mostly found in cluster 1(mmp2, ctsb, cstd, ctsl, elas3bp) and ubiquitin conjugation (polyubiquitin). The same genes and others (nfat, lhx2 or bmp2r) are involved in the response to stress. This response can be related to the known links between thyroid and stress axes [19] and by the roles played by the immune system during tail resorption [20, 21]. Among the transcriptional regulators, we found both global and specific modifiers. In the first category, co-activators or chromatin modifiers were found such as tcp4, hmgb1, pc2, p66, runx2, and notably nr2b2. In the second category, we found several known or orphan zinc finger proteins (zo72, xfin, zo6, zn343, zg5, zn207) as well as transcription factors whose role during tail resorption is unexpected (tbx6, anf1, fd4pr and twn). Transitory down-regulated profiles: Regulation of transcription and macromolecule biosynthesis are associated with transitory down-regulation (figure 1, cluster 6). Both global (anm1, hdac1, cbx6, chd1l, hmg14, hmgn1, z297b) and specific transcriptional regulators (foxd3, vax1, tbx6l, znf84) share this expression profile. We can observe gene products playing a role in distinct growth factors signalling pathways such as frz8 (frizzled 8, receptor of WNT) and fgfr2 (an FGF receptor), as well as hdgf (Hepatoma- Derived Growth Factor). Constantly down-regulated profiles: Among the constantly down-regulated clusters (fig 1, clusters 7, 8 9 and 10), regulation of both transcription and progression through cell cycle as well as transport are the prominent processes found regulated (table 1).
5 Cluster 8 deals mainly with transcriptional regulation. Some known or probable co- activators as ada3, myst1, hmgb1rs15, usf1 smarcc1, edf1 or pc1 are shut down and therefore alter transcription on a global level. Specific transcriptions factors are also found down regulated such as homeobox genes (optx2, msx2, gsc, mxr, shox2, dlx1), known or orphan zinc protein finger (zbt17, zbt34, zg8) or others (junb, esr6e; olig2, olig3). In cluster 7, the regulation deals mainly with actin cytoskeleton function especially muscular actin network (act3, mle1, mlrs, tnnc2, tnnt3, tpm1). This is concordant with the disappearance of tail muscle occurring essentially between NF stage 57 and 63 [22]. The cell cycle is regulated by turning down genes involved in the initiation of mitosis (dpk, gppsup2 and cdcp6) or in inhibition of cell cycle arrest (cmyc2, ppase2cd). Other genes modulating cell proliferation are also down regulated (bhh or 143ga). Cluster 9 represents a global down-regulation of cellular physiological processes (macromolecule biosynthesis and metabolism). The observation of housekeeping genes being turned down could be interpreted either as a direct repression or as a consequence of tissue resorption. Cluster 10 is associated with transport and regulation of cellular metabolism.
Analysis of gene expression profiles in the central nervous system Morphological changes in the nervous system were previously described during metamorphosis [23-27]. However, only few studies on the gene expression program of the CNS have been done [7, 28]. One of the major difficulties is the identification of specific transcriptional program of each cell type composing the central nervous system. A set of 502 genes was identified as being differentially expressed in the CNS between NF stage 55 and 66. The gene expression profiles were clustered in 8 groups by K- means (fig 2). Three classes of genes profiles can be seen in the clustering results. Transient up-regulation is observed for genes in clusters 1 and 2. Constantly up- regulated genes are in clusters 3 and 4, while constantly down-regulated genes belong to clusters 5, 6 and 7. The remaining cluster (Cluster H) is composed of differentially expressed genes exhibiting a divergent profile. Results of GO (Gene Ontology) term enrichment analysis for the three global expression patterns observed are presented on Table 2. Transitory up-regulated profiles: Transiently up-regulated genes (fig2, cluster 1 and 2) are implicated in the regulation of cell cycle, noticeably in cell cycle arrest (skb1, tsc2) or proliferation inhibition (eststferf). These processes are in agreement with precursor cells engaging in differentiation. Cluster 1 contains a particular enrichment for transcription factors, in particular thbzip, a transcription factor regulated by TH during metamorphosis. nkx2.1, usf1 and sox1 were found similarly expressed together with unknown zinc finger proteins (zn561, zn300). Constantly up-regulated profiles: Analysis of the up-regulated clusters 3 and 4 revealed the expression of genes already identified as direct targets of thyroid hormone: nr1a2b1 (TRβ, thyroid hormone receptor up-regulated during metamorphosis, cluster 3) and bzip (fra2, direct TH-response transcription factor, cluster 4). The glucocorticoid receptor (gr, cluster 4) is another gene up-regulated during metamorphosis in the brain and identified in this category [19]. Transcription factor activity, Wnt signaling pathway and glucose metabolism are the most significant GO terms enriched in this group. Wnt signaling is essentially found in cluster 4 together with transcription factors. Cation transporter activity together with neurophysiological processes represent significantly enriched terms in clusters 3 and 4. Genes encoding neurotransmitters or
6 neurotransmitter receptors are found up-regulated: oxt, chrnb2, chrnb4 in cluster 3 or chrm2, act1, adra2, gabt4, gabarapl1, grm7, grin1 in cluster 4. This is concordant with a new wave of neurogenesis identified by the expression of different transcription factors involved in this process (xotx2, zic1, ngn1 and co-regulators pcafa and pcafb) as well as molecules implicated in axon growth or guidance (epha2, elfa1, ephrinb3, ntrn1). Global regulators of transcription (co-factors nspc1, bcap37, smarcb1), homeobox proteins (lbhx1, cad2, mxr, hoxb7, gsc) or uncharacterized zinc finger proteins (zn343, zf260) were found up-regulated. All these transcriptional regulators were not implicated in metamorphosis so far. Constantly down-regulated profiles: Chromatin silencing, regulation of translation, transcription factor activity and regulation of cell cycle are GO terms found significantly overrepresented in constantly down regulated clusters 5, 6 and 7 (Fig2, table 2). While we found structural components of ribosome as being represented mostly in clusters 5 and 6, chromatin silencing is more associated with cluster 6. Enhancement of gene expression in the CNS by inactivation of co-repressor activity is evidenced by the down-regulation of several chromatin modifiers such as the CBX genes (cbx1, cbx3 and cbx5), hdac1 or ctcf. We observed the inhibition of HLH binding proteins id3 and id4. Moreover, many transcription factors are also down regulated such as homeobox transcription factors of the HoxC family (hoxc4, hoxc6 and hoxc9), zinc finger proteins (zf161, zn161, zo6 and zo71) or other factors (sox2, zic3). Causes of the global down-regulation of 11 ribosomal proteins are poorly understood. This is correlated with the down-regulation of several translational factors such as if5, if36 if2b or sui1. However, this could be interpreted as the consequence of specific cell types losses. Cluster H: Among the differentially expressed genes exhibiting divergent profiles, we observed key regulators such as jagged1 (jdd11, a NOTCH ligand), smoothened (smo, a hedgehog receptor), igfr1 (insulin-like growth factor 1 receptor), derriere (a TGFß family member) and frz2 (secreted frizzled 2, a secreted WNT inhibitor). Transcriptional factors such as p3f2b (pou3), cdx1, mad4 were found similarly expressed but were not reported as playing a role during metamorphosis.
Analysis of gene expression profiles in the liver The gene expression cascade occurring in the liver during amphibian metamorphosis is poorly described. The morphological change of liver cells has been widely documented in the past [23, 25, 29] and modifications of metabolism, especially the transition from ammono- to ureotelism has been reported in amphibians [29-32]. One study reported the identification of 20 genes as being differentially expressed in the liver [15]. Indeed, few genes have been implicated in the transcriptional regulation of the metabolic shift occurring in liver during metamorphosis. We report a set of 133 genes identified as being differentially expressed in the liver between NF stages 55 and 66. The corresponding expression profiles were clustered by K-means in 6 different clusters (figure 3) that can be further grouped in three categories. Four of these clusters are composed of genes showing a transient (cluster 1and cluster 2) or constant (cluster 3) up-regulation during the period studied. Cluster 4 gathers genes with a transitory down-regulated profile. Cluster 5 and 6 are made of genes showing a constant down regulation from NF stage 63 to 66 and NF stage 58 to 66 respectively. Results of GO term enrichment analysis for the three global expression patterns observed are presented on Table 3.
7 Transitory up-regulated profiles: GO terms significantly enriched in these clusters deals essentially with transcription regulation (table 3, figure 3 clusters 1 and 2). The global co-regulators smarcc1, nap1 and piaspg are up-regulated. smarcc1 may act either as a co-activator or a co-repressor. nap1 regulate cell proliferation and numerous signaling pathways recruit piaspg for transcriptional read-out. We also identify smad10, a regulator of the TGFβ signaling, and junb a transcriptional regulator expressed in response to growth factors as well as homeobox proteins (hoxc8, pprx2 and six3). Two zinc finger proteins, potentially regulating transcription are also found in these clusters (zn206 and zo8). Response to stress is also identified as a significant process in transitory up-regulated clusters (table 3). The genes involved work at the level of protein degradation, cytoskeleton stability, protein metabolism and transcription. The matrix metalloprotease mmp21 and prspsn are proteins involved in cellular catabolism. Eplin and stmn2 are involved in the regulation of cytoskeleton stability grp58, ec5218 act on protein metabolism and junb, stat3 and six3 are transcriptional regulators associated with the stress response. Constantly up-regulated profiles: Several genes encoding regulatory functions showed up in cluster 2(figure 3). frz1 is a receptor with Wnt ligand activity. xr11 is an anti- apoptotic molecule of the Bcl superfamily. ngn1 is a transcriptional regulator of the bHLH superfamilly playing a role in the specification of sensory neurons. Its function in the liver has never been reported. Transitory down-regulated cluster: Transcription factor activity is the prominent term found enriched in this cluster (figure 3, cluster 4; table 3). The gene nfib1 is found in this cluster. It encodes a transcriptional activator involved in proliferation and identified as up-regulated by TH in the intestine during metamorphosis. Nuclear receptors (nr2f2 and nr6a1), homeobox protein hoxb7 and forkhead box protein (foxf1a, fd4) are similarly expressed in this cluster together with tcf1c (transcriptional modulator of the Wnt signaling pathway), pparb (responsible for βoxydation of fatty acids), tcea1 and tcea2 (transcription elongation factors) and zinc finger proteins (znf85, zo6, zn11b). Constantly down-regulated profiles: GO analysis of constantly down-regulated clusters underscore components of ribosome, chromatin modification and nucleic acid binding as significantly enriched GO terms (figure 3, clusters 5 and 6). Numerous histones are identified as down-regulated (h1h1t, h2h4, h1h2ac, h1h3c, h4h4) suggesting a modification of the histone isoforms. As we already observed in the CNS, several ribosomal proteins are down-regulated in the liver during metamorphosis (rpl8, l5a, rl15, rs23, rs11, rpl21, sui1).
Quantitative PCR validation QRT-PCR experiments were performed to independently assess the results of our microarray hybridizations. Since ef1α is not identified as differentially expressed in our experiments and previously reported for TH receptors quantification by RT-PCR [33], it appears as a good reference for ∆∆Ct experiments. Two sets of genes were tested by QRT-PCR. The first is composed of up or down regulated genes identified in the three organs (aldo2, frz1, retdh1, stmn3 and wif1, figure 4A). The second is made of genes regulated during metamorphosis (nr1a1a, nr1a2b1 and thbzip, figure 4B) and only partially identified by our microarray studies. Expression levels were tested for two metamorphic stages. For 11 out of 15 gene-organ combinations, we observed a similar expression between microarray and QRT-PCR (Figure 4A). In four cases the results are discordant (stmn3 in tail, aldo2, stmn3, wif1 in liver). It should be remarked that in six cases the ratios observed were less than 1 (i.e.
8 less than a two-fold change) and that we selected a subset of genes for which the differential was not strongest but observed in the three organs. We conclude that our microarray data are in majority confirmed by QRT-PCR experiments, especially for the tail and CNS experiments. Our liver and tail microarray data do not provide evidences for significant changes in the expression of TRβ (nr1a2b1) and thbzip. However, QRT-PCR results revealed their regulation during the period studied in these two tissues (Figure 4B). We confirm the regulation of nr1a2b1 and thbzip observed in the CNS between NF stage 55 and 66 (Figure 4 B). Similarly, nr1a1 (TRα) was found differentially expressed in our tail microarray data but not in liver and CNS. QRT-PCR experiments showed that this transcript level is increasing in the liver (2,4 fold-change). As referred in [7, 10, 34, 35], Ribosomal Protein L8 is often chosen as a reference for RT-PCR or Northern blot experiments in metamorphosis studies. We observed a down- regulated profile in our experiment on CNS and liver (respectively cluster 6 in figure 2 and cluster 6 in figure 3). To further our observations, we quantified the abundance of this transcript in the CNS at four different metamorphic stages. A slight down- regulation of this transcript was detected by QRT-PCR (sup figure 1) corresponding to a quantity of mRNA varying from 1,32 to 2,97 fold and thus confirming our microarray results. We conclude that rpl8 transcript level is effectively decreasing in the CNS during metamorphosis. This finding is in agreement with the observed down-regulation of ribosomal proteins in liver and CNS samples. We strongly suggest not using rpl8 as a reference when conducting expressions surveys during amphibian metamorphosis.
Comparisons between the tail, CNS and liver gene expression profiles To gain insights on the different fates elicited by TH during metamorphosis, we compared the repertoires of differential gene expression between the three organs sampled (Fig. 5). Globally, 797 genes are found differentially expressed in at least one of the three organs (figure 5A). The proportions of genes found differentially expressed exclusively in one organ are 41% for the liver, 72% for the CNS and 61% for the tail (Figure 5A). Similar proportions are found when only transcription factors are taken in account (liver: 47% ; CNS: 73% ; tail 61%: Fig. 5D). This indicates that the repertoire of differential expression of each organ is mainly non- overlapping with the other one, and due to the activity of a specific set of transcription regulators for each tissue. Nearly two-thirds of the genes that are found differentially expressed are up-regulated (513 genes, Fig. 5B). Only 2% of these 513 genes are common to the three organs. This set composed of 10 genes commonly up regulated by thyroid hormones could represent a part of the genes enhancing the beginning of metamorphosis in all organ. Tail and CNS express a common set of 44 genes whereas CNS and liver express only 5 genes in common. This underscores the similarity of the tail and CNS transcriptional programmes, and the difference with the liver. Looking at down regulated genes (figure 5C) shows that no genes are shared in this category between the three tissues. Common sets of genes shared by two organs are small. However, 12 genes are co-regulated between CNS and liver and theses transcripts are essentially ribosomal proteins. We compared the common sets of transcriptions regulators expressed in these three tissues (figure 5D). Each organ expresses relatively specific sets of transcription factors (17 TFs in liver, 67 in CNS and 48 in tail). Only 9 transcription factors are expressed in tail, CNS and liver. Only 2 common specific genes are found between CNS and liver,
9 14 are common to CNS and tail. Our interpretation is that transcriptional control of metamorphosis is very different in each organ submitted to thyroid hormone influence. This raises once again the question of how a single molecule, acting on gene expression can confer such different gene expression program. Some genes were found as up-regulated in some organs and down regulated in the others. The result of comparing these gene expression profiles is shown on table 5. For example, Zo6, a potential transcriptional regulator, is expressed in the three organs. This gene is strongly up regulated in the tail but strongly repressed in CNS. The same case is observable for Tcp4 (transcriptional co-activator) or Hsp70 (chaperone). Such results show how metamorphic programs can be different between organs. The fact that one gene is up regulated in one tissue but down regulated in others, especially for transcriptional regulators, is an observation of how thyroid hormone can govern different genes regulations in different organ. Such genes, showing different expression programs in different tissues, could have enhancing activities on some biological processes, thus facilitating accomplishment of the metamorphic program in one organ.
Promoter Studies We were next interested to identify potential binding sites of thyroid hormone on promoter of differentially expressed genes. Thyroid hormone regulates transcription via thyroid hormone receptors. These receptors hetero-dimerise with RXR receptors and recognize Thyroid hormone Responsive Elements on DNA. TREs are sequences composed of two direct hexamer repeats (preferentially AGGTCT) spaced by 4 nucleotides (Direct Repeat 4, DR4 motif). TREs have been identified in various positions around start site of genes directly regulated by thyroid hormones, for example upstream THbZIP coding sequence [36] or in the first intron of Stromelysin 3 gene [37]. We took advantage of the genomic sequence of X.tropicalis to study the promoter regions of genes identified as differentially expressed during metamorphosis. Despite the fact that promoters’ characterization remains still difficult in X.tropicalis due to the lack of proper genome annotation, upstream regions of predicted transcripts are retrievable using tools like ENSEMBL Biomart [38] or TOUCAN [39]. We extracted 10kb upstream and 5kb downtsream start codon for each Xenopuce transcript showing a significant alignment on X.tropicalis predicted transcripts. These 15 kb where masked for repeated sequences with Censor4.1 and then analyzed with NHRscan, a program that allows to identify biding sites of nuclear hormone receptors using Hidden Markov Model [40]. By comparing DR4 identification in promoters regions of differentially expressed genes in each organ and a random set of 200 sequences, no difference in distribution was observed (figure 6 A). This suggests that others factors could be required for proper regulation of thyroid response genes. As shown for X. laevis TRβ and THbZIP, several potential TRE sites are identified upstream their cognate coding sequences but only one or two are known to be functional [36, 41]. We analyzed distribution of number of DR4 among promoters sequences. DR4 distribution by clusters is different from the one observed globally. Some clusters have a larger proportion (more than 15%) of sequences with 3 or more sites (clusters 1, 4, 6, 7 and H in CNS data, clusters 2 and 3 in liver data, and cluster 1, 2, 5, 7 and 8 in tail data, figure 6 B). Distribution of motifs was statistically assessed by comparing DR4 distributions between clusters showing more than 3 DR4 and others. No difference is seen in repartition of sequences showing less than 3 DR4. Nonetheless, statistical difference is robust between clusters showing more than 15% of sequences with 3 or more DR4 and others. Interestingly, these clusters contain identified direct
10 response genes of thyroid hormone (NR1A2B1 in CNS cluster 1, MMP2 in Tail cluster 1). These clusters could contain direct target genes of thyroid hormone receptors.
Discussion 1. General features of the expression data Here we have characterized the temporal expression profiles of 802 transcripts found significantly regulated in the central nervous system, tail or liver during metamorphosis in X. tropicalis. We evidenced that most transcriptional regulations during metamorphosis are due to a combination of ubiquitous and organ specific factors. The global extent of transcriptional changes triggered by thyroid hormone during metamorphosis was evidenced using the biological processes defined in the Gene Ontology thesaurus. Indeed the functional annotation of the genes members of a same cluster enabled us to found statistically significant overrepresentation of processes such as cell cycle, apoptosis, neurogenesis and metabolism. Similarly, the TH hormone regulation of specific signalling pathways (wnt, notch) was observed. A recurrent thema is the modulation of transcription, either an activation or an inhibition, using either coregulators or specific factors. Transcriptional regulation of hox genes was evidenced and might play an important role in the respecification of cell identity during metamorphosis. We conclude that metamorphosis can be conceived essentially in terms of regulation as a process of transcriptional reprogramming. From our in silico promoter characterization, we observed that one-third of the genes identified are presumed to lack canonical DR4 elements in their regulatory regions and might be totally controlled by other factors than TH receptors. We can not exclude that thyroid hormone response elements (TRE) exist in the promoters of these genes but either they are located further away from the transcriptional unit or they are non- canonical DR4. Of the remaining two thirds of the genes, at least one DR4 was evidenced in silico. These transcription units are putative direct targets of THR and further experiments are required to validate these predictions.
2. TF composing the metamorphic pathway Validation of the Shi’s Hypothesis An important question dealing with metamorphosis is how thyroid hormone drives different cellular processes such as apoptosis, proliferation and differentiation. YB Shi proposed three different models about the transcriptional control of metamorphosis (figure 7, [12]). The first one is that a common set of transcription factors is required to achieve metamorphosis by the regulation of different target genes in the different tissues. The second model purposes that tissue-specific sets of transcription factors allow the achievement of metamorphosis. The third model is a combination of the two previous models. Our data are suggestive of a composite model (i.e. ubiquitous factors and specific factors). Indeed, we found two populations of transcription factors being differently expressed during metamorphosis. First a minimal set which is common to the three organs and could represent a part of a shared transcriptional program. This observation is coherent with what has been already observed. Only four transcription regulators at least common to three organs (brain, tail and intestine) have already been identified during amphibian metamorphosis. These are TRβ, bZIP, BTEB and THbZIP [5, 7, 8]. These genes are direct TH-response genes and then form the core of the shared transcriptional regulation program during metamorphosis. Our data implicate new common transcription factors but functional studies proving a direct T3 regulation for these.
11 Another noticeable observation is that the number of shared transcription factors is more important between tail and central nervous system than between liver and other tissues. This underlines the fact that transformations occurring in liver are very different of those occurring in tail or central nervous system and hence that genes expression regulation are totally different between organs accomplishing totally different metamorphic transformation.
3. Homeobox genes expression and thyroid hormone signaling Numerous Hox and homeobox genes have been identified as differentially expressed during metamorphosis in liver (HoxB7, HoxC8, Six3 and Prrx2), tail (Anf1, Twn, Dlx1, Shox2, Mxr, Optx2, Msx2, Gsc, Lhx2, Tlx1, HoxB7, Lhx1, Six2, Brn3A, Vax1) and CNS (Cad2, Mxr, Otx5B, Nkx2.1, HoxB7, Gsc, Brn3A, Hhex1, HoxC9, HoxC4, Pbx2, HoxC6, Meis1, Twn and Cad1). It is now well established that hox genes keep an activity during adult life in cell populations [42, 43]. HoxA 9, 10 and 11 have been implicated in these processes, giving position identities to the cells where these genes are expressed. Regulation of hox genes during this period is mainly due to endocrine system. Retinoic acid, vitamin D, estrogen and progesterone have been shown to regulate directly hox gene transcription [43]. Moreover, response elements to these hormones have been identified upstream various Hox transcription units [44-47]. As in adult tissues showing cyclic plasticity during life, hox genes could play a predominant role during metamorphosis by redefining cell position in tissues subjected to changes, especially in CNS. Until now, the regulation of hox genes during metamorphosis is an unexplored thema. But our data together with other lines of evidence suggest a role played by hox genes during metamorphosis. First, nuclear receptors and especially retinoic acid receptors are well known to regulate hox genes transcription. DNA binding elements of these receptors are structurally similar i.e. direct repeats separated by several nucleotides. DR4 elements are characterized as binding sites of TR/RXR hetero-dimers. Such proteic complexes could recognize other binding sites of the same familysuch as DR0 or DR1 [48]. Therefore genes known to be regulated by retinoic acid or others nuclear receptors, such as the hox genes, could also be regulated by thyroid hormone. Moreover, the overexpression of thyroid hormone receptor α1 disrupts the retinoic acid signaling necessary for proper Hox gene expression and a direct competition between TR/RXR and RAR/RXR on genomic DNA has been shown to explain this observation [49]. Altogether, the hypothesis of the direct regulation of hox genes by thyroid hormone during metamorphosis is likely and remains to be experimentally challenged.
4. Comparison with other microarrays data on Xenopus metamorphosis A recent study [11] reported gene expression profiling during induced metamorphosis in X.laevis. In this study, pre-metamorphic tadpoles were treated by thyroid hormone and gene expression in the brain, tail and hind limbs was monitored using microarrays. Gene expression profiles characterized on both our and Das et al. studies were characterized. Orthologous gene characterization between X.laevis and X.tropicalis was only possible for 60 to 70% of our genes identified as differentially expressed depending on the experiment (CNS, liver or tail, see table 5). Our results were then separated in up or down regulated categories and compared to each organ results of the Das et al study; finally resulting in nine different comparisons. A small proportion of genes are common to both experiments. For example, only 52 % of the genes identified as differentially expressed in our study on CNS are found in Das brain datas. Of these, 37% share a similar pattern of expression. Similarly, of the 35 % of genes identified as differentially
12 expressed in the tail in our study and found in Das tail data, 69% show a similar pattern between the two datasets. The remaining genes show an opposite regulation behavior. Different reasons can explain such differences. First, studies were not conducted on exactly the same organs ( i.e. brain and complete CNS). Second, even if TRβ and THbzip have been shown to be regulated similarly in physiological metamorphosis and in T3 treatment there is no evidence that T3 treatment could mimic physiologic TH roles during metamorphosis. Moreover, premetamorphic tadpoles treated by T3 do not undertake a complete metamorphosis. Thus, the differences observed between Das and our datasets are logical consequences of the sum of experimental set-up and technical differences. However, the complimentarity of these experiments is interesting to determine which genes are most likely under the influence of thyroid hormone during Xenopus metamorphosis.
Promoter Characterization Promoter analysis remains still difficult despite huge efforts of the community to annotate this kind of sequence. A lot of tools are now available, using different methods to characterize TF binding site or even extract regulation model from transcripts upstream sequences [50]. We took advantage of X.tropicalis genome to try to identify any regulators of gene expression during metamorphosis. Here, we focused on TREs by analyzing DR4 distribution in upstream regions of genes identified as differentially expressed. Results showed that DR4 don’t seem to be particularly over-represented in our sequences subset compared to a random set of sequences extracted from X. tropicalis ENSEMBL. Looking more closely to results make appear that DR4 distribution is not homogenous among clusters, with more sequences with at least 3 DR4 motifs in some of them, clusters that could correspond to early response genes. Known direct genes regulated by thyroid hormone signaling are elements of these clusters. But experimental characterization of these sequences needs to be conducted to validate DR4 functionality. TFBS distribution in promoter sequences should also be realized. This task promise to be tricky since little is know about TFBS in Xenopus genus and phylogenetic footprinting is not easily applicable to this since no genome of close species are sequenced. An interesting approach could be to analyse ECR (Extremely Conserved Regions) but how reliable to transcription regulation are these datas is not yet determined. Another interesting tool that is still in development is a X. tropicalis genomic array. For the moment, only 3000 genes are planned to be represented on this array but it could be a extremely useful tool for TFBS studies especially during metamophosis where a lot of work have already been done on chromatin immunoprecipitation.
Materials and methods
1. Microarray design and genes annotation Xenopuce is a project gathering efforts of four laboratories interested in different aspects of the Xenopus development. The aim of this project was to build a long oligonucleotide microarray representing 3000 genes interesting for each consortium member. We initially drew a list of biological processes characteristic of early development and metamorphosis corresponding to research topics of each research team. Gene products playing a role in these processes were compiled using different methods from manual listing to global selection of genes belonging to a given Gene
13 Ontology term. A vertebrate accession number is associated to each item allowing X.tropicalis orthologous identification. Reversal best BLAST hit method was used to identify orthologous genes using the available X.tropicalis sequences data especially ESTs sequences produced from head and retinas of young tadpoles and nervous system of metamorphic tadpoles (xtscope project, Thuret et al 2007). Orthologous transcript identification was possible for 1639 genes. A specific search for genes encoding DNA binding domain was made in ESTs databases. 563 potential transcriptional regulators were identified by this way. Finally, we included potential full-length cDNA sequences encoding genes of unselected biochemical function or biological processes. These cDNA were issued either from the XGC project (302 sequences) or our own cDNA sequencing efforts (404 sequences). This set of 2908 sequences was transmitted to MWG Biotech for oligonucleotides design and spotting on glass slides. One or two 50 nt long oligonucleotide were selected to represent each cDNA sequence. Annotation of the microarray transcripts is three dimensioned. First, meta-category and category was associated to each gene while setting the list. This annotation is biologically orientated since each list designer chose genes identified as playing a role in his research topic. Second, a BLASTP comparison of each Xenopuce sequence was made using SWISSPROT/UNIPROT database. Each transcript is then associated to a SWISSPROT/UNIPROT entry including the protein definition and function. Finally, systematic GO annotation of Xenopuce transcripts was realized by using Gotcha [51]. This software assigns a GO annotation to a nucleotidic or proteic sequence by BLAST homology. GO annotation accuracy depends on the similarity threshold accepted. In our case, this thresold was set to 40%. Informations on the composition and annotation of Xenopuce microarray are available on our website [52].
2. Animal staging and RNA preparation for microarray experiments Xenopus tropicalis tadpoles were raised to metamorphosis and then staged according to the Niewkoop and Faber developmental table [53]. 10 tadpole of each NF stage were dissected and isolated organs were then kept in RNAlater at –20°C until RNA preparation. RNAs were then extracted using Trizol™ (Invitrogen™) coupled with Phase Lock Gel (Eppendorf™). After DNAse digestion (Ambion Turbo DNase™), RNA were purified with Megaclear™ kit (Ambion) and their quality was assessed using Agilent 2100 Bioanalyzer with ARN 6000 Nano lab chip kit.
3. Probes preparation and microarray hybridization 10 mg of total RNA were used to prepare probes with Invitrogen SuperScript™ Indirect cDNA labeling system (using polyA and random hexamers primers) according to the manufacturer’s protocol. These probes where then coupled to Amersham Cy3 or Cy5 monofunctionnal reactive dye following Invitrogen protocol. Probes quality was then assessed on 1% agarose minigel on glass slide scanned in Genetac LS IV scanner and quantified with Nanodrop ND-1000 spectrophotometer. Dyes quantities were then equilibrated for hybridization by quantity of fluorescence per ng of cDNA. Probes were then dried by speedvac, re-suspended in 35 µL of MWG™ hybridization buffer and placed between slide and coverslip on pre-saturated arrays according to the manufacturer’s protocol (QMT ref). Hybridization were conducted in hybridization chambers (ref) at 45 °C for 20h. Slides were then washed once in 2X SSC 0,1% SDS for
14 5’ and twice in 1X SSC and 0,5X SSC for 5’. Hybridized arrays were then dried by centrifugation at 500 rpm in slides baskets and kept in dark until scanning. 4. Data retrieval, normalization and analysis Microarrays were scanned using an Axon scanner. Gpr files were created with Genepix 3.0 and data normalization was conducted under R/Bioconductor [54, 55] using LIMMA [56]. Flagged spots were excluded before Printiploess normalization applied on raw signals. Log of absolute intensities were extracted to identify differentially expressed genes with MAANOVA package [57], using array replicates as separated datasets. Mixed model analysis was conducted taking in account following parameters: Dye, Array and Sample. Transcripts showing a differential expression for the three statistical tests were then retained and log2 of absolute data expression for each sample were retrieved. Expression profiles were finally reconstructed by subtracting initial stage data to others stages data (NF stage 55 for central nervous system and liver, NF stage 58 for tail). Hierarchical and K-means clustering were realized using Cluster [58], [59]for Mac OSX adaptation) and visualized with TreeView [60]. GO terms enrichment evaluation was performed with STEM [61].
5. Quantitative RT-PCR Primers picking was performed with Primer Express. PCR fragments were chosen around introns to avoid genomic DNA amplification. Table xx lists primer sequences of genes tested by QRT-PCR and their location on X.tropicalis genome version 4.1. Independent RNAs isolations were prepared using the same protocol than for microarrays. Five tadpoles of each tested stage (NF 55, 57, 62 and 64) were dissected. RNA quality was assessed on Agilent 2100 Bioanalyzer with ARN 6000 Nano lab chip kit. Retrotranscription was conducted with Invitrogen Superscript III kit following manufacturer’s protocol. Quantitative RT-PCR were then performed on ABI prism 7900 HT using SYBR green PCR Master Mix according to manufacturer’s protocol. Primers couples efficiency was assessed by absolute quantification experiments on 5 orders of dilution (from 100 ng to 0,01 ng). Relative quantification experiments using ∆∆Ct method were then performed using EF1α as the reference gene. Standard PCR conditions (40 cycles, 15’’ at 94°C, 1’ at 60°C, primers) were used. Amplification of each gene in each sample was realized in triplicate.
6. Promoter studies Microarray oligonucleotides sequences were blasted on X.tropicalis ENSEMBL predicted transcripts to identify translation initiation sites. ENSEMBL transcripts IDs were then use in TOUCAN [39] to retrieve 10kb upstream and 5kb downstream of the start codon. Genomic sequences were masked for repeated sequences with Censor 4.1 and then analyzed with NHRscan [40] to identify potential TRE sites (DR4). These sites were then validated with fuzznuc since NHRscan eliminates stretch of Ns. After verification, DR4 where counted and position kept for further PCR primers design.
List of abbreviations
CNS: Central Nervous System ; DR4: Direct Repeat 4 ; EST : Expressed Sequence Tag ; GO: Gene Ontology ; NF: Nieuwkoop and Faber ; RXR: Retinoid-X-Receptor ; TF: Transcription Factor ; TH: Thyroid Hormone ; THR: Thyroid Hormone Receptor ; TRE: Thyroid hormone Responsive Element.
15 Acknowledgements
We thank L. Du Pasquier for the gift of X.tropicalis animals and his continuous support. This research was funded by grants from le Centre National de la Recherche Scientifique, le Ministère de l’Education, de la Recherche et de la Technologie (French Xenopus Stock Center), the University of Paris Sud and the European Community FP6 (X-omics coordinated action No. 512065). We thank the Department of Energy’s Joint Genome Institute for the availability and the use of X. tropicalis genomic sequences. We thank Christophe de Medeiros for taking care of the animals. We acknowledge the technical support from the Plateforme d’Instrumentation et de Compétences en Transcriptomique of INRA Jouy-en-Josas and especially Sophie Pollet and Emmanuelle Zalachas for their assistance as well as the GODMAP microarray facility of CNRS Gif- sur-Yvette. We acknowledge the assistance of David Du Pasquier, Laurent Coen, Catherine Jessus, Yann Audic, Muriel Perron, De-Li Shi for their contribution in the gene selection process for the Xénopuces project. Thanks to André Mazabraud and Maurice Wegnez for general support.
Bibliography 1. Sap J, Munoz A, Damm K, Goldberg Y, Ghysdael J, Leutz A, Beug H, Vennstrom B: The c-erb-A protein is a high-affinity receptor for thyroid hormone. Nature 1986, 324(6098):635-640. 2. Tsai MJ, O'Malley BW: Molecular mechanisms of action of steroid/thyroid receptor superfamily members. Annu Rev Biochem 1994, 63:451-486. 3. Umesono K, Evans RM: Determinants of target gene specificity for steroid/thyroid hormone receptors. Cell 1989, 57(7):1139-1146. 4. Wang Z, Brown DD: A gene expression screen. Proc Natl Acad Sci U S A 1991, 88(24):11505-11509. 5. Wang Z, Brown DD: Thyroid hormone-induced gene expression program for amphibian tail resorption. J Biol Chem 1993, 268(22):16270-16278. 6. Brown DD, Wang Z, Furlow JD, Kanamori A, Schwartzman RA, Remo BF, Pinder A: The thyroid hormone-induced tail resorption program during Xenopus laevis metamorphosis. Proc Natl Acad Sci U S A 1996, 93(5):1924-1929. 7. Denver RJ, Pavgi S, Shi YB: Thyroid hormone-dependent gene expression program for Xenopus neural development. J Biol Chem 1997, 272(13):8179-8188. 8. Shi YB, Brown DD: The earliest changes in gene expression in tadpole intestine induced by thyroid hormone. J Biol Chem 1993, 268(27):20312-20317. 9. Helbing CC, Werry K, Crump D, Domanski D, Veldhoen N, Bailey CM: Expression profiles of novel thyroid hormone-responsive genes and proteins in the tail of Xenopus laevis tadpoles undergoing precocious metamorphosis. Mol Endocrinol 2003, 17(7):1395-1409. 10. Veldhoen N, Crump D, Werry K, Helbing CC: Distinctive gene profiles occur at key points during natural metamorphosis in the Xenopus laevis tadpole tail. Dev Dyn 2002, 225(4):457-468. 11. Das B, Cai L, Carter MG, Piao YL, Sharov AA, Ko MS, Brown DD: Gene expression changes at metamorphosis induced by thyroid hormone in Xenopus laevis tadpoles. Dev Biol 2006, 291(2):342-355. 12. Shi YB: Amphibian Metamophosis. New York: Wiley-Liss; 2000. 13. Beck CW, Slack JM: An amphibian with ambition: a new role for Xenopus in the 21st century. Genome Biol 2001, 2(10):REVIEWS1029.
16 14. Carruthers S, Stemple DL: Genetic and genomic prospects for Xenopus tropicalis research. Semin Cell Dev Biol 2006, 17(1):146-153. 15. Lyman DF, White BA: Molecular cloning of hepatic mRNAs in Rana catesbeiana responsive to thyroid hormone during induced and spontaneous metamorphosis. J Biol Chem 1987, 262(11):5233-5237. 16. Wit E, McClure J: Statistical adjustment of signal censoring in gene expression experiments. Bioinformatics 2003, 19(9):1055-1060. 17. Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183-201. 18. Berry DL, Schwartzman RA, Brown DD: The expression pattern of thyroid hormone response genes in the tadpole tail identifies multiple resorption programs. Dev Biol 1998, 203(1):12-23. 19. Krain LP, Denver RJ: Developmental expression and hormonal regulation of glucocorticoid and thyroid hormone receptors during metamorphosis in Xenopus laevis. J Endocrinol 2004, 181(1):91-104. 20. Izutsu Y, Tochinai S, Maeno M, Iwabuchi K, Onoe K: Larval antigen molecules recognized by adult immune cells of inbred Xenopus laevis: partial characterization and implication in metamorphosis. Dev Growth Differ 2002, 44(6):477-488. 21. Watanabe M, Ohshima M, Morohashi M, Maeno M, Izutsu Y: Ontogenic emergence and localization of larval skin antigen molecule recognized by adult T cells of Xenopus laevis: Regulation by thyroid hormone during metamorphosis. Dev Growth Differ 2003, 45(1):77-84. 22. Nakajima K, Yaoita Y: Dual mechanisms governing muscle cell death in tadpole tail during amphibian metamorphosis. Dev Dyn 2003, 227(2):246-255. 23. Dodd MHI, Dodd JM: The Biology of Metamorphosis. In: In Physiology of the Amphibia. Edited by Lofts B. New York: Academic Press; 1976: 467-599 . 24. Kollros JJ: Transitions in the nervous system during amphibian metamorphosis. In: In Metamorphosis: a problem of developmental biology, 2nd edition. Edited by Gilbert LI, Frieden E. New York: Plenum Press; 1981: 445-459. 25. Fox H: Amphibian Morphogenesis. Clifton, N.J.: Humana Press; 1983. 26. Gona AG, Hauser KF, Uray NJ: Ultrastructural studies on Purkinje cell maturation in the cerebellum of the frog tadpole during spontaneous and thyroxine-induced metamorphosis. Brain Behav Evol 1982, 20(3-4):156-171. 27. Tata JR: Gene expression during metamorphosis: an ideal model for post- embryonic development. Bioessays 1993, 15(4):239-248. 28. Denver RJ: The molecular basis of thyroid hormone-dependent central nervous system remodeling during amphibian metamorphosis. Comp Biochem Physiol C Pharmacol Toxicol Endocrinol 1998, 119(3):219-228. 29. Atkinson BG: Metamorphosis: Model systems for studying gene expression in postembryonic development. Dev Genet 1994, 15:313-319. 30. Chen Y, Hu H, Atkinson BG: Characterization and expression of C/EPB-like genes in the liver of Rana catesbeiana tadpoles during spontaneous and thyroid hormone-induced metamorphosis. Dev Genet 1994, 15(4):366-377. 31. Weber R: Biochemistry of amphibian metamorphosis. In: In The biochemistry of animal development. Edited by Weber R, vol. 2. New York: Academic Press; 1967: 227-301.
17 32. Underhay EE, Baldwin W: Nitrogen excretion in tadpoles of Xenopus laevis daudin. Biochem 1955, 61:544-547. 33. Opitz R, Lutz I, Nguyen NH, Scanlan TS, Kloas W: Analysis of thyroid hormone receptor betaA mRNA expression in Xenopus laevis tadpoles as a means to detect agonism and antagonism of thyroid hormone action. Toxicol Appl Pharmacol 2006, 212(1):1-13. 34. Rowe I, Coen L, Le Blay K, Le Mevel S, Demeneix BA: Autonomous regulation of muscle fibre fate during metamorphosis in Xenopus tropicalis. Dev Dyn 2002, 224(4):381-390. 35. Kuiper GG, Klootwijk W, Morvan Dubois G, Destree O, Darras VM, Van der Geyten S, Demeneix B, Visser TJ: Characterization of recombinant Xenopus laevis type I iodothyronine deiodinase: substitution of a proline residue in the catalytic center by serine (Pro132Ser) restores sensitivity to 6-propyl-2-thiouracil. Endocrinology 2006, 147(7):3519-3529. 36. Furlow JD, Brown DD: In vitro and in vivo analysis of the regulation of a transcription factor gene by thyroid hormone during Xenopus laevis metamorphosis. Mol Endocrinol 1999, 13(12):2076-2089. 37. Fu L, Tomita A, Wang H, Buchholz DR, Shi YB: Transcriptional regulation of the Xenopus laevis Stromelysin-3 gene by thyroid hormone is mediated by a DNA element in the first intron. J Biol Chem 2006, 281(25):16870-16878. 38. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14(1):160-169. 39. Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6):1753- 1764. 40. Sandelin A, Wasserman WW: Prediction of nuclear hormone receptor response elements. Mol Endocrinol 2005, 19(3):595-606. 41. Machuca I, Esslemont G, Fairclough L, Tata JR: Analysis of structure and expression of the Xenopus thyroid hormone receptor-beta gene to explain its autoinduction. Mol Endocrinol 1995, 9(1):96-107. 42. Magli MC, Largman C, Lawrence HJ: Effects of HOX homeobox genes in blood cell differentiation. J Cell Physiol 1997, 173(2):168-177. 43. Taylor HS, Igarashi P, Olive DL, Arici A: Sex steroids mediate HOXA11 expression in the human peri-implantation endometrium. J Clin Endocrinol Metab 1999, 84(3):1129-1135. 44. Akbas GE, Song J, Taylor HS: A HOXA10 estrogen response element (ERE) is differentially regulated by 17 beta-estradiol and diethylstilbestrol (DES). J Mol Biol 2004, 340(5):1013-1023. 45. Marshall GM, Cheung B, Stacey KP, Norris MD, Haber M: Regulation of retinoic acid receptor alpha expression in human neuroblastoma cell lines and tumor tissue. Anticancer Res 1994, 14(2A):437-441. 46. Popperl H, Featherstone MS: Identification of a retinoic acid response element upstream of the murine Hox-4.2 gene. Mol Cell Biol 1993, 13(1):257-265. 47. Ogura T, Evans RM: Evidence for two distinct retinoic acid response pathways for HOXB1 gene regulation. Proc Natl Acad Sci U S A 1995, 92(2):392-396. 48. Shin DJ, Plateroti M, Samarut J, Osborne TF: Two uniquely arranged thyroid hormone response elements in the far upstream 5' flanking region confer direct
18 thyroid hormone regulation to the murine cholesterol 7alpha hydroxylase gene. Nucleic Acids Res 2006, 34(14):3853-3861. 49. Essner JJ, Johnson RG, Hackett PB, Jr.: Overexpression of thyroid hormone receptor alpha 1 during zebrafish embryogenesis disrupts hindbrain patterning and implicates retinoic acid receptors in the control of hox gene expression. Differentiation 1999, 65(1):1-11. 50. GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006, 34(12):3585-3598. 51. Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5:178. 52. Xenopuce: http://indigene.ibaic.u-psud.fr/xenopuce ; 2005. 53. Nieuwkoop PD, Faber F: Normal Table of xenopus laevis (Daudin). A systematical and chronological survey of the development from fertilized egg till the end of metamophosis. New York: Garland; 1994. 54. R package: statistics for microarray analysis: http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html. 55. Bioconductor: http://www.bioconductor.org. 56. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods 2003, 31(4):265-273. 57. Wu H, Kerr M, Cui X, Churchill G: MAANOVA: a software package for the analysis of spotted cDNA microarray experiments. In. 58. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863- 14868. 59. de Hoon MJ, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20(9):1453-1454. 60. Saldanha AJ: Java Treeview--extensible visualization of microarray data. Bioinformatics 2004, 20(17):3246-3248. 61. Ernst J, Bar-Joseph Z: STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics 2006, 7:191.
Figures legends Figure 1: Clustergrams of tail genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 10 classes. Clusters are sorted according to their expression profiles: constantly up-regulated (clusters 1, 2 and 3), transiently up-regulated (clusters 4 and 5), constantly down-regulated (clusters 7, 8, 9 and 10) and transiently down-regulated (cluster 6). First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; dark blue: genes involved in proliferation/differentiation; brown: gene involved in proliferation/apoptosis; red: transcription factor; green: genes involved in proliferation; light blue: genes involved in differentiation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF, scheme of the experimental design are represented.
19 Figure 2: Clustergram of central nervous system (CNS) genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 8 classes. Clusters are sorted according to their expression profiles: up regulated, transitory up regulated, down regulated and transitory down regulated. Last cluster, showing no particular profile, was clustered using hierarchical method. First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: orange: genes involved in proliferation/differentiation/apoptosis; pink: transcription factor involved in proliferation/apoptosis; purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; brown: gene involved in proliferation/apoptosis; blue green: genes involved in differentiation/apoptosis; red: transcription factor; green: genes involved in proliferation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF and scheme of the experimental design are represented.
Figure 3 Clustergrams of liver genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 6 classes. Clusters are sorted according to their expression profiles: up regulated, transitory up regulated, down regulated and transitory down regulated. First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: pink: transcription factor involved in proliferation/apoptosis; purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; brown: gene involved in proliferation/apoptosis; red: transcription factor; green: genes involved in proliferation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF, scheme of the experimental design are represented.
Figure 4 : Validation of Microrarray Results by Quantitative RT-PCR. For each organ, the log2 of MA expression values (curves) and the QRT-PCR log2 of RQ (histograms) were plotted. NF stages studied are indicated at the bottom of each graph. The * mark stages that were tested by QRT-PCR. Since stage 57 and stage 62 were not analyzed on microarray respectively in tail and liver, MA values for these time points were extrapolated as the mean of the neighboring values (dashed lines). A., genes identified regulated in every organ. B., genes known to be regulated by thyroid hormone but only identified as regulated in one organ. C.: Absolute quantification of RPL8 gene in CNS. Mean values of the difference of Ct of NF stage 55 and others are plotted for three different quantities of cDNA.
Figure 5: Venn diagrams representations of the overlap between sets of differential expressed genes in liver, tail and central nervous system during metamorphosis. In A., all genes were considered, in B., only up regulated genes, in C., only down regulated genes and in D., only transcription factors.
Figure 6: Analysis of DR4 distribution in 15kb of sequences around start codon of differentially expressed genes. A. The global distribution of DR4 motifs in genes is represented the. Xenopuce random set is a set of 50 sequences corresponding to genes not identified as differentially expressed in the
20 experiments realized. Random set of ENSEMBL transcripts are genomic region of a set of 200 ENSEMBL transcripts. B. Representation of the proportion of sequences containing 1, 2, 3 or more DR4 motifs per clusters.
Figure 7: Transcriptional control of tissue-specific metamorphic programs hypothesis (adapted from Shi 2000). In each case, TR/RXR heterodimers repress TH-controlled gene expression until TH is available in the cell. Once complexed with TR/RXR, TH enhances expression of ubiquitous (A) or tissue- specific (B) transcriptional factors allowing expression of late tissue-specific response genes. A composite model of A and B is presented in C. In this third model, early response ubiquitous and tissue-specific transcription factors control expression of late response genes.
Table 1: Tail-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C).
Table 2: CNS-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C).
Table 3: Liver-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C).
Table4: Expression profiles of genes differently regulated in two or more organs during metamorphosis. For each gene is depicted the organ specific expression profile
Table 5: Comparison of microarray studies on natural and induced metamorphosis.
Table 6: Primers sequences for genes tested by QRT-PCR. Location on X. tropicalis genome and primers sequences are given for each gene used in QRT-PCR experiments.
21 22 23
24 25
26
27
28 29
30
31
Chapter 4: Evaluation of time profile reconstruction from complex two-color microarray designs
Les contrantes au niveau du matériel pour mener l’étude du transcriptome avec des puces à ADN ont conduit au choix d’une stratégie expérimentale du type « interwoven design ». Au vu des limitations de temps, de matériaux et d’argent, cette option est incontournable, mais il faut une méthode d’analyse afin de reconstruire l’information d’intérêt (profils d’expression). Les avantages potentiels d’utiliser ces stratégies alternatives de puces dépendent largement du succès de la reconstruction des profils. Il est donc nécessaire d’évaluer les méthodes d’analyse des puces à ADN, afin de déterminer laquelle des approches est la meilleure. Pour cette raison, on a comparé jusqu’à quel point des différents modèles linéaires sont capables de reconstruire des profils d’expression semblables.
83 It has previously been shown that complex two-color microarray designs (e.g., loop design, interwoven design) offers advantages over the commonly used reference design: at the same cost, more balanced measurements in the number of replicates per condition can be obtained. More and more laboratories tend to use these complex designs. What is often ignored, however, is that such complex designs require more complex analysis procedures to reconstruct the factor of interest (e.g. a gene profile across a time series) from the data. Reconstruction of the gene profile is a critical step in the analysis as the practical usefulness of complex designs will depend on how well analysis methods are able to retrieve this factor of interest from the data. In this study we performed an exhaustive comparison between different profile reconstruction methods (article submited to BMC Bioinformatic).
84 Evaluation of time profile reconstruction from complex two-color microarray designs
Ana Fierro1,2,3, Raphael Thuret1,2, K. Engelen4, Gilles Bernot3, Kathleen Marchal4§, Nicolas Pollet1,2,3
1CNRS UMR 8080, Laboratoire Développement et Evolution, Bat 445, F-91405 Orsay, France. 2Univ Paris Sud, F-91405 Orsay, France 3Programme d'Epigenomique – Genopole, Univ Evry, Tour Evry-2, Place des terrasses, 91000 Evry, France 4Dep Microbial and Molecular Sciences, K.U.Leuven, Kasteelpark Arenberg 20, 3000 Leuven, Belgium
§Corresponding authors
Email addresses: AF: [email protected] RT: [email protected] KE: [email protected] GB: [email protected] KM: [email protected] NP: [email protected]
- 1 - Abstract
Background As an alternative to the frequently used “reference design” for two-channel microarrays, other designs have been proposed. These designs have been shown to be more profitable from a theoretical point of view (more replicates of the conditions of interest for the same number of arrays). However, the interpretation of the measurements is less straightforward and a reconstruction method is needed to convert the observed ratios into the genuine profile of interest (e.g. a time profile). The potential advantages of using these alternative designs thus largely depend on the success of the profile reconstruction. Therefore, we compared to what extent different linear models agree with each other in reconstructing expression ratios and corresponding time profiles from a complex design.
Results On average the correlation between the estimated ratios was high, and all methods agreed with each other in predicting the same profile, especially for genes of which the expression profile showed a large variance across the different time points. Assessing the similarity in profile shape, it appears that, the more similar the underlying principles of the methods (model and input data), the more similar their results. Methods with a dye effect seemed more robust against array failure. The influence of a different normalization was not drastic and independent of the method used. The accuracy of the different methods in estimating the true profiles was assessed using a spike in experiment: all of the tested methods reconstructed very similar profiles. Only when ratios were to be estimated from low intensity signals (corresponding to low spike in concentration), they failed to approximate the true expression ratios.
Conclusions Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix compensates for residual dye related inconsistencies in the data and renders the results more robust against array failure. Including random effects only makes sense if a design is used with a sufficient number of replicates, otherwise it deteriorates the results. Because of this, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for practical use.
Background Microarray experiments have become an important tool for biological studies, allowing the quantification of thousands of mRNA levels simultaneously. They are being customarily applied in current molecular biology practice.
In contrast to the Affymetrix based technology, for the two-channel microarray technology assays, mRNA extracted from two conditions is hybridised simultaneously on a given microarray. Which conditions to pair on the same array is a non trivial issue and relates to the choice of the “microarray design”. The most intuitively interpretable and frequently used design is the “reference design” in which a single, fixed reference condition is chosen against which all conditions are compared.
- 2 - Alternatively, other designs have been proposed (e.g. a loop design). From a theoretical point of view, these alternative designs usually offer, at the same cost, more balanced measurements in the number of replicates per condition than a common reference design. They are thus, based on theoretical issues, potentially more profitable [1, 2]. For instance, a loop design would outperform the common reference design when searching for differentially expressed genes [3]. However, the drawback of such alternative design is that the interpretation of the measurements becomes less straightforward. More complex analysis procedures are needed to reconstruct the factor of interest (genes being differentially expressed between two particular conditions, a time profile, etc.), so that the practical usefulness of a design depends mainly on how well analysis methods are able to retrieve this factor of interest from the data. Such analysis would require removing systematic biases from the raw data by the appropriate normalization steps and combining replicate values to reconstruct the factor of interest.
When focusing on profiling the changes in gene expression over time, the factor of interest is the time profile [1, 2]. For such time series experiment, the “reference design”, where, for instance, time point zero is chosen as the common reference has a straightforward interpretation: for each array, the genes’ mean ratio between replicates readily represents the changes in expression of that gene relative to the first time point. However, when using an alternative design, such as an interwoven design, mean ratios represent the mutual comparison between distinct (sometimes consecutive) time points. A reconstruction procedure is needed to obtain the time profile from the observed ratios [3-5].
Several profile reconstruction methods are available for complex designs. They all rely on linear models and for the purpose of this study, we subdivided them in “gene specific” and “two-stage” methods. Gene specific profile reconstruction methods apply a linear model, on each gene separately. The underlying linear model is usually only designed for reconstructing a specific gene profile from a complex design, but not for normalizing the data. As a result, normalized log-ratios are used as input to these methods (see ’Methods’). Examples of these methods are described by Vinciotti, et al. (2005) [3] and Smyth, et al. (2004) (Limma) [4]. Two stage profile reconstruction methods on the other hand, first apply a single linear model on all data simultaneously, i.e. the model is fitted on the dataset as a whole. These models use the separate log-intensity values for each channel, as spot effects are explicitly incorporated. They return normalized absolute expression levels for each channel separately, which can then be used to reconstruct the required time profile by a second stage gene specific model. An example of such two-stage method is implemented in the Maanova package [6].
So far, comparative studies focused on the ability of different methods to reconstruct “genes being differentially expressed” from different two-color array based designs [7-9] or the ratio estimation between two particular conditions [5]. In this study, we aimed at performing a comparative study focusing on the time profile as the factor of interest to be reconstructed from the data.
We compared to what extent five existing profile reconstruction methods (lmbr, lmbr_dye, limmaQual, anovaFix, and anovaMix; see ‘Methods’ for details) were able to reconstruct similar profiles from data obtained by two channel microarrays using
- 3 - either a loop design or an interwoven design. We assessed similarities between the methods, their sensitivity towards using alternative normalizations and their robustness against array failure. Using a spike-in experiment we were able to assess the accuracy of the time profiles estimated by each of the methods.
Results
Assessing the influence of the used methodology on the profile reconstruction We compared to what extent the different methods agreed with each other in 1) estimating the changes in gene expression relative to the first time point (i.e. the log- ratios of each single time point and the first time point) and 2) in estimating the overall gene specific profile shapes. Results were evaluated using two test sets, each of which represents a different complex design.
The first dataset was a time series experiment consisting of 6 time points measured on 9 arrays using an interwoven design (Figure 1a). This design resulted in three replicate measurements for each time point, with alternating dyes. As a second test, a smaller loop design was derived from the previous dataset by picking the combination of five arrays that connect five time points in a single loop (Figure 1b). A balanced loop is obtained with two replicates per condition, for which each condition is labeled once with the red and once with the green dye (see ‘Methods’)
The balance with respect to the dyes (present in the loop design) ensures that the effect of interest is not confounded with other sources of variation. In this study, the effect of interest corresponds to the time profile. The replication (as present in the interwoven design) improves the precision of the estimates and provides the essential degrees of freedom for error estimation [2]. Moreover, the interwoven design not only has more replicates, but also increases the possible paths to join any two conditions in the design. As they have different characteristics, using both datasets allows us to assess the reconstruction process under two different settings, while the RNA preparations for both designs are the same.
Effect of profile reconstruction methods on the ratio estimates We first assessed to what extent the different methods agreed with each other in estimating similar log-ratios for each single gene at each single time point. To this end, we calculated the overall correlation per time point between the gene expression ratios estimated by each pair of two different methods. Table 1 gives the results for all mutual comparisons between the methods tested for the loop design. Irrespective of which two methods were compared, the correlation between the estimated ratios was high on average, ranging from 0.94 to 0.98 (Table 1, mean column). Moreover, this high average correlation is due to a high correlation of all individual ratios throughout the complete ratio range (see supplementary Figure S1), with only a few outliers (genes for which a rather different ratio estimate was obtained, depending on the method used). Note that for the loop design, there was no difference between the results of lmbr and lmbr_dye due to the balanced nature of this design (see ‘Methods’ section).
For this loop design the ratio estimates T3/T1 or T4/T1 obtained by each of the different methods are on overall more correlated than estimates of respectively T5/T1
- 4 - and T6/T1. As can be expected, direct estimates, i.e. estimates of a ratio for which the measurements were assessed on the same array (see Figure 1b: ratios T3/T1 and T4/T1) are more consistent than indirect estimates, i.e. the measurements used to obtain the estimates were assessed on different arrays (see Figure 1b: ratios T5/T1 and T6/T1). A similar observation was already made by Kerr and Churchill (2001), and Yang and Speed (2002). For a loop design, both the ANOVA (two-stage) [2] and the gene-specific methods [10], have trouble estimating ratios between conditions not measured on the same array (indirect estimates). The larger the loops (the longer the paths) between indirectly measured pairs of conditions, the less precise estimates will be.
For the interwoven design, the correlation between ratio estimates, obtained by any pair of two different methods was even higher, with values ranging from 0.95 to 0.99 (see supplementary Table S1). For this unbalanced design, the ratio estimates for the lmbr_dye and the lmbr methods were no longer exactly the same. The difference in consistency between direct and indirect ratio estimates was not obviously visible for this design.
Effect of profile reconstruction methods on the profile shape A high average correlation between the ratio estimates obtained by the different methods at each single time point is a first valuable assessment. However, it is biologically more important that gene specific profiles reconstructed by the different methods exhibit the same tendency over time. Therefore, we also compared to what extent profile shapes estimated by each of the methods differed from each other. This was done by computing the mean similarity between profile estimates obtained by any combination of two methods (Table 2a).
Figure 2 shows a few illustrative examples of profiles estimated by the different methods. For the ribosomal gene “L22” (Figure 2a), irrespective of the method, highly similar profiles were obtained. However, for the MGC85244 gene (Figure 2c), the observed degree of similarity between profiles derived by each of the different methods is much lower, especially for the last two time points.
Table 2a summarizes the results of the profile comparison expressed as average profile similarities across all genes. The similarity was computed with the cosine similarity measure after mean centering the profiles (see ‘Methods’). It ranges from -1 (anti-correlation) to 1 (perfect correlation), 0 being no correlation. Also here, the overall correlation between different methods was not drastically different. From this table, it appears that the more similar the underlying principles of the used methods (both the model and the input data) are, the more correlated their results. Indeed, correlations between profiles estimated by either limmaQual and lmbr (both gene specific models without dye effect), or anovaMix and anovaFix (both two stage models) are high. The most divergent correlations are observed when comparing a gene-specific method (more specifically lmbr, or limmaQual) with a two-stage method (anovaFix or anovaMix). When using lmbr_dye on the interwoven design, it behaves somewhere in between: although it is a single gene model, it includes a dye effect just like the two stage models. This does not apply for the loop design due to its dye-balance (lmbr and lmbr_dye give the same results for balanced designs; see ‘Methods’).
- 5 - Differences in the input data (log ratio versus log expression values) and alterations in the underlying model (including a dye or random effect) are confounded in affecting the final result. Therefore, in order to assess into more detail the specific effect of including either a dye or a random effect in the model, we compared results between methods that share the same input data.
To assess the influence of including a dye effect on profile estimation, we compared the results of the gene-specific methods (see Table 2a, the first two rows). Including a dye effect (present in lmbr_dye but not in limmaQual and lmbr) has a strong effect under the unbalanced interwoven design (seen as decrease in correlation between lmbr_dye and the other single gene methods). For the loop design this effect is non- existent because of the loop design’s balance with respect to the dyes (see ‘Methods’).
The mere impact of including a random effect in the model can be assessed by comparing results of anovaFix and anovaMix. Indeed, they both contain the same input data, the same normalization procedure, and the same model except for the random effect. Seemingly, inclusion of the random effect has a higher influence on the loop design than on the interwoven design.
Usually in a microarray experiment, an important proportion of the genes does not change its expression significantly under the conditions tested (global normalization assumption), exhibiting a “flat” profile. We wondered whether removing such flat genes, with a noisy profile would affect the similarity in profile estimation between the different methods. Indeed, because the cosine similarity with centering only measures the similarity in profile shape, regardless of its absolute expression level, the higher level of similarity we observe between the methods might be due to a high level of random correlation between the “flat” profiles. Therefore, we applied a filtering procedure by removing those genes for which the profile variance over the different time points was lower than a certain threshold (a range of threshold values going from 0.2-0.4 was tested. The similarity was assessed for any pair of cognate profile estimates if at least one of the two profiles passed the filter threshold (Table 2b for the variance threshold of 0.4, results for the other thresholds can be found in the supplementary information, see Table S2).
Overall, the results obtained with each of the different variance thresholds confirmed the observations of Table 2a: 1) the more similar the models and input data, the more similar the methods behaved (two-stage methods differed most from limmaQual followed by lmbr in estimating the gene profiles, 2) including a dye effect has a pronounced effect in an interwoven design (in a loop design there is no distinction due to the balance with respect to the dyes; see ‘Methods’), 3) including a random effect has most influence on the loop design. In addition, it seems that, the more flat profiles are filtered from the dataset, the more similar the results obtained by each of the different methods become.
The effect of array failure on the profile reconstruction In practice, when performing a microarray experiment some arrays might fail with their measurements falling below standard quality. When these bad measurements are removed from the analysis, the complete design and the results inferred from it will be affected. Here we evaluated this issue experimentally by simulating array defects. In a first experiment, the interwoven design (dataset 1) was considered as the original
- 6 - design without failure. We tested 9 different, possible situations of failure, by each time removing a single array from the design, resulting in 9 reduced datasets. The same test was performed with the loop design (dataset 2).
We compared for each of the different profile reconstruction methods, the mean similarity between the ratios obtained either with the full dataset or with each of the reduced datasets (9 comparisons). Table 3 summarizes the results for the interwoven design, and Table 4 for the loop design.
For the interwoven design (Table 3), it appears that in general removing one array from the original design did not really affect the ratio reconstruction. For all methods, ratio estimates tend to be more affected when an array measuring the reference time point was removed (T1) (Table 3). Overall the two-stage methods, and in particular anovaMix, seemed most robust against array failure, while limmaQual was most sensitive (Table 3). Methods including a dye effect were more robust against array failure. Similar results were obtained when the effect of array failure was assessed on the similarity in profiles (see supplementary Table S3).
For the loop design, the situation was quite different (Table 4). Note that here, the lmbr_dye and limmaQual methods were not used for profile reconstruction as the reduced datasets did not contain sufficient information for estimating all the model parameters. For both lmbr and limmaQual, the linear models lose their main differing characteristics compared to lmbr (see ‘Methods’ section). For all remaining methods removing one array from the design affected the results considerably more than was the case for the interwoven design. Two-stage methods were the most robust, but in this design anovaMix performs slightly worse than anovaFix. The lmbr method turned out to be very sensitive to array failure, giving a mean similarity around 0.2, indicating no correlation between profiles estimated with and without array failure (see supplementary Table S4).
Note that overall, all methods seem to be more robust to array failure under the interwoven design than under the loop design. This is to be expected as the latter design contains more replicates.
Consistency of the methods under different normalization procedures In the previous section we compared profiles and ratio estimates obtained by the different methods after applying default normalization steps. However, other normalization strategies are possible, and could potentially affect the outcome. To assess the influence of using alternative normalization procedures, we compared profiles reconstructed from data normalized with 1) print tip Loess without additional normalization step (the default setting for anovaMix and anovaFix as used throughout this paper), 2) print tip Loess with a scale-based normalization between arrays [13], and 3) print tip Loess with a quantile-based between array normalization (the default normalization for lmbr, lmbr_dye, and limmaQual as used throughout this paper) [12, 14].
Table 5 shows, for each of the different methods, the mean similarity between reconstructed profiles derived from differently normalized datasets. Overall, the influence of the normalization was not drastic. More importantly, the influence of the additional nomalization steps seemed independent of the method used (similar
- 7 - influences were observed for all methods). When assessing the similarity in ratio estimates instead of profile estimates, similar results were obtained (data not shown).
Accuracy of estimation So far we only assessed to what extent changes in the used methodologies or normalization steps affected the inferred profiles. This, however, does not give any information on the accuracy of the methods, i.e., which of these methods is able to best approximate the true time profiles. Assessing the accuracy is almost impossible as usually the true underlying time profile is not known. However, datasets that contain external controls (spikes) could prove useful in this regard. Spikes are added to the hybridisation solution in known quantities, so that we have a clear view of their actual profile. In the following analysis, we used such a spike-in experiment to test the accuracy of each of the profile reconstruction methods [15]. For the technical details of this dataset we refer to ‘Methods’ and Table 6.
As lmbr and lmbr_dye and limmaQual gave exactly the same results using this balanced design, we further assessed to what extent lmbr, anovaFix and anovaMix agreed with each other in estimating similar profiles. The profile was reconstructed using one specific spike concentration as a reference point. Figure 3 shows the results for two representative spikes using the 10 cpc (copies per cell) measurement as a reference. Similar results were obtained for the other spikes (data not shown). The three compared methods reconstruct very similar profiles for both representative spikes. This is consistent with our previous observations where these three methods were very consistent with each other (Table 1 and 2 in the balanced loop design). Judging to which extent these methods were able to approximate the true underlying profile (i.e. approximating the true ratios and not only the profile behaviour), it appeared that the tested linear methods started failing when ratios were to be estimated from low intensity signals (corresponding to low spike in concentration). As shown in Figure 3, differences between low concentrations can no longer be accurately detected by the linear methods, giving estimation around zero instead of approximating the true range of ratio concentrations (which should be between – log2(0.001) and –log2(1)). Clearly, ratios were overestimated relative to their true values. In contrast, at the high concentration range, ratios were consistently underestimated for all three methods (Figure 3).
Figure 4 shows spike-in profiles reconstructed by using a more extreme concentration as reference for the reconstructed ratio profile. In the high concentration range (from 10 to 10,000 cpc) the shape of the estimated profile is highly similar to the expected shape (Figure 4a, dotted line), but the estimated values depend on the concentration used as reference. When the ratios were estimated using the maximum concentration of 10,000 cpc as reference, we observed a high correlation between the estimated ratios and the true profile (Figure 4a). However, when the lowest concentration of 0 cpc was used as reference point, reconstructed ratio profiles were highly underestimated (Figure 4b). These results illustrate how, when estimating ratio profiles, the observed intensity of the reference point does not drastically disturb the profile shape, but can largely bias the accuracy of the estimated ratio. Discussion In this study, we evaluated the performance of five methods based on linear models in estimating gene expression ratios and reconstructing time profiles from complex
- 8 - microarray experiments. From a theoretical viewpoint, two major differences can be distinguished between the methods selected for this study: 1) differences related to alterations in the input data: the selected two-stage methods make use of the log- intensity values while the gene-specific methods use log-ratios, 2) differences related to the model characteristics: some of the models include an explicit dye effect (lmbr_dye, anovaFix and anovaMix) or an explicit random effect (anovaMix).
Although Kerr [5] assumed that observed differences in estimates obtained by different models are due to the differences in model characteristics, rather than to the input data, we cannot clearly make this distinction. Indeed, the way the error-term is modelled influences the statistical inference and hence the use of log-intensities or log-ratios does cause a difference between models [5]. However, when focusing on results obtained between methods with similar input data, we can assess, to some extent, the effect of different model specificities. In the following sections, some of these effects are discussed more in detail.
The inclusion of the dye effect In general we observed that, gene specific methods without dye effects, and two-stage models with dye effect behaved more similar with each other than when they were compared among each other. Lmbr_dye (a gene specific model with dye effect) is situated somewhere in between when the design is unbalanced with respect to the dyes. Indeed, the gene specific models lmbr and limmaQual contain a combination of log-ratios plus an error term. However, when adding a dye effect to these models as is the case of lmbr_dye, the formulations and estimations converge with those of the two-stage ANOVA models for unbalanced designs.
Originally, Vinciotti, et al. (2005) [3] and Wit, et al. (2005) [16] added the dye effect for purposes of data normalization when one is working with non-normalized data. From our results, we also noted a practical advantage of including a dye effect even with normalized data. The fact that adding a dye effect showed pronounced differences for a dye-unbalanced design indicates that, despite the data being normalized, there are still dye-related inconsistencies in the data that might –partially- be compensated for by including a dye effect. Moreover, models with dye effects seemed more robust in estimating log-ratios from a design disturbed by array failure. Therefore, when working with unbalanced designs, it is advisable to include a dye effect, not only for the two-stage ANOVA models, as was also suggested by Wolfinger (2001) [17], Kerr (2003) [5], and Kerr and Churchill (2001) [2], but also for gene specific models based on log-ratios.
Mixed models versus Fixed models Several studies advise the users to model the spot-gene or array-gene effects as random variables [9, 17]. We observed that under the loop design (with 5 arrays), profiles estimated by anovaMix and anovaFix diverged. We also noticed that, for the loop design anovaMix had a lower capacity than anovaFix to handle array failures. For the interwoven design with 9 arrays these effects were less pronounced. Probably, the loop design used in our study does not contain a sufficient number of arrays to allow for the estimation of the spot-gene effect when using a mixed anova model. As a result, ratios and time profiles estimated by anovaMix are less reliable for an experiment with few arrays than when using the anovaFix model in similar conditions.
- 9 -
The effect of using alternative normalization steps on the methods’ performance We tested the influence of using additional normalization steps. Differently normalized data give different results, but the effects were not dramatic. Moreover, they had the same influence on all methods, indicating that all methods were equally sensitive to changes in the normalization.
Accuracy of estimated ratios Based on spike-in experiments for two-channel microarrays, we could also assess to what extent the estimated ratios approximated the true ratios (i.e., the accuracy of the estimated ratios). We observed that all five tested linear methods generated biased estimations, consistently overestimating changes in expression relative to a reference with low mRNA-concentration. These results showed to be independent of the method used (gene specific or two-stage) or of the number of effects included the model. Conclusions On average the correlation between the estimated ratios was high, and all methods more or less agreed with each other in predicting the same profile. The similarity in profile estimation between the different methods improved with an increasing variance of the expression profiles.
We observed that when dealing with unbalanced designs, including a dye effect, such as in the methods lmbr_dye, anovaFix and anovaMix, seems to compensate for residual dye related inconsistencies in the data (despite an earlier normalization step). Adding a dye effect also renders the results more robust against array failure. Including random effects only makes sense if a design is used with a sufficient number of replicates, otherwise it deteriorates the results.
The accuracy of the different methods in estimating the true profiles was assessed using a spike in experiment: all of the tested methods reconstructed very similar profiles. Only when ratios were to be estimated from low intensity signals (corresponding to low spike in concentration), they failed to approximate the true expression ratios.
Conclusively, because of their robustness against imbalances in the design and array failure, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for practical use (given a sufficient number of replicates in case of the latter).
Methods
Microarray data The first dataset used in this study was a temporal Xenopus tropicalis expression profiling experiment. The array used consisted of 3000 oligos of 50mers, corresponding to 2898 unique X. tropicalis gene sequences and negative control spots (Arabidopsis thaliana probes, blanks and empty buffer controls). Each oligo was spotted in duplicate on each array in two separated grids. On each grid, oligonucleotides were spotted in 16 blocks of 14 x 14 spots. Pairs of duplicated
- 10 - oligo’s on the two grids of the same gene sequence were treated as replicates during analysis, corresponding to a total of 2999 different duplicated measurements (a few oligos were spotted multiple times on the arrays). MWG Biotech performed oligonucleotide design, synthesis and spotting. X. tropicalis gene sequences were derived from the assembly of public and in-house expressed sequence tags. The temporal expression of X. tropicalis during metamorphosis was profiled at 6 time points, using an experimental design consisting of 9 arrays. Each time point was measured three times, with alternating dyes as shown in Figure 1a. This interwoven design was used as a first test set.
From this original design a second test set containing a smaller loop design was derived by picking the combinations of five arrays that connect five time points in a single loop (Figure 1b) and with the first time point as a reference. This results in a balanced loop design
A publicly available spike-in experiment [18] was used as a third test set. This dataset contains 13 spikes-in, or control clones spiked with known concentrations. The control clones were spiked at different concentrations for each of the 7 conditions, where each spike describes a specific temporal profile (Table 9).
The microarray design used for the spike-in experiment was a common reference design, with dye swap for each condition, and the concentrations of spikes ranges from 0 to 10,000 copies per cellular equivalent (cpc), assuming that the total RNA contained 1% poly(A) mRNA and that a cell contained on average 300,000 transcripts. This concentration range covered all biologically relevant transcript levels.
Probes preparation and microarray hybridization 10 µg of total RNA were used to prepare probes. Labeling was performed with the Invitrogen SuperScript™ Indirect cDNA labeling system (using polyA and random hexamers primers) using the Amersham Cy3 or Cy5 monofunctional reactive dyes. Probe quality was assessed on an agarose minigel and quantified with a Nanodrop ND-1000 spectrophotometer. Dye quantities were equilibrated for hybridization by the amount of fluorescence per ng of cDNA. The arrays were hybridized for 20 h at 45 °C according to the manufacturers protocol (QMT ref). Washing was performed in 2X SSC 0.1% SDS at 42°C for 5’ and then twice at room temperature in 1X SSC, 0.5X SSC each time for 5’. Arrays were scanned using a GenePix Axon scanner.
Microarray normalization The raw intensity data were used for further normalization. No background subtraction was performed. Data were log-transformed and the intensity dependent dye or condition effects were removed by using a local linear fit loess on these log- transformed data (Printtiploess command with default settings as implemented in the limma BioConductor package [13]). As this loess fit not only normalizes the data but also linearizes them, applying it before profile reconstruction is a prerequisite as all linear models used for profile reconstruction assume non linearities to be absent from the data.
For the gene specific methods (lmbr, lmbr_dye and limmaQual), Loess corrected log- ratios (per print tip) were subjected to an additional quantile normalization step [4, 12]
- 11 - as suggested by Vinciotti et al. (2005) [3] in order to improve the intercomparability between arrays. It equalizes the distribution of probe intensities for each array in a set of arrays. For the two-stage profile reconstruction methods (anovaFix and anovaMix), corrected log-intensities for the red (RCORR) and green (GCORR) channels were calculated from the Loess corrected log-ratios (MCORR; no additional quantile normalization was done for the two-stage methods) and mean absolute intensities (A) as follows: RAM/CORR=( + CORR ) 2 , and GAM/CORR=( − CORR ) 2 .
Used profile reconstruction methods Available R implementations (BioConductor [19]) of the presented methods were used to perform the analyses.
Gene specific methods based on log-ratios Gene specific profile reconstruction methods apply a linear model on each gene separately. The goal is to estimate the true expression differences between the mRNA of interest and the reference mRNA, from the observed log-ratios. The presented models assume that the expression values have been appropriately pre-processed and normalized [3, 20]. The three selected gene-specific models for this study are:
1) lmbr, the linear model described by Vinciotti et al. (2005) [3]: An observation yjk is the log-ratio of condition j and condition k. For each gene a
vector of n observations y = (y1 ,..., yn ) can be represented as y = X + ε where X is the design matrix defining the relationship between the values observed in the experiment and a set of independent parameters