The Arabidopsis Thaliana Cdna Sequencing Projects 1This Paper
Total Page:16
File Type:pdf, Size:1020Kb
FEBS 18220 FEBS Letters 403 (1997) 221-224 Minireview The Arabidopsis thaliana cDNA sequencing projects Michel Delseny*, Richard Cooke, Monique Raynal, Francoise Grellet University of Perpignan, Laboratoire de Physiologie et Biologie Moleculaire des Plantes, UMR 5545 CNRS, Avenue de Villeneuve 66860, Perpignan cédex, France Received 14 January 1997 scribed for human genes [3]. The second strategy consisted in Abstract Nearly 30 000 Arabidopsis thaliana EST (Expressed Sequence Tags) have been produced by a French-American sequencing the complete genome. Although the two strategies consortium. Despite redundancy, these sequences tag about half are complementary and part of a world-wide integrated proj- of the expected Arabidopsis genes. Approximately 40% of the ect [2], this minireview will be restricted to the EST projects non-redundant EST can be assigned a putative function by simple and their impact on plant science. homology search. This programme allowed the identification of a large number of genes which would have been very difficult to 2. EST strategy and production isolate by other classical techniques. It considerably stimulated many areas of plant biology by the rapid discovery a large Most Arabidopsis EST sequencing has been carried out by number of genes, by revealing multigene families and by allowing two consortia: one in France, started by the end of 1991 with the analysis of differential expression of the different members. Finally this programme facilitated construction of physical maps national support and continued after 1993 with European of the chromosomes and opened the way for complete sequencing Union funding as part of the ESSA (European Scientists Se- of the Arabidopsis genome and comparative mapping of the quencing Arabidopsis) programme, and one in the US initiated major plant crops. a few months later at Michigan State University. The initial © 1997 Federation of European Biochemical Societies French project was to obtain the largest number of EST, to identify as many of them as possible by homology search, and Key words: Expressed Sequence Tag; cDNA; Gene function; to map them on the genetic and physical maps. In contrast to Genome sequencing; Crucifer their American colleagues, who prepared a single library from a mixture of mRNA from different tissues, the French scien- tists worked with specialized libraries corresponding to specif- 1. Introduction ic tissues, organs or developmental stages. This strategy ini- tially helped to avoid overlaps and redundancy between Arabidopsis thaliana, a small crucifer weed, was chosen by a groups as well as maintaining a strong biological interest for number of plant biologists a few years ago as a model to the participants and it allows one to compare the different analyse plant genome structure and gene expression as well libraries. A computer network was organised so as to auto- as plant development. The major advantages of this plant are matically identify redundancies and submit sequence informa- its small genome, short life cycle, genetics and relative easiness tion to the EMBL database. All sequences were first edited of transformation [1]. With a genome size of 120 Mbp, Ara- and analyzed by homology search in each individual labora- bidopsis compares well with animal models such as Drosophila tory. After validation of the sequence and annotation, non- or Caenorhabditis and is one of the smallest plant genomes. redundant EST were then transferred to the EMBL database More than 1000 mutants have been characterized and half of from where they were automatically integrated into dbEST them have been mapped to one of its five chromosomes [2], [4]. Another difference between the two projects was that making it a suitable model to dissect metabolism, develop- American EST were sequenced only from the 5' end and re- ment and transduction pathways using molecular genetics. dundancy was not eliminated at the submission stage whereas Due to specific difficulties, as well as a much smaller scientific in the French project non-redundant clones were sequenced community, gene isolation and characterization in higher from both ends. plants has been lagging behind animals. Thus, at the end of Altogether, the two programmes have produced close to 1991, no more than 500 different plant protein sequences were 30000 EST, placing Arabidopsis genome in fourth position represented in databases. In order to increase rapidly our after human and mouse EST, immediately behind Caenorhab- knowledge of this genome and, hopefully, that of other crop ditis elegans and ahead of rice [5]. The French groups pro- plant genomes, two major strategies have been developed. The duced approximately 12000 EST, but only 6500, which were first consisted in partially sequencing a large number of not redundant with previously submitted EST, were deposited cDNA in order to make a catalogue of Arabidopsis expressed in dbEST. Most of the corresponding clones are available genes (EST, or Expressed Sequence Tags) as initially de- from the Ohio State University Arabidopsis Biological Re- source Centre (ABRC) in Columbus, Ohio. It is difficult to determine precisely how many different genes these EST represent because EST sequences are not *Corresponding author. Fax: (33) 4-68-66-84-99. 100% reliable because the collection is a mixture of 5' and E-mail: delseny @univ-perp.fr 3' ends and because cDNA are not always full length. An This paper was presented at the 24th FEBS Meeting in Barcelona. additional problem is that many cDNA correspond to mem- 0014-5793/97/S17.00 © 1997 Federation of European Biochemical Societies. All rights reserved. P//S0014-5793(97)00075-6 222 M. Delseny et al.lFEBS Letters 403 (1997) 221-224 bers of multigene families which are not easily distinguishable. the sequenced region might correspond to a variable region in A first estimation consists in assembling overlapping EST into a conserved gene. contigs: the 30000 EST can be organized into about 15 000 A major surprise was to discover that many genes belonged contigs [6] but this is certainly an overestimate of gene number to multigene families: for example, we found that there were and a more realistic speculation is that 10000 distinct genes at least five distinct thioredoxin h genes which are differen- have been tagged by at least one EST. This figure has to be tially expressed whereas there are three genes in yeast and a compared with the gene number which is now estimated to be single one in human [14]. Similarly we reconstructed 106 dif- around 20000 and with the 50 or 60 genes which have been ferent cDNA coding for 50 distinct types of ribosomal pro- isolated using T-DNA or transposon tagging or map-based teins. The majority are coded by two, three and even four cloning [7]. Thus, just by random sequencing of cDNA, genes which differ essentially in their non-coding regions [15]. roughly half of the genes have been tagged; they probably correspond to the most abundantly expressed genes. Sequenc- 4. Impact of the Arabidopsis EST programmes ing work is almost completed in both consortia because re- dundancy is now becoming too high for the programmes to be One of the biggest impacts of the Arabidopsis EST pro- cost effective. Tagging genes with rare transcripts cannot be grammes is on plant biology studies. A crude measure of achieved using the present strategies and will require further this impact is the thousands of clone requests which were developments. received by the ABRC. Availability of the EST sequences EST are usually ca. 300 bp long and represent only part of accelerated the isolation of many new genes for which a par- a gene. However, about 150 full-length cDNA have now been tial protein sequence was available. It also stimulated and completely sequenced, starting from an EST. In addition, if a facilitated the obtention of recombinant proteins and the gene is abundantly expressed, overlapping EST can be as- analysis of differential expression of members of multigene sembled in order to reconstruct a full-length cDNA in silico. families by allowing one to prepare gene-specific probes [14,16-19]. Having available a large number of EST allows 3. What genes have been identified by EST sequencing? analysis of overall gene expression. cDNA clones can be gridded on high-density filters and hybridized with various A preliminary identification of a gene relies on the presence complex probes corresponding to different development of an already identified homologous sequence in databases. stages, different stimuli or different genetic backgrounds [20]. Each sequence was translated in all six possible reading Alternatively oligonucleotides derived from the EST can be phases. Using the BLAST and FASTA algorithms [8,9], spotted on DNA chips [21]. When several cDNA libraries they were aligned with the non-redundant protein sequence have been used, the frequency of occurrence of the various database. In this way, more than 40% of the non-redundant clones can also be analysed and compared. Another type of EST were found to have significant homology, at least at the physiological study consists in attempting to classify cDNA amino acid sequence level, with an already known gene. and genes according to their control by a given regulatory Lists of putative functions for EST can be found in the gene. This strategy was developed for a number of genes ex- papers reporting major progress in the programmes [10-12]. pressed during seed maturation to determine whether their Such an analysis simply gives a hint of what the function expression was dependent upon regulatory gene ABI 3 [22]. might be and, in many cases, extensive biochemical and bio- EST programmes have an important impact on mapping logical work is necessary to unambiguously identify a gene and sequencing the Arabidopsis genome. From the very begin- and its function. This task is clearly impossible for the se- ning, mapping the EST on the chromosomes was a major goal quencing labs, therefore, relevant clones were distributed to of the French consortium because it was anticipated that it colleagues who have an interest in a specific gene.