Annotation of the Arabidopsis Genome1

Annotation of the Arabidopsis Genome1 Jennifer R. Wortman, Brian J. Haas, Linda I. Hannick, Roger K. Smith, Jr., Rama Maiti, Catherine M. Ronning, Agnes P. Chan, Chunhui Yu, Mulu Ayele, Catherine A. Whitelaw, Owen R. White, and Christopher D. Town* The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850 The Arabidopsis Genome Sequencing Project was A CENTRALIZED REANNOTATION EFFORT officially completed in late 2000, leading to the pub- The stated goals of the Arabidopsis reannotation lication of a landmark paper describing, in broad effort are the comprehensive identification of protein outline, many salient features of the Arabidopsis ge- coding genes, the elucidation of accurate gene struc- nome (Arabidopsis Genome Initiative [AGI], 2000). tures, and the assignment of function to each gene However, the genome annotation, generated by the product in the predicted proteome. The approach individual sequencing centers, was heterogeneous, taken at TIGR involves both automated annotation both in terms of gene structure predictions and ter- and manual curation, including the development of minology used in their description. In response, The novel algorithms and custom software interfaces to Institute for Genomic Research (TIGR) was funded facilitate the generation of data that are complete, by the National Science Foundation to carry out a thorough, and of high quality. whole-genome reannotation that would be of a uni- The first challenge is to generate and maintain an form high standard and consistency, eliminating the integrated, nonredundant set of gene models across heterogeneity that had accumulated over time in the the Arabidopsis genome. This involves incorporating public databases. data from multiple original sources and recognizing The first phase of this process took place in the and correcting for duplicate and partial gene models latter half of 2000, as the sequencing itself was on overlapping BACs. The growing availability of drawing to a close. At that time, it was agreed that full-length cDNA (FL-cDNA) and expressed se- two centers, The Munich Information Center for quence tag (EST) data enables robust computational Protein Sequences (MIPS) and TIGR, would carry corrections to the original annotated gene structures. out the bulk of the analyses for the whole-genome Protein data and genomic reads from an evolutionary publication. As part of this effort, all publicly avail- neighbor, such as Brassica oleracea, can also inform able sequence and associated annotation was re- computational gene prediction. With a comprehensive set of gene models avail- trieved from various databases and loaded into the able, the predicted proteome can be clustered, mak- TIGR annotation database. Bacterial artificial chro- ing it possible to examine, model, and describe genes mosomes (BACs) that either lacked annotation or in the context of gene families. This can highlight required review were examined by TIGR annotators inconsistencies in previous annotations and aids in to provide as complete and nonredundant a data set achieving consistency and accuracy in the manual as possible for publication. When the whole-genome curation of both gene structure and function. Anno- analysis, carried out in close collaboration with tation updates are incorporated weekly into the TIGR MIPS, was completed, TIGR embarked upon the Arabidopsis Annotation Web site. reannotation proper, using the integrated annota- TIGR publishes its annotation data in the context tion data set as a starting point. The purpose of this of Arabidopsis chromosome sequences, generated article is to describe the goals, chronology, and ac- based on the available tiling path information. Where complishments to date of this reannotation effort tiling path data are incomplete or inconsistent, over- and to communicate the nature, quality, and basis of laps are being validated experimentally. Updated the latest TIGR data release to the plant research chromosome sequences along with annotation are community. generated and released twice a year to the Arabidop- sis Information Resource (TAIR) and the Genome Division of the National Center for Biotechnology Information (NCBI). 1 This work was supported by the National Science Foundation (Cooperative Agreement no. DBI 9813586). GENERATING A COMPREHENSIVE GENE SET * Corresponding author; e-mail [email protected]; fax 301–838– 0208. To support the reannotation effort, an in-house Article, publication date, and citation information can be found database was populated with a complete set of BAC at www.plantphysiol.org/cgi/doi/10.1104/pp.103.022251. sequences representing the published Arabidopsis Plant Physiology, June 2003, Vol. 132, pp. 461–468, www.plantphysiol.org © 2003 American Society of Plant Biologists 461 Wortman et al. tiling path, along with the genes and other biological Gene Models Supported by Arabidopsis FL-cDNA and features identified on these genomic sequences by EST Sequences the respective sequencing centers. These sequences were then processed through the TIGR annotation Early in 2001, we performed a detailed analysis pipeline, a collection of software known as Eukary- with approximately 5,000 FL-cDNA sequences pro- otic Genome Control (EGC) that serves as the central vided by Ceres Inc. (Malibu, CA) and evaluated the data management system. performance of a number of programs for their suit- EGC processes each BAC sequence through a series ability in generating gene models based on the of algorithms for predicting genes (Genscanϩ, Gen- genomic alignments of these data. We found that emark.hmm, and Glimmer; Burge and Karlin, 1997; approximately 35% of the cognate gene models with Lukashin and Borodovsky, 1998; Salzberg et al., cDNA support required some kind of modification. 1999), splice sites (Hebsgaard et al., 1996; Pertea et al., During this pilot study, 240 new genes were discov- 2001), and tRNAs (Lowe and Eddy, 1997). Homology ered and modeled based upon FL-cDNA alignments. to nucleotide and protein databases is computed us- This analysis also supported the documentation of ing the AAT package (Huang et al., 1997), which many instances of alternative splicing (Haas et al., utilizes a two-step approach consisting of a fast da- 2002) and mini-exons (N. Volfovsky, B.J. Haas, and tabase search step to identify the boundaries of the S.L. Salzberg, unpublished data) and has allowed us sequence match, followed by a rigorous alignment to develop a pipeline in which almost all new FL- step which takes into account consensus splice sig- cDNAs can be used to validate or update a gene nals. Data sets include Arabidopsis-specific cDNA model automatically. We have continued to down- and EST sequences, TIGR gene indices for Arabidop- load FL-cDNAs from GenBank. Currently, there are sis and other plants (Quackenbush et al., 2000), a 24,839 cDNA accessions supporting 12,569 gene nonredundant amino acid database filtered from models for 11,960 genes. From the end user’s per- public sources, and SwissProt (Bairoch and Ap- spective, gene models and predicted coding sequence weiler, 2000). based on FL-cDNAs can be regarded as high confi- The output of the gene prediction algorithms and dence, although a small percentage may be truncated homology-based computes were compared with the or contain unspliced introns or genomic DNA. preexisting gene models. New gene models were Arabidopsis ESTs also provide strong support for created, and existing gene models were updated components of gene structure, including exons, based on different classes of evidence. The process of splice sites, and untranslated regions. Many of the automating gene structure updates is continually be- original gene models relied, at least in part, on ESTs ing improved as new data becomes available and for their support. However, due to the distributed pipeline components become more robust. Current nature of the annotation and the gradual accumula- annotation data is based on the following data sets: tion of sequences, EST evidence was not uniformly or Arabidopsis and plant cDNAs and ESTs downloaded systematically incorporated into gene models. We from GenBank on January 28, 2003, the February 2003 recently have developed an algorithmic approach to release of TIGR gene indices, and a nonredundant incorporate all EST evidence into gene models and, protein set generated on February 1, 2003. Table I where appropriate, provide evidence for alternative describes the numbers of gene models supported by splicing. As a result, the current annotation data set different types of evidence in the current annotation contains approximately 16,000 genes whose models data set. incorporate all available EST evidence (B.J. Haas, Table I. Evidence supporting annotated Arabidopsis gene models Support was determined by BLAST-based similarity between the current, complete set of gene models/proteins, and the following datasets: non-Arabidopsis proteins parsed from the TIGR nonredundant protein set (February 1, 2003); the Arabidopsis protein set as represented in TIGR’s annotation database (February 1, 2003); and Arabidopsis cDNAs, Arabidopsis ESTs, and non-Arabidopsis plant ESTs downloaded from GenBank (January 28, 2003). Note that matches between an Arabidopsis protein and itself were excluded from the count for Arabidopsis protein support. Each gene was counted only once, in the category associated with the highest overall confidence. Gene models supported by both Arabidopsis cDNA and protein similarity to another organism are considered to be highest confidence. Genes with no EST and no protein support are based solely on gene predictions and are the lowest confidence set. Non-Arabidopsis Arabidopsis No Total Protein Protein Protein Arabidopsis cDNA 9,017 2,055 1,664 12,736 Arabidopsis EST 2,607 1,041 1,033 4,681 Other Plant EST 3,691 1,365 1,032 6,088 No cDNA/EST 247 2,280 1,352 3,879 Total 15,562 6,741 5,081 27,384 462 Plant Physiol. Vol. 132, 2003 Annotation of the Arabidopsis Genome A.L. Delcher, J.R. Wortman, R.K. Smith Jr, L.I. Extending Gene Detection and Validation by Hannick, R. Maiti, C.M. Ronning, D.B. Rusch, C.D. Comparative Genomics Town, S.L.

Annotation of the Arabidopsis Genome1

Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research

Hierarchical Classification of Gene Ontology Terms Using the Gostruct

Arabidopsis Thaliana

Gene Ontology Analysis with Cytoscape

To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari

Use and Misuse of the Gene Ontology Annotations

Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis Thaliana by Genome-Wide Sequencing

Locdb: Experimental Annotations of Localization for Homo Sapiens and Arabidopsis Thaliana Shruti Rastogi1,* and Burkhard Rost1,2,3

Wormbase Curation Interfaces and Tools

A Bioinformatics Analysis of the Arabidopsis Thaliana Epigenome Ikhlak Ahmed

The Biogrid Interaction Database

Mutator-Like Elements in Arabidopsis Thaliana: Structure, Diversity and Evolution