(TAIR): a Model Organism Database Providing a Centralized, Curated Gateway to Arabidopsis Biology, Research Materials and Community
Total Page:16
File Type:pdf, Size:1020Kb
224–228 Nucleic Acids Research, 2003, Vol. 31, No. 1 # 2003 Oxford University Press DOI: 10.1093/nar/gkg076 The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community Seung Yon Rhee*, William Beavis1, Tanya Z. Berardini, Guanghong Chen1, David Dixon1, Aisling Doyle, Margarita Garcia-Hernandez, Eva Huala, Gabriel Lander, Mary Montoya1, Neil Miller1, Lukas A. Mueller, Suparna Mundodi, Leonore Reiser, Julie Tacklind, Dan C. Weems1, Yihe Wu1, Iris Xu, Daniel Yoo, Jungwon Yoon and Peifen Zhang Carnegie Institution of Washington, 260 Panama Street, Stanford, CA 94305, USA and 1NationalCenter for Genome Resources, 2939 Rodeo Park Dr. East, Santa Fe, NM 87505, USA Received September 16, 2002; Revised and Accepted October 14, 2002 ABSTRACT INTRODUCTION Arabidopsis thaliana is the most widely-studied Arabidopsis thaliana is a small flowering plant that serves as a plant today. The concerted efforts of over 11 000 model organism for understanding the complex processes researchers and 4000 organizations around the required for plant growth and development. The Arabidopsis world are generating a rich diversity and quantity Information Resource (TAIR’s) goal is to provide an integrated, of information and materials. This information is up-to-date view of Arabidopsis biology from genome to made available through a comprehensive on-line phenome from various sources of information ranging from personal communication with researchers to published litera- resource called the Arabidopsis Information ture. TAIR aims to facilitate interaction within a research Resource (TAIR) (http://arabidopsis.org), which is community that collectively generates and refines a common accessible via commonly used web browsers and body of knowledge (1–3). Since TAIRs inception, its usage has can be searched and downloaded in a number of increased steadily from 20 000 web page visits per month in ways. In the last two years, efforts have been November 1999 to about 500 000 per month in August 2002 focused on increasing data content and diversity, (http://arabidopsis.org/usage/). Likewise, the content of the functionally annotating genes and gene products database has increased dramatically, largely due to incorpora- with controlled vocabularies, and improving data tion of data from large-scale genomics projects and stock retrieval, analysis and visualization tools. New centers. Current, detailed statistics of TAIR database content information include sequence polymorphisms are available on-line (http://arabidopsis.org/jsp/tairjsp/ including alleles, germplasms and phenotypes, pubDbStats.jsp). TAIR continues to evolve and expand to include new data types and analysis tools. It is committed to Gene Ontology annotations, gene families, protein improving data content by continuous curation and data information, metabolic pathways, gene expression analysis facilitated by collaborations with and feedback from data from microarray experiments and seed and DNA the research community. Detailed information about the TAIR stocks. New data visualization and analysis tools project, including the full proposal, data sources, projects include SeqViewer, which interactively displays the under development and documentation of the software and genome from the whole chromosome down to 10 kb curation procedures are available on-line (http://arabidopsis. of nucleotide sequence and AraCyc, a metabolic org/about/) and in a comprehensive review (4). pathway database and map tool that allows over- laying expression data onto the pathway diagrams. Finally, we have recently incorporated seed and DNA UPDATED LOCUS INFORMATION stock information from the Arabidopsis Biological The genome sequence increased the number of Arabidopsis Resource Center (ABRC) and implemented a shop- genes identified about 20-fold. There are currently 28 974 ping-cart style on-line ordering system. sequenced, physical loci (defined as a non-overlapping region *To whom correspondence should be addressed. Email: [email protected] Nucleic Acids Research, 2003, Vol. 31, No. 1 225 of the genome corresponding to at least one transcription unit) chromosome down to a 10 kb nucleotide view. Search options and 549 loci defined genetically. All physical loci follow a include text (e.g. clone, marker, transcript, polymorphism or standard nomenclature that is based on the genomic location gene names, up to 250 names at a time) or short nucleotide and is synchronized among collaborating databases (TAIR, sequences (no more than 4 sequences of 15–150 nucleotides MIPS’ MATDB, TIGR’s ATH1 and NCBI’s RefSeq). More each). Multiple hits are displayed on the genome, in a close-up information about locus and gene nomenclature is found on- graphical view or in a nucleotide window (Fig. 1). Matching line (http://arabidopsis.org/info/guidelines.html). Genes that entities can be viewed by clicking on the genome to open a correspond to the same physical or genetic locus but have close-up graphical view or by clicking on a match summary to different structural or functional annotations or produce open a table-format view. Several close-up views can be different transcribed products are stored as separate gene opened in the same or different windows to allow users to models and are associated to the corresponding locus. There compare regions on the same or different chromosomes. are currently 35 184 gene models in TAIR. Organization and A nucleotide window displays 10 kb of any region of the structural annotation of these genes on the genome are genome, with all the selected sequenced entities highlighted essential and continual processes, aided by improved analysis on the actual nucleotide sequence (Fig. 1). The view can be methods and incorporation of experimental data supporting the scrolled up or down 5 kb at a time, and allows users to view gene models. As a consequence of the large-scale re- genes on either one or both strands with UTRs, start and stop annotation effort by The Institute for Genome Research codons, exons and introns highlighted in different colors. EST (TIGR, http://www.tigr.org/tdb/e2k1/ath1/), some of the origi- and full-length cDNA matches are displayed as dashed lines. nal loci have been merged or split. Therefore, some locus Genetic markers and different types of polymorphisms such as names have been made obsolete and new locus names have insertions, deletions, substitutions and compound (a combina- been created. The history of the locus names and the current tion of either substitution/insertion or substitution/deletion) are usage can be searched and downloaded (http://arabidopsis.org/ highlighted in different colors. Ends of clones are also be tools/bulk/locushistory/). One of our major efforts has been precisely located in the nucleotide window. Sequence from this identifying and grouping gene models with different annota- page can be copied and pasted into other applications with the tions, including names, arising from many sources such as gene annotation preserved as uppercase (coding regions) and TIGR, MIPS, GenBank and the literature. There are now over lowercase (introns, UTRs and intergenic regions). A comple- 66 000 aliases including ORF names, protein names, full mentary tool, MapViewer (http://arabidopsis.org/servlets/ names and 4800 gene symbols. Aliases have been designated mapper), can be used to compare Arabidopsis sequence, based upon sequence match or published allelism data. physical and genetic maps (1). GENOME ANNOTATION AND VISUALIZATION METABOLIC PATHWAYS, ENZYMES, REACTIONS AND COMPOUNDS The use of structured, shared vocabularies to describe the roles of genes and products will facilitate identifying similarities Biochemical pathways provide another way to group gene among diverse organisms and revealing the extent or absence products and related data, including compounds, co-factors, of knowledge gathered. TAIR is a member of the Gene reactions, enzymes, enzyme complexes and subcell enzyme Ontology Consortium (http://www.geneontology.org), which locations. TAIR has relied on the Pathway Tools software (6) to has developed widely-adopted structured vocabularies to build an Arabidopsis specific pathway database called AraCyc describe gene products (5). Our main concentration has been (http://arabidopsis.org/tools/aracyc). It currently features over modifying structures and adding terms to better accommodate 1800 enzymes and 1108 enzymatic reactions involved in 174 annotation of plant genes. Both TAIR and TIGR are actively pathways containing 822 different compounds. AraCyc path- annotating Arabidopsis gene products with GO terms using ways were generated computationally, then edited manually to information from the literature, sequence similarity and other correct mistakes, remove non-plant pathways and add missing computational methods. Currently, TAIR has annotated 13 041 plant pathways. To date, more than twenty plant-specific unique gene models with GO terms for gene product function, pathways, including carotenoid, brassinosteroid and gibberellin process and cellular component. We have also developed biosyntheses have been added from the literature. controlled vocabularies to describe Arabidopsis anatomy and The main query page for AraCyc provides links to stages of development (ftp://ftp.arabidopsis.org/home/tair/ descriptions of the data set, an overview map, and the results Ontologies/) and are working to extend these ontologies to of the initial computational build of the database.