BASys: A Web Server for Automated Bacterial Annotation Gary Van Domselaar†, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane Szafron, Russ Greiner, and David S. Wishart‡

Departments of Computing Science and Biological Sciences University of Alberta Edmonton AB T6E 2E9

[email protected][email protected]

Abstract Head Node Annotation Reports BASys uses CGView [3] to generate BASys (Bacterial Annotation System) is a web server that clickable genome maps for navigating supports automated, in- depth annotation of bacterial the genome data. An HTML- formatted tabular summary is also provided. The genomic (chromosomal, plasmid, and contig) sequences. Genomic Annotation genome maps are prerendered as a It accepts raw DNA sequence data and an optional list of Sequence Data Reports series of hyperlinked PNG image files. gene identification information and provides extensive Each gene label is hyperlinked to its textual and hyperlinked image output. BASys uses more corresponding HTML- formatted “gene card”. The card is hyperlinked where than 30 programs to determine nearly 60 annotation applicable to external references. Text- subfields for each gene, including gene/ protein name, only gene cards are also provided. BASys GO function, COG function, possible paralogues and also supplies an 'evidence card' (Optional) Data Scheduling describing how each annotation was orthologues, molecular weight, isoelectric point, operon generated. The gene cards, evidence Gene BASys is implemented as a distributed structure, subcellular localization, signal peptides, Identification cards, and graphical genome maps are system. The head node monitors and downloadable for offline viewing. transmembrane regions, secondary structure, 3- D Data manages the job scheduling. Annotation structure, reactions, and pathways. The depth and detail and report generation are carried out by of a BASys annotation matches or exceeds that found in a the compute nodes. standard SwissProt entry. BASys also generates colourful, clickable and fully zoomable maps of each query to permit rapid navigation and detailed Data Submission visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys supplies a web form for uploading chromosome, BASys can be generated in approximately 24 hours for an plasmid, or contig sequence average bacterial chromosome (5 Megabases). BASys data. Optional gene annotations may be viewed and downloaded identification data can be provided, or BASys can anonymously or through a password protected access predict protein coding system. The BASys server and databases can also be regions from the genomic downloaded and run locally. BASys is accessible at: data using Glimmer [1]. http:/ / wishart.biology.ualberta.ca/ basys

BASys Annotation Pipeline Validation

BASys annotations were The BASys annotation engine combines database comparison and compared to a set of expertly computational sequence analysis in its annotation pipeline. annotated proteins from C. Search Capability Translated coding sequences are initially compared using BLAST to trachomatis [2]. BASys the expertly annotated reference databases UniProt and the annotations agreed with the BASys supports online keyword searches CyberCell comprehensive molecular database on Escherichia coli. expert annotations 762 times and sequence similarity searches Search The similarity score between the query and database sequence is out of 894. The sensitivity is Compute Compute results contain hyperlinks to their gene compared to the threshold value for each annotation type and 94% ; the specificity is 73% . Node Node cards and graphical genome maps. qualifying annotations are transitively applied to the query sequence. BASys attempts to fill the remaining annotations with additional similarity searches and sequence analyses. BLAST searches are conducted against the protein sequences of C. elegans, human, yeast, and Drosophila; a non- redundant database of bacterial protein sequences, the PDB , KEGG, and Reference DB Model Organism Metabolite Analysis Sequence Analysis Structure Analysis Report Generation COG databases. Various sequence analyses are also performed Similarity Search Similarity Search including Pfam, PROSITE, signal peptide and transmembrane domain predictions, and predicted secondary structure with Pfam Homodeller PSIPRED. If the sequence has sufficient similarity to a sequence represented in the PDB database, then BASys may use PROSITE VADAR HOMODELLER to generate a homology model and subsequently perform a structural analysis using VADAR. Several additional SWISSPROT E. coli KEGG References D. m elanogast er PredictSPTM annotations, such as protein molecular weight, isoelectric point, 1. Delcher AL et al. (1999) Nucleic Acid Res. and operon structure are calculated directly from the H. sapiens 27:4636- 41. chromosomal, protein- coding nucleotide, and translated protein C. elegans 2. Ilioupoulos I et al. (2003) Bioinf orm at ics etc. 19:717- 26. sequence data. In all collection of nearly 60 distinct annotations is S. cerevisiae PDB CGview 3. Stothard P. and Wishart DS (2005) generated for each gene. CCDB Bioinf orm at ics 21:537- 39.