The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D
Total Page:16
File Type:pdf, Size:1020Kb
Briefings in Bioinformatics, 2017, 1–9 doi: 10.1093/bib/bbx085 Paper Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbx085/4084231 by guest on 02 May 2019 The BioCyc collection of microbial genomes and metabolic pathways Peter D. Karp, Richard Billington, Ron Caspi, Carol A. Fulcher, Mario Latendresse, Anamika Kothari, Ingrid M. Keseler, Markus Krummenacker, Peter E. Midford, Quang Ong, Wai Kit Ong, Suzanne M. Paley and Pallavi Subhraveti Corresponding author: Peter D. Karp, Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, AE206, Menlo Park, CA 94025, USA. E-mail: [email protected] Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc in- clude an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc soft- ware tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biolo- gists to perform analyses that previously would have required a programmer’s assistance. Key words: genome databases; microbial genome databases; metabolic pathway databases Peter Karp is the Director of the SRI Bioinformatics Research Group (BRG). Richard Billington is a Senior Software Engineer at SRI International; he obtained an MS degree from the University of Pennsylvania and a BS degree from the University of California at Santa Cruz. He has held research positions at the University of Michigan, University of Pennsylvania, Georgia Tech, and Sandia National Labs. Ron Caspi has a PhD in Molecular Microbiology. He is currently the sole curator of the MetaCyc database. Carol A. Fulcher, is a Scientific Database Curator at SRI International. Mario Latendresse is a Computer Scientist at SRI International. He has worked on the flux balance analysis module, atom mappings, Gibbs free energies computation, BioVelo, Route Search and visualization tools. Anamika Kothari obtained her bachelor’s degree from K.C. College Mumbai; she is a Scientific Database Curator at SRI International. Ingrid Keseler, is a Scientific Database Curator at SRI International. Markus Krummenacker is a computer scientist. He has worked on the BioCyc genome browser, reaction and compound display pages, and electron trans- fer diagrams. Peter E. Midford is a Chemo/Bioinformatics Scientist at SRI International. Before joining SRI, he worked on phylogenetic databases and phenotype ontolo- gies for several collaborative projects. Quang Ong received a BS in Industrial Management and is a Scientific Programmer at SRI International. Wai Kit Ong is a Metabolic Modeler at BRG. He is responsible for building genome-scale metabolic network models of bacteria with a focus on those found in the human gut microbiome. Suzanne Paley is a Senior Software Developer and has been with the BioCyc project since its inception. She is responsible for Pathway Collages, and many other BioCyc query, analysis and visualization tools. Pallavi Subhraveti is the Release Manager at BRG. She is also responsible for building PGDBs in large scale. Submitted: 18 May 2017; Received (in revised form): 22 June 2017 VC The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1 2|Karp et al. Introduction BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases (DBs) and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visual- ization services and analysis software. BioCyc has been developed over a 25 year period, beginning with the EcoCyc DB for Escherichia coli. Over time, the content of Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbx085/4084231 by guest on 02 May 2019 Figure 1. Processing steps involved in generating the BioCyc DBs. Recently added BioCyc has expanded in terms of the number of genomes, the steps are shown in bold. No relative ordering is implied between the steps along types of information available for each genome and the amount the top and the steps along the bottom. of curated content. BioCyc has also grown to include some eu- karyotic genomes (although its main emphasis is microbial). original genome annotation that was present in the down- The software behind BioCyc, called Pathway Tools [1,2], loaded genome file(s) for each organism. has also expanded in many ways during this period, such as to First, Pathway Tools converts the annotated genome from support regulatory networks, omics data analysis and metabolic the Genbank format to its internal PGDB format. Next, the com- modeling. Recent enhancements include redesigned gene/pro- putational operations in the upper portion of Figure 1 are per- tein pages and metabolite pages, new search tools, a new formed. Pathway Tools modules make the following predictions sequence-alignment tool, a new tool for visualizing groups of [2]. Metabolic and transport reactions and metabolic pathways related metabolic pathways and a facility called SmartTables, are predicted [3] from the reactions and pathways in the which enables biologists to perform analyses that previously MetaCyc DB [4]. Next occurs prediction of pathway hole fillers would have required a programmer’s assistance. (genes that code for enzymes catalyzing reactions with no cur- rently assigned enzyme) and prediction of operons using both Expansion of BioCyc DB content structural and functional information [5]. Orthologs among BioCyc genomes are computed by software Each BioCyc DB describes one sequenced genome, with the ex- that runs large-scale bidirectional BLAST (version 2.2.23) com- ception of the MetaCyc DB, which describes experimentally parisons among all pairs of proteins in the BioCyc genomes. We studied metabolic pathways from all domains of life. Since use a BLAST E-value cutoff of 0.001, with all other parameters at 2011, BioCyc has expanded from 1000 genomes to 9300 gen- default settings. We define two proteins A and B as orthologs if omes. The majority of those genomes were obtained from protein A from proteome PA and protein B from proteome PB are Genbank RefSeq and from the Human Microbiome Project com- bidirectional best BLAST hits of one another, meaning that pro- plete genomes DB. As the majority of sequenced microbial gen- tein B is the best BLAST hit of protein A within proteome PB, and omes are of interest to a relatively small number of researchers, protein A is the best BLAST hit of protein B within proteome PA. BioCyc emphasizes breadth and quality of information for more In rare cases, protein A might have multiple orthologs in prote- highly used genomes at the expense of number of genomes. ome PB, as explained below. The best hit of protein A in prote- To facilitate access to the more commonly used BioCyc ome PB is defined by finding the minimal E-value among all hits Pathway/Genome Databases (PGDBs), we have created the set of in proteome PB in the BLAST output, and collecting all the hits home pages listed in Table 1. When entering BioCyc through for A in proteome PB that have the same minimal E-value. In these home pages, the user’s default organism will be set to the other words, ties are possible, as in the case of exact gene dupli- BioCyc PGDB for the primary strain for that species. cations. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit Workflow for generation of BioCyc PGDBs with the maximum alignment amino acid residue identity. For To produce new BioCyc PGDBs, we process each BioCyc genome the first method, we compare the alignment lengths among all through the computational steps shown in Figure 1 both to the hits of protein A in proteome PB that share the same min- computationally infer new information for the genome and to imum E-value, and the protein in proteome PB with the max- integrate additional information from other bioinformatics DBs. imum alignment length is selected. For the second method, we The amount of information found by the different import steps compare the number of identical amino acid residues in the will vary for different organisms. Note that we retain the alignments between protein A and the hits of protein A in prote- ome PB that share the same minimum E-value, and the protein in proteome P with the maximum number of identical amino Table 1. Home pages for BioCyc organisms B acid residues is selected. In the case that ties still remain (as in Home page Genus the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have ecocyc.org Escherichia coli multiple orthologs in PB, such as if multiple proteins B1, B2, etc., helicobacter.biocyc.org Helicobacter pylori exist in PB, and have exactly the same regions align against pro- vibrio.biocyc.org Vibrio cholerae tein A. BioCyc does not calculate paralogs. Listeria monocytogenes listeria.biocyc.org Pfam [6] domains are identified in BioCyc proteins by run- salmonella.biocyc.org Salmonella enterica ning the Pfam software. Finally, zoomable cellular overview shigella.biocyc.org Shigella flexneri (metabolic map) diagrams are generated for each organism. cdifficile.biocyc.org Clostridium difficile Next, data from several third-party DBs are imported into mycobacterium.biocyc.org Mycobacterium tuberculosis BioCyc, as shown in the lower portion of Figure 1.