Metabolic Pathways for the Whole Community Niels W Hanson1, Kishori M Konwar2, Alyse K Hawley2, Tomer Altman3, Peter D Karp4 and Steven J Hallam1,2*

Hanson et al. BMC Genomics 2014, 15:619 http://www.biomedcentral.com/1471-2164/15/619 METHODOLOGY ARTICLE Open Access Metabolic pathways for the whole community Niels W Hanson1, Kishori M Konwar2, Alyse K Hawley2, Tomer Altman3, Peter D Karp4 and Steven J Hallam1,2* Abstract Background: A convergence of high-throughput sequencing and computational power is transforming biology into information science. Despite these technological advances, converting bits and bytes of sequence information into meaningful insights remains a challenging enterprise. Biological systems operate on multiple hierarchical levels from genomes to biomes. Holistic understanding of biological systems requires agile software tools that permit comparative analyses across multiple information levels (DNA, RNA, protein, and metabolites) to identify emergent properties, diagnose system states, or predict responses to environmental change. Results: Here we adopt the MetaPathways annotation and analysis pipeline and Pathway Tools to construct environmental pathway/genome databases (ePGDBs) that describe microbial community metabolism using MetaCyc, a highly curated database of metabolic pathways and components covering all domains of life. We evaluate Pathway Tools’ performance on three datasets with different complexity and coding potential, including simulated metagenomes, a symbiotic system, and the Hawaii Ocean Time-series. We define accuracy and sensitivity relationships between read length, coverage and pathway recovery and evaluate the impact of taxonomic pruning on ePGDB construction and interpretation. Resulting ePGDBs provide interactive metabolic maps, predict emergent metabolic pathways associated with biosynthesis and energy production and differentiate between genomic potential and phenotypic expression across defined environmental gradients. Conclusions: This multi-tiered analysis provides the user community with specific operating guidelines, performance metrics and prediction hazards for more reliable ePGDB construction and interpretation. Moreover, it demonstrates the power of Pathway Tools in predicting metabolic interactions in natural and engineered ecosystems. Background potential and phenotypic expression of microbial com- Community interactions between uncultivated microor- munities into predictive insights and technological or ganisms give rise to dynamic metabolic networks integral therapeutic innovations. to ecosystem function and global scale biogeochemical Functional genes operate within the structure of meta- cycles [1]. Metagenomics bridges the “cultivation gap” bolic pathways and reactions that define metabolic through plurality or single-cell sequencing by providing networks. Despite this fact, few metagenomic studies direct and quantitative insight into microbial community use pathway-centric approaches to predict microbial structure and function [2,3]. Although, new technologies community interaction networks based on known are rapidly expanding our capacity to chart microbial biochemical rules. Recently, algorithms for pathway sequence space, persistent computational and analytical prediction and metabolic flux have been developed for bottlenecks impede comparative analyses across multiple environmental sequence information including the Human information levels (DNA, RNA, protein and metabolites) Microbiome Project Unified Metabolic Analysis Network [4,5]. This in turn limits our ability to convert the genetic (HUMAnN) and Predicted Relative Metabolic Turnover (PRMT). HUMAnN uses an integer optimization algorithm that conservatively computes a parsimonious * Correspondence: [email protected] 1Graduate Program in Bioinformatics, University of British Columbia, Genome minimum set of reactions along KEGG pathways based on Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia V5Z pathway presence, absence or completion [6,7]. PRMT 4S6, Canada infers metabolic flux based on normalized enzyme activity 2Department of Microbiology & Immunology, University of British Columbia, 2552-2350 Health Sciences Mall, Vancouver, British Columbia V6T 1Z3, counts mapped to KEGG pathways across multiple meta- Canada genomes [8]. Because KEGG pathways are coarse and do Full list of author information is available at the end of the article © 2014 Hanson et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Hanson et al. BMC Genomics 2014, 15:619 Page 2 of 14 http://www.biomedcentral.com/1471-2164/15/619 not discriminate between pathway variants, both modes of [21]. We previously reported PathoLogic’s performance analysis have limited metabolic resolution [9]. Moreover, on combined and incomplete genomes using two simu- neither HUMAnN nor PRMT provides a coherent lated metagenomes (Sim1 and Sim2) derived from 10 structure for exploring and interpreting predicted KEGG BioCyc tier-2 PGDBs manifesting different coverage and pathways. taxonomic diversity using MetaSim [14,22]. Simulations One alternative to HUMAnN and PRMT is Pathway on increasing proportions of the total component genome Tools, a production-qualitysoftwareenvironmentsup- length (Gm) showed that the performance of pathway porting metabolic inference and flux balance analysis recovery based on multiple metrics (F-measure, Matthews based on the MetaCyc database of metabolic pathways Correlation Coefficient, etc.) increased with sequence and enzymes representing all domains of life [10-13]. coverage and sample diversity nearing an asymptote at Unlike KEGG or SEED subsystems, MetaCyc emphasizes higher coverage (Figure 2a). This suggests that pathway smaller, evolutionarily conserved or co-regulated units of prediction follows a collector’s curve in which common metabolism and contains the largest collection (over 2000) core pathways accumulate in the early part of the curve of experimentally validated metabolic pathways. Extensively followed by less common accessory pathways near the commented pathway descriptions, literature citations, and asymptote. enzyme properties combined within a pathway/genome To better constrain pathway recovery and performance database (PGDB) provide a coherent structure for explor- in relation to ePGDB construction we compared results ing and interpreting predicted pathways. Although initially of MetaSim experiments using the Esherichia coli K12 conceived for cellular organisms, recent development of substr. MG1655 genome (basis of the EcoCyc database), the MetaPathways pipeline extends the PGDB concept to Sim1 and Sim2, and a subsampled 25 m metagenome environmental sequence information enabling pathway- from HOT [19] (Additional file 1: Materials and Methods, centric insights into microbial community structure and Tables S1-S4 and Figure S1). Simulations were performed function [14,15]. at progressively larger Gm coverage. Consistent with Here we provide essential guidelines for generating previous observations for Sim1 and Sim2, all experi- and interpreting ePGDBs inspired by the multi-tiered ments showed that pathway recovery percentage and structureofBioCyc[16](Figure1).Webeginwithgenome performance sensitivity increased with sequence cover- and metagenome simulations to assess performance on age and sample diversity nearing an asymptote at higher datasets manifesting different read length, coverage and coverage (Figure 2a-b). The absolute values of these pat- taxonomic diversity and we develop a weighted taxonomic terns were sensitive to read length and likely reflected distance to evaluate concordance between pathways limits imposed by open reading frame prediction and predicted using environmental sequence information BLAST/LAST-based annotation. In contrast, performance and reference pathways in the MetayCyc database. Given specificity was high (>85%) regardless of read length, these metrics, we demonstrate Pathway Tools’ power to coverage, or taxonomic diversity (Figure 2b). The rate of predict emergent metabolism in simulated metagenomes pathway recovery increased proportionally with increas- and a previously characterized symbiotic system [17]. ing sample diversity at lower coverage values, as seen in Finally, we generate ePGDBs using coupled metagenomic the reduction of pathway recovery percentage between and metatranscriptomic datasets from the Hawaii Ocean Sim1, Sim2 and E. coli for long read (~700 bp) and be- Time-series (HOT) to compare and contrast genetic tween HOT, Sim1/2 and E. coli for short read (~160 bp) potential and phenotypic expression along defined datasets. Additional performance metrics can be found environmental gradients in the ocean [18-20]. in Additional file 1: Tables S5–S8. Because PathoLogic performance improves with increasing read length, Results and discussion coverage and sample diversity, sequencing platform selec- Performance considerations tion and use of assembled versus unassembled sequence in- Environmental pathway/genome database (ePGDB) con- formation should be considered when generating ePGDBs. struction commences with the MetaPathways automated When constructing PGDBs for individual genomes annotation pipeline using environmental sequence informa- PathoLogic

Load more