BLUE WATERS ANNUAL REPORT 2016

BIG DATA ON SMALL ORGANISMS: -SCALE FIGURE 2: Layers of biological MODELING AND PHENOTYPIC PREDICTION OF organization captured and information used to train the model. Allocation: NSF PRAC/300 Knh PI: Ilias Tagkopoulos1

1University of California-Davis

EXECUTIVE SUMMARY ontologies. Second, the integration of all layers led to the predictor with the highest performance, We developed semi-supervised normalization although the increase in accuracy was incremental. pipelines and performed experimental We have validated 16 new predictions by genome- characterization (growth, transcriptional, ) scale transcriptional profiling. This work constitutes to create a consistent, quality-controlled multiomics the largest -based simulation and demonstrates compendium for Escherichia coli with cohesive how large high-performance computing metadata information. We then used this resource infrastructure, and novel computational to train on Blue Waters a multi-scale model that methods can lead to an integrative framework to integrates four omics layers to predict genome- guide biological discovery. wide concentrations and growth dynamics. Large scale simulations and subsequent validation led to microbial behaviors emerge and b) engineer the several interesting results. First, the genetic and WHY BLUE WATERS INTRODUCTION organisms in a fast, robust and accurate manner environmental ontology reconstructed from the for biomedical and biotechnological applications. Due to the complexity and size of the datasets (18 Predicting microbial behaviors through data-driven omics data was found to be substantially different Realistic models of bacterial cells can lead to new million points from various platforms and molecular computational tools is key to a) understand how and complementary to the genetic and chemical paradigms related to bacterial physiology as well species), the computational complexity of the as the rational redesign of with desired algorithms that range from constrained regression to FIGURE 1: Dataset behavior. For this task, multi-scale modeling and artificial neural networks, Blue Waters was critical and Methodology visualization tools are of paramount importance for the completion of the large-scale simulations as they provide insight of the systems under study. we performed.

METHODS & RESULTS NEXT GENERATION WORK This work had two parts, the generation of a Our first publication appeared in Nature consistent omics compendium and the development Communications. We have developed a deep of the integrative model that uses it as a training learning method for proteome reconstruction, which set. As such, we first constructed Ecomics, a has been tested already in Blue Waters and will be normalized, well-annotated, multi-omics database integrated into the model. Aside from algorithmic for E. coli, developed to provide high-quality data improvements, we aspire to move to population and associated meta-data for performing predictive simulations of this model to investigate and predict analysis and training data-driven algorithms. This population-level phenomena, similarly what we have compendium houses 4,389 normalized expression done before, with simpler models [1,2] profiles across 649 different conditions. We then proceeded to create the Multi-Omics Model and Analytics (MOMA) platform, an integrated model PUBLICATIONS AND DATA SETS that learns from the Ecomics and other available Kim, M., N. Rai, V. Zorraquino, and I. network data to predict genome-wide expression Tagkopoulos, Multi-omics integration accurately and growth. predicts cellular state in unexplored conditions for Escherichia coli, Nat. Commun. 7:13090 EP (2016), doi:10.1038/ncomms13090.

244 245