A Mechanism-Aware and Multiomic Machine-Learning Pipeline
Total Page:16
File Type:pdf, Size:1020Kb
A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth Christopher Culleya,b, Supreeta Vijayakumarb , Guido Zampierib , and Claudio Angioneb,c,1 aFaculty of Engineering and Physical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom; bDepartment of Computer Science and Information Systems, Teesside University, Middlesbrough TS1 3BX, United Kingdom; and cHealthcare Innovation Centre, Teesside University, Middlesbrough TS1 3BX, United Kingdom Edited by Jens Nielsen, BioInnovation Institute, Copenhagen, Denmark, and approved June 12, 2020 (received for review February 16, 2020) Metabolic modeling and machine learning are key components in Machine-learning techniques generally ignore previous bio- the emerging next generation of systems and synthetic biology logical knowledge in driving the pattern analysis, limiting the tools, targeting the genotype–phenotype–environment relation- trustworthiness and interpretability of any obtained model. ship. Rather than being used in isolation, it is becoming clear To fill these gaps, constraint-based modeling (CBM) can be that their value is maximized when they are combined. How- used to simulate steady-state metabolism on a cellular scale. ever, the potential of integrating these two frameworks for Metabolic flux profiles generated in silico have been previ- omic data augmentation and integration is largely unexplored. ously used to inform specific machine-learning models (4– We propose, rigorously assess, and compare machine-learning– 9), in some cases providing predictive advantages, as recently based data integration techniques, combining gene expression reviewed (10). However, an integrative approach that fully profiles with computationally generated metabolic flux data to exploits the multimodal learning potential to integrate such predict yeast cell growth. To this end, we create strain-specific models with experimental omics and is therefore able to incorpo- metabolic models for 1,143 Saccharomyces cerevisiae mutants rate mechanistic biological knowledge in the learning process is and we test 27 machine-learning methods, incorporating state- still lacking. of-the-art feature selection and multiview learning approaches. In this work, we propose a multimodal learning frame- We propose a multiview neural network using fluxomic and work that leverages both transcriptomic data and strain-specific transcriptomic data, showing that the former increases the pre- metabolic models to predict phenotypic traits of interest. We use dictive accuracy of the latter and reveals functional patterns this framework to predict the cellular growth for 1,143 strains SYSTEMS BIOLOGY that are not directly deducible from gene expression alone. We of Saccharomyces cerevisiae, one of the main eukaryotic plat- test the proposed neural network on a further 86 strains gen- forms in basic research as well as in biotechnology and, more erated in a different experiment, therefore verifying its robust- recently, used for characterizing the processes associated with ness to an additional independent dataset. Finally, we show human diseases (11). that introducing mechanistic flux features improves the predic- tions also for knockout strains whose genes were not modeled Significance in the metabolic reconstruction. Our results thus demonstrate that fusing experimental cues with in silico models, based on BIOPHYSICS AND COMPUTATIONAL BIOLOGY known biochemistry, can contribute with disjoint information Linking genotype and phenotype is a fundamental problem toward biologically informed and interpretable machine learn- in biology, key to several biomedical and biotechnological ing. Overall, this study provides tools for understanding and applications. Cell growth is a central phenotypic trait, result- manipulating complex phenotypes, increasing both the predic- ing from interactions between environment, gene regulation, tion accuracy and the extent of discernible mechanistic biological and metabolism, yet its functional bases are still not com- insights. pletely understood. We propose and test a machine-learning approach that integrates large-scale gene expression pro- metabolic modeling j machine learning j flux balance analysis j files and mechanistic metabolic models, for characterizing systems biology j multimodal learning cell growth and understanding its driving mechanisms in Saccharomyces cerevisiae. At its core, a custom-built multi- he analysis of complex, high-dimensional biological data modal learning method merges experimentally generated and Tfrom heterogeneous sources is currently one of the main bot- model-generated data. We show that our approach can lever- tlenecks in molecular biology. Such data are generated by a range age the advantages of both machine learning and metabolic of high-throughput devices that target specific biomolecules or modeling, revealing unknown interactions between biological biological processes and are collectively known as omic data. domains, incorporating mechanistic knowledge, and therefore Representative examples are the global genetic composition of overcoming black-box limitations of conventional data-driven an organism—the genome—and the overall activation level of approaches. its genes at a certain time—the transcriptome. Author contributions: G.Z. and C.A. designed research; C.C. performed research; G.Z. and Popular technologies permit the monitoring of various phe- C.A. contributed new reagents/analytic tools; C.C. and S.V. analyzed data; C.C., S.V., G.Z., nomena on a genetic and epigenetic level. However, in several and C.A. wrote the paper; G.Z. and C.A. administered the project; and C.A. acquired applications, information on genes may have limited relevance funding and supervised the project.y to the task at hand, describing only a part of the processes tak- The authors declare no competing interest.y ing place in biological organisms. Metabolic data are closer to This article is a PNAS Direct Submission.y the cellular phenotype but, despite recent innovations in omic Published under the PNAS license.y technologies, sampling metabolic activity on a large scale is still Data deposition: All data, models, and code used in this work are available on challenging (1). Machine learning provides tools to identify and GitHub at https://github.com/multiOmicMechanismAwareML/CodeBase, along with the exploit patterns within this metabolic information, which can aid information for replicating the results presented.y in our understanding of the underlying biological mechanisms 1 To whom correspondence may be addressed. Email: [email protected] (2). In this context, the heterogeneity of omic data has fos- This article contains supporting information online at https://www.pnas.org/lookup/suppl/ tered the development and application of multimodal learning doi:10.1073/pnas.2002959117/-/DCSupplemental.y methods (3). www.pnas.org/cgi/doi/10.1073/pnas.2002959117 PNAS Latest Articles j 1 of 11 Downloaded by guest on September 29, 2021 Cellular growth and gene expression are closely related in uni- gration approaches and 2) an examination of the benefits of using cellular organisms, as they coparticipate in mutual regulation. metabolic modeling in building multimodal machine-learning On the one hand, growth is sustained by genes implicated in predictive models, evaluating to what extent these mechanistic ribosomal and translational functions. In parallel, the expres- data are used to drive the learning process. sion of genes is affected by global and unspecific regulation originating from the physiological state of the cell (12). This Results relationship has yet to be fully understood, and therefore pre- Our goal was to develop and evaluate a multiomic mechanism- dicting cellular growth following genetic manipulations is still aware pipeline for predicting S. cerevisiae growth rate. To this challenging. Understanding and controlling cellular growth have end, we developed the workflow summarized in Fig. 1. In brief, important applications in disease modeling, in biotechnology, we used CBM of metabolism to estimate the metabolic activity and for the development of efficient cell factories (13). CRISPR- of each yeast mutant in the exponential growth phase, starting Cas9–enabled genetic engineering now allows modifying yeast from their transcriptional activity. Then, we built and cross- DNA with single-nucleotide precision in vivo (14), achieving compared 27 machine-learning models of yeast growth from a engineered strains that maximize a desired output. However, the combination of transcript abundance and metabolic flux infor- identification of such strains is a complex issue (15). For instance, mation. These steps and their output are described in detail streamlining yeast metabolism for the production of valuable in the following. compounds often requires the deletion of multiple genes and efficient diversion of resources toward production pathways (16). Strain-Specific Metabolic Modeling of Yeast Mutants. Genome- In an attempt to fully elucidate relationships between cel- scale metabolic models (GSMMs) aim to capture and simulate lular growth and other processes, mathematical models have the entire metabolic activity within a cell. Since different tran- been developed, particularly in bacteria and yeast (17–19). For scription rates lead to alterations of cell behavior, we used gene instance, coarse-grained models were designed to describe the expression data to create 1,229 strain-specific models that emu- global relationship between the allocation