A Fast Machine Learning Workflow for Metagenomic Data for Phenotype

The Thirty-First AAAI Conference on Innovative Applications of Artificial Intelligence (IAAI-19) A Fast Machine Learning Workflow for Rapid Phenotype Prediction from Whole Shotgun Metagenomes Anna Paola Carrieri* Will PM Rowe∗ Martyn Winn Edward O. Pyzer-Knappy IBM Research UK Scientific Computing Dept. Scientific Computing Dept. IBM Research UK Sci-Tech Daresbury STFC Daresbury Lab. STFC Daresbury Lab. Sci-Tech Daresbury Warrington, UK. Warrington, UK. Warrington, UK. Warrington, UK. [email protected] Abstract Metagenomics: study of the genetic material of microorgan- isms constituting the microbiome. Research on the microbiome is an emerging and crucial Read: inferred sequence of nucleotides corresponding to all science that finds many applications in healthcare, food or part of a single DNA fragment, as measured in a sequenc- safety, precision agriculture and environmental studies. Huge amounts of DNA from microbial communities are being se- ing experiment. quenced and analyzed by scientists interested in extracting Whole metagenome shotgun sequencing: sequencing of meaningful biological information from this big data. Ana- the total genetic material of a microbial community. lyzing massive microbiome sequencing datasets, which em- Taxonomy: the process of naming and classifying organ- bed the functions and interactions of thousands of different isms into groups within a larger system, according to their bacterial, fungal and viral species, is a significant computa- similarities. tional challenge. Artificial intelligence has the potential for Marker genes: gene families that can be used to quantify building predictive models that can provide insights for spe- taxonomic diversity. cific cutting edge applications such as guiding diagnostics OTUs - Operational Taxonomy Units: cluster of microor- and developing personalised treatments, as well as maintain- ganisms grouped by sequence similarity of a specific marker ing soil health and fertility. Current machine learning work- flows that predict traits of host organisms from their com- gene (such as 16S rRNA) and representing a taxon. mensal microbiome do not take into account the whole genetic material constituting the microbiome, instead basing the Introduction analysis on specific marker genes. In this paper, to the best A research objective of great interest in the last decade is to of our knowledge, we introduce the first machine learning be able to understand the structure, organization, and func- workflow that efficiently performs host phenotype prediction from whole shotgun metagenomes by computing similarity- tionality of the microbiome and how this affects, and is preserving compact representations of the genetic material. affected by, the environment that surrounds it. The micro- Our workflow enables prediction tasks, such as classification biome is made up of the whole genetic material and interac- and regression, from Terabytes of raw sequencing data that do tions of a community of micro-organisms (bacteria, archaea, not necessitate any pre-prossessing through expensive bioin- viruses or fungi) that live in a natural environment. The envi- formatics pipelines. We compare the performance in terms of ronment could be an entire organism (e.g. a human being or time, accuracy and uncertainty of predictions for four differ- a mouse), part of it (e.g. the intestine, the skin or the mouth) ent classifiers. More precisely, we demonstrate that our ML or a natural habitat (e.g. water or soil). workflow can efficiently classify real data with high accuracy, Trillions of microbes have evolved and continue to live in using examples from dog and human metagenomic studies, the human body and can play a positive or negative role in representing a step forward towards real time diagnostics and a potential for cloud applications. determining the well-being of individuals. An imbalance in the microbial community can lead to the development and progression of various diseases, such as infections, respi- Glossary ratory diseases, metabolic and even psychological illnesses (e.g. depression and anxiety). Phenotype: observable characteristics of an organism re- Understanding the relationship between microbial com- sulting from the interaction of its gene products with the en- munities and the health or disease state of individuals can vironment. help with designing effective targeted treatments focused Microbiome: collective genomes of a microbial community on re-balancing the microbiome. On the other hand, micro- inhabiting a particular environment such as a surface of the organisms residing and interacting in the soil perform im- human body. portant processes such as support of the plant growth and cy- ∗These authors contributed equally to the work cling of carbon and other nutrients (Jansson and Hofmockel yCorresponding author 2018). Many beneficial functions carried out by the soil mi- Copyright c 2019, Association for the Advancement of Artificial crobiome are currently threatened due to changing climate, Intelligence (www.aaai.org). All rights reserved. soil degradation and poor land management practices. Un- 9434 derstanding which species of micro-organisms are essential cessed through bioinformatics pipelines such as Quantita- for maintaining soil health and fertility is important to com- tive Insights Into Microbial Ecology (QIIME) and clustered prehend how to manipulate the soil microbiome and restore into groups called Operational Taxonomy Units (OTUs). ecosystem function. Each OTU represents a different microbial species or taxon. The complex structure in which most of the micro- OTU tables, summarizing the relative abundance of micro- organisms that make up the microbiome are organized rep- bial species for a set of biological samples, are sparse and resents an obstacle to traditional in vitro culture. However, high dimensional data that have been recently used to train recent advances in high-throughput next generation DNA se- machine learning models such as SVMs, RFs and NNs. quencing (NGS) technologies have enabled researchers to A 2016 machine learning framework (Pasolli et al. 2016) characterize and compare microbial communities in diverse uses quantitative microbiome profiles, including species- natural environments, such as the human gut microbiome, level relative abundances and presence/absence of species- via the analysis of their nucleic acid content. Metagenomics, and strain-specific markers, as features for ML models. The the study of genetic material of micro-organisms constitut- authors evaluate the use of SVMs, RFs, Lasso and ENet to ing the microbiome, is proving very promising in research predict five diseases (liver cirrhosis, colorectal cancer, in- studies on the environment, biomedicine and food safety. flammatory bowel diseases (IBD), obesity, and type 2 dia- This is exemplified by several large-scale sequencing initia- betes) from six available metagenomic datasets. tives such as the Human Microbiome Project (HMP Consor- In (Carrieri, Haiminen, and Parida 2017) various normal- tium 2012), Global Ocean Survey (Rusch and al 2007) and ization methods for OTUs table and their impact on phe- the Earth Microbiome Project (Consortium 2017). notype prediction are evaluated for human, mouse, and en- Given the decreasing cost of NGS technologies and the vironmental studies. The authors also address the problem resultant increase in the amount of metagenomic sequenc- of identifying the most relevant microbial features (OTUs) ing data being generated, a compelling need for efficient an- that could give insight into the structure and function of the alytic tools able to manage and address the big data chal- differential microbial communities observed between phe- lenges of microbiome research arises. Machine learning cur- notype groups. rently offers some of the most promising tools for building The OTU representation of metagenomic data has several predictive models for classification or regression tasks from disadvantages. Firstly, the construction of the OTU table in- biological data, such as metagenomic data. (Libbrecht and volves a very large number of sequence alignments, either Noble 2015) presents an overview of machine learning ap- to the reference genomes (in closed reference strategies) or plications for the analysis of genome sequencing data sets. among sequences present in the sample (de novo strategies), Data produced in metagenomic studies are both unbalanced which makes it computationally expensive (Cai et al. 2017). and heterogeneous, thus reflecting the current challenges of Finally, the resulting OTU abundance tables are biased and machine learning in the era of Big Data. (Soueidan and sensitive to the specific bioinformatic pipeline used to gen- Macha 2018) reviews the contribution of machine learning erate them (He et al. 2015). This could have an impact on methods for metagenomics research, focusing on answering accuracy when attempting to predict phenotypes from OTU several important questions such as microbial species clus- tables (Ross et al. 2013), (Karlsson et al. 2013). tering, taxonomic assignment, comparative metagenomics A recent work (Asgari et al. 2018) presents a refer- and gene prediction. ence and alignment-free approach for predicting the phe- In this paper, we focus on phenotype prediction tasks ap- notype from microbial community samples based on k-mer plying data driven machine learning to metagenomic data. distributions in 16S rRNA marker gene sequences. K-mers The phenotype is an observable characteristic or trait of the are short overlapping

Load more