bioRxiv preprint doi: https://doi.org/10.1101/2020.09.01.277632; this version posted September 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Tree-Aggregated Predictive Modeling of Microbiome Data Jacob Bien 1;∗, Xiaohan Yan 2, L´eoSimpson 3;4, and Christian L. M¨uller4;5;6;∗ 1Department of Data Sciences and Operations, University of Southern California, CA, USA 2Microsoft Azure, Redmond, WA, USA 3Technische Universit¨atM¨unchen, Germany 4Institute of Computational Biology, Helmholtz Zentrum M¨unchen, Germany 5Department of Statistics, Ludwig-Maximilians-Universit¨atM¨unchen, Germany 6Center for Computational Mathematics, Flatiron Institute, Simons Foundation, NY, USA ∗correspondence to:
[email protected], cmueller@flatironinstitute.org September 1, 2020 Abstract Modern high-throughput sequencing technologies provide low-cost microbiome sur- vey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group informa- tion. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven, parameter-free, and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional mi- crobiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity.