Microbial Community Metagenomics with the bioBakery and Bioconductor

CLASS SESSIONS Wednesday, June 5, 2019; 8:30am - 5:30pm Thursday, June 6, 2019; 8:30am - 5:30pm Location: TBD Directions can be found here: http://www.cuepisummer.org/contactpage

INSTRUCTORS Curtis Huttenhower, PhD Associate Professor of Computational Biology and Harvard T.H. Chan School of email: [email protected]

Levi Waldron, PhD Assistant Professor of CUNY Graduate School of Public Health and Health Policy email: [email protected]

COURSE DESCRIPTION Microbial ecology is one of many fields that have benefitted greatly from technical advances in DNA sequencing. In particular, low-cost culture-independent sequencing has made metagenomic and metatranscriptomic surveys of microbial communities practical, including bacteria, archaea, viruses, and fungi associated with the human body, other hosts, and the environment. The resulting data complement amplicon-based sequencing approaches and have stimulated the development of new computational approaches to meta'omic sequence analysis, including metagenomic assembly, microbial identification, and gene, transcript, and pathway functional profiling.

We will present a high-level introduction to computational metagenomics, highlighting the state-of-the-art in the field as well as outstanding challenges. These include an introduction to the biological goals of typical meta'omic studies and the bioinformatic processes currently available to achieve them. This will briefly summarize the major aspects of metagenomic and metatranscriptomic analysis to be covered here: reference genome-based community composition and functional profiling, along with methods for constructing new genomic references by using de novo assembly. We will also go into detail regarding the statistical challenges associated with human microbiome epidemiology, precisely quantifying members of a microbial community and functional analysis of gene families, association of gene families

with source organisms and strains, and the combination of genes into pathways for metabolic profiling.

Finally, as sequencing technologies deliver more data for the same price, our ability to examine complex microbial communities using sequencing grows. For environmental communities, many fewer reference genomes or transcriptomes are typically available than for human- associated microbes, and the substantial diversity of many communities means that terabases of sequencing may be needed to recover a significant fraction of the community metagenome. We will introduce large-scale de novo assembly, reference free methods for investigating community coverage, and diversity estimation for shotgun sequencing data. We will conclude with an overview of the statistical challenges inherent to analyzing the compositional and count data arising meta'omic studies, and present Bioconductor solutions for simplifying these analyses. The workshop will include standardized protocols for microbial profiling, functional profiling, and metagenome/metatranscriptome assembly with benchmarks and examples.

PREREQUISITES Registrants will be required to bring a personal laptop with R/Bioconductor, RStudio, and Anaconda (Python 3) installed. Additional software installations may be necessary and this will be communicated to registrations closer to the dates of instruction.

COURSE LEARNING OBJECTIVES By the end of this course, students will be able to: 1. Understand all stages of meta'omic analysis from raw sequence data handling and quality control to statistical identification of differentially abundant microbes, microbial genes, and gene functions 2. Perform metagenomic taxonomic profiling with MetaPhlan2 3. Perform metagenomic and metatranscriptomic functional profiling with HUMAnN2 4. Generate synthetic microbiome data using SparseDOSSA and perform quality control, visualization, and dimensionality reduction in preparation for downstream analysis 5. Perform epidemiological association testing, clustering, and visualization in R/Bioconductor

RECOMMENDED COURSE READING LIST UNIX Tutorial for Beginners: http://www.ee.surrey.ac.uk/Teaching/Unix/ Quick-R: http://www.statmethods.net/

Abubucker S et al.: Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 2012, 8:e1002358.

Franzosa EA et al.: Relating the metatranscriptome and metagenome of the human gut. Proc. Natl. Acad. Sci. U. S. A. 2014, 111:E2329–38.

Truong DT et al.: MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 2015, 12:902–903.

2 of 3

COURSE STRUCTURE Class time is 16 hours total. The structure of the workshop will include a mixture of lecture and guided exercise with hands-on assistance from the instructor.

COURSE SCHEDULE DAY 1

9:30-10:45 am Introduction to meta’omics (Curtis)

10:50- 12:00 pm Lab 1: Taxonomic profiling (Curtis)

12:30pm- 2:15 pm Meta’omic analysis concepts (Curtis)

2:15- 3:00pm Lab 2: Functional profiling (Curtis)

DAY 2

10:30-10:45 am

Downstream analyses for meta’omics (Curtis)

10:50- 12:00 pm Lab 3: Visualization (Curtis)

12:30pm- 1:15 pm Meta’omics statistical challenges (Levi)

1:15- 3:00pm Lab 4: Data visualization and differential abundance (Levi)

3 of 3