Quantitative-Analysis-Microbial
Total Page:16
File Type:pdf, Size:1020Kb
QUANTITATIVE ANALYSIS OF MICROBIAL SPECIES IN A METAGENOME BASED ON THEIR SIGNATURE SEQUENCES Pooja Yadav A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE August 2017 Committee: Xu, Zhaohui, Advisor Mckay, Robert Michael Roy, Sankardas Copyright c August 2017 Pooja Yadav All rights reserved iii ABSTRACT Xu, Zhaohui, Advisor Shotgun metagenomics has provided a relatively new and powerful approach to study the environmental samples to characterize the microbial communities in contrast to pure cultures by conventional techniques. To determine the microbial diversity and to understand the role of microbes in the ecosystem, quantitative studies are important whose values are comparable across different studies and samples. We have developed a statistical approach to microbial profiling which en-compasses quantitative characterization and comparison of relative abundance of the microbes in a metagenome sample based on their signature sequences (unique k-mers). We demonstrated the utility of this approach by characterizing and quantifying the relative abundance of the microbes in 4 different simulated metagenome samples (Comp_25, Comp_50, Comp_75, and Comp_100). The suffix of simulated metagenome’s name represents the gene content percentage of reporter species in the simulated metagenomes. The analysis of simulated metagenomes for data volume 6e9 and 6e10 furnish the information about the abundance of species by identifying the unique k-mers (signature sequences) of the six-reporter species B. licheniformis, L. brevis, L. fermentum, L. plantarum, P. ananatis, and P. vagans. Our developed approach has efficiently identified the abundance of 4 reporter species i.e. B. licheniformis, L. brevis, L. fermentum, P. ananatis whereas 2 species L. plantarum and P. vagans were overestimated in the simulated metagenomes. So, application of advanced statistics, refinement of the algorithm, and an increase in data volume would be our next steps to improve the accuracy of our approach to estimate the ratio of species of a metagenome. iv ACKNOWLEDGMENTS To start, I would like to acknowledge the support of my advisor Dr. Zhaohui Xu who constantly encouraged and motivated me to learn new skills like Python coding during this project. I truly appreciate her for all the guidance at personal and professional levels. I would also like to thank my committee members, Dr. Robert Micheal McKay and Dr. Sankardas Roy for their valuable suggestions and time whenever needed. I would also like to acknowledge my parents, my brothers Prashant and Bittoo, and my friends Shubhankar, Navneet, Jaspreet, and Sayali for their constant support throughout my masters. Moreover, I would like to thank BGSU for the opportunity and the financial support. v TABLE OF CONTENTS Page CHAPTER 1 INTRODUCTION . 1 1.1 Metagenomics . 1 1.2 Metagenome Analytical Techniques . 4 1.2.1 Marker gene analysis . 5 1.2.2 Metagenome Binning . 6 1.2.3 Metagenome Assembly . .. 11 1.3 Genomic Signature Sequences . 11 1.4 Compositional Characteristics of Signature Sequences . 12 1.4.1 GC content . 12 1.4.2 Amino acid content . 13 1.5 Quantitative Analysis of Metagenome . 14 CHAPTER 2 MATERIALS AND METHODS . 16 2.1 Database Creation . 16 2.2 Extraction of k-mers in Genomes . 17 2.3 Identification of Common k-mers . 18 2.4 Extraction of Unique k-mers . 18 2.5 Simulation and Quantitative Analysis of Metagenome . 19 CHAPTER 3 RESULTS AND DISCUSSION . 21 3.1 Database Creation . 21 3.2 Extraction of k-mers . 21 vi 3.3 Identification of Common k-mers . 22 3.4 Extraction of Unique k-mers . 23 3.5 Analysis of Simulated Metagenome . 31 3.6 Conclusion . 34 REFERENCES . 35 vii LIST OF FIGURES Figure Page 1.1 Limitations correlated with the quantification of microbes in a metagenome sample 4 1.2 Metagenome analytical strategies . 5 1.3 Analysis of a metgenome DNA fragment by MEGAN . 7 1.4 Analysis of three different metagenome samples by unsupervised techniques . 8 1.5 Kraken classification algorithm . 10 1.6 Study of the 4-mer spectrum obtained by genomic signatures . 13 2.1 Flow chart: Steps involved in the metagenome analysis . 16 3.1 Correlation between k-mer length and its uniqueness . 25 3.2 Comparison of the unique k-mers extracted by utilizing single genomes vs all genomes for reporter species . 25 3.3 Analysis of the simulated metagenome for data volume 6e9 and sequence deviation 0.0. 32 3.4 Analysis of the simulated metagenome for data volume 6e10 and sequence devia- tion 0.0. 33 viii LIST OF TABLES Table Page 3.1 Excel sheet comprising organism’s names and their accession numbers . 21 3.2 k-mers extracted from genomes . 22 3.3 Common k-mers of E. coli .............................. 23 3.4 Unique k-mers of E. coli ............................... 24 3.5 Number of common 12-mers at the species level . 26 3.6 Number of unique k-mers for k= 11, 12, and 13 . 27 3.7 Composition data for the metagenome simulation . 31 1 CHAPTER 1 INTRODUCTION 1.1 Metagenomics Numerous life processes are carried out by microbes on Earth. They are indulged in essential ecosystem services (Arrigo, 2005 and Van der Heijden et al., 2008). Microorganisms contribute to a major proportion of biomass of the biosphere and many processes in the ecology, agriculture, food production, and medicine. A rough estimation states that more than 99 percent of the microbes are not readily cultivated in laboratories (Schloss and Handelsman, 2005). However, DNA sequencing and biocomputing help in the exploration of the genetic diversity of the uncultured microbes. Metagenomics is the study of culture-independent microbial communities sampled directly from the environment. Metagenomics plays a critical role in the identification and characterization of the uncultivated microbes of the ecosystem. Additionally, it furnishes information about the biomarkers of functional states of the ecosystem (Erickson et al., 2012). Therefore, genetic markers approach can be utilized to delve the functions and physiology of the uncultured microbes (Frias-Lopez et al., 2008 and Gilbert et al., 2008). Before shotgun metagenomics the characterization of uncultured microbes was performed by amplicon sequencing which involves extraction of the DNA from cells followed by amplification of a taxonomically informative genomic marker by PCR common to all organisms of interest. Further, the amplicons were characterized bioinformatically to determine the diversity and relative abundance of the microbes in a sample. The 16S rRNA, which is an informative marker for taxonomy and phylogenetic classification utilized to characterize the microbes in a sample. The 16S rRNA profiling furnishes information about the microbial diversity and relative abundance across different environmental conditions. These observations have generated data to study the host-microbe relationships and facilitated data about the microbiota based disease mechanisms. However, sequencing errors and inaccurate assembled amplicons (chimeras) led to 2 the production of artificial sequences which result in difficult identification process (Wylie et al., 2012), biases associated with the PCR resulted in failure to acknowledge a substantial fraction of the microbes in a community (Sharpton et al., 2011). Moreover, it does not resolve the queries about the biological functions associated with these taxa and 16S locus can be transmitted in distantly related organisms by horizontal gene transfer leads to the overestimation of microbes in a sample (Acinas et al., 2004). Finally, amplicon sequencing limits the analysis of the taxa for which informative markers are already classified. Therefore, novel microbes particularly viruses are difficult to study by utilizing the amplicon sequencing technique (Edwards, 2005). Hence, to acknowledge the contribution of the uncultured microbes in the environment, shotgun metagenomics provides a paradigm shift in the microbiology which overcomes the limitations introduced by the amplicon sequencing technique. In contrast to targeting a specific genomic locus for amplification shotgun metagenome technology shears all the DNA into tiny fragments and sequences them independently. This results in the production of several million of reads which can be further subjected to either taxonomic studies or biological function related studies. Novel enzymes like bacterial laccases (Ausec et al., 2011), chitinases (Hjort et al., 2010), dioxygenases (Zaprasis et al., 2010), hydrogenases (Schmidt et al.,2010), or hydrazine oxidoreductases hydrogenases (Li et al., 2010) were discovered by the exploitation of the metagenomics. Metatranscriptomics provides the information about the activity of genes in the environment. Metagenomics accompanied by metatranscriptomics provides information about variations in genes, functional genomic linkages, and evolutionary profiles of the microorganisms (Thomas et al, 2012). Though shotgun metagenomics provides various benefits, metagenome data presents myriad challenges. Foremost is the quantity and the complexity of a metagenome data which makes it cumbersome to correlate the read to its corresponding genome in a metagenome sample. Fortuitously, the development of the software has increased the ease and efficiency of metagenome analysis otherwise the vast amount of data also challenges the computational power of programs. Moreover, abundance of some communities is not represented by reads only, 3 therefore, it is difficult detect the abundance of rare species (Schloss and