QUANTITATIVE ANALYSIS OF MICROBIAL IN A METAGENOME BASED ON THEIR SIGNATURE SEQUENCES

Pooja Yadav

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

August 2017

Committee:

Xu, Zhaohui, Advisor

Mckay, Robert Michael

Roy, Sankardas Copyright c August 2017 Pooja Yadav All rights reserved iii ABSTRACT

Xu, Zhaohui, Advisor

Shotgun metagenomics has provided a relatively new and powerful approach to study the environmental samples to characterize the microbial communities in contrast to pure cultures by conventional techniques. To determine the microbial diversity and to understand the role of microbes in the ecosystem, quantitative studies are important whose values are comparable across different studies and samples. We have developed a statistical approach to microbial profiling which en-compasses quantitative characterization and comparison of relative abundance of the microbes in a metagenome sample based on their signature sequences (unique k-mers). We demonstrated the utility of this approach by characterizing and quantifying the relative abundance of the microbes in 4 different simulated metagenome samples (Comp_25, Comp_50, Comp_75, and Comp_100). The suffix of simulated metagenome’s name represents the gene content percentage of reporter species in the simulated metagenomes.

The analysis of simulated metagenomes for data volume 6e9 and 6e10 furnish the information about the abundance of species by identifying the unique k-mers (signature sequences) of the six-reporter species B. licheniformis, L. brevis, L. fermentum, L. plantarum,

P. ananatis, and P. vagans. Our developed approach has efficiently identified the abundance of 4 reporter species i.e. B. licheniformis, L. brevis, L. fermentum, P. ananatis whereas 2 species L. plantarum and P. vagans were overestimated in the simulated metagenomes. So, application of advanced statistics, refinement of the algorithm, and an increase in data volume would be our next steps to improve the accuracy of our approach to estimate the ratio of species of a metagenome. iv ACKNOWLEDGMENTS

To start, I would like to acknowledge the support of my advisor Dr. Zhaohui Xu who constantly encouraged and motivated me to learn new skills like Python coding during this project. I truly appreciate her for all the guidance at personal and professional levels. I would also like to thank my committee members, Dr. Robert Micheal McKay and Dr. Sankardas Roy for their valuable suggestions and time whenever needed. I would also like to acknowledge my parents, my brothers Prashant and Bittoo, and my friends Shubhankar, Navneet, Jaspreet, and Sayali for their constant support throughout my masters. Moreover, I would like to thank BGSU for the opportunity and the financial support. v

TABLE OF CONTENTS Page

CHAPTER 1 INTRODUCTION ...... 1

1.1 Metagenomics ...... 1 1.2 Metagenome Analytical Techniques ...... 4

1.2.1 Marker gene analysis ...... 5 1.2.2 Metagenome Binning ...... 6 1.2.3 Metagenome Assembly ...... 11 1.3 Genomic Signature Sequences ...... 11 1.4 Compositional Characteristics of Signature Sequences ...... 12 1.4.1 GC content ...... 12 1.4.2 Amino acid content ...... 13

1.5 Quantitative Analysis of Metagenome ...... 14

CHAPTER 2 MATERIALS AND METHODS ...... 16

2.1 Database Creation ...... 16

2.2 Extraction of k-mers in Genomes ...... 17 2.3 Identification of Common k-mers ...... 18 2.4 Extraction of Unique k-mers ...... 18 2.5 Simulation and Quantitative Analysis of Metagenome ...... 19

CHAPTER 3 RESULTS AND DISCUSSION ...... 21

3.1 Database Creation ...... 21 3.2 Extraction of k-mers ...... 21 vi 3.3 Identification of Common k-mers ...... 22 3.4 Extraction of Unique k-mers ...... 23 3.5 Analysis of Simulated Metagenome ...... 31 3.6 Conclusion ...... 34

REFERENCES ...... 35 vii

LIST OF FIGURES Figure Page

1.1 Limitations correlated with the quantification of microbes in a metagenome sample 4 1.2 Metagenome analytical strategies ...... 5 1.3 Analysis of a metgenome DNA fragment by MEGAN ...... 7 1.4 Analysis of three different metagenome samples by unsupervised techniques . . . . 8 1.5 Kraken classification algorithm ...... 10 1.6 Study of the 4-mer spectrum obtained by genomic signatures ...... 13

2.1 Flow chart: Steps involved in the metagenome analysis ...... 16

3.1 Correlation between k-mer length and its uniqueness ...... 25 3.2 Comparison of the unique k-mers extracted by utilizing single genomes vs all genomes for reporter species ...... 25 3.3 Analysis of the simulated metagenome for data volume 6e9 and sequence deviation 0.0...... 32 3.4 Analysis of the simulated metagenome for data volume 6e10 and sequence devia- tion 0.0...... 33 viii

LIST OF TABLES Table Page

3.1 Excel sheet comprising organism’s names and their accession numbers ...... 21 3.2 k-mers extracted from genomes ...... 22 3.3 Common k-mers of E. coli ...... 23 3.4 Unique k-mers of E. coli ...... 24 3.5 Number of common 12-mers at the species level ...... 26 3.6 Number of unique k-mers for k= 11, 12, and 13 ...... 27 3.7 Composition data for the metagenome simulation ...... 31 1

CHAPTER 1 INTRODUCTION

1.1 Metagenomics

Numerous life processes are carried out by microbes on Earth. They are indulged in essential ecosystem services (Arrigo, 2005 and Van der Heijden et al., 2008). Microorganisms contribute to a major proportion of biomass of the biosphere and many processes in the ecology, agriculture, food production, and medicine. A rough estimation states that more than 99 percent of the microbes are not readily cultivated in laboratories (Schloss and Handelsman, 2005). However, DNA sequencing and biocomputing help in the exploration of the genetic diversity of the uncultured microbes. Metagenomics is the study of culture-independent microbial communities sampled directly from the environment. Metagenomics plays a critical role in the identification and characterization of the uncultivated microbes of the ecosystem. Additionally, it furnishes information about the biomarkers of functional states of the ecosystem (Erickson et al., 2012). Therefore, genetic markers approach can be utilized to delve the functions and physiology of the uncultured microbes (Frias-Lopez et al., 2008 and Gilbert et al., 2008). Before shotgun metagenomics the characterization of uncultured microbes was performed by amplicon sequencing which involves extraction of the DNA from cells followed by amplification of a taxonomically informative genomic marker by PCR common to all organisms of interest. Further, the amplicons were characterized bioinformatically to determine the diversity and relative abundance of the microbes in a sample. The 16S rRNA, which is an informative marker for and phylogenetic classification utilized to characterize the microbes in a sample. The 16S rRNA profiling furnishes information about the microbial diversity and relative abundance across different environmental conditions. These observations have generated data to study the host-microbe relationships and facilitated data about the microbiota based disease mechanisms. However, sequencing errors and inaccurate assembled amplicons (chimeras) led to 2 the production of artificial sequences which result in difficult identification process (Wylie et al., 2012), biases associated with the PCR resulted in failure to acknowledge a substantial fraction of the microbes in a community (Sharpton et al., 2011). Moreover, it does not resolve the queries about the biological functions associated with these taxa and 16S locus can be transmitted in distantly related organisms by horizontal gene transfer leads to the overestimation of microbes in a sample (Acinas et al., 2004). Finally, amplicon sequencing limits the analysis of the taxa for which informative markers are already classified. Therefore, novel microbes particularly viruses are difficult to study by utilizing the amplicon sequencing technique (Edwards, 2005). Hence, to acknowledge the contribution of the uncultured microbes in the environment, shotgun metagenomics provides a paradigm shift in the microbiology which overcomes the limitations introduced by the amplicon sequencing technique. In contrast to targeting a specific genomic locus for amplification shotgun metagenome technology shears all the DNA into tiny fragments and sequences them independently. This results in the production of several million of reads which can be further subjected to either taxonomic studies or biological function related studies. Novel enzymes like bacterial laccases (Ausec et al., 2011), chitinases (Hjort et al., 2010), dioxygenases (Zaprasis et al., 2010), hydrogenases (Schmidt et al.,2010), or hydrazine oxidoreductases hydrogenases (Li et al., 2010) were discovered by the exploitation of the metagenomics. Metatranscriptomics provides the information about the activity of genes in the environment. Metagenomics accompanied by metatranscriptomics provides information about variations in genes, functional genomic linkages, and evolutionary profiles of the microorganisms (Thomas et al, 2012). Though shotgun metagenomics provides various benefits, metagenome data presents myriad challenges. Foremost is the quantity and the complexity of a metagenome data which makes it cumbersome to correlate the read to its corresponding genome in a metagenome sample. Fortuitously, the development of the software has increased the ease and efficiency of metagenome analysis otherwise the vast amount of data also challenges the computational power of programs. Moreover, abundance of some communities is not represented by reads only, 3 therefore, it is difficult detect the abundance of rare species (Schloss and Handelsman, 2003; Sharpton et al., 2011). Furthermore, the host DNA in a metagenome sample can dominate the microbial community DNA, hence, before sequencing the metagenome data selective enrichment of the microbial DNA should be performed by using molecular techniques. Metagenomes may also consist DNA contaminations where bioinformatics quality control approaches can be utilized to remove the contamination. Followed by quality control the high scored reads are subjected to infer the compositional characteristics of a metagenome either by reference based classification which can overestimate the abundance of known species or by metagenome assembly which helps in the classification of novel genes but overestimates the abundance of abundant taxa in a metagenome sample. (Figure 1.1) 4

Figure 1.1: Limitations correlated with the quantification of microbes in a metagenome sample (Nayfach and Pollard, 2016).

1.2 Metagenome Analytical Techniques

The metagenome data can be used to either explore the taxonomic diversity or functional features. Quantification of microbial diversity in a metagenome is one of the primary ways to characterize the microbial community. It includes the information about the species richness and its abundance in the sample. It also furnishes some insight into the biological functions of the microbial community. Quantification of taxonomic diversity can be done by utilizing the 5 taxonomic informatic marker genes, assembling metagenomic reads, and by binning techniques which include classification of the reads into defined taxa. On the other hand, exploration of functional characteristics associated with metagenomic reads involve identification of protein coding sequences and comparison of the coding reads to a protein database. This profile can be utilized to compare the functional composition of the community (Looft et al., 2012). It also provides correlation of specific environment with the microbial community, the discovery of a novel gene or insight into the environmental conditions associated with those genes (Sharpton, 2014). (Figure 1.2)

Figure 1.2: Metagenome analytical strategies: Who is there represents diversity classification; What are they doing represents prediction of genes and their functions (Sharpton, 2014).

1.2.1 Marker gene analysis

Marker gene analysis involves the identification of the taxonomic informative genes (marker genes) to classify the reads of a metagenome sample. In most of the marker gene-based programs such as Multilocus sequence typing (MLST) (Mahenthiralingam et al., 2006), 6 automated pipeline for phylogenomic analysis (AMPHORA) (Wu and Eisen, 2008), 16S rRNA (Langille et al., 2013), and MLTreeMAP (Stark et al., 2010) the housekeeping genes are used to study the phylogenetic diversity. These programs read a set of marker genes such as 16S rRNA genes (DeSantis et al., 2006), PKS (polyketide synthases), peptide synthetases (Courtois et al., 2003), rpoB (Case et al., 2007) and correlate them to the possible operational taxonomic units (OTUs) to furnish phylogenetic information. However, there are several limitations with the marker gene analysis. First, a small fraction of metagenome which is homologous to marker genes does not provide the accurate sampling about the taxonomic diversification of a community. Moreover, there are taxa which are either not discovered or not sequenced yet. Second, the accuracy of marker gene analysis methods is based on the marker’s properties which can vary from marker to marker gene. Therefore, to overcome the limitations of marker gene analysis approach expanding of genetic diversity is required.

1.2.2 Metagenome Binning

Binning is defined as the assignment of the OTU’s and grouping of contigs based on the sequence similarities, composition of the sequences or reads coverage (Droge et al., 2012). Binning provides information about the novel genes along with the diversity of the microbial community of a metagenome sample. It also helps in reducing the complexity of a metagenome sample. However, the metagenome samples have short DNA reads and binning of short scaffolds into OTU’s is a cumbersome task. The metagenome binning approaches are categorized into three main classes: similitude search techniques, unsupervised compositional techniques, and supervised compositional techniques.

(a) Similitude Search-Based Binning Technique In similarity based search binning technique, large molecular sequences are utilized for sequence alignment. The homology sequence search is done in the reference database for 7 an unknown genomic fragment, meanwhile the public databases serve as a reference database for homology search. Moreover, it provides high ratios of true positive if homologous sequences to the search inquiry exist in search databases. Furthermore, the homology search can utilize DNA sequences as well as protein domains, for example, MEGAN (MEtaGenome ANalyser) (Huson et al., 2007), MG-RAST (Metagenomics-RAST) (Glass et al., 2010), SOrt-ITEMS (Sequence ORTholog based approach for binning and Improved Taxonomic Estimation of Metagenomic Sequences) (Monzoorul et al., 2009), and MetaPhyler (Liu et al., 2011). MEGAN utilizes BLAST for search inquiry and lowest common ancestor algorithm assigns reads to the taxa at the taxonomic level (Figure 1.3). However, it does not target the phylogenetic markers and only provides sufficient results when metagenome fragments have a close relative in reference databases. To infer the reads taxonomy MG-RAST exploits the phylogenomic reconstruction of database sequences whereas the MetaPhyler relies on a list of 31 phylogenetic markers as a taxonomic reference (Wu and Eisen, 2008).

Figure 1.3: Analysis of a metgenome DNA fragment by MEGAN (Huson et al., 2007).

(b) Unsupervised Binning Techniques The supervised binning techniques and similarity search based methods rely upon the marker genes and the reference genomes of the known organisms (Yang et al., 2010). Therefore, these methods are limited to a small portion of microorganisms or to its close associates as current sequencing platforms lack accuracy and precision for short sequencing reads and provide errors due to the instability of the marker genes. Moreover, some species 8 possess multiple markers which result in an incorrect classification of the reads. Nevertheless, a major part of a metagenome sample consists of anonymous and non-sequenced genomes which make the taxonomic classification a cumbersome process. Hence, unsupervised methods can be exploited to study the novel organisms as these methods do not require trained models to identify the novel species in a metagenome sample.

Unsupervised binning techniques exploit algorithms to project high dimensional metagenome data in low-dimensional space. Machine based methods (Phymm in combination with Markov’s model), Support vector machine methods (PhyloPythia) can provide better resolution for the unknown reads at Genus level (Brady and Saizberg, 2009). Similarly, Other methods such as COGs (Clusters of Orthologous Groups), KEGG maps, operons, and COG for functional categories was exploited by a group of researchers to study the unknown reads of the metagenome samples collected from soil, whale fall, and Sargasso Sea sample (Tringe et al., 2005) (Figure 1.4).

Figure 1.4: Comparison of Soil, Whale fall, and Sargasso sea sample by COGs, Operons, KEGG processes, and COG functional categories.The realtive abundance of an item in the respective sample is represented by the dots and the level of enrichment is shown by the proximity with the vertex. (Tringe et al., 2005) 9 Gregory Dick and his colleagues used self-organizing maps (SOM) to study two acidophilic communities and unidentified fragments of various strain variants. They observed more than 90% binning accuracy for the 5 Kb or larger fragments. Meanwhile, the fragments smaller than 5 Kb, accuracy remained the same but sensitivity declined (Dick et al., 2009). Similarly, Chan and colleagues used growing self-organized map (GSOM) instead of SOM which has increased the binning speed by 7-15% (Chan et al., 2008). Moreover, hyperbolic self-organizing map (HSOM) clusters short DNA fragments of size 0.2-50 Kb of a metagenome in two-dimensional maps. Researchers reported that longer sequences i.e greater than 10 Kb are easier to classify and are more accurate as compare to 1 Kb sequences (Martin et al., 2008).

(c) Supervised Compositional Binning Techniques The supervised compositional binning models compare the genomic signatures sequences. Genomic signature sequences defined as taxa-specific unique nucleotide configurational characteristics in a genome. The models based on this approach provide a more accustomed representation and reduces the computational load. The supervised binning method consists a derived and adapted method with some machine learning techniques to identify the unknown sample in a metagenomic sample (Wood and Salzberg, 2014). The programs based on similarity binning techniques comprises the recognition of the unique signature sequences to identify the microbes in a shotgun metagenome sample.

Albeit, most of the program create their own datasets which are smaller than the public genome database, this makes the program to do classification much faster than the similarity based methods. The foundation of the supervised binning method is the classification of the ‘genes’ or ‘genomic signature sequences’ in the database. However, these databases contain information not only about the sequenced organisms thus, these methods restrict the classification of close associates of the reference genomes only. The marker genes approach led to the foundation of various programs, such as Kraken, MetaID, Neptune, STAMP (Sequence tag-based analysis), and so on. Furthermore, the programs 10 like Kraken performs the exact match of the k-mers (k-mer is a subset of any nucleotide sequence of fixed length ‘k’) to the query database by utilizing the lowest common ancestor (LCA) algorithm followed by the assignment of the highest-weighted path to classify reads in a metagenome sample (Wood and Salzberg, 2014) (Figure 1.5) whereas the MetaID assigns a high score to the n-grams which are unique. Thus these programs include effective utilization of the unique and common n-grams for species identification (Srinivasan et al., 2013). Meanwhile, Neptune considers both, the exact k-mer matches for the speed and the k-mer mismatches for the accuracy (Marinier et al., 2015).

Figure 1.5: Kraken classification algorithm includes assignment of the lowest common ancestor score to each read and chooses the highest-weighted path for the taxonomic classification (Wood and Salzberg, 2014) . 11 1.2.3 Metagenome Assembly

It includes the generation of longer sequences by merging collinear metagenomic reads from the same genome into a single contig (continuous sequence) (Sharpton, 2014). Metagenome assembly makes the bioinformatic analysis easier relative to the analysis of the unassembled short metagenomic reads. Sometimes, complete genomes can be assembled which provides the insight into the genetic composition of the uncultured microbes (Iverson et al., 2012, Wrighton et al., 2012, and Ruby et al., 2013). The limitation to generate an assembly is the production of chimeras which can get assembled in a single contig due to the similarity of two sequences (Luo et al., 2012). For the metagenomic assemblies, several factors are considered such as abundance of a taxa, chimeras production, and computational requirements. Therefore, it is difficult to assemble rare genomes without extensive sequencing which requires high computational power. For metagenome assembly, most of the tools are based on the de Bruijn graph approach of genome assembly (Compeau et al., 2011). For example, Meta-IBDA (Peng et al., 2011) and MetaVelvet (Namiki et al., 2012) exploit the approach of de Bruijn graph to project the entire metagenome and utilize the properties of the graph to develop sub-graphs which illustrate genome-specific assemblies. To increase the efficiency and decrease the complexity of de Bruijn graphs several tools adapted the data reduction policy which also decreased the amount of memory and time to assemble the reads.

1.3 Genomic Signature Sequences

A genomic signature sequence is defined as a string of nucleotide characters found only in the target genome, not in background genomes. Signature sequence are also known as unique k-mers or n-grams. The genomic signature sequences aid a beneficial tool for the classification and the quantitative analysis of the microbial community in a metagenome. These unique sequences also provide the information about the metabolic processes and the potential interactions among community members (Frank and Sorensen, 2011). Hence, to perform the classification of reads in a metagenome, a set of nucleotide features can be exploited such as GC 12 content, amino acid content, and codon usage. Oligonucleotide frequency also provides another feature to differentiate microbes in a metagenome sample (Diaz et al., 2009) as the characteristic profile of oligonucleotide frequencies are specific to species. Moreover, the combination of oligonucleotides plays an imperative role in biological properties of different genomes and genome portions (Abe et al., 2002). Though it has been observed that genomic signatures are less specific to identify a read in metagenome relative to signature sequences. Yang et al. has described the caveats of genomic signature approach to identify reads in a metagenome by producing 4-mer spectrums of different DNA fragments taken from E. coli and Lactobacillus. They have produced a 4-mer spectrum of two DNA fragments from different E. coli which showed similarities in oligonucleotide frequencies whereas the 4-mer spectrum of two DNA fragments from the same E. coli has shown the variations in oligonucleotide frequencies. Moreover, 4-mer spectrum of two different DNA of E. coli and Lactobacillus has shown similarities in genomic signatures (Figure 1.6). The accuracy provided by the signature sequences is the reason why we exploited the signature sequence (k-mer) approach for our study instead of genomic signatures.

1.4 Compositional Characteristics of Signature Sequences

1.4.1 GC content

It is defined as the ratio of guanine and cytosine to the whole genome content. Though GC content varies among species, remains same in a species. The tree of life illustrates wide range of GC content from 16.5% (Carsonella ruddii) to 75% (Anaeromyxobacter dehalogens) (Nakabachi et al., 2006). Therefore, in early genomic studies, the GC content was considered as a signature for the taxonomic classification. However, natural selection processes such as mutation and genetic drift affect the GC content of an organism and vice versa. Moreover, it was illustrated by the population genetics models that GC-biased gene conversion (gBCC) influence the selection and intra-genetic recombination (Lassalle et al., 2015). Thus, a single parameter of GC% was not considered as a potential signature sequence because of the high variation as per small 13

Figure 1.6: (A) 4-mer spectrum of DNA fragment two different E. coli genomes, (B): 4-mer spec- trum of DNA fragments of single E. coli genome, and (C) 4-mer spectrum of DNA fragments of E. coli and Lactobacillus, which belong to same kingdom but different phylum (Yang et al., 2010). environmental changes.

1.4.2 Amino acid content

Since genes do not significantly evolve in terms of amino acid, the predilection of an organism for specific amino acids are conserved across the genomes (Sandberg et al., 2003). Therefore, the phylogenetic classification of an organisms at the protein level is less error prone than at the nucleotide level. CARMA is one of an algorithm which classifies the protein domains such as Pfam (Krause et al., 2008). Additionally, the constrained changes at third position of 14 codon level classifies the content of amino acid as a suitable candidate for signature sequences (Wu and Eisen, 2008). Moreover, it has been reported that the N-terminal of the genes is enriched by rare codons but the reasons for codon biasing at N-terminus of the gene are unclear (Goodman et al., 2013). Researchers have also reported 14 times increase in the protein expression in E. coli because of the presence of rare codon at the N-terminus. So, there is a strong relationship between usage of rare codons and protein expression (Gouy and Gautier, 1982). However, the molecular and chemical landscapes of organisms can be influenced by environmental variations (Moura et al., 2013) which are difficult to track by cloning protein encoding gene. Furthermore, rare codons are not pervasive across the whole genome, hence, signature sequences at the nucleotide level remain the gold standard.

1.5 Quantitative Analysis of Metagenome

The databases for the publicly available genomes, gene sequences, and metagenomes are exponentially increasing (Kodama et al., 2012). Therefore, to understand the role of microbes in human biology and other environmental samples it is imperative to produce quantitative data whose values are comparable across samples. The shotgun metagenomics and computational analysis furnish quantitative data which is used to study the taxonomic and functional profiles of microbes. Although, high complexity and small reads make classification and estimation of microbes difficult in a metagenome, bioinformatics and statistics studies have provided the first- generation tool to identify the reads in a shotgun metagenome sample. A general approach adapted by most of the programs to quantify organisms in a shotgun metagenome includes identification of the sequences by exploiting alignment against a reference database of genomes. The counts of the matched reads are used to compute statistics for the estimation of the abundance of a taxon. Various developed program cannot accurately identify and quantify microbes in a metagenome sample beyond genus level as in the case of machine learning methods such as PhyloPythia, Phymm, and PhymmBL (Brady and Salzberg, 2009). Moreover, similarity-based methods, such as MEGAN, CARMA, and MetaPhyler cannot identify the reads beyond genus level except MetaphyAn which can detect the reads at the species level (Liu et al., 2011, Huson et 15 al., 2007, Gerlach and Stoye, 2011, and Segata et al., 2012). Here, we tried to develop an approach which can distinguish the reads at the species level and can estimate the relative ratios of the microbes in the metagenome of a specific environment (Daqu of a fermentation plant). Most of the metagenome studies have concentrated on the community composition which constitutes the information about the relative gene abundance. The relative gene abundance can be correlated with the relative cell abundance of a species in a metagenome. Relative abundance is compositional, therefore, an increase in the number of a taxon will lead to the decrease in the proportion of other taxa in the sample. Here our approach estimates the relative ratio of the species in a metagenome by identifying the unique k-mers in the metagenome reads. Although this approach is developed to monitor the ratios of microbes in a metagenome sample of a fermentation plant at certain time points, this approach can also be used for other metagenome samples. As many laboratories do not have access to supercomputers, we have designed our method in such a way that it works with low computational power and also, attempts were made to accelerate the signature sequence identification in a specific environment (fermentation plant), in contrast to the entire public domain. 16

CHAPTER 2 MATERIALS AND METHODS

The flow chart represents the steps involved in the identification of unique k-mers at the species level for the analysis of the 4 different simulated metagenomes. (Figure 2.1)

Database Creation

Fasta Files Extraction

k-mers Detection

Species comprising Species comprising >1 strain only 1 strain

Common k-mers Unique k-mers Simulation and detection Analysis of Metagenomes

Figure 2.1: Steps involved in the identification and exploitation of signature sequences for the analysis of metagenome.

2.1 Database Creation

The input database contained a list of 562 completely sequenced species which were relevant to a previous study where a consortium of microbes was used to ferment a type of alcoholic drink (Zhang et al., 2016). An Excel spreadsheet for the listed microbes was constructed which comprised the microbe names along with their accession numbers, total genome sizes and average species genome sizes. Additionally, the organisms were clustered according to the genera and the species. Evaluation of the spreadsheet was done by a Python program to assure the accuracy of the spreadsheet. The last update of the list was done in April 2017. 17 2.2 Extraction of k-mers in Genomes

A Python program was used to retrieve 562 fasta files of the listed genomes from the NCBI reference database by using their respective accession numbers. Since the ambiguous nucleotides in the genomic sequence can hinder the identification of k-mers, the extracted files were further screened for the ambiguous nucleotides i.e. nucleotides other than ‘A’, ‘T’, ‘G’, and ‘C’. For instance, letter ‘N’ represents unknown base, letter ‘R’ illustrates either ‘A’ or ‘G’ and letter ‘Y’ indicates either ‘C’ and “T”. Therefore, it was taken in consideration to exempt the ambiguous nucleotides in the identification of k-mers. In the process of k-mers extraction, the first step included the detection of k-mers in individual organism’s genome irrespective of the taxonomy diversification level. Furthermore, common k-mers of the species comprising multiple strains were retrieved followed by the identification of the species k-mers from the common k-mer files. However, for the species comprising single strain the species k-mers were directly identified from their genomes (Figure 2.1). The main aim was to detect unique k-mers in the range of thousands at the species level.

Let’s assume the database of genome sequences S and si be the length of the genome of an organism Oi in S.

where si belongs to a set of four nucleotide ‘A’, ‘T’, ‘G’, and ‘C’. Hence total number of the sets of the k-mers which can be obtained from si

Total number of the k-mers = (si − k + 1) (2.2.1)

The total number of possible k-mers by using ‘A’, ‘T’, ‘G’, and ‘C’ can be 4x. The total size of the database is 2.36 GB. Because DNA is double-stranded molecules, the calculation for the number of possible k-mers was done by using the equation given below.

4x = 2 ∗ 2.36 (2.2.2) 18 By solving this, x comes out 14, which means, statistically every 14-mer in our database is unique. Therefore, 14 is the upper limit of our k-mer size. A Python code was used to retrieve k-mers from the genome of each organism by using their respective accession numbers. The Python code processes each read and its complement strand for the identification of k-mers. The input files were text files comprising the accession numbers distinguished at the species level and output files were in CSV (comma-separated values) format constituting the k-mers from the genomes along with their corresponding frequencies. The k-mer files of the species encompassing single strains were represented by their accession numbers whereas the k-mer files of the species consisting multiple strains are grouped together for the identification of common k-mers.

2.3 Identification of Common k-mers

Followed by the extraction of k-mers across all the genomes, the common k-mers were detected in the species comprising various strains. The input files for the identification of common k-mers included the extracted k-mer files grouped at the species level. Meanwhile, the k-mer files for the species including various strains are named by the first letter of their genus and full species name, for instance, the common k-mer file of the Bacillus licheniformis strains were represented as B licheniformis.

2.4 Extraction of Unique k-mers

Unique k-mers are defined as the distinguishing nucleotide sequences which found only in the target genomes. The selection of unique k-mers is crucial for the quantitative analysis as unique k-mers provides information about the gene abundance by identifying the reads in a shotgun metagenome sample. The primary operation included cross-comparison of all the k-mers at the species level and extraction of unique k-mers corresponding to each species. To perform this task, a Python code was developed in such a way that it extracted unique k-mers and assessed the uniqueness of the unique k-mers by treating other k-mers as the background. 19 2.5 Simulation and Quantitative Analysis of Metagenome

Quantitative analysis involves the comparison of relative abundance of the unique k-mers corresponding to a species in a metagenome sample. The reporter species are defined as the species whose unique k-mers were used to classify reads in the simulated metagenomes. The reporter species considered for the metagenome simulation were B. licheniformis, L. brevis, L. fermentum, L. plantarum, P. ananatis, and P. vagans. The selection of the reporter species for the metagenome simulation was done by considering the abundance of different species in the the MH daqu stage of a fermentation plant. The completely sequenced species having unique k-mers in the range of thousands were selected as reporter species. For profiling the relative abundance of the reporter species, the unique k-mers were identified in the reads of length 150 bp in 4 different simulated shotgun metagenome samples. Table 3.7 depicts the gene content of reporter species for four different simulated metagenomes i.e. Comp 25, Comp 50, Comp 75 and Comp 100 whereas the the theoretical ratio was considered as control for the simulated metagenomes. (Table 3.7) The following steps explain the complete process of the simulation and analysis of the metagenome.

(a) The first step of the program included the reading of the composition file which comprised the name of the reporter species along with their accession numbers and the percentage of the DNA to be considered for the simulation of the metagenome sample.

(b) The second step comprised reading of the unique k-mers of the reporter species and construction of a Python dictionary which contained the k-mers, the species name, the frequency of k-mers occurrence, sample counts, and background counts.

(c) This next step involved the reading of the genome’s fasta files and generation of the random reads of length 150 bp.

(d) This step searched the unique k-mers in generated reads of both the reporter and the background genomes. Consequently, the information such as copy numbers associated with each unique k-mer was tabulated at the species level. 20 (e) Analysis of the metagenome sample included finding out the abundance of reporter species reads by utilizing the unique k-mers in the simulated metagenomes. So, the final step included determination of ratios of the reporter genomes by evaluating the total number of the unique k-mers obtained associated to a reporter species to the sum total of all the unique k-mers detected in all the randomly generated reads of the metagenome sample. 21

CHAPTER 3 RESULTS AND DISCUSSION

3.1 Database Creation

An excel sheet was constructed comprising the name of the organisms, accession numbers, genome sizes, and the level of assembly of the sequenced genomes. Total 562 organisms were classified as 93 species and 20 genera. Thirty-nine of 93 species comprised multiple strains (genomes). Table 3.1 shows a list of few organisms along with their accession numbers as documented in the Excel spreadsheet.

Table 3.1: List of a few organisms with their accession numbers clustered at the species level

Organisms Accession Numbers Bifidobacterium asteroides CP003325.1 Bifidobacterium thermophilum NC 020546.1 Clostridium kluyveri NC 009706.1, NC 011837.1 Clostridium beijerinckii NC 009617.1, NZ CP006777.1, NZ CP010086.2 Syntrophomonas wolfei NC 008346.1 Tepidanaerobacter acetatoxydans NC 019954.2 Weissella cibaria NZ CP012873.1 Virgibacillus halodenitrificans NZ CP017962.1 ......

3.2 Extraction of k-mers

The input for this step were the genomes fasta files and as output, we got the k-mer files in CSV format comprising the nucleotide sequences of the k-mers and their respective frequencies. Total 562 files were extracted followed by a grouping of 508 files (species having multiple strains) at the species level. As an example, a list of the first 15 k-mers extracted from one of the E. coli (accession number AE014075.1) strain is shown in Table 3.2. 22 Table 3.2: k-mers extracted from the E. coli (Accession number AE014075.1 )

12mer Number of 12-mers

AGCTTTTCATTC 1 GCTTTTCATTCT 1 CTTTTCATTCTG 1 TTTTCATTCTGA 3 TTTCATTCTGAC 4 TTCATTCTGACT 1 TCATTCTGACTG 3 CATTCTGACTGC 3 ATTCTGACTGCA 2 TTCTGACTGCAA 2 TCTGACTGCAAC 3 CTGACTGCAACG 3 TGACTGCAACGG 2 GACTGCAACGGG 2 ......

3.3 Identification of Common k-mers

In our study, 39 of 93 species comprised multiple strains, so, common k-mers were detected in those 39 species. The 508 k-mer files grouped at the species level served as the input for the extraction of common k-mers at the species level. The output files of the common k-mers comprised the k-mers common among all strains of a specific species. Table 3.3 shows a list of first 15 common k-mers and average copy number of the common k-mers extracted in all strains of E. coli with their numbers. 23 Table 3.3: Common k-mers of E. coli

Common 12-mers Average copy number of the common 12-mers

GCTTTTCATTCT 1.155689 CTTTTCATTCTG 1.101796 TTTTCATTCTGA 2.730539 TTTCATTCTGAC 1.946108 TCATTCTGACTG 3.479042 CATTCTGACTGC 2.772455 TTCTGACTGCAA 2.023952 GACTGCAACGGG 1.988024 ACTGCAACGGGC 2.203593 TGCAACGGGCAA 3.730539 GCAACGGGCAAT 4.580838 AACGGGCAATAT 3.413174 ACGGGCAATATG 2.149701 CGGGCAATATGT 3.197605 ......

3.4 Extraction of Unique k-mers

The files used to extract the unique k-mers at the species level consisted of the common k-mer files of 39 species (multiple strains) and the k-mer files of 54 species (single strains). With the aim of extracting unique k-mers in the range of thousands optimization of the k-mer length was performed. With k=13 and 14, the number of unique k-mers obtained was more than hundred thousand as with the increase in the k-mer length by even one nucleotide the uniqueness also gets increased exponentially for small genome sizes (1-6 Mbp) (Figure 3.1), whereas, for k=10 and 11 24 the number of unique k-mers was too less (Table 3.6). So, the length of k was optimized to 12 nucleotides to obtain a significant number of different unique k-mers for each species. Table 3.4 depicts a list of 15 unique k-mers and average copy number of the unique k-mers of E. coli .

Table 3.4: Unique k-mers of E. coli

Unique 12-mers Average copy number of the unique 12-mers

CGACTTACGTGT 1 ACTTACGTGTCT 1.011976 CGCACTATAGGT 1 CGCGTAACTATG 1.299401 CGATACGTACCC 1.832335 CTCAAGTCTGGG 1.005988 TTAGTTATCGAG 1.017964 TACACACTCGGG 1.107784 TACGCTAGTGTT 1.017964 ACGTAGTGACTG 1.934132 TGGTTTAGTCCG 2.91018 GGTCCGTACTTA 1 ACGAGTAAGTGT 1 TAGCACTGAGTG 1.017964 ......

However, at this stage we have also observed an interesting correlation between the number of strains (genomes) in a species and abundance of its unique k-mers. The number of the unique k-mers declined with the inclination in the number of strains (genomes) in a species. The possible explanation for this observation includes that an increase in the number of 25

Figure 3.1: Uniqueness of the k-mer increases with the increase in the k-mer length (Greenfield et al., 2013) genomes (strains) leads to a decrease in the number of common k-mers in that species, which subsequently resulted in the generation of more unique k-mers at the species level. (Figure 3.2)

Figure 3.2: The number of the unique k-mers extracted by utilizing a single genome (strain) vs utilizing all the available genomes (strains) at the species level. 26 The common k-mers and the unique k-mers at the species level are summarized in Table

3.5 and Table 3.6 respectively.

Table 3.5: Number of common 12-mers at the species level

Organisms Number of Common 12-mers

B. amyloliquefaciens 2410503

B. angulatum 2428191

B. anthracis 5247363

B. animalis 1625171

B. bifidum 2270463

B. breve 2115498

B. cereus 2001486

B. dentium 3541582

B. kashiwanohense 2201381

B. licheniformis 4885476

B. longum 1554677

B. pumilus 1737944

B. subtilis 836339

C. acetobutylicum 3775248

C. beijerinckii 3815520

C. botulinum 807152

C. freundii 2952683

C. kluyveri 3886193

C. pasteurianum 1767708

C. perfringens 1820711

C. sporogenes 3339068

C. tetani 2131630

C. tyrobutyricum 3165422

E. aerogenes 4799403 27 Table 3.5 – continued from the previous page

Organisms Number of Common 12-mers

E. asburiae 2605757

E. hormaechei 702027

E. coli 2437675

L. brevis 2854993

L. casei 1447024

L. fermentum 2388130

L. paracasei 3138032

L. plantarum 3188314

P. agglomerans 3048385

P. ananatis 4633080

P. putida 1336149

S. bongori 4997315

S. maltophilia 1590307

S. saprophyticus 2994517

Table 3.6: Number of unique k-mers for k= 11, 12, and 13

Organisms 11-mers 12-mers 13-mers

AP014719.1 (A. sp. Hiyo8) 4 2683 167826

B. adolescentis 0 372 31280

B. amyloliquefaciens 0 310 31652

B. angulatum 0 562 47198

B. animalis 0 357 28338

B. anthracis 2 2774 166138

B. bifidum 0 515 41374

B. breve 2 558 38462 28 Table 3.6 – continued from the previous page

Organisms 11-mers 12-mers 13-mers

B. cereus 0 22

B. dentium 2 1230

B. kashiwanohense 0 116

B. licheniformis 0 1279

B. longum 0 248

B. pumilus 2 352

B. subtilis 0 2

CP000382.1(C. novyi) 0 505

CP002770.1 (D. kuznetsovii) 2 1769

CP003325.1 (B. asteroides) 0 1311

CP011005.1 (A. sp. IHBB 11108) 0 5108

CP012171.1 (A. sp. LS16) 0 773

CP012677.1 (A. alpinus strain R3.8) 0 3211

CP014196.1(A. sp. ATCC 21022) 2 2083

C. acetobutylicum 0 1327

C. beijerinckii 0 993

C. botulinum 0 0

C. freundii 0 322

C. kluyveri 2 1045

C. pasteurianum 0 2

C. perfringens 0 246

C. sporogenes 0 967

C. tetani 2 311

C. tyrobutyricum 0 706

E. aerogenes 2 1434

E. asburiae 0 208

E. coli 0 337 29 Table 3.6 – continued from the previous page

Organisms 11-mers 12-mers 13-mers

E. hormaechei 0 0

L. brevis 0 2166

L. casei 0 28

L. fermentum 0 1956

L. paracasei 0 1377

L. plantarum 8 2225

NC 008346.1 (S. wolfei) 2 1504

NC 008530.1 (L. gasseri) 0 1174

NC 008541.1 (A. sp. FB24) 4 2393

NC 009253.1 (D. reducens MI-1) 2 3130

NC 010471.1 (L. citreum) 0 1694

NC 012004.1 (S. uberis) 4 1256

NC 013216.1 (D. acetoxidans) 2 2302

NC 014011.1 ( A. colombiense) 0 1812

NC 014328.1 (C. ljungdahlii) 0 282

NC 014393.1(C. cellulovorans) 0 1915

NC 014562.1 (P. vagans) 2 986

NC 015565.1 (D. nigrificans) 0 1562

NC 015589.1 (D. ruminis) 2 1953

NC 015737.1(C. sp. SY8519) 0 718

NC 015978.1 (L. sanfranciscensis) 2 664

NC 016791.1 (C. sp. BNL1100) 0 1899

NC 019954.2 (T. acetatoxydans) 0 1593

NC 020291.1 (C. saccharoperbutylacetonicum) 0 1542

NC 020546.1 (B. thermophilum) 2 1312

NC 021184.1 (D. gibsoniae) 2 2598

NC 022571.1 (C. saccharobutylicum) 2 1114 30 Table 3.6 – continued from the previous page

Organisms 11-mers 12-mers 13-mers

NC 022592.1 (C. autoethanogenum 0 210

NZ AP012325.1 (B. catenulatum) 0 682

NZ AP012330.1 (B. pseudocatenulatum) 0 713

NZ AP012331.1 (B. scardovii) 0 1171

NZ CP006018.1 (B. indicum) 0 249

NZ CP006905.1 (C. baratii) 4 599

NZ CP007161.1 (V. Sp. SK37) 0 1633

NZ CP007287.1 (B. coryneforme) 0 366

NZ CP007457.1 (B. pseudolongum) 0 885

NZ CP007595.1 (A. sp. PAMC25486) 0 2669

NZ CP009687.1 (C. aceticum) 0 3011

NZ CP009933.1 (C. scatologene) 0 1028

NZ CP011786.1 (B. actinocoloniiforme) 4 1560

NZ CP011803.1 (C. carboxidivorans) 0 969

NZ CP012024.1 (B. smithii) 0 1559

NZ CP012479.1 (A. sp. ERGS1:01) 0 1775

NZ CP012873.1 (W. cibaria) 0 2081

NZ CP013297.1 (A. sp. YC-RL1) 0 800

NZ CP013745.1 (A. sp. A3) 2 3316

NZ CP015249.1 (D. koreensis) 0 1133

NZ CP015732.1 (A. sp. U41) 0 1978

NZ CP017253.1 (C. taeniosporum) 0 464

NZ CP017762.1 (V. sp. 6R) 0 3446

NZ CP017962.1 (V. halodenitrifican) 0 1570

NZ HG917868.1 (C. bornimense) 0 824

P. agglomerans 0 74

P. ananatis 0 1217 31 Table 3.6 – continued from the previous page

Organisms 11-mers 12-mers 13-mers

P. putida 0 121

S. bongori 0 1547

S. maltophilia 0 138

S. saprophyticus 2 1276

3.5 Analysis of Simulated Metagenome

Table 3.7: Composition data for the metagenome simulation

Organism Theoretical ratio Comp 25 Comp 50 Comp 75 Comp 100 B. licheniformis 8 3.17 6.34 9.52 12.69 L. brevis 16 6.34 12.69 19.04 25.39 L. fermentum 32 12.69 25.39 38.09 50.79 L. plantarum 2 0.79 1.58 2.38 3.17 P. ananatis 4 1.53 3.17 4.76 6.34 P. vagans 1 0.39 0.79 1.19 1.58 Total (Reporter Genomes) - 25 50 75 100 Rest (Background Genomes) - 75 50 25 0

We performed the analysis of simulated metagenomes (Comp 25, Comp 50 , Comp 75, and Comp 100) by identifying the signature sequences in the metagenomic reads of length 150 bp. We analyzed these compositions at different data volumes. The figures 3.3 and 3.4 show the estimated ratio of reporter species in the metagenomes at data volumes 6e9 (600 Mbp) and 6e10 (6 Gbp) whereas when we tried to identify the reads in the metagenimes for 6e11 (60 Gbp), the computer ran out of memory. The analysis result for the data volumes i.e., 6e9 and 6e10 shows that Comp 25 was comparable to control for the detection of B. licheniformis, L . plantarum, L. fermentum whereas for Comp 50, Comp 75, and Comp 100 showed remarkable detection of L. brevis, L. fermentum, and P. anantis. However, overestimation of L. plantarum and P. vagans was detected. (Figures 3.3 and 3.4) 32 The results of the metagenome analysis showed a trend of underestimation reporter genomes in the metagenome. The possible explanation for the underestimation can be either the small data volume or the noise produced by the background genomes. Moreover, the frequency of the unique k-mers can also influence the analysis of the metagenome. The statistics method was used to rectify the underestimated trend of the ratios of the reporter species. We are still working on searching the mathematical solution for this problem.

Figure 3.3: Analysis of the simulated metagenome for data volume 6e9 and sequence deviation 0.0. Theoretical ratio: Composition of the metagenome data as described in the composition table, Comp 25: 25% of the marker genome included from the reporter species as described in the composition table, Comp 50: 50% of the marker genome included from the reporter species, Comp 75: 75% of the marker genome included from the reporter species, Comp 100: 100% of the marker genome included from the reporter species. 33

Figure 3.4: Analysis of the simulated metagenome for data volume 6e10 and sequence deviation 0.0. Theoretical ratio: Composition of the metagenome data as described in the composition table, Comp 25: 25% of the marker genome included from the reporter species as described in the composition table, Comp 50: 50% of the marker genome included from the reporter species, Comp 75: 75% of the marker genome included from the reporter species, Comp 100: 100% of the marker genome included from the reporter species. 34 3.6 Conclusion

The work discussed here is entirely concentrated on the bacterial genomes which were completely sequenced and used to ferment a type of alcoholic drink (Zhang et al., 2016). The first part of the study focused on the identification of the unique k-mers in the organisms at the species level whereas the second part included the quantitative analysis of microbial species in simulated metagenomes by utilizing the unique k-mers. Our developed approach accurately estimated the abundance ratios of 4 of the 6 reporter species which were comparable to the control whereas 2 of the 6 reporter species were overestimated in the simulated metagenomes, nevertheless refining the algorithm, increasing computational power, and applying advance statistics can be helpful to improve the accuracy of the method. REFERENCES

Abe, T., Ichiba, Y., Kanaya, S., Kozuki, T., Kinouchi, M., & Ikemura, T. (2002). A novel

bioinformatic strategy for unveiling hidden genome signatures of eukaryotes:

self- organizing map of oligonucleotide frequency. Genome Informatics, 13, 12-

20.

Acinas, S. G., Marcelino, L. A., Klepac-Ceraj, V., & Polz, M. F. (2004). Divergence and

redundancy of 16S rRNA sequences in genomes with multiple rrn operons.

Journal of Bacteriology, 186(9), 2629-2635.

Arrigo, K. R (2005). Marine microorganisms and global nutrient cycles. Nature 437, 349

355. doi:10.1038/nature04159.

Ausec, L., Van Elsas, J. D., & Mandic-Mulec, I. (2011). Two-and three-domain bacterial

laccase-like genes are present in drained peat soils. Soil Biology and

Biochemistry, 43(5), 975-983.

Brady, A., & Salzberg, S. L. (2009). Phymm and PhymmBL: metagenomic phylogenetic

classification with interpolated Markov models. Nature Methods, 6(9), 673-

676.

Case, R. J., Boucher, Y., Dahllf, I., Holmstrm, C., Doolittle, W. F., & Kjelleberg, S.

(2007). Use of 16S rRNA and rpoB genes as molecular markers for microbial

ecology studies. Applied and Environmental Microbiology, 73(1), 278-288.

Chan, C. K. K., Hsu, A. L., Tang, S. L., & Halgamuge, S. K. (2007). Using growing

self-organising maps to improve the binning process in environmental whole-

genome shotgun sequencing. BioMed Research International, 2008.

Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 29(11), 987-991.

Courtois, S., Cappellano, C. M., Ball, M., Francou, F. X., Normand, P., Helynck, G., ... &

August, P. R. (2003). Recombinant environmental libraries provide access to

microbial diversity for drug discovery from natural products. Applied and

Environmental Microbiology, 69(1), 49-55.

DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., ... &

Andersen, G. L. (2006). Green genes, a chimera-checked 16S rRNA gene

database and workbench compatible with ARB. Applied and Environmental

Microbiology, 72(7), 5069-5072.

Diaz, N. N., Krause, L., Goesmann, A., Niehaus, K., & Nattkemper, T. W. (2009).

TACOA Taxonomic classification of environmental genomic fragments using a

kernelized nearest neighbor approach. BMC Bioinformatics, 10(1), 56.

Dick, Gregory J., Anders F. Andersson, Brett J. Baker, Sheri L. Simmons, Brian C.

Thomas, A. Pepper Yelton, and Jillian F. Banfield. ”Community-wide analysis

of microbial genome sequence signatures. Genome Biology 10, no. 8 (2009):

R85.

Droge, J., & McHardy, A. C. (2012). Taxonomic binning of metagenome samples

generated by next- generation sequencing technologies. Briefings in

Bioinformatics, 13(6), 646-655.

Edwards, R. A., & Rohwer, F. (2005). Viral metagenomics. Nature Reviews

Microbiology, 3(6), 504-510.

Entcheva, P., Liebl, W., Johann, A., Hartsch, T., & Streit, W. R. (2001). Direct cloning

from enrichment cultures, a reliable strategy for isolation of complete operons and genes from microbial consortia. Applied and Environmental Microbiology,

67(1), 89-99.

Erickson, A. R., Cantarel, B. L., Lamendella, R., Darzi, Y., Mongodin, E. F., Pan, C., ...

& Raes, J. (2012). Integrated metagenomics/metaproteomics reveals human

host-microbiota signatures of Crohn’s disease. PloS One, 7(11), e49138.

Frank, J. A., & Sorensen, S. J. (2011). Quantitative metagenomic analyses based on

average genome size normalization. Applied and Environmental

Microbiology, 77(7), 2513-2521.

Frias-Lopez, J., Shi, Y., Tyson, G. W., Coleman, M. L., Schuster, S. C., Chisholm, S.

W., & DeLong, E. F. (2008). Microbial community gene expression in ocean

surface waters. Proceedings of the National Academy of Sciences, 105(10),

3805-3810.

Gerlach, W., & Stoye, J. (2011). Taxonomic classification of metagenomic shotgun

sequences with CARMA3. Nucleic Acids Research, 39(14), e91-e91.

Gilbert, J. A., Field, D., Huang, Y., Edwards, R., Li, W., Gilna, P., & Joint, I. (2008).

Detection of large numbers of novel sequences in the metatranscriptomes of

complex marine microbial communities. PloS One, 3(8), e3042.

Glass, E. M., Wilkening, J., Wilke, A., Antonopoulos, D., & Meyer, F. (2010). Using the

metagenomics RAST server (MG-RAST) for analyzing shotgun

metagenomes. Cold Spring Harbor Protocols, 2010(1), pdb-prot5368.

Goodman, D. B., Church, G. M., & Kosuri, S. (2013). Causes and effects of N-terminal

codon bias in bacterial genes. Science, 342(6157), 475-479.

Gouy, M., & Gautier, C. (1982). Codon usage in bacteria: correlation with gene expressivity. Nucleic acids research, 10(22), 7055-7074.

Greenfield, P., & Roehm, U. (2013). Answering biological questions by querying kmer

databases. Concurrency and Computation: Practice and Experience, 25(4),

497-509.

Hjort, K., Bergstrm, M., Adesina, M. F., Jansson, J. K., Smalla, K., & Sjling, S. (2010).

Chitinase genes revealed and compared in bacterial isolates, DNA extracts

and a metagenomic library from a phytopathogen-suppressive soil. FEMS

Microbiology Ecology, 71(2), 197-207.

Huson, D. H., Richter, D. C., Mitra, S., Auch, A. F., & Schuster, S. C. (2009). Methods

for comparative metagenomics. BMC Bioinformatics, 10(1), 1.

Huson, D. H., Auch, A. F., Qi, J., & Schuster, S. C. (2007). MEGAN analysis of

metagenomic data. Genome Research, 17(3), 377-386.

Iverson, V., Morris, R. M., Frazar, C. D., Berthiaume, C. T., Morales, R. L., & Armbrust,

E. V. (2012). Untangling genomes from metagenomes: revealing an

uncultured class of marine Euryarchaeota. Science, 335(6068), 587-590.

Kariin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: a genomic

signature. Trends in Genetics, 11(7), 283-290.

Kodama, Y., Shumway, M., & Leinonen, R. (2012). The Sequence Read Archive:

explosive growth of sequencing data. Nucleic Acids Research, 40(D1), D54-

D56.

Krause, L., Diaz, N. N., Goesmann, A., Kelley, S., Nattkemper, T. W., Rohwer, F., ... &

Stoye, J. (2008). Phylogenetic classification of short environmental DNA

fragments. Nucleic Acids Research, 36(7), 2230-2239. Langille, M. G., Zaneveld, J., Caporaso, J. G., McDonald, D., Knights, D., Reyes, J. A.,

... & Beiko, R. G. (2013). Predictive functional profiling of microbial

communities using 16S rRNA marker gene sequences. Nature Biotechnology,

31(9), 814-821.

Lassalle, F., Prian, S., Bataillon, T., Nesme, X., Duret, L., & Daubin, V. (2015).

GC-content evolution in bacterial genomes: the biased gene conversion

hypothesis expands. PLoS Genet, 11(2), e1004941.

Li, M., Hong, Y., Klotz, M. G., & Gu, J. D. (2010). A comparison of primer sets for

detecting 16S rRNA and hydrazine oxidoreductase genes of anaerobic

ammonium-oxidizing bacteria in marine sediments. Applied Microbiology and

Biotechnology, 86(2), 781-790.

Li, Y., Wang, H., Nie, K., Zhang, C., Zhang, Y., Wang, J., ... & Ma, X. (2016). VIP: an

integrated pipeline for metagenomics of virus identification and discovery.

Scientific Reports, 6.

Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., & Pop, M. (2011). Accurate and fast

estimation of taxonomic profiles from metagenomic shotgun sequences.

Genome Biology, 12(1), P11.

Liu, B., Gibbons, T., Ghodsi, M., & Pop, M. (2010, December). MetaPhyler: Taxonomic

profiling for metagenomic sequences. In Bioinformatics and Biomedicine

(BIBM), 2010 IEEE International Conference on (pp. 95-100). IEEE.

Looft, T., Johnson, T. A., Allen, H. K., Bayles, D. O., Alt, D. P., Stedtfeld, R. D., ... &

Hashsham, S. A. (2012). In-feed antibiotic effects on the swine intestinal

microbiome. Proceedings of the National Academy of Sciences, 109(5), 1691- 1696.

Luo, C., Tsementzi, D., Kyrpides, N. C., & Konstantinidis, K. T. (2012). Individual

genome assembly from complex community short-read metagenomic

datasets. The ISME Journal, 6(4), 898-901.

Mahenthiralingam, E., Baldwin, A., Drevinek, P., Vanlaere, E., Vandamme, P., LiPuma,

J. J., & Dowson, C. G. (2006). Multilocus sequence typing breathes life into

a microbial metagenome. PLoS One, 1(1), e17.

Marinier, E., Zaheer, R., Berry, C., Weedmark, K. A., Domaratzki, M., Mabon, P., ... &

LiDS-NG Consortium. (2016). Neptune: A Bioinformatics Tool for Rapid

Discovery of Genomic Variation in Bacterial Populations. bioRxiv, 032227.

Marinier, E., Berry, C., Weedmark, K. A., Domaratzki, M., Mabon, P., Knox, N. C., ... &

LiDS-NG Consortium. (2015). Neptune: A Tool for Rapid Genomic

Signature Discovery. bioRxiv, 032227.

Martin, C., Diaz, N. N., Ontrup, J., & Nattkemper, T. W. (2008). Hyperbolic SOM-based

clustering of DNA fragment features for taxonomic visualization and

classification. Bioinformatics, 24(14), 1568- 1574.

Monzoorul Haque, M., Ghosh, T. S., Komanduri, D., & Mande, S. S. (2009). SOrt-

ITEMS: Sequence orthology based approach for improved taxonomic

estimation of metagenomic sequences. Bioinformatics, 25(14), 1722-1730.

Moura, A., Savageau, M. A., & Alves, R. (2013). Relative amino acid composition

signatures of organisms and environments. PLoS One, 8(10), e77319.

Nakabachi, A., Yamashita, A., Toh, H., Ishikawa, H., Dunbar, H. E., Moran, N. A., &

Hattori, M. (2006). The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science, 314(5797), 267-267.

Namiki, T., Hachiya, T., Tanaka, H., & Sakakibara, Y. (2012). MetaVelvet: an extension

of Velvet assembler to de novo metagenome assembly from short

sequence reads. Nucleic Acids Research, 40(20), e155-e155.

Nayfach, S., & Pollard, K. S. (2016). Toward accurate and quantitative comparative

metagenomics. Cell, 166(5), 1103-1116.

Nayfach, S., Rodriguez-Mueller, B., Garud, N., & Pollard, K. S. (2016). An integrated

metagenomics pipeline for strain profiling reveals novel patterns of

bacterial transmission and biogeography. Genome Research, 26(11),

1612-1625.

Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2011). Meta-IDBA: a de Novo

assembler for metagenomic data. Bioinformatics, 27(13), i94-i101.

Prakash, T., & Taylor, T. D. (2012). Functional assignment of metagenomic data:

challenges and applications. Briefings in Bioinformatics, 13(6), 711-727.

Ruby, J. G., Bellare, P., & DeRisi, J. L. (2013). PRICE: software for the targeted

assembly of components of (Meta) genomic sequence data. G3: Genes,

Genomes, Genetics, 3(5), 865-880.

Sandberg, R., Branden, C. I., Ernberg, I., & Cster, J. (2003). Quantifying the

species-specificity in genomic signatures, synonymous codon choice,

amino acid usage and G+ C content. Gene, 311, 35-42.

Schloss, P. D., & Handelsman, J. (2003). Biotechnological prospects from

metagenomics. Current Opinion in Biotechnology, 14(3), 303-310.

Schloss, P. D., & Handelsman, J. (2005). Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biology, 6(8), 229.

Schmidt, O., Drake, H. L., & Horn, M. A. (2010). Hitherto unknown [Fe-Fe]-hydrogenase

gene diversity in anaerobes and anoxic enrichments from a moderately

acidic fen. Applied and Environmental Microbiology, 76(6), 2027-2031.

Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., & Huttenhower, C.

(2012). Metagenomic microbial community profiling using unique clade-specific

marker genes. Nature Methods, 9(8), 811-814.

S. K. Gupta, S. Majumdar, T. K. Bhattacharya, T. C. Ghosh. Studies on the

relationships between the synonymous codon usage and protein secondary

structural units. Biochem Biophys Res Commun, 269:692-696, 2000.

Sharpton, T. J. (2014). An introduction to the analysis of shotgun metagenomic data.

Frontiers in Plant Science, 5.

Sharpton, T. J., Riesenfeld, S. J., Kembel, S. W., Ladau, J., O’Dwyer, J. P., Green, J.

L., ... & Pollard, K. S. (2011). PhylOTU: a high-throughput procedure

quantifies microbial community diversity and resolves novel taxa from

metagenomic data. PLoS Computational Biology, 7(1), e1001061.

Srinivasan, S. M., & Guda, C. (2013). MetaID: A novel method for identification and

quantification of metagenomic samples. BMC Genomics, 14(8), S4.

Stark, M., Berger, S. A., Stamatakis, A., & von Mering, C. (2010). MLTreeMap-accurate

Maximum Likelihood placement of environmental DNA sequences into

taxonomic and functional reference phylogenies. BMC Genomics, 11(1),

461.

Sueoka, N. (1961). Variation and heterogeneity of base composition of deoxyribonucleic acids: a compilation of old and new data. Journal of Molecular Biology, 3(1),

31IN15-40.

Thomas, T., Gilbert, J., & Meyer, F. (2012). Metagenomics-a guide from sampling to

data analysis. Microbial Informatics and Experimentation, 2(1), 1.

Torsvik, V., J. Goksoyr, and F. L. Daae. 1990. High diversity in DNA of soil bacteria.

Appl. Environ. Microbiol. 56:782-787.

Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W.,

... & Bork, P. (2005). Comparative metagenomics of microbial communities.

Science, 308(5721), 554-557.

Van Der Heijden, M. G., Bardgett, R. D., & Van Straalen, N. M. (2008). The unseen

majority: soil microbes as drivers of plant diversity and productivity in

terrestrial ecosystems. Ecology Letters, 11(3), 296-310.

Wrighton, K. C. (2012). Fermentation, hydrogen, and sulfur metabolism in multiple

uncultivated bacterial phyla (vol 337, pg 1661, 2012). Science, 338(6108),

742-742.

Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence

classification using exact alignments. Genome Biology, 15(3), 1.

Wooley, J. C., & Ye, Y. (2010). Metagenomics: facts and artifacts, and computational

challenges. Journal of Computer Science and Technology, 25(1), 71-81.

Wu, M., & Eisen, J. A. (2008). A simple, fast, and accurate method of phylogenomic

inference. Genome Biology, 9(10), R151.

Wylie, Kristine M., Rebecca M. Truty, Thomas J. Sharpton, Kathie A. Mihindukulasuriya,

Yanjiao Zhou, Hongyu Gao, Erica Sodergren, George M. Weinstock, and Katherine S. Pollard. ”Novel bacterial taxa in the human microbiome.” PloS one,

no. 6 (2012): e35294.

Yang, B., Peng, Y., Leung, H. C. M., Yiu, S. M., Chen, J. C., & Chin, F. Y. L. (2010).

Unsupervised binning of environmental genomic fragments based on an error

robust selection of l-mers. BMC Bioinformatics, 11(2), S5.

Zaprasis, A., Liu, Y. J., Liu, S. J., Drake, H. L., & Horn, M. A. (2010). Abundance of

novel and diverse tfdA-like genes, encoding putative phenoxyalkanoic acid

herbicide-degrading dioxygenases, in soil. Applied and Environmental

Microbiology, 76(1), 119-128.

Zhang, H., He, H., Yu, X., Xu, Z., & Zhang, Z. (2016). Employment of Near Full-Length

Ribosome Gene TA-Cloning and Primer-Blast to Detect Multiple Species in a

natural Complex Microbial Community Using Species-Specific Primers

designed with Their Genome Sequences. Molecular Biotechnology, 58(11),

729-737.