M. Sc. Thesis—Quan Yao McMaster—Biology
IDENTIFICATION OF ENVIRONMENTAL ALPHAPROTEOBACTERIA WITH CONSERVED SIGNATURE PROTEINS IN METAGENOMIC DATASETS
M. Sc. Thesis—Quan Yao McMaster—Biology
IDENTIFICATION OF ENVIRONMENTAL ALPHAPROTEOBACTERIA WITH CONSERVED SIGNATURE PROTEINS IN METAGENOMIC DATASETS
BY
QUAN YAO, B.Sc.
A Thesis
Submitted to the School of Graduate Studies
in Partial Fulfillment of the Requirements
For the Degree
Master of Science
McMaster University
© Copyright by Quan Yao, Dec 2013 M. Sc. Thesis—Quan Yao McMaster—Biology
MASTER OF SCIENCE (2013) McMaster University
(Biology) Hamilton, Ontario
TITLE: Identification of Environmental Alphaproteobacteria with Conserved
Signature Proteins in Metagenomic Datasets
AUTHOR: Quan Yao, B.Sc. (Ocean University of China)
SUPERVISOR: Professor H.E. Schellhorn
NUMBER OF PAGES: ix, 94
ii M. Sc. Thesis—Quan Yao McMaster—Biology
Abstract
Microbial metagenomics is the exploration of taxonomical diversity of microbial communities in environmental habitats using large, exhaustive DNA sequence datasets.
However, due to inherent limitations of sequencing technology and the complexity of environmental genomes, current analytical approaches do not reveal the existence of all microbes that may be present. In this study, a new classification approach is proposed based upon unique proteins that are specific for different clades of Alphaproteobacteria to predict the presence and absence of species from these groups of bacteria in published metagenomic datasets. In this work, 264 previously–identified, published conserved signature proteins (CSPs) characteristic of individual taxonomic clades of
Alphaproteobacteria are used as probes to detect the presence of bacteria in metagenomic datasets. Although public genome sequence information has increased manifold since these CSPs were initially identified 6 years ago, results indicate that nearly all of these
CSPs (259 of 265) are specific for their previously characterized clades. Furthermore, they are confirmed to be present in the newly–identified and sequenced members of these clades. In view of their specificity and predictive ability in different monophyletic clades of Alphaproteobacteria, the sequences of these CSPs provide reliable probes to determine the presence or absence of these Alphaproteobacteria in metagenomic datasets. In this work, CSPs are used to determine the presence of Alphaproteobacteria diversity in 10 published metagenomic datasets (bioreactor, compost, wastewater, activated sludge, groundwater, freshwater sediment, microbial mat, marine, hydrothermal vent and whale fall metagenomes), which cover diverse environment and ecosystems. It is indicated that
iii M. Sc. Thesis—Quan Yao McMaster—Biology
the BLAST searches with these CSPs can be used to efficiently identify
Alphaproteobacteria species in these metagenome dataset and substantial differences can be determined in the distribution and relative abundance of different Alphaproteobacteria species in the tested metagenome datasets. Thus the CSPs, which are specific for different microbial taxa, provide novel and powerful means for identification of microbes and for their taxonomic profiling in metagenomic datasets.
iv M. Sc. Thesis—Quan Yao McMaster—Biology
Acknowledgements
First, I must thank my Supervisor, Dr. Herb Schellhorn, who gave a lot of valuable suggestions and recommendations during my research work, along with his generosity for taking us to attend the conference of Canadian Society of Microbiologist in Ottawa, during which we had a great experience to share research work and communicate with the world’s top researchers. The second summer in Dr. Schellhorn’s cottage is an unforgettable memory, where we enjoyed a fascinating retreat after a year of hard work.
Equally important, I would like to thank my co-supervisor, Dr. Gupta for his continuous support in my work and the inspirations he ignited in my mind and my committee chair,
Dr. Igdoura for his kindness and assistance in my defense.
Secondly, I have to thank my lab mate who accompanied me in the past 2 years both in the lab and out of campus. I want to acknowledge Lingzi, Mohammed, Shirley, Sohail,
Steve, Rachel, and Pardis. The coffee break chats for casual and entertaining topics, the cooperative work we managed to accomplish when encountering the bottlenecks in research, or some in-depth exchange of ideas and thoughts about philosophy, universe and ourselves, all these pieces make up an indispensable part in my life to establish my values and faiths.
Finally, I must thank my parents for their encouragement in my life. Without their guidance and instruction, I can never achieve the goal that I have ever dreamed of. Their love to me is my forever treasure and provides the motive power to help me conquer future obstacles in my career.
v M. Sc. Thesis—Quan Yao McMaster—Biology
Table of Contents Part I. Uniqueness of Alphaproteobacteria specific CSPs ...... 1 Chapter 1 Introduction ...... 1 1.1 Significance of Alphaproteobacteria ...... 1 1.2 Conserved signature proteins as phylogenetic markers ...... 5 1.3 Standards for taxonomic hierarchy ...... 6 Chapter 2 Materials and methods ...... 9 2.1 Confirmation of the uniqueness of CSPs ...... 9 2.2 Grouping of CSP into Taxonomic levels ...... 10 Chapter 3. Results ...... 13 3.1 Confirmation of the uniqueness of CSPs ...... 13 3.2 Grouping of CSP into Taxonomic levels ...... 15 Chapter 4 Discussion ...... 27 4.1 Confirmation of the uniqueness of CSP ...... 27 4.2 Grouping of CSP into Taxonomic levels ...... 28 4.3 Future experiments ...... 29 Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples ...... 31 Chapter 1 introduction ...... 31 1.1 Metagenome, environmental genomes ...... 31 1.2 Taxonomic classification of metagenomic reads: methods and challenges ...... 34 1.3 Application of metagenomics ...... 36 1.4 Project objectives ...... 40 Chapter 2 Materials and methods ...... 42 2.1 Metagenome selection ...... 42 2.2 Identification of CSP in metagenomic samples ...... 42 2.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 43 Chapter 3 Results ...... 45 3.1 Metagenome selection ...... 45 3.2 Identification of CSPs in metagenomic samples ...... 47 3.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 50 Chapter 4 Discussion ...... 74 4.1 Metagenome selection ...... 74 4.2 Identification of CSPs in metagenomic samples ...... 75 4.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 77 4.4 Overall conclusions ...... 79 4.5 Future directions ...... 80 References ...... 82
vi M. Sc. Thesis—Quan Yao McMaster—Biology
List of Figures
Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10 metagenomes ...... 54 Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes ...... 62 Figure 3: Similarity of significant hits in 10 metagenomes ...... 70 Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP distribution in 10 metagenomes ...... 71 Figure 5: The relative abundance of Alphaproteobacteria and its different sub-clades in the studied metagenomes based upon BLASTp searches with CSPs ...... 72 Figure 6: Comparative results of Alphaproteobacteria distribution in 4 metagenomes derived from (A) CSPs-based binning and (B) similarity-based binning...... 73
vii M. Sc. Thesis—Quan Yao McMaster—Biology
List of Tables
Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in 2007 and 2013 ...... 11 Table 2: Comparison of the Results of BLAST Search with Protein and Nucleotide Sequences ...... 12 Table 3: CSPs specific to Alphaproteobacteria ...... 16 Table 4 CSPs specific to Rhizobiales ...... 17 Table 5: CSPs specific to Bradyrhizobiaceae and Xanthobacteraceae ...... 19 Table 6 CSPs specific to Rhodobacterales ...... 21 Table 7: CSPs specific to Caulobacterales ...... 23 Table 8: CSPs specific to Sphingomonadales ...... 24 Table 9 CSPs specific to Rhodospirillales ...... 25 Table 10: CSPs specific to Rickettsiales ...... 26 Table 11 Characteristics of Metagenomic Datasets Investigated in this Study ...... 44
viii M. Sc. Thesis—Quan Yao McMaster—Biology
ix M. Sc. Thesis—Quan Yao McMaster—Biology
Part I. Uniqueness of Alphaproteobacteria specific CSPs
Chapter 1 Introduction
1.1 Significance of Alphaproteobacteria
Alphaproteobacteria is one of the largest classes of Proteobacteria phylum, which comprises 4 major classes: Betaproteobacteria, Gammaproteobacteria,
Deltaproteobacteria and Epsilonproteobacteria (Kersters et al., 2006).
Alphaproteobacteria contains 6 main orders: Rhizobiales, Rhodobacterales,
Caulobacterales, Sphingomonadales, Rhodospirillales and Rickettsiales, which are featured by different characteristics (Williams et al., 2007). Alphaproteobacterial species are morphologically, physiologically and metabolically diverse and adapt to different habitats associated with both terrestrial and marine conditions (Rathsack et al., 2011;
Williams et al., 2007). Most characterized Alphaproteobacteria species are Gram-negative bacteria (Olson et al., 2002). A myriad of them develop mechanisms to adopt an intracellular lifestyle either as plant mutualists or animal pathogens (Dumler et al., 2001).
Some Alphaproteobacterial species can grow at low levels of nutrients (Kang et al.,
2010). Alphaproteobacteria undertake several important metabolic strategies such as photosynthesis, nitrogen fixation, ammonia oxidation and methylotrophy (Campagne et al., 2012). They are also morphologically diverse with stellate, spiral and prosthecate
(Hallez et al., 2004). Alphaproteobacteria is the most abundant cellular organism in marines (Williams et al., 2007). Pelagibacter ubique, which was isolated in 2002, was discovered to comprise 1/4 of all plankton cells in the ocean (Sowell et al., 2008).
1 M. Sc. Thesis—Quan Yao McMaster—Biology
Rhizobiales is the largest order of Alphaproteobacteria. It constitutes 1/3 of all sequenced Alphaproteobacteria species (Carvalho et al., 2010). Rhizobiales species develop several strategies to adapt both intracellular and extracellular niches (Carvalho et al., 2010). Plant mutualists such as Rhizobium, Sinorhizobium and Bradyrhizobium are capable of fixing nitrogen in symbiosis with most leguminous plants (Fischer, 1996).
Agricultural and animal pathogen such as Agrobacterium, Bartonella and Brucella are obligatory and facultative intracellular bacteria of either plants or animal parasites and have been studied extensively (Bowman, 2011). Bartonella henselae, the chief causative agent of cat scratch disease (CSD) is called Gram-negative bacillus (English, 1988).
Intimate contact with infected cats such as scratches, bites and saliva can cause the transmission of B. henselae (Andersson and Kempf, 2004). Fortunately, infection by
Bartonella sp. causes a mild injury, which can be easily treated with common antibiotics
(Holley, 1991). Another obligatory parasite of mammals——Brucella, are small, non- motile coccobacilli and are more severe pathogens than Bartonella sp. (Alsmark et al.,
2004). They are usually passed in animals through gastrointestinal tract (GI track), respiration and skin wounds, subsequently caussing brucellosis in many animals due to their ability to survive phagocytosis (Breitschwerdt and Kordick, 2000). Severe infections may affect the central nervous system or circulatory system, and antibiotic treatment such as a combination of doxycycline and rifampin is necessary for at least 6 weeks while treatment period mainly depends on the timing of treatment and severity of illness (Raoult et al., 2003).
2 M. Sc. Thesis—Quan Yao McMaster—Biology
Most Rhodobacterales are purple non-sulfur bacteria, belonging to a larger group called photolithotrophic bacteria (Dang et al., 2008). They employ several metabolic mechanisms including photosynthesis, nitrogen fixation and fermentation, either under aerobic or anaerobic conditions (Dang et al., 2008). Rhodobacter sphaeroides, first isolated from deep lakes and stagnant waters (Choudhary and Kaplan, 2000), is remarkable for two unique characteristics—— an innate oxygen sensing system based on invaginations and two sets of chromosomes responsible for distinct functions. Versatility of Rhodobacterales species in metabolism enables them to dominate many ecological niches, especially abundant in oceans (Oh and Kaplan, 2001).
Caulobacterales is typically found in low-nutrient aquatic environments such as lakes and rivers (Riemann et al., 2008). They have a featured stalk that can anchor the surfaces of organisms nearby (Poindexter and Staley, 1996). The development of attaching strategy increases their nutrient uptake since they expose themselves into a continuously changing flow of fluids (Poindexter and Staley, 1996). Meanwhile Caulobacterales can exploit the host’s excretions as extra nutrients when environmental nutrients are depleted
(Abraham et al., 2008).
Sphingomonadales are oval or rod-shaped bacteria, which is featured by its sphingolipids located at the outer membrane of the cell wall (Yabuuchi and Kosako,
2005). Some of them are pleomorphic and the shapes of cells can change through time while other relatives undertake phototrophic metabolism (Yurkov and Beatty, 1998).
Most Sphingomonadales species are widely spreading in diverse terrestrial and aquatic habitats due to their ability of surviving in low nutrient environments (Boersma et al.,
3 M. Sc. Thesis—Quan Yao McMaster—Biology
2009). Sphingomonadales can be applied into bioremediation since some of the species isolated from contaminated environments feed on toxic aromatic compounds as their main nutrient source (Boersma et al., 2009).
Rhodospirillales comprise 2 distinct families: Acetobacteraceae and
Rhodospirillaceae (Gupta and Mok, 2007a). In Acetobacteraceae, soil bacteria—
Azospirillum employs the nutrients excreted by plants and in exchange fixes nitrogen into ammonia from atmosphere for host plants (Steenhoudt and Vanderleyden, 2000).
Acetobacter and Gluconobacter are industrially important aerobic organisms widely used in brewery for the fermentation of wine and vinegar by converting ethyl alcohol into acetic acid (Gullo and Giudici, 2008). Rhodospirillum is a facultative anaerobic bacteria
(Yildiz et al., 1991). When oxygen is exhausted, Rhodospirillum activates the machinery of photosynthesis apparatus to acquire nutrition (Yildiz et al., 1991). However the mechanism of photosynthesis depression under aerobic conditions are poorly understood
(Matsuda et al., 1984).
The order Rickettsiales are mostly composed of human pathogens and marine bacteria
(Fredricks, 2006). The typical genus——Rickettsia are Gram-negative and rod shaped pathogenic bacteria (Zomorodipour and Andersson, 1999). These obligate intracellular parasites only reproduce within mammalian cells. Laboratory isolation and purification is feasible with tissue culture or embryos. Rickettsia enter host cells by inducing phagocytosis (Sahni and Rydkina, 2009). Once they penetrate into the cytoplasm of the cell, reproducing of binary fission is conducted to ensure the survival of Rickettsia.
Infection by Rickettsia deteriorates the permeability of blood capillaries, which is
4 M. Sc. Thesis—Quan Yao McMaster—Biology
clinically characterized by spotted rash (Walker et al., 2003). Another obligatory pathogen of clinical significance is the genus——Ehrlichia. Ehrlichia cause parasitemia by living in blood cells (Arraga-Alvarado et al., 2003). They are often transmitted from animals to humans through bites of infected ticks, which eventually result in ehrlichoisis
(Arraga-Alvarado et al., 2003). Apart from their pathogenic features, Rickettsiales are also the closest relatives of Eukaryotic mitochondria organelles based on high genomic similarity (Gray, 2012).
1.2 Conserved signature proteins as phylogenetic markers
Conserved signature proteins (CSPs) are a type of rare genomic changes (RGC) often applied into phylogenetic analysis and taxonomic classification, because they are whole proteins uniquely present in certain groups of bacteria but not found anywhere else (Gao et al., 2006; Gupta and Lorenzini, 2007). Although most identified CSPs are of unknown functions, their distribution pattern at different phylogenetic depths provides reliable evidences to distinguish taxonomically coherent clades (Bhandari et al., 2012). Like other
RGCs, CSPs are mostly inherited vertically rather than horizontally, CSPs are applied to elucidate the evolutionary relationships among closely-related clades (Bhandari et al.,
2012). Recent studies proved that these CSPs could be identified in newly-sequenced species (Bhandari et al., 2012; Gao and Gupta, 2012). Due to their clade specificity and conservative property, it is postulated that the CSPs may be present in uncharacterized
Alphaproteobacterial species. Environmental Alphaproteobacterial species may also carry such molecular markers to demonstrate their affiliation to their laboratory relatives.
Previous analysis of approximate 60 Alphaproteobacteria genomes has identified 265
5 M. Sc. Thesis—Quan Yao McMaster—Biology
CSPs specific to different phylogenetic clades (Gupta and Mok, 2007a). Serving as reliable molecular markers, these CSPs are utilized to predict the presence of
Alphaproteobacteria species in environment samples if similar sequences are identified.
1.3 Standards for taxonomic hierarchy
The most reputable criterion currently used for taxonomic purpose is based on the branching pattern of 16S rRNA trees (Nguimbi et al., 2003). Because 16S rRNA gene is universally present in almost all bacteria species and is featured by its dual-characteristics that both conserved and variant regions are alternately located on this gene (Nguimbi et al., 2003). The conserved regions of 16S rRNA are used to infer the common ancestor of them while the variant region differentiate one species from the other (Moine et al.,
2000). Nowadays, Bacteria domain is classified into 23 major groups according to the phylogenetic tree of 16S rRNA (Ludwig et al., 1998). However, the numbers of species in different phyla are not evenly distributed but are biased by the fact that some genera may be studied more intensively than others. For instance, Proteobacteria, Actinobacteria,
Firmicutes, Cyanobacteria and Bacteroidetes are the 5 largest phyla, which comprise
90~95% of all known bacteria in laboratory (Binnewies et al., 2006). While some other small phyla such as Ignavibacteriae, Caldiserica, Chrysiogenetes, Dictyoglomi and
Themodesulfobacteria only account for less than 1% of the bacteria studied (Binnewies et al., 2006). Furthermore, due to the low resolution capacity of the 16S rRNA gene marker below genus level, phylogenetic trees based on a single gene cannot robustly resolve all the issues regarding evolutionary events of different bacterial species (Kunisawa, 2007).
6 M. Sc. Thesis—Quan Yao McMaster—Biology
Hence, the taxonomic hierarchy of Bacteria domain is primarily subjective and there is, as yet, no consistent agreement on their phylogeny (Gupta, 2005a).
To describe the evolutionary relationships of bacteria appropriately, phylogenetic trees based on topological models such as rooted tree, unrooted tree and bifurcating tree can be determined (Williams et al., 2011). In an idealized rooted phylogenetic tree, all bacteria are derived from a common ancestor bacterium and the earliest bacterium is found at the foot of the phylogenetic tree (Arisue et al., 2005). Each branch indicates the divergence of a large bacterial clade such as phylum or class in evolutionary history. The closer a branch is to the foot, the earlier the divergence event occurred. Recent branches denote the further evolution of different sub-clades such as order, family, genus and species. Bacteria on the same branch have more characteristics in common than the ones on different branches (Doolittle and Bapteste, 2007).
The purpose of identifying CSPs is to provide reliable evidence for each node of phylogenetic tree and support the validity of determined branching pattern of phylogenetic tree (Gupta and Griffiths, 2002). Previous studies have identified a myriad of CSPs specific to different clades within Alphaproteobacteria. These molecular markers well resolved the phylogeny of Alphaproteobacterial species (Gupta, 2005b; Gupta and
Mok, 2007b; Kainth and Gupta, 2005). With increased availability of large datasets, sufficient CSPs can construct a comprehensive and reliable phylogenetic tree for both
Alphaproteobacteria and the whole Bacteria kingdom.
1.4 Project objectives
7 M. Sc. Thesis—Quan Yao McMaster—Biology
Alphaproteobacteria-specific CSP have been proved to be useful in inferring phylogenetic trees and branching patterns within Alphaproteobacteria clades (Gupta and
Mok, 2007a). Although the majority of CSPs are of hypothetical proteins, these proteins may assign certain functions or characteristic to distinguish species belonging to
Alphaproteobacteria clades from all others. The aim of this project is to confirm the specificity of previous identified Alphaproteobacteria specific CSPs at different phylogenetic depths by performing BLAST searches against the latest nr protein database.
Then, according to the distribution of the CSPs in bacterial taxonomy, all determined
CSPs are grouped based on their specificity. Finally, a CSPs database that represents different clades of Alphaproteobacteria from class level to family level will be constructed to serve as signature markers for bacteria diagnosis in environments.
8 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 2 Materials and methods
2.1 Confirmation of the uniqueness of CSPs
In view of the large increase in the number of sequenced bacterial genomes in the past
6 years, current CSPs may be identified in new species, no matter whether they are members of Alphaproteobacteria or not. So, systematic BLASTp searches (Altschul et al.,
1990) were performed on each CSP against the NCBI non-redundant protein sequences
(nr) database (all non-redundant GenBank CDS translations + PDB + SwissProt + PIR +
PRF excluding environmental samples from WGS projects) with an E-value threshold of
1x10-e04 to confirm their specificity. Meanwhile, a parallel BLASTn search was conducted on the nucleotide sequences of corresponding CSPs to compare the uniqueness between amino acid sequences and nucleotide sequences. By convention, Blast hits with associated E-values >1e-04 do not support orthology, thus the hits exceeding this E-value threshold are excluded from phylogenetic analysis. However, in some cases, when query proteins are too short to yield sufficient information (bits of information) to determine discriminating E-value, higher E-values can be employed (Sharon et al., 2005). A potential CSP is considered to be clade specific if all significant Blast analysis hits are derived from within a monophyletic clade of Alphaproteobacteria or if there is a large difference in the determined E-value of the last hit belonging to Alphaproteobacterial relatives to the first identified hit of non-Alphaproteobacteria (Gupta and Mok, 2007a).
All significant hits of CSPs meeting these criteria described above were further analyzed as described below.
9 M. Sc. Thesis—Quan Yao McMaster—Biology
2.2 Grouping of CSP into Taxonomic levels
We determined the taxonomic placement of significant hits for each CSP from
BLASTp searches. A CSP should have multiple, similar sequences that are shared among several closely related species. The taxonomic report produced by BLASTp searches yield a distribution of query CSP in all Bacteria. The lowest common ancestor (LCA) of reported taxa was identified. LCA analysis indicates the most recent taxon from which all descendant organisms are derived (Travers et al., 2004). For example, if a CSP is identified in 50 species, and these species belong to 3 genera X, Y, Z under 2 families M,
N under 1 order A, this CSP will be defined as order A-specific CSP. It will not be named as genus X specific or family M specific CSP because this marker is not uniquely present in a single genus or family but also present in genera Y, Z and family N. Principles of
LCA analysis yield the most parsimonious definition for the specificity of this CSP
(Travers et al., 2004). A few organisms out of the clade may also share some CSPs found within a monophyletic clade of Alphaproteobacteria. These are likely due to lateral gene transfer (LGT) event but these protein markers may still be regarded as clade-specific markers (Beiko and Ragan, 2008). Occasionally, very few CSP might be found sporadically distributed in several distantly related bacteria clades. These signature markers are likely to be misdiagnosed due to the limited number of sequenced
Alphaproteobacterial species at that time, and non-specific markers will be excluded from
CSPs database.
10 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in 2007 and 2013
# of sequenced genomes # of identified CSPs Accession # and Clade Specificity other information 2007 2013 2007 2013
Alphaproteobacteria 60 250 4 4 Table 3A Alphaproteobacteria 45 180 7 7 Table 3B except Rickettsiales Rhizobiales 24 96 3 3 Table 4A Clade 1 Rhizobiales 14 58 16 16 Table 4B Rhizobiaceae and 6 30 18 18 Table 5C Phyllobacteriaceae Bradyrhizobiaceae 10 20 74 74 Table 5A, 5B Xanthobacteraceae Rhodobacterales 8 26 35 35 Table 6A Rhodobacteraceae 3 4 13 13 Table 6B Caulobacterales 3 7 11 11 Table 7 Sphingomonadales 5 14 31 31 Table 8 Rhodospirillales 5 27 4 0 N/A
Acetobacteraceae 3 17 14 17 Table 9A Rhodospirillaceae 2 10 14 14 Table 9B Rickettsiales 15 69 3 2 Table 10A Anaplasmataceae 7 23 15 16 Table 10B Rickettsiaceae 7 45 3 3 Table 10C
Note: The values underlined highlight the changes of CSP specificity during the periods
11 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 2: Comparison of the Results of BLAST Search with Protein and Nucleotide Sequences
Accession # of Hits1 Protein Specificity Gene ID # of Hits2 Nucleotide Specificity
NP_422086 621 α-proteobacteria 943808 8 Caulobacteraceae
Mesorhizobium and NP_105743 276 Clade1 Rhizobiales 1228404 13 Sinorhizobium
NP_102577 76 Rhizobiaceae 1225240 2 Mesorhizobium
YP_317328 32 Bradyrhizobiaceae 3674956 2 Nitrobacter
YP_611978 92 Rhodobacterales 4075456 1 Ruegeria sp. TM1040
Silicibacter sp. YP_614100 21 Rhodobacteraceae 4077857 1 TM1040
Novosphingobium YP_495301 76 Sphingomonadales 3916060 1 aromaticivorans
Gluconobacter AAW62008 45 Acetobacteraceae 3249894 1 oxydans
Rhodospirillum YP_428643 23 Rhodospirillaceae 3837017 2 rubrum
NP_220498 92 Rickettsiales 883719 42 Rickettsia
1. Significant hits (hits with E-values below 1e-04) of protein sequences were obtained using BLASTp 2. Significant hits (hits with E-values below 1e-04) of nucleotide sequences were obtained using BLASTn
12 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 3. Results
3.1 Confirmation of the uniqueness of CSPs
Most CSP were found to be specific to their original taxa given that the sequenced
Alphaproteobacteria species have increased almost 4 times (Table 1). In the CSPs database, 4 Alphaproteobacteria-specific CSPs used to be shared by 60 sequenced
Alphaproteobacteria species are now uniquely shared by more than 250
Alphaproteobacteria species, including many of the recently sequenced members between
2007~2013. Similar results were also seen in the other 7 Alphaproteobacteria-specific
CSPs (they were absent in Rickettsiales order). The 47 Rhizobiales-specific CSPs were also confirmed to be specific for most Rhizobiales species. For example, 3 Rhizobiales specific CSPs have been identified in almost all 96 sequenced Rhizobiales species. In detail, 16 Clade 1 Rhizobiales were commonly shared by 11 Rhizobiaceae species, 8
Phyllobacteriaceae species, 2 Aurantimonadaceae species, 16 Brucellaceae species and
12 Bartonellaceae species (another 18 CSPs were only present in Rhizobiaceae and
Phyllobacteriaceae species). Likewise, another important clade of Bradyrhizobiaceae and
Xanthobacteraceae yielded a similar pattern. 74 CSPs were identified present in 18
Bradyrhizobiaceae and 3 Xanthobacteraceae species. Blast searches results for other 4 important orders under Alphaproteobacteria also validated the prediction that previous- identified CSPs based on limited number of sequenced Alphaproteobacteria were present in newly sequenced Alphaproteobacterial species. 35 Rhodobacterales specific CSPs were highly conserved in 41 Rhodobacterales species, while 13 previous Silicibacter and
Roseobacter specific CSPs were present in other Rhodobacteraceae sp., such as
13 M. Sc. Thesis—Quan Yao McMaster—Biology
Phaeobacter and Ruegeria species. These 13 CSPs are now defined as Rhodobacteraceae specific CSPs. 11 Caulobacterales specific CSPs were found unique to 9 Caulobacterales species and 4 Hyphomonadaceae species. 31 Sphingomonadales specific CSPs are now uniquely present in 3 Erythrobacteraceae species and 17 Sphingomonadaceae species.
Most Rhodospirillales-specific CSPs and Rickettsiales-specific CSPs were conserved within their group. However, 4 Rhodospirillales-specific CSPs were proved to be only specific to Acetobacteraceae and 1 Rickettsiales-specific CSP was proved to be specific to Anaplasmataceae species (underlined in Table 1). Only 1 non-specific CSP was identified (Accession number: AAW61951), which used to be specific to
Acetobacteraceae. This was the only CSP that did not meet the classification criterion and as a result the CSP database contained 264 qualified CSPs in total.
Important differences were observed in the clade specificity of the same genes. When
Blast searches were performed using the nucleotide sequence data versus the protein sequence data (Table 2). For example, for two of the signature proteins, which were specific for the family Anaplasmataceae (viz. NP_966526 and NP_965909), when Blastp searches were carried out using the amino acid sequence data, significant hits were observed for all of the sequenced species from the family Anaplasmataceae (e.g.
Wolbachia, Anaplasma, Ehrlichia, etc.). In contrast, when the Blast searches were carried out using the gene sequence for the same proteins, then depending upon whether the searches were carried out with the Wolbachia or Anaplasma gene sequences, all significant hits obtained were only for the Wolbachia or the Anaplasma species.
Similarly, for a signature protein that is specific for Caulobacterales (viz. NP_419305),
14 M. Sc. Thesis—Quan Yao McMaster—Biology
the Blastp search with its amino acid sequence identified >30 significant hits covering all of the sequenced Caulobacterales species, while blastn search with its nucleotide sequence identified only 6 significant hits most of which were from the genus
Caulobacter. Similar differences are observed in the results of blast searches for the signature proteins for other bacterial clades. Thus, the use of gene sequences as marker genes may grossly underestimates the taxonomic diversity of microbial species in environments than as revealed by the use of CSPs.
3.2 Grouping of CSP into Taxonomic levels
Once we filtered all qualified CSP, it is possible to group them together based on their taxonomic specificity. All these CSPs are specific to either Alphaproteobacteria class or different orders and families within Alphaproteobacteria. In the CSPs database, they are divided into 8 major groups, including 11 Alphaproteobacteria specific CSPs (Table 3).
47 Clade-1 Rhizobiales specific CSPs (Table 4), 74 Bradyrhizobiaceae and
Xanthobacteraceae specific CSPs (Table 5), 48 Rhodobacterales specific CSPs (Table 6),
11 Caulobacterales specific CSPs (Table 7), 31 Sphingomonadales specific CSPs (Table
8), 31 Rhodospirillales specific CSPs (Table 9) and 21 Rickettsiales specific CSPs (Table
10).
15 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 3: CSPs specific to Alphaproteobacteria
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to all Alphaproteobacteria
CC2102 NP_420905 162 CC3319 NP_422113 89
CC3292 NP_422086 224 CC1365 NP_420178 161
B. CSPs unique to Alphaproteobacteria except Rickettsiales
CC1211 NP_420025 167 CC0520 NP_419339 284
CC1886 NP_420693 223 CC3010 NP_421804 216
CC2245 NP_421048 190 CC0100 NP_418919 576
CC3470 NP_422264 253
16 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 4 CSPs specific to Rhizobiales
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to Rhizobiales
BQ00720 YP_031797 83 BQ12030 YP_032733 91
BQ07670 YP_032395 336
B. CSPs unique to Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and Aurantimonadaceae
mll0062 NP_101943 107 mll1268 NP_102895 108
mll4068 NP_105027 144 mll2847 NP_104087 186
mll7791 NP_108034 263 mll2898 NP_104130 144
mlr0777 NP_102510 186 mll4298 NP_105201 171
mlr0789 NP_102519 207 mll5001 NP_105743 324
mlr3016 NP_104217 166 mll8359 NP_108472 415
msl6526 NP_107016 80 mlr1823 NP_103319 198
mll0122 NP_101988 349 mlr0094 NP_101965 299
C. CSPs unique to Rhizobiaceae and Phyllobacteriaceae
mll0080 NP_101954 172 mll0459 NP_102252 108
mll0867 NP_102577 168 mll1779 NP_103286 141
mll9619 NP_109472 296 mll6195 NP_106741 174
mlr5174 NP_105883 181 mll8758 NP_106740 205
mll6303 NP_106835 292 mlr3037 NP_104236 281
mll6703 NP_107159 198 mll2007 NP_103455 289
mlr1904 NP_103376 146 mlr1999 NP_103450 111
mlr3274 NP_104418 461 mlr2029 NP_103476 238
17 M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID Accession # Length Gene ID Accession # Length
mlr4951 NP_105704 84 mlr6601 NP_107075 141
18 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 5: CSPs specific to Bradyrhizobiaceae and Xanthobacteraceae
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to Bradyrhizobiaceae and Xanthobacteraceae
bll6014 NP_772654 193 Nwi_1674 YP_318287 185
Nwi_1093 YP_317707 195 Nwi_1705 YP_318318 63
Nwi_1227 YP_317841 106 Nwi_1711 YP_318324 77
Nwi_1786 YP_318399 126 Nwi_1785 YP_318398 422
Nwi_1788 YP_318401 190 Nwi_1793 YP_318406 165
Nwi_2147 YP_318753 82 Nwi_1800 YP_318413 84
B. CSPs unique to Bradyrhizobiaceae
Nwi_2179 YP_318785 161 Nwi_2021 YP_318632 172
Nwi_2432 YP_319038 110 Nwi_2063 YP_318673 186
Nwi_2476 YP_319081 85 Nwi_2064 YP_318674 148
Nwi_2572 YP_319177 171 Nwi_2163 YP_318769 156
Nwi_2623 YP_319228 87 Nwi_2173 YP_318779 109
Nwi_2707 YP_319312 198 Nwi_2183 YP_318789 129
bll5899 NP_772539 131 Nwi_2208 YP_318814 174
blr6106 NP_772746 141 Nwi_2244 YP_318850 164
Nwi_0278 YP_316897 398 Nwi_2247 YP_318853 230
Nwi_0503 YP_317122 108 Nwi_2379 YP_318985 450
Nwi_0528 YP_317147 66 Nwi_2381 YP_318987 63
Nwi_0605 YP_317224 71 Nwi_2414 YP_319020 89
Nwi_0710 YP_317328 248 Nwi_2489 YP_319094 259
Nwi_0925 YP_317539 86 Nwi_2492 YP_319097 122
19 M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID Accession # Length Gene ID Accession # Length
Nwi_0966 YP_317580 260 Nwi_2500 YP_319105 152
Nwi_1084 YP_317698 385 Nwi_2506 YP_319111 72
Nwi_1092 YP_317706 145 Nwi_2509 YP_319114 98
Nwi_1107 YP_317721 121 Nwi_2531 YP_319136 96
Nwi_1108 YP_317722 121 Nwi_2575 YP_319180 399
Nwi_1336 YP_317949 146 Nwi_2577 YP_319182 135
Nwi_1139 YP_317753 321 Nwi_2588 YP_319193 62
Nwi_1247 YP_317861 113 Nwi_2630 YP_319235 141
Nwi_1270 YP_317883 137 Nwi_2676 YP_319281 217
Nwi_1275 YP_317888 126 Nwi_2677 YP_319282 102
Nwi_1454 YP_318067 160 Nwi_2769 YP_319374 127
Nwi_1498 YP_318111 142 Nwi_2789 YP_319394 112
Nwi_1512 YP_318125 409 Nwi_2984 YP_319586 68
Nwi_1581 YP_318194 99 Nwi_2959 YP_319561 87
Nwi_1582 YP_318195 83 Nwi_3035 YP_319637 582
Nwi_1586 YP_318199 182 Nwi_3140 YP_319739 156
Nwi_1649 YP_318262 101 Nwi_3141 YP_319740 104
20 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 6 CSPs specific to Rhodobacterales
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to Rhodobacterales
TM1040_0093 YP_612088 168 TM1040_1988 YP_613982 105
TM1040_0184 YP_612179 289 TM1040_2263 YP_614257 761
TM1040_0236 YP_612231 270 TM1040_2370 YP_614364 221
TM1040_0471 YP_612466 179 TM1040_2425 YP_614419 278
TM1040_0586 YP_612581 329 TM1040_2466 YP_614460 241
TM1040_0587 YP_612582 291 TM1040_2487 YP_614481 272
TM1040_0697 YP_612692 80 TM1040_2582 YP_614576 122
TM1040_0750 YP_612745 154 TM1040_2999 YP_614993 121
TM1040_0752 YP_612747 130 TM1040_3077 YP_611313 175
TM1040_1063 YP_613058 112 TM1040_3749 YP_611978 343
TM1040_1064 YP_613059 135 TM1040_3759 YP_611988 207
TM1040_1247 YP_613242 161 TM1040_3764 YP_611993 276
TM1040_1350 YP_613345 179 TM1040_1558 YP_613553 70
TM1040_1406 YP_613401 181 TM1040_1735 YP_613730 138
TM1040_1567 YP_613562 351 TM1040_2157 YP_613732 360
TM1040_1842 YP_613837 148 TM1040_2443 YP_613733 212
TM1040_1967 YP_613961 732 TM1040_2680 YP_613734 202
TM1040_1844 YP_613839 256
B. CSPs unique to Rhodobacteraceae
TM1040_1099 YP_613094 149 TM1040_3189 YP_611425 93
TM1040_1423 YP_613418 124 TM1040_3202 YP_611438 109
21 M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID Accession # Length Gene ID Accession # Length
TM1040_1451 YP_613446 194 TM1040_3208 YP_611444 100
TM1040_1986 YP_613980 193 TM1040_3226 YP_611462 270
TM1040_2106 YP_614100 105 TM1040_3529 YP_611763 288
TM1040_2139 YP_614133 102 TM1040_3626 YP_611855 192 TM1040_3075 YP_611311 84
22 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 7: CSPs specific to Caulobacterales
Gene ID Accession # Length Gene ID Accession # Length
CC0486 NP_419305 258 CC1066 NP_419882 126
CC2480 NP_421283 253 CC1586 NP_420397 214
CC2764 NP_421560 415 CC2207 NP_421010 222
CC3101 NP_421895 379 CC2628 NP_421428 147
CC0512 NP_419331 289 CC2639 NP_421438 309
CC1064 NP_419880 296
23 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 8: CSPs specific to Sphingomonadales
Gene ID Accession # Length Gene ID Accession # Length
Saro_0018 YP_495301 300 Saro_0044 YP_495327 129
Saro_0052 YP_495335 193 Saro_0154 YP_495437 97
Saro_0087 YP_495370 221 Saro_0415 YP_495697 140
Saro_0150 YP_495433 133 Saro_0458 YP_495740 319
Saro_0232 YP_495514 448 Saro_1078 YP_496357 223
Saro_0409 YP_495691 175 Saro_1126 YP_496405 286
Saro_1088 YP_496367 220 Saro_1160 YP_496439 103
Saro_1144 YP_496423 243 Saro_1163 YP_496442 70
Saro_1291 YP_496569 190 Saro_1748 YP_497022 221
Saro_1378 YP_496656 227 Saro_1785 YP_497059 117
Saro_1914 YP_497188 156 Saro_1972 YP_497246 72
Saro_2130 YP_497403 184 Saro_2036 YP_497309 414
Saro_2788 YP_498058 296 Saro_2037 YP_497310 99
Saro_2958 YP_498227 251 Saro_2333 YP_497604 568
Saro_3138 YP_498407 159 Saro_2548 YP_497818 290
Saro_3213 YP_498482 246
24 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 9 CSPs specific to Rhodospirillales
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to Acetobacteraceae
GOX0633 AAW60410 347 GOX1222 AAW60983 304
GOX0695 AAW60472 165 GOX1224 AAW60985 207
GOX0963 AAW60735 311 GOX2275 AAW62008 201
GOX1258 AAW61019 186 GOX2316 AAW62049 628
GOX0143 AAW59936 198 GOX2452 AAW62183 143
GOX1616 AAW61357 430 GOX2454 AAW62185 466
GOX0343 AAW60126 232 GOX1233 AAW60994 272
GOX1212 AAW60973 472 GOX2456 AAW62187 497
GOX1215 AAW60976 133
B. CSPs unique to Rhodospirillaceae
Rru_A0125 YP_425217 449 Rru_A2592 YP_427676 231
Rru_A0152 YP_425244 138 Rru_A2828 YP_427912 169
Rru_A0531 YP_425622 588 Rru_A3562 YP_428643 349
Rru_A1689 YP_426776 178 Rru_A3636 YP_428717 464
Rru_A1756 YP_426843 139 Rru_A3662 YP_428743 119
Rru_A2112 YP_427199 237 Rru_A3739 YP_428820 464
Rru_A2510 YP_427597 184 Rru_A3800 YP_428881 153
25 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 10: CSPs specific to Rickettsiales
Gene ID Accession # Length Gene ID Accession # Length
A. CSPs unique to Rickettsiales
WD0161 NP_965979 70 WD0715 NP_966474 94
B. CSPs unique to Anaplasmataceae
WD0083 NP_965909 271 WD0821 NP_966574 156
WD0827 NP_966580 191 WD0863 NP_966613 147
WD0157 NP_965975 242 WD0771 NP_966526 460
WD0148 NP_965966 139 WD0764 NP_966520 138
WD0772 NP_966527 202 WD1025 NP_966750 97
WD0412 NP_966202 143 WD1056 NP_966779 92
WD0467 NP_966253 106 WD1220 NP_966932 204
WD0757 NP_966513 290 WD1230 NP_966942 243
C. CSPs unique to Rickettsiaceae
RP030 NP_220424 219 RP187 NP_220576 194
RP192 NP_220581 128
26 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 4 Discussion
4.1 Confirmation of the uniqueness of CSP
The purpose of this study was to determine if the CSPs identified in earlier studies could still be regarded as specific for the desired group so that the results obtained with them in metagenomic analysis will be reliable. The results of re-BLAST studies indicate that most of these CSPs are still specific for the previously reported taxonomic units but there are a small number of exceptions (Table 1). For example, among all the 265 CSPs examined in this study, only 6 proteins are no longer diagnostic. One Rickettsiales specific CSP (Accession No.: NP_966526) becomes Anaplasmataceae specific CSP in this study (Table 10B). Similarly, four CSPs, which were previously-regarded as unique to Rhodospirillales order (Gupta and Mok, 2007a), have now been determined to be uniquely present in either Glucobacter (Accession No.: AAW60410, AAW60472) or
Acetobacteraceae (Accession No.: AAW60735, AAW61019) (Table 9A). Thus, no
Rhodospirillales CSP has yet been identified. Another Acetobacteraceae specific CSP
(accession number: AAW61951) is found to be sporadically distributed protein present in some other distantly related bacterial cohorts including Verrucomicrobia and
Planctomycetes. These CSPs were probably misidentified earlier due to the limited number of sequenced Rhodospirillales species available (3 Acetobacteraceae species and
2 Rhodospirillaceae species available in 2007) (Gupta and Mok, 2007a). The majority of
CSPs maintain their original taxonomic specificity, which has been identified in desired bacterial species that were fully sequenced after 2007 (Table 1).
27 M. Sc. Thesis—Quan Yao McMaster—Biology
4.2 Grouping of CSP into Taxonomic levels
The CSPs used in this work were first identified when information was only available for a limited number of Alphaproteobacterial species (Gupta and Mok, 2007a). Hence, an initial undertaking in this work was to confirm their group specificities. Blast searches results again confirmed that most of these proteins were still specific for the originally indicated taxonomic clades despite many fold increase in the number of sequenced
Alphaproteobacteria genomes (Table 1). Most of these signature markers are present in the genomes of newly sequenced Alphaproteobacteria species, belonging to the appropriate taxonomic groupings, but not in any other bacteria. Based upon their observed specificities for different clades of Alphaproteobacteria, these CSPs are endowed with distinctive characteristics to indicate the divergence of Alphaproteobacteria clades in evolutionary history. And these molecular markers provide reliable evidence to support the branching pattern of Alphaproteobacteria in a revolutionary context.
The grouping of molecular markers is based on phylogenetic analysis of CSPs’ specificity. Each molecular marker is shared by several closely related taxa at any taxonomic ranking such as class, order and family. Phylum or genus specific markers were not considered in this study. Since there are sufficient CSPs that have been identified previously and they represent almost all major clades of Alphaproteobacteria and thus these CSP can be divided into 8 groups based on their taxonomic rankings.
They are either specific to Alphaproteobacteria or sub-clades of Alphaproteobacteria.
CSP database consists of three tiers. Tier 1 CSPs are specific to Alphaproteobacteria class. Tier 2 CSPs are specific to different orders of Alphaproteobacteria such as
28 M. Sc. Thesis—Quan Yao McMaster—Biology
Rhizobiales, Rhodobacterales, and Caulobacterales. Tier 3 CSPs are specific to constituent families within these orders. With all these three tiers of CSPs, it is possible to diagnose the presence of organisms in a hierarchical manner. Tier 2 CSPs are not evenly distributed in all 6 different orders of Alphaproteobacteria. The largest order Rhizobiales contains 121 CSPs, which comprise almost 45% of all CSPs while Caulobacterales embody merely 11 CSPs. The disparity of CSP volume in different orders results from the bias of fully sequenced Alphaproteobacterial genomes. Pathogenic and agricultural
Alphaproteobacterial species are studied more extensively. Apart from those CSPs unique to class, order and family level, phylum specific and genus specific CSPs are also available for Proteobacteria and Brucella. Since this project mainly concentrates on
Alphaproteobacteria class, CSPs specific to Betaproteobacteria/Gammaproteobacteria are not considered during database construction. As Brucella is an intracellular pathogen, it is likely that Brucella specific CSPs cannot be readily detected in environmental samples and thus they are not included in the CSPs database.
4.3 Future experiments
The next objective of my project is to detect the presence of different
Alphaproteobacteria clades in metagenomic samples. More experiments need to be designed as follows:
(i) Selection of suitable metagenome for Alphaproteobacteria detection. Parameters such as the relative abundances of Alphaproteobacteria in metagenomic datasets will be taken into account for metagenomes selection. Qualified metagenomes will be used for organism identification.
29 M. Sc. Thesis—Quan Yao McMaster—Biology
(ii) Application of CSP database into metagenomes. This will test if the CSP database can be used to identify environmental bacteria
(iii) Comparative analysis of metagenomes for taxonomical profiling of
Alphaproteobacteria. Experiment results from CSPs will be compared to verify whether
CSP based similarity search produces reliable results like traditional similarity-based binning
All these experiments described above, once accomplished, are expected to address the issues and objectives of this project.
30 M. Sc. Thesis—Quan Yao McMaster—Biology
Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples
Chapter 1 introduction
1.1 Metagenome, environmental genomes
Metagenome is a composite genomes of all organisms from an environmental sample,
(Thomas et al., 2012). It investigates microbial world by applying sequencing method and bioinformatics technologies to the environmental microbial communities, overlooking the need of isolation and culturing of individual microbial members
(Ghazanfar et al., 2010). Only 1.0% of all micro-organisms on the earth could be cultured successful in artificial media (Ferrari et al., 2005). For instance, soil microbial communities are estimated to comprise 5000~20000 different species, however only
50~200 of them can be isolated and cultured (Handelsman, 2004). Metagenomic studies may provide more microbial diversity information from the environment (Gilbert and
Dupont, 2011).
All sequence-based metagenomic studies follow similar procedure:
(1) Total genomic DNA from all environmental samples such as soil, permafrost, marine water, termite gut, human intestine are extracted directly without isolation and culturing (Solonenko et al., 2013). Before sequencing, quality control (QC) and duplicate clustering (DC) are performed to reduce potential artificial sequences present in unassembled raw read data. QC filter calculates the average quality score of each read.
According to the statistical analysis on the input reads, the overall quality performance and the high quality reads are fetched for further analysis (Lindner et al., 2013). Duplicate clustering is another important preparatory step to identify duplicates from raw data read.
31 M. Sc. Thesis—Quan Yao McMaster—Biology
These duplicates are mainly sequencing artifacts in metagenomic library such as vectors and plasmids. Duplicate clustering also reduces the redundancy of metagenomic reads to yield a non-redundant dataset (Li et al., 2012). Since raw metagenomic reads are almost non-redundant due to the complexity of environmental bacterial communities, DC does not biased the results for subsequent experiments (Lindner et al., 2013). However, most duplicates in transcriptomes are not nonsense sequences, so it is not suggested to run DC workflow for meta-transcriptomic datasets (Li et al., 2012).
(2) Metagenomic samples are sequenced either through vector sequencing or direct sequencing (Morgan et al., 2010). In the former protocol, environmental DNAs are fragmented into small pieces, which are subsequently inserted into the vectors of
Escherichia coli to build metagenomic library (Lussier et al., 2011). Direct sequencing skips the step for metagenomic library construction and sequence original microbial fragmented genomes in environmental samples (Kisand et al., 2012).
(3) The purpose of metagenomic assembly is to assemble similar sequences from related genomes while prevent assembly of similar sequences from irrelevant genomes
(Ruby et al., 2013). The metagenomic reads are assembled into contigs and scaffolds
(Nijkamp et al., 2013). However, metagenomic sequence assembly is a major bottleneck in metagenomic studies. Repeats lead to the ambiguity genome recovery. Deficient coverage generates many gaps in between genomes. Sequencing errors become an inherent blemish preceding any bioinformatic analysis (Huang et al., 2012). In many metagenomic studies, direct analysis is implemented on raw reads without sequencing assembly (Takacs-Vesbach et al., 2013).
32 M. Sc. Thesis—Quan Yao McMaster—Biology
(4) RNA and open reading frames (ORFs) prediction are performed through basic local alignment search tool (BLAST) (Altschul et al., 1990). It is an algorithm used to compare the extent of similarity between two sequences, and both amino acid sequences or nucleotide sequences applies. BLAST search compares the query sequences to a database of sequences to identify known sequences relative to query sequences above a cutoff threshold (Altschul et al., 1990). Apart from sequence alignment similarity search by
BLAST, Hidden Markov Model pattern is an alternative solution to predict rRNA- specific structures and six-reading frame translation and it is applied to identify all potential ORFs within a DNA sequence of any size (Siepel and Haussler, 2004). Gene prediction of RNA and ORFs excavates taxonomic information and functional categories in metagenomic reads (Leimena et al., 2013).
(5) After predicting the phylogeny of tRNA and ORFs of proteins, all annotated sequences are classified according to their most-likely taxonomic origin and functional category (Strous et al., 2012). For taxonomic clustering, all metagenomic reads showing similar phylogenetic affiliations are emplaced on a certain taxon in bacterial taxonomy
(Dröge and McHardy, 2012). There are two algorithm to calculate the phylogenetic affiliation of metagenomic sequences. One of which depend on the best hit of BLAST search to determine the taxonomic origin of reads, while another method, which is more parsimonious and reliable, takes the lowest common ancestor of all significant hits above threshold to affirm the taxonomic placement of metagenomic reads (Albertsen et al.,
2013). As for functional binning, all annotated gene are mapped to databases resources such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and SEED classifications
33 M. Sc. Thesis—Quan Yao McMaster—Biology
based on higher functional categories and subordinate biological subsystems (Mitra et al.,
2011).
Unveiling the taxonomic and functional diversity of microbial community in particular environment enables us to answer 2 questions: “Who is there?” and “What are they doing?” (Handelsman, 2004). Through constructing the networks between environmental sequences and microbial attributes, it is feasible to predict the potential presence of similar or identical species and functional pathways in other similar environments (Ghai et al., 2013). Understanding the composition of microbial communities and their interaction networks allows identification of the core bacterial metabolic pathways implemented to sustain a balanced development of bacterial communities, thus providing valuable information for environmentalists to inhibit the production of toxics or enhance the production of beneficial metabolites for the well-being of ecosystem (Brennerova et al., 2009).
1.2 Taxonomic classification of metagenomic reads: methods and challenges
Measuring species diversity in metagenomes provides the answer for “who is there”
(Chistoserdova, 2013). In order to connect each metagenomic sequence to a certain taxon,
Binning is a necessary process, and traditional binning process consists of two approaches: composition based binning and similarity based binning (Dröge and
McHardy, 2012).
In composition based binning, metagenomic softwares are developed to unearth the inherent features of sequences, such as GC content, codon usage bias and tetra-nucleotide frequency (Roller et al., 2013; Teeling et al., 2004). These approaches identify the
34 M. Sc. Thesis—Quan Yao McMaster—Biology
differentiation of new species in environment, the so-called operational taxonomic unit
(OTU), because most species in natural environments are not successfully cultured and beyond laboratory characterization (Wooley et al., 2010).
Similarity based binning, also called alignment based binning, matches metagenomic sequences to referenced databases, methods such as BLAST (Altschul et al., 1990),
PhymmBL (Brady and Salzberg, 2009) and MetaPhlAn (Segata et al., 2012) are employed in metagenomic researches. These methods not only identify and measure the relative abundance and diversity of known microbial organisms in environment, but also reveal functional impact of bacteria communities in environments because extensive studies on individual microbial species in laboratory have well characterized the function of genes and proteins within these cultivable species (Leung et al., 2011).
Both binning strategies are important and complement each other in metagenomic taxonomic profiling. The former discovers unknown species in wild environments while the latter investigates known species in wild environments (Wu and Ye, 2011). However, neither of them could fully reveal environmental species diversity given that 99% of environmental microbes haven’t been cultured yet (Schloss and Handelsman, 2005).
Usually, similarity based binning is more accurate and sensitive compared to composition based binning, but the performance is highly subject to the reference resources (Xia et al.,
2011). Composition based binning clusters all sequences into groups. But it fails to build an association between metagenomic reads and bacterial individuals (Thomas et al.,
2012).
35 M. Sc. Thesis—Quan Yao McMaster—Biology
Other issues such as time expense and computing requirement are haunting problems waiting to be resolved (Thomas et al., 2012). BLASTX analysis was once performed on permafrost soil samples including 176 million Illumina DNA reads (Mackelprang et al.,
2011), which eventually cost 800000 CPU hours on a similar work station server (64 cores, 512 GB main memory) (Huson and Xie, 2013). Regarding all these inevitable limitations above, a hybrid of these two approaches is preferable for accurate estimation of taxonomic classification (Mohammed et al., 2011).
1.3 Application of metagenomics
Metagenomics have a broad range of potential applications to transfer current knowledge into solving practical issues. Some pioneering attempts have been proved successful in fields such as energy, agriculture, environmental, medicine, and engineering
(National Research Council (US) Committee on Metagenomics: Challenges and
Functional and Functional, 2007).
Microbial communities in humans guts body have an essential impact on human health. However, the composition of gastrointestinal microbes and the mechanism by which they use to influence human body remains to be cryptic (Bäckhed et al., 2012). In view of this, metagenomic technology is utilized to characterize human microbiome
(Lepage et al., 2013). One of the largest project involving human gut microbiome is initiated by European Commission ——Metagenomics of the Human Intestinal Tract
(MetaHIT) to explore the relationships between the changes of human microbiome and human health by gathering genomic sequences of all microbial organisms on 15~18 different body sites from 250 european individuals (Qin et al., 2010). The primary goal of
36 M. Sc. Thesis—Quan Yao McMaster—Biology
MetaHIT project is to determine a core set of human microbiome maintaining the health of mankind (Ursell et al., 2012). Another clinical research as part of MetaHIT project is to classify the profound phylogenetic variation of gastrointestinal microbes between health people and patient suffering from diseases and disorders such as Crohn’s disease, irritable bowel syndrome (IBS) disease and obesity (Moloney et al., 2013). The results elucidated that two bacterial phyla, Bacteroidetes and Firmicutes dominate the distal gut by comprising >90% of known bacteria (Le Chatelier et al., 2013). Gene frequency profiling identifies 1244 metagenomic functional clusters of crucial importance to the health of human intestinal tract, from which functions are divided into two categories: house keeping cluster and intestine specific cluster (Qin et al., 2010). Housekeeping functions are indispensable in human gut and required by all other microbial members around them because they play a key role in main metabolic pathways including citric acid cycle and amino acid synthesis. While gut specific functions cope with host protein adhesion and sugar harvesting (Qin et al., 2010). One of the discoveries regarding IBS is that the genes of microbiome in patients are 25% lower than healthy controls, and the bacterial diversity is also lower in IBS patients. It is strongly indicated that gut associated disease and obesity results from the reduction of gut microbiome diversity (Qin et al., 2010). Despite of the potential application in the study of human gut metagenome, It is notable that only
7.6~21.2% of the metagenomic reads can be matched to bacterial genomes on Genebank, and There are much more novel bacterial species in human distal gut that haven’t be researched yet (Qin et al., 2010). The characterization of unknown microbiome may
37 M. Sc. Thesis—Quan Yao McMaster—Biology
throw light on new medical therapy dealing with human gastrointestinal diseases (Kinross et al., 2011).
Metagenomics also advances the knowledge in exploring new green energies.
Bioenergy is expected to be the next generation fuel that could replace the status of fossil fuels (Hess et al., 2011). They are derived from biomass conversion,which transfer plant material such as grain, starch, sugar, oil, cellulose, hemicellulose, and lignin into cellulosic ethanol methane and hydrogen (van der Lelie et al., 2012). The transformation process relies upon microbial cohorts from host associated habitats ranging from herbivore mammals, insects, birds to rainforest soils (Allgaier et al., 2010). The microbial communities in these habitats share core cellulosic genes coding for enzymes that degrade biomass (Scully et al., 2013). In view of the importance of such enzymes and the inexhaustible pool of environmental microbes, metagenomics aims at analyzing sophisticated microbial consortia that allows for the production of novel enzymes fulfilling the industrial requirements——biomass deconstructing enzymes with higher productivity and lower cost (Hess et al., 2011). Meanwhile, metagenomic technologies permit comparative analysis between convergent microbial ecosystems, which in return improves the understanding of differentiated biomass degradation mechanisms (Lu et al.,
2012). Metagenomic approaches not only identify the diversified enzymes of interest, but also control the activation and depression of these catalyzing process (Zhang et al., 2013).
Industrialization of massive biofuel production may likely reduce the release of greenhouse gases and promotes environmental qualities (Sommer et al., 2010).
38 M. Sc. Thesis—Quan Yao McMaster—Biology
Microbial communities in soils are recognized as the most diverse and complex bacterial ecosystem, with 109~1010 microbial cells in one gram of soil (Vogel et al.,
2009). In spite of the enormous sequence information per unit soil holds (one gig abase per gram of soil), the taxonomic composition and functional categories are poorly understood (Vogel et al., 2009). Many bacteria develop a stable symbiotic relationships with specific plants and provide diverse ecological services as symbionts or epibionts for plant growth, including atmospheric nitrogen fixation, nutrient circulation, pathogen resistance, and trace elements enrichment (Rascovan et al., 2013). Functional metagenomic pipelines seek to decipher the sophisticated interactions and communications between soil microbes and plants through screening novel genes of interest in microbial communities (Rout and Callaway, 2012). Insight into the rare uncultivable bacterial members responsible for mutualism and intra-species competitive inhibition also offers a new angle of view for floral disease resistance and farming practice enhancement (Rosen et al., 2009). The application of metagenomic techniques into agriculture enables the improvement and maintenance of crop health if only the dynamic equilibrium between microbes and plants are under the control (Rascovan et al.,
2013).
Apart from the applications described above, metagenomic approaches also tackle environmental issues. In the field of environmental remediation, new policies and strategies based on metagenomic principles are advocated for monitoring the impact of pollutants and cleaning up environmental contamination (Yergeau et al., 2012). One of the metagenomic projects targets wastewater treatment plant where microbial organisms
39 M. Sc. Thesis—Quan Yao McMaster—Biology
remove excessive inorganic phosphate from wastewater. The treatment process is called enhanced biological phosphorus removal (EBPR) (Nielsen et al., 2012). Another wastewater treatment project in a common effluent treatment plant (CEPT) investigates the activated biomass occupied by particular microbial communities in this niche (Kapley et al., 2007). Metagenomic studies aim at identification of novel bacteria members in these niches and exploring new catabolic pathways that help reduce the chemical oxygen demand (COD) so that treated wastewater by activated sludge process can be subsequently released into environment safely (Ravi P More, 2013). Although the metabolic traits of this process are not well understood yet, increased understanding of how microbial communities deal with pollutants provides theoretical foundations for environmentalists to assess the potential sites vulnerable to contaminants, as well as developing appropriate strategies to increase the chance of removing pollution for habitat rehabilitation (Gomez-Alvarez et al., 2012).
1.4 Project objectives
For the sake of profiling the bacteria diversity and functional category, it is necessary to map metagenomic sequences to reference databases (Mitra et al., 2011). Since 99% of environmental microbial species cannot be cultured in laboratories, it is a big challenge to identify wild type species based on limited genomic information from sequenced domesticated individuals (Albertsen et al., 2013). Based on the results of the first project, several molecular markers have been found to be present in newly sequenced species of different clades within Alphaproteobacteria. Given that these molecular markers are
40 M. Sc. Thesis—Quan Yao McMaster—Biology
ubiquitously present in all potential Alphaproteobacteria species, It is highly possible that environmental Alphaproteobacteria may also carry these signatures as well.
The second project consists of 3 related components. The first step is to choose several metagenomic samples that may contain potential Alphaproteobacteria. A large scale screening test performed on 200 metagenomic samples. Subsequently, a more detailed and comprehensive profiling of Alphaproteobacteria clades was performed on those selected metagenomes. Finally, a comparative analysis will be carried out to compare the relative abundance of Alphaproteobacteria among selected metagenomes. Once the experiments are completed, an overall performance of molecular signatures in identifying environmental microorganisms can be assessed and a new molecular marker based method can be developed to determine the taxonomic classification of metagenomes.
41 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 2 Materials and methods
2.1 Metagenome selection
A systematic tBLASTn search was conducted with threshold of 1x10-e04 on 4
Alphaproteobacteria class specific CSPs against 201 metagenomes. These 201 metagenomic samples consist of 15049531 genomic sequences in total and are divided into either ecological metagenomes or organismal metagenomes on NCBI genomic
BLAST webpage. All significant hits above the threshold were collected and a metagenome taxonomy report was carried out to demonstrate the distribution of CSPs in potential metagenomes. Metagenomes with highly similar sequences to CSPs suggest that
Alphaproteobacteria might be abundant in those habitats, thus are preferable in this study.
According to the similarity of sequences and the amount of positive hits discovered in candidate metagenomes, qualified metagenomic projects are selected for
Alphaproteobacteria profiling later.
2.2 Identification of CSP in metagenomic samples
A systematic tBLASTn search was performed on all 264 CSPs against 10 qualified metagenomic projects. An E-value threshold of 1x10-e04 was employed in this experiment with default filter (low complexity regions) to eliminate statistically significant but biologically uninteresting hits from the BLAST output (Coletta et al., 2010). Then, best bit scores and positive hit numbers of all CSPs were collected to evaluate the quality and quantity information derived from BLAST results for further analysis. The bit score is a numerical value that describes the overall quality of an alignment, which indicates how
42 M. Sc. Thesis—Quan Yao McMaster—Biology
ideal the alignment results are. The higher the score is, the better the alignment is
(Altschul et al., 1990).
2.3 Comparative analysis of Alphaproteobacteria in metagenomes
The distribution of Alphaproteobacteria clades was plotted based on the best bit scores and the amount of positive hits for all CSPs. Then a heatmap was created according to the distribution pattern of 264 CSPs. Meanwhile, the average bit score and total amounts of positive hits for all CSPs in each metagenome were calculated to indicate the overall relative abundance of Alphaproteobacteria in each metagenome. Afterwards, a comparative analysis of CSPs was conducted to demonstrate the detailed proportion of different Alphaproteobacteria clades between 10 metagenomes. A comparison of relative abundance between CSPs-based in this study and similarity-based taxonomic classification on public metagenomic server was performed to validate the reliability of
CSPs-based methodology. Taxonomical hits distribution of the 10 metagenomes were accessible from either Metagenome Rapid Annotation using Subsystem Technology server (MG-RAST) (Meyer et al., 2008) or Integrated Microbial Genomes with
Microbiome Samples (IMG/M) (Markowitz et al., 2012).
43 M. Sc. Thesis—Quan Yao McMaster—Biology
Table 11 Characteristics of Metagenomic Datasets Investigated in this Study
Metagenomic project # of Total length Average Raw sequencing α-proteobacteria Contigs1 (Mb) length (bp) data (Gb) %2
Wastewater 172,804 421.6 2,440 157.50 16.8%
Marine 54,509 77.4 1,420 1.17 30.7%
Bioreactor 748,672 317.9 425 1.44 17.2%
Compost 218,885 104.9 479 0.28 27.0%
Activated sludge 36,270 27.9 769 N/A N/A
Whale fall 84,317 89.6 1,062 0.14 23.8%
Freshwater sediment 252,427 214.8 850 8.20 5.3%
Microbial mat 112,984 84.2 745 0.12 21.9%
Hydrothermal vent 26,573 24.9 937 0.03 19.8%
Groundwater 37,367 104.7 2801 7.20 4.6%
1. Contigs are assembled metagenomic sequences 2. The percentage of α-proteobacteria is calculated based on the ratio of reads annotated to α-proteobacteria to all metagenomic reads
44 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 3 Results
3.1 Metagenome selection
201 metagenomic datasets were accessible in NCBI genomic BLAST webpage. To determine which metagenomic datasets were dominated by Alphaproteobacteria, a large- scale BLAST search was undertaken with Alphaproteobacteria specific CSPs. The experiment results indicated that Alphaproteobacteria were most likely present in roughly
10 metagenomic projects. The 4 CSPs used for preliminary screening has been confirmed to be the most conserved and specific molecular signatures shared by all available
Alphaproteobacteria species (Table 3A). The tBLASTn results of these CSPs indicated that 220 BLAST hits were identified in 10 metagenomic projects (the accession # on
NCBI and project ID on MG-RAST and IMG/M servers are listed in brackets). They were Microbial mat metagenome (PRJNA29795, 4440964.3) (Harris et al., 2012), Marine metagenome (PRJNA16339, 4443701.3) (DeLong et al., 2006), Wastewater metagenome
(PRJNA167559, 4455295.3) (Mielczarek et al., 2013), Freshwater sediment metagenome
(PRJNA30541, 2006543005) (Kalyuzhnaya et al., 2008), Hydrothermal vent metagenome
(PRJNA37895, 4461585.3) (Brazelton and Baross, 2009), Bioreactor metagenome
(PRJNA73603, 20220044000) (van der Lelie et al., 2012), Activated sludge metagenome
(PRJNA61401, N/A) (Kapley et al., 2007), Compost metagenome (PRJNA41493,
4446153.3) (Allgaier et al., 2010), Whale fall metagenome (PRJNA81625, 4441619.3)
(Tringe et al., 2005) and Groundwater metagenome (PRJNA114691, 3300000815)
(Wrighton et al., 2012). Although positive hits were also sporadically distributed in other metagenomic projects such as freshwater metagenome and mosquito metagenome, these
45 M. Sc. Thesis—Quan Yao McMaster—Biology
two metagenomic projects were not pursued further because BLAST analysis indicated that neither bit score nor hits number were sufficient to classify these metagenomic datasets as Alphaproteobacteria abundant metagenomes.
The 10 metagenomic projects described above were composed of more than 50 metagenomic samples. So one most representative metagenomic sample was selected from each metagenomic project. Given that 6 of the 10 metagenomic projects
(wastewater metagenome, hydrothermal vent metagenome, activated sludge metagenome, compost metagenome, bioreactor metagenome and groundwater metagenome) contained only one sample respectively, They automatically became the representative metagenomic sample for corresponding metagenomic projects. Whale fall metagenome, freshwater metagenome and microbial mat metagenome were made up of 3 datasets, 5 datasets and 10 datasets respectively. Since the samples in each project were concentrating on a certain topic, metagenomic datasets in each project could be combined as a single sample for research. Marine metagenome comprised 35 metagenomic samples gathered from all around the world. Further analysis indicated that most significant hits of
Alphaproteobacteria specific CSP were identified in North Pacific Subtropical Gyre
Planktonic Microbial Community, so it was selected as the representative marine metagenome sample in this project.
The sample size of each metagenomic project, the number of contigs and total length of all reads were collected from WGS master webpage (shotgun assembly sequences for genome and transcriptome). From Table 11, it can be seen that the number of assembled sequences between metagenomic projects ranges from ten thousands to hundreds of
46 M. Sc. Thesis—Quan Yao McMaster—Biology
thousands of sequences. The total length of metagenomic reads are limited within tens of million base pairs to hundreds of million base pairs. The average length of metagenomic read can be calculated based the total length divided by the quantity of contigs. The average length of metagenomic reads ranges between roughly 500 bp to 2500 bp. The metagenomic reads are appropriate for CSP based similarity search because the average length of metagenomic reads are comparable to the length of CSPs. To have an overall understanding of how many Alphaproteobacteria are assumed to be present in these metagenomic projects. Organism abundance of metagenomes were searched in MG-
RAST and IMG/M server. The numbers of reads annotated to Alphaproteobacteria were collected for calculating the relative proportion of all Alphaproteobacteria species in each metagenome (Table 11). And the relative abundance of Alphaproteobacteria based on the proportion of related metagenomic reads ranges from 5% to 30%, which indicates the fact that Alphaproteobacteria is one of the major groups in selected metagenomic projects.
3.2 Identification of CSPs in metagenomic samples
After selecting 10 appropriate metagenomes, the distribution of all CSPs in these metagenomic reads was investigated. The bit scores, as well as the number of significant hits obtained from 16 CSPs unique to different clades of Alphaproteobacteria were tabulated in Figure 1. Equally important, two heatmaps were built to depict the detailed distribution of Alphaproteobacterial clades in 10 metagenomes (Figure 2 and 3). In this study, 11 CSPs specific for Alphaproteobacteria at class level (i.e. they are specifically found in all or most Alphaproteobacteria) were applied to identify the presence of
Alphaproteobacteria in metagenomic samples. Significant hits of these CSPs with high bit
47 M. Sc. Thesis—Quan Yao McMaster—Biology
scores were identified in the metagenomic datasets and in most cases multiple metagenomic reads were found to exhibit positive hits. However, the total number of significant hits for these 11 CSPs in different metagenomic datasets showed considerable variation, as well as the bit score, It is notable that bioreactor, wastewater and whale fall metagenomes have more Alphaproteobacterial sequences than the other 7 metagenomes
(Figure 2 and 3). These differences may be related to the size of the datasets themselves as well as the relative abundance of Alphaproteobacteria in these metagenomes. Based on these findings, it is indicated that CSPs specific for the class Alphaproteobacteria are ubiquitously present in 10 metagenomic datasets, particularly enriched in three of them.
Multiple CSPs that are specific for either all Rhizobiales or two major clades within this order have been identified, which contains 3 CSPs specific for Rhizobiales, 16 CSPs specific for Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and
Aurantimonadaceae (called clade-1 Rhizobiales) and 18 CSPs specific for Rhizobiaceae and Phyllobacteriaceae. The results of tBLASTn searches with these CSPs demonstrated that the significant hits of these CSPs were highly concentrated in wastewater metagenome, followed by bioreactor, compost and whale fall metagenomes. At the same time, These CSPs were either sporadically distributed or totally absent in other metagenomes studied (Figure 2 and 3).
Another important sub-clade within Rhizobiales is the Bradyrhizobiaceae and
Xanthobacteraceae group. All 74 CSPs were examined to be consistently present in either
Bradyrhizobiaceae family or both Bradyrhizobiaceae and Xanthobacteraceae families.
The tBLASTn results indicated that their distribution differed somewhat from those of the
48 M. Sc. Thesis—Quan Yao McMaster—Biology
Clade 1 Rhizobiales-specific CSPs. The maximum number of BLAST hits was observed in this case for the bioreactor metagenome, while the marine and the compost metagenomes also yielded equivalent significant hits (Figure 2 and 3).
The distribution of CSPs specific for the order Rhodobacterales, Caulobacterales,
Sphingomonadales, Rhodospirillales and Rickettsiales were investigated respectively. Of the 35 Rhodobacterales specific CSPs, multiple significant hits were detected in the following 6 metagenomes: wastewater, marine, microbial mat, hydrothermal vent, whale fall and groundwater metagenomes. Furthermore, the average bit scores of CSPs in these
6 metagenomes were higher than those in the other 4 metagenomes, which gave more confidence in the reliability of these results and indicated that Rhodobacterales species were important constituents of these metagenomes (Table 11). Also, the distribution of significant BLAST hits based on 11 Caulobacterales specific CSPs indicated that the
Caulobacterales were likely enriched in bioreactor, wastewater and whale fall metagenomes (Figure 2 and 3).
The result of tBLASTn searches regarding the distribution of 31 Sphingomonadales specific CSPs in 10 metagenomes was displayed in Figure 2 and 3. These CSPs were highly concentrated in bioreactor, wastewater metagenomes, and moderately scattered in marine and whale fall metagenomes. It was inferred that Sphingomonadales species might prefer either engineered or marine habitats than any other environments examined in this study.
The analyses of tBLASTn results with Rhodospirillales specific CSPs indicated that
Rhodospirillales were most abundant in the bioreactor metagenome, admitting that
49 M. Sc. Thesis—Quan Yao McMaster—Biology
correlated CSPs were present with low amount in the marine, compost, freshwater sediment and whale fall metagenomes (Figure 2 and 3).
21 CSPs specific for the order Rickettsiales were also employed to detect potential pathogens in environmental datasets. Only 2 significant hits were observed in wastewater and freshwater sediment metagenomes respectively (Figure 2 and 3). It is probably because intracellular pathogenic bacteria were not common in environmental metagenomic samples.
3.3 Comparative analysis of Alphaproteobacteria in metagenomes
The best bit scores and the number of significant hits from the BLAST search results of all CSPs were collected. In summary, 4 metagenomic datasets enriched by
Alphaproteobacteria were identified. They were bioreactor metagenome wastewater metagenome, marine metagenome and whale fall metagenome. The experimental results for other 7 metagenomes were shown in Figure 4. All these significant hits were derived from either Alphaproteobacteria class specific CSPs or clade specific CSPs. For instance, among all the 410 hits found in bioreactor metagenome, 125 of them were from 11
Alphaproteobacteria class specific CSPs, 73 were from Bradyrhizobiaceae and 98 were from Sphingomonadales. In wastewater metagenome, the 551 significant hits discovered by CSPs were mainly from Alphaproteobacteria class (179 hits), Rhizobiales (130 hits)
Rhodobacterales (87 hits) and Sphingomonadales (109 hits). As for whale fall metagenome, more than 75% of the significant hits were derived from
Alphaproteobacteria specific CSPs (109 hits) and Rhodobacterales (114 hits).
50 M. Sc. Thesis—Quan Yao McMaster—Biology
By calculating the total number of significant hits discovered for All CSPs and
Grouping them based on orders, a comparative analysis was made to demonstrate the detailed distribution of Alphaproteobacteria clades in each metagenome. According to
Figure 5, Alphaproteobacteria was most abundant in bioreactor, wastewater and whale fall metagenomes, not only for the whole class, but also for different orders of
Alphaproteobacteria. For example, in bioreactor metagenome, the relative abundance of
Rhizobiales, Bradyrhizobiaceae, Sphingomonadales and Rhodospirillales were higher compared to the other metagenomes. Though wastewater metagenome was also enriched with Alphaproteobacteria, the composition of concentrated organism was different from bioreactor metagenome. In wastewater metagenome, Rhizobiales, Rhodobacterales and
Sphingomonadales were the most dominant groups of Alphaproteobacteria, but the concentration of Bradyrhizobiaceae was lower than bioreactor. According to CSPs distribution, only Rhodobacterales and Sphingomonadales were abundant in whale fall metagenomes, admitting the fact that other clades were also moderately present in this metagenome.
To compare the organism abundance between CSPs-based binning and similarity- based binning, the taxonomic classification from MG-RAST and IMG/M server (Figure
6) was collected. 4 metagenomes (bioreactor, wastewater, whale fall and marine) were compared in this study. In the bioreactor metagenome, the relative abundance of
Alphaproteobacteria clades based on CSP distribution were demonstrated as 5% for
Rhizobiales, 11% for Bradyrhizobiaceae, 2% for Rhodobacterales, 3% for
Caulobacterales, 14% for Sphingomonadales and 7% for Rhodospirillales. The relative
51 M. Sc. Thesis—Quan Yao McMaster—Biology
proportion of the same metagenome derived from IMG/M server were 7% Rhizobiales,
11% Bradyrhizobiaceae, 2% Rhodobacterales, 6% for Caulobacterales, 11% for
Sphingomonadales and 10% for Rhodospirillales. The results were highly correlated to each other. The organism abundance for the other 3 metagenomes on MG-RAST server was also similar to the results based on CSPs search (Figure 6)
52 M. Sc. Thesis—Quan Yao McMaster—Biology
Marine Marine Compost Whale fall Whalefall Bioreactor Wastewater Groundwater Microbial matMicrobial Activated sludge Hydrothermalvent Freshwatersediment Clade specificity CSP Best bit score NP_422086 203 205 212 132 206 190 180 191 209 143 Alphaproteobacteria NP_420178 263 124 94 102 83 80 61 92 118 96 YP_031797 70 77 51 67 0 0 53 0 56 0 Rhizobiales YP_032395 169 265 0 107 0 0 162 0 106 0 YP_317328 125 0 118 145 0 0 0 0 0 0 Bradyrhizobiaceae YP_317580 128 0 111 147 0 0 0 0 0 0 YP_614257 0 721 0 0 0 333 199 330 311 719 Rhodobacterales YP_611978 0 273 249 0 183 0 0 0 286 196 NP_419305 82 62 85 0 0 0 0 0 81 0 Caulobacterales NP_421895 127 291 73 0 0 0 115 0 119 0 YP_495301 116 216 115 0 0 0 0 0 225 0 Sphingomonadales YP_496569 110 139 70 0 0 0 0 105 77 0 AAW62049 70 0 471 0 0 0 0 0 158 0 Rhodospirillales YP_425217 173 0 177 54 68 0 0 0 171 0 NP_965979 0 0 0 0 0 0 0 0 0 0 Rickettsiales NP_966474 0 0 0 0 0 0 0 0 0 0 Clade specificity CSP Significant hits NP_422086 23 18 7 2 3 2 1 3 8 1 Alphaproteobacteria NP_420178 11 18 6 3 3 2 2 2 9 2 YP_031797 3 11 1 2 0 0 1 0 1 0 Rhizobiales YP_032395 4 11 0 5 0 0 1 0 2 0 YP_317328 2 0 1 1 0 0 0 0 0 0 Bradyrhizobiaceae YP_317580 2 0 1 1 0 0 0 0 0 0 YP_614257 0 1 0 0 0 1 1 2 3 1 Rhodobacterales YP_611978 0 1 2 0 2 0 0 0 3 1 NP_419305 2 1 1 0 0 0 0 0 3 0 Caulobacterales NP_421895 11 6 1 0 0 0 1 0 1 0 YP_495301 4 5 2 0 0 0 0 0 3 0 Sphingomonadales YP_496569 3 5 1 0 0 0 0 1 1 0 AAW62049 1 0 2 0 0 0 0 0 1 0 Rhodospirillales YP_425217 9 0 2 1 1 0 0 0 2 0 NP_965979 0 0 0 0 0 0 0 0 0 0 Rickettsiales NP_966474 0 0 0 0 0 0 0 0 0 0
53 M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10 metagenomes The upper heatmap specifies the best bit score within each metagenome assigned to the listed taxa. The lower heatmap indicates the numbers of significant hits within each metagenome that are assigned to the listed taxa. Color formatting indicates high and low values. Zero values are in green. Values between 1~10 are in yellow. Red indicates the highest values in the chart.
54 M. Sc. Thesis—Quan Yao McMaster—Biology
Vent
Mat
fall Sludge
Sediment
CSP Marine Compost Whale Bioreactor Wastewater Groundwater Microbial Activated Hydrothermal Freshwater
NP_420905 15 17 3 3 2 1 2 2 7 2 NP_422086 23 18 7 2 3 2 1 3 8 1
NP_422113 8 20 1 2 2 2 2 0 8 1 NP_420178 11 18 6 3 3 2 2 2 9 2 NP_420025 9 17 5 1 1 1 3 0 11 0 NP_420693 6 13 5 1 1 1 1 0 5 1 NP_421048 14 17 5 3 1 1 3 0 11 0 proteobacteria
- NP_422264 6 18 1 1 0 1 1 1 8 2 α NP_419339 10 11 0 0 2 0 1 2 12 1 NP_421804 15 21 8 3 0 0 1 1 9 1 NP_418919 8 9 4 4 2 2 1 3 21 1 YP_031797 3 11 1 2 0 0 1 0 1 0 YP_032733 1 5 0 1 0 0 0 1 1 0 YP_032395 4 11 0 5 0 0 1 0 2 0 NP_101943 0 4 0 0 0 0 0 0 0 0 NP_105027 0 3 0 0 0 1 0 0 0 0 NP_108034 4 8 0 2 0 0 1 1 2 0
NP_102510 0 3 0 0 0 0 0 0 0 0 NP_102519 0 5 1 1 0 0 0 0 0 0 NP_104217 0 4 1 0 0 0 0 0 0 0 NP_107016 2 4 0 0 0 1 0 0 0 0 Rhizobiales NP_101988 4 8 0 1 0 1 0 0 1 0 NP_102895 0 5 0 0 0 0 0 0 0 0 NP_104087 0 1 0 1 0 0 0 0 0 0 NP_104130 2 6 0 0 0 0 0 0 1 0 NP_105201 0 4 0 0 0 0 0 0 0 0 NP_105743 4 5 0 1 0 1 0 0 1 0 NP_108472 3 4 0 0 0 0 0 0 0 0
55 M. Sc. Thesis—Quan Yao McMaster—Biology
NP_103319 1 6 0 0 0 0 0 0 2 0 NP_101965 4 4 0 0 0 0 0 1 0 0 NP_101954 0 4 0 0 0 0 0 0 0 0 NP_102577 1 1 0 0 0 0 0 0 0 0 NP_109472 0 2 0 0 0 0 0 0 0 0 NP_105883 0 5 0 0 0 0 0 0 0 0 NP_106835 1 0 0 0 0 0 0 0 0 0 NP_107159 1 1 0 0 0 0 0 0 0 0 NP_103376 1 9 0 0 0 0 0 0 0 0 NP_104418 0 1 0 0 0 0 0 0 0 0 NP_105704 0 1 0 0 0 0 1 0 0 0 NP_102252 0 0 0 0 0 0 0 0 0 0 NP_103286 0 1 0 0 0 0 0 0 0 0 NP_106741 0 2 0 0 0 0 0 0 0 0 NP_106740 0 0 0 0 0 0 0 0 0 0 NP_104236 2 0 0 0 0 0 0 0 0 0 NP_103455 0 0 0 0 0 0 0 0 0 0 NP_103450 0 0 0 0 0 0 0 0 0 0 NP_103476 0 2 0 0 0 0 0 0 0 0 NP_107075 0 0 0 0 0 0 0 0 0 0 NP_772654 0 1 0 0 0 0 0 0 0 0 YP_317707 0 1 0 0 0 0 0 1 0 0
YP_317841 1 4 0 0 0 0 0 0 0 0 YP_318399 3 0 1 0 0 0 0 0 0 0 YP_318401 3 0 0 0 0 0 0 0 0 0 YP_318753 0 0 0 0 0 0 0 0 0 0 YP_318785 0 0 0 0 0 0 0 0 0 0
Xanthobacteraceae YP_319038 0 0 0 0 0 0 0 0 0 0
YP_319081 3 2 0 0 0 0 0 0 0 0 and YP_319177 1 1 0 1 0 0 0 0 0 0 YP_319228 1 2 0 0 0 0 0 0 0 0 YP_319312 6 0 0 2 0 0 0 0 0 0 NP_772539 0 0 0 0 0 0 0 0 0 0 NP_772746 0 1 0 0 0 0 0 0 0 0 YP_316897 2 0 0 4 0 0 1 0 0 0 Bradyrhizobiaceae YP_317122 0 0 0 0 0 0 0 0 0 0 YP_317147 0 0 0 0 0 0 0 0 0 0
56 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_317224 1 0 0 0 0 0 0 0 0 0 YP_317328 2 0 1 1 0 0 0 0 0 0 YP_317539 0 1 0 1 0 0 0 0 0 0 YP_317580 2 0 1 1 0 0 0 0 0 0 YP_317698 0 4 0 1 0 0 1 1 0 0 YP_317706 0 0 0 0 0 0 0 0 0 0 YP_317721 1 1 0 0 0 0 0 0 0 0 YP_317722 0 1 0 0 0 0 0 0 0 0 YP_317949 1 1 0 1 0 0 1 1 0 0 YP_317753 0 1 0 0 0 0 0 6 0 0 YP_317861 0 0 0 0 0 0 0 0 0 0 YP_317883 0 0 0 0 0 0 0 0 0 0 YP_317888 0 0 0 0 0 0 0 0 0 0 YP_318067 0 0 0 0 0 0 0 0 0 0 YP_318111 0 0 0 0 0 0 0 0 0 0 YP_318125 0 0 0 0 0 0 0 0 0 0 YP_318194 1 1 0 0 0 0 0 0 0 0 YP_318195 0 0 0 0 0 0 0 0 0 0 YP_318199 2 2 0 0 0 0 0 0 1 0 YP_318262 1 0 0 0 0 0 0 0 0 0 YP_318287 2 0 0 0 0 0 0 1 0 0 YP_318318 6 1 0 1 0 0 0 0 0 0 YP_318324 1 0 0 0 0 0 0 0 0 0 YP_318398 0 0 0 0 0 0 0 0 0 0 YP_318406 3 2 0 4 0 0 2 0 1 0 YP_318413 0 0 0 0 0 0 0 0 0 0 YP_318632 0 0 0 0 0 0 0 0 0 0 YP_318673 0 0 0 0 0 0 0 0 0 0 YP_318674 0 0 0 1 0 0 0 0 0 0 YP_318769 0 0 0 0 0 0 0 0 0 0 YP_318779 1 0 0 0 0 0 0 0 0 0 YP_318789 0 0 0 0 0 0 0 0 0 0 YP_318814 0 0 0 0 0 0 0 0 0 0 YP_318850 0 0 0 1 0 0 0 0 0 0 YP_318853 2 0 0 0 0 0 0 0 0 0 YP_318985 20 1 8 5 0 0 0 6 0 0 YP_318987 0 0 0 0 0 0 0 0 0 0
57 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_319020 0 0 0 0 0 0 0 0 0 0 YP_319094 0 0 0 0 0 0 0 0 0 0 YP_319097 2 0 0 0 0 0 0 0 0 0 YP_319105 0 0 0 0 0 0 0 0 0 0 YP_319111 0 0 0 0 0 0 0 0 0 0 YP_319114 0 0 0 1 0 0 1 0 0 0 YP_319136 1 0 0 0 0 0 0 0 0 0 YP_319180 0 0 0 0 0 0 0 0 0 0 YP_319182 0 0 0 2 0 0 0 0 0 0 YP_319193 0 0 0 0 0 0 0 0 0 0 YP_319235 0 0 0 0 0 0 0 0 0 0 YP_319281 0 0 0 0 0 0 0 0 0 0 YP_319282 0 0 0 0 0 0 0 0 0 0 YP_319374 0 0 0 0 0 0 0 0 0 0 YP_319394 0 0 0 0 0 0 0 0 0 0 YP_319586 0 0 0 1 0 0 0 0 0 0 YP_319561 1 0 0 0 0 0 0 0 0 0 YP_319637 0 0 0 0 0 0 0 0 0 0 YP_319739 1 0 0 0 0 0 0 0 0 0 YP_319740 2 0 0 0 0 0 0 0 0 0 YP_612088 0 5 3 0 2 3 0 0 3 1 YP_612179 0 1 0 0 0 2 0 1 5 0 YP_612231 0 0 0 0 0 1 0 0 2 0 YP_612466 0 1 0 0 0 0 0 0 2 1 YP_612581 0 1 0 0 0 1 0 0 4 2 YP_612582 0 1 0 0 0 2 0 0 3 1
YP_612692 0 1 0 0 4 0 0 0 2 0 YP_612745 0 1 0 0 0 0 0 0 1 0 YP_612747 0 1 1 0 0 1 0 0 0 1 YP_613058 0 4 0 0 1 1 0 0 2 0 YP_613059 0 2 1 0 2 0 0 0 4 0 Rhodobacterales YP_613242 0 0 0 0 1 2 0 0 0 1 YP_613345 0 1 0 0 4 1 0 0 2 1 YP_613401 0 1 1 0 1 0 0 0 0 1 YP_613562 0 1 0 0 3 1 1 0 4 0 YP_613837 0 2 1 0 0 1 0 0 2 0 YP_613961 0 1 0 0 2 4 0 0 1 0
58 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_613982 0 1 0 0 1 2 0 0 3 0 YP_614257 0 1 0 0 0 1 1 2 3 1 YP_614364 0 1 0 0 1 2 0 0 2 0 YP_614419 0 2 1 0 0 1 0 0 5 1 YP_614460 0 1 0 0 0 1 0 0 4 0 YP_614481 0 2 0 0 1 1 0 0 3 0 YP_614576 0 1 0 0 0 1 0 0 3 1 YP_614993 0 1 2 0 0 2 0 0 5 1 YP_611313 0 1 0 0 0 1 0 0 1 0 YP_611978 0 1 2 0 2 0 0 0 3 1 YP_611988 0 2 2 0 2 1 0 0 2 0 YP_611993 0 2 0 0 1 2 0 0 5 1 YP_613553 0 1 0 0 1 2 0 0 1 1 YP_613730 0 1 0 0 0 0 0 0 2 1 YP_613732 3 5 1 1 0 4 1 0 5 0 YP_613733 0 0 0 0 0 1 0 0 3 0 YP_613734 4 0 13 2 2 2 0 5 3 3 YP_613731 4 39 17 1 2 3 2 5 20 13 YP_613094 0 0 0 0 0 0 0 1 0 0 YP_611425 0 0 2 0 0 0 0 0 0 0 YP_613418 0 0 0 0 0 0 0 0 0 0 YP_613446 0 0 0 0 0 0 0 0 2 0 YP_613980 0 0 0 0 0 0 0 0 0 0 YP_614100 0 0 0 0 0 0 0 0 2 0 YP_614133 0 0 0 0 0 0 0 0 0 0 YP_611311 0 0 0 0 0 0 0 0 0 0 YP_611438 0 0 0 0 0 0 0 0 0 0 YP_611444 0 1 0 0 0 0 0 0 0 0 YP_611462 0 0 0 0 0 0 0 0 0 0 YP_611763 0 0 0 0 0 1 0 0 0 0 YP_611855 0 0 0 0 0 0 0 0 0 0
NP_419305 2 1 1 0 0 0 0 0 3 0 NP_421283 3 0 0 0 0 0 0 0 3 0 NP_421560 0 0 0 0 0 0 0 0 1 0 NP_421895 11 6 1 0 0 0 1 0 1 0 NP_419331 0 3 0 0 0 0 0 0 0 0 Caulobacterales
59 M. Sc. Thesis—Quan Yao McMaster—Biology
NP_419880 0 2 0 0 0 0 0 1 1 0 NP_419882 0 1 1 0 0 0 0 0 2 0 NP_420397 0 0 0 0 0 0 0 1 0 0 NP_421010 2 1 1 0 0 0 0 0 0 0 NP_421428 0 0 0 0 1 0 0 0 0 0 NP_421438 0 2 0 0 0 0 0 0 1 0 YP_495301 4 5 2 0 0 0 0 0 3 0 YP_495335 3 5 0 0 0 0 0 1 0 0 YP_495370 1 3 1 0 0 0 0 0 0 0 YP_495433 0 2 1 0 0 0 0 0 1 0 YP_495514 10 4 2 0 0 0 0 0 0 0 YP_495691 4 4 0 0 0 0 0 0 2 0 YP_496367 8 5 0 0 0 0 0 0 2 0 YP_496423 1 2 0 0 0 0 0 0 0 0 YP_496569 3 5 1 0 0 0 0 1 1 0 YP_496656 1 2 0 0 0 0 0 0 0 0 YP_497188 2 4 1 0 0 0 0 0 0 0 YP_497403 2 4 0 0 0 0 0 0 1 0
YP_498058 1 3 0 0 0 0 0 0 2 0 YP_498227 4 4 0 0 0 0 0 0 0 0 YP_498407 2 3 0 0 0 0 0 0 1 0 YP_498482 1 3 0 0 0 0 0 0 1 0 YP_495327 1 1 3 0 0 0 0 0 0 0 YP_495437 1 3 0 0 0 0 0 0 2 0 Sphingomonadales YP_495697 9 5 0 0 0 0 0 0 2 0 YP_495740 1 3 1 0 0 0 0 0 3 0 YP_496357 5 4 1 1 0 0 0 0 1 0 YP_496405 0 3 2 0 0 0 0 0 0 0 YP_496439 6 4 0 0 0 0 0 1 3 0 YP_496442 0 3 0 0 0 0 0 0 2 0 YP_497022 4 7 0 0 0 0 0 1 4 0 YP_497059 0 1 0 0 0 0 0 0 1 0 YP_497246 0 2 1 0 0 0 0 0 0 0 YP_497309 0 1 0 0 0 0 0 0 2 0 YP_497310 4 2 0 0 0 0 0 0 1 0 YP_497604 16 6 2 0 0 0 0 0 1 0 YP_497818 4 6 4 0 0 0 2 0 1 0
60 M. Sc. Thesis—Quan Yao McMaster—Biology
AAW60410 0 0 0 0 0 0 0 0 0 0 AAW60472 1 0 0 0 0 0 0 0 0 0 AAW60735 1 0 0 0 0 0 1 0 0 0 AAW61019 2 0 0 1 0 0 0 0 0 0 AAW59936 1 0 0 0 0 0 0 1 0 0 AAW61357 0 0 0 0 0 0 0 0 0 0 AAW60126 0 0 0 0 0 0 0 0 0 0 AAW60973 0 0 0 0 0 0 0 0 0 0 AAW60976 0 0 0 0 0 0 0 0 0 0 AAW60983 0 0 0 0 0 0 0 0 0 0 AAW60985 0 0 0 0 0 0 0 0 0 0 AAW62008 0 0 0 0 0 0 0 1 0 0 AAW62049 1 0 2 0 0 0 0 0 1 0
AAW62183 0 0 0 0 0 0 0 0 0 0 AAW62185 3 0 0 0 0 0 0 0 0 0 AAW60994 0 0 0 0 0 0 0 0 0 0 AAW62187 1 0 0 0 0 0 0 1 0 0 YP_425217 9 0 2 1 1 0 0 0 2 0 Rhodospirillales YP_425244 0 0 0 0 0 0 0 0 0 0 YP_425622 1 0 0 0 0 0 0 1 1 0 YP_426776 1 0 1 0 0 0 0 0 0 0 YP_426843 1 0 1 3 0 0 0 0 0 0 YP_427199 2 0 0 3 0 0 0 1 0 0 YP_427597 4 0 1 0 0 0 0 1 0 0 YP_427676 2 0 0 0 1 0 0 0 0 0 YP_427912 1 0 0 0 0 0 0 0 0 0 YP_428643 8 0 2 0 0 0 0 0 1 0 YP_428717 2 0 0 1 1 0 0 0 1 0 YP_428743 4 0 0 0 0 0 0 0 0 0 YP_428820 0 0 0 0 0 0 0 0 0 0 YP_428881 2 1 2 0 0 0 0 0 0 0 NP_965979 0 0 0 0 0 0 0 0 0 0
NP_966474 0 0 0 0 0 0 0 0 0 0 NP_965909 0 0 0 0 0 0 0 0 0 0 NP_966580 0 0 0 0 0 0 0 0 0 0
Rickettsiales NP_965975 0 0 0 0 0 0 0 0 0 0 NP_965966 0 0 0 0 0 0 0 0 0 0
61 M. Sc. Thesis—Quan Yao McMaster—Biology
NP_966527 0 0 0 0 0 0 0 0 0 0 NP_966202 0 0 0 0 0 0 0 0 0 0 NP_966253 0 0 0 0 0 0 0 0 0 0 NP_966513 0 0 0 0 0 0 0 0 0 0 NP_966574 0 0 0 0 0 0 0 0 0 0 NP_966613 0 0 0 0 0 0 0 0 0 0 NP_966526 0 1 0 0 0 0 0 0 0 0 NP_966520 0 0 0 0 0 0 0 0 0 0 NP_966750 0 0 0 0 0 0 0 0 0 0 NP_966779 0 0 0 0 0 0 0 0 0 0 NP_966932 0 0 0 0 0 0 0 0 0 0 NP_966942 0 0 0 0 0 0 0 0 0 0 NP_220581 0 0 0 0 0 0 0 0 0 0 NP_220424 0 0 0 0 0 0 0 0 0 0 NP_220576 0 0 0 0 0 0 0 1 0 0
Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes Numbers of significant hits within each metagenome are assigned to the listed taxa. Color formatting indicates high and low values. Negative results are in green. Positive results are in yellow. Red indicates the highest values in the chart.
62 M. Sc. Thesis—Quan Yao McMaster—Biology
Vent
Mat
fall Sludge
Sediment
CSP ater Marine Compost Whale Bioreactor Wastewater Groundwater Microbial Activated Hydrothermal Freshw
NP_420905 128 130 63 98 93 85 87 78 108 85 NP_422086 203 205 212 132 206 190 180 191 209 143 NP_422113 67 98 77 67 57 55 71 0 66 65
NP_420178 263 124 94 102 83 80 61 92 118 96 NP_420025 94 110 85 73 48 72 109 0 103 0 NP_420693 100 75 72 71 49 72 78 0 78 71 NP_421048 117 138 89 69 47 91 105 0 122 0 proteobacteria - NP_422264 155 196 58 69 0 60 59 53 130 60 α NP_419339 147 177 0 0 119 0 159 113 161 158 NP_421804 98 188 84 83 0 0 90 59 135 72 NP_418919 117 140 87 75 69 115 118 222 132 56 YP_031797 70 77 51 67 0 0 53 0 56 0 YP_032733 63 81 0 61 0 0 0 48 77 0 YP_032395 169 265 0 107 0 0 162 0 106 0 NP_101943 0 171 0 0 0 0 0 0 0 0 NP_105027 0 104 0 0 0 80 0 0 0 0 NP_108034 174 231 0 196 0 0 107 90 186 0
NP_102510 0 189 0 0 0 0 0 0 0 0 NP_102519 0 173 49 53 0 0 0 0 0 0 NP_104217 0 112 60 0 0 0 0 0 0 0 Rhizobiales NP_107016 82 133 0 0 0 128 0 0 0 0 NP_101988 154 411 0 112 0 69 0 0 202 0 NP_102895 0 110 0 0 0 0 0 0 0 0 NP_104087 0 55 0 81 0 0 0 0 0 0 NP_104130 78 200 0 0 0 0 0 0 115 0 NP_105201 0 207 0 0 0 0 0 0 0 0
63 M. Sc. Thesis—Quan Yao McMaster—Biology
NP_105743 213 487 0 176 0 92 0 0 291 0 NP_108472 102 462 0 0 0 0 0 0 0 0 NP_103319 115 184 0 0 0 0 0 0 167 0 NP_101965 128 410 0 0 0 0 0 56 0 0 NP_101954 0 126 0 0 0 0 0 0 0 0 NP_102577 51 164 0 0 0 0 0 0 0 0 NP_109472 0 238 0 0 0 0 0 0 0 0 NP_105883 0 189 0 0 0 0 0 0 0 0 NP_106835 54 0 0 0 0 0 0 0 0 0 NP_107159 50 63 0 0 0 0 0 0 0 0 NP_103376 65 200 0 0 0 0 0 0 0 0 NP_104418 0 107 0 0 0 0 0 0 0 0 NP_105704 0 120 0 0 0 0 49 0 0 0 NP_102252 0 0 0 0 0 0 0 0 0 0 NP_103286 0 107 0 0 0 0 0 0 0 0 NP_106741 0 112 0 0 0 0 0 0 0 0 NP_106740 0 0 0 0 0 0 0 0 0 0 NP_104236 127 0 0 0 0 0 0 0 0 0 NP_103455 0 0 0 0 0 0 0 0 0 0 NP_103450 0 0 0 0 0 0 0 0 0 0 NP_103476 0 47 0 0 0 0 0 0 0 0 NP_107075 0 0 0 0 0 0 0 0 0 0
NP_772654 0 91 0 0 0 0 0 0 0 0 YP_317707 0 56 0 0 0 0 0 204 0 0 YP_317841 65 47 0 0 0 0 0 0 0 0 YP_318399 112 0 163 0 0 0 0 0 0 0 YP_318401 156 0 0 0 0 0 0 0 0 0 Xanthobacteraceae YP_318753 0 0 0 0 0 0 0 0 0 0 and
YP_318785 0 0 0 0 0 0 0 0 0 0 YP_319038 0 0 0 0 0 0 0 0 0 0 YP_319081 60 61 0 0 0 0 0 0 0 0 YP_319177 127 140 0 95 0 0 0 0 0 0 YP_319228 56 93 0 0 0 0 0 0 0 0 YP_319312 117 0 0 76 0 0 0 0 0 0 Bradyrhizobiaceae
64 M. Sc. Thesis—Quan Yao McMaster—Biology
NP_772539 0 0 0 0 0 0 0 0 0 0 NP_772746 0 73 0 0 0 0 0 0 0 0 YP_316897 216 0 0 148 0 0 176 0 0 0 YP_317122 0 0 0 0 0 0 0 0 0 0 YP_317147 0 0 0 0 0 0 0 0 0 0 YP_317224 46 0 0 0 0 0 0 0 0 0 YP_317328 125 0 118 145 0 0 0 0 0 0 YP_317539 0 54 0 54 0 0 0 0 0 0 YP_317580 128 0 111 147 0 0 0 0 0 0 YP_317698 0 65 0 55 0 0 56 57 0 0 YP_317706 0 0 0 0 0 0 0 0 0 0 YP_317721 61 131 0 0 0 0 0 0 0 0 YP_317722 0 68 0 0 0 0 0 0 0 0 YP_317949 49 56 0 79 0 0 159 105 0 0 YP_317753 0 235 0 0 0 0 0 88 0 0 YP_317861 0 0 0 0 0 0 0 0 0 0 YP_317883 0 0 0 0 0 0 0 0 0 0 YP_317888 0 0 0 0 0 0 0 0 0 0 YP_318067 0 0 0 0 0 0 0 0 0 0 YP_318111 0 0 0 0 0 0 0 0 0 0 YP_318125 0 0 0 0 0 0 0 0 0 0 YP_318194 66 58 0 0 0 0 0 0 0 0 YP_318195 0 0 0 0 0 0 0 0 0 0 YP_318199 103 63 0 0 0 0 0 0 87 0 YP_318262 46 0 0 0 0 0 0 0 0 0 YP_318287 57 0 0 0 0 0 0 71 0 0 YP_318318 90 53 0 63 0 0 0 0 0 0 YP_318324 51 0 0 0 0 0 0 0 0 0 YP_318398 0 0 0 0 0 0 0 0 0 0 YP_318406 96 120 0 54 0 0 92 0 49 0 YP_318413 0 0 0 0 0 0 0 0 0 0 YP_318632 0 0 0 0 0 0 0 0 0 0 YP_318673 0 0 0 0 0 0 0 0 0 0 YP_318674 0 0 0 52 0 0 0 0 0 0
65 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_318769 0 0 0 0 0 0 0 0 0 0 YP_318779 78 0 0 0 0 0 0 0 0 0 YP_318789 0 0 0 0 0 0 0 0 0 0 YP_318814 0 0 0 0 0 0 0 0 0 0 YP_318850 0 0 0 55 0 0 0 0 0 0 YP_318853 90 0 0 0 0 0 0 0 0 0 YP_318985 339 59 441 169 0 0 0 376 0 0 YP_318987 0 0 0 0 0 0 0 0 0 0 YP_319020 0 0 0 0 0 0 0 0 0 0 YP_319094 0 0 0 0 0 0 0 0 0 0 YP_319097 108 0 0 0 0 0 0 0 0 0 YP_319105 0 0 0 0 0 0 0 0 0 0 YP_319111 0 0 0 0 0 0 0 0 0 0 YP_319114 0 0 0 48 0 0 56 0 0 0 YP_319136 52 0 0 0 0 0 0 0 0 0 YP_319180 0 0 0 0 0 0 0 0 0 0 YP_319182 0 0 0 46 0 0 0 0 0 0 YP_319193 0 0 0 0 0 0 0 0 0 0 YP_319235 0 0 0 0 0 0 0 0 0 0 YP_319281 0 0 0 0 0 0 0 0 0 0 YP_319282 0 0 0 0 0 0 0 0 0 0 YP_319374 0 0 0 0 0 0 0 0 0 0 YP_319394 0 0 0 0 0 0 0 0 0 0 YP_319586 0 0 0 52 0 0 0 0 0 0 YP_319561 73 0 0 0 0 0 0 0 0 0 YP_319637 0 0 0 0 0 0 0 0 0 0 YP_319739 102 0 0 0 0 0 0 0 0 0 YP_319740 92 0 0 0 0 0 0 0 0 0
YP_612088 0 125 107 0 68 65 0 0 119 119 YP_612179 0 383 0 0 0 405 0 49 236 0 YP_612231 0 0 0 0 0 60 0 0 108 0 YP_612466 0 131 0 0 0 0 0 0 137 107 YP_612581 0 365 0 0 0 114 0 0 357 352
Rhodobacterales YP_612582 0 183 0 0 0 124 0 0 170 196
66 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_612692 0 92 0 0 90 0 0 0 86 0 YP_612745 0 55 0 0 0 0 0 0 130 0 YP_612747 0 138 114 0 0 97 0 0 0 115 YP_613058 0 113 0 0 85 117 0 0 109 0 YP_613059 0 103 83 0 78 0 0 0 93 0 YP_613242 0 0 0 0 94 63 0 0 0 50 YP_613345 0 172 0 0 147 161 0 0 114 197 YP_613401 0 81 104 0 97 0 0 0 0 92 YP_613562 0 218 0 0 194 184 98 0 213 0 YP_613837 0 124 150 0 0 84 0 0 135 0 YP_613961 0 136 0 0 131 152 0 0 142 0 YP_613982 0 148 0 0 152 159 0 0 151 0 YP_614257 0 721 0 0 0 333 199 330 311 719 YP_614364 0 271 0 0 115 293 0 0 237 0 YP_614419 0 192 199 0 0 153 0 0 232 225 YP_614460 0 156 0 0 0 202 0 0 139 0 YP_614481 0 459 0 0 158 460 0 0 496 0 YP_614576 0 111 0 0 0 102 0 0 102 124 YP_614993 0 139 178 0 0 176 0 0 174 156 YP_611313 0 75 0 0 0 72 0 0 55 0 YP_611978 0 273 249 0 183 0 0 0 286 196 YP_611988 0 143 122 0 165 160 0 0 198 0 YP_611993 0 231 0 0 227 259 0 0 184 225 YP_613553 0 123 0 0 117 125 0 0 112 125 YP_613730 0 194 0 0 0 0 0 0 203 195 YP_613732 54 57 54 60 0 105 61 0 103 0 YP_613733 0 0 0 0 0 129 0 0 99 0 YP_613734 99 0 211 57 60 68 0 115 82 115 YP_613731 159 405 209 135 140 204 86 136 221 318 YP_613094 0 0 0 0 0 0 0 60 0 0 YP_611425 0 0 45 0 0 0 0 0 0 0 YP_613418 0 0 0 0 0 0 0 0 0 0 YP_613446 0 0 0 0 0 0 0 0 142 0 YP_613980 0 0 0 0 0 0 0 0 0 0 YP_614100 0 0 0 0 0 0 0 0 151 0 YP_614133 0 0 0 0 0 0 0 0 0 0
67 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_611311 0 0 0 0 0 0 0 0 0 0 YP_611438 0 0 0 0 0 0 0 0 0 0 YP_611444 0 50 0 0 0 0 0 0 0 0 YP_611462 0 0 0 0 0 0 0 0 0 0 YP_611763 0 0 0 0 0 84 0 0 0 0 YP_611855 0 0 0 0 0 0 0 0 0 0 NP_419305 82 62 85 0 0 0 0 0 81 0 NP_421283 181 0 0 0 0 0 0 0 191 0 NP_421560 0 0 0 0 0 0 0 0 56 0
NP_421895 127 291 73 0 0 0 115 0 119 0 NP_419331 0 100 0 0 0 0 0 0 0 0 NP_419880 0 51 0 0 0 0 0 50 88 0 NP_419882 0 71 51 0 0 0 0 0 63 0 NP_420397 0 0 0 0 0 0 0 82 0 0 Caulobacterales NP_421010 104 75 100 0 0 0 0 0 0 0 NP_421428 0 0 0 0 60 0 0 0 0 0 NP_421438 0 94 0 0 0 0 0 0 95 0 YP_495301 116 216 115 0 0 0 0 0 225 0 YP_495335 132 87 0 0 0 0 0 248 0 0 YP_495370 64 309 168 0 0 0 0 0 0 0 YP_495433 0 112 82 0 0 0 0 0 65 0 YP_495514 174 464 61 0 0 0 0 0 0 0 YP_495691 175 206 0 0 0 0 0 0 127 0 YP_496367 93 284 0 0 0 0 0 0 215 0
YP_496423 57 276 0 0 0 0 0 0 0 0 YP_496569 110 139 70 0 0 0 0 105 77 0 YP_496656 69 192 0 0 0 0 0 0 0 0 YP_497188 110 142 49 0 0 0 0 0 0 0 YP_497403 77 202 0 0 0 0 0 0 112 0 YP_498058 68 225 0 0 0 0 0 0 92 0 Sphingomonadales YP_498227 141 311 0 0 0 0 0 0 0 0 YP_498407 88 145 0 0 0 0 0 0 53 0 YP_498482 73 246 0 0 0 0 0 0 103 0 YP_495327 74 134 89 0 0 0 0 0 0 0 YP_495437 74 89 0 0 0 0 0 0 66 0 YP_495697 119 156 0 0 0 0 0 0 57 0 YP_495740 60 73 48 0 0 0 0 0 114 0
68 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_496357 122 240 78 159 0 0 0 0 59 0 YP_496405 0 168 94 0 0 0 0 0 0 0 YP_496439 70 76 0 0 0 0 0 68 102 0 YP_496442 0 61 0 0 0 0 0 0 47 0 YP_497022 74 191 0 0 0 0 0 62 117 0 YP_497059 0 67 0 0 0 0 0 0 68 0 YP_497246 0 64 50 0 0 0 0 0 0 0 YP_497309 0 285 0 0 0 0 0 0 99 0 YP_497310 80 85 0 0 0 0 0 0 73 0 YP_497604 147 663 303 0 0 0 0 0 149 0 YP_497818 130 305 308 0 0 0 127 0 164 0 AAW60410 0 0 0 0 0 0 0 0 0 0 AAW60472 64 0 0 0 0 0 0 0 0 0 AAW60735 52 0 0 0 0 0 53 0 0 0 AAW61019 53 0 0 56 0 0 0 0 0 0 AAW59936 78 0 0 0 0 0 0 68 0 0 AAW61357 0 0 0 0 0 0 0 0 0 0 AAW60126 0 0 0 0 0 0 0 0 0 0 AAW60973 0 0 0 0 0 0 0 0 0 0 AAW60976 0 0 0 0 0 0 0 0 0 0 AAW60983 0 0 0 0 0 0 0 0 0 0 AAW60985 0 0 0 0 0 0 0 0 0 0
AAW62008 0 0 0 0 0 0 0 56 0 0 AAW62049 70 0 471 0 0 0 0 0 158 0 AAW62183 0 0 0 0 0 0 0 0 0 0 AAW62185 162 0 0 0 0 0 0 0 0 0 AAW60994 0 0 0 0 0 0 0 0 0 0 Rhodospirillales AAW62187 70 0 0 0 0 0 0 69 0 0 YP_425217 173 0 177 54 68 0 0 0 171 0 YP_425244 0 0 0 0 0 0 0 0 0 0 YP_425622 94 0 0 0 0 0 0 88 89 0 YP_426776 104 0 67 0 0 0 0 0 0 0 YP_426843 155 0 62 53 0 0 0 0 0 0 YP_427199 144 0 0 87 0 0 0 78 0 0 YP_427597 181 0 125 0 0 0 0 60 0 0 YP_427676 227 0 0 0 59 0 0 0 0 0 YP_427912 55 0 0 0 0 0 0 0 0 0 YP_428643 410 0 342 0 0 0 0 0 195 0
69 M. Sc. Thesis—Quan Yao McMaster—Biology
YP_428717 78 0 0 74 83 0 0 0 72 0 YP_428743 114 0 0 0 0 0 0 0 0 0 YP_428820 0 0 0 0 0 0 0 0 0 0 YP_428881 86 56 82 0 0 0 0 0 0 0 NP_965979 0 0 0 0 0 0 0 0 0 0 NP_966474 0 0 0 0 0 0 0 0 0 0 NP_965909 0 0 0 0 0 0 0 0 0 0 NP_966580 0 0 0 0 0 0 0 0 0 0 NP_965975 0 0 0 0 0 0 0 0 0 0 NP_965966 0 0 0 0 0 0 0 0 0 0 NP_966527 0 0 0 0 0 0 0 0 0 0 NP_966202 0 0 0 0 0 0 0 0 0 0
NP_966253 0 0 0 0 0 0 0 0 0 0 NP_966513 0 0 0 0 0 0 0 0 0 0 NP_966574 0 0 0 0 0 0 0 0 0 0 NP_966613 0 0 0 0 0 0 0 0 0 0
Rickettsiales NP_966526 0 69 0 0 0 0 0 0 0 0 NP_966520 0 0 0 0 0 0 0 0 0 0 NP_966750 0 0 0 0 0 0 0 0 0 0 NP_966779 0 0 0 0 0 0 0 0 0 0 NP_966932 0 0 0 0 0 0 0 0 0 0 NP_966942 0 0 0 0 0 0 0 0 0 0 NP_220581 0 0 0 0 0 0 0 0 0 0 NP_220424 0 0 0 0 0 0 0 0 0 0 NP_220576 0 0 0 0 0 0 0 82 0 0
Figure 3: Similarity of significant hits in 10 metagenomes Best bit score within each metagenome are assigned to the listed taxa. Color formatting indicates high and low values. Negative results are in green. Positive results are in yellow. Red indicates the highest values in the chart.
70 M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP distribution in 10 metagenomes Note: The dots indicate the average bit score obtained from BLAST search and demonstrates the average extent of similarity between metagenomic reads and CSPs
71 M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 5: The relative abundance of Alphaproteobacteria and its different sub- clades in the studied metagenomes based upon BLASTp searches with CSPs Note: The colored bars indicate the numbers of significant hits that were detected in each metagenomes with CSPs, which are specific for different groups of Alphaproteobacteria.
72 M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 6: Comparative results of Alphaproteobacteria distribution in 4 metagenomes derived from (A) CSPs-based binning and (B) similarity-based binning. Note: The lower piecharts are obtained from MG-RAST and IMG/M databases. The color scheme to denote different Alphaproteobacteria subgroups is shown below.
73 M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 4 Discussion
4.1 Metagenome selection
The 10 metagenomes selected in this work from NCBI metagenomic database represent different environmental habitats around the world and cover all 3 metagenomic ecosystems: 4 from engineered ecosystems (bioreactor, compost, wastewater and activated sludge metagenome), 5 from environmental ecosystems (groundwater, freshwater sediment, microbial mat, marine and hydrothermal vent metagenomes) and 1 from host-associated ecosystem (whale fall metagenome). The habitats for
Alphaproteobacterial microbial communities are highly divergent, including saline water, sediment, marine, fossil, green-waste compost, wastewater treatment plant and so forth.
Public taxonomical classification of these metagenomes, either form MG-RAST and JGI platform, identified a myriad of Alphaproteobacteria associated sequences, further suggesting that Alphaproteobacteria can adapt to diverse environments described above.
It is notable that the size of metagenomic projects varies to each other regarding total length, # of contigs and average length (Table 11). These statistical differences have a remarkable influence on downstream bioinformatics analysis. For instance, the total length of whole genome shotgun sequences (WGS) in this study spans from 24.9 Mbases to 421.6 Mbases. However, corresponding raw sequencing data reaches up to tens or hundreds of Gbases. Data loss occurs in bioinformatics analysis such as quality control and duplicate clustering. Likewise, enormous amount of metagenomic sequences are discarded because the ever-increasing size of the metagenomic projects have surpassed the volume of any existing public database so that they cannot be matched to any
74 M. Sc. Thesis—Quan Yao McMaster—Biology
reference sequence in public database (Thomas et al., 2012). The number of contigs assembled in these metagenomes ranges from 26573 to 748672. Sequence coverage is a key factor for producing assembled contigs. However, mixture of genomes casts challenge on assembly process, leading to the low yield of assembled contigs because metagenomic sequences are less redundant than single genome sequences. Lastly, the average length of contigs in each metagenomic project ranges from 425bp to 2801bp. The longer a metagenomic sequence is, the higher the mapping accuracy is (Wommack et al.,
2008). The depth of sequencing determines the length of assembled contigs. So it is possible to plot a single draft genome from metagenomic sequences if only sequencing is deep enough to provide sufficient folds of coverage for splicing DNA fragments.
However, sequencing technology merely unveils a small portion of microbes in environments because incomplete sequencing is a major and inevitable limitation of most metagenomic studies. As a consequence, the species that could be predicted from metagenomic datasets are still very limited and are likely biased by information asymmetry between database and metagenomes (Wooley et al., 2010).
4.2 Identification of CSPs in metagenomic samples
An important advantage of using CSPs for metagenomic profiling is that the presence of these protein markers can be more reliably detected than the corresponding gene markers. When gene markers of corresponding CSPs are used in similar studies, the number of significant hits obtained is much less than that obtained using protein markers
(Table 2). This can be explained by both the redundancy of genetic code and the variation of gene sequences in metagenomes (Kembel et al., 2011). In view of this, CSPs may be
75 M. Sc. Thesis—Quan Yao McMaster—Biology
able to decipher taxonomic origin of some unassigned metagenomic sequences beyond what nucleotide markers can do. Different from MetaPhlAn which is a similarity based binning software relying on unique clade-specific gene marker, CSPs-based methodology emphasize on identifying and exploiting microbial clade specific protein markers ranging from phyla to genera. However, the number of genera specific CSPs identified within
Alphaproteobacteria before is very poor, leading to the low resolution of taxonomic profiling at lower taxonomic levels for metagenomic projects. More genera specific CSPs will be identified if more reference genomes are available for public.
According to the heat-map distribution of 264 Alphaproteobacterial CSPs, most of the class specific CSPs could be detected in all 10 metagenomic projects, whose habitats abound with Alphaproteobacteria. Compared with the formidable task that aims at assembling each individual genome in metagenomic samples, it is more feasible to build a molecular marker database that contains all commonly shared genes within
Alphaproteobacteria. The results also indicate that CSPs at higher taxonomic level such as class level and phylum level tend to be discovered effortlessly in metagenomic datasets. Alphaproteobacteria is enriched in 3 metagenomic projects (bioreactor, wastewater, whale fall metagenome) (Figure 4). The detailed distribution of 6 major orders under Alphaproteobacteria indicates that Rhodobacterales is the most abundant clade in 6 metagenomic projects studied (Figure 5). The relative abundance of
Alphaproteobacteria from MG-RAST and IMG/M also support the dissertation in 4 metagenomes based on CSP searches (Figure 6). The proportion of Alphaproteobacteria clades in these 4 metagenomic projects is highly correlated to the proportion predicted by
76 M. Sc. Thesis—Quan Yao McMaster—Biology
Alphaproteobacteria specific CSPs (Figure 6). This important discovery indicates a potential application of CSPs----Alphaproteobacteria specific CSPs are able to predict the distribution pattern of Alphaproteobacterial clades in metagenomic samples.
4.3 Comparative analysis of Alphaproteobacteria in metagenomes
The heatmaps in Figure 2 and 3 reflect the overall distribution pattern of all 264 CSPs in 10 metagenomes. It is noticeable that the 11 CSPs unique to Alphaproteobacteria class are ubiquitously present in all 10 metagenomes. The average bit scores are higher than the other order specific CSPs and the total number of significant hits identified outweigh all other CSPs. Based on this finding, It is concluded that class specific CSPs are much more easily to be discovered than order or family specific CSPs. This is because all potential
Alphaproteobacteria species are assumed to contribute class specific CSPs into metagenomic datasets. So the predictive ability of Alphaproteobacteria class specific
CSPs are much stronger than order specific or family specific CSPs. The best bit scores of
Rhodobacterales specific CSPs in 6 metagenomic samples are very high, compared to other clade specific CSPs. It is suggested that Rhodobacterales is the dominant
Alphaproteobacteria member in those metagenomes and the high concentration of
Rhodobacterales increases the coverage during sequence assembly, thus produces more complete genomic sequences of Rhodobacterales. Similar results are also seen in
Sphingomonadales specific CSPs and Rhizobiales specific CSPs regarding wastewater metagenome. In summary, the occurrence rate of Alphaproteobacteria specific CSPs is influenced by the specificity of CSPs as well as the concentration of corresponding bacteria clade.
77 M. Sc. Thesis—Quan Yao McMaster—Biology
A comprehensive profiling of Alphaproteobacteria was performed based on CSPs distribution in metagenomic projects. Alphaproteobacteria dominates 3 metagenomes. It is indicated that in these three metagenomes, different clades of Alphaproteobacteria may exert certain functions respectively to maintain the balance and well development for each habitat. Alphaproteobacteria is less abundant in other 7 metagenomes, which are either occupied by 1 order together with other orders in low concentration (Figure 5).
Microbial mat, hydrothermal vent, groundwater metagenomes are three typical habitats that are mainly composed of Rhodobacterales only. These habitats are characterized by extreme environment conditions such as hypersaline, high temperature and exposure to radiation. The discovery of Rhodobacterales specific CSPs suggests that they have very strong adaptive abilities to adopt harsh environments. For the rest 4 metagenomes: marine metagenome, compost metagenome, activated sludge metagenome and freshwater sediment metagenome, although the overall concentration of Alphaproteobacteria in these metagenomes are not very high, but the diversity of Alphaproteobacteria is higher than the three metagenomes discussed above. Several Alphaproteobacterial clades are existent with low concentration in these metagenomes. It is suggested that Alphaproteobacteria are in charge of some auxiliary functions to maintain the equilibrium of the habitat. In brief, different environments featured by unique growth conditions are preferred by different Alphaproteobacteria species. The nexus between organism and environment may predict the presence of similar lineages before fieldwork and laboratory experiments are accomplished.
78 M. Sc. Thesis—Quan Yao McMaster—Biology
An important goal in this study is to validate the methodology of CSPs in organism identification and abundance prediction. The comparison of relative abundance between
CSPs-based binning and traditional similarity-based binning from public metagenomic server shows high correlation. The distribution of Alphaproteobacteria and its sub-clades in bioreactor metagenome matches perfectly with the proportion on IMG/M server
(Figure 6). In wastewater metagenome, 3 groups of Alphaproteobacteria: Rhizobiales,
Rhodobacterales and Sphingomonadales, are proved to be the major members both by
CSPs searches and similarity searches from MG-RAST server. Comparison between
CSPs-based binning and similarity based binning on MG-RAST for whale fall metagenome and marine metagenome also shows similar results. So, CSP-based binning is reliable to predict the relative abundance of Alphaproteobacteria species in metagenomic samples. Since the database constructed is smaller but more unique than the
NCBI non-redundant database, it is more accurate and fast to achieve taxonomic clustering in environmental datasets.
4.4 Overall conclusions
In the previous centuries, the study of microbiology was mainly restricted to single species in laboratory culture (Madigan et al., 2008). Since the vast majority of microbes cannot be grown in the laboratory, researches on microbial community interactions beyond the substrates fall behind (Hugenholtz et al., 1998). Nevertheless, in environment conditions, all microbial activities, such as photosynthesis, organic degradation, and fixation of nitrogen, are conducted by complex microbial communities----those that have evolved for millions of years to adapt to different habitats and ecosystems (Davey and
79 M. Sc. Thesis—Quan Yao McMaster—Biology
O’toole, 2000). In order to understand the complex mutual effects within microbial cohort, it is necessary to explore the species diversity as well as their relative abundance in environment (Kuramitsu et al., 2007). In this study, 264 CSPs were utilized to investigate the Alphaproteobacterial diversity in 10 metagenomic projects. The results indicate that most CSPs could be detected in different metagenomic projects. Through analyzing and comparing the distribution of bit score and significant hit number, a comprehensive profiling of Alphaproteobacterial species diversity in metagenomic datasets was plotted. Basically, CSPs-based binning is a refinement of traditional similarity-based binning, which enhances the efficiency and effectiveness of performance. Computational expense is reduced while the accuracy of mapping increase.
Although CSP cannot robustly resolve the issue such as bacterial quantification or species/strains diagnosis, it sheds light upon bacterial clades profiling above species level and provides a new way to predict the relative abundance of microbial clades in different metagenomes with clade specific protein markers.
4.5 Future directions
Apart from the projects accomplished here, there are some experiments that can be done to expand the results above:
CSPs specific to other bacterial phyla such as Actinobacteria, Cyanobacteria and
Bacteroidetes have already been identified in previous studies. With these molecular markers, it is possible to forecast the presence and relative abundance of corresponding bacteria in more metagenomic projects. By constructing a database that contains all CSPs unique to every taxon from genus level to phylum level in Bacteria domain a
80 M. Sc. Thesis—Quan Yao McMaster—Biology
comprehensive blueprint of metagenomic taxonomic classification can be created to profile the presence and relative abundance of all microorganisms in metagenomic datasets.
81 M. Sc. Thesis—Quan Yao McMaster—Biology
References
Abraham, W.-R., Macedo, A.J., Lünsdorf, H., Fischer, R., Pawelczyk, S., Smit, J., and Vancanneyt, M. (2008). Phylogeny by a polyphasic approach of the order Caulobacterales, proposal of Caulobacter mirabilis sp. nov., Phenylobacterium haematophilum sp. nov. and Phenylobacterium conjunctum sp. nov., and emendation of the genus Phenylobacterium. Int. J. Syst. Evol. Microbiol. 58, 1939–1949.
Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W., and Nielsen, P.H. (2013). Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538.
Allgaier, M., Reddy, A., Park, J.I., Ivanova, N., D’haeseleer, P., Lowry, S., Sapra, R., Hazen, T.C., Simmons, B. a, VanderGheynst, J.S., et al. (2010). Targeted discovery of glycoside hydrolases from a switchgrass-adapted compost community. PLoS One 5, e8812.
Alsmark, C.M., Frank, A.C., Karlberg, E.O., Legault, B.A., Ardell, D.H., Canback, B., Eriksson, A.S., Naslund, A.K., Handley, S.A., Huvet, M., et al. (2004). The louse-borne human pathogen Bartonella quintana is a genomic derivative of the zoonotic agent Bartonella henselae. Proc.Natl.Acad.Sci.U.S.A 101, 9716–9721.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Andersson, S.G., and Kempf, V.A. (2004). Host cell modulation by human, animal and plant pathogens. Int.J.Med.Microbiol. 293, 463–470.
Arisue, N., Hasegawa, M., and Hashimoto, T. (2005). Root of the Eukaryota tree as inferred from combined maximum likelihood analyses of multiple molecular sequence data. Mol. Biol. Evol. 22, 409–420.
Arraga-Alvarado, C., Palmar, M., Parra, O., and Salas, P. (2003). Ehrlichia platys (Anaplasma platys) in dogs from Maracaibo, Venezuela: an ultrastructural study of experimental and natural infections. Vet. Pathol. 40, 149–156.
Bäckhed, F., Fraser, C.M., Ringel, Y., Sanders, M.E., Sartor, R.B., Sherman, P.M., Versalovic, J., Young, V., and Finlay, B.B. (2012). Defining a healthy human gut microbiome: current concepts, future directions, and clinical applications. Cell Host Microbe 12, 611–622.
82 M. Sc. Thesis—Quan Yao McMaster—Biology
Beiko, R.G., and Ragan, M.A. (2008). Detecting lateral genetic transfer : a phylogenetic approach. Methods Mol.Biol. 452, 457–469.
Bhandari, V., Naushad, H.S., and Gupta, R.S. (2012). Protein based molecular markers provide reliable means to understand prokaryotic phylogeny and support Darwinian mode of evolution. Front. Cell. Infect. Microbiol. 2, 98.
Binnewies, T.T., Motro, Y., Hallin, P.F., Lund, O., Dunn, D., La, T., Hampson, D.J., Bellgard, M., Wassenaar, T.M., and Ussery, D.W. (2006). Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct.Integr.Genomics 6, 165–185.
Boersma, F.G.H., Warmink, J.A., Andreote, F.A., and van Elsas, J.D. (2009). Selection of Sphingomonadaceae at the base of Laccaria proxima and Russula exalbicans fruiting bodies. Appl. Environ. Microbiol. 75, 1979–1989.
Bowman, D.D. (2011). Introduction to the alpha-proteobacteria: Wolbachia and Bartonella, Rickettsia, Brucella, Ehrlichia, and Anaplasma. Top. Companion Anim. Med. 26, 173–177.
Brady, A., and Salzberg, S.L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6, 673–676.
Brazelton, W.J., and Baross, J. a (2009). Abundant transposases encoded by the metagenome of a hydrothermal chimney biofilm. ISME J. 3, 1420–1424.
Breitschwerdt, E.B., and Kordick, D.L. (2000). Bartonella Infection in Animals: Carriership, Reservoir Potential, Pathogenicity, and Zoonotic Potential for Human Infection. Clin. Microbiol. Rev. 13, 428–438.
Brennerova, M. V, Josefiova, J., Brenner, V., Pieper, D.H., and Junca, H. (2009). Metagenomics reveals diversity and abundance of meta-cleavage pathways in microbial communities from soil highly contaminated with jet fuel under air-sparging bioremediation. Environ. Microbiol. 11, 2216–2227.
Campagne, S., Damberger, F.F., Kaczmarczyk, A., Francez-Charlot, A., Allain, F.H.-T., and Vorholt, J.A. (2012). Structural basis for sigma factor mimicry in the general stress response of Alphaproteobacteria. Proc. Natl. Acad. Sci. U. S. A. 109, E1405–14.
Carvalho, F.M., Souza, R.C., Barcellos, F.G., Hungria, M., and Vasconcelos, A.T.R. (2010). Genomic and evolutionary comparisons of diazotrophic and pathogenic bacteria of the order Rhizobiales. BMC Microbiol. 10, 37.
83 M. Sc. Thesis—Quan Yao McMaster—Biology
Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J.-M., Kennedy, S., et al. (2013). Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546.
Chistoserdova, L. (2013). Is metagenomics resolving identification of functions in microbial communities? Microb. Biotechnol.
Choudhary, M., and Kaplan, S. (2000). DNA sequence analysis of the photosynthesis region of Rhodobacter sphaeroides 2.4.1. Nucleic Acids Res. 28, 862–867.
Coletta, A., Pinney, J.W., Solís, D.Y.W., Marsh, J., Pettifer, S.R., and Attwood, T.K. (2010). Low-complexity regions within protein sequences have position-dependent roles. BMC Syst. Biol. 4, 43.
Dang, H., Li, T., Chen, M., and Huang, G. (2008). Cross-ocean distribution of Rhodobacterales bacteria as primary surface colonizers in temperate coastal marine waters. Appl. Environ. Microbiol. 74, 52–60.
Davey, M.E., and O’toole, G.A. (2000). Microbial biofilms: from ecology to molecular genetics. Microbiol. Mol. Biol. Rev. 64, 847–867.
DeLong, E.F., Preston, C.M., Mincer, T., Rich, V., Hallam, S.J., Frigaard, N.-U.U., Martinez, A., Sullivan, M.B., Edwards, R., Brito, B.R., et al. (2006). Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311, 496–503.
Doolittle, W.F., and Bapteste, E. (2007). Pattern pluralism and the Tree of Life hypothesis. Proc.Natl.Acad.Sci.U.S.A 104, 2043–2049.
Dröge, J., and McHardy, A.C. (2012). Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief. Bioinform. 13, 646–655.
Dumler, J.S., Barbet, A.F., Bekker, C.P., Dasch, G.A., Palmer, G.H., Ray, S.C., Rikihisa, Y., and Rurangirwa, F.R. (2001). Reorganization of genera in the families Rickettsiaceae and Anaplasmataceae in the order Rickettsiales: unification of some species of Ehrlichia with Anaplasma, Cowdria with Ehrlichia and Ehrlichia with Neorickettsia, descriptions of six new species combi. Int. J. Syst. Evol. Microbiol. 51, 2145–2165.
English, C.K. (1988). Cat-Scratch Disease. JAMA 259, 1347.
Ferrari, B.C., Binnerup, S.J., and Gillings, M. (2005). Microcolony cultivation on a soil substrate membrane system selects for previously uncultured soil bacteria. Appl. Environ. Microbiol. 71, 8714–8720.
84 M. Sc. Thesis—Quan Yao McMaster—Biology
Fischer, H.M. (1996). Environmental regulation of rhizobial symbiotic nitrogen fixation genes. Trends Microbiol. 4, 317–320.
Fredricks, D.N. (2006). Introduction to the Rickettsiales and other intracellular prokaryotes. In The Prokaryotes: A Handbook on the Biology of Bacteria, M. Dworkin, S. Falkow, E. Rosenberg, K.H. Schleifer, and E. Stackebrandt, eds. (New York: Springer), pp. 457–466.
Gao, B., and Gupta, R.S. (2012). Microbial systematics in the post-genomics era. Antonie Van Leeuwenhoek 101, 45–54.
Gao, B., Parmanathan, R., and Gupta, R.S. (2006). Signature proteins that are distinctive characteristics of Actinobacteria and their subgroups. Antonie Van Leeuwenhoek 90, 69– 91.
Ghai, R., Mizuno, C.M., Picazo, A., Camacho, A., and Rodriguez-Valera, F. (2013). Metagenomics uncovers a new group of low GC and ultra-small marine Actinobacteria. Sci. Rep. 3, 2471.
Ghazanfar, S., Azim, A., Ghazanfar, M.A.M.A., Iqbal, M., and Anjum, I.B. (2010). Metagenomics and its application in soil microbial community studies: biotechnological prospects. J. Anim. … 6, 611–622.
Gilbert, J.A., and Dupont, C.L. (2011). Microbial Metagenomics: Beyond the Genome. Ann. Rev. Mar. Sci. 3, 347–371.
Gomez-Alvarez, V., Revetta, R.P., and Santo Domingo, J.W. (2012). Metagenome analyses of corroded concrete wastewater pipe biofilms reveal a complex microbial system. BMC Microbiol. 12, 122.
Gray, M.W. (2012). Mitochondrial evolution. Cold Spring Harb. Perspect. Biol. 4, a011403.
Gullo, M., and Giudici, P. (2008). Acetic acid bacteria in traditional balsamic vinegar: phenotypic traits relevant for starter cultures selection. Int. J. Food Microbiol. 125, 46–53.
Gupta, R.S. (2000). The phylogeny of proteobacteria: relationships to other eubacterial phyla and eukaryotes. FEMS Microbiol. Rev. 24, 367–402.
Gupta, R.S. (2005a). Critical issues in prokaryotic phylogeny and taxonomy. ASM News 71, 393–394.
85 M. Sc. Thesis—Quan Yao McMaster—Biology
Gupta, R.S. (2005b). Protein signatures distinctive of alpha proteobacteria and its subgroups and a model for alpha-proteobacterial evolution. Crit Rev.Microbiol. 31, 101– 135.
Gupta, R.S., and Griffiths, E. (2002). Critical issues in bacterial phylogeny. Theor.Popul.Biol. 61, 423–434.
Gupta, R.S., and Lorenzini, E. (2007). Phylogeny and molecular signatures (conserved proteins and indels) that are specific for the Bacteroidetes and Chlorobi species. BMC Evol.Biol. 7, 71.
Gupta, R.S., and Mok, A. (2007a). Phylogenomics and signature proteins for the alpha proteobacteria and its main groups. BMC Microbiol. 7, 106.
Gupta, R.S., and Mok, A. (2007b). Phylogenomics and signature proteins for the alpha Proteobacteria and its main groups. BMC Microbiol. 7, 106.
Hallez, R., Bellefontaine, A.-F., Letesson, J.-J., and De Bolle, X. (2004). Morphological and functional asymmetry in alpha-proteobacteria. Trends Microbiol. 12, 361–365.
Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68, 669–685.
Harris, J.K., Caporaso, J.G., and Walker, J.J. (2012). Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME … 1–11.
Hess, M., Sczyrba, A., Egan, R., Kim, T.-W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al. (2011). Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331, 463–467.
Holley, H.P. (1991). Successful Treatment of Cat-scratch Disease With Ciprofloxacin. JAMA J. Am. Med. Assoc. 265, 1563.
Huang, W.E., Zhou, J., Scholz, M.B., Lo, C.-C., and Chain, P.S. (2012). Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 23, 9–15.
Hugenholtz, P., Goebel, B.M., and Pace, N.R. (1998). Impact of Culture-Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity. J. Bacteriol. 180, 4765–4774.
Huson, D.H., and Xie, C. (2013). A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinformatics.
86 M. Sc. Thesis—Quan Yao McMaster—Biology
Kainth, P., and Gupta, R.S. (2005). Signature proteins that are distinctive of alpha proteobacteria. BMC Genomics 6, 94.
Kalyuzhnaya, M.G., Lapidus, A., Ivanova, N., Copeland, A.C., McHardy, A.C., Szeto, E., Salamov, A., Grigoriev, I. V, Suciu, D., Levine, S.R., et al. (2008). High-resolution metagenomics targets specific functional types in complex microbial communities. Nat. Biotechnol. 26, 1029–1034.
Kang, I., Oh, H.-M., Vergin, K.L., Giovannoni, S.J., and Cho, J.-C. (2010). Genome sequence of the marine alphaproteobacterium HTCC2150, assigned to the Roseobacter clade. J. Bacteriol. 192, 6315–6316.
Kapley, A., De Baere, T., and Purohit, H.J. (2007). Eubacterial diversity of activated biomass from a common effluent treatment plant. Res. Microbiol. 158, 494–500.
Kembel, S.W., Eisen, J.A., Pollard, K.S., and Green, J.L. (2011). The Phylogenetic Diversity of Metagenomes. PLoS One 6, 9.
Kersters, K., Devos, P., Gillis, M., Swings, J., Vandamme, P., and Stackebrandt, E. (2006). Introduction to the Proteobacteria. In The Prokaryotes: A Handbook on the Biology of Bacteria, M. Dworkin, S. Falkow, E. Rosenberg, K.H. Schleifer, and E. Stackebrandt, eds. (New York: Springer), pp. 3–37.
Kinross, J.M., Darzi, A.W., and Nicholson, J.K. (2011). Gut microbiome-host interactions in health and disease. Genome Med. 3, 14.
Kisand, V., Valente, A., Lahm, A., Tanet, G., and Lettieri, T. (2012). Phylogenetic and functional metagenomic profiling for assessing microbial biodiversity in environmental monitoring. PLoS One 7, e43630.
Kunisawa, T. (2007). Gene arrangements characteristic of the phylum Actinobacteria. Antonie Van Leeuwenhoek 92, 359–365.
Kuramitsu, H.K., He, X., Lux, R., Anderson, M.H., and Shi, W. (2007). Interspecies interactions within oral microbial communities. Microbiol. Mol. Biol. Rev. MMBR 71, 653–670.
Leimena, M.M., Ramiro-Garcia, J., Davids, M., van den Bogert, B., Smidt, H., Smid, E.J., Boekhorst, J., Zoetendal, E.G., Schaap, P.J., and Kleerebezem, M. (2013). A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets. BMC Genomics 14, 530.
Van der Lelie, D., Taghavi, S., McCorkle, S.M., Li, L.-L.L., Malfatti, S. a, Monteleone, D., Donohoe, B.S., Ding, S.-Y.Y., Adney, W.S., Himmel, M.E., et al. (2012). The
87 M. Sc. Thesis—Quan Yao McMaster—Biology
metagenome of an anaerobic microbial community decomposing poplar wood chips. PLoS One 7, e36740.
Lepage, P., Leclerc, M.C., Joossens, M., Mondot, S., Blottière, H.M., Raes, J., Ehrlich, D., and Doré, J. (2013). A metagenomic insight into our gut’s microbiome. Gut 62, 146–158.
Leung, H.C.M., Yiu, S.M., Yang, B., Peng, Y., Wang, Y., Liu, Z., Chen, J., Qin, J., Li, R., and Chin, F.Y.L. (2011). A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495.
Li, W., Fu, L., Niu, B., Wu, S., and Wooley, J. (2012). Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13, 656–668.
Lindner, M.S., Kollock, M., Zickmann, F., and Renard, B.Y. (2013). Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics 29, 1260–1267.
Lu, H.-P., Wang, Y., Huang, S.-W., Lin, C.-Y., Wu, M., Hsieh, C., and Yu, H.-T. (2012). Metagenomic analysis reveals a functional signature for biomass degradation by cecal microbiota in the leaf-eating flying squirrel (Petaurista alborufus lena). BMC Genomics 13, 466.
Ludwig, W., Strunk, O., Klugbauer, S., Klugbauer, N., Weizenegger, M., Neumaier, J., Bachleitner, M., and Schleifer, K.H. (1998). Bacterial phylogeny based on comparative sequence analysis. Electrophoresis 19, 554–568.
Lussier, F.-X., Chambenoit, O., Côté, A., Hupé, J.-F., Denis, F., Juteau, P., Beaudet, R., and Shareck, F. (2011). Construction and functional screening of a metagenomic library using a T7 RNA polymerase-based expression cosmid vector. J. Ind. Microbiol. Biotechnol. 38, 1321–1328.
Mackelprang, R., Waldrop, M.P., DeAngelis, K.M., David, M.M., Chavarria, K.L., Blazewicz, S.J., Rubin, E.M., and Jansson, J.K. (2011). Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature 480, 368–371.
Madigan, M.T., Martinko, J.M., Dunlap, P. V, and Clark, D.P. (2008). Brock Biology of Microorganisms (12th Edition) (Benjamin Cummings).
Markowitz, V.M., Chen, I.-M.A., Chu, K., Szeto, E., Palaniappan, K., Grechkin, Y., Ratner, A., Jacob, B., Pati, A., Huntemann, M., et al. (2012). IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res. 40, D123–D129.
88 M. Sc. Thesis—Quan Yao McMaster—Biology
Matsuda, H., Nishi, N., Tsuji, K., Tanaka, K., Kakuno, T., Yamashita, J., and Horio, T. (1984). Reconstruction of photosynthetic, cyclic electron transport system from photoreaction unit, ubiquinone-10 protein, cytochrome c2 and polar lipids purified from Rhodospirillum rubrum. J. Biochem. 95, 431–442.
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E.M., Kubal, M., Paczian, T., Rodriguez, a, Stevens, R., Wilke, A., et al. (2008). The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386.
Mielczarek, a T., Saunders, a M., Larsen, P., Albertsen, M., Stevenson, M., Nielsen, J.L., and Nielsen, P.H. (2013). The Microbial Database for Danish wastewater treatment plants with nutrient removal (MiDas-DK) - a tool for understanding activated sludge population dynamics and community stability. Water Sci. Technol. 67, 2519–2526.
Mitra, S., Rupek, P., Richter, D.C., Urich, T., Gilbert, J.A., Meyer, F., Wilke, A., and Huson, D.H. (2011). Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics 12 Suppl 1, S21.
Mohammed, M.H., Ghosh, T.S., Singh, N.K., and Mande, S.S. (2011). SPHINX--an algorithm for taxonomic binning of metagenomic sequences. Bioinformatics 27, 22–30.
Moine, H., Squires, C.L., Ehresmann, B., and Ehresmann, C. (2000). In vivo selection of functional ribosomes with variations in the rRNA-binding site of Escherichia coli ribosomal protein S8: evolutionary implications. Proc.Natl.Acad.Sci.U.S.A 97, 605–610.
Moloney, R.D., Desbonnet, L., Clarke, G., Dinan, T.G., and Cryan, J.F. (2013). The microbiome: stress, health and disease. Mamm. Genome.
Morgan, J.L., Darling, A.E., and Eisen, J. a (2010). Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5, e10209–e10209.
National Research Council (US) Committee on Metagenomics: Challenges and Functional, and Functional, N.R.C. (US) C. on M.C. and (2007). THE NEW SCIENCE OF METAGENOMICS Revealing the Secrets of Our Microbial Planet (The National Academies Press).
Nguimbi, E., Li, Y.Z., Gao, B.L., Li, Z.F., Wang, B., Wu, Z.H., Yan, B.X., Qu, Y.B., and Gao, P.J. (2003). 16S-23S ribosomal DNA intergenic spacer regions in cellulolytic myxobacteria and differentiation of closely related strains. Syst.Appl.Microbiol. 26, 262– 268.
Nielsen, P.H., Saunders, A.M., Hansen, A.A., Larsen, P., and Nielsen, J.L. (2012). Microbial communities involved in enhanced biological phosphorus removal from
89 M. Sc. Thesis—Quan Yao McMaster—Biology
wastewater--a model system in environmental biotechnology. Curr. Opin. Biotechnol. 23, 452–459.
Nijkamp, J.F., Pop, M., Reinders, M.J.T., and de Ridder, D. (2013). Exploring variation- aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834.
Oh, J.I., and Kaplan, S. (2001). Generalized approach to the regulation and integration of gene expression. Mol. Microbiol. 39, 1116–1123.
Olson, J.B., Harmody, D.K., and McCarthy, P.J. (2002). Alpha-proteobacteria cultivated from marine sponges display branching rod morphology. FEMS Microbiol. Lett. 211, 169–173.
Poindexter, J.S., and Staley, J.T. (1996). Caulobacter and Asticcacaulis stalk bands as indicators of stalk age. J. Bacteriol. 178, 3939–3948.
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65.
Raoult, D., Fournier, P.-E., Vandenesch, F., Mainardi, J.-L., Eykyn, S.J., Nash, J., James, E., Benoit-Lemercier, C., and Marrie, T.J. (2003). Outcome and Treatment of Bartonella Endocarditis. Arch. Intern. Med. 163, 226.
Rascovan, N., Carbonetto, B., Revale, S., Reinert, M.D., Alvarez, R., Godeas, A.M., Colombo, R., Aguilar, M., Novas, M., Iannone, L., et al. (2013). The PAMPA datasets: a metagenomic survey of microbial communities in Argentinean pampean soils. Microbiome 1, 21.
Rathsack, K., Reitner, J., Stackebrandt, E., and Tindall, B.J. (2011). Reclassification of Aurantimonas altamirensis (Jurado et al. 2006), Aurantimonas ureilytica (Weon et al. 2007) and Aurantimonas frigidaquae (Kim et al. 2008) as members of a new genus, Aureimonas gen. nov., as Aureimonas altamirensis gen. nov., comb. nov. Int. J. Syst. Evol. Microbiol. 61, 2722–2728.
Ravi P More, S.M. (2013). Mining and assessment of catabolic pathways in the metagenome of a common effluent treatment plant to induce the degradative capacity of biomass. Bioresour. Technol.
Riemann, L., Leitet, C., Pommier, T., Simu, K., Holmfeldt, K., Larsson, U., and Hagström, A. (2008). The native bacterioplankton community in the central baltic sea is influenced by freshwater bacterial species. Appl. Environ. Microbiol. 74, 503–515.
90 M. Sc. Thesis—Quan Yao McMaster—Biology
Roller, M., Lucić, V., Nagy, I., Perica, T., and Vlahovicek, K. (2013). Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res. 41, 8842–8852.
Rosen, G.L., Sokhansanj, B.A., Polikar, R., Bruns, M.A., Russell, J., Garbarine, E., Essinger, S., and Yok, N. (2009). Signal Processing for Metagenomics: Extracting Information from the Soup. Curr. Genomics 10, 493–510.
Rout, M.E., and Callaway, R.M. (2012). Interactions between exotic invasive plants and soil microbes in the rhizosphere suggest that “everything is not everywhere”. Ann. Bot. 110, 213–222.
Ruby, J.G., Bellare, P., and Derisi, J.L. (2013). PRICE: software for the targeted assembly of components of (Meta) genomic sequence data. G3 (Bethesda). 3, 865–880.
Sahni, S.K., and Rydkina, E. (2009). Host-cell interactions with pathogenic Rickettsia species. Future Microbiol. 4, 323–339.
Schloss, P.D., and Handelsman, J. (2005). Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biol. 6, 229.
Scully, E.D., Geib, S.M., Hoover, K., Tien, M., Tringe, S.G., Barry, K.W., Glavina del Rio, T., Chovatia, M., Herr, J.R., and Carlson, J.E. (2013). Metagenomic profiling reveals lignocellulose degrading system in a microbial community associated with a wood- feeding beetle. PLoS One 8, e73827.
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814.
Sharon, I., Birkland, A., Chang, K., El-Yaniv, R., and Yona, G. (2005). Correcting BLAST e-Values for Low-Complexity Segments. J. Comput. Biol. a J. Comput. Mol. Cell Biol. 12, 980–1003.
Siepel, A., and Haussler, D. (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. 11, 413–428.
Solonenko, S.A., Ignacio-Espinoza, J.C., Alberti, A., Cruaud, C., Hallam, S., Konstantinidis, K., Tyson, G., Wincker, P., and Sullivan, M.B. (2013). Sequencing platform and library preparation choices impact viral metagenomes. BMC Genomics 14, 320.
91 M. Sc. Thesis—Quan Yao McMaster—Biology
Sommer, M.O.A., Church, G.M., and Dantas, G. (2010). A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion. Mol. Syst. Biol. 6, 360.
Sowell, S.M., Norbeck, A.D., Lipton, M.S., Nicora, C.D., Callister, S.J., Smith, R.D., Barofsky, D.F., and Giovannoni, S.J. (2008). Proteomic analysis of stationary phase in the marine bacterium “Candidatus Pelagibacter ubique”. Appl. Environ. Microbiol. 74, 4091– 4100.
Steenhoudt, O., and Vanderleyden, J. (2000). Azospirillum, a free-living nitrogen-fixing bacterium closely associated with grasses: genetic, biochemical and ecological aspects. FEMS Microbiol. Rev. 24, 487–506.
Strous, M., Kraft, B., Bisdorf, R., and Tegetmeyer, H.E. (2012). The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410.
Takacs-Vesbach, C., Inskeep, W.P., Jay, Z.J., Herrgard, M.J., Rusch, D.B., Tringe, S.G., Kozubal, M.A., Hamamura, N., Macur, R.E., Fouke, B.W., et al. (2013). Metagenome sequence analysis of filamentous microbial communities obtained from geochemically distinct geothermal channels reveals specialization of three aquificales lineages. Front. Microbiol. 4, 84.
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., and Glöckner, F.O. (2004). TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163.
Thomas, T., Gilbert, J., and Meyer, F. (2012). Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3.
Travers, S.A.A., Clewley, J.P., Glynn, J.R., Fine, P.E.M., Crampin, A.C., Sibande, F., Mulawa, D., McInerney, J.O., and McCormack, G.P. (2004). Timing and reconstruction of the most recent common ancestor of the subtype C clade of human immunodeficiency virus type 1. J. Virol. 78, 10501–10506.
Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A. a, Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., et al. (2005). Comparative metagenomics of microbial communities. Science 308, 554–557.
Ursell, L.K., Metcalf, J.L., Parfrey, L.W., and Knight, R. (2012). Defining the human microbiome. Nutr. Rev. 70 Suppl 1, S38–44.
92 M. Sc. Thesis—Quan Yao McMaster—Biology
Vogel, T.M., Simonet, P., Jansson, J.K., Hirsch, P.R., Tiedje, J.M., van Elsas, J.D., Bailey, M.J., Nalin, R., and Philippot, L. (2009). TerraGenome: a consortium for the sequencing of a soil metagenome. Nat. Rev. Microbiol. 7, 252–252.
Walker, D.H., Valbuena, G.A., and Olano, J.P. (2003). Pathogenic mechanisms of diseases caused by Rickettsia. Ann. N. Y. Acad. Sci. 990, 1–11.
Williams, D., Fournier, G.P., Lapierre, P., Swithers, K.S., Green, A.G., Andam, C.P., and Gogarten, J.P. (2011). A rooted net of life. Biol.Direct. 6, 45.
Williams, K.P., Sobral, B.W., and Dickerman, A.W. (2007). A Robust Species Tree for the Alphaproteobacteria. J. Bacteriol. 189, 4578–4586.
Wommack, K.E., Bhavsar, J., and Ravel, J. (2008). Metagenomics: Read Length Matters. Appl. Environ. Microbiol. 74, 1453–1463.
Wooley, J.C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS Comput. Biol. 6, e1000667–e1000667.
Wrighton, K.C., Thomas, B.C., Sharon, I., Miller, C.S., Castelle, C.J., VerBerkmoes, N.C., Wilkins, M.J., Hettich, R.L., Lipton, M.S., Williams, K.H., et al. (2012). Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661–1665.
Wu, Y.-W., and Ye, Y. (2011). A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18, 523–534.
Xia, L.C., Cram, J.A., Chen, T., Fuhrman, J.A., and Sun, F. (2011). Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS One 6, e27992.
Yabuuchi, E., and Kosako, Y. (2005). Order IV. Sphingomonadales ord. nov. In Bergey’s Manual of Systematic Bacteriology, D.J. Brenner, N.R. Krieg, and J.T. Staley, eds. (New York: Springer), pp. 230–258.
Yergeau, E., Sanschagrin, S., Beaumier, D., and Greer, C.W. (2012). Metagenomic analysis of the bioremediation of diesel-contaminated Canadian high arctic soils. PLoS One 7, e30058.
Yildiz, F.H., Gest, H., and Bauer, C.E. (1991). Attenuated effect of oxygen on photopigment synthesis in Rhodospirillum centenum. J. Bacteriol. 173, 5502–5506.
Yurkov, V. V, and Beatty, J.T. (1998). Aerobic anoxygenic phototrophic bacteria. Microbiol.Mol.Biol.Rev. 62, 695–724.
93 M. Sc. Thesis—Quan Yao McMaster—Biology
Zhang, W., Wang, Y., Lee, O.O., Tian, R., Cao, H., Gao, Z., Li, Y., Yu, L., Xu, Y., and Qian, P.-Y. (2013). Adaptation of intertidal biofilm communities is driven by metal ion and oxidative stresses. Sci. Rep. 3, 3180.
Zomorodipour, A., and Andersson, S.G. (1999). Obligate intracellular parasites: Rickettsia prowazekii and Chlamydia trachomatis. FEBS Lett. 452, 11–15.
94