<<

M. Sc. Thesis—Quan Yao McMaster—Biology

IDENTIFICATION OF ENVIRONMENTAL WITH CONSERVED SIGNATURE IN METAGENOMIC DATASETS

M. Sc. Thesis—Quan Yao McMaster—Biology

IDENTIFICATION OF ENVIRONMENTAL ALPHAPROTEOBACTERIA WITH CONSERVED SIGNATURE PROTEINS IN METAGENOMIC DATASETS

BY

QUAN YAO, B.Sc.

A Thesis

Submitted to the School of Graduate Studies

in Partial Fulfillment of the Requirements

For the Degree

Master of Science

McMaster University

© Copyright by Quan Yao, Dec 2013 M. Sc. Thesis—Quan Yao McMaster—Biology

MASTER OF SCIENCE (2013) McMaster University

(Biology) Hamilton, Ontario

TITLE: Identification of Environmental Alphaproteobacteria with Conserved

Signature Proteins in Metagenomic Datasets

AUTHOR: Quan Yao, B.Sc. (Ocean University of China)

SUPERVISOR: Professor H.E. Schellhorn

NUMBER OF PAGES: ix, 94

ii M. Sc. Thesis—Quan Yao McMaster—Biology

Abstract

Microbial is the exploration of taxonomical diversity of microbial communities in environmental habitats using large, exhaustive DNA sequence datasets.

However, due to inherent limitations of sequencing technology and the complexity of environmental , current analytical approaches do not reveal the existence of all microbes that may be present. In this study, a new classification approach is proposed based upon unique proteins that are specific for different of Alphaproteobacteria to predict the presence and absence of from these groups of in published metagenomic datasets. In this work, 264 previously–identified, published conserved signature proteins (CSPs) characteristic of individual taxonomic clades of

Alphaproteobacteria are used as probes to detect the presence of bacteria in metagenomic datasets. Although public sequence information has increased manifold since these CSPs were initially identified 6 years ago, results indicate that nearly all of these

CSPs (259 of 265) are specific for their previously characterized clades. Furthermore, they are confirmed to be present in the newly–identified and sequenced members of these clades. In view of their specificity and predictive ability in different monophyletic clades of Alphaproteobacteria, the sequences of these CSPs provide reliable probes to determine the presence or absence of these Alphaproteobacteria in metagenomic datasets. In this work, CSPs are used to determine the presence of Alphaproteobacteria diversity in 10 published metagenomic datasets (bioreactor, , wastewater, activated sludge, groundwater, freshwater sediment, microbial mat, marine, hydrothermal vent and whale fall metagenomes), which cover diverse environment and ecosystems. It is indicated that

iii M. Sc. Thesis—Quan Yao McMaster—Biology

the BLAST searches with these CSPs can be used to efficiently identify

Alphaproteobacteria species in these metagenome dataset and substantial differences can be determined in the distribution and relative abundance of different Alphaproteobacteria species in the tested metagenome datasets. Thus the CSPs, which are specific for different microbial taxa, provide novel and powerful means for identification of microbes and for their taxonomic profiling in metagenomic datasets.

iv M. Sc. Thesis—Quan Yao McMaster—Biology

Acknowledgements

First, I must thank my Supervisor, Dr. Herb Schellhorn, who gave a lot of valuable suggestions and recommendations during my research work, along with his generosity for taking us to attend the conference of Canadian Society of Microbiologist in Ottawa, during which we had a great experience to share research work and communicate with the world’s top researchers. The second summer in Dr. Schellhorn’s cottage is an unforgettable memory, where we enjoyed a fascinating retreat after a year of hard work.

Equally important, I would like to thank my co-supervisor, Dr. Gupta for his continuous support in my work and the inspirations he ignited in my mind and my committee chair,

Dr. Igdoura for his kindness and assistance in my defense.

Secondly, I have to thank my lab mate who accompanied me in the past 2 years both in the lab and out of campus. I want to acknowledge Lingzi, Mohammed, Shirley, Sohail,

Steve, Rachel, and Pardis. The coffee break chats for casual and entertaining topics, the cooperative work we managed to accomplish when encountering the bottlenecks in research, or some in-depth exchange of ideas and thoughts about philosophy, universe and ourselves, all these pieces make up an indispensable part in my life to establish my values and faiths.

Finally, I must thank my parents for their encouragement in my life. Without their guidance and instruction, I can never achieve the goal that I have ever dreamed of. Their love to me is my forever treasure and provides the motive power to help me conquer future obstacles in my career.

v M. Sc. Thesis—Quan Yao McMaster—Biology

Table of Contents Part I. Uniqueness of Alphaproteobacteria specific CSPs ...... 1 Chapter 1 Introduction ...... 1 1.1 Significance of Alphaproteobacteria ...... 1 1.2 Conserved signature proteins as phylogenetic markers ...... 5 1.3 Standards for taxonomic hierarchy ...... 6 Chapter 2 Materials and methods ...... 9 2.1 Confirmation of the uniqueness of CSPs ...... 9 2.2 Grouping of CSP into Taxonomic levels ...... 10 Chapter 3. Results ...... 13 3.1 Confirmation of the uniqueness of CSPs ...... 13 3.2 Grouping of CSP into Taxonomic levels ...... 15 Chapter 4 Discussion ...... 27 4.1 Confirmation of the uniqueness of CSP ...... 27 4.2 Grouping of CSP into Taxonomic levels ...... 28 4.3 Future experiments ...... 29 Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples ...... 31 Chapter 1 introduction ...... 31 1.1 Metagenome, environmental genomes ...... 31 1.2 Taxonomic classification of metagenomic reads: methods and challenges ...... 34 1.3 Application of metagenomics ...... 36 1.4 Project objectives ...... 40 Chapter 2 Materials and methods ...... 42 2.1 Metagenome selection ...... 42 2.2 Identification of CSP in metagenomic samples ...... 42 2.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 43 Chapter 3 Results ...... 45 3.1 Metagenome selection ...... 45 3.2 Identification of CSPs in metagenomic samples ...... 47 3.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 50 Chapter 4 Discussion ...... 74 4.1 Metagenome selection ...... 74 4.2 Identification of CSPs in metagenomic samples ...... 75 4.3 Comparative analysis of Alphaproteobacteria in metagenomes ...... 77 4.4 Overall conclusions ...... 79 4.5 Future directions ...... 80 References ...... 82

vi M. Sc. Thesis—Quan Yao McMaster—Biology

List of Figures

Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10 metagenomes ...... 54 Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes ...... 62 Figure 3: Similarity of significant hits in 10 metagenomes ...... 70 Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP distribution in 10 metagenomes ...... 71 Figure 5: The relative abundance of Alphaproteobacteria and its different sub-clades in the studied metagenomes based upon BLASTp searches with CSPs ...... 72 Figure 6: Comparative results of Alphaproteobacteria distribution in 4 metagenomes derived from (A) CSPs-based binning and (B) similarity-based binning...... 73

vii M. Sc. Thesis—Quan Yao McMaster—Biology

List of Tables

Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in 2007 and 2013 ...... 11 Table 2: Comparison of the Results of BLAST Search with and Nucleotide Sequences ...... 12 Table 3: CSPs specific to Alphaproteobacteria ...... 16 Table 4 CSPs specific to Rhizobiales ...... 17 Table 5: CSPs specific to Bradyrhizobiaceae and ...... 19 Table 6 CSPs specific to ...... 21 Table 7: CSPs specific to ...... 23 Table 8: CSPs specific to ...... 24 Table 9 CSPs specific to ...... 25 Table 10: CSPs specific to ...... 26 Table 11 Characteristics of Metagenomic Datasets Investigated in this Study ...... 44

viii M. Sc. Thesis—Quan Yao McMaster—Biology

ix M. Sc. Thesis—Quan Yao McMaster—Biology

Part I. Uniqueness of Alphaproteobacteria specific CSPs

Chapter 1 Introduction

1.1 Significance of Alphaproteobacteria

Alphaproteobacteria is one of the largest classes of , which comprises 4 major classes: , ,

Deltaproteobacteria and (Kersters et al., 2006).

Alphaproteobacteria contains 6 main orders: Rhizobiales, Rhodobacterales,

Caulobacterales, Sphingomonadales, Rhodospirillales and Rickettsiales, which are featured by different characteristics (Williams et al., 2007). Alphaproteobacterial species are morphologically, physiologically and metabolically diverse and adapt to different habitats associated with both terrestrial and marine conditions (Rathsack et al., 2011;

Williams et al., 2007). Most characterized Alphaproteobacteria species are Gram-negative bacteria (Olson et al., 2002). A myriad of them develop mechanisms to adopt an intracellular lifestyle either as plant mutualists or animal (Dumler et al., 2001).

Some Alphaproteobacterial species can grow at low levels of nutrients (Kang et al.,

2010). Alphaproteobacteria undertake several important metabolic strategies such as photosynthesis, nitrogen fixation, oxidation and methylotrophy (Campagne et al., 2012). They are also morphologically diverse with stellate, spiral and prosthecate

(Hallez et al., 2004). Alphaproteobacteria is the most abundant cellular in marines (Williams et al., 2007). Pelagibacter ubique, which was isolated in 2002, was discovered to comprise 1/4 of all plankton cells in the ocean (Sowell et al., 2008).

1 M. Sc. Thesis—Quan Yao McMaster—Biology

Rhizobiales is the largest order of Alphaproteobacteria. It constitutes 1/3 of all sequenced Alphaproteobacteria species (Carvalho et al., 2010). Rhizobiales species develop several strategies to adapt both intracellular and extracellular niches (Carvalho et al., 2010). Plant mutualists such as , and are capable of fixing nitrogen in symbiosis with most leguminous plants (Fischer, 1996).

Agricultural and animal such as , and are obligatory and facultative intracellular bacteria of either plants or animal parasites and have been studied extensively (Bowman, 2011). , the chief causative agent of cat scratch disease (CSD) is called Gram-negative (English, 1988).

Intimate contact with infected cats such as scratches, bites and saliva can cause the transmission of B. henselae (Andersson and Kempf, 2004). Fortunately, by

Bartonella sp. causes a mild injury, which can be easily treated with common

(Holley, 1991). Another obligatory parasite of ——Brucella, are small, non- motile coccobacilli and are more severe pathogens than Bartonella sp. (Alsmark et al.,

2004). They are usually passed in animals through (GI track), respiration and skin wounds, subsequently caussing in many animals due to their ability to survive (Breitschwerdt and Kordick, 2000). Severe may affect the central nervous system or circulatory system, and treatment such as a combination of and rifampin is necessary for at least 6 weeks while treatment period mainly depends on the timing of treatment and severity of illness (Raoult et al., 2003).

2 M. Sc. Thesis—Quan Yao McMaster—Biology

Most Rhodobacterales are purple non-sulfur bacteria, belonging to a larger group called photolithotrophic bacteria (Dang et al., 2008). They employ several metabolic mechanisms including photosynthesis, nitrogen fixation and , either under aerobic or anaerobic conditions (Dang et al., 2008). sphaeroides, first isolated from deep lakes and stagnant (Choudhary and Kaplan, 2000), is remarkable for two unique characteristics—— an innate sensing system based on invaginations and two sets of chromosomes responsible for distinct functions. Versatility of Rhodobacterales species in enables them to dominate many ecological niches, especially abundant in oceans (Oh and Kaplan, 2001).

Caulobacterales is typically found in low-nutrient aquatic environments such as lakes and rivers (Riemann et al., 2008). They have a featured stalk that can anchor the surfaces of nearby (Poindexter and Staley, 1996). The development of attaching strategy increases their nutrient uptake since they expose themselves into a continuously changing flow of fluids (Poindexter and Staley, 1996). Meanwhile Caulobacterales can exploit the ’s excretions as extra nutrients when environmental nutrients are depleted

(Abraham et al., 2008).

Sphingomonadales are oval or rod-shaped bacteria, which is featured by its sphingolipids located at the outer membrane of the cell wall (Yabuuchi and Kosako,

2005). Some of them are pleomorphic and the shapes of cells can change through time while other relatives undertake phototrophic metabolism (Yurkov and Beatty, 1998).

Most Sphingomonadales species are widely spreading in diverse terrestrial and aquatic habitats due to their ability of surviving in low nutrient environments (Boersma et al.,

3 M. Sc. Thesis—Quan Yao McMaster—Biology

2009). Sphingomonadales can be applied into since some of the species isolated from contaminated environments feed on toxic aromatic compounds as their main nutrient source (Boersma et al., 2009).

Rhodospirillales comprise 2 distinct families: and

Rhodospirillaceae (Gupta and Mok, 2007a). In Acetobacteraceae, soil bacteria—

Azospirillum employs the nutrients excreted by plants and in exchange fixes nitrogen into ammonia from atmosphere for host plants (Steenhoudt and Vanderleyden, 2000).

Acetobacter and are industrially important aerobic organisms widely used in brewery for the fermentation of wine and by converting ethyl alcohol into (Gullo and Giudici, 2008). Rhodospirillum is a facultative anaerobic bacteria

(Yildiz et al., 1991). When oxygen is exhausted, Rhodospirillum activates the machinery of photosynthesis apparatus to acquire nutrition (Yildiz et al., 1991). However the mechanism of photosynthesis depression under aerobic conditions are poorly understood

(Matsuda et al., 1984).

The order Rickettsiales are mostly composed of pathogens and marine bacteria

(Fredricks, 2006). The typical —— are Gram-negative and rod shaped (Zomorodipour and Andersson, 1999). These obligate intracellular parasites only reproduce within mammalian cells. Laboratory and purification is feasible with culture or . Rickettsia enter host cells by inducing phagocytosis (Sahni and Rydkina, 2009). Once they penetrate into the cytoplasm of the cell, reproducing of binary is conducted to ensure the survival of Rickettsia.

Infection by Rickettsia deteriorates the permeability of capillaries, which is

4 M. Sc. Thesis—Quan Yao McMaster—Biology

clinically characterized by spotted rash (Walker et al., 2003). Another obligatory pathogen of clinical significance is the genus——. Ehrlichia cause parasitemia by living in blood cells (Arraga-Alvarado et al., 2003). They are often transmitted from animals to through bites of infected , which eventually result in ehrlichoisis

(Arraga-Alvarado et al., 2003). Apart from their pathogenic features, Rickettsiales are also the closest relatives of Eukaryotic mitochondria based on high genomic similarity (Gray, 2012).

1.2 Conserved signature proteins as phylogenetic markers

Conserved signature proteins (CSPs) are a type of rare genomic changes (RGC) often applied into phylogenetic analysis and taxonomic classification, because they are whole proteins uniquely present in certain groups of bacteria but not found anywhere else (Gao et al., 2006; Gupta and Lorenzini, 2007). Although most identified CSPs are of unknown functions, their distribution pattern at different phylogenetic depths provides reliable evidences to distinguish taxonomically coherent clades (Bhandari et al., 2012). Like other

RGCs, CSPs are mostly inherited vertically rather than horizontally, CSPs are applied to elucidate the evolutionary relationships among closely-related clades (Bhandari et al.,

2012). Recent studies proved that these CSPs could be identified in newly-sequenced species (Bhandari et al., 2012; Gao and Gupta, 2012). Due to their specificity and conservative property, it is postulated that the CSPs may be present in uncharacterized

Alphaproteobacterial species. Environmental Alphaproteobacterial species may also carry such molecular markers to demonstrate their affiliation to their laboratory relatives.

Previous analysis of approximate 60 Alphaproteobacteria genomes has identified 265

5 M. Sc. Thesis—Quan Yao McMaster—Biology

CSPs specific to different phylogenetic clades (Gupta and Mok, 2007a). Serving as reliable molecular markers, these CSPs are utilized to predict the presence of

Alphaproteobacteria species in environment samples if similar sequences are identified.

1.3 Standards for taxonomic hierarchy

The most reputable criterion currently used for taxonomic purpose is based on the branching pattern of 16S rRNA trees (Nguimbi et al., 2003). Because 16S rRNA is universally present in almost all bacteria species and is featured by its dual-characteristics that both conserved and variant regions are alternately located on this gene (Nguimbi et al., 2003). The conserved regions of 16S rRNA are used to infer the common ancestor of them while the variant region differentiate one species from the other (Moine et al.,

2000). Nowadays, Bacteria is classified into 23 major groups according to the of 16S rRNA (Ludwig et al., 1998). However, the numbers of species in different phyla are not evenly distributed but are biased by the fact that some genera may be studied more intensively than others. For instance, Proteobacteria, ,

Firmicutes, and are the 5 largest phyla, which comprise

90~95% of all known bacteria in laboratory (Binnewies et al., 2006). While some other small phyla such as Ignavibacteriae, Caldiserica, Chrysiogenetes, Dictyoglomi and

Themodesulfobacteria only account for less than 1% of the bacteria studied (Binnewies et al., 2006). Furthermore, due to the low resolution capacity of the 16S rRNA gene marker below genus level, phylogenetic trees based on a single gene cannot robustly resolve all the issues regarding evolutionary events of different bacterial species (Kunisawa, 2007).

6 M. Sc. Thesis—Quan Yao McMaster—Biology

Hence, the taxonomic hierarchy of Bacteria domain is primarily subjective and there is, as yet, no consistent agreement on their phylogeny (Gupta, 2005a).

To describe the evolutionary relationships of bacteria appropriately, phylogenetic trees based on topological models such as rooted tree, unrooted tree and bifurcating tree can be determined (Williams et al., 2011). In an idealized rooted phylogenetic tree, all bacteria are derived from a common ancestor bacterium and the earliest bacterium is found at the foot of the phylogenetic tree (Arisue et al., 2005). Each branch indicates the divergence of a large bacterial clade such as phylum or class in evolutionary history. The closer a branch is to the foot, the earlier the divergence event occurred. Recent branches denote the further of different sub-clades such as order, family, genus and species. Bacteria on the same branch have more characteristics in common than the ones on different branches (Doolittle and Bapteste, 2007).

The purpose of identifying CSPs is to provide reliable evidence for each node of phylogenetic tree and support the validity of determined branching pattern of phylogenetic tree (Gupta and Griffiths, 2002). Previous studies have identified a myriad of CSPs specific to different clades within Alphaproteobacteria. These molecular markers resolved the phylogeny of Alphaproteobacterial species (Gupta, 2005b; Gupta and

Mok, 2007b; Kainth and Gupta, 2005). With increased availability of large datasets, sufficient CSPs can construct a comprehensive and reliable phylogenetic tree for both

Alphaproteobacteria and the whole Bacteria .

1.4 Project objectives

7 M. Sc. Thesis—Quan Yao McMaster—Biology

Alphaproteobacteria-specific CSP have been proved to be useful in inferring phylogenetic trees and branching patterns within Alphaproteobacteria clades (Gupta and

Mok, 2007a). Although the majority of CSPs are of hypothetical proteins, these proteins may assign certain functions or characteristic to distinguish species belonging to

Alphaproteobacteria clades from all others. The aim of this project is to confirm the specificity of previous identified Alphaproteobacteria specific CSPs at different phylogenetic depths by performing BLAST searches against the latest nr protein database.

Then, according to the distribution of the CSPs in bacterial , all determined

CSPs are grouped based on their specificity. Finally, a CSPs database that represents different clades of Alphaproteobacteria from class level to family level will be constructed to serve as signature markers for bacteria diagnosis in environments.

8 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 2 Materials and methods

2.1 Confirmation of the uniqueness of CSPs

In view of the large increase in the number of sequenced bacterial genomes in the past

6 years, current CSPs may be identified in new species, no matter whether they are members of Alphaproteobacteria or not. So, systematic BLASTp searches (Altschul et al.,

1990) were performed on each CSP against the NCBI non-redundant protein sequences

(nr) database (all non-redundant GenBank CDS translations + PDB + SwissProt + PIR +

PRF excluding environmental samples from WGS projects) with an E-value threshold of

1x10-e04 to confirm their specificity. Meanwhile, a parallel BLASTn search was conducted on the nucleotide sequences of corresponding CSPs to compare the uniqueness between sequences and nucleotide sequences. By convention, Blast hits with associated E-values >1e-04 do not support orthology, thus the hits exceeding this E-value threshold are excluded from phylogenetic analysis. However, in some cases, when query proteins are too short to yield sufficient information (bits of information) to determine discriminating E-value, higher E-values can be employed (Sharon et al., 2005). A potential CSP is considered to be clade specific if all significant Blast analysis hits are derived from within a monophyletic clade of Alphaproteobacteria or if there is a large difference in the determined E-value of the last hit belonging to Alphaproteobacterial relatives to the first identified hit of non-Alphaproteobacteria (Gupta and Mok, 2007a).

All significant hits of CSPs meeting these criteria described above were further analyzed as described below.

9 M. Sc. Thesis—Quan Yao McMaster—Biology

2.2 Grouping of CSP into Taxonomic levels

We determined the taxonomic placement of significant hits for each CSP from

BLASTp searches. A CSP should have multiple, similar sequences that are shared among several closely related species. The taxonomic report produced by BLASTp searches yield a distribution of query CSP in all Bacteria. The lowest common ancestor (LCA) of reported taxa was identified. LCA analysis indicates the most recent taxon from which all descendant organisms are derived (Travers et al., 2004). For example, if a CSP is identified in 50 species, and these species belong to 3 genera X, Y, Z under 2 families M,

N under 1 order A, this CSP will be defined as order A-specific CSP. It will not be named as genus X specific or family M specific CSP because this marker is not uniquely present in a single genus or family but also present in genera Y, Z and family N. Principles of

LCA analysis yield the most parsimonious definition for the specificity of this CSP

(Travers et al., 2004). A few organisms out of the clade may also share some CSPs found within a monophyletic clade of Alphaproteobacteria. These are likely due to lateral gene transfer (LGT) event but these protein markers may still be regarded as clade-specific markers (Beiko and Ragan, 2008). Occasionally, very few CSP might be found sporadically distributed in several distantly related bacteria clades. These signature markers are likely to be misdiagnosed due to the limited number of sequenced

Alphaproteobacterial species at that time, and non-specific markers will be excluded from

CSPs database.

10 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in 2007 and 2013

# of sequenced genomes # of identified CSPs Accession # and Clade Specificity other information 2007 2013 2007 2013

Alphaproteobacteria 60 250 4 4 Table 3A Alphaproteobacteria 45 180 7 7 Table 3B except Rickettsiales Rhizobiales 24 96 3 3 Table 4A Clade 1 Rhizobiales 14 58 16 16 Table 4B and 6 30 18 18 Table 5C Bradyrhizobiaceae 10 20 74 74 Table 5A, 5B Xanthobacteraceae Rhodobacterales 8 26 35 35 Table 6A 3 4 13 13 Table 6B Caulobacterales 3 7 11 11 Table 7 Sphingomonadales 5 14 31 31 Table 8 Rhodospirillales 5 27 4 0 N/A

Acetobacteraceae 3 17 14 17 Table 9A 2 10 14 14 Table 9B Rickettsiales 15 69 3 2 Table 10A 7 23 15 16 Table 10B 7 45 3 3 Table 10C

Note: The values underlined highlight the changes of CSP specificity during the periods

11 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 2: Comparison of the Results of BLAST Search with Protein and Nucleotide Sequences

Accession # of Hits1 Protein Specificity Gene ID # of Hits2 Nucleotide Specificity

NP_422086 621 α-proteobacteria 943808 8

Mesorhizobium and NP_105743 276 Clade1 Rhizobiales 1228404 13 Sinorhizobium

NP_102577 76 Rhizobiaceae 1225240 2

YP_317328 32 Bradyrhizobiaceae 3674956 2

YP_611978 92 Rhodobacterales 4075456 1 sp. TM1040

Silicibacter sp. YP_614100 21 Rhodobacteraceae 4077857 1 TM1040

Novosphingobium YP_495301 76 Sphingomonadales 3916060 1 aromaticivorans

Gluconobacter AAW62008 45 Acetobacteraceae 3249894 1 oxydans

Rhodospirillum YP_428643 23 Rhodospirillaceae 3837017 2 rubrum

NP_220498 92 Rickettsiales 883719 42 Rickettsia

1. Significant hits (hits with E-values below 1e-04) of protein sequences were obtained using BLASTp 2. Significant hits (hits with E-values below 1e-04) of nucleotide sequences were obtained using BLASTn

12 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 3. Results

3.1 Confirmation of the uniqueness of CSPs

Most CSP were found to be specific to their original taxa given that the sequenced

Alphaproteobacteria species have increased almost 4 times (Table 1). In the CSPs database, 4 Alphaproteobacteria-specific CSPs used to be shared by 60 sequenced

Alphaproteobacteria species are now uniquely shared by more than 250

Alphaproteobacteria species, including many of the recently sequenced members between

2007~2013. Similar results were also seen in the other 7 Alphaproteobacteria-specific

CSPs (they were absent in Rickettsiales order). The 47 Rhizobiales-specific CSPs were also confirmed to be specific for most Rhizobiales species. For example, 3 Rhizobiales specific CSPs have been identified in almost all 96 sequenced Rhizobiales species. In detail, 16 Clade 1 Rhizobiales were commonly shared by 11 Rhizobiaceae species, 8

Phyllobacteriaceae species, 2 species, 16 species and

12 Bartonellaceae species (another 18 CSPs were only present in Rhizobiaceae and

Phyllobacteriaceae species). Likewise, another important clade of Bradyrhizobiaceae and

Xanthobacteraceae yielded a similar pattern. 74 CSPs were identified present in 18

Bradyrhizobiaceae and 3 Xanthobacteraceae species. Blast searches results for other 4 important orders under Alphaproteobacteria also validated the prediction that previous- identified CSPs based on limited number of sequenced Alphaproteobacteria were present in newly sequenced Alphaproteobacterial species. 35 Rhodobacterales specific CSPs were highly conserved in 41 Rhodobacterales species, while 13 previous Silicibacter and

Roseobacter specific CSPs were present in other Rhodobacteraceae sp., such as

13 M. Sc. Thesis—Quan Yao McMaster—Biology

Phaeobacter and Ruegeria species. These 13 CSPs are now defined as Rhodobacteraceae specific CSPs. 11 Caulobacterales specific CSPs were found unique to 9 Caulobacterales species and 4 species. 31 Sphingomonadales specific CSPs are now uniquely present in 3 species and 17 species.

Most Rhodospirillales-specific CSPs and Rickettsiales-specific CSPs were conserved within their group. However, 4 Rhodospirillales-specific CSPs were proved to be only specific to Acetobacteraceae and 1 Rickettsiales-specific CSP was proved to be specific to Anaplasmataceae species (underlined in Table 1). Only 1 non-specific CSP was identified (Accession number: AAW61951), which used to be specific to

Acetobacteraceae. This was the only CSP that did not meet the classification criterion and as a result the CSP database contained 264 qualified CSPs in total.

Important differences were observed in the clade specificity of the same . When

Blast searches were performed using the nucleotide sequence data versus the protein sequence data (Table 2). For example, for two of the signature proteins, which were specific for the family Anaplasmataceae (viz. NP_966526 and NP_965909), when Blastp searches were carried out using the amino acid sequence data, significant hits were observed for all of the sequenced species from the family Anaplasmataceae (e.g.

Wolbachia, , Ehrlichia, etc.). In contrast, when the Blast searches were carried out using the gene sequence for the same proteins, then depending upon whether the searches were carried out with the or Anaplasma gene sequences, all significant hits obtained were only for the Wolbachia or the Anaplasma species.

Similarly, for a signature protein that is specific for Caulobacterales (viz. NP_419305),

14 M. Sc. Thesis—Quan Yao McMaster—Biology

the Blastp search with its amino acid sequence identified >30 significant hits covering all of the sequenced Caulobacterales species, while blastn search with its nucleotide sequence identified only 6 significant hits most of which were from the genus

Caulobacter. Similar differences are observed in the results of blast searches for the signature proteins for other bacterial clades. Thus, the use of gene sequences as marker genes may grossly underestimates the taxonomic diversity of microbial species in environments than as revealed by the use of CSPs.

3.2 Grouping of CSP into Taxonomic levels

Once we filtered all qualified CSP, it is possible to group them together based on their taxonomic specificity. All these CSPs are specific to either Alphaproteobacteria class or different orders and families within Alphaproteobacteria. In the CSPs database, they are divided into 8 major groups, including 11 Alphaproteobacteria specific CSPs (Table 3).

47 Clade-1 Rhizobiales specific CSPs (Table 4), 74 Bradyrhizobiaceae and

Xanthobacteraceae specific CSPs (Table 5), 48 Rhodobacterales specific CSPs (Table 6),

11 Caulobacterales specific CSPs (Table 7), 31 Sphingomonadales specific CSPs (Table

8), 31 Rhodospirillales specific CSPs (Table 9) and 21 Rickettsiales specific CSPs (Table

10).

15 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 3: CSPs specific to Alphaproteobacteria

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to all Alphaproteobacteria

CC2102 NP_420905 162 CC3319 NP_422113 89

CC3292 NP_422086 224 CC1365 NP_420178 161

B. CSPs unique to Alphaproteobacteria except Rickettsiales

CC1211 NP_420025 167 CC0520 NP_419339 284

CC1886 NP_420693 223 CC3010 NP_421804 216

CC2245 NP_421048 190 CC0100 NP_418919 576

CC3470 NP_422264 253

16 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 4 CSPs specific to Rhizobiales

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to Rhizobiales

BQ00720 YP_031797 83 BQ12030 YP_032733 91

BQ07670 YP_032395 336

B. CSPs unique to Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and Aurantimonadaceae

mll0062 NP_101943 107 mll1268 NP_102895 108

mll4068 NP_105027 144 mll2847 NP_104087 186

mll7791 NP_108034 263 mll2898 NP_104130 144

mlr0777 NP_102510 186 mll4298 NP_105201 171

mlr0789 NP_102519 207 mll5001 NP_105743 324

mlr3016 NP_104217 166 mll8359 NP_108472 415

msl6526 NP_107016 80 mlr1823 NP_103319 198

mll0122 NP_101988 349 mlr0094 NP_101965 299

C. CSPs unique to Rhizobiaceae and Phyllobacteriaceae

mll0080 NP_101954 172 mll0459 NP_102252 108

mll0867 NP_102577 168 mll1779 NP_103286 141

mll9619 NP_109472 296 mll6195 NP_106741 174

mlr5174 NP_105883 181 mll8758 NP_106740 205

mll6303 NP_106835 292 mlr3037 NP_104236 281

mll6703 NP_107159 198 mll2007 NP_103455 289

mlr1904 NP_103376 146 mlr1999 NP_103450 111

mlr3274 NP_104418 461 mlr2029 NP_103476 238

17 M. Sc. Thesis—Quan Yao McMaster—Biology

Gene ID Accession # Length Gene ID Accession # Length

mlr4951 NP_105704 84 mlr6601 NP_107075 141

18 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 5: CSPs specific to Bradyrhizobiaceae and Xanthobacteraceae

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to Bradyrhizobiaceae and Xanthobacteraceae

bll6014 NP_772654 193 Nwi_1674 YP_318287 185

Nwi_1093 YP_317707 195 Nwi_1705 YP_318318 63

Nwi_1227 YP_317841 106 Nwi_1711 YP_318324 77

Nwi_1786 YP_318399 126 Nwi_1785 YP_318398 422

Nwi_1788 YP_318401 190 Nwi_1793 YP_318406 165

Nwi_2147 YP_318753 82 Nwi_1800 YP_318413 84

B. CSPs unique to Bradyrhizobiaceae

Nwi_2179 YP_318785 161 Nwi_2021 YP_318632 172

Nwi_2432 YP_319038 110 Nwi_2063 YP_318673 186

Nwi_2476 YP_319081 85 Nwi_2064 YP_318674 148

Nwi_2572 YP_319177 171 Nwi_2163 YP_318769 156

Nwi_2623 YP_319228 87 Nwi_2173 YP_318779 109

Nwi_2707 YP_319312 198 Nwi_2183 YP_318789 129

bll5899 NP_772539 131 Nwi_2208 YP_318814 174

blr6106 NP_772746 141 Nwi_2244 YP_318850 164

Nwi_0278 YP_316897 398 Nwi_2247 YP_318853 230

Nwi_0503 YP_317122 108 Nwi_2379 YP_318985 450

Nwi_0528 YP_317147 66 Nwi_2381 YP_318987 63

Nwi_0605 YP_317224 71 Nwi_2414 YP_319020 89

Nwi_0710 YP_317328 248 Nwi_2489 YP_319094 259

Nwi_0925 YP_317539 86 Nwi_2492 YP_319097 122

19 M. Sc. Thesis—Quan Yao McMaster—Biology

Gene ID Accession # Length Gene ID Accession # Length

Nwi_0966 YP_317580 260 Nwi_2500 YP_319105 152

Nwi_1084 YP_317698 385 Nwi_2506 YP_319111 72

Nwi_1092 YP_317706 145 Nwi_2509 YP_319114 98

Nwi_1107 YP_317721 121 Nwi_2531 YP_319136 96

Nwi_1108 YP_317722 121 Nwi_2575 YP_319180 399

Nwi_1336 YP_317949 146 Nwi_2577 YP_319182 135

Nwi_1139 YP_317753 321 Nwi_2588 YP_319193 62

Nwi_1247 YP_317861 113 Nwi_2630 YP_319235 141

Nwi_1270 YP_317883 137 Nwi_2676 YP_319281 217

Nwi_1275 YP_317888 126 Nwi_2677 YP_319282 102

Nwi_1454 YP_318067 160 Nwi_2769 YP_319374 127

Nwi_1498 YP_318111 142 Nwi_2789 YP_319394 112

Nwi_1512 YP_318125 409 Nwi_2984 YP_319586 68

Nwi_1581 YP_318194 99 Nwi_2959 YP_319561 87

Nwi_1582 YP_318195 83 Nwi_3035 YP_319637 582

Nwi_1586 YP_318199 182 Nwi_3140 YP_319739 156

Nwi_1649 YP_318262 101 Nwi_3141 YP_319740 104

20 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 6 CSPs specific to Rhodobacterales

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to Rhodobacterales

TM1040_0093 YP_612088 168 TM1040_1988 YP_613982 105

TM1040_0184 YP_612179 289 TM1040_2263 YP_614257 761

TM1040_0236 YP_612231 270 TM1040_2370 YP_614364 221

TM1040_0471 YP_612466 179 TM1040_2425 YP_614419 278

TM1040_0586 YP_612581 329 TM1040_2466 YP_614460 241

TM1040_0587 YP_612582 291 TM1040_2487 YP_614481 272

TM1040_0697 YP_612692 80 TM1040_2582 YP_614576 122

TM1040_0750 YP_612745 154 TM1040_2999 YP_614993 121

TM1040_0752 YP_612747 130 TM1040_3077 YP_611313 175

TM1040_1063 YP_613058 112 TM1040_3749 YP_611978 343

TM1040_1064 YP_613059 135 TM1040_3759 YP_611988 207

TM1040_1247 YP_613242 161 TM1040_3764 YP_611993 276

TM1040_1350 YP_613345 179 TM1040_1558 YP_613553 70

TM1040_1406 YP_613401 181 TM1040_1735 YP_613730 138

TM1040_1567 YP_613562 351 TM1040_2157 YP_613732 360

TM1040_1842 YP_613837 148 TM1040_2443 YP_613733 212

TM1040_1967 YP_613961 732 TM1040_2680 YP_613734 202

TM1040_1844 YP_613839 256

B. CSPs unique to Rhodobacteraceae

TM1040_1099 YP_613094 149 TM1040_3189 YP_611425 93

TM1040_1423 YP_613418 124 TM1040_3202 YP_611438 109

21 M. Sc. Thesis—Quan Yao McMaster—Biology

Gene ID Accession # Length Gene ID Accession # Length

TM1040_1451 YP_613446 194 TM1040_3208 YP_611444 100

TM1040_1986 YP_613980 193 TM1040_3226 YP_611462 270

TM1040_2106 YP_614100 105 TM1040_3529 YP_611763 288

TM1040_2139 YP_614133 102 TM1040_3626 YP_611855 192 TM1040_3075 YP_611311 84

22 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 7: CSPs specific to Caulobacterales

Gene ID Accession # Length Gene ID Accession # Length

CC0486 NP_419305 258 CC1066 NP_419882 126

CC2480 NP_421283 253 CC1586 NP_420397 214

CC2764 NP_421560 415 CC2207 NP_421010 222

CC3101 NP_421895 379 CC2628 NP_421428 147

CC0512 NP_419331 289 CC2639 NP_421438 309

CC1064 NP_419880 296

23 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 8: CSPs specific to Sphingomonadales

Gene ID Accession # Length Gene ID Accession # Length

Saro_0018 YP_495301 300 Saro_0044 YP_495327 129

Saro_0052 YP_495335 193 Saro_0154 YP_495437 97

Saro_0087 YP_495370 221 Saro_0415 YP_495697 140

Saro_0150 YP_495433 133 Saro_0458 YP_495740 319

Saro_0232 YP_495514 448 Saro_1078 YP_496357 223

Saro_0409 YP_495691 175 Saro_1126 YP_496405 286

Saro_1088 YP_496367 220 Saro_1160 YP_496439 103

Saro_1144 YP_496423 243 Saro_1163 YP_496442 70

Saro_1291 YP_496569 190 Saro_1748 YP_497022 221

Saro_1378 YP_496656 227 Saro_1785 YP_497059 117

Saro_1914 YP_497188 156 Saro_1972 YP_497246 72

Saro_2130 YP_497403 184 Saro_2036 YP_497309 414

Saro_2788 YP_498058 296 Saro_2037 YP_497310 99

Saro_2958 YP_498227 251 Saro_2333 YP_497604 568

Saro_3138 YP_498407 159 Saro_2548 YP_497818 290

Saro_3213 YP_498482 246

24 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 9 CSPs specific to Rhodospirillales

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to Acetobacteraceae

GOX0633 AAW60410 347 GOX1222 AAW60983 304

GOX0695 AAW60472 165 GOX1224 AAW60985 207

GOX0963 AAW60735 311 GOX2275 AAW62008 201

GOX1258 AAW61019 186 GOX2316 AAW62049 628

GOX0143 AAW59936 198 GOX2452 AAW62183 143

GOX1616 AAW61357 430 GOX2454 AAW62185 466

GOX0343 AAW60126 232 GOX1233 AAW60994 272

GOX1212 AAW60973 472 GOX2456 AAW62187 497

GOX1215 AAW60976 133

B. CSPs unique to Rhodospirillaceae

Rru_A0125 YP_425217 449 Rru_A2592 YP_427676 231

Rru_A0152 YP_425244 138 Rru_A2828 YP_427912 169

Rru_A0531 YP_425622 588 Rru_A3562 YP_428643 349

Rru_A1689 YP_426776 178 Rru_A3636 YP_428717 464

Rru_A1756 YP_426843 139 Rru_A3662 YP_428743 119

Rru_A2112 YP_427199 237 Rru_A3739 YP_428820 464

Rru_A2510 YP_427597 184 Rru_A3800 YP_428881 153

25 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 10: CSPs specific to Rickettsiales

Gene ID Accession # Length Gene ID Accession # Length

A. CSPs unique to Rickettsiales

WD0161 NP_965979 70 WD0715 NP_966474 94

B. CSPs unique to Anaplasmataceae

WD0083 NP_965909 271 WD0821 NP_966574 156

WD0827 NP_966580 191 WD0863 NP_966613 147

WD0157 NP_965975 242 WD0771 NP_966526 460

WD0148 NP_965966 139 WD0764 NP_966520 138

WD0772 NP_966527 202 WD1025 NP_966750 97

WD0412 NP_966202 143 WD1056 NP_966779 92

WD0467 NP_966253 106 WD1220 NP_966932 204

WD0757 NP_966513 290 WD1230 NP_966942 243

C. CSPs unique to Rickettsiaceae

RP030 NP_220424 219 RP187 NP_220576 194

RP192 NP_220581 128

26 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 4 Discussion

4.1 Confirmation of the uniqueness of CSP

The purpose of this study was to determine if the CSPs identified in earlier studies could still be regarded as specific for the desired group so that the results obtained with them in metagenomic analysis will be reliable. The results of re-BLAST studies indicate that most of these CSPs are still specific for the previously reported taxonomic units but there are a small number of exceptions (Table 1). For example, among all the 265 CSPs examined in this study, only 6 proteins are no longer diagnostic. One Rickettsiales specific CSP (Accession No.: NP_966526) becomes Anaplasmataceae specific CSP in this study (Table 10B). Similarly, four CSPs, which were previously-regarded as unique to Rhodospirillales order (Gupta and Mok, 2007a), have now been determined to be uniquely present in either Glucobacter (Accession No.: AAW60410, AAW60472) or

Acetobacteraceae (Accession No.: AAW60735, AAW61019) (Table 9A). Thus, no

Rhodospirillales CSP has yet been identified. Another Acetobacteraceae specific CSP

(accession number: AAW61951) is found to be sporadically distributed protein present in some other distantly related bacterial cohorts including and

Planctomycetes. These CSPs were probably misidentified earlier due to the limited number of sequenced Rhodospirillales species available (3 Acetobacteraceae species and

2 Rhodospirillaceae species available in 2007) (Gupta and Mok, 2007a). The majority of

CSPs maintain their original taxonomic specificity, which has been identified in desired bacterial species that were fully sequenced after 2007 (Table 1).

27 M. Sc. Thesis—Quan Yao McMaster—Biology

4.2 Grouping of CSP into Taxonomic levels

The CSPs used in this work were first identified when information was only available for a limited number of Alphaproteobacterial species (Gupta and Mok, 2007a). Hence, an initial undertaking in this work was to confirm their group specificities. Blast searches results again confirmed that most of these proteins were still specific for the originally indicated taxonomic clades despite many fold increase in the number of sequenced

Alphaproteobacteria genomes (Table 1). Most of these signature markers are present in the genomes of newly sequenced Alphaproteobacteria species, belonging to the appropriate taxonomic groupings, but not in any other bacteria. Based upon their observed specificities for different clades of Alphaproteobacteria, these CSPs are endowed with distinctive characteristics to indicate the divergence of Alphaproteobacteria clades in evolutionary history. And these molecular markers provide reliable evidence to support the branching pattern of Alphaproteobacteria in a revolutionary context.

The grouping of molecular markers is based on phylogenetic analysis of CSPs’ specificity. Each molecular marker is shared by several closely related taxa at any taxonomic ranking such as class, order and family. Phylum or genus specific markers were not considered in this study. Since there are sufficient CSPs that have been identified previously and they represent almost all major clades of Alphaproteobacteria and thus these CSP can be divided into 8 groups based on their taxonomic rankings.

They are either specific to Alphaproteobacteria or sub-clades of Alphaproteobacteria.

CSP database consists of three tiers. Tier 1 CSPs are specific to Alphaproteobacteria class. Tier 2 CSPs are specific to different orders of Alphaproteobacteria such as

28 M. Sc. Thesis—Quan Yao McMaster—Biology

Rhizobiales, Rhodobacterales, and Caulobacterales. Tier 3 CSPs are specific to constituent families within these orders. With all these three tiers of CSPs, it is possible to diagnose the presence of organisms in a hierarchical manner. Tier 2 CSPs are not evenly distributed in all 6 different orders of Alphaproteobacteria. The largest order Rhizobiales contains 121 CSPs, which comprise almost 45% of all CSPs while Caulobacterales embody merely 11 CSPs. The disparity of CSP volume in different orders results from the bias of fully sequenced Alphaproteobacterial genomes. Pathogenic and agricultural

Alphaproteobacterial species are studied more extensively. Apart from those CSPs unique to class, order and family level, phylum specific and genus specific CSPs are also available for Proteobacteria and Brucella. Since this project mainly concentrates on

Alphaproteobacteria class, CSPs specific to Betaproteobacteria/Gammaproteobacteria are not considered during database construction. As Brucella is an intracellular pathogen, it is likely that Brucella specific CSPs cannot be readily detected in environmental samples and thus they are not included in the CSPs database.

4.3 Future experiments

The next objective of my project is to detect the presence of different

Alphaproteobacteria clades in metagenomic samples. More experiments need to be designed as follows:

(i) Selection of suitable metagenome for Alphaproteobacteria detection. Parameters such as the relative abundances of Alphaproteobacteria in metagenomic datasets will be taken into account for metagenomes selection. Qualified metagenomes will be used for organism identification.

29 M. Sc. Thesis—Quan Yao McMaster—Biology

(ii) Application of CSP database into metagenomes. This will test if the CSP database can be used to identify environmental bacteria

(iii) Comparative analysis of metagenomes for taxonomical profiling of

Alphaproteobacteria. Experiment results from CSPs will be compared to verify whether

CSP based similarity search produces reliable results like traditional similarity-based binning

All these experiments described above, once accomplished, are expected to address the issues and objectives of this project.

30 M. Sc. Thesis—Quan Yao McMaster—Biology

Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples

Chapter 1 introduction

1.1 Metagenome, environmental genomes

Metagenome is a composite genomes of all organisms from an environmental sample,

(Thomas et al., 2012). It investigates microbial world by applying sequencing method and bioinformatics technologies to the environmental microbial communities, overlooking the need of isolation and culturing of individual microbial members

(Ghazanfar et al., 2010). Only 1.0% of all micro-organisms on the earth could be cultured successful in artificial media (Ferrari et al., 2005). For instance, soil microbial communities are estimated to comprise 5000~20000 different species, however only

50~200 of them can be isolated and cultured (Handelsman, 2004). Metagenomic studies may provide more microbial diversity information from the environment (Gilbert and

Dupont, 2011).

All sequence-based metagenomic studies follow similar procedure:

(1) Total genomic DNA from all environmental samples such as soil, permafrost, marine , termite gut, human intestine are extracted directly without isolation and culturing (Solonenko et al., 2013). Before sequencing, quality control (QC) and duplicate clustering (DC) are performed to reduce potential artificial sequences present in unassembled raw read data. QC filter calculates the average quality score of each read.

According to the statistical analysis on the input reads, the overall quality performance and the high quality reads are fetched for further analysis (Lindner et al., 2013). Duplicate clustering is another important preparatory step to identify duplicates from raw data read.

31 M. Sc. Thesis—Quan Yao McMaster—Biology

These duplicates are mainly sequencing artifacts in metagenomic library such as vectors and . Duplicate clustering also reduces the redundancy of metagenomic reads to yield a non-redundant dataset (Li et al., 2012). Since raw metagenomic reads are almost non-redundant due to the complexity of environmental bacterial communities, DC does not biased the results for subsequent experiments (Lindner et al., 2013). However, most duplicates in transcriptomes are not nonsense sequences, so it is not suggested to run DC workflow for meta-transcriptomic datasets (Li et al., 2012).

(2) Metagenomic samples are sequenced either through vector sequencing or direct sequencing (Morgan et al., 2010). In the former protocol, environmental are fragmented into small pieces, which are subsequently inserted into the vectors of

Escherichia coli to build metagenomic library (Lussier et al., 2011). Direct sequencing skips the step for metagenomic library construction and sequence original microbial fragmented genomes in environmental samples (Kisand et al., 2012).

(3) The purpose of metagenomic assembly is to assemble similar sequences from related genomes while prevent assembly of similar sequences from irrelevant genomes

(Ruby et al., 2013). The metagenomic reads are assembled into contigs and scaffolds

(Nijkamp et al., 2013). However, metagenomic sequence assembly is a major bottleneck in metagenomic studies. Repeats lead to the ambiguity genome recovery. Deficient coverage generates many gaps in between genomes. Sequencing errors become an inherent blemish preceding any bioinformatic analysis (Huang et al., 2012). In many metagenomic studies, direct analysis is implemented on raw reads without sequencing assembly (Takacs-Vesbach et al., 2013).

32 M. Sc. Thesis—Quan Yao McMaster—Biology

(4) RNA and open reading frames (ORFs) prediction are performed through basic local alignment search tool (BLAST) (Altschul et al., 1990). It is an algorithm used to compare the extent of similarity between two sequences, and both amino acid sequences or nucleotide sequences applies. BLAST search compares the query sequences to a database of sequences to identify known sequences relative to query sequences above a cutoff threshold (Altschul et al., 1990). Apart from sequence alignment similarity search by

BLAST, Hidden Markov Model pattern is an alternative solution to predict rRNA- specific structures and six-reading frame translation and it is applied to identify all potential ORFs within a DNA sequence of any size (Siepel and Haussler, 2004). Gene prediction of RNA and ORFs excavates taxonomic information and functional categories in metagenomic reads (Leimena et al., 2013).

(5) After predicting the phylogeny of tRNA and ORFs of proteins, all annotated sequences are classified according to their most-likely taxonomic origin and functional category (Strous et al., 2012). For taxonomic clustering, all metagenomic reads showing similar phylogenetic affiliations are emplaced on a certain taxon in

(Dröge and McHardy, 2012). There are two algorithm to calculate the phylogenetic affiliation of metagenomic sequences. One of which depend on the best hit of BLAST search to determine the taxonomic origin of reads, while another method, which is more parsimonious and reliable, takes the lowest common ancestor of all significant hits above threshold to affirm the taxonomic placement of metagenomic reads (Albertsen et al.,

2013). As for functional binning, all annotated gene are mapped to databases resources such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and SEED classifications

33 M. Sc. Thesis—Quan Yao McMaster—Biology

based on higher functional categories and subordinate biological subsystems (Mitra et al.,

2011).

Unveiling the taxonomic and functional diversity of microbial community in particular environment enables us to answer 2 questions: “Who is there?” and “What are they doing?” (Handelsman, 2004). Through constructing the networks between environmental sequences and microbial attributes, it is feasible to predict the potential presence of similar or identical species and functional pathways in other similar environments (Ghai et al., 2013). Understanding the composition of microbial communities and their interaction networks allows identification of the core bacterial metabolic pathways implemented to sustain a balanced development of bacterial communities, thus providing valuable information for environmentalists to inhibit the production of toxics or enhance the production of beneficial metabolites for the well-being of ecosystem (Brennerova et al., 2009).

1.2 Taxonomic classification of metagenomic reads: methods and challenges

Measuring species diversity in metagenomes provides the answer for “who is there”

(Chistoserdova, 2013). In order to connect each metagenomic sequence to a certain taxon,

Binning is a necessary process, and traditional binning process consists of two approaches: composition based binning and similarity based binning (Dröge and

McHardy, 2012).

In composition based binning, metagenomic softwares are developed to unearth the inherent features of sequences, such as GC content, and tetra-nucleotide frequency (Roller et al., 2013; Teeling et al., 2004). These approaches identify the

34 M. Sc. Thesis—Quan Yao McMaster—Biology

differentiation of new species in environment, the so-called operational taxonomic unit

(OTU), because most species in natural environments are not successfully cultured and beyond laboratory characterization (Wooley et al., 2010).

Similarity based binning, also called alignment based binning, matches metagenomic sequences to referenced databases, methods such as BLAST (Altschul et al., 1990),

PhymmBL (Brady and Salzberg, 2009) and MetaPhlAn (Segata et al., 2012) are employed in metagenomic researches. These methods not only identify and measure the relative abundance and diversity of known microbial organisms in environment, but also reveal functional impact of bacteria communities in environments because extensive studies on individual microbial species in laboratory have well characterized the function of genes and proteins within these cultivable species (Leung et al., 2011).

Both binning strategies are important and complement each other in metagenomic taxonomic profiling. The former discovers unknown species in wild environments while the latter investigates known species in wild environments (Wu and Ye, 2011). However, neither of them could fully reveal environmental species diversity given that 99% of environmental microbes haven’t been cultured yet (Schloss and Handelsman, 2005).

Usually, similarity based binning is more accurate and sensitive compared to composition based binning, but the performance is highly subject to the reference resources (Xia et al.,

2011). Composition based binning clusters all sequences into groups. But it fails to build an association between metagenomic reads and bacterial individuals (Thomas et al.,

2012).

35 M. Sc. Thesis—Quan Yao McMaster—Biology

Other issues such as time expense and computing requirement are haunting problems waiting to be resolved (Thomas et al., 2012). BLASTX analysis was once performed on permafrost soil samples including 176 million Illumina DNA reads (Mackelprang et al.,

2011), which eventually cost 800000 CPU hours on a similar work station server (64 cores, 512 GB main memory) (Huson and Xie, 2013). Regarding all these inevitable limitations above, a hybrid of these two approaches is preferable for accurate estimation of taxonomic classification (Mohammed et al., 2011).

1.3 Application of metagenomics

Metagenomics have a broad range of potential applications to transfer current knowledge into solving practical issues. Some pioneering attempts have been proved successful in fields such as energy, agriculture, environmental, medicine, and engineering

(National Research Council (US) Committee on Metagenomics: Challenges and

Functional and Functional, 2007).

Microbial communities in humans guts body have an essential impact on human health. However, the composition of gastrointestinal microbes and the mechanism by which they use to influence human body remains to be cryptic (Bäckhed et al., 2012). In view of this, metagenomic technology is utilized to characterize

(Lepage et al., 2013). One of the largest project involving human gut microbiome is initiated by European Commission ——Metagenomics of the Human Intestinal Tract

(MetaHIT) to explore the relationships between the changes of human microbiome and human health by gathering genomic sequences of all microbial organisms on 15~18 different body sites from 250 european individuals (Qin et al., 2010). The primary goal of

36 M. Sc. Thesis—Quan Yao McMaster—Biology

MetaHIT project is to determine a core set of human microbiome maintaining the health of mankind (Ursell et al., 2012). Another clinical research as part of MetaHIT project is to classify the profound phylogenetic variation of gastrointestinal microbes between health people and patient suffering from diseases and disorders such as Crohn’s disease, irritable bowel syndrome (IBS) disease and obesity (Moloney et al., 2013). The results elucidated that two , Bacteroidetes and dominate the distal gut by comprising >90% of known bacteria (Le Chatelier et al., 2013). Gene frequency profiling identifies 1244 metagenomic functional clusters of crucial importance to the health of human intestinal tract, from which functions are divided into two categories: house keeping cluster and intestine specific cluster (Qin et al., 2010). Housekeeping functions are indispensable in human gut and required by all other microbial members around them because they play a key role in main metabolic pathways including cycle and amino acid synthesis. While gut specific functions cope with host protein adhesion and sugar harvesting (Qin et al., 2010). One of the discoveries regarding IBS is that the genes of microbiome in patients are 25% lower than healthy controls, and the bacterial diversity is also lower in IBS patients. It is strongly indicated that gut associated disease and obesity results from the reduction of gut microbiome diversity (Qin et al., 2010). Despite of the potential application in the study of human gut metagenome, It is notable that only

7.6~21.2% of the metagenomic reads can be matched to bacterial genomes on Genebank, and There are much more novel bacterial species in human distal gut that haven’t be researched yet (Qin et al., 2010). The characterization of unknown microbiome may

37 M. Sc. Thesis—Quan Yao McMaster—Biology

throw light on new medical therapy dealing with human gastrointestinal diseases (Kinross et al., 2011).

Metagenomics also advances the knowledge in exploring new green energies.

Bioenergy is expected to be the next generation fuel that could replace the status of fossil fuels (Hess et al., 2011). They are derived from biomass conversion,which transfer plant material such as grain, starch, sugar, oil, cellulose, hemicellulose, and lignin into cellulosic methane and (van der Lelie et al., 2012). The transformation process relies upon microbial cohorts from host associated habitats ranging from herbivore mammals, insects, to rainforest soils (Allgaier et al., 2010). The microbial communities in these habitats share core cellulosic genes coding for that degrade biomass (Scully et al., 2013). In view of the importance of such enzymes and the inexhaustible pool of environmental microbes, metagenomics aims at analyzing sophisticated microbial consortia that allows for the production of novel enzymes fulfilling the industrial requirements——biomass deconstructing enzymes with higher and lower cost (Hess et al., 2011). Meanwhile, metagenomic technologies permit comparative analysis between convergent microbial ecosystems, which in return improves the understanding of differentiated biomass degradation mechanisms (Lu et al.,

2012). Metagenomic approaches not only identify the diversified enzymes of interest, but also control the activation and depression of these catalyzing process (Zhang et al., 2013).

Industrialization of massive production may likely reduce the release of greenhouse gases and promotes environmental qualities (Sommer et al., 2010).

38 M. Sc. Thesis—Quan Yao McMaster—Biology

Microbial communities in soils are recognized as the most diverse and complex bacterial ecosystem, with 109~1010 microbial cells in one gram of soil (Vogel et al.,

2009). In spite of the enormous sequence information per unit soil holds (one gig abase per gram of soil), the taxonomic composition and functional categories are poorly understood (Vogel et al., 2009). Many bacteria develop a stable symbiotic relationships with specific plants and provide diverse ecological services as symbionts or epibionts for plant growth, including atmospheric nitrogen fixation, nutrient circulation, pathogen resistance, and trace elements enrichment (Rascovan et al., 2013). Functional metagenomic pipelines seek to decipher the sophisticated interactions and communications between soil microbes and plants through screening novel genes of interest in microbial communities (Rout and Callaway, 2012). Insight into the rare uncultivable bacterial members responsible for and intra-species competitive inhibition also offers a new angle of view for floral disease resistance and farming practice enhancement (Rosen et al., 2009). The application of metagenomic techniques into agriculture enables the improvement and maintenance of crop health if only the dynamic equilibrium between microbes and plants are under the control (Rascovan et al.,

2013).

Apart from the applications described above, metagenomic approaches also tackle environmental issues. In the field of environmental remediation, new policies and strategies based on metagenomic principles are advocated for monitoring the impact of pollutants and cleaning up environmental contamination (Yergeau et al., 2012). One of the metagenomic projects targets wastewater treatment plant where microbial organisms

39 M. Sc. Thesis—Quan Yao McMaster—Biology

remove excessive inorganic phosphate from wastewater. The treatment process is called enhanced biological phosphorus removal (EBPR) (Nielsen et al., 2012). Another wastewater treatment project in a common effluent treatment plant (CEPT) investigates the activated biomass occupied by particular microbial communities in this niche (Kapley et al., 2007). Metagenomic studies aim at identification of novel bacteria members in these niches and exploring new catabolic pathways that help reduce the chemical oxygen demand (COD) so that treated wastewater by activated sludge process can be subsequently released into environment safely (Ravi P More, 2013). Although the metabolic traits of this process are not well understood yet, increased understanding of how microbial communities deal with pollutants provides theoretical foundations for environmentalists to assess the potential sites vulnerable to contaminants, as well as developing appropriate strategies to increase the chance of removing pollution for habitat rehabilitation (Gomez-Alvarez et al., 2012).

1.4 Project objectives

For the sake of profiling the bacteria diversity and functional category, it is necessary to map metagenomic sequences to reference databases (Mitra et al., 2011). Since 99% of environmental microbial species cannot be cultured in laboratories, it is a big challenge to identify wild type species based on limited genomic information from sequenced domesticated individuals (Albertsen et al., 2013). Based on the results of the first project, several molecular markers have been found to be present in newly sequenced species of different clades within Alphaproteobacteria. Given that these molecular markers are

40 M. Sc. Thesis—Quan Yao McMaster—Biology

ubiquitously present in all potential Alphaproteobacteria species, It is highly possible that environmental Alphaproteobacteria may also carry these signatures as well.

The second project consists of 3 related components. The first step is to choose several metagenomic samples that may contain potential Alphaproteobacteria. A large scale screening test performed on 200 metagenomic samples. Subsequently, a more detailed and comprehensive profiling of Alphaproteobacteria clades was performed on those selected metagenomes. Finally, a comparative analysis will be carried out to compare the relative abundance of Alphaproteobacteria among selected metagenomes. Once the experiments are completed, an overall performance of molecular signatures in identifying environmental can be assessed and a new molecular marker based method can be developed to determine the taxonomic classification of metagenomes.

41 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 2 Materials and methods

2.1 Metagenome selection

A systematic tBLASTn search was conducted with threshold of 1x10-e04 on 4

Alphaproteobacteria class specific CSPs against 201 metagenomes. These 201 metagenomic samples consist of 15049531 genomic sequences in total and are divided into either ecological metagenomes or organismal metagenomes on NCBI genomic

BLAST webpage. All significant hits above the threshold were collected and a metagenome taxonomy report was carried out to demonstrate the distribution of CSPs in potential metagenomes. Metagenomes with highly similar sequences to CSPs suggest that

Alphaproteobacteria might be abundant in those habitats, thus are preferable in this study.

According to the similarity of sequences and the amount of positive hits discovered in candidate metagenomes, qualified metagenomic projects are selected for

Alphaproteobacteria profiling later.

2.2 Identification of CSP in metagenomic samples

A systematic tBLASTn search was performed on all 264 CSPs against 10 qualified metagenomic projects. An E-value threshold of 1x10-e04 was employed in this experiment with default filter (low complexity regions) to eliminate statistically significant but biologically uninteresting hits from the BLAST output (Coletta et al., 2010). Then, best bit scores and positive hit numbers of all CSPs were collected to evaluate the quality and quantity information derived from BLAST results for further analysis. The bit score is a numerical value that describes the overall quality of an alignment, which indicates how

42 M. Sc. Thesis—Quan Yao McMaster—Biology

ideal the alignment results are. The higher the score is, the better the alignment is

(Altschul et al., 1990).

2.3 Comparative analysis of Alphaproteobacteria in metagenomes

The distribution of Alphaproteobacteria clades was plotted based on the best bit scores and the amount of positive hits for all CSPs. Then a heatmap was created according to the distribution pattern of 264 CSPs. Meanwhile, the average bit score and total amounts of positive hits for all CSPs in each metagenome were calculated to indicate the overall relative abundance of Alphaproteobacteria in each metagenome. Afterwards, a comparative analysis of CSPs was conducted to demonstrate the detailed proportion of different Alphaproteobacteria clades between 10 metagenomes. A comparison of relative abundance between CSPs-based in this study and similarity-based taxonomic classification on public metagenomic server was performed to validate the reliability of

CSPs-based methodology. Taxonomical hits distribution of the 10 metagenomes were accessible from either Metagenome Rapid Annotation using Subsystem Technology server (MG-RAST) (Meyer et al., 2008) or Integrated Microbial Genomes with

Microbiome Samples (IMG/M) (Markowitz et al., 2012).

43 M. Sc. Thesis—Quan Yao McMaster—Biology

Table 11 Characteristics of Metagenomic Datasets Investigated in this Study

Metagenomic project # of Total length Average Raw sequencing α-proteobacteria Contigs1 (Mb) length (bp) data (Gb) %2

Wastewater 172,804 421.6 2,440 157.50 16.8%

Marine 54,509 77.4 1,420 1.17 30.7%

Bioreactor 748,672 317.9 425 1.44 17.2%

Compost 218,885 104.9 479 0.28 27.0%

Activated sludge 36,270 27.9 769 N/A N/A

Whale fall 84,317 89.6 1,062 0.14 23.8%

Freshwater sediment 252,427 214.8 850 8.20 5.3%

Microbial mat 112,984 84.2 745 0.12 21.9%

Hydrothermal vent 26,573 24.9 937 0.03 19.8%

Groundwater 37,367 104.7 2801 7.20 4.6%

1. Contigs are assembled metagenomic sequences 2. The percentage of α-proteobacteria is calculated based on the ratio of reads annotated to α-proteobacteria to all metagenomic reads

44 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 3 Results

3.1 Metagenome selection

201 metagenomic datasets were accessible in NCBI genomic BLAST webpage. To determine which metagenomic datasets were dominated by Alphaproteobacteria, a large- scale BLAST search was undertaken with Alphaproteobacteria specific CSPs. The experiment results indicated that Alphaproteobacteria were most likely present in roughly

10 metagenomic projects. The 4 CSPs used for preliminary screening has been confirmed to be the most conserved and specific molecular signatures shared by all available

Alphaproteobacteria species (Table 3A). The tBLASTn results of these CSPs indicated that 220 BLAST hits were identified in 10 metagenomic projects (the accession # on

NCBI and project ID on MG-RAST and IMG/M servers are listed in brackets). They were Microbial mat metagenome (PRJNA29795, 4440964.3) (Harris et al., 2012), Marine metagenome (PRJNA16339, 4443701.3) (DeLong et al., 2006), Wastewater metagenome

(PRJNA167559, 4455295.3) (Mielczarek et al., 2013), Freshwater sediment metagenome

(PRJNA30541, 2006543005) (Kalyuzhnaya et al., 2008), Hydrothermal vent metagenome

(PRJNA37895, 4461585.3) (Brazelton and Baross, 2009), Bioreactor metagenome

(PRJNA73603, 20220044000) (van der Lelie et al., 2012), Activated sludge metagenome

(PRJNA61401, N/A) (Kapley et al., 2007), Compost metagenome (PRJNA41493,

4446153.3) (Allgaier et al., 2010), Whale fall metagenome (PRJNA81625, 4441619.3)

(Tringe et al., 2005) and Groundwater metagenome (PRJNA114691, 3300000815)

(Wrighton et al., 2012). Although positive hits were also sporadically distributed in other metagenomic projects such as freshwater metagenome and mosquito metagenome, these

45 M. Sc. Thesis—Quan Yao McMaster—Biology

two metagenomic projects were not pursued further because BLAST analysis indicated that neither bit score nor hits number were sufficient to classify these metagenomic datasets as Alphaproteobacteria abundant metagenomes.

The 10 metagenomic projects described above were composed of more than 50 metagenomic samples. So one most representative metagenomic sample was selected from each metagenomic project. Given that 6 of the 10 metagenomic projects

(wastewater metagenome, hydrothermal vent metagenome, activated sludge metagenome, compost metagenome, bioreactor metagenome and groundwater metagenome) contained only one sample respectively, They automatically became the representative metagenomic sample for corresponding metagenomic projects. Whale fall metagenome, freshwater metagenome and microbial mat metagenome were made up of 3 datasets, 5 datasets and 10 datasets respectively. Since the samples in each project were concentrating on a certain topic, metagenomic datasets in each project could be combined as a single sample for research. Marine metagenome comprised 35 metagenomic samples gathered from all around the world. Further analysis indicated that most significant hits of

Alphaproteobacteria specific CSP were identified in North Pacific Subtropical Gyre

Planktonic Microbial Community, so it was selected as the representative marine metagenome sample in this project.

The sample size of each metagenomic project, the number of contigs and total length of all reads were collected from WGS master webpage (shotgun assembly sequences for genome and transcriptome). From Table 11, it can be seen that the number of assembled sequences between metagenomic projects ranges from ten thousands to hundreds of

46 M. Sc. Thesis—Quan Yao McMaster—Biology

thousands of sequences. The total length of metagenomic reads are limited within tens of million base pairs to hundreds of million base pairs. The average length of metagenomic read can be calculated based the total length divided by the quantity of contigs. The average length of metagenomic reads ranges between roughly 500 bp to 2500 bp. The metagenomic reads are appropriate for CSP based similarity search because the average length of metagenomic reads are comparable to the length of CSPs. To have an overall understanding of how many Alphaproteobacteria are assumed to be present in these metagenomic projects. Organism abundance of metagenomes were searched in MG-

RAST and IMG/M server. The numbers of reads annotated to Alphaproteobacteria were collected for calculating the relative proportion of all Alphaproteobacteria species in each metagenome (Table 11). And the relative abundance of Alphaproteobacteria based on the proportion of related metagenomic reads ranges from 5% to 30%, which indicates the fact that Alphaproteobacteria is one of the major groups in selected metagenomic projects.

3.2 Identification of CSPs in metagenomic samples

After selecting 10 appropriate metagenomes, the distribution of all CSPs in these metagenomic reads was investigated. The bit scores, as well as the number of significant hits obtained from 16 CSPs unique to different clades of Alphaproteobacteria were tabulated in Figure 1. Equally important, two heatmaps were built to depict the detailed distribution of Alphaproteobacterial clades in 10 metagenomes (Figure 2 and 3). In this study, 11 CSPs specific for Alphaproteobacteria at class level (i.e. they are specifically found in all or most Alphaproteobacteria) were applied to identify the presence of

Alphaproteobacteria in metagenomic samples. Significant hits of these CSPs with high bit

47 M. Sc. Thesis—Quan Yao McMaster—Biology

scores were identified in the metagenomic datasets and in most cases multiple metagenomic reads were found to exhibit positive hits. However, the total number of significant hits for these 11 CSPs in different metagenomic datasets showed considerable variation, as well as the bit score, It is notable that bioreactor, wastewater and whale fall metagenomes have more Alphaproteobacterial sequences than the other 7 metagenomes

(Figure 2 and 3). These differences may be related to the size of the datasets themselves as well as the relative abundance of Alphaproteobacteria in these metagenomes. Based on these findings, it is indicated that CSPs specific for the class Alphaproteobacteria are ubiquitously present in 10 metagenomic datasets, particularly enriched in three of them.

Multiple CSPs that are specific for either all Rhizobiales or two major clades within this order have been identified, which contains 3 CSPs specific for Rhizobiales, 16 CSPs specific for Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and

Aurantimonadaceae (called clade-1 Rhizobiales) and 18 CSPs specific for Rhizobiaceae and Phyllobacteriaceae. The results of tBLASTn searches with these CSPs demonstrated that the significant hits of these CSPs were highly concentrated in wastewater metagenome, followed by bioreactor, compost and whale fall metagenomes. At the same time, These CSPs were either sporadically distributed or totally absent in other metagenomes studied (Figure 2 and 3).

Another important sub-clade within Rhizobiales is the Bradyrhizobiaceae and

Xanthobacteraceae group. All 74 CSPs were examined to be consistently present in either

Bradyrhizobiaceae family or both Bradyrhizobiaceae and Xanthobacteraceae families.

The tBLASTn results indicated that their distribution differed somewhat from those of the

48 M. Sc. Thesis—Quan Yao McMaster—Biology

Clade 1 Rhizobiales-specific CSPs. The maximum number of BLAST hits was observed in this case for the bioreactor metagenome, while the marine and the compost metagenomes also yielded equivalent significant hits (Figure 2 and 3).

The distribution of CSPs specific for the order Rhodobacterales, Caulobacterales,

Sphingomonadales, Rhodospirillales and Rickettsiales were investigated respectively. Of the 35 Rhodobacterales specific CSPs, multiple significant hits were detected in the following 6 metagenomes: wastewater, marine, microbial mat, hydrothermal vent, whale fall and groundwater metagenomes. Furthermore, the average bit scores of CSPs in these

6 metagenomes were higher than those in the other 4 metagenomes, which gave more confidence in the reliability of these results and indicated that Rhodobacterales species were important constituents of these metagenomes (Table 11). Also, the distribution of significant BLAST hits based on 11 Caulobacterales specific CSPs indicated that the

Caulobacterales were likely enriched in bioreactor, wastewater and whale fall metagenomes (Figure 2 and 3).

The result of tBLASTn searches regarding the distribution of 31 Sphingomonadales specific CSPs in 10 metagenomes was displayed in Figure 2 and 3. These CSPs were highly concentrated in bioreactor, wastewater metagenomes, and moderately scattered in marine and whale fall metagenomes. It was inferred that Sphingomonadales species might prefer either engineered or marine habitats than any other environments examined in this study.

The analyses of tBLASTn results with Rhodospirillales specific CSPs indicated that

Rhodospirillales were most abundant in the bioreactor metagenome, admitting that

49 M. Sc. Thesis—Quan Yao McMaster—Biology

correlated CSPs were present with low amount in the marine, compost, freshwater sediment and whale fall metagenomes (Figure 2 and 3).

21 CSPs specific for the order Rickettsiales were also employed to detect potential pathogens in environmental datasets. Only 2 significant hits were observed in wastewater and freshwater sediment metagenomes respectively (Figure 2 and 3). It is probably because intracellular pathogenic bacteria were not common in environmental metagenomic samples.

3.3 Comparative analysis of Alphaproteobacteria in metagenomes

The best bit scores and the number of significant hits from the BLAST search results of all CSPs were collected. In summary, 4 metagenomic datasets enriched by

Alphaproteobacteria were identified. They were bioreactor metagenome wastewater metagenome, marine metagenome and whale fall metagenome. The experimental results for other 7 metagenomes were shown in Figure 4. All these significant hits were derived from either Alphaproteobacteria class specific CSPs or clade specific CSPs. For instance, among all the 410 hits found in bioreactor metagenome, 125 of them were from 11

Alphaproteobacteria class specific CSPs, 73 were from Bradyrhizobiaceae and 98 were from Sphingomonadales. In wastewater metagenome, the 551 significant hits discovered by CSPs were mainly from Alphaproteobacteria class (179 hits), Rhizobiales (130 hits)

Rhodobacterales (87 hits) and Sphingomonadales (109 hits). As for whale fall metagenome, more than 75% of the significant hits were derived from

Alphaproteobacteria specific CSPs (109 hits) and Rhodobacterales (114 hits).

50 M. Sc. Thesis—Quan Yao McMaster—Biology

By calculating the total number of significant hits discovered for All CSPs and

Grouping them based on orders, a comparative analysis was made to demonstrate the detailed distribution of Alphaproteobacteria clades in each metagenome. According to

Figure 5, Alphaproteobacteria was most abundant in bioreactor, wastewater and whale fall metagenomes, not only for the whole class, but also for different orders of

Alphaproteobacteria. For example, in bioreactor metagenome, the relative abundance of

Rhizobiales, Bradyrhizobiaceae, Sphingomonadales and Rhodospirillales were higher compared to the other metagenomes. Though wastewater metagenome was also enriched with Alphaproteobacteria, the composition of concentrated organism was different from bioreactor metagenome. In wastewater metagenome, Rhizobiales, Rhodobacterales and

Sphingomonadales were the most dominant groups of Alphaproteobacteria, but the concentration of Bradyrhizobiaceae was lower than bioreactor. According to CSPs distribution, only Rhodobacterales and Sphingomonadales were abundant in whale fall metagenomes, admitting the fact that other clades were also moderately present in this metagenome.

To compare the organism abundance between CSPs-based binning and similarity- based binning, the taxonomic classification from MG-RAST and IMG/M server (Figure

6) was collected. 4 metagenomes (bioreactor, wastewater, whale fall and marine) were compared in this study. In the bioreactor metagenome, the relative abundance of

Alphaproteobacteria clades based on CSP distribution were demonstrated as 5% for

Rhizobiales, 11% for Bradyrhizobiaceae, 2% for Rhodobacterales, 3% for

Caulobacterales, 14% for Sphingomonadales and 7% for Rhodospirillales. The relative

51 M. Sc. Thesis—Quan Yao McMaster—Biology

proportion of the same metagenome derived from IMG/M server were 7% Rhizobiales,

11% Bradyrhizobiaceae, 2% Rhodobacterales, 6% for Caulobacterales, 11% for

Sphingomonadales and 10% for Rhodospirillales. The results were highly correlated to each other. The organism abundance for the other 3 metagenomes on MG-RAST server was also similar to the results based on CSPs search (Figure 6)

52 M. Sc. Thesis—Quan Yao McMaster—Biology

Marine Marine Compost Whale fall Whalefall Bioreactor Wastewater Groundwater Microbial matMicrobial Activated sludge Hydrothermalvent Freshwatersediment Clade specificity CSP Best bit score NP_422086 203 205 212 132 206 190 180 191 209 143 Alphaproteobacteria NP_420178 263 124 94 102 83 80 61 92 118 96 YP_031797 70 77 51 67 0 0 53 0 56 0 Rhizobiales YP_032395 169 265 0 107 0 0 162 0 106 0 YP_317328 125 0 118 145 0 0 0 0 0 0 Bradyrhizobiaceae YP_317580 128 0 111 147 0 0 0 0 0 0 YP_614257 0 721 0 0 0 333 199 330 311 719 Rhodobacterales YP_611978 0 273 249 0 183 0 0 0 286 196 NP_419305 82 62 85 0 0 0 0 0 81 0 Caulobacterales NP_421895 127 291 73 0 0 0 115 0 119 0 YP_495301 116 216 115 0 0 0 0 0 225 0 Sphingomonadales YP_496569 110 139 70 0 0 0 0 105 77 0 AAW62049 70 0 471 0 0 0 0 0 158 0 Rhodospirillales YP_425217 173 0 177 54 68 0 0 0 171 0 NP_965979 0 0 0 0 0 0 0 0 0 0 Rickettsiales NP_966474 0 0 0 0 0 0 0 0 0 0 Clade specificity CSP Significant hits NP_422086 23 18 7 2 3 2 1 3 8 1 Alphaproteobacteria NP_420178 11 18 6 3 3 2 2 2 9 2 YP_031797 3 11 1 2 0 0 1 0 1 0 Rhizobiales YP_032395 4 11 0 5 0 0 1 0 2 0 YP_317328 2 0 1 1 0 0 0 0 0 0 Bradyrhizobiaceae YP_317580 2 0 1 1 0 0 0 0 0 0 YP_614257 0 1 0 0 0 1 1 2 3 1 Rhodobacterales YP_611978 0 1 2 0 2 0 0 0 3 1 NP_419305 2 1 1 0 0 0 0 0 3 0 Caulobacterales NP_421895 11 6 1 0 0 0 1 0 1 0 YP_495301 4 5 2 0 0 0 0 0 3 0 Sphingomonadales YP_496569 3 5 1 0 0 0 0 1 1 0 AAW62049 1 0 2 0 0 0 0 0 1 0 Rhodospirillales YP_425217 9 0 2 1 1 0 0 0 2 0 NP_965979 0 0 0 0 0 0 0 0 0 0 Rickettsiales NP_966474 0 0 0 0 0 0 0 0 0 0

53 M. Sc. Thesis—Quan Yao McMaster—Biology

Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10 metagenomes The upper heatmap specifies the best bit score within each metagenome assigned to the listed taxa. The lower heatmap indicates the numbers of significant hits within each metagenome that are assigned to the listed taxa. Color formatting indicates high and low values. Zero values are in green. Values between 1~10 are in yellow. Red indicates the highest values in the chart.

54 M. Sc. Thesis—Quan Yao McMaster—Biology

Vent

Mat

fall Sludge

Sediment

CSP Marine Compost Whale Bioreactor Wastewater Groundwater Microbial Activated Hydrothermal Freshwater

NP_420905 15 17 3 3 2 1 2 2 7 2 NP_422086 23 18 7 2 3 2 1 3 8 1

NP_422113 8 20 1 2 2 2 2 0 8 1 NP_420178 11 18 6 3 3 2 2 2 9 2 NP_420025 9 17 5 1 1 1 3 0 11 0 NP_420693 6 13 5 1 1 1 1 0 5 1 NP_421048 14 17 5 3 1 1 3 0 11 0 proteobacteria

- NP_422264 6 18 1 1 0 1 1 1 8 2 α NP_419339 10 11 0 0 2 0 1 2 12 1 NP_421804 15 21 8 3 0 0 1 1 9 1 NP_418919 8 9 4 4 2 2 1 3 21 1 YP_031797 3 11 1 2 0 0 1 0 1 0 YP_032733 1 5 0 1 0 0 0 1 1 0 YP_032395 4 11 0 5 0 0 1 0 2 0 NP_101943 0 4 0 0 0 0 0 0 0 0 NP_105027 0 3 0 0 0 1 0 0 0 0 NP_108034 4 8 0 2 0 0 1 1 2 0

NP_102510 0 3 0 0 0 0 0 0 0 0 NP_102519 0 5 1 1 0 0 0 0 0 0 NP_104217 0 4 1 0 0 0 0 0 0 0 NP_107016 2 4 0 0 0 1 0 0 0 0 Rhizobiales NP_101988 4 8 0 1 0 1 0 0 1 0 NP_102895 0 5 0 0 0 0 0 0 0 0 NP_104087 0 1 0 1 0 0 0 0 0 0 NP_104130 2 6 0 0 0 0 0 0 1 0 NP_105201 0 4 0 0 0 0 0 0 0 0 NP_105743 4 5 0 1 0 1 0 0 1 0 NP_108472 3 4 0 0 0 0 0 0 0 0

55 M. Sc. Thesis—Quan Yao McMaster—Biology

NP_103319 1 6 0 0 0 0 0 0 2 0 NP_101965 4 4 0 0 0 0 0 1 0 0 NP_101954 0 4 0 0 0 0 0 0 0 0 NP_102577 1 1 0 0 0 0 0 0 0 0 NP_109472 0 2 0 0 0 0 0 0 0 0 NP_105883 0 5 0 0 0 0 0 0 0 0 NP_106835 1 0 0 0 0 0 0 0 0 0 NP_107159 1 1 0 0 0 0 0 0 0 0 NP_103376 1 9 0 0 0 0 0 0 0 0 NP_104418 0 1 0 0 0 0 0 0 0 0 NP_105704 0 1 0 0 0 0 1 0 0 0 NP_102252 0 0 0 0 0 0 0 0 0 0 NP_103286 0 1 0 0 0 0 0 0 0 0 NP_106741 0 2 0 0 0 0 0 0 0 0 NP_106740 0 0 0 0 0 0 0 0 0 0 NP_104236 2 0 0 0 0 0 0 0 0 0 NP_103455 0 0 0 0 0 0 0 0 0 0 NP_103450 0 0 0 0 0 0 0 0 0 0 NP_103476 0 2 0 0 0 0 0 0 0 0 NP_107075 0 0 0 0 0 0 0 0 0 0 NP_772654 0 1 0 0 0 0 0 0 0 0 YP_317707 0 1 0 0 0 0 0 1 0 0

YP_317841 1 4 0 0 0 0 0 0 0 0 YP_318399 3 0 1 0 0 0 0 0 0 0 YP_318401 3 0 0 0 0 0 0 0 0 0 YP_318753 0 0 0 0 0 0 0 0 0 0 YP_318785 0 0 0 0 0 0 0 0 0 0

Xanthobacteraceae YP_319038 0 0 0 0 0 0 0 0 0 0

YP_319081 3 2 0 0 0 0 0 0 0 0 and YP_319177 1 1 0 1 0 0 0 0 0 0 YP_319228 1 2 0 0 0 0 0 0 0 0 YP_319312 6 0 0 2 0 0 0 0 0 0 NP_772539 0 0 0 0 0 0 0 0 0 0 NP_772746 0 1 0 0 0 0 0 0 0 0 YP_316897 2 0 0 4 0 0 1 0 0 0 Bradyrhizobiaceae YP_317122 0 0 0 0 0 0 0 0 0 0 YP_317147 0 0 0 0 0 0 0 0 0 0

56 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_317224 1 0 0 0 0 0 0 0 0 0 YP_317328 2 0 1 1 0 0 0 0 0 0 YP_317539 0 1 0 1 0 0 0 0 0 0 YP_317580 2 0 1 1 0 0 0 0 0 0 YP_317698 0 4 0 1 0 0 1 1 0 0 YP_317706 0 0 0 0 0 0 0 0 0 0 YP_317721 1 1 0 0 0 0 0 0 0 0 YP_317722 0 1 0 0 0 0 0 0 0 0 YP_317949 1 1 0 1 0 0 1 1 0 0 YP_317753 0 1 0 0 0 0 0 6 0 0 YP_317861 0 0 0 0 0 0 0 0 0 0 YP_317883 0 0 0 0 0 0 0 0 0 0 YP_317888 0 0 0 0 0 0 0 0 0 0 YP_318067 0 0 0 0 0 0 0 0 0 0 YP_318111 0 0 0 0 0 0 0 0 0 0 YP_318125 0 0 0 0 0 0 0 0 0 0 YP_318194 1 1 0 0 0 0 0 0 0 0 YP_318195 0 0 0 0 0 0 0 0 0 0 YP_318199 2 2 0 0 0 0 0 0 1 0 YP_318262 1 0 0 0 0 0 0 0 0 0 YP_318287 2 0 0 0 0 0 0 1 0 0 YP_318318 6 1 0 1 0 0 0 0 0 0 YP_318324 1 0 0 0 0 0 0 0 0 0 YP_318398 0 0 0 0 0 0 0 0 0 0 YP_318406 3 2 0 4 0 0 2 0 1 0 YP_318413 0 0 0 0 0 0 0 0 0 0 YP_318632 0 0 0 0 0 0 0 0 0 0 YP_318673 0 0 0 0 0 0 0 0 0 0 YP_318674 0 0 0 1 0 0 0 0 0 0 YP_318769 0 0 0 0 0 0 0 0 0 0 YP_318779 1 0 0 0 0 0 0 0 0 0 YP_318789 0 0 0 0 0 0 0 0 0 0 YP_318814 0 0 0 0 0 0 0 0 0 0 YP_318850 0 0 0 1 0 0 0 0 0 0 YP_318853 2 0 0 0 0 0 0 0 0 0 YP_318985 20 1 8 5 0 0 0 6 0 0 YP_318987 0 0 0 0 0 0 0 0 0 0

57 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_319020 0 0 0 0 0 0 0 0 0 0 YP_319094 0 0 0 0 0 0 0 0 0 0 YP_319097 2 0 0 0 0 0 0 0 0 0 YP_319105 0 0 0 0 0 0 0 0 0 0 YP_319111 0 0 0 0 0 0 0 0 0 0 YP_319114 0 0 0 1 0 0 1 0 0 0 YP_319136 1 0 0 0 0 0 0 0 0 0 YP_319180 0 0 0 0 0 0 0 0 0 0 YP_319182 0 0 0 2 0 0 0 0 0 0 YP_319193 0 0 0 0 0 0 0 0 0 0 YP_319235 0 0 0 0 0 0 0 0 0 0 YP_319281 0 0 0 0 0 0 0 0 0 0 YP_319282 0 0 0 0 0 0 0 0 0 0 YP_319374 0 0 0 0 0 0 0 0 0 0 YP_319394 0 0 0 0 0 0 0 0 0 0 YP_319586 0 0 0 1 0 0 0 0 0 0 YP_319561 1 0 0 0 0 0 0 0 0 0 YP_319637 0 0 0 0 0 0 0 0 0 0 YP_319739 1 0 0 0 0 0 0 0 0 0 YP_319740 2 0 0 0 0 0 0 0 0 0 YP_612088 0 5 3 0 2 3 0 0 3 1 YP_612179 0 1 0 0 0 2 0 1 5 0 YP_612231 0 0 0 0 0 1 0 0 2 0 YP_612466 0 1 0 0 0 0 0 0 2 1 YP_612581 0 1 0 0 0 1 0 0 4 2 YP_612582 0 1 0 0 0 2 0 0 3 1

YP_612692 0 1 0 0 4 0 0 0 2 0 YP_612745 0 1 0 0 0 0 0 0 1 0 YP_612747 0 1 1 0 0 1 0 0 0 1 YP_613058 0 4 0 0 1 1 0 0 2 0 YP_613059 0 2 1 0 2 0 0 0 4 0 Rhodobacterales YP_613242 0 0 0 0 1 2 0 0 0 1 YP_613345 0 1 0 0 4 1 0 0 2 1 YP_613401 0 1 1 0 1 0 0 0 0 1 YP_613562 0 1 0 0 3 1 1 0 4 0 YP_613837 0 2 1 0 0 1 0 0 2 0 YP_613961 0 1 0 0 2 4 0 0 1 0

58 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_613982 0 1 0 0 1 2 0 0 3 0 YP_614257 0 1 0 0 0 1 1 2 3 1 YP_614364 0 1 0 0 1 2 0 0 2 0 YP_614419 0 2 1 0 0 1 0 0 5 1 YP_614460 0 1 0 0 0 1 0 0 4 0 YP_614481 0 2 0 0 1 1 0 0 3 0 YP_614576 0 1 0 0 0 1 0 0 3 1 YP_614993 0 1 2 0 0 2 0 0 5 1 YP_611313 0 1 0 0 0 1 0 0 1 0 YP_611978 0 1 2 0 2 0 0 0 3 1 YP_611988 0 2 2 0 2 1 0 0 2 0 YP_611993 0 2 0 0 1 2 0 0 5 1 YP_613553 0 1 0 0 1 2 0 0 1 1 YP_613730 0 1 0 0 0 0 0 0 2 1 YP_613732 3 5 1 1 0 4 1 0 5 0 YP_613733 0 0 0 0 0 1 0 0 3 0 YP_613734 4 0 13 2 2 2 0 5 3 3 YP_613731 4 39 17 1 2 3 2 5 20 13 YP_613094 0 0 0 0 0 0 0 1 0 0 YP_611425 0 0 2 0 0 0 0 0 0 0 YP_613418 0 0 0 0 0 0 0 0 0 0 YP_613446 0 0 0 0 0 0 0 0 2 0 YP_613980 0 0 0 0 0 0 0 0 0 0 YP_614100 0 0 0 0 0 0 0 0 2 0 YP_614133 0 0 0 0 0 0 0 0 0 0 YP_611311 0 0 0 0 0 0 0 0 0 0 YP_611438 0 0 0 0 0 0 0 0 0 0 YP_611444 0 1 0 0 0 0 0 0 0 0 YP_611462 0 0 0 0 0 0 0 0 0 0 YP_611763 0 0 0 0 0 1 0 0 0 0 YP_611855 0 0 0 0 0 0 0 0 0 0

NP_419305 2 1 1 0 0 0 0 0 3 0 NP_421283 3 0 0 0 0 0 0 0 3 0 NP_421560 0 0 0 0 0 0 0 0 1 0 NP_421895 11 6 1 0 0 0 1 0 1 0 NP_419331 0 3 0 0 0 0 0 0 0 0 Caulobacterales

59 M. Sc. Thesis—Quan Yao McMaster—Biology

NP_419880 0 2 0 0 0 0 0 1 1 0 NP_419882 0 1 1 0 0 0 0 0 2 0 NP_420397 0 0 0 0 0 0 0 1 0 0 NP_421010 2 1 1 0 0 0 0 0 0 0 NP_421428 0 0 0 0 1 0 0 0 0 0 NP_421438 0 2 0 0 0 0 0 0 1 0 YP_495301 4 5 2 0 0 0 0 0 3 0 YP_495335 3 5 0 0 0 0 0 1 0 0 YP_495370 1 3 1 0 0 0 0 0 0 0 YP_495433 0 2 1 0 0 0 0 0 1 0 YP_495514 10 4 2 0 0 0 0 0 0 0 YP_495691 4 4 0 0 0 0 0 0 2 0 YP_496367 8 5 0 0 0 0 0 0 2 0 YP_496423 1 2 0 0 0 0 0 0 0 0 YP_496569 3 5 1 0 0 0 0 1 1 0 YP_496656 1 2 0 0 0 0 0 0 0 0 YP_497188 2 4 1 0 0 0 0 0 0 0 YP_497403 2 4 0 0 0 0 0 0 1 0

YP_498058 1 3 0 0 0 0 0 0 2 0 YP_498227 4 4 0 0 0 0 0 0 0 0 YP_498407 2 3 0 0 0 0 0 0 1 0 YP_498482 1 3 0 0 0 0 0 0 1 0 YP_495327 1 1 3 0 0 0 0 0 0 0 YP_495437 1 3 0 0 0 0 0 0 2 0 Sphingomonadales YP_495697 9 5 0 0 0 0 0 0 2 0 YP_495740 1 3 1 0 0 0 0 0 3 0 YP_496357 5 4 1 1 0 0 0 0 1 0 YP_496405 0 3 2 0 0 0 0 0 0 0 YP_496439 6 4 0 0 0 0 0 1 3 0 YP_496442 0 3 0 0 0 0 0 0 2 0 YP_497022 4 7 0 0 0 0 0 1 4 0 YP_497059 0 1 0 0 0 0 0 0 1 0 YP_497246 0 2 1 0 0 0 0 0 0 0 YP_497309 0 1 0 0 0 0 0 0 2 0 YP_497310 4 2 0 0 0 0 0 0 1 0 YP_497604 16 6 2 0 0 0 0 0 1 0 YP_497818 4 6 4 0 0 0 2 0 1 0

60 M. Sc. Thesis—Quan Yao McMaster—Biology

AAW60410 0 0 0 0 0 0 0 0 0 0 AAW60472 1 0 0 0 0 0 0 0 0 0 AAW60735 1 0 0 0 0 0 1 0 0 0 AAW61019 2 0 0 1 0 0 0 0 0 0 AAW59936 1 0 0 0 0 0 0 1 0 0 AAW61357 0 0 0 0 0 0 0 0 0 0 AAW60126 0 0 0 0 0 0 0 0 0 0 AAW60973 0 0 0 0 0 0 0 0 0 0 AAW60976 0 0 0 0 0 0 0 0 0 0 AAW60983 0 0 0 0 0 0 0 0 0 0 AAW60985 0 0 0 0 0 0 0 0 0 0 AAW62008 0 0 0 0 0 0 0 1 0 0 AAW62049 1 0 2 0 0 0 0 0 1 0

AAW62183 0 0 0 0 0 0 0 0 0 0 AAW62185 3 0 0 0 0 0 0 0 0 0 AAW60994 0 0 0 0 0 0 0 0 0 0 AAW62187 1 0 0 0 0 0 0 1 0 0 YP_425217 9 0 2 1 1 0 0 0 2 0 Rhodospirillales YP_425244 0 0 0 0 0 0 0 0 0 0 YP_425622 1 0 0 0 0 0 0 1 1 0 YP_426776 1 0 1 0 0 0 0 0 0 0 YP_426843 1 0 1 3 0 0 0 0 0 0 YP_427199 2 0 0 3 0 0 0 1 0 0 YP_427597 4 0 1 0 0 0 0 1 0 0 YP_427676 2 0 0 0 1 0 0 0 0 0 YP_427912 1 0 0 0 0 0 0 0 0 0 YP_428643 8 0 2 0 0 0 0 0 1 0 YP_428717 2 0 0 1 1 0 0 0 1 0 YP_428743 4 0 0 0 0 0 0 0 0 0 YP_428820 0 0 0 0 0 0 0 0 0 0 YP_428881 2 1 2 0 0 0 0 0 0 0 NP_965979 0 0 0 0 0 0 0 0 0 0

NP_966474 0 0 0 0 0 0 0 0 0 0 NP_965909 0 0 0 0 0 0 0 0 0 0 NP_966580 0 0 0 0 0 0 0 0 0 0

Rickettsiales NP_965975 0 0 0 0 0 0 0 0 0 0 NP_965966 0 0 0 0 0 0 0 0 0 0

61 M. Sc. Thesis—Quan Yao McMaster—Biology

NP_966527 0 0 0 0 0 0 0 0 0 0 NP_966202 0 0 0 0 0 0 0 0 0 0 NP_966253 0 0 0 0 0 0 0 0 0 0 NP_966513 0 0 0 0 0 0 0 0 0 0 NP_966574 0 0 0 0 0 0 0 0 0 0 NP_966613 0 0 0 0 0 0 0 0 0 0 NP_966526 0 1 0 0 0 0 0 0 0 0 NP_966520 0 0 0 0 0 0 0 0 0 0 NP_966750 0 0 0 0 0 0 0 0 0 0 NP_966779 0 0 0 0 0 0 0 0 0 0 NP_966932 0 0 0 0 0 0 0 0 0 0 NP_966942 0 0 0 0 0 0 0 0 0 0 NP_220581 0 0 0 0 0 0 0 0 0 0 NP_220424 0 0 0 0 0 0 0 0 0 0 NP_220576 0 0 0 0 0 0 0 1 0 0

Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes Numbers of significant hits within each metagenome are assigned to the listed taxa. Color formatting indicates high and low values. Negative results are in green. Positive results are in yellow. Red indicates the highest values in the chart.

62 M. Sc. Thesis—Quan Yao McMaster—Biology

Vent

Mat

fall Sludge

Sediment

CSP ater Marine Compost Whale Bioreactor Wastewater Groundwater Microbial Activated Hydrothermal Freshw

NP_420905 128 130 63 98 93 85 87 78 108 85 NP_422086 203 205 212 132 206 190 180 191 209 143 NP_422113 67 98 77 67 57 55 71 0 66 65

NP_420178 263 124 94 102 83 80 61 92 118 96 NP_420025 94 110 85 73 48 72 109 0 103 0 NP_420693 100 75 72 71 49 72 78 0 78 71 NP_421048 117 138 89 69 47 91 105 0 122 0 proteobacteria - NP_422264 155 196 58 69 0 60 59 53 130 60 α NP_419339 147 177 0 0 119 0 159 113 161 158 NP_421804 98 188 84 83 0 0 90 59 135 72 NP_418919 117 140 87 75 69 115 118 222 132 56 YP_031797 70 77 51 67 0 0 53 0 56 0 YP_032733 63 81 0 61 0 0 0 48 77 0 YP_032395 169 265 0 107 0 0 162 0 106 0 NP_101943 0 171 0 0 0 0 0 0 0 0 NP_105027 0 104 0 0 0 80 0 0 0 0 NP_108034 174 231 0 196 0 0 107 90 186 0

NP_102510 0 189 0 0 0 0 0 0 0 0 NP_102519 0 173 49 53 0 0 0 0 0 0 NP_104217 0 112 60 0 0 0 0 0 0 0 Rhizobiales NP_107016 82 133 0 0 0 128 0 0 0 0 NP_101988 154 411 0 112 0 69 0 0 202 0 NP_102895 0 110 0 0 0 0 0 0 0 0 NP_104087 0 55 0 81 0 0 0 0 0 0 NP_104130 78 200 0 0 0 0 0 0 115 0 NP_105201 0 207 0 0 0 0 0 0 0 0

63 M. Sc. Thesis—Quan Yao McMaster—Biology

NP_105743 213 487 0 176 0 92 0 0 291 0 NP_108472 102 462 0 0 0 0 0 0 0 0 NP_103319 115 184 0 0 0 0 0 0 167 0 NP_101965 128 410 0 0 0 0 0 56 0 0 NP_101954 0 126 0 0 0 0 0 0 0 0 NP_102577 51 164 0 0 0 0 0 0 0 0 NP_109472 0 238 0 0 0 0 0 0 0 0 NP_105883 0 189 0 0 0 0 0 0 0 0 NP_106835 54 0 0 0 0 0 0 0 0 0 NP_107159 50 63 0 0 0 0 0 0 0 0 NP_103376 65 200 0 0 0 0 0 0 0 0 NP_104418 0 107 0 0 0 0 0 0 0 0 NP_105704 0 120 0 0 0 0 49 0 0 0 NP_102252 0 0 0 0 0 0 0 0 0 0 NP_103286 0 107 0 0 0 0 0 0 0 0 NP_106741 0 112 0 0 0 0 0 0 0 0 NP_106740 0 0 0 0 0 0 0 0 0 0 NP_104236 127 0 0 0 0 0 0 0 0 0 NP_103455 0 0 0 0 0 0 0 0 0 0 NP_103450 0 0 0 0 0 0 0 0 0 0 NP_103476 0 47 0 0 0 0 0 0 0 0 NP_107075 0 0 0 0 0 0 0 0 0 0

NP_772654 0 91 0 0 0 0 0 0 0 0 YP_317707 0 56 0 0 0 0 0 204 0 0 YP_317841 65 47 0 0 0 0 0 0 0 0 YP_318399 112 0 163 0 0 0 0 0 0 0 YP_318401 156 0 0 0 0 0 0 0 0 0 Xanthobacteraceae YP_318753 0 0 0 0 0 0 0 0 0 0 and

YP_318785 0 0 0 0 0 0 0 0 0 0 YP_319038 0 0 0 0 0 0 0 0 0 0 YP_319081 60 61 0 0 0 0 0 0 0 0 YP_319177 127 140 0 95 0 0 0 0 0 0 YP_319228 56 93 0 0 0 0 0 0 0 0 YP_319312 117 0 0 76 0 0 0 0 0 0 Bradyrhizobiaceae

64 M. Sc. Thesis—Quan Yao McMaster—Biology

NP_772539 0 0 0 0 0 0 0 0 0 0 NP_772746 0 73 0 0 0 0 0 0 0 0 YP_316897 216 0 0 148 0 0 176 0 0 0 YP_317122 0 0 0 0 0 0 0 0 0 0 YP_317147 0 0 0 0 0 0 0 0 0 0 YP_317224 46 0 0 0 0 0 0 0 0 0 YP_317328 125 0 118 145 0 0 0 0 0 0 YP_317539 0 54 0 54 0 0 0 0 0 0 YP_317580 128 0 111 147 0 0 0 0 0 0 YP_317698 0 65 0 55 0 0 56 57 0 0 YP_317706 0 0 0 0 0 0 0 0 0 0 YP_317721 61 131 0 0 0 0 0 0 0 0 YP_317722 0 68 0 0 0 0 0 0 0 0 YP_317949 49 56 0 79 0 0 159 105 0 0 YP_317753 0 235 0 0 0 0 0 88 0 0 YP_317861 0 0 0 0 0 0 0 0 0 0 YP_317883 0 0 0 0 0 0 0 0 0 0 YP_317888 0 0 0 0 0 0 0 0 0 0 YP_318067 0 0 0 0 0 0 0 0 0 0 YP_318111 0 0 0 0 0 0 0 0 0 0 YP_318125 0 0 0 0 0 0 0 0 0 0 YP_318194 66 58 0 0 0 0 0 0 0 0 YP_318195 0 0 0 0 0 0 0 0 0 0 YP_318199 103 63 0 0 0 0 0 0 87 0 YP_318262 46 0 0 0 0 0 0 0 0 0 YP_318287 57 0 0 0 0 0 0 71 0 0 YP_318318 90 53 0 63 0 0 0 0 0 0 YP_318324 51 0 0 0 0 0 0 0 0 0 YP_318398 0 0 0 0 0 0 0 0 0 0 YP_318406 96 120 0 54 0 0 92 0 49 0 YP_318413 0 0 0 0 0 0 0 0 0 0 YP_318632 0 0 0 0 0 0 0 0 0 0 YP_318673 0 0 0 0 0 0 0 0 0 0 YP_318674 0 0 0 52 0 0 0 0 0 0

65 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_318769 0 0 0 0 0 0 0 0 0 0 YP_318779 78 0 0 0 0 0 0 0 0 0 YP_318789 0 0 0 0 0 0 0 0 0 0 YP_318814 0 0 0 0 0 0 0 0 0 0 YP_318850 0 0 0 55 0 0 0 0 0 0 YP_318853 90 0 0 0 0 0 0 0 0 0 YP_318985 339 59 441 169 0 0 0 376 0 0 YP_318987 0 0 0 0 0 0 0 0 0 0 YP_319020 0 0 0 0 0 0 0 0 0 0 YP_319094 0 0 0 0 0 0 0 0 0 0 YP_319097 108 0 0 0 0 0 0 0 0 0 YP_319105 0 0 0 0 0 0 0 0 0 0 YP_319111 0 0 0 0 0 0 0 0 0 0 YP_319114 0 0 0 48 0 0 56 0 0 0 YP_319136 52 0 0 0 0 0 0 0 0 0 YP_319180 0 0 0 0 0 0 0 0 0 0 YP_319182 0 0 0 46 0 0 0 0 0 0 YP_319193 0 0 0 0 0 0 0 0 0 0 YP_319235 0 0 0 0 0 0 0 0 0 0 YP_319281 0 0 0 0 0 0 0 0 0 0 YP_319282 0 0 0 0 0 0 0 0 0 0 YP_319374 0 0 0 0 0 0 0 0 0 0 YP_319394 0 0 0 0 0 0 0 0 0 0 YP_319586 0 0 0 52 0 0 0 0 0 0 YP_319561 73 0 0 0 0 0 0 0 0 0 YP_319637 0 0 0 0 0 0 0 0 0 0 YP_319739 102 0 0 0 0 0 0 0 0 0 YP_319740 92 0 0 0 0 0 0 0 0 0

YP_612088 0 125 107 0 68 65 0 0 119 119 YP_612179 0 383 0 0 0 405 0 49 236 0 YP_612231 0 0 0 0 0 60 0 0 108 0 YP_612466 0 131 0 0 0 0 0 0 137 107 YP_612581 0 365 0 0 0 114 0 0 357 352

Rhodobacterales YP_612582 0 183 0 0 0 124 0 0 170 196

66 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_612692 0 92 0 0 90 0 0 0 86 0 YP_612745 0 55 0 0 0 0 0 0 130 0 YP_612747 0 138 114 0 0 97 0 0 0 115 YP_613058 0 113 0 0 85 117 0 0 109 0 YP_613059 0 103 83 0 78 0 0 0 93 0 YP_613242 0 0 0 0 94 63 0 0 0 50 YP_613345 0 172 0 0 147 161 0 0 114 197 YP_613401 0 81 104 0 97 0 0 0 0 92 YP_613562 0 218 0 0 194 184 98 0 213 0 YP_613837 0 124 150 0 0 84 0 0 135 0 YP_613961 0 136 0 0 131 152 0 0 142 0 YP_613982 0 148 0 0 152 159 0 0 151 0 YP_614257 0 721 0 0 0 333 199 330 311 719 YP_614364 0 271 0 0 115 293 0 0 237 0 YP_614419 0 192 199 0 0 153 0 0 232 225 YP_614460 0 156 0 0 0 202 0 0 139 0 YP_614481 0 459 0 0 158 460 0 0 496 0 YP_614576 0 111 0 0 0 102 0 0 102 124 YP_614993 0 139 178 0 0 176 0 0 174 156 YP_611313 0 75 0 0 0 72 0 0 55 0 YP_611978 0 273 249 0 183 0 0 0 286 196 YP_611988 0 143 122 0 165 160 0 0 198 0 YP_611993 0 231 0 0 227 259 0 0 184 225 YP_613553 0 123 0 0 117 125 0 0 112 125 YP_613730 0 194 0 0 0 0 0 0 203 195 YP_613732 54 57 54 60 0 105 61 0 103 0 YP_613733 0 0 0 0 0 129 0 0 99 0 YP_613734 99 0 211 57 60 68 0 115 82 115 YP_613731 159 405 209 135 140 204 86 136 221 318 YP_613094 0 0 0 0 0 0 0 60 0 0 YP_611425 0 0 45 0 0 0 0 0 0 0 YP_613418 0 0 0 0 0 0 0 0 0 0 YP_613446 0 0 0 0 0 0 0 0 142 0 YP_613980 0 0 0 0 0 0 0 0 0 0 YP_614100 0 0 0 0 0 0 0 0 151 0 YP_614133 0 0 0 0 0 0 0 0 0 0

67 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_611311 0 0 0 0 0 0 0 0 0 0 YP_611438 0 0 0 0 0 0 0 0 0 0 YP_611444 0 50 0 0 0 0 0 0 0 0 YP_611462 0 0 0 0 0 0 0 0 0 0 YP_611763 0 0 0 0 0 84 0 0 0 0 YP_611855 0 0 0 0 0 0 0 0 0 0 NP_419305 82 62 85 0 0 0 0 0 81 0 NP_421283 181 0 0 0 0 0 0 0 191 0 NP_421560 0 0 0 0 0 0 0 0 56 0

NP_421895 127 291 73 0 0 0 115 0 119 0 NP_419331 0 100 0 0 0 0 0 0 0 0 NP_419880 0 51 0 0 0 0 0 50 88 0 NP_419882 0 71 51 0 0 0 0 0 63 0 NP_420397 0 0 0 0 0 0 0 82 0 0 Caulobacterales NP_421010 104 75 100 0 0 0 0 0 0 0 NP_421428 0 0 0 0 60 0 0 0 0 0 NP_421438 0 94 0 0 0 0 0 0 95 0 YP_495301 116 216 115 0 0 0 0 0 225 0 YP_495335 132 87 0 0 0 0 0 248 0 0 YP_495370 64 309 168 0 0 0 0 0 0 0 YP_495433 0 112 82 0 0 0 0 0 65 0 YP_495514 174 464 61 0 0 0 0 0 0 0 YP_495691 175 206 0 0 0 0 0 0 127 0 YP_496367 93 284 0 0 0 0 0 0 215 0

YP_496423 57 276 0 0 0 0 0 0 0 0 YP_496569 110 139 70 0 0 0 0 105 77 0 YP_496656 69 192 0 0 0 0 0 0 0 0 YP_497188 110 142 49 0 0 0 0 0 0 0 YP_497403 77 202 0 0 0 0 0 0 112 0 YP_498058 68 225 0 0 0 0 0 0 92 0 Sphingomonadales YP_498227 141 311 0 0 0 0 0 0 0 0 YP_498407 88 145 0 0 0 0 0 0 53 0 YP_498482 73 246 0 0 0 0 0 0 103 0 YP_495327 74 134 89 0 0 0 0 0 0 0 YP_495437 74 89 0 0 0 0 0 0 66 0 YP_495697 119 156 0 0 0 0 0 0 57 0 YP_495740 60 73 48 0 0 0 0 0 114 0

68 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_496357 122 240 78 159 0 0 0 0 59 0 YP_496405 0 168 94 0 0 0 0 0 0 0 YP_496439 70 76 0 0 0 0 0 68 102 0 YP_496442 0 61 0 0 0 0 0 0 47 0 YP_497022 74 191 0 0 0 0 0 62 117 0 YP_497059 0 67 0 0 0 0 0 0 68 0 YP_497246 0 64 50 0 0 0 0 0 0 0 YP_497309 0 285 0 0 0 0 0 0 99 0 YP_497310 80 85 0 0 0 0 0 0 73 0 YP_497604 147 663 303 0 0 0 0 0 149 0 YP_497818 130 305 308 0 0 0 127 0 164 0 AAW60410 0 0 0 0 0 0 0 0 0 0 AAW60472 64 0 0 0 0 0 0 0 0 0 AAW60735 52 0 0 0 0 0 53 0 0 0 AAW61019 53 0 0 56 0 0 0 0 0 0 AAW59936 78 0 0 0 0 0 0 68 0 0 AAW61357 0 0 0 0 0 0 0 0 0 0 AAW60126 0 0 0 0 0 0 0 0 0 0 AAW60973 0 0 0 0 0 0 0 0 0 0 AAW60976 0 0 0 0 0 0 0 0 0 0 AAW60983 0 0 0 0 0 0 0 0 0 0 AAW60985 0 0 0 0 0 0 0 0 0 0

AAW62008 0 0 0 0 0 0 0 56 0 0 AAW62049 70 0 471 0 0 0 0 0 158 0 AAW62183 0 0 0 0 0 0 0 0 0 0 AAW62185 162 0 0 0 0 0 0 0 0 0 AAW60994 0 0 0 0 0 0 0 0 0 0 Rhodospirillales AAW62187 70 0 0 0 0 0 0 69 0 0 YP_425217 173 0 177 54 68 0 0 0 171 0 YP_425244 0 0 0 0 0 0 0 0 0 0 YP_425622 94 0 0 0 0 0 0 88 89 0 YP_426776 104 0 67 0 0 0 0 0 0 0 YP_426843 155 0 62 53 0 0 0 0 0 0 YP_427199 144 0 0 87 0 0 0 78 0 0 YP_427597 181 0 125 0 0 0 0 60 0 0 YP_427676 227 0 0 0 59 0 0 0 0 0 YP_427912 55 0 0 0 0 0 0 0 0 0 YP_428643 410 0 342 0 0 0 0 0 195 0

69 M. Sc. Thesis—Quan Yao McMaster—Biology

YP_428717 78 0 0 74 83 0 0 0 72 0 YP_428743 114 0 0 0 0 0 0 0 0 0 YP_428820 0 0 0 0 0 0 0 0 0 0 YP_428881 86 56 82 0 0 0 0 0 0 0 NP_965979 0 0 0 0 0 0 0 0 0 0 NP_966474 0 0 0 0 0 0 0 0 0 0 NP_965909 0 0 0 0 0 0 0 0 0 0 NP_966580 0 0 0 0 0 0 0 0 0 0 NP_965975 0 0 0 0 0 0 0 0 0 0 NP_965966 0 0 0 0 0 0 0 0 0 0 NP_966527 0 0 0 0 0 0 0 0 0 0 NP_966202 0 0 0 0 0 0 0 0 0 0

NP_966253 0 0 0 0 0 0 0 0 0 0 NP_966513 0 0 0 0 0 0 0 0 0 0 NP_966574 0 0 0 0 0 0 0 0 0 0 NP_966613 0 0 0 0 0 0 0 0 0 0

Rickettsiales NP_966526 0 69 0 0 0 0 0 0 0 0 NP_966520 0 0 0 0 0 0 0 0 0 0 NP_966750 0 0 0 0 0 0 0 0 0 0 NP_966779 0 0 0 0 0 0 0 0 0 0 NP_966932 0 0 0 0 0 0 0 0 0 0 NP_966942 0 0 0 0 0 0 0 0 0 0 NP_220581 0 0 0 0 0 0 0 0 0 0 NP_220424 0 0 0 0 0 0 0 0 0 0 NP_220576 0 0 0 0 0 0 0 82 0 0

Figure 3: Similarity of significant hits in 10 metagenomes Best bit score within each metagenome are assigned to the listed taxa. Color formatting indicates high and low values. Negative results are in green. Positive results are in yellow. Red indicates the highest values in the chart.

70 M. Sc. Thesis—Quan Yao McMaster—Biology

Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP distribution in 10 metagenomes Note: The dots indicate the average bit score obtained from BLAST search and demonstrates the average extent of similarity between metagenomic reads and CSPs

71 M. Sc. Thesis—Quan Yao McMaster—Biology

Figure 5: The relative abundance of Alphaproteobacteria and its different sub- clades in the studied metagenomes based upon BLASTp searches with CSPs Note: The colored bars indicate the numbers of significant hits that were detected in each metagenomes with CSPs, which are specific for different groups of Alphaproteobacteria.

72 M. Sc. Thesis—Quan Yao McMaster—Biology

Figure 6: Comparative results of Alphaproteobacteria distribution in 4 metagenomes derived from (A) CSPs-based binning and (B) similarity-based binning. Note: The lower piecharts are obtained from MG-RAST and IMG/M databases. The color scheme to denote different Alphaproteobacteria subgroups is shown below.

73 M. Sc. Thesis—Quan Yao McMaster—Biology

Chapter 4 Discussion

4.1 Metagenome selection

The 10 metagenomes selected in this work from NCBI metagenomic database represent different environmental habitats around the world and cover all 3 metagenomic ecosystems: 4 from engineered ecosystems (bioreactor, compost, wastewater and activated sludge metagenome), 5 from environmental ecosystems (groundwater, freshwater sediment, microbial mat, marine and hydrothermal vent metagenomes) and 1 from host-associated ecosystem (whale fall metagenome). The habitats for

Alphaproteobacterial microbial communities are highly divergent, including saline water, sediment, marine, fossil, green-waste compost, wastewater treatment plant and so forth.

Public taxonomical classification of these metagenomes, either form MG-RAST and JGI platform, identified a myriad of Alphaproteobacteria associated sequences, further suggesting that Alphaproteobacteria can adapt to diverse environments described above.

It is notable that the size of metagenomic projects varies to each other regarding total length, # of contigs and average length (Table 11). These statistical differences have a remarkable influence on downstream bioinformatics analysis. For instance, the total length of whole genome shotgun sequences (WGS) in this study spans from 24.9 Mbases to 421.6 Mbases. However, corresponding raw sequencing data reaches up to tens or hundreds of Gbases. Data loss occurs in bioinformatics analysis such as quality control and duplicate clustering. Likewise, enormous amount of metagenomic sequences are discarded because the ever-increasing size of the metagenomic projects have surpassed the volume of any existing public database so that they cannot be matched to any

74 M. Sc. Thesis—Quan Yao McMaster—Biology

reference sequence in public database (Thomas et al., 2012). The number of contigs assembled in these metagenomes ranges from 26573 to 748672. Sequence coverage is a key factor for producing assembled contigs. However, mixture of genomes casts challenge on assembly process, leading to the low yield of assembled contigs because metagenomic sequences are less redundant than single genome sequences. Lastly, the average length of contigs in each metagenomic project ranges from 425bp to 2801bp. The longer a metagenomic sequence is, the higher the mapping accuracy is (Wommack et al.,

2008). The depth of sequencing determines the length of assembled contigs. So it is possible to plot a single draft genome from metagenomic sequences if only sequencing is deep enough to provide sufficient folds of coverage for splicing DNA fragments.

However, sequencing technology merely unveils a small portion of microbes in environments because incomplete sequencing is a major and inevitable limitation of most metagenomic studies. As a consequence, the species that could be predicted from metagenomic datasets are still very limited and are likely biased by information asymmetry between database and metagenomes (Wooley et al., 2010).

4.2 Identification of CSPs in metagenomic samples

An important advantage of using CSPs for metagenomic profiling is that the presence of these protein markers can be more reliably detected than the corresponding gene markers. When gene markers of corresponding CSPs are used in similar studies, the number of significant hits obtained is much less than that obtained using protein markers

(Table 2). This can be explained by both the redundancy of and the variation of gene sequences in metagenomes (Kembel et al., 2011). In view of this, CSPs may be

75 M. Sc. Thesis—Quan Yao McMaster—Biology

able to decipher taxonomic origin of some unassigned metagenomic sequences beyond what nucleotide markers can do. Different from MetaPhlAn which is a similarity based binning software relying on unique clade-specific gene marker, CSPs-based methodology emphasize on identifying and exploiting microbial clade specific protein markers ranging from phyla to genera. However, the number of genera specific CSPs identified within

Alphaproteobacteria before is very poor, leading to the low resolution of taxonomic profiling at lower taxonomic levels for metagenomic projects. More genera specific CSPs will be identified if more reference genomes are available for public.

According to the heat-map distribution of 264 Alphaproteobacterial CSPs, most of the class specific CSPs could be detected in all 10 metagenomic projects, whose habitats abound with Alphaproteobacteria. Compared with the formidable task that aims at assembling each individual genome in metagenomic samples, it is more feasible to build a molecular marker database that contains all commonly shared genes within

Alphaproteobacteria. The results also indicate that CSPs at higher taxonomic level such as class level and phylum level tend to be discovered effortlessly in metagenomic datasets. Alphaproteobacteria is enriched in 3 metagenomic projects (bioreactor, wastewater, whale fall metagenome) (Figure 4). The detailed distribution of 6 major orders under Alphaproteobacteria indicates that Rhodobacterales is the most abundant clade in 6 metagenomic projects studied (Figure 5). The relative abundance of

Alphaproteobacteria from MG-RAST and IMG/M also support the dissertation in 4 metagenomes based on CSP searches (Figure 6). The proportion of Alphaproteobacteria clades in these 4 metagenomic projects is highly correlated to the proportion predicted by

76 M. Sc. Thesis—Quan Yao McMaster—Biology

Alphaproteobacteria specific CSPs (Figure 6). This important discovery indicates a potential application of CSPs----Alphaproteobacteria specific CSPs are able to predict the distribution pattern of Alphaproteobacterial clades in metagenomic samples.

4.3 Comparative analysis of Alphaproteobacteria in metagenomes

The heatmaps in Figure 2 and 3 reflect the overall distribution pattern of all 264 CSPs in 10 metagenomes. It is noticeable that the 11 CSPs unique to Alphaproteobacteria class are ubiquitously present in all 10 metagenomes. The average bit scores are higher than the other order specific CSPs and the total number of significant hits identified outweigh all other CSPs. Based on this finding, It is concluded that class specific CSPs are much more easily to be discovered than order or family specific CSPs. This is because all potential

Alphaproteobacteria species are assumed to contribute class specific CSPs into metagenomic datasets. So the predictive ability of Alphaproteobacteria class specific

CSPs are much stronger than order specific or family specific CSPs. The best bit scores of

Rhodobacterales specific CSPs in 6 metagenomic samples are very high, compared to other clade specific CSPs. It is suggested that Rhodobacterales is the dominant

Alphaproteobacteria member in those metagenomes and the high concentration of

Rhodobacterales increases the coverage during sequence assembly, thus produces more complete genomic sequences of Rhodobacterales. Similar results are also seen in

Sphingomonadales specific CSPs and Rhizobiales specific CSPs regarding wastewater metagenome. In summary, the occurrence rate of Alphaproteobacteria specific CSPs is influenced by the specificity of CSPs as well as the concentration of corresponding bacteria clade.

77 M. Sc. Thesis—Quan Yao McMaster—Biology

A comprehensive profiling of Alphaproteobacteria was performed based on CSPs distribution in metagenomic projects. Alphaproteobacteria dominates 3 metagenomes. It is indicated that in these three metagenomes, different clades of Alphaproteobacteria may exert certain functions respectively to maintain the balance and well development for each habitat. Alphaproteobacteria is less abundant in other 7 metagenomes, which are either occupied by 1 order together with other orders in low concentration (Figure 5).

Microbial mat, hydrothermal vent, groundwater metagenomes are three typical habitats that are mainly composed of Rhodobacterales only. These habitats are characterized by extreme environment conditions such as hypersaline, high temperature and exposure to radiation. The discovery of Rhodobacterales specific CSPs suggests that they have very strong adaptive abilities to adopt harsh environments. For the rest 4 metagenomes: marine metagenome, compost metagenome, activated sludge metagenome and freshwater sediment metagenome, although the overall concentration of Alphaproteobacteria in these metagenomes are not very high, but the diversity of Alphaproteobacteria is higher than the three metagenomes discussed above. Several Alphaproteobacterial clades are existent with low concentration in these metagenomes. It is suggested that Alphaproteobacteria are in charge of some auxiliary functions to maintain the equilibrium of the habitat. In brief, different environments featured by unique growth conditions are preferred by different Alphaproteobacteria species. The nexus between organism and environment may predict the presence of similar lineages before fieldwork and laboratory experiments are accomplished.

78 M. Sc. Thesis—Quan Yao McMaster—Biology

An important goal in this study is to validate the methodology of CSPs in organism identification and abundance prediction. The comparison of relative abundance between

CSPs-based binning and traditional similarity-based binning from public metagenomic server shows high correlation. The distribution of Alphaproteobacteria and its sub-clades in bioreactor metagenome matches perfectly with the proportion on IMG/M server

(Figure 6). In wastewater metagenome, 3 groups of Alphaproteobacteria: Rhizobiales,

Rhodobacterales and Sphingomonadales, are proved to be the major members both by

CSPs searches and similarity searches from MG-RAST server. Comparison between

CSPs-based binning and similarity based binning on MG-RAST for whale fall metagenome and marine metagenome also shows similar results. So, CSP-based binning is reliable to predict the relative abundance of Alphaproteobacteria species in metagenomic samples. Since the database constructed is smaller but more unique than the

NCBI non-redundant database, it is more accurate and fast to achieve taxonomic clustering in environmental datasets.

4.4 Overall conclusions

In the previous centuries, the study of was mainly restricted to single species in laboratory culture (Madigan et al., 2008). Since the vast majority of microbes cannot be grown in the laboratory, researches on microbial community interactions beyond the substrates fall behind (Hugenholtz et al., 1998). Nevertheless, in environment conditions, all microbial activities, such as photosynthesis, organic degradation, and fixation of nitrogen, are conducted by complex microbial communities----those that have evolved for millions of years to adapt to different habitats and ecosystems (Davey and

79 M. Sc. Thesis—Quan Yao McMaster—Biology

O’toole, 2000). In order to understand the complex mutual effects within microbial cohort, it is necessary to explore the species diversity as well as their relative abundance in environment (Kuramitsu et al., 2007). In this study, 264 CSPs were utilized to investigate the Alphaproteobacterial diversity in 10 metagenomic projects. The results indicate that most CSPs could be detected in different metagenomic projects. Through analyzing and comparing the distribution of bit score and significant hit number, a comprehensive profiling of Alphaproteobacterial species diversity in metagenomic datasets was plotted. Basically, CSPs-based binning is a refinement of traditional similarity-based binning, which enhances the efficiency and effectiveness of performance. Computational expense is reduced while the accuracy of mapping increase.

Although CSP cannot robustly resolve the issue such as bacterial quantification or species/strains diagnosis, it sheds light upon bacterial clades profiling above species level and provides a new way to predict the relative abundance of microbial clades in different metagenomes with clade specific protein markers.

4.5 Future directions

Apart from the projects accomplished here, there are some experiments that can be done to expand the results above:

CSPs specific to other bacterial phyla such as Actinobacteria, Cyanobacteria and

Bacteroidetes have already been identified in previous studies. With these molecular markers, it is possible to forecast the presence and relative abundance of corresponding bacteria in more metagenomic projects. By constructing a database that contains all CSPs unique to every taxon from genus level to phylum level in Bacteria domain a

80 M. Sc. Thesis—Quan Yao McMaster—Biology

comprehensive blueprint of metagenomic taxonomic classification can be created to profile the presence and relative abundance of all microorganisms in metagenomic datasets.

81 M. Sc. Thesis—Quan Yao McMaster—Biology

References

Abraham, W.-R., Macedo, A.J., Lünsdorf, H., Fischer, R., Pawelczyk, S., Smit, J., and Vancanneyt, M. (2008). Phylogeny by a polyphasic approach of the order Caulobacterales, proposal of mirabilis sp. nov., haematophilum sp. nov. and Phenylobacterium conjunctum sp. nov., and emendation of the genus Phenylobacterium. Int. J. Syst. Evol. Microbiol. 58, 1939–1949.

Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W., and Nielsen, P.H. (2013). Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538.

Allgaier, M., Reddy, A., Park, J.I., Ivanova, N., D’haeseleer, P., Lowry, S., Sapra, R., Hazen, T.C., Simmons, B. a, VanderGheynst, J.S., et al. (2010). Targeted discovery of glycoside hydrolases from a switchgrass-adapted compost community. PLoS One 5, e8812.

Alsmark, C.M., Frank, A.C., Karlberg, E.O., Legault, B.A., Ardell, D.H., Canback, B., Eriksson, A.S., Naslund, A.K., Handley, S.A., Huvet, M., et al. (2004). The -borne is a genomic derivative of the zoonotic agent Bartonella henselae. Proc.Natl.Acad.Sci.U.S.A 101, 9716–9721.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

Andersson, S.G., and Kempf, V.A. (2004). Host cell modulation by human, animal and plant pathogens. Int.J.Med.Microbiol. 293, 463–470.

Arisue, N., Hasegawa, M., and Hashimoto, T. (2005). Root of the Eukaryota tree as inferred from combined maximum likelihood analyses of multiple molecular sequence data. Mol. Biol. Evol. 22, 409–420.

Arraga-Alvarado, C., Palmar, M., Parra, O., and Salas, P. (2003). Ehrlichia platys () in from Maracaibo, Venezuela: an ultrastructural study of experimental and natural infections. Vet. Pathol. 40, 149–156.

Bäckhed, F., Fraser, C.M., Ringel, Y., Sanders, M.E., Sartor, R.B., Sherman, P.M., Versalovic, J., Young, V., and Finlay, B.B. (2012). Defining a healthy human gut microbiome: current concepts, future directions, and clinical applications. Cell Host Microbe 12, 611–622.

82 M. Sc. Thesis—Quan Yao McMaster—Biology

Beiko, R.G., and Ragan, M.A. (2008). Detecting lateral genetic transfer : a phylogenetic approach. Methods Mol.Biol. 452, 457–469.

Bhandari, V., Naushad, H.S., and Gupta, R.S. (2012). Protein based molecular markers provide reliable means to understand prokaryotic phylogeny and support Darwinian mode of evolution. Front. Cell. Infect. Microbiol. 2, 98.

Binnewies, T.T., Motro, Y., Hallin, P.F., Lund, O., Dunn, D., La, T., Hampson, D.J., Bellgard, M., Wassenaar, T.M., and Ussery, D.W. (2006). Ten years of sequencing: comparative-genomics-based discoveries. Funct.Integr.Genomics 6, 165–185.

Boersma, F.G.H., Warmink, J.A., Andreote, F.A., and van Elsas, J.D. (2009). Selection of Sphingomonadaceae at the base of Laccaria proxima and Russula exalbicans fruiting bodies. Appl. Environ. Microbiol. 75, 1979–1989.

Bowman, D.D. (2011). Introduction to the alpha-proteobacteria: Wolbachia and Bartonella, Rickettsia, Brucella, Ehrlichia, and Anaplasma. Top. Companion Anim. Med. 26, 173–177.

Brady, A., and Salzberg, S.L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6, 673–676.

Brazelton, W.J., and Baross, J. a (2009). Abundant encoded by the metagenome of a hydrothermal chimney . ISME J. 3, 1420–1424.

Breitschwerdt, E.B., and Kordick, D.L. (2000). Bartonella Infection in Animals: Carriership, Reservoir Potential, Pathogenicity, and Zoonotic Potential for Human Infection. Clin. Microbiol. Rev. 13, 428–438.

Brennerova, M. V, Josefiova, J., Brenner, V., Pieper, D.H., and Junca, H. (2009). Metagenomics reveals diversity and abundance of meta-cleavage pathways in microbial communities from soil highly contaminated with jet fuel under air-sparging bioremediation. Environ. Microbiol. 11, 2216–2227.

Campagne, S., Damberger, F.F., Kaczmarczyk, A., Francez-Charlot, A., Allain, F.H.-T., and Vorholt, J.A. (2012). Structural basis for mimicry in the general stress response of Alphaproteobacteria. Proc. Natl. Acad. Sci. U. S. A. 109, E1405–14.

Carvalho, F.M., Souza, R.C., Barcellos, F.G., Hungria, M., and Vasconcelos, A.T.R. (2010). Genomic and evolutionary comparisons of diazotrophic and pathogenic bacteria of the order Rhizobiales. BMC Microbiol. 10, 37.

83 M. Sc. Thesis—Quan Yao McMaster—Biology

Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J.-M., Kennedy, S., et al. (2013). Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546.

Chistoserdova, L. (2013). Is metagenomics resolving identification of functions in microbial communities? Microb. Biotechnol.

Choudhary, M., and Kaplan, S. (2000). DNA sequence analysis of the photosynthesis region of 2.4.1. Nucleic Acids Res. 28, 862–867.

Coletta, A., Pinney, J.W., Solís, D.Y.W., Marsh, J., Pettifer, S.R., and Attwood, T.K. (2010). Low-complexity regions within protein sequences have position-dependent roles. BMC Syst. Biol. 4, 43.

Dang, H., Li, T., Chen, M., and Huang, G. (2008). Cross-ocean distribution of Rhodobacterales bacteria as primary surface colonizers in temperate coastal marine waters. Appl. Environ. Microbiol. 74, 52–60.

Davey, M.E., and O’toole, G.A. (2000). Microbial : from ecology to molecular . Microbiol. Mol. Biol. Rev. 64, 847–867.

DeLong, E.F., Preston, C.M., Mincer, T., Rich, V., Hallam, S.J., Frigaard, N.-U.U., Martinez, A., Sullivan, M.B., Edwards, R., Brito, B.R., et al. (2006). Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311, 496–503.

Doolittle, W.F., and Bapteste, E. (2007). Pattern pluralism and the Tree of Life hypothesis. Proc.Natl.Acad.Sci.U.S.A 104, 2043–2049.

Dröge, J., and McHardy, A.C. (2012). Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief. Bioinform. 13, 646–655.

Dumler, J.S., Barbet, A.F., Bekker, C.P., Dasch, G.A., Palmer, G.H., Ray, S.C., Rikihisa, Y., and Rurangirwa, F.R. (2001). Reorganization of genera in the families Rickettsiaceae and Anaplasmataceae in the order Rickettsiales: unification of some species of Ehrlichia with Anaplasma, Cowdria with Ehrlichia and Ehrlichia with , descriptions of six new species combi. Int. J. Syst. Evol. Microbiol. 51, 2145–2165.

English, C.K. (1988). Cat-Scratch Disease. JAMA 259, 1347.

Ferrari, B.C., Binnerup, S.J., and Gillings, M. (2005). Microcolony cultivation on a soil substrate membrane system selects for previously uncultured soil bacteria. Appl. Environ. Microbiol. 71, 8714–8720.

84 M. Sc. Thesis—Quan Yao McMaster—Biology

Fischer, H.M. (1996). Environmental regulation of rhizobial symbiotic nitrogen fixation genes. Trends Microbiol. 4, 317–320.

Fredricks, D.N. (2006). Introduction to the Rickettsiales and other intracellular . In The Prokaryotes: A Handbook on the Biology of Bacteria, M. Dworkin, S. Falkow, E. Rosenberg, K.H. Schleifer, and E. Stackebrandt, eds. (New York: Springer), pp. 457–466.

Gao, B., and Gupta, R.S. (2012). Microbial systematics in the post-genomics era. 101, 45–54.

Gao, B., Parmanathan, R., and Gupta, R.S. (2006). Signature proteins that are distinctive characteristics of Actinobacteria and their subgroups. Antonie Van Leeuwenhoek 90, 69– 91.

Ghai, R., Mizuno, C.M., Picazo, A., Camacho, A., and Rodriguez-Valera, F. (2013). Metagenomics uncovers a new group of low GC and ultra-small marine Actinobacteria. Sci. Rep. 3, 2471.

Ghazanfar, S., Azim, A., Ghazanfar, M.A.M.A., Iqbal, M., and Anjum, I.B. (2010). Metagenomics and its application in soil microbial community studies: biotechnological prospects. J. Anim. … 6, 611–622.

Gilbert, J.A., and Dupont, C.L. (2011). Microbial Metagenomics: Beyond the Genome. Ann. Rev. Mar. Sci. 3, 347–371.

Gomez-Alvarez, V., Revetta, R.P., and Santo Domingo, J.W. (2012). Metagenome analyses of corroded concrete wastewater pipe biofilms reveal a complex microbial system. BMC Microbiol. 12, 122.

Gray, M.W. (2012). Mitochondrial evolution. Cold Spring Harb. Perspect. Biol. 4, a011403.

Gullo, M., and Giudici, P. (2008). in traditional : phenotypic traits relevant for starter cultures selection. Int. J. Microbiol. 125, 46–53.

Gupta, R.S. (2000). The phylogeny of proteobacteria: relationships to other eubacterial phyla and . FEMS Microbiol. Rev. 24, 367–402.

Gupta, R.S. (2005a). Critical issues in prokaryotic phylogeny and taxonomy. ASM News 71, 393–394.

85 M. Sc. Thesis—Quan Yao McMaster—Biology

Gupta, R.S. (2005b). Protein signatures distinctive of alpha proteobacteria and its subgroups and a model for alpha-proteobacterial evolution. Crit Rev.Microbiol. 31, 101– 135.

Gupta, R.S., and Griffiths, E. (2002). Critical issues in bacterial phylogeny. Theor.Popul.Biol. 61, 423–434.

Gupta, R.S., and Lorenzini, E. (2007). Phylogeny and molecular signatures (conserved proteins and indels) that are specific for the Bacteroidetes and Chlorobi species. BMC Evol.Biol. 7, 71.

Gupta, R.S., and Mok, A. (2007a). Phylogenomics and signature proteins for the alpha proteobacteria and its main groups. BMC Microbiol. 7, 106.

Gupta, R.S., and Mok, A. (2007b). Phylogenomics and signature proteins for the alpha Proteobacteria and its main groups. BMC Microbiol. 7, 106.

Hallez, R., Bellefontaine, A.-F., Letesson, J.-J., and De Bolle, X. (2004). Morphological and functional asymmetry in alpha-proteobacteria. Trends Microbiol. 12, 361–365.

Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68, 669–685.

Harris, J.K., Caporaso, J.G., and Walker, J.J. (2012). Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME … 1–11.

Hess, M., Sczyrba, A., Egan, R., Kim, T.-W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al. (2011). Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331, 463–467.

Holley, H.P. (1991). Successful Treatment of Cat-scratch Disease With . JAMA J. Am. Med. Assoc. 265, 1563.

Huang, W.E., Zhou, J., Scholz, M.B., Lo, C.-C., and Chain, P.S. (2012). Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 23, 9–15.

Hugenholtz, P., Goebel, B.M., and Pace, N.R. (1998). Impact of Culture-Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity. J. Bacteriol. 180, 4765–4774.

Huson, D.H., and Xie, C. (2013). A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinformatics.

86 M. Sc. Thesis—Quan Yao McMaster—Biology

Kainth, P., and Gupta, R.S. (2005). Signature proteins that are distinctive of alpha proteobacteria. BMC Genomics 6, 94.

Kalyuzhnaya, M.G., Lapidus, A., Ivanova, N., Copeland, A.C., McHardy, A.C., Szeto, E., Salamov, A., Grigoriev, I. V, Suciu, D., Levine, S.R., et al. (2008). High-resolution metagenomics targets specific functional types in complex microbial communities. Nat. Biotechnol. 26, 1029–1034.

Kang, I., Oh, H.-M., Vergin, K.L., Giovannoni, S.J., and Cho, J.-C. (2010). Genome sequence of the marine alphaproteobacterium HTCC2150, assigned to the clade. J. Bacteriol. 192, 6315–6316.

Kapley, A., De Baere, T., and Purohit, H.J. (2007). Eubacterial diversity of activated biomass from a common effluent treatment plant. Res. Microbiol. 158, 494–500.

Kembel, S.W., Eisen, J.A., Pollard, K.S., and Green, J.L. (2011). The Phylogenetic Diversity of Metagenomes. PLoS One 6, 9.

Kersters, K., Devos, P., Gillis, M., Swings, J., Vandamme, P., and Stackebrandt, E. (2006). Introduction to the Proteobacteria. In The Prokaryotes: A Handbook on the Biology of Bacteria, M. Dworkin, S. Falkow, E. Rosenberg, K.H. Schleifer, and E. Stackebrandt, eds. (New York: Springer), pp. 3–37.

Kinross, J.M., Darzi, A.W., and Nicholson, J.K. (2011). Gut microbiome-host interactions in health and disease. Genome Med. 3, 14.

Kisand, V., Valente, A., Lahm, A., Tanet, G., and Lettieri, T. (2012). Phylogenetic and functional metagenomic profiling for assessing microbial biodiversity in . PLoS One 7, e43630.

Kunisawa, T. (2007). Gene arrangements characteristic of the phylum Actinobacteria. Antonie Van Leeuwenhoek 92, 359–365.

Kuramitsu, H.K., He, X., Lux, R., Anderson, M.H., and Shi, W. (2007). Interspecies interactions within oral microbial communities. Microbiol. Mol. Biol. Rev. MMBR 71, 653–670.

Leimena, M.M., Ramiro-Garcia, J., Davids, M., van den Bogert, B., Smidt, H., Smid, E.J., Boekhorst, J., Zoetendal, E.G., Schaap, P.J., and Kleerebezem, M. (2013). A comprehensive metatranscriptome analysis pipeline and its validation using human microbiota datasets. BMC Genomics 14, 530.

Van der Lelie, D., Taghavi, S., McCorkle, S.M., Li, L.-L.L., Malfatti, S. a, Monteleone, D., Donohoe, B.S., Ding, S.-Y.Y., Adney, W.S., Himmel, M.E., et al. (2012). The

87 M. Sc. Thesis—Quan Yao McMaster—Biology

metagenome of an anaerobic microbial community decomposing poplar wood chips. PLoS One 7, e36740.

Lepage, P., Leclerc, M.C., Joossens, M., Mondot, S., Blottière, H.M., Raes, J., Ehrlich, D., and Doré, J. (2013). A metagenomic insight into our gut’s microbiome. Gut 62, 146–158.

Leung, H.C.M., Yiu, S.M., Yang, B., Peng, Y., Wang, Y., Liu, Z., Chen, J., Qin, J., Li, R., and Chin, F.Y.L. (2011). A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495.

Li, W., Fu, L., Niu, B., Wu, S., and Wooley, J. (2012). Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13, 656–668.

Lindner, M.S., Kollock, M., Zickmann, F., and Renard, B.Y. (2013). Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics 29, 1260–1267.

Lu, H.-P., Wang, Y., Huang, S.-W., Lin, C.-Y., Wu, M., Hsieh, C., and Yu, H.-T. (2012). Metagenomic analysis reveals a functional signature for biomass degradation by cecal microbiota in the leaf-eating flying squirrel (Petaurista alborufus lena). BMC Genomics 13, 466.

Ludwig, W., Strunk, O., Klugbauer, S., Klugbauer, N., Weizenegger, M., Neumaier, J., Bachleitner, M., and Schleifer, K.H. (1998). Bacterial phylogeny based on comparative sequence analysis. Electrophoresis 19, 554–568.

Lussier, F.-X., Chambenoit, O., Côté, A., Hupé, J.-F., Denis, F., Juteau, P., Beaudet, R., and Shareck, F. (2011). Construction and functional screening of a metagenomic library using a T7 RNA polymerase-based expression cosmid vector. J. Ind. Microbiol. Biotechnol. 38, 1321–1328.

Mackelprang, R., Waldrop, M.P., DeAngelis, K.M., David, M.M., Chavarria, K.L., Blazewicz, S.J., Rubin, E.M., and Jansson, J.K. (2011). Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature 480, 368–371.

Madigan, M.T., Martinko, J.M., Dunlap, P. V, and Clark, D.P. (2008). Brock Biology of Microorganisms (12th Edition) (Benjamin Cummings).

Markowitz, V.M., Chen, I.-M.A., Chu, K., Szeto, E., Palaniappan, K., Grechkin, Y., Ratner, A., Jacob, B., Pati, A., Huntemann, M., et al. (2012). IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res. 40, D123–D129.

88 M. Sc. Thesis—Quan Yao McMaster—Biology

Matsuda, H., Nishi, N., Tsuji, K., Tanaka, K., Kakuno, T., Yamashita, J., and Horio, T. (1984). Reconstruction of photosynthetic, cyclic electron transport system from photoreaction unit, ubiquinone-10 protein, cytochrome c2 and polar lipids purified from . J. Biochem. 95, 431–442.

Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E.M., Kubal, M., Paczian, T., Rodriguez, a, Stevens, R., Wilke, A., et al. (2008). The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386.

Mielczarek, a T., Saunders, a M., Larsen, P., Albertsen, M., Stevenson, M., Nielsen, J.L., and Nielsen, P.H. (2013). The Microbial Database for Danish wastewater treatment plants with nutrient removal (MiDas-DK) - a tool for understanding activated sludge and community stability. Water Sci. Technol. 67, 2519–2526.

Mitra, S., Rupek, P., Richter, D.C., Urich, T., Gilbert, J.A., Meyer, F., Wilke, A., and Huson, D.H. (2011). Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics 12 Suppl 1, S21.

Mohammed, M.H., Ghosh, T.S., Singh, N.K., and Mande, S.S. (2011). SPHINX--an algorithm for taxonomic binning of metagenomic sequences. Bioinformatics 27, 22–30.

Moine, H., Squires, C.L., Ehresmann, B., and Ehresmann, C. (2000). In vivo selection of functional with variations in the rRNA-binding site of coli S8: evolutionary implications. Proc.Natl.Acad.Sci.U.S.A 97, 605–610.

Moloney, R.D., Desbonnet, L., Clarke, G., Dinan, T.G., and Cryan, J.F. (2013). The microbiome: stress, health and disease. Mamm. Genome.

Morgan, J.L., Darling, A.E., and Eisen, J. a (2010). Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5, e10209–e10209.

National Research Council (US) Committee on Metagenomics: Challenges and Functional, and Functional, N.R.C. (US) C. on M.C. and (2007). THE NEW SCIENCE OF METAGENOMICS Revealing the Secrets of Our Microbial Planet (The National Academies Press).

Nguimbi, E., Li, Y.Z., Gao, B.L., Li, Z.F., Wang, B., Wu, Z.H., Yan, B.X., Qu, Y.B., and Gao, P.J. (2003). 16S-23S ribosomal DNA intergenic spacer regions in cellulolytic myxobacteria and differentiation of closely related strains. Syst.Appl.Microbiol. 26, 262– 268.

Nielsen, P.H., Saunders, A.M., Hansen, A.A., Larsen, P., and Nielsen, J.L. (2012). Microbial communities involved in enhanced biological phosphorus removal from

89 M. Sc. Thesis—Quan Yao McMaster—Biology

wastewater--a model system in environmental . Curr. Opin. Biotechnol. 23, 452–459.

Nijkamp, J.F., Pop, M., Reinders, M.J.T., and de Ridder, D. (2013). Exploring variation- aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834.

Oh, J.I., and Kaplan, S. (2001). Generalized approach to the regulation and integration of gene expression. Mol. Microbiol. 39, 1116–1123.

Olson, J.B., Harmody, D.K., and McCarthy, P.J. (2002). Alpha-proteobacteria cultivated from marine sponges display branching rod morphology. FEMS Microbiol. Lett. 211, 169–173.

Poindexter, J.S., and Staley, J.T. (1996). Caulobacter and stalk bands as indicators of stalk age. J. Bacteriol. 178, 3939–3948.

Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65.

Raoult, D., Fournier, P.-E., Vandenesch, F., Mainardi, J.-L., Eykyn, S.J., Nash, J., James, E., Benoit-Lemercier, C., and Marrie, T.J. (2003). Outcome and Treatment of Bartonella Endocarditis. Arch. Intern. Med. 163, 226.

Rascovan, N., Carbonetto, B., Revale, S., Reinert, M.D., Alvarez, R., Godeas, A.M., Colombo, R., Aguilar, M., Novas, M., Iannone, L., et al. (2013). The PAMPA datasets: a metagenomic survey of microbial communities in Argentinean pampean soils. Microbiome 1, 21.

Rathsack, K., Reitner, J., Stackebrandt, E., and Tindall, B.J. (2011). Reclassification of altamirensis (Jurado et al. 2006), Aurantimonas ureilytica (Weon et al. 2007) and Aurantimonas frigidaquae (Kim et al. 2008) as members of a new genus, Aureimonas gen. nov., as Aureimonas altamirensis gen. nov., comb. nov. Int. J. Syst. Evol. Microbiol. 61, 2722–2728.

Ravi P More, S.M. (2013). Mining and assessment of catabolic pathways in the metagenome of a common effluent treatment plant to induce the degradative capacity of biomass. Bioresour. Technol.

Riemann, L., Leitet, C., Pommier, T., Simu, K., Holmfeldt, K., Larsson, U., and Hagström, A. (2008). The native bacterioplankton community in the central baltic sea is influenced by freshwater bacterial species. Appl. Environ. Microbiol. 74, 503–515.

90 M. Sc. Thesis—Quan Yao McMaster—Biology

Roller, M., Lucić, V., Nagy, I., Perica, T., and Vlahovicek, K. (2013). Environmental shaping of codon usage and functional across microbial communities. Nucleic Acids Res. 41, 8842–8852.

Rosen, G.L., Sokhansanj, B.A., Polikar, R., Bruns, M.A., Russell, J., Garbarine, E., Essinger, S., and Yok, N. (2009). Signal Processing for Metagenomics: Extracting Information from the Soup. Curr. Genomics 10, 493–510.

Rout, M.E., and Callaway, R.M. (2012). Interactions between exotic invasive plants and soil microbes in the rhizosphere suggest that “everything is not everywhere”. Ann. Bot. 110, 213–222.

Ruby, J.G., Bellare, P., and Derisi, J.L. (2013). PRICE: software for the targeted assembly of components of (Meta) genomic sequence data. G3 (Bethesda). 3, 865–880.

Sahni, S.K., and Rydkina, E. (2009). Host-cell interactions with pathogenic Rickettsia species. Future Microbiol. 4, 323–339.

Schloss, P.D., and Handelsman, J. (2005). Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biol. 6, 229.

Scully, E.D., Geib, S.M., Hoover, K., Tien, M., Tringe, S.G., Barry, K.W., Glavina del Rio, T., Chovatia, M., Herr, J.R., and Carlson, J.E. (2013). Metagenomic profiling reveals lignocellulose degrading system in a microbial community associated with a wood- feeding beetle. PLoS One 8, e73827.

Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814.

Sharon, I., Birkland, A., Chang, K., El-Yaniv, R., and Yona, G. (2005). Correcting BLAST e-Values for Low-Complexity Segments. J. Comput. Biol. a J. Comput. Mol. Cell Biol. 12, 980–1003.

Siepel, A., and Haussler, D. (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. 11, 413–428.

Solonenko, S.A., Ignacio-Espinoza, J.C., Alberti, A., Cruaud, C., Hallam, S., Konstantinidis, K., Tyson, G., Wincker, P., and Sullivan, M.B. (2013). Sequencing platform and library preparation choices impact viral metagenomes. BMC Genomics 14, 320.

91 M. Sc. Thesis—Quan Yao McMaster—Biology

Sommer, M.O.A., Church, G.M., and Dantas, G. (2010). A functional metagenomic approach for expanding the toolbox for biomass conversion. Mol. Syst. Biol. 6, 360.

Sowell, S.M., Norbeck, A.D., Lipton, M.S., Nicora, C.D., Callister, S.J., Smith, R.D., Barofsky, D.F., and Giovannoni, S.J. (2008). Proteomic analysis of stationary phase in the marine bacterium “Candidatus Pelagibacter ubique”. Appl. Environ. Microbiol. 74, 4091– 4100.

Steenhoudt, O., and Vanderleyden, J. (2000). , a free-living nitrogen-fixing bacterium closely associated with grasses: genetic, biochemical and ecological aspects. FEMS Microbiol. Rev. 24, 487–506.

Strous, M., Kraft, B., Bisdorf, R., and Tegetmeyer, H.E. (2012). The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410.

Takacs-Vesbach, C., Inskeep, W.P., Jay, Z.J., Herrgard, M.J., Rusch, D.B., Tringe, S.G., Kozubal, M.A., Hamamura, N., Macur, R.E., Fouke, B.W., et al. (2013). Metagenome sequence analysis of filamentous microbial communities obtained from geochemically distinct geothermal channels reveals specialization of three aquificales lineages. Front. Microbiol. 4, 84.

Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., and Glöckner, F.O. (2004). TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163.

Thomas, T., Gilbert, J., and Meyer, F. (2012). Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3.

Travers, S.A.A., Clewley, J.P., Glynn, J.R., Fine, P.E.M., Crampin, A.C., Sibande, F., Mulawa, D., McInerney, J.O., and McCormack, G.P. (2004). Timing and reconstruction of the most recent common ancestor of the subtype C clade of human type 1. J. Virol. 78, 10501–10506.

Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A. a, Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., et al. (2005). Comparative metagenomics of microbial communities. Science 308, 554–557.

Ursell, L.K., Metcalf, J.L., Parfrey, L.W., and Knight, R. (2012). Defining the human microbiome. Nutr. Rev. 70 Suppl 1, S38–44.

92 M. Sc. Thesis—Quan Yao McMaster—Biology

Vogel, T.M., Simonet, P., Jansson, J.K., Hirsch, P.R., Tiedje, J.M., van Elsas, J.D., Bailey, M.J., Nalin, R., and Philippot, L. (2009). TerraGenome: a consortium for the sequencing of a soil metagenome. Nat. Rev. Microbiol. 7, 252–252.

Walker, D.H., Valbuena, G.A., and Olano, J.P. (2003). Pathogenic mechanisms of diseases caused by Rickettsia. Ann. N. Y. Acad. Sci. 990, 1–11.

Williams, D., Fournier, G.P., Lapierre, P., Swithers, K.S., Green, A.G., Andam, C.P., and Gogarten, J.P. (2011). A rooted net of life. Biol.Direct. 6, 45.

Williams, K.P., Sobral, B.W., and Dickerman, A.W. (2007). A Robust Species Tree for the Alphaproteobacteria. J. Bacteriol. 189, 4578–4586.

Wommack, K.E., Bhavsar, J., and Ravel, J. (2008). Metagenomics: Read Length Matters. Appl. Environ. Microbiol. 74, 1453–1463.

Wooley, J.C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS Comput. Biol. 6, e1000667–e1000667.

Wrighton, K.C., Thomas, B.C., Sharon, I., Miller, C.S., Castelle, C.J., VerBerkmoes, N.C., Wilkins, M.J., Hettich, R.L., Lipton, M.S., Williams, K.H., et al. (2012). Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661–1665.

Wu, Y.-W., and Ye, Y. (2011). A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18, 523–534.

Xia, L.C., Cram, J.A., Chen, T., Fuhrman, J.A., and Sun, F. (2011). Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS One 6, e27992.

Yabuuchi, E., and Kosako, Y. (2005). Order IV. Sphingomonadales ord. nov. In Bergey’s Manual of Systematic Bacteriology, D.J. Brenner, N.R. Krieg, and J.T. Staley, eds. (New York: Springer), pp. 230–258.

Yergeau, E., Sanschagrin, S., Beaumier, D., and Greer, C.W. (2012). Metagenomic analysis of the bioremediation of diesel-contaminated Canadian high arctic soils. PLoS One 7, e30058.

Yildiz, F.H., Gest, H., and Bauer, C.E. (1991). Attenuated effect of oxygen on photopigment synthesis in Rhodospirillum centenum. J. Bacteriol. 173, 5502–5506.

Yurkov, V. V, and Beatty, J.T. (1998). Aerobic anoxygenic phototrophic bacteria. Microbiol.Mol.Biol.Rev. 62, 695–724.

93 M. Sc. Thesis—Quan Yao McMaster—Biology

Zhang, W., Wang, Y., Lee, O.O., Tian, R., Cao, H., Gao, Z., Li, Y., Yu, L., Xu, Y., and Qian, P.-Y. (2013). Adaptation of intertidal biofilm communities is driven by metal ion and oxidative stresses. Sci. Rep. 3, 3180.

Zomorodipour, A., and Andersson, S.G. (1999). Obligate intracellular parasites: and Chlamydia trachomatis. FEBS Lett. 452, 11–15.

94