Identifying the Activities of Rhizosphere Microbial Communities Using Metatranscriptomics

Total Page:16

File Type:pdf, Size:1020Kb

Identifying the Activities of Rhizosphere Microbial Communities Using Metatranscriptomics

IDENTIFYING THE ACTIVITIES OF RHIZOSPHERE MICROBIAL COMMUNITIES USING METATRANSCRIPTOMICS

By

Aaron Garoutte

A DISSERTATION

Submitted to Michigan State University in partial fulfillment of the requirements for the degree of

Microbiology and Molecular Genetics – Doctor of Philosophy

2016

ABSTRACT

IDENTIFYING THE ACTIVITIES OF RHIZOSPHERE MICROBIAL COMMUNITIES USING METATRANSCRIPTOMICS

By Aaron Garoutte Soil microbial communities carry out many functions, most of which are beneficial to the planet as well as to humans. Soil microbial communities control the biogeochemical cycling rates of key elements such as carbon, nitrogen, sulfur and phosphorous and can also aid in plant growth and disease defense. Microbial ecologists have studied the functional activities of microbial communities for decades often using laboratory incubations.

Metagenomics has allowed the identification of the microbes and the potentially functional genes in an environmental sample, but does not allow an assessment of activity. Direct observation of microbial community activity in the field is the desired strategy to build the foundational knowledge required to assess, predict and potentially manage soil microbial community activity.

In this dissertation I combine the use of metagenomics with metatranscriptomics to identify functional activity of the microbial community in the soil and rhizosphere of candidate biofuel crops. First, I assessed the efficiency of a novel method of rRNA removal, called the duplex specific nuclease normalization, to remove the dominating rRNA from samples of total RNA, to allow greater sequence coverage of mRNA. While this method did result in about 17% non-rRNA, it did not provide a major gain in sequencing depth. I also established best practices for computational metatranscriptomic analysis, especially the importance of assembling short reads into longer contigs to improve annotation accuracy.

Second, I examined the activity of the rhizosphere microbial community of switchgrass, a candidate biofuel crop, using a combination of metagenomics, metatranscriptomics and metaproteomics. I defined a minimum core of microbial community functions of both metagenomic and metatranscriptomic sequences to focus the analysis on the most common sequences that were expressed. Beyond the expected housekeeping functions, the ecologically important functions related to biogeochemical cycling expressed were glycoside hydrolases, ligninolytic enzymes, ammonia assimilation, phosphate metabolism and functions related to plant-microbe interactions were production of auxin, trehalose and ACC-deaminase. Ecologically important genes had lower abundance than housekeeping functions indicating that ecologically important genes may represent keystone functions.

I also examined the effect of two plants, switchgrass and corn, on the presence and activity of microbial community functions at various distances from living roots using metagenomics and metatranscriptomics. The metagenomic data was able to differentiate between microbial communities associated with the two different crops and differentiate communities in direct contact with the roots versus those not in direct contact. The metatranscriptomic data was unable to differentiate between bulk and rhizosphere samples indicating others factors are stronger determinants of community transcription.

I show that direct observation of the activity of microbial communities associated with biofuel crops in field collected samples is possible through metatranscriptomics and aided by metagenomics and metaproteomics. These data allow the detection of microbial activities related to biogeochemical cycling and plant microbe interactions as well as reveal differences in genetic potential across different soil treatments.

This thesis is dedicated to my parents, Kristin Magnuson and Michael Garoutte, and to my wife Jennifer Garoutte

iv

ACKNOWLEDGEMENTS

They say it takes a whole village to raise a child, I think the same could be said about finishing a PhD. I would like to thank my friends and family (Garoutte and Gable), you have all been a fantastic support! Your interest in my work has always encouraged me. I never tire of the question “So, what is it you do again?” I would like to thank my children, Charlie and Claire. Watching you grow, learn and explore has been an inspiration to me as a scientist and as a father. You give me immeasurable joy and remind me there is a world outside of the lab. I will always be proud of you. I love you both dearly. I would also like to thank my wife, Jenny. You have been my rock. Your unwavering confidence in me and your constant support have enabled me to achieve this goal. I love you more than words can convey. Mom, thank you for all your support over the years, through all the ups and downs of life you have always been there for me. It is only now, as I have my own children, that I am starting to understand the sacrifices you and Dad made for me. Dad, you were taken from us too soon. You will always be my role model and I will always strive to be the great father you were to me. I know you’d be proud of me, you said it so many times that now it is the only thing I hear you say when I remember your voice.

I would like to thank a few specific teachers and professors who have helped me throughout my now 22 years of education. First, I’d like to thank Mrs. Christine Arenson, you taught this little dyslexic boy to read and helped me overcome my learning disability.

Next, I would like to thank Dr. Richard Cass. You showed me that even though I was a slow reader and not the best writer that I could succeed in a college level English course. Finally,

I would like to thank Drs. Greg and Kathy Murray. You both shaped me as an ecologist and

v

gave me my first research experiences. Your classes were hard but the time I put in was never work. Thank you for investing in me.

I would like to thank Dion Antonopoulos for taking a chance and hiring a field ecologist with minimal knowledge of microbiology, and next to nothing in terms of lab skills, to work in your lab. You helped ignite my passion for microbial ecology and set me on the path of working with metatranscriptomics. You taught me everything I know about microbiology and molecular biology lab work. I would also like to thank Sarah O’Brien for sharing with me your passion for soil. The two of you set me on a path to explore soil microbial communities in my PhD work.

I would also like to thank the Tiedje lab, all current and past members. I was fortunate to have been able to interact with so many people with such diverse backgrounds. Thank you for your support, encouragement and constructive criticism. You helped to make me a better scientist and more well rounded person. Specifically, I’d like to thank Adina Howe. Adina, you were like my second mentor. You taught me to code, how to use the HPCC, how to interview and train students. You helped improve my writing and you were always up for Indian buffet! Thank you for investing in me. Dr. Tiedje, thank you for taking me on as a student. You had no idea what you were getting into! You have helped me learn how to think about science and how to present my work both in publication and presentation form. You have always had the ability to push me outside of my scientific comfort zone, and your unique perspective has always helped bring my work up to the next level. Thank you.

Finally, I would like to thank my current committee members Sheng Yang He, Maren

Friesen and Ashley Shade. Each of you has helped me grow as a scientist. I value the

vi

feedback you have given me and appreciate your time. I would also like to thank a former committee member, Titus Brown. Thank you for supporting me as this hybrid biologist bioinformatician. You have always made me feel like I had a place in your lab. You taught me the value of open science and helped me think critically about the code I write.

vii

TABLE OF CONTENTS

LIST OF TABLES …...…..……………………………………………………………………………………...…………….x

LIST OF FIGURES.….....………………………………………………………………………………………………..…xiii

CHAPTER 1: Bridging the gap between the lab and the field .…………...…………………...... 1 Introduction to soil microbial communities…………………………………………………………..2 Soil microbial community functions …………………..…………………………………………………2 The rhizosphere: a focal point of microbial community activity……………………………..5 Disconnect between laboratory-based studies and field microbial communities ……………………………………………………………………………………………………………………6 “Omics” opens new avenues of research………………………………………………………………..7 Evolution of sequencing technology…………………………………………………………...7 Metagenomics…………………………………………………………………...... 7 Metatranscriptomics………………………………………………………………………9 The dark side of high throughput sequencing………………………………..10 Observation of microbial community functions to answer ecological questions…....11 Importance of direct observation of microbial community functional activity………12 Questions addressed in this thesis………………………………………………………………………13 REFERENCES……..………………………………………………………………………………………………16

CHAPTER 2: Methodologies for probing the metatranscriptome of grassland soil...... 24 Abstract………………………………………………………………………………………………………………………..25 Introduction…………………………………………………………………………………………………………………26 Methods……………………………………………………………………………………………………………………….29 Metatranscriptome sample collection and library preparation……………………………..31 MG-RAST databases used for annotation of unassembled raw reads…………………….32 Metatranscriptome assembly & annotation…………………………………………………………32 Previous metagenomes used for comparison to soil metatranscriptome………………33 Metagenome assembly & annotation…………………………………………………………………..33 Estimation of abundance of assembled contigs or reference sequences………………..34 Curation of soil reference genome database, RefSoil……………………………………………34 Unassembled read mapping to metagenomes and RefSoil genomes……………………..35 Results…………………………………………………………………………………………………………………………35 Characterization of sequences in the unassembled soil metatranscriptome………….35 Characterization of sequences in the assembled soil metatranscriptome…...... 40 Comparison of the metatranscriptome datasets to metagenomes…………………………47 Discussion…………………………………………………………………………………………………………………….49 Conclusion……………………………………………………………………………………………………………………54 Acknowledgements………………………………………………………………………………………………………55 APPENDIX……….………………………………………………………………………………………………...... 56 REFERENCES……..……………………………………………………………………………………………………….104

viii

CHAPTER 3: Using a multi-omics to approach to identify active microbial functions in the rhizosphere of Switchgrass………………………..……………………………………………...... 111 Abstract…………………………………………………………………………………………………………...... 112 Introduction……………………………………………………………………………………………………………….113 Methods……………………………………………………………………………………………………………………..115 Soil collection…………………………………………………………………………………………………..115 Sample preparation and sequencing…………………………………………………………………116 Metaproteome sample preparation and characterization………………………...... 117 Indirect extraction………………………………………………………………………………...117 Metatranscriptome and metagenome peptide analysis……………………………120 Metatranscriptome and metagenome search and peptide identification…..120 Metatranscriptomic and metagenomic sequencing data analysis………………………..121 Defining the minimum functional core and its representation of the field site...…..121 Results……………………………………………………………………………………………………………………….124 Building minimum functional core……………………………………………………………………124 Metaproteome characterization and core comparison……………………………………….127 Comparison of the minimum functional core to rhizoplane functional cores……….128 Characterization of the minimum functional core of switchgrass……………..…………131 Abundant functions within the minimum functional core are related to housekeeping processes……………………………………………………………...131 Functions of ecological importance within the minimum functional core……………134 Carbon cycling functions within the minimum functional core……...... 138 Discussion……………………………………………………………………………………………………..……………140 Conclusion………………………………………………………………………………………………………………….144 APPENDIX..…………………………………………………………………………………………………………………146 REFERENCE….…………………………………………………………………………………………………………….148

CHAPTER 4: Plant root effects on soil microbial community functions as viewed through metagenomics and metatranscriptomics………..…..………………………………...... 152 Abstract……………………………………………………………………………………………………………………...153 Introduction……………………………………………………………………………………………………………….154 Methods……………………………………………………………………………………………………………………..155 Site description and sample collection………………………………………………………………155 Sample preparation and sequencing…………………………………………………….…………...156 Data analysis……………………………………………………………………………………………………157 Results……………………………………………………………………………………………………………………….158 Metatranscriptome analysis…………………………………………………………………...... 158 Metagenome analysis……………………………………………………………………………………….162 Discussion…………………………………………………………………………………………………………………..167 Conclusion………………………………………………………………………………………………………………….171 APPENDIX……..……………………………………………………………………………………………………………172 REFERENCES…..………………………………………………………………………………………………………….176

CHAPTER 5: Conclusions and Future directions ……………………………………………...... 179 Conclusions……………………………………………………………………………………………………...... 180 Future directions………………………………………………………………………………………………………..183

ix

LIST OF TABLES

Table 2.1: Summary of sequence annotations of the unassembled and assembled soil metatranscriptome against various reference databases. Results of annotation by MG- RAST……………………………………………………………………………………………………………………………38

Table 2.2: Summary of transcript mapping. Transcripts mapped to reference assemblies (available in MG-RAST with IDs indicated) or genomes with proportion of reads identified as similar to rRNA genes and mapping uniquely to a specific reference assembly. …………………………………………………………………………………………………………………………………….39

Table 2.3: Summary of assembled metagenomes in number of assembled contigs and total base pairs represented in the assembly. Results of short read assembly of metagenome samples…………………………………………………………………………………………………...49

Table 2.4: Manually curated soil-associated genomes comprising the RefSoil database……………………………………………………………………………………………………………………..57

Table 2.5: Most abundant functional annotations of the unassembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database………………………………………………………….88

Table 2.6: Most abundant functional annotations of the unassembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database…………………………………………………....89

Table 2.7: Most abundant functional annotations of the unassembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database……………………………………………………….89

Table 2.8: Most abundant functional annotations of the unassembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database…………………………………………………………90

Table 2.9: 50 RefSoil genomes with the greatest number of metatranscriptome reads mapping………………………………………………………………………………………………………….…………..90

Table 2.10: RefSoil genomes with metatranscriptome reads mapping to the most unique genes………………………………………………………………………………………………………………94

Table 2.11: Most abundant functional annotations of the assembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database……………………………………………………….…96

x

Table 2.12: Most abundant functional annotations of the assembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database…………………………………………...……….97

Table 2.13: Most abundant functional annotations of the assembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database…………………………………………………….…98

Table 2.14: Most abundant functional annotations of the assembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database……………………………………………………...….99

Table 2.15: RFam abundance (based on base pair coverage) of the assembled metatranscriptome…………………………………………………………………………………………………….99

Table 2.16: Top 50 CAZy annotations by number of contigs. Most abundant CAZy annotations by the number of contigs without regard to the abundance or number of reads mapping to the contigs…………………………………………………………………………………………….…101

Table 2.17: Top 50 CAZy annotations by abundance. Abundance of CAZy annotation by the number of reads mapping to annotated contigs…………………………………………………..…102

Table 3.1: Summary of rhizosphere sequencing, assembly and annotation. Metagenome sequencing samples did not have rRNA removed before sequencing, therefore in the “Non-rRNA reads” column these samples are marked as NA. Additionally metagenome samples were not mapped to the metatranscriptome. Therefore these samples are also marked as NA in the “Reads mapping to the MetaG contigs” column…...123

Table 3.2: Summary of core annotations. The SEED Subsystems is a hierarchical database, which annotates gene functions not specific genes. RefSeq database annotates specific genes from model organisms. The Carbohydrate Active Enzyme database (CAZy) specifically annotates enzymes related to synthesis, metabolism and transport of carbohydrates…………………………………………………………………………………………………………....127

Table 3.3: Summary of minimum functional core annotations found in rhizoplane metagenomes. Minimum core represents the functions found in five out of seven samples. Percent core represents the percent of the crop specific core that is found within our established minimum functional core. Percent abundance captured represents the abundance of the crop specific samples found in the minimum functional core……………..130

Table 3.4: Summary of rhizoplane metagenome reads, assembly and assembled read abundance………………………………………………………………………………………………………….…….147

Table 4.1: PERMANOVA analysis of metagenome and metatranscriptome samples. Permutational multivariate analysis of variance of samples. SRT is the switchgrass metatranscriptome, CBT is the corn bulk metatranscriptome, switchgrass rhizosphere

xi

metagenome, CBG is the corn bulk metagenome, C is for the corn rhizoplane metagenome, S is the switchgrass rhizoplane metagenome, SR is the switchgrass rhizosphere metagenome, CR is the switchgrass rhizosphere metagenome, SRG is the switchgrass rhizosphere metagenome samples collected with the metatranscriptome samples, CBG is the corn bulk metagenome samples collected with the metatranscriptome samples, CNRP is the combination of the corn bulk and rhizosphere metagenome samples, and SNRP is the combination of the two treatments of switchgrass rhizosphere metagenome samples…..159

Table 4.2: Summary of metagenome and metatranscriptome reads, assembly and assembled read abundance…………………………………………………………………………………..….173

Table 4.3: Metatranscriptome protein coding annotations………………………………….....175

xii

LIST OF FIGURES

Figure 2.1: Metatranscriptome data analysis workflow. Various methods for metatranscriptome data analysis are shown. (a) Direct annotation of short reads. (b) Assembly of short reads into longer contigs and subsequent annotation. (c) Short read mapping to genomes compiled in the RefSoil database…………………………………………………..30

Figure 2.2: Phylogenetic distribution of sequence annotations identified in unassembled and assembled metatranscriptome and associated soil metagenomes. Phylogeny of rRNA from the unassembled metatranscriptome compared to the phylogeny of MG-RAST’s best-hit classification of protein-coding genes for the assembled metatranscriptome and the reference metagenomes…………………………………………………….36

Figure 2.3: Distribution of assembled metatranscriptome annotations. Proportion of assembled metatranscriptome sequences associated with known rRNA, gene function (SEED), or non-coding sequences (RFAM)…………………………………………………………………..…41

Figure 2.4: Comparison of functional profiles of metatranscriptomes and metagenomes. Annotations were identified in the assembled and unassembled metatranscriptome datasets as well as the three metagenome assemblies against the MG- RAST SEED database…………………………………………………………………………………………………….42

Figure 2.5: Comparison of the total number of gene annotations identified in the unassembled and assembled metatranscriptomes. Results were generated using the MG-RAST Metagenome Analysis page…………………………………………………………………………....43

Figure 2.6: Comparison of annotation alignment lengths of the assembled and unassembled datasets. Amino acid alignment lengths of SEED subsystem annotations for the assembled and unassembled datasets. The minimum alignment length is set to the MG- RAST default of 15 amino acids……………………………………………………………………………………..46

Figure 2.7: Comparison of annotation E-values of the assembled and unassembled datasets. E-value of SEED Subsystem annotations of the assembled and unassembled datasets. The minimum e-value is set to the MG-RAST default of 1e-5……………………………47

Figure 3.1: Rank Abundance Curve of Multi-omics Subsystem Annotations. The metaproteome data set is smaller than metagenome and metatranscriptome data sets as indicated by they shorter lines in the MetaP-MetaG and MetaP-MetaT samples………….…125

Figure 3.2: Diversity of core functions by subsystem. Number of annotations in the minimum functional core as annotated by the SEED Subsystems database…………………...132

Figure 3.3: Relative abundance of multi-omics data by SEED Subsystem annotations. Relative abundance here is averaged across each of the three individual samples………....134

xiii

Figure 3.4: Relative abundance of biogeochemical cycling functions in the minimum functional core. Allantonin utilization, Ammonia assimilation, Denitrification and Nitrate and Nitrite ammonification are subsystems within the Nitrogen Metabolism subsystem. Alkylphosphate utilization and Phosphate metabolism are subsystems within the Phosphorous metabolism subsystem…………………………………………………………………………..136

Figure 3.5: Relative abundance of plant growth promoting functions in the minimum functional core. Auxin biosynthesis and Trehalose Biosynthesis are the second level in the SEED Subsystem hierarchy within the Secondary metabolites subsystem and Carbohydrates subsystems respectively……………………………………………………………………...138

Figure 3.6: Relative abundance of CAZy annotations in the minimum functional core. CAZy enzyme classes are: Glycoside Hydrolases (GH), Glycosyl Transferases (GT), Carbohydrate Esterases (CE), Polysaccharide Lyases (PL) and Auxiliary Activities (AA)..140

Figure 4.1: Average relative abundance of metatranscriptome annotations Shows annotations based on MG-RAST SEED Subsystem database. (a) Average relative abundance of corn and switchgrass metatranscriptome annotations. (b) Relative abundance of metatranscriptome annotations in the Carbohydrate subsystem level 2….161

Figure 4.2: Nonmetric multidimensional scaling (NMDS) analysis of metagenome sample. Metagenome samples were log2 plus one transformed to increase normality….163

Figure 4.3: Comparison of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples. Number of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples based on SEED Subsystems……………………………165

Figure 4.4: Comparison of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples. Number of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples based on SEED Subsystems……...167

xiv

CHAPTER 1:

Bridging the gap between the lab and the field

1

Introduction to soil microbial communities

Franklin D. Roosevelt once said, “A nation that destroys its soils destroys itself”. While he may have only been referring to soil as the matrix in which plants grow we now know the many important benefits provided by the microbes that live within soil. Soil microbial communities are unique as compared to other environmental microbial communities. One reason is the sheer size of the microbial community within soil. Soil contains a large microbial community [1]. In addition to its large size, soil is one of the most diverse ecosystems on the planet [2]. Just one gram of soil is thought to contain as many as 52,000 distinct microbial species [3].

Soil microbial communities provide numerous beneficial ecosystem services such as bioremediation, controlling biogeochemical cycles, restoring water quality and aiding in plant growth and development. The extreme diversity and vital ecosystem services provided by soil microbial communities have drawn the interest of many microbial ecologists. Understanding and learning to enhance these natural processes can aid in our ability to detoxify the environment, mitigate global climate change, produce “green” energy and produce more food crops with less nutrient input.

Soil microbial community functions

Irresponsible waste disposal practices have lead to contamination of soils with toxic chemicals. These toxic chemicals have the potential to leach into ground water and cause illness. One method to mitigate contaminated soils involves the use of microbial communities to detoxify contaminates. This process is called

2

bioremediation [4]. Microbial populations and communities have been shown to be capable of detoxifying a wide range of compounds including organic pollutants such as hexachlorocyclohexane [5], polycyclic aromatic hydrocarbons [6], polychlorinated biphenyls [7] and dioxins [8] to name the more prominent.

Evidence has shown that some genes responsible for bioremediation processes can be transferred via horizontal gene transfer potentially improving the ability of the microbial community to detoxify contaminated soils [5]. Other evidence has shown that biostimulation via root exudates can enhance bioremediation processes [6]. In addition, microbial communities have been shown to reduce uranium VI, a soluble compound, to the insoluble uranium IV preventing the spread of radioactive uranium through ground water [9]. The ability to detoxify contaminated soils represents one of the many benefits provided by soil microbial communities that could be harnessed and enhanced to provide greater benefit to society.

Soil microbial communities play a key role in biogeochemical cycling processes. Biogeochemistry is an interdisciplinary science that draws on chemistry, geology and biology to study processes that regulate the chemical cycles of elements. Soil microbial communities are major contributors to the biogeochemical cycling of elements such as carbon [10], nitrogen [11, 12], sulfur [13] and phosphorus [14]. Much of the microbially mediated biogeochemical cycling takes place in the rhizosphere, thereby allowing plants to access important macro and micronutrients needed for growth [15-17]. Plant root exudates have been shown to illicit the rhizosphere priming effect in which root exudates stimulate the turnover of soil organic matter (SOM) and can increase microbially mediated rates of

3

biogeochemical cycling processes [17-19]. Due to the importance of CO2 and CH4 as greenhouse gases many biogeochemistry studies focus on carbon trapped in the soil as SOM and seek to understand how a warming global climate will influence the rates at which SOM is respired as CO2 and CH4 [20-22]. Microbially mediated biogeochemical processes play a critical role in influencing climate change and affect the ability of plants to access macro- and micronutrients needed for growth.

Finally, soil microbial communities living in the rhizoplane and rhizosphere provide many important benefits to the plants with which they interact.

Rhizosphere microbial communities provide plants with macronutrients such as nitrogen, phosphorous, and micronutrients such as iron [16, 23, 24]. Nitrogen is the most limiting nutrient for plant growth [25]. Microbial communities can make nitrogen available to plants. This occurs via direct fixation of atmospheric nitrogen by both free-living (associative) nitrogen fixers [26] and by rhizobia in a symbiotic relationship with leguminous plants [27]. Microbes also can provide nitrogen to plants by transforming nitrogen-containing compounds into bioavailable compounds, i.e. mineralization [28]. Besides providing nitrogen to plants, microbial communities can promote plant growth via production of plant growth hormones such as auxin and ACC deaminase [29, 30]. Auxin can aid in root development [31] and suppress plant defense mechanisms possibly allowing non-pathogenic microbes to colonize plant root systems [32]. ACC deaminase reduces ethylene concentration in roots, which in high concentrations can reduce plant growth [29, 33]. While much of the current research focuses on single microbial species interactions with plants, recent work has shown that microbial populations can cooperatively

4

enhance plant growth. In one study a secondary metabolite produced by

Pseudomonas fluorescens F113 acted as a promoter for many plant genes known to have phytostimulatory effects in Azospirillum brasilense Sp245-Rif [34]. Finally microbial communities can protect plants from pathogenic microbes [35]. Many can produce antibiotics against pathogens [28, 36, 37]. Microbes can also protect plants from pathogens by activating the plants own immune functions [38,

39]. Soil microbial communities provide many benefits to plants including providing a source of growth limiting macro- and micronutrients, promoting plant growth through exogenous addition of plant hormones, and protecting plants from pathogens.

The rhizosphere: a focal point of microbial community activity

The rhizosphere is a hotspot of microbial activity, i.e. a small volume of soil with higher process rates and greater interaction among community members compared to average soil [40]. Many studies have shown bioremediation processes to be enhanced in the rhizosphere [41]. These studies include depletion of contaminants such as PCBs [42, 43], petrochemical residues [44], bensulfuron- methyl [45], arsenate, chromate [46], and heavy metals (Zn, Cd and Cu) by bioaccumulation such as in white mustard, i.e. phytoextraction [47]. Many plant microbe interactions are also tied to biogeochemical cycling processes related to the transformation of carbon and macro- and micronutrients into bioavailable forms for plant utilization [17-19, 48, 49]. It’s thought that plant secretions of carbon rich compounds, i.e. root exudates, into the rhizosphere provide the fuel for these

5

enhanced microbial processes [50]. As previously stated rhizosphere microbial communities provide many other benefits to plants including plant growth promotion, activating plant defense mechanisms, increasing stress tolerance, and protecting plants from pathogens.

Disconnect between laboratory-based studies and field microbial

communities

Many of the previously mentioned studies were carried out under laboratory, growth chamber or greenhouse conditions using single populations of microbes and/or a single plant species. However naturally occurring soil microbes live within a diverse community of microbes and plants. It is unknown how interactions between community members will affect the functional activity of microbes that carry out anthropogenically important functions. Closing this gap in knowledge is central to our ability to predict and manipulate these microbial functions to meet modern day challenges such as global climate change and an increasing human population on earth. To bridge the gap between laboratory studies of microbial community activity and to study microbial community activity in the environment microbial ecologists must rely on direct observation of functions actively carried out by microbial community members in the environment. Advances in high- throughput sequencing technologies such as metatranscriptomics allows direct observation of microbial community gene expression, a proxy of functional activity, thereby enabling microbial ecologists to study the activity of microbial communities in the environment.

6

“Omics” opens new avenues of research

Evolution of Sequencing Technology

Recent advances in high throughput sequencing, also called next-generation sequencing, have led to an extreme decrease in sequencing cost per nucleotide since the release of these new technologies, beginning in 2005 by 454 Life Sciences [51] and soon followed by Illumina. These advances have opened new avenues of research that were previously infeasible by allowing deeper and more cost effective sampling. Specifically, advances in high throughput sequencing have allowed microbial ecologists to sample fragments of whole genomes from environmental microbial communities without the bias of culture based methods [52]. Previously,

Sanger sequencing required labor intensive cloning of environmental DNA into a host cell, prior to sequencing of that clone. Compared to current high throughput sequencing methodologies, the yield of the Sanger sequencing is very low [53] although its accuracy was much higher.

Metagenomics

Metagenomics is an approach that allows DNA to be sequenced directly from the sampled community. Current technologies enable metagenomic samples to be sequenced resulting in billions of reads without the need for labor intensive cloning.

There are two primary approaches to metagenomics. The first is a “targeted” approach in which a particular gene of interest is PCR amplified from an environmental DNA sample and then sequenced. This is often done for phylogenetic

7

studies of 16S rRNA genes. The second is a “shotgun” approach in which environmental DNA, fragmented into short pieces, is sequenced [54]. Frequently shotgun metagenomics is referred to as simply metagenomics. After sequencing, the

DNA sequences are typically assigned a functional or taxonomic annotation based on similarity to previously sequenced organisms or genes [53]. Sequencing technologies will likely continue to improve providing deeper sampling and longer reads to the point in which whole genomes of environmental organisms can be fully assembled, which cannot currently be done. Eventually, metagenomic data will consist of full genomes of organisms present in an environmental sample rather than the current collection of annotations of gene or genome fragments. It is important to note that additional care must be taken in the interpretation of metagenomic data. DNA represents the genetic potential of an organism or community, not necessarily its activity. Metagenomic sequencing can include sequences from dead microbes, or dormant microbes, which may skew data interpretation. Estimates of dormancy in soil microbial communities are as high as

80% [55]. Transcriptional regulation of a gene’s expression may be tightly controlled to only occur under specific circumstances, as is the case with many microbes that engage in symbiotic interactions and quorum sensing behaviors.

Therefore metagenomic functional annotations must be recognized as the functional potential of a microbial community and not functional activity of a microbial community.

8

Metatranscriptomics

Advanced high throughput sequencing technologies also enable sequencing of environmental RNA, referred to as metatranscriptomics. Metatranscriptomics presents many unique methodological challenges compared to metagenomics thus leading to low adoption of metatranscriptomics by many microbial community ecologists, especially in complex habitats like soil and sediments. Like metagenomics, metatranscriptomics can be targeted to a specific gene [56] or an untargeted shotgun approach can be used [57]. Often an intermediate approach is desired as the goal of many metatranscriptomic studies is to sequence mRNA to sample actively transcribed genes. Messenger RNAs typically comprise only 4% of the total RNA, while rRNA is thought to comprise over 90% of total RNA requiring its removal to increase sequencing yield of mRNA [58]. Removal of rRNA is challenging. Various methods have been developed to remove rRNA but have been met with mixed success depending on the complexity of the environmental system

[59-62]. Another challenge unique to metatranscriptomics is the unstable nature of the mRNA molecule and its short half-life, estimated to be between 2.4 and 5 minutes [63]. These factors make sample preservation and extraction of RNA more difficult than that of DNA for metagenomics. Before sequencing, a reverse transcription reaction must be carried out to convert the RNA to cDNA. All of these factors can result in lower sequence yield and lower sequence quality compared to

(DNA-based) metagenomics. Despite the challenges inherent in metatranscriptomics, this method has the potential to deliver valuable insights and

9

build foundational knowledge of microbial community functional activity. As with all methods there remain limits to metatranscriptomics. For example, due to the short turnover time of mRNA, metatranscriptomic data represents microbial community activity at the time of sampling only. To overcome this limitation studies can adopt a time series approach to capture transcription information over a relevant time period. Additionally after transcription mRNA molecules may still not be translated into proteins [64], and even if the protein is made, its actual activity depends on availability of substrate and any other required condition for function.

Consequently, metatranscriptomics represents likely functional activity of a microbial community, not necessarily the actual functional activity of a microbial community at the sampling time.

The dark side of high throughput sequencing

Annotation of metagenomic and metatranscriptomic reads relies on gene annotations from previously sequenced and typically cultured microbes [65]. It is estimated that only 0.1 to 1% of microbial species have been isolated in culture [66].

Therefore many of the sequences from metagenomic and metatranscriptomic studies cannot be annotated. Processes such as short read assembly can substantially improve quality and percent of sequences annotated [67]. Further improvement in sequence annotation relies on increasing the isolation of the currently uncultivated microbes and the characterization of their gene functions and especially of the hypothetical proteins. The Joint Genome Institute’s Microbial Dark

Matter Project [68] uses single cell genome sequencing to fill in the phylogenetic

10

gaps in the current database of cultured strains. This is a step in the right direction however, much work remains. Identifying the functions of unknown or hypothetical proteins will lead to the most dramatic improvements in annotation of metagenomic and metatranscriptomic sequences.

Observation of microbial community functions to answer ecological questions

Direct observation of microbial community functions can enable microbial ecologists to answer questions of ecological importance. For example one study examined the microbial community structure associated with Ulva australis, a green macroalga. The study showed only 15% of community members were shared across samples while 70% of functional gene annotations were shared [69]. These results led the authors to suggest that the functional genes possessed by individuals rather than their species identification better explains community structure. In another study, soil and sediments were sampled in space and time along an environmental gradient near a stream corridor. Microbial community membership was measured using denaturing gradient gel electrophoresis. Substrate analogs with fluorescent molecules were used to measure the activity of ten different enzymes. In this study, bacterial community structure was not associated with spatial or temporal components while enzyme activity was associated with temporal dynamics [70].

The disconnect between bacterial community structure and enzymatic activity provides further evidence that not all microbial functional activities are limited to a particular microbial species. Both of these examples illustrate how the genomic diversity found within a species and gene exchange between species results in a

11

decoupling of microbial species and their functions. This decoupling necessitates the observation of microbial community functions, through more direct methods such as metagenomics and metatranscriptomics, to answer important ecological questions.

Importance of direct observation of microbial community functional activity

Microbial species can differ not only in genomic content but also in the regulation of functional genes. The mere presence of a gene does not imply the gene is expressed. A simple species survey or a metagenomic study does not provide sufficient information to determine microbial community functional activity. Direct observation of microbial community functional activity through metatranscriptomics is more directly related to function and hence a better method to answer this question and bridge the knowledge gap between the laboratory and the environment.

Although methodologically more challenging than metagenomics, the use of metatranscriptomics has resulted in the advancement of foundational knowledge to understanding interactions within microbial communities that shape community composition, richness and ecosystem. For example, a study of methanogenic microbial communities illustrates that amino acid auxotrophy promoted syntrophic relationships within the community and regulated carbon and energy flow within the community [71]. Without direct observation of microbial community the basis of the syntrophic interactions could only have been postulated instead of directly observed. In another study the biogeochemistry of hydrothermal plumes was found

12

to be regulated by microbes inhabiting seawater rather than microbes inhabiting the seafloor [72]. Direct observation of microbial community activity allowed the authors to identify which groups of microbes regulate important biogeochemical processes in water surrounding hydrothermal vents. These examples illustrate the power of direct observation of microbial activity through metatranscriptomics to illuminate what microbes are doing and highlight their ecological importance.

Questions addressed in this thesis

Soil microbial communities provide many environmentally and anthropogenically beneficial services. Most of these services, i.e., functions, are carried out or enhanced in the rhizosphere. The ability to predict and control this activity would provide great societal benefit. To see these benefits realized microbial ecologists must first bridge the gap between laboratory-based studies and direct observation of microbial community activity in the environment. This requires direct observation of microbial community functions. While metagenomics enables microbial ecologists to examine the functional potential of microbial communities, metatranscriptomics offers direct observation of genes actively transcribed by microbial community members. Hence, I focused my studies on answering the question of which genes are expressed in the rhizosphere of bioenergy crops. The three research chapters of my dissertation focus on i) examination of a novel method of rRNA removal and establishment of best practices for metatranscriptome data analysis ii) using a multiomics approach to integrate next generation sequencing of DNA and mRNA with advanced proteomic

13

information and iii) examining gene presence and transcription in three regions of the root - the rhizoplane, the rhizosphere and the bulk soil. These studies are conducted in the context of a biofuel cropping system study in which corn (Zea mays), Miscanthus (Miscanthus gigantus) and switchgrass (Panicum virgatum) are compared. Use of biofuels, particularly cellulosic biofuel crops such as derived from switchgrass and Miscanthus, represent an important energy source, which can reduce our dependence on greenhouse gas emitting fossil fuels. Given the previously discussed benefits microbial communities provide to plants, developing a deeper understanding of how to manage microbial functions carried out in the rhizosphere can aid low cost and sustainable cultivation of cellulosic biofuel crops on marginal lands, i.e. currently not economic for cultivated food crops.

In Chapter two I examine a novel method of rRNA removal, called the duplex specific nuclease normalization (DSN). This method provides several advantages over current probe-base rRNA removal methods as it requires less total RNA input and doesn’t rely on probes to remove rRNAs. This chapter also examines various data analysis approaches to establish best practices. There are no publications, which provide a through examination of data analysis practices for metatranscriptomic data. In chapter three I use a multiomics approach to identify genes present, transcribed and translated in the rhizosphere of switchgrass. These data traverse all the steps within the central dogma of molecular biology, which presents an integrated profile of microbial community activity in the switchgrass rhizosphere. This chapter focuses on microbially mediated biogeochemical cycle and functions related to plant microbe interactions. Finally chapter four examines

14

differential gene presence in the rhizoplane and rhizosphere of corn and switchgrass as well as bulk soil. This chapter also compares microbial community transcription in the rhizosphere of switchgrass with that of bulk soil. The goals were to: i) identify genes enriched at various distances from living roots, ii) identify microbial genes associated with different plants - corn and switchgrass, and iii) to determine if any observed differences in gene abundance correlate with microbial community transcription. These studies are aimed at directly observing microbial community activity under field conditions with the ultimate goal of learning to enhance the beneficial services provided by the microbial community to the plant.

This would aid enhance the sustainable production of cellulosic biofuel crops on marginal lands.

15

REFERENCES

16

REFERENCES

1. Whitman, W.B., et al., Genomic Encyclopedia of Bacterial and Archaeal Type Strains, Phase III: the genomes of soil and plant-associated and newly described type strains. Standards in Genomic Sciences, 2015. 10(1): p. 8-13.

2. York, L.M., et al., The holistic rhizosphere: integrating zones, processes, and semantics in the soil influenced by roots. Journal of experimental botany, 2016. 67(12): p. 3629-43.

3. Roesch, L.F., et al., Pyrosequencing enumerates and contrasts soil microbial diversity. Isme Journal, 2007. 1(4): p. 283-290.

4. Juwarkar, A.A., S.K. Singh, and A. Mudhoo, A comprehensive overview of elements in bioremediation. Reviews in Environmental Science and Biotechnology, 2010. 9(3): p. 215-288.

5. Sangwan, N., et al., Comparative Metagenomic Analysis of Soil Microbial Communities across Three Hexachlorocyclohexane Contamination Levels. PloS one, 2012. 7(9): p. e46219-e46219.

6. Techer, D., et al., Contribution of Miscanthus x giganteus root exudates to the biostimulation of PAH degradation: An in vitro study. Science of the Total Environment, 2011. 409(20): p. 4489-4495.

7. Chen, F., et al., Enhanced biodegradation of polychlorinated biphenyls by defined bacteria-yeast consortium. Annals of Microbiology, 2015. 65(4): p. 1847-1854.

8. Benli Chai, T.V.T., Shoko Iwai, Cun Liu, Jordan A. Fish, Cheng Gu, Timothy A. Johnson, Gerben Zylstra, Brian J. Teppen, Hui Li, Syed A. Hashsham, Stephen A. Boyd, James R. Cole, James M. Tiedje, Sphingomonas wittichii Strain RW1 Genome-Wide Gene Expression Shifts in Response to Dioxins and Clay. PLoS One, 2016. 11(6).

9. O'Loughlin, E.J., et al., Reduction of Uranium(VI) by mixed iron(II/iron(III) hydroxide (green rust): Formation of UO2 manoparticies. Environmental Science & Technology, 2003. 37(4): p. 721-727.

10. Zhao, M., et al., Microbial mediation of biogeochemical cycles revealed by simulation of global changes with soil transplant and cropping. The ISME journal, 2014. 8(10): p. 2045-55.

17

11. Jia, Z. and R. Conrad, Bacteria rather than Archaea dominate microbial ammonia oxidation in an agricultural soil. Environmental microbiology, 2009. 11(7): p. 1658-71.

12. Reed, S.C., C.C. Cleveland, and A.R. Townsend, Functional Ecology of Free- Living Nitrogen Fixation: A Contemporary Perspective. Annual Review of Ecology, Evolution, and Systematics, 2011. 42(1): p. 489-512.

13. Schmalenberger, A., et al., The role of Variovorax and other Comamonadaceae in sulfur transformations by microbial wheat rhizosphere communities exposed to different sulfur fertilization regimes. Environmental Microbiology, 2008. 10(6): p. 1486-1500.

14. Marschner, P., D. Crowley, and Z. Rengel, Rhizosphere interactions between microorganisms and plants govern iron and phosphorus acquisition along the root axis – model and research methods. Soil Biology and Biochemistry, 2011. 43(5): p. 883-894.

15. Jackson, L.E., M. Burger, and T.R. Cavagnaro, Roots, nitrogen transformations, and ecosystem services. Annual review of plant biology, 2008. 59: p. 341-63.

16. Richardson, A.E., et al., Acquisition of phosphorus and nitrogen in the rhizosphere and plant growth promotion by microorganisms. Plant and Soil, 2009. 321(1-2): p. 305-339.

17. Zhu, B., et al., Rhizosphere priming effects on soil carbon and nitrogen mineralization. Soil Biology and Biochemistry, 2014. 76: p. 183-192.

18. Bird, J.a., D.J. Herman, and M.K. Firestone, Rhizosphere priming of soil organic matter by bacterial groups in a grassland soil. Soil Biology and Biochemistry, 2011. 43(4): p. 718-725.

19. Cheng, W., Rhizosphere priming effect: Its functional relationships with microbial turnover, evapotranspiration, and C–N budgets. Soil Biology and Biochemistry, 2009. 41(9): p. 1795-1801.

20. Bracho, R., et al., Temperature sensitivity of organic matter decomposition of permafrost-region soils during laboratory incubations. Soil Biology and Biochemistry, 2016(February): p. 1-14.

21. Conant, R.T., J.M. Steinweg, and M.L. Haddix, EXPERIMENTAL WARMING SHOWS THAT DECOMPOSITION TEMPERATURE SENSITIVITY INCREASES WITH SOIL ORGANIC MATTER RECALCITRANCE. Ecology, 2008. 89(9): p. 2384-2391.

18

22. Fang, C., et al., Similar response of labile and resistant soil organic matter pools to changes in temperature. Nature, 2005: p. 57-59.

23. Lemanceau, P., et al., Iron dynamics in the rhizosphere as a case study for analyzing interactions between soils, plants and microbes. Plant and Soil, 2009. 321(1-2): p. 513-535.

24. Van Der Heijden, M.G.A., R.D. Bardgett, and N.M. Van Straalen, The unseen majority: Soil microbes as drivers of plant diversity and productivity in terrestrial ecosystems. Ecology Letters, 2008. 11(3): p. 296-310.

25. Clode, P.L., et al., In situ mapping of nutrient uptake in the rhizosphere using nanoscale secondary ion mass spectrometry. Plant physiology, 2009. 151(4): p. 1751-7.

26. Saikia, S.P., & Jain, V. , Biological nitrogen fixation with non-legumes: an achievable target or a dogma. Current Science, 2007. 92(3): p. 317-322.

27. Friesen, M.L., Widespread fitness alignment in the legume-rhizobium symbiosis. The New phytologist, 2012. 194(4): p. 1096-111.

28. Morgan, J.a.W., G.D. Bending, and P.J. White, Biological costs and benefits to plant-microbe interactions in the rhizosphere. Journal of experimental botany, 2005. 56(417): p. 1729-39.

29. Bhattacharyya, P.N. and D.K. Jha, Plant growth-promoting rhizobacteria (PGPR): emergence in agriculture. World Journal of Microbiology and Biotechnology, 2011: p. 1327-1350.

30. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harbor perspectives in biology, 2011. 3(4).

31. Patten, C.L. and B.R. Glick, Role of Pseudomonas putida indoleacetic acid in development of the host plant root system. Applied and Environmental Microbiology, 2002. 68(8).

32. Spaepen, S., J. Vanderleyden, and R. Remans, Indole-3-acetic acid in microbial and microorganism-plant signaling. FEMS microbiology reviews, 2007. 31(4): p. 425-48.

33. Bhattacharjee, R.B., et al., Indole acetic acid and ACC deaminase-producing Rhizobium leguminosarum bv. trifolii SN10 promote rice growth, and in the process undergo colonization and chemotaxis. Biology and Fertility of Soils, 2011. 48(2): p. 173-182.

19

34. Combes-Meynet, E., et al., The Pseudomonas secondary metabolite 2,4- diacetylphloroglucinol is a signal inducing rhizoplane expression of Azospirillum genes involved in plant-growth promotion. Molecular plant- microbe interactions : MPMI, 2011. 24(2): p. 271-84.

35. Mendes, R., P. Garbeva, and J.M. Raaijmakers, The rhizosphere microbiome: significance of plant beneficial, plant pathogenic, and human pathogenic microorganisms. FEMS microbiology reviews, 2013. 37(5): p. 634-63.

36. Jousset, A., et al., Plants respond to pathogen infection by enhancing the antifungal gene expression of root-associated bacteria. Molecular plant- microbe interactions : MPMI, 2011. 24(3): p. 352-8.

37. Mazurier, S., et al., Phenazine antibiotics produced by fluorescent pseudomonads contribute to natural soil suppressiveness to Fusarium wilt. The ISME journal, 2009. 3(8): p. 977-91.

38. Rudrappa, T., et al., Root-secreted malic acid recruits beneficial soil bacteria. Plant physiology, 2008. 148(3): p. 1547-56.

39. Saravanakumar, D., Harish, S., Loganathan, M., Vivekananthan, R., Rajendran, L., Raguchander, T., & Samiyappan, R., Rhizobacterial bioformulation for the effective management of Macrophomina root rot in mungbean. Archives of Phytopathology and Plant Protection, 2007. 40(5): p. 323-337.

40. Kuzyakov, Y. and E. Blagodatskaya, Microbial hotspots and hot moments in soil: Concept & review. Soil Biology and Biochemistry, 2015. 83: p. 184-199.

41. Shukla, K.P. and S. Sharma, Nature and role of root exudates: Efficacy in bioremediation. African Journal of …, 2011. 10(48): p. 9717-9724.

42. Narasimhan, K., et al., Enhancement of Plant-Microbe Interactions Using a Rhizosphere Metabolomics-Driven Polychlorinated Biphenyls. 2003. 132(May): p. 146-153.

43. Xu, L., et al., Enhanced removal of polychlorinated biphenyls from alfalfa rhizosphere soil in a field study: The impact of a rhizobial inoculum. Science of the Total Environment, 2010. 408(5): p. 1007-1013.

44. Yergeau, E., et al., Microbial expression profiles in the rhizosphere of willows depend on soil contamination. The ISME journal, 2013: p. 1-15.

45. Yang, C., Y. Wang, and J. Li, Plant Species Mediate Rhizosphere Microbial Activity and Biodegradation Dynamics in a Riparian Soil Treated with Bensulfuron-methyl. Clean - Soil, Air, Water, 2011. 39(4): p. 338-344.

20

46. Bolan, N., A. Kunhikrishnan, and J. Gibbs, Rhizoreduction of arsenate and chromate in Australian native grass, shrub and tree vegetation. Plant and Soil, 2013. 367(1-2): p. 615-625.

47. Płociniczak, T., et al., rhizospheric bacterial strain Brevibacterium casei MH8a colonizes plamt tissues and enhances Cd, Zn, Cu phytoextraction by white mustard. Frontiers in Plant Science, 2016. 7(February): p. 101-101.

48. Hinsinger, P., C. Plassard, and B. Jaillard, Rhizosphere: A new frontier for soil biogeochemistry. Journal of Geochemical Exploration, 2006. 88(1-3): p. 210- 213.

49. Murphy, C.J., et al., Rhizosphere priming can promote mobilisation of N-rich compounds from soil organic matter. Soil Biology and Biochemistry, 2015. 81: p. 236-243.

50. Chaparro, J.M., et al., Manipulating the soil microbiome to increase soil health and plant fertility. Biology and Fertility of Soils, 2012. 48(5): p. 489-499.

51. Lemmon, E.M. and A.R. Lemmon, High-Throughput Genomic Data in Systematics and Phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 2013. 44(1): p. 99-121.

52. Tringe, S.G. and E.M. Rubin, Metagenomics: DNA sequencing of environmental samples. Nature reviews. Genetics, 2005. 6(11): p. 805-14.

53. Teeling, H. and F.O. Glöckner, Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Briefings in bioinformatics, 2012.

54. Eisen, J.a., Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS biology, 2007. 5(3): p. e82-e82.

55. Lennon, J.T. and S.E. Jones, Microbial seed banks: the ecological and evolutionary implications of dormancy. Nature Reviews Microbiology, 2011. 9(2): p. 119-130.

56. Baldrian, P., et al., Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. The ISME journal, 2012. 6(2): p. 248-58.

57. Urich, T., et al., Simultaneous assessment of soil microbial community structure and function through analysis of the meta-transcriptome. PloS one, 2008. 3(6): p. e2527-e2527.

21

58. Neidhardt, F.C. and H.E. Umbarger, Chemical composition of Escherichia coli. 1996. 13-16.

59. Gilbert, J.a., et al., Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PloS one, 2008. 3(8): p. e3042-e3042.

60. He, S., et al., Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. … methods, 2010. 7(10).

61. Stewart, F.J., E.a. Ottesen, and E.F. DeLong, Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics. The ISME journal, 2010. 4(7): p. 896-907.

62. Yi, H., et al., Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic acids research, 2011. 39(20): p. e140-e140.

63. Moran, M.A., et al., Sizing up metatranscriptomics. The ISME journal, 2013. 7(2): p. 237-43.

64. Zhang, Y.P., et al., Regulation of nitrogen fixation in Azospirillum brasilense. Fems Microbiology Letters, 1997. 152(2): p. 195-204.

65. Yang Y, J.X.-T., Zhang T Evaluation of a Hybrid Approach Using UBLAST and BLASTX for Metagenomic Sequences Annotation of Specific Functional Genes. PLoS One, 2014. 9(10): p. e110947.

66. Head, I.M., J.R. Saunders, and R.W. Pickup, Microbial evolution, diversity, and ecology: A decade of ribosomal RNA analysis of uncultivated microorganisms. Microbial Ecology, 1998. 35(1): p. 1-21.

67. Wommack, K.E., J. Bhavsar, and J. Ravel, Metagenomics: Read length matters. Applied and Environmental Microbiology, 2008. 74(5): p. 1453-1463.

68. Rinke, C., et al., Insights into the phylogeny and coding potential of microbial dark matter. Nature, 2013. 499(7459): p. 431-437.

69. Burke, C. and P. Steinberg, Bacterial community assembly based on functional genes rather than species. Proceedings of the …, 2011.

70. Frossard, A., et al., Disconnect of microbial structure and function: enzyme activities and bacterial communities in nascent stream corridors. The ISME journal, 2012. 6(3): p. 680-91.

22

71. Embree, M., et al., Networks of energetic and metabolic interactions define dynamics in microbial communities. Proceedings of the National Academy of Sciences, 2015. 112(50): p. 201506034-201506034.

72. Lesniewski, R.a., et al., The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. The ISME journal, 2012. 6(12): p. 2257-2268.

23

Chapter 2:

Methodologies for probing the metatranscriptome of grassland soil

This chapter has been published in: Garoutte, A., Cardenas, E., Tiedje, J., & Howe, A. (2016). Methodologies for probing the metatranscriptome of grassland soil. Journal of Microbiological Methods, 131, 122–129. http://doi.org/10.1016/j.mimet.2016.10.018

24

Abstract

Metatranscriptomics provides an opportunity to identify active microbes and expressed genes in complex soil communities in response to particular conditions.

Currently, there are a limited number of soil metatranscriptome studies to provide guidance for using this approach in this challenging matrix. Hence, we evaluated the technical challenges of applying soil metatranscriptomics to a highly diverse, low activity natural system. We used a non-targeted rRNA removal approach, duplex nuclease specific (DSN) normalization, to generate a metatranscriptomic library from field collected soil supporting a perennial grass, Miscanthus x giganteus (a biofuel crop), and evaluated its ability to provide insight into its active community members and their expressed protein-coding genes. We also evaluated various bioinformatics approaches for analyzing our soil metatranscriptome, including annotation of unassembled transcripts, de novo assembly, and aligning reads to known genomes. Further, we evaluated various databases for their ability to provide annotations for our metatranscriptome. Overall, our results emphasize that low activity, highly genetically diverse and relatively stable microbiomes, like soil, requires very deep sequencing to sample the transcriptome beyond the common core functions. We identified several key areas that metatranscriptomic analyses will benefit from including increased rRNA removal, assembly of short read transcripts, and more relevant reference bases while providing a priority set of expressed genes for functional assessment.

25

Introduction

Metatranscriptomics holds promise for providing insight into which organisms are active and which gene subsets are expressed within microbial communities, but its use is particularly challenging in complex systems, especially soil. Metatranscriptomics has been most prevalently used in marine ecology studies, where, as examples, it has helped identify key nutrient transformations in hydrothermal plumes [1]; patterns of niche diversification in coastal waters [2]; seasonal and diurnal patterns of gene expression in the English Channel [3] and patterns of diazotroph diversity along salinity and nutrient gradients [4]. In contrast, the application of metatranscriptomics in terrestrial environments has been limited, mostly either targeting specific genes (e.g., phylogenetic markers or functional genes), [5,6] experimentally enriched soil communities [7,8] or in greenhouse pot-based experiments [9]. In forest soils, fungal-targeted metatranscriptomics has been used to identify novel hydrolase enzymes [5], and a targeted approach (16S rRNA, ITS, and cellobiohydrolase) has shown that low- abundance species play an important role in carbon decomposition [7].

Metatranscriptomics has also been used to contrast expression in pristine soils and those contaminated with polycyclic aromatic hydrocarbons [10] and domain-level changes in the rhizosphere of potted plants [11]. While these examples demonstrate the feasibility and usefulness of soil transcriptomics, the application of non-targeted metatranscriptomics to field collected agricultural soils, e.g., croplands and pastures,

26

has yet to be demonstrated; these soils comprise over 40% of global land use [12] and are essential to food production and ecosystem services.

Soil metatranscriptomics presents several obstacles. First, soil microbial communities are incredibly diverse; one gram of soil is estimated to contain nearly one million distinct genomes [13], magnitudes higher than aquatic and host- associated habitats [14]. Second, reference genomes from soil are limited, making sequence annotation difficult. Third, RNA, especially mRNA, is in low abundance because of the primarily dormant or starved states of the community members, with few perturbations to induce expression. For example, turnover rates of soil microbes has been calculated to be 30- to 300-fold slower than that of microbes in the ocean [15]. Overall, the mRNA comprises only about 4% of total RNA [16], highlighting the challenge of isolating or enriching the mRNA prior to sequencing to achieve greater sequence depth. A common approach for mRNA enrichment is to remove rRNA through subtractive hybridization [17]. This approach presents challenges of its own in that it is hindered by the difficulty of obtaining intact rRNA through soil RNA extraction methods. Finally, soil metatranscriptomics is challenged by the high temporal and spatial diversity in soil populations due to habitat complexity at small scales (<1 mm), and various stochastic perturbations (e.g., rainfall, plant litter introduction, micro and mesofauna movements). Consequently, capturing appropriate snapshots of targeted activity in soil requires sampling high biodiversity within complex and often unknown and unpredictable dynamics.

Furthermore, the lack of the soil metatranscriptome reference datasets makes it

27

difficult to evaluate appropriate sampling and experimental strategies and for insight into common system responses.

In this study, we evaluated mRNA enrichment as well as various bioinformatic approaches to analyze soil metatranscriptomes, and present recommendations for such studies. Our metatranscriptome originated from bulk soil associated with a Miscanthus x giganteus crop, an important bioenergy crop due to its perenniality and its high biomass yield compared to other crops [18]. Soils were sampled at the period of most active plant growth (early August), at midday when photosynthesis was maximum, and after a rainfall period so that soil water was not limiting to maximize soil microbial community response to potentially new substrates. Our objective was to determine our ability to access and identify actively transcribed genes in this soil microbial community from the soil’s metatranscriptome.

To address the challenge of low concentration of mRNA, we enriched for mRNA by removing rRNA using duplex nuclease specific (DSN) normalization. DSN is a non-targeted approach which has several advantages over subtractive hybridization including less stringent RNA quality requirements, a lower required amount of RNA (100 ng compared to 1 ug), and an increased rRNA removal efficiency [17]. An alternative to DSN is subtractive hybridization, which targets the removal of specific rRNAs based on genomic primer targets. In contrast to subtractive hybridization, which removes rRNAs with bias, DSN does not target specific rRNAs and consequently the remaining rRNAs are more likely to reflect their original distributions. To assess the gene content of our metatranscriptome,

28

we annotated against several gene or genome reference databases, including MG-

RAST M5NR, the Carbohydrate Active Enzyme (CAZy) database, and a soil genome dataset termed RefSoil, and three de novo assembled metagenomes obtained from the same plot during the Spring of 2009. Additionally, we assessed the value of assembly of longer sequences, or contigs, from the metatranscriptome for improved insight into community activity.

Methods

Various methods that can be used for soil metatranscriptome analysis including (A) direct annotation of short read sequences (B) assembly and annotation and (C) alignment of reads to existing reference genomes are summarized in Figure 2.1 along with the advantages and disadvantages of each. We analyzed our dataset set using all three of these methods. We compared the results of each method to the others to determine best practices and to identify the advantages and disadvantages of each method.

29

Figure 2.1: Metatranscriptome data analysis workflow. Various methods for metatranscriptome data analysis are shown. (a) Direct annotation of short reads. (b)

Assembly of short reads into longer contigs and subsequent annotation. (c) Short read mapping to genomes compiled in the RefSoil database.

30

Figure 2.1 (cont’d)

Metatranscriptome sample collection and library preparation

The bulk soil samples for metatranscriptomics were obtained from a four year-old stand of Miscanthus (Miscanthus x giganteus), plot G6R1 at the Bioenergy

Cropping Systems Experiment (BCSE; also known as the Intensive Site) (42˚23’47”N,

85˚22’26”W) at the W.K. Kellogg Biological Station in Southwest, Michigan, USA.

Samples were collected midday on August 1st, 2012. The mean air temperature for the previous week was 24 oC, and there had been 117 mm of rain in the week preceding the sampling with 2 days of no rain directly prior to sampling; the soil was still moist. A composite sample comprised of three soil samples was taken from random points in the plot. The soil was quickly sieved (4 mm) to remove roots, and frozen on dry ice to prevent mRNA degradation. Samples were stored at -80 oC until

RNA extraction. RNA was extracted from 2 g of soil using the PowerSoil RNA kit

(MoBio, Carlsbad, CA), and DNA was then removed by DNase treatment (Invitrogen,

Carlsbad, CA). RNA (100 ng) was converted to cDNA and treated with duplex

31

specific nuclease (DSN) to reduce the abundance of rRNA as described in [17].

Samples were sequenced with the Illumina HiSeq sequencing platform at the

Research Technology Support Facility, Michigan State University, East Lansing, MI,

USA, generating 100 base pair (bp) reads.

MG-RAST databases used for annotation of unassembled raw reads

Ribosomal RNA sequences were identified using riboPicker [21] (Figure

2.1a), Rfam [22] databases and MG-RAST [23]. The resulting non-rRNA sequences were submitted to MG-RAST (v 3.3.7.3) using the M5NR[24] for gene annotation.

Many reads were annotated as Enterobacteria phage phiX174, which is commonly used as a control in sequencing facilities. Hence, the sequences were mapped to

Enterobacteria phage phiX174 sensu lato genome (NC_001422.1) using Bowtie 2

(v2.0.0-beta6, [21]) and removed from the analysis as they were likely the result of contamination. Additionally rRNA sequences within the unassembled database were annotated using the MG-RAST M5RNA database (MGRAST IDs 4554103.3,

Unassembled Metatranscriptome). Annotations were identified using the following preset quality filter parameters: Max. e-value cutoff 1e-5, Min. percent identity cutoff 60% and Min. Alignment length cutoff of 15.

Metatranscriptome assembly & annotation

Sequences were filtered using digital normalization (flags: -C 20, -k 20, N 4, – x 2e9) as described in [27–29]. Normalized reads were assembled using Velvet (v

1.2.10) [30] with odd numbered k-mers from length 19 to 59 (Figure 1B).

32

Assemblies produced from different k-mer lengths were merged using AMOS (v

3.1.0) [31] and CD-HIT (v 4.5.7) [32]. Resulting assembled contigs with lengths greater than 200 bp were annotated with MG-RAST (v 3.3.7.3) [24] (MGRAST IDs

4532564.3 Assembled Metatranscriptome), the CAZy database (date accessed: July

13, 2008) [33] which contains enzymes involved in carbon compound synthesis and decomposition, and the Rfam database which contains non-coding RNAs.

Previous metagenomes used for comparison to soil metatranscriptome

The reference metagenomes used in this study were obtained from one bulk and two rhizosphere soil samples collected from the same Miscanthus bioenergy plot in October 2009. DNA was extracted from 2.5 g soil as described in [19]. The high molecular weight DNA was then gel purified, electroeluted, and concentrated using methods described in [20]. Samples were sequenced with both the Illumina

GAII and 454 sequencing platforms at the Joint Genome Institute (Walnut Creek, CA) generating 100-base reads.

Metagenome assembly & annotation

Sequences were trimmed using a quality score of 20. Reads were assembled using SOAPdenovo[34] with a k-mer range of 21, 23, 25, 27, 29 and 31. All of the default settings were used for the SOAPdenovo assembly (flags –d1 and –R). Contigs were then merged using SGA [35] with all default parameters. Contigs greater than

500 bp were annotated by MG-RAST (v3, 2011-02-22) (MGRAST IDs 4465947.3,

Bulk MetaG, 4465942.3, Rhizo MetaG1, and 4465943.3; Rhizo MetaG2).

33

Estimation of abundance of assembled contigs or reference sequences

The abundance of assembled contigs and reference sequences (e.g., soil genomes) was estimated as the median base pair coverage of all transcript alignments to contigs (assembled metatranscriptome and reference metagenomes) or genomes with the RefSoil database. Mapping of unassembled metatranscriptome reads to contigs or genomes was performed using Bowtie2 (v2.0.0-beta6, [25]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, –D 100, distinct alignments for each read. Base pair coverage was estimated using BedTools (v 2.17.0) [26]. For metagenomic, metatranscriptomic, and soil reference annotated genes, coverage was estimated on the genic region rather than the complete originating contig (which may contain both genic and intergenic sequences).

Curation of soil reference genome database, RefSoil

A manually curated database of soil bacterial genomes was built to provide a soil-specific reference set. Strains with completely sequenced genomes were selected from the Gold Database (http://genomesonline.org) on August 19th,

2011.The inclusion criteria involved both information on isolation of the sequenced organism and literature searches regarding the ecology of the species. e.g. Erwinia amylovora CFBP 1430 was selected even when it was originally isolated from a

Crataegus plant because it is commonly detected in soils. Obligate human pathogens

34

and non-soil relevant extremophiles were excluded. If redundant genomes were found at the species-level, only two per species were kept to reduce the database bias. A total of 492 organisms, representing 19 different phyla and contributing a total of 1,031 replicons (chromosomes and plasmids) formed the database.

Complete GenBank accessions were downloaded and parsed to extract whole genome sequences and features (gene coordinates, and annotations) (Supp Table

2.1). A complete list of genomes and accession numbers used in the RefSoil database is in Supp Table 2.1.

Unassembled read mapping to metagenomes and RefSoil genomes

Sequences were mapped to the three metagenome assemblies and the genomes within the RefSoil database using Bowtie 2 (v2.0.0-beta6, [25]) with the following default parameters: end-to-end alignment, minimum score threshold for

100 bp reads was -60.6, –D 100, distinct alignments for each read (Figure 1C).

Coverage of annotated regions was estimated using BedTools (v 2.17.0) [26]. Only reads with a minimum alignment length of 100 bp (to references) and contigs (or genes/genomes) with at least two mapped reads were considered.

Results

Characterization of sequences in the unassembled soil metatranscriptome

Genes identified in transcripts include sequences associated with both rRNA and mRNA genes, informative of the active community structure and function. The large majority of transcripts, 169 million reads (82.8%), shared similarity to known

35

rRNA genes. The justification of using the DSN approach for mRNA enrichment over probe-based rRNA removal was the unbiased removal of rRNA gene fragments.

Using the DSN metatranscriptome library preparation, the remaining rRNA sequences were evaluated to determine the taxonomic composition of the active community members, resulting in nearest matches to over 22,000 species in our metatranscriptome. The most abundant “active” bacterial phyla were

Actinobacteria and Proteobacteria (Figure 2.2, blue bar), and sequences associated with Ascomycota were the most abundant fungal phylum.

Figure 2.2: Phylogenetic distribution of sequence annotations identified in unassembled and assembled metatranscriptome and associated soil metagenomes. Phylogeny of rRNA from the unassembled metatranscriptome compared to the phylogeny of MG-RAST’s best-hit classification of protein-coding genes for the assembled metatranscriptome and the reference metagenomes.

36

* The pink bar in the Assembled MetaT (mRNA) Firmicutes represents the proportion of misannotation in the sample (explained below).

In unassembled transcripts sharing sequence similarity with known genes, the most abundant protein coding annotations from the SEED database were associated with hypothetical (3.7%) or housekeeping proteins such as GroEL

(2.9%), DNA-directed RNA-polymerase (1.9%), and the translation elongation factor-Tu (1.8%). These non-rRNA sequences account for 179,088 reads, representing 8,345 protein coding genes (Table 2.1). Similar functional profiles were also obtained when annotating the unassembled metatranscriptome against other databases in MG-RAST including GenBank, KEGG, and RefSeq (Table 2.1, Supp

Table 2-5). Overall, the reads comprising the unassembled metatranscriptome were associated with a few dominant annotations, where the five most abundant annotations represented 12% of the total abundance of annotations our metatranscriptome.

37

Table 2.1. Summary of sequence annotations of the unassembled and

assembled soil metatranscriptome against various reference databases.

Results of annotation by MG-RAST.

Unassembled metatranscriptome Assembled metatranscriptome

Unique Unique Unique Unique

Database Abundance annotation features a Abundance annotation features a

SEED 480,802 8,345 59,189 388,030 3,882 13,754

GenBank 681,148 45,204 116,790 174,438 6,946 16,653

KEGG 385,794 24,479 82,977 204,263 5,442 15,687

RefSeq 470,518 24,444 94,699 319,974 5,518 17,365

a Note that annotations are defined within the MG-RAST M5NR database, where

distinct annotations may be represented by multiple features. Features may be

associated with a specific gene in a reference genome.

To further explore both the taxonomic and functional content of the

metatranscriptome, transcripts were also compared against the RefSoil database,

which resulted in 94 million reads aligning to RefSoil genomes, the large majority of

which were associated with rRNA gene annotations (Table 2.2). Similar to SEED-

associated annotations, the most represented functions of the metatranscriptome in

the RefSoil database were associated with hypothetical proteins, ribosomal

structure, and housekeeping genes. Overall, the most abundantly represented

RefSoil genomes in the soil metatranscriptome included Syntrophus aciditrophicus

38

SB, Methylococcus capsulatus str. Bath, and Novosphingobium sp. PP1Y (Supp Table

6). The most genes (e.g., presence rather than abundance) were identified in genomes of Nocardioides sp. JS614 , Bradyrhizobium japonicum USDA 110, and

Streptomyces scabiei 87.22 (Supp Table 2.7).

Table 2.2: Summary of transcript mapping. Transcripts mapped to reference assemblies (available in MG-RAST with IDs indicated) or genomes with proportion of reads identified as similar to rRNA genes and mapping uniquely to a specific reference assembly.

Transcripts mapping

Unassembled Transcripts to protein coding

read mapping mapped to regions in assembled

sources reference contigs

30,769,638

MetaG Bulk (15.0%) 3,461,504 (1.7%)

39,728,854

MetaG Rhizo1 (19.4%) 277,158 (0.1%)

35,837,442

MetaG Rhizo2 (17.5%) 876,690 (0.4%)

94,104,227

RefSoil (45.9%) 9,693,354 (4.7%)

39

Characterization of sequences in the assembled soil metatranscriptome

Assembly of the metatranscriptome was highly successful and incorporated

73.8% of the reads into 116,556 contigs totaling 32.4 Mbp. In contrast to the unassembled metatranscriptome, the majority of assembled contigs (78.3%) were not associated with rRNA genes. Overall, a total of 15,032 (13.3%) contigs shared sequence similarity with known proteins in the SEED database (Figure 2.3, Supp

Table 2.8-11). To estimate abundance of assembled sequences, unassembled reads were aligned to contigs, and the median base pair coverage of each contig was calculated. The most abundant gene functions identified within the soil metatranscriptome assembly included those associated with hypothetical proteins or with functions associated with RNA and protein metabolism (Figure 2.4 dark blue and red lines only). Comparing the number of annotations identified with the unassembled and assembled metatranscriptome datasets; we found a larger number of annotations in the unassembled dataset (Figure 2.5). However, the five most abundant subsystems in the SEED annotations were shared between the assembled and unassembled metatranscriptome datasets though with differing rank abundances (Figure 2.4).

40

Figure 2.3: Distribution of assembled metatranscriptome annotations.

Proportion of assembled metatranscriptome sequences associated with known rRNA, gene function (SEED), or non-coding sequences (RFAM).

41

Figure 2.4: Comparison of functional profiles of metatranscriptomes and metagenomes. Annotations were identified in the assembled and unassembled metatranscriptome datasets as well as the three metagenome assemblies against the

MG-RAST SEED database.

42

Figure 2.5: Comparison of the total number of gene annotations identified in the unassembled and assembled metatranscriptomes. Results were generated using the MG-RAST Metagenome Analysis page.

Community composition represented by the assembled metatranscriptome was identified by comparing the contigs to the taxonomic origins of proteins in the

MG-RAST M5NR, resulting in the identification of 2,200 species. Similar to the rRNA in the unassembled metatranscriptome dataset, the dominant phyla represented were and Proteobacteria. In contrast to the unassembled metatranscriptome, Firmicutes also represented a large portion of protein annotations in the assembled dataset (Figure 2.2, red bar), but this was mainly due

43

to hypothetical proteins associated with Heliobacterium modesticaldum,

Lactobacillus rhamnosus, and Staphylococcus aureus. The unique detection of abundant Firmicutes in the assembled dataset and its absence in unassembled transcripts suggest the presence of a database bias in rRNA gene sequences within the MG-RAST M5NR, and hence this likely annotation error (as noted by the different shading of the red bar in Figure 2.1.). This is likely as the three previously mentioned organisms are associated with human disease and therefore comprise a larger portion of available genomes.

The large majority of contigs within the soil metatranscriptome (greater than

65%) could not be annotated with any of the reference databases used in this study

(MG-RAST, RefSoil, CAZy, or associated metagenomes) (Figure 2.3). To evaluate the possible presence of non-coding RNAs, sequences were compared to known non- coding RNAs in the RFam database, resulting in a total of 3,036 contigs (2.6%) sharing similarity to RNA genes, regulatory RNAs, or self-splicing RNAs. The major

RNA families identified included RNAs associated with transcription and translation

(5/5.8S, tmRNA, and RNaseP), signal recognition particles, and riboswitches (Supp

Table 2.12).

Further, longer sequence lengths of assembled contigs significantly improved annotations, doubling the median alignment lengths to known proteins (Figure 2.6).

To assess the impacts of sequence length, we evaluated the influence of varying similarity thresholds. Stricter criteria for alignment scores (e.g., decreased minimum

E-value cutoff) reduced the abundance and total number of unique features in the unassembled dataset. Overall, confidence in annotations (e.g. median E-value

44

scores) for the assembled annotations were much higher (lower E-value) than for the unassembled (Figure 2.7), and variations in the E-value thresholds did not have as pronounced an effect on the total number of annotations nor the number of unique features. Importantly, assembled contigs provides longer sequence lengths for annotation (62 aa vs. 31 in the unassembled set), allowing for improved annotations (e.g., similarity comparisons to CAZy enzymes) (Figure 2.6). In total, assembly resulted in 688 contigs, comprising 194,985 bp, which could be classified into five enzyme categories including glycoside hydrolases (GH), glycosyltransferases (GT), polysaccharide lyases (PL), and carbohydrate esterases

(CE). The large majority of these sequences (572 contigs, 83%) were associated with GH, GT, and CBM–containing enzymes. The most frequent CAZy gene families were GT2, GH36, CBM13, and GH18 (Supp Table 2.12). Overall, these CAZy- associated contigs were present at relatively low abundances within the metatranscriptome, averaging 4.1-fold coverage. The most abundant enzymes classes included GH19, GH17, and CBM14 with 137, 47, and 19-fold coverage, respectively (Supp Table 2.13).

45

Figure 2.6: Comparison of annotation alignment lengths of the assembled and unassembled datasets. Amino acid alignment lengths of SEED subsystem annotations for the assembled and unassembled datasets. The minimum alignment length is set to the MG-RAST default of 15 amino acids.

46

Figure 2.7: Comparison of annotation E-values of the assembled and unassembled datasets. E-value of SEED Subsystem annotations of the assembled and unassembled datasets. The minimum e-value is set to the MG-RAST default of

1e-5.

Comparison of the metatranscriptome datasets to metagenomes

To provide further insight into the active subset of microbial communities, we evaluated the membership identified in the soil metatranscriptome (gene expression) and compared this to membership identified in the soil metagenomes

(gene potential). Assemblies of the reference metagenomes produced 1.3 million contigs and represent over one billion bases (Table 2.3). Approximately 30 to 40 million unassembled metatranscriptome reads mapped to these metagenomes, with the majority of these reads being rRNA (98%, Table 2.2). The remaining non-rRNA transcripts mapped to a total of 147 genes, the most abundant were related to

47

housekeeping functions, e.g., RNA polymerase, chaperone proteins, and translation elongation factors. The functional profiles of the assembled metatranscriptome and metagenomes from the same site revealed that the metatranscriptome was greatly enriched in genes related to RNA and protein metabolism. In contrast, the metagenomes were enriched in genes related to carbohydrate, amino acid (and derivatives), and DNA metabolism (Figure 2.4). The overlap of functional annotations between the assembled metagenomes and the assembled metatranscriptome (e.g., at the functional level) was comprised of 2,413 annotations

(62% of the metatranscriptome). Comparing taxonomic profiles of the metatranscriptomes to those of the metagenomes, we found that sequences associated with Proteobacteria were enriched in the metagenomes, while sequences associated with Actinobacteria were enriched in both the assembled and unassembled metatranscriptomes (Figure 2.2).

48

Table 2.3: Summary of assembled metagenomes in number of assembled contigs and total base pairs represented in the assembly. Results of short read assembly of metagenome samples.

Assembled metagenomes Contigs Base Pairs

MetaG Bulk 617,602 457,810,820

MetaG Rhizo1 303,353 216,957,151

MetaG Rhizo2 453,481 360,952,806

Total 1,374,436 1,035,720,777

Discussion

Our aim was to use metatranscriptomics to assess biological information of a

(normally) marginally active soil microbiome and to understanding the technical and methodological challenges of this approach. Towards this end, we assessed approaches to generate a soil metatranscriptome library (e.g., mRNA enrichment), analysis approaches (e.g., de novo assembly), and the gene content of the dataset.

Overall, we identified multiple expressed genes in our soil metatranscriptome

(Figure 2.4), though it was largely dominated by ribosomal rRNA genes as well as sequences of unknown origin and function (Table 2.2, Figure 2.3). To increase the information gleaned from soil metatranscriptomics in the future, we identify below several areas for improvement.

The abundance of rRNA in metatranscriptomes must be further reduced in order to improve sampling of mRNA encoding protein-coding genes. The large

49

proportion of rRNA within our soil metatranscriptome library compromised our ability to sample deeply and consequently access more protein-coding genes.

Although DSN normalization was expected to remove diverse rRNA, it did not with

83% rRNA remaining in our metatranscriptome library. This fraction is comparable to rRNA removal efficiency in a human gut metatranscriptome following subtractive hybridization [37] and in a sandy soil metatranscriptome with no rRNA removal

[17]. Direct comparison of RNA extraction efficiencies in the two soils may not be appropriate because of different soil characteristics and the sampling season; their much lighter textured (sandy) soil was sampled in winter, and our medium textured

(loamy) soil was sampled in late summer [38]. In general, reports on rRNA remaining based on multiple approaches and environments vary from 50 to 85%

[2,37–39], evidence that rRNA removal in metatranscriptomes remains inefficient for complex communities, regardless of extraction methodology. Though DSN normalization has improved performance compared to subtractive hybridization in pure cultures, its effectiveness in high diversity soil systems remains unclear. An alternative approach is to bypass rRNA removal and to sequence more deeply and computationally remove rRNA reads. This approach is more feasible as sequencing prices decrease.

A useful result of the presence of rRNA in our metatranscriptome was that it did allow us to make taxonomic inferences about active members of the community.

Unlike samples prepared using subtractive hybridization, the DSN normalization preserves the relative abundance of sequences within the sample [17]. Since the relative abundance of sequences is preserved taxonomic annotations associated

50

with the remaining rRNA sequences (in unassembled reads) is reflective of the relative abundances in the original sample. Notably, ribosomal RNA sequences from the assembled metatranscriptome dataset were not used because assemblers typically cannot assemble highly conserved sequences like 16S rRNA genes.

Therefore, as a proxy, the taxonomic classification of the most similar known homologous protein was used for community analysis of the assembled metatranscriptome. Taxonomic annotation of both metatranscriptome datasets

(unassembled rRNA and assembled protein coding contigs) suggests that they share a similar taxonomic profile that contrasts with those observed in the metagenomes, highlighting the increased activity of sequences associated with Actinobacteria and diminished activity of sequences associated with Proteobacteria. This result is consistent with other findings that indicate Actinobacteria are more abundant and active in bulk soils while Proteobacteria tend to be more abundant in the rhizosphere [41,42].

The curation and availability of the RefSoil database allows for the evaluation of sequencing datasets in the context of cultivated soil organisms. Despite the diversity of soil microbial communities and the difficulty of cultivating microbial representatives, this database was surprisingly represented within our soil metatranscriptome. Many transcripts could be aligned to RefSoil genomes, although most were associated with rRNA genes. This result suggests that the RefSoil database captures a large amount of the SSU rRNA (taxonomic) diversity in our sample. The functions contained with RefSoil were not nearly as well represented in the soil metatranscriptome, suggesting that although this database may capture

51

many of the genus-to-species level of diversity, the genetic diversity within those groups is still very large.

We found de novo assembly of this soil metatranscriptome to be an important step towards providing improved references for soil sequencing approaches, evidenced by longer sequence lengths, data reduction, improved confidence in annotation, and the development of reference sequences that do not rely on a priori information. Previously, the high diversity of soil communities has resulted in only a fraction of sequences being assembled in metagenomic studies

[28]. For this soil metatranscriptome, 73.8% of the reads in the unassembled dataset mapped to our assembly, suggesting that the diversity of soil metatranscriptomes is significantly less than that of metagenomes. As a consequence, if rRNA can be efficiently removed prior to sequencing, metatranscriptomic efforts may require less sequencing depth than previously suggested by soil metagenomes. The longer sequence lengths provided by the assembly also provide higher confidence in annotations as well as the identification of multiple novel and abundant sequences. Importantly, our metatranscriptome assembly provides a specific set of genomic references that can be used for comparative soil studies. The presence of shared (highly) expressed sequences in multiple datasets can be used to prioritize encoded genes for characterization. An indirect advantage to soil metatranscriptome assembly is also that it discards many rRNA-associated sequences because these sequences are difficult to assemble, allowing it to be used as a method for rRNA removal that does not rely on having known references. As datasets continue to grow in volume, assembly may become

52

an increasingly efficient method for both improving gene annotation and removing rRNA.

We evaluated the novel information gained through our metatranscriptome by comparing our soil metatranscriptome to available metagenomes from the same plot. The majority of genes were unshared between these datasets though the majority of encoded functions were similar. Within functional annotations, 62% of the metatranscriptome annotations were shared with the metagenomes. However, many of these were associated with rRNA genes; relatively few non-rRNA transcripts (~2.7 million) were aligned to the metagenomes, with only 3.7% of non- rRNA annotations shared. This result indicates that the metatranscriptome is functionally similar to the metagenomes but is composed of distinct genotypes with high levels of functional redundancy between members. Previous studies have also shown little overlap between metagenome and metatranscriptome libraries [42,43].

A possible explanation for this observation is a change in the microbial communities over the time (2 years) between sampling the soil metatranscriptome and metagenomes. While changes during this time are very likely, we expect that these changes are relatively small in the metagenomes as soil microbial populations are thought to have turnover times ranging from 6.8 to 0.24 years [15,44,45]. Another possible explanation for the low overlap between samples is the soil subhabitat

(metatranscriptome of bulk vs. metagenome of rhizosphere). Rhizosphere soils, generally, contain more active communities compared to bulk soils [10]. Differences in the sequencing depth of the metatranscriptomic and metagenomic efforts may have also contributed to differences in these datasets. A final explanation for the

53

distinct communities identified between these sequencing efforts is that the biologically active communities in the soil may not be represented in metagenomes due to under sampling or spatial differences. In this case, metagenomic libraries may be most useful for generating gene references reflecting possible soil diversity, while metatranscriptomics may be most appropriate for targeting active communities.

Overall, our metatranscriptome was dominated by sequences that could not be associated with genes that have previously been studied and for which no function is known (e.g., hypothetical proteins). Abundant hypothetical proteins are observed in other metatranscriptomes [40,44]. Insight into these sequences (these

“known unknowns”) is necessary to determine if they generally play an important role in function. Additionally, as increasing numbers of metatranscriptomes become available, the development of novel approaches that use unsupervised classification methods to identify patterns of codon usage across microbial communities [47] or co-occurrence of sequences [48–50] within multiple datasets should prove useful in characterizing these sequences.

Conclusion

Based on our evaluation of this Miscanthus soil sample, soil metatranscriptomics holds promise for identifying actively transcribed genes in the soil. The methods for leveraging this technology still require much development to reach genes important to ecological fitness or ecosystem functions. From this relatively small sample (20 Gbp), we were able to produce an assembly that

54

captured the majority of reads in the sequenced dataset. The resulting assembly allowed us to identify, with high confidence, several sequences similar to known genes and soil genomes that are actively transcribed and of interest to carbon cycling. The development of a soil specific database was helpful for analyzing our soil metatranscriptome, but a large majority of the assembled sequences still lack references in databases. This does show, however, the value of expanding the soil isolate genome and physiology database. Overall this study illustrates that metatranscriptomic sequencing can be preformed on samples of field collected soil.

Acknowledgements

We thank Tamara Cole for helpful discussions regarding this manuscript and

Jeff Landgraf for troubleshooting the duplex specific nuclease normalization method. This work was funded by the DOE Great Lakes Bioenergy Research Center

(DOE BER Office of Science DE-FC02-07ER64494).

55

APPENDIX

56

Table 2.4: Manually curated soil-associated genomes comprising the RefSoil database.

Organism GenBank Accession numbers Acetobacter AP011121, AP011122, AP011123, pasteurianus IFO Proteobacteria- AP011124, AP011125, AP011126, 3283-01 Alpha AP011127 Acholeplasma laidlawii PG-8A Tenericutes CP000896 Achromobacter Proteobacteria- xylosoxidans A8 Beta CP002287, CP002288, CP002289 Acidithiobacillus Proteobacteria- CP002573, CP002574, CP002575, caldus SM-1 Gamma CP002576, CP002577 Acidithiobacillus ferrooxidans Proteobacteria- ATCC 23270 Gamma CP001219 Acidobacterium capsulatum ATCC 51196 Acidobacteria CP001472 Acidovorax avenae avenae Proteobacteria- ATCC 19860 Beta CP002521 Acidovorax avenae citrulli Proteobacteria- AAC00-1 Beta CP000512 Acidovorax Proteobacteria- ebreus TPSY Beta CP001392 Acidovorax sp. Proteobacteria- JS42 Beta CP000539, CP000540, CP000541 Acinetobacter baumannii ATCC Proteobacteria- 17978 Gamma CP000521, CP000522, CP000523 Acinetobacter Proteobacteria- baylyi ADP1 Gamma CR543861 Acinetobacter calcoaceticus Proteobacteria- PHEA-2 Gamma CP002177 Acinetobacter sp. Proteobacteria- DR1 Gamma CP002080 Actinosynnema mirum 101, DSM 43827 Actinobacteria CP001630 Agrobacterium Proteobacteria- sp. H13-3 Alpha CP002248, CP002249,CP002250

57

Table 2.4 (cont’d)

Agrobacterium tumefaciens C58- Proteobacteria- AE007869, AE007870, AE007871, UWash Alpha AE007872 CP000633, CP000634, CP000635, Agrobacterium Proteobacteria- CP000636, CP000637, CP000638, vitis S4 Alpha CP000639 Akkermansia muciniphila ATCC BAA-835 Verrucomicrobia CP001071 Alicycliphilus Proteobacteria- denitrificans BC Beta CP002449, CP002450, CP002451 Alicycliphilus Proteobacteria- denitrificans K601 Beta CP002657, CP002658 Alkalilimnicola Proteobacteria- ehrlichei MLHE-1 Gamma CP000453 Alkaliphilus metalliredigens QYMF Firmicutes CP000724 Alkaliphilus oremlandii OhILAs Firmicutes CP000853 Amycolatopsis mediterranei U32 Actinobacteria CP002000 Anabaena variabilis ATCC CP000117, CP000118, CP000119, 29413 Cyanobacteria CP000120, CP000121 Anaeromyxobacter dehalogenans 2CP- Proteobacteria- C Delta CP000251 Anaeromyxobacter Proteobacteria- sp K Delta CP001131 Anaeromyxobacter Proteobacteria- sp. Fw109-5 Delta CP000769 Arcobacter nitrofigilis DSM Proteobacteria- 7299 Epsilon CP001999 Aromatoleum Proteobacteria- aromaticum EbN1 Beta CR555306, CR555307, CR555308 Arthrobacter arilaitensis re117, CIP108037 Actinobacteria FQ311875, FQ311475, FQ311476

58

Table 2.4 (cont’d)

Arthrobacter aurescens TC1 Actinobacteria CP000474, CP000475, CP000476 Arthrobacter chlorophenolicus A6 Actinobacteria CP001341, CP001342, CP001343 Arthrobacter phenanthrenivorans Sphe3 Actinobacteria CP002379, CP002380, CP002381 Arthrobacter sp. CP000454, CP000455, CP000456, FB24 Actinobacteria CP000457 Asticcacaulis Proteobacteria- CP002395, CP002396, CP002397, excentricus CB 48 Alpha CP002398 Proteobacteria- Azoarcus sp. BH72 Beta AM406670 Azorhizobium caulinodans ORS Proteobacteria- 571 Alpha AP009384 AP010946, AP010947, AP010948, Azospirillum sp. Proteobacteria- AP010949, AP010950, AP010951, B510 Alpha AP010952 Azotobacter vinelandii DJ, ATCC Proteobacteria- BAA-1303 Gamma CP001157 Bacillus amyloliquefaciens Campbell F Firmicutes FN597644 Bacillus amyloliquefaciens FZB42 Firmicutes CP000560 Bacillus anthracis Ames Firmicutes AE016879 Bacillus anthracis Ames Ancestor A2084 (0581) Firmicutes AE017334, AE017335, AE017336 CP001746, CP001747, CP001748, Bacillus anthracis CI Firmicutes CP001749 Bacillus atrophaeus 1942 Firmicutes CP002207 Bacillus cellulosilyticus N-4, DSM 2522 Firmicutes CP002394

59

Table 2.4 (cont’d) Bacillus cereus ATCC 10987 Firmicutes AE017194, AE017195 Bacillus cereus ATCC 14579 Firmicutes AE016877, AE016878 Bacillus clausii KSM-K16 Firmicutes AP006627 Bacillus coagulans 2-6 Firmicutes CP002472 Bacillus halodurans C-125 Firmicutes BA000004 Bacillus licheniformis DSM 13 Goettingen Firmicutes AE017333 Bacillus CP001983, CP001984, CP001985, megaterium QM CP001986, CP001987, CP001988, B1551 Firmicutes CP001989, CP001990

Bacillus pseudofirmus OF4 Firmicutes CP001878, CP001879, CP001880 Bacillus pumilus SAFR-032 Firmicutes CP000813 Bacillus selenitireducens MLS10 Firmicutes CP001791 Bacillus subtilis BSn5 Firmicutes CP002468 Bacillus subtilis subtilis 168 Firmicutes AL009126 Bacillus CP001907, CP001908, CP001909, thuringiensis CT43 Firmicutes CP001910, CP001911, CP001912 Bacillus thuringiensis sv. finitimus YBT-020 Firmicutes CP002508, CP002509, CP002510 Bacillus tusciae T2, DSM 2912 Firmicutes CP002017 Bacillus weihenstephanensis CP000903, CP000904, CP000905, KBAB4 Firmicutes CP000906, CP000907

61

Table 2.4 (cont’d)

Beijerinckia indica indica Proteobacteria- ATCC 9039 Alpha CP001016, CP001017, CP001018 Beutenbergia cavernae HKI 0122, DSM 12333 Actinobacteria CP001618 Brachybacterium faecium 6-10, DSM 4810 Actinobacteria CP001643 Bradyrhizobium japonicum USDA Proteobacteria- 110 Alpha BA000040 Bradyrhizobium Proteobacteria- sp. BTAi1 Alpha CP000494, CP000495 Bradyrhizobium Proteobacteria- sp. ORS278 Alpha CU234118 Brevibacillus brevis NBRC 100599 Firmicutes AP008955 Brevundimonas subvibrioides Proteobacteria- ATCC 15264 Alpha CP002102 Brucella microti Proteobacteria- CCM 4915 Alpha CP001578, CP001579 Burkholderia Proteobacteria- ambifaria MC40-6 Beta CP001025, CP001026, CP001027, CP001028 Burkholderia cenocepacia Proteobacteria- HI2424, BCC1 Beta CP000458, CP000459, CP000460, CP000461 Burkholderia cenocepacia MC0- Proteobacteria- 3 Beta CP000958, CP000959, CP000960 Burkholderia cepacia 383 Proteobacteria- (R18194) Beta CP000151, CP000150, CP000152 Burkholderia Proteobacteria- cepacia AMMD Beta CP000440, CP000441, CP000442, CP000443 Burkholderia Proteobacteria- CP002599, CP002600, CP002601, gladioli BSR3 Beta CP002602, CP002603, CP002604 Burkholderia Proteobacteria- CP001503, CP001504, CP001505, glumae BGR1 Beta CP001506, CP001507, CP001508

62

Table 2.4 (cont’d)

Burkholderia multivorans ATCC Proteobacteria- CP000868.1, CP000869.1, CP000870.1, 17616 Beta CP000871.1 Burkholderia phymatum Proteobacteria- STM815 Beta CP001043, CP001044, CP001045, CP001046 Burkholderia phytofirmans Proteobacteria- PsJN Beta CP001052, CP001053, CP001054 Burkholderia rhizoxinica HKI Proteobacteria- 454 Beta FR687359, FR687360, FR687361 Burkholderia sp. Proteobacteria- CCGE1001 Beta CP002519, CP002520 Burkholderia sp. Proteobacteria- CCGE1002 Beta CP002013, CP002014, CP002015, CP002016 Burkholderia thailandensis Proteobacteria- E264 Beta CP000085, CP000086 Burkholderia CP000614, CP000615, CP000616, vietnamiensis G4 Proteobacteria- CP000617, CP000618, CP000619, (R1808) Beta CP000620, CP000621 Burkholderia xenovorans Proteobacteria- LB400 Beta CP000270, CP000271, CP000272 Campylobacter Proteobacteria- lari RM2100 Epsilon CP000932, CP000933 Candidatus Accumulibacter phosphatis Type Proteobacteria- CP001715, CP001716, CP001717, IIA UW-1 Beta CP0017185, Candidatus Blochmannia Proteobacteria- floridanus Gamma BX248583 Candidatus Blochmannia pennsylvanicus Proteobacteria- BPEN Gamma CP000016 Candidatus Blochmannia Proteobacteria- vafer BVAF Gamma CP002189 Candidatus Proteobacteria- Carsonella ruddii Gamma AP009180

63

Table 2.4 (cont’d)

Candidatus Cloacamonas acidaminovorans WWE1 CU466930 Candidatus Hamiltonella Proteobacteria- defensa 5AT Gamma CP001277, CP001278 Candidatus Hodgkinia Proteobacteria- cicadicola Dsem Alpha CP001226 Candidatus Liberibacter Proteobacteria- asiaticus psy62 Alpha CP001677 Candidatus Liberibacter solanacearum Proteobacteria- CLso-ZC1 Alpha CP002371 Candidatus Methylomirabilis oxyfera NC10 FP565575 Candidatus Nitrospira defluvii Nitrospirae FP929003.1 Candidatus Phytoplasma aster yellows witches'-broom CP000061, CP000062, CP000063, AY-WB Tenericutes CP000064, CP000065 Candidatus Phytoplasma australiense Tenericutes AM422018 Candidatus Phytoplasma mali AT Tenericutes CU469464 Candidatus Phytoplasma onion yellows OY- M Tenericutes AP006628 Candidatus Protochlamydia amoebophila UWE25 Chlamydiae BX908798 Candidatus Sulcia muelleri DMIN Bacteroidetes CP001981

64

Table 2.4 (cont’d)

Candidatus Tremblaya Proteobacteria- princeps PCIT Beta CP002244 Candidatus Zinderia Proteobacteria- insecticola CARI Beta CP002161 Catenulispora acidiphila ID139908, DSM 44928 Actinobacteria CP001700 Caulobacter Proteobacteria- crescentus CB15 Alpha AE005673 Caulobacter crescentus Proteobacteria- NA1000 Alpha CP001340 Caulobacter segnis ATCC Proteobacteria- 21756 Alpha CP002008 Caulobacter sp. Proteobacteria- K31 Alpha CP000927,CP000928,CP000929 Cellulomonas flavigena 134, DSM 20109 Actinobacteria CP001964 Cellulomonas flavigena NRS 133, ATCC 484 Actinobacteria CP002666 Cellvibrio gilvus Proteobacteria- ATCC 13127 Gamma CP002665 Cellvibrio japonicus Proteobacteria- Ueda107 Gamma CP000934 Chelativorans sp. Proteobacteria- BNC1 Alpha CP000390, CP000389, CP000391, CP000392 Chitinophaga pinensis UQM 2034, DSM 2588 Bacteroidetes CP001699 Chromobacterium violaceum ATCC Proteobacteria- 12472 Beta AE016825 Citrobacter koseri Proteobacteria- ATCC BAA-895 Gamma CP000822, CP000823, CP000824 Citrobacter Proteobacteria- FN543502, FN543503, FN543504, rodentium Gamma FN543505

65

Table 2.4 (cont’d)

Clavibacter michiganensis michiganensis NCPPB 382 Actinobacteria AM711867, AM711866, AM711865 Clavibacter michiganensis sepedonicus ATCC 33113 Actinobacteria AM849034, AM849035, AM849036 Clostridium acetobutylicum ATCC 824 Firmicutes AE001437, AE001438 Clostridium acetobutylicum DSM 1731 Firmicutes CP002660, CP002661, CP002662 Clostridium beijerinckii NCIMB 8052 Firmicutes CP000721 Clostridium botulinum BoNT/B1 Okra Firmicutes CP000939, CP000940 Clostridium botulinum type A - Hall Firmicutes AM412317, AM412318 Clostridium cellulolyticum H10 Firmicutes CP001348 Clostridium cellulovorans 743B, ATCC 35296 Firmicutes CP002160 Clostridium cf. saccharolyticum K10 Firmicutes FP929037 Clostridium kluyveri DSM 555 Firmicutes CP000673, CP000674 Clostridium kluyveri NBRC 12016 Firmicutes AP009049,AP009050 Clostridium ljungdahlii PETC, DSM 13528 Firmicutes CP001666 Clostridium novyi NT Firmicutes CP000382

66

Table 2.4 (cont’d)

Clostridium perfringens 13 Firmicutes BA000016, AP003515.1 Clostridium perfringens ATCC 13124 Firmicutes CP000246 Clostridium phytofermentans ISDg Firmicutes CP000885 Clostridium saccharolyticum WM1, DSM 2544 Firmicutes CP002109 Clostridium sticklandii DSM 519 Firmicutes FP565809 Clostridium tetani Massachusetts E88 Firmicutes AE015927, AF528097.1 Clostridium thermocellum ATCC 27405 Firmicutes CP000568 Clostridium thermocellum LQ8, DSM 1313 Firmicutes CP002416 Comamonas testosteroni CNB- Proteobacteria- 1 Beta CP001220, EF079106.1 Conexibacter woesei ID131577, DSM 14684 Actinobacteria CP001854 Coraliomargarita akajimensis DSM 45221 Verrucomicrobia CP001998 Corynebacterium aurimucosum CN- 1, ATCC 700975 Actinobacteria CP001601, FM164414 Corynebacterium efficiens YS-314T Actinobacteria BA000035, AP005225, AP005226 Corynebacterium glutamicum Nakagawa, ATCC 13032 Actinobacteria BA000036

67

Table 2.4 (cont’d)

Corynebacterium glutamicum R Actinobacteria AP009044, AP009045 Corynebacterium jeikeium K411 Actinobacteria CR931997, AF401314.1 Corynebacterium kroppenstedtii DSM 44385 Actinobacteria CP001620 Corynebacterium pseudotuberculosis 1002 Actinobacteria CP001809 Corynebacterium pseudotuberculosis C231 Actinobacteria CP001829 Corynebacterium resistens DSM 45100 Actinobacteria CP002857 Corynebacterium urealyticum DSM 7109 Actinobacteria AM942444 Corynenebacterium ulcerans 809 Actinobacteria CP002790 Corynenebacterium ulcerans BR-AD22 Actinobacteria CP002791 Cronobacter sakazakii ATCC Proteobacteria- BAA-894 Gamma CP000783, CP000784, CP000785 Cronobacter Proteobacteria- FN543093, FN543094, FN543095, turicensis z3032 Gamma FN543096 Cupriavidus metallidurans Proteobacteria- CP000352, CP000353, CP000354, CH34 Beta CP000355 Cupriavidus Proteobacteria- CP000090, CP000091, CP000092, necator JMP134 Beta CP000093 Cupriavidus Proteobacteria- CP002877, CP002878, CP002879, necator N-1 Beta CP002880 Cupriavidus taiwanensis LMG Proteobacteria- 19424 Beta CU633749, CU633750, CU633751 CP001291, CP001292, CP001293, Cyanothece sp. PCC CP001294, CP001295, CP001296, 7424 Cyanobacteria CP001297,

68

Table 2.4 (cont’d)

Cyanothece sp. PCC CP001701, CP001702, CP001703, 8802 Cyanobacteria CP001704, CP001705 Cytophaga hutchinsonii ATCC 33406 Bacteroidetes CP000383 Dechloromonas Proteobacteria- aromatica RCB Beta CP000089 Dehalococcoides ethenogenes 195 Chloroflexi CP000027 Dehalococcoides sp. VS Chloroflexi CP001827 Dehalogenimonas lykanthroporepellens BL-DC-9 Chloroflexi CP002084 Deinococcus deserti CP001114, CP001115, CP001116, VCD115 Thermi CP001117 Deinococcus maricopensis LB-34, DSM 21211 Thermi CP002454 Deinococcus proteolyticus MRP, CP002536, CP002537, CP002538, DSM 20540 Thermi CP002539, CP002540 Deinococcus radiodurans USUHS AE000513, AE001825, AE001826, (R1) Thermi AE001827 Delftia acidovorans Proteobacteria- SPH-1 Beta CP000884 Proteobacteria- Delftia sp. Cs1-4 Beta CP002735 Desulfarculus baarsii Proteobacteria- 2st14, DSM 2075 Delta CP002085 Desulfitobacterium hafniense DCB-2 Firmicutes CP001336 Desulfitobacterium hafniense Y51 Firmicutes AP008230 Desulfobacca Proteobacteria- acetoxidans ASRB2 Delta CP002629

69

Table 2.4 (cont’d)

Desulfobacterium autotrophicum Proteobacteria- HRM2, DSM 3382 Delta CP001087, CP001088 Desulfobulbus propionicus 1pr3, Proteobacteria- DSM 2032 Delta CP002364 Desulfotomaculum acetoxidans 5575, DSM 771 Firmicutes CP001720 Desulfotomaculum carboxydivorans CO-1-SRB, DSM 14880 Firmicutes CP002736 Desulfotomaculum kuznetsovii 17 Firmicutes CP002770 Desulfotomaculum reducens MI-1 Firmicutes CP000612 Desulfotomaculum ruminis DL, DSM 2154 Firmicutes CP002780 Desulfovibrio aespoeensis Aspo- Proteobacteria- 2 Delta CP002431 Desulfovibrio Proteobacteria- alaskensis G20 Delta CP000112 Desulfovibrio desulfuricans desulfuricans Proteobacteria- 27774 Delta CP001358 Desulfovibrio Proteobacteria- magneticus RS-1 Delta AP010904, AP010905, AP010906 Desulfovibrio salexigens DSM Proteobacteria- 2638 Delta CP001649 Desulfovibrio Proteobacteria- vulgaris RCH1 Delta CP002297, CP002298 Desulfovibrio vulgaris vulgaris Proteobacteria- Hildenborough Delta AE017285, AE017286 Desulfurispirillum indicum S5 Chrysiogenetes CP002432

70

Table 2.4 (cont’d)

Desulfurivibrio Proteobacteria- alkaliphilus AHT2 Delta CP001940 Dickeya dadantii Proteobacteria- 3937 Gamma CP002038 Dickeya dadantii Proteobacteria- Ech703 Gamma CP001654 Dickeya zea Proteobacteria- Ech1591 Gamma CP001655 Dyadobacter fermentans NS114, DSM 18053 Bacteroidetes CP001619 Ensifer medicae Proteobacteria- WSM419 Alpha CP000738, CP000739, CP000740, CP000741 Ensifer meliloti Proteobacteria- CP002781, CP002782, CP002783, AK83 Alpha CP002784, CP002785 Ensifer meliloti Proteobacteria- BL225C Alpha CP002740, CP002741,CP002742 Enterobacter aerogenes KCTC Proteobacteria- 2190 Gamma CP002824 Enterobacter cloacae cloacae Proteobacteria- ATCC 13047 Gamma CP001918, CP001919, CP001920 Enterobacter cloacae cloacae Proteobacteria- NCTC 9394 Gamma FP929040 Enterobacter sp. Proteobacteria- 638 Gamma CP000653, CP000654 Erwinia amylovora Proteobacteria- CFBP1430 Gamma FN434113, FN434114 Erwinia amylovora Ea273, Proteobacteria- ATCC 49946 Gamma FN666575, FN666576, FN666577 Erwinia billingiae Proteobacteria- Eb661 Gamma FP236843, FP236826, FP236830, Erwinia pyrifoliae Proteobacteria- FP236842, FP236827, FP236828, Ep1/96 Gamma FP236829, FP928999 Erwinia pyrifoliae Proteobacteria- FN392235, FN392236, FN392237, Ep1/96 Gamma FN392238, FN392239

71

Table 2.4 (cont’d)

Erwinia tasmaniensis Proteobacteria- CU468135, CU468128.1, CU468130, Et1/99 Gamma CU468131, CU468132, CU468133

Escherichia coli Proteobacteria- W, ATCC 9739 Gamma CP002185, AY639886 Eubacterium cylindroides T2- 87 Firmicutes FP929041 Eubacterium limosum KIST612 Firmicutes CP002273 Eubacterium rectale M104/1 Firmicutes FP929043 Eubacterium siraeum 70/3 Firmicutes FP929044 Exiguobacterium sibiricum 255-15 Firmicutes CP001022, CP001023, CP001024 Exiguobacterium sp. AT1b Firmicutes CP001615 Flavobacterium johnsoniae UW101, ATCC 17061 Bacteroidetes CP000685

Frankia sp EuI1c Actinobacteria CP002299 Gallionella capsiferriformans Proteobacteria- ES-2 Beta CP002159 Gemmatimonas aurantiaca T-27T Gemmatimonadetes AP009153 Geobacter bemidjiensis Bem, Proteobacteria- DSM 16622 Delta CP001124 Geobacter lovleyi Proteobacteria- SZ Delta CP001089, CP001090 Geobacter metallireducens Proteobacteria- GS-15 Delta CP000148, CP000149 Geobacter sp. Proteobacteria- FRC-32 Delta CP001390 Geobacter Proteobacteria- sulfurreducens Delta AE017180

72

Table 2.4 (cont’d)

Geobacter uraniireducens Proteobacteria- Rf4 Delta CP000698 Geodermatophilus obscurus G-20, DSM 43160 Actinobacteria CP001867 Gloeobacter violaceus PCC 7421 Cyanobacteria BA000045 Gluconacetobacter diazotrophicus Proteobacteria- PAl 5, DSM 5601 Alpha AM889285, AM889286, AM889287 Gluconacetobacter diazotrophicus Proteobacteria- PAl 5,DSM 5601 Alpha CP001189, CP001190 Gluconobacter Proteobacteria- CP000009, CP000004, CP000005, oxydans 621H Alpha CP000006, CP000007, CP000008 Gordonia bronchialis 3410, DSM 43247 Actinobacteria CP001802, CP001803 Gordonibacter pamelaeae 7-10-1- bT Actinobacteria FP929047 Granulibacter bethesdensis Proteobacteria- CGDNIH1 Alpha CP000394 Granulicella tundricola CP002480, CP002481, CP002482, MP5ACTX9 Acidobacteria CP002483, CP002484, CP002485 Herbaspirillum Proteobacteria- seropedicae SmR1 Beta CP002039 Herminiimonas arsenicoxydans Proteobacteria- ULPAs1 Beta CU207211 Hyphomicrobium denitrificans ATCC Proteobacteria- 51888 Alpha CP002083 Intrasporangium calvum 7KIP, DSM 43043 Actinobacteria CP002343 Isoptericola variabilis 225 Actinobacteria CP002810

73

Table 2.4 (cont’d)

Janthinobacterium Proteobacteria- sp. Marseille Beta CP000269 Ketogulonicigenium Proteobacteria- vulgare Y25 Alpha CP002224, CP002225, CP002226 Kineococcus radiotolerans SRS30216 Actinobacteria CP000750, CP000751, CP000752 Kitasatospora setae KM-6054, NBRC 14216, DSM 43861 Actinobacteria AP010968 Klebsiella Proteobacteria- pneumoniae 342 Gamma CP000964, CP000965, CP000966 Klebsiella pneumoniae pneumoniae Proteobacteria- CP000647, CP000648, CP000649, MGH78578 Gamma CP000650, CP000651, CP000652 Klebsiella variicola Proteobacteria- At-22 Gamma CP001891 Kocuria rhizophila DC2201 Actinobacteria AP009152 Korebacter versatilis Ellin345 Acidobacteria CP000360 Kribbella flavida IFO 14399, DSM 17836 Actinobacteria CP001736 Lactobacillus brevis ATCC 367 Firmicutes CP000416, CP000417, CP000418 Leadbetterella byssophila 4M15, DSM 17132 Bacteroidetes CP002305 Legionella longbeachae Proteobacteria- NSW150 Gamma FN650140, FN650141 Legionella pneumophila Proteobacteria- 2300/99 Alcoy Gamma CP001828 Legionella Proteobacteria- pneumophila Paris Gamma CR628336, CR628338 Leifsonia xyli xyli CTCB07 Actinobacteria AE016822 Leptospira biflexa Spirochaetes CP000777,CP000778,CP000779

74

Table 2.4 (cont’d)

Leptospira borgpetersenii JB197 Spirochaetes CP000350, CP000351 Leptospira borgpetersenii L550 Spirochaetes CP000348, CP000349 Leptospira interrogans 56601 Spirochaetes AE010301, AE010300 Leptospira interrogans Fiocruz L1-130 Spirochaetes AE016823, AE016824 Leptothrix Proteobacteria- cholodnii SP-6 Beta CP001013 Leuconostoc DQ489736, DQ489737, DQ489738, citreum KM20 Firmicutes DQ489739, DQ489740 Leuconostoc gasicomitatum LMG 18811 Firmicutes FN822744 Leuconostoc kimchii CP001758, CP001753, CP001754, IMSNU11154 Firmicutes CP001755, CP001756, CP001757 Leuconostoc mesenteroides mesenteroides ATCC 8293 Firmicutes CP000414, CP000415

Leuconostoc sp. Firmicutes CP002898 Listeria monocytogenes 4b F2365 Firmicutes AE017262 Listeria monocytogenes HCC23 Firmicutes CP001175 Listeria monocytogenes M7 Firmicutes CP002816 Listeria seeligeri SLCC3954 Firmicutes FN557490 Listeria welshimeri SLCC5334 Firmicutes AM263198

Lysinibacillus sphaericus C3-41 Firmicutes CP000817, CP000818

75

Table 2.4 (cont’d)

Mesoplasma florum L1 Firmicutes AE017263 Mesorhizobium ciceri bv biserrulae Proteobacteri WSM1271 a-Alpha CP002447, CP002448 Mesorhizobium loti Proteobacteri MAFF303099 a-Alpha BA000012, BA000013, AP003017 Mesorhizobium opportunistum Proteobacteri WSM2075 a-Alpha CP002279 Methylacidiphil um infernorum Verrucomicro V4 bia CP000975 Methylibium petroleiphilum Proteobacteri PM1 a-Beta CP000555, CP000556 Methylobacteriu m chloromethanic Proteobacteri um CM4 a-Alpha CP001298, CP001299, CP001300 Methylobacteriu m extorquens Proteobacteri CP001510,CP001511,CP001512,CP001513,CP0 AM1 a-Alpha 01514 Methylobacteriu m extorquens Proteobacteri DM4 a-Alpha FP103042, FP103043, FP103044 Methylobacteriu m nodulans ORS Proteobacteri CP001349, CP001350, CP001351, CP001352, 2060 a-Alpha CP001353, CP001354, CP001355, CP001356

Methylobacteriu Proteobacteri m populi BJ001 a-Alpha CP001029, CP001030, CP001031 Methylobacteriu CP001001, CP001002, CP001003, CP001004, m radiotolerans Proteobacteri CP001005, CP001006, CP001007, CP001008, JCM 2831 a-Alpha CP001009 Methylocella Proteobacteri silvestris BL2 a-Alpha CP001280 Methylococcus Proteobacteri capsulatus Bath a-Gamma AE017282 Methylotenera Proteobacteri mobilis JLW8 a-Beta CP001672

76

Table 2.4 (cont’d)

Methylovorus glucosetrophus SIP3- Proteobacteria- 4 Beta CP001674, CP001675, CP001676 Methylovorus sp. Proteobacteria- MP688 Beta CP002252 Microbacterium testaceum StLB037 Actinobacteria AP012052 Microlunatus phosphovorus NM-1 Actinobacteria AP012204 Micromonospora aurantiaca ATCC 27029 Actinobacteria CP002162 Micromonospora sp. L5 Actinobacteria CP002399 Moorella thermoacetica ATCC 39073 Firmicutes CP000232 Mycobacterium abscessus CIP 104536 Actinobacteria CU458896, CU458745 Mycobacterium avium 104 Actinobacteria CP000479 Mycobacterium avium paratuberculosis K- 10 Actinobacteria AE016958 Mycobacterium bovis BCG Moreau RDJ Actinobacteria AM412059 Mycobacterium bovis BCG Tokyo 172 Actinobacteria AP010918 Mycobacterium CP000656, CP000657, CP000658, gilvum PYR-GCK Actinobacteria CP000659 Mycobacterium leprae Br4923 Actinobacteria FM211192 Mycobacterium smegmatis MC2 155 Actinobacteria CP000480 Mycobacterium tuberculosis F11 (ExPEC) Actinobacteria CP000717.1 Mycobacterium tuberculosis KZN 1435 (MDR) Actinobacteria CP001658

77

Table 2.4 (cont’d)

Mycobacterium vanbaalenii PYR-1 Actinobacteria CP000511 Myxococcus xanthus Proteobacteria- DK 1622 Delta CP000113 Nakamurella multipartita Y-104, DSM 44233 Actinobacteria CP001737 Nitrobacter Proteobacteria- CP000319, CP000320, CP000321, hamburgensis X14 Alpha CP000322 Nitrobacter Proteobacteria- winogradskyi Nb-255 Alpha CP000115 Nitrosomonas europaea ATCC Proteobacteria- 19718 Beta AL954747 Nitrosomonas Proteobacteria- eutropha C91 Beta CP000450, CP000451, CP000452 Nitrosomonas sp. Proteobacteria- AL212 Beta CP002552, CP002553,CP002554 Nitrosomonas sp. Proteobacteria- Is79A3 Beta CP002876 Nitrosospira multiformis ATCC Proteobacteria- CP000103, CP000104, CP000105, 25196 Beta CP000106 Nocardia farcinica IFM 10152 Actinobacteria AP006618, AP006619, AP006620 Nocardioides sp. JS614 Actinobacteria CP000509, CP000508 Nocardiopsis dassonvillei dassonvillei DSM 43111 Actinobacteria CP002040, CP002041

Nostoc azollae 0708 Cyanobacteria CP002059, CP002060, CP002061 Nostoc punctiforme CP001037, CP001038, CP001039, ATCC 29133 Cyanobacteria CP001040, CP001041, CP001042 BA000019, BA000020, AP003602, AP003603, AP003604, AP003605, Nostoc sp. PCC 7120 Cyanobacteria AP003606, Novosphingobium aromaticivorans DSM Proteobacteria- 12444 Alpha CP000248, CP000676, CP000677

78

Table 2.4 (cont’d)

Novosphingobium sp. Proteobacteria- FR856862, FR856859, FR856860, PP1Y Alpha FR856861 Oceanobacillus iheyensis HTE831 Firmicutes BA000028 Ochrobactrum Proteobacteria- CP000758, CP000759, CP000760, anthropi ATCC 49188 Alpha CP000761, CP000762, CP000763 Oligotropha Proteobacteria- carboxidovorans OM4 Alpha CP002821, CP002822, CP002823 Oligotropha Proteobacteria- carboxidovorans OM5 Alpha CP002826, CP002827, CP002828 Opitutus terrae PB90- 1 Verrucomicrobia CP001032 Paenibacillus mucilaginosus KNP414 Firmicutes CP002869 Paenibacillus polymyxa E681 Firmicutes CP000154 Paenibacillus polymyxa SC2 Firmicutes CP002213, CP002214 Paludibacter propionicigenes WB4, DSM 17365 Bacteroidetes CP002345 Pantoea ananatis Proteobacteria- AJ13355 Gamma AP012032, AP012033 Pantoea ananatis Proteobacteria- LMG 20103 Gamma CP001875 Proteobacteria- CP002433, CP002434, CP002435, Pantoea sp. At-9b. Gamma CP002436, CP002437, CP002438 Proteobacteria- CP002206.1,CP001893, CP001894, Pantoea vagans C9-1 Gamma CP001895 Paracoccus Proteobacteria- denitrificans PD1222 Alpha CP000489, CP000490, CP000491 Pectobacterium atrosepticum Proteobacteria- SCRI1043 Gamma BX950851 Pectobacterium Proteobacteria- carotovorum PC1 Gamma CP001657 Pectobacterium Proteobacteria- wasabiae WPP163 Gamma CP001790 Pediococcus pentosaceus Firmicutes CP000422

79

Table 2.4 (cont’d)

Pedobacter heparinus HIM 762-3, DSM 2366 Bacteroidetes CP001681 Pedobacter saltans Stey 113, DSM 12145 Bacteroidetes CP002545 Pelobacter carbinolicus DSM Proteobacteria- 2380 Delta CP000142 Pelobacter propionicus DSM Proteobacteria- 2379 Delta CP000482, CP000483, CP000484 Pirellula staleyi DSM 6068 Planctomycetes CP001848 Planctomyces brasiliensis IFAM 1448, DSM 5305 Planctomycetes CP002546 Planctomyces limnophilus Mu 290, DSM 3776 Planctomycetes CP001744, CP001745 Polaromonas CP000529, CP000530, CP000531, naphthalenivorans Proteobacteria- CP000532, CP000533, CP000534, CJ2 Beta CP000535, CP000536, CP000537

Polaromonas sp. Proteobacteria- JS666 Beta CP000316, CP000317, CP000318 Polymorphum gilvum Proteobacteria- SL003B-26A1 Alpha CP002568, CP002569 Polynucleobacter necessarius asymbioticus QLW- Proteobacteria- P1DMWA-1 Beta CP000655 Polynucleobacter necessarius Proteobacteria- necessarius STIR1 Beta CP001010 Pseudomonas Proteobacteria- aeruginosa PA7 Gamma CP000744 Pseudomonas Proteobacteria- aeruginosa PAO1 Gamma AE004091 Pseudomonas brassicacearum brassicacearum Proteobacteria- NFM421 Gamma CP002585

80

Table 2.4 (cont’d)

Pseudomonas Proteobacteria- entomophila L48 Gamma CT573326 Pseudomonas Proteobacteria- fluorescens Pf0-1 Gamma CP000094 Pseudomonas Proteobacteria- fluorescens SBW25 Gamma AM181176, AM235768.1 Pseudomonas fulva Proteobacteria- 12-X Gamma CP002727 Pseudomonas Proteobacteria- mendocina NK-01 Gamma CP002620 Pseudomonas Proteobacteria- mendocina ymp Gamma CP000680 Pseudomonas putida Proteobacteria- BIRD-1 Gamma CP002290 Pseudomonas putida Proteobacteria- F1 Gamma CP000712 Pseudomonas putida Proteobacteria- KT2440 Gamma AE015451 Pseudomonas Proteobacteria- stutzeri ATCC 17588 Gamma CP002881 Pseudomonas stutzeri CMT.A.9, Proteobacteria- DSM 4166 Gamma CP002622 Pseudomonas Proteobacteria- syringae 1448A Gamma CP000058, CP000059, CP000060 Pseudomonas Proteobacteria- syringae B728a Gamma CP000075 Pseudomonas syringae tomato Proteobacteria- DC3000 Gamma AE016853, AE016854, AE016855 Pseudoxanthomonas Proteobacteria- suwonensis 11-1 Gamma CP002446 Psychrobacter Proteobacteria- arcticus 273-4 Gamma CP000082 Psychrobacter Proteobacteria- cryohalolentis K5 Gamma CP000323, CP000324 Psychrobacter sp. Proteobacteria- PRwf-1 Gamma CP000713, CP000714, CP000715 Proteobacteria- Rahnella sp. Y9602 Gamma CP002505, CP002506, CP002507 Ralstonia eutropha Proteobacteria- H16 Beta AM260479, AM260480, AY305378

81

Table 2.4 (cont’d)

Ralstonia pickettii Proteobacteria- CP001644, CP001645, CP001646, 12D Beta CP001647, CP001648 Proteobacteria- Ralstonia pickettii 12J Beta CP001068, CP001069, CP001070 Ralstonia solanacearum Proteobacteria- CFBP2957 Beta FP885897, FP885907 Ralstonia Proteobacteria- solanacearum Po82 Beta CP002819, CP002820 CP000133, CP000134, CP000135, Rhizobium etli CFN Proteobacteria- CP000136, CP000137, CP000138, 42, DSM 11541 Alpha U80928

Rhizobium etli CIAT Proteobacteria- CP001074, CP001075, CP001076, 652 Alpha CP001077 Rhizobium leguminosarum bv. Proteobacteria- CP001622, CP001623, CP001624, trifolii WSM1325 Alpha CP001625, CP001626 , CP001627 Rhizobium AM236080, AM236081, AM236082, leguminosarum bv. Proteobacteria- AM236083, AM236084, AM236085, viciae 3841 Alpha AM236086

Rhizobium Proteobacteria- CP000628, CP000629, CP000630, rhizogenes K84 Alpha CP000631, CP000632 Rhizobium sp. Proteobacteria- NGR234 (ANU265) Alpha CP001389, CP000874, U00090 Rhodobacter Proteobacteria- capsulatus SB1003 Alpha CP001312, CP001313 CP000143, CP000144, , CP000145, Rhodobacter Proteobacteria- CP000146, CP000147, DQ232586, sphaeroides 2.4.1 Alpha DQ232587 Rhodobacter sphaeroides ATCC Proteobacteria- CP000661, CP000662, CP000663, 17025 Alpha CP000664, CP000665, CP000666 Rhodocista Proteobacteria- centenaria SW Alpha CP000613 Rhodococcus equi 103S Actinobacteria FN563149 Rhodococcus AP008957, AP008931, AP008932, erythropolis PR4 Actinobacteria AP008933 Rhodococcus jostii CP000431, CP000432, CP000433, RHA1 Actinobacteria CP000434

82

Table 2.4 (cont’d)

Rhodococcus opacus AP011115, AP011116, AP011117, B4 Actinobacteria AP011118, AP011119, AP011120 Rhodopseudomonas Proteobacteria- palustris BisB5 Alpha CP000283 Rhodopseudomonas Proteobacteria- palustris CGA009 Alpha BX571963, BX571964 Rubrobacter xylanophilus DSM 9941 Actinobacteria CP000386 Runella slithyformis CP002859, CP002860, CP002861, LSU4, DSM 19594 Bacteroidetes CP002862, CP002863, CP002864 Saccharomonospora viridis P101, DSM 43017 Actinobacteria CP001683 Saccharophagus Proteobacteria- degradans 2-40 Gamma CP000282 Saccharopolyspora erythraea NRRL2338 Actinobacteria AM420293 Salmonella bongori Proteobacteria- NCTC 12419 Gamma FR877557 Salmonella enterica Proteobacteria- Agona SL483 Gamma CP001138, CP001137 Salmonella enterica Proteobacteria- Newport SL254 Gamma CP001113, CP000604.1, CP001112 Serratia Proteobacteria- proteamaculans 568 Gamma CP000826, CP000827 Proteobacteria- Serratia sp. AS9 Gamma CP002773 Shewanella Proteobacteria- amazonensis SB2B Gamma CP000507 Shewanella Proteobacteria- denitrificans OS217 Gamma CP000302 Shewanella halifaxensis HAW- Proteobacteria- EB4 Gamma CP000931 Shewanella Proteobacteria- oneidensis MR-1 Gamma AE014299, AE014300 Shewanella Proteobacteria- putrefaciens 200 Gamma CP002457 Shewanella Proteobacteria- putrefaciens CN-32 Gamma CP000681

83

Table 2.4 (cont’d)

Shewanella sediminis Proteobacteria- HAW-EB3 Gamma CP000821 Proteobacteria- Shewanella sp. ANA-3 Gamma CP000469, CP000470 Shewanella violacea Proteobacteria- DSS12 Gamma AP011177 Sideroxydans Proteobacteria- lithotrophicus ES-1 Gamma CP001965 Solibacter usitatus Ellin6076 Acidobacteria CP000473 Sorangium Proteobacteria- cellulosum So ce 56 Delta AM746676 Sphaerobacter thermophilus 4ac11, DSM 20745 Chloroflexi CP001823, CP001824 Sphingobacterium sp. 21 Bacteroidetes CP002584 Sphingobium Proteobacteria- chlorophenolicum L-1 Alpha CP002798, CP002799, CP002800, Sphingobium Proteobacteria- AP010803, AP010804, AP010805, japonicum UT26S Alpha AP010806, AP010807 Sphingomonas Proteobacteria- wittichii RW1 Alpha CP000699, CP000700, CP000701 CP001769, CP001770, CP001771, Spirosoma linguale CP001772, CP001773, CP001774, DSM 74 Bacteroidetes CP001775, CP001776, CP001777

Stackebrandtia nassauensis LLR-40K- 21, DSM 44728 Actinobacteria CP001778 Staphylococcus aureus aureus Newman Firmicutes AP009351 Staphylococcus aureus RF122 Firmicutes AJ938182 Staphylococcus carnosus carnosus TM300 Firmicutes AM295250 Staphylococcus AE015929, AE015930, AE015931, epidermidis ATCC AE015932, AE015933, AE015934, 12228 Firmicutes AE015935

84

Table 2.4 (cont’d)

Staphylococcus epidermidis RP62A Firmicutes CP000029, CP000028 Staphylococcus haemolyticus AP006716, AP006717, AP006718, JCSC1435 Firmicutes AP006719 Staphylococcus lugdunensis HKU09- 01 Firmicutes CP001837 Staphylococcus lugdunensis N920143 Firmicutes FR870271 Staphylococcus pseudintermedius ED99 Firmicutes CP002478 Staphylococcus pseudintermedius HKU10-03 Firmicutes CP002439 Staphylococcus saprophyticus saprophyticus ATCC 15305 Firmicutes AP008934, AP008935, AP008936

Starkeya novella DSM Proteobacteria- 506 Alpha CP002026 Stenotrophomonas Proteobacteria- maltophilia K279a Gamma AM743169 Stenotrophomonas Proteobacteria- maltophilia R551-3 Gamma CP001111.1 Stigmatella Proteobacteria- aurantiaca DW4 /3-1 Delta CP002271 Streptomyces avermitilis MA-4680 Actinobacteria BA000030, AP005645.1 Streptomyces bingchenggensis BCW-1 Actinobacteria CP002047 Streptomyces flavogriseus IAF 45 CD, ATCC 33331 Actinobacteria CP002475, CP002476, CP002477 Streptomyces griseus griseus NBRC 13350 Actinobacteria AP009493 Streptomyces scabiei 87.22 Actinobacteria FN554889.1

85

Table 2.4 (cont’d)

Streptomyces venezuelae Actinobacteria FR845719 Streptosporangium roseum NI 9100, DSM 43021 Actinobacteria CP001814, CP001815 Sulfurihydrogenibium azorense Az-Fu1 Aquificae CP001229 Sulfurihydrogenibium sp. YO3AOP1 Aquificae CP001080 Sulfurospirillum deleyianum 5175, Proteobacteria- DSM 6946 Epsilon CP001816 Symbiobacterium thermophilum IAM 14863 Firmicutes AP006840 Syntrophobacter Proteobacteria- fumaroxidans MPOB Delta CP000478 Syntrophomonas wolfei Goettingen, DSM 2245B Firmicutes CP000448 Syntrophothermus lipocalidus DSM 12680 Firmicutes CP002048 Syntrophus Proteobacteria- aciditrophicus SB Delta CP000252 Terriglobus saanensis SP1PR4 Acidobacteria CP002467 Proteobacteria- Thauera sp. MZ1T Beta CP001281, CP001282 Thermobaculum terrenum YNP1, ATCC BAA-798 Chloroflexi CP001825, CP001826 Thermobifida fusca YX Actinobacteria CP000088 Thermobispora bispora R51, DSM 43833 Actinobacteria CP001874 Thiobacillus denitrificans ATCC Proteobacteria- 25259 Beta CP000116 Variovorax Proteobacteria- paradoxus EPS Beta CP002417

86

Table 2.4 (cont’d)

Variovorax Proteobacteria- paradoxus S110 Beta CP001635, CP001636 Verminephrobacter Proteobacteria- eiseniae EF01-2 Beta CP000542,CP000543 Xanthobacter Proteobacteria- autotrophicus Py2 Alpha CP000781, CP000782 Xanthomonas Proteobacteria- albilineans GPE PC73 Gamma FP565176 Xanthomonas Proteobacteria- axonopodis 306 Gamma AE008923, AE008924, AE008925 Xanthomonas campestris ATCC Proteobacteria- 33913 Gamma AE008922 Xanthomonas Proteobacteria- campestris B100 Gamma AM920689 Xanthomonas oryzae Proteobacteria- MAFF 311018 Gamma AP008229 Xanthomonas oryzae Proteobacteria- pv. oryzae PXO99A Gamma CP000967 Xenorhabdus bovienii Proteobacteria- SS-2004 Gamma FN667741 Xenorhabdus nematophila Proteobacteria- ATCC19061 Gamma FN667742, FN667743 Xylanimonas cellulosilytica XIL07, DSM 15894 Actinobacteria CP001821, CP001822 Xylella fastidiosa CVC Proteobacteria- 9a5c Gamma AE003849, AE003850, AE003851 Xylella fastidiosa Proteobacteria- Temecula1 Gamma AE009442, AE009443 Proteobacteria- AL590842, AL109969.1, AL117189.1, Yersinia pestis CO-92 Gamma AL117211.1

Yersinia pestis KIM Proteobacteria- 10 Gamma AE009952, AF074611.1 Yersinia pseudotuberculosis Proteobacteria- IP 32953 Gamma BX936398, BX936399, BX936400 Yersinia pseudotuberculosis Proteobacteria- YPIII Gamma CP000950

87

Table 2.4 (cont’d)

Zymomonas mobilis Proteobacteria- CP001722, CP001723, CP001724, mobilis NCIB 11163 Alpha CP001725 Zymomonas mobilis CP002850, CP002851, CP002852, mobilis T.H.Delft 1, Proteobacteria- CP002853, CP002854, CP002855, ATCC 10988 Alpha CP002856 Zymomonas mobilis pomaceae Barker 1, Proteobacteria- ATCC 29192 Alpha CP002865, CP002866, CP002867

Table 2.5: Most abundant functional annotations of the unassembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database. Rank SEED Functions abundance 1 hypothetical protein 21,668 2 Heat shock protein 60 family chaperone GroEL 16,685 3 DNA-directed RNA polymerase beta subunit (EC 2.7.7.6) 11,413 4 Translation elongation factor Tu 10,495 DNA-directed RNA polymerase beta' subunit (EC 5 2.7.7.6) 8,849 6 Translation elongation factor G 7,646 7 Chaperone protein DnaK 7,644 8 SSU ribosomal protein S1p 6,270 9 Aldehyde dehydrogenase (EC 1.2.1.3) 6,034 10 RNA polymerase sigma factor RpoD 3,845 11 hyphothetical protein 3,806 12 Iron-sulfur cluster assembly protein SufB 3,410 13 Glutamine synthetase type I (EC 6.3.1.2) 3,340 14 Cell division protein FtsH (EC 3.4.24.-) 3,196 DNA-directed RNA polymerase alpha subunit (EC 15 2.7.7.6) 2,962

88

Table 2.6: Most abundant functional annotations of the unassembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database. Rank GenBank Function abundance 1 conserved hypothetical protein 59012 2 chaperonin GroEL 14315 3 DNA-directed RNA polymerase, beta subunit 9891 4 DNA-directed RNA polymerase, beta' subunit 6669 5 translation elongation factor Tu 6144 6 chaperone protein DnaK 6023 7 predicted protein 5755 8 translation elongation factor G 5060 9 DNA-directed RNA polymerase subunit beta 3644 10 ATPase AAA-2 domain protein 3474 LOW QUALITY PROTEIN: conserved hypothetical 11 protein 3107 12 adenosylhomocysteinase 3059 13 ABC transporter related 2970 14 translation elongation factor 2 (EF-2/EF-G) 2957 15 SSU ribosomal protein S1P 2450

Table 2.7 Most abundant functional annotations of the unassembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database. Rank RefSeq Function abundance 1 18S ribosomal RNA 477043 2 hypothetical protein 89899 3 conserved hypothetical protein 45183 4 chaperonin GroEL 19437 5 DNA-directed RNA polymerase subunit beta 12153 6 elongation factor Tu 9373 7 DNA-directed RNA polymerase subunit beta' 6977 8 28S ribosomal RNA 6880 9 elongation factor G 6160 10 30S ribosomal protein S1 5776 11 aldehyde dehydrogenase 5632 12 molecular chaperone DnaK 4757 13 chaperone protein DnaK 3882 14 DNA-directed RNA polymerase, beta subunit 3405 15 translation elongation factor Tu 3175

89

Table 2.8: Most abundant functional annotations of the unassembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database. Rank KEGG Function abundance 1 hypothetical protein 69437 2 chaperonin GroEL 18763 3 DNA-directed RNA polymerase subunit beta (EC:2.7.7.6) 8857 4 elongation factor G 5880 5 elongation factor Tu (EC:3.6.5.3) 5725 6 30S ribosomal protein S1 5510 DNA-directed RNA polymerase subunit beta' 7 (EC:2.7.7.6) 4748 8 molecular chaperone DnaK 4330 9 aldehyde dehydrogenase 3403 10 S-adenosyl-L-homocysteine hydrolase (EC:3.3.1.1) 2710 11 ABC transporter related 2628 12 DNA-directed RNA polymerase subunit beta 2622 13 elongation factor Tu 2617 14 chaperone protein DnaK 2351 15 ATPase 2348

Table 2.9: 50 RefSoil genomes with the greatest number of metatranscriptome reads mapping

# of Avg annotated Genbank Median regions Accession Coverage similar to No. (bp) transcriptome Description Syntrophus CP000252 7459 1 aciditrophicus SB

Methylococcus AE017282 3454 4 capsulatus str. Bath Novosphingobium FR856862 2407 12 sp. PP1Y Desulfovibrio AP010904 1126 3 magneticus RS-1 Staphylococcus epidermidis ATCC AE015929 808 2 12228

90

Table 2.9 (cont’d)

Microlunatus phosphovorus NM- AP012204 747 20 1 Bacillus coagulans CP002472 458 16 2-6 Desulfobacca acetoxidans DSM CP002629 243 1 11109 Clostridium tetani AE015927 128 4 E88 Clostridium novyi CP000382 93 10 NT

Cupriavidus CP000352 70 1 metallidurans CH34 Corynebacterium BA000035 59 2 efficiens YS-314 Cronobacter sakazakii ATCC CP000783 40 3 BAA-894 Mycobacterium avium subsp. paratuberculosis K- AE016958 30 2 10 Citrobacter koseri CP000822 17 5 ATCC BAA-895 Paenibacillus CP002213 11 1 polymyxa SC2 Mesoplasma florum AE017263 11 1 L1 Candidatus Phytoplasma aster yellows witches'- CP000061 10 2 broom AY-WB Nocardioides sp. CP000509 9 76 JS614

Candidatus Phytoplasma onion AP006628 8 2 yellows OY-M Xanthomonas albilineans GPE FP565176 7 1 PC73

91

Table 2.9 (cont’d)

Bacillus cereus AE016877 6 3 ATCC 14579 Bacillus weihenstephanensis CP000903 6 2 KBAB4 Bacillus megaterium QM CP001983 5 29 B1551

Conexibacter CP001854 5 8 woesei DSM 14684 Acidovorax citrulli CP000512 5 1 AAC00-1

Intrasporangium CP002343 5 32 calvum DSM 43043 Amycolatopsis CP002000 5 10 mediterranei U32 Mesorhizobium loti BA000012 5 6 MAFF303099 Candidatus FP929003 5 4 Nitrospira defluvii Bacillus pumilus CP000813 4 3 SAFR-032

Microbacterium AP012052 4 40 testaceum StLB037 Isoptericola CP002810 4 2 variabilis 225 Oligotropha carboxidovorans CP002821 4 1 OM4 Kribbella flavida CP001736 4 43 DSM 17836

Actinosynnema CP001630 4 16 mirum DSM 43827 Catenulispora acidiphila DSM CP001700 4 2 44928 Arthrobacter chlorophenolicus CP001341 4 2 A6

92

Table 2.9 (cont’d)

Bacillus cereus AE017194 4 1 ATCC 10987 Mycobacterium CP000656 3 16 gilvum PYR-GCK Streptomyces FN554889 3 47 scabiei 87.22 Streptomyces griseus subsp. griseus NBRC AP009493 3 5 13350

Sorangium AM746676 3 9 cellulosum So ce56 Xylanimonas cellulosilytica DSM CP001821 3 3 15894 Bradyrhizobium japonicum USDA BA000040 3 48 110 Geodermatophilus obscurus DSM CP001867 3 40 43160 [Cellvibrio] gilvus CP002665 3 10 ATCC 13127 Mesorhizobium opportunistum CP002279 3 8 WSM2075

Leifsonia xyli subsp. AE016822 3 4 xyli str. CTCB07 Micromonospora aurantiaca ATCC CP002162 3 22 27029

93

Table 2.10: RefSoil genomes with metatranscriptome reads mapping to the most unique genes # of annotated regions similar Avg Median Genbank to Coverage Accession No. transcriptome (bp) Description CP000509 76 9 Nocardioides sp. JS614 BA000040 48 3 Bradyrhizobium japonicum USDA 110 FN554889 47 3 Streptomyces scabiei 87.22 CP001736 43 4 Kribbella flavida DSM 17836 CP000454 42 3 Arthrobacter sp. FB24 BA000030 41 3 Streptomyces avermitilis MA-4680 Geodermatophilus obscurus DSM CP001867 40 3 43160 AP012052 40 4 Microbacterium testaceum StLB037 CP002343 32 5 Intrasporangium calvum DSM 43043 CP001635 32 3 Variovorax paradoxus S110 CP001983 29 5 Bacillus megaterium QM B1551 CP000511 25 3 Mycobacterium vanbaalenii PYR-1 Micromonospora aurantiaca ATCC CP002162 22 3 27029 CP002666 22 2 Cellulomonas fimi ATCC 484 Mycobacterium smegmatis str. MC2 CP000480 22 3 155 CP002399 21 3 Micromonospora sp. L5 AP012204 20 747 Microlunatus phosphovorus NM-1 CP000555 18 2 Methylibium petroleiphilum PM1 CP000474 18 3 Arthrobacter aurescens TC1 CP001630 16 4 Actinosynnema mirum DSM 43827 CP000656 16 3 Mycobacterium gilvum PYR-GCK

94

Table 2.10 (cont’d)

CP002472 16 458 Bacillus coagulans 2-6 CP001737 15 3 Nakamurella multipartita DSM 44233 FR845719 14 3 Streptomyces venezuelae ATCC 10712 FR856862 12 2407 Novosphingobium sp. PP1Y CP002417 12 3 Variovorax paradoxus EPS CP001013 11 3 Leptothrix cholodnii SP-6 CP002000 10 5 Amycolatopsis mediterranei U32 CP002665 10 3 [Cellvibrio] gilvus ATCC 13127 AP010968 10 2 Kitasatospora setae KM-6054 CP000382 10 93 Clostridium novyi NT AM746676 9 3 Sorangium cellulosum So ce56 CU234118 9 3 Bradyrhizobium sp. ORS 278 CP002047 9 3 Streptomyces bingchenggensis BCW-1 CP001854 8 5 Conexibacter woesei DSM 14684 CP000699 8 2 Sphingomonas wittichii RW1 CP000115 8 3 Nitrobacter winogradskyi Nb-255 Mesorhizobium opportunistum CP002279 8 3 WSM2075 Arthrobacter phenanthrenivorans CP002379 7 3 Sphe3 CP000494 6 2 Bradyrhizobium sp. BTAi1 CP001814 6 2 Streptosporangium roseum DSM 43021 CP000319 6 3 Nitrobacter hamburgensis X14 CP002475 6 3 Streptomyces flavogriseus ATCC 33331 CP001096 6 2 Rhodopseudomonas palustris TIE-1 BA000012 6 5 Mesorhizobium loti MAFF303099 FN563149 5 3 Rhodococcus equi 103S

95

Table 2.10 (cont’d)

CP000283 5 2 Rhodopseudomonas palustris BisB5 CP000316 5 3 Polaromonas sp. JS666 Streptomyces griseus subsp. griseus AP009493 5 3 NBRC 13350 Mesorhizobium ciceri biovar biserrulae CP002447 5 3 WSM1271

96

Table 2.11: Most abundant functional annotations of the assembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database. Rank SEED Functions abundance 1 hypothetical protein 256 556 2 hyphothetical protein 45 493 3 Retron-type reverse transcriptase 24 961 4 Cell wall-associated hydrolase 3 636 5 FOG: WD40 repeat 1 213 6 Heat shock protein 60 family chaperone GroEL 1 207 7 Hypothetical ORF 1 042 8 predicted protein 1 004 DNA-directed RNA polymerase beta subunit (EC 9 2.7.7.6) 948 10 Translation elongation factor Tu 728 11 SSU ribosomal protein S1p 689 DNA-directed RNA polymerase beta' subunit (EC 12 2.7.7.6) 598 13 Aldehyde dehydrogenase (EC 1.2.1.3) 587 14 Translation elongation factor G 499 15 lipoprotein, putative 438

96

Table 2.12: Most abundant functional annotations of the assembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database. Rank GenBank Functions abundance 1 conserved hypothetical protein 277 402 2 hypothetical protein BACCAP_03833 67 029 3 hypothetical protein BACCAP_04473 67 029 4 predicted protein 49 548 5 hypothetical protein HMPREF9529_01276 26 513 6 hypothetical protein BACUNI_00158 20 088 7 hypothetical protein BACUNI_02471 20 088 8 unknown 15 933 9 hypothetical protein RAZWK3B_00595 10 419 10 hypothetical protein RAZWK3B_11306 10 419 11 cell wall-associated hydrolase 7 180 LOW QUALITY PROTEIN: hypothetical protein 12 SSBG_02741 5 110 LOW QUALITY PROTEIN: hypothetical protein 13 SSBG_06429 5 110 14 conserved domain protein 4 878 15 hypothetical protein ANACOL_02136 4 130

97

Table 2.13 Most abundant functional annotations of the assembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database. Rank RefSeq Functions abundance 1 hypothetical protein 773 813 2 conserved hypothetical protein 190 198 3 3,4-dihydroxy-2-butanone-4-phosphate synthase 5 462 4 Senescence-associated protein 4 171 5 ORF58e 3 813 6 chaperonin GroEL 1 070 7 GLP_748_1200_211 984 Putative protein of unknown function; overexpression 8 confers resistance to the antimicrobial peptide MiAMP1 927 9 DNA-directed RNA polymerase subunit beta 665 10 putative cytoplasmic protein 506 11 multi-sensor hybrid histidine kinase 501 12 elongation factor Tu 475 13 hypothetical 440 14 30S ribosomal protein S1 433 15 methane monooxygenase 402

98

Table 2.14: Most abundant functional annotations of the assembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database. Rank KEGG Functions abundance 1 hypothetical protein 509 073 2 Senescence-associated protein 4 171 3 hypothetical protein LOC100337426 1 523 4 chaperonin GroEL 1 107 Putative protein of unknown function; overexpression 5 confers resistance to the antimicrobial peptide MiAMP1 927 6 putative cytoplasmic protein 506 DNA-directed RNA polymerase subunit beta 7 (EC:2.7.7.6) 500 8 multi-sensor hybrid histidine kinase 493 9 hypothetical LOC783710 468 10 hypothetical protein LOC100335677 468 11 hypothetical protein LOC100336571 468 12 hypothetical protein LOC100336585 468 13 hypothetical protein LOC100337004 468 14 hypothetical protein LOC100337158 468 15 30S ribosomal protein S1 405

Table 2.15: RFam abundance (based on base pair coverage) of the assembled metatranscriptome. RFam ID RFam Annotation number Abundance 5_8S_rRNA RF00002 46,190.5 tmRNA RF00023 44,298.0 PK-G12rRNA RF01118 36,375.5 RNaseP_bact_a RF00010 18,232.0 Bacteria_small_SRP RF00169 11,915.0 Bacteria_large_SRP RF01854 11,728.5 5S_rRNA RF00001 6,622.0 c-di-GMP-I RF01051 2,458.5 Metazoa_SRP RF00017 1,716.5 Fungi_SRP RF01502 1,486.5 Archaea_SRP RF01857 1,456.0 beta_tmRNA RF01850 1,446.0 tRNA RF00005 824.5 Plant_SRP RF01855 725.5

99

Table 2.15 (cont’d)

6S RF00013 356.0 SSU_rRNA_bacteria RF00177 225.0 RNaseP_bact_b RF00011 200.0 RNaseP_arch RF00373 148.0 SSU_rRNA_archaea RF01959 129.0 alpha_tmRNA RF01849 63.0 ydaO-yuaA RF00379 56.0 RNaseP_nuc RF00009 49.0 group-II-D1D4-2 RF01999 33.0 c-di-GMP-II RF01786 31.5 Intron_gpII RF00029 28.0 group-II-D1D4-6 RF02005 21.0 GOLLD RF02032 19.0 speF RF00518 16.0 Alpha_RBS RF00140 16.0 Afu_309 RF01512 15.0 U3 RF00012 13.0 U2 RF00004 12.0 cspA RF01766 11.0 ROSE RF00435 9.0 Cobalamin RF00174 9.0 Intron_gpI RF00028 9.0 Glycine RF00504 7.0 group-II-D1D4-7 RF02012 7.0 group-II-D1D4-4 RF02003 6.0 RNase_MRP RF00030 6.0 HEARO RF02033 6.0 group-II-D1D4-3 RF02001 5.0 FMN RF00050 4.0 Rhizobiales-2 RF01723 4.0 Hammerhead_3 RF00008 4.0 msiK RF01747 3.0 SAH_riboswitch RF01057 3.0 CrcZ RF01675 3.0 T-box RF00230 3.0 suhB RF00519 2.0 Acido-Lenti-1 RF01687 1.0

100

Table 2.16: Top 50 CAZy annotations by number of contigs. Most abundant CAZy annotations by the number of contigs without regard to the abundance or number of reads mapping to the contigs. CAZy enzyme # of Contigs class Matching Abundance GT2 72 3.5 GH36 41 3.2 CBM13 36 3.5 GH18 34 4.9 CBM33 30 5.4 CE11 29 2.4 CE1 26 2.2 GH13 22 3.1 GT4 20 2.6 GT51 20 2.4 GH76 18 8.4 CE10 18 4.1 CBM2 15 7.4 CBM47 15 3.1 GH16 14 2.2 CBM50 13 3.8 GT41 13 2.5 GH23 12 2.8 GH35 11 6.2 GH15 9 3.0 CBM14 8 19.1 GT22 8 7.3 CBM5 8 2.6 GT35 8 2.3 CBM32 7 2.9 GH5 6 4.2 GT87 6 4.2 CBM48 6 2.5 GH3 6 2.3 GH6 5 2.8 CE9 5 2.6 GT26 5 2.6 CBM20 5 2.4 CBM12 5 2.0 CE4 5 2.0 GH28 5 2.0 GH92 4 3.3

101

Table 2.16 (cont’d)

GT20 4 2.3 GT55 4 2.3 GT1 4 2.0 GH9 3 3.3 GH73 3 2.7 GH48 3 2.3 CBM3 3 2.2 GH1 3 2.0 GH38 3 2.0 GH43 3 2.0 GH19 2 137.0 GH17 2 47.0 GH102 2 7.3

Table 2.17: Top 50 CAZy annotations by abundance. Abundance of CAZy annotation by the number of reads mapping to annotated contigs. CAZy enzyme # of Contigs class Abundance Matching GH19 137.0 2 GH17 47.0 2 CBM14 19.1 8 GH76 8.4 18 CBM2 7.4 15 GT22 7.3 8 GH102 7.3 2 GH35 6.2 11 CBM33 5.4 30 GH66 5.0 1 GH18 4.9 34 GH5 4.2 6 GT87 4.2 6 CE10 4.1 18 GH14 4.0 1 PL11 4.0 1 CBM50 3.8 13 GT2 3.5 72 CBM13 3.5 36 GH9 3.3 3 GH92 3.3 4 GH36 3.2 41

102

Table 2.17 (cont’d)

GH13 3.1 22 CBM47 3.1 15 GH15 3.0 9 CBM1 3.0 1 GT77 3.0 1 CBM32 2.9 7 GH6 2.8 5 GH23 2.8 12 GH73 2.7 3 CBM5 2.6 8 GT4 2.6 20 CE9 2.6 5 GT26 2.6 5 GT41 2.5 13 CBM48 2.5 6 CBM43 2.5 2 GH62 2.5 2 CE11 2.4 29 CBM20 2.4 5 GT51 2.4 20 GH3 2.3 6 GH48 2.3 3 GT35 2.3 8 GT20 2.3 4 GT55 2.3 4 GH16 2.2 14

103

REFERENCES

104

REFERENCES

1. Lesniewski R a, Jain S, Anantharaman K, Schloss PD, Dick GJ (2012) The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. ISME J 6: 2257–2268. Available: http://www.ncbi.nlm.nih.gov/pubmed/22695860. Accessed 26 October 2012.

2. Gifford SM, Sharma S, Booth M, Moran MA (2013) Expression patterns reveal niche diversification in a marine microbial assemblage. ISME J 7: 281–298. Available: http://www.ncbi.nlm.nih.gov/pubmed/22931830. Accessed 28 February 2013.

3. Gilbert J, Meyer F, Schriml L (2010) Metagenomes and metatranscriptomes from the L4 long-term coastal monitoring station in the Western English Channel. Stand genomic …: 183–193. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035373/. Accessed 19 September 2012.

4. Hilton J a, Satinsky BM, Doherty M, Zielinski B, Zehr JP (2014) Metatranscriptomics of N2-fixing cyanobacteria in the Amazon River plume. ISME J 9: 1557–1569. Available: http://www.nature.com/doifinder/10.1038/ismej.2014.240.

5. Baldrian P, Kolařík M, Stursová M, Kopecký J, Valášková V, et al. (2012) Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. ISME J 6: 248–258. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3260513&tool=pmcent rez&rendertype=abstract. Accessed 5 March 2012.

6. Geisen S, Tveit AT, Clark IM, Richter A, Svenning MM, et al. (2015) Metatranscriptomic census of active protists in soils. ISME J: 1–13. Available: http://www.nature.com/doifinder/10.1038/ismej.2015.30.

7. Takasaki K, Miura T, Kanno M, Tamaki H, Hanada S, et al. (2013) Discovery of glycoside hydrolase enzymes in an avicel-adapted forest soil fungal community by a metatranscriptomic approach. PLoS One 8: e55485. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3564753&tool=pmcent rez&rendertype=abstract. Accessed 16 September 2013.

8. Yergeau E, Schoondermark-Stolk S a, Brodie EL, Déjean S, DeSantis TZ, et al. (2009) Environmental microarray analyses of Antarctic soil microbial communities. ISME J 3: 340–351. Available: http://www.ncbi.nlm.nih.gov/pubmed/19020556. Accessed 12 March 2012.

105

9. Ofek-lalzar M, Sela N, Goldman-voronov M, Green SJ, Hadar Y, et al. (2014) Niche and host-associated functional signatures of the root surface microbiome. Nat Commun 5: 1–9. Available: http://dx.doi.org/10.1038/ncomms5950.

10. Yergeau E, Sanschagrin S, Maynard C, St-Arnaud M, Greer CW (2013) Microbial expression profiles in the rhizosphere of willows depend on soil contamination. ISME J: 1–15. Available: http://www.ncbi.nlm.nih.gov/pubmed/24067257. Accessed 6 November 2013.

11. Turner TR, Ramakrishnan K, Walshaw J, Heavens D, Alston M, et al. (2013) Comparative metatranscriptomics reveals kingdom level changes in the rhizosphere microbiome of plants. ISME J: 1–11. Available: http://www.nature.com/doifinder/10.1038/ismej.2013.119. Accessed 19 July 2013.

12. Foley J a, Defries R, Asner GP, Barford C, Bonan G, et al. (2005) Global consequences of land use. Science 309: 570–574. Available: http://www.ncbi.nlm.nih.gov/pubmed/16040698. Accessed 6 November 2013.

13. Gans J, Wolinsky M, Dunbar J (2005) Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309: 1387–1390. Available: http://www.ncbi.nlm.nih.gov/pubmed/16123304. Accessed 19 March 2012.

14. Rodriguez-R LM, Konstantinidis KT (2013) Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics: 1–7. Available: http://www.ncbi.nlm.nih.gov/pubmed/24123672. Accessed 10 December 2013.

15. Whitman W (1998) Prokaryotes: the unseen majority. Proc … 95: 6578–6583. Available: http://www.pnas.org/content/95/12/6578.full.pdf&embedded=true. Accessed 23 July 2014.

16. Neidhardt F, Umbarger H (1996) Chemical composition of Escherichia coli.

17. Yi H, Cho Y-J, Won S, Lee J-E, Jin Yu H, et al. (2011) Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Res 39: e140. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3203590&tool=pmcent rez&rendertype=abstract. Accessed 22 March 2012.

18. Heaton E a., Dohleman FG, Long SP (2008) Meeting US biofuel goals with less land: the potential of Miscanthus. Glob Chang Biol 14: 2000–2014. Available: http://doi.wiley.com/10.1111/j.1365-2486.2008.01662.x. Accessed 29 February 2012.

106

19. Zhou J, Bruns M, Tiedje J (1996) DNA recovery from soils of diverse composition. Appl Environ … 62. Available: http://aem.asm.org/content/62/2/316.short. Accessed 11 February 2014.

20. Brady S (2007) Construction of soil environmental DNA cosmid libraries and screening for clones that produce biologically active small molecules. Nat Protoc 2: 1297–31305.

21. Schmieder R, Lim YW, Edwards R (2012) Identification and removal of ribosomal RNA sequences from metatranscriptomes. Bioinformatics 28: 433–435. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3268242&tool=pmcent rez&rendertype=abstract. Accessed 16 September 2013.

22. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, et al. (2011) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res 39: D141–5. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013711&tool=pmcent rez&rendertype=abstract. Accessed 18 September 2013.

23. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9: 386. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2563014&tool=pmcent rez&rendertype=abstract. Accessed 3 March 2013.

24. Wilke A, Harrison T, Wilkening J, Field D, Glass EM, et al. (2012) The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools. BMC Bioinformatics 13: 141. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3410781&tool=pmcent rez&rendertype=abstract. Accessed 25 September 2013.

25. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2690996&tool=pmcent rez&rendertype=abstract. Accessed 4 October 2012.

26. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2832824&tool=pmcent rez&rendertype=abstract. Accessed 20 September 2013.

27. Brown CT, Howe A, Zhang Q, Pyrkosz A, Brom T (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv Prepr arXiv …: 1–18. Available: http://ged.msu.edu/downloads/2012-diginorm.pdf. Accessed 11 February 2014.

107

28. Howe AC, Jansson J, Malfatti S (2012) Assembling large, complex environmental metagenomes. Available: http://adsabs.harvard.edu/abs/2012arXiv1212.2832C. Accessed 11 February 2014.

29. Pell J, Hintze A, Canino-Koning R (2012) Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc … I: 1–12. Available: http://buonmathuot.vn/ws/r/www.pnas.org/content/109/33/13272.full. Accessed 25 April 2013.

30. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2336801&tool=pmcent rez&rendertype=abstract. Accessed 17 September 2013.

31. Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M (2011) Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics Chapter 11: Unit 11.8. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3072823&tool=pmcent rez&rendertype=abstract. Accessed 20 September 2013.

32. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659. Available: http://www.ncbi.nlm.nih.gov/pubmed/16731699. Accessed 28 February 2013.

33. Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, et al. (2009) The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res 37: D233–8. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2686590&tool=pmcent rez&rendertype=abstract. Accessed 19 September 2013.

34. Li R, Zhu H, Ruan J, Qian W, Fang X, et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20: 265–272. doi:10.1101/gr.097261.109.

35. Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26: i367–73. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2881401&tool=pmcent rez&rendertype=abstract. Accessed 16 September 2013.

36. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, et al. (2013) Rfam 11.0: 10 years of RNA families. Nucleic Acids Res 41: D226–32. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3531072&tool=pmcent rez&rendertype=abstract. Accessed 19 February 2014.

108

37. Leimena MM, Ramiro-Garcia J, Davids M, van den Bogert B, Smidt H, et al. (2013) A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets. BMC Genomics 14: 530. Available: http://www.biomedcentral.com/1471-2164/14/530. Accessed 6 August 2013.

38. Urich T, Lanzén A, Qi J, Huson DH, Schleper C, et al. (2008) Simultaneous assessment of soil microbial community structure and function through analysis of the meta- transcriptome. PLoS One 3: e2527. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2424134&tool=pmcent rez&rendertype=abstract. Accessed 6 August 2013.

39. Gifford SM, Sharma S, Rinta-Kanto JM, Moran MA (2011) Quantitative analysis of a deeply sequenced marine microbial metatranscriptome. ISME J 5: 461–472. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3105723&tool=pmcent rez&rendertype=abstract. Accessed 9 March 2012.

40. Singh, B.K., et al., Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils. Applied Soil Ecology, 2007. 36(2-3): p. 147- 155.

41. Ridl, J., et al., Plants Rather than Mineral Fertilization Shape Microbial Community Structure and Functional Potential in Legacy Contaminated Soil. Frontiers in Microbiology, 2016. 7.

42. Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al. (2008) Microbial community gene expression in ocean surface waters. Proc Natl Acad Sci U S A 105: 3805–3810. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2268829&tool=pmcent rez&rendertype=abstract.

43. Gilbert J a, Field D, Huang Y, Edwards R, Li W, et al. (2008) Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One 3: e3042. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2518522&tool=pmcent rez&rendertype=abstract. Accessed 1 March 2012.

44. Paul EA, Voroney RP (1984) Field interpretation of microbial biomass activity measurements. Curr Perspect Microb Ecol p 509-514.

45. Clark FE, Paul EA (1970) The Microflora of Grassland. Adv Agron. doi:10.1016/S0065-2113(08)60273-4.

109

46. Shi Y, Tyson GW, DeLong EF (2009) Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature 459: 266–269. Available: http://www.ncbi.nlm.nih.gov/pubmed/19444216. Accessed 9 March 2012.

47. Roller M, Lucic V, Nagy I, Perica T, Vlahovicek K (2013) Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res: 1–11. Available: http://www.ncbi.nlm.nih.gov/pubmed/23921637. Accessed 20 September 2013.

48. Raes J, Bork P (2008) Molecular eco-systems biology: towards an understanding of community function. Nat Rev Microbiol 6: 693–699. Available: http://www.ncbi.nlm.nih.gov/pubmed/18587409.

49. Fuhrman J a (2009) Microbial community structure and its functional implications. Nature 459: 193–199. Available: http://www.ncbi.nlm.nih.gov/pubmed/19444205. Accessed 1 March 2012.

50. Williams RJ, Howe A, Hofmockel K (2014) Demonstrating Microbial Co-occurrence Pattern Analyses Within and Between Ecosystems. doi:10.3389/fmicb.2014.00358.

110

Chapter 3:

Using a multi-omics to approach to identify active microbial functions in the rhizosphere of Switchgrass

111

Abstract

Rhizosphere microbial communities provide many ecologically important services such as regulating biogeochemical cycles of elements such as carbon, nitrogen, phosphorous and iron. They also are known to aid in plant growth and defense. Advances in high throughput sequencing have allowed the use of metagenomics to survey microbial community functional activity without culture-based bias. To assay the functional activity of rhizosphere microbial communities we used a multi-omics approach including the use of metagenomics, metatranscriptomics and metaproteomics. In this article we establish a minimum functional core from the multi-omics data, collected from the rhizosphere microbial community associated with the biofuel crop switchgrass (Panicum virgatum) grown in agricultural soil. The minimum functional core is defined by annotations found in both metagenomic and metatranscriptomic data from the switchgrass rhizosphere, which represent ubiquitous and dominant functions found in the microbial community. We compare the minimum functional core to rhizoplane metagenomes from switchgrass,

Miscanthus (Miscantus giganteusi) and corn (Zea mays) to determine if the minimum functional core is representative of the field. Functions related to cellular maintenance and housekeeping were highly abundant within the minimum functional core. While functions related to biogeochemical cycling and plant growth promotion were present within the minimum functional core in low abundance. Specifically we found evidence that at the time of sampling, the most abundant nitrogen cycling processes were related to ammonia assimilation. Phosphate metabolism was a highly active component of the phosphorous cycle. Due to their low abundance these biogeochemical processes related functions and

112

their great importance to the community, these functions likely represent keystone functions. Carbon cycling enzymes related to glycoside hydrolases and lignin break down were abundant especially in the metaproteome. The multi-omics approach used in this article enabled the identification of active microbial processes from field collected rhizosphere soil.

Introduction

Soil microbes in the rhizosphere play a key role in environmental processes such as cycling of elements carbon, nitrogen, sulfur, phosphorus and iron [1-3]. Rhizosphere microbes also aid plant growth and development through protection from pathogens [4, 5], liberation of micronutrients [1] and secretion of plant growth promotion compounds [6, 7].

Metagenomic surveys of soil microbial communities have been instrumental in identifying the presence of microbial functions of environmental importance [8, 9]. However, metagenomic surveys can’t discriminate between active, dormant or dead cells. In a study of microbial dormancy, 80% of microbial cells and 60% of operation taxonomic units

(OTUs) in soil microbial communities were identified as dormant [10]. These data indicate that a significant portion of the information gathered from metagenomic studies are from non-active (dead or dormant) cells.

To characterize the contribution of the microbial community to biogeochemical cycles and the types of plant-microbe interactions taking place in soil it is paramount to know which functions are actively being carried out. In this study we used a multi-omics approach, taking into account metagenomic data as functional potential, metatranscriptomic data to assess transcriptional activity and metaproteomic data to

113

assess translated proteins. We used this multi-omics approach to provide an integrated view of stages of microbial community activity in the rhizosphere of switchgrass (Panicum virgatum), a crop of major interest for biofuel production.

Because of the complexity of the multi-omic soil data, we identified a minimum functional core representing the dominant functions found in both the metagenome and expressed in the metatranscriptome data. This provides several benefits. First, it identifies the functional activity of the microbial community carried out in the majority of samples, which provides an overview of general activity. For example, many functions related to central metabolism are present in the minimum functional core while there are few functions related to cell division. Second, it relieves the effect of undersampling the microbial community, which occurs because of the extensive genetic diversity in soil, i.e. in the terabase range [11, 12]. Current sequencing technologies and computational methods cannot accommodate this level of deep sequencing. As a result of undersampling the metagenomic data are incomplete [13]. Third, the minimum functional core will identify functions necessary for microbial community survival in agricultural soil. These core functions will best characterize the functional diversity present at the site and identify likely central functions in soil microbial community activities of the sampled area [14].

While this core approach will undoubtedly include many housekeeping genes it should also capture information about the most abundant, ecologically important, non-housekeeping functions. This approach has been successfully applied to the human microbiome [15] where the authors were able to identify functional genes related to housekeeping functions as well as functional genes potentially specific to the gut.

114

The metaproteomic data was used to assess which of the annotations in the minimum functional core were translated as well as transcribed. We also compare the minimum functional core to metagenomes from field-collected rhizoplane soils associated with two other major biofuel candidate crops, corn (Zea mays), Miscanthus (Miscanthus giganteus), to determine if the minimum functional core is broadly representative of the field site. While housekeeping functions are expected to be prominent in the multi-omics data set, these functions will provide insight into the state of cellular maintenance of the microbial community at the time of sampling. More central to this study is the examination of microbial community functions related to carbon and plant nutrient cycling and plant- microbe interactions since these processes are central to a sustainable and renewable-fuel production ecosystem.

Methods

Soil collection

All soil samples were collected from Great Lakes Bioenergy Research Center’s

(GLBRC) Biofuel Cropping Systems Experiment (BCSE) at the Kellogg Biological Station on

July 31st 2013 at midday, a time near maximum plant photosynthesis. The sky was cloudless and the soil moist from a 11 mm rain 2 days before. Rhizosphere samples used for metatranscriptomics, metagenomics and metaproteomics were collected from switchgrass plot G5R2 http://lter.kbs.msu.edu/research/long-term-experiments/glbrc- intensive-experiment. Three switchgrass root systems were dug up and vigorously shaken to remove excess soil. The switchgrass root system was placed in a sterile bag and vigorously shaken again. The soil that fell into the bag was subdivided into whirpack bags

115

and placed in liquid nitrogen for rapid freezing. This process was completed in less than 5 min to minimize transcript turnover.

Rhizoplane samples for metagenome sequencing were collected at the same time as the rhizosphere samples, but from corn, Miscanthus as well as switchgrass plants from plots G1R2 through R4, G6R2 through R4 and G5R2 through R4, respectively. Plants were dug up and the root systems were placed in gallon bags that were place on ice for transport to the laboratory where they were stored at -20C. At the R2 sites three samples were collected from two adjacent replicate plants. At the R3 and R4 sites two samples were collected from two adjacent replicate plants. Later, excess soil and macroaggregates were removed from the roots so that only a very thin film of soil and micro aggregates surrounding the roots remained. Each sample which contained approximately 5 g of root and soil material, was placed in phosphate buffer. Samples were gently shaken to remove soil attached to the roots. The samples were spun down to pellet the suspended soil particles and microbes. Root material was carefully removed from the soil pellet, and the soil was saved for DNA extraction.

Sample preparation and sequencing

RNA was extracted from the three replicates of rhizosphere soils using the MoBio

PowerSoil RNA extraction kit (MoBio, Carlsbad, CA). Samples were treated with DNase

(Invitrogen, Carlsbad, CA) to remove any potential contaminating DNA. The sample quality was checked by nanodrop and quantified using the Qubit RNA quantification kit

(Invitrogen, Carlsbad, CA). The three RNA samples, collected from switchgrass are termed

SRT1, SRT2 and SRT3. DNA was extracted from the same aliquot of rhizosphere soil as was

116

used for the metatranscriptome samples. Approximately 0.5 grams of soil was used for

DNA extraction using the MoBio PowerSoil DNA kit (MoBio, Carlsbad, CA) according to the manufacture’s protocols. Sample quality was checked by nanodrop and quantified using the Qubit DNA quantification kit (Invitrogen, Carlsbad, CA). The three rhizosphere DNA samples are termed SRG1, SRG2 and SRG3.

DNA was also extracted from an additional 21 samples collected from the rhizoplanes of corn, Miscanthus and switchgrass (seven replicates for each plant and termed C1-7, M1-7 and S1-7 respectively). Both DNA and RNA samples were sequenced by the Joint Genome Institute (JGI) in Walnut Creek, CA. Ribosomal RNA subtraction was performed by JGI on the metatranscriptome samples using the RiboZero kit [16]. All samples were sequenced using the HiSeq-1TB. All samples were filtered for quality and trimmed to remove adapters using BBDuk (ktrim=r, k=25, mink = 12, tpe=t, tbo=t, qtrim=10, maq=10, maxns=3, minlen=50). Reads were then filtered for artifacts using

BBDuk (k-16). For metatranscriptomic samples rRNA was removed via mapping to the

Silva database with BBMap (fast=t, minid=0.90 local=t). Reads were then assembled using metahit (v 0.2.0) [16] (--cpu-only -m 100e9 --k-max 123 -l 155).

Metaproteome sample preparation and characterization1

Indirect extraction

Metaproteomic sample preparation and data analysis was performed at the

Environmental Molecular Science Laboratory (EMSL) at the Pacific Northwest National

Laboratory in Richland, WA. Rhizosphere soil was sieved through a 35mm mesh and

1 This work was performed by Angela M Norbeck, Carrie Nicora, Sam Purvine and Ljjiljana Pasa-Tolic

117

weighed into 20 g aliquots in 50 mL tubes with 20 mL of ice cold phosphate buffered saline

(PBS), pH 8. The samples were kept on ice and homogenized at full speed with a hand-held

OMNI tool and disposable probes (OMNI, Kennesaw, GA) for 30 s, allowed to cool and homogenized again. The samples were then centrifuged at 2,500x g for 5 min at 4°C to remove large soil particulates. The supernatants were transferred to a fresh 50 mL tube.

Again 20 mL of buffer was added to the soil pellet and the samples were homogenized and centrifuged as described previously. The supernatants were combined and centrifuged at

10,000 x g for 15 min at 4°C to pellet the intact microbial cells. The cell pellet was washed with 1 mL of ammonium bicarbonate buffer, pH 8.0 (ABC), and transferred into a 2 mL snap-cap centrifuge tubes (Eppendorf, Hamburg, Germany). The sample was then centrifuged at 10,000 xg for 10 min at 4°C and the supernatant removed and 200 µl of ABC was added along with 0.1 mm zirconia beads and bead beaten in a Bullet Blender (Next

Advance, Averill Park, NY) at speed 8 for 3 min at 4°C. After bead beating, the lysate was spun into a 15 mL Falcon tube at 2,000 xg for 10 min at 4°C. The sample was removed to a clean tube and a methanol/chloroform extraction was done to separate the protein, metabolites and lipids. Ice cold (-20°C) chloroform:methanol mix (prepared 2:1 (v/v)) was added to the sample in a 5:1 ratio over sample volume and vigorously vortexed. The sample was then placed on ice for 5 min and then vortexed for 10 s followed by centrifugation at

10,000 xg for 10 min at 4°C. The upper water soluble metabolite phase was collected into a glass vial, the lower lipid soluble phase was collected into another fresh glass vial, and both samples were dried to complete dryness in a speed vac and then stored at -80°C until analysis. The remaining protein interlayer was washed with 100% ice-cold methanol and placed in a fume hood to dry after pelleting. The protein pellet was solubilized by adding

118

up to 100 µl of SDS-Tris buffer (4% SDS, 100 mM DTT in 100 mM Tris-HCl, pH 8.0), gently sonicated into solution and then added to the microbial pellet. The solution was incubated at 95°C for 5 min to reduce and denature the protein and allowed to cool at 4°C for 10 min.

Filter Aided Sample Preparation (FASP) [17] kits were used for protein digestion

(Expedeon, San Diego, CA) according to the manufacturer’s instructions. Briefly, 400 µl of 8

M urea (all reagents included in the kit) was added to each 500 µl 30K molecular weight cut off (MWCO) FASP spin column and up to 100 µl of the sample in SDS buffer was added, centrifuged at 14,000 xg for 30 min to bring the sample to the dead volume. The waste was removed from the bottom of the tube and another 400 µl of 8 M urea was added to the column and centrifuged again at 14,000 xg for 30 min and repeated once more. Each column was prepared with 400 µl of 50 mM ABC and then centrifuged for 30 min, done twice. The column was placed into a new fresh, clean and labeled collection tube. Digestion solution was made by dissolving 4 μg trypsin in 75 μL 50 mM ABC solution and added to the sample. Each sample was incubated for 3 h at 37°C with 800 rpm shaking on a thermomixer with a thermotop (Eppendorf, Hamburg, Germany) to reduce condensation into the cap. Additional ABC (40 ul) was added to the filter and the resultant peptides were then centrifuged through the filter and into the collection tube at 14,000 xg for 15 min, repeated twice. The peptides were then concentrated to ~30µL using a SpeedVac and stored in vials until analysis. Final peptide concentrations were determined using a bicinchoninic acid (BCA) assay (Thermo Scientific, Waltham, MA USA).

119

Metatranscriptome and metagenome peptide analysis

Three contig files were converted to amino acids via six-frame translation using

Python, with 50 amino acids as the shortest sequence allowed. All sequences were converted to tab delimited text using Protein Digestion Simulator (PDS)

(http://omics.pnl.gov/software/protein-digestion-simulator) and imported into Microsoft

SQL server 2008. Contig source names were appended to the contig names to eliminate name collisions downstream. Assuming at most one protein sequence per contig, the longest resultant sequence per contig was retained. For all protein collection, 16 common contaminant protein sequences were added, including porcine and bovine trypsin, human and bovine serum albumin, and commonly observed keratin sequences. For all metaproteome searches, the fasta files were split into 25 roughly equivalently sized files to allow for memory limitations of the MSGFPlus search program. Top scoring identifications for each searched MS/MS spectrum were retained for the final output.

Metatranscriptome and metagenome search and peptide identification

Spectra to peptide identification: Peptide mass spectra (MS/MS) were searched against the metatranscriptome using the MS-GF+ algorithm [18], and accepting MSGF scores [19] of less than or equal to 1e-12. This yielded a false discovery rate (FDR) for the entire set of 0.81%. In the metagenome searches, a MSGF score threshold of 1e-12 was used, resulting in an FDR of 2.47%. Peptide sequences and redundant matches to protein parents are reported. Because of the large file sizes, the fasta files containing sequences were split into 25 pieces and separate searches performed on a distributed CPU system.

The results were then merged for each metaproteome or transcriptome.

120

Metatranscriptomic and metagenomic sequence data analysis

To determine the median number of reads mapping to the assembled contigs in each sample, JGI quality filtered reads were mapped to the assembled contigs using bowtie2 (v2.0.0-beta6, [20]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, –D 100, distinct alignments for each read. Median base pair coverage was estimated using BedTools [21]. All contigs of length less than 300 were removed. Samples were submitted to MG-RAST (v3.6, [22]) for annotation using the assembled pipeline and no other filtering methods or quality controls.

To examine carbon cycling genes more closely, a BLAST search (blastx –e-value e-5) was used to identify contigs that could be annotated using the Carbohydrate Active Enzyme

(CAZy) database [23] (code used can be found at: https://github.com/Garoutte/Chapter_3/tree/master )

Defining the minimum functional core and its representation of the field site’s

dominant genetic composition

Annotations from the three rhizosphere metagenomes were compared. Annotations present in two out of the three samples were considered core. The selection of two out of three samples was chosen for two reasons; first the metatranscriptome sample SRT-1 had a high percentage of rRNA sequences resulting many fewer non-rRNA contigs (Table 3.1).

Secondly, we are undoubtedly undersampling the soil’s genetic composition and therefore requiring a function to be present in all was too stringent and would result in a lack of diverse metabolic functions. The same selection criterion, presence in two of the three

121

replicates, was also done with the three rhizosphere metatranscriptomes, all of which had similar sequence yields. The metatranscriptome and metagenome cores were compared; functions found in both cores were considered to comprise the minimum functional core.

This process was carried out for annotations based on the SEED Subsystems, RefSeq and

CAZy databases (code used can be found at: https://github.com/Garoutte/Chapter_3/tree/master) A functional core was also established for each of the three rhizoplane metagenome plant treatments: corn, switchgrass and Miscanthus. Annotations found in five of the seven rhizoplane replicate metagenomes were considered core. We chose to identify functions as core to the rhizoplane samples using a more stringent cutoff to enhance the rigor of the core, while still allowing for undersampling. The minimum functional core was compared to each rhizoplane functional core to determine if the minimum functional core is broadly representative of the microbial functional diversity of the field site.

122

Table 3.1: Summary of switchgrass rhizosphere metagenome (SRG) and metatranscriptome (SRT) sequence yield and its assembly and annotation.

Reads

mapping

Annotated to

Non-rRNA Assembled Percent Contigs MetaG

Sample Total Reads reads contigs Assembleda (percent) contigs

SRT-1 246,895,742 68,949,934 440,213 81.23% 45% 10.92%

SRT-2 284,791,354 166,978,397 1,825,857 80.34% 46.3% 27.70%

SRT-3 397,351,240 250,124,715 2,237,997 82.42% 46.90% 22.90%

SRG-1 298,716,384 NAb 6,606,700 40.82% 68.70% NAc

SRG-2 338,846,620 NA 4,076,354 44.87% 68.7% NA

SRG-3 298,364,910 NA 6,207,377 40.93% 68.7% NA aFor SRT samples percent assembled is based on Non-rRNA reads bMetagenome samples did not have rRNA removed before sequencing, therefore the “Non- rRNA reads” for these samples are marked as NA. cMetagenome samples were not mapped to the metatranscriptome

123

Results

Building minimum functional core

To define the minimum functional core, a metagenomic and a metatranscriptomic core were separately created and subsequently combined. Due to the need to freeze samples used for metatranscriptomics quickly, to prevent mRNA turnover, samples used to define the minimum functional core were taken from the rhizosphere of switchgrass rather than from the rhizoplane. Collection of rhizoplane samples requires time-consuming removal of roots from the plant root system and is therefore not amenable to metatranscriptomic sampling. Metatranscriptome samples were comprised of approximately 486 million non-rRNA reads and the metagenome samples were composed of approximate 935 million reads (Table 3.1). Both switchgrass rhizosphere metatranscriptome (SRT) and switchgrass rhizosphere metagenome (SRG) samples have a similar rank abundance curve (Figure 3.1), indicating that the occurrence of their annotations in the core is similar. Both the SRT and SRG have a similar slope indicating a similar level of evenness across both core datasets. This shows that a variety of functions are transcribed at varying abundance levels whereas one might expect a few highly abundance transcripts to be extremely abundant with the remaining transcripts at very low abundance. The datasets do differ in some aspects, the SRT slope is steeper indicating fewer annotations with high abundance, while the SRG line extends farther than the SRT indicating a greater number of annotations. For the proteomics, however, the lines are much shorter and the slope steeper than the metagenome and metatranscriptome samples.

This indicates the at metaproteome samples contain fewer annotations and have a greater disparity in abundance.

124

Figure: 3.1: Rank abundance curve of multi-omics subsystem annotations. The metaproteome data set is smaller than metagenome and metatranscriptome data sets as indicated by their shorter lines in the MetaP-MetaG and MetaP-MetaT samples.

0

0

1

0

8

e

c

n

a

0

d

6

n u

b MetaG

A

e MetaP-MetaG

0

v

i t

4 MetaP-MetaT a

l MetaT

e

R

0

2 0

0 2000 4000 6000 8000 10000 Functional Annotations

Near complete sampling of soil microbial communities is calculated to require terabytes of sequencing [24], hence we are likely under sampling with our current dataset.

125

To minimize undersampling effects, we defined the metagenome and metatranscriptome functional cores as the presence of a functional annotation in two of the three samples. The minimum functional core was then composed of functional annotations present in both the metagenome and metatranscriptome. Table 3.2 shows the number of core functional annotations, by SEED Subsystems, RefSeq and CAZy databases. The metagenome core is larger than the metatranscriptome core as defined by all annotation databases. The minimum functional core accounts for 99% of the abundance of the SEED Subsystems annotations, 92% of RefSeq annotations and 99% of CAZy annotations. The RefSeq minimum functional core is much larger than the SEED Subsystems and CAZy minimum functional cores because RefSeq annotations are more fine scale leading to more redundancies in functional annotations. The CAZy minimum functional core is much small than the other annotation databases because the CAZy database is specific to proteins that act on carbohydrates. When we refer to the minimum functional core from this point on we will be referring to the minimum functional core derived from the SEED Subsystems unless specified otherwise.

126

Table 3.2: Summary of minimal core annotations. The SEED Subsystems is a hierarchical database, which annotates gene functions not specific genes. RefSeq database annotates specific genes from model organisms. The Carbohydrate Active Enzyme database

(CAZy) specifically annotates enzymes related to synthesis, metabolism and transport of carbohydrates.

Percent of MFC Reference Number of Minimum Annotation Annotations* database Functional Core Abundance (MFC) Represented by Annotations** SRT SRG Reference database

SEED 8,180 9,729 7,781 0.99 Subsystems

RefSeq 38,672 85,193 27,988 0.92

CAZy 380 410 375 0.99

*Represent the number of annotations in the core of each data type

** Represents the combination of the SRG and SRT functional cores.

Metaproteome characterization and core comparison

Analysis of the metaproteome data sets found 460 unique SEED Subsystem annotations with a total abundance of 876,429 in the metatranscriptome derived metaproteome and 766 unique SEED Subsystem annotations with a total abundance of

607,281 in the metagenome derived metaproteome. The rank abundance curves of the

127

metaproteome data (Figure 3.1) indicate that there are a few highly abundant proteins as the curve is fairly steep, although the metagenome derived metaproteome is more diverse.

When compared to the minimum functional core, 448 of the 460 SEED subsystem annotations from the metatranscriptome derived metaproteome are found within the core, while 727 of the 766 SEED subsystem annotations from the metagenome derived metaproteome are found within the minimum functional core. Of the 12 annotations from the metatranscriptome derived metaproteome that aren’t found in the minimum functional core, most are eukaryotic or archael ribosomal proteins. These 12 proteins only represent

0.25% of the total relative abundance. These data illustrate that our sequencing effort was insufficient to adequately sample the archael and eukaryotic portion of the microbial community. The 39 annotations from the metagenome-derived metaproteome are from 15 different SEED subsystems and only represent 0.3% of the total relative abundance. The metaproteome derived from the metatranscriptome contained 102 CAZy annotations, all of which were found in the CAZy minimum functional core while the metaproteome derived from the metagenome contained 271 CAZy annotations, all but one was found in the minimum functional core.

Comparison of the minimum functional core to rhizoplane functional cores

Twenty-one metagenomic samples originating from corn, Miscanthus and switchgrass associated rhizoplane soils, seven from each respective plant, were compared to the minimum functional core to determine if the minimum functional core is representative of the functional diversity of the field site or specific to the crop. Samples averaged approximately 363 million bases each and formed an average of 5.5 million

128

contigs (Table 3.4). Rhizoplane core functions were defined as found in five of seven replicates for each plant. This criterion is more stringent than the criterion of two out of three used in the construction of the minimum functional core.

Of the 7,781 SEED Subsystem functional annotations present in the minimum functional core, between 97.7% and 98.4% were found in the core of the three plant metagenome samples (Table 3.3). The minimum functional core captured approximately

99.3% of the total annnotation abundance of each plant indicating that the minimum functional core is representative of the functional diversity of the field site irrespective of the plant. For the RefSeq minimum functional core, composed of 27,988 functional annotations, between 92.8% and 94.1% were found in the core of the three plant metagenomes. The RefSeq functional annotations represent approximately 91% of the annotation abundance of the rhizoplane metagenomes.

129

Table 3.3: Summary of minimum functional core annotations found in rhizoplane metagenomes of three crops. Minimum core represents the functions found in five out of seven samples. Percent core represents the percent of the crop specific core that is found within our established minimum functional core. Percent abundance captured represents the abundance of the crop specific samples found in the minimum functional core.

SEED Subsystems

Crop Minimum Core Percent of Core Percent Abundance Captured

Corn 7628 98.0% 99.3%

Miscanthus 7606 97.8% 99.4%

Switchgrass 7656 98.4% 99.3%

RefSeq

Crop Minimum Core Percent of Core Percent Abundance Captured

Corn 26106 93.3% 91.2%

Miscanthus 25993 92.9% 91.7%

Switchgrass 26333 94.1% 91.3%

130

Characterization of the minimum functional core of switchgrass

Abundant functions within the minimum functional core are related to

housekeeping processes

The two subsystems with the greatest number of functions within the minimum functional core are Carbohydrates and Clustering-based subsystems with 1151 and 1049 functional annotations respectively (Figure 3.2). The Carbohydrate subsystem includes functions related to central metabolism, which can be classified as housekeeping functions, as well as functions related to the utilization of organic compounds important when considering carbon cycling in the rhizosphere. The Clustering-based subsystem is defined as genes, which evidence suggests belong together but there is no known function.

Therefore this subsystem represents “known unknowns”. Many other subsystems contain functional annotations, which also represent housekeeping related functions; these include

Amino Acids and Derivatives, Protein Metabolism and RNA Metabolism. Functions within these subsystems are related to the expression of proteins, which is an important process for all microbes. The Protein Metabolism subsystem is the only one where the number of functions in the metatranscriptome core is greater than the metagenome core, 561 and

474, respectively.

131

Figure 3.2: Diversity of switchgrass rhizosphere core functions by subsystem.

Number of annotations in the minimum functional core as annotated by the SEED

Subsystems database.

1600 1400 1200 1000 Core 800 600 MetaG core 400 MetaT core 200

0

Number of Functional Annotations Number of

Respiration

Miscellaneous

Carbohydrates

Photosynthesis

Stress Response

RNA Metabolism RNA

DNA Metabolism DNA

Sulfur Metabolism Sulfur

Phages, Phages, Prophages,…

Protein Protein Metabolism

Nitrogen MetabolismNitrogen

Cell Wall andCapsule Cell Wall

Membrane Transport Membrane

Potassium Potassium metabolism

Secondary Metabolism Secondary

Fatty Acids, Lipids, and… Lipids, Acids, Fatty

Metabolism of Aromatic… Metabolism of

Clustering-based subsys Clustering-based

Phosphorus Metabolism Phosphorus

Motility andChemotaxis

Cell Division and Cell Cycle and Cell Cell Division

Dormancy and Sporulation Dormancy

Nucleosides and Nucleotides and Nucleosides

Amino Acids and Derivatives Amino Acids and

Regulation and Cell signaling Regulation and

Cofactors, Vitamins, Prosthetic… Cofactors, Vitamins,

Virulence, Disease and Defense Virulence, Disease Iron acquisition andIron metabolism acquisition SEED Subsystem

When relative abundance of functions in the minimum functional core are taken into account (Figure 3.3), Carbohydrates and Clustering-based subsystems remain the most abundant in the metagenome, with 14.3% and 13.9% relative abundance respectively. In the metatranscriptome the Clustering-based Subsystem, Protein Metabolism and

Carbohydrates have the greatest relative abundance, 15%, 14.6% and 12.2% respectively.

Protein Metabolism and RNA Metabolism subsystems are the most abundant in the metatranscriptome-derived metaproteome, with 45% and 19.6% relative abundance respectively. The most abundant subsystems in the metagenome-derived metaproteome

132

are RNA Metabolism and Respiration, with relative abundance of 28.7% and 22.4% respectively. Considering the important role functions within the Protein Metabolism, RNA

Metabolism and Respiration subsystems have in basic cellular maintenance their abundance in the metatranscriptome and metaproteome is not surprising. Taken together these date indicate that the rhizosphere microbial community is actively carrying out housekeeping functions related to transcription and translation.

133

Figure 3.3. Relative abundance of switchgrass rhizosphere core multi-omics data by

SEED Subsystem annotations. Relative abundance is averaged across each of the three replicates.

0.5 0.45 0.4 0.35 MetaG 0.3 0.25 MetaT 0.2 MetaP-MetaT 0.15 0.1 MetaP-MetaG 0.05

0

Average Relative Abundance Relative Average

Respiration

Miscellaneous

Carbohydrates

Photosynthesis

Stress Response

RNA RNA Metabolism

DNA Metabolism DNA

Sulfur Metabolism Sulfur

Protein Protein Metabolism

Nitrogen MetabolismNitrogen

Cell Wall andCapsuleCell Wall

Membrane Transport Membrane

Clustering-based subs Clustering-based

Potassium Potassium metabolism

Secondary Metabolism Secondary

Fatty Acids, Lipids, and… Lipids, Acids, Fatty

Phages, Prophages, etc. Phages, Prophages,

Cofactors, Vitamins, etc. Vitamins, Cofactors,

Metabolism of Aromatic… Metabolism of

Phosphorus Metabolism Phosphorus

Motility and Chemotaxis Motility

Cell Division and Cell Cycle and Cell Cell Division

Dormancy and Sporulation Dormancy

Nucleosides and NucleotidesNucleosides and

Amino Acids and Derivatives Amino Acids and

RegulationCell signaling and

Virulence, Disease and Defense Virulence, Disease Iron acquisition and and metabolism Iron acquisition SEED Subsystem

Functions of ecological importance within the minimum functional core

Rhizosphere microbes are known to play an important role in many ecologically important functions such as biogeochemical cycling and plant growth and defense.

Subsystems representing biogeochemical cycling include Carbohydrates, Nitrogen

Metabolism, Phosphorus Metabolism, and Iron Acquisition and Metabolism. The Secondary

Metabolism subsystem contains functions related to plant defense and growth promotion.

With the exception of Carbohydrates, all of these subsystems have a low relative abundance (>1%) in the minimum functional core (Figure 3.3). However, the minimum

134

functional core accounts for the majority of the relative abundance of each subsystem ranging from 99.8% to 97%. While these functions of ecological importance have low relative abundance in our samples, compared to other more ubiquitous housekeeping functions, they are present and active in the minimum functional core.

Nitrogen Metabolism is one of the most important microbial community functions for plant growth. Major subsystem categories within the Nitrogen Metabolism subsystem found in the minimum functional core include Allantoin Utilization, Ammonia Assimilation,

Denitrification, Nitrate and Nitrite Ammonification, Nitrogen Fixation and Nitrosative stress. The nitrogen fixation functions in the minimum functional core are related to nitrogenase transcription factors. The protein components of nitrogenase are found in the metagenome functional core but are not expressed and therefore not part of the minimum functional core. Noticeably absent from these data are genes related to nitrification. The metaproteome supports these findings with annotations related to Nitrogen Metabolism subsystem subcategories Allantion Utilization, Ammonia Assimilation, Denitrification and

Nitrate and Nitrite Ammonification (Figure 3.4). Ammonia Assimilation is the most active process related to nitrogen cycling taking place in the rhizosphere at the time of sampling as the metatranscriptome and both metaproteome data sets have greatest relative abundance in the Ammonia Assimilation subcategory within the Nitrogen Metabolism subsystem.

135

Figure 3.4: Relative abundance of biogeochemical cycling functions in the minimum functional core of the switchgrass rhizosphere. Allantonin utilization, Ammonia assimilation, Denitrification and Nitrate and Nitrite ammonification are subsystems within the Nitrogen Metabolism subsystem. Alkylphosphate utilization and Phosphate metabolism are subsystems within the Phosphorous metabolism subsystem.

0.045 0.04 0.035 MetaG 0.03 MetaT 0.025 MetaP-metaT 0.02 MetaP-MetaG 0.015 0.01

Average Relative Abundance Relative Average 0.005 0

SEED Subsystem Level 3

The Phosphorous metabolism subsystem in the minimum functional core contains

75% of the functional annotations and 99.8% of the relative abundance of this subsystem.

Major subcategories within the Phosphorous Metabolism subsystem include Phosphate metabolism, Alkylphosphonate Utilization and High Affinity Phosphate Transporters. The metaproteomes contained functional annotations for Alkyphosphonate Utilization and

Phosphate Metabolism subcategories within the Phosphorous Metabolism subsystem

136

(Figure 3.4). The Iron Acquisition and Metabolism subsystem is a very diverse subsystem with 230 functional annotations that are core to the metagenome. The minimum function core contains 117 functional annotations representing 97% of the relative abundance. The major subcategories of the Iron Acquisition and Metabolism subsystem are related to

Siderophores and Heme and Hemin Uptake and Utilization.

Many beneficial services provided by rhizosphere microbes are also reflected in the minimum functional core. Aside from providing plants with bioavailable sources of nitrogen, phosphorous and iron, microbes also produce plant growth hormones such as

Auxin. Many functions related to Auxin biosynthesis are found in the Secondary

Metabolism subsystem. Additionally rhizosphere microbes have been shown to reduce plant ethylene levels, a plant stress hormone, through the production of ACC-deaminase.

This function was found in the minimum functional core and is classified as belonging to the Miscellaneous subsystem. Finally, rhizosphere microbes produce the sugar trehalose, which can be used by plants to protect against drought stress. Many functions related to trehalose biosynthesis are found in the Carbohydrates subsystem (Figure 3.5). While these functions comprise a rather small portion of the overall abundance they can have a significant effect on plant microbe interactions within the microbial community.

137

Figure 3.5. Relative abundance of plant growth promoting functions in the minimum functional core of the switchgrass rhizosphere. Auxin biosynthesis and Trehalose

Biosynthesis are the second level in the SEED Subsystem hierarchy within the Secondary metabolites subsystem and Carbohydrates subsystems respectively.

*ACC-deaminase is the fourth and lowest level in the SEED Subsystem hierarchy.

Carbon cycling functions within the minimum functional core

Carbohydrates subsystem is highly abundant in the metagenome and the metatranscriptome functional cores (Fig 3.3). In both derivations of the metaproteome the

Carbohydrate subsystem falls in relative abundance but it still represents 6.5% and 5.7% of the relative abundance, respectively. To further examine Carbon cycling processes, we used the Carbohydrate Active Enzyme (CAZy) database to annotate our sequences. The

CAZy minimum functional core is composed of 375 different functional annotations. The most common enzyme class was the Glycoside Hydrolases (GH) with 200 functional annotations. These enzymes are common and are known to hydrolize or rearrange glycosidic bonds. When relative abundance is taken into account the Glycoside Hydrolases

138

class again stands out with the greatest relative abundance across the other classes (Figure

3.6). Glycoside Transferases (GT) show a downward trend in relative abundance with the metagenome representing the 33% of the relative abundance and the metatranscriptome and the metatranscriptome derived metaproteome with lower relative abundances, 17% and 8%, respectively. However, in the metagenome derived metaproteome the relative abundance of GT annotations is 39%. Finally the Auxiliary Activities class relative abundance in the metagenome is 3%, while in the metatranscriptome it comprises 11% of the relative abundance. In the metatranscriptome derived metaproteome the Auxiliary

Activities class comprises 29% of the relative abundance and 9% of the relative abundance of the metagenome derived metaproteome. This indicates that the Auxiliary Activities class, which has low representation in the metagenomes, was being highly expressed and translated at the time of sampling. The Auxiliary Activities class is described as redox enzymes that act in conjunction with CAZy enzymes. This class is predominately related to ligninolytic enzymes.

139

Figure 3.6: Relative abundance of CAZy annotations in the minimum functional core of the switchgrass rhizosphere . CAZy enzyme classes are: Glycoside Hydrolases (GH),

Glycosyl Transferases (GT), Carbohydrate Esterases (CE), Polysaccharide Lyases (PL) and

Auxiliary Activities (AA).

Discussion

Many methods can be used to build a minimum functional core ranging from conservative, i.e. functions must be found in all samples, to lenient, all functions are included regardless of how many times it was found. We chose an intermediate approach requiring functions to be present in two of the three samples for each of the metagenome and metatranscriptome datasets. This removes singletons while still preserving much of the functional diversity found within the samples.

We’ve defined this as the minimum functional core for several reason. We are almost certainly under sampling the soil’s genetic composition but also because not all of the contigs are annotated (Table 3.6). In the case of the metagenomes used to build the

140

minimum functional core less than 50% of the contigs could be annotated by MG-RAST.

This represents a large portion of the dataset from which no functional information can be obtained. Typically a core of functions shrinks when more data are included. However we define this dataset as the minimum functional core since the unannotated portion of the data is high.

Based on our comparison of the minimum functional core to 21 rhizoplane metagenomic samples from the same field site we are confident that our minimum functional core is representative of major functional processes being carried out in the rhizosphere of our field site regardless of associated plant. In our comparison we used a highly-bred, high-nutrient responsive annual crop (corn) and two recently domesticated low nutrient input perennials (switchgrass and Miscanthus). Using a more stringent core identification method (defined as five of seven replicates per plant treatment) we identified over 97% of the minimum functional core annotations and accounted for over 99% of the relative abundance of each plant treatment based on the SEED Subsystem annotations. In the RefSeq minimum functional core over 92% of the functional annotations were observed and comprised over 91% of the rhizoplane abundance (Table 3.3).

This multi-omics approach has shown that many environmentally important microbially mediated biogeochemical cycling functions are active in the rhizosphere. The

Nitrogen Metabolism subsystem within the minimum functional core contains many key elements of the nitrogen cycle. The greater abundance of the Ammonia Assimilation subsystem, especially in the metatranscriptomic and metaproteomic data, (Figure 3.4) highlights the relative importance of this function to the community compared to other subcategories of the Nitrogen Metabolism subsystem. The Phosphorous Metabolism

141

subsystem functions within the minimum functional core also contain a many elements of phosphorous cycling. The Phosphate Metabolism subcategory has the highest relative abundance across the multi-omics data set indicating its importance to the rhizosphere microbial community. The Iron Acquisition and Metabolism subsystem is represented in the minimum functional core built from the metagenomes and metatranscriptomes. No metatranscriptome derived metaproteome functional annotations were found and very few annotations were found in the metagenome-derived metaproteome. This may be the result of under sampling in the metaproteomic data combined with the methodology used to obtain the metaproteome.

Despite their low proportional abundance, microbially mediated biogeochemical cycling functions are vital to the microbial community. These functions provide bioavailable micronutrients to rhizosphere microbial community and the associated plant.

We speculate these functional annotations related to biogeochemical cycling, found in low abundance, within the minimum functional core should be classified as rare keystone community functions as they have a disproportionately large effect on the microbial community. These rare functions and their associated taxa may be involved in synergistic crossfeeding. Syntrophy occurs in various habitats including hot springs, fresh water sediments, marine sediments, eutrophic bog sediments, marsh sediments, rumen [25], sewage treatment plants [26] and petroleum muck [27] as well as in constructed communities [28, 29]. To support these claims a more direct and quantitative approach is required.

Plant growth promoting functions are also found within the minimum functional core at low abundance. It was once thought that 80% of microbes living in the rhizosphere

142

could produce the plant growth hormone auxin [30]. These data also suggest that many or most of the organisms in the rhizosphere have a commensal relationship with the plant.

These organisms benefit from carbon secreted through root exudates while not providing a benefit to the plant, other than perhaps outcompeting plant pathogens in the rhizosphere.

An alternative explanation is that timing plays a more relevant role in the abundance and expression of microbially mediated element cycles and plant-microbe interactions. As can be seen in Figure 3.5, the relative abundance in the metagenomes is greater than the metatranscriptome at the time of sampling indicating the system has a greater capacity to aid plant growth than was observed at the time of sampling. Sampling at different times throughout the growing season may reveal different trends in the abundance and expression of elemental cycling functions and functions related to plant-microbe interactions.

Annotations of our data to the CAZy minimum functional core suggests that proteins related to lignin breakdown are highly active in the rhizosphere microbial community. The

Auxiliary Activities class, which has the lowest relative abundance in the metagenome minimum functional core and has intermediate abundance in the metatranscriptome, has the second highest relative abundance in the metatranscriptome derived metaproteome.

The metagenome-derived metaproteome has lower abundance than the metatranscriptome-derived metaproteome suggesting that much of the annotated proteins were recently synthesized by active microbes. The breakdown of lignin indicates that a portion of the microbial community was actively involved in the breakdown of plant biomass. Our use of a multi-omics approach greatly contributed to this finding, without the metaproteomic data this insight into rhizosphere microbial community function would not

143

have been identified. These data also show that Glycoside Hydrolases are the predominate

CAZy class that was active in the rhizosphere community across all samples. The high relative abundance of the metagenome derived metaproteome GT annotations indicates these proteins are older and originate from microbes that were not actively transcribing these enzymes at the time of sampling.

Annotations in the Clustering-base subsystem were found at high abundance in both the metagenome and metatranscriptome (Figures 3.2 and 3.3). The metagenomic data clearly shows many genomes possess a wide range of different annotations belonging to the Clustering-based subsystem (Figure 3.2). Metatranscriptomic data also shows a wide range of highly expressed transcripts in this class. Meanwhile both metaproteomic data sets show few annotations belonging to the Clustering-based subsystem indicating that in the recent past few of the proteins of Clustering-based subsystem functions were produced in detectable quantities. The cumulative evidence here also reinforces the need to further investigate the actual functions found in the Clustering-based subsystem as they are abundant in rhizosphere community genomes, and are actively transcribed. More evidence is needed to determine if functions found in the Clustering-based subsystem are translated to protein in levels similar to their transcription levels.

Conclusion

To fully understand the dynamics of environmentally important microbially mediated processes, microbes must be studied in a community context. Our multi-omics approach for establishing a core of active microbial functions has led to several insights into microbial community function within the rhizosphere. While the most abundant

144

functional annotations within the minimum functional core are related to housekeeping functions, biogeochemical cycling and plant growth promoting functions are present and active in the rhizosphere. Our use of metaproteomics greatly contributed to the findings that Ammonia Assimilation and Phosphate Metabolism subsystems were highly active during the sampling period compared to other biogeochemical functions. To further increase our understanding of biogeochemical cycling and plant microbe interactions in the rhizosphere sampling during multiple time points throughout the season may reveal seasonal patterns of activity. The use of a multi-omics approach allowed the identification of microbial community activity in the rhizosphere.

145

APPENDIX

146

Table 3.4: Summary of rhizoplane metagenome reads, assembly and assembled read abundance Total Reads Assembled contigs Reads Assembled

Corn-1 335,539,202 4,985,554 58.37%

Corn-2 362,890,032 5,795,619 48.31%

Corn-3 340,306,582 5,159,075 49.56%

Corn-4 353,097,824 5,306,152 44.89%

Corn-5 396,880,408 5,227,891 44.44%

Corn-6 419,729,170 5,989,941 43.18%

Corn-7 411,276,622 6,235,593 49.49%

Miscanthus-1 349,441,596 5,733,645 47.42%

Miscanthus-2 356,902,550 6,365,326 47.73%

Miscanthus-3 355,019,026 6,058,969 44.76%

Miscanthus-4 334,711,394 5,764,967 50.37%

Miscanthus-5 282,058,246 4,566,237 40.79%

Miscanthus-6 213,309,852 2,879,035 30.27%

Miscanthus-7 399,148,934 6,517,537 44.98%

Switchgrass-1 449,331,454 6,606,700 58.36%

Switchgrass-2 352,965,222 4,076,354 41.81%

Switchgrass-3 405,940,264 6,207,377 48.30%

Switchgrass-4 415,253,680 6,156,833 45.96%

Switchgrass-5 389,928,900 5,430,532 43.88%

Switchgrass-6 353,644,618 5,843,003 49.98%

Switchgrass-7 364,517,026 5,715,056 47.57%

147

REFERENCES

148

REFERENCES

1. Philippot, L., et al., Going back to the roots: the microbial ecology of the rhizosphere. Nature reviews. Microbiology, 2013. 11: p. 789-99.

2. Bird, J.a., D.J. Herman, and M.K. Firestone, Rhizosphere priming of soil organic matter by bacterial groups in a grassland soil. Soil Biology and Biochemistry, 2011. 43: p. 718-725.

3. Marschner, P., D. Crowley, and Z. Rengel, Rhizosphere interactions between microorganisms and plants govern iron and phosphorus acquisition along the root axis – model and research methods. Soil Biology and Biochemistry, 2011. 43: p. 883- 894.

4. Berendsen, R.L., C.M.J. Pieterse, and P.a.H.M. Bakker, The rhizosphere microbiome and plant health. Trends in plant science, 2012: p. 1-9.

5. Mendes, R., P. Garbeva, and J.M. Raaijmakers, The rhizosphere microbiome: significance of plant beneficial, plant pathogenic, and human pathogenic microorganisms. FEMS microbiology reviews, 2013. 37: p. 634-63.

6. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harbor perspectives in biology, 2011. 3.

7. Dodd, I.C., et al., Rhizobacterial mediation of plant hormone status. Annals of Applied Biology, 2010. 157: p. 361-379.

8. Fierer, N., et al., Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proceedings of the National Academy of Sciences of the United States of America, 2012. 109(52): p. 21390-21395.

9. Mendes, L.W., et al., Taxonomical and functional microbial community selection in soybean rhizosphere. Isme Journal, 2014. 8(8): p. 1577-1587.

10. Lennon, J.T. and S.E. Jones, Microbial seed banks: the ecological and evolutionary implications of dormancy. Nature Reviews Microbiology, 2011. 9(2): p. 119-130.

11. Howe, A.C., et al., Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences of the United States of America, 2014. 111(13): p. 4904-4909.

12. Rodriguez-R, L.M. and K.T. Konstantinidis, Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, 2014. 30(5): p. 629-635.

149

13. Delmont, T.O., P. Simonet, and T.M. Vogel, Describing microbial communities and performing global comparisons in the 'omic era. Isme Journal, 2012. 6(9): p. 1625- 1628.

14. Shade, A. and J. Handelsman, Beyond the Venn diagram: the hunt for a core microbiome. Environmental Microbiology, 2012. 14(1): p. 4-12.

15. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 2010. 464(7285): p. 59-U70.

16. Dinghua Li, C.-M.L., Ruibang Luo, Kunihiko Sadakane and Tak-Wah Lam, MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Brujin graph. Bioinformatics, 2015. 31(10): p. 1674-1676.

17. Wisniewski, J.R., A. Zougman, and M. Mann, Combination of FASP and StageTip-Based Fractionation Allows In-Depth Analysis of the Hippocampal Membrane Proteome. Journal of Proteome Research, 2009. 8(12): p. 5674-5678.

18. Kim, S., et al., The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search. Molecular & Cellular Proteomics, 2010. 9(12): p. 2840-2852.

19. Kim, S., N. Gupta, and P.A. Pevzner, Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research, 2008. 7(8): p. 3354-3363.

20. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 2009. 10(3): p. R25.

21. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-2.

22. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008. 9: p. 386.

23. Cantarel, B.L., et al., The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res, 2009. 37(Database issue): p. D233-8.

24. Rodriguez, R.L. and K.T. Konstantinidis, Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, 2014. 30(5): p. 629-35.

150

25. McInerney, M.J., et al., Physiology, ecology, phylogeny, and genomics of microorganisms capable of syntrophic metabolism. Annals of the New York Academy of Sciences, 2008. 1125: p. 58-72.

26. Jackson, B.E., et al., a new anaerobic bacterium that degrades fatty acids and benzoate in syntrophic association with hydrogen-using microorganisms. 1999: p. 107-114.

27. Joshi, M.N., et al., Metagenomics of petroleum muck: Revealing microbial diversity and depicting microbial syntrophy. Archives of Microbiology, 2014. 196: p. 531-544.

28. Mee, M.T., et al., Syntrophic exchange in synthetic microbial communities. Proceedings of the National Academy of Sciences of the United States of America, 2014. 111: p. E2149-56.

29. D'Souza, G., et al., Less is more: Selective advantages can explain the prevalent loss of biosynthetic genes in bacteria. Evolution, 2014. 68: p. 2559-2570.

30. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harb Perspect Biol, 2011. 3(4).

151

Chapter 4

Plant root effects on soil microbial community functions as viewed through metagenomics and metatranscriptomics

152

Abstract

We used metagenomics and metatranscriptomics to identify microbial community functions enriched in the rhizoplane, the rhizosphere and bulk soil (not influenced by living plant roots). We postulated that metagenomes of all three soil zones would show significant differences in the communities’ functional compositions and their metatranscriptomes, and that the bulk versus rhizosphere soil, would show the greatest difference. To accomplish this, we obtained metagenome sequence from the rhizoplane of switchgrass (Panicum virgatum) and corn (Zea mays), and the rhizosphere of corn. For metatranscriptomics we obtained sequence from rhizosphere samples of switchgrass and bulk samples from between corn rows. Contrary to our hypothesis the metagenomes rhizosphere and bulk soil showed no statistical differences, but there was a significant difference between rhizoplane and rhizosphere soils. We therefore combined the corn bulk and rhizosphere samples and termed them non-rhizoplane samples; the rhizosphere samples from switchgrass were also termed non-rhizoplane samples. Additionally bulk corn and switchgrass rhizosphere metagenomes showed significant differences in terms of functional composition and gene abundances, however the associated metatranscriptomes were not statistically different. In this study we show that bulk soil microbial communities are affected by the plant even though samples were collected away from living roots. This study also illustrates that factors other than proximity to living plant roots more strongly affects microbial community functions.

153

Introduction

Soils exist in a continuum with regard to their exposure to living plant roots and plant detritus. As rhizodeposition occurs, carbon rich compounds are first accessible to organisms living in the rhizoplane, external surface of plant roots together with any closely adhering particles of soil or debris [1]. Carbon rich compounds then diffuse to the surrounding soil, i.e. rhizosphere, the soil influenced by living plant roots [1]. Bulk soil is defined in this study as soil not influenced by living plant roots. Differences between bulk and rhizosphere soils have been widely studied while comparisons to the rhizoplane are relatively rare. The rhizosphere is characterized as having lower species diversity than bulk soils [2] but also as having a larger and more complex network of interactions than bulk soil[3]. These differences in microbial community composition are also accompanied by differences in the functional potential of bulk and rhizosphere microbial communities

[4]. These results have not been universally consistent. In studies of Oryza sativa, rice, and

Merlot grapevine the microbial community structure of the rhizosphere and bulk soil cannot be differentiated [5, 6]. Many experiments in plant-microbe interactions utilize bulk soil as a control [7-9] however; there is disagreement in the literature of the effect of plant roots and their extension into the rhizosphere.

In this study we utilize metagenomics as well as metatranscriptomics to more deeply examine functional differences in microbial communities associated with the rhizoplane, rhizosphere and bulk soil of switchgrass and corn. We postulate rhizoplane and rhizosphere samples from corn and switchgrass associated soils will differ from bulk soil with the bulk soil being intermediate between corn and switchgrass associated soils. We also postulate that while the metagenome samples will show a significant difference

154

between bulk soil and the rhizoplane and rhizosphere soils, the metatranscriptome will show a more statistically significant difference between bulk and rhizosphere samples, as it is not obfuscated by dead or dormant cells.

Methods

Site description and sample collection

Samples were collected from the Kellogg Biological Station Great Lakes Bioenergy

Research Center BSCE. Prior to 2008, dating back to 1988, the site was conventionally farmed mostly for soybeans. In 2008 the site was set up in a randomized block design to study various bioenergy cropping systems including continuous corn and switchgrass, the focus of this study.

Soil samples were collected on July 31st 2013. Bulk and rhizosphere samples used for metatranscriptomic and metagenomic sequencing were collected from corn plot G1R4 and adjacent switchgrass plot G5R2, respectively. Bulk soil samples were collected from in between crop rows using a 3 cm diameter soil corer; only the top 10 cm of soil were collected. For the switchgrass rhizosphere samples, the root systems were dug up and vigorously shaken to remove excess, potentially non-rhizosphere, soil. The switchgrass root system was placed in a sterile bag and vigorously shaken to loosen the closely attached rhizosphere soil. The separated soil was placed in whirlpacks and frozen in liquid nitrogen. This process was completed in less than 5 min to minimize transcript turnover.

Rhizoplane metagenome samples were collect at the same time and place as the previous samples. The root systems were placed in sealed bags on kept on ice for transport to the laboratory. Samples were stored at -20C until needed. Small roots with their surrounding

155

thin layer of soil and microaggregates were removed from the root system and placed in a phosphate buffer. Sample contained about five grams of root and soil material. Samples were shaken to remove soil attached to the roots. The suspended soil and microbes were then pelleted. Root material was then carefully removed from the soil pellet. Soil that fell off of the roots during transport or rhizoplane root picking was considered rhizosphere soil as it was likely not in direct contact with the plant root system. Therefore this soil was collected for both switchgrass and corn samples for use as rhizosphere metagenomes.

Sample preparation and sequencing

RNA was extracted from three replicate rhizosphere soils using the MoBio

PowerSoil RNA extraction kit (MoBio, Carlsbad, CA). Samples were treated with DNase

(Invitrogen, Carlsbad, CA) to remove any potentially co-extracted DNA. Sample quality was checked using a nanodrop and was quantified using the Qubit RNA quantification kit

(Invitrogen, Carlsbad, CA). The three switchgrass RNA replicates were termed SRT1, SRT2 and SRT3, SRT (switchgrass rhizosphere treatment), while the three corn samples were termed CBT1, CBT2 and CBT3, CBT (corn bulk (soil) treatment). Six DNA samples were extracted from the same samples used for metatranscriptomics. About 0.5 grams of soil was used for DNA extraction using the MoBio PowerSoil DNA kit (MoBio, Carlsbad, CA) according to the manufacture’s protocols. The quality of each sample was checked using a nanodrop and was quantified using the Qubit DNA quantification kit (Invitrogen, Carlsbad,

CA). The three rhizosphere DNA samples were termed SRG1, SRG2 and SRG3, SRG

(Switchgrass rhizosphere (meta)genome), while the three corn metagenome samples were termed SBG1, SBG2 and SBG3, SBG (switchgrass bulk(soil) treatment). DNA was also

156

extracted from an additional 20 samples (as above) seven were collected from the rhizoplane of corn and seven from switchgrass (samples for each plant treatment termed

C1-7, (C refers to corn plot), and S1-7, (S refers to switchgrass plot). The remaining six samples, three corn rhizosphere and three switchgrass rhizosphere samples were termed

CR1, CR2, CR3, CR (corn rhizosphere), and SR1, SR2 and SR3, SR (switchgrass rhizosphere).

All DNA samples were sent to the Joint Genome Institute (JGI) in Walnut Creek, CA for sequencing. Ribosomal RNA subtraction was performed on the metatranscriptome samples using the RiboZero kit (Illumina, San Diego, CA) at JGI. All samples were sequenced using the HiSeq-1TB. All data were processed using JGI’s standard method, then assembled using Metahit (v 0.2.0) [10] (--cpu-only -m 100e9 --k-max 123 -l 155).

Data analysis

The median coverage of reads mapping to the contigs was identified as follows.

Quality filtered reads were mapped to the contigs using bowtie2 (v2.0.0-beta6, [11]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, –D 100, distinct alignments for each read. Median base pair coverage was estimated using BedTools [12]. All contigs of length less than 300bp were removed from the data set. Samples were then submitted to MG-RAST (v3.6, [13]) for annotation using the assembled pipeline and no other filtering methods or quality controls (code available at https://github.com/Garoutte/Chapter_4).

Statistical analysis was carried out in the R statistics environment (v3.1.3) using the vegan package (v2.2-1). Nonmetric multidimensional analysis was used to visualize the data, which were log2 plus one transformed to increase normality. To establish

157

significance of sampling groups permutational multivariate analysis of variance

(PERMANOVA) was used with a Bray Curtis distance matrix. Differential abundance of functional annotations was carried out using a modified version of Fisher’s exact test according to [14] in the R package EdgeR (v3.8.6). Functional annotations were considered differentially abundant if the log fold change was one or greater and the false discovery rate was 0.05 or less (code available at https://github.com/Garoutte/Chapter_4).

Results

Metatranscriptome analysis

Ribosomal RNA subtraction by JGI was mostly successful with the majority of samples having approximately 30-40% rRNA after sequencing (Table 4.2). However, for one sample, SRT1, rRNA removal was less successful with rRNA reads making up approximately 72% of the sequences. The low number of rRNA sequences resulted in a much lower number of contigs (Table 4.2) and fewer overall annotations (Table 4.3).

Therefore, sample SRT1 was not used for further analysis as it is a technical outlier. Bulk soil samples were not collected from the switchgrass field because no root free area of soil could be found. Metatranscriptome samples were tested using PERMANOVA to determine if there was a statistically significant difference between SRT and CBT samples. The

PERMANOVA shows that the corn and switchgrass metatranscriptome samples are not significantly different from one another (Table 4.1). Metagenome samples of the switchgrass rhizosphere (SRT) and the corn bulk soil (CBG), collected from the same samples as the metatranscriptome samples were statistically different with a PERMANOVA p-value of 0.001 (Table 4.1).

158

Table 4.1: PERMANOVA analysis of metagenome and metatranscriptome samples.

Permutational multivariate analysis of variance of samples. SRT is the switchgrass metatranscriptome, CBT is the corn bulk metatranscriptome, switchgrass rhizosphere metagenome, CBG is the corn bulk metagenome, C is for the corn rhizoplane metagenome, S is the switchgrass rhizoplane metagenome, SR is the switchgrass rhizosphere metagenome,

CR is the switchgrass rhizosphere metagenome, SRG is the switchgrass rhizosphere metagenome samples collected with the metatranscriptome samples, CBG is the corn bulk metagenome samples collected with the metatranscriptome samples, CNRP is the combination of the corn bulk and rhizosphere metagenome samples, and SNRP is the combination of the two treatments of switchgrass rhizosphere metagenome samples.

PERMANOVA

Comparison p-value

SRT-CBT 0.1

SRG-CBG 0.001389

C-S 0.031

S-SR 0.01

S-SRG 0.029

SR-SRG 0.2014

159

Table 4.1 (cont’d)

C-CR 0.001

C-CBG 0.001

CR-CBG 0.1

CNRP-SNRP 0.002

Both metatranscriptome samples sets share the top three subsystem annotations, namely common Clustering-based subsystem, Protein Metabolism and Carbohydrates

(Figure 4.1a). The Clustering-based subsystem is defined, as a subsystem in which there is evidence of functional coupling between annotations with no known function. The high abundance of these “known unknown” functions indicates they perform important cellular functions. Another very abundant subsystem is the Protein Metabolism, which contains most to housekeeping functions such as translation. The majority of the relative abundant functional annotations within the Carbohydrates subsystem is related to housekeeping functions such as central metabolism (Figure 4.1b). Other level two functions within the

Carbohydrate subsystem relate to the metabolism of various carbohydrates. Other abundant housekeeping-related functions include Amino Acids and Derivatives and RNA

Metabolism.

160

Figure 4.1: Average relative abundance of metatranscriptome annotations

Shows annotations based on MG-RAST SEED Subsystem database. (a) Average relative abundance of corn and switchgrass metatranscriptome annotations. (b) Relative abundance of metatranscriptome annotations in the Carbohydrate subsystem level 2. a:

0.16 Switchgrass 0.14 Rhizosphere 0.12 Metatranscriptome 0.1 Corn Bulk 0.08 Metatranscriptome 0.06 0.04 0.02

0

Average Relative Abundance Relative Average

Respiration

and… Dormancy

Photosynthesis

Response Stress

and… Nucleosides

RNA Metabolism

Acids and… Amino

Metabolism DNA

Clustering-based…

Regulation and Cell… Regulation and

Metabolism Sulfur

Iron acquisition and…Iron acquisition

2Phages, Prophages,… 2Phages,

Protein Metabolism

0Cofactors, 0Cofactors, Vitamins,…

Metabolism Nitrogen

Virulence, Disease and… Virulence, Disease

and Wall Capsule Cell

Transport Membrane

metabolism Potassium

Metabolism Secondary

$Fatty Lipids,Acids,$Fatty and…

Metabolism of Aromatic… Metabolism

Metabolism Phosphorus

and Motility Chemotaxis Cell Division and Cell Cycle and Cell Cell Division SEED Subsystems

161

Figure 4.1 (cont’d) b:

0.06 0.05 0.04 0.03 0.02 CBT 0.01 SRT

0 Average Average relative abundance

Carbohydrate Subsystem level 2

Metagenome analysis

To explore the continuum of root effects on microbial communities we used nonmetric multidimensional scaling (NMDS) analysis and PERMANOVA on the rhizoplane metagenomes (S and C1) and rhizosphere metagenomes (SR and CR). The two plant rhizoplane metagenomes (C and S) were statistically different (Table 4.1) even though they appear to cluster in the graph of the NMDS analysis (Figure 4.2). Furthermore, the switchgrass rhizoplane metagenomic samples (S) are also statistically different from the rhizosphere samples collected from the rhizoshpere metagenomic sample (SR) and the rhizosphere samples collected from the metatranscriptome sampling (SRG) (Table 4.1).

The two sets of switchgrass rhizosphere metagenomic samples collected by different methods are, as expected, not statistically different (SRG and SR) (Table 4.1). Like the

162

switchgrass rhizosphere metagenomic samples, the corn bulk (CBG) and corn rhizosphere metagenomic samples (CR) are both significantly different (Table 4.1) from the corn rhizoplane (C). However, the corn bulk (CBG) and rhizosphere (CR) metagenomic samples are not statistically different from one another (Table 4.1) even though they appear to cluster independently in the graph of the NMDS analysis (Figure 4.2).

Figure 4.2: Nonmetric multidimensional scaling (NMDS) analysis of metagenome sample. Metagenome samples were log2 plus one transformed to increase normality.

Since the two switchgrass rhizosphere metagenomic samples (SR and SRG) and the corn bulk (CBG) and rhizosphere (CR) metagenomic samples are not statistically different

163

from one another the samples were combined by associated plant and termed non- rhizoplane samples (CNRP and SNRP). A PERMANOVA analysis of the two non-rhizoplane treatments shows that the two treatments are significantly different from one another

(Table 4.1).

A modified version of Fisher’s exact test using the R package edgeR was used to identify functional annotations that are differentially abundant in the various treatments.

The corn rhizoplane metagenomic samples compared to the non-rhizoplane corn metagenomic samples identified 294 enriched functional annotations in the rhizoplane versus 73 enriched in the rhizosphere (Figure 4.3). Many of the differentially abundant functions are the only members of the subcategories within the hierarchical structure of the SEED Subsystem to which they belong. These functional annotations do not comprise a complete or even partial functional pathway and hence do not reveal likely major functional changes. They may be the result of under sampling. Therefore we will only present differentially abundant annotations with reasonable representation within a pathway. The corn rhizoplane metagenomic samples are enriched for many functions thought to be common plant-microbe interactions. These include subcategories of the

Carbohydrate subsystem, Oligo and Di-saccharides, Mononsaccharides, and Organic Acids; four chemotaxis and five flagellum associated functional annotations; and protein secretion systems with 14 annotations. Interestingly there are many functional annotations related to DNA exchange; with five functions related to plasmid encoded T-DNA and five annotations associate with conjugative transfer. The subsystem with the greatest number of enriched functional annotations in the non-rhizoplane corn samples is the Phages,

Prophages, Transposable Elements and Plasmids. All but one of these annotations are

164

related to Phage replication and reproduction. The non-rhizoplane corn metagenomic samples are also enriched for Phage shock proteins and have two CRISPR associated hypothetical protein annotations.

Figure 4.3: Comparison of differentially abundant annotations in corn rhizoplane and non-rhizoplane metagenomic samples. Number of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples based on SEED Subsystems.

60

50

40

30 Corn Rhizoplane 20 Corn Non-rhizoplane

10

0

Respiration

Miscellaneous

Carbohydrates

Photosynthesis

Number of differentially annotations abundant differentially of Number

Stress Response

RNA RNA Metabolism

DNA Metabolism DNA

Sulfur Metabolism Sulfur

Phages, Phages, Prophages,…

Protein Protein Metabolism

Cofactors, Vitamins,…

Iron acquisition and… Iron acquisition

Nitrogen Metabolism Nitrogen

Cell Wall andCapsuleCell Wall

Membrane Transport Membrane

Potassium metabolism Potassium

Fatty Acids, Lipids, and… Lipids, Acids, Fatty

Metabolism of Aromatic… Metabolism of

Phosphorus Metabolism Phosphorus

Motility and Chemotaxis Motility

Cell Division and Cell Cycle and Cell Cell Division

Dormancy and Sporulation Dormancy

Nucleosides and NucleotidesNucleosides and

Amino Acids and Derivatives Amino Acids and

RegulationCell signaling and Clustering-based subsystems Clustering-based

SEED and Defense Virulence, Disease Subsystems

When the switchgrass rhizoplane was compared to the non-rhizoplane switchgrass associated samples using edgeR, 119 functional annotations were associated with the non- rhizoplane switchgrass metagenomic samples , while 391 functional annotations were associated with the rhizoplane switchgrass metagenomic samples (Figure 4.4). The

165

annotations enriched in the switchgrass rhizoplane metagenomic samples, like the corn rhizoplane enriched functions, relate to plant-microbe interactions like the utilization of root exudates. The Carbohydrate subsystem contains enriched functions related to utilization of Di- and oligosacchrides, monosaccharides, organic acids and sugar alcohols.

The Iron acquisition and metabolism subsystem contains 21 enriched functions related to siderophores. The Membrane Transport subsystem contains 33 enriched functions related to type IV secretion, 17 of which are related to conjugative transfer. There are 11 enriched functions related to T-DNA and ten related to resistance to antibiotics and toxins.

Functional annotations associated with non-rhizoplane switchgrass metagenomic samples include seven related to central metabolism and 11 related to phage.

166

Figure 4.4: Comparison of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane metagenomic samples. Number of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples based on

SEED Subsystems.

50 45 40 Switchgrass Rhizoplane 35 Switchgrass Non-rhizoplane 30 25 20 15 10 5

0

Respiration

Number of differentially annotationsabundant differentially Number of

Miscellaneous

Carbohydrates

Photosynthesis

Stress Response Stress

RNA RNA Metabolism

DNA Metabolism DNA

Sulfur Metabolism Sulfur

Phages, Phages, Prophages,…

Protein Protein Metabolism

Cofactors, Vitamins,…

Iron acquisition and… Iron acquisition

Nitrogen MetabolismNitrogen

Cell Wall and Capsuleand Cell Wall

Membrane Transport Membrane

Virulence, Disease and… Virulence, Disease

Secondary Metabolism Secondary

Fatty Acids, Lipids, and… Lipids, Acids, Fatty

Metabolism of Aromatic… Metabolism of

Phosphorus Metabolism Phosphorus

Motility and Chemotaxis Motility

Cell Division and Cell Cycle and Cell Cell Division

Dormancy and Sporulation Dormancy

Nucleosides and Nucleotides Nucleosides and

Amino Acids and Derivatives Amino Acids and Regulation and Cell signaling Regulation and Clustering-based subsystems Clustering-based SEED Subsystems

Discussion

Analysis of corn rhizosphere and bulk soil associated samples showed (via

PERMANOVA analysis) that the two sample sets are not statistically different (Table 4.1) even though the samples appear to cluster separately in the NMDS analysis (Figures 4.2).

Additional sampling may aid in resolving this discrepancy, as a PERMANOVA p-value of 0.1 is considered by some to be marginally significant. This result lead us to combine the corn bulk and rhizosphere samples into a single treatment called corn associated non-

167

rhizoplane samples as we could not reliably differentiate between bulk and rhizosphere soils. We subsequently reclassified and combined the two treatments of switchgrass rhizosphere into switchgrass associated non-rhizosphere soil. These results indicate that the “bulk” soil samples (CBG) are still influenced by the presence of the plant (Figure 4.2).

We initially predicted that bulk soils samples (not influenced by plants) should ordinate at an intermediate position between the corn and switchgrass treatments. It may be that decaying organic matter from previous years’ crops influences the microbial community in the “bulk” soil sample. The previous years crop residue (primarily and old root systems since the tops are harvested for biofuel) were found near the bulk soil sample collection site. No live roots were observed in the “bulk” soil samples. Rhizosphere microbial communities associated with an annual crop such as corn would also come into contact with dead plant material during a growing season. This continuous exposure to decaying plant material may influence the bulk, rhizosphere and rhizoplane communities to possess similar functional traits making it difficult to decipher one from the other.

Non-rhizoplane samples could easily be differentiated from their associated rhizoplane samples based on their PERMANOVA p-values (Table 4.1) as well as the NMDS

(Figures 4.2). In the rhizoplane samples we see enrichment of functions commonly associated with plant-microbe interactions. Both corn and switchgrass are enriched for annotations related to carbohydrate utilization (many of which are potentially related to utilization of root exudates) and protein secretion, which can be a form of chemical communication between plants and microbes [15]. The corn rhizoplane is also enriched for chemotaxis and flagellum related annotations. Root exudates have been shown to act as a chemo-attractant to some bacteria, therefore it is not surprising to find the rhizoplane

168

enriched with functional annotations related to chemotaxis and flagella [16]. Furthermore, bacterial movement is more common in the rhizosphere than bulk soil . The switchgrass rhizoplane is enriched for capsular polysaccharide biosynthesis, indicating bacterial cell growth, as well as for phosphorous and iron utilization, micronutrients commonly liberated by microbes and utilized by associated plants [17].

Many of the non-rhizoplane samples were collected as rhizosphere samples. By labeling them non-rhizoplane samples we are not suggesting that they are devoid of rhizosphere influence, only that we can’t differentiate them from bulk soils by the sequence information. Furthermore the presence of differentially abundant functions associated with plant-microbe interactions in the rhizoplane samples does not negate the possibility that these samples originate from the rhizosphere. Instead these data only show that plant-microbe associated functions are more abundant in the rhizoplane. Comparing the corn rhizoplane to the corn non-rhizoplane sequences all of the differentially abundant functions in the rhizoplane were also present in the corn non-rhizoplane annotations. The differentially abundant switchgrass rhizoplane sequences only contain two functions not found in the switchgrass non-rhizoplane sequences. We simply lack sufficient data, specifically “bulk” soil samples to validate these samples as rhizosphere samples. The non- rhizoplane samples have fewer differentially abundant annotations. The differentially abundant annotations found in non-rhizoplane samples for the most part don’t contain many annotations for the same pathways. The lack of differentially abundant functions in the non-rhizoplane samples indicates the functional similarity between the two samples. It can be inferred that the non-rhizoplane samples are a reflection of the rhizoplane samples

169

with a lower abundance of classical functional annotations associated plant-microbe interactions.

In our initial hypothesis we postulate that metagenomic techniques would allow us to differentiate among bulk, rhizosphere and rhizoplane soils associated with corn and switchgrass. Additionally we postulated that the metatranscriptome would show a greater statistical difference between bulk and rhizosphere soils associated with switchgrass.

Interestingly when compared using PERMANOVA the two metatranscriptome samples sets did not show a statistically significant difference (p = 0.1, Table 4.1). Counter to our hypothesis, the metagenome sequence associated with the metatranscriptome (SRG and

CBG) showed a very significant difference. The assumption underlying our hypothesis was that the main driver of microbial community transcription (activity) is plant-microbe interactions. However, the relative abundance data from the metatranscriptome suggests the microbial community activity was mostly related to housekeeping functions such as transcription, translation and central metabolism. Influence from plant roots or detritus seems to have little impact on microbial community transcription at the time of sampling.

Other factors common to both samples such as environmental factors like temperature and moisture content of the soil may be playing a larger role in transcription by the microbial communities. Another factor could be that some of the roots we collected were older, lignified roots and therefore secreting fewer exudates. We did however, collect only the small roots, i.e. < 1mm diameter. Sampling during throughout the growing season or closer to root tips may reveal different patterns of transcriptional activity.

170

Conclusion

We investigated the affect of plant root exudates on microbial communities across a continuum of samples ranging from the rhizoplane, rhizosphere and bulk soil. We were able to differentiate rhizoplane soil samples from non-rhizoplane soil samples using metagenomic sequence. Rhizoplane samples showed differential abundance of functions related to plant-microbe interactions such as carbohydrate utilization (potentially related to root exudates), protein secretion and biogeochemical cycling. Bulk soil samples could not be differentiated from rhizosphere soil possibly due to under sampling. However, the bulk soil is clearly influenced by the presence of nearby or recent plants not necessarily by living plant roots. The metatranscriptome didn’t show a statistically significant difference between bulk and rhizosphere soils, while the associated metagenomes did. The metatranscriptome samples were enriched for functional annotations related to housekeeping processes indicating that plant-microbe interactions were not the main driver of microbial community transcription at the time of sampling. Taken together these data illustrate the complexity of natural soil systems and the need for further efforts to develop more accurate conceptual models of plant effects on soil microbial communities or to develop better methods to sample active root tips.

171

APPENDIX

172

Table 4.2: Summary of metagenome and metatranscriptome reads, assembly and assembled read abundance

Non-rRNA Assembled Percent Sample Total Reads reads contigs Assembled*

SRT-1 246,895,742 68,949,934 440,213 81.23%

SRT-2 284,791,354 166,978,397 1,825,857 80.34%

SRT-3 397,351,240 250,124,715 2,237,997 82.42%

CBT-1 271,044,518 168,104,823 2,070,236 77.22%

CBT-2 395,933,348 272,908,023 2,884,984 81.24%

CBT-3 272,295,622 147,577,500 1,972,841 79.44%

SRG-1 298,716,384 NA 6,606,700 40.82%

SRG-2 338,846,620 NA 4,076,354 44.87%

SRG-3 298,364,910 NA 6,207,377 40.93%

CBG-1 367,768,170 NA 5,471,300 54.25%

CBG-2 343,375,648 NA 5,254,567 46.66%

CBG-3 342,045,434 NA 4,960,472 57.39%

Corn-1 335,539,202 NA 4,985,554 58.37%

Corn-2 362,890,032 NA 5,795,619 48.31%

Corn-3 340,306,582 NA 5,159,075 49.56%

Corn-4 353,097,824 NA 5,306,152 44.89%

173

Table 4.2 (cont’d)

Corn-5 396,880,408 NA 5,227,891 44.44%

Corn-6 419,729,170 NA 5,989,941 43.18%

Corn-7 411,276,622 NA 6,235,593 49.49%

Corn-R1 416,788,124 NA 6,209,886 35.57%

Corn-R2 379,476,104 NA 6,254,363 39.26%

Corn-R3 362,767,630 NA 6,119,938 49.40%

Miscanthus-1 349,441,596 NA 5,733,645 47.42%

Miscanthus-2 356,902,550 NA 6,365,326 47.73%

Miscanthus-3 355,019,026 NA 6,058,969 44.76%

Miscanthus-4 334,711,394 NA 5,764,967 50.37%

Miscanthus-5 282,058,246 NA 4,566,237 40.79%

Miscanthus-6 213,309,852 NA 2,879,035 30.27%

Miscanthus-7 399,148,934 NA 6,517,537 44.98%

Switchgrass-1 449,331,454 NA 6,606,700 58.36%

Switchgrass-2 352,965,222 NA 4,076,354 41.81%

Switchgrass-3 405,940,264 NA 6,207,377 48.30%

Switchgrass-4 415,253,680 NA 6,156,833 45.96%

Switchgrass-5 389,928,900 NA 5,430,532 43.88%

Switchgrass-6 353,644,618 NA 5,843,003 49.98%

174

Table 4.2 (cont’d)

Switchgrass-7 364,517,026 NA 5,715,056 47.57%

Switchgrass- R1 305,009,580 NA 4,592,503 35.57%

Switchgrass- R2 332,653,200 NA 5,545,283 39.29%

Switchgrass- R3 312,867,640 NA 4,239,083 34.52%

Table 4.3: Metatranscriptome protein coding annotations Protein Coding Sample Annotations

CBT-1 8,780

CBT-2 9,110

CBT-3 8,798

SRT-1 6,606

SRT-2 8,522

SRT-3 8,967

175

REFERENCES

176

REFERENCES

1. York, L.M., et al., The holistic rhizosphere: integrating zones, processes, and semantics in the soil influenced by roots. Journal of experimental botany, 2016. 67(12): p. 3629- 43.

2. Peiffer, J.A. and R.E. Ley, Exploring the maize rhizosphere microbiome in the field. Communicative & integrative biology, 2013(October): p. 5-7.

3. Shi, S., Nuccio, E. E., Shi, Z. J., He, Z., Zhou, J. and Firestone, M. K., The interconnected rhizosphere: High network complexity dominates rhizosphere assemblages. Ecol Lett, 2016(19): p. 926-936.

4. Xiangzhen Li , J.R., Jingbo Xiong, Jiabao Li, Zhili He, Jizhong Zhou, Anthony C. Yannarell, Roderick I. Mackie, Functional Potential of Soil Microbial Communities in the Maize Rhizosphere. PLoS One, 2014. 9(11).

5. Iratxe Zarraonaindia, S.M.O., Pamela Weisenhorn, Kristin West, Jarrad Hampton- Marcell, Simon Lax, Nicholas A. Bokulich, David A. Mills, Gilles Martin, Safiyh Taghavi, Daniel van der Lelie, Jack A. Gilbert, The Soil Microbiome Influences Grapevine-Associated Microbiota. mBio, 2015. 6(2).

6. Joseph Edwards, C.J., Christian Santos-Medellín, Eugene Lurie, Natraj Kumar Podishetty, Srijak Bhatnagar, Jonathan A. Eisen, and Venkatesan Sundaresan, Structure, variation, and assembly of the root-associated microbiomes of rice. Proceedings of the National Academy of Sciences, 2015. 112(8): p. 911-920.

7. Yergeau, E., et al., Microbial expression profiles in the rhizosphere of willows depend on soil contamination. The ISME journal, 2013: p. 1-15.

8. DeAngelis, K.M., et al., Selective progressive response of soil microbial community to wild oat roots. The ISME journal, 2009. 3(2): p. 168-78.

9. Chaudhary, D.R., et al., Microbial profiles of rhizosphere and bulk soil microbial communities of biofuel crops switchgrass ( Panicum virgatum L.) and jatropha ( Jatropha curcas L.). Applied and Environmental Soil Science, 2012. 2012: p. 906864- Article ID 906864.

10. Li, D., et al., MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015. 31(10): p. 1674-1676.

11. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012. 9(4): p. 357-U54.

177

12. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-842.

13. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. Bmc Bioinformatics, 2008. 9.

14. McMurdie, P.J. and S. Holmes, Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. Plos Computational Biology, 2014. 10(4).

15. De-la-Pena, C., et al., Root-microbe communication through protein secretion. Journal of Biological Chemistry, 2008. 283(37): p. 25247-25255.

16. Zhang, N., et al., Effects of different plant root exudates and their organic acid components on chemotaxis, biofilm formation and colonization by beneficial rhizosphere-associated bacterial strains. Plant and Soil, 2013. 374(1-2): p. 689-700.

17. Mendes, L.W., et al., Taxonomical and functional microbial community selection in soybean rhizosphere. The ISME journal, 2014: p. 1577-1587.

178

Chapter 5

Conclusions and Future Directions

179

Conclusions

Soil microbial communities provide many beneficial services that humans rely on.

For many years microbial ecologists have studied these beneficial services in laboratory conditions, in growth chambers and greenhouses. However, we lack foundational knowledge of the beneficial services provided by soil microbes under natural environmental conditions and in communities. This dissertation attempts narrow the knowledge gap through the development and use of a novel technological method, metatranscriptomics. This dissertation first evaluates the practicality and use of metatranscriptomics on field collected agricultural soil samples. To my knowledge this has not been done before. Additionally this dissertation attempts to establish best practices for metatranscriptomic analysis. Second this dissertation utilizes metatranscriptomics to identify the actively transcribed genes through the use of a minimum functional core composed of annotations found in both metagenome and metatranscriptome samples collected from the same site. Finally this dissertation explores the relationship between microbial potential activity and actual activity in relation to distance from living plant roots.

In chapter two of this dissertation a novel method of rRNA removal, called duplex specific normalization or DSN, is explored using a soil sample collected from an agricultural field site. The DSN is not as efficient at removing rRNA as other probe based methods such as the RiboZero kit. However, the DSN still offers several advantages over probe-based methods. The DSN requires only 10 ng of total RNA as input while the RiboZero kit requires at least one microgram of material. Another advantage of the DSN is that the rRNA that is not removed from the sample can be used for phylogenetic analysis. This is

180

because the DSN is a normalization procedure that decreases the relative abundance of the most abundant sequences while preserving the relative abundance of all sequences across the sample. The RiboZero kit removes rRNA based on probes so any contaminating rRNA in the metatranscriptome sample is present because there was not a probe in the RiboZero kit with a close enough match to remove the sequence. Chapter two of this dissertation also identifies best practices for metatranscriptomic analysis, namely the need for short read assembly. Assembly was show to greatly improve the confidence in annotation.

Chapter three of this dissertation utilized a multi-omics approach to identify actively transcribed and translated genes. To accomplish this goal a minimum functional core of functional annotations present in two of three metagenome and metatranscriptome samples was established. The metaproteomes derived from the metagenomic and metatranscriptomic data were compared to the minimum functional core. Finally the minimum functional core was compared to plant specific functional cores derived from rhizoplane soil taken from switchgrass, corn and Miscanthus. This comparison showed that over 90% of functions in the minimum functional core were found throughout the field site indicating that it is broadly representative of the site. The minimum functional core was composed of many housekeeping functions mostly related to transcription, translation and central metabolism. The minimum functional core also contained functions to important biogeochemical cycles such as carbon, nitrogen, phosphorous and iron. Functions related to plant microbe interactions were also found within the minimum functional core.

Chapter four of this dissertation examines the effect of proximity to living roots on the functional composition of the microbial community. In this study metagenomes and metatranscriptomes from corn bulk soil and switchgrass rhizosphere are compared. Corn

181

bulk soil (from the root-free soil between the corn rows) was used to approximate switchgrass bulk soil as roots from the 5-year old switchgrass stand had fully penetrated the plot and suitable bulk (root-free) soil could not be found. Analysis of the corn bulk and switchgrass rhizosphere metagenomes showed a significant difference while the metatranscriptomes did not show a significant difference. These data show that while the metagenomes underlying the metatranscriptome show differences in functional composition, the activity of the community is very similar. These data indicate that at the time of sampling plant treatment was not a strong driver of community function. The similarity in community activity could be due to environmental conditions at the time of sampling. Metagenome samples collected from the rhizoplane of switchgrass and corn show a statistically significant difference. Metagenomes from both corn and switchgrass rhizoplanes were compared to their respective rhizosphere samples and were statistically different. However, the corn bulk soil samples were not statistically different from the corn rhizosphere samples. Due to these results the rhizosphere samples were labeled as non- rhizoplane samples to better reflect the fact that they were not statistically different from the bulk samples. Nonmetric multidimensional scaling analysis shows that the corn bulk samples do not cluster between rhizosphere samples as expected. Instead the bulk soil samples cluster along a shared trajectory with the corn rhizosphere and rhizoplane samples. This indicates that the bulk soil is under the influence of the plant treatment, possibly due to root exudates but more likely due to corn plant residues from previous seasons.

182

Overall my dissertation highlights some of the challenges of using field-collected samples for plant microbe interaction –omics sequencing studies. Studies commonly conducted in a growth chamber or a greenhouse have the luxury of a relatively stable, controlled and more homogenous environment in contrast to the heterogeneous environment experienced by microbes in nature. Laboratory based studies of plant microbe interactions are also better able to collect the desired sample types as their access may be built into the study design. For example, one can have a design providing for more confidence in collecting a rhizosphere sample with a more defined relationship to living active roots, and of roots of a particular age. In the field one must uproot plants to collect samples from and the most accessible roots maybe be older and lignified. The more active growing root tips may not be accessible in the field. Greater care must be given to field sample collection in order to collect the most informative samples. For these reasons field collected samples may show additional variation in microbial activity. Any patterns of activity may be more obscured by variance in environmental conditions and sampling procedures. These findings, while not ideal for experimental design, identify factors that must be taken into account when collecting field samples.

Future directions

A common theme in this dissertation and common problem among all metagenome and metatranscriptome studies is the existence of hypothetical proteins. A meta-analysis of published metagenomes and metatranscriptomes could be conducted to describe the distribution of hypothetical genes across different habitats. This work could potentially identify hypothetical proteins as enriched in a particular habitat or ubiquitous throughout

183

many different types of microbiomes. Of equal or greater importance is unannotated contigs. A similar meta-analysis approach could be used to bin common unannotated contigs by habitat based on sequence identity. Again this work could establish links between unannotated contigs and particular environments, conditions or as cosmopolitan.

The minimum functional core described in chapter three could be applied to other soils and environments. Comparing the minimum functional core to other soils and environments could aid in determining which functions are cosmopolitan in all environments, such as many housekeeping functions, and identify functions that are specific to a given environment such as soil, gut, aquatic habitat, etc. This would also allow for better comparison between soil types to identify how environment shapes microbial community functions.

Identification of biotic and abiotic factors that influence the activity of plant microbe interaction genes and functions involved in biogeochemical cycling is of particular importance. The ability to promote and control these functions could have a variety of environmental impacts such as reducing fertilizer use, promoting plant growth and mitigating climate change by sequestering carbon in soils. A time series sampling approach combined with metatranscriptomics could provide valuable insight into biotic and abiotic factors affecting microbial community activity related to plant microbe interactions and biogeochemical cycling. If full metatranscriptomics is not deemed feasible, a targeted metatranscriptome approach could be used. The program Xandar could be used to assemble low abundance genes of interest and primers could then be designed to target the desired set of genes.

184

Finally, chapter four of this dissertation raises questions about the affect of plants both through living roots and potentially leftover residues on the microbial community functional composition. To further investigate the affect of plant treatment in agricultural soils the metagenomics portion of the experiment in chapter four could be repeated for the corn plot only. Additional samples would be collected from both the rhizosphere and the bulk, samples (between crop rows). I would recommend at least six samples be taken from each soil zone to ensure enough statistical power. Alternatively, soil could be collected from between crop rows of corn, sieved and taken to the lab. The soil could then be divided into three treatments, a corn treatment where corn stover is mixed in the soil, a switchgrass treatment where cuttings from switchgrass are mixed in the soil, and a control where nothing is added to the soil. Metagenomes sequence could be collected before the treatment and then after six or so months.

185

Recommended publications