FUNCTIONAL METAGENOMICS AND CONSOLIDATED BIOPROCESSING FOR VALORIZATION OF PULP AND

PAPER MILL SLUDGE

by

Anupama Achal Sharan

B.E., Birla Institute of Technology, Mesra, Ranchi, 2015

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF APPLIED SCIENCE

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Chemical and Biological Engineering)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

April 2018

© Anupama Achal Sharan, 2018

i

Abstract

Biocatalyst discovery is integral to bioeconomy development, enabling design of scalable bioprocesses that can compete with the resource-intensive petrochemical industry. Uncultivated microbial communities within natural and engineered ecosystems provide a near-infinite reservoir of genomic diversity and metabolic potential that can be harnessed for this purpose. To bridge the cultivation gap, functional metagenomic screens have been developed to recover active genes directly from environmental samples. In this thesis, a pipeline for recovery of biomass-deconstructing biocatalysts sourced from pulp and paper mill sludge (PPS) metagenome is described. This environment is targeted given its high composition of cellulose that is hypothesized to direct enrichment of enzymes capable of hydrolysing it. The resulting oligosaccharides represent platform molecules that can be fed to downstream applications using consolidated process design for converting biological waste streams into value-added products.

High-molecular weight DNA was extracted from sludge and used to construct a fosmid library containing 15,000 clones using the copy control system in EPI300™-T1 R E.coli. Extracted DNA was also used in whole genome shotgun sequencing to compare the metabolic potential of the sludge community with fosmid screening outcomes as well as other waste biomass environments using MetaPathways v2.5 software pipeline, with specific emphasis on carbohydrate-active enzymes (CAZymes). Metagenomic assembling, open reading frame (ORF) prediction, binning and taxonomic assignment approaches were also used to bring out correlations between function and . In total, 32,232 ORF’s were mapped to the CAZy database predicted to encode glycoside hydrolases, glycosyl transferases, and carbohydrate binding module families.

The fosmid library was screened for glycosidase hydrolase activities using a pool of sensitive

ii fluorogenic glycosides of 6-chloro-4-methylumbelliferone (CMU). A total of 744 clones capable of converting pooled substrates were recovered indicating an extremely high hit rate (1 hit per

43 clones). Following fosmid sequencing and annotation, two of the most promising hits with defined single GH family loci were sub-cloned and overexpressed in E.coli BL21 DE3 strain to conduct basic biochemical characterization. Activity of purified enzymes was demonstrated on model lignocellulosic substrates to evaluate the potential of implementing the proposed circular bioprocess with waste PPS as both the feedstock and source of enriched biocatalysts.

iii

Lay Summary

In an era of growing global consciousness for bioeconomy development, it is ironic that the pulp and paper industry, which is one of the most prominent biomass-based industries in terms of revenue generation, is declining in Canada. The strategic importance of this industry within the bioeconomy and an opportunity to reduce environmental burden by not only remediating but valorising pulp and paper mill waste forms the primary motivation behind this work. This study presents an approach to add value to the solid waste stream, paper sludge, from this industry by unravelling its environmental genome to discover novel genes that produce enzymes capable of breaking down biomass. This was done using both experimental techniques and data analysis that involved high-throughput robotic functional screening and bioinformatics approaches. The findings of this study point to definitive cost-reduction approaches in industrial bioprocessing using cheap, waste biomass feedstocks for reciprocal enzyme discovery and enhanced bioconversion.

iv

Preface

Several sections of this work are being used for composing manuscripts for publication in peer-reviewed journals.

The research was conducted under the co-supervision of Dr. Vikramaditya G. Yadav

(Chemical and Biological Engineering) and Dr. Steven J. Hallam (Microbiology and Immunology).

I conducted the literature review, defined research questions and methodologies, designed and conducted experiments, analysed data and compiled and interpreted results under their guidance and assistance with group members from both Hallam and Yadav lab groups. The thesis is written by me and contains original, unpublished work with inputs from both Dr. Hallam and

Dr. Yadav.

Several bioinformatics-based workflows included in the thesis as presented in chapters 2-

3, including shotgun metagenome DNA QC, trimming and assembly, binning, EMIRGE and

Phylosift taxonomy assignment, positive fosmid clone sequences FabFos pipeline assembly were implemented in collaboration with Connor Morgan-Lang. Small subunit ribosomal RNA (SSU rRNA or 16S) gene pyrotag sequencing and operational taxonomic unit (OTU) data analysis for the metagenomic DNA as presented in chapter 2 was done in collaboration with Ashley Arnold.

Annotation of the pulp and paper sludge (PPS) metagenome and other metagenomes included in the comparative analysis presented in chapter 2 was done using Metapathways 2.5, a metagenomic DNA annotation pipeline developed in the Hallam lab (Hanson et al. 2014) with suitable in-house updates. The same pipeline was also used for positive fosmid clone annotations

v as presented in chapter 3. The fosmid sequence ORF figure was produced by Kateryna

Ievdokymenko and edited by me.

The fluorogenic substrates for fosmid library screening as presented in chapter 3, were kindly provided by Zach Armstrong from the Withers lab group at UBC and experiment design for fosmid library screening and biochemical assays and characterization were guided by him. Dr.

Aria Hanh and Dr. Keith Mewis also guided several aspects of metagenomic annotation and results interpretations.

Compositional analysis of the paper mill sludge as presented in chapter 4 was guided by

Dr. Jinguang Hu, Saddler lab group (Forestry) and done in collaboration with Daniela Vargas

Figueroa, MASc student in the same group. The Saddler group also provided the Celluclast enzyme used as positive control in enzyme assays. The mutant Chit-O enzyme used for cellulase assaying was kindly provided by the Alessandro R Ferrari from the University of Groningen,

Netherlands. Dr. Sandip Pawar and Benson Chang from the Yadav group helped with sub-cloning work presented in this chapter.

vi

Table of Contents

Abstract ...... ii Lay summary ...... iv Preface ...... v Table of Contents ...... vii List of Tables ...... xi List of Figures ...... xiii List of Abbreviations…………………………………………………………………………………………………………………xx Acknowledgements ………………………………………………………………………………………………………………xxiii Dedication ……………………………………………………………………………………………………………………………..xxv Chapter 1: Introduction …………………………………………………………………………………………………………….1 1.1 Sustainability and the Circular Economy ………………………………………………………………….1 1.2 The Bioeconomy ……………………………………………………………………………………………………..4 1.2.1 Industrial Biocatalysis ……………………………………………………………………………….9 1.3 Rejuvenating the Paper Industry by Valorising Solid Waste Stream – Pulp and Paper Mill Sludge (PPS) ………………………………………………………………………………………………………….11 1.4 Research Overview ………………………………………………………………………………………………..13 Chapter 2: In-silico Analysis of the Paper Mill Sludge Microbiome……………………………………………16 2.1 Background…………………………………………………………………………………………………………….16 2.1.1 Metagenomics – Unearthing Nature’s Biocatalytic Potential……………………16 2.1.1.1 Lignocellulose Hydrolysis and Valorization…………………………………17 2.1.2 In-silico Approaches in Metagenomics – what all can the microbiome map tell us?...... 20 2.1.2.1 Bioinformatics Tools – making sense in the noise………………………21 2.1.2.2 Standardizing Data and Streamlining Analysis……………………………23 2.2 Materials and Methods………………………………………………………………………………………….28 2.2.1 Sample Collection and Processing……………………………………………………………28 2.2.2 Whole-metagenome Sequencing, Binning and Assembly…………………………29

vii

2.2.2.1 High Molecular Weight Genomic DNA Extraction………………………29 2.2.2.2 Sequencing………………………………………………………………………………..30 2.2.2.3 Metagenome Assembly and Binning………………………………………….31 2.2.3 Taxonomic Analysis………………………………………………………………………………….32 2.2.3.1 454 16s rRNA gene Pyrotag Sequencing…………………………………….32 2.2.3.2 Expectation Maximization Iterative Reconstruction of Genes from the Environment (EMIRGE) based 16s Prediction………………………………….34 2.2.3.3 Phylosift Annotation………………………………………………………………….34 2.2.4 Functional annotation of metagenome……………………………………………………35 2.2.5 Mapping Reads to Assembly and Generating Normalized Abundance Values for Open Reading Frames (ORFs)………………………………………………………………………36 2.2.6 Qualitative comparison of Carbohydrate Active Enzyme (CAZyme) abundance across different environments………………………………………………………..37 2.3 Results……………………………………………………………………………………………………………………38 2.3.1 Metagenome Assembly and Annotation………………………………………………….38 2.3.2 CAZy Families Relevant to Plant Polysaccharide Degradation…………………..39 2.3.3 Binning the PPS metagenome………………………………………………………………….41 2.3.4 The Paper Sludge Microbiome and Metabolic Potential…………………………..45 2.3.4.1 Taxonomic Distribution……………………………………………………………..45 2.3.4.2 CAZy Distribution……………………………………………………………………….48 2.3.5 Comparison to Other Environmental Microbiomes………………………………….51 2.4 Conclusion………………………………………………………………………………………………………………58 CHAPTER 3 High-Throughput Biocatalyst Discovery from Paper Sludge Metagenome by Functional Metagenomics………………………………………………………………………………………………………..60 3.1 Background…………………………………………………………………………………………………………….60 3.1.1 Functional Metagenomics - discovering novel industrial biocatalysts………60 3.1.2 What goes into metagenomic functional screen design?...... 62 3.1.3 Carbohydrate Active Enzymes (CAZy) Database: Glycoside Hydrolase (GH) families and Polysaccharide utilisation loci (PUL’s)……………………………………………67

viii

3.2 Materials and Methods…………………………………………………………………………………………..68 3.2.1 Fosmid Library Construction…………………………………………………………………….68 3.2.2 Functional Screening………………………………………………………………………………..70 3.2.3 Fosmid Purification and Sequencing…………………………………………………………72 3.2.4 Fosmid Assembly and Annotation…………………………………………………………….72 3.3 Results and Discussion……………………………………………………………………………………………74 3.3.1 Metagenomic Library Construction and Host Selection……………………………74 3.3.2 High-throughput Functional Screening…………………………………………………….76 3.3.3 Fosmid Sequencing and Annotations……………………………………………………….80 3.4 Conclusion………………………………………………………………………………………………………………89 CHAPTER 4 Function to Application – Consolidated Bioprocessing…………………………………………..91 4.1 Background…………………………………………………………………………………………………………….91 4.1.1 Coming Full Circle - design and implementation of consolidated bioprocessing…………………………………………………………………………………………………...91 4.1.2 Biochemical Assays for Cellulose Hydrolysis Kinetics………………………………..94 4.1.3 Matching the Scales of Discovery and Application - what does functional metagenomics need to go all the way?...... 95 4.2 Materials and Methods…………………………………………………………………………………………..97 4.2.1 Biomass Compositional Analysis………………………………………………………………97 4.2.2 Detection of Hydrolytic Activity using Colorimetric Assay…………………………98 4.2.3 Different Systems for Testing Cellulolytic Activity…………………………………..101 4.2.4 Bench-scale Bioprocess Development……………………………………………………103 4.3 Results………………………………………………………………………………………………………………….104 4.3.1 Compositional Analysis…………………………………………………………………………..104 4.3.2 Colorimetric Detection of Cellulolytic Activity………………………………………..105 4.3.2.1 Whole-cell Lysates…………………………………………………………………..107 4.3.2.2 Fosmid Whole Cell Lysate and 50X Concentrated Culture Supernatant Protein Fractions…………………………………………………………….109

ix

4.3.2.3 Sub-cloned GH genes from fosmids over-expressed in E. coli BL21(DE3) strain………………………………………………………………………………….113 4.3.3 Bench-scale Hydrolysis…………………………………………………………………………..116 4.4 Conclusion……………………………………………………………………………………………………………117 Chapter 5 Thesis Conclusion and Future Directions of Work………………………………………………….118 5.1 Concluding Discussion…………………………………………………………………………………………..118 5.2 Future Perspectives………………………………………………………………………………………………119 5.2.1 Microbiome Metabolic Potential……………………………………………………………119 5.2.2 Functional Metagenomic Screening……………………………………………………….120 5.2.3 Consolidated Bioprocess Development………………………………………………….121 Bibliography…………………………………………………………………………………………………………………………..123 Appendix A: Chapter 4 - Sub-cloning Details…………………………………………………………………………..149

x

List of Tables

Table 1.1: Bioeconomy development strategies………………………………………………………………………….8

Table 1.2: Approximate content of cellulose, hemicellulose and lignin in different types of waste lignocellulosic materials (Prasetyo and Park 2013)……………………………………………………………………13

Table 2.1: The most prominent enzyme activities needed to completely hydrolyse lignocellulosic biomass (adapted from Salehi Jouzani and Taherzadeh 2015)…………………………………………………..20

Table 2.2: Genome Standards Consortium recommendations for metadata generation for metagenome assembled genomes (MIGS – Minimum information about a Genomic Sequence)

(Bowers et al. 2017)………………………………………………………………………………………………………………….25

Table 2.3: Metagenome assembled genome (MAG) metadata requirements (Bowers et al.

2017)……………………………………………………………………………………………………………………………………….27

Table 2.4: General Assembly Statistics………………………………………………………………………………………39

Table 2.5: Summary of phylogenetic (Phylosift) and functional (Metapathways) information of the medium-high quality draft genomes as assembled through binning of the PPS metagenome

(*≤ 1 gene count)……………………………………………………………………………………………………………………..43

Table 2.6: Taxonomic assessment comparison pipeline across the major phyla depicted in Figure

2.9……………………………………………………………………………………………………………………………………………47

Table 2.7: ORF statistics and metadata for metagenomes in comparative analysis…………………….52

xi

Table 3.1: Hit rate of different functional metagenomic library screened for glycoside hydrolase genes using a soluble, chromogenic model compound, 2,4-dinitrophenyl cellobioside (DNP-C)

(data from Mewis 2016)……………………………………………………………………………………………………………78

Table 3.2: Fosmid assembly statistics – generated using FabFos pipeline and Quast online tool………………………………………………………………………………………………………………………………………….81

Table 3.3: Taxonomic assignment of ORF’s across fosmids: ORF taxonomic annotation was done through the LCA algorithm implemented using NCBI taxonomy tree in Metapathways pipeline (in cases of multiple GH loci from individual fosmid annotations – the annotation for ≥50% of instances was reported)……………………………………………………………………………………………………………86

xii

List of Figures

Figure 1.1: Types of sustainability and meaningful sustainable development at their intersection

(adapted from Rehmann 2010)…………………………………………………………………………………………………..2

Figure 1.2: The bioeconomy circular by nature (BioVale 2015)…………………………………………………….5

Figure 1.3: Flowchart showing pathways to products from biomass that are conventionally produced from petroleum based feedstocks (U.S. DOE. 2015)…………………………………………………….7

Figure 1.4: Major current industry statistics for the Pulp and Paper industry in Canada (Source:

IBIS world open access content)……………………………………………………………………………………………….12

Figure 1.5: Research Overview………………………………………………………………………………………………….15

Figure 2.1: Dot plot to show abundance of enzymes discovered through functional screening in metagenomic libraries (Taupp et al. 2011) Copyright © 2011 Elseiver Ltd.………………………………..17

Figure 2.2: Structure of lignocellulosic biomass containing cellulose (composed of a β-1,4-linked chain of glucose molecules), hemicellulose (composed of various 5- and 6-carbon sugars) and lignin (composed of three major phenolic components) (Rubin 2008) Copyright © 2008, Springer

Nature……………………………………………………………………………………………………………………………………..18

Figure 2.3: Quality assessment pipeline for single amplified genomes (SAGs) and genomes from metagenomes (MAGs) (Bowers et al. 2017)………………………………………………………………………………25

Figure 2.4: Phlyosift workflow schematic (Darling et al. 2014)…………………………………………………..35

xiii

Figure 2.5: Nx curve for Megahit assembled Pulp and Paper sludge (PPS) metagenome (Figure in collaboration with Connor-Morgan Lang)…………………………………………………………………………………38

Figure 2.6: Manhattan distance hierarchical clustering of CAZymes used in this study……………….41

Figure 2.7: Completeness and contamination estimation of the bins of PPS metagenome…………42

Figure 2.8: Left to right (a) Taxonomic assignment distribution at phylum level for the bins as determine through Phylosift (b) Cumulative percentage distribution of CAZymes across the annotated bins using Metapathways………………………………………………………………………………………..45

Figure 2.9: Qualitative assessment of taxonomy distribution of major phyla through different pipelines………………………………………………………………………………………………………………………………….47

Figure 2.10: Overall CAZyme family distribution in the PPS metagenome as annotated using

Metapathways…………………………………………………………………………………………………………………………49

Figure 2.11: Phylum level distribution of relevant Glycoside Hydrolase (GH) genes in pulp and paper sludge (PPS) metagenome (only phyla constituting > 90% of the taxonomy as annotated through Metapathways included)…………………………………………………………………………………………….51

Figure 2.12: Hierarchical clustering of the different microbiomes in this study based on relevant

CAZy gene counts (tree distances calculated using Manhattan method)……………………………………55

Figure 2.13: Heat map showing differential abundance of CAZyme families across different metagenomic environments (The color coding represents the conversion of VST GH count values to an enrichment-depletion scale based on calculated z-scores for each family across different environments)………………………………………………………………………………………………………………………….57

xiv

Figure 3.1: Production and functional screening of metagenomic libraries (Taupp et al. 2011)

Copyright © 2011 Elseiver Ltd.………………………………………………………………………………………………….66

Figure 3.2: Metagenomic library and functional screening schematic for pulp and paper sludge

(PPS) metagenome…………………………………………………………………………………………………………………..70

Figure 3.3: The β-1,4-glycoside substrates of 6-chloro-4-methylumbelliferyl (CMU) used for functional screening of the pulp and paper sludge (PPS) metagenomic library clones (A) CMU- cellobiosise (B) CMU-xylobioside (C) CMU-Mannoside (figures by Zach Armstrong)………………….71

Figure 3.4: FabFos pipeline schematic (https://github.com/hallamlab/FabFos)...... 73

Figure 3.5: Co-occurrence and co-localization of glycoside hydrolase (GH) genes are presented in literature (A) Heat map showing frequencies of cooccurrence of GH43 subfamily domains with major noncatalytic modules including CBM, carbohydrate binding module; DOC, cellulosomal dockerin domain; X19, conserved noncatalytic module with subfamilies clustered as per respective HMM profiles (Mewis et al. 2016). (B) Schematic representation of gram-positive polysaccharide utilization locii (gpPULs) concerned with xylan, pectin and arabinogalactan utilization © (Harris et al. 2016)………………………………………………………………………………………………..75

Figure 3.6: Schematic of testing for different cellulolytic activities using of 6-chloro-4- methylumbelliferyl (CMU) glycoside of cellobiose and resultant products from enzymatic breakdown that results in fluorescent signal detection……………………………………………………………..76

Figure 3.7: Initial functional screening results with all clones in the pulp and paper sludge (PPS) metagenomic library………………………………………………………………………………………………………………..78

xv

Figure 3.8: Second round of screening – validation of top-128 hits in triplicate and deconvolution of activity on CMU-cellobioside………………………………………………………………………………………………..79

Figure 3.9: Reproducibility test of top-29 hits using CMU-cellobiose alongside background control ePCC1FOS and positive control Celluclast enzyme cocktail; inset shows ePCC1FOS values on the two runs (error bars represent 5% error)…………………………………………………………………………………..80

Figure 3.10: Percentage breakdown of Metapathways annotations of fosmid ORFs with focus on

CAZy annotations……………………………………………………………………………………………………………………..83

Figure 3.11: Fosmid linked genomic map - each line represents a fosmid clone with some fosmids represented by multiple contigs. Each predicted gene is represented by an arrow showing the direction of transcription. Grey links connect protein homologous with e-value≤1e-10 (Figure in collaboration with Kateryna Ievdokymenko)…………………………………………………………………………….83

Figure 3.12: Taxonomic distribution across sequenced fosmids (Taxonomy assigned based on LCA assignment of taxonomy at phylum level represented in ≥50% of ORF’s for each fosmid assembly)…………………………………………………………………………………………………………………………………88

Figure 4.1: Schematic of proposed circular, consolidated process using paper sludge feedstock as direct (smaller circle) and indirect (bigger circle) applications to the paper industry……………………………………………………………………………………………………………………………………92

Figure 4.2: Different bioprocessing strategies available for the conversion of lignocellulosic biomass to bioalcohols. Abbreviations: SHF, separate hydrolysis and fermentation; SHCF, separate hydrolysis and co-fermentation; SSF, simultaneous saccharification and fermentation;

xvi

SSCF, simultaneous saccharification and co-fermentation; CBP, consolidated bioprocessing

(Salehi Jouzani and Taherzadeh 2015)………………………………………………………………………………………93

Figure 4.3: Pulp and paper sludge (PPS) feedstock (left-right) (a) Wet PPS cakes obtained after filtration of water content (b) Dry PPS (constant weight) (c) Dried, milled and sieved

PPS…………………………………………………………………………………………………………………………………………..98

Figure 4.4: Schematic of the colorimetric assay used for cellulolytic activity detection………………99

Figure 4.5: Experimental set-up for bio-hydrolysis…………………………………………………………………103

Figure 4.6: Percentage composition of dried, milled PPS (left-right) (a) Klason method (b) CHN elemental analysis (5% error)………………………………………………………………………………………………….105

Figure 4.7: Titration of Celluclast FPU in the chito-oligosaccharide oxidase assay (net absorbance values after subtracting assay mixture blank)………………………………………………………………………….106

Figure 4.8: Colorimetric assay - Columns 1-3 titration of Celluclast at different FPU loading; other wells show supernatant from hydrolysis of PPS at different solid loadings (fixed Celluclast loading

500mU) showing color development in contrast to blanks (D-F 10-12)…………………………………….107

Figure 4.9: (Left-right) T0 and T0+24 hours incubation of fosmid whole cell lysates with filter paper substrate. Only wells spiked with Celluclast show activity and colour change………………..108

Figure 4.10: Absorbance readings at specific time intervals during incubation – representative results for two fosmid clone reactions - the cellulolytic activity signal is resulting only from

Celluclast enzyme action with no contribution from fosmid whole cell lysates (‘+cel’ refers to spiking of reaction with 10mU of Celluclast, 5% error)…………………………………………………………….108

xvii

Figure 4.11: Protein content estimation of fosmid whole-cell lysate and supernatant fractions using BCA assay (50mL cultures; 5% error)………………………………………………………………………………111

Figure 4.12: SDS PAGE visualisation of whole cell lysate and secreted protein fractions…………..111

Figure 4.13: Measurement of colorimetric signal after incubation with filter paper substrate for

72 hours (left-right) (a) colour development and (b) absorbance values at end of incubation period…………………………………………………………………………………………………………………………………….112

Figure 4.14: Supplementing Celluclast enzyme mixture with fosmid protein fractions (left-right) and application to filter paper substrate (a) Replacement (1:1) with total protein content fixed at

35mg/g cellulose (b) Addition of protein factions to give net double increase in total protein content (Celluclast + fosmid protein)…………………………………………………………………………………….113

Figure 4.15: SDS-PAGE results of sub-cloned BL21 DE3 cell lysate and supernatant fraction with genes from fosmids P04P08 and P14I01 respectively (sup- supernatant fraction; CL – cell lysate)…………………………………………………………………………………………………………………………………….114

Figure 4.16: Activity testing of sub-cloned cell lysate and protein fraction using CMU substrates

(CMU-C2: Cellobioside; CMU-X2: Xyloside; CMU-Man: Mannoside; CMU-3X: mixture of all three substrates; readings at end of 4-hour incubation period with 5% error and inset shows deconvolution tests for P14I01 supernatant fraction)……………………………………………………………..115

Figure 4.17: Percentage conversion of glucan in PPS to glucose during the hydrolysis experiment…………………………………………………………………………………………………………………………….116

xviii

Figure 5.1: Material and energy-based revenue flow streams around the paper mill using a biorefinery for valorization of pulp and paper mill sludge………………………………………………………..122

Figure A.1: Plasmid pET-21 a(+) circular map (Addgene database)…………………………………………..152

xix

List of Abbreviations

AAI – Amino acid identity

AA – Auxiliary activities

BACs - Bacterial artificial chromosomes

BCA - Bicinchoninic acid

BSA – Bovine serum albumin

CAZyme – Carbohydrate active enzyme

CBM – Carbohydrate binding module

CBP – Consolidated bioprocessing

CE – Carbohydrate esterase

CFU – Colony forming unit

CMU - 6-chloro-4-methylumbelliferyl

COG – Clusters of orthologous groups

DNS - 3,5-dinitrosalicylic acid

EMIRGE - Expectation maximization iterative reconstruction of genes from the environment ePGDB - Environmental pathway/genome database

GH – Glycoside hydrolase

xx

GHK - Glucose hexokinase

GO - Glucose oxidase

GT – Glycosyl transferase

HMM – Hidden markov models

HPLC – High performance liquid chromatography

IMG-M - Integrated Microbial Genomes & Microbiomes

JGI – Joint genome institute

KEGG - Kyoto Encyclopedia of Genes and Genomes

MIMAG - Minimum information about metagenome-assembled genome

MIMS - Minimum information about a metagenomic sequence

MIxS - Minimum information about any (x) sequence

MP – Metapathways

NCBI – National center for biotechnology information

NGS – Next generation sequencing nr – non-redundant

ORF – Open reading frame

OUT – Operational taxonomic unit

xxi

PCR – Polymerase chain reaction

PFGE – Pulse field gel electrophoresis

PL – Polysaccharide lyase

PMSF - Phenylmethane sulfonyl fluoride

PPS – Pulp and paper mill sludge

PULs – Polysaccharide utilization loci

QIIME - Quantitative Insights into Microbial Ecology

RPKM - Reads per kilobase per million mapped reads

SDS PAGE – Sodium dodecyl sulphate polyacrylamide gel electrophoresis

SIP – Stable isotope probing

SIGEX - Substrate induced gene expression

TMP – Thermo-mechanical pulping

VST - Variance stabilizing transformation

XyGULs – Xyloglucan utilization loci

xxii

Acknowledgements

I would like to thank my co-supervisors, Dr. Steven Hallam and Dr. Vikramaditya Yadav for their support and guidance in conducting this work. Both of them encouraged me to think creatively and undertake novel approaches to answer my research questions. Without their mentorship this amazing journey at UBC and participation in the ECOSCOPE NSERC-CREATE training program would not have been possible. This experience has been extremely beneficial to me both personally and professionally.

I am also very grateful to them and Dr. Heather Trajano, for providing inputs on the thesis and directing me to relevant literature resources for reading. Thanks also to Dr. Dhanesh

Kannangara for being a great mentor and her constant support during my program here. I would also like to acknowledge Brian Houle, Vaughan Blackman and Prashanth Krishnamoorthy for their help with sampling of pulp and paper mill sludge. Big thanks to Dr. Hubert Timmenga for his constant mentorship throughout the project.

A big part of my work was done in collaboration with the members of both the Hallam lab and Yadav lab (Biofoundry). Special thanks to Zach Armstrong, Connor-Morgan Lang, Ashely

Arnold, Kateryna Ievdokymenko for their constant advice on experiments and help with data analysis and bioinformatics. I would also like to thank Dr. Aria Hahn for her guidance and mentorship and Dr. Keith Mewis for providing insights from his work. Thanks to Joe Ho and Jewel

Ocampo for their constant support. Big thanks to Biofoundry lab members, Sonal, Roza, Julia, Dr.

Protiva, Dr. Sandip, Carmen and Benson for their advice and support.

xxiii

Thanks also to Andrew Wieczorek for his guidance during his time in the Hallam lab. I am also grateful to Dr. Cara Haney and Dr. Ranil Waliwitya (and the team at Active Agri Science) for coordinating my amazing internship experience.

Last but not the least, I would like to thank all my friends in Vancouver and Canada for being my constant support system. They have made this place a home away from home. My biggest motivation are my parents and grandparents who have always encouraged me to push the limits and try to excel in all my endeavours.

I thank you all!

xxiv

To my parents and grandparents!

xxv

CHAPTER 1 Introduction

1.1 Sustainability and the Circular Economy

There is worldwide movement towards sustainability in almost all forms of manufacturing and process industries and it is hence no surprise that several emerging research areas within the field of Chemical Engineering are dedicated towards improving, modelling and assessing their sustainability potential.

Sustainability, apparently quite simple and straightforward in one of the most widely adopted definitions - “meeting the needs of the present without compromising the ability of future generations to meet their own needs” (United Nations General Assembly 1987) – is quite ambiguous in how it is approached. Often, sustainability is used interchangeably with environmental protection. While preservation of the environment is paramount, if it becomes the sole basis for revamping a manufacturing process to conform to sustainability norms, it can often produce undesirable consequences. This has been observed as a major factor in the decline of the paper industry within Canada where the increasing costs of environmental regulations levied on the industry are being passed on to customers and have decreased the industry’s global competitiveness (Bogdanski 2014). True sustainable development can only be efficiently achieved at the intersections of economic, environmental and socio-political sustainability

(Rehmann 2010) (Figure 1.1).

1

Environmental Sustianability

Sustainable development Socio- Economic political Sustainability Sustainability

Figure 1.1: Types of sustainability and meaningful sustainable development at their intersection (adapted from

Rehmann 2010)

To this end, circularization of the economy, is being put forward as a potential framework for approaching sustainable development in industrial manufacturing. Within the circular economy model, resource demands of global economic growth and scarce environmental resources are reconciled by focusing on product reuse, remanufacture, and resource recycling.

This is postulated to extract the maximum value from the current, linear economy through an approach combining industrial reworking, policy incentives and technological innovation. Based on the different definitions (Geng and Doberstein 2008; The Ellen MacArthur Foundation 2015;

Bocken et al. 2016) and theoretical influences for the circular economy (McDonough and

Braungart 2002; Benyus 2009; Stahel 2010) as presented in scientific literature, for this study the following definition will be adopted:

2

“The circular economy is the material, energy and financial flow system that minimizes the resource input from the environment, prioritizes sustainable development, and minimizes the disposal of unused outputs from processes by employing recycle streams to valorize waste and maximize value addition to the global economic market. The ultimate intention is to maximize economic efficiency of the linear economy by the reallocation of scarce environmental resources, utilizing all product streams by integration of each product flow into an alternate input stream to totality.”

Circularization requires coordination between industry, consumers, and government, and path dependency makes it difficult to make theoretical process improvements. Due to the lack of global governance, the coordination of industries on a global scale is unlikely to occur anytime soon (Korhonen et al. 2018), placing an upper limit on the capabilities of circularization. In traditional consumption markets, the bulk of recycling and reuse responsibility is placed on the consumer, internalizing profits and externalizing negative effects like pollution. The European

Union and the United Nations are attempting to implement the circular economy using a top- down approach, in which policy is expected to foment culture sustainability by forcing national governments to adhere to international objectives. According to Lieder and Rashid, a bottom-up approach, in which industry takes responsibility for their product life cycles must also be applied simultaneously to achieve circularization (Lieder and Rashid 2016). This has indeed been reflected in the most recent circularization initiatives launched at the World Economic Forum

(2018) which are increasingly business and global-supply chain centric.

There are however several technological bottlenecks that need to be overcome before circularization of manufacturing processes can be achieved. The technology demands for

3 efficient recycling are lagging behind our circularization demands (Stahel 2016; Skene 2017).

Integrated resource recovery is the disruptive paradigm shift that is at the core of the closed- loop, circular future. However, splitting heterogenous waste process streams to be used viably as feedstocks is an expensive endeavor, and there is too little research in the chemical sciences to facilitate this (Stahel 2016).

The ability to valorize renewable resources into a variety of products and the technological flexibility compared to other manufacturing industries allows the bioeconomy, the collection of industries which use renewable biomass as the primary feedstock and biotechnology and bioprocessing are major contributors to the economic productivity (Bueso and Tangney

2017), to be a catalyst in the success of the global circular economy.

1.2 The Bioeconomy

While the global circular economy community is actively searching for case studies to evaluate economic feasibility and social impact of cyclic workflows, the perfect models can be potentially provided by the bioeconomy (Carrez and Van Leeuwen 2015). Bioprocesses making up the bioeconomy are inherently circular and often operate in closed, feedback-driven loops and the waste products used to supplement other process inputs are typically inexpensive and simple to collect (Figure 1.2). The scientific rationale for pursuing development of bioprocesses is not just formed by the global resource and energy paucity arguments but also by the ability of these processes to restore environmental damage and regenerate value from waste.

4

Figure 1.2: The bioeconomy circular by nature (BioVale 2015)

The inhibitory energy requirements that other industries face that offset the profit margin in the conventional circular economy are drastically reduced, as the labor is performed by microbial communities breaking down wastes into valorized products through metabolic pathways. While traditional recycling is energy intensive, the energy requirements of recycling through bioprocess engineering can be provided by renewable feedstocks. Where many markets require extensive research to facilitate the circularization shift, the bioeconomy has experienced significant technology advancements especially in synthetic biology and industrial biotechnology that feed directly into making possible closed-loop bioprocesses.

Biorefining as defined by the International Energy Agency (IEA Bioenergy Task 42—

Biorefineries), is the “sustainable processing of biomass into a spectrum of bio-based products

5

(food, feed, chemicals, and/or materials) and bioenergy (biofuels, power, and/or heat)” (Ree and

Jong 2017) . Theoretically biorefining can indeed result in production of all the consumer products that are made from petro-chemicals currently (Figure 1.3). However, it must be acknowledged that the differences between biomass as the central feedstock vs crude oil puts an upper limit on the efficiency with which it can be converted into the targeted end-products.

Fluctuations in global petroleum prices together with development of technologies to access shale/natural gas reserves more cheaply make it difficult to justify the high investments that entail development of technologies to refine or valorize waste or lignocellulosic biomass (Chen

2012) . Especially in the scenario of a mismatch in the required end-product yields make complete replacement to bio-based products economically unfeasible and not truly sustainable. In conclusion, a “one-dimensional departure from the fossil economy” to the bioeconomy is not possible (Schütte 2017).

6

Figure 1.3: Flowchart showing pathways to products from biomass that are conventionally produced from

petroleum based feedstocks (U.S. DOE. 2015)

The scale of operation, feedstock constraints (including supply chain networks) and intensive water usage of bioprocesses are some of the key hindrances that deter their inclusion as strong models leading the transition to circularity in the global economy. There have been several strategies proposed to overcome this in interdisciplinary literature across ecological economics, industrial biotechnology, chemical engineering and socio-politics. These are summarized in the Table 1.1.

7

Table 1.1: Bioeconomy development strategies

Strategy discipline Proposed strategy References

Industrial Biotechnology and Large-scale biocatalyst discovery and (Pellis et al. 2018) (Bio-TIC 2015) Microbiology optimization of industrial hosts

Bioengineering and Bioprocessing Continuous fermentations and whole- (Committee on Industrialization of cell/immobilized biocatalysis Biology 2015; Sheldon and Woodley development and scale-up 2018)

Synthetic Biology Metabolic engineering and protein (Bueso and Tangney 2017) engineering

Engineering process design Consolidated/Closed-loop bioprocessing; (Brown 2013) (Lamers et al. 2016) Effective utilisation of mixed waste feedstocks; modularization of unit processes

Engineering process economics Focus on waste, abundant feedstocks, (Venkata Mohan et al. 2016) growing biomass on marginalised lands, resource recovery to bring down manufacturing costs

Education and human resource Knowledge mobilization through open (European Commission 2017)(El- development access of biological data, course Chichakli et al. 2016) development across disciplines to foster bioeconomy leadership

Policy and governance Tax credit incentivization systems for (Lange et al. 2016)(Pellerin and renewable/waste feedstock usage and Taylor 2008) support to emerging research and industrial technology development; Promoting technology transfer

Economics and Marketing Targeting niche market segments for fine (Browne et al. 2011; Rabaçal et al. chemicals and compounds difficult to 2017) synthesize chemically

Ecological Economics/Industrial Industrial symbiosis networks/Integrated (Sillanpää and Ncibi 2017) (Philp and Ecology Biorefineries Winickoff 2017)

Social economics Focus on small communities in rural (Lopolito et al. 2011; Owen 2018) agricultural areas with strong biomass supply network for biorefinery development and rural economy revitalization

8

Within the scope of this study, the focus will be on the technological aspect of biocatalysis, which has been identified as one of the most important means to take bioeconomy development forward.

1.2.1 Industrial biocatalysis

A strong push for industrial bio-catalysis development, arises from advantages around their high specificities (particularly enantiomeric like transaminases) and improvement in process economics by their low price in comparison to chemical catalysts as well as requirement of mild operating process conditions leading to savings in energy costs (Sheldon and Woodley 2018). This has been especially beneficial for industries needing fine chemical synthesis like pharmaceutical and accelerated the growth of industries that utilize specific enzymes in their production processes viz food and beverage processing (hydrolases); detergent and surfactant (proteases) and polymer synthesis (hydratases, peroxidases) among others (Patel et al. 2017). Some key areas of focus around industrial biocatalysis development have been immobilized enzyme systems (for ease of recovery and enzyme regeneration) and whole cell biocatalysis to overcome cell lysis and enzyme recovery unit operations from production processes.

However, there remains a vast expanse of biological information that is untapped for biocatalyst production. Advances in synthetic biology, bioinformatics and data analytics have now made it possible to cost-effectively mine this large dataset for putative enzyme candidates with interesting properties like thermo-stability, broad-range pH tolerance and identification of promiscuous activities that can be applied for conversion of several different types of substrates.

Moreover, capitalizing on nature’s directed evolution of microbial enzyme activity enrichment

9 due to unique environmental composition can be translated to process applications for utilizing difficult to degrade substrates. This has been observed for lignocellulosic biomass hydrolysis through fungal or bacterial enzymes (Prasetyo et al. 2011; Kharayat and Thakur 2012) or bioremediation applications for mine tailings (Nancucheo et al. 2017) or oil sands process water treatment using enrichment cultures derived from these environments (Rochman et al. 2017).

These promising applications of biocatalysis directly feed into bioeconomy development. In addition to biochemical and synthetic biology tools, metagenomics, is arguably the single-most powerful recent area of research that has made possible high-throughput exploration and development of biocatalysts from environmental genomic information.

In the following sections of this study, a combined approach of metagenomics and consolidated bioprocessing is presented within the context of application towards valorizing the solid waste stream from the pulp and paper industry, one of the largest and more prominent sectors of the bioeconomy, in a closed-loop, sustainable process. This is motivated by identification of bottlenecks in the growth of the bioeconomy, promising technological tools that can be potentially used to overcome this coupled with the decline in the paper industry within

Canada that demands an economically promising, re-propositioning solution. The findings presented herein are very relevant and applicable to the broader development of the global circular economy.

10

1.3 Rejuvenating the Paper Industry by Valorising Solid Waste Stream – Pulp and Paper Mill

Sludge (PPS)

As per the nominal GDP data on the Canadian economy, the pulp and paper industry contributed only 0.45% of the overall GDP (sourced from National Research Council Canada statistics data) in 2016. From being the world’s leader in pulp and paper production, Canada is currently ranked 8th. IBIS world key statistics for the industry (depicted in Figure 1.4) clearly indicate a decline phase with the decline forecasted to plateau in the next five years with negligible revival. The industry has taken this hard hit due to the diminishing demand for printed paper products - newspapers, magazines, directories - owing to the digital media revolution.

Industry mills have steadily consolidated in recent years due to rising competitive pressures from digital media and foreign manufacturers of paper, packaging materials, hygiene products and other paper products. Pulp, especially Northern Bleached Softwood Kraft (NBSK), however, remains a competitive product of this industry, both domestically as well as in exported products.

Falling demand for newsprint and intensifying import competition have made it difficult for smaller mills to remain profitable in this industry, causing many operators to shut down entirely over the past five years greatly affecting the employment in small communities supported by paper mills. Globally, there is an annual decline of 2.4% for Canadian paper exports.

At the same time this industry is also heavily regulated environmentally which leads to compliance costs being transferred to consumers that limit the company’s international competitiveness. There is hence a dire need to look at ways to strategically reposition this industry in a manner that can capture an emerging market. Bio-refining to valorise its solid waste stream, pulp and paper sludge (PPS) presents one such compelling solution.

11

PPS is a solid by-product of the pulping and paper-making process which produced abundantly in quantities of about 300– 350 million tons every year (Ioelovich 2014). Landfill disposal of PPS presents a significant share of around 60% of the total waste water treatment plant operating costs, creating both economic and environmental problems (Chen et al. 2014).

Even in plants utilising incineration of PPS to generate heat energy from the organic content, the process becomes a net cost due to extremely high moisture content (>80%) which makes drying of PPS prior to burning a high energy and cost intensive process.

Figure 1.4: Major current industry statistics for the Pulp and Paper industry in Canada (Source: IBIS world open

access content)

The high cellulosic content of paper mill sludge (45-50%), when compared to other waste lignocellulosic biomass sources (Table 1.2) makes it uniquely suited for conversion to bio- products of value. Moreover, due to the thermo-mechanical nature of paper pulp processing, the cellulose fibres within paper sludge are more readily accessible for hydrolytic conversion to oligosaccharides as compared to other typical lignocellulosic materials fostering confidence in the economic viability of such conversions (Gurram et al. 2015). There have already been several

12 studies that have investigated feasibility of conversion of paper sludge to the biofuel, bio-ethanol

(Marques et al. 2008; Kang et al. 2010) and other bio-products.

Table 1.2: Approximate content of cellulose, hemicellulose and lignin in different types of waste lignocellulosic

materials (Prasetyo and Park 2013)

1.4 Research Overview

The goal of this research thesis is to examine pulp and paper mill sludge (PPS), an important waste lignocellulosic biomass material, both as a source of potentially novel glycoside hydrolase (GH) enzymes as well as assess feasibility of its application as a feedstock in a consolidated bioprocess to generate a bioproduct that can valorise the cellulosic content in it.

This investigation is undertaken with the broader objective to establish proof-of-concept of a hypothetical closed-loop bioprocess that can potentially be retrofitted to the pulp and paper making industry and add value to its waste stream upon proper optimization and scale-up. The biocatalyst discovery is done using a novel metagenomic approach in contrast to traditional microbiological culturing as has been observed before in literature (Maki et al. 2011). The proof-

13 of-concept of closed loop bioprocess is demonstrated by hydrolysis of PPS, assessment of the hydrolysate and application of hydrolysate to generate a bioproduct. Throughout this study the term paper sludge will refer to cellulose fibre rejects from the paper making process, sampled from the primary unit of the wastewater treatment operation.

The specific research objectives are discussed in the thesis chapters as follows (Figure 1.5):

1. In Chapter 2 the overall metabolic potential of the pulp and paper metagenome is

discussed with a focus on carbohydrate active enzymes (CAZyme) activity using in

silico methods. This chapter introduces metagenomics with a focus on the

bioinformatics methods as applicable to this study.

2. In Chapter 3 the functional metagenomics part of the study is detailed. The

experimental design to mine biocatalysts with GH function from PPS metagenome

is presented.

3. Chapter 4 is based on taking the metagenomic discoveries to bioprocess

application. Feasibility of using PPS as a bioprocess platform feedstock is

presented and findings from tests conducted on assessing hydrolysis potential of

whole cell fosmid lysates are also discussed.

14

Figure 1.5: Research overview

There is inherent, synergistic linking between all the modules of research in this study. It reflects the overall motivation behind this work to contribute towards development of a sustainable and circular bioeconomy. In silico findings from the microbiome are important to assess the potential of an environment for mining biocatalysts. The functional metagenomic activity discovery in turn must be corroborated and supported by preliminary assessments using bioinformatics. Further, biochemical testing and experimental validation of function of biocatalysts using substrates that closely model the real biomass feedstock is important from the perspectives of both bioprocess application, as well as annotation of novel enzyme families, that adds to expansion of existing knowledge databases.

15

CHAPTER 2 In silico Analysis of the Paper Mill Sludge Microbiome

2.1 Background

2.1.1 Metagenomics – Unearthing Nature’s Biocatalytic Potential

Metagenomics is the isolate independent study and analysis of microbial communities and the metabolic potential contained in the collection of their genomes across different environments (Handelsman 2004; Council 2007b; Thomas et al. 2012) . It combines the methods of genomic sequencing, high-performance computing and classical microbiology to unravel microbial community structure and metabolism with genome scale resolution (Hawley et al.

2017). Given that the vast majority of the and archaea dwelling in nature have not been isolated under laboratory conditions, the metagenomics approach has enabled access to their genomic potential and gene expression patterns (Schloss and Handelsman 2005). This together with genetic engineering and synthetic biology is leading to the transformation of tractable industrial strains with novel environmental genes to produce biocatalysts (Madhavan et al. 2017) with improved properties or novel metabolites through pathway engineering (Bao et al. 2017;

Cuadrat et al. 2018). The findings generated through metagenomics have contributed significantly to the body of knowledge about the elegant mechanisms through which the “unseen majority” (Whitman et al. 1998) of life directs and guides several biogeochemical processes which are crucial for the survival of the “visible” life on earth (Falkowski et al. 2008).

16

2.1.1.1 Lignocellulose Hydrolysis and Valorization

One of the biggest areas of research in biocatalysis is lignocellulose breakdown. This is also proportionately reflected in the number of metagenomic studies that have aimed to discover enzyme families and complexes for this objective. In fact, glycoside hydrolase gene families occupy the greatest number of families discovered through metagenomics in the past decade as depicted in Figure 2.1.

Figure 2.1: Dot plot to show abundance of enzymes discovered through functional screening in metagenomic

libraries (Taupp et al. 2011) Copyright © 2011 Elsevier Ltd.

17

Lignocellulose is the most abundant natural renewable material on earth while cellulose is the most abundant naturally occurring polymer (Salehi Jouzani and Taherzadeh 2015; Cai et al.

2017). It is currently the target feedstock for several biorefining processes that are increasingly looking to valorise second generation biomass including harvest and forestry residues, underutilised crops, food and municipal waste, paper and sawmill residues. These processes seek to capitalize on the low cost of these feedstocks and lack of competition with starch-based food resources but suffer from lack of technological readiness for economically feasible and efficient conversion of the fractions – cellulose, hemicellulose and lignin within the biomass. The structure of lignocellulose is quite complex and cellulose – which is the primary source of C6 monomer generation is encapsulated well within a reinforced structure of lignin and cross-linked through hemi-cellulose chains (depicted in Figure 2.2). Also, hydrolysis of biomass after lignin removal produces a mixture of C5 and C6 sugars which are not very readily fermentable by industrial strains.

Figure 2.2: Structure of lignocellulosic biomass containing cellulose (composed of a β-1,4-linked chain of glucose

molecules), hemicellulose (composed of various 5- and 6-carbon sugars) and lignin (composed of three major

phenolic components) (Rubin 2008) Copyright © 2008, Springer Nature

18

The low cost coupled with the high cellulosic content of these biomass sources presents both an intellectual and industrial incentive to discover enzyme mixtures capable of hydrolysing them and funnelling the carbon obtained into bioproducts of high value like biofuels, biopolymers and other fine chemicals. In their comprehensive review on consolidated bioprocesses (CBP) for butanol production from lignocellulosic biomass Jouzani et al. have described almost twenty-five different activities that are needed to completely hydrolyse any given lignocellulosic biomass without any pre-treatment (Table 2.1). Engineering a single “biorefining organism” encoding all these activities alongside metabolizing and producing the desired bioproduct would entail an extremely heavy metabolic load for any biological system to bear by itself! It is hence no surprise that thermo-mechanical or chemical processes for pre-treatment of the biomass are needed prior to hydrolysis. These energy-intensive processes can be potentially replaced by microbial consortium displaying all these activities in the following sequential steps (including fermentation to a bioproduct):

Step 1 - Secreting several glycoside hydrolase enzymes

Step 2 - Hydrolyzing both cellulose and hemicellulose to soluble sugars

Step 3 - Metabolizing soluble sugars

Step 4 - Produce bioproducts

Step 5 - Be highly tolerant against lignin-derived compounds and the bioproduct produced.

19

Table 2.1: The most prominent enzyme activities needed to completely hydrolyse lignocellulosic biomass (adapted

from Salehi Jouzani and Taherzadeh 2015)

Cellulases Hemi-cellulases Ligninases Pectinolytic and Cell wall loosening enzymes

cellobiohydrolase endoxylanase lignin peroxidase expansin

endoglucanase β-xylosidase aryl-alcohol oxidase swollenin

β-glucosidase acetyl xylan esterase laccase loosinin

phospho-β-glucosidase glucuronyl esterase glyoxal oxidase cellulose induced protein

arabinofuranosidase cellobiose dehydrogenase

galactosidase

glucuronidase

mannanase

xyloglucan hydrolase

There is hence an immense opportunity using metagenomics to find environmental genes with the desired properties that can lead to the design of a “one-pot” consolidated bioprocess that can not only break down but also valorise lignocellulosic biomass.

2.1.2 In silico Approaches in Metagenomics - what all can the microbiome map tell us?

The fast-paced development of bioinformatics tools and pipelines specifically for metagenomic data coupled with a great reduction in the cost of next-generation DNA sequencing method has led to an enormous wealth of information that is available for in-silico analysis before using functional screening approaches for biocatalyst discovery or biochemical assaying for validation of enzyme function.

20

There is a highly interdependent relationship between the functional gene diversity, microbial community structure and the metabolic potential of any environment (Wang et al.

2017) that needs to be carefully navigated before making any interpretations of “who’s doing what?” in a given environment. Given the complex and mixed nature of metagenomic environmental DNA, bioinformatics tools become very important in making sense out of potential noise that can be confounding and to better assist functional hypothesis and experimental design.

2.1.2.1 Bioinformatics Tools – making sense of the noise

The development of next generation sequencing platforms has made possible enough sequencing depth to confidently analyse complex microbiomes. This is done through either second generation (short reads 150-400bp; Illumina (Illumina Inc. 2015) and Ion-torrent (Ansorge et al. 2017)) or third generation (longer reads 6-20kb; PacBio (Rhoads and Au 2015) and

Nanopore technologies (Bainomugisa et al. 2018)) technologies. These reads however then must be assembled or recruited into the genomes which comprise the metagenome to enable reconstruction of open reading frames (ORFs) on single or multiple gene loci that would eventually lead to analysis of the metabolic potential through pathway reconstruction (Metacyc or Pathway tools) or annotation of genes.

Metagenomic assembly is essentially assembly of many different genomes at once and is hence more complex than single genome assembly (which is a complicated process due to the presence of repetitive elements within genomes making read assignments to individual contiguous sequences difficult). Several “de-novo”’ assembly pipelines exist for this purpose and

21 most of them use an iterative approach using k-mers (or short sequences of fixed length) to construct de Bruijn graphs which is then used to systematically construct larger contiguous sequences. There are some statistics that are used for judging the quality of the assembly like

N50, L50, (terms are defined and reviewed elsewhere (Mäkinen et al. 2012)) largest contig length or the Nx curve that depicts proportion of genome contained within specific contig lengths. These can be readily computed using QUAST: quality assessment tool for genome assemblies (Gurevich et al. 2013). Some common pipelines used are Megahit , Minia , Meraga, A*, Ray Meta and Velour that produce reproducible results (Sczyrba et al. 2017) .

The metagenomic data can also be binned or grouped into clusters with data arising from related genomic or taxonomic sequences, either prior to assembly or post assembly (Roumpeka et al. 2017). This is done to better understand the phylogenetic distribution of the functional/metabolic clusters within the metagenome. The quality of the bins is determined by a suitable measurement of the contamination or presence of reads from other taxonomic groups within the bin (calculated as % contamination and strain heterogeneity index). Binning can be done in both supervised (using a reference genome) or un-supervised manner (without a reference genome). The latter is better for metagenomic data where different genomes might potentially exist. Pipeline that include both genome and taxonomic binners include MaxBin,

MetaBAT, MetaWatt, CONCOCT, Kraken etc. (Sczyrba et al. 2017).

The distribution of taxa within a metagenome is also not easy to assess. Often multiple marker strategies that might also include lowest common ancestor (LCA) tree construction – using Phylosift (Darling et al. 2014), taxator-tk (Dröge et al. 2015) or MEGAN (Huson et al. 2016)

22 rather than only SSU or 16S rRNA recovery strategies – SILVA database alignment, EMIRGE (Miller et al. 2011) or amplicon sequencing (Pilloni et al. 2012) - are used for determining taxonomic IDs.

Finally, gene annotation for the metagenomic data is done and like other analyses, several approaches exist for this purpose. There are some pipelines that can predict genes from fragmented or short read data like MetaGeneAnnotater (Noguchi et al. 2008), Glimmer-MG

(Kelley et al. 2012) and FragGeneScan (Rho et al. 2010) among others. However, for prediction of multiple gene loci, especially those contained within large-insert fosmid/cosmid metagenomic libraries or applicable to activities like cellulose hydrolysis that involve more than one catalytic family domains, it is important to assemble the data first prior to annotation. This might also be very important for PCR or gene synthesis strategies for experimental validation of function where recovery of complete gene sequences is necessary. There are also pipelines that combine gene prediction with annotation using databases of interest that are tailored for metagenomic assemblies or short-read sequence data. Metapathways (Hanson et al. 2014) is one such pipeline and has been used in this study.

2.1.2.2 Standardizing Data and Streamlining Analysis

The relative newness of metagenomics together with the explosion of different pipelines to analyse environmental sequence information (as noted in the section before) has resulted in a big disarray in terms of the forms of outputs and how published results affect consequent interpretations about function of discovered gene products. This has also partly contributed to inhibition of industry-wide adoption of metagenomically discovered biocatalysts given the time- scale for biocatalyst optimization under process conditions. This time-scale can be reduced at the

23 upstream R&D end through standardization of the way the metadata of metagenome is organized and how the recovered genomes are generated and annotated. Specifically, with respect to industrial biocatalysis and biorefining, standardization can help with:

1. Better comparison of functional and/or taxonomic enrichment across different

environments as it relates to environmental metadata

2. Standardization of metadata would enable stronger linkages of functional discoveries

with taxonomic composition

3. It can also potentially inform synthetic biology or genetic engineering-based optimization

of industrial hosts with environmental genes by optimizing codon usage or using native

promoters.

To facilitate this, the Genome Standards Consortium, in continuation of their Minimum

Information about Any (x) Sequence (MIxS) standards, have established Minimum Information about a Metagenomic Sequence (MIMS) and Metagenome-Assembled Genome (MIMAG) of bacteria and archaea standards (Bowers et al. 2017). These are recommendations to provide statistics for assembly quality, genome completeness, and a measure of contamination to assess genome quality prior to further downstream analysis and annotation. The detailed criteria and qualification for high, medium and low-quality genome drafts are provided in Tables 2.2 and 2.3 and depicted in Figure 2.3.

24

Figure 2.3: Quality assessment pipeline for single amplified genomes (SAGs) and genomes from metagenomes

(MAGs) (Bowers et al. 2017)

Table 2.2: Genome Standards Consortium recommendations for metadata generation for metagenome assembled

genomes (MIGS – Minimum information about a Genomic Sequence) (Bowers et al. 2017)

General genome metadata currently not in MIGS mandatory analysis project type single amplified genome (SAG) metagenome-assembled genome (MAG) mandatory taxa id 16S rRNA gene multi marker approach other mandatory assembly software tool used for assembly optional annotation tool or pipeline used for annotation Genome quality mandatory assembly quaility

25

Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities with a consensus error rate equivalent to Q50 or better. Assembly statistics*. High Quality Draft:Multiple fragments where gaps span repetitive regions. Assembly statistics*. Presence of the 23S, 16S and 5S rRNA genes and at least 18 tRNAs. Medium Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics*. Low Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics*. mandatory completeness score High Quality Draft: >90% Medium Quality Draft: >50% Low Quality Draft: < 50% mandatory contamination score High Quality Draft: < 5% Medium Quality Draft: < 10% Low Quality Draft: < 10% mandatory completeness software name of software optional number of contigs value optional 16S recovered yes/no optional 16S recovery software name of software optional number of standard tRNAs extracted value 0-21 optional tRNA extraction software name of software optional completeness approach marker gene based reference genome based other optional contamination screen input reads contigs optional contamination screen parameters

26

ref db kmer coverage ref db+kmer ref db+coverage ref db+kmer+coverage kmer+coverage other optional decontamination software checkm/refinem anvi'o prodege bbtools:decontaminat.sh acdc combination other

Table 2.3: Metagenome assembled genome (MAG) metadata requirements (Bowers et al. 2017)

MAG metadata mandatory bin parameters homology search kmer coverage codon usage other combinations madatory binning software metabat maxbin anvi'o concoct groopm esom metawatt

27

combination other optional reassembly post binning yes/no optional mag coverage software bwa bbmap bowtie other

2.2 Materials and Methods

2.2.1 Sample Collection and Processing

The paper sludge was sampled from a BC coastal paper mill (48° 52'N, 123° 39' W) on 19th

February 2016. The pulping process used at this facility is a combination of thermomechanical

(TMP) pulping and kraft process and the major product produced is Northern Bleached Softwood

Kraft pulp (NBSK). The sample in the study however was taken from the primary wastewater treatment reactor prior to mixing of the TMP and kraft process rejects. The sampling was done using three methods:

1. 8X15mL cryo-temperature resistant falcon tubes were filled with liquid sludge, frozen and

transported over dry ice and stored at -800C for metagenomic DNA extraction and fosmid

library construction.

2. 3X5 gal (19 L) plastic jugs containing liquid sludge sampled from the primary wastewater

reactor on site at the mill. The samples were chilled at 40C overnight prior to transport at

ambient temperature and then stored at 40C.

28

3. ~ 1 kg total weight of dewatered sludge cakes were sampled by filtering the liquid sludge

using vacuum-based Buchner funnel apparatus and stored at 40C.

The dewatered sludge cakes were dried to a constant weight by controlled drying for 72 hours at 450C in an incubator (redline, Binder) in accordance with standard procedures for preparation of wet biomass for compositional analysis (Hames et al. 2008). Detailed methods for compositional analysis are covered in chapter 4.

2.2.2 Whole-metagenome Sequencing, Binning and Assembly

2.2.2.1 High Molecular Weight Genomic DNA Extraction

The high molecular weight genomic DNA from sludge was extracted following the previously developed protocol for DNA extraction from forest soils and sediments (Lee and

Hallam 2009) and modified suitably for paper sludge. Briefly, the sludge was dewatered using centrifugation (3000g, 5 mins) and approximately 5g. was used for each DNA extraction. It was ground using liquid N2 to a powdery consistency with regular addition of denaturing solution.

Hybridisation based extraction was carried out at 650C followed by centrifugation to remove solids. Chloroform-isoamyl alcohol extraction was done repeatedly on the supernatant.

Precipitates were removed from the collected aqueous phase and it was concentrated using

10kDa cut-off amicon filters (Millipore) and buffer-exchanged into 1X Tris-EDTA (TE) buffer to a final volume of 500-250 µL. The DNA was then precipitated using 0.6 volumes of isopropyl alcohol and air-dried pellet solubilised in 100 µL 1X TE through an overnight incubation at 40C. The integrity of DNA was checked using 0.8% agarose gel using λ / HindIII (Thermo fisher) and 1kb+ ladder (New England Biolabs - NEB) with 23kb as the lower cut-off. The extracted samples were

29 quantified for double-stranded (ds) DNA content using Quant-iT Picogreen dsDNA kit (Thermo

Fisher) protocol.

2.2.2.2 Sequencing

DNA library preparation and HiSeq Sequencing was done at GENEWIZ (NJ, USA). For QC,

The DNA sample was quantified using Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA,

USA) and the DNA integrity was checked with 0.6% agarose gel with 50-60 ng of sample loaded in the well. NEB NextUltra DNA Library Preparation kit was used following the manufacturer’s recommendations (Illumina). Briefly, the genomic DNA was fragmented by acoustic shearing with a Covaris S220 instrument. The DNA was end repaired and adenylated. Adapters were ligated after adenylation of the 3’ends. Adapter-ligated DNA was indexed and enriched by limited cycle

PCR. The DNA library was validated using TapeStation (Agilent Technologies), and was quantified using Qubit 2.0 Fluorometer.

The sequencing libraries were multiplexed and clustered on one lane of a flowcell. After clustering, the flowcell was loaded on the Illumina HiSeq 4000 instrument according to manufacturer’s instructions. The samples were sequenced using a 2x150 Paired End (PE) configuration. Image analysis and base calling were conducted by the HiSeq Control Software

(HCS) on the HiSeq instrument. Raw sequence data (.bcl files) generated from Illumina HiSeq were converted into fastq files and de-multiplexed using Illumina bsl2fastq v. 2.17 program. One mis-match was allowed for index sequence identification.

30

2.2.2.3 Metagenome Assembly and Binning

Metagenomic assembly and binning was done in collaboration with Connor Morgan-Lang.

Prior to assembly the reads were quality filtered and trimmed to remove Illumina TruSeq paired- end sequencing adaptors using Trimmomatic-0.35 (Bolger et al. 2014) that resulted in removal of 0.23% of the total reads.

PrefixPair used: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and

'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

The resulting reads were stored as interleaved paired-end read files and used for assembly with Megahit v1.0.6 (Li et al. 2015) using a k-mer value of 27 and a 10-step increase to a kmax of 147.

Binning was done post-assembly using MaxBin 2.0 (Wu et al. 2016) using a minimum contig length of 2000 bp and a probability threshold of 0.95. The bins were further assessed for quality using CheckM tool (Parks et al. 2015) that gives values for % completeness and % contamination of the bins. The former is calculated based on the number of marker genes present in the bin (from the set of 104 markers used for bacteria) and the latter is calculated using the number of copies for each of the marker gene (<1 copy implies contamination). There is another parameter generated from the pipeline called strain heterogeneity index that allows for assessing the amino acid identity (AAI) between multicopy marker genes to give an index value for how many of these genes arise from closely related strains above an AAI threshold and further informs strain level contamination of the bins generated. The scale is from 0 (no strain heterogeneity) to 100 (all markers present >1 appear to be from closely related organisms).

31

2.2.3 Taxonomic Analysis

Both in-silico and experimental approaches were taken for the taxonomic analysis of the

PPS microbiome with an objective to compare the differing pipelines qualitatively.

Metapathways annotation of taxonomy using refseq database gene detection is discussed in the section 2.2.4.

2.2.3.1 454 16S rRNA gene Pyrotag Sequencing

The sample preparation and analysis were done in collaboration with Ashely Arnold from the Hallam lab group. Briefly, the V6 – V8 region of the 16S rRNA gene was amplified from the

DNA template using the universal primer pair 926F (5′-AAACTYAAAKGAATTGRCGG-3′) and 1392R

(5′-ACGGGCGGTGTGTRC-3′) (Engelbrektson et al. 2010) to generate molecular barcodes of microbial community diversity. Primers were modified to include 454 adaptor sequences and barcodes were added to the reverse primers to allow multiplexing during sample sequencing.

PCR amplification was performed in 50μL reaction volumes in duplicate to minimize PCR bias.

Each reaction well contained 5μL Bishop 10X PCR buffer, 3μL 25mM MgCl2, 4μL 2mM dNTPs, 2μL of forward and reverse primers, 2μL 10mg/mL BSA, 0.6 μl Bishop Taq DNA Polymerase (5 U μl−1),

33.4 nuclease-free water and 1 – 10ng of the DNA or cDNA template. Negative controls were included for each sample reaction. Thermal cycler protocol started with denaturation at 950C for

3 minutes, followed by 25 cycles of additional denaturation at 950C for 30 seconds, annealing at

550C for 45 seconds, and extension at 720C for 90 seconds. 10 minutes at 720C for final extension completed the amplification process.

32

Successful PCR products were pooled, cleaned using the NucleoSpin Gel and PCR Clean- up kit (Macherey-Nagel), eluted in 30μL of 5mM Tris buffer at pH 8.5, and quantified following the Quant-iT Picogreen dsDNA kit (Thermo Fisher) protocol. Each 16S SSU rRNA amplicons were pooled at 100ng DNA prior to sequencing. Emulsion PCR and sequencing were performed at

Genome Quebec (Montreal, Canada) on the Roche 454 GS FLX Titanium platform (454 Life

Sciences) according to manufacturer’s protocol.

6,137 SSU rDNA pyrotag sequences were processed using the Quantitative Insights Into

Microbial Ecology (QIIME) version 1.9.0 software package (Caporaso et al. 2010). Sequences with less than 200 bases, ambiguous ‘N’ bases, and homopolymer runs were removed before a quality filtering step using the usearch quality filtering pipeline developed by Robert Edgar and implemented in QIIME. De novo and reference based chimeric sequences were identified via

UCHIIME and removed prior to taxonomic assignment. Non-chimeric sequences were clustered at 97% into operational taxonomic units (OTUs) with UCLUST where representative sequences from each cluster were queried against the SILVA 128 ribosomal RNA database (Quast et al. 2013) using the Ribosomal Database Project (RDP) classifier to assign taxonomy. Singleton OTUs

(represented by one read only) were omitted from downstream analysis to reduce over prediction of rare OTUs (Kunin et al. 2010).

33

2.2.3.2 Expectation Maximization Iterative Reconstruction of Genes from the Environment

(EMIRGE) based 16s Prediction

The raw reads of the PPS metagenome as obtained from using the methods described in section 2.2.2.2 were analyzed for genome wide 16s rRNA gene reads using EMIRGE software for reconstructing full length ribosomal genes from short read sequencing data (Miller et al. 2011) in collaboration with Connor Morgan-Lang in the Hallam lab. Briefly, in this pipeline a probabilistic method based on expectation maximization algorithm is used in an iterative manner to determine abundance of taxa in the short-read dataset. The fastQ input files had a Phred quality score of 33 and the SILVA 132 SSU rRNA database was used as the candidate SSU database.

Maximum input read length was specified as 150 and insert mean and inset standard deviation parameters were set to a value of 500 to allow all 16s rRNA gene read pairs to map.

2.2.3.3 Phylosift Annotation

Both the whole metagenome dataset and the bins were analyzed for taxonomy through multiple marker gene detection approach using the pipeline Phylosift (Darling et al. 2014). This pipeline incorporates LAST alignment algorithm, hidden markov model profiling (HMMER) and ppclaer tools to automate phylogenetic prediction through identification of protein coding regions and RNA sequences. The outputs are presented in the form of dynamic krona plots. The workflow is presented in the Figure 2.4 below and the rRNA workflow presented is relevant to the objective of this study.

34

Figure 2.4: Phlyosift workflow schematic (Darling et al. 2014)

2.2.4 Functional annotation of metagenome

Metapathways (MP) (v 2.5 with in-house updates) (Hanson et al. 2014), a modular annotation and analysis pipeline developed in the Hallam lab for environmental sequence information analysis was used for the functional annotation of the assembled metagenome and bins. Gene prediction and annotation information generated from the pipeline is then used to construct Environmental Pathway/Genome Database (ePGDB) comprising of MetaCyc pathways using the Pathway Tools software and its pathway prediction algorithm “PathoLogic”. The results could be interactively viewed in the graphical user interface of the pipeline and the ePGDB data structure of sequences, genes, pathways, and literature annotations were used for integrative interpretation. The metagenome contigs were compared to the curated and annotated latest versions of the databases KEGG (Kanehisa et al. 2016), COG (Tatusov et al. 2003), RefSeq (O’Leary et al. 2016), MetaCyc (Caspi et al. 2016), and CAZy (Terrapon et al. 2017) databases using the

35

LAST algorithm. Prodigal (Hyatt et al. 2010) parameters for open reading frame (ORF) prediction used were: minimum length of 60 bp, minimum bitscore of 20, minimum (B)LAST score ratio

(BSR) of 0.4, and maximum e-value of 1x10− 6.

2.2.5 Mapping Reads to Assembly and Generating Normalized Abundance Values for Open

Reading Frames (ORFs)

The burrows wheel aligner (bwa) – a software package for mapping low-divergent sequences (such as Illumina short reads) against a large reference genome was used to map PPS metagenome raw reads to assembly (Li and Durbin 2010). This was done to get an estimate of the unmapped reads and correct for abundance measures lost during assembly. The MEM algorithm was used as it offers support for longer sequences ranged from 70bp to 1Mbp and split alignment which was the case for PPS reads. The developers also recommend it for high-quality queries as it is faster and more accurate.

Reads Per Kilobase per Million mapped reads (RPKM) values were calculated for all predicted ORF’s to generate abundance values normalised for sequencing depth and the gene length through a custom script developed in the Hallam lab and run with the “multi-read” mode flag to consider the multi-reads generated through the bwa alignment. The results generated were fed into MP and the GUI was used to interactively view the ORF RPKM relative abundance across different databases.

36

2.2.6 Qualitative comparison of Carbohydrate Active Enzyme (CAZyme) abundance across different environments

To qualitatively assess where the PPS microbiome CAZy profile lies in the spectrum of

CAZy abundance across different waste-biomass environments, an in-silico comparison of CAZy family counts was done. Datasets were obtained from both publicly available datasets (NCBI and

IMG-M/JGI) as well as the metagenomic data generated by members of the Hallam lab comprising different environments. These datasets’ metadata and related information are given in Table 2.7. This comparison was qualitative given the different pipelines through which assemblies were generated for the microbiomes before feeding into MP. Also, due to variations in sequencing depth between samples, a variance stabilizing transformation (VST) was applied to the data using the DeSeq2 package (Love et al. 2014) in R. A hierarchical clustering based on

Manhattan distances of z-scores was used. The Manhattan distance metric weights each additional gene equally regardless of abundance, whereas a Euclidean distance metric places lesser importance on additional genes and therefore this approach was chosen.

Next, relative abundance of each family was calculated, and a z-score (sigma value above or below mean count of a CAZy family) was determined for each CAZy family in the different environments and a color scale was chosen to represent this in a heat map implemented in R with mean of each GH family assigned 0 and z-score on a scale of +4 to -4 used to color code for enrichment and depletion respectively.

37

2.3 Results

2.3.1 Metagenome Assembly and Annotation

The assembly statistics of the PPS metagenome were satisfactory with 93.73% of reads mapping to the assembly. The average RPKM of the contigs across the sample was 0.71. The Nx curve for the assembly is shown in Figure 2.5 below and Table 2.4 gives the general assembly statistics. These values give confidence in further functional annotation of the pulp and paper sludge (PPS) metagenome.

Figure 2.5: Nx curve for Megahit assembled Pulp and Paper sludge (PPS) metagenome (Figure in collaboration with

Connor-Morgan Lang)

38

Table 2.4: General Assembly Statistics

General Assembly Statistic Parameter Value

Total number of contigs 807686

Total cumulative length 839622602 bp

Minimum contig length 200 bp

Maximum contig length 626031 bp

Average contig length 1040 bp

N50 1652 bp 2.3.2 CAZy Families Relevant to Plant Polysaccharide Degradation

The primary objective of this study was to find the most relevant enzymes that can hydrolyse the polysaccharide content within the waste biomass feedstock PPS. As such, majority of the solid content within PPS consists of rejects from the pulping process during paper making i.e. the fibres that are deemed too short to be processed ahead. The pulp in turn is derived from woody tree species that are used at the respective mills. The sample of PPS used in this study was derived from a coastal BC mill and is composed of a mixture of softwood trees species, typical of mills in this area, with only small adjustments from year to year. The major tree species comprising the source of pulp include Hemlock, Douglas-Fir, Cedar, Cypress and SPF (Spruce-

Pine-Fir) mixture (data from mill).

It is interesting to note the differences in the major polysaccharide components of the starting woody material and the sludge produced after the pulping treatment. The composition of softwood polysaccharides has been reviewed by Willför et al. and their analysis of the major tree species found the most prevalent components as galactoglucomannans (mannans) and arabinoglucuronoxylans (xylans) (components of hemicellulose) followed by glucans (derived

39 mostly from cellulose). However, given their existence within the amorphous hemicellulose fractions and undesirability for paper making process, they are systematically removed during pulping that results in the cellulose related polymers (glucans) as the major components of process pulp and consequently sludge. This has been reflected in the composition of PPS in this study (Figure 4.6, section 4.3.1) as well as sludge used in other studies involving NBSK pulp rejects

(Jackson and Line 1997; Mabee 2001; Rangu 2018).

Using this information from the target feedstock and environment, the CAZy families most relevant to this study have been determined. These families have been shown to have activities on the substrates ArabinoGalactan, Arabinan, Cellulose (or glucan), GlucoMannan,

GlucuoronoXylan, Homogalacturonan backbone, Rhamnogalacuronan backbone and Xyloglucan and Pectin. This list also represents some of the most well-characterized enzyme activities found in bacterial and fungal systems across different microbiomes and often act in synergy to carry out the hydrolysis of polysaccharides in woody biomass (van den Brink and de Vries 2011;

Simmons et al. 2014; López-Mondéjar et al. 2016; Ndeh et al. 2017; Berlemont 2017) . Figure 2.6 shows a hierarchical clustering of the selected enzymes (using Manhattan distance and clustering based on average counts) derived from their variance stabilization transformed counts across the different microbiomes used in a comparative assessment in this study (section 2.3.4)

40

Figure 2.6: Manhattan distance hierarchical clustering of CAZymes used in this study

2.3.3 Binning the PPS metagenome

Binning the metagenome resulted in around 87 bins. The total size of sample that was assembled into bins was 246 Mb. This represents around 30% of the total assembly. Further, these bins were then checked for quality and the bins for annotation of CAZy functions were selected based on their completeness and contamination scores. Bins with >90% completeness were considered for further analysis. 6 of these bins had contamination % >20% and that put

41 them in the “medium-quality” draft standard as mentioned before using the MIMAG standards

(Bowers et al. 2017) . The assembled bin draft quality ranges are indicated in Figure 2.7.

Figure 2.7: Completeness and contamination estimation of the bins of PPS metagenome

The legend in the figure represents taxonomy assignment through CheckM tool and given the nature of the metagenomic data several bins were assigned taxonomy at the “bacteria” domain level due to the absence of marker genes used by this tool to assign taxonomy. If these bins were included for further analysis, then their taxonomy was determined using Phylosift as depicted in Table 2.5. In total these 19 high-medium quality bins represent only 9% of the assembled metagenome. Despite good closure (80-100%) of the bins in assignment to a dominant phylum (except in 3 cases), given this low percentage of coverage of the metagenome by binning, it is difficult to make correlations between the significance of the contribution of this approach towards linking taxonomic distribution to function due to the high probability that

42 several CAZy families that could remain undetected by random chance due to insufficient coverage or completeness.

Table 2.5: Summary of phylogenetic (Phylosift) and functional (Metapathways) information of the medium-high

quality draft genomes as assembled through binning of the PPS metagenome (*≤ 1 gene count)

Bin Size Phylosift assigned phylum Est. Est. Strain Major GH Function relevant to (Mb) Completeness Contamination Heterogenity family plant polysaccharide (%) (%) (%) Index (0-100) annotations degradation (% of total CAZy ORF annotations)

4 3.5 Proteobacteria (88%) 95 3 6 GH 13 (8.3); α-amylase; GH 23 (7.4) peptidoglycan lyase

5 5.2 Proteobacteria (86%) 98.6 7.7 2 GH 13 (5.7); α-amylase; GH 23 (4.9) peptidoglycan lyase

6 2.4 Firmicutes (96%) 91.9 2.6 0 GH 1 (21); β-glucosidase; α- GH 13 (15); amylase; GH 65 (6) phosphorylase

7 4.1 Bacteroidetes/ Chlorobi group 97.8 2.4 0 GH 13 (5.9); α-amylase; β- (92%) GH 2/3 (5) galactosidase; β- glucosidase

16 5.5 Bacteroidetes/ Chlorobi group 97.5 0.9 0 GH 13 α-amylase; β- (100%) glucosidase (8.2); GH 3 (4.1)

26 5.9 Bacteroidetes/ Chlorobi group 96.3 23.6 14 GH 13 α-amylase; β- (100%) glucosidase; endo-β- (6.6); GH 3/5 1,4-glucanase / (3.3) cellulase

31 5.5 Bacteroidetes/ Chlorobi group 99 1.7 20 GH 13 α-amylase; α-L- (100%) fucosidase; β- (5.3); GH 29 glucosidase; endo-β- (4); GH 1,4-glucanase / 3/5(2.67) cellulase

32 4.1 Proteobacteria (74%) 94.3 0.98 0 GH 23 peptidoglycan lyase; β- (10.5); GH glucosidase; α-amylase 3/13/57 (3.5)

34 3.3 Bacteroidetes/ Chlorobi group 95.6 0.86 66.7 GH 43 (8); β-xylosidase; β- (100%) GH 3/13 glucosidase; α-amylase (5.3)

35 3.3 Proteobacteria (100%) 88.1 0.76 33.3 GH 23 (3.1); Chitinase; GH 6/10/43 endoglucanase; (2.1) cellobiohydrolase; endo-β-1,4-xylanase; β-xylosidase

38 3.3 Chlamydiae/verrucomicrobia 95.4 2.1 0 GH 13 (8.5); α-amylase group (57%) GH 57 (4.2)

43

40 2.2 Firmicutes (100%) 94.2 1.45 66.7 GH 9 Endoglucanase; endo- β-1,4-glucanase; α- (11.3); GH amylase; β-xylosidase 5/13/43 (7)

42 3.3 Bacteroidetes/Chlorobi group 91.2 6.1 70.6 GH 3 β-glucosidase; α- (100%) amylase (6); GH 13 (5.1)

48 2.7 Firmicutes (90%) 96.8 6.45 0 GH 13 α-amylase; β- glucosidase; endo-β- (22); GH 1,4-xylanase 3/30 (4.9)

49 3.4 Bacteroidetes/Chlorobi group 94.6 1.7 50 No (100%) significant hits*

50 4.5 Bacteroidetes/Chlorobi group 95 2 20 GH 1 (5.1); β-glucosidase (100%) GH 92 (3.4)

51 2.2 Firmicutes (100%) 93.6 2.1 100 GH 13 α-amylase; β- (13.2); GH 3 glucosidase (10.5)

53 3.8 Planctomycetes (77%) 95.6 17.4 4.8 GH 31/15/29 α- glucosidase; (6.8) glucoamylase; α-L- fucosidase

57 5.1 Chlamydiae/Verrucomicrobia 89.7 2.3 25 GH 43 β-galactosidase; β- group (100%) (15.4); GH 2 xylosidase (6.7)

Most of the bins were assigned to phylum Bacteroidetes (44%) (Figure 2.8 (a)). These bacteria are known for their organic polysaccharide degradation activities and are considered model organisms for polysaccharide utilization loci (PULs) studies and annotations. Firmicutes represent another phylum well known for organic matter degradation and were observed at

16.7% of the total bin phylogeny. It is therefore not surprising that CAZy annotations representing majority of the most relevant hydrolytic activity towards cellulose and hemicellulose degradation including β-glucosidase (GH3, GH1, GH92), endo-β-1,4-glucanase (GH5, GH9), β-xylosidase

(GH43) and endo-β-1,4-xylanase (GH30) are present in the bins assigned to these phyla indicating the phylogenetic correlation with these functions and validating previous observations from literature (Terrapon et al. 2015; Ransom-Jones et al. 2017). GH13 was observed to be the CAZy

44 family with the highest % abundance across ORF’s annotated through CAZy and it is expected given the high abundance of this CAZy family within bacteria. However, this activity is not very relevant to cellulose or hemicellulose degradation. Overall, CAZy families with cellulolytic and hemicellulolytic activities dominated the CAZyme distribution across the bins, representing a cumulative total of 54% (Figure 2.8 (b)).

Figure 2.8: Left to right (a) Taxonomic assignment distribution at phylum level for the bins as determine through

Phylosift (b) Cumulative percentage distribution of CAZymes across the annotated bins using Metapathways

2.3.4 The Paper Sludge Microbiome and Metabolic Potential

2.3.4.1 Taxonomic Distribution

Taxonomic assessment of metagenomic data or derived assemblies present many challenges (Sedlar et al. 2017). This arises from a combination of sampling techniques, the nature of the short-read data produced from next-generation platforms as well as the downstream bioinformatics processing pipelines that are used to extract marker genes to assign taxonomy

(and the lack of consensus for benchmarking these techniques) (Sczyrba et al. 2017).

45

In this study, I sought to do a qualitative assessment of different means through which one can arrive at an estimation of the taxonomic composition of a metagenome and how they differ from each other. Taxonomic binning results have been presented previously, but due to the small % of the representation of the total microbiome in high quality bins, it is not possible to conclude taxonomic distribution with confidence using solely bin distribution.

Therefore, the whole metagenome taxonomic distribution was assessed using different bioinformatics pipelines. The results were qualitatively compared based on the distribution of the major phyla (that comprised >90% of the metagenome). For 454 pyrotag sequencing 421 unique OTU’s were identified through the QIIME (Caporaso et al. 2010) pipeline using 5,818 QC reads clustered at 97%.

It was observed that there was closer agreement between EMIRGE and 454 pyrotag sequencing and Metapathways and Phylosift results respectively (Figure 2.9). However, it is intriguing to note that these pipelines identified the major phylum to be different from each other. This disparity of taxonomic assignment across amplicon-based sequencing and

NGS/shotgun sequencing pipelines has been noted before and Tessler et al. claim that amplicon sequencing results might be more robust (Tessler et al. 2017). However, it is also well-known that amplicon-based sequencing suffers from over-estimation of phylum with bigger genome sizes that might skew the detection due to PCR bias and consequently the phylogenetic distribution.

These can miss out on rare taxa that other bioinformatics pipelines might capture and result in potential discovery of new candidate phyla (Gonzalez et al. 2012).

46

Table 2.6: Taxonomic assessment comparison pipeline across the major phyla depicted in Figure 2.9

Pipeline Input data Reference % assigned to other phyla (non- major)

454 Pyrotag sequencing and QIIME 16s pyrotag samples (Caporaso et al. 4 assessment from PPS metagenome 2010; Pilloni et al. 2012)

EMIRGE Short-read sequence (Miller et al. 2011) 0.5 data from Illumina sequencing of whole PPS metagenomic DNA

PhyloSift PPS metagenome (Darling et al. 2014) 10 assembly

Metapathways (LCA using NCBI PPS metagenome (Hanson et al. 2014) 2 taxonomy and RefSeq tree) assembly

Figure 2.9: Qualitative assessment of taxonomy distribution of major phyla through different pipelines

Data on taxonomic profile of PPS metagenome is not available in literature. Some studies give data on isolated strains from these samples (Maki et al. 2011; Ghribi et al. 2016). However,

47 investigations across similar environments – activated sludge from a municipal wastewater reactor (Yu and Zhang 2012) and a thermophilic cellulose degrading microbial sludge consortia

(Xia et al. 2013) have shown similar trends in distribution of major phylum across these environments. The databases used for annotations in these pipelines can also explain the trend observed, Phylosift incorporates a multi-marker approach using HMM’s which is like the LCA algorithm implemented in Metapathways using information from NCBI non-redundant (nr) protein dataset. Similarly, EMIRGE and QIIME both rely on SILVA SSU databases for taxonomic assessments. It is therefore important to be wary of these differences in methods to better select a tool to answer the question most relevant to this study. Increasing the sample size to include

PPS from different coastal mills in the geographic region is needed to confidently determine which taxonomic assessment is the most representative of this environment.

2.3.4.2 CAZy Distribution

The CAZyme annotation of the PPS metagenome was done using Metapathways pipeline as explained before. A total of 32,232 unique CAZy ORF’s were detected, representing around

2.69% of the total ORF’s meeting QC thresholds in the pipeline. The overall distribution of the

CAZy families is given below.

48

Figure 2.10: Overall CAZyme family distribution in the PPS metagenome as annotated using Metapathways

Glycoside hydrolases form the biggest fraction of the CAZy families for the PPS sample. It is interesting to note that within GH fraction, the highest activity fraction observed is related to hemicellulose and structural biomass degradation (viz pectin) while cellulosic biomass degradation forms only 11% of this fraction. It should also be noted that other CAZy families that are needed for the synergistic breakdown of hemicellulose in this environment are also present in the overall annotations. These results show interesting targets for functional screening for other enzyme activities (apart from GH’s) that might be important for total biomass deconstruction.

The taxonomic enrichment and distribution of the CAZy-GH families most relevant to plant polysaccharide distribution was plotted (Figure 2.11). This figure is generated using LCA taxonomy assignment of GH genes through Metapathways including only those gene counts that could be assigned at phylum level. As noted before, GH 13 remained the most predominant family and was found across all major phyla. It is very interesting to note that the phylum

Bacteroidetes has the greatest abundance of the most relevant GH families involved in activities cellulose and hemicellulose degradation. These include all the 7 β-glucosidase families (GH1,

49

GH3, GH5, GH9, GH30, GH39 and GH116), 6 of 14 cellulase families (GH5, GH6, GH9, GH10, GH12,

GH26), β-xylosidase families (GH2, GH43) and endo-β-1,4-xylanase families (GH10, GH30).

The same trend was also observed through binning and assignment of CAZy functionality to the taxonomy. Estimation of the Bacteroidetes phylum in the PPS metagenome is also the most closely agreed on by different pipelines and therefore lends credence to the hypothesis that this phylum is the major GH activity provider. This observation might also point out the possibility that majority of the CAZy functionality of this metagenome might be carried out by the taxonomic group which is not the most abundant (based on mapped reads) and other taxa might be dependent on this group for their metabolism. The genes belonging to this phylum might be potential candidates for sub-cloning and optimization of expression of both cellulolytic and hemi- cellulolytic functions. This is also corroborated from findings from literature that show a strong correlation between members of the Bacteroidetes phyla and GH activity. Berlemont and Martiny have done a comprehensive study on linking GH family profiles (functionally and taxonomically) across around 1,934 annotated metagenomes from 13 broadly defined ecosystems and found that Bacteriodetes are degrader genera found in almost all type of ecosystems (Berlemont and

Martiny 2016) and are also strongly associated with xylan degradation which is the most major type of GH activity found in the PPS environment.

50

Figure 2.11: Phylum level distribution of relevant Glycoside Hydrolase (GH) genes in pulp and paper sludge (PPS)

metagenome (only phyla constituting > 90% of the taxonomy as annotated through Metapathways included)

2.3.5 Comparison to Other Environmental Microbiomes

In order to, better understand the relevancy of CAZy functionality of the PPS metagenome in the broader context of environmental metagenomes, I did a comparative analysis of the distribution of the CAZy families relevant to plant polysaccharide degradation across them. The metadata for these environments are given in Table 2.7 and should be considered while analysing the CAZyme distribution.

51

Table 2.7: ORF statistics and metadata for metagenomes in comparative analysis

Metagenome ID Metagenome No. of ORFs No. of CAZy Reference description predicted hits (and % (metadata) of total ORFs)

WL-0 LFH(organic) 160378 16113 (2.85) (Wilhelm et al. harvested soil 2017); All horizon, 0.3 cm samples were depth, pH 5.87 from O’Connor lake,BC WL-1 Mineral (Ae) 714203 18627 (2.61) harvested soil horizon, 6.2 cm depth, pH 6.13

WL-2 Mineral (AB) 313767 25681 harvested soil horizon, 33.3 cm (2.37) depth, pH 6.5

WL-3 Mineral (Bt) 653135 52910 harvested soil horizon, 52.8 cm (2.34) depth, pH 6.9

WL-4 LFH(organic) 320397 8497 harvested soil horizon, 0.2cm (2.65) depth, pH 6.18

WL-5 Mineral (Ae) 921308 86655 harvested soil horizon, 6.7 cm (2.62) depth, pH 6.05

WL-6 Mineral (AB) 477359 48868 harvested soil horizon, 21.7 cm (2.9) depth, pH 6.27

WL-7 Mineral (Bt) 868411 75413 harvested soil horizon, 59.7 cm (2.45) depth, pH 6.76

WL-8 LFH(organic) 532576 12552 harvested soil horizon0.2cm (2.36) depth, pH 6.03

WL-9 Mineral (Ae) 153718 3379 harvested soil

52

horizon, 4.2 cm (2.2) depth, pH 5.16

BEAVER Beaver feces 179831 6375 (Mewis 2016) metagenome (3.54)

WWT-SLUDGE Activated sludge 1104030 113964 (Stadler et al. wastewater 2017) microbial (2.61) communities from Ann Arbor, Michigan, USA

COALBED-1 Coalbed methane 1370665 30088 (An et al. (CBM) samples – 2013) sampled as pieces (2.2) of core obtained by rotary drilling (cuttings from less than 1000 mbs); 182 m. depth; 10–20°C

COALBED-2 Coalbed methane 987759 23766 (An et al. (CBM) samples – 2013) sampled as pieces (2.41) of core obtained by rotary drilling (cuttings from less than 1000 mbs); 182 m. depth; 10–20°C

SAKINAW_LAKE Permanently 5721366 82269 (Gies et al. stratified 2014) meromictic (1.44) Sakinaw Lake

COMPOST Rice straw- 159941 16317 (Wang et al. adapted 2016) microbial (2.74) consortia enriched from compost ecosystems (Day 30 - 29.2 °C)

PPS Primary 1200162 32031 This study wastewater (Sharan et al. reactor sludge (2.67) 2017) sampled from BC coastal paper mill (19-200C, pH = 7.0 ± 0.02)

53

ZOO_COMPOST Time series study 371820 10452 (Antunes et al. of thermophilic 2016) zoo composting (2.81) facility (sampled on DAY 67 - 70.5 ± 4.80C, pH = 7.4)

Figure 2.12 shows a hierarchical clustering of the environments on the basis of the variance stabilized GH gene counts. As expected, the beaver metagenome assembly clusters away from all other environmental metagenomes owing to the fundamental differences between this system (mammalian xylotroph) and the other environments. This is also reflected in the

CAZyme enrichment profile (Figure 2.13). Given its diet comprising of mostly woody biomass that drives a dedicated microbial flora to assist with digestion, it is not surprising to see an enrichment of most major cellulolytic and hemicellulolytic activities (as described before). This environment also represents the highest total % ORF CAZy annotations. The other environments have pretty similar values in terms of the overall percentage and PPS metagenome falls within this range (2-

3%). It is good to see the PPS metagenome clustering together with compost microbial enrichment and zoo-compost which gives further confidence in choosing these environments for explaining the CAZy taxonomy trends as observed for PPS earlier given the lack of literature on the metagenome of this environment. However, it is surprising to see activated sludge clustering with coal bed sample. The anaerobic nature of this sludge leading to methanogenesis might explain potential correlation with dominant phyla found in the solid coalbed methane core samples. The soil metagenomes (WL 0-9) cluster together with each other with unexpected overlaps between organic and mineral horizons (might be potentially attributed to error in

54 sampling methods and the stratified meromictic lake sample also clusters away from most environments as expected.

Figure 2.12: Hierarchical clustering of the different microbiomes in this study based on relevant CAZy gene

counts (tree distances calculated using Manhattan method)

55

The overall profiles of CAZyme abundance of these environments corroborate well with the metadata. In fact, they explain the gene functionality to explain the clustering together. For instance, the unexpected overlaps between the mineral horizons WL 1,7,9 with the organic horizon WL 0,4,8 is explained by the enrichment of similar CAZyme activities. The other mineral horizons expectedly show lack of these activities. Harvesting of soil and intermittent mixing of horizons as related to sampling might be driving these profiles.

Similarly, the clustering of coal-bed samples with activated sludge microbiome is explained. These metagenomes along with mineral soil horizons (WL 2,3,6,5) represent some of the environments with the poorest CAZy enrichment profiles and therefore do not seem good candidates for functional profiling of CAZy activities. It is quite surprising to find this particular activated sludge environment as not enriched in CAZy activities when this environment is quite regularly prospected for lignocellulolytic activity in literature.

A closer look at the CAZyme heat map across all these environments shows that the PPS metagenome is not exceptionally rich in the CAZy activities most relevant to plant polysaccharide degradation. However, it is interesting to note that the enzyme families including - GH 43, CE7,

GH 30, GH 115, GH 113, GH 52, GH 27, GH 35, GH 51, GH 106, GH 127 – which represent most of the differentially abundant enzyme categories across this environment are related to degradation of hemicellulose and structural compounds instead of cellulose. This reflects the

CAZy profile as annotated through Metapathways and it further adds emphasis on re-asking the question of targeted enzymatic mining of this environment.

56

Figure 2.13: Heat map showing differential abundance of CAZyme families across different metagenomic environments (The color coding represents the

conversion of VST GH count values to an enrichment-depletion scale based on calculated z-scores for each family across different environments)

57

The presence of hemicellulolytic genes specific for polysaccharides and oligosaccharides derived from softwood species is intriguing despite knowing that compositionally sludge is not enriched in the hemicellulolytic fraction. However, on an “as received” basis – sludge is 80% moisture and 20% solids. These microbial communities are present naturally in this high-moisture environment. There is a possibility that due to the presence of the hemicellulose derived sugars in a dissolved form in the environment – these activities might get enriched in the microbial communities. Being crystalline, cellulose hydrolysis and uptake is quite inhibitory for metabolism and as such does not represent an energetically favourable metabolism pathway. Its persistence in PPS solids post the harsh pulping process stands proof to its recalcitrance. This can potentially explain this observation and prompt hemicellulolytic gene mining as the targeted approach for

PPS microbiome. However, an analysis of the liquid fraction of sludge is needed to validate this reasoning and should be involved in further studies.

2.4 Conclusion

In-silico analysis of microbiomes prior to functional analysis represents a powerful pre- validation step to maximize efficiency in downstream experimental design and use of wet-lab resources. It might also be needed for quick-testing of hypothesis around functionality and taxonomy of metagenomes. However, lack of bench-marking bioinformatic tools and variability in outputs across pipelines makes informed interpretations of such results difficult. As much as possible, and especially for comparative analysis across microbiomes, it must be ensured that the draft assemblies of metagenomes confirm to MIMAG standards and be processed equivalently.

This can make these analyses more quantitative and less qualitative. Also supplementing shotgun metagenomic DNA sequencing with metatranscriptomic sequencing can lead to new insights

58 about the actual functional distribution of genes and the phyla that are most active in driving the activities of interest to any study.

59

CHAPTER 3 High-throughput Biocatalyst Discovery from Paper Sludge

Metagenome by Functional Metagenomics

3.1 Background

3.1.1 Functional Metagenomics - discovering novel industrial biocatalysts

Functional metagenomics, or the expression of mixed environmental microbial DNA in heterologous host systems (Lam et al. 2015), is the experimental complement to sequence-based metagenomics. Sequence-based metagenomics concerns itself with in-silico annotation of the environmental DNA using known nucleotide or protein databases and can potentially lead to discovery novel gene families through clustering approaches. However, it is only through functional metagenomic methods that we can truly link desired phenotype/function to the novel gene discoveries and potentially generate tractable systems for biocatalyst or metabolite production (Cheng et al. 2017). This is done through the cloning and expression of environmental

DNA of interest into a suitable host system to generate what is called a “metagenomic library” i.e. several, thousand clones harbouring a fragment of the environmental DNA. This library is then screened or tested for desirable function using different assaying approaches that includes but is not limited to fluorescence detection, colorimetric or chromogenic methods or possibly even selection for growth or inhibition of growth, preferably in high-throughput (Streit and Daniel

(Eds.) 2010).

Functional metagenomic findings are very important for closing the gap between knowledge-based discovery and application to bioprocesses engineering and industrial

60 biocatalysis. It also enables answering key questions in environmental microbial ecology for linking function to taxonomy through the ability to assess function independent of the ability to cultivate the microbes in laboratory setting (Chistoserdovai 2010). Detection of desired phenotypes in a functional metagenomic library requires a strong, hypothesis-driven approach.

Correct screen design is indeed arguably the singular most important in the process of discovering a gene or biocatalytic activity of interest and will be discussed in detail in section

3.1.2.

However, prior to screen design it is also important to consider other factors that would influence hypothesis formation around the functions of interest that one can potentially hope to discover from an environment. The sampling methodology and environmental metadata like temperature, pH, presence of target substrate molecules are very important to drive screening targets, substrates or even the host systems used (Taupp et al. 2011; Thies et al. 2016) . For instance, several glycoside hydrolases have been discovered in ruminant gut and faecal microbiome as the main dietary components of these animals are feedstocks that are rich in cellulose (Hess et al. 2011; Geng et al. 2012; Ilmberger et al. 2014) or biosurfactant producing genes from the microbiota of oil-contaminated environments that have been enriched for microbes that can break down oil molecules and these have immense application in bioremediation approaches (Oliveira et al. 2015).

Apart from the metadata, as evidenced from chapter 2 of this study, in-silico metagenome analysis also plays an important part in guiding screen design as it provides information about the pathways and genes that might exist in the metagenome. There is almost a two-way

61 dependency of sequence-based metagenomics and functional metagenomics on each other with regards to experimentation and validation.

3.1.2 What goes into metagenomic functional screen design?

Screen design is the most important part of functional metagenomics. A consistent observation across both culture dependent (enrichment cultures, directed evolution) and culture independent (metagenomic screening) phenotypic screening experiments is – we get what we screen for. This puts even more emphasis on the careful considerations that must go behind screen design. The design principles for metagenomic functional screens usually follow the general guidelines that are used for high throughput screening (HTS) applications (Acker and

Auld 2014) as often the preliminary screen involving the whole metagenomic library involves several thousand reactions that need to be conducted simultaneously and reproducibly. The following are some important factors that would affect both the screen design and outcomes and have been reviewed previously (Taupp et al. 2011; Armstrong et al. 2015) (Figure 3.1):

1. Environmental DNA: The quality and quantity of the isolated genomic DNA affects several

aspects of the screening process. The quantity of DNA that is captured in the

metagenomic library will affect the coverage of the environmental genome that is

captured in the library and would therefore create biases in the type of functions that

might be observed. Strategies like substrate-based enrichment of mixed environment

cultures (Wang et al. 2016; Arshad et al. 2017) [in some cases with use of special growth

chambers (Nichols et al. 2010)] and substrate induced gene expression (SIGEX) (Simon

and Daniel 2011) have been used to enhance the probability of discovery of targeted gene

62

products. Stable isotope probing (SIP), a common technique applied in microbial ecology

studies to link taxonomy to metabolic function (Radajewski et al. 2000) is now being

increasingly integrated with metagenomic studies to fractionate the environmental DNA

fraction that needs to be targeted for screening (Grob et al. 2015; Ziels et al. 2018). The

DNA can then be size-fractionated to construct libraries harbouring different genes of

interest.

For low biomass environments, especially extreme environments that are targeted for

thermo-stable or broad range pH activities, it is challenging to get sufficient DNA for

library construction. In these cases, techniques like whole genome (multiple

displacement) amplification (Binga et al. 2008), linker-amplified shotgun library (LASL’s)

or expressed-LASL can be used to amplify extracted DNA with negligible bias and have

been reviewed for phage metagenomic studies (Henn et al. 2010).

2. Vector and expression system: Plasmids can usually harbour inserts up to 15kb in size

and while their high transformation efficiency greatly increases the number of available

clones for screening, only single loci gene functions can be tested. The other vectors like

fosmids, cosmids (≤40kb) and bacterial artificial chromosomes (BACs) (>40kb) can support

screens for linked gene inheritance derived activities, multi-locus traits also enable better

taxonomic annotation. Some systems like pCC1FOSTM (Jendrisak et al. 2002) and

conditionally amplifiable BACs (Wild et al. 2002) also support inducible copy control that

is useful for high yields of cloned DNA of interest while maintaining a stable clonal

population and minimizing expression of toxic genes (Simon and Daniel 2011) .

63

Selection of the expression system or host in which the environmental DNA is cloned is

very important to ensure optimum translation of desired genes. E.coli strains remain the

most frequent host, and this is supported by evidence through the in-silico analysis of the

transcriptional, translational and posttranslational controls along with the promoter

recognition and initiation factors that suggests expression of approximately 40% of genes

within a subset of 32 taxonomically diverse genomes with wide-ranging variation in

expression potential between genomes (7 –73%) (Gabor et al. 2004). However, depending

on the codon usage bias percentage that varies between different species, other hosts

like Bacillus (Steels et al. 2013), Caulobacter or Pseudomonas spp. (Craig et al. 2010) have

been found to be better hosts for genes linked to other taxonomic groups. It is therefore

important to consider taxonomic linkages of target genes and comparatively assess

expression of vectors in different hosts (Lam et al. 2015) to obtain the clone with the most

optimized expression system that would reduce downstream biocatalyst production

optimization.

3. Activity/function: Biochemically there exist several nuances to this factor. Enzymatic

functions often belong to different classes, it is misleading to designate generic names

like “metagenomic screen for cellulases” as it is a very broad term containing different

enzyme activity sub-classes that are responsible for complete degradation of the cellulose

polymer (Wilson 2009). There are dedicated databases for several important large

enzyme families (specific or collection of different activities) and they are used for

annotation and identification of potential novel enzymes in conjunction with functional

metagenomics (Schomburg and Schomburg 2010). This also informs a structure in

64

systematic screening approaches where there can be step-wise capturing of a repertoire

of different activities followed by deconvolution of clones with very specific activities

applicable to target substrates (Chen et al. 2016) as is presented in this study.

4. Substrate, Co-factors and co-enzymes: Substrate specificity is one of the fundamental

characteristics of enzymes. Substrate selection for the metagenomic screen should

therefore be done very carefully to avoid confounding effects of substrate analogues or

presence of other reactive groups that might lead to promiscuity and discovery of false

positives. Macdonald et al. have recently described a novel high-throughput assay for

discovering broad-substrate specificity glycoside phosphorylases using inorganic

phosphate in the enzyme assay medium (Macdonald et al. 2018).

Several enzymes might also need the presence of co-factors or co-enzymes to be

functional. Although many of these co-factors might be naturally present in the cellular

medium, the specific requirements of the targeted enzymatic reactions should be

assessed prior to screening (Bisswanger 2014). This can be done by potentially doing an

in-silico survey of known enzyme families with similar activities and the screening medium

or assay mixture should be supplemented as required.

5. General considerations for good functional screen design: Given the anticipated issues

with effective gene expression of environmental genes in single host systems, sensitivity

of activity detection is very important and can be achieved by design of substrate. Related

factors also include the need for a high dynamic range, broad pKa to allow screening at

different environmental pH and low signal to noise ratio during screening. Insensitivity to

cellular contents and compatibility with typical cell lysis reagents is also important. For

65

example, Chen et al. have demonstrated synthesis fluorescent phenols of pKa <7, such as

halogenated coumarins modified to be stable so that they can provide even at extended

assay times without generation of significant background signal. These substrates have

been used for screening in this work (Chen et al. 2016) .

Figure 3.1: Production and functional screening of metagenomic libraries (Taupp et al. 2011) Copyright © 2011

Elsevier Ltd.

66

3.1.3 Carbohydrate Active Enzymes (CAZy) Database: Glycoside Hydrolase (GH) families and

Polysaccharide utilisation loci (PUL’s)

The Carbohydrate Active enZymes (CAZy) database (http://www.cazy.org/) has been developed as a curated and annotated reference resource for determining CAZy activities and is increasingly being applied to metagenomic data (Armstrong et al. 2015). The polysaccharide degradation genes are classified as glycoside hydrolases (GH), polysaccharide lyases (PL), carbohydrate esterases (CE), carbohydrate-biding modules (CBM) and auxiliary activities (AA)

(Terrapon et al. 2017). This database allows targeted database searches alongside gene expression studies from functional screening from genomes and metagenomes thereby allowing ready functional validation.

Polysaccharide utilisation loci or PULs are minimally defined as a SusC/SusD gene pairing in close proximity to genes that encode carbohydrate active enzymes (Grondin et al. 2017).

Hypothetical genes predicted within a PUL have the potential to provide functional clues and inform downstream expression studies especially related to lignocellulosic bioprocessing. An automated Bacteroidetes PUL prediction pipeline and web interface using genomic context information and domain annotations based on information in the CAZy database has been developed by Terrapon et al. (Terrapon et al. 2015), using sequence information from the

Bacteriodetes group where PULs occur most commonly and presents a powerful tool to investigate the occurrence of such cellulolytic orchestrated gene cassettes within metagenomes.

67

3.2 Materials and Methods

3.2.1 Fosmid Library Construction

The fosmid library was created following the protocols established previously (Taupp et al. 2009) using the Epicentre (now Lucigen) CopyControl™ Fosmid Library Production Kit with pCC1FOS™ Vector. The high molecular weight DNA to be used in fosmid library production was extracted from pulp and paper mill sludge (PPS) sampled for in-situ analysis and as described in sections 2.2.1 and 2.2.2.1. The DNA was purified from commonly present contaminants using

CsCl density-gradient ultracentrifugation (Wright et al. 2009) . The desired pure genomic DNA band was extruded using a syringe and the dye Ethidium Bromide (Et-Br) was removed using water-saturated butanol extraction and further purified by concentration using YM-30 microcon

(Millipore) unit. The insert DNA was then end-repaired using reagents supplied in the library production kit and size separated on low-melting point agarose through pulse field gel electrophoresis (PFGE) technique. The gel was stained using SYBR Gold (Molecular probes, life technologies) and thin band corresponding to size range (40-23kb) for fosmid library creation as determined using control DNA, mid-range PFGE marker and λ / HindIII ladder was spliced out.

The gel was melted and treated using GELase (Epicentre) enzyme. The DNA was purified and concentrated to a volume of 14 µL using Amicon and microcon filters in succession. The DNA was then ligated overnight with pCC1FOS vector and packaged into MaxPlax Lambda Packaging

Extract. After proper incubation, an actively growing culture of host strain E. coli EPI300-T1R was infected with the phage particles and following an hour-long incubation at 370C the cells were centrifuged and resuspended in 1mL Luria-Bertini (LB) medium with 10% glycerol and stored as glycerol stocks at -800C. A titre test for testing phage infection efficiency (using control DNA

68 reaction) and colony forming unit (CFU) count of the culture was also done simultaneously. The following formulae were used to determine packaging efficiency (1), CFU/µL of the glycerol stock

(2) as well as number of clones to make the pulp metagenome library (3):

(푁표.표푓 푝푙푎푞푢푒푠) (푑𝑖푙푢푡𝑖표푛 푓푎푐푡표푟)(푇표푡푎푙 푟푒푎푐푡𝑖표푛 푣표푙푢푚푒) = packaging efficiency (pfu/μg DNA) (1) (푉표푙푢푚푒 표푓 푑𝑖푙푢푡𝑖표푛 푝푙푎푡푒푑)(퐴푚표푢푛푡 표푓 퐷푁퐴 푝푎푐푘푎푔푒푑)

(푁푢푚푏푒푟 표푓 푐표푙표푛푖푒푠 푐표푢푛푡푒푑) 푋 (푉표푙푢푚푒 표푓 푐푢푙푡푢푟푒) (푉표푙푢푚푒 표푓 푐푢푙푡푢푟푒 푃푙푎푡푒푑) = CFU/µL of the glycerol stock (2) 푉표푙푢푚푒 표푓 푡ℎ푒 푔푙푦푐푒푟표푙 푠푡표푐푘 ln(1−푃) = 푁 , (P = desired probability fraction; f = proportion of genome contained in insert; N = ln(1−푓)

no. of clones) (3)

For equation 3, in the case of a mixed environmental metagenome, f is indeterminate. The clones were picked and transferred into 384 well plates containing LB with 12.5 µg/mL chloramphenicol and 10% glycerol using QPix2 robot (Genetix) and stored at a temperature of -800C prior to screening. Figure 3.2 depicts the general workflow for metagenomic library construction and screening as adopted in this work.

69

Figure 3.2: Metagenomic library and functional screening schematic for pulp and paper sludge (PPS) metagenome

3.2.2 Functional Screening

The fosmid library was screened for cellulolytic activity using a mixture of model fluorogenic substrates developed previously (Chen et al. 2016) containing the functional group

6-chloro-4-methylumbelliferyl (CMU) . The substrates were synthesized as described in the publication and were provided by Zach Armstrong from the Withers lab group at UBC. In the first round of screening the entire library (viz 15,000 clones) a mixture of the substrates CMU- cellobioside (CMU-C), CMU-D-mannoside (CMU-Man), and CMU-xylobioside (CMU-X2) (Figure

3.3) to capture the repertoire of catalytic activities of interest from the library. In the subsequent screening rounds, the cellulolytic activity was deconvoluted by using only CMU-C substrate to

70 screen and validate clones having specificities only towards cellobiose (and/or cellulose for which the substrate served as the model proxy).

Figure 3.3: The β-1,4-glycoside substrates of 6-chloro-4-methylumbelliferyl (CMU) used for functional screening of

the pulp and paper sludge (PPS) metagenomic library clones (A) CMU-cellobiosise (B) CMU-xylobioside (C) CMU-

Mannoside (figures by Zach Armstrong)

The clones from the working glycerol stock library were replicated and grown for 24 hours at 370C in supplemented LB media with 12.5 µg/mL chloramphenicol in a volume of 45µL per well. Following OD600 measurement using a VarioSkan multi-mode plate reader coupled with a

RapidStak device (Thermo Fisher Scientific), the lysis mixture containing the substrates was added to each well using a high-throughput liquid handler giving a final volume ratio of 1:1.The lysis mixture was prepared in 50mM potassium acetate buffer adjusted at pH 7.02 using 1% Triton

X-100 as the lysing agent and 200 µM each of the substrates. The incubation was done overnight

(16-18 hours) after which the fluorescence measurements were carried out using the following parameters (excitation: 365 nm, emission: 450 nm, gain: auto) using a VarioSkan as before.

Clones with relative fluorescence values (RFU) – represented as robust z-score calculations- greater than the specified σ-value cut-offs in each round of screening were

71 designated as positive clones. The choice of σ-value cut-off varied with the dataset and was chosen depending on significant inflection of data values or the base-line value as observed with the background control (EPI300 strain with empty vector pCC1TMFOS). The data analysis and visualisations were done using suitable packages in R version 3.4.3.

3.2.3 Fosmid Purification and Sequencing

Following identification of positive clones, following several rounds of screening and deconvolution, the fosmid DNA was extracted from the clones using GeneJet Plasmid Miniprep kit (Thermo Fisher Scientific). E.coli (EPI300) genomic DNA was removed using Plasmid-Safe

DNase (Epicentre). Plasmid DNA concentrations were determined using Quant-iT PicoGreen

Assay (Invitrogen). DNA was sent for (next-generation) NGS library preparation, sequencing and short-read data generation to seqWell Inc. (Beverly, MA) using their plexWell ProTM Service platform and sequencing was done on Illumina Miseq.

3.2.4 Fosmid Assembly and Annotation

The short-read DNA sequence data as obtained from seqWell was checked for quality using Fast-QC pipeline Version 0.11.6 (Andrews 2017). The assembly was done using the FabFos pipeline, developed by Connor Morgan-Lang freely available at Hallam Lab GitHub account

(https://github.com/hallamlab/FabFos). The pipeline involves sequential QC of reads and vector backbone (pCC1FOSTM) backbone trimming using Trimmomatic (v 0.35) (Bolger et al. 2014), assembly of the trimmed fasta files into contigs using Megahit (v 1.0.6) (Li et al. 2015) and

(optional) mapping of end-sequences to assembled contigs using BlAST (blastn) (Figure 3.4) and

72 outputs the assembly statistics in a tsv file utilising the input from user in a minimum information for fosmid environmental data (MIFFED) file giving the major Nx statistics for the assembly.

Fosmid sequences were annotated for gene content using MetaPathways v2.5 (Hanson et al. 2014) , and compared to the curated and annotated latest versions of the databases KEGG

(Kanehisa et al. 2016), COG (Tatusov et al. 2003), RefSeq (O’Leary et al. 2016), MetaCyc (Caspi et al. 2016), and CAZy (Terrapon et al. 2017) databases using the LAST algorithm. Prodigal (Hyatt et al. 2010) parameters for open reading frame (ORF) prediction used were: minimum length of 60 bp, minimum bitscore of 20, minimum (B)LAST score ratio (BSR) of 0.4, and maximum e-value of

1x10− 6. Fosmid gene content was annotated using the results files generated by MP.

The taxonomic assignment was also done using the (lowest common ancestor) LCA algorithm as implemented in MP using NCBI tree and assignment of all ORFs found on a fosmid served as a marker for fosmid taxonomy.

Figure 3.4: FabFos pipeline schematic (https://github.com/hallamlab/FabFos)

73

3.3 Results and Discussion

3.3.1 Metagenomic Library Construction and Host Selection

A functional metagenomic library based on a fosmid-vector system was chosen to allow expression of larger gene clusters required for more complex metabolic pathways (Martínez and

Osburne 2013) . Although, the gene cluster capacity is limited at an upper threshold of ~40kb due to the phage packaging system as offered by the λ phage packaging system (Epicentre 2010), it is a good compromise in terms of ease of construction and capturing complex gene cluster expressions. Phage T1-Resistant TransforMax™ EPI300™-T1R Electrocompetent E. coli was used as the host strain in this study. It has been optimized for heterologous gene expression and the copy-control pCC1FOSTM system allows for tight, inducible control over the expression of fosmid copy number relevant to gene expression (Lucigen 2016).

Complex/Multi-gene cluster expression systems are also important for the major objective of this study – discovery of glycoside hydrolase (GH) families. It has been previously shown through co-occurrence analysis in literature that some GH family genes (for instance GH

43) are co-localised with carbohydrate binding modules (CBM’s) genes (Mewis et al. 2016) (Figure

3.5 (A)). It has also been common to observe co-occurrence of multiple GH families within the same operon in previous fosmid-based functional screening works. It is therefore important to capture them for correct interpretation of the activity of the gene loci which is assisted. Well- characterized systems like PULs such as those observed in cellulolytic bacterial phylum

Bacteroidetes including Xyloglucan Utilization Loci (XyGULs) (Attia et al. 2018) also occur as co- localized gene clusters that encode enzymes and protein ensembles required for the

74 saccharification of complex carbohydrates (Grondin et al. 2017) (Figure 3.5 (B)). The same is also true for bacterial cellulosome systems that occur as multienzyme complexes comprising of structural (scaffoldin) and enzymatic subunits (Artzi et al. 2017). These represent some of the most potent cellulolytic systems in the prokaryotic domain and are the prime targets for the functional screen in this study, thereby necessitating the need for using fosmid-based libraries.

Figure 3.5: Co-occurrence and co-localization of glycoside hydrolase (GH) genes are presented in literature

(A) Heat map showing frequencies of cooccurrence of GH43 subfamily domains with major noncatalytic modules

including CBM, carbohydrate binding module; DOC, cellulosomal dockerin domain; X19, conserved noncatalytic

module with subfamilies clustered as per respective HMM profiles (Mewis et al. 2016). (B) Schematic

representation of gram-positive polysaccharide utilization locii (gpPULs) concerned with xylan, pectin and

arabinogalactan utilization © (Harris et al. 2016)

In this study high molecular weight DNA (40-23kB), purified using density-gradient caesium chloride centrifugation, was subjected to random shearing and end-repair. It was then size-separated on pulse-field gel electrophoresis and the correct size fraction spliced, digested and ligated to pCC1TMFOS vector. The ligation mixture was phage packaged and EPI300TM-T1R cells transfected and plated for metagenomic library creation. Around 15,000 clones were generated and assumed to give sufficient coverage for the pulp and paper sludge metagenome

75 and stored as master copy and working stock. The working stock was then used consequently for high-throughput functional screening.

3.3.2 High-throughput Functional Screening

The fluorogenic substrates used in this study were selected for their high sensitivity and relatively stable half-lives and emission spectrum that allows for rapid, high-throughput screening of environmental metagenomic libraries (Chen et al. 2016). This is especially important to increase the hit frequency rate of positive clone recovery from functional screening of metagenomic libraries. Pooling substrate also allow a single screening-run to be performed, surveying several activities at one time (depicted in Figure 3.6 for the substrates used in this study), thus conveying increased efficiency and reduced materials cost.

Figure 3.6: Schematic of testing for different cellulolytic activities using of 6-chloro-4-methylumbelliferyl

(CMU) glycoside of cellobiose and resultant products from enzymatic breakdown that results in fluorescent signal

detection

76

To this end, we chose to pool three distinct substrates, the 6-chloro- 4-methylumbelliferyl

β-glycosides of cellobiose (CMU-C), xylobiose (CMU-X2) and D-mannose (CMU-Man). It had been previously established that the screening host (E. coli EPI300TM) does not catalyse significant turnover of these substrates. This allowed for simultaneous testing of endo-acting cellulases, xylanases, or sequential action of exo-acting β-glucosidases.

To the best of my knowledge, this is the only attempt so far at generating a metagenomic library for functional screening of GH family genes from the pulp and paper primary sludge metagenome. GH activity is one of the most sought after activity for functional screening from metagenomic libraries, given its direct applicability to industrial lignocellulose biocatalysis

(Armstrong et al. 2015)

The initial screening with 15,000 clones yielded around 384 hits with robust z-score value

≥10 (also considered the sigma value). This represents an extremely high hit rate of almost 1 in

39 clones which is much higher than values reported for other microbiomes investigated for GH genes like anaerobic bioreactor system (1 in 410) (Mewis et al. 2013) and as observed in functional screening of in-house constructed metagenomic libraries in the Hallam lab group using a previously developed high-throughput screen (Mewis et al. 2011) (Table 3.1). Even at the more conservative 40 sigma cut-off value (where the values show significant inflection from other robust z-score values that cluster together and might confound results), the hit rate of 1 in 517 is well-within the typically observed rate for metagenomic environments.

77

Figure 3.7: Initial functional screening results with all clones in the pulp and paper sludge (PPS) metagenomic

library

Table 3.1: Hit rate of different functional metagenomic library screened for glycoside hydrolase genes using a

soluble, chromogenic model compound, 2,4-dinitrophenyl cellobioside (DNP-C) (data from Mewis 2016)

Library Clones Hits Hit rate (1 per x clones) Forest soils 115584 194 596 Hydrocarbon 121728 193 631 Marine 53760 30 1792 Beaver 44928 184 244

From the initial screen, the top 128 clones were picked for validating in triplicate in a 384 well plate format and de-convolution of activity on CMU-cellobiose which is the closest model proxy to cellulose. Given method limitations like plate-edge effects and batch-to-batch screen run variations, specificity of activity on cellobiose, this second round of screening led to significant decrease in the number of significant hits (evaluated at sigma value ≥10), giving a sub- set of ~29 clones that were then further investigated (Figure 3.8).

78

Figure 3.8: Second round of screening – validation of top-128 hits in triplicate and deconvolution of activity on

CMU-cellobioside

These top-29 clones were then re-screened alongside background control strain EPI300TM containing the empty pCC1FOS TM vector as well as commercial Celluclast (Novozymes) enzyme mixture cocktail as positive control to check potency for activity on CMU-cellobiose. The controls in a way define the spectrum of activity on a relative scale - background strain at low

(insignificant) end and Celluclast (Hu et al. 2011) at the high end. Testing was done at two distinct time points to assess the reproducibility and it was gratifying to observe that within a 5% error, all significant 13 hits, except for one, were recovered. It was also intriguing to see that at microlitre-level, several clones outperformed Celluclast at 0.5 and 1 mU of enzyme loading.

79

Figure 3.9: Reproducibility test of top-29 hits using CMU-cellobiose alongside background control ePCC1FOS and positive control Celluclast enzyme cocktail; inset shows ePCC1FOS values on the two runs (error bars represent 5%

error)

This step-by-step approach of eliminating hits to cherry-pick clones was taken to ensure rigour in selection for clones for direct application to cellulose hydrolysis and downstream bioprocess development for valorization of PPS – the ultimate objective of this work. The screen used is reproducible (within the limits of dynamic biological systems variations) and sensitive, enabling a high recovery rate of GH gene discovery (Figure 3.9).

3.3.3 Fosmid Sequencing and Annotations

The top-13 clones recovered in the section above were sent for next-generation (NGS) library preparation and sequencing on Illumina Miseq platform. Out of the 13 fosmids sent, 11 fosmids had sufficient coverage to give confidence in assembly and further analysis. The assembly statistics generated using the FabFos pipeline are indicated in the table below. Fosmid ID is designated as library plate serial # followed by well serial # viz P04P08 is the fosmid in well

80 location P08 in PPS library plate (PPSLIBM) ID 04. The assemblies were majorly successful in complete assembly of contigs and retention of major portion of the genome on the largest contig

(with the possible exception of sample ID P13D18) giving confidence in using the assembled contigs for functional annotation of genes and taxonomic assignment of the genes in the fosmid insert.

Table 3.2: Fosmid assembly statistics – generated using FabFos pipeline and Quast online tool

Fosmid ID Coverage Length of No. of contigs Cumulative N50 Largest Contig >1000 bp Length (bp) (bp)

P04P08 230 28534 2 37,020 28534

P08H17 535 23747 2 35,719 23747

P13D18 488 10810 7 25611 10810

P14I01 435 35484 3 41268 39228

P14K01 389 33549 1 33549 33549

P22O04 643 35319 3 43400 43400

P28E08 440 32830 1 32830 32830

P29G23 528 40754 1 40754 40754

P31H11 300 31614 1 31614 31614

P37P01 591 23101 2 31207 23101

P38I02 244 21955 3 43708 21955

Metapathways v2.5 (MP) (Hanson et al. 2014) with in-house updates was used to annotate the concatenated input fasta file comprising all fosmid contigs against curated and annotated databases as described in materials and methods. The focus was to assess CAZy annotations to validate empirical functional activity observations. For taxonomic assessment of

81 genes, all fosmid fasta files were also run separately on MP including refseq annotation for LCA taxonomic tree construction of annotated genes. The results presented in Table 3.3 therefore differ from Figure 3.11 due to gene annotation frequency changes between different runs

(concatenated vs separate)

A total of 327 translated ORFs were predicted across all fosmids following QC with 39 hits in CAZy database (8%). All of them, saving one hit in carbohydrate esterase (CE) family, belonged to glycoside hydrolase (GH) family (Figure 3.10). It was also observed that majority of hits across

COG and KEGG databases (that yielded maximum ORF annotations) were related to carbohydrate transport and metabolism. A comprehensive interpretation of all these database annotations is beyond the scope of this study. However, these annotations foster confidence in the possibility of discovering genes that would not only enable hydrolysis of lignocellulosic polymers but also help with heterologous uptake of these polymers, specifically carbohydrate ABC transporter membrane proteins, sugar transporter permeases, glycosylation related proteins (sugar kinases) and accessory activities like sugar isomerases. Lignocellulose hydrolysis requires synergy between different activities as discussed before and these genes can be used to engineer a biological system than can lead to consolidated bioprocessing.

82

Figure 3.10: Percentage breakdown of Metapathways annotations of fosmid ORFs with focus on CAZy annotations

Figure 3.11: Fosmid linked genomic map - each line represents a fosmid clone with some fosmids represented by multiple contigs. Each predicted gene is represented by an arrow showing the direction of transcription. Grey links connect protein homologous with e-value≤1e-10 (Figure in collaboration with Kateryna Ievdokymenko)

83

The Metapathways output was then fed into a custom python script developed by Ivan

Minevskiy in collaboration with Kateryna Ievdokymenko

(https://github.com/minevskiy/bioinformatics/tree/master/genomic-map-with-links) to visualise ORFs and determine the protein homology between the different fosmid clone gene sequences using BLASTP 2.2.2.28 (e-value ≤1.00e-10). The output was re-touched and coloured using Adobe Illustrator CS 6 (Figure 3.11) to depict the different glycoside hydrolase family genes.

Each of the fosmid gene cassette (shown in Figure 3.11) harbours at least one GH locus which explains and validates the functional screening. The fosmid gene cassettes were quite distinct from each other. For fosmids where only 1 or 2 significant GH ORF’s are present it is simple to draw conclusions about GH family responsible for observed activity. This is observed for fosmids P04P08, P1401 and P08H17 and interestingly the first two were the hits with the highest RFU signal during screening (Figure 3.9). However, for other fosmids, interpretations about which gene locus/loci are active is more convoluted - given the co-occurrence/co- localization of several GH families with each other.

GH 3, GH 43 and GH 10 represent equally the most abundant GH family groups in the fosmids. No endo-glucanase groups were detected on these fosmids, potentially indicating that the observed activity was a result of exo-acting glucanases or β-glucosidases. However, endo- xylanase groups were detected (GH 43, GH 10, GH 30) and were found to be increasingly co- localized. These families have been commonly observed to co-occur on PULs especially XyGULs

(Grondin et al. 2017; Attia et al. 2018) and have been commonly observed across different groups of bacteria. For example, several groups within Bacteriodetes (model group for PUL studies)

84

(Larsbrink et al. 2014) and Firmicutes (Harris et al. 2016) phylum that are well-defined organic degraders, contain these XyGULs and can degrade xylan polymers in hemicellulose. From the perspective of bio hydrolysis, it is an important function to make cellulose more accessible and generate net total reducing C5-C6 sugars that might be used further downstream as C-sources.

For these loci, synteny was also observed with GH 3 and GH 62 families (Figure 3.11) and the former might explain observation of activities on CMU-C substrate.

GH3 is a big family including exo-acting β-D-glucosidases and β-D-glucan glucohydrolases,

α-L-arabinofuranosidases, β-D-xylopyranosidases, N-acetyl-β-D-glucosaminidases, and N-acetyl-

β-D-glucosaminide phosphorylases. The enzyme activities span cellulosic biomass degradation, plant and bacterial cell wall remodelling, energy metabolism and defence mechanisms. The enzymes are also known to be promiscuous in action and have dual or broad substrate specificities with respect to monosaccharide residues, linkage position and chain length of the substrate (Fincher et al. 2017).

The taxonomic assignment of the fosmid genes is presented in Table 3.3. Although, ideally the fosmid gene segment/insert contains sufficient taxonomic resolution to constrain the taxonomy of donor genotypes, functional screening often recovers active clones with mixed heritage consistent with horizontal gene transfer in the environment. Especially for big groups like proteobacteria, confidently assigning taxonomy to the fosmid is difficult as observed in this study. Some of the genes of interest that could not be assigned through the LCA algorithm utilising Refseq and NCBI taxonomic tree information within MP have been indicated below.

These need further investigation and are candidate for potential novel phylogeny placements or candidate phylum discovery. This can be done through MEGAN (Huson et al. 2016) , MLTreeMAP

85

(Stark et al. 2010) or other suitable softwares tailored for metagenomic sequence information

(including both raw sequence and assembled reads). The individual protein sequences can also be searched for homology based taxonomic assignment in protein specific databases like Pfam

(Bateman et al. 2004) or Universal Protein Resource (UNIPROT) (Wu et al. 2006). The increased presence of co-localised XuGULs on the fosmid sequences which have ambiguous taxonomic results can also speak to assembly artefacts that deter taxonomic assignment by missing sequence information. These loci could be used as a composite query in the PUL database

(Terrapon et al. 2015) to predict taxonomy of these conserved regions. The ambiguity also presents an opportunity to investigate in detail the GH families that are missing taxonomic assignments and placement in enzyme-family specific taxonomic trees created using RAxML

(Stamatakis 2014) or other suitable softwares.

Table 3.3: Taxonomic assignment of ORF’s across fosmids: ORF taxonomic annotation was done through the LCA

algorithm implemented using NCBI taxonomy tree in Metapathways pipeline (in cases of multiple GH loci from

individual fosmid annotations – the annotation for ≥50% of instances was reported)

Fosmid Taxonomic Group Taxonomic MP Taxonomic Notes ID represented in rank and assignment of GH ≥50% of ORFs phylum gene(s) information

P04P08 Chloroflexia Class; GH 3 -Prokaryotes GH 3 gene not assigned

P08H17 Caldilinea Strain; GH 1 - Caldilinea aerophila strain Chloroflexi aerophila strain STL-6- STL-6-01 01

P13D18 Microbacteriaceae Family; GH 3 – Actinobacteria Microbacteriaceae; CE

7 – Microbacteriaceae; GH 20 - Microbacteriaceae

86

P14I01 Caldilinea Strain; GH 1 – Prokaryotes; GH 1 gene not aerophila strain Chloroflexi GH 13 - Caldilinea assigned STL-6-01 aerophila strain STL-6- 01

P14K01 Betaproteobacteria Class; GH 30 – Equal Proteobacteria Gammaproteobacteria; assignments to CE 1 – Burkhoderiales; orders Rhodocyclales GH 62 – and Cellvibrionaceae; GH Burkholderiales 10 – Uliginosibacterium strain 5YN10-9; GH 43 - Betaproteobacteria

P22O04 Caldilinea Strain; GH 92 and GH 1 - aerophila strain Chloroflexi Caldilinea aerophila STL-6-01 strain STL-6-01

P28E08 Bacteroidales Order; GH 3 – Bacteroidetes Porphyromonadaceae; GH 10 – Bacteroidales; GH 43 – Bacteroides;

GH 67 - Porphyromonadaceae

P29G23 Caldilinea Strain; GH 3 – Chloroflexi; GH aerophila strain Chloroflexi 29 - Caldilinea STL-6-01 aerophila strain STL-6- 01

P31H11 Betaproteobacteria Class; GH 30 – Xanthomonas; Presence of both Proteobacteria GH 43 – beta and gamma Betaproteobacteria; sub-groups of GH 10 – Proteobacteria Uliginosibacterium; GH 62 - Gammaproteobacteria

P37P01 Prokaryotes Domain GH 2 – Ktedonobacter; The fosmid GH 3 – Proteobacteria; shows different phyla; one GH 3 GH 88 - Proteobacteria ORF not specifically assigned

P38I02 Rhodocyclales Order; GH 10 - Presence of Proteobacteria Uliginosibacterium Burkholderiales strain 5YN10-9; GH 30 group and GH 43 -

87

Proteobacteria; GH 62 – gamma Proteobacteria

40 35 30 25 20 15

10 acrossORFs fosmid

Percentage assignment 5 0 Chloroflexi Bacteriodetes Actinobacteria Proteobacteria

Figure 3.12: Taxonomic distribution across sequenced fosmids (Taxonomy assigned based on LCA assignment of

taxonomy at phylum level represented in ≥50% of ORF’s for each fosmid assembly)

Comparing Figure 3.12 with Figure 2.9 (taxonomic distribution of the whole metagenome) makes the disparity between the functional gene abundance of active clones obtained through functional screening and whole metagenomic taxonomic assessments very evident. Similar to binning taxonomic distribution results (Figure 2.8), this can point towards the importance of certain phyla that may not be very abundant in the environment but play a crucial role as the primary organic matter degraders. Interestingly the specific species of Chloroflexi phylum that had the highest number of ORF’s assigned to it, Caldilinea aerophila (Sekiguchi et al. 2003) is a thermophile commonly found in anaerobic granular sludge and waste water treatment bioreactors. So, its presence in the PPS environment is not unsurprising. However, it does not have the highest representation in the taxonomy of the GH ORFs (Table 3.3). The Chloroflexi phylum are present in environments similar to PPS but they are not usually the phylum with the most predominant GH activities. These activities are attributed more to Bacteroidetes, Firmicutes

88 and Actinobacteria (Wang et al. 2016; Berlemont and Martiny 2016; Berlemont 2017). There is however, some evidence for organic matter degradation by Chloroflexi phyla in marine environments (Jessen et al. 2017; Landry et al. 2017) and a patent on an undisclosed thermophilic

Chloroflexi-like organism capable of degrading cellulose (Stott et al. 2009).

In light of these findings from this study and literature, it is difficult to make definite conclusions about the disparity between in silico and functional taxonomic distribution and it can elude to the need to evaluate biases in environmental DNA capture for metagenomic library preparation, substrate choice in functional screening, and potentially conduct metatranscriptomic studies on the environment to answer the question of which taxa are functionally most active in the environment for GH activity.

3.4 Conclusion

The fosmid clones obtained through functional metagenomic screening mostly show exo- acting β-glucosidase activities. There is also an interesting enrichment of XyGUL’s and endo- xylanase activities in the screened clones and it might be attributed to the promiscuity of the GH families that are present within these groups. Transposon mutagenesis can be used to assign specific GH loci to observed function. Taxonomically, there is more to investigate to confidently assign the fosmid genes and map it back to the original environments. The findings presented are within the boundaries of the functional screening paradigm set-up in this study and therefore suffer from the general limitations of functional metagenomic screen design including choice of substrate, expression host and batch-to-batch variations. Upstream of library preparation, techniques like SIP based DNA enrichment to focus on cellulose hydrolysis function can also be

89 done to narrow down screening efforts. There is a need to couple high-throughput functional clone product biochemical characterization to truly translate these findings into bioprocessing applications with lignocellulosic substrates.

90

CHAPTER 4 Function to Application – Consolidated Bioprocessing

4.1 Background

4.1.1 Coming Full Circle - design and implementation of consolidated bioprocessing

The term consolidated bioprocessing (CBP) is used interchangeably in literature as a definition which is only very specific to lignocellulosic biomass processing for biofuel production.

However, a closer analysis reveals it to be more of a general process design concept which concerns itself with consolidation or merging of different unit processes together to result in a

“one-pot” processing. This design makes the process very modular and is easy to retrofit with any other existing bioprocesses. If the objective for process integration or any similar brown-field operation is to utilise waste bio-streams for making value added products, then this process design can be implemented with ease for closing the loop around the bioprocesses.

Closed-loop, circular bio-process schematics for utilization of the feedstock in this study, pulp and paper sludge (PPS) is depicted in Figure 4.1. The smaller circle represents direct application of PPS sludge hydrolysis products to the paper industry as strength additives to pulp making. The bigger circle shows possibility of consolidation with other engineered bioprocesses that can produce bioproducts and biofuels that can be potentially applied to the paper industry or represent alternate revenue streams for the industry through integration of on-site bioproduction facilities.

91

Figure 4.1: Schematic of proposed circular, consolidated processes using paper sludge feedstock as direct

(smaller circle) and indirect (bigger circle) applications to the paper industry

Consolidation of unit processes is especially beneficial for processes concerned with lignocellulosic biomass refining due to the complex nature of the biomass that entails several separate pre-treatment and processing steps. The consolidation of several steps in comparison to other fermentation strategies for bioprocessing is depicted in Figure 4.2. As per an estimate by Bayer et. al, the cost of feedstock, enzyme, and pre-treatment account for about two-third of the total production cost, of which the enzyme cost is the largest (Bayer et al. 2007). This is where

CBP design has its most impact, through both microbial platform or organism engineering

(Parisutham et al. 2014) and in-situ production of enzymes.

92

Figure 4.2: Different bioprocessing strategies available for the conversion of lignocellulosic biomass to bioalcohols.

Abbreviations: SHF, separate hydrolysis and fermentation; SHCF, separate hydrolysis and co-fermentation; SSF,

simultaneous saccharification and fermentation; SSCF, simultaneous saccharification and co-fermentation; CBP,

consolidated bioprocessing (Salehi Jouzani and Taherzadeh 2015)

CBP design for lignocellulosic degradation is done through two major approaches: native and recombinant (Olson et al. 2012; Kricka et al. 2014).

1. Native strategy: Using organisms that are naturally cellulolytic to generate the desired

bioproduct. These can include fungi and bacteria (including both free-enzyme systems

and cellulosomes)

2. Recombinant strategy: This approach is about genetically engineering all processes from

scratch in one organism or a consortium. It can prove beneficial and almost necessary if

the objective is to utilise both C5 and C6 sugars that cannot be done through native

systems.

Given that no industrial scale demonstration of CBP process has been done to date – it might reflect on the fact that there is a need to improve upon the biological host design. Precise genetic circuitry control that would allow temporal gene expression and help control release of enzyme as per the stage of the process (Committee on Industrialization of Biology et al. 2015) is needed. The metagenomic functional genes that are integrated into bioprocess development

93 should also be optimized within this framework. Functional metagenomics has led to discovery of several functional genes (especially glycoside hydrolases) that can be potentially used to engineer CBP organisms (Maruthamuthu et al. 2016; Tiwari et al. 2018). Sommer et al. have shown application for expansion of synthetic biology toolbox within the context of increased tolerance to inhibitory compounds arising from lignin degradation in biomass (Sommer et al.

2010).

4.1.2 Biochemical Assays for Cellulose Hydrolysis Kinetics

Biochemical assay experiment design requires use of pure substrates to establish activity value and other kinetic parameters like kcat (turnover number), Km (substrate inhibition constant), optimum temperature and pH of enzyme activity. This becomes challenging to screen enzymes directly on complex substrates like waste lignocelluloses biomasses on an “as-received basis” where there are interferences from other components. There is also a need to detect signal in a turbid medium and potential substrate inaccessibility that hampers activity detection. This is in direct contradiction to the need for developing enzymes that act on these complex substrates.

Dashtban et al. have reviewed the different assays that are traditionally used for cellulase activity detection (Dashtban et al. 2010). Broadly, detecting cellulase activity can be done using three major approaches:

1. Assays in which the accumulation of products after hydrolysis is measured (for e.g.

assays in which the reducing sugar content is measured like 3,5-dinitrosalicylic acid

(DNS), Glucose oxidase (GO) or Glucose hexokinase (GHK) assays).

94

2. Assays in which the reduction in substrate quantity was monitored (this can range

from simple mass difference measurements to sophisticated bio-analytical

techniques using size-exclusion chromatography for determining the quantity of

oligosaccharides released.

3. Assays in which the change in the physical properties of the substrate is measured

(Microscopic analysis of surface characteristics using scanning electron microscopy

and/or fibre staining)

4.1.3 Matching the Scales of Discovery and Application - what does functional metagenomics need to go all the way?

The power of discovery can be best realized through application.

Despite its potency in unveiling unique biocatalysts and novel metabolic pathways from different environmental microbiomes, there have been several recent questions raised as to how far has functional metagenomics fulfilled its promise for delivering the promised biocatalysts

(Bergholz et al. 2014; Ferrer et al. 2016) that would make biochemical processing arguably more sustainable and environmentally benign. It is indeed difficult to quantitatively assess this. This may result in part from lack of application-based studies in metagenomics combined with the reluctance of industries to share information about proprietary strains and bioprocesses. The terms bioprospecting and biorefining are quite often used in several metagenomic experimental literature (Strachan et al. 2014; Zhang et al. 2016a). However, the inherent commercialization connotation (Timmermans 2001) that comes with them has been left unaddressed by many of these studies.

95

A closer look at the history of the field and its discovery-focused structure might possibly answer this question. In 2006, the National Academic Research Press published the key challenges faced by metagenomics as a new field and this was outlined by the Global Microbiome

Initiative (Council 2007a). These challenges however stop at optimising the discovery of functional genes coding putative biocatalytic functions. The logical step that should follow is the optimization of the discovered functional clones for bioprocess applications to allow true

“translation” of biocatalytic products in industry or commercial ventures. The 7th challenge in this sequence would be to integrate metagenomics with bioprocessing.

A major challenge faced by functional discovery and application is the inability of biochemical characterization to keep pace with generation and analysis of metagenomic sequence data. Recently, an in-silico approach using an iterative hidden markov model (HMM) approach for minimizing the functional hits from metagenomic library has been proposed to arrive at a reasonable number of protein candidates for experimental characterization and validation of function without any significant loss of information (Kusnezowa and Leichert 2017).

However, there is yet no literature on an experimental high-throughput biochemical characterization/assaying platform that can be coupled directly with functionally discovered metagenomic clones, especially as applicable to detecting glycoside hydrolase (GH) activities.

Microfluidics and microarray approaches can be used to partially overcome these technical hurdles (Abot et al. 2016).

Downstream biochemical characterization of enzymes and optimization of clones should be informed from real process conditions for the intended application. Some aspects that should be given importance while designing these experiments are (Prather 2004):

96

1. Genetic or genomic engineering to maximize substrate flux to reach as close as possible to maximum theoretical yields of enzyme product production.

2. Using reference enzyme homolog models to guide protein engineering applications to include synthesis of co-factors or co-enzymes that might be missing from the environmental catalytic code captured through metagenomics.

3. Stoichiometric analysis of the feedstock to product process to determine the yield needed for an economically viable process and this should set the parameters for enzyme kinetic performance.

4.2 Materials and Methods

4.2.1 Biomass Compositional Analysis

Sampling of biomass has been previously described in section 2.2.1.

The pre-treatment of biomass prior to composition analysis (on a dry basis) was done in accordance with National Renewable Energy Laboratory (NREL) Laboratory Analytical Procedure

(LAP) for preparation of biomass for compositional analysis (Hames et al. 2008). The biomass was dried to a constant weight through incubation for around 72 hours at 450C in an incubator. The moisture content was determined after the dry PPS reached a constant weight. It was then ground using a Wiley laboratory knife mill with a 40-mesh sieve. The dry ground PPS was also fractionated using a sieve shaker and the +20/-80 fraction was retained for analysis (Figure 4.3).

This was done to remove excess ash content in the sludge which has been shown previously to inhibit hydrolysis due to change in pH of the hydrolysate (Gurram et al. 2015).

97

Figure 4.3: Pulp and paper sludge (PPS) feedstock (left-right) (a) Wet PPS cakes obtained after filtration of water

content (b) Dry PPS (constant weight) (c) Dried, milled and sieved PPS

The extractive content of the biomass was also determined prior to compositional analysis and the biomass was analysed after this on an “extractives-free” basis. The analysis was done using a Dionex ASE 350 Accelerated Solvent Extractor and the % extractives in the biomass was determined according to NREL: LAP “Determination of Extractives in Biomass” (Sluiter et al.

2008). The biomass analysis was conducted using a modified Klason method based on Technical

Association of the Pulp and Paper Industry (TAPPI) standard compositional analysis protocol

T222om-88 (TAPPI 2006) . Briefly, in this method, the composition of pre-treated (dried and milled) woody biomass is determined using acid-hydrolysis. The structural carbohydrates are determined through hydrolysis and solubilization by sulfuric acid, acid-soluble and insoluble lignin and the ash content are also determined simultaneously.

4.2.2 Detection of Hydrolytic Activity using Colorimetric Assay

The colorimetric assay used for the detection of cellulolytic activity of positive fosmid clone lysates was developed by Ferrari et al. (Ferrari et al. 2014). The assay uses a mutant oxidase

98

(chito-oligosaccharide oxidase; designated as ChitO-Q268R) engineered and produced in E. coli

(E. coli ORIGAMI2 DE3) that releases hydrogen peroxide upon the oxidation of cellulase-produced hydrolytic products (oligomers and monomers). The hydrogen peroxide (H2O2) produced is then monitored using a horseradish peroxidase (HRP) mediated reaction in which HRP uses the released H2O2 to convert 4-aminoantipyrine (AAP) and 3,5-dichloro-2-hydroxybenzenesulfonic acid (DCHBS) into a pink and stable compound (Figure 4.4).

Figure 4.4: Schematic of the colorimetric assay used for cellulolytic activity detection

This assay was chosen as it presents a fast, sensitive method to detect cellulolytic activity without the need for any acid hydrolysis or boiling as is used in 3,5-dinitrosalicylic acid (DNS) method. It has been tested previously on complex lignocellulolytic substrates (like wheat straw)

(Ferrari et al. 2014) and being colorimetric, allows rapid detection of cellulolytic activity even in coloured or turbid media. The assay is also independent of the need for reducing sugar production (i.e. only glucose) as the mutant enzyme is capable of oxidizing different oligosaccharides that result from breakdown of cellulose (glucose, cellobiose, cellotriose,

99 cellotetraose). The chito-oligosaccharide oxidase used in this study was kindly provided by. Dr.

Marco W Fraaije and Dr. Alessandro Ferrari from the Molecular Enzymology Group, University of

Groningen, Netherlands. It was in the form of a lyophilized powder stored at -200C. HRP (Sigma,

179.2 U/mg), 4-AAP (Acros Organics, 98%) and DCHBS (Alfa Aesar,99%) were purchased from different vendors.

For detection of cellulolytic activity in a high-throughput format, filter paper disks

(Whatman filter paper grade 1) (0.5 cm in diameter) were punched and deposited at the bottom of a 96-well plate and used as substrate (containing 95% crystalline α-cellulose) for cellulolytic activity detection. The disks were incubated with 200 µL of the enzyme mixture (either fosmid lysates or purified proteins as described in next section) and incubated overnight. The plate was briefly centrifuged (2000g, 5 min) and 100 µL of the supernatant was transferred to another well and supplemented with 100 µL of the reaction mixture. The plate was immediately set for continuous read measurement at 515 nm using Bio-Tek Synergy H1 Hybrid Multi-Mode Reader and readings were taken for up to 72 hours. The coloured product development was monitored by plotting absorbance values with time and was correlated to cellulolytic activity by comparison with the positive control - Celluclast enzyme cocktail. This enzyme mixture was kindly provided by the Saddler lab group (Hu et al. 2011) and was used as the positive control. EPI300 strain with empty vector pCC1TMFOS (“ePCC1FOS”) was used as the background control, BSA as protein additive control (in addition/replacement tests with Celluclast) and assay mixture as blank.

100

4.2.3 Different Systems for Testing Cellulolytic Activity

Tests for cellulolytic activity of the positive fosmid clones as noted in Table 3.3, section 3.3.3, was done using three different protein sources (please note that “fosmids” from this point forward refer to the positive clones). The order of listing these systems is reflective of a scale of ease of implementation of cellulolytic assaying with the objective to reduce processing steps at scale-up. The protein lysis methods used are derived from established protein purification protocols (Burgess and Deutscher 2009) :

1. OD600 normalized fosmid whole cell lysates: 5 mL fosmid cultures and ePPC1FOS were

grown for 24 hours following induction with Arabinose at inoculation. 1mL of each culture

was normalized following OD600 measurements by diluting with LB to get the lowest OD

value across all samples. 100 µL of culture was supplied to each 96-well with 100 µL lysis

mixture (as specified in section 3.2.2) and incubated with filter paper overnight. 100 µL of

this centrifuged supernatant was used for the assay next day.

2. Fosmid whole cell lysate and 50X concentrated culture supernatant protein fractions:

50 mL fosmid cell cultures were grown for 24 hours following induction with Arabinose at

inoculation. Whole cell lysate protein fraction was obtained by a combination of

enzymatic and free-thaw based lysis. Briefly, cells were harvested at by pelleting at 3000g

(20 min) and the supernatant was separately stored at 40C prior to processing. 3 mL lysis

buffer (50mM Tris pH 8.0, 0.1% Triton X-100, 10% glycerol, 300mM NaCl) supplemented

with 100 µL lysozyme, 25 µL PMSF was used to resuspend the pellets from 50mL culture.

This was followed by an incubation at 370C for 30 min. The cultures were then flash frozen

in liquid nitrogen for 3 min, thawed at 420C in a water-bath and pulse vortexed. This was

101

repeated 3 times. Finally, the cell debris was separated out by spinning at 15000rpm at

40C for 20 min. DNAse and RNAse (Molecular probes, Invitrogen) were supplemented to

the supernatant if it was too viscous. The lysate fraction was concentrated 3X using an

Amicon filter (Merck, Millipore) with a 10kDa cut-off. The supernatant fraction was

concentrated 10X and buffer exchanged in lysis buffer using an Amicon filter as with

lysate. The protein fractions were measured for concentration using the Pierce™

bicinchoninic acid (BCA) protein assay kit (ThermoFisher Scientific) and then visualised

using Coomassie Blue staining after SDS-PAGE.

3. Sub-cloned GH genes from fosmids over-expressed in E. coli BL21(DE3) strain: Following

sequencing of fosmids and gene annotation in Metapathways, the top two hits from

CMU-C2 tests (section 3.3.2) harbouring GH locus on their contigs were selected for sub-

cloning (P04P08 – GH3, P14I01- GH1). Details of the gene sequence, primer design and

sub-cloning protocols are given in Appendix A. The sub-cloned fosmid genes were

expressed in BL21(DE3) E. coli strain transformed with pET-21 a (+) vector and induced

0 using IPTG at OD600 = 0.6. Following induction, they were grown at 30 C. Untransformed

BL21(DE3) cells were also included as negative controls. The cells were lysed to obtain

proteins using a combination of enzymatic (lysozyme) and mechanical (probe sonication)

lysis. Briefly, 3 mL of the same lysis buffer (composition same as before) was used to

resuspend the cells. This was followed by an incubation on ice for 30 min. The cultures

were then subjected to probe sonication by pulsing for 15s followed by ice-incubation for

30s (total 10 cycles per sample). The proteins were then concentrated, quantified and

102

visualised as explained before. They were tested for activity using CMU substrates as

explained in chapter 3 methods.

4.2.4 Bench-scale Bioprocess Development

The fosmid clones were tested for their hydrolytic potential on PPS in 100mL reactions using Erlenmeyer flasks and monitored for 72 hours with 2mL sample withdrawal at definite intervals. The flasks were supplemented with 10% inoculum at T0 and whole-cell hydrolysis was carried out without addition of any lysing reagent to test feasibility of whole-cell biocatalysis and use of PPS as growth medium to induce cellulase production. Each flask had 2.5% solids loading

(dried and milled sludge) and was supplemented with 5g/L peptone as N-source (given negligible

N-content in PPS). The pH was measured before inoculation and adjusted if necessary to be near the natural pH of PPS (7.1-7.3 ±0.08). The media was then autoclaved and inoculated with 10%

(v/v) inoculum from a 10mL seed culture grown for 24 hours prior to start to biohydrolysis. One blank control was also included along with positive Celluclast controls, and the incubation was done at 370C at 250 rpm. Each reaction was set-up in duplicate (Figure 4.5). pH measurements were made after stopping the incubation at T0 + 72 hours.

Figure 4.5: Experimental set-up for bio-hydrolysis

103

The collected samples were stored at 40C prior to processing. They were centrifuged at high speed in a bench top centrifuge at 40C to pellet the solid fraction and the clear supernatant was passed through a 0.2 µm filter and stored for sugar analysis. The samples were then analysed for crude glucose content using glucose oxidase membrane-based detection (YSI 2300 STAT Plus

Glucose Lactate Analyzer, Marshall Scientific) to select samples for further analysis using HPLC.

The sugar content of the selected samples was then analysed using HPLC (Dionex DX-3000,

Sunnyvale, CA) using a Dionex PA1 column equipped with a pulsed amperometric detector and autosampler (Dionex). The column was equilibrated with 0.25 mM NaOH and eluted with pure water at a flowrate of 0.8 mL/min. L-Arabinose, D-galactose, D-glucose, D-xylose, D-mannose, were used as calibration standards and fucose as an internal standard. The standards were prepared in five different concentrations for each sugar to cover the estimated range in the samples. Each sample was injected in triplicate using the auto-sampler set-up.

4.3 Results

4.3.1 Compositional Analysis

The compositional analysis was conducted on the biomass both with and without extractives and the “extractive-free” basis gave a much better mass closure (~99%) as depicted in the Figure 4.6 (a) below. The sugars are reported as % polymers of the detected reduced sugar monomers and the cellulose content is considered equal to the Glucan (C6) content and hemicellulose as the sum of the Xylan, Mannan, Arabinan and Galactan (C5) content. The ash content of PPS was quite low given the source being mostly thermo-mechanically pulped fibre rejects. For PPS with high ash content, milling and sieving of the dry material has been shown

104 previously in literature to effectively get rid of the ash (Gurram et al. 2015) and reduce the undesirable buffering effects that basic components might have on hydrolysis of PPS (Kang et al.

2010). The CHN elemental composition results are also included (Figure 4.6 (b)).

Figure 4.6: Percentage composition of dried, milled PPS (left-right) (a) Klason method (b) CHN elemental analysis

(5% error)

From the compositional analysis, the sampled PPS shows promise as a lignocellulosic feedstock for hydrolysis and potentially support whole-cell catalysis if supplied with N2 sources.

This also presents an interesting opportunity for supplementing with other local agricultural waste streams rich in available N2 like used growth medium from mushroom farming and chicken manure (data from Timmenga & Associates Inc., Vancouver, BC).

4.3.2 Colorimetric Detection of Cellulolytic Activity

The mutant chito-oligosaccharide oxidase enzyme was first titrated against different concentrations of activity of positive control enzyme mixture Celluclast used in this study. The lowest possible coloured visible signal detection threshold was found to be 0.5 mU (FPU). The

Figure 4.7 below shows the range of linear signal measurement using this assay. The results were

105 readily reproducible and in accordance with the original results as observed in literature (Ferrari et al. 2014) with the linear range of FPU detection as 6-100 mU.

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04

0.02 Net Absorbance 515 nm (Absorbance (Absorbance Units) nm 515 Absorbance Net 0 0 100 200 300 400 500 600 700 Cellulase mU (Celluclast, Novozymes from T. reesei)

Figure 4.7: Titration of Celluclast FPU in the chito-oligosaccharide oxidase assay (net absorbance values after

subtracting assay mixture blank)

The assay was also found to be able to distinguish between different solid loadings at the same enzyme loading and yield detectable colorimetric signal even while using turbid media or whole cell lysates (Figure 4.8, 4.9). The sequencing results of the positive fosmid clone hits were obtained after these assays were done and hence the interpretations from functional gene annotations were not applied for choice of substrate for assaying and only filter paper was used as model cellulose proxy.

106

Figure 4.8: Colorimetric assay - Columns 1-3 titration of Celluclast at different FPU loading; other wells show

supernatant from hydrolysis of PPS at different solid loadings (fixed Celluclast loading 500mU) showing color

development in contrast to blanks (D-F 10-12)

4.3.2.1 Whole-cell Lysates

The top-10 positive fosmid clones as obtained after deconvolution with CMU-C2 in section 3.3.2 were cultured and OD600 normalized as described. Lysis mixture was added, and the lysate incubated on filter paper overnight. Each clone culture was tested in triplicate and for each clone culture there was another identical set that was spiked with 10mU Celluclast enzyme cocktail. It is clearly seen from both visual (Figure 4.9) as well as absorbance readings observations (Figure 4.10) that the cell lysates did not yield any significant activity that could be detected by the colorimetric assay.

107

Figure 4.9: (Left-right) T0 and T0+24 hours incubation of fosmid whole cell lysates with filter paper substrate. Only

wells spiked with Celluclast show activity and colour change

Figure 4.10: Absorbance readings at specific time intervals during incubation – representative results for two

fosmid clone reactions - the cellulolytic activity signal is resulting only from Celluclast enzyme action with no contribution from fosmid whole cell lysates (‘+cel’ refers to spiking of reaction with 10mU of Celluclast, 5% error)

There could be several optimization steps that could help answer the initial question of lack of detection of cellulolytic activity using a sensitive assay. Given the low volume of the reaction, the cellulolytic proteins in the fosmid lysates might not be concentrated enough to catalyse substrate breakdown. Also, since they are environmental genes in a non-native host,

108 issues of improper protein folding/truncation might occur. Different more rigorous methods of cell lysis could be used. To address the concentration issue, tests were done with whole-cell crude protein lysates as well as the secreted fractions of fosmid clones (50mL volume). This was to check if there was any activity detection by increasing the protein quantity of the fosmid whole cell lysates as well as test secreted fraction for activity (several bacterial GH enzymes or cellulosomes are secreted) (Gladden et al. 2011; Rashamuse et al. 2016).

4.3.2.2 Fosmid Whole Cell Lysate and 50X Concentrated Culture Supernatant Protein

Fractions

The total protein factions from the fosmid hits (both cell lysate and the supernatant fractions were obtained, and quantification showed cell lysate of all the clones with much higher protein contents than the control (epCC1FOS) strain. The supernatant fraction however did not seem to be enriched significantly in protein content when compared to the control strain (Figure

#). However, when these proteins were run on SDS-PAGE, there was significant overexpression seen in almost all the hits in the size range of (20-25 kDa). Some hits also showed overexpression in the size range (75-30kDa) which can be corroborated with the expected molecular weight sizes as predicted by ExPasy PI/Mw tool (https://web.expasy.org/compute_pi/) from the amino acid sequence information of the respective ORF’s.

For most of the clones, however, the size of the overexpressed protein bands (20-25 kDa) are less readily relatable to the theoretical calculations from the sequence information (Figure

4.12). But these proteins might still show promise and be targeted for sub-cloning as there are cellulolytic proteins belonging to the glycoside hydrolase (GH) families in literature that fall within

109 this size range and might be catalysing breakdown of the model substrates. GH12, which the most well-known GH family in this size range includes endoglucanases. Proteins in this family are known to be induced in fungal hosts after xylanase expression when exposed to hydrolysates of lignocellulose polysaccharides (Xing et al. 2013), have some of the smallest size of GH families

(around 20 kDa) (Karlsson et al. 2002), typically lack a carbohydrate-binding module (CBM) and show multifunction including both endoglucanase and endoxylanase activities (Zhang et al.

2016b). Given the overall enrichment of endoxylanase genes in the fosmids (section 3.3.3), there might be a possibility of occurrence of a gene sequence from this family which could potentially be getting annotated as an endoxylanase (for example P38I02). Although a rigorous testing of this hypothesis would entail construction of phylogenetic tree of the endoxylanase annotated enzymes and structural homology studies between observed endoxylanase clusters and GH12 sequences in curated databases. This is less likely however, since the annotation pipelines use structural and sequence similarity signatures and allocate the best possible match for gene annotation as a specific GH gamily. But given the environmental source of these genes, there could also be other low-scoring hits for the same gene that might help explain these observations

There is also another recent report of discovery thermostable alkaline cellulase of compost microbial origin that falls within this size range and shows CMCase (CMC- Carboxy methyl cellulose) activity (De Marco et al. 2017).

110

Figure 4.11: Protein content estimation of fosmid whole-cell lysate and supernatant fractions using BCA assay

(50mL cultures; 5% error)

Colorimetric assay of both the lysate and the supernatant protein fractions was not very promising as neither of these fractions showed significant activity towards the model substrate filter paper. Even after an extended incubation of 72 hours, the values were not significantly greater than the control strain (cell lysates performed marginally better). Representative results for fosmids P38I02 and P13D18 have been depicted in Figure 4.13.

Figure 4.12: SDS PAGE visualisation of whole cell lysate and secreted protein fractions

111

Figure 4.13: Measurement of colorimetric signal after incubation with filter paper substrate for 72 hours (left-right)

(a) colour development and (b) absorbance values at end of incubation period

Tests were also done to assess the efficiency of the protein fractions as accessory enzymes to existing cellulolytic enzyme mixtures (Figure 4.14). This was done using two methods:

1. Replacing a part of cellulolytic enzyme mixture with the protein fraction

2. Adding a specific amount of protein fraction to the cellulolytic enzyme mixture

No marked increase in hydrolysis of filter paper was observed through either of these methods. Fosmid sequence information as depicted in Figure 3.11 (section 3.3.3) tells us potentially that fosmid P38I02 has a Xyloglucan utilization loci (XuGUL) and might be encoding xylanase genes. Similarly, P13D18 has glucuronic hydrolase (GH88) (breaking down glucuronic acid - a component within hemi-cellulose) and β-glucosidase (GH3) activity that might be encoded. Supplementation of xylanase enzymes on the same cellulolytic mixture (Celluclast) used in this study has previously shown to increase hydrolysis when supplemented by replacement, rather than addition. The latter was observed to decrease the saccharolytic output

(Hu et al. 2011). However, these results were not readily observable for fosmid protein fractions.

On the contrary, a very minor increase in cellulolytic signal was seen for the additive tests with

112

P38I02. These might be related to the difference in the type of substrates used for the assay.

While Hu et al. had lignocellulosic substrates with a major hemicellulose component that was hydrolysed by xylanases and lead to increase in cellulolytic activity – the substrate used in this assay does not have any hemi-cellulose content. So, use of these lysates might be redundant and potentially other substrates like xylans, mannans or even avicel (for testing exo-acting glycosidases) should be used to best assess hydrolytic potential.

Figure 4.14: Supplementing Celluclast enzyme mixture with fosmid protein fractions (left-right) and application to

filter paper substrate (a) Replacement (1:1) with total protein content fixed at 35mg/g cellulose (b) Addition of

protein factions to give net double increase in total protein content (Celluclast + fosmid protein)

4.3.2.3 Sub-cloned GH genes from fosmids over-expressed in E. coli BL21(DE3) strain

SDS-PAGE was used to visualise the gene products from P04P08 and P1401 fosmids with

BL21 DE3 as control. The first three lanes in the Figure depict the supernatant fraction both

P04P08 and P14I01 show bands of ~25 kDa as observed which is absent from control strain both in supernatant as well as secreted fraction similar to what was observed before in fosmid whole cell lysates (Figure 4.12). However, the size of the observed overexpressed protein fraction is not readily explained since ExPASy tool (https://web.expasy.org/compute_pi/)

113 prediction of the molecular weights of the translated amino acid sequences of the genes puts the expected sizes at around 77 and 50 kDa respectively (Figure 4.15).

Figure 4.15: SDS-PAGE results of sub-cloned BL21 DE3 cell lysate and supernatant fraction with genes

from fosmids P04P08 and P14I01 respectively (sup- supernatant fraction; CL – cell lysate)

Some initial tests for functionality of proteins contained within both the lysates and the supernatant fraction were done with the CMU substrates used for screening the metagenomic libraries following the same methods as chapter 3, to check if the proteins within these fractions are active. After incubation for 4 hours, it was observed that only cell lysate and supernatant fraction of P14I01 had activity on both CMU-C2 (cellobioside) and CMU-3X (mixture of cellobioside, xyloside and mannoside) (Figure 4.16). While the cell lysate fraction had expected increased activity for the mixture (3X), it was surprising to see that the supernatant fraction showed a marked decrease in activity for the mixture vs cellobioside only.

To confirm if there was substrate inhibition, the P14I01 supernatant fraction was tested further on each substrate individually (Figure 4.16 inset). Given, low activities on both Xyloside and Mannoside, there could potentially be some inhibition occurring due to their presence or competition for enzyme activity (potentially with Mannoside). GH 1 family enzymes (domain

114 annotation for P14I01 gene sequence) are known to contain all these activities (β- glucosidase/xylosidase/mannosidase) so these results are not unsurprising. However, to understand the kinetics of inhibition better, further combinatorial experiments with the substrates are needed. P04P08 surprisingly did not show any activity in both the supernatant or lysate fractions. This might be attributed to incorrect expression of the gene leading to truncated proteins, improper folding or other post-translational modifications leading to lack of activity.

To correctly establish activity of enzymes within these fractions, his-tag purification of the proteins is required. The obtained proteins should be then tested for activity on CMU substrates along with cellobiose and Avicel (which are the more relevant compounds for optimizing these biocatalysts for biomass degradation).

Figure 4.16: Activity testing of sub-cloned cell lysate and protein fraction using CMU substrates (CMU-C2:

Cellobioside; CMU-X2: Xyloside; CMU-Man: Mannoside; CMU-3X: mixture of all three substrates; readings at end

of 4-hour incubation period with 5% error and inset shows deconvolution tests for P14I01 supernatant fraction)

115

4.3.3 Bench-scale Hydrolysis

The hydrolysis tests using whole fosmid cultures showed a maximum glucan % conversion of around 2.7 for two clone cultures P22O04 and P04P08 (Figure 4.15). This was very low as compared to Celluclast control (< 70%). However, given that these clones were not biologically optimized or engineered to overproduce enzymes or secrete enzymes, these clones were selected for further investigation using HPLC analysis of the reduced sugar profile. This was also done to validate results from crude analysis using the glucose oxidase membrane-based detection that had a lower detection threshold of 25mg/L.

Figure 4.17: Percentage conversion of glucan in PPS to glucose during the hydrolysis experiment

HPLC however revealed that these results were not reliable as none of the fosmid clone cultures yielded quantifiable amounts of glucose content that could be analysed in the selected standard range (50-750 mg/L with lower value representing a minimum conversion of 5%). The theoretical minimum conversion needed to completely replace current cellulolytic mixtures is around 50-70% and fosmid whole cell cultures do not meet that criteria. It is not surprising given

116 the low copy number of fosmids and need for improving and optimizing the expression of GH genes specifically. Celluclast produced around 50-60% glucose from initial glucan content as expected.

4.4 Conclusion

Biocatalysts are an integral part of bioeconomy development and functional metagenomics has the potential to not only unearth new biocatalysts, but also whole new metabolic pathways that might be already adept to complex organic matter or biomass degradation. However, to truly fulfil its promise of enhancing cost-effectiveness in bioprocess development, there is a pressing need to biochemically optimize the clones discovered through functional metagenomics pipeline. These biochemical kinetic assays need to be done in high- throughput and preferably using real biomass substrates to generate activity values as

“ligocellulose units (glucan/xylan)” instead of approximating “filter paper units” or “CMC units”.

Substrate choice is extremely crucial as it affects the screening process enriching for very specific activities and given the complex nature of biomass, detection of PULs rather than single GH genes is beneficial for biocatalytic clone design and development. The clones obtained in this study show high potential for application as hemi-cellulolytic and exo-acting cellulolytic catalysts. This has important implications in lignocellulolytic biomass degradation and engineering a consortium of these clones in tandem with other metagenomically discovered endocellulolytic clones will enable the economic bioconversion of PPS. As such, it is demonstrated through compositional analysis and conversion using conventional cellulolytic enzyme mixtures that PPS indeed represents a low cost, easily available biomass resource that can be biologically converted to valuable downstream products through production of readily utilisable C6 sugars.

117

Chapter 5 Thesis Conclusion and Future Directions of Work

5.1 Concluding Discussion

The paper sludge microbiome presents a promising source of plant polysaccharide degradation genes that can be tapped for engineering optimized biological systems for biocatalyst production. The microbiome seems uniquely enriched in hemi-cellulose and structural biomass degradation genes and these present interesting targets for developing enzymatic cocktails for lignocellulose degradation or even enzymatic pre-treatment of lignocellulosic biomass to selectively hydrolyse and remove hemicellulose content from biomass.

This is also reflected in functional genes uncovered through high-throughput screening with model fluorogenic substrates. The positive clones uncovered show enrichment for exo- acting β-glucosidase activities and xyloglucan utilization loci. This might be reflective of the specific substrates used during the screening along with the close association of several glycoside hydrolase genes in polysaccharide utilization loci commonly observed in prokaryotic systems that bring about degradation of organic matter through synergistic action of different enzymes.

Finally, consolidated bioprocess development using the fosmid proteins remains to be thoroughly investigated. It has been challenging to translate functional metagenomic findings to bioprocess applications due to time lag between fosmid sequencing information and proper substrate selection for designing kinetics experiments. Given the abundance of xylan hydrolysing loci in fosmids, experiments in which concentrated proteins from fosmid hits are used as xylanase supplements to current cellulolytic enzyme mixtures should be studied to assess their feasibility

118 of application. Biochemical and kinetic characterizations should be done using hemi-cellulose polymers and cellobiose or avicel for exo-acting β-glucosidases.

To demonstrate proof-of-concept of closed loop biomass hydrolysis process, further steps would need to be informed using an interdisciplinary approach There has to be an alliance between microbial ecology, metagenomics, synthetic biology and bioprocess engineering to truly advance the progress towards bioeconomy development.

5.2 Future Perspectives

5.2.1 Microbiome metabolic potential

The paper sludge microbiome represents an interesting environment for bioprospecting of plant polysaccharide degradation genes. To better understand the microbial ecological metabolic networks driving the functions of interest and better assign function to taxonomy, representative sampling of similar environments is needed. The findings from this study are good for initial insights into the potential linkages between function and taxonomy. However, to confidently do this, it is imperative to assess several different metagenomes from mills in the similar geographic location that use the same kind of plant biomass for pulp making.

Metagenomic bioinformatic annotation pipelines are sensitive to not only the nature of the input data but also how it has processed i.e. assembled or binned (tools for which lack benchmarking too). This together with the differences in methodologies and different reference databases used create variations in taxonomic annotations as observed. Therefore, results from multiple samples will allow better assessment of the taxonomic profile confidently.

119

Experimental validation of the functional genes can also be done by using metatranscriptomic studies and construction of cDNA libraries instead of only relying on annotations from pipelines. This is also applicable to the comparative analysis of microbiomes for differential expression of genes of interest. The findings presented in this study are qualitative at best and only present inferences good for hypothesis design. In order to quantitatively validate these findings, the metagenomic data should be processed using same pipelines for all samples prior to annotation and potentially also present enrichment data from rRNA profiles from these environments.

5.2.2 Functional Metagenomic Screening

The findings presented from functional metagenomic screening in this study are within the constraints of the specific screening experimental design. The number of clones discovered is proportional to the number of metagenomic genes that can be expressed within host system

E. coli EPI300TM. Choosing a vector that has a broad-range host expression can potentially uncover other genes with similar activities when expressed using different promoters in other host systems.

The effect of the substrate on the screening outcomes is very important and when the objective is to discover genes that can produce biocatalysts with improved or equivalent activities to industrial cellulolytic mixtures, then it is important to include complex substrates that represent lignocellulosic biomass. To reach a compromise between inability to design high- throughput screens with complex biomass (space and volume considerations in 96/384 well formats) and need to recover clones with needed functionalities, substrates with a mixture of

120 lignocellulose monomers should ideally be used. This will enable focus on clones that are most relevant to bioprocess development with targeted lignocelluloses. This presents a challenge in substrate design that can be potentially answered by synthetic chemistry.

Another strategy that can be explored is targeted screening of enrichment cultures.

Inoculum from PPS can be used to grow cultures on different recalcitrant biomass and following metagenomic DNA extraction, functional screening can be done to recover clones that can represent a powerful consortium with hydrolytic properties towards biomass. This will also allow using temperature and pH parameters to uncover thermostable and/or broad-range pH stable enzymes.

5.2.3 Consolidated Bioprocess Development

Complete scale-up of the process involving paper sludge feedstock as the biomass for hydrolysis and using on-site production of cellulase enzymes from the PPS microbiome involves several steps beyond proof of concept.

Assaying the biochemical activity of the screened clones needs to be optimized in alignment with bioprocess development. High-titre of biocatalyst production can be achieved in expression strains using proper induction and growth conditions. The enzymes should be tested for functionality step-wise on model and real substrates to guide optimization and rule out recombinant enzyme folding or post-translational modification issues.

It is also important to justify the re-proposition of PPS from an economic perspective. To ensure that the proposed closed-loop bioprocess is truly sustainable and adding value to the industrial waste stream, there should be a techno-economic analysis along with risk-assessment

121 studies done on the system boundary proposed (Figure 5.1). These metrics are important to quantify biocatalyst units needed, C5 and C6 sugar titres from PPS hydrolysis to sustain the biomass and finally the downstream product yield to ensure profitable return on investment.

These findings will bring the financial and business perspectives needed to attract industrial investment and partnerships for demonstration of bioprocess at pilot-scale and ultimate conversion to industrial scale production. This has been done previously for bioethanol production from PPS (Venditti 2014).

Figure 5.1: Material and energy-based revenue flow streams around the paper mill using a biorefinery for

valorization of pulp and paper mill sludge

122

Bibliography

Abot A, Arnal G, Auer L, et al (2016) CAZyChip: dynamic assessment of exploration of glycoside

hydrolases in microbial ecosystems. BMC Genomics 17:671. doi: 10.1186/s12864-016-

2988-4

Acker MG, Auld DS (2014) Considerations for the design and reporting of enzyme assays in high-

throughput screening applications. Perspect Sci 1:56–73. doi: 10.1016/j.pisc.2013.12.001

An D, Caffrey SM, Soh J, et al (2013) Metagenomics of hydrocarbon resource environments

indicates aerobic taxa and genes to be unexpectedly common. Environ Sci Technol

47:10708–17. doi: 10.1021/es4020184

Andrews S (2017) FastQC: A Quality Control tool for High Throughput Sequence Data.

Ansorge WJ, Katsila T, Patrinos GP (2017) Perspectives for Future DNA Sequencing Techniques

and Applications. In: Molecular Diagnostics. Elsevier, pp 141–153

Antunes LP, Martins LF, Pereira RV, et al (2016) Microbial community structure and dynamics in

thermophilic composting viewed through metagenomics and metatranscriptomics. Sci Rep

6:38915. doi: 10.1038/srep38915

Armstrong Z, Mewis K, Strachan C, Hallam SJ (2015) Biocatalysts for biomass deconstruction

from environmental genomics. Curr Opin Chem Biol 29:18–25. doi:

10.1016/j.cbpa.2015.06.032

Arshad A, Dalcin Martins P, Frank J, et al (2017) Mimicking microbial interactions under nitrate-

reducing conditions in an anoxic bioreactor: enrichment of novel Nitrospirae bacteria

123

distantly related to Thermodesulfovibrio. Environ Microbiol 19:4965–4977. doi:

10.1111/1462-2920.13977

Artzi L, Bayer EA, Moraïs S (2017) Cellulosomes: bacterial nanomachines for dismantling plant

polysaccharides. Nat Rev Microbiol 15:83–95. doi: 10.1038/nrmicro.2016.164

Attia MA, Nelson CE, Offen WA, et al (2018) In vitro and in vivo characterization of three

Cellvibrio japonicus glycoside hydrolase family 5 members reveals potent xyloglucan

backbone-cleaving functions. Biotechnol Biofuels. doi: 10.14288/1.0363937

Bainomugisa A, Duarte T, Lavu E, et al (2018) A complete nanonpore-only assembly of an XDR

Mycobacterium tuberculosis Beijing lineage strain identifies novel genetic variation in

repetitive PE/PPE gene regions. bioRxiv 256719. doi: 10.1101/256719

Bao Y-J, Xu Z, Li Y, et al (2017) High-throughput metagenomic analysis of petroleum-

contaminated soil microbiome reveals the versatility in xenobiotic aromatics metabolism. J

Environ Sci 56:25–35. doi: 10.1016/J.JES.2016.08.022

Bateman A, Coin L, Durbin R, et al (2004) The Pfam protein families database. Nucleic Acids Res

32:138D–141. doi: 10.1093/nar/gkh121

Bayer EA, Lamed R, Himmel ME (2007) The potential of cellulases and cellulosomes for

cellulosic waste management. Curr Opin Biotechnol 18:237–245. doi:

10.1016/J.COPBIO.2007.04.004

Benyus JM (2009) Biomimicry : innovation inspired by nature. HarperCollins e-books

Bergholz TM, Moreno Switt AI, Wiedmann M (2014) Omics approaches in food safety: fulfilling

124

the promise? Trends Microbiol 22:275–81. doi: 10.1016/j.tim.2014.01.006

Berlemont R (2017) Distribution and diversity of enzymes for polysaccharide degradation in

fungi. Sci Rep 7:222. doi: 10.1038/s41598-017-00258-w

Berlemont R, Martiny AC (2016) Glycoside Hydrolases across Environmental Microbial

Communities. PLoS Comput Biol 12:e1005300. doi: 10.1371/journal.pcbi.1005300

Binga EK, Lasken RS, Neufeld JD (2008) Something from (almost) nothing: the impact of multiple

displacement amplification on microbial ecology. ISME J 2:233–241. doi:

10.1038/ismej.2008.10

Bio-TIC (2015) The bioeconomy enabled - A roadmap to a thriving industrial biotechnology

sector in Europe.

BioVale (2015) BioVale: A strategy for a bioeconomy innovation cluster across Yorkshire and

Humber.

Bisswanger H (2014) Enzyme assays. Perspect Sci 1:41–55. doi: 10.1016/J.PISC.2014.02.005

Bocken NMP, de Pauw I, Bakker C, van der Grinten B (2016) Product design and business model

strategies for a circular economy. J Ind Prod Eng 33:308–320. doi:

10.1080/21681015.2016.1172124

Bogdanski BEC (2014) The rise and fall of the canadian pulp and paper sector. For Chron

90:785–793.

Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence

data. Bioinformatics 30:2114–2120. doi: 10.1093/bioinformatics/btu170

125

Bowers RM, Kyrpides NC, Stepanauskas R, et al (2017) Minimum information about a single

amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria

and archaea. Nat Biotechnol 35:725–731. doi: 10.1038/nbt.3893

Brown RC (2013) Distributed Production of Biobased Products with Biomass Processing

Modules. Ames

Browne T, Gilsenan R, Singbeil D, Paleologou M (2011) Bio-energy and Bio-chemicals Synthesis

Report.

Bueso YF, Tangney M (2017) Synthetic Biology in the Driving Seat of the Bioeconomy. Trends

Biotechnol 35:373–378. doi: 10.1016/j.tibtech.2017.02.002

Burgess RR, Deutscher MP (2009) Guide to protein purification, 2nd edn. Elsevier/Academic

Press

Cai J, He Y, Yu X, et al (2017) Review of physicochemical properties and analytical

characterization of lignocellulosic biomass. Renew Sustain Energy Rev 76:309–322. doi:

10.1016/J.RSER.2017.03.072

Caporaso JG, Kuczynski J, Stombaugh J, et al (2010) QIIME allows analysis of high-throughput

community sequencing data. Nat Methods 7:335–336. doi: 10.1038/nmeth.f.303

Carrez D, Van Leeuwen P (2015) Bioeconomy: circular by nature. Eur FIles 34–35.

Caspi R, Billington R, Ferrer L, et al (2016) The MetaCyc database of metabolic pathways and

enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res

44:D471-80. doi: 10.1093/nar/gkv1164

126

Chen G-Q (2012) New challenges and opportunities for industrial biotechnology. Microb Cell

Fact 11:111. doi: 10.1186/1475-2859-11-111

Chen H, Armstrong Z, Hallam S, Withers S (2016) Synthesis and evaluation of a series of 6-

chloro-4-methylumbelliferyl glycosides as fluorogenic reagents for screening metagenomic

libraries for glycosidase activity.

Chen H, Han Q, Daniel K, et al (2014) Conversion of Industrial Paper Sludge to Ethanol:

Fractionation of Sludge and Its Impact. Appl Biochem Biotechnol 174:2096–2113. doi:

10.1007/s12010-014-1083-z

Cheng J, Romantsov T, Engel K, et al (2017) Functional metagenomics reveals novel β-

galactosidases not predictable from gene sequences. PLoS One 12:1–20. doi:

10.1371/journal.pone.0172545

Chistoserdovai L (2010) Functional metagenomics: recent advances and future challenges.

Biotechnol Genet Eng Rev 26:335–52.

Committee on Industrialization of Biology, Board on Chemical Sciences and Technology, Board

on Life Sciences, Division on Earth and Life Studies (2015) Industrialization of Biology: A

Roadmap to Accelerate the Advanced Manufacturing of Chemicals.

Council C on MC and FNR (2007a) A Balanced Portfolio: Multi-Scale Projects in the “Global

Metagenomics Initiative.” In: The New Science of Metagenomics Revealing the Secrets of

Our Microbial Planet. THE NATIONAL ACADEMIES PRESS, pp 107–123

Council NR (2007b) The New Science of Metagenomics: Revealing the Secrets of Our Microbial

127

Planet. National Academies Press, Washington, D.C.

Craig JW, Chang F-Y, Kim JH, et al (2010) Expanding Small-Molecule Functional Metagenomics

through Parallel Screening of Broad-Host-Range Cosmid Environmental DNA Libraries in

Diverse Proteobacteria. Appl Environ Microbiol 76:1633–1641. doi: 10.1128/AEM.02169-

09

Cuadrat RRC, Ionescu D, Dávila AMR, Grossart H-P (2018) Recovering Genomics Clusters of

Secondary Metabolites from Lakes Using Genome-Resolved Metagenomics. Front

Microbiol 9:251. doi: 10.3389/fmicb.2018.00251

Darling AE, Jospin G, Lowe E, et al (2014) PhyloSift: phylogenetic analysis of genomes and

metagenomes. PeerJ 2:e243. doi: 10.7717/peerj.243

Dashtban M, Maki M, Leung KT, et al (2010) Cellulase activities in biomass conversion:

Measurement methods and comparison. Crit Rev Biotechnol 30:302–309. doi:

10.3109/07388551.2010.490938

De Marco ÉG, Heck K, Martos ET, Van Der Sand ST (2017) Purification and characterization of a

thermostable alkaline cellulase produced by Bacillus licheniformis 380 isolated from

compost. Ann Brazilian Acad Sci 8933:2359–2370. doi: 10.1590/0001-3765201720170408

DOE. U (2015) Bioenergy Workshop.

Dröge J, Gregor I, McHardy AC (2015) Taxator-tk: precise taxonomic assignment of

metagenomes by fast approximation of evolutionary neighborhoods. Bioinformatics

31:817–24. doi: 10.1093/bioinformatics/btu745

128

El-Chichakli B, von Braun J, Lang C, et al (2016) Policy: Five cornerstones of a global

bioeconomy. Nature 535:221–223. doi: 10.1038/535221a

Engelbrektson A, Kunin V, Wrighton KC, et al (2010) Experimental factors affecting PCR-based

estimates of microbial species richness and evenness. ISME J 4:642–647. doi:

10.1038/ismej.2009.153

Epicentre (2010) CopyControl TM Fosmid Library Production Kit with pCC1FOS TM Vector

CopyControl TM HTP Fosmid Library Production Kit with pCC2FOS TM Vector. Control 1–28.

doi: CCFOSS110

European Commission (2017) Review of the 2012 European Bioeconomy Strategy.

Falkowski PG, Fenchel T, Delong EF (2008) The Microbial Engines That Drive Earth ’s

Biogeochemical Cycles. Science (80- ) 320:1034–1039. doi: 10.1126/science.1153213

Ferrari AR, Gaber Y, Fraaije MW (2014) A fast, sensitive and easy colorimetric assay for chitinase

and cellulase activity detection. Biotechnol Biofuels 7:37. doi: 10.1186/1754-6834-7-37

Ferrer M, Martínez-Martínez M, Bargiela R, et al (2016) Estimating the success of enzyme

bioprospecting through metagenomics: current status and future trends. Microb

Biotechnol 9:22–34. doi: 10.1111/1751-7915.12309

Fincher G, Mark B, Brumer H (2017) Glycoside Hydrolase Family 3. In: CAZYpedia.

//www.cazypedia.org/index.php?title=Glycoside_Hydrolase_Family_3&oldid=11467.

Accessed 20 Mar 2018

Gabor EM, Alkema WBL, Janssen DB (2004) Quantifying the accessibility of the metagenome by

129

random expression cloning techniques. Environ Microbiol 6:879–886. doi: 10.1111/j.1462-

2920.2004.00640.x

General Assembly UN (1987) Our Common Future: Report of the World Commission on

Environment and Development. Oslo

Geng A, Zou G, Yan X, et al (2012) Expression and characterization of a novel metagenome-

derived cellulase Exo2b and its application to improve cellulase activity in Trichoderma

reesei. Appl Microbiol Biotechnol 96:951–962. doi: 10.1007/s00253-012-3873-y

Geng Y, Doberstein B (2008) Developing the circular economy in China: Challenges and

opportunities for achieving “leapfrog development.” Int J Sustain Dev World Ecol 15:231–

239. doi: 10.3843/SusDev.15.3:6

Ghribi M, Meddeb-Mouelhi F, Beauregard M (2016) Microbial diversity in various types of paper

mill sludge: identification of enzyme activities with potential industrial applications.

Springerplus. doi: 10.1186/s40064-016-3147-8

Gies EA, Konwar KM, Beatty JT, Hallam SJ (2014) Illuminating microbial dark matter in

meromictic Sakinaw Lake. Appl Environ Microbiol 80:6807–18. doi: 10.1128/AEM.01774-

14

Gladden JM, Allgaier M, Miller CS, et al (2011) Glycoside hydrolase activities of thermophilic

bacterial consortia adapted to switchgrass. Appl Environ Microbiol 77:5804–12. doi:

10.1128/AEM.00032-11

Gonzalez JM, Portillo MC, Belda-Ferre P, Mira A (2012) Amplification by PCR artificially reduces

130

the proportion of the rare biosphere in microbial communities. PLoS One 7:e29973. doi:

10.1371/journal.pone.0029973

Grob C, Taubert M, Howat AM, et al (2015) Combining metagenomics with metaproteomics and

stable isotope probing reveals metabolic pathways used by a naturally occurring marine

methylotroph. Environ Microbiol 17:4007–4018. doi: 10.1111/1462-2920.12935

Grondin JM, Tamura K, Déjean G, et al (2017) Polysaccharide Utilization Loci: Fueling Microbial

Communities. J Bacteriol 199:JB.00860-16. doi: 10.1128/JB.00860-16

Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome

assemblies. Bioinformatics 29:1072–5. doi: 10.1093/bioinformatics/btt086

Gurram RN, Al-Shannag M, Lecher NJ, et al (2015) Bioconversion of paper mill sludge to

bioethanol in the presence of accelerants or hydrogen peroxide pretreatment. Bioresour

Technol 192:529–539. doi: 10.1016/j.biortech.2015.06.010

Hames B, Ruiz R, Scarlata C, et al (2008) Preparation of Samples for Compositional Analysis:

Laboratory Analytical Procedure (LAP); Issue Date 08/08/2008.

Handelsman J (2004) Metagenomics: Application of Genomics to Uncultured Microorganisms.

Microbiol Mol Biol Rev 68:669–685. doi: 10.1128/MMBR.68.4.669-685.2004

Hanson NW, Konwar KM, Wu S-J, Hallam SJ (2014) MetaPathways v2.0: A master-worker model

for environmental Pathway/Genome Database construction on grids and clouds. In: 2014

IEEE Conference on Computational Intelligence in Bioinformatics and Computational

Biology. IEEE, pp 1–7

131

Harris HMB, Duncan SH, P. Scott K, et al (2016) Polysaccharide utilization loci and nutritional

specialization in a dominant group of butyrate-producing human colonic Firmicutes.

Microb Genomics 2:e000043. doi: 10.1099/mgen.0.000043

Hawley AK, Nobu MK, Wright JJ, et al (2017) Diverse Marinimicrobia bacteria may mediate

coupled biogeochemical cycles along eco-thermodynamic gradients. Nat Commun 8:1507.

doi: 10.1038/s41467-017-01376-9

Henn MR, Sullivan MB, Stange-Thomann N, et al (2010) Analysis of high-throughput sequencing

and annotation strategies for phage genomes. PLoS One 5:e9083. doi:

10.1371/journal.pone.0009083

Hess M, Sczyrba A, Egan R, et al (2011) Metagenomic discovery of biomass-degrading genes and

genomes from cow rumen. Science 331:463–7. doi: 10.1126/science.1200387

Hu J, Arantes V, Saddler JN (2011) The enhancement of enzymatic hydrolysis of lignocellulosic

substrates by the addition of accessory enzymes such as xylanase: is it an additive or

synergistic effect? Biotechnol Biofuels 4:36. doi: 10.1186/1754-6834-4-36

Huson DH, Beier S, Flade I, et al (2016) MEGAN Community Edition - Interactive Exploration and

Analysis of Large-Scale Microbiome Sequencing Data. PLOS Comput Biol 12:e1004957. doi:

10.1371/journal.pcbi.1004957

Hyatt D, Chen G-L, Locascio PF, et al (2010) Prodigal: prokaryotic gene recognition and

translation initiation site identification. BMC Bioinformatics 11:119. doi: 10.1186/1471-

2105-11-119

132

Illumina Inc. (2015) An introduction to Next-Generation Sequencing Technology.

Ilmberger N, Güllert S, Dannenberg J, et al (2014) A Comparative Metagenome Survey of the

Fecal Microbiota of a Breast- and a Plant-Fed Asian Elephant Reveals an Unexpectedly High

Diversity of Glycoside Hydrolase Family Enzymes. PLoS One 9:e106707. doi:

10.1371/journal.pone.0106707

Ioelovich M (2014) Waste Paper as Promising Feedstock for Production of Biofuel.

Jackson MA, Line MA (1997) Organic Composition of a Pulp and Paper Mill Sludge Determined

by FTIR, 13C CP MAS NMR, and Chemical Extraction Techniques. doi: 10.1021/JF960946L

Jendrisak JJ, Hoffman LM, Fiandt MJ, Haskins D (2002) Methods and compositions for

amplifying DNA clone copy number. 14.

Jessen GL, Lichtschlag A, Ramette A, et al (2017) Hypoxia causes preservation of labile organic

matter and changes seafloor microbial community composition (Black Sea). Sci Adv

3:e1601897. doi: 10.1126/sciadv.1601897

Kanehisa M, Sato Y, Kawashima M, et al (2016) KEGG as a reference resource for gene and

protein annotation. Nucleic Acids Res 44:D457–D462. doi: 10.1093/nar/gkv1070

Kang L, Wang W, Lee YY (2010) Bioconversion of kraft paper mill sludges to ethanol by SSF and

SSCF. Appl Biochem Biotechnol 161:53–66. doi: 10.1007/s12010-009-8893-4

Karlsson J, Siika-aho M, Tenkanen M, Tjerneld F (2002) Enzymatic properties of the low

molecular mass endoglucanases Cel12A (EG III) and Cel45A (EG V) of Trichoderma reesei. J

Biotechnol 99:63–78. doi: 10.1016/S0168-1656(02)00156-6

133

Kelley DR, Liu B, Delcher AL, et al (2012) Gene prediction with Glimmer for metagenomic

sequences augmented by classification and clustering. Nucleic Acids Res 40:e9–e9. doi:

10.1093/nar/gkr1067

Kharayat Y, Thakur IS (2012) Isolation of bacterial strain from sediment core of Pulp and Paper

Mill industries for production and purification of lignin peroxidase (LiP) enzyme.

Bioremediat J 16:125–130. doi: 10.1080/10889868.2012.665964

Korhonen J, Honkasalo A, Seppälä J (2018) Circular Economy: The Concept and its Limitations.

Ecol Econ 143:37–46.

Kricka W, Fitzpatrick J, Bond U (2014) Metabolic engineering of yeasts by heterologous enzyme

production for degradation of cellulose and hemicellulose from biomass: a perspective.

Front Microbiol 5:174. doi: 10.3389/fmicb.2014.00174

Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rare biosphere:

pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ

Microbiol 12:118–123. doi: 10.1111/j.1462-2920.2009.02051.x

Kusnezowa A, Leichert LI (2017) In silico approach to designing rational metagenomic libraries

for functional studies. BMC Bioinformatics 18:267. doi: 10.1186/s12859-017-1668-y

Lam KN, Cheng J, Engel K, et al (2015) Current and future resources for functional

metagenomics. Front Microbiol 6:1–8. doi: 10.3389/fmicb.2015.01196

Lamers P, Searcy E, Hess JR, Stichnothe H (2016) Developing the global bioeconomy : technical,

market, and environmental lessons from bioenergy.

134

Landry Z, Swan BK, Herndl GJ, et al (2017) SAR202 Genomes from the Dark Ocean Predict

Pathways for the Oxidation of Recalcitrant Dissolved Organic Matter. MBio 8:e00413-17.

doi: 10.1128/mBio.00413-17

Lange L, Hreggviðsson GÓ, Björnsdóttir B, et al (2016) Development of the Nordic bioeconomy.

Rosendahls-SchultzGrafisk

Larsbrink J, Rogers TE, Hemsworth GR, et al (2014) A discrete genetic locus confers xyloglucan

metabolism in select human gut Bacteroidetes. Nature 506:498–502. doi:

10.1038/nature12907

Lee S, Hallam SJ (2009) Extraction of High Molecular Weight Genomic DNA from Soils and

Sediments. J Vis Exp 2–5. doi: 10.3791/1569

Li D, Liu C-M, Luo R, et al (2015) MEGAHIT: an ultra-fast single-node solution for large and

complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31:1674–

1676. doi: 10.1093/bioinformatics/btv033

Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform.

Bioinformatics 26:589–595. doi: 10.1093/bioinformatics/btp698

Lieder M, Rashid A (2016) Towards circular economy implementation: A comprehensive review

in context of manufacturing industry. J Clean Prod 115:36–51. doi:

10.1016/j.jclepro.2015.12.042

López-Mondéjar R, Zühlke D, Becher D, et al (2016) Cellulose and hemicellulose decomposition

by forest soil bacteria proceeds by the action of structurally variable enzymatic systems.

135

Nat Publ Gr. doi: 10.1038/srep25279

Lopolito A, Nardone G, Prosperi M, et al (2011) Modeling the bio-refinery industry in rural

areas: A participatory approach for policy options comparison. Ecol Econ 72:18–27. doi:

10.1016/j.ecolecon.2011.09.010

Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for

RNA-seq data with DESeq2. Genome Biol 15:550. doi: 10.1186/s13059-014-0550-8

Lucigen (2016) Phage T1-Resistant TransforMaxTM EPI300TM-T1.

Mabee W (2001) Study of woody fibre in papermill sludge. University of Toronto

Macdonald SS, Patel A, Larmour VLC, et al (2018) Structural and mechanistic analysis of a β-

glycoside phosphorylase identified by screening a metagenomic library. J Biol Chem

jbc.RA117.000948. doi: 10.1074/jbc.RA117.000948

Madhavan A, Sindhu R, Parameswaran B, et al (2017) Metagenome Analysis: a Powerful Tool

for Enzyme Bioprospecting. Appl Biochem Biotechnol 183:636–651. doi: 10.1007/s12010-

017-2568-3

Maki ML, Broere M, Leung KT, Qin W (2011) Characterization of some efficient cellulase

producing bacteria isolated from paper mill sludges and organic fertilizers. Int J Biochem

Mol Biol 2:146–154.

Mäkinen V, Salmela L, Ylinen J (2012) Normalized N50 assembly metric using gap-restricted co-

linear chaining. BMC Bioinformatics 13:255. doi: 10.1186/1471-2105-13-255

Marques S, Alves L, Roseiro JC, Gírio FM (2008) Conversion of recycled paper sludge to ethanol

136

by SHF and SSF using Pichia stipitis. Biomass and Bioenergy 32:400–406. doi:

10.1016/j.biombioe.2007.10.011

Martínez A, Osburne MS (2013) Preparation of Fosmid Libraries and Functional Metagenomic

Analysis of Microbial Community DNA. Methods Enzymol 531:123–142. doi:

10.1016/B978-0-12-407863-5.00007-1

Maruthamuthu M, Jiménez DJ, Stevens P, Dirk Van Elsas J (2016) A multi-substrate approach for

functional metagenomics-based screening for (hemi)cellulases in two wheat straw-

degrading microbial consortia unveils novel thermoalkaliphilic enzymes. BMC Genomics.

doi: 10.1186/s12864-016-2404-0

McDonough W, Braungart M (2002) Cradle to cradle : remaking the way we make things. North

Point Press

Mewis K (2016) Functional Metagenomic Screening for Glycoside Hydrolases. University of

British Columbia

Mewis K, Armstrong Z, Song YC, et al (2013) Biomining active cellulases from a mining

bioremediation system. J Biotechnol 167:462–471. doi: 10.1016/j.jbiotec.2013.07.015

Mewis K, Lenfant N, Lombard V, Henrissat B (2016) Dividing the Large Glycoside Hydrolase

Family 43 into Subfamilies: a Motivation for Detailed Enzyme Characterization. Appl

Environ Microbiol 82:1686–92. doi: 10.1128/AEM.03453-15

Mewis K, Taupp M, Hallam S (2011) A high throughput screen for biomining cellulase activity

from metagenomic libraries.

137

Miller CS, Baker BJ, Thomas BC, et al (2011) EMIRGE: reconstruction of full-length ribosomal

genes from microbial community short read sequencing data. Genome Biol 12:R44. doi:

10.1186/gb-2011-12-5-r44

Nancucheo I, Bitencourt JAP, Sahoo PK, et al (2017) Recent Developments for Remediating

Acidic Mine Waters Using Sulfidogenic Bacteria. Biomed Res Int 2017:7256582. doi:

10.1155/2017/7256582

Ndeh D, Rogowski A, Cartmell A, et al (2017) Complex pectin metabolism by gut bacteria

reveals novel catalytic functions. Nature 544:65–70. doi: 10.1038/nature21725

Nichols D, Cahoon N, Trakhtenberg EM, et al (2010) Use of ichip for high-throughput in situ

cultivation of "uncultivable microbial species▽. Appl Environ Microbiol 76:2445–2450.

doi: 10.1128/AEM.01754-09

Noguchi H, Taniguchi T, Itoh T (2008) MetaGeneAnnotator: Detecting Species-Specific Patterns

of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and

Phage Genomes. DNA Res 15:387–396. doi: 10.1093/dnares/dsn027

O’Leary NA, Wright MW, Brister JR, et al (2016) Reference sequence (RefSeq) database at NCBI:

current status, taxonomic expansion, and functional annotation. Nucleic Acids Res

44:D733-45. doi: 10.1093/nar/gkv1189

Oliveira JS, Araújo W, Lopes Sales AI, et al (2015) BioSurfDB: knowledge and algorithms to

support biosurfactants and biodegradation studies. Database (Oxford). doi:

10.1093/database/bav033

138

Olson DG, McBride JE, Joe Shaw A, Lynd LR (2012) Recent progress in consolidated

bioprocessing. Curr Opin Biotechnol 23:396–405. doi: 10.1016/J.COPBIO.2011.11.026

Owen PW (2018) Special Report Renewable energy for sustainable rural development:

significant potential synergies, but mostly unrealised.

Parisutham V, Kim TH, Lee SK (2014) Feasibilities of consolidated bioprocessing microbes: From

pretreatment to biofuel production. Bioresour Technol 161:431–440. doi:

10.1016/J.BIORTECH.2014.03.114

Parks DH, Imelfort M, Skennerton CT, et al (2015) CheckM: assessing the quality of microbial

genomes recovered from isolates, single cells, and metagenomes. Genome Res 25:1043–

55. doi: 10.1101/gr.186072.114

Patel AK, Singhania RR, Pandey A (2017) Production, Purification, and Application of Microbial

Enzymes. In: Biotechnology of Microbial Enzymes. Elsevier, pp 13–41

Pellerin W, Taylor DW (2008) Measuring the biobased economy: A Canadian perspective. Ind

Biotechnol 4:363–366. doi: 10.1089/ind.2008.4.363

Pellis A, Cantone S, Ebert C, Gardossi L (2018) Evolving biocatalysis to meet bioeconomy

challenges and opportunities. N Biotechnol 40:154–169. doi: 10.1016/J.NBT.2017.07.005

Philp J, Winickoff DE (2017) Clusters in Industrial Biotechnology and Bioeconomy: The Roles of

the Public Sector. Trends in Biiotechnology 35:682–686. doi:

10.1016/j.tibtech.2017.04.001

Pilloni G, Granitsiotis MS, Engel M, Lueders T (2012) Testing the Limits of 454 Pyrotag

139

Sequencing: Reproducibility, Quantitative Assessment and Comparison to T-RFLP

Fingerprinting of Aquifer Microbes. PLoS One 7:e40467. doi:

10.1371/journal.pone.0040467

Prasetyo J, Naruse K, Kato T, et al (2011) Bioconversion of paper sludge to biofuel by

simultaneous saccharification and fermentation using a cellulase of paper sludge origin

and thermotolerant Saccharomyces cerevisiae TJ14. Biotechnol Biofuels 4:35. doi:

10.1186/1754-6834-4-35

Prasetyo J, Park EY (2013) Waste paper sludge as a potential biomass for bio-ethanol

production. Korean J Chem Eng 30:253–261. doi: 10.1007/s11814-013-0003-1

Prather KLJ (2004) Integrated Chemical Engineering Topics I. In: MIT OpenCourseWare.

Massachusetts Institute of Technology: MIT OpenCourseWare,

Quast C, Pruesse E, Yilmaz P, et al (2013) The SILVA ribosomal RNA gene database project:

improved data processing and web-based tools. Nucleic Acids Res 41:D590-6. doi:

10.1093/nar/gks1219

Rabaçal M, Ferreira AF, Silva CAM da, Costa M (2017) Biorefineries : targeting energy, high

value products and waste valorisation.

Radajewski S, Ineson P, Parekh NR, Murrell JC (2000) Stable-isotope probing as a tool in

microbial ecology. Nature 403:646–649. doi: 10.1038/35001054

Rangu V (2018) Fractionation of pulp mill waste to produce hemicellulose oligomers for

adsorption onto NBSK pulp. University of British Columbia

140

Ransom-Jones E, McCarthy AJ, Haldenby S, et al (2017) Lignocellulose-Degrading Microbial

Communities in Landfill Sites Represent a Repository of Unexplored Biomass-Degrading

Diversity. mSphere. doi: 10.1128/mSphere.00300-17

Rashamuse K, Sanyika Tendai W, Mathiba K, et al (2016) Metagenomic mining of glycoside

hydrolases from the hindgut bacterial symbionts of a termite (Trinervitermes trinervoides)

and the characterization of a multimodular β-1,4-xylanase (GH11). Biotechnol Appl

Biochem. doi: 10.1002/bab.1480

Ree R van, Jong E de (2017) Biorefining in a future BioEconomy.

http://www.ieabioenergy.com/task/biorefining-sustainable-processing-of-biomass-into-a-

spectrum-of-marketable-biobased-products-and-bioenergy/. Accessed 11 Mar 2018

Rehmann M (2010) Overview of Sustainability Concepts. In: International Forum on Sustainable

Operations for Uranium Production. International Atomic Energy Agency, pp 1–38

Rho M, Tang H, Ye Y (2010) FragGeneScan: predicting genes in short and error-prone reads.

Nucleic Acids Res 38:e191–e191. doi: 10.1093/nar/gkq747

Rhoads A, Au KF (2015) PacBio Sequencing and Its Applications. Genomics Proteomics

Bioinformatics 13:278–289. doi: 10.1016/J.GPB.2015.08.002

Rochman FF, Sheremet A, Tamas I, et al (2017) Benzene and Naphthalene Degrading Bacterial

Communities in an Oil Sands Tailings Pond. Front Microbiol 8:1845. doi:

10.3389/fmicb.2017.01845

Roumpeka DD, Wallace RJ, Escalettes F, et al (2017) A Review of Bioinformatics Tools for Bio-

141

Prospecting from Metagenomic Sequence Data. Front Genet 8:23. doi:

10.3389/fgene.2017.00023

Rubin EM (2008) Genomics of cellulosic biofuels. Nature 454:841–845. doi:

10.1038/nature07190

Salehi Jouzani G, Taherzadeh MJ (2015) Advances in consolidated bioprocessing systems for

bioethanol and butanol production from biomass: a comprehensive review. Biofuel Res J

5:152–195. doi: 10.18331/BRJ2015.2.1.4

Schloss PD, Handelsman J (2005) Metagenomics for studying unculturable microorganisms:

cutting the Gordian knot. Genome Biol 6:229. doi: 10.1186/gb-2005-6-8-229

Schomburg D, Schomburg I (2010) Enzyme Databases. Humana Press, pp 113–128

Schütte G (2017) What kind of innovation policy does the bioeconomy need? N Biotechnol 3–7.

doi: 10.1016/j.nbt.2017.04.003

Sczyrba A, Hofmann P, Belmann P, et al (2017) Critical Assessment of Metagenome

Interpretation – a comprehensive benchmark of computational metagenomics software.

bioRxiv 1–33. doi: 10.1101/0991277

Sedlar K, Kupkova K, Provaznik I (2017) Bioinformatics strategies for taxonomy independent

binning and visualization of sequences in shotgun metagenomics. Comput Struct

Biotechnol J 15:48–55. doi: 10.1016/j.csbj.2016.11.005

Sekiguchi Y, Yamada T, Hanada S, et al (2003) Anaerolinea thermophila gen. nov., sp. nov. and

Caldilinea aerophila gen. nov., sp. nov., novel filamentous thermophiles that represent a

142

previously uncultured lineage of the domain Bacteria at the subphylum level. Int J Syst Evol

Microbiol 53:1843–1851. doi: 10.1099/ijs.0.02699-0

Sharan AA, Yadav VG, Hallam SJ (2017) Deep Device Mining for Carbohydrate-Active Enzymes in

Pulp and Paper Mill Sludge Metagenome and Applications to Bioprocess Development |

2017 Synthetic Biology: Engineering, Evolution & Design (SEED). In: 2017 Synthetic

Biology: Engineering, Evolution & Design (SEED). Vancouver,

Sheldon RA, Woodley JM (2018) Role of Biocatalysis in Sustainable Chemistry. Chem Rev

118:801–838. doi: 10.1021/acs.chemrev.7b00203

Sillanpää M, Ncibi C (2017) A sustainable bioeconomy: The green industrial revolution.

Simmons CW, Reddy AP, D’haeseleer P, et al (2014) Metatranscriptomic analysis of

lignocellulolytic microbial communities involved in high-solids decomposition of rice straw.

Biotechnol Biofuels 7:495. doi: 10.1186/s13068-014-0180-0

Simon C, Daniel R (2011) Metagenomic analyses: past and future trends. Appl Environ Microbiol

77:1153–61. doi: 10.1128/AEM.02345-10

Skene KR (2017) Circles, spirals, pyramids and cubes: why the circular economy cannot work.

Sustain Sci. doi: 10.1007/s11625-017-0443-3

Sluiter A, Ruiz R, Scarlata C, et al (2008) Determination of Extractives in Biomass: Laboratory

Analytical Procedure (LAP).

Sommer MOA, Church GM, Dantas G (2010) A functional metagenomic approach for expanding

the synthetic biology toolbox for biomass conversion. Mol Syst Biol 6:360. doi:

143

10.1038/msb.2010.16

Stadler LB, Delgado Vela J, Jain S, et al (2017) Elucidating the impact of microbial community

biodiversity on pharmaceutical biotransformation during wastewater treatment. Microb

Biotechnol. doi: 10.1111/1751-7915.12870

Stahel WR (2016) Circular economy. 6–9.

Stahel WR (2010) The Performance Economy. Palgrave Macmillan UK, London

Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large

phylogenies. Bioinformatics 30:1312–3. doi: 10.1093/bioinformatics/btu033

Stark M, Berger SA, Stamatakis A, von Mering C (2010) MLTreeMap - accurate Maximum

Likelihood placement of environmental DNA sequences into taxonomic and functional

reference phylogenies. BMC Genomics 11:461. doi: 10.1186/1471-2164-11-461

Steels S, Portetelle D, Vandenbol M (2013) Bacillus subtilis as a Tool for Screening Soil

Metagenomic Libraries for Antimicrobial Activities. J Microbiol Biotechnol 23:850–855. doi:

10.4014/jmb.1212.12008

Stott MB, Dunfield PF, Crowe MA (2009) Class of Chloroflexi-like thermophilic cellulose

degrading bacteria. 40.

Strachan CR, Singh R, Vaninsberghe D, et al (2014) Metagenomic scaffolds enable combinatorial

lignin transformation. Source Proc Natl Acad Sci United States Am 111:10143–10148.

Streit WR, Daniel R (2010) Metagenomics : methods and protocols. Humana Press

144

TAPPI (2006) Acid-insoluble lignin in wood and pulp (Reaffirmation of T 222 om-02).

Tatusov RL, Fedorova ND, Jackson JD, et al (2003) The COG database: an updated version

includes eukaryotes. BMC Bioinformatics 4:41. doi: 10.1186/1471-2105-4-41

Taupp M, Lee S, Hawley A, et al (2009) Large Insert Environmental Genomic Library Production.

J Vis Exp 2–7. doi: 10.3791/1387

Taupp M, Mewis K, Hallam SJ (2011) The art and design of functional metagenomic screens.

Curr Opin Biotechnol 22:465–472. doi: 10.1016/j.copbio.2011.02.010

Terrapon N, Lombard V, Drula E, et al (2017) The CAZy Database/the Carbohydrate-Active

Enzyme (CAZy) Database: Principles and Usage Guidelines. In: A Practical Guide to Using

Glycomics Databases. Springer Japan, Tokyo, pp 117–131

Terrapon N, Lombard V, Gilbert HJ, Henrissat B (2015) Automatic prediction of polysaccharide

utilization loci in Bacteroidetes species. Bioinformatics 31:647–655. doi:

10.1093/bioinformatics/btu716

Tessler M, Neumann JS, Afshinnekoo E, et al (2017) Large-scale differences in microbial

biodiversity discovery between 16S amplicon and shotgun sequencing. Sci Rep 7:6589. doi:

10.1038/s41598-017-06665-3

The Ellen MacArthur Foundation (2015) Towards a Circular Economy - Economic and Business

Rationale for an Accelerated Transition. Greener Manag Int 97. doi: 2012-04-03

Thies S, Rausch SC, Kovacic F, et al (2016) Metagenomic discovery of novel enzymes and

biosurfactants in a slaughterhouse biofilm microbial community. Sci Rep 6:27035. doi:

145

10.1038/srep27035

Thomas T, Gilbert J, Meyer F (2012) Metagenomics - a guide from sampling to data analysis.

Microb Inform Exp 2:3. doi: 10.1186/2042-5783-2-3

Timmermans K (2001) Trips, CBD and Traditional Medicines: Concepts and Questions. Report of

an ASEAN Workshop on the TRIPS Agreement and Traditional Medicine, Jakarta, February

2001: II. CONTEXT: 2.3 Bioprospecting. Jakarta

Tiwari R, Nain L, Labrou NE, Shukla P (2018) Bioprospecting of functional cellulases from

metagenome for second generation biofuel production: a review. Crit Rev Microbiol

44:244–257. doi: 10.1080/1040841X.2017.1337713 van den Brink J, de Vries RP (2011) Fungal enzyme sets for plant polysaccharide degradation.

Appl Microbiol Biotechnol 91:1477–92. doi: 10.1007/s00253-011-3473-2

Venditti RA (2014) Selected Topics in Lignocellulosics for Biofuels: Sludge to Ethanol.

http://www4.ncsu.edu/~richardv/ethanol.html. Accessed 30 Mar 2018

Venkata Mohan S, Nikhil GN, Chiranjeevi P, et al (2016) Waste biorefinery models towards

sustainable circular bioeconomy: Critical review and future perspectives. Bioresour

Technol 215:2–12. doi: 10.1016/j.biortech.2016.03.130

Wang C, Dong D, Wang H, et al (2016) Metagenomic analysis of microbial consortia enriched

from compost: new insights into the role of Actinobacteria in lignocellulose

decomposition. Biotechnol Biofuels 9:22. doi: 10.1186/s13068-016-0440-2

Wang Y, Zhang R, He Z, et al (2017) Functional Gene Diversity and Metabolic Potential of the

146

Microbial Community in an Estuary-Shelf Environment. Front Microbiol 8:1153. doi:

10.3389/fmicb.2017.01153

Whitman WB, Coleman DC, Wiebe WJ (1998) Prokaryotes: The unseen majority. 95:6578–6583.

Wild J, Hradecna Z, Szybalski W (2002) Conditionally amplifiable BACs: switching from single-

copy to high-copy vectors and genomic clones. Genome Res 12:1434–44. doi:

10.1101/gr.130502

Wilhelm RC, Cardenas E, Leung H, et al (2017) Data Descriptor: A metagenomic survey of forest

soil microbial communities more than a decade after timber harvesting. Sci Data. doi:

10.1038/sdata.2017.92

Wilson DB (2009) Cellulases. In: Encyclopedia of Microbiology. Elsevier, pp 252–258

Wright JJ, Lee S, Zaikova E, et al (2009) DNA Extraction from 0.22 &mu;M Sterivex Filters

and Cesium Chloride Density Gradient Centrifugation. J Vis Exp. doi: 10.3791/1352

Wu CH, Apweiler R, Bairoch A, et al (2006) The Universal Protein Resource (UniProt): an

expanding universe of protein information. Nucleic Acids Res 34:D187-91. doi:

10.1093/nar/gkj161

Wu Y-W, Simmons BA, Singer SW (2016) MaxBin 2.0: an automated binning algorithm to

recover genomes from multiple metagenomic datasets. Bioinformatics 32:605–607. doi:

10.1093/bioinformatics/btv638

Xia Y, Ju F, Fang HHP, Zhang T (2013) Mining of Novel Thermo-Stable Cellulolytic Genes from a

Thermophilic Cellulose-Degrading Consortium by Metagenomics. PLoS One 8:e53779. doi:

147

10.1371/journal.pone.0053779

Xing S, Li G, Sun X, et al (2013) Dynamic Changes in Xylanases and β-1,4-Endoglucanases

Secreted by Aspergillus niger An-76 in Response to Hydrolysates of Lignocellulose

Polysaccharide. Appl Biochem Biotechnol 171:832–846. doi: 10.1007/s12010-013-0402-0

Yu K, Zhang T (2012) Metagenomic and Metatranscriptomic Analysis of Microbial Community

Structure and Gene Expression of Activated Sludge. PLoS One 7:e38183. doi:

10.1371/journal.pone.0038183

Zhang G, Liu P, Zhang L, et al (2016a) Bioprospecting metagenomics of a microbial community

on cotton degradation: Mining for new glycoside hydrolases. J Biotechnol 234:35–42. doi:

10.1016/J.JBIOTEC.2016.07.017

Zhang X, Wang S, Wu X, et al (2016b) Subsite-specific contributions of different aromatic

residues in the active site architecture of glycoside hydrolase family 12. Sci Rep 5:18357.

doi: 10.1038/srep18357

Ziels RM, Sousa DZ, Stensel HD, Beck DAC (2018) DNA-SIP based genome-centric metagenomics

identifies key long-chain fatty acid-degrading populations in anaerobic digesters with

different feeding frequencies. ISME J 12:112–123. doi: 10.1038/ismej.2017.143

148

Appendix A Chapter 4- Sub-cloning Details

1. P04P08 Glycoside hydrolase (GH) 3 ORF length: 2150 bp Contig location: 3 Strand orientation: “ – “ LCA taxonomy: Prokaryotes Start position: 14083 bp End position: 16233 bp ORF_ID: 2_9

Nucleotide Sequence: >PPSLIBM-04-P08_2_9 ATGAGCCTGGAAGAGAAGGTCGCGCAACTGGCGCAGATCAGCGGAGGCGACTTTATGCCAGGGCCAA AGGCCGCCGACATCATCCGCAAGAGCGGGGCTGGCTCTGTGCTGTGGCTGAACGACACCAGGCGGTTC AACGAATTGCAGAAGATCGCCGTGGAGGAAAGCCCGTCCGGCATCCCCGTGCTGTTTGCGCTGGATGT GATTCACGGCTACCGCACGATCTTCCCCGTGCCGCTGGCGATGGCTTCTTCATGGGACCCCGCTGTGGC GGAACAGGCGCAGACGGTGGCAGCGCGCGAAACCCGCGCCGCCGGGCCGCATTGGACGTTTGGCCCG ATGCTGGACATCGCGCGCGATGCGCGCTGGGGGCGAATCGTGGAAGGGGCGGGCGAAGATCCCTATC TCGGGGCGGCGATGGCAGCCGCACAGGTGCGCGGATTCCAGGGCGCCGACCTGTCGGACCCGGAGCG CGTGCTGGCGTGCGCCAAGCACTTTGCCGGCTATGGCGCGGCGGAAGGCGGCCGTGACTACGACGAGG TGCACCTGTCGGAGACGGAGCTGCGCAACACGTACTTTCCGCCCTTCGAGGCGGCGGTGAAAGCCGGC GTGGGTTCCTTCATGGGCGCCTATATGGATTTGAACCATGTCCCGGCCAGCGCCAACCGCTGGCTGCTG CGCGACATGCTGCGCAGCGAATTTGGCTTTGAAGGGTTTGTGGTCAGCGATGCCCTGGCGATTGGCAAC CTGGTCATCCAGGGCCACGCGCGCGACAAGCGCGATGCTGCGCTGCGGGCGCTCAAGGCCGGCATGAA CATGGACATGGCTTCCGGTTCGTACCTGGAAAACCTGGCCGACCTGGTAAAGGATGGCTCCCTTTCGGC AGAGCAGATCGATGAGATGGTGCGGCCAATCCTGGCGATCAAGTTCAAGATGGGGCTGTTCGAGAACC CTTACGTGGAAGAAGGACTGCTGGAGAAGGTGGCCGCCAGGCCCGACCACCGTGAGTTGGCGCGCTG GGCGGCGCAACGCTCGATGGTGCTGCTCAAGAACGAAGGCGGGCTGCTGCCGCTTGCCAAGAGCCTGC AGAAGGTTGCCGTGCTCGGCCCACTGGCCGACTCGATGGCGGCCACCGAAGGATCGTGGATGGTCTTC GGCCATCAACCGTCTGCCGTGACCGTGCTGCAAGGCATTCGGGCCAAGCTGCCCGACGCCAATGTGCAG TACGCCCCCGGGCCGGATATCCGGCGCGATTTCCCCTCGTTCTTTGACGAACTCTTCTCGGAAGCCAAGA

149

AACCCGTCCAAACGCCCGCAGAGGCCGACGCAGCCTTGGCAACGGCCGTAGCAACTGCGCAGGCTGCC GACCTGGTCGTGATGGTGCTGGGCGAAGATGCCAACATGGCCGGCGAGTACGCCAGCCGCGGCTCGCT GGACTTGCCGGGCCGGCAGGAAGAACTGCTCAAGGCGGTCTGTGCGCTGGGCAAACCGGTGGTGCTG GTGCTGCTGAACGGCCGCCCGCTGAGCATCAACTGGGCAGCCGAACACGTGCCCGCCATTCTCGAAGCG TGGGAACCCGGCACGGAGGGCGGCAATGCGGTGGCCGACATTCTGTTCGGCGATGTCAACCCAGGCGG CAAGCTGCCTGTTACCTTTCCGCGCAGCGGCAGCCACGCGCCCATGTATTACGCGCACACGCTCAGCCAC CAGCCCGAGGGCCACCCGCAGTACACGTCACGCTACTGGGACAGCCCAACCTCGCCATTGTTCCCGTTT GGCTTTGGCCTCAGCTACACCAGCTTTGCGTTTAGCAACCTCACGCTGTCGGCCCCGCAGGTCAAGCTGG GCGCATCCCTCAGCGTGAACGCCGACGTGACCAACACCGGCCCGGTTGCCGGCGACGAGGTGGTGCAG TTGTACATCCACCAACGCTGGGGAACTGACACGCGCCCGATCCGCGAGTTGAAGGGTTTCCAGCGCATT ACCCTGCAGCCGGGAGAAACCAAGACGGTCAGCTTCCCGCTGGGGCCGGAGGAACTGCGCTACTGGAG CACGAATGCCGGCGCGTGGATTCAAGATGCCACAACTTTCGACGTGTGGGTTGGCAGTGACTCGCAGG CCACCCTGCACGCTGAATTTGAGGTGACTGCCTAG Amino acid sequence: PPSLIBM-04-P08_2_9 MSLEEKVAQLAQISGGDFMPGPKAADIIRKSGAGSVLWLNDTRRFNELQKIAVEESPSGIPVLFALDVIHGYR TIFPVPLAMASSWDPAVAEQAQTVAARETRAAGPHWTFGPMLDIARDARWGRIVEGAGEDPYLGAAMA AAQVRGFQGADLSDPERVLACAKHFAGYGAAEGGRDYDEVHLSETELRNTYFPPFEAAVKAGVGSFMGAY MDLNHVPASANRWLLRDMLRSEFGFEGFVVSDALAIGNLVIQGHARDKRDAALRALKAGMNMDMASGS YLENLADLVKDGSLSAEQIDEMVRPILAIKFKMGLFENPYVEEGLLEKVAARPDHRELARWAAQRSMVLLKN EGGLLPLAKSLQKVAVLGPLADSMAATEGSWMVFGHQPSAVTVLQGIRAKLPDANVQYAPGPDIRRDFPSF FDELFSEAKKPVQTPAEADAALATAVATAQAADLVVMVLGEDANMAGEYASRGSLDLPGRQEELLKAVCAL GKPVVLVLLNGRPLSINWAAEHVPAILEAWEPGTEGGNAVADILFGDVNPGGKLPVTFPRSGSHAPMYYAH TLSHQPEGHPQYTSRYWDSPTSPLFPFGFGLSYTSFAFSNLTLSAPQVKLGASLSVNADVTNTGPVAGDEVV QLYIHQRWGTDTRPIRELKGFQRITLQPGETKTVSFPLGPEELRYWSTNAGAWIQDATTFDVWVGSDSQAT LHAEFEVTA Fw Primer - EcoR1 HF: GTTACTTCGAATTCATGAGCCTGGAAGAGAAGG (tm 63 deg) Re Primer- Hind III HF: GTTACTTCAAGCTTCTAGGCAGTCACCTCAAATT (tm 61 deg)

150

2. P14I01 Glycoside hydrolase (GH) 1 ORF length: 1358 bp Contig location: 3 Strand orientation: “+” LCA taxonomy: Prokaryotes Start position: 552 bp End position: 1910 bp ORF_ID: 2_1

Nucleotide Sequence: >PPSLIBM-14-I01_2_1 ATGCCCAGCTTTAACTTCCCGGCAGGCTTTCTATGGGGTTCTGCCACTGCTTCTTACCAGATTGAAGGCG CCGTCAACGAAGATGGTCGCAGCGAATCGATCTGGGACCGCTTCTCGCACACGCCCGGCAAGGTTCTTA ACGGAGACACCGGCGACGTTGCGTGCGACCATTACCACCGCTGGCGCGACGACGTAGCGCTGATGAAG TCGCTGGGCCTCAAAGCCTACCGCTTCTCGGTCGCGTGGCCGCGCATCTTGCCCAACGGCGCCGGCGAG GTCAACCAGAAGGGGCTGGACTTCTACAGCGCGCTGGTGGACGAGCTGCTGGCGGCGGGGATTACGCC GTTCGTCACCTTGTATCACTGGGATTTGCCGCAGGTGTTGCAGGATGCCGGCGGCTGGCCCGAGCGCGC CACCTGCGCCGCCTTTGTGGAGTATGCCGACGTGGTCAGCCGCCACTTGGGCGATCGTGTCAAGAACTG GATCACGCACAACGAGCCGTGGTGTGTCAGCTTCCTCAGCCATCAGATTGGCGAGCACGCGCCGGGGT GGAAGGATGACTGGATGGCGGCCTTCCGCGCCGCCCATCACGTGCTGCTGTCACACGGCCAGGCTGTG CCGGTGATCCGCGCCAACAGCGCCGGGGCCGAGGTCGGCATCGCGCTCAACTTCAGTTGGGTGGAAGC CGCTTCCTCCGCCGCCGCCGACCAAATGGCTGCGCGCTGGGCTGACGGCTATTCCAACCGCTGGTTCATC GACCCGGTGTATGGGCGGCGCTACCCGGCGGACATGGTGGAGGCGTTCACCACAGCCGGGCTGTTGCC CAACGGGTTGGACTTTGTGCAGCCGGGCGACATGGATGTGATCGCCACGCAGACGGACTTCTTGGGCG TCAACTACTACACGCGCGATGTGGTCAAGGCGCGAAGTGCGGAGACGCCGCTGCCCGAGCCGGCGCGC GAGGTTGCCACGTTGCCGCGCACCGAGATGGACTGGGAGGTCTACCCGGATGGGCTGTACAAGCTCTT GTGCCGCCTGTATTTTGACTATGACATTCCGAAGCTGTATGTGACGGAGAACGGCTGCAGCTACGGCGA TGGGCCGGGGGCCGACGGGGCCGTGCACGACAGGCGGCGCACCGAGTACCTGCGCAGCCACTTCCTG GCGGTGCATCGCGCCATGCTGGCGGGCGCGCCGGTGCAGGGGTATTTCGTGTGGTCGCTGCTGGACAA CTTCGAATGGGCCAAGGGGTATACGCAACGCTTTGGGATCGTGTGGGTGGACTACAACACGCAGCAGC GCATTCCCAAGGACAGCGCGCTGTGGGTCAAGCAAGTGATCGCCAATAACGGTTTCTA

151

Amino acid sequence: >PPSLIBM-14-I01_2_1 MPSFNFPAGFLWGSATASYQIEGAVNEDGRSESIWDRFSHTPGKVLNGDTGDVACDHYHRWRDDVALMK SLGLKAYRFSVAWPRILPNGAGEVNQKGLDFYSALVDELLAAGITPFVTLYHWDLPQVLQDAGGWPERATC AAFVEYADVVSRHLGDRVKNWITHNEPWCVSFLSHQIGEHAPGWKDDWMAAFRAAHHVLLSHGQAVPVI RANSAGAEVGIALNFSWVEAASSAAADQMAARWADGYSNRWFIDPVYGRRYPADMVEAFTTAGLLPNGL DFVQPGDMDVIATQTDFLGVNYYTRDVVKARSAETPLPEPAREVATLPRTEMDWEVYPDGLYKLLCRLYFDY DIPKLYVTENGCSYGDGPGADGAVHDRRRTEYLRSHFLAVHRAMLAGAPVQGYFVWSLLDNFEWAKGYTQ RFGIVWVDYNTQQRIPKDSALWVKQVIANNGF Fw Primer - EcoR1 HF: GTTACTTCGAATTCATGCCCAGCTTTAACTTCC (tm 63 deg) Re Primer- Hind III HF: GTTACTTCAAGCTTCTAGAAACCGTTATTGGCG (tm 62 deg)

Plasmid Information Plasmid Name: pET-21 a(+)

Figure A.1: Plasmid pET-21 a(+) circular map (Addgene database)

152

Plasmid sequence (with selected EcoR I and Hind III cut sites highlighted): ATCCGGATATAGTTCCTCCTTTCAGCAAAAAACCCCTCAAGACCCGTTTAGAGGCCCCAAGGGGTTATGC TAGTTATTGCTCAGCGGTGGCAGCAGCCAACTCAGCTTCCTTTCGGGCTTTGTTAGCAGCCGGATCTCAG TGGTGGTGGTGGTGGTGCTCGAGTGCGGCCGCAAGCTTGTCGACGGAGCTCGAATTCGGATCCGCGAC CCATTTGCTGTCCACCAGTCATGCTAGCCATATGTATATCTCCTTCTTAAAGTTAAACAAAATTATTTCTAG AGGGGAATTGTTATCCGCTCACAATTCCCCTATAGTGAGTCGTATTAATTTCGCGGGATCGAGATCTCGA TCCTCTACGCCGGACGCATCGTGGCCGGCATCACCGGCGCCACAGGTGCGGTTGCTGGCGCCTATATCG CCGACATCACCGATGGGGAAGATCGGGCTCGCCACTTCGGGCTCATGAGCGCTTGTTTCGGCGTGGGTA TGGTGGCAGGCCCCGTGGCCGGGGGACTGTTGGGCGCCATCTCCTTGCATGCACCATTCCTTGCGGCGG CGGTGCTCAACGGCCTCAACCTACTACTGGGCTGCTTCCTAATGCAGGAGTCGCATAAGGGAGAGCGTC GAGATCCCGGACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAG TCAATTCAGGGTGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTA TCAGACCGTTTCCCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAG CGGCGATGGCGGAGCTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTG CTGATTGGCGTTGCCACCTCCAGTCTGGCCCTGCACGCGCCGTCGCAAATTGTCGCGGCGATTAAATCTC GCGCCGATCAACTGGGTGCCAGCGTGGTGGTGTCGATGGTAGAACGAAGCGGCGTCGAAGCCTGTAAA GCGGCGGTGCACAATCTTCTCGCGCAACGCGTCAGTGGGCTGATCATTAACTATCCGCTGGATGACCAG GATGCCATTGCTGTGGAAGCTGCCTGCACTAATGTTCCGGCGTTATTTCTTGATGTCTCTGACCAGACAC CCATCAACAGTATTATTTTCTCCCATGAAGACGGTACGCGACTGGGCGTGGAGCATCTGGTCGCATTGG GTCACCAGCAAATCGCGCTGTTAGCGGGCCCATTAAGTTCTGTCTCGGCGCGTCTGCGTCTGGCTGGCT GGCATAAATATCTCACTCGCAATCAAATTCAGCCGATAGCGGAACGGGAAGGCGACTGGAGTGCCATG TCCGGTTTTCAACAAACCATGCAAATGCTGAATGAGGGCATCGTTCCCACTGCGATGCTGGTTGCCAACG ATCAGATGGCGCTGGGCGCAATGCGCGCCATTACCGAGTCCGGGCTGCGCGTTGGTGCGGATATCTCG GTAGTGGGATACGACGATACCGAAGACAGCTCATGTTATATCCCGCCGTTAACCACCATCAAACAGGAT TTTCGCCTGCTGGGGCAAACCAGCGTGGACCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGG CAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCT CCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGA GCGCAACGCAATTAATGTAAGTTAGCTCACTCATTAGGCACCGGGATCTCGACCGATGCCCTTGAGAGC CTTCAACCCAGTCAGCTCCTTCCGGTGGGCGCGGGGCATGACTATCGTCGCCGCACTTATGACTGTCTTC TTTATCATGCAACTCGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGC TGGAGCGCGACGATGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTC GTCACTGGTCCCGCCACCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCCCACGG GTGCGCATGATCGTGCTCCTGTCGTTGAGGACCCGGCTAGGCTGGCGGGGTTGCCTTACTGGTTAGCAG AATGAATCACCGATACGCGAGCGAACGTGAAGCGACTGCTGCTGCAAAACGTCTGCGACCTGAGCAAC AACATGAATGGTCTTCGGTTTCCGTGTTTCGTAAAGTCTGGAAACGCGGAAGTCAGCGCCCTGCACCATT ATGTTCCGGATCTGCATCGCAGGATGCTGCTGGCTACCCTGTGGAACACCTACATCTGTATTAACGAAGC GCTGGCATTGACCCTGAGTGATTTTTCTCTGGTCCCGCCGCATCCATACCGCCAGTTGTTTACCCTCACAA CGTTCCAGTAACCGGGCATGTTCATCATCAGTAACCCGTATCGTGAGCATCCTCTCTCGTTTCATCGGTAT CATTACCCCCATGAACAGAAATCCCCCTTACACGGAGGCATCAGTGACCAAACAGGAAAAAACCGCCCT TAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGCTTCTGGAGAAACTCAACGAGCTGGACGCGGA

153

TGAACAGGCAGACATCTGTGAATCGCTTCACGACCACGCTGATGAGCTTTACCGCAGCTGCCTCGCGCG TTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGACGGTCACAGCTTGTCTGTAAGC GGATGCCGGGAGCAGACAAGCCCGTCAGGGCGCGTCAGCGGGTGTTGGCGGGTGTCGGGGCGCAGCC ATGACCCAGTCACGTAGCGATAGCGGAGTGTATACTGGCTTAACTATGCGGCATCAGAGCAGATTGTAC TGAGAGTGCACCATATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGC GCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCA CTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAG GCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTG ACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAG GCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGC CTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTC GTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACT ATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTA GCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGA AGGACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGAT CCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAA AAGGATCTCAAGAAGATCCTTTGATCTTTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTA AGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTT AAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTA TCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGG GAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGACCCACGCTCACCGGCTCCAGATTTA TCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATC CAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTG CCATTGCTGCAGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACG ATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTT GTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTACTGTCA TGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATAGTGTATGCG GCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATAGCAGAACTTTAAAAGT GCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCG ATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATCTTTTACTTTCACCAGCGTTTCTGGGTGAGCAAA AACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCT TCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATTGTCTCATGAGCGGATACATATTTGAATGTATTT AGAAAAATAAACAAATAGGGGTTCCGCGCACATTTCCCCGAAAAGTGCCACCTGAAATTGTAAACGTTA ATATTTTGTTAAAATTCGCGTTAAATTTTTGTTAAATCAGCTCATTTTTTAACCAATAGGCCGAAATCGGC AAAATCCCTTATAAATCAAAAGAATAGACCGAGATAGGGTTGAGTGTTGTTCCAGTTTGGAACAAGAGT CCACTATTAAAGAACGTGGACTCCAACGTCAAAGGGCGAAAAACCGTCTATCAGGGCGATGGCCCACTA CGTGAACCATCACCCTAATCAAGTTTTTTGGGGTCGAGGTGCCGTAAAGCACTAAATCGGAACCCTAAA GGGAGCCCCCGATTTAGAGCTTGACGGGGAAAGCCGGCGAACGTGGCGAGAAAGGAAGGGAAGAAA GCGAAAGGAGCGGGCGCTAGGGCGCTGGCAAGTGTAGCGGTCACGCTGCGCGTAACCACCACACCCG CCGCGCTTAATGCGCCGCTACAGGGCGCGTCCCATTCGCCA

154