Reconstructing the evolution of xylose fermentation in Scheffersomyces stipitis

by

Kevin Correia

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Chemical Engineering and Applied Chemistry University of Toronto

© Copyright 2019 by Kevin Correia Abstract

Reconstructing the evolution of xylose fermentation in Scheffersomyces stipitis

Kevin Correia Doctor of Philosophy Graduate Department of Chemical Engineering and Applied Chemistry University of Toronto 2019

Lignocellulose is a renewable feedstock that can be fermented to biofuels and biochemicals, but its use is limited by several techno-economic challenges, including xylose fermentation. Scheffersomyces stipitis has been identified as an efficient xylose fermenter, but does not ferment well in industrial conditions. The scientific community has been engineering , the preferred for industrial biotechnology, to ferment xylose to ethanol with the xylose reductase (XR)-xylitol dehydrogenase (XDH) genes from Sch. stipitis for 30 years; however, recombinant Sac. cerevisiae’s titres, rates, and yields for ethanol from xylose lags Sch. stipitis. This performance gap may be due to aspects of Sac. cerevisiae’s metabolism that hinders xylose fermentation, aspects of Sch. stipitis’ metabolism that enables xylose fermentation, or a combination of both. The focus of this thesis is to improve our understanding of budding yeast metabolism beyond Sac. cerevisiae, and ultimately reverse engineer how Sch. stipitis’ metabolism has evolved to ferment xylose to ethanol efficiently with improved genome annotations and metabolic modelling. To obtain higher-quality genome annotations, ortholog groups in 33 and fungi spanning 600 million years in Dikarya were predicted using OrthoMCL. Over 500 families were curated with phylogenetic reconstruction to resolve inconsistencies. These ortholog assignments are more accurate than existing ortholog databases, and span diverse yeasts. Next, the gains and losses of metabolic genes were reviewed to identify important and reoccurring events in the evolution of metabolism in budding yeasts. Duplications were found to play an important role in the evolution of metabolism, including changes in enzyme preference and localization. The curated pan-genome and functional annotations were used to reconstruct genome-scale metabolic networks of the 33 species; the reconstructions have more genomic and metabolic coverage than those made with previous methods.

Finally, the Sch. stipitis metabolic network was used to simulate xylose fermentation. NADH and

NADP phosphatase (NADPase), an orphan enzyme in eukaryotes, were found to be critical to xylose fermentation in these metabolic flux simulations. Pho3.2p was the sole NADPase candidate showing expression during xylose fermentation; its activity was confirmed in vitro. Xylose fermentation evolved in the Scheffersomyces-Spathaspora clade following genes duplications for XR and an acid phosphatase.

This thesis provides a foundation for unravelling how metabolism has evolved in the yeast pan-genome.

ii Dedicated to my family, and especially to my late father, who was my first math and science teacher. Luis da Costa Correia (December 12, 1959-December 29, 2014)

iii Acknowledgements

My doctoral experience has been one of the most rewarding experiences in my life. I have been amazed at how much I have been able to learn during my time at the University of Toronto. First and foremost, I would like to thank Prof. Radhakrishnan Mahadevan for taking me on as a Masters student and eventually allowing me to bypass to the Ph.D. program. I am thankful for the freedom he gave me to pursue my research interests. I also thank my reading committee for guiding me throughout my doctoral research. Prof. Grant Allen for challenging me to make my research more accessible, Prof. Amy Caudy for inviting me to her lab meetings, Prof. Elizabeth Edwards for sitting as an examiner in my defences, and Prof. Jack Pronk for sitting as my external examiner and his helpful suggestions to improve my thesis. My labmates in the Laboratory for Metabolic Systems Engineering have been a great source of friendship and learning, including Peter Li, Pratish Gawand, Patrick Hyland, Chris Gowen, Kayla Nemr, Naveen Venayak, Shyam Srinivasan, Jeong Chan Joo, and Taeho Kim. Prof. Alexander F. Yakunin and Anna Khusnutdinova for their help and guidance with enzymology. Prof. Hung Lee, Xin Wen, and Mehdi Dashtban for their help with Scheffersomyces stipitis genetics. Dean Robson, Weijun Gao, Daniel Tomchyshyn for their help with IT. I thank my family and for their support throughout my Ph.D. It was surprising to hear my parents ask when I would graduate because when I was growing up they always told me to stay in school. They should have specified a time limit. Maiko Sugai, my better half and second brain, for her support. Pierre Sawtell for his sense of humour. I would also like to thank the taxpayers of Ontario and Canada who fund research programs across the country. Direct financial support was provided from Natural Sciences and Engineering Research Council of Canada from the Bioconversion project and the CREATE M3 scholarship. A special thanks to all the failures and rejections that brought me here.

iv Contents

List of Tables ix

List of Figures xii

1 Abbreviations 1

2 Introduction 3 2.1 Motivation ...... 3 2.2 Challenges and Objectives ...... 5 2.3 Contributions ...... 6 2.3.1 Pan-genome creation ...... 6 2.3.2 Pan-genome analysis ...... 6 2.3.3 Pan-genome-scale metabolic network reconstruction ...... 6 2.3.4 Analysis of xylose fermentation in Sch. stipitis ...... 7 2.3.5 Other contributions ...... 7

3 Literature review 9 3.1 History and physiology of xylose fermentation in yeasts ...... 9 3.1.1 Xylose catabolic pathways ...... 9 3.1.2 Native pentose fermentation by yeasts ...... 10 3.1.3 Engineering xylose fermentation in Saccharomyces cerevisiae ...... 15 3.1.4 Opportunities to improve our understanding of xylose fermentation ...... 18 3.2 Functional genome annotation ...... 19 3.2.1 Biology is technology ...... 19 3.2.2 Evolving machines with unknown origins ...... 19 3.2.3 Homology and analogy in biology ...... 21 3.2.4 Structural and functional genome annotation ...... 21 3.2.5 Identifying orthologous proteins ...... 23 3.2.6 Opportunities to improve functional genome annotation ...... 27 3.3 Reconstruction and analysis of metabolism in silico ...... 27 3.3.1 The worm: the first full-scale reconstructed network of biology ...... 27 3.3.2 Genome-scale network reconstruction ...... 28 3.3.3 Flux balance analysis ...... 29 3.3.4 Scheffersomyces stipitis genome-scale network reconstruction and analysis . . . . . 30

v 3.3.5 Opportunities to improve genome-scale network reconstruction and analysis for Scheffersomyces stipitis ...... 30 3.4 Summary of the literature and synthesis ...... 30 3.4.1 Xylose fermentation ...... 30 3.4.2 Functional genome annotation ...... 32 3.4.3 Genome-scale network reconstruction ...... 33 3.4.4 Synthesis of xylose fermentation in yeasts, genome annotations, and metabolic modelling ...... 34 3.5 Hypotheses, objectives, and organization of the thesis ...... 35 3.5.1 Curation of the yeast pan-genome ...... 35 3.5.2 Analysis of the yeast pan-genome ...... 35 3.5.3 In silico metabolic network reconstruction for the yeast pan-genome ...... 36 3.5.4 Reverse engineering xylose fermentation in Scheffersomyces stipitis ...... 36

4 AYbRAH: an open-source ortholog database for yeasts and fungi 38 4.1 Abstract ...... 38 4.2 Introduction ...... 39 4.3 Methods ...... 40 4.3.1 Initial construction of AYbRAH ...... 40 4.3.2 AYbRAH curation ...... 40 4.3.3 Annotating misidentified and unidentified proteins ...... 40 4.3.4 Comparison of AYbRAH to existing phylogenomic databases ...... 43 4.3.5 Subcellular localization prediction ...... 43 4.3.6 Literature references ...... 43 4.4 AYbRAH overview ...... 45 4.4.1 The AYbRAH web portal...... 45 4.5 AYbRAH curation ...... 48 4.5.1 Over-clustering by OrthoMCL ...... 48 4.5.2 Under-clustering by OrthoMCL ...... 48 4.6 Comparison of AYbRAH to other ortholog identification methods...... 48 4.6.1 BLASTP scoring metrics...... 48 4.6.2 Comparison of AYbRAH to well-established phylogenomic databases...... 51 4.7 Applications of a curated ortholog database...... 53 4.8 Conclusion ...... 53

5 Reconstructing the evolution of metabolism in yeasts 54 5.1 Abstract ...... 54 5.2 Introduction ...... 55 5.3 Methods ...... 56 5.3.1 Refinement of the yeast species topology ...... 56 5.3.2 Reconstruction of gene duplications and losses ...... 56 5.3.3 Flux balance analysis ...... 56 5.4 Results & Discussion ...... 56 5.4.1 Refined yeast species tree topology ...... 56

vi 5.4.2 Evolution of the pyruvate dehydrogenase bypass ...... 60 5.4.3 Evolution of NADH dehydrogenase ...... 64 5.4.4 Heteromerization of enzyme complexes ...... 67 5.4.5 Ribosomal subunit duplications...... 68 5.4.6 Changes in enzyme localization via gene duplications ...... 69 5.4.7 Redox cofactor changes in ...... 70 5.4.8 Horizontal gene transfer ...... 71 5.5 Conclusion ...... 71

6 Fungi pan-genome-scale network reconstruction 73 6.1 Abstract ...... 73 6.2 Introduction ...... 74 6.3 Methods ...... 76 6.3.1 Network reconstruction ...... 76 6.3.2 Biomass equation formulations ...... 78 6.3.3 Flux balance analysis ...... 79 6.4 Results ...... 79 6.4.1 Expanded genomic coverage in the fungal pan-GENRE ...... 79 6.4.2 Expanded metabolic coverage in the fungal pan-GENRE ...... 81 6.4.3 Amino acid yields vary across yeast strains ...... 81 6.4.4 Comparison of pan-GENRE framework to CoReCo ...... 83 6.5 Discussion ...... 84 6.6 Conclusions ...... 85 6.7 Data availability ...... 85

7 Reverse engineering xylose fermentation in Scheffersomyces stipitis 86 7.1 Abstract ...... 86 7.2 Introduction ...... 87 7.3 Methods ...... 93 7.3.1 Genome-scale network reconstruction and analysis ...... 93 7.3.2 Cloning PHO3 /PHO3.2, and transformation in Kom. phaffii ...... 93 7.3.3 Agar acid phosphatase assay ...... 93 7.3.4 General phosphatase assay with para-nitrophenyl phosphate (pNPP) ...... 94 7.3.5 Protein expression and purification ...... 94 7.3.6 Alcohol dehydrogenase enzyme assay ...... 95 7.3.7 NADPase phosphatase enzyme assay ...... 95 7.3.8 Phosphatase screen with natural substrates using the malachite green assay . . . . 95 7.3.9 Syntenic and phylogenetic analysis of XYL1 and PHO3.2 homologs ...... 96 7.4 Results & Discussion ...... 97 7.4.1 In vivo NADPH source in Sch. stipitis during xylose fermentation...... 97 7.4.2 Impact of XR cofactor preference on xylose fermentation...... 98 7.4.3 PHO3 and PHO3.2 characterization...... 101 7.4.4 The phylogenetic origin of xylose fermentation to ethanol in Scheffersomyces- Spathaspora...... 103

vii 7.4.5 Cofactor balancing in metabolic pathways...... 104 7.5 Conclusion ...... 105

8 Conclusions 106 8.1 Recommendations ...... 108 8.1.1 Next steps for AYbRAH ...... 108 8.1.2 Recommendations for other community-driven ortholog databases ...... 110 8.1.3 Comparative genetics as a tool to understand physiology ...... 110 8.1.4 Pan-genome-scale network reconstruction ...... 112 8.2 Outlook ...... 112 8.2.1 The future of genome annotation and curation ...... 112 8.2.2 Lessons for the Metabolic Engineering field ...... 113 8.2.3 Comparative genomics, genetics and physiology across the tree of life ...... 113

Appendices 115

Appendices 116

A Commission and omission errors in published yeast and fungal genome-scale network reconstructions 116

B Comparison of AYbRAH orthology assignments to published ortholog databases 118

C Sample webpages of AYbRAH portal for Acs 121

D phylogenetic reconstruction 130

E Placement of Ascoidea in the budding yeast species topology 132

F Distribution of BLASTP sequence similarity scores in yeast clades 134

G Composite biomass equations 137

H Reaction and ortholog counts by subsystem in the Dikarya pan-genome-scale network reconstructions 138

I Cloning, enzyme expression, optimization, and characterization of Pho3p and Pho3.2p141

J UTR1 amino acid alignment 145

K Phylogenetic and syntenic analysis of PHO3.2 and XYL1 homologs 150

L Review and discussion of additional orthologs in yeast central metabolism 155

Bibliography 155

9 Bibliography 156

viii List of Tables

3.1 Biomass, ethanol, and polyol yields from xylose fermentation under varying aeration for Pachysolen tannophilus, Scheffersomyces shehatae, and Scheffer- somyces stipitis [Ligthelm et al., 1988b]. Sch. stipitis has a robust ability to ferment xylose to ethanol with little to no polyol yield. The accumulation of ethanol during aero- bic conditions suggests the fermentation conditions are not fully aerobic. No acetate was measured but it is an expected byproduct, especially for Pac. tannophilus [Jeffries, 1983]. 11

3.2 Xylose dehydrogenase forward and reverse affinities for Pachysolen tannophilus, Scheffersomyces shehatae, and Scheffersomyces stipitis...... 13

3.3 Manually reviewed and uncharacterized proteins in select reference proteomes in UniProt...... 20

3.4 Homology and analogy definitions, and examples in biology and the automo- tive industry...... 22

3.5 Functional annotations for proteins in the aldehyde dehydrogenase family.... 23

3.6 Protein names for orthologs of NDE1, UTR1, PHO3, and ACS1 in 33 Dikarya yeasts and fungi. Sac. cerevisiae’s annotations are bolded...... 24

3.7 BLASTP best hits for Saccharomyces cerevisiae ALD6 against Kluyveromyces lactis proteome, and its best hit against Sac. cerevisiae...... 25

3.8 Phylogenomic databases used to identify homologous and orthologous pro- teins. The most popular methods are graph-based that rely on sequence similarity, are scalable, but not specific. Tree-based methods are more computationally intensive, ex- pected to be more accurate, but not as not scalable. Hybrid-methods are often developed for specialized taxonomic groups...... 26

3.9 Automatic and semiautomatic tools for genome-scale network reconstruction in prokaryotes and eukaryotes...... 29

4.1 Fungal and yeast strain genomes in AYbRAH. Protein sequences were downloaded from UniProt or MycoCosm. Species were assigned to monophyletic or paraphyletic groups based on divergence time with Saccharomyces cerevisiae...... 44

ix 4.2 AYbRAH ortholog database statistics before and after curation. The initial ortholog assignments were obtained with OrthoMCL and OrthoDB. Additional proteins were annotated using TBLASTN. Ortholog groups for enzymes and small metabolite transporters were manually curated by visual inspection of homolog phylogeny, and by identifying ortholog groups with an ETE 3 script [Huerta-Cepas et al., 2016a]. Ortholog groups were modified by adding proteins to existing groups via pplacer [Matsen et al., 2010], or by collapsing homolog groups into a single ortholog group if there were no gene duplications in the homolog group (under-clustering)...... 45

4.3 Comparison of ortholog assignments between AYbRAH and well-established phylogenomic databases. OMA and PANTHER are the most congruous with AY- bRAH. Bold numbers indicate the greatest source of incongruency with AYbRAH. OMA and PANTHER are predicted to have more under-clustered and over-clustered groups relative to AYbRAH, respectively. HOGENOM, eggNOG, and KO have a large number of proteins with no ortholog assignment...... 51

6.1 Comparison of genes, reactions, and metabolites in yeast and fungal GEN- REs. AYbRAHAM GENREs have more genes and reactions than GENREs from manual curation or automatic reconstruction, with the exception of the reaction count in Sac. cerevisiae...... 83

7.1 Xylose fermentation related genes in Scheffersomyces stipitis, including the xylose reductase(XR)- xylitol dehydrogenase (XDH) pathway, NADPH regeneration, and NADP phosphatase candidates. AYbRAH annotations, transcriptomics, proteomics, and functional charac- terization are outlined for all the genes...... 91

7.2 Xylose fermentation genotypes and phenotypes for Debaryomyces hansenii, Suhomyces tanzawaensis, Spathaspora, and Scheffersomyces species. GenBank Assembly Accessions were used to analyze the synteny of XYL1 and PHO3.2 loci...... 97

7.3 Estimated xylose reductase cofactor selectivity using various techniques in the literature. . 99

B.1 Comparison of manually curated acetyl-Coenzyme A synthetase orthologs in AYbRAH to highly cited ortholog databases. N/A indicates genomes that do not have orthology relationships in the public database but have been assigned orthology with AYbRAH. Omission indicates genomes that have annotations in the public ortholog database but do not have an annotation for the given gene. PANTHER is the only database that can distinguish between the three ACS ortholog groups; ACS3 is assigned to a different PANTHER family despite the shared ancestry of all the orthologs. KEGG can only differentiate between ACS1 and ACS3 ortholog groups, while all other database orthologly assignments are polyphyletic. EggNOG includes FOG07524 and FOG07525 in the ACS ortholog group, which both have predicted acetoacetate-CoA activity. . . . 119

x B.2 Comparison of manually curated Type II NADH dehydrogenase (NDH2) or- thologs in AYbRAH to highly cited ortholog databases. N/A indicates genomes that do not have orthology relationships in the public database but have been assigned orthology with AYbRAH. Omission indicates genomes that have annotations in the public ortholog database but do not have an annotation for the given gene. PANTHER is able to distinguish between most orthologs in the NDH2 family, with the exception of NDE1 and NDE2; NDE0 is in a different PANTHER family than the rest of the NDH2 genes. KEGG is the only other database that can differentiate between some NDH2 genes; the genes are split between the older NDI0/NDE0 ortholog group and more recent NDE1/NDI1 ortholog group. PANTHER and EggNOG contain additional genes not included in other ortholog databases, which may represent ancient paralogs having lower sequence similarities than other NDH2 paralogs. AIF1, which can localize to the mitochondria in S. cerevisiae, is in the same subfamily as NDE0 in PANTHER; the other inconsistency is an Aspergillus niger gene (FOG07265) paralogous to a characterized external NADH dehydrogenase in Neurospora crassa (FOG07264) in EggNOG [Carneiro et al., 2007]...... 120

E.1 Count of orthologs shared each yeast species with defined clade...... 133 E.2 Count of de novo genes shared each yeast species with defined clade...... 133

I.1 Primer sequences for PHO3 and PHO3.2 inserts into pPICZα,B without their native signal peptides...... 141

xi List of Figures

3.1 Maximum parsimony of xylose growth and fermentation in the CTG clade. Growth on xylose was hypothesized to be absent in the common ancestor of budding yeasts. Growth on xylose was gained in the CTG clade (node 1). Xylose fermentation was subsequently gained at node 2; it was independently lost at nodes 4 and 7. Xylose growth was lost in Lodderomyces elongisporus. Adapted from Wohlbach et al. [2011]. . . . 14

4.1 Ortholog database coverage for fungal and yeast genomes in AYbRAH, YGOB, CGOB, PANTHER, HOGENOM, KO, OMA, and eggNOG. Ortholog assign- ments based on the manual curation of sequence similarity and synteny are shown in green columns; tree-based methods in red columns; graph-based methods in blue columns; a hy- brid graph and tree-based method in the purple column. Many ortholog databases are well represented in and the CTG clade, which had their genomes se- quenced during the 2000’s [Dujon, 2010]. AYbRAH has ortholog assignments for species in Pichiaceae, Phaffomycetaceae and several incertae sedis families, which are not well represented in other ortholog databases, as these yeasts were recently sequenced [Riley et al., 2016]. The well established phylogenomic databases span other yeast species not shown in this phylogeny, but they mostly belong to Saccharomycetaceae or the CTG clade. 41

4.2 AYbRAH workflow for ortholog curation. 33 fungal and yeast proteomes were downloaded from UniProt and MycoCosm. BLASTP computed the sequence similarity between all proteins . OrthoMCL clustered the proteins into putative Fungal Ortholog Groups (FOGs) using the BLASTP results. FOGs were clustered into HOmolog Groups (HOGs) using Fungi-level homolog assignments from OrthoDB. Multiple sequence align- ments for each homolog group were obtained with MAFFT, and 100 bootstrap phyloge- netic trees were reconstructed with PhyML. The consensus phylogenetic trees for enzymes and transporters were reviewed and curated to differentiate between orthologs, paralogs, ohnologs, and xenologs...... 42

xii 4.3 Annotation features of a sample phylogenetic tree in AYbRAH. Square and circle leaves indicate protein sequences in Basidiomycota or , respectively. Leaf nodes are coloured based on taxonomic groups. Circle leaves are used for proteins with no paralogs in the same species, whereas sphere leaves are used to designate proteins with paralogs in the same species. Vertical bold lines indicate species-lineage expansions, which are sometimes called in-paralogs or co-orthologs [Remm et al., 2001]. Horizontal bold lines designate Sac. cerevisiae proteins, which is the most widely studied eukaryote. Dashed lines indicate the most anciently diverged protein sequence in the ortholog group. Ortholog groups can be identified by colour groups to help the visual inspection of ortholog assignments. The leaf names include a three-letter species code and a sequence accession. Internal nodes are labelled with the bootstrap values from phylogenetic reconstruction with PhyML...... 46

4.4 Localization predictions for internal NADH dehydrogenase (NDI1_YEAST) in AYbRAH. (A) Histogram plots are shown for mitochondrial localization predictions of Ndi1porthologs Ndi1p predicted by Predotar, TargetP, and MitoProt. (B) Transmem- brane domain predictions computed for orthologous proteins by the Phobius web server...... 47

4.5 Example of over-clustering by OrthoMCL with the family and its curation in AYbRAH. A gene duplication of HXK2 in Pichiaceae led to the HXK3 paralog. HXK2 was subsequently lost in Ogataea parapolymorpha but maintained in Komagataella phaffii. OrthoMCL was unable to differentiate between the Hxk2p and Hxk3p orthologs. Both ortholog groups are also assigned to the same Fungi-level ortholog group in OrthoDB...... 49

4.6 Example of under-clustering by OrthoMCL in the FLO8 ortholog group and its curation in AYbRAH. OrthoMCL dispersed the Flo8p proteins into multiple or- tholog groups due to the low sequence similarity between the proteins. The proteins were merged into one ortholog group...... 50

4.7 Distribution of BLASTP percent identities, logarithm of bit scores, and neg- ative logarithm of expect-values for proteins orthologous to Saccharomyces cerevisiae. The bottom half of orthologous proteins in the Saccharomycotina outgroup and Saccharomycetaceae have a percent identities of less than 40% and 58%, respectively; the bottom half of the expect-value ranges is more than 1e-60 and 1e-125 for the same groups. The wide and skewed distribution in the Saccharomycotina outgroup highlights the difficulty in making pairwise ortholog predictions for proteins with more than 400 mil- lions of divergence in Dikarya fungi with BLASTP results; however orthologs can be easily identified in the Saccharomycetaceae family because of their high sequence similarities and low expect-values...... 52

xiii 5.1 Manually curated glycolytic orthologs (red columns), TCA orthologs (green columns), pyruvate metabolism-related orthologs (blue columns), and other metabolic orthologs (orange) mapped to a budding yeast species tree; a Ba- sidiomycete yeast, Taphrinomycotina yeasts, and Pezizomycotina fungi were used as out- groups. Family/clade names are shown next to species names. Proto-yeast is the ancestor to all budding yeasts in Saccharomycotina; proto-fermenter is the first budding yeast to have used the PDH bypass following the duplication of ACS1 to ACS2, gaining the ability to ferment sugars to ethanol aerobically or with oxygen limitation; hetero-oligomer PFK denotes the transition from the ancestral homo-oligomeric PFK to the hetero-oligomeric PFK with altered allostery in budding yeasts; loss of ancestral tRNA(CUG) indicates the earliest known point for the loss of ancestral CUG tRNA in budding yeasts and subsequent reassignment in Pac. tannophilus and the CTG clade [Mühlhausen et al., 2016]. Ethanologenic/Crabtree-positive yeasts are labelled with red stars and citrogenic fungi/yeasts with yellow stars. Species with published genome-scale network reconstruc- tions (GENREs) are labelled with green text. Gene names reflect Sac. cerevisiae orthologs except: GPM4, the cofactor independent phosphoglycerate mutase; AOX1, alcohol oxi- dase; MLS1.2, the peroxisomal malate synthase paralog from the ancestral cytoplasic and peroxisomal malate synthase; OSM2, an uncharacterized fumarate reductase paralogous to OSM1 in Sac. cerevisiae; GAL10.2, UDP-glucose-4-epimerase orthologous to GAL10 in Sac. cerevisiae; IDP1.2, mitochondrial isocitrate dehydrogenase orthologous to IDP1 in Sac. cerevisiae; FUM2, an uncharacterized cytoplasmic fumarse; UGA1.2, a mitochon- drial gamma-aminobutyrate transaminase; LYS21.2, a cytoplasmic homocitrate synthase; HXK3, a paralog of Sac. cerevisiae’s HXK2 ; ARO10.2, an uncharacterized transaminated amino acid decarboxylase. Arrows indicate the predicted direction of the gene duplication. The topology of our tree is based on recently published species trees [Mühlhausen and Kollmar, 2014, Riley et al., 2016, Shen et al., 2016], but with a modified topology for Bla. adeninivorans, Nad. fulvescens and Asc. rubescens to minimize the homoplasy of gene duplications. Ortholog columns are sorted by their earliest duplication (except for the gene families for GPM, CIT, OSM, GAL10, HXK) to show support for our species tree. Bla. adeninivorans’ topology is supported with with OSM1 and ACS2 ; Nad. fulvescens’ placement with GAL10.2, PFK1, IDP1.2, ADH3 ; Asc. rubescens’ topology with AKR, ELO3, LYS21.2, and to a lesser extent FUM2 and UGA1.2. Kluyveromyces lactis is the only yeast in Saccharomycetaceae which is Crabtree-negative and can grow with xylose; it differs within its family by the gain of the GDP1 paralog and loss of THI3...... 58

xiv 5.2 Manually curated electron transport chain orthologs (blue columns), acetyl- CoA-related orthologs (purple columns) and ribosomal protein subunit du- plications mapped to a budding yeast species tree. See Figure 5.1 for shared annotations. Subphylum names are shown next to species names. Enzymes with pre- dicted cytoplasmic and mitochondrial localizations are indicated with [c] and [m], re- spectively. using NAD(H) and NADP(H) are designated with x and y, respectively. NUO, Complex I; STO1, alternative oxidase; NNT, membrane-bound transhydrogenase; NDI, internal NADH dehydrogenase; NDE, external NAD(P)H de- hydrogenase; ALD, aldehyde dehydrogenase; PHK, phosphoketolase; ACL, ATP citrate ; ACS, acetyl-CoA synthetase; RPL, ribosome protein of the large subunit; RPS, ri- bosome protein of the small subunit; MRPL, mitochondrial ribosomal protein of the large subunit; MRPS, mitochondrial ribosomal protein of the small subunit. No yeasts have maintained alternative oxidase while having lost Complex I; however, Pac. tannophilus and Oga. parapolymorpha have both lost alternative oxidase yet maintain Complex I. Transhydrogenase is absent in Taphrinomycotina and Saccharomycotina, the two inde- pendent yeast lineages in this study. The only internal alternative NADH dehydrogenase (NDI0 ) present that was present in proto-yeast is maintained in Lip. starkeyi within Saccharomycotina, while the older external NADH dehydrogenase (NDE0 ) is present in most clades except the CTG clade. NDE1, an external NADH dehydrogenase, is the only NDH2 ortholog maintained in all yeast lineages, although its activity with NADPH is not conserved in Sac. cerevisiae. Internal NADH dehydrogenase re-emerged as NDI1 from an NDE1 duplication, but were not maintained in the CTG clade or most of the Pichi- aceae; NDI2 independently evolved in Nad. fulvescens. Lip. starkeyi is the only budding yeast that has kept all orthologs from the three cytosolic acetyl-CoA sources: ATP citrate lyase, acetyl-CoA synthase, and phosphoketolase. ATP citrate lyase and phosphoketolase both had duplications in Dikarya that led to hetero-oligomeric enzymes. Cytoplasmic NAD-dependent-aldehyde dehydrogenase was present in proto-yeast, but duplications led to mitochondrial localizations (ALD2.3, ALD5, ALD6.3 ) and NADP activity (ALD2.4, ALD5, ALD5.2, ALD6.1, ALD6.2, ALD6.3 ). ALD5.2 and ALD6.3 are recent duplica- tions that are present in Dek. bruxellensis and Debaryomyces, two independent lineages of Crabtree-positive yeasts. Schizo. pombe, Nad. fulvescens, and Sac. cerevisiae are three Crabtree-positive yeasts that have significant independent duplications in their cytoplas- mic ribosome subunits...... 59

xv 5.3 (A) Phylogenetic tree showing the three major ortholog groups in the acetyl-CoA syn- thetase family: ACS1, acetate-inducible acetyl-CoA synthetase; ACS2, glucose-inducible acetyl-CoA synthease; ACS3, propionyl-CoA synthetase. (B) Metabolic pathways for acetyl-CoA production in Sac. cerevisiae via the PDH bypass and Yar. lipolytica via ACL. Gene annotations not conserved with Sac. cerevisiae’s genes: MCP, mitochon- drial pyruvate carrier; PDH, PDH complex; CITm, mitochondrial citrate synthase; CITc; cytoplasmic citrate synthase; MDHm, mitochondrial malate dehydrogenase; MDHc, cyto- plasmic malate dehydrogenase; CAT, carnitine o-acetyltransferase. (C) FBA predictions for protein and phospholipid yield from glucose with the PDH bypass (via NADP-ALD), PDH bypass (via NAD-ALD), and ATP citrate lyase using a GENRE for Sac. cerevisiae. Highest yields of protein and phospholipids occur with the PDH bypass via ALD6, which is absent in proto-yeast...... 62

5.4 (A) Phylogenetic reconstruction of the type II membrane-bound NADH dehydrogenase family. Grey stars indicate the origin of duplications. Major ortholog groups are high- lighted in different colors and designated as internal alternative NADH dehydrogenase with NDI and external alternative NADH dehydrogenase with NDE. (B) Organization of the ETC in Fungi. Complex I, NDE, NDI, and alternative oxidase are enzymes not con- served across all fungi in this study and are designated with red text with underlines. (C) Transmembrane posterior probability for aligned Sac. cerevisiae Ndi1p (NDI1_YEAST) and Nde1p (NDE1_YEAST). Ndi1p orthologs have C-terminal transmembrane predic- tions; Nde1p ortholgos have N-terminal transmembrane predictions and sometimes lower C-terminal transmembrane predictions. (D) Biomass yield as a function of non-proton- translocating NDI1 and proton-translocating Complex I in the Sac. cerevisiae GEMS. . . 66

xvi 6.1 Non-conventional organism genome-scale network reconstruction (GENRE) using a model organism GENRE as a template versus databases from community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Saccharomyces Genome Database (SGD) mines the literature to describe the physiology of Saccharomyces cere- visiae (a model organism). This knowledge base is leveraged to compile a Sac. cere- visiae GENRE, which is its state of the art metabolic network; the gap between its true metabolic network and the state of the art GENRE represents undiscovered and promis- cuous enzymes. In the absence of a curated genome database for Scheffersomyces stipitis (a non-conventional organism), the Sac. cerevisiae GENRE is used as a template to guide the Sch. stipitis GENRE. Anchoring bias makes the Sch. stipitis GENRE skew towards the Sac. cerevisiae GENRE with commission errors; enzymes that have been character- ized in Sac. cerevisiae, such as cytosolic NADP-dependent acetaldehyde dehydrogenase, have been included in past Sch. stipitis reconstructions despite no evidence based on orthology or enzymology. Availability bias prevents the GENRE curators from describing metabolism that is unique to Sch. stipitis and not well studied or documented, leading to omission errors; Sch. stipitis shares alkane hydroxylase orthologs with Candida trop- icalis, a known alkane degrader [Lebeault et al., 1971], and yet alkane degradation has never been described in any Sch. stipitis or yeast GENRE. Community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome reduces anchoring and availability biases by removing Sac. cerevisiae as the focal point for comparative metabolic network reconstruction; orthology assignment from rigorous phylogenic analy- sis of gene families rather than error-prone error prone methods; capturing non-canonical reactions catalyzed by promiscuous enzymes...... 75

xvii 6.2 Pan-genome-scale network reconstruction framework for improving the quan- tity and quality of GENREs. (A) A research community for a taxon, such as yeasts and fungi, curate its pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Orthologs, paralogs, ohnologs, and xenologs are identified in the pan-genome by rigorous phylogenetic analysis. Reactions catalyzed by enzymes within a taxon are described in the pan-reactome and annotated with ortholog-protein-reaction associations (OPR). The presence and dynamics of metabolites are described in the pan-metabolome of a taxon; metabolites in the pan-metabolome not captured in the S-matrix can lead to new path- way discovery in the pan-reactome. Phenotypes within the taxon are transcribed into machine-readable formats to test, validate and improve genome-scale metabolic models. These databases are the basis for the pan-GENRE. (B) The pan-reactome is created by mining the literature, inferred reactions from growth assays and Biolog, pathway analysis and gap-filling, in vitro enzyme characterization and crude enzyme assays, comparative genomics and comparative physiology. (C) Our Fungi pan-GENRE spans 600 million years of evolution for 33 yeasts and fungi. It was used as a template to compile GENREs for each taxonomic rank from sub-kingdom to strain taxonomic ranks. GENREs for any taxonomic rank can by curated by updating the AYbRAH ortholog and AYbRAHAM reaction databases. These changes are pulled to the Fungi pan-GENRE, and pushed to the lower taxonomic rank GENREs, enabling all the GENREs to be synchronized. A GENRE can be used for genome-scale metabolic modelling (GSMM) to test new path- ways and reconcile in silico and in vivo data. Further GENRE curation can bridge the gap between in silico and in vivo data...... 78

6.3 Yeast and fungi GENRE statistics. GENRE names, genes, reactions, metabolites, subphylum, and family mapped to a yeast/fungi species tree...... 80

6.4 Heat map showing the yields of amino acids with glucose and ammonium as substrates with each strain genome-scale network reconstruction (GENRE). pombe, Lipomyces starkeyi, Nadsonia fulvescens, Hanseniaspora val- byensis all have reduced yields of glutamate, glutamine, glycine, histidine, leucine, me- thionine, and valine than the rest of the fungi and yeasts. Lip. starkeyi shares many genomic features as proto-yeast, the first budding yeast, while the remaining yeasts inde- pendently evolved the Crabtree effect. Schizo. pombe and Han. valbyensis have reduced genomes and primarily rely on glucose fermentation to ethanol...... 82

xviii 7.2 (A) Proposed redox balancing during xylose fermentation in Scheffersomyces stipitis. NADH kinase regenerates NADPH; NAD(P)H flux drives xylose reductase (XR); NADP phosphatase (NADPase) dephosphylates NADP to NAD; NAD is reduced to NADH by xylitol dehydrogenase (XDH). This redox balancing scheme is consistent with the 13C results from [Ligthelm et al., 1988c], independent of oxygen availability, does not have a

loss of CO2 from the oxidative pentose phosphate pathway, but requires ATP. (B) Redox balancing during xylose fermentation in engineered Sac. cerevisiae with the XR-XDH pathway from Sch. stipitis. NAD kinase phosphorylates a fraction of the NAD pool for de novo NADP synthesis (dotted line); the oxidative pentose phosphate pathway regenerates NADPH; NAD(P)H drives XR. XDH regenerates NADH; NADH is reoxidized to NAD

by the ETC. Under this redox balancing scheme, there is a loss of CO2 from the oxidative pentose phosphate pathway, oxygen is required to reoxidize NADH, and therefore xylose cannot be anaerobically fermented to ethanol at the maximum theoretical yield...... 92 7.1 Simplified map of xylose fermentation in Scheffersomyces stipitis and potential redox bal- ancing mechanisms. Uncertainites in xylose fermentation are highlighted in red squares: (A) the impact of the redox cofactor imbalance on metabolism, (B) the in vivo XR co- factor preference, (C) the use of the succinate bypass to regenerate NADPH, (D) the presence of non or phosphorylating glyceraldehyde 3-phosphate dehydrogenase (GAPDH) to regenerate NADPH, (E) the impact of bypassing Complex I (NUO) during xylose fer- mentation, (F) the ability of alternative oxidase (AOX) to oxidize NADH during xylose fermentation, and (G) the presence of novel redox balancing mechanisms...... 92 7.3 Ethanol yield as a function of NADPH source and oxygen uptake rate (OUR). There is a drop in the ethanol yield when the oxidative pentose phosphate regenerates NADPH at OUR’s close to anaerobic levels. The highest ethanol yields were obtained with NADP phosphatase/NADH kinase, phosphorylating glyceraldehyde 3-phosphate dehy- drogenase (GAPDH), and non-phosphorylating GAPDH. NADP-dependent isocitrate de- hydrogenase and the succinate bypass were unable to ferment xylose to ethanol below 10 mmol · gDCW-1 · h-1...... 98 7.4 Xylitol yield sensitivity to oxygen uptake rate (OUR) and growth rate. (A) Simulations maximized xylose to ethanol with and without NADP phosphatase (NADPase) and NADH kinase. Anaerobic xylose fermentation is only feasible in silico when xylose reductase (XR) is driven by more than 80% NADPH. The in silico xylitol yield exceeds the 10% polyol yield typically observed in vivo when OUR is less than 2 mmol · gDCW-1 · h-1. The presence of NADPase and NADH kinase in the metabolic model eliminates xylitol yield at all OUR’s and XR cofactor selectivities. (B) Xylitol production envelope with and without NADPase and NADH kinase when XR is driven by 60% NADPH. OUR was constrained to 1 mmol · gDCW-1 · h-1. Simulations without NADPase/NADH kinase lead to xylitol accumulation at the optimal growth rate and at all suboptimal growth rates. Bypassing Complex I (NUO) only reduces the maximum growth rate and has a marginal decrease in the xylitol yield at the optimal growth rate. The presence of NADH kinase or NADPase does not reduce the xylitol yield to the experimental polyol range; however, the addition of NADPase and NADH kinase enables the xylitol yield to fall within the experimental polyol range...... 100

xix 7.5 Pho3.2p Michaelis-Menten kinetics kinetics with NADP as a substrate. Reaction condi- tions: 50 mM HEPES pH 7.5, 50 mM NaFormate, 20 μg formate dehydrogenase, 5 mM

MgCl2, 0.5 mM MnCl2, and 2.9 μg pure phosphatase added. The enzyme assay was not

optimized for Km or Vmax...... 102

D.1 Phylogenetic reconstruction of 6-phosphofructokinase. The ancestral ortholog group (red) and its paralog (blue)...... 131

F.1 Distributions of BLASTP percent identities for proteins identified as orthologous to Sac- charomyces cerevisiae in AYbRAH...... 134 F.2 Distributions of logarithm BLASTP bitscores for proteins orthologous to Saccharomyces cerevisiae in AYbRAH...... 135 F.3 Distributions of negative logarithm of BLASTP expect-values for proteins orthologous to Saccharomyces cerevisiae in AYbRAH...... 136

H.1 Strain-level GENRE model statistics and reactions summarized by subsystem...... 139 H.2 Strain-level GENRE model statistics and genes summarized by subsystem...... 140

I.1 Acid phosphatase screen as described by Dorn [1965]. Three out 45 colonies did not have any detectable acid phosphatase activity...... 142 I.2 Impact of purification method on phosphatase activity after 24 hours of growth on methanol. Sonication led to the highest activity of phosphatase in the supernatant...... 143 I.3 PAAG showing purified proteins from wild-type, PHO3, and PHO3.2 -expressing mutants in Komagataella phaffii. The Komagataella phaffii Adh2p contaminant is also shown in the gel...... 144 I.4 Malachite green assay results for Pho3p and Pho3.2p (no replicates). Scale is % change in absorbance after five minutes. Pho3.2p has broader activity than Pho3p...... 144

K.1 Phylogenetic reconstruction of PHO3.2 (acid phosphatase) homologs in budding yeasts. PHO3.2 derived from a tandem duplication in a common ancestor of Suhomyces tanza- waensis, Scheffersomyces and Spathaspora species. The red leaves highlight the PHO3.2 paralogs. The purple leaves highlight an additional uncharacterized PHO3 or PHO3.2 paralog...... 151 K.2 Synteny of the PHO3 (blue) and PHO3.2 (red) loci in Suhomyces tanzawaensis, Scheffer- somyces and Spathaspora species. A genomic inversion of PHO3 occured in an ancestor of Suhomyces tanzawaensis. PHO3 is less conserved in Scheffersomyces species than PHO3.2.152 K.3 Phylogenetic reconstruction of XYL1 (xylose reductase) homologs in budding yeasts, Pezizomycotina fungi, and Saitoella complicata; the XYL1 ortholog appears to be ab- sent in Basidiomycota [Mi et al., 2012]. Red leaves highlights the XYL1.2 paralog, which has NAD(P)H-dependent xylose reductase activity. Pachysolen tannophilus has an independent duplication of XYL1, which also led to NAD(P)H-dependent XR activity [Ditzelmüller et al., 1985]...... 153

xx K.4 Synteny of XYL1 (blue) and XYL1.2 (red) loci in Scheffersomyces and Spathaspora species. XYL1.2 originated from a tandem duplication of XYL1 upstream of trimethylly- sine dioxygenase (FOG01414). XYL1 was subsequently lost in some Scheffersomyces species...... 154

xxi Chapter 1

Abbreviations

Acs Acetyl-CoA synthetase Ald Acetaldehyde dehydrogenase Acl ATP citrate lyase AYbRAH Analyzing Yeasts by Reconstructing Ancestry of Homologs AYbRAHAM Analyzing Yeasts by Reconstructing Ancestry of Homologs about Metabolism BLAST Basic Local Alignment Search Tool BMGY Buffered complex medium containing glycerol BMMY Buffered complex medium containing methanol CGOB Candida Gene Order Browser CTG clade-Pichiaceae- CPPSS Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae EDD Experimental Data Depot ETC Electron transport chain F1,6P Fructose 1,6-bisphosphate FBA Flux Balance Analysis FOG Fungal Ortholog Group FVA Flux Variability Analysis GAPDH Glyceraldehyde 3-phosphate dehydrogenase GENRE GEnome-scale Network REconstruction GPR Gene-Protein-Reaction-association HEPES (4-(2-HydroxyEthyl)-1-PiperazineEthaneSulfonic acid) HGT Horizontal gene transfer HOG Homolog Group KO KEGG Orthology MPC Mitochondrial pyruvate carrier NAD-Ald NAD-dependent acetaldehyde dehydrogenase NAD(P)-Ald NAD(P)-dependent acetaldehyde dehydrogenase NADP-Ald NADP-dependent acetaldehyde dehydrogenase NADP-GAPDH NADP-dependent glyceraldehyde 3-phosphate dehydrogenase

1 Chapter 1. Abbreviations 2

NAD(P)-GAPDH NAD(P)-dependent glyceraldehyde 3-phosphate dehydrogenase NADPase NADP Phosphatase Nde external alternative NADH dehydrogenase Ndi internal alternative NADH dehydrogenase OPR Ortholog-Protein-Reaction-association PAAG Polyacrylamide gel pan-GENRE Pan-genome-scale network reconstruction Pdc Pyruvate decarboxylase PDH Pyruvate dehydrogenase Pfk 6-Phosphofructokinase Phk phosphoketolase pNPP para-nitrophenyl phosphate PSS Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae PTM Post translational modification RBH Reciprocal best hit ROS Reactive oxygen species RPKM Reads per kilobase of transcript per million mapped reads XI Xylose XDH Xylitol dehydrogenase XR Xylose reductase WGD Whole Genome Duplication YGOB Yeast Gene Order Browser YPD Yeast extract-peptone-dextrose Chapter 2

Introduction

Ethanol makes a beautiful, clean and efficient fuel... that can be manufactured from corn stalks, and in fact from almost any vegetable matter capable of fermentation... we need never fear the exhaustion of our present fuel supplies so long as we can produce an annual crop of alcohol to any extent desired

Alexander Graham Bell

2.1 Motivation

In the late 19th and early 20th centuries, the world was transitioning from the traditional biomass energy diet of the agricultural age to the fossil fuel binge of the industrial age [Smil, 2010]. Henry Ford, the American founder of the Ford Motor Company, and often seen as the father of mass production, was championing for strong integration of agriculture and industry. He constructed his first automobile, the Quadricycle, to run off 100% ethanol in 1896, and famously shared his vision that cars of the future would be powered by ethanol from fermented "fruit... weeds, sawdust - almost anything" [Ford, 1925]. Contrary to popular American folklore, the dream of transportation fuelled by ethanol is by no means Ford’s idea, just as he did not invent mass production. In 1826, Samuel Morey, an American inventor, developed an internal combustion engine prototype fuelled by camphine, a common lamp fuel of the time composed of ethanol and turpentine [Singh, 2013]. Nikolaus August Otto, a German engineer and father of the modern internal combustion engine, created an engine that could run on ethanol in 1860 [Singh, 2013]. Germany, France, and Britain promoted ethanol as a biofuel at various times from 1899 until the end of World War II with government research programs and mandated ethanol requirements in gasoline blends, because of oil scarcity in Europe and a wariness of the United States and Russia [Singh and Walia, 2016]. The dream of ethanol as a sustainable fuel was ultimately shattered by the strategic importance of crude oil in World War II, and the discovery of cheap oil reserves worldwide. It has since become a reoccurring dream with the Arab Oil Embargo, the push for U.S. energy independence during the 2000s, and more recently with urgent action needed to curb greenhouse gas emissions.

3 Chapter 2. Introduction 4

Forestry residues and farm waste, such as sawdust, tree branches, wheat straw, and corn stover, are all composed of a polymeric material called lignocellulose. This material provides structure to most plant life. Lignocellulose has been viewed as an attractive feedstock for biofuels and biochemicals because it is renewable, it does not contribute to the food versus fuel debate, and it can reduce anthropogenic

CO2 emissions [Singhvi et al., 2014]. Lignocellulose consists of three major fractions: lignin, a phenolic polymer that provides structure and hydrophobicity to the cell wall; cellulose, a glucose-based polysac- charide; and hemicellulose, a mixed polysaccharide linking cellulose to lignin [Pandey, 2011]. Cellulose accounts for the largest mass fraction in lignocellulose for all biomass sources, but the relative abundance of lignin and hemicellulose depends on the source. The mass fraction of xylose, the most abundant sugar in hemicellulose, can range from 5-10% in softwoods, to 15-37% in hardwood and agricultural residues [Hahn-Hägerdal et al., 2001, Jeffries and Shi, 1999]. Regardless of the source of lignocellulose, glucose and xylose are the most common sugars in lignocellulose. Fermentation of lignocellulose to ethanol has been intensely researched for over a century, yet techno- economic and policy challenges remain for it to successfully compete with the fossil fuel industry at scale. These challenges include the economic harvesting and transportation of low density biomass over large areas [Miao et al., 2012]; processing and pretreatment of lignocellulose to make its surface more accessible for enzyme hydrolysis [Meng and Ragauskas, 2014, Kumar and Sharma, 2017]; enzymatic hydrolysis of cellulose and hemicellulose into fermentable sugars at high yields and productivities [Van Dyk and Pletschke, 2012]; techno-economic evaluation of consolidated versus sequential fermentations [Öhgren et al., 2007]; microbial tolerance to hydrolysate inhibitors [Jönsson et al., 2013]; the fermentation of xylose and other pentose sugars by microbial hosts at high yields and productivities [Jansen et al., 2017]; the search for higher value products [Deepa et al., 2015, Gillet et al., 2017]. Research into each of these challenges has been conducted in Canada and around the world for over a century, with xylose fermentation being one of the most heavily studied problems. Xylose fermentation to ethanol in yeasts was not known to exist until it was found in Pachysolen tannophilus in the early 1980s, by government researchers at the National Research Council of Canada [Schneider et al., 1981] and the United States Department of Agriculture [Slininger et al., 1982]. This discovery sparked an interest in xylose fermenting yeasts in laboratories around the world, including Australia [Dekker, 1982], the Netherlands [Toivola et al., 1984], Germany [Dellweg et al., 1984], Japan [Morikawa et al., 1985], Sweden [Hahn-Hägerdal et al., 1985], Portugal [Lucas and Van Uden, 1985], France [Delgenes et al., 1986], New Zealand [Clark et al., 1986], and South Africa [Ligthelm et al., 1988b]. Scheffersomyces stipitis emerged as the most efficient xylose fermenter during this period [Bruinenberg et al., 1984, Slininger et al., 1985, Delgenes et al., 1986, Du Preez et al., 1986]. Research into the physiology of native xylose fermenters waned as advancements in genetics led scientists to engineer the XR-XDH pathway from Sch. stipitis in Saccharomyces cerevisiae [Kötter and Ciriacy, 1993], which is unable to naturally metabolize xylose but is the preferred industrial host for hexose fermentation to ethanol. The scientific community has never been able to engineer Sac. cerevisiae strains to attain the yields or volumetric productivities of ethanol from xylose as Sch. stipitis, despite expressing the same genes that enable xylose fermentation in Sch. stipitis. This fermentation performance gap may be due to aspects of Saccharomyces cerevisiae’s metabolism that hinder xylose fermentation, aspects of Sch. stipitis’ metabolism that enable xylose fermentation, or a combination of both. The scientific literature is inundated with attempts to improve xylose fermentation in Sac. cerevisiae, while not thoroughly Chapter 2. Introduction 5 exploring how Sch. stipitis natively ferments xylose. This thesis aims to bridge the genotype- phenotype gap for xylose fermentation in Sch. stipitis by studying the evolution of yeast metabolism beyond Sac. cerevisiae and Saccharomycetaceae. In biology it is easier to disrupt function to an existing network than it is to add function. Therefore, if we want to understand xylose fermentation it is best done through Sch. stipitis, because there are a finite number of genes that can be disrupted in Sch. stipitis to hinder its xylose fermentation, and an infinite number of gene changes that can be tested in Sac. cerevisiae to improve its xylose fermentation. The main objective of this thesis is to probe the metabolism of Sch. stipitis for enzymes that may promote xylose fermentation. The genome sequence of Sch. stipitis [Jeffries et al., 2007] provides the scientific community with the blueprint to reverse engineer how it can metabolize xylose fermentation efficiently. By identifying all the known and putative enzymes in Sch. stipitis’ metabolism, it is possible to reconstruct its metabolic network in silico, generate hypotheses of how xylose is fermented with metabolic flux simulations, and test these hypotheses with gene knockouts in Sch. stipitis and enzyme characterization.

2.2 Challenges and Objectives

Today, we have the luxury of combining the in-depth xylose fermentation physiology studies from the genomic dark ages, mostly from the 1980s, with the plethora of budding yeast genome sequences of today’s genomics golden age. Genome-scale network reconstructions (GENREs) provide a means to bridge the genotype and phenotype of xylose fermentation with metabolic simulations. Many metabolic network reconstructions have been created for yeasts, including several for Sch. stipitis, but these often use the metabolic network of Sac. cerevisiae, the most widely studied yeast and eukaryote, as a template. The net result is that the reconstructed metabolic networks of these non-conventional yeasts resemble Sac. cerevisiae’s metabolism, more than they reflect their unique metabolism (Appendix A). Using Sac. cerevisiae’s metabolic network as a template to reconstruct the metabolism of Sch. stipitis is unlikely to shed light on how Sch. stipitis ferments xylose because this phenotype is absent in Sac. cerevisiae. In this thesis, I move beyond Sac. cerevisiae as the central focus of our understanding of yeast metabolism to study the evolution of metabolism in the Dikarya subkingdom. A group of 33 diverse yeasts and fungi in Dikarya spanning 600 million years of evolution were selected for the pan-genome; some of these species includes well studied model organisms such as Neurospora crassa, Aspergillus niger, Schizosaccharomyces pombe, and Yarrowia lipolytica. Using this approach, we can increase the breadth and depth of our understanding of how metabolism has evolved in yeasts, and what genes may have evolved to enable xylose fermentation in Sch. stipitis. In Chapter 3, I curate the pan-genome of these yeasts and fungi into ortholog groups, with a focus on enzymes. In Chapter 4, I analyze the gains and losses of metabolic genes in the pan-genome to identify important and reoccurring events in the evolution of budding yeast metabolism. In Chapter 5, I reconstruct the metabolic networks of the 33 yeasts and fungi using the curated pan-genome from Chapter 3, and the metabolic functions assigned to ortholog groups in Chapter 4. Finally, in Chapter 6, I test how various redox balancing mechanisms impact xylose fermentation in silico for Sch. stipitis, review expression data to understand xylose fermentation from the bottom-up, and analyze the genotypes of Sch. stipitis’ close relatives in the Scheffersomyces- Spathaspora clade to propose how redox cofactors are balanced during xylose fermentation. Although there are more challenges to studying Sch. stipitis metabolism than Sac. cerevisiae, it offers the best Chapter 2. Introduction 6 way to understand how xylose is metabolized efficiently.

2.3 Contributions

The goal of this thesis is to bridge the genotype and phenotype gap for metabolism in yeasts, with a focus on xylose fermentation in Sch. stipitis. The works from this thesis are outlined below from the pan-genome curation, analysis, in silico reconstruction of yeast and fungal metabolic networks, and analysis of xylose fermentation in Sch. stipitis.

2.3.1 Pan-genome creation

Current ortholog databases do not have have assignments for diverse yeasts and fungi; furthermore, the ortholog databases that do exist for well defined yeast clades sometimes have errors in their annotations. An ortholog database called AYbRAH was created using phylogenetic reconstruction to serve as the genomics foundation of this and future studies.

• K Correia, MY Shi, R Mahadevan. AYbRAH: a curated ortholog database for yeasts and fungi spanning 600 million years of evolution. Database: The Journal of Biological Databases and Curation, 2019. https://doi.org/10.1093/database/baz022

2.3.2 Pan-genome analysis

Little is known about the genomic evolution of budding yeasts beyond Saccharomycetaceae. The gains and losses of metabolic genes were analyzed in yeasts and fungi within the Dikarya pan-genome. Impor- tant and reoccurring events in the evolution of metabolism in budding yeast metabolism are discussed, and their impact on yeast physiology.

• K Correia, MY Shi, R Mahadevan. Reconstructing the evolution of metabolism in budding yeasts. bioRxiv, 237974.

2.3.3 Pan-genome-scale metabolic network reconstruction

Reconstructing genome-scale metabolic networks is a tedious process and the scientific community has been hampered by parallel reconstructions for the same species. A new framework is introduced to increase the quantity and quality of GENREs using AYbRAH. Previous pan-GENREs have been built for the genus-level, but this work reconstructs the metabolism at the kingdom-level, as well as at lower taxonomic levels. This method results in reconstructions with more genomic and metabolic coverage than existing reconstructions.

• K Correia, R Mahadevan. Pan-genome-scale network reconstruction: a framework to increase the quantity and quality of metabolic network reconstructions throughout the tree of life. bioRxiv, 412593. Chapter 2. Introduction 7

2.3.4 Analysis of xylose fermentation in Sch. stipitis

It is not known how Sch. stipitis can balance its redox cofactors during xylose fermentation. Metabolic simulations suggest NADP phosphatase (NADPase), an orphan enzyme in eukaryotes, and NADH kinase are critical to balancing NADP(H) and NAD(H) during oxygen-limitation. PHO3.2 was the only NAD- Pase candidate showing evidence of expression in a proteomics study. NADPase activity was confirmed in Pho3.2p, but not Pho3p, using Komagataella phaffii as an expression host. PHO3.2 and XYL1 both originated from gene duplications that evolved in the Scheffersomyces-Spathaspora clade.

• K Correia, A Khusnutdinova, PY Li, JC Joo, G Brown, AF Yakunin, R Mahadevan. Flux balance analysis predicts NADP phosphatase and NADH kinase are critical to balancing redox during xylose fermentation in Scheffersomyces stipitis. bioRxiv, 390401.

Individual contributions:

• Metabolic model curation, and metabolic flux analysis.

• Phylogenetic and syntenic analysis of XYL1 and PHO3.

• Cloning PHO3 and PHO3.2 into pPICZalpha,B for extracellular expression in Komagataella phaf- fii

• Transformation, screening of PHO3 and PHO3.2 expressing mutants in Kom. phaffii.

2.3.5 Other contributions

Additional contributions to other projects related to metabolic network reconstruction and modelling, bioinformatics, yeast genetics, and fermentation are listed below:

Peer-reviewed articles

• AN Khusnutdinova, R Flick, A Popovic, G Brown, A Tchigvintsev, B Nocek, K Correia, JC Chan, R Mahadevan, AF Yakunin. Exploring bacterial carboxylate reductases for the reduction of bi- functional carboxylic acids. Biotechnology Journal 12 (11), 1600751.

• Phylogenetic analysis. Reviewed the manuscript.

• K Raj, S Partow, K Correia, AN Khusnutdinova, AF Yakunin, et al. Biocatalytic production of adipic acid from glucose using engineered Saccharomyces cerevisiae. Metabolic Engineering Communications 6, 28-32.

• Designed and setup bottle fermentors to allow fermentations to switch from aerobic to anaerobic conditions.

• Reviewed the manuscript.

• VE Balderas-Hernandez, K Correia, R Mahadevan. Inactivation of the transcription factor mig1 (YGL035C) in Saccharomyces cerevisiae improves tolerance towards monocarboxylic weak acids: acetic, formic acid. Journal of Industrial Microbiology & Biotechnology, 1-17. Chapter 2. Introduction 8

• Confirmed mig1 mutant by PCR sequencing, and its tolerance to acetic acid.

• Wrote and reviewed the manuscript.

• PH Wang*, K Correia*, HC Ho, N Venayak, K Nemr, R Flick, R Mahadevan, EA Edwards. In- terspecies malate-pyruvate shuttle drives amino acid exchange in organohalide-respiring microbial communities. ISME, 2019.

• Curation of Dehalobacter sp CF metabolic model.

• Phylogenetic reconstruction and ortholog analysis.

• Analyzed metabolic flux with the metabolic model

• Discussed and reviewed experimental data with PH Wang.

• Wrote and reviewed the manuscript.

* Equal contributions

Preprints

• K Correia, H Ho, R Mahadevan. Genome-scale metabolic network reconstruction of the chloroform- respiring Dehalobacter restrictus strain CF. bioRxiv, 375063

• Metabolic model curation.

• C Lieven, ME Beber, BG Olivier, FT Bergmann, P Babaei, JA Bartell, K Correia, C Diener, A Drager, BE Ebert, JN Edirisinghe, RMT Fleming, B Garcia-Jimenez, W van Helvoirt, C Henry, H Hermjakob, MJ Herrgard, Hyun Uk Kim, Z King, JJ Koehorst, S Klamt, E Klipp, M Lakshmanan, N le Novere, DY Lee, SY Lee, S Lee, NE Lewis, H Ma, D Machado, R Mahadevan, P Maia, A Mardinoglu, GL Medlock, J Monk, J Nielsen, LK Nielsen, J Nogales, I Nookaew, O Resendis, B Palsson, JA Papin, KR Patil, ND Price, A Richelle, I Rocha, P Schaap, RSM Sheriff, S Shoaie, N Sonnenschein, B Teusink, P Vilaca, JO Vik, JA Wodke, JC Xavier, Q Yuan, M Zakhartsev, C Zhang. Memote: A community-driven effort towards a standardized genome-scale metabolic model test suite. bioRxiv, 350991. [Under revision for Nature Biotechnology]

• Contributed code to compile partitioned biomass equations for protein, RNA, DNA, and lipids. Chapter 3

Literature review

A couple of months in the laboratory can frequently save a couple of hours in the library.

Frank Westheimer

In this chapter, I review the history and physiology of xylose fermentation in wild-type and recom- binant yeasts; how genomes are annotated, which are critical to identifying enzyme function; GENREs and how they can be used to bridge the genotype-phenotype gap. The importance of xylose is discussed and how it is currently known to be metabolized. The physi- ology of xylose fermentation is reviewed in Sch. stipitis and other yeasts. Attempts to engineer xylose fermentation in Sac. cerevisiae with the four natural catabolic pathways are described, with a focus on the XR-XDH pathway. The gap in fermentation performance between Sac. cerevisiae and Sch. stipitis is discussed, pointing to the possibility of a different redox balancing mechanisms in Sch. stipitis. The similarities and differences between biology and technology are described, and how biology as a science is akin to reverse engineering. Homology and analogy, fundamental concepts in biology since Darwin’s time, are discussed, and how homology has taken on new meanings with orthology, paralogy, xenology in the molecular biology era. Several examples are shown to demonstrate how a handful of proteins are used to annotate millions of proteins by sequence homology, and how current practices can lead to genome annotation errors. GENREs are discussed as a means to capture knowledge about an organism’s physiology in a struc- tured format. This process relies on genome annotations. Various methods used to reconstruct these networks are described, including their advantages and disadvantages. Flux balance analysis (FBA) is introduced, and how it can be used to predict metabolic flux simulations using constraints and optimiza- tion. A brief overview of reconstruction and analysis of metabolism in Sch. stipitis is outlined.

3.1 History and physiology of xylose fermentation in yeasts

3.1.1 Xylose catabolic pathways

Most biomass on Earth is from plants [Bar-On et al., 2018], making lignocellulose the most abundant polymer on Earth, with glucose and xylose being the most abundant sugars. The availability of these sug-

9 Chapter 3. Literature review 10 ars make them prime candidate feedstocks that can be fermented to biofuels and biochemicals. Glucose is metabolized by many organisms via the Embden-Meyerhof-Parnas pathway (BioCyc: GLYCOLYSIS), the Entner-Doudoroff pathway (BioCyc: ENTNER-DOUDOROFF-PWY), the pentose phosphate path- way (BioCyc: PENTOSE-P-PWY) or the phosphoketolase pathway (BioCyc: P122-PWY). Glucose can be fermented to ethanol aerobically or anaerobically, depending if NADH or NADPH are completely re- oxidized. On the other hand, xylose degradation pathways are not as conserved across most organisms, and xylose is not easily fermented in the absence of oxygen in many organisms. Known xylose catabolic pathways include the xylose isomerase (XI) pathway (BioCyc: XYLCAT-PWY) [Schellenberg et al., 1984], the xylose reductase (XR)-xylitol dehydrogenase (XDH) pathway (BioCyc: PWY-5516) [Horitsu et al., 1968], the Weimberg pathway (BioCyc: PWY-8020) [Weimberg, 1961], and the Dahms pathway (BioCyc: PWY-294) [Dahms, 1974]. Mitsuhashi and Lampen [1953] first described XI in Lactobacillus pentosus. This enzyme converts xylose to xylulose, some of which use an FeS cluster. XI was later isolated from higher plants [Pubols and Axelrod, 1959], and its protein has been characterized from barley [Kristo et al., 1996], Cereus pterogonus [Ravikumar and Srikumar, 2008], and Arabidopsis thaliana [Maehara et al., 2013]. XI was not known to exist in fungi, until it was discovered in Piromyces sp. E2 (ATCC 76762), an anaerobic in the Neocallimastigomycota phylum, isolated from elephant dung [Kuyper et al., 2003]. This pathway does not require oxygen to ferment xylose to ethanol. The XR-XDH pathway has been characterized in fungi and yeasts and was first described by Chiang and Knight [1959]. XR converts xylose to xylitol, with NADPH as a cofactor in most fungi and yeasts. XDH converts xylitol to xylulose via NAD+. This creates a redox cofactor imbalance when oxygen is limiting, preventing xylitol from being oxidized to xylulose. Xylulose enters the pentose phosphate pathway via xylulose kinase. The Weimberg pathway is found in archaea and bacteria, while the Dahms pathway has only been found in archaea. NAD+ or NADP+-dependent D-xylose dehydrogenase catalyze the first step in both pathways; these enzymes do not always have homology. The Weimberg pathway metabolizes xylose to α-ketoglutarate in five enzymatic steps. The Dahms pathway converts xylose to glycolaldehyde and pyruvate. Glycolaldehyde can be further metabolized by several pathways, including to malate via glycolate and glyoxylate.

3.1.2 Native pentose fermentation by yeasts

Discovery of xylose fermenting yeasts

In the late 1890s, xylose was known to be assimilated by some yeasts and fungi [Tollens, 1898, Van Laer, 1898, Schellenberg, 1908], but not fermented under oxygen limitation. These findings were corroborated by A.J. Kluyver, in his Ph.D. thesis Biochemische suikerbepalingen (Biochemical sugar determhiations) [Kluyver, 1914, Bruinenberg et al., 1983a], who did not see any evidence of xylose fermentation in his limited tests. The XR-XDH pathway was found to metabolize xylose in fungi and yeasts. Xylose fermentation to ethanol was unknown until it was discovered in Pac. tannophilus in the early 1980’s [Schneider et al., 1981, Slininger et al., 1982]. Additional quantitative and qualitative screening of yeasts in the 1980s found xylose fermenta- tion in more yeasts. These include Kluyveromyces marxianus, Nakazawaea holstii (Hansenula hol- stii), Brettanomyces naardenensis, Pichia etchellsii, Yamadazyma mexicana (Candida terebra), Ambro- Chapter 3. Literature review 11 siozyma angophorae (Pichia angophorae), Schwanniomyces polymorphus (Debaryomyces polymorpha), Meyerozyma guilliermondii (Candida guilliermondii), Yamadazyma tenuis (Candida tenuis), Candida tropicalis, Scheffersomyces segobiensis (Pichia segobiensis), Scheffersomyces shehatae (Candida she- hatae), Sch. stipitis [Maleszka and Schneider, 1982, Toivola et al., 1984, Du Preez and Prior, 1985, Nigam et al., 1985]; surprisingly, most of these yeasts belong to the CTG clade [Kurtzman and Suzuki, 2010, Papon et al., 2014], which is a group of yeasts that translate the CUG codon as serine rather than leucine. The field shifted its focus to improving the yield and volumetric productivity of ethanol production with process optimization [Slininger et al., 1985, 1990, du Preez, 1994]. Sch. stipitis emerged as the most efficient xylose fermenter in this period (Table 3.1). Xylose fermentation is rare in budding yeasts, and appears to have evolved independently in several lineages.

Table 3.1: Biomass, ethanol, and polyol yields from xylose fermentation under varying aer- ation for Pachysolen tannophilus, Scheffersomyces shehatae, and Scheffersomyces stipitis [Ligthelm et al., 1988b]. Sch. stipitis has a robust ability to ferment xylose to ethanol with little to no polyol yield. The accumulation of ethanol during aerobic conditions suggests the fermentation con- ditions are not fully aerobic. No acetate was measured but it is an expected byproduct, especially for Pac. tannophilus [Jeffries, 1983].

Yield (g/g) Pachysolen tannophilus Scheffersomyces shehatae Scheffersomyces stipitis Aerobic Oxygen-limited Anoxic Aerobic Oxygen-limited Anoxic Aerobic Oxygen-limited Anoxic Biomass 0.25 0.014 0.015 0.33 0.01 0.01 0.39 0.05 0.03 Ethanol 0.25 0.014 0.015 0.22 0.37 0.41 0.18 0.47 0.4 Xylitol 0.17 0.3 0.3 0.04 0.13 0.18 0 0.06 0 Glycerol 0 0.07 0 0 0.02 0.015 0 0.004 0.004 Arabitol 0 0.05 0 0 0.07 0 0 0 0 Ribitol 0 0 0 0 0 0 0.01 0.01 0.06

Xylose fermenting yeasts closely related to Sch. stipitis have often been isolated from the gut of Odontotaenius disjunctus and Phrenapates bennetti, which are both wood ingesting passalid [Suh et al., 2003, Urbina and Blackwell, 2012]. These yeasts are speculated to have a symbiotic role in the gut of these beetles [Suh et al., 2003], and may have evolved the rare ability to ferment xylose in the oxygen-limiting conditions of the gut. Spathaspora passalidarum, a sister genus of Scheffersomyces, has been recently found to ferment xylose at higher yields than Sch. stipitis during anaerobic conditions [Hou, 2012, Su et al., 2015]. Additional yeasts in the Scheffersomyces-Spathaspora clade have been described to grow on xylose or ferment it to ethanol, and have had their genomes sequenced [Lopes et al., 2016, Morais et al., 2017, Lopes et al., 2018]. Other xylose fermenting yeasts include: Ogataea parapolymorpha (Hansenula polymorpha, Pichia angusta) [Ryabova et al., 2003] and Ogataea boidinii (Candida boidinii) [Vongsuvanlert and Tani, 1989, Vandeska et al., 1995], methylotrophic yeasts in the Pichiaceae clade, and the Sugiyamaella genus, which is closely related to Blastobotrys [Urbina et al., 2013, Morais et al., 2013, Sena et al., 2017].

Physiology of xylose fermenting yeasts

The most seminal works on understanding xylose fermentation are from a quadfecta of studies led by Bruinenberg from Scheffers’ lab at the Delft University of Technology. Bruinenberg et al. [1983c,b] published two joint papers analyzing the theoretical production and consumption of NADPH for various carbon and sources in Cyberlindnera jadinii (Candida utilis), and the enzymatic analysis of Chapter 3. Literature review 12

NADPH sources in chemostat grown Cyb. jadinii. Transhydrogenase was not known to be present in budding yeasts, which has since been confirmed by genome sequencing, restricting the interconversion of NAD(H) and NADP(H) under oxygen limitation. These studies were followed up with an experiment in which Cyb. jadinii could only ferment xylose to ethanol when XI or acetoin was added to the media [Bruinenberg et al., 1983a]. This ingenious experiment provided conclusive evidence that the first two steps in xylose catabolism, XR and XDH, are what impede Cyb. jadinii from fermenting xylose to ethanol. First, extracellular XI converts xylose to xylulose, bypassing the intracellular cofactor imbalance. Second, supplemented acetoin is used as an external electron acceptor for NADH, allowing XDH to proceed under oxygen limitation. In the culminating study, xylose fermentation and the XR cofactor preference were studied in Cyb. jadinii, Pac. tannophilus, and Sch. stipitis [Bruinenberg et al., 1984]. NADH and NADPH-dependent XR was found in Pac. tannophilus and Sch. stipitis, the xylose-fermenting yeasts, while Cyb. jadinii, which can grow on xylose but not ferment it, only had NADPH-dependent XR. Thus NADH-dependent XR was found to be essential for alleviating the cofactor imbalance in xylose-fermenting yeasts.

The discovery of NADH-linked XR activity in xylose fermenting yeasts by Bruinenberg et al. [1984] led researchers to characterize other XRs in vitro. Two distinct XR enzymes were discovered in Pac. tannophilus [Ditzelmüller et al., 1985, Verduyn et al., 1985b]: an XR expressed during aerobic conditions that is solely catalyzed by NADPH [Verduyn et al., 1985b], and a second XR having activity with NADH and NADPH which is responsible for ethanol fermentation [Ditzelmüller et al., 1985, Verduyn et al., 1985b]. XR in Sch. stipitis and Sch. shehatae were found to have higher activities with NADH than XR in Pac. tannophilus [Bruinenberg et al., 1984, Verduyn et al., 1985a, Ho et al., 1990]. This difference was assumed to account for their higher ethanol yields and specific ethanol productivities. Spa. passalidarum’s genome sequence revealed that it had two genes encoding XR, XYL1.1 and XYL1.2 [Wohlbach et al., 2011]. Xyl1.2p was expressed in Sac. cerevisiae and had higher activity with NADH than NADPH [Mamoori et al., 2013], which made it unlike previously studied XRs. Expression of XYL1.2 in Sac. cerevisiae led to a higher growth rate, higher ethanol yield, and lower xylitol yield than strains expressing the Sch. stipitis XR [Cadete et al., 2016]. Yeasts with XR with higher preference for NADH are consistently found to be efficient xylose fermenters.

The XR from xylose-fermenting yeasts can catalyze the reduction of xylose by NADH or NADPH in vitro, but the in vivo cofactor of these enzymes remains unknown. It has been proposed that xylose is mostly reduced by NADH under anaerobic conditions via XR in Sch. stipitis, and NADPH under aerobic conditions [Ligthelm et al., 1988c, Dellweg et al., 1990]; however, no underlying mechanism has been provided to explain why XR would prefer NADH in vivo under certain conditions, despite higher activity with NADPH under all in vitro conditions. Several lines of evidence suggest NADPH is the preferred cofactor in vivo. First, Verduyn et al. [1985a] demonstrated that in vitro XR prefers NADPH to NADH when both cofactors are present in equimolar amounts. Second, the in vitro XR cofactor preference does not change when Sch. stipitis is grown at different aeration rates in bioreactors [du Preez et al., 1989, Skoog and Hahn-Hägerdal, 1990]; this eliminates the possibility of multiple XR enzymes or a post-translational modification of XR that might change its cofactor preference. Finally, the accumulation of xylitol by recombinant Sac. cerevisiae strains expressing the Sch. stipitis XR-XDH pathway suggests the apparent in vivo XR cofactor selectivity favours NADPH [Kötter and Ciriacy, 1993, Wahlbom et al., 2003]. Although NADH-dependent XR activity is present in all xylose-fermenting yeasts, indirect evidence indicates XR is mostly catalyzed by NADPH in Sch. stipitis during oxygen Chapter 3. Literature review 13 limitation. XDH kinetics were also characterized in xylose fermenting yeasts [Ditzelmüller et al., 1984, Rizzi et al., 1989, Yang and Jeffries, 1990] (Table 3.2). Yang and Jeffries [1990] hypothesized the lower affinity of XDH in Sch. stipitis could explain its lower xylitol yield than Sch. shehatae. No XDH enzymes have been found to have significant activity with NADP+.

Table 3.2: Xylose dehydrogenase forward and reverse substrate affinities for Pachysolen tannophilus, Scheffersomyces shehatae, and Scheffersomyces stipitis. Substrate affinity (mM) Pac. tannophilus Sch. shehatae Sch. stipitis Km (xylitol) 70 18.5 26 Forward reaction Km (NAD) 0.1 0.24 0.16 Km (xylulose) 8.3 13.8 9.6 Reverse reaction Km NADH 0.013 0.037 0.072

Wohlbach et al. [2011] analyzed the genomes and physiologies of 14 Ascomycete yeasts to find candi- date genes enabling xylose fermentation. This semilar work is reminiscent of the comparitive approach by Bruinenberg et al. [1984] to study xylose fermentation. In this study, Wohlbach et al. [2011] clas- sified yeasts as non-xylose growing, xylose growing, and xylose-fermenting. XR and XDH transcripts were some of the most abundant during xylose fermentation, consistent with expression patterns in Sch. stipitis [Jeffries et al., 2007, Jeffries and van Vleet, 2009, Yuan et al., 2011]. HGT2, HXT2.4, and XUT1 are all sugar transporters that had higher expression with xylose than glucose in most xylose growing yeasts, but none of these knockouts have been studied in xylose fermenting yeasts. HGT2 encodes a high-affinity sugar transporter, and had high expression during xylose fermentation for all the xylose growing yeasts; this gene was the most abundantly expressed sugar transporter in the Sch. stipitis transcriptome [Yuan et al., 2011]. HXT2.4 and XUT1, which encode a low-affinity and high-affinity sugar transporters, respectively, had expression in all the xylose growing yeasts, except for Yam. tenuis. Interestingly, genes encoding β-glucosidases and cellulases were expressed in the presence of xylose, even though the media did not contain cellulose; this Pavlovian-like transcriptional response supports the hypothesis that xylose-fermenting yeasts have co-evolved with wood-boring beetles digesting lignocel- lulose. Increased expression of the oxidative pentose phosphate pathway during xylose fermentation provides indirect evidence that XR uses NADPH in vivo [Wohlbach et al., 2011]; this is also consistent with expression patterns of ZWF1 in Sch. stipitis [Jeffries et al., 2007, Jeffries and van Vleet, 2009, Yuan et al., 2011], and increased NAD(H) kinase expression, encoded by UTR1, during xylose fermen- tation [Yuan et al., 2011]. Transcriptomcis by Wohlbach et al. [2011] and Yuan et al. [2011] highlight the metabolic programming of glucose to xylose fermentation, with NADPH playing a major role and common transporters expressed during xylose fermentation. Wohlbach et al. [2011] mapped the three xylose phenotypes onto a CTG species tree to reconstruct the phylogeny of xylose growth and fermentation, assuming maximum parsimony (Figure 3.1). Xylose growth was hypothesized to be absent in the budding yeast common ancestor but gained in the CTG clade at node 1. Xylose fermentation was gained at node 2 but lost at nodes 4 and 7. Xylose growth was lost in Lod. elongisporus. Comparative genomics was used to identify genes that may play a role in xylose growth and xylose fermentation. First, 43 orthologous genes were conserved in the xylose utilizers, but absent in yeasts unable to use xylose. Second, three orthologous genes were only present Chapter 3. Literature review 14 in Yam. tenuis, Sch. stipitis, and Spa. passalidarum, the xylose fermenting yeasts. These two sets of genes were hypothesized to enable xylose fermentation but mostly belong to gene families with unknown functions. C. lusitaniae 1 C. tenuis

2 4 D. hansenii M. guillermondi 3 S. stipitis 5 S. passalidarum 6 L. elongisporus 7 8 C. tropicalis C. albicans gain of xylose growth

loss of xylose growth

gain of xylose fermentation

loss of xylose fermentation

Figure 3.1: Maximum parsimony of xylose growth and fermentation in the CTG clade. Growth on xylose was hypothesized to be absent in the common ancestor of budding yeasts. Growth on xylose was gained in the CTG clade (node 1). Xylose fermentation was subsequently gained at node 2; it was independently lost at nodes 4 and 7. Xylose growth was lost in Lodderomyces elongisporus. Adapted from Wohlbach et al. [2011].

Physiology of Scheffersomyces stipitis

Sch. stipitis is one of the most efficient xylose fermenters that has been discovered. Its physiology was heavily studied in the 1980s, but interest waned as the community sought to engineer xylose fermentation in Sac. cerevisiae, mostly with the XR-XDH genes from Sch. stipitis in the 1990s and 2000s. Chapter 3. Literature review 15

The xylose fermentation products vary between yeasts, including Pac. tannophilus, Sch. stipitis, and Sac. cerevisiae engineered with the XR-XDH pathway from Sch. stipitis. Jeppsson et al. [1995] hypothesized the existence of additional "metabolic events" could reoxidize NADH during oxygen lim- itation. Alternative oxidase, which is absent in Sac. cerevisiae and Pac. tannophilus, was postulated to resolve the redox cofactor imbalance by providing an additional sink for NADH reoxidation beyond the standard electron transport chain (ETC). Inhibition of alternative oxidase by salicylhydroxamic acid was found to marginally increase polyol yields in Sch. stipitis during xylose fermentation. Shi et al. [2002] tested the claim that alternative oxidase reoxidizes NADH during xylose fermentation by disrupting its gene in Sch. stipitis, STO1. This knockout provided direct evidence of its impact on xylose fermentation, since inhibition can introduce unintended effects. Unexpectedly, the sto1 mutant led to an increase in the ethanol yield from xylose in their fermentation conditions. These results rejected the hypothesis proposed by Jeppsson et al. [1995] that alternative oxidase balances plays a role in reoxidizing NADH during xylose fermentation in Sch. stipitis. Shi et al. [2002] also provided evidence that Complex I is bypassed during xylose fermentation in Sch. stipitis, but not during glucose fermentation. Disruption of sto1 restored the use of Complex I during xylose fermentation. No transcriptomics or proteomics studies support the downregulation or repression of Complex I subunits in Sch. stipitis [Yuan et al., 2011, Huang and Lefsrud, 2012]. External alternative NAD(P)H dehydrogenase (Nde), encoded by NDE1 in Sch. stipitis, was expressed during xylose fermentation [Huang and Lefsrud, 2012] and may provide an alternative mechanism to reoxidize NADH to NAD+ in the cytoplasm. Xylose fermentation in Sch. stipitis is suboptimal in comparison to glucose fermentation. The preference of Nde, which is not itself linked to proton translocation, over Complex I may account for the suboptimal growth differences observed for xylose fermentaiton relative to glucose in Sch. stipitis [Ligthelm et al., 1988b, Shi et al., 2002]. It is not obvious what advantages bypassing Complex I for Nde would confer. The cytoplasmic succinate bypass was proposed to balance redox cofactors during xylose fermentation in Sch. stipitis, by regenerating NADPH from NADH Jeffries et al. [2007], Jeffries and van Vleet [2009]. This mechanism was inferred from the increased expression of GDH2, encoding NAD+-dependent glutamate dehydrogenase, and UGA1, encoding 4-aminobutyrate aminotransferase, from microarrays. Characterization of GDH2 knockouts in Sch. stipitis found its role to be related to glutamate catabolism [Freese et al., 2011], and not glutamate anabolism, which does not support its proposed role in redox balancing.

3.1.3 Engineering xylose fermentation in Saccharomyces cerevisiae

Most strains of Sac. cerevisiae cannot grow on xylose [Wenger et al., 2010], despite having XR and XDH genes orthologous to known xylose utilizers, and the ability to ferment xylulose fermentation [Chiang et al., 1981]. Recent studies have found that intracellular xylose, but not extracellular xylose, can be sensed by Sac. cerevisiae [Brink et al., 2016]. The scientific community has attempted to engineer Sac. cerevisiae with all four known natural xylose catabolic pathways because it is the preferred industrial host for ethanol fermentation. Reviews of xylose fermentation in Sac. cerevisiae are frequently published as it is a heavily researched topic [Jansen et al., 2017, Kwak and Jin, 2017]. Chapter 3. Literature review 16

Xylose isomerase

Initial attempts to express bacterial XI genes from Bacillus and Actinoplanes did not result in any func- tional proteins in Sac. cerevisiae [Amore et al., 1989]. Walfridsson et al. [1996] succesfully demonstrated xylose fermentation to ethanol by expressing XI from Thermus thermophilus; unexpectedly, the strain accumulated acetate and xylitol. Karhumaa et al. [2005] demonstrated xylose catabolism could be im- proved with increased expression of enzymes converting xylose to xylulose, either via the XI or XR-XDH pathways, and the non-oxidative pentose phosphate pathway. Despite these changes, Sac. cerevisiae engineered with XI still grew slower than strains expressing XR-XDH under aerobic conditions. The discovery of XI in Piromyces marked an important turning point in the performance of XI and the XR-XDH pathway in Sac. cerevisiae [Kuyper et al., 2003, 2005]. Xylose fermentation in Sac. cerevisiae with XI led to faster rates and higher yields than the XR-XDH pathway, but not native xylose fermenters. Other labs have found success with fungal XI, with some help from adaptive laboratory evolution, including Hal Alper [Lee et al., 2012], and Gregory Stephanopoulos [Zhou et al., 2012]. More recently, manganese has been found to be an important cofactor in XI [Lee et al., 2017] and was found to be limiting in some strains [Verhoeven et al., 2017].

XR-XDH pathway

The XR-XDH pathway is one of the most studied pathway in metabolic engineering. It was first expressed in Sac. cerevisiae by Kötter et al. [1990], Kötter and Ciriacy [1993], using genes from Sch. stipitis. Many labs have devoted multi-year efforts at improving the performance of this pathway in Sac. cerevisiae, including the labs of Hal Alper, Barbel Hahn-Hagerdal, Maria F. Gorwa-Grauslund, Nancy Ho, T.W. Jeffries, and Yong-Su Jin. Recombinant Sac. cerevisiae strains expressing XR-XDH from Sch. stipitis accumulated significant amounts of xylitol during oxygen-limiting and anaerobic conditions [Kötter and Ciriacy, 1993, Sondereg- ger and Sauer, 2003]; this has been viewed as an anomaly since these genes appear to endow Sch. stipitis with the ability ferment xylose to ethanol with little to no xylitol as a byproduct (Table 3.1). Wahlbom et al. [2003] demonstrated this striking difference in xylose fermentation between recombinant Sac. cere- visiae strain and wild-type Sch. stipitis grown in bioreactors with decreasing aeration. Sch. stipitis had a robust ability to ferment xylose to ethanol during oxygen-limiting and anaerobic conditions with little to no xylitol accumulation, whereas Sac. cerevisiae accumulated significant xylitol and glycerol during these conditions. Rather than try to understand how Sch. stipitis ferments xylose, the metabolic engineering field has mostly sought to improve xylose fermentation in Sac. cerevisiae through random mutagenesis [Wahlbom et al., 2003, Ni et al., 2007], rational metabolic engineering, and inverse metabolic engineering [Bailey et al., 1996]. Rational metabolic engineering strategies for improving xylose fermentation in Sac. cerevisiae in- clude: increasing expression of the non-oxidative pentose phosphate pathway [Walfridsson et al., 1995, Jin et al., 2005]; fine tuning gene expression for XR, XDH, and xylulose kinase [Walfridsson et al., 1997, Ho et al., 1998, Eliasson et al., 2000, Johansson et al., 2001, Jin et al., 2003, Jin and Jeffries, 2003]; deletion or downregulation of the endogenous aldose reductases [Träff et al., 2001, Träff-Bjerre et al., 2004], encoded by GRE3 in Sac. cerevisiae; perturbing NADPH regeneration through the oxidative pentose phosphate pathway [Jeppsson et al., 2002, 2003], NADP-dependent aldehyde dehydrogenase (NADP-Ald) [Karhumaa et al., 2009, Kim et al., 2013b], NADP-dependent glyceraldehyde 3-phosphate Chapter 3. Literature review 17 dehydrogenase [Verho et al., 2003], and NADH kinase [Hou et al., 2009]; protein engineering to alle- viate the cofactor imbalance between XR and XDH [Runquist et al., 2010]; expression of engineered transhydrogenase-like cycles to alleviate the redox cofactor imbalnce [Suga et al., 2013]; and deletion of FPS1, which encodes a polyol transporter in Sac. cerevisiae Wei et al. [2013b]. In a systems-level rational approach, Wei et al. [2013a] used acetate as an external electron acceptor to alleviate the cofactor imbalance between XR and XDH in Sac. cerevisiae. The ETC cannot reoxidize NADH under oxygen limitation in Sac. cerevisiae, preventing XDH from converting xylitol to xylulose. In this scheme, acetate, which is present as an inhibitor in lignocellulose hydrolyzates, was co-consumed with xylose in recombinant Sac. cerevisiae, and converted to ethanol via acetylating acetaldehyde dehydrogenase. The polyol yield decreased from 0.33 to 0.22 mol/mol xylose in Sac. cerevisiae. The use of external electron acceptors can increase the ethanol yield, but this strategy is not natively used by Sch. stipitis. In one notable inverse metabolic engineering study, Jin et al. [2005] transformed recombinant Sac. cerevisiae with genomic fragments from Sch. stipitis in a plasmid. This evolutionary engineering ex- periment provided Sac. cerevisiae with an option to improve xylose fermentation by increasing the expression of genes that are already in its genome but limited in their expression, expressing genes that have suppressor-like impacts, or expressing genes that are not present in Sac. cerevisiae’s genome. XYL3 and TAL1, encoding xylulose 5-kinase and transaldolase, were found to improve xylose fermentation. This observation is consistent with previous findings that increased expression of xylulose kinase and the non-oxidative pentose phosphate can improve xylose fermentation. Ni et al. [2007] improved xylose fermentation in Sac. cerevisiae engineered with the XR-XDH pathway via mutations in PHO13 and the promoter of TAL1 using transposon mutagenesis. Both mutations led to increased expression of TAL1. Deletion of PHO13 has been consistently found to improve xylose fermentation in Sac. cerevisiae engineered with XR-XDH in other studies [van Vleet et al., 2008, Fujitomi et al., 2012, Kim et al., 2013b]. In a whole cell assay, Kim et al. [2013b] demonstrated that PHO13 may encode a xylulose 5-phosphatase, and therefore assumed it created a futile cycle between xylulose 5-kinase and xylulose 5-phosphatase. Recent studies suggest a different or multiple roles for Pho13p. Collard et al. [2016] found that Pho13p and its mammalian homologs serve as a metabolite repair enzymes, or guardian angel phosphatases [Beaudoin and Hanson, 2016], by which toxic glycolytic byproducts are eliminated. Sac. cerevisiae strains lacking Pho13p showed reduced sedoheptulose 7-phosphate levels [Xu et al., 2016]. This metabolite was first reported to accumulate in the first recombinant XR-XDH strain [Kötter and Ciriacy, 1993]. These mutants also revealed a transcriptional reprogramming with an increased expression of STB5 [Kim et al., 2015, Xu et al., 2016], a transcription factor involved in NADPH regeneration [Larochelle et al., 2006]. The direct role of Pho13p remains unknown, including its cascading effects on STB5, but Xu et al. [2016] have suggested studying this gene in Sch. stipitis. Despite decades of engineering the XR-XDH pathway in Sac. cerevisiae, these strains do not match the rates and yields of xylose conversion by native xylose fermenters, like Spa. passalidarum and Sch. stipitis.

Weimberg and Dahms pathways

The Weimberg pathway has been expressed in Sac. cerevisiae, using genes from Caulobacter crescentus [Wasserstrom et al., 2018]. The pathway was not functionally active. D-xylonate dehydratase, which requires FeS clusters [Rahman et al., 2018], was thought to be limiting. The Weimberg pathway provides Chapter 3. Literature review 18 a five step shunt to α-ketoglutarate, an important intermediate in metabolism; this would normally require xylose to pass through the pentose phosphate pathway, glycolysis, and the oxidative TCA cycle in the cytoplasm and mitochondria. Salusjärvi et al. [2017] reconstituted the Dahms pathway in Sac. cerevisiae using enzymes from multiple sources. This pathway provides a short-circuit route to glycolate or ethylene glycol. Like the Weimberg pathway, D-xylonate was found to accumulate. The authors speculated that insufficient FeS cluster assembly in the cytosol might limit D-xylonate dehydratase activity.

3.1.4 Opportunities to improve our understanding of xylose fermentation

Bruinenberg et al. [1984] identified NADH-dependent XR as a key determinant of xylose fermentation in yeasts, but not all xylose fermenters are equal (Table 3.1). Sch. stipitis ferments xylose to ethanol at a faster rate and higher yield than Pac. tannophilus, Yam. tenuis, and Sch. shehatae; Pac. tannophilus accumulates large amounts of acetate, which has not been described in other pentose fermenters [Jeffries, 1983]; recombinant Sac. cerevisiae strains accumulate different byproducts than Sch. stipitis [Wahlbom et al., 2003], despite expressing the same pathway. It is likely that each organism has a unique redox metabolism that hinders or supports xylose fermentation, beyond the XR cofactor preference. More can be done to understand how the metabolic networks of xylose fermenters have evolved. The in vivo redox cofactor preference of XR remains an unresolved aspect of xylose fermentation. In vitro methods have shown XR in Sch. stipitis has higher activity with NADPH than NADH [Verduyn et al., 1985a], even in the presence of equimolar NADH and NADPH, or with decreasing aeration in chemostats Skoog and Hahn-Hägerdal [1990]; increased expression of ZWF1 and UTR1 during xylose fermentation also indicates NADPH is the preferred cofactor. 13C [Ligthelm et al., 1988c] and metabolic modelling studies [Dellweg et al., 1990, Balagurunathan et al., 2012, Hilliard et al., 2018] have reported xylose is preferentially catalyzed by NADH via XR during oxygen limitation. All the methods that have predicted XR preferring NADH in vivo rely on metabolic network reconstructions, since flux cannot be directly measured in most cases. The inconsistency between methods that predict XR having a prefer- ence for NADPH and NADH could be explained by an incomplete and inaccurate metabolic network reconstruction for Sch. stipitis. Metabolic simulations with varying XR cofactor preference and different metabolic networks can provide further evidence of how redox cofactors are balanced during oxygen limitation. Wohlbach et al. [2011] used a comparative genomics approach to identify genes common in xylose growers and xylose fermenters, but absent in yeasts unable to grow on xylose. These orthologous genes were hypothesized to enable xylose fermentation in Yam. tenuis, Spa. passalidarum, and Sch. stipitis. This hypothesis has implicit assumptions that were not addressed by Wohlbach et al. [2011]. Are there are additional factors beyond NADH-dependent XR that enable xylose fermentation in Yam. tenuis, Sch. stipitis, and Spa. passalidarum [Bruinenberg et al., 1984]? Is the mechanism enabling xylose fermentation not conserved with glucose fermentation?1 Can their ortholog assignments distinguish between gene duplications, even if one gene copy was lost? What is the accuracy of their species tree topology (Figure 3.1)? A top-down comparative genomics approach requires the selection of an appropriate taxon with a binary phenotype. Complex I in budding yeasts meets this criterion because it is unlikely that assembly proteins for Complex I would play a role in other processes, whereas xylose

1A gene promoting xylose fermentation that is present in Lodderomyces elongisporus would not show up in their list of candidates because Lod. elongisporus has lost the ability to metabolize xylose. Chapter 3. Literature review 19 fermentation does not because it is not a binary phenotype, and multiple enzymes may impact redox balance. An alternative approach to Wohlbach et al. [2011] is to study the pan-genome and physiology of closely related yeasts with obvious differences in xylose fermentation. Yeasts in the Scheffersomyecs- Spathaspora clade are well suited because these yeasts can all grow on xylose, and some of the most efficient xylose fermenters have been found in this clade [Veras et al., 2017]. clade is one useful group that can be studied in more detail to unravel how redox metabolism is balanced in these yeasts.

3.2 Functional genome annotation

3.2.1 Biology is technology

The concept of "biology is technology" is currently in vogue with the rise of synthetic biology [Carlson, 2010]. This idea is certainly not new as organisms have often been referred to as machines [Thurston, 1895]. In this regard, biology should be taught in the Faculty of Engineering, rather than the Faculty of Science. There are only a handful of equations relevant to molecular biology, such as the Monod equation and Michaelis-Menten kinetics, and there are no real laws in molecular biology, except that evolution is a fundamental process. As Theodosius Dobzhansky, a Ukrainian-American evolutionary biologist, famously stated: "nothing in biology makes sense except in the light of evolution." One proponent of "biology is technology" has been Sydney Brenner, who has drawn parallels of organisms to the Turing Machine and von Neumann’s self-reproducing machine [Brenner, 2012a,b]. He remarks that if physics were once called natural philosophy, biology should be called "natural engineer- ing" [Brenner, 2012a]. Over the course of 3.5 billion years, evolution has given the world a diverse set of machines from the microscopic to macroscopic scale before humans invented their counterparts. These include the ribosome, nature’s 3D printer before humans invented it in the 1980s; ATP synthase, nature’s motor before Faraday’s demonstrated it in 1821; photosynthetic leaves able to harvest energy from the sun before Charles Fritts’ photovoltaic cells in 1883; the cornea, an adjustable lens crafted by Nature before it was invented in ancient Greece; heavier than air flight before the Wright brothers demonstrated it in 1903. Xylose fermentation by Sch. stipitis can be thought of as a technology, much like the previously listed machines. Fundamentally, biology is the study of how these machines work and how they evolve(d) Brenner [2012b].

3.2.2 Evolving machines with unknown origins

Cars are often used in analogies to describe how organisms are like machines [Hood, 2003]. Cars and organisms have simple but unknown origins with the wheel and ribosome, respectively. The wheel was invented sometime around 3500 BC in Mesopotamia, while the ribosome has origins in the RNA world some 3500 million years ago. Cars are composed of thousands of parts that fit into integrated structural, mechanical, and electrical networks, just as organisms have thousands of proteins and RNA molecules that fit into integrated structural, metabolic, and transcriptional networks. Cars and organisms have always been evolving, leading to them having fitness advantages and niches. Horizontal technology transfer has led to drastic changes in cars and organisms; the use of horses in chariots and fossil fuels marked significant turning points in cars, while endosymbiosis drove genomic and phenotypic diversity in Eukaryotes. The common origin of cars and organisms means they have a phylogeny at the parts and Chapter 3. Literature review 20 machine-level.

One key difference between cars and organisms is that we use forward engineering to design cars, but we are left to reverse engineer nature’s tinkerings, which are not quite designed. Cars have parts with names and are designed for specific functions, while organisms have proteins with promiscuous functions that are always evolving; unlike car parts, proteins do not have names except for their chemical sequence [Brenner, 2002]. Researchers christen proteins after they have been characterized, although promiscuous activities can make these names less meaningful or misleading.2 Proteins that have not been characterized but are conserved in many organisms are designated as conserved hypothetical protein, but to the trained mechanic there are no conserved hypothetical automobile parts.

There is an unconcerted effort by the scientific community to characterize all the genes in Escherichia coli and Sac. cerevisiae, the lodestars of prokaryotic and eukaryotic molecular biology. In contrast, non- conventional organisms, such as Sch. stipitis, only have a handful of genes that have been functionally characterized. Table 3.3 outlines the number of proteins that have been manually curated and uncharac- terized in select model organisms and Sch. stipitis in UniProt, a database of proteins and their function. Sac. cerevisiae S288c and Sch. stipitis CBS 6054 currently have 6049 and 5797 proteins in UniProt. All of Sac. cerevisiae’s proteins have been manually curated in UniProt, but only 265 have been for Sch. stipitis. The remaining proteins in Sch. stipitis are mostly annotated with homology-based methods [Jeffries et al., 2007], but unreviewed by UniProt curators. 629 and 2314 proteins are annotated as "uncharacterized protein" in the Sac. cerevisiae and Sch. stipitis proteomes, respectively. This could indicate the proteins are truly uncharacterized and have unknown functions in both yeasts, or they could be annotated by homology or orthology in Sch. stipitis but have not been included in the current genome annotation. In total, UniProt currently has 134 066 004 protein entries in its database. Only 558 681 (0.4%) have been reviewed manually, which means curators have either reviewed a protein’s function from experimental evidence or through sequence homology. Therefore, obedient but fallible bioinformatic programs ordain the vast majority of protein function through sequence homology.

Table 3.3: Manually reviewed and uncharacterized proteins in select reference proteomes in UniProt. Species Proteome ID Proteins Manually Reviewed Uncharacterized Proteins Saccharomyces cerevisiae UP000002311 6049 6049 100% 629 10% Scheffersomyces stipitis UP000002258 5797 265 5% 2314 40% Schizosaccharomyces pombe UP000002485 5142 5141 100% 635 12% Caenorhabditis elegans UP000001940 26788 4035 15% 12174 45% Japanese pufferfish UP000005226 47858 174 0.4% 16258 34% Homo sapiens UP000005640 73 101 20 395 28% 1649 2% Arabidopsis thaliana UP000006548 39 380 15743 40% 2255 6% Escherichia coli UP000000625 4347 4347 100% 541 12% UniProt total 134 066 004 558 681 0.4% 48 501 881 36%

2YKR043C was found to encode a metal-independent fructose-1,6-bisphosphatase in Sac. cerevisiae by Kuznetsova et al. [2010], but has since been rechristened as sedoheptulose bisphosphatase by Clasquin et al. [2011] based on its physiological role. It is more accurate to describe YKR043C as a gene that currently encodes a protein with sedoheptulose bisphosphatase activity, octulose bisphosphatase activity, and fructose-1,6-bisphosphatase activity, rather than a gene that encodes sedoheptulose bisphosphatase or fructose-1,6-bisphosphatase. Chapter 3. Literature review 21

3.2.3 Homology and analogy in biology

Homology and analogy are cornerstone concepts in biology. These terms have their origins in mathe- matics, but became part of modern biology lexicon through Richard Owen,3 a British palaeontologist and one of Charles Darwin’s contemporaries known for comparative anatomy of skeletons and organs. In 1843, he used the term "homologue" to describe "the same organ in different animals under every variety of form and function." To Darwin and Owen, the conserved position and structure of organs and skeletons showed common descent from an archetype, but with "modification" [Hubbs, 1944, Wilkins and Ebach, 2013]. A classic example of homology is the structure of the bird wing, bat wing, human arm and whale pectoral fin. These appear different from the outside and have different functions, but a look at their limb structures shows they have the same arrangements. The humerus is found in the upper limb, while the radius and ulna are in the lower limb for these animals. The conserved bone structure of these limbs arose from their common origin to sarcopterygian [Sanchez et al., 2014], a bony fish with four limbs. A shared phenotype does not mean it evolved the phenotype from the same ancestor. The polar bear and Arctic fox are both white and mammals, but their white fur came from local adaptation, rather than a common ancestor with white fur. Features are said to be analogous when they arose from convergent evolution. Fitch [1970] extended the concept of homology in biology to orthology, paralogy, and xenology in molecular biology; he revisited these terms in the genomic era [Fitch, 2000]. Table 3.4 outlines examples of these terms and other relevant genomics terms in biology and in the automotive industry which can help gain a deeper understanding of these often confused terms. Proteins are orthologous if they originate from a common ancestor by speciation; the ortholog conjecture states these proteins are likely to have a conserved function. Proteins are paralogous and ohnologous4 if they arose from a single locus duplication and a whole genome duplication in a common ancestor, respectively; paralogs and ohnologs often result in neofunctionalization and subfunctionalization, which are proteins that gain new function or inherit a specialized function of its ancestor, respectively [Conant and Wolfe, 2008]. Proteins are xenologous if they share ancestry but one was gained from horizontal gene transfer (HGT); these may or may not share function. Although biology and technology have many similarities, the concept of homology is unique and essential to biology because biology is a form of reverse engineering, and it is expensive to characterize all proteins.

3.2.4 Structural and functional genome annotation

A genome annotation is the process and product of identifying all structural coding and non-coding genes in an organism’s complete DNA sequence, and their function [Yandell and Ence, 2012]. There are two phases in structural genome annotation: the computational phase, which includes the assembly of long DNA sequence reads, repeat masking, and gene prediction from ab initio and evidence-based methods; the annotation phase where the coordinates are finalized by manual curation or automated programs [Yandell and Ence, 2012]. Methods for structural genome annotation include AUGUSTUS [Stanke et al., 2004], Yeast Genome Annotation Pipeline (YGAP) [Proux-Wéra et al., 2012], and mapping expression

3Richard Owen famously termed Dinosauria (terrible lizards) for the dinosaur fossils that were being discovered around the world. 4Ohnologs was coined by Wolfe [2000], in honour of Susumu Ohno, a Japanese-American scientist who predicted that whole-genome duplications could play a role in evolution. Chapter 3. Literature review 22

Table 3.4: Homology and analogy definitions, and examples in biology and the automotive industry. Term Biological Definition Biological Example Analogous car example Characters Any genic, structural, or be- Ability to ferment xylose; ability Transmission type; model/make; havioural feature of an organism to fly. tire types; fuel type; interior having at least two forms of the type. feature called character states. Cenancestor The most recent common ances- The first budding yeast to The first internal combustion en- tor of the taxa underconsidera- Lipomyces and Saccharomyces; gine vehicle by Otto to the Model tion. Homo antecessor to Homo sapi- T and Toyota Corolla. ens and Homo neanderthalensis. Homology The relationship of any two char- UV eye cones in birds and red 4 wheel drive in trains and SUV. acters that have descended, usu- eye cones in humans; HXT1 ally with divergence, from a com- and SUT1 genes encoding sugar mon ancestral character. transporters in Saccharomyces cerevisiae and Scheffersomyces stipitis. Orthology The relationship of any two ho- Xylose reductase encoded by 3-season tires from different tire mologous characters whose com- Kluyveromyces lactis XYL1 and manufacturers; 12 V car batter- mon ancestor lies in the ce- Sac. cerevisiae GRE3 ; the ies; 12 V port in the main console nancestor of the taxa from which structure of bird wings, human of modern cars and car cigarette the two sequences were obtained. arm, whale pectoral fin, and bat lighter port of older cars. wings. Paralogy The relationship of any two ho- Adh1p and Adh2p in Sac. cere- Automatic Saturn seat belt and mologous characters arising from visiae, which produce and con- standard buckle seat belt; 3- a duplication of the gene for that sume ethanol. season tires and studded tires. character. Ohnology A special case of paralogy de- Fumarate reductases in Sac. N/A. scribing the relationship of any cerevisiae, encoded by FRD1 two homologous characters aris- and OSM1 ; mitochondrial pyru- ing from a whole genome dupli- vate carrier subunits, encoded by cation. MPC2 and MPC3. Xenology The relationship of any two ho- URA1 in Saccharomycetaceae Nickel metal hydride battery mologous characters whose his- family, which encodes dihy- pack in the Toyota Prius; Blue- tory, since their common ances- droorotate dehydrogenase via fu- tooth; radio antenna. tor, involves an interspecies (hor- marate. izontal) transfer of the genetic material for at least one of those characters. Analogy The relationship of any two char- Flight between birds and bats; Convertible cover and truck bed acters that have descended con- Crabtree effect in Schizosaccha- cover. vertgently from unrelated ances- romyces pombe and Sac. cere- tors. visiae; xylose fermentation in Pachysolen tannophilus and Sch. stipitis; yeast formation in bud- ding yeasts, fission yeasts, black yeasts De novo gene A gene that arises from non- Phosphofructokinase subunit in Seat belts; windshield wipers. coding DNA. Komagataella phaffii (A7MAS3) Orphan enzyme An enzyme that has biochemical NADP phosphatase in eukary- Flux capacitor in the DeLorean evidence, or is thought to exist, otes. time machine. but does not have a non-genetic basis. Chapter 3. Literature review 23 sequence tags (ESTs) onto the genome sequence. Several pipelines are available for functional genome annotation, such as RAST [Aziz et al., 2008], UniProt [Consortium, 2012], JGI Pipeline [Huntemann et al., 2015]. In the JGI pipeline, proteins are functionally annotated with Clusters of Orthologous Groups (COGs) [Galperin et al., 2014], KEGG Orthology (KO) [Kanehisa et al., 2011], MetaCyc [Caspi et al., 2015], PFAM [Finn et al., 2015], TIGRfam [Haft et al., 2012], InterPro [Jones et al., 2014]. Functional annotations from pipelines can be thought of as a "first pass" annotation which are later refined by curators. As an example, functional annotations for proteins in the aldehyde dehydrogenase family are outlined in Table 3.5 for Sac. cerevisiae. KO annotations are more granular than Pfam and InterProt, but none accurately or fully describe protein function. This includes protein localization and redox cofactor preference. For example, all 10 genes belong to one family in InterPro and Pfam, but KEGG Orthology distinguishes the genes into five families, except for MSC7 which has no KO annotation. Aldp6 has been characterized as as a cytosolic NADP-Ald, but it belongs to a KO family described as NAD-Ald without a specified localization. No current annotation pipelines contain accurate and complete information on protein function, even for model organisms.

Table 3.5: Functional annotations for proteins in the aldehyde dehydrogenase family. Pfam InterPro Gene Protein name KEGG Orthology Family evalue Accession Family Cytoplasmic NAD-dependent ALD2 2.20E-169 K00129 aldehyde dehydrogenase aldehyde dehydrogenase Cytoplasmic NAD-dependent ALD3 1.00E-169 (NAD(P)+) aldehyde dehydrogenase Mitochondrial NAD(P)-dependent ALD4 3.70E-182 Aldehyde/histidinol aldehyde dehydrogenase Aldedh IPR016161 K00128 dehydrogenase Mitochondrial NAD(P)-dependent aldehyde dehydrogenase ALD5 1.00E-184 aldehyde dehydrogenase (NAD+) Cytoplasmic NADP-dependent ALD6 1.00E-184 aldehyde dehydrogenase Dehydrogenase involved in HFD1 9.00E-79 ubiquinone and sphingolipid metabolism K00135 Cytoplasmic NADP-dependent UGA2 1.80E-157 succinate-semialdehyde succinate semialdehyde dehydrogenase dehydrogenase MSC7 Protein of unknown function 6.60E-130 None K00294 Delta-1-pyrroline-5-carboxylate PUT2 7.60E-102 1-pyrroline-5-carboxylate dehydrogenase dehydrogenase K00147 PRO2 Gamma-glutamyl phosphate reductase 1.40E-10 glutamate-5-semialdehyde dehydrogenase

Table 3.6 outlines protein names for several orthologs in Dikarya. Despite being orthologous, these proteins have different names ranging from enzyme function (NAD+ kinase), enzyme family (uncharac- terized kinase), locus names (DEHA2C13464p), to uncharacterized protein. Although it is possible that orthologous proteins have different functions, the ortholog conjecture states that orthologous proteins are more functionally similar than paralogs. In other words, unlike the American justice system, bioin- formatics operates on the principle of guilty by association until proven innocent. Therefore, current genome annotation practices therefore often fail to accurately annotate protein function.

3.2.5 Identifying orthologous proteins

Identifying orthologous proteins is a critical step in functional genome annotation. Several methods are used to identify orthologous or homologous proteins, including reciprocal best hit, phylogenomic Chapter 3. Literature review 24

Table 3.6: Protein names for orthologs of NDE1, UTR1, PHO3, and ACS1 in 33 Dikarya yeasts and fungi. Sac. cerevisiae’s annotations are bolded. Gene Protein name Count Uncharacterized protein 12 FAD/NAD(P)-binding domain-containing protein 6 Mitochondrial external NADH dehydrogenase, a type II NAD(P)H:quinone 2 External NADH-ubiquinone oxidoreductase 1, mitochondrial 2 NADH dehydrogenase 1 Probable NADH-ubiquinone oxidoreductase C3A11.07, mitochondrial 1 External alternative NADH-ubiquinone oxidoreductase, mitochondrial 1 NDE1 Alternative NADH-dehydrogenase 1 Uncharacterized protein NDE1 1 Predicted protein 1 ARAD1D20086p 1 DEHA2D07568p 1 KLLA0E21891p 1 KLTH0H15708p 1 ZYRO0A08228p 1 Aspergillus niger contig An08c0120, genomic contig 1 Uncharacterized protein 14 ATP-NAD kinase 6 NAD(+) kinase 2 NAD+ kinase Utr1 1 Uncharacterized kinase C24B10.02c 1 UTR1 Predicted protein 1 YALI0E27874p 1 DEHA2C13464p 1 KLTH0C03322p 1 KLLA0F16885p 1 ZYRO0F05302p 1 Aspergillus niger contig An03c0160, genomic contig 1 Uncharacterized protein 8 Sure-like protein 3 Acid phosphatase 2 PHO3 5’-nucleotidase 2 YALI0C19866p 1 YALI0A08217p 1 DEHA2C01012p 1 Acetyl-coenzyme A synthetase 28 Acetyl-coenzyme A synthetase 1 ACS1 3 Probable acetyl-coenzyme A synthetase 1 acetate–CoA ligase 1 Chapter 3. Literature review 25 databases and software. These methods are reviewed here for their advantages and disadvantages.

Reciprocal best hit.

The reciprocal best hit (RBH) method is a widely used method to annotate proteins, especially when using a model organism as a reference. Two proteins in two different organisms are identified as orthol- ogous with RHB if they each return the other protein as a best hit in a Basic Local Alignment Search Tool (BLAST) search. A sample output of RBH is provided for Ald6p between Sac. cerevisiae and Kluyveromyces lactis. This method is useful for mapping uncharacterized genes to characterized genes, especially when there are few duplications in the gene family, or when there are shared duplications with no gene loss in the family. RBH was used to create the initial COG collection [Tatusov et al., 1997].

Table 3.7: BLASTP best hits for Saccharomyces cerevisiae ALD6 against Kluyveromyces lactis proteome, and its best hit against Sac. cerevisiae. Query sequence Subject sequence E-value Identity Bitscore XP_454992.1 (best hit) 0 74 783 XP_455099.1 0 57 587 Ald6p XP_452546.1 0 51 546 XP_453506.1 7.00E-132 43 390 XP_453507.1 3.00E-128 42 380 Ald6p (reciprocal best hit) 0 74 783 ALD4 0 61 624 XP_454992.1 ALD5 0 60 623 ALD2 5.00E-146 46 426 ALD3 3.00E-143 46 419

BLASTP searches return proteins with sequence similarity, but not sequence homology, to a query sequence. Therefore, the RBH method can lead to errors in identifying orthologous proteins because RBHs may be homologous, but not orthologous, to a query sequence, or have a sequence similarity without homology. These limitations can be illustrated with a pairwise comparison of the parts in a convertible car and a pickup truck, as a query and subject, respectively. In this example, we will assume that the convertible is a well-characterized model vehicle and a pickup truck is a non-conventional vehicle. RBH would identify the tires on the convertible and pickup truck as orthologous, regardless of the kind of tires installed on either vehicle. If the convertible has 3-season tires, the pickup truck would be assumed to have 3-season tires even if it has studded tires; this annotation would also impact our understanding of how the truck would drive in winter conditions. RBH would also identify the convertible top and truck bed cover as orthologous based on their structural similarity, despite no shared origin in their design or function. These examples highlight the drawbacks of pairwise methods like RBH in identifying orthologous proteins, which impact our understanding of physiology.

Phylogenomic and ortholog databases.

Ortholog databases, also called phylogenomic databases, have been created to help identify orthologous and homologous proteins across the tree of life. These databases can be categorized as graph-based methods that rely on sequence similarity to cluster protein sequences into ortholog and homolog groups; phylogenetic methods, which also require clustering, but use tree-based methods to distinguish between Chapter 3. Literature review 26 orthologs, paralogs, and xenologs; hybrid methods that use sequence similarity, phylogenetic reconstruc- tion, or other methods. Table 3.8 outlines commonly used databases for phylogenomics. Tree-based methods are expected to be more accurate than clustering-based methods since orthology is inherently defined by phylogeny. Phylogenomic databases overcome some of the limitations of RBH because mul- tiple sequences are compared, rather than pairwise methods which are more prone to errors.

Table 3.8: Phylogenomic databases used to identify homologous and orthologous proteins. The most popular methods are graph-based that rely on sequence similarity, are scalable, but not specific. Tree-based methods are more computationally intensive, expected to be more accurate, but not as not scalable. Hybrid-methods are often developed for specialized taxonomic groups. Database Method Reference Clusters of Orthologous Groups (COG) Graph Tatusov et al. [1997] eggNOG Graph Jensen et al. [2007], Huerta-Cepas et al. [2018] OrthoDB Graph Kriventseva et al. [2007, 2018] OrthoMCL Graph Li et al. [2003], Fischer et al. [2011] Orthologous MAtrix (OMA) Graph Schneider et al. [2007], Altenhoff et al. [2017] InParanoid Graph O’brien et al. [2005] RoundUp Graph DeLuca et al. [2006] KEGG Orthology (KO) Graph Mao et al. [2005] TreeFam Tree Li et al. [2006] PANTHER Tree Thomas et al. [2003], Mi et al. [2016] MetaPhOrs Tree Pryszcz et al. [2010] PhylomeDB Tree Huerta-Cepas et al. [2007] Yeast Gene Order Browswer (YGOB) Hybrid Byrne and Wolfe [2005] Candida Gene Order Browswer (CGOB) Hybrid Maguire et al. [2013] EnsemblCompara GeneTrees Hybrid Vilella et al. [2009] Ortholuge Hybrid Fulton et al. [2006]

PANTHER Database

Protein ANalysis THrough Evolutionary Relationships (PANTHER) is a curated database of protein families based on phylogenetic reconstruction. It was created in 1998 by Molecular Applications Group, acquired by Celera Genomics, the company that led the private human genome sequencing effort, but has since been spun off and is maintained at the University of Southern California [Thomas et al., 2003]. Homologous proteins are clustered into PANTHER families, which are then divided into subfamilies based on gene duplications or HGT. Functional annotations, such as GO ontologies, are assigned to nodes on the trees. PANTHER currently has 112 genomes, which span almost four billion years of evolution for bacteria, archaea, and eukaryotes. PANTHER has four budding yeast genomes: Yarrowia lipolytica, Candida albicans, Eremothecium gossypii, and Sac. cerevisiae.

Yeast Gene Order Browser

Yeast Gene Order Browser (YGOB) was created by Byrne and Wolfe [2005] using sequence similarity and synteny in closely related budding yeasts. The latest version of YGOB includes orthology information for 20 yeasts in the Saccharomycetaceae family. This pan-genome spans 100 million years of evolution, which includes the budding yeast Whole Genome Duplication (WGD). A similar database exists for yeasts in the CTG clade called Candida Gene Order Browser (CGOB). This dataset has served as a gold standard for comparing ortholog identification methods [Salichos and Rokas, 2011]. Chapter 3. Literature review 27

Evaluation of methods identifying orthologs

Salichos and Rokas [2011] evaluated MultiParanoid, OrthoMCL, RBH, and Reciprocal Smallest Distance (RSD) with manually curated assignments in YGOB. Simple algorithms perform better for low copy proteins, but all methods fail when paralogs are rampant. A comparison of OrthoMCL, KEGG, eggNOG, OrthoDB, and PANTHER demonstrate that most databases cannot distinguish between paralogs for ACS1 and NDE1 homologs (Appendix B). PANTHER can distinguish between orthologs and paralogs, despite having fewer genomes in its databases.

3.2.6 Opportunities to improve functional genome annotation

The ortholog conjecture is important to biology because not all proteins can be directly characterized, and therefore must be annotated with sequence homology. Most methods assign function by clustering protein sequences by sequence similarity, or with RBHs in pairwise genome comparisons, even though orthology is inherently based on phylogenetic reconstruction. The lack of consistency of protein names for orthologous proteins shows the need for improvement for functional genome annotation (Table 3.6). Curated ortholog databases provide a valuable resource in identifying orthologous proteins, but these are biased towards model organisms and are not easily mapped to non-conventional organisms. PAN- THER has a low genomic coverage over a wide evolutionary timeframe, whereas YGOB has a high genomic coverage over a short evolutionary timeframe. Mapping proteins from Sch. stipitis is akin to linear interpolation with a sparse curve for PANTHER, or linear extrapolation with a dense curve with YGOB. Therefore, existing curated ortholog databases cannot sufficiently annotate protein function in Sch. stipitis. Curation of homologous proteins into ortholog groups using phylogenetic reconstruction with more diverse yeast species than YGOB, CGOB, or PANTHER can be used to improve the functional genome annotation for Sch. stipitis and other non-conventional yeasts. Fungal genomes are essential to include in this pan-genome as an outgroup because they are closer to the archetype of the first budding yeast before it underwent gene loss and genome streamlining. Phylogenetic reconstruction is especially crucial for identifying gene duplications that may result in new gene function, which may be relevant to xylose fermentation in Sch. stipitis. These databases can be used to reconstruct the evolution of metabolism in budding yeasts.

3.3 Reconstruction and analysis of metabolism in silico

3.3.1 The worm: the first full-scale reconstructed network of biology

After the nascent molecular biology field cracked the Genetic Code in the 1960s, Brenner sought to close the gap between genes and behaviour. He chose Caenorhabditis elegans, a nematode, to study this problem because its development could be tracked under a microscope, and all of its cells could be counted. By reconstructing the nervous system, as specified by its genes, Brenner believed that he could simulate how inputs (food) get relayed into outputs (movement towards the food) through this network [White et al., 1986]. His team was able to reconstruct the nervous system into a network of precisely 302 neurons, but they were not able to predict behaviour with this "wiring diagram." Nonetheless, this work constitutes one of the significant steps in trying to reconstruct full biological networks for simulation, and not merely imitation [Brenner, 2010]. Chapter 3. Literature review 28

Brenner’s Nobel Prize Lecture in 2002 outlined his vision for computing biology with CellMap, which has obvious roots in his work with the nematode. Brenner has argued that merely reporting biological facts is not science, but a kind of journalism. Instead, biology must work towards developing models that predict biological facts. In this framework, biological parts, such as proteins, are characterized and fit into networks [Brenner, 2010], which can be used to predict and compute biology in silico. According to Brenner, if we understand all of biology, we could read the DNA sequence of a zebra zygote and simulate its development into a zebra, and not a horse. Brenner has lamented that the biology field has not adopted his CellMap idea, but elements have been assimilated by the metabolic modelling community.

3.3.2 Genome-scale network reconstruction

A GENRE is a structured knowledge base of an organism’s physiology, derived from its genome sequence and available biochemical knowledge. The most common reconstruction is the metabolic network, which represents the interactions between metabolites and proteins. This reaction network is often transcribed as a stoichiometric matrix, denoted as the S matrix, of MxN dimensions, where M is the number of metabolites in the matrix, and N is the number of reactions. The first GENRE was developed for Haemophilus influenzae [Edwards and Palsson, 1999]. Thiele and Palsson [2010] described the gold standard protocol for network reconstruction, which involves a draft reconstruction, reconstruction re- finement, conversion of the reconstruction into a computable format, and network evaluation. Reconstructing the metabolic networks for non-conventional organisms, such as Sch. stipitis, is inher- ently more difficult than for model organisms, like Sac. cerevisiae, because non-conventional organisms do not have large research communities studying their physiologies, proteins, nor genome-scale knock- outs. Bioinformatics is required to functionally annotate their genomes, but Section 3.2 illustrates there is room for improvement. Current practices often lead to commission and omission errors in GENREs, where reactions present in model organisms are erroneously included in non-conventional organisms GENRE, and reactions present in non-conventional organisms are not included in their GENRE because they are absent in model organisms. Examples of these errors are provided in Appendix A.

Automated genome-scale network reconstruction

Automatic and semiautomatic GENRE tools for prokaryotes and eukaryotes are outlined in Table 3.9. These methods are used to speed the development of GENREs using existing GENREs as templates or with bioinformatic tools. More methods are available for prokaryote reconstructions because they do not have as many compartments as eukaryotes, and have fewer paralogs. Comparative Genome-Scale Reconstruction (CoReCo) is the first method to consider the simulta- neous reconstruction of GENREs for a wide range of species [Pitkänen et al., 2014]. It relies on a probabilistic model to identify orthologs using RBH with BLASTP, which can complicate or misiden- tify orthology relationships for gene families with many duplications or high sequence similarity (Sec- tion 3.2.5) [Dalquen and Dessimoz, 2013]. The quality of these GENREs was not compared to curated reconstructions, with the exception of Sac. cerevisiae. CarveMe is a top-down approach recently outlined by Machado et al. [2018]. They introduce the con- cept of a universal model for bacteria which can build GENREs by homology using DIAMOND [Buchfink et al., 2014], or orthology via eggNOG-mapper [Huerta-Cepas et al., 2017]. Using the GENREs from 23 species, they were able to scale their universal model to more than 5000 bacterial GENREs. This method Chapter 3. Literature review 29

Table 3.9: Automatic and semiautomatic tools for genome-scale network reconstruction in prokaryotes and eukaryotes.

Reconstruction Prokaryote Eukaryote Reference MEMOSys X Pabinger et al. [2011] FAME X Boele et al. [2012] MicrobesFlux X Feng et al. [2012] CoReCo X Pitkänen et al. [2014] Pathway Tools XX Karp et al. [2015] RAVEN 2.0 XX Wang et al. [2018] Model SEED X Devoid et al. [2013] merlin XX Dias et al. [2015] CarveMe X Machado et al. [2018]

can lead to several complications with eukaryotes, which have more paralogs than prokaryotes [Makarova et al., 2005]. For example, Emericella nidulans encodes mitochondrial, cytosolic, and peroxisomal isoc- itrate dehydrogenase from one gene, IdpA, but most budding yeasts have these enzymes encoded by two or three genes [Szewczyk et al., 2001]. Assigning a metabolic function to IdpA via orthology from Sac. cerevisiae would annotate it as a mitochondrial NADP-dependent isocitrate dehydrogenase, and omit its cytosolic and peroxisomal localizations. Therefore, homolog and ortholog-based assignments can both lead to incorrect gene-reaction-protein-associations (GPRs) and enzyme localization. Further- more, errors in genome annotations and existing GENREs can be inherited in GENREs derived from CarveMe. For example, the genome annotation of Sch. stipitis indicates that internal alternative NADH dehydrogenase (Ndi) is encoded by PICST_58800 [Jeffries et al., 2007], and this gene was subsequently included in iBB814 [Balagurunathan et al., 2012] even though there is no evidence of this gene function. CarveMe offers the best approach to scaling GENREs across the tree of life, but these hinge on the accuracy of identifying orthologs and the curated GENREs which are largely unreviewed.

3.3.3 Flux balance analysis

FBA is one of the most widely used metabolic modelling techniques that attempts to predict pheno- type from genotype using GENREs. FBA was first described by Watson [1984] as a means to teach biochemistry, but Varma and Palsson [1994] is often regarded as the most seminal paper in metabolic modelling. FBA assumes steady-state conditions, in which no metabolites accumulate. The S matrix usually has more reactions (columns) than metabolites (rows), making it an underdetermined system. An objective function and constraints must be applied to the linear system of equations to solve for a solution with FBA. Common objective functions include maximizing growth rate, ATP production, or the yield of a biochemical. FBA is well suited to study how cofactors such as NADP(H), NAD(H), an ATP are balanced, because these cannot accumulate during steady state conditions [King and Feist, 2013]. OptKnock was the first genome-scale method to analyze metabolic networks [Burgard et al., 2003]. It seeks to maximize the growth rate and the production of metabolites in a bi-level optimization. Lewis et al. [2012] reviewed OptKnock and other computational methods for metabolic modelling. Chapter 3. Literature review 30

3.3.4 Scheffersomyces stipitis genome-scale network reconstruction and anal- ysis

Five genome-scale reconstructions are available for Sch. stipitis [Balagurunathan et al., 2012, Caspeta et al., 2012, Li, 2012, Liu et al., 2012, Hilliard et al., 2018], making it one of the most widely studied yeasts with FBA. Different approaches were used to reconstruct its metabolic network, including KEGG, Pathway Tools, and a pairwise genome comparison with Sac. cerevisiae. No effort has been made to create a consensus GENRE for Sch. stipitis. The aim of each reconstruction was to understand how Sch. stipitis metabolizes xylose. Balagu- runathan et al. [2012] provided the most thorough analysis of Sch. stipitis metabolism, indicating that the succinate bypass (Section 3.1.2) proposed by Jeffries et al. [2007] is not sufficient to regenerate NADPH in silico. Caspeta et al. [2012] and Li [2012] found that their xylose fermentation simulations did not predict xylitol accumulation but provide no mechanism to account the redox balancing. Hilliard et al. [2018] used their system identification-based framework to propose that XR switches its cofactor preference from NADPH to NADH, similar to what had been proposed [Ligthelm et al., 1988c, Dellweg et al., 1990]. Brenner has warned there is a difference between simulation and imitation, and what a model predicts is less important than how a model predicts it. In the case of xylose fermentation in Sch. stipitis, some authors have claimed their models do not predict xylitol accumulation, but this does not confirm the accuracy of the GENRE or model predictions.

3.3.5 Opportunities to improve genome-scale network reconstruction and analysis for Scheffersomyces stipitis

Current methods for genome-scale network reconstruction cause commission and omission errors for non- conventional organisms, including Sch. stipitis. One of the primary causes of these errors is the use of homology or sequence similarity to annotate canonical enzyme function to proteins, rather than the use of stringent orthology to assign promiscuous enzyme function. An alternative approach to CoReCo and CarveMe that can be scaled to more organisms is to curate the pan-genome of a taxon with phylogenetic reconstruction, assign promiscuous function to tree nodes, and reconstruct metabolic networks for the pan-genome. It is unknown how Sch. stipitis can ferment xylose to ethanol under oxygen limitation, despite multiple labs reconstructing and analyzing its metabolic network. No new pathways or mechanisms have been proposed, likely because of an overrelience on the current literature and Sac. cerevisiae’s metabolism as a template. Including promiscuous enzyme activity in GENREs provides a way to explore unknown metabolic function.

3.4 Summary of the literature and synthesis

3.4.1 Xylose fermentation

• Xylose is the second most abundant sugar on Earth

• Xylose can be fermented to biofuels and biochemicals

• Xylose can be metabolized by XI, XR-XDH, Weimberg, and Dahms pathway Chapter 3. Literature review 31

• The XR-XDH pathway exists in budding yeasts

• Pac. tannophilus was the first yeast discovered to ferment xylose to ethanol

• Sch. stipitis emerged as the most efficient xylose fermenter in the 1980s

• Yeasts that can ferment xylose to ethanol have XR catalyzed by NADH

• Other mechanisms beyond NADH-dependent XR are thought to exist in xylose fermenting yeasts to balance redox cofactors during oxygen limitation

• Transcriptomics and proteomics studies have been carried out for xylose fermenting yeasts, but no experimentally validated mechanism has been described to balance redox cofactors during xylose fermentation

• All four xylose catabolic pathways have been expressed in Sac. cerevisiae with varying success

• Expression of XI from an anaerobic fungi with adaptive laboratory evolution has led to the best xylose fermentation performance in Sac. cerevisiae

• Expression of the Sch. stipitis XR-XDH pathway in Sac. cerevisiae causes xylitol and glycerol to accumulate, which is not observed with Sch. stipitis

• Rational and irrational methods have been used to improve the performance of xylose fermentation in Sac. cerevisiae, but the field has never attained the same titres, rates, or yields as Sch. stipitis

Xylose fermentation is one of the most studied metabolic engineering problems. Many yeasts have independently evolved to ferment xylose to ethanol, but Sch. stipitis and Spa. passalidarum have a robust ability to ferment xylose to ethanol with little to no polyols. The two most seminal works in xylose fermentation used comparative approaches to further the understanding of xylose fermentation. In the genomic dark age, Bruinenberg et al. [1984] studied xylose fermentation in Cyb. jadinii, Pac. tannophilus, and Sch. stipitis, which all belong to different budding yeast families. The ability of Pac. tannophilus and Sch. stipitis to ferment xylose to ethanol under oxygen limiting conditions was attributed to XR having a preference for NADH that is absent in most yeast XR enzymes. The inability of Sac. cerevisiae to ferment xylose to the same extent as Sch. stipitis led researchers to believe additional redox balancing mechanisms may exist in some xylose fermenters. In the genomics golden age, Wohlbach et al. [2011] analyzed the genomes, transcriptome, and xylose fermentation of 12 yeasts. Orthologous genes that are present in xylose fermenters or growers were hypothesized to play a role in xylose fermentation. No experimental evidence was provided in their study with knockouts or how these may balance redox metabolism; furthermore, the quality of the ortholog assignments is not known. Sac. cerevisiae engineered with bacterial XI showed slower xylose fermentation than strains express- ing the XR-XDH pathway from Sch. stipitis for nearly two decades. The discovery of fungal XI and expression in Sac. cerevisiae marked an important inflection point in the performance, challenges, and fate of both pathways [Kuyper et al., 2003, Jansen et al., 2017]. Recently, engineered and evolved Sac. cerevisiae strains harbouring XI have performed similarly to native xylose fermenters, but Sac. cere- visiae engineered with the XR-XDH ferments at lower ethanol yields and lower growth rate than native fermenters. The xylose fermentation with the XR-XDH pathway is not likely to play a role in industrial Chapter 3. Literature review 32 ethanol fermentation, but it remains unresolved why Sac. cerevisiae cannot ferment xylose to ethanol at the same titers, rates, and yields as Sch. stipitis, despite expressing the same initial pathway. The community has largely focused on testing metabolic strategies in Sac. cerevisiae, including the deletion of PHO13, but the metabolism of Sch. stipitis and other xylose fermenters have not been thoroughly explored for new redox balancing mechanisms.

3.4.2 Functional genome annotation

• Homology and analogy are cornerstone concepts in biology

• Homology is used to describe a feature that shares ancestry, whereas analogy is used to describe a feature that arose from convergent evolution

• Homologs can be further defined as orthologs, which arise from speciation as likely have conserved function; paralogs, which arise from gene duplications and likely have new function; xenologs, which arise from HGT, and may have conserved function

• Sequence homology is fundamental to functional genome annotation

• RBH and phylogenomic databases are often used to annotate genomes

• RBH and phylogenomic databases often cannot distinguish between orthologs and paralogs, and therefore cannot accurately identify function

• Phylogenetic reconstruction is the most accurate method to annotate proteins, but is computa- tionally intensive

Homology and analogy are cornerstone concepts in biology. These terms were first used to describe the conserved structure and function of body systems in animals during Darwin’s time. A homolog is defined as a character, such as protein, structure or behaviour, that shares ancestry with another character but with some divergence. An analog is a character that shares the same state as another character through convergent evolution. Homology was extended to molecular biology with orthology, paralogy, and xenology. The ortholog conjecture states that orthologous proteins are more likely to share function than paralogous proteins. Gene duplications are associated with new gene function, often descried as neofunctionalization and subfunctionalization. Xenologs can allow for new function to be gained across the species barrier. Sequence homology is critical to annotating protein function. An understanding of orthologs, paralogs, and xenologs in an organism is critical to genome annota- tions, and ultimately to bridge the genotype-phenotype gap. The genome annotation of model organisms, such as Sac. cerevisiae and Esc. coli, are more accurate than other organisms because its genes and proteins are directly characterized. A handful of characterized proteins are used to annotate millions of proteins in non-conventional organisms using bioinformatics, leading to the possibility to genome annotation errors. Phylogenomic databases and the RBH method offer a scalable means to identify an approximate protein function across the tree of life. Current methods often do not distinguish between orthologs and paralogs (Appendix B), which can misidentify protein function. Distinguishing between orthologs and paralogs can help identify, with some confidence, changes in redox cofactor preference or cellular localization for proteins that have not been characterized. Curation of eukaroytic or yeast pan-genomes Chapter 3. Literature review 33 with phylogenetic reconstruction is essential to improving our understanding of the genetic repertoire in budding yeasts because current methods lead to inaccurate genome annotations.

3.4.3 Genome-scale network reconstruction

• A GENRE represent the structured knowledge base of an organism’s physiology

• Metabolic network reconstructions are the most common GENRE

• Eukaryotic GENRE are more complex than prokaryotes because of their compartments and gene duplications

• Programs have been developed to automate GENREs for prokaryotes and eukaryotes

• Commission and omission errors can occur in GENRE for non-conventional organisms

• FBA is the most common metabolic modelling technique used to analyze metabolism with GENREs

• FBA assumes metabolite concentrations are at steady-state

• An objective function is required for solving the metabolic flux distribution since the number of reactions generally exceeds the number of metabolites

• Typical objective functions are maximum growth rate, ATP yield, or biochemical yield

• Five GENREs have been developed for Sch. stipitis

• None of the five studies has proposed new mechanisms to explain how Sch. stipitis ferments xylose to ethanol

A GENRE represents the knowledge-base of an organism, as predicted by its genome and biochemical evidence. Metabolic networks are the most common GENRE, which is often transcribed as a stoichio- metric matrix. GENREs of non-conventional (non-model) organisms are more likely to have errors than those of model organisms such as Sac. cerevisiae, because their genome annotations rely on identify- ing function through sequence homology, rather than direct enzyme characterization. This limitation causes commission and omission errors in GENREs, especially when a distantly related model organism’s GENRE is used as a template, and can impact metabolic simulations. Several methods are available to automate and semiautomate GENREs; fewer are available for eukaryotes because these GENREs have more complexity than those of prokaryotes. FBA is the most common metabolic modelling technique used to analzye the metabolic network with GENREs. FBA assumes steady-state concentrations for metabolites. An objective function is required to solve the flux distribution from this linear system of equations. Typical analysis with FBA includes growth requirements, and predicting byproducts during fermentation. Several groups have independently reconstructed the metabolic network of Sch. stipitis using different methods. None have described how redox cofactors are balanced during xylose fermentation beyond what has already been proposed regarding XR changing its cofactor preference in vivo. A bottom-up consensus GENRE is required for Sch. stipitis to reassess how redox cofactors can be balanced during xylose fermentation. Chapter 3. Literature review 34

3.4.4 Synthesis of xylose fermentation in yeasts, genome annotations, and metabolic modelling

Sch. stipitis has a robust ability to ferment xylose to ethanol under oxygen-limiting conditions that is not easily engineered in Sac. cerevisiae, and has not been observed in most xylose fermenting yeasts. There are two important outcomes to these phenotype differences. First, the inability of Sac. cerevisiae to ferment xylose to the same extent as Sch. stipitis, despite expressing the same XR-XDH pathway and decades of tinkering its metabolic network, suggests that Sch. stipitis has additional redox balancing mechanisms that are not present in Sac. cerevisiae. Second, the low xylitol yield during xylose fermen- tation by Sch. stipitis and Spa. passalidarum, but not in other xylose fermenting yeasts, indicates that the redox balancing mechanism may have recently evolved in the Scheffersomyces-Spathaspora clade; gene duplications are generally prime targets for gains in function. Genome-scale metabolic modelling provides a means to generate hypotheses on how Sch. stipitis balances its redox cofactors during xylose fermentation. Several metabolic reconstructions are available in the literature for Sch. stipitis, but no recent studies have proposed or demonstrated new redox balancing mechanisms for xylose fermentation. There are two reasons I think no new mechanisms have been proposed. Most of what is known about yeast metabolism is from Sac. cerevisiae, causing an anchoring bias for GENRE curators; mapping what is known from Sac. cerevisiae’s metabolism to Sch. stipitis’ genome results in commission errors and omission errors. Second, some metabolic modelling studies are focused on demonstrating their xylose fermentation simulations do not accumulate xylitol in silico, which is what has been demonstrated experimentally, but this is only half of what should be expected from these studies. The other half of a modelling study seeking to understand xylose fermentation is to perturb Sch. stipitis’ metabolic network with in silico knockouts to show how xylitol can be accumulated with an algorithm like OptKnock. A consensus metabolic network is needed for Sch. stipitis for future studies, which is not as reliant on Sac. cerevisiae as a reference. Functional protein annotations are important to bridging the genotype-phenotype gap, and are es- pecially critical to GENREs [Thiele and Palsson, 2010]. As our knowledge of biology is incomplete, so to are genome annotations; however, not all genome annotations are created equal. Sac. cerevisiae’s genome annotation is superior to Sch. stipitis’ because most of its proteins have been experimentally characterized and annotations are kept up-to-date. On the other hand, Sch. stipitis’ genome annota- tion is not periodically updated, few of its proteins have been directly characterized, and most of its protein annotations are inferred from homology. Clustering proteins using sequence similarity and RBH methods are widely used to functionally annotate proteins, but these approaches are better described as first approximations of protein function. Furthermore, clustering methods often cannot detect func- tional differences between duplicated genes, which likely play a role in xylose fermentation. Phylogenetic reconstruction of protein families can improve the accuracy of protein annotations and GENREs. These annotations should focus on , enzyme localization, and redox cofactor preference. A pan-genome curation using phylogenetic reconstruction is more likely to provide an accurate genome annotation than a pairwise comparison with Sch. stipitis and Sac. cerevisiae. In summary, Sch. stipitis has likely evolved a new redox balancing strategy used during xylose fermentation that is absent in other yeasts, including Sac. cerevisiae. Genome-scale metabolic modelling can be used to elucidate this mechanism in Sch. stipitis, but this requires a genome annotation based on phylogenetic reconstruction. Chapter 3. Literature review 35

3.5 Hypotheses, objectives, and organization of the thesis

Based on the stated opportunities for advancement, I propose the following hypotheses and objectives to bridge the genotype-phenotype gap in yeast metabolism, including xylose fermentation in Sch. stipitis.

3.5.1 Curation of the yeast pan-genome

Problem statement: Current practises limit the quality of genome annotation in non-conventional yeasts, including Sch. stipitis, leading to misannotated and unannotated proteins. Hypothesis: Orthologs are easier to identify using phylogenetic reconstruction than by clustering- based methods. Objectives:

1. Curate enzyme families from 33 yeasts and fungi from Dikarya into ortholog groups using phylo- genetic reconstruction

2. Assign metabolic function to ortholog groups

3. Compare curated ortholog assignments to phylogenomic databases

4. Make the pan-genome open source to allow assignments to be scrutinized and continuously im- proved

Rationale: Current genome annotation practices, including phylogenomic databases, mostly use sequence similarity, which is only a proxy for orthology. These clustering-based methods are not com- putational intensive, but cognitively demanding to interpret gene families with many duplications and losses. On the other hand, phylogenetic reconstruction is inherently based on orthology, more compu- tationally intensive than clustering approaches, but easier to interpret. Therefore a yeast pan-genome based on phylogenetic reconstruction will have fewer errors than sequence similarity methods, but will require more effort. Fungal genomes should be included in this pan-genome because they have larger genomes which likely resemble the genomes of the first budding yeasts which had many enzymes, but many of which were subsequently lost with genome streamlining. The phylogenetic reconstruction of the Dikarya pan-genome, which is the foundation of this thesis, is described in Chapter 3. It is used in all subsequent research chapters. Protein function annotations were reviewed and updated in each chapter.

3.5.2 Analysis of the yeast pan-genome

Problem statement: The metabolism of Sch. stipitis and other yeasts is not well known beyond what little has been directly characterized and what has been mapped from Sac. cerevisiae. Hypothesis: Greater insight into the metabolism of Sch. stipitis and other non-conventional yeasts can be gained by reconciling the gains and losses of metabolic genes with their physiologies in the Dikarya pan-genome. Objective:

1. Identify ancient and recent metabolic gene duplications in the Dikarya pan-genome

2. Identify metabolic gene losses in the Dikarya pan-genome Chapter 3. Literature review 36

3. Reconstruct the Dikarya species tree based on existing trees and gene duplications with a phylo- genetic signal

4. Reconcile the genotypes and phenotypes of important metabolic properties, including fermentation characteristics and

Rationale: All yeast species are inherently unique, but Sac. cerevisiae is an exceptional yeast because of its specialized Crabtree lifestyle and its ancestors’ WGD. Therefore it cannot be used as a model organism to understand specific aspects of metabolism in all budding yeasts, such as xylose fermentation. By analyzing the pan-genome and physiology of species in Dikarya, it is possible to improve our understanding of budding yeast metabolism beyond Sac. cerevisiae and the Saccharomycetaceae family. These objectives are discussed in Chapter 4, and subsequently used to reconstruct the metabolic network in Chapter 5.

3.5.3 In silico metabolic network reconstruction for the yeast pan-genome

Problem statement: Current practises limit the quantity and quality of genome-scale reconstructions for non-conventional organisms. Hypothesis: Curation of the pan-genome using phylogenetic reconstruction, and mapping function onto phylogenetic trees can lead to increased genomic and metabolic coverage in GENREs. Objective:

1. Develop a new scalable framework to reconstruct high quality metabolic networks of multiple species by leveraging the pan-genome and its functional annotation

2. GENRE gap-filling for each essential biomass precursor

3. Compare GENREs with previously published GENREs

Rationale: The most dominant practice for metabolic network reconstruction for non-conventional organisms uses model organism GENREs as a template, leading to commission and omission errors. Curation of the pan-genome into ortholog groups, and assigning function to ortholog groups can improve the quality of GENREs can allow for higher quality GENRE’s by reducing anchoring and availability bias. The metabolic reconstruction is described in Chapter, and based on the pan-genome curation from Chapter 3 and analysis of the pan-genome from Chapter 4.

3.5.4 Reverse engineering xylose fermentation in Scheffersomyces stipitis

Problem statement: Sac. cerevisiae engineered with the Sch. stipitis XR-XDH pathway cannot ferment xylose to ethanol as efficiently as Sch. stipitis, despite expressing the same pathway. Hypothesis: Sch. stipitis has enzymes that enable it to balance redox cofactors during xylose fermentation to ethanol, which are absent in Sac. cerevisiae’s metabolism. Objectives:

1. Bottom-up analysis of existing xylose fermentation omics data to identify gene targets Chapter 3. Literature review 37

2. Use metabolic simulations to reconcile xylose fermentation data with previously proposed and new mechanisms that can balance redox cofactors during oxygen limitation

3. Bioinformatic analysis of candidate genes in Scheffersomyces-Spathaspora yeasts

4. In vitro characterization of the leading candidate proteins

Rationale: The growth rate of Sac. cerevisiae engineered with the XR-XDH pathway is often 10 times lower than Sch. stipitis, and its xylitol yield is higher than Sch. stipitis. This may be due to something in Sac. cerevisiae metabolism that hinders redox balancing during oxygen limitation, something in Sch. stipitis metabolism that promotes redox balancing, or a combination of both. Thirty years of various engineering attempts have not enabled Sac. cerevisiae to efficiently ferment xylose to ethanol, and Sch. stipitis’ metabolism has largely remained unexplored. This indicates Sch. stipitis may indeed have a novel redox balancing mechanism which evolved in the Scheffersomyces-Spathaspora clade. These objectives and results are described in Chapter 6, which uses the genome-scale metabolic network of Sch. stipitis that was reconstructed in Chapter 5. Chapter 4

AYbRAH: an open-source ortholog database for yeasts and fungi

As soon two genomes were sequenced, all genomics became evolutionary.

Eugene V. Koonin

4.1 Abstract

Budding yeasts inhabit a range of environments by exploiting various metabolic traits. The genetic bases for these traits are mostly unknown, preventing their addition or removal in a chassis organism for metabolic engineering. Analyzing Yeasts by Reconstructing Ancestry of Homologs (AYbRAH), a curated and open-source ortholog database, was created with the aim to understand these traits in yeasts. This pan-genome contains 33 diverse fungi and yeasts in Dikarya, spanning 600 million years of evolution. OrthoMCL and OrthoDB were used to cluster protein sequence into ortholog and homolog groups, respectively; MAFFT and PhyML reconstructed the phylogeny of all homolog groups. Ortholog assignments for enzymes and small metabolite transporters were evaluated against their phylogenetic re- construction, and curated to resolve any discrepancies. Information on homolog and ortholog groups can be viewed in the AYbRAH web portal (https://lmse.github.io/aybrah/), including functional anno- tations, predictions for mitochondrial localization and transmembrane domains, literature references, and phylogenetic reconstructions. Ortholog assignments in AYbRAH were compared to HOGENOM, KEGG Orthology, OMA, eggNOG, and PANTHER. PANTHER and OMA had the most congruent ortholog groups with AYbRAH, while the other phylogenomic databases had greater amounts of under-clustering, over-clustering, or no ortholog annotations for proteins. The large discrepancy between AYbRAH and other ortholog databases indicates existing methods must be improved to correctly identify orthologs, or new approaches must be developed. Applications for the pan-genome database are discussed. Associated publication: K Correia, MY Shi, R Mahadevan. AYbRAH: a curated ortholog database for yeasts and fungi spanning 600 million years of evolution. Database: The Journal of Biological Databases and Curation, 2019. https://doi.org/10.1093/database/baz022

38 Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 39

4.2 Introduction

Yeasts are unicellular fungi that exploit diverse habitats on every continent, including the gut of wood boring beetles, insect frass, tree exudate, rotting wood, rotting cactus tissue, soil, brine solutions, and fermenting juice [Kurtzman et al., 2011]. The most widely studied yeasts are true budding yeasts, which span roughly 400 million years of evolution in the subphylum Saccharomycotina [Hedges et al., 2015], and possess a broad range of traits important to metabolic engineering. These include citrate and lipid accumulation in Yarrowia [Aiba and Matsuoka, 1979] and Lipomyces [Boulton and Ratledge, 1983], thermotolerance in multiple lineages [Banat et al., 1996, Ryabova et al., 2003], acid tolerance in Pichia [Rush and Fosmer, 2013] and Zygosaccharomyces [Lindberg et al., 2013], methanol utilization in Komagataella [Hazeu and Donker, 1983], osmotolerance in Debaryomyces [Larsson and Gustafsson, 1993], xylose to ethanol fermentation in multiple yeast lineages [Schneider et al., 1981, Slininger et al., 1982, Toivola et al., 1984], alternative nuclear codon assignments [Mühlhausen et al., 2016], glucose and acetic acid co-consumption in Zygosaccharomyces [Sousa et al., 1998], and aerobic ethanol production (the Crabtree effect) in multiple lineages [de Deken, 1966, van Urk et al., 1990, Blank et al., 2005, Christen and Sauer, 2011]. The complete genetic bases of these traits are mostly unknown, preventing their addition or removal in a chassis organism for biotechnology. The distinction between orthologs, paralogs, ohnologs and xenologs plays an important role in bridg- ing the genotype-phenotype gap across the tree of life [Koonin, 2001]. Briefly, orthologs are genes that arise from speciation and typically have a conserved function; paralogs and ohnologs emerge from locus and whole genome duplications, respectively, and may have a novel function; xenologs derive from HGT between organisms, and do not necessarily have conserved function [Jensen, 2001, Koonin, 2005]. Knowl- edge of these types of genes has played an important role in deciphering Sac. cerevisiae’s physiology. For example, the Adh2p paralog in Sac. cerevisiae has favourable kinetics for ethanol oxidation and evolved from an ancient Adh1p duplication whose kinetics favoured ethanol production [Thomson et al., 2005]; the Saccharomycetaceae WGD led to the MPC2 and MPC3 ohnologs, which encode the fermentative and respirative subunits of the mitochondrial pyruvate carrier [Bender et al., 2015], respectively; the URA1 xenolog from Lactobacillales enables uracil to be synthesized anaerobically in most Saccharomyc- etaceae yeasts [Hall et al., 2005]. These examples demonstrate how understanding the origin of genes has narrowed the genotype-phenotype gap for fermentation in Saccharomycetaceae. Many genomics studies have focused on the Saccharomycetaceae family, and to a lesser extent the CTG clade [Dujon, 2010], but more can be learned about yeast metabolism by studying its evolution over a longer time horizon, especially with yeasts having deeper phylogeny [Gaucher et al., 2010]. If we could study the metabolism of the mother of all budding yeasts, which we refer to as the proto-yeast, we could track the gains and losses of orthologs, and their function, in all of her descendants to bridge the genotype-phenotype gap. Proto-yeast has evolved from her original state, making this direct study impossible, but we can reconstruct her metabolism through her living descendants. In recent yeasts, dozens of yeasts with deep phylogeny have been sequenced [Riley et al., 2016], paving the way for greater insight into the evolution of metabolism in yeasts beyond Saccharomycetaceae. Ortholog databases are critical to facilitating comparative genomics studies and inferring protein function. Most of these databases are constructed using graph-based methods that rely on sequence sim- ilarity, while fewer databases use tree-based methods [Kuzniar et al., 2008]. Existing ortholog databases do not span diverse yeasts (Figure 4.1), and sometimes cannot distinguish between orthologs and paralogs (Appendix B). In addition to these databases, orthologs are identified on an ad hoc basis with OrthoMCL Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 40 for comparative genomics studies [Wohlbach et al., 2011, Papini et al., 2012], or with the RBH method for genome-scale network reconstructions [Caspeta et al., 2012]; these ortholog assignments often lack transparency or traceability, and therefore cannot be scrutinized or continuously improved by research communities. Analyzing Yeasts by Reconstructing Ancestry of Homologs (AYbRAH) (Figure 4.2) was created to solve these outlined problems, and ultimately improve our understanding of budding yeast physiology. AYbRAH, derived from the Hebrew name Abra, mother of many, is an open-source database of predicted and manually curated orthologs, their function, and origin. The initial AYbRAH database was constructed using OrthoMCL and OrthoDB. PhyML was used to reconstruct the phylogeny of each homolog group. AYbRAH ortholog assignments for enzymes and small metabolite transporters were compared against their phylogenetic reconstruction and curated to resolve any discrepancies. This chap- ter outlines information available in the AYbRAH web portal (https://lmse.github.io/aybrah/), issues that arose from reviewing the accuracy of ortholog predictions, a comparison of AYbRAH to established phylogenomic databases, and a discussion of the benefits of open-source ortholog databases.

4.3 Methods

4.3.1 Initial construction of AYbRAH

AYbRAH was created by combining several algorithms and databases in a pipeline (Figure 4.2). 212 836 protein sequences from 33 organisms (Table 4.1) in Dikarya were downloaded from UniProt [Consortium, 2014] and MycoCosm [Grigoriev et al., 2013]. OrthoMCL [Fischer et al., 2011] clustered protein sequences into putative Fungal Ortholog Group’s (FOGs); default parameters were used for BLASTP and OrthoMCL. The FOGs from OrthoMCL were coalesced into HOmolog Groups (HOGs) using Fungi-level homolog group assignments from OrthoDB v8 [Kriventseva et al., 2015].

4.3.2 AYbRAH curation

Multiple sequence alignments were obtained for each homolog group with MAFFT v7.245 [Katoh and Standley, 2013] using a gap and extension penalty of 1.5. 100 bootstrap trees were reconstructed for each HOG with PhyML v3.2.0 [Guindon et al., 2010], optimized for tree topology and branch length. Consen- sus phylogenetic trees were generated for each HOG with SumTrees from DendroPy v4.1.0 [Sukumaran and Holder, 2010], and trees were rendered with ETE v3 [Huerta-Cepas et al., 2016a]. The phyloge- netic reconstruction for enzymes and metabolite transporters were reviewed when OrthoMCL failed to differentiate between orthologs and paralogs, caused by over-clustering (Figure 4.5), or when orthol- ogous proteins were dispersed into multiple ortholog groups, caused by under-clustering (Figure 4.6). Orthologs were identified by visual inspection of the phylogenetic trees or with a custom ETE 3-based script [Huerta-Cepas et al., 2016a].

4.3.3 Annotating misidentified and unidentified proteins

Additional steps were required to assign proteins to ortholog groups because OrthoMCL did not cluster all related proteins to ortholog groups, or because whole genome protein annotations were incomplete. First, proteins in OrthoDB homolog groups were added to new FOGs if they were not assigned to any FOG by OrthoMCL. Next, each organism had its genome nucleotide sequence queried by with a protein Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 41

HOGENOM PANTHER AYbRAH eggNOG YGOBCGOB OMA KO

rgm | Rhodotorula graminis Sporidiobolaceae Pucciniomycotina sai | Saitoella complicata Taphrinomycotina spo | Schizosaccharomyces pombe Schizosaccharomycetaceae ang | Aspergillus niger Aspergillaceae ncr | Neurospora crassa Sordariaceae Pezizomycotina tre | Trichoderma reesei Hypocreaceae lst | Lipomyces starkeyi Lipomycetaceae yli | Yarrowia lipolytica Dipodascaceae arx | Blastobotrys adeninivorans nfu | Nadsonia fulvescens aru | Ascoidea rubescens Ascoideaceae pta | Pachysolen tannophilus ppa | Komagataella phaffii kcp | Kuraishia capsulata car | Ogataea arabinofermentans Pichiaceae opm | Ogataea parapolymorpha dbx | Dekkera bruxellensis pme | Pichia membranifaciens pku | Pichia kudriavzevii bin | Babjeviella inositovora Saccharomycotina mbi | Metschnikowia bicuspidata mgu | Meyerozyma guilliermondii CTG clade dha | Debaryomyces hansenii pic | Scheffersomyces stipitis spa | Spathaspora passalidarum wan | Wickerhamomyces anomalus Phaffomycetaceae cut | Cyberlindnera jadinii hva | Hanseniaspora valbyensis Saccharomycodaceae kla | Kluyveromyces lactis lth | Lachancea thermotolerans zro | Zygosaccharomyces rouxii Saccharomycetaceae sce | Saccharomyces cerevisiae vpo | Vanderwaltozyma polyspora

Figure 4.1: Ortholog database coverage for fungal and yeast genomes in AYbRAH, YGOB, CGOB, PANTHER, HOGENOM, KO, OMA, and eggNOG. Ortholog assignments based on the manual curation of sequence similarity and synteny are shown in green columns; tree-based methods in red columns; graph-based methods in blue columns; a hybrid graph and tree-based method in the purple column. Many ortholog databases are well represented in Saccharomycetaceae and the CTG clade, which had their genomes sequenced during the 2000’s [Dujon, 2010]. AYbRAH has ortholog assignments for species in Pichiaceae, Phaffomycetaceae and several incertae sedis families, which are not well represented in other ortholog databases, as these yeasts were recently sequenced [Riley et al., 2016]. The well established phylogenomic databases span other yeast species not shown in this phylogeny, but they mostly belong to Saccharomycetaceae or the CTG clade. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 42

Review File Program Curation Data

33 Fungal Whole Genome Protein Sequences *.fasta

blastp results all-v-all

blastp.tsv

Identify paralogs, Review HOG ohnologs, phylogeny xenologs

Fungal Ortholog Groups aybrah.tsv Homolog Groups aybrah.json aybrah.xml aybrah.xlsx

HOG Protein Sequences Iterate for each HOG HOG*.fasta

MAFFT 7.245

HOG Multiple HOG PhyML Sequence bootstrap 3.2.0 Alignment trees

HOG*_msa.fasta HOG*.newick

Figure 4.2: AYbRAH workflow for ortholog curation. 33 fungal and yeast proteomes were down- loaded from UniProt and MycoCosm. BLASTP computed the sequence similarity between all proteins . OrthoMCL clustered the proteins into putative Fungal Ortholog Groups (FOGs) using the BLASTP results. FOGs were clustered into HOmolog Groups (HOGs) using Fungi-level homolog assignments from OrthoDB. Multiple sequence alignments for each homolog group were obtained with MAFFT, and 100 bootstrap phylogenetic trees were reconstructed with PhyML. The consensus phylogenetic trees for enzymes and transporters were reviewed and curated to differentiate between orthologs, paralogs, ohnologs, and xenologs. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 43 of the species closest relative from each FOG using TBLASTN (expect threshold of 1e-20). Annotated proteins were then queried against the TBLASTN hits to determine which proteins were annotated but not assigned to a FOG by OrthoMCL, and which proteins were unannotated despite a match in their nucleotide sequences. Proteins identified via TBLASTN with a sequence length less than 75% of the mean FOG sequence length were discarded from the candidate list. The remaining proteins were assigned to a HOG by their best hit via BLASTP, and to a FOG with pplacer [Matsen et al., 2010] via the MAFFT add alignment option. The following examples highlight how misidentified and unidentified protein annotations were resolved in AYbRAH, respectively. First, Cybja1_169606 (A0A1E4RV95), which encodes NADP-dependent isocitrate dehydrogenase in Cyb. jadinii, was not assigned to any ortholog group by OrthoMCL despite its high sequence similarity to other proteins. It was added to FOG00618 by pplacer [Matsen et al., 2010] with high confidence. Second, no 60S ribosomal protein L6 (FOG00006) was present in Meyerozyma guilliermondii’s protein annotation; it was identified by TBLASTN, annotated as mgu_AYbRAH_00173, and added to FOG00006 by pplacer [Matsen et al., 2010].

4.3.4 Comparison of AYbRAH to existing phylogenomic databases

AYbRAH ortholog assignments were compared to OMA [Altenhoff et al., 2015], PANTHER [Mi et al., 2016], HOGENOM [Penel et al., 2009], eggNOG [Huerta-Cepas et al., 2016b], KEGG Orthology [Mao et al., 2005]. Phylogenomic annotations were downloaded from UniProt. Ortholog groups were as- sessed as congruent, over-clustered, under-clustered, over and under-clustered, or no ortholog assign- ment relative to AYbRAH. AYbRAH ortholog groups were only compared with an ortholog database if an ortholog group in AYbRAH had proteins from species present in the other ortholog database. For example, FOG19691 consists of proteins from Ascoidea rubescens, Pac. tannophilus, Kuraishia capsulata, Ogataea parapolymorpha, Dekkera bruxellensis, Pichia kudriavzevii, Pichia membranifaciens, Babjeviella inositovora, Wickerhamomyces anomalus, and Cyberlindnera jadinii. None of the phyloge- nomic databases have ortholog assignments for these organisms, and therefore cannot be compared with AYbRAH. Evolview v2 [He et al., 2016] was used to map ortholog databases coverage onto the yeast species tree.

4.3.5 Subcellular localization prediction

Subcellular localization predictions for all proteins in the pan-genome were computed with MitoProt II [Claros and Vincens, 1996], Predotar [Small et al., 2004], and TargetP [Emanuelsson et al., 2007]. The Phobius web server [Käll et al., 2007] was used to predict transmembrane domains for all proteins.

4.3.6 Literature references

Literature references for characterized proteins were assigned to FOGs in AYbRAH. Additional references were obtained from paperBLAST [Price and Arkin, 2017], UniProt [Consortium, 2014], Saccharomyces Genome Database [Cherry et al., 2011], PomBase [McDowall et al., 2014], Candida Genome Database [Inglis et al., 2011], and Aspergillus Genome Database [Cerqueira et al., 2013]. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 44

Table 4.1: Fungal and yeast strain genomes in AYbRAH. Protein sequences were downloaded from UniProt or MycoCosm. Species were assigned to monophyletic or paraphyletic groups based on divergence time with Saccharomyces cerevisiae. Species Strain Group Database Reference Rhodotorula graminis WP1 MycoCosm Firrincieli et al. [2015] Saitoella complicata NRRL Y-17804 MycoCosm Riley et al. [2016] Schizosaccharomyces pombe 972h-Saccharomycotina UniProt Wood et al. [2002] Aspergillus niger CBS 513.88outgroup UniProt Pel et al. [2007] Neurospora crassa CBS708.71 UniProt Galagan et al. [2003] Trichoderma reesei QM6a UniProt Martinez et al. [2008] Lipomyces starkeyi NRRL Y-11557 MycoCosm Riley et al. [2016] Yarrowia lipolytica CLIB 122 UniProt Dujon et al. [2004] basal Blastobotrys adeninivorans LS3 MycoCosm Kunze et al. [2014] Saccharomycotina Nadsonia fulvescens var. elongata DSM 6959 MycoCosm Riley et al. [2016] Ascoidea rubescens NRRL Y17699 MycoCosm Riley et al. [2016] Pachysolen tannophilus NRRL Y-2460 MycoCosm Riley et al. [2016] Komagataella phaffii GS115 UniProt De Schutter et al. [2009] Kuraishia capsulata CBS 1993 UniProt Morales et al. [2013] Ogataea arabinofermentans NRRL YB-2248 MycoCosm Riley et al. [2016] Pichiaceae Ogataea parapolymorpha NRRL Y-7560 UniProt Ravin et al. [2013] Dekkera bruxellensis CBS 2499 MycoCosm Piškur et al. [2012] Pichia membranifaciens NRRL Y-2026 MycoCosm Riley et al. [2016] Pichia kudriavzevii SD108 UniProt Xiao et al. [2014] Babjeviella inositovora NRRL Y-12698 MycoCosm Riley et al. [2016] Metschnikowia bicuspidata NRRL YB-4993 MycoCosm Riley et al. [2016] Meyerozyma guilliermondii CBS 566 UniProt Butler et al. [2009] CTG clade Debaryomyces hansenii CBS 767 UniProt Dujon et al. [2004] Scheffersomyces stipitis CBS 6054 UniProt Jeffries et al. [2007] Spathaspora passalidarum NRRL Y-27907 UniProt Wohlbach et al. [2011] Wickerhamomyces anomalus NRRL Y-366-8 Phaffomycetaceae MycoCosm Riley et al. [2016] Cyberlindnera jadinii NRRL Y-1542& MycoCosm Riley et al. [2016] Hanseniaspora valbyensis NRRL Y-1626Saccharomycodaceae MycoCosm Riley et al. [2016] Kluyveromyces lactis CBS 2359 UniProt Dujon et al. [2004] Lachancea thermotolerans CBS 6340 UniProt Souciet et al. [2009] Zygosaccharomyces rouxii CBS 732Saccharomycetaceae UniProt Souciet et al. [2009] Saccharomyces cerevisiae S288C UniProt Goffeau et al. [1996] Vanderwaltozyma polyspora DSM 70294 UniProt Scannell et al. [2007] Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 45

Table 4.2: AYbRAH ortholog database statistics before and after curation. The initial ortholog assignments were obtained with OrthoMCL and OrthoDB. Additional proteins were annotated using TBLASTN. Ortholog groups for enzymes and small metabolite transporters were manually curated by visual inspection of homolog phylogeny, and by identifying ortholog groups with an ETE 3 script [Huerta- Cepas et al., 2016a]. Ortholog groups were modified by adding proteins to existing groups via pplacer [Matsen et al., 2010], or by collapsing homolog groups into a single ortholog group if there were no gene duplications in the homolog group (under-clustering). AYbRAH v0.1 v0.2.3 Proteins 212 551 214 498 Proteins in AYbRAH 169 118 (79%) 187 555 (87%) Ortholog groups 14 249 22 538 Homolog groups 0 18 202 Manually curated ortholog groups 0 625 Electronically modified ortholog groups 0 3760

4.4 AYbRAH overview

AYbRAH v0.1 and v0.2.3 database statistics are summarized in Table 4.2. In total, there are 214 498 protein sequences in the pan-genome for 33 yeasts and fungi; Pezizomycotina fungi were included in the database as an outgroup because they have genes that were present in proto-yeast’s ancestor, but subsequently lost. AYbRAH has 187 555 proteins (87% of the pan-proteome) that were assigned to 22 538 ortholog groups, and 18 202 homolog groups. Ortholog assignments are available in an Excel spreadsheet, a tab-separated file, orthoXML [Schmitt et al., 2011], and a JSON format.

4.4.1 The AYbRAH web portal.

AYbRAH has a web page for each homolog group with information on gene names, descriptions, gene origin (paralog, ohnolog, xenolog), literature references, localization predictions, and phylogenetic re- construction. A sample webpage for the acetyl-CoA synthetase can be seen in Appendix C. Homolog groups can be searched by FOG (FOG00404) or HOG (HOG00229) identification codes, gene names (ACS1 ), ordered locus (YAL054C), UniProt entry names (ACS1_YEAST), or protein accession codes from UniProt (Q01574), NCBI RefSeq (NP_009347.1) or EMBL (CAA47054.1). A sample phylogenetic tree rendered by ETE v3 [Huerta-Cepas et al., 2016a] and descriptions of its annotation features is shown in Figure 4.3 for the acetyl-CoA synthetase family (HOG00229). The initial ortholog assignments by OrthoMCL did not distinguish between the Acs1p (FOG00404) and Acs2p (FOG00405) paralogs. Discrepancies in ortholog assignments can be identified by comparing bootstrap support values for clades with the assigned ortholog groups; issues can reported on GitHub or issuing pull requests for large changes to ortholog groups. Ortholog groups are considered to be equal or siblings in current ortholog databases and COG collections. AYbRAH adopts multi-level hierarchical relationships between ortholog groups, which was recommended by Galperin et al. [2017]. This can be seen in Figure 4.3 for the acetyl-CoA synthetase homolog group. Acs1p was present in the common ancestor of all yeasts and fungi, but a duplication of ACS1 led to ACS2 in budding yeasts. Therefore, Acs1p (FOG00404) is the parent ortholog group to Acs2p (FOG00405). These relationships can be mapped to a species tree to illustrate how function has been gained and lost. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 46

square leaf nodes for basidiomycete proteins circle leaf nodes for ascomycete proteins

different leaf nodes colours to identify taxa circle leaf nodes for

leaf node species with one protein annotations in the homolog group sphere leaf nodes for species with more than one protein in the homolog group

bold vertical lines to identify in-paralogs bold horizontal lines to identify S. cerevisiae proteins line dashed lines for the

annotations basal protein in the ortholog group different colours for each ortholog group species code and gene/protein accession other bootstrap support annotations from PhyML

Figure 4.3: Annotation features of a sample phylogenetic tree in AYbRAH. Square and circle leaves indicate protein sequences in Basidiomycota or Ascomycota, respectively. Leaf nodes are coloured based on taxonomic groups. Circle leaves are used for proteins with no paralogs in the same species, whereas sphere leaves are used to designate proteins with paralogs in the same species. Vertical bold lines indicate species-lineage expansions, which are sometimes called in-paralogs or co-orthologs [Remm et al., 2001]. Horizontal bold lines designate Sac. cerevisiae proteins, which is the most widely studied eukaryote. Dashed lines indicate the most anciently diverged protein sequence in the ortholog group. Ortholog groups can be identified by colour groups to help the visual inspection of ortholog assignments. The leaf names include a three-letter species code and a sequence accession. Internal nodes are labelled with the bootstrap values from phylogenetic reconstruction with PhyML. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 47

Snapshots for mitochondrial localization and transmembrane domain predictions are shown in Fig- ures 4.4A and 4.4B for internal alternative NADH dehydrogenase, encoded by NDI1 (FOG00846). Re- viewing localization predictions for orthologous proteins with multiple algorithms enables researchers to make prudent decisions about protein localization, rather than relying on one method for one protein sequence. For example, Cybja1_131289 encodes internal alternative NADH dehydrogenase, yet its mi- tochondrial localization probability is 0.0019 with MitoProt II; all other mitochondrial predictions for Ndi1p orthologs are greater than 0.80 with MitoProt II. A review of the upstream nucleotide sequence of Cybja1_131289 indicates additional start codons that were not included in the protein annotation. MitoProt II predicts a mitochondrial localization probability of 0.5191 for the full protein sequence, which is more consistent with its orthologs.

A

B

Figure 4.4: Localization predictions for internal NADH dehydrogenase (NDI1_YEAST) in AYbRAH. (A) Histogram plots are shown for mitochondrial localization predictions of Ndi1porthologs Ndi1p predicted by Predotar, TargetP, and MitoProt. (B) Transmembrane domain predictions computed for orthologous proteins by the Phobius web server. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 48

4.5 AYbRAH curation

OrthoMCL and OrthoDB are less computationally intensive than phylogenetic-based methods, but are not always accurate [Salichos and Rokas, 2011]. Curation was required to resolve incorrect ortholog assignments due to over-clustering and under-clustering.

4.5.1 Over-clustering by OrthoMCL

Over-clustering has been described in past studies [Jothi et al., 2006], and occurs when graph-based methods create ortholog groups that do not distinguish between orthologs and paralogs. Over-clustering by OrthoMCL was common in gene families with many duplications or high sequence similarities, such as the aldehyde dehydrogenase (HOG00216) and the major facilitator superfamily (HOG01031); adjusting parameters for BLASTP and OrthoMCL did not help differentiate between orthologs and paralogs in HOG00216. Figure 4.5 illustrates an example of over-clustering with a subset of the hexokinase family (HOG00193). In this phylogenetic reconstruction, one hexokinase gene was present in the ancestral yeast species, but a gene duplication in Pichiaceae led to the HXK3 paralog; the HXK2 ortholog is subsequently not maintained in Oga. parapolymorpha’s genome. OrthoMCL assigned the HXK3 paralog to the same ortholog group as HXK2. The RBH method, commonly used for ortholog identification [Moreno-Hagelsieb and Latimer, 2008], would have also falsely identified Oga. parapolymorpha’s HXK3 as orthologous to Sac. cerevisiae’s HXK2. This example highlights how the greediness of graph-based methods can misidentify orthologs, which has been shown for yeast ohnologs [Salichos and Rokas, 2011], and how incorrect ortholog assignments can be made with pairwise comparisons. Paralogs were identified from over-clustered ortholog groups by finding nodes with high bootstrap support in the the consensus phylogenetic trees for homologs and migrating the proteins to a new ortholog groups; in some cases orthologs were identified by reviewing the sequence alignment of homologs.

4.5.2 Under-clustering by OrthoMCL

Under-clustering occurs when orthologous proteins are assigned to multiple ortholog groups. OrthoMCL was more prone to under-clustering by short protein sequences and protein sequences with low sequence similarity, such as ETC complex subunits and Flo8p. Figure 4.6 demonstrates under-clustering with a subset of the Flo8p family that was incorrectly assigned to multiple ortholog groups by OrthoMCL. Under-clustering was mostly resolved via a Python script that coalesced proteins into a new ortholog group when multiple ortholog groups were present in the homolog group yet no organism had any gene duplications.

4.6 Comparison of AYbRAH to other ortholog identification meth- ods.

4.6.1 BLASTP scoring metrics.

BLASTP is used as the basis for many ortholog predictions, including graph-based methods [Kuzniar et al., 2008] and RBH [Moreno-Hagelsieb and Latimer, 2008]. The distribution of percent identity, log(bitscore), and -log(expect-value) for proteins identified as orthologous to Sac. cerevisiae’s proteins Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 49

HXK2

Saccharomycotina

speciation Pichiaceae Saccharomycetaceae duplication

gene loss HXK2 HXK3 HXK2

ppa ppa opm opm sce HXK2 HXK3 HXK2 HXK3 HXK2

Komagataella Ogataea Saccharomyces phaffii parapolymorpha cerevisiae

Yeast Ortholog Group YOG00210 YOG00210 YOG00210 YOG00210 (via OrthoMCL)

Fungal Ortholog Group FOG00269 FOG00271 FOG00271 FOG00269 (curated)

Homolog Group HOG00193

OrthoDB EOG092C2JW4 EOG092C2JW4 EOG092C2JW4 EOG092C2JW4

Figure 4.5: Example of over-clustering by OrthoMCL with the hexokinase family and its curation in AYbRAH. A gene duplication of HXK2 in Pichiaceae led to the HXK3 paralog. HXK2 was subsequently lost in Ogataea parapolymorpha but maintained in Komagataella phaffii. OrthoMCL was unable to differentiate between the Hxk2p and Hxk3p orthologs. Both ortholog groups are also assigned to the same Fungi-level ortholog group in OrthoDB. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 50

FLO8 speciation

Saccharomycotina

FLO8 FLO8

Dipodascaceae Saccharomycetaceae

FLO8 FLO8

Pichiaceae CTG clade

yli kcp pic sce FLO8 FLO8 FLO8 FLO8

Yarrowia Kuraishia Scheffersomyces Saccharomyces lipolytica capsulata stipitis cerevisiae

Yeast Ortholog Group YOG17627 YOG04713 YOG13014 YOG14200 (via OrthoMCL)

Fungal Ortholog Group FOG04178 FOG04178 FOG04178 FOG04178 (curated)

Homolog Group HOG01445

OrthoDB EOG092C2ONV no annotation EOG092C2ONV EOG092C2ONV

Figure 4.6: Example of under-clustering by OrthoMCL in the FLO8 ortholog group and its curation in AYbRAH. OrthoMCL dispersed the Flo8p proteins into multiple ortholog groups due to the low sequence similarity between the proteins. The proteins were merged into one ortholog group. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 51

Table 4.3: Comparison of ortholog assignments between AYbRAH and well-established phy- logenomic databases. OMA and PANTHER are the most congruous with AYbRAH. Bold numbers indicate the greatest source of incongruency with AYbRAH. OMA and PANTHER are predicted to have more under-clustered and over-clustered groups relative to AYbRAH, respectively. HOGENOM, eggNOG, and KO have a large number of proteins with no ortholog assignment. FOGs congruent over-clustered under-clustered over and no ortholog group Ortholog Database compared groups groups groups under-clustered groups assignment OMA 8505 59% 5% 19% 3% 14% PANTHER 7014 58% 29% 1% 4% 8% HOGENOM 9393 50% 14% 11% 1% 24% eggNOG 7827 48% 10% 4% 1% 37% KO 9027 22% 16% 0% 0% 62% in AYbRAH are shown in Figure 4.7. Taxonomic groups include the Saccharomycotina outgroup, basal Saccharomycotina, Pichiaceae, the CTG clade, Phaffomycetaceae and Saccharomycodaceae, and Saccha- romycetaceae (Table 4.1). The approximate divergence time with Saccharomyces cerevisiae is 400-600 million years with the Saccharomycotina outgroup, 200-400 million years with the basal Saccharomy- cotina yeasts, 200 million years with Pichiaceae and CTG clades, 100-200 million years with Phaffomyc- etaceae and Saccharomycodaceae, and 0-100 million years with Saccharomycetaceae. The distributions of percent identity, log(bitscore), and -log(expect-value) for proteins with 100-400 million years of diver- gence with Sac. cerevisiae are similar; however, the distributions skew differently for percent identity and -log(expect-value) for the Saccharomycotina outgroup (more than 400 million years of divergence) and Saccharomycetaceae (less than 100 million years of divergence). Distributions for percent identity, log(bitscore), and -log(expect-value) for each species in AYbRAH are shown in Appendix F. These re- sults highlight the need to use phylogenetic methods and hidden Markov models to identify orthologs over long evolutionary timescales [Mi et al., 2016], but also enable orthologs to be identified by synteny and sequence similarity over smaller evolutionary time ranges [Byrne and Wolfe, 2005, Scannell et al., 2006].

4.6.2 Comparison of AYbRAH to well-established phylogenomic databases.

Ortholog assignments in AYbRAH were compared with OMA, PANTHER, HOGENOM, eggNOG, and KO (Table 4.3). OMA and PANTHER have the highest number of congruous ortholog groups with AYbRAH. Interestingly, PANTHER tends to over-cluster protein sequences into ortholog groups, while OMA tends to under-cluster. HOGENOM, eggNOG, and KO have a high fraction of proteins not assigned to any ortholog groups. 10 ortholog groups were randomly selected from the over-clustered groups in PANTHER and under- clustered groups in OMA to determine the source of the incongruency. It was found that three of the 10 over-clustered ortholog groups in PANTHER were correctly annotated in AYbRAH, one ortholog group was correctly identified in PANTHER but under-clustered in AYbRAH, one ortholog group was not correctly identified in either database, and five ortholog groups required further curation since the phylogenies were ambiguous. All 10 ortholog groups from OMA were under-clustered, suggesting a sys- tematic bias to not cluster proteins with lower sequence similarity; i.e., proteins identified as orthologous in AYbRAH were separated into two or more ortholog groups in OMA. Therefore, PANTHER database is most closely aligned with AYbRAH. All other databases appear to be more prone to over-clustering, or not have any annotation, indicting their methods should be improved. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 52

100

80

60

40 percenty identity 20

0

4.0

3.5

3.0

2.5

2.0 log ( bitsore )

1.5

1.0

200

150

100

50 -log( expect-value )

0 Phaffomycetaceae Saccharomycotina Basal Pichiaceae CTG clade & Saccharomycetaceae outgroup Saccharomycotina Saccharomycodaceae divergence time with 400-600 MY 200-400 MY 200 MY 200 MY 100-200 MY 0-100 MY Saccharomyces cerevisiae

Figure 4.7: Distribution of BLASTP percent identities, logarithm of bit scores, and negative logarithm of expect-values for proteins orthologous to Saccharomyces cerevisiae. The bottom half of orthologous proteins in the Saccharomycotina outgroup and Saccharomycetaceae have a percent identities of less than 40% and 58%, respectively; the bottom half of the expect-value ranges is more than 1e-60 and 1e-125 for the same groups. The wide and skewed distribution in the Saccharomycotina outgroup highlights the difficulty in making pairwise ortholog predictions for proteins with more than 400 millions of divergence in Dikarya fungi with BLASTP results; however orthologs can be easily identified in the Saccharomycetaceae family because of their high sequence similarities and low expect-values. Chapter 4. AYbRAH: an open-source ortholog database for yeasts and fungi 53

4.7 Applications of a curated ortholog database.

Ortholog databases offer additional benefits beyond simply identifying orthologous proteins. These databases can be used to identify gene targets for functional characterization to accelerate functional genome annotation and to streamline GENREs; Galperin et al. [2017] recently outlined some of the benefits and challenges to ortholog databases for microbial genomics. First, a curated ortholog database can serve as a repository for orthologs that have been screened and orthologs that require screening [Galperin and Koonin, 2004]. Rather than characterizing all the orthologs in a handful of model or- ganisms, research communities can broaden their efforts to understand the orthologs that do not exist in model organisms and the set of orthologs that do not have a conserved function with orthologs in model organisms. Second, a curated ortholog database can be used to improve and simplify genome annotation [Galperin and Koonin, 2004]. Genes from newly sequenced organisms can be mapped to cu- rated ortholog groups, similar to eggNOG-mapper [Huerta-Cepas et al., 2017], rather than using protein sequences from ortholog databases as queries in TBLASTN searches [Proux-Wéra et al., 2012]. New ortholog groups can be created for de novo genes or genes from recent duplications. Pulling annotations from a curated ortholog database has the advantage of unifying the names and descriptions of genes between organisms, as has been proposed for ribosomal subunits [Ban et al., 2014], and can reduce the number of genes that are misannotated or annotated as conserved hypothetical proteins. And finally, a curated ortholog database can be used to improve the quality and quantity of GENREs. GENREs inherently require a great deal of curation to identify orthologous proteins and their function and this process for doing so is often not transparent. Refocusing this effort to curate ortholog groups and their function in open-source knowledge base for pan-genomes can allow for improvements to be pushed to all GENREs, and for GENREs to be compiled for any taxonomic level, from Kingdom to strain.

4.8 Conclusion

In conclusion, AYbRAH was developed as an open-source ortholog database for yeasts and fungi. Manual curation was required for gene families with high sequence similarity, often arising from recent gene duplications, and with gene families with low sequence similarity. Ortholog assignments in AYbRAH are more accurate than commonly used phylogenomic databases, but further curation is needed. Chapter 5

Reconstructing the evolution of metabolism in yeasts

As was predicted at the beginning of the Human Genome Project, getting the sequence will be the easy part as only technical issues are involved. The hard part will be finding out what it means, because this poses intellectual problems of how to understand the participation of the genes in the functions of living cells.

Sydney Brenner

5.1 Abstract

Sac. cerevisiae serves as an important model organism for yeasts and eukaryotes. One consequence of its role in molecular biology is that current knowledge of budding yeast genomics and genetics is heavily skewed towards Sac. cerevisiae. To broaden our understanding of yeast metabolism beyond Sac. cerevisiae and the Saccharomycetaceae family, AYbRAH, an ortholog database spanning 600 million years of evolution, was used to refine the budding yeast species tree topology, identify important and reoccurring events in the radiation of budding yeasts, and partially reconstruct the genomes of proto- yeast and proto-fermenter, i.e., the first budding yeast and the first budding yeast that fermented glucose to ethanol, respectively. Analysis of budding yeast gene duplications place Nadsonia fulvescens in a clade sister to Blastobotrys adeninivorans, and Bla. adeninivorans in a clade sister to Yar. lipolytica. This topology has not been predicted by any concatenation or coalescence-based methods, indicating that genes chosen for phylogenetic reconstruction do not have a strong phylogenetic signal for Nadsonia. The shift from a citrogenic to ethanologenic lifestyle in budding yeasts coincides with an ancient duplication of ACS1 to ACS2. These genes encode acetyl-CoA synthetase expressed during growth on acetate and glucose, respectively. This duplication led to the emergence of the pyruvate dehydrogenase (PDH) bypass in budding yeasts. Flux balance analysis (FBA) indicates the PDH bypass has a higher protein and

54 Chapter 5. Reconstructing the evolution of metabolism in yeasts 55 phospholipid yield from glucose than ATP citrate lyase, the ancestral acetyl-CoA source in proto-yeast. Complex I and internal alternative NADH dehydrogenase (Ndi) orthologs have been independently lost in multiple yeast lineages. FBA simulations demonstrate Complex I has a significantly higher biomass yield than Ndi with glucose. The repeated loss of Complex I in yeasts, despite having a higher biomass yield than Ndi or aerobic fermentation, is proposed to be a result of its inactivation during cell proliferation to reduce harmful reactive oxygen species generation. Additional gene duplications have played a role in changing the expression of genes for different carbon sources, transitioning homomer enzyme complexes to heteromers with altered enzyme kinetics, possible ribosome heterogeneity in independent yeast lineages, changes in enzyme localization and cofactor preference. HGT appears to be an underappreciated factor in yeast evolution. The Dikarya pan-genome highlights the role chance duplications play in shaping the evolution of metabolism in yeasts and fungi. Associated publication: K Correia, MY Shi, R Mahadevan. Reconstructing the evolution of metabolism in budding yeasts. bioRxiv, 237974.

5.2 Introduction

Sac. cerevisiae is one of the most widely studied organisms and has served as the lodestar for under- standing the molecular biology of eukaryotes. Although Sac. cerevisiae plays an outsized role in science and biotechnology, it is not a perfect yeast or eukaryote model organism because of its unique evolu- tionary history. For example, Sac. cerevisiae’s ancestor did not retain the widely conserved Complex

I in its ETC [Gabaldón et al., 2005], gained new functions after the Saccharomycetaceae WGD [Byrne and Wolfe, 2007], has point centromeres rather than regional centromeres [Dujon, 2010], and underwent changes from a lipogenic-citrogenic lifestyle to an ethanologenic lifestyle. Sac. cerevisiae’s status as a model organism has skewed our understanding of budding yeast metabolism to the Saccharomycetaceae family [Dujon, 2010].

To broaden our understanding of yeast metabolism beyond Saccharomycetaceae, elements of proto- yeast’s and proto-fermenter’s genomes and physiologies were reconstructed; these yeasts are the first budding yeast and the first budding yeast to ferment glucose to ethanol, respectively, and serve as alternative references to Sac. cerevisiae for comparative genomics. By tracking the gains and losses of genes in proto-yeast’s and proto-fermenter’s descendants using AYbRAH, recent Saccharomycotina species tree topologies can be refined, and important and reoccurring events in the radiation of budding yeasts can be identified. These events include the evolution of the pyruvate dehydrogenase (PDH) bypass, the evolution of the types of NADH dehydrogenases in the ETC, changes in the composition of enzyme complexes, changes in enzyme localization via subfunctionalization and neofunctionalization, and HGT events. FBA was used to evaluate the fitness differences of metabolic pathways. Chapter 5. Reconstructing the evolution of metabolism in yeasts 56

5.3 Methods

5.3.1 Refinement of the yeast species topology

The budding yeast species topology reconstructed by Mühlhausen and Kollmar [2014] was used as the starting species tree in this study and adjusted to minimize homoplasy of gene duplications in central metabolism.

5.3.2 Reconstruction of gene duplications and losses

The phylogeny of all metabolic homolog groups were reviewed in AYbRAH, and cross-referenced with

PANTHER [Mi et al., 2016] for homolog groups with complex or ambiguous phylogeny. For example, the fumarate reductase family (HOG00276) has two major ortholog groups in AYbRAH. The OSM1 gene phylogeny in AYbRAH appears as a gene duplication in Saccharomycotina and HGT to Pezizomycotina.

In PANTHER, which spans all domains of life, this paralog arose from an ancient duplication that predates Fungi and did not involve HGT. Major incongruences in the species tree and gene tree topologies, or sequence alignments and tree branch lengths were used to identify HGT events; PANTHER was also used to cross-reference these events. For example, Schizosaccharomyces pombe and Debaryomyces hansenii belong to different taxonomic classes, yet Schizo. pombe has a gene sister to Deb. hansenii’s

GAL10.2 (HOG00206). This has previously been identified as a HGT by Slot and Rokas [2010]. Evolview v2 was used to plot the presence and absence of orthologs onto the species tree [He et al., 2016]. Additional gene duplications not discussed in this chapter are outlined in Appendix L.

5.3.3 Flux balance analysis

FBA was carried out with a Sac. cerevisiae genome-scale metabolic model (Chapter 6) in MATLAB with COBRA. Either NAD+ or NADP+-dependent acetaldehyde dehydrogenase was used. ATP citrate lyase (Acl) was added to the network.

5.4 Results & Discussion

5.4.1 Refined yeast species tree topology

Three Saccharomycotina phylogenies have been recently published [Mühlhausen and Kollmar, 2014, Riley et al., 2016, Shen et al., 2016], as well as a species tree for early diverging budding yeasts [Gonçalves et al.,

2018]. These species trees are mostly supported with high bootstrap values using concatenation analysis, Chapter 5. Reconstructing the evolution of metabolism in yeasts 57 and coalescent analysis by Shen et al. [2016]; however, there are some inconsistencies amongst the species trees and with the literature. Tetrapisispora phaffii and Vanderwaltozyma polyspora were shown to be sister to Zygosaccharomyces bailii and Torulaspora delbrueckii by Mühlhausen and Kollmar [2014]. This implies two independent whole genome duplications in Saccharomycetaceae rather than the currently ac- cepted view of a single event [Scannell et al., 2007, Marcet-Houben and Gabaldón, 2015]. Asc. rubescens is also shown to be sister to the Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae (PSS) clade in studies led by Rokas, Hittinger, and colleagues [Shen et al., 2016, Krassowski et al., 2018], or sister to the CTG clade-Pichiaceae-Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae (CPPSS) clade by Kollmar and colleagues [Mühlhausen and Kollmar, 2014, Mühlhausen et al., 2018]. Although re- construction of a species trees with concatenation analysis has become standard practice, it does not guarantee the true species tree. Inferring ancient divergences requires genes with a strong phylognetic signal [Salichos and Rokas, 2013], otherwise the species tree topologies can be driven by a handful of genes

[Shen et al., 2017]. Evolution is a messy process that does not always lend itself to being modelled by a conventional tree-based method [Bapteste et al., 2013]. Therefore, the emergence of paralogs in budding yeasts were reviewed to constrain the topology of Nadsonia fulvescens, Blastobotrys adeninivorans, and

Asc. rubescens, which have the most contentious phylogenies in Saccharomycotina.

All recent phylogenies predict Nad. fulvescens and Yar. lipolytica to be sister species, and Bla. adeninivorans as sister to the Nadsonia-Yarrowia clade. This topology is not consistent with the con- sequential paralogs that emerged in Saccarohmycotina (Figures 5.1 and 5.2). The presence and absence of ACS2 (FOG00405), which encodes the glucose-inducible acetyl-CoA synthetase (Figure 5.2), divides the budding yeasts in this study into ethanologenic and citrogenic-lipogenic lifestyles [de Deken, 1966,

Büttner et al., 1992], respectively. This marks Lipomyces and Yarrowia as early diverging lipogenic yeasts with a metabolism similar to proto-yeast, while Bla. adeninivorans and Nad. fulvescens are later diverging ethanologenic yeasts with the ACS2 paralog. Nad. fulvescens also has additional paralogs which arose within Saccharomycotina but are absent in Lipomyces starkeyi, Yar. lipolytica, and Bla. adeninivorans. These include GPM2 (FOG00293), GAL10.2 (FOG00302), PFK1 (FOG00279), IDP1.2

(FOG00621), and ADH3 (FOG00328) (Figure 5.1). These paralogs place Nad. fulvescens in a clade sis- ter to Bla. adeninivorans. If the recent Saccharomycotina phylogenies [Mühlhausen and Kollmar, 2014,

Riley et al., 2016, Shen et al., 2016, Gonçalves et al., 2018] represent the true yeast species topology, this would imply that these paralogs were present in the common ancestor of Yar. lipolytica, Nad. fulvescens, and Bla. adeninivorans, but lost in Yar. lipolytica and Bla. adeninivorans; 6-phosphofructokinase would also have transitioned from a homomeric to a heteromeric enzyme complex, and reverted back to a ho- momeric enzyme complex in Yar. lipolytica and Bla. adeninivorans. HGT is unlikely because the PFK1 Chapter 5. Reconstructing the evolution of metabolism in yeasts 58

Figure 5.1: Manually curated glycolytic orthologs (red columns), TCA orthologs (green columns), pyruvate metabolism-related orthologs (blue columns), and other metabolic or- thologs (orange) mapped to a budding yeast species tree; a Basidiomycete yeast, Taphrinomy- cotina yeasts, and Pezizomycotina fungi were used as outgroups. Family/clade names are shown next to species names. Proto-yeast is the ancestor to all budding yeasts in Saccharomycotina; proto-fermenter is the first budding yeast to have used the PDH bypass following the duplication of ACS1 to ACS2, gaining the ability to ferment sugars to ethanol aerobically or with oxygen limitation; hetero-oligomer PFK denotes the transition from the ancestral homo-oligomeric PFK to the hetero-oligomeric PFK with altered allostery in budding yeasts; loss of ancestral tRNA(CUG) indicates the earliest known point for the loss of ancestral CUG tRNA in budding yeasts and subsequent reassignment in Pac. tannophilus and the CTG clade [Mühlhausen et al., 2016]. Ethanologenic/Crabtree-positive yeasts are labelled with red stars and citrogenic fungi/yeasts with yellow stars. Species with published genome-scale network recon- structions (GENREs) are labelled with green text. Gene names reflect Sac. cerevisiae orthologs except: GPM4, the cofactor independent phosphoglycerate mutase; AOX1, alcohol oxidase; MLS1.2, the perox- isomal malate synthase paralog from the ancestral cytoplasic and peroxisomal malate synthase; OSM2, an uncharacterized fumarate reductase paralogous to OSM1 in Sac. cerevisiae; GAL10.2, UDP-glucose- 4-epimerase orthologous to GAL10 in Sac. cerevisiae; IDP1.2, mitochondrial isocitrate dehydrogenase orthologous to IDP1 in Sac. cerevisiae; FUM2, an uncharacterized cytoplasmic fumarse; UGA1.2, a mi- tochondrial gamma-aminobutyrate transaminase; LYS21.2, a cytoplasmic homocitrate synthase; HXK3, a paralog of Sac. cerevisiae’s HXK2 ; ARO10.2, an uncharacterized transaminated amino acid decar- boxylase. Arrows indicate the predicted direction of the gene duplication. The topology of our tree is based on recently published species trees [Mühlhausen and Kollmar, 2014, Riley et al., 2016, Shen et al., 2016], but with a modified topology for Bla. adeninivorans, Nad. fulvescens and Asc. rubescens to minimize the homoplasy of gene duplications. Ortholog columns are sorted by their earliest duplication (except for the gene families for GPM, CIT, OSM, GAL10, HXK) to show support for our species tree. Bla. adeninivorans’ topology is supported with with OSM1 and ACS2 ; Nad. fulvescens’ placement with GAL10.2, PFK1, IDP1.2, ADH3 ; Asc. rubescens’ topology with AKR, ELO3, LYS21.2, and to a lesser extent FUM2 and UGA1.2. Kluyveromyces lactis is the only yeast in Saccharomycetaceae which is Crabtree-negative and can grow with xylose; it differs within its family by the gain of the GDP1 paralog and loss of THI3. Chapter 5. Reconstructing the evolution of metabolism in yeasts 59

Figure 5.2: Manually curated electron transport chain orthologs (blue columns), acetyl-CoA- related orthologs (purple columns) and ribosomal protein subunit duplications mapped to a budding yeast species tree. See Figure 5.1 for shared annotations. Subphylum names are shown next to species names. Enzymes with predicted cytoplasmic and mitochondrial localizations are indi- cated with [c] and [m], respectively. Oxidoreductases using NAD(H) and NADP(H) are designated with x and y, respectively. NUO, Complex I; STO1, alternative oxidase; NNT, membrane-bound transhydro- genase; NDI, internal NADH dehydrogenase; NDE, external NAD(P)H dehydrogenase; ALD, aldehyde dehydrogenase; PHK, phosphoketolase; ACL, ATP citrate lyase; ACS, acetyl-CoA synthetase; RPL, ri- bosome protein of the large subunit; RPS, ribosome protein of the small subunit; MRPL, mitochondrial ribosomal protein of the large subunit; MRPS, mitochondrial ribosomal protein of the small subunit. No yeasts have maintained alternative oxidase while having lost Complex I; however, Pac. tannophilus and Oga. parapolymorpha have both lost alternative oxidase yet maintain Complex I. Transhydrogenase is absent in Taphrinomycotina and Saccharomycotina, the two independent yeast lineages in this study. The only internal alternative NADH dehydrogenase (NDI0 ) present that was present in proto-yeast is maintained in Lip. starkeyi within Saccharomycotina, while the older external NADH dehydrogenase (NDE0 ) is present in most clades except the CTG clade. NDE1, an external NADH dehydrogenase, is the only NDH2 ortholog maintained in all yeast lineages, although its activity with NADPH is not conserved in Sac. cerevisiae. Internal NADH dehydrogenase re-emerged as NDI1 from an NDE1 du- plication, but were not maintained in the CTG clade or most of the Pichiaceae; NDI2 independently evolved in Nad. fulvescens. Lip. starkeyi is the only budding yeast that has kept all orthologs from the three cytosolic acetyl-CoA sources: ATP citrate lyase, acetyl-CoA synthase, and phosphoketolase. ATP citrate lyase and phosphoketolase both had duplications in Dikarya that led to hetero-oligomeric enzymes. Cytoplasmic NAD-dependent-aldehyde dehydrogenase was present in proto-yeast, but duplica- tions led to mitochondrial localizations (ALD2.3, ALD5, ALD6.3 ) and NADP activity (ALD2.4, ALD5, ALD5.2, ALD6.1, ALD6.2, ALD6.3 ). ALD5.2 and ALD6.3 are recent duplications that are present in Dek. bruxellensis and Debaryomyces, two independent lineages of Crabtree-positive yeasts. Schizo. pombe, Nad. fulvescens, and Sac. cerevisiae are three Crabtree-positive yeasts that have significant independent duplications in their cytoplasmic ribosome subunits. Chapter 5. Reconstructing the evolution of metabolism in yeasts 60 gene tree has the same topology for Nad. fulvescens as our species tree (Appendix D). These duplications are consistent with our topology for Yar. lipolytica, Bla. adeninivorans, and Nad. fulvescens.

There were no obvious gene duplications that could constrain the topology of Asc. rubescens.A clustering-based approach was used to find which clade of yeasts Asc. rubescens shared the most paralogs and de novo genes with to resolve its phylogenetic placement (Appendix E). Asc. rubescens shares more paralogs and de novo genes with the CPPSS clade than the PSS clade. Therefore the placement of Asc. rubescens as sister to the CPPSS clade Mühlhausen et al. [2018] has more indirect support. The refined species tree topology is shown in Figure 5.1.

5.4.2 Evolution of the pyruvate dehydrogenase bypass

Impact of acetyl-CoA source on biomass precursor yields in Sac. cerevisiae

The PDH bypass is a well-known for cytosolic acetyl-CoA synthesis in Sac. cerevisiae [Pronk et al.,

1996]. It consists of pyruvate decarboxylase (Pdc), NAD(P)-dependent acetaldehyde dehydrogenase

(NAD(P)-Ald), and acetyl-CoA synthetase (Acs) (Figure 5.3B). The PDH bypass functionally replaced

Acl, the ancestral acetyl-CoA source in proto-yeast (Figure 5.3B). To help understand which source of cytosolic acetyl-CoA is more advantageous for macromolecule biosynthesis, the production of carbohy- drates, protein, DNA, RNA and phospholipids was maximized via the PDH bypass and Acl with glucose as a carbon-source using a Sac. cerevisiae genome-scale metabolic model. No changes were observed with carbohydrate, RNA, and DNA yields for the PDH bypass with NADP-Ald, the PDH bypass with

NAD-Ald, or Acl. Protein and phospholipid yields were highest with the PDH bypass, at 0.53 g/g and

0.02 g/g, respectively (Figure 5.3C). Replacing NADP+ with NAD+ as a cofactor for acetaldehyde de- hydrogenase (Ald) marginally reduced both yields by 0.01%. Acl reduced the protein and phospholipid yields by 2.9% and 0.07%, respectively (Figure 5.3). FBA also indicates the PDH bypass has a higher

ATP yield under oxygen-limiting conditions than the Acl pathway. Hence, the different acetyl-CoA sources may have an impact on fitness levels, causing budding yeasts to have selected for the higher yielding PDH bypass for protein and phospholipid biosynthesis, or limiting oxygen conditions.

Evolution of acetyl-CoA sources in Dikarya

The PDH bypass and Acl are often associated with fermentative yeasts [van Urk et al., 1990] and oleaginous yeasts [Ratledge, 1991], respectively, but the phylogenetic origin of the PDH bypass has not been explored, or why Acl was lost in most budding yeasts. The primary source of cytosolic acetyl-

CoA in proto-yeast was likely Acl [Hynes and Murray, 2010] (Figure 5.3B); it exists as a homomer in Chapter 5. Reconstructing the evolution of metabolism in yeasts 61

Basidiomycota [Shashi et al., 1990], encoded by ACL1 (FOG00612), and as a heteromer in Ascomycota encoded by ACL1 and ACL2 (FOG00613) following an ancient gene duplication (Figure 5.2). Acl was lost in the lineage sister to Nad. fulvescens. Another source of acetyl-CoA in proto-yeast was phosphoketolase (Phk), which was lost in budding yeasts sister to Lip. starkeyi. Phk exists as a homomer in Basidiomycota and Taphrinomycotina, encoded by PHK1 (FOG00394), but emerged as a heteromer in the Pezizomycotina-Saccharomycotina clade following an ancient gene duplication (FOG00395). Proto- yeast had one copy of acetyl-CoA synthetase, encoded by ACS1 (FOG00404), which was likely expressed when grown on ethanol or acetate [Connerton et al., 1990] (Figure 5.3). A duplication of ACS1 in proto-fermenter led to ACS2 (Figure 5.3A), the glucose-inducible acetyl-CoA synthetase present in the

PDH bypass [van den Berg et al., 1996, Zeeman and Steensma, 2003]. ACS3 (FOG00406), encoding a propionyl-CoA synthetase, derived from an additional duplication of ACS1 in Pezizomycotina; it may have been present in proto-yeast. These duplication illustrate that the proto-yeast had multiple sources of cytosolic acetyl-CoA, but were lost in most budding yeasts.

All budding yeasts in this study have homologs of the PDH bypass genes (Pdc, NAD-Ald or NADP-

Ald, and Acs), but the delimiting gene for an active PDH bypass during glucose growth for budding yeasts in this study, and ultimately glucose fermentation to ethanol, is the ACS2 paralog. This duplication of

ACS1 to ACS2 in proto-fermenter pushes back the origin of alcoholic fermentation in budding yeasts from 125-150 million years ago [Hagman et al., 2013], which is often associated to the proliferation of angiosperms [Piškur et al., 2006], to somewhere between 300-400 million years ago [Mühlhausen and

Kollmar, 2014, Hedges et al., 2015]. Schizo. pombe [van Urk et al., 1990], Nad. fulvescens and Bla. adeninivorans, which can all ferment glucose to ethanol, are the only yeasts in this study with Acl, but still use the PDH bypass to synthesize acetyl-CoA with glucose as a carbon source. It is not known when or why each Acl and the PDH bypass is preferred for cytosolic acetyl-CoA production in these species; both are likely to be used during glucose fermentation by Schizo. pombe since it can only grow on glucose and fructose. The absence of an active PDH bypass in Lip. starkeyi and Yar. lipolytica counters the theory that microbes will maximize their biomass yields given their metabolic network [Segre et al., 2002,

Fong et al., 2003]. The activation of the PDH bypass with the ACS2 paralog reinforces the important role chance gene duplications play in evolution.

Expansion and loss in the aldehyde dehydrogenase family

At least two genes encoding NAD-Ald, ALD2 and ALD2.2, were present in proto-yeast’s genome for ethanol and acetate catabolism. Multiple duplications throughout Saccharomycotina led to new isozymes with different localizations, cofactor preferences, and functions. NADP-Ald emerged as cytosolic and Chapter 5. Reconstructing the evolution of metabolism in yeasts 62

Figure 5.3: (A) Phylogenetic tree showing the three major ortholog groups in the acetyl-CoA synthetase family: ACS1, acetate-inducible acetyl-CoA synthetase; ACS2, glucose-inducible acetyl-CoA synthease; ACS3, propionyl-CoA synthetase. (B) Metabolic pathways for acetyl-CoA production in Sac. cerevisiae via the PDH bypass and Yar. lipolytica via ACL. Gene annotations not conserved with Sac. cere- visiae’s genes: MCP, mitochondrial pyruvate carrier; PDH, PDH complex; CITm, mitochondrial citrate synthase; CITc; cytoplasmic citrate synthase; MDHm, mitochondrial malate dehydrogenase; MDHc, cytoplasmic malate dehydrogenase; CAT, carnitine o-acetyltransferase. (C) FBA predictions for pro- tein and phospholipid yield from glucose with the PDH bypass (via NADP-ALD), PDH bypass (via NAD-ALD), and ATP citrate lyase using a GENRE for Sac. cerevisiae. Highest yields of protein and phospholipids occur with the PDH bypass via ALD6, which is absent in proto-yeast. Chapter 5. Reconstructing the evolution of metabolism in yeasts 63 mitochondrial enzymes from several duplications in budding yeasts (Figure 5.2). ALD6.1 (FOG00361), a cytosolic NADP-Ald, likely originated from an ALD2 duplication; this is the major Ald isozyme expressed during glucose fermentation in the PDH bypass. Additional duplications led to ALD6.2

(FOG00362), orthologous to Sac. cerevisiae’s well studied ALD6 gene, and ALD5, the mitochondrial

NAD(P)-Ald, from ALD6.1. ALD2.4 (FOG00350), which encodes a putative cytosolic NADP-Ald, emerged within Taphrinomycotina from another ALD2 duplication but has not been directly charac- terized [van Urk et al., 1990]. The loss of cytosolic NADP-Ald in the CTG clade, with the exception of Deb. hansenii, Cephaloascus albidus and Cephaloascus fragrans [Grigoriev et al., 2013], is puzzling as ALD6 has been shown to be indispensable for Sac. cerevisiae [Grabowska and Chelstowska, 2003].

An alternative NADPH source may have evolved in the CTG clade that reduced the need for cytosolic

NADP-Ald. The presence and absence of ALD2.3 (FOG00349), encoding the only mitochondrial Ald in

Pezizomycotina species examined in this study, appears to divide the fungal species into ethanologenic and citrogenic species; the PDH bypass may be active in Neu. crassa and Trichoderma reesei under some conditions [Sanchis et al., 1994]. Duplications in the Ald family enabled NADPH regeneration in the PDH bypass in ethanol fermenting yeasts in Saccharomycotina and in Schizosaccharomyces.

Dekkera bruxellensis and Deb. hansenii are two Crabtree-positive yeasts with peculiar recent duplica- tions in the Ald family. First, Dek. bruxellensis has a mitochondrial NADP-Ald, ALD6.3 (FOG00363), originating from a duplication of ALD6.1 ; this complements the older mitochondrial NAD(P) paralog,

ALD5, present in most yeasts sister to Nad. fulvescens. ALD6.3 is also present in other Crabtree- positive Dekkera species, but absent in Brettanomyces naardenensis, a closely related Crabtree-negative yeast. Second, Deb. hansenii has a cytosolic NAD(P)-Ald, ALD5.2 (FOG00360), derived from a mi- tochondrial NAD(P)-Ald duplication, ALD5 ; it is the only putative cytosolic NADP-ALD in the CTG clade covered in this study. These yeasts also deviate from the typical Crabtree effect observed in

Saccharomycetaceae. Both yeasts possess Complex I and alternative oxidase, while not having any internal alternative NADH dehydrogenase (Ndi) orthologs (Figure 5.4B). Dek. bruxellensis accumu- lates large amounts of acetic acid when grown aerobically with glucose, but for reasons not entirely understood. Fructose 1,6-bisphosphate, an important metabolite elevated during overflow metabolism in many prokaryotic and eukaryotic species, was found at a lower concentration in Deb. hansenii than

Sac. cerevisiae following glucose addition to non-fasted cells [Sánchez et al., 2006]. Morever, both yeasts exhibit the Crabtree effect at a lower glycolytic flux than Saccharomycetaceae yeasts [van Urk et al.,

1990]. It is unclear how these paralogs may directly impact the Crabtree effect, but there is evidence

NADP(H) may play an underappreciated role. Most Crabtree-positive yeasts cannot metabolize xylose via NADPH-dependent XR; Kluyveromyces is the only known Crabtree-negative lineage in Saccharomyc- Chapter 5. Reconstructing the evolution of metabolism in yeasts 64 etaceae, and only lineage with a known NADP-dependent glyceraldehyde 3-phosphate dehydrogenase

(NADP-GAPDH) [Verho et al., 2002]; NADPH-dependent nitrate reductase impacts aerobic ethanol and acetic acid production in Pac. tannophilus and Dek. bruxellensis [Jeffries, 1983, Galafassi et al., 2013,

Moktaduzzaman et al., 2015]. Recent studies have shown the importance metabolites play in and may offer a template to study how NAD(P)H directly or indirectly impacts the Crabtree effect [Gerosa et al., 2015, Hackett et al., 2016].

5.4.3 Evolution of NADH dehydrogenase

Branched electron transport chain in fungi

The branched-ETC in Dikarya fungi is outlined in Figure 5.4B. Complex I, external alternative dehy- drogenase (Nde), and Ndi are the three mechanisms for NADH reoxidation by the ETC. Complex I and Ndi are sometimes confused in the literature, so it is important to outline their key similarities and differences. Both oxidize NADH in the matrix, and form supercomplexes with Complex III and

IV [Mileykovskaya et al., 2012]. In terms of differences, Ndi is encoded by a single nuclear gene, while

Complex I is one of the largest compexes known, requires coordinated assembly from proteins encoded from the nuclear and mitochondrial DNA, and is a large source of reactive oxygen species (ROS) [Seo et al., 2006]. All yeasts in this study encode at least one Nde. In contrast, some yeasts lineages have lost either Complex I or Ndi. Complex II, Complex III, Complex IV, and ATP synthase are conserved in all yeasts and fungi in this pan-genome. Alternative oxidase is absent in several lineages, most notably in yeasts that have lost Complex I [Riley et al., 2016], Pac. tannophilus, and Oga. parapolymorpha.

Repeated loss of Complex I in yeasts

Genes encoding Complex I (FOG00684-FOG00734) were independently lost in Schizosaccharomyces

[Gabaldón et al., 2005], Nadsonia [Riley et al., 2016], and the Saccharomycodaceae-Saccharomycetaceae clade [Dujon, 2010, Riley et al., 2016]. Complex I has also been independently lost in Starmerella bacillaris, Ogataea philodendri, and Wickerhamomyces pijperi [Freel et al., 2015]. Starmerella is a known Crabtree-positive yeast, but the physiologies of Oga. philodendri and Wic. pijperi have not been studied in the literature. Kerscher [2000] hypothesized Sac. cerevisiae’s ancestor lost Complex I because it is more energy efficient to translate one protein than coordinating the assembly of more than

35 subunits for Complex I during fast cell proliferation. Contrary to this hypothesis, Complex I appears to be assembled even when it is inactive in some yeasts [Ohnishi, 1972]. Chapter 5. Reconstructing the evolution of metabolism in yeasts 65

Orientation of type II NADH dehydrogenase

The orientation of type II NADH dehydrogenase has been confirmed using genetics in some organisms

[Kerscher et al., 1999, Bakker et al., 2000]. Phobius was used to show the position of transmembrane domains in the primary sequence of type II NADH dehydrogenase can predict its membrane orientation

(Figure 5.4B and C). Ndi enzymes, which orient towards the matrix, have C-terminal hydrophobic do- mains; Nde enzymes, which orient towards the cytosol, have dominant N-terminal hydrophobic domains and sometimes a C-terminal hydrophobic domain (Figure 5.4C). Schizo. pombe does not encode any directly characterized Ndi orthologs but can oxidize NADH within the mitochondrial matrix [Crich- ton et al., 2007]. Alternative translation start sites in its two Nde orthologs, which are conserved in other Schizosaccharomyces species, may translate Ndi isoforms. Future studies can confirm Phobius predictions for Schizo. pombe’s dual localization from its Nde genes and other uncharacterized Ndi-Nde genes.

Expansion and loss of type II NADH dehydrogenase

Proto-yeast likely had three type II NADH dehydrogenase orthologs (Figure 5.4A). The oldest ortholog group in this family is NDI0 (FOG00837), which encodes Ndi in fungi [Duarte et al., 2003] and some yeasts. NDE0 (FOG00838) and NDE1 (FOG00839) emerged from independent duplications of NDI0, and encode Nde. NDE1 is conserved in all species reviewed in this study, but there is a repeated loss of

Ndi in yeasts. NDI0 was lost in Schizosaccharomyces and in the clade sister to Lip. starkeyi. Ndi re- emerged as NDI1 (FOG00846) from a duplication of NDE1 in the clade sister to Asc. rubescens; NDI1 was subsequently lost in the CTG clade and Pichiaceae, with the exception of Pac. tannophilus. Ndi also independently re-emerged as NDI2 (FOG00842) in Nad. fulvescens, likely before the loss of its Complex

I. NDE1.2 (FOG00840) in Schizo. pombe originated from an NDE1 duplication in Schizo. pombe.

Unlike Sac. cerevisiae, proto-yeast’s Nde enzymes had NADPH dehydrogenase activity [Melo et al.,

2001, Carneiro et al., 2004, Tarrío et al., 2006a,b]. It is not known where the NADPH dehydrogenase activity was lost in Saccharomyceteae, or whether its loss contributed to the Crabtree effect. Repeated duplications in type II NADH dehydrogenase family led to new isozyme orientations in the mitochondrial membrane, but there are reoccurring losses of Ndi orthologs in the budding yeast pan-genome.

Impact of NADH dehydrogenase sink on biomass yield

The impact of reoxidizing NADH with Complex I versus Ndi on the biomass yield was assessed with

FBA. The proton-translocating Complex I and non-proton translocating Ndi (Figure 5.4D) yielded 0.63 Chapter 5. Reconstructing the evolution of metabolism in yeasts 66

Figure 5.4: (A) Phylogenetic reconstruction of the type II membrane-bound NADH dehydrogenase fam- ily. Grey stars indicate the origin of duplications. Major ortholog groups are highlighted in different colors and designated as internal alternative NADH dehydrogenase with NDI and external alternative NADH dehydrogenase with NDE. (B) Organization of the ETC in Fungi. Complex I, NDE, NDI, and alternative oxidase are enzymes not conserved across all fungi in this study and are designated with red text with underlines. (C) Transmembrane posterior probability for aligned Sac. cerevisiae Ndi1p (NDI1_YEAST) and Nde1p (NDE1_YEAST). Ndi1p orthologs have C-terminal transmembrane pre- dictions; Nde1p ortholgos have N-terminal transmembrane predictions and sometimes lower C-terminal transmembrane predictions. (D) Biomass yield as a function of non-proton-translocating NDI1 and proton-translocating Complex I in the Sac. cerevisiae GEMS. Chapter 5. Reconstructing the evolution of metabolism in yeasts 67 and 0.50 g biomass/g glucose, respectively, when the protein content of the biomass was 40%. This is in agreement with experimental yields for aerobic, respiratory growth of yeasts on glucose [van Urk et al.,

1990, Van Hoek et al., 1998]. These in silico results indicate there is a clear advantage in using Complex

I to obtain higher biomass yields; however, unaccounted biophysical or biochemical constraints [Goel et al., 2015] may explain why many proliferating prokaryotes, lower eukaryotes, and higher eukaryotes prefer aerobic fermentation to type I or type II NADH dehydrogenase.

Reactive oxygen species-driven loss of Complex I

Recent hypotheses have suggested tradeoffs in proteome allocation or membrane economics account for the preference of aerobic fermentation over oxidative phosphorylation during fast cell proliferation, but there is no consensus of a root cause in all organisms. One hypothesis that not been considered in the metabolic modelling community is the elimination of ROS by aerobic fermentation in proliferating cells; this was first demonstrated in rat thymocytes by Brand and Hermfisse [1997]. ROS generation by

Complex I can divert NADPH from biosynthesis of biomass precursors to ROS detoxification, and left untreated can cause damage to lipids, DNA, and proteins. Therefore, FBA may predict a fitness benefit via Complex I, but this may come at a net negative cost to yeasts when membrane economics, proteome allocation, or ROS detoxification are considered.

Helliwell et al. [2013] proposed that vitamin auxotrophies can arise when organisms have a readily available supply of a vitamin with a large biosynthetic cost. One example that demonstrates this is the independent loss of L-gulonolactone oxidase [Helliwell et al., 2013], the last step in vitamin C biosynthesis, in apes, guinea pigs, and bats. L-gulonolactone oxidase is a known source of ROS, and these organisms’ ancestors may have had ample supply of vitamin C. The repeated loss of Complex I in yeasts may fit this model for evolution. If a yeast found itself in environment in which Complex I was persistently inhibited or inactivated to reduce ROS production, a deleterious mutation to one subunit would have been enough to cause Complex I to become defunct, rendering its more than 35 genes into pseudogenes, and eventually to junk DNA without a trace of Complex I, despite having a higher biomass yield than

Ndi.

5.4.4 Heteromerization of enzyme complexes

Proto-yeast had one gene encoding a homomeric phosphofructokinase. The emergence of ACS2 in proto- fermenter was followed by a duplication of PFK2, causing the ancestral homomeric 6-phosphofructokinase

(Pfk) to become a heteromeric enzyme with altered allosteric regulation [Habison et al., 1983, Reuter Chapter 5. Reconstructing the evolution of metabolism in yeasts 68 et al., 2000, Flores et al., 2005]. Both forms are activated by ATP at low concentrations, and inhibited by ATP at higher concentrations. Pfk in Pezizomycotina fungi and citrogenic yeasts are inhibited by citrate or phosphoenolpyruvate, in contrast to inhibition by citrate in Sac. cerevisiae.

Heteromeric Pfk’s in Saccharomycotina are also activated by AMP and fructose 2,6-bisphosphate [Bär et al., 1997, Lorberg et al., 1999, Kirchberger et al., 2002], which is absent in Yar. lipolytica’s Pfk [Flores et al., 2005]. The physiological significance of these changes in allostery has not been explored in the literature, but the emergence of the PDH bypass, which generates AMP from acetyl-CoA synthetase, and reduces the demand for citrate in the cytosol, may play a role in stimulating upper glycolysis in ethanologenic yeasts, while citrate accumulation in lipogenic yeasts may reduce upper glycolytic flux.

Crabtree-positive yeasts in Saccharomycetaceae have been shown to have elevated AMP levels relative to Crabtree-negative yeasts [Christen and Sauer, 2011], but further characterization is needed in other

Crabtree-positive and citrogenic yeast lineages. The absence of citrate inhibition in upper glycolysis in

Sac. cerevisiae could allow for fructose 1,6-bisphosphate (F1,6P) accumulation, an important source of metabolic control [Díaz-Ruiz et al., 2008] during nitrogen-limiting conditions [Hackett et al., 2016].

Coarse-grained kinetic studies, similar to recent investigations on the glycolytic imbalance in Sac. cere- visiae [van Heerden et al., 2014] and bistability in metabolic networks [Srinivasan et al., 2017], could help elucidate the impact of various Pfk kinetics on overflow metabolism given different metabolic networks and environments.

The evolution of heteromers from homomers has been studied in Saccharomyces by Diss et al. [2017] as an alternative to the enzyme dosing hypothesis. Heteromers were found to have evolved from small- scale duplications and the whole genome duplication, covering a wide range of biological functions and time-scales. Diss et al. [2017] demonstrate these duplications lead to robustness and fragility in protein complexes. The evolution of Pfk enzyme kinetics, and previously discussed Phk and Acl, demonstrate the heteromerization of enzyme complexes may lead to a fitness advantage with enzyme kinetics. Ancient transitions to heteromers that predate Fungi include NAD-dependent isocitrate dehydrogenase and the mitochondrial pyruvate carrier; Mpc3p emerged as a new heteromeric complex with Mpc1p within the

Saccharomyces genus after the WGD. The repeated emergence of heteromer enzymes from homomer enzymes also illustrates an underappreciated mechanism for evolving enzyme kinetics.

5.4.5 Ribosomal subunit duplications.

Proto-yeast likely had one or two paralogs of its ribosome protein subunits; additional paralogs were gained via duplications throughout Saccharomycotina evolution (Figure 5.2). Schizo. pombe, Nad. Chapter 5. Reconstructing the evolution of metabolism in yeasts 69 fulvescens, and Sac. cerevisiae represent three independent lineages of Crabtree-positive species in

Ascomycota with significant independent duplications of large and small ribosome protein subunits; coincidentally, their ETCs do not contain Complex I. It is unknown if similar duplications occurred in Sta. bacillaris, Oga. philodendri, or Wic. pijperi, the remaining known yeasts lineages without

Complex I, since they lack nuclear genome sequences or genome annotations. No yeasts have significant mitochondrial ribosome protein subunit duplications. rRNA genes were not studied in the pan-genome, but have been analyzed by Riley et al. [2016].

Gerst [2018] recently reviewed the impact of ribosome paralogs on translation control. Ribosome heterogeneity, which arises from different ribosome paralog subunit composition in ribosomes, can add an extra layer of control that selectively translates mRNAs or mRNA subsets. A ribosome favouring a fermentative or respirative proteome would be analogous to the fermentative and respirative MPC that evolved in WGD yeasts. The increase in ribosome heterogeneity appears to be a beneficial trait in

Crabtree-positive yeasts that evolved independently, similar to promoter rewiring [Rozpkedowska et al.,

2011].

5.4.6 Changes in enzyme localization via gene duplications

Subfunctionalization and neofunctionalization has been studied in WGD, but these have have not been thoroughly studied in other taxonomic groups. Gene duplications have driven changes in enzyme local- ization in this pan-genome.

Enzyme localization via subfunctionalization

Proto-yeast had one gene encoding cytosolic, mitochondrial, and peroxisomal NADP-isocitrate dehy- drogenase; a duplication led to a mitochondrial NADP-isocitrate dehydrogenase, IDP1.2, before the divergence of the clade sister to Bla. adeninivorans. Similarly, proto-yeast had one gene encoding cyto- plasmic and peroxisomal malate synthase, MLS1 ; peroxisomal malate synthase, MLS1.2, derived from a duplication of MLS1, before the clade sister to Lip. starkeyi emerged. The cytosolic ortholog was subsequently lost in Sac. cerevisiae, possibly after the rise of the DAL7 ohnolog. Peroxisomal malate dehydrogenase, MDH3, originated from the ancestral cytoplasmic and peroxisomal malate dehydroge- nase, MDH1, before the divergence of the CPPSS clade; mitochondrial malate dehydrogenase, MDH2, is present in all Dikarya fungi covered in this study. Chapter 5. Reconstructing the evolution of metabolism in yeasts 70

Enzyme localization via neofunctionalization

Changes in enzyme localization also occurred via neofunctionalization within Saccharomycotina. A dupli- cation of a cytosolic succinate semialdehyde dehydrogenase (FOG00976), UGA1, led to a mitochondrial succinate semialdehyde dehydrogenase gene (FOG00977), UGA1.2, before the divergence of the Asc. rubescens-CPPSS clade; there is no evidence of a mitochondrial succinate semialdehyde in earlier fungal lineages. Similarly, proto-yeast may have only had a mitochondrial homocitrate synthase, LYS21, but its duplication led to a cytosolic isozyme, LYS21.2, before the Asc. rubescens-CPPSS clade divergence.

A mitochondrial trifunctional enzyme C1-tetrahydrofolate synthase, MIS1, derived from a cytoplasmic isozyme, ADE3.2, before the emergence of the clade sister to Lip. starkeyi; it was subsequently lost in Sac. cerevisiae and has not been characterized. An independent duplication in Neurospora crassa also led to a mitochondrial paralog. Mitochondrial NAD-dependent glycerol 3-phosphate dehydrogenase,

GPD2, emerged from the cytosolic isozyme, GPD1, before the clade sister to Lip. starkeyi diverged; an independent duplication is also present in Pezizomycotina.

Proto-yeast had a single mitochondrial alcohol dehydrogenase, encoded by ADH4. A duplication of a cytosolic alcohol dehydrogenase, ADH1, which does not have shared ancestry with ADH4, led to an additional mitochondrial alcohol dehydrogenase, ADH3, in the clade sister to Bla. adeninivorans.

ADH4 was lost before the radiation of the Asc. rubescens-CPPSS clade, but regained in the PSS clade following a HGT from a Schizosaccharomyces species. ADH3 plays a role in shuttling NADH across the mitochondrial membrane in anaerobic cultures of Sac. cerevisiae [Bakker et al., 2000], but it is unclear if ADH4 had the same function in proto-yeast or proto-fermenter. Nad. fulvescens, the enigmatic

Crabtree-positive yeast with genomic attributes of proto-yeast and proto-fermenter, has retained both orthologs, whereas, Bla. adeninivorans only has the ADH4 ortholog and can ferment glucose to ethanol.

These duplications indicate that neofunctionalization and subfunctionalization have occurred through- out budding yeast evolution and were not exclusive to yeasts that underwent the WGD [Sémon and Wolfe,

2007]. In several cases, ohnologs in Sac. cerevisiae overtook the function of a lost paralog (e.g. MDH3 and OSM1 ).

5.4.7 Redox cofactor changes in enzymes

Gene duplications have led to changes in enzyme cofactor preferences for budding yeasts. The expan- sion of the aldehyde dehydrogenase family outlines some of these changes in redox cofactor preference; these enzymes also have different metal preferences. One striking change in cofactor preference in yeasts is NAD(P)-glyceraldehyde 3-phosphate dehydrogenase (NAD(P)-GAPDH), encoded by GDP1 Chapter 5. Reconstructing the evolution of metabolism in yeasts 71

(FOG00290), in Kluyveromyces lactis. This is the only phosphorylating glyceraldehyde 3-phosphate dehydrogenase that has been isolated from fungi [Verho et al., 2002], which originated from a TDH3 duplication (FOG00285).

The presence of GDP1 coincides with a loss of THI3 (FOG00316), which encodes a regulatory protein arising from a Pdc duplication, and an ability to assimilate xylose in that is not present in other

Saccharomycetaceae yeasts [Kurtzman et al., 2011], and a lack of aerobic ethanol fermentation in batch and chemostat [Kiers et al., 1998]; Klu. lactis does accumulate small amounts of ethanol after more than 30 minutes following a glucose pulse. The GDP1 paralog may have alleviated the Crabtree effect in Klu. lactis or the imbalance between upper and lower glycolysis [van Heerden et al., 2014].

5.4.8 Horizontal gene transfer

HGT is widespread between prokaryotes, somewhat common from prokaryotes to eukaryotes, and less common between eukaryote to eukaryote [Danchin, 2016, Husnik and McCutcheon, 2017]. Previous HGTs that have been identified in Ascomycota include the transfer of URA1 (FOG01561) from a Lactobacilius species to Saccharomycetaceae, and of GAL10.2 from a Candida species to an ancestor of Schizo. pombe

[Slot and Rokas, 2010].

ADH4 is the most noteworthy xenolog discovered in this yeast pan-genome that has not previously been described. ADH4 encodes a mitochondrial alcohol dehydrogenase, which belongs to a different class than ADH1 -ADH3. As previously noted, ADH4 was lost in the Ascoidea-CPPSS clade, but re-emerged in the PSS clade following a HGT from Schizosaccharomyces. Surprisingly, the Schizosaccharomyces

ADH4 is also a xenolog from Neisseriales, or a higher taxonomic rank; Schizosaccharomyces either had lost the ancestral ADH4 and required it via HGT, or had two ADH4 homologs. Schizosaccharomyces seemed most prone to HGT based on the genes we reviewed. These xenologs highlight that prokaryote- to-eukaryote and eukaryote-to-eukaryote have been overlooked.

5.5 Conclusion

Existing methods cannot reconstruct the correct topology for yeasts with deep phylogeny, including Nad. fulvescens, Bla. adeninivorans, and Asc. rubescens. FBA simulations indicate the PDH bypass has a higher yield of protein and phospholipids from glucose than Acl, the ancestral acetyl-CoA source in proto- yeast. All yeasts in this study have homologs of the PDH bypass but the delimiting gene for its expression with glucose appears to be ACS2, a paralog of the ancestral acetate-inducible acetyl-CoA synthetase.

FBA simulations demonstrate Complex I results in a 26% higher biomass yield than via Ndi when cells Chapter 5. Reconstructing the evolution of metabolism in yeasts 72 are grown with glucose. Complex I is bypassed for Ndi in some yeast lineages, and both enzymes have been lost in many independent yeast lineages. Duplications in the Ndi family have enabled the enzyme to orient paralogs to the cytosol or matrix. The repeated loss of Complex I in yeasts is hypothesized to be beneficial because it eliminates ROS production during fast growth. Additional gene duplications led to enzymes with altered enzyme kinetics, changes in cofactor preference, and enzyme localization. By reconstructing the evolution of homologs into orthologs at the pan-genome level, bioinformaticians can improve genome annotations for existing and future genome sequences, improve the accuracy of species trees, which can ultimately help biologists understand the evolution of physiology throughout the tree of life. Chapter 6

Fungi pan-genome-scale network reconstruction

What I cannot create, I do not understand.

Richard Feynman

6.1 Abstract

A GENRE represents a compendium of knowledge of an organism and can be used in a variety of applications. The drop in genome sequencing costs has led to an increase in sequenced genomes, but the number of curated GENREs has not kept pace. This gap hinders our ability to study physiology across the tree of life. Furthermore, non-conventional yeast GENREs have significant commission and omission errors, especially in central metabolism. To address these quantity and quality issues for GENREs, an open and transparent framework is outlined for the curation of the pan-genome, pan-reactome, pan- metabolome, and pan-phenome for taxons by research communities, rather than for a single species. This is demonstrated with a Fungi pan-GENRE by integrating AYbRAH, and AYbRAHAM, a new fungal reaction database. This pan-GENRE was used to compile 33 yeast and fungal GENREs in the Dikarya subkingdom, spanning 600 million years of evolution. The fungal pan-GENRE contains 1547 orthologs,

2726 reactions, 2226 metabolites, and 10 compartments. The strain GENREs have a wider genomic and metabolic coverage than previous yeast and fungi GENREs. Metabolic simulations show the amino acid yields from glucose differs between yeast lineages, indicating metabolic networks have evolved in yeasts.

Curating ortholog and reaction databases for a taxon can be used to increase the quantity and quality

73 Chapter 6. Fungi pan-genome-scale network reconstruction 74 of strain GENREs. This pan-GENRE framework provides the ability to scale high-quality GENREs to more branches in the tree of life.

Associated publication: K Correia, R Mahadevan. Pan-genome-scale network reconstruction: a framework to increase the quantity and quality of metabolic network reconstructions throughout the tree of life. bioRxiv, 412593.

6.2 Introduction

A GENRE represents the knowledge base of an organism and has a variety of applications. Studies have used GENREs in metabolic simulations to uncover new enzyme-metabolite interactions [Hackett et al.,

2016] and understand the pharmacokinetic response of drugs in the human body [Thiele et al., 2017].

They can also can be used as a scaffold for omics integration [Österlund et al., 2013].

Current practices limit the quantity and quality of GENREs, and their application to study metabolism throughout the tree of life. First, the drop in genome sequencing costs and the common practice of having one research team reconstruct the metabolic network for an organism has led to a growing gap between genome sequences and curated GENREs. Second, most GENRE efforts are directed to a handful of organisms with existing GENREs [Monk et al., 2014]; the yeast/fungal GENRE community primarily focuses on industrially-relevant organisms, which has led to parallel reconstructions for Sac. cerevisiae,

Yar. lipolytica, Scheffersomyces stipitis, Kom. phaffii, Asp. niger [Lopes and Rocha, 2017]. Third, anchoring and availability biases cause non-conventional organisms’ GENREs to be skewed towards the metabolism of model organisms, while neglecting their own unique metabolism (Figure 6.1); examples of these commission and omission errors in yeast and fungi GENREs can be found in (Appendix A). Finally, eukaryotic GENREs have additional challenges not found in prokaryotes since they have larger genomes, fewer operons/gene clusters, numerous paralogs, alternative splicing, and compartments. These quantity and quality issues for GENRE hinders our ability to understand and control metabolism throughout the tree of life and highlight the need to build upon the bottom-up GENRE protocol outlined by Thiele and

Palsson [2010].

GENRE curators must make hundreds or thousands of ad hoc decisions, often without traceability

[Ravikrishnan and Raman, 2015], because existing databases for orthologs, reactions, metabolites, and phenotypes are not always accurate or complete. This approach prevents GENREs from being scaled throughout the tree of life. To solve these problems, a new approach is outlined for pan-GENRE with the formalized curation of orthologs, reactions, metabolites, and phenotypes for a larger group of organisms by research communities (Figure 6.2A). Previous studies have used this pan-genome approach Chapter 6. Fungi pan-genome-scale network reconstruction 75

Saccharomyces cerevisiae Scheffersomyces stipitis (model organism) (non-conventional organism) true metabolic network true metabolic network physiology literature S. cerevisiae GENRE from SGD knowledgebase S. stipitis GENRE using S. stipitis S. cerevisiae S. cerevisiae GENRE GENRE GENRE as a template

species GENRE’s commission from pan-yeast omission errors errors knowledgebase I GENRE from I existing knowledge t m/z undiscovered in the literature community driven curation S. stipitis & promiscuous of the pan-yeast genome, GENRE enzymes reactome, metabolome & phenome

Figure 6.1: Non-conventional organism genome-scale network reconstruction (GENRE) using a model organism GENRE as a template versus databases from community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Saccharomyces Genome Database (SGD) mines the literature to describe the physiology of Saccharomyces cerevisiae (a model organism). This knowledge base is leveraged to compile a Sac. cerevisiae GENRE, which is its state of the art metabolic network; the gap between its true metabolic network and the state of the art GENRE represents undiscovered and promiscuous enzymes. In the absence of a curated genome database for Scheffersomyces stipitis (a non-conventional organism), the Sac. cerevisiae GENRE is used as a template to guide the Sch. stipitis GENRE. Anchoring bias makes the Sch. stipitis GENRE skew towards the Sac. cerevisiae GENRE with commission errors; enzymes that have been characterized in Sac. cerevisiae, such as cytosolic NADP-dependent acetaldehyde dehydrogenase, have been included in past Sch. stipitis reconstructions despite no evidence based on orthology or enzymology. Availability bias prevents the GENRE curators from describing metabolism that is unique to Sch. stipitis and not well studied or documented, leading to omission errors; Sch. stipitis shares alkane hydroxylase orthologs with Candida tropicalis, a known alkane degrader [Lebeault et al., 1971], and yet alkane degradation has never been described in any Sch. stipitis or yeast GENRE. Community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome reduces anchoring and availability biases by removing Sac. cerevisiae as the focal point for comparative metabolic network reconstruction; orthology assignment from rigorous phylogenic analysis of gene families rather than error-prone error prone methods; capturing non-canonical reactions catalyzed by promiscuous enzymes. Chapter 6. Fungi pan-genome-scale network reconstruction 76 for different strains of the same species to reconstruct metabolic networks [Islam et al., 2010, Monk et al.,

2013], but the pan-genome can be expanded to a broader range of organisms [Vernikos et al., 2015]. The

first step in this framework is the curation of orthologs, paralogs, ohnologs, and xenologs within a pan- genome; many ortholog databases are available but these often cannot distinguish between orthologs and paralogs. The second step is compiling the canonical and non-canonical reactions within a taxon and annotating them with ortholog-protein-reaction associations (OPRs). This reaction database can be collected from various sources (Figure 6.2B); existing reaction databases generally focus on canonical metabolism and underrepresent the metabolic potential of enzymes [Notebaart et al., 2014]. Third, curating the presence and dynamics of metabolites from a larger group of organisms can facilitate the discovery of new pathways and enzymes [Caudy et al., 2018], especially when the metabolites are not found within the S-matrix. One notable example is the discovery of riboneogenesis, which was detected by sedoheptulose-1,7-bisphosphate changes in a shb17 knockout strain of Sac. cerevisiae [Clasquin et al.,

2011]. Finally, curation of phenotypic data into machine-readable formats can help test, validate and improve GENREs; Experiment Data Depot (EDD) has been recently described [Morrell et al., 2017] but phenotype databases are less developed than genome, reaction, and metabolite databases. The primary purpose of these databases is for generating high-quality GENREs, but they can also benefit the broader scientific community.

This outlined framework is demonstrated with a pan-GENRE for 33 yeasts/fungi spanning more than 600 million years of evolution. It was created by integrating AYbRAH, our yeast/fungal ortholog database, and Analyzing Yeasts by Reconstructing Ancestry of Homologs about Metabolism (AYbRA-

HAM), a new yeast/fungal reaction database. This approach enables GENREs for lower taxonomic levels, such as phylum, family or strains, to be compiled from the fungal pan-GENRE, and permits

GENREs to be synchronized with each other as the community characterizes more enzymes or research teams improve individual strain GENREs (Figure 6.2C). The benefits and drawbacks of this approach are discussed, and compared with CoReCo [Pitkänen et al., 2014].

6.3 Methods

6.3.1 Network reconstruction

A draft consensus pan-GENRE for Fungi was compiled from the following GENREs using BiGG nomen- clature [King et al., 2015]: Schizo. pombe [Sohn et al., 2012], Asp. niger [Andersen et al., 2008], Yar. lipolytica [Pan and Hua, 2012, Loira et al., 2012], Kom. phaffii [Caspeta et al., 2012], Sch. stipitis Chapter 6. Fungi pan-genome-scale network reconstruction 77

A Pan-genome-scale network B Reaction curation reconstruction framework

AYbRAH e.g. pan-genome curation physiology genetics enzyme characterization enzyme localization accessory unique core orthologs HXK2 PGI1 literature review growth assay Hxk2p Pgi1p

HXK PGI

PFK1 PFK2

Pfk1p Pfk2p

genotype- PFK phenotype gap OPR

icit akg succoa succ SBML reactions

Pan-genome-scale phenotypes AYbRAHAM pathway analysis enzyme

pan-phenome network pan-reaction & gapfilling characterization curation reconstruction ( curation

-1 0 0 0 0 0 0 -1 1 -1 0 0 0 0 0 0 0 -1 0 -1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 1 -1 1 0 0 0 0 0 0 1 -1 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 1 0 metabolites 0 0 0 0 0 1 -1 0 metabolomics ( S matrix I

t m/z

pan-metabolome curation comparative comparative genomics physiology

C pan-GENRE synchronization

Basidiomycota Pucciniomycotina Sporidiobolaceae

XLSX Rhodotorula graminis WP1 push to XLSX XLSX XLSX SBML SBML lower-level SBML SBML

XLSX GENRE’s Taphrinomycotina Schizosaccharomycetaceae SBML

XLSX Subkingdom: Dikarya Schizosaccharomyces pombe 972h 600 million years of evolution XLSX SBML XLSX Phylum: Ascomycota SBML SBML Fungi Subphylum: Saccharomycotina Class: Order: pan-GENRE XLSX Pezizomycotina Family: Debaryomycetaceae SBML Sordariomycetidae Genus: Scheffersomyces Ascomycota XLSX SBML Neurospora crassa OR74A Species: Scheffersomyces stipitis XLSX XLSX Type: Scheffersomyces stipitis CBS 6054 SBML SBML

Dipodascaceae

XLSX Yarrowia lipolytica CLIB122 XLSX SBML SBML

XLSX SBML Saccharomycetaceae Saccharomycotina pull changes XLSX SBML Saccharomyces cerevisiae S288c XLSX to higher-level SBML pan-GENRE XLSX SBML Debaryomycetaceae Scheffersomyces stipitis CBS 6054

model curation model analysis

XLSX XLSX SBML SBML ortholog curation OPR and reaction curation

HXK2 PGI1

Hxk2p Pgi1p

accessory HXK PGI core PFK1 PFK2 unique Pfk1p Pfk2p PFK genes AYbRAH AYbRAHAM Chapter 6. Fungi pan-genome-scale network reconstruction 78

[Balagurunathan et al., 2012, Caspeta et al., 2012, Li, 2012, Liu et al., 2012], Eremothecium gossypii

[Ledesma-Amaro et al., 2014], and Sac. cerevisiae [Aung et al., 2013]. Additional reactions were added to the pan-GENRE by reviewing enzyme assays, and reactions inferred from growth assays and Biolog data; these are referenced in the notes and reference section of the GENRE. Reactions were annotated with OPRs using fungal ortholog groups (FOGs) from AYbRAH if there was strong genomic evidence.

OPRs for protein complexes were reviewed manually to determine the essential subunits. Protein com- plexes for Sac. cerevisiae and Schizo. pombe were cross-referenced with the Complex Portal [Meldal et al., 2014]. Reactions were also annotated with ORPHAN_FOG if a reaction was catalyzed by an orphan enzyme and had a prerequisite FOG; ORPHAN_RXN if a reaction had a prerequisite reaction; an organism code if there was evidence in a species or strain but the enzyme was unknown. Reactions annotated with BIOMASS, ESSENTIAL, ORPHAN, SPONTANEOUS, EQUILIBRIUM, DIFFUSION,

EXCHANGE, or DEMAND were automatically included in all children GENREs. A Python script was used to compile GENREs for taxonomic ranks below Fungi in SBML3 and XLSX using the reaction annotations. Reactions requiring charge and mass balancing in the Fungi pan-GENRE were identified with Memote [Lieven et al., 2018]. PaperBLAST was used to find evidence of reactions for reactions in some ortholog groups [Price and Arkin, 2017].

6.3.2 Biomass equation formulations

The biomass equation comprised of protein, RNA, DNA, phospholipids, sterols, chitin, β-D-glucan, glycogen, trehalose, and mannose. Each macromolecule was converted to a new metabolite species having

Figure 6.2 (preceding page): Pan-genome-scale network reconstruction framework for improv- ing the quantity and quality of GENREs. (A) A research community for a taxon, such as yeasts and fungi, curate its pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Orthologs, par- alogs, ohnologs, and xenologs are identified in the pan-genome by rigorous phylogenetic analysis. Re- actions catalyzed by enzymes within a taxon are described in the pan-reactome and annotated with ortholog-protein-reaction associations (OPR). The presence and dynamics of metabolites are described in the pan-metabolome of a taxon; metabolites in the pan-metabolome not captured in the S-matrix can lead to new pathway discovery in the pan-reactome. Phenotypes within the taxon are transcribed into machine-readable formats to test, validate and improve genome-scale metabolic models. These databases are the basis for the pan-GENRE. (B) The pan-reactome is created by mining the literature, inferred reactions from growth assays and Biolog, pathway analysis and gap-filling, in vitro enzyme character- ization and crude enzyme assays, comparative genomics and comparative physiology. (C) Our Fungi pan-GENRE spans 600 million years of evolution for 33 yeasts and fungi. It was used as a template to compile GENREs for each taxonomic rank from sub-kingdom to strain taxonomic ranks. GENREs for any taxonomic rank can by curated by updating the AYbRAH ortholog and AYbRAHAM reaction databases. These changes are pulled to the Fungi pan-GENRE, and pushed to the lower taxonomic rank GENREs, enabling all the GENREs to be synchronized. A GENRE can be used for genome-scale metabolic modelling (GSMM) to test new pathways and reconcile in silico and in vivo data. Further GENRE curation can bridge the gap between in silico and in vivo data. Chapter 6. Fungi pan-genome-scale network reconstruction 79 a molar mass of 1000 g · mmol−1 (Appendix G). These macromolecule species allow the coefficients in the biomass equation to be set to the weight fraction of each macromolecule as measured in experiments; a similar approach was previously employed with SpoMBEL1693 [Sohn et al., 2012]. The DNA composition for each strain GENRE was adjusted according to its GC content. Two protein synthesis reactions were constructed in the pan-GENRE: condensation from amino acids without a GTP cost and condensation of amino acids from charged tRNAs and with an additional cost of 2 mol GTP per mol amino acid

[Schimmel, 1993]. FOGs for the ribosome, RNA , DNA polymerase were assigned to the protein, RNA, and DNA biosynthesis reactions, respectively.

6.3.3 Flux balance analysis

FBA was carried out with COBRApy v0.9.1. Reactions enabling artificial proton motive force cycles with glucose as a carbon source were blocked in the pan-GENRE [Pereira et al., 2016]. Glucose and ammonium were used as a carbon source and nitrogen source to determine the yields of biomass precursors and correct any auxotrophies in all the strain GENREs. Evolview v2 was used to map GENRE statistics and amino acid yields onto the species tree [He et al., 2016].

6.4 Results

The Fungi pan-GENRE contains 1547 orthologs, 2726 reactions, 2226 metabolites, and 10 compartments.

GENRE statistics for each strain are shown in Figure 6.3. Genes for the strain GENREs range from 942 in Han. valbyensis to 1558 in Aspergillus niger. The top subsystem in the strain GENREs are biomass, transporters, amino acid metabolism, and central metabolism. Reaction and gene counts by subsystem for each strain GENRE is available in Appendix H.

6.4.1 Expanded genomic coverage in the fungal pan-GENRE

The strain GENREs in the pan-GENRE framework have more genes than GENREs derived from manual curation or CoReCo (Table 6.1). Only yeastGEM and iMT1026, the most recent GENREs for Sac. cerevisiae and Kom. phaffii, have more genes than our GENREs when we compared the reconstructions without genes from the ribosome, RNA polymerase, and DNA polymerase genes. These genes are not usually included in GENREs. Inspection of the genes in iMT1026 indicated that it also includes

RNA polymerase genes and uncharacterized genes/enzymes. Curating ortholog groups and annotating reactions with OPRs minimizes the chance that distantly related homologous genes are not added to

GENREs, and orthologous genes with low sequence similarity are captured in GENREs. Chapter 6. Fungi pan-genome-scale network reconstruction 80

Figure 6.3: Yeast and fungi GENRE statistics. GENRE names, genes, reactions, metabolites, subphylum, and family mapped to a yeast/fungi species tree. Chapter 6. Fungi pan-genome-scale network reconstruction 81

6.4.2 Expanded metabolic coverage in the fungal pan-GENRE

The GENREs from the fungal pan-GENRE have a wider metabolic coverage than prior yeast and fungal GENREs. The pan-GENRE includes the degradation of alkanes and aromatics, which have been studied in Fungi for decades [Middelhoven et al., 1991, 1992, Van Beilen and Funhoff, 2007], and the degradation of aliphatic amines, which has been recently studied in Sch. stipitis by Linder [2014, 2018].

These pathways are mediated by hydroxylases, monooxygenases, and dioxygenases from the cytochrome

P450 family, which do not have well described gene-protein-reaction associations in current reaction databases. Enzymes in the cytochrome P450 family are known to be promiscuous and likely have vast unexplored metabolic capabilities in yeasts and fungi. Non-canonical metabolic reactions were also added to the GENRE, such as reactions catalyzed by ene-reductases. These reactions have no obvious role in metabolism [Zhang et al., 2016], but can be useful to know when engineering pathways in microbial hosts. The expanded metabolic coverage in the yeast and fungal GENREs highlight the need for curated reaction and ortholog databases.

6.4.3 Amino acid yields vary across yeast strains

The amino acid yields from glucose and ammonium vary in yeasts (Figure 6.4). There are strikingly lower yields in Schizo. pombe, Lip. starkeyi, Nad. fulvescens, and Hanseniaspora valbyensis for gluta- mate, glutamine, glycine, histidine, leucine, methionine, and valine. Lip. starkeyi has many genomic and phenotypic features shared with proto-yeast, the first budding yeast, while the remaining yeasts independently evolved the Crabtree effect. The overall protein yield, assuming the amino acid com- position of Sac. cerevisiae from iMM904 [Mo et al., 2009], is lowest in yeasts that lost Complex I.

The decrease in protein yield for these yeasts indicate that microbes are not always maximizing growth rate or biomass yield, and additional constraints [Bachmann et al., 2017] or environmental conditions can lead to loss of function. The differences in yield highlight the need to select microbial hosts with the highest potential for producing target biochemicals in metabolic engineering, or engineering higher yielding pathways in preferred hosts [Meadows et al., 2016]. These hosts or pathways can be identified by running metabolic simulations with GENREs from higher taxonomic ranks. Reconstructing GENREs from curated databases can avoid commission errors that would lead to yields converging to the yields found in Sac. cerevisiae or other model organisms. Chapter 6. Fungi pan-genome-scale network reconstruction 82

Figure 6.4: Heat map showing the yields of amino acids with glucose and ammonium as substrates with each strain genome-scale network reconstruction (GENRE). Schizosaccha- romyces pombe, Lipomyces starkeyi, Nadsonia fulvescens, Hanseniaspora valbyensis all have reduced yields of glutamate, glutamine, glycine, histidine, leucine, methionine, and valine than the rest of the fungi and yeasts. Lip. starkeyi shares many genomic features as proto-yeast, the first budding yeast, while the remaining yeasts independently evolved the Crabtree effect. Schizo. pombe and Han. valbyen- sis have reduced genomes and primarily rely on glucose fermentation to ethanol. Chapter 6. Fungi pan-genome-scale network reconstruction 83

6.4.4 Comparison of pan-GENRE framework to CoReCo

CoReCo was better at capturing reactions that were omitted by manual curation in past reconstructions, such as Acl in Schizo. pombe [Sohn et al., 2012]. On average, GENREs from the pan-GENRE frame- work have 74% more genes than CoReCo GENREs (Table 6.1). CoReCo is an important milestone in expanding GENRE to diverse species, but curated ortholog and reaction databases are needed to ensure they are complete and accurate.

Table 6.1: Comparison of genes, reactions, and metabolites in yeast and fungal GENREs. AYbRAHAM GENREs have more genes and reactions than GENREs from manual curation or automatic reconstruction, with the exception of the reaction count in Sac. cerevisiae. Organism Model Genes Reactions Metabolites Reference SpoMBEL1693 605 1693 1712 Sohn et al. [2012] Schizosaccharomyces pombe iSchpo_Pitkanen2014 646 1789 1694 Pitkänen et al. [2014] iSchpo_972h 1079 2025 1791 This work iAspni_Pitkanen2014 1109 2249 2068 Pitkänen et al. [2014] Aspergillus niger iHL1210 1210 1764 906 Lu et al. [2017] iAspni_CBS_513.88 1558 2475 2096 This work iJDZ836 837 1845 1008 Dreyfuss et al. [2013] Neurospora crassa iNeucr_Pitkanen2014 858 2189 2031 Pitkänen et al. [2014] iNeucr_OR74A 1319 2424 2073 This work i Trichoderma reesei Hypje_Pitkanen2014 939 2145 1985 Pitkänen et al. [2014] iHypje_QM6a 1336 2420 2083 This work iNL895 899 2002 1847 Loira et al. [2012] iYL619_PCP 619 1142 969 Pan and Hua [2012] Yarrowia lipolytica iYarli_Pitkanen2014 727 1797 1695 Pitkänen et al. [2014] iYali 847 1924 1671 Kerkhoven et al. [2016] iYarli_CLIB122 1300 2373 2036 This work iPP668 668 1227 1177 Chung et al. [2010] iLC915 915 1448 2301 Caspeta et al. [2012] Komagataella phaffii iKomph_Pitkanen2014 649 1670 1617 Pitkänen et al. [2014] iMT1026 1026 2237 1706 Tomàs-Gamisans et al. [2018] iKomph_GS115 1151 2343 2003 This work i Debaryomyces hansenii Debha_Pitkanen2014 763 1965 1856 Pitkänen et al. [2014] iDebha_CBS_767 1316 2427 2065 This work i Meyerozyma guilliermondii Meygu_Pitkanen2014 676 1703 1639 Pitkänen et al. [2014] iMeygu_ATCC_6260 1326 2412 2054 This work iBB814 814 1371 644 Balagurunathan et al. [2012] iSS884 884 1376 2228 Caspeta et al. [2012] Scheffersomyces stipitis iTL885 885 1240 589 Liu et al. [2012] iPicst_Pitkanen2014 762 1952 1841 Pitkänen et al. [2014] iPicst_CBS_6054 1453 2496 2078 This work iKlula_Pitkanen2014 637 1760 1668 Pitkänen et al. [2014] Kluyveromyces lactis iOD907 910 2180 2338 Dias et al. [2014] iKlula_NRRL_Y1140 1162 2376 2026 This work iSacce_Pitkanen2014 630 1694 1613 Pitkänen et al. [2014] Saccharomyces cerevisiae yeastGEM 1133 3703 2518 Sanchez et al. [2018] iSacce_S288c 1347 2395 2040 This work Chapter 6. Fungi pan-genome-scale network reconstruction 84

6.5 Discussion

Most reaction databases reflect canonical metabolism, which understates the true metabolic capability of enzymes. AYbRAHAM includes non-canonical reactions identified in enzyme assays, which are rele- vant to metabolic engineering and enzyme evolution. Engineering the phosphoketolase pathway in Sac. cerevisiae required the deletion of GPP1 and GPP2, which encode glycerol 3-phosphatase phosphatase, but have promiscuous acetyl-phosphate phosphatase activity [Hawkins et al., 2016]. Acetyl-phosphate phosphatase would be considered a blocked reaction in a Sac. cerevisiae genome-scale metabolic model, even though this enzyme activity is important to metabolic engineers. Another example can be seen with rescuing a serine auxotrophy in Esc. coli from promiscuous phosphatase [Yip and Matsumura,

2013]. These examples highlight the need to capture the true metabolic network of microbes beyond canonical metabolism.

GENRE involves a review of the literature to capture metabolism unique to strains [Österlund et al.,

2012]. One limitation of this approach is that a literature review based on Sch. stipitis would ignore metabolism it shares with other non-conventional organisms but has not been characterized directly, such as alkane degradation. A high-quality ortholog and reaction database with OPRs with this framework enables discoveries in one organism to be pushed to the GENREs of all organisms within a taxon

(Figure 6.2C). This synchronization can reduce the parallel GENRE efforts that have hampered the yeast community [Lopes and Rocha, 2017].

The pan-GENRE framework relies on the ortholog conjecture, which states that orthologs are likely to have conserved function [Koonin, 2005]; however, biological function is not always conserved amongst orthologs. For example, NDE1 encodes external alternative NADH dehydrogenase and is present in all yeasts 33 yeasts and fungi in this study. Nde1p has been found to oxidize NADH and NADPH in several species [Bruinenberg et al., 1985, Melo et al., 2001, Overkamp et al., 2002, Miranda, 2011], but this ability was lost in the ancestor of Sac. cerevisiae. Furthermore, Schizo. pombe appears to encode internal and external NADH dehydrogenase from the same gene [Crichton et al., 2007]. To resolve these issues of non-conserved function in ortholog groups, new "functional ortholog groups" were created to assign these exceptions. Future methods can address these issues by assigning biological function to internal nodes in phylogenetic trees, or annotating reactions with transcripts instead of genes [Pfau et al., 2015]. Chapter 6. Fungi pan-genome-scale network reconstruction 85

6.6 Conclusions

The heart of the GENRE process involves integrating information about orthologs and reactions, and to a lesser extent information about metabolites and phenotypes. The lack of complete and accurate ortholog and reaction databases forces research teams to make ad hoc decisions about the presence or absence of reactions, especially in non-conventional organisms. This time-consuming process limits the quantity and quality of GENREs. A new pan-GENRE framework was created to resolve these issues and demonstrate it with the metabolic network reconstruction of 33 yeasts and fungi, spanning 600 million years of evolution. These GENREs have more genomic and metabolic coverage than previous yeast and fungi GENREs. The unified ortholog and reaction nomenclature in our GENREs enables them to be synchronized as the pan-GENRE, or strain GENREs are improved. This approach can be scaled to other groups of organisms throughout the tree of life to increase the quantity and quality of GENREs.

6.7 Data availability

The AYbRAH ortholog database is available at https://github.com/lmse/aybrah. The AYbRAHAM reaction database, the code used to compile the GENREs, and all GENREs for each taxonomic rank in

Fungi are available at https://github.com/lmse/aybraham. Chapter 7

Reverse engineering xylose fermentation in Scheffersomyces stipitis

There are known knowns; there are things

we know we know. We also know there are

known unknowns; that is to say we know

there are some things we do not know. But

there are also unknown unknowns - the ones

we don’t know we don’t know.

Donald Rumsfeld

7.1 Abstract

Xylose is the second most abundant sugar in lignocellulose and can be used as a feedstock for next- generation biofuels production by industry. Saccharomyces cerevisiae, one of the main workhorses in biotechnology, is unable to metabolize xylose natively but has been engineered to ferment xylose to ethanol with the xylose reductase (XR) and xylitol dehydrogenase (XDH) genes from Scheffersomyces stipitis. In the scientific literature, the yield and volumetric productivity of xylose fermentation to ethanol in engineered Sac. cerevisiae still lags Sch. stipitis, despite expressing of the same XR-XDH genes. These contrasting phenotypes can be due to differences in Sac. cerevisiae’s redox metabolism

86 Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 87 that hinder xylose fermentation, differences in Sch. stipitis’ redox metabolism that promote xylose fermentation, or both. To help elucidate how Sch. stipitis ferments xylose, flux balance analysis was used to test various redox balancing mechanisms, published omics datasets were reviewed, and the phylogeny of key genes in xylose fermentation were analyzed. In vivo and in silico xylose fermentation could not be reconciled without involving NADP phosphatase (NADPase) and NADH kinase. Eight candidate genes for NADPase were identified. PHO3.2 was the sole candidate showing evidence of expression during xylose fermentation. Pho3.2p and Pho3p, a recent paralog, were purified and characterized for their substrate preferences. Only Pho3.2p was found to have NADPase activity. Both NADPase and

NAD(P)H-dependent XR emerged from recent duplications in a common ancestor of Scheffersomyces and Spathaspora. This study demonstrates the advantages of using metabolic simulations, omics data, bioinformatics, and enzymology to reverse engineer metabolism.

Associated publication: K Correia, A Khusnutdinova, PY Li, JC Joo, G Brown, AF Yakunin,

R Mahadevan. Flux balance analysis predicts NADP phosphatase and NADH kinase are critical to balancing redox during xylose fermentation in Scheffersomyces stipitis. bioRxiv, 390401.

Individual contributions:

• Metabolic model curation, and metabolic flux analysis.

• Phylogenetic and syntenic analysis of XYL1 and PHO3.

• Cloning PHO3 and PHO3.2 into pPICZalpha,B for extracellular expression in Komagataella phaf-

fii.

• Transformation, screening of PHO3 and PHO3.2 expressing mutants in Kom. phaffii.

• Debugging how to purify phosphatase enzymes.

7.2 Introduction

Xylose is the second most abundant sugar in lignocellulose and can be used as a feedstock for the production of biofuels and biochemicals by industry [Jeffries and Jin, 2004]. Known catabolic pathways include the XI pathway [Schellenberg et al., 1984], the XR-XDH pathway [Horitsu et al., 1968], the

Weimberg pathway [Weimberg, 1961], and the Dahms pathway [Dahms, 1974]. Sac. cerevisiae, one of the main workhorses of industrial biotechnology, has evolved to ferment glucose to ethanol rapidly but cannot natively grow on xylose or ferment it to ethanol [Jeffries and Jin, 2004], even though it has XR and XDH genes. Screening of various yeasts has found that ethanol fermentation with xylose in yeasts Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 88 is rare [Toivola et al., 1984] and requires NADH-linked XR to alleviate the cofactor imbalance in the

XR-XDH pathway during oxygen limitation [Schneider et al., 1981, Slininger et al., 1982, Bruinenberg et al., 1984].

Early research on xylose fermentation in yeasts focused on process optimization of native xylose fermenters [Slininger et al., 1985, 1990, du Preez, 1994], but advancements in genetic engineering enabled the expression of bacterial XI [Sarthy et al., 1987, Amore et al., 1989] and the XR-XDH pathway in

Sac. cerevisiae [Kötter and Ciriacy, 1993]. Sac. cerevisiae engineered with the XR-XDH pathway from

Sch. stipitis, encoded by XYL1 and XYL2, have typically outperformed strains expressing bacterial

XI. More recently, evolved strains of Sac. cerevisiae expressing fungal XI have led to faster growth rates, higher ethanol yields, and fewer byproducts [Zhou et al., 2012, Verhoeven et al., 2017]. Kwak and Jin [2017] provide a review of engineering xylose fermentation in Sac. cerevisiae. After more than

30 years of engineering xylose fermentation in Sac. cerevisiae, the yield and volumetric productivity of engineered Sac. cerevisiae with XI or XR-XDH have lagged native xylose fermenters like Sch. stipitis and Spa. passalidarum in the scientific literature [van Vleet and Jeffries, 2009, Kim et al., 2013a].

The inability of Sac. cerevisiae, engineered the XR-XDH pathway from Sch. stipitis, to anaerobically ferment xylose to ethanol at high yields is especially puzzling because the same genes appear to enable xylose fermentation in wild-type Sch. stipitis [Wahlbom et al., 2003]. These contrasting phenotypes can be due to differences in the transcriptome, proteome, and metabolome of Sac. cerevisiae that hinder xylose fermentation, differences in the transcriptome, proteome, and metabolome of Sch. stipitis that promote xylose fermentation, or a combination of both. The metabolic engineering community has largely focused on targets in Sac. cerevisiae [van Vleet et al., 2008, Wei et al., 2013b, Kim et al., 2013b], while few studies have probed for targets in Sch. stipitis beyond central metabolism [Jeppsson et al.,

1995, Freese et al., 2011, Wohlbach et al., 2011]. Although the XI pathway has fewer technical challenges for industrial fermentation than the XR-XDH pathway in Sac. cerevisiae, the redox balancing of the

XR-XDH pathway in Sch. stipitis has not been fully elucidated and its elucidation may therefore offer insight into new redox balancing strategies.

FBA is a computational method often used to gain insight into metabolism [Österlund et al., 2013,

McCloskey et al., 2013], and is well suited to study redox metabolism in yeasts [Pereira et al., 2016]. To date, there are five GENREs for Sch. stipitis, the most widely studied xylose fermenting yeast: iBB814

[Balagurunathan et al., 2012], iSS884 [Caspeta et al., 2012], iTL885 [Liu et al., 2012], iPL912 [Li, 2012], and iDH814 [Hilliard et al., 2018]. Xylose fermentation simulations with all models predicted xylitol accumulation when the XR cofactor selectivity was constrained to its in vitro preference (60% NADPH), cytosolic NADP-Ald was removed from the metabolic models because it is not encoded in Sch. stipitis’ Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 89 genome, flux through degradation pathways was prevented, and alternative optima were considered.

Failing to reconcile in silico predictions and in vivo xylose fermentation in Sch. stipitis, a consensus

GENRE for Sch. stipitis was created to analyze redox balancing mechanisms during xylose fermentation

(Figure 7.1), and published omics datasets were reviewed to guide potential flux constraints.

The only mechanisms that were able to eliminate in silico xylitol accumulation during anaerobic xylose fermentation were phosphorylating NADP-GAPDH, non-phosphorylating NADP-GAPDH, or

NADPase and NADH kinase. The expression of phosphorylating NADP-GAPDH, encoded by Klu. lactis’ GDP1 [Verho et al., 2002], in XR-XDH engineered Sac. cerevisiae decreased its xylitol yield and increased its ethanol yield [Verho et al., 2003]; however, there is no strong biochemical or bioinformatic evidence for the occurrence of either form of GAPDH in Sch. stipitis. Expression of cytosolic NADH kinase led to an increase in the xylitol yield of XR-XDH engineered Sac. cerevisiae [Hou et al., 2009], but these simulations indicate that NADPase and NADH kinase are both required to completely balance redox cofactors.

The presence of NADPase in Sac. cerevisiae or Sch. stipitis is unknown since it is a eukaryotic orphan enzyme. Eight NADPase candidate genes were identified in iSS885 and iPL912, which had functional annotations from KEGG Orthology [Kanehisa et al., 2011] and PathwayTools [Karp et al.,

2009], respectively. The genes encoding the NADPase candidates, the XR-XDH pathway, and enzymes related to NADPH regeneration are outlined in Table 7.1, along with expression data [Yuan et al.,

2011, Huang and Lefsrud, 2012] and orthology information. PHO3.2 was the most promising candidate since Sac. cerevisiae does not encode any homologs (Table 7.1), and its expression in Sch. stipitis was confirmed during xylose fermentation via shotgun proteomics [Huang and Lefsrud, 2012]. Therefore,

Pho3.2p and Pho3p, its paralog with 77% identity, were expressed in Kom. phaffii, purified them via 6xHis tags, and characterized their substrate preferences. The phylogenetic origin of the proposed mechanism in the Scheffersomyces-Spathaspora clade was also studied. NADPase and NADH kinase are proposed to be critical for xylose fermentation in Sch. stipitis since they can balance redox cofactors in the absence of oxygen (Figure 7.2). In contrast, Sac. cerevisiae requires oxygen to balance redox, and loses CO2 when the oxidative pentose phosphate pathway regenerates NADPH. Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 90 Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 91

Table 7.1: Xylose fermentation related genes in Scheffersomyces stipitis, including the xylose reductase(XR)-xylitol dehydrogenase (XDH) pathway, NADPH regeneration, and NADP phosphatase candidates. AYbRAH annotations, transcriptomics, proteomics, and functional characterization are outlined for all the genes. Transcriptomicsb Proteomicsc Gene AYbRAH annotationa Locus tag Protein name d Characterized function name RPKM average spectral hits HOG FOG Sac. cerevisiae glucose xylose homolog XYL1 PICST_89614 NAD(P)H-dependent xy- HOG00232 FOG00421 paralog 67.1 11 175.4 126 NADH and NADPH-linked XR-XDH lose reductase D-xylose reductase [Verduyn pathway et al., 1985b] XYL2 PICST_86924 NAD-linked xylitol dehy- HOG00428 FOG00908 ortholog 40.0 2555.3 136 NAD-xylitol dehydrogenase drogenase [Rizzi et al., 1989] XKS1 PICST_68734 D-xylulose 5-kinase HOG00427 FOG00907 ortholog 11.5 2156.4 42 phosphorylates D-xylulose for entry into the pentose phos- phate pathway ZWF1 PICST_85065 glucose-6-phosphate HOG00414 FOG00879 ortholog 250.0 1772.6 19 first committed step in the 1-dehydrogenase oxidative pentose phosphate NADPH pathway, regenerates NADPH regeneration [Horne et al., 1970] GND1 PICST_69500 6-phosphogluconate HOG00417 FOG00889 ortholog 730.4 1658.3 50 last step in the oxidative pen- dehydrogenase, decar- tose phosphate pathway, re- boxylating generates NADPH IDP2 PICST_43870 NADP-linked isocitrate HOG00262 FOG00618 ortholog 4.6 16.8 5 NADPH source during growth dehydrogenase, cytosolic on non-fermentable carbon sources [Haselbeck and McAlister-Henn, 1993] UGA2 PICST_40468 NADP-linked succinate HOG00216 FOG00365 ortholog 12.2 15.56 0 involved in the utilization semialdehyde dehydroge- of gamma-aminobutyrate nase, cytosolic [Ramos et al., 1985] UTR1 PICST_87580 NAD(H) kinase, cytosolic HOG00407 FOG00863 ortholog 4.0 18.0 0 de novo synthesis of NAD(H) pool [Shi et al., 2005] PHO3 PICST_32593 acid phosphatase HOG00624 FOG01534 absent 6.6 8.4 0 secreted acid phosphatase by Candida albicans under low- phosphate conditions [Mac- Callum et al., 2009] NADPase PHO3.2 candidates PICST_47650 acid phosphatase HOG00624 FOG01535 absent 2.4 2.9 4 uncharacterized ortholog group PHO3.3 PICST_83142 acid phosphatase-like HOG00624 FOG01626 absent 2.1 0.9 0 uncharacterized ortholog protein group PHO5 PICST_46975 acid phosphatase HOG00677 FOG01619 paralog 1.5 3.0 0 secreted acid phosphatase PHO6 PICST_61096 acid phosphatase HOG00677 FOG01619 paralog 16.5 23.1 0 secreted acid phosphatase PHO12 PICST_46121 acid phosphatase HOG00677 FOG01619 paralog 8.7 11.9 0 secreted acid phosphatase YBU4 PICST_74933 predicted tubulin- HOG00671 FOG01612 ortholog 12.7 9.8 0 uncharacterized ortholog tyrosine ligase group YMR1 PICST_82137 myotubularin-related HOG00670 FOG01611 ortholog 4.5 6.4 0 myotubularin phosphatase dual specificity phos- family, pathway signalling phatase [Parrish et al., 2004]

aHomolog Group (HOG) and Fungal Ortholog Group (FOG) identifications from AYbRAH. bRNA-seq from glucose and xylose fermentation [Yuan et al., 2011]. cAverage spectral hits sampled from three time points in two independent xylose fermentation runs [Huang and Lefsrud, 2012]. dReads Per Kilobase Million (RPKM). Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 92

Figure 7.2: (A) Proposed redox balancing during xylose fermentation in Scheffersomyces stipitis. NADH kinase regenerates NADPH; NAD(P)H flux drives xylose reductase (XR); NADP phosphatase (NAD-

Pase) dephosphylates NADP to NAD; NAD is reduced to NADH by xylitol dehydrogenase (XDH). This redox balancing scheme is consistent with the 13C results from [Ligthelm et al., 1988c], independent of oxygen availability, does not have a loss of CO2 from the oxidative pentose phosphate pathway, but requires ATP. (B) Redox balancing during xylose fermentation in engineered Sac. cerevisiae with the

XR-XDH pathway from Sch. stipitis. NAD kinase phosphorylates a fraction of the NAD pool for de novo NADP synthesis (dotted line); the oxidative pentose phosphate pathway regenerates NADPH;

NAD(P)H drives XR. XDH regenerates NADH; NADH is reoxidized to NAD by the ETC. Under this redox balancing scheme, there is a loss of CO2 from the oxidative pentose phosphate pathway, oxygen is required to reoxidize NADH, and therefore xylose cannot be anaerobically fermented to ethanol at the maximum theoretical yield.

Figure 7.1 (preceding page): Simplified map of xylose fermentation in Scheffersomyces stipitis and potential redox balancing mechanisms. Uncertainites in xylose fermentation are highlighted in red squares: (A) the impact of the redox cofactor imbalance on metabolism, (B) the in vivo XR cofactor preference, (C) the use of the succinate bypass to regenerate NADPH, (D) the presence of non or phosphorylating glyceraldehyde 3-phosphate dehydrogenase (GAPDH) to regenerate NADPH, (E) the impact of bypassing Complex I (NUO) during xylose fermentation, (F) the ability of alternative oxidase (AOX) to oxidize NADH during xylose fermentation, and (G) the presence of novel redox balancing mechanisms. Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 93

7.3 Methods

7.3.1 Genome-scale network reconstruction and analysis

The Sch. stipitis GENRE from Chapter 6 was used to simulate xylose fermentation. Model simula- tions were carried out using COnstraints Based Reconstruction and Analysis for Python version 0.9.1

(COBRApy) [Ebrahim et al., 2013]. The xylose uptake rate was set to 10 mmol · gDCW-1 · h-1. The growth associated maintenance (GAM) and non-growth associated maintenance (NGAM) were set to

60 mmolATP · gDCW-1 and 0 mmolATP · gDCW-1 · h-1, respectively. Flux variability analysis (FVA) was used to evaluate alternative optimum solutions [Mahadevan and Schilling, 2003]. The exchange bounds for erythritol, ribitol, arabitol, sorbitol, and glycerol were all set to zero to simplify the solution space for polyols. The cofactor selectivities of NADH and NADPH were varied in a single XR reaction

[Balagurunathan et al., 2012]. XR solely driven by NADH or NADPH were blocked. The ethanol yield or biomass growth rate were maximized depending on the simulation. The data from all published transcriptomics and proteomics studies for Sch. stipitis were reviewed to guide the understanding of its metabolism [Jeffries et al., 2007, Jeffries and van Vleet, 2009, Wohlbach et al., 2011, Yuan et al., 2011,

Huang and Lefsrud, 2012, Papini et al., 2012, Huang and Lefsrud, 2014].

7.3.2 Cloning PHO3 /PHO3.2, and transformation in Kom. phaffii

PHO3 and PHO3.2 were amplified from Sch. stipitis CBS 5773 genomic DNA with Phusion polymerase

(NEB), without their native signal peptides, as predicted by SignalP [Petersen et al., 2011]. The genes were inserted into pPICZα,B using restriction enzyme digestion and ligation. Primer sequences can be found in Appendix I. The plasmids were transformed into Esc. coli BL21 using electroporation, selected on low salt LB (Lennox)-Zeocin™(10 µg/mL), and sequence verified (the Centre for Applied Genomics,

Toronto). The plasmids were linearly digested by PmeI, and 5 µg of each digestion were transformed into Kom. phaffii KM71H using electroporation (EasySelect Pichia Expression Kit, Invitrogen). Cells were recovered in 4 mL of YPD for 2 hours and plated on YPD agar with Zeocin™(100 µg/mL).

7.3.3 Agar acid phosphatase assay

An acid phosphatase assay was used to screen for Kom. phaffii colonies with the highest expression of

Pho3p and Pho3.2p [Dorn, 1965]. Colonies were plated on BMMY agar and incubated at 30°C. 100 µL of 100% methanol was dispensed on the lid of each inverted Petri dish after 24 and 48 hours of growth.

20 mg of Fast Garnet GBC sulphate salt (MilliporeSigma) and 2 mg of 1-naphthyl phosphate disodium Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 94 salt (MilliporeSigma) were dissolved in 4 mL of 0.6 M acetate buffer (pH 4.8). 4 mL of the solution was

flooded into the Petri dish, after which it was examined for five minutes (Appendix I).

7.3.4 General phosphatase assay with para-nitrophenyl phosphate (pNPP)

In addition to the agar acid phosphatase assay, a general phosphatase assay was used to monitor the activity of various Kom. phaffii clones for secreted phosphatase in liquid media [Kuznetsova et al.,

2005]. 100 µL of the fermentation broth from wild-type Kom. phaffii and mutants expressing Pho3p and

Pho3.2p were assayed for pNPP phosphatase after 24 hours of induction. The supernatant enriched with phosphatase from each culture was prepared by treatment with 0.1% of Triton X100 or Tween 20 on a rotator at +4°C for 30 minutes or sonication in 1 mL volume during 10 seconds on ice. The supernatant was separated from the cells by centrifugation at 13 000 rpm and a benchtop Eppendorf 5424 centrifuge.

The cells were resuspended in an equal volume of BMMY media. The pNPP phosphatase assay was carried out in 200 µL reactions in sealed 96 well-plates incubated overnight at 30°C. The reaction mixture consisted of the phosphatase-enriched supernatant collected from the fermentation, 4 mM pNPP, 0.5 mM

MnCl2, 5 mM MgCl2, and 100 mM HEPES pH 7.5. pNPP phosphatase activity was estimated due to the increasing absorbance of para-nitrophenyl at 410 nm; the absorbance of wild-type Kom. phaffii cultures was subtracted as background.

7.3.5 Protein expression and purification

The clones with the highest expression of Pho3p and Pho3.2p were grown in BMGY as described in the manual of the EasySelect Pichia Expression Kit (Invitrogen™). 5 mL of 100% methanol was added at 24 and 48 hours after inoculation in 1 L of broth in a 4 L baffled flask. No secreted protein was purified in the fermentation broth after 24, 48 and 72 hours after methanol induction; the final pH was adjusted to 7.5, buffered with 50 mM HEPES, 0.4 M NaCl and 5 mM imidazole for Ni-NTA binding.

Most of the phosphatase activity was found to be associated with the cells (Appendix I). The cells were collected and stored at -20°C until purification. The cells were sonicated (Qsonica, dual horn probe,

2.5 minutes, 80% of maximal amplitude) to detach the acid phosphatase from the cell surface. The phosphatases relative sizes were estimated to be 55 kDa with glycosylation on 12% PAAG (Appendix I) and were sequence confirmed by mass spectrometry. The phosphatases were further purified with a Ni-

NTA agarose column (Quiagen) by their 6xHis tag according to the manufacturer’s protocol. It should be noted that longer sonication time or Y-PER Yeast Protein extraction reagent (Thermo) application increased contamination with an alcohol dehydrogenase (40 kDa band), identified as Kom. phaffii Adh2p Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 95

(A0A1B2JBQ8) (Appendix I). Size exclusion chromatography (Superdex 10/300 GL) failed to separate the phosphatases from Kom. phaffii Adh2p. The proteins formed a tight dimer, with a relative size of

110 kDa (data not shown). Utr1p from Sac. cerevisiae and Sch. stipitis were expressed in Esc. coli and

Kom. phaffii, but no soluble protein was collected.

7.3.6 Alcohol dehydrogenase enzyme assay

The Kom. phaffii Adh2p contaminant was purified from wild-type Kom. phaffii and assayed for NADH and NADPH oxidation at 340 nm in 96 well-plates at 30°C, with 0.01-10 mM butyraldehyde, 1 mM ZnCl2,

50 mM HEPES pH -7.5. Both NADH and NADPH reduced butyraldehyde, but no butanol oxidation with NAD or NADP was detected.

7.3.7 NADPase phosphatase enzyme assay

NADPase activity was assayed in a coupled reaction with formate dehydrogenase (P33160). 2.9 µg of

Pho3p and Pho3.2p were incubated at 30°C in 200 µL reactions with NADP, 50 mM sodium formate, 20 µg of strictly NAD-dependent formate dehydrogenase, 50 mM HEPES pH 7.5, 5 mM MgCl2 and 0.5 mM

MnCl2. The reduction of NAD was monitored at 340 nm in 96-well microplates. 10 mM butyraldehyde was used to estimate the concentration of the contaminated Pho3p and Pho3.2p Ni-NTA eluted samples via NADH-dependent butyraldehyde reductase activity.

7.3.8 Phosphatase screen with natural substrates using the malachite green assay

Pho3p and Pho3.2p were assayed for their substrate preferences with the malachite green assay [Kuznetsova et al., 2015]. The assay was performed in 96-well microplates in 160 µL reactions containing of 100 mM

HEPES pH 7.5, 5 mM MgCl2, 0.5 mM MnCl2, and 3 µg the acid phosphatase. 3.125 mM was the fi- nal substrate concentration of cytidine 2’ monophosphate, inosine triphosphate, NADP, FMN, coen- zyme A, glyphosate, adenosine 3’,5’ diphosphate. The remaining substrates were assayed at 6.25 mM:

AMP, CMP, GMP, IMP, UMP, XMP, 2’AMP, 2’CMP, 3’AMP, 3’CMP, dAMP, dCMP, dGMP, dIMP, dTMP, dUMP, ADP, CDP, GDP, IDP, TDP, UDP, dADP, dCDP, dGDP, ATP, CTP, GTP, ITP,

TTP, UTP, dATP, dCTP, dGTP, dITP, dUTP, NADP, FMN, PEP, CoA, phosphocholine, α-glucose

1-phosphate, β-glucose 1-phosphate, β-glucose 6-phosphate, fructose 1-phosphate, fructose 6-phosphate, ribose 5-phosphate, mannose 1-phosphate, mannose 6-phosphate, galactose 1-phosphate, fructose 1,6- bisphosphate, erythrose 4-phosphate, trehalose 6-phosphate, glucose 1,6-bisphosphate, sucrose 6-phosphate, Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 96

2-deoxy-D-glucose 6-phosphate, 2-D-ribose 5-phosphate, glucosamine 6-phosphate, 6-phospho-D-gluconate,

L-2-phosphoglycerate, 3-phosphoglycerate, glyceraldehyde 3-phosphate, phytic acid, thiamine monophos- phate, thiamine disphosphate, phosphoserine, phosphothreonine, phosphotyrosine, 2-phosphoascorbate, pyridoxal 5-phosphate, polyphosphate, glycerol 1-phosphate, glycerol 2-phosphate, glycerol 3-phosphate, ribulose 1,5-bisphosphate, lactose 1-phosphate, N-acetyl-α-D-glucosamine 1-phosphate, α-D-glucosamine

1-phosphate, N-acetyl-α-D-glucosamine 6-phosphate, dihydroxyacetone phosphate, phosphono-acetate, phosphono-formate, phosphono-methyl-glycine, AMP-ramidate, D-sorbitol 6-phosphate, glyphosate, phosphoryl- ethanolamine, NMN, disphosphate, 2’,5’-ADP, PAP (3’, 5’ - ADP), PAPS, pNPP, 5-methyl dCMP. Af- ter 30 minutes of incubation at 30°C, free phosphate concentration was estimated by mixing a reaction aliquot with 40 µL of the malachite green solution to a final volume of 200 µL volume. After 1 minute of 1000 rpm orbital shaking, the production of phosphate was measured according to the optical density at 630 nm. The malachite green solution was prepared fresh by mixing 1 mL of 7.5% NH4MoO4 with

80 µL 11% Tween-20 and malachite stock solution (1 L: 1.1 g malachite green, 150 mL H2SO4, 750 mL water).

7.3.9 Syntenic and phylogenetic analysis of XYL1 and PHO3.2 homologs

AYbRAH was used to annotate proteins in the XYL1 and PHO3.2 loci, since genome annotations are lacking for several Scheffersomyces and Spathaspora species. The protein sequences of Xyl1p and

Pho3.2p from Sch. stipitis were queried against the genomic nucleotide sequences of Deb. hansenii,

Suhomyces tanzawaensis, Scheffersomyces species, and Spathaspora species using TBLASTN; assembly accessions are listed in Table 7.2. The genomic loci 50 kbp upstream and downstream of the XYL1 and PHO3.2 hits were then queried with a protein sequence of the organism’s closest relative for each

FOG in AYbRAH using BLASTP. The gene coordiantes were manually reviewed to remove spurious open reading frames. Biopython’s GenomeDiagram was used to illustrate the synteny of the XYL1 and

PHO3.2 loci [Cock et al., 2009]. The XYL1 and PHO3.2 homologs were aligned with MAFFT version

7.24. PhyML version 3.2.0 was used to reconstruct the phylogeny of XYL1 and PHO3.2 homologs with

1000 bootstrap replicates. These trees were used to help distinguish between the XYL1 /XYL1.2 and

PHO3 /PHO3.2 paralogs. Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 97

Table 7.2: Xylose fermentation genotypes and phenotypes for Debaryomyces hansenii, Suhomyces tanza- waensis, Spathaspora, and Scheffersomyces species. GenBank Assembly Accessions were used to analyze the synteny of XYL1 and PHO3.2 loci. Genotype Phenotype Genome sequence Organism XR Xylitol Ethanol Species GenBank code XYL1.2 PHO3.2 NADPH yield yield Reference Assembly Accession selectivity (g/g) (g/g) Debaryomyces hansenii CBS 767 dha 0.14-0.45a 0.02-0.16a GCA_000006445.2 Dujon et al. [2004] Suhomyces tanzawaensis NRRL Y-17324 ctz X GCA_001661415.1 Riley et al. [2016] b cd cd Scheffersomyces stipitis CBS 6054 pic XX 0.58 0.01-0.06 0.45-0.47 GCA_000209165.1 Jeffries et al. [2007] e e Scheffersomyces stambukii UFMG-CM-Y427 sheu X 0.58-0.66 0.12-0.17 GCA_002245345.1 Lopes et al. [2018] Scheffersomyces lignosus JCM 9837 shel XX GCA_001599395.1 f c c Scheffersomyces shehatae ATY839 she XX 0.71 0.18 0.37 GCA_002118035.1 Okada et al. [2017] g gd gd Spathaspora passalidarum NRRL Y-27907 spa XX 0.36 0.02-0.05 0.44-0.48 GCA_000223485.1 Wohlbach et al. [2011] g g g Spathaspora arborariae UFMG-19.1A spaa X 0.57 0.04-0.18 0.31-0.32 GCA_000497715.1 Lobo et al. [2014] g g g Spathaspora xylofermentans UFMG-HMD23.3 spax X 1.00 0.51 0.08 GCA_002105455.1 Lopes et al. [2017] Spathaspora boniae UFMG-CM-Y306 spau 0.96h 0.46h 0.19h GCA_002094185.1 Morais et al. [2017] i i i Spathaspora girioi UFMG-CM-Y302 spagi X 1.00 0.35 0.22 GCA_001657455.1 Lopes et al. [2016] i i i Spathaspora hagerdaliae UFMG-CM-Y303 spah X 1.00 0.21-0.24 0.25-0.28 GCA_001655755.1 Lopes et al. [2016] i i i Spathaspora gorwiae UFMG-CM-Y312 spago X 0.49 0 0.1 GCA_001655765.1 Lopes et al. [2016]

aRoseiro et al. [1991] bVerduyn et al. [1985a] cLigthelm et al. [1988a] dVeras et al. [2017] eLopes et al. [2018] fHo et al. [1990] gCadete et al. [2016] hMorais et al. [2017] iLopes et al. [2016]

7.4 Results & Discussion

7.4.1 In vivo NADPH source in Sch. stipitis during xylose fermentation.

Sch. stipitis Xyl1p prefers NADPH to NADH in vitro, but its NADPH source and XR selectivity for

NADPH are unknown in vivo. FVA was used to analyze the impact of each NADPH source on the maximum ethanol yield from xylose, assuming the in vitro NADPH selectivity for XR, in presence and absence of NADPase (Figure 7.3). The oxidative pentose phosphate pathway is able to ferment xylose to ethanol in silico, but its ethanol yields are less than the in vivo yields when oxygen uptake rate (OUR)

-1 -1 is less than 3 mmolO2 · gDCW · h ; the addition of NADPase is unable to resolve the redox imbalance with NADPH regenerated from the oxidative pentose phosphate pathway. Jeffries et al. [2007] proposed the succinate bypass to regenerate NADPH, but it is unable to ferment xylose to ethanol when the OUR is

-1 -1 less than 10 mmolO2 · gDCW · h in these simulations. NADP-dependent isocitrate dehydrogenase can

-1 -1 also enable ethanol fermentation, but not below 10 mmolO2 · gDCW · h . Only NADPase and NADH kinase, phosphorylating NADP-GAPDH, and non-phosphorylating NADP-GAPDH were predicted to be able to anaerobically ferment xylose to ethanol at the maximum theoretical yield. There is no strong evidence for NADPH regeneration from NADP-GAPDH in Sch. stipitis, since it does not have any genes orthologous to Klu. lactis’ phosphorylating NADP-GAPDH, and there is no significant upregulation of Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 98

UGA2, the most likely source of non-phosphorylating NADP-GAPDH in Sch. stipitis [Brunner et al.,

1998], in its xylose-fermenting transcriptome [Jeffries and van Vleet, 2009, Yuan et al., 2011]. Therefore, these simulations and omics data support NADPase/NADH kinase balancing redox cofactors during xylose fermentation in Sch. stipitis if XR prefers NADPH in vivo.

Figure 7.3: Ethanol yield as a function of NADPH source and oxygen uptake rate (OUR). There is a drop in the ethanol yield when the oxidative pentose phosphate regenerates NADPH at OUR’s close to anaerobic levels. The highest ethanol yields were obtained with NADP phosphatase/NADH kinase, phos- phorylating glyceraldehyde 3-phosphate dehydrogenase (GAPDH), and non-phosphorylating GAPDH. NADP-dependent isocitrate dehydrogenase and the succinate bypass were unable to ferment xylose to ethanol below 10 mmol · gDCW-1 · h-1.

7.4.2 Impact of XR cofactor preference on xylose fermentation.

The in vivo XR cofactor preference is not known, so FVA was used to explore the maximum xylitol yield with the ethanol yield as the objective function, with varying OUR’s and NADPH selectivities, in the presence and absence of NADPase/NADH kinase (Figure 7.4A). Anaerobic xylose fermentation leads to xylitol accumulation when any amount of NADPH drives XR flux in silico. The in silico xylitol yield Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 99 falls within the experimental polyol range when the NADPH selectivity is less than 10% NADPH, far less than the XR’s in vitro selectivity (60% NADPH). In silico anaerobic xylose fermentation to ethanol is infeasible when the XR selectivity for NADPH is greater than 80% in the absence of NADPase and NADH kinase. The in vitro cofactor selectivity predicts the xylitol yield to be greater than 0.3 g/g when the

OUR is less than 2 mmol · gDCW-1 · h-1, which is more than the typical polyol yield of 0.1 g/g observed with Sch. stipitis [Ligthelm et al., 1988b, Su et al., 2015]. The presence of NADPase/NADH kinase in the model enables the fermentation of xylose to ethanol at all OUR’s and all NADPH selectivities. There is no impact on the ethanol yield when they are present individually. These simulations indicate that either

XR can change its cofactor selectivity to NADH during oxygen-limiting conditions, NADPase/NADH kinase plays a role in redox cofactor balancing in Sch. stipitis or additional enzymes are absent in the metabolic model.

Previous attempts to determine the XR cofactor preference are summarized in Table 7.3. In vitro enzyme characterization and crude enzyme assays for XR at varying aeration rates with Sch. stipitis both show higher selectivity for NADPH than NADH. Verduyn et al. [1985a] found that XR preferred

NADPH as a cofactor even in the presence of NADH. The accumulation of xylitol in the first engineered

Sac. cerevisiae strain with the XR-XDH pathway [Kötter and Ciriacy, 1993] also suggests NADPH is the preferred cofactor in vivo. In contrast, Ligthelm et al. [1988c] used nuclear magnetic resonance

(NMR) spectroscopy and 13C labelling to infer that NADH is the preferred XR cofactor during anaerobic conditions. Dellweg et al. [1990] corroborated their conclusions via metabolic flux analysis and predicted that NADPH is preferred by XR aerobically. Dellweg et al. [1990] suggested the concentration of redox cofactors may exert metabolic control over the in vivo XR cofactor preference. The conflicting conclusions for the in vivo XR preference inferred from in vitro enzyme activities and model-based analysis require a reexamination of the assumptions made during the pre-genomic age.

Table 7.3: Estimated xylose reductase cofactor selectivity using various techniques in the literature. Method NADH NADPH Reference In vitro: purified enzyme characterization 40% 60% Verduyn et al. [1985a] from Sch. stipitis In vitro: crude enzyme assay in Sch. stipi- 40% 60% Skoog and Hahn-Hägerdal [1990] tis at varying aeration rates In vivo: XYL1 and XYL2 expression in Minor Major Kötter and Ciriacy [1993] Sac. cerevisiae In vivo: 13C xylose (anaerobic) 100% 0% Ligthelm et al. [1988c] In silico: MFA (aerobic) 5% 95% Dellweg et al. [1990] In silico: MFA (anaerobic) 85% 15% Dellweg et al. [1990]

Although metabolic control over the XR cofactor preference may exist to an extent, the discrepancy between the in vitro and in vivo XR cofactor preferences is likely due to an incomplete metabolic network Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 100

Figure 7.4: Xylitol yield sensitivity to oxygen uptake rate (OUR) and growth rate. (A) Simulations maximized xylose to ethanol with and without NADP phosphatase (NADPase) and NADH kinase. Anaerobic xylose fermentation is only feasible in silico when xylose reductase (XR) is driven by more than 80% NADPH. The in silico xylitol yield exceeds the 10% polyol yield typically observed in vivo when OUR is less than 2 mmol · gDCW-1 · h-1. The presence of NADPase and NADH kinase in the metabolic model eliminates xylitol yield at all OUR’s and XR cofactor selectivities. (B) Xylitol production envelope with and without NADPase and NADH kinase when XR is driven by 60% NADPH. OUR was constrained to 1 mmol · gDCW-1 · h-1. Simulations without NADPase/NADH kinase lead to xylitol accumulation at the optimal growth rate and at all suboptimal growth rates. Bypassing Complex I (NUO) only reduces the maximum growth rate and has a marginal decrease in the xylitol yield at the optimal growth rate. The presence of NADH kinase or NADPase does not reduce the xylitol yield to the experimental polyol range; however, the addition of NADPase and NADH kinase enables the xylitol yield to fall within the experimental polyol range. Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 101 reconstruction. In vivo flux cannot be directly measured in most cases but can be inferred using accurate metabolic network reconstructions. Ligthelm et al. [1988c] demonstrated anaerobic xylose fermentation did not use NADPH regenerated from the oxidative pentose phosphate pathway and concluded that XR likely used NADH regenerated from glycolysis during anaerobic xylose fermentation. These results do not prove XR uses NADH as a cofactor anaerobically since the presence of cytosolic NADPase/NADH kinase or NADP-GAPDH in Sch. stipitis’ metabolic network would obfuscate the ability to resolve the XR cofactor preference using D-1-13C-xylose and NMR. The increased expression of ZWF1 and

UTR1 in the xylose-fermenting transcriptome (Table 7.1), the lack of evidence for NADP-GAPDH in

Sch. stipitis, and the complete anaerobic fermentation of xylose to ethanol supports our proposed redox balancing mechanism of NADPase and NADH kinase.

The previous simulations demonstrated that xylitol accumulates when the ethanol yield is maximized in the absence of NADPase/NADH kinase, but the impact of the growth rate on xylose fermentation was not studied. Growth-coupled xylose fermentation is especially relevant to Sch. stipitis since it has suboptimal growth with xylose, but not glucose [Ligthelm et al., 1988c, Shi et al., 2002]. FVA was used to explore the xylitol yield solution space by assuming the in vitro XR cofactor preference in the wild-type background, a bypassed Complex I [Shi et al., 2002], and in the presence and absence of NADPase/NADH kinase. The minimum xylitol yield was predicted to be 0.52 g/g and 0.27 g/g, at the maximum and minimum growth rates, respectively. These predicted yields were greater than the maximum 0.10 g/g polyol yield typically seen in xylose fermentation with Sch. stipitis [Cadete et al.,

2012]. Bypassing Complex I (type I NADH dehydrogenase) only had a minor impact on reducing the predicted xylitol yield at the maximum growth rate. The addition of NADPase or NADH kinase does not change the solution space; however, the presence of NADPase and NADH kinase enables the in silico xylitol yield to overlap with the experimental range [Ligthelm et al., 1988c, Shi et al., 2002,

Wahlbom et al., 2003, Jeffries et al., 2007, Jeffries and van Vleet, 2009]. These results demonstrate no growth-coupled mechanism balances redox cofactors given our metabolic constraints and further support

NADPase and NADH kinase as critical to balancing redox cofactors anaerobically.

7.4.3 PHO3 and PHO3.2 characterization.

Pho3p and Pho3.2p were expressed in Kom. phaffii and purified for characterization. Purified Pho3.2p showed Michaelis-Menten kinetics with NADP (Figure 7.5); no activity was detected with Pho3p. The assay conditions were not optimized, and therefore the in vitro NADPase activity may not reflect its in vivo activity. Pho3.2p was found to be more promiscuous than Pho3p using the malachite green phos- Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 102 phate assay (Appendix I). Although Pho3.2p’s NADPase activity may balance redox cofactors under oxygen limitation, its broad activity may have suboptimal effects if the in vitro activities are relevant in vivo. These include creating futile cycles with its phosphatase activity on ribose 5-phosphate, erythrose

4-phosphate, and fructose 6-phosphate. PHO3.2 is not highly expressed, and is not significantly upregu- lated during xylose fermentation (Table 7.1). Pho3.2p’s low expression has allowed it to evade top-down omics approaches. Characterization of Pho3.2p has confirmed it has NADPase activity, in addition to a broader range of activities.

Figure 7.5: Pho3.2p Michaelis-Menten kinetics kinetics with NADP as a substrate. Reaction conditions:

50 mM HEPES pH 7.5, 50 mM NaFormate, 20 μg formate dehydrogenase, 5 mM MgCl2, 0.5 mM MnCl2, and 2.9 μg pure phosphatase added. The enzyme assay was not optimized for Km or Vmax. Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 103

7.4.4 The phylogenetic origin of xylose fermentation to ethanol in Scheffer- somyces-Spathaspora.

NADP phosphatase.

PHO3.2 is derived from a tandem duplication of PHO3 in a common ancestor of Scheffersomyces,

Spathaspora, and Suhomyces tanzawaensis (Appendix K). PHO3 was subsequently lost in several Schef- fersomyces and Spathaspora species. The presence of PHO3.2 in yeasts unable to ferment xylose to ethanol indicates that it may offer a fitness advantage beyond balancing redox cofactors; one possibility is fine-tuning the concentrations or ratios of redox cofactors [Kawai et al., 2005]. The absence of PHO3.2 homologs in wild-type Sac. cerevisiae prevents adaptive laboratory evolution from climbing the redox imbalance hurdle created by expressing the XR-XDH pathway from Sch. stipitis.

PHO3 and PHO3.2 belong to the acid phosphatase family and contain the survival E protein motif, which was first characterized in Thermotoga maritima [Zhang et al., 2001, Lee et al., 2001]. Candida albicans’ Pho100p is the only eukaryotic homolog of Pho3p to be characterized [MacCallum et al., 2009] and was shown to scavenge for phosphate from the extracellular medium. The gain of NADPase activity by Pho3.2p is an example of neofunctionalization, which would have required several changes in its regulation and sequence. First, Pho3.2p would have required changes in its expression from phosphate limitation to at least oxygen limitation to enable redox cofactor balancing during xylose fermentation.

Second, a change in its signal peptide or mature protein sequence would have redirected it from the extracellular to the cytoplasm or embedded in the membrane of an organelle. Lastly, Pho3.2p would have likely required sequence mutations to gain NADPase activity, since it is absent in Sch. stipitis

Pho3p; as a consequence, it may have gained greater enzyme promiscuity. These changes are consistent with the view that duplications are critical to the evolution of metabolism, and it enabled the PHO3.2 paralog to gain NADPase activity in the Scheffersomyces-Spathaspora clade.

NAD(P)H-dependent xylose reductase.

XYL1 is part of the large aldo-keto reductase family, which has a broad range of activities and gen- erally prefers NADPH as a cofactor [Bennett et al., 1997]. Several yeasts maintain XYL1 in their genome despite the inability to grow on xylose. Yeasts that have NAD(P)H-dependent XR are able to ferment xylose to ethanol under oxygen limitation [Bruinenberg et al., 1984]. The best xylose fer- menters belong in the Scheffersomyces-Spathaspora clade, and have NAD(P)H-dependent XR encoded by XYL1.2 (Table 7.1) [Mamoori et al., 2013]. XYL1.2 originated from a tandem duplication of XYL1, in a common ancestor of Spathaspora and Scheffersomyces (Appendix I). Other yeasts that can ferment Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 104 xylose to ethanol, although at lower yields than Scheffersomyces-Spathaspora species, have indepen- dently evolved NAD(P)H-dependent XR. These include Pac. tannophilus, from a recent duplication of

XYL1 [Ditzelmüller et al., 1985], and likely convergent evolution in the Xyl1p orthologs for Spathaspora hagerdaliae, Candida tropicalis [Bruinenberg et al., 1984]. NAD(P)H-dependent XR evolved indepen- dently in yeasts, but Xyl1.2p’s higher preference for NADH enables superior xylose fermentation to ethanol in the Scheffersomyces-Spathaspora clade.

NADH kinase.

UTR1 encodes a cytoplasmic NAD+ kinase in Sac. cerevisiae but has been shown to have slight activity with NADH [Mori et al., 2005]. Its physiological role is to synthesize de novo NADP+, and not de novo

NADPH or NADPH regeneration [Kawai et al., 2001, Mori et al., 2005]. Utr1p from Sac. cerevisiae or

Sch. stipitis could not be purified to compare their enzyme activities, but there is some indirect evidence

Utr1p in Sch. stipitis regenerates NADPH. Swapping Sch. stipitis UTR1 in place for Sac. cerevisiae

UTR1 led to a decrease in Sac. cerevisiae’s growth rate on glucose; its growth rate was restored to near wild-type by further swapping Sch. stipitis NDE1 in place for Sac. cerevisiae NDE1, which encode

NAD(P)H and NADH dehydrogenase, respectively. Furthermore, the amino acid alignment of UTR1 in budding yeasts reveals conserved motifs found exclusively in the CTG clade, but outside the conserved

NAD kinase domain (Appendix J). Many of these yeasts grow well on xylose and can ferment it to xylitol and ethanol [Papon et al., 2014]. NAD+ kinase may have evolved to NADH kinase in the CTG clade and replaced NADP-dependent acetaldehyde dehydrogenase as an alternative NADPH source. Additional characterization of Utr1p orthologs can trace its preference for phosphorylating NAD+ or NADH in budding yeasts.

7.4.5 Cofactor balancing in metabolic pathways.

NADP phosphatase and NADH kinase appear to have evolved millions of years ago to balance redox cofactors during xylose fermentation, in a possible symbiotic relationship between Scheffersomyces-

Spathaspora species and wood-ingesting beetles [Suh et al., 2003], but these enzymes are relevant to today’s metabolic engineers. Gevo Inc. filed a patent which proposed the use of NADH kinase and

NADP(H) phosphatase to resolve the redox imbalance between NAD-dependent GAPDH and NADPH- dependent isobutanal reductase in the isobutanol pathway; no results were provided in their patent to confirm its feasibility [Buelter et al., 2008]. The xylose fermentation simulations show that the sole expression of NADPase or NADH kinase cannot increase the ethanol yield from xylose, but the expres- Chapter 7. Reverse engineering xylose fermentation in Scheffersomyces stipitis 105 sion of NADPase and NADH kinase may balance redox cofactors anaerobically. This is not consistent with the bioinformatic and physiological analysis of Scheffersomyces-Spathaspora yeasts (Table 7.1).

These yeasts demonstrate that NAD(P)H-dependent XR enables xylose fermentation to ethanol, but

NADPH-dependent XR and NADPase/NADH kinase still accumulate xylitol [Bruinenberg et al., 1984].

Further investigation is needed to determine if our proposed mechanism can support sustained anaerobic fermentation with an entirely imbalanced pathway, or if other factors, such as inhibition of XR or XDH by NAD(P)(H) [Verduyn et al., 1985a, Dellweg et al., 1990], limit its ability in vivo. This analysis of xylose fermentation in Sch. stipitis highlights the need to understand and engineer metabolism at a systems-level, which has been advocated by Kim et al. [2011] and Meadows et al. [2016].

7.5 Conclusion

In this study, a bottom-up analysis was undertaken to understand how Sch. stipitis balances redox cofac- tors during xylose fermentation by integrating transcriptomics, proteomics, phylogenetics, and metabolic modelling. Metabolic flux simulations could not be reconciled with Sch. stipitis’ near theoretical ethanol yields during oxygen limitation without the presence of NADPase and NADH kinase in the metabolic model when NADPH was the dominant cofactor driving XR flux. The proposed mechanism involving

NADPase and NADH kinase can balance redox cofactors independent of oxygen, but requires ATP. In contrast, most yeasts generate NADPH from the oxidative pentose phosphate pathway but require oxygen to reoxidize NADH. NADPase activity was confirmed from purified Pho3.2p using Kom. phaffii; it has a broader activity than Pho3p, its recent paralog. NADH-linked XR activity and NADPase from Pho3.2p both originated from tandem gene duplications in a common ancestor of Scheffersomyces and Spathas- pora species. This study demonstrates the advantages of using a bottom-up approach that combines metabolic modelling, omics analysis, bioinformatics, and enzymology to reverse engineer metabolism. Chapter 8

Conclusions

In this thesis I sought to unravel how Sch. stipitis is able to balance its redox cofactors during xy- lose fermentation. The objectives and hypotheses of this thesis were identified in detail in Chapter 2.

Broadly speaking the hypotheses include: curation of a pan-genome using phylogenetic reconstruction could lead to more accurate ortholog assignments than clustering-based methods, which can improve genome annotation; important changes in the evolution of yeast metabolism could be identified by rec- onciling gene the gains and losses of genes in the pan-genome with yeast physiology; metabolic network reconstructions with more accurate genomic and metabolic coverage could be obtained by leveraging a functionally annotated pan-genome based on phylogenetic reconstruction; the redox balancing mecha- nism enabling xylose fermentation in Sch. stipitis could be reverse engineered with metabolic modelling, omics analysis, and bioinformatics.

The following outcomes from each chapter are outlined:

1. Pan-genome curation of 33 Dikarya fungi and yeasts into ortholog groups, and its comparison with

existing phylogenomic databases

2. Analysis of the pan-genome identified important and reoccurring events in the radiation of budding

yeasts, and functionally annotated metabolic ortholog groups

3. Reconstruction of 33 genome-scale metabolic networks with more genomic and metabolic coverage

than existing methods

4. Prediction and preliminary confirmation that xylose fermentation in Sch. stipitis requires NAD-

Pase, encoded by PHO3.2, and cytosolic NADH kinase

106 Chapter 8. Conclusions 107

The first objective in this thesis was the pan-genome curation of budding yeasts and fungi Chap- ter 3, spanning 600 million years of evolution in Dikarya. Existing phylogenomic databases do not span diverse budding yeast taxa, and sometimes these databases do not distinguish between orthologs and paralogs. AYbRAH was created to increase the breadth and depth of the budding yeast pan-genome by clustering proteins into homolog and ortholog groups with OrthoDB and OrthoMCL, respectively, and curating homologous proteins with metabolic function into ortholog groups with phylogenetic recon- struction if protein families with ambiguous phylogeny. Although phylogenetic reconstruction is more computationally intensive than clustering approaches, it is easier to identify paralogs that may have new function. A comparison of AYbRAH to existing phylogenomic databases shows that it performs as well as PANTHER, and better than OMA, KO, and eggNOG.

In Chapter 4, it was hypothesized that the gains and losses of metabolic genes could help identify important and reoccurring events in the radiation of budding yeasts in the evolution of budding yeasts.

As expected, gene duplications were identified as playing an critical role in the evolution of budding yeasts, including the duplication of ACS1 to ACS2, enabling the PDH bypass in fermentative yeasts; the duplication of the aldehyde dehydrogenase family giving rise to NADP-Ald, an alternative NADPH regeneration source; the duplication of Nde and Ndi in the ETC; the duplication of genes encoding homomer enzymes, such as Pfk, and their subsequence evolution into heteromer enzymes with altered enzyme kinetics; duplications of genes leading to enzymes being localized in new compartments via neofunctionalization; duplications of ribosomal protein subunits, and their possible role in the Crabtree effect.

The third hypothesis was that genome-scale metabolic networks with more accurate genomic and metabolic coverage can be created by leveraging a functionally annotated pan-genome based on phy- logenetic reconstruction in Chapter 5. Comparison to published GENREs to these models, showed this hypothesis to be true for Schizo. pombe, Asp. niger, Neu. crassa, Yar. lipolytica, Sch. stipitis,

Kom. phaffii; Sac. cerevisiae was the only exception. Reactions that were not previously included in past GENREs include aromatic degradation, alkane degradation, and branched chain amino acid degradation. The genome-scale metabolic networks of 33 fungi and yeasts were reconstructed using a new simultaneous bottom-up approach that relied on the pan-genome curated in Chapter 3, and the functional annotation of gene duplications in Chapter 4.

In Chapter 6, the Sch. stipitis GENRE was analysed to test its ability to ferment xylose to ethanol in silico. Previously proposed and new redox balancing mechanisms were tested. NADH kinase and

NADPase, a eukaryotic orphan enzyme, were identified as critical to balancing redox cofactors. Pho3.2p was the only candidate phosphatase showing expression during xylose fermentation, via proteomics. Chapter 8. Conclusions 108

Purification and characterization of Pho3.2p, and Pho3, its paralog with 77% identify indicate that

Pho3.2p has NADPase activity, but not Pho3p. XR and Pho3.2p both emerged from independent gene duplications in a common ancestor of Scheffersomyces and Spathaspora.

In conclusion, it was demonstrated that current genome annotation practices have limited the ability to bridge the genotype-to-phenotype gap in non-conventional organisms. Phylogenetic reconstruction was critical to curating the pan-genome in gene families with many duplications and gene loses. Analysis of this pan-genome identified duplications as important events in the radiation of budding yeasts. A new scalable framework was developed that can enable future GENREs to have more accurate genomic and metabolic coverage, and can be synchronized as individual GENREs are curated. Finally, NADPase and NADH kinase were proposed to balance redox cofactors during xylose fermentation. Analysis of omics expression yielded Pho3.2p as the most promising candidate; its activity was confirmed with

Kom. phaffii. This thesis increased the breadth and depth of yeast evolution metabolism.

8.1 Recommendations

8.1.1 Next steps for AYbRAH

Integration with PANTHER

OrthoDB was chosen to cluster ortholog groups in AYbRAH into homolog groups because it spans more taxa than other phylogenomic databases, and has ortholog assignments for different taxonomic ranks; however, it is less specific than PANTHER, despite the latter only having a few fungal proteome annotations (Appendix B). Future updates to AYbRAH should migrate the AYbRAH homolog backbone from OrthoDB to PANTHER, and add the remaining fungi in PANTHER to increase its phylogenomic span. These include other fungal model organisms, fungi and yeasts having pathogenicity to humans or plants, or fungi and yeasts occupying important taxonomic ranks: Batrachochytrium dendrobatidis,

Cryptococcus neoformans, Puccinia graminis, Ustilago maydis, Eme. nidulans, Neosartorya fumigata,

Phaeosphaeria nodorum, Sclerotinia sclerotiorum, Can. albicans, and Eremothecium gossypii.

Reconciling AYbRAH with YGOB and CGOB

YGOB [Byrne and Wolfe, 2005] and CGOB [Maguire et al., 2013] are the gold-standard for ortholog databases in yeast genomics, and were created using sequence similarity and synteny. YGOB and CGOB span roughly 112 and 239 million years of evolution, respectively, while AYbRAH spans more than 600 million years of evolution [Hedges et al., 2015]. Although AYbRAH has a broader pan-genomic coverage, Chapter 8. Conclusions 109

YGOB and CGOB are expected to have better paralog and ohnolog assignments than AYbRAH because of its use of synteny. Future versions of AYbRAH should be reconciled with YGOB and CGOB.

Coordinate-based protein annotations

It has been noted that genome protein annotations sometimes contain inaccuracies [Maguire et al., 2013].

For example, the protein translation Cybja1_131289 does not include its full N-terminal sequence.

Another common mistake with genome protein annotations are genes that are not annotated at all.

Spa. passalidarum’s genome encodes two PHO3 homologs, but only one protein is currently annotated.

AYbRAH should adopt the genomic coordinate-based system used in YGOB and CGOB [Maguire et al.,

2013] to improve protein annotations.

Multi-level phylogenetic placement

Czech et al. [2018] recently outlined an approach to automate multilevel phylogenetic placement, which is often used to map environmental sequences onto an existing species tree reconstructions. This method can be adapted to ortholog identification with TreeGrafter [Tang et al., 2018] or pplacer [Matsen et al.,

2010]. Two sets of phylogenetic trees can be maintained in this multi-level framework for homolog groups: smaller reference trees trimmed from phylogenetic trees of homologs that maintain sequence diversity of orthologs, and clade phylogenetic trees with higher sequence resolution. Proteins from new genome sequences can then be queried against the reference tree and a subsequent clade tree with higher sequence resolution to be assigned annotations from an ortholog group.

Alternative phylogenetic reconstruction programs

Some phylogenetic trees had low bootstrap support for ortholog groups, which prohibited identifica- tion of orthologs and paralogs from the trees. For example, NADP-dependent serine dehydrogenase

(HOG00233) has several closely related ortholog groups. Additional programs for maximum likelihood- based phylogenetic analysis can be used to reconstruct the phylogeny of homologs, such as IQ-TREE

[Nguyen et al., 2014], which has been shown to outperform RAxML and PHYML for species tree re- construction [Zhou et al., 2017]. Small phylogenetic trees can be reconstructed with subfamilies of the homolog groups. Chapter 8. Conclusions 110

8.1.2 Recommendations for other community-driven ortholog databases

PANTHER and YGOB are the highest quality ortholog databases for yeasts despite being built from different methods and for different purposes. PANTHER spans some four billion years of evolution, has the most accurate ortholog assignments for large-scale phylogenomic databases when compared to AYbRAH, but has limited proteomes and cannot distinguish between recently emerged paralogs and ohnologs. On the other hand, YGOB spans 100 million years of evolution, also covers limited proteomes, but has high accuracy for recently evolved paralogs and ohnologs in Saccharomycetaceae.

An alternative approach to curated ortholog databases for other taxa can leverage both methods.

First, a research community can select a group of organisms of interest, and choose several proteomes in PANTHER that span the phylogenomic range. Next, TreeGrafter [Tang et al., 2018], or other phy- logenetic placement algorithms [Matsen et al., 2010], can be used to map the proteomes of interest to

PANTHER families/subfamilies. Different algorithms can be used to align and reconstruct these protein families at different hierarchies. Tree-crawling scripts, based on ETE [Huerta-Cepas et al., 2016a], and visual inspection can identify ambiguous ortholog, paralog, ohnolog, and xenolog assignments. Ambigu- ous paralog or ohnolog assignments can be resolved by phylogenetic reconstruction, sequence analysis, or by using genomic synteny of closely related organisms, similar to YGOB [Byrne and Wolfe, 2005].

This framework leverages the accuracy of PANTHER from the top-down, and from the bottom-up using the genomic synteny approaches like YGOB.

8.1.3 Comparative genetics as a tool to understand physiology

Comparative genetics, not to be confused with comparative genomics, can be used to further advance our understanding of physiology. This approach was championed by Brenner in a recent interview.

Orthologous or paralogous genes in two organisms can be swapped to test their impact on physiology.

These genetic swaps can lead to beneficial, neutral, or harmful phenotypes and provide indirect insight into the ortholog conjecture and protein function. If there is a change in a phenotype from the genetic swap, it must be understood from the sequence-structure-function space for the protein, and its impact on a cellular network.

For example, UTR1 encodes NAD+ kinase in Sac. cerevisiae and is predicted to encode NADH kinase in Sch. stipitis. At the protein-level, the amino acid sequence in one or both the Utr1p enzymes must have significantly evolved from the archetypal Utr1p sequence to make the enzymes incompatible or not fully interconvertible. At the network level, the genetic swap in Sac. cerevisiae would generate excess NADH, slowing down its growth rate, while Sch. stipitis would lose its ability to ferment xylose Chapter 8. Conclusions 111 to ethanol at high yields. These genetic swaps are expected to play a more important role in deciphering the physiology of non-conventional yeasts than model organisms.

Genes that have been lost in yeasts can be resurrected to study their impact on yeast physiology.

These genes may have been lost because they were not being used (the use it or lose it phenomenon), or they were no longer compatible with an organism’s environment and metabolic network. For example,

Ndi0p is absent in all budding yeasts in this pan-genome, except for Lip. starkeyi. This enzyme can be reintroduced in Sac. cerevisiae and tested for its ability to reoxidize NADH in different conditions relative to Ndi1p. Likewise, Ndi1p can be expressed in Lip. starkeyi. These genetic swaps can inform how proteins impact an organism’s network.

Chapter 4 identified important gene duplications and losses in the radiation of budding yeasts. Some genes that may be studied with this approach include: swapping the heteromeric Pfk from Sac. cerevisiae with homomeric Pfk in Yar. lipolytica; deletion of NADP-GAPDH in Klu. lactis, and its variable expression in Saccharomycetaceae yeasts; deletion of ALD6.3 in Dek. bruxellensis, and its expression in Bre. naardenensis; deletion of ALD5.2 in Deb. hansenii, and its expression in other CTG yeasts.

These gene duplications and losses are prime targets for genetic swaps and genetic resurrections in the budding yeast pan-genome.

Further elucidating xylose fermentation in Scheffersomyces stipitis

Comparative genetics with CRISPR-Cas9 can be used to validate further or refute the proposed NADH kinase-NADPase redox balancing mechanism outlined in Chapter 6. This can be achieved by perturbing the mechanism in part or whole for Sch. stipitis, or introducing it in yeasts unable to ferment xylose to ethanol, such as Mey. guilliermondii [Nolleau et al., 1995].

For example, PHO3.2 ’s role in metabolism can be demonstrated by a knockout in Sch. stipitis or expression in Mey. guilliermondii. Swapping Mey. guilliermondii’s NADPH-dependent XR for Sch. stipitis’ NAD(P)H-dependent XR can test Sch. stipitis’ ability to ferment xylose to ethanol with a completely imbalanced pathway in the presence of NADPase and a putative NADH kinase; the corollary tests the ability for Mey. guilliermondii to ferment xylose to ethanol with NAD(P)H-dependent XR but without NADPase. Furthermore, the putative NADH kinase in Sch. stipitis can be swapped with NAD+ kinase in Sac. cerevisiae to test its impact on redox metabolism and xylose fermentation, especially during anaerobic fermentation. Chapter 8. Conclusions 112

8.1.4 Pan-genome-scale network reconstruction

A new pan-GENRE framework was developed that allows high-quality GENREs to be scaled throughout the tree of life, but this hinges on the willingness of the metabolic modelling community to adopt ortholog and reaction standards. Autocatalysis offers a way to help the community adopt standards. A curated and functionally annotated pan-genome for animals, plants, archaea, and bacteria would provide the community with templates that would be easier to reconstruct a GENRE for a new species than from scratch or an automated GENRE. As outlined above, PANTHER is the higher-level ortholog database that can bridge these genomes.

Consequences of including promiscuous enzyme activities in metabolic networks

Promiscuous enzyme activities were a focus of the yeast and fungi GENREs in this thesis but there are two problems this presents. Not all function is conserved in orthologous proteins, and these reactions can lead to an increased computational burden in metabolic simulations.

Promiscuous enzyme activity, and even canonical enzyme activity, is not always conserved. To solve this problem, experimentally verified function can be mapped to a tree and propagated to other nodes.

Homoplasy would be used to determine which nodes inherit conflicting function. For example, it is clear

Nde1p in proto-yeast had NAD(P)H dehydrogenase activity, which is also present in Klu. lactis. This enzyme only carries out NADH dehydrogenase in Sac. cerevisiae. An Nde1p ortholog that diverged after proto-yeast and before Klu. lactis would inherit the NAD(P)H dehydrogenase activity.

Although promiscuous activities are important to biologists and metabolic engineers, non-canonical reactions can increase the computational burden in genome-scale metabolic modelling. Reactions should be identified as core canonical, promiscuous, etc to enable various metabolic network reconstructions to be compiled, including a core metabolic network.

8.2 Outlook

8.2.1 The future of genome annotation and curation

Orthology is the cornerstone of comparative genomics, but it surprisingly not easy to identify orthologous proteins. This results from orthology being treated as an afterthought, or approximated through sequence similarity, and limits our ability to bridge the genotype-phenotype gap across the tree of life. Genomes annotations should be based on an ortholog-first mindset. Phylogenetic placement algorithms are critical to achieving this because the ortholog conjecture is inherently based on phylogeny, even though it is more Chapter 8. Conclusions 113 computationally intensive than today’s popular methods.

Pan-genomes should be curated on at least two levels based on phylogenetic reconstruction: a pan- genome that spans all kingdoms in the tree of life, and a smaller pan-genome that maps to the larger pan-genome but is targeted to specific research communities. In this approach, PANTHER can serve as the lingua franca for comparative genomics, since its pan-genome spans almost four billion years of evolution for bacteria, archaea, protists, plants, animals, and fungi, and it has been curated since

1998. Organism-specific databases can represent dialects to the lingua franca for certain branches in the tree of life, such as Dikarya fungi and yeasts in AYbRAH, or Saccharomycetaceae yeasts in YGOB.

Identifying orthologous proteins in this two-tier approach can allow genome annotations to be improved and synchronized across the tree of life.

8.2.2 Lessons for the Metabolic Engineering field

A detailed look at the yeast pan-genome offers lessons that can improve metabolic engineering designs and adaptive laboratory evolution.

The repeated emergence of heteromer enzyme complexes from homomer enzyme complexes, such as

NAD-dependent isocitrate dehydrogenase, Pfk, and Acl, illustrates that the sequence-structure-function space for enzymes can be expanded by evolving new tertiary structures. Therefore, directed evolution of two or more copies of the same gene, or orthologous genes, can expand the trajectories for improved enzyme kinetics.

Ribosome protein subunit paralogs are hypothesized to selectively translate mRNA. Uncovering these mRNA motifs can enable pseudo-orthogonal ribosomes that selectively translate desired pathways.

Horizontal gene transfer highlights the benefit of expanding an organism’s capabilities. ALE with gene-rich DNA from organisms with exotic genes can expand potential of evolution. This was previously done with xylose fermentation, but no new enzymes were taken up with Sac. cerevisiae.

8.2.3 Comparative genomics, genetics and physiology across the tree of life

Research communities are heavy swayed by findings in model organisms. Yeast and Sac. cerevisiae are often used interchangeably, even though yeasts are paraphyletic with over 1200 budding yeast species.

These model organisms are important to optimize research resources, but the research community can benefit from studying the genomes and physiologies of more organisms between the lodestars of molecular biology.

To continue our analogy of cars and organisms, it is not possible to understand how a Ford F-150 Chapter 8. Conclusions 114 pickup truck works by fully understanding the Quadricycle and a modern day Ford Mustang Convertible.

An alternative approach might be to study the evolution of various vehicles every 5-10 years. Likewise, it is impossible to understand xylose fermentation in Sch. stipitis by knowing everything about the phys- iology of Sac. cerevisiae and Schizo. pombe, the most heavily studied fungi spanning 500 million years of evolution. Studying yeasts and fungi that are separated by 50-100 million years, which can include gene knockout and gene swap libraries, can improve our understanding of how metabolic, transcriptional and other cellular networks are evolving. In this approach, proteins enabling xylose fermentation in Sch. stipitis could have been identified without metabolic modelling with gene knockouts in Sch. stipitis and gene swaps with Sac. cerevisiae. Appendices

115 Appendix A

Commission and omission errors in published yeast and fungal genome-scale network reconstructions

116 Genome-scale metabolic Gene Error type Comment Reaction ID Reaction Reference reconstruction Schizosaccharomyces pombe URA3 PYRD_SCHPO Comission Enzyme in spo does not use RXN00585 DHOR-S + FUM -> orot + SUCC PMC3390277 fumarate; sce has a xenolog that uses fumarate ACL ACL1_SCHPO;ACL2_SCHPO Omission Enzyme is absent in sce N/A atp[c] + cit[c] + coa[c] -> accoa[c] + adp[c] + oaa[c] + pi[c]

PHK PHK_SCHPO Omission Enzyme is absent in sce N/A f6p[c] + pi[c] <=> actp[c] + e4p[c] + h2o[c] xu5p__D[c] + pi[c] <=> actp[c] + g3p[c] + h2o[c] Kluyveromyces lactis GDP1 F2Z641_KLULA Omission Enzyme is absent in sce N/A g3p[c] + nadp[c] + h2o[c] <=> 3pg[c] + 2 h[c] + nadph[c] PMID:23025710

Yarrowia lipolytica NUO Q6CGB4_YARLI;N7BM_YARLI;Q6CEK9_YARLI;F2Z660_YA Comission Reaction modelled as sce NDI1, not PMID:23236514 RLI;Q6CD73_YARLI;F2Z6C0_YARLI;Q6CA88_YARLI;F2Z6F as Complex I; NDI is absent in yli R0430 h[m] + nadh[m] + q6[m] <=> nad[m] + q6h2[m] ALD5 N/A Comission Enzyme is absent in yli, but present in sce

R0362 acald[m] + h2o[m] + nadp[m] <=> ac[m] + 2 h[m] + nadph[m] ALD5 N/A Comission Enzyme is absent in yli, but present in sce R0361 acald[m] + h2o[m] + nad[m] <=> ac[m] + 2 h[m] + nadh[m] ALD2 Q6C9V7_YARLI;Q6C2W9_YARLI Omission Enzyme is present in yli and ace N/A acald[c] + h2o[c] + nad[c] <=> ac[c] + 2 h[c] + nadh[c] ACL Q6C3H5_YARLI; Q6C7Y1_YARLI Omission Enzyme is absent in sce N/A atp[c] + cit[c] + coa[c] -> accoa[c] + adp[c] + oaa[c] + pi[c]

AOX Q6C9M5_YARLI;AOX_YARLI Omission Enzyme is absent in sce N/A 2 q6h2[m] + o2[m] -> 2 q6[m] + 2 h2o[m] Yarrowia lipolytica NDE NDH2_YARLI Comission Enzyme is absent in yli, but present r_0745 NADH [mitochondrion] + ubiquinone-6 [mitochondrion] <=> PMID:26503450 in sce NAD [mitochondrion] + ubiquinol-6 [mitochondrion] ADH Q6CAL4_YARLI;Q6CCV6_YARLI;Q6C648_YARLI;Q6CGX5_ Comission Enzyme should be annotated as acetaldehyde [mitochondrion] + NADH [mitochondrion] <=> YARLI;Q6C0G2_YARLI;Q6C2G3_YARLI;Q6CFC2_YARLI;Q ADH4 (Q6C5R5_YARLI). Evidence r_0183 ethanol [mitochondrion] + NAD [mitochondrion] 6C597_YARLI;Q6BZZ4_YARLI not provided for genes included in ALD6 Q6C2W9_YARLI Comission Enzyme is absent in yli, but present acetaldehyde [cytoplasm] + NADP(+) [cytoplasm] <=> in sce acetate [cytoplasm] + NADPH [cytoplasm]

r_0191 ALD5 Q6CD79_YARLI or Q6C7J6_YARLI Comission Q6C7J6_YARLI is correct acetaldehyde [mitochondrion] + NAD [mitochondrion] <=> annotation but not Q6CD79_YARLI. r_0192 acetate [mitochondrion] + NADH [mitochondrion] ALD5 Q6CD79_YARLI or Q6C7J6_YARLI Comission There is no evidence either gene acetaldehyde [mitochondrion] + NADP(+) [mitochondrion] has activity with NADP. r_0193 <=> acetate [mitochondrion] + NADPH [mitochondrion] AOX Omission 2 q6h2[m] + o2[m] -> 2 q6[m] + 2 h2o[m] Komagataella pastoris NUO Comission PMID:22472172 R02163 NADH[m] + 1 H+[m] + Ubiquinone[m] => NAD+[m] + Ubiquinol[m] + 0 H+[c] ADH3 Comission ppa does not have ADH3 or ADH4 C4R0S8_PICPG orthologs. R00754 Ethanol[m] + NAD+[m] <=> Acetaldehyde[m] + NADH[m] + H+[m] LYS21 Comission Ancestral [m] enzyme appears to have been list in ppa. C4QYE7_PICPG; C4QWR7_PICPG R00271 Acetyl-CoA[m] + H2O[m] + 2-Oxoglutarate[m] => Homocitrate[m] + CoA[m] Reconstruction should include ALD6 Omission ppa does not have an ortholog to N/A acald[c] + nadp[c] + h2o[c] -> nadph[c] + ac[c] + 2 h[c] sce ALD6 gene but does have a paralog with strong evidence for a cytosolic NADP ALD. Scheffersomyces stipitis (iBB) ALD6 A3M013_PICST Comission There is strong phylogenetic and acald[c] + nadp[c] + h2o[c] -> nadph[c] + ac[c] + 2 h[c] PMID:22356827 experimental evidence that pic does not have a cytosolic NADP ALD (PMC183316). ALDDH1 NDI A3GHE2_PICST Commission No genomic evidence enzyme is present in pic; reaction is present in sce. Listed gene encodes external NADH dehydrogenase. NADH2-u6m h[m] + nadh[m] + q6[m] -> nad[m] + q6h2[m] UGA1.2 Omission [m] paralog is present in pic but is N/A absent in sce succsal[m] + nadp[m] + h2o[m] -> succ[m] + nadph[m] + 2 h[m] Scheffersomyces stipitis (iSS) Omission NDE is the only type II NADH N/A PMID:22472172 dehydrogenase in pic h[c] + nadh[c] + q6[c] -> nad[c] + q6h2[c] Omission pic has genes that encode a [c] and succsal[m] + nadp[m] + h2o[m] -> succ[m] + nadph[m] + 2 [m] enzyme; sce only has the [c] h[m] ortholog Scheffersomyces stipitis (iPL) Comission There is strong phylogenetic and RXN8LE__45__48 acald[c] + nadp[c] + h2o[c] -> nadph[c] + ac[c] + 2 h[c] Li, P. Y. (2012). In silico experimental evidence that pic does 02__91__c__93__ metabolic network not have a cytosolic NADP ALD __45__ACETALD_ (PMC183316). _47__NADP__47_ reconstruction of Comission NDE is the only type II NADH RXN8LE__45__52 dehydrogenase in pic 18

h[m] + nadh[m] + q6[m] -> nad[m] + q6h2[m] Scheffersomyces stipitis (iAD) Commission There is strong phylogenetic and acald[c] + nadp[c] + h2o[c] -> nadph[c] + ac[c] + 2 h[c] Damiani, A. (2015). experimental evidence that pic does Control Engineering not have a cytosolic NADP ALD Aspergillus niger NUO / NDE A2QD14_ASPNC;A2QD54_ASPNC;A2QE81_ASPNC;A2QEI3_A Commission Enzyme is formulated as a PMC2290933 SPNC;A2QHJ1_ASPNC;A2QHJ6_ASPNC;A2QJ32_ASPNC;A2QLI combination of Complex I and 8_ASPNC;A2QN27_ASPNC;A2QQZ5_ASPNC;A2QR63_ASPNC; external NADH dehydrogenae. A2QSH0_ASPNC;A2QUU8_ASPNC;A2QWS1_ASPNC;A2QXG1_ ASPNC;A2QXL0_ASPNC;A2QXL4_ASPNC;A2QZF9_ASPNC;A2R 062_ASPNC;A2R2A5_ASPNC r240 NADH + Qm + 4 HPLUS_POm => NAD + QH2m + 4 HPLUS_PO ADH3 A2R0U3_ASPNC;A2R9I3_ASPNC;A2QAN5_ASPNC;A2QBA Commission Although most of these genes have 8_ASPNC;A2QC27_ASPNC;A2QCA7_ASPNC;A2QFZ8_AS putative alcohol dehydrogenase PNC;A2QI91_ASPNC;A5ABB8_ASPNC;A2QTS8_ASPNC;A there is no evidence provided this 2QW77_ASPNC;A2QW91_ASPNC;A2R0V6_ASPNC;A2R1 enzyme is present in ang. E6_ASPNC;A2R229_ASPNC;A2R2W2_ASPNC;A2R4B1_A SPNC;A2R6H3_ASPNC;A2R6L2_ASPNC;A2R886_ASPNC r118 ETHm + NADm <-> ACALm + NADHm UGA1 A2R9C8_ASPNC Comission Mitochondrial paralog is only present in budding yeasts.

r76 GABAm + AKGm -> SUCCSALm + GLUm

1 Appendix B

Comparison of AYbRAH orthology assignments to published ortholog databases

118 Appendix B. Comparison of AYbRAH orthology assignments to published ortholog databases119 PTHR24095:SF14 PTHR24095:SF245 K01895 K01895 OG5_126680 OG5_126680 EOG8HQC0H EOG8HQC0H KOG1175 KOG1175 K01895 K01895 OG5_126680 OG5_126680 EOG8HQC0H EOG8HQC0H KOG1175 KOG1175 K01895 K01895 OG5_126680 OG5_126680 EOG8HQC0H EOG8HQC0H KOG1175 KOG1175 N/AK01895 N/AK01895 N/A N/A EOG8HQC0H EOG8HQC0H KOG1175 KOG1175 PTHR24095:SF14 N/AK01895 N/AOG5_126680 N/A N/A EOG8HQC0H KOG1175 PTHR24095:SF14 PTHR43347:SF3 K01895 K01908 OG5_126680 OG5_126680 EOG8HQC0H EOG8HQC0H KOG1175 KOG1175 KOG1175 KOG1175 K01895 K01908 EOG8HQC0H KOG1175 KOG1175 KOG1175 KOG1175 N/A N/A omission PTHR24095:SF14 N/A K01895 OG5_126680 N/A EOG8HQC0H KOG1175 PANTHER KEGG OrthoMCL OrthoDB EggNOG FOG00404 FOG00405 FOG00406 FOG00404 FOG00405 FOG00406 FOG00404 FOG00405 FOG00406 FOG00404 FOG00405 FOG00406 FOG00404 FOG00405 FOG00406 FOG07524 FOG07525 AYbRAH Annotation Ortholog DBACS1 spoACS2 ACS3 ACS1 ACS2 angACS3 ACS1 ACS2 ncrACS3 ACS1 ACS2 ACS3 yliACS1 ACS2 ACS3 ppaHYP HYP pic kla sce Comparison of manually curated acetyl-Coenzyme A synthetase orthologs in AYbRAH to highly cited ortholog databases. Table B.1: N/A indicates genomes that doindicates not genomes have that orthology have relationships annotations indatabase in the that the can public distinguish public database between ortholog the but three database ACS havethe ortholog but been orthologs. groups; do ACS3 assigned KEGG is not can orthology assigned have only to with differentiate an a AYbRAH. different betweenEggNOG annotation PANTHER ACS1 Omission includes family for and FOG07524 despite ACS3 the the and ortholog shared FOG07525 given groups, ancestry in of gene. while the all all ACS other PANTHER ortholog database is group, orthologly the which assignments both only are have polyphyletic. predicted acetoacetate-CoA ligase activity. Appendix B. Comparison of AYbRAH orthology assignments to published ortholog databases120 PTHR42913:SF2 PTHR43706:SF1 PTHR43706:SF1 PTHR43706:SF10 K17871 K17871 K17871 OG5_126960 OG5_126960 OG5_126960 EOG8P8D18 EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 KOG2495 N/A K03885 K17871 K17871 OG5_126960 OG5_126960 OG5_126960 EOG8P8D18 EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 KOG2495 K17871 OG5_126960 EOG8P8D18 KOG2495 N/A K03885 K17871 EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 PTHR42913:SF2 N/APTHR43706:SF1 N/AK03885 K17871 N/A N/A OG5_126960 N/A OG5_126960 N/A N/A EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 PTHR43706:SF11 PTHR42913:SF2 PTHR43706:SF1 K03885 K03885 OG5_126960 OG5_126960 OG5_126960 EOG8P8D18 EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 KOG2495 K17871 omission EOG8P8D18 EOG8P8D18 EOG8P8D18 KOG2495 KOG2495 KOG2495 KOG2495 PTHR43706:SF1 N/A K17871 OG5_126960 N/A EOG8P8D18 KOG2495 PANTHER KEGG OrthoMCL OrthoDB EggNOG FOG00837FOG00838FOG11982 FOG00839 FOG00845 FOG00846 FOG00837FOG00845FOG00845 FOG00845 FOG00845 FOG00837FOG00838FOG00839 FOG00845 FOG00846 FOG00837 N/A FOG00838 N/A FOG00839 FOG00845 FOG00846 FOG00837 FOG00838 omission FOG00839 omission FOG00845 FOG00846 FOG07265 N/A N/A AYbRAH annotationNDI0 Ortholog DatabaseNDE0 spoAIF1 NDE1 NDE2 NDI1 NDI0 NDE0 angNDE1 NDE2 NDI1 NDI0 ncrNDE0 NDE1 NDE2 NDI1 NDI0 NDE0 NDE1 yliNDE2 NDI1 NDI0 NDE0 NDE1 NDE2 ppaNDI1 HYP pic kla sce N/A indicates genomes that do not have orthology relationships in the public database but have been assigned orthology with AYbRAH. Comparison of manually curated Type II NADH dehydrogenase (NDH2) orthologs in AYbRAH to highly cited ortholog databases. Table B.2: Omission indicates genomesable that to have distinguish annotations between inthe most rest the orthologs of in public the the orthologNDI0/NDE0 NDH2 NDH2 database ortholog family, genes. with but group the KEGG doortholog and exception is not databases, more of the have which recent NDE1 only an and maythe NDE1/NDI1 other NDE2; annotation represent mitochondria ortholog database NDE0 for ancient in that group. is paralogs the can S.paralogous in PANTHER having differentiate given to cerevisiae, a and between lower a gene. different is some sequence characterized EggNOG PANTHER in family external NDH2 PANTHER similarities contain the than NADH is genes; additional than same dehydrogenase the genes other in subfamily genes NDH2 not Neurospora as are crassa paralogs. included NDE0 split (FOG07264) in between in in AIF1, the other PANTHER; EggNOG which older the [Carneiro can other et localize inconsistency al., to 2007]. is an Aspergillus niger gene (FOG07265) Appendix C

Sample webpages of AYbRAH portal for

Acs

121 Analyzing Yeasts by Reconstructing Ancestry of Homologs

Home

HOG00227 HAD-like superfamily. DOG/GPP family HOG00229 HOG00230 acetyl-CoA hydrolase/ family

FASTA MAFFT sequence alignment Phyml trees Gblocks Phobius predictions

FOG00404 EOG8HQC0H Protein description ACS1 Acetyl-coA synthetase isoform expressed with non-fermentable carbon sources. Spo gene sce:ACS1 expressed with fermentable carbon sources.

Genes: 34 SGD Description Acetyl-coA synthetase isoform; along with Acs2p, acetyl-coA synthetase isoform is the nuclear source of acetyl-coA for histone acetylation; expressed during growth on nonfermentable carbon sources and under aerobic conditions

PomBase Description acetyl-CoA ligase (predicted)

AspGD Description Putative acetyl-CoA synthase

References

Armitt S, et al. (1976 Feb). Analysis of acetate non-utilizing (acu) mutants in Aspergillus nidulans.

Payton M, et al. (1976 May). Agar as a carbon source and its effect on the utilization of other carbon sources by acetate non-utilizing (acu) mutants of Aspergillus nidulans.

Frenkel EP, et al. (1977 Jan 25). Purification and properties of acetyl coenzyme A synthetase from bakers' yeast.

Hynes MJ, et al. (1977 Sep). Induction of the acetamidase of Aspergillus nidulans by acetate metabolism.

Midelfort CF, et al. (1978 Oct 25). The stereochemical course of acetate activation by yeast acetyl-CoA synthetase. Bal J, et al. (1979 Dec). Allele specific and locus non-specific suppressors in Aspergillus nidulans.

Kelly JM, et al. (1981 Apr). The regulation of phosphoenolpyruvate carboxykinase and the NADP-linked malic enzyme in Aspergillus nidulans.

Sandeman RA, et al. (1989 Jul). Isolation of the facA (acetyl-coenzyme A synthetase) and acuE (malate synthase) genes of Aspergillus nidulans.

Connerton IF, et al. (1990 Mar). Comparison and cross-species expression of the acetyl-CoA synthetase genes of the Ascomycete fungi, Aspergillus nidulans and Neurospora crassa.

Lloyd AT, et al. (1991 Nov). Codon usage in Aspergillus nidulans.

Sandeman RA, et al. (1991 Sep). Molecular organisation of the malate synthase genes of Aspergillus nidulans and Neurospora crassa.

Birch PR, et al. (1992). Nucleotide sequence of a gene from Phanerochaete chrysosporium that shows homology to the facA gene of Aspergillus nidulans.

Maconochie MK, et al. (1992 Aug). The acu-1 gene of Coprinus cinereus is a regulatory gene required for induction of acetate utilisation enzymes.

De Virgilio C, et al. (1992 Dec). Cloning and disruption of a gene required for growth on acetate but not on ethanol: the acetyl-coenzyme A synthetase gene of Saccharomyces cerevisiae.

Kujau M, et al. (1992 Mar). Characterization of mutants of the yeast Yarrowia lipolytica defective in acetyl-coenzyme A synthetase.

Saleeba JA, et al. (1992 Nov). Characterization of the amdA-regulated aciA gene of Aspergillus nidulans.

Steensma HY, et al. (1993 Apr). Genetic and physical localization of the acetyl-coenzyme A synthetase gene ACS1 on chromosome I of Saccharomyces cerevisiae.

Martínez-Blanco H, et al. (1993 Aug 25). Characterisation of the gene encoding acetyl-CoA synthetase in Penicillium chrysogenum: conservation of intron position in plectomycetes.

Gouka RJ, et al. (1993 Jan). Development of a new transformant selection system for Penicillium chrysogenum: isolation and characterization of the P. chrysogenum acetyl- coenzyme A synthetase gene (facA) and its use as a homologous selection marker.

Sealy-Lewis HM, et al. (1994 Jan). A new selection method for isolating mutants defective in acetate utilisation in Aspergillus nidulans.

Van den Berg MA, et al. (1995 Aug 1). ACS2, a Saccharomyces cerevisiae gene encoding acetyl-coenzyme A synthetase, essential for growth on glucose.

Kratzer S, et al. (1995 Aug 8). Carbon source-dependent regulation of the acetyl-coenzyme A synthetase-encoding gene ACS1 from Saccharomyces cerevisiae. van den Berg MA, et al. (1996 Nov 15). The two acetyl-coenzyme A synthetases of Saccharomyces cerevisiae differ with respect to kinetic properties and transcriptional regulation. de Jong-Gubbels P, et al. (1997 Aug 1). The Saccharomyces cerevisiae acetyl-coenzyme A synthetase encoded by the ACS1 gene, but not the ACS2-encoded enzyme, is subject to glucose catabolite inactivation.

Clutterbuck AJ, et al. (1997 Jun). The validity of the Aspergillus nidulans linkage map.

Todd RB, et al. (1998 Apr 1). FacB, the Aspergillus nidulans activator of acetate utilization genes, binds dissimilar DNA sequences.

Stemple CJ, et al. (1998 Dec). The facC gene of Aspergillus nidulans encodes an acetate- inducible carnitine acetyltransferase.

Papadopoulou S, et al. (1999 Sep 1). The Aspergillus niger acuA and acuB genes correspond to the facA and facB genes in Aspergillus nidulans.

Dessen P, et al. (2000 Feb 22). The PAUSE software for analysis of translational control over protein targeting: application to E. nidulans membrane proteins.

Brock M, et al. (2000 Mar). Methylcitrate synthase from Aspergillus nidulans: implications for propionate as an antifungal agent.

Jones IG, et al. (2001 Feb). ADHII in Aspergillus nidulans is induced by carbon starvation stress.

Lodi T, et al. (2001 Sep). Three target genes for the transcriptional activator Cat8p of Kluyveromyces lactis: acetyl coenzyme A synthetase genes KlACS1 and KlACS2 and lactate permease gene KlJEN1.

Hynes MJ, et al. (2002 Jan). Regulation of the acuF gene, encoding phosphoenolpyruvate carboxykinase in the filamentous fungus Aspergillus nidulans.

Kumar A, et al. (2002 Mar 15). Subcellular localization of the yeast proteome.

Flipphi M, et al. (2002 May 15). Characteristics of physiological inducers of the ethanol utilization (alc) pathway in Aspergillus nidulans.

Flipphi M, et al. (2003 Apr 4). Onset of carbon catabolite repression in Aspergillus nidulans. Parallel involvement of hexokinase and in sugar signaling.

Zeeman AM, et al. (2003 Jan 15). The acetyl co-enzyme A synthetase genes of Kluyveromyces lactis.

Sickmann A, et al. (2003 Nov 11). The proteome of Saccharomyces cerevisiae mitochondria.

Flipphi M, et al. (2003 Sep). Relationships between the ethanol utilization (alc) pathway and unrelated catabolic pathways in Aspergillus nidulans.

Brock M, et al. (2004 Aug). On the mechanism of action of the antifungal agent propionate.

Sims AH, et al. (2004 Feb). Use of expressed sequence tag analysis and cDNA microarrays of the filamentous fungus Aspergillus nidulans.

Jogl G, et al. (2004 Feb 17). Crystal structure of yeast acetyl-coenzyme A synthetase in complex with AMP.

Takasaki K, et al. (2004 Mar 26). Fungal ammonia fermentation, a novel metabolic mechanism that couples the dissimilatory and assimilatory pathways of both nitrate and ethanol. Role of acetyl CoA synthetase in anaerobic ATP synthesis.

Zhang YQ, et al. (2004 Oct). Connection of propionyl-CoA metabolism to polyketide biosynthesis in Aspergillus nidulans. David H, et al. (2006). Metabolic network driven analysis of genome-wide transcription data from Aspergillus nidulans.

Mogensen J, et al. (2006 Aug). Transcription analysis using high-density micro-arrays of Aspergillus nidulans wild-type and creA mutant during growth on glucose or ethanol.

Takahashi H, et al. (2006 Jul 21). Nucleocytosolic acetyl-coenzyme a synthetase is required for histone acetylation and global transcription.

Hynes MJ, et al. (2006 May). Regulatory genes controlling fatty acid catabolism and peroxisomal functions in the filamentous fungus Aspergillus nidulans.

Salazar M, et al. (2009 Dec). Uncovering transcriptional regulation of glycerol metabolism in Aspergilli through genome-wide gene expression data analysis.

Shimizu M, et al. (2009 Jan). Proteomic analysis of Aspergillus nidulans cultured under hypoxic conditions.

Flipphi M, et al. (2009 Mar). Biodiversity and evolution of primary carbon metabolism in Aspergillus nidulans and other Aspergillus spp.

Hynes MJ, et al. (2010 Jul). ATP-citrate lyase is required for production of cytosolic acetyl coenzyme A and development in Aspergillus nidulans.

Oh YT, et al. (2010 Mar). Proteomic analysis of early phase of conidia germination in Aspergillus nidulans.

Wendland J, et al. (2011 Dec). Genome evolution in the eremothecium clade of the Saccharomyces complex revealed by comparative genomics.

Saykhedkar S, et al. (2012 Jul 26). A time course analysis of the extracellular proteome of Aspergillus nidulans growing on sorghum stover.

Georgakopoulos P, et al. (2012 Nov). SAGA complex components and acetate repression in Aspergillus nidulans.

Nakamura T, et al. (2012 Sep). Impaired coenzyme A synthesis in fission yeast causes defective mitosis, quiescence-exit failure, histone hypoacetylation and fragile DNA.

Carpy A, et al. (2014 Aug). Absolute proteome and phosphoproteome dynamics during the cell cycle of Schizosaccharomyces pombe (Fission Yeast).

Beckley JR, et al. (2015 Dec). A Degenerate Cohort of Yeast Membrane Trafficking DUBs Mediates Cell Polarity and Survival.

Malecki M, et al. (2016 Nov 25). Functional and regulatory profiling of energy metabolism in fission yeast.

Mitochondrial localization predications Predotar TargetP MitoProt Mitochondrial localization predications

Raw data

Phobius transmembrane predictions 24 genes with posterior transmembrane prediction > 50%

FOG00405 EOG8HQC0H Protein description ACS2 Acetyl-coA synthetase isoform expressed with fermentable carbon sources sce:ACS2

Genes: 26 Parent paralog:FOG00404

SGD Description Acetyl-coA synthetase isoform; along with Acs1p, acetyl-coA synthetase isoform is the nuclear source of acetyl-coA for histone acetylation; mutants affect global transcription; required for growth on glucose; expressed under anaerobic conditions

References

Frenkel EP, et al. (1977 Jan 25). Purification and properties of acetyl coenzyme A synthetase from bakers' yeast.

Midelfort CF, et al. (1978 Oct 25). The stereochemical course of acetate activation by yeast acetyl-CoA synthetase.

De Virgilio C, et al. (1992 Dec). Cloning and disruption of a gene required for growth on acetate but not on ethanol: the acetyl-coenzyme A synthetase gene of Saccharomyces cerevisiae.

Van den Berg MA, et al. (1995 Aug 1). ACS2, a Saccharomyces cerevisiae gene encoding acetyl-coenzyme A synthetase, essential for growth on glucose.

Kratzer S, et al. (1995 Aug 8). Carbon source-dependent regulation of the acetyl-coenzyme A synthetase-encoding gene ACS1 from Saccharomyces cerevisiae.

van den Berg MA, et al. (1996 Nov 15). The two acetyl-coenzyme A synthetases of Saccharomyces cerevisiae differ with respect to kinetic properties and transcriptional regulation.

de Jong-Gubbels P, et al. (1997 Aug 1). The Saccharomyces cerevisiae acetyl-coenzyme A synthetase encoded by the ACS1 gene, but not the ACS2-encoded enzyme, is subject to glucose catabolite inactivation.

Lodi T, et al. (2001 Sep). Three target genes for the transcriptional activator Cat8p of Kluyveromyces lactis: acetyl coenzyme A synthetase genes KlACS1 and KlACS2 and lactate permease gene KlJEN1.

Peng J, et al. (2003 Aug). A proteomics approach to understanding protein ubiquitination.

Zeeman AM, et al. (2003 Jan 15). The acetyl co-enzyme A synthetase genes of Kluyveromyces lactis.

Takahashi H, et al. (2006 Jul 21). Nucleocytosolic acetyl-coenzyme a synthetase is required for histone acetylation and global transcription.

Mitochondrial localization predications Predotar TargetP MitoProt

Raw data

Phobius transmembrane predictions 16 genes with posterior transmembrane prediction > 50%

FOG00406 EOG8HQC0H Protein description ACS3 Uncharacterized acetyl-coA synthetase paralog sce:absent

Genes: 3 Parent paralog:FOG00404

AspGD Description Putative acetyl-CoA synthase

References

Zhang YQ, et al. (2004 Oct). Connection of propionyl-CoA metabolism to polyketide biosynthesis in Aspergillus nidulans.

Salazar M, et al. (2009 Dec). Uncovering transcriptional regulation of glycerol metabolism in Aspergilli through genome-wide gene expression data analysis.

Flipphi M, et al. (2009 Mar). Biodiversity and evolution of primary carbon metabolism in Aspergillus nidulans and other Aspergillus spp.

Lee MK, et al. (2014 May). NsdD is a key repressor of asexual development in Aspergillus nidulans.

Mitochondrial localization predications Predotar TargetP MitoProt

Raw data

Phobius transmembrane predictions 3 genes with posterior transmembrane prediction > 50%

Appendix D

Phosphofructokinase phylogenetic reconstruction

130 Appendix D. Phosphofructokinase phylogenetic reconstruction 131

Figure D.1: Phylogenetic reconstruction of 6-phosphofructokinase. The ancestral ortholog group (red) and its paralog (blue). Appendix E

Placement of Ascoidea in the budding yeast species topology

The placement of Ascoidea rubescens as sister to Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae by Shen et al. [2016], rather than sister to the larger CTG, Scheffersomyces stipitishiaceae, and Phaffomycetaceae- Saccharomycodaceae-Saccharomycetaceae clades (CPPSS), is their only major disagreement between Mühlhausen and Kollmar [2014] and Riley et al. [2016]; Shen et al. [2016] admit their placement is not conclusive and open to further study. ACS2 and PFK1 were both used to constrain the topology of Bla. adeninivorans and Nad. fulvescens, however, no metabolic genes had an obvious phylogenetic signal to restrict Asc. rubescens’ topology. To find support for Asc. rubescens’ placement in the species tree, AYbRAH was used to count the instances of clade-specific ortholog groups, which arose from duplication or de novo, to Asc. rubescens and the following well-defined clades:

• CTG-Pichiaceae-Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae (CPPSS) clade (monophyletic)

• CTG-Pichiaceae (CP) clade (monophyletic)

• Pichiaceae - Saccharomycetaceae (PS) clade (paraphyletic)

• CTG - Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae (CPSS) clade (paraphyletic)

• Pichiaceae clade

• CTG clade

• Phaffomycetaceae-Saccharomycodaceae-Saccharomycetaceae (PSS) clade (monophyletic)

A. rubescens shared most clade-specific orthologs with the three major clades (CPPSS) for duplicated genes and de novo genes, suggesting that it is sister to the three major clades and not to Phaffomycetaceae-Saccharomycetaceae- Saccharomycodaceae. If we consider tRNA loss driven codon reassignment and A. rubescens’ near loss of all CTG codons [Mühlhausen and Kollmar, 2014], it is more fitting for A rubescens to be sister to CPPSS than Phaf- fomycetaceae. Additional phylogenomic studies with Arthroascus fermentans, Sporopachydermia lactativora, or Sporopachydermia quercuum, three recently sequenced yeasts may help resolve Asc. rubescens’ phylogenomic placement with higher confidence [Mühlhausen and Kollmar, 2014].

132 Appendix E. Placement of Ascoidea in the budding yeast species topology 133

Table E.1: Count of orthologs shared each yeast species with defined clade. Species CPPSS CP PPSS CTG and PSS clades Pichiaceae CTG clade PSS no other orthologs Nadsonia fulvescens 25 0 2 7 3 3 2 24 Ascoidea rubescens 161 19 31 14 16 9 20 83 Pachysolen tannophilus 85 36 32 8 95 8 21 22 Komagataella phaffii 117 125 77 0 290 0 0 0 Kuraishia capsulata 87 49 47 0 214 0 0 0 Ogataea arabinofermentans 73 38 42 0 532 0 0 0 Ogataea parapolymorpha 78 33 40 0 484 0 0 0 Dekkera bruxellensis 62 33 40 0 388 0 0 0 Pichia membranifaciens 57 24 34 0 643 0 0 0 Pichia kudriavzevii 47 17 28 0 871 0 0 0 Babjeviella inositovora 83 33 14 13 19 35 23 39 Metschnikowia bicuspidata 101 77 0 33 0 293 0 0 Meyerozyma guilliermondii 116 124 0 57 0 611 0 0 Debaryomyces hansenii 128 122 0 62 0 631 0 0 Scheffersomyces stipitis 116 113 0 55 0 584 0 0 Spathaspora passalidarum 119 110 0 53 0 634 0 0 Cyberlindnera jadinii 94 0 44 24 0 0 402 0 Wickerhamomyces anomalus 97 0 51 34 0 0 413 0 Hanseniaspora valbyensis 16 0 11 6 0 0 175 0 Kluyveromyces lactis 86 0 64 44 0 0 638 0 Lachancea thermotolerans 80 0 63 33 0 0 615 0 Zygosaccharomyces rouxii 75 0 59 33 0 0 622 0 Saccharomyces cerevisiae 79 0 64 42 0 0 1558 0 Vanderwaltozyma polyspora 75 0 57 39 0 0 746 0

Table E.2: Count of de novo genes shared each yeast species with defined clade. Species CPPSS CP PPSS CTG and PSS clades Pichiaceae CTG clade PSS no other orthologs Nadsonia fulvescens 10 0 1 0 0 2 1 0 Ascoidea rubescens 73 8 9 9 2 2 1 0 Pachysolen tannophilus 54 20 12 3 27 4 2 0 Komagataella phaffii 82 102 61 0 225 0 0 0 Kuraishia capsulata 58 27 19 0 32 0 0 0 Ogataea arabinofermentans 44 21 17 0 28 0 0 0 Ogataea parapolymorpha 47 20 17 0 26 0 0 0 Dekkera bruxellensis 38 17 16 0 20 0 0 0 Pichia membranifaciens 34 12 11 0 20 0 0 0 Pichia kudriavzevii 30 8 9 0 15 0 0 0 Babjeviella inositovora 58 20 4 5 3 13 4 0 Metschnikowia bicuspidata 73 58 0 14 0 176 0 0 Meyerozyma guilliermondii 79 94 0 26 0 416 0 0 Debaryomyces hansenii 88 96 0 36 0 459 0 0 Scheffersomyces stipitis 80 85 0 26 0 404 0 0 Spathaspora passalidarum 82 84 0 26 0 456 0 0 Cyberlindnera jadinii 59 0 13 8 0 0 8 0 Wickerhamomyces anomalus 61 0 16 11 0 0 7 0 Hanseniaspora valbyensis 9 0 7 2 0 0 14 0 Kluyveromyces lactis 64 0 54 30 0 0 441 0 Lachancea thermotolerans 61 0 53 25 0 0 436 0 Zygosaccharomyces rouxii 58 0 48 21 0 0 444 0 Saccharomyces cerevisiae 61 0 53 27 0 0 1176 0 Vanderwaltozyma polyspora 60 0 45 24 0 0 456 0 Appendix F

Distribution of BLASTP sequence similarity scores in yeast clades percenty identity

Figure F.1: Distributions of BLASTP percent identities for proteins identified as orthologous to Saccharomyces cerevisiae in AYbRAH.

134 Appendix F. Distribution of BLASTP sequence similarity scores in yeast clades 135 log ( bitsore )

Figure F.2: Distributions of logarithm BLASTP bitscores for proteins orthologous to Saccharomyces cerevisiae in AYbRAH. Appendix F. Distribution of BLASTP sequence similarity scores in yeast clades 136 -log( expect-value )

Figure F.3: Distributions of negative logarithm of BLASTP expect-values for proteins orthologous to Saccha- romyces cerevisiae in AYbRAH. Appendix G

Composite biomass equations

val_L val_L X  X  vaametidaa − Qnetprotein h[c] → vn − 1 h2o[c] + protein_1g[c] (G.1) aa=ala_L aa=ala_L

−1 MWprotein_1g = 1000 g · mmol (G.2)

val_L X vaaQaa = Qnetprotein (G.3) aa=ala_L

UTP UTP UTP X  X  X vnmetid + vn h[c] + h2o[c] → vnppi[c] + rna_1g[c] (G.4) n=AT P n=AT P n=AT P

−1 MWrna_1g = 1000 g · mmol (G.5)

dT T P dT T P dT T P X  X  X vnmetidn + vn h[c] + h2o[c] → vnppi[c] + dna_1g[c] (G.6) n=dAT P n=dAT P n=dAT P

−1 MWdna_1g = 1000 g · mmol (G.7)

137 Appendix H

Reaction and ortholog counts by subsystem in the Dikarya pan-genome-scale network reconstructions

138 Appendix H. Reaction and ortholog counts by subsystem in the Dikarya pan-genome-scale network reconstructions139

Figure H.1: Strain-level GENRE model statistics and reactions summarized by subsystem. Appendix H. Reaction and ortholog counts by subsystem in the Dikarya pan-genome-scale network reconstructions140

Figure H.2: Strain-level GENRE model statistics and genes summarized by subsystem. Appendix I

Cloning, enzyme expression, optimization, and characterization of Pho3p and

Pho3.2p

Primer sequences used to clone PHO3 and PHO3.2 into pPICZα,B for expression in Komagataella phaffii are listed in Table I.1. We used K. phaffii as a host because we were unable to collect any soluble protein by expressing Pho3 and Pho3.2p in Escherichia coli; presumably glycosylation was required or the signal peptide impacted its expression. The N-terminal sequences of PHO3 and PHO3.2 were truncated to remove their native signal peptides, and tagged with the α-factor secretion signal peptide from S. cerevisiae at their N-terminus.

His-tag antibodies (results not shown) and the agar acid phosphatase assay were used to find K. phaffii clones with the highest expression of Pho3p and Pho3.2p (Figure I.1) [Dorn, 1965]. No phosphatase activity was initially detected in the supernatant of mutants expressing Pho3p and Pho3.2p after 24 hours of growth on methanol (Figure I.2). Sonication treatment to the cells increased phosphatase activity in the supernatant of Pho3p and Pho3.2p (Figure I.2). Pho3p and Pho3.2p were predicted to be 36 kDa without glycosylation, but were actually 55 kDa with glycosylation (Figure I.3). The malachite green assay shows higher promiscuity with Pho3.2p than Pho3p (Figure I.4).

Table I.1: Primer sequences for PHO3 and PHO3.2 inserts into pPICZα,B without their native signal peptides. Primer Sequence PstI-PHO3 (mature) FR GTT GTT CTG CAG TTA AAA CAA TTC TCT TGT CTA ACG AC PHO3 (mature)-NotI RC GTT GTT GCG GCC GCG CTG GAA AAC AAA GGT TGC A EcoRI-PHO3.2 (mature) FR GTT GTT GAA TTC TGA AGA CCA TCC TCT TGA CCA A PHO3.2 (mature)-NotI RC GTT GTT GCG GCC GCA GAG AAC AAT GGT TCC AAC A

141 Appendix I. Cloning, enzyme expression, optimization, and characterization of Pho3p and Pho3.2p142

Figure I.1: Acid phosphatase screen as described by Dorn [1965]. Three out 45 colonies did not have any detectable acid phosphatase activity. Appendix I. Cloning, enzyme expression, optimization, and characterization of Pho3p and Pho3.2p143

Figure I.2: Impact of purification method on phosphatase activity after 24 hours of growth on methanol. Soni- cation led to the highest activity of phosphatase in the supernatant. Appendix I. Cloning, enzyme expression, optimization, and characterization of Pho3p and Pho3.2p144

Figure I.3: PAAG showing purified proteins from wild-type, PHO3, and PHO3.2 -expressing mutants in Koma- gataella phaffii. The Komagataella phaffii Adh2p contaminant is also shown in the gel.

Figure I.4: Malachite green assay results for Pho3p and Pho3.2p (no replicates). Scale is % change in absorbance after five minutes. Pho3.2p has broader activity than Pho3p. Appendix J

UTR1 amino acid alignment

Flux balance analysis predicts NADPase and NADH kinase are required to balance redox cofactors during xylose fermentation. We were unable to confirm S. stipitis Utr1p activity using Escherichia coli or K. phaffii as expression hosts. The protein alignment of the CTG clade and Saccharomycetaceae Utr1p sequences show the CTG clade has unique motifs at the N and C-termii.

145 UTR1 NAD(H) kinase protein alignment

conserved submotif in CTG clade

conserved submotif in Saccharomycetaceae clade S_stipitis ...... D_hanseni ...... M_guilliermondii ...... C_albicans .....MSHKTQSQLSSQMKNLNTPPIDFNSTSSNNTMPSEPNSQPQQQQSQPEAKTEP M_farinosa ...... C_tenuis ...... C_parapsilosis ...... MSDSPQSQLTEQLKNLSTKVDKDALHPLTKLVSPSEPVDIQQ C_orthopsilosis ...... MSDYTQSQLTEQLKHLSTKSERDVLHPLTKIVSPSEPMDL C_dubliniensis MSHNTQSQLSSQMKNLDTPSISINPTSTNITMPIEPEPKPQSQSQSQPDSQPEAKAEP C_tropicalis ...... C_maltosa ...... T_delbrueckii ...... E_gossypii ...... K_lactis ...... K_marxianus ...... L_thermotolerans ...... Z_bailii ...... Z_rouxii ...... C_glabrata ...... N_castelli ...... S_cerevisiae ...... V_polyspora ......

1 10 20 30 40 S_stipitis ...... MFTDKKSQLTARLHGEMTNTDRPTNPTKRASSRGYGTVVTNGHS D_hanseni ...... MSKYGLESH M_guilliermondii ...... MNPQSIPPRRVSGSIMANSHA C_albicans QTIRPATFTTSGNSSSSSISTLSADIIQPLHQLSINNNNSTVTQPAPQSSSFQRRNNP M_farinosa ...... MSKNDGIERSI C_tenuis ...... MAIAVSPKSTAIEPTKEAPLVPSSPVAVAPGSAVSR C_parapsilosis IREGRQQRERTSESSSSSISSLPTEILEPPSNQPQVNSTPALASPQPQHSSFRRKHSI C_orthopsilosis QQIRAARQQRTSESSSSSVSSLPTEVLDPPSNQAQPNSTPTLASPQPQHTSFRRKHSI C_dubliniensis QTIRAATFTTSGNSSSSSISTLSTDIIQPLHQLSLNNNNSTVTQPGTQSSSFQRRSNP C_tropicalis ...... MYATEEKKIEISDLRFLLQQAIEYSTAMNNNNNNNSSSSNSNY C_maltosa ...... T_delbrueckii ...... MSLPNSAEECVTR E_gossypii ...... MVKRKQRAPVTRASTTEKRPLNHPDWV K_lactis ...... MVEGHPLEKVLSASALTSSSNSSSRSSIPLTFEVTHQHKTQIKRFQNV K_marxianus ...... MVNHPLEEKALPTGMECSSSNSSSR L_thermotolerans ...... MARHGDAPFSVPLSRELSPLSRTLTDGGEKKSHGT Z_bailii ...... MVIPNMDEILEQDSE Z_rouxii ...... MGSLSNSSNMDSIDDT C_glabrata ...... M N_castelli ...... MIETKDDKHYMVPFPIEDTSEITTMYN S_cerevisiae ...... MKENDMNNGVDKWVNE V_polyspora ...... MKRS

50 60 70 80 90 S_stipitis IIANTSHANGFRSQSQSQIQSQNSSGTTSPVGNHISREAPMIHNKLYCEQVNKKI... D_hanseni QSQLTQGLRRNRSHRQSLGQNKLSKSPWEEENDALNREAPMIHNKLYCDSVKKNVKLR M_guilliermondii ASPSPIVSPTAARPIDTEASFSPHGTISPLTSSSPVSAPPMIHNKLYCEQAR...... C_albicans QRFNRNQLNVYT.DFNSTTSSASSISSSPKDFFTREP..PRIHSKLICEEIASANN.. M_farinosa ESDKTTQIVEDEVSNRDSGDSESRSPTQGDSTKRYVPGLPIIHNKLYHGSAKKNSRLK C_tenuis KPSGGAHKNGVRKHRKSLREGGSSSRSTSPDGSSIRRVPPKIHNKLYCEQVNKIK... C_parapsilosis HKFNRSQYSLSSMSSTPYGSTVPSISSSPKDNFAREP..PRIHAKLYCEDVAKSN... C_orthopsilosis HKFNRSQYSLSSISSTPYGSTVPSMSSSPKDNFAREP..PRIHAKLYCEDVAKNN... C_dubliniensis QRFSRNQLNVYT.DFGSTTSSVSSMSSSPKDFFTREP..PRIHNKLICEEIASANN.. C_tropicalis HRAQFTTGSSTTTNSSTSSLSELTTSSSFQRKNNFVPHSPKIHSKLICDDVKAAS..G C_maltosa ...... MSTSNST T_delbrueckii IYKDCDKKSKGNSKTVEQKNTSYTKQRRKSATHRSVPSLGFCKELNYIDDDKKRERIN E_gossypii LVASEDTMGHISDDDSKASSVNLALIDDSEQDIVSVTDEPKLVEMAAEVGAAAADTIA K_lactis LTSDSATQDDGNDDPSRNQGNEVSEQFHLLQYPEQHQHQHQNKHQHQHQQQHEKGDLD K_marxianus TSVPLSFEVKRQNRHNGCERFQNVMCSDGMEPNDPDHLLHKELQQVDDFQGEENDNVD L_thermotolerans NSKGKGEITNKSKRSKSVSHLADRNKILGSRMTPIKPEEQKVPEEPLEDPLEEPHRVD Z_bailii QSSSAEDEGTLAGPIAQVRSLQDIKNDRDSKITRSDPALDGFRAVKTGDDGN.CERIR Z_rouxii LESERSSSNSVSIPMSQCRSLQDIRRDNRENTVKSDSSIRFFPDQHVLDEGNERIRNI C_glabrata EEMKRTDSCTKEKALQKWLADAEESSSSASEYKIANQDTSRHDSDVALDISNAKDMLR N_castelli SSTALSEKGESCTSTSSFPKLQKASSVPPKLTKKISQDSESYHDDLALEDGPSLDTCA S_cerevisiae EDGRNDHHNNNNNLMKKAMMNNEQIDRTQDIDNAKEMLRKISSESSSRRSSLLNKDSS V_polyspora LELIKKDGDVVSPDLKLARTKPDTSDIKLKWKLYDEPKSTDIIESKTKKITDGYDGCI 100 110 120 130 140 150 S_stipitis ..TSSTN M LQK L ...SVD.E IR S VKSH TE L AE TA NG VR ML A K N LS KTT I Q L DV RA IMI D_hanseni ..QVSSS V LSK L ...GTD.D LR S VKSH TE L AE TA NG VR ML A K N L AKAT I Q L EV KA IMI M_guilliermondii ..SKPKK A SAT L KLSD.P.E LK T VRSH AE L AQ TA NG VR ML A K N LT RAT I Q L DV RS IMI C_albicans ..RAAKE V LSR L ...STD.E LR S VKSH TE L AE TA NG VR ML A K N LS RAT I Q L DV RA IMI M_farinosa ..RLPSN V LRK L ...SNE.D LR N VRSH TE L AE TA NG VR ML M K N LT RAT I Q I DV KA IMI C_tenuis ..RAPST V LNK L ...NHD.E LK S VRSH TE L AE TA NG VR ML A R N LS KAT I H L DV KT IMV C_parapsilosis ..KTAKD V LSR L ...SSD.E LR S VKSH TE L AE TA NG VR ML A K N LS RAT I Q L DV RA LMI C_orthopsilosis ..KTAKD V LSR L ...SSD.E LR S VKTH TE L AE TA NG VR ML A K N LS RAT I Q L DV KA LMI C_dubliniensis ..RAAKE V LSR L ...STD.E LR S VKSH TE L AE TA NG VR ML A K N LS RAT I Q L DV RA IMI C_tropicalis ...... GN.TP R S IKSH TE L AE TA NG VR LL A K N L ARAT I Q L DV KA IMV C_maltosa LTTSTTTTSASSTSLSTT.PRSAGS S FTH L AE TA NG VR ML A K N L ARSS I Q L NV NA IMI T_delbrueckii DLNEAKA M IRR L SGDSHPPK V TAT KSH FQ L SS TA YG VR ML S K D LS NTR V E L KV EN LMI E_gossypii AAKDSQRSEDS L KPLPHQDC MR K VKS YAQ L SS TA HG VR ML C K N IS NTR V S I QV ES LMI K_lactis EVLCTQR M FRK L STGSDD.. VK K V Y SH AQ L SS TA HG VR LL S K N LS NTK V A L EV KK LMI K_marxianus EMLTTER M FRK L STGSED.. MK K A Y SH AQ L SS TA HG VR IL S K N LS NTK V A L EV KN LMI L_thermotolerans ...TAEK V FRR L SVGSND.. MR R A T SH AQ L SS TA HG MR LL S K D IQ KSM V V L QV NR LMI Z_bailii NINDAKE M IKQ L S..VGDKK L NS AKS QLK L SS TA YG VR ML S R D I FNTK V E L QV EN LLI Z_rouxii ..NDAKE M IKQ L NIGGK..R L TS AKS QLK L SS TA YG VR ML S K D I FNTK V E L QV EN LLI C_glabrata RISSERSPSMS M ...SAHNTS K SSN TH FQYAS TA YG VR LL S R D IS NTK V E L DV QN LMI N_castelli ELKTVKE M VRR I SSEPSP.G AK HH K K H VEYAS TA YG VR ML S K N LT KTK V Q L NV EN LII S_cerevisiae LVNGNANSGGGTSINGTRGSS K SSN TH FQYAS TA YG VR ML S K D IS NTK V E L DV EN LMI V_polyspora EVSRRDSKSTSKSMPNVSEMKASF K P H FKYASH A YG LR MM S K K I FNTR V E L DV EN LMI

160 170 180 190 200 S_stipitis VT K AR D N SL IYLTREV V EWL L T Q ERD I T VYVD AK L EN S K RF NTDD I RTQIPKANGL LR D_hanseni VT K AR D N SL IYLTREI V DYL L AKNKD I T VYVD RN L QK S K RF NAVN L YETVPKAKKY VK M_guilliermondii VT K AK D N SL V V LTREL V EWL L G Q SRD I A VYVD KG L EK S K RF NARE I FESSEKAQRN LR C_albicans IT K AR D NG L IYLTKEV V EWI L D Q HPH I T IYAD EK L AK S K RF NPES I IANYPNGCK KLK M_farinosa VT K AR D N SL IYITREM V Q FL L TRDKE I T VYVD KN L QD S K RF DLAG L HETVPKAKT HVK C_tenuis VT K AR D N SL IFLTREV V EW FL KRNKN I T IYVD SK L EA S K RF NYSG L VESVPTARQY VK C_parapsilosis VT K AR D NA L VYLTREV V EWL L S N HTE I T VFVD SK L EH S K RF DSKR M VKQYPSASK HLR C_orthopsilosis VT K AR D NA L VYLTREV V EWI L T N HTE I T VYVD SK L EQ S K RF DSKR M IKQYPNAAK HLR C_dubliniensis IT K AR D NG L IYLTKEV V EWI L G Q HPQ I T IYVD EK L EK S K RF NPQD I ITNYPNGCK KLK C_tropicalis IT K AR D N SL I T LTK Q LV EWL L E S HPH I V VFVD SK L QQ S K RF GV...... APCNS.. LK C_maltosa IT K AG D N SL VYLTKEL V EWL L N N HPH I V VFVD AK L EK S A RF DI...... CDNP KLR T_delbrueckii VT K TH D V SL IYLTREL V EWL L I N YPK V T VYV GED M KN S K KF SAED L CKDSRCENR RIK E_gossypii VT K KH D R SL IYLTREM V EWL L V N FPSTD VYV NES L KG S K RF NEKE L IKDSKCAKSS IK K_lactis VT K RQ D D SL IYLTREL V EWI L V N YPT I D VYVE YGFER N ES F NAKE L CKDSKCGSH KI Q K_marxianus VT K KQ D D SL IYLTREM V EWI L V N YPS I D IYVD AKFEK N EP F NAKE L CKDSKCGSH KIR L_thermotolerans VT K KQ D S SL VYLTREM V EWI L V N YPE I E IYVD ET I EC S N RF DTKG I IRDSRCGSN RIK Z_bailii VT K NE D R SL VYL V REL V EWL L IHSPH V T VYVE KY M KG S K KF GAED I YKDSRCTEH RIK Z_rouxii VT K SQ D R SL VYLTREL V EWL L I N SPD I T VYVE KI L QG S EQ F GAED I YKDSRCKEQ RIK C_glabrata VT K LN D I SL Y FLTREL V EWL L VHFPQ V T VYVD KE L EH N D KF AAQE L AKDSKCRQS RIK N_castelli IT K TT D V SL IFLTREL V EWL L T T FPN L N IYVE DTFKG S NQ F AADE I CDDTKCRES RIR S_cerevisiae VT K LN D V SL Y FLTREL V EWV L VHFPR V T VYVD SE L KN S K KF AAGE L CEDSKCRES RIK V_polyspora V IK QN E V SL IYL M REL V EWL L I N FPS I T IYLD EA L KG S KT F DAED I CTDSKCSAK RI S

210 220 230 240 250 260 S_stipitis FW DKKF A LKNPEK FD LVV TLGGDGTVL YA S NL FQ R V VPP VIS F ALGSLGFL TNF K FE H D_hanseni YW DKKF A LQNPQK FD LVV TLGGDGTVL YV S NL FQ R V VPP VIS F ALGSLGFL TNF K FE Q M_guilliermondii FW DKQF A MRNPEI FD LVL TLGGDGTVL YV S NL FQ R I VPP VIS F ALGSLGFL TNF Q FE E C_albicans YW N KKLTTKNPEI FD LVL TLGGDGTVL FA S NL FQ K I VPP ILS F SLGSLGFL TNF E F SA M_farinosa FW T RKLSMRNPEA FD LVV TLGGDGTVL YV S NL FQ R V VPP VIS F ALGSLGFL TNF E FD N C_tenuis FW T KEFTINNPEI FD LVL TLGGDGTVL YV S NL FQ R I VPP VIS F ALGSLGFL TNF R FD D C_parapsilosis YW N KELTLKSPEL FD LVV TLGGDGTVL YV S NL FQ R V VPP VLS F SLGSLGFL TNF K FD D C_orthopsilosis YW N KDLTLKSPEL FD LVV TLGGDGTVL YV S NL FQ K V VPP VLS F SLGSLGFL TNF K FD D C_dubliniensis YW N KKLTTKNPEF FD LVI TLGGDGTVL FA S NL FQ K I VPP ILS F SLGSLGFL TNF E F SA C_tropicalis FW T KRL V KKQPEL FD LVV TLGGDGTVL YA S TL FQ H I APP VL PF SLGSLGFL TNF Q F QD C_maltosa YW T KGL A MKHPEL FD LVV TLGGDGTVL FA S SL FQ GI VPP VL AF SLGSLGFL TNF E F ND T_delbrueckii YW N PKF V DEHDNF FD LIV TLGGDGTVL FV S SV FQ R H VPP VLS F SLGSLGFL TNF Q FE N E_gossypii YW T PEL V SERGDL FD MII TLGGDGTVL YV S SI FQ QD VPP VMS F ALGSLGFL T V F K YE N K_lactis YW S PEF V KEHEDF FD LII TLGGDGTVL YV S SI FQ K N VPP VMS F ALGSLGFL TNF Q FE D K_marxianus YW T PTF V KENEYF FD LII TLGGDGTVL YV S SI FQ K N VPP VMS F ALGSLGFL TNF Q FE N L_thermotolerans TW S PEL V AKKDDF FD LVI TLGGDGTVL YV S SL FQ R S IPP VMS F SLGSLGFL TNF N YE N Z_bailii YW DDDF I AKHDGF FD LIM TLGGDGTVL FV S SI FQ R H VPP VLS F SLGSLGFL TNY Q YE H Z_rouxii YW DKEF V AQHDGF FD MII TLGGDGTVL FV S SI FQ R H VPP VLS F SLGSLGFL A NY Q FE R C_glabrata YW T KEF I DENDVF FD LVI TLGGDGTVL FV S SL FQ R H VPP VMS F SLGSLGFL TNF K FE D N_castelli YW N QEF I AKHDDF FD L C VTLGGDGTVL FV S TV FQ K S VPP T VS F SLGSLGFL TNF N FE Y S_cerevisiae YW T KDF I REHDVF FD LVV TLGGDGTVL FV S SI FQ R H VPP VMS F SLGSLGFL TNF K FE H V_polyspora YW N QEF L DNNVGF FD LVM TLGGDGTVL YV S SI FQ K HT PP IVS F ALGSLGFL TNF K FE H 270 280 290 300 310 S_stipitis FR ER M NT VI A S G VK AY LR MR FT CRV HTAD....G...... K LI CEQQ VLNE L V I D_hanseni FR ER M SN VL DAG VR AY LR MR FT CRV HRAD...... GK LI CEQQ VLNE L V V M_guilliermondii F PKH M VK VL ERG VR A NL R MR FT CRV HHAD....G...... R LV SEQQ VLNE L V V C_albicans FR TV L SKCFD S G VK A NL R MR FT CRV HTDE....G...... K LI CEQQ VLNE L V V M_farinosa FR EK M TQ VL E S G VR AY LR MR LT CRV HTAD....G...... K LV CEQH VLNE L V V C_tenuis FR SK M LS VL E S G VR A NL R MR FTA RV HRSD...... GQ LV CEQQ VLNE L V V C_parapsilosis YK SK L NHC L D S G VK A NL R MR FT CRV HTAE...... GK LI CEQQ VLNE L V V C_orthopsilosis YK SK L NHC L D S G VK A NL R MR FT CRV HTAE...... GK LI CEQQ VLNE L V V C_dubliniensis FR TV L NKCFD S G VK A NL R MR FT CRV HTDE.....G...... K LI CEQQ VLNE L V V C_tropicalis FK RI L NRC I E S G VK A NL R MR FT CRV HSSD...... GK LI GQYQT LNE L V V C_maltosa FK NV L NTC I N S G V NA NL R MR FT CRV HNNE...... GK LL AQQQ VLNE L V V T_delbrueckii FK ED L AT VL N N R IK T NL R MR LD CKA YRRRPPIIDPNTGKKTCVTE LV GQHQ VLNE L T I E_gossypii FR ED L SK AL Q S K IR T NM R MR LC CKV YRRLPCSSSKGNKKK...YEY V ETHH ILNE L T I K_lactis FK HA L SK IL Q N K IK TK MR MR LC C Q L FRKRIKKVDEEARKTHIKYT M EGEYH VLNE L T I K_marxianus FK HD L SK VL Q N K IK SK MR MR LC C Q L YKRRIKRKDEESGKTHIKYD L KGEYH VLNE L T I L_thermotolerans FR QS L PR VL N S K IR SK MR MR LC CRV FRKRKPNKENNNSRSRKKFT MI GEYH VLNE L T I Z_bailii FR ED L PK IL N N K IK T NL R MR LE CKV YRRHPPVLDPRTGEKIAVAE LI SQRQ VLNE L T V Z_rouxii FR ED L PK IL D N K IK T NL R MR LE CKV YRCHPPMVDSRTGEKVAVAE LV MQRQ ILNE L T I C_glabrata FR TD L TK IL N S K VK T NL R MR LE CKV YRRHEPEVDPETGKKICVVEH I DTHH ILNE V T I N_castelli FK QD L RK IL TEK VK I NL R MR LE CKI YHRNKPEYDSETGKKVCIMEQ V STHH VLNE M T I S_cerevisiae FR ED L PR IM NHK IK T NL R LR LE C T I YRRHRPEVDPNTGKKICVVEK L STHH ILNE V T I V_polyspora FR KD L PL IL N N K IK T NL R MR LE CKV FRRRDPVVNPETGKKIFVSE LI SEHH VLNE L T V

320 330 340 350 360 S_stipitis DRG PSPYVT QL ELY GD G SL LT IAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT D_hanseni DRG PSPYVT QL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT M_guilliermondii DRG PSPYVT QL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_albicans DRG PSPYVT HL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT M_farinosa DRG S SPYVT QL ELY GD D SL LT IAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_tenuis DRG PSPYVT NL ELY GD G SL LT IAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_parapsilosis DRG PSPFVT NL ELY GD G SL LT IAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_orthopsilosis DRG PSPFVT NL ELY GD G SL LT IAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_dubliniensis DRG PSPYVT HL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_tropicalis DRG PSPYVT QL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT C_maltosa DRG PSPFVT QL ELY GD G SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V HP G VS A I S VT T_delbrueckii DRG PSPFIS ML ELY GD N SL LT MAQADG LI IAT PTGSTAYSLSAGG SL V YP S VN A I A VT E_gossypii DRG PSPFLS ML ELY GD H SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V YP S VN A I C VT K_lactis DRG PSPFIS ML ELY GD G SL LT VAQADG LI IAS PTGSTAYSLSAGG SL V YP S VN A I A VT K_marxianus DRG PSPFI AM L ELY GD G SL LT VAQADG LI IAS PTGSTAYSLSAGG PL V YP S VN A I A VT L_thermotolerans DRG PS A FIS ML EVF GD N SL LT VAQADG LI IAT PTGSTAYSLSAGG SL V YP S VN A I A VT Z_bailii DRG PSPFIS NL DLY GD D SL LT VAQADG II IAT PTGSTAYSLSAGG PL V YP S VN A V V VT Z_rouxii DRG PSPFIS NL EVY GD N SL LT VAQADG II IAT PTGSTAYSLSAGG PL V YP S VN A V C VT C_glabrata DRG PSPFIS ML ELY GD G NL MT VAQADG LI IAT PTGSTAYSLSAGG SL I YP T VN A I A VT N_castelli DRG TC PFIS NL ELY GD D SL MT VAQADG LI IAT PTGSTAYSLSAGG AL V HP S IN A I S VT S_cerevisiae DRG PSPFLS ML ELY GD G SL MT VAQADG LI AAT PTGSTAYSLSAGG SL V CP T VN A I A LT V_polyspora DRG S SPFIS ML ELY GD S SL FT VAQADG LI V S TPTGSTAYSLSAGG SL V YP S VN A I A VT

370 380 390 400 410 420 S_stipitis P ICPH TLSFRP IL LP D G M F LKVKV PDT SR S TAW A SFDGK V R TE L R KG DYVTIQA S PF P D_hanseni P ICPH TLSFRP IL LP D G M F LKVKV PST SR S TAW A SFDGK V R TE L H KG DYVTI H AS PF P M_guilliermondii P ICPH TLSFRP IL LP D G M V LKVRV PLT SR S TAW A SFDGK E R LE L K RG DYVTI R AS PY P C_albicans P ICPH TLSFRP IL LP D G M F LKVKV PSS SR A TAW CS FDGK V R TE L K KG Y YVTIQA S PF P M_farinosa P ICPH TLSFRP IL LP D G M F LKLRV PWD SR S TAW A SFDGK V R TE L C RG DYVTVQA S PY P C_tenuis P ICPH TLSFRP IL LP D G M F LKVKV PFA SR S TAW A SFDGK V R TE L LQ G DYVTIQA S PF P C_parapsilosis P ICPH TLSFRP IL LP D G M F LKIKV PVT SR S TAW CS FDGK V R KE L S KG Y YVTIQA S PF P C_orthopsilosis P ICPH TLSFRP IL LP D G M F LKIKV PLT SR S TAW CS FDGK V R KE L S KG Y YVTIQA S PF P C_dubliniensis P ICPH TLSFRP IL LP D G M F LKVKV PSS SR A TAW CS FDGK V R TE L K KG Y YVTIQA S PF P C_tropicalis P ICPH TLSFRP VL LP D G M F LKVKV PDG SR A TAW CS FDGK D R TE L K KG DYVTIQA S S F P C_maltosa P ICPH TLSFRP IL LP D G M N LKVK TPNS SR G TAW CS FDGK V R TE L K KG Y YVTIQA S S F P T_delbrueckii P ICPH TLSFRP II LP E S M T LKVRV SMK SR A TAW A AFDGK S R LE L K KG DYITIQA S PY S E_gossypii P VCPH TLSFRP II LP D S M R LRIKV PKR SR G TAW A AFDGK S R VE L Q KG DYISVTA S PY S K_lactis P ICPH TLSFRP II LP D S M T LKVKV PKA SR S TAW A AFDGK N R VE M K RG DYI V INA S PY S K_marxianus P ICPH TLSFRP II LP D S M T LKVKV PRT SR S TAW A AFDGK N R VE M Q KG DYI V ITA S PY S L_thermotolerans P ICPH TLSFRP II LP D S M K LKVKV PLN SR A TAW A AFDGK N R VE L F KG DYV C ITA S P HS Z_bailii P VCPH TLSFRP IL LP D S M N LKVKV SMK SR A TAW A SFDGK E R TE L Q KG DYITVQ TS PY A Z_rouxii P ICPH TLSFRP IM LP D S M N IKIRV SQG SR A TAW A AFDGK D R IE L Q KG DYITVQ SS PY A C_glabrata P ICPH TLSFRP II LP D S M T LKVKV SLKA R G TAW A GFDGK D R CE L KQ G DFITISA S PY V N_castelli P ICPH TLSFRP IL LP E N M N LKVKV SLKA R G NAW A SFDGK G R FE L Q KG DYITVSA S PY A S_cerevisiae P ICPH ALSFRP II LP E S I N LKVKV SMK SR AP AW A AFDGK D R IE L Q KG DFITI C AS PY A V_polyspora P ICPH TLSFRP II LP D S M N LKVRV SLK SR A TAW A AFDGK N K VE L QP G DYISI A AS PY A 430 440 450 460 470 S_stipitis FPT V IS S K TEYIDSVS R NLN WN AR E QQ K P FS HL L SDQ S QRMM...... RKSTSAA D_hanseni FPT V IS S K TEYIDSVS R NLN WN AR E QQ K P FT HL L SEN N KKIY...... ENSPAN M_guilliermondii FPT V IS S K TEYIDSVS R NL HWN VR D SQ K P FT HL L SAK N KQLFEGGHGDAQFDIDYSDE C_albicans LPT V MS S K TEYIDSVS R NL HWN IR E QQ K P FS SY L KPE T RQSIAESERLDNLHISSEQD M_farinosa FPT V IS S K TEYIDSVS R NLN WN AR E KQ K PL S SY L SQE S TDML...... NTEAAQA C_tenuis FPT V IS S K TEYIDSVS R NL HWN VR KQ Q R P FS DYKLND S NE...... DVESQI C_parapsilosis FPT V IA S K TEYMDSVS R NLN WN VR E QQ K P FS SY L KPE T RQMMV...... C_orthopsilosis FPT V IA S K TEYMDSVS R NLN WN VR E QQ K P FS SY L KPE T RQMMV...... C_dubliniensis LPT V MS S K TEYIDSVS R NL HWN IR E QQ K P FS SY L KPE T QKSIAESE.....RLENLHI C_tropicalis FPT V IA S P TEY F DSVS R NL HWN VR E QQ K PLGNQTKDIDGDMDNLH...... ISSEQ C_maltosa FPT V VS T K TEY F DSVS R NL HWN VR QQ Q K PL...... K T EE..G...... EDLEN T_delbrueckii FPT V ESHP TEFIDSIS R TLN WN VR E QQ R S FT HM L SRK N QEKYASDAGNTRADDEEVEE E_gossypii FPT L EH S P TDFIDSI RR TLN WN SR E PQ K S Y AHM L SQK N QLRYESDACHKTPTSSDSAD K_lactis FPT L EARS TEFIDSIS R TLN WN VR E SQ K S FT HM L SRK N QQKYEIHTVRTRQDSEEEEE K_marxianus FPT L EARS TEFID G IS R TLN WN VR E PQ K S FT HM L SRK N QQKYEIHTEKAKLGSEEEEE L_thermotolerans FPT L ES S P TEFIDSIS R TLN WN AR E PQ K S F AHM L SQK N QKNYVSDTEKQKQDQPEVRA Z_bailii FPT V ESHR TEFIESIS R SLN WN VR RE Q K S FT HM L SRK N REKYVTDKS....GRDEWEE Z_rouxii FPT V ESHS TEFIESIS R SLN WN VR RE Q K S FT HM L SRK N QEKYVTDKE.....GEDDLH C_glabrata FPT V ES S PI EFI N SIS R TMN WN VR E QQ K S FT HI L SQK N KEKYNTEKVRESKAKSEEEE N_castelli FPT V ES S P TEF F D G IS R TLN WN VR E QQ K S FT HM L SLK N RKKLAIESNYEHDSDEDIEE S_cerevisiae FPT V EA S PD EFI N SIS R QLN WN VR E QQ K S FT HI L SQK N QEKYAHEANKVRNQAEPLEV V_polyspora FPT V ES S S SEFIDSI GR TLN WN VR E EQ K S FT HM L SQK N KEKFAIESYKMDDSDSSVEE

480 490 500 510 520 530 S_stipitis TQDLENLHIDNGAAEDDF D INYSSDSDSTPSDNETEEDLPYIPLPGDGINTPPPGNT. D_hanseni LDSHIN...NLSLAPDEF D IDYTDEENSSLSDEDIPFVPSPEEGLETPPTYGT....F M_guilliermondii NDDELGTADTEITGSEPS E VEYDDDRPCFAHPNAKITLDHQ...... C_albicans ESNH.....EEPEITEDF D INYTDNERDSSSSTPSEESNEECANTTT...... M_farinosa DGAQHRGSRSSVPLDDDY D INYSAYDSSDSELNSPMPSSPNSGFCTPFSSNFNGL... C_tenuis ESLSLNASITIDNDNADY D INFSDENPEENTSQYSSKDIPFLPSLGGGLSTPPSH.TI C_parapsilosis ENDNGEFKPEQNEDNEDF D INYSDQEESEEKTSSSNTTSETNSEDMSYLPLGGGTQTP C_orthopsilosis ENDNGEDKPKENDDNDDF D INYSDQEETEEKTSSSNTTSETNSEDMSYLPLGGGTQTP C_dubliniensis SSEQDEVNHEEHEITEDF D INYTDNERDSSSSTPSEESNEEHVNATT...... C_tropicalis DEESEPDITEDDEEDDEF D INFTDTERSSYSSTPSSDDIHYLSTNGAETPQSMSY... C_maltosa LHISTEVSAPHSGEEEIF D INFDGSFSSTPNSEENLFLPSNGGANTPQNTSYL..... T_delbrueckii VIIDKNKKPPKFRLHDNA D DDNDDENEEALEDDTSLAPVSPKSKPIDQIRSPQANFTI E_gossypii DKSASDPEEPAPTEQQPA D SGSDSSSNTGRDSLVLRRAPNRRRKGRRPACRRPD.... K_lactis LEDDQSDDYSTDSDSELN E ...... K_marxianus LEESTNTAASDSERDSNS E SDSDSDVE...... L_thermotolerans PSDEEEEEGESDSESPSQFRPKAASSAASPKLKPNFQL...... Z_bailii REDDEVLVVQAEDPAQASNMMHKASDRNSIGGKPSFTV...... Z_rouxii EGDHREVVVLQAEDKDQAQKMIEERTLAEQKAEKDSVCANGGAAKTNFTV...... C_glabrata IEERKLSSSAFDMSSLKEAVKEEAKEDEGDDEDETADRCLLKKTSGSK...... N_castelli RKLEKQLSGEHLTDSSET E NENSNDYEEEIRVALPKIETVHV...... S_cerevisiae IRDKYSLEADATKENNNGSDDESDDESVNCEACKLKPSSVPKPSQARFSV...... V_polyspora LEVNEKIGDEKLDMDKIESLLDQADIKEKVHFSD......

540 550 560 570 S_stipitis ..YGGFEGTCFAHPHAKITLDSSATSTTNSTRYSSSTTGSGSED...... D_hanseni .....EDRPCFAHPNAKISLNGSSSMSNFSSPTSAESDTLTMYPKHRINNNHTPNN M_guilliermondii ...... C_albicans ...... M_farinosa ...SYDERQCFAHPNAKISLNPSALNSISSSPLSTNSEVLTINPESALHHK..... C_tenuis SYTNLEDRHCFAHPNAHVTFGSSSSDNSGHSNFL...... C_parapsilosis TTNNQDDRCCYAHPHARVHLNGN...... C_orthopsilosis TTNNQDDRCCYAHPHARVHLNGS...... C_dubliniensis ...... C_tropicalis .LNNVDERCCFAHPNARVHLSGGKS...... C_maltosa ..NNIDERCCFAHPKARIHLNGRE...... T_delbrueckii ...... E_gossypii ...... K_lactis ...... K_marxianus ...... L_thermotolerans ...... Z_bailii ...... Z_rouxii ...... C_glabrata ...... N_castelli ...... S_cerevisiae ...... V_polyspora ......

150 Appendix K. Phylogenetic and syntenic analysis of PHO3.2 and XYL1 homologs 151

Appendix K

Phylogenetic and syntenic analysis of

PHO3.2 and XYL1 homologs

Figure K.1: Phylogenetic reconstruction of PHO3.2 (acid phosphatase) homologs in budding yeasts. PHO3.2 derived from a tandem duplication in a common ancestor of Suhomyces tanzawaensis, Scheffersomyces and

Spathaspora species. The red leaves highlight the PHO3.2 paralogs. The purple leaves highlight an additional uncharacterized PHO3 or PHO3.2 paralog. Appendix K. Phylogenetic and syntenic analysis of PHO3.2 and XYL1 homologs 152

Figure K.2: Synteny of the PHO3 (blue) and PHO3.2 (red) loci in Suhomyces tanzawaensis, Scheffersomyces and Spathaspora species. A genomic inversion of PHO3 occured in an ancestor of Suhomyces tanzawaensis.

PHO3 is less conserved in Scheffersomyces species than PHO3.2. Appendix K. Phylogenetic and syntenic analysis of PHO3.2 and XYL1 homologs 153

Figure K.3: Phylogenetic reconstruction of XYL1 (xylose reductase) homologs in budding yeasts, Pezizomy- cotina fungi, and Saitoella complicata; the XYL1 ortholog appears to be absent in Basidiomycota [Mi et al.,

2012]. Red leaves highlights the XYL1.2 paralog, which has NAD(P)H-dependent xylose reductase activity.

Pachysolen tannophilus has an independent duplication of XYL1, which also led to NAD(P)H-dependent XR activity [Ditzelmüller et al., 1985]. Appendix K. Phylogenetic and syntenic analysis of PHO3.2 and XYL1 homologs 154

Figure K.4: Synteny of XYL1 (blue) and XYL1.2 (red) loci in Scheffersomyces and Spathaspora species. XYL1.2 originated from a tandem duplication of XYL1 upstream of trimethyllysine dioxygenase (FOG01414). XYL1 was subsequently lost in some Scheffersomyces species. Appendix L

Review and discussion of additional orthologs in yeast central metabolism

Phosphoglycerate mutase. Most fungi have been found to use cofactor-dependent or cofactor-independent phosphoglycerate mutase [Price et al., 1983], GPM1 or GPM4, respectively. Exophiala dermatitidis, Fusarium oxysporum, Pseudocercospora fijiensis, Zymoseptoria tritici were identified as having both orthologs with Or- thoDB. GPM1 and GPM4 share no homology. Proto-yeast may have had both orthologs; there is no clear indication of horizontal gene transfer within Dikarya (results not shown). Citrate synthase. A duplication of mitochondrial citrate synthase, CIT1, led to mitochondrial methyl cit- rate synthase in Dikarya, CIT3, after Taphrinomycotina divergence; CIT3 plays a role in propionate metabolism. Hexokinase. HXK3 originated from an HXK2 duplication, before the emergence of the Pichiaceae clade. Ogataea polymorpha subsequently lost the original HXK2 ortholog; it is not known how the gain of HXK3 or loss of HXK2 impacted metabolic regulation in Pichiaceae regulation since HXK2 plays a dual role of enzyme and regulator in glycolysis/gluconeogenesis. Methanol assimilation Alcohol oxidase, the first enzyme in methanol metabolism, was present in Proto- Yeast but is currently found in the Pichiaceae clade; it was subsequently lost in Pichia membranifaciens, Pichia kudriavzevii, and Dekkera bruxellensis. It was present in early fungal lineages and did not emerge within Saccharomycotina. Other paralogs ARO9, an aromatic aminotransferase involved in catabolism, originated from a duplica- tion of ARO8, an aromatic aminotransferase involved in amino acid biosynthesis, before the clade sister to L. starkeyi diverged. Additional duplications occurred within this gene family, but these orthologs have not been characterized. ARO10.2 derived from a phenylpyruvate decarboxylase duplication, ARO10, before the emergence of the CPPSS clade; this paralog is not present in S. cerevisiae and has not been characterized. ELO3 originated from a duplication of ELO2 before A. rubescens-CPPSS clade; ELO3 synthesizes very long fatty acid chains.

155 Chapter 9

Bibliography

Shuichi Aiba and Masayoshi Matsuoka. Identification of metabolic model: citrate production from

glucose by Candida lipolytica. Biotechnology and Bioengineering, 21(8):1373–1386, 1979.

Adrian M Altenhoff, Nives Škunca, Natasha Glover, Clément-Marie Train, Anna Sueki, Ivana Piližota,

Kevin Gori, Bartlomiej Tomiczek, Steven Müller, Henning Redestig, et al. The OMA orthology

database in 2015: function predictions, better plant support, synteny view and other improvements.

Nucleic Acids Research, 43(Database issue):D240–249, Jan 2015.

Adrian M Altenhoff, Natasha M Glover, Clément-Marie Train, Klara Kaleb, Alex Warwick Vesztrocy,

David Dylus, Tarcisio M de Farias, Karina Zile, Charles Stevenson, Jiao Long, et al. The oma orthology

database in 2018: retrieving evolutionary relationships among all domains of life through richer web

and programmatic interfaces. Nucleic acids research, 46(D1):D477–D485, 2017.

Rene Amore, Martin Wilhelm, and Cornelis P Hollenberg. The fermentation of xylose– an analysis of

the expression of Bacillus and Actinoplanes xylose isomerase genes in yeast. Applied Microbiology and

Biotechnology, 30(4):351–357, 1989.

Mikael Rørdam Andersen, Michael Lynge Nielsen, and Jens Nielsen. Metabolic model integration of the

bibliome, genome, metabolome and reactome of Aspergillus niger. Molecular Systems Biology, 4(1):

178, 2008.

Hnin W Aung, Susan A Henry, and Larry P Walker. Revising the representation of fatty acid, glyc-

erolipid, and glycerophospholipid metabolism in the consensus model of yeast metabolism. Industrial

Biotechnology, 9(4):215–228, 2013.

156 Chapter 9. Bibliography 157

Ramy K Aziz, Daniela Bartels, Aaron A Best, Matthew DeJongh, Terrence Disz, Robert A Edwards,

Kevin Formsma, Svetlana Gerdes, Elizabeth M Glass, Michael Kubal, et al. The RAST Server: rapid

annotations using subsystems technology. BMC Genomics, 9(1):75, 2008.

Herwig Bachmann, Douwe Molenaar, Filipe Branco dos Santos, and Bas Teusink. Experimental evolution

and the adjustment of metabolic strategies in lactic acid bacteria. FEMS Microbiology Reviews, page

fux024, 2017.

James E Bailey, Adriana Sburlati, Vassily Hatzimanikatis, Kelvin Lee, Wolfgang A Renner, and Philip S

Tsai. Inverse metabolic engineering: a strategy for directed genetic engineering of useful phenotypes.

Biotechnology and Bioengineering, 52(1):109–121, 1996.

Barbara M Bakker, Christoffer Bro, Peter Kötter, Marijke AH Luttik, Johannes P Van Dijken, and

Jack T Pronk. The mitochondrial alcohol dehydrogenase Adh3p is involved in a redox shuttle in

Saccharomyces cerevisiae. Journal of Bacteriology, 182(17):4730–4737, 2000.

Balaji Balagurunathan, Sudhakar Jonnalagadda, Lily Tan, and Rajagopalan Srinivasan. Reconstruction

and analysis of a genome-scale metabolic model for Scheffersomyces stipitis. Microbial Cell Factories,

11(1):1, 2012.

Nenad Ban, Roland Beckmann, Jamie HD Cate, Jonathan D Dinman, François Dragon, Steven R Ellis,

Denis LJ Lafontaine, Lasse Lindahl, Anders Liljas, Jeffrey M Lipton, et al. A new system for naming

ribosomal proteins. Current Opinion in Structural Biology, 24:165–169, 2014.

IM Banat, D Singh, and Roger Marchant. The use of a thermotolerant fermentative Kluyveromyces

marxianus IMB3 yeast strain for ethanol production. Acta Biotechnologica, 16(2-3):215–223, 1996.

Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James O McInerney, David A

Morrison, Luay Nakhleh, Mike Steel, Leen Stougie, et al. Networks: expanding evolutionary thinking.

Trends in Genetics, 29(8):439–441, 2013.

Jörg Bär, Wolfgang Schellenberger, and Gerhard Kopperschläger. Purification and characterization of

phosphofructokinase from the yeast Kluyveromyces lactis. Yeast, 13(14):1309–1317, 1997.

Yinon M Bar-On, Rob Phillips, and Ron Milo. The biomass distribution on Earth. Proceedings of the

National Academy of Sciences, page 201711842, 2018.

Guillaume A Beaudoin and Andrew D Hanson. A guardian angel phosphatase for mainline carbon

metabolism. Trends in biochemical sciences, 41(11):893–894, 2016. Chapter 9. Bibliography 158

Tom Bender, Gabrielle Pena, and Jean-Claude Martinou. Regulation of mitochondrial pyruvate uptake

by alternative pyruvate carrier complexes. The EMBO Journal, 34(7):911–924, 2015.

Melanie J Bennett, Brian P Schlegel, Mitchell Lewis, M Trevor, et al. Comparative anatomy of the

aldo–keto reductase superfamily. Biochemical Journal, 326(3):625–636, 1997.

Lars M Blank, Frank Lehmbeck, and Uwe Sauer. Metabolic-flux and network analysis in fourteen

hemiascomycetous yeasts. FEMS Yeast Research, 5(6-7):545–558, 2005.

Joost Boele, Brett G Olivier, and Bas Teusink. Fame, the flux analysis and modeling environment. BMC

systems biology, 6(1):8, 2012.

Christopher A Boulton and Colin Ratledge. Use of transition studies in continuous cultures of Lipomyces

starkeyi, an oleaginous yeast, to investigate the physiology of lipid accumulation. Microbiology, 129

(9):2871–2876, 1983.

Karl A Brand and Ulrich Hermfisse. Aerobic glycolysis by proliferating cells: a protective strategy

against reactive oxygen species. The FASEB Journal, 11(5):388–395, 1997.

Sydney Brenner. Life sentences: Ontology recapitulates philology. Genome Biology, 3(4):comment1006–

1, 2002.

Sydney Brenner. Sequences and consequences. Philosophical Transactions of the Royal Society of London

B: Biological Sciences, 365(1537):207–212, 2010.

Sydney Brenner. The revolution in the life sciences. Science, 338(6113):1427–1428, 2012a.

Sydney Brenner. Turing centenary: Life’s code script. Nature, 482(7386):461, 2012b.

Daniel P Brink, Celina Borgström, Felipe G Tueros, and Marie F Gorwa-Grauslund. Real-time moni-

toring of the sugar sensing in saccharomyces cerevisiae indicates endogenous mechanisms for xylose

signaling. Microbial cell factories, 15(1):183, 2016.

Peter M Bruinenberg, Peter HM de Bot, Johannes P van Dijken, and W Alexander Scheffers. The

role of redox balances in the anaerobic fermentation of xylose by yeasts. European journal of applied

microbiology and biotechnology, 18(5):287–292, 1983a.

Peter M Bruinenberg, Johannes P Van Dijken, and W Alexander Scheffers. An enzymic analysis of

NADPH production and consumption in Candida utilis. Microbiology, 129(4):965–971, 1983b. Chapter 9. Bibliography 159

Peter M Bruinenberg, Johannes P Van Dijken, and W Alexander Scheffers. A theoretical analysis of

NADPH production and consumption in yeasts. Microbiology, 129(4):953–964, 1983c.

Peter M Bruinenberg, Peter HM de Bot, Johannes P van Dijken, and W Alexander Scheffers. NADH-

linked aldose reductase: the key to anaerobic alcoholic fermentation of xylose by yeasts. Applied

Microbiology and Biotechnology, 19(4):256–260, 1984.

Peter M Bruinenberg, Johannes P van Dijken, J Gijs Kuenen, and W Alexander Scheffers. Oxidation of

NADH and NADPH by mitochondria from the yeast Candida utilis. Microbiology, 131(5):1043–1051,

1985.

Nina A Brunner, Henner Brinkmann, Bettina Siebers, and Reinhard Hensel. NAD+-dependent

glyceraldehyde-3-phosphate dehydrogenase from Thermoproteus tenax: The first identified archaeal

member of the aldehyde dehydrogenase superfamily is a glycolytic enzyme with unusual regulatory

properties. Journal of Biological Chemistry, 273(11):6149–6156, 1998.

Benjamin Buchfink, Chao Xie, and Daniel H Huson. Fast and sensitive protein alignment using DIA-

MOND. Nature Methods, 12(1):59, 2014.

Thomas Buelter, Peter Meinhold, Reid M. Renny Feldman, Eva Eckl, Andrew Hawkins, Aristos Aristi-

dou, Catherine Asleson Dundon, Sabine Bastian, Doug Lies, Frances Arnold, and Jun Urano. Engi-

neered microorganisms capable of producing target compounds under anaerobic conditions, 10 2008.

Anthony P Burgard, Priti Pharkya, and Costas D Maranas. Optknock: a bilevel programming frame-

work for identifying gene knockout strategies for microbial strain optimization. Biotechnology and

Bioengineering, 84(6):647–657, 2003.

Geraldine Butler, Matthew D Rasmussen, Michael F Lin, Manuel AS Santos, Sharadha Sakthikumar,

Carol A Munro, Esther Rheinbay, Manfred Grabherr, Anja Forche, Jennifer L Reedy, et al. Evolution

of pathogenicity and sexual reproduction in eight Candida genomes. Nature, 459(7247):657, 2009.

R Büttner, R Bode, and D Birnbaum. Alcoholic fermentation of starch by Arxula adeninivorans. Zen-

tralblatt für Mikrobiologie, 147(3):225–230, 1992.

Kevin P Byrne and Kenneth H Wolfe. The Yeast Gene Order Browser: combining curated homology

and syntenic context reveals gene fate in polyploid species. Genome Research, 15(10):1456–1461, 2005.

Kevin P Byrne and Kenneth H Wolfe. Consistent patterns of rate asymmetry and gene loss indicate

widespread neofunctionalization of yeast genes after whole-genome duplication. Genetics, 175(3):

1341–1350, 2007. Chapter 9. Bibliography 160

Raquel M Cadete, Monaliza A Melo, Kelly J Dussán, Rita CLB Rodrigues, Silvio S Silva, Jerri E

Zilli, Marcos JS Vital, Fátima CO Gomes, Marc-André Lachance, and Carlos A Rosa. Diversity and

physiological characterization of D-xylose-fermenting yeasts isolated from the Brazilian Amazonian

forest. PLoS One, 7(8):e43135, 2012.

Raquel M Cadete, M Alejandro, Anders G Sandström, Carla Ferreira, Francisco Gírio, Marie-Françoise

Gorwa-Grauslund, Carlos A Rosa, and César Fonseca. Exploring xylose metabolism in Spathaspora

species: XYL1. 2 from Spathaspora passalidarum as the key for efficient anaerobic xylose fermentation

in metabolic engineered Saccharomyces cerevisiae. Biotechnology for Biofuels, 9(1):167, 2016.

Robert H Carlson. Biology is technology. Harvard University Press, 2010.

Patrícia Carneiro, Margarida Duarte, and Arnaldo Videira. The external alternative NAD(P)H dehy-

drogenase NDE3 is localized both in the mitochondria and in the cytoplasm of Neurospora crassa.

Journal of Molecular Biology, 368(4):1114–1121, 2007.

PatrıÃÅcia Carneiro, Margarida Duarte, and Arnaldo Videira. The main external alternative NAD(P)H

dehydrogenase of Neurospora crassa mitochondria. Biochimica et Biophysica Acta, 1608(1):45–52,

2004.

Luis Caspeta, Saeed Shoaie, Rasmus Agren, Intawat Nookaew, and Jens Nielsen. Genome-scale metabolic

reconstructions of Pichia stipitis and Pichia pastoris and in silico evaluation of their potentials. BMC

Systems Biology, 6(1):1, 2012.

Ron Caspi, Richard Billington, Luciana Ferrer, Hartmut Foerster, Carol A Fulcher, Ingrid M Keseler,

Anamika Kothari, Markus Krummenacker, Mario Latendresse, Lukas A Mueller, et al. The metacyc

database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases.

Nucleic acids research, 44(D1):D471–D480, 2015.

Amy A Caudy, Julia A Hanchard, Alan Hsieh, Saravannan Shaan, and Adam P Rosebrock. Functional

genetic discovery of enzymes using full-scan mass spectrometry metabolomics. Biochemistry and Cell

Biology, (999):1–12, 2018.

Gustavo C Cerqueira, Martha B Arnaud, Diane O Inglis, Marek S Skrzypek, Gail Binkley, Matt Simi-

son, Stuart R Miyasato, Jonathan Binkley, Joshua Orvis, Prachi Shah, et al. The Aspergillus Genome

Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene an-

notations. Nucleic Acids Research, 42(D1):D705–D710, 2013. Chapter 9. Bibliography 161

J Michael Cherry, Eurie L Hong, Craig Amundsen, Rama Balakrishnan, Gail Binkley, Esther T Chan,

Karen R Christie, Maria C Costanzo, Selina S Dwight, Stacia R Engel, et al. Saccharomyces Genome

Database: the genomics resource of budding yeast. Nucleic Acids Research, 40(D1):D700–D705, 2011.

Ching Chiang and SG Knight. D-xylose metabolism by cell-free extracts of Penicillium chrysogenum.

Biochimica et Biophysica Acta, 35:454–463, 1959.

Lin-Chang Chiang, Cheng-Shung Gong, Li-Fu Chen, and George T Tsao. D-xylulose fermentation to

ethanol by Saccharomyces cerevisiae. Applied and Environmental Microbiology, 42(2):284–289, 1981.

Stefan Christen and Uwe Sauer. Intracellular characterization of aerobic glucose metabolism in seven

yeast species by 13C flux analysis and metabolomics. FEMS Yeast Research, 11(3):263–272, 2011.

Bevan KS Chung, Suresh Selvarasu, Andrea Camattari, Jimyoung Ryu, Hyeokweon Lee, Jungoh Ahn,

Hongweon Lee, and Dong-Yup Lee. Genome-scale metabolic reconstruction and in silico analysis of

methylotrophic yeast Pichia pastoris for strain improvement. Microbial Cell Factories, 9(1):50, 2010.

Tom Clark, Neil Wedlock, Allen P James, Kay Deverell, and Roy J Thornton. Strain improvement of the

xylose-fermenting yeast Pachysolen tannophilus by hybridisation of two mutant strains. Biotechnology

Letters, 8(11):801–806, 1986.

Manuel G Claros and Pierre Vincens. Computational method to predict mitochondrially imported

proteins and their targeting sequences. European Journal of Biochemistry, 241(3):779–786, 1996.

Michelle F Clasquin, Eugene Melamud, Alexander Singer, Jessica R Gooding, Xiaohui Xu, Aiping Dong,

Hong Cui, Shawn R Campagna, Alexei Savchenko, Alexander F Yakunin, et al. Riboneogenesis in

yeast. Cell, 145(6):969–980, 2011.

Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo

Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available

Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–

1423, 2009.

François Collard, Francesca Baldin, Isabelle Gerin, Jennifer Bolsée, Gaëtane Noël, Julie Graff, Maria

Veiga-da Cunha, Vincent Stroobant, Didier Vertommen, Amina Houddane, et al. A conserved phos-

phatase destroys toxic glycolytic side products in mammals and yeast. Nature chemical biology, 12(8):

601, 2016. Chapter 9. Bibliography 162

Gavin C Conant and Kenneth H Wolfe. Turning a hobby into a job: how duplicated genes find new

functions. Nature Reviews Genetics, 9(12):938, 2008.

IF Connerton, JRS Fincham, RA Sandeman, and MJ Hynes. Comparison and cross-species expression of

the acetyl-CoA synthetase genes of the Ascomycete fungi, Aspergillus nidulans and Neurospora crassa.

Molecular Microbiology, 4(3):451–460, 1990.

UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic

Acids Research, 41(D1):D43–D47, 2012.

UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Research, 43(D1):D204–

D212, 2014.

Paul G Crichton, Charles Affourtit, and Anthony L Moore. Identification of a mitochondrial alcohol

dehydrogenase in Schizosaccharomyces pombe: new insights into energy metabolism. Biochemical

Journal, 401(2):459–464, 2007.

Lucas Czech, Pierre Barbera, and Alexandros Stamatakis. Methods for automatic reference trees and

multilevel phylogenetic placement. Bioinformatics, page bty767, 2018. doi: 10.1093/bioinformatics/

bty767. URL http://dx.doi.org/10.1093/bioinformatics/bty767.

A Stephen Dahms. 3-Deoxy-D-pentulosonic acid aldolase and its role in a new pathway of D-xylose

degradation. Biochemical and Biophysical Research Communications, 60(4):1433–1439, 1974.

Daniel A Dalquen and Christophe Dessimoz. Bidirectional best hits miss many orthologs in duplication-

rich clades such as plants and animals. Genome Biology and Evolution, 5(10):1800–1806, 2013.

Etienne GJ Danchin. Lateral gene transfer in eukaryotes: tip of the iceberg or of the ice cube? BMC

Biology, 14(1):101, 2016.

RH de Deken. The Crabtree effect: a regulatory system in yeast. Microbiology, 44(2):149–156, 1966.

Kristof De Schutter, Yao-Cheng Lin, Petra Tiels, Annelies Van Hecke, Sascha Glinka, Jacqueline Weber-

Lehmann, Pierre Rouzé, Yves Van de Peer, and Nico Callewaert. Genome sequence of the recombinant

protein production host Pichia pastoris. Nature Biotechnology, 27(6):561, 2009.

B Deepa, Eldho Abraham, Nereida Cordeiro, Miran Mozetic, Aji P Mathew, Kristiina Oksman, Marisa

Faria, Sabu Thomas, and Laly A Pothan. Utilization of various lignocellulosic biomass for the pro-

duction of nanocellulose: a comparative study. Cellulose, 22(2):1075–1090, 2015. Chapter 9. Bibliography 163

Robert FH Dekker. Ethanol production from d-xylose and other sugars by the yeast Pachysolen

tannophilus. Biotechnology Letters, 4(7):411–416, 1982.

JP Delgenes, R Moletta, and JM Navarro. The effect of aeration on D-xylose fermentation by Pachysolen

tannophilus, Pichia stipitis, Kluyveromyces marxianus and Candida shehatae. Biotechnology Letters,

8(12):897–900, 1986.

H Dellweg, M Rizzi, H Methner, and D Debus. Xylose fermentation by yeasts. Biotechnology Letters, 6

(6):395–400, 1984.

H Dellweg, C Klein, S Prahl, M Rizzi, and B Weigert. Kinetics of ethanol production from D-xylose by

the yeast Pichia stipitis. Food Biotechnology, 4(1), 1990.

Todd F DeLuca, I-Hsien Wu, Jian Pu, Thomas Monaghan, Leonid Peshkin, Saurav Singh, and Dennis P

Wall. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics,

22(16):2044–2046, 2006.

Scott Devoid, Ross Overbeek, Matthew DeJongh, Veronika Vonstein, Aaron A Best, and Christopher

Henry. Automated genome annotation and metabolic model reconstruction in the SEED and Model

SEED. In Systems Metabolic Engineering, pages 17–45. Springer, 2013.

Oscar Dias, Rui Pereira, Andreas K Gombert, Eugénio C Ferreira, and Isabel Rocha. iOD907, the first

genome-scale metabolic model for the milk yeast Kluyveromyces lactis. Biotechnology Journal, 9(6):

776–790, 2014.

Oscar Dias, Miguel Rocha, Eugénio C Ferreira, and Isabel Rocha. Reconstructing genome-scale metabolic

models with merlin. Nucleic acids research, 43(8):3899–3910, 2015.

Rodrigo Díaz-Ruiz, Nicole Avéret, Daniela Araiza, Benoît Pinson, Salvador Uribe-Carvajal, Anne

Devin, and Michel Rigoulet. Mitochondrial oxidative phosphorylation is regulated by fructose 1,6-

bisphosphate. A possible role in Crabtree effect induction? Journal of Biological Chemistry, 283(40):

26948–26955, 2008.

Guillaume Diss, Isabelle Gagnon-Arsenault, Anne-Marie Dion-Coté, Hélène Vignaud, Diana I Ascencio,

Caroline M Berger, and Christian R Landry. Gene duplication can impart fragility, not robustness, in

the yeast protein interaction network. Science, 355(6325):630–634, 2017.

G Ditzelmüller, CP Kubicek, W Wöhrer, and M Röhr. Xylitol dehydrogenase from Pachysolen

tannophilus. FEMS Microbiology Letters, 25(2-3):195–198, 1984. Chapter 9. Bibliography 164

G Ditzelmüller, EM Kubicek-Pranz, M Röhr, and CP Kubicek. NADPH-specific and NADH-specific

xylose reduction is catalyzed by two separate enzymes in Pachysolen tannophilus. Applied Microbiology

and Biotechnology, 22(4):297–299, 1985.

G Dorn. Genetic analysis of the phosphatases in Aspergillus nidulans. Genetics Research, 6(1):13–26,

1965.

Jonathan M Dreyfuss, Jeremy D Zucker, Heather M Hood, Linda R Ocasio, Matthew S Sachs, and

James E Galagan. Reconstruction and validation of a genome-scale metabolic model for the filamentous

fungus Neurospora crassa using FARM. PLoS Computational Biology, 9(7):e1003126, 2013.

James C du Preez, Brian Van Driessel, and Bernard A Prior. Effect of aerobiosis on fermentation and

key enzyme levels during growth of Pichia stipitis, Candida shehatae and Candida tenuis on D-xylose.

Archives of Microbiology, 152(2):143–147, 1989.

JC du Preez. Process parameters and environmental factors affecting D-xylose fermentation by yeasts.

Enzyme and Microbial Technology, 16(11):944–956, 1994.

JC Du Preez and BA Prior. A quantitative screening of some xylose-fermenting yeast isolates. Biotech-

nology Letters, 7(4):241–246, 1985.

JC Du Preez, M Bosch, and BA Prior. The fermentation of hexose and pentose sugars by Candida

shehatae and Pichia stipitis. Applied Microbiology and Biotechnology, 23(3-4):228–233, 1986.

Margarida Duarte, Markus Peters, Ulrich Schulte, and Arnaldo Videira. The internal alternative NADH

dehydrogenase of Neurospora crassa mitochondria. Biochemical Journal, 371(3):1005–1011, 2003.

Bernard Dujon. Yeast evolutionary genomics. Nature Reviews Genetics, 11(7):512–524, 2010.

Bernard Dujon, David Sherman, Gilles Fischer, Pascal Durrens, Serge Casaregola, Ingrid Lafontaine,

Jacky De Montigny, Christian Marck, Cécile Neuvéglise, Emmanuel Talla, et al. Genome evolution in

yeasts. Nature, 430(6995):35, 2004.

Ali Ebrahim, Joshua A Lerman, Bernhard O Palsson, and Daniel R Hyduke. COBRApy: COnstraints-

Based Reconstruction and Analysis for Python. BMC Systems Biology, 7(1):74, 2013.

Jeremy S Edwards and Bernhard O Palsson. Systems properties of the haemophilus influenzaerd

metabolic genotype. Journal of Biological Chemistry, 274(25):17410–17416, 1999. Chapter 9. Bibliography 165

Anna Eliasson, Camilla Christensson, C Fredrik Wahlbom, and Bärbel Hahn-Hägerdal. Anaerobic xylose

fermentation by recombinant saccharomyces cerevisiae carrying XYL1, XYL2, and XKS1 in mineral

medium chemostat cultures. Applied and Environmental Microbiology, 66(8):3381–3386, 2000.

Olof Emanuelsson, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. Locating proteins in the cell

using TargetP, SignalP and related tools. Nature Protocols, 2(4):953–971, 2007.

Xueyang Feng, You Xu, Yixin Chen, and Yinjie J Tang. Microbesflux: a web platform for drafting

metabolic models from the database. BMC systems biology, 6(1):94, 2012.

Robert D Finn, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Jaina Mistry, Alex L Mitchell,

Simon C Potter, Marco Punta, Matloob Qureshi, Amaia Sangrador-Vegas, et al. The pfam protein

families database: towards a more sustainable future. Nucleic acids research, 44(D1):D279–D285,

2015.

Andrea Firrincieli, Robert Otillar, Asaf Salamov, Jeremy Schmutz, Zareen Khan, Regina S Redman,

Neil David Fleck, Erika Lindquist, Igor V Grigoriev, and Sharon Lafferty Doty. Genome sequence of

the plant growth promoting endophytic yeast Rhodotorula graminis WP1. Frontiers in Microbiology,

6:978, 2015.

Steve Fischer, Brian P Brunk, Feng Chen, Xin Gao, Omar S Harb, John B Iodice, Dhanasekaran Shan-

mugam, David S Roos, and Christian J Stoeckert. Using OrthoMCL to assign proteins to OrthoMCL-

DB groups or to cluster proteomes into new ortholog groups. Current Protocols in Bioinformatics,

Chapter 6:1–9, 2011.

Walter M Fitch. Distinguishing homologous from analogous proteins. Systematic Zoology, 19(2):99–113,

1970.

Walter M Fitch. Homology: a personal view on some of the problems. Trends in genetics, 16(5):227–231,

2000.

Carmen-Lisset Flores, Oscar H Martínez-Costa, Valentina Sanchez, Carlos Gancedo, and Juan J Aragon.

The dimorphic yeast Yarrowia lipolytica possesses an atypical phosphofructokinase: characterization

of the enzyme and its encoding gene. Microbiology, 151(5):1465–1474, 2005.

Stephen S Fong, Jennifer Y Marciniak, and Bernhard Ø Palsson. Description and interpretation of

adaptive evolution of Escherichia coli K-12 MG1655 by using a genome-scale in silico metabolic

model. Journal of Bacteriology, 185(21):6400–6408, 2003. Chapter 9. Bibliography 166

Henry Ford. Ford predicts fuel from vegetation. The New York Times, page 24, 1925.

Kelle C Freel, Anne Friedrich, and Joseph Schacherer. Mitochondrial genome evolution in yeasts: an

all-encompassing view. FEMS yeast research, 15(4), 2015.

Stefan Freese, Tanja Vogts, Falk Speer, Bernd Schäfer, Volkmar Passoth, and Ulrich Klinner. C-and

N-catabolic utilization of tricarboxylic acid cycle-related amino acids by Scheffersomyces stipitis and

other yeasts. Yeast, 28(5):375–390, 2011.

Keisuke Fujitomi, Tomoya Sanda, Tomohisa Hasunuma, and Akihiko Kondo. Deletion of the pho13

gene in saccharomyces cerevisiae improves ethanol production from lignocellulosic hydrolysate in the

presence of acetic and formic acids, and furfural. Bioresource technology, 111:161–166, 2012.

Debra L Fulton, Yvonne Y Li, Matthew R Laird, Benjamin GS Horsman, Fiona M Roche, and Fiona SL

Brinkman. Improving the specificity of high-throughput ortholog prediction. BMC bioinformatics, 7

(1):270, 2006.

Toni Gabaldón, Daphne Rainey, and Martijn A Huynen. Tracing the evolution of a large protein complex

in the eukaryotes, NADH:ubiquinone oxidoreductase (Complex I). Journal of Molecular Biology, 348

(4):857–870, 2005.

Silvia Galafassi, Claudia Capusoni, Md Moktaduzzaman, and Concetta Compagno. Utilization of ni-

trate abolishes the "Custers effect" in Dekkera bruxellensis and determines a different pattern of

fermentation products. Journal of Industrial Microbiology & Biotechnology, 40(3-4):297–303, 2013.

James E Galagan, Sarah E Calvo, Katherine A Borkovich, Eric U Selker, Nick D Read, David Jaffe,

William FitzHugh, Li-Jun Ma, Serge Smirnov, Seth Purcell, et al. The genome sequence of the

filamentous fungus Neurospora crassa. Nature, 422(6934):859, 2003.

Michael Y Galperin and Eugene V Koonin. ‘Conserved hypothetical’ proteins: prioritization of targets

for experimental study. Nucleic acids research, 32(18):5452–5463, 2004.

Michael Y Galperin, Kira S Makarova, Yuri I Wolf, and Eugene V Koonin. Expanded microbial genome

coverage and improved annotation in the cog database. Nucleic acids research, 43(D1):

D261–D269, 2014.

Michael Y Galperin, David M Kristensen, Kira S Makarova, Yuri I Wolf, and Eugene V Koonin. Microbial

genome analysis: the COG approach. Briefings in Bioinformatics, 2017. Chapter 9. Bibliography 167

Eric A Gaucher, James T Kratzer, and Ryan N Randall. Deep phylogeny–how a tree can help characterize

early life on Earth. Cold Spring Harbor Perspectives in Biology, 2(1):a002238, 2010.

Luca Gerosa, Bart RB Haverkorn van Rijsewijk, Dimitris Christodoulou, Karl Kochanowski, Thomas SB

Schmidt, Elad Noor, and Uwe Sauer. Pseudo-transition analysis identifies the key regulators of dy-

namic metabolic adaptations from steady-state data. Cell Systems, 1(4):270–282, 2015.

Jeffrey E Gerst. Pimp my ribosome: Ribosomal protein paralogs specify translational control. Trends

in Genetics, 2018.

S Gillet, M Aguedo, L Petitjean, ARC Morais, AM da Costa Lopes, RM Łukasik, and PT Anastas. Lignin

transformations for high value applications: towards targeted modifications using green chemistry.

Green Chemistry, 19(18):4200–4233, 2017.

Anisha Goel, Thomas H Eckhardt, Pranav Puri, Anne de Jong, Filipe Branco dos Santos, Martin Giera,

Fabrizia Fusetti, Willem M de Vos, Jan Kok, Bert Poolman, et al. Protein costs do not explain

evolution of metabolic strategies and regulation of ribosomal content: does protein investment explain

an anaerobic bacterial c rabtree effect? Molecular microbiology, 97(1):77–92, 2015.

André Goffeau, Bart G Barrell, Howard Bussey, RW Davis, Bernard Dujon, Heinz Feldmann, Francis

Galibert, JD Hoheisel, Cr Jacq, Michael Johnston, et al. Life with 6000 genes. Science, 274(5287):

546–567, 1996.

Carla Gonçalves, Jennifer H Wisecaver, Jacek Kominek, Madalena Salema Oom, Maria Jose Leandro,

Xing-Xing Shen, Dana A Opulente, Xiaofan Zhou, David Peris, Cletus P Kurtzman, et al. Evidence

for loss and reacquisition of alcoholic fermentation in a fructophilic yeast lineage. eLife, 7:e33034,

2018.

Dorota Grabowska and Anna Chelstowska. The ALD6 gene product is indispensable for providing

NADPH in yeast cells lacking glucose-6-phosphate dehydrogenase activity. Journal of Biological Chem-

istry, 278(16):13984–13988, 2003.

Igor V Grigoriev, Roman Nikitin, Sajeet Haridas, Alan Kuo, Robin Ohm, Robert Otillar, Robert Riley,

Asaf Salamov, Xueling Zhao, Frank Korzeniewski, et al. MycoCosm portal: gearing up for 1000 fungal

genomes. Nucleic Acids Research, 42(Database issue):699–704, 2013.

Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier

Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the

performance of PhyML 3.0. Systematic Biology, 59(3):307–321, 2010. Chapter 9. Bibliography 168

Aysen Habison, Christian P Kubicek, and Max Röhr. Partial purification and regulatory properties of

phosphofructokinase from Aspergillus niger. Biochemical Journal, 209(3):669–676, 1983.

Sean R Hackett, Vito RT Zanotelli, Wenxin Xu, Jonathan Goya, Junyoung O Park, David H Perlman,

Patrick A Gibney, David Botstein, John D Storey, and Joshua D Rabinowitz. Systems-level analysis

of mechanisms regulating yeast metabolic flux. Science, 354(6311), 2016.

Daniel H Haft, Jeremy D Selengut, Roland A Richter, Derek Harkins, Malay K Basu, and Erin Beck.

Tigrfams and genome properties in 2013. Nucleic acids research, 41(D1):D387–D395, 2012.

Arne Hagman, Torbjörn Säll, Concetta Compagno, and Jure Piskur. Yeast "make-accumulate-consume"

life strategy evolved as a multi-step process that predates the whole genome duplication. PLoS One,

8(7):e68734, 2013.

Bärbel Hahn-Hägerdal, Birgitta Jönsson, and Elke Lohmeier-Vogel. Shifting product formation from

xylitol to ethanol in pentose fermentations using Candida tropicalis by adding polyethylene glycol

(peg). Applied Microbiology and Biotechnology, 21(3-4):173–175, 1985.

Bärbel Hahn-Hägerdal, C Fredrik Wahlbom, Márk Gárdonyi, Willem H van Zyl, Ricardo R Cordero

Otero, and Leif J Jönsson. Metabolic engineering of Saccharomyces cerevisiae for xylose utilization.

In Metabolic Engineering, pages 53–84. Springer, 2001.

Charles Hall, Sophie Brachat, and Fred S Dietrich. Contribution of horizontal gene transfer to the

evolution of Saccharomyces cerevisiae. Eukaryotic cell, 4(6):1102–1115, 2005.

Robert J Haselbeck and Lee McAlister-Henn. Function and expression of yeast mitochondrial NAD-

and NADP-specific isocitrate dehydrogenases. Journal of Biological Chemistry, 268(16):12116–12122,

1993.

Kristy Michelle Hawkins, Tina Tipawan Mahatdejkul-Meadows, Adam Leon Meadows, Lauren Barbara

Pickens, Anna Tai, and Annie Ening Tsong. Use of phosphoketolase and phosphotransacetylase for

production of acetyl-coenzyme A derived compounds, August 9 2016. US Patent 9,410,214.

W Hazeu and RA Donker. A continuous culture study of methanol and formate utilization by the yeast

Pichia pastoris. Biotechnology Letters, 5(6):399–404, 1983.

Zilong He, Huangkai Zhang, Shenghan Gao, Martin J Lercher, Wei-Hua Chen, and Songnian Hu.

Evolview v2: an online visualization and management tool for customized and annotated phyloge-

netic trees. Nucleic Acids Research, pages W236–241, 2016. Chapter 9. Bibliography 169

S Blair Hedges, Julie Marin, Michael Suleski, Madeline Paymer, and Sudhir Kumar. Tree of life reveals

clock-like speciation and diversification. Molecular Biology and Evolution, pages 835–845, 2015.

Katherine E Helliwell, Glen L Wheeler, and Alison G Smith. Widespread decay of vitamin-related

pathways: coincidence or consequence? Trends in Genetics, 29(8):469–478, 2013.

Matthew Hilliard, Andrew Damiani, Q Peter He, Thomas Jeffries, and Jin Wang. Elucidating re-

dox balance shift in Scheffersomyces stipitis’ fermentative metabolism using a modified genome-scale

metabolic model. Microbial Cell Factories, 17(1):140, 2018.

Nancy WY Ho, Zhengdao Chen, and Adam P Brainard. Genetically engineered Saccharomyces yeast

capable of effective cofermentation of glucose and xylose. Applied and Environmental Microbiology,

64(5):1852–1859, 1998.

NWY Ho, FP Lin, S Huang, PC Andrews, and GT Tsao. Purification, characterization, and amino

terminal sequence of xylose reductase from Candida shehatae. Enzyme and Microbial Technology, 12

(1):33–39, 1990.

Leroy Hood. Systems biology: integrating technology, biology, and computation. Mechanisms of ageing

and development, 124(1):9–16, 2003.

Hiroyuki Horitsu, Mikio Tomoeda, and Katsushi Kumagai. Pentose metabolism in Candida utilis: Part

IV. NADP specific polyol dehydrogenase. Agricultural and Biological Chemistry, 32(4):514–517, 1968.

Richard N Horne, Wayne Bartley Anderson, and Robert C Nordlie. Glucose dehydrogenase activity of

yeast glucose 6-phosphate dehydrogenase. Inhibition by adenosine 5’-triphosphate and other nucleoside

5’-triphosphates and diphosphates. Biochemistry, 9(3):610–616, 1970.

Jin Hou, Goutham N Vemuri, Xiaoming Bao, and Lisbeth Olsson. Impact of overexpressing NADH

kinase on glucose and xylose metabolism in recombinant xylose-utilizing Saccharomyces cerevisiae.

Applied Microbiology and Biotechnology, 82(5):909–919, 2009.

Xiaoru Hou. Anaerobic xylose fermentation by Spathaspora passalidarum. Applied Microbiology and

Biotechnology, 94(1):205–214, 2012.

E Huang and M Lefsrud. Fermentation monitoring of a co-culture process with Saccharomyces cerevisiae

and Scheffersomyces stipitis using shotgun proteomics. Journal of Bioprocessing and Biotechniques,

144:1–7, 2014. Chapter 9. Bibliography 170

Eric L Huang and Mark G Lefsrud. Temporal analysis of xylose fermentation by Scheffersomyces stipitis

using shotgun proteomics. Journal of Industrial Microbiology & Biotechnology, 39(10):1507–1514,

2012.

Carl L Hubbs. Concepts of homology and analogy. The American Naturalist, 78(777):289–307, 1944.

Jaime Huerta-Cepas, Anibal Bueno, Joaquín Dopazo, and Toni Gabaldón. Phylomedb: a database for

genome-wide collections of gene phylogenies. Nucleic acids research, 36(suppl_1):D491–D496, 2007.

Jaime Huerta-Cepas, François Serra, and Peer Bork. ETE 3: Reconstruction, analysis, and visualization

of phylogenomic data. Molecular Biology and Evolution, 33(6):1635–1638, 2016a.

Jaime Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C

Walter, Thomas Rattei, Daniel R Mende, Shinichi Sunagawa, Michael Kuhn, et al. eggNOG 4.5:

a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic

and viral sequences. Nucleic Acids Research, 44(D1):D286–293, 2016b.

Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho, Damian Szklarczyk, Lars Juhl Jensen,

Christian von Mering, and Peer Bork. Fast genome-wide functional annotation through orthology

assignment by eggNOG-Mapper. Molecular Biology and Evolution, 34(8):2115–2122, 2017.

Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen

Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, et al. eggnog 5.0: a hierarchical,

functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502

viruses. Nucleic acids research, 2018.

Marcel Huntemann, Natalia N Ivanova, Konstantinos Mavromatis, H James Tripp, David Paez-Espino,

Krishnaveni Palaniappan, Ernest Szeto, Manoj Pillay, I-Min A Chen, Amrita Pati, et al. The standard

operating procedure of the DOE-JGI microbial genome annotation pipeline (MGAP v. 4). Standards

in Genomic Sciences, 10(1):86, 2015.

Filip Husnik and John P McCutcheon. Functional horizontal gene transfer from bacteria to eukaryotes.

Nature Reviews Microbiology, 2017.

Michael J Hynes and Sandra L Murray. ATP-citrate lyase is required for production of cytosolic acetyl

coenzyme A and development in Aspergillus nidulans. Eukaryotic Cell, 9(7):1039–1048, 2010.

Diane O Inglis, Martha B Arnaud, Jonathan Binkley, Prachi Shah, Marek S Skrzypek, Farrell Wymore,

Gail Binkley, Stuart R Miyasato, Matt Simison, and Gavin Sherlock. The Candida genome database Chapter 9. Bibliography 171

incorporates multiple Candida species: multispecies search and analysis tools with curated gene and

protein information for Candida albicans and Candida glabrata. Nucleic Acids Research, 40(D1):

D667–D674, 2011.

M Ahsanul Islam, Elizabeth A Edwards, and Radhakrishnan Mahadevan. Characterizing the metabolism

of Dehalococcoides with a constraint-based model. PLoS Computational Biology, 6(8):e1000887, 2010.

Mickel LA Jansen, Jasmine M Bracher, Ioannis Papapetridis, Maarten D Verhoeven, Hans de Bruijn,

Paul P de Waal, Antonius JA van Maris, Paul Klaassen, and Jack T Pronk. Saccharomyces cerevisiae

strains for second-generation ethanol production: from academic exploration to industrial implemen-

tation. FEMS Yeast Research, 17(5), 2017.

Thomas W Jeffries. Effects of nitrate on fermentation of xylose and glucose by Pachysolen tannophilus.

Nature Biotechnology, 1(6):503–506, 1983.

Thomas W Jeffries and Nian-Qing Shi. Genetic engineering for improved xylose fermentation by yeasts.

In Recent progress in bioconversion of lignocellulosics, pages 117–161. Springer, 1999.

Thomas W Jeffries and Jennifer R Headman van Vleet. Pichia stipitis genomics, transcriptomics, and

gene clusters. FEMS Yeast Research, 9(6):793–807, 2009.

Thomas W Jeffries, Igor V Grigoriev, Jane Grimwood, José M Laplaza, Andrea Aerts, Asaf Salamov,

Jeremy Schmutz, Erika Lindquist, Paramvir Dehal, Harris Shapiro, et al. Genome sequence of the

lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nature Biotechnology, 25(3):

319–326, 2007.

TW Jeffries and Y-S Jin. Metabolic engineering for improved fermentation of pentoses by yeasts. Applied

Microbiology and Biotechnology, 63(5):495–509, 2004.

Lars Juhl Jensen, Philippe Julien, Michael Kuhn, Christian von Mering, Jean Muller, Tobias Doerks, and

Peer Bork. eggnog: automated construction and annotation of orthologous groups of genes. Nucleic

acids research, 36(suppl_1):D250–D254, 2007.

Roy A Jensen. Orthologs and paralogs - we need to get it right. Genome Biology, 2(8):interactions1002–1,

2001.

H Jeppsson, NJ Alexander, and B Hahn-Hagerdal. Existence of cyanide-insensitive respiration in the

yeast Pichia stipitis and its possible influence on product formation during xylose utilization. Applied

and Environmental Microbiology, 61(7):2596–2600, 1995. Chapter 9. Bibliography 172

Marie Jeppsson, Björn Johansson, Bärbel Hahn-Hägerdal, and Marie F Gorwa-Grauslund. Reduced

oxidative pentose phosphate pathway flux in recombinant xylose-utilizing Saccharomyces cerevisiae

strains improves the ethanol yield from xylose. Applied and Environmental Microbiology, 68(4):1604–

1609, 2002.

Marie Jeppsson, Björn Johansson, Peter Ruhdal Jensen, Bärbel Hahn-Hägerdal, and Marie F Gorwa-

Grauslund. The level of glucose-6-phosphate dehydrogenase activity strongly influences xylose fer-

mentation and inhibitor sensitivity in recombinant Saccharomyces cerevisiae strains. Yeast, 20(15):

1263–1272, 2003.

Yong-Su Jin and Thomas W Jeffries. Changing flux of xylose metabolites by altering expression of xylose

reductase and xylitol dehydrogenase in recombinant Saccharomyces cerevisiae. In Biotechnology for

Fuels and Chemicals, pages 277–285. Springer, 2003.

Yong-Su Jin, Haiying Ni, Jose M Laplaza, and Thomas W Jeffries. Optimal growth and ethanol produc-

tion from xylose by recombinant Saccharomyces cerevisiae require moderate D-xylulokinase activity.

Applied and Environmental Microbiology, 69(1):495–503, 2003.

Yong-Su Jin, Hal Alper, Yea-Tyng Yang, and Gregory Stephanopoulos. Improvement of xylose up-

take and ethanol production in recombinant Saccharomyces cerevisiae through an inverse metabolic

engineering approach. Applied and Environmental Microbiology, 71(12):8249–8256, 2005.

Björn Johansson, Camilla Christensson, Timothy Hobley, and Bärbel Hahn-Hägerdal. Xylulokinase

overexpression in two strains of Saccharomyces cerevisiae also expressing xylose reductase and xylitol

dehydrogenase and its effect on fermentation of xylose and lignocellulosic hydrolysate. Applied and

Environmental Microbiology, 67(9):4249–4255, 2001.

Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish

McWilliam, John Maslen, Alex Mitchell, Gift Nuka, et al. Interproscan 5: genome-scale protein

function classification. Bioinformatics, 30(9):1236–1240, 2014.

Leif J Jönsson, Björn Alriksson, and Nils-Olof Nilvebrant. Bioconversion of lignocellulose: inhibitors

and detoxification. Biotechnology for Biofuels, 6(1):16, 2013.

Raja Jothi, Elena Zotenko, Asba Tasneem, and Teresa M Przytycka. COCO-CL: hierarchical clustering

of homology relations based on evolutionary correlations. Bioinformatics, 22(7):779–788, 2006. Chapter 9. Bibliography 173

Lukas Käll, Anders Krogh, and Erik LL Sonnhammer. Advantages of combined transmembrane topology

and signal peptide prediction—the Phobius web server. Nucleic Acids Research, 35(Web Server issue):

W429–W432, 2007.

Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integration

and interpretation of large-scale molecular data sets. Nucleic Acids Research, 40(D1):D109–D114,

2011.

Kaisa Karhumaa, Bärbel Hahn-Hägerdal, and Marie-F Gorwa-Grauslund. Investigation of limiting

metabolic steps in the utilization of xylose by recombinant Saccharomyces cerevisiae using metabolic

engineering. Yeast, 22(5):359–368, 2005.

Kaisa Karhumaa, Anna-Karin Påhlman, Bärbel Hahn-Hägerdal, Fredrik Levander, and Marie-F Gorwa-

Grauslund. Proteome analysis of the xylose-fermenting mutant yeast strain TMB 3400. Yeast, 26(7):

371–382, 2009.

Peter D Karp, Suzanne M Paley, Markus Krummenacker, Mario Latendresse, Joseph M Dale, Thomas J

Lee, Pallavi Kaipa, Fred Gilham, Aaron Spaulding, Liviu Popescu, et al. Pathway Tools version 13.0:

integrated software for pathway/genome informatics and systems biology. Briefings in Bioinformatics,

11(1):40–79, 2009.

Peter D Karp, Mario Latendresse, Suzanne M Paley, Markus Krummenacker, Quang D Ong, Richard

Billington, Anamika Kothari, Daniel Weaver, Thomas Lee, Pallavi Subhraveti, et al. Pathway tools

version 19.0 update: software for pathway/genome informatics and systems biology. Briefings in

bioinformatics, 17(5):877–890, 2015.

Kazutaka Katoh and Daron M Standley. MAFFT multiple sequence alignment software version 7:

improvements in performance and usability. Molecular Biology and Evolution, 30(4):772–780, 2013.

Shigeyuki Kawai, Sachiko Suzuki, Shigetarou Mori, and Kousaku Murata. Molecular cloning and iden-

tification of UTR1 of a yeast Saccharomyces cerevisiae as a gene encoding an NAD kinase. FEMS

Microbiology Letters, 200(2):181–184, 2001.

Shigeyuki Kawai, Chikako Fukuda, Takako Mukai, and Kousaku Murata. MJ0917 in archaeon

Methanococcus jannaschii is a novel NADP phosphatase/NAD kinase. Journal of Biological Chem-

istry, 280(47):39200–39207, 2005. Chapter 9. Bibliography 174

Eduard J Kerkhoven, Kyle R Pomraning, Scott E Baker, and Jens Nielsen. Regulation of amino-acid

metabolism controls flux to lipid accumulation in Yarrowia lipolytica. NPJ Systems Biology and

Applications, 2:16005, 2016.

Stefan J Kerscher. Diversity and origin of alternative NADH:ubiquinone oxidoreductases. Biochimica et

Biophysica Acta, 1459(2):274–283, 2000.

Stefan J Kerscher, Jürgen G Okun, and Ulrich Brandt. A single external enzyme confers alternative

NADH: ubiquinone oxidoreductase activity in Yarrowia lipolytica. Journal of Cell Science, 112(14):

2347–2354, 1999.

Janine Kiers, Anne-Marie Zeeman, Marijke Luttik, Claudia Thiele, Juan I Castrillo, HY Steensma, Jo-

hannes P Van Dijken, and Jack T Pronk. Regulation of alcoholic fermentation in batch and chemostat

cultures of Kluyveromyces lactis CBS 2359. Yeast, 14(5):459–469, 1998.

Joonhoon Kim, Jennifer L Reed, and Christos T Maravelias. Large-scale bi-level strain design approaches

and mixed-integer programming solution techniques. PLoS One, 6(9):e24162, 2011.

Soo Rin Kim, Yong-Cheol Park, Yong-Su Jin, and Jin-Ho Seo. Strain engineering of Saccharomyces

cerevisiae for enhanced xylose metabolism. Biotechnology Advances, 31(6):851–861, 2013a.

Soo Rin Kim, Jeffrey M Skerker, Wei Kang, Anastashia Lesmana, Na Wei, Adam P Arkin, and Yong-Su

Jin. Rational and evolutionary engineering approaches uncover a small set of genetic changes efficient

for rapid xylose fermentation in Saccharomyces cerevisiae. PloS one, 8(2):e57048, 2013b.

Soo Rin Kim, Haiqing Xu, Anastashia Lesmana, Uros Kuzmanovic, Matthew Au, Clarissa Florencia,

Eun Joong Oh, Guochang Zhang, Kyoung Heon Kim, and Yong-Su Jin. Deletion of pho13, encoding

haloacid dehalogenase type iia phosphatase, results in upregulation of the pentose phosphate pathway

in saccharomyces cerevisiae. Appl. Environ. Microbiol., 81(5):1601–1609, 2015.

Zachary A King and Adam M Feist. Optimizing cofactor specificity of oxidoreductase enzymes for the

generation of microbial production strains–OptSwap. Industrial Biotechnology, 9(4):236–246, 2013.

Zachary A King, Justin Lu, Andreas Dräger, Philip Miller, Stephen Federowicz, Joshua A Lerman,

Ali Ebrahim, Bernhard O Palsson, and Nathan E Lewis. BiGG models: A platform for integrating,

standardizing and sharing genome-scale models. Nucleic Acids Research, 44(D1):D515–D522, 2015.

Jürgen Kirchberger, Jörg Bär, Wolfgang Schellenberger, Hassan Dihazi, and Gerhard Kopperschläger.

6-phosphofructokinase from Pichia pastoris: purification, kinetic and molecular characterization of

the enzyme. Yeast, 19(11):933–947, 2002. Chapter 9. Bibliography 175

Albert Jan Kluyver. Biochemische suikerbepalingen. PhD thesis, TU Delft, Delft University of Technol-

ogy, 1914.

Eugene V Koonin. An apology for orthologs-or brave new memes. Genome Biology, 2(4):comment1005–1,

2001.

Eugene V Koonin. Orthologs, paralogs, and evolutionary genomics. Annual Review of Genetics, 39:

309–338, 2005.

Peter Kötter and Michael Ciriacy. Xylose fermentation by Saccharomyces cerevisiae. Applied Microbiol-

ogy and Biotechnology, 38(6):776–783, 1993.

Peter Kötter, René Amore, Cornelis P Hollenberg, and Michael Ciriacy. Isolation and characteriza-

tion of the Pichia stipitis xylitol dehydrogenase gene, XYL2, and construction of a xylose-utilizing

Saccharomyces cerevisiae transformant. Current Genetics, 18(6):493–500, 1990.

Tadeusz Krassowski, Aisling Y Coughlan, Xing-Xing Shen, Xiaofan Zhou, Jacek Kominek, Dana A

Opulente, Robert Riley, Igor V Grigoriev, Nikunj Maheshwari, Denis C Shields, et al. Evolutionary

instability of cug-leu in the genetic code of budding yeasts. Nature communications, 9(1):1887, 2018.

Paula Kristo, Ritva Saarelainen, Richard Fagerström, Sirpa Aho, and Matti Korhola. Protein purifica-

tion, and cloning and characterization of the cDNA and gene for xylose isomerase of barley. European

Journal of Biochemistry, 237(1):240–246, 1996.

Evgenia V Kriventseva, Nazim Rahman, Octavio Espinosa, and Evgeny M Zdobnov. Orthodb: the

hierarchical catalog of eukaryotic orthologs. Nucleic acids research, 36(suppl_1):D271–D275, 2007.

Evgenia V Kriventseva, Fredrik Tegenfeldt, Tom J Petty, Robert M Waterhouse, Felipe A Simão, Igor A

Pozdnyakov, Panagiotis Ioannidis, and Evgeny M Zdobnov. OrthoDB v8: update of the hierarchical

catalog of orthologs and the underlying free software. Nucleic Acids Research, 43(D1):D250–D256,

2015.

Evgenia V Kriventseva, Dmitry Kuznetsov, Fredrik Tegenfeldt, Mosè Manni, Renata Dias, Felipe A

Simão, and Evgeny M Zdobnov. Orthodb v10: sampling the diversity of animal, plant, fungal, protist,

bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids

research, 2018.

Adepu Kiran Kumar and Shaishav Sharma. Recent updates on different methods of pretreatment of

lignocellulosic feedstocks: a review. Bioresources and Bioprocessing, 4(1):7, 2017. Chapter 9. Bibliography 176

Gotthard Kunze, Claude Gaillardin, Małgorzata Czernicka, Pascal Durrens, Tiphaine Martin, Erik Böer,

Toni Gabaldón, Jose A Cruz, Emmanuel Talla, Christian Marck, et al. The complete genome of Blas-

tobotrys (Arxula) adeninivorans LS3-a yeast of biotechnological interest. Biotechnology for Biofuels,

7(1):66, 2014.

Cletus Kurtzman, Jack W Fell, and Teun Boekhout. The Yeasts: a taxonomic study. Elsevier, 2011.

Cletus P Kurtzman and Motofumi Suzuki. Phylogenetic analysis of ascomycete yeasts that form coen-

zyme Q-9 and the proposal of the new genera Babjeviella, Meyerozyma, Millerozyma, Priceomyces,

and Scheffersomyces. Mycoscience, 51(1):2–14, 2010.

Marko Kuyper, Harry R Harhangi, Ann Kristin Stave, Aaron A Winkler, Mike SM Jetten, Wim TAM

de Laat, Jan JJ den Ridder, Huub JM Op den Camp, Johannes P van Dijken, and Jack T Pronk. High-

level functional expression of a fungal xylose isomerase: the key to efficient ethanolic fermentation of

xylose by Saccharomyces cerevisiae? FEMS Yeast Research, 4(1):69–78, 2003.

Marko Kuyper, Miranda MP Hartog, Maurice J Toirkens, Marinka JH Almering, Aaron A Winkler,

Johannes P van Dijken, and Jack T Pronk. Metabolic engineering of a xylose-isomerase-expressing

Saccharomyces cerevisiae strain for rapid anaerobic xylose fermentation. FEMS Yeast Research, 5

(4-5):399–409, 2005.

Ekaterina Kuznetsova, Michael Proudfoot, Stephen A Sanders, Jeffrey Reinking, Alexei Savchenko,

Cheryl H Arrowsmith, Aled M Edwards, and Alexander F Yakunin. Enzyme genomics: Application

of general enzymatic screens to discover new enzymes. FEMS Microbiology Reviews, 29(2):263–279,

2005.

Ekaterina Kuznetsova, Linda Xu, Alexander Singer, Greg Brown, Aiping Dong, Robert Flick, Hong Cui,

Marianne Cuff, Andrzej Joachimiak, Alexei Savchenko, et al. Structure and activity of the metal-

independent fructose 1, 6-bisphosphatase yk23 from saccharomyces cerevisiae. Journal of Biological

Chemistry, pages jbc–M110, 2010.

Ekaterina Kuznetsova, Boguslaw Nocek, Greg Brown, Kira S Makarova, Robert Flick, Yuri I Wolf,

Anna Khusnutdinova, Elena Evdokimova, Ke Jin, Kemin Tan, et al. Functional diversity of haloacid

dehalogenase superfamily phosphatases from Saccharomyces cerevisiae: biochemical, structural, and

evolutionary insights. Journal of Biological Chemistry, pages jbc–M115, 2015.

Arnold Kuzniar, Roeland CHJ van Ham, Sándor Pongor, and Jack AM Leunissen. The quest for

orthologs: finding the corresponding gene across genomes. Trends in Genetics, 24(11):539–551, 2008. Chapter 9. Bibliography 177

Suryang Kwak and Yong-Su Jin. Production of fuels and chemicals from xylose by engineered Saccha-

romyces cerevisiae: a review and perspective. Microbial Cell Factories, 16(1):82, 2017.

Marc Larochelle, Simon Drouin, François Robert, and Bernard Turcotte. Oxidative stress-activated zinc

cluster protein Stb5 has dual activator/repressor functions required for pentose phosphate pathway

regulation and NADPH production. Molecular and Cellular Biology, 26(17):6690–6701, 2006.

Christer Larsson and Lena Gustafsson. The role of physiological state in osmotolerance of the salt-

tolerant yeast Debaryomyces hansenii. Canadian Journal of Microbiology, 39(6):603–609, 1993.

Jean-Michel Lebeault, Eglis T Lode, and Minor J Coon. Fatty acid and hydrocarbon hydroxylation

in yeast: role of cytochrome P-450 in Candida tropicalis. Biochemical and Biophysical Research

Communications, 42(3):413–419, 1971.

Rodrigo Ledesma-Amaro, Eduard J Kerkhoven, José Luis Revuelta, and Jens Nielsen. Genome scale

metabolic modeling of the riboflavin overproducer Ashbya gossypii. Biotechnology and Bioengineering,

111(6):1191–1199, 2014.

Jae Young Lee, Jae Eun Kwak, Jinho Moon, Soo Hyun Eom, Elaine C Liong, Jean-Denis Pedelacq, Joel

Berendzen, and Se Won Suh. Crystal structure and functional analysis of the SurE protein identify a

novel phosphatase family. Nature Structural and Molecular Biology, 8(9):789, 2001.

Misun Lee, HenrieÃàtte J Rozeboom, Paul P de Waal, Rene M de Jong, Hanna M Dudek, and Dick B

Janssen. Metal dependence of the xylose isomerase from Piromyces sp. e2 explored by activity profiling

and protein crystallography. Biochemistry, 56(45):5991–6005, 2017.

Sun-Mi Lee, Taylor Jellison, and Hal S Alper. Directed evolution of xylose isomerase for improved

xylose catabolism and fermentation in the yeast saccharomyces cerevisiae. Applied and environmental

microbiology, pages AEM–01419, 2012.

Nathan E Lewis, Harish Nagarajan, and Bernhard O Palsson. Constraining the metabolic genotype–

phenotype relationship using a phylogeny of in silico methods. Nature Reviews Microbiology, 10(4):

291, 2012.

Heng Li, Avril Coghlan, Jue Ruan, Lachlan James Coin, Jean-Karim Heriche, Lara Osmotherly, Ruiqiang

Li, Tao Liu, Zhang Zhang, Lars Bolund, et al. Treefam: a curated database of phylogenetic trees of

animal gene families. Nucleic acids research, 34(suppl_1):D572–D580, 2006. Chapter 9. Bibliography 178

Li Li, Christian J Stoeckert, and David S Roos. Orthomcl: identification of ortholog groups for eukaryotic

genomes. Genome research, 13(9):2178–2189, 2003.

Peter Yan Li. In silico metabolic network reconstruction of Scheffersomyces stipitis. Master’s thesis,

University of Toronto, 2012.

Christian Lieven, Moritz Emanuel Beber, Brett G Olivier, Frank T Bergmann, Parizad Babaei, Jen-

nifer A Bartell, Lars M Blank, Siddharth Chauhan, Kevin Correia, Christian Diener, et al. Memote:

A community-driven effort towards a standardized genome-scale metabolic model test suite. bioRxiv,

page 350991, 2018.

Magdalena E Ligthelm, Bernard A Prior, and James C du Preez. The effect of respiratory inhibitors

on the fermentative ability of Pichia stipitis, Pachysolen tannophilus and Saccharomyces cerevisiae

under various conditions of aerobiosis. Applied Microbiology and Biotechnology, 29(1):67–71, 1988a.

Magdalena E Ligthelm, Bernard A Prior, and James C du Preez. The oxygen requirements of yeasts for

the fermentation of D-xylose and D-glucose to ethanol. Applied Microbiology and Biotechnology, 28

(1):63–68, 1988b.

Magdalena E Ligthelm, Bernard A Prior, James C du Preez, and Vincent Brandt. An investigation

of D-{1-13 C} xylose metabolism in Pichia stipitis under aerobic and anaerobic conditions. Applied

Microbiology and Biotechnology, 28(3):293–296, 1988c.

Lina Lindberg, Aline XS Santos, Howard Riezman, Lisbeth Olsson, and Maurizio Bettiga. Lipidomic

profiling of Saccharomyces cerevisiae and Zygosaccharomyces bailii reveals critical changes in lipid

composition in response to acetic acid stress. PLoS One, 8(9):e73936, 2013.

Tomas Linder. CMO1 encodes a putative choline monooxygenase and is required for the utilization of

choline as the sole nitrogen source in the yeast Scheffersomyces stipitis (syn. Pichia stipitis). Micro-

biology, 160(5):929–940, 2014.

Tomas Linder. Genetic redundancy in the catabolism of methylated amines in the yeast Scheffersomyces

stipitis. Antonie van Leeuwenhoek, 111(3):401–411, 2018.

Ting Liu, Wei Zou, Liming Liu, and Jian Chen. A constraint-based model of Scheffersomyces stipitis

for improved ethanol production. Biotechnology for Biofuels, 5(1):72, 2012.

Francisco P Lobo, Davi L Gonçalves, Sergio L Alves, Alexandra L Gerber, Ana Tereza R de Vascon-

celos, Luiz C Basso, Glória R Franco, Marco A Soares, Raquel M Cadete, Carlos A Rosa, et al. Chapter 9. Bibliography 179

Draft genome sequence of the D-xylose-fermenting yeast Spathaspora arborariae UFMG-HM19. 1AT.

Genome Announcements, 2(1):e01163–13, 2014.

Nicolas Loira, Thierry Dulermo, Jean-Marc Nicaud, and David James Sherman. A genome-scale

metabolic model of the lipid-accumulating yeast Yarrowia lipolytica. BMC Systems Biology, 6(1):

35, 2012.

Daiane D Lopes, Samuel P Cibulski, Fabiana Q Mayer, Franciele M Siqueira, Carlos A Rosa, Ronald E

Hector, and Marco Antônio Z Ayub. Draft genome sequence of the d-xylose-fermenting yeast Spathas-

pora xylofermentans UFMG-HMD23. 3. Genome Announcements, 5(33):e00815–17, 2017.

Helder Lopes and Isabel Rocha. Genome-scale modeling of yeast: chronology, applications and critical

perspectives. FEMS Yeast Research, 17(5), 2017.

Mariana R Lopes, Camila G Morais, Jacek Kominek, Raquel M Cadete, Marco A Soares, Ana Paula T

Uetanabaro, César Fonseca, Marc-André Lachance, Chris Todd Hittinger, and Carlos A Rosa. Genomic

analysis and D-xylose fermentation of three novel Spathaspora species: Spathaspora girioi sp. nov.,

Spathaspora hagerdaliae fa, sp. nov. and Spathaspora gorwiae fa, sp. nov. FEMS Yeast Research, 16

(4):fow044, 2016.

Mariana R Lopes, Thiago M Batista, Glória R Franco, Lucas R Ribeiro, Ana RO Santos, Carolina Fur-

tado, Rennan G Moreira, Aristóteles Goes-Neto, Marcos JS Vital, Luiz H Rosa, et al. Scheffersomyces

stambukii fa, sp. nov., a D-xylose-fermenting species isolated from rotting wood. International Journal

of Systematic and Evolutionary Microbiology, 2018.

Anja Lorberg, Lutz Kirchrath, Joachim F Ernst, and Jürgen J Heinisch. Genetic and biochemical

characterization of phosphofructokinase from the opportunistic pathogenic yeast Candida albicans.

European Journal of Biochemistry, 260(1):217–226, 1999.

Hongzhong Lu, Weiqiang Cao, Liming Ouyang, Jianye Xia, Mingzhi Huang, Ju Chu, Yingping Zhuang,

Siliang Zhang, and Henk Noorman. Comprehensive reconstruction and in silico analysis of Aspergillus

niger genome-scale metabolic network model that accounts for 1210 ORFs. Biotechnology and Bio-

engineering, 114(3):685–695, 2017.

Candida Lucas and N Van Uden. The temperature profiles of growth, thermal death and ethanol

tolerance of the xylose-fermenting yeast Candida shehatae. Journal of Basic Microbiology, 25(8):

547–550, 1985. Chapter 9. Bibliography 180

Donna M MacCallum, Luis Castillo, Kerstin Nather, Carol A Munro, Alistair JP Brown, Neil AR

Gow, and Frank C Odds. Property differences among the four major Candida albicans strain clades.

Eukaryotic Cell, 8(3):373–387, 2009.

Daniel Machado, Sergej Andrejev, Melanie Tramontano, and Kiran Raosaheb Patil. Fast automated

reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic

Acids Research, 2018.

Tomoko Maehara, Koji Takabatake, and Satoshi Kaneko. Expression of Arabidopsis thaliana xylose

isomerase gene and its effect on ethanol production in Flammulina velutipes. Fungal Biology, 117

(11-12):776–782, 2013.

Sarah L Maguire, Seán S OhÉigeartaigh, Kevin P Byrne, Markus S Schröder, Peadar O’Gaora, Ken-

neth H Wolfe, and Geraldine Butler. Comparative genome analysis and gene finding in Candida

species using CGOB. Molecular Biology and Evolution, 30(6):1281–1291, 2013.

R Mahadevan and CH Schilling. The effects of alternate optimal solutions in constraint-based genome-

scale metabolic models. Metabolic Engineering, 5(4):264–276, 2003.

Kira S Makarova, Yuri I Wolf, Sergey L Mekhedov, Boris G Mirkin, and Eugene V Koonin. Ancestral

paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell. Nucleic Acids

Research, 33(14):4626–4638, 2005.

Ryszard Maleszka and Henry Schneider. Fermentation of D-xylose, xylitol, and D-xylulose by yeasts.

Canadian Journal of Microbiology, 28(3):360–363, 1982.

Yaseen I Mamoori, Abdul Ghani I Yahya, and Majed H AL-Jelawi. Expression of xylose reductase

enzyme from Spathaspora passalidarum in Saccharomyces cerevisiae. Iraqi Journal of Science, 54:

316–323, 2013.

Xizeng Mao, Tao Cai, John G Olyarchuk, and Liping Wei. Automated genome annotation and pathway

identification using the KEGG orthology (KO) as a controlled vocabulary. Bioinformatics, 21(19):

3787–3793, 2005.

Marina Marcet-Houben and Toni Gabaldón. Beyond the whole-genome duplication: Phylogenetic ev-

idence for an ancient interspecies hybridization in the baker’s yeast lineage. PLoS Biology, 13(8):

e1002220, 2015. Chapter 9. Bibliography 181

Diego Martinez, Randy M Berka, Bernard Henrissat, Markku Saloheimo, Mikko Arvas, Scott E Baker,

Jarod Chapman, Olga Chertkov, Pedro M Coutinho, Dan Cullen, et al. Genome sequencing and anal-

ysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina). Nature Biotech-

nology, 26(5):553, 2008.

F.A. Matsen, R.B. Kodner, and E. Armbrust. pplacer: linear time maximum-likelihood and Bayesian

phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11(1):538, 2010.

Douglas McCloskey, Bernhard Ø Palsson, and Adam M Feist. Basic and applied uses of genome-scale

metabolic network reconstructions of Escherichia coli. Molecular Systems Biology, 9(1):661, 2013.

Mark D McDowall, Midori A Harris, Antonia Lock, Kim Rutherford, Daniel M Staines, Jürg Bähler,

Paul J Kersey, Stephen G Oliver, and Valerie Wood. PomBase 2015: updates to the fission yeast

database. Nucleic Acids Research, 43(D1):D656–D661, 2014.

Adam L Meadows, Kristy M Hawkins, Yoseph Tsegaye, Eugene Antipov, Youngnyun Kim, Lauren Raetz,

Robert H Dahl, Anna Tai, Tina Mahatdejkul-Meadows, Lan Xu, et al. Rewriting yeast central carbon

metabolism for industrial isoprenoid production. Nature, 537(7622):694–697, 2016.

Birgit HM Meldal, Oscar Forner-Martinez, Maria C Costanzo, Jose Dana, Janos Demeter, Marine Du-

mousseau, Selina S Dwight, Anna Gaulton, Luana Licata, Anna N Melidoni, et al. The complex

portal-an encyclopaedia of macromolecular complexes. Nucleic Acids Research, 43(D1):D479–D484,

2014.

Ana MP Melo, Margarida Duarte, Ian M Møller, Holger Prokisch, Patricia L Dolan, Laura Pinto,

Mary Anne Nelson, and Arnaldo Videira. The external calcium-dependent NADPH dehydrogenase

from Neurospora crassa mitochondria. Journal of Biological Chemistry, 276(6):3947–3951, 2001.

Xianzhi Meng and Arthur Jonas Ragauskas. Recent advances in understanding the role of cellulose

accessibility in enzymatic hydrolysis of lignocellulosic substrates. Current opinion in biotechnology,

27:150–158, 2014.

Huaiyu Mi, Anushya Muruganujan, and Paul D Thomas. PANTHER in 2013: modeling the evolution of

gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Research,

41(D1):D377–D386, 2012.

Huaiyu Mi, Xiaosong Huang, Anushya Muruganujan, Haiming Tang, Caitlin Mills, Diane Kang, and

Paul D Thomas. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome

pathways, and data analysis tool enhancements. Nucleic Acids Research, 45(D1):D183–D189, 2016. Chapter 9. Bibliography 182

Zewei Miao, Yogendra Shastri, Tony E Grift, Alan C Hansen, and KC Ting. Lignocellulosic biomass

feedstock transportation alternatives, logistics, equipment configurations, and modeling. Biofuels,

Bioproducts and Biorefining, 6(3):351–362, 2012.

Wouter J Middelhoven, Ilona M de Jong, and Marleen de Winter. Arxula adeninivorans, a yeast as-

similating many nitrogenous and aromatic compounds. Antonie van Leeuwenhoek, 59(2):129–137,

1991.

Wouter J Middelhoven, Alex Coenen, Bart Kraakman, and Maarten D Sollewijn Gelpke. Degradation of

some phenols and hydroxybenzoates by the imperfect ascomycetous yeasts Candida parapsilosis and

Arxula adeninivorans: evidence for an operative gentisate pathway. Antonie van Leeuwenhoek, 62(3):

181–187, 1992.

Eugenia Mileykovskaya, Pawel A Penczek, Jia Fang, Venkata KPS Mallampalli, Genevieve C Sparagna,

and William Dowhan. Arrangement of the respiratory chain complexes in saccharomyces cerevisiae

supercomplex iii2iv2 revealed by single particle cryo-electron microscopy (em). Journal of Biological

Chemistry, pages jbc–M112, 2012.

André Ribas de Miranda. Deleção do gene PGI1 da levedura Pichia stiptis para aumentar o rendimento

fermentativo a etanol. 2011.

S Mitsuhashi and JO Lampen. Conversion of D-xylose to D-xylulose in extracts of Lactobacillus pentosus.

Journal of Biological Chemistry, 204(2):1011–1018, 1953.

Monica L Mo, Bernhard Ø Palsson, and Markus J Herrgård. Connecting extracellular metabolomic

measurements to intracellular flux states in yeast. BMC Systems Biology, 3(1):37, 2009.

Md Moktaduzzaman, Silvia Galafassi, Claudia Capusoni, Ileana Vigentini, Zhihao Ling, Jure Piškur,

and Concetta Compagno. Galactose utilization sheds new light on sugar metabolism in the sequenced

strain Dekkera bruxellensis CBS 2499. FEMS Yeast Research, 15(2), 2015.

Jonathan Monk, Juan Nogales, and Bernhard O Palsson. Optimizing genome-scale network reconstruc-

tions. Nature Biotechnology, 32(5):447, 2014.

Jonathan M Monk, Pep Charusanti, Ramy K Aziz, Joshua A Lerman, Ned Premyodhin, Jeffrey D

Orth, Adam M Feist, and Bernhard Ø Palsson. Genome-scale metabolic reconstructions of multiple

Escherichia coli strains highlight strain-specific adaptations to nutritional environments. Proceedings

of the National Academy of Sciences, 110(50):20338–20343, 2013. Chapter 9. Bibliography 183

Camila G Morais, Raquel M Cadete, Ana Paula T Uetanabaro, Luiz H Rosa, Marc-André Lachance,

and Carlos A Rosa. D-xylose-fermenting and xylanase-producing yeast species from rotting wood of

two Atlantic Rainforest habitats in Brazil. Fungal Genetics and Biology, 60:19–28, 2013.

Camila G Morais, Thiago M Batista, Jacek Kominek, Beatriz M Borelli, Carolina Furtado, Rennan G

Moreira, Gloria R Franco, Luiz H Rosa, César Fonseca, Chris T Hittinger, et al. Spathaspora boniae

sp. nov., a D-xylose-fermenting species in the Candida albicans/Lodderomyces clade. International

Journal of Systematic and Evolutionary Microbiology, 67(10):3798–3805, 2017.

Lucia Morales, Benjamin Noel, Betina Porcel, Marina Marcet-Houben, Marie-Francoise Hullo, Christine

Sacerdot, Fredj Tekaia, Veronique Leh-Louis, Laurence Despons, Varun Khanna, et al. Complete

DNA sequence of Kuraishia capsulata illustrates novel genomic features among budding yeasts (Sac-

charomycotina). Genome Biology and Evolution, 5(12):2524–2539, 2013.

Gabriel Moreno-Hagelsieb and Kristen Latimer. Choosing BLAST options for better detection of or-

thologs as reciprocal best hits. Bioinformatics, 24(3):319–324, 2008.

Shigetarou Mori, Shigeyuki Kawai, Feng Shi, Bunzo Mikami, and Kousaku Murata. Molecular conversion

of NAD kinase to NADH kinase through single amino acid residue substitution. Journal of Biological

Chemistry, 280(25):24104–24112, 2005.

Y Morikawa, S Takasawa, I Masunaga, and K Takayama. Ethanol productions from D-xylose and

cellobiose by Kluyveromyces cellobiovorus. Biotechnology and Bioengineering, 27(4):509–513, 1985.

William C Morrell, Garrett W Birkel, Mark Forrer, Teresa Lopez, Tyler WH Backman, Michael Dussault,

Christopher J Petzold, Edward EK Baidoo, Zak Costello, David Ando, et al. The Experiment Data

Depot: a web-based software tool for biological experimental data storage, sharing, and visualization.

ACS Synthetic Biology, 6(12):2248–2259, 2017.

Stefanie Mühlhausen and Martin Kollmar. Molecular phylogeny of sequenced Saccharomycetes reveals

polyphyly of the alternative yeast codon usage. Genome Biology and Evolution, 6(12):3222–3237,

2014.

Stefanie Mühlhausen, Peggy Findeisen, Uwe Plessmann, Henning Urlaub, and Martin Kollmar. A

novel nuclear genetic code alteration in yeasts and the evolution of codon reassignment in eukaryotes.

Genome Research, 26(7):945–955, 2016. Chapter 9. Bibliography 184

Stefanie Mühlhausen, Hans Dieter Schmitt, Kuan-Ting Pan, Uwe Plessmann, Henning Urlaub, Lau-

rence D Hurst, and Martin Kollmar. Endogenous stochastic decoding of the cug codon by competing

ser-and leu-trnas in ascoidea asiatica. Current Biology, 2018.

Lam-Tung Nguyen, Heiko A Schmidt, Arndt von Haeseler, and Bui Quang Minh. IQ-TREE: a fast and

effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and

Evolution, 32(1):268–274, 2014.

Haiying Ni, José M Laplaza, and Thomas W Jeffries. Transposon mutagenesis to improve the growth of

recombinant Saccharomyces cerevisiae on d-xylose. Applied and Environmental Microbiology, 73(7):

2061–2066, 2007.

JN Nigam, RS Ireland, A Margaritis, and MA Lachance. Isolation and screening of yeasts that ferment

D-xylose directly to ethanol. Applied and Environmental Microbiology, 50(6):1486–1489, 1985.

V Nolleau, L Preziosi-Belloy, and JM Navarro. The reduction of xylose to xylitol by Candida guillier-

mondii and Candida parapsilosis: incidence of oxygen and pH. Biotechnology Letters, 17(4):417–422,

1995.

Richard A Notebaart, Balázs Szappanos, Bálint Kintses, Ferenc Pál, Ádám Györkei, Balázs Bogos,

Viktória Lázár, Réka Spohn, Bálint Csörgő, Allon Wagner, et al. Network-level architecture and the

evolutionary potential of underground metabolism. Proceedings of the National Academy of Sciences,

111(32):11762–11767, 2014.

Kevin P O’brien, Maido Remm, and Erik LL Sonnhammer. Inparanoid: a comprehensive database of

eukaryotic orthologs. Nucleic acids research, 33(suppl_1):D476–D480, 2005.

Karin Öhgren, Renata Bura, Gary Lesnicki, Jack Saddler, and Guido Zacchi. A comparison between

simultaneous saccharification and fermentation and separate hydrolysis and fermentation using steam-

pretreated corn stover. Process Biochemistry, 42(5):834–839, 2007.

Tomoko Ohnishi. Factors controlling the occurrence of site I phosphorylation in C. utilis mitochondria.

FEBS Letters, 24(3):305–309, 1972.

Natsumi Okada, Ayumi Tanimura, Hideki Hirakawa, Masako Takashima, Jun Ogawa, and Jun Shima.

Draft genome sequences of the xylose-fermenting yeast Scheffersomyces shehatae NBRC 1983T and a

thermotolerant isolate of S. shehatae ATY839 (JCM 18690). Genome Announcements, 5(20):e00347–

17, 2017. Chapter 9. Bibliography 185

Tobias Österlund, Intawat Nookaew, and Jens Nielsen. Fifteen years of large scale metabolic modeling

of yeast: developments and impacts. Biotechnology Advances, 30(5):979–988, 2012.

Tobias Österlund, Intawat Nookaew, Sergio Bordel, and Jens Nielsen. Mapping condition-dependent

regulation of metabolism in yeast through genome-scale modeling. BMC Systems Biology, 7(1):1,

2013.

Karin M Overkamp, Barbara M Bakker, HY Steensma, Johannes P van Dijken, and Jack T Pronk. Two

mechanisms for oxidation of cytosolic NADPH by Kluyveromyces lactis mitochondria. Yeast, 19(10):

813–824, 2002.

Stephan Pabinger, Robert Rader, Rasmus Agren, Jens Nielsen, and Zlatko Trajanoski. Memosys: Bioin-

formatics platform for genome-scale metabolic models. BMC systems biology, 5(1):20, 2011.

Pengcheng Pan and Qiang Hua. Reconstruction and in silico analysis of metabolic network for an

oleaginous yeast, Yarrowia lipolytica. PLoS One, 7(12):e51535, 2012.

Ashok Pandey. Biofuels: alternative feedstocks and conversion processes. Academic Press, 2011.

Marta Papini, Intawat Nookaew, Mathias Uhlén, and Jens Nielsen. Scheffersomyces stipitis: a compar-

ative systems biology study with the Crabtree positive yeast Saccharomyces cerevisiae. Microbial Cell

Factories, 11(1):1, 2012.

Nicolas Papon, Vincent Courdavault, and Marc Clastre. Biotechnological potential of the fungal CTG

clade species in the synthetic biology era. Trends in Biotechnology, 32(4):167–168, 2014.

William R Parrish, Christopher J Stefan, and Scott D Emr. Essential role for the myotubularin-related

phosphatase Ymr1p and the synaptojanin-like phosphatases Sjl2p and Sjl3p in regulation of phos-

phatidylinositol 3-phosphate in yeast. Molecular Biology of the Cell, 15(8):3567–3579, 2004.

Herman J Pel, Johannes H de Winde, David B Archer, Paul S Dyer, Gerald Hofmann, Peter J Schaap,

Geoffrey Turner, Ronald P de Vries, Richard Albang, Kaj Albermann, et al. Genome sequencing and

analysis of the versatile cell factory Aspergillus niger CBS 513.88. Nature Biotechnology, 25(2):221,

2007.

Simon Penel, Anne-Muriel Arigon, Jean-François Dufayard, Anne-Sophie Sertier, Vincent Daubin, Lau-

rent Duret, Manolo Gouy, and Guy Perrière. Databases of homologous gene families for comparative

genomics. BMC Bioinformatics, 10(6):S3, 2009. Chapter 9. Bibliography 186

Rui Pereira, Jens Nielsen, and Isabel Rocha. Improving the flux distributions simulated with genome-

scale metabolic models of Saccharomyces cerevisiae. Metabolic Engineering Communications, 3:153–

163, 2016.

Thomas Nordahl Petersen, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. SignalP 4.0: dis-

criminating signal peptides from transmembrane regions. Nature Methods, 8(10):785, 2011.

Thomas Pfau, Maria Pires Pacheco, and Thomas Sauter. Towards improved genome-scale metabolic

network reconstructions: unification, transcript specificity and beyond. Briefings in Bioinformatics,

17(6):1060–1069, 2015.

Jure Piškur, Elżbieta Rozpedowska, Silvia Polakova, Annamaria Merico, and Concetta Compagno. How

did saccharomyces evolve to become a good brewer? TRENDS in Genetics, 22(4):183–186, 2006.

Jure Piškur, Zhihao Ling, Marina Marcet-Houben, Olena P Ishchuk, Andrea Aerts, Kurt LaButti, Alex

Copeland, Erika Lindquist, Kerrie Barry, Concetta Compagno, et al. The genome of wine yeast

Dekkera bruxellensis provides a tool to explore its food-related properties. International Journal of

Food Microbiology, 157(2):202–209, 2012.

Esa Pitkänen, Paula Jouhten, Jian Hou, Muhammad Fahad Syed, Peter Blomberg, Jana Kludas, Merja

Oja, Liisa Holm, Merja Penttilä, Juho Rousu, et al. Comparative genome-scale reconstruction of

gapless metabolic networks for present and ancestral species. PLoS Computational Biology, 10(2):

e1003465, 2014.

Morgan N Price and Adam P Arkin. PaperBLAST: text mining papers for information about homologs.

mSystems, 2(4):e00039–17, 2017.

Nicholas C Price, Evelyn Stevens, and Paul M Rogers. Cofactor-dependence of phosphoglycerate mutase

activity in a variety of fungi. FEMS Microbiology Letters, 19(2-3):257–259, 1983.

Jack T Pronk, H Yde Steensma, and Johannes P van Dijken. Pyruvate metabolism in Saccharomyces

cerevisiae. Yeast, 12(16):1607–1633, 1996.

Estelle Proux-Wéra, David Armisén, Kevin P Byrne, and Kenneth H Wolfe. A pipeline for automated

annotation of yeast genome sequences by a conserved-synteny approach. BMC Bioinformatics, 13(1):

237, 2012.

Leszek P Pryszcz, Jaime Huerta-Cepas, and Toni Gabaldon. Metaphors: orthology and paralogy predic-

tions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic acids

research, 39(5):e32–e32, 2010. Chapter 9. Bibliography 187

M. H. Pubols and Bernard Axelrod. Xylose assimilation in higher plants. Biochimica et Biophysica Acta,

36:582–583, 1959.

Mohammad Mubinur Rahman, Martina Andberg, Anu Koivula, Juha Rouvinen, and Nina Hakulinen.

The crystal structure of D-xylonate dehydratase reveals functional features of enzymes from the Ilv/ED

dehydratase family. Scientific Reports, 8(1):865, 2018.

Fernando Ramos, Mounir el Guezzar, Marcelle Grendon, and Jean-Marie Wiame. Mutations affect-

ing the enzymes involved in the utilization of 4-aminobutyric acid as nitrogen source by the yeast

Saccharomyces cerevisiae. European Journal of Biochemistry, 149(2):401–404, 1985.

C Ratledge. Microorganisms for lipids. Acta Biotechnologica, 11(5):429–438, 1991.

Aarthi Ravikrishnan and Karthik Raman. Critical assessment of genome-scale metabolic networks: the

need for a unified standard. Briefings in Bioinformatics, 16(6):1057–1068, 2015.

S Ravikumar and K Srikumar. Xerophytic Cereus pterogonus xylose isomerase is a thermostable enzyme.

Chemistry of Natural Compounds, 44(2):213, 2008.

Nikolai V Ravin, Michael A Eldarov, Vitaly V Kadnikov, Alexey V Beletsky, Jessica Schneider, Eugenia S

Mardanova, Elena M Smekalova, Maria I Zvereva, Olga A Dontsova, Andrey V Mardanov, et al.

Genome sequence and analysis of methylotrophic yeast Hansenula polymorpha DL1. BMC Genomics,

14(1):837, 2013.

Maido Remm, Christian EV Storm, and Erik LL Sonnhammer. Automatic clustering of orthologs and

in-paralogs from pairwise species comparisons. Journal of Molecular Biology, 314(5):1041–1052, 2001.

Renate Reuter, Manfred Naumann, Jörg Bär, Dieter Haferburg, and Gerhard Kopperschläger. Purifi-

cation, molecular and kinetic characterization of phosphofructokinase-1 from the yeast Schizosaccha-

romyces pombe: evidence for an unusual subunit composition. Yeast, 16(14):1273–1285, 2000.

Robert Riley, Sajeet Haridas, Kenneth H Wolfe, Mariana R Lopes, Chris Todd Hittinger, Markus Göker,

Asaf A Salamov, Jennifer H Wisecaver, Tanya M Long, Christopher H Calvey, et al. Comparative

genomics of biotechnologically important yeasts. Proceedings of the National Academy of Sciences,

113(35):9882–9887, 2016.

Manfred Rizzi, Katharina Harwart, Petra Erlemann, Ngoc-Anh Bui-Thanh, and Hanswerner Dellweg.

Purification and properties of the NAD+-xylitol-dehydrogenase from the yeast Pichia stipitis. Journal

of Fermentation and Bioengineering, 67(1):20–24, 1989. Chapter 9. Bibliography 188

J Carlos Roseiro, M Amália Peito, Francisco M Gírio, and MT Amaral-Collaço. The effects of the oxygen

transfer coefficient and substrate concentration on the xylose fermentation by Debaryomyces hansenii.

Archives of Microbiology, 156(6):484–490, 1991.

Elżbieta Rozpkedowska, Linda Hellborg, Olena P Ishchuk, Furkan Orhan, Silvia Galafassi, Annamaria

Merico, Megan Woolfit, Concetta Compagno, and Jure Piškur. Parallel evolution of the make-

accumulate-consume strategy in Saccharomyces and Dekkera yeasts. Nature Communications, 2:302,

2011.

David Runquist, Bärbel Hahn-Hägerdal, and Maurizio Bettiga. Increased ethanol productivity in xylose-

utilizing Saccharomyces cerevisiae via a randomly mutagenized xylose reductase. Applied and Envi-

ronmental Microbiology, 76(23):7796–7802, 2010.

Brian J Rush and Arlene M Fosmer. Methods for succinate production, January 25 2013. US Patent

App. 14/374,464.

Olena B Ryabova, Oksana M Chmil, and Andrii A Sibirny. Xylose and cellobiose fermentation to

ethanol by the thermotolerant methylotrophic yeast Hansenula polymorpha. FEMS Yeast Research, 4

(2):157–164, 2003.

Leonidas Salichos and Antonis Rokas. Evaluating ortholog prediction algorithms in a yeast model clade.

PLoS One, 6(4):e18755, 2011.

Leonidas Salichos and Antonis Rokas. Inferring ancient divergences requires genes with strong phyloge-

netic signals. Nature, 497(7449):327, 2013.

Laura Salusjärvi, Mervi Toivari, Maija-Leena Vehkomäki, Outi Koivistoinen, Dominik Mojzita, Klaus

Niemelä, Merja Penttilä, and Laura Ruohonen. Production of ethylene glycol or glycolic acid from

d-xylose in Saccharomyces cerevisiae. Applied Microbiology and Biotechnology, 101(22):8151–8163,

2017.

Benjamin J Sanchez, Feiran Li, Eduard J Kerkhoven, and Jens Nielsen. SLIMEr: probing flexibility

of lipid metabolism in yeast with an improved constraint-based modeling framework. bioRxiv, page

324863, 2018.

NS Sánchez, M Calahorra, JC González-Hernández, and A Peña. Glycolytic sequence and respiration

of Debaryomyces hansenii as compared to Saccharomyces cerevisiae. Yeast, 23(5):361–374, 2006. Chapter 9. Bibliography 189

Sophie Sanchez, P Tafforeau, and Per E Ahlberg. The humerus of Eusthenopteron: a puzzling organiza-

tion presaging the establishment of tetrapod limb bone marrow. Proc. R. Soc. B, 281(1782):20140299,

2014.

Vicente Sanchis, Inmaculada Vinas, Ian N Roberts, David J Jeenes, Adrian J Watson, and David B

Archer. A pyruvate decarboxylase gene from Aspergillus parasiticus. FEMS Microbiology Letters, 117

(2):207–210, 1994.

AV Sarthy, BL McConaughy, Z Lobo, JA Sundstrom, CE Furlong, and BD Hall. Expression of the

Escherichia coli xylose isomerase gene in Saccharomyces cerevisiae. Applied and Environmental Mi-

crobiology, 53(9):1996–2000, 1987.

Devin R Scannell, Kevin P Byrne, Jonathan L Gordon, Simon Wong, and Kenneth H Wolfe. Multiple

rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature, 440(7082):341,

2006.

Devin R Scannell, A Carolin Frank, Gavin C Conant, Kevin P Byrne, Megan Woolfit, and Kenneth H

Wolfe. Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended

from a whole-genome duplication. Proceedings of the National Academy of Sciences, 104(20):8397–

8402, 2007.

GD Schellenberg, A Sarthy, AE Larson, MP Backer, JW Crabb, M Lidstrom, BD Hall, and CE Fur-

long. Xylose isomerase from Escherichia coli. Characterization of the protein and the structural gene.

Journal of Biological Chemistry, 259(11):6826–6832, 1984.

HC Schellenberg. Untersuchungen über das verhalten einiger pilze gegen hemizellulosen. Flora oder

Allgemeine Botanische Zeitung, 98(3):257–308, 1908.

Paul Schimmel. GTP hydrolysis in protein synthesis: two for Tu? Science, 259(5099):1264–1266, 1993.

Thomas Schmitt, David N Messina, Fabian Schreiber, and Erik LL Sonnhammer. Letter to the editor:

SeqXML and OrthoXML: standards for sequence and orthology information. Briefings in Bioinfor-

matics, 12(5):485–488, 2011.

Adrian Schneider, Christophe Dessimoz, and Gaston H Gonnet. OMA browser–exploring orthologous

relations across 352 complete genomes. Bioinformatics, 23(16):2180–2182, 2007.

H Schneider, PY Wang, YK Chan, and R Maleszka. Conversion of D-xylose into ethanol by the yeast

Pachysolen tannophilus. Biotechnology Letters, 3(2):89–92, 1981. Chapter 9. Bibliography 190

Daniel Segre, Dennis Vitkup, and George M Church. Analysis of optimality in natural and perturbed

metabolic networks. Proceedings of the National Academy of Sciences, 99(23):15112–15117, 2002.

Marie Sémon and Kenneth H Wolfe. Consequences of genome duplication. Current Opinion in Genetics

& Development, 17(6):505–512, 2007.

Letícia MF Sena, Camila G Morais, Mariana R Lopes, Renata O Santos, Ana PT Uetanabaro, Paula B

Morais, Marcos JS Vital, Marcos A de Morais, Marc-André Lachance, and Carlos A Rosa. D-xylose

fermentation, xylitol production and xylanase activities by seven new species of Sugiyamaella. Antonie

van Leeuwenhoek, 110(1):53–67, 2017.

Byoung Boo Seo, Mathieu Marella, Takao Yagi, and Akemi Matsuno-Yagi. The single subunit nadh

dehydrogenase reduces generation of reactive oxygen species from complex i. FEBS letters, 580(26):

6105–6108, 2006.

Kaithamana Shashi, Anand K Bachhawat, and Richard Joseph. ATP:citrate lyase of Rhodotorula gra-

cilis: purification and properties. Biochimica et Biophysica Acta, 1033(1):23–30, 1990.

Xing-Xing Shen, Xiaofan Zhou, Jacek Kominek, Cletus P Kurtzman, Chris Todd Hittinger, and Antonis

Rokas. Reconstructing the backbone of the Saccharomycotina yeast phylogeny using genome-scale

data. G3: Genes, Genomes, Genetics, 6(12):3927–3939, 2016.

Xing-Xing Shen, Chris Todd Hittinger, and Antonis Rokas. Contentious relationships in phylogenomic

studies can be driven by a handful of genes. Nature Ecology & Evolution, 1(5):0126, 2017.

Feng Shi, Shigeyuki Kawai, Shigetarou Mori, Emi Kono, and Kousaku Murata. Identification of ATP-

NADH kinase isozymes and their contribution to supply of NADP(H) in Saccharomyces cerevisiae.

The FEBS Journal, 272(13):3337–3349, 2005.

Nian-Qing Shi, Jose Cruz, Fred Sherman, and Thomas W Jeffries. SHAM-sensitive alternative respiration

in the xylose-metabolizing yeast Pichia stipitis. Yeast, 19(14):1203–1220, 2002.

B.P. Singh. Biofuel Crops: Production, Physiology and Genetics. CABI, 2013. ISBN 9781845938857.

URL https://books.google.ca/books?id=Bp10AZ_g2IsC.

Ram Sarup Singh and Amandeep Kaur Walia. Historical perspectives and public opinions. Biofuels:

Production and Future Perspectives, page 1, 2016.

Mamata S Singhvi, Shivani Chaudhari, and Digambar V Gokhale. Lignocellulose processing: a current

challenge. RSC Advances, 4(16):8271–8277, 2014. Chapter 9. Bibliography 191

Kerstin Skoog and Bärbel Hahn-Hägerdal. Effect of oxygenation on xylose fermentation by Pichia

stipitis. Applied and Environmental Microbiology, 56(11):3389–3394, 1990.

PJ Slininger, RJ Bothast, JE van Cauwenberge, and CP Kurtzman. Conversion of D-xylose to ethanol

by the yeast Pachysolen tannophilus. Biotechnology and Bioengineering, 24(2):371–384, 1982.

PJ Slininger, RJ Bothast, MR Okos, and MR Ladisch. Comparative evaluation of ethanol production by

xylose-fermenting yeasts presented high xylose concentrations. Biotechnology Letters, 7(6):431–436,

1985.

PJ Slininger, JE Branstrator, JM Lomont, BS Dien, MR Okos, MR Ladisch, and RJ Bothast. Stoi-

chiometry and kinetics of xylose fermentation by Pichia stipitis. Annals of the New York Academy of

Sciences, 589(1):25–40, 1990.

Jason C Slot and Antonis Rokas. Multiple gal pathway gene clusters evolved independently and by

different mechanisms in fungi. Proceedings of the National Academy of Sciences, 107(22):10136–10141,

2010.

Ian Small, Nemo Peeters, Fabrice Legeai, and Claire Lurin. Predotar: A tool for rapidly screening

proteomes for N-terminal targeting sequences. Proteomics, 4(6):1581–1590, 2004.

Vaclav Smil. Energy transitions: history, requirements, prospects. ABC-CLIO, 2010.

Seung Bum Sohn, Tae Yong Kim, Jay H Lee, and Sang Yup Lee. Genome-scale metabolic model of the

fission yeast Schizosaccharomyces pombe and the reconciliation of in silico/in vivo mutant growth.

BMC Systems Biology, 6(1):1, 2012.

Marco Sonderegger and Uwe Sauer. Evolutionary engineering of saccharomyces cerevisiae for anaerobic

growth on xylose. Applied and environmental microbiology, 69(4):1990–1998, 2003.

Jean-Luc Souciet, Bernard Dujon, Claude Gaillardin, Mark Johnston, Philippe V Baret, Paul Cliften,

David J Sherman, Jean Weissenbach, Eric Westhof, Patrick Wincker, et al. Comparative genomics of

protoploid Saccharomycetaceae. Genome Research, 2009.

Maria João Sousa, Fernando Rodrigues, Manuela Coôrte-Real, and Cecília Leão. Mechanisms underlying

the transport and intracellular metabolism of acetic acid in the presence of glucose in the yeast

Zygosaccharomyces bailii. Microbiology, 144(3):665–670, 1998.

Shyam Srinivasan, William R Cluett, and Radhakrishnan Mahadevan. Model-based design of bistable

cell factories for metabolic engineering. Bioinformatics, 1:9, 2017. Chapter 9. Bibliography 192

Mario Stanke, Rasmus Steinkamp, Stephan Waack, and Burkhard Morgenstern. Augustus: a web server

for gene finding in eukaryotes. Nucleic acids research, 32(suppl_2):W309–W312, 2004.

Yi-Kai Su, Laura B Willis, and Thomas W Jeffries. Effects of aeration on growth, ethanol and polyol

accumulation by Spathaspora passalidarum NRRL Y-27907 and Scheffersomyces stipitis NRRL Y-

7124. Biotechnology and Bioengineering, 112(3):457–469, 2015.

Hiroyuki Suga, Fumio Matsuda, Tomohisa Hasunuma, Jun Ishii, and Akihiko Kondo. Implementation of

a transhydrogenase-like shunt to counter redox imbalance during xylose fermentation in saccharomyces

cerevisiae. Applied microbiology and biotechnology, 97(4):1669–1678, 2013.

Sung-Oui Suh, Christopher J Marshall, Joseph V Mchugh, and Meredith Blackwell. Wood ingestion by

passalid beetles in the presence of xylose-fermenting gut yeasts. Molecular Ecology, 12(11):3137–3145,

2003.

Jeet Sukumaran and Mark T Holder. DendroPy: a Python library for phylogenetic computing. Bioin-

formatics, 26(12):1569–1571, 2010.

Edyta Szewczyk, Alex Andrianopoulos, Meryl A Davis, and Michael J Hynes. A single gene produces

mitochondrial, cytoplasmic, and peroxisomal NADP-dependent isocitrate dehydrogenase in Aspergillus

nidulans. Journal of Biological Chemistry, 276(40):37722–37729, 2001.

Haiming Tang, Robert D Finn, and Paul D Thomas. TreeGrafter: phylogenetic tree-based annotation

of proteins with Gene Ontology terms and other annotations. Bioinformatics, page bty625, 2018. doi:

10.1093/bioinformatics/bty625. URL http://dx.doi.org/10.1093/bioinformatics/bty625.

Nuria Tarrío, Manuel Becerra, María Esperanza Cerdán, and María Isabel González Siso. Reoxidation

of cytosolic NADPH in Kluyveromyces lactis. FEMS Yeast Research, 6(3):371–380, 2006a.

Nuria Tarrío, M Esperanza Cerdán, and M Isabel González Siso. Characterization of the second external

alternative dehydrogenase from mitochondria of the respiratory yeast Kluyveromyces lactis. Biochimica

et Biophysica Acta, 1757(11):1476–1484, 2006b.

Roman L Tatusov, Eugene V Koonin, and David J Lipman. A genomic perspective on protein families.

Science, 278(5338):631–637, 1997.

Ines Thiele and Bernhard Ø Palsson. A protocol for generating a high-quality genome-scale metabolic

reconstruction. Nature Protocols, 5(1):93, 2010. Chapter 9. Bibliography 193

Ines Thiele, Catherine M Clancy, Almut Heinken, and Ronan MT Fleming. Quantitative systems phar-

macology and the personalized drug–microbiota–diet axis. Current Opinion in Systems Biology, 4:

43–52, 2017.

Paul D Thomas, Michael J Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak, Robin Daverman,

Karen Diemer, Anushya Muruganujan, and Apurva Narechania. Panther: a library of protein families

and subfamilies indexed by function. Genome research, 13(9):2129–2141, 2003.

J Michael Thomson, Eric A Gaucher, Michelle F Burgan, Danny W De Kee, Tang Li, John P Aris, and

Steven A Benner. Resurrecting ancestral alcohol dehydrogenases from yeast. Nature Genetics, 37(6):

630–635, 2005.

RH Thurston. The animal as a machine and prime mover. Science, 1(14):365–371, 1895.

Ansa Toivola, David Yarrow, Eduard Van Den Bosch, Johannes P Van Dijken, and W Alexander Schef-

fers. Alcoholic fermentation of D-xylose by yeasts. Applied and Environmental Microbiology, 47(6):

1221–1223, 1984.

B Tollens. On the carbohydrates of barley and malt, with special reference to the pentosans. the

behaviour of the pentosans during the preparation of malt, and during mashing and fermentation.

Journal of the Federated Institutes of Brewing, 4(4):438–454, 1898.

Màrius Tomàs-Gamisans, Pau Ferrer, and Joan Albiol. Fine-tuning the P. pastoris iMT1026 genome-

scale metabolic model for improved prediction of growth on methanol or glycerol as sole carbon sources.

Microbial Biotechnology, 11(1):224–237, 2018.

KL Träff, RR Otero Cordero, WH Van Zyl, and B Hahn-Hägerdal. Deletion of the GRE3 aldose reduc-

tase gene and its influence on xylose metabolism in recombinant strains of Saccharomyces cerevisiae

expressing the xylA and XKS1 genes. Applied and Environmental Microbiology, 67(12):5668–5674,

2001.

KL Träff-Bjerre, Marie Jeppsson, Bärbel Hahn-Hägerdal, and M-F Gorwa-Grauslund. Endogenous

NADPH-dependent aldose reductase activity influences product formation during xylose consumption

in recombinant Saccharomyces cerevisiae. Yeast, 21(2):141–150, 2004.

Hector Urbina and Meredith Blackwell. Multilocus phylogenetic study of the Scheffersomyces yeast

clade and characterization of the N-terminal region of xylose reductase gene. PLoS One, 7(6):e39128,

2012. Chapter 9. Bibliography 194

Hector Urbina, Robert Frank, and Meredith Blackwell. Scheffersomyces cryptocercus: a new xylose-

fermenting yeast associated with the gut of wood roaches and new combinations in the Sugiyamaella

yeast clade. Mycologia, 105(3):650–660, 2013.

Jan B Van Beilen and Enrico G Funhoff. Alkane hydroxylases involved in microbial alkane degradation.

Applied Microbiology and Biotechnology, 74(1):13–21, 2007.

Marco A van den Berg, Patricia de Jong-Gubbels, Christine J Kortland, Johannes P van Dijken, Jack T

Pronk, and H Yde Steensma. The two acetyl-coenzyme A synthetases of Saccharomyces cerevisiae differ

with respect to kinetic properties and transcriptional regulation. Journal of Biological Chemistry, 271

(46):28953–28959, 1996.

JS Van Dyk and BI Pletschke. A review of lignocellulose bioconversion using enzymatic hydrolysis

and synergistic cooperation between enzymes—factors affecting enzymes, conversion and synergy.

Biotechnology Advances, 30(6):1458–1480, 2012.

Johan H van Heerden, Meike T Wortel, Frank J Bruggeman, Joseph J Heijnen, Yves JM Bollen, Robert

Planqué, Josephus Hulshof, Tom G O’Toole, S Aljoscha Wahl, and Bas Teusink. Lost in transition:

start-up of glycolysis yields subpopulations of nongrowing cells. Science, 343(6174):1245114, 2014.

PIM Van Hoek, Johannes P Van Dijken, and Jack T Pronk. Effect of specific growth rate on fermentative

capacity of baker’s yeast. Applied and Environmental Microbiology, 64(11):4226–4233, 1998.

H Van Laer. A few words on the occurrence of furfural in alcoholic liquids. Journal of the Federated

Institutes of Brewing, 4(6):2–6, 1898.

Hendrik van Urk, WS Leopold Voll, W Alexander Scheffers, and Johannes P van Dijken. Transient-

state analysis of metabolic fluxes in Crabtree-positive and Crabtree-negative yeasts. Applied and

Environmental Microbiology, 56(1):281–287, 1990.

Jennifer Headman van Vleet, Thomas W Jeffries, and Lisbeth Olsson. Deleting the para-nitrophenyl

phosphatase (pNPPase), PHO13, in recombinant Saccharomyces cerevisiae improves growth and

ethanol production on D-xylose. Metabolic Engineering, 10(6):360–369, 2008.

JH van Vleet and Thomas W Jeffries. Yeast metabolic engineering for hemicellulosic ethanol production.

Current Opinion in Biotechnology, 20(3):300–306, 2009.

Eleonora Vandeska, Slobodanka Kuzmanova, Thomas W Jeffries, et al. Xylitol formation and key

enzyme activities in candida boidinii under different oxygen transfer rates. Journal of fermentation

and bioengineering, 80(5):513–516, 1995. Chapter 9. Bibliography 195

Amit Varma and Bernhard O Palsson. Metabolic flux balancing: basic concepts, scientific and practical

use. Nature biotechnology, 12(10):994, 1994.

Henrique César Teixeira Veras, Nádia Skorupa Parachin, and João Ricardo Moreira Almeida. Compara-

tive assessment of fermentative capacity of different xylose-consuming yeasts. Microbial Cell Factories,

16(1):153, 2017.

C Verduyn, R Van Kleef, J Frank, H Schreuder, JP Van Dijken, and WA Scheffers. Properties of the

NAD(P)H-dependent xylose reductase from the xylose-fermenting yeast Pichia stipitis. Biochemical

Journal, 226(3):669–677, 1985a.

Cornelis Verduyn, Johannes Frank Jzn, Johannes P van Dijken, and W Alexander Scheffers. Multiple

forms of xylose reductase in Pachysolen tannophilus CBS 4044. FEMS Microbiology Letters, 30(3):

313–317, 1985b.

Ritva Verho, Peter Richard, Per Harald Jonson, Lena Sundqvist, John Londesborough, and Merja

Penttilä. Identification of the first fungal NADP-GAPDH from Kluyveromyces lactis. Biochemistry,

41(46):13833–13838, 2002.

Ritva Verho, John Londesborough, Merja Penttilä, and Peter Richard. Engineering redox cofactor regen-

eration for improved pentose fermentation in Saccharomyces cerevisiae. Applied and Environmental

Microbiology, 69(10):5892–5897, 2003.

Maarten D Verhoeven, Misun Lee, Lycka Kamoen, Marcel Van Den Broek, Dick B Janssen, Jean-Marc G

Daran, Antonius JA Van Maris, and Jack T Pronk. Mutations in PMR1 stimulate xylose isomerase

activity and anaerobic growth on xylose of engineered Saccharomyces cerevisiae by influencing man-

ganese homeostasis. Scientific Reports, 7:46155, 2017.

George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin. Ten years of pan-genome analyses.

Current Opinion in Microbiology, 23:148–154, 2015.

Albert J Vilella, Jessica Severin, Abel Ureta-Vidal, Li Heng, Richard Durbin, and Ewan Birney. En-

semblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome

Research, 19(2):327–335, 2009.

Vitchuporn Vongsuvanlert and Yoshiki Tani. Xylitol production by a methanol yeast, Candida boidinii

(Kloeckera sp.) no. 2201. Journal of Fermentation and Bioengineering, 67(1):35–39, 1989. Chapter 9. Bibliography 196

C Fredrik Wahlbom, Willem H van Zyl, Leif J Jönsson, Bärbel Hahn-Hägerdal, and Ricardo R Cordero

Otero. Generation of the improved recombinant xylose-utilizing Saccharomyces cerevisiae TMB 3400

by random mutagenesis and physiological comparison with Pichia stipitis CBS 6054. FEMS Yeast

Research, 3(3):319–326, 2003.

M Walfridsson, M Anderlund, X Bao, and B Hahn-Hägerdal. Expression of different levels of enzymes

from the Pichia stipitis XYL1 and XYL2 genes in Saccharomyces cerevisiae and its effects on product

formation during xylose utilisation. Applied Microbiology and Biotechnology, 48(2):218–224, 1997.

Mats Walfridsson, Johan Hallborn, MERJA Penttilä, SIRKKA Keränen, and Bärbel Hahn-Hägerdal.

Xylose-metabolizing Saccharomyces cerevisiae strains overexpressing the TKL1 and TAL1 genes en-

coding the pentose phosphate pathway enzymes transketolase and transaldolase. Applied and Envi-

ronmental Microbiology, 61(12):4184–4190, 1995.

Mats Walfridsson, Xiaoming Bao, Mikael Anderlund, Gosta Lilius, L Bülow, and B Hahn-Hägerdal.

Ethanolic fermentation of xylose with Saccharomyces cerevisiae harboring the Thermus thermophilus

xylA gene, which expresses an active xylose (glucose) isomerase. Applied and Environmental Micro-

biology, 62(12):4648–4651, 1996.

Hao Wang, Simonas Marcišauskas, Benjamín J Sánchez, Iván Domenzain, Daniel Hermansson, Rasmus

Agren, Jens Nielsen, and Eduard J Kerkhoven. Raven 2.0: A versatile toolbox for metabolic network

reconstruction and a case study on streptomyces coelicolor. PLoS computational biology, 14(10):

e1006541, 2018.

Lisa Wasserstrom, Diogo Portugal-Nunes, Henrik Almqvist, Anders G Sandström, Gunnar Lidén, and

Marie F Gorwa-Grauslund. Exploring D-xylose oxidation in Saccharomyces cerevisiae through the

Weimberg pathway. AMB Express, 8(1):33, 2018.

MR Watson. Metabolic maps for the Apple II, 1984.

Na Wei, Josh Quarterman, Soo Rin Kim, Jamie HD Cate, and Yong-Su Jin. Enhanced biofuel production

through coupled acetic acid and xylose consumption by engineered yeast. Nature Communications, 4:

2580, 2013a.

Na Wei, Haiqing Xu, Soo Rin Kim, and Yong-Su Jin. Deletion of FPS1, encoding aquaglyceroporin

Fps1p, improves xylose fermentation by engineered Saccharomyces cerevisiae. Applied and Environ-

mental Microbiology, 79(10):3193–3201, 2013b. Chapter 9. Bibliography 197

Ralph Weimberg. Pentose oxidation by Pseudomonas fragi. Journal of Biological Chemistry, 236(3):

629–635, 1961.

Jared W Wenger, Katja Schwartz, and Gavin Sherlock. Bulk segregant analysis by high-throughput

sequencing reveals a novel xylose utilization gene from Saccharomyces cerevisiae. PLoS Genetics, 6

(5):e1000942, 2010.

John G White, Eileen Southgate, J Nichol Thomson, and Sydney Brenner. The structure of the nervous

system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society B:

Biological Sciences, 314(1165):1–340, 1986.

J Wilkins and M Ebach. The nature of classification: relationships and kinds in the natural sciences.

Springer, 2013.

Dana J Wohlbach, Alan Kuo, Trey K Sato, Katlyn M Potts, Asaf A Salamov, Kurt M LaButti, Hui

Sun, Alicia Clum, Jasmyn L Pangilinan, Erika A Lindquist, et al. Comparative genomics of xylose-

fermenting fungi for enhanced biofuel production. Proceedings of the National Academy of Sciences,

108(32):13212–13217, 2011.

Ken Wolfe. Robustness–it’s not where you think it is. Nature genetics, 25(1):3, 2000.

V Wood, R Gwilliam, M-A Rajandream, M Lyne, R Lyne, A Stewart, J Sgouros, N Peat, J Hayles,

S Baker, et al. The genome sequence of Schizosaccharomyces pombe. Nature, 415(6874):871, 2002.

Han Xiao, Zengyi Shao, Yu Jiang, Sudhanshu Dole, and Huimin Zhao. Exploiting Issatchenkia orientalis

SD108 for succinic acid production. Microbial Cell Factories, 13(1):121, 2014.

Haiqing Xu, Sooah Kim, Hagit Sorek, Youngsuk Lee, Deokyeol Jeong, Jungyeon Kim, Eun Joong Oh,

Eun Ju Yun, David E Wemmer, Kyoung Heon Kim, et al. PHO13 deletion-induced transcriptional ac-

tivation prevents sedoheptulose accumulation during xylose metabolism in engineered Saccharomyces

cerevisiae. Metabolic Engineering, 34:88–96, 2016.

Mark Yandell and Daniel Ence. A beginner’s guide to eukaryotic genome annotation. Nature Reviews

Genetics, 13(5):329, 2012.

Vina W Yang and Thomas W Jeffries. Purification and properties of xylitol dehydrogenase from the

xylose-fermenting yeast Candida shehatae. Applied Biochemistry and Biotechnology, 26(2):197–206,

1990. Chapter 9. Bibliography 198

Sylvia Hsu-Chen Yip and Ichiro Matsumura. Substrate ambiguous enzymes within the Escherichia coli

proteome offer different evolutionary solutions to the same problem. Molecular Biology and Evolution,

30(9):2001–2012, 2013.

Tiezheng Yuan, Yan Ren, Kun Meng, Yun Feng, Peilong Yang, Shaojing Wang, Pengjun Shi, Lei Wang,

Daoxin Xie, and Bin Yao. RNA-Seq of the xylose-fermenting yeast Scheffersomyces stipitis cultivated

in glucose or xylose. Applied Microbiology and Biotechnology, 92(6):1237–1249, 2011.

A-M Zeeman and HY Steensma. The acetyl co-enzyme A synthetase genes of Kluyveromyces lactis.

Yeast, 20(1):13–23, 2003.

Baoqi Zhang, Liandan Zheng, Jinping Lin, and Dongzhi Wei. Characterization of an ene-reductase from

Meyerozyma guilliermondii for asymmetric bioreduction of α, β-unsaturated compounds. Biotechnol-

ogy Letters, 38(9):1527–1534, 2016.

R-G Zhang, T Skarina, JE Katz, S Beasley, A Khachatryan, S Vyas, CH Arrowsmith, S Clarke, A Ed-

wards, A Joachimiak, et al. Structure of Thermotoga maritima stationary phase survival protein SurE:

a novel acid phosphatase. Structure, 9(11):1095–1106, 2001.

Hang Zhou, Jing-sheng Cheng, Benjamin L Wang, Gerald R Fink, and Gregory Stephanopoulos. Xylose

isomerase overexpression along with engineering of the pentose phosphate pathway and evolution-

ary engineering enable rapid xylose utilization and ethanol production by Saccharomyces cerevisiae.

Metabolic Engineering, 14(6):611–622, 2012.

Xiaofan Zhou, Xing-Xing Shen, Chris Todd Hittinger, and Antonis Rokas. Evaluating fast maximum

likelihood-based phylogenetic programs using empirical phylogenomic data sets. Molecular Biology

and Evolution, 35(2):486–503, 2017.