Evolutionary History of Nickel-Dependent Enzymes:

Implications for the Origins of .

Reem Hallak

Degree project in biology, Master of science (2 years), 2020 Examensarbete i biologi 60 hp till masterexamen, 2020 Biology Education Centre and Department of Earth Sciences, Palaeobiology, Uppsala University Supervisors: Aodhán Butler and Anna Neubeck External opponent: Brenda Irene Medina Jimenez Table of Contents

ABSTRACT ...... 5

INTRODUCTION ...... 6

METHODOLOGY ...... 11

▪ Brief Summary ...... 11

▪ Operating Principle of HMMER ...... 11

▪ Detailed Summary - Collecting Query Sequences ...... 13

a. From Published Study: ...... 13

b. From Databases:...... 13

▪ Searching for Homologs ...... 14

a. Concatenating the Sequences: ...... 14

b. Setting Up the Sequence Databases: ...... 14

c. Building the Profile HMMs + Conducting the Search:...... 15

d. Visualizing the Results: ...... 16

▪ Gene Trees ...... 16

a. Building the Trees: ...... 16

b. Annotation & Visualization of the Trees:...... 16

c. Comparison of Tree Topologies: ...... 17

▪ Ancestral State Reconstruction ...... 17

RESULTS ...... 18

▪ HMMER Results Summary ...... 18

Page 1 of 68

▪ Gene Trees + Ancestral State Reconstruction ...... 18

A. Redox Enzymes ...... 18

1. Co-methylating acetyl-CoA synthase (Enzymatic complex of ACS & CODH)...... 18

2. Carbon monoxide dehydrogenase (Ni-CODH) ...... 20

3. Methyl-coenzyme M reductase (MCR) ...... 22

4. Superoxide dismutase (Ni-SOD) ...... 23

5. Coenzyme F420 hydrogenase (FrhABG): A Group 3a Ni-Fe hydrogenase ...... 25

6. Hydrogen:quinone oxidoreductase (or quinone-reactive Ni-Fe hydrogenase (Group

1)) 26

7. Ni-dependent hydrogenase (NiFeSe_Hases) ...... 28

B. Non-redox Enzymes ...... 30

1. Acireductone dioxygenase (Ni-ARD) ...... 30

2. Glyoxalase I (GloI or GlxI) ...... 32

3. Lactate racemase (LarA) ...... 34

4. Urease...... 35

▪ Tree Comparison ...... 37

→ CO-methylating acetyl-CoA synthase ...... 38

→ Carbon monoxide dehydrogenase (Ni-CODH) ...... 38

▪ Supplementary Files ...... 38

DISCUSSION ...... 38

▪ General Trends in Ancestral State Reconstruction Results ...... 38

▪ The Analogous Histories within Nickel Enzyme Groups: Single or Multiple Origins? ...... 39

Page 2 of 68

▪ Selective Pressures: Nickel Famine or the Great Oxidation Event? ...... 40

▪ The Odd Case of Methyl-coenzyme M Reductase ...... 41

▪ Partial Genes in the LCA of Prokaryotes ...... 42

▪ Non-redox Enzymes: Convergent Evolution or LGT? ...... 42

▪ Conflicting Tree Topologies: Evidence of Lateral Gene Transfer (LGT)? ...... 43

▪ Tree Comparison Results ...... 45

→ CO-methylating acetyl-CoA synthase ...... 45

→ Carbon monoxide dehydrogenase (Ni-CODH) ...... 48

▪ Implications for the Origin of Life ...... 50

→ Why are nickel enzymes interesting for the origins of life?...... 50

→ Implications from this study...... 51

→ In what context do these new results matter? ...... 52

▪ Inherent Assumptions of the Models...... 54

CONCLUSION ...... 55

ACKNOWLEDGEMENTS ...... 56

REFERENCES ...... 58

LIST OF APPENDICES ...... 67

Appendix A: Hits ...... 67

Appendix B: Lingula reevi Hits ...... 67

Appendix C: Occupancy Matrices ...... 67

Appendix D: Gene Trees ...... 67

Appendix E: Ancestral State Reconstruction (ASR) Trees ...... 67

Page 3 of 68

Appendix F: Tanglegrams ...... 67

Appendix G: Reference Tree...... 68

Page 4 of 68

ABSTRACT

Nickel enzymes have been suggested, through numerous phylogenetic studies, to have been among the very first catalytic compounds on the early Earth, possibly present in the last universal common ancestor (LUCA) or prior to the onset of life. This is because of the type of reactions catalyzed by some of these enzymes, the nature of organisms that utilize them, their distribution in the tree of life, and their key roles in what is now thought of as possibly one of the oldest carbon fixation pathways, the Wood-Ljungdahl (WL) pathway. Additionally, nickel is generally thought to have been an abundant element on the early Earth, highly soluble in what were, theoretically, euxinic (anoxic and sulfidic) ocean waters.

This combined with the fact that the enzymes involved in the WL pathway have an active center configuration that resembles that of minerals found in hydrothermal vent walls, makes nickel enzymes a likely candidate to have evolved from what were proto-enzymes, responsible for the prebiotic catalysis of the first simple organic molecules prior to the origins of life, according to the so-called submarine alkaline hydrothermal vent theory, first presented by

Michael J. Russell in 1993 (Russell et al. 1994).

In this study, I expand the known coverage on the distribution of these enzymes by mapping them in 10,575 OTUs of microbial taxa. Using their pattern of distribution, I reconstruct their histories along the branches of a reference of the same taxa through methods of ancestral reconstruction of discrete traits. Additionally, I construct an individual gene tree for each of the enzymes in order to consolidate gene history with species history.

My results showed that the redox nickel enzymes (except methyl-coenzyme M reductase) are ancestral to all prokaryotes, while non-redox enzymes are derived and with multiple origins, possibly due to lateral gene transfer events or convergent evolution. I propose that the patterns observed are a product of the drastic changes during early Earth history, namely a hypothesized

“nickel famine” or the Great Oxidation Event, which acted as selective pressures. Page 5 of 68

INTRODUCTION

Nickel (Ni) is a transition metal and the fifth-most common element on Earth, where it occurs naturally as nickel oxides, nickel sulfides or nickel silicates (Nickel Institute 2012). It is found abundantly in the earth’s core (around 8% of the core) and in smaller amounts in the earth’s crust (0.01%) (Walsh & Orme-Johnson 1987) and aqueous systems (Sawers 2013). It is also very commonly found in meteorites along with iron (Brown & Patterson 1947).

Biologically, it is an important trace metal as it comprises the metal cofactor of ten enzymes discovered to date (Siegbahn et al. 2019), which are:

1. Urease

2. Ni-Fe hydrogenase

3. Carbon monoxide dehydrogenase (Ni-CODH)

4. Acetyl-coenzyme A synthase (ACS)

5. Superoxide dismutase (Ni-SOD)

6. Methyl-coenzyme M reductase (MCR)

7. Glyoxalase I (GloI or GlxI)

8. Lactate racemase (LarA)

9. Acireductone dioxygenase (Ni-ARD)

10. Quercetin 2,4-dioxygenase (Ni-QueD)

Despite playing critical roles in many living organisms, such as , , fungi, algae, and higher (Boer et al. 2014), the first protein containing functionally significant nickel, which was jack bean urease, was discovered somewhat recently, in 1975 (Walsh & Orme-Johnson

1987). This surprising discovery came after 50 years of having mistakenly thought that ureases are completely made from protein. This was because it was the first enzyme that scientists were able to crystallize and use as the textbook proof that enzymes can be composed of pure proteins (B. Zamble 2015). Page 6 of 68

Therefore, it follows that what we know about nickel enzymes is “relatively” recent. However, the knowledge on the composition, functions and biochemistries of these enzymes is rather broad and is continuously being investigated and expanded (Siegbahn et al. 2019).

Nickel enzymes can be split into two groups depending on how nickel behaves during the reaction:

1. Redox enzymes, where nickel (Ni2+) is able to cycle through three redox states (1+, 2+,

3+) and can catalyze reactions over a potential range of 1.5 V (Ragsdale 2009).

2. Non-redox enzymes, where nickel (Ni2+) acts as a Lewis acid (Alfano & Cavazza 2020).

Due to this plasticity in nickel redox chemistry and coordination, nickel enzymes play a wide variety of roles (Table 1), some of which are of absolute importance on a biogeochemical global scale, for example, as key contributors to the carbon cycle by catalyzing key steps in carbon fixation and reduction (CO dehydrogenase, acetyl-CoA synthase, and methyl-coenzyme M reductase) (Ragsdale 2007). Methyl coenzyme M reductase (MCR) is also an obligatorily required element for growth and the synthesis of methane in methanogenic archaea (Walsh &

Orme-Johnson 1987, Duin 2009). The diverse nickel-dependent hydrogenases are essential parts of prokaryotic redox reactions (Alfano & Cavazza 2020). Ureases participate in nitrogen metabolism by releasing it as part of ammonia (NH3) from the breaking down of urea, to be used as energy for growth by certain microorganisms (Mulrooney & Hausinger 2003).

Table 1 The Reactions and Functions of Nickel Enzymes

Enzyme Reaction1639 Function Comments Acetyl-coenzyme A CO + CoA-S− + Autotrophic carbon Part of the Wood- 3+ 1 synthase CH3-CO FeSP  fixation/energy Ljungdahl pathway 1 CH3C(O)-S-CoA + conservation co+FeSP

Carbon monoxide CO + H2O  CO2 Allows Part of the global dehydrogenase + 2H+ + 2e− microorganism to use carbon cycle & the

1 (Alfano & Cavazza 2020)

Page 7 of 68

CO & CO2 as a Wood-Ljungdahl source of carbon and pathway1 energy23

Methyl-coenzyme M CH3-SCoM + Catalyzes final step of Part of global carbon reductase CoBSH  CH4 + methanogenesis in cycle/responsible for CoBS-SCoM archaeaError! all biologically Bookmark not produced methane defined. on Earth31 −˚ + Superoxide 2 O2 + 2H  Protects against No sequence dismutase H2O2 + O2 reactive oxygen homology with other species2 SODs1/Thought to have evolved around the GOE45 6 Coenzyme F420 H2 + coenzyme F420 Methanogenesis /Met Might contribute to hydrogenase  reduced abolism of complex the ability of 7 coenzyme F420 organic molecules mycobacteria to persist in challenging environments8 9 Quinone-reactive Ni- H2 + quinone  Reduces quinones Quinones are Fe hydrogenase quinol electron carriers in bacterial electron transport chains + − Nickel-dependent H2  2H + 2e Vital part of redox Phylogenetically hydrogenase reactions in distinct from iron- prokaryotes12 only hydrogenases1

Acireductone Acireductone + O2 Regulating Part of the dioxygenase  3- metabolism10 methionine salvage methylthiopropanoat pathway e + formate + CO 1 Glyoxalase I CH3-CO-C(OH)-SG Detoxification Found in ,  CH3-CH(OH)- such as plants and CO-SG (Trypanosoma and Leishmania)1 Lactate racemase L-lactate  D-lactate Thought essential for Most recently integrity of some discovered nickel microbial cell walls enzyme (2014)11 under stress conditions11 Quercetin 2,4 Quercetin  2- Potentially detoxifies Only found in dioxygenase protocatechuoylphlor flavonol12 Streptomyces sp. oglucinol carboxylic strain FLA12

acid Urease (NH2)CO2 + H2O  Nitrogen Part of the global 2 CO2 + 2 NH3 fixation/Virulence nitrogen cycle

2 (Ragsdale 2009) 6 (Peters et al. 2015) 9 (Ferber & Maier 11 (Desguin et al. 3 (Boer et al. 2014) 7 (Ney et al. 2017) 1993, Bernhard et al. 2014) 4 Great Oxidation 8 (Greening et al. 1997) 12 (Nianios et al. Event 2016) 10 (Maroney & Ciurli 2015) 5 (Dupont et al. 2014, Sekowska et al. 2008) 2018)

Page 8 of 68

factor for certain pathogens23

Overall, there is no shortage of works that describe and characterize all known nickel- dependent enzymes and their associated transporters (Eitinger et al. 2005, Rodionov et al. 2006,

Boer et al. 2014, Maroney & Ciurli 2014, Siegbahn et al. 2019). Still lacking, however, is a comprehensive view on the distribution and evolution of these enzymes in the tree of life.

Research on the subject is few and far between, with the most recent studies stretching back to

2009 and 2010 (Zhang et al. 2009, Zhang & Gladyshev 2010), the first of which examined a relatively small sample of only 740 organisms in the total three domains of life (Zhang et al.

2009). These studies confirmed what we already suspected about the distribution of nickel enzymes in that they are mostly found in prokaryotes and are lacking in the majority of eukaryotes, except for some plants and fungi (Zhang et al. 2009).

According to Zhang et al. (2009), nearly all bacterial phyla sampled in their study utilized nickel enzymes except for groups that have a parasitic lifestyle and small genomes, like all members of

Chlamydiae and Alphaproteobacteria/Rickettsiales (obligate intracellular parasites) and most members of /Mollicutes and (extracellular parasites). Archaea showed an even wider phylogenetic distribution of nickel enzymes among their phyla, especially among methanogens (present in all 18 sequenced methanogenic archaea) (Zhang et al. 2009). In eukaryotes, on the other hand, utilization of nickel enzymes seemed to be absent in almost all sequenced phyla, except some belonging to Stramenopiles,

Viridiplantae/Chlrophyt+Streptophyta, Fungi/++ and

Metazoa/Coelomata/Others (Zhang et al. 2009).

Urease appeared to be the most widespread nickel enzyme in bacteria, while in contrast it was

Ni-Fe hydrogenase in archaea, where urease was rare or even absent (Zhang & Gladyshev

2010). Additionally, urease seems to be the only known nickel-dependent enzyme used by

Page 9 of 68 eukaryotes (Zhang et al. 2009), with a few rare examples of other enzymes such as glyoxalase I

(GloI) in Leishmania major (eukaryotic parasite) (Maroney & Ciurli 2014) and Oryza sativa (rice)

(Mustafiz et al. 2014) and superoxide dismutase (Ni-SOD) in Ostreococcus tauri and

Ostreococcus lucimarinus (marine eukaryotes - green algae), which was only explored at the gene sequence level (in silico) (Ragsdale 2009).

Finally, the study (Zhang et al. 2009) concluded that host-associated lifestyles and/or having a small genome with low GC content (<40%) in prokaryotes is correlated with the loss of Ni utilization. As for eukaryotes, the lack of nickel utilization in almost all phyla except some lower eukaryotes and land plants was suggested to be due to the nature of reactions that most nickel- dependent enzymes catalyze, that is, anaerobic reactions that are otherwise not utilized by most eukaryotes that depend on oxygen to survive (Zhang et al. 2009).

In this study, I search the proteomes of 10,575 taxa from a reference phylogeny by Zhu et al.

(2019) for nickel enzyme homologs in order to uncover their phyletic pattern (or presence- absence pattern) and estimate their evolutionary history via ancestral reconstruction methods over the branches of said phylogeny as well as via the construction of an individual gene tree for each of the enzymes in order to consolidate gene history with species history.

I aim to answer the following:

1. Do these enzymes have single or multiple origins? (Pattern of evolution)

2. What were some of the selective pressures that might have led to the observed

patterns of evolution?

3. Is there evidence of lateral gene transfer?

4. What are the implications of my results in relation to origin of life models?

Page 10 of 68

METHODOLOGY

▪ Brief Summary

The scope of this project is to conduct a comparative genomic analysis of nickel-dependent enzymes in microbial genomes in order to map their phyletic patterns (presence-absence patterns) in as large of a taxonomic sample as possible using an amount of genomic data orders of magnitude beyond previous efforts, as well as infer their origins and history through ancestral reconstruction methods and molecular phylogenetic analysis of the enzyme genes.

The methodology utilizes the HMMER3 software (www.hmmer.org) and consists of building representative “profiles” out of multiple sequence alignments (or MSAs) of nickel-dependent enzymes that are of the same protein group (example: Ureases, Ni-Fe hydrogenases, etc.), and using these profiles to search for homologs in a given sequence database. The presence/absence of the enzymes will then be mapped as a discrete trait on a phylogenetic tree

(from published works) and traced back in time using ancestral reconstruction methods.

Furthermore, gene trees will be built from each enzyme group to examine the evolutionary relationships of the amino acid sequences of the enzymes and to address possible events of lateral gene transfer.

▪ Operating Principle of HMMER

HMMER is a software package that, similar to NCBI’s BLAST and others, is a tool to search for homology among protein or nucleotide sequences in a given sequence database.

HMMER works by first creating a profile (or profile Hidden Markov Model/profile HMM) out of the query sequences (in the form of multiple sequence alignments) that are fed to the program, and then searching for target sequences against a database, using that profile. What sets

HMMER apart from other such tools, such as NCBI’s BLAST, is that these profile HMMs are probabilistic position-specific scoring matrices or models that assign a likeliness of observing

Page 11 of 68 certain nucleotides or amino acids as well as the frequency of insertion/deletions at each column of a multiple sequence alignment. The likeliness values are both derived from priors

(probabilities based on established knowledge) as well as dependent on the position in question. This is as opposed to other substitution scoring matrices such as BLOSUM and PAM used by pairwise sequence alignment methods such as BLAST and FASTA, which are position- independent and penalize substitutions, insertions and deletions equally and regardless of where they occur in an alignment (Eddy 2019). Additionally, profile HMMs have a large number of parameters (at least 22) for each one of the approximately 200 or so positions in a typical protein , and these parameters are estimated from each particular family alignment. This is contrasted with a total of 210 parameters used in BLOSUM and PAM amino acid substitution matrices that are averaged over large collections of different known sequence alignments (Eddy

2019).

Moreover, unlike most sequence homology search tools, HMMER does not produce optimal high-scoring alignments. Instead, it uses ensemble algorithms that consider all possible alignments, assign them a “weight” according to their relative likelihood and calculate log-odd scores that are summed over the posterior alignment ensemble, which is the collective posterior probabilities of possible alignments. This enhances greatly the power of statistical inference when presented with scores of uncertainty to each possible alignment, as alignments are inherently an approximation themselves (Eddy 2019).

All the previously mentioned attributes of the software give it an exceptional sensitivity in detecting even remote homologs in a timely manner, making it a very suitable tool when working on a specific sequence family, such as nickel enzymes in this project.

Page 12 of 68

▪ Detailed Summary - Collecting Query Sequences

a. From Published Study:

Nickel enzyme sequences from the following study: Comparative genomic analyses of nickel, cobalt and vitamin B12 utilization (Zhang et al. 2009) were chosen to be used as query sequences. This was to ensure that sequences used to build profile HMMs for this study were positively those of nickel-dependent enzymes, as data from published works are both more reliable and convenient than other customary, but time-consuming ways, of collecting sequences, like searching with BLAST for homologs.

At least 16 amino acid sequences (and up to 145 aa sequences) were extracted from each of six nickel-dependent enzyme groups out of the ten total groups that are known (Siegbahn et al.

2019). Sequence data on the four remaining enzymes was lacking in the paper. The extracted sequences were sorted into their respective groups. The groups were: Urease, Ni-Fe hydrogenase (further sorted into subgroups), carbon monoxide dehydrogenase, CO- methylating acetyl-CoA synthase, superoxide dismutase, & methyl-coenzyme M reductase.

Each file of sequences was then aligned using MAFFT G-INS-i (version 7) online alignment tool

(https://mafft.cbrc.jp/alignment/server) in order to construct multiple sequence alignment files of each enzyme group (in FASTA and CLUSTAL format).

b. From Databases:

Sequences for the four remaining enzyme groups (glyoxalase I, lactate racemase, quercetin 2,4- dioxygenase & acireductone dioxygenase) that were not present among the data in Zhang et al. paper, were instead retrieved from the UniProt database (https://www.uniprot.org/). The number of sequences for each enzyme group ranged from 11 to 25 sequences except for the enzyme quercetin 2,4 dioxygenase, for which only one sequence was retrieved from

Streptomyces sp. FLA. This is because the latter species is the only occurrence of this enzyme that we know of thus far (Merkens et al. 2008, Siegbahn et al. 2019), making it the second only Page 13 of 68 of its kind, a nickel-dependent dioxygenase, after nickel acireductone dioxygenase (Merkens et al. 2008). As done in the previous step, each of the four files of sequences were aligned using

MAFFT G-INS-i (version 7) online alignment tool and downloaded as FASTA files (some in

CLUSTAL format).

▪ Searching for Homologs

The following steps were performed on UBUNTU 18.04.4 LTS (Bionic Beaver)

(https://ubuntu.com/). UBUNTU was operated using Oracle VM VirtualBox, version 6.1.12 r139181 (Qt5.6.2) (https://www.virtualbox.org/).

a. Concatenating the Sequences:

Prior to concatenation, poorly aligned sequences were trimmed from MSAs using trimAl

(version 3), which is a tool for automated alignment trimming in large-scale phylogenetic analyses. For further inspection, trimmed alignments were viewed with AliView (version 2019), which is an alignment viewer and editor for large datasets. Trimmed sequences were concatenated into a supermatrix with partitions using FASconCAT (v1.04) software package.

b. Setting Up the Sequence Databases:

Since most of nickel-dependent enzymes are found in prokaryotes (Zhang et al. 2009), I used the extensive protein sequence data from the recent Zhu et al. (2019) study “Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea” as a sequence database against which to search for nickel-enzyme homologs. This study was chosen because of its sufficiently large number of assembled sequences (10,575 of which 9906 are bacteria and 669 are archaea) and associated phylogeny based on 381 globally sampled marker genes, which maximizes my scope of search. The search was also set to run against sequence databases of eukaryotes. Data from the following studies was used:

Page 14 of 68

- Phylogenomics of Lophotrochozoa with Consideration of Systematic Error (Kocot et

al. 2017) (N.B.13: 10 taxa were excluded due to problems with translating the sequences

to amino acids. The taxa were: Brachionus plicatilis, Macrodasys sp., Bugula neritina,

Megadasys sp., Cephalothrix linearis, pandora, Gnathostomula paradoxa, Tubulanus

polymorphus-Halanych, Idiosepius paradoxus, Tubulanus polymorphus-Struck). (For results,

refer to “Appendix A”)

The search was also run against the complete transcriptome of Lingula reevi from unpublished data by Aodhán Butler. (For results, refer to “Appendix B”)

All translation of sequences from nucleotides to amino acids was done using a python script written by Warren Francis (https://bitbucket.org/wrf/sequences/src/master/prottrans.py).

Translating nucleotides to amino acids serves to avoid the degeneracy of codons thereby increasing the phylogenetic signal, since the methodology for this study consists of searching for conserved protein domains that evolve slowly in deep time.

c. Building the Profile HMMs + Conducting the Search:

For building profile HMMs and conducting searches, I used “supermatrix”

(https://github.com/wrf/supermatrix), which is a convenient collection of scripts written by

Warren Francis, that use HMMER to build profile HMMs from multiple sequence alignment files and search for homologs against a given sequence database that the user specifies. Then, it adds those homologs or “hits” to the original multiple sequence alignment for the convenience of viewing them together as one complete alignment (Francis 2020). Biopython (version

1.70+dfsg-4) (https://biopython.org/) was used to successfully run the “supermatrix” scripts.

13 Nota Bene

Page 15 of 68

d. Visualizing the Results:

Besides viewing search results as a multiple sequence alignment, the supermatrix scripts include code to construct occupancy matrices or phyletic vectors of the results, which encode the presence, incomplete presence and absence of the target sequences in the database as “2”, “1” and “0", respectively, along with the original query sequences (Francis 2020). This code was used to generate matrices of my search results, which were also visualized as colored charts

(Refer to “Appendix C”) using R code (version 3.4.4.) (https://www.r-project.org/) that was also part of the “supermatrix” project.

▪ Gene Trees

a. Building the Trees:

IQ-TREE (v. 1.6.1) (http://www.iqtree.org/) was used to infer phylogenetic trees for each enzyme group using the original query sequences plus the resulting target sequences from the search in the large database of prokaryotes. The sequences were aligned using MAFFT G-INS-i

(version 7) before being fed to the software. Model selection was done using ModelFinder

(Kalyaanamoorthy et al. 2017), which is implemented in IQ-TREE, and branch support was obtained with UFBoot2 (Ultrafast Bootstrap) (Hoang et al. 2018) for 1000 replicates (also implemented in IQ-TREE).

b. Annotation & Visualization of the Trees:

TAXANAMECONVERT (version 2.4) (http://www.cibiv.at/software/taxnameconvert/) was used for the conversion of gene tree taxa from their unique identifiers to their corresponding unique binomial nomenclature provided by the metadata file from the Zhu et al. (2019) study. In cases where TAXANAMECONVERT failed to produce an operable Newick file, Mesquite (version

3.61) (https://www.mesquiteproject.org/) was used for name conversion instead.

Trees visualization and manipulation was done in RStudio (version 1.3.1073)

(https://rstudio.com/) using R (version 4.0.2) (https://www.r-project.org/) and the following Page 16 of 68 packages: Ggtree (version 2.2.1), treeio (version 1.12.0), ape (version 5.4-1), tidytree (0.3.3), tibble (version 3.0.3), ggplot2 (3.3.2), phytools (version 0.7-7).

c. Comparison of Tree Topologies:

Using the “cophylo” function in “phytools” package (version 0.7-7) in RStudio (version

1.3.1073) using R (version 4.0.2), each gene tree was plotted in a facing fashion against the reference tree from Zhu et al. (2019) with lines linking the corresponding tips together in order to examine topological congruence.

N.B.: The function links taxa after intensive rotation of the nodes to optimize matching.

▪ Ancestral State Reconstruction

Enzyme presence, incomplete presence or absence was plotted as a discrete character on the reference tree from Zhu et al. (2019) as “P”, “I” and “N”, respectively, for all nickel enzymes

(except for QueD, which is present in only two species) (Refer to “Appendix C”). Ancestral states were then reconstructed through the method of stochastic character mapping (Nielsen

2002) using the “make.simmap” function in “phytools” package. Models used in the function were: an “Equal Rates” model (ER), where all different states receive equal rates, a “Symmetric” model (SYM) in which forward and reverse transitions between states are constrained to be equal, and an “All-Rates-Different” model (ARD) where all possible transitions between states receive distinct parameters (Paradis et al. 2004). All models used were un-ordered, in which character states are allowed to transition in any order.

N.B.: For characters with only two transition states, the “SYM” model was not used since it would yield the same results as an “ER” model.

Using the same package, each model was fitted to the data for the purpose of calculating the

AIC (Akaike Information Criterion) score and AIC weights in order to evaluate which model of the three mentioned above best fits the data.

Page 17 of 68

RESULTS

▪ HMMER Results Summary

Table 2 Summary table of the HMMER search results in the large proteome database from the Zhu et al. (2019) study.

Enzyme Number of Number of Number of Percentage of Query Database Sequence Hits in Database Sequences Sequences Hits (%) CO-methylating 25 73 0.69% acetyl-CoA synthase Carbon monoxide 42 154 1.45% dehydrogenase Methyl-coenzyme M 17 171 1.62% reductase Superoxide 21 480 4.54% dismutase

Coenzyme F420 19 10,575 504 4.76% hydrogenase (9906 bacteria Quinone-reactive 8 & 669 archaea) 202 1.91% Ni-Fe hydrogenase Nickel-dependent 82 1332 12.60% hydrogenase Acireductone 25 1002 9.47% dioxygenase Glyoxalase I 12 1320 12.48% Lactate racemase 9 603 5.70% Quercetin 2,4 1 1 9.45e-5 dioxygenase Urease 140 2296 21.71%

▪ Gene Trees + Ancestral State Reconstruction

Results of the ancestral state reconstruction were included only in the supplementary folder under “Ancestral State Reconstruction (ASR) Trees” due to the unusually large size of the mapped trees (Refer to “Appendix E”).

A. Redox Enzymes

1. Co-methylating acetyl-CoA synthase (Enzymatic complex of ACS & CODH)

Co-methylating acetyl-CoA synthase homologs were found in 73 taxa (or 0.69%) out of 10,575 taxa in the large proteome database (Table 2). All 73 taxa belonged to the Archaea , particularly the , while the few bacterial sequences (Firmicutes,

Page 18 of 68

Proteobacteria and ) were part of the query sequences used for the search (Figure 1

& 2). This enzyme had the least number of homologs in the database compared to the other nickel enzymes (excluding the enzyme QueD with only 1 hit).

As for ancestral state reconstruction (ARD model), results from supplementary file “AcetylTree

(ASR-ARD)” show that this enzyme originated in the last common ancestor of archaea and was retained only in some methanogenic Euryarchaeota, while it was lost in all other archaeal phyla represented in the reference tree.

Figure 1 Gene tree for the CO-methylating acetyl-CoA synthase enzymatic complex generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 19 of 68

100,00% 89,80% 90,00% 80,00% 70,00% 60,00% 50,00% 40,00%

30,00% Percentage of Taxa Taxa of (%) Percentage 20,00% 6,12% 10,00% 2,04% 2,04% 0,00% Archaea Eubacteria Chloroflexi 0,00% 2,04% Euryarchaeota 89,80% 0,00% Firmicutes 0,00% 6,12% 0,00% 2,04%

Figure 2 Summary figure of the percentage and type of taxa composing the CO-methylating acetyl-CoA synthase gene tree in Figure 1.

2. Carbon monoxide dehydrogenase (Ni-CODH)

Ni-CODH homologs were found in 154 taxa (or 1.45%) out of 10,575 taxa in the database

(Table 2). The hits were distributed somewhat evenly between the archaeal and bacterial kingdoms, with most archaeal hits belonging to the Euryarchaeota phylum (41.54%) and most bacterial hits belonging to the Proteobacteria phylum (38.31%). The rest belonging to

Archaea/TACK+ and Euacteria/Chloroflexi+FCB+Firmicutes+Bacteria (Figure 3 and 4). This enzyme was the second least widely distributed of the nickel enzymes in the database (excluding QueD). As for ancestral reconstruction, results from supplementary file

“CODHTree (ASR-ARD)” show that Ni-CODH was present in the last common ancestor of prokaryotes and was subsequently lost in almost all prokaryotic phyla represented in the reference tree but retained in some anaerobic archaea and anaerobic bacteria.

Page 20 of 68

Figure 3 Gene tree for the carbon monoxide dehydrogenase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

45,00% 41,56% 38,31% 40,00% 35,00% 30,00% 25,00% 20,00% 15,00% 11,69% 10,00% 4,55%

Percentage of Taxa Taxa of (%) Percentage 5,00% 0,65% 0,65% 1,95% 0,65% 0,00% Archaea Eubacteria Bacteria 0,00% 4,55% Chloroflexi 0,00% 1,95% Crenarchaeota 0,65% 0,00% Euryarchaeota 41,56% 0,00% FCB 0,00% 0,65% Firmicutes 0,00% 11,69% Proteobacteria 0,00% 38,31% TACK 0,65% 0,00%

Figure 4 Summary figure of the percentage and type of taxa composing the carbon monoxide dehydrogenase gene tree in Figure 3.

Page 21 of 68

3. Methyl-coenzyme M reductase (MCR)

MCR homologs were found in 171 taxa (or 1.62%) out of 10,575 taxa in the database (Table

2). The vast majority of homologs (97.34%) to MCR were found in the archaeal kingdom, specifically the Euryarchaeota phylum, as expected. The rest of the homologs belonged to the archaeal phylum TACK, in addition to ONE homolog in (Kingdom Eubacteria)

(Figure 5 and 6). Ancestral reconstruction results show that contrary to the other redox nickel enzymes, MCR was inferred (supplementary file: “MCRTree (ASR-ARD)”) as a derived character, exclusive to and originating in methanogens through what seems as three independent events or otherwise one event and two subsequent horizontal gene transfers.

Figure 5 Gene tree for the methyl-coenzyme M reductase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 22 of 68

120,00%

97,34% 100,00%

80,00%

60,00%

40,00% Percentage of Taxa Taxa of (%) Percentage 20,00% 2,13% 0,53% 0,00% Archaea Eubacteria Bacteroidetes 0,00% 0,53% Euryarchaeota 97,34% 0,00% TACK 2,13% 0,00%

Figure 6 Summary figure of the percentage and type of taxa composing the methyl-coenzyme M reductase gene tree in Figure 5.

4. Superoxide dismutase (Ni-SOD)

Ni-SOD homologs were found in 480 taxa (or 4.54%) out of 10,575 taxa in the database

(Table 2). The vast majority of those homologs were found in

Euacteria/+Bacteroidetes+Chloroflexi++Proteobacteria+Bacteria and only a few sequences in Archaea/Euryarchaeota and CPR (Candidate Phyla

Radiation)/Microgenomates+Parcubacteria (Figure 7 and 8). Ancestral reconstruction results from supplementary file “NiSODTree (ASR-ARD)” infer a partial Ni-SOD gene in the common ancestor of archaea and bacteria that was lost at the beginning of the branching of both domains. We then see a subsequent emergence of a complete gene first in clades of

Actinobacteria and then Cyanobacteria and in a few members of Proteobacteria, Euryarchaeota,

Chloroflexi and CPR. The pattern of evolution of this enzyme is called a reversal homoplasy.

Page 23 of 68

Figure 7 Gene tree for the superoxide dismutase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

80,00% 67,60% 70,00% 60,00% 50,00% 40,00% 0,80% 30,00% 19,20%

20,00% 2,00% 0,40% Percentage of Taxa Taxa of (%) Percentage 10,00% 5,60% 0,80% 2,80% 0,20% 0,00% Archaea CPR 0,60% Eubacteria Actinobacteria 0,00% 0,00% 67,60% Bacteria 0,00% 0,00% 0,20% Bacteriodetes 0,00% 0,00% 0,40% Bacteroidetes 0,00% 0,00% 0,60% Chloroflexi 0,00% 0,00% 0,80% Cyanobacteria 0,00% 0,00% 19,20% Euryarchaeota 0,80% 0,00% 0,00% Microgenomates 0,00% 2,80% 0,00% Parcubacteria 0,00% 2,00% 0,00% Proteobacteria 0,00% 0,00% 5,60%

Figure 8 Summary figure of the percentage and type of taxa composing the superoxide dismutase gene tree in Figure 7. 1 RWHWKDW´&35µV WD QGVIRU´&DQGLGDWH3K\OD5DGL D WL RQµ

Page 24 of 68

5. Coenzyme F420 hydrogenase (FrhABG): A Group 3a Ni-Fe hydrogenase

FrhABG homologs were found in 504 taxa (or 4.76%) out of 10,575 taxa in the database

(Table 2). The hits belonged to

Euacteria/Proteobacteria+Actinobacteria+Bacteroidetes+Chloroflexi+Cyanobacteria+FCB+Fir micutes+PVC+Spirochaetes+Bacteria, Archaea/Euryarchaeota+Asgard+TACK and

CPR/Parcubacteria (Figure 9 and 10). Ancestral reconstruction results (supplementary file:

“CoenzymeF420Tree (ASR-ARD)”) suggest that this enzyme was present in the last common ancestor (LCA) of prokaryotes. The enzyme appeared to be retained in a small number of species that span diverse phyla. The enzyme has a plesiomorphic pattern of evolution.

Figure 9 Gene tree for the coenzyme F420 hydrogenase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 25 of 68

40,00% 33,65% 35,00% 29,45% 30,00% 25,00% 20,00% 6,12% 15,00% 8,03% 1,72% 10,00% 6,50%

Percentage of Taxa Taxa of (%) Percentage 4,59% 2,87% 0,38% 5,00% 1,15% 1,91% 0,57% 0,19% 0,00% Archaea CPR 2,87% Eubacteria Actinobacteria 0,00% 0,00% 2,87% Asgard 0,57% 0,00% 0,00% Bacteria 0,00% 0,00% 6,50% Bacteroidetes 0,00% 0,00% 2,87% Chloroflexi 0,00% 0,00% 6,12% Cyanobacteria 0,00% 0,00% 8,03% Euryarchaeota 29,45% 0,00% 0,00% FCB 0,00% 0,00% 4,59% Firmicutes 0,00% 0,00% 1,91% Parcubacteria 0,00% 0,19% 0,00% Proteobacteria 0,00% 0,00% 33,65% PVC 0,00% 0,00% 1,72% Spirochaetes 0,00% 0,00% 0,38% TACK 1,15% 0,00% 0,00%

Figure 10 Summary figure of the percentage and type of taxa composing the coenzyme F420 hydrogenase gene tree in Figure 9. 1RWHWKDW´&35µVWDQGVIRU´&DQGLGDWH3K\OD5DGLDWLRQµ

6. Hydrogen:quinone oxidoreductase (or quinone-reactive Ni-Fe hydrogenase (Group 1))

Hydrogen:quinone oxidoreductase homologs were found in 202 taxa (or 1.91%) out of 10,575 taxa in the database (Table 2). All hits belonged to Eubacteria, with the vast majority of homologs in the Actinobacteria phylum (85.71%). The rest were found in

Proteobacteria++PVC+Chloroflexi+Cyanobacteria+Firmicutes+Bacteria (Figure

11 and 12). Ancestral reconstruction results (supplementary file: “HydroQuiTree (ASR-ARD)”) infer a complete gene in the common ancestor of bacteria and a partial gene in the common ancestor of archaea despite the lack of any homologous sequences found in archaeal species

Page 26 of 68 represented in the reference tree. The enzyme is retained in a few bacterial phyla and shows a plesiomorphic pattern of evolution.

Figure 11 Gene tree for the hydrogen:quinone oxidoreductase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 27 of 68

90,00% 85,17%

80,00%

70,00%

60,00%

50,00%

40,00%

30,00%

Percentage of Taxa Taxa of (%) Percentage 20,00%

10,00% 4,78% 0,48% 2,87% 2,87% 0,48% 0,48% 2,87% 0,00% Eubacteria Acidobacteria 0,48% Actinobacteria 85,17% Bacteria 2,87% Chloroflexi 2,87% Cyanobacteria 0,48% Firmicutes 0,48% Proteobacteria 4,78% PVC 2,87%

Figure 12 Summary figure of the percentage and type of taxa composing the hydrogen:quinone oxidoreductase gene tree in Figure 11.

7. Ni-dependent hydrogenase (NiFeSe_Hases)

NiFeSe_Hases homologs were found in 1332 taxa (or 12.60%) out of 10,575 taxa in the database (Table 2). This was the second most widely distributed nickel enzyme in the prokaryotic database. Most hits belonged to

Euacteria/Proteobacteria+Actinobacteria+Cyanobacteria+Acidobacteria++Bacteria+B acteroidetes+Chlorobi+Chloroflexi+FCB+Firmicutes+PVC+Spirochaetes+Terrabacteria, while the rest turned up in CPR and Archaea/Euryarchaeota+Crenarchaeota+TACK+Archaea (Figure

13 and 14). Ancestral reconstruction results (supplementary file: “NidephydroTree (ASR-

ARD)”) imply a partial gene in the last common ancestor of all prokaryotes and the gain of complete genes in several clades in an “underlying synapomorphy” pattern.

Page 28 of 68

Figure 13 Gene tree for the nickel-dependent hydrogenase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 29 of 68

60,00%

48,69% 50,00%

40,00%

30,00% 2,55% 20,00% 5,73% 6,94% 0,07%0,35% Percentage of Taxa Taxa of (%) Percentage 9,34% 0,14% 7,64% 10,00% 7,15% 0,07% 5,52% 0,14% 1,63% 0,14% 0,07% 0,92% 0,71% 0,00% Archaea CPR Eubacteria 2,19% Acidobacteria 0,00% 0,00% 0,14% Actinobacteria 0,00% 0,00% 5,52% Aquificae 0,00% 0,00% 0,07% Archaea 0,07% 0,00% 0,00% Bacteria 0,00% 0,00% 7,64% Bacteroidetes 0,00% 0,00% 2,19% Chlorobi 0,00% 0,00% 0,35% Chloroflexi 0,00% 0,00% 5,73% CPR 0,00% 0,07% 0,00% Crenarchaeota 0,92% 0,00% 0,00% Cyanobacteria 0,00% 0,00% 9,34% Euryarchaeota 7,15% 0,00% 0,00% FCB 0,00% 0,00% 2,55% Firmicutes 0,00% 0,00% 6,94% Proteobacteria 0,00% 0,00% 48,69% PVC 0,00% 0,00% 1,63% Spirochaetes 0,00% 0,00% 0,14% TACK 0,71% 0,00% 0,00% Terrabacteria 0,00% 0,00% 0,14%

Figure 14 Summary figure of the percentage and type of taxa composing the nickel-GHSHQGHQWK\GURJHQDVHJHQHWUHHLQ)LJXUH1RWHWKDW´&35µ VWDQGVIRU´&DQGLGDWH3K\OD5DGLDWLRQµ

B. Non-redox Enzymes

1. Acireductone dioxygenase (Ni-ARD)

Ni-ARD homologs were found in 1002 taxa (or 9.47%) out of 10,575 taxa in the database

(Table 2). The hits belonged to

Eubacteria/Proteobacteria+Firmicutes+Cyanobacteria+Actinobacteria+Bacteroidetes+Spirocha

etes+PVC+Terrabacteria+FCB+Aquificae+Bacteria, Archaea/Euryarchaeota and

Page 30 of 68

CPR/Parcubacteria (Figure 15 and 16). As for ancestral reconstruction, I will refrain from commenting about the history of acireductone dioxygenase as a nickel enzyme, due to the fact that the same enzyme utilizes both iron and nickel in its active center, and I will therefore not be able to make observations based on sequence hits alone.

Figure 15 Gene tree for the acireductone dioxygenase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 31 of 68

45,00% 40,80% 40,00% 35,00% 30,00% 26,97% 25,00% 20,00% 14,31% 15,00% 9,64% 3,02% 10,00% 3,31% Percentage of Taxa Taxa of (%) Percentage 0,39% 5,00% 0,10% 0,10% 0,10% 0,88% 0,10% 0,00% Archaea CPR Eubacteria 0,29% Actinobacteria 0,00% 0,00% 9,64% Aquificae 0,00% 0,00% 0,10% Bacteria 0,00% 0,00% 3,31% Bacteroidetes 0,00% 0,00% 0,88% Cyanobacteria 0,00% 0,00% 14,31% Euryarchaeota 0,10% 0,00% 0,00% FCB 0,00% 0,00% 0,10% Firmicutes 0,00% 0,00% 26,97% Parcubacteria 0,00% 0,10% 0,00% Proteobacteria 0,00% 0,00% 40,80% PVC 0,00% 0,00% 0,39% Spirochaetes 0,00% 0,00% 3,02% Terrabacteria 0,00% 0,00% 0,29% Figure 16 Summary figure of the percentage aQGW\SHRIWD[DFRPSRVLQJWKHDFLUHGXFWRQHGLR[\JHQDVHJHQHWUHHLQ)LJXUH1RWHWKDW´&35µVWDQGV IRU´&DQGLGDWH3K\OD5DGLDWLRQµ

2. Glyoxalase I (GloI or GlxI)

GlxI homologs were found in 1320 taxa (or 12.48%) out of 10,575 taxa in the database (Table

2). The vast majority of GlxI hits were found in Eubacteria/Proteobacteria (79.73%). The rest

belonged to Cyanobacteria, Bacteria, and PVC. Other minimal hits were found in

Archaea/TACK (Figure 17 and 18). Homologs from Viridiplantae were part of the query

sequences used for the search. Ancestral reconstruction results (supplementary file: “GloITree

(ASR-ARD)”) show this enzyme to have multiple origins with an “underlying synapomorphic”

pattern of evolution. The earliest point of origin seems to be in Proteobacteria (specifically

members of Alpha/Beta/Gammaproteobacteria). Other points of origin are a clade of

cyanobacteria and some members of PVC bacteria and TACK archaea.

Page 32 of 68

Figure 17 Gene tree for the glyoxalase I enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

90,00% 79,73% 80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 19,59%

20,00% Percentage of Taxa Taxa of (%) Percentage 10,00% 0,08% 0,45% 0,08% 0,08% 0,00% Archaea Eubacteria Viridiplantae Bacteria 0,00% 0,45% 0,00% Cyanobacteria 0,00% 19,59% 0,00% Proteobacteria 0,00% 79,73% 0,00% PVC 0,00% 0,08% 0,00% Streptophyta 0,00% 0,00% 0,08% TACK 0,08% 0,00% 0,00%

Figure 18 Summary figure of the percentage and type of taxa composing the glyoxalase I gene tree in Figure 17.

Page 33 of 68

3. Lactate racemase (LarA)

LarA homologs were found in 603 taxa (or 5.70%) out of 10,575 taxa in the database (Table

2). The majority of LarA hits were found in

Eubacteria/Firmicutes+Actinobacteria+Proteobacteria+Spirochaetes+Bacteria+Chloroflexi+PV

C+FCB+Bacteroidetes+Terrabacteria, and the rest were found in

Archaea/Euryarchaeota+TACK (Figure 19 and 20). LarA was the only nickel enzyme to be found entirely in partial hits. Ancestral reconstruction results (supplementary file: “LarATree

(ASR-ARD)”) show that a partial gene, most likely the conserved enzyme domain, was present in the last common ancestor of prokaryotes and retained in only a handful but diverse members of bacteria and methanogenic archaea.

Figure 19 Gene tree for the lactate racemase enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 34 of 68

70,00% 58,99% 60,00% 50,00% 40,00% 30,00% 6,86% 2,61% 7,03% 20,00% 11,11% 2,29% 7,19% 0,82% Percentage of Taxa Taxa of (%) Percentage 10,00% 0,98% 0,33% 1,63% 0,16% 0,00% Archaea Eubacteria Actinobacteria 0,00% 7,19% Bacteria 0,00% 6,86% Bacteroidetes 0,00% 0,82% Chloroflexi 0,00% 2,61% Euryarchaeota 11,11% 0,00% FCB 0,00% 0,33% Firmicutes 0,00% 58,99% Proteobacteria 0,00% 7,03% PVC 0,00% 1,63% Spirochaetes 0,00% 2,29% TACK 0,98% 0,00% Terrabacteria 0,00% 0,16%

Figure 20 Summary figure of the percentage and type of taxa composing the lactate racemase gene tree in Figure 19.

4. Urease

Urease homologs were found in 2296 taxa (or 21.71%) out of 10,575 taxa in the database

(Table 2). This was the most widespread nickel enzyme in the database. The majority of urease hits were found in

Eubacteria/Proteobacteria+Actinobacteria+Firmicutes+Cyanobacteria+Bacteroidetes+Bacteria

+Terrabacteria+Tenericutes+Spirochaetes+PVC+FCB+Chloroflexi, and the rest were found in

Archaea/Euryarchaeota+Crenarchaeota+TACK+Archaea (Figure 21 and 22). Ancestral reconstruction results (supplementary file: “ProkUreaseTree (ASR-ARD)”) compare to those of

GlxI in terms of having multiple origins with an “underlying synapomorphic” pattern of evolution but differ in the significantly larger number and diversity of the members having the enzyme in each synapomorphy.

Page 35 of 68

Figure 21 Gene tree for the urease enzyme generated from the search result sequences + query sequences using IQTREE (Minh et al. 2020). The tree is color-coded based on corresponding phylum rank. Color coding is unified across all gene trees presented in this report.

Page 36 of 68

50,00% 47,05% 45,00% 40,00% 35,00% 30,00% 25,00% 21,11% 20,00% 10,98% 0,12% 15,00% 4,51% 10,53% 0,29% 10,00% 0,66% 1,52% 0,90% 0,08% Percentage of Taxa Taxa of (%) Percentage 5,00% 0,08% 0,86% 0,82% 0,33% 0,16% 0,00% Archaea Eubacteria Actinobacteria 0,00% 21,11% Archaea 0,08% 0,00% Bacteria 0,00% 0,82% Bacteroidetes 0,00% 4,51% Chloroflexi 0,00% 0,33% Crenarchaeota 0,66% 0,00% Cyanobacteria 0,00% 10,53% Euryarchaeota 1,52% 0,00% FCB 0,00% 0,16% Firmicutes 0,00% 10,98% Proteobacteria 0,00% 47,05% PVC 0,00% 0,90% Spirochaetes 0,00% 0,12% TACK 0,86% 0,00% Tenericutes 0,00% 0,08% Terrabacteria 0,00% 0,29%

Figure 22 Summary figure of the percentage and type of taxa composing the urease gene tree in Figure 21.

▪ Tree Comparison

Comparisons (or co-phylogenetic plots) between reference tree topology and that of all gene trees reported in the results section can be found in the supplementary folder enclosed with this report (Refer to “Appendix F”). In this report, I only relay results on two nickel enzymes,

CO-methylating acetyl-CoA synthase & carbon monoxide dehydrogenase, that are the most relevant to the chemoautotrophic origins of life model I rely on in this study (See “In what context do these new results matter?” in the discussion). This is due to the large amount of data that I have as well as the aims and objectives of this study. The aim of these comparisons is to explore patterns of inheritance and possible horizontal gene transfer events.

Page 37 of 68

→ CO-methylating acetyl-CoA synthase

Results (Figure 24) show very minimal entanglement, or otherwise, overall topological congruence between the enzyme complex gene tree and the reference tree.

→ Carbon monoxide dehydrogenase (Ni-CODH)

Results (Figure 25) show moderate entanglement or moderate incongruence between the enzyme gene tree and the reference tree.

▪ Supplementary Files

Additional files of the reference tree + gene trees with taxa labels and color-coding based on corresponding kingdom rank can be found in the supplementary folder enclosed with this report (Refer to appendices D & G).

DISCUSSION

In this work, I uncover the phyletic profiles (presence-absence) of all known nickel enzymes to date in 10,575 microbial genomes. Using these profiles, I apply the methodology of stochastic character mapping that utilizes maximum likelihood-based models (Nielsen 2002, Huelsenbeck et al. 2003) to infer the status of each enzyme gene along the branches of the reference tree and in the common ancestor of all examined species, which in this case, is the last common ancestor of 10,575 prokaryotic species or the LUCA (last universal common ancestor) considering a two-domain tree of life. In the following paragraphs I explain the revealed patterns of gene loss and gain and consolidate these patterns with Earth history and models for the origins of life.

▪ General Trends in Ancestral State Reconstruction Results

Results revealed a few common themes. The first theme was an overall homoplastic pattern of character evolution (multiple ancestors) when using the ER (Equal Rates) and SYM (Symmetric)

Page 38 of 68 models, a one-parameter and three-parameter model in this case, respectively. The second theme was an overwhelming bias towards loss over gain or partial gain when allowing all transition rates to differ using the ARD (All-Rates-Different) model, a six-parameter model in this case (Refer to “Appendix E” for estimated parameter values). This agrees with results from a study by Kannan et al. (2013), which found that estimated gene loss rates tend to be higher than estimated gene gain rates when more than two states of genes are allowed, and that models that specify different rates generally explain the observed data better than those that do not (Kannan et al. 2013). The third and most important theme was the analogous histories that were inferred within each of the two groups of nickel enzymes, redox and non-redox, when using the ARD model.

▪ The Analogous Histories within Nickel Enzyme Groups14: Single or Multiple Origins?

Under the All-Rates-Different (ARD) model, where rates of gene loss and gene gain or partial gain are allowed to be different in all directions, stochastic mapping results revealed these patterns:

- All redox nickel enzymes, except for methyl-coenzyme M reductase (MCR) and Ni-

SOD, displayed a symplesiomorphic (shared ancestral) pattern of evolution, originating

from a single point at the root of the tree, the last common ancestor of all 10,575

examined species (Discussed in the following paragraphs). Ni-SOD was also inferred to

have a single point of origin but displayed a reversal homoplastic pattern, in which an

ancestral trait is not present in the direct descendants but reappears in later ones, and

MCR displayed a homoplastic pattern with multiple origins. Three of the seven redox

nickel enzymes, which are superoxide dismutase, hydrogen:quinone oxidoreductase and

nickel-dependent hydrogenase, were inferred only partially in the root of the tree.

14 Excluding Ni-ARD mentioned for reasons previously.

Page 39 of 68

- On the other hand, all non-redox nickel enzymes, except for lactate racemase (LarA),

displayed a parallel homoplastic (shared derived with no direct common ancestor)

pattern of character evolution with multiple origins (See “Non-redox Enzymes:

Convergent Evolution or LGT?”).

▪ Selective Pressures: Nickel Famine or the Great Oxidation Event?

I propose that the inferred patterns of evolution of redox nickel enzymes (except MCR) can be explained by the selective pressure against nickel-dependent organisms caused by the “nickel famine”, theorized to have occurred at the end of the Archaean, 2.5 billion years ago. Based on a decline in molar nickel to iron ratio around 2.7 Gyr ago reported from the banded iron formations, the theory put forth by Konhauser (Konhauser et al. 2009) suggests that nickel concentrations in Precambrian ocean waters had dropped significantly at the end of the

Archaean era (~400 nM to ~200 nM) due to a reduced flux of nickel to the waters caused by a cooling of upper-mantle temperatures and a decreased eruption of nickel-rich ultramafic rocks, which are igneous rocks found in the Earth’s mantle. This decline, in return, would have cause a decline in the number of oceanic methanogen communities, which use nickel in several of their enzymes, leading to a significant decrease in biologically produced methane and the thriving of other microorganisms that do not require as much nickel, especially cyanobacteria, thereby setting the stage for the Great Oxidation Event (GOE). The rise in oxygen would have consequently killed off many other microorganisms that rely on anaerobic pathways, like H2 and/or CO2 reduction, leaving survivors in the remaining anoxic niches only (like guts, volcanic mud, hydrothermal vents, and ocean/lake basins), where many of our current nickel- utilizing microbes live. Another explanation could be that the GOE is what directly caused the demise of nickel-utilizing lifeforms, without any drastic changes in the molar concentrations of oceanic nickel. The patterns of massive loss of nickel utilization inferred in my trees would then be the result of the dramatic change from a reducing atmosphere and anoxic or euxinic (acidic

Page 40 of 68 and sulfidic) oceans to oxidizing ones, causing the rise in reactive oxygen species that are otherwise toxic to anaerobes that presumably lacked oxygen detoxification methods at the time (Sessions et al. 2009). Moreover, the decreased sulfur content in oceans upon the GOE would in turn reduce the solubility of nickel and other metals necessary for essential enzymes in the majority of microbes during that period (Case 2017).

▪ The Odd Case of Methyl-coenzyme M Reductase

In the case of MCR, ancestral reconstruction results as well as the gene tree suggested this enzyme to be a derived innovation in Euryarchaeota that was most likely transferred through events of LGT to some members of TACK and one bacteroid (Refer to “MCRTree (Phyla)” in the supplementary folder). This agrees with similar findings by Muñoz-Velasco et al. (2018), which negate any methanogenic origins in the LCA of Archaea and contradicts the findings by

Weiss et al. (2016) and Borrel et al. (2016), which placed methanogens as the closest relatives of the LUCA and proposed that methanogenic pathways evolved at the early stages of life.

However, a recent study (Berghuis et al. 2019) that found a complete and divergent methanogenic pathway in the archaeal order Verstraetearchaeota, reiterated the shared ancestry of methanogenesis and set the origins of methanogenesis prior to Euryarchaeota and claimed that their observations support the notion of a CO2-reducing methanogenic LUCA.

I think that my results regarding MCR, which are contingent upon the level of archaeal diversity included in the reference tree as well as the very topology of the tree, could be concealing the true history of these enzymes, in addition to the inability to discern LGT events solely through the methodology used in this study, which could further skew the results. Thus, with the new emerging field of metagenomics and the expanding tree of Archaea, we can hope to reveal a more accurate history of this enzyme by increasing the number and diversity of sampled archaeal genomes and testing different models of gene evolution.

Page 41 of 68

▪ Partial Genes in the LCA of Prokaryotes

Three of the seven redox nickel enzymes in this study, which are superoxide dismutase, hydrogen:quinone oxidoreductase and nickel-dependent hydrogenase, were inferred only partially in the last common ancestor of prokaryotes in the reference tree. According to a study by Baymann et al. (2003), many enzymes involved in bioenergetic processes in bacteria and archaea, such as redox enzymes, are made up of different Figure 23 Schematic representation of the basic units that can be combinations of a few conserved basic units (Figure 23), put together in various ways to make different bioenergetic enzymes.

Recreated from The redox protein construction kit: pre-last universal indicated phylogenetically to have existed pre-LUCA, common ancestor evolution of energy-conserving enzymes by Baymann et al. (2003). Created in BioRender.com that get rearranged according to the catalytic requirements of the enzyme (Baymann et al. 2003). One such enzyme, as found by the study by

Baymann et al. (2003), is the nickel-dependent hydrogenase. I suggest, thus, that the partial gene inferred in the LCA of prokaryotes in my tree represents the conserved subunit or domain of the enzyme that is shared and ancestral to all nickel hydrogenases, and that can be put together with other units, represented by the flanking sequences of the gene, to give the complete enzyme. I extend this logic to the other two redox enzymes that were also inferred partially in this study, superoxide dismutase and hydrogen:quinone oxidoreductase.

▪ Non-redox Enzymes: Convergent Evolution or LGT?

As for non-redox nickel enzymes (except LarA and Ni-ARD), the inferred patterns of multiple origins can be the result of convergence or lateral gene transfer events. Judging from the number of discrepancies between the gene trees and the reference tree (Refer to “Appendix

F”) and the extremely low probability for two or more homologous amino acid sequences to evolve multiple times by chance (1/20 chance of the same amino acid being identical by chance

Page 42 of 68 alone in two molecules), I could say that lateral gene transfer seems more likely. However, it is outside the scope of this study to confirm, with any degree of certainty, the one or the other.

Regarding the ancestral history of LarA, I propose that its inference in the LCA of prokaryotes could be due to its important role in the assembly of cell walls (Shi et al. 2019). Unfortunately, studies to date on this particular enzyme are very scarce since its discovery in 2014 (Desguin et al. 2014), and thus, I cannot rely on previous findings to support or deny my claims. However, I have significantly expanded the knowledge on their distribution, and this could be the seed for more research on these ambiguous enzymes.

▪ Conflicting Tree Topologies: Evidence of Lateral Gene Transfer (LGT)?

Nearly all nickel enzyme gene trees displayed some level of topological discrepancy with the reference tree (Figure 24), as expected when dealing with prokaryotic taxa. Such discrepancies were found both between the two kingdoms of Archaea and Bacteria as well as within them.

Inter-kingdom discrepancies can be seen by comparing the reference tree (#12 in Figure 24), where bacteria and archaea split into distinct clades, with the individual gene trees (#1, 2, 4, 5,

7, 10 & 11 in Figure 23), where the enzymes are found in both kingdoms in significant numbers), where we see clustering of archaeal taxa (in blue) within bacterial clades (in red) or vice versa. More severe discrepancies occur intra-kingdom-wise as seen in the gene trees in the results section (Figures 3, 7, 9, 11, 13, 15, 17, 19 & 21) as indicated by the color groupings, where almost all monophyletic clades in the gene trees consist of members of more than one phylum.

Inconsistencies with the species tree could be attributed to several different reasons, such as being an artifact of the tree-building methods used in this study or the uncertainty of the underlying reference phylogeny. In addition, the gene trees contain the query sequences used in the search, which are not part of the reference phylogeny, potentially contributing to further inconsistencies. However, even with the addition of external sequences, one would still expect

Page 43 of 68 members of the different kingdoms or phyla to cluster within their respective groups. Other scenarios can be events of lateral (or horizontal) gene transfer and/or non-horizontal events like lineage-specific gene duplication and loss, which are very likely since these are major evolutionary phenomena in the prokaryotic world (Mirkin et al. 2003). Similarly, incomplete lineage sorting could potentially produce similar results. In conclusion, it can be said that the explicit comparative phylogenetic methods used in this study do provide potential evidence of lateral gene transfer, albeit not conclusive, but further simulations are needed using model- based methods to test different evolutionary scenarios with different kinds of events of loss/gain/duplication and compare their fit to the data.

In the next paragraph I present a more detailed look on the gene trees of two nickel enzymes involved in the ancient Wood-Ljungdahl pathway and attempt to discern their pattern of inheritance based on my results and those of other studies.

Page 44 of 68

Figure 24 Nickel enzyme gene trees (1-11) + reference tree (12) colored based on corresponding kingdom and supergroup rank (Red=Eubacteria, blue=Archaea, yellow or black=CPR (Candidate Phyla Radiation)). (1) CO-methylating acetyl-CoA synthase (2) Ni- CODH (3) MCR (4) Ni-SOD (5) FrhABG (6) Hydrogen:quinone oxidoreductase (7) NiFeSe_Hases (8) Ni-ARD (9) GlxI (10) LarA (11) Urease (12) Reference tree of 10,575 prokaryotes. Individual diagrams of the same trees can be found in the supplementary folder.

▪ Tree Comparison Results

→ CO-methylating acetyl-CoA synthase

My results showed CO-methylating acetyl-CoA synthase to be only present in Archaea, originating from the LCA of archaea (See “Results” and “AcetylTree (ASR-ARD)” in the supplementary folder). However, according to the comprehensive study by Adam et al. (2018), they are thought to not only have existed in the LCA of Archaea, but in the LUCA. I think the difference in results between our studies stems from the fact that the type of sequences I used to build the profile HMMs represented the archaeal form of the enzyme (particularly the beta

Page 45 of 68 subunit involved in the CODH activity), homologous with the bacterial one in only 4 of the 5 subunits that make up the complex (Adam et al. 2018), which also explains why the enzyme was primarily found in Archaea and only a few bacterial taxa (firmicutes, chloroflexi and proteobacteria) contradictory to the wide distribution of this enzyme complex in both domains according to the study by Adam et al. (2018). Hence, I depend in my analysis on both my results and those from the paper of Adam et al. (2018) for a more accurate conclusion.

The minimal entanglement in the co-phylogenetic plot (Figure 25) suggests that the majority of sequences in the gene tree are orthologs and can be easily traced to a common ancestor. This indicates conservation and eliminates major events of duplication or lateral gene transfer, which agrees with previous findings from the study by Adam et al. (2018) that concluded a largely vertical inheritance pattern of the enzyme complex as well as a great amount of conservation in both the archaeal and bacterial domains, which points to the strong possibility of the complex having been present in the last universal common ancestor for at least 4 of its 5 subunits (Adam et al. 2018).

Since we are dealing with the archaeal form of the enzyme, the clustering of bacteria in a clade inside the tree rather than as an outgroup suggests a possible event of lateral gene transfer from archaea to bacteria that share the same environment. This was corroborated by results from the paper by Adam et al. (2018), which found archaeal-type clusters in Dehalococcoidia

(Present in the tree in Figure 1) and concluded a single lateral gene transfer of the form from a methanogen to bacteria and subsequent transfers among other bacterial lineages.

Page 46 of 68

Figure 25 Tanglegram between reference tree (Zhu et al. 2019) (left) and gene tree of the CO-methylating acetyl-CoA synthase enzyme complex (right). The minimal entanglement indicates congruence. The yellow box highlights the sole bacterial clade in the gene tree, which explains the incongruence with that part of the tree. The remaining taxa are methanogenic archaea.

Page 47 of 68

→ Carbon monoxide dehydrogenase (Ni-CODH)

The most glaring reason for incongruence between our trees (Figure 26) seems to be the sporadic clades of archaea (green outlines) clustering within bacterial clades (red outlines). A possible explanation is horizontal gene transfer between the two domains, and subsequent transfer within domains. However, since my research is not sufficient for clearer conclusions about such events, I turn to published literature to support or falsify my claims. According to a study by Techtmann et al. (2012), some disparities between the species tree and Ni-CODH gene tree could arise from functional distinctions between CODH lineages. It is thought that the function of Ni-CODHs is determined by their associated accessory proteins encoded within gene clusters, so that certain CODHs from one clade evolved to associate with proteins from another clade, thereby resulting in convergent evolution of sequence motifs for functionally similar CODHs that have different origins (Techtmann et al. 2012). More in-depth studies are needed to discern the evolutionary history of the different types of these enzymes, but it is generally thought that the types of CODHs associated with the Co-methylating acetyl-CoA synthase complex were likely present in the LUCA (Inoue et al. 2019).

Page 48 of 68

Figure 26 Tanglegram between reference tree (Zhu et al. 2019) (left) and gene tree of Ni-CODH (right). Red outlines refer to bacteria, while green outlines refer to archaea. Blue brackets refer to sulfur-reducing bacteria and delta proteobacteria. Green arrow refers to Thermococci. Blue arrow refers to methanogenic archaea. Black arrows refer to possible HGT events.

Page 49 of 68

▪ Implications for the Origin of Life

→ Why are nickel enzymes interesting for the origins of life?

Nickel enzymes have been proposed as good candidates to have been the direct descendants of the first molecules to catalyze abiotic reactions where life emerged (Zamble et al. 2017). This is due to the following:

1) Presence of nickel enzymes seems to span nearly all prokaryotic phyla as well as some

lower eukaryotes, so that a LUCA common origin for these enzymes can be arguably

theorized (Zhang et al. 2009).

2) Several nickel enzymes strictly catalyze anaerobic reactions (Ragsdale 2009), which

would have been the dominant form of reactions on early anoxic Earth up until the

Great Oxidation Event (GOE) around 2.4 billion years ago (Alfano & Cavazza 2020).

3) Several of these enzymes also strictly catalyze redox reactions of H2 and CO2, which are

generally thought to have been present in high concentrations in the early Earth

atmosphere (Alfano & Cavazza 2020) as well as the early ocean waters (CO2) and

hydrothermal vents (H2) (Martin & Russell 2007). Renditions of those redox reactions

have been proposed in several origin of life models, namely metabolism-first models, to

have supplied the energy for further synthesis of simple organic molecules (Lanier &

Williams 2017).

4) Nickel ions (Ni2+) (along with iron ions (Fe2+) and other transition metals) are also

thought to have been highly abundant in the hot oceans of early Earth (Alfano &

Cavazza 2020), which would have sufficiently supplied the metal centers of emergent

proto-enzyme systems (Nitschke et al. 2013).

5) The active centers of some nickel enzymes, like Ni-Fe hydrogenase, CODH and ACS

have been likened to the transition metal sulphides and oxides that made up the

presumed mineral precipitates that would form upon mixing of hydrothermal fluids and

Page 50 of 68

the Hadean ocean waters (>4 billion years ago), such as nickelian mackinawite ([Fe >

Ni]2S2), a nickel-bearing greigite (~FeSS[Fe3NiS4]SSFe) and violarite

(~NiSS[Fe2Ni2S4]SSNi) (see Figure 27) (Nitschke et al. 2013).

Figure 27 Correspondence between some naturally occurring minerals (Left to right: Nickelian mackinawite, Ni-substituted greigite & violarite) with the active centers of some nickel enzymes (Ni-Fe hydrogenase, CODH & ACS, respectively) (Nitschke et al. 2013).

Reprinted from BBA - Bioenergetics, Volume 1827, Nitschke et al., On the antiquity of metalloenzymes and their substrates in bioenergetics, Pages 871-881, Copyright (2013), with permission from Elsevier.

→ Implications from this study.

My results placed all redox nickel enzymes (except MCR) as having originated fully or partially from the root of the reference tree, or otherwise from the last common ancestor of all 10,575 bacterial and archaeal species examined in this study. If we were to accept Eukarya as a derived group considering the accumulating support for a two-domain tree of life (Williams et al. 2013,

McInerney et al. 2015, Williams et al. 2020), with Archaea and Bacteria as the primary domains, then the root of our reference tree would represent the last universal common ancestor

(LUCA) of all present life. As a result, one would conclude an origin for redox nickel enzymes in the LUCA, similarly to what was concluded in a well-known phylogenetic study from 2016 by

Weiss et al., which attempted to uncover the physiology and habitat of the LUCA (Weiss et al.

2016). The study (Weiss et al. 2016) examined 286,541 protein families from fully sequenced prokaryotic genomes to be used for tree construction, out of which only 355 remained after

Page 51 of 68 accounting for lateral gene transfer by discarding genes that were not present in at least two higher taxa of bacteria and archaea and genes whose trees did not recover bacterial and archaeal monophyly (Weiss et al. 2016). The 355 protein families traced to the LUCA by phylogenetic criteria, of which were some redox nickel enzymes, like the CO dehydrogenase/acetyl-CoA synthase complex, coenzyme F420 hydrogenase, Ni-Fe hydrogenase, and a menaquinone (a hydrogen:quinone oxidoreductase), similar to the results in this study (Weiss et al. 2016).

→ In what context do these new results matter?

Acetyl-coenzyme A synthase (ACS) and carbon monoxide dehydrogenase (CODH) are key enzymes in the Wood-

Ljungdahl pathway (or the reductive acetyl-CoA pathway)

(Figure 28), which is considered one of the most ancient carbon-fixation pathways and potentially prebiotic (Peretó et al. 1999). The WL pathway is an important component of one of the most prominent models for the origins of life, the

Submarine Alkaline Hydrothermal Vent theory (SAVT), which was put forth by Michael J. Russell in 1993, and argues that the first abiotic metabolic pathways leading up to life Figure 28 The Wood-Ljungdahl pathway (or the Acetyl- probably took place in the mounds of hydrothermal vents, CoA pathway). The pathway consists of two branches: the methyl and the carbonyl branch. In this pathway CO2 is reduced to CO (carbonyl branch) and formic which are laminated with Fe(Ni)S minerals, involving a acid or directly into a formyl group that is reduced to a methyl group (methyl branch) and then combined with the CO and Coenzyme A to produce acetyl-CoA. The primitive analogue of the exergonic WL pathway, using H2 as pathway uses two nickel enzymes: Ni-CODH, which catalyzes the reduction of CO2 and acetyl-CoA the initial electron donor and CO2 as the initial acceptor, and synthase, which combines the resulting CO with a methyl group to give acetyl-CoA.

Reprinted from Biogas production through syntrophic acetate oxidation and deliberate operating strategies for improved digester performance, https://doi.org/10.1016/j.apenergy.2016.06.061, Creative Commons License: https://creativecommons.org/licenses/by-nc-nd/4.0/.

Page 52 of 68

metal sulphides like FeS and NiS as catalysts (Figure 29) (Russell et al. 1994, Russell & Martin

2004).

Figure 29 Schematic of an alkaline hydrothermal vent, as perceived by the SAVT model. As seen, H2 dissolved in water travels through the vent chimneys, reducing CO2 and supplying electrons for the synthesis of organic molecules that are retained in the microcompartments of the vent, while the carbon reacts to form acetic acid and escape the chimneys. Hydrosulfides traveling though the chimneys react with the metal rich water to form metal sulphide precipitates that inflate hydrostatically and form the microcompartments seen in the enlarged image and act as catalysts for the reactions inside the vent (Russell et al. 1994, Russell & Martin 2004).

Reprinted from Trends in Biochemical Sciences, Volume 29, Russell and Martin, The rocky roots of the acetyl-CoA pathway, Pages 358-363, Copyright (2004), with permission from Elsevier.

This study is, by far, not the first to infer redox nickel enzymes to have been present in the

LUCA or to at least imply their ancient origins. In fact, there have been many studies done on

individual nickel enzymes that discuss their history in some form (Techtmann et al. 2012, Case

2017, Adam et al. 2018, Boyd et al. 2019). However, it is unique in the diversity and density of

the sampled genomes used, which is proven to increase the predictive accuracy of phylogenetic

profiling and ancestral inference (Škunca & Dessimoz 2015), and the results thus serve to

Page 53 of 68 provide even more evidence on the antiquity of redox nickel enzymes and their possible origins in the LUCA. However, such conclusions are not without complications, as with most evolutionary history studies into deep time, as one can never fully account for missing data or lateral gene transfer events, which have undoubtedly shaped a great deal of prokaryotic evolution and which could completely erase large portions of history at large evolutionary distance.

▪ Inherent Assumptions of the Models

"All models are wrong, but some are useful" - George E. P. Box (1919-2013)

There is not a more useful quote than the one by George E. P. Box to serve as a reminder that true evolutionary history is always going to be an unattainable task, no matter the model in use and specifically when studying deep time. Despite ancestral reconstruction proving very useful since its inception, most methods still fail to incorporate trait non-neutrality and heterogeneity of transition rates across the underlying tree, in favor of a more straight-forward analysis of data

(Holland et al. 2020). Non-neutral traits, such as some essential nickel enzymes in this case, are susceptible to natural selection and can affect the fitness of the microorganism in terms of carbon assimilation and other essential reactions for survival. Failure to account for non- neutrality leads to systematic biases and the assumption that the estimate of the phylogenetic tree is independent of the estimate of ancestral states (Holland et al. 2020). Moreover, one cannot exclude model artifacts when considering results, as it has been found that maximum- likelihood models are more likely to assign rare genes as highly ancestral, provided that these genes are found in different clades, and that more heavily parameterized models tend to favor gene loss to gene gain (Kannan et al. 2013). It is also important to always keep in mind that even with heavier sampling of genomes, reconstruction accuracy depends heavily on the underlying tree topology and that gene inference can only take us to the last universal common ancestor of all current living organisms, failing to account for any missing data in the tree of life,

Page 54 of 68 which is made more complicated through the rampant occurrence of horizontal gene transfer in the microbial world (Kannan et al. 2013).

CONCLUSION

This study aimed to expand our knowledge on the distribution of nickel-dependent enzymes, since they hold significance in relation to early Earth history and possibly the origins of life in addition to their importance on a global scale and in organisms both as a toxicity and nutritional factor. The purpose was to uncover their evolutionary history via reconstructing their ancestral states, and to explore the relationship between their sequences by generating gene trees for each enzyme. My main findings show single origins for redox nickel enzymes (except for MCR) with a symplesiomorphic pattern of evolution, and multiple origins for non-redox nickel enzymes (except for LarA) with a homoplastic pattern of evolution. Both patterns of evolution could be disputed with possible lateral gene transfer (LGT) events implied in the enzyme gene trees. However, my phylogenetic analysis alone is not sufficient to confirm any such events. I proposed that these patterns arose as a result of a selective pressure brought upon by the

“nickel famine” event or the Great Oxidation Event.

My findings largely support previous works on the very ancient origins of these enzymes, which in my case, stretch back to the last common ancestor of all prokaryotes, and in a two-domain tree, back to the last universal common ancestor of all living organisms, and could very well have been evolved from the proto-metalloenzymes that catalyzed the primitive metabolic reactions in the mounds of hydrothermal vents, as proposed by the SAVT model for the origins of life.

Page 55 of 68

More conclusive results could be obtained in the future through testing different statistical models for ancestral reconstruction, calibration of the ancestrally reconstructed trees and synteny analysis to distinguish LGT events from genuine shared ancestry.

ACKNOWLEDGEMENTS

Thank you, my two amazing supervisors!

Thank you, Aodhán Butler, for without you, this project would not have gone anywhere! Thank you for your patience and your expertise and your humble and encouraging attitude towards me, and for always finding a solution to every problem we encountered.

Thank you, Anna Neubeck, for your guidance and for giving me, without hesitation, the opportunity to work with you on a subject near and dear to my heart, and for your every valuable input on the thesis and for this very project that keeps on giving.

Thank you, Brenda Irene Medina Jimenez and Xuejing Hu for being my opponents.

Thank you, my partner, for your continuous support throughout my two-year studies and my entire journey in Sweden. You have been my rock!

Thank you, my father, for influencing my love of science and for being my role model, and for everything you have ever done for me.

Thank you, my mother, for always supporting and believing in me.

Thank you, my siblings, for your continuous support and enthusiasm towards my work and for always being the rock I lean on. Thank you, brothers-in-law, for being my non-blood siblings!

Thank you, my niece, for simply existing.

A special thanks to my very talented sister Yvette for illustrating my final presentation.

Page 56 of 68

Finally…thank you, my new country Sweden!

Thank you, Sweden, for welcoming me, for helping me and for all the opportunities that I hope to pay back one day. Thank you for the safety away from war.

Thank you, my friend Kinan, for all the help in Sweden and for your kind soul.

Page 57 of 68

REFERENCES

1. Adam PS, Borrel G, Gribaldo S. 2018. Evolutionary history of carbon monoxide

dehydrogenase/acetyl-CoA synthase, one of the oldest enzymatic complexes.

Proceedings of the National Academy of Sciences 115: E1166–E1173.

2. Alfano M, Cavazza C. 2020. Structure, function, and biosynthesis of nickel-dependent

enzymes. Protein Science 29: 1071–1089.

3. Baymann F, Lebrun E, Brugna M, Schoepp-Cothenet B, Giudici-Orticoni M-T, Nitschke

W. 2003. The redox protein construction kit: pre-last universal common ancestor

evolution of energy-conserving enzymes. Philosophical Transactions of the Royal Society

B: Biological Sciences 358: 267–274.

4. Berghuis BA, Yu FB, Schulz F, Blainey PC, Woyke T, Quake SR. 2019. Hydrogenotrophic

methanogenesis in archaeal phylum Verstraetearchaeota reveals the shared ancestry of

all methanogens. Proceedings of the National Academy of Sciences 116: 5037–5044.

5. Bernhard M, Benelli B, Hochkoeppler A, Zannoni D, Friedrich B. 1997. Functional and

Structural Role of the Cytochrome b Subunit of the Membrane-Bound Hydrogenase

Complex of Alcaligenes Eutrophus H16. European Journal of Biochemistry 248: 179–

186.

6. Boer JL, Mulrooney SB, Hausinger RP. 2014. Nickel-dependent metalloenzymes.

Archives of Biochemistry and Biophysics 544: 142–152.

7. Boyd JA, Jungbluth SP, Leu AO, Evans PN, Woodcroft BJ, Chadwick GL, Orphan VJ,

Amend JP, Rappé MS, Tyson GW. 2019. Divergent methyl-coenzyme M reductase

genes in a deep-subseafloor Archaeoglobi. The ISME Journal 13: 1269–1279.

8. Brown H, Patterson C. 1947. The Composition of Meteoritic Matter: II. The

Composition of Iron Meteorites and of the Metal Phase of Stony Meteorites. The

Journal of Geology 55: 508–510.

Page 58 of 68

9. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. 2009. trimAl: a tool for automated

alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972–1973.

10. Case AJ. 2017. On the Origin of Superoxide Dismutase: An Evolutionary Perspective of

Superoxide-Mediated Redox Signaling. Antioxidants, doi 10.3390/antiox6040082.

11. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T,

Kauff F, Wilczynski B, de Hoon MJL. 2009. Biopython: freely available Python tools for

computational molecular biology and bioinformatics. Bioinformatics 25: 1422–1423.

12. Desguin B, Goffin P, Viaene E, Kleerebezem M, Martin-Diaconescu V, Maroney MJ,

Declercq J-P, Soumillion P, Hols P. 2014. Lactate racemase is a nickel-dependent

enzyme activated by a widespread maturation system. Nature communications 5: 3615.

13. Duin EC. 2009. Role of Coenzyme F430 in Methanogenesis. In: Warren MJ, Smith AG

(ed.). Tetrapyrroles: Birth, Life and Death, pp. 352–374. Springer, New York, NY.

14. Dupont CL, Neupane K, Shearer J, Palenik B. 2008. Diversity, function and evolution of

genes coding for putative Ni-containing superoxide dismutases. Environmental

Microbiology 10: 1831–1843.

15. Eddy SR. 2015. HMMER: biosequence analysis using profile hidden Markov models.

United States. URL: hmmer.org

16. Eddy SR. 2019. HMMER User’s Guide. 229.

17. Eitinger T, Suhr J, Moore L, Smith JAC. 2005. Secondary transporters for nickel and

cobalt ions: theme and variations. Biometals: An International Journal on the Role of

Metal Ions in Biology, Biochemistry, and Medicine 18: 399–405.

18. Ferber DM, Maier RJ. 1993. Hydrogen-ubiquinone oxidoreductase activity by the

Bradyrhizobium japonicum membrane-bound hydrogenase. FEMS Microbiology Letters

110: 257–264.

19. Francis W. 2020. supermatrix.

Page 59 of 68

20. Francis W. 2020. supermatrix: scripts to add new proteins to alignment using hmms.

URL: https://github.com/wrf/supermatrix.

21. Greening C, Ahmed FH, Mohamed AE, Lee BM, Pandey G, Warden AC, Scott C,

Oakeshott JG, Taylor MC, Jackson CJ. 2016. Physiology, Biochemistry, and Applications

of F 420 - and F o -Dependent Redox Reactions. Microbiology and Molecular Biology

Reviews 80: 451–493.

22. Helmke M, Joseph EK, Rey JA, Hill BM. 2017. The official Ubuntu book, Ninth. Prentice

Hall, Boston.

23. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2:

Improving the Ultrafast Bootstrap Approximation. Molecular Biology and Evolution 35:

518–522.

24. Holland BR, Ketelaar-Jones S, O’Mara AR, Woodhams MD, Jordan GJ. 2020. Accuracy

of ancestral state reconstruction for non-neutral traits. Scientific Reports 10: 7644.

25. Huelsenbeck JP, Nielsen R, Bollback JP. 2003. Stochastic Mapping of Morphological

Characters. Systematic Biology 52: 131–158.

26. Inoue M, Nakamoto I, Omae K, Oguro T, Ogata H, Yoshida T, Sako Y. 2019. Structural

and Phylogenetic Diversity of Anaerobic Carbon-Monoxide Dehydrogenases. Frontiers

in Microbiology, doi 10.3389/fmicb.2018.03353.

27. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017.

ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods

14: 587–589.

28. Kannan L, Li H, Rubinstein B, Mushegian A. 2013. Models of gene gain and gene loss for

probabilistic reconstruction of gene content in the last universal common ancestor of

life. Biology Direct 8: 32.

Page 60 of 68

29. Kocot KM, Struck TH, Merkel J, Waits DS, Todt C, Brannock PM, Weese DA, Cannon

JT, Moroz LL, Lieb B, Halanych KM. 2017. Phylogenomics of Lophotrochozoa with

Consideration of Systematic Error. Systematic Biology 66: 256–282.

30. Konhauser KO, Pecoits E, Lalonde SV, Papineau D, Nisbet EG, Barley ME, Arndt NT,

Zahnle K, Kamber BS. 2009. Oceanic nickel depletion and a methanogen famine before

the Great Oxidation Event. Nature 458: 750–753.

31. Kück P, Meusemann K. 2010. FASconCAT: Convenient handling of data matrices.

Molecular Phylogenetics and Evolution 56: 1115–1118.

32. Lanier KA, Williams LD. 2017. The Origin of Life: Models and Data. Journal of Molecular

Evolution 84: 85–92.

33. Larsson A. 2014. AliView: a fast and lightweight alignment viewer and editor for large

datasets. Bioinformatics 30: 3276–3278.

34. Maddison WP, Maddison DR. 2019. Mesquite: a modular system for evolutionary

analysis. URL: http://www.mesquiteproject.org.

35. Maroney MJ, Ciurli S. 2014. Nonredox Nickel Enzymes. Chemical Reviews 114: 4206–

4228.

36. Martin W, Russell MJ. 2007. On the origin of biochemistry at an alkaline hydrothermal

vent. Philosophical Transactions of the Royal Society B: Biological Sciences 362: 1887–

1926.

37. McInerney J, Pisani D, O’Connell MJ. 2015. The ring of life hypothesis for

origins is supported by multiple kinds of data. Philosophical Transactions of the Royal

Society B: Biological Sciences 370: 20140323.

38. Merkens H, Kappl R, Jakob RP, Schmid FX, Fetzner S. 2008. Quercetinase QueD of

Streptomyces sp. FLA, a Monocupin Dioxygenase with a Preference for Nickel and

Cobalt †. Biochemistry 47: 12185–12196.

Page 61 of 68

39. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A,

Lanfear R. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic

Inference in the Genomic Era. Molecular Biology and Evolution 37: 1530–1534.

40. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A,

Lanfear R. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic

Inference in the Genomic Era. Molecular Biology and Evolution 37: 1530–1534.

41. Mirkin BG, Fenner TI, Galperin MY, Koonin EV. 2003. Algorithms for computing

parsimonious evolutionary scenarios for genome evolution, the last universal common

ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes.

BMC Evolutionary Biology 3: 2.

42. Müller K, Wickham H. 2020. tibble Simple Data Frames.

43. Mulrooney SB, Hausinger RP. 2003. Nickel uptake and utilization by microorganisms.

FEMS Microbiology Reviews 27: 239–261.

44. Mustafiz A, Ghosh A, Tripathi AK, Kaur C, Ganguly AK, Bhavesh NS, Tripathi JK, Pareek

A, Sopory SK, Singla‐Pareek SL. 2014. A unique Ni2+ -dependent and methylglyoxal-

inducible rice glyoxalase I possesses a single active site and functions in abiotic stress

response. The Journal 78: 951–963.

45. Ney B, Ahmed FH, Carere CR, Biswas A, Warden AC, Morales SE, Pandey G, Watt SJ,

Oakeshott JG, Taylor MC, Stott MB, Jackson CJ, Greening C. 2017. The methanogenic

redox cofactor F420 is widely synthesized by aerobic soil bacteria. The ISME Journal 11:

125–137.

46. Nianios D, Thierbach S, Steimer L, Lulchev P, Klostermeier D, Fetzner S. 2015. Nickel

quercetinase, a “promiscuous” metalloenzyme: metal incorporation and metal ligand

substitution studies. BMC Biochemistry, doi 10.1186/s12858-015-0039-4.

47. Nickel Institute. 2012. About nickel. online May 19, 2012:

https://www.nickelinstitute.org/about-nickel. Accessed May 14, 2020. Page 62 of 68

48. Nielsen R. 2002. Mapping Mutations on Phylogenies. Systematic Biology 51: 729–739.

49. Nitschke W, McGlynn SE, Milner-White EJ, Russell MJ. 2013. On the antiquity of

metalloenzymes and their substrates in bioenergetics. BBA - Bioenergetics 1827: 871–

881.

50. Oracle Corporation. 2020. Oracle VM VirtualBox. Oracle Corporation, Redwood City,

California, United States. URL: https://www.virtualbox.org/.

51. Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R

language. Bioinformatics 20: 289–290.

52. Paradis E, Schliep K. 2019. ape 5.0: an environment for modern phylogenetics and

evolutionary analyses in R. Bioinformatics 35: 526–528.

53. Peretó JG, Velasco AM, Becerra A, Lazcano A. 1999. Comparative biochemistry of CO2

fixation and the evolution of autotrophy. International Microbiology: The Official Journal

of the Spanish Society for Microbiology 2: 3–10.

54. Peters JW, Schut GJ, Boyd ES, Mulder DW, Shepard EM, Broderick JB, King PW, Adams

MWW. 2015. [FeFe]- and [NiFe]-hydrogenase diversity, mechanism, and maturation.

Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1853: 1350–1369.

55. R Core Team. 2020. R: A Language and Environment for Statistical Computing. R

Foundation for Statistical Computing, Vienna, Austria.

56. Ragsdale SW. 2007. Nickel and the Carbon Cycle. Journal of inorganic biochemistry 101:

1657–1666.

57. Ragsdale SW. 2009. Nickel-based Enzyme Systems. Journal of Biological Chemistry 284:

18571–18575.

58. Revell LJ. 2012. phytools: An R package for phylogenetic comparative biology (and other

things). Methods in Ecology and Evolution 3: 217–223.

59. Rodionov DA, Hebbeln P, Gelfand MS, Eitinger T. 2006. Comparative and Functional

Genomic Analysis of Prokaryotic Nickel and Cobalt Uptake Transporters: Evidence for a Page 63 of 68

Novel Group of ATP-Binding Cassette Transporters. Journal of Bacteriology 188: 317–

327.

60. Rozewicki J, Li S, Amada KM, Standley DM, Katoh K. 2019. MAFFT-DASH: integrated

protein sequence and structural alignment. Nucleic Acids Research 47: W5–W10.

61. RStudio Team. 2020. RStudio: Integrated Development Environment for R. RStudio,

PBC., Boston, MA. URL: http://www.rstudio.com/.

62. Russell MJ, Daniel RM, Hall AJ, Sherringham JA. 1994. A hydrothermally precipitated

catalytic iron sulphide membrane as a first step toward life. Journal of Molecular

Evolution 39: 231–243.

63. Russell MJ, Martin W. 2004. The rocky roots of the acetyl-CoA pathway. Trends in

Biochemical Sciences 29: 358–363.

64. Sawers RG. 2013. Nickel in Bacteria and Archaea. In: Kretsinger RH, Uversky VN,

Permyakov EA (ed.). Encyclopedia of Metalloproteins, pp. 1490–1496. Springer, New

York, NY.

65. Schmidt HA. 2004. TAXNAMECONVERT. Distributed by the author. URL:

http://www.cibiv.at/software/taxnameconvert/.

66. Sekowska A, Ashida H, Danchin A. 2018. Revisiting the methionine salvage pathway and

its paralogues. Microbial Biotechnology 12: 77–97.

67. Sessions AL, Doughty DM, Welander PV, Summons RE, Newman DK. 2009. The

Continuing Puzzle of the Great Oxidation Event. Current Biology 19: R567–R574.

68. Shi R, Wodrich MD, Pan H-J, Tirani FF, Hu X. 2019. Functional Models of the Nickel

Pincer Nucleotide Cofactor of Lactate Racemase. Angewandte Chemie International

Edition 58: 16869–16872.

69. Siegbahn PEM, Chen S-L, Liao R-Z. 2019. Theoretical Studies of Nickel-Dependent

Enzymes. Inorganics 7: 95.

Page 64 of 68

70. Škunca N, Dessimoz C. 2015. Phylogenetic Profiling: How Much Input Data Is Enough?

PLOS ONE 10: e0114701.

71. Techtmann S, Colman AS, Lebedinsky AV, Sokolova TG, Robb FT. 2012. Evidence for

Horizontal Gene Transfer of Anaerobic Carbon Monoxide Dehydrogenases. Frontiers in

Microbiology, doi 10.3389/fmicb.2012.00132.

72. The UniProt Consortium. 2018. UniProt: a worldwide hub of protein knowledge.

Nucleic Acids Research 47: D506–D515.

73. Walsh CT, Orme-Johnson WH. 1987. Nickel enzymes. Biochemistry 26: 4901–4906.

74. Wang L-G, Lam TT-Y, Xu S, Dai Z, Zhou L, Feng T, Guo P, Dunn CW, Jones BR,

Bradley T, Zhu H, Guan Y, Jiang Y, Yu G. 2020. treeio: an R package for phylogenetic

tree input and output with richly annotated and associated data. Molecular Biology and

Evolution 37: 599–603.

75. Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin

WF. 2016. The physiology and habitat of the last universal common ancestor. Nature

Microbiology 1: 16116.

76. Westerholm M, Moestedt J, Schnürer A. 2016. Biogas production through syntrophic

acetate oxidation and deliberate operating strategies for improved digester

performance. Applied Energy 179: 124–135.

77. Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New

York

78. Williams TA, Cox CJ, Foster PG, Szöllősi GJ, Embley TM. 2020. Phylogenomics provides

robust support for a two-domains tree of life. Nature Ecology & Evolution 4: 138–147.

79. Williams TA, Foster PG, Cox CJ, Embley TM. 2013. An archaeal origin of eukaryotes

supports only two primary domains of life. Nature 504: 231–236.

Page 65 of 68

80. Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. 2017. ggtree: an r package for visualization

and annotation of phylogenetic trees with their covariates and other associated data.

Methods in Ecology and Evolution 8: 28–36.

81. Yu G. 2020. tidytree: A Tidy Tool for Phylogenetic Tree Data Manipulation.

82. Zamble D, Rowińska-Żyrek M, Kozlowski H. 2017. The Biological Chemistry of Nickel.

Royal Society of Chemistry

83. Zhang Y, Gladyshev VN. 2010. General Trends in Trace Element Utilization Revealed by

Comparative Genomic Analyses of Co, Cu, Mo, Ni, and Se. Journal of Biological

Chemistry 285: 3393–3405.

84. Zhang Y, Rodionov DA, Gelfand MS, Gladyshev VN. 2009. Comparative genomic

analyses of nickel, cobalt and vitamin B12 utilization. BMC Genomics 10: 78.

Page 66 of 68

LIST OF APPENDICES

Appendix A: Lophotrochozoa Hits

The supplementary folder “Lophotrochozoa Hits” includes a collection of files that resulted from running the HMMER search for nickel enzymes against a database of Lophotrochozoa sequences taken from the study: Phylogenomics of Lophotrochozoa with Consideration of

Systematic Error (Kocot et al. 2017).

Appendix B: Lingula reevi Hits

The supplementary folder “Lingula reevi Hits” includes a collection of files that results from running the HMMER search for nickel enzymes against the complete transcriptome of Lingula reevi from unpublished data by Aodhán Butler.

Appendix C: Occupancy Matrices

The supplementary folder “Occupancy Matrices” includes all occupancy matrices produced in the study both as colored charts and TAB-delimited files.

Appendix D: Gene Trees

The supplementary folder “Gene Trees” includes all gene trees produced in this study with taxa labels and color-coding based on corresponding phylum and kingdom rank.

Appendix E: Ancestral State Reconstruction (ASR) Trees

The supplementary folder “Ancestral State Reconstruction (ASR) Trees” includes all ancestral reconstruction results produced in this study in addition to an “Estimated Parameter Values” file, which contains all parameter values estimated from the data for each ancestral reconstruction model.

Appendix F: Tanglegrams

The supplementary folder “Tanglegrams” includes all co-phylogenetic plots created between the reference tree and the individual gene trees.

Page 67 of 68

Appendix G: Reference Tree

The supplementary folder “Reference Tree” includes the reference tree from the study:

“Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains

Bacteria and Archaea” (Zhu et al. 2019), with taxa labels and color-coding based on corresponding phylum and kingdom rank, in addition to a metadata file “Metadata of the 10,575 reference genomes (Zhu et al. 2019)” on the taxa used in that study.

Page 68 of 68