<<

The Evolution of Nucleotide Excision Repair in Eukaryotes

Hrafnhildur Agnarsdóttir

Faculty of Life and Environmental Sciences University of Iceland i 2018

The Evolution of Nucleotide Excision Repair Genes in Eukaryotes

Hrafnhildur Agnarsdóttir

14 credit theses as a part of Baccalaureus Scientiarum degree in biology

Instructor Arnar Pálsson

Faculty of Life and Environmental Sciences School of Engineering and Natural Sciences University of Iceland Reykjavík, May 14 i

The Evolution of Nucleotide Excision Repair Genes in Eukaryotes The Evolution of NER Genes 14 credit thesis as a part of Baccalaureus Scientiarum degree in biology

Copyrights © 2018 Hrafnhildur Agnarsdóttir All rights reserved

Faculty of Life and Environmental Sciences School of Engineering and Natural Sciences University of Iceland Askja, Sturlugata 7 107 Reykjavík

Phone: 525 4000

Reference information: Hrafnhildur Agnarsdóttir, 2018, The Evolution of Nucleotide Excision Repair Genes in Eukaryotes, BS thesis, Faculty of Life and Environmental Sciences, University of Iceland, 52 pages

Print: Háskólaprent Reykjavík, May 2018

ii

Abstract

Major advances in genetics and biotechnology have enabled us to study and compare the genomes of different organisms. With this information available, the evolutionary origin of specific genes and that some genes are more conserved than others has been established, mainly genes encoding for key metabolic and cellular processes such as replication, and translation. Genes encoding for DNA repair seem to be highly conserved, and may indicate parallel evolution of similar functions in bacteria and archea. Curiously, many DNA repair genes are conserved in humans and yeast, but some seem to have been lost in certain eukaryotic taxa. This study set out to study the perservation of genes involved in the nucleotide excision repair pathway (NER) and map the losses of specific genes on among eukaryote groups. The approach was to bioinformatically identify orthologs of involved in this process, using KEGG and Orthodb, and construct a map of perservation and losses. The key genes in the process were perserved in all taxa, but six genes,thought to be involved in DNA damage detection, DDB2, ERCC6, ERCC8, XPA, TTDA, and CETN2 showed taxonomic-specific losses throughout the eukaryotic domain. For instance DDB2 was found to be lost in three clades (fungi, nematodes and insects), while preserved in vertebrates and plants. Also ERCC6 and ERCC8 were found to be lost two and three times respectively, both lost from species within basidiomycota and dipterans, while ERCC8 was found to be lost earlier in the insect group (also missing from hymenoptera) and four nematode species. A more comprehensive map of the content and of the phylogenetic origins of eukaryotes, and the functions of these genes could help us better understand the mechanism of DNA repair and the evolution of conserved systems.

iii

Ágrip

Gríðarlegar framfarir í erfða- og líftækni hafa gert okkur kleift að kanna og bera saman erfðamengi ólíkra lífvera. Með þessar upplýsingar að leiðarljósi hefur þróunarlegur uppruni ákveðinna gena verið skilgreindur, ásamt því að hafa sýnt fram á að sum gen eru betur varðveitt en önnur, einkum þau sem kóða fyrir mikilvægum efnaskipta- og frumuferlum s.s. afritun, umritun og þýðingu. Gen sem kóða fyrir DNA viðgerð virðast vel varðveitt, og gætu gefið til kynna að samhliða þróun svipaðrar virkni í bakteríum og arkeum hafi átt sér stað. Áhugavert er að mörg DNA viðgerðargen eru varðveitt í mönnum og gersveppum, en sum virðast hafa tapast í ákveðnum hópum heilkjörnunga. Markmið þessa verkefnis var að kanna varðveislu þeirra gena er taka þátt í kirniskerðiviðgerð (NER) og kortleggja tap ákveðinna gena meðal heilkjörnungahópa. Lífupplýsingafræðileg nálgun var notuð til að bera kennsl á hliðstæður próteina er taka þátt í þessu ferli með gagnagrunnum frá KEGG og Orthodb, og út frá þeim, kortleggja varðveislu og tap. Lykilgen í kirniskerðiviðgerðaferlinu voru varðveitt í öllum hópum, en sex þeirra sem talin eru taka þátt í að bera kennsl á DNA skemmdir, DDB2, ERCC6, ERCC8, XPA, TTDA og CETN2, sýndu töp innan ákveðinna hópa heilkjörnunga. T.d. virtist DDB2 hafa tapast í þremur kláðum (sveppum, þráðormum og skordýrum), en hélst varðveitt í hryggdýrum og plöntum. Einnig virtust ERCC6 og ERCC8 hafa tapast tvisvar-þrisvar, þar sem bæði gen höfðu tapast í tegundum innan basíðusveppa og tvívængja, á meðan ERCC8 tapaðist fyrr í skordýrum (hafði tapast í æðvængjum) og fjórum þráðormategundum. Yfirigripsmeira kort af innihaldi gena og flokkunarfræðilegum uppruna heilkjörnunga, og virnki þessara gena gæti veitt okkur betri skilning um gagnvirkni DNA viðgerðar og þróun varðveittra kerfa.

iv

Table of contents

Abstract ...... iii

Figures ...... vii

Tables ...... viiii

Acknowledgements ...... ix

1 Introduction ...... Error! Bookmark not defined. 1.1 Historical perspectives ...... 1 1.2 Genomic trends ...... 2 1.3 and DNA repair ...... 4 1.4 NER defects in humans ...... 7 1.4.1 ...... 7 1.4.2 ...... 8 1.4.3 Trichothiodystrophy ...... 8 1.5 Transcription coupled repair ...... 9 1.5.1 ERCC6/CSB ...... 10 1.5.2 ERCC8/CSA ...... 10 1.6 Global genome repair ...... 11 1.6.1 XPC ...... 11 1.6.2 DDB2/XPE ...... 12 1.7 Evolutionary perspectives ...... 12

2 Materials and Methods ...... 15 2.1 Data collection ...... 15 2.2 Ascertainment of gene study set ...... 16 2.3 Taxonomic representation and subsets ...... 17 2.4 Statistical testing for gene losses ...... 17 2.5 Phylogenetic tree construction ...... 17 2.6 Supporting analyses ...... 18

3 Results ...... 1Error! Bookmark not defined. 3.1 Representation of NER genes in Eukaryotes ...... 19 3.2 Phylogenetic reconstruction ...... 23

4 Discussion ...... 35 4.1 DNA Repair in the Three Domains of Life ...... 35 4.2 DDB2 ...... 37 4.3 ERCC6 and ERCC8 ...... 38 4.4 XPA ...... 40 4.5 CETN2 ...... 40 4.6 Challenges ...... 41 v

4.7 Conclusion ...... 42

References ...... 43

Appendix A ...... 47

Appendix B ...... 48

Appendix C ...... 50

Appendix D ...... 51

Appendix E ...... 52

vi

Figures

Figure 1.1 Proteins involved in the nucleotide excision repair pathway in humans...... 6

Figure 3.1 Presence and loss of 18 NER genes in 429 whole genome sequenced eukaryotes...... 19

Figure 3.2 Gene representation in vertebrates, arthropods and nematodes ...... 22

Figure 3.3 Presence of ERCC6, ERCC8 and DDB2 in eukaryotes ...... 23

Figure 3.4 Phylogenetic trees based on distances between amino acid sequences of baseline genes, ERCC3 and ERCC5 ...... 25

Figure 3.5 Phylogenetic trees based on distances between amino acid sequences of unpreserved genes, ERCC6, ERCC8 and DDB2. .. 27

vii

Tables

Table 1.1 Genes of importance in the nucleotide excision repair pathway (NER)...... 9

Table 2.1 Genes involved in the nucleotide excision repair pathway post filtering of rare genes ...... 15

Table 2.2 Genes involved in NER concerned with this focus study ...... 16

Table 3.1 P-values from tests on gene loss within 5 taxonomic groups ...... 21

Table 3.2 The presence (1) and absence (0) of genes involved in the nucleotide excision repair pathway in eukaryotes...... 30

viii

Acknowledgements

I would like to thank my instructor Arnar Pálsson for his excellent guidance and endless patience in this writing process. I would also like to dedicate my family and friends for encouraging words during this period.

ix

1. Introduction

1.1 Historical perspectives

In this age of rapid change where technological advances have never been greater, one may conclude that we have made excellent progress as a species in almost every aspect. However, we have been puzzled by the same question for centuries, the question of how did life originate? Explanation of life’s origin has been a central aspect of all major religions, and endless attempts have been made to answer this, seemingly unsolvable, problem. One could argue that humanity’s quest to find the origin of life will never come to an end, but each day we’re coming closer to explaining the beginning of life’s creation.

Although nowhere close to the absolute truth about how life originated, one thing is known for certain: All species currently inhabiting the planet share a common ancestor. This was first theorized by Charles Robert Darwin in 1859 in his greatest scientific contribution “On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life”. This central tenet of the theory of evolution has stood the test of time (at least the last 158 years), and multiple studies have shown that humans descended from another primate species. Since the publication of “On the origin..”, a myriad of milestones have been passed in the field of biology, and it looks as though the quest for knowledge of the details how organismal diversity, including human beings, evolved from the simply structured last universal common ancestor is certainly not slowing down. One such milestone was the discovery of DNA by Miescher in 1869, which confirmed that DNA is the hereditary material of life.

Despite the fact that DNA was discovered in the late 1800‘s, the idea that organisms carry hereditary units had been put forward a few years earlier with Mendel‘s experiments with pea plants (Abbott & Fairbanks, 2016). This was however ignored by a majority of scientists, until rediscovered in 1900 (Futuyma, 2005). Seminal work by Thomas Hunt Morgan and his colleagues later demonstrated genetic linkage and chromosomal inheritance. Mendel‘s theory has since been accepted as an integral cog in classical genetics (Morgan, 1915). Hersey and Chase later provided groundwork for showing that it was indeed the molecule discovered by Miescher that carried this hereditary information, not proteins as was believed by many of their contemporary colleagues (Hershey & Chase, 1952). Less than a century 1

later, the double helical structure of DNA was solved by Watson and Crick (Watson & Crick, 1953).

Understanding of the structure of the DNA molecule and the central dogma revolutionized studies in biology and medicine, giving rise to many groundbreaking discoveries (Crick, 1958). Less than 20 years later, the code of life had been cracked and the technique of sequencing DNA had been invented (Sanger et al., 1977). Consequently, a whole new subfield within biology, comparative genetics and later comparative genomics, where genetic and genomic features of different organisms are studied, was born. One could argue that the beginning of the genomic age was in 1995 with the whole genome sequencing of Haemophilus influenzae by Fleischmann et al. (1995). Five years later, Drosophila melanogaster had been sequenced by Adams et al. (2000), and thousands of genomes have now been sequenced, giving us a more coherent idea of how species differ in genetic composition. This has also allowed us to deduce where, when and if certain genomic elements are expressed or functioning in organisms of different taxa, shedding a light on the evolutionary history of life on earth.

1.2 Genomic trends

Major advances in genetics and biotechnology led to the conclusion that genomes were composed of coding and non-coding DNA (such as cis-and trans-regulatory elements, introns, rRNA, tRNA, transposons, etc.), which holds true for all organisms (apart from certain viruses who carry RNA). However, the composition of genomes varies between organisms, both in size and content. It may be easy to assume that larger organisms have larger genomes, but this is not the case (e.g. the amoeba Ameoba dubia has a 200fold larger genome than H. sapiens (Gee, 2001). This has been dubbed as the “C-value paradox”, and its counterpart “G-level paradox”, which state that simpler organisms do not have smaller genomes, nor do they have fewer genes. Genome size does however show a certain trend across the major branches of life. Bacteria and viruses tend to contain small and condensed genomes, supposedly efficient and quickly replicated (for instance by having few introns). In contrast, eukaryotic genomes tend to be larger and complex, with large sections of non-coding DNA that have mostly unknown function.

How does this variance in genome size and gene content come about? The obvious answer is that there have been changes in heritable traits between generations of organisms, i.e. evolution. These changes are driven by genomic mutations, recombination and gene flow, which is favored by natural selection if they enhance an organism‘s fitness. A small genome for example, provides an advantage for bacterial species, as it promotes quicker DNA replication which corresponds to more fitness. It is however critical to

2

abandon teleological thought, as evolution is not goal oriented and bacteria do not have smaller genomes for the purpose of quicker DNA replication, but rather, bacteria with smaller genomes proved to be more fit than bacteria with larger genomes because they replicate DNA at a faster rate. Instead, we adopt natural teleology in our way of thinking, where parts of organism serve some intrinsic purpose for the animal; e.g. D. melanogaster evolved the vestigial gene, which guides for the development of wing discs, so eventually the organism can fly, making the specie better fit for its environment. Therefore we must perceive evolutionary changes as products of a blind process, rather than perceiving them as means to some definite end. These changes do result in phenotypic effects in genomic composition that alter fitness in some way, and they reflect the organisms evolutionary and ecological history. This is clearly demonstrated in living things, with species seemingly specifically adapted to their geographical and ecological niches. Thermus thermophilus provides a perfect example of this and has been used as a model species for thermophilic organisms, due to its structural and physiological features for genetic applications, with its most definitive characteristic being the ability to grow optimally at temperatures between 65-72°C (Oshima & Imahori, 1974). Its ability to, not only tolerate, but to prefer these extreme temperatures is due to the evolution of thermostable enzymes in its genome, making it fit for extreme environmental conditions. Another example of evolutionary adaptation in bacteria is the pathogenic Mycoplasma genus. These bacteria lack cell walls, and are the smallest known microorganism capable of self-replication. Along with lacking a cell wall, they also lacks various genes involved in DNA repair, making them more prone to which consequently increases antibiotic resistance. These characteristics have evolved by reduction in genome size where the bacteria obtains amino acids, lipoproteins and other nutrients from its host, enabling more rapid DNA replication (Andersson & Kurland, 1998). This raises the question of why do pathogens such as Mycoplasma not require DNA repair mechanisms? One can hypothesize that pathogens rely on rapid mutations to maintain the specie‘s population, which would be slowed down if the process of DNA repair were present. Alternatively, they may be protected from radiation by living inside their host.

The essential factor in all of evolution is mutation, the raw material contributing to phenotypic variation. Mutations, i.e. changes in a DNA sequence, are only able to affect evolution if they are passed on to successive generations. These mutations can be in the form of substitutions, indels, duplications, frameshifts and intragenic recombinations. Evolutionists propose a few methods of gene formation, including gene duplication, transposable element domestication, lateral gene transfer, gene fusion, gene fission and de novo formation.

3

Gene duplication is the most common way of gene formation, and whole genomes can even be duplicated. Since duplicated genome or genomic elements are not necessary for organism survival, this can lead to innovation and mutations of the duplicates. The duplication of genes is thought to occur relatively frequently, for instance over 100 genes have been duplicated in the per million years (Hahn, Demuth, & Han, 2007). Lateral transfer occurs when offspring do not inherit genes from parents, but rather absorb them from another organism‘s genome, such as between Wolbachia pipentis and Drosophila spp.(Dunning Hotopp et al., 2007). Another method of gene generation is by fusion of two ore more genes or by fission, where transcriptional units can break into multiple transcripts. This mainly occurs by chromosomal changes, retrotransposition of genes, and often involved duplicated genes or . Genes can also become generated de novo. These mutations are inherited from the of parent organisms and become apparent in the offspring. They are generally more deleterious, as less selection has occurred against their phenotypic traits (Veltman & Brunner, 2012). Origination by de novo mechanisms must have been how the first genes formed, and was probably very important in the early phases of life on earth. It is unclear how important this is in modern species.

Through evolution, some genes have persisted through time by purifying natural selection. These can be considered conserved genes, as they differ minimally between species, and most likely are involved in key metabolic or cellular processes, such as for transcription and translation. The main reason for their conservation is thought to be the process of purifying selection, indicating that a mutation in these genes has such adverse effects on organisms that they must carry out extremely important functional roles such as metabolism, replication, transcription, translation and DNA repair (Siepel et al., 2005). Conservation of genes and their function can manifest in two ways. First as conservation of a.a. sequence of RNA fold, which can be seen as very slow sequence evolution, with few changes over millions of years. Secondly, a gene can be preserved in all lineages because in serves a vital function. Such genes may experience higher rate of sequence evolution compared to genes in the first class, but still be essential to life.

1.3 Mutations and DNA repair

As cell divisions are fundamental to living organisms, DNA replication is required for growth, development and reproduction. DNA replication is a precise process that occurs rapidly, at about 50 nucleotides/second in humans (Miko & LeJeune, 2009). Although this happens with high precision, errors do occur. These errors, in the form of damaged DNA, may result from endogenous chemical events, such as exposure to reactive oxygen substances, or exposure to exogenous agents such as viruses, ultra-violet radiation and hydrolysis. The rate of errors is estimated to be of 10,000 to

4

1,000,000 molecular lesions cell/day in humans (Hoeijmakers, 2009). One of the main causes of damaged DNA is the occurrence of pyrimidine dimers, caused by UV-irradiation. These dimers are in the form of cyclobutane pyrimidine dimers (CPD) and pyrimidine (6-4) pyrimidone photoproducts (6- 4PP) (Mouret et al., 2006). These effects of UV-induced damage have been well characterized and are known to have adverse effects on organisms. To counteract these relatively frequent errors in DNA replication, organisms have evolved repair mechanisms to ensure DNA stability and unaltered transcription (Gahan, 2005). The importance of these processes is demonstrated by the severity of disorders in humans known to be caused by DNA-repair deficiency. They have been linked to two major DNA repair pathways, a) (BER) and b) nucleotide excision repair (NER). In base excision repair, DNA glycosylases identify altered bases in the DNA sequence and catalyze hydrolysis on those bases, while in nucleotide excision repair, larger alterations in the DNA double helix are identified and excised, using a . After the oligonucleotide containing the lesion has been excised, the DNA is restored with the non-damaged strand as a template for polymerization. This large-scale repair can occur in two ways in eukaryotes, through transcription-coupled repair (TC-NER), and global-genomic NER (GG-NER), which differ in how damage is detected. Once damage has been detected, the same NER core factors are required in both pathways (Fousteri & Mullenders, 2008). A distinct feature of NER is the wide array of substrate specificity and its ability to remove bulky and thermodynamically unstable DNA. Over 30 proteins are required to carry out this pathway with sequential protein-to-protein interaction in humans (Figure 1) (Table 1.1).

5

Figure 1. Proteins involved in the nucleotide excision repair pathway in humans. In global genomic repair, products of XPC and XPE/DDB2 recognize damage such as cyclobutane pyrimidine dimers (CPD) and 6-4 photoproducts (6-4PP). In transcription- coupled repair, DNA lesions are recognized by a stall in the progression of RNA polymerase II, where CSA and CSB play an integral part. After initial damage recognition, the two pathways converge and are identical. The area surrounding the DNA damage site is unwound by encoded by XPB/ERCC3 and XPD/ERCC2 with assistance of XPA and XPG. Endonucleases encoded by XPF/ERCC4 and XPG/ERCC5 perform the incision of the lesion with single stranded breaks. The non-damaged strand is then used as a template for polymerization. Adapted from Kraemer et al 2007. ("XERODERMA PIGMENTOSUM, TRICHOTHIODYSTROPHY AND COCKAYNE SYNDROME: A COMPLEX GENOTYPE-PHENOTYPE RELATIONSHIP" by K.H. Kraemer,N.J. Patronas,R. Schiffmann,B.P. Brooks,D. Tamura,J.J. DiGiovanna, 2007, Neuroscience, 145, 4, p. 1388-1396. Copyright 2007 by Elsevier.

6

1.4 NER defects in humans

DNA repair is important for viability, as several mutations in genes encoding for key proteins in the nucleotide excision repair pathway cause disorders in humans. Varying in their phenotype, these disorders all stem from a reduced capacity in repairing cyclobutane-pyrimidine dimers and 6-4 pyrimidine- pyrimidone photoproducts due to UV-irradiation and all share the feature of being inherited in an autosomal recessive manner (Kraemer et al., 2007). These disorders, their characteristics and their molecular similarities will be briefly described below.

1.4.1 Xeroderma Pigmentosum

The first of the major disorders caused by defects in the nucleotide excision repair pathway is Xeroderma Pigmentosum (XP). XP is estimated to occur from 1 in 2,000 in Japan (Hirai et al., 2006), 1 in 250,000 in the United States (Robbins, Kraemer, Lutzner, Festoff, & Coon, 1974) and 1 in 2.3 million in Western Europe (Kleijer et al., 2008). It is characterized by extreme sensitivity to sun, sunlight induced ocular involvement (e.g. photophobia) and a highly elevated risk of cancer in sun-exposed areas, with 25% of affected individuals also showing neurodegeneration (Niedernhofer, Bohr, Sander, & Kraemer, 2011). Approximately 60% of XP cases show extreme sensitivity to sun, resulting in sunburns lasting for up to weeks. The other 40% have freckle-like symptoms on the facial area usually by the age of two. The severity of symptoms depends on the individual‘s exposure to sun, and patients living in areas with less access to sunlight and stay well protected from UV-irradiation may therefore show mild symptoms. Individuals with XP under the age of 20 are estimated to have a 10,000-fold increased risk of non-melanoma skin cancers and a 2,000-fold increased risk of melanoma-skin cancers.

XP can be caused by defects in one of eight genes. Seven of these genes are involved in the NER-pathway, XPA, XPB/ERCC3, XPC, XPD/ERCC2, XPE/DDB2, XPF/ERCC4, XPG/ERCC5. The eighth, XPV/POLH, encodes for the translesion polymerase which is required to get past DNA damage during translesion synthesis (Lehmann, McGibbon, & Stefanini, 2011). The products of XPC and XPE/DDB2 are confined to the global genomic repair pathway while the other six contribute to both global genomic repair and transcription-coupled repair.

1.4.2 Cockayne Syndrome

Cockayne syndrome (CS) is a disorder estimated to occur in 2 to 3 per million newborns in the USA and Europe (Kleijer et al., 2008). CS is characterized by extreme photosensitivity, microcephaly, failure to thrive which leads to short stature, delayed development, decreased hearing, loss

7

of speech, gait and appearance of rapid aging (progerioid symptoms). Unlike XP, individuals with CS do not show pigmentation changes or skin lesions, nor are they at an increased risk of developing cancer. The disorder has been divided into four types, I, II, III and XP-CS depending on the time of onset and severity of symptoms (Laugel, 1993). Individuals with type I CS display a classic form of the disorder, where symptoms of growth and developmental abnormalities are generally apparent by the age of two and lifespan is usually below 20 years. Type II individuals show a more severe form where symptoms are present at birth and show lack of postnatal brain growth. Patients with type II CS typically die before the age of seven. Type III patients show the mildest form of CS where growth and neurological development are practically normal or become apparent late. Xeroderma pigmentosum-Cockayne syndrome (XP-CS) patients show classic XP symptoms like pigmentation changes and increased risk of cancer, while also showing some CS symptoms as cognitive impairment, short stature and hypogonadism. Cockayne syndrome is thought to be caused by a mutation in one of two genes, ERCC6/CSB or ERCC8/CSA. 65% of CS patients carry a mutation in ERCC6 and 35% a mutation in ERCC8 (Laugel, 1993). Both genes are thought to be involved in the early detection of damaged DNA in the transcription-coupled repair pathway and will be further described below.

1.4.3 Trichothiodystrophy

Trichothiodystrophy (TTD) is characterized by brittle hair and nails and has a wide spectrum of severity. Most severe cases show delayed development, cognitive impairment, recurring infections (often respiratory), all of which can lead to death at infancy or in childhood, while milder cases may only show symptoms that result in sulfur-deficient hair (Kraemer et al., 2007). However, most affected individuals show similar symptoms to CS, i.e. short stature, cognitive impairment and delayed development, while half of TTD patients also show photosensitivity but no change in pigmentation as occurs in XP patients. This disorder is believed to occur in 1 per million newborns in USA and Europe, with only 100 cases reported globally (Atkinson et al., 2014).

The causes of TTD are thought to be due to defective proteins in the transcription factor IIH (TFIIH). These proteins are encoded by XPD/ERCC2, XPB/ERCC3 and TTDA/GTF2H5.

8

Table 1.1 Genes of importance in the nucleotide excision repair pathway (NER) and associated pathways and phenotypes. Their assumed function, the subpathways they are involved in and the result of defective proteins, either Xeroderma pigmentosum, Cockayne syndrome, Trichothiodystrophy or a combination of the three.

Gene Protein function Pathway Defective phenotype XPA Damage verification NER1 XP2 XPB/ERCC3 Helicase NER XP, TTD3, XP/CS4 XPC Damage recognition GG-NER5 XP XPD/ERCC2 Helicase NER XP, XP/TTD, TTD, XP/CS, CS/TTD XPE/DDB2 Damage recognition GG-NER XP XPF/ERCC4 Nuclease NER XP XPG/ERCC5 Nuclease NER XP, XP/CS XPV/POLH Polymerase Translesion XP synthesis CSA/ERCC8 Recruitment regulation/ TC-NER6 CS damage detection CSB/ERCC6 Recruitment TC-NER CS regulation/damage detection GTF2H5/TTDA Stabilization NER TTD Note: This table is adapted from Niedernhofer et al. (2011) 1Nucleotide excision repair 2Xeroderma pigmentosum 3Thrichothiodystrophy 4Cockayne syndrome 5Global genome repair 6Transcription coupled repair

1.5 Transcription coupled repair

Despite advances in biochemistry and molecular biology, it is not fully understood how cells detect DNA damages and how this leads to activation of signaling pathways that recruit the repair machinery. However, it has been suggested that in the case of transcription-coupled repair, during RNA synthesis, that DNA lesions affect the progression of RNA polymerase, causing stalling or blockage of the polymerase action. This may cause a collapse in replication forks, resulting in apoptosis or altered and faulty transcripts. Many factors in this process are highly conserved from Escherichia coli to yeast to mammals, showing similarity in protein sequences and functional domains (Beerens, Hoeijmakers, Kanaar, Vermeulen, & Wyman, 2005).

It has been hypothesized that when the RNA polymerase II complex becomes stalled, the NER machinery in the TC-NER pathway becomes activated. This machinery consists of factors such ERCC8/CSA and 9

ERCC6/CSB, both integral in the TC-NER subpathway and will be described briefly below.

1.5.1 ERCC6/CSB

The ERCC6 protein (called CSB in Homo sapiens), encoded by the ERCC6/CSB gene seems to play a vital role in the TC-NER pathway, as was demonstrated by the lack of TC-NER in ERCC6 deficient cells. Multiple models have been suggested to explain how ERCC6 interacts with the RNAPII but this is poorly understood to this date. ERCC6 is thought to act, similar to transcription-repair coupling factor (TRFC) in E. Coli, by activating recruitment of other repair factors. As ERCC6 is a part of the SWI2/SNF2 DNA-dependent ATPase family, making it a nucleosome remodeling complex, it was also suggested to act as a chromatin remodeling factor at the damage site by exerting its DNA-stimulated ATPase activity and altering DNA‘s double helical conformation with specifics of this process still unexplained (Beerens et al., 2005). Despite the incomplete molecular knowledge of precisely how ERCC6 affects DNA conformation, it is still considered to be an important early factor in the TC-NER subpathway in humans, with its C- terminus interacting with RNAPII, translocating the ERCC8 protein (to the nucleus and summoning core NER factors such as XPA, TFIIH, ERCC5/XPG, ERCC4/XPF-ERCC1 complex and RPA (Sin, Tanaka, & Saijo, 2016).

1.5.2 ERCC8/CSA

Another vital protein in the TC-NER pathway is ERCC8 (CSA in humans), encoded by the ERCC8/CSA gene. This protein is a part of the Cul4-ERCC8 ubiquitin ligase complex along with DDB1, RBX1 and CUL4A, and acts as a substrate recognition factor.

This complex binds to RNAPII and displays ubiquitin ligase activity (Groisman et al., 2003). After ERCC6 has recognized a stall in RNAPII, it activates recruitment of the Cul4-ERCC8 complex, which is necessary for further recruitment of non-core NER factors such as HMGN1, XAB2 and TFIIS. As RNAPII is in the state of transcription elongation, it becomes co- localized with ERCC8 (Spivak, 2016). An interaction of the Cul4-ERCC8 complex with the UVSSA-USP7 complex has also been established. UVSSA- USP7 is a ubiquitin-specific protease which has been shown to coimmunoprecipitate with ERCC8, so it has been suggested that it is loaded onto RNAPII, dependent on ERCC6. Structurally, the ERCC8 protein shares a sequence similarity to DDB2, both containing WD40 repeats in their middle region. DDB2, the product of the DDB2 gene, is involved in the GG-NER sub pathway, and the Cul4-DDB2 complex shares an identical protein structure with the Cul4-ERCC8 complex, with the exception of DDB2 being present instead of ERCC8 (Saijo, 2013).

10

Transcription-coupled repair is believed to be quicker in removing DNA lesions in actively transcribed genes compared to global genome repair. Since a deficiency of ERCC6 and ERCC8 in organisms has such detrimental effects on DNA repair, and since the genes have vital functions in both humans and yeast, one could hypothesize that these genes should be relatively well conserved in eukaryotes.

1.6 Global genome repair

As mentioned before, GG-NER and TC-NER differ in their mode of damage detection. The majority of DNA lesions in NER are detected in a TC-NER independent manner, thus the correct function of TC-NER alternative important. GG-NER recognizes lesions throughout the entire genome, including untranscribed regions and silent chromatin, while TC-NER recognizes arrested RNAPII in actively transcribed genes (Petruseva, Evdokimov, & Lavrik, 2014).

Proteins vital for the GG-NER pathway include DDB2, XPC, hRAD23b and Cetn2, but after the recruitment and activation of their gene products, the two NER subpathways converge and the latter stages in the process are identical for both GG-NER and TC-NER.

1.6.1 XPC

Eukaryotes can detect DNA damage independently from transcription coupled repair, through a heterotrimer consisting of XPC, hRAD23 and Cetn2. This heterotrimer is necessary which will be referred to as the XPC complex. This complex screens for thermodynamically unstable DNA and subsequently binds to the non-damaged ssDNA, on the strand opposite to the lesion, which has been prevented from base pairing. Therefore, the complex never makes direct contact with the damaged bases. This indicates that the initial damage detection lies not in the structure of the lesion, but in the destabilized state of the DNA duplex (although the lesion must be bulky enough to stall the verification step) (Yeo, Khoo, Fagbemi, & Schärer, 2012). Apart from damage detection, XPC has been shown to play a vital role in recruitment of the basal transcription initiation complex (TFIIH), which is required for continuation of the nucleotide excision repair (Yokoi et al., 2000)

1.6.2 DDB2/XPE

Damaged DNA-binding protein-2 (DDB2), encoded by the DDB2 gene (XPE in humans) is thought to be involved in early DNA damage detection in global genome repair (Roy et al., 2010). Along with DDB1, DDB2 constitutes

11

a part of the UV-damaged DNA-binding complex (UV-DDB) which preferentially binds to CPD‘s and 6-4PP‘s lesions. As mentioned before, the DDB2 shares a sequence similarity with the ERCC8 protein (a part of the TC- NER subpathway), with both proteins making up structurally identical complexes with Cul4 and RBX1, where ERCC8 and DDB2 are the varying factors. The E3 ubiquitin-protein ligase complex Cul4-DDB2 has been suggested to ubiquitinate histones and enhance XPC binding, which has been shown to be imperative in GG-NER. The role of DDB2 in these complexes has been suggested to be substrate recognition (Wakasugi et al., 2002). The genes described above are currently thought to function in damage recognition in NER. However, new information is constantly emerging shedding more light on the functional roles of their gene products. Although they definitely play a part in damage recognition, their exact roles must be studied further.

1.7 Evolutionary perspectives

Above we outlined the central agents in the nucleotide excision repair pathway and their supposed mode of action, as it has been summarized by studies on mainly human cells. But how stable have these pathways been during evolution and moreover have the pathways evolved in eukaryotes? Mutation rates differ between species and some genomic sites have been shown to be more conserved than others, e.g. non-coding DNA is less conserved than coding sequences of genes. This mainly reflects variation in the strength of purifying selection, as most mutations have a detrimental effects on organisms. A large minority of mutations is thought to lead to enhanced fitness, for instance a better functioning protein. This reiterates the essentiality of functioning DNA repair in preserving the genomes of organisms. This is supported by the fact that the genomes of all cellular organisms and some DNA viruses encode for DNA repair proteins (Friedberg & Friedberg, 2006). Of course, in every generation some mutations go undetected by the repair mechanisms, and fuel future evolution.

This seemingly paradoxical interplay between ensuring stability through DNA repair whilst demanding some mutations presents a curious problem for evolution. Most molecular biologists consider mutations to be errors in DNA repair mechanisms, rather than an adaptation. To evolutionary biologists, this error-prone system which purges both the soma and germline of mutations must stay in balance with the evolutionary need for raw material. Consequently, it is of importance to look at the evolutionary history of such repair mechanisms, for instance by studying the history and distribution of different proteins involved in the pathways of various organisms. Different scenarios are possible, genes may have altered or lost their function, been duplicated, deleted, or rearranged with other genes to produce proteins with new combinations of domains. 12

This study follows up on a curious observation Sekelsky, Brodsky, and Burtis (2000), that noted that two of the key genes in TC-NER, ERCC6 and ERCC8 are missing from the Drosophila melanogaster genome (Sekelsky et al., 2000). The overall objective of this project is to utilize available whole genome sequencing data to explore the occurrence of evolutionary losses of the key genes in the nucleotide excision repair pathway in eukaryotic organisms.

The specific questions I aim to answer are:

 How are NER genes distributed among taxa and the main eukaryotic groups?

 Are they preserved in all groups, or are some confined to particular taxonomic units?

 Do genes differ with respect to their preservation and conservation?

 If so, when did these gene vanish from specific lineages?

 Are there indications of multiple losses of the same genes?

13

14

2. Methods

2.1 Data collection

To obtain a list of genes considered to encode for proteins involved in the eukaryotic nucleotide excision repair pathway, I utilized the pathway map from the Kyoto Encyclopedia of Genes and Genomes database (Kanehisa, Furumichi, Tanabe, Sato, & Morishima, 2017; Kanehisa & Goto, 2000; Kanehisa, Sato, Kawashima, Furumichi, & Tanabe, 2016). This was accessed January 25th 2018. This returned a list of 47 genes or their orthologs in 429 sequenced species. For quality control, seven genes (mfd, uvrA, uvrB, uvrC, uvrD, polA and E6.5.1.2) were excluded from the table due to their sole presence in protist species, leaving 40 genes to be analyzed (Table 2.2).

Table 2.2 . Genes involved in the nucleotide excision repair pathway post filtering of rare genes. Table adapted from Kanehisa and Goto (2000).

CETN2 DDB2 ERCC1 ERCC2 ERCC3 ERCC4 ERCC5 ERCC6 ERCC8 LIG1 MNAT1 RFA1 RFA2 RFC1 TTDA XPA XPC DDB1 RBX1 CUL4 RAD23 CDK7 CCNH TFIIH1 TFIIH2 TFIIH3 TFIIH4 RPA3 RPA4 POLD1 POLD2 POLD3 POLD4 POLE POLE2 POLE3 POLE4 PCNA RFC2_4 RFC3_5

In order to study the patterns of gene loss, the species were separated by kingdoms, (a) Animalia, (b) Plantae, (c) Fungi, and (d) Protista. The assignment of species to taxonomic units relied on groups provided in the KEGG database. Kingdom Animalia was further divided into subgroups as larger genetic discrepancies were found to be present within the kingdom. The remaining three kingdoms remained undivided. This resulted in seven eukaryotic groups, (a) vertebrates, (b) arthropods, (c) nematodes, (d) invertebrates (including arthropods and nematodes), (e) plants, (f) fungi, and (g) protists.

2.2 Ascertainment of gene study set

The focus of the project was to study gene loss through the evolutionary history of eukaryotes. This was first estimated by calculating the ratios of absence versus presence of each gene in the entire dataset. Similarly, the same ratios were calculated within each organism group. Using the software environment R and the package ggplot2, stacked bar plots were constructed based on these data for raw visual analysis for further filtering(R CoreTeam, 2017; Wickham, 2009). 15

Next, this list of genes was filtered further. This filtering was based on, (a) genes showing deviations between taxonomic and phylogenetic groups observed by visual analysis and whose product functions and roles were well understood (ERCC8, ERCC6, DDB2, CETN2), (b) genes mutations of which are known to have detrimental effects on humans (XPA, XPC, ERCC2, ERCC3, ERCC4, ERCC5, TTDA), and (c) a few other genes of various function (LIG1, MNAT1, RFA1, RFA2, RFC1). Thus the project focused on 18 genes for further analysis (Table 2.2).

Table 3.2 Genes involved in NER concerned with this focus study. Protein function, pathway involvement, and defective phenotypes.

Gene Protein function Pathway Defective phenotype CETN2 Stimulate binding of GG-NERa XPCb XPC/HR23B dimer DDB2 Damage recognition GG-NER XPE ERCC1 Nuclease NERc Possibly XP ERCC2 Helicase NER XP, XP/TTD, TTDd, XP/CS, CS/TTD ERCC3 Helicase NER XPB,TTD, XP/CS ERCC4 Nuclease NER XPF ERCC5 Nuclease NER XPG, XP/CS ERCC6 Recruitment regulation TC-NERe CSf ERCC8 Recruitment regulation TC-NER CS LIG1 Ligase NER Bloom Syndrome, Aaa Syndrome MNAT1 Stabilizing CCNH-CDK7 NER Unknown complex for activation of CAK RPA1 Binding and stabilizing NER XPA, Werner Syndrome ssDNA intermediates RPA2 Binding and stabilizing NER XPV, Ataxia- ssDNA intermediates Telangiectasia RFC1 ATPase NER Progeria, Pediatric Osteosarcoma TTDA Stabilization NER TTD XPA Damage verification NER XPA XPC Damage recognition GG-NER XPC DDB1 Damage binding, GG-NER, TC- Unknown (possibly XPE) possible damage NER recognition and ubitiquination a Global genomic nucleotide excision repair. b Xeroderma Pigmentosum (group C, the last letter indicating the defective variants) c Nucleotide excision repair d Trichothiodystrophy e Transcription coupled nucleotide excision repair f Cockayne Syndrome.

16

2.3 Taxonomic representation and subsets

The study relies on whole genome sequencing of eukaryotic species. However, the quality of the available whole genome sequence data varies greatly between species, and as does the genome size and complexity. Furthermore, as the sampling within taxonomic and phylogenetic groups is not standardized, then different branches in the evolutionary tree were not equally represented. The data were analyzed for the whole set, but larger taxonomic units were also compared by choosing representative species from each group. The criteria were (a) model species with well sequenced genomes, (b) species representing their taxonomic unit, (c) species displaying deviation from the average gene presence within organism group, (d) species located around boundaries of closely ranked species that lack specific genes. This resulted in 37 representative species meeting the above criteria.

Taxonomic status for the 37 species was also tabulated in order to get a comprehensive idea of exactly where gene losses occurred, which for instance led to the division of arthropods into diptera, hymenoptera, cladocera, nematodes, annelids, arachnids, phthirapteran, and mollusks. Although these are not of equal taxonomic hierarchy, they are specific enough to provide a general idea of evolutionary events. Based on similar reasoning, fungi were further divided into ascomycota, basidiomycota and microsporidia.

2.4 Statistical testing for gene losses

Occurrence of gene losses within the entire set or between taxonomic units were assessed with chi-square tests (on two by N-way contingency tables). This was performed on all genes of study, and subsequently enabled me to choose three baseline genes, on the basis of being highly conserved within eukaryotes (ERCC2, ERCC3 and ERCC5), and less conserved genes. Close inspection of the full dataset indicated several gene losses in two clades within the invertebrate group, so tests were performed on smaller taxonomic groups within the invertebrates (arthropods and nematodes), rather than the invertebrate group as a whole.

2.5 Phylogenetic tree construction Two genes, present in all representative species, ERCC3 and ERCC5 were defined as baseline genes and used for constructing phylogenetic trees. Since they are well preserved, amino acid fasta sequences of these genes or their orthologs in each representative specie were collected from the KEGG database (Kanehisa et al., 2017; Kanehisa & Goto, 2000; Kanehisa et al., 2016). As sequences between varying species may be misaligned due to 17

gaps, the online clustal omega alignment was used align the sequences of 37 species (Larkin et al., 2007)

The trees were then calculated using the phangorn package in R,with the JC69 substitution model. Pairwise distances between sequences were calculated to determine distances. These distances were then utilized to construct neighbor-joining phylogenetic trees, with Oryza sativa japonica as the root (K. Schliep, Potts, Morrison, Grimm, & Fitzjohn, 2017; K. P. Schliep, 2011)

Subsequently, a maximum likelihood tree was constructed with optimized model parameters (topology, proportion of variable size, rate matrix) to get more accurate results. Furthermore, to estimate the accuracy of clades and branching events, bootstrapping was performed with 100 samples to construct the optimal phylogenetic tree. The phylogenetic trees were subsequently written in Newick format and visually reconstructed by using iTOL, where branch lengths were ignored for aesthetic purposes (Letunic & Bork, 2007, 2011, 2016; Paradis, Claude, & Strimmer, 2004).

These steps were repeated for three genes, DDB2, ERCC6 and ERCC8, which by visual analysis were deemed to depict the evolutionary history of the organisms and groups of study.

2.6 Supporting analyses

In order to further support the data retrieved from the KEGG database, 18 genes were extracted from OrthoDB (Zdobnov et al., 2017). The data collected from OrthoDB on October 11th 2017, did not contain the protist group and the combination of sequenced species differed slightly. The same procedure of ratio calculation and constructing stacked bar plots with respect to groups and genes. No statistical testing nor phylogenetic construction was deemed necessary for conclusive purposes.

Stacked bar plots with respect to groups carrying the genes were constructed to compare with the KEGG database data for visual analysis but no phylogenetic construction was performed.

18

3. Results

3.1 Representation of NER genes in Eukaryotes

The overall aim of this project was to explore the pattern of conservation and loss of genes involved in the nucleotide excision repair pathway in eukaryotes. Here, conservation indicates preservation of the gene in organisms or lineages, not the rate of protein evolution. Data from 429 species with 47 genes was obtained from the Kyoto Encyclopedia of Genes and Genomes database to obtain an overview of the pathway. The analyses was focused on 18 genes that (a) showed deviations between organism groups observed by visual analysis and whose product functions and roles were well understood (ERCC8, ERCC6, DDB2, CETN2), (b) genes mutations of which are known to have detrimental effects on humans (XPA, XPC, ERCC2, ERCC3, ERCC4, ERCC5, TTDA), and (c) a few other genes of various functions (LIG1, MNAT1, RFA1, RFA2, RFC1).

The genes of interest did not all appear to be present in the same proportions in the 429 whole genome sequenced species. This was supported by the Pearson X2 test which indicated that the probability of loss between group’s genes was not random, as 12 genes had P-values> 0.001 (Table 3.1).

Figure 3.1 Presence and loss of 18 NER genes in 429 whole genome sequenced eukaryotes. Represented are the percentages of absence and presence of genes.

The question of interest was whether these gene losses were tied to specific taxonomic units and if so, how often did these losses occur in evolution?

19

To determine the association between gene loss and taxonomic units, species were divided into seven taxonomic units, (a) vertebrates, (b) arthropods, (c) nematodes, (d) invertebrates (including arthropods and nematodes), (e) protists, (f) fungi and (g) plants (Fig 3.2).

Overall gene preservation seemed to be very high in vertebrates and plants. More losses are in the arthropods, nematodes and the overall invertebrate group. Preservation in protists was relatively low. None of the genes were completely missing from the protist group, but most genes were only present in a varying percentage of protist species. Furthermore, no protist lacked all genes.

Most genes are relatively well preserved overall in the eukaryotic domain, with the exception of CETN2, DDB2, ERCC6, ERCC8, XPA and TTDA, all of which were lost in >20% of eukaryotes (Fig 3.1). These genes were further explored, with a special focus on DDB2, ERCC6 and ERCC8 as they are thought to be directly involved in DNA damage detection.

CETN2 was found to be present in all vertebrates but lost in roughly 70% of plants. An almost complete loss was found in all other eukaryotic organisms, with the gene being lost in >95% of species in every other taxonomic unit.

DDB2 was found to be completely lost in protists, fungi and nematodes, and almost completely lost in arthropods, with only around 11% of the species carrying the gene. It was highly conserved in vertebrates however, only lost in 3% of the sequenced species and prevalent in plants.

ERCC6 was found to be not completely lost in any taxonomic unit, but it was lost in around 30% of arthropods, protists and fungi. It was lost in ~24% of invertebrates, and a large fraction of badisiomycetes It was however almost completely preserved in nematodes, vertebrates and plants, suggesting evolutionary losses restricted to specific taxonomic units.

ERCC8 displayed an checkered pattern, as it was found to be present in almost all plant species but lost from ~35% of fungi, ~70% of protists, ~82% of arthropods and a substantial number of nematodes. In contrast, ERCC8 was only absent in 3% of vertebrate genome sequences.

TTDA also showed a more spotted pattern. The observed losses were scattered in the invertebrate, fungi and protists, but no losses found within the vertebrates.

XPA did not display an overall lack of preservation in eukaryotes, but it was not found in the plant genomes. It seemed to be preserved in all other groups, apart from protists as many other genes studied here.

To test if the frequency of genes differed between taxonomic units, a Pearson X2 test was performed. Since the invertebrate group contained 20

elements of both nematodes and arthropods, the group was not included in statistical testing for it had duplicate values and a majority of invertebrate species were arthropods to begin with.

Table 3.4 P-values from tests on gene loss within 5 taxonomic groups. 105 vertebrates, 44 arthropods, 6 nematodes, 9 protists, 121 fungi and 84 plants.

Gene P value CETN2* <2.2*10-16 DDB2* <2.2*10-16 ERCC1* 3.4*10-05 ERCC2* 2.645*10-06 ERCC3 0.8559 ERCC4 0.002769 ERCC5 0.2539 ERCC6* 2.357*10-12 ERCC8* <2.2*10-16 LIG1* 3.762*10-05 MNAT1* < 2.210-16 RFA1 0.0968 RFA2* < 2.210-16 RFC1 0.7688 TTDA* 3.065*10-09 XPA* < 2.210-16 XPC* < 2.210-16 DDB1* < 2.210-16 Note: *Significance level <0.05.

The results of the X2 contingency test with significance levels of α=0.05 an df = 4, show the baseline genes ERCC3 and ERCC5 follow a X2 distribution with P>0.05 (Table 3.1), supporting visual analysis for their definition as baseline genes (Fig 3.1). ERCC2 however did not qualify as a baseline, as the frequency of gene losses was not the same between groups (P = 2.673*10-7.)

The frequency of loss of ERCC6, ERCC8 and DDB2 deviated between groups, all were significant with P<0.0001. Interestingly, most genes (13/18) did show a significant deviation in loss between groups. Since protists are a paraphyletic group, statistical tests without protists were also performed with 10/13 genes showing significant deviation.

21

Figure 3.2 Gene representation in vertebrates, arthropods and nematodes. The figure shows the absence and presence of genes involved in the nucleotide excision repair pathway in a) 105 vertebrates, b) 44 arthropod species c) 6 nematodes in d) 70 invertebrates (including nematodes and arthropods), e) 49 protist species, f) 121 fungal species and g) 84 plant species

22

Figure 3.3 Presence of ERCC6, ERCC8 and DDB2 in eukaryotes. The figure shows the absence and presence of these three genes in i) 105 vertebrate species, ii) 44 arthropod species iii) 6 nematodes. iv) 70 invertebrate species (including arthropods and nematodes), v) 84 plant species, vi) 121 fungal species and vii) 49 protist species.

3.2 Phylogenetic reconstruction

The assignment of genes to taxonomic units relied on KEGG categories. Next I wanted to confirm the evolutionary history of NER genes and to study the patterns of gene losses on the phylogeny.

First to study the evolutionary history of a few NER genes, the phylogeny of two baseline genes was estimated (ERCC3 and ERCC5). Phylogenetic trees were constructed with 37 species with the R package Ape (Paradis et al., 2004), first creating a neighbor-joining tree and subsequently a maximum likelihood tree. The tree was and rooted with O. sativa japonica and visualized with iTOL (Letunic & Bork, 2016).

The two baseline trees for ERCC3 and ERCC5 were nearly identical, only depicting minor differences in branching, e.g. the microsporidia Encphalitzoon cunculi and Nosema ceranae show less relation to to other fungal species such as Saccharomyches cervisiae with respect to ERCC5, and more relation to protist species such as Paramecium tetrauralia (Fig 3.4b). They also agreed with the general eukaryotic tree hypothesized by Ren et al. (2016).

Next I also calculated the phylogenetic trees of the three genes that showed significant frequency of gene loss. Naturally, a tree can not depict a branch for a species of the gene has been lost from its genome. Thus these trees

23

did not contain all representative species as the genes are not present in all groups. Only 26 of the 37 representative species carry the ERCC6 gene in their genome, 18 carry the ERCC8 gene and 9 carry the DDB2 gene.

All three trees, i.e. ERCC6, ERCC8 and DDB2 have the same topology as the baseline trees, apart from the missing taxa.

24

Figure 3.4 Phylogenetic trees based on distances between amino acid sequences of baseline genes, ERCC3 and ERCC5. Rooted phylogenetic trees of maximum likelihood in eukaryotes based on pairwise distances between 37 species under the JC69 substitution model with O. sativa japonica used as root. a) Tree based on the amino acid distances of the ERCC3 ortholog b) Tree based on the amino acid distances of the ERCC5 ortholog. Note: This is a toplogical representation, the branch lengths are not informative

25

Now I wanted to map the clear losses of specific NER gene in particular taxonomic groups.

As determined before, CETN2 was not well preserved in eukaryotes. It was found in one species of the annelid phylum, Helobdella robusta, and in one protist species, Paramecium tetrauralia and all vertebrates. It is difficult to estimate how often the gene was lost, or when it emerged in evolution, as it was found to be present in only one annelid specie, so whether it emerged in annelids or sometime earlier is difficult to say.

DDB2 was clearly absent in almost all eukaryotic groups with the exception of vertebrates and plants (Fig 3.3). From the phylogenetic tree of DDB2 (Fig 3.5c) when compared to the baseline genes ERCC3 and ERCC5 (Fig 3.4) it is quite apparent that protists and fungi, lack this gene completely. Conversely, the data showed the gene to be present in the genomes of plants. The invertebrate group showed an interesting trend, as DDB2 was lacking in all insects apart from the human body louse, Pediculus humanus corporis. This was the only insect species within the sequenced KEGG species that carries this gene and this may indicate that this gene has disappeared relatively recently and may have originally been present in eukaryotes. A few observations from the full KEGG table supports this hypothesis. a) The DDB2 gene was found in deuterostomes (both chordates and hemichordates), and some groups of protostomes and sponges. b) The gene was found among mollusks and in Crassostrea spp., Lottia spp and Lingula spp, regardless of the fact that the gene was not found in Octopus bimaculoides (in retrospect, more mollusk species should have been chosen for analysis to study this further). This may indicate that the gene was present at least within the Lophotrochozoa superphyla. c) Parallel to the Lophotrochozoa superphyla, the DDB2 gene was found within the the superphyla Holometabola and its subgroups, although the proportion of lineages missing the gene was high. For instance, the gene was not found in nematodes, but was observed in several of the arthropod phyla. This might indicate that as nematodes and arthropods diverged, DDB2 vanished from the genome of nematodes but remained in arthropods. d) Chelicerates, (represented by Ixodes scapularis) a subphyla basal to insects within the arthropod phyla, also has DDB2. The gene was however not found in a large number of insects, but within the sister group of crustacians (as in D. pulex).

26

Figure 3.5 Phylogenetic trees based on distances between amino acid sequences of unpreserved genes, ERCC6, ERCC8 and DDB2. The figure depicts three rooted phylogenetic trees of maximum likelihood in eukaryotes based on pairwise distances between 37 species under the JC69 substitution model with O. sativa japonica used as root. a) Tree based on the amino acid distances of the ERCC6 ortholog. b) Tree based on the amino acid distances of the ERCC8 ortholog c) Tree based on the amino acid distances of the DDB2 ortholog.

27

The ERCC6 gene was not highly preserved within the eukaryotic domain, although notably better than DDB2 (Fig 3.5). ERCC6 was found in the genomes of plants and vertebrates, and to a varying degree in invertebrates, protists and fungi. The fact that the gene was detected within all eukaryotic kingdoms, strongly suggests that this gene emerged early in evolutionary history, and suggests that it has been lost at least two times, from groups of invertebrates and fungi, and possibly also from some protists.

The ERCC6 appears to have been lost from a clade within the Basidiomycota within the fungal kingdom. In contrast, ERCC6 was found in all Ascomycota, the largest fungal phyla. Furthermore, microsporidia, which were once considered to belong to the protists but have now been reclassified as fungi, showed inconsistencies with respect to the presence of ERCC6. The gene was found in E. cunculi but not in N. ceranae. This might stem from imperfect sequencing of N. ceranae as three different species of Encephalitozoon had the gene, so no significant deductions can be made. As before, this also applied to the protist group, where all protists except for Toxoplasma gondii carry the ERCC6 gene, which either reflect evolutionary loss or gaps in the genome sequencing of that species.

The deep taxonomic genome sequencing of Drosophila spp. showed that this clade lacks the ERCC6 gene completely. This loss appears to have occurred earlier in evolutionary history, as the gene was not found in the genomes of other species in the dipteral order, such as Culex quinquefasciatus, Aedes aegypti and Musca domestica. Notably, ERCC6 was found to be present in all other arthropods except diptera. Thus, since ERCC6 is preserved in all those arthropod lineages, it must have been lost around the time the dipteran ancestor diverged from the rest of the insects.

The phylogenetic tree of ERCC6 (Fig 3.5a) placed the protist, E. cunculi and the arthropod, D. pulex as sister groups, showing more sequence similarity to arthropods than to its closer relatives, fungi and protists. The gene appears to have been present originally in all eukaryotes.

ERCC8 showed a similar trend to ERCC6 but was found to be less preserved, especially in invertebrates. It was found in almost all vertebrates and plants but lost three times, in fungal, arthropod and nematode phyla.

Similar to ERCC6, the loss of ERCC8 loss seems to have occurred within fungi with similar pattern between Basidiomycota and Ascomycota. Although the gene was not found in one Ascomycota species (Podospora anserina) there was a significant difference in the proportion of species 28

with the gene in the two divisions. The gene was not found in any Basidomycota. Similar to ERCC6, a mixed pattern was observed for ERCC8 in microsporidia and protists, with one protist, Monosiga brevicollis, carrying the gene. The same implications as above however applies to the representation of the genes in the protist group.

An evolutionary loss of ERCC8 was observed in nematodes. Loa loa and Brugia malayi both contain the gene, while it was not found in the rest. This indicates that the loss of the gene occurred within the nematode phyla, but the assumption is as always, that they are not well sequenced. An interesting similarity between DDB2 and ERCC8 can be seen in I. scapularis, D. pulex and P. h. corporis, as this gene was also found to be preserved in these species but missing from other arthropods. Again (as in the case of DDB2) the gene was found conserved in the genome of P. h. corporis, which may implicate this species as basal to the insect class.

As the ERCC8 gene was found present in plants and at least one protist, one can deduce that the gene emerged early in the eukaryotic domain.

TTDA showed statistically significant deviation of losses between taxa, its distribution was found to be relatively stable in eukaryotes. The group that shows high frequency of TTDA absences were the microsporidia, but such a small sample, with only four microsporidia species (three species of the Encephalitozoon genus) might not be indicative.

Finally, XPA was found to be highly conserved in eukaryotes. However, it was not found in the in plant genomes (and also lacking in some protists) which might indicate that it emerged with protists or the common ancestor of fungi and animals.

These data suggest that these 6 genes involved in damage detection in the pathway may be more prone to evolutionary loss, than the other genes in the NER pathway. The genes that code for the proteins in the later steps of NER, after the two subpathways converge appear to more conserved throughout the eukaryotic domain.

Table 3.2 clearly shows that CETN2, DDB2, ERCC6, and ERCC8 are not equally preserved between eukaryotic groups, demonstrated by a large gap between plants and vertebrates with respect to DDB2.

29

Table 3.2 The presence (1) and absence (0) of genes involved in the nucleotide excision repair pathway in eukaryotes. Specie Group CETN2 DDB2 ERCC1 ERCC2 ERCC3 ERCC4 ERCC5 ERCC6 ERCC8 LIG1 MNAT1 RFA1 RFA2 RFC1 TTDA XPA XPC DDB1 Homo sapiens Vertebrate 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Mus musculus Vertebrate 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Xenopus laevis Vertebrate 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Drosophila melanogaster Insect 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Drosophila virilis Insect 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Aedes aegypti Insect 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Culex quinquefasciatus Insect 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Bombus impatiens Insect 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 Nasonia vitripennis Insect 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 Pediculus humanus corporis Insect 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Daphnia pulex Cladocera 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ixodes scapularis Arachnida 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 Caenorhabditis elegans Nematoda 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 Caenorhabditis briggsae Nematoda 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 Necator americanus Nematoda 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 Brugia malayi Nematoda 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Loa loa Nematoda 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Trichinella spiralis Nematoda 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 Helobdella robusta Annelida 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Octopus bimaculoides Mollusca 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 Amphimedon queenslandica Sponge 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Saccharomyces cerevisiae Ascomycota 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Tetrapisispora phaffii Ascomycota 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Podospora anserina Ascomycota 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 Fusarium oxysporum Ascomycota 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Schizosaccharomyces pombe Ascomycota 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Cryptococcus neoformans Basidiomycota 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Trametes versicolor Basidiomycota 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Dichomitus squalens Basidiomycota 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 Encephalitozoon cuniculi Microsporidia 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 Nosema ceranae Microsporidia 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 Monosiga brevicollis Protist 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 Plasmodium vivax Protist 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 Toxoplasma gondii Protist 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 Paramecium tetraurelia Protist 1 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 0 Arabidopsis thaliana Plant 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 Oryza sativa japonica Plant 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 *Note: Taxonomic groupings based on KEGG

30

4. Discussion

4.1 DNA Repair in the Three Domains of Life

The importance of DNA repair systems can be inferred from the general negative effects of mutations on viability and evolvability, though it must be acknowledged that mutation rate can evolve, for instance in heterogeneous environments (Levins, 1967). The importance of DNA repair is both supported by the variety of different of repair mechanisms present in extant organisms but also by the observation that some repair enzymes seem to have evolved independently more than once, based on the fact that many similarities between repair proteins are shared between Archea and Eukarya but not Bacteria (Forterre, Filée, & Myllykallio, 2004). This seems to be the case for the seemingly sophisticated system of nucleotide excision repair. The NER pathways have shown functional parallelism in bacteria and eukaryotes, differing in which proteins conduct lesion detection, helix unwinding and incision catalyzation. However, in spite of these functional similarities, these systems are non-homologous and appear to have evolved independently (Rouillon & White, 2011). Interestingly, most archeal genomes seem to encode for NER type proteins, that show homologoy to eukaryal NER proteins, such as the ERCC2 gene, that codes for the ERCC2 helicase that is believed to be vital for NER in eukaryotes. The archeal and eukaryotic ERCC3s also share 25-30% protein sequence identity (Richards, Cubeddu, Roberts, Liu, & White, 2008). The endonuclease coding gene ERCC4 and ERCC5 are also thought to have homologus counterparts in archea. The lack of understanding of the exact roles of specific proteins in the pathway across the major branches of life, poses difficulties in defining whether the gene products actually have a NER function in archea. This has relevance for our understanding of the eukaryotic NER, as some genes may not have direct involvement in NER, but may rather be mediators in transcription or other nuclear processes.

With that in mind, it seems safe to postulate that when eukaryotes diverged from the other two domains, NER mechanisms were already present. The results of this project do indeed support this hypothesis, as most of the core NER genes studied were found to be conserved and maintained through all eukaryotic taxonomic units studied here. On the other hand I observed that 10 genes were lost in 5% or more of the lineages and that the losses were clustered on specific branches in the tree, arguing for true evolutionary loss of the genes rather than

31 incomplete genome sequence or partial chromosomal deletions differences within species. These 10 NER genes can be termed less perserved, than the core pathway genes. I must stress that perservation refers to distribution of a gene in the evolutionary sequence, not the rate of evolution in the genes themselves (that would be conservation). These less preserved genes all associate with DNA damage detection (DDB2, ERCC6, ERCC8, CETN2, XPA, TTDA) raising questions about their evolutionary dynamics and their assumed or known functions in NER. One can also ask, would it be possible that the lack of specific proteins indicates no need for these damage detection proteins in certain organims or taxonomic groups? Or alternatively, that other genes and proteins have been co-opted for damage detection in these groups?

The literature cited above and data presented here indicate that most genes involved in NER were present in the ancestor of eukaryotes, after the divergence from bacteria and archea. This is the most plausible explanation, based on Dollo’s Law of irreversibility which Gould (1970) summarized:

An organism never returns exactly to a former state, even if it finds itself placed in conditions of existence identical to those in which it has previously lived ... it always keeps some trace of the intermediate stages through which it has passed.

Therefore it is safe to assume that these genes were originally present, rather than they evolved independently throughout the genome.

This brings up questions of how long ago the genes that were seen to be lost in specific clades of the eukaryotic tree were rendered without function and by which mutation mechanism. If a gene was abolished by a deletion, then we should not expect any traces of it, except if the deletion was only partial. Conversely, if the gene was inactivated by point mutation, then one can ask, how fast will pseudogenes decay? In the Xenopus spp. genome, pseudogenes were no more than 17 million years old, because the species underwent tetraploidization around that time (Session et al., 2016). Furthermore, in studying Drosophila spp. mitochondrial genes that translocated to the nucleus, the identification of pseudogenes were found to work best for closely related species, and proved a lot harder for deeper branches (20-50 million years) (Rogers & Griffiths-Jones, 2012). This suggests that decapitated genes can only be detected relatively shortly after the initial mutation event, most likely on the timescale of tens of millions of years rather than hundreds of millions of years. Most of the clades that show loss of the NER genes are hundred million years older or more.

The discussion will take us through five individual genes, DDB2, ERCC6, ERCC8, CETN2 and XPA, starting from the least preserved gene to XPA,

32 which may only be lost in plants. CETN2 will also be discussed due to its debatable preservation in eukaryotes.

4.2 DDB2

The results indicate that the DDB2 gene, which is thought to be involved in early damage detection in GG-NER, was present in the ancestor of in eukaryotes, after they diverged from archea and bacteria. The data also suggest that the gene has been lost multiple times throughout the domain.

Loss of DDB2 appears to have occurred three or four times in evolutionary history. The loss was clearly placed within nematodes, insects, fungi and what is harder to estimate possibly also among unicellular protists protists. However, the percentage of protein sequence compatibility between species may not be very high, since Mus musculus and H. sapiens only show a 73% sequence match (Roy, Bagchi, & Raychaudhuri, 2012).

The fact that the gene is conserved in some taxonomic groups but not in others may shed better light on the repair mechanisms these different organisms utilize. It is also possible that since the gene was lost in nematodes, insects, fungi and possibly some protists, that other mechanisms might be in place. In humans, DDB2 is thought to be a factor in premature senescence, as was demonstrated by a failure of cells to respond to reactive oxygen species (ROS). Also, studies on DDB2 deficient humans have shown a 40-60% reduction in repair levels, resulting in the mildest form of XP (XP-E), with the failure of repair most likely caused by inability to bind damaged DNA. However, since defects in DDB2 causes such a mild form of the disorder, they have not been studied as extensively as mutations in other XP genes. In fact, not all XP-E patients carry mutations in DDB2, and some do not show abnormal DNA binding activity in vitro, but instead showed alterations in interactions with other repair factors. Microinjection of the UV-DDB complex into XPE deficient cells did increase the repair level, indicating that DDB2 does have a role in the process, but its significance remains undetermined (Rapic Otrin et al., 1998). The fact that the other part of the UV-DDB heterodimer, DDB1, is consvered within the domain (Fig3.1), while DDB2 is not found in „lower“ organisms could indicate that DDB1 works somewhat independently from DDB2, particularly in those organisms, and that the DDB2 evolved later than DDB1.

A better understanding of the role of DDB2, and its functional interactions with and without DDB1 could provide opportunities in further exploring the mechanism of aging (Molinier, Lechner, Dumbliauskas, & Genschik, 2008) and possible implications for patients suffering from XPE.

33

4.3 ERCC6 and ERCC8

As was noted by Sekelsky et al. (2000) ERCC6/8 were not found in the D. melanogaster genome. Here I found that the ERCC8 gene was also missing in all other holometaboles, which consists of 11 orders, including hymenoptera, diptera and lepidoptera. I also found that the ERCC6 gene was missing in all dipteran species while preserved in hymenoptera. This result, indicates that the two genes must have been lost around 400-47 millions of years ago, first ERCC8 and then ERCC6. The oldest insects are estimated to be 400 million years old, dating back to the Devonian period, and the holometabola are thought to have undergone a radiation between 298.9–47 million years ago, in the Permian Period. This may indicate that ERCC8 disappeared within that timeframe, as it is missing in all holometabola. Following the loss of ERCC8, the loss of ERCC6 must have occurred about 240 million years ago (Grimaldi & Engel, 2005). The oldest known diptera has an estimated age of 240 million years, so if the gene was lost in early dipterians, it must have occurred around that time period (Krzeminski, 2000).

Furthermore, ERCC6 (CSB) gene was not found to be highly preserved within the eukaryotic domain, although notably better conserved than DDB2 (Fig 3.6). ERCC6 is present in both plants and veretebrates, but seems to have been lost on two or three occasions during eukaryotic evolution, from the invertebrates, fungi and possibly protists. The fact that orthologs of ERCC6 are present within all eukaryotic kingdoms, strongly suggests that this gene emerged early in eukaryotic history.

By comparison, the functionally related ERCC8 (CSA) was found to be notably less preserved than ERCC6. It was lost in all Holometabola, and also seems to have been lost in the ancestor of several nematode species. Both genes show an overlap in loss by phylogenetic groups, however, the fact that ERCC6 seems to have been lost in the ancestor of dipterians, but ERCC8 is missing in an older Holometabola clade, might indicate differences in functional importance. Sekelsky wondered that ERCC8 might have functions beside TC-NER, one that works independently from ERCC6 (J. Sekelsky, 2017). The opposite could also be true, since ERCC6 is present in other Holometabola. A recent study conducted by Epanchintsev et al. (2017), established the requirement of fully functioning ERCC6 and ERCC8 proteins needs to be fulfilled for optimal TC-NER in vertebrates. In mammalian cells, a failure in TC-NER due to a deficiency in ERCC6 and ERCC8, and in humans, severe mutations in those cause a shortened life span and a failure to repair UV damaged DNA. Although ERCC6 and ERCC8 are thought to work together in order to regulate recruitment timing of DNA-binding factors. Their role in targeting ATF3 for ubiquitination and degradation was 34 shown to be dependent on each other. ATF3 is a transcription factor which is a product of an immediate early gene, which is able to respond quickly to regulatory signals, and prevents the arrival of RNAPolII, hindering the restart of RNA synthesis (Bahrami & Drabløs, 2016). ERCC6 and ERCC8 target ATF3 and are thought to be involved in its degradation, enabling the RNAPolII to continue transcription. This implies that ERCC6/8 might not degrade the stalled RNAPolII complex directly, but rather degrades ATF3, allowing the arrival of RNAPolII. However, the exact function of ERCC6 has not been pin-pointed, except for its role in a chromatin remodeling protein complex, so it is possible it has functions other than its interaction with ATF3.

Sekelsky et al. (2000) previously observed that the fruitfly lacks ERCC6 and ERCC8, and concluded that it was most likely that the fruitfly lacks the TC-NER pathway. Supporting this conclusion is the fact that biochemical study failed to find evidence of TC-NER in Drosophila spp. (de Cock et al., 1992). Interestingly, lack of these genes does not cause problems for Drosophila or other diptera.

The question remains, how Drosophila spp. deals with stalled RNA without functioning TC-NER. Sekelsky et al. (2000) suggested scenarios to explain this, or mechanisms by which the organism might solve the problem. The first suggestion is that they don‘t deal with stalled RNA at all. Some of the embryonic tissues have polytene chromsomes, during which time their main growth occurs, so the absence of a template may not have an affect. Second, it was proposed that there is an alternative mechanism for removing stalled RNA polymerases, one without recruiting repair proteins. Finally, it could also be that other proteins carry out the function of relieving stalled polymerases, but that runs counter to the data of de Cock et al. (1992)

Similar to human ERCC6, mutations in its ortholog in Saccharomyces cervesiae, RAD26, also results in altered UV sensitivity. A study conducted by Ghosh-Roy, Das, Chowdhury, Smerdon, and Chaudhuri (2013) determined that a deletion of Rad26 did not have an effect on the expression of NER factors, suggesting that the increased UV- sensitivity is not due to stalled transcription, but rather is due to altered transcription of the RPB2 gene which encodes for RNAPolII. How exactly this is performed is unknown.

It is difficult to speculate about how ERCC6 or ERCC8 deficient organisms remove arrested polymerases. If they even are removed at all, due to the fact that scientists have not established yet what functionality the ERCC6 and ERCC8 proteins have (it is however suggested that ERCC8 functions in assembly, while ERCC6 functions in catalysis). How stalled RNAPolII complexes are removed with ERCC6 and ERCC8 proteins could aid in explaining how TC-NER free organisms carry out this function (if they do, in the first place). Understanding this

35 mechanism and the role of these proteins could have great implications with recent developments in genome engineering.

4.4 XPA

XPA is one of the most highly conserved genes in the NER pathway, and was found to be preserved in all eukaryotes with the exception of plants. Plants are sedentary organisms that rely on UV waves from the sun in order to carry out photosynthesis. This however also causes DNA damage, which is mostly dealt with by , which are enzymes that repair UV induced damage and are present in most eukaryotes apart from placental mammals. Plants also do rely on NER, which is unfortunately seriously under-investigated in plants. The data suggests NER machinery and processes of plants differ from those of other organisms. For instance, as was obsereved here, plants do not have an XPA ortholog. However, Arabidopsis thaliana and some other plants carry two copies of ERCC3 and a genus specific expansion of ERCC3 seems to be the case. A notable difference can be found in the duplications of certain genes in A. thaliana vs. O. sativa, which may reflect a mechanical difference in DNA repair. Interestingly in line with the results above, XPA is yet another DNA damage recognition protein which seems to be systematically lacking in eukaryotes(Canturk et al., 2016).

4.5 CETN2

The least preserved gene studied here is CETN2.

This gene is mainly found in vertebrates and annelids, while possibly also present in some protist species. The fact that it was missing in so many taxa and only found in more derived species like vertebrates, might suggest that this gene emerged late in evolutionary history. Although the gene seems to be present in one protist species, the evolutionary history of this group is very obscure and many species have not been perfectly sequenced yet. In the case of H. robusta, this is the only annelid sequenced in the dataset. It is most unlikely that the same gene arose twice de novo, so if the gene emerged within annelids is impossible to determine from such incomplete data.

According to the KEGG database, no ortholog of CETN2 exists in A. thaliana. However, blast analyses confirm that higher plants do indeed have centrin2 orthologs (Molinier, Ramos, Fritsch, & Hohn, 2004). As the centrin2 gene in A. thaliana shows 49% sequence identity with human CETN2 (Liang, Flury, Kalck, Hohn, & Molinier, 2006). The CETN2 protein was shown to interact with XPC and be of importance in centrosome duplication. This suggests that some errors might be found in the KEGG table, or that their criteria for calling orthologs are more 36 strict. It is possible that the fact that CETN2 is a short protein, the human version only 172 AA, make it harder to call orthology accurately. Thus the data on CETN2 should be revisited.

4.6 Challenges

Several challenges have to be faced in a study of this type. One limiting factor in mapping evolutionary events, like gene loss, is the fact that the chronological timeline of early eukaryotic splits and clades has not yet determined. This fact poses difficulties for deducing the gene content of ancestors of deep eukaryotic taxonomic groups or whether specific genes have been lost in these early separated groups. Multiple timelines have been hypothesized of which groups emerged first, and one appealing hypothesis places the eukaryotic between a supergroup consisting of metazoa, fungi, some heterotrophic protista and amoebozoa, called unikonts, and all other organisms, based on physiological characteristics (Burki, 2014). The difficulty of pinpointing the eukaryotic root stems from the age of the event, the small size of the groups, mostly simple phenotypic traits and also the fact that different criteria have used for estimateion of the root. For instance, if the taxonomic distribution of the endoplasmic reticulum is used, the root is placed between a group consisting of unikonts and excavata, and all other eukaryotes (Ren et al., 2016). One possible reliable way to infer about the early branches in eukaryotic evolution would be to utilize the method used in this project, but with a larger subset of highly conserved genes and a larger set of whole genome sequenced species. This was done by Ren et al. (2016), in order to map fungal relationships by choosing highly conserved nuclear genes, and could be expanded by similar studies in other eukaryotic lineages.

Such might be the case with ERCC6 for example. Its presence in plants and absence in some fungi and protists can either suggest that it was present in eukaryotic ancestors, and was later lost in some fungal and protist species, or that distribution of the gene in these groups actually maps the evolutionary relatedness of them, suggesting plants, fungi, certain protists and bilateria are all of the same origin.

4.7 Conclusion

Our understanding of DNA repair is incomplete, which calls for development of more model systems for study of core biological and biochemical processes. Every model organism has its own history and ideosyncracies in particular processes and enzyme systems. A broad genomic and functional analyses is needed to understand the

37 fundamental processes of the cell and how organsims have solved challenges with replicating, transcribing and repairing their DNA.

Preservation of the 18 NER genes is generally very high in eukaryotes as was predicted, since DNA repair genes are generally well conserved throughout the tree of life. The repeated loss of several genes functioning in DNA damage recognition raises many questions however. Could these genes possibly have alternative functions, and thus only function in DNA damage recognition in certain organisms? This begs the question, do other genes serve these DNA damage detection functions in other species, or even within some of the model species?

To answer these questions, further studies must be conducted to ultimately determine the functional role of the genes discussed in this chapter. Furthermore, their distribution must be mapped out in more detail, possibly with the addition of archean outgroups, as they have shown parallelism with eukaryotic NER.

With recent developments in genomic engineering through systems such as CRISPR/Cas9, this could aid in genetic manipulation of patients suffering from NER deficiency disorders.

38 References

Abbott, S., & Fairbanks, D. J. (2016). Experiments on Plant Hybrids by Gregor Mendel. Genetics, 204(2), 407-422. doi:10.1534/genetics.116.195198 Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., . . . Venter, J. C. (2000). The genome sequence of Drosophila melanogaster. Science, 287(5461), 2185- 2195. Andersson, S. G., & Kurland, C. G. (1998). Reductive evolution of resident genomes. Trends Microbiol, 6(7), 263-268. Atkinson, E. C., Thiara, D., Tamura, D., DiGiovanna, J. J., Kraemer, K. H., & Hadigan, C. (2014). Growth and nutrition in children with trichothiodystrophy. J Pediatr Gastroenterol Nutr, 59(4), 458-464. doi:10.1097/MPG.0000000000000458 Bahrami, S., & Drabløs, F. (2016). Gene regulation in the immediate- early response process. Advances in Biological Regulation, 62, 37- 49. doi:https://doi.org/10.1016/j.jbior.2016.05.001 Beerens, N., Hoeijmakers, J. H., Kanaar, R., Vermeulen, W., & Wyman, C. (2005). The CSB protein actively wraps DNA. J Biol Chem, 280(6), 4722-4729. doi:10.1074/jbc.M409147200 Burki, F. (2014). The eukaryotic tree of life from a global phylogenomic perspective. Cold Spring Harb Perspect Biol, 6(5), a016147. doi:10.1101/cshperspect.a016147 Canturk, F., Karaman, M., Selby, C. P., Kemp, M. G., Kulaksiz-Erkmen, G., Hu, J., . . . Sancar, A. (2016). Nucleotide excision repair by dual incisions in plants. Proc Natl Acad Sci U S A, 113(17), 4706- 4710. doi:10.1073/pnas.1604097113 Crick, F. H. (1958). On protein synthesis. Symp Soc Exp Biol, 12, 138- 163. de Cock, J. G., van Hoffen, A., Wijnands, J., Molenaar, G., Lohman, P. H., & Eeken, J. C. (1992). Repair of UV-induced (6- 4)photoproducts measured in individual genes in the Drosophila embryonic Kc cell line. Nucleic Acids Res, 20(18), 4789-4793. Dunning Hotopp, J. C., Clark, M. E., Oliveira, D. C., Foster, J. M., Fischer, P., Munoz Torres, M. C., . . . Werren, J. H. (2007). Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science, 317(5845), 1753-1756. doi:10.1126/science.1142490 Epanchintsev, A., Costanzo, F., Rauschendorf, M. A., Caputo, M., Ye, T., Donnio, L. M., . . . Egly, J. M. (2017). Cockayne's Syndrome A and B Proteins Regulate Transcription Arrest after Genotoxic Stress by Promoting ATF3 Degradation. Mol Cell, 68(6), 1054- 1066 e1056. doi:10.1016/j.molcel.2017.11.009 Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., . . . et al. (1995). Whole-genome random 39 sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223), 496-512. Forterre, P., Filée, J., & Myllykallio, H. (2004). Origin and Evolution of DNA and DNA Replication Machineries The Genetic Code and the Origin of Life (pp. 145-168). Boston, MA: Springer US. Fousteri, M., & Mullenders, L. H. F. (2008). Transcription-coupled nucleotide excision repair in mammalian cells: molecular mechanisms and biological effects. Cell Research, 18, 73. doi:10.1038/cr.2008.6 Friedberg, E. C., & Friedberg, E. C. (2006). DNA repair and mutagenesis (2nd ed.). Washington, D.C.: ASM Press. Futuyma, D. J. (2005). Evolution. Sunderland, MA: Sinauer Associates. Gahan, P. B. (2005). Molecular biology of the cell (4th edn) B. Alberts, A. Johnson, J. Lewis, K. Roberts and P. Walter (eds), Garland Science, 1463 pp., ISBN 0‐8153‐4072‐9 (paperback) (2002). Cell Biochemistry and Function, 23(2), 150-150. doi:doi:10.1002/cbf.1142 Gee, H. (2001). A journey into the genome: what's there Nature News. Retrieved from https://www.nature.com/news/2001/010215/full/news010215- 3.html Ghosh-Roy, S., Das, D., Chowdhury, D., Smerdon, M. J., & Chaudhuri, R. N. (2013). Rad26, the transcription-coupled repair factor in yeast, is required for removal of stalled RNA polymerase-II following UV irradiation. PLoS One, 8(8), e72090. doi:10.1371/journal.pone.0072090 Gould, S. J. (1970). Dollo on Dollo's law: irreversibility and the status of evolutionary laws. J Hist Biol, 3, 189-212. Grimaldi, D. A., & Engel, M. S. (2005). Evolution of the insects. Cambridge U.K. ; New York: Cambridge University Press. Groisman, R., Polanowska, J., Kuraoka, I., Sawada, J.-i., Saijo, M., Drapkin, R., . . . Nakatani, Y. (2003). The Ubiquitin Ligase Activity in the DDB2 and CSA Complexes Is Differentially Regulated by the COP9 Signalosome in Response to DNA Damage. Cell, 113(3), 357-367. doi:10.1016/S0092-8674(03)00316-7 Hahn, M. W., Demuth, J. P., & Han, S. G. (2007). Accelerated rate of gene gain and loss in primates. Genetics, 177(3), 1941-1949. doi:10.1534/genetics.107.080077 Hershey, A. D., & Chase, M. (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol, 36(1), 39-56. Hirai, Y., Kodama, Y., Moriwaki, S., Noda, A., Cullings, H. M., Macphee, D. G., . . . Nakamura, N. (2006). Heterozygous individuals bearing a founder mutation in the XPA DNA repair gene comprise nearly 1% of the Japanese population. Mutat Res, 601(1-2), 171- 178. doi:10.1016/j.mrfmmm.2006.06.010 Hoeijmakers, J. H. (2009). DNA damage, aging, and cancer. N Engl J Med, 361(15), 1475-1485. doi:10.1056/NEJMra0804615 40 Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., & Morishima, K. (2017). KEGG: New Perspectives on Genomes, Pathways, Diseases and Drugs. Nucleic Acids Res, 45(D1), D353-D361. doi:10.1093/nar/gkw1092 Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res, 28(1), 27-30. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., & Tanabe, M. (2016). KEGG as a Reference Resource For Gene and Protein Annotation. Nucleic Acids Res, 44(D1), D457-462. doi:10.1093/nar/gkv1070 Kleijer, W. J., Laugel, V., Berneburg, M., Nardo, T., Fawcett, H., Gratchev, A., . . . Lehmann, A. R. (2008). Incidence of DNA repair deficiency disorders in western Europe: Xeroderma pigmentosum, Cockayne syndrome and trichothiodystrophy. DNA Repair (Amst), 7(5), 744-750. doi:10.1016/j.dnarep.2008.01.014 Kraemer, K. H., Patronas, N. J., Schiffmann, R., Brooks, B. P., Tamura, D., & DiGiovanna, J. J. (2007). Xeroderma pigmentosum, trichothiodystrophy and Cockayne syndrome: a complex genotype-phenotype relationship. Neuroscience, 145(4), 1388- 1396. doi:10.1016/j.neuroscience.2006.12.020 Krzeminski, W. (2000). Review of Diptera palaeontological records. In: Contributions to Manual of Palaearctic Diptera (Vol. 1). Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., . . . Higgins, D. G. (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23(21), 2947-2948. doi:10.1093/bioinformatics/btm404 Laugel, V. (1993). Cockayne Syndrome. In M. P. Adam, H. H. Ardinger, R. A. Pagon, S. E. Wallace, L. J. H. Bean, K. Stephens, & A. Amemiya (Eds.), GeneReviews((R)). Seattle (WA). Lehmann, A. R., McGibbon, D., & Stefanini, M. (2011). Xeroderma pigmentosum. Orphanet J Rare Dis, 6, 70. doi:10.1186/1750- 1172-6-70 Letunic, I., & Bork, P. (2007). Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics, 23(1), 127-128. doi:10.1093/bioinformatics/btl529 Letunic, I., & Bork, P. (2011). Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res, 39(Web Server issue), W475-478. doi:10.1093/nar/gkr201 Letunic, I., & Bork, P. (2016). Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res, 44(W1), W242-245. doi:10.1093/nar/gkw290 Levins, R. (1967). Theory of fitness in a heterogeneous environment. VI. The adaptive significance of mutation. Genetics, 56(1), 163- 178. Liang, L., Flury, S., Kalck, V., Hohn, B., & Molinier, J. (2006). CENTRIN2 interacts with the Arabidopsis homolog of the human XPC protein 41 (AtRAD4) and contributes to efficient synthesis-dependent repair of bulky DNA lesions. Plant Mol Biol, 61(1-2), 345-356. doi:10.1007/s11103-006-0016-9 Miko, I., & LeJeune, L. (2009). Essentials of Genetics Molinier, J., Lechner, E., Dumbliauskas, E., & Genschik, P. (2008). Regulation and role of Arabidopsis CUL4-DDB1A-DDB2 in maintaining genome integrity upon UV stress. PLoS Genet, 4(6), e1000093. doi:10.1371/journal.pgen.1000093 Molinier, J., Ramos, C., Fritsch, O., & Hohn, B. (2004). CENTRIN2 modulates and nucleotide excision repair in Arabidopsis. Plant Cell, 16(6), 1633-1643. doi:10.1105/tpc.021378 Morgan, T. H. (1915). The Mechanism of Mendelian Heredity: H. Holt. Mouret, S., Baudouin, C., Charveron, M., Favier, A., Cadet, J., & Douki, T. (2006). Cyclobutane pyrimidine dimers are predominant DNA lesions in whole human skin exposed to UVA radiation. Proceedings of the National Academy of Sciences, 103, 13765- 13770. Niedernhofer, L. J., Bohr, V. A., Sander, M., & Kraemer, K. H. (2011). Xeroderma pigmentosum and other diseases of human premature aging and DNA repair: molecules to patients. Mech Ageing Dev, 132(6-7), 340-347. doi:10.1016/j.mad.2011.06.004 Oshima, T., & Imahori, K. (1974). Description of Thermus thermophilus (Yoshida and Oshima) comb. nov., a Nonsporulating Thermophilic Bacterium from a Japanese Thermal Spa. International Journal of Systematic and Evolutionary Microbiology, 24(1), 102-112. doi:doi:10.1099/00207713-24-1-102 Paradis, E., Claude, J., & Strimmer, K. (2004). APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics, 20(2), 289-290. doi:10.1093/bioinformatics/btg412 Petruseva, I. O., Evdokimov, A. N., & Lavrik, O. I. (2014). Molecular mechanism of global genome nucleotide excision repair. Acta Naturae, 6(1), 23-34. Rapic Otrin, V., Kuraoka, I., Nardo, T., McLenigan, M., Eker, A. P., Stefanini, M., . . . Wood, R. D. (1998). Relationship of the xeroderma pigmentosum group E DNA repair defect to the chromatin and DNA binding proteins UV-DDB and . Mol Cell Biol, 18(6), 3182-3190. Ren, R., Sun, Y., Zhao, Y., Geiser, D., Ma, H., & Zhou, X. (2016). Phylogenetic Resolution of Deep Eukaryotic and Fungal Relationships Using Highly Conserved Low-Copy Nuclear Genes. Genome Biology and Evolution, 8(9), 2683-2701. doi:10.1093/gbe/evw196 Richards, J. D., Cubeddu, L., Roberts, J., Liu, H., & White, M. F. (2008). The archaeal XPB protein is a ssDNA-dependent ATPase with a novel partner. J Mol Biol, 376(3), 634-644. doi:10.1016/j.jmb.2007.12.019

42 Robbins, J. H., Kraemer, K. H., Lutzner, M. A., Festoff, B. W., & Coon, H. G. (1974). Xeroderma pigmentosum. An inherited diseases with sun sensitivity, multiple cutaneous neoplasms, and abnormal DNA repair. Ann Intern Med, 80(2), 221-248. Rogers, H. H., & Griffiths-Jones, S. (2012). Mitochondrial pseudogenes in the nuclear genomes of Drosophila. PLoS One, 7(3), e32593. doi:10.1371/journal.pone.0032593 Rouillon, C., & White, M. F. (2011). The evolution and mechanisms of nucleotide excision repair proteins. Res Microbiol, 162(1), 19-26. doi:10.1016/j.resmic.2010.09.003 Roy, N., Bagchi, S., & Raychaudhuri, P. (2012). Damaged DNA binding protein 2 in reactive oxygen species (ROS) regulation and premature senescence. Int J Mol Sci, 13(9), 11012-11026. doi:10.3390/ijms130911012 Roy, N., Stoyanova, T., Dominguez-Brauer, C., Park, H. J., Bagchi, S., & Raychaudhuri, P. (2010). DDB2, an essential mediator of premature senescence. Mol Cell Biol, 30(11), 2681-2692. doi:10.1128/MCB.01480-09 Saijo, M. (2013). The role of Cockayne syndrome group A (CSA) protein in transcription-coupled nucleotide excision repair. Mech Ageing Dev, 134(5-6), 196-201. doi:10.1016/j.mad.2013.03.008 Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C. A., . . . Smith, M. (1977). Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265(5596), 687-695. Schliep, K., Potts, A. J., Morrison, D. A., Grimm, G. W., & Fitzjohn, R. (2017). Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution, 8(10), 1212-1220. doi:doi:10.1111/2041- 210X.12760 Schliep, K. P. (2011). phangorn: phylogenetic analysis in R. Bioinformatics, 27(4), 592-593. doi:10.1093/bioinformatics/btq706 Sekelsky, Brodsky, M. H., & Burtis, K. C. (2000). DNA repair in Drosophila: insights from the Drosophila genome sequence. J Cell Biol, 150(2), F31-36. Sekelsky, J. (2017). DNA Repair in Drosophila: Mutagens, Models, and Missing Genes. Genetics, 205(2), 471-490. doi:10.1534/genetics.116.186759 Session, A. M., Uno, Y., Kwon, T., Chapman, J. A., Toyoda, A., Takahashi, S., . . . Rokhsar, D. S. (2016). Genome evolution in the allotetraploid frog Xenopus laevis. Nature, 538(7625), 336- 343. doi:10.1038/nature19840 Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., . . . Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 15(8), 1034-1050. doi:10.1101/gr.3715005 Sin, Y., Tanaka, K., & Saijo, M. (2016). The C-terminal Region and SUMOylation of Cockayne Syndrome Group B Protein Play Critical 43 Roles in Transcription-coupled Nucleotide Excision Repair. J Biol Chem, 291(3), 1387-1397. doi:10.1074/jbc.M115.683235 Spivak, G. (2016). Transcription-coupled repair: an update. Archives of Toxicology, 90(11), 2583-2594. doi:10.1007/s00204-016-1820-x Team, R. C. (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing

Veltman, J. A., & Brunner, H. G. (2012). De novo mutations in human genetic disease. Nat Rev Genet, 13(8), 565-575. doi:10.1038/nrg3241 Wakasugi, M., Kawashima, A., Morioka, H., Linn, S., Sancar, A., Mori, T., . . . Matsunaga, T. (2002). DDB accumulates at DNA damage sites immediately after UV irradiation and directly stimulates nucleotide excision repair. J Biol Chem, 277(3), 1637-1640. doi:10.1074/jbc.C100610200 Watson, J. D., & Crick, F. H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356), 737-738. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag. Yeo, J.-E., Khoo, A., Fagbemi, A. F., & Schärer, O. D. (2012). The Efficiencies of Damage Recognition and Excision Correlate with Duplex Destabilization Induced by Acetylaminofluorene Adducts in Human Nucleotide Excision Repair. Chemical Research in Toxicology, 25(11), 2462-2468. doi:10.1021/tx3003033 Yokoi, M., Masutani, C., Maekawa, T., Sugasawa, K., Ohkuma, Y., & Hanaoka, F. (2000). The xeroderma pigmentosum group C protein complex XPC-HR23B plays an important role in the recruitment of transcription factor IIH to damaged DNA. J Biol Chem, 275(13), 9870-9875. Zdobnov, E. M., Tegenfeldt, F., Kuznetsov, D., Waterhouse, R. M., Simao, F. A., Ioannidis, P., . . . Kriventseva, E. V. (2017). OrthoDB v9.1: cataloging evolutionary and functional annotations

44 for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Res, 45(D1), D744-D749. doi:10.1093/nar/gkw1119

45

46 Appendix A

Summary of taxonomic units

Species CETN2 DDB2 ERCC1 ERCC2 ERCC3 ERCC4 ERCC5 ERCC6 ERCC8 LIG1 MNAT1 RFA1 RFA2 RFC1 TTDA XPA XPC DDB1 Vertebrates posessing: 105 102 90 91 104 103 103 104 102 93 105 105 105 105 99 104 105 105 Vertebrates posessing: 0 3 15 14 1 2 2 1 3 12 0 0 0 0 6 1 0 0 Arthropods posessing: 1 5 44 44 44 43 43 28 8 43 43 44 43 44 33 43 44 44 Arthropods missing: 43 39 0 0 0 1 1 16 36 1 1 0 1 0 11 1 0 0 Nematodes posessing: 0 0 5 6 6 5 6 6 2 6 6 6 6 6 4 6 5 6 Nematodes missing: 6 6 1 0 0 1 0 0 4 0 0 0 0 0 2 0 1 0 Plants posessing: 26 76 82 84 83 84 82 82 81 84 82 84 82 83 74 3 82 84 Plants missing: 58 8 2 0 1 0 2 2 3 0 2 0 2 1 10 81 2 0 Fungi posessing: 0 0 117 120 118 119 119 88 78 120 119 120 118 120 93 118 112 83 Fungi missing: 121 121 4 1 3 2 2 33 43 1 2 1 3 1 28 3 9 38 Protists posessing: 2 0 39 48 48 44 45 33 15 49 32 47 27 48 24 18 25 19 Protists missing: 47 49 10 1 1 5 4 16 34 0 17 2 22 1 25 31 24 30 Invertebrates posessing: 2 14 66 70 70 67 69 53 28 68 67 70 67 70 45 69 69 70 Invertebrates missing: 68 56 4 0 0 3 1 17 42 2 3 0 3 0 25 1 1 0

47

Appendix B

Sample of construction of stacked barplots. The process for vertebrates was repeated for other taxonomic units.

Kegg = read.table('KeggTableSummary.csv',sep=';',header = TRUE)

#####ALL SPECIES#######

#Choosing appropriate columns and rows for organism group Keggstats = Kegg[1:2,5:22]

#Create a column with row numbers Keggstats$row <- seq_len(nrow(Keggstats))

#Melt table Keggstats2 <- melt(Keggstats, id.vars = "row")

#Forcing rows to be discontinuous for legend plot Keggstats2$row = as.factor(Keggstats2$row)

#Create plot Keggallspeciesplot = ggplot(Keggstats2, aes(x=variable, y=value, fill=row)) + geom_bar(stat="identity",position=position_fill(vjust=1,reverse=TRUE)) + scale_fill_manual(values=c("navy","skyblue"),labels=c("Present","Absent" ))+ ggtitle("All Species")+ xlab("Genes") + ylab("Representation in Species(%)\n") + theme(legend.title=element_blank())+ theme(axis.text.x =element_text(angle=90, hjust=1))+ theme(plot.title=element_text(hjust=0.5)) Keggallspeciesplot

###VERTEBRATES### #Extract number of species within taxonomic unit with and without gene

Keggvert = Kegg[3:4,5:22] Keggvert$row <- seq_len(nrow(Keggvert)) Keggvert2 <- melt(Keggvert, id.vars = "row") Keggvert2$row = as.factor(Keggvert2$row)

#Create plot Keggvert2plot = ggplot(Keggvert2, aes(x=variable, y=value, fill=row)) + geom_bar(stat="identity",position=position_fill(vjust=1,reverse=TRUE)) + scale_fill_manual(values=c("navy","skyblue"),labels=c("Present","Absent" ))+ ggtitle("a) Vertebrates ")+ xlab("Genes") + ylab("Representation in Species(%)\n") + guides(fill=FALSE)+ #theme(axis.text.x =element_text(angle=90, hjust=1))+ theme(axis.text.x=element_blank())+ theme(axis.title.x=element_blank())+ theme(axis.title.y=element_blank())+ 48

theme(legend.title=element_blank())+ theme(plot.title=element_text(hjust=0.5))

Keggvert2plot

49

Appendix C

Construction of stacked barplots for each gene. The process for CETN2 was repeated for the other 17 genes.

###CETN2###

#Make dataframe for different organism group, excluding the "All species" group, as it's irrelevant Keggcetn2= data.frame(Keggvert$CETN2,Keggplant$CETN2,Keggfungi$CETN2,Keggnematoda$C ETN2,Keggarthropoda$CETN2,Kegginvert$CETN2,Keggprotist$CETN2) names(Keggcetn2)= c("Vertebrates","Plants","Fungi","Nematodes","Arthropods","Invertebrates ","Protists")

#Create rows Keggcetn2$row = seq_len(nrow(Keggcetn2))

#Melt data.frame for stacked plot Keggcetn22=melt(Keggcetn2,id.vars="row") Keggcetn22$row = as.factor(Keggcetn22$row)

#Create plot Keggcetn2plot = ggplot(Keggcetn22, aes(x=variable, y=value, fill=row)) + geom_bar(stat="identity",position=position_fill(vjust=1, reverse=TRUE))+ scale_fill_manual(values=c("navy","skyblue"),labels=c("Present","Absent" ))+ ggtitle("CETN2")+ xlab("Organism group") + ylab("Representation in Species(%)\n") + theme(axis.text.x =element_text(angle=90, hjust=1))+ theme(legend.title=element_blank())+ theme(plot.title=element_text(hjust=0.5)) Keggcetn2plot

50

Appendix D

Construction of phylogenetic trees. The same process was repeated for ERCC5, ERCC6, ERCC8 and DDB2. install.packages("ape") install.packages("seqinr") install.packages("phangorn") library("ape") library("seqinr") library(phangorn)

#Read sequence after clustal alignment ERCC3.clu <-read.aa("ERCC3Aligned.fasta.txt",format="fasta") ERCC3.clu

#Set data as phyDat class object for further manipulation ERCC3.clu.phyDat <- phyDat(ERCC3.clu, type="AA")

ERCC3.clu.phyDat

#Calculate distance between taxa (default "JC69") #dist.ml computes pairwise distances for an object of class phyDat. dist.ml uses DNA / AA sequences to compute distances under different substitution models ERCC30.clu <- dist.ml(ERCC3.clu.phyDat) ERCC30.clu

#Create neighbor-joining tree rooted by Oryza sativa japonica ERCC3.clu.nj = NJ(ERCC30.clu) ERCC3nj = plot(root(ERCC3.clu.nj,"Oryza.sativa.japonica"), main="ERCC3")

#Maximum likelihood tree with highest log likelihood ERCC3fit.clu = pml(root(ERCC3.clu.nj, "Oryza.sativa.japonica"),ERCC3.clu.phyDat) ERCC3fit.clu <- optim.pml(ERCC3fit.clu, optNni=TRUE, optInv=TRUE, optBf=TRUE, optQ=TRUE, getRooted=TRUE)

#Plot phylogram with Oryza as root plot(root(ERCC3fit.clu$tree, "Oryza.sativa.japonica"),main="ERCC3") #Save file in high-res tiff('ERCC3Max.tiff',units="in", width=7, height=7, res=300) plot(root(ERCC3fit.clu$tree, "Oryza.sativa.japonica"),main="ERCC3") dev.off()

#Function creating random numbers set.seed(1) # fyrir fall sem byr til tilviljunartolur

#Bootstrapping 100 trees ERCC3bs.clu <-bootstrap.pml(ERCC3fit.clu, bs=100, optNni=TRUE,optInv=TRUE, optBf=TRUE, optQ=TRUE, getRooted=TRUE) ERCC3bs.clu

# Draw tree with bootstrap values tiff('ERCC3BS.tiff', units="in", width=8, height=8, res=500) plotBS(ERCC3fit.clu$tree, ERCC3bs.clu,"phylogram", main="ERCC3") dev.off() pml(ERCC3fit.clu$tree, ERCC3.clu.phyDat)

51

Appendix E

R commands for Chi-squared tests . cetn2 = matrix(c(105,0,1,43,0,6,26,58,0,121,2,47), ncol=6) = matrix(c(102,3,5,39,0,6,76,8,0,121,0,49), ncol=6) = matrix(c(90,15,44,0,5,1,82,2,117,4,39,10), ncol=6) = matrix(c(91,14,44,0,6,0,84,0,120,1,48,1), ncol=6) ercc3 = matrix(c(104,1,44,0,6,0,83,1,118,3,48,1), ncol=6) = matrix(c(103,2,43,1,5,1,84,0,119,2,44,5), ncol=6) = matrix(c(103,2,43,1,6,0,82,2,119,2,45,4), ncol=6) = matrix(c(104,1,28,16,6,0,82,2,88,33,33,16),ncol=6) = matrix(c(102,3,8,36,2,4,81,3,78,43,15,34), ncol=6) lig1 = matrix(c(93,12,43,1,6,0,84,0,120,1,49,0), ncol=6) mnat1 = matrix(c(105,0,43,1,6,0,82,2,119,2,32,17), ncol=6) rfa1 = matrix(c(105,0,44,0,6,0,84,0,120,1,47,2), ncol=6) rfa2 = matrix(c(105,0,43,1,6,0,82,2,118,3,27,22), ncol=6) rfc1 = matrix(c(105,0,44,0,6,0,83,1,120,1,48,1), ncol=6) ttda = matrix(c(99,6,33,11,4,2,74,10,93,28,24,25), ncol=6) = matrix(c(104,1,43,1,6,0,3,81,118,3,18,31), ncol=6) = matrix(c(105,0,44,0,5,1,82,2,112,19,25,24), ncol=6) = matrix(c(105,0,44,0,6,0,84,0,83,38,19,30), ncol=6) chisq.test(cetn2) chisq.test(ddb2) chisq.test(ercc1) chisq.test(ercc2) chisq.test(ercc3) chisq.test(ercc4) chisq.test(ercc5) chisq.test(ercc6) chisq.test(ercc8) chisq.test(lig1) chisq.test(mnat1) chisq.test(rfa1) chisq.test(rfa2) chisq.test(rfc1) chisq.test(ttda) chisq.test(xpa) chisq.test(xpc) chisq.test(ddb1)

52