Research Collection

Doctoral Thesis

The genome of the trematode parasite Atriophallophorus winterbourni, Blasco-Costa et al., 2019: A macro- and microevolutionary perspective

Author(s): Zajac, Natalia

Publication Date: 2021

Permanent Link: https://doi.org/10.3929/ethz-b-000475035

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library DISS. ETH NO. 27280

The genome of the trematode parasite Atriophallophorus winterbourni, Blasco-Costa et al., 2019: A macro- and microevolutionary perspective

A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich)

presented by

Natalia Halina Zając MSc, Uppsala University

Born on 01.07.1992 citizen of Poland

Accepted on the recommendation of

Prof. Dr. Roger Butlin, Department of and Plant Sciences, University of Sheffield Prof. Dr. Christophe Dessimoz, Computational Evolutionary Biology, University of Lausanne Prof. Dr. Jukka Jokela, D-USYS, ETH Zurich Prof. Dr. Hanna Hartikainen, Faculty of Medicine & Health Sciences, University of Nottingham 2020

Table of Contents

Summary 5 Zusammenfassung 8 Introduction 11 Chapter 1 34 Gene duplication and gain in the trematode Atriophallophorus winterbourni contributes to adaptation to parasitism Chapter 2 96 Divergence of gene structure in genes originating through duplication Chapter 3 128 Genomic footprint of divergence under high gene flow in the trematode parasite Atriophallophorus winterbourni Chapter 4 182 Genomic signature of local adaptation to a host population in a connected parasite population of the trematode, Atriophallophorus winterbourni Concluding remarks 225 Acknowledgments 228 Curriculum Vitae 230

Summary

Summary

Macro- and microevolutionary processes impact the structure and the function of a genome. Macroevolution mediates character state transitions that diagnose evolutionary differences of major taxonomic rank. The study of genomic features that have evolved on a macroevolutionary scale, including gene duplications or evolution of novel genes, reveals outcomes of evolutionary processes affecting species as a whole. On the other hand, microevolutionary processes, including natural selection, mutation, gene flow and genetic drift, affect changes in allele frequencies within and among populations of a species. The study of genomic regions impacted by those processes reveals their impact on fitness- correlated trait values and the diversity of a genome within a species. Ultimately, the two processes inevitably work in concert: macroevolution restricts the repertoire of genes microevolutionary processes can act on; but accumulation of small modifications has an impact over longer evolutionary time scales. A genome of an organism at any point in time is the product of them both. Ideal model systems for the study of macro- and microevolutionary processes are parasitic species as transition to parasitism is a macroevolutionary event but rapid evolutionary responses to hosts with which the parasites co-evolve, diversification and specialization as well as evolution of complex population structures are a result of microevolution.

This thesis addresses the genomic signature of macro- and microevolutionary processes that have shaped the genome of the trematode parasite, Atriophallophorus winterbourni (Blasco-Costa et al.,2019). Firstly, we sequenced, assembled de novo and annotated a reference genome for A. winterbourni, creating a resource and an opportunity for the subsequent whole genome research. Secondly, through comparative genomic analysis with other major parasitic worms and reconstruction of three ancestral genomes in the trematode phylogeny, we investigated gene families that might have contributed to adaptation of A. winterbourni to parasitism. We inferred the timing of these evolutionary changes through the inference of a robust phylogeny of 18 Platyhelminthes and a time tree with divergence times of the trematode speciation events. In order to better understand the trematode genome architecture, we studied the gene structure of conserved genes, genes evolved through duplication and novel genes since the trematode ancestor in the 13 focal trematode species. The research was performed in collaboration with the authors of OMA

5

Summary

(“Orthologous MAtrix”) which is a method and a database for the inference of orthologs among complete genomes. The results indicated that A. winterbourni split from the Opisthorchiata suborder approximately 237.4 MYA (± 120.4 MY). We found that since that speciation event, 24% of A. winterbourni genes have arisen through duplication events and 31.9% have been newly acquired, suggesting a high contribution of novelty in lineage-specific adaptation. Among those genes, we found 13 gene families with over 10 genes arisen through duplication; all of which have functions potentially relating to host behavioural manipulation, host tissue penetration, and hiding from host immunity through antigen presentation. Interestingly, we found that the process of duplication results in a change in gene structure. In the 13 studied trematodes, duplicated copies of a gene tended to be shorter in length and have fewer exons than the gene they originated from. Additionally, gene families coding for shorter proteins tended to more often evolve through duplication than gene families consisting of longer proteins.

Thirdly, with the use of population genomic techniques, we searched for the genomic regions involved in adaptation of A. winterbourni to its intermediate host, New Zealand mud snail, Potamopyrgus antipodarum (Gray, 1843), across different spatial scales. We examined the genomic footprint of divergence and the extent of gene flow between the north-west and south-east populations of A. winterbourni across its native range, the South Island of New Zealand, subject to previous separation by glacial periods and to restricted gene flow across a mountain range, the Southern Alps. We found the divergent genomic regions coding for proteins to be possibly involved in an extracellular vesicle biogenesis pathway. The pathway was found in other trematodes to be playing a role in parasite migration through the host tissue and countering the attack of the host immune system. The functional genomic differentiation between these populations could possibly be due to host-parasite co- divergence and local adaptation. Previous research, however, has not only suggested population structure between the lakes but also within a single lake, an extensively studied Lake Alexandrina, where negative frequency-dependent dynamics between the host and the parasite have been observed. In order to fully understand the geographic mosaic of co- evolution shaping the genomic diversity of the parasite, we searched for genomic signature of divergence in Lake Alexandrina between 6 highly interconnected parts of the lake, encompassing all the lake banks. The study found high level of polymorphism for the parasite populations in the lake, but no clear signature of divergence between any of the sites. We

6

Summary speculated that the population might be undergoing an expansion in diversity after a recent selective sweep caused by adaptation to the most common host genotype. The observations were supported with the investigation of the infected host population using neutral markers. Altogether, the thesis presents a comprehensive study of the different forces shaping a parasite genome, acting either on either the species or the population level, and addresses the levels of complexity of the host-parasite co-evolutionary dynamics. Despite the system being extensively studied, it is the first time the questions of orthology, phylogeny, population structure, population divergence, and the geographic mosaic of co-evolution were addressed using whole genome data.

7

Zusammenfassung

Zusammenfassung

Makro- und mikroevolutionäre Prozesse beeinflussen die Struktur und Funktion eines Genoms. Makroevolution vermittelt Übergänge zwischen Zuständen von Eigenschaften, welche evolutionäre Unterschiede von grossen taxonomischen Stufen diagnostizieren. Durch Untersuchung von genomischen Merkmalen, die durch makroevolutionäre Prozesse wie Genduplikation oder die Entstehung neuer Gene entstanden sind, erkennt man wie stark evolutionäre Vorgänge die gesamte Art beeinflussen. Anderseits beeinflussen mikroevolutionäre Prozesse, wie natürliche Selektion, Mutation, Genfluss oder genetische Drift Veränderungen in Allelfrequenzen innerhalb und zwischen Populationen einer Art. Untersuchungen an genomischen Regionen, welche durch obengenannte Prozesse beeinflusst werden, zeigen deren Einfluss auf fitnesskorrelierte Merkmale und auf die Diversität eines Genoms innerhalb einer Art. Letztlich arbeiten beide Prozesse unweigerlich zusammen: Makroevolution schränkt das Repertoire der Gene ein, auf welche mikroevolutionäre Prozesse einwirken können, aber die Akkumulation kleiner Veränderungen wirkt sich über einen längeren evolutionären Zeitraum aus. Somit ist das Genom eines Organismus immer das Produkt beider Prozesse. Parasitäre Arten sind ein ideales Modell für die Studie der makro- und mikroevolutionären Prozesse: Der Wechsel zu einer parasitären Lebensform ist ein makroevolutionäres Ereignis, während schnelle, kurzzeitige evolutionäre Anpassungen an den Wirt, mit welchem sich der Parasit koevolviert, Diversifizierung und Spezialisierung, oder auch die Evolution von komplexen Populationsstrukturen ein Ergebnis von Mikroevolution ist.

Diese Doktorarbeit untersucht die genomische Signatur von makro- und mikroevolutionären Prozessen im Genom des parasitischen Trematoden, Atriophallophorus winterbourni (Blasco-Costa et al., 2019). Zuerst haben wir das Referenzgenom von A. winterbourni sequenziert, ein de novo assembly durchgeführt und annotiert. Dadurch haben wir eine wichtige Ressource und eine Möglichkeit für nachfolgende Forschung am gesamten Genom geschaffen. Zweitens untersuchten wir Genfamilien, welche zur Adaptation von A. winterbourni an den parasitären Lebensstil geführt haben könnten. Dazu verglichen wir genomische Sequenzen mit anderen relevanten parasitischen Würmern und rekonstruierten drei ancestrale Genome in der Phylogenie der Trematoden. Die Zeitspannen dieser evolutionären Änderungen haben wir durch den Rückschluss einer robusten Phylogenie von

8

Zusammenfassung

18 Platyhelminthes und einem Stammbaum mit Divergenzzeitpunkten der verschiedenen Trematoden Artbildungsevents ermittelt. Um ein besseres Verständnis der genomischen Architektur der Trematoden zu erhalten, haben wir die Genstruktur von konservierten Genen, Genen die durch Duplikation evolviert sind und von neu entstandenen Genen seit dem seit dem gemeinsamen Vorfahren der 13 fokalen Trematodenarten untersucht. Diese Untersuchungen waren Teil einer Kollaboration mit den Autoren von OMA («Orthologous MAtrix»). OMA ist eine Methode und Datenbank für die Inferenz von Orthologen zwischen vollständigen Genomen. Die Resultate deuten darauf hin, dass A. winterbourni sich vor circa 237.4 Mrd. J. (± 120.4 Mrd. J.) von der Unterordnung Opistorchiata abgespalten hat. Ausserdem haben wir gesehen, dass seit der Artbildung 24% der Gene von A. winterbourni durch Duplikation gebildet wurden und 31.9% der Gene neu entstanden sind. Dies deutet auf einen hohen Beitrag an Neuheit bei der linienspezifischen Anpassung hin. Unter diesen Genen haben wir 13 Genfamilien mit mehr als 10 Genen gefunden, die durch Duplikation entstanden sind. Davon haben alle Funktionen, die vermutlich mit Verhaltensmanipulation des Wirts, Gewebedurchdringung des Wirts oder mit Verstecken vor dem Immunsystems des Wirts durch Antigen Präsentation zusammenhängen. Bei den 13 untersuchten Trematoden waren duplizierte Gene jeweils etwas kürzer und hatten weniger Exons als die Gene, von denen sie abstammten. Ausserdem sind Genfamilien, welche für kürzere Proteine kodieren häufiger durch Duplikation evolviert als Genfamilien, welche für längere Proteine kodieren.

Drittens haben wir mittels populationsgenomischen Methoden in geographisch getrennten Parasitenpopulationen nach genomischen Regionen gesucht, welche an der Adaptation von A. winterbourni an seinen Zwischenwirt, die Neuseeländischen Zwergdeckelschnecke Potamopyrgus antipodarum (Gray, 1843) beteiligt sind. Wir haben den genomischen Fussabdruck der Divergenz und das Ausmass von Genfluss zwischen nordwestlichen und südöstlichen Populationen von A. winterbourni entlang des ursprünglichen Habitats, der Südinsel von Neuseeland, untersucht. Diese beiden Teile waren während der Gletscherperiode voneinander getrennt und hatten wegen den südlichen Alpen nur einen eingeschränkten Genfluss. Wir fanden heraus, dass die divergenten, proteinkodierenden genomischen Regionen möglicherweise in einem extrazellulären Vesikel - Biogenese Signalweg involviert sind. In anderen Trematoden spielt dieser Signalweg eine Rolle bei der Fortbewegung durch das Wirtsgewebe und wirkt dem Angriff des Immunsystem des Wirts entgegen. Die funktionelle genomische Differenzierung zwischen diesen

9

Zusammenfassung

Populationen könnte aufgrund von Co-Divergenz zwischen Wirt und Parasit und lokaler Adaptation entstanden sein. Frühere Forschungsarbeiten an diesem System weisen aber darauf hin, dass sich Populationsstrukturen nicht nur zwischen Seen, sondern auch innerhalb eines Sees unterscheiden. Dies ist der Fall im seit Jahrzehnten sehr gut untersuchten Lake Alexandrina, in welchem negativ frequenzabhängige Dynamiken zwischen Wirt und Parasit beobachtet wurden. Um das gesamte geographische Mosaik, welches die Koevolution im Genom verursacht und zu der Diversität der parasitischen Genome führt, zu verstehen, suchten wir in sechs stark vernetzen aber umfassenden Seeufern von Lake Alexandrina nach Signaturen von Divergenz im Genom der Parasitenpopulationen. Das Resultat zeigte ein hohes Mass an Polymorphismus der Parasitenpopulationen im See, aber kein klares Zeichen für eine Divergenz zwischen den einzelnen Standorten. Daher vermuteten wir, dass sich die Diversität der Population nach einem kürzlichen «selective sweep», verursacht durch die Anpassung der Parasiten an den häufigsten Wirtsgenotyp, nun wieder erhöht. Diese Vermutungen wurden durch die Untersuchung von infizierten Wirtspopulationen mit neutralen Marker unterstützt.

Zusammengefasst beinhaltet diese Doktorarbeit eine umfassende Untersuchung der verschiedenen Faktoren, die ein parasitäres Genom prägen, sei dies auf Art- oder auf Populationsebene, und beschäftigt sich mit der Komplexität der koevolutionären Dynamiken zwischen Wirt und Parasit. Obwohl das System von A. winterbourni bereits sehr gut untersucht ist, wurde in dieser Arbeit zum ersten Mal mittels ganzen Genomdaten die Phylogenie, die Populationsstrukturen, Orthologie, die Divergenz von Populationen sowie das geographische Moskaik der Koevolution untersucht.

10

Introduction

Introduction

Macroevolution vs Microevolution: historical perspective

The study of genome evolution lies at the intersection of deeply intertwined fields of evolutionary biology and genomics. A multitude of processes shape a species’ genome; its structure, architecture, function and intra-species diversity. These processes can be divided into macroevolutionary processes, that create broader patterns above the species level and over the course of Earth’s history (Gregory, 2005), or, as Levinton put it, “the sum of those processes that explain the character state transitions that diagnose evolutionary differences of major taxonomic rank” (Levinton, 2001). Microevolutionary processes involve changes in allele frequencies within and among populations; usually observed over shorter evolutionary time scales, for example over the course of several generations (Gregory, 2005; Levinton, 1988) but argued by others to also be evident from paleontological data (Hendry & Kinnison, 2001).

Microevolution is driven by four major processes: mutation and accumulation of new alleles, gene flow exchanging alleles between populations, random genetic drift, being an element of stochasticity and contingency, and natural selection, acting on polymorphism and retaining alleles conferring adaptation (Ridley, 1993). The term ‘microevolution’ was coined by Yuri Filipchenko (Philiptschenko) in 1927 (Philiptschenko, 1927) and the first half of the 20th century, including the periods of Neo-Darwinism and Modern Synthesis, marked a time of major contribution to the understanding of evolutionary change within and between populations (Levinton, 2001; Ridley, 1993). The most prominent were the works of Sewall Wright, John B.S Haldane and Ronald A. Fisher who enhanced the theoretical synthesis of Darwin’s theory of natural selection with the theory of heredity and broadened the field of population genetics (Fisher, 1930; Haldane, 1932; Wright, 1931). The Neo-Darwinists and the architects of Modern Synthesis believed in natural selection as the major driver of evolution and that evolutionary change was the product of accumulation of many small modifications. The major thinkers of the time, such as Theodosius Dobzhansky, Earnst Mayr or George Gaylord Simpson, favoured the equivalency of macro- and microevolution. They asserted that “all evolution is due to accumulation of small genetic changes, guided by natural selection”

11

Introduction and that evolution at the higher than species level is “nothing but an extrapolation and magnification of events that take place within populations and species” (Dobzhansky, 1937; Mayr, 1963; Simpson, 1944).

However, this view was criticised by macroevolutionists claiming that aggregation of changes resulting from natural selection at an individual level is insufficient to explain evolution as a whole and that character-state transitions, that distinguish lineages, are not just a simple sum of events occurring at an organismal level (Levinton, 2001). Richard Goldschmidt held an opposing view to the proponents of Modern Synthesis and was the first to propose the idea of the ‘hopeful monsters’ - organisms with a profound mutant phenotype that have the potential to establish a new evolutionary lineage (Goldschmidt, 1940). He did not agree that genes were independently acting entities but rather that they are subject to large scale integrated effects of chromosomes (Goldschmidt, 1940). He and others, including Stephen Jay Gould and Otto Heinrich Schindewolf, believed in a single revolutionary step from ancestral to descendant species; in other words, evolution through punctuated equilibrium (Gould, 1977, 1980). They favoured mutationism, which explains how species can form from single mutations, and saltationism, explaining how major chromosomal rearrangements can form new species and explain evolutionary change (Bateson, 1894; Berg, 1969; De Vries, 1922; Depew, 2017; Gregory, 2005; Ulett, 2014). Macroevolutionists denied the value of experimental genetics in studying evolution (Levinton, 2001).

The distinction between macro- and microevolutionary processes, or their interconnectedness, has been a long standing debate and a subject of, often acrimonious, dispute. Although a debate on the extent of this dichotomy and its usefulness still exists, the importance of both macroevolutionary and microevolutionary processes is now recognized in shaping a genome (Erwin, 2010; Grantham, 2007; Hendry & Kinnison, 2001; Levinton, 1988). It is widely accepted that microevolutionary events are linked to evolution of individual genes; every allele must pass a filter of selection at a population level. But the repertoire of genes a particular species has imposes developmental constraints on evolutionary pathways. Genome characteristics which enhance survival of an individual in a population or act in local adaptation processes may be different from genome characteristics facilitating speciation and involved in the persistence of a species as a whole. A death of an individual organism does not mean loss of a gene from a population. However, death of a species (extinction)

12

Introduction inevitably does. A genome of a species at any particular point in time is the product of both of these processes.

Genome structure and evolution

An organism’s genome consists of genes (the coding regions) and the non-coding material, not transcribed and translated into proteins, but rather consisting of regulatory elements, transposable elements or repetitive sequence (Clark, 2001). Mutations in coding regions can either be synonymous, not affecting the protein structure, or nonsynonymous, altering the amino acid sequence of a protein and thus potentially having functional consequences on the organism’s phenotype (Nei & Gojobori, 1986). On the other hand, mutations in the genes’ regulatory regions can cause changes in gene expression, thus also having a functional impact. Effects on phenotype and thus potential adaptation can also involve other forms of genetic change (Stapley et al., 2010). This includes large scale deletions or insertions, (Chan et al., 2010; Droma et al., 2008; Eizirik et al., 2003), inversions or chromosomal rearrangements that alter the overall genome organization (Fouet, Gray, Besansky, & Costantini, 2012; Rane, Rako, Kapun, Lee, & Hoffmann, 2015). The way a genomic change influences a phenotype depends on the genomic architecture underlying the trait, including the number of genomic regions involved, the strength of their relative influence, recombination rates in the genome, degree of redundancy, pattern of dominance, pleiotropy and epistasis (Hansen, 2006; Hoban et al., 2016).

Study of genome evolution and function has been facilitated by spectacular advances in sequencing technology which is now providing access to whole genome data for a growing number of species and opening the door to the study of the impact of both macroevolutionary and microevolutionary processes. Addressing the two processes requires different questions. Macroevolution deals with phylogenetic relationships among taxa, nature of evolutionary novelty, as well as genetic or developmental constraints, patterns of change over long time scales and processes that regulate that evolutionary change (Levinton, 1988). Questions on microevolutionary processes address population structure, heredity, local adaptation, intraspecific diversity, adaptive divergence and admixture (Levinton, 1988).

13

Introduction

Adaptation to parasitism: macro- and microevolutionary perspective

Parasites are an ideal model for the study of macro- and microevolutionary processes. Parasitic species are known for high potential for diversification and specialization, complex population structures and rapid evolutionary responses to hosts with which they co- evolve (Brooks, 1988; Huyse, Poulin, & Theron, 2005). On the other hand, transition of a species to parasitism is inarguably an occurrence on a macroevolutionary scale. The era of genomics has brought about new opportunities for studying parasite macroevolution, as parasite fossil records are very sparse (Leung, 2017).

Parasitism arose independently hundreds of times across the tree of life (Poulin & Randhawa, 2015). It has been shown to have arisen at least 4 times in Phylum Nematoda (Blaxter et al., 1998), at least three times in kinetoplastid protozoans (Maslov & Simpson, 1995), multiple times among copepods, amphipods and isopods (Poulin, 2011), and several times among the free-living Turbellaria giving rise to major groups of parasitic Platyhelminthes (Rohde, 1994). Despite a diverse range of hosts and mechanisms that facilitated the origins of parasitism, all parasites faced a similar set of pressures associated with transmission, within host survival and invasion of host tissues (Poulin & Randhawa, 2015). Convergence of phenotypic solutions to achieve successful parasitism has led to recurring genomic patterns across diverse taxonomic groups, including gene duplications, novel genes and gene losses in key functional pathways (Poulin & Randhawa, 2015).

A most common view is that parasites are associated with morphological regression and evolutionary degeneration; with loss of evolutionarily costly structures and functions (Corradi, 2015; O’Malley, Wideman, & Ruiz-Trillo, 2016; Poulin & Randhawa, 2015; Zarowiecki & Berriman, 2015). The biggest reductions have been observed in metabolic functions due to altered food availabilities compared to the free-living species (Zarowiecki & Berriman, 2015). For example, parasitic cestodes have been found to lack the ability to synthesize cholesterol de novo, several species of nematodes and trematodes have lost metabolic proteases and Clonorchis sinensis is the only helminth to date found to have all genes coding for enzymes involved in fatty acid β oxidation pathway whose product, acetyl CoA, is used for ATP synthesis in mitochondria. Indeed, in several helminths loss of mitochondrial genes including

14

Introduction atp8 (Egger, Bachmann, & Fromm, 2017) or cytochrome P450 redox enzymes (Abad et al., 2008; Tsai et al., 2013), as well as loss of peroxisome genes have been associated with reduced metabolic capacities.

However, gene loss and genome size reduction has mostly been associated with intracellular parasites and has not been so prominent in helminth parasites (Poulin & Randhawa, 2015). In many extracellular endo- and ectoparasites that have evolved complex, multihost life-cycles, gene duplication has played a crucial role in developing adaptations to parasitism. Expansion of gene families have been especially important in antigens, in transmembrane proteins, such as excretory and secretory proteins, and in proteases. An example of expansion of gene families coding for proteins facilitating evasion of host immunity responses are a large range of duplicated surface proteins (Mucins) in species (Roger et al., 2008) or apomucins and galactosyltransferases in Echinococcus multilocularis (Tsai et al., 2013). Proteases, enzymes often involved in host tissue digestions and invasion, have been found in Schistosoma species (Metallopeptidase Major Surface Proteins (MSP), (Zhou et al., 2009)) or Strongyloides nematodes (Astasins, (Hunt et al., 2016)).

Duplicated genes at first generate functional redundancy (Zhang, Dyer, & Rosenberg, 2000). If extra amount of RNA product or protein is beneficial, the two genes are maintained through purifying selection to perform the same function (Zhang et al., 2000). More often, however, the duplicated gene undergoes neofunctionalization, with the second copy assuming a novel function, or subfunctionalization, with each gene copy taking on part of the original function. Consequently, species-specific gene duplications can lead to lineage- specific gene functions and confer specific traits important in adaptation of that lineage to its particular niche. This gene duplication model proposed by Ohno (1970 (Ohno, 2013)) and further developed by Force, Lynch, Tirosh and others (Force et al., 1999; Lynch & Conery, 2000; Tirosh & Barkai, 2007; Yang et al., 2014) widely applies to development of new gene functions in parasites. Lineage-specific expansions of gene families have been observed in a family of nuclear receptors in (Wu, Niles, El-Sayed, Berriman, & LoVerde, 2006) or among tyrosinase genes across many Platyhelminthes (Kim & Bae, 2017).

The above described changes refer to characteristics present in all individuals of the same species. Understanding the process of adaptation of parasites, however, requires investigation of heritable allele frequency changes on an individual level across populations

15

Introduction of the same species. This type of analysis takes into account the diversity of a genome within a species and the variation upon which microevolutionary forces act. The cumulative effect of changes occurring due to reciprocal selection, subdivision of populations and co- divergence of the parasite populations with the host populations, gene flow, mutation and genetic drift can all contribute to speciation.

In host-parasite systems that coevolve, reciprocal selection can vary in space and over the course of generations. The gradient in intensity of reciprocal selection can produce coevolutionary “hot spots”, where selection is strong, and “cold spots”, where selection is weak, with the spatial distribution of hot spots in the landscape changing over time (Thompson, 1994). Variation in time can be caused by negative frequency-dependent selection, with the parasite adapting to the common host genotypes in a population that then decrease in frequency and are replaced by previously rare, uninfected host genotypes (Carius, Little, & Ebert, 2001; Koskella & Lively, 2009). Gene flow between parasite communities with different intensities of reciprocal selection can create local mismatches and maladaptations in a coevolving interaction, but can also provide rare, favourable variants (Nuismer, Thompson, & Gomulkiewicz, 1999). When extensive, gene flow can erode variation at neutral loci with selection maintaining divergence only in regions relevant for local adaptation. Consideration of both population dynamics and population genetics in the study of maintenance of polymorphism in structured populations was already emphasized by Frank (Frank, 1991, 1993) but the theoretical framework that describes these dynamics and addresses their combined impact on the outcome of co-evolution is the geographic mosaic of coevolution developed by Thompson (Thompson, 1994).

As Thompson pointed out himself, the mosaic of reciprocal selection is highly dependent on the ecology of parasites and their life history traits including mode of reproduction, mode of transmission and host specificity, population connectivity and the type of genetic underlying of the host-parasite interaction (Criscione, Poulin, & Blouin, 2005; Nadler, 1995; Thompson, 1994).

Parasites with complex life cycles, including Platyhelminthes (e.g. trematodes, some species of cestodes) or Nematodes, rely on multiple hosts in their life cycle. The interconnectedness and admixture is often mediated through interaction with the different host populations. Presence of multiple hosts in the life cycle combined with extensive

16

Introduction mobility of at least one of the host species can greatly facilitate gene flow between isolated parasite populations. For instance, a population genetic structure reflecting isolation by distance observed for an aquatic snail, Biomphalaria glabrata, on the Guadeloupe Island was not observed for its trematode parasite, Schistosoma mansoni, due to its dispersal by the definitive host (Rattus rattus) (Prugnolle et al., 2005). However, multiple hosts do not always mediate exchange of genetic material between populations. Asexual amplification, which Platyhelminthes or Nematode parasites undergo in the intermediate hosts, often results in their aggregation into clusters and thus increased probability of inbreeding in the definitive host. Such process can result in reduction in diversity and heterozygosity, a so called Wahlund effect. The phenomenon is often worsened by presence of multiple paratenic hosts, not necessary for development but serving to maintain parasite’s life cycle. Examples of that have been observed in Fasciodoides magna (Lydeard, Mulvey, Aho, & Kennedy, 1989), Schistosoma mansoni (Sire, Durand, Pointier, & Théron, 2001) or Lecithochirium fusiforme (Vilas, Paniagua, & SanmartÍN, 2003).

Platyhelminthes and Nematode parasites often alternate between sexual and asexual reproduction in their life cycle (Galaktionov & Dobrovolskij, 2003). Recombination associated with sexual reproduction increases genetic diversity which greatly increases chances of evolutionary change and more rapid adaptation to resistant hosts. The rate of adaptation of the parasite to its host (most often the intermediate host) greatly depends on the genetics underlying the antagonistic relationship. It is thought to be defined on a continuum by two major models: the gene-for-gene model and the matching alleles model (Agrawal & Lively, 2002). The former model represents a scenario where parasite and the host can have a very broad range of genotypes they can infect or resist, respectively, a sort of “universal” infectivity and resistance (gene-for-gene model); the latter describes a close match between the two, meaning each can infect and resist only one specific genotype (matching alleles model). The matching alleles model played a central role in modelling and theory understanding host-parasite coevolution (Agrawal, 2009; Hamilton, 1993; Hamilton, Axelrod, & Tanese, 1990; Lively, 2010). Characteristics of the matching alleles model often imply negative frequency-dependent selection (Agrawal & Lively, 2002). The gene-for-gene model was developed based on data from plant-pathogen interaction (Flor, 1956; Frank, 1992; Keen, 1990) whereas the matching alleles model has been based on research on

17

Introduction invertebrates (Agrawal & Lively, 2002; Grosberg & Hart, 2000; Luijckx, Ben‐Ami, Mouton, Du Pasquier, & Ebert, 2011).

The similarities observed in life history traits and population dynamics between closely related species indicates that studying microevolutionary processes in one parasitic species can be informative about the evolution of others.

Studying macro- and microevolutionary processes: methods

Macroevolution and microevolution are both studied through comparative methods. The former involves comparison of species, the latter - the comparison of individuals and populations. Here I provide an overview of methods used in my Thesis to study macroevolutionary and microevolutionary processes.

Understanding macroevolution previously heavily relied on fossil record but has been accelerated by rapidly expanding tools of comparative genomics based on inference of homology between species. Study of gene homology allows elucidating information about species and gene relatedness and thus protein domains, genomic architecture, intracellular structures and signaling pathways that are conserved within, or that are different between species in taxonomically related groups (Gerald M. Rubin et al., 2000; Gerald M Rubin et al., 2000). The closer the relatedness between species, the higher the degree of conservation; in other words, related species carry more orthologous genes, genes that can be traced back to the same gene in the last common ancestral genome (Glover et al., 2019). Orthologs are pairs of genes that have originated via speciation, unlike paralogs that have originated through duplication (Glover et al., 2019). Orthologs are of interest for studying species relatedness through phylogenies and extrapolation of gene functions as they are assumed to have retained equivalent functions in different organisms (Fitch, 1970; Gabaldón, 2008). On the other hand, paralogs are commonly used for studying functional innovation (Glover et al., 2019).

Orthology inference is done using either graph-based or tree-based approaches. The tree-based methods rely on comparison of gene trees with species trees and derivation of all pairs of orthologous and paralogous genes (Glover et al., 2019; Kristensen, Wolf, Mushegian, & Koonin, 2011). Graph-based methods rely on graphs where proteins or genes

18

Introduction are nodes and the edges are evolutionary distances between species (Glover et al., 2019). Orthologs are inferred through pairwise gene comparisons; by definition, they have branched out at the latest possible time (Altenhoff, Gil, Gonnet, & Dessimoz, 2013; Glover et al., 2019). A graph-based approach is implemented in OMA (“Orthology MAtrix” software) (Train, Glover, Gonnet, Altenhoff, & Dessimoz, 2017). The software uses Smith-Waterman dynamic programming for pairwise alignments and the orthologs are inferred through Stable Pair approach where each pair of genes is found more closely related to each other than to any other sequence (Roth, Gonnet, & Dessimoz, 2008). The relatedness of sequences is measured in PAM units (percent accepted mutations) where 1 PAM equals the time one substitution event per 100 sites is expected to have occurred (Roth et al., 2008). The final goal of the algorithm in OMA is clustering orthologs into orthologous groups and hierarchical orthologous groups. Orthologous Groups are groups where each pair of sequences are orthologous to each other and are established by finding completely connected subgraphs in a graph. Hierarchical Orthologous Groups are defined as sets of genes that have descended from a single most common ancestral gene from a specific taxonomic range of interest (Train et al., 2017). OMA was used in Chapter 1 for elucidating genes that have arisen through duplication, that were retained in 1:1 copy, that have been lost or gained at different stages of the phylogeny of Platyhelminthes.

Microevolution in natural populations has mostly been studied through experimental evolution or through population genetic studies applying genetic markers, such as isozymes, microsatellites, SNPs (Single Nucleotide Polymorphisms) or AFLPs (Amplified Fragment Length Polymorphisms) (Criscione et al., 2005). We know that different parts of the genome are differentially affected by the different microevolutionary forces. For instance, studies on humans and non-model species estimate only about 5-10% of the genome to be affected by selection (Andolfatto & Przeworski, 2000; Nielsen et al., 2005; Nosil, Funk, & Oritz‐ barrientos, 2009; Stinchcombe & Hoekstra, 2008). The odds for a genetic marker to fall within a region under selection or an adaptive mutation are low (Bonin, 2008). We also know that different types of selection pressures over time leave distinct signatures on the genomes in a population (Nielsen et al., 2007). Balancing selection, that maintains diversity, leads to a more even allele frequency across populations and results in an excess of genetic diversity in the vicinity of a selected locus (Excoffier, Hofer, & Foll, 2009). On the other hand, positive divergent selection causes large differences in allele frequencies between populations and

19

Introduction results in reduced level of diversity in and around a locus, in altered allele frequency spectrum and in locally increased extent of linkage disequilibrium within a population (Excoffier et al., 2009). Strong divergent selection between populations can also reduce immigration locally in the vicinity of a selected locus (Beaumont & Balding, 2004). Whole genome sequencing has allowed improved genome wide estimates of divergence between populations and distinguishing between genome wide patterns and locus-specific effects but also allowed for a more comprehensive study of the genomic effects of different types of selection (Crellen et al., 2016; Stinchcombe & Hoekstra, 2008). Majority of whole genome resequencing studies focus on two approaches to detect divergence between populations: identifying loci under unusually high genetic differentiation (Deitz et al., 2016; Karlsen et al., 2013; Kurland et al., 2019) and searching for correlation between allele frequencies and the environment (Dennenmoser, Vamosi, Nolte, & Rogers, 2017; Fischer et al., 2013; Guggisberg et al., 2018; Sailer et al., 2018). Both methods are based on the assumption that alleles involved in local adaptation should occur at a higher frequency because they increase fitness and show greater differentiation if the populations differ in the selective pressures (Hoban et al., 2016; Lewontin & Krakauer, 1973). The methods require uniform sampling across a range of interest, accounting for demography between populations, using a high quality reference genome, sequencing high number of individuals with a reasonable sequencing depth, accounting for linkage disequilibrium and measuring the ecological landscape that affects fitness (Hoban et al., 2016). A novel method that aims to optimize several of these parameters is an approach of whole genome resequencing of pools of individuals (Pool-Seq). The data consists of pooled DNA samples, with no individual tagging, providing genome wide polymorphism data at a considerably lower cost and thus from a larger proportion of each population (Hivert, Leblois, Petit, Gautier, & Vitalis, 2018). Chapter 3 and 4 focus on comparing populations of Atriophallophorus winterbourni at different spatial scales using the Pool-Seq approach. Selection on genomic regions can be studied as a feature of macro and microevolution. Different methods are used to detect deviations from neutrality of particular genomic features over long or short evolutionary time scales. Table 1 list some of the methods of detecting possible signatures of selection within and between species used in this Thesis.

20

Introduction

Table 1. Approaches for detecting signature of selection (i.e. deviation from neutrality) within and between species. statistic Description

Tests based on polymorphisms within a species

Tajima's D a statistic that measures the difference between two measures of population mutation rate: the nucleotide diversity (defined as the mean number of pairwise differences, Tajima’s Pi, π), and the number of segregating sites (Watterson’s Theta, θ) (Tajima, 1989). Under neutrality the two measures should have approximately the same value and thus Tajima's D should be equal to 0. Positive or negative values of Tajima's D indicate deviations from neutrality. Positive skew indicates π > θ, meaning the lack of rare alleles, which suggests balancing selection or sudden population contraction. Negative skew indicates π < θ, a possible excess of rare alleles which suggests a recent selective sweep or a population expansion after a recent bottleneck (Biswas & Akey, 2006; Tajima, 1989).

F the fixation index; a test statistic that quantifies differentiation between

ST subpopulatons (Weir & Cockerham, 1984). There are several measures that have been proposed based on allele frequencies, heterozygosity within and between subpopulations (Biswas & Akey, 2006) or probability of identity by descent (Huerta-Sanchez, Durrett, & Bustamante, 2008). Under neutrality, FST levels are largely determined by genetic drift and migration, but local adaptation and selection can accentuate the levels of population differentiation at particular loci (Biswas & Akey, 2006).

Tests for selection between species

dN/dS the ratio of non-synonymous to synonymous substitutions; under neutrality the expected value of the ratio is 1. For protein coding loci subject to functional constraints or subject to purifing selection the ratio is expected to be less than 1. dN/dS > 1 indicates positive selection (Biswas & Akey, 2006; Nei & Gojobori, 1986; Yang & Nielsen, 1998). Study systems

The main focus of the thesis is on Atriophallophorus winterbourni (Blasco-Costa et al., 2019); a digenean, microphallid trematode previously known as Microphallus sp. or Microphallus lively. It is native to the lakes of New Zealand. It belongs to the order Plagiorchiida and the suborder of Xiphidiata (WoRMS, 2020).

21

Introduction

The Phylum , which encompasses both the Plagiorchiida and Diplostomida orders, consists entirely of parasites and belongs to a phylum of (Platyhelminthes) (Figure 1) (Combes & Simberloff, 2005). They have most likely evolved from turbellarians and have been initially solely parasites of molluscs (Galaktionov & Dobrovolskij, 2003). The complex life cycles seen in digenean trematodes are thought to have evolved together with the evolution and radiation of vertebrates (Galaktionov & Dobrovolskij, 2003). A subgroup of digenean trematodes have complex life cycles including alteration between 2 or 3 hosts with a molluscan species always as an intermediate host and a vertebrate as a final host (Galaktionov & Dobrovolskij, 2003).

Figure 1. Phylogenetic tree of the class Trematode from the phylum Platyhelminthes. The tree contains species from the Diplostomida and Plagiorchiida orders. The families to which the species belong and the bootstrap support are indicated on the branches. For the details on how the phylogenetic tree was constructed see Chapter 1.

In its life cycle A. winterbourni indeed alternates between two hosts; a prosobranch, dioecious, hydrobid snail, Potamopyrgus antipodarum (Gray, 1843), native to the freshwater habitat of New Zealand (Warwick, 1952; Winterbourn, 1970) and waterfowl (Lively & McKenzie, 1991). The snail populations have been often found to be a mixture of dioecious

22

Introduction sexual and clonal individuals (Warwick, 1952; Winterbourn, 1970) and due to that has been used widely to test the hypotheses for maintenance of sexual reproduction (Dybdahl & Lively, 1995; Jokela, Lively, Dybdahl, & Fox, 1997; Lively, 1987; Lively, 1992). The snail is known to be infected by more than 14 parasite species but A. winterbourni is the most abundant (Winterbourn, 1974).

The eggs of A. winterbourni are ingested by the snail from submerged surfaces (Figure 2). The metacercarial asexual stage develops in the gonad of the snail that is eventually castrated (Winterbourn, 1974). The metacercariae become infective within 120 days under laboratory conditions (Dybdahl & Lively, 1995; Levri & Lively, 1996). The metacercariae hatch into adult worms upon ingestion by waterfowl and mature in the gut within a few days. The adult worms reproduce sexually producing eggs released in the faeces of dabbling ducks (Levri & Lively, 1996; Lively & McKenzie, 1991). A. winterbourni is hermaphroditic, just as the most closely related species from the Opisthorchiidae family, but is known to have an abbreviated life cycle in comparison to other digenean trematodes. It lacks the daughter sporocyst parthenitae-bearing germ cells (or at least has very reduced life span), radia, cercaria and possibly miracidia (Blasco-Costa et al., 2019).

Figure 2. Life cycle of the trematode parasite, Atriophallophorus winterbourni.

In order to elucidate characteristics of the A. winterbourni genome involved in adaptation to parasitism as a species, we compared it to 13 other digenean trematodes: 5 species from the Plagiorchiida order, including 3 species from the family Opisthorchiidae

23

Introduction

(Clonorchis sinensis, Opisthorchis viverrini and Opisthorchis felineus), Fasciola hepatica and Echinostoma caproni, and 8 species from the Diplostomida order, more specifically from the family (7 species of schistosomes and regenti) (Figure 1) (WoRMS, 2020). Opisthorchiidae species alternate between three hosts in their life cycle: a snail species from the genus Bithynia, a freshwater fish species and a vertebrate species (King & Scholz, 2001). They are monoecious, each fluke has a complete set of male and female reproductive organs (King & Scholz, 2001). They are most often found in Southeast Asia and they parasitize on humans causing opisthorchiasis. Fasciola hepatica, a liver fluke, and Echinostoma caproni, an intestinal fluke, are also hermaphroditic but alternate between two host in their life cycle, a snail from the Lymnaeidae family and a vertebrate definitive host including humans (Galaktionov & Dobrovolskij, 2003). Both of them have a global distribution. Schistosomes are also distributed throughout the world but most often cause sever human diseases (schistosomiasis) on the African continent and in Southeast Asia. They are dioecious; separate male and female worms exist (Galaktionov & Dobrovolskij, 2003). Transmission to the mammalian definitive hosts is via freshwater gastropod (Gastropoda) snails, e.g., the amphibious snail, Oncomelania hupensis, which transmits , or Biomphalaria glabrata and Bulinus sp., which transmit S. mansoni and S. haematobium, respectively, to humans (Martin, 1999). Adult flatworms parasitise the blood capillaries of either the mesenteries or the plexus of the bladder (Martin, 1999). , from the same family, despite sharing many characteristics, is a neuropathogenic of anatid birds (eg. Anas platyrhynchos) alternating in its life cycle between the aquatic birds as the final hosts and aquatic snails Radix sp. as the intermediate hosts (Horák, Kolářová, & Dvořák, 1998).

Summary of chapters

The aim of the thesis was to investigate the genomic signature of macro- and microevolutionary processes that have shaped the genome of the parasite Atriophallophorus winterbourni. As the first part of the thesis, we performed a de novo assembly of the reference genome of A. winterbourni, necessary to address all the research questions in this thesis. The results are reported in Chapter 1. The reference genome is available through NCBI (https://www.ncbi.nlm.nih.gov/nuccore/JACCGJ000000000) and soon will be available at WormBase ParaSite.

24

Introduction

In Chapter 1, through the reconstruction of 3 ancestral genomes in trematode phylogeny, we investigated gene families that might have contributed to adaptation of A. winterbourni to parasitism and investigated their functions in silico. Additionally, we inferred a robust phylogeny of 18 Platyhelminthes and generated a time tree with divergence times of the trematode speciation events. The results indicated that the split of A. winterbourni from the Opisthorchiata suborder have occurred approximately 237.4 MYA. We found that out of 11,499 genes, 24% have arisen through duplication events since then and 31.9% have been newly acquired. We found 13 gene families in A. winterbourni to have had over 10 genes arising through recent duplication; all of which have functions potentially relating to host behavioural manipulation, host tissue penetration, and hiding from host immunity through antigen presentation. We studied two most expanded gene families and within them identified genes evolving under positive selection. The chapter has been accepted for publication in Genome Biology and Evolution.

In Chapter 2 we investigated the structure of genes that arose through duplication, that have been gained and that have been retained in 1:1 copy in 13 extant trematode species, including A. winterbourni, since the ancestral trematode genome. The results indicated that gene families coding for shorter proteins tend to more often evolve through duplication and that a duplicated copy of a gene tends to be shorter in length and have fewer exons than the gene it originated from. Our observations aligned with results found in free- living organisms. This chapter is in a submission ready form.

In Chapter 3 we turned our focus to microevolutionary processes shaping the A. winterbourni genome. We investigated the population genetic structure of A. winterbourni across the South Island on New Zealand, previously only studied with genetic markers. Additionally, we examined the genomic footprint of divergence between the north-west and south-east populations and the extent of gene flow between them. We found a population structure that reflects post-glacial recolonization of north-west lakes from the northern refugia and south-east lakes from the southern refugia after the Last Glacial Maximum and a somewhat restricted gene flow between the two regions due to the Southern Alps mountain range. We found 6 genes differentiated between lakes on either side of the Southern Alps, most of them possibly involved in an extracellular vesicle biogenesis pathway. The pathway was found in other trematodes to be playing a role in parasite migration

25

Introduction through the host tissue and countering the attack of the host immune system. The chapter is in a submission ready form.

In Chapter 4 we examined the population genetic structure of A. winterbourni at a finer spatial scale, within one lake in New Zealand, Lake Alexandrina. We searched for genomic signature of divergence between 6 highly interconnected parts of the lake, encompassing all the lake banks, to address the extent of geographic mosaic of coevolution. Additionally, we studied the population structure of the snails from which the parasites were extracted through the use of SNP markers. We found high level of polymorphism for the parasite populations in the lake but no clear signature of divergence between any of the sites. The results potentially indicated the whole lake population to be undergoing population expansion after a recent decrease in genetic diversity. This result corresponded with the high diversity of the infected snail genotypes assessed with 20 SNP markers. The chapter discusses the results in the context of negative frequency-dependent dynamics in a host-parasite system.

References Abad, P., Gouzy, J., Aury, J.-M., Castagnone-Sereno, P., Danchin, E. G., Deleury, E., . . . Blok, V. C. (2008). Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nature biotechnology, 26(8), 909-915. Agrawal, A., & Lively, C. M. (2002). Infection genetics: gene-for-gene versus matchingalleles models and all points in between. Evolutionary Ecology Research, 4, 79-90. Agrawal, A. F. (2009). Differences between selection on sex versus recombination in red queen models with diploid hosts. Evolution: International Journal of Organic Evolution, 63(8), 2131-2141. Altenhoff, A. M., Gil, M., Gonnet, G. H., & Dessimoz, C. (2013). Inferring hierarchical orthologous groups from orthologous gene pairs. PloS one, 8(1). Andolfatto, P., & Przeworski, M. (2000). A Genome-Wide Departure From the Standard Neutral Model in Natural Populations of Drosophila. Genetics, 156(1), 257-268. Retrieved from https://www.genetics.org/content/genetics/156/1/257.full.pdf Bateson, W. (1894). Materials for the study of variation: treated with especial regard to discontinuity in the origin of species: Macmillan. Beaumont, M. A., & Balding, D. J. (2004). Identifying adaptive genetic divergence among populations from genome scans. Molecular ecology, 13(4), 969-980. Berg, L. S. (1969). Nomogenesis: Or, Evolution Determined by Law: M.I.T. Press. Biswas, S., & Akey, J. M. (2006). Genomic insights into positive selection. Trends in Genetics, 22(8), 437-446. doi:https://doi.org/10.1016/j.tig.2006.06.005 Blasco-Costa, I., Seppälä, K., Feijen, F., Zajac, N., Klappert, K., & Jokela, J. (2019). A new species of Atriophallophorus Deblock & Rosé, 1964 (Trematoda: Microphallidae) described from in vitro-grown adults and metacercariae from Potamopyrgus antipodarum (Gray, 1843)(Mollusca: Tateidae). Journal of helminthology, 94, e108-e108.

26

Introduction

Blaxter, M. L., De Ley, P., Garey, J. R., Liu, L. X., Scheldeman, P., Vierstraete, A., . . . Frisse, L. M. (1998). A molecular evolutionary framework for the phylum Nematoda. Nature, 392(6671), 71-75. Bonin, A. (2008). Population genomics: a new generation of genome scans to bridge the gap with functional genomics. Molecular ecology, 17(16), 3583-3584. doi:10.1111/j.1365-294X.2008.03854.x Brooks, D. R. (1988). Macroevolutionary comparisons of host and parasite phylogenies. Annual Review of Ecology and Systematics, 19(1), 235-259. Carius, H. J., Little, T. J., & Ebert, D. (2001). Genetic variation in a host‐parasite association: potential for coevolution and frequency‐dependent selection. Evolution, 55(6), 1136-1145. Chan, Y. F., Marks, M. E., Jones, F. C., Villarreal, G., Shapiro, M. D., Brady, S. D., . . . Kingsley, D. M. (2010). Adaptive Evolution of Pelvic Reduction in Sticklebacks by Recurrent Deletion of a Pitx1 Enhancer. Science, 327(5963), 302-305. doi:10.1126/science.1182213 Clark, A. G. (2001). The Search for Meaning in Noncoding DNA. Genome Research, 11(8), 1319-1320. doi:10.1101/gr.201601 Combes, C., & Simberloff, D. (2005). The Art of Being a Parasite: University of Chicago Press. Corradi, N. (2015). Microsporidia: eukaryotic intracellular parasites shaped by gene loss and horizontal gene transfers. Annual review of microbiology, 69, 167-183. Crellen, T., Allan, F., David, S., Durrant, C., Huckvale, T., Holroyd, N., . . . Berriman, M. (2016). Whole genome resequencing of the human parasite Schistosoma mansoni reveals population history and effects of selection. Scientific reports, 6, 20954. Criscione, C. D., Poulin, R., & Blouin, M. S. (2005). Molecular ecology of parasites: elucidating ecological and microevolutionary processes. Molecular ecology, 14(8), 2247-2257. De Vries, H. (1922). Age and area and the mutation theory. A Study in Geographical Distribution and Origin of Species (ed. Willis, JC). Cambridge University Press, Cambridge, 222-227. Deitz, K. C., Athrey, G. A., Jawara, M., Overgaard, H. J., Matias, A., & Slotman, M. A. (2016). Genome-wide divergence in the West-African malaria vector Anopheles melas. G3: Genes, Genomes, Genetics, 6(9), 2867-2879. Dennenmoser, S., Vamosi, S. M., Nolte, A. W., & Rogers, S. M. (2017). Adaptive genomic divergence under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin (Cottus asper) revealed by Pool-Seq. Molecular ecology, 26(1), 25- 42. doi:10.1111/mec.13805 Depew, D. J. (2017). Darwinism in the twentieth century: Productive encounters with saltation, acquired characteristics, and development. In The Darwinian Tradition in Context (pp. 61-88): Springer. Dobzhansky, T. G. (1937). Genetics and the origin of species. Droma, Y., Hanaoka, M., Basnyat, B., Arjyal, A., Neupane, P., Pandit, A., . . . Katsuyama, Y. (2008). Adaptation to high altitude in Sherpas: association with the insertion/deletion polymorphism in the Angiotensin-converting enzyme gene. Wilderness & environmental medicine, 19(1), 22-29. Dybdahl, M. F., & Lively, C. M. (1995). Host–parasite interactions: infection of common clones in natural populations of a freshwater snail (Potamopyrgus antipodarum). Proceedings of the Royal Society of London. Series B: Biological Sciences, 260(1357), 99-103.

27

Introduction

Egger, B., Bachmann, L., & Fromm, B. (2017). Atp8 is in the ground pattern of flatworm mitochondrial genomes. BMC Genomics, 18(1), 414. doi:10.1186/s12864-017- 3807-2 Eizirik, E., Yuhki, N., Johnson, W. E., Menotti-Raymond, M., Hannah, S. S., & O'Brien, S. J. (2003). Molecular Genetics and Evolution of Melanism in the Cat Family. Current Biology, 13(5), 448-453. doi:https://doi.org/10.1016/S0960-9822(03)00128-3 Erwin, D. H. (2010). Microevolution and macroevolution are not governed by the same processes. Contemporary debates in philosophy of biology. Wiley-Blackwell, Chichester, UK, 180-193. Excoffier, L., Hofer, T., & Foll, M. (2009). Detecting loci under selection in a hierarchically structured population. Heredity, 103(4), 285-298. Fischer, M. C., Rellstab, C., Tedder, A., Zoller, S., Gugerli, F., Shimizu, K. K., . . . Widmer, A. (2013). Population genomic footprints of selection and associations with climate in natural populations of Arabidopsis halleri from the Alps. Molecular ecology, 22(22), 5594-5607. doi:10.1111/mec.12521 Fisher, R. (1930). The genetical theory of natural selection. Fitch, W. M. (1970). Distinguishing homologous from analogous proteins. Systematic zoology, 19(2), 99-113. Flor, H. H. (1956). The Complementary Genic Systems in Flax and Flax Rust**Joint contribution from the Field Crops Research Branch, Agricultural Research Service, United States Department of Agriculture and the North Dakota Agricultural Experiment Station. In M. Demerec (Ed.), Advances in Genetics (Vol. 8, pp. 29-54): Academic Press. Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y.-l., & Postlethwait, J. (1999). Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4), 1531-1545. Fouet, C., Gray, E., Besansky, N. J., & Costantini, C. (2012). Adaptation to aridity in the malaria mosquito Anopheles gambiae: chromosomal inversion polymorphism and body size influence resistance to desiccation. PloS one, 7(4), e34841-e34841. doi:10.1371/journal.pone.0034841 Frank, S. A. (1991). Ecological and genetic models of host-pathogen coevolution. Heredity, 67(1), 73-83. Frank, S. A. (1992). Models of plant-pathogen coevolution. Trends in Genetics, 8(6), 213-219. doi:https://doi.org/10.1016/0168-9525(92)90236-W Frank, S. A. (1993). Coevolutionary genetics of plants and pathogens. Evolutionary Ecology, 7(1), 45-75. Gabaldón, T. (2008). Large-scale assignment of orthology: back to phylogenetics? Genome biology, 9(10), 235. doi:10.1186/gb-2008-9-10-235 Galaktionov, K., & Dobrovolskij, A. (2003). The Biology and Evolution of Trematodes. Glover, N., Dessimoz, C., Ebersberger, I., Forslund, S. K., Gabaldón, T., Huerta-Cepas, J., . . . Consortium, Q. f. O. (2019). Advances and Applications in the Quest for Orthologs. Molecular Biology and Evolution, 36(10), 2157-2164. doi:10.1093/molbev/msz150 Goldschmidt, R. (1940). The Material Basis of Evolution: Yale University Press. Gould, S. J. (1977). The return of hopeful monsters. Natural history, 86(6), 22-30. Gould, S. J. (1980). GG Simpson, paleontology, and the modern synthesis: na. Grantham, T. (2007). Is macroevolution more than successive rounds of microevolution? Palaeontology, 50(1), 75-85.

28

Introduction

Gregory, T. R. (2005). CHAPTER 11 - Macroevolution and the Genome. In T. R. Gregory (Ed.), The Evolution of the Genome (pp. 679-729). Burlington: Academic Press. Grosberg, R. K., & Hart, M. W. (2000). Mate selection and the evolution of highly polymorphic self/nonself recognition genes. Science, 289(5487), 2111-2114. Guggisberg, A., Liu, X., Suter, L., Mansion, G., Fischer, M. C., Fior, S., . . . Widmer, A. (2018). The genomic basis of adaptation to calcareous and siliceous soils in Arabidopsis lyrata. Molecular ecology, 27(24), 5088-5103. doi:10.1111/mec.14930 Haldane, J. (1932). The causes of evolution. Hamilton, W. D. (1993). Haploid dynamic polymorphism in a host with matching parasites: effects of mutation/subdivision, linkage, and patterns of selection. Journal of Heredity, 84(5), 328-338. Hamilton, W. D., Axelrod, R., & Tanese, R. (1990). Sexual reproduction as an adaptation to resist parasites (a review). Proceedings of the National Academy of Sciences, 87(9), 3566-3573. Hansen, T. F. (2006). The evolution of genetic architecture. Annu. Rev. Ecol. Evol. Syst., 37, 123-157. Hendry, A. P., & Kinnison, M. T. (2001). An introduction to microevolution: rate, pattern, process. Genetica, 112(1), 1-8. doi:10.1023/A:1013368628607 Hivert, V., Leblois, R., Petit, E. J., Gautier, M., & Vitalis, R. (2018). Measuring genetic differentiation from pool-seq data. Genetics, 210(1), 315-330. Hoban, S., Kelley, J. L., Lotterhos, K. E., Antolin, M. F., Bradburd, G., Lowry, D. B., . . . Whitlock, M. C. (2016). Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. The American Naturalist, 188(4), 379-397. Horák, P., Kolářová, L., & Dvořák, J. (1998). Trichobilharzia regenti n. sp. (Schistosomatidae, Bilharziellinae), a new nasal schistosome from Europe. Parasite, 5(4), 349-357. Retrieved from https://doi.org/10.1051/parasite/1998054349 Huerta-Sanchez, E., Durrett, R., & Bustamante, C. D. (2008). Population Genetics of Polymorphism and Divergence Under Fluctuating Selection. Genetics, 178(1), 325- 337. doi:10.1534/genetics.107.073361 Hunt, V. L., Tsai, I. J., Coghlan, A., Reid, A. J., Holroyd, N., Foth, B. J., . . . Beasley, H. (2016). The genomic basis of parasitism in the Strongyloides clade of nematodes. Nature genetics, 48(3), 299. Huyse, T., Poulin, R., & Theron, A. (2005). Speciation in parasites: a population genetics approach. Trends in parasitology, 21(10), 469-475. Jokela, J., Lively, C. M., Dybdahl, M. F., & Fox, J. A. (1997). Evidence for a cost of sex in the freshwater snail Potamopyrgus antipodarum. Ecology, 78(2), 452-460. Karlsen, B. O., Klingan, K., Emblem, Å., Jørgensen, T. E., Jueterbock, A., Furmanek, T., . . . Moum, T. (2013). Genomic divergence between the migratory and stationary ecotypes of A tlantic cod. Molecular ecology, 22(20), 5098-5111. Keen, N. (1990). Gene-for-gene complementarity in plant-pathogen interactions. Annual review of genetics, 24(1), 447-463. Kim, S.-H., & Bae, Y.-A. (2017). Lineage-specific expansion and loss of tyrosinase genes across platyhelminths and their induction profiles in the carcinogenic oriental liver fluke, Clonorchis sinensis. Parasitology, 144(10), 1316. King, S., & Scholz, T. (2001). Trematodes of the family Opisthorchiidae: a minireview. The Korean journal of parasitology, 39(3), 209.

29

Introduction

Koskella, B., & Lively, C. M. (2009). Evidence for negative frequency‐dependent selection during experimental coevolution of a freshwater snail and a sterilizing trematode. Evolution: International Journal of Organic Evolution, 63(9), 2213-2221. Kristensen, D. M., Wolf, Y. I., Mushegian, A. R., & Koonin, E. V. (2011). Computational methods for Gene Orthology inference. Briefings in bioinformatics, 12(5), 379-391. Kurland, S., Wheat, C. W., de la Paz Celorio Mancera, M., Kutschera, V. E., Hill, J., Andersson, A., . . . Laikre, L. (2019). Exploring a Pool‐seq‐only approach for gaining population genomic insights in nonmodel species. Ecology and evolution, 9(19), 11448-11463. Leung, T. L. F. (2017). Fossils of parasites: what can the fossil record tell us about the evolution of parasitism? Biological Reviews, 92(1), 410-430. doi:10.1111/brv.12238 Levinton, J. (1988). Genetics, Paleontology, and Macroevolution. United States of America: Cambridge University Press. Levinton, J. S. (2001). Macroevolution: The Problem and the Field. In J. S. Levinton (Ed.), Genetics, Paleontology, and Macroevolution (2 ed., pp. 1-31). Cambridge: Cambridge University Press. Levri, E. P., & Lively, C. M. (1996). The effects of size, reproductive condition, and parasitism on foraging behaviour in a freshwater snail, Potamopyrgus antipodarum. Animal Behaviour, 51(4), 891-901. Lewontin, R. C., & Krakauer, J. (1973). DISTRIBUTION OF GENE FREQUENCY AS A TEST OF THE THEORY OF THE SELECTIVE NEUTRALITY OF POLYMORPHISMS. Genetics, 74(1), 175- 195. Retrieved from https://www.genetics.org/content/genetics/74/1/175.full.pdf Lively, C., & McKenzie, J. (1991). Experimental infection of a freshwater snail, Potamopyrgus antipodarum, with a digenetic trematode, Microphallus sp. New Zealand Natural Sciences, 18(0), 59-62. Lively, C. M. (1987). Evidence from a New Zealand snail for the maintenance of sex by parasitism. Nature, 328(6130), 519-521. Lively, C. M. (1992). Parthenogenesis in a freshwater snail: reproductive assurance versus parasitic release. Evolution, 46(4), 907-913. Lively, C. M. (2010). A review of Red Queen models for the persistence of obligate sexual reproduction. Journal of Heredity, 101(suppl_1), S13-S20. Luijckx, P., Ben‐Ami, F., Mouton, L., Du Pasquier, L., & Ebert, D. (2011). Cloning of the unculturable parasite Pasteuria ramosa and its Daphnia host reveals extreme genotype–genotype interactions. Ecology Letters, 14(2), 125-131. Lydeard, C., Mulvey, M., Aho, J. M., & Kennedy, P. K. (1989). Genetic variability among natural populations of the liver fluke Fascioloides magna in white-tailed deer, Odocoileus virginianus. Canadian Journal of Zoology, 67(8), 2021-2025. doi:10.1139/z89-287 Lynch, M., & Conery, J. S. (2000). The evolutionary fate and consequences of duplicate genes. Science, 290(5494), 1151-1155. Martin, A. P. (1999). Increasing Genomic Complexity by Gene Duplication and the Origin of Vertebrates. The American Naturalist, 154(2), 111-128. doi:10.1086/303231 Maslov, D., & Simpson, L. (1995). Evolution of parasitism in kinetoplastid protozoa. Parasitology Today, 11(1), 30-32. Mayr, E. (1963). Animal species and evolution. Animal species and evolution. Nadler, S. A. (1995). Microevolution and the genetic structure of parasite populations. The Journal of parasitology, 395-403. Nei, M., & Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution, 3(5), 418-426.

30

Introduction

Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. G., & Bustamante, C. (2005). Genomic scans for selective sweeps using SNP data. Genome Research, 15(11), 1566-1575. Nosil, P., Funk, D. J., & Oritz‐barrientos, D. (2009). Divergent selection and heterogeneous genomic divergence. Molecular ecology, 18(3), 375-402. Nuismer, S. L., Thompson, J. N., & Gomulkiewicz, R. (1999). Gene flow and geographically structured coevolution. Proceedings of the Royal Society of London. Series B: Biological Sciences, 266(1419), 605-609. O’Malley, M. A., Wideman, J. G., & Ruiz-Trillo, I. (2016). Losing complexity: the role of simplification in macroevolution. Trends in ecology & evolution, 31(8), 608-621. Ohno, S. (2013). Evolution by gene duplication: Springer Science & Business Media. Philiptschenko, J. (1927). Variabilitat und variation. Poulin, R. (2011). Chapter 1 - The Many Roads to Parasitism: A Tale of Convergence. In D. Rollinson & S. I. Hay (Eds.), Advances in Parasitology (Vol. 74, pp. 1-40): Academic Press. Poulin, R., & Randhawa, H. S. (2015). Evolution of parasitism along convergent lines: from ecology to genomics. Parasitology, 142(S1), S6-S15. Prugnolle, F., Théron, A., Pointier, J. P., Jabbour-Zahab, R., Jarne, P., Durand, P., & Meeûs, T. d. (2005). DISPERSAL IN A PARASITIC WORM AND ITS TWO HOSTS: CONSEQUENCE FOR LOCAL ADAPTATION. Evolution, 59(2), 296-303. doi:10.1111/j.0014- 3820.2005.tb00990.x Rane, R. V., Rako, L., Kapun, M., Lee, S. F., & Hoffmann, A. A. (2015). Genomic evidence for role of inversion 3 RP of Drosophila melanogaster in facilitating climate change adaptation. Molecular ecology, 24(10), 2423-2432. Ridley, M. (1993). Evolution: Blackwell Scientific. Roger, E., Mitta, G., Moné, Y., Bouchut, A., Rognon, A., Grunau, C., . . . Gourbal, B. E. (2008). Molecular determinants of compatibility polymorphism in the Biomphalaria glabrata/Schistosoma mansoni model: new candidates identified by a global comparative proteomics approach. Molecular and biochemical parasitology, 157(2), 205-216. Rohde, K. (1994). The origins of parasitism in the Platyhelminthes. International Journal for Parasitology, 24(8), 1099-1115. Roth, A. C., Gonnet, G. H., & Dessimoz, C. (2008). Algorithm of OMA for large-scale orthology inference. BMC bioinformatics, 9(1), 518. Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor , G. L., Miklos, Nelson, C. R., . . . Lewis, S. (2000). Comparative Genomics of the Eukaryotes. Science, 287(5461), 2204-2215. doi:10.1126/science.287.5461.2204 Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor, G. L., Nelson, C. R., Hariharan, I. K., . . . Fleischmann, W. (2000). Comparative genomics of the eukaryotes. Science, 287(5461), 2204-2215. Sailer, C., Babst-Kostecka, A., Fischer, M. C., Zoller, S., Widmer, A., Vollenweider, P., . . . Rellstab, C. (2018). Transmembrane transport and stress response genes play an important role in adaptation of Arabidopsis halleri to metalliferous soils. Scientific reports, 8(1), 16085. doi:10.1038/s41598-018-33938-2 Simpson, G. G. (1944). Tempo and mode in evolution: Columbia University Press. Sire, C., Durand, P., Pointier, J. P., & Théron, A. (2001). Genetic diversity of Schistosoma mansoni within and among individual hosts (Rattus rattus): infrapopulation

31

Introduction

differentiation at microspatial scale. International Journal for Parasitology, 31(14), 1609-1616. doi:https://doi.org/10.1016/S0020-7519(01)00294-6 Stapley, J., Reger, J., Feulner, P. G., Smadja, C., Galindo, J., Ekblom, R., . . . Slate, J. (2010). Adaptation genomics: the next generation. Trends in ecology & evolution, 25(12), 705-712. Stinchcombe, J. R., & Hoekstra, H. E. (2008). Combining population genomics and quantitative genetics: finding the genes underlying ecologically important traits. Heredity, 100(2), 158-170. doi:10.1038/sj.hdy.6800937 Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123(3), 585-595. Thompson, J. N. (1994). The coevolutionary process: University of Chicago Press. Tirosh, I., & Barkai, N. (2007). Comparative analysis indicates regulatory neofunctionalization of yeast duplicates. Genome biology, 8(4), R50. Train, C.-M., Glover, N. M., Gonnet, G. H., Altenhoff, A. M., & Dessimoz, C. (2017). Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference. Bioinformatics, 33(14), i75-i82. Tsai, I. J., Zarowiecki, M., Holroyd, N., Garciarrubio, A., Sanchez-Flores, A., Brooks, K. L., . . . Sciutto, E. (2013). The genomes of four tapeworm species reveal adaptations to parasitism. Nature, 496(7443), 57-63. Ulett, M. A. (2014). Making the case for orthogenesis: the popularization of definitely directed evolution (1890–1926). Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 45, 124-132. Vilas, R., Paniagua, E., & SanmartÍN, M. L. (2003). Genetic variation within and among infrapopulations of the marine digenetic trematode Lecithochirium fusiforme. Parasitology, 126(5), 465-472. doi:10.1017/S0031182003003081 Warwick, T. (1952). Strains in the mollusc Potamopyrgus jenkinsi (Smith). Nature, 169(4300), 551-552. Weir, B. S., & Cockerham, C. C. (1984). Estimating F‐statistics for the analysis of population structure. Evolution, 38(6), 1358-1370. Winterbourn, M. (1970). Population studies on the New Zealand freshwater gastropod, Potamoptrgus antipodarum (Gray). Journal of Molluscan Studies, 39(2-3), 139-149. Winterbourn, M. (1974). Larval Trematoda parasitizing the New Zealand species of Potamopyrgus (Gastropoda: Hydrobiidae). Mauri Ora, 2, 17-30. WoRMS, E. B. (2020). World Register of Marine Species. Retrieved from http://www.marinespecies.org Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16(2), 97. Wu, W., Niles, E. G., El-Sayed, N., Berriman, M., & LoVerde, P. T. (2006). Schistosoma mansoni (Platyhelminthes, Trematoda) nuclear receptors: Sixteen new members and a novel subfamily. Gene, 366(2), 303-315. doi:https://doi.org/10.1016/j.gene.2005.09.013 Yang, Z., & Nielsen, R. (1998). Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of molecular evolution, 46(4), 409-418. Yang, Z., Wafula, E. K., Honaas, L. A., Zhang, H., Das, M., Fernandez-Aparicio, M., . . . dePamphilis, C. W. (2014). Comparative Transcriptome Analyses Reveal Core Parasitism Genes and Suggest Gene Duplication and Repurposing as Sources of Structural Novelty. Molecular Biology and Evolution, 32(3), 767-790. doi:10.1093/molbev/msu343

32

Introduction

Zarowiecki, M., & Berriman, M. (2015). What helminth genomes have taught us about parasite evolution. Parasitology, 142(S1), S85-S97. Zhang, J., Dyer, K. D., & Rosenberg, H. F. (2000). Evolution of the rodent eosinophil-associated RNase gene family by rapid gene sorting and positive selection. Proceedings of the National Academy of Sciences, 97(9), 4701-4706. doi:10.1073/pnas.080071397 Zhou, Y., Zheng, H., Chen, X., Zhang, L., Wang, K., Guo, J., . . . Jin, K. (2009). The Schistosoma japonicum genome reveals features of host-parasite interplay. Nature, 460(7253), 345.

33

Chapter 1

1: Gene duplication and gain in the trematode Atriophallophorus winterbourni contributes to adaptation to parasitism

Natalia Zajac*,1,2, Stefan Zoller2, Katri Seppälä1,4, David Moi5,6,7, Christophe Dessimoz5,6,7,8,9, Jukka Jokela1,2, Hanna Hartikainen1,2,3, Natasha Glover5,6,7

1. Eawag, Swiss Federal Institute of Aquatic Science and Technology, CH-8600 Dübendorf, Switzerland 2. ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, CH-8092 Zurich, Switzerland 3. School of Life Sciences, University of Nottingham, University Park, NG7 2RD, Nottingham, UK 4. Research Department for Limnology, University of Innsbruck, 5310 Mondsee, Austria 5. Department of Computational Biology, University of Lausanne 1015 Lausanne, Switzerland 6. Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland 7. Center for Integrative Genomics, 1015 Lausanne, Switzerland 8. Centre for Life’s Origins and Evolution, Department of Genetics Evolution and Environment, University College London, Gower St, London WC1E 6BT, UK 9. Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK *Author for Correspondence: Natalia Zajac, ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, Zurich, Switzerland, +41 58 765 1122, [email protected]

In press in Genome Biology and Evolution.

34

Chapter 1

Abstract Gene duplications and novel genes have been shown to play a major role in helminth adaptation to a parasitic lifestyle because they provide the novelty necessary for adaptation to a changing environment, such as living in multiple hosts. Here we present the de novo sequenced and annotated genome of the parasitic trematode Atriophallophorus winterbourni and its comparative genomic analysis to other major parasitic trematodes. First, we reconstructed the species phylogeny, and dated the split of A. winterbourni from the Opisthorchiata suborder to approximately 237.4 MYA (± 120.4 MY). We then addressed the question of which expanded gene families and gained genes are potentially involved in adaptation to parasitism. To do this, we used Hierarchical Orthologous Groups to reconstruct three ancestral genomes on the phylogeny leading to A. winterbourni and performed a GO enrichment analysis of the gene composition of each ancestral genome, allowing us to characterize the subsequent genomic changes. Out of the 11,499 genes in the A. winterbourni genome, as much as 24% have arisen through duplication events since the speciation of A. winterbourni from the Opisthorchiata, and as much as 31.9% appear to be novel, i.e. newly acquired. We found 13 gene families in A. winterbourni to have had more than 10 genes arising through these recent duplications; all of which have functions potentially relating to host behavioural manipulation, host tissue penetration, and hiding from host immunity through antigen presentation. We identified several families with genes evolving under positive selection. Our results provide a valuable resource for future studies on the genomic basis of adaptation to parasitism and point to specific candidate genes putatively involved in antagonistic host-parasite adaptation.

Keywords: comparative genomics, evolution, phylogeny, selection

Significance statement Transition to parasitism has been associated with gene duplication and gain of novel genes for host exploitation, invasion, and escape from host immunity. In our study, we trace gene duplications and gains across a phylogeny from an ancestral trematode genome to our focal species, the newly sequenced trematode Atriophallophorus winterbourni. We characterize gene duplications and gains in 3 ancestral genomes leading to A. winterbourni and outline candidate gene families that have recently undergone duplication and are potentially involved in parasitism.

35

Chapter 1

Introduction

The adoption of a parasitic lifestyle represents a major niche shift that has occurred multiple times across the tree of life (Poulin & Randhawa, 2015; Weinstein & Kuris, 2016). The similar selective pressures involved in exploiting hosts have resulted in convergent macroevolutionary features, such as a tendency for morphological simplification (O’Malley et al., 2016) and the associated genome compaction, reduction and streamlining across many parasite lineages (Chang et al., 2015; Lu et al., 2019; Peyretaillade et al., 2011; Slyusarev et al., 2020). At the same time, parts of the parasite genome involved in e.g. host exploitation and life-cycle complexity may have experienced expansions. Comparative genomic analyses have implied that gene duplications can drive innovation in gene function during radiations of parasitic lineages (Zarowiecki & Berriman, 2015).

Novel gene functions involved in the response to host immunity may be particularly important for the evolution of parasitism. For example, mucins, a family of heavily glycosylated surface epithelial proteins, have undergone multiple rounds of duplication in the blood fluke, Schistosoma mansoni. Mucins frequently recombine, generating antigenic variation through splice variants (Roger et al., 2008). Increased life-cycle complexity, especially within the parasitic flatworms (Poulin & Randhawa, 2015), may have also driven the evolution of functional novelty involved in host exploitation strategies. For instance, in S. mansoni, multiple duplication events in the gene superfamily SCP/TAPS (sperm-coating protein/TPx/antigen 5/pathogenesis-related protein 1) have led to an array of proteins that are now associated with an active role in penetration of the snail host tissues (Cantacessi & Gasser, 2012). Duplicated genes, which evolve beyond sequence recognition, can also give rise to lineage-specific genes (“gained” genes), which can confer specific, novel traits, important in adaptation of that lineage to its particular niche (David et al., 2008; Takeuchi et al., 2016).

With the whole genome sequences of over 30 nematodes (roundworms) and 25 platyhelminth (flatworms, including trematodes) species, it has been possible to characterize the births and expansions of new gene families arising by duplication at key taxonomic levels (Rödelsperger, 2018). Nematodes and platyhelminths are two invertebrate animal phyla

36

Chapter 1 consisting of parasitic and free living organisms with the parasitic ones causing major animal, crop and human diseases, as well as being a major economic burden (Disease and Injury Incidence and Prevalence Collaborators, GBD, 2016; International Helminth Genomes, Consortium, 2019).

The microphallid Atriophallophorus winterbourni (syn. Microphallus sp. or Microphallus livelyi) is a digenean trematode parasite native to the lakes of New Zealand (Blasco-Costa et al., 2020). It alternates between two hosts in its life cycle; the intermediate host is Potamopyrgus antipodarum, a prosobranch dioecious mud snail (Warwick, 1952; Winterbourn, 1970) and the final hosts are waterfowl, mainly dabbling ducks (Lively & McKenzie, 1991). Multi-host life cycle is a general characteristic of all digenean trematodes, and always includes a molluscan species as an intermediate host and a vertebrate as the final host (Galaktionov & Dobrovolskij, 2003) (Supplementary Box 1). The metacercarial asexual stage of A. winterbourni develops in the gonad of the snail, which is consequently castrated. The adult worm stage occurs in the gut of waterfowl, where the worms reproduce sexually, producing eggs released with the waterfowl faeces (Lively & McKenzie, 1991). A. winterbourni notably lacks several life cycle stages known to occur in other digenean trematodes, including sporocyst, redia, cercaria, and possibly miracidia stages (Figure 1). Unlike some other well- studied digenean trematodes (see Figure 1 and Supplementary Box 1), A. winterbourni is not known to infect humans and has low virulence in its final bird host. The Potamopyrgus- Atriophallophorus system has been studied intensively because the parasite seems to be in a tight coevolutionary relationship with its host in natural populations (Lively et al., 2004). The host-parasite interaction has been used to test alternative explanations for the maintenance of sex in Potamopyrgus snails (Lively, 1987, 1989). Previous field and laboratory studies suggest that A. winterbourni adaptation to local host populations is genotype-specific to a degree that the parasite population can adapt to specifically infect the most common host genotypes, which creates negative frequency-dependent dynamics between the two (Dybdahl & Lively, 1996; Jokela et al., 2009; Lively et al., 2004). Additionally, recent experimental evidence has indicated that the parasite alters the behaviour of the snail, causing it to migrate to the shallow parts of the lake where the final host resides (Feijen et al. in prep).

37

Chapter 1

Figure 1. A summary table showing several shared life-cycle characteristics of the trematodes used in the study. The first seven columns indicate the presence (blue) or absence (grey) of developmental stages in each parasite’s life cycle. “Host number” indicates the number of hosts in a parasite's life cycle, “Type of adult worm” indicates whether the adult worms in the final host are hermaphroditic or dioecious (both males and females present). Species within the genera Schistosoma and Opisthorchis are grouped due to identical characteristics. The photographs below represent the metacercaria and adult stage of A.winterbourni and the intermediate host of A.winterbourni (P.anitpodarum snail) (photographs taken by N.Zajac and K.Seppälä).

38

Chapter 1

In this study, we assembled de novo the A. winterbourni reference genome, annotated protein-coding genes, and assigned putative functions using Gene Ontology. With the knowledge from previous studies of pathways and gene families potentially important in trematode adaptation to parasitism, we used comparative genomics to contrast A. winterbourni with other trematodes. We studied the evolution of homologous gene families across the phylogeny of platyhelminths using Hierarchical Orthologous Groups (HOGs), or sets of orthologs/paralogs which all originate from a single gene in the last common ancestor of a clade of interest (Altenhoff et al., 2013). By tracing HOGs along the species tree, it is possible to infer the evolutionary history of gene loss, gain, and duplications since the ancestral gene. Using HOGs, we reconstructed the ancestral digenean trematode genome, the Plagiorchiida ancestral genome, and the ancestral genome before the split of Xiphidiata and Opisthorchiata suborders. Using these ancestral genomes, we identified the evolutionary events (duplications, gains, and retention of 1:1 orthologs) that shaped each gene family in the lineage leading to A. winterbourni. We characterized the duplicated, gained, and 1:1 orthologs (i.e. conserved/retained) genes shared between all trematodes, as well as those specific to A. winterbourni. We discuss the relevance and function of these gene families in A. winterbourni and search for signatures of positive selection in two of the largest gene families. We use the inferred changes in the gene content to better understand the genetic novelty necessary for adaptation to parasitic lifestyles in the lineage leading to A. winterbourni. Through outlining candidate genes for parasitism, we provide a basis for future studies on the genomics of parasite-host coevolution and we broaden the knowledge on trematode evolutionary history. Methods

Parasite collection and DNA extraction

P. antipodarum snails infected with A. winterbourni were collected from Lake Alexandrina (New Zealand, South Island) in January 2017 from several shallow localities (< 1.5 m) by pushing a kicknet through the vegetation. The snails were transported to the Swiss Federal Institute of Aquatic Science (Eawag, Dübendorf, Switzerland) within two weeks of collections and were kept in boxes of 500 snails in a flow-through system that filtered the

39

Chapter 1 water every 12 h. Snails were fed spirulina ad libitum (Arthospira platensis, Spirulina California, Earthrise) once a day.

Infected snails were individually dissected and 200-1000 A. winterbourni metacercariae were isolated under 10x-20x magnification. The metacercariae were hatched into adult worms (see Supplementary Methods 1 for details). Obtaining adult worms was necessary to separate the parasite from the double-walled metacercarial cyst that contained both the parasite and the snail DNA (Galaktionov & Dobrovolskij, 2003). The worms were lysed using a CTAB buffer and Proteinase K (2mg/ml) with overnight incubation at 55oC (Yap & Thompson, 1987). DNA was isolated using a chloroform: isoamyl alcohol solution (24:1) and precipitated with sodium acetate (3M). The resulting pellet was washed twice with 70% ethanol. DNA was stored in RNase/DNase-free water (Sigma-Aldrich, Missouri, United States) at -20°C until sequencing library preparation.

Estimation of genome size

To guide the de novo genome assembly, genome size was estimated using flow cytometry with Propidium Iodide staining (CyFlow Space, Sysmex). A. winterbourni worms were hatched according to the above described protocol. A pool of 15 worms was stained for 1 h with Propidium Iodide (according to the Partec protocol of CyStain PI Absolute T kit) and treated with DNase-free RNase. Three batches of 15 worms were measured, each taken from a different snail host. The DNA content of 2C nuclei was calculated using heads of isoline Drosophila melanogaster males and a laboratory clone of Daphnia galeata as two independent standards. For the haploid DNA content of Drosophila melanogaster, a value of 175 Mb (Bennett et al., 2003) was used and for Daphnia galeata a value of 158 Mb (S. Dennis, personal communication, December 12, 2019). Each standard was run separately with each batch of worms.

Sequencing

The DNA of A. winterbourni was sequenced using Illumina and Pacific Biosciences technologies (Ambardar et al., 2016). For Illumina sequencing, two infected snails were selected from a shallow water habitat from one site sampled at Lake Alexandrina. A total of 200 ng DNA was extracted from approximately 800-1000 worms and was sent to the 40

Chapter 1

Functional Genomics Center Zurich (University of Zurich, Zurich) for library preparation and paired end sequencing using the Illumina HiSeq4000 sequencing platform. A single TruSeq library was constructed from the DNA using the TruSeq Nano DNA library prep kit according to Illumina protocols, obtaining an average of 500 bp insert size. The library was sequenced without indexing on a single Illumina lane. For Pacific Biosciences sequencing, we selected 33 infected snails from two different sites from a shallow water habitat with a high infection prevalence within Lake Alexandrina. We assumed no distinct or significant population structure for the parasite from different sites within the same habitat zone, as previously shown for the snail host (Paczesniak et al., 2013). Genomic material was isolated from a pool of approximately 13,000-30,000 worms. The high molecular weight DNA with an average length of 45,000 bp (assessed with a Bioanalyser) was sent for sequencing to the Functional Genomics Center Zurich (University of Zurich, Zurich), where it was sequenced with the Pacific Biosciences RSII sequencing platform. A 10 kb SMRT-bell library was constructed from a total of 10 µg of DNA. The library was sequenced using 3 SMRT-cells using P6/C4 chemistry. Primary filtering was performed by Functional Genomics using the SMRT Link software from Pacific Biosciences. We performed secondary filtering, choosing only reads of at least 1000 bp in length and with read quality > 80%. No error correction was performed on the PacBio data at this stage, as it was corrected later with the Illumina data during the hybrid assembly.

Illumina data correction

A quality trimming step was performed with Trimmomatic 0.35 on the raw Illumina HiSeq data before proceeding with the assembly. Adapter sequences were removed and bases with a phred quality score below 5 were removed from the start and the end of the reads. Reads were scanned with a sliding window of 4 and were clipped if the average quality per base dropped below 15. Reads shorter than 50 bp were discarded. The reads were then submitted to PRINSEQ (Schmieder & Edwards, 2011) for filtering for ambiguous bases (Ns), characters different than A, C, G, T or N, and for removal of exact duplicates. For assessment of contamination, we used taxonomic interrogation of the paired reads with Kraken v2, standard database (Wood & Salzberg, 2014).

41

Chapter 1

Hybrid assembly and annotation

Paired reads from Illumina were used together with long reads from Pacific Biosciences for a hybrid assembly with the MaSuRCA 3.2.3 assembler using default parameters (Zimin et al., 2013). Redundans 0.13c (Pryszcz & Gabaldón, 2016) and AGOUTI (Zhang et al., 2016) were used for improvements. Redundans improves the quality of the assembly by reduction, scaffolding and gap closing (Pryszcz & Gabaldón, 2016). The reduction steps consist of identification and removal of heterozygous contigs, based on pairwise sequence similarity searches. Heterozygous contigs are expected to have high sequence identity (Pryszcz & Gabaldón, 2016). The quality was assessed using the N50 statistic, BUSCO 3.0.2 (Benchmarking Universal Single Copy Orthologs) (Waterhouse et al., 2017), and Blobtools 0.9.19.5 (Laetsch & Blaxter, 2017). BUSCO 3.0.2 assesses the completeness of single copy orthologs based on evolutionary-informed expectations about gene content using the lineage dataset metazoa_odb9. Blobtools 0.9.19.5 was used for taxonomic partitioning of the assembly. All scaffolds >50,000 bp (2718 scaffolds) plus a random sample of scaffolds <50,000 bp from the assembly (2661 scaffolds) were submitted to BLAST 2.3.0 using the NCBI nr database for taxonomic annotation. Taxonomic assessment of those scaffolds was used as input for Blobtools. The paired and filtered Illumina reads and PacBio reads of at least 1000 bp in length and with read quality > 80% were mapped back to the final assembly with BWA- MEM 0.7.17, yielding an average of 143x coverage per base (125x from the Illumina reads and 18x from the PacBio reads).

Genome annotation was performed using the Maker 2.31.9 annotation pipeline (for details see Supplementary Methods 2) (Cantarel et al., 2008). The completeness and quality of the annotation was assessed with BUSCO and with full-length transcript analysis using BLAST+ (see Supplementary Methods 3). Gene Ontology annotation of the coding sequences was performed with Pannzer2 (Törönen, Medlar, & Holm, 2018), EggNOG (Diamond mapping mode) (Huerta-Cepas et al., 2016) and OMA (“Orthologous MAtrix”) (Altenhoff et al., 2017) web browsers with each dataset used separately for GO enrichment analyses (http://ekhidna2.biocenter.helsinki.fi/sanspanz/ (last accessed: 10.2019), http://eggnogdb.embl.de/#/app/home (last accessed: 01.2020), https://omabrowser.org/oma/functions/ (last accessed: 12.2019)). We also assessed the

42

Chapter 1 percentage of all GO terms annotated in A. winterbourni with experimental evidence in nematode or trematode (Supplementary Methods 4).

Comparative genomics and ancestral genome reconstruction

We selected 20 species of platyhelminthes and nematodes for comparative genomic analysis. The choice of both nematodes and trematodes was based on their comparisons in other helminth genomic analyses (International Helminth Genomes, Consortium, 2019; Zarowiecki & Berriman, 2015) and will allow for future comparison of trematodes to model species of nematodes. The species consisted of: 14 digenean trematodes (including our focal species), 3 species of parasitic cestodes, 1 species of parasitic monogeneans, and 2 species of free living nematodes (see Supplementary Box 1, Figure 2). We chose these species on the basis of close relatedness to A. winterbourni and quality of the genome assembly and annotation (species also used in (International Helminth Genomes, Consortium, 2019)). The proteomic, genomic and transcriptomic sequences for analysis were obtained from the NCBI database of invertebrate genomes (ftp.ncbi.nlm.nih.gov) and from the EBI database (ftp://ftp.ebi.ac.uk/). For the analysis, we used the most recent genomes from those databases with available transcriptomic data (CDS_genomic) and protein annotation (see Supplementary Table 1).

The OMA standalone (Orthologous Matrix) software was used for inference of Hierarchical Orthologous Groups (HOGs) of genes shared between species (Altenhoff et al., 2019). This software conducts an all-against-all comparison to identify the evolutionary relationships between all pairs of proteins included in the custom-made database of the 20 genomes. The program was run with default parameters and with the “bottom-up” algorithm for inference of HOGs. C. elegans and P. pacificus were specified as outgroup species. After obtaining the phylogenetic species tree (see next section), OMA was rerun with the precise species tree specified.

The data obtained from OMA was then analyzed with the python library pyHam (Train et al., 2018). With pyHam we reconstructed a model of the ancestral genomes at each stage of the phylogeny leading to A. winterbourni and carried out all comparisons between ancestral and extant genomes to obtain classes of duplicated, gained, retained, or lost genes

43

Chapter 1

(see Jupyter notebook Supplementary Material 6). We also used pyHam to visualise genomic changes along each branch of the phylogenetic tree.

Phylogenetic tree

OMA Groups, i.e. Orthologous Groups, from the OMA output were used for phylogenetic tree construction, as they are stringent groups of orthologs and do not contain paralogs (Zahn-Zabal, Dessimoz, Glover, 2020). The phylogenetic tree was constructed following the protocol of Dylus et al. 2020 (Dylus et al., 2020). Briefly, Orthologous Groups containing at least 15 species of monogeneans, cestodes and trematodes were extracted using the custom script filter_groups.py from the git repository https://github.com/DessimozLab/f1000_PhylogeneticTree. Nematodes were excluded from precise phylogenetic and time tree reconstruction, as they are too evolutionarily distant. Within each Orthologous Group, sequences were aligned using MAFFT (mafft 7.273, 1000 cycles of iterative refinement) (Katoh et al., 2009). The separate alignments were concatenated into one supermatrix using a custom script concat_alignment.py from the git repository https://github.com/DessimozLab/f1000_PhylogeneticTree. The final size of the supermatrix was 145,802 sites for all 18 species. No columns from the supermatrix were excluded. The supermatrix was used as input for IQ-TREE maximum likelihood phylogenetic tree construction (Hoang et al., 2017; Kalyaanamoorthy et al., 2017; Trifinopoulos et al., 2016) using the ModelFinder Plus option for finding the best fitting model. Branch support was calculated with 1000 Ultrafast bootstrap alignments and 1000 iterations. The maximum likelihood tree was confirmed with ASTRAL III (Zhang et al. 2018) by constructing a species tree from gene trees of the 238 Orthologous Groups. Each gene tree was first constructed with IQ-TREE using ModelFinder Plus for choosing an appropriate model; branch support was calculated with 1000 bootstrap alignments and 1000 iterations. The IQ-TREE tree, together with the supermatrix, were used in Mega-X 6.06 for time tree reconstruction using the Maximum Likelihood RelTime method (Tamura et al., 2013). We used two pieces of evidence for time calibration, discussed in the Results.

44

Chapter 1

Gene Ontology Enrichment Analysis

We performed Gene Ontology (GO) annotation for each species using Pannzer2, EggNOG (Diamond mapping mode) and OMA (Törönen et al., 2018; Huerta-Cepas et al., 2016; Altenhoff et al., 2017). Each extant species genome was functionally annotated with orthology-informed putative functions using OMA, Pannzer2 and EggNOG reaching between 26% to 96% of genes annotated for each species (Supplementary Table 2). We then performed GO enrichment analysis using GOATOOLS (Klopfenstein et al., 2018), which finds statistically over- and under-represented GO terms in the set of genes of interest compared to all the GO terms in the background population. For analyses that were species-specific, the background set was all the genes in the genome. For analyses of ancestral genomes, the background population was all the ancestral genes, i.e. the set of HOGs at that taxonomic level. To get the GO terms for any particular ancestral gene/HOG, we took the union of all the GO terms in the extant “children” species. Fisher’s exact test was used for computing uncorrected p-values. The p-values were then corrected using the Bonferroni method and retained if the corrected p-value was <0.05. Subsequently, all enriched GO terms were categorized into GO slim categories using the AGR subset (Alliance of Genome Resources, http://geneontology.org/docs/download-ontology/, last accessed: 7.05.2020) and unique genes within each enriched GO slim category were counted. For each GO term, the IC (Information Content) score was calculated as: IC(t)=−log(p(t)) with p(t) being estimated as the empirical frequency of the term in the UniProt-GOA database (Barrell et al., 2009). The average IC was calculated for each GO slim term using the IC values of all enriched GO terms in each category (Mazandu & Mulder, 2014; Mistry & Pavlidis, 2008). GO slim terms were used in summarizing the data.

Estimation of dN/dS in gene families in Atriophallophorus winterbourni

HOGs 25969 and 36190 with over 30 A. winterbourni genes were investigated for signatures of positive selection. All proteins within the two families were submitted to NCBI BLASTP to find their best hit against the nr database and obtain putative functions. We then applied the protocol from Jeffares et al. (2015) to estimate the non-synonymous to synonymous substitution rate ratio within each HOG and to investigate whether selection 45

Chapter 1 models explain the data better than null models (Yang, 1997; Kohlhase, 2006). Protein sequences were aligned using Clustal Omega (Madeira et al., 2019), then converted to codon alignment in Phylip format with PAL2NAL (Suyama et al., 2006). Positive selection analyses are sensitive to alignment errors; thus the gap-ridden alignment of HOG 36190 was subjected to a more stringent alignment filtering, guided by the approach proposed by (Moretti et al., 2014) (for details see Supplementary Methods 5). Branch site models in codeml were used to estimate dN, dS and ω (dN/dS) (model=2, NSsites=2). The likelihood ratio test (LRT) was used to determine significance. Gene trees were constructed with protein sequence alignments using IQ-TREE (Hoang et al., 2017; Kalyaanamoorthy et al., 2017; Trifinopoulos et al., 2016). First an initial parsimony tree was created by a phylogenetic likelihood library; 168 protein models were then tested for best fit with the data according to the Bayesian Information Criterion. Branch support was calculated with 1000 bootstrap alignments (ultrafast bootstrap) and 1000 iterations. The models chosen were JTT+F+G4 for HOG 25969 (General matrix with empirical amino acid frequencies from the data and discrete Gamma model with 4 categories) and WAG+G4 for HOG 36190 (General matrix with discrete Gamma model with 4 categories). Results and Discussion Genome of A. winterbourni

The de novo sequenced genome of A. winterbourni resulted in a final assembly of 601.7 Mb in size, consisting of 26,114 scaffolds with an N50 of 40,108 (see Table 1 and Supplementary Results 1 for details). The assembly size was similar to the flow cytometry- based genome size estimate of 550-600 Mb (Supplementary Figure 1). The annotation yielded 11,499 predicted protein-coding loci spanning 163.7 Mb, with a mean of 5.8 exons and a median of 4 exons per gene (Table 1). The final BUSCO gene set completeness for the annotation was 72% of complete single copy conserved orthologs (see Supplementary Results 3 for protein coding sequence length analysis using BLAST+). Relative to other published trematode genomes, the A. winterbourni genome showed good protein sequence length distribution and a comparable BUSCO complete single copy conserved orthologs (Supplementary Figure 4, Table 1). Functional annotation via Gene Ontology (GO) was successful for 84% of genes using OMA, Pannzer2, and EggNOG (9674 genes, see

46

Chapter 1

Supplementary Figure 5 and Supplementary Table 2), with 45.3% of the OMA GO terms and 32% of Pannzer2 GO terms assigned to A. winterbourni having experimental evidence in nematodes or trematodes (see Supplementary Table 2 and Supplementary Results 2). In comparison to other Plagiorchiida genomes, the A. winterbourni assembly was of similar size and showed similar percentages of non-coding regions, suggesting that no significant genome reduction has occurred in this species (Table 1). Transposable elements, interspersed repeats and low complexity DNA comprised 51.7% of the genome (Supplementary Table 11). This elevated level of TE content in comparison to closely related Opisthorchiata species (33% C. sinensis, 30.3% O. felineus, 30.9% O. viverrini) (Esch et al., 2002) might be an indication of increased importance of transposable elements in A. winterbourni genome evolution.

47

Chapter 1

Table 1. Information on the genome assemblies used in the analysis. For more information, see Supplementary Table 1. The BUSCO results refer to the protein annotation. The results on exon/intron number and length were calculated from the gff files with genestats script (available at: https://gist.github.com/darencard/fcb32168c243b92734e85c5f8b59a1c3, date accessed 14.07.2020) or obtained from (Tsai et al. 2013).

Species Genome NB. scaff. N50 GC content (%) BUSCO BUSCO Total exon Average Total Average Total size (Mb) genes count complete duplicated (%) number exon intron intron coding single (%) length number length sequenc (bp) (bp) e (Mb) Atriophallophorus 601.7 11499 26114 40108 40.73 56.2 15.7 66672 233 54987 1732 163.7 winterbourni Taenia solium 122 12467 11237 68000 42.9 77.9 1.9 69770 223 57289 574 48.3

Echinococcus granulosus 110.8 11319 957 712683 41.7 76.2 2.1 75264 211 63945 722 62 Gyrodactylus salaris 67.4 15436 6075 18400 33.9 67.7 1.5 61693 229 46257 584 41.1

Fasciola hepatica 1138 14642 23604 161103 44.1 71 1.2 83777 488 72560 4168 343.3 Echinostoma caproni 834.6 18607 86083 27000 42.5 50.3 1.2 65273 267 46666 2451 131.8 Opisthorchis felienus 679.25 11427 13306 621022 44.1 53.4 33 180879 261 160011 3527 291.9 Opisthorchis viverrini 472.26 13555 16038 79767 44 67 0.7 59112 242 48358 2734 146.5 Clonorchis sinensis 562.7 14538 2776 1628761 42.6 72.6 1 89304 234 74766 2745 226.2 Trichobilharzia regenti 701.76 22185 188369 7696 37.4 36.4 0.7 54402 277 32217 1829 74 Schistosoma japonicum 369.9 11416 1789 1093989 33.8 47.2 34.6 130068 336 113132 2372 185.4 Schistosoma mansoni 364.5 10772 885 32115376 35.5 71.5 8.6 70430 204 57138 2475 148.8 Schistosoma 367.4 26189 23355 35236 34.3 65.7 1.6 79991 262 53802 1925 122.3 margrebowiei Schistosoma 375.89 11140 29834 317484 34.2 71.4 1.6 64235 246 53148 2488 148.8 haematobium 373.4 11576 4774 202989 34.4 68.3 4.3 65265 259 53689 2406 146.1 Schistosoma mattheei 340.82 22997 62061 12303 34.1 51.3 1.5 65852 263 43672 1569 84.1 Schistosoma curassoni 344.2 23546 60140 13861 34.2 54.6 1.4 69606 259 46060 1576 88.5 Caeorhabditis elegans 102.3 20184 7 17493829 35.4 98 0.6 285984 239 250855 438 63.3 Echinococcus 115 10663 1217 13800000 42.2 79.8 2.8 71022 205 60677 663 49 multilocularis Pristionchus pacificus 158.5 25991 47 23900000 42.8 91.6 1.3 312244 106 287319 275 112.2

48

Chapter 1

Species phylogeny and molecular clock

To reconstruct a robust maximum likelihood phylogenetic tree, 238 Orthologous Groups (groups containing only orthologs, with a maximum one gene per species) shared between at least 15 out of 18 species of Platyhelminths were used. The phylogenetic estimate was well resolved and congruent with previous publications based on genetic markers or whole genomes (Figure 2) (Blasco-Costa et al., 2020; Galaktionov & Dobrovolskij, 2003; International Helminth Genomes, Consortium, 2019; Lee et al., 2013). A. winterbourni was placed as sister to the Ophisthorchiata clade with 100% bootstrap support. The time of speciation of A. winterbourni from the Opisthorchiata species was estimated to have been 237.4 MYA (± 120.4 MY), i.e. during the Carboniferous through the Cretaceous period (Figure 2). The divergence time estimates across the phylogeny were inferred using several independent pieces of evidence, used as calibration points for Time Tree: the existence time of the proto-trematode first associated with a molluscan host around 400 MYA, and the origin of Schistosoma species in the Creataceous period (66-145MYA) (Blair et al., 2001; Gibson, 1987; Hausdorf, 2000; Parfrey et al., 2011; Peterson et al., 2004).

49

Chapter 1

Figure 2. Phylogenetic tree and classification of species used in the analysis. The data used for the tree were all orthologous groups from the OMA analysis with genes from at least 15 species present (238 groups of orthologs). The combined data was used in IQTree to create a robust consensus robust species tree. The tree and the combined alignment of 238 groups of orthologs was used in Mega-X 6.06 for reconstruction of the time tree. The scale below indicates divergence times in Million Years (MY). Each node has a divergence time with the confidence interval indicated in brackets in million years and a bootstrap support indicated after a slash.

50

Chapter 1

Evolutionary patterns of gene 1:1 orthology, gain, loss and duplication across Trematoda

The OMA analysis identified 38,144 HOGs among all the species included (2 Nematodes and 18 Platyhelminthes). Specifically, in A. winterbourni 5,815 gene families were found (comprising 7,828 out of a total of 11,499 genes, 68.1%) with the rest being identified by OMA as singletons not belonging to any family (3671 genes, 31.9%). Comparisons of three ancestral genomes among the trematode phylogeny (the ancestral Trematoda, the ancestral Plagiorchiida and the Opisthorchiata/Xiphidiata ancestor) revealed many duplicated and gained gene families (Figure 3, Supplementary Figure 7). A particularly high proportion of genomic novelty was inferred during the initial speciation of Trematoda from the Trematoda/Cestoda common ancestor (37.2% of newly acquired genes), and again during the divergence of A. winterbourni from the most recent Opisthorchiata/Xiphidiata ancestor (31.9% of newly acquired genes, Figure 3B). The proportion of duplicated genes in the A. winterbourni genome was also high (24%) when compared to the Trematoda/Cestoda split (10.9%) (Figure 3B). In A. winterbourni, many of the duplicated genes were found in expanded gene families (503 genes comprising 66 HOGs with a minimum of 5 duplications per HOG) and 13 of these HOGs were massively expanded, with over 10 duplicated genes since the Opisthorchiata/Xiphidiata speciation (Supplementary Table 4).

We found only 660 genes lost in the ancestral Trematode from the previous ancestor. We observed a progressive increase in the number of lost genes to the Opisthorchiata/Xiphidiata ancestor (Figure 3A). The Plagiorchiida ancestor exhibited comparable gene loss to gene gain and duplication whereas in the Opisthorchiata/Xiphidiata ancestor, gene loss exceeded the number of duplications or gains (Figure 3A).

51

Chapter 1

Figure 3 A. Number of duplicated, retained (1:1 orthologs) and gained genes resulting after each point of speciation obtained from the analysis of Orthologous Groups in pyHam, mapped onto a phylogenetic tree of nematodes and trematodes (for original see Supplementary Figure 6). The total number of genes at each point is indicated on the left-hand side of the bar and the total number of retained (pink), duplicated (green) and gained (yellow) genes are indicated on the right-hand side of the bar. The bars indicate the proportions of genes in each category. The lost genes are indicated only for the three ancestral genomes: the Trematoda ancestor, the Plagiorchiia ancestor and the Opisthorchiata/Xiphidiata ancestor. B. The proportions (on the bars) and the total numbers (next to the bars) of retained (pink), duplicated (green) and gained (yellow) genes in each reconstructed ancestral genome leading to A. winterbourni. The oldest ancestral genome is on the left-hand side and the extant A. winterbourni genome on the right-hand side. The total number of genes per genome is above each bar beneath the name.

52

Chapter 1

C. Heatmaps summarising the GO enrichment analysis of the duplicated and gained genes in the 3 reconstructed ancestral genomes and the extant genome of A. winterbourni. All enriched GO terms were categorized into GO slims listed on the y-axis of each heatmap. The colours indicate the mean IC value of each GO slim category and the number printed on top is the number of unique genes within that GO slim category (see Methods).

53

Chapter 1

1:1 orthologs in trematodes

Based on previous studies, we assumed that many of the genes that remain conserved throughout speciation are housekeeping genes, the building blocks of the organism, and necessary for life, growth, and reproduction (Wu et al. 2006, Duarte et al. 2010). The prediction was confirmed through the GO annotations associated with the genes retained at a 1:1 orthologous gene ratio for each of the ancestral genomes (Supplementary Table 5). The enriched GO terms for retained genes over all ancestors and A. winterbourni can be summarized as: RNA processing, the establishment of protein localization, organelle organization, embryo development, cellular catabolism, developmental process, reproduction, and response to stress and stimulus. What is more, since the ancestral trematode species 400 MYA, the number of genes retained at a 1:1 ratio remained relatively constant for each of the 14 extant trematodes, between 2966-5203 genes (Supplementary Table 6).

Additionally, we found 28 single-copy orthologs present in all species, which have been maintained since the trematode ancestor. Examination of their functions through the annotations of best studied trematodes (Fasciola hepatica (NCBI, 2017), Schistosoma mansoni, (Protasio et al., 2012; Wang et al., 2016) revealed that the 28 retained gene families shared between them all were largely involved in cell functioning and growth, division, and cell-to-cell or protein-to-protein interactions (Supplementary Results 4, Supplementary Table 7).

Genes duplicated and gained in trematodes

We hypothesized that the duplicated genes are more likely to be adaptive than the single-copy orthologs due to the redundant second copy being functionally maintained through positive selection to play a new or same role within the organism (Ohno, 2013; Yang et al., 2015). Multiple duplications within gene families would further suggest an adaptive importance of these key HOGs. The novel (gained) genes may similarly indicate areas of genetic innovation that were crucial in adoption of new hosts, expansion/streamlining of life cycles and adaptation to changing environments. The origins of the gained genes may stem from neofunctionalization or high divergence of duplicated genes, therefore also potentially involved in adaptive functions as suggested by the gene duplication model of Ohno (2013).

54

Chapter 1

Examination of the enriched functions from the trematode ancestor to the most recent ancestor of A. winterbourni are presented in Figure 3C and appear to indicate a progressive gain and duplication of potentially adaptive genes. An “ancestral GO enrichment” analysis of the ancestral genomes was used to retrieve the putative functions of all gained genes (shared between at least 70% of the extant species in Trematoda, 66.7% of Plagiorchiida, or 50% of Xiphidiata/Opisthorchiata) and duplicated genes (minimum 5 duplicated genes per family) (Supplementary Methods 6). Here, we concentrate on functional analysis of ancestral genomes because the inferred gene duplications, gains, and losses are based on evidence present in all of the extant genomes. For example, a gene is inferred to be gained at a particular ancestral level if it is present in at least two species only in that clade. Therefore, ancestral genomes (i.e. internal nodes in the species tree) are more robust than extant genomes in terms of inferred evolutionary duplications, gains, and losses. Additionally, by only considering gained genes present in the majority of the extant species of a given clade, or duplicated genes present with at least five copies, we have more confidence that we are looking at bona fide gains and duplications. The ancestral genome annotation was based on combining the GO terms assigned to the extant genomes. We further categorized the enriched GO terms into GO slim categories to give a broader overview of the functions, and counted unique genes within each of those categories (summarised in Figure 3C). Although there was a similar number of enriched functions for the duplicated and gained genes in the Trematoda and Plagiorchiida ancestors, we found more functions enriched in duplicated than in gained genes in the Xiphidiata/Opisthorchiata ancestor and in A. winterbourni. Considering only the duplicated genes, from the trematode ancestral genome to the A. winterbourni genome, there was a progressive increase in the number of enriched GO slim functions over time, and an overall increase in the number of unique genes contributing to each function. The increase in the number of unique genes could possibly reflect the increasing importance of this function over time or increased duplication rate of certain families.

We present the average Information Content (IC) per GO slim category, which can be used as a proxy to estimate the specificity of a particular GO term (see methods). The higher the IC, the more specific a term. For the gained genes, we found a progressive increase in IC value of the different GO slim categories but we did not find an increase in the number of enriched GO slim functions or the number of unique genes within them (Figure 3C). The increase in average IC values of GO slim categories enriched for gained genes could suggest

55

Chapter 1 an increase in specificity of functions over time (Figure 3C). These observations are best illustrated with enriched GO slim functions such as catalytic activity (GO:0003824), including microtubule motor activity, but also cellular component organization (GO:0005634), including actin bundle filament organization and response to stimulus. A literature review relates them to the importance of the microtubule-based and actin-based cytoskeletal system building the outer body layering (tegument), through which the parasite interacts with the host environment. Microtubule associated proteins in the tegument, including tubulin, paramyosin, actin, dynein light chains and various antiporters, participate in absorption and secretion (e.g. nitrogen utilization), transport of vesicles from sub-tegumental cells to the tegument cytoplasm, and cell motility (Githui et al., 2009; Young et al., 2010). Molecular characterization and immunostaining studies have also shown dynein light chains to function as tegument associated antigens (Hoffmann & Strand, 1997; Jones et al., 2004; Yang et al., 1999), important in hiding from host immunity. The tegument has been shown to be an essential structure for adaptation to the external environment (Kim et al., 2012) including the pH of the digestive system of the hosts. Indeed, our results show dynein light chain, tegument-associated antigen, and a tubulin-beta chain to be the functions of 3 of the 12 HOGs duplicated since the Trematode ancestor and with at least 3 copies in 75% of the extant species (Supplementary Table 8). We also found dynein light chain to be the putative function of one of the most duplicated HOGs in A. winterbourni (Supplementary Table 4, see next section), as well as a HOG duplicated in all 14 trematode species (Supplementary Table 9). Thus, we speculate the functions related to the tegument to be also of great importance in our focal species.

The results might indicate acquisition of more complex and specific adaptations to hosts and environments over time. More experimentally-validated GO annotations in our species of interest could shed light on this hypothesis in the future.

Gene loss in trematodes

Gene loss is known to be common for intracellular parasites (Sakharkar et al. 2004; Corradi 2015) and it is much rarer in parasites with complex life cycles and multiple hosts (Zarowiecki and Berriman 2015). However, in several helminths there has been a loss of a mitochondrial gene atp8 (Egger et al. 2017) or cytochrome P450 redox enzymes (Tsai et al. 2013) as well as other functional losses and gene family contractions (International Helminth

56

Chapter 1

Genomes, Consortium, 2019). Here, we again focused on ancestral genomes because they are inferred by the accumulation of gene presence and absence information from the extant genomes, i.e. if a gene is not found in all the extant genomes of a clade, we can assume it was lost in the last common ancestor of that clade. Thus, ancestral genome analysis is less prone to being undermined by poorer quality genomes (Deutekom et al., 2019). In our study, the robustness was exhibited by the number of losses being always much lower in ancestral than extant genomes (Supplementary Fig 7). We also performed a GO enrichment of the lost genes for A. winterbourni as well as the ancestors leading to it. For the ancestral genomes, the background population for GO enrichment was the union of all the GO terms in the extant children species constituting the previous ancestor to the ancestor of interest.

Although there was a progressive increase in the number of genes lost from Trematoda to Opisthorchiata/Xiphidiata ancestor, a GO enrichment analysis of lost genes did not reveal any functions to be enriched in the Trematoda or Opisthorchiata/Xiphidiata ancestor. In the Plagiorchiida ancestor we found loss of genes related to intrinsic components of membrane (GO:0016021) and wide pore channel activity (GO:0022829). We did not find any enrichment of GO terms for the lost genes in A. winterbourni. Since the functions of the lost genes appear to not be related to any specific biological processes, we speculate that there is a greater importance of gene gains and duplications in adaptation to parasitism.

Role of gene duplication and gain in driving adaptation of A. winterbourni

The Opisthorchiata species exhibit a high similarity in life cycle traits and set of hosts. The A. winterbourni genome exhibited comparable proportions of gained, retained and duplicated genes since the Opisthorchiata/Xiphidiata ancestor (31.9%, 44.1%, 24% respectively) as Opisthorchis viverrini (41%, 50.5%, 8.5% respectively), i.e. in both species the highest proportion of genes was retained and the smallest proportion of genes was duplicated. On the other hand, Opisthorchis felineus exhibited a much higher proportion of genes originating through duplication since Opisthorchiata/Xipihidiata ancestor (52.4%) and Clonorchis sinensis had the most genes originating through gain since the Opisthorchiata/Xiphidiata ancestor (54.3%). Thus, across the four species, sometimes gene duplication and sometimes gene gain seems to play a greater role in gene family evolution.

57

Chapter 1

However, it is important to note that inferences regarding gene duplications, gains, and losses in extant species rather than ancestral species are impacted to a greater extent by fragmentation in genome assemblies, likely inflating the numbers in these categories of genes.

The A. winterbourni genome revealed a massive expansion of 13 HOGs that occurred after the speciation from Opisthorchiata/Xiphidiata ancestor (over 10 duplicated genes/HOG, comprising 221 genes, Supplementary Table 4). Comparing A. winterbourni to the Opisthorchiata/Xiphidiata ancestor, two gene families stood out due to the presence of more than 30 A. winterbourni genes: HOG 25969, with 31 genes in A. winterbourni out of 56 genes in all trematodes, and HOG 36190, with 36 genes in A. winterbourni out of 72 genes. In these two families, 29 and 31 genes originated through duplication since the Opisthorchiata/Xiphidiata ancestor for HOG 25969 and HOG 36190, respectively. In any other trematode, only 1-5 copies were found. These genes were investigated for being artificially duplicated due to a high proportion of BUSCO duplicated genes found within the assembly. Genes could be considered artificial duplications due to being fragmented by breaks between scaffolds (Alkan et al., 2011). We looked at the positions of the duplicated genes of HOG 25969 and 36190 on their scaffolds, and we did not find this to be the case (Supplementary Table 10). We thus concluded our genes are likely real duplications rather than artificial duplications due to assembly fragmentation.

Functions of massively expanded gene families in A. winterbourni

Examination of GO annotations of the 13 HOGs with over 10 recently duplicated genes (Supplementary Table 4) led us to speculate that the genes are likely involved in host tissue invasion and exploitation (metallohydrolases, Baskaran et al. 2017), escape from host immunity (serpins, Bao et al. 2018), and host behavioural manipulation (glutamine synthase, Helluy et al. 2010) (Supplementary Table 4).

Specifically, we examined the two most highly duplicated gene families in depth. We determined HOG 36190 (36 genes in A. winterbourni) to be a gene family of putative glutamine synthases (Supplementary Table 4). Already from the Plagiorchiida ancestor to the Opisthorchiata/Xiphidiata ancestor there was a significant enrichment in biological processes and cellular components related to glutamine family amino acid metabolic processes,

58

Chapter 1 including glutamate ammonia ligase activity (GO:0004356), positive regulation of synaptic transmission, glutamatergic (GO:0051968), glutamate binding (GO:0016595), glutamate catabolic process (GO:0006538), and glial cell projection (GO:0097386) (Supplementary Table 5). The glutamine biosynthesis pathway is a pathway in which one of the end products is proline, a non-essential amino acid. An extremely active proline pathway has already been observed in most helminths infecting humans (Fasciola hepatica, Schistosomes), with host derived arginine used as a substrate (Ertel & Isseroff, 1976; Mehlhorn, 2016; Toledo & Fried, 2010). These excessive proline levels have been implicated in the pathogenesis of trematode infections. Proline alters antioxidant defenses, activating secondary metabolite virulence factors, but also provides an energy source for a metabolic shift appropriate for adaptation to the host environment (Ertel & Isseroff, 1976; Mehlhorn, 2016; Toledo & Fried, 2010). Glutamine synthase has also been found to be a marker for glial cells, immunity cells of the central nervous system. A study on Microphallus papillorobustus, a trematode parasite of Gammarus crustaceans, found disruption of the glutamine metabolism in the brain of the gammarids due to astrocyte-like glia and nitric oxide production by the parasite metacercariae, resulting in altered neuromodulation and behaviour of the host (Helluy & Thomas, 2010). The gene family is thus especially interesting and a potential candidate in parasite-host interactions as previous research has shown A. winterbourni to be affecting the behaviour of its snail host (Feijen et al. in prep; Levri & Lively, 1996).

The second highest-duplicated gene family in A. winterbourni was HOG 25969, with 31 genes. It consists of proteins putatively encoding for O-sialoglycoprotein endopeptidase, tRNA N6-adenosine threonylcarbamoyltransferase, metallohydrolase and/or glycoprotease/Kae1, all related to DNA repair, protein binding and metal ion binding (Supplementary Table 4). The GO annotation indicates the family to be potentially involved in DNA repair, nuclease activity and nucleic acid phosphodiester bond hydrolysis. Metalloproteases have been found to be duplicated and under positive selection in other parasitic worms (Strongyloides papillosus), showing them to be involved in host tissue penetration at final larval stage (Baskaran et al., 2017).

59

Chapter 1

Signatures of selection in two expanded gene families of A. winterbourni

We next investigated the two highly duplicated (> 30 genes) HOGs described above for signatures of positive selection. Signatures of positive selection were detected by comparing the dN/dS ratio at branches leading to the radiation of A. winterbourni genes, indicated with a #1 on the gene tree, with the dN/dS ratio of background branches (Figure 4, Supplementary Figure 9). Selection is generally considered negative/purifying if ω (or dN/dS) is less than one, neutral if ω is one, and positive if ω is greater than one.

HOG 36190 was the most massively expanded HOG and selection was found to be acting on some but not all genes within this family. In the dN/dS ratio analysis, the null model (allowing ω ≤ 1) explained the data better than the alternative model (allowing ω > 1) for 2 out of 3 of the investigated branches, indicating neutral evolution (Table 2, Supplementary Figure 9). The signature of selection was detected only on one branch, a long branch leading to a subset of 13 A. winterbourni genes within this family (Supplementary Figure 9, branch #1.3). Eleven sites were identified as >50% probability to be under positive selection with one having a probability >90%. From this we conclude that selection might be acting on some, but not all, genes within this family potentially indicating a certain structure evolving under positive selection. However, considering we do not find selection on any other branches in the gene tree, it also has to be taken into account that genes in this family might be highly proliferating due to being in genomic locations prone to duplication events. Their increasing number can be causing redundancy, which can ultimately be deleterious to the organism (Schiffer et al., 2016).

60

Chapter 1

Figure 4. Gene Tree of gene family HOG 25969 created with IQTree. The tree is unrooted. Each name is a species name followed by the original gene name (protein name). A.winterbourni gene names are shortened version of gene names in Supplementary Table 10. The numbers above branches indicate bootstrap support, for the #1 branches the bootstrap support is after a backslash. The branches labelled with #1.X indicate the separation between the foreground branches and the background branches (distinction used in codeml for investigation of selection). The test for selection compares the dN/dS between the foreground branch and the background branches. The total number of genes in this HOG per trematode species is given next to each species name.

61

Chapter 1

For the gene family HOG 25969 the alternative model (allowing ω > 1) explained the data better than the null model (allowing ω <=1) for all the investigated branches, indicating a signature of positive selection on all of the investigated foreground branches (Table 2, Figure 4). With this result we followed up with the post hoc BEB (Bayes Empirical Bayes) analysis implemented in the alternative model (Ziheng Yang et al., 2005). For the branch leading to all 31 A. winterbourni genes, the BEB analysis identified 39 amino acids residues to be under positive selection in the alignment with 4 sites having an over 95% probability of being selected. For sites under positive selection among different subsets of foreground lineages see Table 3. Analysis of positive selection on the structures of the enzyme showed the active, DNA or mental binding site to be under highest probability of selection suggesting an important role (Supplementary Figure 8, Supplementary Results 5). However, without experimental characterization it is difficult to say what role the family might be playing in A. winterbourni.

62

Chapter 1

Table 2. Results of studying positive selection in two majorly expanded gene families in A.winterbourni. The HOG indicates the ID of the gene family. The node relates to the nodes indicated to the gene trees of each HOG. LRT - results of likelihood ratio test, p-value is the result of chi2 test of the LRT. Positively selected sites are the result of BEB (Bayes empirical Bayes) test implemented in codeml. The starred values indicate sites under significantly high probability of selection (>95%).

p- positively selected sites (position in the alignment, amino acid, probability of HOG node LRT df valu being under positive selection) e 4.13 25969 #1.1 25.6 1 603 A 0.767 794 A 0.575 922 I 0.548 998 V 0.504 E-07 607 S 0.664 795 K 0.694 932 N 0.516 1038 Q 0.542 638 N 0.707 798 I 0.662 951 K 0.683 1073 S 0.605 649 V 0.935 803 S 0.762 955 H 0.878 1090 S 0.875 675 S 0.669 804 G 0.556 970 T 0.513 1095 Y 0.951* 680 C 0.912 812 R 0.893 976 Q 0.846 1121 S 0.508 701 I 0.684 833 S 0.514 986 N 0.906 1184 R 0.931 705 K 0.541 871 A 0.695 989 F 0.544 1188 I 0.521 716 Y 0.852 879 Q 0.889 994 S 0.624 1198 H 0.700 921 N 719 C 0.966* 996 F 0.957* 0.983* 1.80 25969 #1.2 5.5 1 366 K 0.537 E-02 394 W 0.53 464 K 0.661 473 N 0.832 637 T 0.623 657 H 0.830 662 D 0.692 665 S 0.707 3.70 25969 #1.3 4.3 1 402 R 0.867 E-02 873 F 0.624 2.30 25969 #1.4 26.7 1 370 T 0.756 831 N 0.534 1004 N 0.974* E-07 857 E 394 W 0.559 1029 F 0.676 0.984* 480 R 0.868 928 S 0.766 1044 D 0.846 929 G 482 L 0.590 1080 E 0.767 0.986* 611 C 0.632 931 N 0.687 1157 N 0.674 710 N 0.697 932 N 0.546 1162 Y 0.635 967 H 714 S 0.733 1184 R 0.922 0.955* 722 M 0.740 970 T 0.811 1189 L 0.886 988 M 749 P 0.791 0.940 814 E 0.637 994 S 0.729 36190 #1.1 0.00017 1 0.98 - 36190 #1.2 1.9 1 0.17 - 3. 36190 #1.3 16.9 1 9E- 291 A 0.725 874 L 0.731 1030 - 0.681 05 367 E 0.910 875 S 0.800 370 K 0.745 878 Y 0.564 371 K 0.827 879 V 0.518 873 K 0.626 880 P 0.707 Conclusions

In our study we report a new de novo sequenced genome of a digenean trematode parasite, Atriophallophorus winterbourni, its phylogenetic position among other digenean

63

Chapter 1 trematodes, and the time of speciation of its ancestor from Opisthorchiata suborder. Using 14 other currently available and well-studied parasitic digenean trematodes we reconstruct the ancestral trematode genome and investigate which genes have originated through duplication, which were gained and which have remained conserved (retained) through each speciation point until the extant genome of A. winterbourni. The comparative genomic approach is a powerful tool for identifying candidate duplicated gene families involved in adaptation. We find 13 gene families expanded recently in A. winterbourni, and for two we infer signatures of positive selection. Our description of candidate gene families putatively involved in parasite infectivity will facilitate the identification of genomic regions directly involved in the host-parasite coevolutionary arms race and will facilitate studying coadaptation in the laboratory. Gene expression studies in diverse life-cycle stages and functional confirmation via e.g. RNAi knock-out studies will be required to provide a direct link between the genes and phenotypes involved. By focusing on gene duplications and retention across the digenean trematodes our work informs on the genomic basis of adaptation to parasitic lifestyles and paves the way for future adaptation genomics focusing on antagonistic relationships between host and parasites. Data Availability Statement

Data deposition: The reference genome of Atriophallophorus winterbourni has been deposited on NCBI: https://www.ncbi.nlm.nih.gov/nuccore/JACCGJ000000000. It will soon be released with annotation at WormBase Parasite. Acknowledgments

We thank Frida Feijen for helpful comments on the manuscript and Kirsten Klappert for help in field work and sample collection. We thank Alex Warwick Vesztrocy for data of IC scores calculated for the GO terms in the OMA database. Finally, we thank the reviewers for their constructive comments on this work. The research was funded by an ETH grant ETH-36 15-2 obtained by Jukka Jokela and Hanna Hartikainen. Jukka Jokela also acknowledges Swiss National Science Foundation grant #31003A_166667. Christophe Dessimoz further acknowledges Swiss National Science Foundation grant #183723. Data produced and analyzed in this article were generated in collaboration with the Genetic Diversity Centre (GDC), ETH Zurich.

64

Chapter 1

Supplementary Information

Table of contents: 1. Supplementary Methods a. Supplementary methods 1. Hatching adult worms from metacercariae for DNA extraction. b. Supplementary methods 2. Details of annotation of the protein coding sequence of the reference genome of Atriophallophorus winterbourni. c. Supplementary methods 3. The assessment of completeness and quality of the annotation d. Supplementary methods 4. Verification of experimental evidence for GO annotation of the A. winterbourni genome e. Supplementary methods 5. Filtering of codon alignments of HOGs for positive selection analysis. f. Supplementary methods 6. Code used for the analysis in jupyter notebooks: i. Each_ancestral_genome_reconstruction_until_Atriophallopho rus ii. Extant_trematodes_compared_to_ancestral_trematode iii. Atriophallophorus_compared_to_Opisthorchiata_Xiphidiata_a ncestor g. Supplementary methods 7. Study of protein structure of HOG 25969. 2. Supplementary Results a. Supplementary results 1. Genome assembly. b. Supplementary results 2. Gene Ontology annotation of Atriophallophorus winterbourni using OMA, Pannzer2 and EggNOG. c. Supplementary results 3. Quality assessment of annotation of protein coding sequences in the genome of A.winterbourni using full length transcript analysis. d. Supplementary results 4. Protein length of duplicated, gained and retained genes in extant trematodes since the trematode ancestor. e. Supplementary results 5. Study of protein structure of HOG 25969.

65

Chapter 1

3. Supplementary Figures a. Supplementary Box 1. Details of life cycles of digenean trematodes used in the study. b. Supplementary Figure 1. Propidium Iodine staining plot showing peaks for Drosophila melanogaster (standard, haploid genome size 175 Mb) and Atriophallophorus winterbourni. PI stains all DNA therefore the genome size can be calculated from a relative measure between a standard and a focal sample. c. Supplementary Figure 2. GenomeScope k-mer profile plot (21-mer frequency distribution) for Illumina filtered data. The first peak located at coverage 21X corresponds to the heterozygous peak. The second peak at coverage 42X, corresponds to the homozygous peak. Estimate of the heterozygous portion is 3.6%. d. Supplementary Figure 3. BlobPlot of the A.winterbourni assembly. The assembly was assessed for taxonomic uniformity using Blobtools 0.9.19.5. All assembly scaffolds are depicted as circles and the diameter of the circle indicates sequence length. All scaffolds >50 000bp and a random sample of scaffolds <50 000bp from assembly were submitted to BLAST using NCBI nr database for taxonomic annotation. The circles ae coloured by taxonomic annotation. Grey indicates the scaffold was not submitted to BLAST. Circles are positioned according to their proportion of GC content and the coverage of their reads which map onto the scaffolds. e. Supplementary Figure 4. Frequency distribution of the total length of proteins of 14 species of trematodes used in the analysis. The higher peaks belong to genomes with BUSCO completeness score (proportion of single copy orthologs) below 45% indicated by an arrow on the figure and by a star on the legend) and the bottom belong to genomes with BUSCO completeness score above 45%. Atriophallophorus winterbourni falls into the group of genomes with higher BUSCO completeness and greater number of longer proteins. f. Supplementary Figure 5. The number of functionally annotated genes of Atriophallophorus winterbourni by Pannzer2, OMA, EggNOG and their

66

Chapter 1

overlap in the genes they annotated. All three programs annotated 1281 genes. g. Supplementary Figure 6. Species tree created with ASTRAL III. 238 gene trees for 238 Orthologous Groups were created with MAFFT and together used in ASTRAL III to create a unifying species tree. h. Supplementary Figure 7. Original tree created in pyHam showing total number of gained, duplicated, retained and lost genes on each branch of the phylogeny of species used in this study. i. Supplementary Figure 8. A. A HOG 25969 PBD model (red) superimposed with DNA bound FEN1 nuclease (5V07, white) viewed from two different rotations. Green colour indicates sites with more than 90% probability of being under selection. Phe280 and Tyr127 indicate two sites are within two alpha helices which correspond to part of the typical DNA binding site present in FEN1 nucleases and show a positive selection at the site of two aromatic residues. The site numbers correspond to the MSA alignment on which the model was based and are equivalent to sites 996 (Phe280) and 1095 (Tyr127) in Table 4 for branch #1.1 B. A zoom-in image onto the sites under selection. j. Supplementary Figure 9. Gene Tree of gene family HOG 36190 created with IQTree. The tree is unrooted. Each original gene name (protein name) is followed by the species name, the names of genes of A. winterbourni are referring to the names in Supplementary Table 11. The numbers above branches indicate bootstrap support, for the #1 branch the bootstrap support is separated with a slash. The #1 indicated the separation between the foreground branches and the background branches (distinction used in codeml for investigation of selection). The test for selection compares the dN/dS between the foreground and the background branches. 4. Supplementary Tables a. Supplementary Table 1. Information on the genome assemblies used in the analysis. For more information, see Table 1. The BUSCO results refer to reference genome data.

67

Chapter 1

b. Supplementary Table 2. Number of genes from all 14 trematode genomes with GO terms after annotation with Pannzer2, OMA and EGGNOG. a. Supplementary Table 3. Number of Illumina and PacBio reads at each stage of data filtering leading to the number of reads used for the assembly of reference genome of A. winterbourni. Illumina data was filtered with Trimmomatic 0.35 – see methods (phred score 5, average quality per base 15, removal of reads shorter than 50bp, removal of exact duplicates). c. Supplementary Table 4. Top 13 highly expanded (>=10 duplicated genes) in A. winterbourni since Opisthorchiata/Xiphidiata ancestor. For all duplicated genes of A. winterbourni in this family we performed BLAST putative function annotation, annotation of Interpro superfamily or domain and GO annotation using OMA, Pannzer2 and EggNOG. d. Supplementary Table 5. GO enrichment analysis of duplicated, gained and retained (1:1 orthologs) genes from every previous ancestor to every next ancestor (Trematoda ancestor, Plagiorchiida ancestor, Opisthorchiata/Xiphidiata ancestor) in phylogeny leading to Atriophallophorus winterbourni. The table also include GO enrichment analysis of gained, duplicated and retained genes from Opisthrochiata/Xiphidiata ancestor to the extant genome of A. winterbourni. The GO annotations of all extant species that were then used to annotate reconstructed ancestral genomes were performed with EggNOG, OMA and Pannzer2 (indicated by the “Source” column). The GO enrichment analysis was performed with GOAtools. The GO functions are divided into Biological Processes (BP), Cellular Components (CC) and Molecular Functions (MF). “Ratio in study” indicates the number of duplicated, gained or retained genes from a particular genome (ancestral or extant) related to particular function. “Ratio in pop” indicated all the genes related to that function from the whole genome (ancestral or extant). Ratios are calculated from those two values and the fold change is estimated by comparing ratio in study and ratio in population. “p-

68

Chapter 1

bonferroni” are bonferroni corrected p-values for the statistical overrepresentation of these GO terms in study vs in population. e. Supplementary Table 6. The total number and proportions of genes that originated through duplication (duplicated), number of genes that originated through duplication with minimum 3 genes per family (duplicated >=3), number of 1:1 orthologs (retained) and number of gained genes in each extant species when compared to the model of ancestral trematode genome. f. Supplementary Table 7. Retained genes (1:1 orthologs) shared between all extant trematodes since the trematode ancestor. For each species per each HOG a gene ID is provided and a putative function established through the annotation of best studied species (F.hepatica, S.mansoni, S.japonicum). g. Supplementary Table 8. A list of gene families (HOGs) in which genes have originated through duplication since the most recent ancestor of all trematode species with minimum three genes per family and shared between at least 11 out the 14 species. Their putative function is assessed through annotation of the best studied trematodes (Fasciola hepatica, Schistosoma mansoni and Schistosoma japonicum) and all gene IDs from the 3 species are provided. h. Supplementary Table 9. GO enrichment analysis of duplicated genes from HOG 35588 and HOG 33101 for all trematode species. HOG 35588 and HOG 33101 are HOGs with duplicated genes (more than 3 copies) for all trematode species since the trematode ancestor. The GO enrichment analysis was performed with GOAtools. The GO functions are divided into Biological Processes (BP), Cellular Components (CC) and Molecular Functions (MF). “Ratio in study” indicated the number of each species' duplicated genes related to particular GO term. “Ratio in pop” indicated all the genes related to that function from the whole genome. Ratios are calculated from those two values and the fold change is estimated by comparing ratio in study and ratio in population. "Fold change" compared

69

Chapter 1

the two values. “p-bonferroni” are bonferroni corrected p-values for the statistical overrepresentation of these GO terms in study vs in population. i. Supplementary Table 10. Atriophalophorus winterbourni genes within the two biggest HOGs (see Supplementary Table 12) investigated for signature of positive selection. Here we provide the position of each gene on each scaffold and the distance of the gene from the end edge of the scaffold. It is visualized with the figure next to it where genes are indicated in yellow. Figures created with package chromoMap in R. j. Supplementary Table 11. Classification of genome content masked from annotation. RepeatModeler 1.0.11 and RepeatMasker 4.0.7 were used to classify repetitive content within the assembly. k. Supplementary Table 12. GO enrichment analysis using OMA, Pannzer2 and EggNOG for all 14 trematode species used in the study of :1. duplicated genes (>=3 copies) from HOGs shared between at least 11 trematode species since trematode ancestor (See supplementary table 6), 2. retained genes shared between all 14 trematode species since the trematode ancestor (Supplementary Table 5). The GO enrichment analysis was performed with GOAtools. The GO functions are divided into Biological Processes (BP), Cellular Components (CC) and Molecular Functions (MF). “Ratio in study” indicated the number of each species' duplicated genes related to particular GO term. “Ratio in pop” indicated all the genes related to that function from the whole genome. Ratios are calculated from those two values and the fold change is estimated by comparing ratio in study and ratio in population. "Fold change" compared the two values. “p-bonferroni” are bonferroni corrected p-values for the statistical overrepresentation of these GO terms in study vs in population.

70

Chapter 1

Supplementary Methods

1. Hatching adult worms from metacercariae for DNA extraction. Metacercariae were obtained from each snail separately through dissection under the microscope. To initiate hatching, the metacercariae were incubated at 40 oC for 2-4 h in Tyrode’s salt solution, supplemented with pancreatin (Sigma P3292) (0.15g/50ml of Tyrode’s salt solution), 100 mg/mL Penicilin G (Fluka 13752) and 0.1g/mL of Streptomycin (Fluka 85880). For Tyrode’s salt solution we mixed Tyrode’s salts (Sigma T2145-10x1L) with 1L of

MiliQ water and 1g of sodium bicarbonate (NaHCO3). After all the worms had hatched to their adult stage they were washed twice with Tyrode’s salt solution and antibiotics (100 mg/mL Penicillin G, Fluka 13752 and 0.1g/mL Streptomycin, Fluka 85880) to remove all the remaining cysts shed by the hatched worms, and transferred to a 1.5 mL tube (Eppendorf, safe lock). The worms were immersed in no more than 10µl of washing solution and washed with 200 µl of PBS to be immediately used for DNA extraction. 2. Details of annotation of the protein coding sequence of the reference genome of Atriophallophorus winterbourni. Coding sequence annotation was performed by Maker 2.31.9. Transcriptome data (Bankers & Neiman, 2017) from the same species, available at DDBJ/EMBL/GenBank under the accession GFFK00000000, was used as input together with the invertebrate protein database from UniProt for initial creation of gene models that were then subsequently used to train gene finding tools Augustus 3.2.1 and SNAP in version 2006-07-28. Transcriptomic evidence from our species as well as closely related species (Schistosoma mansoni (Protasio et al., 2012)and Caenorhabditis elegans (Kaletsky et al., 2018)) provided a more accurate result for the gene models. We produced a customized species library of repetitive elements for the annotation using RepeatModeler 1.0.11 and RepeatMasker 4.0.7 on the scaffolds (Smit & Hubley, 2018; Smit, Hubley, & Green, 2015). All repetitive content collected by RepeatModeler was then submitted to blastx 2.3.0 to confirm that no proteins, hypothetical proteins or coding sequences were excluded from annotation. The confirmed repetitive content has been masked from all rounds of Maker annotation. 3. Assessment of completeness and quality of the annotation The completeness and quality of the annotation was assessed with BUSCO and with full- length protein coding sequence analysis using BLAST+ according to the following protocol:

71

Chapter 1 https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity- Transcripts (last accessed: 11.2018). The “Trinity transcript” analysis was done to assess how many genes are full length or nearly full length. All coding sequences from Maker annotation were compared to all proteins from the UniProt database (558681 proteins, realase 7.11.2018) using blastx (evalue 1e-20). 4. Verification of experimental evidence for GO annotation of the A. winterbourni genome GO annotation of the A. winterbourni genome was performed with Pannzer2, OMA and EggNOG (see main methods). QuickGO was used to check all GO terms for having experimental evidence in manual assertion (ECO:0000269), in mutant phenotype evidence in manual assertion (ECO:0000315), in high throughput direct evidence in manual assertion (ECO:0007005) or in high throughput mutant phenotypic evidence used in manual assertion (ECO:0007001) in trematodes or nematodes (species: , Schistosoma japonicum, Schistosoma mansoni, Schistosoma mattheei, Fasciola hepatica, Clonorchis sinensis and Caenorhabditis elegans, taxon identifiers in https://www.uniprot.org/taxonomy/: 6185, 6182, 6183, 31246, 6912, 6183, 31246, 6912, 79923, 6239 respectively) (https://www.ebi.ac.uk/QuickGO/annotations (last accessed: 9.03.2020)). 5. Filtering of codon alignments of HOGs for positive selection analysis. Positive selection analyses are sensitive to alignment errors. That is why we used a filtering approach suggested in Selectome database (Moretti et al., 2014) for HOG36190 alignment ridden with gaps. Short, low quality sequences disrupting the alignment were filtered with MaxAlign leaving 55 sequences (out of 72). The alignment was performed with Clustal Omega and quality scores of each amino acid were computed with MCoffee 11.00.d625267 and Guidance2. Amino acids with quality scores below 93% in Guidance2 and 60% (below “Good”) in MCoffee were replaced with “X” and the alignment was converted to codon alignment with PAL2NAL. Poor alignment columns were filtered out using TrimAl v1.3 (noallgaps, columns composed only of gaps). 6. See Jupyter notebook for supplementary method 6 or https://github.com/zajacn/comparative_genomics_trematodes 7. Study of structure of HOG 25969.

72

Chapter 1

To get a better description of this highly expanded gene family with no evident functional annotation or obvious homology to known proteins, we used remote homology detection and structural modelling tools to gain further insight. In order to model each of the sequences in the expanded family, we chose to use modeller (Webb & Sali, 2016) in tandem with HHBlits to select appropriate structural templates, align them by taking into account structural constraints and finally generate models for each sequence by using multiple crystal structure templates. This approach allows for template based modelling using limited computational resources and enables modelling each member of the expanded family to allow for comparative analysis between models if needed. To generate a query alignment for HHBlits (Hildebrand, Remmert, Biegert, & Söding, 2009; Remmert, Biegert, Hauser, & Söding, 2011), the sequences assigned to HOG25969 from the A. winterbourni genome were aligned with Clustal Omega (Sievers & Higgins, 2014) for 3 iterations on default parameters. The resulting MSA was used to query the pdb70 database with HHBlits. Details on the content and construction of the pdb70 database are available on the mmseqs (Mirdita et al., 2017) website (https://uniclust.mmseqs.com/, last accessed 04.2020). The protein structure was coupled with BEB results of the probability of selection on each base from the codeml analysis with #1.1 as the foreground branch (see methods Estimation of dN/dS in gene families in Atriophallophorus winterbourni). A pymol(PyMOL | pymol.org ) plugin was written to color the residues of models with the p-values calculated for their positive selection to visualize positive selection in an intuitive way on the final models. Supplementary Results

1. Genome assembly. The sequencing data was obtained using Illumina and Pacific Biosciences technologies (for number of reads at each stage, see Supplementary Table 3). After filtering, 244,185,272 paired Illumina reads and 1,981,809 PacBio reads spanning 1 to 35 kb in length were used for the assembly. Due to the high heterozygosity of the Illumina data (assessed by 21-mer frequency distributions, Supplementary Fig 2) and the inflation of the genome size in the initial assembly, we used Redundans for correction. We used the resulting assembly, combined with the existing A. winterbourni transcriptome (Bankers & Neiman, 2017), to submit to AGOUTI to improve the scaffolding and contiguity of the assembly (Zhang et al.,

73

Chapter 1

2016). The scaffolds ranged from 3,947 bp to 525,914 bp with an N50 of 40,108. Blobtools showed that 63.7% of the reads map to no taxonomic groups as those scaffolds have not been blasted (Supplementary Fig 3). Out of the 5739 scaffolds submitted to BLAST, 1069 scaffolds mapped to Platyhelminths (Supplementary Fig 3). 644 scaffolds mapped to Mollusca amounting to a total of 0.92% of the length of the 5739 scaffolds. Due to such negligible contamination with potential host DNA, Blobtools output was not used to filter the assembly. On the other hand, to eliminate any concern about host contamination, the Potamopyrgus antipodarum transcriptome obtained from (http://bioweb.biology.uiowa.edu/neiman/download.php, date accessed 15.07.2020) (Wilton, Sloan, Logsdon, Doddapaneni, & Neiman, 2013)was used to filter the assembly. The host genome was converted to BLAST database and A. winterbourni genes were mapped to it searching for 100% matches (e-value = 1e-7). Consequently, 4 genes were eliminated. The assembly yielded 70.1% of the complete BUSCO genes, including 54.5% single copy and 15.6% duplicated genes. 2. Gene Ontology annotation of Atriophallophorus winterbourni using OMA, Pannzer2 and EggNOG. OMA (Altenhoff et al., 2017), Pannzer2 (Törönen et al., 2018) and EggNOG (Huerta-Cepas et al., 2016) were used for Gene Ontology annotation of the A.winterbourni genome. All three alorithms perform orthology-based GO annotation; however, each of them uses a different database of species and they differ in prioritizing precision vs coverage. We used all three to better understand our results. Pannzer2 annotated 5222 genes, yielding 88,789 gene-GO term combinations; OMA annotated 8106 genes, yielding 42,253 gene-GO term annotations; and EggNOG annotated 3032 genes, yielding 413,485 gene-GO term annotations (Supplementary Fig 5). From EBI QuickGO annotations we compiled a dataset of experimentally annotated GO terms from nematodes and trematodes (species: Schistosoma haematobium, Schistosoma japonicum, Schistosoma mansoni, Schistosoma mattheei, Fasciola hepatica, Clonorchis sinensis and Caenorhabditis elegans, see Methods for details). The dataset was 30,907 GO terms. We compared the GO terms annotated in A.winterbourni to this database. We found 28255 GO terms out of 88,789 (32%) for the Pannzer2 GO annotation, 19,164 GO terms out of 42,253 (45.3%) for the OMA GO annotation and none of the EggNOG annotations to be based on experimental evidence in nematodes or trematodes (see methods).

74

Chapter 1

3. Quality assessment of annotation of protein coding sequences in the genome of A. winterbourni using full length transcript analysis. A “Trinity transcript” analysis using BLAST+ was performed on proteins annotated with Maker by comparing them to already existing sequences from other species (all proteins from the UniProt database, release 7.11.2018, 558681 proteins) to assess how many proteins are full length or nearly full length. It has to be taken into account that only a single best matching coding sequence is reported for each top matching database entry; in other words, if multiple proteins match a database entry or if any database entry matches multiple proteins from the studied species each protein is counted once along with a best matching database entry (highest BLAST score) and the rest of the results are discarded. Thus the results have to be understood as a measure of improvement of annotation over multiple Maker annotation rounds rather than an absolute assessment of completeness of the assembly. The first round of Maker annotation yielded 1747 proteins covered with more than 80% of their protein length (3349 covered with more than 10% of their full length) and the final round of Maker annotation yielded 2317 proteins covered more than 80% of their protein length (4409 covered with more than 10% of their full length). 4. Comparison of extant trematodes to the ancestral trematode genome. According to our analysis, the ancestral trematode had 13,296 genes, most of which were retained since speciation from Cestoda (6907 genes, 51.9%), 1452 which originated through duplication (10.9%), and 4937 of which were newly acquired (37.2%) (Fig 3). Subsequently, we compared each extant trematode species to that of the ancestral genome to trace the evolutionary history of the extant genomes, focusing on duplicated and retained genes since the trematode common ancestor. We then used these categories of extant genes to find shared gene families between A. winterbourni and other trematodes. The total number of 1:1 orthologous (retained) gene families per species can be seen in Supplementary Table 6. As mentioned in the main results there were 28 gene families which were retained at a 1:1 orthologous gene ratio in all 14 trematode species (Supplementary Table 7). Examination of their putative functions through the annotations of best studied trematodes (Fasciola hepatica (NCBI, 2017), Schistosoma mansoni, (Protasio et al., 2012; Wang et al., 2016)) showed that the retained gene families were largely involved in cell functioning and growth, division and cell-to-cell or protein-to-protein interactions. However,

75

Chapter 1

Gene Ontology enrichment analysis of genes from every species from these HOGs showed no enrichment of any specific GO functions (Supplementary Table 12). We studied duplicated gene families (or HOGs) with at least 3 genes that originated through duplication since the most recent common ancestor of all trematodes. The total number per species can be seen in Supplementary Table 6. There were 12 families with at least 3 duplicated genes, shared between at least 11 trematode species (Supplementary Table 8). For the putative functions of these families based on annotation of most well-known and studied trematode species (Fasciola hepatica (NCBI, 2017), Schistosoma mansoni,(Protasio et al., 2012; Wang et al., 2016)) see Supplementary Table 8 and for GO enrichment analysis results see Supplementary Table 12. There were 2 duplicated gene families, HOG 33101 and HOG 35588, present in all 14 trematode species (138 and 105 genes respectively). The putative functions of these gene families based on BLASTP showed them to be dynein light chain type 1 (HOG 33101) and cathepsin B-like peptidase (HOG35588) and GO annotation of genes from this family for all species confirmed the findings. GO enrichment analysis of the 138 genes from HOG33101 indicated the genes to be implicated in dynein complex and microtubule based process. GO enrichment of 105 genes from HOG 35588 indicated the genes to be involved in cysteine-type endopeptidase activity and production of pigment granule (Supplementary Table 9). 5. Protein structure of HOG 25969. One domain appears to be conserved across the entire HOG 25969, having an appropriate structural template which is detectable using HHBlits, FEN1 nuclease family of enzymes. Several sites of the domain show significant positive selection according to the codeml BEB results of the #1.1 branch (Fig 4). We were able to leverage our existing knowledge of the FEN1 structure (pbd70 database) and our observations relative to the selection of the nucleotide sequence of this gene family to gain possible insight into functionally important sites of the final protein. Multiple FEN1 nuclease templates mapped with ~15% identity to the template sequences. With a low mapping identity, we could draw limited conclusions from Angstrom scale assessment of the models and thus we would like to draw attention to multiple other factors that could influence the folding or conformation of the FEN1 domain including other domains of the protein which are not modelled in our study and the DNA typically bound to FEN1 while optimizing the placement of side chains which typically results in more accurate modelling. The residues of the domain were coloured with the probability

76

Chapter 1 values of them being under positive selection (see Supplementary method 7) and two residues stood out due to having high (>90%) probability of being under positive selection and appear as highly conserved in the protein sequence alignment, Phe280 and Tyr127 (see Supplementary Fig. 9). The two sites are within two alpha helices which correspond to part of the typical DNA binding site present in FEN1 nucleases and show a positive selection at the site of two aromatic residues (Shi, Hellinga, & Beese, 2017). This is particularly important for DNA binding enzymes since pi stacking interactions involving aromatic rings enable binding between the DNA and the enzyme by lowering the overall energy through a configuration where aromatic and conjugated double bonds in the DNA and enzyme are in close proximity. This pi-stacking interaction, which is dependent on the orientation and distance of the aromatic residue and nucleic acid, has a binding energy of approximately –43 kJmol−1 (Gallivan & Dougherty, 1999). That these sites would be positively selected in this expanded gene family may point to their importance in the DNA binding function of this particular gene product. However, no robust evolutionary explanations or conclusions can be drawn on the recent expansion without in-vivo experimentation.

77

Chapter 1

Supplementary Box 1. Details of life cycles of digenean trematodes used in the study.

Life cycle of Atriophallohporus winterbourni

The final host of Atriophallophorus is waterfowl. The ducks feed on the aquatic snails; the trematode enters the digestive system of its final host as encysted larvae (metacercariae). The hermaphroditic worms hatch to their adult stage within the gut of the waterfowl. The worms cross-fertilize producing eggs that are released with the bird’s faeces. Consequently, the eggs are released into the water where they are passively ingested by the snails feeding on algae on the rocks. When ingested by the snail, the egg hatches into a cercariae that moves into the reproductive tract of the snail and reproduces sexually into about 200-1000 metacercariae consequently castrating the snail.

Life cycle of Schistosoma species

Humans are often the final host but the species are rather generalist. The eggs are released from the definitive host into freshwater through feces or urine, and the eggs hatch in water releasing ciliated miracidia. The miracidia actively invade a snail host (Biomphalaria, Bulinus or Oncomelania depending on the species of Schistosoma). Inside the snail, the miracidia develop into cercaria and reproduce asexually. Attracted by light, the cercaria with forked tails leave the host during daylight hours. They penetrate the skin of the vertebrate host while leaving their tails and develop into schistosomulae. Eventually they gain access to the host’s lymphatic system where (in different tissues depending on the species of Schistosoma) they develop into adults. The adult stages cause severe symptoms in their final hosts.

Life cycle of Trichobilharzia regenti

The life cycle of Trichobilharzia regenti is somewhat analogous to the one of Schistosoma species. The adult flukes mate in the nasal mucosa of water birds and produce eggs which hatch into miracidae. The miracidiae leak out of the tissue during feeding or eating. They swim to the intermediate host (snail species from genus Radix) where they develop into sporocysts and eventually cercariae. Cercariae are released and penetrate directly the skin of their avian host. They shed the glycoprotein layer and the tail and transform into schistosomula. The schistosomula moves through the nervous system into the brain and eventually into the nasal tissue in the bill.

Life cycle of Fasciola hepatica

The eggs are released with stool from cattle, sheep or buffalos into freshwater. Humans are also often the final host. The eggs hatch into miracidia, which find a suitable snail host (Lymnaidae family). In the snail, the miracidia develop into sporocysts, then rediae, then cercariae. The cercariae are released from the snail and form cysts on different surfaces including aquatic vegetation, which is then consumed by the mammalian hosts. The cercariae bury into the intestine walls, and gradually make their way into the bile duct where they develop into adult flukes.

Continues on next page

78

Chapter 1

Life cycle of Echinostoma caproni

Echinostoma caproni has a three-host life cycle. The two intermediate hosts belong either to the Lymnaidae family. In the first intermediate host, the miracidium undergoes asexual reproduction; the metacercariae transform into sporocysts, radiae and then cercariae. The free-swimming cercariae are released from the second host and penetrate the second intermediate host. The final host (usually an aquatic bird) becomes infected through consumption of the second intermediate host. Humans are often the final host.

Life cycle of Clonorchis sinensis

Clonorochis sinesis has a three-host life cycle. The definitive host are usually humans or other that consume raw freshwater fish. The metacercariae penetrate the intestine walls and move towards the bile duct. The eggs are released with feces and are ingested by the first intermediate host (most often a freshwater snail Parafossarulus manchouricus). The metacercariae hatch and develop into sporocysts, radiae and cercariae. The cercariae are not free swimming. When released the initially hang at the surface of water and then sink to the bottom. This movement is repeated several times until the cercariae feel the disturbance of water by fish which they attack. They burrow their way through the fish scales and develop in muscles.

Life cycle of Opisthorchis species

Opisthorchis viverrini (Southeast Asian liver fluke) and Opisthorchis felineus (cat liver fluke) have a very similar life cycle to the closely related Clonorchis sinensis. Human is the usual definite host for the two species (other hosts are those eating raw freshwater fish, including human pets). The eggs are released in feces into freshwater bodies. They are consumed by the first intermediate host, snails from the genus Bithynia. Within the snail the parasite undergoes all stages of development (miracidia, sporocysts, radiae and cercariae) and the free swimming cercariae are released. They attack freshwater fish, encysting in their muscles.

Information from: ((CDC), 2018; Coon, 2005; Dybdahl & Lively, 1996; Esch, Barger, & Fellis, 2002; Galaktionov & Dobrovolskij, 2003; Hechinger, 2012; Krist & Lively, 1998; Wanger et al., 2017)

79

Chapter 1

Supplementary Figure 1. Propidium Iodine staining plot showing peaks for Drosophila melanogaster (standard, haploid genome size 175 Mb) and Atriophallophorus winterbourni. PI stains all DNA therefore the genome size can be calculated from a relative measure between a standard and a focal sample.

80

Chapter 1

Supplementary Figure 2. GenomeScope k-mer profile plot (21-mer frequency distribution) for Illumina filtered data. The first peak located at coverage 21X corresponds to the heterozygous peak. The second peak at coverage 42X, corresponds to the homozygous peak. Estimate of the heterozygous portion is 3.6%.

81

Chapter 1

Supplementary Figure 3. BlobPlot of the A.winterbourni assembly. The assembly was assessed for taxonomic uniformity using Blobtools 0.9.19.5. All assembly scaffolds are depicted as circles and the diameter of the circle indicates sequence length. All scaffolds >50 000bp and a random sample of scaffolds <50 000bp from assembly were submitted to BLAST using NCBI nr database for taxonomic annotation. The circles ae coloured by taxonomic annotation. Grey indicates the scaffold was not submitted to BLAST. Circles are positioned according to their proportion of GC content and the coverage of their reads which map onto the scaffolds.

82

Chapter 1

Supplementary Figure 4. Frequency distribution of the total length of proteins of 14 species of trematodes used in the analysis. The higher peaks belong to genomes with BUSCO completeness score (proportion of single copy orthologs) below 45% indicated by an arrow on the figure and by a star on the legend) and the bottom belong to genomes with BUSCO completeness score above 45%. Atriophallophorus winterbourni falls into the group of genomes with higher BUSCO completeness and greater number of longer proteins.

83

Chapter 1

Supplementary Figure 5. The number of functionally annotated genes of Atriophallophorus winterbourni by Pannzer2, OMA, EggNOG and their overlap in the genes they annotated. All three programs annotated 1281 genes.

84

Chapter 1

Supplementary Figure 6. Species tree created with ASTRAL III. 238 gene trees for 238 Orthologous Groups were created with MAFFT and together used in ASTRAL III to create a unifying species tree.

85

Chapter 1

Supplementary Figure 7. Original tree created in pyham showing total number of gained, duplicated, retained and lost genes on each branch of the phylogeny of species used in this study.

86

Chapter 1

Supplementary Figure 8. A. A HOG25969 PBD model (red) superimposed with DNA bound FEN1 nuclease (5V07, white) viewed from two different rotations. Green colour indicates sites with more than 90% probability of being under selection. Phe280 and Tyr127 indicate two sites are within two alpha helices which correspond to part of the typical DNA binding site present in FEN1 nucleases and show a positive selection at the site of two aromatic residues. The site numbers correspond to the MSA alignment on which the model was based and are equivalent to sites 996 (Phe280) and 1095 (Tyr127) in Table 4 for branch #1.1 B. A zoom-in image onto the sites under selection.

87

Chapter 1

Supplementary Figure 9. Gene Tree of gene family HOG 36190 created with IQTree. The tree is unrooted. Each original gene name (protein name) is followed by the species name, the names of genes of A. winterbourni are referring to the names in Supplementary Table 11. The numbers above branches indicate bootstrap support, for the #1 branch the bootstrap support is separated with a slash. The #1 indicated the separation between the foreground branches and the background branches (distinction used in codeml for investigation of selection). The test for selection compares the dN/dS between the foreground and the background branches.

88

Chapter 1 References

(CDC), C. f. d. c. a. p. (2018). Alphabetical Index of Parasitic Diseases. Alkan, C., Sajjadian, S., & Eichler, E. E. (2011). Limitations of next-generation genome sequence assembly. Nat. Methods, 8(1), 61-65. doi:10.1038/nmeth.1527 Altenhoff, A. M., Gil, M., Gonnet, G. H., & Dessimoz, C. (2013). Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs. PLoS One, 8(1), e53786. doi:10.1371/journal.pone.0053786 Altenhoff, A. M., Glover, N. M., Train, C.-M., Kaleb, K., Warwick Vesztrocy, A., Dylus, D., . . . Dessimoz, C. (2017). The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res., 46(D1), D477-D485. doi:10.1093/nar/gkx1019 Altenhoff, A. M., Levy, J., Zarowiecki, M., Tomiczek, B., Warwick Vesztrocy, A., Dalquen, D. A., . . . Dessimoz, C. (2019). OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res., 29(7), 1152-1163. doi:10.1101/gr.243212.118 Ambardar, S., Gupta, R., Trakroo, D., Lal, R., & Vakhlu, J. (2016). High Throughput Sequencing: An Overview of Sequencing Chemistry. Indian J. Microbiol., 56(4), 394-404. doi:10.1007/s12088-016-0606-4 Bankers, L., & Neiman, M. (2017). De novo Transcriptome Characterization of a Sterilizing Trematode Parasite (Microphallus sp.) from Two Species of New Zealand Snails. G3: Genes|Genomes|Genetics, 7(3), 871. doi:10.1534/g3.116.037275 Bao, J., Pan, G., Poncz, M., Wei, J., Ran, M., & Zhou, Z. (2018). Serpin functions in host- pathogen interactions. PeerJ, 6, e4557. doi:10.7717/peerj.4557 Barrell, D., Dimmer, E., Huntley, R. P., Binns, D., O’Donovan, C., & Apweiler, R. (2009). The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucleic Acids Res., 37(suppl_1), D396-D403. doi:10.1093/nar/gkn803 Baskaran, P., Jaleta, T. G., Streit, A., & Rödelsperger, C. (2017). Duplications and Positive Selection Drive the Evolution of Parasitism-Associated Gene Families in the Nematode Strongyloides papillosus. Genome Biol. Evol., 9(3), 790-801. doi:10.1093/gbe/evx040 Bennett, M. D., Leitch, I. J., Price, H. J., & Johnston, J. S. (2003). Comparisons with Caenorhabditis ( 100 Mb) and Drosophila ( 175 Mb) Using Flow Cytometry Show Genome Size in Arabidopsis to be 157 Mb and thus 25 % Larger than the Arabidopsis Genome∼ Initiative Estimate of ∼ 125 Mb. Ann. Bot., 91(5), 547-557. doi:10.1093/aob/mcg057 ∼ ∼ Blair, D., Davis, G. M., & Wu, B. (2001). Evolutionary relat∼ ionships between trematodes and snails emphasizing schistosomes and paragonimids. Parasitology, 123 Suppl, S229- 243. doi:10.1017/s003118200100837x Blasco-Costa, I., Seppälä, K., Feijen, F., Zajac, N., Klappert, K., & Jokela, J. (2019). A new species of Atriophallophorus Deblock & Rosé, 1964 (Trematoda: Microphallidae) described from in vitro-grown adults and metacercariae from Potamopyrgus antipodarum (Gray, 1843) (Mollusca: Tateidae). J. Helminthol., 94, e108. doi:10.1017/S0022149X19000993 Cantacessi, C., & Gasser, R. B. (2012). SCP/TAPS proteins in helminths – Where to from now? Mol. Cell. Probes, 26(1), 54-59. doi:10.1016/j.mcp.2011.10.001 Cantarel, B. L., Korf, I., Robb, S. M. C., Parra, G., Ross, E., Moore, B., . . . Yandell, M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res., 18(1), 188-196. doi:10.1101/gr.6743907 89

Chapter 1 Chang, E. S., Neuhof, M., Rubinstein, N. D., Diamant, A., Philippe, H., Huchon, D., & Cartwright, P. (2015). Genomic insights into the evolutionary origin of Myxozoa within Cnidaria. Proc. Natl. Acad. Sci. U. S. A., 112(48), 14912-14917. doi:10.1073/pnas.1511468112 Coon, D. R. (2005). Schistosomiasis: Overview of the history, biology, clinicopathology, and laboratory diagnosis. Clinical Microbiology Newsletter, 27(21), 163-168. doi:https://doi.org/10.1016/j.clinmicnews.2005.10.001 Corradi, N. (2015). Microsporidia: Eukaryotic Intracellular Parasites Shaped by Gene Loss and Horizontal Gene Transfers. Annu. Rev. Microbiol., 69, 167-183. doi:10.1146/annurev-micro-091014-104136 David, C. N., Ozbek, S., Adamczyk, P., Meier, S., Pauly, B., Chapman, J., . . . Holstein, T. W. (2008). Evolution of complex structures: minicollagens shape the cnidarian nematocyst. Trends Genet., 24(9), 431-438. doi:10.1016/j.tig.2008.07.001 Dennis, S. Personal Communication.(2019, 2019/12/12). Disease and Injury Incidence and Prevalence Collaborators GBD. (2016). Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet, 388(10053), 1545-1602. doi:10.1016/S0140-6736(16)31678-6 Duarte, J. M., Wall, P. K., Edger, P. P., Landherr, L. L., Ma, H., Pires, P. K., . . . Claude, W. d. (2010). Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryzaand their phylogenetic utility across various taxonomic levels. BMC Evolutionary Biology, 10(1), 61. Dybdahl, M. F., & Lively, C. M. (1996). THE GEOGRAPHY OF COEVOLUTION: COMPARATIVE POPULATION STRUCTURES FOR A SNAIL AND ITS TREMATODE PARASITE. Evolution, 50(6), 2264-2275. doi:10.1111/j.1558-5646.1996.tb03615.x Dylus, D., Nevers, Y., Altenhoff, A. M., Gürtler, A., Dessimoz, C., & Glover, N. M. (2020). How to build phylogenetic species trees with OMA. F1000Research, 9(511), 511. Egger, B., Bachmann, L., & Fromm, B. (2017). Atp8 is in the ground pattern of flatworm mitochondrial genomes. BMC Genomics, 18(1), 414. doi:10.1186/s12864-017- 3807-2 Ertel, J. C., & Isseroff, H. (1976). Proline in fascioliasis: II. Characteristics of partially purified ornithine-δ-transaminase from Fasciola. Rice Institute Pamphlet-Rice University Studies, 62(4). Esch, G. W., Barger, M. A., & Fellis, K. J. (2002). The Transmission of Digenetic Trematodes: Style, Elegance, Complexity1. Integrative and Comparative Biology, 42(2), 304-312. doi:10.1093/icb/42.2.304 Feijen, F. A. A., Buser, C. C., Klappert, K., Kopp, K., Lively, C. M., Zajac, N. H., & Jokela, J. (in prep). Hotspots for parasite transmission emerge from large infection source habitats. Galaktionov, K., & Dobrovolskij Gallivan, J. P., & Dougherty, D. A. (1999). Cation-π interactions in structural biology. Proceedings of the National. Retrieved from Gibson, D. I. (1987). Questions in digenean systematics and evolution. Parasitology, 95 ( Pt 2), 429-460. doi:10.1017/s0031182000057851 Githui, E. K., Damian, R. T., Aman, R. A., Ali, M. A., & Kamau, J. M. (2009). Schistosoma spp.: Isolation of microtubule associated proteins in the tegument and the definition of dynein light chains components. Exp. Parasitol., 121(1), 96-104. doi:10.1016/j.exppara.2008.10.007 Hausdorf, B. (2000). Early evolution of the bilateria. Syst. Biol., 49(1), 130-142. doi:10.1080/10635150050207438

90

Chapter 1 Hechinger, R. F. (2012). Faunal survey and identification key for the trematodes (Platyhelminthes: ) infecting Potamopyrgus antipodarum (Gastropoda: Hydrobiidae) as first intermediate host. Zootaxa, 3418(1), 1-27. Helluy, S., & Thomas, F. (2010). Parasitic manipulation and neuroinflammation: Evidence from the system Microphallus papillorobustus (Trematoda) - Gammarus (Crustacea). Parasit. Vectors, 3, 38. doi:10.1186/1756-3305-3-38 Hildebrand, A., Remmert, M., Biegert, A., & Söding, J. (2009). Fast and accurate automatic structure prediction with HHpred. Proteins, 77 Suppl 9, 128-132. doi:10.1002/prot.22499, A. (2003). The Biology and Evolution of Trematodes. Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q., & Vinh, L. S. (2017). UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol. Biol. Evol., 35(2), 518-522. doi:10.1093/molbev/msx281 Hoffmann, K. F., & Strand, M. (1997). Molecular Characterization of a 20.8-kDa Schistosoma mansoni Antigen: SEQUENCE SIMILARITY TO TEGUMENTAL ASSOCIATED ANTIGENS AND DYNEIN LIGHT CHAINS. J. Biol. Chem., 272(23), 14509-14515. doi:10.1074/jbc.272.23.14509 Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., . . . Bork, P. (2016). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res., 44(D1), D286-293. doi:10.1093/nar/gkv1248 International Helminth Genomes Consortium. (2019). Comparative genomics of the major parasitic worms. Nat. Genet., 51(1), 163-174. doi:10.1038/s41588-018-0262-1 Jeffares, D. C., Tomiczek, B., Sojo, V., & dos Reis, M. (2015). A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome. In C. Peacock (Ed.), Parasite Genomics Protocols (pp. 65-90). New York, NY: Springer New York. Jokela, J., Dybdahl, M. F., & Lively, C. M. (2009). The maintenance of sex, clonal dynamics, and host-parasite coevolution in a mixed population of sexual and asexual snails. the american naturalist, 174(S1), S43-S53. Jones, M. K., Gobert, G. N., Zhang, L., Sunderland, P., & McManus, D. P. (2004). The cytoskeleton and motor proteins of human schistosomes and their roles in surface maintenance and host-parasite interactions. Bioessays, 26(7), 752-765. doi:10.1002/bies.20058 Kaletsky, R., Yao, V., Williams, A., Runnels, A. M., Tadych, A., Zhou, S., . . . Murphy, C. T. (2018). Transcriptome analysis of adult Caenorhabditis elegans cells reveals tissue-specific gene and isoform expression. PLoS Genet., 14(8), e1007559. doi:10.1371/journal.pgen.1007559 Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A., & Jermiin, L. S. (2017). ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods, 14(6), 587-589. doi:10.1038/nmeth.4285 Katoh, K., Asimenos, G., & Toh, H. (2009). Multiple Alignment of DNA Sequences with MAFFT. In D. Posada (Ed.), Bioinformatics for DNA Sequence Analysis (pp. 39-64). Totowa, NJ: Humana Press. Kim, Y.-J., Yoo, W. G., Lee, M.-R., Kim, D.-W., Lee, W.-J., Kang, J.-M., . . . Ju, J.-W. (2012). Identification and characterization of a novel 21.6-kDa tegumental protein from Clonorchis sinensis. Parasitol. Res., 110(5), 2061-2066. doi:10.1007/s00436-011- 2681-0 Klopfenstein, D. V., Zhang, L., Pedersen, B. S., Ramírez, F., Warwick Vesztrocy, A., Naldi, A., . . . Tang, H. (2018). GOATOOLS: A Python library for Gene Ontology analyses. Sci. Rep., 8(1), 10872. doi:10.1038/s41598-018-28948-z

91

Chapter 1 Krist, A. C., & Lively, C. M. (1998). Experimental exposure of juvenile snails (Potamopyrgus antipodarum ) to infection by trematode larvae (Microphallus sp.): infectivity, fecundity compensation and growth. Oecologia, 116(4), 575-582. doi:10.1007/s004420050623 Laetsch, D. R., & Blaxter, M. L. (2017). BlobTools: Interrogation of genome assemblies [version 1; peer review: 2 approved with reservations]. F1000Res., 6(1287). doi:10.12688/f1000research.12232.1 Lee, D., Choe, S., Park, H., Jeon, H.-K., Chai, J.-Y., Sohn, W.-M., . . . Eom, K. S. (2013). Complete mitochondrial genome of Haplorchis taichui and comparative analysis with other trematodes. Korean J. Parasitol., 51(6), 719. Levri, E. P., & Lively, C. M. (1996). The effects of size, reproductive condition, and parasitism on foraging behaviour in a freshwater snail, Potamopyrgus antipodarum. Animal Behaviour, 51(4), 891-901. Lively, C. M. (1987). Evidence from a New Zealand snail for the maintenance of sex by parasitism. Nature, 328(6130), 519-521. Lively, C. M. (1989). Adaptation by a parasitic trematode to local populations of its snail host. Evolution, 43(8), 1663-1671. Lively, C. M., Dybdahl, M. F., Jokela, J., Osnas, E. E., & Delph, L. F. (2004). Host Sex and Local Adaptation by Parasites in a Snail‐Trematode Interaction. Am. Nat., 164(S5), S6- S18. doi:10.1086/424605 Lively, C. M., & McKenzie, J. C. (1991). Experimental infection of a freshwater snail, Potamopyrgus antipodarum, with a digenetic trematode, Microphallus sp. Lu, T.-M., Kanda, M., Furuya, H., & Satoh, N. (2019). Dicyemid Mesozoans: a unique parasitic lifestyle and a reduced genome. Genome biology and evolution, 11(8), 2232-2243. Madeira, F., Park, Y. M., Lee, J., Buso, N., Gur, T., Madhusoodanan, N., . . . Lopez, R. (2019). The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res., 47(W1), W636-W641. doi:10.1093/nar/gkz268 Mazandu, G. K., & Mulder, N. J. (2014). Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type? PLoS One, 9(12), e113859. doi:10.1371/journal.pone.0113859 Mehlhorn, H. (2016). Amino Acids. In H. Mehlhorn (Ed.), Encyclopedia of Parasitology (pp. 104-107). Berlin, Heidelberg: Springer Berlin Heidelberg. Mirdita, M., von den Driesch, L., Galiez, C., Martin, M. J., Söding, J., & Steinegger, M. (2017). Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res., 45(D1), D170-D176. doi:10.1093/nar/gkw1081 Moretti, S., Laurenczy, B., Gharib, W. H., Castella, B., Kuzniar, A., Schabauer, H., . . . Robinson- Rechavi, M. (2014). Selectome update: quality control and computational improvements to a database of positive selection. Nucleic Acids Res., 42(Database issue), D917-921. doi:10.1093/nar/gkt1065 NCBI. (2017). Fasciola hepatica genome assembly, unpublished. Retrieved from https://www.ncbi.nlm.nih.gov/assembly/GCA_002763495.1 O'Malley, M. A., Wideman, J. G., & Ruiz-Trillo, I. (2016). Losing Complexity: The Role of Simplification in Macroevolution. Trends Ecol. Evol., 31(8), 608-621. doi:10.1016/j.tree.2016.04.004 Ohno, S. (2013). Evolution by Gene Duplication: Springer Science & Business Media. Paczesniak, D., Jokela, J., Larkin, K., & Neiman, M. (2013). Discordance between nuclear and mitochondrial genomes in sexual and asexual lineages of the freshwater snail Potamopyrgus antipodarum. Mol. Ecol., 22(18), 4695-4710. doi:10.1111/mec.12422

92

Chapter 1 Parfrey, L. W., Lahr, D. J. G., Knoll, A. H., & Katz, L. A. (2011). Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl. Acad. Sci. U. S. A., 108(33), 13624-13629. doi:10.1073/pnas.1110633108 Peterson, K. J., Lyons, J. B., Nowak, K. S., Takacs, C. M., Wargo, M. J., & McPeek, M. A. (2004). Estimating metazoan divergence times with a molecular clock. Proc. Natl. Acad. Sci. U. S. A., 101(17), 6536-6541. doi:10.1073/pnas.0401670101 Peyretaillade, E., El Alaoui, H., Diogon, M., Polonais, V., Parisot, N., Biron, D. G., . . . Delbac, F. (2011). Extreme reduction and compaction of microsporidian genomes. Res. Microbiol., 162(6), 598-606. doi:10.1016/j.resmic.2011.03.004 Poulin, R., & Randhawa, H. S. (2015). Evolution of parasitism along convergent lines: from ecology to genomics. Parasitology, 142(S1), S6-S15. Protasio, A. V., Tsai, I. J., Babbage, A., Nichol, S., Hunt, M., Aslett, M. A., . . . Berriman, M. (2012). A systematically improved high quality genome and transcriptome of the human blood fluke Schistosoma mansoni. PLoS Negl. Trop. Dis., 6(1), e1455. doi:10.1371/journal.pntd.0001455 Pryszcz, L. P., & Gabaldón, T. (2016). Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res., 44(12), e113-e113. doi:10.1093/nar/gkw294 Remmert, M., Biegert, A., Hauser, A., & Söding, J. (2011). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods, 9(2), 173- 175. doi:10.1038/nmeth.1818 Rödelsperger, C. (2018). Comparative Genomics of Gene Loss and Gain in Caenorhabditis and Other Nematodes. In J. C. Setubal, J. Stoye, & P. F. Stadler (Eds.), Comparative Genomics: Methods and Protocols (pp. 419-432). New York, NY: Springer New York. Roger, E., Mitta, G., Moné, Y., Bouchut, A., Rognon, A., Grunau, C., . . . Gourbal, B. E. F. (2008). Molecular determinants of compatibility polymorphism in the Biomphalaria glabrata/Schistosoma mansoni model: New candidates identified by a global comparative proteomics approach. Mol. Biochem. Parasitol., 157(2), 205-216. doi:10.1016/j.molbiopara.2007.11.003 Sakharkar, K. R., Dhar, P. K., & Chow, V. T. K. (2004). Genome reduction in prokaryotic obligatory intracellular parasites of humans: a comparative analysis. Int. J. Syst. Evol. Microbiol., 54(Pt 6), 1937-1941. doi:10.1099/ijs.0.63090-0 Schiffer, P. H., Gravemeyer, J., Rauscher, M., & Wiehe, T. (2016). Ultra Large Gene Families: A Matter of Adaptation or Genomic Parasites? Life, 6(3). doi:10.3390/life6030032 Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863-864. doi:10.1093/bioinformatics/btr026 Shi, Y., Hellinga, H. W., & Beese, L. S. (2017). Interplay of catalysis, fidelity, threading, and processivity in the exo- and endonucleolytic reactions of human exonuclease I. Proc. Natl. Acad. Sci. U. S. A., 114(23), 6010-6015. doi:10.1073/pnas.1704845114 Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. In Multiple sequence alignment methods (pp. 105-116): Springer. Slyusarev, G. S., Starunov, V. V., Bondarenko, A. S., Zorina, N. A., & Bondarenko, N. I. (2020). Extreme Genome and Nervous System Streamlining in the Invertebrate Parasite Intoshia variabili. Curr. Biol., 30(7), 1292-1298.e1293. doi:10.1016/j.cub.2020.01.061 Smit, A. F. A., & Hubley, R. (2018). 2008–2015. RepeatModeler Open-1.0. Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015. Suyama, M., Torrents, D., & Bork, P. (2006). PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res., 34(suppl_2), W609-W612. Takeuchi, T., Koyanagi, R., Gyoja, F., Kanda, M., Hisata, K., Fujie, M., . . . Kawashima, T. (2016). Bivalve-specific gene expansion in the pearl oyster genome: implications of 93

Chapter 1 adaptation to a sessile lifestyle. Zoological Letters, 2(1), 3. doi:10.1186/s40851-016- 0039-2 Tamura, K., Stecher, G., Peterson, D., Filipski, A., & Kumar, S. (2013). MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol., 30(12), 2725-2729. doi:10.1093/molbev/mst197 Toledo, R., & Fried, B. (2010). Biomphalaria snails and larval trematodes: Springer Science & Business Media. Törönen, P., Medlar, A., & Holm, L. (2018). PANNZER2: a rapid functional annotation web server. Nucleic Acids Res., 46(W1), W84-W88. doi:10.1093/nar/gky350 Train, C.-M., Pignatelli, M., Altenhoff, A., & Dessimoz, C. (2018). iHam and pyHam: visualizing and processing hierarchical orthologous groups. Bioinformatics, 35(14), 2504-2506. doi:10.1093/bioinformatics/bty994 Trifinopoulos, J., Nguyen, L.-T., von Haeseler, A., & Minh, B. Q. (2016). W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis. Nucleic Acids Res., 44(W1), W232-W235. doi:10.1093/nar/gkw256 Tsai, I. J., Zarowiecki, M., Holroyd, N., Garciarrubio, A., Sánchez-Flores, A., Brooks, K. L., . . . Berriman, M. (2013). The genomes of four tapeworm species reveal adaptations to parasitism. Nature, 496(7443), 57-63. doi:10.1038/nature12031 Wang, T., Zhao, M., Rotgans, B. A., Strong, A., Liang, D., Ni, G., . . . Cummins, S. F. (2016). Proteomic Analysis of the Schistosoma mansoni Miracidium. PLoS One, 11(1), e0147247. doi:10.1371/journal.pone.0147247 Wanger, A., Chavez, V., Huang, R. S. P., Wahed, A., Actor, J. K., & Dasgupta, A. (2017). Chapter 10 - Infections Caused by Parasites. In A. Wanger, V. Chavez, R. S. P. Huang, A. Wahed, J. K. Actor, & A. Dasgupta (Eds.), Microbiology and Molecular Diagnosis in Pathology (pp. 191-219): Elsevier. Warwick, T. (1952). Strains in the mollusc Potamopyrgus jenkinsi (Smith). Nature, 169(4300), 551-552. Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., . . . Zdobnov, E. M. (2017). BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol. Biol. Evol., 35(3), 543-548. doi:10.1093/molbev/msx319 Webb, B., & Sali, A. (2016). Comparative protein structure modeling using MODELLER. Current protocols in bioinformatics, 54(1), 5.6. 1-5.6. 37. Weinstein, S. B., & Kuris, A. M. (2016). Independent origins of parasitism in Animalia. Biology Letters, 12(7), 20160324. Wilton, P. R., Sloan, D. B., Logsdon, J. M., Jr., Doddapaneni, H., & Neiman, M. (2013). Characterization of transcriptomes from sexual and asexual lineages of a New Zealand snail (Potamopyrgus antipodarum). Mol. Ecol. Resour., 13(2), 289-294. doi:10.1111/1755-0998.12051 Winterbourn, M. J. (1974). Larval Trematoda parasitizing the New Zealand species of Potamopyrgus (Gastropoda: Hydrobiidae). Mauri Ora, 2, 17-30. Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15(3), R46. doi:10.1186/gb-2014-15-3-r46 Wu, F., Mueller, L. A., Crouzillat, D., Pétiard, V., & Tanksley, S. D. (2006). Combining bioinformatics and phylogenetics to identify large sets of single-copy orthologous genes (COSII) for comparative, evolutionary and systematic studies: a test case in the euasterid plant clade. Genetics, 174(3), 1407-1420. Yang, W., Jones, M. K., Fan, J., Hughes-Stamm, S. R., & McManus, D. P. (1999). Characterisation of a family of Schistosoma japonicum proteins related to dynein light chains1The nucleotide sequences reported in this paper have been submitted to the GenBank/EMBL Data Bank with accession numbers AF072327–AF072332.1. 94

Chapter 1 Biochimica et Biophysica Acta (BBA) - Protein Structure and Molecular Enzymology, 1432(1), 13-26. doi:10.1016/S0167-4838(99)00089-8 Yang, Z., Wafula, E. K., Honaas, L. A., Zhang, H., Das, M., Fernandez-Aparicio, M., . . . dePamphilis, C. W. (2015). Comparative transcriptome analyses reveal core parasitism genes and suggest gene duplication and repurposing as sources of structural novelty. Mol. Biol. Evol., 32(3), 767-790. doi:10.1093/molbev/msu343 Yang, Z., Wong, W. S. W., & Nielsen, R. (2005). Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol., 22(4), 1107-1118. Yap, K. W., & Thompson, R. C. (1987). CTAB precipitation of cestode DNA. Parasitol. Today, 3(7), 220-222. doi:10.1016/0169-4758(87)90065-2 Young, N. D., Hall, R. S., Jex, A. R., Cantacessi, C., & Gasser, R. B. (2010). Elucidating the transcriptome of Fasciola hepatica—a key to fundamental and biotechnological discoveries for a neglected parasite. Biotechnol. Adv., 28(2), 222-231. Zahn-Zabal, M., Dessimoz, C., & Glover, N. M. (2020). Identifying orthologs with OMA: A primer. F1000Research, 9. Zarowiecki, M., & Berriman, M. (2015). What helminth genomes have taught us about parasite evolution. Parasitology, 142(S1), S85-S97. Zhang, C., Rabiee, M., Sayyari, E., & Mirarab, S. (2018). ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(6), 153. doi:10.1186/s12859-018-2129-y Zhang, S. V., Zhuo, L., & Hahn, M. W. (2016). AGOUTI: improving genome assembly and annotation using transcriptome data. Gigascience, 5(1), 31. doi:10.1186/s13742- 016-0136-3 Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669-2677. doi:10.1093/bioinformatics/btt476

95

Chapter 2

2: Divergence of gene structure in genes originating through duplication

Natalia Zajac*,1,2, Jukka Jokela1,2, Hanna Hartikainen1,2,3, Natasha Glover4,5,6 1. Eawag, Swiss Federal Institute of Aquatic Science and Technology, CH-8600 Dübendorf, Switzerland 2. ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, CH-8092 Zurich, Switzerland 3. School of Life Sciences, University of Nottingham, University Park, NG7 2RD, Nottingham, UK 4. Department of Computational Biology, University of Lausanne 1015 Lausanne, Switzerland 5. Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland 6. Center for Integrative Genomics, 1015 Lausanne, Switzerland *Author for Correspondence: Natalia Zajac, ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, Zurich, Switzerland, +41 58 765 1122, [email protected]

96

Chapter 2

Abstract

Duplicated and novel genes and proteins tend to be shorter than the conserved, non- duplicated ones. Shorter length in duplicated genes may reflect the mechanisms by which they arise, or shorter genes may have a higher propensity to evolve via duplication due to a lower probability of structural changes rendering the genes dysfunctional. Gene duplication is common among parasites, occurring in a broad range of gene families, and has contributed to parasite adaptive radiations. In this study we examine a set of homologous gene families from 13 digenean trematodes, all of which are parasites, since their most recent common ancestor. We compared the gene and protein lengths of conserved and non-duplicated genes (1:1 orthologs) with duplicated and novel genes. We examined all genes in order to understand whether the pattern of gene length variation between 1:1 orthologs, duplicated genes and novel genes was similar in a group of parasites to that previously reported for free- living animals. We explored two different sets of hierarchical orthologous groups in all 13 species to understand if evolution through duplication is a characteristic of certain gene families and is dependent on gene length, and/or it is the process of duplication itself that renders a gene to become shorter. Our results indicate that both processes play a role. Gene families consisting only of duplicated genes coded for shorter proteins. But in gene families consisting of both duplicated and 1:1 orthologous genes, duplicated genes were significantly shorter on the genomic level (including the length of introns, exons and regulatory regions) and had fewer exons. The results suggest that certain gene families evolve through duplication because the product of duplication is more likely to become fixed and remain functional if it occurs in a family producing short proteins. However, the mechanisms of duplication also play a role. Retrotransposition or exon/intron loss in segmental duplication can render a gene shorter but not necessarily producing a shorter protein.

97

Chapter 2

Introduction

Major transitions in evolution, such as evolution of eukaryotes or multicellularity, have been marked by gain of functional complexity and can be traced through genomic changes (Freeling & Thomas, 2006; Herron, 2016; Koonin, 2015; Petrov, 2001). Gene duplications give rise to genomic novelty and complexity (Martin, 1999; Rivera et al., 2010; Yang et al., 2014), thus providing heritable gene variation subject to selection and genetic drift. The second copy of a duplicated gene can be maintained to perform the same function and create double the amount of the same product, or it can undergo divergence leading to subfunctionalisation or neofunctionalisation, assuming a different function (Ohno, 2013).

Duplicated and novel genes have been observed to be shorter in many species than the conserved, non-duplicated genes. Studies of the genomes of model organisms, such as the nematode Caenorhabditis elegans or mouse Mus musculus, have identified that the duplicated copy tends to be shorter than its ancestral counterpart (Katju & Lynch, 2006; Mallon et al., 2004). Lineage-specific genes in the Poaceae family were found to be shorter and have an elevated GC content compared to non-lineage restricted genes with significant sequence similarity in a species outside the Poaceae (Campbell et al., 2007). A similar phenomenon has been observed for zebrafish, Danio rerio, or sweet orange, Citrus sinensis, with lineage-specific genes being shorter and having fewer exon numbers (Xu et al., 2015; Yang, Zou, Fu, & He, 2013). However, it is not only the duplicated and novel genes that have been observed to be shorter but also their resulting proteins. Proteins exclude introns, transposable elements and UTR elements, containing only the coding sequence of a gene. In a study on mammalian-specific gene families, proteins of those genes were found to be short and depleted of aromatic and negatively charged residues important for the stabilization of the protein three-dimensional structure (Luis Villanueva-Cañas et al., 2017) .

The answer to why duplicated and novel genes tend to be shorter might lie in the various mechanisms by which they arise. Genes duplicate through three major mechanisms: segmental duplication with unequal crossing over, occurring through homologous recombination; non-homologous segmental duplication, for example due to replication- dependent chromosome breakages; and retrotransposition, a process of reverse transcription of mature RNA (Hurles, 2004; Koszul, Caburet, Dujon, & Fischer, 2004). The first two processes can result in longer or shorter genes depending on the structural divergence

98

Chapter 2 that follows. That includes shift in open reading frame, position of the regulatory regions, point mutations introducing premature stop codons, deletions or exonization/ pseudoexonization process by which intronic or intergenic sequence becomes exonic and vice versa (Casewell, Wagstaff, Harrison, Renjifo, & Wüster, 2011; Xu, Guo, Shan, & Kong, 2012). Retrotransposition always results in shorter genes that lack introns, can often be single exon and have poly A-tails which are only associated with mRNA (Hurles, 2004; Jorquera, González, Clausen, Petersen, & Holmes, 2018). High level of structural divergence can cause the duplicated gene copy to appear novel, especially when performing a highly divergent function (Ohno, 2013). Novel genes can also arise de novo from non-coding regions or through exon shuffling, a process of recombination between parts of genes, often promoted by introns and resulting in chimeric genes (Gilbert, 1978; Katju & Lynch, 2006; Kawasaki, Lafont, & Sire, 2011; Luis Villanueva-Cañas et al., 2017). These described processes indicate that duplicated and gained genes might differ in length either at the genomic level or at the proteomic level depending on the process of origin.

The cause and effect relationship between duplication and gene length might, however, not be so straightforward. Knowles and McLysaght (2006) found that duplicated genes in Arabidopsis thaliana, exhibiting intron loss, show evidence of GO enrichment for membrane location and transporter activity functions (Knowles & McLysaght, 2006), demonstrating that duplication does not occur in gene families at random. Thus an additional or an alternative explanation for the tendency of duplicated and novel genes to be shorter is that genes in gene families composed of shorter genes have a higher propensity to evolve via duplication. The product of duplication in gene families consisting of shorter genes might be more likely to survive than gene duplication in families consisting of longer genes. Indeed, Grishkevich and Yanai (2014) in their study on human and mouse genes suggest that evolution of novelty is dependent on gene length. They postulate that longer genes are less likely to produce duplicates in the first place and that the families of longer genes evolve through alternative splicing. On the other hand, gene families comprising shorter genes, depleted of transposable elements and insertions, are more likely to evolve through duplication. Other studies confirm this bias in the amount of duplication among genes depleted in splice variants (Roux & Robinson-Rechavi, 2011). Longer genes also appear to be under stronger selective constraints as suggested in a study on E. coli, where selection in favour of codons increasing translational accuracy was found to be stronger in longer genes (Eyre-Walker, 1996). Indeed,

99

Chapter 2 correlation between gene length and level of gene conservation was documented across several model organisms (Neme & Tautz, 2013). Disentangling the relative importance of these mechanisms leading to gene length evolution is directly relevant for understanding the evolution of genomic novelty and thus the adaptive process.

Previous studies investigating the evolution of length of the duplicated and the novel genes have focused on free-living species and model organisms. However, gene duplication is also known to occur in parasites and has contributed to their adaptation to parasitic life-style (International Helminth Genomes Consortium, 2019; Zarowiecki & Berriman, 2015). It is especially common in parasites with complex life cycles and multiple hosts where gene duplication and novelty provides the innovation advantageous for host exploitation, invasion, and escape from host immunity (Cwiklinski et al., 2015; Robinson, Dalton, & Donnelly, 2008; Yang et al., 2014). Genes evolving through duplication and novel genes have been shown to encode proteins with a wide range of functions (Cwiklinski et al., 2015; Danchin et al., 2010; International Helminth Genomes Consortium, 2019; Yang et al., 2014; Zarowiecki & Berriman, 2015) but whether that contributes to reduction of gene lengths and perhaps has an impact on genome size has not been fully explored. Here we use a dataset of diverse parasitic worms in the Phylum Platyhelminthes to explore 1) whether the pattern of gene length variation between 1:1 orthologs and duplicated genes is similar in a group of parasites to that previously reported for free-living animals, 2) if the number of duplicated genes correlates with genome size, 3) if evolution through duplication is a characteristic of certain gene families and dependent on gene length and 4) if it is the process of duplication itself that renders a gene to become shorter.

We studied the evolution of Hierarchical Orthologous Groups (HOGs), or sets of orthologs/paralogs originating from a single gene (Altenhoff, Gil, Gonnet, & Dessimoz, 2013), in 13 extant digenean trematodes, since their most recent common ancestor. The ancestral trematode genome was reconstructed in Zajac et al. (Chapter 1). It was estimated to have existed about 400 Mya (± 202.4 Mya) (Chapter 1). Since then, trematodes, all of which are parasites, have radiated to exploit a wide range of invertebrate and vertebrate hosts, and to incorporate multiple hosts in some of the most complex life-cycles known (Chapter 1). Here, we examined the HOGs including 1:1 orthologous (retained) genes and genes that originated through duplication; and genes not classified into HOGs and not present in the ancestral

100

Chapter 2 trematode but present in the extant species (the gained or the novel genes). The gene lengths, protein lengths and exon numbers in the whole dataset were calculated to contrast gene properties between the retained, the duplicated and the gained genes. Gene lengths included the lengths of introns, coding sequences and regulatory regions. HOGs consisting either of retained genes in all 13 species or consisting of duplicated genes in all species were compared to reveal whether evolution through duplication is a characteristic of certain gene families and is dependent on gene length. To test the role of duplication itself for reduction in gene length, we contrasted a set of HOGs for each species where both retained and duplicated genes were present within the same HOG (sharing the same ancestral gene). We also examined the number of single exon genes in each gene type category to understand the possible relative rate of retrotransposition in gene duplication. Methods Characterization of Genome Architecture

In order to understand if gene duplication and gain of novel genes has an impact on genome architecture, phylogenetic generalized least squares (PGLS) method was used to test the correlation between the genome size and non-coding sequence length/coding sequence length/number of genes in each gene type category, accounting for the fact that lineages are not independent. The analysis was carried out using packages phytools and rms in R4.0.2 according to the protocol at http://blog.phytools.org/2017/08/pearson-correlation-with- phylogenetic.html.

Comparative genomics and ancestral genome reconstruction

The ancestral trematode genome reconstructed by Zajac et al. (Chapter 1) in the following manner. The OMA standalone (“Orthologous MAtrix”) software was used for inference of Hierarchical Orthologous Groups (HOGs) of genes shared between species (Altenhoff et al., 2019). The OMA standalone conducts an all-against-all comparison to identify the evolutionary relationships between all pairs of proteins included in the custom- made database (Altenhoff et al., 2013). Zajac et al. selected 20 species of platyhelminthes and nematodes for the custom-made database. Fourteen of these species were digenean

101

Chapter 2 trematodes (including our focal species), 3 species of parasitic cestodes, 1 species of parasitic monogeneans, and 2 species of free-living nematodes (see Supplementary Box 1 Chapter 1). As the study by Zajac et al. (Chapter 1) focused on A. winterbourni, these species were chosen on the basis of close relatedness to A. winterbourni and quality of their genome assemblies and annotations (species also used in International Helminth Genomes Consortium (2019)). The pyHam library (Train et al., 2018) was then used to reconstruct the ancestral trematode genome and characterize genes of each extant trematode species as gained, duplicated or retained since the ancestral trematode genome using the data obtained from OMA. The proteomic, genomic and transcriptomic sequences for analysis were obtained from the NCBI database of invertebrate genomes (ftp.ncbi.nlm.nih.gov) and from the EBI database (ftp://ftp.ebi.ac.uk/) (see Supplementary Table 1 in Chapter 1).

Gene length analysis

We performed our analysis on 13 trematode species. Schistosoma mansoni was excluded from the analysis due to no mRNA evidence for genes provided in the gff file, very scarce exon data available for all genes and with most of the gene length described with CDS length; therefore, resulting in an incomparable description of the genes with those of the other trematodes. For each species we measured 4 characteristics per gene: gene length in nucleotides, protein length in amino acids, exon number per gene and average exon length per gene. In our analysis gene length included exons (CDS + UTR regions) and introns. Genes and proteins without start (ATG) or stop codons (TGA, TAG, TAA) were eliminated from the analyses. For the number of genes without start or stop codons per species see Supplementary Table 1. After having filtered out the incomplete genes, we kept only HOGs with at least 2 species present (Supplementary Table 1). For the raw data, see Supplementary Table 2.

Four types of analyses were performed on the dataset. Firstly, two-way ANOVA was performed with python using statsmodels.api v0.12.0 module, fitting ordinary least squares regression model to the data of gene length, protein length and exon number per gene for all genes for all species. The data was log transformed and normalized using a sklearn.preprocessing library from scikit-learn 0.23.2 to minimize the effect of outliers on the result. The purpose of this analysis was to measure the effect sizes of two independent factors - species and gene type (retained, duplicated or gained) - and the

102

Chapter 2 interaction between the two factors on the three dependent variables (gene length, protein length and exon number). Post-hoc Tukey’s Honest Significant Difference test was then performed to determine meaningful differences between pairs of groups (gene types). Additionally, the analysis of variance was repeated on two subsets. The first dataset included all HOGs with retained genes from all 13 trematodes and all HOGs with duplicated genes from all 13 trematodes. Linear mixed effects regression model from statsmodels.api v0.12.0 was fit into the data measuring the effect of gene type on the three dependent variables (gene length, protein length and exon number) with the factors HOG and species treated as random effects (using HOG as a grouping factor and allowing a random slope for differences between species between HOGs). The data was log transformed and normalized using sklearn.preprocessing library from scikit-learn 0.23.2. For the second dataset, we chose all HOGs for each species where both duplicated AND retained genes were present for that species in the same HOG. In each HOG we calculated the average protein length, gene length and exon number for all the retained and all the duplicated genes. For the three dependent variables, the HOGs were then allocated to three categories: “retained genes > duplicated genes” (greater on average in either in protein length, gene length or exon number), “retained genes < duplicated genes” or “retained genes = duplicated genes”. To account for the differences between species in the number of HOGs we then calculated the proportion of HOGs in each category for each species. Linear mixed effects regression model from statsmodels.api v0.12.0 was fit into the data measuring the effect of the HOG category on the proportion of HOGs. The analysis was run separately on the data of the protein length, the gene length and the exon number. Species was treated as the random effect (grouping factor) and a random slope was allowed for differences between the HOG categories between species. From the last analysis, we then chose all HOGs for which Atriophallophorus winterbourni had retained proteins on average longer than duplicated proteins. We decided to use Atriophallophorus winterbourni as an example as it was the focal species in Zajac et al. (Chapter 1). We compared the structure and number of protein domains in a randomly selected retained and a duplicated protein from that HOG using InterProScan (Jones et al., 2014) and Clustal Omega (Sievers & Higgins, 2014). Lastly, in order to assess the effect of retrotransposition, for each gene type category we measured the number of single exon genes. For each species, we compared the

103

Chapter 2 proportion of single exon genes out of the number of genes in each gene type category and out of the total number of genes in the species. For testing whether the gene type category or the species explains most variance in the data we used the ordinary least squares regression model.

The code in the form of a jupyter notebook has been deposited on: https://github.com/zajacn/comparative_genomics_trematodes/blob/master/Anova_Gene_ and_Protein_Length_of_Duplicated_And_Gained_Genes.ipynb Results

The OMA analysis identified the ancestral trematode genome to have consisted of 13,296 genes (Figure 1). After filtering for fragmented genes and Hierarchical Orthologous Groups comprised of only 1 species, we identified 13,416 HOGs encompassing the retained and duplicated genes. By definition, all gained genes were singletons not belonging to any gene family. The majority of the genes in most of the extant trematode genomes were not present in the ancestral trematode, as the prevalence of gained genes varied between species from 33.6% up to 80.6% (Figure 1, Supplementary Table 1). Duplicated genes accounted for the smallest proportion of genes in each species (ranging from 2.5% - 26.8%), excluding Schistosoma japonicum and Opisthorchis felineus with 51.6% and 53.5% of duplicated genes respectively (Figure 1, Supplementary Table 1).

104

Chapter 2

Figure 1. Phylogenetic relationship between 13 digenean trematodes used in this study and the proportion of 1:1 orthologs (retained), duplicated and gained genes since the ancestral trematode in each species. The proportions refer to data compiled into Supplementary Table 1. The ancestral trematode genome was reconstructed by Zajac et al. (Chapter 1) using OMA standalone and pyHam. The maximum likelihood phylogeny was reconstructed by Zajac et al. (Chapter 1) in IQTree with ModelFinder Plus, using 238 Orthologous Groups reconstructed with OMA standalone, shared between at least 15 species of the 18 species of trematodes, cestodes and monogeneans used in the study. Each branch in the phylogeny has a bootstrap support of 100, indicated on top of each branch.

Phylogenetic generalized least squared model revealed no significant correlation between the genome size and the number or retained (corr.coef = 0.33, p-value = 0.222), the number of gained (corr.coef = 0.032, p-value = 0.922) or the number of duplicated genes (corr.coef = 0.038, p-value = 0.9153). In fact, the non-coding sequence length (corr.coef = 0.837, p-value = 0.0003) rather than the coding sequence length (corr.coef = -0.015, p-value = 0.98) had a predictive power on the genome size (Figure 2).

105

Chapter 2

Figure 2. Correlation between genome size and non-coding genome length between the 13 species used in the study. The correlation between the two characters is visualised on a phylomorphospace (a projection of the phylogenetic tree into a morphospace).

In each species, we compared the average exon number, the protein length and the gene length of all genes from the three gene types (retained, duplicated and gained). Per species results were compiled into Supplementary Table 3. Two-way analysis of variance on log transformed values revealed significant effect of species, gene type and the interaction between the two independent factors on all three dependent variables (Table 1). Protein length was more conserved between species than was gene length or exon number, as comparable amount of variance in protein length was explained by species and by gene type (η2 = 8.9% and 9.5% respectively). On the other hand, greater differences were observed between species rather than between gene types for gene length (η2 = 19.5% and 4.7% respectively) and average exon number per gene (η2 12.1% and 6.7% respectively). Highest average gene length together with highest average exon number was observed in Fasciola hepatica (mean = 31,306 bp, 7.7 exons), Schistosoma japonicum (mean = 18,037.3 bp, 7 exons) and Opisthorchis felineus (mean = 28,519.7 bp, 8.1 exons) (Supplementary Table 3).

106

Chapter 2

Table 1. Results of ordinary least squares regression models (Table A) and post-hoc Tukey’s HSD tests (Table B) fit into the whole dataset of retained, duplicated and gained genes in all species. The tests were carried out with and without Opisthorchis felineus due to a different pattern observed between the 3 gene types (Figure 3A), making it an outlier. Species, gene type and gene type by species interaction were used as fixed effects. Protein length, gene length and average exon number were the 3 dependent variables. Eta-squared (η2) is a measure of effect size for use in ANOVA (Analysis of variance).

Table A. Analysis with Opisthorchis felineus Protein Length Gene Length Average exon number df F P η2 df F P η2 df F P η2 Model 38, 156758 1561.9 <0.001 38, 156758 2154 <0.001 38, 156758 1656 <0.001 Species 12, 156758 1455.6 <0.001 8.90% 12, 156758 3431.9 <0.001 19.50% 12, 156758 1972.1 <0.001 12.10% Gene Type 2, 156758 9339.6 <0.001 9.50% 2, 156758 5005.8 <0.001 4.70% 2, 156758 6576.1 <0.001 6.70% Species x Gene Type 24, 156758 111.6 <0.001 1.40% 24, 156758 123.8 <0.001 1.40% 24, 156758 102.3 <0.001 1.30% Analysis without Opisthorchis felineus Protein Length Gene Length Average exon number df F P η2 df F P η2 df F P η2 Model 35, 140081 1440 <0.001 35, 140081 1712 <0.001 35, 140081 1468 <0.001 Species 11, 140081 1321.6 <0.001 8.30% 11, 140081 2776.8 <0.001 16.70% 11, 140081 1857.7 <0.001 11.70% Gene Type 2, 140081 8907.6 <0.001 10.20% 2, 140081 4639.7 <0.001 6% 2, 140081 6046 <0.001 7% Species x Gene Type 22, 140081 104.5 <0.001 1.30% 22, 140081 100.8 <0.001 1.20% 22, 140081 76.9 <0.001 0.90%

107

Chapter 2

Table B. Tukey's HSD Test Analysis with Opisthorchis felineus Gene Type Categories Mean difference protein length (aa) Mean difference gene length (bp) Mean difference average exon number gained_duplicated -244 -12640 -4.0 retained_duplicated 28 -3498 -0.7 retained_gained 273 9141 3.3 Analysis without Opisthorchis felineus Mean difference protein length (aa) Mean difference gene length (bp) Mean difference average exon number gained_duplicated -193 -7301 -3 retained_duplicated 85 2192 0.33 retained_gained 278.1 9494 3.4

108

Chapter 2

As can be observed from Figure 3A, Opisthorchis felineus was the only species with protein and gene length and average exon number of duplicated genes greater than of the other two gene types. Thus, Tukey’s HSD post-hoc testing was performed with and without it. Results indicated retained proteins to be longest (with/without O. felineus: mean = 543.3 aa (amino acids)/538.7 aa, median = 396 aa/ 393 aa) and gained protein to be the shortest (with/without O. felineus: mean = 270.4 aa/260.6 aa, median = 166 aa/162 aa) regardless whether the analysis was performed with and without O. felineus (Table 1). On the other hand, excluding O. felineus had a significant impact on gene length and exon number; retained genes were only found to be the longest (mean = 15845 bp, median = 8516 bp) and contained on average more exons (mean = 6.6, median = 5) than duplicated or gained genes with O. felineus excluded. With O.felineus included, duplicated genes were the longest (mean = 19905 bp, median = 12645 bp) and had the most exons (mean = 7.3, median = 6). In both analyses gained genes were found to be the shortest (with/without O. felineus: mean = 7265.4 bp/ 6350.7 bp, median = 2066.5 bp/ 1877 bp) and contained least number of exons (with/without O. felineus: mean = 3.3 /3.2, median = 2/2).

109

Chapter 2

Figure 3. A. Boxplots of the protein length, gene length and average exon number of the duplicated, gained and retained genes for the 13 trematode species in the dataset.

110

Chapter 2

Figure 3. B. Boxplots showing protein length of retained and duplicated genes per species from all HOGs consisting of retained genes for all species and all HOGs consisting of duplicated genes for all species. Corresponding graphs of gene length and exon number can be found in Supplementary Figure 1. C. Boxplots showing average exon length for retained and duplicated genes from all HOGs consisting of retained genes for all species and all HOGs consisting of duplicated genes for all species, the same dataset as portrayed in B. D. Bar graphs depicting data from a set of HOGs per species with retained and duplicated genes in the same HOG, originating from the same ancestral gene. The bar graph shows the proportions of HOGs with retained genes longer than, shorter than or equal to duplicated genes in protein length and gene length and proportion of HOGs with average exon number greater, lower or equal in retained genes than/as in duplicated genes. The star indicates significantly higher proportion of HOGs with the retained genes being longer or having on average more exons per gene than the duplicated genes, a result of linear mixed effects model.

111

Chapter 2

Comparing the three gene types revealed the gained genes to have the highest proportion of single exon genes out of all genes in each species (1-38% of total gene number), except for Atriophallophorus winterbourni (1% of total gene number) (Figure 4B). Additionally, gained genes had the highest proportion of single exon genes in its gene type category compared to the other two gene types in most species (8%-48% of genes in each gene category) (Figure 4A). Analysis of variance and post-hoc Tukey’s HSD test found no significant difference in proportion of single exon genes out of all genes between retained

2 and duplicated genes (F14,24 =2.6, R = 0.604, p-value = 0.87).

Figure 4.A. Proportion of single exon genes per total number of duplicated (green), retained (pink) and gained (yellow) genes. B. Proportion of single exon genes of each gene type category in total number of genes in each species’ genome.

In order to investigate if retained genes are longer than duplicated genes due to certain gene families consisting of longer genes and being less prone to duplication, we selected all HOGs with either consistently retained genes across all 13 trematodes or consistently duplicating genes in all 13 trematodes. We obtained 24 HOGs with retained

112

Chapter 2 genes consisting of 13 genes each (1 for each species) and 13 HOGs with 32 to 93 genes per species in total (Supplementary Table 4). The number of HOGs in the two categories was different than in Zajac et al. (Chapter 1) (28 HOGs with retained genes and 2 HOGs with duplicated genes) due to the focus of our analysis on 13 species and because incomplete genes were filtered from the data. The linear mixed effects model revealed the retained proteins to be significantly longer than the duplicated proteins (Table 2, Figure 3B) but not significantly different in gene length or exon number (Table 2). Nevertheless, we still found a significant positive correlation between protein length and exon number (corr.coef = 0.25, p- value = 1.7e-16) and a strong positive correlation between protein length and average exon length per gene (corr.coef = 0.63, p-value = 3.7e-115), clearly showing retained genes to have on average more and longer exons than the duplicated genes (Figure 3C, Supplementary Figure 1). Most variance in gene length, protein length and exon number was explained by differences between species and between HOGs (Table 2).

In order to investigate if the retained genes are longer than the duplicated genes if they originate from the same ancestral gene within the same family we selected HOGs for each species consisting of both retained genes and duplicated genes. We obtained between 13 and 84 HOGs per species and a total of 205 unique HOGs as some HOGs consisted of more than 1 species (Supplementary Table 5). We calculated the average protein length, gene length and exon number for the retained and the duplicated genes in each HOG. We then categorized HOGs into those with retained genes on average greater than duplicated genes (in protein length, gene length or exon number), with retained genes smaller than duplicated genes or retained genes equal to duplicated genes. We calculated the proportion of HOGs for each species in each of those 3 HOG categories. Linear mixed effects model revealed a significantly higher proportion of HOGs with the retained genes being longer (between 44% and 80% HOGs per species) and having on average more exons per gene (between 31% and 52% per species) than the duplicated genes (Figure 3D). There were more HOGs for which the retained proteins were longer than the duplicated proteins (between 52% and 65% per species), but the difference was not significant (p-value = 0.171). Nevertheless, for both the duplicated (corr.coef = 0.5, p-value = 3.26e-70) and the retained genes (corr.coef = 0.58, p- value = 2.04e-46) we found a positive correlation between the gene length and the protein length.

113

Chapter 2

Table 2. Results of linear mixed effects models fit into the data of A. a set of HOGs for each species with both duplicated and retained genes B. HOGs with retained genes for all species and HOGs with duplicated genes for all species.

A. HOGs with both duplicated and retained genes, grouping factor Species

Protein Length Gene Length Average exon number

Coef. Std.Err. P Coef. Std.Err. P Coef. Std.Er P r. Intercept 46.69 2.67 <0.001 40.92 2.37 <0.001 25.5 2.08 <0.001

C(HOG -44.30 3.79 <0.001 -40.08 3.43 <0.001 9.85 1.84 <0.001 category)[T.%retain ed=duplicated] C(HOG 4.31 3.14 0.171 17.23 3.37 <0.001 13.8 2.7 <0.001 category)[T.%retain ed>duplicated] HOG category per 47.28 40.28 29.8 species variance B.HOGs with duplicated genes and HOGs with retained genes, grouping factor HOG

Protein Length Gene Length Average exon number

Coef. Std.Err. P Coef. Std.Err. P Coef. Std.Er P r. Intercept 0.49 0.01 <0.001 0.41 0.012 <0.001 0.31 0.015 <0.001

C(gene 0.013 0.006 0.041 -0.013 0.014 0.323 -0.02 0.014 0.133 type)[T.retained] species by HOG 0.04 0.027 0.06 variance

Out of 25 HOGs with both duplicated and retained genes from Atriophallophorus winterbourni, 13 had retained proteins longer (mean = 330 aa) than the duplicated proteins (mean = 282 aa). We inspected the similarity in structure between a retained gene and one duplicated gene from each HOG using Clustal Omega and InterProScan. In only 3 HOGs we observed a loss of protein domains in the duplicated gene (HOG 37106, 36605, 36190) (Table 3, Supplementary Figure 2). No differences in the number of protein domains were observed between the retained and duplicated genes in the other 10 HOGs (Table 3). For all 13 HOGs the Gene Ontology annotation was the same for the retained and the duplicated genes (Table 3). Despite their high similarity in functional annotation, the percent of identity between the amino acid alignment in Clustal Omega varied between 17.5% to 94% (Table 3).

114

Chapter 2

GO Clustal Domains in Domains in Retained Duplicated gene annotation Omega % HOG retained gene: duplicated gene: gene ID ID results for identity summary summary both genes alignment 3 non cytoplasmic 2 noncytoplasmic BP: maker- domains, 6 domains, 3 augustus_masked- transmembr jcf718000022 transmembrane cytoplasmic jcf7180000242521 ane 37106 0891-snap- domains, 4 domains, 5 55.42 -processed-gene- transport gene-0.7- cytoplasmic transmembrane 0.1-mRNA-1 MF: channel mRNA-1 domains, 6 domains, 4 activity Tmhelix domains Tmhelix domains maker- 1 non augustus_masked- 1 non cytoplasmic, BP: agouti_scaf_4 cytoplasmic, 1 jcf7180000219352 1 transmembrane, oxidation- 36847 00-augustus- transmembrane, 61.59 -processed-gene- 1 Tmhelix, 1 reduction gene-0.8- 1 Tmhelix, 1 1.2-mRNA-1 cytoplamic process mRNA-1 cytoplamic maker- jcf718000022 maker- 0459- jcf7180000277652 MF: protein 36605 10 domains 6 domains 21.00 augustus- -snap-gene-0.5- binding gene-0.2- mRNA-1 mRNA-1 maker- MF: catalytic maker- jcf718000022 activity jcf7180000243698 4 unintegrated 3unintegrated 36398 4832-snap- BP: 76.71 -augustus-gene- aldolase domains aldolase domains gene-0.2- glycolytic 0.5-mRNA-1 mRNA-1 process BP: nitrogen snap_masked compound - maker- metabolic jcf718000024 jcf7180000252680 3 Gln-synt 3 Gln-synt process, 36190 9873- 86.99 -snap-gene-0.6- domains domains CC: processed- mRNA-1 glutamate gene-0.1- ammonia mRNA-1 ligase maker- maker- jcf718000023 jcf7180000271882 BP: protein 36076 4612-snap- 2 domains 2 domains 63.27 -augustus-gene- transport gene-0.0- 0.2-mRNA-1 mRNA-1 2 transmembrane 2 transmembrane maker- domains, 2 non domains, 1 non maker- agouti_scaf_8 cytoplasmic cytoplasmic jcf7180000246078 35901 85-snap- domains, 1 domains, 2 No GO terms 43.32 -snap-gene-0.2- gene-0.4- cytoplasmic cytoplasmic mRNA-1 mRNA-1 domain and 2 domain and 2 Tmhelices Tmhelices maker- maker- jcf718000023 MF: DNA jcf7180000249012 35842 4885-snap- 8 Zf domains 8 Zf domains binding, zinc 31.55 -augustus-gene- gene-0.0- ion binding 0.1-mRNA-1 mRNA-1 snap_masked MF: catalytic - maker- activity, agouti_scaf_1 agouti_scaf_164- BP: 33453 5 domains 5 domains 93.73 64-processed- augustus-gene- carbohydrat gene-0.12- 0.18-mRNA-1 e metabolic mRNA-1 process maker- SHIPPO repeat maker- SHIPPO repeat and jcf7180000254477 and 2 33248 agouti_scaf_5 2 unintegrated No GO terms 36.65 -augustus-gene- unintegrated 4-augustus- domains 0.3-mRNA-1 domains

115

Chapter 2

gene-0.11- mRNA-1 maker- jcf718000022 maker- MF: ferric 7911- jcf7180000275997 ion binding, 33344 4 Ferritin domains 4 Ferritin domains 54.97 augustus- -augustus-gene- iron gene-0.7- 0.3-mRNA-1 transport mRNA-1 maker- maker- 4 domains, salt jcf718000025 jcf7180000276787 bridge, domains not 33601 0065-snap- No GO terms 17.46 -snap-gene-0.4- phospholipid characterized gene-0.5- mRNA-1 binding pocket mRNA-1

13 predicted 13 predicted domains: 4 domains: 4 maker- Tmhelices, 4 Tmhelices, 4 augustus_masked- jcf718000024 transmembrane transmembrane jcf7180000243912 31570 3912-snap- proteins, 3 proteins, 3 No GO terms 37.84 -processed-gene- gene-0.1- cytoplasmic cytoplasmic 0.6-mRNA-1 mRNA-1 domains, 2 non domains, 2 non cytoplasmic cytoplasmic domain domain

Table 3. Analysis of protein domains and identity of the alignment between a retained gene and a duplicated gene from the same gene family (HOG) in Atriophallophorus winterbourni. In the set of 25 HOGs with duplicated and retained genes in the same family, there were 13 HOGs with retained proteins longer than duplicated proteins. With InterProScan we studied their protein domains and GO annotation and with ClustalOmega we measured the % identity of the alignment.

116

Chapter 2

Discussion

Through creation of novel adaptive functions gene duplication and formation of new genes may play a key role in organismal evolution (Hurles, 2004; Martin, 1999; Zhang, 2003). Therefore, examining the mechanisms underlying gene duplication and novel gene emergence is important for understanding how speciation proceeds. Using a group of parasitic flatworms as a study system, we confirmed that compared to retained genes, duplicated genes are generally significantly shorter, a pattern similar to that observed in free- living organisms (Katju & Lynch, 2006; Mallon et al., 2004; Yang et al., 2013).

Whilst protein length of all genes appears relatively similar between the species of parasites investigated, variation between species in gene length was significant. The gained genes and proteins tended to be the shortest, and the retained genes and proteins were the longest. Such variation in gene and protein structure between species and between conserved and duplicating genes has previously been attributed to exon or intron number differences (Gardiner, Barker, Butlin, Jordan, & Ritchie, 2008; Katju & Lynch, 2006; Lin, Zhu, Silva, Gu, & Buell, 2006; Patthy, 1999; Xu et al., 2012). In this study, retained genes also had the highest number of exons (Figure 3), and species with longest genes and proteins, such as O. felineus, F. hepatica and S. japonicum, revealed the highest number of exons across all genes. These findings support the notion that protein and gene length variability is explained by exon number differences between species and between gene types.

We hypothesised that the difference in length between 1:1 orthologs and duplicated genes might be due to the either 1) mechanism by which duplication occurs, with retrotransposition or tandem duplication resulting in gene truncation (Hurles, 2004), or 2) due to duplication of shorter genes facing lower probability of loss from shifts in open reading frame, pseudoexonization, or accumulation of stop codons (Xu et al., 2012). The two mechanisms are not mutually exclusive; duplication of shorter genes might be less prone to degradation but might also result in a shorter copy of the gene. To tease apart signatures of these two processes, we contrasted two sets of HOGs. Firstly, across the 13 trematode genomes, presence of shorter genes in HOGs that contained duplicated genes compared to those that contained retained genes would indicate that gene families with shorter genes are more likely to be evolving by duplication, and less likely to be structurally constrained.

117

Chapter 2

Secondly, within each species, shorter genes in duplicated versus retained genes originating from the same ancestral gene within HOGs would indicate that duplication results in truncation. If in both scenarios duplicated genes were shorter, both processes play a role.

We find that HOGs with retained genes produced longer proteins than gene families with duplicated genes, with longer exons per gene on average. Despite the retained genes also being longer at the genomic level (including the total length of introns, coding sequences and regulatory regions), and having more exons, the difference between retained and duplicated genes was not significant as that relationship was not observed across all species. The results suggest that the largest selective constraint is the protein structure. We observe that the product of duplication is more likely to become fixed in a population if it occurs in a family producing short proteins. Genes with longer average length of exons per gene encoding for longer proteins are less likely to evolve through duplication, likely because of the higher probability of incomplete duplication rendering a protein dysfunctional. In housekeeping genes such events can have lethal consequences. Indeed, we observe the average length of retained proteins in those 24 HOGs with retained proteins shared between all species to be longer (mean length = 418.3 bp) than the retained proteins from HOGs where duplication has also occurred (mean length = 395.3 bp).

Within HOGs where both duplicated and retained genes originated from the same ancestral gene, the retained gene was more likely to be longer on a genomic level (including coding regions, regulatory elements and introns) and to have more or equal number of exons than the duplicated gene. However, there was no significant difference in protein length between the two gene types, despite a positive correlation between gene length and protein length observed for both duplicated and retained genes. These results suggest that a duplication event also renders the gene shorter, perhaps through processes such as retrotransposition or exon/intron loss in segmental duplication. Importantly, our results imply that the process does not necessarily have the same impact on the length at the protein level. The differences between protein and gene length are related to the divergence in regulatory regions between the duplicated and retained genes, where the regulatory regions have different impacts on expression patterns and patterns of alternative splicing. Further studies are required to confirm the role and divergence in regulatory regions relative to retained and duplicated genes.

118

Chapter 2

The highest proportion of genes per species were the gained genes, those not present in the ancestral trematode genome. These can be genes that have evolved in the subsequent ancestral genomes or those that are species specific. Our results indicate those genes to be the shortest and represent the highest proportion of single exon genes in the whole genome. The single exon genes can indicate high contribution of retrotransposition in formation of novel genes. Additionally, novel genes originate from processes such as exon shuffling, arising from previously non coding material (Patthy, 1999) or they might be highly diverged duplicated genes without a recognizable similarity to any other existing genes (Ohno, 2013; Yang et al., 2014). All of these processes could render the genes to be shorter than the other two gene types. However, it also has to be taken into consideration that despite our best ability to filter out incomplete genes, the gained genes might also be artefacts of poor quality assemblies. Thus, the result should be interpreted with caution.

We also note that in our dataset the genome size was not correlated with the proportion of duplicated and gained genes. This suggests that our analyses were not biased by genome size effects in the coding regions. These results also illustrate that novelty in the form of duplicated and gained genes and generally gene count plays an insignificant role in genome size variation in trematodes. However, the proportion of non-coding content, including repetitive and non-repetitive DNA, does strongly correlate with genome size, as noted by previous studies in nematodes and Platyhelminthes (International Helminth Genomes Consortium, 2019) as well as in other species (Pagel & Johnstone, 1992).

Opisthorchis felineus displayed unusual genome structure compared to all the other trematodes analysed here, causing considerable shift in results. The proportion of duplicated genes in this species was 53.5%, the highest among all the investigated trematodes. The protein and gene length of duplicated genes in O. felineus was greater than that of the retained and gained genes, as was the average exon number per gene. The high proportion of duplicated genes may not be associated with the pattern of them being longer and having more exons. This was suggested by the opposite results in the same analyses involving S. japonicum, showing a comparable proportion of duplicated genes at 51.6% (Figure 1, Figure 4A). The two comparative analyses indicate that the pattern of gene duplication length and proportion is species specific. In gene families shared with all other trematodes, retained proteins in O. felineus were longer than the duplicated ones. However, when selecting species

119

Chapter 2 specific HOGs containing both retained and duplicated genes, longer protein and gene sequences were associated with duplicated copies. O. felineus is a cat liver fluke alternating in its life cycle between a gastropod snail, a cyprinoid fish and a fish-eating mammal such as a cat or a human (see Supplementary Box 1 in Chapter 1 for full life cycle description). The species’ native range extends beyond those of other Opisthorchiidae species spanning from Artic circle to Southern Europe (Ershov et al., 2019). It has the same amount of repetitive content in the genome as other Opisthorchiidae species (Ershov et al., 2019) but greater average protein and gene length (Supplementary Table 3). Previous studies found high similarity in biochemical pathways and ecology of all Opisthorchiidae species (Ershov et al., 2019; Shekhovtsov, Katokhin, Kolchanov, & Mordvinov, 2010). Also, all species are carcinogenic in their final host (Sripa et al., 2012). From our study it is impossible to say what makes the species different, but this pattern suggests interesting path for further research. However, it has to be taken into account that the reference genome of O. felineus was assembled from DNA of pooled individuals and maybe not surprisingly has been observed to have exceptionally high heterozygosity rate, which might have hampered the contiguity of the assembly and thus might also impact our results (Ershov et al., 2019).

In conclusion, our findings suggest significant differences in gene length and protein length between species and between genes of different origin. We find loss of exons and loss of gene length in duplicated genes originating from the same ancestral gene as the 1:1 orthologous genes. We also find gene families of shorter genes to be more likely to evolve through duplication. Studying selection pressures and functionality of duplicated and gained genes can indicate their adaptive advantage and their importance in lineage evolution. Therefore, further research into mechanisms of gene duplication in trematode species, the relative contribution of the different mechanisms on genome architecture and reasons for length differences between species and gene families would be an interesting step forward. Additionally, in order to confirm that the pattern found in trematodes is also found in closely related free-living species, the data from Zajac et al. (Chapter 1) could be explored to compare gene length of 1:1 orthologous genes and duplicated genes between parasitic trematodes and free-living nematodes.

120

Chapter 2

Acknowledgments

We would like to thank all the people who have contributed to the work for Chapter 1 which made this Chapter possible. We would also like to thank the Statistical Consulting Team at ETH Zurich for their advice in the statistical analysis and Frida Feijen for comments on the manuscript. Supplementary Information

Table of contents:

1. Supplementary Figures a. Supplementary Figure 1. Boxplot of gene length (A) and exon number (B) of retained and duplicated genes for each species in a dataset of HOGs with retained genes for all species and HOGs with duplicated genes for all species. The two figures relate to Figure 3B. Gene length and exon number were not significantly different between the retained and the duplicated genes. b. Supplementary Figure 2. Protein domains characterized by InterProScan of a duplicated and a retained gene of Atriophallophorus winterbourni in 3 HOGs were the retained protein was longer then the duplicated protein and where a loss of protein domain in the duplicated gene was observed. 2. Supplementary Tables a. Supplementary Table 1. Number of duplicated, retained and gained genes per species before and after the two levels of filtering. b. Supplementary Table 2. Gene ID, Protein ID, Gene Length, Protein Length, Exon number per duplicated, retained and gained genes per species since the trematode ancestor. Raw data. c. Supplementary Table 3. The total number of duplicated, retained and gained genes per species since the trematode ancestor and the median, the mean and the mode length all genes and proteins. Genes without

121

Chapter 2

start and stop codons and HOGs with only 1 species have already been eliminated from the analysis. The numbers relate to Figure 3. d. Supplementary Table 4. Total number of genes per species belonging to the 13 HOGs with only duplicated genes from all species. e. Supplementary Table 5. Number of hierarchical orthologous groups and the total number of genes per species in a dataset of both retained and duplicated genes within the same HOG. The table summarises the total number of HOGs and the percentage of HOGs for which retained are longer or have more exons that duplicated genes, for which they are equal for the measurements and for which the retained genes are shorter or have fewer exons than the duplicated genes.

122

Chapter 2

Supplementary Figures

Supplementary Figure 1. Boxplot of gene length (A) and exon number (B) of retained and duplicated genes for each species in a dataset of HOGs with retained genes for all species and HOGs with duplicated genes for all species. The two figures relate to Figure 3B. Gene length and exon number were not significantly different between the retained and the

123

Chapter 2 duplicated genes.

Supplementary Figure 2. Protein domains characterized by InterProScan of a duplicated and a retained gene of Atriophallophorus winterbourni in 3 HOGs were the retained protein was longer then the duplicated protein and where a loss of protein domain in the duplicated gene was observed.

124

Chapter 2

References

Altenhoff, A. M., Gil, M., Gonnet, G. H., & Dessimoz, C. (2013). Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs. PLoS One, 8(1), e53786. doi:10.1371/journal.pone.0053786 Campbell, M. A., Zhu, W., Jiang, N., Lin, H., Ouyang, S., Childs, K. L., . . . Buell, C. R. (2007). Identification and Characterization of Lineage-Specific Genes within the Poaceae. Plant Physiology, 145(4), 1311-1322. doi:10.1104/pp.107.104513 Casewell, N. R., Wagstaff, S. C., Harrison, R. A., Renjifo, C., & Wüster, W. (2011). Domain Loss Facilitates Accelerated Evolution and Neofunctionalization of Duplicate Snake Venom Metalloproteinase Toxin Genes. Molecular Biology and Evolution, 28(9), 2637-2649. doi:10.1093/molbev/msr091 Cwiklinski, K., Dalton, J. P., Dufresne, P. J., La Course, J., Williams, D. J. L., Hodgkinson, J., & Paterson, S. (2015). The Fasciola hepatica genome: gene duplication and polymorphism reveals adaptation to the host environment and the capacity for rapid evolution. Genome Biol., 16(1), 71. Danchin, E. G. J., Rosso, M.-N., Vieira, P., de Almeida-Engler, J., Coutinho, P. M., Henrissat, B., & Abad, P. (2010). Multiple lateral gene transfers and duplications have promoted plant parasitism ability in nematodes. Proceedings of the National Academy of Sciences, 107(41), 17651. doi:10.1073/pnas.1008486107 Ershov, N. I., Mordvinov, V. A., Prokhortchouk, E. B., Pakharukova, M. Y., Gunbin, K. V., Ustyantsev, K., . . . Skryabin, K. G. (2019). New insights from Opisthorchis felineus genome: update on genomics of the epidemiologically important liver flukes. BMC Genomics, 20(1), 399. doi:10.1186/s12864-019-5752-8 Eyre-Walker, A. (1996). Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy? Molecular Biology and Evolution, 13(6), 864- 872. doi:10.1093/oxfordjournals.molbev.a025646 Freeling, M., & Thomas, B. C. (2006). Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Research, 16(7), 805-814. doi:10.1101/gr.3681406 Gardiner, A., Barker, D., Butlin, R. K., Jordan, W. C., & Ritchie, M. G. (2008). Evolution of a Complex Locus: Exon Gain, Loss and Divergence at the Gr39a Locus in Drosophila. PloS one, 3(1), e1513. doi:10.1371/journal.pone.0001513 Gilbert, W. (1978). Why genes in pieces? Nature, 271(5645), 501-501. Grishkevich, V., & Yanai, I. (2014). Gene length and expression level shape genomic novelties. Genome Research, 24(9), 1497-1503. doi:10.1101/gr.169722.113 Herron, M. D. (2016). Origins of multicellular complexity: Volvox and the volvocine algae. Molecular ecology, 25(6), 1213-1223. doi:10.1111/mec.13551 Hurles, M. (2004). Gene duplication: the genomic trade in spare parts. PLoS biology, 2(7), E206-E206. doi:10.1371/journal.pbio.0020206 International Helminth Genomes Consortium. (2019). Comparative genomics of the major parasitic worms. Nat. Genet., 51(1), 163-174. doi:10.1038/s41588-018-0262-1 Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., . . . Nuka, G. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236-1240. Jorquera, R., González, C., Clausen, P., Petersen, B., & Holmes, D. S. (2018). Improved ontology for eukaryotic single-exon coding sequences in biological databases. Database : the

125

Chapter 2

journal of biological databases and curation, 2018, 1-6. doi:10.1093/database/bay089 Katju, V., & Lynch, M. (2006). On the Formation of Novel Genes by Duplication in the Caenorhabditis elegans Genome. Molecular Biology and Evolution, 23(5), 1056- 1067. doi:10.1093/molbev/msj114 Kawasaki, K., Lafont, A.-G., & Sire, J.-Y. (2011). The evolution of milk casein genes from tooth genes before the origin of mammals. Molecular biology and evolution, 28(7), 2053- 2061. Knowles, D. G., & McLysaght, A. (2006). High Rate of Recent Intron Gain and Loss in Simultaneously Duplicated Arabidopsis Genes. Molecular Biology and Evolution, 23(8), 1548-1557. doi:10.1093/molbev/msl017 Koonin, E. V. (2015). Origin of eukaryotes from within archaea, archaeal eukaryome and bursts of gene gain: eukaryogenesis just made easier? Philosophical Transactions of the Royal Society B: Biological Sciences, 370(1678), 20140333. doi:doi:10.1098/rstb.2014.0333 Koszul, R., Caburet, S., Dujon, B., & Fischer, G. (2004). Eucaryotic genome evolution through the spontaneous duplication of large chromosomal segments. The EMBO Journal, 23(1), 234-243. doi:10.1038/sj.emboj.7600024 Lin, H., Zhu, W., Silva, J. C., Gu, X., & Buell, C. R. (2006). Intron gain and loss in segmentally duplicated genes in rice. Genome biology, 7(5), R41. doi:10.1186/gb-2006-7-5-r41 Luis Villanueva-Cañas, J., Ruiz-Orera, J., Agea, M. I., Gallo, M., Andreu, D., & Albà, M. M. (2017). New Genes and Functional Innovation in Mammals. Genome biology and evolution, 9(7), 1886-1900. doi:10.1093/gbe/evx136 Mallon, A.-M., Wilming, L., Weekes, J., Gilbert, J. G. R., Ashurst, J., Peyrefitte, S., . . . Brown, S. D. M. (2004). Organization and Evolution of a Gene-Rich Region of the Mouse Genome: A 12.7-Mb Region Deleted in the Del(13)Svea36H Mouse. Genome Research, 14(10a), 1888-1901. doi:10.1101/gr.2478604 Martin, A. P. (1999). Increasing Genomic Complexity by Gene Duplication and the Origin of Vertebrates. The American Naturalist, 154(2), 111-128. doi:10.1086/303231 Neme, R., & Tautz, D. (2013). Phylogenetic patterns of emergence of new genes support a model of frequent de novoevolution. BMC Genomics, 14(1), 117. doi:10.1186/1471- 2164-14-117 Ohno, S. (2013). Evolution by gene duplication: Springer Science & Business Media. Pagel, M., & Johnstone, R. A. (1992). Variation across species in the size of the nuclear genome supports the junk-DNA explanation for the C-value paradox. Proceedings of the Royal Society of London. Series B: Biological Sciences, 249(1325), 119-124. Patthy, L. (1999). Genome evolution and the evolution of exon-shuffling — a review. Gene, 238(1), 103-114. Petrov, D. A. (2001). Evolution of genome size: new approaches to an old problem. Trends in Genetics, 17(1), 23-28. Rivera, A. S., Pankey, M. S., Plachetzki, D. C., Villacorta, C., Syme, A. E., Serb, J. M., . . . Oakley, T. H. (2010). Gene duplication and the origins of morphological complexity in pancrustacean eyes, a genomic approach. BMC evolutionary biology, 10(1), 123. doi:10.1186/1471-2148-10-123 Robinson, M. W., Dalton, J. P., & Donnelly, S. (2008). Helminth pathogen cathepsin proteases: it’s a family affair. Trends Biochem. Sci., 33(12), 601-608. doi:10.1016/j.tibs.2008.09.001

126

Chapter 2

Roux, J., & Robinson-Rechavi, M. (2011). Age-dependent gain of alternative splice forms and biased duplication explain the relation between splicing and duplication. Genome Research, 21(3), 357-363. doi:10.1101/gr.113803.110 Shekhovtsov, S. V., Katokhin, A. V., Kolchanov, N. A., & Mordvinov, V. A. (2010). The complete mitochondrial genomes of the liver flukes Opisthorchis felineus and Clonorchis sinensis (Trematoda). Parasitology International, 59(1), 100-103. Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. In Multiple sequence alignment methods (pp. 105-116): Springer. Sripa, B., Brindley, P. J., Mulvenna, J., Laha, T., Smout, M. J., Mairiang, E., . . . Loukas, A. (2012). The tumorigenic liver fluke Opisthorchis viverrini–multiple pathways to cancer. Trends in parasitology, 28(10), 395-407. Xu, G., Guo, C., Shan, H., & Kong, H. (2012). Divergence of duplicate genes in exon–intron structure. Proceedings of the National Academy of Sciences, 109(4), 1187-1192. Xu, Y., Wu, G., Hao, B., Chen, L., Deng, X., & Xu, Q. (2015). Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis). BMC Genomics, 16(1), 995. doi:10.1186/s12864-015-2211-z Yang, L., Zou, M., Fu, B., & He, S. (2013). Genome-wide identification, characterization, and expression analysis of lineage-specific genes within zebrafish. BMC Genomics, 14(1), 65. doi:10.1186/1471-2164-14-65 Yang, Z., Wafula, E. K., Honaas, L. A., Zhang, H., Das, M., Fernandez-Aparicio, M., . . . dePamphilis, C. W. (2014). Comparative Transcriptome Analyses Reveal Core Parasitism Genes and Suggest Gene Duplication and Repurposing as Sources of Structural Novelty. Molecular Biology and Evolution, 32(3), 767-790. doi:10.1093/molbev/msu343 Zarowiecki, M., & Berriman, M. (2015). What helminth genomes have taught us about parasite evolution. Parasitology, 142(S1), S85-S97. Zhang, J. (2003). Evolution by gene duplication: an update. Trends Ecol. Evol., 18(6), 292-298. doi:10.1016/S0169-5347(03)00033-8

127

Chapter 3

3: Genomic footprint of divergence under high gene flow in the trematode parasite Atriophallophorus winterbourni Natalia Zajac*,1,2, Frida A. A. Feijen1,2, Niklas Zemp2, Jukka Jokela1,2

1. Eawag, Swiss Federal Institute of Aquatic Science and Technology, CH-8600 Dübendorf, Switzerland

2. ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, CH-8092 Zurich, 9 Switzerland

*Author for Correspondence: Natalia Zajac, ETH Zurich, Department of Environmental Systems Science, Institute of 19 Integrative Biology, Zurich, Switzerland, +41 58 765 1122, [email protected]

128

Chapter 3

Abstract

Past population divergence, partly restored through secondary contact and gene flow, may still be observed as genome-wide variation in the present populations. Formation of the Southern Alps and a series of glacial periods on the South Island of New Zealand has created a north-west and south-east division in animal and plant communities. The populations of dispersing and mobile species have since then resumed connectedness but the genomic signature of the historic divergence may still be seen. Here, using a pooled population sequencing approach (Pool-Seq), we examined the genomic footprint of divergence between the north-west and south-east populations of the trematode parasite, Atriophallophorus winterbourni (Blasco-Costa et al., 2019), native to the Island. Examining the 18.2 million SNPs across the genome found in 5 lakes around the Island, we found the population genomic structure to still reflect the glacial isolation and division due to the Southern Alps, despite a signature of gene flow across the Alpine Fault. Of the 1013 SNPs strongly associated with the differentiation between the lakes on the north-west and the south-east, 5.2% were detected in 6 genes. Three of them encoded for proteins associated with extracellular vesicle biogenesis pathway, found in other trematodes to play a role in parasite invasion of the host tissues. One gene encoding for charged multivesicular body protein, an element of this pathway, was found to harbour a decreased level of polymorphism across all lakes, possibly a signature of positive selection. The observed divergence might be a signature of local adaptation of the parasite to its host.

129

Chapter 3

Introduction

Populations under allopatric isolation can experience divergent selection, which can modulate the underlying genomic architecture, thereby promoting local adaptation (Grivet et al., 2010; Hartmann, McDonald, & Croll, 2018; Lamichhaney et al., 2012). This will manifest itself in significant heterogeneity of genetic differentiation across the genome, reflecting patterns of diversifying selection distinct from background level of divergence caused by neutral processes (Ferchaud & Hansen, 2016; Nosil, Funk, & Ortiz‐Barrientos, 2009; Stölting et al., 2015). In conjunction with gene flow, occurring for example due to secondary contact, divergence will be reduced in regions not under selection leading to ‘genomic islands’ of differentiation (Cruickshank & Hahn, 2014; Feder, Egan, & Nosil, 2012; Ferchaud & Hansen, 2016; Papadopulos et al., 2011). Loci linked to regions under divergent selection will hitchhike towards fixation and divergence will be seen even in regions not directly of functional importance (Emelianov, Marec, & Mallet, 2004). Strictly speaking, however, neutral forces such as strong genetic drift in geographically isolated populations or varying effects of bottlenecks between populations can result in a similar genomic patterns (Ferchaud & Hansen, 2016). Interpreting genome-wide divergence thus requires a critical consideration of the different forces through the study of species ecology and phylogeography (Nunez, Elyanow, Ferranti, & Rand, 2020; Stölting et al., 2015; Walden, Lucek, & Willi, 2020). Rapid advancements in next generation sequencing (NGS) has facilitated detecting the signature of divergence which has shifted from analysis of neutral markers, or markers sparsely distributed across the genome, to analysis of candidate genes and genome wide studies (Fischer et al., 2013). NGS analysis of pools of individuals (Pool-Seq) has provided an effective way of studying large population samples in a cost-effective manner and has proved accurate in estimating population wide allele frequencies (Dauphin, 2020; Hivert, Leblois, Petit, Gautier, & Vitalis, 2018). On the South Island of New Zealand the geographic barriers promoting divergence in many taxa has been the formation of the Southern Alps and periods of glaciation (Heads, 1998; Wagstaff, Heenan, & Sanderson, 1999). Pliocene and Pleistocene events were major factors shaping the present-day distribution of species (Wallis & Trewick, 2009). Uplift of the Southern Alps during Pliocene (11-16 Mya) formed a geographical barrier stretching around 600km along the South Island from the north-east to the south-west, eventually reaching over 3000m above the sea level (a.s.l) (Figure 1) (Sutherland, Carrivick, Shulmeister, Quincey, &

130

Chapter 3

James, 2019; Wallis & Trewick, 2009). The barrier between the north-west and the south-east was strengthened by glacial periods during the Pleistocene (0.01-1.8 Mya) with the Last Glacial Maximum (LGM) ending as recent as 18 kya (Pillans et al., 1993; Shulmeister, Thackray, Rittenour, Fink, & Patton, 2019). The ice sheets during that period covered as much as 30% of the South Island and rendered areas uninhabitable (Pillans et al., 1993; Shulmeister et al., 2019; Trewick & Wallis, 2001). The combined effects of ice sheets and glacial outwash, during the interglacials, resulted in the so called ‘Beech gap’, a discontinuous distribution of taxa with range fragmentation at the Manawatu gap in the very centre of the South Island (Gibbs, 2006; Liggins, Chapple, Daugherty, & Ritchie, 2008; Wallis & Trewick, 2009). Suitable habitats free of ice, and thus biological refugia, have been mostly identified in the north of the South Island, in the Nelson/Marlborough region, and in the south, in the Otago region (Figure 1) (Liggins et al., 2008; Wallis & Trewick, 2009).

After the retreat of the glaciers since the LGM, gene flow through migration and dispersal has majorly contributed to elimination of genetic imprints of Pleistocene refugial isolation. High gene flow through migration and dispersal facilitated post glacial recolonization of the Island for a variety of plant species (Wardle, 1988), invertebrates (Marske, Leschen, Barker, & Buckley, 2009; Trewick & Wallis, 2001), birds such as the kea, Nestor notabilis, (Dussex, Wegmann, & Robertson, 2014) and even some species generally considered poor dispersers, such as the New Zealand common skink (Liggins et al., 2008). Nevertheless, even for the highly dispersing and migrating species, the gene flow still remains somewhat restricted by the Southern Alps (Liggins et al., 2008; Marshall, Hill, Fontaine, Buckley, & Simon, 2009).

One of the most prominent examples of post glacial recolonization is found in the aquatic species, whose dispersal is limited (Leathwick, Elith, Chadderton, Rowe, & Hastie, 2008; Neiman & Lively, 2004; Trewick, Wallis, & Morgan-Richards, 2011). The majority of the contemporary lakes on the South Island were under ice or associated with glaciers during the LGM (Sutherland et al., 2019). Glacial periods have also led to formation of new lakes (Sutherland et al., 2019), and the disappearance of other lakes, such as the paleolake Manuherikia (existing between 15-8 Mya) (Wallis & Trewick, 2009).

Atriophallophorus winterbourni (also known as Microphallus sp. or Microphallus livelyi) (Blasco-Costa et al., 2019) is a digenean trematode parasite whose population genetic

131

Chapter 3 structure is determined by the presence of two hosts in the life cycle, an aquatic and a terrestrial one (Dybdahl & Lively, 1996). The intermediate host is Potamopyrgus antipodarum (Gray, 1843), a prosobranch aquatic snail, common in most freshwaters of New Zealand (Warwick, 1952; Winterbourn, 1970), while the definitive hosts are waterfowl, including Grey Duck (Anas superciliosa (Gmelin, 1789)) and European Mallard (Anas platyrhynchos (von Linné & Lange, 1760)) (Lively & McKenzie, 1991). The parasite is native to lakes across all of New Zealand (Blasco-Costa et al., 2019). It reproduces asexually within the gonads of the snail, producing genetically identical metacercariae. Infection eventually castrates the snail (Lively & McKenzie, 1991). In the gut of the final host, the adult worms reproduce sexually, producing eggs that are released with faeces (Lively & McKenzie, 1991).

Earlier population genetic studies suggest that the parasite populations recolonized the South Island from northern and southern refugia (Feijen, Zajac, Vorburger, & Jokela, in prep). Additionally, population genetic structure of the parasite populations suggest a degree of west/east separation (Dybdahl & Lively, 1996; Feijen, Zajac, et al., in prep). A survey of mitochondrial haplotypes from 28 lakes across the South Island found three haplotype groups: haplotype group W present in the lakes west of the Southern Alps, haplotype group CA (closely related to haplotype group W), present in mountain lakes on the north-east (east of Arthur's Pass) and haplotype group M, present in the lakes in the south-east (Figure 1) (Feijen, Zajac, et al., in prep). The mitochondrial haplotype pattern together with earlier allozyme studies and present SNP marker studies point to greater admixture among populations on either side of the Southern Alps (Dybdahl & Lively, 1996; Feijen, Zajac, et al., in prep). With the current data we predicted that isolation of the parasite in the refugia with its intermediate host most likely led to co-divergence and local adaptation. Recolonization, secondary contact and connectedness between the populations afterwards was most likely mediated through bird flight and should thus lead to a certain degree of erosion of genomic signature of divergence at regions not involved in local adaptation. The degree of that phenomenon would depend on the degree of connectedness. Genome-wide population genetic studies of the parasite, however, have not yet been performed.

132

Chapter 3

Figure 1. Map of the South Island of New Zealand. The map shows the extent of the Southern Alps across the Island, the location of the sampled lakes and indicates which lakes are considered east or west of the Southern Alps in our study. It also points to the haplotype groups found in each lake according to Feijen et al. (Feijen, Zajac, et al., in prep).

133

Chapter 3

In this study, our goals were i.) to provide a better resolution of population structure of Atriophallophorus winterbourni across the South Island of New Zealand using whole genome data, ii.) to characterize patterns of gene flow between the lakes and iii.) to address the question of which genomic regions have been subject to divergence due to isolation in northern/southern refugia and restricted gene flow across the Southern Alps by focusing on the lakes on the west and on the east side of the Alpine Fault (Figure 1). To achieve these goals, we collected parasites from 5 lakes across the South Island of New Zealand: Lake Mapourika, Lake Paringa, Lake Middleton, Lake Alexandrina and Lake Selfe (Figure 1), and performed whole genome resequencing of pools of individuals. The recently sequenced reference genome of Atriophallophorus winterbourni (Chapter 1) provided us the opportunity to address these goals.

Methods

Parasite collection and DNA extraction

P. antipodarum snails were collected in January 2019 from 5 lakes across the South Island of New Zealand (Figure 1). At each lake we collected snails from multiple shallow localities (<1.5m) along the shore with a kicknet pushed through the vegetation. Within two weeks of collection, the snails were transported to the Swiss Federal Institute of Aquatic Science (Eawag, Dübendorf, Switzerland) where they were kept in boxes of 200 – 500 snails in a flow through system filtering the water every 12 h during daytime. Snails were fed spirulina ad libitum (Arthospira platensis, Spirulina California, Earthrise), once a day.

Each snail was dissected separately and all A. winterbourni metacercariae were isolated under 10x-20x magnification. The metacercariae were hatched into adult worms to separate the parasite from the double-walled metacercarial cyst that contain both the parasite and the snail DNA (Galaktionov & Dobrovolskij, 2013). To initiate hatching, the metacercariae were incubated at 40 oC for 2-4 h in Tyrode’s salt solution, supplemented with pancreatin (Sigma P3292) (0.15g/50ml of Tyrode’s salt solution), 100 mg/mL Penicilin G (Fluka 13752) and 0.1g/mL of Streptomycin (Fluka 85880). For Tyrode’s salt solution we mixed Tyrode’s salts (Sigma T2145-10x1L) with 1L of MiliQ water and 1g of sodium bicarbonate

(NaHCO3). After all the worms had hatched to their adult stage, they were washed twice with Tyrode’s salt solution and antibiotics (100 mg/mL Penicillin G, Fluka 13752 and 0.1g/mL

134

Chapter 3

Streptomycin, Fluka 85880) to remove all the remaining cysts shed by the hatched worms. From each infection exactly 200 cysts were counted and transferred to a 1.5 mL tube (Eppendorf, safe lock). The worms were immersed in no more than 10µl of washing solution. The samples were frozen in liquid nitrogen and immediately transferred to a -80oC freezer only to be taken out later for further processing.

For extraction of DNA, the worms were lysed using a CTAB buffer with Proteinase K (2mg/ml) and incubated overnight and at 55°C (Yap & Thompson, 1987). DNA was isolated using a chloroform: isoamyl alcohol (24:1) and precipitated with sodium acetate (3M). The resulting pellet was washed twice with 70% ethanol. DNA was stored in RNase/DNase-free water (Sigma-Aldrich, Missouri, United States) at -20oC until pooling and sequencing library preparation. The quality and quantity of the DNA extraction was assessed with NanoDrop ND1000 (ThermoFisher, Waltham, Massachusetts, USA) and Qubit 2.0 Fluorometer (dsDNA, HS, Invitrogen, Carlsbad, California, USA).

Confirming focal species

Atriophallophorus winterbourni was recently found to coexist with a rare and phenotypically very similar undescribed Atriophallophorus sp. (Feijen, Zajac, et al., in prep). We therefore verified by amplification of 16S fragment that samples were taken from the correct parasite species. All infections by the rare Atriophallophorus sp. and coinfections between Atriphallophorus sp. and A. winterbourni were removed from further analysis. A fragment of 600bp was amplified with Promega GoTaq® G2 DNA Polymerase kit. The primers used, Trem_16S_F1: 5’- GTACCTTTTGCATCATGA-3’ and Trem_16S_R1: 5’- TTACCTAGTTATCCCCGG-3’, were designed based on mitochondrial genomes of Trematodes available on GenBank. The PCR protocol involved a 2min initial denaturation step (95oC) followed by 30 cycles (0.5 min denaturation step at 95oC, 1 min annealing step at 47.3oC and 1 min extension step at 72oC) and completed with a 5min final extension step. All samples were sent to Microsynth AG (Balgach) for sequencing. The consensus sequences and further analysis was performed using Geneious 9.1.8 (Biomatters Limited).

Pool-Seq whole genome resequencing

After selecting only Atriophallophorus winterbourni, for each pool we combined twelve infections (200 cysts per infection) of equal quantities of DNA using a liquid handling

135

Chapter 3 station (BrandTech Scientific) at Genetic Diversity Center (ETH Zurich) resulting in 200-600 ng samples. The pooled DNA was sent to Functional Genomics Center Zurich (University of Zurich, Zurich) for quality assessment with Agilent Tape Station 4200 (Agilent, California, USA) and paired end sequencing (PE150) using the Illumina Novaseq 6000 platform. A single TruSeq library was constructed from the DNA using TruSeq Nano DNA library prep kit according to Illumina protocols with a 500bp insert size. The library was sequenced after indexing each pool on half of an S4 flowcell. Three pools were obtained for both Lake Middleton and Lake Selfe, 6 pools for Lake Alexandrina, 2 pools for Lake Mapourika and 4 Pools for Lake Paringa (Figure 1).

Read mapping and SNP calling

Due to high quality score of the sequenced reads (mean Phred quality score >30) no data correction was performed on the Illumina data and adapter sequences were trimmed at the Functional Genomics Center Zurich (Zurich). Paired-end reads for each population were mapped to the Atriophallophorus winterbourni reference genome (data available at: https://www.ncbi.nlm.nih.gov/nuccore/JACCGJ000000000 (Chapter 1)) using BWA MEM v0.7.17 and Sambamba v0.6.8 with default parameters creating BAM files (Li, 2013; Tarasov, Vilella, Cuppen, Nijman, & Prins, 2015). Low quality mappings (quality score <20) and PCR duplicates were removed with Sambamba v0.6.8 and Picard tools v2.20.2 (Tarasov et al., 2015; Wysoker, Tibbetts, & Fennell, 2013). BEDTools v2.28.0 were used to calculate coverage statistics per each base (Quinlan & Hall, 2010). Single nucleotide polymorphisms were called with SAMTOOLS v1.9 creating mpileup files which were then synchronized with perl script mpileup2sync.pl available through PoPoolation2 software (Kofler, Pandey, & Schlötterer, 2011; Li et al., 2009).

Previous experimental research on snail and parasite population of Lake Alexandrina has shown that the number of parasite genotypes per snail (number of coinfections) correlates with the prevalence of infection (Feijen, Widmer, et al., in prep). Multiple-genotype infections were found in 14% of the infected snails in the shallow water habitat of Lake Alexandrina in 2017, a year with one of the highest infection frequencies in the past two decades (Feijen, Widmer, et al., in prep). In other lakes the rate of coinfections is unknown. Thus, with the available data we concluded double co-infections were possible while triple coinfections were unlikely. That is why a pool size of 48 (infections from 12 snails

136

Chapter 3 x 2 genotypes x diploid worm) was chosen and a minimum allele count of 10 (21%). The sync files were then subjected to subsequent levels of filtering. First the perl script snp-freq-diff.pl available through PoPoolation2 was used to filter out SNPs with coverage lower than 25 and higher than 200 across all pools (Kofler, Pandey, et al., 2011). Then an in-house script was used to filter out triallelic SNPs and keep biallelic SNPs with minimum allele count of 10. The final sync files thus consisted only of biallelic SNPs fulfilling the criteria of minimum coverage of 25, maximum coverage of 200 in all pools and minimum allele count of 10 per pool. Before further analysis, SNPs from all interspersed repeats and low complexity DNA were filtered out. The interspersed repeat and low complexity DNA regions from the whole genome were assessed with RepeatModeler v1.0.11 and RepeatMasker v4.0.7 (Smit, Hubley, & Green, 2015; Smit, Hubley, & Green, 2008). A customized library of repetitive elements produced with RepeatModeler v1.0.11 was verified with blastx v2.3.0 to confirm no proteins, hypothetical proteins or coding sequences were excluded. The stringent SNP filtering of Pool- Seq data was used for elimination of false positives which are the biggest challenge for Pool- Seq data analysis (Anand et al., 2016). Due to genome fragmentation we limited our analysis only to scaffolds containing protein coding genes.

Functional diversity

To obtain pool specific genome-wide estimates of genetic diversity, nucleotide diversity (Tajima’s Pi, µ) and population mutation rate based on the number of segregating sites (Watterson’s theta, Θ Watterson) were calculated per pool per gene with Variance-at- position.pl script available through PoPoolation v1.2.1, setting 10 as the minimum count, 25 as the minimum and 200 as the maximum coverage (Kofler, Orozco-terWengel, et al., 2011). Estimation of genetic diversity only in coding regions eliminated the excess of diversity from other, not so conserved regions of the genome; regions not under purifying selection thus not directly of functional interest (Fischer et al., 2017). Primarily the analysis allowed us to establish if there were considerable differences between any pools in nucleotide diversity which would lead to biased estimates (Jost, 2008). Additionally, we measured the nucleotide diversity of the candidate𝐹𝐹𝑆𝑆𝑆𝑆 genes differentiated between lakes on the eastern and western sides of the Southern Alps. The significance of the deviation from the population mean in polymorphism averaged across all pools per lake for the outlier genes was tested with one sample t-test in SciPy Stats v0.14.0. The deviation from the mean was also tested

137

Chapter 3 with a one sample t-test per lake for each gene separately. Ordinary least squares analysis of variance was applied in python using statsmodels.api v0.12.0 module to test differences in nucleotide diversity between the lakes across all the outlier genes.

Population structure

Two methods were used to characterize the population structure of the parasite across the South Island of New Zealand. All analyses in R were run using R 4.0.2. First, allele frequencies for all SNPs were obtained with sync_to_frequencies R function with R package haploReconstruct v0.1.2 (Franssen, Barton, & Schlötterer, 2016). Next, we performed Hierarchical Clustering of Principal Components on multiple sets of randomly subsampled 100,000 SNPs, using FactoMineR v2.3 R package (Lê, Josse, & Husson, 2008). For this analysis we first carried out the Principal Component Analysis with the function PCA and retained all dimensions explaining 100% of the variance in the data. Using the HCPC function we then divided the data into clusters. The number of clusters in HCPC analysis is decided based on inertia gain and is performed using the Ward’s criterion on the selected principal components (Husson, Josse, & J., 2010; Rellstab et al., 2016). The clusters were visualised on a factor map using factoextra v1.0.7 R package. However, small sample sizes associated with Pool-Seq data limit the interpretations of multivariate analyses. Therefore, population structure was also assessed with population covariance matrix calculated with core model implemented in Baypass2.2 (Gautier, 2015). The data in sync format was converted into Baypass2.2 format using R package poolfstat v1.1.1 (pooldata2genobaypass) (Hivert et al., 2018). The SNPs converted to genobaypass format had to have minimum read count of 10 per each pool, minimum coverage per pool of 25, maximum coverage per pool of 200 and not be in an insertion or deletion. The scaled covariance matrix Ω, calculated across population allele frequencies of 50,000 randomly chosen SNPs, was then converted into a correlation matrix using the function cov2cor in R package corrplot v6.0.1 (Gautier, 2015). The scaled population covariance matrix is a parameter that takes into account neutral correlation structure between populations without being biased by the outliers and thus is highly informative for demographic inference purposes (Gautier, 2015; Pickrell & Pritchard, 2012). In order to confirm our data reflects the mitochondrial population structure shown by Feijen et al. (Feijen, Zajac, et al., in prep), we converted subsampled 25% of raw illumina reads into blast database and blasted the 3 known haplotypes (haplotype W, M and CA) against them. We

138

Chapter 3 assumed presence of a certain haplotype in a pool when a 100% match was found (word size of 14, e-value < 1e10-7). We assessed the presence or absence of a haplotype group in a pool without quantification.

Connectedness

Connectedness between the lakes was assessed with TreeMix v1.13, a software using a graph-based model for building a population tree and testing for the presence of gene flow (Pickrell & Pritchard, 2012). The software first builds a maximum likelihood tree between all populations, then calculates a previously mentioned covariance matrix, implied by the tree structure (Pickrell & Pritchard, 2012). This way the model takes into account the shared variance in allele frequencies between populations due to shared ancestry. The model also applies the Gaussian approximation of genetic drift occurring between ancestral/descendant populations and between extant populations on the tree (Pickrell & Pritchard, 2012). To improve the fit of the model, TreeMix implements a migration parameter which represents possible admixture/ gene flow events (Pickrell & Pritchard, 2012). Migration edges are first introduced on the tree between pairs of populations with the worst fit under the model. Each migration edge is quantified as the proportion of alleles originating from a particular source represented by the weight parameter (Pickrell & Pritchard, 2012). The maximum number of possible migration edges on the tree is 4m2 with m being the number of populations (Pickrell & Pritchard, 2012). In principle, the software adds migration edges until they no longer improve model fit significantly (Pickrell & Pritchard, 2012).

Our analysis was repeated on multiply subsampled 50,000 SNPs. Optimal number of migration events was chosen by taking into account the log likelihood of the model and the matrix of the standard error of the residuals. In our analysis, we used all 18 pools from all 5 lakes (referred to as analysis A) and confirmed our results through repeating the analysis using only 2 pools per lake to account for variation in number of pools between the lakes (referred to as analysis B). The amount of migration edges tested in analysis A was 1-20, 38- 45, 70, 90, 91 and 100 with 1-4 iterations for each value of migration; between 20-40 migration edges were tested for analysis B. Using different number of pools results in different amount of per lake diversity thus potentially causing different number of migration edges to improve the fit of the model. We present the results from the analysis A supported by the results from analysis B. The significance of the found admixture was tested with the “4

139

Chapter 3

Population Test” available through TreeMix v1.13 (Pickrell & Pritchard, 2012), which was first introduced by Keinan et al. (Keinan, Mullikin, Patterson, & Reich, 2007; Reich, Thangaraj, Patterson, Price, & Singh, 2009) and further developed by Reich et al. (Reich et al., 2009). The test calculates the statistic and checks whether the topology (A,B;C,D) is correct; in other words whether allele𝑓𝑓4 frequency between A and B reflects genetic drift that is uncorrelated with C and D (Reich et al., 2009). A significant negative deviation of the statistic from 0 indicates that the four populations cannot be related by a simple phylogeny𝑓𝑓 4without mixture; that A and D or B and C are more closely related than expected by the tree. The standard error of the f-statistic is estimated using the Block Jackknife method correcting for the linkage disequilibrium among SNPs (Reich et al., 2009). Block Jackknife is a resampling technique that is used to estimate parameter values and corresponding standard deviations similar to bootstrapping (Busing, Meijer, & Van Der Leeden, 1999; Kunsch, 1989). The Jackknife standard error of the is then converted to a normally distributed Z-score with a mean of 0 and a variance of 1. The𝑓𝑓4 |Z-score| > 2 is considered significant but it is not directly convertible to a p-value because normality assumptions become imperfect at high Z-scores (Reich et al., 2009). The statistic was measured across the whole genome in blocks of 1000 SNPs.

𝑓𝑓4 Additionally, population structure and connectedness was assessed with analysis using poolfstat v1.1.1 R package calculating average pairwise matrix between𝐹𝐹𝑆𝑆𝑆𝑆 each pair of pools (Hivert et al., 2018). We then averaged the pairwise 𝐹𝐹𝑆𝑆𝑆𝑆 for between lake comparisons and correlated the values of between lake values with𝐹𝐹 𝑆𝑆𝑆𝑆the shortest driving distance between the lakes measured with GoogleMaps𝐹𝐹𝑆𝑆𝑆𝑆 (GoogleMaps, 2020). The driving distance was considered a proxy for separation by distance due to source of recolonization and restricted connectivity by the Southern Alps. The correlation was tested with Pearson’s correlation test with python package SciPy Stats v0.14.0.

Detection of differentiation

Two types of analyses were carried out to identify differentiated regions in the genome between lakes on the west and on the east of the Southern Alps (Figure 1). Lake Selfe was excluded from both analyses based on the results of the analysis of the population structure (see Results).

140

Chapter 3

outliers

𝐹𝐹𝑆𝑆𝑆𝑆 was measured in sliding non-overlapping windows of 2 Kb across the whole genome with𝐹𝐹𝑆𝑆𝑆𝑆 a script fst-sliding.pl available through PoPoolation2 (Kofler, Pandey, et al., 2011). We chose windows of 2 Kb to minimize the impact of false positive SNPs on the result. in PoPoolation2 is calculated with the following formula:

𝑆𝑆𝑆𝑆 𝐹𝐹 = = 1 where 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡−𝑃𝑃𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑖𝑖𝑖𝑖 2 2 2 2 𝐹𝐹𝑆𝑆𝑆𝑆 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑃𝑃𝑃𝑃 − 𝑓𝑓𝐴𝐴 − 𝑓𝑓𝑇𝑇 − 𝑓𝑓𝐶𝐶 − 𝑓𝑓𝑓𝑓 is the frequency of each nucleotide, = 1 calculated 2 2 2 2 across𝑓𝑓 both pools and = 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . − 𝑓𝑓𝐴𝐴 − 𝑓𝑓𝑇𝑇 − 𝑓𝑓𝐶𝐶 − 𝑓𝑓𝑓𝑓 𝑃𝑃𝑃𝑃 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝1+𝑃𝑃𝑃𝑃 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝2 𝑃𝑃𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑖𝑖𝑖𝑖 2 This method has been shown to provide consistent results for estimation in large populations (Hivert et al., 2018) and Morales et al. (2019) have shown𝐹𝐹𝑆𝑆𝑆𝑆 that the slight bias observed by Hivert et al. (2018) in the estimates does not influence outlier detection because it has little effect on the ranking of𝐹𝐹 𝑆𝑆𝑆𝑆loci. The data was used to assess the distribution of across the genome and to select the outliers. Two pools per lake were used for the analysis𝐹𝐹𝑆𝑆𝑆𝑆 from lakes Mapourika and Paringa, located on the west side of the Southern Alps, and lakes Middleton and Alexandrina, located on the east side. For each pairwise comparison we calculated the mean, the median and the standard deviation. Outliers were identified by having an value above 4 times the standard deviation in pairwise comparisons (around

1.75% of each𝐹𝐹𝑆𝑆𝑆𝑆 pairwise distribution) between pools from east and west lakes and below 4 times the standard deviation𝐹𝐹𝑆𝑆𝑆𝑆 in comparisons between pools of lakes on the same side of the Southern Alps (Montague et al., 2014). All the SNPs from outlier windows were inspected with Fisher’s exact test for significance (Fischer et al., 2013; Kofler, Pandey, et al., 2011). The p-values from Fisher’s exact test were corrected for multiple testing and false discovery rate with R package qvalue v2.20.0 (Storey & Tibshirani, 2003). All SNPs within those windows and q-value < 0.01 were further included in the analysis.

Baypass

Baypass2.2 is based on the principles of Bayesian hierarchical model proposed by Coop et al. (Coop, Witonsky, Di Rienzo, & Pritchard, 2010; Gautier, 2015), also implemented in Bayenv2. The software allows accounting for neutral population structure in outlier analysis

141

Chapter 3 through incorporation of the previously mentioned covariance matrix Ω between populations (Gautier, 2015). First, the core model in Baypass2.2 was used to calculate the Ω matrix from multiple random samples of 50,000 SNPs. The core model was also used for estimation of the parameter used in the subsequent model as the initial value of the prior distribution of the population𝑦𝑦𝑦𝑦𝑦𝑦 SNP allele count in the Metropolis-Hastings updates. Both parameters were implemented in the standard model to calculate the association of SNP allele counts with population specific binary trait measured with a C2 statistic implemented in Baypass2.2 (Olazcuaga et al., 2020). We defined the binary contrast between lakes Mapourika and Paringa on the west and lakes Middleton and Alexandrina on the east. We chose the C2 statistic due to its documented best performance in correction for false positives (Olazcuaga et al., 2020). The p-value for each C2 statistic was converted into a q-value with the R package qvalue v2.20.0 (Storey & Tibshirani, 2003) to correct for the false detection rate and multiple testing (Storey & Tibshirani, 2003).

As outliers we considered all SNPs with the 1% of the q-value threshold (q-value < 0.01). Because Baypass2.2 calculations are very computationally intensive we divided our allele count tables into datasets of 50,000 SNPs and implemented the previously computed Ω matrix and the prior allele count in each run. For each BayPass run we used the default parameters of 20 pilot runs, each 1000𝑦𝑦𝑦𝑦𝑦𝑦 steps, 5000 steps of burn-in and 100 steps after burn- in with a thinning sampling of 10.

Functional characterization of highly differentiated candidate genes

A genomic region was considered as a possible candidate for high degree of differentiation between the lakes on the west and the east of the Southern Alps if it appeared as an outlier in both outlier tests (see above). If the regions fell within coding sequences they were then annotated with NCBI BLASTP (e-value cut-off 1e10-10), with InterProScan for protein domains analysis (Mulder & Apweiler, 2007) and they were subjected to Gene Ontology enrichment analysis using GOATOOLS (Klopfenstein et al., 2018). GOATOOLS find statistically over and under-represented GO terms in the set of genes of interest compared to all the GO terms annotating the genome. Annotation of the coding regions of the reference genome was performed using the Maker v2.31.9 annotation pipeline (Cantarel et al., 2008).

142

Chapter 3

GO annotation of the reference genome was performed using OMA, Pannzer2 and EggNOG (Altenhoff et al., 2019; Huerta-Cepas et al., 2016; Törönen, Medlar, & Holm, 2018). For further details on the genome annotation see Chapter 1. Results

Sequence read quality

In our study we performed whole genome resequencing, using the Pool-Seq method, of a trematode parasite Atriophallophorus winterbourni from 5 lakes across the South Island of New Zealand. On average 97.5% of the Illumina reads mapped to the reference genome and 48.4% of mapped reads passed the quality filtering step. The results were similar across all 18 pools from 5 lakes. The reads had a mean length of 150bp yielding a depth of coverage ranging from 56.5-97x for a genome size of 601.7 Mb (Supplementary Table 1). The number of reads we obtained using Illumina sequencing for each of the pools, the number of reads mapping to the reference genome and the number of mapped reads passing the quality filtering (MAPQ values > 20) are compiled to Supplementary Table 1.

Functional Diversity

For pool specific estimates of genetic diversity, we calculated Tajima’s Pi (µ) and

Watterson’s Theta (Θ Watterson) across all coding regions for each pool. We observed no differences among the pools for both measures. The majority of values of Tajima’s µ (between 25% and 75% of the interquartile range) ranged between 0.007 and 0.017 with a mean of

0.013 or 0.014 and a median of 0.011 or 0.012 per pool. Θ Watterson values were slightly higher with the interquartile range between 0.008 and 0.02, with a mean of 0.014 to 0.016 and a median of 0.012 or 0.013 per pool (Supplementary Figure 1, Supplementary Table 2).

143

Chapter 3

Population structure

In total 18,184,991 SNPs across 7473 scaffolds of the total length of 340.4 Mb were identified between all 5 lakes with PoPoolation2. The average pairwise , calculated with all detected markers across the nuclear genome varied between 5.7% (±0.7%)𝐹𝐹𝑆𝑆𝑆𝑆 between most closely geographically located lakes and 7.6% (± 0.7%) between lakes further apart (Figure 2A, Supplementary Table 3). Average pairwise correlated positively with the shortest driving distance between the lakes (Pearson r = 0.834,𝐹𝐹𝑆𝑆𝑆𝑆 p-value = 0.027) (Figure 2A), but not with the flight distance (Pearson r = -0.127, p-value = 0.72).

Figure 2. A. Average between each pair of lakes measured across the whole genome correlated with the shortest driving distance between the lakes. The correlation was tested with Pearson’s correlation 𝑆𝑆𝑆𝑆 test (results indicated𝐹𝐹 in the right bottom corner). B. Representation of the scaled covariance matrix Ω as a correlation matrix among 18 pools from 5 lakes. The covariance matrix was estimated with Baypass2.2 with core model using 50,000 SNPs randomly chosen across the genome and subsampled multiple times (Coop et al., 2010; Gautier, 2015). The greater the similarity (the higher the correlation) between each pair of pools, the darker the colour.

Similar results were obtained with Baypass2.2 using the covariance matrix Ω calculated between all pools converted into a correlation matrix (Figure 2B). Correlation between lakes varied between 0.89 and 1. Lakes located either on the east or on the west side of the Southern Alps were most similar (corr.coef = 0.92) and correlation was the lowest between any pair of lakes from opposite sides of the mountain range (corr. coef = 0.89). Lake Selfe was equally correlated with both the west and the east lakes (0.9).

144

Chapter 3

Consistent with previous studies we found haplotype groups to reflect a west-east division but our results exhibited greater diversity of the already known haplotype groups on the west and in Lake Selfe. Haplotype groups CA and W were both present in lakes on the west side of the Southern Alps, haplotype group M to be present on the east side of the Alps and Lake Selfe had all 3 haplotype groups present (100% BLAST match, e-value < 1e10-7).

For an additional population structure analysis using HCPC, please refer to Supplementary Figure 2.

As shown by all the analyses, Lake Selfe appears to be equally differentiated from the lakes on the eastern and on the western sides of the Southern Alps. Therefore, it was excluded from the analysis on the genomic regions associated with consistent differentiation between the lakes on the west and on the east of the Alpine Fault.

Connectedness

The maximum likelihood tree calculated with TreeMix v1.13 showed clear geographical clustering with the lakes on the same side of the Southern Alps most related to each other, confirming our results on the population structure. The branches of the tree were very shallow, implying high gene flow or recent splits (Figure 3A). We concluded 90 migration events to be the best model fit to the data, yielding lowest residual standard error and highest likelihood of the model (log likelihood = 1549.19). For comparison of the likelihood values and corresponding residual standard errors see Supplementary Figure 3. Out of 90 migration events, 77 (85%) were between tips of the branches (Figure 3B). We quantified the number of migration branches identifying the source (migration origin) and the sink (migration destination) and the total weight of those migration branches (Figure 3). The weight of the branches going between tips of the tree reflects the fraction of alleles explained by a particular migration event. However due to very high number of migration events in our analysis we believe the data cannot be interpreted as cumulative; we cannot interpret each migration branch to be explained by an independent set of alleles.

The analysis showed Lake Alexandrina and Lake Middleton to have the highest fraction of alleles explained by migration events (Figure 3C). Lake Alexandrina and Lake Middleton showed the highest level of admixture, measured in total weight of migration edges (41.4% from Lake Middleton to Lake Alexandrina). A high proportion of admixture was

145

Chapter 3 also observed between Lake Middleton and Lake Paringa (17.2% from Lake Paringa to Lake Middleton), and between Lake Alexandrina and Lake Selfe (18.1% from Lake Selfe to Lake Alexandrina). The results were confirmed as significant by the statistic (Table 1). The least amount of gene flow was observed for Lake Mapourika especially𝑓𝑓4 when it was considered as a source population. The analyses showed Lake Selfe to be a strong source of migration into other lakes, but to not exhibit much inflow of genetic material from the other lakes.

For the results on analysis B using only 2 pools per lake (see methods) see Supplementary Figure 4.

Table 1. Results of the “4 Population Test” measuring the f statistic. The test checks whether the topology (A, B; C, D) is correct; in other words, whether allele frequency4 between A and B reflects genetic drift that is uncorrelated with C and D (Reich et al., 2009). The topology tested is indicated in the first column and the type of admixture tested with that topology is indicated in the second column. A significant negative deviation of the f statistic from 0 (column 3,|Z-score| > 2) indicates that the four populations cannot be related by a simple4 phylogeny without mixture.

Tested topology (A,B;C,D) Tested admixture value Z-score

(Lake Alexandrina any pool, Between Lake -0.0024𝟒𝟒 ≤ ≤ - -5.34 ≤ Z-score ≤ - Lake Alexandrina any pool; Alexandrina and Lake 0.00034𝒇𝒇 2.001 4 Lake Middleton any pool, Middleton 𝑓𝑓 Lake Middleton any pool) (Lake Middleton any pool, Between Lake -0.0032 ≤ ≤ - -4.39 ≤ Z-score ≤ - Lake Middleton any pool; Middleton and Lake 0.0004 2.007 4 Lake Paringa any pool, Lake Paringa 𝑓𝑓 Paringa any pool) (Lake Alexandrina any pool, Between Lake -0.005 ≤ ≤ - -5.56 ≤ Z-score ≤ - Lake Middleton any pool; Middleton and Lake 0.00025 2.00001 4 Lake Paringa any pool, Lake Paringa 𝑓𝑓 Mapourika any pool) (Lake Alexandrina any pool, Between Lake -0.004 ≤ ≤ - -4.97 ≤ Z-score ≤ - Lake Alexandrina any pool; Alexandrina and Lake 0.00024 2.0005 4 Lake Selfe any pool, Lake Selfe 𝑓𝑓 Selfe any pool) (Lake Alexandrina any pool, Either between Lake -0.011 ≤ ≤ - -5.87 ≤ Z-score ≤ - Lake Middleton any pool; Alexandrina and Lake 0.00033 2.0001 4 Lake Paringa any pool, Lake Selfe or between Lake 𝑓𝑓 Selfe any pool) Middleton and Lake Paringa

146

Chapter 3 Figure 3. Results of admixture inference method TreeMix v1.13. A. An unrooted maximum likelihood tree based on allele frequencies of randomly sampled 50,000 SNPs across the genome. The tree was built as a first step in Treemix v1.13 admixture inference and was visualised with phylo.io. “P” next to each name indicates a pool. B. The inferred tree with admixture events indicated between the branches coloured by the weight (in proportion, between 0 and 1). The tree was visualised using plotting_funcs.R available within the TreeMix v1.13 software. The horizontal branch lengths are proportional to the amount of genetic drift that has occurred on the branch. The drift is assumed to be 0 at the level of the ancestral population. The scale bar shows ten times the average standard error of the entries in the covariance matrix. C. Summary per lake of the total weight (in proportion) of the incoming and outgoing migration branches indicated on the tree in B. 147

Chapter 3

Outlier analysis

Regions in the genome associated with consistent differentiation between the eastern (Lake Alexandrina and Lake Middleton) and the western (Lake Mapourika and Lake Paringa) sides of the Southern Alps were studied using two methods. In the analysis, plotting 128,452 windows of 2 Kb revealed 152 outlier windows differentiated between𝐹𝐹𝑆𝑆𝑆𝑆 lakes on the western and the eastern sides of the Southern Alps (Figure 4A). The 152 windows were distributed across 52 scaffolds and contained 7048 SNPs. 1973 SNPs passed the q-value threshold encompassed by 151 of the outlier windows (Figure 4A). The mean, the median and the standard deviation of the for each pairwise comparison have been compiled to Supplementary Table 5. 𝐹𝐹𝑆𝑆𝑆𝑆

A total of 9,955,793 SNPs passed the filtering criteria applied during conversion of sync files into genobaypass files in Baypass2.2. The outlier detection, based on a C2 statistic, identified 4363 SNPs as significantly associated with the differentiation between the west and east contrast (q-value < 0.01) found over 331 scaffolds (Figure 4B). The distribution of p- values derived from a C2 statistic was well-behaved (Supplementary Figure 5).

Candidate SNPs significantly associated with differentiation between the western and the eastern side of the Southern Alps were only those SNPs that appeared as outliers in both analyses. The two analyses resulted in 1013 SNPs clustering across 44 scaffolds (Figure 4B).

148

Chapter 3

Figure 4.A. A scatterplot of pairwise FST values measured in 2 Kb windows arranged in scaffolds. The pairwise comparison between pools form the lakes on both the west and the east sides of the Southern Alps is indicated in orange. The pairwise comparison between pools from the same side of the Southern Alps, the value is indicated in green (Lake Mapourika – Lake Paringa comparisons) or blue (Lake Alexandrina – Lake Middleton). Black indicates the outlier windows in FST measured between the lakes on the west and the east of the Southern Alps. B. A Manhattan plot of the SNP q-values on a −log10 scale derived from the estimated C2 statistics for the west vs east status contrast. SNPs are arranged by their position on their scaffold of origin. The order of scaffolds is the same as for the FST analysis. The red colour indicates the 1% q-value threshold and the yellow indicated the outlier SNPs indicated by both outlier analyses. Functional annotation

Out of the 1013 SNPs, 53 (5.2%) were exonic and found within 6 coding sequences. GO enrichment analysis performed with GOATOOLS on the coding sequences identified a biological process mitotic cytokinesis to be significantly enriched (Bonferroni corrected p- value < 0.05). Additionally, annotation of the genes with NCBI BLASTP and InterProScan found

149

Chapter 3

the genes to be coding for transmembrane proteins and antigens (tetraspanin/peripherin), and for proteins involved in regulation of transcription (general transcription factor, MULE transposase/putative beta arrestin), protein binding, vacuolar transport (charged multivesicular body protein 4), signal transduction and cell division (GTPases) (Table 2). The gene coding for tetraspanin and CD63 antigen harboured the highest number of significantly differentiated SNPs (Table 2).

Table 2. Description of 6 genes represented by the 5.2% of 1013 SNPs significantly associated with the west/east contrast between parasite populations across the South Island of New Zealand. The genes were annotated with NCBI BLASTP (e-value cut-off 1e10-10) and with InterProScan for protein domain annotation. Number of significant SNPs found within the sequence of those genes is indicated in the last column.

Gene ID Functional InterProScan domain Nb. of annotation with significant BLASTP SNPs maker- charged charged mutlivesicular body 2 agouti_scaf_2061- multivesicular body protein 4B, SNF7, GO BP: augustus-gene-0.6- protein 4 vacuolar transport mRNA-1 maker- hypothetical protein, FN3 domain, cytokine receptor 3 jcf7180000230112- twitchin isoform motif, Fibronectin type III, snap-gene-0.1-mRNA-1 Immunoglobulins, GO MF: protein binding maker- putative beta- MULE transposase, FHY3-FAR1, 13 jcf7180000275741- arrestin, protein GO BP: regulation of transcription augustus-gene-0.2- FAR1-RELATED mRNA-1 sequence 5 maker- cdc42-like protein, small GTP binding domain, P-loop 11 jcf7180000275741- cdc42 homolog, cell NTPase, P-loop containing augustus-gene-0.3- division control nucleoside triphosphate mRNA-1 protein hydrolases, Rho family GTPase, Ras, CDC42L protein, GO BP: small GTPase mediated signal transduction, GO MF: GTP binding, GTPase activity maker- general transcription EPM2A_int, no GO terms 9 jcf7180000276130- factor II-I repeat snap-gene-0.0-mRNA-1 domain-containing protein 2B maker- tetraspanin, CD63 Tetraspanin/peripherin, 14 jcf7180000276347- antigen transmembrane region, non- augustus-gene-0.4- cytoplasmic domains, cytoplasmic mRNA-1 domains, GO CC: integral component of transmembrane

We measured nucleotide diversity for the candidate genes to assess whether the genes harbour more or less polymorphism than expected by background nucleotide diversity

150

Chapter 3

(Supplementary Figure 1) and whether the lakes significantly differ from each other in nucleotide diversity across the outlier genes (Figure 5). We observed a significant depletion in nucleotide diversity averaged across all the outlier genes in each lake in comparison to background diversity (30.8 < T-statistic < 182.3, p-value < 0.05) (Supplementary Table 6). However, considering each gene separately, we found only one gene, encoding for charged multivesicular body protein, to exhibit decreased polymorphism across all lakes (328.7 < T- statistic < 513.8, p-value < 0.05) and one gene, encoding for a general transcription factor, to harbour significantly more nucleotide diversity in all lakes in comparison to the background diversity (-401.4 < T-statistic < -119.5, p-value < 0.05) (Figure 5, Supplementary Table 6). Analysis of variance showed significant differences between lakes depending on the outlier gene (F22,61=13.18, p-value = 1.04E-15) (Supplementary Table 7A); 27% of variance in nucleotide diversity in outlier genes was explained by gene and lake interaction factor, 56% was explained by differences between genes and only 1.4% of variance was explained by the lake factor alone (Supplementary Table 7B). Both Lake Alexandrina and Lake Paringa exhibited elevated polymorphism in 3 out of 6 of the outlier genes whereas Lake Middleton and Lake Mapourika exhibited decreased polymorphism in 5 out 6 genes (Figure 5).

Figure 5. Nucleotide diversity estimates for 6 genes represented by the 5.2% of 1013 SNPs significantly associated with the contrast between parasite populations from lakes on the west and on the east of the Southern Alps. Nucleotide diversity was averaged across all SNPs found within those genes and across all pools per each lake. Red line indicates the average nucleotide diversity, which was equal in each lake, π = 0.013 (see Supplementary Figure 1, Supplementary Table 2).

151

Chapter 3

Discussion

Our population genomic analyses of the trematode parasite Atriophallophorus winterbourni across five lakes on the South Island of New Zealand, identified moderate genetic differentiation, with average values between most differentiated lakes reaching

7.6%. Genome wide analyses do, however,𝐹𝐹𝑆𝑆𝑆𝑆 confirm greater similarity of lakes located on either side of the Southern Alps, which corresponds to presence of 2 haplotype groups (CA and W) on the west side of the Alpine Fault and one haplotype (M) on the east of the Alpine Fault. The pattern suggests postglacial recolonization of Lake Mapourika and Lake Paringa from the northern refugia, and Lake Alexandrina and Lake Middleton from the southern refugia, in line with previous findings (Feijen, Zajac, et al., in prep). Divergence of populations strongly correlated with the driving distance between the lakes suggesting a stepping stone fashion of dispersal during recolonization and restricted gene flow by the Southern Alps (Dybdahl & Lively, 1996; Neiman & Lively, 2004).

Although the differentiation among populations was low, we found evidence for an admixture among each side of the Alps and across the Alpine Fault. Analysis of connectedness between populations, based on fraction of shared alleles and corrected for neutral population structure, supports the cross-alpine admixture. The admixture we observed was strongest between Lake Alexandrina and Lake Middleton, Lake Selfe and Lake Alexandrina, and Lake Middleton and Lake Paringa. The gene flow between Lake Middleton and Lake Paringa was also apparent in the genome-wide as the values between the two lakes were lower than for any other west/east comparison𝐹𝐹𝑆𝑆𝑆𝑆 (Figure𝐹𝐹𝑆𝑆𝑆𝑆 1A). The analysis showed that correcting for neutral population structure effectively corrects for the shared ancestry and restricted gene flow through the Southern Alps, which dictates the pattern of divergence. On the other hand, the remaining gene flow occurring across the Alps, may come from migration of the parasite in the duck definitive host, as the flight distance between the lakes across the divide is short without considering the topological constraints. It is somewhat puzzling to detect gene flow between Lake Alexandrina and Lake Middleton as genetic structure of populations of these lakes is similar to a degree that detecting gene flow should be very difficult. This is most likely the case between Lake Paring and Lake Mapourika. Very little gene flow was detected which probably signals the weak differentiation between those lakes. One possible explanation for this observation is that the inflow of genetic material into Lake Middleton from the west side

152

Chapter 3 of the Southern Alps, or indeed from other lakes not considered in our analysis, renders the lake sufficiently different, facilitating interpretation of the similarity between these lakes as either migration (incorporated into the model as migration branches) or as shared ancestry (incorporated as the covariance matrix).

Migratory movements of waterfowl are, indeed, the most likely source of parasite dispersal between the lakes. However, not much is known about the migratory routes of the waterfowl on the Island. Some report birds flying from north-west to south-east but the reported dispersal distances vary between a range of 10-20 square miles to 500 miles (Balham, 1952; Moncrieff, 1929). The postglacial history of the final hosts is also unclear, but there is evidence of a large scale avifaunal colonization from Australia during the Plio- Pleistocene and Late Holocene (Rawlence, Scofield, McGlone, & Knapp, 2019). The haplotype structure of the intermediate host, Potamopyrgus antipodarum, in which the parasite is transported to the gut of the waterfowl, also reflects the dispersal between the lakes on both sides of the Southern Alps (Paczesniak, Jokela, Larkin, & Neiman, 2013). Just as observed for the parasite, the snail shows genetic signature of recolonization from northern and southern refugia (Neiman & Lively, 2004). On the other hand, Wright’s values of 0.174, estimated using allozyme data, suggest stronger lake specific population𝐹𝐹 𝑆𝑆𝑆𝑆genetic structure among the snail populations than observed of the parasite (Dybdahl & Lively, 1996; Neiman & Lively, 2004).

Our population genomic analyses across all SNPs suggest that Lake Selfe, located on the north east of the South Island, is equally differentiated from the lakes on the western and on the eastern sides of the Southern Alps. We also detected all three haplotype groups in the Lake Selfe population. We inferred that the lake was most likely recolonized from the northern refugia after the Last Glacial Maximum, explaining its similarity to the lakes on the west. We interpret the patterns we see now as evidence for extensive present gene flow from the east. In order to address the question on genomic regions differentiated between the west and the east, we excluded the lake from the analysis.

We used two methods to study the outliers associated with the differentiation between the lakes on the west and the east of the Southern Alps: measured across the genome in sliding window analysis and the C2 statistic. The statistic𝐹𝐹𝑆𝑆𝑆𝑆 does not incorporate the neutral population structure in outlier analysis, taking into𝐹𝐹𝑆𝑆𝑆𝑆 account all loci differentiated

153

Chapter 3 for multiple reasons, such as drift, migration or background selection (Meirmans & Hedrick, 2011). Additionally, SNPs found in the outlier tail of distribution show a strong bias towards loci with lower coverage (Günther & Coop, 2013)𝐹𝐹𝑆𝑆𝑆𝑆. However, when measured over 2 Kb windows the likelihood of the impact of false positives on the results decreases (Fischer et al., 2013). On the other hand, the robustness of the C2 statistic stems from incorporating neutral population structure in outlier detection but it is known to vary depending on the structuring of genetic diversity of the neutral variants (Olazcuaga et al., 2020). The highest values of C2 were observed for variants for which allele frequencies were rare in one group and high in other group but still displaying high heterogeneity (Olazcuaga et al., 2020). We considered that if a SNP appeared as an outlier in both analyses, the likelihood of it being a product of the shortcomings of each particular method decreased. Additionally, no differences observed among the pools in measures of Θ Watterson and µ improved the robustness of our estimates (Fischer et al., 2017; Jost, 2008). We thus assumed our results on outlier regions𝐹𝐹𝑆𝑆𝑆𝑆 were not biased by significant differences in nucleotide diversity among pools𝐹𝐹𝑆𝑆𝑆𝑆 or possible demographic changes (bottlenecks, population expansions) impacting the lakes differently (Fischer et al., 2017; Jost, 2008).

We found 1013 SNPs strongly associated with differentiation between the west and the east sides of the Southern Alps, clustered into “islands” of differentiation. 5.2% of those SNPs (53 SNPs) were of functional relevance. These differentiated genes have been found to be coding for transmembrane, antigenic and signal transduction proteins, transcription factors and protein binding molecules. Literature review of the functions of the proteins revealed a large number of them (50%) to be possibly involved in extracellular vesicles (EV) biogenesis pathway, widely known to be implicated in parasitism e.g. through allowing parasite migration through the host tissue, countering the attack of the host immune system and modifying host cell immune responses (Bennett, de la Torre-Escudero, & Robinson, 2020; Cwiklinski et al., 2015). The EV biogenesis pathway has been studied in Fasciola hepatica as well as in other helminths (Bennett et al., 2020; Cwiklinski et al., 2015). The pathway involves charged multivesicular body proteins, which are part of an endosomal complex required for transport (ESCRT) (Cwiklinski et al., 2015). Together with tetraspanins they drive the invagination of the endosome-limiting membrane and lead to formation of intraluminal vesicles. Small GTPases together with SNARE proteins promote the release of the intraluminal vesicles into the extracellular environment as exosomes (Bennett et al., 2020; Cwiklinski et

154

Chapter 3 al., 2015). Multivesicular body proteins, tetraspanins and GTPases have all been found as differentiated in our analysis. The system is at the interface of the parasite and host interaction providing a mechanism for delivery of a wide range of molecules from the parasite to the host. Tetraspanins are also known for their structural role in development, maturation and stability of the tegument, the outer body covering of the parasite (Cai et al., 2008; Piratae et al., 2012; Tran et al., 2010).

The six candidate genes we outlined in our study show strong association with the east/west separation. However, conclusive evidence of whether they are under selection and involved in local adaptation to the intermediate host population still remains to be elucidated. For better understanding we examined the genes’ nucleotide diversity. Two genes, encoding for the charged multivesicular body protein and the general transcription factor, especially stood out as they exhibited either decreased or elevated nucleotide diversity, respectively, across all the lakes. Low nucleotide diversity is frequently associated with positive selection (Hohenlohe, Phillips, & Cresko, 2010); whereas elevated nucleotide diversity is often interpreted as a signature of balancing selection (Hohenlohe et al., 2010) or a result of relaxed purifying selection (Cui et al., 2019; Wicke, Schäferhoff, Depamphilis, & Müller, 2014). The few studies that have directly addressed natural selection on transcription factors, or other elements of the gene expression machinery, report that transcription factors (TFs) are often under strong selective constraints for optimal activity in a given environment. Nevertheless, TFs have been observed to show elevated mutation rates during adaptation to new laboratory environment in microbial experiments (Ali & Seshasayee, 2020; Conrad, Lewis, & Palsson, 2011). Such elevated mutation rate is suggested to be advantageous, and selectively favored, for reaching new optimum of gene expression (Ali & Seshasayee, 2020; Conrad et al., 2011). Extensive post-transcriptional regulation involving TFs has also been recognized in parasites and is assumed to facilitate adaptation to its host, for example in Schistosoma species (Piao et al., 2014; Shoemaker, Ramachandran, Landa, dos Reis, & Stein, 1992) or apicomplexan parasites (Yeoh, Goodman, et al., 2019; Yeoh, Lee, McFadden, & Ralph, 2019).

The functional characterization of the candidate genes suggests their significant role in parasitism and thus their possible evolution through selection imposed by the intermediate host, also differentiated by isolation during the Last Glacial Maximum in the southern and northern refugia. Our observations align with the idea that despite strong gene

155

Chapter 3 flow between the lakes, genomic regions under strong selection, maintaining local adaptation, remain differentiated. Gene flow is expected to erode local adaptation through homogenization of populations (Frank, 1991, 1993; Thompson, 1994). However, simulation studies show that the strength of gene flow and genetic drift are important in determining if selection mosaics and the type of host-parasite genetic specificity lead to local adaptation (Gandon & Nuismer, 2009). Gandon and Nuismer (2009) showed that strong local selection, with low genetic drift can counteract considerable migration between populations promoting local adaptation (Gandon & Nuismer, 2009). Examples supporting this prediction have been found in experimentally manipulated microbial systems and in natural populations. For example, Morgan et al. (2005) showed that increasing phage φ2 parasite migration experimentally between Pseudomonas fluorescence cultures leads to an increase in strength of parasite specialization on local hosts. The field example comes from the A. winterbourni – P. antipodarum system itself. Despite the high gene flow across parasite populations, found both in our study and in previous studies (Dybdahl & Lively, 1996), experimental evidence shows sympatric parasites to cause higher infection prevalence in sympatric hosts than in allopatric hosts, with the cross infection experiments between lakes on both sides of the Southern Alps exhibiting a stronger difference (Lively, Dybdahl, Jokela, Osnas, & Delph, 2004). Ladle et al. (1993) emphasized, however, that the coevolutionary dynamics between parasites and hosts, relying on local adaptation, can be maintained in the presence of gene flow but when the rate of admixture is stronger for the parasite than for the host. Comparing our results with the results obtained for P. antipodarum (Dybdahl & Lively, 1996), this might indeed be the case in our system. Consequently, we believe that in order to fully understand the source of selection on the parasite, genomic regions in P. antipodarum differentiated between the west and the east, to which the parasite might be adapting to, need to be discovered.

Additional question raised by our study is what causes the differentiation among the other 960 SNPs if their differentiation is not of direct functional relevance. These regions could potentially be linked to positively selected sites or to genomic regions experiencing purifying selection, a pattern shown to be mistaken for evidence of local adaptation (Cruickshank & Hahn, 2014). Alternatively, they could be located in promoters, enhancers or small RNAs affecting gene expression (Fischer et al., 2013). The SNPs we detect often appear

156

Chapter 3 in clusters, which could also be a result of chromosomal inversions causing suppressed recombination and thus increased differentiation between populations (Morales et al., 2019).

With our analysis performed on pooled DNA samples and focused on strongly differentiated regions, we might not have detected the full complexity of divergence between populations. This might be due to a range of reasons associated either with the nature of adaptation itself or the constraints of the Pool-Seq method. For instance, the genetic basis of an adaptive phenotype can be epistatic or polygenic, therefore dependent different functional combinations of genes and alleles (Barton, 2017). Traits influenced by many genes of small effect are known to be weakly differentiated and thus missed with outlier based methods (Pritchard, Pickrell, & Coop, 2010; Shi, Kichaev, & Pasaniuc, 2016; Walsh & Lynch, 2018). Lack of individual data, individual heterozygosity estimates and focus on intermediate frequency variants, associated with the Pool-Seq method, also pose challenges to discovering the full molecular basis of divergence.

In conclusion, our analysis provides a genome-wide perspective on the variation caused by climatic and geographic changes in New Zealand in the trematode parasite A. winterbourni. Our study helps in understanding parasite phylogeography and provides a strong basis for study of genomic regions under selection due to population structure or adaptation to the intermediate host, P. antipodarum. Acknowledgments

We thank Julia Vrtilek for help with the collection of samples in New Zealand, Katri Seppälä for developing the parasite hatching protocol and help with the hatching of the parasites. We also thank Hernán Eduardo Morales Villegas, Hélène Boulain, Martin C. Fischer and Benjamin Dauphin for helpful suggestions during the analysis. The research was funded by an ETH grant ETH-36 15-2 obtained by Jukka Jokela and Hanna Hartikainen.

157

Chapter 3

Supplementary Information

Table of contents:

1. Supplementary Figures a. Supplementary Figure 1. Boxplot of Tajima’s Pi and Watterson’s Theta calculated across all coding sequences for each pool using PoPoolation2 (Kofler, Pandey, et al., 2011). b. Supplementary Figure 2. Results of the Hierarchical Clustering of Principal Components performed with FactoMineR v2.3 R package (Lê et al., 2008), represented as a factor map. The optimum number of clusters chosen by the hierarchical clustering was 9, based on 17 PCs. c. Supplementary Figure 3. Visualisation of the standard error of the residuals from the fit of nine models to the data using TreeMix v1.13 software. Blue and green colour indicate positive values, yellow and red indicate negative values. For each model a maximum likelihood tree was built among 18 pools from 5 lakes across the South Island of New Zealand. The models differed only by the value of migration parameter fit into the model indicated by the “m” parameter above each plot. The population list of the bottom and on the left hand side of all the plots indicate the alignment of populations for each matrix. The likelihood of each model is also indicated above each plot. Positive residuals indicate pairs of populations where the model underestimates the observed covariance and the model might benefit from additional migration edges and negative indicates that the model overestimates observed covariance (Pritchard et al., 2010). d. Supplementary Figure 4. A. Results of admixture inference method TreeMix v1.13. A. An unrooted maximum likelihood tree inferred with TreeMix v1.13 using allele frequency of randomly sampled 50,000 SNPs across the genome. The tree is visualised using plotting_funcs.R available within the TreeMix v1.13 software. The horizontal branch lengths are proportional to the amount of genetic drift that has occurred on the

158

Chapter 3

branch. The drift is assumed to be 0 at the level of the ancestor of all populations. Only 2 pools per lake were used for this analysis and were used to correct the analysis represented in Figure 3 for differing population sizes between the lakes. B. Summary per lake of the total weight of the incoming and outgoing migration branches indicated on the tree in B. e. Supplementary Figure 5. Distribution of the p-values computed on the data and derived from the C2 statistics for the contrast between the west and the east lakes across the South Island of New Zealand assuming a χ2 null distribution (with one degree of freedom). Computation of the C2 statistic was performed with Baypass2.2 (Gautier, 2015). 2. Supplementary Tables a. Supplementary Table 1. Sequencing and mapping results for Pool-Seq Illumina (Novaseq 6000) sequencing of 18 pools. b. Supplementary Table 2. Mean and median of Tajima’s Pi and Watterson’s Theta calculated with PoPoolation2 for all coding sequences in each pool. b. Supplementary Table 3. Pairwise matrix calculated with poolfstat

v1.1.1 across 18,184,991 SNPs across𝐹𝐹𝑆𝑆𝑆𝑆 for all 18 pools sampled across 5 lakes in South Island of New Zealand. c. Supplementary Table 4. Proportion of variance captured by each principal component in the PCA analysis done as a first step in the HCPC analysis. d. Supplementary Table 5. Standard deviation, mean and median for

calculated in 2 Kb windows between any 2 pair of pools. 2 pools per lake𝐹𝐹𝑆𝑆𝑆𝑆 were chosen for the analysis. was calculated with PoPoolation2 across

18,184,991 SNPs. The last column𝐹𝐹𝑆𝑆𝑆𝑆 indicates the threshold value used for outlier detection between any pair of pools from the lakes on the west and on the east of the Southern Alps. e. Supplementary Table 6. Results of the one sample t-test comparing the nucleotide diversity (π) of each outlier gene separately and the average

159

Chapter 3

of all outlier genes together in each lake to the mean of nucleotide diversity in that lake for all genes. f. Supplementary Table 7. A. Results of the Ordinary Least Squares regression model fit to the data of nucleotide diversity of the outlier genes in each pool of each lake. Fixed factors used in the analysis were the lake, the outlier gene and the gene by lake interaction. B. Summary of the results for each factor in the OLS regression model. Eta-squared reflects the proportion of variance explained by each factor. C. Results of the post-hoc Turkey’s HSD testing for the significant differences in mean nucleotide diversity between pairs of lakes.

160

Chapter 3

Supplementary Figures

Supplementary Figure 1. Boxplot of Tajima’s Pi and Watterson’s Theta calculated across all coding sequences for each pool using PoPoolation2 (Kofler, Pandey, et al., 2011).

161

Chapter 3

Supplementary Figure 2.Results of the Hierarchical Clustering of Principal Components performed with FactoMineR v2.3 R package (Lê et al., 2008), represented as a factor map. The optimum number of clusters chosen by the hierarchical clustering was 9, based on 17 PCs.

162

Chapter 3

Supplementary Figure 3. Visualisation of the standard error of the residuals from the fit of nine models to the data using TreeMix v1.13 software. Blue and green colour indicate positive values, yellow and red indicate negative values. For each model a maximum likelihood tree was built among 18 pools from 5 lakes across the South Island of New Zealand. The models differed only by the value of migration parameter fit into the model indicated by the “m” parameter above each plot. The population list of the bottom and on the left hand side of all the plots indicate the alignment of populations for each matrix. The likelihood of each model is also indicated above each plot. Positive residuals indicates pairs of populations where the model underestimates the observed covariance and the model might benefit from additional migration edges and negative indicates that the model overestimates observed covariance (Pritchard et al., 2010).

163

Chapter 3

Supplementary Figure 4. A. Results of admixture inference method TreeMix v1.13. A. An unrooted maximum likelihood tree inferred with TreeMix v1.13 using allele frequency of randomly sampled 50,000 SNPs across the genome. The tree is visualised using plotting_funcs.R available within the TreeMix v1.13 software. The horizontal branch lengths are proportional to the amount of genetic drift that has occurred on the branch. The drift is assumed to be 0 at the level of the ancestor of all populations. Only 2 pools per lake were used for this analysis and were used to correct the analysis represented in Figure 3 for differing population sizes between the lakes. B. Summary per lake of the total weight of the incoming and outgoing migration branches indicated on the tree in B.

164

Chapter 3

Supplementary Figure 5. Distribution of the p-values computed on the data and derived from the C2 statistics for the contrast between the west and the east lakes across the South Island of New Zealand assuming a χ2 null distribution (with one degree of freedom). Computation of the C2 statistic was performed with Baypass2.2 (Gautier, 2015).

165

Chapter 3

Supplementary Tables

Supplementary Table 1. Sequencing and mapping results for Pool-Seq Illumina (Novaseq 6000) sequencing of 18 pools.

Number of raw reads Number of reads Proportion of reads Number of mapped reads Proportion of mapped obtained from mapping to the mapping to the reference passing the quality reads passing quality Illumina sequencing reference genome genome filtering filtering Read length Coverage

Alexandrina Pool1 519482864 507988129 0.977872735 247941536 0.488085295 150bp 61.81

Alexandrina Pool2 685694704 669717691 0.976699524 331024798 0.494275129 150bp 82.52

Alexandrina Pool3 493692593 481975597 0.976266616 237048183 0.49182611 150bp 59.09

Alexandrina Pool4 524584368 512241021 0.976470235 254492492 0.496821772 150bp 63.44

Alexandrina Pool5 659911948 644314533 0.9763644 319500759 0.495877002 150bp 79.65

Alexandrina Pool6 632729711 617632424 0.976139437 308697744 0.49980819 150bp 76.95

Mapourika Pool1 547382605 534134091 0.975796611 260220207 0.487181424 150bp 64.87

Mapourika Pool2 554834807 541760032 0.976434833 253628905 0.468157284 150bp 63.23

Middleton Pool1 758038420 739325752 0.975314354 355273577 0.480537268 150bp 88.56

Middleton Pool2 677191214 660679581 0.975617473 319895449 0.484191518 150bp 79.74

Middleton Pool3 718532309 701599399 0.976434031 335461683 0.478138498 150bp 83.62

Paringa Pool1 734166283 714884976 0.973737139 332881237 0.465643073 150bp 82.98

Paringa Pool2 844809031 823791807 0.975121923 389623474 0.472963521 150bp 97.13

Paringa Pool3 637715912 622201597 0.975672059 306475910 0.492566897 150bp 76.40

Paringa Pool4 509025952 493743998 0.969978045 226538029 0.458816775 150bp 56.47

Selfe Pool1 587389712 573094922 0.975663874 271938843 0.474509252 150bp 67.79

Selfe Pool2 651365334 633467715 0.972522918 305771327 0.482694413 150bp 76.22

Selfe Pool3 629536252 614995207 0.976901974 305863676 0.497343187 150bp 76.25

166

Chapter 3

Supplementary Table 2. Mean and median of Tajima’s Pi and Watterson’s Theta calculated with PoPoolation2 for all coding sequences in each pool.

Pool Mean Watterson's Theta Mean Tajima's Pi Median Tajima's Pi Median Watterson's Theta Alex_Pool1_exonic_variance 0.014 0.013 0.011 0.012 Alex_Pool2_exonic_variance 0.015 0.013 0.012 0.013 Alex_Pool3_exonic_variance 0.014 0.013 0.011 0.012 Alex_Pool4_exonic_variance 0.014 0.013 0.011 0.012 Alex_Pool5_exonic_variance 0.014 0.013 0.012 0.013 Alex_Pool6_exonic_variance 0.014 0.013 0.011 0.013 Mapourika_Mix_Pool2_exonic_variance 0.014 0.013 0.011 0.012 Mapourika_Otto_Pool3_exonic_variance 0.014 0.013 0.011 0.012 Middleton_Pool1_exonic_variance 0.015 0.014 0.012 0.013 Middleton_Pool2_exonic_variance 0.015 0.013 0.012 0.013 Middleton_Pool3_exonic_variance 0.015 0.013 0.012 0.013 Paringa_Pool1_exonic_variance 0.015 0.013 0.012 0.013 Paringa_Pool2_exonic_variance 0.016 0.014 0.012 0.014 Paringa_Pool3_exonic_variance 0.014 0.013 0.012 0.013 Paringa_Pool4_exonic_variance 0.014 0.013 0.011 0.012 Selfe_Pool1_exonic_variance 0.014 0.013 0.011 0.012 Selfe_Pool2_exonic_variance 0.014 0.013 0.012 0.013 Selfe_Pool3_exonic_variance 0.015 0.013 0.012 0.013

167

Chapter 3

Supplementary Table 3. Pairwise matrix calculated with poolfstat v1.1.1 across 18,184,991 SNPs across for all 18 pools sampled across 5 lakes in South Island of New Zealand. 𝐹𝐹𝑆𝑆𝑆𝑆

Alexandri Alexandri Alexandri Alexandri Alexandri Alexandri Mapouri Mapouri Middlet Middlet Middlet Paring Paring Paring Paring Selfe Selfe Selfe na Pool1 na Pool2 na Pool3 na Pool4 na Pool5 na Pool6 ka Pool1 ka Pool2 on Pool1 on Pool2 on Pool3 a Pool1 a Pool2 a Pool3 a Pool4 Pool1 Pool2 Pool3

Alexandri 0.000 0.055 0.063 0.067 0.056 0.056 0.081 0.083 0.064 0.058 0.059 0.073 0.072 0.075 0.085 0.068 0.067 0.067 na Pool1

Alexandri 0.055 0.000 0.055 0.059 0.049 0.048 0.072 0.074 0.058 0.051 0.052 0.066 0.065 0.067 0.076 0.060 0.060 0.060 na Pool2

Alexandri 0.063 0.055 0.000 0.066 0.056 0.055 0.081 0.083 0.065 0.058 0.059 0.073 0.071 0.075 0.085 0.068 0.067 0.067 na Pool3

Alexandri 0.067 0.059 0.066 0.000 0.059 0.059 0.084 0.086 0.068 0.061 0.062 0.077 0.075 0.079 0.089 0.072 0.070 0.071 na Pool4

Alexandri 0.056 0.049 0.056 0.059 0.000 0.049 0.073 0.075 0.058 0.052 0.053 0.067 0.066 0.068 0.077 0.061 0.061 0.061 na Pool5

Alexandri 0.056 0.048 0.055 0.059 0.049 0.000 0.073 0.075 0.058 0.051 0.052 0.067 0.065 0.068 0.077 0.061 0.060 0.061 na Pool6

Mapourik 0.081 0.072 0.081 0.084 0.073 0.073 0.000 0.058 0.077 0.071 0.072 0.056 0.055 0.057 0.066 0.071 0.069 0.070 a Pool1

Mapourik 0.083 0.074 0.083 0.086 0.075 0.075 0.058 0.000 0.079 0.073 0.074 0.057 0.056 0.058 0.068 0.073 0.071 0.072 a Pool2

Middleto 0.064 0.058 0.065 0.068 0.058 0.058 0.077 0.079 0.000 0.054 0.055 0.071 0.070 0.073 0.081 0.066 0.066 0.065 n Pool1

Middleto 0.058 0.051 0.058 0.061 0.052 0.051 0.071 0.073 0.054 0.000 0.048 0.064 0.063 0.066 0.075 0.060 0.059 0.059 n Pool2

Middleto 0.059 0.052 0.059 0.062 0.053 0.052 0.072 0.074 0.055 0.048 0.000 0.066 0.065 0.067 0.076 0.061 0.060 0.060 n Pool3

Paringa 0.073 0.066 0.073 0.077 0.067 0.067 0.056 0.057 0.071 0.064 0.066 0.000 0.047 0.048 0.057 0.065 0.064 0.064 Pool1

Paringa 0.072 0.065 0.071 0.075 0.066 0.065 0.055 0.056 0.070 0.063 0.065 0.047 0.000 0.049 0.056 0.063 0.063 0.063 Pool2

Paringa 0.075 0.067 0.075 0.079 0.068 0.068 0.057 0.058 0.073 0.066 0.067 0.048 0.049 0.000 0.059 0.067 0.066 0.066 Pool3

Paringa 0.085 0.076 0.085 0.089 0.077 0.077 0.066 0.068 0.081 0.075 0.076 0.057 0.056 0.059 0.000 0.076 0.074 0.075 Pool4

Selfe 0.068 0.060 0.068 0.072 0.061 0.061 0.071 0.073 0.066 0.060 0.061 0.065 0.063 0.067 0.076 0.000 0.052 0.053 Pool1

Selfe 0.067 0.060 0.067 0.070 0.061 0.060 0.069 0.071 0.066 0.059 0.060 0.064 0.063 0.066 0.074 0.052 0.000 0.052 Pool2

Selfe 0.067 0.060 0.067 0.071 0.061 0.061 0.070 0.072 0.065 0.059 0.060 0.064 0.063 0.066 0.075 0.053 0.052 0.000 Pool3

168

Chapter 3

Supplementary Table 4. Proportion of variance captured by each principal component in the PCA analysis done as a first step in the HCPC analysis.

Principal Eigenvalue % of Cumulative % component variance of variance 1 5299.581 9.934541 9.934541 2 4388.435 8.226515 18.16106 3 3830.38 7.180392 25.34145 4 3501.814 6.564464 31.90591 5 3344.335 6.269256 38.17517 6 3153.682 5.911861 44.08703 7 3002.603 5.628649 49.71568 8 2958.998 5.546907 55.26259 9 2919.589 5.473031 60.73562 10 2903.052 5.442031 66.17765 11 2863.859 5.368561 71.54621 12 2803.767 5.255914 76.80212 13 2624.733 4.920298 81.72242 14 2524.059 4.731576 86.454 15 2499.561 4.685652 91.13965 16 2429.113 4.553592 95.69324 17 2297.441 4.306759 100

169

Chapter 3

Supplementary Table 5. Standard deviation, mean and median for calculated in 2 Kb windows between any 2 pair of pools. 2 pools per lake were chosen for the analysis. was 𝐹𝐹𝑆𝑆𝑆𝑆 calculated with PoPoolation2 across 18,184,991 SNPs. The last column indicates the threshold𝑆𝑆𝑆𝑆 value used for outlier detection between any pair of pools from the lakes on the west𝐹𝐹 and on the east of the Southern Alps.

comparison stdev mean median 4xstdev

𝐹𝐹'Alexandrina_Pool2_Pool6',𝑆𝑆𝑆𝑆 0.035353 0.042267 0.036819 0.141413 'Alexandrina_Pool2_Mapourika_Pool1', 0.054626 0.058151 0.047444 0.218503 'Alexandrina_Pool2_Mapourika_Pool2', 0.055671 0.059073 0.048236 0.222683 'Alexandrina_Pool2_Middleton_Pool1', 0.038917 0.046589 0.040365 0.15567 'Alexandrina_Pool2_Middleton_Pool2', 0.036885 0.043739 0.03762 0.147538 'Alexandrina_Pool2_Paringa_Pool1', 0.05065 0.052757 0.042797 0.202598 'Alexandrina_Pool2_Paringa_Pool2', 0.047738 0.051549 0.042385 0.19095 'Alexandrina_Pool2_Selfe_Pool1', 0.045687 0.050708 0.042642 0.182747 'Alexandrina_Pool2_Selfe_Pool2', 0.043624 0.049091 0.041346 0.174494 'Alexandrina_Pool6_Mapourika_Pool1', 0.056594 0.058911 0.04767 0.226376 'Alexandrina_Pool6_Mapourika_Pool2', 0.057544 0.060074 0.04861 0.230174 'Alexandrina_Pool6_Middleton_Pool1', 0.039565 0.047545 0.0411 0.158261 'Alexandrina_Pool6_Middleton_Pool2', 0.037582 0.04437 0.038101 0.15033 'Alexandrina_Pool6_Paringa_Pool1', 0.052387 0.054043 0.043686 0.209547 'Alexandrina_Pool6_Paringa_Pool2', 0.049055 0.052882 0.043371 0.196219 'Alexandrina_Pool6_Selfe_Pool1', 0.047149 0.051361 0.042882 0.188598 'Alexandrina_Pool6_Selfe_Pool2', 0.044663 0.049898 0.041951 0.178651 'Mapourika_Pool1_Pool2', 0.042882 0.049787 0.042613 0.171527 'Mapourika_Pool1_Middleton_Pool1', 0.05425 0.060643 0.050555 0.217 'Mapourika_Pool1_Middleton_Pool2', 0.053561 0.057529 0.047046 0.214246 'Mapourika_Pool1_Paringa_Pool1', 0.04049 0.048189 0.04146 0.16196 'Mapourika_Pool1_Paringa_Pool2', 0.038752 0.047859 0.041683 0.155008 'Mapourika_Pool1_Selfe_Pool1', 0.054301 0.05789 0.047251 0.217204 'Mapourika_Pool1_Selfe_Pool2', 0.051384 0.056454 0.04656 0.205536 'Mapourika_Pool2_Middleton_Pool1', 0.055022 0.06172 0.051492 0.220087 'Mapourika_Pool2_Middleton_Pool2', 0.054112 0.058501 0.047971 0.216447 'Mapourika_Pool2_Paringa_Pool1', 0.04152 0.048955 0.042196 0.16608 'Mapourika_Pool2_Paringa_Pool2', 0.039521 0.048806 0.042632 0.158084 'Mapourika_Pool2_Selfe_Pool1', 0.05557 0.058851 0.048061 0.222281 'Mapourika_Pool2_Selfe_Pool2', 0.053088 0.057677 0.047416 0.212352 'Middleton_Pool1_Pool2', 0.03601 0.044783 0.039456 0.144042 'Middleton_Pool1_Paringa_Pool1', 0.050442 0.054996 0.045449 0.201769

170

Chapter 3

'Middleton_Pool1_Paringa_Pool2', 0.047354 0.053408 0.044511 0.189417 'Middleton_Pool1_Selfe_Pool1', 0.046253 0.053726 0.045906 0.18501 'Middleton_Pool1_Selfe_Pool2', 0.044295 0.051961 0.044528 0.177181 'Middleton_Pool2_Paringa_Pool1', 0.049478 0.0523 0.042682 0.197913 'Middleton_Pool2_Paringa_Pool2', 0.045851 0.051126 0.04238 0.183406 'Middleton_Pool2_Selfe_Pool1', 0.044895 0.050399 0.042519 0.17958 'Middleton_Pool2_Selfe_Pool2', 0.042585 0.04889 0.041367 0.17034 'Paringa_Pool1_Pool2', 0.033418 0.041132 0.035983 0.13367 'Paringa_Pool1_Selfe_Pool1', 0.050381 0.05388 0.044249 0.201525 'Paringa_Pool1_Selfe_Pool2', 0.048145 0.051981 0.042671 0.192582 'Paringa_Pool2_Selfe_Pool1', 0.047696 0.052959 0.044031 0.190783 'Paringa_Pool2_Selfe_Pool2', 0.045419 0.051009 0.042456 0.181678 'Selfe_Pool1_Pool2', 0.038861 0.045961 0.039642 0.155443

171

Chapter 3

Supplementary Table 6. Results of the one sample t-test comparing the nucleotide diversity (π) of each outlier gene separately and the average of all outlier genes together in each lake to the mean of nucleotide diversity in that lake for all genes.

Value of π T-test statistic P-value Lake Outlier gene

0.001018 513.8 0 Alexandrina maker-agouti_scaf_2061-augustus-gene-0.6-mRNA-1

0.001742 483.2 0 Middleton maker-agouti_scaf_2061-augustus-gene-0.6-mRNA-1

0.003972 389.1 0 Mapourika maker-agouti_scaf_2061-augustus-gene-0.6-mRNA-1

0.005402 328.7 0 Paringa maker-agouti_scaf_2061-augustus-gene-0.6-mRNA-1

0.005792 312.2 0 Alexandrina maker-jcf7180000230112-snap-gene-0.1-mRNA-1

0.005907 307.4 0 Middleton maker-jcf7180000230112-snap-gene-0.1-mRNA-1

0.012795 16.6 4.69E-62 Mapourika maker-jcf7180000230112-snap-gene-0.1-mRNA-1

0.015264 -87.6 0 Paringa maker-jcf7180000230112-snap-gene-0.1-mRNA-1

0.013151 1.60 0.109043 Alexandrina maker-jcf7180000275741-augustus-gene-0.2-mRNA-1

0.009119 171.8 0 Middleton maker-jcf7180000275741-augustus-gene-0.2-mRNA-1

0.008524 196.9 0 Mapourika maker-jcf7180000275741-augustus-gene-0.2-mRNA-1

0.00832 205.6 0 Paringa maker-jcf7180000275741-augustus-gene-0.2-mRNA-1

0.015392 -93 0 Alexandrina maker-jcf7180000275741-augustus-gene-0.3-mRNA-1

0.009548 153.7 0 Middleton maker-jcf7180000275741-augustus-gene-0.3-mRNA-1

0.001534 492 0 Mapourika maker-jcf7180000275741-augustus-gene-0.3-mRNA-1

0.002603 446.9 0 Paringa maker-jcf7180000275741-augustus-gene-0.3-mRNA-1

0.022698 -401.4 0 Alexandrina maker-jcf7180000276130-snap-gene-0.0-mRNA-1

0.032045 -796 0 Middleton maker-jcf7180000276130-snap-gene-0.0-mRNA-1

nan nan nan Mapourika maker-jcf7180000276130-snap-gene-0.0-mRNA-1

0.016019 -119.5 0 Paringa maker-jcf7180000276130-snap-gene-0.0-mRNA-1

0.012793 16.7 1.18E-62 Alexandrina maker-jcf7180000276347-augustus-gene-0.4-mRNA-1

0.01004 132.9 0 Middleton maker-jcf7180000276347-augustus-gene-0.4-mRNA-1

0.01149 71.7 0 Alexandrina all outlier genes

0.01140 75.3 0 Middleton all outlier genes

0.00887 182.32 0 Mapourika all outlier genes

0.01246 30.8 2.e-207 Paringa all outlier genes

172

Chapter 3

Supplementary Table 7. A. Results of the Ordinary Least Squares regression model fit to the data of nucleotide diversity of the outlier genes in each pool of each lake. Fixed factors used in the analysis were the lake, the outlier gene and the gene by lake interaction. B. Summary of the results for each factor in the OLS regression model. Eta-squared reflects the proportion of variance explained by each factor. C. Results of the post-hoc Turkey’s HSD testing for the significant differences in mean nucleotide diversity between pairs of lakes.

A.

R-squared: 0.826

F-statistic: 13.18

P-value 1.04E-15

No.Observations 84

Df.Residuals 61

Df.Model 22

AIC: -654

BIC: -598.1

coef std err t P>|t| [0.025 0.975]

Intercept 0.001 0.002 0.566 0.573 -0.003 0.005

C(lake)[T.Mapourika] 0.003 0.004 0.822 0.414 -0.004 0.010

C(lake)[T.Middleton] 0.001 0.003 0.233 0.817 -0.005 0.007

C(lake)[T.Paringa] 0.004 0.003 1.543 0.128 -0.001 0.010

C(gene)[T.maker-jcf7180000230112-snap-gene-0.1-mRNA-1] 0.005 0.003 1.879 0.065 0.000 0.010

C(gene)[T.maker-jcf7180000275741-augustus-gene-0.2-mRNA-1] 0.012 0.003 4.775 0.000 0.007 0.017

C(gene)[T.maker-jcf7180000275741-augustus-gene-0.3-mRNA-1] 0.014 0.003 5.657 0.000 0.009 0.019

C(gene)[T.maker-jcf7180000276130-snap-gene-0.0-mRNA-1] 0.022 0.003 8.135 0.000 0.016 0.027

C(gene)[T.maker-jcf7180000276347-augustus-gene-0.4-mRNA-1] 0.012 0.003 4.634 0.000 0.007 0.017

C(gene)[T.Mapourika]:C(scaffold)[T.maker-jcf7180000230112-snap-gene-0.1-mRNA-1] 0.004 0.005 0.797 0.429 -0.006 0.014

C(gene)[T.Middleton]:C(scaffold)[T.maker-jcf7180000230112-snap-gene-0.1-mRNA-1] -0.001 0.004 -0.139 0.890 -0.009 0.008

C(gene)[T.Paringa]:C(scaffold)[T.maker-jcf7180000230112-snap-gene-0.1-mRNA-1] 0.005 0.004 1.266 0.210 -0.003 0.013

C(gene)[T.Mapourika]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.2-mRNA-1] -0.008 0.005 -1.492 0.141 -0.018 0.003

C(gene)[T.Middleton]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.2-mRNA-1] -0.005 0.004 -1.081 0.284 -0.014 0.004

C(gene)[T.Paringa]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.2-mRNA-1] -0.009 0.004 -2.294 0.025 -0.017 -0.001

C(gene)[T.Mapourika]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.3-mRNA-1] -0.017 0.005 -3.308 0.002 -0.027 -0.007

C(gene)[T.Middleton]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.3-mRNA-1] -0.007 0.004 -1.492 0.141 -0.015 0.002

C(gene)[T.Paringa]:C(scaffold)[T.maker-jcf7180000275741-augustus-gene-0.3-mRNA-1] -0.017 0.004 -4.274 0.000 -0.025 -0.009

C(gene)[T.Mapourika]:C(scaffold)[T.maker-jcf7180000276130-snap-gene-0.0-mRNA-1] 0.000 0.000 -0.136 0.893 0.000 0.000

C(gene)[T.Middleton]:C(scaffold)[T.maker-jcf7180000276130-snap-gene-0.0-mRNA-1] 0.009 0.004 1.927 0.059 0.000 0.018

173

Chapter 3

C(gene)[T.Paringa]:C(scaffold)[T.maker-jcf7180000276130-snap-gene-0.0-mRNA-1] -0.011 0.006 -1.977 0.053 -0.022 0.000

C(gene)[T.Mapourika]:C(scaffold)[T.maker-jcf7180000276347-augustus-gene-0.4-mRNA-1] 0.002 0.005 0.351 0.726 -0.008 0.012

C(gene)[T.Middleton]:C(scaffold)[T.maker-jcf7180000276347-augustus-gene-0.4-mRNA-1] -0.004 0.004 -0.790 0.433 -0.012 0.005

C(gene)[T.Paringa]:C(scaffold)[T.maker-jcf7180000276347-augustus-gene-0.4-mRNA-1] 0.013 0.004 3.155 0.002 0.005 0.021 B. factor sum_sq mean_sq df F PR(>F) eta_sq C(lake) 0.000113 3.77E-05 3 1.948494 0.131226 0.014426 C(gene) 0.004437 0.000887 5 45.81051 1.00E-17 0.565283 C(gene):C(lake) 0.002117 0.000141 15 7.286804 1.37E-08 0.269748 Residual 0.001182 1.94E-05 61 C.

group1 group2 meandiff p-adj lower upper reject Alex Mapourika -0.0026 0.8366 -0.0112 0.006 FALSE Alex Middleton -0.0001 0.9 -0.0071 0.0069 FALSE Alex Paringa 0.001 0.9 -0.0057 0.0076 FALSE Mapourika Middleton 0.0025 0.8918 -0.0069 0.012 FALSE Mapourika Paringa 0.0036 0.713 -0.0056 0.0128 FALSE Middleton Paringa 0.0011 0.9 -0.0067 0.0088 FALSE

174

Chapter 3

References

Ali, F., & Seshasayee, Aswin Sai N. (2020). Dynamics of genetic variation in transcription factors and its implications for the evolution of regulatory networks in Bacteria. Nucleic acids research, 48(8), 4100-4114. doi:10.1093/nar/gkaa162 Altenhoff, A. M., Levy, J., Zarowiecki, M., Tomiczek, B., Vesztrocy, A. W., Dalquen, D. A., . . . Dylus, D. (2019). OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome research, 29(7), 1152-1163. Anand, S., Mangano, E., Barizzone, N., Bordoni, R., Sorosina, M., Clarelli, F., . . . De Bellis, G. (2016). Next Generation Sequencing of Pooled Samples: Guideline for Variants’ Filtering. Scientific Reports, 6(1), 33735. doi:10.1038/srep33735 Balham, R. W. (1952). Grey and Mallard Ducks in the Manawatu District, New Zealand. Emu - Austral Ornithology, 52(3), 163-191. doi:10.1071/MU952163 Barton, N. H. (2017). How does epistasis influence the response to selection? Heredity, 118(1), 96-109. doi:10.1038/hdy.2016.109 Bennett, A. P. S., de la Torre-Escudero, E., & Robinson, M. W. (2020). Helminth genome analysis reveals conservation of extracellular vesicle biogenesis pathways but divergence of RNA loading machinery between phyla. International Journal for Parasitology, 50(9), 655-661. Blasco-Costa, I., Seppälä, K., Feijen, F., Zajac, N., Klappert, K., & Jokela, J. (2019). A new species of Atriophallophorus Deblock & Rosé, 1964 (Trematoda: Microphallidae) described from in vitro-grown adults and metacercariae from Potamopyrgus antipodarum (Gray, 1843)(Mollusca: Tateidae). Journal of helminthology, 94, e108. Busing, F. M., Meijer, E., & Van Der Leeden, R. (1999). Delete-m jackknife for unequal m. Statistics and Computing, 9(1), 3-8. Cai, P., Bu, L., Wang, J., Wang, Z., Zhong, X., & Wang, H. (2008). Molecular characterization of Schistosoma japonicum tegument protein tetraspanin-2: sequence variation and possible implications for immune evasion. Biochemical and biophysical research communications, 372(1), 197-202. Cantarel, B. L., Korf, I., Robb, S. M., Parra, G., Ross, E., Moore, B., . . . Yandell, M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research, 18(1), 188-196. Conrad, T. M., Lewis, N. E., & Palsson, B. Ø. (2011). Microbial laboratory evolution in the era of genome‐scale science. Molecular systems biology, 7(1), 509. Coop, G., Witonsky, D., Di Rienzo, A., & Pritchard, J. K. (2010). Using environmental correlations to identify loci underlying local adaptation. Genetics, 185(4), 1411- 1423. Cruickshank, T. E., & Hahn, M. W. (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23(13), 3133-3157. Cui, R., Medeiros, T., Willemsen, D., Iasi, L. N., Collier, G. E., Graef, M., . . . Valenzano, D. R. (2019). Relaxed selection limits lifespan by increasing mutation load. Cell, 178(2), 385-399. e320. Cwiklinski, K., de la Torre-Escudero, E., Trelis, M., Bernal, D., Dufresne, P. J., Brennan, G. P., . . . Robinson, M. W. (2015). The Extracellular Vesicles of the Helminth Pathogen, Fasciola hepatica: Biogenesis Pathways and Cargo Molecules Involved in Parasite Pathogenesis. Molecular & Cellular Proteomics, 14(12), 3258-3273. doi:10.1074/mcp.M115.053934

175

Chapter 3

Dauphin, B. (2020). [Assessment of population structure on Poolseq data]. Dussex, N., Wegmann, D., & Robertson, B. C. (2014). Postglacial expansion and not human influence best explains the population structure in the endangered kea (Nestor notabilis). Molecular Ecology, 23(9), 2193-2209. doi:10.1111/mec.12729 Dybdahl, M. F., & Lively, C. M. (1996). THE GEOGRAPHY OF COEVOLUTION: COMPARATIVE POPULATION STRUCTURES FOR A SNAIL AND ITS TREMATODE PARASITE. Evolution, 50(6), 2264-2275. doi:10.1111/j.1558-5646.1996.tb03615.x Emelianov, I., Marec, F., & Mallet, J. (2004). Genomic evidence for divergence with gene flow in host races of the larch budmoth. Proceedings of the Royal Society of London. Series B: Biological Sciences, 271(1534), 97-105. Feder, J. L., Egan, S. P., & Nosil, P. (2012). The genomics of speciation-with-gene-flow. Trends in Genetics, 28(7), 342-350. Feijen, F., Widmer, K. K., Oester, R., Tardent, N., Klappert, K., & Jokela, J. (in prep). Frequency of multiple-genotype infections as indicator of exposure risk in a natural host population. ETH. Feijen, F., Zajac, N., Vorburger, C., & Jokela, J. (in prep). Contrasting phylogeographic patterns in a cryptic species complex of trematode parasites. Ferchaud, A.-L., & Hansen, M. M. (2016). The impact of selection, gene flow and demographic history on heterogeneous genomic divergence: three-spine sticklebacks in divergent environments. Molecular Ecology, 25(1), 238-259. doi:10.1111/mec.13399 Fischer, M. C., Rellstab, C., Leuzinger, M., Roumet, M., Gugerli, F., Shimizu, K. K., . . . Widmer, A. (2017). Estimating genomic diversity and population differentiation–an empirical comparison of microsatellite and SNP variation in Arabidopsis halleri. BMC genomics, 18(1), 1-15. Fischer, M. C., Rellstab, C., Tedder, A., Zoller, S., Gugerli, F., Shimizu, K. K., . . . Widmer, A. (2013). Population genomic footprints of selection and associations with climate in natural populations of Arabidopsis halleri from the Alps. Molecular Ecology, 22(22), 5594-5607. doi:10.1111/mec.12521 Frank, S. A. (1991). Ecological and genetic models of host-pathogen coevolution. Heredity, 67(1), 73-83. Frank, S. A. (1993). Coevolutionary genetics of plants and pathogens. Evolutionary Ecology, 7(1), 45-75. Franssen, S. U., Barton, N. H., & Schlötterer, C. (2016). Reconstruction of Haplotype-Blocks Selected during Experimental Evolution. Molecular Biology and Evolution, 34(1), 174-184. doi:10.1093/molbev/msw210 Galaktionov, K. V., & Dobrovolskij, A. (2013). The biology and evolution of trematodes: an essay on the biology, morphology, life cycles, transmissions, and evolution of digenetic trematodes: Springer Science & Business Media. Gautier, M. (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics, 201(4), 1555. doi:10.1534/genetics.115.181453 Gibbs, G. W. (2006). Ghosts of Gondwana: the history of life in New Zealand: Craig Potton Publishing. Gmelin, J. (1789). Caroli a Linné Systema Naturae, vol. 1, part 3. In: Leipzig, Germany, GE Beer Publishing. GoogleMaps (Cartographer). (2020). New Zealand

176

Chapter 3

Gray, J. (1843). Catalogue of the species of Mollusca and their shells, which have hitherto been recorded as found at New Zealand, with the description of some lately discovered species. In (Vol. 2, pp. 228-265): Murray London, pp. Grivet, D., Sebastiani, F., Alía, R., Bataillon, T., Torre, S., Zabal-Aguirre, M., . . . González- Martínez, S. C. (2010). Molecular Footprints of Local Adaptation in Two Mediterranean Conifers. Molecular Biology and Evolution, 28(1), 101-116. doi:10.1093/molbev/msq190 Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics, 195(1), 205-220. Hartmann, F. E., McDonald, B. A., & Croll, D. (2018). Genome-wide evidence for divergent selection between populations of a major agricultural pathogen. Molecular Ecology, 27(12), 2725-2741. doi:10.1111/mec.14711 Heads, M. (1998). Biogeographic disjunction along the Alpine fault, New Zealand. Biological Journal of the Linnean Society, 63(2), 161-176. Hivert, V., Leblois, R., Petit, E. J., Gautier, M., & Vitalis, R. (2018). Measuring Genetic Differentiation from Pool-seq Data. Genetics, 210(1), 315-330. doi:10.1534/genetics.118.300900 Hohenlohe, P. A., Phillips, P. C., & Cresko, W. A. (2010). USING POPULATION GENOMICS TO DETECT SELECTION IN NATURAL POPULATIONS: KEY CONCEPTS AND METHODOLOGICAL CONSIDERATIONS. International journal of plant sciences, 171(9), 1059-1071. doi:10.1086/656306 Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., . . . Kuhn, M. (2016). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic acids research, 44(D1), D286-D293. Husson, F., Josse, J., & J., P. (2010). Principal Component Methods - Hierarchical Clustering - Partitional Clustering: Why Would We Need to Choose for Visualizing Data? Retrieved from Technical Report of the Applied Mathematics Department (Agrocampus) 3: http://www.sthda.com/english/upload/hcpc_husson_josse.pdf Jost, L. (2008). GST and its relatives do not measure differentiation. Molecular Ecology, 17(18), 4015-4026. Keinan, A., Mullikin, J. C., Patterson, N., & Reich, D. (2007). Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nature genetics, 39(10), 1251-1255. Klopfenstein, D., Zhang, L., Pedersen, B. S., Ramírez, F., Vesztrocy, A. W., Naldi, A., . . . Weigel, M. (2018). GOATOOLS: A Python library for Gene Ontology analyses. Scientific Reports, 8(1), 1-17. Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., . . . Schlötterer, C. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PloS one, 6(1), e15925. Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27(24), 3435-3436. Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The annals of Statistics, 1217-1241. Ladle, R. J., Johnstone, R. A., & Judson, O. P. (1993). Coevolutionary dynamics of sex in a metapopulation: escaping the Red Queen. Proceedings of the Royal Society of London. Series B: Biological Sciences, 253(1337), 155-160.

177

Chapter 3

Lamichhaney, S., Barrio, A. M., Rafati, N., Sundström, G., Rubin, C.-J., Gilbert, E. R., . . . Andersson, L. (2012). Population-scale sequencing reveals genetic differentiation due to local adaptation in Atlantic herring. Proceedings of the National Academy of Sciences, 109(47), 19345-19350. doi:10.1073/pnas.1216128109 Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: an R package for multivariate analysis. Journal of statistical software, 25(1), 1-18. Leathwick, J. R., Elith, J., Chadderton, W. L., Rowe, D., & Hastie, T. (2008). Dispersal, disturbance and the contrasting biogeographies of New Zealand’s diadromous and non-diadromous fish species. Journal of Biogeography, 35(8), 1481-1497. doi:10.1111/j.1365-2699.2008.01887.x Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arXiv preprint arXiv:1303.3997. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. Liggins, L., Chapple, D. G., Daugherty, C. H., & Ritchie, P. A. (2008). A SINE of restricted gene flow across the Alpine Fault: phylogeography of the New Zealand common skink (Oligosoma nigriplantare polychroma). Molecular Ecology, 17(16), 3668-3683. doi:10.1111/j.1365-294X.2008.03864.x Lively, C. M., Dybdahl, M. F., Jokela, J., Osnas, E. E., & Delph, L. F. (2004). Host sex and local adaptation by parasites in a snail-trematode interaction. The American Naturalist, 164(S5), S6-S18. Lively, C. M., & McKenzie, J. C. (1991). Experimental infection of a freshwater snail, Potamopyrgus antipodarum, with a digenetic trematode, Microphallus sp. Marshall, D. C., Hill, K. B. R., Fontaine, K. M., Buckley, T. R., & Simon, C. (2009). Glacial refugia in a maritime temperate climate: Cicada (Kikihia subalpina) mtDNA phylogeography in New Zealand. Molecular Ecology, 18(9), 1995-2009. doi:10.1111/j.1365- 294X.2009.04155.x Marske, K. A., Leschen, R. A. B., Barker, G. M., & Buckley, T. R. (2009). Phylogeography and ecological niche modelling implicate coastal refugia and trans-alpine dispersal of a New Zealand fungus beetle. Molecular Ecology, 18(24), 5126-5142. doi:10.1111/j.1365-294X.2009.04418.x Meirmans, P. G., & Hedrick, P. W. (2011). Assessing population structure: FST and related measures. Molecular ecology resources, 11(1), 5-18. doi:10.1111/j.1755- 0998.2010.02927.x Moncrieff, P. (1929). Bird migration in New Zealand. Emu-Austral Ornithology, 28(3), 215-225. Montague, M. J., Li, G., Gandolfi, B., Khan, R., Aken, B. L., Searle, S. M., . . . Davis, B. W. (2014). Comparative analysis of the domestic cat genome reveals genetic signatures underlying feline biology and domestication. Proceedings of the National Academy of Sciences, 111(48), 17230-17235. Morales, H. E., Faria, R., Johannesson, K., Larsson, T., Panova, M., Westram, A. M., & Butlin, R. K. (2019). Genomic architecture of parallel ecological divergence: Beyond a single environmental contrast. Science Advances, 5(12), eaav9963. doi:10.1126/sciadv.aav9963 Morgan, A. D., Gandon, S., & Buckling, A. (2005). The effect of migration on local adaptation in a coevolving host–parasite system. Nature, 437(7056), 253-256. Mulder, N., & Apweiler, R. (2007). Interpro and interproscan. In Comparative genomics (pp. 59-70): Springer.

178

Chapter 3

Neiman, M., & Lively, C. M. (2004). Pleistocene glaciation is implicated in the phylogeographical structure of Potamopyrgus antipodarum, a New Zealand snail. Molecular Ecology, 13(10), 3085-3098. doi:10.1111/j.1365-294X.2004.02292.x Nosil, P., Funk, D. J., & Ortiz‐Barrientos, D. (2009). Divergent selection and heterogeneous genomic divergence. Molecular Ecology, 18(3), 375-402. Nunez, J. C. B., Elyanow, R. G., Ferranti, D. A., & Rand, D. M. (2020). Population Genomics and Biogeography of the Northern Acorn Barnacle (Semibalanus balanoides) Using Pooled Sequencing Approaches. In M. F. Oleksiak & O. P. Rajora (Eds.), Population Genomics: Marine Organisms (pp. 139-168). Cham: Springer International Publishing. Olazcuaga, L., Loiseau, A., Parrinello, H., Paris, M., Fraimout, A., Guedot, C., . . . Gautier, M. (2020). A Whole-Genome Scan for Association with Invasion Success in the Fruit Fly Drosophila suzukii Using Contrasts of Allele Frequencies Corrected for Population Structure. Molecular Biology and Evolution, 37(8), 2369-2385. doi:10.1093/molbev/msaa098 Paczesniak, D., Jokela, J., Larkin, K., & Neiman, M. (2013). Discordance between nuclear and mitochondrial genomes in sexual and asexual lineages of the freshwater snail P otamopyrgus antipodarum. Molecular Ecology, 22(18), 4695-4710. Papadopulos, A. S. T., Baker, W. J., Crayn, D., Butlin, R. K., Kynast, R. G., Hutton, I., & Savolainen, V. (2011). Speciation with gene flow on Lord Howe Island. Proceedings of the National Academy of Sciences, 108(32), 13188-13193. doi:10.1073/pnas.1106085108 Piao, X., Hou, N., Cai, P., Liu, S., Wu, C., & Chen, Q. (2014). Genome-wide transcriptome analysis shows extensive alternative RNA splicing in the zoonotic parasite Schistosoma japonicum. BMC genomics, 15(1), 715. Pickrell, J., & Pritchard, J. (2012). Inference of population splits and mixtures from genome- wide allele frequency data. Nature Precedings, 1-1. Pickrell, J. K., & Pritchard, J. K. (2012). User Manual for TreeMix v1. Pillans, B., McGlone, M., Palmer, A., Mildenhall, D., Alloway, B., & Berger, G. (1993). The last glacial maximum in central and southern North Island, New Zealand: a paleoenvironmental reconstruction using the Kawakawa Tephra Formation as a chronostratigraphic marker. Palaeogeography, Palaeoclimatology, Palaeoecology, 101(3), 283-304. doi:https://doi.org/10.1016/0031-0182(93)90020-J Piratae, S., Tesana, S., Jones, M. K., Brindley, P. J., Loukas, A., Lovas, E., . . . Laha, T. (2012). Molecular characterization of a tetraspanin from the human liver fluke, Opisthorchis viverrini. PLoS Negl Trop Dis, 6(12), e1939. Pritchard, J. K., Pickrell, J. K., & Coop, G. (2010). The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current biology, 20(4), R208-R215. Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842. Rawlence, N. J., Scofield, R. P., McGlone, M. S., & Knapp, M. (2019). History Repeats: Large Scale Synchronous Biological Turnover in Avifauna From the Plio-Pleistocene and Late Holocene of New Zealand. Frontiers in Ecology and Evolution, 7(158). doi:10.3389/fevo.2019.00158 Reich, D., Thangaraj, K., Patterson, N., Price, A. L., & Singh, L. (2009). Reconstructing Indian population history. Nature, 461(7263), 489-494. Rellstab, C., Zoller, S., Walthert, L., Lesur, I., Pluess, A. R., Graf, R., . . . Gugerli, F. (2016). Signatures of local adaptation in candidate genes of oaks (Quercus spp.) with

179

Chapter 3

respect to present and future climatic conditions. Molecular Ecology, 25(23), 5907- 5924. doi:10.1111/mec.13889 Shi, H., Kichaev, G., & Pasaniuc, B. (2016). Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics, 99(1), 139-153. Shoemaker, C. B., Ramachandran, H., Landa, A., dos Reis, M. G., & Stein, L. D. (1992). Alternative splicing of the Schistosoma mansoni gene encoding a homologue of epidermal growth factor receptor. Molecular and biochemical parasitology, 53(1- 2), 17-32. Shulmeister, J., Thackray, G. D., Rittenour, T. M., Fink, D., & Patton, N. R. (2019). The timing and nature of the last glacial cycle in New Zealand. Quaternary Science Reviews, 206, 1-20. Smit, A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015. In. Smit, A., Hubley, R. R., & Green, P. (2008). Open-1.0. 2008–2015. In. Stölting, K. N., Paris, M., Meier, C., Heinze, B., Castiglione, S., Bartha, D., & Lexer, C. (2015). Genome-wide patterns of differentiation and spatially varying selection between postglacial recolonization lineages of Populus alba (Salicaceae), a widespread forest tree. New Phytologist, 207(3), 723-734. doi:10.1111/nph.13392 Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440-9445. Sutherland, J. L., Carrivick, J. L., Shulmeister, J., Quincey, D. J., & James, W. H. M. (2019). Ice- contact proglacial lakes associated with the Last Glacial Maximum across the Southern Alps, New Zealand. Quaternary Science Reviews, 213, 67-92. Sylvain Gandon, & Scott L. Nuismer. (2009). Interactions between Genetic Drift, Gene Flow, and Selection Mosaics Drive Parasite Local Adaptation. The American Naturalist, 173(2), 212-224. doi:10.1086/593706 Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032-2034. Thompson, J. N. (1994). The coevolutionary process: University of Chicago Press. Törönen, P., Medlar, A., & Holm, L. (2018). PANNZER2: a rapid functional annotation web server. Nucleic acids research, 46(W1), W84-W88. Tran, M. H., Freitas, T. C., Cooper, L., Gaze, S., Gatton, M. L., Jones, M. K., . . . Loukas, A. (2010). Suppression of mRNAs encoding tegument tetraspanins from Schistosoma mansoni results in impaired tegument turnover. PLoS Pathog, 6(4), e1000840. Trewick, S. A., & Wallis, G. P. (2001). BRIDGING THE “BEECH-GAP”: NEW ZEALAND INVERTEBRATE PHYLOGEOGRAPHY IMPLICATES PLEISTOCENE GLACIATION AND PLIOCENE ISOLATION. Evolution, 55(11), 2170-2180. doi:10.1111/j.0014- 3820.2001.tb00733.x Trewick, S. A., Wallis, G. P., & Morgan-Richards, M. (2011). The invertebrate life of New Zealand: a phylogeographic approach. Insects, 2(3), 297-325. von Linné, C., & Lange, J. J. (1760). Caroli Linnaei... systema naturae per regna tria naturae: secundum classes, ordines, genera, species: Io. Iac. Curt. Wagstaff, S. J., Heenan, P. B., & Sanderson, M. J. (1999). Classification, origins, and patterns of diversification in New ZealandCarmichaelinae (Fabaceae). American Journal of Botany, 86(9), 1346-1356. doi:10.2307/2656781 Walden, N., Lucek, K., & Willi, Y. (2020). Lineage-specific adaptation to climate involves flowering time in North American Arabidopsis lyrata. Molecular Ecology, 29(8), 1436-1451. doi:10.1111/mec.15338

180

Chapter 3

Wallis, G. P., & Trewick, S. A. (2009). New Zealand phylogeography: evolution on a small continent. Molecular Ecology, 18(17), 3548-3580. doi:10.1111/j.1365- 294X.2009.04294.x Walsh, B., & Lynch, M. (2018). Evolution and selection of quantitative traits: Oxford University Press. Wardle, P. (1988). Effects of glacial climates on floristic distribution in New Zealand 1. A review of the evidence. New Zealand journal of botany, 26(4), 541-555. Warwick, T. (1952). Strains in the mollusc Potamopyrgus jenkinsi (Smith). Nature, 169(4300), 551-552. Wicke, S., Schäferhoff, B., Depamphilis, C. W., & Müller, K. F. (2014). Disproportional plastome-wide increase of substitution rates and relaxed purifying selection in genes of carnivorous Lentibulariaceae. Molecular Biology and Evolution, 31(3), 529- 545. Winterbourn, M. (1970). THENEWZEALAND SPECIES OF PO TAM O PYRG US (GASTROPODA: HYDROB IID AE). Malacologia, 10(2), 283-321. Wysoker, A., Tibbetts, K., & Fennell, T. (2013). Picard tools. In. Yap, K. W., & Thompson, R. (1987). CTAB precipitation of cestode DNA. Parasitology Today, 3(7), 220-222. Yeoh, L. M., Goodman, C. D., Mollard, V., McHugh, E., Lee, V. V., Sturm, A., . . . Ralph, S. A. (2019). Alternative splicing is required for stage differentiation in malaria parasites. Genome Biology, 20(1), 151. Yeoh, L. M., Lee, V. V., McFadden, G. I., & Ralph, S. A. (2019). Alternative splicing in apicomplexan parasites. MBio, 10(1).

181

Chapter 4

4: Genomic signature of local adaptation to a host population in a connected parasite population of the trematode, Atriophallophorus winterbourni Natalia Zajac*,1,2, Frida A. A. Feijen1,2, Niklas Zemp2, Jukka Jokela1,2

1. Eawag, Swiss Federal Institute of Aquatic Science and Technology, CH-8600 Dübendorf, Switzerland 2. ETH Zurich, Department of Environmental Systems Science, Institute of Integrative Biology, CH-8092 Zurich, 9 Switzerland *Author for Correspondence: Natalia Zajac, ETH Zurich, Department of Environmental Systems Science, Institute of 19 Integrative Biology, Zurich, Switzerland, +41 58 765 1122, [email protected]

182

Chapter 4

Abstract

When selection is strong, local adaptation can counteract gene flow even at a fine geographic scale. Detecting the genomic signature of adaptation is then challenging as gene flow erodes variation at neutral loci and only loci under selection remain differentiated. Adaptation under gene flow is especially relevant for parasite populations that are under strong selection to match the resistance genes of their local host. In this study, we examined the fine-scale population genetic structure of the trematode parasite, Atriophallophorus winterbourni, infecting the Potamopyrgus antipodarum population of Lake Alexandrina on the South Island of New Zealand. Previous studies have suggested differences in prevalence of susceptibility in the intermediate host when exposed to parasites from different sources around the lake. We used whole genome resequencing using Pool-Seq method to search for a signature of local divergence in the parasite population among specific locations in the lake. In our detailed analyses based on 10.3 million SNP markers, we found very little evidence for population genetic structure among the six study locations indicating high gene flow between them. We detected an average of 2040 SNPs differentiated between any pair of sites, with a mean of 55 SNPs falling within 35 coding regions, providing a basis for local genetic divergence. There were 130 unique genes in total differentiating any pair of sites and all these genes returned a negative Tajima’s D value at a minimum of one site where it was found to be an outlier, indicating excess of rare frequency variants. Despite these genes having functions possibly related to adaptation to the host, we believe that our search for genomic regions involved in local adaption was generally overwhelmed by the excess of rare variants found in the population. Such excess can be a result of a parasite population experiencing an ongoing expansion of diversity after a recent selective sweep. We discuss the possible explanations for such an expansion in the context of frequency-dependent coevolutionary and fluctuating selection dynamics.

183

Chapter 4

Introduction

Evolutionary processes promoting local adaptation among populations rely on strong selection that is driven by spatial heterogeneity in important selective factors (Kawecki & Ebert, 2004; Leimu & Fischer, 2008). Limited gene flow among populations and strong differences in the environment or biotic interactions, leading to locally specific selection gradients, are intuitively understood to be selecting for differences in mean phenotype and trait combinations that correlate with local fitness peaks (Brown, Kann, & Rand, 2001; Fischer et al., 2013; Martin & Willis, 2007; Olazcuaga et al., 2020). Such local adaptation reinforces the neutral genetic divergence of populations under genetic drift. It is increasingly recognized that when selection is strong enough, such adaptive divergence can occur also in sympatric populations, counteracting the homogenizing effect of gene flow (Richardson, Urban, Bolnick, & Skelly, 2014). The genetic structure emerging in sympatry can result from environmentally or behaviourally driven reproductive barriers, often observed, for example, in Teleost fish or molluscs, for which high intraspecific diversity has been observed to evolve even at very fine spatial scales (Behrmann-Godel, Gerlach, & Eckmann, 2006; Dennenmoser, Vamosi, Nolte, & Rogers, 2017; Hollander, Lindegarth, & Johannesson, 2005; Lin, Quinn, Hilborn, & Hauser, 2008; Morales et al., 2019; Seehausen, Witte, Van Alphen, & Bouton, 1998). The source of local selective pressure driving divergence can often be an antagonistic adaptation to the host in host-herbivore, host-pathogen or host-parasite systems. Host driven local adaptation at a fine spatial scale has been observed for the digenean trematode Schistosoma mansoni (Sire, Durand, Pointier, & Theron, 2001), the mosquito Anopheles darling (Campos et al., 2017), the malaria parasites Plamodium falciparum (Lo et al., 2018) and Plasmodium mexicanum (Fricke, Vardo-Zalik, & Schall, 2010), the fungal wheat pathogen Zymenoseptoria tritici (Hartmann, McDonald, & Croll, 2018) and the fungal bean parasite Colletotrichum lindemuthianum (Capelle & Neema, 2005).

For detecting the geographic scale on which reciprocal selection fuels coevolutionary dynamics between antagonistically interacting species, it is important to examine the macro- and the micro-geographic genetic structure of the interacting populations. The framework for studying how coevolution continually reshapes interactions across different temporal and spatial scales is captured in the concept of the geographic

184

Chapter 4 mosaic of coevolution (Thompson, 1994). The geographic mosaic idea acknowledges that multiple factors play into the interaction between the host and the parasite. Tight coevolutionary dynamics on a local scale (“hot-spots”) are expected to alternate with “cold- spots” where selection is weak, creating a selection mosaic (Thompson, 1999). Tiles in the selection mosaic can be small, allowing substantial background gene flow between populations experiencing locally divergent selection pressure (Thompson, 1999). Such gene flow can effectively erode any appreciable variation at the neutral loci. Mathematical models have shown that in such selection mosaics gene flow can even facilitate local adaptation by replenishing lost genetic diversity with rare variants that are transiently favourable and will increase in frequency (Frank, 1991, 1993; Gandon, Capowiez, Dubois, Michalakis, & Olivieri, 1996; Gandon & Nuismer, 2009).

Studying such a selection mosaic can be challenging as the signature of local divergence will not be captured by scarce neutral genomic markers and it will only be detected in experimental studies where parasites are administered to both sympatric and allopatric hosts (Kawecki & Ebert, 2004; Gandon & Nuismer, 2009). Additionally, in most study systems genes or regions in the parasite genome responsible for local adaptation are usually not known a priori. Thus, advances in the Next Generation Sequencing technologies together with population genomics offer novel opportunities to detect genomic regions defining success and failure of parasite genotypes and to pin down the genetic processes that are at the root of rapid divergence. Genome-wide sequencing studies can help in dissecting the proportion of the genome that contributes to a local adaptive process, whether it be located in a coding region (Guggisberg et al., 2018; Hartmann et al., 2018), derived from a transposable element (Casacuberta & González, 2013) or an inversion (Morales et al., 2019).

In this study, we searched for the genomic regions in a parasite population that might be diverging at a fine geographic scale. We examined the genome of a trematode parasite, Atriophallophorus winterbourni (Blasco-Costa et al., 2019) by re-sequencing parasites from specific locations within a larger connected population of a single lake, Lake Alexandrina, of 7km in length, up to 2km in width and of up to about 27m in maximum depth (Ward & Talbot, 1984). The parasite is known to alternate between two hosts in its life cycle: an intermediate host Potamopyrgus antipodarum (Gray, 1843), a prosobranch snail common in most freshwaters of New Zealand (Warwick, 1952; Winterbourn, 1970), and the waterfowl,

185

Chapter 4 the definitive host, including Grey Duck (Anas superciliosa (Gmelin, 1789)) and European Mallard (Anas platyrhynchos (von Linné & Lange, 1760)) (Lively & McKenzie, 1991). Previous field and laboratory studies focusing on the snail host suggest that A. winterbourni adaptation to local intermediate host populations is genotype-specific to a degree that the parasite populations adapt to specifically infect single common host genotypes, thus tracking the host genotypes in a negative frequency-dependent manner (Dybdahl & Lively, 1996; Jokela, Dybdahl, & Lively, 2009; Lively, Dybdahl, Jokela, Osnas, & Delph, 2004). The Potamopyrgus antipodarum host populations consist of both sexually and asexually reproducing individuals, a consequence of which is that a resistant asexual host genotype can increase to very high frequency.

Earlier studies with this system suggest strong local adaptation between more distant lake populations (Dybdahl & Lively, 1996; Lively et al., 2004). Within Lake Alexandrina, population genetic studies of the intermediate host suggest a depth-specific population structure but a much weaker differentiation among sites within the same habitat (Dybdahl & Lively, 1996; Fox, Dybdahl, Jokela, & Lively, 1996; Paczesniak et al., 2014). Local adaptation was thus believed to be occurring across a depth gradient and gene flow within the same habitat to be eroding any signature of differentiation. Nevertheless, experimental studies and field surveys demonstrated that prevalence of infection varies considerably among locations within the lake (Fox, 1995; Jokela & Lively, 1995). The most remarkable and most explicit demonstration of fine-scale local signature of divergence between populations within the same habitat was a well-designed infection experiment by Gibson et. al (2016), which found that susceptibility of local snail populations differed when snails were exposed to parasites collected from two near-by locations in the lake. Population genetic structure of the parasite has, however, remained elusive and thus the question of within habitat selection mosaics has not been addressed.

Here, we investigated the fine-scale parasite population genomic structure within Lake Alexandrina and the occurrence (or lack) of genomic signature of local divergence among parasites that manage to infect the snail intermediate hosts at close-by locations in the lake. To address these questions, we sampled the parasites from infected snails at 6 locations around the lake and performed whole genome resequencing of pools of parasite individuals (Pool-Seq method) for a detailed population genomic analyses and for examining the local

186

Chapter 4 divergence patterns seen in the genome. In order to better understand the reasons for the patterns observed in the parasite population, we examined the population structure and genotype diversity of the snails from which the parasites were extracted using 20 SNP markers. Methods

Analysis of the parasite DNA

Parasite collection and DNA extraction

P. antipodarum snails were collected in January 2019 from six shallow localities around Lake Alexandrina (<1.5m) (Figure 1A). At each site we collected along a 500-700m stretch of the shore, by pushing a kicknet through the vegetation, which mainly consisted of willow roots protruding through the shore bank. Within two weeks of collection, the snails were transported to the Swiss Federal Institute of Aquatic Science (Eawag, Dübendorf, Switzerland) where they were kept in tanks of 200 – 500 snails in a flow-through system where water circulated through tanks and filters for 12h a day. For 12h the water circulation was switched off and the snails were fed spirulina ad libitum (Arthospira platensis, Spirulina California, Earthrise).

To collect parasites, each snail was dissected individually and all A. winterbourni metacercariae were isolated under 10x-20x magnification. The metacercariae were hatched into adult worms to separate the parasite from the double-walled metacercarial cyst that contained both the parasite and the snail DNA (Galaktionov and Dobrovolskij 2003). To initiate hatching, the metacercariae were incubated at 40 oC for 2-4 h in Tyrode’s salt solution, supplemented with pancreatin (Sigma P3292) (0.15g/50ml of Tyrode’s salt solution), 100 mg/mL Penicilin G (Fluka 13752) and 0.1g/mL of Streptomycin (Fluka 85880). For Tyrode’s salt solution we mixed Tyrode’s salts (Sigma T2145-10x1L) with 1L of MiliQ water and 1g of sodium bicarbonate (NaHCO3). After all the worms had hatched to their adult stage they were washed twice with Tyrode’s salt solution and antibiotics (100 mg/mL Penicillin G, Fluka 13752 and 0.1g/mL Streptomycin, Fluka 85880) to remove all the remaining cysts shed by the hatched worms. From each infection exactly 200 worms were counted and transferred to a 1.5 mL tube (Eppendorf, safe lock). The worms were immersed in no more than 10µl of

187

Chapter 4 washing solution. The samples were frozen in liquid nitrogen and immediately transferred to a -80oC freezer to only be taken out later for further processing.

For extraction of DNA, the worms were lysed using a CTAB buffer Proteinase K (2mg/ml) and incubated overnight at 55°C (Yap and Thompson, 1987). DNA was isolated using a chloroform: isoamyl alcohol (24:1) and precipitated with sodium acetate (3M). The resulting pellet was washed twice with 70% ethanol. DNA was stored in RNase/DNase-free water (Sigma-Aldrich, Missouri, United States) at -20oC until pooling and sequencing library preparation. The quality and quantity of the DNA extraction was assessed with NanoDrop ND1000 (ThermoFisher, Waltham, Massachusetts, USA) and Qubit 2.0 Fluorometer (dsDNA, HS, Invitrogen, Carlsbad, California, USA).

Confirming focal species

Atriophallophorus winterbourni was recently found to coexist with a rare and phenotypically very similar undescribed Atriophallophorus species (Feijen, Zajac, Vorburger, & Jokela, in prep). We therefore verified by amplification of 16S fragment that samples were taken from the correct parasite species. All infections by the rare Atriophallophorus sp. and coinfections between Atriphallophorus sp. and A. winterbourni were removed from further analysis. A fragment of 600bp was amplified with Promega GoTaq® G2 DNA Polymerase kit. The primers used, Trem_16S_F1: 5’- GTACCTTTTGCATCATGA-3’ and Trem_16S_R1: 5’- TTACCTAGTTATCCCCGG-3’, were designed based on mitochondrial genomes of Trematodes available on GenBank. The PCR protocol involved a 2min initial denaturation step (95oC) followed by 30 cycles (0.5 min denaturation step at 95oC, 1 min annealing step at 47.3oC and 1 min extension step at 72oC) and completed with a 5min final extension step. All samples were sent to Microsynth AG (Balgach) for sequencing. The consensus sequences and further analysis was performed using Geneious 9.1.8 (Biomatters Limited).

Pool-Seq whole genome resequencing

After selecting samples that were infections by Atriophallophorus winterbourni, we combined twelve infections for each pool (200 worms each) of equal quantities of DNA, using liquid handling station (BrandTech Scientific) available at the Genetic Diversity Center (ETH Zurich), resulting in 200-600 ng samples. The pooled DNA was sent to Functional Genomics Center Zurich (University of Zurich, Zurich) for quality assessment with Agilent Tape Station

188

Chapter 4

4200 (Agilent, California, USA) and paired-end sequencing (PE150) using the Illumina Novaseq 6000 platform. A single TruSeq library was constructed from the DNA using TruSeq Nano DNA library prep kit according to Illumina protocols with a 500bp insert size. The library was sequenced after indexing each pool on half of an S4 flowcell. We obtained one pool for each sampling site at Lake Alexandrina.

Read mapping and SNP calling

Due to high quality score of the sequenced reads (mean Phred quality score >30) no data correction was performed on the Illumina data and adapter sequences were trimmed by Functional Genomics Center Zurich (Zurich). Paired-end reads for each population were mapped to the Atriophallophorus winterbourni reference genome (data available at: https://www.ncbi.nlm.nih.gov/nuccore/JACCGJ000000000 (Chapter 1)) using BWA MEM v0.7.17 and Sambamba v0.6.8 with default parameters creating BAM files (Li, 2013; Tarasov, Vilella, Cuppen, Nijman, & Prins, 2015). Low quality mappings (quality score <20) and PCR duplicates were removed with Sambamba v0.6.8 and Picard tools v2.20.2 (Tarasov et al., 2015; Wysoker, Tibbetts, & Fennell, 2013). BEDTools v2.28.0 were used to calculate coverage statistics per each base. Single nucleotide polymorphisms were called with SAMTOOLS v1.9 creating mpileup files which were then synchronized with perl script mpileup2sync.pl available through PoPoolation2 software (Kofler, Pandey, & Schlötterer, 2011; Li et al., 2009).

Previous experimental research of the Lake Alexandrina population has shown that the number of parasite genotypes per snail (number of coinfections) correlates with the prevalence of infection (Feijen, Widmer, et al., in prep). Multiple-genotype infections were found in 14% of the infected snails in the shallow water habitat of Lake Alexandrina in 2017, a year with one the highest infection frequencies in the past two decades (Feijen, Widmer, et al., in prep). In other lakes the rates of coinfections are unknown. Thus, with the available data we concluded double co-infections were possible while triple coinfections were unlikely. That is why a pool size of 48 (infections from 12 snails x 2 genotypes x diploid worm) was chosen with a minimum allele count of 10 (21%). The sync files were then subjected to subsequent levels of filtering. First, a perl script snp-freq-diff.pl available through PoPoolation2 was used to filter out SNPs with coverage lower than 25 and higher than 200 across all pools (Kofler, Pandey, et al., 2011). Then an in-house script was used to filter out triallelic SNPs and keep biallelic SNPs with minimum allele count of 10. The final sync files

189

Chapter 4 thus consisted only of biallelic SNPs fulfilling the criteria of minimum coverage of 25, maximum coverage of 200 in all pools and minimum allele count of 10 per pool. Before further analysis, SNPs from all interspersed repeats and low complexity DNA were filtered out. The interspersed repeat and low complexity DNA regions from the whole genome were assessed with RepeatModeler v1.0.11 and RepeatMasker v4.0.7 (Smit, Hubley, & Green, 2015; Smit, Hubley, & Green, 2008). A customized library of repetitive elements produced with RepeatModeler v1.0.11 was verified with blastx v2.3.0 to confirm that no proteins, hypothetical proteins or coding sequences were excluded. The stringent SNP filtering of Pool- Seq data was used for elimination of false positives which are the biggest challenge for Pool- Seq data analysis (Anand et al., 2016). Due to genome fragmentation we limited our analysis only to scaffolds containing protein coding genes.

Functional Diversity

To obtain pool specific genome-wide estimates of genetic diversity, nucleotide diversity (Tajima’s Pi, µ) and population mutation rate based on the number of segregating sites (Watterson’s theta, θ Watterson) were calculated per pool per gene with Variance-at- position.pl script available through PoPoolation v1.2.1 (Kofler, Orozco-terWengel, et al., 2011). The mpileup files created with SAMTOOLS v1.9 were divided into per pool pileup files using an in-house script. The diversity estimates were calculated setting 10 as the minimum count, 25 as the minimum coverage and 200 as the maximum coverage (Kofler, Orozco- terWengel, et al., 2011). Estimation of genetic diversity only in the coding regions eliminated the excess of diversity from other, not so conserved regions of the genome; regions not under purifying selection thus not directly of functional interest (Fischer et al., 2017). Primarily the analysis allowed us to establish if any pools contained excess of nucleotide diversity which would lead to biased estimates (Jost, 2008).

𝑆𝑆𝑆𝑆 For inference𝐹𝐹 of demographic processes acting on the whole genome we calculated Tajima’s D for each gene in the genome. The calculation of Tajima’s D was also done on pileup files with Variance-at-position.pl script (Kofler, Orozco-terWengel, et al., 2011), with the correction method based on the work of Achaz (2008). The correction method required pool size to be >3x the minimum coverage, which in our study was 15. The minimum count was set to 2 as required.

190

Chapter 4

Population structure

Multiple methods were used to characterize the population structure. All analyses using R were run in R 4.0.2. Primarily, population structure was assessed with analysis using poolfstat v1.1.1 R package calculating an matrix between each pair𝐹𝐹𝑆𝑆𝑆𝑆 of pools averaged across all SNPs (Hivert, Leblois, Petit, Gautier,𝐹𝐹𝑆𝑆𝑆𝑆 & Vitalis, 2018). To confirm our findings, PCA analysis was run with pcadapt v4.3.3 R package (Luu, Bazin, & Blum, 2017). The PCA analysis was ran using allele frequencies for all SNPs calculated with sync_to_frequencies R function available within R package haploReconstruct v0.1.2 (Franssen, Barton, & Schlötterer, 2016). However, small sample sizes associated with Pool-Seq data limit the interpretations of multivariate analyses. Population structure was thus also assessed by calculating covariance matrix between each pair of pools with core model implemented in Baypass2.2 (Gautier, 2015). The data in sync format was converted into Baypass2.2 format using R package poolfstat v6.1.1 (pooldata2genobaypass) (Hivert et al., 2018). The SNPs converted to genobaypass format needed a minimum read count of 10 per each pool, a minimum coverage per pool of 25, a maximum coverage per pool of 200 and not be in an insertion or deletion. A scaled covariance matrix Ω was calculated for 206 sets of randomly subsampled 50,000 SNPs, this way confirming the results across all SNPs in the genome. We confirmed that the FMD distance value of all Ω matrices was <1 (recommended by Baypass manual) using the R function fmd.dist() included in BayPass2.2 (Förstner & Moonen, 2003; Morales et al., 2019). One representative Ω matrix was then converted into a correlation matrix using the function cov2cor in R package corrplot v6.0.1 for visualization (Gautier, 2015). The scaled population covariance matrix reveals the neutral correlation structure between populations without being biased by outliers and thus is highly informative for demographic inference purposes (Gautier, 2015; Pickrell & Pritchard, 2012).

Detection of differentiated regions within Lake Alexandrina

Based on the observed population structure, we decided to identify all regions in the genome that were differentiated between any pair of pools within the lake. Two types of analyses were carried out to identify significant outliers.

191

Chapter 4

outliers and outlier sharing

𝐹𝐹𝑆𝑆𝑆𝑆 was measured between each pair of pools in sliding, non-overlapping windows of 2 Kb across𝐹𝐹𝑆𝑆𝑆𝑆 the whole genome with a script fst-sliding.pl available through PoPoolation2 (Kofler, Pandey, et al., 2011). As linkage disequilibrium in the parasite genome is unknown, we chose windows of 2 Kb to minimize the impact of any false positive SNPs on the result. in PoPoolation2 is calculated with the following formula:

𝑆𝑆𝑆𝑆 𝐹𝐹 = = 1 where 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡−𝑃𝑃𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑖𝑖𝑖𝑖 2 2 2 2 𝐹𝐹𝑆𝑆𝑆𝑆 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑃𝑃𝑃𝑃 − 𝑓𝑓𝐴𝐴 − 𝑓𝑓𝑇𝑇 − 𝑓𝑓𝐶𝐶 − 𝑓𝑓𝑓𝑓 is the frequency of each nucleotide, = 1 calculated 2 2 2 2 across𝑓𝑓 both pools and = 𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . − 𝑓𝑓𝐴𝐴 − 𝑓𝑓𝑇𝑇 − 𝑓𝑓𝐶𝐶 − 𝑓𝑓𝑓𝑓 𝑃𝑃𝑃𝑃 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝1+𝑃𝑃𝑃𝑃 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝2 𝑃𝑃𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑖𝑖𝑖𝑖 2 This method has been shown to provide consistent results for estimation in large populations (Hivert et al., 2018) and Morales et al. (2019) have shown𝐹𝐹𝑆𝑆𝑆𝑆 that the slight bias observed by Hivert et al. (2018) in the estimates does not influence outlier detection because it has little effect on the ranking of𝐹𝐹 𝑆𝑆𝑆𝑆loci. The data was used to assess the distribution of across the genome and to select the outliers. All pairwise comparisons were considered.𝐹𝐹𝑆𝑆𝑆𝑆 For each pairwise comparison we calculated the mean, the median and the standard deviation of all values. Outliers were identified as all windows with values above 4 times the standard𝐹𝐹𝑆𝑆𝑆𝑆 deviation in each pairwise comparison (Montague et 𝐹𝐹al.,𝑆𝑆𝑆𝑆 2014). All the SNPs from outlier windows were inspected with Fisher’s exact test for significance (Fischer et al., 2013; Kofler, Pandey, et al., 2011). The p-values from Fisher’s exact test were then corrected for multiple testing and false discovery rate with R package qvalue v2.20.0 (Storey & Tibshirani, 2003). All SNPs within those windows and q-value < 0.01 were further included in the analysis.

Baypass

Baypass2.2 is based on the principles of Bayesian hierarchical model proposed by Coop et al. (2010), also implemented in Bayenv2 (Coop, Witonsky, Di Rienzo, & Pritchard, 2010; Gautier, 2015). Identification of overly differentiated SNPs relied on the XtX differentiation measure proposed by Günther and Coop calculated in the Baypass2.2 core model (Günther & Coop, 2013). XtX is an -related measure based on standardized allele frequencies, corrected for neutral population𝐹𝐹𝑆𝑆𝑆𝑆 structure with a previously mentioned scaled 192

Chapter 4 covariance matrix between populations that accounts for shared population history and sampling noise. Because Baypass2.2 calculations are very computationally intensive, we divided our allele count tables into 206 datasets of randomly shuffled 50,000 SNPs and the Ω matrix was computed for every set of random SNPs (see Population Structure methods section). The XtX statistic for each SNP is rescaled in the core model using the average posterior means of all standardized allele frequencies accounting for the prior multivariate Gaussian distribution pulling allele frequencies of SNPs in the same population closer together (Olazcuaga et al., 2020). The rescaled XtX was further used in the analysis. To provide a decision criterion discriminating between neutral and outlier markers and to calibrate the XtX statistic, we simulated pseudo-observed data set with simulate.baypass() function available with Baypass2.2. Pseudo-observed data is produced by sampling new observations from the core inference model with the parameters aπ and bπ and the Ω matrix fixed to the posterior means estimated with the original data. Parameters aπ and bπ are defined the β distribution of π, which is the weighted mean reference allele frequency interpreted as ancestral allele frequency in the core model (Coop et al., 2010; Gautier, 2015; Pritchard, Pickrell, & Coop, 2010). We simulated a 5000 SNP data set for each set of SNPs (so a total of

1.03 million SNPs) using the posterior values of aπ and bπ and the Ω matrix estimated for each set separately. The p-value distribution of the simulated data was used for correcting the original data and removing SNPs for which the XtX statistic could not be reliable computed. Additionally, discriminating between significant outliers from neutral values of XtX was based on 99.9% quantile of the XtX distribution of simulated data (Gautier, 2015, 2019).

The XtX statistic represents a covariate-free differentiation statistic so here was used to detect all differentiated regions within the lake (Günther & Coop, 2013; Olazcuaga et al., 2020). The p-values for XtX are computed bilaterally to allow for identification of SNPs under balancing selection (unexpectedly low XtX values) and positive selection (unexpectedly high XtX values) (Gautier, 2019).

A SNP was considered as significantly differentiated between any sites in Lake Alexandrina if it appeared as an outlier in both outlier tests.

193

Chapter 4

Outlier sharing

Additionally, we calculated percentage of outlier sharing. SNPs identified as significant outliers in a comparison between any pair of pools were compared to outliers identified in any other pairwise comparison and the number of shared outliers was calculated. The shared proportion was expressed as the percentage of shared outliers in all the outliers identified in both focal comparisons. To compare the number of shared outlier SNPs with a random expectation we used the formula obtained from Morales et al. (Morales et al., 2019) modified for our study:

Pr( ) = ×

𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑖𝑖𝑖𝑖 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑠𝑠 ℎ 𝑎𝑎 𝑎𝑎 𝑎𝑎𝑎𝑎 𝑎𝑎 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆

We𝑛𝑛𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 calculated𝑢𝑢 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 the 99%𝑆𝑆𝑆𝑆 confidence𝑆𝑆𝑆𝑆 𝑖𝑖𝑖𝑖 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐 interval𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 for𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 this random𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 expectation𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 using𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 the quantile function of a hypergeometric distribution (“qhyper” from the base library in R) in the following manner:

=

𝑚𝑚 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜= 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑖𝑖𝑖𝑖 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

= 𝑛𝑛 𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 − 𝑚𝑚

𝑘𝑘 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑖𝑖𝑖𝑖 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐= 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(0.01𝑎𝑎 𝑑𝑑, 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑, , ) 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 =𝑞𝑞ℎ𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 (0.99,𝑚𝑚 ,𝑛𝑛 ,𝑘𝑘 )

𝐻𝐻𝐻𝐻𝐻𝐻ℎ𝑒𝑒𝑒𝑒 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑞𝑞ℎ𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑚𝑚 𝑛𝑛 𝑘𝑘 Characterization of highly differentiated regions

Annotation of significant outlier SNPs as intergenic, intronic or exonic was performed with SNPdat v1.0.5 (Doran & Creevey, 2013). The exonic SNPs were matched with their respective coding regions. Coding sequences were then annotated with NCBI BLASTP (e- value cut-off 1e10-10) and were subjected to Gene Ontology enrichment analysis using GOATOOLS (Klopfenstein et al., 2018). GOATOOLS find statistically over and under- represented GO terms in the set of genes of interest compared to all the GO terms annotating the genome. Fisher’s exact test was used for computing uncorrected p-values. The p-values were corrected using the Bonferroni method and the data was retained if the corrected p-

194

Chapter 4 value was below 0.05. Annotation of the coding regions of the reference genome was performed using the Maker v2.31.9 annotation pipeline (Cantarel et al., 2008). GO annotation of the reference genome was performed using OMA (Altenhoff et al., 2019), Pannzer2 (Törönen, Medlar, & Holm, 2018) and EggNOG (Huerta-Cepas et al., 2016). For further details on the genome annotation see Chapter 1. Additionally, the differentiated genes were inspected for Tajima’s D (for how Tajima’s D was calculated see “Diversity” methods section above). Negative Tajima’s D indicates excess of rare variants, a sign of selective sweep and thus directional selection but may also signal background selection (Braverman, Hudson, Kaplan, Langley, & Stephan, 1995; Dennenmoser et al., 2017; Tajima, 1989). To test whether the Tajima’s D in each pool for each gene was significantly different from zero we used a t- test against a random normal distribution with an average of zero and the standard deviation across all pools as observed in the real data (function numpy.random.normal and stats.ttest_1samp in python3.6) (Fischer et al., 2017).

Analysis of the host DNA

DNA extraction

Heads of dissected host snails were preserved in 100% ethanol until further processing. After initial evaporation of ethanol, the DNA was extracted with a KingFisher Flex automated extraction instrument (ThermoFisher Scientific, Massachusetts, United States) using a custom designed extraction kit from LGC Genomics GmbH (Berlin, Germany). First, the dry tissue was immersed in 80 µl LGC lysis buffer PN and 8 µl LGC protease solution and incubated on a thermoshaker (800rmp) at 60°C for 3 hours. Following incubation, 120 µl of LGC binding buffer SB and 3 µl of LGC magnetic bead suspension containing EDTA were added. The samples were then loaded into the KingFisher Flex. Samples were first thoroughly mixed and then left for 20 min at room temperature for the DNA to bind to the beads. The beads were then subsequently washed with 200 µl of LGC buffers BN1, TN1 and TN2. A final elution step involved incubation at 60°C for 10 minutes with 150 µl LGC elution buffer AMP. The DNA extracts were then stored at -80°C until further processing.

195

Chapter 4

Genotyping and analysis

A set of 20 nuclear SNP loci were used for genotyping and were designed based by Katri Seppälä and Frida Feijen on the P. antipodarum transcriptome (Bankers et al., 2017; Wilton, Sloan, Logsdon Jr, Doddapaneni, & Neiman, 2013). The development protocol is outlined in Paczesniak et al. (Paczesniak, Jokela, Larkin, & Neiman, 2013). The sequences, primers and positions of the assays are compiled to Supplementary Table 2. The genotyping was performed using the Fluidigm 192.24 dynamic array genotyping chips following the SNP Genotyping Analysis User Guide (PN 68000098) (Fluidigm, California, United States). The following modifications per sample were introduced to the protocol for optimization: 1. for the STA reaction we used 3.75 µl of Qiagen 2X Multiplex PCR Master Mix, 0.75 µl of 10X SNPtype STA Primer Pool and 1.125 µl of PCR-certified water instead of 2.5 µl, 0.8 µl, 1.25 µl, respectively; 2. components of the SNPtype Assay Mixes were doubled and components of the 10X Assays were increased by 10%; 3. for preparing sample pre-mixes we used 2.7 µl of Biotum 2X Fast Probe Master Mix, 0.27 µl of 20X SNPtype Sample Loading Reagent, 0.09 µl of 60X SNPtype Reagent, 0.032 µl of ROX and 0.056 µl of PCR-certified water instead of 2.25 µl, 0.225 µl, 0.075 µl, 0.027 µl and 0.048 µl, respectively. Genotyping was performed with the Fluidigm SNP genotyping software v4.5.1 combined with manual assessment of the clusters. PCA analysis was performed in R 4.0.2 using the R package adegenet v2.1.3 (Jombart, 2008). Afterwards the snail genotypes were compared to known repeating and therefore assumed clonal genotypes. Results

Analysis of the parasite DNA

Sequence read quality

In our study we sequenced pooled sampled of the trematode Atriophallophorus winterbourni from six sites in Lake Alexandrina (Figure 1A). On average 97.5% of the Illumina reads mapped to the reference genome and 48.4% of the mapped reads passed the quality filtering step. The results were similar across all 6 pools. The mean length of the reads was 150bp, yielding a depth of coverage ranging from 59-82.5x for a genome size of 601.7 Mb.

196

Chapter 4

The number of reads we obtained using Illumina sequencing for each of the pools, the number of reads mapping to the reference genome and the number of mapped reads passing the quality filtering (MAPQ values > 20) are compiled to Supplementary Table 1 in Chapter 3.

Figure 1. A. Location of the sampling sites in Lake Alexandrina indicated in black along the shore. The snails were sampled only from the shallow (up to 1m) habitat. B. The average matrix calculated with poolfstat v1.1.1 across all SNPs (in blue) and the correlation matrix based on the Ω matrix calculated with 𝑆𝑆𝑆𝑆 Baypass 2.2 (in purple) reflecting the neutral population structure. The darker𝐹𝐹 colour on both matrices indicates greater difference between sites, the lighter colour indicates greater similarity between sites.

Diversity and demographic inference

We calculated Tajima’s Pi (µ) and Watterson’s Theta (θ Watterson) across all coding regions for each pool to obtain site specific estimates of genetic diversity. We observed a very uniform result across all pools for both estimates. Over 60% of genes had value of µ and

θWatterson below 0.035 (Supplementary Figure 1 in Chapter 3). For all 6 pools the mean of µ was 0.013 with the median varying between of 0.011 and 0.012. The mean of θ Watterson varied between 0.014 and 0.015 with the median of 0.012 and 0.013 respectively (Supplementary

Table 2 in Chapter 3). The slightly elevated average values of θ Watterson in relation to µ were

197

Chapter 4 also reflected in the genome wide estimates of Tajima’s D (Figure 4). The average Tajima’s D across the genome for each pool was between -1 (South East) to -0.91 (North West). 95% of the values in each pool lay between -3 and 0 (Figure 4).

Population structure

We identified 10,275,310 SNPs across 7473 scaffolds of the total length of 340.4 Mb within the lake. We observed weak population structure among the six sites with the three analyses giving somewhat contrasting results. The matrix calculated between each pair of sites across all SNPs implied the North East site to𝐹𝐹𝑆𝑆𝑆𝑆 be the most (6% – 7.3%) and the Middle East site to be the least (5.4% - 6%) divergent from the other sites. The correlation matrix reflecting the structure of the Ω matrix, calculated in Baypass2.2 and depicting the neutral population structure, supported the former (0.85 ≤ corr.coef ≤ 0.87) but not the latter observation. According to the Ω matrix, the Middle West was the least differentiated site from the others (0.87 ≤ corr.coef ≤ 0.88). The consistency of the structure of the Ω matrix was confirmed across all 206 Ω matrices with the average FMD distance of 0.08 between any two covariance matrices (Supplementary Table 3) (Förstner & Moonen, 2003). However, both analyses point to a high homogeneity between the pools, also observed in the PCA analysis, with only 1.5% of the SNPs (149,988 SNPs, alpha = 0.05) explaining all the variance in the data. For the visualisation of the PCA analysis please refer to Supplementary Figure 1.

Detection of differentiated regions

Due to a lack of clear population structure within the lake we investigated all differentiated regions between any pair of pools. Our aim was to identify regions possibly under positive selection at any site. We applied two outlier detection methods.

Out of 123,932 windows in our analysis, we identified between 2513 and 2729 windows to be differentiated between any𝐹𝐹𝑆𝑆𝑆𝑆 pair of pools, representing a top 2% of the distribution (Figure 2A, Supplementary Table 3). The number of outlier SNPs that passed 𝐹𝐹the𝑆𝑆𝑆𝑆 q-value threshold ranged between 17,003 and 22,405 and represented between 0.17% to 0.22% of all SNPs (Supplementary Table 4). In total there were 12,753 unique windows differentiated between any pair of pools encompassing 149,811 SNPs.

198

Chapter 4

The analysis in Baypass2.2 was run on 8,883,813 SNPs which passed the filtering criteria applied during conversion of sync files into genobaypass files (see methods). The rescaled XtX statistic ranged from 5 to 62.9; the empirical distribution can be seen in Figure 2B. XtX was reliably computed for 6,131,902 SNPs. The 0.1% threshold for XtX value discriminating between neutrality and selection resulted in a value of 16 and 27,182 SNPs were detected above that threshold.

199

Chapter 4

Figure 2.A. Distribution of the values calculated pairwise between sites in nonverlapping 2 Kb windows over the whole nuclear genome. The maps on the left hand side visualise the comparisons encompassed in the right most and left most peaks. The lines indicate the pair comparison value of 4x the standard deviation threshold 𝑆𝑆𝑆𝑆 for outlier selection. B. Histogram𝐹𝐹 of the XtX static calculated in Baypass2.2 for 6,131,902 SNPs across the nuclear genome. The solid black line indicates the position beyond which the 0.1% of the outliers fall. The singificance threshold was determined with a pseudoobserved dataset of 1.03 million SNPs following the same inference model and the original data.

200

Chapter 4

Combining the two analyses, we found 9355 SNPs to be significantly differentiated in the lake. Specifically, between 1793 and 2197 SNPs were found to be differentiated between each pair of pools (0.017%-0.021% of all SNPs) (Supplementary Table 4). The pairwise comparisons for the outlier SNPs were specified with the analysis. We found significant site specificity in the differentiated regions. On average 18.7%𝐹𝐹𝑆𝑆𝑆𝑆 of SNPs were shared when two focal comparisons included the same site but only 9.4% when the comparisons did not𝐹𝐹𝑆𝑆𝑆𝑆 have any site in common (Figure 3). Because the percentage of outlier𝐹𝐹𝑆𝑆𝑆𝑆 SNPs in relation to all SNPs was between 0.0175% and 0.021%, we expected on average 0.000004% of SNPs to be shared at random (with a CI 0% - 0.00002% of SNPs); we therefore concluded the site specificity to be significant.

201

Chapter 4

Figure 3. Heatmap visualising the percentage of outliers out of all the outliers that are shared between each two Fst comparisons. The outliers are SNPs that i). were detected in the 2 Kb outlier windows in the analysis and ii.) passed the q-value threshold of 0.01 and iii.) were shared between and XtX analysis. The circles around the values indicate a shared site in the two focal comparisons. The darker 𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆 colour𝐹𝐹 indicates more outlier sharing (the value also printed on the heatmap). 𝐹𝐹

Characterization of highly differentiated regions

Of the 9355 SNPs, 2426 SNPs were intronic, 263 SNPs were exonic and 6666 SNPs were intergenic. The 263 exonic SNPs were found to be distributed across 166 genes with 1-

202

Chapter 4

5 outlier SNPs per gene (Supplementary Table 5). Specifically, we found 96 to 119 genes to be differentiated at each site when compared to any other site but an average of 35 coding regions encompassing on average of 55 SNPs were differentiated between any pair of pools (Supplementary Table 4). Tajima’s D was calculated for 130 of the 166 genes as not all genes passed the coverage threshold for Tajima’s D calculation with PoPoolation2, limiting the outlier analysis to between 71 to 90 genes per site. Average Tajima’s D across outlier genes varied between -0.63 (North East) and -0.71 (South East) per site (Figure 4, Supplementary Table 5). Out of 130 genes, 27 genes had missing data for at least one pool. For the 103 genes we found at least one site for which Tajima’s D was significantly negative when tested against a random normal distribution (mean = 0, stdev = 0.58 as in real data). For graphical representation of this result please refer to Supplementary Figure 2. Out of these 103 genes, 82 had a negative Tajima’s D in all sites and for 21 genes Tajima’s D was significantly negative in some, but not all, sites. We performed GO enrichment analysis for these two categories separately. The first category included genes either under background selection or positive but divergent selection at all sites. The second category included genes selected for at some sites but not others rendering the sites differentiated. We found 1 GO term to be significantly enriched among the 82 genes (muscle attachment, GO:0016203) and 1 GO term to be enriched among the 21 genes (fructan beta-fructosidase activity, GO: 0051669).

203

Chapter 4

Figure 4. Boxplot of Tajima’s D values calculated with PoPoolation v1.2.1 for all genes in the genome (11,499 genes) (dark grey) and for the outliers detected in the two outlier analyses (light grey). A gene was considered a differentiated coding region if it appeared in any comparison between that pool and any other pool (between 71 and 90 genes).

All genes were annotated with BLASTP. Out of the 130 genes, 67 did not blast to any genes with known functions. The 82 genes, for which Tajima’s D was negative across all sites, were coding for a large number of functions related to RNA and transcription, including ATP dependent RNA helicase, pre-mRNA-splicing helicase, putative serine tRNA ligase, scavenger mRNA decapping enzyme, U3 small nucleolar RNA-associated protein, Sin3 histone deacetylase corepressor complex, E1A/CREB-binding protein, or DNA binding proteins such as UV excision repair protein and chromodomain helicase DNA binding protein. Majority of the functions are compiled to Table 1 (removing unknown and hypothetical proteins and redundant functional protein annotations) and all functions are present in Supplementary Table 5. BLASTP annotations of the 21 genes for which negative Tajima’s D significantly deviated from zero in some sites but not in others (see Table 2, Supplementary Table 5), indicated the genes to be coding for various functions presented in Table 2. Among those genes we found no signature of selection between neighbouring sites.

204

Chapter 4

Table 1. Gene annotation results from BLASTP for genes for which Tajima’s D was negative across all sites where the genes were found as outliers. Results for which the annotation was “Unknown”, “Unnamed protein” or “Hypothetical protein” were eliminated. For each function a gene ID and the number of significant SNPs are given.

Gene Gene annotation with BLASTP Significant SNPs maker-agouti_scaf_945-snap-gene- Actin 1 4.0-mRNA-1 maker-agouti_scaf_702-snap-gene- ADP-ribosylation factor GTPase activating protein 3 1 0.3-mRNA-1 maker-agouti_scaf_532-snap-gene- ATP dependent RNA helicase DDX27 2 0.2-mRNA-1 maker-jcf7180000224000-augustus- beta-1 3-galactosyltransferase 1 1 gene-0.1-mRNA-1 maker-agouti_scaf_945-snap-gene- BRG1/brm-associated factor 53A 1 4.0-mRNA-1 maker-agouti_scaf_955-augustus- Chromodomain helicase DNA binding protein 1 like 1 gene-0.0-mRNA-1 maker-jcf7180000235250-augustus- Cilia and flagella-associated protein 2 gene-0.4-mRNA-1 maker-agouti_scaf_791-augustus- CREB-binding protein 2 gene-0.7-mRNA-1 maker-agouti_scaf_532-snap-gene- DEAD/DEAH box RNA helicase 2 0.2-mRNA-1 augustus_masked-agouti_scaf_314- Dolichyl-diphosphooligosaccharide--protein glycosyltransferase 48 kDa 1 processed-gene-0.8-mRNA-1 subunit maker-agouti_scaf_791-augustus- E1A/CREB-binding protein 2 gene-0.7-mRNA-1 snap_masked-agouti_scaf_894- Early growth response protein 2 processed-gene-0.3-mRNA-1 maker-agouti_scaf_329-snap-gene- Glutamate or tyrosine decarboxylase 1 0.9-mRNA-1 maker-agouti_scaf_471-snap-gene- GYF domain protein 1 0.4-mRNA-1 augustus_masked-jcf7180000243408- heat shock cognate 71 kDa protein 1 processed-gene-0.1-mRNA-1 maker-jcf7180000224000-augustus- Hexosyltransferase 1 gene-0.1-mRNA-1 augustus_masked-jcf7180000243408- HSP7C protein 1 processed-gene-0.1-mRNA-1 maker-jcf7180000271078-snap-gene- Importin-13 2 0.1-mRNA-1 snap_masked-jcf7180000246809- Kunitz/Bovine pancreatic trypsin inhibitor domain protein 1 processed-gene-0.0-mRNA-1 maker-jcf7180000224000-augustus- Lactosylceramide 1 1 gene-0.1-mRNA-1 maker-agouti_scaf_364-snap-gene- Low-density lipoprotein receptor domain class A 1 0.3-mRNA-1 maker-agouti_scaf_2004-snap-gene- m7GpppX diphosphatase [Clonorchis sinensis] >dbj|GAA29110.1| 2 0.1-mRNA-1 scavenger mRNA-decapping enzyme DcpS augustus_masked-agouti_scaf_1109- melanoma receptor tyrosine-protein kinase, partial 1 processed-gene-0.4-mRNA-1 maker-jcf7180000220464-augustus- metabotropic glutamate receptor 1 gene-0.4-mRNA-1 maker-jcf7180000224000-augustus- N-acetyllactosaminide 3-alpha-galactosyltransferase, partial 1 gene-0.1-mRNA-1 maker-jcf7180000231314-snap-gene- Nuclear hormone receptor family member nhr-41 1 0.5-mRNA-1 augustus_masked-agouti_scaf_314- oligosaccharyltransferase complex subunit beta 1 processed-gene-0.8-mRNA-1 maker-agouti_scaf_81-augustus- Oxysterol-binding protein 1 gene-0.1-mRNA-1 maker-agouti_scaf_471-snap-gene- PERQ amino acid-rich with GYF domain-containing protein 2 1 0.4-mRNA-1 maker-agouti_scaf_81-augustus- PH domain protein 1 gene-0.1-mRNA-1 snap_masked-jcf7180000260501- pre-mRNA-splicing helicase BRR2 2 processed-gene-0.2-mRNA-1

205

Chapter 4

augustus_masked-agouti_scaf_989- Protein lin-54 1 processed-gene-0.1-mRNA-1 maker-agouti_scaf_532-snap-gene- putative dead box ATP-dependent RNA helicase 2 0.2-mRNA-1 maker-agouti_scaf_955-augustus- putative helicase 1 gene-0.0-mRNA-1 maker-jcf7180000220464-augustus- putative metabotropic glutamate receptor 2, 3 (mglur group 2) 1 gene-0.4-mRNA-1 augustus_masked-agouti_scaf_1109- putative receptor Tyrosine Kinase 1 processed-gene-0.4-mRNA-1 maker-agouti_scaf_1942-snap-gene- putative serine--tRNA ligase 1 0.3-mRNA-1 augustus_masked-agouti_scaf_1109- receptor L domain protein, partial 1 processed-gene-0.4-mRNA-1 maker-jcf7180000222614-augustus- RNA cap guanine-N2 methyltransferase 3 gene-0.17-mRNA-1 maker-agouti_scaf_2004-snap-gene- scavenger mRNA decapping enzyme 2 0.1-mRNA-1 snap_masked-jcf7180000260501- Sin3 histone deacetylase corepressor complex component SDS3 2 processed-gene-0.2-mRNA-1 maker-jcf7180000276262-augustus- Sperm-associated antigen 1 1 gene-0.4-mRNA-1 maker-agouti_scaf_329-snap-gene- Sphingosine-1-phosphate lyase 1 0.9-mRNA-1 maker-agouti_scaf_945-snap-gene- SWI/SNF nucleosome remodeling complex component 1 4.0-mRNA-1 augustus_masked-agouti_scaf_989- Tesmin/TSO1-like CXC domain protein 1 processed-gene-0.1-mRNA-1 maker-agouti_scaf_1942-snap-gene- Tetratricopeptide repeat protein 27 1 0.3-mRNA-1 maker-jcf7180000222614-augustus- TGS1 3 gene-0.17-mRNA-1 maker-jcf7180000222614-augustus- Trimethylguanosine synthase 3 gene-0.17-mRNA-1 snap_masked-jcf7180000246809- trypsin inhibitor-like 1 processed-gene-0.0-mRNA-1 maker-jcf7180000221359-snap-gene- Tubulin/FtsZ family, GTPase domain protein, partial 2 0.2-mRNA-1 augustus_masked-agouti_scaf_1109- Tyrosine-protein kinase transforming protein erbB, partial 1 processed-gene-0.4-mRNA-1 maker-jcf7180000242579-snap-gene- U3 small nucleolar RNA-associated protein 11 1 0.0-mRNA-1 maker-agouti_scaf_1049-augustus- Ultrabithorax 1 gene-0.6-mRNA-1 maker-jcf7180000233421-snap-gene- UV excision repair protein RAD23 1 0.0-mRNA-1 maker-agouti_scaf_32-snap-gene-0.0- Vasohibin 1 1 mRNA-1 maker-jcf7180000271931-augustus- X-box-binding protein 1 2 gene-0.7-mRNA-1 maker-jcf7180000234455-snap-gene- zinc finger BED domain-containing protein 1-like 3 0.1-mRNA-1 maker-jcf7180000231314-snap-gene- zinc finger, C4 type, partial 1 0.5-mRNA-1

206

Chapter 4

Table 2. Gene annotation results from BLASTP and Tajima’s D values for genes for which Tajima’s D significantly deviated from 0 towards negative, significance (p-value < 0.05) indicated with an asterisk (*). The sites for which a gene was not found an outlier are indicated with NA. For each function a gene ID and the number of significant SNPs are given.

Gene name Gene annotation with BLASTP Signifi South West Middle East Middle West North East North West South East cant Tajima's D Tajima's D Tajima's D Tajima's D Tajima's D Tajima's D SNPs maker-jcf7180000224133-snap-gene-0.1-mRNA-1 Unknown 3 -0.73* -0.07 0.21 -1.16* -0.44* -1.15*

maker-jcf7180000227391-snap-gene-0.1-mRNA-1 Unknown 1 -0.28* -0.40* 0.24 -0.10* -0.03 -0.14* augustus_masked-agouti_scaf_375-processed-gene- Unknown 1 NA 0.05* 0.04 NA NA NA 0.2-mRNA-1 maker-jcf7180000220639-augustus-gene-0.1-mRNA-1 Dynein molecular motor protein 2 -0.05 -1.47* -1.55* 1.56 -0.41* -0.89* light chain augustus_masked-jcf7180000219668-processed-gene- Unknown 4 NA 0.19* NA -0.39* -0.08 -0.37 0.1-mRNA-1 maker-agouti_scaf_1027-snap-gene-0.7-mRNA-1 Unknown 1 NA -0.27* -0.03 -0.52* 0.10 0.00 maker-jcf7180000252333-snap-gene-0.1-mRNA-1 Unknown 1 -0.62* 0.11 0.36 NA -0.11* -0.02 maker-jcf7180000231822-snap-gene-0.5-mRNA-1 Unknown 1 NA -0.21* NA -0.12* 0.03 0.23 maker-jcf7180000276795-snap-gene-0.1-mRNA-1 Unknown 1 NA NA NA 0.02 -0.01 -0.11* maker-jcf7180000273606-snap-gene-0.1-mRNA-1 Calreticulin 1 -0.74* -0.89* -0.25* 0.41 -0.46* 0.34 maker-jcf7180000220464-snap-gene-0.3-mRNA-1 Dehydrogenase/reductase 1 0.25* 0.89 NA NA NA NA snap_masked-agouti_scaf_563-processed-gene-0.6- Flap endonuclease 1 3 NA -0.74* 0.22 -0.83* mRNA-1 maker-jcf7180000242179-augustus-gene-0.0-mRNA-1 Vacuolar protein sorting- 1 -0.53* -0.26* -0.27* -0.14* -0.05 0.06 associated protein 37B maker-agouti_scaf_1511-snap-gene-0.12-mRNA-1 protein phosphatase 1 regulatory 2 0.04 -0.27* -0.43* 0.28 -0.55* -0.52* subunit augustus_masked-jcf7180000238822-processed-gene- peptidyl-prolyl cis-trans isomerase 1 -1.37* -1.16* -0.65* 0.15 -1.17* 0.0-mRNA-1 A-like isoform X2 maker-jcf7180000236889-snap-gene-0.2-mRNA-1 beta-1,3-galactosyl-O-glycosyl- 4 0.06 0.79 -0.14* -0.31* -0.04 -0.41* glycoprotein beta-1,6-N- acetylglucosaminyltransferase maker-agouti_scaf_1012-snap-gene-0.1-mRNA-1 Epididymal secretory protein 1 0.03 -0.52* -0.63* -0.59* -0.79* -0.52* isoform 2 maker-jcf7180000245945-snap-gene-0.0-mRNA-1 ATP-citrate synthase 3 NA NA 0.07 0.10 0.07 -0.12* maker-jcf7180000256955-augustus-gene-0.1-mRNA-1 N-acetyltransferase 9 1 0.03 -0.10* 0.09 -0.08 0.34 0.22 maker-jcf7180000233173-augustus-gene-0.1-mRNA-1 Transposase 2 -0.36* 0.26 -0.47* -0.47* -1.01* -0.35* maker-jcf7180000226987-snap-gene-0.6-mRNA-1 Ubiquinone biosynthesis protein 1 -0.04 NA NA 0.25 -0.52* 0.29

207

Chapter 4

Analysis of the host DNA

Populations structure

We genotyped 10 to 12 host snails per site, in total 69 snails. Genotyping of two snails failed for the South West site and one for the Middle East site. PCA analysis of the host genotypes showed absence of structure within the lake, with all the sites clustering together (Figure 5). What is more, genotype assessment indicated all the snails to be representing a diverse set of genotypes with no genotype repeating (Supplementary Table 6). Interestingly, the tight clustering of the genotypes from the North East site in the PCA analysis suggests the North East population to be the least diverse.

Figure 5. Principal component analysis of snail genotypes based on 20 SNP markers performed with R package adegenet v2.1.3. PC1 and PC2 explained 41% and 34% of variance in the data respectively. The ellipses show 95% confidence interval. Discussion

In this study we searched for a genomic signature of local divergence in the trematode parasite Atriophallophorus winterbourni population of Lake Alexandrina. We examined population genomic structure of parasite at different locations around the lake.

208

Chapter 4

Previous studies have documented a strong genetic structure in the snail P. antipodarum, the intermediate host of A. winterbourni, among three adjacent depth-specific habitats of the lake but a much weaker within-habitat genetic structure among the different locations of the lake (Fox et al., 1996; Paczesniak et al., 2014). Here, we corroborate these earlier results with analysis of genetic structure of infected host around the lake. We expected to find high connectivity and admixture in the parasite population among all our study sites due to the strong winds that mix water column of the lake and due to the movement of the final host, the waterfowl, generating ample gene flow within the shallow habitat. Nevertheless, in a recent study Gibson et al. (2016) examined the local differences in susceptibility of snail hosts between sites, motivated by the persistent differences in prevalence of infection by A. winterbourni observed among shallow-water locations within the same habitat. In one of their three infection experiments they used parasites extracted from infected snails representing the very south of South West and Middle East parasite sources examined in our study (see Figure 1A). Their results show a gradient in susceptibility to infection that matched the general pattern of variation in prevalence of infection inferred from a time-series spanning over 10 years, where sites with susceptible host genotypes had high average prevalence of infection (Gibson et al., 2016). The most striking result in their study was that the two parasite sources, although most likely representing a highly connected population, showed differences in infectivity to snails from different shallow-water sites around the lake, suggesting that the two parasite sources were somewhat locally diverged with respect to genes responsible for matching the local snail populations. This discovery motivated our study to search for fine-scale local signature of divergence in the parasite population using whole genome data. We believed that due to the earlier studies being based on neutral genetic markers, they may have not been sufficiently powerful in revealing the within-habitat divergence in genomic regions under strong local selection. Advances in the whole-genome resequencing technologies, and especially sequencing of pooled DNA samples of large number of individuals, allowed us to compare population samples at a genomic scale and address questions of selection affecting only certain parts of the genome (Kofler, Pandey, et al., 2011). As we, and Gibson et al. (2016), predicted, our data provides weak evidence for a genome-wide signal for divergence among the pools of parasite individuals originating from different locations around the lake. While we did observe non-negligible levels of

209

Chapter 4 differentiation between sites with the varying between 5.4% and 7.3% and the correlation between sites being between 0.86 to𝐹𝐹 𝑆𝑆𝑆𝑆0.88, the three independent methods we used for assessment of differentiation did not show consistent results and did not confirm any consistent geographic pattern except for minor differences observed for the North East site. We did not discover greater connectivity between neighbouring sites and only 0.83% of the 10.3 million SNPs explained the geographic structure we observed. We expected our allele frequency estimates to be robust because we pooled a large number of genotypes and we used a high sequencing coverage (59x-82.5x) additionally amplified by the number of clonal individuals representing the same genotype. Interestingly, the high total number of SNPs we found in the parasite genome and the negative shift in Tajima’s D distribution can be taken to suggest an elevated level of low frequency variants in this parasite population. Assuming that our stringent filtering process removed the majority of false positive SNPs, our results suggest that the parasite population in the lake is, in fact, experiencing a population expansion after a recent bottleneck (Dennenmoser et al., 2017; Mobegi et al., 2014). Earlier studies on coevolutionary dynamics, in this and other systems, have not focused on population level consequences of the predicted evolutionary dynamics. Crash-expansion dynamics of a parasite population after the collapse of a previously common host genotype has been beyond the scope of earlier studies. One study, however, reports rapid changes in frequencies of common host genotypes, while not inferring to consequences of parasite population size (Jokela et al., 2009). These negative frequency dynamics predict that when a common host genotype is rendered susceptible, the successfully infecting parasite genotype increases in frequency (Jokela et al., 2009). This in turn creates a signature of selective sweep in a population and causes a decrease of population diversity, especially within the selected and linked genomic regions (Ignacio-Espinoza, Ahlgren, & Fuhrman, 2020; Papkou et al., 2019). According to this hypothesis, after an emergence of a resistant host genotype, the parasite population starts decreasing in frequency but experiences an expansion in genotype diversity (Jokela et al., 2009; Papkou et al., 2019). If this is true, the previously selected sites are not advantageous anymore. Our results are consistent with the genomic pattern where parasite population would expand after a recent bottleneck. One explanation for such a bottleneck would be the adaptation to the new most common host genotype. In 2007 a single snail clone (clone number ‘535’) was reported to cover 5% of the host population within the shallow habitat

210

Chapter 4

(Paczesniak et al., 2014). Over the years it has been observed to rise in frequency becoming the dominant genotype in the lake. Lab experiments have shown the clone to be almost fully resistant to the parasite (Paczesniak et al., 2019). Interestingly, Gibson et al. (2016) made their last discoveries around 4 parasite generations before our study, in 2015. Average parasite prevalence in the lake decreased between the collections of Gibson et al. (2016) (4.5% - 36%, average between 2013 and 2015) or the previously mentioned collections from 2017 (4% - 50% prevalence of infection, Supplementary Table 1) and our collections (0% - 24%). From our analysis we inferred that the population of parasites we have sampled is highly diverse and we speculated it might be infecting the frequently recombining, sexually reproducing snails and the rare clones which do not differ in susceptibility from the sexual snails (Jokela, Lively, Fox, & Dybdahl, 1997). Examination of the snail population confirmed our predictions. All snail genotypes in this study were unique and did not match known clonal genotypes; they thus represented a highly diverse set of genotypes and exhibited no population structure within the lake, which is most often caused by differences in abundance of clonal genotypes between sites. We used two methods, the and the XtX statistics, to search for genomic regions differentiated between any pair of𝐹𝐹 𝑆𝑆𝑆𝑆sites, potentially subject to selection. A SNP was considered as significantly differentiated between any sites in Lake Alexandrina if it appeared as an outlier in both outlier tests. This is conservative, aiming to correct technical bias in outlier calling, because SNPs found in the outlier tail of distribution show a strong bias towards loci with lower coverages whereas SNPs in the tails𝐹𝐹𝑆𝑆𝑆𝑆 of the XtX statistic are enriched in high coverage positions (Günther & Coop, 2013). Using both tests thus minimizes the bias towards extremely high or extremely low coverages (Günther & Coop, 2013). Additionally, the similar results obtained for the two nucleotide diversity measures across all pools, θ

Watterson and π, indicated unbiased estimates (Fischer et al., 2017; Jost, 2008). We found between 1793 and 2197 SNPs to be𝐹𝐹𝑆𝑆𝑆𝑆 significantly differentiated, highly scattered across 810 to 947 windows of 2 Kb. Between 85 and 117 of the significant outlier SNPs were located in 71 to 90 coding regions with only 1 to 5 SNPs per gene. We thus observed no signature of clustered locus-specific reduced gene flow. Additionally, we found the divergence to be significantly site specific with no correlated observations between neighbouring sites. Scattering of differentiated regions can be a signature of several different processes. Low linkage disequilibrium has been shown to result in scattered divergence and

211

Chapter 4 lack of islands of differentiation (Yeaman, 2013). However, our previous analyses, on divergent genomic regions between different lakes on the South Island of New Zealand, found the regions to be clustered (Chapter 3). At low effective population sizes genetic drift might have a strong effect on differences in mutation accumulation between the sites (Tremblay & Ackerman, 2001). However, we would expect purifying selection at functionally important genomic regions to be counteracting this process and gene flow to be eroding variation due to drift at neutral loci. Additionally, the effective population size of the parasite is most likely high due to very high population size of the intermediate host. A large number of differentiated loci but with a weak signal of differentiation would also be expected in polygenic adaptation (Pritchard et al., 2010). The variants detected could have a cumulative effect on a trait under selection. Gene Ontology enrichment analysis would then show the loci to be significantly over-representing a certain function.

In order to distinguish between directional selection and any other explanations for divergence, we inspected the outlier loci for Tajima’s D and inspected their functions. Due to the varying number of loci differentiated between the sites we expected the differentiation to be a result of different selection pressures. We found two categories of differentiated genes: those with negative Tajima’s D values at all sites where they were identified as outliers and those with negative Tajima’s D values in some sites but not in others. In total there were 82 loci in the former category. We concluded the genes to be either under directional selection at the different sites in opposite directions, resulting in the observed differentiation, or a result of background selection. The latter could be due to the observed genome wide shift in Tajima’s D distribution towards negative (Dennenmoser et al., 2017). However, due to a lot of the genes being implicated in gene expression and structural functions (muscle attachment) it is possible that the genes are under selection, perhaps imposed by the host. We found the 21 genes from the latter category to be all implicated in functions related to energy processes and also found to be involved in parasitic lifestyle in previous studies. Calreticulin, a calcium binding protein, has been implicated in functions such as modulation of gene expression, recognition of extracellular stimuli, for example in trypanosome parasites (Ferreira et al., 2004), or parasite immunity inhibiting host hemocyte spreading and preventing hemocyte encapsulation process in Schistosoma mansoni (Nakhasi et al., 1998). Dynein light chains are also implicated in calcium homeostasis in the parasites; they participate in absorption and secretion processes through the tegument and have been

212

Chapter 4 shown to act as tegument associated antigens (Githui, Damian, Aman, Ali, & Kamau, 2009; Hoffmann & Strand, 1996; Young, Hall, Jex, Cantacessi, & Gasser, 2010). Others, such as protein phosphatases or ubiquinone biosynthesis proteins, have been extensively characterized in other digenean trematodes for their regulatory functions, such as growth, development and cell division and have been suggested as potential drug targets (Halton, 1967; Umezurike & Anya, 1980). Most of the functions of the detected loci found in our study, have been shown to be the functions of gene families that have evolved through large scale duplication in A. winterbourni since the split from the Opisthorchiata suborder (Chapter 1). Thus, it is possible that the source of the selection pressure on these loci is also the intermediate host. What could be causing differences in host populations between the sites leading to the pattern that we see, would thus be a potential question for future research. However, it must also be taken into account that the high diversity across the genome might overwhelmed our power to detect locally adaptive gene variants.

We also found some loci among the outlier loci to have a slightly positive Tajima’s D or Tajima’s D values not significantly deviating from zero at each site. It has to be taken into account that analysis does not point only towards loci of positive selection and the XtX analysis finds𝐹𝐹 all𝑆𝑆𝑆𝑆 differentiated regions in the lake (Gautier, 2015; Olazcuaga et al., 2020). Combining the two analyses to find site specific regions does not fully eliminate loci differentiated for reasons such as genetic drift or balancing selection. The XtX statistic can be used to detect loci under balancing selection. However, we have found no studies that have done so, and therefore we could not reliably establish a threshold for exceptionally low values of XtX. Consequently, we decided against studying loci under balancing selection. However, future studies should address such loci, as it has been previously shown that loci involved in antigenic variation can be evolving under balancing selection (Amambua-Ngwa et al., 2012).

In conclusion, our study emphasizes the importance of considering all levels of population structure in studying species whose population interconnectedness highly depends on mobility of host populations. Our results stress the importance of not only spatial but also temporal population studies when studying parasite populations implicated in Red Queen Dynamics with their hosts as fluctuating selection can obscure loci under selection. In our study we capture only one moment in time and we speculate that at this particular moment we sampled, the population was expanding after a recent shift in frequency of

213

Chapter 4 resistant host genotypes. Future studies on coevolutionary dynamics will be more powerful if they integrate quantitative ecological studies on population dynamics of both the host and the parasite, as those would allow addressing the effective population size of host and the parasite experiencing crash-expansion dynamics driven by frequency-dependent fluctuating selection on infectivity and resistance genotype frequencies that segregate in the coevolving populations.

Acknowledgments

We thank Julia Vrtilek for help with the collection of samples in New Zealand, Katri Seppälä for developing the parasite hatching protocol and help with the hatching of the parasites. We thank Kaja Widmer for genotyping the snails. We also thank Hernán Eduardo Morales Villegas, Hélène Boulain, Martin C. Fischer and Torsten Günther for helpful suggestions during the analysis. The research was funded by an ETH grant ETH-36 15-2 obtained by Jukka Jokela and Hanna Hartikainen.

214

Chapter 4

Supplementary Information

Table of contents:

1. Supplementary Figures a. Supplementary Figure 1. Results of principal component analysis in pcadapt v4.3.3 using allele frequency data of all SNPs detected in the nuclear genome. The top plot shows PC1 plotted against PC2 and the bottom plot shows PC1 plotted against PC3. All 3 axis explain 64.5% of variance in the data. b. Supplementary Figure 2. A boxplot for Tajima’s D distribution across all sites for each of the 130 genes. 2. Supplementary Tables a. Supplementary Table 1. Infection frequency per site per lake around the lake from 2019 and 2017 obtained from a random sample of Potamopyrgus antipodarum from the shallow habitat. b. Supplementary Table 2. A collection of 20 SNP markers that were used to genotype the snail (Potamopyrgus antipodarum) population. The last column indicates the transcriptome used to design the SNPs. The different primers were used at different stages of the genotyping (see methods). c. Supplementary Table 3. The FMD distances between any two Ω matrices calculated in Baypass2.2. The Ω matrices were calculated for each set of 50,000 random SNPs across the genome (all together making all SNPs in the genome). The FMD distance was calculated with fmd.dist() function in Baypass2.2 and reflects the similarity between the matrices with the value <1 indicating great similarity. c. Supplementary Table 4. The details of outlier analysis including the

total number of 2kb windows calculated𝐹𝐹𝑆𝑆𝑆𝑆 per each pairwise comparison, the total number of SNPs in the analysis, the number of windows in each comparison passing the 4x standard deviation threshold, number of SNPs in those windows that passed the q-value threshold of 0.01, number of SNPs that matched the outlier analysis in Baypass2.2, the windows

𝑆𝑆𝑆𝑆 215 𝐹𝐹

Chapter 4

encompassing these SNPs, number of SNPs shared with Baypass2.2 in coding regions and number of those coding regions. d. Supplementary Table 5. The information on all outlier coding regions found to be outliers for each site. The table contains the gene ID, the BLASTP annotation results, the site where the gene was found an outlier, when the site was compared to any other site, the total number of SNPs for that gene and the number of SNPs significantly differentiated (using the two outlier methods), Tajima's D results and the p-value results of comparison of Tajima's D for that gene and for that site to a random normal distribution. SE - South East, ME - Middle East, NE- North East, SW - South West, MW - Middle West, NW - North West e. Supplementary Table 6. Genotypes of 69 infected P. antipodarum snails used in the study.

216

Chapter 4

Supplementary Figures

Supplementary Figure 1. Results of principal component analysis in pcadapt v4.3.3 using allele frequency data of all SNPs detected in the nuclear genome. The top plot shows PC1 plotted against PC2 and the bottom plot shows PC1 plotted against PC3. All 3 axis explain 64.5% of variance in the data.

217

Chapter 4

Supplementary Figure 2. A boxplot for Tajima’s D distribution across all sites for each of the 130 genes.

218

Chapter 4

References

Achaz, G. (2008). Testing for neutrality in samples with sequencing errors. Genetics, 179(3), 1409-1424. doi:10.1534/genetics.107.082198 Altenhoff, A. M., Levy, J., Zarowiecki, M., Tomiczek, B., Vesztrocy, A. W., Dalquen, D. A., . . . Dylus, D. (2019). OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome research, 29(7), 1152-1163. Amambua-Ngwa, A., Tetteh, K. K., Manske, M., Gomez-Escobar, N., Stewart, L. B., Deerhake, M. E., . . . Knuepfer, E. (2012). Population genomic scan for candidate signatures of balancing selection to guide antigen characterization in malaria parasites. PLoS Genet, 8(11), e1002992. Anand, S., Mangano, E., Barizzone, N., Bordoni, R., Sorosina, M., Clarelli, F., . . . De Bellis, G. (2016). Next Generation Sequencing of Pooled Samples: Guideline for Variants’ Filtering. Scientific Reports, 6(1), 33735. doi:10.1038/srep33735 Bankers, L., Fields, P., McElroy, K. E., Boore, J. L., Logsdon Jr, J. M., & Neiman, M. (2017). Genomic evidence for population‐specific responses to co‐evolving parasites in a New Zealand freshwater snail. Molecular Ecology, 26(14), 3663-3675. Behrmann-Godel, J., Gerlach, G., & Eckmann, R. (2006). Kin and population recognition in sympatric Lake Constance perch (Perca fluviatilis L.): can assortative shoaling drive population divergence? Behavioral Ecology and Sociobiology, 59(4), 461-468. Blasco-Costa, I., Seppälä, K., Feijen, F., Zajac, N., Klappert, K., & Jokela, J. (2019). A new species of Atriophallophorus Deblock & Rosé, 1964 (Trematoda: Microphallidae) described from in vitro-grown adults and metacercariae from Potamopyrgus antipodarum (Gray, 1843)(Mollusca: Tateidae). Journal of helminthology, 94, e108. Braverman, J. M., Hudson, R. R., Kaplan, N. L., Langley, C. H., & Stephan, W. (1995). The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics, 140(2), 783-796. Brown, A. F., Kann, L. M., & Rand, D. M. (2001). GENE FLOW VERSUS LOCAL ADAPTATION IN THE NORTHERN ACORN BARNACLE, SEMIBALANUS BALANOIDES: INSIGHTS FROM MITOCHONDRIAL DNA VARIATION. Evolution, 55(10), 1972-1979. doi:10.1111/j.0014-3820.2001.tb01314.x Campos, M., Conn, J. E., Alonso, D. P., Vinetz, J. M., Emerson, K. J., & Ribolla, P. E. M. (2017). Microgeographical structure in the major Neotropical malaria vector Anopheles darlingi using microsatellites and SNP markers. Parasites & Vectors, 10(1), 1-8. Cantarel, B. L., Korf, I., Robb, S. M., Parra, G., Ross, E., Moore, B., . . . Yandell, M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research, 18(1), 188-196. Capelle, J., & Neema, C. (2005). Local adaptation and population structure at a micro‐ geographical scale of a fungal parasite on its host plant. Journal of Evolutionary Biology, 18(6), 1445-1454. Casacuberta, E., & González, J. (2013). The impact of transposable elements in environmental adaptation. Molecular Ecology, 22(6), 1503-1517. doi:10.1111/mec.12170 Coop, G., Witonsky, D., Di Rienzo, A., & Pritchard, J. K. (2010). Using environmental correlations to identify loci underlying local adaptation. Genetics, 185(4), 1411- 1423. Dennenmoser, S., Vamosi, S. M., Nolte, A. W., & Rogers, S. M. (2017). Adaptive genomic divergence under high gene flow between freshwater and brackish-water ecotypes

219

Chapter 4

of prickly sculpin (Cottus asper) revealed by Pool-Seq. Molecular Ecology, 26(1), 25- 42. doi:10.1111/mec.13805 Doran, A. G., & Creevey, C. J. (2013). Snpdat: Easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms. BMC Bioinformatics, 14(1), 45. doi:10.1186/1471-2105-14-45 Dybdahl, M. F., & Lively, C. M. (1996). THE GEOGRAPHY OF COEVOLUTION: COMPARATIVE POPULATION STRUCTURES FOR A SNAIL AND ITS TREMATODE PARASITE. Evolution, 50(6), 2264-2275. doi:10.1111/j.1558-5646.1996.tb03615.x Feijen, F., Widmer, K. K., Oester, R., Tardent, N., Klappert, K., & Jokela, J. (in prep). Frequency of multiple-genotype infections as indicator of exposure risk in a natural host population. ETH. Feijen, F., Zajac, N., Vorburger, C., & Jokela, J. (in prep). Contrasting phylogeographic patterns in a cryptic species complex of trematode parasites. Ferreira, V., Molina, M. a. C., Valck, C., Rojas, Á., Aguilar, L., Ramırez,́ G., . . . Ferreira, A. (2004). Role of calreticulin from parasites in its interaction with vertebrate hosts. Molecular immunology, 40(17), 1279-1291. Fischer, M. C., Rellstab, C., Leuzinger, M., Roumet, M., Gugerli, F., Shimizu, K. K., . . . Widmer, A. (2017). Estimating genomic diversity and population differentiation–an empirical comparison of microsatellite and SNP variation in Arabidopsis halleri. BMC genomics, 18(1), 1-15. Fischer, M. C., Rellstab, C., Tedder, A., Zoller, S., Gugerli, F., Shimizu, K. K., . . . Widmer, A. (2013). Population genomic footprints of selection and associations with climate in natural populations of Arabidopsis halleri from the Alps. Molecular Ecology, 22(22), 5594-5607. doi:10.1111/mec.12521 Förstner, W., & Moonen, B. (2003). A metric for covariance matrices. In Geodesy-the Challenge of the 3rd Millennium (pp. 299-309): Springer. Fox, J. A. (1995). The diversity and distribution of clones and sexuals across habitats in a mixed population of a freshwater snail (Potamopyrgus antipodarum). Indiana University, Fox, J. A., Dybdahl, M. F., Jokela, J., & Lively, C. M. (1996). Genetic structure of coexisting sexual and clonal subpopulations in a freshwater snail (Potamopyrgus antipodarum). Evolution, 50(4), 1541-1548. Frank, S. A. (1991). Ecological and genetic models of host-pathogen coevolution. Heredity, 67(1), 73-83. Frank, S. A. (1993). Coevolutionary genetics of plants and pathogens. Evolutionary Ecology, 7(1), 45-75. Franssen, S. U., Barton, N. H., & Schlötterer, C. (2016). Reconstruction of Haplotype-Blocks Selected during Experimental Evolution. Molecular Biology and Evolution, 34(1), 174-184. doi:10.1093/molbev/msw210 Fricke, J. M., Vardo-Zalik, A. M., & Schall, J. J. (2010). Geographic genetic differentiation of a malaria parasite, Plasmodium mexicanum, and its lizard host, Sceloporus occidentalis. Journal of Parasitology, 96(2), 308-313. Gandon, S., Capowiez, Y., Dubois, Y., Michalakis, Y., & Olivieri, I. (1996). Local adaptation and gene-for-gene coevolution in a metapopulation model. Proceedings of the Royal Society of London. Series B: Biological Sciences, 263(1373), 1003-1009. Gautier, M. (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics, 201(4), 1555. doi:10.1534/genetics.115.181453 Gautier, M. (2019). BayPass version 2.2 user manual.

220

Chapter 4

Gibson, A. K., Jokela, J., & Lively, C. M. (2016). Fine-Scale Spatial Covariation between Infection Prevalence and Susceptibility in a Natural Population. The American Naturalist, 188(1), 1-14. doi:10.1086/686767 Githui, E. K., Damian, R. T., Aman, R. A., Ali, M. A., & Kamau, J. M. (2009). Schistosoma spp.: Isolation of microtubule associated proteins in the tegument and the definition of dynein light chains components. Experimental parasitology, 121(1), 96-104. Gmelin, J. (1789). Caroli a Linné Systema Naturae, vol. 1, part 3. In: Leipzig, Germany, GE Beer Publishing. Gray, J. (1843). Catalogue of the species of Mollusca and their shells, which have hitherto been recorded as found at New Zealand, with the description of some lately discovered species. In (Vol. 2, pp. 228-265): Murray London, pp. Guggisberg, A., Liu, X., Suter, L., Mansion, G., Fischer, M. C., Fior, S., . . . Widmer, A. (2018). The genomic basis of adaptation to calcareous and siliceous soils in Arabidopsis lyrata. Molecular Ecology, 27(24), 5088-5103. doi:10.1111/mec.14930 Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics, 195(1), 205-220. Halton, D. (1967). Studies on phosphatase activity in Trematoda. The Journal of Parasitology, 46-54. Hartmann, F. E., McDonald, B. A., & Croll, D. (2018). Genome-wide evidence for divergent selection between populations of a major agricultural pathogen. Molecular Ecology, 27(12), 2725-2741. doi:10.1111/mec.14711 Hivert, V., Leblois, R., Petit, E. J., Gautier, M., & Vitalis, R. (2018). Measuring Genetic Differentiation from Pool-seq Data. Genetics, 210(1), 315-330. doi:10.1534/genetics.118.300900 Hoffmann, K. F., & Strand, M. (1996). Molecular identification of a Schistosoma mansoni tegumental protein with similarity to cytoplasmic dynein light chains. Journal of Biological Chemistry, 271(42), 26117-26123. Hollander, J., Lindegarth, M., & Johannesson, K. (2005). Local adaptation but not geographical separation promotes assortative mating in a snail. Animal Behaviour, 70(5), 1209- 1219. Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., . . . Kuhn, M. (2016). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic acids research, 44(D1), D286-D293. Ignacio-Espinoza, J. C., Ahlgren, N. A., & Fuhrman, J. A. (2020). Long-term stability and Red Queen-like strain dynamics in marine viruses. Nature Microbiology, 5(2), 265-271. Jokela, J., Dybdahl, M. F., & Lively, C. M. (2009). The Maintenance of Sex, Clonal Dynamics, and Host‐Parasite Coevolution in a Mixed Population of Sexual and Asexual Snails. The American Naturalist, 174(S1), S43-S53. doi:10.1086/599080 Jokela, J., & Lively, C. M. (1995). Spatial variation in infection by digenetic trematodes in a population of freshwater snails (Potamopyrgus antipodarum). Oecologia, 103(4), 509-517. Jokela, J., Lively, C. M., Fox, J. A., & Dybdahl, M. F. (1997). FLAT REACTION NORMS AND “FROZEN” PHENOTYPIC VARIATION IN CLONAL SNAILS (POTAMOPYRGUS ANTIPODARUM). Evolution, 51(4), 1120-1129. doi:10.1111/j.1558- 5646.1997.tb03959.x Jombart, T. (2008). adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403-1405.

221

Chapter 4

Jost, L. (2008). GST and its relatives do not measure differentiation. Molecular Ecology, 17(18), 4015-4026. Kawecki, T. J., & Ebert, D. (2004). Conceptual issues in local adaptation. Ecology letters, 7(12), 1225-1241. Klopfenstein, D., Zhang, L., Pedersen, B. S., Ramírez, F., Vesztrocy, A. W., Naldi, A., . . . Weigel, M. (2018). GOATOOLS: A Python library for Gene Ontology analyses. Scientific Reports, 8(1), 1-17. Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., . . . Schlötterer, C. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PloS one, 6(1), e15925. Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27(24), 3435-3436. Leimu, R., & Fischer, M. (2008). A meta-analysis of local adaptation in plants. PloS one, 3(12), e4010. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arXiv preprint arXiv:1303.3997. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. Lin, J., Quinn, T., Hilborn, R., & Hauser, L. (2008). Fine-scale differentiation between sockeye salmon ecotypes and the effect of phenotype on straying. Heredity, 101(4), 341- 350. Lively, C. M., Dybdahl, M. F., Jokela, J., Osnas, E. E., & Delph, L. F. (2004). Host sex and local adaptation by parasites in a snail-trematode interaction. The American Naturalist, 164(S5), S6-S18. Lively, C. M., & McKenzie, J. C. (1991). Experimental infection of a freshwater snail, Potamopyrgus antipodarum, with a digenetic trematode, Microphallus sp. Lo, E., Bonizzoni, M., Hemming-Schroeder, E., Ford, A., Janies, D. A., James, A. A., . . . Githeko, A. (2018). Selection and utility of single nucleotide polymorphism markers to reveal fine-scale population structure in human malaria parasite Plasmodium falciparum. Frontiers in Ecology and Evolution, 6, 145. Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77. Martin, N. H., & Willis, J. H. (2007). Ecological divergence associated with mating system causes nearly complete reproductive isolation between sympatric Mimulus species. Evolution, 61(1), 68-82. Mobegi, V. A., Duffy, C. W., Amambua-Ngwa, A., Loua, K. M., Laman, E., Nwakanma, D. C., . . . Clark, T. G. (2014). Genome-wide analysis of selection on the malaria parasite Plasmodium falciparum in West African populations of differing infection endemicity. Molecular Biology and Evolution, 31(6), 1490-1499. Montague, M. J., Li, G., Gandolfi, B., Khan, R., Aken, B. L., Searle, S. M., . . . Davis, B. W. (2014). Comparative analysis of the domestic cat genome reveals genetic signatures underlying feline biology and domestication. Proceedings of the National Academy of Sciences, 111(48), 17230-17235. Morales, H. E., Faria, R., Johannesson, K., Larsson, T., Panova, M., Westram, A. M., & Butlin, R. K. (2019). Genomic architecture of parallel ecological divergence: Beyond a single

222

Chapter 4

environmental contrast. Science Advances, 5(12), eaav9963. doi:10.1126/sciadv.aav9963 Nakhasi, H., Pogue, G., Duncan, R., Joshi, M., Atreya, C., Lee, N., & Dwyer, D. (1998). Implications of calreticulin function in parasite biology. Parasitology Today, 14(4), 157-160. Olazcuaga, L., Loiseau, A., Parrinello, H., Paris, M., Fraimout, A., Guedot, C., . . . Gautier, M. (2020). A Whole-Genome Scan for Association with Invasion Success in the Fruit Fly Drosophila suzukii Using Contrasts of Allele Frequencies Corrected for Population Structure. Molecular Biology and Evolution, 37(8), 2369-2385. doi:10.1093/molbev/msaa098 Paczesniak, D., Adolfsson, S., Liljeroos, K., Klappert, K., Lively, C. M., & Jokela, J. (2014). Faster clonal turnover in high‐infection habitats provides evidence for parasite‐mediated selection. Journal of Evolutionary Biology, 27(2), 417-428. Paczesniak, D., Jokela, J., Larkin, K., & Neiman, M. (2013). Discordance between nuclear and mitochondrial genomes in sexual and asexual lineages of the freshwater snail P otamopyrgus antipodarum. Molecular Ecology, 22(18), 4695-4710. Paczesniak, D., Klappert, K., Kopp, K., Neiman, M., Seppälä, K., Lively, C. M., & Jokela, J. (2019). Parasite resistance predicts fitness better than fecundity in a natural population of the freshwater snail Potamopyrgus antipodarum. Evolution, 73(8), 1634-1646. doi:10.1111/evo.13768 Papkou, A., Guzella, T., Yang, W., Koepper, S., Pees, B., Schalkowski, R., . . . Schulenburg, H. (2019). The genomic basis of Red Queen dynamics during rapid reciprocal host– pathogen coevolution. Proceedings of the National Academy of Sciences, 116(3), 923-928. Pickrell, J., & Pritchard, J. (2012). Inference of population splits and mixtures from genome- wide allele frequency data. Nature Precedings, 1-1. Pritchard, J. K., Pickrell, J. K., & Coop, G. (2010). The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current biology, 20(4), R208-R215. Richardson, J. L., Urban, M. C., Bolnick, D. I., & Skelly, D. K. (2014). Microgeographic adaptation and the spatial scale of evolution. Trends in ecology & evolution, 29(3), 165-176. Seehausen, O., Witte, F., Van Alphen, J., & Bouton, N. (1998). Direct mate choice maintains diversity among sympatric cichlids in Lake Victoria. Journal of Fish Biology, 53, 37- 55. Sire, C., Durand, P., Pointier, J., & Theron, A. (2001). Genetic diversity of Schistosoma mansoni within and among individual hosts (Rattus rattus): infrapopulation differentiation at microspatial scale. International Journal for Parasitology, 31(14), 1609-1616. Smit, A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015. In. Smit, A., Hubley, R. R., & Green, P. (2008). Open-1.0. 2008–2015. In. Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440-9445. Sylvain Gandon, & Scott L. Nuismer. (2009). Interactions between Genetic Drift, Gene Flow, and Selection Mosaics Drive Parasite Local Adaptation. The American Naturalist, 173(2), 212-224. doi:10.1086/593706 Tajima, F. (1989). Statistical methods for testing the neutral hypothesis by DNA polymorphism. Genetics, 123, 253-262. Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032-2034.

223

Chapter 4

Thompson, J. N. (1994). The coevolutionary process: University of Chicago Press. Thompson, J. N. (1999). Specific hypotheses on the geographic mosaic of coevolution. The American Naturalist, 153(S5), S1-S14. Törönen, P., Medlar, A., & Holm, L. (2018). PANNZER2: a rapid functional annotation web server. Nucleic acids research, 46(W1), W84-W88. Tremblay, R. L., & Ackerman, J. D. (2001). Gene flow and effective population size in Lepanthes (Orchidaceae): a case for genetic drift. Biological Journal of the Linnean Society, 72(1), 47-62. Umezurike, G. M., & Anya, A. O. (1980). Carbohydrate energy metabolism in Fasciola gigantica (Trematoda). International Journal for Parasitology, 10(3), 175-180. von Linné, C., & Lange, J. J. (1760). Caroli Linnaei... systema naturae per regna tria naturae: secundum classes, ordines, genera, species: Io. Iac. Curt. Ward, J., & Talbot, J. (1984). Distribution of aquatic macrophytes in Lake Alexandrina, New Zealand. New Zealand journal of marine and freshwater research, 18(2), 211-220. Warwick, T. (1952). Strains in the mollusc Potamopyrgus jenkinsi (Smith). Nature, 169(4300), 551-552. Wilton, P. R., Sloan, D. B., Logsdon Jr, J. M., Doddapaneni, H., & Neiman, M. (2013). Characterization of transcriptomes from sexual and asexual lineages of a New Zealand snail (Potamopyrgus antipodarum). Molecular ecology resources, 13(2), 289-294. Winterbourn, M. (1970). THENEWZEALAND SPECIES OF PO TAM O PYRG US (GASTROPODA: HYDROB IID AE). Malacologia, 10(2), 283-321. Wysoker, A., Tibbetts, K., & Fennell, T. (2013). Picard tools. In. Yeaman, S. (2013). Genomic rearrangements and the evolution of clusters of locally adaptive loci. Proceedings of the National Academy of Sciences, 110(19), E1743-E1751. Young, N. D., Hall, R. S., Jex, A. R., Cantacessi, C., & Gasser, R. B. (2010). Elucidating the transcriptome of Fasciola hepatica—a key to fundamental and biotechnological discoveries for a neglected parasite. Biotechnology advances, 28(2), 222-231.

224

Concluding remarks

Concluding remarks

One of the most striking conclusions of this thesis is the complexity and richness of information when studying the genomic architecture of an organism. Each chapter had its own challenges that provide a multitude of directions for future research.

In chapters 1 and 2 we addressed the evolutionary relationship between genes in a group of digenean trematodes. We were able to address these questions due to high precision methodologies developed for the inference of homologous genes (Train, Glover, Gonnet, Altenhoff, & Dessimoz, 2017) and due to availability of reference genomes for a wide range of species.

However, the findings of chapter 1 show that the phenotypic consequences of a lot of genes in parasitic trematodes are still untested experimentally; their functions are inferred from homology and structural similarity to genes of other species that are sometimes evolutionarily distant (Gene Ontology Consortium, 2007). Paralogs, however, often develop different functions (Glover et al., 2019). Therefore, evolutionary and medical research would highly benefit from a better understanding of gene functions of paralogs, their gene expression and their epistatic interactions in creating a phenotype. The challenge in trematodes especially comes from the need for multiple hosts for completion of a life cycle and thus a need for multiple different environments in laboratory conditions. With that achieved, methods such as RNAi or gene knockout technologies could provide more insight.

In chapters 1 and 2 I also found that all the 14 extant trematodes had the highest proportion of genes that were gained or novel since the ancestral trematode, but also since the most recent speciation event of each extant species from the others (Chapter 1). The high proportion was still observed even after filtering of incomplete, possibly misannotated genes (Chapter 2). Among them, we detected a lot of single exon genes (Chapter 2). Annotation of a gene as gained or novel could mean the sequence of that gene has significantly diverged from all other genes and thus its homology is untraceable (Train et al., 2017). Alternatively, a novel gene could have arisen from previous non-coding material and thus has no sequence similarity to other genes (Katju & Lynch, 2006). However, the high levels of novelty we observed could also suggest that a broader range of species needs to be considered for

225

Concluding remarks comparative analyses, or that the genomes used require an improvement in quality of their assemblies and annotations. Using chromosome-level assemblies would improve not only the inference of homology, but also the study of locations and functions of transposable elements or the inference of gene synteny, the conservation of order of homologous genes between species (Liu, Hunt, & Tsai, 2018).

In chapters 3 and 4 we addressed the population genomics of a parasite in an already previously extensively studied host-parasite system. The past 30 years of in-depth research on the host-parasite dynamics, the population structure using both nuclear and mitochondrial markers, the mating system, the life-cycle characteristics and the taxonomy has heavily informed our sampling designs and our conclusions. The chapters indicate that a study design and the environmental variables, chosen to be correlated with whole genome scans, heavily rely on strong a priori hypotheses. These can only be formed with extensive knowledge about species’ ecology and evolution. However, genome wide scans are best used in combination with GWAS or linkage maps to understand the level of linkage disequilibrium in the genome and the patterns of recombination (Hoban et al., 2016). Without the information on how polymorphism is inherited, we could not fully take into account in our analysis the non-independence of loci and make informed decisions about the genomic windows of interest.

For the study of genomic regions under selection, it is also important to understand the relationship between the selected phenotype and the genomic underlying. Our research would be heavily improved through the use of methods that combine the analysis of the structural variation (insertion/deletions, inversions, copy number variation) with the SNP variation (Hoban et al., 2016). Additionally, both chapters only provide a snapshot in the system experiencing negative frequency-dependent dynamics. Repeating our analysis over time, and in conjunction with population genomics of the host, would provide a better picture of how selection acts on our focal species.

Altogether, the thesis places itself in a broad field of ecological and evolutionary functional genomics (EEFG), signifying “the study of genes that affect ecological success and evolutionary fitness in natural environments and populations”(Feder & Mitchell-Olds, 2003). As Feder & Mitchell-Olds (2003) point out, EEFG is a multidisciplinary endeavour which aims at making the most of the so-called post-genomic science, the era following the availability

226

Concluding remarks of complete genome sequences. The goals of EEFG require the application of a whole range of scientific disciplines; the methodology and knowledge required are beyond the capabilities of a single researcher and necessitate high level of collaboration and sustained interactions between research fields (Feder & Mitchell-Olds, 2003). References

Feder, M. E., & Mitchell-Olds, T. (2003). Evolutionary and ecological functional genomics. Nature Reviews Genetics, 4(8), 649-655. Gene Ontology Consortium. (2007). Guide to GO Evidence Codes. In. Glover, N., Dessimoz, C., Ebersberger, I., Forslund, S. K., Gabaldón, T., Huerta-Cepas, J., . . . Pereira, C. (2019). Advances and Applications in the Quest for Orthologs. Molecular Biology and Evolution, 36(10), 2157-2164. Hoban, S., Kelley, J. L., Lotterhos, K. E., Antolin, M. F., Bradburd, G., Lowry, D. B., . . . Whitlock, M. C. (2016). Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. The American Naturalist, 188(4), 379-397. Katju, V., & Lynch, M. (2006). On the Formation of Novel Genes by Duplication in the Caenorhabditis elegans Genome. Molecular Biology and Evolution, 23(5), 1056- 1067. doi:10.1093/molbev/msj114 Liu, D., Hunt, M., & Tsai, I. J. (2018). Inferring synteny between genome assemblies: a systematic evaluation. BMC bioinformatics, 19(1), 26. doi:10.1186/s12859-018- 2026-4 Train, C.-M., Glover, N. M., Gonnet, G. H., Altenhoff, A. M., & Dessimoz, C. (2017). Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference. Bioinformatics, 33(14), i75-i82. doi:10.1093/bioinformatics/btx229

227

Acknowledgments

Acknowledgments

Firstly, I would like to thank my supervisor, Prof. Jukka Jokela, by whom I have been given this amazing opportunity. I very much appreciate the freedom you have given me to decide my own path and projects, and the support to succeed in the path I have taken. Thanks to you I had the chance to attend workshops, conferences and courses I was the most interested in and also form collaborations that that made me the scientist I am today. At the start of this PhD I would have never imagined what fascinating things I am about to do but this would have not been possible if not for your trust in my decisions and my work. For this opportunity I also want to thank my second supervisor, Prof. Hanna Hartikainen. Thank you for introducing me into the world of genomics, for being an amazing teacher and for all your support, especially at the initial stages of the PhD.

I would also like to thank Dr. Natasha Glover for a truly fascinating and fruitful collaboration. Working with you has been a great honour and a great pleasure. Your enthusiasm and kindness has made the project so fun, and your immense knowledge and approach to science will always be something I will strive for. I have learnt so much from you about the field and the project we have worked on but also about being a scientist.

I want to thank my PhD committee, consisting of Prof. Roger Butlin, Dr. Martin Fischer and Dr. Kirsten Klappert. Thank you for the inspiring scientific discussions and your incredibly helpful thoughts on the project during its different stages.

I also want to very much thank Dr. Stefan Zoller, from the Genetic Diversity Center at ETH Zurich. Your immense patience, encouragement and kindness has made me love bioinformatics and see its great power in addressing some fundamental scientific questions.

I am also very grateful for having the chance to work with the following people and for their inspirations and helpful comments: Dr. Frida Feijen, Prof. Christophe Dessimoz, Dr. David Moi, Dr. Hélène Boulain and Dr. Niklaus Zemp. I also want to thank the following people for all the support with laboratory work or field work and for teaching me all the necessary skills to accomplish this PhD: Katri Seppäla, Dr. Aria Maya Minder Pfyl and Silvia Kobel from GDC, ETH Zurich, Anja Taddei, Tamara Schlegel, Nadine Tardent, Dr. Kirsten Klappert, Dr. Stuart Dennis, Marco Thali, Raffael Stegmayer, Kaja Widmer, Rebecca Oester, Simona Berta

228

Acknowledgments and Dr. Claudia Buser. Special thank you also goes to Arianne Maniglia and Gioia Matheson from admineco for their help with a lot documentation and for answering all my random questions about how to get settled and get things sorted while living in Switzerland.

I would like to thank the wonderful people who have been part of this adventure, for their friendship, support and really fun times: Dr. Lynn Govaert, Teo Cereghetti, Natalie Sieber, Dr. Robert Dünner, Heidi Käch, Linda Haltiner, Nadine Tardent, Cansu Cetin and Dr. Elvira Mächler; but also Dr. Sophie Roper, Helen Long, Dr. Sophie McManus, Jannik Zeiser, Nenad Torbica and Valentin Roesler. You made this time so much better.

I want to thank my family for their support and encouragement, for being interested in and reading my work. And because it is weird to speak to your own family in English, I want to say: Dziękuję wam wszystkim, mamo, tato i Zuziu, z calego serca za pomoc i wsparcie, zainteresowanie i zrozumienie. I moi kochani rodzice dziękuję wam bardzo za wasza miłość, za wasza ciężką pracę i wyrzeczenia które daly mi tyle szans i przeżyć i pozwolily mi dojść do tego momentu.

And finally, my wholehearted thanks goes to Sebastian Drosselmeier. Thank you for your amazing support throughout this whole time; thank you for making everything, even the most difficult and challenging times, a wonderful adventure.

229

Curriculum Vitae

Curriculum Vitae

Natalia Halina Zajac

Personal data Address (office): EAWAG, Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600 Dübendorf, Switzerland Email: [email protected], [email protected], [email protected] Phone: +41 779842564 (personal), +41 58 765 5605 (office) Date of birth: 01.07.1992 Languages: Polish (native), English, German B2, Spanish

Education 08/2016-12/2020 Ph.D. in Functional Genomics and Evolutionary Biology at ETH Zurich, Institute of Integrative Biology, and EAWAG, Department of Aquatic Ecology, Switzerland

09/2014-06/2016 MSc in Evolutionary Biology at Uppsala University, Sweden

10/2011-06/2014 BA in Biological Sciences at University of Oxford, UK

Research Experience 08/2016-12/2020 Ph.D. in Functional Genomics and Evolutionary Biology at ETH Zurich, Institute of Integrative Biology, and EAWAG, Department of Aquatic Ecology, Switzerland; Supervisors: Prof. Dr. Jukka Jokela, Prof. Dr. Hanna Hartikainen; Topic: The genomics of parasite adaptation to its host

09/2015 – 05/2016 Master thesis project at the University of Bern, Institute of Ecology and Evolution; Supervisor: Prof. Laurent Excoffier; Topic: Characterization of the most common chromosomal inversions in the human

230

Curriculum Vitae

genome (their prevalence, geographic distribution across human populations, gene content and genetic diversity).

04/2015 – 07/2015 Research assistant at the University of Uppsala: 2 projects in parallel. 1. Supervisor: Prof. Martin Lascaux; Topic: A new phylogeny for the genus Picea from plastid, mitochondrial, and nuclear sequences. 2. Supervisors: Prof. Jochen Wolf, Dr. Bart Nieuwenhuis, Sergio Tusso (PhD); Topic: Homothallism in Saccharomyces pombe

11/2013 - 06/2014 Part time work as a research assistant at University of Oxford. Supervisor: Dr. Tobias Uller; Topic: Genetics of the common wall lizard, Podarcis muralis

03/2013 – 11/2013 Final Honours Research Project, University of Oxford; Supervisor: Dr. Tobias Uller; Topic: Characterization of a hybrid zone of the common wall lizard, Podarcis muralis, in Italy Awards 2019 Most active and attentive audience participant, 8th ECO PhD Symposium at EAWAG 2019

2019 Best poster award at Biology19, the Swiss Conference of Organismic Biology, Zürich, Switzerland

2018 Most active and attentive audience participant, 7th ECO PhD Symposium at EAWAG 2018

2017 Best talk at the 6th ECO PhD Symposium at EAWAG 2017

2015 Swiss-European Mobility Programme (SEMP) Scholarship – obtained from the University of Bern for Master thesis project

2014 Diploma for Diana Science Conference at the University of Uppsala as a Speaker, Feedback Provider and Audience, 2014

2014 Southern Field Studies Book Prize, 2014 – obtained for the best fieldwork project of the year for Honours research project at University of Oxford

2013 David Kirby Memorial Fund 2013 - funding awarded for Honours Research Project at University of Oxford

231

Curriculum Vitae

2012-2013 Demyship Scholarship (Magdalen College, Oxford) 2012-2013 – awarded for obtaining the highest grades in end of the year exams

2012 BP-NRE award, 2012 – scholarship from the Oxford University International Internship Program for an internship in Natural Resources and the Environment in Siberia Publications Zajac, N., Zoller, S., Seppälä, K., Moi, D., Dessimoz, D. Jokela, J., Hartikainen, H. and Glover, N. Gene duplication and gain in the trematode Atriophallophorus winterbourni contributes to adaptation to parasitism. In press.

Blasco-Costa, I., Seppälä, K., Feijen, F., Zajac, N., Klappert, K. and Jokela, J. (2019). A new species of Atriophallophorus Deblock & Rosé, 1964 (Trematoda: Microphallidae) described from in vitro-grown adults and metacercariae from Potamopyrgus antipodarum (Gray, 1843)(Mollusca: Tateidae). Journal of Helminthology 94: 1-15.

Michaelides, S.N., While, G.M., Zajac, N., Aubret, F., Calsbeek, B., Sacchi, R., Zuffi, M.A.L. and Uller, T. (2016). Loss of genetic diversity and increased embryonic mortality in non‐native lizard populations. Mol Ecol, 25: 4113-4125.

Michaelides, S., Cornish, N., Griffiths, R., Groombridge, J., Zajac, N., Walters, G.J., Aubret, F., While, G. M. and Uller, T. (2015). Phylogeography and conservation genetics of the common wall lizard, Podarcis muralis, on islands at its northern range. PLoS One 10, no. 2:e0117113.

While, G.M., Michaelides, S., Heathcote, R.J.P., MacGregor, H.E.A., Zajac, N., Beninde, J., Carazo, P., Pérez i de Lanuza, G., Sacchi, R., Zuffi, M.A., Horváthová, T., Fresnillo, B., Schulte, U., Veith, M., Hochkirch, A. and Uller, T. (2015). Sexual selection drives asymmetric introgression in wall lizards. Ecol Lett, 18: 1366-1375. doi:10.1111/ele.12531

Michaelides, S.N., While, G.M., Zajac, N. and Uller, T. (2015). Widespread primary, but geographically restricted secondary, human introductions of wall lizards, Podarcis muralis. Mol Ecol, 24: 2702-2714. doi:10.1111/mec.13206

Selected courses and workshops 2019 Workshop on Viral Evolution and Genomics at Jacques Monod Conference “Virus evolution on the mutualist - parasite continuum"

2018 Project Management at ETH Zurich

2018 Winter School: Genome Assembly and Annotation: The long and short of it by ETH Zurich

2018 Evolutionary Medicine for Infectious Diseases at ETH Zurich

232

Curriculum Vitae

2018 Swiss Institute of Bioinformatics workshop at Biology18, the Swiss Conference of Organismic Biology at the University of Neuchatel

2017 Introduction to Variant Analysis using NGS course by Functional Genomics Center Zurich

2017 Swiss Institute of Bioinformatics workshop at Biology 17, the Swiss Conference of Organismic Biology at the University of Bern

2018 Winter School: Bioinformatics for Adaptation Genomics by ETH Zurich

2017 Integrated methods to detect polygenic adaptation from genomic data by WSL

2017 Ecology and Evolution: Interaction Seminar at ETH Zurich

2016 The genomic basis of eco-evolutionary change organized by the Adaptation to a Changing Environment center (ACE) ETH Zurich

Posters and presentations 06/2020 Presentation at the Swiss Institute of Bioinformatics Days; Title: “Gene duplication and gain in the trematode Atriophallophorus winterbourni contributes to adaptation to parasitism”

02/2020 Presentation at Biology2020, the Swiss Conference of Organismic Biology at the University of Fribourg; Title: “Gene duplication and gain in the trematode Atriophallophorus winterbourni contributes to adaptation to parasitism”

11/2019 Presentation at the 8th ECO PhD Symposium at EAWAG; Topic: “Signatures of adaptation to parasitic life style in Atriophallophorus winterbourni genome”

10/2019 Poster at the Jacques Monod Conference “Virus evolution on the mutualist - parasite continuum"; Topic: “Comparative genomics of parasitic trematodes”

08/2019 Poster at the Congress of the European Society for

233

Curriculum Vitae

Evolutionary Biology, Turku, Finland; Topic: “Genomic signature of adaptation of a parasite to its host”

02/2019 Poster at Biology19, the Swiss Conference for Organismic Biology at the University of Zurich; Topic: “Genomic signature of parasite adaptation to its host”

11/2018 Presentation at the 7th ECO PhD Symposium at EAWAG; Topic: “Adaptation of parasite to its host: behavioural differences in infected hosts.”

02/2018 Poster at Biology18, the Swiss Conference for Organismic Biology at the University of Neuchatel; Topic: “Genomic signature of adaptation of parasite to its host”

08/2017 Poster at the Congress of the European Society for Evolutionary Biology, Groningen, Netherlands; Topic: “Genomic signature of adaptation of a parasite to its host”

06/2017 Presentation at the 6th ECO PhD Symposium, Topic: “Genomic signature of adaptation of a parasite to its host” Teaching 2018/2019 Supervision for a course Term Paper, supervising Michael Zehnder writing a paper entitled:” Whole genome resequencing: what can it tell us about adaptive evolution?”

2018/2019 Shared Supervision of a Master’s Thesis project by Julia Vrtilek at Eawag/ETH Zurich

2017/2018 Supervision for 3 lab assistants: Rebecca Oester, Simona Berta and Michael Zehnder, coordinating a shared laboratory project

234