Probabilistic Modelling of Domain and Gene Evolution

Home , Amos Bairoch, Aviv Regev, Cyrus Chothia, David Sankoff, Joomyeong Kim, Julian Gough (scientist)

SAYYED AUWN MUHAMMAD

Doctoral Thesis Stockholm, Sweden 2016 TRITA-CSC-A-2016:19 ISSN-1653-5723 KTH School of Computer Science and Communication ISRN-KTH/CSC/A–16/19-SE SE-100 44 Stockholm ISBN 978-91-7729-091-9 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framläg- ges till oﬀentlig granskning för avläggande av teknologie doktorsexamen i datalogi måndag den 26 september 2016, klockan 9.00 i conference room Air, Scilifelab, Solna.

Tryck: Universitetsservice US AB iii

To my family iv

Abstract

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution and to reconstruct the domain evolutionary history. In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a substitution process to model the domain sequence evolution with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Indepen- dence Metropolis-Hastings for the estimation of posterior distribution over domain trees with lengths, gene tree topologies and other evolutionary parameters. The hallmark of our work is dating of domain evolutionary events, which is achieved by mapping domain evolutionary history to the realized gene tree, which in turn got its dating from the evolutionary time scale determined by the species tree. The reconciliation of a domain tree to the gene and species phylogenies enable us to infer domain level events such as domain duplications, domain losses, and gene induced domain bifurcations. By using this method, we performed analyses of selected interesting multi-domain families such as Zinc-Finger and PRDM9, which provides an interesting insight of domain evolution with respect to gene and species evolution. Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We evaluated the accuracy of our method against the Neighbourhood Correlation and other methods on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference and synteny-aware homology inference is more accurate than the similarity-only methods. v

Sammanfattning

Fylogenetisk inferens, som syftar till att återskapa evolutionära träd, är starkt beroende av statistiska modeller som under de senaste åren har utvid- gats och förfinats till komplexa hierarkiska modeller som i detalj kan beskriva evolutionära processer. Tillgängligheten av fullständigt sekvenserade genom från olika arter har gett oss möjlighet att studera och förstå evolutionära processer, samt att rekonstruera gen- och artfylogenier i större och mer exakt detalj. Emellertid består gener av evolutionärt konserverade sekvenselement som kallas domäner. Dessa domäner kan påverkas av duplikationer, förlus- ter och förgreningar till följd av genernas evolution. Denna avhandling är ett försök attutöka fylogenetiska metoder till att också rekonstruera domäners evolutionära historia och att anpassa existerande evolutionära modeller för duplicering-förlust, för varierande mutationshastighet under evolutionen, och för sekvensevolution vid domännivå. I denna avhandling presenterar jag DomainDLRS: en omfattande, hierar- kisk Bayesiansk metod, baserad på DLRS-modellen av Åkerborg et al.,2009, Denna modell beskriver domänevolution som en process som sker inom gen- och artträdet. Metoden innehåller en födelse-dödsprocess, enligt vilken do- mäner dupliceras och förloras, tillsammans med en flexibel evolutionär mole- kylär klock-modell (“relaxed molecular clock”) för domänsekvenser. Metoden använder en Markovkedje-Monte Carlo-teknik (MCMC) som kallas “Grouped Independence Metropolis-Hastings” för uppskattning av posteriorifördelning- en över domänträd med längder, genträd-topologier, samt andra evolutionära parametrar. Det utmärkande för detta arbete jämfört med tidigare rekon- cilieringsbaserade metoder är att evolutionära händelser för ett domänträd kopplas till det realiserade genträdet, som har samplats på den evolutio- nära tidsskala som bestäms av arträdet. Rekoncilieringen av domänträdet med gen- och artträdet gör det möjligt att härleda händelser på domänni- vå, såsom domändupliceringar, domänförluster och genintroducerade domän- bifurkationer. Med denna metod har jag genomfört analyser av två utvalda multidomänfamiljer, zink-finger och PRDM9. Detta gav en intressanta inblic- kar i domänevolutionen med avseende på gen- och artevolutionen. Slutligen föreslås en synteny-medveten strategi för inferens av genhomo- logi, kallad GenFamClust, som använder likhet och konservering i en gens grannskap på genomet för att förbättra inferensen av homologi. GenFamC- lust använder information i data bättre och identifierar homologer på ett mer korrekt sätt i jämförelse med likhetsbaserade metoder. Vi har jämfört vår metod mot existerande metoder för genhomologiprediktion på två testdata- mängder – en med fullt sekvenserade eukaryota genom och en med syntetiska sekvenser. De erhållna resultaten från båda testerna bekräftar att synteny underlättar inferensen av homologi och att GenFamClust är en förbättring i jämförelse med likhetsbaserade metoder. Contents

Contents vi

List of Publications 1

Acknowledgements 3

1 Introduction 5 1.1 Fundamentals ...... 5 1.2 Current Work ...... 6 1.3 Thesis Outline ...... 7

2 Three Levels of Evolution 9 2.1 Introduction ...... 9 2.2 Species Level Evolution ...... 10 2.3 Gene Level Evolution ...... 11 2.4 Domain Deﬁnitions ...... 13 2.5 Domain Level Evolution ...... 14

3 Modelling Domain Evolution 17 3.1 Introduction ...... 17 3.2 Current Approaches ...... 18 3.3 Reconciliation-based Approaches ...... 21 3.4 Bayesian Inference ...... 23 3.5 Markov Chain Monte Carlo ...... 24 3.6 The DLRS Model ...... 25 3.7 Interesting Multi-domain Families ...... 28

4 Identifying Gene Families 33 4.1 Introduction ...... 33 4.2 Homology Deﬁnitions ...... 34 4.3 Current Approaches ...... 34 4.4 Homology Inference using a similarity network ...... 36 4.5 Synteny: An important signal for homology inference ...... 39

vi CONTENTS vii

4.6 Quantifying synteny for homologous regions ...... 41 4.7 Homology inference using synteny ...... 41

5 Present Investigations 45

Bibliography 47

List of Publications

Paper I: Species tree aware simultaneous reconstruction of gene and domain evolution. Sayyed Auwn Muhammad, Bengt Sennblad and Jens Lagergren Manuscript. Paper II: Sequence analysis and evolutionary studies of Reelin proteins Manoharan Malini∗, Sayyed Auwn Muhammad∗ and Ramanathan Sowdhamini Bioinformatics and Biology Insights, 2015. doi: 10.4137/BBI.S26530

Paper III: Quantitative synteny scoring improves homology inference and partitioning of gene families Raja Hashim Ali, Sayyed Auwn Muhammad, Mehmood Alam Khan and Lars Arvestad BMC Bioinformatics, 2013. doi:10.1186/1471-2105-14-S15- S12 Paper IV: GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm Raja Hashim Ali, Sayyed Auwn Muhammad and Lars Arvestad BMC Evolutionary Biology, 2016. doi:10.1186/s12862-016- 0684-2

∗ Contributed equally to this manuscript.

Acknowledgements

First of all, I would like to thank my mother Munawar Batool, who guided me in every difficulty of my life after the death of my father, when I was only 5. She provided me confidence and encouragement to stand in difficult times and taught me how to face difficulties and hardships. I would like to thank my family especially, Sidra Batool and Saqlain Haider for their support and love. I would like to thank my wife Fouzia Mukhtar, for bearing all of my stress particularly in the last phase of my PhD studies. I would like to thank my kids Kumail Hassan and Ayan Hassan for their love.

I am very grateful to my supervisor Jens Lagergren, for his advice and guidance throughout my PhD studies. I am feeling so grateful and privileged to have had him as an advisor, who trained a graduate student with very little knowledge of Evo- lutionary Biology and Probabilistic Modelling. His acumen to anticipate research questions and to construct algorithms that directly address them, amazed me. He is truly a brilliant researcher and one of the fastest thinkers I know. I am sincerely grateful for his constant support and calm attitude despite of my ignorances and mistakes.

I would also like to thank my co-supervisors, Lars Arvestad, Bengt Sennblad, and Ramanathan Sowdhamini for advising me in my research. I am very grateful to Lars Arvestad for providing me opportunities to participate in teaching as well as for his scientiﬁc advice and many insightful discussions. I am grateful to Bengt Sennblad for assisting me in my research. He truly knows the hidden layer between programming and mathematical concepts. I am also very grateful to my collabora- tor Ramanathan Sowdhamini for her guidance and suggestions. She provided me valuable advice and support in a collaborative project, even though, I was never able to visit her.

I am thankful to all of my scilifelab fellows, not in any particular order, Owais, Ikram, Hashim, Mehmood, Mattias, Kristofer, Viktor, Lumi, Erik, Joel, Hussein, Pekka, Hanna, Linus, Yrin, Jovana, Banhita, Agata, Henrick, Matthew. I am grateful to Jeanette Hällgren Kotaleski, Harriett Johansson, Birgit Ekberg-Eriksson, and Anna Lena Hållner for their excellent support in all administrative matters at CSC.

Chapter 1

Introduction

1.1 Fundamentals

Life evolves by copying itself, and since it is an inaccurate process, changes are inevitable. Changes create diversity, and, given that resources within the biosphere are finite, diversity has to rely on existing set of resources. Genetically, these resources include genomes, genes, and proteins or even, on a very modular level, domains are also used by evolutionary process to create diversity. The evolutionary process takes advantage of all these levels to produce new copies that can take their own evolutionary path. Thus, evolution occurs differently at different levels. As stated by Walter Fitch, “We must recognize that not all parts of a gene have the same history and thus, in such cases, that the gene is not the unit to which the terms orthology, paralogy, etcetera apply” [59], domains can also be duplicated, lost, and bifurcated implied by gene or species evolution. Therefore, studying evolution without considering the fact that evolution may have occurred differently on different levels will not complete our understanding of evolutionary processes. Modeling evolution is central to all aspects of evolutionary biology, and is extensively studied from almost any conceivable point of view. Despite this vast effort, the subject is sufficiently deep for much to remain unclear. For instance, methods have been proposed to model the gene level evolution and to infer the evolutionary events of a gene family; these include gene duplications, gene losses, and speciations. However, these methods are relying on a simplifying assumption, which consider a gene to be an atomic unit of evolution. Genes are composed of evolutionary conserved sequence segments called domains, and domains can also be duplicated, lost, and bifurcated during the evolutionary time period. Therefore, de- signing a method that can take evolutionary reconstruction one step further down to the domain level would be a significant step towards the better understanding of evolutionary processes and their outcomes. This thesis is mainly proposing a probabilistic method that models the domain level events such as domain duplications, domain losses, and domain bifurcations in the context of a gene and a species

5 6 CHAPTER 1. INTRODUCTION evolution. The proposed method revealed an interesting insight into the process of evolution at diﬀerent levels.

1.2 Current Work

This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution and to reconstruct the domain evolutionary history. In the proposed method, the domain content of the extent gene as well as information from gene and species evolutionary histories have been taken into account to get a clearer picture of domain evolution in the context of gene and species evolution. The studies included in this thesis mainly attempt to model domain level events such as domain duplications, domain losses, and domain bifurcations with gene level events such as gene duplications, gene losses, and speciations. The method proposed in this thesis provides a detailed picture of domain evolution by reconciling the domain evolutionary history to the gene and the species evolutionary histories, thereby inferring domain evolutionary events induced by gene and species level evolutionary events. The proposed method is Bayesian and employ a variant of Markov Chain Monte Carlo technique namely, Grouped Independence Metropolis- Hastings for the estimation of posterior distribution over evolutionary parameters. The hallmark of our work is dating of domain evolutionary events, which is achieved by mapping domain evolutionary history to the realized gene tree, which in turn got its dating from the evolutionary time scale determined by the species tree. We present a probabilistic approach that combines all three levels of evolution i.e. domain, gene, and species evolution into one probabilistic hierarchical framework by modelling duplications and losses at gene level and duplications, losses, rate, and sequence evolution at domain level with the corresponding gene and species evolution. Paper I proposes a probabilistic model that models domain evolution as taking place inside the gene and species tree. The probabilistic model comprises of a duplication-loss model to model domain duplications and losses, a substitution model to simulate the domain sequence evolution along the tree with branch lengths, and a rate model to relax the molecular clock assumption and to allow the rate vari- ations among the branches of the tree. The reconciliation of domain evolutionary events to the gene and the species phylogenies provides evolutionary scenarios with a greater detail. We applied our proposed approach to some interesting domain families; C2H2 Zinc-Fingers and PRDM9, their results provides an interesting insight of the domain evolution in the context of gene and species evolution. The evolutionary study proposed in paper II considered the repeated domain architecture of Reelin protein and performed a phylogenetic analysis by using conventional methods. This was a part of an early attempt to investigate and search datasets for our probabilistic model of domain evolution. Paper III & IV are proposing an homology inference approach that combines gene synteny i.e.conserved neighbourhood 1.3. THESIS OUTLINE 7 of a gene with the sequence similarity and is therefore, more biologically realistic and useful in the case of diverged gene families with low sequence similarity. We applied our proposed method to metazoan and fungal genome datasets to investigate the role of synteny in determining homology relationship among genes. The results indicate that synteny is eﬀective when combined with similarity to detect the homologues gene pairs. The proposed methods develop a better understanding of evolutionary processes by considering all levels of evolution namely domain, gene and species level.

1.3 Thesis Outline

The outline of the rest of this thesis is as follows. A biological perspective of evolution at the domain, gene, and species level and a brief discussion on evolutionary events is included in chapter 2. Evolutionary events such as gene duplication, gene loss, and speciation as well as domain duplication, domain loss, and domain bifurcation will be discussed. In Chapter 3, we will introduce our main modelling eﬀort on domain evolution by providing a brief discussion about some of the current methods followed by an introduction to the Bayesian framework, Markov Chain Monte Carlo based technique, duplication-loss, rate and substitution models. The Section 3.7 of chapter 3 describes interesting multi-domain families including C2H2 Zinc-Fingers and Reelin protein. Finally, chapter 4 provides detail on some of the current approaches for homology inference. Also, a discussion on the use of synteny for homology inference is provided. Chapter 5 provides a brief summary of the articles that are included in this thesis.

Chapter 2

Three Levels of Evolution

2.1 Introduction

About one and half billion years ago, the earliest form of modern day eukaryotic cell had appeared. According to an estimate, the first multicellular animal may have appeared around 640 million years ago [26]. Although cells have matured their DNA replication and repair processes over the time, they are still highly complex and inaccurate in creating exact copies of DNA during cell division. In fact, such failures are serving as basis for the genomic evolution, although transposable DNA elements are also contributing. During the evolution, the replication process is benefited from the modularity existed at several level e.g., gene, domain, and genome level to create new form of life. For instance, at gene level, events such as gene duplications (caused, e.g., by segmental duplications or reverse transcription), lateral gene transfers (the transfer of genetic material from one species to another), or gene deletions (segmental deletions during unequal crossing-over) and gene conversion are key events that can change the genomic sequence. At domain level, events like translocations, short intragenic duplications e.g. through slippage of the DNA polymerase [120], exon duplications, deletions, and shuffling are examples of domain level events that can change the sequence and function of a gene. Similarly, on the genomic level, whole genome duplication event doubles the number of chromosomes and hence also the number of genes in a species. With the availability of large-scale sequence information and protein domain databases, the information organized in these levels can be systematically exploited to reconstruct the evolutionary history. In this thesis, we are proposing an evolutionary model, which is simultaneously dealing with the evolutionary events at gene level as well as domain level. These are therefore discussed in more detail in this chapter. In the next section, an overview of the species level evolution, that have brought genome scale changes through millions of years of evolution, will be discussed. In subsequent sections, gene and domain level events and the detail about their biological mechanisms are discussed.

9 10 CHAPTER 2. THREE LEVELS OF EVOLUTION

2.2 Species Level Evolution

After almost 150 years since the publication of book On the Origin of Species by Darwin, defining species and understanding speciation i.e., the mechanism behind the evolution of species, is still considered as a major challenge for evolutionary biology. In his book, Darwin wrote that “I look at the term species, as one arbi- trarily given for the sake of convenience to a set of individuals closely resembling each other...” and “The amount of difference is one very important criterion in settling whether two forms should be ranked as species or varieties” [42]. Under this Darwinian perspective, speciation can be seen as a process, which accumulates sufficiently many differences between populations to classify them as separate tax- onomic species. Although defining a “species” is still a matter of discussion [147], speciation is ultimately considered as a consequence of genetic divergence through evolutionary time period. The speciation process is influenced by many different factors including ge- ographical, ecological, developmental, and environmental that are interacting in non-linear ways. A straightforward approach to classify the different mechanisms of speciation is to consider the factors with respect to their type and influence on the genetic diversity. In general, we can use any of the factors listed above to classify the speciation mechanisms. However, traditionally, the primary classification is based on geographic modes of speciation and the level of migration between the diverging populations [67]. Based on the geographic modes of the diverging populations, speciation can be classified into three categories namely allopatric, parapatric, and sympatric. The word Allopatric is derived from Greek words; allos means “other” and patra means “fatherland”. In Allopatric speciation mode, populations or sub-populations of a species become geographically isolated from each other. As a result, there is no migration of individuals and the gene flow among the isolating populations ceases. During this process, populations diverge differently in response to the natural selection imposed by their different environments. On the contrary, there are highest possible migration levels in sympatric speciation mode. The word sympatric is derived from Greek word sym means “the same”. In sympatric speciation, populations or sub-populations of a species that are living in the same habitat evolve into a new species, which later become reproductively isolated from the others [116]. Sympatric speciation occurs more often in plants than animals [145], because plants can change their chrosome number (polyploidy) and spread by asexual reproduction. The parapatric speciation is considered as intermediate case where, the migration between diverging (sub) populations is neither zero nor maximum. The word parapatric is derived from Greek word para means “side-by-side”. Genetically, evolution is a process, which consists of changes in the genetic constitution (gene pool) of a population. The process of evolution may be characterized in two ways: anagenesis and cladogenesis [14]. Changes occurring within the population of a given species as time proceeds, is known as anagenetic evolution. Anagenetic evolution results in increased adaptation to the environment, and 2.3. GENE LEVEL EVOLUTION 11 often considers changes, which are according to the physical or biotic conditions of the environment. In contrast to anagenesis, cladogenesis is an evolutionary event in which a parent species splits into two groups that no longer share the same gene pool. The great diversity of the living world is the consequence of cladogenetic evolution, which results in adaptation to a greater variety of environments and living conditions.

2.3 Gene Level Evolution

During cell division and meiosis, inaccurate replication process provides an oppor- tunity for the daughter cells to inherit an altered copy of DNA. Some of these inaccuracies of replication process, can be simple enough to change only a single nucleotide in DNA sequence (substitutions), or complex to change larger parts of DNA by inserting or deleting stretches of DNA sequence. In more complex scenarios, entire genes, chromosome, or even genomes can be duplicated or deleted. Susumu Ohno argued that gene duplication was the single most important factor in evolution [161]. In his book, Evolution by Gene Duplication published in 1970, he clearly identified and described the role of gene duplication as a deriving force in evolution. There are several theories explaining the fate of newly duplicated gene. Initially, the new duplicate will start diverging from original copy by acquiring mutations. Many of these mutation may be tolerated, since the original gene is still present to perform its ancestral function. Consequently, the fate of duplicated gene will be decided in one of the three possible scenarios [111]: 1) one of the gene copy evolves a new function after a gene duplication event (neo-functionalization) 2) both gene copies initially persists their ancestral function for some duration, and eventually specializes in the restricted forms of their original function (sub-functionalization) or 3) the duplicated gene acquires too many mutation and loses its ability to prop- erly encode into protein and thus becomes a pseudo-gene (non-functionalisation). In a similar way, gene loss plays an important role in large scale changes of genome. During DNA replication process, part of DNA may be deleted from the genome. If this deletion causes a gene loss, it can have large impact on the organism or have no effect if the function of the gene are no longer required. In the following section, we will briefly review the mechanism behind gene duplication, gene conversion and how evolution can use them to shape the genomes of many species.

Gene Duplication There are number of biological mechanisms that are involved in gene duplication. These mechanisms include unequal crossing-over, retro-transposition, and whole genome duplication. In the case of unequal crossing-over, the duplicated chromosomes attach to each other in the centromere region during meiosis. Depending on the position 12 CHAPTER 2. THREE LEVELS OF EVOLUTION of crossing over during recombination, one chromatid acquires a longer segment, while the other acquires a shorter one, thereby, creating a tandem repeat on one chromosome and a deletion on the other. The outcome of unequal crossing-over can be a part of a gene, an entire gene, or several genes. In the latter two cases, flanking region will also be copied along with duplicated genes. The flanking region is a region of DNA, which is adjacent to a gene, contains promoter and other regulatory sequences of that gene. Unequal crossing-over mostly placed duplicated genes in tandem; that is, they are linked in a chromosome [183]. In retro-transposition, a message RNA (mRNA) is retro-transcribed to comple- mentary DNA (cDNA) and then inserted into some random position in the genome. Duplications through retro-transposition can easily be characterized due to its features, for example, loss of introns and regulatory sequences, presence of poly-A tracts, and presence of flanking short direct repeats. The resultant duplicated genes are called retro-genes. These retro-genes are rarely expressed into full-length coding sequences due to lack of their regulatory elements. However, there are functional retro-genes that have been identified in many genomes [171]. Due to random insertion of cDNA, retro-genes are often unlinked to their original gene and do not appear in cluster of genes, unless the involved genes are all in an operon (only in case of prokaryotes). An operon refers to a cluster of genes under the control of a single promoter i.e. a functional unit of genomic DNA. The fate of retro-genes often ended up as pseudogenes, since promoter and regulatory sequences of the gene are not transcribed and hence not duplicated by retro-transposition. One of the most dramatic cases of genome evolution is that of whole genome duplication (WGD), in which a duplication event acts on the entire genome and create additional copies of the genome in a species. Because these events lead to significant genomic instability, it is thought that such events are rare in the evolutionary history of modern species. In vertebrates, two round of whole genome duplications (WGD) were first hypothesized by Susumu Ohno [128], after whom WGD duplicated genes are now referred to as "ohnologs". In plants, there is be- lieved to have whole genome duplications that resulted in the diversification of regulatory genes important to seed and flower development [92]. Similarly, there are evidences that a third round of whole genome duplication may have happened within the lineage of ray-finned fish [92]. Whole genome duplication may result in a substantial relaxation of selective pressure due to the scale of duplication events. As a result, most of the duplicated genes go through the process of pseudogeniza- tion and eventually lost from the genome, however, some of the genes retain and diverge to a new functionality.

Gene Conversion Earlier, I discussed how homologous recombination during unequal crossing-over between paralogous sequences can result in tandem gene duplications and deletions. In another process, which is known as gene conversion, homologous recombination between paralogous sequences result in the nonreciprocal transfer of sequence from 2.4. DOMAIN DEFINITIONS 13 one paralogue to the other paralogue sequence. Gene conversion homogenizes paralogous sequences and consequently decelerating their divergence. This leads to the process of “concerted evolution”, in which higher sequence similarity is found among paralogous genes within one species than other genes of the same gene family in another species, suggesting that members within a gene family do not evolve independently of each other [108]. The mechanistic basis of gene conversion is very well explained in [30]. Gene conversion is mainly initiated by a form of homologous recombination caused by DNA double-strand breaks (DSBs). During the process, the genetic information from intact homologous sequences is transferred to the region containing the DSB. The gene conversion process can occur between homologous chromosomes, sister chromatids, or homologous sequences on either the same chromosome or different chromosomes. The gene conversion process are characterised by following observa- tions, which are described in [108]. First, gene conversion is non-reciprocal transfer of genetic information between similar sequences, which does not involve exchange of flanking DNA. Second, gene conversion is very precise in homogenization i.e. it only act on specific DNA regions such as coding sequences. Third, the gene conversion process requires flanking-sequence identity between the affected DNA regions for its efficiency i.e. the genomic context is also important in the process. Recently, Wang et al. [173] highlighted an interesting evolutionary scenario of concerted evolution of mtrA gene family at domain level. They performed phylogenetic and comparative genomic study of mtrA genes, where they describe how the concerted evolution of a highly conserved MtrA domain is responsible for the convergent evolution of mtrA gene family. According to their findings, there are distinct tansmembrane domains in the paralogs mtrA-1 and mtrA-2 of the same species, however a significantly more similar MtrA domain is found in their paralogs as compare to orthologs in different species at both levels of nucleic acid and amino acid sequence. Their findings suggest a unique example of convergent concerted evolution within a certain domain of two paralogous genes. They also compared the probable scenarios of recent domain duplication and the concerted evolution of MtrA domain, which is interesting due to the comparison between two completely different but vaguely recognizable levels of evolutionary events i.e. gene convergence and recent domain duplication. In subsequent sections, I will discuss domain definitions, evolutionary events and their role in the broad picture of evolution.

2.4 Domain Deﬁnitions

A gene is the basic physical structure made from DNA. Genes encode proteins, which are composed of evolutionary units called domains. There are many definitions for protein domain, however the most established definitions are three namely, the structural, evolutionary and functional definitions of domains [120]. Structurally, a domain is a stable, compact structural unit within the protein that corresponds to an independent folding unit [141, 36, 121]. From the evolution- 14 CHAPTER 2. THREE LEVELS OF EVOLUTION ary point of view, a domain can usually be seen as the conserved segment of a gene sequence. Since protein performs its functionality through domains, the functional perspective defines domains as reoccurring functional units. The distinction in definitions of domain is important when considering possible evolutionary mechanism by which multi-domain proteins have come into existence. For instance, in many cases, the functionality of a domain is conferred by amino acids residues that are not co-located in the polypeptide chain, therefore the boundaries of a functional domain is sometimes vaguely defined. Hence, if a functional domain undergoes a duplication process, it will increase the number of functional segments. However, in general, these definitions often coincide; a domain is an evolutionarily conserved entity because it has a specific function depends on its structural arrangement.

2.5 Domain Level Evolution

Domain level evolution provides a good illustration of how nature believes in reusing instead of reinventing while being more opportunistic in creating functional diversity. It is due to the domain level modularity, that provides a set of reusable components that expedite the evolution needed to perform verity of functions. The genomic events that steer domain evolution can be classified on various levels rang- ing from simple point mutations to large-scale genome changes. For instance, there are selectional forces acting at the sequence level of domain to alter it or to change it into completely new domain. Although, in general, the domain is conserved over time subject to its purifying selection, a domain sequence can diverge by mutations, deletions, and insertions and eventually become a new domain by acquiring a function different from the original one [117]. For example, within the chitin-binding domain in the mammalian protein chitinase, there are six cysteines; mutation of any one of these to serine results in complete loss of its ability to metabolize insolu- ble chitin while its ability to hybridize the soluble triacetylchitotriose is maintained [165]. At the gene level, during replication process, the slippage of the DNA polymerase can cause short intragenic duplications [170], whereas crossing-over events might result in larger, intergenic repeats [87]. In some genes, the exons often form more or less exact correspondence to domains [107]. In this case, a duplication of an exon will corresponds to the duplication of the corresponding domains. Other fundamental gene level mechanisms e.g. DNA-strand breakage and repair or transposition [87] can also alter the domain contents. At the genome level, there is a strong indication [8] that most species share a set of common domain families. This motivate the perspective that there may have existed a finite set of domains at the root of the species tree, which was used to derive the vast functional space of today’s protein repertoire. Consequently, majority of the protein repertoire is composed of multi domain proteins; around two-thirds of the proteins in prokaryotes and four-fifths eukaryotic have two or more domains [34]. Moreover, genomic complexity corresponds much better to the number of distinct domain combinations [15] than to the number of genes in the organism. Hence, the understanding 2.5. DOMAIN LEVEL EVOLUTION 15 of evolutionary events at domain level is required to understand the function and structure of proteins.

Domain Duplication The protein domain may have been duplicated and evolved thorough various levels of evolution. For instance, on the genome scale, whole genome duplications, which have been observed in the evolution of vertebrate genomes, resulted in the duplication of entire set of gene families thereby increasing the domain content [22, 162, 163]. On the gene scale, domains may have been duplicated through genetic mechanisms like unequal crossing-over, retro-transpositions, recombination and horizontal gene transfer [70, 34]. On the sub-gene level or on domain level, mechanism like short intragenic duplications [170], exon-shuffling played crucial role in duplicating domains. It has been estimated that at least 70% of the domains is duplicated in prokaryotes, whereas in eukaryotes, this estimate is 90% [9]. A recent study performed a survey of genes in eukaryotes, which showed that the intragenic duplications may have occurred frequently [107] and is one of the major cause of gene elongation and increase in gene size. Tordai et al. [167] suggested that the multiple copies of a single type of structural domain in multi-domain proteins indicates that their evolution was through intragenic duplication of a domain. A gene can extend its capability or acquire a new function by domain duplication events. For instance, enzymes such as dehydrogenases have coenzyme-binding domain, which was evolved from a duplication of a primordial mononucleotide-binding domain [140]. Moreover, after the domain duplication, the two copies of a domain may diverge differently from the ancestral domain to enable the gene for functional divergence. For example in the immunoglobulin gene, the variable and constant region domains may have evolved from a common ancestral domain, however they now have distinct functions e.g. recognizing antigen is performed by variable region domains while constant regions is functioning as effector domains [101]. The KRAB associated Zinc-Finger genes, in which a chromatin-interaction domain called KRAB is associated with tandem arrays of C2H2 Zinc-Finger domains, constitute a larger family of transcription factors genes and present a classical example of domain gain and loss by recent domain duplications or losses [76]. The KRAB domain is a transcriptional repressor domain. Most KRAB-ZNF genes re- side in large familial clusters, many of which contain substantial numbers of lineage- specific genes and varying number of Zinc-Finger domains indicating recent domain and gene duplication events [158].

Gene-induced Domain Bifurcation A bifurcation in the domain tree that arises through a bifurcation in the gene tree is called gene induced domain bifurcation or co-divergence [153]. The gene tree bifurcation may have arisen due to a gene duplication or a speciation event. 16 CHAPTER 2. THREE LEVELS OF EVOLUTION

In the next chapter, I am going to introduce a probabilistic reconciliation-based framework for domain evolution that models domain duplications, domain losses, and domain bifurcations within the context of a gene and a species tree. The events in the history of a multi-domain family are inferred by ﬁtting a domain tree to a gene tree and a species tree. The proposed framework will not only provide the reconciliation of domain events to the corresponding gene and species trees, but also their timing relative to gene and species divergences. For this reason, we are deﬁning a set of domain level events to distinguish them in the context of gene and species evolution.

a) Gene Tree inside the Species Tree Speciation TG TS Gene Duplication Gene Loss

b) Domain Tree inside the Gene Tree

Domain Bifurcation TD TG Domain Duplication Domain Loss

Figure 2.1: This ﬁgure shows a) most parsimonious reconciliation between a gene tree TG (in blue) and a species tree TS (without color), and in b) most parsimonious reconciliation of a domain tree TD with the corresponding gene tree. Gene level events are gene duplications (green color), gene losses (red color) and speciation are shown in black color. Domain level events are domain duplications (green color), domain losses (red color) and gene induced domain bifurcations are shown in black color.

To illustrate the events at domain and gene levels, reconciliation between gene and species tree, between domain and gene tree, respectively are shown in figure 2.1. In figure2.1 (a), a species tree TS (fat tree) is shown without color and a gene tree tree TG (in blue color) is shown inside the species tree, thereby mapping gene level events to the species tree. Similarly, in figure2.1 (b) a domain tree TD is shown inside the blue fat gene tree. An event of duplication is marked with the green color, loss is marked with the red color, and bifurcations are shown in black color. Chapter 3

Modelling Domain Evolution

3.1 Introduction

For many gene families, the common assumption that a gene is evolving as a single unit, which is an assumption underlying even most of the sophisticated models, is unrealistic. In analogy with the gene level, events like duplications, losses, and rearrangements can occur at the domain level [120] and, consequently also domains can be orthologous or paralogous to each other [133]. With the growing availability of genome sequences, many computational methods have been developed for reconstructing gene evolution as taking place inside a species tree e.g. [2, 137, 151]. The evolution of a gene family is then, often, represented by a gene tree with a corresponding species tree, and a reconciliation between them. The reconciliation is used to provide a picture explaining the sequence of gene level evolutionary events that have occurred in the context of the specie tree. The reconstruction of domain evolutionary history also requires a similar strategy for relating its evolution subject to gene and species evolution. Unfortunately, there is a lack of methods, in particular probabilistic methods, that explain domain evolution in the context of a gene and a species tree. Reconstruction of domain evolution in a protein domain family can provide a valuable insight into the processes of gene and species evolution. By inferring the domain history inside the gene and species tree, we can reveal the context and the order in which domains may have evolved. Furthermore, considering domain evolution in the context of gene evolution makes it possible to correlate structural domain level changes with the functional characteristics of the gene. In this chapter, I will discuss approaches to model domain evolution and relating it to the corresponding gene and species trees. A brief introduction to models of evolution and their probabilistic framework for Bayesian analysis will also be a subject of this discussion. The ﬁrst part of this chapter is devoted to traditional, as well as current, approaches to study domain evolution. It includes the dot plot, phylogenetic methods, the domain architecture model, parsimony-based approaches, and reconciliation-based methods. Then follows an

17 18 CHAPTER 3. MODELLING DOMAIN EVOLUTION introduction to some interesting domain families, that are analysed in the studies of this thesis. In the last section, I will discuss reconciliation based analysis and probabilistic models for domain evolution.

3.2 Current Approaches

Domain evolution has been studied by variety of tools and techniques. In this section, we start with the more simpler approaches and proceed to more advanced.

The Dot Plot Among approaches to study the domain evolution, dot plots are considered to be effective in terms of visualisation and simplicity. Dot plots, first introduced by Gibbs et al. [68], is basically a two-dimensional matrix, showing a shaded cell if entries in corresponding row and column are similar. The intensity of shade is used to quantify the levels of similarity between entries as shown in figure 3.1. Dot plots are often used to show a comparison between two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot.

Figure 3.1: A dot plot [22] is showing the pattern of internal domain duplications in human protein (ENSP00000303696) sequence with C2H2 Zinc-Finger domain. The darker color of the squares reﬂects higher alignment score between the corresponding domains. The numbers labelled on each axis denote the domains orientation on protein sequence from N-to-C terminus.

Dot plots can capture regions of local similarity or repetitive parts of sequences by showing one or more diagonal lines in addition to the main diagonal line. A dot plot visualisation is a useful tool to show the similarity patterns, however it can be noisy due to random matches between residues of the sequence. Codons or domains are considered to be a better choice than residues, since the probability of 3.2. CURRENT APPROACHES 19 matching codon or domain by chance is much lower than the chance of matching a single residue. Some studies have used dot plots to show the similarity pattern of domains in a protein sequence. For instance, Bjorklund et al. [22] used dot plots based on Smith-Waterman alignment scores to show the similarity pattern of C2H2 Zinc-Finger domains. They identiﬁed the duplication pattern of Zinc-Finger domains towards the end of the protein sequence as shown in ﬁgure 3.1.

Phylogenetic methods Typically the first step in understanding the evolution of a protein domain is to analyse its phylogeny. There is a wealth of phylogenetic methods that has been applied to model the domain evolution. Most of these methods have limitations with respect to accuracy and speed. For instance, Neighbour Joining (NJ) [143] is a popular tool in biological community due to its simplicity and efficiency in phylogenetic construction of large datasets. NJ greedily [66] builds a tree by joining a pair of taxa with the minimum distance in each iteration, until a fully resolved tree is obtained. Many studies have used NJ with bootstrap analyses to investigate the domain evolution, for instance, Romanel et al. [138] used NJ combined with the bootstrap analysis to investigate the origin and diversification of the DNA binding domain B3 in plants. In some cases, NJ is applied to define the subfamilies or clades within larger domain families. For instance, in a recent study [129], Oulion et al. performed a phylogenetic analysis of the Fibroblast Growth Factors (FGFs) domains in vertebrates. A tree constructed using the NJ algorithm was used to classify the FGFs domain family into eight subfamilies across several vertebrate species. Different variants of NJ has been designed in order to improve speed and accuracy, for instance BIONJ [65], RapidNJ [150], and FNJ [51]. The accuracy of NJ method depends on the additive property of observed distances for the phylogenetic construction. However, in general, the observed distances do not follow the additive property, because the actual number of substitutions between two sequences cannot be estimated directly. Therefore, phylogenetic hypothesis based on these distances is often erroneous. Bayesian and likelihood-based approaches has also been applied to reconstruct domain evolutionary history. For instance, Velthuis et al. [163] applied MrBayes [84] in their phylogenetic analysis to identify domain duplication and rearrangement events in the evolution of the PDZ/LIM domain family. In another study [146], a phylogenetic tree has been constructed for the Kinase domain family using a combination of sequence and structure information into a unified quantitative analysis [126]. Structural information were incorporated as the character matrix, containing the 20 distinctive structural characteristics of the Kinases domain. Recently, Stolzer et al. [153] performed Bayesian analysis for the GuK and PDZ domains by applying MrBayes to infer the domain level evolutionary event. However their resulting posterior was flat, because domains are typically short and sequence data alone does not suffice to reconstruct the domain evolutionary history unless gene 20 CHAPTER 3. MODELLING DOMAIN EVOLUTION and species trees are taken into account. Bayesian methods for phylogenetic inference are accurate, but they require extensive computation for their convergence.

Domain Architecture Model The term "domain architecture" (DA) is used to denote the “bag of domains” model, in which a protein sequence is treated simply as a collection of domains. In the DA model, a multi-domain protein sequence is annotated as a set of tokens i.e. domain names or ids representing its domain composition. The annotation is usually determined by scanning a standard domain database [57, 148, 121, 10, 114] consisting of probabilistic or alignment based domain models. In this model, all instances of the same domain are indistinguishable. DA models have been used due to their computational efficiency, which is advantageous for genome-scale analyses [120, 27, 19, 33]. Several studies have applied the DA model to investigate domain co-occurrence [24], domain distribution across different organisms [182], domain combinations [17], domain substitutions and deletions [176], and domain promiscuity [18], i.e., the property of a specific domain to co-occur with many other domains. The DA model is known for its computational and conceptual tractability. However, it is based on several unrealistic simplifying assumptions; for example, the DA model consider all instances of a domain family to be indistinguishable, and do not use the sequence information, domain order, and the number of copies of each domain within the domain family. Furthermore, the DA model can only capture the change in form of domain gain or loss, which can not provide sufficient information about the events that caused the change. There are some approaches, in which domain content is represented as a form of character data. In these approaches, the domain content of a protein sequence can be treated as binary character data, where 1 indicates domain presence and 0 indicates its absence. For instance, Buljan et al. [27] used Dollo parsimony to infer ancestral domain architectures. In Dollo parsimony, a change of character state from zero to one is allowed only once, but a change from one to zero is allowed to happen multiple times [53]. This is appropriate when modelling domain, since gains are considered to be a rare events compared to losses. This character-based representation has been used to estimate the pairwise edit distances [23] between the domain architectures and to construct domain trees by applying standard phylogenetic methods based on that [60].

Parsimony-based approaches The basic principle of parsimony approaches is based on the philosophy of Ock- ham’s razor law that can be interpreted as “among competing hypotheses, the one with the fewest assumptions should be selected”. Hence, the tree that describes the data with fewest mutations, or character state changes, is a more plausible evolutionary hypothesis [44]. In the beginning, parsimony was only used in the 3.3. RECONCILIATION-BASED APPROACHES 21 analysis of the discrete morphological characters data, but the same principle was later extended to other evolutionary events. The events can, for instance, be gene duplications, gene losses, and lateral gene transfers, or in the context of domain evolution, domain duplication, domain loss or domain transfer [153]. Therefore, the tree with the minimum number of evolutionary events is considered to be the most likely evolutionary hypothesis. The parsimony problem can be classiﬁed into 1) the small parsimony problem 2) the large parsimony problem. In the small parsimony problem, given the topology and character states at the leaves, the goal is to assign and to count the number of mutations to internal nodes of the tree such that, a minimum number of character state changes is obtained. A solution to this problem can be determined by Fitch’s algorithm [58]. Extra constraints can be applied to model the event based framework. For instance, Camin-Sokal parsimony imposes the constraint of irreversible mutations, like nucleotide deletions, that each character can only change from state 1 to 0 once. Another type of parsimony is Dollo parsimony, which is designed for attribute that are hard to acquire, but easy to lose, like protein domains. In the large parsimony problem, given the character data of extant taxa, the goal is to ﬁnd a minimum cost tree topology. The large parsimony problem is NP-complete [104]. Moreover, the small parsimony problem is a sub-problem of the large parsimony problem in the sense that, a solution of the small parsimony problem is used to score the candidate tree of large parsimony problem. The computational tractability and simplicity of parsimony-based methods over probabilistic methods have made them more applicable. Some widely used tools are MEGA6 [159], PAUP [157] and NOTUNG [31]. Parsimony-based methods can be inconsistent and error prone in some cases as demonstrated by Felsenstein [55], and their usage may lead to erroneous explanation of the data.

3.3 Reconciliation-based Approaches

The notion of reconciliation of the evolutionary history of a gene family, represented by a gene tree, in the context of a species tree, was ﬁrst introduced by Goodman et al. [71]. A gene tree is often used to represent the evolutionary history of a gene family; its leaves correspond to the extant genes and its internal vertices represent the evolutionary events e.g. duplications and speciations. In contrast, a species tree is used to describe the phylogenetic relationship among species, with its leaves representing the extant species, while the internal vertices are representing speciation events. In the reconciliation analysis, a mapping of internal vertices of the gene tree to the vertices or edges of the species tree is inferred, which is used to illustrate the evolution of a gene family corresponding to the species tree. In the simplest scenario, if gene and species trees are congruent, i.e., they share topology then all the genes of the family are orthologs. However, if the two trees diﬀers in topology, then events such as gene duplications, gene losses or lateral gene transfers can explain the discordance between the trees. 22 CHAPTER 3. MODELLING DOMAIN EVOLUTION

Traditionally, reconciliation-based methods are designed on the parsimony principle, for instance, Goodman et al. [71] introduced the notion of reconciliation, that finds the most parsimonious reconciliation (MPR) between gene and species tree, that is, a unique mapping of gene tree nodes to species nodes or edges that ex- plains the discordance using a minimum number of duplications. There are many parsimony-based criteria, which have been used by reconciliation methods to infer the parsimonious evolutionary scenarios, such as minimizing gene duplications, gene losses, and lateral gene transfers. For instance, Durand et al. [50] proposed a two- phase method for gene tree reconstruction that takes both sequence evolution and duplications losses into account. Their method construct a gene tree from sequence data by using any of the known algorithms for phylogeny construction in the first phase, while in the second phase, dynamic programming is used to rearrange the tree for minimizing the duplication-loss cost. Hallet et al. [75, 166] were pioneers in introducing the lateral gene transfers into parsimony-based methods. Other methods including AnGST [43] and RANGER-DTL [16] simultaneously consider duplication, loss, and transfer events to find an optimal reconciliation. NOTUNG [31] is a commonly used parsimony-based reconciliation method that allows users to reconcile a gene tree with a, possibly nonbinary, species tree with the optimal reconciliation minimizing the duplication-loss cost. Wu et al. [179] proposed a first parsimony-based reconciliation method for the sub-gene level evolutionary units (which they defined as “modules”) by taking into account module gain, loss, duplication, and rearrangement events. Their proposed phylogenetic method simultaneously constructs the module phylogenies in the context of a common species tree by using standard duplication-loss model. Their algorithm works in three steps to construct the optimal module architecture scenario. In the first step, they used a Duplication-Loss model to construct the module trees by using a species tree-aware algorithm SPIMAP [137]. In the second step, they reconciled each module tree to the species tree using most parsimonious reconciliation to infer the ancestral module counts. In the third step, given the extant architectures present at the leaves of the species tree and the ancestral module counts based on the reconciled module trees, they reconstructed a module architecture scenario for each architecture family by minimizing the total number of events in the reconstruction. They proposed a dynamic programming technique to find an architecture scenario that parsimoniously explain the domain level events such as generation, duplication, loss, merge (fusion), and split (fission). Their method is only based on the reconciliation information from the species tree, while completely ignoring the gene level events in the construction of module architecture scenarios. We know that a gene tree is often in disagreement with the species tree therefore, any inference based on only species level information will be misleading. Moreover, their method do not model the domain level events explicitly, rather minimum cost- based scenarios are used to represent the evolution of domain architecture, which similarly to other parsimony based approaches, can be erroneous in some cases [55]. More recently, Stolzer et al. [153] proposed a reconciliation-based framework for multi-domain analysis. They discussed it as a framework for the inference of 3.4. BAYESIAN INFERENCE 23 domain level events by reconciling a domain tree with a gene tree which is reconciled with the species tree. Their proposed algorithm produces candidate reconciliations in two passes by minimizing a linear cost function over the domain events; this cost function includes co-divergence, domain duplication, insertion, transfer, and loss. In the first pass, a dynamic program is used to calculate cost of mapping a domain vertex to a gene tree vertex. The output of this pass consists of two tables, representing the event cost and children mappings that are possible with the given gene and species tree reconciliation information. In the second pass, an algorithm traverses the domain tree and generates a candidate reconciliation with the optimal cost. Their proposed method is based on the parsimony criterion, i.e., minimizing the cost of domain events with respect to gene and species tree, which may lead towards an inaccurate description of evolutionary events in some cases [55]. In the following sections, I will provide introductory details about Bayesian inference, Markov Chain Monte Carlo-based techniques, and probabilistic models for modelling domain evolution.

3.4 Bayesian Inference

Bayesian inference methods are based on the apparently simple but powerful Bayes’ rule, which states that: P (D|θ)P (θ) P (θ|D) = P (D) where D denotes the data and θ denotes the parameters, and P (θ|D) is the posterior distribution over the set of parameters after observing the data. The probability P (θ) is the prior belief about the parameters before observing the data, while P (D|θ) is the likelihood of the data D given the parameters θ. The denominator term P (D) is the probability of data, which is a normalization constant. The probability of observing the data (likelihood of the data) times the prior probability is used to estimate the posterior and is proportional to the posterior distribution i.e. P (θ|D) ∝ P (D|θ)P (θ)

In Bayesian modelling, the denominator term P (D) is often intractable, because it involves high-dimensional integrals over all possible values of parameters. However, in Markov Chain Monte Carlo (MCMC) based framework, this term cancels out when we compute the acceptance ratio between successive states. The prior probability P (θ) allows the researcher to incorporate biological knowledge in the form of a distribution over parameters. However, it can often be diﬃcult to specify a prior in terms of a probability distribution. Non-informative priors are often used as an objective choice to represent the lack of prior knowledge about parameters. The simplest and oldest rule for determining a non-informative prior is to assign equal probabilities to all possibilities. 24 CHAPTER 3. MODELLING DOMAIN EVOLUTION

3.5 Markov Chain Monte Carlo

In phylogenetics, Markov Chain Monte Carlo-based techniques are often employed to estimate the posterior over possible evolutionary scenarios. When there is a high dimensional state space involved and the analytical solution is intractable, MCMC is typically considered as obvious choice. In MCMC, a Markov Chain is set up, where every state represents a point in the parameter space and the transition into a state only depends on the previous state. The parameter space is sampled by a random walk with transition probabilities adjusted so that the relative frequency of each visited state will converge, in the limit, to a posterior distribution. The posterior distribution is then summarized to estimate the marginal distribution for the parameters of interest. Although the random walk process of MCMC, after a sufficiently large number of samples, eventually converges to the distribution, the initial samples may follow a different distribution. Therefore, these initial samples, belonging to the start of a Markov Chain and known as the burnin samples, are conventionally discarded. The Metropolis–Hastings (MH) algorithm is a MCMC method to sample from a probability distribution for which direct sampling is difficult. At each iteration of the MH algorithm, a new state π0 is proposed by a proposal function, which perturbs one or more parameters of the current state π. The new state is then accepted or rejected with a probability that depends on the likelihoods of the current and proposed states, the forward/backward proposal probabilities and prior probabilities of the states. According to the MH algorithm, the acceptance probability A(π, π0) of a proposed state π0 is given by:    f(D|π0) f(π0) f(π|π0)  A(π, π0) = min 1, × ×   0   f(D|π) f(π) f(π |π)  | {z } | {z } | {z } Likelihood Ratio Prior Ratio Proposal Ratio where f(D|π0) is the likelihood of the proposed state and f(π|π0) is the backward proposal probability. In the MH sampling process, given the current state, the proposal function proposes a new state, which is either accepted or rejected based on the acceptance probability. This process is repeated for a sufficiently large number of times to converge to the stationary distribution. In this way, the distribution of sampled states approaches to the target distribution as more and more states are visited. This method has frequently been used in phylogenetics to estimate the posterior distribution over evolutionary parameters including phylogenetic trees and is guaranteed to be converged to its stationary distribution after sufficiently large number of samples [85]. There are many variants of the MCMC method that has been designed to serve different purposes e.g. group sampling. In group sampling, a variant known as Grouped Independence Metropolis-Hastings (GIMH) has been proposed by Beumont et al. [20] and later generalized by Andreiu et al.[6]. GIMH is applied 3.6. THE DLRS MODEL 25 in MCMC where the likelihood of a state depends not only on the fixed variables of the state, but also includes a sum or integral over some latent variables. The hallmark of GIMH is that the likelihood of a state is estimated using group sampling over the latent variables and that the estimate is not updated as long as the state remains the current state. Surprisingly, in contrast to the strategy where likelihood of the current state is re-estimated whenever a comparison with the likelihood of a new proposed state is made, the GIMH is guaranteed to converge to the stationary distribution targeted by the Metropolis-Hasting’s ratio used. MCMC- based methods are computationally expensive, however with the improvement in computational power and parallelization [155, 134], they are becoming increasingly feasible. Some of the popular method in this field are MrBayes [84], PrIME [2], JPrIME [151], BEAST [49], BAli-Phy [156], and PhyloBayes [100]. In the next section, I will discuss the DLRS framework in general and its individual components: duplication-loss, rate, and substitution models.

3.6 The DLRS Model

The ﬁrst probabilistic model that accounted gene duplications and losses inside a species tree was proposed by Arvestad et al. in 2003 [12]. They used the birth- death process to model the gene family evolution inside a species tree. In 2004, the same group introduced an integrated probabilistic model, known as the DLRS model of gene sequence evolution that considered gene duplications, gene losses, and sequence evolution under a molecular clock [13]. Later, they relaxed the assumption of a molecular clock [2]. The DLRS model, proposed by Åkerborg et. al. [2], consists of three submodels: a duplication loss model (DL), a substitution rate model (R), and a sequence evolution model (S). Figure 3.2 illustrates these three submodels of the DLRS model, a more detailed description of each submodel will be given in subsequent sections. The proposed method takes a multiple sequence alignment of a gene family, a dated species tree, and a mapping between the gene tree leaves and the species tree leaves as input, and used a Bayesian MCMC-based framework to estimate a posterior distribution over gene trees, edge lengths, and other evolutionary parameters.

The Duplication-Loss Model The duplication loss model employs a birth-death process that models the evolution of a gene family. The classical birth-death process was introduced by David G. Kendall [98] and it has been used as a prior for the phylogenetic trees in many studies [164, 122, 136]. The Duplication Loss (DL) model, which is a linear birth- death process, has two parameters: a birth rate λ and a death rate µ. In the DL model, a gene lineage initiated at the root of the species tree, undergoes births and deaths during the evolutionary time span determined by the species lineage. A birth in a species lineage corresponds to a gene duplication, which bifurcates a 26 CHAPTER 3. MODELLING DOMAIN EVOLUTION

A) Duplication Loss Model (DL)

Initial Lineage Duplication Speciation Loss Process Finished

B) Substitution Rates (R) C) Sequence Evolution (S)

ACTA...GA

......

AGTA...GT

Clock relaxation

Figure 3.2: The three submodels of the DLRS are shown [112]. (A) A gene lineage is evolving inside a species tree according to the birth-death process. A gene lineage initiated at the root of the species tree, may come across a duplication event (represented by a green vertex), or a speciation event (represented by a black vertex). Whenever a gene lineage passes through a speciation vertex, it bifurcates into two independent gene lineages. A gene lineage may also be lost (represented by a red circle). After pruning all lost lineages, the ﬁnal gene tree is obtained. (B) The molecular clock assumption is relaxed by taking iid rates along branches of the tree according to the rate model. (C) Finally, a sequence evolution model generates sequences over the gene tree with branch lengths.

gene lineage into two gene lineages, while a lineage which lost before reaching the extent species is considered to be a gene loss. When a gene lineage passes through a speciation vertex in the species tree, it bifurcates into two new gene lineages that evolve independently according to the birth-death process. As the process reaches the leaves of a species tree a gene tree has been formed and lineages having no descendants in extant species are pruned as shown in ﬁgure 3.2 (A). The birth-death process has been broadly applied in genomic analysis [74, 96, 89, 37], phylogenetic analysis of gene families [2, 113, 25], and probabilistic orthology analysis [149, 168]. In 1975, Thompson [164] used the birth-death process in a phylogenetic model to understand the evolution of human populations. Under the generalized birth-death model, Nee et al. [122] derived the geometric distribution for the number of lineages existing at any particular time during the process and the likelihood function for a reconstructed phylogeny, which provided the basis for the estimation of birth and death rates. Rannala and Yang [136] developed a phylogenetic model based on birth-death process along with the sequence evolution model to estimate posterior probabilities of phylogenetic trees from molecular se- 3.6. THE DLRS MODEL 27 quence data. The birth-death process was ﬁrst used, by [12] and further extended by Akerborg et al. [2], to evolve the gene tree inside the species tree by modeling birth as gene duplication and death as gene loss. Subsequently, Rasmussen et al. [137] proposed an improve probabilistic technique that considers the gene trees having MPR reconciliation with the species tree instead of traversing through entire search space consists of gene tree topologies. This move is advantageous because it shrinks the search space to MPR reconciliation of gene trees, which greatly reduces the required computation and it does not compromise the accuracy of the method in most of the cases as shown by [48].

The Rate Model Rates of molecular evolution can have substantial variation among sites, genes, and lineages. Phylogenetic methods have been developed to take these forms of rate heterogeneity into account. For instance, in a rate model, relaxed molecular clock assumption is used and rates are allowed to vary among lineages in a phylogenetic tree. The rate variation across lineages are modelled as a independent and identically Gamma-distributed variables with mean and variance parameters [2]. The branch length of a phylogenetic tree is measured in expected number of substitutions per site, which is the product of time and a substitution rate. The time information is constrained by the species phylogeny, whereas the rate captures the divergence in terms of number of substitutions per site per unit of time. Many studies [103, 109] have suggested the relaxation of the molecular clock assumption and use of independent as well as identically distributed rates to model the rate variation on diﬀerent lineages of the species tree.

The Substitution Model The molecular composition of today’s living organisms has been passed on from their ancestors by different evolutionary processes. Whatever modifications have been happened through these evolutionary processes, the traces in the molecular sequences have significant information for us. The process of sequence evolution is a very complex process and models that are used to simulate this process present a very simplified description of the evolutionary history of the sequence evolution. A phylogenetic tree is used to represent the evolutionary history with branch lengths proportional to the expected amount of evolutionary changes in sequences. The process is represented by a substitution model by defining the substitution probabilities i.e. each nucleotide has some probability of changing into another during an interval of time. Some commonly used substitution models of nucleotide and amino-acid are briefly discussed below. The simplest of all substitution models is JC69, proposed by Jukes and Cantor [94], in which all nucleotide substitutions occur at an equal rate. After that, Kimura et al. [99] modelled the substitutions with different rates for transitions (from purines to pyrimidines or vice versa) and transversions (from purines to purines 28 CHAPTER 3. MODELLING DOMAIN EVOLUTION or pyrimidines to pyrimidines). Their proposed model is called K80. In 1981, Felsenstein [56] relaxed the constraint of the uniform nucleotide frequencies and allowed the substitution rates to correspond to the equilibrium frequency of the target nucleotide, his model is known as F81. Later on, Hasegawa et al. [79] combined the two relaxed constraint into one model known as HKY85, which allowed unequal nucleotide frequencies as well as different rates for transitions and transversions. Models of sequence evolution have also been proposed for protein sequences. For instance, Dayhoff and colleagues [45] introduced an empirical model of protein evolution, which was based on alignments of homologous protein sequences. Most of the protein sequences comprised of distinct domains, which evolve with different rates, and with different selective pressures [1, 61]. Many studies have suggested a priori partition of the sequence data to model the differences in evolutionary rates. Yang et al. [180] proposed the first sequence evolution model with a priori partitions for DNA sequences evolving on a fixed tree. Furthermore, it is also possible to give multiple partition data to MrBayes [139]. Recently, Zoller at al. [184] proposed a maximum likelihood method that considers domain based partitioning of the sequence data to apply different codon models for phylogenetic inference.

3.7 Interesting Multi-domain Families

Gene families that encode proteins with two or more domains are called multi- domain families [9]. These multi-domain families contain proteins with different domain combinations [177]. Genome wide domain annotation of protein sequences has shown the proportion of multi-domain protein in different forms of life, for instance, about 27% in prokaryotes, 40% in metazoans, and more than 70% in eukaryotes are multi-domain protein [167, 77]. The increasing proportion of multi- domain proteins is proportional to organismic complexity[167]. Multi-domain proteins have particular evolutionary and functional importance. They have many important functions including binding activities, catalytic activities, and signalling activities. For instance, Yilmaz et al. [35] described the classification of Zinc-Finger proteins based on structural, functional, and motif char- acterization in Arabidopsis, and performed expression analysis to evaluate their responses on different environmental stimuli. The C2H2 Zinc-Finger is the most common DNA binding domain found in eukaryotic transcription factor proteins; ZNF proteins typically contain multiple C2H2 domains joined together in tandem arrays. They regulate gene expression by binding to DNA via their so called zinc- finger domain [86]. The binding specificity is encoded in ZNF domains. The importance of this multi-domain protein family is evident by the fact that, it stands out as the second largest gene family in human after the odorant receptor family and makes up ∼ 2% of all human genes, ∼ 3% of all genes in mammals, and ∼ 0.8% of all gene in Saccharomyces cerevisiae [46, 35]. In evolutionary perspective, multi- domain protein families have played a vital role in their lineage-specific evolution. 3.7. INTERESTING MULTI-DOMAIN FAMILIES 29

For instance, the KRAB associated Zinc-Fingers domain family, which represses the expression of target genes, has striking number of recent lineage-specific gene and domain duplications [125]. The lineage specific diversification of KRAB-ZNF genes suggests that it has an important role in shaping the functional diversity of transcriptional regulation, which resulted in lineage-specific evolution of regulatory networks [76]. A number of important studies have investigated the role of multi-domain protein families in different diseases. For instance, recent studies [154, 11, 72] have provided ample evidences of strong links between multi-domain protein families and mutations associated with cancer. Similarly, Reelin protein, which is a glycoprotein, has been implicated for its role in cognitive deficits, as seen in schizophrenia, autism, bipolar disorder, and Alzheimer’s disease [54] due to its involvements in signaling pathways, memory formation, synaptic plasticity, and synaptic strength.

The C2H2 Zinc-Finger The transcription factor genes C2H2-ZNF encode a protein that selectively bind to specific DNA sequences through their zinc finger domain and regulate the gene expression [46] as mentioned above. The C2H2 Zinc-Finger ZNF protein typically contains multiple C2H2 Zinc-Finger motifs placed together in tandem arrays, each forms a basic structural unit of 28 amino acids. The functional characteristics of Zinc-Finger protein are determined by its associated N-terminal effector domain, like the Kruppel-associated box (KRAB) domain, which represses the expression of target genes[86]. One of the important factors in the evolution of primate-specific evolution is recent segmental duplications that have occurred within the past 35 to 40 million years [125]. The duplicates, as a result of this primate-specific duplications, have shown positive selection, which indicates their divergence potentially for a new function. Out of these recent primate-specific duplications, one gene family that has expanded considerably, is the KRAB associated Zinc-Finger genes [76, 86]. The expansion of KRAB-ZNF genes are thought to diversify the regulatory specificity, which is determined by the ZNF domain recognition code used to bind the target sites. Thus, the KRAB-ZNF genes are able to evolve by changing the number and structure of Zinc-Finger domains, which predictably alter their DNA-binding specificity [125]. More than 70% of all human Zinc-Finger genes are organized in clusters [158]. There are significantly more clustered C2H2 Zinc-Finger genes that are found in human and mouse as compare to chimpanzee and rat, respectively, which also signifies the lineage specific evolution after the divergence of the two primates i.e. within the 6 to 10 millions of years [76, 158]. Their clustering pattern provides an interesting insight into the evolutionary history of primates, in which one cluster was originated from another, possibly due to segmental or gene duplication during the early primate evolution [76]. The biological process behind these duplications can be either unequal crossing over to form a large clusters or DNA slippage to duplicate a small segments e.g. a 30 CHAPTER 3. MODELLING DOMAIN EVOLUTION duplication of one or few domains [158, 183]. Moreover, these domains are prone to duplications and deletions since they appear in tandem, and sequence similarity is known to increase the rate of unequal crossover as well as DNA slippage. The tandem nature of the ZNF domains within C2H2-ZNF genes can be describe by a process following the birth and death model as mentioned in [124].

The Reelin domain The Reelin domain is an extracellular glycoprotein that plays an essential role in early brain development and also has a role in regulating the structural plasticity of neural networks [81, 181]. It is essential in regulating the neuronal migration, in which it determines the positioning of neurons within cortical structures of the central nervous system [40]. It also controls neuronal growth, maturation, and synaptic activity in the adult brain [181]. Reelin was discovered as a gene product absent in reeler mouse, which is an autosomal recessive mouse mutant. The Reeler mice exhibited brain abnormalities due to the deﬁcit of Reelin protein [41].

F-Spondin like domain Antibodies:142, G10, CR-50

1 2 3 4 5 6 7 8

C-Terminus N-Terminus

Reelin Sub-repeat A EGF Reelin Sub-repeat B

Reelin Repeat

Figure 3.3: The domain architecture [38] of Human Reelin Protein

The human reelin protein consists of 3461 amino acid residues [181]. The domain architecture of reelin sequence consists of a signal sequence, a F-Spondin domain like region, a unique region, and eight tandem repeats of 350–390 amino acid residues 3.7. INTERESTING MULTI-DOMAIN FAMILIES 31 known as reelin repeats [40, 90]. Each reelin repeat is further divided into a central Epidermal Growth Factor (EGF) domain and two ﬂanking sub-repeats known as subrepeat-A and subrepeat-B of lenght 150–190 amino acids as shown in ﬁgure 3.3. Due to its vital role in the modulation of synaptic plasticity, reelin has important implicationsis in several brain disorders, including Schizophrenia, Autism, and Alzheimer’s disease [123, 54]. For instance, many studies [54, 91, 73] have reported reduced expression of reelin in patients with Schizophrenia, Bipolar disorder, and Autism. The unique architecture of reelin protien and its repeats with the em- bedding of central EGF domain in them, prompted us to perform the evolutionary analysis of the reelin domain family, with human reelin and its homologues in model organisms.

Chapter 4

Identifying Gene Families

4.1 Introduction

The term “homology” was first coined by Richard Owen [131] in 1843, who characterized homology as “the same organ in different animals under every variety of form and function”. In his book [130], he referred analogs as a “part or organ in one animal which has the same function as another part or organ in a different animal”, which clearly differentiated analogs from homologs. We find homology between genes of newly sequenced genome to existing genomes to predict its phylogenetic relationship with the other species. Homology is considered as an integral compo- nent of many genome-scale computational analyses e.g. annotation of novel genes [178], prediction of gene function [132, 88], comparative genome mapping [29], and gene fusion analysis [115]. Because of the shared ancestry, homologous genes may also show significant similarities in their sequences. The similarity between the homologous genes in their sequences often implies structural and functional similarity, but it is not always true. Traditionally, homology inference methods are based on sequence similarity. However, in the case of diverged gene families, it is often difficult to detect homologous genes due to their weak sequence similarity. Therefore, a method that can combine sequence similarity with the other information e.g. synteny (genes with the conserved neighbourhood) can improve the accuracy of homology inference methods. Synteny is considered as an important signal for the detection of homologous genes. For instance, Lemoine et al. [102] identified a set of orthologs genes, which they refer to as “positional orthologs” based on their conserved local order. They combined conventional similarity-based approach known as reciprocal smallest distance (RSD) [172] with the synteny to detect the positional orthologs. Similarly, Dandekar et al. [39] has indicated gene order conservation in bacterial and archaeal genomes. These studies suggest that the use of synteny can improve the homology detection method and by taking the syntenic context [88] into account, homology inference can be more accurate. In paper-III & IV of the thesis,

33 34 CHAPTER 4. IDENTIFYING GENE FAMILIES we devised an approach that combined synteny information with Neighbourhood Correlation [152]. The Neighbourhood Correlation is a similarity-based approach that consider the domain architecture while identifying homology in multi-domain proteins. In this chapter, I will discuss some of the current homology inference methods with their limitations. In the first section, I will describe the notion of homology, mainly consists of homology definitions and concepts. I will also discuss approaches that are based on similarity networks, and in the last section, I will briefly describe our proposed technique that used synteny scores to improve the homology inference.

4.2 Homology Deﬁnitions

A gene family is a group of genes that have evolved from a common ancestral gene through several events such as speciations, gene duplications, gene losses, and lateral gene transfers, and therefore have significant similarities in their sequence and functions [127]. The evolution of a gene family often takes the form of a phylogenetic tree, known as phylogeny. The topology of the tree represents the phylogenetic relation among genes of a gene family, and internal nodes of the tree represent the evolutionary events. Thus the phylogeny of a gene family provides a concise summary of evolution i.e., how and when genes have diversified during the evolutionary time period. By mapping gene tree vertices to the species tree vertices, a phylogenetic relationship may be classified as orthologous or paralogous as mentioned in the seminal work [58, 59] by Walter Fitch. Two genes are orthologous if they are derived from a single ancestral gene in the last common ancestor of the compared species. Alternatively in a gene family, two gene are paralogous if they belong to the same species or derived from a different gene in the last common ancestral species. Genes that are related through a lateral gene transfer event, involving an interspecies transfer of genetic material, are called xenologous. Paralogous genes can be further categorised into in-paralogs (due to the recent duplication after the speciation event) or out-paralogs (due to ancient duplication before the speciation event). The reconciliation between gene and species tree provides a scenario of a gene family evolution consisting of evolutionary events such as speciation, gene duplications, and gene losses.

4.3 Current Approaches

As discussed earlier, the members of a gene family have considerable sequence similarity, which can be used to identify the homologous genes. The pioneers of homology inference methods used sequence similarity as a heuristic, typically by employing BLAST [5] as a subroutine. For instance, BlastClust [47] and SiLiX [119] used single linkage clustering on BLAST scores to cluster the homologous genes. However, given that SiLiX is based on transitive clustering, it outputs large clusters largely due to shared domains and large multi-domain families. The 4.3. CURRENT APPROACHES 35 transitive clustering is based on the principle of transitivity i.e., if sequence A is in homologous relationship with the sequence B, and B is homologous to sequence C, then A, B and C should be clustered together into the same family regardless of their similarity level. To overcome the problem of large clusters, another technique HiFiX [118] has been proposed to reﬁne these large clusters into small clusters by considering Hidden Markov Model (HMM) based multiple sequence alignment likelihood score of the cluster. This technique is computationally expensive for large scale analysis due to the computation of multiple sequence alignment and HMM scoring.

Many studies also combined other information with similarity to improve the homology inference. For instance, the SOAR and MSOAR [63, 64] algorithms have use gene order information to assign orthologs by parsimoniously minimizing the rearrangement events between the two genomes. Their algorithm starts with the identification of gene families by using homology search method, and establishes the correspondences among the gene families members on both genomes. Basically, it treats every member of a gene family on each genome as the copies of the same gene. Because of this annotation scheme, the problem turns into a optimization problem of rearrangement, in which a sequence of duplicated genes on a genome is arranged on the other genome with the minimum number of rearrangement events. By applying this method, one can infer the positional orthologous gene pairs with the optimal number of rearrangements events between the genomes. Similarly, Han et al. [78] have used local synteny information i.e. fixed neighbourhood region around each gene to identify the parent orthologs among multiple lineage specific duplicated genes. Jin et al. [95] combines the gene order information with sequence similarity to identify mammalian orthologs.

Wapinski et al. proposed an algorithm known as SYNERGY [174] that used species tree and synteny information along with similarity to infer the group of orthologous genes (ortho-group) and their phylogenetic relationships. They defined ortho-group as a set of genes that are descended from a single common ancestor in their last common ancestral species. Another part of homology problem is related to detection of remote homology, that is to assign gene family to diverged sequences, having low sequence similarity less then 25% in terms of sequence identity. Many methods target this problem, for example, PSI-BLAST [5] use iterative BLAST to construct the position specific scoring matrix (PSSM) profile for the detection of remote homologs. More sophisticated tool like PHYRN [21] returns the phylogenies along with gene families by utilizing the Euclidean distance of sequence profiles. Overall, BLAST remains the underlying homology inference measure for most homology inference techniques. All-versus-all BLAST is often used as an initial step in homology detection approaches to construct a similarity network. In the next section, I will discuss approaches that are exploiting the topological features of this network. 36 CHAPTER 4. IDENTIFYING GENE FAMILIES

4.4 Homology Inference using a similarity network

In a sequence similarity network, sequences are represented by nodes and sequences, whose similarity is above than a predefined threshold, are connected by edges with the weights proportional to the similarity between the protein sequences corresponding to the incident nodes. Sequence similarity networks are very effective in relating multi-domain proteins [142, 135]. In an early use of similarity networks, Koonin and colleagues [160, 75] argue that the orthologous groups form cliques in a sequence similarity network. In domain evolution context, Przytycka et al. [135] used similarity networks to investigate domain insertions, domain architecture per- sistence, and domain merging events. The structure of the sequence similarity networks provide basis for distinguishing homologous pairs from non-homologous pairs. For example, the neighbourhood of a node is defined as a set of nodes adjacent to it i.e. the set of all sequences that have score greater than the significance threshold. The neighbourhood of a node is an important local structure of similarity network; for instance, a neighbourhood consists of distinct nodes may reflect the promiscuous behaviour of a particular domain. Similarly, domain families characterized by a specific domain may form a cluster or community structure in the similarity network. There are methods [152, 52, 93, 105] that exploit the topological structure and edge weights of the sequence similarity network to identify gene families. For instance, hcluster_sg [105] is a similarity network based technique used in earlier versions of TreeFam [106] to perform the hierarchical agglomerative clustering on BLAST scores. The hcluster_sg heuristic is based on edge density, which is measured as number of edges between pair of clusters divided by the all possible edges between them. The hcluster_sg uses threshold on edge density to decide the merge operation while performing agglomerative clustering steps. Song et al. [152] exploit the neighbourhood feature of the similarity network by observing the fact that the homologous genes tend to have strong shared neighbourhood than their unique neighbourhood. Their formulation is based on correlation, which measures the strength of neighbourhood between a pair of genes. In another clustering algorithm known as Markov clustering (MCL) [169, 52], a mathematical bootstrap procedure is used to find the cluster structure in a similarity network by iteratively applying matrix power operations. More detail about these methods are given in subsequent sections.

Neighbourhood Correlation Song et al. proposed Neighbourhood Correlation (NC) method, which computes the correlation between the neighborhood structure of a pair of genes to identify the homology relationship between them. Their method is based on the observation that genes related through duplication events have distinct neighborhood pattern as compare to genes that are related via domain shuffling events in a sequence similarity network. They extended traditional model of homology (i.e. gene du- 4.4. HOMOLOGY INFERENCE USING A SIMILARITY NETWORK 37 plication and speciation events) to multi-domain family evolution. Homology has traditionally been defined in terms of families that evolve by vertical descent i.e. by gene duplication and speciation events [58], however, multi-domain families evolve via both vertical and domain insertion events as shown in figure 4.1. In figure

Figure 4.1: The evolutionary scenario of a hypothetical multi-domain family by gene duplication and domain insertion events from Song et. al. 2008 [152]. Genes in the a and b subfamilies evolve trough a common ancestor however they have diﬀerent domain composition. Gene c shares a domain with genes in subfamily b, but both b and c are not related by any common ancestor gene.

4.1, genes a and b are related through duplication events and share homologous domain (in blue color) from their ancestral gene. Opposite to this, gene b and c, which are not related through vertical descent, are linked via a domain-only match due to the domain insertion event. Thus a pair of sequences that share similarity due to homologous domain (i.e. descent through gene duplication or speciation) will be considered as legitimate members of a family. Based on this model, Neigh- bourhood Correlation’s underlying assumption is that the neighbourhoods of genes related through duplication or speciation will be more similar to each other than the neighbourhoods of genes linked by the domain insertion events. Here, notice that the neighbourhood term should not be confused with the neighbourhood of a gene on a chromosome. In the context of similarity network, the shared neighbourhood of a pair of genes is defined as a set of vertices that are adjacent to both members of the pair whereas, the unique neighbourhood corresponds to the set of vertices that are adjacent to one pair but not the other. In case of homologous pair of genes, the shared neighbourhood tends to increase due to the fact that homolog 38 CHAPTER 4. IDENTIFYING GENE FAMILIES sequences share more vertices in their neighborhood as compare to their unique neighbourhoods. In contrast, a pair of gene that are related via a domain match due to domain insertion events will contribute only to the unique neighbourhoods of the individual genes. Song et al. formularized the ratio of shared neighbourhood to unique neighbourhood sizes by the Neighbourhood Correlation score of two sequences: P ¯ ¯ i∈N (S(x, i) − S(x))(S(y, i) − S(y)) NC(x, y) = q P ¯ 2 P ¯ 2 i∈N (S(x, i) − S(x)) i∈N (S(y, i) − S(y)) where S(x, i) is the normalized BLAST bit score [5] of query sequence x with other database sequence i and N is the number of sequences in the database, and S¯(x) is the mean of S(x, i) over all sequences i. According to Song et al. [152], the neighbourhoods of adjacent sequences will reflect the similarities and differences in domain architecture. The number of edges in the shared and unique neighbourhoods will be influenced by the rate of gene duplication and domain insertion events, while the edge weights will represent the degree of sequence divergence. Therefore, a gene duplication in the same family will contribute towards the increase in size and strength of shared neighbourhood, while domain insertion into a single member of the pair will increase the unique neighbourhood size, hence Neighbourhood Correlation score variate accordingly. Thus for homologous pairs, the shared neighbourhood tends to have more vertices and stronger edges than their unique neighbourhoods. The Neighbourhood Correlation score implicitly consider the sequence similarity and domain architecture comparison within the similarity network. Song et al. [152] has demonstrated the effectiveness of Neighbourhood Correlation approach on benchmark data set. In the next section, we will briefly describe Markov clustering approach, which in contrast to Neighbourhood Corre- lation, is exploiting the global structure of the similarity network.

The Markov Clustering As discussed earlier, multi-domain families characterized by a specific domain may form a cluster or community structure in a similarity network. Enright et al. refer to these community structures as “natural clusters” in similarity networks, which are characterised by the presence of many edges within the members of the cluster. In particular, there will be more edges and paths within the clusters (inter-cluster communication) as compare to the number of edges and paths connecting different clusters (intra-cluster communication) in a network. The Markov Clustering algorithm MCL [52] is designed to exploit this property of natural clusters in a similarity network by using the flow simulation of network. The MCL algorithm deterministically estimates the probabilities of random walks through the sequence similarity network based on stochastic matrices. A stochastic matrix (also known as Markov matrix) is a non-negative matrix with the property that each of its columns sums to 1 (refer to column stochastic matrix). 4.5. SYNTENY: AN IMPORTANT SIGNAL FOR HOMOLOGY INFERENCE39

Each column j of this matrix corresponds to jth node in a network and row entry i in column j, corresponds with the probability of transition from node j to node i. The MCL algorithm simulates random walks within a network by applying two operations namely expansion and inflation on a Markov matrix of a network. In the expansion operation, the power of a stochastic matrix is taken using the nor- mal matrix multiplication, whereas in the inflation operation, element-wise power in the sense of Hadamard power of a matrix is taken, with the necessary scaling, in such way that the resulting matrix remains stochastic. The expansion operation simulates the random walks of many steps, which in the sense of stochastic flow, allows the dissipation of flow within the clusters to boost the inter cluster communication. It computes probabilities for each pair of nodes, where one node is considered to be start and the other is the end of walk. Since members of a cluster are more reachable within the cluster due to the abundance of paths for going from one node to another node, the probabilities associated with the nodes lying in the same cluster will relatively be higher as compare to those nodes that are lying outside the cluster. Inflation operation takes the opposite effect, that is, it will boost the probabilities of intra-cluster walks. By applying expansion and inflation operations iteratively, the network will be divided into different partitions. Eventually, there will be no path between these partitions and the resulting partitions will be considered as clusters. The MCL clustering is based on the global treatment of a similarity network by exploiting its flow properties as opposed to traditional homology inference methods, which allow the homology inference in a pairwise manner by using local structures of a network.

4.5 Synteny: An important signal for homology inference

Many studies have used synteny or gene order conservation in order to detect the homologues and orthologues genes. For instance, Wapinski et al. [175] have shown the importance of synteny to identify the orthologues and paralogues in fungi dataset. Kellis et al. [97] have shown that the yeast Saccharomyces Cerevisiae arose from ancient whole-genome duplication by posing the synteny as an evidence. Simi- larly, Lemoine et al. has investigated homology in prokaryotes by using gene order conservation [102]. Synteny has been used extensively to reconstruct the history of chromosomal rearrangements and to infer the map of conserved segments between different genomes [144]. For example, consider the conserved synteny map [69] of human, mouse, and rat genomes in figure 4.2. For each chromosome of a species, there are columns for the other two species, coloured with respect to conserved synteny to chromosomes of the other two species. The same colour scheme has been used for all species. The orthologous segments between human and rat, human and mouse can easily be observed in the synteny map as shown in figure 4.2. The synteny map representation is useful for the spatial analyses of genomes to elucidate the rearrangements of the gene content. The descendent genomes from speciation or a whole-genome 40 CHAPTER 4. IDENTIFYING GENE FAMILIES duplication event often have identical gene clusters. These gene clusters have homologous genes, but their order and content is not strictly conserved [82]. This indicates that the synteny can be used to identify genomic regions that share a common ancestry. By quantifying the synteny around these clusters, homology inference methods can be more informed and improved in identifying the homologous gene families. There are a number of approaches to extract the synteny information, for examples R-window [62], max-gap [80], and gene teams [110] to extract the syntenic neighbourhood of a gene. However, homology inference require further processing of the content and the order of these gene clusters. In next section, I will discuss techniques to quantify the syntenic neighbourhood of a gene for homology inference.

Figure 4.2: The map of conserved synteny between the human, mouse and rat genomes taken from Nature article Genome sequence of the Brown Norway rat yields insights into mammalian evolution by [69]. 4.6. QUANTIFYING SYNTENY FOR HOMOLOGOUS REGIONS 41

4.6 Quantifying synteny for homologous regions

The first step in a comparison of two or more genomic sequences is to align the sequences in order to identify conserved regions. There are two types of alignment algorithms, local and global. Local alignment algorithms produce similarity scores between subregions of sequences. These algorithms first find common hits (matches) between sequences and then extend these matches as far as quality of a match does not drop below a given threshold. Algorithms that measure synteny conservation within a fixed neighbourhood (window) around an anchor gene, often employ local alignment tools to identify homologous hits within the local region to measure synteny conservation. For instance, the R-window algorithm [62] uses a window with size parameter R to count the number of homologous pairs within a window in order to quantify synteny. The R-window algorithm applies BLAST [5] for computing local alignment scores. These algorithms are useful especially when sequences exhibit conservation of gene synteny but with jumbled order. Global alignment tools produce similarity scores where the sequences are expected to share similarity over their full length. Analogues to this, global synteny conservation techniques measure synteny over complete chromosomal region without any restrictions on the number of matches between the compared regions. To measure the global synteny, these algorithms often employ a searching technique to form a team of closely placed homologous genes on two and more chromosomes, which refer to as gene teams [110]. These approaches extend homologous region on both genetic segments as long as they are able to find a pair of homologs within a predefined distance. For example, the max-gap algorithm outputs gene-teams around anchors genes [83]. It extends the search of homologous genes on the left and right side of the current anchor gene to find the maximum number of genes in a team, and proceed this search in iterative way until it fails to find homologous genes in the vicinity determined by a gap size parameter. In our proposed approach, we used BLAST local alignment scores and neighbourhood correlation scores [4] to measure the synteny for a pair of genes within their fixed size neighbourhood. In the next section, I am briefly describing our proposed method GenFamClust [4], which is taking synteny information into account while inferring homology.

4.7 Homology inference using synteny

Our proposed approach GenFamClust takes two data sets of the sequences; a reference dataset R; used for collecting evidences for conserved gene neighbourhood i.e. synteny and similarity information, and a query dataset Q consisting of genes, for which we are interested in identifying the homology relationship i.e. homologs or non-homologs. In case we do not have reference data, the query data set will be treated as a reference dataset. The GenFamClust approach uses neighbourhood correlation as its similarity measure and proposes a synteny correlation score SyC. 42 CHAPTER 4. IDENTIFYING GENE FAMILIES

The synteny correlation score SyC will be inferred from a local synteny score SyS. Please refer to paper III & IV of the thesis for more technical and algorithmic details. In the ﬁrst step, we performed a unidirectional Q versus R BLAST [5] to compute the bit score statistics for each gene in the query dataset R. The NC scores based on log-transformed bit scores are computed in the second step. Given NC scores NC(g1, g2) for genes g1 and g2, the synteny scores SyS(g1, g2) are calculated in the third step, as follow:

SyS(g1, g2) = max{NC(a, b): a ∈ n(g1), b ∈ n(g2)}

Where n(g) represents the fixed neighbourhood of a gene g. The fixed neighbourhood of a gene is defined as a set of genes at most at distance k upstream or downstream from g, on a chromosomal region. We investigated different values of k to account the influence of synteny on homology inference based on a benchmark by Song et al. [152]. We found k = 5 appropriate for estimating the local synteny between genes on Metazoa dataset. Recently, Anselmetti et al. [7] used local neighbour of size two for the reconstruction of ancestral gene synteny of the mammalians genomes. However, higher values of k can be adjusted for the pair of species with less gene order conservation. In the fourth step of our approach, we computed the synteny correlation score SyC(g1, g2) between genes g1 and g2 based on the same formulation of Neighbourhood Correlation score: P i∈H (SyS(g1, i) − SyS(g1))(SyS(g2, i) − SyS(g2)) SyC(g1, g2) = q P 2 P 2 i∈H (SyS(g1, i) − SyS(g1)) i∈H (SyS(g2, i) − SyS(g2)) where H = ncHits(g1) ∩ ncHits(g2) and ncHits(g) = {i|i ∈ Q ∪ R,NC(g, i) ≥ β}. Synteny correlation scores are computed for each gene pair (g1, g2) with NC(g1, g2) > β, where β is a minimum threshold on NC score. Syntenic correlation score is arguably robust in collecting the gene order conservation evidences from a range of homologous regions within the reference dataset R consisting of multiple species. For instance, TNFRSF1A and CD27 are two homologous genes of the TNFR domain family found in human and mouse respectively, in the benchmark dataset by Song et al. [152]. The domain architecture of protein encoded by their gene is given in figure 4.3, where CD25 protein has an additional Death-TRNF domain, which is not present in TNFRSF1A gene. The Neighbourhood Correlation score between the pair is 0.351 less than the suggested threshold of 0.5 for homologous gene pairs in the study by joseph et al.[93]. The genes CD27 and TNFRSF1A are found very close to a cluster of identified members of TNFR domain family on chromosomal regions of human and mouse as shown in figure (displaying the region along with NC-hits > 0.9 of genes within the fixed neighbourhood around CD27 and TNFRSF1A). The synteny correlation score SyC(CD27,TNFRSF 1A) in this case is 0.983, which is successful in capturing the high degree of common syntenic context around CD27 and TNFRSF1A with their common hits on reference dataset. 4.7. HOMOLOGY INFERENCE USING SYNTENY 43

TNFRS1A 1 50 100 150 200 251 Query Seq

Specific hits TNFR

Superfamilies TNFR Superfamily

CD27 1 50 100 150 200 250 300 350 400 456 Query Seq

Specific hits TNFR Death_TNFR1

Superfamilies TNFR Superfamily DD_superfamily

B) C) Homo sapiens Chromosome 12 Homo sapiens Chromosome 12

TNFRS1A (Homo)

NC(g1,g2) Neighboring genes 0.351

NC(gi,gj) > 0.9

Known Homologs (TNFR) Pongo abelii Chromosome 12 CD27 (Mus)

CD27 (Pongo)

Mus musculus Chromosome 6 Mus musculus Chromosome 6

Figure 4.3: Case study [3]: Synteny is adding useful information to the homology inference. The CD27 gene in Mus musculus and TNFRSF1A gene in Homo sapiens are homologous and belong to the TNFR superfamily [152]. (A) The domain architecture of CD27 and TNFRSF1A is shown. (B) and (C) Gene synteny conservation for Homo sapiens chr. 12, Pongo abelii chr. 12 and Mus musculus chr. 6 containing genes CD27 and TNFRSF1A is shown. Common NC-hits between neighbourhood of CD27 and TNFRSF1A genes are connected by (blue) lines and used for calculating synteny correlation score between both the genes. Only hits with NC-scores greater than 0.9 are shown.

Thus, synteny correlation score, when combined with NC-score, infers these genes as a homologous gene pair. Conclusively, Synteny correlation score is eﬀective and practically feasible method for collecting syntenic support for homology inference. Homology and gene family inference is an important task and prerequisites to functional, structural, and phylogenetic analyses. The multi-domain proteins poses a challenge for homology inference in the presence of gene and domain level events. Based on neighbourhood correlation model of homology for multi-domain protein, we have proposed Gen- FamClust, a novel approach that quantiﬁes synteny along with similarity across multiple genomes. It formularizes the use of synteny and similarity for homology inference of a gene pair and also provides considerable improvement in accuracy as compared to similarity-only algorithms.

Chapter 5

Present Investigations

Paper I: We propose DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model [2] that models domain evolution as taking place inside the gene and species tree. The method incorporated birth-death process according to which domains are duplicated and lost under a domain sequence evolution model with a relaxed molecular clock. The method employs a variant of Markov Chain Monte Carlo technique namely, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain trees with lengths, gene tree topology and other evolutionary parameters. We, moreover, introduce a novel technique to sample datings of the vertices that MPR place on edges of a host tree, i.e., either the species tree or the gene tree. This technique allows us to ﬁrst sample a dating of the gene tree using the dated species tree as host tree and, then, sample a dating for each of the domain trees using the dated gene tree as host tree. By using this approach, we perform analyses of some interesting domain families namely ZNF91, ZNF468, ZNF558, ZNF679, ZNF611, ZNF764, and PRDM9. We found recent domain duplications with respect to gene and species evolution, which provides an interesting insight of the domain evolution inside the gene and species tree. By testing DomainDLRS on synthetic and biological (Zinc-Finger and PRDM9 multi-domain families) datasets, we show that our method provides most parsimonious explanation of the domain evolution in the context of gene and species tree that surpasses MrBayesMPR: a traditional approach based on MPR analysis of MrBayes-generated domain trees. We also present an analysis on the domain duplication patterns, based on recent domain duplication events in Zinc-Finger and PRDM9 domain families.

Paper II: The reelin protein, with its unique domain architecture consisting of eight reelin repeats, stands out as an exceptional model for evolutionary studies, as it is well conserved across the species, including humans. We construct an identity matrix of individual reelin repeats from 167 species to represent the average percentage identity across and between the reelin repeats.

45 46 CHAPTER 5. PRESENT INVESTIGATIONS

Sequence similarity shows that each repeat has an average 88% identity to its corresponding repeat in other species and an average 33% identity with other repeats within the protein sequence. We perform a phylogenetic analysis using neighbour joining algorithm to hypothesize the origin of the reelin protein. Phylogenetic analysis shows that the reelin proteins and the reelin domains might have originated in Arthropoda and retain its consistent presence in mammals. We present a 3D model of the full-length human reelin and suggest key residues at the putative dimer interface region to highlight the functional importance of reelin repeats. Paper III: In paper III, we propose GenFamClust, an approach that used Neigh- bourhood Correlation [152] method to quantify synteny by using the network structure of similarity across multiple genomes. GenFamClust combines synteny with the similarity in a more informed and accurate way to identify the homologous gene pairs. To validate the synteny measure of GenFamClust, we applied GenFamClust on the human and mouse dataset and compared it with the synteny map of human mouse genomes published in [32]. The reconstructed syntney map by GenFamClust is almost perfectly aligned with the original synteny map of human mouse genomes. We also tested the accuracy of homology inference of our method with the Neighbourhood Correlation on synthetic dataset as well as biological dataset of eukaryotes species. The results shows a significant improvement in case of synthetic dataset, particularly for datasets containing homologs with syntenic support (e.g., orthologs, and paralogs related by segmental duplications). Paper IV: In paper IV, we evaluated the effectiveness of GenFamClust on semi- manually curated dataset of Fungal genomes consists of 20 Fungi species from Yeast Genome Order Browser (YGOB) database [28]. Moreover, we tested the accuracy of GenFamClust in comparison with other homology inference approaches like BLAST [5], hcluster [105], SiLiX [119], HiFiX [118], Neigh- bourhood Correlation [152], and MCL [52] using single linkage, complete linkage, and average linkage clustering on the benchmark dataset by Song et al. [152] with Metazoan dataset. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference and synteny-aware homology inference is more accurate than the similarity-only methods. Our proposed measure for synteny quantification is effective in mea- suring the syntenic conservation as observed in diverse biological datasets as well as in simulated datasets. Bibliography

[1] Alexei A Adzhubei, Ivan A Adzhubeib, Igor A Krasheninnikov, and Stephen Neidle. Non-random usage of ‘degenerate’codons is related to protein three- dimensional structure. FEBS letters, 399(1):78–82, 1996.

[2] Örjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Simul- taneous bayesian gene tree reconstruction and reconciliation analysis. Pro- ceedings of the National Academy of Sciences, 106(14):5714–5719, 2009.

[3] Raja H Ali, Sayyed A Muhammad, and Lars Arvestad. Genfamclust: an accurate, synteny-aware and reliable homology inference algorithm. BMC evolutionary biology, 16(1):1, 2016.

[4] Raja H Ali, Sayyed A Muhammad, Mehmood A Khan, and Lars Arvestad. Quantitative synteny scoring improves homology inference and partitioning of gene families. BMC bioinformatics, 14(Suppl 15):S12, 2013.

[5] Stephen F Altschul, Thomas L Madden, Alejandro A Schäﬀer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.

[6] Christophe Andrieu and Gareth O Roberts. The pseudo-marginal approach for eﬃcient monte carlo computations. The Annals of Statistics, pages 697– 725, 2009.

[7] Yoann Anselmetti, Vincent Berry, Cedric Chauve, Annie Chateau, Eric Tan- nier, and Sèverine Bérard. Ancestral gene synteny reconstruction improves extant species scaﬀolding. BMC genomics, 16(Suppl 10):S11, 2015.

[8] Gordana Apic, Julian Gough, and Sarah A Teichmann. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. Journal of molecular biology, 310(2):311–325, 2001.

[9] Gordana Apic, Julian Gough, and Sarah A Teichmann. An insight into domain combinations. Bioinformatics, 17(suppl 1):S83–S89, 2001.

47 48 BIBLIOGRAPHY

[10] Rolf Apweiler, Terri K Attwood, Amos Bairoch, Alex Bateman, Ewan Birney, Margaret Biswas, Philipp Bucher, Lorenzo Cerutti, Florence Corpet, Michael D. R. Croning, et al. The interpro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic acids research, 29(1):37–40, 2001.

[11] S Arena, S Benvenuti, and A Bardelli. Genetic analysis of the kinome and phosphatome in cancer. Cellular and Molecular Life Sciences CMLS, 62(18): 2092–2099, 2005.

[12] Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad. Bayesian gene/species tree reconciliation and orthology analysis using mcmc. Bioinformatics, 19(suppl 1):i7–i15, 2003.

[13] Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In Proceedings of the eighth annual international conference on Resaerch in computational molecular biology, pages 326–335. ACM, 2004.

[14] Francisco J Ayala, Martin L Tracey, Dennis Hedgecock, and Rollin C Rich- mond. Genetic diﬀerentiation during the speciation process in drosophila. Evolution, pages 576–592, 1974.

[15] DV Babushok, EM Ostertag, and HH Kazazian Jr. Current topics in genome evolution: molecular mechanisms of new gene formation. Cellular and molecular life sciences, 64(5):542–554, 2007.

[16] Mukul S Bansal, Eric J Alm, and Manolis Kellis. Eﬃcient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28(12):i283–i291, 2012.

[17] Matthew Bashton and Cyrus Chothia. The geometry of domain combination in proteins. Journal of molecular biology, 315(4):927–939, 2002.

[18] Malay Kumar Basu, Liran Carmel, Igor B Rogozin, and Eugene V Koonin. Evolution of protein domain promiscuity in eukaryotes. Genome research, 18 (3):449–461, 2008.

[19] Malay Kumar Basu, Eugenia Poliakov, and Igor B Rogozin. Domain mobility in proteins: functional and evolutionary implications. Brieﬁngs in bioinformatics, page bbn057, 2009.

[20] Mark A Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003. BIBLIOGRAPHY 49

[21] Gaurav Bhardwaj, Kyung Dae Ko, Yoojin Hong, Zhenhai Zhang, Ngai Lam Ho, Sree V Chintapalli, Lindsay A Kline, Matthew Gotlin, David Nicholas Hartranft, Morgen E Patterson, et al. Phyrn: a robust method for phylogenetic analysis of highly divergent sequences. PloS one, 7(4):e34261, 2012.

[22] Åsa K Björklund, Diana Ekman, and Arne Elofsson. Expansion of protein domain repeats. 2006.

[23] Åsa K Björklund, Diana Ekman, Sara Light, Johannes Frey-Skött, and Arne Elofsson. Domain rearrangements in protein evolution. Journal of molecular biology, 353(4):911–923, 2005.

[24] E Bornberg-Bauer, F Beaussart, SK Kummerfeld, SA Teichmann, and J Weiner III. The evolution of domain arrangements in proteins and interaction networks. Cellular and Molecular Life Sciences CMLS, 62(4):435–445, 2005.

[25] Bastien Boussau, Gergely J Szöllősi, Laurent Duret, Manolo Gouy, Eric Tan- nier, and Vincent Daubin. Genome-scale coestimation of species and gene trees. Genome research, 23(2):323–330, 2013.

[26] Terence A Brown. Genomes. Garland science, 2006.

[27] Marija Buljan, Adam Frankish, and Alex Bateman. Quantifying the mechanisms of domain gain in animal proteins. Genome Biol, 11(7):R74, 2010.

[28] Kevin P Byrne and Kenneth H Wolfe. The yeast gene order browser: com- bining curated homology and syntenic context reveals gene fate in polyploid species. Genome research, 15(10):1456–1461, 2005.

[29] Ethan A Carver and Lisa Stubbs. Zooming in on the human–mouse comparative map: Genome conservation re-examined on a high-resolution scale. Genome Research, 7(12):1123–1137, 1997.

[30] Jian-Min Chen, David N Cooper, Nadia Chuzhanova, Claude Férec, and George P Patrinos. Gene conversion: mechanisms, evolution and human disease. Nature Reviews Genetics, 8(10):762–775, 2007.

[31] Kevin Chen, Dannie Durand, and Martin Farach-Colton. Notung: a program for dating gene duplications and optimizing gene family trees. Journal of Computational Biology, 7(3-4):429–447, 2000.

[32] Asif T Chinwalla, Lisa L Cook, Kimberly D Delehaunty, Ginger A Fewell, Lucinda A Fulton, Robert S Fulton, Tina A Graves, LaDeana W Hillier, Elaine R Mardis, John D McPherson, et al. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–562, 2002. 50 BIBLIOGRAPHY

[33] Cyrus Chothia and Julian Gough. Genomic and structural aspects of protein evolution. Biochem. J, 419:15–28, 2009.

[34] Cyrus Chothia, Julian Gough, Christine Vogel, and Sarah A Teichmann. Evolution of the protein repertoire. Science, 300(5626):1701–1703, 2003.

[35] S Ciftci-Yilmaz and R Mittler. The zinc ﬁnger network of plants. Cellular and Molecular Life Sciences, 65(7-8):1150–1160, 2008.

[36] Andrew FW Coulson and John Moult. A unifold, mesofold, and superfold model of protein fold use. Proteins: Structure, Function, and Bioinformatics, 46(1):61–71, 2002.

[37] Miklós Csűrös and István Miklós. Streamlining and large ancestral genomes in archaea inferred with a phylogenetic birth-and-death model. Molecular biology and evolution, 26(9):2087–2095, 2009.

[38] Tom Curran and Gabriella D’Arcangelo. Role of reelin in the control of brain development. Brain Research Reviews, 26(2):285–294, 1998.

[39] Thomas Dandekar, Berend Snel, Martijn Huynen, and Peer Bork. Conserva- tion of gene order: a ﬁngerprint of proteins that physically interact. Trends in biochemical sciences, 23(9):324–328, 1998.

[40] Gabriella D’Arcangelo. Reelin in the years: controlling neuronal migration and maturation in the mammalian brain. Advances in Neuroscience, 2014, 2014.

[41] Gabriella D’Arcangelo, Graham G Miao, Shu-Cheng Chen, Holly D Scares, James I Morgan, and Tom Curran. A protein related to extracellular matrix proteins deleted in the mouse mutant reeler. 1995.

[42] Charles Darwin. On the origin of species by means of natural selection (j. mur- ray, london). JD Stilwell/Palaeogeography. Palaeoclimatology, Palaeoecology 136 (1997) 97, 119(117):257–274, 1859.

[43] Lawrence A David and Eric J Alm. Rapid evolutionary innovation during an archaean genetic expansion. Nature, 469(7328):93–96, 2011.

[44] William HE Day, David S Johnson, and David Sankoﬀ. The computational complexity of inferring rooted phylogenies by parsimony. Mathematical bio- sciences, 81(1):33–42, 1986.

[45] Margaret O Dayhoﬀ and Robert M Schwartz. A model of evolutionary change in proteins. In In Atlas of protein sequence and structure. Citeseer, 1978.

[46] Hamsa Dhwani Tadepally and Muriel Aubry. Evolution of c2h2 zinc-ﬁnger gene families in mammals. eLS. BIBLIOGRAPHY 51

[47] I Dondoshansky and Y Wolf. Blastclust (ncbi software development toolkit). NCBI, Bethesda, Md, 2002.

[48] Jean-Philippe Doyon, Cedric Chauve, and Sylvie Hamel. Space of gene/species trees reconciliations and parsimonious models. Journal of Com- putational Biology, 16(10):1399–1418, 2009.

[49] Alexei J Drummond and Andrew Rambaut. Beast: Bayesian evolutionary analysis by sampling trees. BMC evolutionary biology, 7(1):214, 2007.

[50] Dannie Durand, Bjarni V Halldórsson, and Benjamin Vernot. A hybrid micro- macroevolutionary approach to gene tree reconstruction. Journal of Compu- tational Biology, 13(2):320–335, 2006.

[51] Isaac Elias and Jens Lagergren. Fast neighbor joining. In Automata, Lan- guages and Programming, pages 1263–1274. Springer, 2005.

[52] Anton J Enright, Stijn Van Dongen, and Christos A Ouzounis. An eﬃcient algorithm for large-scale detection of protein families. Nucleic acids research, 30(7):1575–1584, 2002.

[53] James S Farris. Phylogenetic analysis under dollo’s law. Systematic Zoology, pages 77–88, 1977.

[54] S Hossein Fatemi. Reelin glycoprotein: structure, biology and roles in health and disease. Molecular psychiatry, 10(3):251–257, 2005.

[55] Joseph Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4):401–410, 1978.

[56] Joseph Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood approach. Journal of molecular evolution, 17(6):368–376, 1981.

[57] Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, et al. Pfam: the protein families database. Nucleic acids research, page gkt1223, 2013.

[58] Walter M Fitch. Distinguishing homologous from analogous proteins. Sys- tematic Biology, 19(2):99–113, 1970.

[59] Walter M Fitch. Homology: a personal view on some of the problems. Trends in genetics, 16(5):227–231, 2000.

[60] Kristoﬀer Forslund, Anna Henricson, Volker Hollich, and Erik LL Sonnham- mer. Domain tree-based analysis of protein architecture evolution. Molecular biology and evolution, 25(2):254–264, 2008. 52 BIBLIOGRAPHY

[61] Hunter B Fraser. Modularity and evolutionary constraint on proteins. Nature genetics, 37(4):351–352, 2005.

[62] Robert Friedman and Austin L Hughes. Gene duplication and the structure of eukaryotic genomes. Genome research, 11(3):373–381, 2001.

[63] Zheng Fu, Xin Chen, Vladimir Vacic, Peng Nan, Yang Zhong, and Tao Jiang. A parsimony approach to genome-wide ortholog assignment. In Research in Computational Molecular Biology, pages 578–594. Springer, 2006.

[64] Zheng Fu, Xin Chen, Vladimir Vacic, Peng Nan, Yang Zhong, and Tao Jiang. Msoar: a high-throughput ortholog assignment system based on genome rearrangement. Journal of Computational Biology, 14(9):1160–1175, 2007.

[65] Olivier Gascuel. Bionj: an improved version of the nj algorithm based on a simple model of sequence data. Molecular biology and evolution, 14(7): 685–695, 1997.

[66] Olivier Gascuel and Mike Steel. Neighbor-joining revealed. Molecular biology and evolution, 23(11):1997–2000, 2006.

[67] Sergey Gavrilets. Perspective: models of speciation: what have we learned in 40 years? Evolution, 57(10):2197–2215, 2003.

[68] Adrian J Gibbs and George A McIntyre. The diagram, a method for com- paring sequences. European Journal of Biochemistry, 16(1):1–11, 1970.

[69] Richard A Gibbs, George M Weinstock, Michael L Metzker, Donna M Muzny, Erica J Sodergren, Steven Scherer, Graham Scott, David Steﬀen, Kim C Worley, Paula E Burch, et al. Genome sequence of the brown norway rat yields insights into mammalian evolution. Nature, 428(6982):493–521, 2004.

[70] J Peter Gogarten, W Ford Doolittle, and Jeﬀrey G Lawrence. Prokaryotic evolution in light of gene transfer. Molecular biology and evolution, 19(12): 2226–2238, 2002.

[71] Morris Goodman, John Czelusniak, G William Moore, AE Romero-Herrera, and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology, pages 132–163, 1979.

[72] Christopher Greenman, Philip Stephens, Raﬀaella Smith, Gillian L Dalgliesh, Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam But- ler, Claire Stevens, et al. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132):153–158, 2007. BIBLIOGRAPHY 53

[73] Alessandro Guidotti, James Auta, John M Davis, Valeria DiGiorgi Gerevini, Yogesh Dwivedi, Dennis R Grayson, Francesco Impagnatiello, Ghanshyam Pandey, Christine Pesold, Rajiv Sharma, et al. Decrease in reelin and glu- tamic acid decarboxylase67 (gad67) expression in schizophrenia and bipolar disorder: a postmortem brain study. Archives of general psychiatry, 57(11): 1061–1069, 2000.

[74] Matthew W Hahn, Jeﬀery P Demuth, and Sang-Gook Han. Accelerated rate of gene gain and loss in primates. Genetics, 177(3):1941–1949, 2007.

[75] Mike Hallett, Jens Lagergren, and Ali Toﬁgh. Simultaneous identiﬁcation of duplications and lateral transfers. In Proceedings of the eighth annual international conference on Resaerch in computational molecular biology, pages 347–356. ACM, 2004.

[76] Aaron T Hamilton, Stuart Huntley, Mary Tran-Gyamfi, Daniel M Baggott, Laurie Gordon, and Lisa Stubbs. Evolutionary expansion and divergence in the znf91 subfamily of primate-specific zinc finger genes. Genome research, 16(5):584–594, 2006.

[77] Jung-Hoon Han, Sarah Batey, Adrian A Nickson, Sarah A Teichmann, and Jane Clarke. The folding and evolution of multidomain proteins. Nature Reviews Molecular Cell Biology, 8(4):319–330, 2007.

[78] Mira V Han and Matthew W Hahn. Identifying parent-daughter relationships among duplicated genes. In Paciﬁc Symposium on Biocomputing, volume 14, pages 114–125. World Scientiﬁc, 2009.

[79] Masami Hasegawa, Hirohisa Kishino, and Taka-aki Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial dna. Journal of molecular evolution, 22(2):160–174, 1985.

[80] Steﬀen Heber and Jens Stoye. Algorithms for ﬁnding gene clusters. In Algo- rithms in Bioinformatics, pages 252–263. Springer, 2001.

[81] Joachim Herz and Ying Chen. Reelin, lipoprotein receptors and synaptic plasticity. Nature Reviews Neuroscience, 7(11):850–859, 2006.

[82] Rose Hoberman and Dannie Durand. The incompatible desiderata of gene cluster properties. In Comparative Genomics, pages 73–87. Springer, 2005.

[83] Rose Hoberman, David Sankoﬀ, and Dannie Durand. The statistical signiﬁ- cance of max-gap clusters. In Comparative Genomics, pages 55–71. Springer, 2004.

[84] John P. Huelsenbeck, Fredrik Ronquist, et al. Mrbayes: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, 2001. 54 BIBLIOGRAPHY

[85] John P Huelsenbeck, Fredrik Ronquist, Rasmus Nielsen, and Jonathan P Bollback. Bayesian inference of phylogeny and its impact on evolutionary biology. science, 294(5550):2310–2314, 2001. [86] Stuart Huntley, Daniel M Baggott, Aaron T Hamilton, Mary Tran-Gyamfi, Shan Yang, Joomyeong Kim, Laurie Gordon, Elbert Branscomb, and Lisa Stubbs. A comprehensive catalog of human krab-associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional re- pressors. Genome research, 16(5):669–677, 2006. [87] Matthew Hurles. Gene duplication: the genomic trade in spare parts. PLoS biology, 2:900–904, 2004. [88] Martijn Huynen, Berend Snel, Warren Lathe, and Peer Bork. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome research, 10(8):1204–1210, 2000. [89] Martijn A Huynen and Erik Van Nimwegen. The frequency distribution of gene family sizes in complete genomes. Molecular biology and evolution, 15 (5):583–589, 1998. [90] Hisako Ichihara, Hisato Jingami, and Hiroyuki Toh. Three novel repetitive units of reelin. Molecular Brain Research, 97(2):190–193, 2001. [91] Francesco Impagnatiello, Alessandro R Guidotti, Christine Pesold, Yogesh Dwivedi, Hector Caruncho, Maria G Pisu, Doncho P Uzunov, Neil R Smal- heiser, John M Davis, Ghanshyam N Pandey, et al. A decrease of reelin expression as a putative vulnerability factor in schizophrenia. Proceedings of the National Academy of Sciences, 95(26):15718–15723, 1998. [92] Yuannian Jiao, Norman J Wickett, Saravanaraj Ayyampalayam, André S Chanderbali, Lena Landherr, Paula E Ralph, Lynn P Tomsho, Yi Hu, Haiy- ing Liang, Pamela S Soltis, et al. Ancestral polyploidy in seed plants and angiosperms. Nature, 473(7345):97–100, 2011. [93] Jacob M Joseph and Dannie Durand. Family classification without domain chaining. Bioinformatics, 25(12):i45–i53, 2009. [94] Thomas H Jukes and Charles R Cantor. Evolution of protein molecules. Mammalian protein metabolism, 3:21–132, 1969. [95] Jin Jun, Ion I Mandoiu, and Craig E Nelson. Identification of mammalian orthologs using local synteny. BMC genomics, 10(1):630, 2009. [96] Georgy P Karev, Yuri I Wolf, Faina S Berezovskaya, and Eugene V Koonin. Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth-death-innovation models. BMC Evolutionary Biology, 4(1): 1, 2004. BIBLIOGRAPHY 55

[97] Manolis Kellis, Bruce W Birren, and Eric S Lander. Proof and evolutionary analysis of ancient genome duplication in the yeast saccharomyces cerevisiae. Nature, 428(6983):617–624, 2004. [98] David G Kendall. On the generalized" birth-and-death" process. The annals of mathematical statistics, pages 1–15, 1948.

[99] Motoo Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of molecular evolution, 16(2):111–120, 1980. [100] Nicolas Lartillot and Hervé Philippe. A bayesian mixture model for across- site heterogeneities in the amino-acid replacement process. Molecular biology and evolution, 21(6):1095–1109, 2004. [101] Philip Leder. The genetics of antibody diversity. Scientiﬁc American, 246(5): 102–115, 1982. [102] Frédéric Lemoine, Olivier Lespinet, and Bernard Labedan. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC evolutionary biology, 7(1):237, 2007. [103] Thomas Lepage, David Bryant, Hervé Philippe, and Nicolas Lartillot. A general comparison of relaxed molecular clock models. Molecular biology and evolution, 24(12):2669–2680, 2007.

[104] Fulu Li. Analysis on the complexity of the problem to construct the most parsimonious phylogenetic trees. In Information Sciences and Systems (CISS), 2010 44th Annual Conference on, pages 1–6. IEEE, 2010. [105] H Li. Constructing the TreeFam database. PhD thesis, PhD thesis, The Institute of Theoretical Physics, Chinese Academy of Science. http://www. sanger. ac. uk/Users/lh3/PhD-thesis-liheng-2006-English. pdf, 2006. [106] Heng Li, Avril Coghlan, Jue Ruan, Lachlan James Coin, Jean-Karim Heriche, Lara Osmotherly, Ruiqiang Li, Tao Liu, Zhang Zhang, Lars Bolund, et al. Treefam: a curated database of phylogenetic trees of animal gene families. Nucleic acids research, 34(suppl 1):D572–D580, 2006.

[107] Wen-Hsiung Li and Kateryna D Makova. Domain duplication and gene elongation. eLS. [108] Daiqing Liao. Concerted evolution: molecular mechanism and biological implications. The American Journal of Human Genetics, 64(1):24–30, 1999.

[109] Martin Linder, Tom Britton, and Bengt Sennblad. Evaluation of bayesian models of substitution rate evolution—parental guidance versus mutual independence. Systematic biology, 60(3):329–342, 2011. 56 BIBLIOGRAPHY

[110] Nicolas Luc, Jean-Loup Risler, Anne Bergeron, and Mathieu Raﬃnot. Gene teams: a new formalization of gene clusters for comparative genomics. Com- putational biology and chemistry, 27(1):59–67, 2003.

[111] Michael Lynch and John S Conery. The evolutionary fate and consequences of duplicate genes. Science, 290(5494):1151–1155, 2000.

[112] Owais Mahmudi. Probabilistic reconciliation analysis for genes and pseudogenes. 2015.

[113] Owais Mahmudi, Joel Sjöstrand, Bengt Sennblad, and Jens Lagergren. Genome-wide probabilistic reconciliation analysis across vertebrates. BMC bioinformatics, 14(Suppl 15):S10, 2013.

[114] Aron Marchler-Bauer, John B Anderson, Praveen F Cherukuri, Carol DeWeese-Scott, Lewis Y Geer, Marc Gwadz, Siqian He, David I Hurwitz, John D Jackson, Zhaoxi Ke, et al. Cdd: a conserved domain database for protein classiﬁcation. Nucleic acids research, 33(suppl 1):D192–D196, 2005.

[115] Cynthia J Verjovsky Marcotte and Edward M Marcotte. Predicting functional linkages from gene fusions with conﬁdence. Applied bioinformatics, 1(2):93– 100, 2002.

[116] Ernst Mayr. Systematics and the origin of species, from the viewpoint of a zoologist. Harvard University Press, 1942.

[117] Sebastian Meier, Pernille R Jensen, Charles N David, Jarrod Chapman, Thomas W Holstein, Stephan Grzesiek, and Suat Özbek. Continuous molecular evolution of protein-domain structures by single amino acid changes. Current biology, 17(2):173–178, 2007.

[118] Vincent Miele, Simon Penel, Vincent Daubin, Franck Picard, Daniel Kahn, and Laurent Duret. High-quality sequence clustering guided by network topology and multiple alignment likelihood. Bioinformatics, 28(8):1078–1085, 2012.

[119] Vincent Miele, Simon Penel, and Laurent Duret. Ultra-fast sequence clustering from similarity networks with silix. BMC bioinformatics, 12(1):116, 2011.

[120] Andrew D Moore, Åsa K Björklund, Diana Ekman, Erich Bornberg-Bauer, and Arne Elofsson. Arrangements in the modular evolution of proteins. Trends in biochemical sciences, 33(9):444–451, 2008.

[121] Alexey G Murzin, Steven E Brenner, Tim Hubbard, and Cyrus Chothia. Scop: a structural classiﬁcation of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536–540, 1995. BIBLIOGRAPHY 57

[122] Sean Nee, Robert M May, and Paul H Harvey. The reconstructed evolutionary process. Philosophical Transactions of the Royal Society B: Biological Sciences, 344(1309):305–311, 1994. [123] Terukazu Nogi, Norihisa Yasui, Mitsuharu Hattori, Kenji Iwasaki, and Junichi Takagi. Structure of a signaling-competent reelin fragment revealed by x-ray crystallography and electron tomography. The EMBO journal, 25(15):3675– 3683, 2006. [124] Katja Nowick, Miguel Carneiro, and Rui Faria. A prominent role of krab-znf transcription factors in mammalian speciation? Trends in Genetics, 29(3): 130–139, 2013. [125] Katja Nowick, Aaron T Hamilton, Huimin Zhang, and Lisa Stubbs. Rapid sequence and expression divergence suggest selection for novel function in primate-speciﬁc krab-znf genes. Molecular biology and evolution, 27(11):2606– 2617, 2010. [126] Johan AA Nylander, Fredrik Ronquist, John P Huelsenbeck, and Joséluis Nieves-Aldrey. Bayesian phylogenetic analysis of combined data. Systematic biology, 53(1):47–67, 2004. [127] ROBERT J O’HARA. Population thinking and tree thinking in systematics. Zoologica scripta, 26(4):323–329, 1997. [128] Susumu Ohno. Evolution by gene duplication. Springer Science & Business Media, 2013. [129] Silvan Oulion, Stephanie Bertrand, and Hector Escriva. Evolution of the fgf gene family. International journal of evolutionary biology, 2012, 2012. [130] Richard Owen. On the archetype and homologies of the vertebrate skeleton. van Voorst, 1848. [131] Richard Owen. Lectures on the comparative anatomy and physiology of the invertebrate animals. Longman, 1855. [132] Matteo Pellegrini, Edward M Marcotte, Michael J Thompson, David Eisen- berg, and Todd O Yeates. Assigning protein functions by comparative genome analysis: protein phylogenetic proﬁles. Proceedings of the National Academy of Sciences, 96(8):4285–4288, 1999. [133] Chris P Ponting and Robert R Russell. The natural history of protein domains. Annual review of biophysics and biomolecular structure, 31(1):45–71, 2002. [134] Tobias Preis, Peter Virnau, Wolfgang Paul, and Johannes J Schneider. Gpu accelerated monte carlo simulation of the 2d and 3d ising model. Journal of Computational Physics, 228(12):4468–4477, 2009. 58 BIBLIOGRAPHY

[135] Teresa Przytycka, George Davis, Nan Song, and Dannie Durand. Graph theoretical insights into evolution of multidomain proteins. Journal of computational biology, 13(2):351–363, 2006.

[136] Bruce Rannala and Ziheng Yang. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of molecular evolution, 43(3):304–311, 1996.

[137] Matthew D Rasmussen and Manolis Kellis. A bayesian approach for fast and accurate gene tree reconstruction. Molecular Biology and Evolution, 28(1): 273–290, 2011.

[138] EA Romanel, Carlos G Schrago, Rafael M Couñago, CA Russo, and Márcio Alves-Ferreira. Evolution of the b3 dna binding superfamily: new insights into rem family gene diversiﬁcation. PloS one, 4(6):e5791, 2009.

[139] Fredrik Ronquist and John P Huelsenbeck. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12):1572–1574, 2003.

[140] Michael G Rossman, Anders Liljas, Carl-Ivar Brändén, and Leonard J Ba- naszak. 2 evolutionary and structural relationships among dehydrogenases. The enzymes, 11:61–102, 1975.

[141] Michael G Rossmann, Dino Moras, and Kenneth W Olsen. Chemical and biological evolution of a nucleotide-binding protein. Nature, 250:194–199, 1974.

[142] Andrey Rzhetsky and Shawn M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinfor- matics, 17(10):988–996, 2001.

[143] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406–425, 1987.

[144] David Sankoﬀ and Joseph H Nadeau. Chromosome rearrangements in evolution: From gene order to genome sequence and back. Proceedings of the National Academy of Sciences, 100(20):11188–11189, 2003.

[145] Vincent Savolainen, Marie-Charlotte Anstett, Christian Lexer, Ian Hutton, James J Clarkson, Maria V Norup, Martyn P Powell, David Springate, Nico- las Salamin, and William J Baker. Sympatric speciation in palms on an oceanic island. Nature, 441(7090):210–213, 2006.

[146] Eric D Scheeﬀ and Philip E Bourne. Structural evolution of the protein kinase-like superfamily. PLoS Comput Biol, 1(5):e49, 2005. BIBLIOGRAPHY 59

[147] Dolph Schluter. Evidence for ecological speciation and its alternative. Science, 323(5915):737–741, 2009.

[148] Jörg Schultz, Richard R Copley, Tobias Doerks, Chris P Ponting, and Peer Bork. Smart: a web-based tool for the study of genetically mobile domains. Nucleic acids research, 28(1):231–234, 2000.

[149] Bengt Sennblad and Jens Lagergren. Probabilistic orthology analysis. Sys- tematic biology, 58(4):411–424, 2009.

[150] Martin Simonsen, Thomas Mailund, and Christian NS Pedersen. Rapid neighbour-joining. In Algorithms in Bioinformatics, pages 113–122. Springer, 2008.

[151] Joel Sjöstrand, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Dlrs: gene tree evolution in light of a species tree. Bioinformatics, 28(22):2994– 2995, 2012.

[152] Nan Song, Jacob M Joseph, George B Davis, and Dannie Durand. Sequence similarity network reveals common ancestry of multidomain proteins. 2008.

[153] Maureen Stolzer, Katherine Siewert, Han Lai, Minli Xu, and Dannie Du- rand. Event inference in multidomain families with phylogenetic reconciliation. BMC bioinformatics, 16(Suppl 14):S8, 2015.

[154] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer genome. Nature, 458(7239):719–724, 2009.

[155] Marc A Suchard and Andrew Rambaut. Many-core algorithms for statistical phylogenetics. Bioinformatics, 25(11):1370–1376, 2009.

[156] Marc A Suchard and Benjamin D Redelings. Bali-phy: simultaneous bayesian inference of alignment and phylogeny. Bioinformatics, 22(16):2047–2048, 2006.

[157] David L Swoﬀord. {PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4.}. 2003.

[158] Hamsa D Tadepally, Gertraud Burger, and Muriel Aubry. Evolution of c2h2- zinc finger genes and subfamilies in mammals: species-specific duplication and loss of clusters, genes and effector domains. BMC evolutionary biology, 8(1):176, 2008.

[159] Koichiro Tamura, Glen Stecher, Daniel Peterson, Alan Filipski, and Sudhir Kumar. Mega6: molecular evolutionary genetics analysis version 6.0. Molec- ular biology and evolution, 30(12):2725–2729, 2013. 60 BIBLIOGRAPHY

[160] Roman L Tatusov, Michael Y Galperin, Darren A Natale, and Eugene V Koonin. The cog database: a tool for genome-scale analysis of protein functions and evolution. Nucleic acids research, 28(1):33–36, 2000.

[161] John S Taylor and Jeroen Raes. Duplication and divergence: the evolution of new genes and old ideas. Annu. Rev. Genet., 38:615–643, 2004.

[162] Aartjan JW te Velthuis, Jeroen F Admiraal, and Christoph P Bagowski. Molecular evolution of the maguk family in metazoan genomes. BMC evolutionary biology, 7(1):129, 2007.

[163] AJ Te Velthuis, Tadamoto Isogai, Lieke Gerrits, and Christoph P Bagowski. Insights into the molecular evolution of the pdz/lim family and identiﬁcation of a novel conserved protein motif. PloS one, 2(2):e189, 2007.

[164] Elizabeth Alison Thompson. Human evolutionary trees. CUP Archive, 1975.

[165] Larry W Tjoelker, Larry Gosting, Steve Frey, Christie L Hunter, Hai Le Trong, Bart Steiner, Heather Brammer, and Patrick W Gray. Struc- tural and functional deﬁnition of the human chitinase chitin-binding domain. Journal of Biological Chemistry, 275(1):514–520, 2000.

[166] Ali Toﬁgh, Michael Hallett, and Jens Lagergren. Simultaneous identiﬁca- tion of duplications and lateral gene transfers. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(2):517–535, 2011.

[167] Hedvig Tordai, Alinda Nagy, Krisztina Farkas, Laszlo Banyai, and Laszlo Patthy. Modules, multidomain proteins and organismic complexity. FEBS Journal, 272(19):5064–5078, 2005.

[168] Ikram Ullah, Joel Sjöstrand, Peter Andersson, Bengt Sennblad, and Jens Lagergren. Integrating sequence evolution into probabilistic orthology analysis. Systematic biology, page syv044, 2015.

[169] S Van Dongen. Graph clustering by ﬂow simulation [ph. d. thesis].[utrecht (netherlands)]: University of utrecht. 2000.

[170] Kevin J Verstrepen, An Jansen, Fran Lewitter, and Gerald R Fink. Intragenic tandem repeats generate functional variability. Nature genetics, 37(9):986– 990, 2005.

[171] J-N Volﬀ and J Brosius. Modern genomes with retro-look: retrotransposed elements, retroposition and the origin of new genes. 2007.

[172] DP Wall, HB Fraser, and AE Hirsh. Detecting putative orthologs. Bioinfor- matics, 19(13):1710–1711, 2003. BIBLIOGRAPHY 61

[173] Sishuo Wang, Youhua Chen, Qinhong Cao, and Huiqiang Lou. Long-lasting gene conversion shapes the convergent evolution of the critical methanogenesis genes. G3: Genes| Genomes| Genetics, 5(11):2475–2486, 2015. [174] Ilan Wapinski, Avi Pfeffer, Nir Friedman, and Aviv Regev. Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics, 23 (13):i549–i558, 2007. [175] Ilan Wapinski, Avi Pfeffer, Nir Friedman, and Aviv Regev. Natural history and evolutionary principles of gene duplication in fungi. Nature, 449(7158): 54–61, 2007. [176] January Weiner, Francois Beaussart, and Erich Bornberg-Bauer. Domain deletions and substitutions in the modular protein evolution. FEBS Journal, 273(9):2037–2047, 2006. [177] January Weiner, Andrew D Moore, and Erich Bornberg-Bauer. Just how versatile are domains? BMC evolutionary biology, 8(1):285, 2008. [178] Cathy H Wu, Hongzhan Huang, Lai-Su L Yeh, and Winona C Barker. Protein family classification and functional annotation. Computational Biology and Chemistry, 27(1):37–47, 2003. [179] Yi-Chieh Wu, Matthew D Rasmussen, and Manolis Kellis. Evolution at the subgene level: domain rearrangements in the drosophila phylogeny. Molecular biology and evolution, page msr222, 2011. [180] Ziheng Yang. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. Journal of Molecular evolution, 39(3):306–314, 1994. [181] Norihisa Yasui, Terukazu Nogi, and Junichi Takagi. Structural basis for specific recognition of reelin by its receptors. Structure, 18(3):320–331, 2010. [182] Yuzhen Ye and Adam Godzik. Comparative analysis of protein domain orga- nization. Genome research, 14(3):343–353, 2004. [183] Jianzhi Zhang. Evolution by gene duplication: an update. Trends in ecology & evolution, 18(6):292–298, 2003. [184] Stefan Zoller, Veronika Boskova, and Maria Anisimova. Maximum-likelihood tree estimation using codon substitution models with multiple partitions. Molecular biology and evolution, page msv097, 2015.