<<

From genomes to post-processing of Bayesian inference of phylogeny

RAJA HASHIM ALI

Doctoral Stockholm, Sweden 2016 TRITA-CSC-A-2016:01 ISSN-1653-5723 KTH ISRN-KTH/CSC/A-16/01-SE SE-100 44 Stockholm ISBN: 978-91-7595-849-1 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Akademisk avhandling 25 February 2016 i Fire Scilifelab.

© Raja Hashim Ali, January 2016

Tryck: Universitetsservice US AB iii

“Do not let your difficulties fill you with anxiety, after all it is only in the darkest nights that stars shine more brightly.”

— Hazrat Ali Ibn Abu-Talib A.S. (7th century A.D.)

— Dr. Muhammad Iqbal in Bal-e-Jibril (Gabriel’s Wing) 1935

Beyond the stars, other galaxies also exist. As of now, other tests of love also exist. You are an eagle, flight is your vocation: Further skies stretching out before you also exist.

To my parents iv

Abstract

Life is extremely complex and amazingly diverse; it has taken billions of years of to attain the level of complexity we observe in now and ranges from single-celled prokaryotes to multi-cellular human beings. With availability of molecular sequence data, algorithms inferring and gene families have emerged and similarity in gene content between two genes has been the major signal utilized for homology inference. Recently there has been a significant rise in number of species with fully sequenced genome, which provides an opportunity to investigate and infer homologs with greater accuracy and in a more informed way. Phylogeny analysis explains the rela- tionship between member genes of a gene family in a simple, graphical and plausible way using a tree representation. Bayesian phylogenetic inference is a probabilistic method used to infer gene phylogenies and posteriors of other evolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, in particular using Metropolis-Hastings sampling scheme, is the most commonly employed algorithm to determine evolutionary history of genes. There are many softwares available that process results from each MCMC run, and ex- plore the parameter posterior but there is a need for interactive software that can analyse both discrete and real-valued parameters, and which has conver- gence assessment and burnin estimation diagnostics specifically designed for Bayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, called GenFamClust (GFC), is proposed that uses gene content and gene order con- servation to infer homology. The feature which distinguishes GFC from ear- lier homology inference methods is that local synteny has been combined with gene similarity to infer homologs, without inferring homologous regions. GFC was validated for accuracy on a simulated dataset. Gene families were com- puted by applying clustering algorithms on homologs inferred from GFC, and compared for accuracy, dependence and similarity with gene families inferred from other popular gene family inference methods on a eukaryotic dataset. Gene families in fungi obtained from GFC were evaluated against pillars from Yeast Gene Order Browser. Genome-wide gene families for some eukaryotic species are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces for Bayesian phylogenetics inference. We introduce a new software VMCMC which simplifies post-processing of MCMC traces. VMCMC can be used both as a GUI-based application and as a convenient command-line tool. VMCMC supports interactive exploration, is suitable for automated pipelines and can handle both real-valued and discrete parameters observed in a MCMC trace. We propose and implement joint burnin estimators that are specifically applicable to Bayesian phylogenetics inference. These methods have been compared for similarity with some other popular convergence diagnostics. We show that Bayesian phylogenetic inference and VMCMC can be applied to infer valuable evolutionary information for a biological case – the evolutionary history of FERM domain. v

Sammanfattning

Livet är extremt komplext och otroligt varierande; det har tagit evolutionen miljarder år att uppnå den nivå av komplexitet som vi ser i naturen idag och varierar från encelliga prokaryoter till flercelliga människor. Med tillgången till molekylär sekvensdata, har utvecklingen av algoritmer för att bestämma homologi och genfamiljer gått snabbt och likheten mellan två gener har varit den främsta signalen som använts för att bestämma homologi. Nyligen har det skett en betydande ökning av antalet arter med fullt sekvense genomet, vilket ger en möjlighet att undersöka och bestämma homologi med större noggrann- het och på ett mer informerat sätt. Fylogenetisk analys beskriver sambandet mellan gener i en genfamilj på ett enkelt, grafiskt och rimligt sätt med ett träd. Bayesiansk fylogenetisk inferens är en sannolikhetsteoretisk metod som används för att bestämma genfylogenier och posteriorfördelningen för evolu- tionära parametrar genom applicering av metoden Markov Chain Monte Carlo (MCMC), särskilt genom Metropolis-Hastings sampling som är den mest an- vända algoritmen för att bestämma den evolutionära historien för en mängd gener. Det finns många program tillgängliga för att bearbeta resultaten ifrån en MCMC-körning och utforska posteriorfördelningen för parametrarna men det finns ett behov av en interaktiv programvara som kan analysera både träd och kontinuerliga parametrar samt erbjuder konvergensbedömning och skatt- ningsdiagnostik för burnin och är särskilt utformad för Bayesiansk inferens av fylogenier.

I denna avhandling introduceras en synteni-medveten ansats för att bestäm- ma gen-homologier, som kallas GenFamClust (GFC). Denna ansats använder geninnehåll och genordning för att bestämma homologi. Det utmärkande för GFC jämfört med tidigare homologi-inferensmetoder är att lokal synteni har kombinerats med genlikhet för att bestämma homologer utan att bestämma homologa regioner. GFC validerades för noggrannhet på simulerad data. Gen- familjer skattades genom att tillämpa klusteralgoritmer på homologer som bestämts av GFC och jämfördes med avseende på noggrannhet, beroende och likhet med genfamiljer som bestämts av andra populära genfamilj slutled- ningsmetoder på data ifrån eukaryoter. Genfamiljer i svampar som bestämts av GFC jämfördes mot det liknande begreppet “pelare” i Yeast Gene Or- der Browser. Hela genfamiljer för eukaryota arter med fullständigt framtagen arvsmassa beräknas med hjälp av denna metod och visar därmed på vikten av att ta hänsyn till konservering av geneordning i homologi-inferens.

Ett annat ämne som denna avhandling behandlar är bearbetningen av MCMC spår för Bayesiansk inferens av fylogenier. Vi introducerar en ny program- vara VMCMC som förenklar bearbetningen av MCMC-spår. VMCMC kan användas både som en GUI-baserad applikation och som ett bekvämt kom- mandoradsverktyg. VMCMC stödjer interaktiv utforskning, är lämplig för automatiserade pipelines och kan hantera både kontinuerliga och diskreta pa- rametrar från ett MCMC-spår. Vi föreslår och implementerar gemensamma vi

burnin-skattningar som är skräddarsydda för Bayesiansk inferens av fyloge- nier. Dessa metoder har jämförts med andra populära metoder för konver- gensdiagnostik. Vi visar att Bayesiansk inferens av fylogenier och VMCMC kan användas för att upptäcka värdefull evolutionär information i en biologisk tillämpning: den evolutionära historien för FERM-domänen. Contents

Contents vii

Acknowledgements 1

List of Publications 3

1 Introduction 5 1.1 Thesis overview ...... 5

2 Genomes: Biology and background to project 7 2.1 Genes, proteins and genomes ...... 7 2.2 Evolution and evolutionary events ...... 11 2.3 Biologically interesting proteins – FERM domain containing proteins 13

3 Homology and gene family inference 15 3.1 Fundamentals ...... 15 3.2 Homology and gene family inference methods ...... 21

4 Phylogenetic Inference 29 4.1 Fundamentals of Phylogeny ...... 29 4.2 Computing Phylogenetic Trees – Traditional Approaches ...... 31 4.3 Bayesian phylogenetic inference ...... 31

5 Post-processing of traces from MCMC runs 33 5.1 MCMC Characteristics ...... 33 5.2 Post-processing MCMC traces ...... 36 5.3 Software packages for post-processing of MCMC traces ...... 39

6 Present Investigations 41

7 Discussion & Conclusion 45 7.1 Discussion ...... 45 7.2 Future perspectives ...... 47

vii viii CONTENTS

Bibliography 49 Acknowledgements

It is my pleasure to express my gratitude to my advisor, Lars Arvestad, for his continuous support, motivation, encouragement, and help during my Doctoral studies and research. Without his help, this thesis would not be possible. With his guidance, I started to learn how to do research and how to grow academically with collaborative spirit. With his motivation and encouragement, I started to explore the strength of innovation. I also appreciate his wife’s support, whom we would involve freely to read through our manuscripts, to Emma Arvestad for delicious apple pies and to Alexander Arvestad for his toys and stories! In short, thank you Arvestad clan for your part in my PhD and in my stay in Sweden. I would like to thank Ammad Aslam Khan, with whom I have collaborated on FERM paper among other works. The collaboration experience showed me the joint strength of theoretical and problem driven research, taught me to believe in myself and gave me the opportunity to do independent research. I badly miss our discussions on Science and Saturday dinners with work. I want to give my grati- tude to my friends and collaborators, Sayyed Auwn Muhammad, Muhammad Mushtaq and Dr. Dara Mohammad. It has been an educational and interest- ing experience to collaborate with you all and to learn from you about your fields. Best of luck to you, Auwn and Mushtaq, for your degree and publications, and to Dara for his goals. I am also grateful to all my other collaborators for their help. I owe my sincere gratitude to Bengt Sennblad and Jens Lagergren, my co-supervisors, who have given me generous help and encouragement. Bengt has always been at an arm’s distance (literally!) from my desk and has been extremely helpful for all my little personal and professional queries. Thank you for your time, help and sincere, useful and important advice on all matters. Jens has been overseeing my study and research, and has given invaluable guidance and support in critical situations. Thank you and wish you all the best. I was fortunate to have colleagues in SciLifeLab with whom I have spent some memorable time: Auwn Muhammad; Thank you for hosting me in one of my tough times, for constant support in projects and for great motivational discussions. Owais Mahmudi and Ikramullah; I fondly remember the discussions, food and time spent with both of you. Kristoffer Sahlin and Mattias Franberg – Statis- ticians stuck in the body of computer scientists; your fights over frequentist versus Bayesian approaches are legendary! Joel Sjöstrand; Thank you for all the advice

1 2 CONTENTS and interest in VMCMC. Erik Sjölund; Thank you for discussions on language, regional interests, music and the quality time spent with you. Matthew The; Best of luck for your PhD and to Roger Federer to win that last elusive slam some day. Hossein Farahani; Wish you all the best in Canada. I should especially mention my other comrades with whom I have spent a great time: Annu, Lumi, Victor, Linus, Yrin and Mehmood Khan. I would like to thank my Pakistani gang (Asrar Mehdi, Faheem Mughal, Qamar Toheed, Adeel Yasin, Nabeel Shehzad, Iram Bilal, Alamdar Hus- sain, Shahid Hussain, Rashid Mehmood, little Rameen Rashid, Muham- mad Mushtaq, Sharif Hasni, Aamer Riaz, Imran Jamali, Naeem Anwar, little Ibraheem, Irfan Khan and Farasat Zaman) for making these five years memorable in Sweden. Without your company, discussions and your parties with delicious baryanis, karahis and other desi food, I would have been bored to death in this grey, cloudy, gloomy weather. Amit Gahoi, I miss your late night discus- sions, mushroom cooking and pav bhaji. Best wishes for your work in Germany! Finally to my room mates Muhammad Irfan and Syed Muhammad Zubair; Thank you for late night discussions, Game of Thrones talks and for wonderful companionship. I want to thank my family, especially my parents, Zafar Abbas and Saria Zafar, for giving me my genome, for nurturing and educating me, for having faith in me, and for their constant care and support throughout my life. My grandfather, Lt. Col. (Retd) Sadaqat Ali, has been a strict mentor to me since I was young and I am extremely grateful to him. My siblings, Usra Hussain and Raja Manzar, have been supportive during this time. I would like to give special thanks to my wife, Umme Rabab Syed, who has been my life support for the past 7 years and has my gratitude and acknowledgement for contributing her part in what I am today. Last but not least, I believe I would have finished my degree a long while ago, had it not been for the mischief, love, naughtiness and roguery of the three musketeers, Saifullah Bhatti, Mustafa Ali and Ibraheem Ali. Life would have been much more boring without them! List of Publications

I: Raja H. Ali, Sayyed A. Muhammad, Mehmood A. Khan, & Lars Arvestad. Quantitative synteny scoring improves homology inference and partitioning of gene families. BMC 2013; 14(Suppl 15):S12. RHA helped in designing the algorithm, implemented the algorithm, prepared the biological datasets, performed comparative analysis and drafted the manuscript. II: Raja H. Ali, Sayyed A. Muhammad, & Lars Arvestad. GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm. Manuscript. RHA conceived and designed the study, prepared the biological datasets, performed comparative analysis

with most software and drafted the manuscript. III: Raja H. Ali, Mikael Bark, Jorge Miró, Sayyed A. Muhammad, Joel Sjös- trand, Syed M. Zubair, Raja M. Abbas, & Lars Arvestad. VMCMC: a graphical and statistical analysis tool for Markov chain Monte Carlo traces. Manuscript. RHA lead and coordinated the implementation, implemented most of the functionality and drafted the manuscript. IV: Raja H. Ali, & Lars Arvestad. Burnin estimation and convergence assessment in Bayesian phylogenetic in- ference. Manuscript. RHA prepared the datasets, performed experiments and drafted the manuscript. V: Raja H. Ali∗, & Ammad A. Khan∗. Tracing the evolution of FERM domain of Kindlins. Molecular Phylogenetics and Evolution 2014; 80:193-204. RHA shared equal responsibility in data extraction, analysis of results and drafting of the manuscript.

∗ Contributed equally to this manuscript.

3

Chapter 1

Introduction

Officially, this thesis is presented in School of Computer Science (CSC) but the reader will note that content is a fusion of biology and computer science, in short computational biology. The work has been carried out at SciLifeLab, Solna campus and at the Department of Computational Biology in CSC. One can view this work as a pipeline, where the input is a set of annotated genomes from different species and is followed by identifying homologous gene pairs and inferring gene families using clustering algorithms. Then these gene families are analysed using probabilistic Bayesian inference of phylogeny which is known to be more accurate and informative than traditional methods of phylogeny inference. Finally, Markov chain Monte Carlo runs obtained from Bayesian phylogenetic inference are post-processed and analysed to extract interesting results about evolutionary parameters underlying the gene data. The focus and emphasis of research in this thesis is on two subtopics; Homology inference and post-processing of Bayesian inference of phylogeny.

1.1 Thesis overview

The thesis has been organized according to topics in the project pipeline. Chapter 2 deals with the input of the pipeline and explains the origin and definition of different biological concepts and terminologies utilized in later chapters. In Chapter 3, homology inference and gene family inference have been discussed in detail and existing gene family inference algorithms have been briefly presented. In Chapter 4, a brief overview of advances in phylogenetic inference is given with focus on methods that employ Bayesian phylogenetics inference. Chapter 5 discusses some characteristics of MCMC runs resulting from Bayesian phylogenetics analysis, sheds light on existing softwares that explore these characteristics and presents a case study, where Bayesian phylogenetic analysis and post-processing have been applied on inferring the phylogenetic tree for FERM domain. Chapter 6 provides a brief summary of papers presented in this thesis and Chapter 7 provides conclusions, pros and cons and future outlook of this work.

5

Chapter 2

Genomes: Biology and background to project

There are countless fine introductions to and genomics (see, e.g., “Learn.Genetics” from The University of Utah [148], “DNA from the Beginning” from Cold Spring Harbor Laboratory [109] and standard reference textbooks for molecular biology like [2] and [119]). However, this section touches only those concepts in molecular biology that are essential minimum knowledge for the next chapters. This chapter starts with a primer on Genomics and continues with discussion of important components of genomes. Further, composition and appearance of genomes and of genes, the building blocks of the genome, are discussed. Because the focus of this work is mainly on algorithms that infer evolution, knowledge of different biological concepts related to gene and protein evolution is essential for understanding the contents of this thesis. We conclude this chapter by a discussion about effects of these evolutionary events on gene order and gene content similarity.

2.1 Genes, proteins and genomes

Genomes and their composition A cell is the smallest structural, functional and biological unit in an organism and each cell has a genome, a collection of chemical molecules that contain the complete set of instructions for all cellular activities. Thus, genomes are essential components of life and the primary genetic material in living beings. Each species has a char- acteristic genome, different from the characteristic genome of other species, which explains differences in morphology, behaviour and other characteristics of different species. Genomic traits (see, e.g., chromosome number, and genome size, etc.) are usually unique to a species [173, 16]. Additionally, the genome as a whole of a particular individual is also unique; genomes of other organisms of the same species

7 8 CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT are not exactly the same as this genome. There are small yet subtle differences between these genomes that confer individual characteristics to this individual, e.g., eye color, skin color, height, obesity and other variable morphological features within species [10, 71]. A typical eukaryotic genome can be seen as a group of twisted thread-like structures called chromosomes, which are strands of chemical molecules called nucleotide bases and are composed of Deoxyribonucleic acid (DNA). DNA consists of a sequence of four different chemical nucleotide bases (adenine (A), thymine (T), cytosine (C) and guanine (G)). Friedrich Miescher in 1869 iso- lated and identified DNA as a major chemical in the nucleus [34]. Subse- quent experiments performed by Os- wald Avery [122] and later replicated by Hershey et al. [90] revealed that genomes are composed of DNA and that DNA is the chemical responsi- ble for genetic inheritance. Later, James Watson and Francis Crick dis- covered the three-dimensional struc- ture of DNA (shown in Figure 2.1), which we now know is a twisted-ladder, Figure 2.1: The structure of DNA double he- double stranded helix [206]. They lix [206]. showed that the chromosome is chem- ically composed of two long chains or strands of nucleotide bases (now known as Watson strand and Crick strand) such that for each base on one strand, the opposite strand contains a corresponding paired base, where each species, generally, has the same karyotype and all genetic material is stored in these strands of DNA. Most high-level eukaryotes have a mul- tiploid genome, i.e., two or more copies of each chromosome can be found in the genome, which plays an essential role in inheriting genomic content during cell di- vision. Other eukaryotes and prokaryotes have haploid genomes, i.e., they contain one copy of each chromosome. For eukaryotes, all chromosomes are located in the cell nucleus but in prokaryotes, chromosomes are found in the cell cytoplasm.

Genes – the building blocks of life While the phenomenon of inheritance from parents to offspring was long known, Gregor Mendel, in mid 19th century, was one of the first scientists to streamline this 2.1. GENES, PROTEINS AND GENOMES 9 concept and to introduce the principles of heredity [129, 130]. He introduced factors (known to us now as genes) for hereditary unit and also gave a law of independent assortment, which states that the gene for each trait is inherited by the offspring irrespective of genes for other trait(s). These genes are respon- sible for various cell functions and are the building block of life. Thomas Mor- gan won the Noble prize in 1933 for discovering that genes are physically lo- Figure 2.2: Representation of chromosome as cated on chromosomes [140]. genes and intergenic regions [208]. In retrospect, a chromosome (shown in Figure 2.2) is an ordered collection of genes separated from each other by intergenic regions, a subset of noncoding DNA.

From gene to protein – a central dogma of molecular biology Genes are not the only chemical molecule behind cellular activities. Ribonucleic acid (RNA) and proteins are also functionally important molecules in the cell and are necessary for almost all cellular functions. Jöns Jacob Berzelius, a Swedish chemist, coined the term protein during his work with Gerardus Johannes Mulder in his elemental analyses of organic compounds in 1838 [201]. Urease was the first known enzyme, and probably the first protein with a known function [193]. The first protein known by sequence was Insulin, which was sequenced using Sangar sequencing in 1953 [178, 179, 177, 176]. Friedrich Miescher in 1868 discovered nucleic acids [35] but it was at end of the nineteenth century that experiments on RNA and DNA for sensitivity towards alkaline highlighted chemical differences between them and subsequently RNA were identified as separate chemical molecules from DNA in the cell. The importance and contribution of RNA in synthesizing protein was discovered in 1939 [20]. The relationship between gene, RNA and protein (which is now termed as cen- tral dogma of molecular biology) became clear only in the last seventy years. In 1959, Severo Ochoa discovered the process of RNA synthesis from DNA (called transcription) and was awarded Nobel Prize in Medicine for the discovery [147]. Nirenberg and Matteaei figured out in 1961 that an amino acid in protein is con- stituted by three nucleotides in DNA and that proteins are formed from RNA by a process called translation [144]. Francis Crick is regarded as the pioneer in formu- lating the central dogma of molecular biology in 1956 [31] but the term was formally coined in 1970 in a Nature publication [32]. The summary of Crick’s work is that genetic information is stored in DNA, is transferred to the cytoplasm in form of Ribonucleic acid (RNA) and is finally translated into proteins, the workhorse of the cell. The cellular machinery transcribes instructions encoded in a gene sequence to 10 CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT produce RNA of various types [111, 220], out of which messenger RNAs (mRNA) are further translated into proteins (Figure 2.3). Details of both transcription and translation processes can be studied in many textbooks, e.g., in Chapter 4 and 5 of [2].

Transcription

Translation

Figure 2.3: The central dogma of molecular biology – From DNA to RNA to Protein.

Eukaryotic genes can be divided into exons and introns [72]. Exons are DNA sequences that can be found in at least one mature RNA product of the gene. Introns, on the other hand, are DNA sequence not present on final RNA product but which control the presence or absence of a particular exon and which decide order of exons on RNA during RNA splicing [27]. RNA splicing is a cellular process, in which a copy of a gene is made to produce primary RNA, introns are removed from the primary RNA and the remaining sequences are stitched together in the order necessary to produce mature RNA. Hence, one gene can be translated into many diverse protein products (termed gene isoforms) with the presence/absence of certain exons and with shuffling of exons due to RNA splicing. 2.2. EVOLUTION AND EVOLUTIONARY EVENTS 11

2.2 Evolution and evolutionary events

Gene level evolution

Genes are not a static entity; they evolve over time and across generations. This evolution is responsible for creating variance in the gene pool and even developing new gene functions (see [120]). Sequence [26] and indel events (sequence insertions and deletions) are important evolutionary events that change molecular sequence, are known to disturb similarity among genes and play a vital role in sequence divergence among species [143]. Copying errors during DNA duplication, unrepaired DNA damage and indel events of DNA by mobile genetic elements are known to cause permanent sequence visible in the next generation [26]. Since molecular sequence reflects protein structure and function [53], these events play a defining role in functional and structural evolution at the gene level.

Gene duplication and gene loss are significant gene level evolutionary events that are nature’s way of developing new cellular functions, when combined with sequence mutation and indel events. Fisher is credited for introducing the concept of duplications in genes in 1928 [60] and Ohno developed a coherent concept in his work in 1970 [150]. Phenomena like neofunctionalization [150] and subfunctionaliza- tion [191, 66] can take place via gene duplication followed by sequence mutation in one or both genes. Neofunctionalization is a process in which one of two duplicated genes mutates to gain a novel function not attributed with the parent gene [164]. The sialic acid synthase (SAS) gene duplicated and one of the duplicated genes evolved to become the extant antifreeze protein gene in Antarctic zoarcid fish. The functional evolution of antifreeze protein gene is hypothesized to be a result of ne- ofunctionalization after gene duplication of sialic acid synthase (SAS) gene; the ancestral/extant SAS gene has sialic acid synthase and rudimentary ice-binding functionalities, but the antifreeze protein gene specializes in noncolligative freezing point depression [43]. On the other hand, subfunctionalization is an evolutionary process in which each duplicate of original ancestral gene retains a subset of its original function [191, 66]. Hemoglobin protein in Homo sapiens is an example of subfunctionalization, where a duplicate copy of hemoglobin β-chain evolved to form the gene for hemoglobin α-chain but neither of the two chains can function without the other to form a monomeric hemoglobin molecule [28].

Spontaneous formation of genes from DNA (or de novo origination of genes) is another method for new genes to form and has received increasing importance in re- cent times [44, 196]. Though such events are uncommon, yet there is compelling ev- idence for de novo origination of genes in viruses, prokaryotes and eukaryotes, (see, e.g., in Homo sapiens [106], in mammals [132], in Drosophila melanogaster [223], in yeast [19, 52], in Plasmodium vivax [214], in Escherichia coli [41] and in viruses [172]). The function and evolution of de novo originated genes has been discussed in detail by Wu and Zhang [211]. 12 CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT

Chromosome level evolution

Evolution occurs at both gene level and at chromosome level on a group of genes. In recent times, the term synteny is defined as conservation of order between two groups of genes/genomic content present in two chromosomes/contigs/genomic re- gions that are being analysed and reflects the gene order conservation between two chromosomal regions from the same or different species. When applied on chromo- somes from two different species, this concept is referred to as shared synteny, e.g., many Homo sapiens genes are syntenic with those of other mammals. Figure 2.4 shows an example of shared syntenic regions between Homo sapi- ens and Mus musculus [24]. Stronger shared synteny be- tween regions than that ex- pected between species can de- pict shared regulatory mech- anisms as well as support for functional relationships be- tween syntenic genes [138]. Synteny can be used to make a rough estimation of evolution- ary divergence between two chromosomal segments [76] be- cause in general, relatively recently diverged organisms share similar blocks of genes in genome and divergence times have been found inversely pro- portional to synteny conserva- tion [36, 209]. Gene translocation and chro- mosome translocations are im- portant evolutionary events Figure 2.4: Mapping of syntenic regions of chromo- somes of Homo sapiens on chromosomes of Mus mus- that rearrange the genome by culus where figure shows chromosomes of Mus mus- separating two loci apart or culus, each chromosome of Homo sapiens is given a join two previously separate unique color defined in the legend and the chromoso- pieces of a chromosome to- mal regions of Mus musculus sharing high similarity gether, thereby changing syn- with chromosomal regions of Homo sapiens are filled tenic conservation between re- with colour assigned to chromosome of Homo sapi- gions [159]. Events like whole ens. [24] genome duplication (WGD) followed by massive gene loss 2.3. BIOLOGICALLY INTERESTING PROTEINS – FERM DOMAIN CONTAINING PROTEINS 13 on one or both chromosomal regions is a special event that has been of great in- terest and is a hot topic in recent times, e.g., in fungi [49], fishes [95], flowering plants [33] and [40]. A common molecular mechanism for translocation is by use of mobile genetic elements, commonly known as transposons, which are stretches of DNA capable of changing their position within the genome either through a cut and paste (termed DNA-only transposons) or though a copy and paste (termed retrotransposons) mechanism [152]. Retrotransposons are common in eukaryotic cells in particular in plant cells and are responsible for substantial parts of genomes, e.g., long interspersed nuclear elements (LINEs) form up to 17% of the human genome and short interspersed nuclear elements (SINEs) make up to 11% of the human genome [30]. Retrotransposons usually use a RNA intermediate that is copied back into genome. Since intronic and UTR regions present in the original gene are not transcribed during transcription, these regions present in original gene are missing in copy gene and play an important role in syntenic and sequence divergence of a eukaryotic genome [152]. In short, evolution occurs at both gene level and at genome level and under- standing of basic biological events such as sequence mutation, sequence indels, gene loss, gene duplication and gene translocation together may explain the differences in content similarity and order similarity between two or more genes on two or more chromosomes.

2.3 Biologically interesting proteins – FERM domain containing proteins

FERM domain containing proteins (FDCPs) are a specific class of proteins, which contain a FERM domain near the N-terminus of the protein. The name FERM has been derived from presence of this domain in band 4.1 (F), ezrin (E), radixin (R) and moesin (M) proteins [25]. FDCPs are characterized as signalling molecules that are used for in-out and out-in signalling between the plasma membrane and cytoskeletal structures. FDCPs can be observed in a diverse group of organisms; in vertebrates, invertebrates and even in plants. Interactions between two proteins or between a protein and a lipid are regulated through FERM domain, which in turn regulate many important protein functions (see, e.g., the activity, sub-cellular localizations and recruitment of proteins and/or lipids into macromolecular complexes). FDCP members with known function are kindlin, myosin, band 4.1, ezrin, radixin, moesin, kinase-like calmodulin-binding protein, talin, Krev interaction trapped (KRIT), Focal adhesion kinase (FAK), Janus kinase (JAK) and Guanine nucleotide exchange factors (GEFs). FDCPs are known to be involved in many diseases (see, e.g., kindlin homologs in Kindler syndrome [102, 85], leukocyte adhesion deficiency III (LAD- III) [194, 203] and abnormal expression in different types of cancers [110, 221, 67]). Embryonic knockout of Kindlin-2 in Mus musculus is lethal for embryo [137]. FERM domain can be divided into three subdomains; F1, F2 and F3, which are observed 14 CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT in a clover-like shape [84]. FERM domain is known to interact and bind with many molecules, e.g., Integrin, Ca+2 ions, Actin, IP3, PIP2 and PIP3 [82, 50, 195]. FDCPs are therefore an important biological protein family and evolution of the FERM domain in FDCPs is an interesting subject. Chapter 3

Homology and gene family inference

This chapter contains an overview of gene homology and gene families, and the computational methods to infer homologs and gene families. The chapter starts with a primer on the fundamental concepts and historical background of homology and gene family inference. The chapter concludes with a discussion on current computational methods and important heuristics, in particular sequence similarity and synteny, for inferring homology and gene families.

3.1 Fundamentals

Homology Homology is a widely applied concept in many fields of biology, e.g., in comparative , phylogeny reconstruction [59], and developmental biology [15]. The term homology was introduced in 1843 by Owen [153] and was later adapted in an evolutionary framework [94] after Darwin’s famous concept of evolution by decent under [37]. In short, it was loosely defined as “two parts having a common evolutionary origin”. At present, under the most commonly accepted definition, homologous genes are a group of genes that share a last common ancestor (LCA) either through speciation or through gene duplication [61]. It is important to note that using this definition, homology is a binary concept – two genes are either homologous or not and can not be expressed, for example as percentage homologous [166, 62, 207]. Zuckerkandl and Pauling pioneered in the field of [139, 225]. They highlighted the subtle difference between homologs resulting from speciation and those from duplication in 1960s but the terminology and formal classification of homologs into orthologs (homologs related through speciation at their LCA) and paralogs (homologs related through gene duplication at their LCA) is credited to

15 16 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE

Walter M. Fitch [61]. This distinction is important mainly because of the argued hypothesis that orthologous genes are functionally more similar than paralogous genes [107, 142, 5, 23, 168]. However, it is universally accepted that homology is positively correlated with common structure and function of genes.

Molecular Sequence Similarity Calculating molecular sequence similarity between all genes is the foundation of homology inference and gene family inference. All homology inference algorithms are primarily based on molecular sequence similarity. Computational techniques for sequence comparison are usually classified into alignment-based and alignment-free methods. Alignment-based algorithms are accurate and robust, but hard to apply for large datasets in comparison with alignment-free algorithms (see, e.g., computa- tion time for performing BLAST and computing homologs using afree for complete genomes of Homo sapiens and Mus musculus [124]). Alignment-free methods (see, e.g., afree [124] and Universal Sequence Mapping [3]) are an efficient alternate based on word count statistics and k-tuple contents for both sequences, but tend to be dominated by single-sequence noise and are not as accurate as alignment-based sequence comparison methods. For further details on alignment-free sequence com- parison methods, see [202]. I will stick to alignment-based methods in this work.

Pairwise sequence alignment Pairwise sequence alignment measures sequence similarity quantitatively, in which two sequences are aligned together such that the common residues are placed in same column and depending on aim of alignment, an objective function represented by a score is maximized. The objective function is measured quantitatively by a scoring scheme, which is based on penalizing indels and scoring matches and mis- matches using substitution scores from a substitution matrix (usually PAM matri- ces, initially developed in 1970 and recalculated in 1978 by Margarett Dayhoff [38]

Figure 3.1: A sample pairwise sequence alignment between two proteins, where conserva- tion is shown in third row under the two sequences, ‘*’ denotes identically aligned amino acid in both proteins, ‘:’ and ‘.’ denote conservation between two non-identical aligned amino acids with highly similar properties and with weakly similar properties respectively. or BLOSUM matrices by Henikoff and Henikoff [89]). A sample pairwise sequence alignment is shown in Figure 3.1, where characters in each row represent the amino 3.1. FUNDAMENTALS 17 acid or nucleotide sequence of a particular protein or gene, ‘-’ represents an indel in one sequence, matching characters in a column represent a match and mismatching characters in a column represent a mismatch. When two sequences are aligned for determining the optimally matching regions or subsequence, such an alignment is termed local alignment. On the other hand, global alignment represents the most optimal alignment for the complete sequence. Variants of these alignment methods are also known (see, e.g., semiglobal also known as ends-free alignment). The exact and exhaustive solutions to determine optimal alignment under a given scoring scheme can be computed by using a dynamic programming algorithm in particular by “Needleman-Wunsch” [141] for global alignment and by “Smith- Waterman” [184] for local alignment of two protein sequences. While these algo- rithms guarantee the optimal alignment under a scoring function, the right scoring function to reflect the alignment goals is usually calculated empirically from data. Applying a minimum score threshold on the optimal alignment score then deter- mines if two sequences are homologs or not. The quadratic computation time required for each pair of sequences makes these methods inapplicable for large datasets. Therefore heuristic algorithms based on k- tuples or word methods became popular and are applicable due to their linear time complexity despite not exploring all solutions. Common software suites in this class are BLAST [6] and FASTA [118]. BLAST and FASTA do not guarantee an optimal solution, and are heuristic methods. It has been shown that heuristic-based simi- larity search programs can be erratic (see, e.g., extension of local alignment beyond actual homologous domain by BLAST [77]). Despite these limitations, BLAST is a famous and widely used similarity measurement tool due to its applicability on large datasets.

Gene family A gene family is defined as a group of homologous genes evolved through a tree-like vertical evolution in which one ancestral gene evolves over time, undergoes several gene duplication and gene loss events, and results in a group of extant genes pooled as a single gene family [149]. Such models of gene family inference follow a strict tree-like evolution, where one gene belongs to exactly one family and all genes in the gene family are homologous to each other. Under this model of evolution, homology is transitive, i.e., if A and B are homologs and B and C are homologs, then A and C are also homologs and A, B and C belong to same gene family. The transitive property of homologs in gene family has been employed in guiding structural studies [167]. The term gene family has many definitions and is often used interchangeably with protein family. In this thesis, gene family refers to all genes that as a whole have evolved from a common ancestral gene. Other definitions of gene family also exist in literature, e.g., arising from structural similarity [29] or from functional grouping based on working in the same metabolic pathway, etc. Domain family reflects sharing between proteins of one or more common domains – a conserved 18 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE part of a given protein sequence and structure that usually evolves, functions, and exists irrespective of of the evolution in the rest of the protein chain [17]. The term superfamily has been used in the context of domains to reflect structural homology, i.e., common secondary and tertiary structure of two proteins [155]. From this point on, I will use the term gene family only for genes related through vertical inheritance evolving from a common ancestor and all other definitions pertaining to domain, structural and functional grouping will be ignored. There are many advantages and applications of homology inference and gene family classification. Family classification for individual genes helps in describ- ing relationship between genes and in predicting function, structure and expres- sion patterns of newly identified genes due to the shared similarity with known genes [210, 42]. Gene families can also aid in identification of genes that are active in particular diseases [104]. The strict tree-like definition of gene family and homology is a rather simplistic view of homology. Phylogenetic network thinking (PNT) provides an alternative definition of homologs and gene families, where legitimate recombination events along with vertical evolution are allowed [105]. These events turn a phylogenetic tree into a phylogenetic web that relates closely related sequences without affecting homology relationships. The major applications of PNT are in analyzing legiti- mate recombination and understanding contradictions in gene or genome histories. Another mode of visualizing gene family evolution is Goods Thinking (GT) that al- lows evolution by horizontal dissemination of genes (see, e.g., recombination events, fusion, fission, etc.) along with duplications and losses [128, 8]. Refer to [81] for further discussion on non-tree like definitions of homology. There are many known practical difficulties in identifying homologs and defining gene families with the strictly tree-like definition of a gene family [81]. Sometimes more information about a gene is required to classify it as a member of an es- tablished family; Events like convergent evolution (causing similarity between two genes that do not have a common evolutionary origin [154]) are impossible to de- tect on just sequence similarity and more information (e.g., gene order) is needed to identify such gene pairs [61]. In other cases, genes may belong to multiple families; Conflicting homology assignment (e.g., in case of genes related by fission or fusion events) for a gene makes it imperative to assign a gene to two families, which violates the definition of gene family and the transitive property of homology [222]. Strict tree-like evolutionary models lack the machinery to handle these complications and are unable to handle these issues [81]. These complexities can, however, either be handled or ignored in some cases. Sequence convergence remains a rare phenomenon and few examples of genes re- lated by sequence convergence have been found in recent times [45, 21]. Also, convergent evolution of domain architectures is observed rarely [79]. Therefore, sequence similarity arising from convergent evolution can, in general, be ignored in homology inference due to rarity of convergent evolution at the sequence level. In other cases, simplification in gene family inference is preferred over accuracy, e.g., some problems strictly demand assignment of one gene to one gene family. 3.1. FUNDAMENTALS 19

Homology and gene family inference for multi-domain proteins is, especially, challenging (see, e.g., [55]), in particular for gene families containing promiscuous domain(s) [9] and diverse domain architecture(s) [187]. Inserted domain content into two non-homologous proteins (e.g., through convergent evolution) should not make the pair suddenly homologous and should be discounted for in homology inference. However, it is difficult to differentiate between multi-domain homologs (that follow vertical inheritance) and non-homologous proteins with shared domains based on sequence similarity alone. Also, shared domains are common in many species (see, e.g., in Homo sapiens [116]), can link two non-homologous proteins through a strong local similarity, and thus prove to be problematic in homology and gene family inference.

Cluster analysis algorithms Cluster analysis or clustering attempt to group objects more similar to each other than other objects in the same group and other objects in other clusters with respect to an objective evaluation criteria [96]. In gene family inference, given gene pairs with quantified similarity scores, the objective function is to group all homologs (direct or transient depending on the objective evaluation criteria) of a gene together in same cluster and all non-homologs of the gene in other group(s), where set of genes acts as data. So homologous genes are grouped into gene families by applying a clustering algorithm. Clustering is an optimization problem with one or more goals, which depend(s) on underlying data and involve(s) optimization of clustering parameters, usually with trial and failure to get cluster with desired properties. The clustering al- gorithms are notably different from each other in terms of how they define, and efficiently find clusters [212, 213, 136]. The most popular notion of a cluster is a densely populated similarity graph, where each vertex represents a data point and each edge represents a connection (or similarity) between two data points. A cluster is composed of all connected nodes and members not reachable from a node belong to a different cluster. Clustering algorithms try to maximize an objective function, which can be obtaining groups with minimal difference between members of the cluster members or clustering together densely populated areas or defined on intervals in data or particular statistical distributions of data [135]. Data pre- processing and modification of model parameters are usually done until the result achieves desired properties [11]. Clustering algorithms are the backbone for inferring gene families. Hierarchi- cal clustering [205, 48] is the most commonly used technique in Bioinformatics for inferring gene families. Other useful clustering algorithms in data mining include centroid based clustering (e.g., k-mean clustering [123]), distribution-based cluster- ing and density-based clustering [135, 189, 117, 56]. These methods require some prior information about gene families, e.g., the underlying distribution or the num- ber of gene families, which is usually not known before analysis for gene families. 20 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE

Therefore, methods other than hierarchical clustering can not be used for gene family inference. Hierarchical clustering (connectivity-based clustering) brings related objects closer, increases distance with unrelated objects [205] and provides an extensive hierarchy of clusters, which are joined/divided with each other by the distances between them, instead of a single partitioning of data set. Hence, threshold on distance defines the limit up to which two or more clusters can be combined (ag- glomerative clustering) or a cluster broken into two or more smaller clusters (divisive clustering). Distance computation and linkage criterion play an important role in hierar- chical clustering [96]. Distance functions (see, e.g., Euclidean distance, Hamming distance [83] and Manhattan distance) and the linkage criterion (amount of evi- dence/edges relative to cluster size required for merging two clusters) determines the type of clusters desired from hierarchical clustering. Single linkage (minimum of object distance) [80, 181], complete linkage (maximum of object distance) [39] and average linkage (minimum average distance or UPGMA – Unweighted Pair Group Method with Arithmetic Mean [185]) clustering are popular choices.

Algorithms for measuring synteny for homologous regions Algorithms that measure conservation in synteny of two genomic regions can be broadly divided into two types [91]. Global synteny conservation algorithms measure synteny conservation based on the complete chromosomes without any count restraint on the number of hits in both regions. So, as long as such algorithms are able to find a homolog within a specific predefined distance or even on whole length of the chromosome, they will extend homologous region on both genetic segments and look for next hit using this new pair as anchor genes. Typically these algorithms employ the concept of gene teams [121], where a gene team consists of two chromosomal regions with closely placed homologs. An example of an algorithm based on global synteny conservation is the max-gap algorithm that employs the gene-team concept and outputs regions containing anchors [92]. It uses a maximum length parameter that gives the maximum number of genes on left or right of current anchor gene and iteratively performs this process until no more homolog can be found for any anchor in both chromosomes. Local synteny conservation algorithms measure synteny conservation within a fixed window around an anchor gene and all homologous hits within this region account for measurement of synteny conservation locally for this region. An example algorithm that employs this approach is the r-window algorithm that uses a window size parameter to count the number of homologous pairs within this window. For local synteny, the maximum limit is dependent on size of the window and cannot exceed a certain limit imposed by window size regardless of size of the chromosome while for global synteny, there is no limit on size of the gene team, which can be as large as the size of chromosomes. Algorithms, that measure synteny conservation 3.2. HOMOLOGY AND GENE FAMILY INFERENCE METHODS 21 for differentiation between orthologs and paralogs, employ local synteny (see, e.g., for fungal and mammalian data [100, 180, 18]).

3.2 Homology and gene family inference methods

Homologs and gene families can be inferred from each other. Given a gene fam- ily, all pairs of members of this family are homologs by transitivity. Given all homolog pairs, a specified clustering algorithm can be applied to infer gene fami- lies. Some methods first infer homologs (see, e.g., Neighborhood Correlation [186]) followed by applying clustering algorithm to infer gene families (see, e.g., [99]). Other methods infer gene families directly from similarity data typically employing all-vs-all BLAST scores (see, e.g., GeneRage [54], TribeMCL [55] and SiLiX [134]) and homologs can then be determined from each gene family using the transitiv- ity property. The following sections reviews some methods used for gene family inference.

BlastClust, SiLiX and other similar approaches Some gene family inference applications apply a threshold to either BLAST bitscores, E-values, alignment length, or a combination of these parameters. Gene family in- ference algorithms apply a clustering algorithm on top of inferred homologs from all-versus-all BLAST. Typical examples include BlastClust (single linkage algorithm directly applied on BLAST results with a specific threshold) [65], SiLiX (a mem- ory and time efficient implementation of BlastClust) [134] and ProtoMap (restric- tive single linkage clustering using different levels of thresholds to yield an ordered grouping of all proteins) [219]. The main shortcoming of these approaches is the dif- ficulty in estimating a universal threshold value, e.g., on alignment length, bitscore or E-value for inferring homology, and they suffer from not modelling domains.

GeneRage Multidomain gene families with diverse architecture are problematic to cluster using linkage algorithms directly on BLAST scores. A simple way to resolve this issue is to detect and correct for missing links or incorrect links in the graph-theoretical approach. Enright et al. [54] developed an algorithm, called GeneRage, with the ability to detect and correct such erroneous links. Hence, problems caused by the presence of multidomain proteins are minimized and more precise gene families are obtained. Similarity relationships are stored in a matrix consisting of binary numbers. Smith-Waterman dynamic programming alignment algorithm is performed for sub- sequent symmetrification of matrix to remove false positives. The authors have used the simple homology transitivity criteria explained in Section 3.1 to detect multi domain proteins. Smith-Waterman is then performed in successive rounds, which detects protein families consisting of multiple domains and removes some 22 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE of the incorrectly recognized similarity relationships within the symmetrical ma- trix. Single-linkage clustering is performed on corrected matrix and initial larger clusters, containing multi-domain families, are split using domain architecture in- formation. Hence, this algorithm clusters large protein datasets into families and is particularly useful to detect and eliminate fusion genes – genes that are a result of fusion of two other genes. Figure 3.2 displays the flowchart of this algorithm.

Figure 3.2: Schematic representation of GeneRage algorithm (adapted from Enright et al. [54]).

However, this algorithm is unable to deal with promiscuous domains1, peptide fragments and proteins consisting of complex domain structure, which are generally not present in smaller prokaryotic datasets but are widely present in eukaryotic datasets. A typical example is that of the ‘response regulator’ domain from two- component systems [190] that causes incorrect grouping of functionally different proteins (e.g., heat shock factors and phytochromes) [22] to the same family [218, 55].

1 A protein domain that is found with many distinct domains in multiple functionally unre- lated proteins [126]. 3.2. HOMOLOGY AND GENE FAMILY INFERENCE METHODS 23

Markov Clustering and TribeMCL Enright et al. [55] have applied another graph theoretic approach in an algorithm called TribeMCL for clustering of protein sequences into families. In order to overcome aforementioned problems with multidomain proteins, TribeMCL uses the same approach as GeneRage but with a more elegant mathematical and probability- based approach instead of Smith Waterman alignment, domain detection and single linkage clustering. A simplistic view of TribeMCL is preprocessing of all-versus- all BLAST results followed by application of the Markov Clustering (MCL) algo- rithm [199]. The source code and executable program TribeMCL is not available any more but the MCL implementation, available from [200], can be used as an alternative to TribeMCL. The flowchart for TribeMCL is given in Figure 3.3.

Parse results and Input Set of All vs All symmetrify similarity Protein Sequences Blast scores

Matrix Inflation

Normalize similarity scores (-log[evalue]) to Markov Matrix Similarity Matrix generate transition probabilities Matrix Squaring (Expansion)

Terminate when no further change is observed in the matrix

Interpret final matrix as a clustering Core MCL Algorithm

Protein Clusters Post-Processing (Families) and Domain Correction

Figure 3.3: A flowchart of Tribe-MCL algorithm (adapted from Enright et al. [55]).

MCL is an unsupervised cluster algorithm for graphs based on simulating the- oretical flow in weighted graphs [199], where the main idea is to further strengthen the stronger links (based on the concept that stronger links are used more during a random walk than the weaker links) and to weaken and finally remove the weaker links. The data (E-values from all-versus-all BLAST) is represented as a Markovian matrix – the sum of all elements of a row is exactly one and each cell can only have non-negative values. The inflation parameter I determines speed and granularity 24 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE of clustering. The matrix is inflated first, i.e., raised to power I. After every in- flation, the matrix is expanded, i.e., the Markovian matrix property is restored by normalization. The process of inflation and expansion continues until there are no more changes in the matrix or the changes are within a break point and at this point, MCL is said to have converged.

HiFiX An approach to significantly decrease false positives and false negatives is devel- oped by Miele et al. [133] called HiFiX. HiFiX as well as other gene family inference software consider the gene family inference problem as a graph -theoretical problem where nodes are represented by vertices, and similarities by edges. Using precom- piled families with relaxed threshold settings generated from SiLiX, the input to HiFiX is pre-families of sequences with good sensitivity. HiFiX then takes ad- vantage of the community1 structure of this similarity network and maximizes modularity2 of these communities to divide each family into independent smaller families at weak links. HiFiX uses multiple sequence alignment, a community deter- mination algorithm (Louvain [13]), a hierarchical algorithm (that merges sequences iteratively into meta-communities3) and alignment likelihood using profile-HMM models [51] for evaluation of each meta community.

Clustering algorithms applied on Neighborhood Correlation scores While aforementioned gene family inference methods have used all-versus-all BLAST results as similarity measure and apply a clustering algorithm to infer gene families, Joseph et al. [99] infer homologs first and then apply a clustering algorithm to infer gene families. BLAST scores are first transformed into Neighborhood Correlation (NC) scores and the NC score is then used as a similarity measure between a gene pair [186]. A threshold is applied on NC scores to infer homologous gene pairs and a clustering algorithm is applied on these pairs to infer gene families. NC distinguishes between sequence pairs that have evolved from the same last common ancestor, and those with a common inserted domain but are otherwise not related. In some ways, NC is similar to MCL. Other applications use BLAST results directly but MCL and NC transform BLAST results into a standard score within a range. The intended datasets for both these methods are multidomain protein families. Both NC and MCL are empirical and are reliant on sequences in sequence

1 A grouping of vertices into clusters such that vertices of the same cluster have a lot of edges between them and relatively fewer edges with the vertices belonging to other clusters [133]. 2 Modularity determines the ratio between existing edges of clusters and expected number of edges for a random graph with the same degree distribution where the degree of a vertex is number of edges connected to the vertex [133]. 3 A cluster formed by merging distinct communities containing homologous sequences be- longing to the same protein family [133]. 3.2. HOMOLOGY AND GENE FAMILY INFERENCE METHODS 25 database on which all-versus-all BLAST is performed. Also both methods do not use or detect underlying domains explicitly. However, for MCL, transformation is from E-values of an all-versus-all BLAST to probability values between 0 and 1 but NC transforms bitscores of an all-versus-all BLAST to correlation scores also ranged between 0 and 1. Similarity Network Reveals Common Ancestry

Figure 4. Differences in neighborhood structure of the sequence similarity network reflect differences in evolutionary history. Network neighborhoods in which nodes represent sequences. Edges connect pairs with significant sequence similarity. Edge weights reflecting degree of sequence similarity are not shown. (A) The neighborhoods of the homologous pair, PDGFRB and PRKG1B. PDGFRB and PRKG1B share 779 neighbors, mostly Kinases (turquoise nodes). These are strong matches due to a shared kinase domain. PDGFRB has 183 unique neighbors, mostly Figure 3.4:due Figure to weak matches displaying with Ig domains the (green distribution nodes). PRKG1B has 142 of unique unique neighbors and due to common weak matches with hits the incNMP-binding the neighbor- domain (red nodes). Other matching sequences are shown in yellow. (B) PDGFRB and NCAM2, a domain-only match, have 232 matches in common. PDGFRB has hood for two730 unique homologous neighbors and NCAM2 (athas top) 240, mostly and due two to Fn3 non-homologous domains (dark blue nodes). (at bottom) genes. Differences in neighborhooddoi:10.1371/journal.pcbi.1000063.g004 structure (denser at the middle for homologs and at the two edges for a non-homolog)Conversely, in thePDGFRB graphand NCAM2 depictedare related here through points domain toA the Benchmark difference Dataset infor Multidomainthe path Homology taken by the proteins duringinsertion and evolution have significant [186]. sequence similarity due to a shared Evaluation of classification performance requires a trusted set of Ig domain. Their shared neighborhood is relatively small (242 positive examples (known homologous pairs) and negative sequences) and comprised primarily of Ig-based matches. These examples (pairs known not to share common ancestry). Although contribute little to the Neighborhood Correlation score of this pair benchmarks are available for detection of remote homology (e.g., NC calculatesdue to low sequence correlation conservation within score the Ig superfamily. for each In pairSCOP [38], of CATHgenes. [39]), functionalTwo ordered similarity (e.g., lists the Gene of contrast, the unique neighborhood of PDGFRB is large (630 se- Ontology (GO) [59]), orthology (e.g, COGs [40]), and structural BLAST bitscoresquences), with strong are edge computed weights. For these using reasons, PDGFRB commongenomics and ([16,45,60], unique and work BLAST cited therein), hits we are between unaware of and NCAM2 have a Neighborhood Correlation score of 0.29, any gold-standard validation dataset for multidomain homology. both genes.distinctly Each smaller thanlist the consists score for PDGFRB ofand thePRKG1B neighboring. Unlike Our hitsbenchmark of isa designed gene, to where be suitable neighboring for testing two hit is definedsequence as comparison, a gene this with clear difference a BLAST in neighborhood score withclassification this goals: gene. good overall A correlation performance on a large score set of is structure can be used to recognize multidomain homology. sequence pairs and consistent performance on individual families now computed between the pair of genes using these lists as data, which reflects the differencePLoS in Computational density Biology between | www.ploscompbiol.org common and 6 unique hits May of 2008 both | Volume genes 4 | Issue as 5 |shown e1000063 graphically in Figure 3.4. In a graph, a homolog pair ideally shows a dense common 26 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE neighborhood and a comparatively sparse unique neighborhood for both genes while a non-homolog pair tends to show a sparsely populated common neighborhood and a dense unique neighborhood. A threshold can now be applied on NC scores to infer homologs. The authors [99] recommend a threshold of 0.5. Also, clustering analysis can now be performed using NC scores as input instead of BLAST score and it has been shown to perform better on diverse domain architecture families. It is important to discuss treatment of data in NC. Data is partitioned into two datasets, not necessarily disjoint with each other. The query dataset Q con- tains genes for which we want to infer homology relationships and to classify into gene families. The reference dataset R contains genes providing evidence for (non- )homology of genes in the query dataset but for whom homology inference is not inquired. The reference data R plays an important role in inferring homologs and gene families. If the reference data is, for example, rich in one particular promis- cuous domain present in both non-homologous genes but lacks or have few cases for the second domain present in one of the two non-homologous genes, high NC scores will be observed. On the other hand, if reference data does not contain cases of promiscuous domain, then homology inference will be different.

ProClust, PhyRn, Profile HMM and other distant homology inference methods Sometimes, the goal of a gene family inference algorithm is to infer gene families containing remote homologs – homologs that are members of highly divergent gene families. It is difficult to infer remote homologs with similarity based techniques because the molecular sequences have diverged as far as twilight zone (≤25% amino acid identity), which are known to be problematic for similarity based techniques in general. So, specific algorithms based on, e.g., iterative or transitive search, profile Hidden Markov Models and Position Specific Scoring Matrices (PSSM) [156, 12, 51, 88] have been developed for inferring gene families containing remote homologs.

Homology inference using sequence similarity and synteny Researchers have used additional information along with sequence similarity to aid in homology inference because similarity alone is not completely synonymous with homology; Examples of disagreement are distant homologs and genes related through convergent evolution. As discussed before, traces of evolution are visible at both gene and genome level, which result in divergence in gene content and gene order. Homologous genes that are a result of regional duplications have more chances of retaining their neigh- bourhood conservation [145]. However, gene translocation, tandem duplications, de novo origination and gene loss events are responsible for divergence in gene order conservation but the sequence similarity is not disturbed. It is, therefore, natural to use gene order conservation in conjunction with sequence similarity as a measure of 3.2. HOMOLOGY AND GENE FAMILY INFERENCE METHODS 27 homology. In particular gene order conservation has been used to differentiate be- tween orthologs and paralogs [100], albeit paralogs related by regional duplication events, e.g., whole genome duplication cannot be differentiated by this approach. Divergence time between species, represented by a species tree, is another important heuristic that can aid in inferring homologous genes. It is important to note the difference between homology inference and homolo- gous or syntenic region inference. The main aim of homology inference is to infer pairs of genes that are homologous while the main aim of inferring syntenic regions is to infer two regions in which some pairs of genes between the regions are homol- ogous. For inferring homologous regions, homologs are either provided from the start between the two or more regions, or are inferred as an intermediate step.

SYNERGY Wapinski et al. [204] developed a software (called SYNERGY) that can optionally use synteny information along with sequence similarity in inferring homologs, gene families and phylogenetic trees to determine the origin and evolution of all genes in a collection of species. This results in a better classification of orthologs and paralogs than most similarity-only based methods. The input to SYNERGY is a collection of species, protein-coding genes in each species and a phylogenetic species tree. Synteny information is a bonus that, if available, can also be used. The aim of SYNERGY is to divide the groups of genes into unique sets that do not have a gene in common, and each set consists of exactly those genes that can be traced back to a common hypothetical single gene present at the root of the species tree (known as the last common ancestor of all species). By doing so, SYNERGY also resolves the evolutionary history of a gene family and produces a gene tree for each gene family. SYNERGY traverses the species tree using post-order traversal. Orthogroups1 are determined for the current node using orthogroups and similarity relationships determined in the previous steps for the children of the current node. For leaves (i.e., extant species), similar genes are grouped to form the initial set of orthogroups (in this case, an orthogroup consists of paralogs only). For internal nodes including the root (i.e., for ancestral species), orthogroups present in the two children of the current node are grouped together if they share more similarity than a specified limit. The process ends, when the root of the species tree (or the last common ancestor of all species) is reached. Sequence similarity can be calculated from a similarity-only method as well as combining it with other information available, e.g., with synteny. The sequence similarity and synteny scores are scaled, weighted and then combined to calculate a single rooting score between the two proteins. The sequence similarity is measured by first globally aligning the two proteins and then searching for the most likely distance, which explains the substitutions in each

1 Group of genes related by a duplication or a speciation at or below the selected internal node representing the last common ancestor of both sub-trees [204]. 28 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE aligned position. The syntenic conservation is quantified by a syntenic similarity score, which is defined as the ratio of orthologous neighbors of both proteins (like the R-window method above). SYNERGY does not require any homology or gene family information as input, unlike many other phylogenetic tree inference algorithms. It computes homology and gene tree simultaneously using species tree and orthogroups. However, the similarity criteria used by SYNERGY is ad-hoc, where weights assigned to each of similarity, synteny and likelihood is impossible to assign for gene families with different divergence rates for genes and genomes [1]. Chapter 4

Phylogenetic Inference

This chapter provides an introduction to Phylogenetic inference methods. The chapter starts with the biological motivation behind phylogenetic trees and briefs about the computational techniques for various phylogenetic tree construction meth- ods with a special focus on the Bayesian phylogenetic inference.

4.1 Fundamentals of Phylogeny

Phylogenetics infers and describes the evolutionary relationships among groups of organisms based on molecular data and/or morphological traits. Molecular phylo- genetics emerged in the early 1960’s and was pioneered by Zuckerkandl, Pauling, Fitch, and Margoliash [226, 63]. Following their lead, Kimura [101], Ohta [151], King and Jukes [103] made landmark contributions to molecular biology, establish- ing the practice of using sequences for investigating evolution.

Gene trees and species tree evolution Evolutionary relationships are often described as a tree. In a typical model of species evolution, a process of speciation and extinction applied to an initial ancestral species determines a species tree describing the evolution of present-day descendant species. A gene tree represents evolution for a gene family, where typically an ancestral gene evolves through mutations, insertions, deletions, duplications and losses. However, a tree is not always applicable for a species tree due to reticulate events such as hybridization of species. In presence of these events, a network- like structure is a better representative for species tree. Since we do not account for these events, we will consider trees and not networks for the remainder of this thesis. The information available for phylogenetic inference based on sequences is typ- ically DNA sequences (or RNA or protein derivatives) of gene families for leaves of the tree. The underlying assumption is that evolutionarily closely related species

29 30 CHAPTER 4. PHYLOGENETIC INFERENCE have more similar molecular sequences than distantly related species, according to some model of evolution. A standard reference that details most phylogeny methodology is Joseph Felsenstein’s “Inferring Phylogenies” [59].

Reconciling gene tree and species tree evolution Assuming that gene family evolution and species evolution are both tree-like, the two trees will typically show a high degree of concordance, and by overlaying them, a gene-species tree reconciliation is obtained [46]. Most parsimonius reconcilia- tion [78] is the most well-known technique to reconcile a gene tree with a species tree under the model of duplication and loss. One example of gene-species tree rec- onciliation can be seen in Figure 4.1. There are many applications of gene tree and

Speciation event or species

Gene duplication event

Gene loss event

Species tree

Gene tree Extant gene G11 G11 G43 G42 G41 G31 G21 G12 Spec4 Spec3 Spec2 Spec1

Figure 4.1: An illustration of a gene family that has evolved inside a species tree. Image also shows different evolutionary events that influence a gene tree. species tree reconciliation but in this thesis, we are more concerned with finding evolutionary history (i.e., events and topology) of a gene tree, given species tree and molecular data of a gene family. Some recent approaches of exactly how gene 4.2. COMPUTING PHYLOGENETIC TREES – TRADITIONAL APPROACHES 31 trees are computed with and without using the species tree are shown in the next section.

4.2 Computing Phylogenetic Trees – Traditional Approaches

The input to a phylogenetic analysis program is, usually, the molecular sequences aligned into a multiple sequence alignment (MSA) [157]. Each column of the MSA represents an evolutionarily homologous site and can have insertions, deletions and gaps. A substitution model states the stationary frequencies of nucleotide/amino acid codons and their interchange rates, and is used to measure evolutionary dis- tances between pairs of sequences. Each site in the MSA is treated independently from the others. Hence, the substitution model along with penalties for gaps is used to score each site (column), which taken together represent the evolution of one sequence to another. Parameters for Substitution models are either estimated during inference (e.g., GTR model [197]), or from empirical data (e.g., PAM [38], BLOSUM [89] and JTT [98, 75, 97] matrices). The branch lengths of a phylogenetic tree represent the distance between se- quences, calculated by applying the substitution model. The topology of a phylo- genetic tree reflects the relative closeness and order of divergence between sequences in the MSA. There are many approaches designed to infer phylogeny, which are, in order from simple to complex (and in order of appearance), parsimony-based methods (e.g., Maximum Parsimony [63, 188]), distance-based methods (e.g., Neighbor Join- ing [175]), maximum likelihood methods (e.g., Felsentein’s pruning algorithm [57] and Quartet Puzzling [192]), and Bayesian methods (e.g., Markov chain Monte Carlo [112]). The former approaches require lower computational and time re- sources and are applicable for very large datasets. The latter approaches are more accurate and enable more realistic modeling but are computationally expensive. More details on these methodologies can be found elsewhere [59, 169, 216].

4.3 Bayesian phylogenetic inference

Bayesian inference relies on Bayes theorem for obtaining posterior probability Pr(θ|D) of underlying parameters θ, e.g., duplication rate, loss rate, etc., given data D. Pa- rameters θ are treated as random variables, and the entire posterior landscape is investigated with respect to θ in Bayesian inference. The posterior probability can be calculated using Bayes formula P r(D|θ)P r(θ) P r(θ|D) = (4.1) P r(D) where Pr(D|θ) represents likelihood, Pr(θ) is prior and Pr(D) is a normalization constant. 32 CHAPTER 4. PHYLOGENETIC INFERENCE

Note that in Bayesian phylogenetic inference, the underlying parameters θ can be classified into real-valued parameters (also called continuous parameters in litera- ture) and discrete parameters. The real-valued parameters consist of all parameters with numeric values. The discrete parameters in Bayesian phylogenetic inference consist of the parameters associated with the topology of the tree parameter. The branch lengths parameter can be real-valued (if numeric, e.g., in MrBayes [170]) or discrete (if discretized). The most popular way of performing Bayesian phylogenetic inference is by ap- plying the Markov chain Monte Carlo (MCMC) algorithm [215, 127, 115], which is a random walk algorithm that estimates posterior distribution by sampling (e.g., by Metropolis-Hastings sampling scheme [86] or Gibbs sampling [69]) from the pa- rameter distribution [7]. The backbone of MCMC is to sample a value for each parameter (including real-valued and discrete parameters) from the parameter dis- tribution using a sampling algorithm (typically Metropolis-Hastings), compute the likelihood of observed data (MSA sequences and optionally other information like species tree) with these parameter settings, accept/reject the new state based on comparison on likelihood with the older state, and continue this process until a specified number of iterations are reached or a certain break condition is achieved. Some characteristics (e.g., mixing, convergence assessment and burnin estimation) of MCMC (only with Metropolis-Hastings sampling scheme) are important, and are discussed in detail in Chapter 5. Bayesian analysis is used to estimate the posterior distribution of real-valued parameters and of discrete parameters. The ability of searching both real-valued parameter space and discrete parameter space simultaneously enables biologically realistic multivariate models needed to solve more complex problems. Indepen- dence between samples is ensured by using thinning (sampling every nth iteration) from the MCMC chain and the sampled iterations are called an MCMC trace. The MCMC trace represents an estimate of the posterior distribution of each parame- ter. For the trace of a real-valued parameter, mean, mode or confidence interval of the marginal distribution can be examined. For the trace of a discrete parame- ter, posterior frequencies of tree topology, tree at state with maximum likelihood, maximum-a-posteriori tree, the majority rule consensus tree, or split frequencies of terminal vertex can represent the result of the whole analysis. MrBayes [170], BEAST [47], PrIME [1], JPrIME [183], BAMM [160] and Phy- loBayes [113] are popular Bayesian software, which are based on MCMC, estimate posterior distribution by using Metropolis-Hastings sampling scheme and are widely applied for phylogenetic inference. The input to these software is typically MSA of molecular sequences of each gene family along with a few software-specific require- ments (e.g., species tree in case of PrIME and JPrIME). The output is the MCMC trace of real-valued parameters and of the discrete parameters. Chapter 5

Post-processing of traces from MCMC runs

The aim of this chapter is to discuss different tasks associated with post-processing of Markov Chain Monte Carlo (MCMC) runs. This chapter introduces important characteristics of MCMC briefly followed by possible uses of MCMC in Bayesian phylogeny inference and concludes with a discussion on the currently available software that analyse MCMC.

5.1 MCMC Characteristics

MCMC trace Basic assumptions for a Bayesian inference using Metropolis-Hastings sampling scheme are that samples can be obtained from MCMC simulation with relative frequencies that agree with sample frequencies obtained from true distribution and all samples are dependent only on the last state. A state in MCMC is a set of parameter values. MCMC initializes from a given state, samples from posterior distribution for all parameters simultaneously in the neighbourhood of the current state, computes the likelihood of the new state, selects or rejects the new state based upon the ratio of likelihood of the new state and the old state, and iterates until a specific condition like number of iterations has been met. Acceptance of a state means that current parameter values are changed with proposed parameter values. Rejection of a state means that the proposed parameter values are discarded and the current state is retained in this iteration. Rapid mixing in the chain is essential for exploration of the parameter poste- rior distribution and it’s ability to converge quickly to the stationary distribution. Acceptance ratios for each parameter can be calculated from accepted proposals and the total proposals made for a particular parameter during a complete run. High acceptance ratios of MCMC proposals are indicative of rapid mixing and

33 34 CHAPTER 5. POST-PROCESSING OF TRACES FROM MCMC RUNS

Iter Post Dens Subs Dens DLR Dens Dup Rate Loss Rate 0 -2213.8539 -2169.0250 -44.8289 1.6578 1.6578 100 -2063.5709 -2014.2883 -49.2826 2.8984 1.1652 200 -1974.9265 -1930.6208 -44.3058 0.9655 1.5388 300 -1969.5395 -1925.7214 -43.8181 2.2248 1.6864 400 -1955.1124 -1910.5606 -44.5517 1.4999 2.3079 500 -1949.8377 -1904.5363 -45.3014 1.2243 1.7212 600 -1943.3235 -1902.6028 -40.7207 1.3488 1.7029 700 -1942.5775 -1903.1583 -39.4192 1.9042 2.7428 ...... 999900 -1895.3154 -1938.9227 43.6073 1.7294 0.7909 1000000 -1901.6213 -1938.0307 36.4094 3.2748 3.1813

Table 5.1: A typical MCMC trace with five real-valued parameters and a thinning factor of 100 with 1000000 iterations. The tree parameter has been omitted due to lack of space. that the posterior space has been well-explored. On the other hand, low acceptance ratios are the first indication of poor mixing and cause of non-convergence. The samples are preserved by using a thinning factor so that samples produced in a run are independent of each other and make the subsequent statis- tics computationally feasible. The sampled iterations are called an MCMC trace. Table 5.1 shows how a typical MCMC trace for five pa- rameters with a thinning factor of 100 looks like and Figure 5.1 rep- resents the graphical interpretation Figure 5.1: Graphical illustration of trace of a of an MCMC trace. MCMC run for three parameters that can be ex- tended to more parameters with further dimen- sions in the graph [182]. The run intends to Convergence assessment find the optimal state (the parameter values) that and burnin estimation maximizes the objective function. Due to lack of prior knowledge, it is common to start MCMC simu- lation at a random point (or based on some heuristic) in parameter space. Usually, the initial choice is far from high- 5.1. MCMC CHARACTERISTICS 35 density regions of the posterior distribution. If the chain is mixing well and has been allowed to run long enough, convergence is guaranteed for MCMC algorithms, where the chain is sampling from a region representing the stationary distribution but convergence is unfortunately not decidable. The initial iterations are the burnin period of the chain and the remaining iterations are the stationary part of chain.

Figure 5.2: A typical MCMC trace shown as a plot, where the sample # is shown on the X-axis and the parameter is plotted on the Y-axis. In this case the parameter is the log-likelihood value of the sample. Green bar separates the burnin period in the chain that was estimated visually and the trace appears to have converged for this parameter after this point.

One is often interested in estimating mean and variance of the posterior distri- bution of some parameter. The chain is assumed to have converged, if mean and variance can be measured as close to specified values for the true distribution. Such a convergence is termed as “convergence of an ergodic estimate”. The burnin period of a trace slows down convergence of an ergodic estimate because it interferes with mean and variance estimates and it is recommended to remove burnin samples. 36 CHAPTER 5. POST-PROCESSING OF TRACES FROM MCMC RUNS

The simplest way to assess if a chain has converged or not is through visual analysis of the trace, typically using one or more of trace plots, density plots and running mean plots alongside basic statistics. Trace plots also display mixing of the chain. If a chain is mixing well and appears to be sampling from the same region for better part of the chain, then it appears to have converged. Such assessments are simple and based on expert opinion [73]. However, they can not be automatized and lack objectivity. A typical MCMC trace is shown in Figure 5.2 and the green line marks the burnin estimate that was decided visually. Another commonly employed heuristic for burnin estimation is to remove a fixed percentage (none or 10% or 25%) of samples from the chain and take the rest as converged or assess for convergence using other convergence diagnostics. Such heuristics are easy to perform and are automated, but do not observe the trace characteristics and miss non-convergent runs altogether. An alternate way to assess convergence and estimate burnin is to perform sta- tistical estimates of convergence on the real-valued parameters, e.g., Geweke [70], autocorrelation plots, Gelman-Rubin [68], Raftery-Lewis [161], Heidelberger and Welch stationarity test [87] and using effective sample size (ESS) [114, 198] as shown by Sahlin [174] and Höhna [93], which are objective, consistent and can be automated. A drawback for convergence diagnostic methods is that they are based only on real-valued parameters, and can not be applied on discrete parameters (e.g., tree parameter). Assessing convergence for Bayesian inference in phylogeny remains a difficult and non-standardized task [93]. Frequencies of splits [171] have been used to assess convergence for tree parameter, and are implemented in MrBayes [170] and Are We There Yet (AWTY) [146]. MrBayes periodically checks the difference in variance for splits of trees on parallel chains. AWTY explores the splits and plots split- based statistics and trace plot of log-likelihood of trees. However, parallel chains are required for split frequency analysis, and only comparing the split frequencies and not the real-valued parameters can be misleading [93].

5.2 Post-processing MCMC traces

Statistical analysis MCMC runs are to be analysed statistically to infer important information like mean parameter values and standard deviation observed during a run. If the true distribution from which Markov chain is sampling is known, then the sufficient statistics of this distribution can be compared with the parameters of the estimated distribution for assessing accuracy of phylogenetic inference.

Trace analysis As discussed before in Section 5.1, visual inspection of a trace is an indicator of assessing convergence and can be used to view trends that might suggest problems 5.2. POST-PROCESSING MCMC TRACES 37 with convergence. Apart from convergence assessment, it is also important in as- sessing mixing of the chain. Ideally if the Markov chain mixes properly, it should avoid getting stuck in local optima.

Joint marginal distribution analysis Another significant feature recommended for analysis of Markov chains is to check dependence between variables. A joint-marginal distribution plot displays the de- pendency trends between two parameters and can be used to check independence of both parameters.

Tree split histogram and analysis Tree parameter is a special parameter, significantly different from real-valued pa- rameters, and requires special attention and methods to analyse. An unrooted phylogenetic tree can be represented by a set of splits – a bipartition of the tree representing the smallest unit of information, where each unique non-trivial bipar- tition of the set of leaves of the phylogenetic tree represents one split. The set of splits obtained from all trees present in the posterior can be plotted as a histogram, where each bin corresponds to a fraction of total samples, a split is encountered in the whole run. This histogram can also indicate mixing of tree parameter.

Maximum-a-Posteriori tree topology The “Maximum-a-Posteriori” tree topology (also called MAP tree topology) de- scribes the tree associated with the sampled state with the highest posterior prob- ability in the MCMC chain or simply, the most frequently sampled tree. The MAP tree topology provides a simplistic and informative measure to select a sin- gle topology from the whole set of trees. However, in some cases with large trees, the topology of every sampled tree may be unique, which makes it impossible to differentiate and choose between trees based on frequency. Also, if the number of converged samples is few, then the true MAP tree is hard to estimate.

Majority rule consensus tree and simple majority tree The majority rule consensus tree is defined as a tree that contains all clades occur- ring in at least half of trees in the posterior distribution [58]. A slightly lenient rule is simple majority tree (also called a fully resolved consensus tree), where remaining clades are ordered and selected according to decreasing posterior probability. There is a constraint, however, regarding the compatibility of each newly selected clade with all previously selected clades. This method gives out a single tree, whose parts are agreed by the majority of trees in the posterior distribution. But a perceived drawback with majority trees is that it can lead to a tree with topology that has not been sampled during the run as well as to a tree topology with relatively low 38 CHAPTER 5. POST-PROCESSING OF TRACES FROM MCMC RUNS probability, despite the fact that many features with very high probability will also be present.

Tree and tree space visualization

Gene trees are usually stored in Newick (also called New Hampshire), or Nexus formats. While it is easy to store and systematic to understand them for the computer, they are very hard to understand for humans in particular due to keeping tracks of brackets in these formats to understand when an underlying subtree has finished. On the other hand, a visual illustration of tree like in Figure 4.1 is very easy to interpret for humans. Tree visualization software (see, e.g., FigTree [162] or Forester [224]) convert the computer readable formats into tree illustrations. It is also interesting to see how the chain progresses in tree space over time. Some algorithms, for example multidimensional scaling technique [108], can draw distances between trees in a two dimensional space with a minimum error. One can convert the tree topologies into a distance matrix using a good tree distance metric (e.g., Robinson-Foulds metric and Nodal distance metric [14]) and use this distance matrix as input to multidimensional scaling to view distribution of trees in tree space. This information can aid in assessing if the tree parameter has converged or not. It can also be employed to gain more knowledge about mixing for tree parameter.

Post-processing of parallel MCMC traces

Some software (e.g., MrBayes) can run two chains in parallel and output both chains. They analyze both chains simultaneously during the run to assess conver- gence or use some statistical measure (e.g., difference in split frequencies) on tree parameter of parallel chains to assess stopping criteria. One can also run two sep- arate runs on the same data using same or different starting points in parameter space to minimize possibility of getting stuck in local maxima. Both these cases are based on the assumption that on convergence, both chains must be sampling from the same distribution. To assess if both chains have converged to the same poste- rior distribution, hypothesis testing can be used, where the null hypothesis is that both samples have been drawn from the same distribution, implying convergence. Chi-square tests for two samples [217] for tree parameter and Mann-Whitney U test [125] can be used for hypothesis testing. Other tricks like Metropolis-coupled MCMC [131] can be used with multiple chains to improve mixing and exploration of tree space, and has been used in several Bayesian phylogenetic software (e.g., in MrBayes [4] and in BAMM [160]). Figure 5.3 displays traces of two parallel runs simultaneously. 5.3. SOFTWARE PACKAGES FOR POST-PROCESSING OF MCMC TRACES 39

Figure 5.3: A trace of two parallel chains superimposed on each other with red line for first run and blue line for second run. The green bar indicates the visually selected burnin value, after which both chains are appearing to sample from the same space.

5.3 Software packages for post-processing of MCMC traces

Following is a brief introduction of major software that analyse the MCMC output and are commonly used to deduce results from MCMC runs.

Coda Coda [158] is a standard MCMC analysis and convergence diagnostic package writ- ten for the R language by Martin Plumer, which is popular mainly due to multiple and diverse convergence diagnostics, automatic estimation of burnin and the ability to run from command line. It includes most standard convergence diagnostics, and can perform effective sample size method, Raftery-Lewis diagnostic [161], Geweke diagnostic [70], Gelman-Rubin diagnostic [68] and Heidelberger-Welch stationarity test [87]. It can remove burnin samples using one of these tests, and draw the trace 40 CHAPTER 5. POST-PROCESSING OF TRACES FROM MCMC RUNS of a run using GNU plot commands. It can also run a MCMC chain with a given statistical distribution, thinning factor and other MCMC parameters. However, Coda does not handle the tree parameter at all, and can only analyse and generate MCMC runs for real-valued parameters.

Are We There Yet Are We There Yet (or AWTY) [146] is a Perl-based program that analyzes MCMC runs and is used mainly to predict if a chain has converged or not by observing and performing convergence tests on the tree parameter. The input to AWTY are trace files containing phylogenetic trees which have been generated as output by other phylogenetic MCMC programs, and is based on tree splits to assess the convergence. It uses Coda [158] for some of the convergence diagnostics. The main contribution of AWTY is to help a user in performing the graphical exploration of tree-specific parameters. The split histogram, the trace plots of log likelihoods, parallel chain analysis, split frequency analysis and plotting the symmetric tree distance are some of the functionality available in AWTY. It is a useful tool to analyse the tree parameter of MCMC traces. The developers and authors of AWTY recommend that it should be used in complement with Coda for both real-valued and tree parameters.

Tracer Tracer [163] is a software program used for analysis of the trace files generated by Bayesian MCMC runs. It can estimate selected statistics (see, e.g., mean, stan- dard deviation and confidence intervals) for the selected parameter. A frequency distribution and a density plot for the Bayesian posterior of the selected parameter can also be drawn. It can generate trace plots and joint marginal plots for two parameters. Tracer can handle real-valued parameters only and is not intended for discrete parameters. The default burnin for Tracer is set at 10% and estimates ESS values for each parameter at the specified burnin. It uses a heuristic for assessing convergence where it flags ESS value less than 100 but cautions that values less than 200 should be further analysed by, e.g., tree shape. However, the thresholds used by Tracer seem to be arbitrarily defined [93]. Chapter 6

Present Investigations

All papers included in this thesis target two key areas of Computational Biology; gene family inference and post-processing of Bayesian inference of phylogeny. Paper I-II involve method development of a synteny-aware homology inference and gene family inference software and comparison of this method with other popular gene family inference methods. Paper III focuses on development of a GUI-based soft- ware that is useful in Bayesian inference of phylogeny and displays the important characteristics of MCMC particularly helpful in assessing behaviour of the MCMC chain. Paper IV uses the software designed in paper III and explores behaviour of tree parameter in chain. It also applies some popular convergence diagnostics and discusses a new convergence assessment approach that takes all parameters of the chain jointly. Paper V traces the evolutionary and phylogenetic history of FERM domain of Kindlin with respect to FERM domain of other FERM domain- containing proteins (FDCPs) and applies Bayesian phylogenetic inference among other Bioinformatics tools to infer interesting observations about evolution of the FERM domain.

Paper I: Quantitative synteny scoring improves homology inference and partitioning of gene families.

In this study, a novel homology inference method is proposed that takes into ac- count local synteny and merges local synteny estimates with similarity scores to obtain evaluation scores that better reflect homology. The algorithm uses neigh- borhood correlation scores applied on all-versus-all BLAST bitscores as similar- ity scores and computes neighborhood correlation scores applied on local synteny scores calculated from neighborhood of two genes to quantify synteny conserva- tion between two genes. Both synteny and similarity scores are merged to form an evaluation score that shows confidence in homologous pair. Using this merged score, gene families are computed, and cluster quality of gene families obtained from a synteny-aware approach is shown to be better than those obtained only

41 42 CHAPTER 6. PRESENT INVESTIGATIONS from Neighborhood Correlation scores on simulated data and on a biological data with known gold standard for gene families. Software implementation is available with the paper and is called GenFamClust.

Paper II: GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm.

This paper discusses GenFamClust as an informed and reliable homology infer- ence algorithm and compares gene families obtained from GenFamClust for ac- curacy, similarity, dependence and other characteristics with those obtained from some other popular gene family inference software (hcluster_sg, Markov Clustering (MCL), HiFiX, SiLiX and clustering approaches) applied on BLAST and Neigh- borhood Correlation scores. Simulated datasets and a dataset consisting of many eukaryotic species are used for this comparison. Gene families obtained from Gen- FamClust for Fungi dataset is then evaluated for accuracy with pillars available from Yeast Gene Order Browser (YGOB), which have been semi-manually curated using gene similarity and synteny. Agreement and disagreement between pillars and clusters obtained from GenFamClust were analysed. A few novel predictions for pillars based on better phylogenetic fit and equally supportive synteny were made to show that GenFamClust is not only automated but can also predict gene families in a more informed and accurate manner than predictions from YGOB.

Paper III: VMCMC: a graphical and statistical analysis tool for Markov chain Monte Carlo traces.

This study focuses on presenting a software (named Visual Markov chain Monte Carlo or VMCMC) that simplifies post-processing of MCMC traces with most com- monly useful tasks, e.g., with maximum-a-posteriori or majority rule consensus tree computation, parameter mixing (for both tree topology and real-valued parameters), computing tree posterior, automated burnin estimation for individual parameters as well as for complete chain and visualizing traces of real-valued and tree topology parameters. VMCMC also computes statistical properties, e.g., mean and variance of each parameter and can analyse the same parameter for two parallel chains using Mann-Whitney U test for real-valued parameters and Chi-Square test for splits in tree topology parameter. The burnin estimation methods employed in VMCMC are Gelman-Rubin (potential scale reduction factor), Geweke and effective sample size (ESS). VMCMC is specifically tailored for Bayesian inference of phylogeny, and can be used to analyse MCMC chains generated from MrBayes, PrIMe, JPrIMe-DLRS, BEAST and any other software generating tab separated chain output. VMCMC can be used both as a GUI-based application, supporting interactive exploration, and as a convenient command-line tool suitable for automated pipelines. Software implementation of VMCMC is available online. 43

Paper IV: Burnin estimation and convergence assessment in Bayesian phylogenetic inference.

In this work, we present novel burnin estimation methods specifically applicable to MCMC traces for Bayesian phylogenetic inference, and make a case for use- fulness of features implemented in VMCMC presented in a previous paper. We analysed MCMC runs generated from JPrIMe using VMCMC. We use the last burnin estimation method for assessing convergence and estimating burnin, and apply it to effective sample size (ESS) and convergence diagnostics like Geweke and Gelman-Rubin. We also propose two new convergence diagnostic methods that are based on estimating burnin using ESS from multiple parameters jointly. We quan- tify the effect of burnin on posterior of tree parameter. We explored correlation between burnin estimates from various convergence diagnostics. Furthermore, we investigated relationship between size of gene families and burnin estimation from different convergence diagnostics. Using parallel chains for ascertaining convergence in MCMC runs and estimating burnin, we concluded that it is always advisable to use convergence diagnostics and that the burnin estimates of last burnin applied on ESS were the closest to those estimated by parallel chain analysis.

Paper V: Tracing the evolution of FERM domain of Kindlins.

In this study, we wanted to infer phylogeny using Bayesian phylogenetic inference on a biological dataset, use VMCMC to post-process the trace and infer evolution- ary history of the FERM domain for a medium-sized biological dataset (consisting of FERM domains from 185 FDCPs sequences from 14 proteins in 15 different species). This work represents the application of these and other Bioinformatics tools to study and explore the evolutionary history of FERM domain of FDCPs. Phylogenetic analysis were performed using MEGA5, TimeTree, DLRS in JPrIMe and VMCMC. Other tools like Evolutionary Trace Analysis (ETA), BLAST, NCBI Conserved Domain Database (CDD), Hidden Markov Model (HMM) and Chimera were also deployed for identifying functionally and structurally important and con- served residues in FERM domain of Kindlins.

Chapter 7

Discussion & Conclusion

In 1995 the genome of Bacterium Haemophilus influenzae was sequenced – the first time complete genome of a free-living organism had become known [64]. This was followed by sequencing of the complete genome of Saccharomyces cerevisiae in 1996 [74]. However, a monumental moment that opened the floodgates for sequenc- ing of complete genomes of organisms was sequencing of the complete genome of Homo sapiens in 2001 [111]. Since then, the number of fully sequenced genomes has been increasing at an amazing rate [165]. The availability of genome scale data has brought in more challenges with itself to the Bioinformatics community, in particular to the Phylogenetics community. One challenge is to use this informa- tion in inferring more informed and accurate gene families as input to a phylogeny inferring algorithm (in general). Second, while methods like Bayesian inference of phylogeny are increasingly being used in literature for inferring tree posterior, the need to have interactive and automated analysis tools for analysing output from these software is growing particularly after anticipating the application of these methods on genome scale. In the papers of this thesis, we tackle both these issues.

7.1 Discussion

Synteny-aware homology inference In the first paper, we introduce a synteny-aware homology inference algorithm that uses local synteny information in conjunction with gene similarity information to infer homologs. Then clustering algorithms like single-linkage, average-linkage and complete linkage algorithms can be applied on these homologs to infer gene families. The idea is novel; at this point, to my knowledge, there do not exist any methods that compute synteny-aware homologs except SYNERGY whose implementation is not publicly available at the time of this write-up. The method is informed; it uses gene order conservation alongside gene content conservation to infer homolo- gous gene pairs. The method is built for genome scale data; it requires complete information about genes and positions of genes on chromosomes. The similarity

45 46 CHAPTER 7. DISCUSSION & CONCLUSION component represented by neighborhood correlation scores is more robust than all- versus-all BLAST bitscores [186]. The synteny scoring scheme is simple yet robust and relevant to biological datasets; we verified it experimentally by choosing differ- ent boundaries and schemes as shown in supplementary data and as applied on a biological dataset. Homology inference and partitioning of gene families is signif- icantly improved by using synteny quantitatively; results suggest strong evidence when compared with results from Neighborhood Correlation – the similarity com- ponent of GenFamClust. The major issue we faced was how to choose values of each parameter for each module. While default values for all-versus-all BLAST and neighborhood correlation scores were already determined by Joseph et al. [99], we used empirical data and tested on all parameter values for determining best values, which is given in supplementary data. A drawback with the GenFamClust algorithm is its empirical or data dependent nature; it requires a good reference dataset, which truly reflects homologous and syntenic relationships between the queried gene pair. Another drawback is the computation time to infer synteny scores and synteny correlation scores, which are a hindrance in the scalability of data. The software is available online and is called GenFamClust. In the second paper, we wanted to assess properties of gene families inferred by applying clustering algorithms on homologs inferred from GenFamClust and to motivate everyone to use GenFamClust for inferring homologs and gene families despite higher computational time and requiring more information than most com- monly available homology inference software. The summary of this work is that gene families inferred from GenFamClust are more informed and accurate than gene families inferred from most other popular software. Gene family inference precedes phylogeny inference and is of utmost importance in determining evolutionary history inferred from a phylogenetic tree. The avail- ability and annotation of complete genomes should help improve the quality of gene families by providing syntenic context. I believe that GenFamClust provides the best opportunity, at this point, to incorporate gene similarity and gene synteny supported by empirical evidence. Annotation and sequencing of more genomes will help improve gene family inference accuracy in future.

Analysis of Markov chain Monte Carlo runs In the third paper, we wanted to specifically analyse the MCMC chains produced during Bayesian phylogenetic analysis, which can be used for post-processing of MCMC traces on genome-wide scale. Given a trace, we wanted to interactively view the effect of burnin on the posterior of both real and tree topology parame- ters without going to multiple software for each task and for each value of burnin. The currently used software for analysing MCMC traces in Bayesian inference of phylogeny (namely Tracer, AWTY and CODA) do these tasks for one type of pa- rameters, are not interactive and can not be applied on genome-wide scale. Besides, the goal of a lot of phylogenetic studies is to find a single representative tree for the posterior after removing burnin. VMCMC is intended to fill these gaps. An- 7.2. FUTURE PERSPECTIVES 47 other feature of VMCMC is to have a command line version as well with complete functionality so that one is able to analyse thousands of runs using a small script. I personally found VMCMC very useful in analysing MCMC traces, calculating the maximum-a-posteriori tree and consensus tree after removing burnin in tracing the evolutionary history of FERM domains. In the fourth paper, we proposed and evaluated convergence diagnostics and burnin estimators that are specifically applicable for MCMC traces with multiple parameters (e.g., those used in Bayesian phylogenetic inference), and quantify the effect of different burnin estimation and convergence diagnostics on tree topology parameter. We noted that burnin estimates from different convergence diagnostics were not dependent on size of gene family, on which the chain ran and there was no correlation between burnin estimates from different convergence diagnostics. Chain convergence was verified by using parallel chains and hypothesis tests like chi-square test and Mann-Whitney U test, and we estimated possible burnin points for converged chains as well. We motivated the use of joint burnin estimators in Bayesian phylogenetic inference and in some other cases, where some parameters may be dependent on one or more parameters and perturbation in this parameter causes other dependent parameters to “reset” causing problems in convergence to stationary state. The differences between tree posteriors obtained after removing burnin estimates from a convergence diagnostic and that obtained after removing burnin estimates from parallel chain analysis were quantitatively summarized using Kullback-Leibler divergence. We concluded that it was better to use a convergence diagnostic than using a fixed 25% burnin estimate and that burnin estimates from last-burnin estimation on ESS is the closest to the estimates from parallel chain analysis among the different convergence diagnostics employed in this study.

7.2 Future perspectives

This thesis is dedicated to the topics of homology inference and analysis of chains from Markov chain Monte Carlo in Bayesian phylogenetics inference. I believe that there is room for improvement in both these topics and in the works presented here. Starting with synteny-aware homology inference, the role of reference or empir- ical data is not discussed and has not been explored completely in GenFamClust. Perhaps this can be explored in future experiments with GenFamClust. Incorporat- ing species tree somehow intelligently in BLAST scores, Neighbourhood Correlation scores, synteny scores and/or synteny correlation scores is another avenue that looks interesting to explore. VMCMC can be improved with a few interesting features. There are some obvious features from other software like Tracer [163] and AWTY [146] that could be duplicated in VMCMC to make it most useful. In particular, the ability to plot marginal distributions and correlations among different parameters could be interesting additions to VMCMC. Direct extraction of samples with a particular tree topology could be another desired feature. Display plots of split frequencies (as 48 CHAPTER 7. DISCUSSION & CONCLUSION is done by AWTY) can also be added to VMCMC. More interactivity can be added to VMCMC particularly when analysing the distances between tree topologies. Looking back at FERM domain analysis now, I would have identified FDCP homologs using GenFamClust and identify homologs with syntenic support, which would allow more than one homolog per protein per species (e.g., two or more representatives for recently duplicated paralogs) and will make the analysis more informed and complete. Then, it would represent a perfect biological use case for applying the whole pipeline described in this thesis. Bibliography

[1] Örjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Simul- taneous Bayesian gene tree reconstruction and reconciliation analysis. Pro- ceedings of the National Academy of Sciences, 106(14):5714–5719, 2009. [2] . Molecular Biology of the Cell: Reference Edition. Garland Science, 2008. ISBN 9780815341116. [3] Jonas S. Almeida and Susana Vinga. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics, 3(1):6, 2002. [4] Gautam Altekar, Sandhya Dwarkadas, John P. Huelsenbeck, and Fredrik Ronquist. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics, 20(3):407–415, 2004. [5] Adrian M. Altenhoff, Romain A. Studer, Marc Robinson-Rechavi, and . Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Computational Biology, 8(5):e1002514–e1002514, 2012. [6] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search pro- grams. Nucleic Acids Research, 25(17):3389–3402, 1997. [7] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jor- dan. An introduction to MCMC for machine learning. Machine Learning, 50 (1-2):5–43, 2003. [8] Eric Bapteste, Philippe Lopez, Frédéric Bouchard, et al. Evolutionary analy- ses of non-genealogical bonds produced by introgressive descent. Proceedings of the National Academy of Sciences, 109(45):18266–18272, 2012. [9] Malay K. Basu, Liran Carmel, Igor B. Rogozin, and Eugene V. Koonin. Evo- lution of protein domain promiscuity in eukaryotes. , 18(3): 449–461, 2008. [10] Philip N. Benfey, Philip Benfey, and Alex D. Protopapas. Essentials of Ge- nomics. Prentice-Hall, 2005. ISBN 9780130470188.

49 50 BIBLIOGRAPHY

[11] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data. Springer, 2006. [12] Gaurav Bhardwaj, Kyung D. Ko, Yoojin Hong, et al. PHYRN: a robust method for phylogenetic analysis of highly divergent sequences. PloS One, 7 (4):e34261, 2012. [13] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Sta- tistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. [14] John Bluis and Dong-Guk Shin. Nodal distance algorithm: calculating a phylogenetic tree comparison metric. In Bioinformatics and Bioengineering, 2003. Proceedings. Third IEEE Symposium on, pages 87–94. IEEE, 2003. [15] Ingo Brigandt. Homology in comparative, molecular, and evolutionary de- velopmental biology: the radiation of a concept. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution, 299(1):9–17, 2003. [16] Terence A. Brown. Genomes 3. Garland Science, 2007. ISBN 9780815341383. [17] Marija Buljan and . The evolution of protein domain families. Biochemical Society Transactions, 37(4):751, 2009. [18] Kevin P. Byrne and Kenneth H. Wolfe. The Yeast Gene Order Browser: com- bining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research, 15(10):1456–1461, 2005. [19] Jing Cai, Ruoping Zhao, Huifeng Jiang, and Wen Wang. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics, 179(1): 487–496, 2008. [20] Torbjorn Caspersson and Jack Schultz. Pentose nucleotides in the cytoplasm of growing tissues. Nature, 143(3623):602–3, 1939. [21] Todd A. Castoe, A.P. Jason de Koning, Hyun-Min Kim, et al. Evidence for an ancient adaptive episode of convergent molecular evolution. Proceedings of the National Academy of Sciences, 106(22):8986–8991, 2009. [22] C. Chang and E. M. Meyerowitz. Eukaryotes have “two-component” signal tranducers. Research in Microbiology, 145(5):481–486, 1994. [23] Xiaoshu Chen and Jianzhi Zhang. Correction: The ortholog conjecture is untestable by the current gene ontology but is supported by RNA sequencing data. PLoS Computational Biology, 9(1), 2013. [24] Asif T. Chinwalla, Lisa L. Cook, Kimberly D. Delehaunty, et al. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915): 520–562, 2002. BIBLIOGRAPHY 51

[25] Athar H. Chishti, A. C. Kim, S. M. Marfatia, et al. The FERM domain: a unique module involved in the linkage of cytoplasmic proteins to the mem- brane. Trends in Biochemical Sciences, 23(8):281–282, Aug 1998.

[26] Suzanne Clancy. Genetic mutation. Nature Education, 1(1):187, 2008.

[27] Suzanne Clancy. RNA splicing: introns, exons and spliceosome. Nature Education, 1(1):31, 2008.

[28] Bernard Conrad and Stylianos E. Antonarakis. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annual Reviews in Genomics and Human Genetics, 8:17–35, 2007.

[29] Loredana Lo Conte, Bart Ailey, Tim J.P. Hubbard, et al. SCOP: a structural classification of proteins database. Nucleic Acids Research, 28(1):257–259, 2000.

[30] Richard Cordaux and Mark A. Batzer. The impact of retrotransposons on human genome evolution. Nature Reviews Genetics, 10(10):691–703, 2009.

[31] Francis H. Crick. On protein synthesis. Symposia of the Society for Experi- mental Biology, 12:139–163, 1956.

[32] Francis H. Crick. Central dogma of molecular biology. Nature, 227(5258): 561–563, 1970.

[33] Liying Cui, P Kerr Wall, James H. Leebens-Mack, et al. Widespread genome duplications throughout the history of flowering plants. Genome Research, 16(6):738–749, 2006.

[34] Ralf Dahm. Friedrich Miescher and the discovery of DNA. Developmental Biology, 278(2):274–288, 2005.

[35] Ralf Dahm. Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Human Genetics, 122(6):565–581, Jan 2008.

[36] Thomas Dandekar, Berend Snel, Martijn Huynen, and Peer Bork. Conserva- tion of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences, 23(9):324–328, 1998.

[37] Charles Darwin. On the Origin of Species by Means of Natural Selection Or the Preservation of Favored Races in the Struggle for Life. General Books LLC, 2009. ISBN 9781150016707.

[38] Margaret O. Dayhoff and Robert M. Schwartz. A model of evolutionary change in proteins. In In Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, 1978. 52 BIBLIOGRAPHY

[39] Daniel Defays. An efficient algorithm for a complete link method. The Com- puter Journal, 20(4):364–366, 1977.

[40] Paramvir Dehal and Jeffrey L. Boore. Two rounds of whole genome duplica- tion in the ancestral vertebrate. PLoS Biology, 3(10):1700, 2005.

[41] Luis Delaye, Alexander DeLuna, Antonio Lazcano, and Arturo Becerra. The origin of a novel gene through overprinting in Escherichia coli. BMC Evolu- tionary Biology, 8(1):31, 2008.

[42] Jeffery P. Demuth, Tijl De Bie, Jason E. Stajich, et al. The evolution of mammalian gene families. PloS One, 1(1):e85, 2006.

[43] Cheng Deng, C-H Christina Cheng, Hua Ye, et al. Evolution of an antifreeze protein by neofunctionalization under escape from adaptive conflict. Proceed- ings of the National Academy of Sciences, 107(50):21593–21598, 2010.

[44] Yun Ding, Qi Zhou, and Wen Wang. Origins of new genes and evolution of their novel functions. Annual Review of Ecology, Evolution, and Systematics, 43:345–363, 2012.

[45] Russell F. Doolittle. Convergent evolution: the need to be explicit. Trends in Biochemical Sciences, 19(1):15–18, 1994.

[46] Jean-Philippe Doyon, Vincent Ranwez, Vincent Daubin, and Vincent Berry. Models, algorithms and programs for phylogeny reconciliation. Briefings in Bioinformatics, 12(5):392–400, 2011.

[47] Alexei J. Drummond and Andrew Rambaut. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7(1):214, 2007.

[48] Richard C. Dubes and Anil K. Jain. Clustering methodologies in exploratory data analysis. Advances in Computers, 19(11), 1980.

[49] Bernard Dujon, David Sherman, Gilles Fischer, et al. Genome evolution in yeasts. Nature, 430(6995):35–44, 2004.

[50] Jill M. Dunty, Veronica Gabarra-Niecko, Michelle L. King, et al. FERM domain interaction promotes FAK signaling. Molecular and Cellular Biology, 24(12):5353–5368, 2004.

[51] Sean R. Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755–763, 1998.

[52] Diana Ekman and Arne Elofsson. Identifying and quantifying orphan protein sequences in fungi. Journal of Molecular Biology, 396(2):396–405, 2010. BIBLIOGRAPHY 53

[53] Sabrina Ellenberger, Stefan Schuster, and Johannes Wöstemeyer. Correlation between sequence, structure and function for trisporoid processing proteins in the model zygomycete Mucor mucedo. Journal of Theoretical Biology, 320: 66–75, 2013.

[54] Anton J. Enright and Christos A. Ouzounis. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16(5):451–457, 2000.

[55] Anton J. Enright, Stijn van Dongen, and Christos A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575–1584, 2002.

[56] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density- based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231. AAAI Press, 1996.

[57] Joseph Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368–376, 1981.

[58] Joseph Felsenstein. Confidence limits on phylogenies: an approach using the bootstrap. Evolution, pages 783–791, 1985.

[59] Joseph Felsenstein. Inferring Phylogenies. Macmillan Education, 2004. ISBN 9780878931774.

[60] Ronald A. Fisher. The possible modification of the response of the wild type to recurrent mutations. American Naturalist, pages 115–126, 1928.

[61] Walter M. Fitch. Distinguishing homologous from analogous proteins. Sys- tematic Biology, 19(2):99–113, 1970.

[62] Walter M. Fitch. Homology a personal view on some of the problems. Trends in Genetics, 16(5):227–231, May 2000.

[63] Walter M. Fitch and Emanuel Margoliash. Construction of phylogenetic trees. Science, 155(3760):279–284, 1967.

[64] Robert D. Fleischmann, Mark D. Adams, Owen White, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269 (5223):496–512, 1995.

[65] National Center for Biotechnology Information. Blastclust, 2015. URL ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html.

[66] Allan Force, Michael Lynch, F. Bryan Pickett, et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4):1531– 1545, 1999. 54 BIBLIOGRAPHY

[67] Jianchao Gao, Ammad A. Khan, Takashi Shimokawa, et al. A feedback regulation between Kindlin-2 and GLI1 in prostate cancer cells. FEBS Letters, 587(6):631–638, Mar 2013. [68] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, pages 457–472, 1992. [69] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distribu- tions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(6):721–741, 1984. [70] John Geweke. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, volume 196. Federal Reserve Bank of Min- neapolis, Research Department Minneapolis, MN, USA, 1991. [71] Greg Gibson and Spencer V. Muse. A Primer of Genome Science. Sinauer Associates, 2009. ISBN 9780878932368. [72] Walter Gilbert. Why genes in pieces? Nature, 271(5645):501, Feb 1978. [73] Walter R. Gilks, Sylvia Richardson, and David J. Spiegelhalter. Introduc- ing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice. London: Chapman and Hall, 1996. [74] André Goffeau, Bart G. Barrell, Howard Bussey, et al. Life with 6000 genes. Science, 274(5287):546–567, 1996. [75] Gaston H. Gonnet, Mark A. Cohen, and Steven A. Benner. Exhaustive match- ing of the entire protein sequence database. Science, 256(5062):1443–1445, 1992. [76] Josefa González, José María Ranz, and Alfredo Ruiz. Chromosomal elements evolve at different rates in the Drosophila genome. Genetics, 161(3):1137– 1154, 2002. [77] Mileidy W. Gonzalez and William R. Pearson. Homologous over-extension: a challenge for iterative similarity searches. Nucleic Acids Research, 38(7): 2177–2189, 2010. [78] Morris Goodman, John Czelusniak, G. William Moore, A.E. Romero-Herrera, and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsi- mony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology, pages 132–163, 1979. [79] Julian Gough. Convergent evolution of domain architectures (is rare). Bioin- formatics, 21(8):1464–1471, 2005. [80] John C. Gower and G.J.S. Ross. Minimum spanning trees and single linkage cluster analysis. Applied Statistics, pages 54–64, 1969. BIBLIOGRAPHY 55

[81] Leanne S. Haggerty, Pierre-Alain Jachiet, William P. Hanage, et al. A plu- ralistic account of homology: adapting the models to the data. Molecular Biology and Evolution, 31(3):501–516, 2013.

[82] Keisuke Hamada, Toshiyuki Shimizu, Takeshi Matsui, et al. Structural basis of the membrane-targeting and unmasking mechanisms of the radixin FERM domain. The EMBO journal, 19(17):4449–4462, 2000.

[83] Richard W. Hamming. Error detecting and error correcting codes. Bell System technical Journal, 29(2):147–160, 1950.

[84] Bong-Gyoon Han, Wataru Nunomura, Yuichi Takakuwa, et al. Protein 4.1 R core domain structure and insights into regulation of cytoskeletal organiza- tion. Nature Structural & Molecular Biology, 7(10):871–875, 2000.

[85] Cristina Has, Daniele Castiglia, Marcela del Rio, et al. Kindler syndrome: extension of FERMT1 mutational spectrum and natural history. Human Mutation, 32(11):1204–1212, 2011.

[86] W. Keith Hastings. Monte Carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.

[87] Philip Heidelberger and Peter D. Welch. A spectral method for confidence interval generation and run length control in simulations. Communications of the ACM, 24(4):233–245, 1981.

[88] Jorja G. Henikoff and Steven Henikoff. Using substitution probabilities to improve position-specific scoring matrices. Computer Applications in the Bio- sciences, 12(2):135–143, 1996.

[89] Steven Henikoff and Jorja G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89 (22):10915–10919, 1992.

[90] Alfred D. Hershey and Martha Chase. Independent functions of viral pro- tein and nucleic acid in growth of bacteriophage. The Journal of General Physiology, 36(1):39–56, 1952.

[91] Rose Hoberman and Dannie Durand. The incompatible desiderata of gene cluster properties. In Comparative Genomics. Springer, 2005.

[92] Rose Hoberman, David Sankoff, and Dannie Durand. The statistical signifi- cance of max-gap clusters. In Comparative Genomics. Springer, 2005. ISBN 9783540322900.

[93] Sebastian Höhna. Burnin estimation and convergence assessment. Chapter in Licentiate thesis Bayesian Phylogenetic Inference, 2011. 56 BIBLIOGRAPHY

[94] Thomas H. Huxley. The origin of species. CreateSpace Independent Publish- ing Platform, 2015. ISBN 9781512324358.

[95] Olivier Jaillon, Jean-Marc Aury, Frédéric Brunet, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto- karyotype. Nature, 431(7011):946–957, 2004.

[96] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999.

[97] D. T. Jones, W. R. Taylor, and J. M. Thornton. A mutation data matrix for transmembrane proteins. FEBS Letters, 339(3):269–275, 1994.

[98] David T. Jones, William R. Taylor, and Janet M. Thornton. The rapid generation of mutation data matrices from protein sequences. Computer Ap- plications in the Biosciences, 8(3):275–282, 1992.

[99] Jacob M. Joseph and Dannie Durand. Family classification without domain chaining. Bioinformatics, 25(12):i45–i53, 2009.

[100] Jin Jun, Ion I. Mandoiu, and Craig E. Nelson. Identification of mammalian orthologs using local synteny. BMC Genomics, 10(1):630, 2009.

[101] Motoo Kimura. Evolutionary rate at the molecular level. Nature, 217(5129): 624–626, 1968.

[102] Theresa Kindler. Congenital poikiloderma with traumatic bulla formation and progressive cutaneous atrophy. British Journal of Dermatology, 66(3): 104–111, 1954.

[103] Jack L. King and Thomas H. Jukes. Non-Darwinian evolution. Science, 164 (3881):788–798, 1969.

[104] Christopher M.O.O. Kleifeld. Validating matrix metalloproteinases as drug targets and anti-targets for cancer therapy. Nature Reviews Cancer, 6:227, 2006.

[105] Thorsten Kloesges, Ovidiu Popa, William Martin, and Tal Dagan. Networks of gene sharing among 329 proteobacterial genomes reveal differences in lat- eral gene transfer frequency at different phylogenetic depths. Molecular Bi- ology and Evolution, 28(2):1057–1074, 2011.

[106] David G. Knowles and Aoife McLysaght. Recent de novo origin of human protein-coding genes. Genome Research, 19(10):1752–1759, 2009.

[107] Eugene V. Koonin. Orthologs, paralogs, and evolutionary genomics 1. Annual Review of Genetics, 39:309–338, 2005. BIBLIOGRAPHY 57

[108] Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964. [109] Cold Spring Harbor Laboratory. DNA from the beginning, 2015. URL http://www.dnaftb.org/. [110] J. E. Lai-Cheong, M. Parsons, and John A. McGrath. The role of kindlins in cell biology and relevance to human disease. The International Journal of & Cell Biology, 42(5):595–603, 2010. [111] Eric S. Lander, Lauren M. Linton, Bruce Birren, et al. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001. [112] Bret Larget and Donald L. Simon. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution, 16:750–759, 1999. [113] Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics, 25(17):2286–2288, 2009. [114] John A. Laurmann and W. Lawrence Gates. Statistical considerations in the evaluation of climatic experiments with atmospheric general circulation models. Journal of the Atmospheric Sciences, 34(8):1187–1199, 1977. [115] Shuying Li, Dennis K. Pearl, and Hani Doss. Phylogenetic tree construc- tion using Markov chain Monte Carlo. Journal of the American Statistical Association, 95(450):493–508, 2000. [116] Wen-Hsiung Li, Zhenglong Gu, Haidong Wang, and Anton Nekrutenko. Evo- lutionary analyses of the human genome. Nature, 409(6822):847–849, 2001. [117] Yoseph Linde, Andres Buzo, and Robert M. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1):84–95, 1980. [118] David J. Lipman and William R. Pearson. Rapid and sensitive protein simi- larity searches. Science, 227(4693):1435–1441, 1985. [119] Harvey Lodish, Arnold Berk, Chris A. Kaiser, et al. Molecular Cell Biology: Seventh edition. W. H. Freeman and Company, New York, 2013. ISBN 9781429234139. [120] Manyuan Long, Esther Betrán, Kevin Thornton, and Wen Wang. The origin of new genes: glimpses from the young and old. Nature Reviews Genetics, 4 (11):865–875, 2003. [121] Nicolas Luc, Jean-Loup Risler, Anne Bergeron, and Mathieu Raffinot. Gene teams: a new formalization of gene clusters for comparative genomics. Com- putational Biology and Chemistry, 27(1):59–67, 2003. 58 BIBLIOGRAPHY

[122] Avery O.T. MacLeod and Maclyn McCarty. Studies of the chemical nature of the substance inducing transformation of pneumococcal types. Induction of transformation by a deoxyribonucleic acid fraction isolated from pneumo- coccus type III. Journal of Experimental Medicine, 79:137–158, 1944.

[123] James MacQueen. Some methods for classification and analysis of multivari- ate observations. In Proceedings of the Fifth Berkeley Symposium on Math- ematical Statistics and Probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.

[124] Khalid Mahmood, Geoffrey I. Webb, Jiangning Song, et al. Efficient large- scale protein sequence comparison and gene matching to identify orthologs and co-orthologs. Nucleic Acids Research, 40(6):e44–e44, 2012.

[125] Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Math- ematical Statistics, pages 50–60, 1947.

[126] Edward M. Marcotte, Matteo Pellegrini, Ho-Leung Ng, et al. Detecting pro- tein function and protein-protein interactions from genome sequences. Sci- ence, 285(5428):751–753, 1999.

[127] Bob Mau, Michael A. Newton, and Bret Larget. Bayesian phylogenetic infer- ence via Markov chain Monte Carlo methods. Biometrics, 55(1):1–12, 1999.

[128] James O. McInerney, Davide Pisani, Eric Bapteste, and Mary J. O’Connell. The public goods hypothesis for the evolution of life on earth. Biology Direct, 6:41, 2011.

[129] Gregor Mendel. Versuche über plflanzenhybriden. Verhandlungen des Natur- forschenden Vereines in Brünn, 4:3–47, 1866.

[130] Gregor Mendel. Experiments in plant hybridisation. Cosimo Classics, New York, 2008. ISBN 9781605202570.

[131] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, et al. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.

[132] Julien Meunier, Frédéric Lemoine, Magali Soumillon, et al. Birth and ex- pression evolution of mammalian microRNA genes. Genome Research, 23(1): 34–45, 2013.

[133] Vincent Miele, Simon Penel, Vincent Daubin, et al. High-quality sequence clustering guided by network topology and multiple alignment likelihood. Bioinformatics, 28(8):1078–1085, 2012. BIBLIOGRAPHY 59

[134] Vincent Miele, Simon Penel, and Laurent Duret. Ultra-fast sequence clus- tering from similarity networks with SiLiX. BMC Bioinformatics, 12(1):116, 2011.

[135] Glenn W. Milligan and Martha C. Cooper. Methodology review: Clustering methods. Applied Psychological Measurement, 11(4):329–354, 1987.

[136] Boris Mirkin. Mathematical classification and clustering: From how to what and why. Springer, 1998.

[137] Eloi Montanez, Siegfried Ussar, Martina Schifferer, et al. Kindlin-2 controls bidirectional signaling of integrins. Genes Dev, 22(10):1325–1330, May 2008.

[138] Gabriel Moreno-Hagelsieb, Victor Treviño, Ernesto Pérez-Rueda, et al. Tran- scription unit conservation in the three domains of life: a perspective from Escherichia coli. Trends in Genetics, 17(4):175–177, 2001.

[139] Gregory J. Morgan. Emile Zuckerkandl, Linus Pauling, and the molecular evolutionary clock, 1959-1965. Journal of the , 31(2):155– 178, 1998.

[140] Thomas H. Morgan. The Theory of the Gene. BiblioLife, 2015. ISBN 9781297835674.

[141] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970.

[142] Nathan L. Nehrt, Wyatt T. Clark, Predrag Radivojac, and Matthew W. Hahn. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Computational Biology, 7(6):e1002073–e1002073, 2011.

[143] Masatoshi Nei and Masafumi Nozawa. Roles of mutation and selection in speciation: from Hugo de Vries to the modern genomic era. Genome Biology and Evolution, 3:812–829, 2011.

[144] Marshall W. Nirenberg and J. Heinrich Matthaei. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonu- cleotides. Proceedings of the National Academy of Sciences, 47(10):1588–1602, 1961.

[145] Richard A. Notebaart, Martijn A. Huynen, Bas Teusink, et al. Correlation be- tween sequence conservation and the genomic context after gene duplication. Nucleic Acids Research, 33(19):6164–6171, 2005. 60 BIBLIOGRAPHY

[146] Johan A.A. Nylander, James C. Wilgenbusch, Dan L. Warren, and David L. Swofford. AWTY (are we there yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics. Bioinformatics, 24(4):581– 583, 2008.

[147] Severo Ochoa. Enzymatic synthesis of Ribonucleic Acid (Nobel lecture). In Nobel Lectures, Physiology or Medicine 1942-1962. Elsevier, Amsterdam, 1959.

[148] The University of Utah Health Sciences. Learn.genetics, 2015. URL http://learn.genetics.utah.edu/.

[149] Robert J. O’Hara. Population thinking and tree thinking in systematics. Zoologica Scripta, 26(4):323–329, 1997.

[150] Susumu Ohno. Evolution by gene duplication. Springer Science & Business Media, 2013. ISBN 9783642866593.

[151] . Slightly deleterious mutant substitutions in evolution. Nature, 246(5428):96–98, 1973.

[152] Eric M. Ostertag and Haig H. Kazazian Jr. Biology of mammalian L1 retro- transposons. Annual Review of Genetics, 35(1):501–538, 2001.

[153] Richard Owen and William W. Cooper. Lectures on the Comparative Anatomy and Physiology of the Invertebrate Animals : delivered at the Royal College of Surgeons. Longman, Brown, Green, and Longmans, London, 1843.

[154] Joe Parker, Georgia Tsagkogeorga, James A. Cotton, et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature, 502 (7470):228–231, Oct 2013.

[155] Frances Pearl, Annabel Todd, Ian Sillitoe, et al. The CATH Domain Struc- ture Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Research, 33 (suppl 1):D247–D251, 2005.

[156] P Pipenbacher, Alexander Schliep, Sebastian Schneckener, et al. ProClust: improved clustering of protein sequences with an extended graph-based ap- proach. Bioinformatics, 18(suppl 2):S182–S191, 2002.

[157] Walter Pirovano and Jaap Heringa. Multiple sequence alignment. In Bioin- formatics. Humana, 2008. ISBN 9781603271592.

[158] Martyn Plummer, Nicky Best, Kate Cowles, and Karen Vines. CODA: Con- vergence diagnosis and output analysis for MCMC. R News, 6(1):7–11, 2006. BIBLIOGRAPHY 61

[159] Nicholas H. Putnam, Thomas Butts, David E.K. Ferrier, et al. The amphioxus genome and the evolution of the chordate karyotype. Nature, 453(7198):1064– 1071, 2008.

[160] Daniel L. Rabosky. Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees. PLoS One, 9(2):e89543, 2014.

[161] Adrian E. Raftery and Steven M. Lewis. Practical Markov chain Monte Carlo - comment: one long run with diagnostics: implementation strategies for Markov chain Monte Carlo. Statistical Science, pages 493–497, 1992.

[162] Andrew Rambaut, Marc A. Suchard, Dong W. Xie, and Alexei J. Drummond. Figtree, 2015. URL http://tree.bio.ed.ac.uk/software/figtree.

[163] Andrew Rambaut, Marc A. Suchard, Dong W. Xie, and Alexei J. Drummond. Tracer v1.6, 2015. URL http://beast.bio.ed.ac.uk/Tracer.

[164] Shruti Rastogi and David A. Liberles. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evolutionary Biol- ogy, 5(1):28, 2005.

[165] T.B.K. Reddy, Alex D. Thomas, Dimitri Stamatis, et al. The Genomes On- Line Database (GOLD) v. 5: a metadata management system based on a four level (meta) genome project classification. Nucleic Acids Research, 43(D1): 1099–1106, 2014.

[166] Gerald R. Reeck, Christoph de Haën, David C. Teller, et al. “homology” in proteins and nucleic acids: A terminology muddle and a way out of it. Cell, 50(5):667, 1987.

[167] Christian G. Roessler, Branwen M. Hall, William J. Anderson, et al. Tran- sitive homology-guided structural studies lead to discovery of cro proteins with 40% sequence identity but different folds. Proceedings of the National Academy of Sciences, 105(7):2343–2348, 2008.

[168] Igor B. Rogozin, David Managadze, Svetlana A. Shabalina, and Eugene V. Koonin. Gene family level comparative analysis of gene expression in mam- mals validates the ortholog conjecture. Genome Biology and Evolution, 6(4): 754–762, 2014.

[169] Fredrik Ronquist and Andrew R. Deans. Bayesian phylogenetics and its in- fluence on insect systematics. Annual Review of Entomology, 55:189–206, 2010.

[170] Fredrik Ronquist, Maxim Teslenko, Paul van der Mark, et al. MrBayes 3.2: ef- ficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology, 61(3):539–542, 2012. 62 BIBLIOGRAPHY

[171] Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck. Bayesian phylogenetic analysis using MRBAYES. In The phylogenetic handbook: a practical approach to phylogenetic analysis and hypothesis testing. Cambridge University Press, 2009.

[172] Niv Sabath, Andreas Wagner, and David Karlin. Evolution of viral proteins originated de novo by overprinting. Molecular Biology and Evolution, page mss179, 2012.

[173] Cecilia Saccone and Graziano Pesole. Handbook of Comparative Genomics: Principles and Methodology. John Wiley & Sons, 2005. ISBN 9780471326410.

[174] Kristoffer Sahlin. Estimating convergence of Markov chain Monte Carlo sim- ulations. Stockholm University, Master Thesis, 2011.

[175] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolu- tion, 4(4):406–425, 1987.

[176] Frederick H. Sanger. Chemistry of insulin; determination of the structure of insulin opens the way to greater understanding of life processes. Science, 129 (3359):1340–1344, May 1959.

[177] Frederick H. Sanger and E.O.P. Thompson. The amino-acid sequence in the glycyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochemical Journal, 53(3):353–366, Feb 1953.

[178] Frederick H. Sanger and Hans Tuppy. The amino-acid sequence in the pheny- lalanyl chain of insulin. 1. The identification of lower peptides from partial hydrolysates. Biochemical Journal, 49(4):463, 1951.

[179] Frederick H. Sanger and Hans Tuppy. The amino-acid sequence in the pheny- lalanyl chain of insulin. 2. The investigation of peptides from enzymic hy- drolysates. Biochemical Journal, 49(4):481, 1951.

[180] Anasua Sarkar, Hayssam Soueidan, and Macha Nikolski. Identification of conserved gene clusters in multiple genomes based on synteny and homology. BMC Bioinformatics, 12(Suppl 9):S18, 2011.

[181] Robin Sibson. SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30–34, 1973.

[182] Joel Sjöstrand. Reconciling gene family evolution and species evolution, 2013.

[183] Joel Sjöstrand, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. DLRS: gene tree evolution in light of a species tree. Bioinformatics, 28(22):2994– 2995, 2012. BIBLIOGRAPHY 63

[184] Temple F. Smith and Michael S. Waterman. Identification of common molec- ular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.

[185] Robert R. Sokal. A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28:1409–1438, 1958.

[186] Nan Song, Jacob M. Joseph, George B. Davis, and Dannie Durand. Sequence similarity network reveals common ancestry of multidomain proteins. PLoS Computational Biology, 4(4):e1000063, Apr 2008.

[187] Nan Song, Robert D. Sedgewick, and Dannie Durand. Domain architecture comparison for multidomain homology identification. Journal of Computa- tional Biology, 14(4):496–516, 2007.

[188] Mike Steel. Some statistical aspects of the maximum parsimony method. In Molecular Systematics and Evolution: Theory and Practice. Springer, 2002.

[189] Hugo Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Pol. Sci., Cl. III, 4:801–804, 1957. ISSN 0001-4095.

[190] Ann M. Stock, Victoria L. Robinson, and Paul N. Goudreau. Two-component signal transduction. Annual Review of Biochemistry, 69(1):183–215, 2000.

[191] Arlin Stoltzfus. On the possibility of constructive neutral evolution. Journal of Molecular Evolution, 49(2):169–181, 1999.

[192] Korbinian Strimmer and Arndt von Haeseler. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Molecular Biology and Evolution, 13(7):964–969, 1996.

[193] James B. Sumner. The chemical nature of enzymes (Nobel lecture). In Nobel Lectures, Chemistry 1942-1962. Elsevier, Amsterdam, 1946.

[194] Lena Svensson, Kimberley Howarth, Alison McDowall, et al. Leukocyte ad- hesion deficiency-III is caused by mutations in KINDLIN3 affecting integrin activation. Nature Medicine, 15(3):306–312, 2009.

[195] Guy Tanentzapf and Nicholas H. Brown. An interaction between integrin and the talin FERM domain mediates integrin activation but not linkage to the cytoskeleton. Nature Cell Biology, 8(6):601–606, 2006.

[196] Diethard Tautz and Tomislav Domazet-Lošo. The evolutionary origin of or- phan genes. Nature Reviews Genetics, 12(10):692–702, 2011.

[197] Simon Tavaré. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986. 64 BIBLIOGRAPHY

[198] H. Jean Thiébaux and Francis W. Zwiers. The interpretation and estimation of effective sample size. Journal of Climate and Applied Meteorology, 23(5): 800–811, 1984. [199] Stijn van Dongen. Graph clustering by flow simulation, 2000. [200] Stijn van Dongen. MCL – a cluster algorithm for graphs, 2015. URL http://micans.org/mcl/. [201] Hubert B. Vickery. The origin of the word protein. The Yale Journal of Biology and Medicine, 22(5):387, 1950. [202] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison – a review. Bioinformatics, 19(4):513–523, 2003. [203] Hongyan Wang, Daina Lim, and Christopher E. Rudd. Immunopathologies linked to integrin signalling. In Seminars in Immunopathology, volume 32, pages 173–182. Springer, 2010. [204] Ilan Wapinski, Avi Pfeffer, , and . Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics, 23 (13):i549–i558, 2007. [205] Joe H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963. [206] James D. Watson and Francis H. Crick. Molecular structure of nucleic acids. Nature, 171(4356):737–738, 1953. [207] Caleb Webber and Chris P. Ponting. Genes and homology. Current Biology, 14(9):R332–3, May 2004. [208] Psychology Wikia. Human genome to genes, 2015. URL http://psychology.wikia.com/wiki/File:Human_genome_to_genes.png. [209] Yuri I. Wolf, Igor B. Rogozin, Alexey S. Kondrashov, and Eugene V. Koonin. Genome alignment, evolution of prokaryotic genome organization, and pre- diction of gene function using genomic context. Genome Research, 11(3): 356–372, 2001. [210] Cathy H. Wu, Hongzhan Huang, Lai-Su L. Yeh, and Winona C. Barker. Pro- tein family classification and functional annotation. Computational Biology and Chemistry, 27(1):37–47, 2003. [211] Dong-Dong Wu and Ya-Ping Zhang. Evolution and function of de novo orig- inated genes. Molecular Phylogenetics and Evolution, 67(2):541–545, 2013. [212] Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Trans- actions on Neural Networks, 16(3):645–678, 2005. BIBLIOGRAPHY 65

[213] Rui Xu and Donald C. Wunsch. Clustering algorithms in biomedical research: a review. IEEE Reviews in Biomedical Engineering, 3:120–154, 2010. [214] Zefeng Yang and Jinling Huang. De novo origin of new genes with introns in plasmodium vivax. FEBS Letters, 585(4):641–644, 2011. [215] Ziheng Yang and Bruce Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Molecular Biology and Evolution, 14(7):717–724, 1997. [216] Ziheng Yang and Bruce Rannala. Molecular phylogenetics: principles and practice. Nature Reviews Genetics, 13(5):303–314, 2012. [217] Frank Yates. Contingency tables involving small numbers and the Chi- squared test. Supplement to the Journal of the Royal Statistical Society, pages 217–235, 1934. [218] Kuo-Chen Yeh, Shu-Hsing Wu, John T. Murphy, and J. Clark Lagarias. A cyanobacterial phytochrome two-component light sensory system. Science, 277(5331):1505–1508, 1997. [219] Golan Yona, Nathan Linial, and . ProtoMap: automatic classi- fication of protein sequences and hierarchy of protein families. Nucleic Acids Research, 28(1):49–55, 2000. [220] Phillip D. Zamore and Benjamin Haley. Ribo-gnome: the big world of small RNAs. Science, 309(5740):1519–1524, 2005. [221] Jun Zhan, Xiang Zhu, Yongqing. Guo, et al. Opposite role of Kindlin-1 and Kindlin-2 in lung cancers. PLoS One, 7(11):e50313, 2012. [222] Jianzhi Zhang. Evolution by gene duplication: an update. Trends in Ecology & Evolution, 18(6):292–298, 2003. [223] Qi Zhou, Guojie Zhang, Yue Zhang, et al. On the origin of new genes in Drosophila. Genome Research, 18(9):1446–1455, 2008. [224] Christian M. Zmasek. Forester, 2015. URL https://github.com/cmzmasek/forester. [225] Emile Zuckerkandl. On the molecular evolutionary clock. Journal of Molec- ular Evolution, 26(1-2):34–46, 1987. [226] Emile Zuckerkandl and Linus Pauling. Molecules as documents of evolution- ary history. Journal of Theoretical Biology, 8(2):357–366, 1965.