<<

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1723

Evolutionary Approaches to

MARCIN BOGUSZ

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-513-0445-8 UPPSALA urn:nbn:se:uu:diva-360871 2018 Dissertation presented at Uppsala University to be publicly examined in Ekmansalen, EBC, Norrbyvägen 14, Uppsala, Friday, 9 November 2018 at 09:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Jeffrey Thorne (North Carolina State University).

Abstract Bogusz, M. 2018. Evolutionary Approaches to Sequence Alignment. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1723. 57 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-0445-8.

Molecular evolutionary biology allows us to look into the past by analyzing sequences of amino acids or . These analyses can be very complex, often involving advanced statistical models of sequence evolution to construct phylogenetic trees, study the patterns of natural selection and perform a number of other evolutionary studies. In many cases, these evolutionary studies require a prerequisite of multiple sequence alignment (MSA) - a technique, which aims at grouping the characters that share a common ancestor, or , into columns. This information regarding shared homology is needed by statistical models to describe the process of substitutions in order to perform evolutionary inference. Sequence alignment, however, is difficult and MSAs often contain whole regions of wrongly aligned characters, which impact downstream analyses. In this thesis I use two broad groups of approaches to avoid errors in the alignment. The first group addresses the analysis methods without sequence alignment by explicitly modelling the processes of substitutions, and insertions and deletions (indels) between pairs of sequences using pair hidden Markov models. I describe an accurate tree inference method that uses a clustering approach to construct a tree from a matrix of model-based evolutionary distances. Next, I develop a pairwise method of modelling how natural selection acts on substitutions and indels. I further show the relationship between the constraints acting on these two evolutionary forces to show that natural selection affects them in a similar way. The second group of approaches deals with errors in existing alignments. I use a statistical model-based approach to evaluate the quality of multiple sequence alignments. First, I provide a graph-based tool for removing wrongly aligned pairs of residues by splitting them apart. This approach tends to produce better results when compared to standard column- based filtering. Second, I provide a way to compare MSAs using a probabilistic framework. I propose new ways of scoring of sequence alignments and show that popular methods produce similar results. The overall purpose of this work is to facilitate more accurate evolutionary analyses by addressing the problem of sequence alignment in a statistically rigorous manner.

Keywords: molecular evolution, multiple sequence alignment, pair hidden Markov models

Marcin Bogusz, Department of Ecology and , Evolutionary Biology, Norbyvägen 18D, Uppsala University, SE-75236 Uppsala, Sweden.

© Marcin Bogusz 2018

ISSN 1651-6214 ISBN 978-91-513-0445-8 urn:nbn:se:uu:diva-360871 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-360871)

To my wife and son without whom this work would have been completed a year earlier

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Bogusz, M., Whelan, S. (2017) Estimation With and Without Alignment: New Distance Methods and Benchmarking. Systematic Biology, 66(2):218–231

II Bogusz, M., Whelan, S. (2018) Selection acting on indels and substitutions in coding sequences. Manuscript

III Ali, RH., Bogusz, M., Whelan, S. (2018) A graph-based ap- proach for improving the accuracy of multiple sequence align- ments for evolutionary analysis. Manuscript

IV Bogusz, M., Ali, RH., Whelan, S. (2018) Examining sequence alignments using a model-based approach. Manuscript

Reprints were made with permission from the respective publishers.

Contents

Introduction ...... 11 Homology and Evolution ...... 12 Homology and Molecular Sequences ...... 13 Assigning homology through sequence alignment ...... 13 Evolutionary inference from molecular data ...... 15 Modeling substitutions and evolutionary distances ...... 16 Introduction to Phylogenetics ...... 19 Modeling natural selection ...... 23 Limitations of alignment-based evolutionary inference ...... 23 Methods ...... 27 Statistical alignment approach ...... 27 Modeling indels ...... 27 Early statistical alignment ...... 28 Pair hidden Markov models ...... 29 Full statistical alignment and tree approaches ...... 32 MSA treatment ...... 33 Column filtering ...... 33 Filtering accuracy ...... 34 Impact of filtering on downstream analyses ...... 36 Research Aims ...... 37 Summary of the papers ...... 38 Paper I: Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking ...... 38 Paper II: Selection acting on indels and substitutions in protein coding sequences ...... 39 Paper III: A graph-based approach for improving the accuracy of multiple sequence alignments for evolutionary analysis ...... 40 Paper IV: Examining sequence alignments using a model-based approach ...... 41 Conclusions and future prospects ...... 43 Svensk Sammanfattning ...... 46 Acknowledgements ...... 49 References ...... 51

Abbreviations

MSA Multiple sequence alignment MSAM Multiple sequence alignment method HMM Hidden Markov model TP True positive FP False positive FN False negative PP Pairwise posterior probability ROC Receiver operating characteristic AUC Area under the curve

Introduction

Studying evolution is exciting from multiple perspectives. It allows us to ask questions about the origins of life on earth and relationships between species. Understanding evolution can also have an impact on our daily lives as it helps us to stay ahead of pathogens and lets us understand hereditary diseases. Evo- lution forms the core of genetic studies, ecology, neurobiology, biochemistry, physiology, and other branches of biology making it one of the most important scientific concepts. Accurate understanding of evolution is therefore critical to understanding the world that surrounds us. Since the ground breaking studies of Charles Darwin, a number of advances in the evolutionary thought have been made. Ranging from Mendel’s studies of heredity, population genetics, modern synthesis, discovery of amino acid sequences, the discovery of evolution on molecular level, DNA and full ge- nome , the field of evolutionary biology has gained a lot of tools and data to improve our understanding of mechanisms governing evolution. By looking at extant genetic sequences, it is possible to perform various common evolutionary analyses. These include predicting ancestral relation- ships between sequences in a form of phylogenetic trees, looking if natural selection is acting upon genomic sequences, finding motifs and predicting pro- tein function and structure. These inferences are possible thanks to increase in availability of molecular data. This, data, however, is meaningless without the models that let us interpret it. We know that sequences change over time due to various mechanisms, most notably errors during replication. These changes drive and allows for adaptations and . Studying these changes allows the researchers to draw conclusions regarding the evolution. In this thesis I do not address the evolutionary inference in the population genetics context of poly- morphic data and describe methods and approaches from data that is consid- ered to be fixed within populations. Currently, the common approach to modeling sequence evolution is through continuous time Markov chain models of sequence character substi- tutions, which describe how nucleotides and amino acids in sequences change over evolutionary time. These models assume that substitutions i.e. fixed mu- tations in lineages, occur at a certain rate and an ancestral sequence evolves by substitutions to form extant sequences. These methods are even able to account for multiple substitutions of the same characters during the course of evolution.

11 These popular methods require the important prerequisite of multiple se- quence alignment (MSA), which aims to group together the characters that share ancestral relationships. These homologous characters need to be identi- fied in tractable time, which makes multiple sequence alignment difficult to perform due to the huge number of possible combinations in which the resi- dues can be aligned. MSA methods use a host of algorithmic and heuristic approaches in order to perform the task. There exist a number of different se- quence alignment methods, which often provide contradicting answers regard- ing the homologies, especially for distantly related sequences. This means that most of the contemporary evolutionary analyses in this commonly used two- step approach are underpinned by uncertain data, often containing spurious homologies that affect the results. Another aspect of the common approach to evolutionary analyses is that it does the not take into account another important evolutionary force that changes the lengths of sequences over time: indels. These insertions and dele- tions are difficult to model due to potential overlaps over time and conducting the modeling is computationally intensive. Indel modeling, however, is feasi- ble and the methods of statistical evolutionary analyses exist. So-called full statistical alignment is capable of avoiding the alignment error by jointly in- ferring trees and alignments in a Bayesian framework. These approaches, however, require extensive computing resources and advanced knowledge of Markov Chain Monte Carlo methods making them not accessible for most evolutionary biologists. In this thesis I describe the ways of mitigating the errors introduced by MSA by either avoiding the alignment process altogether or by attempting to avoid the impact of these errors on downstream analyses. I present statistically robust methods based on pair hidden Markov models, a probabilistic frame- work capable of describing the evolution of two sequences using substitution and indel processes without the need of an alignment. These methods can model the evolutionary processes and directly extract evolutionary parameters of interest fast, using easy to use software tools. They are also capable of as- sessing multiple sequence alignments by providing probabilistic measures of shared ancestry by assigning posterior probabilities of pairs of residues being aligned together.

Homology and Evolution Homology is one of the key concepts in evolutionary biology. The commonly accepted definition states that it is the relationship of two characters that have descended from a common ancestral character (Fitch 2000). These characters can be any feature of an organism, for instance structural or behavioral (Thiele 1993; Wiens and Brower 2001). The hypothesis of homology that a character in one organism is the same as a character in another is the cornerstone of

12 many studies, and facilitates taxonomy and systematics. The correct assump- tion of homology is critical, and the discussion on the homology as a hypoth- esis has been running since the definition of the term in pre-Darwinian times by Owen (Owen 1846). The concept of homology, however, has been used centuries before in comparative anatomy. These studies used morphological features and ultimately provided evidence of common descent and revolution- ized our understanding of systematics of . Homologous morpho- logical features provide evolutionary information, but the correlation of many morphological characters limits the data at our disposal (Swiderski et al. 2004). The real breakthrough in the amount of independent homologous char- acters came with comparative genomics.

Homology and Molecular Sequences The advent of brought a new class of homologous charac- ters. Starting with a characterization of the primary structure of by (1951), came the realization that molecular sequences evolve over evolutionary time and the nucleotides and amino acids – charac- ters that that build molecular sequences – can share common ancestry (Ingram 1961; Morgan 1998). This new class of homologous characters can have nu- merous states; for example, glycine and phenylalanine amino acids. These states can also change over time and characters can be inserted and removed, shaping sequence lengths over this evolutionary time. Today we have an abun- dance of molecular data, with multiple sequences spanning hundreds or thou- sands of characters. Assigning homologies to these characters differs greatly from morphological analyses and makes the problem of identifying molecular homologies hard.

Assigning homology through sequence alignment In the early days of molecular evolutionary biology, sequencing was slow and not many molecular sequences were available. The problem of assigning ho- mologies in molecular sequences – sequence alignment – was performed by hand (Strasser 2010). Later, increases in sequence availability in the studies of evolution brought a need for automatic alignment procedures with multiple sequence alignment (MSA) for grouping multiple sequence characters sharing a common ancestor into columns. At this stage, it is worth noting that in mo- lecular biology the term “homology” may not necessarily refer to the evolu- tionary relatedness of characters (known as positional homology). Research- ers also use sequence alignments to study functional and structural relation- ships between , which imply similarity in structure (structural homol- ogy) or function (functional homology) of the regions of proteins.

13 Automated sequence alignment The first automatic alignment algorithm by Needleman and Wunsch worked on two sequences at a time and used the principle of by creating a two-dimensional matrix and an algorithm to compute scores for gaps, character matches and mismatches (Needleman and Wunsch 1970). The two-dimensional matrix means that the computational complexity of the algo- rithm is proportional to the product of lengths of two sequences. If a pair of sequences of length 100 takes 10 seconds to compute, then a pair of 1000 characters (ten times the length) will require 100 times longer computational time – over 16 minutes. Assuming that the sequences are of similar length n the computational complexity is O(n2). MSA is known to be a very difficult NP-complete computational problem (Wang and Jiang 1994) and for multiple m sequences (with m being more than 2), under a naïve assumption of com- paring all the combinations of sequence characters the problem gains O(nm) complexity. For a similar problem of 100 character-long sequences in the mul- tiple alignment of 10 sequences taking 10 seconds to compute, the 1000 char- acter long one would take 200 000 years. Currently, researchers align dozens (and in extreme cases thousands) of se- quences and multiple sequence alignment methods (MSAMs) need to be able to align them in a reasonable time. This means that an exhaustive search for optimal combinations of characters is impossible. Furthermore, in evolution- ary approaches the characters are assumed to share a common ancestor, which implies tree-like relationship structure. Adding possible tree combinations to the exhaustive alignment algorithm further increases the complexity of the calculations involved.

Multiple sequence alignment methods Because of this immense computational complexity, multiple sequence align- ment methods need to rely on the whole host of approximate and heuristic algorithms to make the problem tractable. To start with, these algorithms need a scoring function to weigh alternative alignment solutions against each other. This normally means assigning scores to gaps and amino acid and substitutions in a way that maximizes similarity. Various scoring schemes ex- ist, starting from simple match/mismatch scores through scores obtained from PAM and Blosum matrices, through to probability-based log-odd scores (Durbin et al. 1998; Eddy 2004). These algorithms also need a general strategy for how to process the se- quences. The popular progressive method uses the assumption of evolutionary relatedness between sequences and processes them in a pairwise fashion ac- cording to a computed guide tree (Hogeweg and Hesper 1984). The tree is created using a clustering approach from pairwise sequence comparisons and the pairwise alignment algorithm (usually Needleman-Wunsch) progresses form the leaf nodes to the root. ClustalW is an example of such an algorithm

14 (Thompson et al. 1994). It’s also one of the most cited papers in history, show- ing the popularity of sequence alignment methods (Van Noorden et al. 2014). Other notable examples of progressive aligners include MAFFT (Katoh 2002) MUSCLE (Edgar 2004) and Omega (Sievers et al. 2011). These pop- ular tools also feature iterative refinement in an attempt limit the progression of errors in the alignment during the iterative process. Another general strategy for computing MSAs is through consistency alignment. Under this approach, a library of pairwise alignments is used to find the combination that maximizes the agreement (consistency) between the pairs. Common examples are T-Coffe (Notredame et al. 2000) and Probcons (Do et al. 2005). Finally, the realization that the three problems – positional, structural and functional homology – optimized by general sequence alignment methods are not necessarily the same has led to development of phylogeny-aware ap- proaches, which attempt to produce more realistic alignments for evolutionary inference by using explicit evolutionary models. A good example is the Prank aligner (Löytynoja and Goldman 2008).

Multiple sequence alignment has many applications in molecular biology (Kemena and Notredame 2009; Thompson et al. 2011). It also facilitates a number of diverse evolutionary studies ranging, for example, from the recon- struction of the tree of life (Dunn et al. 2008) through to human genetics (Zhang et al. 2009) and to cancer research and epidemiology (Bao et al. 2008). Just like the morphological analysis of homologous characters (e.g. cranial features) in systematics, molecular evolutionary studies are based on the iden- tities and differences detected between homologous characters inferred during the process of sequence alignment.

Evolutionary inference from molecular data Evolutionary inference is one of the major uses of sequence alignment. In the common evolutionary inference pipeline, an MSA serves as a prerequisite for further inference and analysis (Felsenstein 2004). This widely used two-step inference process starts with a set of homologous sequences with the aim to study how this molecular data changes over time. The first step requires run- ning the sequences through a sequence aligner of choice, and results in a point estimate of homology between characters in the sequence. This estimate of homology is believed to be the true representation of evolutionary history. The second step uses the information in the alignment (also known as the data ma- trix). The characters in the columns of the alignment are assumed to share the evolutionary history by originating form a single common ancestor. Different states of the characters are the result of substitutions. An alignment will also

15 often contain gaps, which describe characters that are missing from the se- quence. A gap can be a result of a deletion event where a character in the lineage leading to the sequence with the gap was lost. Alternatively, an inser- tion may have occurred, leading to gaps in the other sequences.

Modeling substitutions and evolutionary distances The fundamental task in evolutionary inference is the estimation of evolution- ary distances. These distances describe the amount of change between se- quences due to evolution. Just like in morphological feature analysis where we can measure for example bone lengths, evolutionary distances let us quan- tify the number of changes through observing the differences between charac- ters in sequences.

Statistical evolutionary distances Most of the alignment-based inference methods use only information from substitutions, mostly treating gaps as missing data in distance calculation (Yang 2006). A simple example of a distance would be the proportion of sites that differ between two sequences. This simplified distance does not account for multiple potential changes at a site, making it unrealistic for anything but closely related sequences. Currently-used modeling tools are able to estimate the expected number of substitutions that occurred between sequences using stochastic models of se- quence evolution, even including multiple unobserved substitutions at a site (multiple hits). This expectation describes the evolutionary distance between sequences, also called as divergence or divergence time. In the case of nucleic acids, the distance describes expected number of substitutions per nucleotide, for amino acids – the expected number of substitutions per amino acid site – and the expected number of substitutions per codon for coding sequences. The stochastic nature of evolution is commonly probabilistically modeled through continuous-time Markov chains (Nielsen 2005). The sites in se- quences are described by their states (four states for nucleotides, 20 states for amino acids, 64 for codons) and are assumed to be evolving independently of each other. The chain can be in one of the states at a time and is said to be memoryless. This means that the probability of changing to the next state only depends on the current state, not how that state was reached in the past. The Markov chain X describes a probabilistic model of substitutions through its Q = { qij } , also called the instantaneous rate or generative matrix, which determines the relative rates of change between states i and j. It is common to scale the Q matrix so the average rate of the process is 1. The Q matrix, when exponentiated then specifies the probabilities over evolutionary time t described by matrix Pt eQt (Grimmett and Stirzaker 2001). This time t is expressed in expected numbers of substitutions per site.

16 At any time t, the chain X(t) can be characterized by its distribution of states π(t). Initially, the distribution is π(0) and after time t π(t) = π(0)Pt . If, after time t, the distributions are equal then the chain will stay in this distribu‐ tion forever. The chain is said to be in its equilibrium distribution. For nucleotides, this distribution specifies the frequencies πA, πC, πG, πT, and makes the assumption of reversibility possible. Reversibility, satisfied by πiqij = πjqji means that the process looks the same in both directions (forward and back- ward), meaning that, for example, given two sequences, the probability of data is the same whether one sequence is ancestral to the other and vice versa. In addition the probability is the same if both sequences are both descendants of an ancestral sequence due to Chapman–Kolmogorov equation, a property also called as the pulley principle (Grimmett and Stirzaker 2001; Felsenstein 2004). This assumption has an important application as it allows us to calcu- late evolutionary distances by treating one extant sequence as an ancestor to another extant sequence. Popular time-reversible nucleotide models range from zero to six parame- ters with Jukes and Cantor (1969); Hasegawa et al. (1985); Tavaré (1986) as notable examples. The amino acid models – due to a large number of potential parameters and poor understanding of factors governing the substitutions – usually have empirically-derived substitution matrices (Jones et al. 1992; Whelan and Goldman 2001; Le and Gascuel 2008). Under these Markov mod- els every possible transition within the evolutionary time t has a probability assigned and the alignment is needed to specify which transitions took place. In a simple pairwise distance case, each of the character pairs in the alignment columns specifies the state of the chain and can be assigned a probability ac- cording to the Pt matrix. The product of these column probabilities specifies the likelihood PrX|θ of the data X given the model θ and can be represented as a function of the unknown parameter θ and represented as Lθ;X fX|θ, called the . The likelihood then specifies how likely it is that the data comes from a given model. This Maximum Likelihood (ML) approach is commonly used throughout statistical modelling and evolutionary inference, and serves as a popular way of finding the best fitting combination of model parameters i.e. the divergence time and other model-specific parameters. This is usually performed through numerical optimization algorithms (Vetterling 2002). These optimization al- gorithms work in a multi-dimensional space of multiple parameters trying to find the optimal combination of parameters (global optimum). Sometimes, due to the complexity of the parameter space (likelihood function surface), only the local optimum can be found (a sub optimal solution). Maximum Likelihood Estimates (MLEs) have a number of properties (Stuart et al. 2009). First, the approach finds a point estimate θ of the param- eters. This estimate is consistent, meaning that θ converges to the true value θ

17 as the sample size (in our case sequence lengths) increases. Second, the esti- mator is unbiased meaning that the expectation Eθ θ. Finally, no other unbiased estimate can have a smaller variance than the MLE. Continuous time Markov models assume that sites in the sequence evolve at the same rate. This assumption can potentially be unrealistic due to varied rates among site and potential selective pressures (Hodgkinson and Eyre-Walker 2011). Rarely do all the sites in sequences evolve under the same rate. One can accommodate this variation by assuming several different rates for a site and drawing them at random when performing likelihood analyses. A common approach is to use a continuous gamma distribution to account for variable rates among sites (Tateno et al. 1994). To make the problem tractable, one usually uses a discrete gamma distribution by partitioning the distribution curve into n equally probable segments with the mean rate in each segment used to represent the rate in that category. Under this model formulation the mean rate across all categories is 1 (Yang 1994). Other advanced mixture models have been used to model the complex features of this evolutionary processes such as variable transition/ rate across sites (Huelsenbeck and Nielsen 1999), using different substitution rate matrices (Koshi and Goldstein 1996) and heterotachy, to which more complex hidden Markov model approaches have also been applied (Whelan et al. 2011).

Alignment-free distances. These alignment-based statistical methods are not the only way of estimating distances between sequences. So-called alignment-free approaches were cre- ated with the speed in mind to solve problems where sequence alignment and/or statistical methods would not be tractable or possible. These very fast methods use various approaches to quantify the difference between pairs of raw sequences, with the most popular approach being common fraction of substrings of various lengths (k-mers) between two sequences (Vinga and Almeida 2003). Other, more sophisticated, approaches include measures like the compres- sion-based Lempel-Ziv (LZ) complexity (Otu and Sayood 2003) and an infor- mation theory-based average common substring (ACS) metric (Ulitsky et al. 2006). These distances are fast to compute, can be applied to very long se- quences – even whole – and can be used in data sets comprised of thousands of sequences. They, however, are not based on commonly used evo- lutionary measures like the expected number of substitutions per site. Further- more, unlike statistical Markov models, they fail to account for multiple hits – multiple substitutions at a site.

18 Introduction to Phylogenetics Evolutionary distances on their own provide information on how different (distant) sequences are, but do not provide the details of the course of the evo- lutionary process that gave rise to these differences. Phylogenetic trees are commonly used to describe evolutionary relationships between the sequences (rows of the alignment) by representing extant sequences as leaves and ances- tral relationships as nodes. Phylogenetic inference – constructing phylogenetic trees – is arguably the most common evolutionary inference study. Trees are of interest in many systematic studies and have multiple applications, for in- stance in studies of pathogens like HIV (Hahn et al. 2000), protein evolution studies (Liberles et al. 2012), evolution (Lynch and Walsh 2007), and in taxonomic studies (Gogarten et al. 2002). Trees can also be required for studying conserved elements (Siepel et al. 2005) or natural selection (Rocha et al. 2006). The main goal of phylogenetic inference is to find the branching pattern (tree topology) that corresponds with the course of evolution that gave rise to the observed sequences. This topology is usually accompanied with the branch lengths. Phylogenetic trees can be rooted – when the common ancestor of all sequences is identified – or unrooted, with the root unspecified. Each of the nodes in the tree has a degree according to the graph theory. The degree describes the number of edges (branches) that connect to that node. Leaf nodes, representing extant sequences, have a degree of 1. For non-root inner nodes, the degree of three describes a bifurcation – an evolutionary idea that an ancestral node gives raise to two descendants. Nodes of degree more than 3 describe multifurcations (polytomies). A tree with no polytomies is called a bifurcating tree, or fully resolved tree.

Distance methods Distances between pairs of sequences are commonly used to infer trees. The so-called distance methods require a matrix of evolutionary distances and an algorithm to construct the tree. The distances need to be calculated by iterating through all combinations of pairs of sequences, although one usually needs to calculate only the lower or upper diagonal part of the matrix since the dis- tances are symmetrical, which means that the distance dxy = dyx for sequences x and y. The number of distance pairs scales quadratically, and the number of sequences and their pairwise evolutionary distances can be obtained using sta- tistical approaches such as MLE or using alignment-free methods. These dis- tances then usually serve as an input for a clustering algorithm that calculates the tree. The most popular algorithm, Neighbor Joining (NJ) (Saitou and Nei 1987), starts from the star topology – a tree with a single internal node – and uses a divisive clustering approach to create nodes by joining pairs of other neigh- boring nodes. The process continues until the tree is fully resolved. NJ and the

19 improved variants BioNJ and WEIGHBOR offer fast ways to create unrooted trees (Gascuel 1997; Bruno et al. 2000). Distance methods are very popular due to their speed, and are capable of producing trees from thousands of se- quences whilst remaining computationally tractable.

Statistical methods Statistical methods are considered the most accurate ways of inferring trees in the two-step process of phylogenetic inference. As with statistical estimation of evolutionary distances, the process starts with multiple sequence alignment. Statistical tree inference methods come in two varieties – Maximum Likeli- hood inference and Bayesian methods. Here I’ll focus mostly on the Maxi- mum Likelihood estimation and only mention the basics of the Bayesian in- ference process for comparison. Both of the these statistical approaches use probabilistic Markov models described in the previous section and use the likelihood function to calculate how likely the data (sequence alignment) is, given the model. In the case of tree inference, the model comprises the tree topology, branch lengths, and substitution model parameters. Both approaches also assume that a tree is bifurcating. Another common key assumption is the lack of directionality of the substitution process: most of the methods assume reversibility, meaning that the evolution looks the same forwards and back- wards. This has important implications, in that the trees used in the process are unrooted and the root can be identified later if needed. Further assumptions are stationarity, described in the previous chapter as meaning that frequencies of nucleotides or amino acids are constant on the tree, and homogeneity, meaning that the substitution process is the same on the tree (Yang 2014).

Maximum Likelihood In Maximum Likelihood tree inference, the goal is to find the tree that fits the data best. The inferential process can be broken down into two components: the search for tree topologies and likelihood calculations on the tree. Put simply, an algorithm proposes a tree, and then another algorithm optimizes branch lengths and substitution parameters on that tree, returning the likeli- hood value. The tree topology with the best likelihood value is then inferred tree. Likelihood calculation on a tree was established decades ago and, given a fixed tree topology one can calculate the likelihood of data given the model (tree, branch lengths and substitution model parameters) in linear time with the number of sequences (Felsenstein 1981). The main difficulty lies in finding the optimal tree topology. The tree space increases exponentially with the number of sequences, and for a small set of 10 sequences there are over 2 million unrooted tree topologies. For 20 se- quences there are more than 2∙1020 topologies. There are multiple strategies to traverse the tree space and find the optimal topology (Felsenstein 2004), but in general the inference starts with a tree proposal, followed by refinement and checking for stop conditions. To avoid local optima, one also re-samples from

20 the tree space and performs refinement again. Due to the vastness of tree space and greedy nature of tree search, there’s no guarantee of finding the optimal topology. Given that the resulting maximum likelihood tree is a point estimate ob- tained from usually relatively short genetic sequences, the probability of ob- taining the true tree is relatively low due to the variance of the estimator. One commonly uses a non-parametric bootstrap technique to obtain confidence scores of branches in the topology. The idea is to re-sample alignment col- umns multiple times and thus obtain a set of tree estimates. Support values are then placed on branches of the original trees, based on frequencies observed in simulated data. Compared to distance-based methods, statistical methods make efficient use of data in the alignment matrix and are commonly in use by tree inference software (Price et al. 2009; Stamatakis 2014). As they only use pairs of se- quences, distance-based methods have much less information at their disposal than their full alignment-based statistical counterparts. The full alignment col- umns provide information regarding topology and contribute to accurate branch length estimation, but this comes at a cost: they are slower than dis- tance methods. However, the variance of the estimates will be higher in case of the distance methods. Furthermore, the pairwise nature of comparisons po- tentially restricts the use of more sophisticated statistical models due to pa- rameter identifiability. A notable example is the use of rate heterogeneity (Wu and Susko 2010).

Bayesian tree inference Bayesian statistical methods in phylogenetics attempt to produce a posterior distribution of trees and model parameters, referred to as θ conditional on the data X. The idea stems from Bayes’ theorem and facilitates calculating pos- terior probabilities by taking the prior probabilities of the tree and the model P(θ) and the likelihood P(X|θ). Under this approach, the posterior probability of the tree conditional on the observed data (alignment) provides the proba- bility of that tree being correct. Monte Carlo sampling approaches forming a Markov chain (MCMC) are normally used to approximate this posterior dis- tribution. The process relies on repetitive sampling of trees and model param- eters (thousands to millions of samples depending on the complexity of the problems) according to a proposal scheme to create the posterior distribution. The process needs to be carefully monitored to make sure that the samples converge to the expected distribution. Compared to ML tree estimation, Bayesian methods require more computational resources due to the repetitive sampling involved. This inference process also requires supervision and knowledge of the method. On the other hand, the notion of the distribution allows to assign probability-based confidence measures to tree topologies and other model parameters without the need for bootstrapping solutions. Further- more, the Bayesian approach can be used to analyze more difficult problems,

21 where ML-based methods might have a problem with getting stuck in local optima.

Maximum parsimony Another way of reconstructing trees is through maximum parsimony. The pro- cess of inferring trees is similar to Maximum Likelihood inference, with the difference that the optimality criterion is not based on maximizing the likeli- hood, but minimizing the total number of character-state changes throughout the tree (Fitch 1971). Parsimony reconstructs the history of changes on a tree, including assigning characters to the internal nodes of the tree. This method is very fast as it counts the number of changes between character states to calcu- late the tree score. The tree with the smallest score is the maximum parsimony tree, the estimate of the true tree. Under certain conditions, parsimony can be statistically inconsistent meaning that with increase of data the method will not converge to the true tree (Felsenstein 1978). A notable example is the Long Branch attraction where, due to similarity in the high number of substitutions, long branches can be grouped together, whereas they are really more distantly related.

Comparisons between trees. Sometimes it is of interest to measure how different two trees are, for example by comparing the results from different inference methods. These compari- sons are also useful when benchmarking trees against the truth. One is usually interested in comparing topologies and a popular measure is the partition dis- tance (Robinson and Foulds 1981), also known as the Robinson Foulds (RF) distance. Partitions or splits in trees are created by removing non-terminal branches from trees. Such a branch removal divides (splits) the tree into two trees (bi-partitions). The partition distance is defined as the total number of bipartitions that exist in one tree but not in the other. An unrooted binary tree of n sequences can be split in n-3 ways (internal branches). This means that the RF distance can range from zero to 2(n-3) for completely different trees. The main limitation of this distance is that it does not take into account the branch lengths. Other less popular topological distances include the Nearest Neighbor Interchange (NNI) distance, Subtree-Prune-and-Regraft (SPR) dis- tance and the Tree Bisection and Reconnection (TBR) distance (Yang 2006). A metric that accounts for both tree topologies and branch lengths is the geodesic distance. This distance is based on the graph theory. It uses the notion of continuous tree space for trees with n leaves. The space is formed from Euclidean regions, one for each topologically different tree. Two regions are connected if their corresponding trees are considered neighbors. The shortest unique path between the pair of compared trees is the geodesic distance.

22 Modeling natural selection Natural selection is one of the driving forces of the evolutionary change and leads to adaptations. Selection acts on phenotypic traits, which are expressed by proteins. Comparing protein sequences allows us to study how selective forces shape the resulting phenotypes. Expressed protein coding sequences produce proteins and it is common to observe the patterns of substitutions between them in order to study selective pressures. Since these sequences are built of codons – triplets of nucleotides, which further code for amino acids – the patterns of substitutions affect the . The describes 64 combinations of nucleotides in codons, however only 20 amino acids build the proteins. Not counting stop codons, 61 triplets code for 20 amino acids, meaning that some codons code for the same amino acid. When a substitution in a codon arises, it is possible that the codon will still code for the same amino acid: a synonymous substi- tution has occurred. Otherwise, a non- will change the encoded amino acid. The ratio of these non-synonymous and synonymous substitutions serves as a proxy for detecting selection (Kimura 1983). Under neutrality, one expects roughly similar rates of dN to dS. Under purifying se- lection, the constraint of non-synonymous changes reduces the rate to below one and positive selection means dN/dS > 1. This selective constraint acting on substitutions has become so popular throughout the decades of molecular studies that it became synonymous with studying natural selection. Substitution processes responsible for selection are commonly modeled through statistical codon models applied to protein coding sequences. The commonly-used model usually describes four kinds of substitutions: synony- mous transitions and , and non-synonymous transitions and transversions (Goldman and Yang 1994). The dN/dS is described by the ω parameter defining the ratio of non-synonymous and synonymous substitution rates. This model formulation allows for various assumptions around how the selection process is acting on the tree. One can assume a variation of ω across the sites (site models), across branches (branch models) or the combination of the two (Yang and Bielawski 2000).

Limitations of alignment-based evolutionary inference An ever-increasing number of evolutionary studies depend on the assembly of accurate multiple sequence alignments. Most alignment-based methods of evolutionary inference make an implicit assumption that the data represents the real evolutionary relationships between characters i.e. that the alignment is correct. This is often not the case, and MSAMs can produce alignments with whole regions of wrongly aligned residues. This uncertainty with respect to the alignment can provide contradicting information around the properties of

23 the evolutionary process being inferred and ultimately bias the results of the analysis.

Sequence alignment accuracy and evolutionary analyses The relationship between sequence alignment and tree accuracy was estab- lished early (Lake 1991). This study showed that the order of taxa (guide tree) can affect the tree topology. This guide tree problem represents a cyclic de- pendency between the progressive process which needs a tree and tree infer- ence that needs an alignment (which structure is based on the tree), and has been studied extensively (Nelesen et al. 2008; Dessimoz and Gil 2010). Other early studies discovered that the choice of aligner impacts the order of taxa on the topology (Morrison and Ellis 1997). This limitation of the two-step pro- cess – being dependent on a single uncertain estimate of MSA – has been shown in further studies of the impact of MSA accuracy onto phylogenies and distance estimates (Rosenberg 2005; Ogden et al. 2006; Wong et al. 2008). The choice of the aligner also impacts the detection of molecular adaptation (Fletcher and Yang 2009; Markova-Raina and Petrov 2011), showing that es- timates of the selection are especially sensitive to the alignment error because misalignment of two codons will more likely result in overestimation of dN. MSA errors are often inevitable due to the NP-hard complexity of the align- ment problem. Alignment algorithms have been actively researched since the 1980s with the aim to maximize speed, accuracy, and handle increasingly dif- ficult problems. This search for accuracy and scale explains why currently there are dozens of various MSAMs to choose from (Edgar and Batzoglou 2006; Kemena and Notredame 2009; Thompson et al. 2011; Chatzou et al. 2016). These methods feature all kinds of algorithms designed to facilitate the alignment process such as simulated annealing, genetic algorithms, bound heuristics, Hidden Markov Models and a number of stepwise, iterative and clustering techniques (Chatzou et al. 2016). These computations and comparisons usually have the clear objective func- tion of optimizing a similarity-based score to produce an alignment. Even if the score has its roots in model-based substitution probabilities and complex processes for modeling gaps, these probabilities depend on the evolutionary distance and it is impossible to tailor a method to handle the full spectrum of sequence divergences. Due to the lack of explicit evolutionary modeling, this host of heuristics and greedy algorithms do not guarantee an evolutionary cor- rect alignment. Furthermore, MSAMs often try to optimize three problems at a time in the attempt to recover similarities in structure and function of the proteins and infer ancestral relationships between the characters in the sequences. There is no well-defined reason why structural and evolutionary alignment should be identical. Also, a functionally correct alignment might group together regions, which similarity results from convergent evolution.

24 This universal approach to sequence alignment often maximizes identity between sequences in the hope that this simplistic criterion is enough to pro- duce alignments useful in many types of biological inference. This might work well when inferring homologies between closely related species, but will likely be inaccurate for diverged sequences with multiple substitutions at a site. The multitude of sequence alignment programs available for use means that researchers are faced with many choices of method. It is possible that the re- sults from different MSAMs will be identical or very similar – making the choice easy – but given sufficient sequence divergence, different sequence alignment programs will return visually different alignments (see Figure 1).

Figure 1. An example of various alignment outputs returned by different methods.

Since all of these MSAs cannot be fully correct at the same time, there are invariably incorrectly aligned regions between various alignments of the same set of sequences. This is hardly surprising, given the complexity of sequence alignment process and the fact that MSAMs optimize different scoring func- tions. Given a choice of several different sequence alignment variants for analy- sis, it is usually impossible to tell which one most resembles the true evolu- tionary history and what the impact of a particular alignment on downstream analyses will be. Nevertheless, researchers are forced to choose a single MSA, and the choice is usually based on a reputation that one method is better than the others.

25 Benchmarking multiple sequence alignments MSAMs can be benchmarked to identify the conditions under which different alignment algorithms perform best. There are a number of benchmarking da- tabases which provide structural reference alignments (Edgar 2004; Thompson et al. 2005; Van Walle et al. 2005) and developers tend to concen- trate on improving the benchmark performance of their methods. For example, the popular BAliBASE benchmark contains reference sets based on various properties. Reference set 1 describes cases containing small numbers of equi- distant sequences, and is subdivided by percentage of sequence identity (BB11 and BB12). Reference set 2 (BB20) alignments contain “orphan” (unrelated) sequences. Reference set 3 (BB30) deals with a pair of divergent subfamilies, with less than 25% identity between the two groups. Reference set 4 (BB40) contains sequences with long terminal extensions, and set 5 (BB50) cases con- tain large insertions and deletions. High performance in these structural or functional benchmarks, however, may not reflect the accuracy in positional homology prediction. Thus, one al- ternative is to use a simulation approach to obtain the reference (Cartwright 2005; Fletcher and Yang 2009). These reference alignments – whether coming from simulations or struc- tural MSAs –provide information used in benchmarking. This benchmarking is rooted in binary classification where it is common to divide the outcomes of an experiment to assess the performance of a method. These experimental results can be grouped based on the predicted and actual outcomes. True pos- itive (TP) outcomes occur where the predicted positive condition matches the actual true condition (true homology recovered). False positive (FP) outcomes occur when the experiment predicts a true outcome not matching the actual condition (spurious homology). Failing to recover the actual condition is called a false negative (FN) and happens when a true alignment contains a column or a residue pair absent in the inferred MSA. A commonly used score in MSA benchmarking is the column score (CS), which describes the propor- tion of correctly inferred (TP) columns to the total number of columns in the reference (TP + FN). The less restrictive sum-of-pairs (SP) score deals with counting pairs of pairwise homologies and describes the proportion of cor- rectly recovered pairs of residues to the total number of residue pairs in the reference.

26 Methods

The state-of-the art substitution models used to estimate evolutionary trees, distances, selection and other aspects of evolution are underpinned by a single, potentially uncertain point estimate of multiple sequence alignment. Since the tree and alignment are both evolutionary hypotheses regarding relationships between and within the molecular sequences, they should ideally both be treated as the same problem. The realization that MSA and evolutionary in- ference are integral to one another has become clear early in the development of evolutionary analyses (Kruskal and Sankoff 1983). This joint estimation, however, requires complex modeling and computational convenience has led to the development of the two-step process and multiple sequence alignment methods. Since then, the development of sequence alignment methods and evolution- ary inference methodology have gone their separate ways. The realization that MSA contains errors that affect downstream phylogenetic analyses and the desire to model not only substitutions, but other aspects of molecular evolu- tion has led to development of two broad categories of methods that deal with the alignment error: statistical alignment and model-based treatment of the alignment error. The work presented in this thesis builds upon extant methods and approaches within these two categories.

Statistical alignment approach The first class of methodological approaches seeks to use explicit evolutionary models to describe the evolution that lead to observable sequences. This framework – called statistical alignment – jointly estimates the alignment and the phylogenetic problem. With statistical alignment there is no prerequisite of sequence alignment and all of the model parameters are estimated from unaligned (raw sequences).

Modeling indels One of the main problems faced by the statistical alignment is in modeling insertions and deletions, jointly called as indels. These indels are responsible for different lengths of extant sequences and can be seen as gaps in the align- ment. Indeed, one of the main reasons why MSA-based inference is popular

27 is due to computational convenience of modeling only substitution events in the second step. Multiple sequence alignment methods usually deal with gaps in a simplistic manner: by using simple scoring schemes to penalize the existence of the gaps. This simple approach, used by the early aligners, uses so-called linear gap penalty where longer gaps are penalized in proportion to their lengths (Needleman and Wunsch 1970). The common approach is to use affine gap penalties, separately penalizing the gap opening and gap extension (Gotoh 1982). Other scoring schemes also exist, for instance such as power law length distribution (Gu and Li 1995). Unlike the substitutions, indels can have various lengths and can potentially overlap in the course of evolution. For example, deletions can remove parts of earlier insertions, which complicates the modelling process. This necessitates a need for models that account for indel length distribution and parameterize the rates of indel occurrence.

Early statistical alignment Indel modeling complexity was one of the reasons why progress in statistical alignment was initially slow. The first approach to statistical alignment used approximate likelihood calculations for a pair of sequences, and was intro- duced by Bishop and Thompson (1986). It was followed by the TKF91 model (Thorne et al. 1991) who introduced an exact method of calculating the like- lihood of two sequences. An important feature of this model is its ability to calculate the joint probability of sequences given the model parameters. This has a very important implication: sequence alignment is no longer needed as a necessary condition for analysis and one can calculate pairwise evolutionary distances and other model parameters such as substitution and indel rates di- rectly from unaligned sequences. Alternatively, if sequence alignment is needed for other purposes, one can obtain the most probable alignment di- rectly from the model. Under this reversible model, insertions and deletions are described as a birth-death process with separate rates describing insertions and deletions. These insertions and deletions are of unit length meaning single base or amino acid indels. Longer indels need to be described as a series of independent events. This unrealistic assumption was addressed in the TKF92 model (Thorne et al. 1992), which allows for longer insertion and deletion events. The lengths of these fragments are geometrically distributed and, for computational reasons, the model makes the assumption that the fragments are indivisible. This potentially rules out some positional homologies, how- ever the model parameter estimation is robust towards the violation of this assumption (Metzler 2003). These two seminal papers by Thorne and colleagues provide the basis for performing evolutionary analyses within a statistical framework by jointly treating the alignment and inference problems. Soon, the TKF91 model for

28 two sequences became generalized to the star-shaped tree topology (Steel and Hein 2001) and binary trees (Hein 2000). Hein also showed that the TKF91 model can be formalized as a pair hidden Markov model – a class of models with well-established algorithms for efficient likelihood computations (Durbin et al. 1998).

Pair hidden Markov models Hidden Markov models (HMMs) were introduced to the field of evolutionary biology in 1994 as an approach to preform sequence alignment (Krogh et al. 1994). HMMs can vary in their structures and applications. One of the first HMM applications was in speech recognition (Jelinek et al. 1975). Since then, HMMs have gained popularity in diverse domains, including finance, crypta- nalysis, handwriting recognition, machine translation and analysis. HMMs are also commonly used in molecular and evolutionary biology in studies of protein folding, metamorphic virus detection, chromatin state dis- covery, DNA motif discovery, and various sequence analysis problems (Yoon 2009).

A brief introduction to HMMs Simply put, a hidden Markov model outputs (emits) a sequence of observa- tions through transitions between its states where the emissions take place. The exact sequence (order) of these transitions is usually unknown (hidden) during inference and what is observable are only the emissions from the states. This means that we only see the end result, not how this result was produced. A simple example of an HMM is a dishonest casino, where there are two dice – a loaded one outputting number 6 with probability 0.5 and other num- bers with the probability of 0.1, and a fair one with equal probabilities of 1/6. The dishonest casino switches between the states of the model (dice) with probabilities specified by the arrows in the model in Figure 2

Figure 2. An occasionally dishonest casino HMM

29 The switch between the fair and loaded dice is a Markov process, where the current state only depends on the previous one. With every visit to the state an emission related to a single dice cast occurs. After every emission a decision whether or not to stay in the state occurs based on specified transition proba- bilities. The casino uses the fair dice most of the time, but occasionally switches between the dice with probabilities 0.05 and 0.1 respectively. A player only sees the emissions – a series of dice casts such as 1,3,5,2,6,3,6,6 etc, – but whether a result originated from the fair or loaded dice – the state sequence - is unknown (hidden). Under this formulation, given the model parameters (transition and emis- sion probabilities) it is possible to calculate the probability of the data x (se- quences of dice casts) originating from this model. A naïve approach to this would be to consider every possible path (sequence of states) π through the model, calculate the probability of this path giving raise to the sequence x through the series of transitions and emissions and finally sum these probabil- ities. Fortunately, the framework of hidden Markov models offers a set of standard robust recursive algorithms to perform likelihood calculations. The forward algorithm calculates the probability of data ∑ , by recursively considering every path through the model. The Viterbi algorithm recursively finds the maximum likelihood (most probable) path. Another ap- proach commonly used within the framework of HMMs is the possibility of calculating probabilities of a particular observation xi originating from state k given the observed sequence P(πi = k | x). This is the posterior probability of state k at time i (i-th symbol in the sequence) and can be conveniently calcu- lated by running two recursive algorithms – forward and backward using two different starting points (beginning and end of the sequence) up until the i-th symbol (see chapter 3 of Durbin et al. 1998 for a detailed description).

Modeling evolution of two sequences through an HMM Pair hidden Markov HMMs are commonly used to describe the statistical alignment problem of two sequences evolving under substitutions, insertions, and deletions (Pachter et al. 2002). Pair HMMs describe the evolution of two sequences through a series of transitions between the states of the model. From the generative perspective, a pair-HMM outputs two sequences through a series of emissions within states and transitions between its states. The standard formulation of the model has been described by (Durbin et al. 1998) and contains three states. In the Match state the model outputs two char- acters according to a substitution model – one in the first sequence and the other in the second sequence. In the Insert state a character is usually drawn from the equilibrium distribution and output only in the first sequence. The Delete state is analogous and the output (emission) happens only in the second sequence. Figure 3 illustrates a simplified pair-HMM with three states.

30

Figure 3. A simple example of a pair HMM with 3 states. The ++ symbol denotes an emission in both sequences, + – and – + an emission in only one of the sequences.

The model emits two unaligned sequences and the hidden path through the states defines the alignment π. Transitions through the states of the pair HMM describe the indel process, whereas the substitution process is described by the match state emissions. The nature of the transitions necessitates geometrically distributed gap lengths. The same recursive algorithms described in the previous paragraph can be used to perform probabilistic calculations. The forward algorithm, which uses a dynamic programming principle similar to the Gotoh algorithm with affine gap penalties (Gotoh 1982), is capable of integrating over every possible pair- wise alignment between two sequences x and y to find the probability of these two sequences originating from the model θ - , | . This model de- scribes transitions and emissions and , | ∑ , , | with the computational complexity proportional to the product of lengths of x and y. Similarly, if required, the highest probability alignment can be recovered us- ing the Viterbi algorithm, but this approach does not necessarily maximize the accuracy of the alignment. The parameters θ can be directly estimated using this likelihood function in the maximum likelihood framework using numerical optimization (Cappé et al. 2009), however a common approach is to derive transition and emission

31 probabilities using Baum-Welch training based on a larger data set (Baum et al. 1970). These parameters are then fixed in the model and used in subsequent analyses. The HMM formalism allows us to obtain reliability measures for pairs of residues being aligned in the form of posterior probabilities. These measures can be easily calculated using the forward-backward approach and can be fur- ther used to build an alignment that maximizes the scores in the form of mar- ginalized posterior decoding. One can easily obtain these posterior probabili- ties for all combinations of pairwise arrangements. Pair HMMs can be parameterized in various ways, including blank states with no emissions and additional states for long and short indels (Miklos et al. 2004). Being generative models, the full HMM parameterization requires the addition of two states – start and stop – along with specification of transition probabilities, which determine the sequence length distribution. This flexibil- ity in model formulation and computational convenience has made pair- HMMs a commonly used tool to model statistical alignment (Knudsen and Miyamoto 2003; Miklos et al. 2004; Redelings and Suchard 2005; Redelings and Suchard 2007) and in progressive multiple sequence alignment (Löytynoja and Milinkovitch 2003; Do et al. 2005; Wang et al. 2006).

Full statistical alignment and tree approaches The extension of the TKF models of pairwise statistical alignment and pair- HMMs to multiple sequences related by a tree occurred in the early 2000’s. The early approaches concentrated either on Bayesian sampling of the align- ments under a fixed tree, or providing algorithms of likelihood computations on an arbitrary tree (Holmes and Bruno 2001; Lunter et al. 2003). The goal of the full joint estimation of alignments and phylogenies as pos- tulated by Kruskal and Sankoff (1983) was reached shortly after. The first approach to jointly treat the alignment and phylogenetic tree reconstruction implemented in BAli-Phy software uses joint sampling from the posterior dis- tribution of alignments, trees and model parameters (Redelings and Suchard 2005). The improved version of this joint approach featured a more realistic indel model with indel rates related to the branch lengths (Redelings and Suchard 2007). BAli-Phy and another joint estimation method called StatAlign (Novák et al. 2008) are currently considered “state of the art” in phylogenetic inference. They allow for the co-estimation of two problems at the same time, avoiding the biases associated with conditioning on a single alignment estimate as with the common two-step approach. The joint estimation approach also increases the power of the data by making use of information in insertions and deletions by using an explicit indel model. Unfortunately, these joint methods are seldom used, mostly due to the high computational demand: even relatively simple analyses require days to run.

32 Due to their Bayesian sampling nature these methods can be also hard to use and understand, as the inference process needs to be monitored to meet the convergence criteria. Also, more complex problems require more samples to cover the tree and alignment space further limiting the applications of these approaches.

The group of approaches based on the principle of statistical alignment pro- vides accurate methods of evolutionary inference by explicitly modelling evo- lutionary processes. The lack of dependency on the single alignment estimate, which normally makes using a multiple sequence alignment method neces- sary, makes it possible to avoid the biases associated with alignment error. The use of explicit indel modelling allows one to make use of the additional information in gaps, which can carry substantial phylogenetic signal (Dessimoz and Gil 2010). All these benefits, however, come at a cost: in- creased computational demands make these methods less popular than the well-established two-step process.

MSA treatment Another group of approaches to dealing with alignment error stem from the belief that problematic regions in the alignment can be removed, thus making the alignment suitable for downstream analysis. These approaches acknowledge that MSAs contain errors and attempt to identify the regions containing errors and remove them. This approach to removal, called filtering, requires accurate measures, which allow to distinguish between the true ho- mologies recovered by a method and the false (spurious) ones that introduce errors.

Column filtering Originally, the idea to trim (filter) alignments stemmed from the desire to re- move the more diverged parts of in order to facilitate phylogenetic anal- ysis. The goal was to pick “good” bits of the alignment and discard regions potentially saturated with multiple substitutions making them difficult to align (Swofford et al. 1996). Early approaches to filtering were thus based on sub- jective manual removal of alignment columns. The realization that MSA con- tain the errors described in this introduction fueled the development of auto- mated methods, mostly relying on gaps and their distributions (Gatesy et al. 1993; Rodrigo et al. 1994). Aligning sequences and then reversing the residue order to align again is one of the simplest approaches to detecting alignment uncertainty. This heads- or-tails (HoT) approach assesses the stability of the alignment algorithm (Landan and Graur 2007). This is because MSAMs are expected to produce

33 repetitive results and MSA algorithms need to make deterministic choices. For example, identity-based aligners need to decide on the course of action if two or more events have the same score. The decision might be between opening a gap and matching two residues. This choice is usually hard coded in the algorithm and reversing the order of residues might not produce a mirrored alignment as one would expect. Thanks to this simple assumption it is possible to speculate that the regions in the alignment showing differences might be uncertain. Currently available methods of automated MSA filtering use various crite- ria to identify alignment uncertainty. The approaches include ad hoc score- based algorithms relying on gap patterns used in Gblocks and TrimAl (Castresana 2000; Talavera and Castresana 2007; Capella-Gutierrez et al. 2009) , assessing the randomness of an MSA implemented in Aliscore (Kück et al. 2010), comparing the degree of homoplastic sites to random columns in Noisy (Dress et al. 2008), high entropy sites in BMGE (Criscuolo and Gribaldo 2010) and sites sensitivity to the guide tree in GUIDANCE (Penn et al. 2010; Sela et al. 2015). These automated methods are capable of producing repetitive results, but are not based on any explicit models of evolution and thus potentially produce unreliable results. This lack of model-based assess- ment of alignment uncertainty usually means arbitrary results and potential removal of phylogenetically informative data. One notable exception is Zorro trimmer (Wu et al. 2012) , which uses a pair hidden Markov model to base its filtering approach on posterior probabilities of residue pairs from a pair HMM.

Filtering accuracy Filtering removes alignment columns believed to be containing regions of un- certainty. For example, Gblocks might produce a filtered alignment similar to the one shown in Figure 4. The criteria used to describe uncertainty, in the alignment are very im- portant from the point of accuracy. Another difficulty with filtering is to detect problematic columns without removing the correctly aligned ones containing informative sites. Even if one uses the most sophisticated evolutionary models to assign probabilistic scores to the columns of the alignment, the decision whether to classify the column as a “good one” – true – or a “bad one” – false – needs to be made. Furthermore it is possible that the same score value can describe a true column and another false column, meaning that the distribu- tions of true and false column scores can overlap.

34

Figure 4. An example of alignment filtering where the original untreated MSA was subject to automated alignment trimming to remove columns believed to be uncer- tain

This makes the discrimination difficult to make without knowing the true alignment. For example, having the columns scored between 0 and 1 it is com- mon to assume an arbitrary threshold (for example 0.9) below which one clas- sifies columns as false and discards them. This does not guarantee that we’ll get rid of bad columns and keep all the good ones as only the availability of the reference (true alignment) can answer this question. Having a set of reference alignments at one’s disposal makes it possible to assess the performance of a filtering method. By varying the discrimination threshold it is possible to observe the relative proportions of true positive and false positive results. The relationship between these two proportions – the true positive rate (TPR) and false positive rate (FPR) – forms a receiver oper- ating characteristic (ROC) curve, which describes the tradeoff between re- moving false positives and retaining true positives (Figure 5). Ideally, one wants the area under the curve (AUC) to be as close to 1 as possible meaning perfect distinction between TP and FP results, but usually trying to discard all the spurious columns will also remove a portion of the true ones.

35

Figure 5. An example of two ROC curves. The dashed curve shows better perfor- mance when discriminating between true positives and false positives. AUC = 0.95 for the dashed curve and AUC = 0.88 for the solid curve. For example in order to re- tain 75% of true columns the worse (solid line) method has to accept 12.5% of false columns.

Impact of filtering on downstream analyses The developers of filtering methods claim the improvement of alignment treat- ment on evolutionary inference (Talavera and Castresana 2007; Wu et al. 2012). Most of the filtering studies, however, are performed on a limited set of cases, potentially highlighting situations where the alignment trimming works well. Recent studies show, however, that automated methods do not fix the bias associated with inferring adaptive evolution through estimates of dN/dS (Jordan and Goldman 2012) and comprehensive study of multiple fil- tering methodologies by (Tan et al. 2015) shows that filtering does not help and, in many instances, worsens the tree estimates under both simulated and empirical studies. One of the reasons is the aggressive column removal to limit the number of false positive homologies. In extreme cases this removal can affect the majority of columns. This severely limits the number of informative sites leading to higher variance in the estimates.

Even though Zorro’s column filtering scheme does not guarantee to improve phylogenetic accuracy, the statistical approach to evaluate pairs of sequences under an explicit model of evolution has the potential to generate insight into the structure of multiple sequence alignments by assigning probabilities to pairwise homologies. I use this approach throughout the second part of this thesis.

36 Research Aims

Alignment uncertainty is a source of errors in evolutionary analyses. The main objective of this thesis was therefore to develop new, robust, and accessible methods for various kinds of evolutionary inference in order to mitigate or detect the uncertainty introduced by the sequence alignment process. In gen- eral, the aims were to develop a principled approach to MSA uncertainty by not using the point estimates of sequence alignments at all (aims I and II) and to treat multiple sequence alignments using a probabilistic approach of as- sessing pairwise homologies (III, IV). Specifically, this thesis aims to:

I Develop an accurate and statistically robust pair hidden Markov model distance-based method of phylogenetic inference without the need to use multiple sequence alignment.

II Develop a method to quantify how natural selection acts and in- teracts between substitutions and insertions/deletions without the biases introduced by sequence alignment by using a pair hidden Markov model approach.

III Develop a method of removing pairs of misaligned residues from multiple sequence alignment without filtering out the whole col- umns, thus retaining informative data.

IV Quantify alignment quality and similarity by studying patterns of posterior probabilities of residue pairs in the alignment.

37 Summary of the papers

Paper I: Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking This work addresses the principled approach to alignment uncertainty by avoiding dependence of tree construction on point estimates of MSA. The mo- tivation is based on multiple studies showing that sequence alignment quality is known to impact downstream phylogenetic analyses (Wong et al. 2008). This paper presents a fast and robust statistical method of tree inference that does not require conditioning on an estimate of multiple sequence align- ment. The approach models the evolution of pairs of sequences using a pair hidden Markov model to describe the substitution process as well as mecha- nisms governing insertions and deletions. PaHMM-Tree, the novel neighbor joining-based tree reconstruction program developed in this study, estimates pairwise distances without assuming a single alignment and can integrate over alignment uncertainty. It then combines these distances into trees by using the BioNJ neighbor joining algorithm (Gascuel 1997). We further use simulations to benchmark PaHMM-Tree’s performance against a wide-range of other phylogenetic tree inference methods, including the first comparison of alignment-free distance-based methods against more conventional statistical tree estimation methods. We investigate both amino acid and nucleotide data to assess the performance using Robinson-Foulds topological distances in order to measure the quality of resulting tree topolo- gies (Robinson and Foulds 1981). Our new method for calculating pairwise distances based on statistical alignment provides distance estimates that are as accurate as those obtained using standard methods based on the true simulated alignment. For deep di- vergences, we find the alignment step becomes so prone to error that our dis- tance-based PaHMM-Tree outperforms all other methods of tree inference. We further find that the accuracy of alignment-free methods tends to be worse than the standard two-step methods in the presence of alignment uncertainty, with no conditions where alignment-free methods are equal to or more accu- rate than standard phylogenetic methods, even in the presence of substantial alignment error. Our benchmarking study further illustrates the problem of alignment uncertainty and its impact on downstream phylogenetic inference,

38 especially at high divergences. We show that for hard phylogenetic problems MSA-based methods produce inaccurate results due to alignment error. PaHMM-tree could thus provide an alternative to slow and difficult to use full statistical alignment under conditions where MSA-based inference would be too prone to alignment error. Full statistical alignment provides the most accurate results, but also suffers from a number of performance and usability issues, which PaHMM-tree does not. Instead, this pairwise statistical approach is scalable and could be used in bigger inference problems. However, it carries the standard limitations of distance-based methods.

Paper II: Selection acting on indels and substitutions in protein coding sequences In this work, we use the principled approach to alignment uncertainty in order to detect patterns of selection acting on protein coding sequences without the need to perform sequence alignment. Our pair HMM-based approach attempts to avoid the biases associated with alignment error on the estimates of selec- tion (Markova-Raina and Petrov 2011). Furthermore, our approach allows us to look at patterns of selection acting on insertions and deletions, which nor- mally tend to be heavily biased by MSA algorithms. These patterns of selection acting on proteins, which shape phenotypes, are of great interest to researchers studying adaptive evolution, but need to be ac- curately inferred. The dN/dS ratio, commonly used as a proxy for detecting selection, examines only amino acid substitutions and ignores other evolution- ary forces like small-scale insertions and deletions (indels) that may affect protein evolution. Probabilistic methods are considered to be the best ap- proach for estimating dN/dS from multiple sequence alignments of protein coding sequences, but few statistically robust methods capable of inferring patterns of insertions and deletions exist except for the full statistical align- ment (Redelings 2014). Most of the methods in use perform simple counts of gap characters produced by multiple sequence aligners. These patterns, how- ever, are highly dependent on a particular sequence alignment method and cannot be treated with certainty in the face of alignment error, to which protein coding sequences are especially susceptible (Jordan and Goldman 2012). Our approach allows us to integrate over alignment uncertainty by using a pair hidden Markov model. The approach is based on an evolutionary model from PaHMM-Tree and uses a probabilistic codon model to describe the pro- cess of substitutions in the match state and insertions/deletions by changing between the states of the model. We develop a software tool called PaHMM- to provide the estimates of evolutionary parameters governing natural selection for pairs of sequences. The tool is the first statistically robust pro- gram capable of processing thousands of gene pairs. We apply it to look at

39 selection acting on substitutions through the dN/dS ratio, and also explore how it affects small-scale insertions and deletions (indels). By explicitly modeling the evolution between pair of unaligned sequences we obtain high quality estimates of dN/dS and the indel rate from large ge- nomic datasets without biasing ourselves toward any particular alignment. Specifically, we look at thousands of orthologous gene pairs to explore human-mouse and human-chicken gene evolution. These sequences represent medium and high divergences and would normally be prone to a substantial alignment error. We explore the relationship between selective pressures act- ing on substitutions and indels, revealing that the indel rate and selection (dN/dS) tend to be related, demonstrating that purifying selection acting in proteins tends to affect non-synonymous and indels in a similar way. We further investigate how the selective forces acting on substitutions and indels vary along genes, and show that the protein N and the C termini tend to be under different selective pressures compared to the core region of the pro- tein. Finally we use a simulation approach to show that our model can provide robust estimates of selection acting on substitutions and indels, and can be used to detect patterns of selection acting along genes.

Paper III: A graph-based approach for improving the accuracy of multiple sequence alignments for evolutionary analysis In this work we describe a probabilistic approach to dealing with errors in existing MSAs. This approach stems from the popular belief among research- ers that filtering, i.e. removing uncertain portions of the alignment, can help to avoid downstream biases (Talavera and Castresana 2007). There are a num- ber of popular alignment filtering methods, also known as trimming tools, which attempt at removing uncertain columns from the alignment (Tan et al. 2015). Accurate removal of these columns can be problematic because it depends on the method of assessing alignment uncertainty and the ability to distinguish between correctly and incorrectly aligned columns. These trimming methods usually use aggressive column removal algorithms in order to minimize the fraction of misaligned columns. In extreme cases, there is no data left for re- liable analysis to be performed. In this work we propose an alternative approach, which follows the idea that one should only attempt to treat residues that are misaligned, rather than removing the whole columns. This should lend itself to much higher data re- tention. The use of an evolutionary model should provide an objective crite- rion for identifying uncertain residues.

40 Rather than looking at alignment columns, our approach is to assess the quality of pairwise homologies in an alignment. We use a probabilistic pair- HMM to obtain posterior probabilities of pairs of residues being aligned under a model of substitutions, insertions and deletions. We use Zorro software (Wu et al. 2012) to obtain these pairwise posteriors and to transform them into weights further used to cluster the pairs into reliable groups. Our approach can be used to either partially filter a column, by picking the largest cluster of reliable residues and removing everything else, or to “divvy-up” a column into multiple columns, each representing a cluster sharing strong evidence of ho- mology. We implement these approaches in a piece of free software called Divvier, and demonstrate that our new methods are extremely effective relative to ex- isting software at identifying true positive and false positive homology pairs. We use both BAliBASE (Thompson et al. 2005) and simulation studies to show this superior performance. We show that Divvier produces better ROC curves in BAliBASE and across a range of divergences in our simulated phy- logenetic benchmark. Furthermore we show that our method, in contrast to other filtering tools, can alleviate Long Branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty. Divvier serves as an alternative to existing trimming approaches. Unlike the most filtering methods it is based on an explicit evolutionary model to detect potential alignment errors. This makes it possible to remove spurious homologies with high accuracy and retain potentially informative data other- wise lost in column-based filtering approaches.

Paper IV: Examining sequence alignments using a model-based approach This paper describes the use of probabilistic evolutionary models to compare and evaluate the results produced by multiple sequence alignment methods, and to provide comparative tools for assessing MSAs. Due to the difficult NP-complete nature of the multiple sequence alignment procedure, MSAMs are expected to return uncertain results depending on the nature of the problem, and need to be benchmarked. This evaluation through benchmarking is usually performed on structural data sets and MSAMs com- pete to obtain the highest scores in the commonly used sum-of-pairs score (SP) and column scores (CS). These scores reflect the ability to return portions of true positive homolo- gous pairs or columns and summarize MSA accuracy with a single number. Furthermore, these scores do not take into account false positive (FP) results introduced by MSAs thus not providing the full picture as counts of FP can vary due to gap patterns.

41 This competition among MSAMs can potentially lead to MSAs overfitting the data in pursuit for the highest benchmark scores. This maximization of the score can introduce undesirable patterns of FP homologies which are unde- tected using the standard scores. Studying these patterns can helps us identify the differences between MSAMs and identify conditions when MSA provides uncertain results. In order to study the patterns of results produced by MSAMs, we look at patterns of residue pairs in alignments. Studying the patterns of these pairwise homologies requires an objective measure of describing the results. We use pair-hidden Markov models and the full statistical alignment approach to break down alignment columns into pairs and obtain distributions of pairwise posterior scores for these MSAMs. Basing our results on a structural benchmark and a simulation study, we find that MSAMs appear to return a sample from a confidence set defined by high posterior probabilities. Furthermore, we find that the reference align- ments contain low posterior portions of pairwise homologies which cannot be expected to be recovered by any MSAM showing that MSA performance can- not be improved. We also look at several possible alignment metrics, with and without the need for reference alignments, and ultimately suggest using positive predic- tive value (PPV) and mean posterior probability for MSA evaluation. We test the MSAMs based on the phylogenetic accuracy showing similar performance under low divergences. We also find that tree accuracy can be roughly approximated by the mean pairwise posterior probability and show that treating MSA to remove errors does not lead to improved phylogenetic performance. This treatment through removal of false positive pairs shows complex interaction between patterns of TPs and FPs.

42 Conclusions and future prospects

Multiple sequence alignment is a prerequisite in most evolutionary analyses. Researchers, however often do not realize that an MSA can be unreliable and that it is only a point estimate of the homology between the characters in the sequence. Those aware of this uncertainty often refer to ad hoc alignment trimming methods, which do not address the problem and remove informative sites increasing sampling error and potentially reducing accuracy of the down- stream results. In the commonly used two-step approach, one often uses advanced Bayes- ian or Maximum Likelihood statistical models of sequence evolution, but base the analysis on a single uncertain estimate of the sequence alignment. This alignment is a result of algorithms based on similarity measures. This makes the two-step process seem inconsistent. We use relatively crude methods to recover positional homologies and then we switch to the state-of-the art sta- tistical inference. The underlying assumption that MSA is the true representa- tion of positional homologies is therefore often not the case. MSA might work in simpler cases, but for highly diverged sequences it is simply impossible to find a single reliable alignment using similarity-based criteria. One needs to be aware of the condition when MSA can be used and when not. An MSA can be considered a hypothesis with regard to the shared ancestry of the residues in a column, and ideally it should be treated in a statistically rigorous fashion. Rather than using point estimates of MSAs, one could treat the alignment and evolutionary problems as one, intrinsically using a myriad of alignments in a probabilistic manner. This approach of full statistical align- ment was developed over a decade ago, but is unfortunately very complex and difficult to use for a person without a computational or statistical background. It is also time consuming and requires a deep knowledge of the process to make sure that the results are reliable. As a result the state of the art methods are rarely used. In order to enable accurate analyses one needs accessible and reliable evo- lutionary inference methods to exist. In this thesis I attempted to address this problem by using pair hidden Markov models in the Maximum Likelihood framework to create reliable and fast tools for evolutionary inference and to highlight the problems related to sequence alignment and its inherent uncer- tainty. My approach to inferring phylogenies used here uses the concept to statis- tical alignment but, unlike the full statistical alignment joint approaches to

43 inferring trees and MSAs, it does not require the Markov chain Monte Carlo Bayesian approach. The nature of the pair HMM models used in this work makes it possible to infer evolutionary parameters between pairs of sequences in a fraction of the time needed for the Bayesian methods to complete. The execution times rival the ones in the two-step approach making the tools via- ble for use on desktop PCs. The downside of pairwise analyses is that one cannot use the full column information as in the traditional and full statistical approaches. For researchers requiring the highest accuracy, full statistical alignment remains the best choice, but it is not always possible. In our pairwise approaches to detecting selection full statistical alignment can also be applied, but the Bayesian Markov chain Monte Carlo limitations are the same as with tree inference, perhaps even more severe as codon models require more computations to calculate substitution probabilities. Our ap- proach in paHMM-Gene makes it possible to process thousands of pairs of genes in a manner of hours avoiding common biases associated with sequence alignment. The method provides the first insights into selection acting on in- dels and in future, although it suffers from the standard limitations of pairwise methods. The methods developed during the course of my studies were designed with extendibility in mind. The software tools are simple and with little effort one can add graphical user interfaces (GUIs) to make them more approachable for users unwilling or unable to use command line interfaces. The PaHMM-Tree, paHMM-Gene and Divvier tools can be extended to use multi-threading to compute pairwise distances simultaneously. This will improve the speed of analyses by a factor 2-8 on modern desktop machines. The progress of the analyses could be visualized in a GUI. One can also take advantage of the massive parallelism offered by graphical processing units (GPUs) to further speed-up the calculations provided that there is enough memory available and that the sequences are not too long. More advanced heuristics for our pair- HMM tools could also be implemented to limit the number of calculations in dynamic programming matrices used by the software by using more restrictive banding approaches. An alternative approach to the full statistical alignment and pair HMMs could be the use of multiple HMMs such as triple HMMs, which could model evolution of triplets of sequences. These models, in principle, could be used in the Maximum Likelihood framework providing more power than their pair- wise counterparts. For example, estimating discrete gamma substitution rate heterogeneity would more accurate. The complexity of modelling three se- quences at a time increases from quadratic to cubic relative to sequence length – O(n3) vs. O(n2) compared to pair HMMs. The same can be expected for memory requirements. Given relatively short sequences one should, however, be able to perform analyses in a reasonable time with the use of a host in- formed heuristics.

44 When single estimates of MSA need to be used, we show that model-based alignment treatment should be used instead of various ad hoc methods. This model-based approach also allows to compare and benchmark the aligners and can help to identify conditions when MSA-based analyses should not be used. In conclusion, the field of evolutionary biology is crying out for easy-to- use or highly automated reliable statistical methods of evolutionary inference. The current alignment-based pipelines of analyses, although historically mo- tivated due to their computational convenience, don’t always lend themselves to accurate results, especially for divergent sequences. Some of the data are simply not suitable for analyses and such cases should be accounted for in a pipeline. This work attempts to address these issues in a limited scope by of- fering approachable and robust tools and by explicitly describing and quanti- fying the problems with the two-step approach.

45 Svensk Sammanfattning

Multipel linjering eller inpassning (eng. Multiple sequence alignment, MSA) är en vanlig uppgift inom evolutionsbiologi. MSA behandlar molekylära se- kvenser, ofta av olika längder, och grupperar sekvenstecken som delar släkt- skap i kolumner. Denna gruppering gör det möjligt att identifiera evolutionära händelser som substitutioner och insertioner / deletioner, så kallade indels. Den relaterade (homologa) tecken är svåra att identifiera på grund av ett enormt antal kombinationer i vilka elementen kan inpassas. Linjeringsdator- program använder mycket approximativa algoritmiska och heuristiska meto- der för att utföra inpassningen inom rimlig tid. Det finns ett antal olika se- kvenslinjeringsmetoder som ofta ger motsägande resultat avseende homologi- erna, speciellt för avlägset besläktade sekvenser. Detta innebär att de flesta av de samtida evolutionära analyserna i den mest använda tvåstegsmetoden (sekvenslinjering följt av statistisk evolutionär in- ferens) potentiellt är baserade på felaktiga data. Den inverkan som inpass- ningsfelet har på efterföljande analys har studerats tidigare. Fel i inpassningen kan påverka fylogenetiska trädtopologier, grenlängder och uppskattningar av naturligt urval. Ett populärt sätt att försöka fixa inpassningen är att ta bort kolumner som man tror innehåller fel. Det görs vanligtvis genom ad hoc-metoder som inte bygger på några evolutionsmodeller och tar bort användbar information. Full- ständiga statistiska linjeringsmetoder kan undvika inpassningsfel genom att gemensamt härleda träd och linjering i ett Bayesianskt ramverk. Dessa tillvä- gagångssätt kräver omfattande datorresurser och avancerad kunskap om Mar- kov Chain Monte Carlo-metoder (MCMC) som gör dem otillgängliga för de flesta evolutionsbiologer. Analyserna tar ofta dagar att beräkna. I den här avhandlingen beskriver jag statistiska sätt att undvika fel som in- troduceras av MSA. Jag använder probabilistiska evolutionära modeller med två metoder för att hantera fel i linjeringen. Först genom att utveckla robusta metoder som inte kräver multipel linjering alls för att härleda träd och naturligt urval. För det andra, när MSA behöver användas, försöker jag undvika effek- terna av fel i efterföljande analyser genom att identifiera falska homologier med en evolutionsmodell. I artikel I presenterar vi en ny statistisk metod som uppskattar parvisa av- stånd mellan sekvenser utan behov av en linjering. Metoden använder ne- ighbor joining för att bygga fylogenetiska träd. Den nya metoden som imple-

46 menteras i programvaran PaHMM-Tree ger evolutionära avståndsuppskatt- ningar som är lika exakta som de som erhållits med standardmetoder baserat på den sanna (simulerade) linjeringen. PaHMM-Tree ger mer exakta trädupp- skattningar än alla de parvisa avståndsmetoder som bedömts. För sekvensdata med låg till medel divergens tenderar tvåstegsmetoderna med statistisk infe- rens att fungera bättre än PaHMM-Tree. För djupa fylogenier finner vi att av- ståndsbaserade PaHMM-Tree fungerar bättre än andra metoder för trädinfe- rens. Vår studie riktar också uppmärksamheten på linjeringsfria (alignment- free) metoder. Vi visar att de fungerar sämre än andra vanliga metoder för inferens. I artikel II presenterar vi en probabilistisk modell som gemensamt uppskat- tar dN / dS och indelhastigheten. Vår metod tar bort påverkan i båda parame- teruppskattningarna som orsakas av inpassningsfel. Vi använder vår metod på tusentals gener från parvisa analyser av människa-mus och människa-kyck- ling, vilket visar att indelhastigheten och det naturliga urvalet (dN / dS) ten- derar att vara relaterade. Detta visar att negativ selektion som verkar i protei- ner tenderar att påverka icke-synonyma mutationer och indels på ett liknande sätt. Vi undersöker också hur de selektiva krafterna som verkar på substitut- ioner och indels varierar inuti generna. Våra resultat och metoder ger en möj- lighet att börja studera interaktionen mellan substitutioner och indels. Vi er- bjuder de första verktygen för att förstå hur indels påverkar proteinevolut- ionen. I artikel III presenterar vi en metod och ett datorprogram som tar bort fel från linjeringen utan att filtrera hela kolumnerna. Vi använder ett grafbaserat tillvägagångssätt implementerat i programvaran Divvier för att identifiera grupper av tecken som har starka statistiska bevis på delad homologi. Detta baseras på resultatet av en dold Markovmodell (pair-hidden Markov model). Dessa kluster kan sedan användas för att antingen filtrera tecken från linje- ringen genom en process som vi kallar partiell filtrering, eller för att represen- tera vart och ett av klustren i en ny kolumn. Vi validerar vårt tillvägagångssätt genom att applicera det på riktiga och simulerade data. Vi finner att Divvier överträffar all annan filtreringsprogramvara för att behandla MSAs. Divvier behåller mer sanna positiva homologier och avlägsnar mer falska positiva ho- mologier. I artikel IV tittar vi i detalj på resultat som producerats av fyra populära linjeringsprogram. Vi använder BAliBASE-benchmark och en simulerings- studie för att tilldela a posteriori-sannolikheter till homologa par i referenslin- jeringar och uppskattade linjeringar med hjälp av en evolutionär par-HMM- modell. Dessa a posteriori-sannolikheter representerar hur svårt det är att åter- skapa dessa par i linjeringsprocessen. Vi upptäcker att en stor del av paren i BAliBase och mer divergerande simulerade dataset har låga a posteriori-san- nolikhetsresultat och det är osannolikt att de kommer att återskapas av något linjeringsprogram. Vi finner också att linjeringsprogramvara inte producerar

47 signifikant olika resultat och ofta återvänder ett prov från ett konfidensinter- vall för höga a posteriori parvisa homologier. Vi undersöker också effekten av linjeringsfel på underliggande fylogenetisk inferens och finner att den ge- nomsnittliga a posteriori-sannolikheten är en bra prediktion av trädets korrekt- het. Sammanfattningsvis är området för evolutionsbiologi i behov av lättan- vända och tillförlitliga statistiska metoder för evolutionär inferens. Denna av- handling försöker att hantera problemen med MSA-baserad inferens genom att tillhandahålla robusta statistiska inferensverktyg och genom att probabilist- iskt hantera linjeringsfel.

48 Acknowledgements

The manuscripts contained in this thesis would not have been written if it was not for my wife, Claire, who moved to Sweden in to pursue a PhD of her own. As a computer scientist, this motivated me to search for interesting applica- tions of computing concepts in various domains of science. That was how I found the position advertised by Simon, become an evolutionary biologist and the end result is this thesis. Without Claire’s patience and continuous support this thesis might not have been completed. Thank you, my Sweet. I would also like to thank a number of people who contributed to me fin- ishing this work. First of all, I would like to thank Simon Whelan for letting me become an evolutionary biologist in the first place. Thank you Simon for patiently explaining various biological and statistical concepts, sometimes multiple times. Thank you for your involvement and pedagogical approach in leading me through this difficult field of statistical methods in molecular evo- lution. I would also like to thank Hans for allowing a non-biologist to study at his department. The last five years have thought me a lot about evolutionary biol- ogy. I feel like I came a long way from not understanding anything during departmental seminars to being able to follow everyone’s research. I would like to thank pretty much all the members of the Department, both current and past. Over the past few years I have met a lot of great people whose thoughts and work have inspired me. There are a lot of people to thank for discussing various evolutionary concepts, but out of fear of missing someone out, I need to make this general statement. These discussion contributed to the quality of my scientific output. Finally, this work would not have been possible without funding. I would therefore like to thank the Department for their financial support. It is not common in science to have the opportunity do interesting research as a PhD student while being paid a decent salary. I therefore also extend thanks the Swedish state for their generous support of science, and the Unions for peri- odically re-negotiating the collective agreement that covers PhD students. I would also like to thank the Anna Maria Lundin stipendiefond for provid- ing funding for me to attend overseas conferences. Lastly, I would like to thank the now-defunct EBC Graduate School for Genomes and Phenotypes for funding various workshops and conferences.

49 In summary there are many people and entities who contributed to this work both directly and indirectly. Thank you to all those whom I met here at EBC over the past years for great work and scientific atmosphere.

50 References

Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D. 2008. The influenza virus resource at the National Center for Biotechnology Information. J. Virol. 82:596–601. Baum LE, Petrie T, Soules G, Weiss N. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41:164–171. Bishop MJ, Thompson EA. 1986. Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190:159–165. Bruno WJ, Socci ND, Halpern AL. 2000. Weighted neighbor joining: A likelihood- based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189–197. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. 25:1972–1973. Cappé O, Moulines E, Rydén T. 2009. Inference in hidden markov models. In: Proceedings of EUSFLAT Conference. p. 14–16. Cartwright RA. 2005. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 21:iii31-iii38. Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–552. Chatzou M, Magis C, Chang J-M, Kemena C, Bussotti G, Erb I, Notredame C. 2016. Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 17:1009–1023. Criscuolo A, Gribaldo S. 2010. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10:210. Dessimoz C, Gil M. 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 11:R37. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. 2005. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15:330–340. Dress AWM, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. 2008. Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol. Biol. 3:7. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD. 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745. Durbin R, Eddy SR, Krogh A, Mitchison G. 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press

51 Eddy SR. 2004. Where did the BLOSUM62 alignment score matrix come from? Nat. Biotechnol. 22:1035. Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. Edgar RC, Batzoglou S. 2006. Multiple sequence alignment. Curr. Opin. Struct. Biol. 16:368–373. Felsenstein J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410. Felsenstein J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368–376. Felsenstein J. 2004. Inferring phylogenies. Sinauer associates Sunderland, MA Fitch WM. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Biol. 20:406–416. Fitch WM. 2000. Homology: a personal view on some of the problems. Trends Genet. 16:227–231. Fletcher W, Yang Z. 2009. INDELible: A flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26:1879–1888. Gascuel O. 1997. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685–695. Gatesy J, DeSalle R, Wheeler W. 1993. Alignment-ambiguous nucleotide sites and the exclusion of systematic data. Mol. Phylogenet. Evol. 2:152–157. Gogarten JP, Doolittle WF, Lawrence JG. 2002. Prokaryotic Evolution in Light of Gene Transfer. Mol. Biol. Evol. 19:2226–2238. Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736. Gotoh O. 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162:705–708. Grimmett G, Stirzaker D. 2001. Probability and random processes. Oxford University Press Gu X, Li W-H. 1995. The size distribution of insertions and deletions in human and pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol. 40:464–473. Hahn BH, Shaw GM, De Cock KM, Sharp PM, Sharp PM. 2000. AIDS as a zoonosis: scientific and public health implications. Science 287:607–614. Hasegawa M, Kishino H, Yano T aki. 1985. Dating of the human-ape splitting by a of mitochondrial DNA. J. Mol. Evol. 22:160–174. Hein J. 2000. An algorithm for statistical alignment of sequences related by a binary tree. In: Biocomputing 2001. World Scientific. p. 179–190. Hodgkinson A, Eyre-Walker A. 2011. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12:756–766. Hogeweg P, Hesper B. 1984. The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol. 20:175–186. Holmes I, Bruno WJ. 2001. Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 17:803–820. Huelsenbeck JP, Nielsen R. 1999. Variation in the pattern of nucleotide substitution across sites. J. Mol. Evol. 48:86–93. Ingram VM. 1961. Gene Evolution and the Hæmoglobins. Nature 189:704–708.

52 Jelinek F, Bahl L, Mercer R. 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans. Inf. Theory 21:250–256. Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Bioinformatics 8:275–282. Jordan G, Goldman N. 2012. The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Mol. Biol. Evol. 29:1125–1139. Jukes T, Cantor C. 1969. Evolution of Protein Molecules. New York Acad. Press 3:21–132. Katoh K. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059–3066. Kemena C, Notredame C. 2009. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25:2455–2465. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge University Press Knudsen B, Miyamoto MM. 2003. Sequence alignments and pair hidden Markov models using evolutionary history. J. Mol. Biol. 333:453–460. Koshi JM, Goldstein RA. 1996. Correlating structure-dependent mutation matrices with physical-chemical properties. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. p. 488–499. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235:1501–1531. Kruskal JB, Sankoff D. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B. 2010. Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front. Zool. 7:10. Lake JA. 1991. The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol. 8:378–385. Landan G, Graur D. 2007. Heads or tails: A simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 24:1380–1383. Le SQ, Gascuel O. 2008. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25:1307–1320. Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg-Bauer E, Colwell LJ, de Koning APJ, Dokholyan N V., Echave J, et al. 2012. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 21:769–785. Löytynoja A, Goldman N. 2008. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science (80-. ). 320:1632– 1635. Löytynoja A, Milinkovitch MC. 2003. A hidden Markov model for progressive multiple alignment. Bioinformatics 19:1505–1513. Lunter GA, Miklós I, Song YS, Hein J. 2003. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comput. Biol. 10:869– 889.

53 Lynch M, Walsh B. 2007. The origins of genome architecture. Sinauer Associates Sunderland (MA) Markova-Raina P, Petrov D. 2011. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res. 21:863–874. Metzler D. 2003. Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19:490–499. Miklos I, Lunter GA, Holmes I. 2004. A “Long Indel” Model for Evolutionary Sequence Alignment. Mol. Biol. Evol. 21:529–540. Morgan GJ. 1998. Emile Zuckerkandl, Linus Pauling, and the Molecular Evolutionary Clock, 1959–1965. J. Hist. Biol. 31:155–178. Morrison DA, Ellis JT. 1997. Effects of nucleotide sequence alignment on phylogeny estimation: A case study of 18S rDNAs of Apicomplexa. Mol. Biol. Evol. 14:428–441. Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443– 453. Nelesen S, Liu K, Zhao D, Linder CR, Warnow T. 2008. The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. In: Biocomputing 2008. World Scientific. p. 25–36. Nielsen R. 2005. Statistical Methods in Molecular Evolution. New York, NY: Springer New York Van Noorden R, Maher B, Nuzzo R. 2014. The top 100 papers. Nat. News 514:550. Notredame C, Higgins DG, Heringa J. 2000. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302:205–217. Novák Á, Miklós I, Lyngsø R, Hein J. 2008. StatAlign: An extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404. Ogden TH, Rosenberg MS, Page R. 2006. Multiple Sequence Alignment Accuracy and Phylogenetic Inference.Page R, editor. Syst. Biol. 55:314–328. Otu HH, Sayood K. 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122–2130. Owen R. 1846. Lectures on the Comparative Anatomy and Physiology of the Animals: Delivered at the Royal College of Surgeons of England, in 1844 and 1846. Longman, Brown, Green, and Longmans Pachter L, Alexandersson M, Cawley S. 2002. Applications of Generalized Pair Hidden Markov Models to Alignment and Gene Finding Problems. J. Comput. Biol. 9:389–399. Penn O, Privman E, Landan G, Graur D, Pupko T. 2010. An alignment confidence score capturing robustness to guide tree uncertainty. Mol. Biol. Evol. 27:1759–1767. Price MN, Dehal PS, Arkin AP. 2009. Fasttree: Computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26:1641–1650. Redelings BD. 2014. Erasing Errors due to Alignment Ambiguity When Estimating Positive Selection. Mol. Biol. Evol. 31:1979–1993.

54 Redelings BD, Suchard MA. 2005. Joint bayesian estimation of alignment and phylogeny. Syst. Biol. 54:401–418. Redelings BD, Suchard MA. 2007. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 7:40. Robinson DF, Foulds LR. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131–147. Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ. 2006. Comparisons of dN/dS are time dependent for closely related bacterial genomes. J. Theor. Biol. 239:226–235. Rodrigo AG, Bergquist PR, Bergquist PL. 1994. Inadequate support for an evolutionary link between the Metazoa and the Fungi. Syst. Biol. 43:578–584. Rosenberg MS. 2005. Multiple sequence alignment accuracy and evolutionary distance estimation. BMC Bioinformatics 6:278. Saitou N, Nei M. 1987. The Neighbor-joining Method : A New Method for Reconstructing Phylogenetic Trees ’. Mol. Biol. Evol. 4:406–425. Sanger F, Tuppy H. 1951. The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochem. J. 49:463–481. Sela I, Ashkenazy H, Katoh K, Pupko T. 2015. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 43:W7–W14. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034–1050. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J. 2011. Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7:539. Stamatakis A. 2014. RAxML version 8: A tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30:1312–1313. Steel M, Hein J. 2001. Applying the Thorne-Kishino-Felsenstein model sequence evolution on a star-shaped tree. Strasser BJ. 2010. Collecting, Comparing, and Computing Sequences: The Making of Margaret O. Dayhoff’s Atlas of Protein Sequence and Structure, 1954– 1965. J. Hist. Biol. 43:623–660. Stuart A, Ord K, Arnold S, O’Hagan A. 2009. Kendalls Advanced Theory of Statistics, 3 Volume Set. John Wiley & Sons Swiderski DL, Sheets HD, Fink WL. 2004. Geometric morphometrics for biologists. Academic Press Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. 1996. Chapter 11: phylogenetic inference. Mol. Syst. 2. Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56:564–577.

55 Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C. 2015. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst. Biol. 64:778–791. Tateno Y, Takezaki N, Nei M. 1994. Relative efficiencies of the maximum- likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11:261–277. Tavaré S. 1986. Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Am. Math. Soc. Lect. Math. Life Sci. 17:57–86. Thiele K. 1993. The Holy Grail of the Perfect Character: The Cladistic Treatment of Morphometric Data. 9:275–304. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680. Thompson JD, Koehl P, Ripp R, Poch O. 2005. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins Struct. Funct. Bioinforma. 61:127–136. Thompson JD, Linard B, Lecompte O, Poch O. 2011. A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives.Badger J, editor. PLoS One 6:e18093. Thorne JL, Kishino H, Felsenstein J. 1991. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33:114–124. Thorne JL, Kishino H, Felsenstein J. 1992. Inching toward reality: An improved likelihood model of sequence evolution. J. Mol. Evol. 34:3–16. Ulitsky I, Burstein D, Tuller T, Chor B. 2006. The Average Common Substring Approach to Phylogenomic Reconstruction. J. Comput. Biol. 13:336–350. Vetterling WT. 2002. Numerical recipes example book (C++). Cambridge University Press Vinga S, Almeida J. 2003. Alignment-free sequence comparison - A review. Bioinformatics 19:513–523. Van Walle I, Lasters I, Wyns L. 2005. SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21:1267– 1268. Wang J, Keightley PD, Johnson T. 2006. MCALIGN2: faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution. BMC Bioinformatics 7:292. Wang L, Jiang T. 1994. On the Complexity of Multiple Sequence Alignment. J. Comput. Biol. 1:337–348. Whelan S, Blackburne BP, Spencer M. 2011. Phylogenetic Substitution Models for Detecting Heterotachy during Plastid Evolution. Mol. Biol. Evol. 28:449–458. Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691–699. Wiens JJ, Brower A. 2001. Character Analysis in Morphological Phylogenetics: Problems and Solutions.Brower A, editor. Syst. Biol. 50:689–699.

56 Wong KM, Suchard MA, Huelsenbeck JP. 2008. Alignment uncertainty and genomic analysis. Science (80-. ). 319:473–476. Wu J, Susko E. 2010. Rate-variation need not defeat phylogenetic inference through pairwise sequence comparisons. J. Theor. Biol. 263:587–589. Wu M, Chatterji S, Eisen JA. 2012. Accounting For Alignment Uncertainty in Phylogenomics.Salemi M, editor. PLoS One 7:e30288. Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306– 314. Yang Z. 2006. Computational molecular evolution. Oxford University Press Oxford Yang Z. 2014. Molecular evolution: a statistical approach. Oxford University Press Yang Z, Bielawski JP. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496–503. Yoon B-J. 2009. Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr. Genomics 10:402–415. Zhang F, Gu W, Hurles ME, Lupski JR. 2009. Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet. 10:451– 481.

57 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1723 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-360871 2018