<<

Distance Metrics in Variant Graphs

Martin Gjesdal Bjørndal Master’s Thesis, Spring 2017 This master’s thesis is submitted under the master’s programme Modelling and Data Analysis, with programme option Statistics and Data Analysis, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie group E8, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics. Distance Metrics in Variant Graphs

Supervisors: Author: Geir Olve Storvik Martin Gjesdal Bjørndal Lex Nederbragt

Abstract

The traditional, linear representation of the genetic information in pop- ulations cause a loss of data, as it can not fully represent sites where the sequences differ. Conversely, a graph may incorporate the variation, thereby providing a mean of utilizing more information in the analysis of genetic dif- ferences between populations. This thesis deals with the quantification of the differences between two population graphs. Three ways of measuring the distance between population graphs are presented. The first way counts unique variants and compares them to the total number of variants. The second calculates the graph . The third specifies two probability models regarding the genotype distribution at each variant, and then calculates the Bayes factor at each location. The three distance measures are tested on data describing variations in the human genome. Six populations of humans from distinct geographical areas are represented in the data. The distance measures seem to give similar conclusions about the relative distances among pairs of populations. In order to place the distances in a context, they are evaluated using permutation tests. The permutation tests report significant results for all pairs of populations except one. Additionally, this is the pair that is given the shortest distance by all distance measures. In general, the methods and ideas presented in this thesis allow more ge- netic information to be included in the study of relationships between groups. As such, it may prosper to become more accurate than traditional designs.

Keywords: Bayes factor, graph edit distance, permutation test, variant graph.

iii

Preface

The idea for this thesis originated from the observation that linear representa- tions of DNA from more than one individual is associated with a loss of infor- mation. A genetic variation among the individuals of a population means that there is more than one alternative present in the genomic sequence. A linear representation will in this case select one of the alternatives instead of using all the data available. This idea paved the road to graphs, which provides an intuitive extension of the work done with strings in previous literature. A demanding challenge has been the lack of previous work to support my own writings. The use of graphs for assembling DNA sequences is extensive. However, the representations of multiple sequences is still largely dependent on strings. Despite challenges, the thesis did at no point cease to be inter- esting. Rather the contrary. As time has progressed and insight deepened, I find myself profoundly interested in a problem that I hope to witness the applications of. Though my contribution to the field is modest, I hope that the ideas presented in this thesis may inspire others to continue the work on population graphs and how they relate to each other. I believe it to be a topic of great potential. The rationale behind this claim is simple: Why not use all the data? I wish to express my sincere gratitude to my supervisor Geir Storvik for his continuous support and wisdom, and to my co-supervisor Lex Nederbragt for valuable comments and insight in bioinformatics. I would also like to thank my good friends Trygve Danielsen for proofreading and Synnøve Smedbye Botnen for sharing her expertise in molecular biology. Thanks to Tom Andersen for inspiration. Thanks to close family for proofreading. Thanks to friends at “Utfallsrommet” for both academic, and not quite so academic, discussions. Last but not least, a very special thanks to Marte Rødsvik Kolloen for unconditional support, encouragement and highly appreciated technical dis- cussions.

Martin Gjesdal Bjørndal Blindern, May 2017

v

Contents

Contents vii

1 Introduction 1 1.1 Objectives of the Thesis ...... 3 1.2 Genetic Variation ...... 3 1.3 Graphs for Sequence Data ...... 4 1.3.1 Motivation for Graph-Representations ...... 5 1.4 Measuring Distance Between Graphs ...... 7 1.5 Related Work ...... 8 1.6 Outline of the Thesis ...... 9

2 Data 11 2.1 Genomics ...... 11 2.1.1 Alignment and String Edit Distance ...... 12 2.1.2 Assembly ...... 14 2.1.3 Mapping, Resequencing and Variant Detection . . . . . 16 2.2 Description of the Data ...... 17 2.2.1 The Variant Call Format ...... 18 2.2.2 Reference Sequence ...... 20 2.3 Variant Graphs ...... 21

3 Methods 25 3.1 Distance-Based Approaches ...... 26 3.1.1 Bubble Counter ...... 26 3.1.2 Graph Edit Distance ...... 28 3.2 A Model-Based Approach ...... 31 3.2.1 The Bayes Factor ...... 31 3.3 Implementation of the Distance Metrics ...... 37 3.3.1 Bubble Counter and Bayes Factor ...... 37 3.3.2 Graph Building ...... 38 3.3.3 Graph Alignment with GEDEVO ...... 41 3.4 Are the Measures really Metrics? ...... 43 3.4.1 Bubble Counter ...... 44

vii viii CONTENTS

3.4.2 Graph Edit Distance ...... 46 3.4.3 Bayes Factor Measure ...... 47 3.5 The Permutation Test ...... 47 3.5.1 Monte Carlo Sampling ...... 49 3.6 Implementation of the Permutation Test ...... 50 3.7 Example ...... 52

4 Results 57 4.1 Distances Between Population Graphs ...... 57 4.1.1 Bubble Counter ...... 57 4.1.2 Graph Edit Distance ...... 58 4.1.3 Bayes Factor ...... 58 4.2 Visualization by Multidimensional Scaling ...... 60 4.3 Permutation Tests ...... 61 4.3.1 p-values ...... 61

5 Discussion 73 5.1 Distance Metrics ...... 73 5.1.1 Bubble Counter ...... 74 5.1.2 Graph Edit Distance ...... 76 5.1.3 Bayes Factor ...... 77 5.1.4 Other Distances ...... 82 5.2 GraphBuilder ...... 84 5.3 Other Graph Types ...... 85 5.3.1 Sequence Graph ...... 85 5.3.2 de Bruijn Graph ...... 86 5.4 Biological Features ...... 89 5.4.1 Out of Africa ...... 90

6 Concluding Remarks 91 6.1 Future Work ...... 93

A Graph Theory and some Assembly Graphs 95 A.1 Definitions in Graph Theory ...... 95 A.2 De Bruijn Graphs for Genome Assembly ...... 97

B Calculations 99 B.1 Conjugate Prior for the Multinomial Distribution ...... 99 B.2 Calculations for the Bayes Factor in 3.2.1 ...... 100 B.3 Proof of Proposition 3.1 ...... 102 CONTENTS ix

C Table from Example 3.3 103

D Notation 105

E Glossary 107

F Code 111 F.1 GraphBuilder ...... 111 F.2 R-script for Calculating Distances ...... 123 F.3 R-code for Permutation Tests ...... 127

Bibliography 135

Chapter 1

Introduction

DNA data is extensively utilized for discovering differences between organisms, but traditionally only linear representations of sequences has been exploited for this purpose. The idea of our DNA as linear, represented as a string of characters from the alphabet Σ = {A, C, G, T }, prevails (see e.g. Watson, T. Baker, et al. (2008), page 30). However, the question of how to effectively rep- resent collections of DNA sequences has been, and remains, an active subject of discussion (Paten, Novak, and Haussler, 2014; Iqbal, Caccamo, et al., 2012; Church, Schneider, Graves, et al., 2011; Dilthey et al., 2015). This thesis seeks to improve upon the analysis of DNA sequence data by using graphs made up of collections of sequences, instead of the linear representation provided by strings. The aim of the graph-based approach is to include more data in the model, thereby providing more accurate measures of the relative distances between populations. In the following, the word “population” will refer to a group of breeding individuals of the same species (Hartl, 2000; Falconer, 1981).

Genetic data has become an extremely powerful tool in medicine and bio- sciences. Facilitated by the collection of vast amounts of data, the possibilities of the combination of biology, mathematics, statistics and computer science are seemingly endless. For example, a milestone of modern science is the completion of the first draft of the human genome (Venter et al., 2001). Full genome sequencing is becoming increasingly common as software is getting more efficient, sequencing technology more accurate, and hardware better. In the mid 2000s, a new group of sequencing technologies emerged. These methods are referred to as High Throughput Sequencing (HTS) and produced massive amounts of data per time compared to earlier designs. The intro- duction of HTS caused a need for specialized data structures and software to handle the large amounts of data. This brought us various graph-based

1 2 CHAPTER 1. INTRODUCTION methods for de novo sequence assembly1 and mapping2 (Limasset and Peter- longo, 2015; Novak, Garrison, and Paten, 2016) (see Section 2.1.2, 2.1.3 and Appendix A.2). Graphs allow us to store variations in the genetic material by using mul- tiple nodes to represent all the possible variations at a position. Hence, the acquisition of several genome sequences from the same population motivates the use of graphs to store the data. A graph made from DNA sequences of several individuals from one single population will in the following be referred to as a population graph. A key concept of this thesis is the comparison of pop- ulation graphs. In order to do calculations on graphs, a Java-program called GraphBuilder (see Subsection 3.3.2) was made. GraphBuilder constructs pop- ulation graphs from the data introduced in Chapter 2. An obstacle in the work with this thesis has been the limited amount of re- lated work. The idea of combining many individuals into a population graph seems obvious, which makes the lack of relevant material peculiar. More- over, the use of graphs for assembling DNA sequences from single individuals is extensive, but multiple (assembled) sequences are typically represented as strings. This apparent inconsistency led to the work on graph-representations for population data. As the thesis ventured into largely unknown space, a great deal of effort was spent exploring possibilities and obtaining and under- standing the data. Initially, I tried to do a sequence assembly with the aim of directly analyzing the graphs made by assembly programs. This proved to be time-consuming and difficult, especially getting to know the software and handling the raw data. The raw data that are used by assembly programs is an enormous collection of short sequences produced by the HTS methods, and both the amount and the form of the data complicated the work. Vari- ous graph-based methods for assembly were investigated, especially de Bruijn graphs3 and operations on them. In addition, the distribution of reads across the assembly was analyzed but eventually turned out to be irrelevant for the thesis. After discovering the drawbacks of doing a complete assembly, focus turned towards finding fully assembled genomes to analyze. Data on variants from the human genome was proposed mainly due to its availability. The main objective of this thesis is to use processed data on genetic vari- ation from multiple individuals to make graph-representations of their DNA, and do analyses of differences between such objects.

1de novo assembly is the process of combining many small pieces of sequenced DNA into a longer sequence or complete genome. 2The aligning of reads to either an existing reference structure, or a de novo assembly is called mapping. 3A type of graph used in sequence assembly, de Bruijn graphs are defined and discussed in Subsection 5.3.2 and Appendix A.2. 1.1. OBJECTIVES OF THE THESIS 3

1.1 Objectives of the Thesis

The objectives of this thesis have been to

• Develop distance metrics for measuring the distance between two popu- lations.

• Employ the metrics on graph-representations of DNA data from real populations.

• Evaluate the distance metrics to determine if the distances found are small or large.

• Discuss the advantages and challenges of the graph-representation and the involved metrics.

1.2 Genetic Variation

The genetic information of all living organisms is conveyed by the ribonucleid acids DNA and RNA. DNA is the primary source of genetic information in a cell. RNA is synthesized using DNA as a template, before serving as a template itself in the making of amino acids. DNA consists of two strands and pairs of nitrogenous bases connecting the strands. The sequence of the bases adenine (A), cytosine (C), guanine (G) and thymine4 (T) specifies a set of amino acids to be made by the cell. The structural and chemical properties of the bases result in a highly specific pairing pattern. A and G are so-called purines, whilst C and T are pyrimidines. Because of the number of possible hydrogen bonds between the two classes of molecules, A will always pair up with T and C will pair up with G. Due to the specific pairing of the bases, DNA can replicate itself. In the process of DNA replication, the two strands are separated, before each individual strand act as a template for synthesizing the missing partner strand (Watson, T. Baker, et al., 2008, page 209-210). Hence, the information in the DNA represents one of the fundamental criterions for life itself, namely inheritance. The phenotypic variation in a population or species is partially the result of variation in the DNA sequences. Variation can arise in multiple ways, typi- cally during replication of the genetic material in preparation for cell division. Changes in the genetic material includes the following:

• Substitutions: one or several bases is substituted by one or several other bases.

• Insertions and deletions (indels): a sequence is added or removed.

• Inversions: a sequence is substituted with its reverse compliment. 4Replaced by uracil (U) in RNA. 4 CHAPTER 1. INTRODUCTION

• Translocations: so-called transposable elements can move from one loca- tion to another with or without duplication (Watson, T. Baker, et al., 2008, chapter 11; Alberts et al., 2008, page 316-319). Transposable el- ements often leave behind sequence fragments when excised (Futuyma, 2009, page 193-194). The results of a translocation caused by a trans- posable element is thus either an insertion, a deletion, or both.

While all of the above makes up sources of variation in a population, some are more difficult to handle in an alignment of two sequences. As will be further described in Subsection 2.1.2 and Appendix A.2, these kinds of problems motivatesthe use of graph-based methods for assembly.

1.3 Graphs for Sequence Data

Graphs provide an easy and intuitive way of handling variations in a sequence by creating alternative paths through regions where two or more variants occur. A graph is a set of points and connections between either two points or a point and itself. The edges are drawn as lines and the nodes as circles. An example can be found in Figure 1.1.

Definition 1.1. A graph is a pair of sets G = (V,E). The elements of V are called vertices or nodes and the elements of E are called edges. The edges are connections between nodes, i.e. the edges are associated with a set containing one or two nodes.

n 2 e2

n3 e 1 e3

n1

Figure 1.1: A directed graph.

A graph may be directed. If so, all edges have designated start and end nodes. Thus, a walk through the graph will have to move along edges in the correct direction. The graph in Figure 1.1 is directed. For further material on graph-theoretic concepts used in this thesis, see Subsection 3.1.2 and Appendix A.1. Typically, the data are DNA sequences, but the theory presented can be extended to also apply to other kinds of sequences, e.g. RNA or amino acid sequences. 1.3. GRAPHS FOR SEQUENCE DATA 5

In addition to advantages in representing populations (i.e. several se- quences from the same population), there are computational conveniences in using graphs for assembly (see Subsection 2.1.2). Most assembly software utilize some kind of graph structure, e.g “Velvet”, which is a group of algo- rithms that works on de Bruijn graphs (Zerbino and Birney, 2008). Another example is “ALLPATHS”, which uses a representation that is referred to as a “sequence graph” in Butler et al. (2008). Interestingly, even though graphs are popular data structures in assembly software, the end product is usually a linear sequence, typically acquired by traversing the graphs as in Zerbino and Birney (2008) and Butler et al. (2008). I will in the following try to explain the motivation for the use of graphs for representing sequence data.

1.3.1 Motivation for Graph-Representations

A reference sequence is a genetic sequence meant to represent a group, e.g. a population or a species. A linear reference sequence contains one base at each position, no matter how many sequenced individuals it represents. The single base at each position may e.g. be the “consensus” base at this index.

Definition 1.2. A consensus sequence is made up of several input sequences and consists of the symbol which is observed the most times at each particular index in the input sequences.

It is a good idea to use consensus sequences to represent populations, since it displays the most common symbol in the sample. But there may also be other bases or alternative structures present in the same location. The information that these alternatives represent is simply lost if we choose to reduce all the input into one consensus sequence. A reference graph is potentially a vast improvement as it can represent the sequence of the entire sample and capture the variation by expanding the sequence into bubbles in regions where variation is encountered (see example 1.1).

Definition 1.3. A bubble is formed by two separate paths that starts at the same source node and ends at the same sink node (Paten, Novak, Garrison, et al., 2017; Zerbino and Birney, 2008).

Definition 1.3 implies that the source node has more than one outgoing edge. Each path can then be followed until we reach a node with more than one ingoing edge. If the end node is the same for both paths, then this is the sink node and the two paths constitute a bubble.

Example 1.1. We have obtained the following three sequences from three sep- arate individuals: ACGAAGAGGCC, ACGATGA- -CC and ACGATGAGCCC, where the dashes are meant to represent a deletion. 6 CHAPTER 1. INTRODUCTION

seq 1 ACGAAGAGGCC seq 2 ACGATGA–-CC seq 3 ACGATGAGCCC

consensus ACGATGAGGCC

Figure 1.2: Three sequences and a non-unique consensus sequence.

C

A G

ACGA GA G CC

T

Figure 1.3: A graph-representation of the sequences in Figure 1.2.

If we are to make a linear reference sequence, we may use the consensus sequence (see Figure 1.2). The consensus sequence does not display any of the variations. If we represent the three sequences from Figure 1.2 as a graph, it may look like the one in Figure 1.3. Note that all three sequences that were used as input in making this graph can be reconstructed by traversing it. ♣

Most bubbles in DNA graphs are what we call Single Nucleotide Poly- morphisms (SNPs), i.e. a single position with two or more possible variants. Each of the variants is a single base. Bubbles that model SNPs with two different states, will look like the first variation in Figure 1.3. Other bubbles may model a deletion, like the lower edge in the right part of the same figure, or they can be insertions or substitutions of longer sequences. More complex structures arise when there are multiple bubbles spanning the same regions. In Figure 1.3, this is the case in position 8 and 9 of the sequences from Figure 1.2. Two bubbles are found here. The first models the deletion variant start- ing in position 8 and ending in position 10. The second bubble specifies two different paths through position 9. The use of graphs instead of strings for representing DNA data from multiple individuals is more accurate since it contains all possible variations. 1.4. MEASURING DISTANCE BETWEEN GRAPHS 7

In fact, it may contain all the genetic material of every individual. As seen in Example 1.1, we can reconstruct all input sequences by traversing the graph. However, there will typically exist more walks through the graph than there were input sequences available. In Example 1.1, there are six complete walks that gives rise to equally many sequences. The sequences generated by doing all possible walks in this graph is reported in Table 1.1. But only three of the sequences in this table were actually used to make the graph. This raises a dilemma: The graph is less specific than the three single sequences because it displays all the variations without making restrictions on how to combine them. This may cause us to include a lot of combinations in our analysis that are not really present in the population. On the other hand, it may be advantageous as it shows potential configurations of other individuals in the population that are not present in the sample.

Table 1.1: All walks through the graph in Figure 1.3. The input sequences are labelled with an asterisk (∗).

Path no. Sequence 1 ACGAAGAGCCC 2 (∗) ACGAAGAGGCC 3 ACGAAGA–-CC 4 (∗) ACGATGAGCCC 5 ACGATGAGGCC 6 (∗) ACGATGA–-CC

1.4 Measuring Distance Between Graphs

As mentioned earlier, a key concept in this thesis is to quantify the difference between two graphs similar to the one in Figure 1.3. There are a number of ways we can approach this problem. Three ideas are further investigated in Chapter 3.

1. We may look at which bubbles are present in which graphs. If a bubble is unique to a population graph, it means that the variant is not present in the other graph. This is clearly a difference between the two, indicating that the number of unique bubbles relative to the number of non-unique bubbles can be used to say something about the distance between them.

2. A graph can be transformed into another by applying a set of changes, e.g. the substitution or deletion of a node or an edge. If a transformation requires few operations, then the two graphs are more similar than if the transformation requires many operations. 8 CHAPTER 1. INTRODUCTION

3. A different approach to the task of quantifying the difference between two graphs is model-based. We may make assumptions about how a bubble in one graph relates to a corresponding bubble in another graph. Then a hypothesis test can be used to test the difference between the two corresponding bubbles. A measure of the local, pairwise difference of the bubbles can then be aggregated across the entire graph.

In order to correctly interpret a distance measured by one of the afore- mentioned strategies, we need to place it in a context. This can be done by mixing the individuals of the two populations in question and draw individ- uals at random to generate two new, permuted populations. We may then calculate the distance between the newly sampled populations. If we repeat this process, we will obtain a set of distances that can be compared to the distance between the observed group. This process is called a permutation test, and will be used to measure significance of the observed distances.

1.5 Related Work

There are numerous publications on the use of graphs for genome alignment and assembly. Limasset and Peterlongo (2015) showed that the mapping of reads to a de Bruijn graph is NP-complete. Further, they provided a proce- dure for doing this mapping and showed that more reads are mapped to the graph than to an existing assembly. Iqbal, Caccamo, et al. (2012) made a soft- ware suite called Cortex for de novo genome assembly and variant detection. Cortex uses colored de Bruijn graphs5 to represent information from multi- ple individuals. Lee, Grasso, and Sharlow (2002) viewed multiple sequence alignment (MSA) as a graph and aligned this graph to new sequences using dynamic programming. This way, they never had to reduce the alignment to a linear sequence and thus not lose the variation in the alignment. For reviews on the matter, the reader is referred to Kehr et al. (2014), Simpson and Pop (2015) and Compeau, Pevzner, and Tesler (2011). Literature on the analysis of population graphs such as the ones I will focus on in this thesis is sparse, but the reader is referred to Paten, Novak, Eizenga, et al. (2017) for relatively novel work on population reference graphs. A population-based human reference graph was proposed by Paten, No- vak, and Haussler (2014). They further emphasized the advantages of using a graph structure for aligning new sequences against this reference graph. They did, however, not extend it to the general case of comparing two population graphs. Church, Schneider, Steinberg, et al. (2015) discussed advantages of a human reference graph instead of a reference sequence. Dilthey et al. (2015)

5A colored de Bruijn graph is a de Bruijn graph where the nodes are colored in respect to which sample it originated from. 1.6. OUTLINE OF THE THESIS 9 used a reference graph for the human genome and a hidden Markov model to add new sequences as paths through the reference graph. This thesis implements distance metrics on population graphs and eval- uates the distances between them using a permutation test. Sandve, Gun- dersen, Rydbeck, et al. (2010) developed the Genomic Hyperbrowser, which identifies significance levels for a hypothesis test of difference between genomic tracks either exactly or by Monte Carlo testing. In this context, a “track” is a file containing the positions of a particular feature in a genome, e.g. SNPs or genes. The methodology introduced in this paper is expanded in Sandve, Gundersen, Johansen, et al. (2013). To the best of my knowledge, there are no publications that use permu- tation tests on distances between population graphs. For an introduction on graph edit distance and the connection to maximal common subgraph, see Gao et al. (2010), Bunke and Riesen (2009), Bunke (1997) and Brun, Gaüzère, and Fourey (2012).

1.6 Outline of the Thesis

It is my goal that this thesis can be read without any formal prerequisites in biology. I have tried to incorporate the results that are necessary for the reader to understand the text, but some background in statistics, mathematics and computer science is advantageous. A glossary with biological terms is provided in Appendix E. Chapter 2 gives some details on DNA sequencing and assembly, in addi- tion to a description of the data. In Subsection 2.1.2, I try to explain how the data on variants are obtained from the assemblies. Section 2.2 is a more general description of the data that are analyzed in the later chapters. This part also includes some details on the file format. Chapter 3 contains theory and implementational details. Three distance metrics are presented in this chapter. While studying Subsection 3.1.2 the reader can go to Appendix A.1 for basic definitions. Section 3.3 explains the implementation of computer programs to calculate the distances. Section 3.5 presents the permutation test, while 3.6 discusses implementational details of the test and some computational challenges. The final section in this chapter contains a large example made to familiarize the reader with the computations involved in finding the distances by each of the three distance metrics. Chapter 4 reports the distances calculated by the metrics from Section 3.1 and 3.2, using the variant data explained in Chapter 2. In addition, this chapter displays the results of the permutation tests from Section 3.5. Section 4.2 contains a visualization of the distances by multidimensional scaling. In Chapter 5, I discuss challenges and possible extensions of the work done in this thesis. Section 5.1 deals with the distance metrics and 5.2 gives a discussion on GraphBuilder. Moreover, Section 5.3 introduce the sequence 10 CHAPTER 1. INTRODUCTION graph and de Bruijn graph, while Section 5.4 handles some biological features. Finally, Chapter 6 offers a few concluding remarks on the results and ideas for future work. Some definitions in graph theory are given in Appendix A.1. The de Bruijn graphs is a popular type of graph for doing sequence assembly. Its application to assembly is discussed in Appendix A.2. Appendix B con- tains calculations that are left out in the main text. A table used in Ex- ample 3.3 can be found in Appendix C. Appendix D presents a list of no- tation used, while Appendix E contains the aforementioned glossary. In Appendix F, Java code for GraphBuilder and R-scripts for calculating dis- tance and performing the permutation test is given. All other R-code can be found in the github project “Variant-graph-analysis” at https://github.com/ martinbjoerndal/Variant-graph-analysis. The programming languages used in this thesis is R (R Core Team, 2016) and Java, and will henceforth not be referenced. All data is downloaded from the website of the 1000 genomes project (http://www.internationalgenome.org/ data-portal/population). Chapter 2

Data

The data used in this thesis is based on genome sequences from several in- dividuals sampled from six human populations. Two from Northern Europe, two from sub-Sahara Africa and two from China. For the analysis, we will not use the sequences themselves, rather a collection of sites were they differ. In order to adequately explain the features of the data, we must also take a look at some key concepts of molecular biology and explain how to translate cell samples into genomic sequences. In particular, we will concentrate on DNA sequencing, alignment and assembly. Then, we will take a look at how to find the variable sites we are using for data analysis in the following chapters. Finally, a description of the data is given.

2.1 Genomics

Ever since the discovery of the double helix in 1953 (Watson and Crick, 1953), the methods for analyzing sequence data has developed rapidly. The process of finding the sequence of the nucleotides in a DNA molecule is called DNA sequencing. The first automated technique that solved this problem was de- veloped by Frederick Sanger and colleagues in the mid 1970s (Sanger and Coulson, 1975; Sanger, Nicklen, and Coulson, 1977), and is now called Sanger sequencing. For several decades, this was the leading method for DNA se- quencing, and was first superseded in 2001 when the privately funded Cel- era corporation published the results of its so-called shotgun approach to whole genome sequencing (Venter et al., 2001). This is now the dominant method applied to modern sequence technologies, collectively referred to as High Throughput Sequencing (HTS) or Next Generation Sequencing (NGS). Some examples of HTS technologies include “pyrosequencing” by 454 Life- Sciences1, “sequencing by synthesis” by Illumina and “sequencing by ligation” (SOLiD) by Life Technologies.

1454 Life Sciences was acquired by Roche in 2007. In 2013 it was closed down.

11 12 CHAPTER 2. DATA

Most HTS methods rely on the random breaking of genomic sequences into smaller fragments, which are then clonally amplified (Metzker, 2010). Then the fragments are sequenced to produce short sequences called reads. The reads are most often relatively short segments, typically a few hun- dred base pairs long. In recent years, another class of sequencing techniques has emerged, which manages to produce far longer reads. Examples of such methods include single-molecule real-time (SMRT) sequencing by Pacific Bio- sciences and the MinION sequencer by Oxford Nanopore. Assembly software detects overlaps between the reads and use the over- lapping regions as starting points to extend the sequences by aligning multiple reads. A human genome is roughly 3.2 billion base pairs long (Schneider et al., 2017), so the task of aligning the reads is a computationally demanding one. Though there are many HTS platforms, they all share the same basic underlying strategy, namely the parallel sequencing of a huge number of reads (Behjati and Tarpey, 2013).

2.1.1 Alignment and String Edit Distance

Sequence alignment is a comparison of two or more sequences by looking for similar symbols that appear in the same order in both or all sequences (Mount, 2004). A basic aligner generates a similarity score based on three operations: match, mismatch and gaps. A score can be assigned to any pair containing one symbol from each of two strings. If the pair represents a match, we will typically reward the alignment with a positive score. Correspondingly, a negative score can be assigned upon mismatch. In addition, one may wish to insert gaps into one sequence to match the next positions of the first sequence with something further into the second. Gaps are typically scored negative. Consider the alignment of two strings, S1 and S2. In evolutionary terms, a gap in S1 is the result of either an insertion into S2 or a deletion in S1. The idea is that similarities between the sequences increase the positive score, so that one may be able to find the sequences that are most closely related to each other. This would be the ones with the highest scores, thus hopefully the ones with the least, if any, differences. To be able to accomplish this, one needs a scoring regime that in the best way possible represents the way nature work. This is a challenging task, but we can certainly do better than guessing. Aided by empirical knowledge about mutation events and the chemical properties of DNA, we may e.g. weight the events such that the penalty for substituting an adenine (A) with a guanine (G) is smaller than the penalty for substituting an A with a cytosine (C) or a thymine (T) The similarity of two symbols can be specified by a scoring matrix δ (see Figure 2.1), such that δ(x, y) is the score obtained when aligning the symbol x from S1 with y from S2. In this thesis, we have that x, y ∈ Σ ∪ {_}, where _ is a gap, and Σ is the alphabet consisting of the letters A, C, G and T. With this formulation, the problem of finding the best alignment can be phrased as 2.1. GENOMICS 13

ACGT_ A 3 C -4 3 G -4 -4 3 T -4 -4 -4 3 _ -3 -3 -3 -3

Figure 2.1: A scoring matrix for the alphabet Σ ∪ {_}.

AGAGGTGTGTCTACT AAAGGT––––CTACT

Figure 2.2: An alignment of two DNA sequences.

finding the sequence alignment A(S1,S2) that maximizes the expression X score(A(S1,S2)) = δ(x, y) (2.1) (x,y)∈A over all possible alignments (Sung, 2009, page 31).

Example 2.1. Consider the alignment in Figure 2.2. Assume a (rather sim- ple) scoring function that assigns 3 points for a match, -4 for a mismatch and -3 for a gap. Note that the negative score of the gap increases linearly in the number of positions it spans. This is called a linear gap penalty2. The alignment in Figure 2.2 will have the score

3 − 4 + 3 + 3 + 3 + 3 − 3 − 3 − 3 − 3 + 3 + 3 + 3 + 3 + 3 = 14.

Alignment is closely related to the classic problem of string edit distance, or .

Definition 2.1. The string edit distance between two strings S1 and S2 is the minimal number of edit operations required to transform S1 to S2. The edit operations are character substitution, deletion and insertion. 2The linear gap penalty is somewhat unrealistic, but is used in this example for its simplicity. A better model for scoring gaps would be the affine gap penalty. To open a gap, the affine gap penalty assigns a negative score ξO. To extend the gap the model assigns a different negative score ξE < ξO. So the algorithm severely penalizes the opening of a gap, but cares less about the length of it. 14 CHAPTER 2. DATA

Example 2.2. The string edit distance between the words CGCGAA and GGCAA is 2. The edit operations are

• Substitution of C with G at position 1.

• Deletion of G at position 4. ♣

We can associate a cost to all edit operations by introducing a score matrix σ. The problem of aligning two sequences is, with some settings on the cost matrix, equivalent to the string edit distance between them, as the following lemma explains.

Lemma 2.1. Let x be a symbol in S1 and y a symbol at the corresponding position in S2. Let the score matrix for the edit distance (σ) and the alignment (δ) be such that for any operation (x, y), we have that σ(x, y) = −δ(x, y). Then the two problems are equivalent. See page 31-32 in Sung (2009) for proof.

Example 2.3. If we use the scoring matrix σ(x, y) = −δ(x, y), then the string edit distance becomes

−4 + 3 + 3 − 3 + 3 + 3 = 5.

The optimal alignment of the two words in Example 2.2 is

CGCGAA GGC–AA and we see that this has the same score as the string edit distance. ♣

2.1.2 Assembly

De novo genome assembly handles one of the fundamental themes in biology, namely the construction of the full DNA sequence of the studied organism (Simpson and Pop, 2015; M. Baker, 2012; Narzisi and Mishra, 2011). Recall from the introduction of Section 2.1 that the DNA are fragmented into short, unordered segments, amplified, and sequenced to produce reads. Assemblers are computer programs that take the reads as input. By aligning pairs of reads, we can construct longer, continuous sequences called contigs. However, this requires the assembler to do a lot of alignments. Efficient means of aligning short reads is a big challenge, and in recent years there has been much research on the matter (Langmead, Trapnell, et al., 2009; Langmead and Salzberg, 2012; Bradley et al., 2009; H. Li and Durbin, 2009; C.-M. Liu et al., 2012). 2.1. GENOMICS 15

l

Figure 2.3: Mapping reads to an assembly or a reference sequence. The assembly or reference is the lower, thick line. The reads are the short, light blue lines. The number of reads covering a position is called the coverage at this location.

If the contig is unable to be extended further, the assembly will stop. This is either due to lack of reads covering an area, or a repeating sequence that cannot be resolved. Identical sequences that occur multiple times are called repeats, and assembly is sensitive to repeats. If the repeat is larger than the reads, it may be impossible to map uniquely to the assembly. The reason for this is that it aligns equally well to many locations within the repeat (Treangen and Salzberg, 2012). This situation can be handled by specialized reads called mate pairs3. This is a fragment of known length where both ends are sequenced, but not the part in between (Fullwood et al., 2009; Geng et al., 2012). The unknown, middle portion of mate pairs may be from a few hundred to several thousand base pairs long. If a repeat is paired with another sequence that maps uniquely, the repeat will also map uniquely because its location relative to the other sequence is known (Fullwood et al., 2009). The result of the assembly is a set of contigs which needs to be combined to arrive at the full sequence. The process of joining the contigs is aided by mate pairs. The collection of several joined, ordered and oriented contigs is called a scaffold (M. Baker, 2012), and most assemblers output a set of scaffolds as the final result (He et al., 2013). With large amounts of short reads, assembly will be both computationally intensive and hard on memory, so there is a need for specialized data structures and algorithms, e.g. graph structures along with path-finding algorithms.

3Many technologies for preparation of mate-pair information exist, with names such as mate pairs, fragment ends, paired ends or jumping libraries (Simpson and Pop, 2015). In the following, we will refer to mate pairs regardless of the technology used to obtain the reads. 16 CHAPTER 2. DATA

Graph-based methods for assembly builds a graph model from the reads (see Appendix A.2).

2.1.3 Mapping, Resequencing and Variant Detection

In many cases, the main focus is not on assembling a new sequence or entire genome, but rather on finding variations in a population or a species, or on sequencing more individuals to gain more robust data. In this case there may already be a previously assembled sequence available, so that the newly sequenced reads can be mapped to this existing structure.

Definition 2.2. A reference sequence is a genetic sequence meant to represent a larger group, e.g. a population or a species.

Mapping to a reference structure is less computer intensive and requires less memory than de novo sequence assembly. The mapping of reads to a reference structure is called resequencing. When studying multiple individuals, we may be interested in the variation among sequences. Resequencing is the standard method for detecting variation (H. Li, Ruan, and Durbin, 2008; H. Li and Durbin, 2009; Lunter and Goodson, 2011). For both mapping to a reference (resequencing) and assembly to be done with adequate accuracy, it needs to have sufficiently high coverage.

Definition 2.3. The number of reads that maps to a given location in the sequence is called the coverage at this position. The coverage at location l in Figure 2.3 is 9.

Coverage can be considered as a rough measure of certainty of the base at a given position. Now consider a DNA sequence contained in some imaginary experimental data, but not in the reference (Y. Liu et al., 2014). Given a set of sequence data with high enough coverage, we expect to be able to reconstruct the underlying sequence by finding overlapping regions in the reads. If a read does not align adequately to any region, it is discarded when mapping to a linear reference sequence. This means that the assembly will miss the sequence since there is nowhere to map it to. Conversely, a graph-based approach can simply incorporate it into the reference graph if it is not already there (Limasset and Peterlongo, 2015). For more on assembly graphs, see Section A.2. A variant arises if, while performing resequencing, reads map to a certain region despite some differences between the reads and the reference. If so, there are several possible subsequences at this variant, one represented by the reference and one or more represented by the reads. This is what we call a variant. In a graph representation, it will be modeled as a bubble (see Definition 1.3). The mapping (and thereby the variant), may also be given a quality score, i.e. a measure of certainty that the read actually belongs 2.2. DESCRIPTION OF THE DATA 17 to where it is aligned (H. Li, Ruan, and Durbin, 2008). A quality score or filter may e.g. be based on coverage data (Iqbal, Turner, and McVean, 2013, supplementary material). The graph type used in this thesis is essentially just a collection of variants. The variants can be easily presented in a table. The next section describes the appearance and content of such a table.

2.2 Description of the Data

The data used in this thesis is a list of variants from humans. Variant data are preprocessed genetic data, and may be obtained by the methods of Sub- section 2.1.3. The data are given as a variant call format (VCF) file, available from the website of the 1000 genomes project4. The aim of the 1000 genomes project was to characterize the genetic variation in the human genome (1000 Genomes Project Consortium and others, 2010; 1000 Genomes Project Con- sortium and others, 2015). Phase 1 consists of 1024 individuals, and phase 3 consists of 2504 individuals (1000 Genomes Project Consortium and others, 2012; Sudmant et al., 2015). The variants chosen for this thesis are located in chromosome 22 of a few populations from phase 1. Chromosome 22 was chosen due to its small size relative to the other chromosomes. In order to reduce computation time, only the 5 000 first lines of the VCF file are used, i.e. we are looking at the 5 000 first possible variants. The complete data set on chromosome 22 is almost 500 000 lines long. This means that we are only taking into account approximately 4% of chromosome 22, which, in the grand scheme of things, translates to a minuscule part of the whole genome. Rather drastic reduction of the data has been critical for analyzing it, due to heavy computations (see Section 3.6). The VCF file contains descriptions of the variants in the population. The descriptions directly specifies the bubbles in the graph by listing the sequences of the variants, in addition to their location along a reference sequence. Large, structural variations is harder to model in the VCF format, but may be done using the techniques in Danecek et al. (2011, section 5.4). In the data used in this thesis, any large structural variants are excluded from the original VCF file. Thus, the large structural variants are not part of the analysed data. Six populations of 20 individuals each are analyzed: British in England and Scotland (GBR), Finnish in Finland (FIN), Luhya in Webuye, Kenya (LWK), Yoruba in Ibadan, Nigeria (YRI), Han Chinese in Beijing, China (CHB) and Han Chinese South, China (CHS). The variants are used regardless of whether they exist in the populations in question or not. This means that the existence of a variant in the file is not a guarantee that it is present in any particular population, as every sequenced individual was originally contained in the file. So the fact that a variant is listed, only means that it occurs in at least one of the individuals initially included. The number of individuals

4http://www.internationalgenome.org/data 18 CHAPTER 2. DATA is reduced from several hundred to 120 (20 in each of six populations). This probably means that some of the lines describe variants that are not seen in the sample we have worked with, because such a small portion of the total number of individuals are left after preprocessing the data. The data used is easily available and can be downloaded by anyone. How- ever, there are some complicating factors in using them: • The amount of data is colossal. Chromosome 22 is the second smallest chromosome in the human cell, making up 1.6-1.8% of the genome (Mor- ton, 1991). Still, it spans about 51 million base pairs5. The VCF file downloaded from the website of the 1000 genomes projects is a text file of 494 358 lines and takes up approximately 15GB of space. This makes memory handling and computing power an issue. Even after substantial reduction of the original data, rather heavy computation is required to find the distances between two populations. • The genomes of two humans are very alike. The number of variations is tiny compared to the total length of the sequences. The set of variants is the only thing accounted for by the VCF file, and only by incorporating the reference sequence can we obtain the entire genomes. The collection of all variants in all individuals combined with the reference will display the full sequences of all individuals included. A graph built exclusively from the VCF file can be seen in Figure 3.7, and will be more closely examined in the next chapter.

2.2.1 The Variant Call Format The variant call format was developed for the 1000 genomes project, mainly to represent genomic variation (Danecek et al., 2011). The file format always keeps a variable number of lines containing metadata (starting with “##”), followed by a single header line (starting with “#”), containing column names. The column names always include: • CHROM — Specifies which chromosome the variant occurs at. • POS — The position along the reference sequence where the variant occurs. • ID — A unique id. • REF — The base(s) contained in the reference sequence at this position. • ALT — The alternative base(s). This is a list of possible alternatives: At a particular location, a population could theoretically show as many variations as there are individuals. 5http://www.ensembl.org/Homo_sapiens/Location/Chromosome?chr=22;r=22: 1-50818468 2.2. DESCRIPTION OF THE DATA 19

• QUAL — A quality score for the alternative.

• FILTER — If the variant has passed all filters, this field will read “PASS”. Otherwise, it will show the name(s) of any failed filters. The filters are specified in the metainformation lines and may be e.g. a quality threshold. All entries in the data file used in this thesis has passed all filters.

• INFO — Additional information.

If there are more than two alternatives at a position, the various alter- natives are given in separate lines6 that has the same reference type as the other variants, but another alternative type. An example of this is given in Table 2.1 (this is a stripped-down VCF-file where everything not needed for this example is removed), and a graph representing this region is shown in Figure 2.4.

Table 2.1: A stripped-down VCF-file where two bubbles are in the same location. Reference types are colored blue.

Variant no. POS ID REF ALT 1 16423001 rs183782265GA 2 16423023 rs188319963GA 3 16423023 rs71254085G GTA 4 16424559 rs193050994GA

In addition to the eight mandatory columns, all individuals are given a separate column containing the genotype of the individual, among other things. The ploidy is the number of copies of the complete genome in each cell. Humans are what we call diploid organisms, i.e. we have two copies. Thus, a bubble has four possible configurations in an individual because each copy has two possible paths through the bubble. We will call the set of pos- sible paths through the bubbles the genotypes. The genotype of each copy is separated by a vertical line in the VCF-file. The ploidy is important for the interpretation of the data since it means that one single individual can have either two copies of the reference type, one of each, or two copies of the alternative type. In the following, this will be referred to as homozygotic for the reference, heterozygotic and homozygotic for the alternative, respectively. Genotype information about the individuals is given as 0 for the reference type and 1 for alternative type7. There are more values in this field, e.g. coverage and a quality score. In this thesis I will focus on the genotype information. An

6This is in contrast to the general file specification (Danecek et al., 2011), which says that if there are multiple alternatives at a position, they are given as a comma-separated list in the “ALT”-field. 7For more on this visit documentation at https://github.com/samtools/hts-specs. 20 CHAPTER 2. DATA

16423001 16423023 16424559

G G G

I GTA I

A A A

Figure 2.4: The graph-representation of Table 2.1. The nodes marked with an “I” are intermediary nodes of known length (the difference in positions between the previous and the current node). The sequence in this node is not known by the graph, but we may look it up in the reference sequence. The nodes containing reference types are colored blue.

example of the file format can be found in Table 2.3, where the column named “INFO” has been stripped to dots, because each cell contained such a long string that it did not fit inside the page. The columns to the right of “FOR- MAT” are individuals. Each individual is given a separate column, named with the original identifier from the 1000 genomes project, e.g. “HG00096”. Consider the first line of Table 2.3. This is a simple substitution, where the base in the reference sequence is a T, whereas the alternative is a C. Now inspect the columns named “HG00096” and “HG00097”, which are both individuals from an English population (named GBR in the 1000 genomes project). The genotype information is 0|0 and 0|1, respectively. This means that both copies of the genome of individual HG00096 has the reference type (T) at this position. For individual HG00097, we see that one copy has the reference type, but the other has the alternative (C). So the bubble specified in line 1 is clearly contained in the population GBR, although not by all in- dividuals. In this preliminary examination, we have discovered one individual which is homozygotic for the reference type, and one which is heterozygotic.

2.2.2 Reference Sequence

The VCF-file displays only the variants in a particular area, sorted by posi- tions. Furthermore, we can find the number of base pairs that separates two variants by looking at the difference in position between them. The positions are indexes in the reference sequence (see Definition 2.2). This is the same sequence that is referred to by the column “REF” in the VCF file. While we do not need this reference sequence to make the graphs used in this thesis, we 2.3. VARIANT GRAPHS 21 do need it to be able to reconstruct the full sequences. The reason for this is that the VCF file contains no information about what happens in between each variant. The reference sequence is represented in the traditional, linear way, i.e. as a string consisting of the letters A, C, G and T. The data used in this thesis is from phase 1 of the 1000 genomes project (1000 Genomes Project Consortium and others, 2012), which mapped reads against the reference genome GRCh37 (Church, Schneider, Graves, et al., 2011), released by the Genome Reference Consortium (GRC)8 in February 2009.9 After this, an updated reference genome for humans (GRCh38) have been made (Schneider et al., 2017).

2.3 Variant Graphs

It is established that the list of variants provided by the VCF file serves as a template for creating a graph. Intuitively, we may refer to such a structure as a variant graph.

Definition 2.4. A variant graph is a directed, acyclic graph, where all nodes are labeled with a position, in addition to node type. The node type may be “ref”, “alt” or “inter”. The intermediary nodes are of known size, but contains an uknown sequence. The “ref” nodes contain the reference sequence at the position, while the “alt” nodes contain the alternative sequence at the position. The variant graph is built from node pairs placed in a sequence.

Variant graphs are simple in structure, containing mostly SNPs and small indels. So far, two graph figures has been presented, Figure 1.3 and 2.4. Both may be considered variant graphs. They are directed and acyclic, and can be built from a collection of variants, such as demonstrated in Table 2.1 and Figure 2.4. Generally, variant graphs have less variation per distance in the reference sequence than what is illustrated by the previous examples. A typical variant graph is depicted in Figure 2.5. Table 2.2 shows variant number 246- 251 in the VCF file. The table directly specifies the graph in Figure 2.5. As previously noted, the nodes marked with an “I”, are in the following referred to as “inter”. The internodes contain an unknown nucleotide sequence. In order to find the length of the internode separating variant 246 and 247, we may simply subtract the indexes in the reference genome, which is given as the position of the variants in Table 2.2. By looking up the positions spanned by the internodes, we can construct a path through the graph that corresponds to a DNA sequence.

8The GRC consists of The Genome Institute at Washington University, The Wellcome Trust Sanger Institute, The European Bioinformatics Institute and The National Center for Biotechnology Information. 9This information was collected from www.ncbi.nlm.nih.gov/grc/human. 22 CHAPTER 2. DATA

Table 2.2: Variant number 246-251 from the master VCF file used in the data analysis.

Variant no. CHROM POS ID REF ALT 246 22 16235725 rs139546946 G A 247 22 16235732 rs144666531 G T 248 22 16236068 rs200925246 GAAT G 249 22 16237036 rs180695922 A G 250 22 16237058 rs186596399 T C 251 22 16239976 rs141300208 G A

G T G G C A

S I I I I I

A G GAAT A T G

Figure 2.5: A typical variant graph, consisting of SNPs and one small indel.

Example 2.4. We wish to find the length and sequence in the first internode, i.e. the one separating variant number 246 and 247. The length is 16235732 - 16235725 - 1 = 6, where we have subtracted 1 since none of the variants should be included in the internode, only the symbols in between the variants. A look-up in the reference genome10 reveals the sequence CCCTGT at position 16235731 through 16235726. The possible sequences through the first two variants of Figure 2.5 are thus GCCCTGTT, ACCCTGTT, GCCCTGTTG and ACCCTGTG. ♣

The graph from Figure 2.5 can be further explored to find a number of possible paths that represents equally many individual sequences, just like in Example 1.1. Some potential complications encountered in the software are discussed in Section 3.3, while the more theoretical shortcomings of variant graphs are treated in Section 5.3.

10The GRCh37 reference genomes for humans can be explored at http://grch37.ensembl. org/index.html. The same website also offers other generations of the human reference genome, in addition to other species. 2.3. VARIANT GRAPHS 23 1:0.900:-0.71,-0.09,-5.00 1:1.000:-0.86,-0.06,-5.00 1:1.000:-3.80,-0.00,-3.27 0:0.000:-0.05,-0.95,-5.00 1:1.000:-4.22,-0.00,-2.28 0:1.000:-5.00,0.00,-5.00 0:0.950:-0.42,-0.21,-5.00 0:1.000:-3.52,-0.00,-5.00 0:0.000:-0.00,-2.40,-5.00 1:1.000:-5.00,0.00,-5.00 0:0.650:-0.01,-1.52,-5.00 0:0.000:0.00,-0.60,-9.60 0:0.000:-0.05,-0.97,-5.00 1:1.000:-1.59,-0.01,-5.00 0:0.000:-0.09,-0.74,-5.00 0:0.000:-0.01,-1.53,-5.00 0:0.000:-0.07,-0.83,-5.00 1:1.000:-1.10,-0.04,-5.00 0:0.000:-0.03,-1.24,-5.00 1:1.000:-1.17,-0.03,-5.00 | | | | | | | | | | | | | | | | | | | | 0:0.050:-0.03,-1.17,-5.001:1.000:-2.05,-0.01,-1.71 0 0:0.000:-0.02,-1.37,-5.00 0 0:0.000:-0.01,-1.84,-5.00 0 0:0.000:-0.00,-3.13,-5.00 0 0:0.000:-0.01,-1.74,-5.00 0 0:0.000:-0.02,-1.39,-5.00 1 0:0.000:-0.00,-2.30,-5.00 1 0:0.000:-0.00,-2.97,-5.00 1 0:0.000:-0.00,-3.01,-5.00 0 0:0.000:-0.00,-2.61,-5.00 0 0:0.050:0,0,0 1 0:0.000:-0.02,-1.36,-5.000:0.000:-0.02,-1.36,-5.00 0 1:1.000:-0.69,-0.10,-5.00 0 0:0.000:-0.01,-1.87,-5.00 0 0:0.000:-0.00,-2.03,-5.00 0 1:1.000:-5.00,0.00,-5.00 0 0 0:0.000:-0.00,-2.25,-5.00 0 1:1.000:-5.00,0.00,-5.00 0 0 | | | | | | | | | | | | | | | | | | | | A variant call format file example. Table 2.3: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 23 224 225 226 227 228 16050612 229 16050678 rs146752890 2210 16050984 rs139377059 2211 16051107 C 22 rs18894575912 16051249 C 22 rs651835713 16051347 C 22 rs6222460914 G 16051453 22 rs6222461015 T C 16051477 22 T rs143503259 1605148016 G 22 G rs192339082 16051497 10017 A rs79725552 22 16051722 10018 A C rs141578542 22 C 16051867 10019 T rs201906224 22 C PASS 16051882 A20 C rs184287184 22 PASS 16052080 TA 100 A rs2843213 100 22 PASS 16052112 T C rs4965031 100 . G 16052159 100 T rs187181153 . PASS C 16052239 100 PASS rs191584855 . C G 16052240 C 100 PASS rs6518413 100 57 PASS 16052250 T GT:DS:GL rs184458566 . T PASS . GT:DS:GL rs4965019 100 0 A G PASS A C . GT:DS:GL PASS . 0 C PASS . 100 0 A GT:DS:GL PASS 100 100 . G GT:DS:GL G . 100 GT:DS:GL 0 . GT:DS:GL 0 PASS G . 0 GT:DS:GL PASS PASS 100 100 0 GT:DS:GL PASS GT:DS:GL 0 . 100 GT:DS:GL 0 . . 0 PASS PASS GT:DS:GL 0 . 0 PASS GT:DS:GL . . GT:DS:GL GT:DS:GL 0 GT:DS:GL . 0 0 0 GT:DS:GL GT:DS:GL 0 0 GT:DS:GL 0 1 22 16050408 rs149201999 T C 100 PASS . GT:DS:GL 0

Chapter 3

Methods

If the DNA sequences of a group of individuals are represented as a graph as explained in Section 1.3, a tempting idea is to develop tools to measure the distance between two graphs. We can then apply these measures on e.g. pop- ulation graphs or species graphs to find relative distances among the groups. There are considerable amounts of research on the use of simple, linear se- quences for this problem (see e.g. Patwardhan, Ray, and Roy, 2014; Patward- han, Ray, and Roy, 2014). A specific, homologous1 region of interest is usually compared across the groups. It may be that a graph-based approach to this problem will provide more accurate measures of the distance between groups than the sequence-based approach. If this is the case, we can use it to aid biologists in search of phylogenetic relationships. We will consider the pairwise distance between two variant graphs G1 and G2. G1 is made using several individuals from population 1, and G2 us- ing several individuals from population 2. Recall from Section 2.1 that HTS methods produce large amounts of short sequences. Without loss of informa- tion, the graphs can be built from sets of these, where each set corresponds to a single individual. However, this would be extremely heavy on computa- tion and memory, due to the vast amounts of reads. Instead, the graphs in this thesis are represented as a reference sequence along with a list of bubbles (given as a VCF file, see Chapter 2). This list gives a possible alternative base or sequence in each bubble and specifies the genotype for each individual. See Section 2.2 for details on the data. Three distance metrics to calculate differences between population graphs are discussed over the next two sections. The main idea behind the develop- ment of the distance metrics is that they shall report small distances if the populations are similar and large distances if they are different. We may employ two rather separate ideas in the search for a suitable way to mea- 1Homologous regions or structures have a shared ancestry. For example, the forelimbs of horses and the wings of birds are homologous. A genomic sequence is homologous to a corresponding sequence in another population or species if they originate from the same sequence in a common ancestor.

25 26 CHAPTER 3. METHODS sure distance. The two first are described as distance-based and the last as model-based.

3.1 Distance-Based Approaches

In this section, the measures are based on distance. As we have no reference for what we are measuring, it is hard to make conclusions about the results. To obtain such a reference, a nonparametric resampling method called per- mutation testing is used (see Section 3.5 for details on the permutation test).

3.1.1 Bubble Counter

An easy (and naïve) way to estimate a distance is to detect the presence of variants that occurs in one population, but not in the other. If we are given a reference sequence and a set of bubbles, we can go through the individuals of the population and check what genotypes they contain for each bubble. If both alternatives in the variant is observed in at least one individual, we may say that the bubble is contained in the population graph (as explained in Subsection 2.2.1). This way we can detect the presence or absence of all bubbles. Consider two population graphs G1 and G2. We can now count the number of bubbles found in G1 but not in G2, i.e. the bubbles that are unique to G1. We will call this number b1. Similarly, we count the bubbles that are unique to G2 and denote this number by b2. Then we count the number of bubbles that are found both in G1 and in G2. This number will be called m. The bubble counter distance is then defined as the number of bubbles b contained in only one population divided by the sum of the total number of bubbles present in either one or both populations.

Definition 3.1. Denote the number of bubbles uniquely present in either only population 1 or only population 2 by b1 and b2, respectively. Let b = b1 + b2. The bubble counter metric is then b dBC (G1,G2) = , (3.1) b + m where m is the number of bubbles present in both populations.

Figure 3.8 displays a portion of two variant graphs where the upper has a bubble which is not found in the lower graph. This bubble is unique to the upper graph. The probability of a bubble being present in a population will increase as we add more individuals to each population. The result of this is that the term m in the above expression will grow. The number of unique bubbles, b1 and b2, may either increase or decrease depending on the genotypes of the new individuals. 3.1. DISTANCE-BASED APPROACHES 27

G G T

A TA CA CA

A A C

(a) This graph has three bubbles and the middle one is unique, since it is not contained in the graph in (b).

G T

A TAGCA CA

A C

(b) This graph has two bubbles. Note that the middle bubble from (a) is collapsed. This happens when all individuals in the population has a G at this position, i.e. when there is no variation.

Figure 3.1: Two population graphs. The one in (a) has three bubbles, and one unique bubble. The one in (b) has two bub- bles, but both are also contained in (a), so it has no unique bubbles.

Consider the situation where a new individual called A is added to the analysis. Suppose the individual is assigned to population 1. First, remember that humans are diploid organisms, i.e. we have two copies of our genetic ma- terial. Now assume that individual A has a variant that is new to population 1. This means that there is a line in the VCF file where all individuals from population 1 except A has genotype 0|0. Individual A may have either 0|1, 1|0 or 1|1. In other words, it may be either heterozygotic or homozygotic for the alternative type. What happens to the numbers b1, b2 and m now depends on the characteristics of population 2.

• If population 2 does not contain the variant, then b1 will increase by 1, since G1 now has a new unique bubble. m is then held constant, because there are no new bubbles contained in both population 1 and population 2. The effect on the final measure is thus positive. We have b+1 b dBC (G1,G2) = b+1+m ≥ b+m , which is true for positive numbers b and m. 28 CHAPTER 3. METHODS

• If B population 2 contains the variant, the bubble will not be unique in G1, so b1 remains unchanged. Moreover, it will cease to be unique in G2, hence b2 is decreased by 1. The number of non-unique bubbles m is increased by 1. The final measure will decrease, because dBC (G1,G2) = b−1 b b+m+1 ≤ b+m .

The bubble counter metric can be considered a local metric, as it uses single bubbles to compute the distance. It never explores large structures of the graph.

3.1.2 Graph Edit Distance

The task of graph matching concerns measuring the similarity (or lack thereof) between graphs. Graph edit distance is a measure for the graph matching problem, and can be considered as an extension of the string edit distance used in alignment. In fact, we may use graph edit distance to align graphs, opening up for a potential extension of sequence alignment to graph alignment for use in assembly. The following definition is slightly modified from Klau (2009).

Definition 3.2. A graph alignment A(G1,G2) of two graphs G1 = (V1,E1) and G2 = (V2,E2) is a mapping g : V1 → V2 such that for a node u1,i ∈ V1 and u2,j ∈ V2,

( u2,j if u2,j ∈ V2 g(u1,i) = _ if u2,j ∈/ V2 where the symbol “_” is a gap, just like in sequence alignment.

The graphs used in this thesis are what we may call attributed graphs, which means that the edges and nodes have attributes.

Definition 3.3. An attributed graph is denoted by GA = (VA,EA, α, β) where α : VA → LN is a node labelling function and β : EA → LE is an edge labeling function.

LV and LE are node and edge labeling sets, respectively, and can consist of more or less any kinds of symbols, e.g a set containing name strings or a subset of N. In the case of DNA graphs, we will have LV = Σ and LE = Σ×Σ, so all node labels will be a symbol from Σ and all edge labels will be a pair of symbols, both from Σ. 3.1. DISTANCE-BASED APPROACHES 29

Example 3.1. Consider the attributed graph in Figure 3.2. The node labeling function α is then

α(v1) = A

α(v2) = G

α(v3) = A

α(v4) = T.

The edge labeling function β is

β(e1) = AG

β(e2) = AA

β(e3) = GT

β(e4) = AT.

v2 G e2 e4 AG GT

v1 v4 A T

e1 e3 AA AT v3 A

(a) (b)

Figure 3.2: The node labeling function α and the edge label- ing function β assigns nucleotide types to nodes and letters in a DNA sequence graph.

The graph edit distance between two attributed graphs G1 = (V1,E1, α1, β1) and G2 = (V2,E2, α2, β2) is based on finding the optimal way of transforming G1 into G2 using a set of possible edit operations. Like in the problem of string edit distance, the edit operations of graph edit distance are substitutions, in- sertions and deletions, but of nodes and edges, not symbols.

Definition 3.4. A set of edit operations λ = {λ1, λ2, . . . , λk} is called an edit path. If the execution of this sequence of operations transforms G1 into G2, we say that λ is a complete edit path, and we denote the set of all complete edit paths between G1 and G2 by Λ(G1,G2).

We may assign a cost to each edit operation and sum the costs of all the individual operations needed to transform G1 into G2. 30 CHAPTER 3. METHODS

Definition 3.5. The graph edit distance is defined as the minimal sum of all individual edit operations in a complete edit path. Let c(λi) be the cost of the edit operation λi, and let λ be a complete edit path. Then the graph edit distance is X dGED = min c(λi). (3.2) λ∈Λ(G1,G2) λi∈λ For non-attributed graphs, the definition is almost the same, but we then only need isomorphism, rather than equality between G1 and G2. In general, two mathematical structures are said to be isomorphic if they are the same up to a relabeling of the elements they contain (Schumacher, 2001). So for an unlabeled graph, we have that G1 ' G2 ⇔ G1 = G2, where the relation ' denotes . The score of each edit path depends on what penalty we assign to each of the edit operations. Just like in the alignment problem in Subsection 2.1.1, the penalties should to the best of our knowledge represent the evolutionary events happening in the real world, e.g. insertions, deletions and inversions. In this thesis we will use very general scores. Both edge insertion and edge deletion are scored +1, while edge substitutions are scored 0. All node operations are scored 0. Node operations are scored 0 since they will cause an edge operation anyway. Also note that a node substitution will cause substitution of all edges affiliated with the node. This is because the edges are essentially just an ordered pair of nodes, one start node and one end node. If one of the nodes in the pair is substituted, then the edge must be substituted. Formally, let the respective cost of inserting, deleting and substituting a node v be cni(v), cnd(v) and cns(v). The cost of the corresponding edge operations are cei(e), ced(e) and ces(e), respectively. Then the scoring scheme will, for all v ∈ V1 and all e ∈ E1, be

cni(v) = 0 (3.3)

cnd(v) = 0 (3.4) 0 cns(v, v ) = 0 (3.5) cei(e) = 1 (3.6)

ced(e) = 1 (3.7) 0 ces(e, e ) = 0. (3.8)

Example 3.2. A complete edit path λ(G1,G2) between two graphs is indi- cated in Figure 3.3. Using the scoring scheme described above, we get

cost(λ(G1,G2)) = 6 + 2 + 0 = 8, because the edit path includes six edge deletions and two edge insertions. All these operations have score 1. In addition, two nodes were removed (with score 0). ♣ 3.2. A MODEL-BASED APPROACH 31

A C A A C A

S I −→ S I

G G G G G G

(a) The source graph G1. (b) Delete six edges. A A A C A S I S I −→

G G G G G

(c) Add the two edges that en- (d) Delete two unused nodes ter the internode. to arrive at the target graph G2.

Figure 3.3: A complete edit path between the source graph G1 and the target graph G2.

The main challenge of graph edit distance is the algorithm complexity, which is exponential in the number of nodes (Bunke and Riesen, 2009; Riesen and Bunke, 2009). We will elaborate further on this while discussing the implementation in Section 3.3.

3.2 A Model-Based Approach

The measure developed in this section is model-based. It relies on assumptions made about the distribution of the data. As will be discussed in Chapter 5, the assumptions pose a weakness since they may not be correct. The following method is based on the assumption that each bubble is independent and that the genotype frequencies are distributed according to the multinomial distribution. The model can be used to find a test statistic, which may be interpreted directly to determine if the two populations are different or not.

3.2.1 The Bayes Factor

The Bayes factor offers a Bayesian approach to hypothesis testing. The idea was introduced in Jeffreys (1935). The Bayes factor evaluates the support in the data for one model relative to the support for the other (Kass and Raftery, 1995). In this thesis, the two models regard the distribution of the genotypes in a sample. Specifically, the null model states that the genotypes at each bubble are identical in the two populations. The alternative model states the opposite, i.e. that they are not identical. Figure 3.4 shows a graph and Table 3.1 shows a set of possible genotype frequencies for the variants in a graph. 32 CHAPTER 3. METHODS

B1 B2 B3

G GAG T

A TA AA

A A C

Figure 3.4: A graph with three bubbles, B1, B2 and B3. The genotype proportions are given in the Table 3.1 below.

Table 3.1: Genotype proportions in the graph in Figure 3.4.

Position Genotype proportions UUUVVV B1 0.3 0.1 0.6 B2 0.6 0.2 0.2 B3 0.1 0.8 0.1

To avoid notational confusion in the following, we will here refer to the genotypes as UU, UV , VU and VV . As an example, consider Figure 3.4. The possible sequences in B1 are U = G and V = A. So the genotype UU means that the diploid individual has a G in both of its copies of the genome. Similarly, UV means that one copy has G and one has A, while VV means that both of the copies has an A at this position. Continuing this way, we can use Table 3.1 to find the most commonly used path through the graph in Figure 3.4. Note that there is no difference in having outcome U in one copy and outcome V in the other or vice versa, so the actual number of genotypes is only three (UU, UV and VV ). If we count all the genotypes at a particular position in the sample, we can get an estimate of the genotype proportions in the population. Let n1 and n2 be the number of individuals in population 1 and 2, respectively. Also, let y1,UU = y1,1 be the number of individuals with genotype 1 (UU) in population 1, y1,UV = y1,2 the number with genotype 2 (UV ) in population 1, etc. The number of individuals with a particular genotype can then be considered realizations of a multinomially distributed random variable. The counts from population 1 will have parameter p1 = (p1,1, p1,2, p1,3) and the counts from population 2 will have parameter p2 = (p2,1, p2,2, p2,3), so

y1 ∼ Multinomial(p1, n1)

y2 ∼ Multinomial(p2, n2), 3.2. A MODEL-BASED APPROACH 33 meaning that

3 n1! Y y f(y |p , n ) = p 1,i 1 1 1 Q3 1,i (3.9) i=1 y1,i! i=1 3 n2! Y y f(y |p , n ) = p 2,i . 2 2 2 Q3 2,i (3.10) i=1 y2,i! i=1

We now have two populations modeled by two multinomial distributions with a separate set of parameters p1 and p2. We can then formulate a hypothesis test. Consider the null hypothesis that the populations are equal, so the pro- portions of each genotype are identical at each position. The alternative hy- pothesis is that the proportions are not identical. Formally, the null hypothesis is

H0 : p1 = p2 = p0 = (p1, p2, p3), while the alternative hypothesis is

H1 : p1 = (p1,1, p1,2, p1,3) 6= (p2,1, p2,2, p2,3) = p2.

The idea for the Bayes factor metric originated from failed attempts at employing Pearson’s chi square test on the data. Pearson’s chi square test is tempting, since it is a simple hypothesis test for the setup presented here. We wish to test if the genotype proportions are the same in each bubble. In our case, the test observator for Pearson’s chi square test is

2 3 2 X X (yi,j − yˆj) V = , (3.11) yˆ i=1 j=1 j

y1,j +y2,j where yˆj =y ¯·j = 2 is the expected number of individuals with genotype j. Then V is asymptotically χ2-distributed with 6 − 1 = 5 degrees of freedom. The problem with this approach is that many of the bubbles have observed counts equal to zero for one or two of the three possible genotypes, in both populations. This means that in many cases, we will have y¯·j = 0, so (3.11) will not be defined. A simple solution to the issue described above is to remove all locations that gives y¯·j = 0, but such an approach is not satisfying either. The locations are known variants where there may be variations, and the fact that it is not represented in the data could merely be due to chance. In using the Bayes factor as test observator we avoid the problem of the annoying zeroes altogether. For the Bayes factor approach, let M0 be the model with parameters subject to H0, and M1 the model with parameters subject to H1. This means 34 CHAPTER 3. METHODS

that model M0 has parameter θ0 = p0 and M1 has parameters θ1 = p = (p1, p2). We wish to compare the data that has arisen from the two models. The famous theorem of Thomas Bayes says that P (M )P (y|M ) P (M |y) = s s , s P (y) where P (Ms), and P (Ms|y), s = 0, 1 are the so called prior and posterior distributions, respectively, of model Ms (Gelman et al., 2014, page 6-7). The Bayes factor is the ratio of the posterior odds and the prior odds of the two models given the data y. The posterior odds is P (M |y) P (M )P (y|M ) 1 = 1 1 P (M0|y) P (M0)P (y|M0) and the Bayes factor is then

P (y|M1) B10 = P (y|M0) R P (y, θ|M1) dθ = θ∈Ω1 R P (y, θ|M ) dθ θ∈Ω0 0 R P (θ|M1)P (y|θ,M1) dθ = θ∈Ω1 R P (θ|M )P (y|θ,M ) dθ θ∈Ω0 0 0 R P (p)P (y|p) dp = p∈Ω1 , R P (p )P (y|p ) dp p0∈Ω0 0 0 0 because we find the marginal probability of y given the set of parameters 3 6 by integrating over the parameter spaces Ω0 = [0, 1] and Ω1 = [0, 1] . In addition, let y = (y1, y2), and remember that y1 and y2 are assumed to be independent. This leads us to R P (p1, p2)P (y1, y2|(p1, p2)) dp1dp2 B = (p1,p2)∈Ω1 10 R P (p )P (y , y |p ) dp p0∈Ω0 0 1 2 0 0 R (p ,p )∈Ω P (p1)P (p2)P (y1|p1)P (y2|p2) dp1dp2 = 1 2 1 , (3.12) R P (p )P (y |p )P (y |p ) dp p0∈Ω0 0 1 0 2 0 0 where we have assumed independent priors for p1 and p1 under H0. In order to calculate this expression, we need to determine a prior distribution. The conjugate prior distribution for multinomially distributed data y with param- eters n and p = (p1, p2, p3), is Dirichlet(α), where α = (α1, α2, α3). This is shown in Appendix B.1. In the data analysis, the hyperparameter is set to α = (1, 1, 1), to avoid the introduction of excess prior information. Indeed, it can be shown that the Dirichlet(1, 1,..., 1) with k parameters is the uniform k−1 distribution on the simplex in R . Using the Dirichlet prior, and the fact 3.2. A MODEL-BASED APPROACH 35 that the y’s are multinomially distributed as stated in (3.9) and (3.10), we show in Appendix B.2 that the Bayes factor is

Γ(P3 α ) Γ(P3 y + y + α ) B = i=1 i · i=1 1,i 2,i i 10 Q3 Q3 i=1 Γ(αi) i=1 Γ(y1,i + y2,i + αi) Q3 Γ(y + α )Γ(y + α ) · i=1 1,i i 2,i i . P3 P3 (3.13) Γ( i=1 y1,i + αi)Γ( i=1 y2,i + αi)

This ratio can be considered as a test statistic for the hypotheses stated in the beginning of the subsection. In our case, this is a test for the equality of the parameters in a multinomial distribution at one specific bubble in the graph. However, what we really want to do, is to develop a measure that can find the distance between G1 and G2 along either a region, or the entire genome. This measure should include all bubbles. For simplicity, consider all bubbles to be independent of each other. If so, a test observator for the difference of the two graphs becomes

M Y B˜10 = B10,j, (3.14) j=1 where M is the total number of bubbles found in either G1, G2 or both.

Definition 3.6. Under the assumption that the bubbles are independent, the Bayes factor metric is defined as the logarithm of the product of the Bayes factors for each bubble

M ˜ X dBF = log(B10) = log(B10,j). (3.15) j=1

The Bayes factor measure compares bubble pairs, one from population 1 and one from population 2. Due to the organization of the data file, this will cause a (tolerable) error in the special case where there are multiple variants at the same position. Consider two variants with the same position, one with reference type “C” and alternative type “G”, and the next with reference type “C” and alternative type “T”. Because they have the same position, the variants must share the same reference sequence. The graph at this position will look like the graph in Figure 3.5. The way the Bayes factor measure is developed in this thesis is problem- atic as it treats the two variants separately, even though they are clearly part of the same bubble. This means that there are actually six possible geno- types, UU, UV , VV , UW , WW and VW . Due to the fact that the position is identical in both bubbles, the number of individuals with the reference type must also be identical. Let the position of the two variants be l. Also, let the 36 CHAPTER 3. METHODS

G

I C I

T

Figure 3.5: A graph portraying two variants at the same position. The sequence in the reference genome at this location is “C”. Two variants suggests one alternative path each.

alternative model for variant 1 be M1,1 and the alternative model for variant 2 be M1,2. The two Bayes factors are then

P (y|M1,2) B10,p,1 = P (y|M0) P (y|M1,2) B10,p,2 = , P (y|M0) where the null model is identical for the two Bayes factors. So the null model at l is in fact used two times. Ideally, the desired model should bring both vari- ants into the same Bayes factor, thus providing one Bayes factor per variant position, not per variant. The issue described above has potential for being severe, especially if many variants occur at the same position. In the data used in this thesis, the situation is extremely rare. Two variants at one position occurs in six out of 5000 lines in total. More than two variants at one position never occurs. Consequently, the effect of this drawback is only marginal, and have been ignored. The interpretation of the Bayes factor is how much support there is for the alternative hypothesis relative to how much support there is for the null hypothesis (Kass and Raftery, 1995; Masson, 2011). B˜10 > 1 means that the likelihood of the data under the alternative hypothesis is larger than under the null hypothesis. This indicates that the populations are different. Similarly, B˜10 ≤ 1 indicates that they are identical. As an example, assume that we have B10 = 4 for some test. Then the interpretation will be something like the statement “the data is four times more likely under the alternative hypothesis as under the null hypothesis” (Jarosz and Wiley, 2014). Several attempts at establishing a framework for making conclusions based on the value of the Bayes factor has been made (as discussed in Jarosz and Wiley (2014)). For example, the terminology in Table 3.2 was introduced in Jeffreys (1961). 3.3. IMPLEMENTATION OF THE DISTANCE METRICS 37

Table 3.2: Jeffreys terminology for conclusions about the Bayes factor.

B10 log(B10) Support against H0 1 − 3 0 − 1.10 Anecdotal 3 − 10 1.10 − 2.30 Substantial 10 − 30 2.30 − 3.40 Strong 20 − 100 3.40 − 4.61 Very strong > 100 > 4.61 Decisive

The Bayes factor metric is, as the bubble counter, a rather local measure. Recall that it does pairwise comparisons of bubbles at corresponding locations in the two graphs without ever looking at larger structures.

3.3 Implementation of the Distance Metrics

First, the distance between the observed populations were calculated using the metrics introduced in Subsection 3.1.1, 3.1.2 and 3.2.1. The most demanding challenge in designing the software for distance calculation is to ease the in- tensive computation. For finding dBC - and dBF -distance, this is mainly solved by parallel computing in R. For finding dGED, a C++/Java program called GEDEVO is used (Ibragimov, Malek, Guo, et al., 2013; Ibragimov, Martens, et al., 2013; Malek et al., 2016). All analyses were conducted on a 64-bit Red Hat Enterprise Linux Work- station 7.3 system equipped with an Intel i7-4790 processor, carrying 8 cores with 3.60 GHz and 16GB of local memory. This system spends around 10 seconds for computing both the dBC - and dBF -distance between two popu- lations. Hence, it takes about 2.5 minutes in total for all 15 combinations of populations. The calculation of dGED spends between 90 and 150 seconds depending on the populations, and approximately 30 minutes in total for all combinations. Computer code for the calculations can be inspected in Ap- pendix F.2.

3.3.1 Bubble Counter and Bayes Factor

The dBC - and dBF -distances are calculated by the same R-program, and both are computed directly from the data in the VCF-file without making a graph- structure. Both of these measures are fairly straightforward to implement in accordance with the theory in Subsection 3.1.1 and 3.2.1. To find dBC - distances, the program simply checks the genotype data to see if a bubble is unique in one population or not. After all the bubbles have been examined, it counts the occurrences of unique and shared bubbles. dBF -distances are also found by going through each bubble, but in this case, the genotype information 38 CHAPTER 3. METHODS in the two populations is compared bubble by bubble. Finally, the distance for each bubble is combined into one single number by taking the sum of the logarithm of all the individual distances at the individual positions (bubbles).

3.3.2 Graph Building

The graph edit distance compares nodes and edges in two different graphs, so in order to compute dGED, we need to create one graph object representing each of the populations. A Java-program called GraphBuilder was made to build graphs out of the VCF data. It reads the file line by line (or bubble by bubble), and translates the information from the table into node objects and edge objects. An edge is represented as a pair of nodes, where the first is the start node, and the second is the destination node. The program stores the nodes and edges in a graph object. For each variant, two nodes are made, one for each alternative in the variant. In addition, an intermediary node is made in between two variants if their positions differ by more than one, i.e. they are not consecutive in the reference genome. Recall from Section 2.3 that we will refer to the node types as “ref”, “alt” and “inter”. The majority of the graphs consist of SNPs and small indels, scattered across the genome. There are, however, especially two situations that give rise to more complex graph structures. The first is when bubbles form on top of each other in the same positions. Remember from Section 2.2 that if two bubbles have the same position in the file, they specify two possible paths in addition to the reference type at this location. A graph that illustrates this problem is shown in Figure 2.4. Here, both bubble number two and three are in position number 16423023. Observe that they share the same reference type. Both bubbles will diverge from the reference and specify one path each through the region. Similarly, the reference type will be a path, and all paths will be joined in position 16423024. The second situation is when there are bubbles in two or more consecutive positions. Then the problem is not to make the program establish the correct nodes, but rather to draw the correct edge structure. Figure 3.7a shows that GraphBuilder solves this situation by drawing all possible edges in the struc- tures. Clearly, all the bubbles in the graph is found in at least one individual in the sample. Each individual represents a traversal through the graph. This means that by drawing all possible edges in Figure 3.7a, we may end up with more paths in the graph than what the sample says. But in addition, the graph is meant to be a representation of a larger group, in which other paths may occur. By including all possible edges we allow the graph to be more general. On the other hand, it becomes less specific. GraphBuilder models insertions and deletions in the simplest way possi- ble, by assigning a node to each variant regardless of what it contains. This means that GraphBuilder use two nodes to represent insertions and deletions, 3.3. IMPLEMENTATION OF THE DISTANCE METRICS 39 one for the reference type, and one for the alternative type. A deletion means that the alternative is empty, while an insertion will have an alternative type with more than one base (since the reference is one or more bases long). This feature results in a somewhat unintuitive number of edges and nodes in inser- tions and deletions. Two nodes with sequences of different lengths are created, so structurally, an insertion is identical to a substitution. Figure 3.6 shows two possible ways of modeling an insertion. Figure 3.6a is perhaps a bit more intuitive. Graphbuilder draws the structure in 3.6b. In this thesis, little at- tention has been payed to this matter, because when calculating graph edit distance, it boils down to the scoring regime. If we penalize node deletions hard, then 3.6b will be preferred. Conversely, if we penalize edge operations severely, the opposite will be true.

A

I I I I

TG ATG

(a) (b)

Figure 3.6: Two ways of modelling the insertion of the se- quence TG. GraphBuilder handles insertions in the way de- picted in 3.6b.

GraphBuilder outputs a list of edges to a text file. An example is shown in Table 3.3. Each edge consists of a start node and a destination node, labeled with the unique identifier of the variant, in addition to a string which specifies what kind of node it is (“ref”, “alt” or “inter”). This serves two purposes:

• It can be read by an R-program to produce graphics using e.g. the pack- age igraph (Csardi and Nepusz, 2006). Figure 3.7 shows two example graphs drawn by igraph. In addition, a list of the sequences stored at each node is also given, so the nodes can be labeled correctly by igraph.

• It can be read by GEDEVO (Ibragimov, Malek, Guo, et al., 2013; Ibrag- imov, Martens, et al., 2013; Malek et al., 2016), which calculates graph edit distance.

The graph-drawing may appear a bit uninformative as it is not directly connected to the analysis of graph differences. However, it provides a huge advantage in evaluating the correctness of the output from GraphBuilder. 40 CHAPTER 3. METHODS

G A I

G C A G

I

T C S

(a) The variants A/G and G/C are consecutive.

C

CAG G I

A

I GTA A

G S

(b) Two variants at the same position split the graph into three nodes in the same location.

Figure 3.7: Two situations handled by the graph-making program GraphBuilder. The graphs are drawn using the R- package igraph. 3.3. IMPLEMENTATION OF THE DISTANCE METRICS 41

Table 3.3: Output from GraphBuilder. A list of edges, repre- sented as node pairs. All edges in both graphs are given. The list can be fed directly to GEDEVO, which calculates GED.

Graph fromNode toNode G1 rs189697741inter rs189697741ref G1 rs189697741inter rs189697741alt G1 rs78836339ref rs189697741inter G1 Start rs78836339ref G1 Start rs78836339alt G1 rs78836339alt rs189697741inter G2 rs189697741inter rs189697741ref G2 rs189697741inter rs189697741alt G2 rs78836339ref rs201256827inter G2 Start rs78836339ref G2 Start rs78836339alt G2 rs201256827inter rs201256827ref G2 rs201256827inter rs189697741inter G2 rs201256827ref rs189697741inter G2 rs78836339alt rs201256827inter

The primary output (a list of edges) is hard to interpret for large graphs. The visualizations in Figure 3.7 is far easier to work with. Program code for GraphBuilder can be inspected in Appendix F.1

3.3.3 Graph Alignment with GEDEVO

GEDEVO was originally developed for the problem of aligning biological net- works. Alignment of two networks or graphs can be performed by finding mappings between nodes from one group to nodes in the other group. The mappings are chosen in order to optimize some criterion (Heath and Kavraki, 2009). We may score the mappings as we wish, again much like in sequence alignment. The scoring regime can e.g. be that of the graph edit distance in Section 3.1.2, but other options are also possible. This will be discussed in Chapter 5. GEDEVO employs graph edit distance as the optimization criterion for graph alignment. In order to do graph alignment with the scoring regime in equations (3.3)-(3.8), GEDEVO make the following, equivalent definition of the graph edit distance (Ibragimov, Malek, Guo, et al., 2013).

Definition 3.7. Let uv be an edge from u to v, where u, v ∈ V1. Then, let g(u)g(v) be an edge from g(u) to g(v), where g(u), g(v) ∈ V2. With the scoring scheme (3.3)-(3.8), the graph edit distance from Definition 3.5 between G1 and 42 CHAPTER 3. METHODS

G2, induced by g is X dGED(G1,G2) = I{uv ∈ E1}I{g(u)g(v) 6∈ E2}

uv∈E1 X 0 0 −1 0 −1 0 + I{u v ∈ E2}I{g (u )g (v ) 6∈ E1}. (3.16) 0 0 u v ∈E2

Proposition 3.1. Given a mapping g : V1 → V2, as introduced in Definition 3.2, Definition 3.5 and 3.7 are equivalent. See Appendix B for proof.

GEDEVO searches for the mapping g that minimizes the graph edit dis- tance dGED (Ibragimov, Malek, Guo, et al., 2013; Ibragimov, Malek, Baum- bach, et al., 2014). However, the general problem of graph alignment can be shown to be NP-hard (Klau, 2009). The complexity of graph edit distance is exponential in the number of nodes in the graphs. This is unfortunate, because the graphs handled in this thesis easily contain more than 10 000 nodes. This is the reason why only the first 5000 lines of the original data file are used. The process is simply too time-consuming if we include more data, as the graphs will grow too large to handle in a realistic time perspective. The algorithm complexity makes parallel computing a key aspect of software. GEDEVO uses CPUs rather aggressively. It detects the number of cores in the system and starts one thread per core. GEDEVO is an evolutionary algorithm (EA). EAs perform tasks with the ability to learn. The procedure is inspired by concepts from evolutionary biology, such as inheritance, mutation, fitness and survival. EAs maintain a population of solutions referred to as individuals, that optimize some criterion in parallel. The individuals that are considered “best” by some metric (e.g. graph edit distance), are permitted to survive to the next cycle. The surviving individuals are recombined to produce “offspring”, i.e. a new generation of in- dividuals with characteristics from the previous generation. They are exposed to changes (to mimic mutations) (Eiben and J. E. Smith, 2003, chapter 3; Yu and Gen, 2010, page 6-8). In GEDEVO, each individual in the EA represents a mapping g, where g is as explained in Definition 3.2 (Ibragimov, Malek, Guo, et al., 2013). Several components of the algorithm may be stochastic. In GEDEVO, both the recombination, mutation, and selection steps have stochastic parts incorporated (Ibragimov, Malek, Guo, et al., 2013). As a result, GEDEVO will often calculate separate distances for two runs with identical input. This will be largely disregarded in the future analyses on the population graphs, because it applies to every distance calculated by GEDEVO. Consequently, it may be considered as added noise. So if it is small, it may be ignored. Considering the structure of the variant graphs, GEDEVO is perhaps not an ideal choice of software. It was originally designed for alignment of protein- protein interaction (PPI) networks. PPI networks typically have more edges 3.4. ARE THE MEASURES REALLY METRICS? 43 than the close to linear graphs that are explored in this thesis. It is possible that GEDEVO therefore is somewhat optimized for graphs with more edges per node. In particular, GEDEVO has trouble with substitutions. Before starting the calculation, GEDEVO goes through a step called “prematching”, where nodes are matched to each other across the two graphs if the identifier is the same. A pair of prematched nodes are never separated during calculation. The user is not allowed to score node substitutions. All other edit operations (node insertion, node deletion, edge insertion, edge deletion and edge substi- tution can be scored). Edge substitutions is defined as all mappings g(e) = e0 0 for the edges e ∈ E1 and e ∈ E2, where g is the function from Definition 3.7. It is not interesting to score these substitutions, because it does not translate into useful information about the differences. Rather, it tells us something about the amount of edge mappings, which is not so helpful. By default, node insertion and node deletion are both scored zero in GEDEVO, because they will lead to an edge insertion or edge deletion. For most situations this is fine, but it becomes a problem under the circumstances depicted in Figure 3.8. A variant in G1 and a variant in G2 are both located between a set of prematched nodes. The dashed red line is a mapping that prematches nodes in G1 with identical nodes in G2. However, because only a node substitution is needed in order to transform G1 into G2, the edit distance between the two graphs is 0. Clearly incorrect for the data used in this thesis, a rather annoying feature of the software is that the user is not allowed to change the weight of node substitution. Fortunately, its effect is modest. The issue only arises when two different bubbles are surrounded by the same two prematched variants. The result of the drawback described above is that GEDEVO systematically underestimates the edit distance between variant graphs.

3.4 Are the Measures really Metrics?

In this thesis, the notion of a metric has been treated rather sloppy. A formal definition is given below.

Definition 3.8. Let d : X × X → R be a function and X a nonempty set. Also, let x, y, z ∈ X. If d is a metric, it fulfills the criterions

(i) Positivity: d(x, y) > 0 with equality if and only if x = y.

(ii) Symmetry: d(x, y) = d(y, x).

(iii) Triangle inequality: d(x, z) < d(x, y) + d(y, z).

In order to quantify the difference between populations, it is reasonable to assume some features about the measures, in particular the symmetry and triangle inequality of Definition 3.8. We will now check the properties for the 44 CHAPTER 3. METHODS

G G T

I I I I

A A C

G T T

I I I I

A C C

Figure 3.8: A variant surrounded by prematched nodes. The dashed red lines is a mapping that prematches nodes from G1 (lower) with nodes from G2 (upper). The middle variant (bub- ble) is different in the two graphs, but can not be distinguished by GEDEVO.

measures described in Section 3.1. In the following, let X = (G1,G2,...,Gk), where each graph is a variant graph. Denote the elements x and y from above by Gx and Gy, respectively.

3.4.1 Bubble Counter

The bubble counter does exactly what the name reveals, and not really much else. Clearly, if Gx = Gy, there are no unique bubbles, because every bubble in Gx is also contained in Gy. We have

b (i) True. Note that Gx = Gy ⇒ b = 0 ⇒ dBC = m+b = 0.

(ii) True. Both b and m from equation (3.1) is the same no matter what the direction is.

(iii) True. Let bxy,x be the number of unique bubbles in Gx when compared with Gy, and let bxy,y be the number of unique bubbles in Gy when compared with Gx. Similarly, let mxy be the number of shared bubbles 3.4. ARE THE MEASURES REALLY METRICS? 45

between Gx and Gy. The distance from Gx to Gy is then then

bxy,x + bxy,y bxy d(Gx,Gy) = = . bxy,x + bxy,y + mxy bxy + mxy

Similarly, we have

byz d(Gy,Gz) = byz + myz

and

bxz d(Gx,Gz) = . bxz + mxz We wish to know if

bxz bxy byz ≤ + . (3.17) bxz + mxz bxy + mxy byz + myz Consider Figure 3.9, where three sets X, Y and Z give rise to three graphs Gx, Gy and Gz, respectively. We may now express equation (3.17) in the context of this figure. The number of bubbles that are unique to Gx when compared with Gy, is bxy,x = r1 + r6. Similarly, we get

bxy,x = r1 + r6, byz,y = r2 + r4, bxz,x = r1 + r4,

bxy,y = r2 + r5, byz,z = r3 + r6, bxz,z = r3 + r5,

mxy = r4 + r7, myz = r5 + r7, mxz = r6 + r7.

Equation (3.17) can now be expressed as r + r + r + r r + r + r + r 1 4 3 5 ≤ 1 6 2 5 r1 + r4 + r3 + r5 + r6 + r7 r1 + r6 + r2 + r5 + r4 + r7 r + r + r + r + 2 4 3 6 . r2 + r4 + r3 + r6 + r5 + r7 We may then manipulate the left side of the above expression in the following way

r1 + r5 r4 + r3 L.S. = + r1 + r6 + r5 + r4 + r7 r1 + r6 + r5 + r4 + r7 r + r + r r + r + r ≤ 1 5 6 + 4 3 6 r1 + r6 + r5 + r4 + r7 r1 + r6 + r5 + r4 + r7 r + r + r + r r + r + r + r ≤ 1 5 6 2 + 4 3 6 2 r1 + r6 + r5 + r4 + r7 + r2 r1 + r6 + r5 + r4 + r7 + r2 = R.S. 46 CHAPTER 3. METHODS

In order to obtain the last line, we have used that for positive numbers a, b and c, we have a a + c ≤ a + b a + b + c because we then get a2 + ab + ac ≤ a2 + ac + ab + bc ⇔ 0 ≤ bc which is true for positive numbers.

X Y

r4 r1 r2

r7

r6 r5

r3

Z

Figure 3.9: Three sets X, Y and Z, which give rise to the graphs Gx, Gy and Gz, respectively. r1 is the number of bub- bles contained in Gx but not in Gy or Gz, i.e. it is the number of unique bubbles in X, bx. So the number of unique bubbles in Gx when compared with Gy, is bxy,x = r1 + r6.

3.4.2 Graph Edit Distance

(i) True. All weights are positive. Note that when Gx = Gy, the smallest complete edit path is λ = ∅, so d(Gx,Gy) = 0. 3.5. THE PERMUTATION TEST 47

(ii) True. Reversing the order of the edit path does not change the cost of the edit operations.

(iii) True. Consider the complete edit paths λxy that turns Gx into Gy and λyz that turns Gy into Gz. Then, λxy ∪λyz is a complete edit path from Gx to Gz. But the graph edit distance between Gx and Gz is sum of the costs in the minimal complete edit path between the two graphs. Let min λxz be the minimal complete edit path. Then

X X c(λi) ≤ c(λi) (3.18) min λ ∈λ ∪λ λi∈λxz i xy yz

by Definition 3.5.

3.4.3 Bayes Factor Measure

The Bayes factor measure is not intended to be distance metric. It is not based on finding distances. Rather, it tries to relate two hypotheses about the same data to each other. However, it can be treated in the same way as the other two measures in the permutation test. In addition, we may note that PM d(Gx,Gy) = i log(B10,i), where B10,i may be smaller than 1. This means that positivity is not guaranteed, but the condition can possibly be met by applying a transformation.

3.5 The Permutation Test

The distances calculated by the distance-based methods described above are meaningless without a context. At this point they are just numbers. We can evaluate them relative to each other, but there is no way to decide if a distance is large or small. The permutation test offers such a context. Introduced in Fisher (1935), the permutation test is a nonparametric test, i.e. it computes significance without making assumptions on the data distributions (Davison and Hinkley, 2006, page 156). The Bayes factor measure can be interpreted directly using e.g. Jeffreys’ terminology in Table 3.2, but we may employ a permutation test on this measure as well, even though it may not be a real metric. If it behaves in the desired way, i.e. it assigns larger values to population pairs that has larger genetic difference, it is well suited for evaluation by the permutation test. The permutation test applied to the distance between two population graphs made from n1 and n2 individuals, respectively, can be expressed as the so-called two-sample problem. Denote population 1 by x1 = x1,1, x1,2 . . . , x1,n1 and population 2 by x2 = x2,1, x2,2 . . . , x2,n2 , where xi,j is individual j in population i. Then, let F1 and F2 be probability distributions, and assume that x1 originates from F1 and x2 originates from F2. If we consider the 48 CHAPTER 3. METHODS individuals of the two populations in question as the data points in the vectors x1 and x2, we may use a permutation test for the null hypothesis that there is no difference between the two distributions (Efron and Tibshirani, 1993; Chung and Fraser, 1958). The alternative hypothesis then translates to stating that the distance between the two populations is larger than what may be expected purely due to chance. Formally,

H0 : F1 = F2 and H1 : F1 6= F2. (3.19)

The test statistic for this hypothesis test is the distance between the two population graphs, measured by either the bubble counter metric, the Bayes factor metric or the graph edit distance. Intuitively, the larger the distance between the two population graphs, the more evidence we have against H0. For the time being, we do not know what is a large and what is a small distance. We can, however, make conclusions about the observed distance by comparing it with other possible distances between other possible pairs of population graphs. Specifically, we may express the p value of the test of the hypotheses from (3.19) as

p = P (dF ≥ dobs) (3.20) if the null hypothesis is true (Mielke Jr and Berry, 2001, page 21-22). Now, we make the reasonable assumption that the data under H0 are exchangeable, i.e. they are invariant to permutations of the indexes. Given H0 and exchangeability of the individuals, the p-value from 3.20 is exact. The reason for this is that when F1 = F2 = F , all individuals are from the same distribution, so dF is really the test observator between two population graphs made from samples taken from F . The value of p can be used to evaluate the null hypothesis in the usual way, i.e. that the probability of observing something more than, or equally extreme as dobs, has probability p. Consequently, a small p-value means that the null hypothesis is unlikely. The fact that the ordering of the individuals is of no importance what- soever, implies that any permutation of the groups is equally likely, and the observed data is just one of these permutations (Manly, 1997, page 1-3). The n1+n2 exact permutation distribution is given for all n1 random permutations of a pool of two populations of size n1 and n2, respectively. Formally, let p p p x = (x1, x2) be a random permutation of the individuals. The probability p of obtaining the i’th permutation xi is then 1 P (x = xp) = . i n1+n2 (3.21) n1

Now, let the distance (by some chosen metric M) between the two ob- served population graphs be dobs = dM(x1, x2). Similarly, let the distance 3.5. THE PERMUTATION TEST 49

p p between the permuted population graphs be dperm = dM(x1, x2). The per- mutation p-value is

pperm = P (dperm ≥ dobs). (3.22)

Further, by combining equations (3.21) and (3.22), we get

n1+n2 ( n ) 1 X1 p = I(d ≥ d ) perm n1+n2 perm obs n1 i=1 k = , (3.23) N where k is the sum of all the distances larger than the observed one, and N = n1+n2 p n1 is the total number of permutations (Ernst, 2004). The -value of this test is the proportion of the data permutations that has a distance value larger than, or equal to, the distance between the observed population graphs. We can use equation (3.22) to approximate pperm by Monte Carlo methods (Efron and Tibshirani, 1993, page 207; Edgington, 1995, page 360-362).

3.5.1 Monte Carlo Sampling In order to find exact p-values, we need to compute all permutations. For the 40 11 case of n1 = n2 = n = 20, we have N = 20 ≈ 1.38 · 10 . This is a number of permutations which is hard to please. In order to estimate pperm we may use Monte Carlo sampling. In this 0 0 0 method, we randomly draw n individuals x1 = x1,1, x1,2, . . . x1,n without re- placement and assign these to group number 1. The rest of the individuals 0 0 0 x2 = x2,1, x2,2, . . . x2,n are then in group number 2. Then, we calculate the 0 0 0 distance dperm = dM(x1, x2). If we repeat this procedure B times, we can estimate the p-value by the ratio of the number of observations exceeding dobs to the total number of permutations made, B

B 0 1 X 0 k pˆ = I(d ≥ dobs) = . (3.24) B perm B i=1

k 0 If now p = N , we see that pˆ → p as B → N, because then k → k. So pˆ is consistent. Furthermore, pˆ is unbiased

B 1 X [ˆp] = I(d0 ≥ d ) E B E perm obs i=1 B 1 X = p B i=1 = p, 50 CHAPTER 3. METHODS

0 because p = P (dperm ≥ dobs). Also, pˆ is discrete uniformly distributed on 1 B−1 {0, B ,..., B , 1} (Phipson and Smyth, 2010). To summarize, we can perform a permutation test using the following Monte Carlo procedure: 1. Identify the null hypothesis and the alternative hypothesis.

2. Choose a test statistic.

3. Compute the test statistic for the observed groups.

4. Make random permutations of the groups and calculate the test statistic for the permuted groups. Repeat this step until the statistic for a sample of adequate size is known.

5. Compare the test statistic from the observed group with the permutation distribution obtained in the previous step. The permutation distribution is the context we are looking for. If the distance between the observed population graphs is an extreme value in the permutation distribution, we may discard the null hypothesis stating that the ordering of the individuals does not matter. In other words, the distance between the observed population graphs is probably larger than the distance between any random permutation of the groups. We can express this as a 0 hypothesis test which rejects H0 if P (dperm ≥ dobs) < α where α is the desired significance level. By definition, the p-value can not be zero because the pair of observed groups is one of the possible permutations. So we may use the biased estimate k + 1 p = , (3.25) B + 1 presented in Davison and Hinkley (2006, page 158) and Phipson and Smyth (2010), among others.

3.6 Implementation of the Permutation Test

The procedure for calculating the distances between graphs built from ran- domly permuted populations is described in Algorithm 3.1. After a decent number of iterations, we can make a histogram of the distances between the permuted populations and calculate permutation p-values. This enables us to find out if the distance between the observed populations is a probable value for the random variable, whose probability distribution is approximated by the permutation distribution. The 0.975-quantile is used to determine if the observed distances are extreme values in the approximate distributions. The calculations are very computer-intensive, and the permutation test demands numerous repetitions of the same procedure. The permutation test 3.6. IMPLEMENTATION OF THE PERMUTATION TEST 51

Algorithm 3.1 Calculation of pairwise dBC - and dBF -distances between many permuted populations. Input: Two populations x1 and x2, both of size n and the number of permu- tations to be done B. Output: A vector of distances, distT able. 1: function PermutationDistances(x1, x2, B) 2: distT able = null 3: for k = 1 to B do 4: distList = null 5: permP ops = makePermutations(x1, x2) 6: d = calculateDistance(permP os[1], permP ops[2]) 7: distT able = rbind(distT able, distList) 8: end for 9: return distT able 10: end function The permutations are made by drawing n individuals at random, and assigning them to group number 1. The remaining ones are then in group 2. The distance is then calculated by whatever metric one wishes to use. permP ops is the variable that contains the two permuted populations. It is a list with two elements. for any of the three distances were computed by an R-implementation of Al- gorithm 3.1, which calls the function calculateDistance(). This function then finds the distance between the two populations by the preferred metric. Recall that the algorithm use approximately 2.5 minutes for computing both the dBC - and dBF -distance between all combinations. The time grows as we add on more individuals in each population and more data from every indi- vidual. When calculating dBC and dBF , Algorithm 3.1 runs in parallel, so the B total time used is somewhere around C · 2.5 minutes, where C is the number of CPUs and B is the number of permutations. So with 3000 iterations and 7 CPUs, we end up spending almost 18 hours for this computation.

In the permutation test for dGED-distances, calculateDistance() in Algorithm 3.1 gets more complicated. Automating the calculation of the graph edit distance from permuted populations requires communication with the operating system, which complicates the parallelization. The problem is that each iteration in Algorithm 3.1 requires us to read and write files multiple times. Remember that GraphBuilder takes VCF-files as input and makes node and edge objects for each bubble. But we need to remove the bubbles that are not present in the population before feeding the VCF-file to GraphBuilder. Recall that every permutation of the groups is only collections of individuals picked at random. So we have to single out the columns corresponding to the correct individual from the VCF file each time. Then the two population graphs can be built using whatever bubbles are present in the population. The 52 CHAPTER 3. METHODS program will call GraphBuilder from the system. The cleaned input files are then used by Graphbuilder to construct a graph-object. Next, the two graphs are handed to GEDEVO as a list of directed edges, but GEDEVO can only read this from a file. We can e.g. not pass it an R-object (which would be ideal). This is the reason why GraphBuilder outputs the graph as a list of edges. After reading the graph, GEDEVO outputs results to a file (which, again, is the only alternative). Finally, we need to parse this file to obtain the dGED-distance between the graphs made by GraphBuilder. There are many time-consuming aspects in this procedure, but the by far lengthiest is the calculation of the graph edit distance. This is due to the complexity of the algorithm inside GEDEVO and the size of the graphs. Remember that one iteration of the loop in Algorithm 3.1 when calculate- Dist() uses dGED, typically spends between 90 and 150 seconds, depending on the populations. Calculating all the dGED-distances takes roughly 30 min- utes, so 300 iterations takes somewhere between 5 and 8 days. Program code for the permutation test may be inspected in Appendix F.3.

3.7 Example

In this section, the distance metrics are employed on small and simple graphs, to explain how the distances are calculated in practice.

Example 3.3. We wish to compare a specific region in one British (GBR) and one Luhya, Kenya (LWK) population. The data spans from position 16982899 to 17003679 in the human genome. Table 3.4 shows a VCF file which reports all possible variants in the region we are comparing. Genotype data for all individuals are summarized in Table 3.5. The complete list of genotypes can be inspected in Table C.1 in Appendix C. The VCF file may be represented as the two graphs in Figure 3.10. The first graph is the GBR population graph for this region, and the second one is the LWK population graph for the same genomic region.

Table 3.4: A VCF file displaying the three bubbles in this ex- ample. The individuals have been excluded. Instead, genotype frequencies are summarized in Table 3.5 below.

CHROM POS ID REF ALT QUAL FILTER 22 16123812 rs78836339 C T 100 PASS 22 16130350 rs201256827 CT C 105 PASS 22 16130558 rs189697741 C A 100 PASS

The VCF information in Table 3.4 states that there are three possible bubbles in the graphs. To determine which of them that are present, we need to examine the genotype information in Table 3.5. Note that all 20 individuals 3.7. EXAMPLE 53

Table 3.5: Genotype frequencies in the two populations for each of the variants in the VCF file in Table 3.4.

POS GBR LWK UU UV VV UU UV VV 16123812 15 4 1 11 8 1 16130350 20 0 0 17 3 0 16130558 14 6 0 14 6 0 from the British population are homozygotic for the reference type at position 16130350, i.e. this bubbles does not occur in the population graph GGBR. Accordingly, the Luhya population has one unique bubble. The bubble counter distance between the two population graphs is then 1 d (G ,G ) = = 0.333. BC GBR LW K 3 The graph edit distance between the two graphs is 4. The minimal cost path from GLW K in Figure 3.10b to GGBR in Figure 3.10a is the deletion of the four edges surrounding position 16130350, in addition to substitution of the first intermediate node (marked “I”) with the second. Finally, the last intermediate node is deleted. Each of the three edge deletions have score 1, while the node deletion and substitution have cost 0. The cost is then

cost(GLW K ,GGBR)=1+1+1+1+0=4. (3.26)

To find the graph edit distance using GEDEVO, we must first build the two graphs from the VCF file as discussed in Subsection 3.3.2. GraphBuilder outputs a list of all edges in both graphs. The list is passed to igraph, which makes the illustration shown in Figure 3.11. The graph edit distance between the two graphs, calculated by GEDEVO is

dGED(GGBR,GLW K ) = 0.25.

Note that this number is not the same as the raw score we assigned to the mini- mal complete edit path λ(GLW K ,GGBR). GEDEVO calculates score(λ(GLW K ,GGBR) the same way as in (3.26), but also relates it to the maximal possible distance between the graphs. The number of edges in the two graphs are

||GLW K || = 10 and ||GGBR|| = 6.

So the total number of edges is 10 + 6 = 16. With the scoring regime from equations (3.3) to (3.8), the maximal edit distance is 16. GEDEVO then states 4 that the graph edit distance is 16 = 0.25. Finally, we find the Bayes factor distance. We will use the hyperparameter α = (α1, α2, α3) = (1, 1, 1). First, we calculate the Bayes factor in position 54 CHAPTER 3. METHODS

16123812 16130558 C C

I I I

T A

(a) The GBR population graph for a particular region. It has three bubbles. The middle one is unique, since it is not contained in the graph in (b).

16123812 16130350 16130558 C T C

I I I I

T A

(b) The LWK population graph for the same region. It has three bubbles. The middle one is unique, since it is not contained in the graph in (a).

Figure 3.10: Two population graphs built from the same input file, but none of the individuals in the GBR popula- tion contain the alternative type of the variant at position 16130350.

i = 16123812. The frequencies in the first row translates to the data

y = (y1, y2) = (y1,1, y1,2, y1,3, y2,1, y2,2, y2,3) = (15, 4, 1, 11, 8, 1).

We may now insert the genotype frequencies from Table 3.5 into the model at position i = 16123812. By equation (3.13), the distance at this location is Γ(1)Γ(1)Γ(1) Γ(27)Γ(13)Γ(3 B = Γ(1+1+1) Γ(27+13+3) 10,i Γ(16)Γ(5)Γ(2) Γ(12)Γ(9)Γ(2) Γ(16+5+2) Γ(12+9+2) = 3.4389, 3.7. EXAMPLE 55

T S A

I C

C

(a) Population graph for GBR over a specific region.

A

C C

I S I

CT T C

(b) Population graph for LWK over a specific region. Note the insertion of a T relative to GGBR.

Figure 3.11: The two graphs as depicted by GraphBuilder and igraph. 56 CHAPTER 3. METHODS

and log(B10,i) = log(3.4389) = 1.2352. Repeating this procedure for posi- tion 16130350 and 16130558, we get the individual Bayes factors for all three positions in Table 3.4. Given that the assumption of independence among positions hold, we may now use the product over all positions as a test ob- servator for the difference between the two population graphs. By Definition 3.6, the Bayes factor metric is the log-likelihood X dBF (GGBR,GLW K ) = log(B10,i) i = 1.2352 + 1.9673 + 2.8133 = 6.0158.

We have now calculated the distance between two graphs in three rather different ways. The result may now be evaluated by performing a permutation test. Chapter 4

Results

In the following, the methods developed in Chapter 3 are employed on the data reviewed in Chapter 2. Distances between all 15 pairs of the six populations are reported. In adddition, the distances are evaluated using the permutation test of Section 3.5, and the results of the tests are given. The relations among the populations are visulized in three dimensions by multidimensional scaling.

4.1 Distances Between Population Graphs

Computer programs to calculate the distance metrics described in Subsections 3.1.1, 3.1.2 and 3.2.1 were used to quantify the differences between a few selected populations described in the data section. Implementational details are covered in Section 3.3. All methods were given the same input, which was the 5 000 first lines of the VCF file (see Section 2.2).

4.1.1 Bubble Counter

The bubble counter distances dBC from equation (3.1) are reported in Table 4.1. It may be of interest to note that the European (GBR and FIN) and Asian (CHB and CHS) populations appear to be close to each other. The dis- tances from any of the European or Asian populations to the African ones are undoubtedly larger. Furthermore, the distance between the two African popu- lations is small compared to the distances from any of the African populations to any of the others. The same goes for the two Han Chinese populations. The distance between them is small compared to the distance from any of the Han Chinese populations to any of the others. In fact, the bubble counter dis- tance between CHB and CHS is the smallest distance measured by the bubble counter. In general, there seems to be a slight effect of location or race on the bubble counter metric. The populations located close to each other in the real

57 58 CHAPTER 4. RESULTS

Table 4.1: Distances measured by the bubble counter metric.

GBR FIN LWK YRI CHB CHS GBR - FIN 0.313 - LWK 0.542 0.597 - YRI 0.602 0.566 0.398 - CHB 0.291 0.354 0.531 0.579 - CHS 0.382 0.290 0.601 0.571 0.229 - world (by geographical distance), seem to be a little closer to each other by dBC -distance.

4.1.2 Graph Edit Distance

The graph edit distances (dGED) from equation (3.2) are reported in Table 4.2. It looks like the populations relate to each other in a similar way when we look at bubble counter distance and graph edit distance. The European and Asian populations seem to be more closely related to each other than to the African population, and the smallest distance is the one from CHB to CHS. Moreover, the two African populations are more closely related to each other than to any of the European or Asian populations. Remember that due to the “random generation”-step in GEDEVO, (see Section 3.3), the distance may vary slightly from one calculation to another. This is treated as a random, added noise in the permutation test, because it happens to every distance calculated by GEDEVO. The noise is rather small, and the distances presented in Panel A of Table 4.2 are the means of 10 distances for each combination. The estimated standard deviations for the means are given in Panel B below.

4.1.3 Bayes Factor

The Bayes factor distances (dBF ) from equation (3.12) are given in Panel A of Table 4.3. The numbers reported here are

M X dBF (G1,G2) = log(B˜10) = log B10,i, i=1 where M is the total number of bubbles, and the prior distribution is the Dirichlet(1, 1, 1), as noted in Section 3.2.1. This number is the log likelihoods of the individual Bayes factors of H1 with respect to H0, i.e. the values re- ported in the table is the evidence in favor of H1. Recall that each bubble has its own set of genotypes in both populations, and that these are used to calu- late the Bayes factor at this position. In Panel B, the Bayes factor distances 4.1. DISTANCES BETWEEN POPULATION GRAPHS 59

Table 4.2: Distances measured by graph edit distance.

Panel A: Means of 10 dGED-distances measured for each combination. GBR FIN LWK YRI CHB CHS GBR - FIN 0.2349 - LWK 0.3280 0.3964 - YRI 0.2988 0.3301 0.2713 - CHB 0.1821 0.2385 0.3303 0.2996 - CHS 0.2484 0.1912 0.3703 0.3080 0.1663 -

Panel B: Estimated standard deviations for the dGED-means. GBR FIN LWK YRI CHB CHS GBR - FIN 0.001040 - LWK 0.004237 0.004532 - YRI 0.007584 0.003039 0.002383 - CHB 0.001806 0.002348 0.003463 0.002626 - CHS 0.001195 0.001529 0.002598 0.003384 0.001107 - has been transformed by subtracting the smallest value and adding 1000, in order to improve the readability. This explains why the distance between CHB and CHS is 1000 in Panel B of Table 4.3. Because the Bayes factor distance is interpreted as the support for the alternative hypothesis, a larger value of B˜10 means that there is more evidence in favor of H1. Consequently, the largest values in Table 4.3 represents the largest distances between populations. The results are large in magnitude. The reason is that the flat prior results in extreme B10-values at positions where the populations are very similar. The problem is amplified by a data feature explained in Section 2.2, namely that of bubbles that are not contained in any of the two populations in question, but appear in one of the other four populations. Panel A in Table 4.3 reports evidence that is extremely decisive according to Jeffreys (1961) (see Table 3.2). Accordingly, the conclusion of the hypothe- sis test where we use the Bayes factor directly as a test observator is strongly negative, i.e. it states that there is no difference among any of the popula- tions. Due to features of the data and the prior distribution, it is not really informative to employ Jeffreys’ terminology to this problem. This topic is further discussed in Subsection 5.1.3 in the next chapter. If we only consider the Bayes factor distances relative to each other, the results in Table 4.3 seem to agree with the bubble counter and the graph edit distance. Also by dBF , the European and Asian populations are closer to each other than to the African populations, and the distance between CHB and CHS is the smallest by this metric as well. 60 CHAPTER 4. RESULTS

Table 4.3: Distances measured by the Bayes-factor metric.

Panel A. GBR FIN LWK YRI CHB CHS GBR - FIN -9510.11 - LWK -7050.61 -6463.23 - YRI -5574.35 -7473.48 -8202.23 - CHB -8574.94 -9133.92 -6366.04 -6133.81 - CHS -8214.38 -9295.01 -5374.12 -6046.46 -10271.05 -

Panel B: Transformation by adding the smallest value plus 1000. GBR FIN LWK YRI CHB CHS GBR - FIN 1760.94 - LWK 4220.44 4807.82 - YRI 5696.70 3797.57 3068.82 - CHB 2696.11 2137.13 4905.01 5137.24 - CHS 3056.67 1976.04 5896.93 5224.59 1000.00 -

4.2 Visualization by Multidimensional Scaling

The relative distances for the various measures can be further explored in Figure 4.1. Here, multidimensional scaling is used to visualize the popula- tions relative to each other. Let dM,i,j be the distance between population i and j by metric M. Multidimensional scaling is an algorithm to find values z1, z2, . . . , zn that minimize what is called the stress function

X 2 S(z1, z2, . . . , zn) = (dM,i,i − ||zi − zi0 ||) i6=i0

(Friedman, Hastie, and Tibshirani, 2009, page 570). Multidimensional scaling tries to find a r-dimensional representation of the data, such that the dis- tances in lowerdimensional space in the best way match the original data in t-dimensional space, where r < t (Izenman, 2008, chapter 13 and Friedman, Hastie, and Tibshirani, 2009, pages 570-572). In Figure 4.1a, we can see an apparent clustering effect, at least in two dimensions. The populations are grouped according to which continent they reside in. Figure 4.1b also shows a tendation towards a geographical effect. The two Asian populations are still close together, a feature that is confirmed by the distance metrics. Figure 4.1c is a little harder to interpret in light of the results from the distance metrics. This figure shows no obvious grouping that agrees with the other results. 4.3. PERMUTATION TESTS 61

4.3 Permutation Tests

As mentioned in Section 3.3.3, the algorithm that calculates the graph edit distance under the hood in GEDEVO scales terribly in the number of nodes. The permutation test will therefore be very time-consuming if we increase the amount of data. Due to the computational challenges with graph edit distance, the permutation test for this measure has to be done with fewer samples than for the two others. Consequently, the permutation test for dGED was performed using 300 permutations, in contrast to the one for dBC and dBF , which were performed using 3000 random samples. In addition, the permutation tests for dBC and dBF was repeated using 300 permutations. This was done in order to compare the results obtained using a varying number of permutations. The approximate permutation distributions of the distances for all pairwise population comparisons by the bubble counter metric are shown in Figure 4.2 and 4.3. The analogous distributions for distance by the Bayes factor metric and graph edit distance are shown in Figure 4.5, 4.6 and 4.4, respectively. The histograms made using 300 permutations show roughly the same general patterns as the ones with 3000 permutations. The dashed red lines are the 2.5 and 97.5 percentiles, respectively. Now we can compare Table 4.1, 4.2 and 4.3 with the corresponding approximate distributions to see if they are unlikely observations. If this is the case, we have support for the alternative hypothesis, which states that distances between populations are in fact extreme values in the permutation distributions.

4.3.1 p-values In addition to visually inspecting the histograms, we may estimate the p-values of the hypothesis test as described in Section 3.5. The p-values for the tests of equal underlying distributions for the pairwise comparison of populations by dBC -distance can be found in Table 4.4. Similar for dBF and dGED are reported in Table 4.6 and 4.5, respectively. Remember from Subsection 3.5.1 that the p-values are estimated as k + 1 p = , (4.1) B + 1 where k is the number of simulated distances that exceeds the observed dis- tance. B is the total number of permutations. Clearly, the p-value can never be zero, because under the null hypothesis, each permutation is equally likely, and the observed distance is just one of these permutations. If the Monte Carlo sampling does not reveal any ordering of the groups that causes a simulated distance at least as extreme as the observed one, the value will be estimated 1 as B+1 . In a situation like this, we will obtain radically different (but always small) p-values for different numbers of permutations. By this argument, a very low p-value may often be overestimated by (4.1), and the exact p-value 1 is less than B+1 . 62 CHAPTER 4. RESULTS

Table 4.4 shows the p-values for the permutation test of dBC with 3000 permutations (Panel A) and 300 permutations (Panel B). The results are quite similar in the two panels, but the smallest values differ by an order of magnitude due to the different number of permutations. So in a situation where none of the permutations cause a distance dperm > dobs, we will have 1 P (d ≤ d (G ,G )) = = 0.0003, BC GBR LW K 3001 in Panel A, while Panel B will state that

1 P (d ≤ d (G ,G )) = = 0.0033, BC GBR LW K 301 rounded to four decimals. In this situation, the p-values are grossly overes- timated by (4.1). The permutation test yields very small p-values. All are significant on the 0.01-level except the one for the combination of CHB and CHS, which reports a p-value of 0.0670, using 3000 permutations. In fact, all 1 the significant p-values in Table 4.4 have the minimum value B+1 . The conclusions from Table 4.4 are repeated by the permutation tests using dBF and dGED. Table 4.6 and 4.5 also states that all tests reject the null hypothesis on the 0.01 level except the test for equality of the two Chinese populations. A reduction from 3000 to 300 permutations in the tests for dBC - and dBF -distance does not seem to cause any drastic changes to the conclusions. Interestingly, the conclusions of the permutation tests for the various mea- sures are very alike, and in general extremely significant, for all measures. One may also note that the permutation test results give support to the observed distances. When two populations are close to each other by both dBC - and dBF -distance, then the p-value of the permutation test of difference between the two populations is large (but the distance between CHB and CHS is the only example of this). Conversely, when the observed distance is large, the permutation p-value is small, stating that there is, in fact, a difference between the populations. 4.3. PERMUTATION TESTS 63

Table 4.4: Estimated permutation p-values for dBC . Values in Panel A are estimated using n = 3000 permutations. Values Panel B are estimated using n = 300.

Panel A: n = 3000. GBR FIN LWK YRI CHB CHS GBR - FIN 0.0003 - LWK 0.0003 0.0003 - YRI 0.0003 0.0003 0.0003 - CHB 0.0003 0.0003 0.0003 0.0003 - CHS 0.0003 0.0003 0.0003 0.0003 0.0670 -

Panel B: n = 300. GBR FIN LWK YRI CHB CHS GBR - FIN 0.0033 - LWK 0.0033 0.0033 - YRI 0.0033 0.0033 0.0033 - CHB 0.0033 0.0033 0.0033 0.0033 - CHS 0.0033 0.0033 0.0033 0.0033 0.0900 -

Table 4.5: Estimated permutation p-values for dGED. The values are estimated using n = 300 permutations.

GBR FIN LWK YRI CHB CHS GBR - FIN 0.0033 - LWK 0.0033 0.0033 - YRI 0.0033 0.0033 0.0033 - CHB 0.0033 0.0033 0.0033 0.0033 - CHS 0.0033 0.0033 0.0033 0.0033 0.0831 - 64 CHAPTER 4. RESULTS

Table 4.6: Estimated permutation p-values for dBF . Values in Panel A are estimated using n = 3000 permutations. Values in Panel B are estimated using n = 300.

Panel A: n = 3000. GBR FIN LWK YRI CHB CHS GBR - FIN 0.0007 - LWK 0.0003 0.0003 - YRI 0.0003 0.0003 0.0003 - CHB 0.0003 0.0003 0.0003 0.0003 - CHS 0.0003 0.0003 0.0003 0.0003 0.1740 -

Panel B: n = 300. GBR FIN LWK YRI CHB CHS GBR - FIN 0.0033 - LWK 0.0033 0.0033 - YRI 0.0033 0.0033 0.0033 - CHB 0.0033 0.0033 0.0033 0.0033 - CHS 0.0033 0.0033 0.0033 0.0033 0.1933 - 4.3. PERMUTATION TESTS 65

LWK 0.2 GBR CHB

0.1

CHS

0

FIN -0.1 -0.05 YRI -0.1 0

0.05

-0.1 0.1 0 0.1 0.2 0.3

(a) Bubble counter distance.

LWK 0.1 GBR CHB

0.05 CHS

FIN 0

-0.05 -0.05 YRI

0 -0.1 0.05

-0.2 -0.1 0.1 0 0.1

(b) Graph edit distance. 66 CHAPTER 4. RESULTS

GBR

YRI 4000

CHB

2000

CHS

0 FIN

LWK -2000 -2000 -1000

0

-4000 1000 2000 -4000 -2000 0 3000 2000 4000

(c) Bayes factor distance.

Figure 4.1: Multidimensional scaling for the three measures. Visualized in three dimensions by the R-package rgl (Adler and Murdoch, 2017). Colors indicate in which continent the population is found. Yellow is Europe, maroon is Africa and dark gray is Asia. 4.3. PERMUTATION TESTS 67

GBR−FIN GBR−LWK FIN−LWK

200 300 200 150 200 100 100 100 50 0 0 0 0.15 0.20 0.25 0.30 0.28 0.32 0.36 0.40 0.29 0.32 0.36 0.39

GBR−YRI FIN−YRI LWK−YRI 300 200 150 200 200 100 100 100 50 0 0 0 0.26 0.30 0.35 0.39 0.29 0.33 0.37 0.41 0.26 0.29 0.33 0.36

GBR−CHB FIN−CHB LWK−CHB 400 400 200 300 300 150 200 200 100 100 50 100 0 0 0 0.16 0.19 0.21 0.24 0.18 0.22 0.26 0.30 0.26 0.29 0.33 0.36

YRI−CHB GBR−CHS FIN−CHS 400 200 400 300 200 100 200 100 0 0 0 0.25 0.29 0.34 0.38 0.18 0.22 0.26 0.31 0.16 0.19 0.22 0.25

LWK−CHS YRI−CHS CHB−CHS 400 300 200 300 150 200 200 100 100 100 50 0 0 0 0.25 0.29 0.33 0.37 0.24 0.29 0.33 0.38 0.15 0.19 0.23 0.27

Figure 4.2: Approximate distributions of the dBC -distance between all combinations of populations using 3000 permuta- tions. The observed distance is shown in blue if it fits inside the window. If not, it is left out. 68 CHAPTER 4. RESULTS

GBR−FIN GBR−LWK FIN−LWK 30 30 20 20 20 10 10 10

0 0 0 0.16 0.200.23 0.27 0.29 0.33 0.36 0.40 0.30 0.32 0.35 0.38

GBR−YRI FIN−YRI LWK−YRI 30 30 30 20 20 20 10 10 10

0 0 0 0.26 0.29 0.33 0.37 0.29 0.32 0.35 0.38 0.26 0.29 0.31 0.34

GBR−CHB FIN−CHB LWK−CHB 60 30 40 40 20

20 10 20

0 0 0 0.16 0.18 0.20 0.22 0.18 0.22 0.25 0.29 0.27 0.29 0.32 0.35

YRI−CHB GBR−CHS FIN−CHS 40 60 40 30 30 40 20 20 20 10 10 0 0 0 0.26 0.29 0.32 0.35 0.18 0.20 0.22 0.24 0.17 0.19 0.21 0.24

LWK−CHS YRI−CHS CHB−CHS 40 30 30 30 20 20 20 10 10 10 0 0 0 0.26 0.29 0.32 0.35 0.25 0.29 0.33 0.37 0.16 0.19 0.23 0.27

Figure 4.3: Approximate distributions of the dBC -distance between all combinations of populations using 300 permuta- tions. The observed distance is shown in blue if it fits inside the window. If not, it is left out. 4.3. PERMUTATION TESTS 69

GBR−FIN GBR−LWK FIN−LWK 50 40 40 40 30 30 20 20 20 10 10 0 0 0 0.107 0.139 0.172 0.204 0.167 0.195 0.222 0.249 0.173 0.206 0.238 0.271

GBR−YRI FIN−YRI LWK−YRI 40 30 30 30

20 20 20

10 10 10

0 0 0 0.163 0.190 0.217 0.243 0.187 0.219 0.250 0.282 0.164 0.184 0.205 0.226

GBR−CHB FIN−CHB LWK−CHB

40 40 40 30 20 20 20 10 0 0 0 0.109 0.122 0.135 0.148 0.112 0.148 0.183 0.219 0.168 0.197 0.227 0.256

YRI−CHB GBR−CHS FIN−CHS 50 40 40 30 40 30 20 20 20 10 10 0 0 0 0.159 0.192 0.224 0.257 0.116 0.130 0.143 0.157 0.110 0.127 0.144 0.161

LWK−CHS YRI−CHS CHB−CHS 40 40 30 30 30 20 20 20 10 10 10 0 0 0 0.160 0.187 0.213 0.240 0.157 0.180 0.204 0.228 0.105 0.134 0.163 0.192

Figure 4.4: Approximate distributions of the dGED-distance between all combinations of populations using 300 permuta- tions. The observed distance is shown in blue if it fits inside the window. If not, it is left out. 70 CHAPTER 4. RESULTS

GBR−FIN GBR−LWK FIN−LWK 300 300

200 200 200

100 100 100

0 0 0 −17122 −16674 −16226 −15778 −16277 −15819 −15362 −14904 −16344 −15858 −15373 −14887

GBR−YRI FIN−YRI LWK−YRI 300 300

200 200 200

100 100 100

0 0 0 −16480 −15938 −15397 −14856 −16781 −16369 −15957 −15546 −16531 −16112 −15694 −15275

GBR−CHB FIN−CHB LWK−CHB 300 300

200 200 200

100 100 100

0 0 0 −16887 −16364 −15841 −15318 −17064 −16727 −16390 −16053 −16222 −15563 −14905 −14246

YRI−CHB GBR−CHS FIN−CHS 300 300

200 200 200

100 100 100

0 0 0 −16445 −15864 −15282 −14700 −16852 −16375 −15899 −15423 −17235 −16747 −16258 −15770

LWK−CHS YRI−CHS CHB−CHS

200 200 200

100 100 100

0 0 0 −16164 −15572 −14980 −14388 −16610 −15974 −15338 −14703 −17084 −16728 −16373 −16017

Figure 4.5: Approximate distributions of the dBF -distance between all combinations of populations using 3000 permuta- tions. The observed distance is shown in blue if it fits inside the window. If not, it is left out. 4.3. PERMUTATION TESTS 71

GBR−FIN GBR−LWK FIN−LWK 30 30 30 20 20 20

10 10 10

0 0 0 −17096 −16760 −16424 −16088 −16128 −15720 −15312 −14904 −16275 −15812 −15350 −14887

GBR−YRI FIN−YRI LWK−YRI 30 30 30

20 20 20

10 10 10

0 0 0 −16435 −16093 −15751 −15409 −16767 −16473 −16180 −15887 −16435 −16121 −15807 −15493

GBR−CHB FIN−CHB LWK−CHB 30 30

30 20 20 20

10 10 10

0 0 0 −16778 −16468 −16159 −15849 −17052 −16719 −16386 −16053 −16108 −15737 −15367 −14997

YRI−CHB GBR−CHS FIN−CHS

30 30 30 20 20 20

10 10 10

0 0 0 −16432 −16081 −15730 −15380 −16843 −16542 −16241 −15939 −17123 −16794 −16464 −16135

LWK−CHS YRI−CHS CHB−CHS 40 30

30 20 20 20 10 10 10

0 0 0 −16164 −15572 −14980 −14388 −16531 −15922 −15312 −14703 −17024 −16747 −16470 −16193

Figure 4.6: Approximate distributions of the dBF -distance between all combinations of populations using 300 permuta- tions. The observed distance is shown in blue if it fits inside the window. If not, it is left out.

Chapter 5

Discussion

The work in this thesis can be divided into two parts. The first part deals with the theoretical development of distance metrics to quantify the difference between two population graphs. The second concerns testing the distance metrics using data from real human populations. The second part also includes the implementation of a computer program that builds graph objects from VCF files. I will now discuss the distance metrics, the GraphBuilder and the permutation tests that are developed and executed in the previous chapters. In addition, I will give a few remarks on the graph type used in this thesis and introduce two other graph types that may improve the analysis. Finally, I will review some biological features that may be of interest in the process of applying the methods to real-world problems.

5.1 Distance Metrics

Three ways of measuring differences between population graphs were devel- oped in Section 3.1 and 3.2, respectively. Behind this idea was the rationale that the distance metrics may focus on different concepts, and therefore can compliment each other. In general, the three measures seem to agree on the relative sizes of the differences. For instance, they all agree that there seems to be a regional effect and that the two populations that are closest to each other are the two Chinese populations. A similarity of the three measures is that they are all focused on local differences, which are aggregated to a global score. When testing the local difference at a particular position, the bubble counter and the Bayes factor measure are completely ignorant to all other locations. This approach may cause them to misinterpret the effect of large structural changes on the distance. The following three subsections concern problems and improvements of the distance measures. While most of the notions made below are tailored to the individual measures, there are some improvements that relate equally well to all of the three. One example is the use of quality weightings. Each

73 74 CHAPTER 5. DISCUSSION variant in the VCF file is accompanied by a quality score that can be used as a measure of confidence that the variant is real and not the product of a sequencing error. In order to incorporate it into the distance measures, we can simply weight each individual contribution to the distance in the quality score. The weighting function may e.g. be linear or logarithmic. In addition to the quality weightings, it is possible to weight the type of mutation and what bases they include. Intuitively, the problem is simple: Are different mutations equally likely? The answer is no, but the task of assigning probabilities to different mutation types is far more complex. If a probability model for mutation types were given, the probabilities may be used to weight the contribution of a specific genetic difference to the distances. For example, we could (very generally) state that an insertion is less likely than a substitution, so an insertion should give a larger contribution to distance than a substitution. This issue is not explored in this thesis, but could be of great importance. It is a task that requires intricate knowledge of the biological and chemical properties of the DNA. Consequently, it is a work that needs assistance from biologists. The relationships between the measures can be viewed in the correlation plots in Figure 5.1. The plots display a remarkably linear relationship between all three measures, meaning that the measures relate the populations to each other in a similar way. It is comforting that the measures agree. However, the linearity is also an indicator of resemblance between the measures. If they operate in similar ways, they may be inclined to give similar results. Ideally, the metrics focus on different aspects, but still agree in results. Such a situation is a basis for making powerful conclusions.

5.1.1 Bubble Counter

The bubble counter has the virtue of being very simple, yet it appears to be quite effective. The permutation tests report the observed dBC -distances to be extremely significant. As we add more individuals to the populations, the total number of bub- bles will increase. A downside of the bubble counter is that in many cases, we will also start to see fewer unique bubbles. An important reason for this is that the metric does not incorporate information about the genotype frequencies. This means that if one population has an uneven distribution of genotypes in the bubble, it will still be treated the same way as any other bubble, as long as any individual has another configuration than homozygotic for the reference type. Say e.g. that in one population, 40% of the individuals are homozygotic for the reference, 30% are heterozygotic and 30% are homozygotic for the al- ternative type. While in the other population, the genotype frequencies are 2%, 1% and 97%, respectively. These two rather different situations are not distinguished from each other by the bubble counter distance metric. How- 5.1. DISTANCE METRICS 75

0.5

0.4

BC 0.3 d

0.2

0.20 0.25 0.30 0.35 0.40 dGED

0.5

0.4

BC 0.3 d

0.2

1000 2000 3000 4000 5000 6000 dBF

0.4

0.3 GED d

0.2

1000 2000 3000 4000 5000 6000 dBF

Figure 5.1: Correlation plots between distances calculated by different measures. 76 CHAPTER 5. DISCUSSION ever, the observation also reveals a potential for improvement. We can weight the impact of each bubble based on the extremity of its genotype frequencies. The bubble counter does not distinguish between different kinds of events. As an example, consider the case where the reference type is “A” and the alternative type is “AT”. This variant models the insertion of a “T”, but the bubble counter treats it the same way as e.g. a SNP. Both of them are just considered as single bubbles, and in order for the two bubbles to give different contributions to the measure, we need to weight them by what the nodes contain. This modification is essentially what we discussed in the introduction to Section 5.1. Another application that is not related to distance measurement concen- trates on the bubbles that are unique to a population. The idea is that with many individuals included, the number of unique bubbles is small. However, these variants are more defining to a population than the non-unique ones, because they contain genotypes that are not present in other populations. If there is an adequate number of individuals included in the population, the unique bubbles may be considered its “signature”, something that makes it distinguishable from other populations. The collection of unique bubbles may thus be fit to serve a variety of purposes, e.g. as a marker for the population in future research. Another function of the collection of unique bubbles is as a set of especially variable sites that can be directly analyzed across groups. This could reveal a measure of variation on superpopulations at an arbitrary level. A third option is to consider it a set of rare variants that should be examined in case they could be associated with disease.

5.1.2 Graph Edit Distance

Remember that the graphs in this thesis are simple in structure. They con- sist mostly of single substitutions and indels, scattered along the reference sequence. In the graph, this translates to structures with two nodes separated by a single intermediate node (see Figure 2.5). Consequently, graph edit dis- tance will in most locations just change single nodes and edges in isolated substructures that may be present in G1 and absent in G2, or vice versa. In this way, dGED behaves rather similarly to dBC . When a single bubble is encountered in G1, the graph edit distance grows if the bubble is not present in G2 as well, i.e. if the bubble is unique to G1. The benefits of graph edit distance versus the bubble counter comes into play when the graph structures get more complicated. A possible extension of the graph edit distance is to use different scoring regimes. The use of genotype frequencies to tune the scoring function may provide an increase in accuracy. In addition, it can be shown that computation of edit distance with the following scoring function is equivalent to the maximal ∗ common subgraph problem (Bunke, 1997). Given the subgraphs G1 of G1 and 5.1. DISTANCE METRICS 77

∗ G2 of G2, we have

0 cni(v) = cnd(v ) = 1 0 cnd(v) = cni(v ) = 1 ( 0 0 0 if α1(v) = α2(v ) cns(v, v ) = ∞ otherwise 0 cei(e) = ced(e ) = 0 0 ced(e) = cei(e ) = 0 ( 0 0 0 if β1(e) = β2(e ) ces(e, e ) = ∞ otherwise

∗ 0 ∗ ∗ 0 ∗ ∗ ∗ for all v ∈ V1 , v ∈ V2 , e ∈ E1 and e ∈ E2 , where G1 ⊆ G1 and G2 ⊆ G2. This scoring regime makes graph edit distance equivalent to the method of maximal common subgraph. Presumably, the maximal common subgraph is a rather poor way to mea- sure distance between population graphs. It is extremely intolerant in the way that it searches for subgraphs. The infinite penalty for substitutions cause it to completely disregard graphs with an edit distance that could potentially be tiny if we were to use another scoring rule. However, there are perhaps other applications suited for the maximal common subgraph. For example, it can be used to identify similar regions. In this context, a similar region is a common subgraph or a collection of common subgraphs over a region in the genome. Paired with the application for finding variable regions with the bubble counter, this may prove a useful tool. A suggestion for improvement of graph edit distance is to divide the graph into smaller regions. As previously mentioned, a substantial obstacle in this work has been the NP-hard algorithm complexity of graph edit distance. In order to greatly reduce the amount of computation time, we can divide the graphs into subgraphs and calculate graph edit distance on pairs of subgraphs. The idea is then to treat the total graph edit distance as the sum of the indi- vidual edit distances between the subgraphs. As we have repeatedly observed in this thesis, variant graphs typically have numerous large regions that are identical in several populations. An identical region is a location in the graph where no alternative paths exists. In other words, there is only one way of traversing such an area, hence it is a suitable location for making a partition.

5.1.3 Bayes Factor

The Bayes factor measure states that we may find the likelihood that the genotype data at each location arise as a consequence of a model, given a prior distribution of the genotypes. Two separate models are then defined; a null model stating that the genotype frequencies are the same in two populations, 78 CHAPTER 5. DISCUSSION and an alternative model stating that they are different. The Bayes factor is the ratio of the data likelihood under the two models, and each Bayes factor can be interpreted probabilistically to indicate which model is more likely. Contrary to dBC , dBF incorporates information about genotype frequen- cies. In fact, it is based on it. dBF is the log-likelihood of all the individual Bayes factors. As usual, we can easily find the likelihood by taking the product of the individual values given that they are independent. So when using the dBF -distance, we are making the assumption that the distance between the pair of bubbles in question is independent of the distance between any other pairs of bubbles. This may not be the case. Figure 5.2 shows dBF -distance as a function of bubble position between all pairs of populations. The distance at each bubble is plotted against the position of the bubble in the reference sequence. Bubbles where all individuals have the same genotype give extreme contributions to the distance, obscuring the effect of other locations. These lo- cations have been left out of the plots in Figure 5.2, and this issue is discussed in detail below. The scatterplot is hard to interpret because many points are overlapping. The smoothed line gives us an idea if some parts of the sequence tend to show larger distances between the bubbles. If so, we may not be able to treat the locations as independent. An interesting feature of Figure 5.2 is that all plots show some sort of variation in the smoothed curve, meaning that some regions on average give larger distances by the Bayes factor metric. This observation poses an idea for an extension of this thesis by concentrating on smaller areas. Figure 5.3 shows the autocorrelations at time lags 1 to 35 for dBF for all population combinations. Autocorrelation is the correlation of a series with itself at another position or time. Figure 5.3 reveals that the autocorrelations are largely positive, but small. Still, quite a lot of the autocorrelations are outside the confidence region of 0.95. A slight decrease of autocorrelation in the position lags indicates that distance is more dependent on variants that are close in the reference genome than on variants that are far away. This is the expected behavior if there is dependence. However, the differences are small in the first 35 lags. To summarize, it may be that there is some dependence in the data. However, the autocorrelations are small, and will not be discussed further here. Another challenge with the Bayes factor metric is that it is rather sensitive for the choice of prior distribution. Remember from Subsection 3.2.1 that the prior is a Dirichlet(1, 1, 1). This distribution has the advantage of being flat. The hyperparameters are chosen in order not to incorporate excess prior information. The Dirichlet(1, 1, 1) is the uniform distribution on the two- dimensional simplex, thus it places equal weight everywhere. It may very well be that some genotypes are more common than others. If this is the case, then the prior distribution offers a way to include this knowledge in the model. We can e.g. adjust the parameters of the Dirichlet prior to make it place more weight in some areas than others, or choose another prior in the 5.1. DISTANCE METRICS 79

GBR−FIN GBR−LWK FIN−LWK ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ●●●●●●● ● ●●●●● ●● ●● ●● ●●●●● ● ●●● ● ●●● ●● ●● ● ●●●●●● ● 30 ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ● ●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●● ● ● ●● ●●●●●●● ●●●●●●●●●●●●● ●●●●●● ●● ● ● ● ●● ●●●●●●●●● ●●●● ● ●● ● ●●●● ●●● ●●●●●●● ● ●● ● ● ● ● ●●●●●● ●●●●●●● ●● ●● ●● ● ● ●● ● ● 20 ● ● ●● ● ● ●● ●●● ● ● ● 20 ● ● ● ● ●●●●●●●● ● 20 ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ● ●●●● ●●● ● ● ● ●● ● ● ●● ● ● ●●●●● ●●●●●●●●●●●● ●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●●● ● ●●●●●●● ● ● ● ●● ● ● ●● ●●●●●●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ● ● ●●●●● ● ● ●●●●●● ●●●●●● ● ● ● ● ●●●● ●● ● ● ● ● ●●● ●●●●●●●●●●● ●●● ●● ●● ● ● ● ●● ●●●●●●● ●●●●●●● ●●●●●●●●●● ● ● ● ● ●●●●● ● ●●● ●●● 10 ● ● ●●● ●● ●● ●●●●●●● ●● 10 ●● ●● ●● ●●● 10 ● ● ● ●●●● ●● ●● ●●●● ●●● ● ●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● Distance ●● ● ● Distance Distance ● ● ● ●● ● ●●●● ●●●●●●●●● ●● ● ● ● ●● ●●●● ●●●●●●● ●●●●●●●●● ● ● ● ●● ● ●●●●●●● ●●●●●●● ● ● ● ● ● ● ●●●● ●●● ●●●●●●●● ●●●●●●● ●●●● ●● ● ●●●● ●●● ●●●●●●●● ●●●●●●●●●● ●● ● ● ●● ● ● ●●●●●●●● ●●●●●●●● ●●●● ● ● ●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● ● ●●●●●● ●●●●●● ●●●●●●● ●● ●● ●●●● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●● ●●●●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●● ● ● ●●●● ●●●●●● ●●●●●●●●●● ● ●●● ●● ● ●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ●●●● ●● ●●●●●●●●●●●●●●●● ●● ● ●● ●● ●● ● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●● ●●●● ● ●● ● ● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●●● ● ●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● 0 ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 16050408 16601622 17152836 16050408 16601622 17152836 16050408 16601622 17152836 Position Position Position GBR−YRI FIN−YRI LWK−YRI ●●●● ●●● ●● ●● ●● ●● ●● ●●● ●●● ● ● ●●● ●● ●●●● ●●●●●● ●● ●●●●●● ● ●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 30 ●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ● ●● ● ● ●●● ●●●●●●●●●● ● ● ● ● ●●●● ●●●●● ●● ●● ●● ● ● ●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●● ●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●●●●●● ●● ●●●●●●●● ●●●●●●●● ●●●●●●● 20 ● ● ●●●● ● 20 ● ● ● 20 ●●●● ● ● ● ●●● ●●● ●●●●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●●●●● ● ●●● ● ● ● ●● ● ●● ● ● ●●●● ●●●●●●●●●●● ● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ● ● ● ● ●●●● ●● ● ● ● ● ● ●●●●● ●●●●●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●●●●● ●● ● ●● ● ● ● ● ● ● ● ●●●●●●● ●●●●●●●● ● ●● ● ●● ●●●● ● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●● 10 ● ● ● ● ● ●● ●●●● 10 ● ●●● ● ●●● ● 10 ● ● ●● ● ● ●●●● ●● ● ●● ● ●● ● ●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Distance ● ● ● ● ●●●● ●●●● ● Distance ●● ● ● ●●● ●●● Distance ●● ● ●● ●● ● ●●●● ●●●●●●●● ● ● ● ●● ●●●●●●●●●●●● ●● ●● ●● ● ●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●● ● ●●●●●● ●●●●●● ●●●●●●●●●● ● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●● ● ●●●● ●●●●●●●● ●●●●●●●●●●● ●●●● ● ●● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ●●●●●●● ●●●●●●●●●●● ●● ●● ●● ● ● ●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●● ●● ●● ● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ●● ●●●●●●●●●●●●●●● ● ●●●●● ●● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●● ●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 16050408 16601622 17152836 16050408 16601622 17152836 16050408 16601622 17152836 Position Position Position GBR−CHB FIN−CHB LWK−CHB ● ●●●● ● ●●● ● ● ●●●● ●●●●●●● ● ● ● ● ● ●● ● ●●●●●●● ● ● ●● ●● ●● ● ● ● ●●●● ● ●●●●●●●●● ●● 30 ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●● ● ●● ●● ●● ●●●●●●●●● ●●●●●●●●●●●●●● ●● ● ●●●●● ●● ● ● ●●●●● ● ●●●●●●●● ●●●● ●●● ● ● ●● ●● ●●● ● ●●●●●●●●●●● ● ● ● ●● ●● ●● ●●●● ●●●●●● ●●●● ● ●● ●● ●●● ● ●● ●●●●●●●●●● ● ● ● ● ●●●● ●●● ●●● ●● 20 ● ● ● ●● ● ● 20 ● ●● ● ●● ●● ● ● ● 20 ●● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●● ●● ● ●● ● ●●●● ● ● ●●●● ● ● ● ● ●● ●●● ●● ● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ● ●●●●● ●●● ● ● ● ●● ●●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ● ● ● ●● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●● ●●●● ●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●● ●●●●●● ●●●●●●●●●● 10 ●● ●● ● ●● ●●●● 10 ● ● ● ●● ●●●●● ●● 10 ● ● ● ●●●●● ● ●●●●●●●● ●●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Distance ● Distance Distance ● ● ●●● ● ●●●● ●●●● ● ● ● ● ● ● ●●●●●●●●●●● ●● ●● ● ●●● ●● ● ●● ●●●●●● ● ●● ●● ●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●● ●●●●●●●● ●● ●● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●● ●● ● ●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●●●● ● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ● ●●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ● ●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ●●●●● ●● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 16050408 16601622 17152836 16050408 16601622 17152836 16050408 16601622 17152836 Position Position Position YRI−CHB GBR−CHS FIN−CHS ●●● ●● ● ● ●●●● ●●●●●●●●●●●●● ● ● ●●●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●●●●● ●●●●●●●●●● ●● 30 ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ● ● ● ● ● ●●●●●● ●●●●●●●●●●●● ● ●● ● ● ● ●●●● ●●●●●● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ● ●● ●●●●●● ●●●●●● ●●● ●● ●●●● ●●● ● ● ●● ●●● ● ●●●●●●● ●●● ●● ● ●● ●●●● ● ● ●●●●●●●● ●● 20 ● ● ● ● ● 20 ● ● ● ● ● ● ●●● ●●● 20 ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ●●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ● ● ● ● ●●●●● ●● ● ● ●● ●●●● ● ● ● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ●●●●● ● ● ●●● ●●●●● ●●●●●● ●●●● ● ● ● ●●●●●●●●●● ●●●●●●●●● ● ●●●●●● ●● ● ●●● ●● ●●● ● ●●●● ●●●●● ● ●● 10 ● ● ●●●●● ● ● ●●● 10 ● ● ● ●●●● ●●● ● ●●● 10 ● ● ● ● ●● ●●● ●● ●●● ●● ●● ● ●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●● ●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●● Distance ● ● ● ●●●●● ●●● ●●●● Distance ● ● ● ● ●●●●●●● ●●●●●●●●●● ●● Distance ● ● ● ●●●●● ● ● ● ●●●●● ●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●● ●●●●● ● ● ● ● ● ● ●●●●●● ●●● ● ●●●●●● ● ●●● ● ● ●●●●●●●●●●●●●●●●● ●●● ●● ●● ● ●● ● ● ●● ●●●●●● ● ●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●● ●●●●●●● ●●● ●●●● ●●● ● ● ●● ● ●● ● ●●●●●● ●●●●●●● ●●●●●●● ● ●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ●● ● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ● ●●●●●● ●● ●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 16050408 16601622 17152836 16050408 16601622 17152836 16050408 16601622 17152836 Position Position Position LWK−CHS YRI−CHS CHB−CHS ● ● ● ● ●●●●● ●● ●●●● ●● ●● ● ●● ● ●● ●●●● ● ● ●●●● ● ● ●● ●● ● ●●●●● ●●●●●●●●● ●●●● 30 ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 30 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●● ●●●● ● ●●● ●●●●● ●●● ● ● ●●● ● ● ●● ●●●●● ●●● ●●●●● ●● ● ●● ● ● ●● ●●●● ●●●●●●●●● ●● ●● ● ● ● ●● ● ● ●●●●● ● ● ● ●●●●●●●● ● ●●●● ● ● ● ●● ●●●●●●●●●● ●● 20 ● ● ●●●● ●●● ●● ●● 20 ● ● ●●● ●● ●● 20 ● ● ● ● ●●●●● ●● ●● ●●● ● ●● ● ● ● ●●●● ● ●●● ●●●● ● ●● ● ● ● ●●●●●●● ●●●●●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ●●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●● ● ●●●● ● ● ● ●● ● ● ●●●● ●●●● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ●●● ●●●● ● ● ●●● ● ● ●●●● ● ● ● ● ●● ● ● ● ●●●●●●● ●●●● ●●●● ● ● ● ●●●● ●●●●●● ●●● ● ● ●● ● ●●● ●●●●● ●●● ●●●●●●●● ●●●●● ●●●●● 10 ● ● ●●●●● ●●●● ● 10 ● ● ●● ●●●●● ● 10 ● ● ●● ●●●●●●●●●● ●● ●●● ● ●● ● ●●● ● ●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●● ●●●●●●● ●●●●●●●● Distance ● ● ● ●●●●●●● ●●● ●● ●● Distance ● ● ● ● ●●●●● ●●●● ●●● Distance ●● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ● ●●● ●● ●●●●●●●●● ●● ● ●● ● ● ●●●●●●● ●●●●●●● ●●●● ● ●● ● ● ●●●●●●●●●● ●●●●●●● ●●●●●●● ●● ● ●●● ●●●●●●● ●●●●●●● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●● ●●●● ●● ● ●●●● ●●●●●●●●● ●●●●●●● ●●● ●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●● ●● ● ●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ● ● ●●●●●● ●●●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● 16050408 16601622 17152836 16050408 16601622 17152836 16050408 16601622 17152836 Position Position Position

Figure 5.2: A plot of log(B01) per position in the reference sequence. Variants where only one type exists in both popu- lations are left out. 80 CHAPTER 5. DISCUSSION

GBR−FIN GBR−LWK FIN−LWK 0.08 0.10 0.05 0.05 0.04 ACF ACF ACF 0.00 0.00 0.00

−0.05 −0.04 0 10 20 30 0 10 20 30 0 10 20 30 Lag Lag Lag GBR−YRI FIN−YRI LWK−YRI 0.08 0.050 0.05 0.04 0.025

ACF ACF 0.000 ACF 0.00 0.00 −0.025 −0.04 0 10 20 30 0 10 20 30 0 10 20 30 Lag Lag Lag GBR−CHB FIN−CHB LWK−CHB 0.08 0.10

0.04 0.05 0.05 ACF ACF ACF 0.00 0.00 0.00

−0.04 −0.05 0 10 20 30 0 10 20 30 0 10 20 30 Lag Lag Lag YRI−CHB GBR−CHS FIN−CHS 0.075 0.050 0.05 0.05 0.025

ACF ACF ACF 0.000 0.00 0.00 −0.025 −0.050 0 10 20 30 0 10 20 30 10 20 30 Lag Lag Lag LWK−CHS YRI−CHS CHB−CHS 0.075 0.050 0.05 0.05 0.025

ACF ACF ACF 0.000 0.00 0.00 −0.025

0 10 20 30 0 10 20 30 10 20 30 Lag Lag Lag

Figure 5.3: Autocorrelations for the Bayes factors at each bubble. Variants where only one type exists in both popula- tions are left out. 5.1. DISTANCE METRICS 81

first place. An important advantage of the Dirichlet prior in this case is that the calculations involved in the Bayes factor are easy to handle. Other priors are not guaranteed to give us an analytic solution at all.

Yet another disadvantage of the Bayes factor metric emerges from the organization of the data, and concerns the treatment of variants that are not present in any of the two populations. If all individuals in both population 1 and population 2 are homozygotic for the reference type at position i, the data at this location will be y1,i = (n1, 0, 0) and y2,i = (n2, 0, 0). In this thesis, we consider the case where n1 = n2. Then, the genotype frequencies are identical, so B10 is a large negative number at this location, which is the preferred behavior of the model. In practice, however, the value of B10 is so large in magnitude that it obscures the results, because it completely dominates the calculation. The problem is amplified as a consequence of the data, because the number of such sites is tremendous in comparison to the total number of bubbles. In the data analysis, all bubbles that are not present in any of the six populations are removed before calculating distances. This ensures that variants in the data set are present in at least one individual. In this procedure, a total of 1514 variants were removed. Nevertheless, the number of bubbles that are still included in the data, but not present in a pairwise comparison may still be large. For example, 554 of the 5000 − 1514 = 3486 remaining bubbles are not present in neither the GBR nor the FIN population. Due to the extreme scores of such sites, the product of all the Bayes factors is a very large negative number, which causes trouble for the direct interpretation of the Bayes factor metric. If we were to evaluate it directly e.g. in accord with Jeffreys’ terminology in Table 3.2, this problem would be crucial. However, the relative size of the distances are not influenced by it, so the permutation test is unaffected. The best way to handle this issue may be to adjust the prior distribution.

To shed further light on how unsuitable the prior distribution is, consider the calculation of the Bayes factor for the bubble at position i such as the one described above, i.e. when

y1,i = (n1, 0, 0)

y2,i = (n2, 0, 0).

Under the assumption that n1 = n2 = n and α = (1, 1, 1), which is the case in the data used to calculate the results of Chapter 4, we can find the Bayes factor for such a site. This may be expressed as a function of the population 82 CHAPTER 5. DISCUSSION size n. P3 P3 Γ( αi) Γ( (y1,j + y2,j + αj)) B (n) = j=1 j=1 10,i Q3 Q3 j=1 Γ(αj) j=1 Γ(y1,j + y2,j + αj) Q3 Γ(y1,j + αj)Γ(y2,j + αj) · j=1 P3 P3 Γ( j=1(y1,j + αj))Γ( j=1(y2,j + αj)) Γ(3)Γ(2n + 3) Γ(n + 1)Γ(n + 1) = Γ(2n + 1) Γ(n + 3)Γ(n + 3) 2(2n + 2)!(n!)2 = , (5.1) (2n)!((n + 2)!)2 where many terms are excluded due to being Γ(1) = 1. Figure 5.4 displays the logarithm of the Bayes factor as a function of population size n for three different genotype frequencies. The red line cor- responds to log(B10,i), i.e. the logarithm of 5.1. To make the blue line, let n n y1,j = y2,j = ( 2 , 2 , 0) at position j. This gives the curve

 n 4 ! Γ(2n + 3) Γ( 2 + 1) log (B10,j(n)) = log Γ(3) [Γ(n + 1)]2 [Γ(n + 3)]2  2(2n + 2)!( n !)4  = log 2 . (5.2) (n!)2((n + 2)!)2

Finally, let the bubble at position k have the genotype distribution y1,k = n n n y2,k = ( 3 , 3 , 3 ) to get

 n 4 ! Γ(2n + 3) Γ( 2 + 1) log (B10,k(n)) = log Γ(3) [Γ(n + 1)]3 [Γ(n + 3)]2  2(2n + 2)!( n !)4  = log 2 , (5.3) (n!)3((n + 2)!)2 which translates to the green line in Figure 5.4. The plot states that the more individuals we include, the smaller B01 will we obtain, given that the genotype frequencies are identical. This is a desirable feature, but the figure also illustrates the inconvenience of the uniform prior: the Bayes factor is more extreme in cases where the genotype frequencies are unevenly distributed. Clearly, this is not a desirable feature.

5.1.4 Other Distances There are other ways of assessing genetic difference between populations than the measures from Chapter 3. A popular option is the so-called fixation index, FST , introduced in Wright (1921). Recall that if a variant has the alterna- tives U and V , the genotypes of a diploid organism (such as humans) can 5.1. DISTANCE METRICS 83

n n n n n (n, 0, 0) ( 2 , 2 , 0) ( 3 , 3 , 3 )

0

-2 )

10 -4 B log(

-6

-8

0 50 100 150 200 n

Figure 5.4: A plot of log(B10) as a function of population size n, for a bubble where both populations have identical genotype frequencies. The red line is the function from equation (5.1), the blue and green corresponds to equation (5.2) and (5.3), respectively. 84 CHAPTER 5. DISCUSSION be UU, UV and VV . The genotype UV is called heterozygotic, while UU and VV are called homozygotic. Assume two populations P1 and P2, and a superpopulation PT = P1 ∪ P2. Now, let p¯ and q¯ be the average frequencies of genotype U and V , respectively, for individuals in PT . Then, let p1 and q1 be the frequncies of U and V in population 1. Let p2 and q2 be the same in population 2. The fixation index (Hartl, Clark, and Clark, 1997), is then

2 σp FST = , (5.4) p¯q¯

2 p1−p2 2 where σp = 2 is the variance in the frequency of type U between population 1 and population 2 at this location. Note that with two alternatives 2 2 we have σp = σq since q = 1 − p, so

p − p 2 q − q 2 1 2 = 1 2 . 2 2

For the data considered in this thesis, the fixation index can be calculated for each variant and then aggregated to a score, like the Bayes factor.

5.2 GraphBuilder

It should be noted that GraphBuilder is made to handle quite simple graphs, and has not been tested on other data than the variants from the 1000 genomes project. The two first metrics are developed to do calculations directly on the VCF file, while graph edit distance is calculated on a graph object made by GraphBuilder. It can be argued that none of the analyses done are strictly dependent on the graph made by GraphBuilder. This is of course true, because the graph is made exclusively out of the very same data that dBC and dBF are calculated from. So there is probably a way of implementing graph edit distance directly on the table. The graph objects from GraphBuilder does, however, make it significantly easier to calculate graph edit distance. It also enables the use of existing software, since we can make GraphBuilder represent the graph object in the way we want. The graphs are large, so because of the algorithm complexity of the graph edit distance, it is necessary to use specialized software. For two large graphs G1 and G2, there is no way that we can search through the entire space of complete edit paths Λ(G1,G2). GEDEVO provides an approximate solution for this problem. A suboptimal feature of GraphBuilder is that it does not make cycles. The introduction of cycles can potentially save a lot of memory. Consider e.g. the case of a repeat. If we were to allow cycles, we could merely add an edge to the node where the repeat starts, instead of constructing the entire region over again. Hence, we would not need to store new nodes and edges for every repeat. However, the introduction of more advanced features requires more 5.3. OTHER GRAPH TYPES 85 advanced programming. A procedure that allows repeats needs an efficient way of checking if the sequence is, in fact, a repeat. Consequently, the program would have to compare new data with everything already stored in the graph. The variant data used in this thesis is not suited for this. It would be better to do this with sequence reads mapped directly to the assembly graphs (see Subsection 2.1.2 and Appendix A.2). GraphBuilder has only been tested on the VCF files used in this thesis. No guarantees are given, neither for the behaviour nor the performance on larger data sets, but it is likely that memory usage will scale approximately linearly in the number of nodes. The reason for this is that the introduction of a node in almost every case leads to the introduction of two new edges. In any case, GraphBuilder will not be the limiting factor of this procedure, because the computation time of graph edit distance grows exponentially in the number of nodes.

5.3 Other Graph Types

The graphs used in this thesis can be called variant graphs because they are made from variants. The data structure was chosen mainly because it is simple to implement, but still has most of the features needed. It is, alas, not optimal for the problem. Other options exist, and some of them may be preferable.

5.3.1 Sequence Graph

Recently, there has been an increased interest for reference graphs, see e.g. No- vak, Hickey, et al. (2017, pre-print, not peer-reviewed) and Church, Schneider, Steinberg, et al. (2015). Many graph types have been proposed for this pur- pose. A notable example is the so-called sequence graph, treated in Paten, No- vak, Eizenga, et al. (2017) and Paten, Novak, and Haussler (2014). Contrary to variant graphs, sequence graphs are what we call bidirected. A bidirected graph has vertexes with distinct sides, a left side and a right side. This en- ables the edges to carry labels denoting if the edge is associated with the left or right side of the node (Paten, Novak, Garrison, et al., 2017; Medvedev and Brudno, 2009).

Definition 5.1. A sequence graph is a bidirected graph where each node is labeled with a nucleotide sequence, and each edge connects one side of a node with one side of another node.

The sequence graph is a more advanced and flexible model for making population graphs. In particular, the bidirectedness is an important property. It saves memory, improves repeat handling and increases the accuracy of graph edit distance relative to the variant graph. It allows edges to point “back” to earlier sequence fragments. Hence, it is not necessary to generate new nodes 86 CHAPTER 5. DISCUSSION

C

GGA TAAG AC AGG A

AAT

Figure 5.5: A sequence graph. Each node contains a DNA sequence and may be be concatenated with the sequence in other nodes by walking along the edges of the graph. The direction of the string is determined by what side of the node we enter.

for each new variation. This way, the sequence graphs can be stored more compactly than variant graphs. Repeats can be handled by letting a node point from its right side back to its left side. Consequently, a traversal may visit the node carrying the repeated sequence multiple times. Another elegant feature of the sequence graph is the treatment of inversions. Remember that an edge can point to either side of a node. Then, an inversion is modeled simply by pointing to the opposite side of the node storing the original sequence.

Example 5.1. Consider the sequences GGATAAGACCGGAA, GGAAAT- ACAGGA and GGAACCGGAA. The sequence graph that corresponds to this input is shown in Figure 5.5. A reconstruction of genomic sequences can be found by traversing the graph without conferring with a separate reference sequence. An example of a possible traversal of the graph in Figure 5.5 is shown in red, and spells out the sequence GGAAATACCGGAA. ♣

The sequence graph alone can act as a reference structure. Variant graphs may also be modified to incorporate this feature, but such a modification will transform it to a rather cumbersome data structure.

5.3.2 de Bruijn Graph

Another option is to measure distance between de Bruijn graphs. De Bruijn graphs have nodes and edges that relates to the genomic sequence through so-called k-mers. 5.3. OTHER GRAPH TYPES 87

Definition 5.2. Let S be a string of length n ≥ k.A k-mer is a substring of S of length k.

Definition 5.3. A de Bruijn graph is a directed graph where the nodes are (k − 1)-mers. There is an edge uv from node u to node v if the suffix of length k − 2 from u is equal to the prefix of length k − 2 from v (Compeau, Pevzner, and Tesler, 2011).

Example 5.2. Consider the sequences used to construct the sequence graph in Figure 5.5. In Figure 5.6, the same sequences are used to build a de Bruijn graph with k = 4, so sequences stored in the nodes have length k − 1 = 3. The number of times each node and edge is visited is shown in red. ♣

Graph edit distance can be directly applied to measure distance between two de Bruijn graphs, but a modification of the scoring regime should be considered. The Bayes factor measure may also be used, but has to be modified for pairwise comparison of node frequencies between two graphs. In Figure 5.6, the node frequencies are labelled red. The idea is then to construct two separate graphs G1(V1,E1) and G2(V2,E2), using individuals from two separate populations, just like before. The number of times each node is visited will vary between populations, and the differences in node frequency can be exploited by the Bayes factor measure to say something about the distance between the two population graphs. Specifically, we may calculate the individual Bayes factor for each pair of nodes (u, v), where u ∈ V1 and v ∈ V2, and u and v contains the same sequence. In such a setting, a node in G1 that has no corresponding node in G2 (i.e. a node with the same sequence) will be assigned frequency 0 in G2. The success of the approach described above depends on the k-mer size. If the nodes carry short sequences, the method is largely uninteresting. As an example, consider the case of k = 2, so the length of the node sequences is 1. In this situation, we are just comparing the number of times each base occurs in the sequences contained in the two populations. At the other extreme, no nodes will share the same sequence across the populations. This will result in a conclusion of extreme difference since no nodes are identical, even though they may have large identical regions contained inside the node sequences. Since de Bruijn graphs are made from k-mers, they will probably intro- duce heavy dependence into the model, because the k-mers overlap. A read of length L can be divided into L − k + 1 k-mers. This will cause many k-mers to cover the same regions, and therefore the node frequencies are likely to be highly dependent. 88 CHAPTER 5. DISCUSSION

5 1 GGA GAT 3 2 1 GAA ATA AAT 1 1 1 TAC AAA 1 TAA

ACA 1 1 1 AAG 1 AGG CAG 1 AAC 2 AGA 1 ACC GAC 2

CGG 2

CCG

Figure 5.6: A de Bruijn graph. The edges (k-mers) are the overlaps between the nodes ((k − 1)-mers). The number of times each node is visited is labeled red. 5.4. BIOLOGICAL FEATURES 89

I believe that the future of population graphs belong to the more complex graph types such as sequence graphs or de Bruijn graps, rather than to the variant graphs used in this thesis. To summarize, variant graphs are inferior for a number of reasons:

• They do not recognize repeats.

• Inversions and translocations are poorly modeled. Variant data usually only contains small structural variants, so large inversions and translo- cations may be left out of the data altogether, as mentioned in Section 2.2.

• Every structure is either a single bubble or a combination of bubbles. The graphs are overly simple, which may cause graph edit distance to be inaccurate.

• Due to the linear form of the graph, graph edit distance ends up being quite similar to the bubble counter.

• A lot of nodes have to be stored.

A more advanced implementation of GraphBuilder will solve some minor problems, but more important issues require a different graph type, such as the sequence graph or de Bruijn graph.

5.4 Biological Features

As explained in the previous section, a transition to sequence graphs may solve several important problems. However, it is still crucial to remember where the data comes from and what we are trying to do with it. For this purpose, I will discuss some properties of biological data in general, and genetic data in specific. In this work, data analysis has been performed on variants in the human genome. The variant data is derived from sets of sequences. By the technolog- ical advances discussed in Section 2.1, sequence data is no longer a scarcity. It is realistic to expect that the ideas from this thesis can be employed on data spanning the entire human genome. But even unrestricted knowledge of the genomic sequences is not the same as a full understanding of the organism’s biology. For that, the intricacies of molecular genetics are too many. A region of the genome does not necessarily behave the same way as its homolog in another organism. For example, epigenetic modification may alter expression by either amplifying or hindering the transcription of one or multiple genes contained in the region (Jaenisch and Bird, 2003). 90 CHAPTER 5. DISCUSSION

Another aspect of DNA data concerns the irregular distribution of genetic changes. Some regions are more prone to mutations than others (Duret, 2009, and references therein). The distribution of those regions is crucial to analyses such as the one performed in this thesis, because we use a region to make conclusions about the organism. If distances are to be comparable between projects, we need common ground to base them on, which is possible if all projects use the same regions. But if different regions with different mutation rates are used, it makes no sense to compare two distances, even though they are calculated using the same distance metric. By this logic, the choice of the region to do analysis on is an important issue. In hindsight, this is a matter that may have been a little rushed in this thesis.

5.4.1 Out of Africa There seems to be a geographical effect on the pairwise population distances reported in Section 4.1. This is interesting in the sense that the European and Asian populations seems to be close to each other, while the distance among the two African populations is relatively large, but generally smaller than from any of the African ones to any of the other. Accordingly, the results seem to play along with the “out-of-Africa”-hypothesis of human origins. This states that there have been more than one wave of colonizations expanding out of Africa. First, a wave of Homo erectus that gave rise to archaic Homo sapiens (H. sapiens). Then, a wave of H. sapiens that replaced the archaic sapiens, which then became extinct (Templeton, 2002; Vigilant, Stoneking, and Harp- ending, 1991). A key aspect of this hypothesis is that there was a bottleneck effect on the amount of genetic variation as the modern H. sapiens emigrated from Africa (Relethford, 2001). Hence, there is less genetic diversity in all human populations outside of Africa combined, than there is inside Africa. The rationale behind this idea is that only a fraction of the original African population emigrated, and this fraction gave rise to all human populations on other continents. In addition, there has likely been other emigrations later on that mixed with the now established populations outside Africa. It is reassuring that the findings of this thesis also states that the distance between the African populations is large, while the distances between non- African populations are small. Chapter 6

Concluding Remarks

The analysis of DNA sequences is a hallmark of modern science. This thesis has elaborated on one of the many applications of genomic sequences, namely the distinction of groups. I have developed tools for measuring the distance between two variant graphs representing a portion of the DNA of a popu- lation. The distance measures were tested on graphs made of variants from real human populations. In order to put the observed distances into perspec- tive, a permutation test was used to find a reference for comparing distances calculated by the same metric. Graphs are advantageous to the analysis of populations because they are able to represent all the genetic material. When variation is encountered, the graph displays the variants in the population as alternative paths. A linear, sequential representation of the same data is equivalent to collapsing all the bubbles in the variant graph and only display one possible path. A process like this is guaranteed to simplify the data when there are many individuals in the sample. I believe that graph-based methods for analysis of populations will prevail. The rationale remains the observation that more individuals reveals more genetic variation. The graph can represent the variation, but the sequence can not. This means that a graph-based model will likely contain more information than one based on a linear representation. Accordingly, such a model may give more accurate measures of distance between groups. Three ways of measuring distance between variant graphs are presented. The bubble counter metric and the Bayes factor measure were developed dur- ing the work with this thesis. Graph edit distance is an existing metric. The bubble counter is the simplest of the measures, yet it appears to be quite effective for variant data. The graph edit distance behaves very similarly to the bubble counter when applied to variant graphs. This is because the vari- ant graphs are very simple in structure, with only small structural variants (bubbles). The Bayes factor measure is fundamentally different from the two others as it is based on a probability model. Due to the probable lack of inde- pendence and a sub-optimal prior distribution, it is not possible to interpret

91 92 CHAPTER 6. CONCLUDING REMARKS the product of all the individual Bayes factors directly. Instead, the distances found by the Bayes factor metric are evaluated with a permutation test in the same way as with the bubble counter and the graph edit distance. All three distance measures are employed on data from six human popu- lations. Furthermore, the distance metrics are able to make conclusions about how large the differences between the populations are. The two Chinese popu- lations are by far the ones with the smallest pairwise distance. This conclusion is supported by all the distance metrics. The two European populations seem to be quite similar to both each other and to the two Asian ones. In general, there seems to be a regional effect, where both the European and the Asian populations are closer to each other than to the African populations. The two African populations are closer to each other than to any population from Europe or Asia. However, the distance between the two African populations is still large compared to the other intra-continental distances, i.e. the dis- tance between GBR and FIN, and CHB and CHS, respectively. This may be interpreted in light of the “out-of-Africa” hypothesis of human evolution. The permutation tests for all three distance metrics rejects the null hy- pothesis in equation (3.19) of identical underlying distributions at the minimal significance level for all pairs of populations except CHB-CHS. The rejection of this hypothesis says that the distance between the two populations is larger than the distance between two groups of the same size, drawn at random from the pool of both original populations. In other words, it states that there is a significant difference between them. In general, it is possible to use graphs made from variant data to study population differences. The metrics that have been discussed in this thesis provides a way to quantify this difference, but also reveals the following draw- backs of the graph-based approach for variant data:

• Some metrics, e.g. graph edit distance quickly becomes computationally intensive when the size of the graphs grows.

• The variant graph used in this thesis is perhaps not the ideal population graph. Other graph types are more promising.

The latter suggests the use of graphs that can model larger structural variations, e.g. the sequence graph discussed in Section 5.3. This does not see everything as bubbles, so the bubble counter and Bayes factor metric will be harder to implement on such a graph. However, graph edit distance is likely to be a suitable measure of distance in this case as well. A promising idea is to represent a group or species by a population ref- erence graph. This way, a newly sequenced individual may be added to the population by aligning its sequence to the graph. Then, the distance between populations can be estimated by e.g. graph edit distance with a fine-tuned scoring function. 6.1. FUTURE WORK 93

6.1 Future Work

Due to progress in sequencing technology, genome assembly and computing power, the use of graphs for DNA-sequence analysis is a reality. Until now, focus has been on the sequences themselves. In this thesis, the surface of graph-based methods for the analysis of the DNA of populations is merely scratched. The methods used here are quite unsophisticated relative to what may be achieved in an extended work. A few ideas for improvements of the distance metrics are given in Section 5.1. In addition, the results are obtained with limited computer resources. Ideally, a future work will use refined meth- ods on more data, possibly even the complete set of variants in the whole human genome. However, more variants will surely bring regions with more complex graph structures. I once again stress that the GraphBuilder was made for a very tidy graph with no especially large, complex or tricky substructures. Thus, a more robust implementation of the GraphBuilder is needed. It is likely that there is more to be learned about population differences using variant graphs. However, I consider other graph types, such as the se- quence graph and the de Bruijn graph (see Section 5.3), to be more promising. It is my desire that a future work will use these or other equally suited data structures to refine the ideas presented in this thesis. For such a work to fully realize its potential, themes from many disciplines will need to work in con- cert. Topology, graph theory, statistics and molecular biology are all needed if we are to master the successful development and accurate use of population graph analysis.

Appendix A

Graph Theory and some Assembly Graphs

This appendix contains material about graphs that is referenced in the main text, as well as some supplementary notes.

A.1 Definitions in Graph Theory

The definitions in this section is based mainly on material from Chapter 1 in Bondy and Murty (2008), Chapter 1 in Diestel (2000) and Chapter 1 and 2 in Gross and Yellen (2005), and will henceforth not be referenced. Any material from other sources will be referenced as we go along. A graph is a set of nodes and edges between either two nodes or a node and itself. The edges are drawn as lines and the nodes as circles. An example can be found in Figure A.1. Recall the definition from Chapter 1: Definition A.1. A graph is a pair of sets G = (V,E). The elements of V are called vertices or nodes and the elements of E are called edges. The edges are connections between nodes, i.e. the edges are associated with a set containing one or two nodes. A graph may be directed. If so, all edges have designated start and end nodes. So a walk through the graph will have to move along edges in the correct direction. Starting from n1 in Figure A.1, we can e.g. move along e1 to n2 and then via e2 to n3. We can, however, not return from n3 because this node has no edge from itself to n2 The only possible extension of this walk is to continue to n5 where we are stuck because n5 has no outgoing edges. Definition A.2. A directed, acyclic graph (DAG) is a finite directed graph with no cycles. The graph in Figure A.1 is a DAG. Definition A.3. The number of outgoing edges is called out-degree. The number of ingoing edges is called in-degree.

95 96 APPENDIX A. GRAPH THEORY AND SOME ASSEMBLY GRAPHS

n 2 e2

n3 e 1 e3 e6

e5 n1 n5

e4

n4

Figure A.1: A directed, acyclic graph.

The number of nodes in a graph G is called its order, and is denoted by |G|. Similarly, the number of edges is denoted ||G||. The nodes u and v are called adjacent if there is an edge connecting them. We will denote this edge uv. Note that in a directed graph, the edge uv will denote specifically an edge from u to v, not vice versa. The graphs in this thesis are for the most part acyclic. This is mainly a simplification, and to introduce cycles could very well be a significant im- provement. The core material in this thesis concerns the difference between two DAGs. A key concept in determining the size of this difference is what we call isomorphisms. In general, two mathematical structures are isomorphic if they are the same up to a relabeling of the elements they contain (Schu- macher, 2001). Two DAGs G1 = (V1,E1) and G2 = (V2,E2) are isomorphic if 1 and only if there exists a bijection between the sets of nodes V1 and V2 such that if two nodes are adjacent in G1, the corresponding two are also adjacent in G2. We will express the isomorphism of G1 and G2 as G1 ' G2.

Definition A.4. (Graph isomorphism) Let G1 and G2 be DAGs. G1 ' G2 if and only if there exists a bijection f : V1 → V2 such that

uv ∈ E1 ⇔ f(u)f(v) ∈ E2 (A.1) The function f is then called an isomorphism. (Bondy and Murty, 2008, page 12-14, 34; Diestel, 2000, page 3)

It follows from Definition A.4 that if G1 and G2 are unlabeled, i.e. the nodes and edges does not have attributes, we have that

G1 ' G2 ⇔ G1 = G2 However, if nodes and edges have attributes, this does not hold. 1Recall that a bijection is a function which is both surjective (onto) and injective (one- to-one). A.2. DE BRUIJN GRAPHS FOR GENOME ASSEMBLY 97

A.2 De Bruijn Graphs for Genome Assembly

The concept of de Bruijn graphs was introduced by the dutch mathematician Nicolas de Bruijn in his 1946 paper “A combinatorial problem” (De Bruijn, 1946). Today, de Bruijn graphs are widely used in sequence assembly (Com- peau, Pevzner, and Tesler, 2011), where reads are mapped as paths through the graph. De Bruijn graphs are defined in Subsection 5.3.2. A de Brujn graph of order n has m = |Σ| symbols and each node has exactly n edges. Therefore the graph has mn edges in total. A graph made by large words with symbols from a large alphabet (so both m and n are large) grows exponentially. This makes memory the main bottleneck. The choice of k reveals a trade-off: Larger k-mers increase the accuracy, but also adds more nodes, so more memory is needed to store the graph. De Bruijn graphs reduce the problem of sequence assembly to a classic graph-theoretic one, namely the Eulerian path problem (Compeau, Pevzner, and Tesler, 2011).

Definition A.5. An Eulerian path is a path through a graph that visits each edge exactly once.

There is one edge e ∈ E for each k-mer in the sequence, so an Eulerian path in G will represent the shortest string that contains each k-mer exactly once. Before constructing the graph, the reads are broken down into k-mers, and the graph is built by connecting all the nodes ((k − 1)-mers) by edges (k-mers). If a k-mer is present several places in the sequence, we will simply add edges to the graph correspondingly. This means that when we have made a Eulerian path in the graph, we have in fact reconstructed the sequence using overlaps between the reads, because the k-mers are substrings of the reads. An extended example of de Bruijn graph construction will now be given.

Example A.1. Assume that a sequencing output consists of the following 4-mers: AATC, ATCG, AAAA, CATC, TCGA and TCTT. The goal is to make a de Bruijn graph G(V,E) of the data. The 4-mers are overlaps of two (k −1) = 3-mers. The 3-mers are AAT, ATC, TCG, CGA, CAT, AAA, TCT, and CTT. These are the nodes in G and the overlaps are the edges in G. A drawing of the graph is shown in Figure A.2. Recall from Definition A.5 that a Eulerian path in G is a path that visits each edge exactly once. There are two possible Eulerian paths in G. Node CAT has out-degree 2 and and in-degree 1. Node ATC has out-degree 1 and in-degree 2. All other nodes has equally many in- and outgoing edges, so CAT and ATC must be the start and end of both paths, respectively. The two paths are

w1 = CAT, AAA, AAA, AAT, ATC, TCC, CCA, CAT, ATC w2 = CAT, ATC, TCC, CCA, CAT, AAA, AAA, AAT, ATC 98 APPENDIX A. GRAPH THEORY AND SOME ASSEMBLY GRAPHS

AAT

AAAA AAAT AATC

AAA ATC

CATC CATA ATCC

CAT TCC

CCAT TCCA

CCA

Figure A.2: A de Bruijn graph. The edges (k-mers) are the overlaps between the nodes ((k − 1)-mers). which give rise to the sequences

s1 = CATAATCCATC s2 = CATCCATAATC

So s1 and s2 are the possible results for this assembly.

♣ Appendix B

Calculations

B.1 Conjugate Prior for the Multinomial Distribution

We may choose a prior distribution such that the posterior and the prior has the same algebraic form. Such a choice of prior distribution is called a conjugate prior distribution. Conjugate priors are computationally convenient and enables us to find the posterior distribution analytically (Gelman et al., 2014, page 35-36). We will now demonstrate that the Dirichlet distribution is the conjugate prior for the multinomial distribution. Let y = (y1, y2, . . . , yk) be multinomially distributed with parameters θ = (θ1, θ2, . . . , θk) and n trials. Let the prior distribution of the parameter θ be Dirichlet distributed with parameter α = (α1, α2, . . . , αk). We thus have

n! y1 y2 yk p(y|θ, n) = θ1 θ2 ··· θk y1!y2! ··· yk! and

Γ(α1 + α2 + ··· + αk) α1−1 α2−1 αk−1 p(θ|α) = θ1 θ2 ··· θk . Γ(α1)Γ(α2) ··· Γ(αk) The posterior distribution is then p(θ|y) ∝ p(y|θ)p(θ)

y1+α1−1 y2+α2−1 yk+αk−1 ∝ θ1 θ2 ··· θk .

Proportionality holds because the posterior is constant in the yi’s. In addition, we note that

Γ(y1 + α1 + y2 + α2 + ··· + yk + αk) y1+α1−1 y2+α2−1 yk+αk−1 p(θ|y) ∝ θ1 θ2 ··· θk , Γ(y1 + α1)Γ(y2 + α2) ··· Γ(yk + αk) ∗ which is the Dirichlet distribution with parameter θ = (y1 + α1 − 1, y2 + α2 − 1, . . . , yk + αk − 1). So the Dirichlet distribution is the conjugate prior for the multinomial distribution.

99 100 APPENDIX B. CALCULATIONS

B.2 Calculations for the Bayes Factor in 3.2.1

We will here show the calculations needed to arrive at the Bayes factor R P (p )P (p )P (y |p )P (y |p ) dp dp P (y|M1) (p1,p2)∈Ω1 1 2 1 1 2 2 1 2 B10 = = R , P (y|M0) P (p )P (y |p )P (y |p ) dp p0∈Ω0 0 1 0 2 0 0 as displayed in equation (3.12). Remember that under H0, the parameters of the two distributions are equal, so there are actually only three parameters P p0 = (p1, p2, p3), where just two of them are free to vary, because i pi = 1. We have

P (p0) = Dirichlet(p1, p2, p3|α1, α2, α3) P3 3 Γ( αi) Y = i=1 pαi−1, Q3 i (B.1) i=1 Γ(αi) i=1 with α = (α1, α2, α3). Recall that y = (y1, y2), where y1 and y2 are assumed to be independent and distributed according to (3.9) and (3.10). Under H0, we have that p1 = p2, so the distributions of the data are now 3 n1! Y y f(y |p , n ) = p 1,i 1 0 1 Q3 i (B.2) i=1 y1,i! i=1 3 n2! Y y f(y |p , n ) = p 2,i , 2 0 2 Q3 i (B.3) i=1 y2,i! i=1 for p0 = (p1, p2, p3). Then we insert (B.2), (B.3) and (B.1) into (3.12) to see that Z P (y|M0) = P (p0)P (y1|p0)P (y2|p0) dp0 p0∈Ω1 Z 1 P3 3 3 3 Γ( αi) Y n1! Y y n2! Y y = i=1 pαi−1 p 1,i p 2,i dp Q3 i Q3 i Q3 i 0 0 i=1 Γ(αi) i=1 i=1 y1,i! i=1 i=1 y2,i! i=1 P3 Z 1 3 Γ( αi) n1!n2! Y y +y +α −1 = i=1 p 1,i 2,i i dp . Q3 Q3 i 0 (B.4) i=1 Γ(αi) i=1 y1,i!y2,i! 0 i=1 The expression inside the integral is the kernel of a Dirichlet distribution with parameters λ = (λ1, λ2, λ3) = (y1,1 + y2,1 + α1, y1,2 + y2,2 + α2, y1,3 + y2,3 + α3). Hence, by manipulating the constants and noting that the Dirichlet distribution integrates to 1, we obtain Γ(P3 α ) Q3 Γ(y + y + α ) n !n ! P (y|M ) = i=1 i · i=1 1,i 2,i i 1 2 0 Q3 P3 Q3 i=1 Γ(αi) Γ( i=1(y1,i + y2,i + αi)) i=1 y1,i!y2,i! Γ(P3 α ) Q3 Γ(λ ) n !n ! = i=1 i · i=1 i 1 2 , Q3 P3 Q3 (B.5) i=1 Γ(αi) Γ( i=1 λi) i=1 y1,i!y2,i! B.2. CALCULATIONS FOR THE BAYES FACTOR IN 3.2.1 101 and we have the expression in the numerator of (3.12). Then we need to solve the integral in the denominator of (3.12). The prior is slightly more complicated in this case. Under the alternative hypothesis, we have that p1 = (p1,1, p1,2, p1,3) 6= (p2,1, p2,2, p2,3) = p2, so there are six P P parameters, and four are free to vary because i p1,i = i p2,i = 1. But p1 and p2 are assumed to be independent, so the prior distribution for M1 is assumed to be the product of two Dirichlet distributions with the same hyperparameters.

P (p1, p2) = P (p1,1, p1,2, p1,3, p2,1, p2,2, p2,3|pi,j 6= pk,l for some i, j, k, l). = Dirichlet(p1,1, p1,2, p1,3|α1, α2, α3) · Dirichlet(p2,1, p2,2, p2,3|α1, α2, α3) " P3 #2 3 3 Γ( αi) Y Y = i=1 pαi−1 pαi−1. Q3 1,i 2,i (B.6) i=1 Γ(αi) i=1 i=1

Remember that p = (p1, p2) = (p1,1, p1,2, p1,3, p2,1, p2,2, p2,3), and insert (3.9), (3.10) and (B.6) to see that Z P (y|M1) = P (p1)P (p2)P (y1|p1)P (y2|p2) dp1dp2 (p1,p2)∈Ω 2 Z 1 " P3 # 3 3 Γ( αi) Y Y = i=1 · pαi−1 · pαi−1 Q3 1,i 2,i 0 i=1 Γ(αi) i=1 i=1 3 3 n1! Y y n2! Y y · · p 1,i · · p 2,i dp dp Q3 1,i Q3 2,i 1 2 i=1 y1,i! i=1 i=1 y2,i! i=1 2 " P3 # Z 1 3 Γ( αi) n1!n2! Y y +α −1 = i=1 p 1,i i dp Q3 Q3 1,i 1 i=1 Γ(αi) i=1 y1,i!y2,i! 0 i=1 Z 1 3 Y y2,i+αi−1 · p2,i dp2, (B.7) 0 i=1 where the two integrals are the kernels of two Dirichlet distributions with parameters γ1 = (y1,1 + α1, y1,2 + α2, y1,3 + α3) and γ2 = (y2,1 + α1, y2,2 + α2, y2,3 + α3), respectively. We then manipulate the constants to get

" #2 Γ(P3 α ) n !n ! Q3 Γ(γ )Γ(γ ) P (y|M ) = i=1 i 1 2 i=1 1,i 2,i 1 Q3 Q3 Q3 P3 P3 i=1 Γ(αi) i=1 y1,i! i=1 y2,i! Γ( i=1 γ1,i)Γ( i=1 γ2,i) Z 1 P3 3 Z 1 P3 3 Γ( γ1,i) Y y +α −1 Γ( γ2,i) Y y +α −1 · i=1 p 1,i i dp · i=1 p 2,i i dp Q3 1,i 1 Q3 2,i 2 0 i=1 Γ(γ1,i) i=1 0 i=1 Γ(γ2,i) i=1 " #2 Γ(P3 α ) Q3 Γ(γ )Γ(γ ) n !n ! = i=1 i i=1 1,i 2,i 1 2 , Q3 P3 P3 Q3 (B.8) i=1 Γ(αi) Γ( i=1 γ1,i)Γ( i=1 γ2,i) i=1 y1,i!y2,i! 102 APPENDIX B. CALCULATIONS where the two dirichlet distributions integrate to 1. We can now find the Bayes factor by plugging equation (B.5) and (B.8) into (3.12). R P (p1)P (p2)P (y1|p1)P (y2|p2) dp1dp2 B = (p1,p2)∈Ω1 , 10 R P (p )P (y |p )P (y |p ) dp p0∈Ω0 0 1 0 2 0 0 h P3 i2 Q3 Γ( i=1 αi) i=1 Γ(γ1,i)Γ(γ2,i) n1!n2! Q3 · P3 P3 · Q3 i=1 Γ(αi) Γ( i=1 γ1,i)Γ( i=1 γ2,i) i=1 y1,i!y2,i! = P3 Q3 Γ( i=1 αi) i=1 Γ(λi) n1!n2! Q3 · P3 · Q3 i=1 Γ(αi) Γ( i=1 λi) i=1 y1,i!y2,i! Γ(P3 α ) Γ(P3 λ ) Q3 Γ(γ )Γ(γ ) = i=1 i · i=1 i · i=1 1,i 2,i Q3 Q3 P3 P3 i=1 Γ(αi) i=1 Γ(λi) Γ( i=1 γ1,i)Γ( i=1 γ2,i) Γ(P3 α ) Γ(P3 y + y + α ) = i=1 i · i=1 1,i 2,i i Q3 Q3 i=1 Γ(αi) i=1 Γ(y1,i + y2,i + αi) Q3 Γ(y + α )Γ(y + α ) · i=1 1,i i 2,i i . P3 P3 (B.9) Γ( i=1 y1,i + αi)Γ( i=1 y2,i + αi) B.3 Proof of Proposition 3.1

Proposition 3.1. Given a mapping g : V1 → V2, as introduced in Definition 3.2, Definition 3.5 and 3.7 are equivalent.

Proof. This can be validated by looking at the weights in the scoring regime 0 0 (3.3) - (3.8). Let u, v ∈ V1 be nodes in G1, and u , v ∈ V2 be nodes in G2. With the specified scoring model, all node operations are scored 0. So if λi is a node operation, then c(λi) = 0. This means that node operations have no effect on the graph edit distance in equation (3.2). In addition, edge substitutions are scored 0, so we only have to worry about edge insertions and edge deletions, which are both scored 1 by (3.3) - (3.8). This means that we only have to count the number of edge insertions and edge deletions, which is exactly what equation (3.16) in Definition 3.7 does. Let uv ∈ E1 be an edge 0 0 that connects the nodes u and v in G1. Similarly, let u v ∈ E2 be an edge 0 0 that connects the nodes u and v in G2. The edge g(u)g(v) is inserted into G2 0 0 0 0 if uv ∈ E1, but g(u)g(v) ∈/ E2. Similarly, u v is deleted from G2 if u v ∈ E2, −1 0 −1 0 but g (u )g (v ) ∈/ E1. Both of these events add 1 to the sum in equation (3.16), while all other events are disregarded. Consequently, equation (3.16) of Definition 3.7 counts the number of edge insertions and deletions, just like equation (3.2) of Definition 3.5, with the scoring regime in (3.3) - (3.8). So the two definitions are equivalent given the mapping and the specific scoring rule used. Appendix C

Table from Example 3.3

Table C.1 on the next page is an excerpt from the full VCF table with the same format as Table 2.3, only here we include the genotype information in all the individuals in the population, but not much else. In Table 2.3, only two individuals was included for illustration. The individuals are in separate columns named with the original identifier from the 1000 genomes project, which is a string of two letters followed by five numbers, e.g HG00097.

103 104 APPENDIX C. TABLE FROM EXAMPLE ?? 70690 17003679 0 17003679 68000 16987070 0 16987070 O G00 G01 G01 G01 G01 G01 G01 G01 G01 HG00119 HG00118 HG00117 HG00116 HG00114 HG00113 HG00112 HG00108 HG00111 0 HG00106 HG00110 1 HG00104 HG00109 16982901 HG00103 16982899 HG00102 HG00101 POS HG00100 0 HG00099 0 HG00097 16982901 HG00096 16982899 POS 70690 0 0 17003679 16987070 16982901 0 0 1 17003679 16987070 16982901 O A90 A91 A91 A91 A91 A91 A91 A91 A91 NA19319 NA19318 NA19317 NA19316 NA19315 NA19313 NA19312 NA19308 NA19311 NA19307 NA19310 1 NA19046 NA19309 NA19044 16982899 NA19041 NA19038 POS NA19036 NA19035 1 NA19028 NA19020 16982899 POS al C.1: Table | | | | | | | | | | | | | | | | 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 eotp nomto rmteVFfiefr2 niiul rmpplto GBR. population from individuals 20 for file VCF the from information Genoptype | | | | | | | | | | | | | | | | 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 | | | | | | | | | | | | | | | | 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 | | | | | | | | | | | | | | | | 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 ae :LWK B: Panel ae :GBR B: Panel | | | | | | | | | | | | | | | | 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 | | | | | | | | | | | | | | | | 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 | | | | | | | | | | | | | | | | 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 | | | | | | | | | | | | | | | | 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 | | | | | | | | | | | | | | | | 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 | | | | | | | | | | | | | | | | 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 1 Appendix D

Notation

Graphs

G(V,E) Graph made up of a set of vertexes (nodes) V = {v1, v2, . . . vn} and a set of edges E = {e1, e2, . . . , em}. G(V, E, α, β) Attributed graph with node labelling function α and edge la- belling function β. |G| The degree of G, i.e. the number of nodes in G. ||G|| The number of edges in G. ' Graph ismorphism. Σ A set of symbols that will be refered to as an alphabet. λ(G1,G2) A complete edit path from G1 to G2. Λ(G1,G2) The set of all complete edit paths from G1 to G2. λ Edit operation.

Alignment

A(R,T ) Alignment between two objects R and T , which may be either graphs or strings. δ(x, y) A scoring matrix which assigns a penalty to the alignment of all symbols in Σ to each other. σ(x, y) A scoring matrix which assigns an edit distance for all symbols in Σ to each other.

Other

dM(a, b) The distance from a to b by metric dM. N The natural numbers. R The real numbers. ∅ The empty set. ♣ Marks the end of an example. Marks the end of a proof.

105 106 APPENDIX D. NOTATION

Abbreviations

DAG DNA DeoxyriboNucleic Acid EA Evolutionary Algorithm HTS High Throughput Sequencing NGS Next Generation Sequencing NP Non-Polynomial PPI Protein-Protein Interaction RNA RiboNucleic Acid SMRT Single-Molecule Real-Time SNP Single Nucleotide Polymorphism VCF Variant Call File Appendix E

Glossary

alignment Arrangement of objects, e.g. strings, to quantify their similarities. An alignment is scored in order to be compared to other alignments. assembly The process of aligning many reads into larger, contiguos sequences called contigs. bidirected graph A directed graph where the nodes have two opposite sides. An edge specifies which side of the node it enters. consensus sequence The sequence of the bases that occurs most frequently in the analyzed individuals. contigs Contiguos sequences produced by aligning many reads. Contigs are the primary output of DNA sequencing, and are ordered into scaffolds and finally into complete genomes. de Bruijn graph A graph type where the nodes are k-mers and the edges are (k − 1)-mers. The edges specifies which k-mers that overlaps. By overlapping the k-mers we may reconstruct the sequence. This can be done by traversing the de Bruijn graph. deletion The removement of one or multiple consecutive bases from a se- quence. diploid The ploidy is the number of genome copies in a cell. Humans are diploid, i.e. we have two copies of our entire genome. Not to be confused with the double helical spatial structure of the DNA, which consists of two strands. epigenetics The study of genetic differences that are not manifested in the sequence. Such differences may also be inheritable. Examples includes the silencing of transcription by DNA methylation and other kinds of gene regulation.

107 108 GLOSSARY

fitness Reproductive success, i.e. the contribution of an allele, gene, gene family or individual into the next generation (Futuyma, 2009, page 283). genome The collection of all the genetic material in an organism. genotype The genetic sequence in a region or at a single positition. heterozygotic A diploid organism may be homo- or heterozygotic at a par- ticular position or region. Heterozygotic means that the two gene copies are different at the postion or region in question. homology Two traits inherited from a common ancestor are said to homol- ogous. homozygotic The two gene copies are identical at the position or region in question. indel An insertion or deletion. See insertion and deletion. insertion The addition of one or multiple consecutive bases into a sequence. mate pairs Fragment where the ends are sequenced and the unkown portion in the middle is of known length. The length of the middle portion varies between different technologies. mutation Alteration of genetic the material. phenotypic variation The variation in the expression of the genotype, in addition to environmental factors. phylogenetics The study of evolutionary releationships among groups of or- ganisms, i.e. which groups that share a common ancestor and how recent the ancestor is (Futuyma, 2009), page 18. ploidy The number of copies of the genetic material in each cell of an organ- ism. The number can be one (haploid), two (diploid), three (triploid), four (tetraploid), etc. population Group of breeding individuals. read The sequence of a DNA fragment read by the sequencing machine. recombination The formation of new genetic sequences, not identical to either parent sequences, by combining the parent sequences. reference sequence A genetic sequence to be considered a representation of a group, e.g. a species. In this thesis we use a human reference sequence as a context for the variant data in the VCF file. GLOSSARY 109 scaffold Ordered assemblies of contigs. The scaffolds are combined into com- plete genomes. sequence graph A bidirected graph, where all nodes have DNA-strings as labels. The edges are associated with a specific side of a node.

Single Nucleotide Polymorphism (SNP) A single position in the sequence where not all the individuals of a population have the same nucleotide. string edit distance The (potentially weighted) number of changes it takes to make one string equal to another. substitution One or several bases is substituted by one or several other bases. transcription Assembly of an mRNA strand complementary to the DNA it is copied from. The sequence in mRNA is then interpreted into an amino acid sequence making up a protein in a process known as translation. transposable elements Genetic sequence that is capable of relocation. Trans- posable elements may move the entire sequence, or leave behind either a complete or partially complete copy. variant graph A directed, acyclic graph, made up of a set of variants. The variants have positions along a linear reference genome.

Appendix F

Code

This appendix reports code for computer programs referenced in the text. Section F.1 contains the GraphBuilder, which is made up of two separate files of Java code. GedevoGraph.java reads a VCF file and constructs a graph object. Graph.java contains the class Graph, made by GedevoGraph.java. Section F.2 contains the R-script calculateDist.r, used to calculate dBC - and dBF distance. Recall that dGED is calculated externally by GEDEVO. Section F.3 consists of the R-scripts used in the permutation test. The dBC - and dBF - distances between permutated populations are generated by permDistanceBF_BC.r, while permDistance_GED.r does the same for dGED. permTest.r calculates permutation p-values. All other code used in the thesis is available on the github page “Variant- graph-analysis” at https://github.com/martinbjoerndal/Variant-graph-analysis.

F.1 GraphBuilder

/** * ANALYSES FOR MASTER’S THESIS * ======* * GedevoGraph.java * @author Martin Gjesdal Bjrndal * * This program buildsa graph froma VCF-file with intermediary nodes * in-between the variants. Each variant makesa"fork" and two variants at the * same position makesa three-way fork because all variants at the same * position has the same referende, so this is seemingly done only to not have * lists in the ALT-column in the VCF-file. **/ import java.util.Scanner; import java.util.ArrayList;

111 112 APPENDIX F. CODE import java.util.HashMap; import java.io.File; import java.io.PrintWriter; import java.io.FileNotFoundException; import java.io.IOException; public class GedevoGraph { public static void main(String[] args) { TestClass t = new TestClass(); t.test(args[0], args[1]); } } class TestClass { public void test(String filename1, String filename2) { File infile1 = new File(filename1); File infile2 = new File(filename2);

HashMap nodeList1; HashMap nodeList2;

FileReader reader1 = new FileReader(); FileReader reader2 = new FileReader();

nodeList1 = reader1.readFile(infile1); nodeList2 = reader2.readFile(infile2);

Graph g1 = new Graph(nodeList1); Graph g2 = new Graph(nodeList2);

g1.setNodeNumbers(); g2.setNodeNumbers();

try{ PrintWriter writer = new PrintWriter("gedevoInput.txt","UTF-8"); writer.println("graph fromNode toNode"); g1.gedevoInput("g1", writer); g2.gedevoInput("g2", writer);

writer.close();

} catch (IOException e) { e.printStackTrace(); } } } F.1. GRAPHBUILDER 113 class FileReader {

/** *A lot of things actually happens here. TheR-preprocessing removed the * lines that specified nonextistent variations. So all lines now * translates to two nodes. Must draw edges to the node at the"fork" point. * And we need to do some more complex things where variations happen close * by or on top of each other. * * @param infile the VCF-file to be read * @return nodeList HashMap of nodes eventually passed to the constructor * in Graph.java **/ public HashMap readFile(File infile) { HashMap nodeList = new HashMap();

try{ Scanner scan = new Scanner(infile); scan.nextLine();

// The last nodes needs to be readily at hand Node lastInBetween = null; Node previousRef = null; Node previousAlt = null; ArrayList previousVariants = new ArrayList();

int previousPos = 0; int lineCounter = 0; while (scan.hasNext()) { String line = scan.nextLine(); String[] lineList = line.split("");

// All lines specify two nodes: One for the reference type and // one for the alternative. The reference is column no. 4, the // alternative in no. 5. int pos = Integer.parseInt(lineList[2]); Node alt = new Node(pos, lineList[3].concat("alt"), lineList[5],"alt");

// The most interesting thing is when there are multiple // variations at the same position! This is really the same as //a multiple-way fork. So we can just add ona connection from // the last in-between-node. if (pos == previousPos) { nodeList.get(lastInBetween.getId()).addEdge(lastInBetween, alt); nodeList.put(alt.getId(), alt); previousVariants.add(alt);

} else if (pos == previousPos + 1) { 114 APPENDIX F. CODE

Node ref = new Node(pos, lineList[3].concat("ref"), lineList[4],"ref");

// Need to go two steps back to look at how many nodes there // were in the previous step Edge[] edges = new Edge[2]; edges[0] = new Edge(lastInBetween, ref); edges[1] = new Edge(lastInBetween, alt);

Edge[] edgesToPreviousStep = lastInBetween.getEdges(); int noOfNodesPreviousStep = edgesToPreviousStep.length; Node[] nodesPreviousStep = new Node[noOfNodesPreviousStep];

for(int i = 0; i < edgesToPreviousStep.length; i++) { nodesPreviousStep[i] = edgesToPreviousStep[i].getToNode(); }

Edge[] edgesFromPreviousStep = new Edge[noOfNodesPreviousStep];

// And now we must set the edges in this step. Overwrite // the variable edges for(int i = 0; i < nodesPreviousStep.length; i++) { edges = new Edge[2]; edges[0] = new Edge(nodesPreviousStep[i] , ref); edges[1] = new Edge(nodesPreviousStep[i], alt); nodesPreviousStep[i].setEdges(edges); }

nodeList.put(ref.getId(), ref); nodeList.put(alt.getId(), alt); previousRef = ref;

previousVariants.clear(); previousVariants.add(ref); previousVariants.add(alt);

} else{ Node ref = new Node(pos, lineList[3].concat("ref"), lineList[4],"ref");

// If we ain’t stacking variations then this is just another // SNP. if (lineCounter == 0) { lastInBetween = new Node(1,"Start","Start","start"); Edge[] edges = new Edge[2]; edges[0] = new Edge(lastInBetween, ref); edges[1] = new Edge(lastInBetween, alt); lastInBetween.setEdges(edges); F.1. GRAPHBUILDER 115

nodeList.put(lastInBetween.getId(), lastInBetween); nodeList.put(ref.getId(), ref); nodeList.put(alt.getId(), alt);

} else{ lastInBetween = new Node(pos - 1, lineList[3].concat("inter"), "I","inter"); Edge[] edges = new Edge[2]; edges[0] = new Edge(lastInBetween, ref); edges[1] = new Edge(lastInBetween, alt); lastInBetween.setEdges(edges);

// Set the edges from two steps ago to the in-between node. previousRef = nodeList.get(previousRef.getId()); previousAlt = nodeList.get(previousAlt.getId());

for (Node n : previousVariants) { Edge[] edgeToInBetween = new Edge[1]; edgeToInBetween[0] = new Edge(n, lastInBetween); nodeList.get(n.getId()).setEdges(edgeToInBetween); }

nodeList.put(lastInBetween.getId(), lastInBetween); }

nodeList.put(ref.getId(), ref); nodeList.put(alt.getId(), alt); previousVariants.clear(); previousRef = ref; previousVariants.add(ref); }

previousPos = pos; previousAlt = alt; previousVariants.add(alt); //System.out.println(previousVariants.size()); lineCounter++; } } catch (FileNotFoundException e) { e.printStackTrace(); }

System.out.println("\nFile read"); return nodeList; } } 116 APPENDIX F. CODE

/** * ANALYSES FOR MASTER’S THESIS * ======* * Graph.java * @author Martin Gjesdal Bjrndal * * The actual graph built by Gedevograph.java. The graph isa hashmap of Nodes. * **/ import java.util.ArrayList; import java.util.HashMap; import java.util.Arrays; import java.io.PrintWriter; import java.io.IOException; public class Graph { HashMap nodeList;

private int NUMBER_OF_NODES; private int NUMBER_OF_EDGES;

// Constructor. The graph isa Hashmap of nodes. Graph(HashMap nodeList) { this.nodeList = nodeList; }

/** *A debugging aid. Calls display(), which prints node and edge stats. **/ public void printNodes() { Node[] sorted = new Node[nodeList.size()]; int[] nodePositions = new int[nodeList.size()];

int counter = 0; for (Node n : nodeList.values()) { nodePositions[counter] = n.getPos(); counter++; }

Arrays.sort(nodePositions); for(int i = 0; i < nodePositions.length; i++) { if (i > 0) { if (nodePositions[i] == nodePositions[i + 1]) { Node n = findNodeAtPosition(nodePositions[i]); F.1. GRAPHBUILDER 117

if (n.getType().equals("alt")) { Node alt = n;

// Remove to get access to the next at the same // position. Then add again nodeList.remove(alt.getId()); Node ref = findNodeAtPosition(nodePositions[i]); nodeList.put(alt.getId(), alt);

} else if (n.getType().equals("ref")) { Node ref = n; nodeList.remove(ref.getId()); findNodeAtPosition(nodePositions[i]).display(); nodeList.put(ref.getId(), ref);

} else{ findNodeAtPosition(nodePositions[i]).display(); }

i++;

} else{ findNodeAtPosition(nodePositions[i]).display(); } } else{ findNodeAtPosition(nodePositions[i]).display(); } } }

public Node findNodeAtPosition(int pos) { for (Node n : nodeList.values()) { if (n.getPos() == pos) { return n; } }

return null; }

public void setNodeNumbers() { int nodeNumber = 1;

for (Node n : nodeList.values()) { n.setNumber(nodeNumber); nodeNumber++; } } 118 APPENDIX F. CODE

/** * This isa bit important, as it will be used as input to igraph inR. * Then the igraph object can be visualized byR. * **/ public void igraphOutput(String graphName, boolean gedevo) {

if (!gedevo) { try{ PrintWriter edgeWriter = new PrintWriter(graphName + "edgeList.txt","UTF-8"); PrintWriter labelWriter = new PrintWriter(graphName + "labelList.txt","UTF-8"); edgeWriter.println("fromNode toNode"); labelWriter.println("seq");

for (Node n : nodeList.values()) { Edge[] edges = n.getEdges();

if (edges != null){ for(int i = 0; i < edges.length; i++) { edgeWriter.println(Integer. toString(edges[i].getFromNode(). getNodeNumber()) +"" + Integer. toString(edges[i].getToNode(). getNodeNumber())); } }

labelWriter.println(n.getSeq()); }

edgeWriter.close(); labelWriter.close();

} catch (IOException e) { e.printStackTrace(); }

} else{

try{ PrintWriter edgeWriter = new PrintWriter(graphName + "edgeList.txt","UTF-8"); PrintWriter labelWriter = new PrintWriter(graphName + "labelList.txt","UTF-8"); edgeWriter.println("fromNode toNode"); labelWriter.println("seq"); F.1. GRAPHBUILDER 119

for (Node n : nodeList.values()) { Edge[] edges = n.getEdges();

if (edges != null){ for(int i = 0; i < edges.length; i++) { edgeWriter.println(edges[i].getFromNode().getId() +"" + edges[i].getToNode().getId()); } }

labelWriter.println(n.getSeq()); }

edgeWriter.close(); labelWriter.close();

} catch (IOException e) { e.printStackTrace(); } } }

/** * This is also important, as it will be used as input to GEDEVO. Because * the graphs puts different numbers on the nodes(by iterating through * the hashmap), we need to use the ids to be able to compare edges with * gedevo. **/ public void gedevoInput(String graphName, PrintWriter writer) {

for (Node n : nodeList.values()) { Edge[] edges = n.getEdges();

if (edges != null){ for(int i = 0; i < edges.length; i++) { writer.println(graphName +"" + edges[i].getFromNode().getId() + "" + edges[i].getToNode().getId()); } } } } }

/** * Edges are attributes of the nodes and containsa start node and an end node * **/ 120 APPENDIX F. CODE class Edge {

private Node to; private Node from;

Edge(Node from, Node to) { this.from = from; this.to = to; }

public Node getToNode() { return to; }

public Node getFromNode() { return from; } }

/** * Nodes contain edges, identifiers anda sequence. The graph isa HashMap of * nodes. **/ class Node { private int pos; private String id; private String seq; private String type; private int nodeNumber;

private Edge[] edges;

Node(int pos, String id, String seq, String type) { this.pos = pos; this.id = id; this.seq = seq; this.type = type; }

//A lot of get/sets. public void setEdges(Edge[] edges) { this.edges = edges; }

public Edge[] getEdges() { return edges; }

public String getId() { F.1. GRAPHBUILDER 121

return id; }

public void setNumber(int nodeNumber) { this.nodeNumber = nodeNumber; }

public int getNodeNumber() { return nodeNumber; }

public String getSeq() { return seq; }

public void setSeq(String seq) { this.seq = seq; }

public int getPos() { return pos; }

public String getType() { return type; }

/** * If there’s more than two edges out from an in-between node, we’ll need * to add another if more variants appear at the same position. **/ public void addEdge(Node edgeFrom, Node edgeTo) { Edge[] update = new Edge[edges.length + 1];

for(int i = 0; i < edges.length; i++) { update[i] = edges[i]; }

update[update.length - 1] = new Edge(edgeFrom, edgeTo); setEdges(update); }

/** *A debugging aid. Prints node and edge stats. **/ public void display() { System.out.println("Node" + id +" of type" + type + " and igraph-numering" + nodeNumber + " reporting: My seq is" + seq + 122 APPENDIX F. CODE

" at position" + pos + ".\nMy edges are from myself to");

if (edges != null){ for(int i = 0; i < edges.length; i++) { if (edges[i] == null){ System.out.println("Warning: Edge is null"); } else{ System.out.println(edges[i].getToNode().getId()); } }

} else{ System.out.println(""); } System.out.println(); } } F.2. R-SCRIPT FOR CALCULATING DISTANCES 123

F.2 R-script for Calculating Distances

############################################################################## # calculateDist.r # # Analyses for master thesis # ======# # An extension of naiveDist to include the proportions of the genotypes. # We still wish to find the distance between two population. ############################################################################### calculateDist = function(pop1, pop2) { internalStartTime = proc.time()[3] samples = cbind(pop1, pop2)

# Loop through variants. What are the genotypes in the sample? Do they # contain The variant? First, investigate if the sample contain the variant # at all,i.e it is homozyguos for the variant or heterozygous. containedPop1 = logical(length(samples[,1])) containedPop2 = logical(length(samples[,1]))

genotypeCountPop1 = matrix(ncol = 3, nrow= length(pop1[,1])) genotypeCountPop2 = matrix(ncol = 3, nrow= length(pop1[,1]))

for (i in 1:length(pop1[,1])) { # Genotype proportions. What are the possible genotypes? Lookup in # the VCF-file at positions REF and ALT ref = chr22Reduced$REF[i] alt = chr22Reduced$REF[i] zygosityPop1 = numeric(3) zygosityPop2 = numeric(3)

for (j in 1:length(pop1[1,])) { # Find genotypes in the first population genotypePop1 = integer(2) genotypePop1[1] = strtoi(substr(pop1[i,j], 1, 1)) genotypePop1[2] = strtoi(substr(pop1[i,j], 3, 3))

# Find genotypes in second population genotypePop2 = integer(2) genotypePop2[1] = strtoi(substr(pop2[i,j], 1, 1)) genotypePop2[2] = strtoi(substr(pop2[i,j], 3, 3))

# Is the variant contained in the populations? if(genotypePop1[1]!= 0 | genotypePop1[2]!= 0) { containedPop1[i] = TRUE } 124 APPENDIX F. CODE

if (genotypePop2[1]!= 0 | genotypePop2[2]!= 0) { containedPop2[i] = TRUE }

# What are the genotypes in the populations? The vector"zygosity" # contains the number of individuals that are 1) homozygotes for the # reference, 2) heterozygotes or 3) homozygotes for the alternative if (genotypePop1[1] == 0&& genotypePop1[2] == 0) { zygosityPop1[1] = zygosityPop1[1] + 1 } else if (genotypePop1[1] == 1&& genotypePop1[2] == 1) { zygosityPop1[3] = zygosityPop1[3] + 1 } else{ zygosityPop1[2] = zygosityPop1[2] + 1 }

if (genotypePop2[1] == 0&& genotypePop2[2] == 0) { zygosityPop2[1] = zygosityPop2[1] + 1 } else if (genotypePop2[1] == 1&& genotypePop2[2] == 1) { zygosityPop2[3] = zygosityPop2[3] + 1 } else{ zygosityPop2[2] = zygosityPop2[2] + 1 } }

genotypeCountPop1[i,] = zygosityPop1 genotypeCountPop2[i,] = zygosityPop2 }

proportionsPop1 = genotypeCountPop1/ length(pop1) proportionsPop2 = genotypeCountPop2/ length(pop2)

# If the bubble is contained in both populations, the vector has value "1" # at this index. If it is contained in one of the populations, it has value # "0". If it is not contained in any, we will not use it anymore in this # preliminary analysis, so it is denoted bya "-1". bubbles = integer(length(containedPop1)) for (i in (1:length(containedPop1))) { if (containedPop1[i]&& containedPop2[i]) { bubbles[i] = 1 } else if(!containedPop1[i]&&!containedPop2[i]) { bubbles[i] = -1 } else{ bubbles[i] = 0 } }

# Naive distance # ======# The distance from graph G1 to graph G2 is estimated by number of bubbles F.2. R-SCRIPT FOR CALCULATING DISTANCES 125

# contained in only G1 or G2 divided by the total number of bubbles found in # either. separate = sum(bubbles == 0) shared = sum(bubbles == 1) dist = separate/ (shared + separate)

# Comparison of the genotype proportions. We shuld doa hypothesis test to # see if the proportions are similar in two populations

# Bayesian approach # ======# Usea Dirichlet prior on the parameters. Under H0, the dirichlet prior has # parameters that are the genotype proportions of the second observations. # Under Ha the parameters givesa rather flat prior. First, we must assign # some hyperparameters. alpha = rep(1, 3) index= numeric(length(genotypeCountPop2[,1])) d = numeric(length(genotypeCountPop2[,1]))

for (i in 1:length(pop1[,1])) { lambda = genotypeCountPop2[i,] + genotypeCountPop1[i,] + alpha gammaOne = genotypeCountPop1[i,] + alpha gammaTwo = genotypeCountPop2[i,] + alpha

prodAlpha = prod(gamma(alpha)) sumAlpha = gamma(sum(alpha))

prodLambda = prod(gamma(lambda)) sumLambda = gamma(sum(lambda))

prodGammaOne = prod(gamma(gammaOne)) sumGammaOne = gamma(sum(gammaOne))

prodGammaTwo = prod(gamma(gammaTwo)) sumGammaTwo = gamma(sum(gammaTwo))

d[i] = ((prodAlpha/ sumAlpha)* (prodLambda/ sumLambda))/ ((prodGammaOne/ sumGammaOne)* (prodGammaTwo/ sumGammaTwo)) index[i] = data[i,2] }

# Finally, report distance #cat("\ndistance between permuted populations") #cat("\nNaive dist =", dist,"\tBayes dist =", sum(log(1/d)))

#internalEndTime= proc.time()[3] #cat("\nInternal time:",(internalEndTime- internalStartTime),"sec") # Return the Bayes factor of the null hypothesis relative to the alternative # hypothesis 126 APPENDIX F. CODE

return(c(dist, sum(log(1/d)))) } F.3. R-CODE FOR PERMUTATION TESTS 127

F.3 R-code for Permutation Tests

############################################################################## # permDistanceBF_BC.r # # Analyses for master thesis # ======# # Does permutation tests on the distances metrics from distances.r. Pools # two populations and does permutations. Runs in parallel. ############################################################################### require(parallel) require(doParallel) require(foreach)

# Find distances source("readData.r") source("calculateDist.r") source("popCleaningForGraph.r")

# Function to perform permutations. Takes two populations, pools them, and # draws permuted populations. pairwisePermutations <- function(pop1, pop2) {

pooled = names(c(pop1, pop2)) permutation = sample(pooled, length(pooled), replace = FALSE)

permPop1 = subset(data, select = permutation[1:20]) permPop2 = subset(data, select = permutation[21:40]) return(list(permPop1, permPop2)) }

# Parallellize cores = detectCores() - 1 cluster = makeCluster(cores) registerDoParallel(cores)

K = 3000 popNames =c("GBR","FIN","LWK","YRI","CHB","CHS") pops = list(brit, fin, luhya, yoruba, hanBei, hanSouth) startTime = proc.time() # Run loop in parallel: Requires some special syntax distanceTable = foreach(k = 1:K, .combine = rbind) %dopar% {

startTime = proc.time() distanceList = NULL for (i in 1:length(pops)) { 128 APPENDIX F. CODE

for (j in 1:length(pops)) { if(i > j) {

# Call function that makes permutation. Calculate distance as always. permPops = pairwisePermutations(pops[[i]], pops[[j]]) distances = calculateDist(permPops[[1]], permPops[[2]]) distanceList =c(distanceList, distances) } } }

time= proc.time()[3] - startTime[3] cat("\nIteration", k,"of", K,"at time:",(time/ 60),"min.\n") distanceList } stopCluster(cluster)

# Write to file write.table(distanceTable, file="./5000Lines/distanceTable.txt") F.3. R-CODE FOR PERMUTATION TESTS 129

############################################################################## # permDistanceGED.r # # Analyses for master thesis # ======# # Does permutation tests on the distances metrics from distances.r. Pools # two populations and does permutations. GEDEVO runs in parallel, making it # hard to have the outer loop in parallell.... ############################################################################### source("readData.r") source("popCleaningForGraph.r") startTime = proc.time()

# Function to perform permutations. Takes two populations, pools them, and # draws permuted populations. pairwisePermutations <- function(pop1, pop2) {

pooled = names(c(pop1, pop2)) permutation = sample(pooled, length(pooled), replace = FALSE)

permPop1 = subset(data, select = permutation[1:20]) permPop2 = subset(data, select = permutation[21:40])

return(list(permPop1, permPop2)) } callToGedevo <- function() { cat("\nNow starting GEDEVO...\n") system("bash gedevoRun.sh", wait = TRUE)

# Then read the results file and pass back to caller con = file("distance.txt") open(con)

# Read only line number 13 line = scan(con, what = character(), skip = 12, nlines = 1) close(con) return(as.numeric(line[4])) }

K = 25 popNames =c("GBR","FIN","LWK","YRI","CHB","CHS") pops = list(brit, fin, luhya, yoruba, hanBei, hanSouth) gedTable = matrix(NA, nrow = K, ncol = 15)

# This can not so easily go parallel because we need BuildGraph to write files 130 APPENDIX F. CODE

# in each iteration and the different threads may pick wrong files if they are # not completely in sync. for (k in 1:K) { gedList = NULL timeList = NULL for (i in 1:length(pops)) { for (j in 1:length(pops)) { if(i > j) {

# Call function that makes permutation. Calculate distance as always. permPops = pairwisePermutations(pops[[i]], pops[[j]])

# Clean the permuted populations and write to file. The files are # used by BuildGraph cleaned = clean(permPops[[1]], permPops[[2]]) write.table(cleaned[[1]], file="pop1Perm.txt") write.table(cleaned[[2]], file="pop2Perm.txt")

system("java GedevoGraph pop1Perm.txt pop2Perm.txt") ged = callToGedevo() gedList =c(gedList, ged)

# Do some cleaning. # # IMPORTANT: Remove connectivity matrices to prevent GEDEVO from using # old ones. # # Also remove the input-file for GEDEVO so that it crashes instead of # potentially using the previous file in case GraphBuilder fails. system("rm pop1Perm.txt") system("rm pop2Perm.txt") } } }

gedTable[k,] = gedList time= proc.time()[3] - startTime[3] cat("\nIteration", k,"of", K,"at time:",(time/ 60),"min.\n") }

# Write to file write.table(gedTable, file="GEDresults/gedSubScored_1.txt") F.3. R-CODE FOR PERMUTATION TESTS 131

############################################################################## # permTest.r # # Analyses for master thesis # ======# # Compute achieved significance level(ASL). Can be used to compare the # across different permutation sample sizes. ############################################################################### require(xtable) d = read.table("./5000Lines/distanceTable.txt", header = TRUE) d = d[1:300,] dGED = read.table("GEDresults/gedPermutations.txt", header = TRUE) colnames=c("GBR-FIN","GBR-FIN","GBR-LWK","GBR-LWK","FIN-LWK","FIN-LWK", "GBR-YRI","GBR-YRI","FIN-YRI","FIN-YRI","LWK-YRI","LWK-YRI","GBR-CHB", "GBR-CHB","FIN-CHB","FIN-CHB","LWK-CHB","LWK-CHB","YRI-CHB","YRI-CHB", "GBR-CHN","GBR-CHN","FIN-CHN","FIN-CHN","LWK-CHN","LWK-CHN","YRI-CHN", "YRI-CHN","CHB-CHN","CHB-CHN")

# Split table into two: One for each metric names(d) = colnames dNaive = matrix(NA, ncol= length(colnames)/2, nrow= length(d[,1])) dBayes = matrix(NA, ncol= length(colnames)/2, nrow= length(d[,1])) observedNaive = read.table("naiveData.txt", header = TRUE)[-2:-1,-1] observedBayes = read.table("bayesData.txt", header = TRUE)[-2:-1,-1] observedGED = read.table("./GEDresults/gedData.txt", header = TRUE)[-2:-1, -1]

# Convert to vector to have corresponding indexes as the ones from the distance # table. Convert and removeNA’s observedNaive = na.omit(as.vector(t(as.matrix(observedNaive)))) observedBayes = na.omit(as.vector(t(as.matrix(observedBayes)))) observedGED = na.omit(as.vector(t(as.matrix(observedGED)))) j = 1 k = 1 for (i in 1:length(colnames)) { if ((i%% 2) == 1) { dNaive[,j] = d[,i] j = j + 1 } else{ dBayes[,k] = d[,i] k = k + 1 } } 132 APPENDIX F. CODE calculateASL <- function(i, observed) { naiveExtremes = sum(dNaive[,i] > observed[1]) bayesExtremes = sum(dBayes[,i] < observed[2]) gedExtremes = sum(dGED[,i] > observed[3])

naiveASL = (naiveExtremes + 1)/(length(d[,i] + 1)) bayesASL = (bayesExtremes + 1)/(length(d[,i] + 1)) gedASL = (gedExtremes + 1)/(length(dGED[,i]) + 1) print(gedExtremes) return(list(naiveASL, bayesASL, gedASL)) }

# Calculate the ASL ASL = matrix(ncol = 3, nrow= length(colnames)/2) for (i in 1:length(dNaive[1,])) { observed =c(observedNaive[i], observedBayes[i], observedGED[i]) ASL[i,] = unlist(calculateASL(i, observed)) }

# Fill into matrix for report popNames =c("GBR","FIN","LWK","YRI","CHB","CHS") naiveASL_table= matrix(NA, ncol= length(popNames), nrow= length(popNames)) bayesASL_table= matrix(NA, ncol= length(popNames), nrow= length(popNames)) gedASL_table= matrix(NA, ncol= length(popNames), nrow= length(popNames)) k = 1 for (i in 1:length(popNames)) { for (j in 1:length(popNames)) { if (i > j) { naiveASL_table[i, j] = ASL[k, 1] bayesASL_table[i, j] = ASL[k, 2] gedASL_table[i, j] = ASL[k, 3] k = k + 1 } } }

# Clean data.frame(naiveASL_table) data.frame(bayesASL_table) data.frame(gedASL_table) colnames(naiveASL_table) = popNames colnames(bayesASL_table) = popNames colnames(gedASL_table) = popNames rownames(naiveASL_table) = popNames rownames(bayesASL_table) = popNames rownames(gedASL_table) = popNames F.3. R-CODE FOR PERMUTATION TESTS 133

# And produce TeX output naiveASL_tableTeX = xtable(naiveASL_table, digits = 4, caption ="ASL for$d_{BC}$", label ="tbl:naiveASL") bayesASL_tableTeX = xtable(bayesASL_table, digits = 4, caption ="ASL for$d_{BF}$", label ="tbl:bayesASL") gedASL_tableTeX = xtable(gedASL_table, digits = 4, caption ="ASL for$d_{GED}$", label ="tbl:gedASL")

Bibliography

1000 Genomes Project Consortium and others (2010). «A map of human genome variation from population-scale sequencing». In: Nature 467.7319, pp. 1061–1073. — (2012). «An integrated map of genetic variation from 1,092 human genomes». In: Nature 491.7422, pp. 56–65. — (2015). «A global reference for human genetic variation». In: Nature 526.7571, pp. 68–74. Adler, Daniel, Duncan Murdoch, et al. (2017). rgl: 3D Visualization Using OpenGL. R package version 0.98.1. url: https://CRAN.R- project.org/ package=rgl. Alberts, Bruce et al. (2008). Molecular Biology of the Cell. Garland Science. Chap. 5, pp. 316–319. Baker, Monya (2012). «De novo genome assembly: what every biologist should know». In: Nature methods 9.4, p. 333. Behjati, Sam and Patrick S Tarpey (2013). «What is next generation sequenc- ing?» In: Archives of disease in childhood-Education & practice edition, edpract–2013. Bondy, JA and USR Murty (2008). Graph theory (graduate texts in mathe- matics). Springer New York. Bradley, Robert K et al. (2009). «Fast statistical alignment». In: PLoS Comput Biol 5.5, e1000392. Brun, Luc, Benoit Gaüzère, and Sébastien Fourey (2012). Relationships be- tween Graph Edit Distance and Maximal Common Unlabeled Subgraph. Tech. rep. url: https://hal.archives-ouvertes.fr/hal-00714879. Bunke, Horst (1997). «On a relation between graph edit distance and maxi- mum common subgraph». In: Pattern Recognition Letters 18.8, pp. 689– 694. Bunke, Horst and Kaspar Riesen (2009). «Graph edit distance: optimal and suboptimal algorithms with applications». In: Analysis of Complex Net- works: From Biology to Linguistics, pp. 113–143. Butler, Jonathan et al. (2008). «ALLPATHS: de novo assembly of whole- genome shotgun microreads». In: Genome research 18.5, pp. 810–820.

135 136 BIBLIOGRAPHY

Chung, James H and Donald AS Fraser (1958). «Randomization tests for a multivariate two-sample problem». In: Journal of the American Statistical Association 53.283, pp. 729–735. Church, Deanna M, Valerie A Schneider, Tina Graves, et al. (2011). «Mod- ernizing reference genome assemblies». In: PLoS Biol 9.7, e1001091. Church, Deanna M, Valerie A Schneider, Karyn Meltz Steinberg, et al. (2015). «Extending reference assembly models». In: Genome biology 16.1, p. 13. Compeau, Phillip EC, Pavel A Pevzner, and Glenn Tesler (2011). «How to apply de Bruijn graphs to genome assembly». In: Nature biotechnology 29.11, pp. 987–991. Csardi, Gabor and Tamas Nepusz (2006). «The igraph software package for complex network research». In: InterJournal Complex Systems, p. 1695. url: http://igraph.org. Danecek, Petr et al. (2011). «The variant call format and VCFtools». In: Bioinformatics 27.15, pp. 2156–2158. Davison, Anthony Christopher and David Victor Hinkley (2006). Bootstrap methods and their application. Vol. 1. Cambridge university press. De Bruijn, Nicolaas Govert (1946). «A combinatorial problem». In: Proceed- ings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam 49, pp. 758–764. Diestel, Reinhard (2000). Graph theory {graduate texts in mathematics; 173}. Springer-Verlag Berlin and Heidelberg GmbH & amp. Dilthey, Alexander et al. (2015). «Improved genome inference in the MHC using a population reference graph». In: Nature genetics 47.6, pp. 682– 688. Duret, Laurent (2009). «Mutation patterns in the human genome: more vari- able than expected». In: PLoS Biol 7.2, e1000028. Edgington, Eugene S (1995). «Randomization tests». In: Statistics: Textbooks and Monographs Series, New York, Basel: Marcel Dekker,| c1995, 3rd ed., revised and expanded. Efron, Bradley and Robert J Tibshirani (1993). An introduction to the boot- strap. Chapman & Hall. Eiben, Agoston E, James E Smith, et al. (2003). Introduction to evolutionary computing. Vol. 53. Springer. Ernst, Michael D et al. (2004). «Permutation methods: a basis for exact in- ference». In: Statistical Science 19.4, pp. 676–685. Falconer, Douglas S (1981). Introduction to quantitative genetics. Tech. rep., p. 5. Fisher, Ronald A (1935). The design of experiments. 1935. Oliver and Boyd, Edinburgh. Friedman, Jerome, Trevor Hastie, and Robert J Tibshirani (2009). The ele- ments of statistical learning. Springer Science & Business Media. BIBLIOGRAPHY 137

Fullwood, Melissa J et al. (2009). «Next-generation DNA sequencing of paired- end tags (PET) for transcriptome and genome analyses». In: Genome re- search 19.4, pp. 521–532. Futuyma, Douglas J (2009). Evolution. Gao, Xinbo et al. (2010). «A survey of graph edit distance». In: Pattern Anal- ysis and applications 13.1, pp. 113–129. Gelman, Andrew et al. (2014). Bayesian data analysis. Vol. 2. Chapman & Hall/CRC Boca Raton, FL, USA. Geng, Chunyu et al. (2012). «Paired-end sequencing of long-range DNA frag- ments for de novo assembly of large, complex Mammalian genomes by direct intra-molecule ligation». In: PloS one 7.9, e46211. Gross, Jonathan L and Jay Yellen (2005). Graph theory and its applications. CRC press. Hartl, Daniel L (2000). A primer of population genetics. Sinauer Associates, Inc., pp. 23–24. Hartl, Daniel L, Andrew G Clark, and Andrew G Clark (1997). Principles of population genetics. Vol. 116. Sinauer associates Sunderland. He, Yiming et al. (2013). «De novo assembly methods for next generation sequencing data». In: Tsinghua Science and Technology 18.5, pp. 500–514. Heath, Allison P and Lydia E Kavraki (2009). «Computational challenges in systems biology». In: Computer Science Review 3.1, pp. 1–17. Ibragimov, Rashid, Maximilian Malek, Jan Baumbach, et al. (2014). «Mul- tiple graph edit distance: simultaneous topological alignment of multiple protein-protein interaction networks with an evolutionary algorithm». In: Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, pp. 277–284. Ibragimov, Rashid, Maximilian Malek, Jiong Guo, et al. (2013). «GEDEVO: An Evolutionary Graph Edit Distance Algorithm for Biological Network Alignment». In: German Conference on Bioinformatics 2013. Ed. by Tim Beißbarth et al. Vol. 34. OpenAccess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp. 68–79. isbn: 978-3-939897-59-0. url: http://drops.dagstuhl.de/opus/volltexte/ 2013/4229. Ibragimov, Rashid, Jan Martens, et al. (2013). «NABEECO: Biological Net- work Alignment with Bee Colony Optimization Algorithm». In: Proceed- ings of the 15th Annual Conference Companion on Genetic and Evolution- ary Computation. GECCO ’13 Companion. Amsterdam, The Netherlands: ACM, pp. 43–44. isbn: 978-1-4503-1964-5. url: http://doi.acm.org/10. 1145/2464576.2464600. Iqbal, Zamin, Mario Caccamo, et al. (2012). «De novo assembly and genotyp- ing of variants using colored de Bruijn graphs». In: Nature genetics 44.2, pp. 226–232. 138 BIBLIOGRAPHY

Iqbal, Zamin, Isaac Turner, and Gil McVean (2013). «High-throughput mi- crobial population genomics using the Cortex variation assembler». In: Bioinformatics 29.2, pp. 275–276. Izenman, Alan Julian (2008). Modern multivariate statistical techniques: Re- gression, classification and manifold learning. Springer. Chap. 13. Jaenisch, Rudolf and Adrian Bird (2003). «Epigenetic regulation of gene ex- pression: how the genome integrates intrinsic and environmental signals». In: Nature genetics 33, pp. 245–254. Jarosz, Andrew F and Jennifer Wiley (2014). «What are the odds? A practi- cal guide to computing and reporting Bayes factors». In: The Journal of Problem Solving 7.1, p. 2. Jeffreys, Harold (1935). «Some tests of significance, treated by the theory of probability». In: Proceedings of the Cambridge Philosophical Society. Vol. 31. 02. Cambridge Univ Press, pp. 203–222. — (1961). «Theory of probability, Harold Jeffreys». In: International series of monographs on physics. Kass, Robert E and Adrian E Raftery (1995). «Bayes factors». In: Journal of the american statistical association 90.430, pp. 773–795. Kehr, Birte et al. (2014). «Genome alignment with graph data structures: a comparison». In: BMC bioinformatics 15.1, p. 1. Klau, Gunnar W (2009). «A new graph-based method for pairwise global network alignment». In: BMC bioinformatics 10.1, S59. Langmead, Ben and Steven L Salzberg (2012). «Fast gapped-read alignment with Bowtie 2». In: Nature methods 9.4, pp. 357–359. Langmead, Ben, Cole Trapnell, et al. (2009). «Ultrafast and memory-efficient alignment of short DNA sequences to the human genome». In: Genome biology 10.3, R25. Lee, Christopher, Catherine Grasso, and Mark F Sharlow (2002). «Multiple sequence alignment using partial order graphs». In: Bioinformatics 18.3, pp. 452–464. Li, Heng and Richard Durbin (2009). «Fast and accurate short read alignment with Burrows–Wheeler transform». In: Bioinformatics 25.14, p. 1754. doi: 10.1093/bioinformatics/btp324. Li, Heng, Jue Ruan, and Richard Durbin (2008). «Mapping short DNA se- quencing reads and calling variants using mapping quality scores». In: Genome research 18.11, pp. 1851–1858. Limasset, Antoine and Pierre Peterlongo (2015). «Read Mapping on de Bruijn graph». In: CoRR abs/1505.04911. url: http://arxiv.org/abs/1505.04911. Liu, Chi-Man et al. (2012). «SOAP3: ultra-fast GPU-based parallel align- ment tool for short reads». In: Bioinformatics 28.6, p. 878. doi: 10.1093/ bioinformatics/bts061. Liu, Yu et al. (2014). «Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing». In: BMC genomics 15.1, p. 685. BIBLIOGRAPHY 139

Lunter, Gerton and Martin Goodson (2011). «Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads». In: Genome research 21.6, pp. 936–939. Malek, Maximilian et al. (2016). «CytoGEDEVO—global alignment of bio- logical networks with Cytoscape». In: Bioinformatics 32.8, p. 1259. url: +%20http://dx.doi.org/10.1093/bioinformatics/btv732. Manly, Bryan FJ (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall. Masson, Michael EJ (2011). «A tutorial on a practical Bayesian alternative to null-hypothesis significance testing». In: Behavior research methods 43.3, pp. 679–690. Medvedev, Paul and Michael Brudno (2009). «Maximum likelihood genome assembly». In: Journal of computational Biology 16.8, pp. 1101–1116. Metzker, Michael L (2010). «Sequencing technologies—the next generation». In: Nature reviews genetics 11.1, pp. 31–46. Mielke Jr, Paul W and Kenneth J Berry (2001). Permutation methods: a distance function approach. Springer Science & Business Media. Morton, Newton E (1991). «Parameters of the human genome». In: Proceed- ings of the National Academy of Sciences 88.17, pp. 7474–7476. Mount, David W (2004). Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press. Chap. 3. Narzisi, Giuseppe and Bud Mishra (2011). «Comparing de novo genome as- sembly: the long and short of it». In: PloS one 6.4, e19175. Novak, Adam M, Erik Garrison, and Benedict Paten (2016). «A Graph Exten- sion of the Positional Burrows-Wheeler Transform and its Applications». In: International Workshop on Algorithms in Bioinformatics. Springer, pp. 246–256. Novak, Adam M, Glenn Hickey, et al. (2017). «Genome Graphs». In: bioRxiv. url: http://biorxiv.org/content/early/2017/01/18/101378. Paten, Benedict, Adam M Novak, Jordan M Eizenga, et al. (2017). «Genome graphs and the evolution of genome inference». In: Genome research 27.5, pp. 665–676. Paten, Benedict, Adam M Novak, Erik Garrison, et al. (2017). «Superbubbles, Ultrabubbles and Cacti». In: bioRxiv, p. 101493. Paten, Benedict, Adam M Novak, and David Haussler (2014). «Mapping to a reference genome structure». In: arXiv preprint arXiv:1404.5010. Patwardhan, Anand, Samit Ray, and Amit Roy (2014). «Molecular markers in phylogenetic studies-A Review». In: Journal of Phylogenetics & Evolu- tionary Biology 2014. Phipson, Belinda, Gordon K Smyth, et al. (2010). «Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn». In: Stat Appl Genet Mol Biol 9.1, Article39–Article39. 140 BIBLIOGRAPHY

R Core Team (2016). R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing. Vienna, Austria. url: https://www.R-project.org/. Relethford, John H (2001). Genetics and the search for modern human origins. Wiley-Liss. Chap. 5. Riesen, Kaspar and Horst Bunke (2009). «Approximate graph edit distance computation by means of bipartite graph matching». In: Image and Vision computing 27.7, pp. 950–959. Sandve, Geir K., Sveinung Gundersen, Morten Johansen, et al. (2013). «The Genomic HyperBrowser: an analysis web server for genome-scale data». In: Nucleic Acids Research 41.W1, W133. url: +%20http://dx.doi.org/ 10.1093/nar/gkt342. Sandve, Geir K., Sveinung Gundersen, Halfdan Rydbeck, et al. (2010). «The Genomic HyperBrowser: inferential genomics at the sequence level». In: Genome Biology 11.12, R121. issn: 1474-760X. url: http://dx.doi.org/10. 1186/gb-2010-11-12-r121. Sanger, Frederick and Alan R Coulson (1975). «A rapid method for determin- ing sequences in DNA by primed synthesis with DNA polymerase». In: Journal of molecular biology 94.3, 441IN19447–446IN20448. Sanger, Frederick, Steven Nicklen, and Alan R Coulson (1977). «DNA se- quencing with chain-terminating inhibitors». In: Proceedings of the na- tional academy of sciences 74.12, pp. 5463–5467. Schneider, Valerie A et al. (2017). «Evaluation of GRCh38 and de novo hap- loid genome assemblies demonstrates the enduring quality of the reference assembly». In: Genome research 27.5, pp. 849–864. Schumacher, Carol (2001). Chapter Zero. Addison Wesley, pp. 72–73. Simpson, Jared T and Mihai Pop (2015). «The theory and practice of genome sequence assembly». In: Annual review of genomics and human genetics 16, pp. 153–172. Sudmant, Peter H et al. (2015). «An integrated map of structural variation in 2,504 human genomes». In: Nature 526.7571, pp. 75–81. Sung, Wing-Kin (2009). Algorithms in bioinformatics: A practical introduc- tion. CRC Press. Templeton, Alan (2002). «Out of Africa again and again». In: Nature 416.6876, pp. 45–51. Treangen, Todd J and Steven L Salzberg (2012). «Repetitive DNA and next- generation sequencing: computational challenges and solutions». In: Na- ture Reviews Genetics 13.1, pp. 36–46. Venter, J Craig et al. (2001). «The sequence of the human genome». In: science 291.5507, pp. 1304–1351. Vigilant, Linda, Mark Stoneking, and Henry Harpending (1991). «Human mi- tochondrial DNA». In: Science 53.5027, pp. 1503–1507. Watson, James D, Tania Baker, et al. (2008). Molecular Biology of the Gene. Pearson/Benjamin Cummings. BIBLIOGRAPHY 141

Watson, James D and Francis HC Crick (1953). «The structure of DNA». In: Cold Spring Harbor symposia on quantitative biology. Vol. 18. Cold Spring Harbor Laboratory Press, pp. 123–131. Wright, Sewall (1921). «Systems of mating. I. The biometric relations between parent and offspring». In: Genetics 6.2, pp. 111–178. Yu, Xinjie and Mitsuo Gen (2010). Introduction to evolutionary algorithms. Springer Science & Business Media. Zerbino, Daniel R and Ewan Birney (2008). «Velvet: algorithms for de novo short read assembly using de Bruijn graphs». In: Genome research 18.5, pp. 821–829.