Distance Metrics in Variant Graphs

Distance Metrics in Variant Graphs Martin Gjesdal Bjørndal Master’s Thesis, Spring 2017 This master’s thesis is submitted under the master’s programme Modelling and Data Analysis, with programme option Statistics and Data Analysis, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits. The front page depicts a section of the root system of the exceptional Lie group E8, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics. Distance Metrics in Variant Graphs Supervisors: Author: Geir Olve Storvik Martin Gjesdal Bjørndal Lex Nederbragt Abstract The traditional, linear representation of the genetic information in populations cause a loss of data, as it can not fully represent sites where the sequences differ. Conversely, a graph may incorporate the variation, thereby providing a mean of utilizing more information in the analysis of genetic differences between populations. This thesis deals with the quantification of the differences between two population graphs. Three ways of measuring the distance between population graphs are presented. The first way counts unique variants and compares them to the total number of variants. The second calculates the graph edit distance. The third specifies two probability models regarding the genotype distribution at each variant, and then calculates the Bayes factor at each location. The three distance measures are tested on data describing variations in the human genome. Six populations of humans from distinct geographical areas are represented in the data. The distance measures seem to give similar conclusions about the relative distances among pairs of populations. In order to place the distances in a context, they are evaluated using permutation tests. The permutation tests report significant results for all pairs of populations except one. Additionally, this is the pair that is given the shortest distance by all distance measures. In general, the methods and ideas presented in this thesis allow more genetic information to be included in the study of relationships between groups. As such, it may prosper to become more accurate than traditional designs. Keywords: Bayes factor, graph edit distance, permutation test, variant graph. iii Preface The idea for this thesis originated from the observation that linear representations of DNA from more than one individual is associated with a loss of information. A genetic variation among the individuals of a population means that there is more than one alternative present in the genomic sequence. A linear representation will in this case select one of the alternatives instead of using all the data available. This idea paved the road to graphs, which provides an intuitive extension of the work done with strings in previous literature. A demanding challenge has been the lack of previous work to support my own writings. The use of graphs for assembling DNA sequences is extensive. However, the representations of multiple sequences is still largely dependent on strings. Despite challenges, the thesis did at no point cease to be inter- esting. Rather the contrary. As time has progressed and insight deepened, I find myself profoundly interested in a problem that I hope to witness the applications of. Though my contribution to the field is modest, I hope that the ideas presented in this thesis may inspire others to continue the work on population graphs and how they relate to each other. I believe it to be a topic of great potential. The rationale behind this claim is simple: Why not use all the data? I wish to express my sincere gratitude to my supervisor Geir Storvik for his continuous support and wisdom, and to my co-supervisor Lex Nederbragt for valuable comments and insight in bioinformatics. I would also like to thank my good friends Trygve Danielsen for proofreading and Synnøve Smedbye Botnen for sharing her expertise in molecular biology. Thanks to Tom Andersen for inspiration. Thanks to close family for proofreading. Thanks to friends at “Utfallsrommet” for both academic, and not quite so academic, discussions. Last but not least, a very special thanks to Marte Rødsvik Kolloen for unconditional support, encouragement and highly appreciated technical discussions. Martin Gjesdal Bjørndal Blindern, May 2017 v Contents Contents vii 1 Introduction 1 1.1 Objectives of the Thesis . 3 1.2 Genetic Variation . 3 1.3 Graphs for Sequence Data . 4 1.3.1 Motivation for Graph-Representations . 5 1.4 Measuring Distance Between Graphs . 7 1.5 Related Work . 8 1.6 Outline of the Thesis . 9 2 Data 11 2.1 Genomics . 11 2.1.1 Alignment and String Edit Distance . 12 2.1.2 Assembly . 14 2.1.3 Mapping, Resequencing and Variant Detection . 16 2.2 Description of the Data . 17 2.2.1 The Variant Call Format . 18 2.2.2 Reference Sequence . 20 2.3 Variant Graphs . 21 3 Methods 25 3.1 Distance-Based Approaches . 26 3.1.1 Bubble Counter . 26 3.1.2 Graph Edit Distance . 28 3.2 A Model-Based Approach . 31 3.2.1 The Bayes Factor . 31 3.3 Implementation of the Distance Metrics . 37 3.3.1 Bubble Counter and Bayes Factor . 37 3.3.2 Graph Building . 38 3.3.3 Graph Alignment with GEDEVO . 41 3.4 Are the Measures really Metrics? . 43 3.4.1 Bubble Counter . 44 vii viii CONTENTS 3.4.2 Graph Edit Distance . 46 3.4.3 Bayes Factor Measure . 47 3.5 The Permutation Test . 47 3.5.1 Monte Carlo Sampling . 49 3.6 Implementation of the Permutation Test . 50 3.7 Example . 52 4 Results 57 4.1 Distances Between Population Graphs . 57 4.1.1 Bubble Counter . 57 4.1.2 Graph Edit Distance . 58 4.1.3 Bayes Factor . 58 4.2 Visualization by Multidimensional Scaling . 60 4.3 Permutation Tests . 61 4.3.1 p-values . 61 5 Discussion 73 5.1 Distance Metrics . 73 5.1.1 Bubble Counter . 74 5.1.2 Graph Edit Distance . 76 5.1.3 Bayes Factor . 77 5.1.4 Other Distances . 82 5.2 GraphBuilder . 84 5.3 Other Graph Types . 85 5.3.1 Sequence Graph . 85 5.3.2 de Bruijn Graph . 86 5.4 Biological Features . 89 5.4.1 Out of Africa . 90 6 Concluding Remarks 91 6.1 Future Work . 93 A Graph Theory and some Assembly Graphs 95 A.1 Definitions in Graph Theory . 95 A.2 De Bruijn Graphs for Genome Assembly . 97 B Calculations 99 B.1 Conjugate Prior for the Multinomial Distribution . 99 B.2 Calculations for the Bayes Factor in 3.2.1 . 100 B.3 Proof of Proposition 3.1 . 102 CONTENTS ix C Table from Example 3.3 103 D Notation 105 E Glossary 107 F Code 111 F.1 GraphBuilder . 111 F.2 R-script for Calculating Distances . 123 F.3 R-code for Permutation Tests . 127 Bibliography 135 Chapter 1 Introduction DNA data is extensively utilized for discovering differences between organisms, but traditionally only linear representations of sequences has been exploited for this purpose. The idea of our DNA as linear, represented as a string of characters from the alphabet Σ = fA; C; G; T g, prevails (see e.g. Watson, T. Baker, et al. (2008), page 30). However, the question of how to effectively represent collections of DNA sequences has been, and remains, an active subject of discussion (Paten, Novak, and Haussler, 2014; Iqbal, Caccamo, et al., 2012; Church, Schneider, Graves, et al., 2011; Dilthey et al., 2015). This thesis seeks to improve upon the analysis of DNA sequence data by using graphs made up of collections of sequences, instead of the linear representation provided by strings. The aim of the graph-based approach is to include more data in the model, thereby providing more accurate measures of the relative distances between populations. In the following, the word “population” will refer to a group of breeding individuals of the same species (Hartl, 2000; Falconer, 1981). Genetic data has become an extremely powerful tool in medicine and bio- sciences. Facilitated by the collection of vast amounts of data, the possibilities of the combination of biology, mathematics, statistics and computer science are seemingly endless. For example, a milestone of modern science is the completion of the first draft of the human genome (Venter et al., 2001). Full genome sequencing is becoming increasingly common as software is getting more efficient, sequencing technology more accurate, and hardware better. In the mid 2000s, a new group of sequencing technologies emerged. These methods are referred to as High Throughput Sequencing (HTS) and produced massive amounts of data per time compared to earlier designs. The introduction of HTS caused a need for specialized data structures and software to handle the large amounts of data. This brought us various graph-based 1 2 CHAPTER 1. INTRODUCTION methods for de novo sequence assembly1 and mapping2 (Limasset and Peter- longo, 2015; Novak, Garrison, and Paten, 2016) (see Section 2.1.2, 2.1.3 and Appendix A.2). Graphs allow us to store variations in the genetic material by using multiple nodes to represent all the possible variations at a position. Hence, the acquisition of several genome sequences from the same population motivates the use of graphs to store the data. A graph made from DNA sequences of several individuals from one single population will in the following be referred to as a population graph. A key concept of this thesis is the comparison of population graphs. In order to do calculations on graphs, a Java-program called GraphBuilder (see Subsection 3.3.2) was made.

Load more