Efficient Approximate String Matching Techniques for Sequence Alignment

Universitat Politècnicade Catalunya Efficient Approximate String Matching Techniques for Sequence Alignment Santiago Marco-Sola Advisor Paolo Ribeca Co-advisor Roderic Guigó Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the Polytechnic University of Catalonia, December 2016 2 i Abstract One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read's point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still provid- ing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as compo- nents of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces. ii iii Acknowledgements In order to succeed in any endeavour in life, we must remember to be thankful for those close to us we can lean upon. No man is an island, and no project like this can be accomplished without the help and support of many others. Here, I want to thank all those who have stood close to me during these years, giving me their support, knowledge, and trust. First, I want to thank all my friends and colleagues from the CNAG for their support, all those productive discussions, and, most of all, the good times we spent together. At the same time, I want to express my gratitude for all the support I have had over the years from the direction of the CNAG; especially from Ivo Gut and Simon Heath who always believed in this project, strongly supporting it through thick and thin. In the same way, I want to thank MercèJuan Badia and Maria Serna for all the help they have given me with administrative matters throughout the whole thesis. I would also like to thank Juan Carlos Moure and Toni Espinosa. I consider myself very lucky to have found people so genuinely kind and valuable, who have helped me so much in my career. It goes without saying that, had it not been for you, I would not have met Alejandro Chacón.No doubt, he is one of the most hard-working and quick witted people I have ever known. Thank you Alejandro, this project would not have been the same without all your contributions, support, and time. At the same time, I want to thank Roderic Guigófor his ever correct advice. Thanks for helping me widen my focus when I needed it the most, and realise what really matters. I am in debt for the opportunities you have given me and the trust you have always had in the project and in me. Somebody once told me that beyond hard work, intelligence, and talent, the key to success is having someone to trust you and give you the opportunity to prove yourself. That person and I have now known each other long enough to see past our differences and agree that we have both learnt a lot during this endeavour. For that, Paolo, I have only gratitude, for all your support, the time you have spent, and the wonderful opportunity you have given to me. No doubt, it has been a tough journey, but if I had to, I would do it all again. iv v To the three women in my life, and the man who stood by my side when I needed him the most. There is no point to this life without the certainty of your love and support. vi vii You want to know how I did it? This is how I did it: I never saved anything for the swim back. Vincent Freeman viii Contents Abstract i Acknowledgements iii 1 Introduction 1 1.1 Context . .3 1.2 Motivation . .5 1.3 Methodological tenets. Algorithm Engineering . .7 1.4 Thesis outline . 11 2 Bioinformatics and High Throughput Sequencing 13 2.1 Biological background . 14 2.1.1 DNA . 14 2.1.2 Central dogma of molecular biology . 14 2.1.3 Replication . 15 2.1.4 Genetic Variations . 16 2.2 Sequencing technologies . 17 2.2.1 High-throughput sequencing . 18 2.2.2 Synthetic long-read sequencing . 19 2.2.3 Single molecule sequencing . 20 2.3 Sequencing protocols . 22 2.3.1 Single-end, paired-end and mate-pair sequencing . 22 2.3.2 Whole-genome and targeted resequencing . 23 2.3.3 Bisulfite sequencing . 24 2.4 Bioinformatics applications . 25 2.4.1 Resequencing . 25 2.4.2 De-novo assembly . 26 2.4.3 Other applications . 27 2.5 High-throughput data analysis . 29 3 Sequence alignment 31 3.1 Sequence Alignment. A problem in bioinformatics . 32 3.1.1 Paradigms . 32 3.1.2 Error model . 33 3.1.3 Alignment classification . 34 3.1.4 Alignment conflicts . 36 3.2 Approximate String Matching. A problem in computer science . 38 3.2.1 Stringology . 39 3.2.2 String Matching . 39 3.2.3 Distance metrics . 40 ix x CONTENTS I Classical distance metrics . 41 II General weighted metrics . 41 III Dynamic programing pair-wise alignment . 44 3.2.4 Stratifying metric space . 45 3.3 Genome mapping accuracy . 48 3.3.1 Benchmarking mappers . 48 I Quality control metrics . 48 II Simulation-based evaluation . 50 III MAPQ score . 51 IV Rabema. Read Alignment Benchmark . 53 3.3.2 Conflict resolution . 54 I Matches equivalence . 54 II Stratified conflict resolution . 54 III δ-strata groups . 56 IV Local alignments . 58 3.3.3 Genome mappability . 59 4 Approximate string matching I. Index Structures 63 4.1 Classic index structures . 64 4.2 Succinct full-text index structures . 67 4.2.1 The Burrows Wheeler Transform . 67 4.2.2 FM-Index . 69 4.3 Practical FM-Index design . 72 4.3.1 FM-Index design considerations . 72 4.3.2 FM-Index layout . 76 4.3.3 FM-Index operators . 82 4.3.4 FM-Index lookup table . 86 4.3.5 FM-Index sampled-SA . 89 4.3.6 Double-stranded bidirectional FM-Indexes . 91 4.4 Specialised indexes . 93 4.4.1 Bisulfite index . 93 4.4.2 Homopolymer compacted indexes . 93 5 Approximate string matching II. Filtering Algorithms 95 5.1 Filtering algorithms . 97 5.1.1 Classification of filters. Unified filtering paradigm . 99 5.1.2 Approximate filters. Pattern Partitioning Techniques . 100 5.2 Exact filters. Candidate Verification Techniques . 102 5.2.1 Formulations of the problem . 102 5.2.2 Accelerating computations . 103 I Block-wise computations . 103 II Bit-parallel algorithms . 103 III SIMD algorithms . 104 5.2.3 Approximated tiled computations . 105 5.2.4 Embedding distance metrics . 107 5.3 Approximate filters. Static partitioning techniques . 110 5.3.1 Counting filters . 110 5.3.2 Factor filters . 113 5.3.3 Intermediate partitioning filters . 115 5.3.4 Limits of static partitioning techniques . 117 5.4 Approximate filters. Dynamic partitioning techniques . ..

Efficient Approximate String Matching Techniques for Sequence Alignment

Weighted Automata Computation of Edit Distances with Consolidations and Fragmentations Mathieu Giraud, Florent Jacquemard

Algorithms for Approximate String Matching Petro Protsyk (28 October 2016) Agenda

Fast Approximate Search in Large Dictionaries

Chapter 1 OBTAINING VALUABLE PRECISION-RECALL

Unary Words Have the Smallest Levenshtein K-Neighbourhoods

Fast Approximate Search in Large Dictionaries

Automata Processor Architecture and Applications: a Survey

Accelerating Decision Tree Ensemble Inference with an Automata Representation

Algorithms and Frameworks for Accelerating Security Applications on HPC Platforms

REAPR: Reconfigurable Engine for Automata Processing

An Experimental Tagalog Finite State Automata Spellchecker with Levenshtein Edit-Distance Feature

Towards Differential Privacy for Symbolic Systems