Satisfiability-Based Algorithms for Haplotype Inference
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSIDADE TECNICA´ DE LISBOA INSTITUTO SUPERIOR TECNICO´ Satisfiability-based Algorithms for Haplotype Inference Ana Sofia Soeiro da Gra¸ca (Licenciada) Disserta¸c˜ao para obten¸c˜ao do Grau de Doutor em Engenharia Inform´atica e de Computadores Orientadora: Doutora Maria Inˆes Camarate de Campos Lynce de Faria Co-orientadores: Doutor Jo˜ao Paulo Marques da Silva Doutor Arlindo Manuel Limede de Oliveira J´uri Presidente: Presidente do Conselho Cient´ıfico do IST Vogais: Doutor Arlindo Manuel Limede de Oliveira Doutor Jo˜ao Paulo Marques da Silva Doutor Eran Halperin Doutor Ludwig Krippahl Doutor Lu´ıs Jorge Br´as Monteiro Guerra e Silva Doutora Maria Inˆes Camarate de Campos Lynce de Faria Janeiro de 2011 Abstract Identifying genetic variations has an important role in human genetics. Single Nucleotide Polymor- phisms (SNPs) are the most common genetic variations and haplotypes encode SNPs in a single chromosome. Unfortunately, current technology cannot obtain directly the haplotypes. Instead, genotypes, which represent the combined data of two haplotypes on homologous chromosomes, are acquired. The haplotype inference problem consists in obtaining the haplotypes from the genotype data. This problem is a key issue in genetics and a computational challenge. This dissertation proposes two Boolean optimization methods for haplotype inference, one po- pulation-based and another pedigree-based, which are competitive both in terms of accuracy and efficiency. The proposed method for solving the haplotype inference by pure parsimony (HIPP) approach is shown to be very robust, representing the state of the art HIPP method. The pedigree- based method, which takes into consideration both family and population information, shows to be more accurate than the existing methods for haplotype inference in pedigrees. Furthermore, this dissertation contributes to better understanding the HIPP problem by providing a vast comparison between HIPP methods. Keywords: haplotype inference, Boolean satisfiability (SAT), pseudo-Boolean optimization (PBO), maximum satisfiability (MaxSAT), pure parsimony (HIPP), minimum recombinant (MRHC). i ii Sum´ario A identifica¸c˜ao de varia¸c˜oes gen´eticas desempenha um papel fundamental no estudo da gen´etica humana. Os polimorfismos de nucle´otido ´unico (SNPs) correspondem `as varia¸c˜oes gen´eticas mais comuns entre os seres humanos e os hapl´otipos codificam SNPs presentes num ´unico cromossoma. Actualmente, por limita¸c˜oes de ´ındole tecnol´ogica, a obten¸c˜ao directa de hapl´otipos n˜ao ´eexequ´ıvel. Por outro lado, s˜ao obtidos os gen´otipos, que correspondem `ainforma¸c˜ao conjunta do par de cro- mossomas hom´ologos. O problema da inferˆencia de hapl´otipos consiste na obten¸c˜ao de hapl´otipos a partir de gen´otipos. Este paradigma representa um aspecto fundamental em gen´etica e tamb´em um interessante desafio computacional. Esta disserta¸c˜ao prop˜oe dois m´etodos de optimiza¸c˜ao Booleana para inferˆencia de hapl´otipos, um populacional e outro em linhagens geneal´ogicas, cujo desempenho ´eassinal´avel em eficiˆencia e precis˜ao. O m´etodo para inferˆencia de hapl´otipos por parcim´onia pura (HIPP) em popula¸c˜oes mostra-se muito robusto e representa o estado da arte entre os modelos HIPP. O modelo proposto para inferˆencia de hapl´otipos em linhagens geneal´ogicas, que combina informa¸c˜ao familiar e popula- cional, representa um m´etodo mais preciso que os congeneres. Ademais, esta disserta¸c˜ao representa um contributo relevante na compreens˜ao do problema HIPP, fornecendo uma extensa compara¸c˜ao entre m´etodos. Palavras chave: inferˆencia de hapl´otipos, satisfa¸c˜ao Booleana (SAT), optimiza¸c˜ao pseudo-Booleana (PBO), satisfa¸c˜ao m´axima (MaxSAT), parcim´onia pura (HIPP), recombina¸c˜ao m´ınima (MRHC). iii iv Acknowledgments “Bear in mind that the wonderful things that you learn in your schools are the work of many generations, produced by enthusiastic effort and infinite labor in every country of the world. All this is put into your hands as your inheritance in order that you may receive it, honor it, and add to it, and one day faithfully hand it on to your children.” Albert Einstein I have always been fascinated by science, its mysteries and curiosities. Understanding the world where we live and, in particular, the human beings, requires the effort and hard work of many generations. This thesis represents a modest contribution to the development of science, my little drop in the ocean. Studying such a multidisciplinary topic was a big challenge and achieving this goal represents a very important step for me. However, I could not have reached this point without the help of several people. First of all, I owe a special thanks to Professor Inˆes Lynce for being such a dedicated supervisor and for her contributions to this project. She taught me how to do research. She is an example of an excellent researcher and also a very good friend. She motivated me from the very beginning and always gave me the right words. I also thank Professor Jo˜ao Marques da Silva for his enthusiasm and motivation at work, for his interesting contributions and for sharing with me the vast experience he has in the SAT area. I am also grateful to Professor Arlindo Oliveira for suggesting me this PhD topic, for his inter- esting suggestions and his help with the biological issues - the hardest for me as a mathematician. I thank all my colleges and professors at Instituto Superior T´ecnico, INESC-ID, University of Southampton and University College of Dublin. I am eternally grateful to my parents, Ot´ılia and Ant´onio, who always help me following my dreams and support my decisions, sometimes even without understanding. I dedicate my thesis to them. v I owe a word of gratitude to my brother and to my sister-in-law. Paulo and Cristina gave me strength to follow this project and they gave me the best advices when I needed them. I admire their perseverance in life. Most of all, I thank them for giving me three lovely nephews that bring so much happiness into my life. A special thanks to my nephews for renewing in me the imagination and curiosity, typical from children, and so important for who works in research. I am very grateful to Renato. He is able to solve my problems even before I could realize them. He supported me in the hardest moments and always believed I could do a good work out of this project. Finally, I also thank all my friends for being present. “We ourselves feel that what we are doing is just a drop in the ocean. But the ocean would be less because of that missing drop.” Mother Teresa vi Contents 1 Introduction 1 1.1 Contributions................................... ..... 5 1.2 ThesisOutline ................................... .... 6 2 Discrete Constraint Solving 9 2.1 IntegerLinearProgramming(ILP) . ......... 9 2.1.1 Preliminaries ................................. ... 9 2.1.2 ILPAlgorithms................................. 10 2.2 BooleanSatisfiability(SAT) . ......... 12 2.2.1 Preliminaries ................................. 12 2.2.2 CNFEncodings.................................. 13 2.2.3 SATAlgorithms ................................. 14 2.3 Pseudo-BooleanOptimization(PBO). .......... 17 2.3.1 Preliminaries ................................. 17 2.3.2 PBOAlgorithms ................................. 19 2.4 MaximumSatisfiability(MaxSAT) . ........ 21 2.4.1 Preliminaries ................................. 21 2.4.2 MaxSATAlgorithms.............................. 23 2.5 Conclusions ..................................... 24 3 Haplotype Inference 27 3.1 Preliminaries ................................... ..... 28 3.1.1 SNPs,HaplotypesandGenotypes. ...... 28 3.1.2 Haplotyping................................... 30 3.1.3 Pedigrees ..................................... 31 vii 3.2 Population-based HaplotypeInference . ............ 32 3.2.1 CombinatorialApproaches. ...... 33 3.2.2 StatisticalApproaches . ...... 38 3.3 Pedigree-basedHaplotypeInference. ............ 42 3.3.1 CombinatorialApproaches. ...... 43 3.3.2 StatisticalApproaches . ...... 44 3.4 Conclusions ..................................... 45 4 Haplotype Inference by Pure Parsimony (HIPP) 47 4.1 PreprocessingTechniques . ........ 48 4.1.1 StructuralSimplifications . ....... 48 4.1.2 LowerBounds ................................... 50 4.1.3 UpperBounds ................................... 53 4.2 IntegerLinearProgrammingModels . ......... 58 4.2.1 RTIP ........................................ 58 4.2.2 PolyIP........................................ 60 4.2.3 HybridIP ...................................... 61 4.2.4 HaploPPH ..................................... 62 UB 4.2.5 Pmax ........................................ 64 4.3 Branch-and-BoundModels. ....... 66 4.3.1 HAPAR....................................... 66 4.3.2 SetCovering ................................... 68 4.4 BooleanSatisfiabilityModels . ......... 70 4.4.1 SHIPs........................................ 70 4.4.2 SATlotyper .................................... 72 4.5 AnswerSetProgrammingModels. ....... 73 4.5.1 HAPLO-ASP.................................... 73 4.6 CompleteHIPP.................................... 74 4.7 Complexity ...................................... 76 4.8 HeuristicMethods ................................ ..... 76 4.9 Conclusions ..................................... 78 5 A New Approach for HIPP 81 5.1 SolvingILPHIPPModelswithPBO . ...... 82 viii 5.2 RPolyModel ...................................... 83 5.2.1 EliminatingKeySymmetries . ..... 83 5.2.2 ReducingtheModel .............................. 84 5.3