Master of Science in Information Systems and Computer Engineering
Total Page:16
File Type:pdf, Size:1020Kb
Error Detection in Pedigrees Using Satisfiability-Based Approaches David Manuel Nunes Martins Dissertation for the degree of Master of Science in Information Systems and Computer Engineering Jury President: Prof. Ana Maria Severino de Almeida e Paiva Research Advisor: Prof. Maria InˆesCamarate de Campos Lynce de Faria Committee: Prof. Sara Alexandra Cordeiro Madeira Prof. Vasco Miguel Gomes Nunes Manquinho November 2009 Resumo No campo da biologia computacional, um pedigree ´euma estrutura que permite a moni- toriza¸c˜aode individuos e correspondentes rela¸c˜oesfamiliares. Com pedigrees, torna-se poss´ıvel seguir o fluxo de informa¸c˜aogen´eticadesde os progenitores at´eaos seus descendentes. Contudo, com o elevado desempenho das t´ecnicasactuais de genotipagem, a ocorrˆenciade erros em pedigrees torna-se cada vez mais um problema da maior importˆanciae a consistˆencia vai sendo cada vez mais algo com que os geneticistas tˆemde lidar. O problema da verifica¸c˜aoda consistˆenciade pedigrees consiste na problem´aticade verificar se um pedigree (uma ´arvore geneol´ogicacom informa¸c˜aode genes associada) ´eou n˜aoconsistente com as leis de herditariedade de Mendel. Esta verfifica¸c˜aode consistˆencia´eum problema NP-completo bastante conhecido e existem dois tipos principais de abordagens que o tentam resolver: abordagens estat´ısticase abordagens combinatoriais. O objectivo deste trabalho consiste na tentativa de desenvolvimento de dois tipos de aborda- gens combinatoriais baseadas em satisfa¸c˜ao(suporte m´ınimoe codifica¸c˜ao k-AC) e tentar medir a sua efic´aciacom distintos tipos de instˆancias. Palavras Chave: Pedigrees, NP-completo, Satisfa¸c˜ao,Suporte M´ınimo,Codifica¸c˜ao k-AC. iii iv Abstract In bioinformatics, a pedigree is a structure that allows the monitoring of individuals and their family relations. With pedigrees, it is possible to track the flow of genetic information from parents to offspring. However, with the high throughput of genotyping techniques, the occurrence of errors in pedigrees is becoming a problem of increasingly greater importance and consistency becomes something that geneticists must deal with. The pedigree consistency checking problem consists in the problem of verifying if a pedigree (a genealogical tree with associated genotypes) is consistent under the Mendelian laws of inheritance. This pedigree consistency check is a well-known NP-complete problem and there are two main types of approaches that attempt to tackle it: statistical and combinatorial ones. The purpose of this work is to try to develop two types of combinatorial approaches based on satisfiability (minimal support encoding and k-AC encoding) and attempt to measure their efficiency against distinct types of pedigree instances. Key Words: Pedigrees, NP-complete, Satisfiability, Minimal Support Encoding, k-AC En- coding. v vi Contents 1 Introduction 1 1.1 Biological Background . 2 1.2 Mendelian Laws of Inheritance . 2 1.3 Linkage Analysis and Pedigree Consistency Checking . 3 1.4 Solving Pedigree Consistency Checking . 4 1.5 Organization . 4 2 Pedigrees 5 2.1 Pedigrees in detail . 5 2.2 Errors occurring in pedigrees . 9 2.3 Complexity of pedigree consistency checking . 10 2.3.1 Classes of problems . 10 2.3.2 Consistency checking is NP-complete . 11 3 Pedigree Consistency Checking Tools 13 3.1 Pedcheck . 13 3.1.1 Nuclear-Family Algorithm . 13 3.1.2 Genotype-Elimination Algorithm . 15 3.1.3 Critical Genotype Algorithm . 16 3.1.4 Odds-Ratio Algorithm . 17 3.2 Allelic Path Explorer (APE) . 18 3.3 Checkfam . 20 vii 3.4 PCS . 22 3.5 Mendelsoft . 23 3.6 Performance Comparison between Mendelsoft and Pedcheck . 25 3.6.1 Synthetic Pedigree Instances . 26 3.6.2 Real Pedigree Instances . 29 4 Maximum Satisfiability and Related Encodings 31 4.1 SAT, Max-SAT and Partial Max-SAT . 31 4.1.1 Inference Rules . 32 4.2 Max-CSP . 33 4.3 Translating CSP into SAT . 34 4.3.1 Direct Encoding . 34 4.3.2 Support Encoding . 35 4.3.3 Minimal Support Encoding . 35 4.4 Encoding Max-CSP into Partial Max-SAT . 36 4.4.1 Direct Encoding for Partial Max-SAT . 36 4.4.2 Minimal Support Encoding for Partial Max-SAT . 36 4.4.3 Support Encoding . 37 5 Pedigrees and Maximum Satisfiability 38 5.1 Mapping pedigree structures . 38 5.1.1 Mapping into Minimal Support Encoding . 40 5.1.2 Mapping into k-AC Encoding . 42 5.2 Comparative tests between encodings . 45 5.2.1 Random pedigree instances . 45 5.2.2 Real pedigree instances . 50 6 Conclusions and Future Work 52 6.1 Future Work . 54 viii ix List of Figures 2.1 A simple pedigree . 6 2.2 A looping pedigree . 7 2.3 An inconsistent and incomplete pedigree . 8 2.4 A complete and consistent pedigree . 10 3.1 Pedigree with level 1 errors . 14 3.2 Pedigree with a level 2 error . 16 3.3 Pedigree with critical genotypes . 17 3.4 Example of input and output data of Checkfam . 22 3.5 Pedigree and corresponding CSP . 24 3.6 Mendelsoft vs. Pedcheck in collection A of instances . 27 3.7 Mendelsoft vs. Pedcheck in collection B of instances . 27 3.8 Mendelsoft vs. Pedcheck in collection C of instances . 28 3.9 Mendelsoft vs. Pedcheck in collection D of instances . 29 5.1 Parental Relations and Genotype information in a Pedigree . 40 5.2 Example of a simple pedigree . 41 5.3 Example of four possible k-AC encodings for a ternary constraint . 43 5.4 Minimal support vs. direct support encoding in collection A of random instances 46 5.5 Minimal support vs. direct support encoding in collection B of random instances 46 5.6 Minimal support vs. direct support encoding in collection C of random instances 47 5.7 Minimal support vs. direct support encoding in collection D of random instances 47 x 5.8 Minimal support vs. 2-AC support encoding in collection A of random instances 48 5.9 Minimal support vs. 2-AC support encoding in collection B of random instances 48 5.10 Minimal support vs. 2-AC support encoding in collection C of random instances 49 5.11 Minimal support vs. 2-AC support encoding in collection D of random instances 49 xi xii List of Tables 3.1 Description of the random classes of pedigree instances . 26 3.2 Mendelsoft vs. Pedcheck in real pedigree instances . 30 5.1 Minimal support encoding vs direct encoding in real instances . 50 5.2 Minimal support encoding vs 2-AC encoding in real instances . 51 xiii xiv 1 Introduction One of today’s most challenging problems in the field of genetics consists in the correct relation between the genes in the human genome and some biological trait an individual may possess. Example of these traits range from blood type and eye color, to those that can predispose an individual for a particular disease. The method used to tackle this problem is named linkage analysis [1, 2, 3] which is a well established statistical-based method that has already shown significant results: genes causing major diseases (e.g., Parkinson’s disease, obesity and anxiety) have already been discovered using this technique. In simpler terms, linkage analysis works by creating maps that correlate specific genes and/or genetic markers of a determined experimental population. And the more information contained on those maps, that is, the denser these maps become, the higher is the probability that the information may contain errors. The problem of verifying the correctness of this information (if it contains errors or not) is named the pedigree consistency checking problem. To describe linkage analysis, consistency checking, and the errors that can be generated in the data in more detail, it is first necessary to introduce some definitions regarding biology and inheritance. 1 1.1 Biological Background DNA (deoxyribonucleic acid) is a molecule that contains the genetic instructions used in the development and functioning of all living organisms, with the role of long-term storage of genetic information. A chromosome is a long chain of DNA and associated proteins that carry portions of the hereditary information in an organism. Inside a chromosome, a locus is a fixed position that carries some specific information (which typically identifies the position of one or more genes). This specific information is named allele. Diploid organisms (e.g. humans) carry chromosomes in pairs, which means that a single locus carries a pair of alleles. This pair of alleles defines the genotype of the individual at this locus. If both alleles are the same (that is, if the information described by the two alleles is equal) the individual is said to be homozygous at that locus, and if the alleles are different the individual is said to be heterozygous. However, we should note that genotypes are not always completely observable and the indirect observation of a genotype (its expression) is named the phenotype. 1.2 Mendelian Laws of Inheritance The notion of the inheritance of traits between generations was first introduced by Gregor Mendel (1865). And his main contribution, the Mendelian laws of inheritance, are the base principles that rule today’s modern genetics [4]. The Mendel’s laws, which specify the concept of Mendelian inheritance are presented next: • Unit factors in pairs: Genetic characters are controlled by unit factors that exist in pairs in individual organisms. The unit factor, is today known as gene. This law implies that the genotype of an individual should always be considered as a pair. • Dominance/Recessiveness: When two unlike unit factors responsible for a single char- acter are present in a single individual, one is dominant over the other, which is said to be recessive. Dominance and recessiveness refers to the phenotype, which can be considered an abstraction of the genotype.