Master of Science in Information Systems and Computer Engineering

Error Detection in Pedigrees Using Satisfiability-Based Approaches David Manuel Nunes Martins Dissertation for the degree of Master of Science in Information Systems and Computer Engineering Jury President: Prof. Ana Maria Severino de Almeida e Paiva Research Advisor: Prof. Maria InêsCamarate de Campos Lynce de Faria Committee: Prof. Sara Alexandra Cordeiro Madeira Prof. Vasco Miguel Gomes Nunes Manquinho November 2009 Resumo No campo da biologia computacional, um pedigree éuma estrutura que permite a moni- toriza¸cãode individuos e correspondentes rela¸cõesfamiliares. Com pedigrees, torna-se poss´ıvel seguir o fluxo de informa¸cãogenéticadesde os progenitores atéaos seus descendentes. Contudo, com o elevado desempenho das técnicasactuais de genotipagem, a ocorrênciade erros em pedigrees torna-se cada vez mais um problema da maior importânciae a consistência vai sendo cada vez mais algo com que os geneticistas têmde lidar. O problema da verifica¸cãoda consistênciade pedigrees consiste na problemáticade verificar se um pedigree (uma árvore geneológicacom informa¸cãode genes associada) éou nãoconsistente com as leis de herditariedade de Mendel. Esta verfifica¸cãode consistênciaéum problema NP-completo bastante conhecido e existem dois tipos principais de abordagens que o tentam resolver: abordagens estat´ısticase abordagens combinatoriais. O objectivo deste trabalho consiste na tentativa de desenvolvimento de dois tipos de abordagens combinatoriais baseadas em satisfa¸cão(suporte m´ınimoe codifica¸cão k-AC) e tentar medir a sua eficáciacom distintos tipos de instâncias. Palavras Chave: Pedigrees, NP-completo, Satisfa¸cão,Suporte M´ınimo,Codifica¸cão k-AC. iii iv Abstract In bioinformatics, a pedigree is a structure that allows the monitoring of individuals and their family relations. With pedigrees, it is possible to track the flow of genetic information from parents to offspring. However, with the high throughput of genotyping techniques, the occurrence of errors in pedigrees is becoming a problem of increasingly greater importance and consistency becomes something that geneticists must deal with. The pedigree consistency checking problem consists in the problem of verifying if a pedigree (a genealogical tree with associated genotypes) is consistent under the Mendelian laws of inheritance. This pedigree consistency check is a well-known NP-complete problem and there are two main types of approaches that attempt to tackle it: statistical and combinatorial ones. The purpose of this work is to try to develop two types of combinatorial approaches based on satisfiability (minimal support encoding and k-AC encoding) and attempt to measure their efficiency against distinct types of pedigree instances. Key Words: Pedigrees, NP-complete, Satisfiability, Minimal Support Encoding, k-AC En- coding. v vi Contents 1 Introduction 1 1.1 Biological Background . 2 1.2 Mendelian Laws of Inheritance . 2 1.3 Linkage Analysis and Pedigree Consistency Checking . 3 1.4 Solving Pedigree Consistency Checking . 4 1.5 Organization . 4 2 Pedigrees 5 2.1 Pedigrees in detail . 5 2.2 Errors occurring in pedigrees . 9 2.3 Complexity of pedigree consistency checking . 10 2.3.1 Classes of problems . 10 2.3.2 Consistency checking is NP-complete . 11 3 Pedigree Consistency Checking Tools 13 3.1 Pedcheck . 13 3.1.1 Nuclear-Family Algorithm . 13 3.1.2 Genotype-Elimination Algorithm . 15 3.1.3 Critical Genotype Algorithm . 16 3.1.4 Odds-Ratio Algorithm . 17 3.2 Allelic Path Explorer (APE) . 18 3.3 Checkfam . 20 vii 3.4 PCS . 22 3.5 Mendelsoft . 23 3.6 Performance Comparison between Mendelsoft and Pedcheck . 25 3.6.1 Synthetic Pedigree Instances . 26 3.6.2 Real Pedigree Instances . 29 4 Maximum Satisfiability and Related Encodings 31 4.1 SAT, Max-SAT and Partial Max-SAT . 31 4.1.1 Inference Rules . 32 4.2 Max-CSP . 33 4.3 Translating CSP into SAT . 34 4.3.1 Direct Encoding . 34 4.3.2 Support Encoding . 35 4.3.3 Minimal Support Encoding . 35 4.4 Encoding Max-CSP into Partial Max-SAT . 36 4.4.1 Direct Encoding for Partial Max-SAT . 36 4.4.2 Minimal Support Encoding for Partial Max-SAT . 36 4.4.3 Support Encoding . 37 5 Pedigrees and Maximum Satisfiability 38 5.1 Mapping pedigree structures . 38 5.1.1 Mapping into Minimal Support Encoding . 40 5.1.2 Mapping into k-AC Encoding . 42 5.2 Comparative tests between encodings . 45 5.2.1 Random pedigree instances . 45 5.2.2 Real pedigree instances . 50 6 Conclusions and Future Work 52 6.1 Future Work . 54 viii ix List of Figures 2.1 A simple pedigree . 6 2.2 A looping pedigree . 7 2.3 An inconsistent and incomplete pedigree . 8 2.4 A complete and consistent pedigree . 10 3.1 Pedigree with level 1 errors . 14 3.2 Pedigree with a level 2 error . 16 3.3 Pedigree with critical genotypes . 17 3.4 Example of input and output data of Checkfam . 22 3.5 Pedigree and corresponding CSP . 24 3.6 Mendelsoft vs. Pedcheck in collection A of instances . 27 3.7 Mendelsoft vs. Pedcheck in collection B of instances . 27 3.8 Mendelsoft vs. Pedcheck in collection C of instances . 28 3.9 Mendelsoft vs. Pedcheck in collection D of instances . 29 5.1 Parental Relations and Genotype information in a Pedigree . 40 5.2 Example of a simple pedigree . 41 5.3 Example of four possible k-AC encodings for a ternary constraint . 43 5.4 Minimal support vs. direct support encoding in collection A of random instances 46 5.5 Minimal support vs. direct support encoding in collection B of random instances 46 5.6 Minimal support vs. direct support encoding in collection C of random instances 47 5.7 Minimal support vs. direct support encoding in collection D of random instances 47 x 5.8 Minimal support vs. 2-AC support encoding in collection A of random instances 48 5.9 Minimal support vs. 2-AC support encoding in collection B of random instances 48 5.10 Minimal support vs. 2-AC support encoding in collection C of random instances 49 5.11 Minimal support vs. 2-AC support encoding in collection D of random instances 49 xi xii List of Tables 3.1 Description of the random classes of pedigree instances . 26 3.2 Mendelsoft vs. Pedcheck in real pedigree instances . 30 5.1 Minimal support encoding vs direct encoding in real instances . 50 5.2 Minimal support encoding vs 2-AC encoding in real instances . 51 xiii xiv 1 Introduction One of today’s most challenging problems in the field of genetics consists in the correct relation between the genes in the human genome and some biological trait an individual may possess. Example of these traits range from blood type and eye color, to those that can predispose an individual for a particular disease. The method used to tackle this problem is named linkage analysis [1, 2, 3] which is a well established statistical-based method that has already shown significant results: genes causing major diseases (e.g., Parkinson’s disease, obesity and anxiety) have already been discovered using this technique. In simpler terms, linkage analysis works by creating maps that correlate specific genes and/or genetic markers of a determined experimental population. And the more information contained on those maps, that is, the denser these maps become, the higher is the probability that the information may contain errors. The problem of verifying the correctness of this information (if it contains errors or not) is named the pedigree consistency checking problem. To describe linkage analysis, consistency checking, and the errors that can be generated in the data in more detail, it is first necessary to introduce some definitions regarding biology and inheritance. 1 1.1 Biological Background DNA (deoxyribonucleic acid) is a molecule that contains the genetic instructions used in the development and functioning of all living organisms, with the role of long-term storage of genetic information. A chromosome is a long chain of DNA and associated proteins that carry portions of the hereditary information in an organism. Inside a chromosome, a locus is a fixed position that carries some specific information (which typically identifies the position of one or more genes). This specific information is named allele. Diploid organisms (e.g. humans) carry chromosomes in pairs, which means that a single locus carries a pair of alleles. This pair of alleles defines the genotype of the individual at this locus. If both alleles are the same (that is, if the information described by the two alleles is equal) the individual is said to be homozygous at that locus, and if the alleles are different the individual is said to be heterozygous. However, we should note that genotypes are not always completely observable and the indirect observation of a genotype (its expression) is named the phenotype. 1.2 Mendelian Laws of Inheritance The notion of the inheritance of traits between generations was first introduced by Gregor Mendel (1865). And his main contribution, the Mendelian laws of inheritance, are the base principles that rule today’s modern genetics [4]. The Mendel’s laws, which specify the concept of Mendelian inheritance are presented next: • Unit factors in pairs: Genetic characters are controlled by unit factors that exist in pairs in individual organisms. The unit factor, is today known as gene. This law implies that the genotype of an individual should always be considered as a pair. • Dominance/Recessiveness: When two unlike unit factors responsible for a single char- acter are present in a single individual, one is dominant over the other, which is said to be recessive. Dominance and recessiveness refers to the phenotype, which can be considered an abstraction of the genotype.

Master of Science in Information Systems and Computer Engineering

Identifying Novel Disease-Associated Variants and Understanding The

The Struggle to Find Reliable Results in Exome Sequencing Data

BIOINFORMATICS Pages 1–9

The Clark Phase-Able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS

Integration and Missing Data Handling in Multiple Omics Studies

The NCBI Dbgap Database of Genotypes and Phenotypes

Erciyes Medical Genetics Days 2019

Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data

Genetic Investigation of Kidney Disease

Exploration of Interaction Between Common and Rare Variants in Genetic Susceptibility to Holoprosencephaly Artem Kim

Estimation of Genotype Error Rate Using Samples with Pedigree Information—An Application on the Genechip Mapping 10K Array

Mendelian Error Detection in Complex Pedigrees Using Weighted Constraint Satisfaction Techniques Marti Sanchez, Simon De Givry, Thomas Schiex