Polynomial and APX-Hard Cases of the Individual Haplotyping Problem
Total Page:16
File Type:pdf, Size:1020Kb
Theoretical Computer Science 335 (2005) 109–125 www.elsevier.com/locate/tcs Polynomial and APX-hard cases of the individual haplotyping problem Vineet Bafnaa,∗, Sorin Istrailb, Giuseppe Lanciac,1, Romeo Rizzid aDepartment of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0114, USA bInformatics Research, Celera Genomics, 45 West Gude, Rockville, MD, USA cDipartimento di Matematica e Informatica, Università di Udine, via delle Scienze 206, 33100 Udine, Italy dDipartimento di Matematica, Università di Trento, via Sommarive 14, 38050 Povo, Italy Abstract SNP haplotyping problems have been the subject of extensive research in the last few years, and are one of the hottest areas of Computational Biologytoday.In this paper we report on our work of the last two years, whose preliminary results were presented at the European Symposium on Algorithms (Proceedings of the Annual European Symposium on Algorithms (ESA), Vol. 2161. Lecture Notes in Computer Science, Springer, 2001, pp. 182–193.) and Workshop on Algorithms in Bioinformatics (Proceedings of the Annual Workshop on Algorithms in Bioinformatics (WABI), Vol. 2452. Lecture Notes in Computer Science, Springer, 2002, pp. 29–43.). We address the problem of reconstructing two haplotypes for an individual from fragment assembly data. This problem will be called the Single Individual Haplotyping Problem. On the positive side, we prove that the problem can be solved effectivelyfor gapless data, and give practical, dynamicprogramming algorithms for its solution. On the negative side, we show that it is unlikelythat polynomialalgorithms exist, even to approximate the solution arbitrarilywell, when the data contain gaps. We remark that both the gapless and gapped data arise in different real-life applications. © 2005 Elsevier B.V. All rights reserved. Keywords: Haplotyping; Fixed-parameter tractability; APX-hardness ∗ Corresponding author. E-mail addresses: [email protected] (V. Bafna), [email protected] (S. Istrail), [email protected] (G. Lancia), [email protected] (R. Rizzi). 1 Also for correspondence. 0304-3975/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.tcs.2004.12.017 110 V. Bafna et al. / Theoretical Computer Science 335 (2005) 109–125 1. Introduction One of the biggest scientific conquests of the last decades has certainlybeen the sequenc- ing of the human genome [18,13]. To achieve this result, thousands of people from all over the world, with different scientific backgrounds, have been involved at various levels. This milestone result is one more step, from Mendel’s experiments with cross-breeding some varieties of peas, to Watson and Crick’s discoveryof DNAs double helix structure, which brings us closer to a real understanding of the secret of life. Being able to look at the genetic makeup of humans (as well as other species), a striking discoverywas made: we are all almost identical at DNA level (99% and more identity). Hence, genomic differences located in relativelysmall regions must be responsible for the observed diversities in our phenotypes (e.g., for the way we look). The smallest possible variation is at the level of a single nucleotide and is called Single Nucleotide Polymorphism or SNP (pronounced “snip”). It is believed that SNPs are the predominant form of human genetic variation, so that their importance cannot be overestimated for medical, drug-design, diagnostic and forensic applications. Broadly speaking, a polymorphism is a trait, common to everybody, whose value can be different but drawn in a limited range of possibilities, called alleles (for a simple example, think of the blood group). A SNP is a specific nucleotide, placed in the middle of a DNA region which is otherwise identical for all of us, whose value varies, in a statisticallysig- nificant way, within a population. In particular, for some reasons that are still unclear, each SNP usuallyshows a variabilityof onlytwo possible alleles (i.e., two possible bases out of four). These alleles can be different for different SNPs. Since DNA of diploid organisms is organized in pairs of chromosomes, for each SNP one can either be homozygous (same allele on both chromosomes) or heterozygous (different alleles). The values of a set of SNPs on a particular chromosome copydefine a haplotype. Haplotyping an individual consists of determining a pair of haplotypes, one for each copy of a given chromosome. The pair provides full information of the SNP fingerprint for that individual at the specific chromosome. There exist different combinatorial versions of haplotyping problems. For instance, the problem of haplotyping a population (i.e., a set of individuals) has been extensively studied, under manyobjective functions, in recent years [1,4,6,10,11]. The problem of haplotyping a single individual has been first formalized as an optimization problem bythese authors [14,17]. A statistical approach for this problem has been recentlyproposed byLi et al. [15]. Given the full DNA sequence of an individual, haplotyping would consist of the trivial check of the value of some nucleotides. Unfortunately, in reality the situation is not so simple. In fact, although in principle the problem of sequencing a genome is considered to have been solved, at a more practical level there are still manyissues, which force the use of computer algorithms and proper optimization models to come up with the “best” possible sequence from the data. The same holds for haplotyping: given genomic data, there is a computational haplotyping problem, consisting of determining the “best” (i.e., most likely to be correct) pair of haplotypes, which can be inferred from the data. In this paper we formallydefine this problem, deriving conditions under which it is solvable in polynomial time and others for which it is NP-hard. We remark that both situations are likelyto occur in real-life contexts, depending on the type of data available and the methodology used for V. Bafna et al. / Theoretical Computer Science 335 (2005) 109–125 111 sequencing. For the polynomial version of the problem we also describe practical effective algorithms based on dynamic programming, while, for the NP-hard versions, we show that theyare in fact difficult even to approximate (i.e., APX-hard). Despite the biological origin of the problem, the model turns out to be purelycombinatorial, and has manynice mathematical properties. We will introduce a few needed biological concepts in Section 2. The mathematical model of the problems, together with the notation and two useful reductions, are described in Section 3. In Section 4 we describe dynamic programming-based polynomial algorithms for the gapless versions of the problem, while in Section 5 we extend these results to bounded- length gaps. Section 6 is devoted to discussing APX-hardness results for the general versions of the problem. Some conclusions are drawn in Section 7. 2. SNPs and haploids 2.1. Sequencing The process of passing from the sequence of nucleotides in a DNA molecule to a string over the DNA alphabet is called sequencing. Given their verysmall size (in the order of nanometers), DNA molecules are not easilymanageable, and sequencing is a verycritical process, which cannot be directlyperformed within a single lab experiment. A sequencer is a machine that is fed some DNA and whose output is a string of As, Ts, Cs and Gs. To each letter, the sequencer attaches a value (confidence level) which represents, essentially, the probabilitythat the letter has been correctlyread. The main problem with sequencing is that the technologyis not powerful enough to sequence a long DNA molecule, which, in the shotgun sequencing approach, must therefore first be replicated (cloned) into manycopies, and then be broken, at random, into several pieces (called fragments), of about 1000 nucleotides each, that are individuallyfed to a sequencer. The cloning phase is necessaryso that the fragments can have nonemptyoverlap. From the overlap of two fragments one mayinfer a longer fragment, and so on, until the original DNA sequence has been reconstructed. This is, in essence, the principle of Shotgun Sequencing [19], in which the fragments are assembled back into the original sequence byusing sophisticated algorithms and powerful computers. Shotgun sequencing allowed an earlycompletion of the sequencing of the human genome [18]. The assembly (i.e., overlap and merge) phase is complicated bythe fact that in a genome there exist many regions with identical content (repeats) scattered all around and due to replicating events during evolution. The repeats mayfool the assembler into thinking that theyare all copies of a same, unique, region. The situation is complicated further from the fact that diploid genomes are organized into pairs of chromosomes (a paternal and a maternal copy) which mayhave identical or nearlyidentical content, a situation that makes the assemblyprocess even harder. To partlyovercome these difficulties, the fragments used in shotgun sequencing some- times have some extra information attached. In fact, theyare obtained via a process that generates pairs (called mate pairs) of fragments instead of individual ones, with a fairly precise estimate of the distance between them. These pairs are guaranteed to come from 112 V. Bafna et al. / Theoretical Computer Science 335 (2005) 109–125 3 SNP 1 2 1 2 3 4 5 6 4 1 A B − Α Α Β 3 5 4 2 Β Α − − Β − 1 3 − Α Β Α Β Α fragment 4 − Α Β − Β Α 2 5 6 5 Β − Α Β Α − (a) (b) (c) Fig. 1. Conflict Graphs for SNP matrix M. (a) M, (b) GF (M), (c) GS (M). Table 1 A chromosome and the two haplotypes Chrom. c, paternal: ataggtccCtatttccaggcgcCgtatacttcgacgggActata Chrom. c, maternal: ataggtccGtatttccaggcgcCgtatacttcgacgggTctata Haplotype 1 → CC A Haplotype 2 → GC T the same copyof a chromosome, and mayhelp whenever one of them comes from a repeat region while the other does not (and can be used as an anchor to place its troublesome mate).