ON the COMPLEXITY of the SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 2 at That Site
Total Page:16
File Type:pdf, Size:1020Kb
USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 1 On the Complexity of the Single Individual SNP Haplotyping Problem Rudi Cilibrasi, Leo van Iersel, Steven Kelk and John Tromp Abstract— We present several new results pertaining to of as a “fingerprint” for that individual. haplotyping. These results concern the combinatorial prob- lem of reconstructing haplotypes from incomplete and/or It has been observed that, for most SNP sites, only imperfectly sequenced haplotype fragments. We consider two nucleotides are seen; sites where three or four the complexity of the problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR) for nucleotides are found are comparatively rare. Thus, different restrictions on the input data. Specifically, we from a combinatorial perspective, a haplotype can look at the gapless case, where every row of the input be abstractly expressed as a string over the alphabet corresponds to a gapless haplotype-fragment, and the 1- {0, 1}. Indeed, the biologically-motivated field of gap case, where at most one gap per fragment is allowed. SNP and haplotype analysis has spawned a rich We prove that MEC is APX-hard in the 1-gap case and variety of combinatorial problems, which are well still NP-hard in the gapless case. In addition, we question earlier claims that MEC is NP-hard even when the input described in surveys such as [6] and [9]. matrix is restricted to being completely binary. Concerning LHR, we show that this problem is NP-hard and APX-hard We focus on two such combinatorial problems, in the 1-gap case (and thus also in the general case), but both variants of the Single Individual Haplotyping is polynomial time solvable in the gapless case. Problem (SIH), introduced in [15]. SIH amounts Index Terms— Combinatorial algorithms, Biology and to determining the haplotype of an individual genetics, Complexity hierarchies using (potentially) incomplete and/or imperfect fragments of sequencing data. The situation is further complicated by the fact that, being a I. INTRODUCTION diploid organism, a human has two versions of If we abstractly consider the human genome as each chromosome; one each from the individual’s a string over the nucleotide alphabet {A,C,G,T }, mother and father. Hence, for a given interval it is widely known that the genomes of any two of the genome, a human has two haplotypes. humans have at more than 99% of the sites the Thus, SIH can be more accurately described as same nucleotide. The sites at which variability is finding the two haplotypes of an individual given observed across the human population are called fragments of sequencing data where the fragments Single Nucleotide Polymorphisms (SNPs), which potentially have read errors and, crucially, where arXiv:q-bio/0508012v2 [q-bio.GN] 6 Sep 2005 are formally defined as the sites on the human it is not known which of the two chromosomes genome where, across the human population, each fragment was read from. We consider two two or more nucleotides are observed and each well-known variants of the problem: Minimum such nucleotide occurs in at least 5% of the Error Correction (MEC), and Longest Haplotype population. These sites, which occur (on average) Reconstruction (LHR). approximately once per thousand bases, capture the bulk of human genetic variability; the string of The input to these problems is a matrix M nucleotides found at the SNP sites of a human - the of SNP fragments. Each column of M represents haplotype of that individual - can thus be thought an SNP site and thus each entry of the matrix Part of this research has been funded by the Dutch BSIK/BRICKS denotes the (binary) choice of nucleotide seen project. at that SNP location on that fragment. An entry Rudi Cilibrasi is supported in part by NWO project 612.55.002, of the matrix can thus either be ‘0’, ‘1’ or a and by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication hole, represented by ‘-’, which denotes lack of only reflects the authors’ views. knowledge or uncertainty about the nucleotide USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 2 at that site. We use M[i, j] to refer to the value In Section II-A we offer what we believe is the found at row i, column j of M, and use M[i] first proof that Ungapped-MEC (and hence 1-gap to refer to the ith row. Two rows r1,r2 of the MEC and also the general MEC) is NP-hard. We matrix conflict if there exists a column j such that do so by reduction from MAX-CUT. (As far as M[r1, j] =6 M[r2, j] and M[r1, j], M[r2, j] ∈{0, 1}. we are aware, other claims of this result are based explicitly or implicitly on results found in [12]; as A matrix is feasible iff the rows of the matrix we discuss in Section II-C, we conclude that the can be partitioned into two sets such that all rows results in [12] cannot be used for this purpose.) within each set are pairwise non-conflicting. The NP-hardness of 1-gap MEC (and general The objective in MEC is to “correct” (or “flip”) MEC) follows immediately from the proof as few entries of the input matrix as possible (i.e. that Ungapped-MEC is NP-hard. However, our convert 0 to 1 or vice-versa) to arrive at a feasible NP-hardness proof for Ungapped-MEC is not matrix. The motivation behind this is that all rows approximation-preserving, and consequently tells of the input matrix were sequenced from one us little about the (in)approximability of Ungapped- haplotype or the other, and that any deviation from MEC, 1-gap MEC and general MEC. In light of that haplotype occurred because of read-errors this we provide (in Section II-B) a proof that 1-gap during sequencing. MEC is APX-hard, thus excluding (unless P=NP) the existence of a Polynomial Time Approximation The problem LHR has the same input as MEC Scheme (PTAS) for 1-gap MEC (and general MEC.) but a different objective. Recall that the rows of a feasible matrix M can be partitioned into two We define (in Section II-C) the problem Binary- sets such that all rows within each set are pairwise MEC, where the input matrix contains no holes; non-conflicting. Having obtained such a partition, as far as we know the complexity of this problem we can reconstruct a haplotype from each set by is still - intriguingly - open. We also consider a merging all the rows in that set together. (We parameterised version of binary-MEC, where the define this formally later in Section III.) With number of haplotypes is not fixed as two, but is LHR the objective is to remove rows such that the part of the input. We prove that this problem is resulting matrix is feasible and such that the sum NP-hard in Section II-D. (In the Appendix we of the lengths of the two resulting haplotypes is also prove an “auxiliary” lemma which, besides maximised. being interesting in its own right, takes on a new significance in light of the open complexity of In the context of haplotyping, MEC and LHR Binary-MEC.) have been discussed - sometimes under different names - in papers such as [6], [19], [8] and In Section III-A we show that Ungapped-LHR (implicitly) [15]. One question arising from this is polynomial-time solvable and give a dynamic discussion is how the distribution of holes in the programming algorithm for this which runs in time input data affects computational complexity. To O(n2m + n3) for an n × m input matrix. This explain, let us first define a gap (in a string over the improves upon the result of [15] which also showed alphabet {0, 1, −}) as a maximal contiguous block a polynomial-time algorithm for Ungapped-LHR of holes that is flanked on both sides by non-hole but under the restricting assumption of non-nested values. For example, the string ---0010--- input rows. has no gaps, -0--10-111 has two gaps, and -0-----1-- has one gap. Two special cases of We also prove, in Section III-B, that LHR is MEC and LHR that are considered to be practically APX-hard (and thus also NP-hard) in the general relevant are the ungapped case and the 1-gap case. case, by proving the much stronger result that The ungapped variant is where every row of the 1-gap LHR is APX-hard. This is the first proof of input matrix is ungapped, i.e. all holes appear at hardness (for both 1-gap LHR and general LHR) the start or end. In the 1-gap case every row has at most one gap. USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 3 appearing in the literature. 1 II. MINIMUM ERROR CORRECTION (MEC) For a length-m string X ∈ {0, 1, −}m, and a length-m string Y ∈{0, 1}m, we define d(X,Y ) as Fig. II.1 the number of mismatches between the strings i.e. EXAMPLE INPUT TO MAX-CUT (SEE LEMMA 1) positions where X is 0 and Y is 1, or vice-versa; holes do not contribute to the mismatch count. Recall the definition of feasible from earlier; an alternative, and equivalent, definition (which we 0 0 −−−−−− use in the following proofs) is as follows. An − − 0 0 −−−− n × m SNP matrix M is feasible iff there exist two −−−− 0 0 − − m strings (haplotypes) H1,H2 ∈ {0, 1} , such that −−−−−− 0 0 32 copies for all rows r of M, d(r, H1)=0 or d(r, H2)=0.