arXiv:q-bio/0508012v2 [q-bio.GN] 6 Sep 2005 n yteITPormeo h uoenCmuiy ne the views. authors’ under Community, the reflects European publi This only the IST-2002-506778. Excellence, of of Programme Network PASCAL IST the by and project. uloie on tteSPstso ua the - human a of of sites string SNP haplotype the the the capture variability; at of found genetic bases, nucleotides human 5% of thousand bulk average) per least the (on once occur at each which approximately and in sites, population, observed These occurs population. human are human nucleotide the nucleotides the such on more or sites across two the where, as genome defined called formally are Polymorphisms are the is population sites Nucleotide variability human the which Single the two of at across any sites 99% observed of The than genomes nucleotide. more the same at that have known humans widely alphabet is nucleotide it the over string a eeis opeiyhierarchies Complexity but genetics, case), case. general gapless the the in in also solvable thus time polynomial (and is case APX-hard 1-gap and NP-hard the is in problem input this that the Concernin show binary. we when completely LHR, even being to question NP-hard and restricted we is is case addition, matrix MEC 1-gap In that case. the claims gapless in earlier the APX-hard in is NP-hard still MEC that prove the We and haplotype-fragment, gapless gap we a Specifically, to data. corresponds the input at the look on restrictions different SNLSO H OPEIYO H IGEIDVDA N HAP SNP INDIVIDUAL SINGLE THE OF COMPLEXITY THE USENGLISHON MC and problems (MEC) the consider of We complexity fragments. the and/or haplotype incomplete sequenced from imperfectly haplotypes reconstructing of prob lem combinatorial the concern results These haplotyping. uiClbaii upre npr yNOpoet612.55.00 project NWO by part in supported is Cilibrasi Rudi ato hsrsac a enfne yteDthBSIK/BRICK Dutch the by funded been has research this of Part fw btatycnie h ua eoeas genome human the consider abstractly we If ne Terms Index Abstract nteCmlxt fteSnl niiulSNP Individual Single the of Complexity the On ae hr tms n a e rgeti allowed. is fragment per gap one most at where case, epeetsvrlnwrslsprann to pertaining results new several present We — fta niiul-cntu ethought be thus can - individual that of ogs altp Reconstruction Haplotype Longest gapless obntra loihs ilg and Biology algorithms, Combinatorial — .I I. uiClbai e a esl tvnKl n onTromp John and Kelk Steven Iersel, van Leo Cilibrasi, Rudi ae hr vr o fteinput the of row every where case, NTRODUCTION iiu ro Correction Error Minimum altpn Problem Haplotyping SP) which (SNPs), { ,C ,T G, C, A, LR for (LHR) cation } 1- 2, g S - , elkonvrat fteproblem: two the consider Correction We of Error from. variants read well-known was fragment each easrcl xrse sasrn vrtealphabet the over string can { a haplotype as expressed a abstractly perspective, be combinatorial Thus, four rare. a or comparatively from three are only where found sites, are sites SNP nucleotides seen; most are for nucleotides that, two observed been has It individual. that for “fingerprint” a as of oetal aera rosad rcal,where crucially, and, is errors fragments it read the given as have where individual haplotypes. data potentially described an sequencing two of accurately of haplotypes has fragments more interval two be human the given finding can a a SIH genome, for Thus, Hence, the a individual’s father. of being the and from that, each mother one fact is chromosome; each the situation The by diploid individual data. complicated imperfect sequencing an further and/or of of incomplete fragments haplotype (potentially) the using determining problems, to combinatorial such the Problem of two variants both on focus [9]. We and [6] rich as well such are a surveys which spawned in problems, described has combinatorial analysis of variety haplotype and SNP ftemti a hsete e‘’ 1 ra nucleotide or the ‘1’ about ‘0’, uncertainty be either or entry thus An knowledge fragment. can seen matrix that matrix hole nucleotide the on the of of location of entry SNP choice matrix each that (binary) at thus a the and is denotes site of problems column SNP Each an these fragments. SNP to of input The Reconstruction 0 , OYIGPROBLEM LOTYPING 1 ersne y‘’ hc eoe akof lack denotes which ‘-’, by represented , } not ned h ilgclymtvtdfil of field biologically-motivated the Indeed, . raim ua a w esosof versions two has human a organism, SH,itoue n[5.SHamounts SIH [15]. in introduced (SIH), nw hc ftetochromosomes two the of which known (LHR). MC,and (MEC), igeIdvda Haplotyping Individual Single ogs Haplotype Longest M represents Minimum M 1 USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 2 at that site. We use M[i, j] to refer to the value In Section II-A we offer what we believe is the found at row i, column j of M, and use M[i] first proof that Ungapped-MEC (and hence 1-gap to refer to the ith row. Two rows r1,r2 of the MEC and also the general MEC) is NP-hard. We matrix conflict if there exists a column j such that do so by reduction from MAX-CUT. (As far as M[r1, j] =6 M[r2, j] and M[r1, j], M[r2, j] ∈{0, 1}. we are aware, other claims of this result are based explicitly or implicitly on results found in [12]; as A matrix is feasible iff the rows of the matrix we discuss in Section II-C, we conclude that the can be partitioned into two sets such that all rows results in [12] cannot be used for this purpose.) within each set are pairwise non-conflicting. The NP-hardness of 1-gap MEC (and general The objective in MEC is to “correct” (or “flip”) MEC) follows immediately from the proof as few entries of the input matrix as possible (i.e. that Ungapped-MEC is NP-hard. However, our convert 0 to 1 or vice-versa) to arrive at a feasible NP-hardness proof for Ungapped-MEC is not matrix. The motivation behind this is that all rows approximation-preserving, and consequently tells of the input matrix were sequenced from one us little about the (in)approximability of Ungapped- haplotype or the other, and that any deviation from MEC, 1-gap MEC and general MEC. In light of that haplotype occurred because of read-errors this we provide (in Section II-B) a proof that 1-gap during sequencing. MEC is APX-hard, thus excluding (unless P=NP) the existence of a Polynomial Time Approximation The problem LHR has the same input as MEC Scheme (PTAS) for 1-gap MEC (and general MEC.) but a different objective. Recall that the rows of a feasible matrix M can be partitioned into two We define (in Section II-C) the problem Binary- sets such that all rows within each set are pairwise MEC, where the input matrix contains no holes; non-conflicting. Having obtained such a partition, as far as we know the complexity of this problem we can reconstruct a haplotype from each set by is still - intriguingly - open. We also consider a merging all the rows in that set together. (We parameterised version of binary-MEC, where the define this formally later in Section III.) With number of haplotypes is not fixed as two, but is LHR the objective is to remove rows such that the part of the input. We prove that this problem is resulting matrix is feasible and such that the sum NP-hard in Section II-D. (In the Appendix we of the lengths of the two resulting haplotypes is also prove an “auxiliary” lemma which, besides maximised. being interesting in its own right, takes on a new significance in light of the open complexity of In the context of haplotyping, MEC and LHR Binary-MEC.) have been discussed - sometimes under different names - in papers such as [6], [19], [8] and In Section III-A we show that Ungapped-LHR (implicitly) [15]. One question arising from this is polynomial-time solvable and give a dynamic discussion is how the distribution of holes in the programming algorithm for this which runs in time input data affects computational complexity. To O(n2m + n3) for an n × m input matrix. This explain, let us first define a gap (in a string over the improves upon the result of [15] which also showed alphabet {0, 1, −}) as a maximal contiguous block a polynomial-time algorithm for Ungapped-LHR of holes that is flanked on both sides by non-hole but under the restricting assumption of non-nested values. For example, the string ---0010--- input rows. has no gaps, -0--10-111 has two gaps, and -0-----1-- has one gap. Two special cases of We also prove, in Section III-B, that LHR is MEC and LHR that are considered to be practically APX-hard (and thus also NP-hard) in the general relevant are the ungapped case and the 1-gap case. case, by proving the much stronger result that The ungapped variant is where every row of the 1-gap LHR is APX-hard. This is the first proof of input matrix is ungapped, i.e. all holes appear at hardness (for both 1-gap LHR and general LHR) the start or end. In the 1-gap case every row has at most one gap. USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 3 appearing in the literature. 1

II.MINIMUM ERROR CORRECTION (MEC) For a length-m string X ∈ {0, 1, −}m, and a length-m string Y ∈{0, 1}m, we define d(X,Y ) as Fig. II.1 the number of mismatches between the strings i.e. EXAMPLE INPUT TO MAX-CUT (SEE LEMMA 1) positions where X is 0 and Y is 1, or vice-versa; holes do not contribute to the mismatch count. Recall the definition of feasible from earlier; an alternative, and equivalent, definition (which we 0 0 −−−−−− use in the following proofs) is as follows. An  − − 0 0 −−−−   n × m SNP matrix M is feasible iff there exist two −−−− 0 0 − −  m    strings (haplotypes) H1,H2 ∈ {0, 1} , such that  −−−−−− 0 0      32 copies for all rows r of M, d(r, H1)=0 or d(r, H2)=0.  1 1 −−−−−−      − − 1 1 −−−−     Finally, a flip is where a 0 entry is converted  −−−− 1 1 − −      to a 1, or vice-versa. Flipping to or from holes is  −−−−−− 1 1   not allowed and the haplotypes H and H may    1 2  00110101   not contain holes.  00011101     M  00010111   G     01010011  A. Ungapped-MEC     Problem: Ungapped-MEC Fig. II.2 Input: An ungapped SNP matrix M CONSTRUCTION OF MATRIX M (FROM LEMMA 1) FOR GRAPH IN Output: Ungapped-MEC(M), which we define as FIGURE II.1 the smallest number of flips needed to make M feasible.2 and holes at all other columns. M1 is defined similar Lemma 1: Ungapped-MEC is NP-hard. to M0 with 1-entries instead of 0-entries. Each row of MG encodes an edge from E: for edge {i, j} Proof: We give a polynomial-time reduction (with i < j) we specify that columns 2i − 1 and from MAX-CUT, which is the problem of com- 2i contain 0s, columns 2j − 1 and 2j contain 1s, puting the size of a maximum cardinality cut in a and for all h =6 i, j, column 2h − 1 contains 0 and graph.3 Let G =(V, E) be the input to MAX-CUT, column 2h contains 1. (See Figures II.1 and II.2 for where E is undirected. (We identify, wlog, V with an example of how M is constructed.) {1, 2, ..., |V |}.) We construct an input matrix M for Ungapped-MEC with 2k|V | + |E| rows and 2|V | Suppose t is the largest cut possible in G and s is columns where k =2|E||V |. We use M0 to refer to the minimum number of flips needed to make M the first k|V | rows of M, M1 to refer to the second feasible. We claim that the following holds: k|V | rows of M, and M to refer to the remaining G s = |E|(|V | − 2)+2(|E| − t). (II.1) |E| rows. M0 consists of |V | consecutive blocks of k identical rows. Each row in the i-th block (for From this t, the optimal solution of MAX-CUT, 1 ≤ i ≤|V |) contains a 0 at columns 2i − 1 and 2i can easily be computed. First, note that the solution 1In [15] there is a claim, made very briefly, that LHR is NP-hard to Ungapped-MEC(M) is trivially upperbounded in general, but it is not substantiated. by |V ||E|. This follows because we could simply 2 In subsequent problem definitions we regard it as implicit that flip every 1 entry in MG to 0; the resulting overall P(I) represents the optimal output of a problem P on input I. matrix would be feasible because we could just take 3The reduction given here can easily be converted into a Karp reduction from the decision version of MAX-CUT to the decision H1 as the all-0 string and H2 as the all-1 string. version of Ungapped-MEC. Now, we say a haplotype H has the double-entry USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 4 property if, for all odd-indexed positions (i.e. reduction4 from CUBIC-MIN-UNCUT, which is columns) j in H, the entry at position j of H is the problem of finding the minimum number of the same as the entry at position j +1. We argue edges that have to be removed from a 3-regular that a minimal number of feasibility-inducing flips graph in order to make it bipartite. Our first goal will always lead to two haplotypes H1,H2 such is thus to prove the APX-hardness of CUBIC- that both haplotypes have the double-entry property MIN-UNCUT, which itself will be proven using an and, further, H1 is the bitwise complement of L-reduction from the APX-hard problem CUBIC- H2. (We describe such a pair of haplotypes as MAX-CUT. partition-encoding.) This is because, if H1,H2 are not partition-encoding, then at least k > |V ||E| (in To aid the reader, we reproduce here the definition contrast with zero) entries in M0 and/or M1 will of an L-reduction. have to be flipped, meaning this strategy is doomed to begin with. Definition 1: (Papadimitriou and Yannakakis [21]) Let A and B be two optimisation problems. Now, for a given partition-encoding pair of An L-reduction from A to B is a pair of functions haplotypes, it follows that - for each row in MG - R and S, both computable in polynomial time, such we will have to flip either |V | − 2 or |V | entries that for any instance I of A with optimum cost to reach its nearest haplotype. This is because, Opt(I), R(I) is an instance of B with optimum cost irrespective of which haplotype we move a row Opt(R(I)) and for every feasible5 solution s of R(I), to, the |V | − 2 pairs of columns not encoding S(s) is a feasible solution of I such that: end-points (for a given row) will always cost 1 flip Opt(R(I)) ≤ αOpt(I), (II.2) each to fix. Then either 2 or 0 of the 4 “endpoint- encoding” entries will also need to be flipped; 4 for some positive constant α and: flips will never be necessary because then the row |Opt(I) − c(S(s))| ≤ β|Opt(R(I)) − c(s)|, (II.3) could move to the other haplotype, requiring no extra flips. Ungapped-MEC thus maximises the for some positive constant β, where c(S(s)) and number of rows which require |V | − 2 rather than c(s) represent the costs of S(s) and s, respectively. |V | flips. If we think of H1 and H2 as encoding a partition of the vertices of V (i.e. a vertex i is on Observation 1: CUBIC-MIN-UNCUT is APX- one side of the partition if H1 has 1s in columns hard. 2i − 1 and 2i, and on the other side if H2 has 1s in those columns), it follows that each row requiring Proof: We give an L-reduction from CUBIC- |V | − 2 flips corresponds to a cut-edge in the vertex MAX-CUT, the problem of finding the maximum partition defined by H1 and H2. The expression cardinality of a cut in a 3-regular graph. (This (II.1) follows. problem is shown to be APX-hard in [1]; see also [5].) Let G =(V, E) be the input to CUBIC-MAX- CUT.

Note that CUBIC-MIN-UNCUT is the “complement” of CUBIC-MAX-CUT, as expressed by the following relationship: B. 1-gap MEC CUBIC-MAX-CUT(G) (II.4) = |E| − CUBIC-MIN-UNCUT(G). Problem: 1-gap MEC 4 Input: SNP matrix with at most 1 gap per row An L-reduction is a specific type of approximation-preserving M reduction, first introduced in [21]. If there exists an L-reduction from Output: The smallest number of flips needed to a problem X to a problem Y, then a PTAS for Y can be used to build make M feasible. a PTAS for X. Conversely, if there exists an L-reduction from X to Y, and X is APX-hard, so is Y. See (for example) [10] for a succinct discussion of this. To prove that 1-gap MEC is APX-hard (and 5Note that feasible in this context has a different meaning to therefore also NP-hard) we will give an L- feasible in the context of SNP matrices. USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 5

To see why this holds, note that for every cut C, number of vertices, because every graph must have the removal of the edges E \ C will lead to a an even number of odd-degree vertices. We add an bipartite graph. On the other hand, given a set of arbitrary perfect matching to the graph, which may edges E′ whose removal makes G bipartite, the create multiple edges. The graph is now 4-regular complement is not necessarily a cut. However, and therefore has an Euler tour. We direct the edges given a bipartition induced by the removal of E′, following the Euler-tour; every vertex is now in-in- the edges from the original graph that cross this out-out. If we remove the perfect matching edges bipartition form a cut C′, such that |C′|≥|E \ E′|. we added, we are left with an oriented version of G This proves (II.4), and the mapping (just described) where every vertex is in-in-out or out-out-in. This from E′ to C′ is the mapping we use in the can all be done in polynomial time. L-reduction. Lemma 2: 1-gap MEC is APX-hard Now, note that property (II.2) of the L-reduction is easily satisfied (taking α = 1) because the optimal Proof: We give a reduction from CUBIC- value of CUBIC-MIN-UNCUT is always less than MIN-UNCUT. Consider an arbitrary 3-regular or equal to the optimal value of CUBIC-MAX- graph G =(V, E) and orient the edges as described in Observation 2 to obtain an oriented version CUT. This follows from the combination of (II.4) −→ −→ with the fact that a in a 3-regular of G, G = (V, E ), where each vertex is either graph always contains at least 2/3 of the edges: if in-in-out or out-out-in. We construct an |E|×|V | a vertex has less than two incident edges in the cut input matrix M for 1-gap MEC as follows. The −→ then we can get a larger cut by moving this vertex columns of M correspond to the vertices of G and −→ to the other side of the partition. every row of M encodes an oriented edge of G; it has a 0 in the column corresponding to the tail To see that property (II.3) of the L-reduction of the edge (i.e. the vertex from which the edge is easily satisfied (taking β = 1), let E′ be any leaves), a 1 in the column corresponding to the set of edges whose removal makes G bipartite. head of the edge, and the rest holes. Property (II.3) is satisfied because E′ gets mapped to a cut C′, as defined above, and combined with We prove the following: (II.4) this gives: CUBIC-MIN-UNCUT(G) = 1-gap MEC(M). CUBIC-MAX-CUT(G) −|C′| (II.6) ≤ CUBIC-MAX-CUT(G) −|E \ E′| (II.5) We first prove that: ′ = |E | − CUBIC-MIN-UNCUT(G). 1-gap MEC(M) ≤ CUBIC-MIN-UNCUT(G). This completes the L-reduction from CUBIC-MAX- (II.7) ′ CUT to CUBIC-MIN-UNCUT, proving the APX- To see this, let E be a minimal set of edges whose ′ hardness of CUBIC-MIN-UNCUT. removal makes G bipartite, and let |E | = k. Let B = (L ∪ R, E \ E′) be the bipartite graph (with We also need the following observation. bipartition L ∪ R) obtained from G by removing ′ the edges E . Let H1 (respectively, H2) be the Observation 2: Let G =(V, E) be an undirected, haplotype that has 1s in the columns representing 3-regular graph. Then we can find, in polynomial vertices of L (respectively, R) and 0s elsewhere. time, an orientation of the edges of G so that It is possible to make M feasible with k flips, by ′ each vertex has either in-degree 2 and out-degree the following process: for each edge in E , flip 1 (“in-in-out”) or out-degree 2 and in-degree 1 the 0 bit in the corresponding row of M to 1. For (“out-out-in”). each row r of M it is now true that d(r, H1)=0 or d(r, H2)=0, proving the feasibility of M. Proof: (We assume that G is connected; if G is not connected, we can apply the following argument The proof that, to each component of G in turn, and the overall CUBIC-MIN-UNCUT(G) ≤ 1-gap MEC(M), result still holds.) Every cubic graph has an even (II.8) USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 6 is more subtle. Suppose we can render M feasible hard when the input matrix is further restricted. Let using j flips, and let H1 and H2 be any two us therefore define the following problem. haplotypes such that, after the j flips, each row of M is distance 0 from either H1 or H2. If H1 Problem: Binary-MEC and H2 are bitwise complementary then we can Input: An SNP matrix M that does not contain make G bipartite by removing an edge whenever any holes we had to flip a bit in the corresponding row. The Output: As for Ungapped-MEC idea is, namely, that the 1s in H1 (respectively, H2) represent the vertices L (respectively, R) in the Like all optimisation problems, the problem resulting bipartition L ∪ R. Binary-MEC has different variants, depending on how the problem is defined. The above definition However, suppose the two haplotypes H1 and is technically speaking the evaluation variant of the 6 H2 are not bitwise complementary. In this case it Binary-MEC problem . Consider the closely-related is sufficient to demonstrate that there also exists constructive version: ′ ′ bitwise complementary haplotypes H1 and H2 such that, after j (or fewer) flips, every row of M is Problem: Binary-Constructive-MEC ′ ′ distance 0 from either H1 or H2. Consider thus a Input: An SNP matrix M that does not contain column of H1 and H2 where the two haplotypes any holes are not complementary. Crucially, the orientation Output: For an input matrix M of size n × m, two −→ m of G ensures that every column of M contains haplotypes H1,H2 ∈{0, 1} minimizing: either one 1 and two 0s or two 1s and one 0 (and the rest holes). A simple case analysis shows DM (H1,H2)= min(d(r, H1),d(r, H2)). that, because of this, we can always change the rowsX r of M (II.10) value of one of the haplotypes in that column, In the appendix, we prove that Binary-Constructive- without increasing the number of flips. (The MEC is polynomial-time Turing interreducible with number of flips might decrease.) Repeating this its evaluation counterpart, Binary-MEC. This process for all columns of H and H where the 1 2 proves that Binary-Constructive-MEC is solvable same value is observed thus creates complementary in polynomial-time iff Binary-MEC is solvable in haplotypes H′ and H′ , and - as described in 1 2 polynomial-time. We mention this correspondence the previous paragraph - these haplotypes then because, when expressed as a constructive problem, determine which edges of G should be removed to it can be seen that MEC is in fact a specific make G bipartite. This completes the proof of (II.6). type of clustering problem, a topic of intensive study in the literature. More specifically, we are The above reduction can be computed in polynomial trying to find two representative “median” (or time and is an L-reduction. From (II.6) it follows “consensus”) strings such that the sum, over all directly that property (II.2) of an L-reduction is input strings, of the distance between each input satisfied with α = 1. Property (II.3), with β = 1, string and its nearest median, is minimised. This follows from the proof of (II.8), combined with interreducibility is potentially useful because we (II.6). Namely, whenever we use (say) t flips to now argue, in contrast to claims in the existing make M feasible, we can find s ≤ t edges of G literature, that the complexity of Binary-MEC / that can be removed to make G bipartite. Combined Binary-Constructive-MEC is actually still open. with (II.6) this gives: |CUBIC-MIN-UNCUT(G) − s| (II.9) To elaborate, it is claimed in several papers ≤|1-gap MEC(M) − t|. (e.g. [2]) that a problem equivalent to Binary- Constructive-MEC is NP-hard. Such claims inevitably refer to the seminal paper Segmentation C. Binary-MEC Problems by Kleinberg, Papadimitriou, and

From a mathematical point of view it is 6See [3] for a more detailed explanation of terminology in this interesting to determine whether MEC stays NP- area. USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 7

Raghavan (KPR), which has appeared in multiple and k ∈ N \{0} different forms since 1998 (e.g. [12], [13] and Output: The smallest number of flips needed to [14].) However, the KPR papers actually discuss make M feasible under k haplotypes. two superficially similar, but essentially different, The notion of feasible generalises easily to k ≥ 1 problems: one problem is essentially equivalent to haplotypes: an SNP matrix M is feasible under k Binary-Constructive-MEC, and the other is a more haplotypes if M can be partitioned into k segments general (and thus, potentially, a more difficult) such that all the rows within each segment are 7 problem. Communication with the authors [20] pairwise non-conflicting. The definition of DM has confirmed that they have no proof of hardness also generalises easily to k haplotypes; we define for the former problem i.e. the problem that is DM,k(H1,H2, ..., Hk) as: essentially equivalent to Binary-Constructive-MEC. min(d(r, H1),d(r, H2), ..., d(r, Hk)). rowsX r of M Thus we conclude that the complexity of Binary- (II.11) Constructive-MEC / Binary-MEC is still open. We define OptTuples(M,k) as the set of unordered From an approximation viewpoint the problem optimal k-tuples of haplotypes for M i.e. those k- has been quite well-studied; the problem has a tuples of haplotypes which have a D score equal Polynomial Time Approximation Scheme (PTAS) M,k to PBMEC(M,k). because it is a special form of the Hamming 2-Median Clustering Problem, for which a PTAS Lemma 3: PBMEC is NP-hard is demonstrated in [11]. Other approximation results appear in [12], [2], [14], [18] and a Proof: We reduce from the NP-hard problem heuristic for a similar (but not identical) problem MINIMUM-VERTEX-COVER. Let G =(V, E) be appears in [19]. We also know that, if the number an undirected graph. A subset V ′ ⊆ V is said to of haplotypes to be found is specified as part cover an edge (u, v) ∈ E iff u ∈ V ′ or v ∈ V ′. A of the input (and not fixed as 2), the problem vertex cover of an undirected graph G = (V, E) is becomes NP-hard; we prove this in the following a subset U of the vertices such that every edge in section. Finally, it may also be relevant that the E is covered by U. MINIMUM-VERTEX-COVER “geometric” version of the problem (where rows is the problem of, given a graph G, computing the of the input matrix are not drawn from m {0, 1} size of a minimum cardinality vertex cover U of but from Rm, and Euclidean distance is used G. instead of Hamming distance) is also open from a complexity viewpoint [18]. (However, the version Let G = (V, E) be the input to MINIMUM- using Euclidean-distance-squared is known to be VERTEX-COVER. We construct an SNP matrix M NP-hard [7].) as follows. M has |V | columns and 3|E||V | + |E| rows. We name the first 3|E||V | rows M0 and D. Parameterised Binary-MEC the remaining |E| rows MG. M0 is the matrix obtained by taking the |V |×|V | identity matrix Let us now consider a generalisation of the (i.e. 1s on the diagonal, 0s everywhere else) and problem Binary-MEC, where the number of making 3|E| copies of each row. Each row in MG haplotypes is not fixed as two, but part of the input. encodes an edge of G: the row has 1-entries at the endpoints of the edge, and the rest of the row is 0. Problem: Parameterised-Binary-MEC (PBMEC) We argue shortly that, to compute the size of the Input: An SNP matrix M that contains no holes, smallest vertex cover in G, we call PBMEC(M,k) 7In this more general problem, rows and haplotypes are viewed for increasing values of k (starting with k = 2) as vectors and the distance between a row and a haplotype is their until we first encounter a k such that: dot product. Further, unlike Binary-Constructive-MEC, this problem allows entries of the input matrix to be drawn arbitrarily from R. This PBMEC(M,k)=3|E|(|V | − (k − 1)) + |E|. extra degree of freedom - particularly the ability to simultaneously (II.12) use positive, negative and zero values in the input matrix - is what (when coupled with a dot product distance measure) provides the Once the smallest such k has been found, we ability to encode NP-hard problems. can output that the size of the smallest vertex USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 8

1000  0100    12 copies 0010     0001      1100   Fig. II.3    1010   EXAMPLE INPUT GRAPH TO MINIMUM-VERTEX-COVER (SEE    MG  1001   LEMMA 3)  0011      Fig. II.4 cover in G is k − 1. (Actually, if we haven’t CONSTRUCTION OF MATRIX M FOR GRAPH FROM FIGURE II.3 yet found a value k < |V | − 2 satisfying the above equation, we can check by brute force in polynomial-time whether G has a vertex cover of of I contribute 0 to the D measure (i.e. every size |V |−3, |V |−2, |V |−1, or |V |. The reason for |V | haplotype has precisely one entry set to 1, and the wanting to ensure that PBMEC(M,k) is not called haplotypes are all distinct) then there are |V | − k with k ≥|V |−2 is explained later in the analysis.8) rows which each contribute 2 to the D measure; such a k-tuple cannot be optimal because it has It remains only to prove that (for k < |V | − 2) a D measure of 2(|V | − k) > |V | − (k − 1). So (II.12) holds iff G has a vertex cover of size k − 1. we reason that at most k − 1 rows contribute 0 to the D measure. In fact, precisely k − 1 rows To prove this we need to first analyze must contribute 0 to the D measure because, OptTuples(M ,k). Recall that M was obtained 0 0 otherwise, there would be at least |V | − (k − 2) by duplicating the rows of the |V |×|V | identity rows contributing at least 1, and this is not possible matrix. Let I|V | be shorthand for the |V |×|V | because PBMEC(I|V |,k) ≤ |V | − (k − 1). So identity matrix. Given that M0 is simply a “scaled k − 1 of the haplotypes correspond to rows of I|V |, up” version of I|V |, it follows that: and the remaining |V | − (k − 1) rows of I|V | must each contribute 1 to the D measure. But the only OptTuples(M0,k)= OptTuples(I|V |,k). (II.13) way to do this (given that |V | − (k − 1) > 2) is to Now, we argue that all the k-tuples in make the kth haplotype the haplotype where every OptTuples(I|V |,k) (for k < |V | − 2) have entry is 0. Hence: the following form: one haplotype from the tuple PBMEC(I ,k)= |V | − (k − 1) (II.14) contains only 0s, and the remaining k−1 haplotypes |V | from the tuple each have precisely one entry set to and: 1. Let us name such a k-tuple a candidate tuple. PBMEC(M0,k)=3|E|(|V | − (k − 1)). (II.15) First, note that PBMEC(I ,k) ≤|V | − (k − 1), |V | OptTuples(I|V |,k) (= OptTuples(M0,k)) is, by because |V | − (k − 1) is the value of the D extension, precisely the set of candidate k-tuples. measure - defined in (II.11) - under any candidate tuple. Secondly, under an arbitrary k-tuple there The next step is to observe that OptTuples(M,k) ⊆ can be at most rows of which contribute 0 k I|V | OptTuples(M0,k). To see this, suppose (by way to the D measure. However, if precisely k rows of contradiction) that it is not true, and there exists ∗ 8 a k-tuple H ∈ OptTuples(M,k) that is not in Note that, should we wish to build a Karp reduction from the ∗ decision version of MINIMUM-VERTEX-COVER to the decision OptTuples(M0,k). But then replacing H by any version of PBMEC, it is not a problem to make this brute force k-tuple out of OptTuples(M0,k) would reduce checking fit into the framework of a Karp reduction. The Karp the number of flips needed in M0 by at least reduction can do the brute force checking itself and use trivial inputs to the decision version of PBMEC to communicate its “yes” or “no” 3|E|, in contrast to an increase in the number of answer. flips needed in MG of at most 2|E|, thus leading USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 9 to an overall reduction in the number of flips; of a haplotype H, denoted as |H|, as the number contradiction! (The 2|E| figure is the number of of positions where it does not contain a hole; the flips required to make all rows in MG equal to the haplotype 10-1 thus has length 3, for example. all-0 haplotype.) Now, the objective with LHR is to remove rows from M to make it feasible but also such that the Because OptTuples(M,k) ⊆ OptTuples(M0,k), sum of the lengths of the two resulting haplotypes we can restrict our attention to the k-tuples in is maximised. We define the function LHR(M) OptTuples(M0,k). Observe that there is a natural (which gives a natural number as output) as the 1-1 correspondence between the elements of largest value this sum-of-lengths value can take, OptTuples(M0,k) and all size k − 1 subsets of V : ranging over all feasibility-inducing row-removals a vertex v ∈ V is in the subset corresponding to and subsequent partitions. ∗ H ∈ OptTuples(M0,k) iff one of the haplotypes in H∗ has a 1 in the column corresponding to In Section III-A we provide a polynomial-time vertex v. dynamic programming algorithm for the ungapped variant of LHR, Ungapped-LHR. In Section III-B ∗ Now, for a k-tuple H ∈ OptTuples(M0,k) we show that LHR becomes APX-hard and NP-hard we let Cov(G,H∗) be the set of edges in G which when at most one gap per input row is allowed, are covered by the subset of V corresponding to automatically also proving the hardness of LHR in H∗. (Thus, |Cov(G,H∗)| = |E| iff H∗ represents the general case. a vertex cover of G.) It is easy to check that, for H∗ ∈ OptTuples(M ,k): 0 A. A polynomial-time algorithm for Ungapped-LHR ∗ DM,k(H ) = 3|E|(|V | − (k − 1)) Problem: Ungapped-LHR ∗ + |Cov(G,H )| Input: An ungapped SNP matrix M ∗ + 2(|E|−|Cov(G,H |) Output: The value LHR(M), as defined above = 3|E|(|V | − (k − 1)) ∗ +2|E|−|Cov(G,H )|. The LHR problem for ungapped matrices was proved to be polynomial-time solvable by Lancia Hence, for H∗ ∈ OptTuples(M ,k), D (H∗) 0 M,k et. al in [15], but only with the genuine restriction equals 3|E|(|V | − (k − 1)) + |E| iff H∗ represents that no fragments are included in other fragments. a size k − 1 vertex cover of G. Our algorithm improves this in the sense that it works for all ungapped input matrices; our algorithm is similar in style to the algorithm that III. LONGEST HAPLOTYPE RECONSTRUCTION solves MFR9 in the ungapped case by Bafna et. (LHR) al. in [4]. Note that our dynamic-programming Suppose an SNP matrix M is feasible. Then algorithm computes Ungapped-LHR(M) but it can we can partition the rows of M into two sets, Ml easily be adapted to generate the rows that must be and Mr, such that the rows within each set are removed (and subsequently, the partition that must pairwise non-conflicting. (The partition might not be made) to achieve this value. be unique.) From Mi (i ∈{l,r}) we can then build a haplotype Hi by combining the rows of Mi as Lemma 4: Ungapped-LHR can be solved in time 2 3 follows: The jth column of Hi is set to 1 if at least O(n m + n ) one row from Mi has a 1 in column j, is set to 0 if at least one row from Mi has a 0 in column j, Proof: Let M be the input to Ungapped-LHR, and is set to a hole if all rows in Mi have a hole in and assume the matrix has size n × m. For row column j. Note that, in contrast to MEC, this leads i define l(i) as the leftmost column that is not a to haplotypes that potentially contain holes. For hole and define r(i) as the rightmost column that example, suppose one side of the partition contains 9Minimum Fragment Removal: in this problem the objective is not rows 10--, -0-- and ---1; then the haplotype to maximise the length of the haplotypes, but to minimise the number we get from this is 10-1. We define the length of rows removed USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 10 is not a hole. The rows of M are ordered such Where f(i, j) = r(i) − max{r(j),l(i) − 1} is that l(i) ≤ l(j) if i < j. Define the matrix Mi as the increase of the haplotype’s length. Equation the matrix consisting of the first i rows of M and (III.3) results from the following. The definition two extra rows at the top: row 0 and row −1, both of D[i, k; i] says that row i has to be used for the consisting of all holes. Define W (i) as the set of haplotype for which k is not used and amongst rows j < i that are not in conflict with row i. such rows maximises r(·). Therefore, the optimal solution is achieved by adding row i to some For h, k ≤ i and h, k ≥ −1 and r(h) ≤ r(k) solution that has a row j as the most-right-ending define D[h, k; i] as the maximum sum of lengths row, for some j that agrees with i, is not equal of two haplotypes such that: to k and ends before i. Adding row i to the • each haplotype is built up as a combination of haplotype leads to an increase of its length of rows from Mi (in the sense explained above); f(i, j) = r(i) − max{r(j),l(i) − 1}. This term • each row from Mi can be used to build at most is fixed, for fixed i and j and therefore we only one haplotype (i.e. it cannot be used for both have to consider extensions of solutions that were haplotypes); already optimal. Note that this reasoning does not • row k is one of the rows used to build a hold for more general, “gapped”, data. haplotype and among such rows maximises r(·); The last case is when k = i; D[h, i; i] is • row h is one of the rows used to build the equal to: haplotype for which k is not used and among D[j, h; i − 1] + f(i, j) if r(h) ≥ r(j), such rows maximises r(·). max j∈W (i), j6=h  D[h, j; i − 1] + f(i, j) if r(h) r(k): row i cannot be used for the haplotype that row k is used for, because row k has maximal r(·) among all rows that are B. 1-gap LHR is NP-hard and APX-hard used for a haplotype; Problem: 1-gap LHR • if r(i) ≤ r(k): row i cannot increase the length Input: SNP matrix M with at most one gap per of the haplotype that row k is used for (because row also l(i) ≥ l(k)); Output: The value LHR(M), as defined earlier • the same arguments hold for h. In this section we prove that 1-gap LHR is The second case is when h = i; D[i, k; i] is equal APX-hard (and thus also NP-hard.) We prove this to: by demonstrating (indirectly) an L-reduction from the problem CUBIC-MAX-INDEPENDENT-SET - max D[j, k; i − 1] + f(i, j). (III.3) the problem of computing the maximum cardinality j∈W (i), j6=k r(j)≤r(i) of an independent set in a cubic graph - which is USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 11 itself proven APX-hard in [1].

′ We do this in several steps. We first show an L- H ∈ SOL(M) ⇒ (H,H) ∈ SOL(M ). (III.5) reduction from Single Haplotype LHR (SH-LHR), To satisfy the L-reduction we need to show how the version of LHR where only one haplotype is elements from SOL(M ′) are mapped back to el- 10 used , to LHR, such that the number of gaps per ements of SOL(M) in polynomial time. So, let rows is unchanged. We then show an L-reduction ′ (H1,H2) be any pair from SOL(M ). If |H1| ≥ from CUBIC-MAX-INDEPENDENT-SET to 2-gap |H2| map the pair (H1,H2) to H1, otherwise to H2. SH-LHR. Then, using an observation pertaining to This completes the L-reduction, and we now prove the structure of cubic graphs, we show how this its correctness. Central to this is the proof of the reduction can be adapted to give an L-reduction following: from CUBIC-MAX-INDEPENDENT-SET to 1-gap 1 SH-LHR. This proves the APX-hardness of 1-gap SH-LHR(M)= LHR(M ′). (III.6) SH-LHR and thus (by transitivity of L-reductions) 2 also 1-gap LHR. 1 ′ The fact that SH-LHR(M) ≥ 2 LHR(M ) follows immediately from (III.4) and the mapping described Lemma 5: SH-LHR is L-reducible to LHR, such above. (This lets us fulfil condition II.2) of the that the number of gaps per row is unchanged. L-reduction definition, taking α =2.) The fact that 1 ′ SH-LHR(M) ≤ 2 LHR(M ) follows because, by Proof: Let M be the n × m input to SH-LHR. (III.5), every element in SOL(M) is guaranteed to We may assume that M contains no duplicate have a counterpart in SOL(M ′) which has a total rows, because duplicate rows are entirely redundant length twice as large. when working with only one haplotype. We map the SH-LHR input, M, to the 2n × m LHR input, We can fulfil condition (II.3) of the L-reduction M ′, by taking each row of M and making a copy 1 by taking β = 2 . To see this, let (H1,H2) be of it. Informally, the idea is that the influence any pair from SOL(M ′), and (wlog) assume that of the second haplotype can be neutralised by ′ |H1|≥|H2|. Let r = LHR(M ), the distance of doubling the rows of the input matrix. Note that (H1,H2) from optimal is then: this construction clearly preserves the maximum number of gaps per row. r − (|H1| + |H2|) ≥ r − 2|H1|. (III.7) Let l = SH-LHR(M), then: Now, let SOL(M ′) be the set that contains all r pairs of haplotypes (H1,H2) that can be induced l −|H1| = −|H1| ′ 2 by removing some rows of M , partitioning the 1 ′ = r − 2|H | remaining rows of M into two mutually non- 2  1  (III.8) conflicting sets, and then reading off the two ≤ 1 r − (|H | + |H |) . induced haplotypes. Similarly, let SOL(M) be 2  1 2  the set that contains all haplotypes H that can be 1 induced by removing some rows of M (such that Thus, taking β = 2 satisfies condition (II.3) of the the remaining rows are mutually non-conflicting) L-reduction. and then reading off the single, induced haplotype. Note the following pair of observations, which both Lemma 6: 2-gap SH-LHR is APX-hard follow directly from the construction of M ′: ′ Proof: We reduce from CUBIC-MAX- (H1,H2) ∈ SOL(M ) ⇒ H1,H2 ∈ SOL(M), INDEPENDENT-SET. Let G = (V, E) be (III.4) the undirected, cubic input to CUBIC-MAX- 10More formally:- rows of the input matrix M must be removed INDEPENDENT-SET. We direct the edges of G until the remaining rows are mutually non-conflicting. The length of in the manner described by Observation 2, to give the resulting single haplotype, which we seek to maximise, is the −→ −→ −→ number of columns (amongst the remaining rows) that have at least G = (V, E ). Thus, every vertex of G is now one non-hole entry. out-out-in or in-in-out. A vertex w is a child of USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 12 a vertex v if there is an edge leaving v in the −→ direction of w i.e. (v,w) ∈ E , and in this case v is said to be the parent of w. −→ Let vin be the number of vertices in G that are in-in-out, and vout be the number of vertices that are out-out-in. We build a matrix M, to be used as input to 2-gap SH-LHR, which has |V | rows and 2vin + vout columns. The construction of M is as follows. (Each row of M will represent a vertex from V , so we henceforth index the rows Fig. III.1 of M using vertices of V .) Now, to each in-in-out EXAMPLE INPUT GRAPH TO −→ vertex of G, we allocate two adjacent columns of CUBIC-MAX-INDEPENDENT-SET (SEE LEMMAS 6 AND 7) M, and for each out-out-in vertex, we allocate one AFTER AN APPROPRIATE EDGE ORIENTATION HAS BEEN APPLIED. column of M. (A column may not be allocated to more than one vertex.)11 For simplicity, we also impose an arbitrary total order P on the vertices of v3 v1 v2 v5 v5 v7 v8 v8 v4 v4 v6 v6 V . v1 − 1 0 −−−−− 0 − − − v2  − − 1 0 −−−−−− 0 −  Now, for each vertex v ∈ V , we build row v v3 1 0 −−−−−−− 0 − −   as follows. Firstly, we put 1(s) in the column(s) v4  −−−−− 0 − − 1 1 − −    representing v. Secondly, consider each child w v5  − − − 1 1 − − 0 −−−−    of v. If w is an out-out-in vertex, we put a 0 in v6  −−−− 0 −−−−− 1 1    the column representing w. Alternatively, w is v7  0 −−−− 1 0 −−−−−  v  −−−−−− 1 1 − − − 0  an in-in-out vertex, so w is represented by two 8   columns; in this case we put a 0 in the left such Fig. III.2 column (if v comes before the other parent of w CONSTRUCTION OF MATRIX M (FROM LEMMA 6 AND 7) FOR in the total order P ) or, alternatively, in the right GRAPH IN FIGURE III.1 column (if v comes after the other parent of w in the total order P ). The rest of the row is holes. follows from the aforementioned facts that every This completes the construction of M. Note that rows encoding in-in-out vertices contain two column of M contains exactly one 1 and one 0. adjacent 1s and one 0, with at most one gap in the and that every row has exactly 3 non-hole elements. row, and rows encoding out-out-in vertices contain one 1 and two 0s, with at most two gaps in the We now prove that the rows of K are in conflict row. In either case there are precisely 3 non-hole iff V [K] is not an independent set. First, suppose V [K] is not an independent set. Then there exist elements per row. It is also crucial to note that, −→ reading down any one column of M, one sees u, v ∈ V [K] such that (u, v) ∈ E . In row v of K exactly one 1 and exactly one 0. there are thus 1(s) in the column(s) representing vertex v. However, there is also (in row u)a0in Let K be any submatrix of M obtained by the column (or one of the columns) representing removing rows from M, and let V [K] ⊆ V be vertex v, causing a conflict. Hence, if V [K] is not the set of vertices whose rows appear in K. an independent set, K is in conflict. Now consider If the rows of K are mutually non-conflicting, the other direction. Suppose K is in conflict. Then then the haplotype induced by K has length in some column of K there is a 0 and a 1. Let u 3r where r is the number of rows in K. This be the row where the 0 is seen, and v be the row where the 1 is seen. So both u and v are in V [K].

11 Further, we know that there is an out-edge (u, v) Note that, for this lemma, it is not important how the columns −→ are allocated; in the proof of Lemma 9, the ordering is crucial. in E , and thus an edge between u and v in E, USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 13 proving that V [K] is not an independent set. This used to allocate the columns of M to the vertices completes the proof of the iff relationship. of V , such that every row of M has at most one gap. To ensure this property, it is necessary to It follows that: stipulate that, where favourite(v) is an in-in-out CUBIC-MAX-INDEPENDENT-SET(G) vertex, the 0 encoding the edge (v, favourite(v)) 1 (III.9) is placed in the left of the two columns encoding = 3 SH-LHR(M). favourite(v). This is not a problem because every The conditions of the L-reduction definition are now vertex is the favourite of at most one other vertex. easily satisfied, because of the 1-1 correspondence between haplotypes induced (after row-removals) It remains to prove that the function favourite and independent sets in G, and the fact that a size- exists and that it can be constructed in polynomial r independent set of G corresponds to a length- time. This is equivalent to finding vertex disjoint 3r haplotype (or, equivalently, to r mutually non- −→ directed paths in G such that every out-out-in vertex conflicting rows of M.) The L-reduction is formally is on such a path and all paths end in an in-in-out satisfied by taking α = 3 and β = 1 . The two 3 vertex. Lemma 8 tells us how to find such paths. functions that comprise the L-reduction are both We thank Bert Gerards for invaluable help with this. polynomial time computable.

Lemma 7: 1-gap SH-LHR is APX-hard. This completes the proof that 1-gap SH-LHR is APX-hard. (See Figures III.1 and III.2 for an Proof: This proof is almost identical to the example of the whole reduction in action.) proof of Lemma 6; the difference is the manner in −→ which columns of M are assigned to vertices of G. Lemma 8: Let G be a directed, cubic graph The informal motivation is follows. In the previous with a partition (Vout,Vin) of the vertices such that allocation of columns to vertices, it was possible the vertices in Vout are out-out-in and the vertices for a row corresponding to an out-out-in vertex to in Vin are in-in-out. Then Vout can be covered, in have 2 gaps. Suppose, for each out-out-in vertex, polynomial time, by vertex-disjoint directed paths we could ensure that one of the 0s in its row was ending in Vin. adjacent to the 1 in the row, with no holes in between. Then every row of the matrix would have Proof: Observe that any two directed circuits (at most) 1 gap, and we would be finished. We now contained entirely within Vout are pairwise vertex show that, by exploiting a rather subtle property ′ disjoint. Let Vout be obtained from Vout by shrinking of cubic graphs, it is indeed possible to allocate each directed circuit in V to a single vertex, and −→ out columns to vertices such that this is possible. let G′ be the resulting new graph. (Note that each vertex in V ′ has outdegree at least 2 and indegree Assume, that we have ordered the edges of out −→ at most 1 and that the indegree of each node in G as before to obtain G. Let V ⊆ V be those out Vin is still 2, because we do not delete multiple vertices in V that are out-out-in. Now, suppose we edges) We now argue that it is possible to find a could compute (in polynomial time) an injective −→ set of edges F ′ in G′, with |F ′| = |V ′ |, such that function with the following out favourite : Vout → V - for each v ∈ V ′ - precisely one edge from F ′ properties: out −→ begins at v, and such that no two edges in F ′ have • for every v ∈ V , (v, favourite(v)) ∈ E ; out−→ the same endpoint. We prove this by construction. ′ ′ • the subgraph of G induced by edges of the For each vertex u ∈ Vout that has a child v in Vout, form (v, favourite(v)), henceforth called the we can add the edge (u, v) to F ′, because v has favourite-induced subgraph, is acyclic. indegree 1 and therefore no other edges can end at v. (In case u has two such children, we can Given such a function it is easy to create a total choose one of the edges to add to F ′). Thus we ′ enumeration of the vertices of V such that every are left to deal with a subset of vertices L ⊆ Vout out-out-in vertex is immediately followed by its where every vertex in L has all its children in Vin. favourite vertex. This enumeration can then be Now consider the bipartite graph B with bipartition USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 14

−→′ (L, Vin) and an edge for every directed edge of G make it feasible. going from L to Vin. If we can find a matching in B of size |L|, we can complete the construction of F ′ by adding the edges from the perfect matching. Binary (i.e. no holes) ? (Section II-C) PTAS known [11] Hall’s Theorem states that a bipartite graph with MEC Ungapped NP-hard (Section II-A) bipartition (X,Y ) has a matching of size |X| iff, 1-Gap NP-hard (Section II-B), for all X′ ⊆ X, |N(X′)|≥|X′|, where N(X′) is APX-hard (Section II-B) ′ Ungapped P (Section III-A) the set of all neighbours of X . Now, note that each LHR 1-Gap NP-hard (Section III-B) vertex in L sends at least two edges across the APX-hard (Section III-B) Ungapped P [4] partition of B, and each vertex in Vin can accept ′ MFR 1-Gap NP-hard [15] at most two such edges, so for each L ⊆ L it is APX-hard [4] ′ ′ clear that |N(L )|≥|L |. Hence, the graph (L, Vin) Ungapped P [15] does indeed have a matching of size |L| and the MSR 1-Gap NP-hard [4] construction of F ′ can be completed. APX-hard [4] TABLE I Now, given that the graph induced by V ′ is THE NEW STATE OF KNOWLEDGE FOLLOWING OUR WORK out −→ acyclic, so is F ′. Let F be the set of edges in G corresponding to those in F ′. F is acyclic and each Indeed, from a complexity perspective, the most directed circuit C in Vout has exactly one vertex intriguing open problem is to ascertain the vC that is a tail of an edge of F and no vertex that complexity of the “re-opened” problem Binary- is a head of an edge in F . Let PC be the longest MEC. It would also be interesting to study the directed path in C that ends in vC . Then the union approximability of Ungapped-MEC. of F and all PC over all directed circuits C in Vout is a collection of paths ending in Vin and covering From a more practical perspective, the next Vout. logical step is to study the complexity of these problems under more restricted classes of input, Finding cycles in a graph and finding a maximum ideally under classes of input that have direct matching in a bipartite graph are both polynomial- biological relevance. It would also be of interest time computable, so the whole process described to study some of these problems in a “weighted” above is polynomial-time computable. context i.e. where the cost of the operation in question (row removal, column removal, error Lemma 9: 1-gap LHR is APX-hard. correction) is some function of (for example) an a priori specified confidence in the correctness of the Proof: Follows from Lemma 7 and Lemma 5. data being changed.

V. ACKNOWLEDGEMENTS IV. CONCLUSION We thank Leen Stougie and Judith Keijsper for This paper involves the complexity (under various many useful conversations during the writing of this different input restrictions) of the haplotyping paper. problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR). The REFERENCES state of knowledge about MEC and LHR after this paper is demonstrated in Table I. We also [1] Paola Alimonti, Vigo Kann, Hardness of approximating prob- include Minimum Fragment Removal (MFR) lems on cubic graphs, Proceedings of the Third Italian Confer- ence on Algorithms and Complexity, 288-298 (1997) and Minimum SNP Removal (MSR) in the table [2] Noga Alon, Benny Sudakov, On Two Segmentation Problems, because they are two other well-known Single Journal of Algorithms 33, 173-184 (1999) Individual Haplotyping problems. MSR (MFR) is [3] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti- Spaccamela, M. Protasi, Complexity and Approximation - the problem of removing the minimum number of Combinatorial optimization problems and their approximability columns (rows) from an SNP-matrix in order to properties, Springer Verlag (1999) USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 15

[4] Vineet Bafna, Sorin Istrail, Giuseppe Lancia, Romeo Rizzi, [22] Romeo Rizzi, Vineet Bafna, Sorin Istrail, Giuseppe Lancia: Polynomial and APX-hard cases of the individual haplotyp- Practical Algorithms and Fixed-Parameter Tractability for the ing problem, Theoretical Computer Science, 335(1), 109-125 Single Individual SNP Haplotyping Problem, 2nd Workshop on (2005) Algorithms in Bioinformatics (WABI 2002) 29-43 [5] Piotr Berman, Marek Karpinski, On Some Tighter Inapprox- imability Results (Extended Abstract), Proceedings of the 26th International Colloquium on Automata, Languages and Pro- gramming, 200-209 (1999) [6] Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Jing Li, The Haplotyping Problem: An Overview of Computational Models and Solutions, Journal of Computer Science and Tech- nology 18(6), 675-688 (November 2003) [7] P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering in large graphs via Singular Value Decomposition, Journal of Machine Learning 56, 9-33 (2004) [8] Harvey J. Greenberg, William E. Hart, Giuseppe Lancia, Op- portunities for Combinatorial Optimisation in Computational Rudi Cilibrasi received his bachelor’s degree Biology, INFORMS Journal on Computing, 16(3), 211-231 at Caltech in 1996. He spent several years in (2004) industry doing network programming, Linux [9] Bjarni V. Halldorsson, Vineet Bafna, Nathan Edwards, Ross PLACE kernel programming, and a variety of software Lippert, Shibu Yooseph, and Sorin Istrail, A Survey of Com- PHOTO development work until returning to academia putational Methods for Determining Haplotypes, Proceedings HERE with CWI in 2001. He is now nearing comple- of the First RECOMB Satellite on Computational Methods tion of his doctoral work that has been largely for SNPs and Haplotype Inference, Springer Lecture Notes in concerned with robust methods of approximat- Bioinformatics, LNBI 2983, pp. 26-47 (2003) ing bioinformatics and related clustering prob- [10] Hoogeveen, J.A., Schuurman, P., and Woeginger, G.J., Non- lems. He currently maintains CompLearn ( http://complearn.org/ ), an approximability results for scheduling problems with minsum open-source data-mining package that can be used for phylogenetic criteria, INFORMS Journal on Computing, 13(2), 157-168 tree construction. (Spring 2001) [11] Yishan Jiao, Jingyi Xu, Ming Li, On the k-Closest Substring and k-Consensus Pattern Problems, Combinatorial Pattern Match- ing: 15th Annual Symposium (CPM 2004) 130-144 [12] Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems, Proceedings of STOC 1998, 473-482 (1998) [13] Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, A Microeconomic View of Data Mining, Data Mining and Knowledge Discovery 2, 311-324 (1998) [14] Jon Kleinberg, Christos Papadimitriou, Prabhakar Raghavan, Segmentation Problems, Journal of the ACM 51(2), 263-280 Leo van Iersel received in 2004 his Master of (March 2004) Note: this paper is somewhat different to the Science degree in Applied Mathematics from 1998 version. the Universiteit Twente in the Netherlands. [15] Giuseppe Lancia, Vineet Bafna, Sorin Istrail, Ross Lippert, and PLACE He is now working as a PhD student at the Russel Schwartz, SNPs Problems, Complexity and Algorithms, PHOTO Technische Universiteit Eindhoven, also in the Proceedings of the 9th Annual European Symposium on Algo- HERE Netherlands. His research is mainly concerned rithms, 182-193 (2001) with the search for combinatorial algorithms for biological problems. [16] Giuseppe Lancia, Maria Christina Pinotti, Romeo Rizzi, Hap- lotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms, INFORMS Journal on Com- puting, Vol. 16, No.4, 348-359 (Fall 2004) [17] Giuseppe Lancia, Romeo Rizzi, A polynomial solution to a special case of the parsimony haplotyping problem, to appear in Operations Research Letters [18] Rafail Ostrovsky and Yuval Rabani, Polynomial-Time Approx- imation Schemes for Geometric Min-Sum Median Clustering, Journal of the ACM 49(2), 139-156 (March 2002) [19] Alessandro Panconesi and Mauro Sozio, Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction, Steven Kelk received his PhD in Computer Proceedings of 4th Workshop on Algorithms in Bioinformatics Science in 2004 from the University of War- (WABI 2004), LNCS Springer-Verlag, 266-277 wick, in England. He is now working as a post- [20] Personal communication with Christos H. Papadimitriou, June PLACE doc at the Centrum voor Wiskunde en Infor- 2005 PHOTO matica (CWI) in Amsterdam, the Netherlands, [21] C.H. Papadimitriou and M. Yannakakis, Optimization, approx- HERE where he is focussing on the combinatorial imation, and complexity classes, Journal of Computer and aspects of computational biology. System Sciences 43, 425-440 (1991) USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 16

John Tromp received the bachelor’s and PhD degrees in Computer Science from the Univer- sity of Amsterdam in 1989 and 1993 respec- PLACE tively, where he studied with Paul Vit´anyi. He PHOTO then spent two years as a postdoctoral fellow HERE with Ming Li at the University of Waterloo in Canada. In 1996 he returned as a postdoc to the Centre for Mathematics and Computer Science (CWI) in Amsterdam. He spent 2001 working as software developer at Bioinformatics Solutions Inc. in Waterloo, to return once more to CWI, where he currently holds a permanent position. He is the recipient of a Canada International Fellowship. See http://www.cwi.nl/∼tromp/ for more information. USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 17

APPENDIX: INTERREDUCIBILITY OF MEC AND r ∈{0, 1}m. CONSTRUCTIVE-MEC Output: The value ddfn which we define as follows: Lemma 10: MEC and Constructive-MEC are polynomial-time Turing interreducible. (Also: ddfn = min g(r, H1,H2). Binary-MEC and Binary-Constructive-MEC are (H1,H2)∈OptPairs(M) polynomial-time Turing interreducible.) Subroutine: ANCHORED-DFN (“Anchored Dis- tance From Nearest Optimal Haplotype Pair”) Proof: We show interreducibility of MEC Input: An n × m SNP matrix M, a vector r ∈ and Constructive-MEC in such a way that the {0, 1}m, and a haplotype H′ such that (H′,H ) ∈ interreducibility of Binary-MEC with Binary- 2 OptP airs(M) for some H . Constructive-MEC also follows immediately from 2 Output: The value d , defined as: the reduction. This makes the reduction from adfn

Constructive-MEC to MEC quite complicated dadfn = min ′ g(r, H1,H2). because we must thus avoid the use of holes. (H1,H2)∈OptPairs(M,H ) We assume the existence of implementations 1. Reducing MEC to Constructive-MEC is trivial of DFN and ANCHORED-DFN which run because, given an optimal haplotype pair (H ,H ), 1 2 in polynomial-time whenever MEC runs in D (H ,H ) can easily be computed in polynomial- M 1 2 polynomial-time. We use these two subroutines time by summing min(d(H ,r),d(H ,r)) over all 1 2 to reduce Constructive-MEC to MEC and then, rows r of the input matrix M. to complete the proof, demonstrate and prove correcteness of implementations for DFN and 2. Reducing Constructive-MEC to MEC is ANCHORED-DFN. more involved. To prevent a particular special case which could complicate our reduction, we first The general idea of the reduction from check whether every row of M (i.e. the input to Constructive-MEC to MEC is to find some Constructive-MEC) is identical. If this is so, we pair (H ,H ) ∈ OptP airs(M) by first finding H can complete the reduction by simply returning 1 2 1 (using repeated calls to DFN) and then finding (H ,H ) where H is the first row of M. Hence, 1 1 1 H (by using repeated calls to ANCHORED- from this point onwards, we assume that M has at 2 DFN with H specified as the “anchoring” least two distinct rows. 1 haplotype.) Throughout the reduction, the following two observations are important. Both follow Let OptP airs(M) be the set of all unordered immediately from the definition of D - i.e. (II.10). optimal haplotype pairs for M i.e. the set of all (H ,H ) such that D (H ,H ) = MEC(M). 1 2 M 1 2 Observation 3: Let M ∪ M be a partition Given that all rows in M are not identical, 1 2 of rows of the matrix M into two sets. we observe that there are no pairs of the 12 Then, for all H1 and H2, DM (H1,H2) = form (H1,H1) in OptP airs(M). Let D 1 (H ,H )+ D 2 (H ,H ). OptP airs(M,H′) ⊆ OptP airs(M) be those M 1 2 M 1 2 elements (H1,H2) ∈ OptP airs(M) such that ′ ′ Observation 4: Suppose an SNP matrix M1 H1 = H or H2 = H . Let g(r, H1,H2) be defined can be obtained from an SNP matrix M2 by as min(d(r, H1),d(r, H2)). removing 0 or more rows from M2. Then MEC(M ) ≤ MEC(M ). Consider the following two subroutines: 1 2 To begin the reduction, note that, for an Subroutine: DFN (“Distance From Nearest arbitrary haplotype X, DFN(M,X) = 0 iff Optimal Haplotype Pair”) (X,H ) ∈ OptP airs(M) for some haplotype H . Input: An n × m SNP matrix M and a vector 2 2 Our idea is thus that we initialise X to be all-0 and 12 This is because DM (H1,H1) is always larger than DM (H1,r) flip one entry of X at a time (i.e. change a 0 to a 1 for any row r in M that is not equal to H1. or vice-versa) until DFN(M,X)=0; at that point USENGLISHON THE COMPLEXITY OF THE SINGLE INDIVIDUAL SNP HAPLOTYPING PROBLEM 18

X = H1 (for some (H1,H2) ∈ OptP airs(M).) Given that we can guarantee that ANCHORED- More specifically, suppose DFN(M,X)= d where DFN(M,X,H1) can be reduced by at least 1 at 0

13 By the above observation we know that It is not possible that DFN(M, X)= m, because all (H1,H2) ∈ ′ ′ OptPairs(M) are of the form H1 =6 H2, and if H1 =6 H2 we know MEC(M ) = (m + 1)d and OptP airs(M ) = that g(X,H1,H2)

OptP airs(M ′ ∪ {r}) ⊆ OptP airs(M). To OptP airs(M ∪ {H′}). To prove the other direc- see why this is, suppose there existed (H3,H4) tion, suppose there existed some pair (H1,H2) ∈ ′ ′ ′ such that (H3,H4) ∈ OptP airs(M ∪ {r}) but OptP airs(M ∪ {H }) such that H1 =6 H and ′ (H3,H4) 6∈ OptP airs(M). This would mean H2 =6 H . But then, from Observation 3, we would DM (H3,H4) >d where d =MEC(M). Now: have: ′ DM ′∪{r}(H3,H4) ≥ DM ′ (H3,H4) DM∪{H′}(H1,H2)= DM (H1,H2)+ g(H ,H1,H2)

=(m + 1)DM (H3,H4) ≥ DM (H1,H2)+1 ≥ (m + 1)(d + 1). > d.

However, if we take any (H1,H2) ∈ OptP airs(M), Thus, (H1,H2) could not have been we see that: in OptP airs(M ∪ {H′}) in the first place, giving us a contradiction. Thus DM ′∪{r}(H1,H2) ≤ (m + 1)d + g(r, H1,H2) OptP airs(M ∪ {H′}) ⊆ OptP airs(M,H′) and ≤ (m + 1)d + m. hence OptP airs(M ∪{H′})= OptP airs(M,H′), proving the correctness of subroutine ANCHORED- Now, (m +1)d + m< (m + 1)(d +1) so (H ,H ) 3 4 DFN. could not possibly be in OptP airs(M ′ ∪ {r}) - contradiction! The relationship OptP airs(M ′ ∪ {r}) ⊆ OptP airs(M) thus follows. It further follows, from Observation 3, that the members of OptP airs(M ′ ∪{r}) are precisely those pairs (H1,H2) ∈ OptP airs(M) that minimise the expression g(r, H1,H2). The minimal value of g(r, H1,H2) has already been defined as ddfn, so we have:

′ MEC(M ∪{r})=(m + 1)d + ddfn. This proves the correctness of Step 3 of the subroutine.

Subroutine: ANCHORED-DFN (“Anchored Distance From Nearest Optimal Haplotype Pair”) Input: An n × m SNP matrix M, a vector r ∈ {0, 1}m, and a haplotype H′ such that ′ (H ,H2) ∈ OptP airs(M) for some H2. Output: The value dadfn, defined as:

dadfn = min ′ g(r, H1,H2). (H1,H2)∈OptPairs(M,H ) Given that H′ is one half of some optimal haplotype pair for M, it can be shown that ANCHORED- DFN(M,r,H′) = DFN(M ∪{H′},r), thus demon- strating how ANCHORED-DFN can be easily reduced to DFN in polynomial-time. To prove the equation it is sufficient to demonstrate that OptP airs(M ∪{H′})= OptP airs(M,H′), which we do now. Let d =MEC(M). It follows that MEC(M ∪{H′}) ≥ d. In fact, MEC(M ∪{H′})= d ′ ′ because DM∪{H′}(H ,H2) = d for all (H ,H2) ∈ OptP airs(M,H′). Hence OptP airs(M,H′) ⊆