The Computational Complexity of Some Rounding and Survey Overlap Problems

Kirk Pruhs, Computer Science Department University of Pittsburgh, Pittsburgh, PA 15260 KEY WORDS: NP-complete, Zero-restricted Rounding, Unbiased Rounding, Controlled Rounding

1. Introduction tempts to design efficient algorithms for this prob- In this paper we examine the computational lem have been unsuccessful [Cox, HS, Wat]. In §3, complexity of two classes of problems. The first we explain why these attempts were unsuccessful class of problems involves rounding entries in multi- by showing that both the problem of determining way tables subject to certain restrictions. The sec- whether a 3-way table has a zero-restricted round- ond class of problems involves maximizing the over- ing, and the problem of determining whether a 3- lap between several surveys, with stratified design, way table has an unbiased rounding, are N P-hard. on a common population. In particular, we inves- We discuss, in §2, the implications of a problem tigate whether efficient algorithms exist for these being NP-hard. problems. For some of these problems we give ef- An instance of a survey overlap problem con- ficient algorithms. For the other problems, we use sists of several survey sampling problems, with the theory of NP-completeness to show that it is stratified design, on a common population. (We de- highly unlikely that efficient algorithms exist for fine stratification in §2.) It is possible that both the these problems. stratification and the selection probabilities of the The rounding problems that we consider are sampling units are different in each survey. In sur- the zero-restricted rounding problem and the unbi- vey overlap problems we make the assumption that ased rounding problem. Given a table with ratio- the cost of sampling is roughly proportional to the nal entries, the goal in the zero-restricted round- total number of units sampled in of the surveys, ing problem is to replace the entries in the ta- i.e., it is cheaper to sample the same unit twice ble with adjacent integers in such a way that the than it is to sample two distinct units. Minimizing marginals are maintained (this is called a zero- the number of distinct units chosen in the different restricted rounding). Given a table with rational surveys would then minimize the cost. Hence, the entries, the goal in the unbiased rounding problem goal in the survey overlap problem is to solve each is randomly generate a zero-restricted rounding in survey in such a way as to minimize the maximum such a way that the expected value of the rounding number of units in the union of the samples. of each entry is equal to the value of that entry. In section 4, we give an O(n 2) time algorithm For applications of these rounding problems see for instances of the survey overlap problem that [CFGH, Cox, HRF, IK]. Cox [Cox], and Causey, consist of two singularly stratified survey sampling Cox, and Ernst [CCE] have given efficient algo- problems. This algorithm is optimal, in terms of rithms for generating unbiased roundings of 2-way minimizing the total number of elements sampled, tables. Their algorithms run in time O(n 3) on n both in the expected case and in the worst case. by n tables. An algorithm A runs in time O(f(n)), We then show that if we generalize this problem by for some function f, if for every sufficiently large allowing three singularly stratified surveys, or two input, A finishes after at most c. f(n) steps, where doubly stratified surveys, then the survey overlap c is some constant and n is the size of the input. problem becomes N P-hard. We also show that the An unbiased rounding of a 1-way table of size n can survey overlap problem is N P-hard for instances be generated in time O(n). consisting of an arbitrary number of unstratified Causey, Cox, and Ernst [CCE], Hess and surveys. Srikantan [HS], Waterton [Wat], and Cox [Cox] 2. Definitions posed the problem of how to generalize these - We assume familiarity with standard defini- sults to 3-way tables. The first difficulty encoun- tions and concepts from graph theory; see for ex- tered when generalizing these results is that not all ample. An edge 3-coloring of a multigraph is an 3-way tables have zero-restricted roundings [CCE, assignment of one of three colors to each edge that HS]. Given a 3-way table T, one might still hope has the property that each pair of edges incident for an efficient algorithm that determines whether on a common vertex are assigned distinct colors. T has a zero-restricted rounding (or an unbiased A vertex 3-coloring of a multigraph G is an assign- rounding), and if so, generates such a rounding. At- ment of one of three colors to each vertex of G,

747 with the property that each pair of adjacent ver- nomial p(n). The class P is generally regarded as tices are assigned different colors. For notational the class of problems that can be solved by compu- convenience we will represent the three colors by tationally feasible algorithms [GJ]. The class NP the integers 1, 2, and 3. is defined as the the class of problems for which it An m-way array, A- A1 × As × ... Am, is the is possible to, in polynomial time, guess at a po- Cartesian product of m finite sets. Each A/is called tential solution and then verify the correctness of a stratum. An m-way table T is an assignment of a that potential solution. As an example of a prob- positive number, T(al, a2,..., am), to each element lem in NP, consider the edge 3-coloring problem. (al,as,...,am) of A. Each ai is called an index, Given a trivalent multigraph G, the edge 3-coloring and T(al,a2,...,am) is called an entry of T. problem is the problem of determining whether G We denote the floor and ceiling of a number has an edge 3-coloring. It easy to, in polynomial z by, [zJ and Ix], respectively. An integer k is time, guess a color for each edge, and then verify a rounding of rational number z with respect to a that each vertex has exactly one edge incident to rounding base B if ~- [~J, or ~r- [~]" A table it of each color. R is a rounding of T with respect to a rounding There are many problems, such as the edge base B, if R and T have the same underlying array, 3-coloring problem, that are known to be in N P, and each entry of R is a rounding with respect to but are not known to be P. Garey and Johnson's B of the corresponding entry in T. From now on book [GJ] contains a 100 page list of such prob- we assume, without loss of generality, that B - 1, lems arising from such diverse areas as graph the- and that the range of T is [0, 1) [CCE]. A k-way ory, network design, algebra, number theory, math hyperplane, 0 < k < m, of A is obtained fixing programming, and logic. The P =?NP problem is m-k of the indices, and is denoted in the following the problem of determining whether all problems manner. A(al, ., ., a4, .) is the 3-way hyperplane in NP are solvable by polynomial-time algorithms. derived from A by considering only those elements The P =?NP problem is the most important open of A whose the first index is al and whose fourth problem in theoretical computer science, and has index is a4. We call a 1-way hyperplane a line, and been cited as one of the ten most famous open prob- a 2-way hyperplane a plane. For a table T and a lems in mathematics. hyperplane H, we define #T(H) to be ~aeH T(a). An important concept in the study of the A rounding R of T preserves a hyperplane H if P =?NP problem is NP-completeness (or NP- =g=R(g) is a rounding of #T(H). equivalence). A problem 7~ is N P-hard if a R is a zero-restricted rounding of T if R polynomial-time algorithm for 7~ implies that every preserves all hyperplanes. A probabilistic proce- problem in N P has a polynomial-time algorithm. dure generates a fair rounding R of a table T A problem is NP-equivalent if it is in NP and it is if the expected value of each R(al, as,..., am) is NP-hard. Most natural problems, that are known T(al, a2,..., am). A probabilistic procedure gen- to be in NP, but are not known to be in P, are erates an unbiased rounding R of a table T if the NP-equivalent. procedure fairly generates a zero-restricted round- One way to show that a problem 7~ is NP- ing T. hard is to exhibit a polynomial-time reduction from In the m-way stratified sampling problem the some known NP-hard problem C to 7~. A func- population is stratified by m variables that are tion r from instances of C to instances of T' is a hopefully correlated with the measure variable. To polynomial-lime reduclion if: reduce the variance of the measure variable a sam- 1. r(z) can be computed in time polynomial in ple whose distribution among the strata mirrors the the size of x. populations distribution as closely as possible. We 2. Each instance z of C has a solution if and only can formally define the m-way stratified sampling if the instance r(x) of 7~ has a solution. problem in a manner similar to the way that we de- If C is polynomial-time reducible to P, and there is fined unbiased rounding, the only difference being a polynomial-time algorithm for deciding whether that each strata may have more than one sampling an instance of 7~ has a solution, then there is a unit. polynomial-time algorithm for deciding whether an We now explain the tools that we use to pro- instance of C has a solution. Holyer [Hol] showed vide evidence that some problems are computation- that the edge 3-coloring problem is NP-hard. ally difficult. The class P is defined as the class of It is (almost) unanimously believed that P problems solvable in time O(p(n)), for some poly- N P. If this is true, then by definition no N P-hard

748 problem can have a polynomial-time algorithm. At We next show that various relaxations of these the very least the fact that a problem P is N P-hard rounding problems remain NP-hard. One obvious implies that one should expect the task of finding relaxation, that would still be useful in some ap- a polynomial-time algorithm for 7) to be extremely plications, is to require that only planes, and not difficult. lines, be preserved by the rounding procedure. Un- The following variant of edge coloring is N P- fortunately, lemma 3.5 implies that, in the worst hard [Pru]: case, this relaxation does not make the rounding Problem 2.1: problems any easier. INSTANCE: A trivalent multigraph G and a ver- Lemma 3.5" Let T be a table constructed by re- tex 3-coloring f of G. duction 3.1. If a rounding R of T preserves planes, QUESTION: Does G have an edge 3-coloring? then R preserves lines. 3. Rounding To show that the unbiased rounding problem Corollary 3.6" Given an 3-way table T, the prob- and the zero-restricted rounding problems are N P- lem of determining whether T has a rounding that hard for 3-way tables we reduce from problem 2.1. preserves planes is NP-hard. Reduction 3.1" Given a trivalent multigraph G - Corollary 3.7: Given an 3-way table T, the prob- (V, E), and a vertex 3-coloring f of G, we construct lem of determining whether T has a fair rounding a 3-way table T from G as follows. Let stratum Aj, T that preserves planes is N P-hard. j = 1, 2, 3, be defined by: We now show that both zero-restricted round- ing and unbiased rounding remain NP-hard for 3- Aj - {vilv E V,f(v) - j, and i- 1,2,3} U way tables that have the property that one stratum {(v, w) E Elf(v) 5~ J and f(w) ¢ j} is of size at most six. In other words, these round- ing problems remain hard for "flat" tables. We Stratum Aj consists of three copies of each vertex reduce from the edge coloring problem. colored j, and one copy of each edge that has no Reduction 3.8: Given a trivalent multigraph vertex incident to it colored j. We define [vi, wi, el, G - (V, E), we construct a 3-way table T from G as for e = (v,w) and i = 1, 2, 3, to be the unique ele- follows. The strata are A1 - {ejle E E, j- 1,2}, ment of A with indices vi, wi and e. More precisely: A2 - {vilv E V,i - 1,2,3), and A3 - {1,...,6}. For each e - (v,w) E E, and each i - 1,2,3, let Vi, Wi, e) if f(v)= 1 & f(w)= 2; T(el, vi, 1) - 7,1 and let T(e2, wi, 1)- -} (the pair- (Wi, Vi, e) if f(v) = 2 & f(w) = 1; ing of v with el was arbitrary, we could have just (e, Vi, Wi) iff(v) = 2 & f (w) = 3; ?)i, Wi, e] -- as easily paired w with el). Rounding an entry of (e, Wi, Vi) iff(v) = 3 & f (w) = 2; the form T(el, vi, 1) or T(e2, wi, 1), i E {1, 2, 3}, (vi, e, wi) iff(v) = 1 & f (w) = 3; to 1 is equivalent to coloring the edge e the color (Wi , e, Vi) iff(v) = 3 & f(w) = 1. i. We then use the following construction to guar- For each edge e - (v, w), and each i - 1, 2,3, let antee that entries of this form are rounded in the T[vi, wi, e]- -} (rounding this entry to 1 is equiv- same direction. alent to coloring the edge e the color i). Let all Sequentially consider the edges. Assume we other entries in T be 0. | are considering the edge e - (v,w). For each i - 1, 2, 3, find a plane of the form A(.., ., ki), Lemma 3.2: Let T be a table constructed by 2 < ki < 6, for which none of the entries in the rows reduction 3.1 from a multigraph G and a vertex 3- A(., vi, ki) and A(., wi, ki) have been defined yet. coloring f. Then T has a zero-restricted rounding if It is always possible to find such a ki because each and only if G has an edge 3-coloring. Furthermore, of the vertices v and w are adjacent to at most two T has a zero-restricted rounding if and only if T other vertices. Let T(el, vi, ki) - T(e2, wi, ki) - 5,2 has a unbiased rounding. let - k/) - After this construction has been completed for Theorem 3.3" Given a 3-way table T, the prob- each edge, let the unassigned table entries be as- lem of determining whether T has a zero-restricted signed 0. | rounding is NP-hard. Theorem 3.4: Given a 3-way table T, the prob- Lemma 3.9" Let T be a table constructed by lem of determining whether T has an unbiased reduction 3.8 from a multigraph G. Then T has a rounding is NP-hard. zero-restricted rounding if and only if G has an edge

749 3-coloring. Furthermore, T has a zero-restricted We define the overlap between two samples rounding if and only if T has a unbiased rounding. to be the number of units common to both sam- ples. In the 2-overlap problem, minimizing the to- Theorem 3.10" Given a 3-way table T, with the tal number of units sampled is equivalent maxi- size of one stratum being at most six, the prob- mizing the overlap. Each x i E X can appear in lem of determining whether T has a zero-restricted both S and T with probability at most min(pi, qi). rounding is NP-hard. Hence, an upper bound for the expected overlap is d - ~i=1n min(pi, qi). The expected overlap of our Theorem 3.11" Given a 3-way table T, with the algorithm will be d, and the worst case overlap will size of one stratum being at most six, the problem be LdJ. Hence, our algorithm guarantees optimal of determining whether T has a unbiased rounding overlap, both in the average case, and in the worst is NP-hard. case. We end this section by noting that all of the Samples S and T, which satisfy the structural NP-hardness results in this section can easily be constraints, can be viewed as 0/1 solutions to the generalized to tables of dimension greater than following equations: (A unit x i is included in S if three. ~bi- 1, and in T if qi- 1.) 4. Survey Overlap (1) LuJ _< P, < ru-i. The 2-overlap problem is a special case of the (2) ',,J < ,7, < r,v]. survey overlap problem in which the instance con- (3) vc c L#Cl < P, < r#c]. sists of two 1-way stratified sampling problem in- (4) VD e 7) [#DJ s s [#D]. stances. We begin this section with an O(n 2) algo- (5) ~in_m_lmin(pi, qi ) ~_ d. rithm for the 2-overlap problem. Our algorithm is We begin the algorithm by letting each [~i - pi, motivated by Cox's [Cox] algorithm for generating and letting each gti - qi. The initial values of the unbiased roundings of 2-way tables. /hi's and the ~i's then satisfy the above equations. We now define the 2-overlap problem more The first phase of the algorithm transforms some formally. Assume that the population is X - of the/hi's and ~i's to 0/1 probabilities, in such a {Xl,..., Xn}. Let the first 1-way stratified sam- way that (1)-(5) remain valid. This transforma- pling problem instance consist of X, associated se- tion uses what we call a balanced adjustment. Let lection probabilities {Pl,...,Pn}, and a partition P and M be two disjoint subsets of the variables C of X. Similarly, let the second 1-way stratified Pl,..., Pn and ql,..., qn that satisfy the following sampling problem instance consist of X, associated two conditions: selection probabilities {ql,..., qu}, and a partition 1) Each variable in P L9 M is non 0/1. 7) of X. The goal in the 2-overlap problem is to 2) Adding any amount to the value of each vari- probabilistically generate two samples, S and T, able in P (M), and subtracting that same from X in way such a way that the size of the amount from the value of each variable in M maximum possible size of S t.J T is minimized and (P), will not affect the veracity of (1)-(5). the following criteria are satisfied" Let a + - min(1- max(P),min(M)), and a- - Probability Constraints: rain(rain(P), 1 - max(M)). Given such sets, a bal- 1. Each unit xi E X is included in S with prob- anced adjustment consists of randomly executing ability p~. one of the following two assignments. With proba- 2. Each unit xi E X is included in T with prob- bility a++a-a- ' add a + to the value of each variable ability qi. in P, and subtract a + from the value of each vari- Structural Constraints: able in M • With probability ~i++~a "1" , subtract a- 1. The number of units in S is either the floor of, from the value of each variable in P, and add a- or the ceiling of, u - ~inl Pi. to the value of each variable in M. A balanced 2. The number of units in T is either the floor of, adjustment causes at least one i0i, or one qi, to be- or the ceiling of, v- ~inl qi. come 0/1. The fairness constraints are not violated 3. The number of units in S fl'om each C E C because the expected change of the value of each is either the floor of, or the ceiling of, #C = variable is zero. ~xi6C Pi. Throughout the algorithm we maintain a ver- 4. The number of units in T from each D 6 79 tex labeled bipartite multigraph G. The vertices of is either the floor of, or the ceiling of, #D - G are the classes in g, and the classes in 7:). The ~xi6D qi. value of the ihi's and the ~i's determine the edges

750 and the labels. A unit xi is free if exactly one of Let P (M) contain the variables /3i and qi, such /5i and qi is 0/1. A unit zi is bound if neither of that zi that is an odd (even) edge in H. In case 1, /3i or qi is 0/1. For zi E X, with zi E C E C and also add the mate a, and the mate b, if it exists, to zi E D E 79, edges and labels are added to G in M. In case 2, add the mate a, if it exists, to M, the following manner: and add the mate b, if it exists, to P. A good path 1. If zi is bound, then there is an undirected edge step consists of performing a balanced adjustment (C,D). on these sets. 2. If z i is free, with/5i - 0 (1), then a label of 0 The cycle step is applied repeatedly, updating (1) is added to D. In this case, the variable qi G after each step, until G becomes acyclic. We now is called a O-mate (1-mate) of D. assume that G is acyclic. If y and z are vertices 3. If zi is free, with qi - 0 (1), then a label of 0 in the same connected component of G, we denote (1) is added to C. In this case, the variable/3i the unique path between them as P(y, z). A simple is called a O-mate (1-mate) of C. path H is maximal if no other simple path properly Each vertex may be labeled with any number of O's contains H. In the following two theorems we give and l's, or may have no label. We will freely switch sufficient conditions for G to contain a good path. interpretations between edges and units, and be- These theorems can be proved by exhaustively enu- tween vertices and equivalence classes. merating the possibilities. The goal of the first phase of the algorithm Lemma 4.1: If H is a tree in G, with at least three is to simplify G. Each simplification step con- leaves, z, y, and z, then one of P(z, y), P(z, z), or sists of a balanced adjustment. We now define the P(y, z) is a good path. three types of balanced adjustments used in the Lemma 4.2: Let H be a maximal simple path first phase. between vertices x and z in G. If H contains an in- Cycle Step" Let H be a cycle in G, with the ternal labeled vertex y then one of P(z, y), P(y, z), edges in H alternately designated odd and even. or P(z, z) is a good path. Let P (M) be the set consisting of the variables/5/ While G contains trees with at least three and qi, such that xi is an odd (even) edge in H. A leaves, or maximal simple paths with labeled inter- cycle step then consists of performing a balanced nal vertices, the algorithm repeatedly finds a good adjustment on these sets. path, performs a good path step, and updates G. Pair Step" Let a and b be two 0-mates, or two Next, the algorithm repeatedly finds vertices in G 1-mates, of some vertex in G. Let P - {a} and let that have two 0-mates, or two 1-mates, performs a M- {b}. A pair step then consists of performing pair step, and updates G. a balanced adjustment on these sets. G is now of the following simple form. It is the Good Path Step: Let H be a simple path be- disjoint union of simple paths and isolated vertices. tween two vertices y and z in G. Designate the Each vertex has at most one label. No internal edges alternately as odd and even, with an odd vertex on a path can have a label. One consequence edge incident to y. H is then called a good palh if of this is that each equivalence class C E C (D E 7)) it satisfies one following conditions: contains at most two units zi and xj with/5i and/3j 1. H is of odd length, and satisfies one of the (qi and ~j) being non 0/1. At this point no further following conditions: balanced adjustments may be possible, and phase a. y has a 1-mate a and z has a 0-mate b. 1 is finished. Phase 1 requires at most O(n 2) time b. y has a 1-mate a and z is a vertex of degree because each simplification step can be performed 1 with no label. in linear time, and each simplification step makes 2. H is of even length, and satisfies one of the at least one/5i, or one qi, 0/1. following conditions: The second phase of the algorithm is divided a. y has a O-mate a and z has a O-mate b. into two parts. To begin the first part of phase b. y has a l-mate a and z has a l-mate b. two associate with each unit xi E z a probability, c. y has a O-mate a and z is a vertex of degree hi = min(~i, 7ti). To guarantee maximum overlap 1 with no label. we need to include each xi in both samples with d. Both y and z are vertices of degree 1 with probability hi. Note that if either/3i or qi is 0, then no label. hi = 0 and xi will not be included in both samples. Let H be a good path as defined above. If H satis- Similarly, if both/3i and qi are 1, then hi - 1 and fies more than one condition in the above definition, xi will be included in both samples. We call units then pick a condition arbitrarily for the next steps. xi and xj in H a C-pair (79-pair) if they share a

751 common equivalence class in g (79), and both ibi We are currently working on finding algo- and/bj (qi and ~j) are non 0/1. A unit x i E g is rithms for zero-restricted rounding that are efficient a singleton if it does not occur in a pair. We order on "average", ie. efficient on almost all tables. The H so that all pairs x i and xj are consecutive. This computational complexity of the problem of gen- ordering is possible because of the simple form of erating a controlled rounding [Cox] of tables with G. We now use systematic sampling to select the dimension greater than two remains an intriguing units to be included in both samples. open problem. Systematic Sampling: Let gi -- ~j

752