Efficiently Decodable Non-Adaptive Group Testing

Efficiently Decodable Non-adaptive Group Testing Piotr Indyk∗ Hung Q. Ngoy Atri Rudraz Abstract few \tests" as possible. A test is a subset of items, which We consider the following “efficiently decodable" non- returns positive if there is a positive in the subset. The adaptive group testing problem. There is an unknown string semantics of \positives," \items," and \tests" depend on x 2 f0; 1gn with at most d ones in it. We are allowed to test any subset S ⊆ [n] of the indices. The answer to the test the application. For example, the topic of group testing tells whether xi = 0 for all i 2 S or not. The objective started in 1943 when Dorfman studied the problem of is to design as few tests as possible (say, t tests) such that testing for syphilis in WWII draftees' blood samples [8]. x can be identified as fast as possible (say, poly(t)-time). Efficiently decodable non-adaptive group testing has appli- In this case, items are blood samples, which are positive cations in many areas, including data stream algorithms and if they are infected, and a test is a pool of samples. Since data forensics. then, group testing has found numerous applications A non-adaptive group testing strategy can be represented by a t × n matrix, which is the stacking of all the (see, e.g., the book [9]). Many applications require the characteristic vectors of the tests. It is well-known that if non-adaptive variant of group testing, in which all tests this matrix is d-disjunct, then any test outcome corresponds are to be performed at once: the outcome of one test uniquely to an unknown input string. Furthermore, we know how to construct d-disjunct matrices with t = O(d2 log n) cannot be used to adaptively design another test. Non- efficiently. However, these matrices so far only allow for a adaptive group testing (NAGT) has found applications \decoding" time of O(nt), which can be exponentially larger in DNA library screening [22], multiple access control than poly(t) for relatively small values of d. This paper presents a randomness efficient construction protocols [3, 31], data pattern mining [21], data forensics of d-disjunct matrices with t = O(d2 log n) that can be de- [15] and data streams [7], among others. coded in time poly(d) · t log2 t + O(t2). To the best of our The main research focus thus far has been on knowledge, this is the first result that achieves an efficient de- designing NAGT strategies minimizing the number of 2 coding time and matches the best known O(d log n) bound tests. In some applications, however, the speed of the on the number of tests. We also derandomize the construction, which results in a polynomial time deterministic con- \decoding" procedure to identify the positive items is struction of such matrices when d = O(log n= log log n). just as important, as we will elaborate later. In this A crucial building block in our construction is the paper, we consider the following “efficiently decodable" notion of (d; `)-list disjunct matrices, which represent the more general \list group testing" problem whose goal is to NAGT problem. Given integers n > d ≥ 1, and an output less than d + ` positions in x, including all the (at unknown string x 2 f0; 1gn with at most d ones in it most d) positions that have a one in them. List disjunct (x is the characteristic vector of the positives), we are matrices turn out to be interesting objects in their own right and were also considered independently by [Cheraghchi, allowed to test any subset S ⊆ [n] of the indices. The FCT 2009]. We present connections between list disjunct answer to a test S tells whether xi = 0 for all i 2 S. matrices, expanders, dispersers and disjunct matrices. List The objective is to design as few tests as possible (say disjunct matrices have applications in constructing (d; `)- sparsity separator structures [Ganguly, ISAAC 2008] and in t tests) such that we can identify the input string x as constructing tolerant testers for Reed-Solomon codes in the efficiently as possible (say, in poly(t)-time). data stream model. A NAGT algorithm can be represented as a t × n matrix M, where each row is the characteristic vector 1 Introduction of the subset of [n] to be tested. (The answers to the The basic group testing problem is to identify the set of tests can be thought of as M being multiplied by x, \positives" from a large population of \items" using as where the addition is logical OR and multiplication is the logical AND.) A well known sufficient condition for ∗CSAIL, MIT. Email: [email protected]. Supported in part such matrices to represent uniquely decodable NAGT by David and Lucille Packard Fellowship, MADALGO (Center algorithms is one of disjunctiveness. In particular, a for Massive Data Algorithmics, funded by the Danish National matrix M is said to be d-disjunct if and only if the union Research Association) and NSF grant CCF-0728645. yDepartment of Computer Science and Engineering, University of at most d columns cannot contain another column. at Buffalo, SUNY. Email: [email protected]. Supported in Here, each column is viewed as a characteristic vector part by NSF grant CCF-0347565. on the set of rows. It is known that t × n d-disjunct zDepartment of Computer Science and Engineering, University matrices can be constructed for t = O(d2 log n) [25, 2, 9]. at Buffalo, SUNY. Email: [email protected]. Supported by NSF d2 CAREER Award CCF-0844796. A lower bound of t = Ω( log d log n) has also been known for a long time [10, 11, 13]. non-hot items occur at most m=(d + 1) times. In terms of decoding disjunct matrices, not much is The algorithm works as follows: let M be a d- known beyond the \naive" decoding algorithm: keep disjunct t × n matrix. For each test i 2 [t], maintain removing items belonging to the tests with negative a counter ci. When an item j 2 [n] arrives (leaves outcomes. The recent survey by Chen and Hwang [5] respectively), increment (decrement respectively) all the lists various naive decoding algorithms under different counters ci such that Mij = 1 (i.e. all counters ci group testing models (with errors, with inhibitors, and for which test i contains item j). The algorithm also variations). The main reason for the lack of \smart" maintains the total number of items m seen so far. At decoding algorithms is that current decoding algorithms any point in time, the hot items can be computed as are designed for generic disjunct matrices. Without follows. Think of the test i corresponding to counter imposing some structure into the disjunct matrices, fast ci as having a positive outcome if and only if ci > decoding seems hopeless. m=(d + 1). Due to the small tail property, a test's The naive decoding algorithm can easily be imple- outcome is positive if and only if it contains a hot item. mented in time O(nt). For most traditional applica- Thus, computing the hot items reduces to decoding the tions of group testing, this decoding time is fine. How- result vector. ever, in other applications, this running time is pro- When Cormode and Muthukrishnan published their hibitive. This raises a natural question (see e.g., [7, 16]) result, the only decoder known for d-disjunct matrices of whether one can perform the decoding in time that is was the O(nt)-time naive decoder mentioned earlier. sub-linear (ideally, polylogarithmic) in n, while keeping This meant that their algorithm above could not be the number of tests at the best known O(d2 log n). efficiently implemented. The authors then provided Our Main Result: In this paper we show that alternate algorithms, inspired by the group testing such a decoding algorithm indeed exists. Specifically, idea above and left the task of designing an efficiently we present a randomness efficient construction of d- decodable group testing matrix as an open problem. disjunct matrices with t = O(d2 log n) tests that can Our main result answers this open question. This be decoded in time poly(d) · t log2 t + O(t2). In par- application also requires that the matrix M be strongly ticular, we only need R = O(log t · max(log n; d log t)) explicit, i.e. any entry in M can be computed in time many random bits to construct such matrices. Further, (and hence, space) poly(t). Our result satisfies this given these R bits, any entry in the matrix can be con- requirement as well. structed in time poly(t) and space O(log n + log t). We We would like to point out that the solution to the also derandomize the construction, which results in a hot items problem using our result is not as good as poly(n) time and poly(t) space deterministic construc- the best known results for that problem. For example, tion of such matrices when d = O(log n= log log n). the paper [7] gives a solution which has a lower space To the best of our knowledge, this is the first result complexity than what one can achieve with efficiently that achieves an efficient decoding time and matches the decodable NAGT. Nevertheless, the above application best known O(d2 log n) bound on the number of tests. to finding heavy hitters is illustrative of many other An earlier result due to Guruswami and Indyk gives applications of NAGT to data stream algorithms, and efficient decoding time but with O(d4 log n) tests [16].

Load more