Non-Adaptive Randomized Algorithm for Group Testing

Proceedings of Machine Learning Research 76:1{20, 2017 Algorithmic Learning Theory 2017 Non-Adaptive Randomized Algorithm for Group Testing Nader H. Bshouty [email protected] Dept. of Computer Science Technion, Haifa, 32000 Nuha Diab [email protected] Sisters of Nazareth High School 12th Grade P.O.B. 9422, Haifa, 35661 Shada R. Kawar [email protected] Nazareth Baptist High School 11th Grade P.O.B. 20, Nazareth, 16000 Robert J. Shahla [email protected] Sisters of Nazareth High School 11th Grade P.O.B. 9422, Haifa, 35661 Editors: Steve Hanneke and Lev Reyzin Abstract We study the problem of group testing with a non-adaptive randomized algorithm in the random incidence design (RID) model where each entry in the test is chosen randomly independently from f0; 1g with a fixed probability p. The property that is sufficient and necessary for a unique decoding is the separability of the tests, but unfortunately no linear time algorithm is known for such tests. In order to achieve linear-time decodable tests, the algorithms in the literature use the disjunction property that gives almost optimal number of tests. We define a new property for the tests which we call semi-disjunction property. We show that there is a linear time decoding for such test and for d ! 1 the number of tests converges to the number of tests with the separability property. Our analysis shows that, in the RID model, the number of tests in our algorithm is better than the one with the disjunction property even for small d. Keywords: Group Testing, Non-adaptive Algorithm, Randomized Algorithm, RID Algo- rithm. 1. Introduction As quoted in Wikipedia., \Robert Dorfman's paper in 1943, Dorfman(1943), introduced the field of Group Testing. The motivation arose during the Second World War when the United States Public Health Service and the Selective service embarked upon a large scale c 2017 N.H. Bshouty, N. Diab, S.R. Kawar & R.J. Shahla. Non-Adaptive Randomized Algorithm for Group Testing project. The objective was to weed out all syphilitic men called up for induction. However, syphilis testing back then was expensive and testing every soldier individually would have been very cost heavy and inefficient. A basic breakdown of a test is: Draw sample from a given individual, perform required tests and determine the presence or absence of syphilis. Suppose we have n soldiers. Then this method of testing leads to n tests. Our goal is to achieve effective testing in a scenario where it does not make sense to test 100; 000 people to get (say) 10 positives. The feasibility of a more effective testing scheme hinges on the following property. We can combine blood samples and test a combined sample together to check if at least one soldier has syphilis". Group testing was originally introduced as a potential approach to the above econom- ical mass blood testing, Dorfman(1943). However it has been proven to be applicable in a variety of problems, including quality control in product testing, Sobel and Groll(1959), searching files in storage systems, Kautz and Singleton(1964), sequential screening of exper- imental variables, Li(1962), efficient contention resolution algorithms for multiple-access communication, Kautz and Singleton(1964); Wolf(1985), data compression, Hong and Ladner(2002), and computation in the data stream model, Cormode and Muthukrish- nan(2005). See a brief history and other applications in Cicalese(2013); Du and Hwang (2000, 2006); Hwang(1972); Macula and Popyack(2004); Ngo and Du(1999) and references therein. We now formalize the problem. Let S be the set of the n items (soldiers, clones) and let I ⊆ S be the set of defective items (the sick soldiers, positive clone). Suppose we know that the number of defective items, jIj, is bounded by some integer d.A test or a query (pool or probing) is a set J ⊂ S. The answer to the test is T (I;J) = 1 if I \ J 6= Ø and 0 otherwise. Here J is the set of soldiers for which their blood samples is combined. The problem is to find the defective items (sick soldiers) with a minimum number of tests. We will assume that S = [n] := f1; 2; : : : ; ng and will identify the test J ⊆ S with an assignment J n J J a 2 f0; 1g where ai = 1 if and only if i 2 J, i.e., ai = 1 signifies that item i is in test J. J J J Then the answer to the test a is T (I; a ) = 1 (positive) if there is i 2 I such that ai = 1 and 0 (negative) otherwise. In the adaptive algorithm, the tests can depend on the answers to the previous ones. In the non-adaptive algorithm they are independent of the previous one and; therefore, one can do all the tests in one parallel step. In all the above applications, non-adaptive algorithms are most desirable since minimizing time is of utmost importance. The set of tests in any non-adaptive deterministic algorithm (resp. randomized) can be identified with a (resp. random) m × n test matrix M (pool design) such that its rows are all the assignments aJ of all the tests J in the algorithm. The above problem is also equivalent to the problem of non-adaptive deterministic (resp. randomized) learning the class of boolean disjunction of up to d variables from membership queries Bshouty(2013, 2017). It is known that any non-adaptive deterministic algorithm must do at least Ω(d2 log n= log d) tests D'yachkov and Rykov(1982); Füredi(1996); Porat and Rothschild(2011); Ruszinkó(1994). This is O(d= log d) times more than the number of tests of the known non-adaptive randomized algorithms that do only O(d log n) tests. In this paper we study the case where the number of items n is very large and the cost of the test is \very expensive" and in that case, deterministic algorithms are impractical. Also, algorithms that run in super-linear time are not practical for group testing. So we need our algorithms to do a 2 Non-Adaptive Randomized Algorithm for Group Testing minimum number of tests and run in quasi-linear time, preferably, linear time in the size of all the tests, that is, O(dn log n). Non-adaptive randomized algorithms have been proposed in Balding et al.(1995); Bruno et al.(1995); Du and Hwang(2006); Erdösand Rényi(1963); Hwang(2000); Hwang and Liu(2001). The first folklore non-adaptive randomized algorithm known from the literature simply chooses a random m × n test matrix M, for sufficiently large m, where each entry of M is chosen randomly independently to be 0 with probability p and 1 with probability 1 − p. This type of algorithm is called a random incidence design (RID) and is the simplest type of randomized algorithm. In this paper we will study this type of algorithms. If I ⊆ S = [n] is the set of defective items then the vector of answers to the tests (rows (i) (i) of M) is T (I;M) := _i2I M where _ is bitwise \or" and M is the ith column of M. The above parameters, m and p, are chosen so that the probability of success of recovering the defective items (decoding) is maximal. If the number of defective items is bounded by jIj ≤ d then in order to find the defective items with probability at least 1 − δ, we need that with probability at least 1 − δ, M satisfies the following property: For every J ⊆ [n], jJj ≤ d and J 6= I we have T (J; M) 6= T (I;M). This basically says that the only set J (i) of up to d items that is consistent with the answers of the tests î2I M is I. We call such a property (I; d)-separable property and M is called an (I; d)-separable test matrix. Unfortunately, although this property is what we need to solve the problem, there is no known linear time (or even polynomial time) algorithm that finds the defective items from the answers of an (I; d)-separable test matrix. The only known algorithm is the trivial one that exhaustively goes over all possible up to d sets J and verifies whether T (J; M) is equal to the answers of the tests. This is one of the reasons why the separability property has not been much studied in the literature. Sebö, Sebö(1985), studies the easy case when the number of defective items is exactly d. He shows that the best probability for the random test matrix (i.e., that gives a minimum number of tests) is p = 1−1=21=d. In this paper, we study the general case when the number of defective items is at most d. We give a new analysis for the general case and show that the probability that gives minimum number of tests is p = 1 − 1=d. In particular, we show that mRID 1 e + 1 1 lim ≤ γd−1 + O := ed − + O (1) n!1 ln n d 2 d where mRID is the minimum number of tests in the RID algorithms, e = 2:71828 ··· is the Euler's number and γd is defined in (2). But again, no polynomial time algorithm is known for such matrix even for the special case when the number of the defective items is exactly d. To be able to find the defective items in linear time in the size of all the tests, that is, O(dn log n), and asymptotically optimal O(d log n) tests, algorithms in the literature use a relaxed property for M.

Load more