Group testing for connected communities

Pavlos Nikolopoulos† Sundara Rajan Srinivasavaradhan‡ Tao Guo‡ Christina Fragouli‡ Suhas Diggavi‡ †EPFL, Switzerland ‡University of California Los Angeles, USA

Abstract history of several decades dating back to R. Dorfman in 1943 and a number of variations and setups have been examined in the literature [Dorfman, 1943, Du In this paper, we propose that and Hwang, 1993, Aldridge et al., 2019, Yaakov Mali- leverage a known community structure to novsky, 2016]. make group testing more efficient. We con- sider a population organized in disjoint com- The observation we make in this paper is that we can munities: each individual participates in a leverage a known community structure to make group community, and its infection probability de- testing more efficient. The work in group testing we pends on the community (s)he participates know of, assumes “independent” infections, and ig- in. Use cases include families, students who nores that an infection may be governed by community participate in several classes, and workers spread; we argue that taking into account the commu- who share common spaces. Group testing nity structure can lead to significant savings. As a use reduces the number of tests needed to iden- case, consider an apartment building consisting of F tify the infected individuals by pooling di- families that have practiced social distancing; clearly agnostic samples and testing them together. there is a strong correlation on whether members of We show that if we design the testing strat- the same family are infected or not. Assume that the egy taking into account the community struc- building management would like to test all members ture, we can significantly reduce the number to enable access to common facilities. We ask, what is of tests needed for adaptive and non-adaptive the most test-efficient way to do so. group testing, and can improve the reliability Our approach enlarges the regime where group test- in cases where tests are noisy. ing can offer benefits over individual testing. Indeed, a limitation of group testing is that it offers fewer or no benefits when k grows linearly with n [Riccio and 1 Introduction Colbourn, 2000,Hu et al., 1981,Ungar, 1960,Aldridge, 2019, Aldridge et al., 2019]. Taking into account the Group testing pools together diagnostic samples to community structure allows to identify and remove reduce the number of tests needed to identify in- from the population large groups of infected members, arXiv:2007.08111v5 [cs.IT] 17 Mar 2021 fected members in a population. In particular, if in thus reducing their proportion and converting a lin- a population of n members we have a small frac- ear to a sparse regime identification. Essentially, the tion infected (say k  n members), we can iden- community structure can guide us on when to use in- n tify the infected members using as low as O(k log( k )) dividual, and when group testing. group tests, as opposed to n individual tests [Du and Hwang, 1993, Aldridge et al., 2019, Kucirka et al., Our main results are as follows. Assume that n popu- 2020]. Triggered by the need of widespread testing, lation members are partitioned into F groups that we such techniques are already being explored in the con- call families, out of which kf families have at least one text of Covid-19 [Gollier and Gossner, 2020,Broadfoot, infected member. 2020,Ellenberg, 2020,Verdun et al., 2020,Ghosh et al., • We derive a lower bound on the number of tests, 2020, Kucirka et al., 2020]. Group testing has a rich which for some regimes increases (almost) linearly with kf (the number of infected families) as opposed to k (the number of infected members). This work was accepted at the Proceedings of the • We propose an adaptive that achieves the 24th International Conference on Artificial Intelligence and lower bound in some parameter regimes. (AISTATS) 2021, San Diego, California, USA. • We propose a nonadaptive algorithm that accounts Group testing for connected communities

for the community structure to reduce the number of ber, and the non-zero elements determine the set δτ . tests when some false positive errors can be tolerated. Although adaptive testing uses less tests than non- • We propose a new decoder based on loopy belief adaptive, non-adaptive testing is more practical as all propagation that is generic enough to accommodate tests can be executed in parallel. any community structure and can be combined with • Scaling regimes of operation: assume k = Θ(nα), we any test matrix (encoder) to achieve low error rates. say we operate in the linear regime if α = 1; in the • We numerically show that leveraging the community sparse regime if 0 ≤ α < 1; in the very sparse regime structure can offer benefits both when the tests used if k is constant. have perfect accuracy and when they are noisy. Known results. The following are well estab- We present our models in Section 2, the lower bound in lished results (see [Johnson, 2017, Du and Hwang, Section 3, our algorithms for the noiseless case in Sec- 1993, Aldridge et al., 2019] and references therein): tion 4, and loopy belief propagation (LBP) decoding • In the combinatorial model, since T tests allow to in Section 5. Numerical results are in Section 6. distinguish among 2T combinations of test outputs, then to identify all k infected members without error, Note: The proofs of our theoretical results (in Sec- we need: 2T ≥ n ⇔ T ≥ log n. This is known as tions 3–4) are in the Appendix, along with an extended k 2 k the counting bound [Johnson, 2017,Du and Hwang, explanation of the rationale behind our algorithms. 1993,Aldridge et al., 2019] and implies that we cannot n use less than T = O(k log k ) tests. In the probabilistic 2 Background and notation model, a similar bound has been derived for the num- ber of tests needed on average: T ≥ nh2 (p), where h is the binary entropy function. 2.1 Traditional group testing 2 • Noiseless adaptive testing can achieve the counting α Our work extends traditional group testing to infection bound for k = Θ(n ) and α ∈ [0, 1); for non-adaptive models that are based on community spread. For this testing, this is also true of α ∈ [0, 0.409], if we allow reason, we review here known results from prior work. a vanishing (with n) error [Aldridge et al., 2019,Coja- Oghlan et al., 2020, Coja-Oghlan et al., 2020]. Traditional group testing typically assumes a popu- • In the linear regime (α = 1), group testing offers lit- lation of n members out of which some are infected. tle benefits over individual testing. In particular, if the Two infection models are considered: (i) in the combi- infection rate k/n is more than 0.38, group testing does natorial model, a fixed number of infected members k, not use fewer tests than 1-by-1 (individual) testing un- are selected uniformly at random among all sets of size less high identification-error rates are acceptable [Ric- k; (ii) in the probabilistic model, each item is infected cio and Colbourn, 2000, Hu et al., 1981, Ungar, 1960]. independently of all others with probability p, so that ¯ the expected number of infected members is k = np. 2.2 Community and infection models A group test τ takes as input samples from nτ individ- uals, pools them together and outputs a single value: In this paper, we additionally assume a known commu- positive if any one of the samples is infected, and neg- nity structure: the population can be decomposed in ative if none is infected. More precisely, let Ui = 1 F disjoint groups of individuals that we call families. PF when individual i is infected and 0 otherwise. Then Each family j has Mj members, so that n = j =1 Mj . the traditional group testing output Yτ takes a binary In the symmetric case, Mj = M for all j and n = FM . value calculated as Y = W U , where W stands for Note, that the term “families” is not limited to real τ i∈δτ i the OR operator (disjunction) and δτ is the group of families—we use the same term for any group of peo- people participating in the test. ple that happen to interact, so that they get infected according to some common infection principle. The performance of a group testing algorithm is mea- sured by the number of group tests T = T (n) needed We consider the following infection models, that par- to identify the infected members (for the probabilistic allel the ones in the traditional setup: model, the expected number of tests needed). Setups • Combinatorial Model (I). kf of the families are that have been explored in the literature include: infected—namely they have at least one infected mem- • Adaptive vs. non-adaptive testing: In adaptive test- ber. The rest of the families have no infected members. j ing, we use the outcome of previous tests to decide In each infected family j , there exist km infected mem- j what tests to perform next. An example of adaptive bers, with 0 ≤ km ≤ Mj . The infected families (resp. testing is binary splitting, which implements a form of infected family members) are chosen uniformly at ran- binary search. Non-adaptive testing constructs, in ad- dom out of all families (resp. members of the same vance, a test matrix G ∈ {0, 1}T×n where each row family). For our analysis, we sometimes consider only j corresponds to one test, each column to one mem- the symmetric case, where km = km for each family j . Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

• Probabilistic Model (II). A family is infected coding algorithms for independent but not identical with probability q i.i.d. across the families. A member priors are investigated by [Li et al., 2014]. of an infected family j is infected, independently from The line of work on graph-constrained group testing the other members (and other families), with proba- (see for example [Cheraghchi et al., 2012,Karbasi and bility p > 0. If a family j is not infected, then p = 0. j j Zadimoghaddam, 2012, Luo et al., 2019]) solves the When k j = p M the two models behave similarly. m j j problem of how to design group tests when there are Our goal is two-fold: (a) provide new lower bounds for constraints on which samples can be pooled together, the number of tests T needed to identify all infected provided in the form of a graph; in our case, individu- members without error; and (b) design community- als can be pooled together into tests freely. aware testing algorithms that are more efficient than traditional group-testing ones, in the sense that they 3 Lower bound on the number of tests can achieve the same identification accuracy using sig- nificantly fewer tests and they can also perform close We compute the minimum number of tests needed to to the lower bounds in some cases. identify all infected members under the zero-error cri- terion in both community models (I) and (II). 2.3 Noisy testing and error probability Theorem 1 (Combinatorial community bound). Consider the combinatorial model (I) (of Section 2.2). In this work we assume that there is no dilution noise, Any algorithm that identifies all k infected members that is, the performance of a test does not depend without error requires a number of tests T satisfying: on the number of samples pooled together. This is   kf   a reasonable assumption with genetic RT-PCR tests F X Mj where even small amounts of viral nucleotides can be T ≥ log2 + log2 j . (1) kf km amplified to be detectable [Saiki et al., 1985, Kucirka j =1 et al., 2020]. However, we do consider noisy tests in F  M  For the symmetric case: T ≥ log2 + kf log2 . our numerical evaluation (Section 6) using a Z-channel kf km noise model1. We remark that this is simply a model Observations: We make two observations regard- one may use; our algorithms are agnostic to this and ing the combinatorial community bound, in the case can be used with any other model. where the number of infected family members follows a “strongly” linear regime (k ≈ M ) and the num- Additionally, some of our identification algorithms m j ber of infected families kf follows a sparse regime (i.e., may return with errors. For this, we use the following α kf = Θ(F f ) for αf ∈ [0, 1)): terminology: Let Uˆi denote the estimate of the state of Ui after group testing. Zero error captures the re- (a) The bound increases almost linearly with kf (the quirement that Uˆi = Ui for all i ∈ N . Vanishing error number of infected families), as opposed to k (the over- requires that all error probabilities go to zero with n. all number of infected members). This is because, Sometimes we also distinguish between False Negative if the infection regime about families is sparse, the (FN) and False Positive (FP) errors: FN errors occur following asymptotic equivalence holds: log F  ∼ 2 kf when infected members are identified as non-infected F kf log2 ∼ (1 − αf )kf log2 F . (and vice-versa for FP). kf (b) If additionally to the sparse regime about families, α 2.4 Other related work an overall sparse regime (k = Θ(n ) for α ∈ [0, 1)) holds, then the community bound may be significantly The idea of community-aware group testing is explored lower than the counting bound that does not take into to some extent in our preprint [Nikolopoulos et al., account the community structure. Consider, for ex- 2020]. Also, a similar idea of using side-information ample, the symmetric case. The asymptotic behav- from contact tracing in decoding is proposed by [Zhu ior of the counting bound in the sparse regime is: n n F log ∼ k log ∼ kf km log , where the latter is et al., 2020, Goenka et al., 2020], independently from 2 k 2 k 2 kf our work. That work is complementary to ours; we because km ≈ M . So, the ratio of the counting bound focus more on test designs rather than decoding, for to the combinatorial bound scales (as F gets large) as: which we use well-known algorithms such as COMP n F log kf km log2 k and LBP. Finally, test designs, lower bounds and de- 2 k ∼ f = k . (2) F  M  F m log + k log kf log 2 k f 2 k 2 kf 1In a Z-channel noise model, a test output that should f m be positive, flips and appears as negative with probability Although simplistic, observation (b) is important for z, while a test output that is negative cannot flip. Thus: practical reasons. Many times, the population is com- W  (Yτ = 1|U ) = Ui (1 − z). P δτ i∈δτ posed of a large number of families with members that Group testing for connected communities have close contacts (e.g. relatives, work colleagues, Algorithm 1 Adaptive Community Testing students who attend the same classes, etc.). In such Uˆi is the estimated infection status of member i. cases, we do expect that almost all members of infected Uˆx is the estimated infection status of a mixed sample families are infected (i.e. km ≈ Mj ), even though the x. overall infection regime may still be sparse. Eq. (2) SelectRepresentatives() is a function that selects a rep- shows the benefits of taking the community structure resentative subset from a set of members. into account in the test design, in such a case. AdaptiveTest() is an adaptive algorithm that tests a Theorem 2 (Probabilistic Community bound). Con- set of items (mixed samples or members). sider the probabilistic model (II) (of Section 2.2). Any 1: for j = 1,..., F do algorithm that identifies all k infected members with- 2: rj = SelectRepresentatives ({i : i ∈ j }) out error requires a number of tests T satisfying: 3: end for h ˆ ˆ i 4: Ux(r ),..., Ux(r ) = F 1 F X 1 − q  T ≥ F h (q)+ qM h (p ) − w h (3) AdaptiveTest (x(r1),..., x(rF )) 2 j 2 j j 2 w j =1 j 5: Set A := ∅ 6: for j = 1,..., F do Mj ˆ where wj = 1 − q + q(1 − pj ) . 7: if Ux(rj ) = “positive” then 8: Use a noiseless, individual test for each fam- Two observations: (a) If for each family j , pj and Mj ily member: Uˆi = Ui , ∀i ∈ j . Mj are such that q(1 − pj ) → 0 (i.e. the probability of 9: else the peculiar event, where a family is labeled “infected” 10: A := A ∪ {i : i ∈ j } and yet has no infected members, is negligible), the 11: end if combinatorial and probabilistic bounds are asymptot- 12: nend for o ically equivalent. In particular, using the standard es- 13: Uˆi : i ∈ A = AdaptiveTest (A) timates of the binomial coefficient [Ash, 1990, Sec. 4.7], h ˆ ˆ i the combinatorial bound in (1) is asymptotically 14: return U1,..., Un kf j kf P k equivalent to F h2 ( /F) + j =1 Mj h2 ( m/Mj ), which matches the probabilistic bound in (3): F h2 (q) + ¯ F ¯ kf j P kf P k¯ representative members r from family j . A mixed q j =1 Mj h2 (pj ) = F h2 ( /F) + j =1 Mj h2 ( m/Mj ), j ¯ j ¯j sample is “positive,” if at least one of the members with kf = kf + o(1) and km = km + o(1) in place of ¯ ¯j that compose it is infected, and “negative” otherwise. their expected values kf = Fq and km. Because in some cases we only care about mixed sam- (b) Theorem 2 extends from zero-error recovery to ples, we can treat them in the same way as individual constant-probability recovery by applying Fano’s in- samples—hence use group testing to identify the infec- equality (similarly to Thm 1 of [Li et al., 2014]), and tion state of mixed samples as we do for individuals. in doing so, the right-hand side of (3) gets multiplied Part 1 (lines 1-4): The goal of this part is to de- by the desired probability of success P(suc). tect the infection regime inside each family j , so that the family is tested accordingly at the next part: us- 4 Algorithms ing group testing, if j is “lightly” infected, or individ- ual testing, otherwise. Our idea is motivated by the 4.1 Adaptive algorithm result presented in Section 2.1 that group testing is preferable to individual, only if infection rate is low Alg. 1 describes our algorithm for the fully adaptive (i.e. p ≤ 0.38). Therefore, the challenge is to ac- case, which consists of two parts (the interested reader j curately detect the infection regime spending only a may find the detailed rationale for our algorithm in limited number of tests. In this paper, we limited our Appendix B). In both parts, we make use of a clas- exploration to using only one mixed sample in this re- sic adaptive-group-testing algorithm AdaptiveTest(), gard, but more sophisticated techniques are also pos- which is an abstraction for any existing (or future) sible, some of which are discussed in Appendix B.2. adaptive group-testing algorithm. We distinguish be- tween 2 different kinds of input for AdaptiveTest(): First, a representative subset rj of family-j (a) a set of selected members, which is the typical in- members is selected using a sampling function put of group-testing algorithms; (b) a set of selected SelectRepresentatives() (lines 1-3). Then, a mixed mixed samples. A mixed sample is created by pooling sample x(rj ) is produced for each subset rj , and an together samples from multiple members that usually adaptive group-testing algorithm is performed on top have some common characteristic. For example, mixed of all representative mixed samples (line 4). If our sample x(rj ) denotes an aggregate sample of a set of choice of AdaptiveTest() offers exact reconstruction Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

ˆ (which is usually the case), then: Ux(rj ) = Ux(rj ). where the inequality is due to the worst performance R of BSA, and φp = 1−(1 − p) is the expected fraction Part 2 (lines 5-13): We treat Uˆ as an estimate x(rj ) of infected families whose mixed sample is positive. ˆ of the infection regime inside family j : if Ux(rj ) is posi- tive, then we consider the family to be heavily infected Lemmas 1 and 2 are derived (in Appendix B) as a re- k j (i.e m/Mj or pj ≥ 0.38), otherwise lightly infected (i.e. peated application of the performance bounds of HG- k j m/Mj or pj < 0.38). Since group testing performs bet- BSA and BSA: if out of n members, k are infected uni- ter than individual testing only in the latter case (sec- formly at random, then HGBSA (resp. BSA) achieves tion 2.1), we use individual testing for each heavily- exact identification using at most: log n + k (resp. infected family (lines 7-8), and adaptive group testing 2 k k log2 n + k) tests [Aldridge et al., 2019, Baldassini for all lightly-infected ones (line 13). et al., 2013]. Analysis for the number of tests. We now com- Observations: (a) If heavily/lightly infected families pute the maximum expected number of tests needed are detected without errors in Part 1, our algorithm by our algorithm to detect the infection status of all can asymptotically achieve (up to a constant) the lower members without error. For simplicity of notation, we combinatorial bound of Theorem 1 in particular com- present our results through the symmetric case, where munity structures. We show this via 2 examples: j Mj = M , km = km (combinatorial case) or pj = p (probabilistic case), and |rj | = R for all families: First, consider a sparse regime for families (i.e. kf = α Let SelectRepresentatives() be a simple function that Θ(F f ) for αf ∈ [0, 1)) and a moderately linear regime performs uniform (random) sampling without replace- within each family (i.e. km/M ≈ 0.5). In this case: F  F M  k log ∼ kf log ( /kf ), log ∼ M h2 ( m/M ) ∼ M ment, and consider 2 choices for the AdaptiveTest() 2 kf 2 2 km F algorithm: (i) Hwang’s generalized binary splitting al- and the bound in (1) becomes: kf (log2 /kf + M ). If R gorithm (HGBSA) [Hwang, 1972], which is optimal if is chosen such that all infected families (which are also the number of infected members of the tested group is heavily infected as km/M > 0.38) are detected without known in advance; and (ii), traditional binary-splitting errors (e.g. if R > M − km ), then φc = 1; thus, the algorithm (BSA) [Sobel and Groll, 1959], which per- RHS of (4) becomes almost equal (up to constant kf ) forms well, even if little is known about the number of to the lower bound (1). infected members. Second, consider the opposite example, where the in- Lemma 1 (Expected number of tests - Symmetric fection regime for families is very high, while each combinatorial model). Consider the choices (i) and separate family is lightly infected. In this case, k = (ii) for the AdaptiveTest() defined above. Alg. 1 suc- kf km ≈ kf ; therefore, the lower bound becomes: T ∼ ceeds using a maximum expected number of tests: F M n kf log2( /kf ) + kf km log2( /km ) ≈ k log2( /k). If R is   chosen such that none of the (lightly infected) fam- ¯ F T(i) ≤kf φc log2 + 1 + M ilies is marked as heavily infected in Part 1 (e.g. if kf φc R = 0, which reduces to using traditional community-  n − k M φ  + k (1 − φ ) log f c + 1 (4) agnostic group testing), then φ = 0, and the RHS c 2 k (1 − φ ) c c of (4) is almost equal (up to k) to the bound in (1). ¯ T(ii) ≤kf φc (log F + 1 + M ) + 2 (b) The upper bound in (5) shows that our algorithm + k (1 − φ ) (log (n − k M φ ) + 1) , (5) c 2 f c achieves significant benefits compared to classic BSA where the inequalities are because of the worst-case per- when the infected families are heavily-infected and R formance of HGBSA and BSA, and φc is the expected is chosen such that φc = 1 (e.g. R > M − km ); this is ¯ fraction of infected families whose mixed sample is pos- because T(ii) ≤ kf (log2 F + 1 + M )  k log2 n + k). itive: Also, it achieves the same performance as BSA, when  families are lightly-infected and R is chosen such that 0 , if R = 0 ¯  φc = 0 (e.g. R = 0); this is because T(ii) ≤ k log2 n + φ = 1 − M −km /M  , if 1 ≤ R ≤ M − k c R R m k). Since the former case (heavy infection) is more  1 , if M − km < R ≤ M . realistic, our algorithm is expected to per. form a lot better than classic group testing in practice. Lemma 2 (Expected number of tests - Symmetric probabilistic model). If Alg. 1 uses BSA in place of The examples in observation (a) and the above anal- AdaptiveTest(), then it succeeds using a maximum ex- ysis indicate two things: First, the knowledge of the pected number of tests: community structure is more beneficial when families are heavily infected; traditional group testing performs T¯ ≤Fqφ (log F + 1 + M ) (6) p 2 equally well in low infection rates. Our experiments + nqp (1 − φp ) (log2 (n (1 − qφp )) + 1) , (7) showed that the community structure helps whenever Group testing for connected communities p > 0.15 and the benefits increase with p. Second, Test Matrix Structure. Our test matrix G is di-  G  a rough estimate of the families’ infection rate has to vided into two sub-matrices: G = 1 . be known a priori in order to optimally choose R. In G2 Appendix B, we demonstrate that this is unavoidable . The sub-matrix G1 of size T1 × n identifies the in- in the symmetric scenario we examine and when only fected families using one mixed sample from each fam- one mixed sample per family is used to identify which ily, similar to line 4 of Alg. 1. We want G1 to identify families are heavily/lightly infected. all (non-)infected families with small error probability. (c) In the most favorable regime for our community- If the number of tests available is high, we set T1 = F , aware group testing, where very few families have al- i.e., we use one row for each family test. Otherwise, in F αf sparse kf regimes, we set T1 closer to O(kf log ). most all their members infected (i.e. kf = Θ(F ) for kf αf ∈ [0, 1) and km ≈ M ), even if R is chosen optimally . The sub-matrix G2 of size T2 × n has a block ma- such that φc = 1, the ratio of the expected number of trix structure and contains F identity matrices IM , tests needed by Algorithm 1 (see (4)) and HGBSA can- one for each family. G2 is designed as follows: (i) each not be less than 1/ log(n/k), which upper bounds the block column contains only one identity matrix IM , benefits one may get. In Appendix B.2, we detail this i.e., each member is tested only once; (ii) each block observation and provide an optimized version of our row i (i ∈ {1, 2, ··· , b}) contains ci identity matrices algorithm that improves upon the gain of 1/ log(n/k). IM , i.e., there are ci members included in the corre- sponding tests. As a result: T2 = bM . An example 4.2 Two stage algorithm with F = 6, b = 3, c1 = 2, c2 = 1, c3 = 3 is: The adaptive algorithm can be easily implemented as a   I 0 0 I 0 0 two-stage algorithm, where we first perform one round M M ×M M ×M M M ×M M ×M   of tests, see the outcomes, and then design and per- G2 =  0M ×M IM 0M ×M 0M ×M 0M ×M 0M ×M  . form a second round of tests. The first round of tests 0M ×M 0M ×M IM 0M ×M IM IM implements part 1, checking whether a family is highly infected or not; the second round of tests implements Decoding. From the outcome of the tests in G1 we part 2, performing individual tests for the members identify the F − kf non-infected families, and proceed of the highly infected families, and in parallel, group to remove the corresponding columns (non-infected testing for the members of the remaining families. members) from G2. We use the remaining columns As we did before for the adaptive case, we here make of G2 to identify infected members according to the use of a classic non-adaptive group-testing algorithm, rules (which follow the logic of combinatorial orthogo- which we call NonAdaptiveTest(), and abstracts any nal matching pursuit (COMP) decoding [Chan et al., existing (or future) non-adaptive algorithm in the 2014, Cai et al., 2017]): group-testing literature. Thus to translate Alg. 1 to (i) A member is identified as non-infected if it is in- a two-stage algorithm, lines 4 and 13 simply become: cluded in at least one negative test in G2. (ii) All other members, that are only included in pos- h i ˆ ˆ itive tests in G , are identified as infected. 4 : Ux(r1), ...,Ux(rF ) =NonAdaptiveTest (x(r1), ..., x(rF )) 2 n o Error Probability. It is perhaps not hard to see that: 13 : Uˆ : i ∈ A = NonAdaptiveTest (A) . (8) i after the removal of the columns, the block structure of G2 helps us obtain a test matrix that is close to an Number of tests: In some regimes, the two-stage identity matrix – hence perform “almost” individual 2 algorithm can operate with the same (order) number testing . Also, note that our decoding strategy for G2 of tests as the adaptive algorithm, at a cost of a van- leads to zero FN errors. Building on these ideas, the ishing error probability: for example, for the tests in following lemmas guide us though a design of G2 that α line 4, if kf = Θ(F f ) with αf < 0.409, we can use minimizes the (FP) error probability. αf approximately (1 − αf )F log2 F tests and achieve vanishing error probability leveraging literature non- • Requiring zero-error decoding is too rigid: the op- adaptive algorithms [Aldridge et al., 2019,Scarlett and timal solution is the trivial solution that tests each Cevher, 2016,Johnson et al., 2019,Coja-Oghlan et al., member individually, but this would require T2 ≥ n.

2020, Coja-Oghlan et al., 2020]. • The symmetric choice ci = c minimizes the error probability. As said, we design G2 such that FP errors 4.3 Non-adaptive algorithm are minimized. A FP may happen if identity matri- ces IM corresponding to two or more infected families For simplicity of notation, we describe our non- 2 adaptive algorithm using again the symmetric case. An extended analysis about G2 is in Appendix C.2. Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

appear in the same block row of G2. In this case, has achieved success on a variety of applications. Also, some non-infected members may be included in the LBP offers soft information (posterior distributions), same test with infected members from other families which can be proved more useful than hard decisions and identified as infected by mistake. in the context of disease-spread management. Lemma 3. Under models (I) and (II), the probability We use LBP for our probabilistic model, because it that there is some block row containing two or more is fast and can be easily configured to take into ac- infected families is: count the community structure leading to more reli- P Q able identification. Many inference algorithms exist ci |B|=k : B⊆{1,2,··· ,b} i∈B that estimate the posterior marginals, some of which I = 1 − f , (9) Pjoint F  have also been employed for group testing. For ex- kf ample, GAMP [Zhu et al., 2020] and Monte-Carlo b sampling [Cuturi et al., 2020] yield more accurate de- II Y  ci ci−1 Pjoint = 1 − (1 − q) + ciq(1 − q) . (10) coders. However, taking into account the statistical i=1 information provided by the community structure was proved not trivial with such decoders. Moreover, the The following lemma offers a test-matrix design that focus of this work is to examine whether benefits from minimizes the system FP probability, defined as: accounting for the community structure (both at the (any-FP) (∃i :u ˆ = 1 and u = 0). (11) test design and the decoder) exist; hence we think that P , P i i considering a simple (possibly sub-optimal) decoder Lemma 4. The P(any-FP) is minimized for both mod- based on LBP is a good first step; we defer more com- els (I) and (II), if ci = c for all i ∈ {1, ··· , b}. plex designs to future work.

Lemma 5. For G2 as in Lemma 4, the system FP We next describe the factor graph and the belief prop- probability for models (I) and (II) equals: agation update rules for our probabilistic model (II). Let the infection status of each family j be Vj ∼ " #" T2/M  k # (FM /T2) f I 1 kf Ber(q). Moreover, let V (Ui ) denote the family that P (any-FP) = 1 − 1 − . M  F  Ui belongs to. km kf " M # II X  i M −i2 1 P(V1,..., VF , U1, ..., Un , Y1, ..., YT ) = P (any-FP) = 1 − p (1 − p) M  F n T i=1 i Y Y Y (Vj ) (Ui |V (Ui )) (Yτ |Uδ ), (14) "   T2/M # P P P τ FM −1 FMq j =1 i=1 τ=1 · 1 − (1 − q) T2 1 − q + . T2 where δτ is the group of people participating in the test. Equation (14) can be represented by a factor (any-FP) can be pessimistic; a more practical metric P graph, where variable nodes correspond to each ran- is the average fraction of members that are misidenti- dom variable Vj , Ui , Yτ and factor nodes correspond fied (error rate): R(error) , |{i :u ˆi 6= ui}|/n. to P(Vj ), P(Ui |V (Ui )), P(Yτ |Uδτ ). Lemma 6. For G2 as in Lemma 4, the error rate is calculated for models (I) and (II) as: Given the result of each test is yτ , i.e., Yτ = yτ , LBP computes the marginals P(Vj = v|Y1 = y1, ..., YT = kf (M − km ) y ) and (U = u|Y = y , ..., Y = y ), by iter- R (error) < · I , (12) T P i 1 1 T T I FM Pjoint atively exchanging messages across the variable and  c−1 RII (error) < (1 − p)q 1 − (1 − q) . (13) factor nodes. The messages are viewed as beliefs about that variable or distributions (a local estimate of 5 Loopy belief propagation decoder P(variable|observations)). Since all random variables are binary, each message is a 2-dimensional vector. We now describe our new algorithm for decoding infec- We use the factor graph framework from [Kschischang tion status of the individuals (and families). This is ac- et al., 2001] to compute the messages: Variable nodes complished by estimating the posterior probability of Yτ continually transmit the message [0, 1] if Yτ = 1 the corresponding individual (or family) being infected and [1, 0] if Yτ = 0 on its incident edge, at every iter- via loopy belief propagation (LBP). LBP computes the ation. Each other variable node (Vj and Ui ) uses the posterior marginals exactly when the underlying factor following rule: for incident each edge e, the node com- graph describing the joint distribution is a tree (which putes the elementwise product of the messages from is rarely the case) [Kschischang et al., 2001]. Never- every other incident edge e0 and transmits this along theless, it is an algorithm of practical importance and e. For the factor node messages, we derive closed-form Group testing for connected communities

500 Alg.1: R = 1 Alg.1: R = M 450 binary splitting 400 lower bound benefits from the community structure, even when the non-adaptive 350 counting bound number of infected members is unknown. More inter- 300 250 estingly, when R = M , Alg. 1 performs close to the 200

average #of tests lower bound in most realistic scenarios p ∈ [0.5, 0.8] 150 100 (as also shown in Section 4.1). The relevant result in 50 0 the linear regime, was slightly worse: 50-70 tests above 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 p the lower bound. Last, the grey line shows number Figure 1: Noiseless case: Average number of tests. of tests needed by our nonadaptive algorithm; we ob- serve that even that algorithm can perform better than expressions for the sum-product update rules (akin to BSA, when p > 0.55 and small FP rates are tolerated. equation (6) in [Kschischang et al., 2001]). The exact (ii) Noiseless testing – Average error rate: We here messages are described in Appendix D. quantify the additional cost in terms of error rate, when one goes from a two-stage adaptive algorithm 6 Numerical evaluation that achieves zero-error identification to much faster single-stage nonadaptive algorithms. In each run, we In this section, we evaluate the benefits (in terms of first run our two-stage algorithm (Section 4.2) that number of tests and error rate) from taking the com- uses a classic constant-column-weight test design at munity structure into account in practical scenarios, each stage and measure the number of tests it requires where noiseless or noisy tests are used. to achieve zero errors. Then, we use the same number of tests to infer the members’ infection status through Experimental setup I: Symmetric. In our sim- 2 nonadaptive algorithms that account for the com- ulations, we consider 2 different use cases about the munity structure either at the test matrix (encoding) community structure: (Community 1) a neighborhood part or the decoding and a traditional one that does with F = 200 families of M = 5 members each, and not consider it at all: “COMP with C-encoder” is our (Community 2) a university department with F = 20 nonadaptive algorithm that uses a COMP decoder as classes of M = 50 students each. In each use case, we described in Section 4.3; “C-LBP with NC-encoder” is also examine 2 different infection regimes: (a) a lin- an algorithm that uses classic constant-column-weight k¯ ear regime,√ where /n = 0.1; and (b) a sparse regime, test design combined with our LBP decoder form Sec- where k¯ = n = 32. Finally, we consider both noise- tion 5; and “COMP with NC-encoder” is a traditional less tests that have perfect accuracy and noisy tests nonadaptive algorithm, that we use as a benchmark that follow the Z-channel model from Section 2.3. For and uses a constant-column-weight test matrix with a each scenario, we average over 500 randomly generated COMP decoder. “C” denotes that the community is community structures, in which the members/students taken into account, while “NC” denotes that it is ig- are infected according to the symmetric probabilis- nored. It is important to note that the number of tests tic model (II): first a family/class is chosen at ran- needed by the two-stage algorithm (and therefore all dom w.p. q to be infected and then each of its mem- other algorithms) gets lower as p gets large, something bers/students gets randomly infected w.p. p. that affects the results (as discussed further below). Results. Our results were similar in all scenarios; for Fig. 2 depicts the FP and FN error rates3 (averaged brevity, we show here only the sparse regime. Further over 500 runs) as a function of p ∈ [0.3, 0.9] for Com- results can be found in the Appendix of the supple- munity 1. We observe that any community-aware non- mentary submitted document. adaptive algorithm performs better than traditional (i) Noiseless testing – Average number of tests: In this nonadaptive group testing (red line) when p > 0.4— experiment, we measure the average number of tests the absolute performance gap ranges from 0.4% (when needed by 3 algorithms that achieve zero-error recon- p = 0.3) to 5.5% (when p = 0.9). “COMP with C- struction (Alg. 1 with R = 1, Alg. 1 with R = M , encoder” has a stable FP rate across for all p values and classic BSA), and a nonadaptive algorithm (Sec- that was close to 1%, and a zero FN rate by con- struction. Our LBP decoder, may yield both FN and tion 4.3) that uses T1 = F tests for G1 and has FP rate around 0.5%. Alg. 1 assumes no prior knowl- FP errors. Also, being an approximate inference algo- edge of the number of infected families/classes or mem- rithm, it may produce worse results than COMP when bers/students, hence uses BSA for the AdaptiveTest(). p ∈ [0.42, 0.67], but performs better when the infection rate is higher. Fig. 1 depicts our results about Community 2 and for p ∈ [0.4, 0.8]. Both versions of Alg. 1 need significantly Fig. 3 examines the effect of the number of tests. Start- fewer tests compared to classic BSA, while staying be- 3FN rate is the percentage of infected individuals iden- low the counting bound. This indicates the potential tified as negative and vice versa for FP. Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

Figure 2: Noiseless case: Average Figure 3: Noiseless case: Average Figure 4: Noisy case: Average error error rate with few tests. error rate (p = 0.6). rate (p = 0.8).

12 and pj are selected uniformly at random from the in- 10 tervals [5, 50] and [0.4, 0.8], respectively. 8 Fig. 5 is a box plot depicting our results for the sparse 6 regime (q = 3%) over 500 randomly generated in- 4 stances, as described above. The middle line in the 2 box represents the mean and the ends of the box rep- 0 resent the lower and upper quartiles respectively. The Alg.1: R=1 Alg.1: R=M BSA crosses represent outlier points. BSA needs on aver- Figure 5: Asymmetric case: Ratio of the number of age 5.23× (that can reach up to 13×) more tests com- tests needed to the lower bound (2). pared to the probabilistic bound, while the two ver- sions of Algorithm 1 with R = 1 and R = M need ing from the average number of tests used by the two only 2.4× and 1.11× (that can reach up to 9.85× and stage algorithm when p = 0.6, we compute the FP 1.8×) more tests, respectively. Also, the significantly and FN rates for larger numbers of tests. Our ex- smaller range between the 25-th and 75-th percentiles periment shows a transition around T = 240, after of the boxplots related to Algorithm 1 indicate a more which point “C-LBP with NC-encoder” performs bet- predictable performance w.r.t. BSA. ter than “COMP with C-encoder”. In fact, “COMP with C-encoder” seems to converges to zero FP er- 7 Conclusions rors much slower. This result was common for other p values, the transition just occurred at different T . The new observation we make in this paper is that tak- We therefore conclude that one may use our “COMP ing into account infection correlations, as dictated by with C-encoder” when the number of tests available a known community structure, enables to reduce the is limited or they just want to use a simple decoder; number of group tests required to identify the infected otherwise if the testing budget is larger, one should members of a population and can improve the identi- better go with “C-LBP with NC-encoder”. fication accuracy when the number of tests is fixed. (iii) Noisy testing: Assuming the Z-channel noise of In this paper we make this point assuming a nonover- Section 2.3 with parameter z = 0.15, we evaluate the laping community structure, a specific noise model and performance of our community-based LBP decoder of binary group testing. We considered a combinatorial Section 5 against a LBP that does not account for and probabilistic model, derived lower bounds on the community—namely its factor graph has no V nodes. j number of tests needed, explored adaptive, two-stage Fig. 4 depicts our results for Community 1 and for and non-adaptive algorithms for the noiseless case, as a selected p = 0.8 and a number of tests as given well as algorithms for the noisy case. Our algorithms from the two-stage algorithm of the previous experi- are not always optimal w.r.t. the lower bounds, but ments. We observe that the knowledge of the com- perform significantly better than community-agnostic munity structure (in C-LBP) reduces both FP and group testing; per our experiments, they need upto FN rates achieved community-unaware NC-LBP. Es- 55 − 75% fewer tests (on average) to achieve the same pecially, FN error rates drop significantly (up to 80% identification accuracy. when tests are few), which is important in our context We posit that such benefits are possible in a num- since FN errors lead to further infections. Our results ber of other community or noise or group test mod- were similar for other p values as well. els; as an example, the followup work in [Nikolopou- Experimental setup II: Asymmetric. In our los et al., 2021] illustrates benefits when the families asymmetric setup, infections follow again the proba- overlap. Understanding what are benefits in more so- bilistic model (II), but this time for each family j , Mj phisticated models remains as an open question. Group testing for connected communities

Acknowledgments [Cuturi et al., 2020] Cuturi, M., Teboul, O., and Vert, J.-P. (2020). Noisy adaptive group testing us- This work was supported in part by NSF grants ing bayesian sequential experimental design. arXiv #2007714, #1705077 and UC-NL grant LFR-18- preprint arXiv:2004.12508. 548554. We would also like to thank Katerina Ar- gyraki for her ongoing support and the valuable dis- [Dorfman, 1943] Dorfman, R. (1943). The detection of cussions we have had about this project, as well as the defective members of large population. The Annals anonymous reviewers (especially Reviewer 1) for their of Mathematical Statistics, 14:436–440. constructive comments that helped improve our paper. [Du and Hwang, 1993] Du, D.-Z. and Hwang, F. (1993). Combinatorial Group Testing and Its Ap- References plications. Series on Applied Mathematics. [Aldridge, 2019] Aldridge, M. (2019). Individual test- [Ellenberg, 2020] Ellenberg, J. (2020). Five people. ing is optimal for nonadaptive group testing in the one test. this is how you get there. NYtimes. linear regime. IEEE Trans. Inf. Theory, 65(4). [Ghosh et al., 2020] Ghosh, S. et al. (2020). Tapestry: [Aldridge et al., 2019] Aldridge, M., Johnson, O., and A single-round smart pooling technique for covid-19 Scarlett, J. (2019). Group testing: an information testing. medRxiv. theory perspective. CoRR, abs/1902.06002. [Goenka et al., 2020] Goenka, R., Cao, S.-J., Wong, [Ash, 1990] Ash, R. (1990). . Dover C.-W., Rajwade, A., and Baron, D. (2020). Contact Publications Inc., New York, NY. tracing enhances the efficiency of covid-19 group [Baldassini et al., 2013] Baldassini, L., Johnson, O., testing. arXiv preprint arXiv:2011.14186. and Aldridge, M. (2013). The capacity of adaptive [Gollier and Gossner, 2020] Gollier, C. and Goss- group testing. In 2013 IEEE International Sympo- ner, O. (2020). Group testing against covid- sium on Information Theory, pages 2676–2680. 19. See https://www.tse-fr.eu/articles/ [Broadfoot, 2020] Broadfoot, M. (2020). Coronavirus group-testing-against-covid-19. test shortages trigger a new strategy: Group screen- [Hu et al., 1981] Hu, M. C., Hwang, F. K., and Wang, ing. See https://www.scientificamerican. J. K. (1981). A boundary problem for group testing. com/article/coronavirus-test-shortages- SIAM Jour. on Algebraic Discrete Methods. trigger-a-new-strategy-group-screening2/. [Cai et al., 2017] Cai, S., Jahangoshahi, M., Bakshi, [Hwang, 1972] Hwang, F. K. (1972). A method for M., and Jaggi, S. (2017). Efficient algorithms for detecting all defective members in a population by noisy group testing. IEEE Trans. Inf. Theory, group testing. Journal of the American Statistical 63(4):2113–2136. Association, 67(339):605–608. [Chan et al., 2014] Chan, C. L., Jaggi, S., Saligrama, [Johnson et al., 2019] Johnson, O., Aldridge, M., and V., and Agnihotri, S. (2014). Non-adaptive group Scarlett, J. (2019). Performance of group testing testing: Explicit bounds and novel algorithms. algorithms with near-constant tests per item. IEEE IEEE Trans. Inf. Theory, 60(5):3019–3035. Trans. Inf. Theory, 65(2):707–723. [Cheraghchi et al., 2012] Cheraghchi, M., Karbasi, [Johnson, 2017] Johnson, O. T. (2017). Strong con- A., Mohajer, S., and Saligrama, V. (2012). Graph- verses for group testing from finite block- length re- constrained group testing. IEEE Transactions on sults. IEEE Trans. Inf. Theory, 63(9). Information Theory, 58(1):248–262. [Karbasi and Zadimoghaddam, 2012] Karbasi, A. and [Coja-Oghlan et al., 2020] Coja-Oghlan, A., Geb- Zadimoghaddam, M. (2012). Sequential group test- hard, O., Hahn-Klimroth, M., and Loick, P. (2020). ing with graph constraints. In 2012 IEEE informa- Information-theoretic and algorithmic thresholds tion theory workshop, pages 292–296. Ieee. for group testing. IEEE Trans. Inf. Theory. [Kschischang et al., 2001] Kschischang, F. R., Frey, [Coja-Oghlan et al., 2020] Coja-Oghlan, A., Geb- B. J., and Loeliger, H.-A. (2001). Factor graphs and hard, O., Hahn-Klimroth, M., and Loick, P. (2020). the sum-product algorithm. IEEE Transactions on Optimal group testing. volume 125 of Proceedings information theory, 47(2):498–519. of Research, pages 1374–1388. Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

[Kucirka et al., 2020] Kucirka, L. M., Lauer, S. A., [Scarlett and Cevher, 2016] Scarlett, J. and Cevher, Laeyendecker, O., Boon, D., and Lessler, J. (2020). V. (2016). Phase transitions in group testing. In Variation in false-negative rate of reverse tran- Proceedings of the Twenty-Seventh Annual ACM- scriptase polymerase chain reaction–based sars-cov- SIAM Symposium on Discrete Algorithms, SODA 2 tests by time since exposure. Annals of Internal 2016, pages 40–53. SIAM. Medicine, 173:262–267. [Sobel and Elashoff, 1975] Sobel, M. and Elashoff, R. [Li et al., 2014] Li, T., Chan, C. L., Huang, W., (1975). Group testing with a new goal, estimation. Kaced, T., and Jaggi, S. (2014). Group testing with Biometrika, 62(1):181–193. prior statistics. In 2014 IEEE International Sympo- sium on Information Theory, pages 2346–2350. [Sobel and Groll, 1959] Sobel, M. and Groll, P. A. (1959). Group testing to eliminate efficiently all [Luo et al., 2019] Luo, S., Matsuura, Y., Miao, Y., defectives in a binomial sample. The Bell System and Shigeno, M. (2019). Non-adaptive group testing Technical Journal, 38(5):1179–1252. on graphs with connectivity. Journal of Combina- torial Optimization, 38(1):278–291. [Ungar, 1960] Ungar, P. (1960). Cutoff points in group testing. Comm. Pure Appl. Math, 13:49–54. [Nikolopoulos et al., 2020] Nikolopoulos, P., Srini- vasavaradhan, S. R., Guo, T., Fragouli, C., and Dig- [Verdun et al., 2020] Verdun, C. et al. (2020). Group gavi, S. (2020). Community aware group testing. testing for sars-cov-2 allows up to 10-fold efficiency arXiv preprint arXiv:2007.08111. increase across realistic scenarios and testing strate- gies. medRxiv. [Nikolopoulos et al., 2021] Nikolopoulos, P., Srini- [Walter et al., 1980] Walter, S. D., Hildreth, S. W., vasavaradhan, S. R., Guo, T., Fragouli, C., and and Beaty, B. J. (1980). Estimation of infection Diggavi, S. (2021). Group testing for overlapping rates in populations of organisms using pools of communities. In Proc. of the IEEE International variable size. American Journal of Epidemiology, Conference on Communications, ICC 2021. 112(1):124–128. [Riccio and Colbourn, 2000] Riccio, L. and Colbourn, [Yaakov Malinovsky, 2016] Yaakov Malinovsky, P. C. J. (2000). Sharper bounds in adaptive group S. A. (2016). Revisiting nested group testing pro- testing. Taiwanese Journal of Mathematics, page cedures: new results, comparisons, and robustness. 669–673. American Statistician. See also https://arxiv. [Saiki et al., 1985] Saiki, R. et al. (1985). Enzymatic org/abs/1608.06330. amplification of beta-globin genomic sequences and [Zhu et al., 2020] Zhu, J., Rivera, K., and Baron, D. restriction site analysis for diagnosis of sickle cell (2020). Noisy pooled pcr for virus testing. arXiv anemia. Science, 230(4732):1350–1354. preprint arXiv:2004.02689. Group testing for connected communities

Appendix

A Appendix for Section 3: The lower where Ev is the family containing vertex v. bounds Finally, we compute the third term as:

A.1 Proof of Theorem 1 F F X X H (V|U) = H (Vj |U) = H (Vj |USj ) Proof. Ineq. (1) is because of the following counting j =1 j =1 T argument: There are only 2 combinations of test re- F sults. But, because of the community model I, there X = P(USj = 0)h2 (P(Vj = 0|USj = 0)) F  Qkf Mj  are · j possible sets of infected members j =1 kf j =1 km that each must give a different set of results. Thus, F  1 − q  X |Sj | = (1 − q + q(1 − pj ) )h2 k |Sj |   f   1 − q + q(1 − pj ) T F Y Mj j =1 2 ≥ · j , kf km j =1 where Sj is the set of members who belong to family which reveals the result. The RHS of the latter in- j and |Sj| = Mj . Combining all the 3 terms concludes equality is because there are F  combinations of in- the proof. kf fected families, and for each infected family j , there Mj  are j possible combinations of infected family B Appendix for Section 4.1: The km members—hence for each combination of kf infected noiseless adaptive case Qkf Mj  families, there are j =1 j possible combinations of km B.1 Rationale for Alg. 1 infected family members. The symmetric bound is ob- j tained as a corollary by taking Mj = M and km = km Group testing already has a rich body of literature for each infected family j . with near-optimal test designs in the case of indepen- dent infections, we do not try to improve upon them. A.2 Proof of Theorem 2 Instead, we adapt these ideas to incorporate the cor- relations arisen from the community structure. All Proof. Let V be the indicator random vector for the test designs described in this section are conceptually infection status of all families. By rephrasing [Li et al., divided into two parts. This split is guided by the 2014, Theorem 1], any probabilistic group testing algo- community structure and attempts to identify the dif- rithm using T noiseless tests can achieve a zero-error ferent infection regimes inside the community, so that reconstruction of U if: the best testing method (individual or classic group T ≥ H (U) = H (V) + H (U|V) − H (V|U). (A.1) testing) is used. We show that such a two-part de- sign is enough to significantly reduce the cost of group PF The first term is: H (V) = j =1 H (Vj ) = F h2 (q). testing and also achieve the lower bound in some cases. The second term is calculated as: Two-part design: Two parts of Algorithm 1 serve n complimentary goals: X H (U|V) = H (U |V ) v Ev The goal of Part 1 is to detect the infection regime v=1 n inside each family j : i.e., to accurately estimate which X X of the F families have a high infection rate (“heavily” = P(VEv = x)H (Uv|VEv = x) infected) and which are have a low or zero infection v=1 x∈{0,1} n rate (“lightly” infected). Our interest in detecting the X infection regime is motivated by prior work [Riccio and = (qH (Uv|VEv = 1) + (1 − q)H (Uv|VEv = 0)) v=1 Colbourn, 2000,Hu et al., 1981], which has shown that n F group testing offers benefits over individual testing, X X only if the infection rate is low (pj ≤ 0.38). This = qh2 (pEv )= q Mj h2 (pj ), v=1 j =1 Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

For example, if the internal structure of family j can be represented through a contact graph, in which only specific family nodes have external contacts with other families, it may make sense to include (some of) these nodes into the representative group with certainty. When only one mixed sample per family is used to identify the heavily/lightly infected families, the cardi- nality of the representative subset |rj | is essential, but the optimal choice of it is not trivial. |rj | affects the accuracy of regime estimate—hence the performance of our algorithm in terms of the expected number of tests that it uses. Unfortunately, choosing the num- ber of representatives optimally is not easy even in the Figure 6: Expected number of tests from (7) as a func- symmetric case that is examined in Section 4.1. Ide- tion of size of representative set and probability of in- ally, in the symmetric case, we would like to choose fection inside a family. |rj | = R such that the bounds in Lemmas 1 and 2 are minimized. However, this requires solving equations of the form yey = x, which is generally possible through allows us to define the two regimes as follows: In the Lambert functions for x ≥ − 1 , but the latter does combinatorial model I (resp. probabilistic model II), a e j not hold in our case. Fig. 6 demonstrates that there is considered heavily infected when km/M ≥ 0.38 (resp. j exists no unique R that is optimal for any infection pj ≥ 0.38); conversely, it is considered lightly infected j probability p in (0, 1) through an example of F = 50 family when km/M < 0.38 (resp. p < 0.38). j j families with M = 60 members each. The figure plots ˆ the bound of Lemma 2 as a function of p and R. As we For each family j , we regard Ux(rj ) as an estimate of ˆ can see, there is no single minimizer R?: if p < 0.15, the family’s infection regime. If Ux(rj ) is positive, we consider the family to be highly infected and there- then R must be picked equal to 0 (which yields tradi- fore perform individual testing for all of its members. tional group testing); otherwise, if p > 0.15, then R ˆ must be selected equal to M . Otherwise, if Ux(rj ) is negative, we consider the fam- ily to be lightly infected and group test its members Therefore, in order to optimally choose R, a rough with all other lightly infected families. The challenge estimate about p has to be known a priori. If the is therefore to produce accurate enough regime esti- latter is not possible, then one may use a few more mates, such that the overall number of tests that are tests at the first stage of our algorithm to better detect needed from Alg. 1 to achieve exact infection-status re- whether a family is heavily infected. We provide such construction for all members i = 1,..., n is minimal. an optimization in the next section. We discuss this challenge further below. Function AdaptiveTest(): In both parts of our algo- ˆ Given all estimates Ux(rj ) from Part 1, the goal of the rithm, we make use of a classic adaptive-group-testing Part 2 is then to identify all infected members, by us- algorithm, which we call AdaptiveTest(). This may be ing the appropriate testing method (group or individ- regarded as an abstraction for any existing (or future) ual testing) according to the infection regime of each adaptive algorithm in the group-testing literature. In family (light or heavy). In this way, at the end of Part our analysis, however, we mostly focus on the classic ˆ 2, the algorithm returns an estimate Ui of the true binary splitting algorithm because of its good perfor- infection status Ui of each individual member i. mance in realistic cases, where the numbers of infected j Selection of family representatives: Function families and/or members (kf , km ) are unknown [Sobel SelectRepresentatives() at line 2 refers to any sampling and Groll, 1959]. function on a set of family members, as long as it re- In this section, we consider only adaptive algorithms turns a fixed number of members from family j . That that offer noiseless (zero-error) reconstruction. Note, is, one may use their own sampling function, as long however, the fact that AdaptiveTest() offers exact re- as the accuracy of Part 1 is well defined. In this paper, construction is not enough to guarantee an accurate we consider only random-sampling functions without detection of any family’s infection regime in Part 1. replacement (i.e. |rj | members are randomly chosen For example, consider the following case, where the from the family members and each subset of that size true infection rate within a family j is not very low has the same probability of being selected as the rep- (say pj = 0.6), yet none of the family representative resentative subset). But perhaps, more elaborate sam- in set rj happened to be infected. Intuitively, the er- pling functions may be considered in other contexts. ror probability of detection in Part 1 should depend Group testing for connected communities

on the number of selected representatives |rj | from • A last modification occurred to us after a related each family j and the infection rate among its mem- comment of one of our reviewers, who we thank. As bers pj . In our analysis, we examine different scenaria discussed in Section 4.1, when a sparse regime holds α w.r.t. these parameters and discuss which parametriza- for families (i.e. kf = Θ(F f ) for αf ∈ [0, 1)) and tion (i.e. value of |rj |) optimizes the expected number a heavily linear regime holds within each family (i.e. of the tests required by our algorithm. km ≈ M ), the benefits of Lemma 1 with regard to Hwang’s binary splitting (HBSA) cannot be more than B.2 Modified/Optimized versions of Alg. 1 1/ log(n/k). This is because, in Eq. (4), we get the additive term kf M > FM = k, which comes from the • One modification of our algorithm is the following: second stage of Algorithm 1. In Part 1, instead of selecting only one representative Nevertheless, if k > M − k (i.e., the infection rate group for each family, we select m representative sub- m m s inside each family is more than 0.5), then at the second groups, each of size s, and we treat each of these sub- stage of our algorithm it makes more sense to look for groups as a single “(super)-member”. That is, we iden- not-infected members and stop testing once we find tify whether each subgroup is positive (has at least one them. In that case, we need at least k ∗ (M − k ) positive member) or not, and based on this informa- f m tests, which can be less than k, and therefore could tion, using for example majority vote, we can classify lead to more benefits on average. the family as heavily or lightly infected; essentially we can solve an estimation problem as in [Aldridge et al., For example, consider the case where km = M − 1. 2019] (see Chapter 5.3), [Walter et al., 1980,Sobel and Then the expected number of individual tests needed Elashoff, 1975]. In this regard, Alg. 1 is just a special to find the 1 not-infected member inside each in- case of this approach, with ms = 1 and s = |rj |. fected family can be computed as follows: Without loss of generality, suppose that we test the members Intuitively, we expect that such a modification would at some fixed ordering without replacement and the increase the estimation accuracy ofp ˆ and reduce the j not-infected member has a uniformly random posi- error of the related hypothesis test, at the cost of few tion in that ordering. Then, the probability of the more tests. As a result, it could need fewer tests on not-infected item being at a given position i in the expectation than Alg. 1, hence perform better in some ordering is equal to 1/M and we need i tests to cases. However, the potential improvement would de- find it. As a result, the expected number of tests is pend on parameters such as the family size - for in- PM i ∗ 1/M = (M + 1)/2. From linearity of expec- stance for small size families it is not expected to be i=1 tation, the expected number of tests for all infected large. To keep things simple, we prefer not to ana- families at the second stage of our algorithm (if we lyze this algorithm in this paper and defer it to future further assume that all infected families are identified work. without error at the first stage—i.e., φc = 1) will be: • Another modification could be the following: instead kf ∗ (M + 1)/2 < kf ∗ (M − 1) = kf ∗ km = k. Hence, of leveraging the community structure to perform indi- is this particular regime, the modification of our algo- vidual tests where needed, we could use it to improve rithm can achieve benefits more than 1/ log(n/k). traditional binary splitting algorithm by running it on In the more general case, where M − k > 1, the rele- multiple testing groups that are related to the com- m vant probabilities for the computation of the expected munity structure. For example, consider a symmetric number of tests can be obtained from the negative hy- case where: we split all n = FM members into M pergeometric distribution (since sampling is without groups of F people (one from each family), then run replacement). binary splitting to each of these groups. In the extreme case, where for each infected family k This modification is also related to Hwang’s binary m is known and equal to M , all we need to do is to iden- splitting algorithm, but achieves only logarithmic ben- tify the infected families and label all their members as efits compared to binary splitting, as opposed to our infected. In that case the benefit would be k /k. Note, algorithm that may perform much better in real cases f that to achieve these higher benefits described above, (see Section 4.1). In fact, the expected number of tests the knowledge of the number of infected members per needed by this modified algorithm would be at most n family is required, but this is also the case for HBSA. k log2 ( /M )+O(k): each group g has kg infected mem- n 3 ber and binary splitting needs kg log2 ( /M ) + O(kg) The symmetric example is only used here only to better tests to identify all of them. By adding together the illustrate the advantages of the modification proposed. The number of tests for each group g, we deduce the result. idea is similar for the asymmetric case. Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

B.3 Proof of Lemma 1 of AdaptiveTest(), then the maximum number of tests needed to identify all mixed samples is on expectation Proof. Let φc be the expected fraction of infected Fqφp log2 F + Fqφp [Aldridge et al., 2019, Baldassini families whose mixed sample is positive. Since et al., 2013]. SelectRepresentatives() is uniform random sampling • At line 8, the expected number of individual tests is without replacement, we can compute φc when 1 ≤ equal to: Fqφ M . R ≤ M − km using the hypergeometric distribution p Hyper(M , km ,R), as follows: the probability of a • At line 13, the expected number of items that random mixed sample x(rj ) being negative (i.e. all are tested is: n − Fqφp M , and the expected num- members of rj are negative) is given by the PMF of ber of infected members is equal to the expected Hyper(M , km ,R) evaluated at 0, and it is therefore number of all infected members minus the expected M −km  M  M −km  M  equal to R / R , which yields φc = 1 − R / R . number of the ones that are identified though indi- We also define the following for completeness: φc = 0 vidual testing at line 8: i.e., FqMp − Fqφp Mp = when R = 0 and φc = 1 when M − km < R ≤ M . FqMp (1 − φp ) = nqp (1 − φp ). So, if BSA is used Fixing the number of positive mixed samples in Part 1 in the place of AdaptiveTest(), it is expected to suc- ceed using at most nqp (1 − φ ) log (n − Fqφ M ) + of Alg. 1 to its expected value: kf ·φc, we now compute p 2 p the maximum number of tests needed by the algorithm nqp (1 − φp ) tests [Aldridge et al., 2019, Baldassini to succeed. et al., 2013]. Alg. 1 performs testing at lines 4, 8, 13. We add together all the above terms and the result follows. • At line 4, it identifies the positive mixed samples to mark the corresponding families as heavily infected and all others as lightly infected. If HGBSA is used for C Appendix for Section 4.3: The AdaptiveTest(), then Alg. 1 is expected to succeed at Noiseless Non-adaptive case F this step using kf φc log + kf φc tests. Similarly, 2 kf φc if BSA is used for AdaptiveTest(), then then Alg. 1 C.1 Zero error requirements is guaranteed to succeed at this step using at most For our design of G , we have the following lemma k φ log F + k φ [Aldridge et al., 2019, Baldassini 2 f c 2 f c and observation. et al., 2013]. Lemma 7. To achieve zero-error w.r.t. G2, we need • At line 8, the expected number of individual tests is T2 ≥ n. equal to: M kf φc. This is the same irrespectively from whether AdaptiveTest() is binary splitting or Hwang’s Proof. A trivial implementation for G2 is to use an algorithm as it only depends on φc. identity matrix of size n; since each member is tested • At line 13, the expected number of items that individually, we can identify all the infected members are tested is: n − kf φcM , and the expected number correctly. We next argue that T2 ≥ n for the zero-error of infected members is: k − kf φckm = k (1 − φc). case. We prove this through contradiction. Assume So, if HGBSA is used for AdaptiveTest(), then that T2 < n. Then, from the pigeonhole principle, Alg. 1 is guaranteed to succeed at this step using there exists one member, say m1 that does not partic- (n−kf φc M ) ipate in any test alone -it always participates together k (1 − φc) log + k (1 − φc) tests. Similarly, 2 k(1−φc ) if BSA is used, then Alg. 1 is expected to succeed with one or more members from a set S1. Assume that all members in S1 are infected, while m1 belongs in an in at most: k (1 − φc) log2 (n − kf φcM ) + k (1 − φc) tests [Aldridge et al., 2019, Baldassini et al., 2013]. infected family but is not infected -our decoding will result in a FP. We add together all the above terms that are related to HGBSA or BSA, and the result follows. Observation: G2 leads to zero error if and only if it has the following property: B.4 Proof of Lemma 2 Zero Error Property: Any subset of {1, 2, ··· , n} of size (F − kf )M + kf (M − km ) equals the union of some Proof. Let φp be the expected fraction of infected fam- testing rows of G2. Namely, the members of the not- ilies whose mixed sample is positive. Then, because of R infected families together with the not-infected mem- the probabilistic setting, φp = 1 − (1 − p) . bers of the infected families, need to be the only partic- Alg. 1 performs testing at lines 4, 8, 13. ipants in some rows of G2, for all possible not-infected families and not-infected members. This requirement • At line 4, the expected number of mixed samples that can lead to an alternative proof of Lemma 7. are positive is Fqφp . So, if BSA is used in the place Group testing for connected communities

C.2 Rationale for the structure of G2 that there is no infected family in the i-th block ci−1 row, and ciq(1 − q) is the probability that Our goal is to design a non-trivial matrix G2 that there is only one infected family in the i-th block can identify almost all the infected members with row. The multiplication Q denotes the proba- high probability and a small number of tests. We bility that any one block row contains at most next discuss two intuitive properties we would like II one infected family. Thus, Pjoint is obtained as our designs to have to minimize the error probability. the probability that there is some block row that contains two or more infected families. Desirable Property 1: Use identity matrices as build- ing blocks. Intuition: ideally, after removing the (F − kf )M columns corresponding to the members in non- C.4 Proof of Lemma 4 infected families, we would like the remaining columns 0 0 to form an identity matrix so that we can identify all Proof. Consider ci > cj + 1, let ci = ci − 1 and cj = the infected members correctly. To reduce the number cj + 1. For the combinatorial model, we can verify the 0 of tests, there should be more than one members difference of the probability for ci and ci by included in each test. Thus we use overlapping X Y 0 X Y 0 0 identity matrices, one corresponding to each family. c` − c` = (cicj − cicj) · X We assume the index for the n members is family-by- |B|=kf : `∈B |B|=kf : `∈B family, i.e., the indexes for the members in the same B⊆{1,2,··· ,b} B⊆{1,2,··· ,b} family are consecutive. Then each family corresponds = (ci − cj − 1) · X to an identity sub-matrix IM in G2. Now the prob- > 0, lem becomes how to arrange the identity sub-matrices. where X is a positive value independent of ci and cj. This implies that the minimum of the probability I Desirable Property 2: The identity matrices corre- Pjoint in (10) achieves its minimum roughly at the symmetric sponding to different families either appear in the same case where all c ’s are equal, i.e., c = c for all i ∈ set of M rows in G or they do not appear in any i i 2 {1, 2, ··· , b}. shared rows. Intuition: otherwise, a family would share tests with Similarly, for the probabilistic model, consider the more other families. Then the probability that this probability in (10), we can calculate that family shares tests with infected families becomes b larger. This would increase the probability that two h 0 0 i Y c` 0 c`−1 infected families share tests after removing all the non- (1 − q) + c`q(1 − q) infected family columns, which in turn would increase `=1 the FP probability. b Y  c` c`−1 − (1 − q) + c`q(1 − q) (C.1) C.3 Proof of Lemma 3 `=1 2 2 ci+cj −2 = [(ci − cj) − (1 − q) ]q (1 − q) · Y Proof. The probabilities can be explained as follows: > 0, (C.2)

I (i) For in (9), the numerator gives the num- Q  c` c`−1 Pjoint where Y = `6=i,j (1 − q) + c`q(1 − q) > 0 is ber of possibilities that each block row contains independent of ci and cj. This implies that the mini- at most one infected family, which is obtained by mum of the probability in (10) achieves its minimum randomly choosing kf block rows (the summa- roughly at the symmetric case where all ci’s are equal, tion) and then from each chosen block row choos- i.e., ci = c for all i ∈ {1, 2, ··· , b}. ing one family to be infected (ci possible choices for i-th block row). The denominator is the to- C.5 Proof of Lemma 5 tal number of infection possibilities, and then the fraction denotes the probability that each block The lemma is obtained under the assumption that the row contains at most one infected family. Thus, number of families F is a multiple of b and c. If F can- I Pjoint is obtained as the probability that there not be factorized, the error probabilities in Lemma 5 is some block row that contains two or more in- can be viewed as an upper bound for the correspond- fected families. ing error probabilities. This can be seen by simply 0 0 II ci adding F auxiliary families so that F + F = bc. (ii) For Pjoint in (10), (1 − q) is the probability Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

Proof. In the symmetric case, i.e., ci = c for all i ∈ bound in (12) is obtained by assuming that if there {1, 2, ··· , b}, the probabilities in (9) and (10) become exist errors (FPs), then all non-infected members in infected families are misidentified as infected in the b  kf k c decoding of G . (Note that all non-infected members I = 1 − f , (C.3) 2 Pjoint F  in non-infected families are correctly identified by de- k f coding of G .) b 1 II = 1 − (1 − q)c−1(1 − q + cq) . (C.4) Pjoint For the probabilistic model (II), the upper bound for For the symmetric combinatorial model, the number the expected error rate in (13) is obtained by j of infected members in an infected family km = km for  c all infected families j. If two families appear in the 1 X c R (error) = · b · qj(1 − q)c−j same set of M tests, the probability that all infected II n  j j=2 members in one family share the same km tests as the other family is simply j ! # X j · pi(1 − p)j−i(j − i) · M (C.9) 1 i (no FP|joint) = . (C.5) i=1 P M  " c km bM X c = · qj(1 − q)c−j Thus the probability that FPs happen is n j j=2 b # " #" ckf # 1 kf j P e = (FP|joint) · I = 1 − 1 − . · j(1 − p) − j(1 − p) (C.10) P Pjoint M  F  km kf   (C.6) c   (1 − p)T2 X c < · qj(1 − q)c−j · j For the symmetric probabilistic model, the infection n  j  j=2 probability in an infected family pj = p for all infected (1 − p)T families j. If two families appear in the same set of M = 2 · cq − cq(1 − q)c−1, tests, then there is no false positives only when the two n  c−1 families have the same number of infected members = (1 − p)q 1 − (1 − q) , (C.11) and the infected (non-infected) members in one family must appear in the same set of tests as infected (non- where the expression in the bracket in (C.9) for each j infected) members of the other family. The probabil- denotes the expected number of FPs in one block row ity that two families both have i infected members is if there are j families infected in this block row, (C.10) 2 pi(1 − p)M −i , and the probability that all infected is obtained from the expected value of binomial distri- bution, and (C.11) follows by substituting c = n . members in one family share tests with only infected T2 1 members in the other family is simply M . Thus, the ( i ) We here make the following observation about the sys- probability that there is no false positives is given as tem FP probability P(any-FP): As we explore fur- follows, ther in Section 6 non-adaptive group testing requires α M more tests than adaptive. Assume that kf = Θ(F f ) X  i M −i2 1 (no FP|joint) = p (1 − p) . (C.7) for αf ∈ [0, 1) and choose R = M − 1 in Algo- P M  i=1 i rithm 1. Adaptive testing allows to achieve zero er- ror with k log F + k M tests; if we use the same Thus the probability that a false positive happens can f 2 f (order) number of tests with a non-adaptive strat- be obtained as F egy, i.e., T1 = kf log and T2 = kf (log kf + M ), 2 kf 2 II P e = P(FP|joint) · Pjoint we get P(any-FP) in Lemma 5 approximately equal " T /M # 2 k " M # ( k ) (F/k ) f X 2 1 to 1 − 1  1 − f f which is bounded  i M −i M  T /M kf F = p (1 − p) 2 (k ) M  kf f i=1 i away from 0. The latter can be seen as follows: i) h b i · 1 − (1 − q)c−1(1 − q + cq) . (C.8) (n) (n+m)  k T /M ≈ k  F ; ii) k  k = n · 2 f n k n+m k n+m ( k ) ( k ) Replacing b by T /M and c by FM /T completes the Qm n+i−k 2 2 i=1 n+i is decreasing with m and can be very small result. when m  n. Fig. 7 depicts (any-FP) and R(error) for parameters C.6 Proof of Lemma 6 and Discussions P F = 64, kf = 6, km = 4, M = 5, q = 1/8, and p = 0.8. Proof. For the combinatorial model (I), it is hard to explicitly calculate the expected error rate. The upper Group testing for connected communities

factor messages and normalize to obtain an esti- mate of P(Vj = v|y1...yT ) and P(Ui = u|y1...yT ) for v, u ∈ {0, 1}.

Next we describe the simplified variable and factor node message update rules. We use equations (5) and (6) of [Kschischang et al., 2001] to compute the mes- sages. Leaf node messages: At every iteration, the variable Figure 7: System FP probability and FP error rate. node Yτ continually transmits the message [0, 1] if Yτ = 1 and [1, 0] if Yτ = 0 on its incident edge. The D Appendix for Section 5: Loopy factor node P(Vj ) continually transmits [1 − q, q] on Belief Propagation algorithm its incident edge; see Fig. 8 (a) and (b). Variable node messages: The other variable nodes Vj We here describe our loopy belief propagation algo- and Ui use the following rule to transmit messages rithm (LBP) and update rules for our probabilistic along the incident edges: for incident each edge e, model (II). We use the factor graph framework of a variable node takes the elementwise product of the [Kschischang et al., 2001] and derive closed-form ex- messages from every other incident edge e0 and trans- pressions for the sum-product update rules (see equa- mits this along e; see Fig. 8 (c). tions (5) and (6) in [Kschischang et al., 2001]). Factor node messages: For the factor node messages, The LBP algorithm on a factor graph iteratively ex- we calculate closed form expressions for the sum- changes messages across the variable and factor nodes. product update rule (equation (6) in [Kschischang The messages to and from a variable node Vj or Ui et al., 2001]). The simplified expressions are summa- are beliefs about the variable or distributions (a local rized in Fig. 8 (d) and (e). Next we briefly describe estimate of P(Vj |observations) or P(Ui |observations)). these calculations. Since all the random variables are binary, in our case each message would be a 2-dimensional vector [a, b] Firstly, we note that each message represents a prob- where a, b ≥ 0. Suppose the result of each test ability distribution. One could, without loss of gen- erality, normalize each message before transmission. is yτ , i.e., Yτ = yτ and we wish to compute the Therefore, we assume that each message µ = [a, b] is marginals P(Vj = v|Y1 = y1, Y2 = y2, ..., YT = yT ) such that a + b = 1. Now, the the leaf nodes labeled and P(Ui = u|Y1 = y1, Y2 = y2, ..., YT = yT ) for v, u ∈ {0, 1}. The LBP algorithm proceeds as follows: P(Vj) perennially transmit the prior distribution cor- responding to Vj. 1. Initialization: The variable nodes V and U j i Next, consider the factor node P(Ui|Vj) as shown in transmit the message [0.5, 0.5] on each of their Fig. 8 (d). The message sent to Ui is calculated as incident edges. Each variable node Yτ transmits the message [1 − y , y ], where y is the observed X τ τ τ µu = P(Ui = u|Vj = v)wv test result, on its incident edge. v∈{0,1} u 1−u 2. Factor node messages: Each factor node receives = w0(1 − u) + w1pj (1 − pj) . the messages from the neighboring variable nodes and computes a new set of messages to send on Similarly, the message sent to Vi is each incident edge. The rules on how to compute X these messages are described next. νv = P(Ui = u|Vj = v)su u∈{0,1} 3. Iteration and completion. The algorithm alter- = s (v(1 − p ) + 1 − v) + s vp . nates between steps 2 and 3 above a fixed number 0 j 1 j of times (in practice 10 or 20 times works well) and computes an estimate of the posterior marginals Finally for the factor nodes P(Yτ |Uδτ ) as shown in Fig. 8 (e), note that the messages to Y play no role as follows – for each variable node Vj and Ui , we τ take the coordinatewise product of the incoming since they are never used to recompute the variable Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

Pr � |� V�! Pr(�Vj) �

1 , 1 , V � Messages from Pr �! factor nodes � Messages from � variable nodes

� �V!

V�!� � � Pr �|�V! , 1 , �′

, -- incoming message from node � , -- incoming message from , -- incoming message from node V� factor node �′ j

� Messages from �V! and � variable nodes � Messages from Pr �|�V! factor nodes

� Pr �|� 1 , 1 , , – incoming message from node � , – outgoing message to node �

� Messages from Pr �|� factor nodes

Figure 8: The update rules for the factor and variable node messages. messages. The messages to Ui nodes are expressed as we get

X  X  Y 0  µ = (Y = y|U = u ) (Y = 0|U = u ) s(i ) u P τ δτ δτ P τ δτ δτ ui0 0 y∈{0,1}, {ui0 ∈{0,1}: i ∈δτ \{i}} 0 0 {ui0 ∈{0,1}:i ∈δτ \{i}} i ∈δτ \{i}} Y 0  n Y (i0) o (1 − y )1−yyy s(i ) = 1 z + 1 1 − (1 − z)(1 − s ) , τ τ ui0 u=1 u=0 0 0 0 i ∈δτ \{i}} i ∈δτ 0 X  i 6=i = (1 − yτ ) P(Yτ = 0|Uδτ = uδτ ) and {ui0 ∈{0,1}: 0 i ∈δτ \{i}} X  Y (i0) 0 Y (i ) P(Yτ = 1|Uδτ = uδτ ) su 0 s i ui0 0 {ui0 ∈{0,1}: i ∈δτ \{i}} 0 0 i ∈δτ \{i}} i ∈δτ \{i}} X  Y (i0)  Y (i0)  + yτ (Yτ = 1|Uδ = uδ ) s . 1 1 P τ τ ui0 = u=1(1 − z) + u=0 (1 − z)(1 − s0 ) . {u 0 ∈{0,1}: i0∈δ \{i}} 0 i τ i ∈δτ 0 0 i ∈δτ \{i}} i 6=i

From our Z-channel model, recall that P(Yτ = 0|Uδτ = Substituting u = 0, and u = 1 we obtain the messages uδτ ) = 1 if ui = 0 ∀ i ∈ δτ and P(Yτ = 0|Uδτ = uδτ ) = n Y (i0) o z otherwise. Thus we split the summation terms into µ0 = (1 − yτ ) 1 − (1 − z)(1 − s0 ) 0 0 0 2 cases – one where ui = 0 for all i and the other its i ∈δτ complement. Also combining this with the assumption i06=i that the messages are normalized, i.e., s(i) + s(i) = 1, Y (i0) 0 1 + yτ (1 − z)(1 − s0 ), 0 i ∈δτ i06=i Group testing for connected communities

Alg.1: R = 1 Alg.1: R = M E Appendix for Section 6: Other 600 binary splitting lower bound Results 500 non-adaptive counting bound 400 We next provide additional experimental results to the 300 ones provided in Section 6.

average #of tests 200 (i) Noiseless testing – Average number of tests: In Fig- 100 ure 9, we reproduce additional numerics akin to the 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 ones in Section 6 for number of tests in the noiseless- p testing case. As earlier, we measure the average num- (a) Community 1—sparse regime. ber of tests needed by 3 algorithms that achieve zero-

1200 Alg.1: R = 1 error reconstruction (Alg. 1 with R = 1, Alg. 1 with Alg.1: R = M 1000 binary splitting R = M , and classic BSA), and a version of our non- lower bound non-adaptive adaptive algorithm (Section 4.3) that uses T1 = F 800 counting bound tests for submatrix G1 and has an overall FP rate 600 around 0.5%. Alg. 1 assumes no prior knowledge

400 of the number of infected families/classes or mem- average #of tests bers/students, hence uses BSA as group-testing algo- 200 rithm for the AdaptiveTest() function. 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 p Fig. 9 depicts our results: We observe that both ver- (b) Community 1—linear regime. sions of Alg. 1 (black and magenta lines) need sig- nificantly fewer tests compared to classic BSA (green

500 Alg.1: R = 1 Alg.1: R = M line), while staying below the counting bound. This 450 binary splitting 400 lower bound indicates the potential benefits from the community non-adaptive 350 counting bound structure, even when the number of infected members 300 is unknown. More interestingly, when R = M , Alg. 1 250

200 performs close to the lower bound in most realistic

average #of tests 150 scenarios p ∈ [0.5, 0.8] (as also shown in Section 4.1). 100 The grey line shows number of tests needed by our 50 nonadaptive algorithm; we observe that even that al- 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 p gorithm needs fewer tests than BSA when p gets larger (c) Community 2—sparse regime. than 0.5, of course at the cost of a (FP) error rate of 0.5%. 1000 Alg.1: R = 1 900 Alg.1: R = M (ii) Noiseless testing – Average error rate: In Fig. 10, binary splitting 800 lower bound non-adaptive we reproduce additional numerics akin to the ones in 700 counting bound 600 Section 6 for average error rates in the noiseless-testing

500 case. As earlier, we quantify the additional cost in 400 terms of error rate, when one goes from a two-stage average #of tests 300 adaptive algorithm that achieves zero-error identifi- 200

100 cation to much faster single-stage nonadaptive algo- 0 rithms. 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 p Fig. 10a is a reproduction of Fig. 3 for p = 0.8, and as (d) Community 2—linear regime. can be seen its behavior is very similar to Fig. 3. Fig. 10b depicts the FP and FN error rates (aver- Figure 9: Experiment (i): aged over 500 runs) as a function of p ∈ [0.3, 0.8] Noiseless case—Average number of tests. for Community 1 for the linear regime. We ob- serve that any community-aware nonadaptive algo- rithm performs better than traditional nonadaptive and group testing (red line) when p > 0.5 – the absolute performance gap ranges from 0.2% (when p = 0.5) to µ1 = (1 − yτ )z + yτ (1 − z). 8.5% (when p = 0.8). “COMP with C-encoder” has a For our probabilistic model, the complexity of com- stable FP rate across for all p values that was close to puting the factor node messages increases only linearly 2%, and a zero FN rate by construction. Unlike the with the factor node degree. Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

12 FP for COMP with C-encoder 0 for LBP. FP for COMP with NC-encoder FP for C-LBP with NC-encoder 10 FN for C-LBP with NC-encoder Fig. 10c and Fig. 10d examine the effect of the num-

8

%) ber of tests in the linear regime. For p = 0.6, “C- 6 LBP with NC-encoder” performs better than “COMP

Error rateError( 4 with C-encoder” for T > 450 until which both have

2 high error rates. On the other hand, for p = 0.8, “C-

0 LBP with NC-encoder” performs better than “COMP 150 200 250 300 350 400 450 Number of tests T with C-encoder” for all values of T . More importantly, “COMP with C-encoder” seems to saturate to a non- (a) Noiseless case: Average error rate p = 0.8 zero FP error rate, while “C-LBP with NC-encoder” for sparse regime. is able to attain close to zero error FP and FN rates. FP for COMP with C-encoder FP for COMP with NC-encoder These results contrast with the results for the sparse 8 FP for C-LBP with NC-encoder FN for C-LBP with NC-encoder regime.

6 (iii) Noisy testing:

4

Error rateError(%) In Figure 11, we reproduce additional numerics akin 2 to the ones in Section 6 for average error rates in the noisy-testing case. As earlier, we assuming the Z- 0 0.3 0.4 0.5 0.6 0.7 channel noise of Section 2.3 with parameter z = 0.15, Probability of infection p and we evaluate the performance of our community- (b) Noiseless case: Average error rate with based LBP decoder of Section 5 against a LBP that few tests for linear regime. does not account for community—namely its factor

FP for COMP with C-encoder graph has no Vj nodes. FP for COMP with NC-encoder 20 FP for C-LBP with NC-encoder FN for C-LBP with NC-encoder Fig. 11a is a reproduction of Fig. 4 for p = 0.6, and as

15 can be seen its behavior is very similar to it.

10 Fig. 11b and Fig. 11c depict our results for Community Error rateError(%) 1 and for p = 0.6 and p = 0.8 in the linear infection 5 regime. We observe that the knowledge of the commu- 0 nity structure reduces the FN rates achieved by LBP. 300 350 400 450 500 550 Number of tests T The FP error rates are always close to 0 while the, FN (c) Noiseless case: Average error rate p = 0.6 error rates drop significantly (up to 60% when tests for linear regime. are few), which is important in our context since FN

FP for COMP with C-encoder errors lead to further infections. FP for COMP with NC-encoder 20 FP for C-LBP with NC-encoder FN for C-LBP with NC-encoder (iv) Asymmetric case—Linear regime: 15 Here we offer the results about an asymmetric setup

10 that parallels the one of Section 6. Infections follow

Error rateError(%) again the probabilistic model (II), and the size of each 5 family is randomly selected from the interval [5, 50]

0 and the infection rate of each infected family is ran- 300 350 400 450 500 550 Number of tests T domly selected from the range [0.4, 0.8]. But, this time q = 5%. (d) Noiseless case: Average error rate p = 0.8 for linear regime. Figure 5 depicts our results. BSA needs on average 6.19× (that can reach up to 13.87×) more tests com- Figure 10: Experiment (ii): pared to the probabilistic bound, while the two ver- Noiseless case—Average error rate. sions of Algorithm 1 with R = 1 and R = M need only 2.74× and 1.19× (that can reach up to 9.7× and 2.03×) more tests, respectively. Also, similarly to the sparse regime, the LBP consistently produces better sparse regime, there is a significantly smaller range error rates compared to the COMP decoder. However, between the 25-th and 75-th percentiles of the box- for low values of p, LBP produces more FN errors. For plots related to Algorithm 1 that indicates its more p > 0.6, both the FN and FP error rates are close to predictable performance compared to BSA. Group testing for connected communities

14

FP for NC-LBP with NC-encoder FP for C-LBP with NC-encoder 12 80 FN for NC-LBP with NC-encoder FN for C-LBP with NC-encoder 10 60 8 40 6 Error rateError(%)

20 4

0 2 150 200 250 300 350 400 450 Number of tests T 0 Alg.1: R=1 Alg.1: R=M BSA (a) Noisy case: Average error rate p = 0.6 for sparse regime. Figure 12: Asymmetric case—Linear regime: Cost

FP for NC-LBP with NC-encoder efficiency for number of tests. 80 FP for C-LBP with NC-encoder FN for NC-LBP with NC-encoder FN for C-LBP with NC-encoder

60

40 Error rateError(%)

20

0 300 350 400 450 500 550 600 650 Number of tests T

(b) Noisy case: Average error rate p = 0.6 for linear regime.

FP for NC-LBP with NC-encoder 80 FP for C-LBP with NC-encoder FN for NC-LBP with NC-encoder FN for C-LBP with NC-encoder

60

40 Error rateError(%)

20

0

300 350 400 450 500 550 600 650 Number of tests T

(c) Noisy case: Average error rate p = 0.8 for linear regime.

Figure 11: Experiment (iii): Noisy case—Average error rate.