Set Reconciliation with an Inaccurate Oracle Mark Bilinski, Ryan Gabrys SPAWAR Systems Center San Diego {Mark.Bilinski, Ryan.Gabrys}@Navy.Mil

2017 Workshop on Computing, Networking and Communications (CNC)

Set Reconciliation With an Inaccurate Oracle Mark Bilinski, Ryan Gabrys SPAWAR Systems Center San Diego {mark.bilinski, ryan.gabrys}@navy.mil

I.INTRODUCTION multiple rounds of communication. [1] considers where Abstract—In this work, we consider a variant of the set small errors between elements of the data sets are han- reconciliation problem where the estimate for the size of dled differently than large discrepancies. [3] considers the symmetric difference may be inaccurate. Given this where elements in the symmetric difference are related. setup, we propose a new method to reconciling sets of data In this work, we consider a new problem which, to and we then compare our method to the Invertible Bloom the best of the authors’ knowledge, has not been studied Filter approach proposed by Eppstein et al. [2]. before. Classically, it is assumed that the estimate for the For the next ten years, one of the greatest challenges upper bound on d, which we denote as M, is accurate so faced by the defense community will be developing new that d 6 M always holds. The aim of this work will be capabilities to handle big data. A driving force behind to consider where the oracle (estimate for d) is inaccu- the proliferation of data in the Navy community is grow- rate so that it may be that d > M. Our contributions will ing demand for intelligence, surveillance, and reconnais- be to propose a new algorithm, which we refer to as the sance (ISR) data for enhanced situational awareness. As Layered PI approach (or our approach when the context recently as 2014, it was estimated that less than 5% of is clear), that is robust against an inaccurate oracle. We data collected from ISR platforms reaches Navy ana- then compare our approach with IBF and demonstrate lysts [9]. Often times, this problem is the result of low- an improvement given the oracle is inaccurate. bandwidth (or otherwise poor) communication links. An- The paper is organized as follows. Section II describes other common issue is duplicate data is transferred be- the Layered PI approach. We then evaluate our approach tween many hosts [11]. In an effort to improve the flow in Section IV using the setup described in Section III. of information across these links and prevent the transfer Finally, Section V concludes the paper. of data shared between collections of hosts, in this work we consider a variant of the set reconciliation problem. II.LAYERED PIALGORITHM The set reconciliation problem has the following In this section, we describe the Layered PI algorithm. setup. Suppose two hosts, A and B, each have a set The idea is first to sub-divide the space {0, 1}b contain- of binary strings of length b. Let SA denote the set of ing the data strings into random subsets and run PI on strings on Host A and let SB denote the set of strings on each of these smaller sets. Note that a similar approach Host B. The set reconciliation problem is to determine, was also used in [7]. In [7], the authors randomly par- using the minimum amount of information exchange, tition the space {0, 1}b into subsets and ran PI on each what to send from Host A to Host B with a single round subset to reduce the high computational complexity of of communication so that Host B can compute their PI, but they then allow multiple rounds of communi- symmetric difference SA 4SB = (SA \SB)∪(SB \SA) cation. We adopt a similar approach to enable partial where d = |SA 4 SB| and M is an estimate for an recovery of the symmetric difference when d > M, but upper bound on d. Under the traditional setup, the maintain only a single round of communication. estimate on d is accurate so that d 6 M always holds. The Layered PI approach calls the encode and de- The set reconciliation problem has received consider- code methods of PI algorithm of [8]. In executing PI, able attention in the past. The work in [6] and [8] pro- one encodes the data sets with a threshold parameter pose the use of algebraic error correction codes. The ap- M. The value of M will be fixed at each layer of our proach in [8], which we refer to as polynomial interpola- algorithm. If M > d then decoding will recover the tion (PI), was shown to be nearly optimal with respect to entire symmetric difference. However, if M < d, then the amount of information exchange. However, the pri- decoding may fail to recover anything (assuming that mary drawback with PI is that it possesses high O(M 3) failure is detected). It is this value of M that determines computational complexity. Later works, such as [2] and the complexity of both the encode and decode of the PI [7], propose the use of structures similar to Bloom filters algorithm; M’s relationship to d determines whether PI [4]. These schemes also achieve nearly optimal informa- succeeds or fails. Given an oracle that provides the value tion exchange. Furthermore, the approaches in [2] and of d, one would set M = d and achieve the best possi- [4] exhibit lower computational complexity. In particu- ble complexity results. In our approach we will set the lar, the Invertible Bloom Filter (IBF) approach in [2] has value of M and strategically run many instances of PI, computational complexity O(M). expecting only some them to succeed. Many variations of the set reconciliation problem have Next we informally describe the encoding procedure been studied in the literature. In [7], the authors allow performed on the sets SA and SB. We partition the space

b d n F2 into ` = log(d) bins each containing ` strings. Then, Similarly deﬁne EK+1 such that |EK+1| = blog(d)c. we perform PI on each bin, meaning we compute ` char- Let H = bc3 log dc b for some c3 > 0. acteristic functions (one for each bin), and evaluate each We ﬁrst perform the encode operation, shown below, function at the same threshold of M points. In addition, on both Host A and Host B. we compute a hash consisting of H bits for each bin. Next, we similarly re-partition space into another set of ` Algorithm 1 Encode n b bins each of size ` . We again compute the characteristic Input: S ⊆ F2,Ei,H function for each bin, except that now each characteris- 1: for i ∈ [K] and j ∈ [`] do i,j tic function is evaluated at fM points and we compute 2: Compute χS(i,j)(z), h = h[(f)i−1H](S(i, j)). i,j a hash of fH bits for each bin, where 0 < f < 1. 3: Evaluate {χS(i,j)(e)|e ∈ Ei} = χ . This process is iterated each time with a new partition- 4: end for ing scheme except the number of evaluation points and 5: for j ∈ [`] do the length of the hashes are reduced by a factor of f 6: Compute χS(K+1,j)(z). K+1,j each iteration until the number of evaluation points is 7: Evaluate {χS(K+1,j)(e)|e ∈ EK+1} = χ . less than or equal to log(log(d)). We transmit the out- 8: end for come of ALL the evaluations along with ALL the hashes Output: χi,j, hi,j, χK+1,j for i ∈ [K] and j ∈ [`] to Host B, in a single trasmit. i,j i,j K+1,j i,j i,j K+1,j Let χA , hA , χA and χB , hB , χB be the results of performing Encode on Host A and B, respec- i,j i,j K+1,j tively. First, Host A sends χA , hA , χA for i ∈ [K] i,j i−1 and j ∈ [`]. Since χ requires f ·c2 log(d)·b bits of i,j i−1 storage, h requires f · c3 log(d) · b bits of storage, and PI requires db bits of information exchange, notice that this requires at most K X i d (f) · (c log(d) + c log(d)) b · + 2 3 log(d) i=0 d 1 log(d) b (c + c )db + db = O(db) log(d) 6 1 − f 2 3

Fig. 1: Example of algorithm, 2 layers partitioning into bits of information exchange. Similarly, as PI has encode 3 bins. In layer 1, one bin fails to decode since it has 3 complexity O(d), our encode complexity is O(d). i,j “balls,” exceeding the encode threshold M = 2. In layer The ability to modify an already encoded χS to add 2, one bin would fail if not for the modify procedure or remove elements is possible in linear time [8] and it is which removed a ball decoded in an earlier layer. shown below as Algorithm 2. D1,D2 will designate the sets of elements we want to add, remove respectively. We now more formally describe the encoding process, followed the decoding procedure. The in- Algorithm 2 Modify b put to the procedure below is the set S ⊆ F2. Let Input: D1,D2 ⊆ S(i, j) such that D1 ∩ D2 = ∅, χ = Q (z − x) ∈ [z] be a univariate polyno- i,j S(z) x∈S Fq χS ,Ei χ (z) mial with the indeterminate z, where there is an injective D1 ∗ 1: Compute = χD ,D (z) χD2 (z) 1 2 mapping of length-b bitstrings onto elements of the ﬁeld ∗ i,j b 2: Evaluate {χS(i,j)(e) · χD ,D (e)|e ∈ Ei} = χS4D Fq and q > 2 , also known as the characteristic polyno- 1 2 Output: χi,j mial of the set S. Let hk be a hash which takes as input a S4D b subset of F2 and outputs bkc bits. Let K = dlog log(d)e and suppose we have the subsets U1,1, U1,2,..., U1,`, Let PID(χS1 , χS2 ,E), E ⊆ Fq denote the standard U2,1, U2,2,..., U2,`, ..., UK+1,1, UK+1,2,..., UK+1,` PI decode algorithm, where E is the evaluation set and where for any i ∈ [K + 1], Ui,1, Ui,2,..., Ui,` is a |E| = M; recall that it outputs S1 4S2 if |S1 4S2| 6 M b partition of the space F2. For shorthand, let S(i, j) de- the threshold at which the two χ’s were originally en- b note the elements of S ⊆ F2 that are in Ui,j. Here we coded [8]. Using Algorithm 2, we can potentially de- have that S(i, j) refers to bin j at layer i. Let c1, c2 crease the symmetric difference between the encoded (discussed later) be such that 0 < c1 < 1 < c2. For sets and thus improve the likelihood for PID to success- i ∈ [K] let Ei = {ei,1, ei,2, ..., ei,|Ei|} ⊆ Fq such that fully decode given knowledge of some of the elements in i−1 |Ei| = b(f) · c2 log(d)c, denote a set of evaluation the symmetric difference. We now present in Algorithm points for characteristic polynomials with index i. Fur- 3 the decode algorithm which is performed on Host B, ther ensure the evaluation points in Ei do not include which calls the modify algorithm discussed earlier. 3 any elements of Fq mapped to by length-b bitstrings. Recall that PI has decode complexity O(M ). Thus 2017 Workshop on Computing, Networking and Communications (CNC)

Algorithm 3 Decode at most

i,j i,j K+1,j i,j K+1,j log d 2 3 d b 3 Input: χ , h , χ , χ , χ , S , |E | ( 4 ) · log d − ·log(d) A A A B B B i P r(F ) 6 log log d 2(0.999) + 2 4 1: Initialize D = ∅. 2: for i ← 1 to K + 1 do so that P r(F ) → 0, as d → ∞. 3: Compute D = D ∩ S , D = D\S 1 B 2 B The above discussion forms the theoretical basis for 4: for j ← 1 to ` do i,j our algorithm, but a number of parameters are up for se- 5: Compute χA4D = i,j lection in implementation. Next we motivate our selec- Modify(D1(i, j),D2(i, j), χA ,Ei) 0 i,j i,j tion of these parameters and define our implementation. 6: Compute D =PID(χA4D, χB ,Ei) The full algorithm has multiple layers. In a given 7: if i = K + 1 or hf i−1H (SB(i, j)4(D ∪ layer, we want M to be small and l to be large, since 0 i,j D )) = hA then each call of PID has complexity O(M 3). In choosing l 0 8: Update D = D ∪ D . to be roughly d/ log(d), we expect roughly log(d) el- 9: end if ements of the symmetric difference per bin at the first 10: end for layer. While the distribution of this balls and bins prob- 11: end for lem means that some bins will have more balls than Output: D log(d), by setting M at slightly above that value, we expect a reasonable number of bins to successfully decode. Moreover, we expect a reasonable number of our algorithm has decode complexity O(d · log(d)2): elements of the symmetric difference to be in successfully decoded bins. If we adjust M appropriately from K X 3 d d layer to layer, we can maintain a set fractional decoding f i [c log(d)] + c log(d) + · 2 3 log(d) log(d) in every layer – our goal is half. i=0 Initial values of M, l, K are determined by a Heuris- log(d)3 = O(d · log(d)2). tic, as is the value of M at each layer. Pseudocode for In the following, we discuss the intuition behind our our implementation and both Heuristics are below. approach. Since we partitioned the universe (of size n = Algorithm 4 Simulation Implementation 2b) into ` random bins, if there are d elements in the symmetric difference, then each bin is expected to con- Input: SA,SB, d log(d) 1: M1, l, K ← Heuristic1(d) tain d · d = log(d) elements from the symmetric B(N, p) 2: for i ← 1 to K do difference as it follows a binomial distribution b where N = n·log(d) and p = d . The variance is n·log(d) · 3: for all x ∈ F2 do d n d 4: Assign x to U uniformly at random, j ∈ d · (1 − d ) < log(d), and thus σ < plog(d). Thus for i,j n n [1, l] any fixed 0 < c1 < 1 < c2, for large enough d at least 5: end for half of the total bins will contain between c1M and c2M 6: Pick E such that |E | = M elements from the set difference. Using PI, we can re- i i i 1 7: Mi+1 ← Heuristic2(Mi) cover at least ( l)(c4M) > 0 elements for some c4 > 0, 2 8: end for or a fixed positive fraction of the total symmetric differ- 9: EK+1 ← ∅ ence. Iterating, we can recover at least that same fraction i,j i,j K+1,j 10: χA , hA , χA ← Encode(SA,Ei,H) of the remaining symmetric difference each layer until i,j i,j K+1,j 11: χB , hB , χB ← Encode(SB,Ei,H) almost all the elements in the symmetric difference have i,j i,j i,j 12: D ← χ , h , ∅ χ , ∅ S |E | been recovered. Under certain conditions, the K+1 layer Decode( A A , B , B, i ) then has sufficiently high threshold to recover the rest. Output: D The following theorem shows that for large d (and b correspondingly large n = 2 , as trivially n > 2d) that Algorithm 5 Heuristic1 the probability of our algorithm successfully recover- Input: d ing the symmetric difference tends to 1. The proof is a 1: if d <= 3 then straightforward application of Chernoff bounds. While 2: d ← 3 + 1 our earlier treatment of our algorithm dealt with gen- 3: end if eral parameters, for the below theorem we chose the 4: M ← dlog3(d)e following fixed values to simplify our proof: c1 = 1/3, 5: l ← d d e log3(d) c2 = 7/4, c3 = 1/4, f = 1/6. 6: K ← dlog2(M)e 7: if K = 1 then Theorem 1. Suppose n = 2b 2d and d 1024. > > 8: M ← M + 1 Let S , S ⊆ b where for any x ∈ b , P r(x ∈ A B F2 F2 9: end if S 4S ) = d . Then the probability the algorithm fails A B n Output: M, l, K to recover the symmetric difference, denoted P r(F ), is 2017 Workshop on Computing, Networking and Communications (CNC)

Algorithm 6 Heuristic2 In contrast to [8], we utilize an independent hash func- Input: M tion of size H. This changes the space required to M · −H 1: M ← dM/2e + 1 b + H, and probability of not detecting failure to 2 . Output: M We use H = 10 for all layers, which impacts the space required by our algorithm. We still assume our blackbox Our goal was to decode at least half of the elements has a perfect ability to detect failure. in each layer – and our simulation results show we met C. IBF Parameters (and usually far exceeded) that goal in almost every run. For the IBF algorithm, we used the implementation of Thus, we decreased the value of M by roughly half each Invertible Bloom Lookup Table (IBLT) from [12]. IBLT layer until it hit 1 which means K ∼ log log d. In prac- is more general than IBF in that it encodes a key-value tice K will be very small, 3 in all cases we considered. 6 pair [4]. By encoding just the key, you get an IBF. Thus we developed our two Heuristics to compensate An IBF consists of m of cells, each tracking three for the small number of layers. Heuristic targets espe- 1 values: count, sum of IDs, and sum of hash of IDs. As cially small values of d which would only have a single parameters for the IBF, one needs to choose m, as well layer. Heuristic considers cases where there are multi- 2 as how much space to allocate to the count and each ple layers. In both cases we slightly boost M in order of the two sums, and k how many different cells each to provide faster convergence, since K is small. In sim- element gets hashed to. In practice k = 3 or 4 usually ulation, the algorithm was already fairly successful at works well [2], so we chose k = 3. The IDs are the decoding, so we chose to skip the K + 1 layer. In Al- binary representation of elements in Ω and hence require gorithm 4 we denote this by letting |E | = 0 so the K+1 b space; and the hash permutes elements of Ω. Binary encoding in layer K +1 would be empty and we explic- XOR sums of such values require b space. itly enter ∅ into the K + 1’th layer’s inputs for Decode. Nk If N is the number of elements, m is expected num- III.SIMULATION ber of elements that get hashed to a given cell. We as- Nk To test our algorithm, we generated synthetic data and sign count c = dlog2( m )e + 1 bits of space. Thus each had our algorithm attempt to find the symmetric differ- cell uses 2b + c bits of space, making the total space ence while recording certain metrics to gauge perfor- used by the IBF algorithm m(2b + c). Let P be the bits mance. We tracked the fraction of the symmetric differ- of space used by the PI based Algorithm 4. We choose P ence that successfully decoded at the end of each layer, m = d m(2b+c) e. Since c and m depend on each other, as well as cost of space to transmit and decode compu- we iteratively update them based on these formulas until tation. We then ran IBF against the exact same synthetic they reach a stable point. Further we force c > 2. Thus data sets, with the same amount of space. the amount of space used by IBF will be > P . D. Second Simulation A. Synthetic Data We ran a second simulation where we fixed The universe Ω of possible set elements is of size the amount of space allowed (as if d = 50) but |Ω| = 10000. The set SA is of size |SA| = 3000, and changed the true value of d. We used |Ω| = 1000, was chosen uniformly at random from Ω. The size of |SA| = |SB| = 300, R = 1000 and we ran the simula- the symmetric difference between SA and SB, d is the tion across d = 40...130. The purpose was to investigate parameter we varied, from 1...1000. SB was constructed how resilient both algorithms were when the estimate by modifying SA to achieve d. For each value of d, we for symmetric difference was incorrect. ran the simulation R = 10000 times. IV. RESULTS B. PI Implementation and Performance Figure 2a shows that the space requirement of Algo- For PI, successful decoding means the entire sym- rithm 4 increases roughly linearly, with jumps due to the metric difference is recovered. Failure means potentially ceiling functions. Figure 2b shows compute time for de- nothing is recovered. Recall decoding has O(M 3) code increases faster than linearly, but noteably is less complexity; to save processing time, we implemented than cubic. Figures 2c, 2d show the distribution of results a blackbox that has total success/failure based on for two example values of d, 18 and 675 respectively. symmetric difference versus threshold. More specifically they show fraction of the symmetric The nuance is that when PI fails you need to detect difference that was successfully recovered at the end of a the failure. Within the paper [8], they include additional run, tallying over 10, 000 runs, and displaying that dis- evaluation points specifically to detect failure. This in- tribution for both algorithms. These values of d were crease of the threshold above M impacts the complexity chosen as they showcased examples where Algorithm 4 of encode, decode, and space requirements. The underly- had relatively more runs fail to fully decode and thus ing assumption is this increase is small compared to M. resulted in more interesting distributions. In our approach, we try to keep M as small as possible, Figures 2e, 2f, 2g compare Algorithm 4’s performance so this method will have a dramatic effect. to IBF’s across d = 1...1000. In Figure 2e, we tally the 2017 Workshop on Computing, Networking and Communications (CNC)

we feel the more compelling story comes from examin- ing what happens when the symmetric difference is not fully decoded. In that case, our algorithm successfully decodes a very high fraction of the symmetric difference, while IBF performance is notably lower. Figures 2f, 2g make the comparison across different values of (a) (b) d; note that our algorithm’s performance is usually in the high 90%’s. Figures 2c, 2d showcase the distributions in more detail for d = 18 and 675 respectively. And while both algorithms operate with a single round of communication, if the decode is not fully successful, our algorithm still manages to recover most of that sym- (c) (d) metric difference. This has significant implications if a subsequent round of communication is not possible, and even if it is, it lightens the load for that next round. In practice, one does not know the exact size of the symmetric difference a priori and must use another tech- nique to estimate its value. PI is extremely sensitive to that estimate. IBF of course is more resilient than PI, but (e) (f) our results show that the layered PI approach with our algorithm outperforms IBF in this regard. In our second simulation we fixed the space alloted for both algorithms as if d = 50, but then let the actual d vary from 40...130. Performance of the algorithms were captured in Figures 2h, 2i, 2j and notably show our algorithm to the right (g) (h) of IBF by about 10 − 20 values of d. The price our algorithm pays for these advantages comes from space required and decode complexity. Re- call that PI and IBF require db and O(db) space respectively while our algorithm requires O(db). All share an encode complexity of O(d). Decode complexity for PI and IBF are O(d3) and O(d) respectively while for our (i) (j) algorithm it is O(d·log(d)2). However, decoding for our Fig. 2: (a)-(i) measure different performance metrics be- d algorithm can be performed in parallel across ` = log d tween Algorithm 4 (Alg) and IBF. The specifics are de- nodes, and reaggregating the results gives an effective tailed in the text of the Results section. decode complexity of O(d) after parallelization thus re- number of runs that successfully decoded the entire sym- ducing the gap between IBF and our approach. metric difference. Note the large spikes correspond to the REFERENCES [1] D. Chen, C. Konrad, K. Yi, W. Yu, Q. Zhang, “Robust set recon- small jumps in space used. In Figures 2f, 2g, we count ciliation,” SIGMOD, 2014. the fraction of symmetric difference that successfully de- [2] D. Eppstein, M. Goodrich, et al, “What’s the difference? Efficient coded averaged across certain runs – respectively over set reconciliation without prior context,” SIGCOMM 2011. [3] R. Gabrys and F. Farnoud, “Reconciling similar sets of data,” ISIT, all runs and excluding runs that were 100% successful 2015. (if all runs are excluded, the average set to 1). [4] M. T. Goodrich and M. Mitzenmacher, “Invertible Bloom Lookup Figures 2h, 2i, 2j show the results of the second simu- Tables,” ArXiv e-prints, 2011. [5] D. Guo and M. Li, “Set reconciliation via counting bloom filters,” lation where the space used by both algorithms was fixed IEEE Trans. Knowledge and Data Eng., 2013. and we varied the value of d = 40...130. Recall this sim- [6] M. Karpovsky, L. Levitin, and A. Trachtenberg, “Data verifica- ulation had 1000 runs per value of d. Figure 2h tallies tion and reconciliation with generalized error-control codes,” IEEE Trans. Info. Theory, July 2003. the number of runs that successfully decode the entire [7] Y. Minsky, A. Trachtenberg, “Practical set reconciliation,” Tech. symmetric difference. Figures 2i, 2j show the fraction Rep., Dept of Electrical and Computer Eng, Boston Univ, 2002. of the symmetric difference decoded averaged over all [8] Y. Minsky, A. Trachtenberg, R. Zippel, “Set reconciliation with nearly optimal communication complexity,” IEEE Trans. Inform. runs and excluding 100% successful runs, respectively. Theory, vol. 49, no. 9, pp. 2213-2218, Sept. 2003. V. CONCLUSIONS [9] I. R. Porche III, B. Wilson, E.E Johnson, S. Tierney, E. Saltzman, data flood, RAND Corporation, 2014. Layered PI has several desirable characteristics, par- [10] V. Skachek and M.G. Rabbat, “Subspace synchronization: a ticularly from the perspective of Disconnected Intermit- network-coding approach to object reconciliation,” ISIT, 2014. tent Limited (DIL) environments. Given the same space, [11] P.B. Symon and A. Tarapore, “Defense intelligence analysis in the age of big data,” Joint Force Quarterly, #79, pp. 4-11, 2015 our algorithm fully decodes the symmetric difference on [12] https://github.com/jesperborgstrup/Py-IBLT average better than IBF as seen in Figure 2e. However,