Quick viewing(Text Mode)

Fast Set Intersection in Memory

Fast Set Intersection in Memory

Fast Intersection in Memory

Bolin Ding Arnd Christian Konig¨ University of Illinois at Urbana-Champaign Microsoft Research 201 N. Goodwin Avenue One Microsoft Way Urbana, IL 61801, USA Redmond, WA 98052, USA [email protected] [email protected]

ABSTRACT rithms become possible, which often outperform inverted indexes; Set intersection is a fundamental operation in information retrieval e.g., using hash-based dictionaries, intersecting two sets L1, L2 and database systems. This paper introduces linear data struc- requires expected time O(min(|L1|, |L2|)), which is a factor of tures to represent sets such that their intersection can be computed Θ(log(1 + max(|L1|/|L2|, |L1|/|L2|))) better than the best pos- in a worst-case efficient way. In general, given k (preprocessed) sible worst-case performance of comparison-based algorithms [6]. In this work, we propose new set intersection algorithms aimed at sets, with totally n elements, we will√ show how to compute their intersection in expected time O(n/ w + kr), where r is the in- fast performance.These outperform the competing techniques for tersection size and w is the number of bits in a machine-word. In most inputs and are also robust in that – for inputs where they are addition,we introduce a very simple version of this algorithm that not optimal – they are close to the best-performing algorithm. The has weaker asymptotic guarantees but performs even better in prac- tradeoff for this gain is a slight increase in the size of the data struc- tice; both algorithms outperform the state of the art techniques for tures, when compared to an inverted index; however, in user-facing both synthetic and real data sets and workloads. scenarios where latency is crucial, this tradeoff is often acceptable. 1.1 Contributions 1. INTRODUCTION Our approach leverages two key observations: (a) If w is the size Fast processing of set intersections is a key operation in many (in bits) of a machine-word, we can encode a set from a universe query processing tasks in the context of databases and information of w elements in a single machine word, allowing for very fast in- retrieval. For example, in the context of databases, set intersections tersections. (b) For the data distributions seen in many real-life ex- are used in the context of various forms of data mining, text analyt- amples (in particular search applications), the size of intersections ics, and evaluation of conjunctive predicates. They are also the key is typically much smaller than the smallest set being intersected. operations in enterprise and web search. To illustrate the second observation, we analyzed the 10K most Many of these applications are interactive, meaning that the la- frequent queries issued against the Bing Shopping portal. For 94% tency with which query results are displayed is a key concern. It of all queries it held that the size of the full intersection was at least has been shown in the context of search that query latency is criti- one order of magnitude smaller than the document frequency of the cal to user satisfaction, with increases in latency directly leading to least frequent keyword; for 76% of the queries the difference was fewer search queries being issued and higher rates of query aban- two orders of magnitude. By exploiting these two observations, we donment [10, 17]. As a consequence, significant portions of the make the following contributions. sets to be intersected are often cached in main memory. (i) We introduce linear-space data structures to represent sets This paper will study the performance of set intersection algo- such that their intersection can be computed in a worst-case ef- rithms for main-memory resident data. Note that these techniques ficient way. Given k sets, with n elements in total, these data are also relevant in the context of large disk-based (inverted) in- structures√ allow us to compute their intersection in expected time dexes, when large fractions of these reside in a main memory cache. O(n/ w + kr), where r is the size of the intersection and w is There has been considerable study of set intersection algorithms in the number of bits in a machine-word; when the size of the inter- information retrieval (e.g., [12, 4, 11]). Most of these papers as- section is an order of magnitude (or more) smaller than the size of sume that the underlying data structure is an inverted index [23]. the smallest set being intersected, our approach yields significant Much of this work (e.g., [12, 4]) focuses on adaptive algorithms improvements in execution time over previous approaches. which use the number of comparisons as measure of overhead. To the best of our knowledge, the best asymptotic bound for fast 2  For in-memory data, additional structures which encode additional set intersection is achieved by the O (n(log2 w) )/w + kr algo- skipping-steps [18], tree-based structures [7], or hash-based algo- rithm of [6]. However, note that the bound relies on a large value 16 of w; in practice,√ w is small (and constant), and w < 2 = 65536 bits implies 1/ w < (log w)2/w. More importantly, [6] requires Permission to make digital or hard copies of all or part of this work for 2 personal or classroom use is granted without fee provided that copies are complex bit-manipulation, making it slow in practice, which we not made or distributed for profit or commercial advantage and that copies will demonstrate empirically in Section 4. bear this notice and the full citation on the first page. To copy otherwise, to (ii) We describe a much simpler algorithm√ that√ computes the in- republish, to post on servers or to redistribute to lists, requires prior specific tersection in expected O(n/αm + mn/ w + kr w) time, where permission and/or a fee. Articles from this volume were invited to present α is a constant determined by w, and m is a parameter. This al- their results at The 37th International Conference on Very Large Data Bases, gorithm has weaker guarantees in theory, but performs better in August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 4 practice, and gives significant improvements over the various data Copyright 2011 VLDB Endowment 2150-8097/11/01... $ 10.00. structures typically used, while being very simple to implement.

255 2. BACKGROUND AND RELATED WORK approaches can be combined with our work by using these small Algorithms based on Ordered Lists: Most work on set intersec- groups in place of individual documents. tion focuses on ordered lists as the underlying data structure, in par- Set intersections using multiple cores: Techniques that exploit ticular algorithms using inverted indexes, which have become the multi-core architectures to speed up set intersections are described standard data structure in information retrieval. Here, documents in [20, 22].The use of multiple cores is orthogonal to our approach are identified via a document ID, and for each term t, the inverted in the sense that our algorithms can be parallelized for these archi- index stores a sorted list of all document IDs containing t. tectures as well; however, this is beyond the scope of our paper. Using this representation, two sets L1,L2 of similar sizes (i.e., |L1| ≈ |L2|) can be intersected efficiently using a linear merge 3. OUR APPROACH by scanning both lists in parallel, requiring O(|L1| + |L2|) oper- Notation: We are given a collection of N sets S = {L1,...,LN }, ations (the “merge step” in merge sort). This approach is wasteful where Li ⊆ Σ and Σ is the universe of elements in the sets; let when set sizes differ significantly or only small fractions of the sets ni = |Li| be the size of set Li. Suppose elements in a set are intersect. For very different set sizes, algorithms have been pro- ordered, and for a set L, let inf(L) and sup(L) be the minimum and |L1|+|L2| posed that exploit this asymmetry, requiring log + |L1| |L1| maximum elements of a set L, respectively. We use w to denote the comparisons at most (for |L1| < |L2|) [16]. size (number of bits) of a word on the target processor. Throughout To improve the performance further, there has recently been sig- the paper we will use log to denote log2. Finally, we use [w] to nificant work on so-called adaptive set-intersection algorithms for denote the set {1, . . . , w}. Our approach can be extended to bag set intersections [12, 4, 13, 1, 2, 5]. These algorithms use the total semantics by additionally storing element frequency. number of comparisons as measure of the algorithm’s complexity Framework: Our task is to design data structures such that the and aim to use a number of comparisons as close as possible to intersection of multiple sets can be computed efficiently. We dif- the minimum number of comparisons ideally required to establish ferentiate between a pre-processing stage, during which we reor- the intersection. However, the resulting reduction in the number of ganize each set and attach additional index structures, and an on- comparisons does not necessarily result in performance improve- processing stage, which uses the pre-processed data structures ments in practice: for example, in [2], binary search based algo- to compute intersections. An intersection query is specified via rithms outperform a parallel scan only when |L2| < 20|L1|, even a collection of k sets L1,L2,...,Lk (to simplify notations, we though several times fewer comparisons are needed. use the offsets 1, 2, . . . , k to refer to the sets in a query through- Hierarchical Representations: There are various algorithms for out this section); our goal is to compute L1 ∩ L2 ∩ ... ∩ Lk ef- set intersections based on variants of balanced trees (e.g. [9], treaps ficiently. Note that pre-processing is typical of most non-trivial [7], and skip-lists [18]), computing the intersection of (preprocessed) data structures used for computing set intersections; even building sets L1,L2 in O(|L1| log(|L2|/|L1|)) (for |L1| < |L2|) opera- simple non-compressed inverted indexes requires sorting the post- tions. However, while some form of “skipping” is commonly used ing lists as a pre-processing step. We require the pre-processing as part of algorithms based on inverted indexes, skip-lists (or trees) stage to be time/space-efficient in that it does not require more than are typically not used in the scenarios outlined above (with static O(ni log ni) time (necessary for sorting) and linear space O(ni). set data) due to the required space-overhead. A novel and compact The size of intersection |L1 ∩ L2| is a lower bound of the time two-level representation of posting lists aimed at fast intersections needed to compute the intersection. Our method leverages two key in main memory was proposed in [19]. ideas to approach this lower bound: (i) The intersection of two Algorithms based on Hashing: Using a hash-based representation sets in a small universe can be computed very efficiently; in par- of sets can speed up the intersection of sets L1,L2 with |L1|  ticular, if the two sets are subsets of {1, 2, . . . , w}, we can encode |L2| significantly (expected time O(|L1|) – by looking up all el- them as single machine-words and compute their intersection using ements of L1 in the hash-table of L2); however, because of the a bitwise-AND. (ii) A small number of elements in a large universe added indirection, this approach performs poorly for less skewed can be mapped into a small universe. set sizes. A new hashing-based approach is proposed in [6]: here, L the elements in sets L1,L2 are mapped using a hash-function h to 1 partitioning via sorting/hashing smaller (approximate) representations h(L1), h(L2). These repre- 1 2 p sentations are then intersected to compute H = h(L1) ∩ h(L2). L1 L1 ...... L1 ...... Finally, the set of all elements in the original sets that map to H via h :Σ [w] → 1 2 p h are computed and any “false positives” removed. As the hashed h(L1)h(L1) ...... h(L1) ...... p q images h(L1), h(L2) to be intersected are smaller than the original compute L1 L2: p q p q p ∩ q h(L ) h(L ) L L sets (using fewer bits), they can be intersected more quickly. Given (with the help of h(L ) h(L )) | 1 ∩ 2 |≈| 1 ∩ 2| 1 ∩ 2 k sets of total size n, their intersection can be computed in expected 1 2 q h(L2)h(L2) ...... h(L2) ...... h :Σ [w] 2  T → time O (n log w)/w + kr , where r = | i Li|. Score-based pruning: In many IR engines it is possible to avoid 1 2 q L2 L2 ...... L2 ...... computing full intersections by leveraging scoring functions that partitioning via sorting/hashing are monotonic in the individual term-wise scores; this makes it L possible to terminate the intersection processing early using ap- 2 proaches such as TA [15] or document-at-a-time (DAAT) process- ing (e.g., [8]). However, in practice, this is often not possible, ei- Figure 1: Algorithmic Framework ther because of the complexity of the scoring function (e.g., non- We leverage these two ideas by first partitioning each set Li monotonic machine-learning based ranking functions) or because j into smaller groups Li ’s, which are intersected separately. In the full intersection results are required. Our approach is based on par- pre-processing stage, we map each small group into a small uni- titioning the elements in each set into very small (≈8 elements) verse [w] = {1, 2, . . . , w} using a universal hash function h and groups, for which we have fast intersection schemes. Hence, DAAT- j encode the image h(Li ) with a machine-word. Then, in the on-

256 j line processing stage, to compute the intersection of two small we store the elements in Li with hash value y, in a short list which p q groups L1 and L2, we first use a bitwise-AND operation to com- supports ordered access. We ensure that the order of these elements p q p q −1 j pute H = h(L1) ∩ h(L2), and then try to “recover” L1 ∩ L2 using is identical across different h (y, Li )’s and Li’s; in this way, we −1 p q the inverse mapping h from H. The union of L1 ∩ L2’s forms can intersect these short lists using a linear merge. L1 ∩ L2. Moreover, if the intersection L1 ∩ L2 is of a small size EXAMPLE 3.1. (PRE-PROCESSINGAND DATA STRUCTURES) compared to |L1| and |L2| (seen in practice), a large fraction of the Suppose we have two sets L1 = {1001, 1002, 1004, 1009, 1016, small groups with overlapping ranges has an empty intersection; 1027, 1043}, L2 = {1001, 1003, 1005,√1009, 1011, 1016, 1022, thus, by using the word-representations of H to detect these groups 1032, 1034, 1049}. And, let w = 16 ( w = 4). For simplicity, quickly, we can skip much unnecessary computation, resulting in h is selected to be h(x) = (x − 1000) mod 16. L1 is parti- significant speed-up. The resulting algorithmic framework is illus- 1 2 tioned into 2 groups: L1 = {1001, 1002, 1004, 1009}, L1 = trated in Figure 1. Given this overall approach, the key questions 1 {1016, 1027, 1043}, and L2 is partitioned into 3 groups: L2 = become how to form groups, what structures to be used to represent 2 {1001, 1003, 1005, 1009}, L2 = {1011, 1016, 1022, 1032}, them, and how to process intersections of these small groups. 3 1 L2 = {1034, 1049}. We pre-compute: h(L1) = {1, 2, 4, 9}, We will discuss these details in the following sections. All the 2 1 2 h(L1) = {0, 11}, h(L2) = {1, 3, 5, 9}, h(L2) = {0, 6, 11}, formal proofs of analytical results are deferred to the appendix. 3 −1 p h(L2) = {1, 2}. We also pre-process h (y, Li )’s: for example, −1 2 −1 2 −1 2 h (0,L1) = {1016}, h (0,L2) = {1016, 1032}, h (11,L1) 3.1 Intersection via Fixed-Width Partitions −1 2 = {1027, 1043}, and h (11,L2) = {1011}. 2 We first consider the case when there are only two sets L1 and L2 in the intersection query. We will present a pair of pre-processing 1: p 1, q 1, ∆ ← ← ← ∅ and online processing algorithms, which we use to illustrate the 2: while p n1 and q n2 do ≤ q ≤ p basic ideas of our algorithms. We subsequently refine and extend 3: if inf(L2) > sup(L1) then our techniques to k sets in Section 3.2. 4: p p + 1 ← p q In the pre-processing stage, L1 and L2 are sorted, and parti- 5: else if inf(L1) > sup(L2) then tioned into groups (recall w is the word width) 6: q q + 1 √ √ 7: else← 1 2 dn1/ we 1 2 dn2/ we p q L1,L1,...,L1 , and L2,L2,...,L2 8: compute (L1 L2) using IntersectSmall √ ∩p q of equal size w (except the last ones). 9: ∆ ∆ (L1 L2) ← ∪p ∩ q In the online processing stage (Algorithm 1), the small groups 10: if sup(L ) < sup(L ) then p p + 1 else q q + 1 1 2 ← ← p q 11: ∆ is the result of L1 L2 are scanned in order. If the ranges of L1 and L2 overlap, we may p q p q ∩ have L1 ∩ L2 6= ∅. The intersection L1 ∩ L2 of each pair of over- Algorithm 1: Intersection via fixed-width partitioning lapping groups is computed (line 8) in some iteration. And finally, Online Processing Stage: The algorithm used to intersect two sets the union of all these intersections is L1 ∩ L2. Since each group is √ is shown in Algorithm 1. Since elements in Li are sorted, Algo- scanned once, line 2-10 repeat for O((n1 + n2)/ w) iterations. rithm 1 ensures that if the ranges of any two small groups Lp,Lq The major remaining question now becomes how to compute 1 2 p q overlap, their intersection is computed (line 8). After scanning all L1 ∩ L2 efficiently with proper pre-processing? For this purpose, p q such pairs, ∆ must then contain the intersection of the whole sets. we map each group L1 or L2 into a small universe for fast inter- Now the question is: how to compute the intersection of two section, and we leverage single-word representations to store and p q small groups L1 ∩ L2 efficiently? For this purpose, we introduce manipulate sets from a small universe. the algorithm IntersectSmall (Algorithm 2) , which: Single-Word Representation of Sets: We represent a set A ⊆ p q (i) first computes H = h(L1) ∩ h(L2) using a bitwise-AND; [w] = {1, 2, . . . , w} using a single machine-word of width w by (ii) for each (1-bit) y ∈ H, intersects the corresponding inverted setting the y-th bit as 1 iff y ∈ A. We refer to this as the word mappings using the linear merge algorithm. representation w(A) of A. For two sets A and B, the bitwise-AND p q p q w(A) ∧ w(B) (computed in O(1) time) is the word representation IntersectSmall(L1,L2): computing L1 L2 p q ∩ of A∩B. Given a word representation w(A), all the elements of A 1: Compute H h(L1) h(L2) can be retrieved in linear time O(|A|) 1. In the rest of this paper, if 2: for each y ←H do ∩ ∈ −1 p −1 q A ⊆ [w], we use A to denote both a set and its word representation. 3: Γ Γ (h (y, L1) h (y, L2)) → ∪ p q ∩ 1 2 4: Γ is the result of L1 L2 Pre-processing Stage: Elements in a set Li are sorted as {xi , xi , ∩ ni k k+1 . . . , x } (i.e., xi < x ) and Li is partitioned as follows: Algorithm 2: Computing the intersection of small groups i √ i √ √ 1 1 w 2 w+1 2 w Li = {xi , . . . , xi },Li = {xi , . . . , xi },... (1) EXAMPLE 3.2. (ONLINE PROCESSING) √ √ √ j (j−1) w+1 (j−1) w+2 j w Following Example 3.1, to compute L1 ∩ L2, we need to compute Li = {xi , xi , . . . , xi },... (2) 1 1 2 2 2 3 L1 ∩L2, L1 ∩L2, and L1 ∩L2 (pairs with overlapping ranges): for j 2 2 2 2 For each small group L , we compute the word-representation of example, for computing L1 ∩L2, we first compute h(L1)∩h(L2) = i 2 2 S −1 2 −1 2  its image under a universal hash function h :Σ → [w], i.e., {0, 11}; then L1 ∩ L2 = y=0,11 h (y, L1) ∩ h (y, L2) = j j 1 1 h(Li ) = {h(x) | x ∈ Li }. In addition, for each position y ∈ [w] {1016}. Similarly, we can compute L1 ∩ L2 = {1001, 1009}. j 2 3 2 3 and each small group Li , we also maintain the inverted mapping Finally, we find h(L1) ∩ h(L2) = ∅, and thus L1 ∩ L2 = ∅. So, j j −1 we have L1 ∩ L2 = {1001, 1009} ∪ {1016} ∪ ∅. 2 h (y, Li ) = {x | x ∈ Li and h(x) = y}, i.e., for each y ∈ [w] Note that word representations and inverted mappings for Li are 1We use the following well-known technique: ( is bitwise-XOR) pre-computed, and word-representations can be intersected using (i) lowbit = ((w(A) 1) w(A)) w(A) is⊕ the lowest 1-bit of w(A). − ⊕ ∧ y one operation. So the running time of IntersectSmall is bounded For the smallest element y in A, we have 2 = lowbit. p q y = log(lowbit) A can be computed using the machine instruction by the number of pairs of elements, one from L1 and one from L2, NLZ (number of leading∈ zeros) or pre-computed lookup tables. that are mapped to the same hash-value. This number can be shown (ii) Set w(A) as w(A) lowbit and repeat (i) to scan the next smallest to be equal (in expectation) to the intersection size plus O(1) for element until w(A) becomes⊕ 0. j each group Li . Using this, we obtain Algorithm 1’s running time:

257 p q z1 z2 THEOREM 3.3. Algorithm 1 computes L1 ∩ L2 in expected of L and L overlap; in Algorithm 3, we compute L ∩L (also   1 2 1 2 n1+n2 when z is a t -prefix of z O √ + r time, where r = |L1 ∩ L2|. using Algorithm 2) 1 1 2 (this is a necessary w z1 z2 condition for L1 ∩ L2 6= ∅; so Algorithm 3 is correct). This To achieve a better bound, we optimize the group sizes: with L1 significantly reduces the number of pairs to be intersected. ∗ p ∗ and L2 partitioned into groups of sizes s1 = wn1/n2 and s2 = p t2 wn2/n1, respectively, L1 ∩ L2 can be computed in expected 1: for each z2 0, 1 do ∈ { t} p 2: Let z1 0, 1 1 be the t1-prefix of z2 O( n1n2/w + r) time. A detailed analysis of the effect of group ∈ { z1 } z2 z1 z2 3: Compute L1 L2 using IntersectSmall(L1 ,L2 ) ∩ z z size on running times can be found in Section A.1.1. 4: Let ∆ ∆ (L 1 L 2 ) ← ∪ 1 ∩ 2 Overhead of Pre-processing: If only the bound in Theorem 3.3 is 5: ∆ is the result of L1 L2 required, then to pre-process a set L of size n , it is obvious that ∩ i i Algorithm 3: 2-list Intersection via Randomized Partitioning O(n log n ) time and O(n ) space suffice: we only need to parti- i i i √ Based on the choices of parameters t and t , we can either parti- tion a sorted list into small groups of size w, and for each small 1 2 tion L and L into the same number of small groups (yielding the group, construct the word representation and inverted mapping in 1 2 bound of Theorem 3.5), or into small groups of the (approximately) linear time using the hash function h. p identical sizes (yielding Theorem 3.6). To achieve the better bound O( n1n2/w+r), we need multiple “resolutions” of the partitioning of a set L . This is because, as THEOREM 3.5. Algorithm 3 computes L1 ∩ L2 in expected i  √  ∗ p n1n2 p n1n2 discussed above, the optimal group size s = wn1/n2 of the √ 1 O w + r time (r = |L1∩L2|), with t1 = t2 = dlog w e. set L1 also depends on the size n2 of the set L2 to be intersected with it. For this purpose, we partition a set Li into small groups THEOREM 3.6. Algorithm 3 computes L1 ∩ L2 in expected j   √ of size 2, 4,..., 2 , etc. To compute L1 ∩ L2 for the given two O n1√+n2 + r time (r = |L ∩ L |), using t = dlog(n / w)e ∗ w 1 2 1 1 sets, suppose si is the optimal group size of Li; we then select the √ ∗∗ t ∗ ∗∗ ∗ and t2 = dlog(n2/ w)e. actual group size si = 2 s.t. si ≤ si ≤ 2si , obtaining the same bound. A carefully-designed multi-resolution data structure Note that when n1 6= n2, Theorem 3.5 has a better bound than enabling access to these groups consumes only O(ni) space for Li. Theorem 3.6. But we can extend Theorem 3.6 to k-set intersection. We will describe and analyze this structure in Section 3.2.1. Extension to More Than Two Sets: Suppose we want to compute

THEOREM 3.4. To pre-process a set Li of size ni for Algo- the intersection of k sets L1,...,Lk, where ni = |Li| and n1 ≤ z rithm 1, we need O(ni log ni) time and O(ni) space (in words). n2 ≤ ... ≤ nk. Li is partitioned into groups Li ’s using gti :Σ → ti {0, 1} . Note that gt ’s are generated from the same hash function Limitations of Fixed-Width Partitions: i √ The main limitation of g. We use ti = dlog(ni/ w)e and proceed as in Algorithm 4. the proposed approach is that it is difficult to extend to more than Algorithm 4 is almost identical to Algorithm 3, but is general- two sets, because the partitioning scheme we use is not well-aligned tk ized to k sets: for each zk ∈ {0, 1} , we pick the group iden- for more than two sets: for three sets, e.g., there may be more than √ tifiers zi to be the ti-prefix of zk, and we only intersect groups O((n1 + n2 + n3)/ w) triples of small groups that overlap. We z1 z2 zk L1 ,L2 ,...,Lk , where z1, z2, . . . , zk share a prefix of size t1. introduce a different partitioning scheme to address this issue in Also, we extend IntersectSmall (Algorithm 2) for k groups: we Section 3.2, which extends to k > 2 sets. first compute the intersection (bitwise-AND) of hash images (their zi word-representations) of the k groups Li ’s; and, if the result H = 3.2 Intersection via Randomized Partitions Tk zi i=1 h(Li ) is not zero, for each (1-bit) y ∈ H, we intersect the k In this section, we will introduce an algorithm based on a ran- −1 zi corresponding inverted mappings h (y, Li )’s. Details and anal- domized partitioning scheme to compute the intersection of two or ysis are deferred to the appendix. more sets. The general approach is as follows: instead of fixed- √ width partitions, we use a hash function g to partition each set into THEOREM 3.7. Using ti = dlog(ni/ w)e, Algorithm 4 com- Tk √ small groups, using the most significant bits of g(x) to group an putes the intersection Li of k sets in expected O(n/ w + kr) i=1 element x ∈ Σ. This reduces the number of combinations (pairs) time, where r = Tk L and n = Pk n = Pk |L |. i=1 i i=1 i i=1 i of small groups we have to intersect, allowing us to prove bounds similar to Theorem 3.3 for computing intersections of k > 2 sets. t w 1: for each zk 0, 1 k (ti = log(ni/√w) ) do Pre-processing Stage: Let g be a hash function g :Σ → {0, 1} ∈ { } d e 2: Let zi be the ti-prefix of zk for i = 1, . . . , k 1 mapping an element to a bit-string (or binary number); we use Tk zi − 3: Compute i=1 Li using extended IntersectSmall gt(x) to denote the t most significant bits of g(x). We say that for Tk zi 4: Let ∆ ∆ ( i=1 Li ) two bit-strings z1 and z2, z1 is a t1-prefix of z2, iff z1 is identical ← ∪Tk 5: ∆ is the result of i=1 Li to the highest t1 bits in z2; e.g., 1010 is a 4-prefix of 101011. z To pre-process a set Li, we partition it into groups Li = {x | x ∈ Algorithm 4: k-list Intersection via Randomized Partitioning t Li and gt(x) = z} for all z ∈ {0, 1} (some t). As before, we z compute the word representation of the image of each Li under 3.2.1 A Multi-resolution Data Structure −1 another hash function h :Σ → [w], and inverted mappings h . Recall that in some algorithms (e.g., Theorem 3.5), the selection Online Processing Stage: This stage is similar to our previous of the number of small groups used for a set Li depends on the algorithm: to compute the intersection of two sets L1 and L2, we (size of) other sets being intersected with Li. So by naively pre- compute the intersections of pairs of overlapping small groups, one computing the required structures for each possible group size, we from each set, and finally take the union of these intersections. would incur excessive space requirements. In this section, we de- t1 In general, suppose L1 is partitioned using gt1 :Σ → {0, 1} , scribe a data structure that supports access to partitions of Li into t2 t and L2 is partitioned using gt2 :Σ → {0, 1} . Assume n1 ≤ n2 2 groups for any possible t, using only O(ni) space. It is illus- and t1 ≤ t2. We now intersect sets L1 and L2 using Algorithm 3. trated in Figure 2. To support the algorithms introduced so far, this The major improvement of Algorithm 3 compared to Algorithm 1 structure must also allow us: p q z z is that in Algorithm 1, we need compute L1 ∩ L2 when the ranges (i) for each Li , to retrieve the word-representation h(Li ), and

258 w THEOREM 3.8. To pre-process a set Li of size ni for Algo- pointers 1 2 y z pointer to the first element in Li rithm 3-4, we need O(ni log ni) time and O(ni) space (in words). z s.t. h(x)=y Li x

Lz g :Σ 0, 1 3.3 From Theory to Practice i 1 →{ } partitioned by g1 In this section, we describe a more practical version of our meth- z 2 Li g2 :Σ 0, 1 partitioned by g2 →{ } ods. This algorithm is simpler, uses significantly less memory,

z 3 Li g3 :Σ 0, 1 straight-forward data structures, and, while it has worse theoreti- partitioned →{ } by g3 cal guarantees, is faster in practice. The main difference is that – g :Σ 0, 1 t z z t →{ } for each small group Li , we only store the elements in Li and their images under m ≥ 1 hash functions (i.e., we do not maintain in- x verted mappings, trading off a complex O(1)-access for a simple scan over a short block of data). Also, we use only a single parti- x next(x) = the smallest x Li s.t. x x and h(x)=h(∈x) tion for each set Li. Having multiple word representations of hash ≥ images (different hash functions) for each small group allows us to Figure 2: Multi-Resolution Partition of Li detect empty intersections of small groups with higher probability. −1 z Pre-processing Stage: L (ii) for each y ∈ [w], to access all elements in h (y, Li ) = As before, each set i is partitioned into z −1 z groups Lz’s using a hash function g :Σ → {0, 1}ti . We will {x | x ∈ Li and h(x) = y} in time linear in its size |h (y, Li )|. i ti √ Multi-resolution Partitioning: For the ease of explanation, we show that a good selection of ti is dlog(ni/ w)e, which depends suppose Σ = {0, 1}w and choose g as a random permutation of Σ. only on the size of Li. Thus for each set Li, pre-processing with a single partitioning suffices, saving significant memory. For each To pre-process Li, we first order all the elements x ∈ Li according z group, we compute word representations of images under m (inde- to g(x). Then any small group Li = {x | x ∈ Li and gt(x) = z} pendent) universal hash functions h1, . . . , hm :Σ → [w]. Note forms a consecutive interval in Li (partitions of different resolu- tions are formed for t = 1, 2,...). that we only require a small value of m in practice (e.g., m = 2). Note: in all of our algorithms, universal hash functions and ran- Online Processing Stage: The algorithm for computing ∩iLi we dom permutations are almost interchangeable (when used as g)– use here (Algorithm 5) is identical to Algorithm 4, with two ex- zi the differences being that (i) a permutation induces a total order- ceptions: (1) When needed, ∩iLi is directly computed by a linear zi zi ing of elements (in this data structure, this property is required), merge of Li ’s (line 4), using O(Σi|Li |) time. (2) We can skip zi whereas hashing may result in collisions (which we can overcome the computation of ∩iLi if, for some hj , the bitwise-AND of the zi by using the pre-image to break ties) and (ii) there is a slight dif- corresponding word representations hj (Li )’s is zero (line 3). ference in the resulting probability of, e.g., elements being grouped t 1: for each zk 0, 1 k (ti = log(ni/√w) ) do together (hashing results in (limited) independence, whereas per- ∈ { } d e 2: Let zi be the ti-prefix of zk for i = 1, . . . , k 1 mutations result in negative dependence – we account for this by k zi − 3: if i=1hj (Li ) = for all j = 1, . . . , m then using the weaker condition in our proofs). ∩ Tk 6 ∅zi z z 4: Compute i=1 Li by a linear merge of L1,...,Lk Word Representations of Hash Mappings: Now, for each small Tk zi z 5: Let ∆ ∆ ( i=1 Li ) group Li , we need to pre-compute and store the word representa- ← T∪k z 6: ∆ is the result of i=1 Li tion h(Li ). Note the total number of small groups is ni/2+ni/4+ t ... + ni/2 + ... ≤ ni. So this requires O(ni) space. Algorithm 5: “Simple” Intersection via Randomized Partitioning −1 z Inverted Mappings: We need to access all elements in h (y, Li ) Analysis: To see why Algorithm 5 is efficient, we observe that: if z1 z2 z1 z2 in order, for each y ∈ [w]. If we were to store these mappings for L ∩ L = ∅, then with high probability, hj (L ) ∩ hj (L ) = z 1 2 1 2 each Li explicitly, this would require O(ni log ni) space. How- ∅ for some j = 1, . . . , m. So most empty intersections can be −1 z ever, by storing the inverted mappings h (y, Li )’s implicitly, we skipped using the test in line 3. With the probability of a “suc- zi zi can do better, as follows: cessful filtering” (i.e. given ∩iL = ∅, ∩ihj (L ) = ∅ for some z i i For each group Li , since it corresponds to an interval in Li, hash function hj , j = 1, . . . , m) bounded by the Lemmas A.1 and we can store the starting and ending positions in Li, denoted by A.3, we can derive Theorem 3.9. Detailed analysis of this proba- z z left(Li ) and right(Li ). These allow us to determine if an element bility (both theoretical and experimental) and overall complexity is z x belongs to Li . Now, to enable the ordered access to the inverted deferred to Appendix A.5. mappings, we define, for each x ∈ Li, next(x) to be the “next” √ 0 0 THEOREM 3.9. Using ti = dlog(ni/ w)e, Algorithm 5 com- element x to x on the right s.t. h(x ) = h(x) (i.e., with minimum  √  Tk max(n,knk) mn 0 0 z putes Li in expected O + √ + kr w time g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each Li and each i=1 α(w)m w z 00 Tk Pk 1 y ∈ [w], we store the position first(y, Li ) of the first element x (r=| i=1 Li|, n= i=1 ni, α(w)= 1−β(w) for β(w) used in Lemma A.3). z 00 −1 z in Li s.t. h(x ) = y. Now, to access all elements in h (y, Li ) in first(y, Lz) z order, we can start from the element at i , and follow the 3.3.1 Data Structure for Storing Li next(x) right(Lz) pointers , until passing the right boundary i . And, In this section, we describe the simple and space-efficient data in this way, all elements in the inverted mapping are retrieved in the structure that we use in Algorithm 5. As stated earlier, we only same order as g(x) – which we require for IntersectSmall. need to partition Li using one hash function gti ; hence we can Space Requirements: For all groups of different sizes, the total represent each L as an array of small groups Lz’s, ordered by z. z z z z i i space for storing h(Li )’s, left(Li )’s, right(Li )’s, first(y, Li )’s For each small group, we store the information associated with it and next(x)’s is O(ni). So the whole multi-resolution data struc- in the structure shown in Figure 3. The first word in this struc- ture requires O(ni) space. A detailed analysis is in the appendix. z ture stores z = gti (Li ). The second word stores the structure’s When the group size ti depends only on ni (e.g., in Algorithm 4), length len. The following m words represent the hash images “single-resolution” in pre-processing suffices, and the above multi- z z z z h1(Li ), . . . , hm(Li ) of Li . Finally, we store the elements of Li resolution scheme (for selecting t online) is not necessary. √ i as an array in the remaining part. We need ni/ w such blocks for

259 m words set intersection based on hash tables: Hash (i.e., we iterate over the z z z z len h1(Li ) hm(Li ) Li smallest set L1, looking up every element x ∈ L1 in hash-table rep- resentations of L ,...L ); (iv) the algorithm of [6]: BPP; (v) the len 2 k algorithm proposed for fast intersection in inverted indices Figure 3: The Structure for a Pre-processed Small Group z Li in main memory [19, 21]: Lookup (using B = 32 as the bucket- size, which is the best value in our and the authors’ experience); Li in total. The first word z can be also computed on-the-fly, as these small groups are accessed sequentially in Algorithm 5. So, and (vi) various adaptive intersection algorithms based on binary if we store len using one word, and one word for each element of search/galloping search: SvS, Adaptive [12, 13, 3], BaezaYates z z z [1, 2], and SmallAdaptive [5]. Note that BaezaYates is general- Li , then we need totally m + 1 + |Li | words for each group Li , √ ized to handle more than two sets as in [5]. and thus ni(1 + (m + 1)/ w) words to store the pre-processed Li. The overhead of the pre-processing is dominated by the cost of Implementation: For each competitor, we tried our best to opti- sorting Li (the remaining operations are trivial). mize its performance. For example, for Merge we tried to mini- mize the number of branches in the inner loop; we also store post- THEOREM 3.10. To pre-process a set Li of size ni for Algo- √ ings in consecutive memory addresses to speed up parallel scans rithm 5, we need O(ni(m + log ni)) time, and O(ni(1 + m/ w)) (words) space. and reduce page walks after TLB misses. Our implementation of We describe methods for compressing this structure in Appendix B. skip lists follows [18], with simplifications since we are focusing on static data and do not need fast insertion/deletion. We also sim- 3.4 Intersecting Small and Large Sets plified the bit-manipulation in BPP [6] so that it works faster in An important special case for set intersection are asymmetric in- practice for small w. For the algorithms using inverted indexes, we initially do not consider compression on the posting lists, as we tersections where the sizes n1 and n2 of the sets that are inter- do not want the decompression step to impact the performance re- sected vary significantly (w.l.o.g., assume n1  n2). In this sub- section, using the same multi-resolution data structure as in Sec- ports. In Section 4.1 we will study variants of the algorithms incor- porating compression. With regards to skip-operations in the index tion 3.2.1, we present an algorithm HashBin that computes L1∩L2 note that since we use uncompressed posting lists, algorithms such in O(n1 log(n2/n1)) time. This bound is also achieved by other previous works, e.g., SmallAdaptive [5], but our algorithm is even as Adaptive can perform arbitrary skips into the index directly. simpler in online processing. It is also known that algorithms based Datasets: To evaluate these algorithms we use both synthetic and on hash-tables only require O(n1) time for this scenario; however, real data. For the experiments with synthetic datasets, sets are unlike HashBin, they are ill-suited for less asymmetric cases. generated randomly (and uniformly) from a universe Σ. The real dataset is a collection of more than 8M Wikipedia pages. In each Algorithm HashBin: When intersecting two sets L1 and L2 with experiment for the synthetic datasets, 20 combinations of sets are sizes n1  n2, we focus on the partitioning induced by gt :Σ → t randomly generated, and the average time is reported. {0, 1} , where t = dlog n1e for both of them, and g is a random z z permutation of Σ. To compute L1 ∩ L2, we compute L1 ∩ L2 for 800 t z z all z ∈ {0, 1} and take the union. To compute L1 ∩ L2, we iterate 700 z Merge over each x ∈ L1, and perform a binary search to check whether 600 z z SkipList x ∈ L2 using O(log |L2|) time. This scheme can be extended to 500 z z z Hash multiple sets by searching for x in Li if found in L1,...,Li−1. 400 IntGroup THEOREM 3.11. The algorithm HashBin computes L1 ∩ L2 300 BPP  n  200 2 Intersection Time (ms) in expected O n1 log n time. The pre-processing of a list Li Adaptive 1 100 requires O(ni log ni) time and O(ni) space. LkLookup 0 The proof of Theorem 3.11 and how HashBin uses the multi- 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M RanGroupScan resolution data structure is deferred to the Section A.6 in the ap- Set Size pendix. The advantage of HashBin is that, since it is based on the Figure 4: Varying the Set Size same structure as the algorithm introduced in Section 3.2, we can Varying the Set Size: First, we measure the performance when in- make the choice between algorithms online, based on n1/n2. tersecting only 2 sets; we use synthetic data, the lists are of equal size and the size of the intersection is fixed at 1% of the list size; 4. EXPERIMENTAL EVALUATION the results are shown in Figure 4. We can see that the perfor- We evaluate the performance and space requirements of four of mance of the different techniques relative to each other does not the techniques described in this paper: (a) the fixed-width parti- change with varying list size. Hash performs worst, as the (rel- tion algorithm described in Section 3.1 (which we will refer to as atively) expensive lookup operation needs to be performed many IntGroup); (b) the randomized partition algorithm in Section 3.2 times. SkipList performs poorly for the same reason. The BPP (RanGroup) (c) the simple algorithm based on randomized parti- algorithm is also slow, but this is because of a number of com- tions described in Section 3.3 (RanGroupScan); and (d) the one plex operations that need to be performed, which are hidden as a for intersecting sets of skewed sizes in Section 3.4 (HashBin). constant in the O()-notation. The same trend held for the remain- Setup: All algorithms are implemented using C and evaluated on ing experiments as well; hence, for readability, we did not include a 4GB 64-bit 2.4GHz PC. We employ a random permutation of the BPP in the subsequent graphs. For the same reason we only show document IDs for the hash function g and 2-universal hash func- the best-performing among the adaptive algorithms in the evalua- tions for h (or hj ’s). For RanGroup, we use m = 4 (the number tion; if one adaptive algorithm dominates another on all parameter of hash functions hj ), unless noted otherwise. settings in an experiment, we don’t plot the worse one. We compare our techniques to the following competitors: (i) set Among the remaining algorithms, RanGroupScan (40%-50% intersection based on a simple parallel scan of inverted indexes: faster than Merge) and IntGroup perform the best (RanGroup Merge; (ii) set intersection based on skip lists: SkipList [18]; (iii) performs similarly to IntGroup and is not plotted). Interestingly,

260 1000 900 900 MERGE performs best 800 K=2 800 for |L1 ∩L2| > 0.7 |L1| Merge 700 K=3 700 SkipList 600 600 K=4 500 500 Hash 400 400 Adaptive 300 300 SvS

Intersection TIme Intersection (ms) 200 200 Lookup 100 100 IntGroup 0 0 (ms) Time Intersection Average RanGroup RanGroupScan

Intersection Size Figure 5: Varying the Intersection Size Figure 6: Varying the Number of Keywords the simple Merge algorithm is next, outperforming the more so- terestingly, the simple Merge algorithm again performs very well phisticated algorithms, followed by Lookup and the best-performing when compared to the more sophisticated techniques; the Lookup adaptive algorithm. algorithm is next, followed by the various adaptive techniques. Varying the Intersection Size: The size of the intersection r is Size of the Data Structure: The improvements in speed come at an important factor concerning the performance of the algorithms: the cost of an increase in space: our data structures (without com- larger intersections mean fewer opportunities to eliminate small pression) require more space than an uncompressed posting list – groups early for our algorithms or to skip parts of the set for the the increase is 37% (RanGroupScan for m = 2), 63% (Ran- adaptive and skiplist-based approaches. Here, we use synthetic GroupScan for m = 4), 75% (IntGroup) or 87% (RanGroup). data, intersecting two sets with 10M elements and vary r = |L1 ∩ L2| between 500 and 10M. The results are reported in Figure 5. For 2.5 Results: All Query Sizes r < 7M (70% of the set size) RanGroupScan and IntGroup per- 2 form best. Otherwise, Merge becomes the fastest and RanGroup- 1.5 Merge Scan the 2nd-fastest alternative; here, the performance of Ran- 1 GroupScan is very similar to Merge, all the way to r = 10M. Among the remaining algorithms, RanGroup slightly outperforms 0.5 Merge for r < 5M, Lookup is the next-best algorithm and SvS 0 and Adaptive perform best among the adaptive algorithms.

Varying the Sets Size Ratios: As we illustrated in the introduction, Time Intersection Normalized the skew in set sizes is also an important factor in performance. Figure 7: Normalized Execution Time on a Real Workload When sets are very different in size, algorithms that iterate through the smaller set and are able to locate the corresponding values in Experiment on Real Data: In this experiment, we used a “work- the larger set quickly, such as HashBin and Hash, perform well. In load” of the 104 most frequent (measured over a week in 2009) this experiment we use synthetic data and vary the ratio of set sizes, queries against the Bing.com search engine. As the text corpus, we setting |L2| = 10M and varying |L1| between 16K and 10M. The used a set of 8 Million Wikipedia documents. size of the intersection is set to be 1% of |L1| and we define the ratio Query characteristics: 68% of the queries contain 2 keywords, between the list sizes as sr = |L2|/|L1|. Here, the differences 23% 3 keywords and 6% 4 keywords. As we have illustrated be- between the algorithms become small with growing sr (for this fore, a key factor for performance is the ratio of set sizes – among reason, we also don’t report them in a graph, as too many lines the 2-word queries, the average ratio |L1|/|L2| is 0.21, for 3-word overlap). For sr < 32, RanGroupScan performs best; for larger queries the average ratio |L1|/|L2| is 0.31 and the average ratio sr, Lookup and Hash perform best, until a ratio of sr ≥ 100 – for |L1|/|L3| is 0.09, and for 4-word queries, the |L1|/|L2| ratio is this and larger ratios, Hash outperforms the remaining algorithms, 0.36 and the |L1|/|L4| ratio is 0.06 – note that |L1| ≤ |L2| ≤ followed by Lookup and HashBin. Generally, both HashBin and |L3| ≤ |L4|. The average ratio of intersection size to |L1| is 0.19. RanGroupScan perform close to the best-performing algorithm. To illustrate the relative performance of the algorithms over all The adaptive algorithms require more time than RanGroupScan queries we plotted their average running times in Figure 7: here, the for sr ≤ 200 and more time than HashBin for all values of sr; running time of Merge is normalized to 1. Both RanGroup and Skiplist and BPP perform worst across all values of sr. RanGroupScan significantly outperform Merge, with the latter Varying the Number of Keywords: In this experiment, we var- performing the best overall; interestingly, when used for all queries ied the number of sets k = 2, 3, 4, fixing |Li| = 10M for i = (as opposed to only for the large skew case it was designed for) 1, . . . , k, with the IDs in the sets being randomly generated using HashBin still performed better than Merge. The remaining algo- a uniform distribution over [0, 2 × 108]; the results are reported rithms performed in similar order to the earlier experiments, with in Figure 6. In this experiment, we use m = 2 hash images the one exception being SvS which outperformed both Merge and for RanGroupScan. For multiple sets, RanGroupScan is the Lookup for this more realistic data. Overall, the RanGroupScan fastest, with the difference becoming more pronounced for 3 and 4 was the best-performing algorithm for 61.6% of the queries, fol- keywords, since, with additional sets, intersecting the hash-images lowed by RanGroup (16%) and Hashbin (7.7%) – among the re- (word-representations) yields more empty results, allowing us to maining algorithms not proposed in this paper, Lookup performed skip the corresponding groups. RanGroup is the next-best per- best in 6.4% of the queries and SvS for 3.6% of the queries. All of forming algorithm; we don’t include results for IntGroup here, as the other techniques were best for 2.1% of the queries or fewer. We it is designed for intersections of two sets (see Section 3.1). In- present additional experiments for this data set in the Appendix C.2.

261 4.1 Experiments on Compressed Structures 5. CONCLUSION To illustrate the impact of compression on performance, we re- In this paper we introduced algorithms for set intersection pro- peated the first experiment above, intersecting two sets of identical cessing for memory-resident data. Our approach provides both size, with the size of the intersection fixed to 1% of the set size. novel theoretical worst-case guarantees as well as very fast per- Varying the set size, we report the execution times and storage re- formance in practice, at the cost of increased storage space. Our quirements for the three algorithms that performed best overall in techniques outperform a wide range of existing techniques and are the earlier experiments – Merge, Lookup and RanGroupScan robust in that – for inputs for which they are not the best-performing (since we are interested in small structures here, we only use m = 1 approach – they perform close to the best one. Our techniques have hash images in RanGroupScan) – when being compressed with applications in information retrieval and query processing scenar- different techniques: we used the standard techniques based on γ- ios where performance is of greater concern than space. and δ-coding (see [23], p.116) to compress the parts of the post- ing data stored and accessed sequentially for the three algorithms, 6. REFERENCES and the compression technique described in Appendix B for Ran- [1] R. A. Baeza-Yates. A fast set intersection algorithm for sorted GroupScan (which we refer to as RanGroupScan Lowbits). The sequences. In CPM, pages 400–408, 2004. results are shown in Figure 8; here, we omitted the results for γ- [2] R. A. Baeza-Yates and A. Salinger. Experimental Analysis of a Fast encoding as they were essentially indistinguishable from ones for Intersection Algorithm for Sorted Sequences. In SPIRE, pages 13–24, 2005. δ-coding. RanGroupScan outperforms – in terms of speed – the [3] J. Barbay and C. Kenyon. Adaptive intersection and t-threshold other two algorithms using the same compression scheme; the other problems. In SODA, pages 390–399, 2002. two algorithms perform similarly to each other, as the decompres- [4] J. Barbay, A. Lopez-Ortiz,´ and T. Lu. Faster Adaptive Set sion now dominates their run-time. Using our encoding scheme of Intersections for Text Searching. In 5th WEA, pages 146–157, 2006. Appendix B improves the performance significantly. [5] J. Barbay, A. Lopez-Ortiz,´ T. Lu, and A. Salinger. An experimental Looking at the graph, we can see that the storage requirement for investigation of set intersection algorithms for text searching. ACM RanGroupScan (using our own encoding) is between 1.3-1.9x of Journal of Experimental Algorithmics, 14:7–24, 2010. the size of the compressed inverted index and between 1.2-1.6x of [6] P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection the compressed Lookup structure. At the same time, the perfor- Expressions. In ISAAC, pages 739–750, 2007. [7] G. E. Blelloch and M. Reid-Miller. Fast Set Operations using Treaps. mance improvements are between 7.6-15x (vs. Merge) or 7.4-13x In ACM SPAA, pages 16–26, 1998. (vs. Lookup). Furthermore, by increasing the number of hash im- [8] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. ages to m = 2, we obtain an algorithm that significantly outper- Efficient query evaluation using a two-level retrieval process. In forms the uncompressed Merge, while requiring less memory. CIKM, pages 426–434, 2003.

1000 [9] M. R. Brown and R. E. Tarjan. A Fast Merging Algorithm. Journal of 900 the ACM, 26(2):211–226, 1979. Merge_Delta 800 [10] J. Brutlag. Speed Matters for Google Web Search. 700 Lookup_Delta http://code.google.com/speed/files/delayexp.pdf, 2009. 600 RanGroupScan_Lowbits [11] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case 500 RanGroupScan_Delta optimal union-intersection expression evaluation. In ALENEX, pages 400 179–190, 2001. 300 Intersection time (ms) [12] E. Demaine, A. Lopez-Ortiz,´ and J. Munro. Adaptive Set 200 100 Intersections, Unions, and Differences. In SODA, pages 743–752, 0 2000. 128K 256K 512K 1M 2M 4M 8M [13] E. Demaine, A. Lopez-Ortiz,´ and J. Munro. Experiments on Adaptive Number of postings 12000000 Set Intersections for Text Retrieval Systems. In ALENEX, pages Merge_Delta 10000000 91–104, 2001. Lookup_Delta 8000000 [14] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the RanGroupScan_Lowbits Analysis of Randomized Algorithms. Cambridge, 2009. 6000000 RanGroupScan_Delta [15] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms 4000000 for Middleware. In PODS, pages 102–113, 2001. 2000000 [16] F. K. Hwang and S. Lin. A Simple Algorithm for Merging Two Size of the structure (in words) words) (in of Size the structure 0 Disjoint Linearly Ordered Sets. SIAM Journal, 1(1):31–39, 1972. 128K 256K 512K 1M 2M 4M 8M Number of postings [17] G. Linden. Figure 8: Running Time and Space Requirement http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html. Experiment on Real Data: We repeated this experiment using [18] W. Pugh. A skip list cookbook. Technical Report UMIACS-TR-89-72.1, University of Maryland, 1990. the real-life data/workload described earlier and the compressed [19] P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In variants of RanGroupScan, Lookup and Merge. Again, Ran- ALENEX, pages 71–83, 2007. GroupScan Lowbits performed best, improving run-times by a [20] S. Tatikonda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On factor of 8.4x (vs. Merge + δ-coding), 9.1x (Merge + γ-coding), Efficient Posting List Intersection with Multicore Processors. In 5.7x (Lookup + δ-coding), 6.2x (Lookup + γ-coding), respec- ACM SIGIR, pages 738–739, 2009. tively. However, our approach required the most space (66% of the [21] F. Transier and P. Sanders. Compressed inverted indexes for uncompressed data), whereas Merge (26% / 28% for γ-/ δ-coding) in-memory search engines. In ALENEX, pages 3–12, 2008. and Lookup (35% / 37%) required significantly less. [22] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the Finally, to illustrate the robustness of our techniques, we also performance of list intersection. PVLDB, 2(1):838–849, 2009. [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - measured the worst-case latency for any single query: here, the Compressing and Indexing Documents and Images. Morgan worst-case latency using Merge + δ-coding was 5.2x the worst- Kaufman Publishers, 1999. case latency of RanGroupScan Lowbits. We saw similar results for Merge + γ-coding (5.6x higher), Lookup + δ-coding (4.4x higher), and Lookup + γ-coding (4.9x higher).

262 APPENDIX A.2 Analysis of Algorithm 3 (Proof of Theorem 3.5) Acknowledgments z1 Similar to the proof of Theorem 3.3, the cost of computing L1 ∩ We thank the anonymous reviewers for their numerous insights and z2 z1 L2 using IntersectSmall for each pair of small groups L1 and suggestions that immensely improved the paper. z2 z1 L2 is determined by the size of I = {(x1, x2) | x1 ∈ L1 , x2 ∈ z2 L2 , and h(x1) = h(x2)}. As in A.1, let I= = {(x1, x2) | x1 = A. PROOFS OF THEOREMS x2} ∩ I and I6= = {(x1, x2) | x1 6= x2} ∩ I. Obviously, |I=| = z1 z2 |L1 ∩ L2 |, and I6= is the set of element-pairs that result in a A.1 Analysis of Algorithm 1 hash-collision. If we can show E [|I6=|] ≤ O(1), the proof is p (Proof of Theorem 3.3) complete: because, since t1 = t2 = dlog n1n2/we, there are √ p q p There are a total of O((n1 + n2)/ w) pairs of L1 and L2 to be O( n1n2/w) pairs of z1 and z2 to be considered (we have z1 = p q checked in Algorithm 1. For each pair, since H = h(L1) ∩ h(L2) z2 = z in every iteration), and thus the total running time is can be computed in O(1) time and elements in H can be enumer- X X X p q O(E [|I|] + 1) = ( |I=| + (E [|I6=|] + 1)) · O(1) ated in linear time, the cost of computing L1 ∩ L2 is dominated by computing h−1(y, Lp) ∩ h−1(y, Lp) for every y ∈ H, the cost z∈{0,1}t2 z∈{0,1}t2 z∈{0,1}t2 1 2 √ √ of which is in turn determined by the number of pairs of elements ≤ O(r) + n1n2/ w · O(1) . which are mapped to the same location by h; we denote this set as p q Now we prove that, for each pair (z1, z2), E [|I6=|] = O(1). Let- I = {(x1, x2) | x1 ∈ L , x2 ∈ L , and h(x1) = h(x2)}. 1 2 ting Sz1 = |Lz1 − Lz2 | and Sz2 = |Lz2 − Lz1 |, if Lz1 and Lz2 Let I= = {(x1, x2) | x1 = x2} ∩ I denote the pairs of iden- 1 1 2 2 2 1 1 2 are fixed, we have (similar to (4) in the proof of Theorem 3.3): tical elements (i.e., elements in the intersection) in I and I6= = {(x1, x2) | x1 6= x2} ∩ I the remaining pairs of elements that are z1 z2 z1 z2 Eh [|I6=| | S1 ,S2 ] = S1 · S2 /w. hashed to the same value by h but are not identical. Obviously, p q Sz1 and Sz2 are random variables determined by the hash func- |I=| = |L1 ∩ L2|. If we can show E [|I6=|] = O(1), the√ proof is 1 2 completed: this is because, for a total of O((n1 + n2)/ w) pairs tion g. From their definition and the property of 2-universal (2- p q z1 z2 z1 independent) hashing, we can prove Eg [S · S ] ≤ Eg [S ] · of L1 and L2 to be checked, the total running time is 1 2 1 E [Sz2 ] (using a random permutation g yields the same result). X X X g 2 O(E [|I|] + 1) = ( |I | + (E [|I |] + 1)) · O(1) z1 z1 t1  p = 6= Also, Eg [S1 ] ≤ Eg [|L1 |] = O n1/2 = O( n1w/n2), and p,q p,q p,q z2 p z z √ similarly Eg [S2 ] ≤ O( n2w/n1). Therefore, Eg [S1 · S2 ] = z z = O(r) + (n1 + n2)/ w · O(1) . (3) Eg [S1 ] · Eg [S2 ] ≤ O(w), and thus, E [|I6=|] = p q  z z  Indeed, we can show for each pair of L1 and L2 that: z z S1 · S2 O(w) =Eg [Eh [|I6=| | S1 ,S2 ]]=Eg ≤ =O(1) , (5) X √ √ 1 w w E [|I6=|] = Pr [h(x1) = h(x2)] ≤ w · w · = O(1) , p q w which completes the argument. 2 x1∈L1 , x2∈L2, x16=x2 (4) A.3 Analysis of Algorithm 4 for a universal hash function h, which completes the proof. 2 (Proof of Theorem 3.6 and Theorem 3.7) Theorem 3.6 is special case of Theorem 3.7 for two-set intersec- A.1.1 Group Size and Optimizing Running Time tion. So we only present the proof of Theorem 3.7 below. In Algorithm 1, the group size is selected as the “magical num- Consider any element x ∈ Li for each set Li involved in the √ p q √ ber” w (i.e., |L1| = |L2| = w). To explain this choice, we intersection computation, i.e., extended IntersectSmall in line 3 of now explore the effect of group size on the running time of Al- Algorithm 4, where we compute: gorithm 1. Suppose in general Li is partitioned into groups of k k \ z \ z [  k −1 z  size si. Extending Equation (4) a bit, we have E [|I6=|] = O(1) i i i H = h(Li ), and Li = ∩i=1h (y, Li ) . as long as s1 · s2 ≤ w. Then following the same argument as i=1 i=1 y∈H in (3), a total of O(n1/s1 + n2/s2) pairs are to be checked, and Denote the set of all such elements (with h(x) = y ∈ H) by Γ. The the expected running time of Algorithm 1 is O(T (s1, s2)), where number of such elements |Γ| dominates the cost of Algorithm 4. T (s1, s2) = n1/s1 + n2/s2 + r. Minimizing T (s1, s2) under s · s ≤ w s∗ = We first differentiate two cases of elements in Γ: the constraint 1 2 yields optimal group sizes of 1 Tk p ∗ p (i) x ∈ Li: These r elements are scanned k times, and thus wn1/n2 and s2 = wn2/n1, and the optimal running time is i=1 ∗ ∗ p contribute a factor of O(kr) in the time complexity overall. O(T (s1, s2)) =√O( n1n2/w + r). If we now use the group Tk 0 0 (ii) x∈ / Li: We group all these elements into k − 1 sets, sizes s1 = s2 = w, as in the proof of Theorem√ 3.3, we obtain a i=1 running time of O(T (s0 , s0 )) = O((n + n )/ w + r). D2,...,Dk (an element x may belong to multiple Di’s): 1 2 1 2 √ O(pn n /w + r) O((n + n )/ w + r) 1 2 is better than 1 2√ when Di = {x ∈ Γ | x ∈ Li∩Li+1∩...∩Lk ∧ x∈ / Lj for some j < i}. set sizes are skewed (e.g., n1  n2 or n1 = n2). To achieve zi ti ∗ p Now focus on Di ∩ Li for each zi ∈ {0, 1} . For any x ∈ Li the better bound we leverage that the group size s1 = wn1/n2 but x∈ / Lj for some j < i, letting zj be the tj -prefix of zi, we of the set L1 depends on the size n2 of the set L2 to be intersected zi have x ∈ Di ∩ L implies that h(x) ∈ H and thus there exists with it, and use a multi-resolution structure which keeps different z i x0(6= x) ∈ L j such that h(x) = h(x0); so for such an x, partitions of a set, as discussed at the end of Section 3.1. j zi zi zi Pr [x ∈ Di ∩ Li | x ∈ Li ]=Pr [h(x) ∈ H | x ∈ Li ] X  0  zj ≤ Pr h(x) = h(x ) ≤ |Lj |/w. z 0 j x ∈Lj

263 Generalizing Equation (5) in the proof of Theorem 3.5, we have A.5 Analysis of Algorithm 5

zi   zi zj  (Proof of Theorem 3.9) E [|Di ∩ Li |] = Eg Eh |Di ∩ Li | | |Lj | for all zj ’s √ √ X  zj  zi w w A.5.1 Probability of Successful Filtering ≤ Eg |Lj |/w · Pr [x ∈ Li ] ≤ ni · · = O(1) w ni x∈Li Recall in Algorithm 5, sets are partitioned into small groups by z √ √ hash function g, and m universal hash functions h1, . . . , hm are (as E |L j | = w for any j, and Pr [x ∈ Lzi ] = w/n ). So g j √ i i √ used to test whether the intersection of small groups is empty. ti z1 E [|Di|] ≤ O(ni/ w), as Li is partitioned into 2 = ni/ w It is efficient because of the following observation: if L1 ∩ zi z2 z1 z2 groups Li ’s over all iterations of Algorithm 4. Then we have L2 = ∅, then hj (L1 ) ∩ hj (L2 ) = ∅ for some j = 1, . . . , m successful filtering with high probability " k # k (so-called “ ”) . But once a √ √ zi zi X X   “false positive” happens (i.e., ∩iL = ∅ but ∩ihj (L ) 6= ∅ for E |Di| ≤ O ni/ w = O n/ w . (6) i i any hash function h1, . . . , hm), we have to scan the two or k small i=2 i=2 groups for the intersection. So to analyze Algorithm 5, the key Running Time: As the |Di|’s are bounded as√ above, a naive imple- we need to establish is that successful filtering happens with mentation of Algorithm 4 requires O(knk/ w + kr) time in√ ex- a constant probability for two or k small groups.We first verify the z √ pectation. The iteration of lines 1-4 in Algorithm 4 repeats nk/ w i above intuition, by assuming that |Li | is w: times (suppose n1 ≤ n2 ≤ ... ≤ nk). In each iteration, we com- pute H in O(k) time, and each element in D needs O(k) com- z1 z2 z1 k LEMMA√ A.1. For two small groups L1 and L2 with |L1 | = z2 parisons√ to be eliminated. √ |L2 | = w, given a universal hash function h :Σ → [w], if z1 z2 z1 z2 n/ w is potentially smaller than knk/ w especially for sets L ∩ L = ∅, then h(L ) ∩ h(L ) = ∅ with probability at 1 2 √ 1 2   w with skewed sizes. With careful memorization of the partial results least 1− √1 (≈ 0.3436 for w = 64). Tj zi Tk −1 zi w i=1 h(Li ) and i=j h (y, Li ) in Algorithm 4, from (i) and √ z1 z2 z2 (ii), we now prove the promised running time O(n/ w + kr): PROOF. Since L1 ∩ L2 = ∅, for each x2 ∈ L2 , we have z1 z1 The major cost of Algorithm 4 comes from the computation of h(x2) ∈/ h(L1 ) holds with probability 1 − |h(L1 |)/w. So, z z Tk i Tk −1 i z z (a) H = h(L ) and (b) h (y, L ) for each y ∈ H. z L 2 z L 2 i=1 i i=1 i z z |h(L 1 )| !| 2 | |L 1 | !| 2 | 1− 1 1− 1 Assume n1 ≤ n2 ≤ ... ≤ nk. Pr [h(L1) ∩ h(L2) = ∅] ≥ w ≥ w , For (a), as z is the t -prefix of z if i ≤ j, we can memorize i i j z z z z √ Tj zi 1 10 as |h(L 1 )| ≤ |L 1 |. So, when |L 1 | = |L 2 | = w, we have i=1 h(Li ) for each zj . Then, for example, reuse h(L1)∩h(L2 ) 1 1 1 √ 2 1 10 100 1 10   w when computing h(L1)∩h(L2 )∩h(L3 ) and h(L1)∩h(L2 )∩ z z √1 101 Pr [h(L1) ∩ h(L2) = ∅] ≥ 1 − w . h(L3 ). In this way, the computation of√ H for different√ combina- tions of z , . . . , z requires P O(n / w) = O(n/ w) time. 1 k i i In general, although the sizes of the small groups |Lz| are ran- For (b), for each combination of z , . . . , z , we compute the √ i 1 k dom variables, they are unlikely to deviate from w by much. This result the inverse order (from i = k to i = 1): the partial re- Tk −1 zi is important since groups of larger sizes result in poorer filtering sults i=j h (y, Li ) (for all y ∈ H, all zk’s, and some j) zi performance of the word representations hj (Li )’s (incurring more have their total size bounded by |Dj | + r. Using the hash-table- false positives). Using Chernoff bounds we can show that: based approach to compute the intersection, the total running time is bounded by the total size of the partial results. So from (6), the zi √ PROPOSITION A.2. For any group Li defined in Algorithm√ 5 total running time is O(n/ w + kr) in expectation. 2 (i.e., partition L by g :Σ → {0, 1}ti with t = dlog(n / w)e), √ i ti √ i i we have: (i) w ≤ E [|Lzi |] ≤ w; A.4 Analysis of the Multi-resolution Structure 2 i √ √ 2 zi  w  (Proof of Theorem 3.8) (ii) Pr [|Li | ≤ (1 + ) w] ≥ 1−exp − 3 , for 0 <  < 1; z √ 1 i √ δ(w)= The time bound is trivial, because we only require sorting and (iii) Pr [|Li | ≤ δ(w) w] ≥ 1 − 4 w , where the constant z z √ scanning of each set. The total space for storing h(L )’s, left(L )’s,  6 ln(4 w) 1/2 i i 1+ √ z w (≈ 2.6122 for w = 64). and right(Li )’s is O(ni), as there are O(ni + ni/2 + ni/4 + ...) = O(ni) groups of different sizes. For next(x)’s we also only need PROOF. In this proof we use a random permutation as g :Σ → O(ni) space, as there are ni elements in the set. We now analyze Σ, and define gt(x) to be the t most significant bits of g(x). How- z the space needed for first(y, Li )’s to complete the proof. ever, note that we can use a hash function here as well, if we use the z To store first(y, Li ) for each y ∈ [w] and each z, storing the pre-image to break any ties resulting from hash collisions (thereby z z difference between first(y, Li ) and left(Li ) suffices; so we need resulting in a total ordering). z z zi zi O(log |Li |) bits. To store first(y, Li )’s for all y ∈ [w] in a group For the group Li , define Yx = 1 if x ∈ Li (i.e., gti (x) = zi), z z z zi P Li , we need O(w · log |Li |/w) = O(log |Li |) words. Consider and Yx = 0 otherwise. So |Li | = x Yx. Then (i) is from the t ti the partitioning induced by gt :Σ → {0, 1} for some t, letting fact that Pr [Yx = 1] = 1/2 and the linearity of expectation. t¯ z g {Y | x ∈ L } t¯= dlog nie − t, there are O(ni/2 ) groups Li ’s generated by gt, For a random permutation , we can prove that the x i so the space we need for all these groups is: (log(·) is concave) are negatively associated [14], so the√ Chernoff bounds can be still zi applied. As in (i), we have µL = w/2 ≤ µ = E [|Li |] ≤ P z t t t¯ ¯ √ O( z∈{0,1}t log |Li |) ≤ O(2 · log(ni/2 )) = O((ni/2 ) · t). w = µH . To prove (iii), we use the Chernoff bound: zi 2 2 Pr [|Li | > (1 + )µ] < exp(−µ /3) ≤ exp(−µL /3) [14]. Therefore, for all resolutions t = 1, 2,..., dlog nie, the total space ¯ To prove (ii), we can use a tighter bound: for 0 <  < 1, z P t z needed for first(y, Li )’s is O( ¯ t¯· ni/2 ) ≤ O(ni). 2 i 2 t Pr [|Li | > (1 + )µH ] < exp(−µH  /3) [14].

Note that the same bounds hold when a hash function is used as g. Lemma A.3 extends Lemma A.1 for k groups, whose sizes are ran- dom variables determined by the hash function g.

264 zi LEMMA A.3. For k groups Li ’s (for i =√ 1, . . . , k, partition Bing/Wikipedia. As we can see, the probabilities are very similar ti zi Li by gti :Σ → {0, 1} where ti = dlog(ni/ w)e), if ∩iLi = for both datasets, with slightly better filtering performance for the zi ∅, then ∩ih(L ) = ∅ with at least constant probability asymmetric real data. Moreover, the real-life successful-filtering i √ √ δ(w) w  1+δ(w) w   δ(w)  probabilities are significantly better than the theoretical bounds de- β(w) = 1 − √ · 1 − √ (or β (w) below), 4 w w 2 rived in Lemma A.1 and Lemma A.3 (where m = 1). where δ(w) is a constant determined by w, as in Proposition A.2. zi zk A.5.3 Proof of Theorem 3.9 PROOF. Since ∩iLi = ∅, for any xk ∈ Lk , there exists some z z tk z j j For any zk ∈ {0, 1} , let zi be its ti-prefix (as in Algorithm 5). Lj s.t. xk ∈/ Lj ; now, for this small group Lj , if for any xj ∈ z z z T i j j For computing i Li , there are two cases: Lj we have h(xk) 6= h(xj ), i.e., xk ∈/ h(Lj ), we say that xk is T zi T zi zj √ (i) If i Li = ∅, from Lemma A.3, we have i hj (Li ) 6= ∅ collision-free. If |Lj | ≤ δ(w) w, from the union bound, T zi with probability at most 1 − β(w), and thus i hj (Li ) 6= ∅ for m Lzj all j = 1, . . . , m with probability at most (1 − β(w)) . So we j δ(w) T zi m Pr [xk is collision-free] ≥ 1 − ≥ 1 − √ , (7) wastefully compute i Li with probability at most (1 − β(w)) . w w T zi T zi (ii) If i Li 6= ∅, we must have i hj (Li ) 6= ∅ for all j, and zi T zi T where δ(w) is defined in Proposition A.2. Note that ∩ih(L ) = ∅ compute i Li . Case (ii) happens at most r = | i Li| times. i z zk We compute T L i using the linear merge algorithm in linear implies that every xk in Lk is collision-free. So, if furthermore i i √ z √ zi  |L k | ≤ δ(w) w, we have time O P |L | , or O(k w) time in expectation. In case (i), k √ i i √  δ(w) w since there are nk/ w groups in Lk, for all groups, this contributes zi δ(w) m Pr [∩ih(L ) = ∅] ≥ 1 − √ . a factor O(max(n, kn )(1 − β(w)) ); and in case (ii), this con- i w √k tributes a factor O(kr w) (since (ii) happens at most r times). The derivation of (7) assumes independence of the randomized Tk h (Lzi ) 6= ∅ j = We also need to test whether√ i=1 j i for all hash function h. If h is generated from a random permutation p, zi 1, . . . , m. Since there are n/ w groups Li ’s, with careful mem- i.e., taking the prefix of p(x) as h(x), then by considering negative 1 10 orization of partial results (e.g., reusing hj (L1) ∩ hj (L2 ) when dependence [14], a similar (a bit weaker) bound can be derived. h (L1) ∩ h (L10) ∩ h (L100) h (L1) ∩ h (L10) ∩ 1 computing j 1 j 2 j 3 and j 1 j 2 From Proposition A.2(iii), with probability at least 1 − √ , we 101 √ √ 4 w√ hj (L3 )), this contributes a factor O(mn/ w) in total. zi zi zk have |Li | ≤ δ(w) w for a group Li . Given |Lk | ≤ δ(w) w, So from the above analysis, Algorithm 5 needs a total of √ z there are at most min{k, δ(w) w} L j ’s involved in the analy- j  max(n, kn ) mn √  sis of (7). From the union bound, with probability at least 1 − k √ √ O m + + kr w (8) 1+δ(w) w z √ √ α(w) w √ j 4 w , we have |Lj | ≤ δ(w) w for all of these 1+δ(w) w time in expectation, where α(w) = 1/(1 − β(w)). 2 groups. So, with probability at least √ √ δ(w) w  1+δ(w) w   δ(w)  √ √ β1(w) = 1 − 4 w · 1 − w , A.6 Analysis of Algorithm HashBin zi (notice the independence between g and h) we have ∩ih(Li ) = ∅. (Proof of Theorem 3.11) If we use Proposition A.2(ii) to bound the probability of |Lzk | ≤ √ √ k For HashBin, the intuition is: in the resulting partitioning, we zj z 3 w/2 (then there are at most min{k, 3 w/2} Lj ’s involved in have O(1) element in each group L1, and O(n2/n1) elements in z the analysis of (7)), we can derive a tighter bound in√ a similar way: each group L2. The expected running time is: √ 3 w/2   w    δ(w)  3 √   β2(w) = 1 − exp − 12 − 8 · 1 − w . X z z E  (|L1| log |L2|) Thus, we have shown the probability of successful filtering (β1(w) z∈{0,1}t or β2(w) as a conservative lower bound) is at least a constant de- X z z z pending only on the machine-word width w (but independent on = (E [|L1|] log |L2|) (suppose |L2|’s are fixed) the number and the sizes of sets), and increases with w. It can be z∈{0,1}t m magnified to 1 − (1 − β(w)) by using m > 1 word images of X z = log |L | (because E [|Lz|] = 1) independent hash functions for filtering. 2 1 z∈{0,1}t A.5.2 Filtering Performance in Practice P z ≤ n1 log(n2/n1) ( |L2| = n2 and log(·) is concave). 2 z∈{0,1}t 1 0.9 A.6.1 HashBin using the Multi-resolution Structure 0.8 Algorithm HashBin works on a simplified version of the multi- 0.7 0.6 resolution data structure (Figure 2) introduced in Section 3.2.1. 0.5 Here, we use a random permutation g :Σ → Σ to partition sets 0.4 Measured Probability, Synthetic Data 0.3 into small groups. To pre-process Li, we first order all the elements Measured Probability, Real-Life Data z 0.2 Pr("Successful Filtering") in Li according to g(x). Then any small group Li = {x | x ∈ 0.1 Li and gt(x) = z} (for any t) corresponds to a consecutive interval 0 z m=1 m=2 m=4 m=6 m=8 in Li. For each small group Li , we only need to store its starting z z Figure 9: Filtering Performance in Experiments position left(Li ) and ending position right(Li ). z z For each x ∈ L1, we need to check whether x ∈ L2. Suppose In this section we evaluate the efficiency of the word images for z 1 2 s z L2 = {x , x , . . . , x }. Although elements in L2 are not sorted filtering. In Figure 9, we have plotted the probability that, for dif- in their own order, they are ordered as g(x1) ≤ g(x2) ≤ ... ≤ ferent numbers m of hash functions, a given pair of small groups s z g(x ) in the preprocessing. So to check whether x ∈ L2, we can with an empty intersection is filtered; as before, we use w = 64. As binary-search whether g(x) is in {g(x1), g(x2), . . . , g(xs)}, since the datasets, we use the synthetic data from the first experiment in the random permutation g is a one-to-one mapping from Σ to Σ. Section 4 (with an intersection size of 1% of the set size) and the 2- word queries described in the experiments on real data derived from

265 10000 C. ADDITIONAL EXPERIMENTS

1000 C.1 Preprocessing Overhead In this section, we evaluate the time taken to construct the novel 100 structures when given a set Li as input. Our approach is similar to

HashBin inverted indexes (and nearly all of the competing algorithms) in that IntGroup the elements have to be sorted during pre-processing; thus, to put

Construction Time (ms) (ms) Time Construction 10 RanGroup RanGroupScan4 the construction overhead in perspective, we also measure and plot Sorting the overhead of sorting using an in-memory quicksort (averaging 1 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M the time over 10 random instances). Figure 10 shows the results for Set Size the construction time for the data structures without compression Figure 10: Preprocessing Overhead for different set sizes |Li|. Note that we use a log-scale on the

10000 y-axis to better separate the different graphs. As we can see, the additional construction overhead is generally a small fraction of

1000 the sorting overhead. Figure 11 shows the overhead for constructing different com- pressed structures. We also plot the overhead for compressing the 100 sets without additional hash images (resulting in the structures used Sorting in the compressed Merge, i.e., Merge Gamma and Merge Delta). RanGroupScan_Lowbits 10

Construction Time Time (ms) Construction RanGroupScan_Gamma Again, the required overhead is only a small fraction of the sorting RanGroupScan_Delta Merge_Gamma overhead; also, the preprocessing time for the Lowbits compres- 1 Merge_Delta sion scheme which yields the best intersection performance in Sec- 64K 128K 256K 512K 1M 2M 4M 8M Set Size tion 4.1 is significantly lower than the alternatives. Figure 11: Preprocessing Overhead (with compression) C.2 More Experiments on Real Data In this section, we present a breakdown of the experiments on B. COMPRESSION FOR ALGORITHM 5 real data in Section 4; to understand how the number of keywords z For each small group Li , we can use the standard techniques in a query affect the relative performance in this scenario, we plot- based on γ- and δ-coding (see [23], p.116) to compress the ele- ted the distribution of average intersection times for 2-, 3- and 4- ments stored sequentially at the end of the block associated with keyword queries separately in Figure 12. As we can see, the relative z Li . However, the decoding of γ- and δ-coding is expensive. performances are similar as seen earlier with three exceptions: (a) As an alternative, we describe simple but effective (i.e., efficient the Merge algorithm performs worse with increasing number of in decoding) compression technique for Algorithm 5 in the follow- keywords (as it cannot leverage the asymmetry in any way), (b) in ing: contrast, Hash performs increasingly better, but still remains (close (i) Instead of storing the length len of each structure, we can store to) the worst performer, and (c) for 4-keyword queries, RanGroup z slightly outperforms RanGroupScan. the size |Li |, since the structure length len can be derived from |Lz|. As proved in Proposition A.2, |Lz| is usually very small, so i i 3 Results: 2 Keyword Queries we store it using unary code (e.g., 011 = 2). 2.5

z z z z 2 (ii) Only if |Li | > 0, we store h1(Li ), h2(Li ), ..., hm(Li ) in the 1.5 following m words. Merge z 1 (iii) To store elements of Li in the remaining part of this block, 0.5 we can use the standard techniques based on γ- or δ-coding. We 0 present another compression technique here, which is specifically

designed for our algorithm. Its decoding is much more efficient Time Intersection Normalized than γ- or δ-coding. First, for the purpose of partitioning sets into 1.6 Results: 3 Keyword Queries small groups, we use a random permutation as g. Then, assuming 1.4 1.2 w z Merge |Σ| = 2 (the worst case), instead of storing each x ∈ Li , we 1 w−ti 0.8 store lowbitsti (x) = g(x) mod 2 , i.e., the lowest w − ti bits 0.6 of g(x); and the remaining highest ti bits of g(x) correspond to 0.4 0.2 z = gti (x). In online-processing, decoding in this compression 0 z scheme can be done efficiently: to get g(x) for an element x ∈ Li , we concatenate z = gti (x) to lowbitsti (x). Since g is a one-to-one Normalized Intersection Time mapping from Σ to Σ, the intersection of L1 and L2 is equivalent 1.2 Results: 4 Keyword Queries Merge to the intersection of g(L1) and g(L2). 1 Following is some basic analysis to establish an upper bound of 0.8 0.6 the space consumed by our compression technique. Recall ti = √ √ 0.4 dlog(ni/ w)e. There are ni/ w small groups in Li in total. 0.2 Storing all of them requires: 0 √ z P z (i) ni + ni/ w bits for |Li |’s (since z |Li | = ni); √ z (ii) at most mw · ni/ w bits for hj (Li )’s; and Normalized Intersection Time (iii) (w − t ) · n bits for all elements (we store g(x) mod 2w−ti ). i i Figure 12: Normalized Execution Time on a Real Workload

266