Fast Set Intersection in Memory

Fast Set Intersection in Memory Bolin Ding Arnd Christian Konig¨ University of Illinois at Urbana-Champaign Microsoft Research 201 N. Goodwin Avenue One Microsoft Way Urbana, IL 61801, USA Redmond, WA 98052, USA [email protected] [email protected] ABSTRACT rithms become possible, which often outperform inverted indexes; Set intersection is a fundamental operation in information retrieval e.g., using hash-based dictionaries, intersecting two sets L1, L2 and database systems. This paper introduces linear space data struc- requires expected time O(min(jL1j; jL2j)), which is a factor of tures to represent sets such that their intersection can be computed Θ(log(1 + max(jL1j=jL2j; jL1j=jL2j))) better than the best pos- in a worst-case efficient way. In general, given k (preprocessed) sible worst-case performance of comparison-based algorithms [6]. In this work, we propose new set intersection algorithms aimed at sets, with totally n elements, we willp show how to compute their intersection in expected time O(n= w + kr), where r is the in- fast performance.These outperform the competing techniques for tersection size and w is the number of bits in a machine-word. In most inputs and are also robust in that – for inputs where they are addition,we introduce a very simple version of this algorithm that not optimal – they are close to the best-performing algorithm. The has weaker asymptotic guarantees but performs even better in prac- tradeoff for this gain is a slight increase in the size of the data struc- tice; both algorithms outperform the state of the art techniques for tures, when compared to an inverted index; however, in user-facing both synthetic and real data sets and workloads. scenarios where latency is crucial, this tradeoff is often acceptable. 1.1 Contributions 1. INTRODUCTION Our approach leverages two key observations: (a) If w is the size Fast processing of set intersections is a key operation in many (in bits) of a machine-word, we can encode a set from a universe query processing tasks in the context of databases and information of w elements in a single machine word, allowing for very fast in- retrieval. For example, in the context of databases, set intersections tersections. (b) For the data distributions seen in many real-life ex- are used in the context of various forms of data mining, text analyt- amples (in particular search applications), the size of intersections ics, and evaluation of conjunctive predicates. They are also the key is typically much smaller than the smallest set being intersected. operations in enterprise and web search. To illustrate the second observation, we analyzed the 10K most Many of these applications are interactive, meaning that the la- frequent queries issued against the Bing Shopping portal. For 94% tency with which query results are displayed is a key concern. It of all queries it held that the size of the full intersection was at least has been shown in the context of search that query latency is criti- one order of magnitude smaller than the document frequency of the cal to user satisfaction, with increases in latency directly leading to least frequent keyword; for 76% of the queries the difference was fewer search queries being issued and higher rates of query aban- two orders of magnitude. By exploiting these two observations, we donment [10, 17]. As a consequence, significant portions of the make the following contributions. sets to be intersected are often cached in main memory. (i) We introduce linear-space data structures to represent sets This paper will study the performance of set intersection algo- such that their intersection can be computed in a worst-case ef- rithms for main-memory resident data. Note that these techniques ficient way. Given k sets, with n elements in total, these data are also relevant in the context of large disk-based (inverted) in- structuresp allow us to compute their intersection in expected time dexes, when large fractions of these reside in a main memory cache. O(n= w + kr), where r is the size of the intersection and w is There has been considerable study of set intersection algorithms in the number of bits in a machine-word; when the size of the inter- information retrieval (e.g., [12, 4, 11]). Most of these papers as- section is an order of magnitude (or more) smaller than the size of sume that the underlying data structure is an inverted index [23]. the smallest set being intersected, our approach yields significant Much of this work (e.g., [12, 4]) focuses on adaptive algorithms improvements in execution time over previous approaches. which use the number of comparisons as measure of overhead. To the best of our knowledge, the best asymptotic bound for fast 2 For in-memory data, additional structures which encode additional set intersection is achieved by the O (n(log2 w) )=w + kr algo- skipping-steps [18], tree-based structures [7], or hash-based algorithm of [6]. However, note that the bound relies on a large value 16 of w; in practice,p w is small (and constant), and w < 2 = 65536 bits implies 1= w < (log w)2=w. More importantly, [6] requires Permission to make digital or hard copies of all or part of this work for 2 personal or classroom use is granted without fee provided that copies are complex bit-manipulation, making it slow in practice, which we not made or distributed for profit or commercial advantage and that copies will demonstrate empirically in Section 4. bear this notice and the full citation on the first page. To copy otherwise, to (ii) We describe a much simpler algorithmp thatp computes the in- republish, to post on servers or to redistribute to lists, requires prior specific tersection in expected O(n/αm + mn= w + kr w) time, where permission and/or a fee. Articles from this volume were invited to present α is a constant determined by w, and m is a parameter. This al- their results at The 37th International Conference on Very Large Data Bases, gorithm has weaker guarantees in theory, but performs better in August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 4 practice, and gives significant improvements over the various data Copyright 2011 VLDB Endowment 2150-8097/11/01... $ 10.00. structures typically used, while being very simple to implement. 255 2. BACKGROUND AND RELATED WORK approaches can be combined with our work by using these small Algorithms based on Ordered Lists: Most work on set intersec- groups in place of individual documents. tion focuses on ordered lists as the underlying data structure, in par- Set intersections using multiple cores: Techniques that exploit ticular algorithms using inverted indexes, which have become the multi-core architectures to speed up set intersections are described standard data structure in information retrieval. Here, documents in [20, 22].The use of multiple cores is orthogonal to our approach are identified via a document ID, and for each term t, the inverted in the sense that our algorithms can be parallelized for these archi- index stores a sorted list of all document IDs containing t. tectures as well; however, this is beyond the scope of our paper. Using this representation, two sets L1;L2 of similar sizes (i.e., jL1j ≈ jL2j) can be intersected efficiently using a linear merge 3. OUR APPROACH by scanning both lists in parallel, requiring O(jL1j + jL2j) oper- Notation: We are given a collection of N sets S = fL1;:::;LN g, ations (the “merge step” in merge sort). This approach is wasteful where Li ⊆ Σ and Σ is the universe of elements in the sets; let when set sizes differ significantly or only small fractions of the sets ni = jLij be the size of set Li. Suppose elements in a set are intersect. For very different set sizes, algorithms have been pro- ordered, and for a set L, let inf(L) and sup(L) be the minimum and jL1j+jL2j posed that exploit this asymmetry, requiring log + jL1j jL1j maximum elements of a set L, respectively. We use w to denote the comparisons at most (for jL1j < jL2j) [16]. size (number of bits) of a word on the target processor. Throughout To improve the performance further, there has recently been sig- the paper we will use log to denote log2. Finally, we use [w] to nificant work on so-called adaptive set-intersection algorithms for denote the set f1; : : : ; wg. Our approach can be extended to bag set intersections [12, 4, 13, 1, 2, 5]. These algorithms use the total semantics by additionally storing element frequency. number of comparisons as measure of the algorithm’s complexity Framework: Our task is to design data structures such that the and aim to use a number of comparisons as close as possible to intersection of multiple sets can be computed efficiently. We dif- the minimum number of comparisons ideally required to establish ferentiate between a pre-processing stage, during which we reor- the intersection. However, the resulting reduction in the number of ganize each set and attach additional index structures, and an on- comparisons does not necessarily result in performance improve- line processing stage, which uses the pre-processed data structures ments in practice: for example, in [2], binary search based algo- to compute intersections. An intersection query is specified via rithms outperform a parallel scan only when jL2j < 20jL1j, even a collection of k sets L1;L2;:::;Lk (to simplify notations, we though several times fewer comparisons are needed. use the offsets 1; 2; : : : ; k to refer to the sets in a query through- Hierarchical Representations: There are various algorithms for out this section); our goal is to compute L1 \ L2 \ ::: \ Lk ef- set intersections based on variants of balanced trees (e.g.

Fast Set Intersection in Memory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support