Dynamic Elias-Fano Representation
Total Page:16
File Type:pdf, Size:1020Kb
Dynamic Elias-Fano Representation Giulio Ermanno Pibiri1 and Rossano Venturini1 1 Computer Science Department, University of Pisa, Italy [email protected], [email protected] Abstract We show that it is possible to store a dynamic ordered set S(n, u) of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and yet preserve the asymptotic time optimality of the operations. Our results leverage on the Elias- u Fano representation of S(n, u) which takes EF(S(n, u)) = ndlog n e + 2n bits of space and can be shown to be less than half a bit per element away from the information-theoretic minimum. Considering a RAM model with memory words of Θ(log u) bits, we focus on the case in which the integers of S are drawn from a polynomial universe of size u = nγ , for any γ = Θ(1). We represent S(n, u) with EF(S(n, u)) + o(n) bits of space and: 1. support static predecessor/suc- u cessor queries in O(min{1 + log n , log log n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. support random access in O(log n/ log log n) worst- case, insertions/deletions in O(log n/ log log n) amortized and predecessor/successor queries in u O(min{1 + log n , log log n}) worst-case time. These time bounds are optimal. 1998 ACM Subject Classification E.1 Data Structures, E.4 Coding and Information Theory, F.2.2 Nonnumerical Algorithms and Problems Keywords and phrases Succinct Data Structures, Integer Sets, Predecessor Problem, Elias-Fano Digital Object Identifier 10.4230/LIPIcs.CPM.2017.48 1 Introduction The problem we consider is the one of representing in compressed space a dynamic ordered set S of n integer keys, which is a fundamental textbook problem (see the introduction to parts III and V of [8]). In general, any self-balancing search tree data structure, e.g., AVL or Red-Black tree, solves the problem optimally in the comparison model, by implementing all operations in O(log n) worst-case time and using linear space [8]. However, by exploiting the fact that the stored keys are integers drawn from a bounded universe of size u, the problem is known to admit more efficient solutions in terms of asymptotic time complexity while still retaining linear space [8, 23, 26, 30, 13, 14]. Classical examples include the van Emde Boas tree [26, 27, 28], x/y-fast trie [30] and the fusion tree [14], that was the first data structure able to surpass the information-theoretic lower bound, by exhibiting an optimal [13] amount of time per operation within a number of memory words proportional to the size of the input. Some efforts have been spent in trying to reduce the space requirements of the representation [16, 18, 25] but known compressed solutions do not closely match the information-theoretic lower bound of the underlying integer set. In this paper we show that it is possible to preserve the optimal bounds for the operations under almost optimal space requirements. The key ingredient of our data structures is the Elias-Fano representation of monotone integer sequences [10, 11]. In particular, Elias-Fano u encodes a monotone integer sequence S(n, u) in EF(S(n, u)) = ndlog n e + 2n bits, which can be shown to be less than half a bit per element away from optimality [10], maintaining the capability of randomly access an integer in O(1) worst-case time. The query Predecessor, © Giulio Ermanno Pibiri and Rossano Venturini; licensed under Creative Commons License CC-BY 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Editors: Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter; Article No. 48; pp. 48:1–48:14 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 48:2 Dynamic Elias-Fano Representation which, given an integer x, returns max{y ∈ S : y < x}, is possible as well over the compressed u sequence in O(1 + log n ) worst-case time. These properties make Elias-Fano extremely efficient on crucial practical applications, e.g., inverted indexes compression, just to mention the most noticeable one. Since inverted indexes can indeed be regarded as being a collection of sorted integer sequences, recent works [29, 20] have shown that Elias-Fano exhibits the best time/space trade-off thanks to its efficient search capabilities and strong theoretical guarantees. For this specific application, the operation that has to be supported efficiently is Successor(x) = min{y ∈ S : y ≥ x}, which is commonly called NextGEQ (Next Greater or EQual) [29, 20]. Throughout the paper we adopt the classical nomenclature and discuss Predecessor(x) as it is well known that the twin query Successor(x) is solved in a similar way. The natural question is whether it is possible to extend the static Elias-Fano representation to dynamic scenarios, in which integers can also be inserted/deleted in/from S. To this end, we consider the case in which the n integers of S are drawn from a polynomial universe of size u = nγ , for any γ = Θ(1). This is the classical operational setting as considered by Fredman and Saks [13] (list representation problem) and let us concentrate on the typical case of practical interest. In order to characterize the asymptotic complexity of the data structures described in the paper and review the literature, we use a RAM model with word size w = Θ(log u) bits. We also adopt the usual trans-dichotomous assumption [14], making w grow with n as needed. We maintain S(n, u) using EF(S(n, u)) + o(n) bits of space, hence introducing a sublinear space overhead with respect to its static Elias-Fano representation, and show how: u 1. static predecessor/successor queries can be supported in O(min{1 + log n , log log n}) u worst-case time (note that the first term of the bound, i.e., O(1 + log n ), is optimal only for polynomial universes of size u = nγ with 1 ≤ γ ≤ 1 + log log n/ log n); 2. to extend S in an append-only fashion, i.e., by assuming that integers are inserted in the data structure in sorted order, using a constant amount of work per integer; 3. to maintain S in a fully dynamic way, supporting random access in O(log n/ log log n) worst-case, insertions/deletions in O(log n/ log log n) amortized and predecessor/successor u queries in O(min{1 + log n , log log n}) worst-case time. 2 Related Work We organize the discussion of the related work in three parts. The first part concerns the review of the results about the static predecessor problem. The second one explains in details the (static) Elias-Fano representation of monotone integer sequences because it forms the backbone of our solutions. The last part finally describes the results closest to our work for the maintenance of a dynamic integer set. 2.1 Static Predecessor Problem We could solve the static predecessor problem in O(1) worst-case by storing all results to every possible query using perfect hashing [12] in O(u) words of space. In order to not trivialize the problem, assume we have a polynomial space budget, e.g., we deal with a data structure occupying O(nO(1)) words. √ Ajtai [1] proved the first ω(1) lower bound, claiming that ∀w, ∃n that gives Ω( log w) query time. Only ten years later Beame and Fich [3, 4] proved two strong bounds for any G. E. Pibiri and R. Venturini 48:3 cell-probe data structure1. They proved that ∀w, ∃n that gives Ω(log w/ log log w) query time and that ∀n, ∃w that gives Ω(plog n/ log log n) query time. They also gave a static data structure achieving O(min{log w/ log log w, plog n/ log log n}) which is, therefore, optimal. Building on a long line of research, Pˇatraşcuand Thourp [21, 22] finally proved the following optimal space-time trade-off for a static data structure taking m = n2aw bits of space, with m a = log n − log w n w − log n log w log w o Θ min log n, log , a , a (1) w a log( a log w ) w log n log n a log(log a / log a ) This lower bound holds for cell-probe, RAM, trans-dichotomous RAM, external memory and communication game models. The first branch of the trade-off indicates that, whenever we are in RAM or external memory with one integer fitting in one memory word, fusion trees are optimal, as these require O(logw n) = O(log n/ log w) query time. The second branch holds for polynomial universes, i.e., whenever u = nγ , for any γ = Θ(1). In such case we have that w = Θ(log u) = γ log n, therefore y-fast tries [30] and van Emde Boas trees [26, 27, 28] are optimal with query time O(log log u) = O(log log n). Finally, the last two branches of the trade-off treat the case for super-polynomial universes. In particular, the third branch matches the lower bound by Beame and Fich [3, 4] that requires nO(1) words of space; the 1− fourth branch improves this space occupancy, showing that n1+1/ exp(log log u) words are sufficient, for any > 0. 2.2 Static Elias-Fano Representation The integer encoding we describe in this section was independently proposed by Peter Elias [10] and Robert Mario Fano [11], hence its name.