A Locally Encodable and Decodable Compressed Data Structure

Forty-Seventh Annual Allerton Conference Allerton House, UIUC, Illinois, USA September 30 - October 2, 2009 A Locally Encodable and Decodable Compressed Data Structure Venkat Chandar Devavrat Shah Gregory W. Wornell Abstract—In a variety of applications, ranging from high- The work of Lempel and Ziv [14] provided an elegant univer- speed networks to massive databases, there is a need to maintain sal compression scheme. Since that time, there has continued histograms and other statistics in a streaming manner. Motivated to be important progress on compression algorithms. However, by such applications, we establish the existence of efficient source codes that are both locally encodable and locally decodable. Our most well-known schemes suffer from two drawbacks. First, solution is an explicit construction in the form of a (randomized) these schemes do not support efficient updating of the com- data structure for storing N integers. The construction uses pressed data. This is because in typical systems, a small change multi-layered sparse graph codes based on Ramanujan graphs, in one value in the input sequence leads to a huge change in the and has the following properties:(a) the structure utilizes compressed output. Second, such schemes require the whole minimal possible space, and (b) the value of any of the integers can be read or updated in near constant time (on average and sequence to be decoded from its compressed representation in with high probability). 1 By contrast, data structures proposed order to recover even a single element of the input sequence. in the context of streaming algorithms and compressed sensing A long line of research has addressed the second drawback. in recent years (e.g., various sketches) support local encodability, For example, Bloom filters [35], [36], [37] are a popular data but not local decodability; and those known as succinct data structure for storing a set in compressed form while allowing structures are locally decodable, but not locally encodable. membership queries to be answered in constant time. The I. INTRODUCTION rank/select problem [27], [28], [29], [30], [31], [32], [33], [34] In this paper, we are concerned with the designing space and dictionary problem [38], [39], [32] in the field of succinct efficient and dynamic data structures for maintaining a collec- data structures are also examples of problems involving both tion of integers, motivated by many applications that arise in compression and the ability to efficiently recover a single networks, databases, signal processing, etc. For example, in a element of the input, and [26] gives a succinct data structure high-speed network router one wishes to count the number of for arithmetic coding that supports efficient, local decoding packets per flow in order to police the network or identify a but not local encoding. In summary, this line of research malicious flow. As another example, in a very large database, successfully addresses the second drawback but not the first relatively limited main memory requires “streaming” compu- drawback, e.g., changing a single value in the input sequence tation of queries. And for “how many?” type queries, this can still lead to huge changes in the compressed output. leads to the requirement of maintaining dynamically changing Similarly, several recent papers have examined the first integers. drawback but not the second. In order to be able to update In these applications, there are two key requirements. First, an individual integer efficiently, one must consider compres- we need to maintain such statistics in a dynamic manner so that sion schemes that possess some kind of continuity property, they can be updated (increment, decrement) and evaluated very whereby a change in a single value in the input sequence leads quickly. Second, we need the data to be represented as com- to a small change in the compressed representation. In recent pactly as possible to minimize storage, which is often highly work, Varshney et al. [40] analyzed “continuous” source codes constrained in these scenarios. Therefore, we are interested in from an information theoretic point of view, but the notion of data structures for storing N integers in a compressed manner continuity considered is much stronger than the notion we are and be able to read, write/update any of these N integers in interested in, and [40] does not take computational complexity near constant time.2 into account. Also, Mossel and Montanari [1] have constructed space efficient “continuous” source codes based on nonlinear, A. Prior work sparse graph codes. However, because of the nonlinearity, it Efficient storage of a sequence of integers in a compressed is unclear that the continuity property of the code is sufficient manner is a classical problem of source coding, with a long to give a computationally efficient algorithm for updating a history, starting with the pioneering work by Shannon [13]. single value in the input sequence. In contrast, linear, sparse graph codes have a natural, This work was supported, in part, by NSF under Grant No. CCF-0515109 and by Microsoft Research. computationally efficient algorithm for updates. A large body The authors are with the Dept. EECS, MIT, Cambridge, MA 01239. Email: of work has focussed on such codes in the context of streaming vchandar,devavrat,[email protected] algorithms and compressed sensing. Most of this work essen- 1 See Section I-B for our sharp statements and formal results. tially involves the design of sketches based on linear codes 2Or, at worst , in poly-logarithmic time (in N). Moreover, we will insist that the space complexity of the algorithms utilized for read and write/update [4], [5], [6], [3], [2], [7], [8], [9], [10], [11]. Among these, be of smaller order than the space utilized by the data structure. [3], [2], [7], [8], [9], [10], [11] consider sparse linear codes. 978-1-4244-5871-4/09/$26.00 ©2009 IEEE 613 Existing solutions from this body of work are ill-suited for the space utilized by the read and write/update algorithms our purposes for two reasons. First, the decoding algorithms should be accounted for. (LP based or combinatorial) requires decoding all integers to Naive Solutions Don’t Work. We briefly discuss two naive read even one integer. Second, they are sub-optimal in terms solutions, each with some of these properties but not all, that of space utilization when the input is very sparse. will help explain the non-triviality of this seemingly simple Perhaps the work most closely related to this paper is that question. First, if each integer is allocated log B2 bits, then one of Lu et al. [21], which develops a space efficient linear code. solution is the standard Random Access Memory setup: it has They introduced a multi-layered linear graph code (which they O(1) read and write/update complexity. However, the space re- refer to as a Counter Braid). In terms of graph structure, the quired is N log B2, which can be much larger than the minimal codes considered in [21] are essentially identical to Tornado space required (as we’ll establish, it’s essentially N log B1). codes [41], so Counter Braids can be viewed as one possible To overcome this poor space utilization, consider the following generalization of Tornado codes designed to operate over the “prefix free” coding solution. Here, each xi is stored in a integers instead of a finite field. [21] establishes the space serial manner: each of them utilizes (roughly) log xi bits to efficiency of counter braids when (a) the graph structure and store its value, and these values are separated by (roughly) layers are carefully chosen depending upon the distributional log log xi 1s followed by a 0. Such a coding will be essentially information of inputs, and (b) the decoding algorithm is optimal as it will utilize (roughly) N(log B1 +log log B1) bits. based on maximum likelihood (ML) decoding, which is in However, read and write/update will require Ω(N) operations. general computationally very expensive. The authors propose a As noted earlier, the streaming data structures and succinct message-passing algorithm to overcome this difficulty and pro- data structures have either good encoding or good decoding vide an asymptotic analysis to quantify the performance of this performance along with small space utilization, but not both. algorithm in terms of space utilization. However, the message- In this paper we desire a solution with both good encoding passing algorithm may not provide optimal performance in and decoding performance with small space. general. In summary, the solution of [21] does not solve our Our Result. We establish the following result as the main problem of interest because: (a) it is not locally decodable, contribution of this paper. i.e., it does not support efficient evaluation of a single input3, (b) the space efficiency requires distributional knowledge of Theorem I.1 The data structure described in Section II the input, and (c) there is no explicit guarantee that the space achieves the following: requirements for the data structure itself (separate from its Space efficiency. The total space utilization is (1 + 4 content) are not excessive; the latter is only assumed. o(1))N log B1 . This accounts for space utilized by aux- iliary information as well. B. Our contribution Local encoding. The data structure is based on a lay- As the main result of this paper, we provide a space ered, sparse, linear graphical code with an encoding 2 efficient data structure that is locally encodable and decodable. algorithm that takes poly(max(log(B2/B1 ), 1)) time to Specifically, we consider the following setup. write/update. Local decoding. Our local decoding algorithm takes Setup. Let there be N integers, x1,..., xN with notation x = 2 2 [xi]1≤i≤N .

A Locally Encodable and Decodable Compressed Data Structure

Data Compression: Dictionary-Based Coding 2 / 37 Dictionary-Based Coding Dictionary-Based Coding

The Basic Principles of Data Compression

Lzw Compression and Decompression

Digital Communication Systems 2.2 Optimal Source Coding

Randomized Lempel-Ziv Compression for Anti-Compression Side-Channel Attacks

Principles of Communications ECS 332

Marconi Society - Wikipedia

A Software Tool for Data Compression Using the LZ77 ("Sliding Window") Algorithm Student Authors: Vladan R

INFORMATION and CODING THEORY Exercise Sheet 4

Effective Variations on Opened GIF Format Images

Cryptanalysis Of, and Practical Attacks Against E-Safenet Encryption

Computational Hardness of Optimal Fair Computation: Beyond Minicrypt