Automata and Graph Compression

Automata and Graph Compression Mehryar Mohri Michael Riley Ananda Theertha Suresh Courant Institute and Google Research Google Research UCSD [email protected] [email protected] [email protected] Abstract—We present a theoretical framework for the compres- attention. Here, we precisely study the problem of structured sion of automata, which are widely used representations in speech data compression. Motivated by the examples just mentioned, processing, natural language processing and many other tasks. As we focus on the problem of automata compression and, as a a corollary, our framework further covers graph compression. We introduce a probabilistic process of graph and automata generation corollary, graph compression. that is similar to stationary ergodic processes and that covers An empirical study of web graph compression was given real-world phenomena. We also introduce a universal compression by [5], [6], [14]. A theoretical study of the problem was first scheme LZA for this probabilistic model and show that LZA presented by [1] who proposed a scheme using a minimum significantly outperforms other compression techniques such as gzip spanning tree to find similar nodes to compress. However, the and the UNIX compress command for several synthetic and real data sets. authors showed that many generalizations of their problem are Index Terms—Lempel-Ziv, universal compression, graphs NP-hard. Motivated by probabilistic models, [8], [9] later showed that arithmetic coding can be used to nearly optimally compress I. INTRODUCTION (the structure of) graphs generated by the Erdos-R˝ enyi´ model. Massive amounts of data are generated and processed by An empirical study of automata compression has been pre- modern search engines and popular online sites, which have sented by several authors in the past, including [11], [12]. been reported to be in the order of hundreds of petabytes. At However, we are not aware of any theoretical work focused a different level, sophisticated models are commonly stored on on automata compression. Our objective is three-fold: (i) pro- mobile devices for tasks such as speech-to-text conversion. Thus, pose a probabilistic model for automata that captures real- efficient storage and compression algorithms are needed both at world phenomena; (ii) provide a provably universal compression the data warehouse level (petabytes of data) and at a device algorithm; and (iii), show experimentally that the algorithm fares level (megabytes of data). Memory and storage constraints for well compared to techniques such as gzip and compress. Note the latter, as well as the communication costs of downloading that the probabilistic model we introduce can be viewed as a complex models, further motivate the need for effective com- generalization of Erdos-R˝ enyi´ graphs [4]. pression algorithms. The rest of the paper is organized as follows: in Section II, we Most existing compression techniques were designed for se- briefly describe finite automata and some of their key properties. quential data. They include Huffman coding and arithmetic In Section III, we describe our probabilistic model and show that coding, which are optimal compression schemes for sequences of it helps cover many real-world applications. In Section IV, we de- symbols distributed i.i.d. according to some unknown distribution scribe our proposed algorithm, LZA, and prove its optimality. In [10], or the Lempel-Ziv schemes, which are asymptotically opti- Section V, we further demonstrate our algorithm’s effectiveness mal for stationary ergodic processes [15], [20], or various combi- in terms of its degree of compression. Due to space constraints, nations of these schemes, such as the UNIX command compress, all the proofs are deferred to an extended version of the paper. an efficient implementation of Lempel-Ziv-Walsh (LZW) and gzip, a combination of Lempel-Ziv-77 (LZ77) and Huffman II. DIRECTED GRAPHS AND FINITE AUTOMATA coding. A finite automaton A is a 5-tuple (Q; Σ; δ; qI ;F ) where Q = However, much of the modern data processed at the data f1; 2; : : : ; ng is a finite set of states, Σ = f1; 2; : : : ; mg a finite ∗ warehouse or mobile device levels admits further structure: it alphabet, δ : Q × Σ ! Q the transition function, qI 2 Q an may represent segments of a very large graph such as the web initial state, and F ⊆ Q the set of final states. Thus, δ(q; a) graph or a social network, or include very large finite automata denotes the set of states reachable from state q via transitions or finite-state transducers such as those widely used in speech labeled with a 2 Σ. If there is no transition by label a, then recognition or in a variety of other language processing tasks δ(q; a) = ;. We denote by E ⊆ Q × Σ × Q the set of all such as machine translation, information extraction, and tagging transitions (q; a; q0) and by E[q] the set of all transitions from [17]. These data sets are typically very large. Web graphs contain state q. With the automata notation just introduced, a directed tens of billions of nodes (web pages). In speech processing, a graph can be defined as a pair (Q; δ) where Q = f1; 2; 3; : : : ; ng large-alphabet language model may have billions of word edges. is a finite set of nodes and δ : Q ! Q∗ a finite set of edges These examples and applications strongly motivate the need for where, for any node q, δ(q) denotes the set of nodes connected structured data compression in practice. to q. But, how can we exploit this structure to devise better com- An example of an automaton is given in Figure 1(a). State pression algorithms? Can we improve upon the straightforward 0 in this simple example is the initial state (depicted with the serialization of that data followed by the subsequent application bold circle) and state 1 is the final state (depicted with double of an existing sequence compression algorithm? Surprisingly, the circle). The strings 12 and 222 are among those accepted by this problem of compressing such structured data has received little automaton. By using symbolic labels on this automaton in place 1 2 push coin 1:2/.75 2:1/.5 2 coin 2:3/.25 0 1 locked unlocked 0 1/.25 1 push 1:1/.25 (a) (b) (c) Fig. 1: (a) An example automaton; (b) a subway turnstile automaton; (c) an example weighted transducer. of the usual integers, as depicted in Figure 1(b), we can interpret 2) Probabilistic processes on automata: Before deriving mod- this automaton as the operation of a subway turnstile. It has two els for automata generation, we first discuss an invariance states locked and unlocked and actions (alphabet) coin and push. property of automata that is useful in practice. The set of If the turnstile is in the locked state and you push, it remains strings accepted by an automaton and the time and space of locked and if you insert a coin, it becomes unlocked. If it is its use are not affected by the state numbering. Two automata unlocked and you insert a coin it remains unlocked, but if you are isomorphic if they coincide modulo a renumbering of the 0 0 0 0 push once it becomes locked. states. Thus, automata (Q; Σ; δ; qI ;F ) and (Q ; Σ; δ ; qI ;F ) are Note that directed graphs form a subset of automata with isomorphic, if there is a one-to-one mapping f : Q ! Q0 such Σ = f1g. Thus, in what follows, we can focus on automata that f(δ(q; a)) = δ0(f(q); a), for all q 2 Q and a 2 Σ, 0 0 compression. Furthermore, to be consistent with the existing f(qI ) = qI , and f(F ) = F , where f(F ) = ff(q): q 2 F g. automata literature, we use states to refer to nodes and transitions Under stationary ergodic processes, two sequences with the to refer to edges in both graphs and automata going forward. same order of observed symbols have the same probabilities. Some key applications motivating the study of automata are Similarly we wish to construct a probabilistic model of automata speech and natural language processing. In these applications, it such that any two isomorphic automata have the same probabil- is often useful to augment transitions with both an output label ities. For example, the probabilities of automata in Figure 2 are and some real-valued weight. The resulting devices, weighted the same. finite-state transducers (FSTs), are extensively used in these There are several probabilistic models of automata and graphs fields [2], [17], [18]. An example of an FST is given in that satisfy this property. Perhaps the most studied random model Figure 1(c). The string 12 is among those accepted by this is the Erdos-R˝ enyi´ model G(n; p), where each state is connected transducer. For this input, the transducer outputs the string 23 and to every other state independently with probability p [4]. Note has weight :046875 (transitions weights 0:75 times 0:25 times that if two automata are isomorphic then the Erdos-R˝ enyi´ model final weight 0:25). assigns them the same probability. The Erdos-R˝ enyi´ model is We propose an algorithm for unweighted automata compres- analogous to i.i.d. sampling on sequences. We wish to generalize sion. For FSTs, we use the same algorithm by treating the input- the Erdos-R˝ enyi´ model to more realistic models of automata. output label pair as a single label. If the automaton is weighted, Since the automata model should not depend on the state we just add the weights at the end of the compressed file by numbering, it could only depend on the set of incoming or using some standard representation.

Load more