Automata and Graph Compression

Mehryar Mohri Michael Riley Ananda Theertha Suresh Courant Institute and Google Research Google Research UCSD [email protected] [email protected] [email protected]

Abstract—We present a theoretical framework for the compres- attention. Here, we precisely study the problem of structured sion of automata, which are widely used representations in speech . Motivated by the examples just mentioned, processing, natural language processing and many other tasks. As we focus on the problem of automata compression and, as a a corollary, our framework further covers graph compression. We introduce a probabilistic process of graph and automata generation corollary, graph compression. that is similar to stationary ergodic processes and that covers An empirical study of web graph compression was given real-world phenomena. We also introduce a universal compression by [5], [6], [14]. A theoretical study of the problem was first scheme LZA for this probabilistic model and show that LZA presented by [1] who proposed a scheme using a minimum significantly outperforms other compression techniques such as spanning tree to find similar nodes to . However, the and the UNIX compress command for several synthetic and real data sets. authors showed that many generalizations of their problem are Index Terms—Lempel-Ziv, universal compression, graphs NP-hard. Motivated by probabilistic models, [8], [9] later showed that arithmetic coding can be used to nearly optimally compress I.INTRODUCTION (the structure of) graphs generated by the Erdos-R˝ enyi´ model. Massive amounts of data are generated and processed by An empirical study of automata compression has been pre- modern search engines and popular online sites, which have sented by several authors in the past, including [11], [12]. been reported to be in the order of hundreds of petabytes. At However, we are not aware of any theoretical work focused a different level, sophisticated models are commonly stored on on automata compression. Our objective is three-fold: (i) pro- mobile devices for tasks such as speech-to-text conversion. Thus, pose a probabilistic model for automata that captures real- efficient storage and compression algorithms are needed both at world phenomena; (ii) provide a provably universal compression the data warehouse level (petabytes of data) and at a device algorithm; and (iii), show experimentally that the algorithm fares level (megabytes of data). Memory and storage constraints for well compared to techniques such as gzip and compress. Note the latter, as well as the communication costs of downloading that the probabilistic model we introduce can be viewed as a complex models, further motivate the need for effective com- generalization of Erdos-R˝ enyi´ graphs [4]. pression algorithms. The rest of the paper is organized as follows: in Section II, we Most existing compression techniques were designed for se- briefly describe finite automata and some of their key properties. quential data. They include and arithmetic In Section III, we describe our probabilistic model and show that coding, which are optimal compression schemes for sequences of it helps cover many real-world applications. In Section IV, we de- symbols distributed i.i.d. according to some unknown distribution scribe our proposed algorithm, LZA, and prove its optimality. In [10], or the Lempel-Ziv schemes, which are asymptotically opti- Section V, we further demonstrate our algorithm’s effectiveness mal for stationary ergodic processes [15], [20], or various combi- in terms of its degree of compression. Due to space constraints, nations of these schemes, such as the UNIX command compress, all the proofs are deferred to an extended version of the paper. an efficient implementation of Lempel-Ziv-Walsh (LZW) and gzip, a combination of Lempel-Ziv-77 (LZ77) and Huffman II.DIRECTED GRAPHSAND FINITE AUTOMATA coding. A finite automaton A is a 5-tuple (Q, Σ, δ, qI ,F ) where Q = However, much of the modern data processed at the data {1, 2, . . . , n} is a finite set of states, Σ = {1, 2, . . . , m} a finite ∗ warehouse or mobile device levels admits further structure: it alphabet, δ : Q × Σ → Q the transition function, qI ∈ Q an may represent segments of a very large graph such as the web initial state, and F ⊆ Q the set of final states. Thus, δ(q, a) graph or a social network, or include very large finite automata denotes the set of states reachable from state q via transitions or finite-state transducers such as those widely used in speech labeled with a ∈ Σ. If there is no transition by label a, then recognition or in a variety of other language processing tasks δ(q, a) = ∅. We denote by E ⊆ Q × Σ × Q the set of all such as machine translation, information extraction, and tagging transitions (q, a, q0) and by E[q] the set of all transitions from [17]. These data sets are typically very large. Web graphs contain state q. With the automata notation just introduced, a directed tens of billions of nodes (web pages). In speech processing, a graph can be defined as a pair (Q, δ) where Q = {1, 2, 3, . . . , n} large-alphabet language model may have billions of word edges. is a finite set of nodes and δ : Q → Q∗ a finite set of edges These examples and applications strongly motivate the need for where, for any node q, δ(q) denotes the set of nodes connected structured data compression in practice. to q. But, how can we exploit this structure to devise better com- An example of an automaton is given in Figure 1(a). State pression algorithms? Can we improve upon the straightforward 0 in this simple example is the initial state (depicted with the serialization of that data followed by the subsequent application bold circle) and state 1 is the final state (depicted with double of an existing sequence compression algorithm? Surprisingly, the circle). The strings 12 and 222 are among those accepted by this problem of compressing such structured data has received little automaton. By using symbolic labels on this automaton in place 1 2 push coin 1:2/.75 2:1/.5

2 coin 2:3/.25 0 1 locked unlocked 0 1/.25 1 push 1:1/.25 (a) (b) (c) Fig. 1: (a) An example automaton; (b) a subway turnstile automaton; (c) an example weighted transducer. of the usual integers, as depicted in Figure 1(b), we can interpret 2) Probabilistic processes on automata: Before deriving mod- this automaton as the operation of a subway turnstile. It has two els for automata generation, we first discuss an invariance states locked and unlocked and actions (alphabet) coin and push. property of automata that is useful in practice. The set of If the turnstile is in the locked state and you push, it remains strings accepted by an automaton and the time and space of locked and if you insert a coin, it becomes unlocked. If it is its use are not affected by the state numbering. Two automata unlocked and you insert a coin it remains unlocked, but if you are isomorphic if they coincide modulo a renumbering of the 0 0 0 0 push once it becomes locked. states. Thus, automata (Q, Σ, δ, qI ,F ) and (Q , Σ, δ , qI ,F ) are Note that directed graphs form a subset of automata with isomorphic, if there is a one-to-one mapping f : Q → Q0 such Σ = {1}. Thus, in what follows, we can focus on automata that f(δ(q, a)) = δ0(f(q), a), for all q ∈ Q and a ∈ Σ, 0 0 compression. Furthermore, to be consistent with the existing f(qI ) = qI , and f(F ) = F , where f(F ) = {f(q): q ∈ F }. automata literature, we use states to refer to nodes and transitions Under stationary ergodic processes, two sequences with the to refer to edges in both graphs and automata going forward. same order of observed symbols have the same probabilities. Some key applications motivating the study of automata are Similarly we wish to construct a probabilistic model of automata speech and natural language processing. In these applications, it such that any two isomorphic automata have the same probabil- is often useful to augment transitions with both an output label ities. For example, the probabilities of automata in Figure 2 are and some real-valued weight. The resulting devices, weighted the same. finite-state transducers (FSTs), are extensively used in these There are several probabilistic models of automata and graphs fields [2], [17], [18]. An example of an FST is given in that satisfy this property. Perhaps the most studied random model Figure 1(c). The string 12 is among those accepted by this is the Erdos-R˝ enyi´ model G(n, p), where each state is connected transducer. For this input, the transducer outputs the string 23 and to every other state independently with probability p [4]. Note has weight .046875 (transitions weights 0.75 times 0.25 times that if two automata are isomorphic then the Erdos-R˝ enyi´ model final weight 0.25). assigns them the same probability. The Erdos-R˝ enyi´ model is We propose an algorithm for unweighted automata compres- analogous to i.i.d. sampling on sequences. We wish to generalize sion. For FSTs, we use the same algorithm by treating the input- the Erdos-R˝ enyi´ model to more realistic models of automata. output label pair as a single label. If the automaton is weighted, Since the automata model should not depend on the state we just add the weights at the end of the compressed file by numbering, it could only depend on the set of incoming or using some standard representation. leaving paths. There is no other possible semantics we could assign to states. Between incoming or outgoing paths, choosing the dependency on the incoming paths is natural given the III.RANDOM AUTOMATA COMPRESSION sequentiality of language models. For example in an n-gram model, a state might have an outgoing transition with label A. Probabilistic model Francisco or Diego only if it has an input transition with label Our goal is to propose a probabilistic model for automata San. This is an example where we have restrictions on paths of generation that captures real world phenomena. To this end, length 2. In general, we may have restrictions on paths of any we first review probabilistic models for sequences and draw length `. ` connections to probabilistic models for automata. We define an `-memory model for automata as follows. Let hq 1) Probabilistic processes on sequences: We now define i.i.d. be the set of paths of length at most ` leading to the state q. The sampling of sequences. Let xn denote an n-length sequence probability distribution of transitions from a state depends on the 1 def n x1, x2 . . . xn. If x1 are n independent samples from a distribution paths leading to it. Let δ(q, ∗) = δ(q, 1), δ(q, 2), . . . , δ(q, m). n Qn p over X , then p(x1 ) = i=1 p(xi). Note that under i.i.d. n Y ` sampling, the index of the sample has no importance, i.e., p(A) = p(δ(1, ∗), δ(2, ∗), . . . , δ(n, ∗)) ∝ p(δ(q, ∗)|hq). q=1 p(X = x) = p(X = x), ∀1 ≤ i, j ≤ n, x ∈ X . i j Similarly, transitions leaving a state q dissociate into marginals ` 0 conditioned on the history hq and probability that q ∈ δ(q, a) Stationary ergodic processes generalizes i.i.d. sampling. For a also dissociates into marginals. stationary ergodic process p over sequences ` Y ` p(δ(q, ∗)|hq) = p(δ(q, a)|hq) m m m+j m m p(Xi = xi ) = p(Xi+j = xi ), ∀i, j, m, xi . a∈Σ n Y Y 1 0 ` Informally stationary ergodic processes are those for which only = p( (q ∈ δ(q, a))|hq), the relative position of the indices matter and not the actual ones. a∈Σ q0=1 2 2 1 choose a scaling scaling factor of n as the number of automata 1 1 0 1 1 1 scales as exp(n2). We call a coding scheme c for automata over 0 2 2 2 a class of distributions P universal if [l (A )] − H(A ) Fig. 2: An example of isomorphic automata. The above two E c n n lim sup max 2 = 0. automata are same under the permutation 0 → 0, 1 → 2, and n→∞ p∈P n 2 → 1. We now describe the algorithm LZA. Note that the algorithm does not require the knowledge of the underlying parameters or the probabilistic model. where 1(q0 ∈ δ(q, a)) is the indicator of the event q0 ∈ δ(q, a). Note that the probabilities are defined with proportionality. This IV. ALGORITHMFORAUTOMATONCOMPRESSION is due to the probabilities possibly not adding to one. Thus we Our algorithm recursively finds substructures over states and have a constant Z to ensure that it is a probability distribution. uses a Lempel-Ziv subroutine. Our coding method is based on two auxiliary techniques to improve the compression rate: Elias- p(A) = p(δ(1, ∗), δ(2, ∗), . . . , δ(n, ∗)) delta coding and coding the differences. We briefly discuss these n 1 Y techniques and their properties before describing our algorithm. = p(δ(q, ∗)|h` ) Z q q=1 A. Elias-delta coding and coding the differences n 1 Y Y = p(δ(q, a)|h` ). Elias-delta coding is a universal compression scheme for Z q integers [13]. To represent a positive integer x, Elias-delta codes q=1 a∈Σ use blog xc + 2blogblog xc + 1c + 1 bits. To obtain a code over Note that `-memory models assign the same probability to N ∪ {0}, we replace x by x + 1and use Elias-delta codes. automata that are isomorphic. In our calculations, we restrict ` We now use Elias-delta codes to obtain to code sets of integers. to make the model tractable. Let x1, x2, . . . , xm be integers such that 0 ≤ x1 ≤ x2 ≤ · · · ≤ Note that sequences form a subset of automata as follows. For xm ≤ n. We use the following algorithm to code x1, x2, . . . , xm. n a sequence x over alphabet Σ, consider the automata represen- The decoding algorithm follows from ELIAS-DECODE [13]. tation with states Q = {1, 2, . . . n}, initial state qI = 1, final state F = {n}, alphabet Σ, and transition function δ(i, xi) = i+1 and Algorithm DIFFERENCE-ENCODE δ(i, x) = φ for all x 6= xi. Informally, every sequence can be represented as an automaton with line as the underlying structure. Input: Integers 0 ≤ x1 ≤ x2 ≤ · · · ≤ xm ≤ n. Note that the `-memory model is similar to `th order Markov 1) Use ELIAS-ENCODE to code x1 − 0, x2 − x1,... xd − xd−1. chain for sequences. Lemma 1 (Appendix A). For integers such that 0 ≤ x1 ≤ B. Entropy and coding schemes x2, . . . xd ≤ n, DIFFERENCE-ENCODE uses at most A compression scheme is a mapping from X to {0, 1}∗ n + d  n + d  d log + 2d log log + 1 + d such that the resulting code is prefix-free and can be uniquely d d recovered. For a coding scheme c, let lc(x) denote the length bits. of the code for x ∈ X . It is well-known that the expected number of bits used by any coding scheme is the entropy of the We first give an example to illustrate DIFFERENCE-ENCODE’s def usefulness. Consider graph representation using adjacency lists. distribution, defined as H(p) = P p(x) log 1 . The well x∈X p(x) For every source state, the order in which the destination states known Huffman coding scheme achieves this entropy up-to one are stored does not matter. For example, if state 1 is connected additional bit. For n-length sequences arithmetic coding is used, to states 2, 4, and 3, it suffices to represent the unordered set which achieves compression up-to entropy with few additional {2, 3, 4}. In general if a state is connected to d out of n states, bits of error. then it suffices to encode the ordered set of states y , y , . . . , y The above-mentioned coding methods such as Huffman coding 1 2 d where 1 ≤ y ≤ y ≤ y . . . y ≤ n. The number of such and arithmetic coding require the knowledge of the underlying 1 2 3 d possible sets is n. If the state-sets are all equally likely, then distribution. In many practical scenarios, the underlying distribu- d the entropy of state-sets is log n ≈ d log n . tion may be unknown and only the broader class to which the d d If each state is represented using log n bits, then d log n > distribution belongs may be known. For example, we might know d log n bits are necessary, which is not optimal. However, by that the given n-length sequence is generated by i.i.d. sampling d Lemma 1, DIFFERENCE-ENCODE uses d log n+d (1 + o(1)) ≈ of some unknown distribution p over {1, 2, . . . , k}. The objective d d log n , and hence is asymptotically optimal. Furthermore, the of a universal compression scheme is to asymptotically achieve d bounds in Lemma 1 are for the worst-case scenario and in H(p) bits per symbol even if the distribution is unknown. A practice DIFFERENCE-ENCODE yields much higher savings. A coding scheme c for sequences over a class of distributions P is similar scenario arises in LZA as discussed later. called universal if [l (Xn)] − H(Xn) B. LZA lim sup max E c = 0. n→∞ p∈P n We now have at our disposal the tools needed to design a The normalization factor in the above definition is n, as the variant of the Lempel-Ziv algorithm for compressing automata, def number of sequences of length n increases exponentially with which we denote by LZA. Let dq = |E[q]| be the num- n. For automata and graphs with n states (denoted by An) we ber of transitions from state q and let transitions in E[q] = {(q, a1, q1), (q, a2, q2),..., (q, adq , qdq )} are ordered as follows: is at most for all i, qi ≤ qi+1 and if qi = qi+1 then ai < ai+1. The algorithm is based on the observation that the ordering of |D| [log(n + 1) + log (ν + 1) + 2 log(log(n + 1) + 1)] the transitions leaving a state does not affect the definition of an + |D| [2 log (log (ν + 1) + 1) + 2 + dlog me] + ndlog nme, automaton and works as follows. The states of the automaton are 2 where ν = n . visited in a BFS order. For each state visited, the set of outgoing |D| transitions is sorted based on their destination state. Next, the C. Proof of optimality algorithm recursively finds the largest overlap of the sets of In this section, we prove that LZA is asymptotically opti- transitions that match some dictionary element and encodes the mal for the random automata model introduced in Section III. pair (matched dictionary element number, next transition), and Lemma 2 gives an upper bound on the number of bits used in adds the dictionary element to T , alphabet of the transition d terms of the size of the dictionary |D|. We now present a lower to T , and the destination state of the transition to T . It Σ δ bound on the entropy in terms of D which will help us prove also updates the dictionary element by adding a new dictionary this result. The proof is given in Appendix C. element (matched dictionary element number, next transition) to the dictionary. Finally it encodes Td, Tδ using DIFFERENCE- Lemma 3. LZA satisfies ENCODE and encodes each element in TΣ using dlog me bits.   2   E[|D|] ` n m H(p) ≥ E[|D|] log − m − log + 1 − 1 . Algorithm LZA n E[|D|] Input: The transition label function δ of the automaton. The above result together with Lemma 2 implies

Output: Encoded sequence S. `   Theorem 4 (Appendix D). If 2m = o log n , then LZA is 1) Set dictionary D = ∅. log log n 2) Visit all states q in BFS order. For every state q do: a universal compression algorithm. a) Code dq using dlog nme bits. V. EXPERIMENTS b) Set T = ∅, T = ∅, and T = ∅. d Σ δ A. Automaton structure compression c) Start with j = 1 in E[q] = {(q, a1, q1),..., (q, adq , qdq )} LZA automata, but for most applications, it is and continue till j reaches dq. sufficient to compress the automata structure. We convert LZA i) Find largest l such that (a , q ),..., (a , q ) ∈ D. j j j+l j+l into LZA , an algorithm for automata structure compression Let this dictionary element be d . S r as follows. We first perform a breadth first search (BFS) with ii) Add (a , q ),..., (a , q ) to D. j j j+l+1 j+l+1 the initial state as the root state and relabel the states in their iii) Add d to T , q to T , and a to T . r d j+l+1 δ j+l+1 Σ BFS visitation order. We then run LZA with the following IFFERENCE NCODE T T d) Use D -E to encode d, δ and encode modification. In step 2, for every state q we divide the transitions each element in TΣ using dlog me bits. Append these q from q into two groups, Told transitions whose destination states sequences to S q have been traversed before in LZA and Tnew, transitions whose 3) Discard the dictionary and output S. destinations have not been traversed. Note that since the state numbers are ordered based on a BFS visit, the destination state We note that simply compressing the unordered sets Td and q numbers in Tnew are 1, 2, . . . n, and can be recovered easily while T IFFERENCE δ suffices for unique reconstruction and thus D - decoding and thus need not be stored. Hence, we run step 2b in ENCODE is the natural choice. Observe that DIFFERENCE- q q LZA only on transitions in Told. For Tnew, we just compress the ENCODE is a succinct representation of the dictionary and does transition labels using a standard implementation of LZ78. not affect the way Lempel-Ziv dictionary is built. Thus the q Since each destination state can appear in Tnew only once, decoding algorithm follows immediately from retracing the steps q the number of transitions in ∪qTnew ≤ n, Since this number in LZA and LZ78 decoding algorithm. The run time of LZA is is  n2, the normalization factor in the definition of universal O(n+|E|) similar to that of other Lempel Ziv algorithms and is . compression algorithm for automata, the proof of Theorem 4 IFFERENCE NCODE If D -E is not used, the number of bits used extends to LZAs. Since for most applications, it is sufficient to would be approximately |D| log |D| + |D| log n, which is strictly compress to the automata structure, we implemented LZAs in greater than that number of bits in Lemma 2. Furthermore, we C++ and added it to the OpenFst open-source library [3]1. did consider several other natural variants of this algorithm where we difference-encode the states first and then serialize the data B. Comparison using a standard Lempel-Ziv algorithm. However, we could not The best known convergence rate of all Lempel-Ziv algorithms prove asymptotic optimality for those variants. Proving their  log log n  for sequences is O log n and LZA has the same conver- non-optimality requires constructing distributions for which the gence rate under the `-memory probabilistic model. algorithm is non-optimal and is not the focus of this paper. However in practice data sets have finitely many states and the We first bound the number of bits used by LZA in terms of underlying automata may not be generated from an `-memory the size of the dictionary |D|, the number of states n, and the probabilistic model. To prove the practicality of the algorithm, alphabet size m. This bound is independent of the underlying we compare LZAs with the Unix compress command (LZW) probabilistic model. Next, we proceed to derive probabilistic and gzip (LZ77 and Huffman coding) for various synthetic and bounds. real data sets.

Lemma 2 (Appendix B). The total number of bits used by LZA 1Available for download atwww.openfst.org Class Uncompressed LZAS compress gzip LZA+gzip G1 172208 18260 22681 23752 17320

A1 172076 21745 33478 31682 21108 1e+08 lza compress G2 34102 2536 4994 4564 2443 gzip A2 34171 3027 6707 5546 2940 lza,gzip 1e+07 TABLE I: Synthetic data compression examples (in bytes). 1e+06

1) Synthetic Data: While the `-memory probabilistic model (bytes) compressed size illustrates a broad class of probabilistic models on which LZA is 1e+05 universal, generating samples from an `-memory model is dif- ficult similar to graphical models as the normalization factor Z 1e+04 is hard to compute. We therefore test our algorithm on a few simpler synthetic data sets. In all our experiments the number of 0 20 40 60 80 states is 1000 and the results are averaged over 1000 runs. uncompressed size rank Table I summarizes our results for a few synthetic data sets, Fig. 3: Real-world compression examples. specified in bytes. Note that one of the main advantages of LZAS over existing algorithms is that LZAS just compresses the structure, which is sufficient for applications in speech processing ordered by their uncompressed (adjacency-list) size rank, with the and language modeling. Furthermore, note that to obtain the same set of compression algorithms presented in the synthetic actual automaton from the structure we need the original state case. At the smallest sizes, gzip out-performs LZA, but after numbering, which can be specified in n log n bits, which is about 100 kbytes in compressed size, LZA is better. Overall, the (1000 log2 1000)/8 ≤ 1250 bytes in our experiments. Even if combination of LZA and gzip performs best. we add 1250 bytes to our results in Table I, LZA still performs better than gzip and compress. VI.ACKNOWLEDGEMENTS We run the algorithm on four different synthetic data sets We thank J. Acharya and A. Orlitsky for helpful discussions. G1,G2,A1,A2. G1 and A1 are models with a uniform out-degree distribution over the states and G2 and A2 are models with a non- REFERENCES uniform out-degree distribution: [1] M. Adler and M. Mitzenmacher. Towards compressing web graphs. In G1: directed Erdos-R˝ enyi´ graphs where we randomly generate DCC, 2001. transitions between every source-destination pair with probability [2] V. Alabau, F. Casacuberta, E. Vidal, and A. Juan. Inference of stochastic 1/100. finite-state transducers using N -gram mixtures. In IbPRIA, 2007. [3] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. Openfst: A1: automata version of Erdos-R˝ enyi´ graphs, where there is a A general and efficient weighted finite-state transducer library. In CIAA, transition between every two states with probability 1/100 and 2007. the transition labels are chosen independently from an alphabet [4] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 1992. 10 [5] V. N. Anh and A. Moffat. Local modeling for webgraph compression. In of size for each transition. DCC, 2010. G2: we assign each state a class c ∈ {1, 2,... 1000} randomly. [6] A. Apostolico and G. Drovandi. Graph compression by BFS. Algorithms, We connect every two states s and d with probability 1/(cs +cd). 2(3):1031–1044, 2009. The resulting graph has degrees varying from 2 to log 1000. [7] M. Bisani and H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434–451, 2008. A2: we generate the transitions as above and we label each [8] Y. Choi and W. Szpankowski. Compression of graphical structures. In ISIT, transition to be a deterministic function of the destination state. 2009. This is similar to n-gram models, where the the destination state [9] Y. Choi and W. Szpankowski. Compression of graphical structures: Fundamental limits, algorithms, and experiments. IEEE Trans. on Info. determines the label. Here again we chose |Σ| = 10. Theory, 58(2):620–638, 2012. Note that LZA always performs better than the standard [10] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, Lempel-Ziv-based algorithms gzip and compress. Note that al- 2006. [11] J. Daciuk. Experiments with automata compression. In CIAA, 2000. gorithms designed with specific knowledge of the underlying [12] J. Daciuk and J. Piskorski. Gazetteer compression technique based on model can achieve better performance. For example, for G1, substructure recognition. In IIPWM. Springer, 2006. arithmetic coding can be used to obtain a compressed file size of [13] P. Elias. Universal codeword sets and representations of the integers. IEEE 2 Tran. on Info. Theory, 21(2):194–203, 1975. n h(0.01)/8 ≈ 10000 bytes. However the same algorithm would [14] S. Grabowski and W. Bieniecki. Tight and simple web graph compression. not perform well for G2 or A2. In Proceedings of the Prague Stringology Conference, 2010. 2) Real-World Data: We also tested our compression algo- [15] G. Hansel, D. Perrin, and I. Simon. Compression and entropy. In STACS, 1992. rithm on a variety of ‘real-world’ automata drawn from various [16] G. Iglesias, C. Allauzen, W. Byrne, A. de Gispert, and M. Riley. Hierarchi- speech and natural language applications. These include large cal phrase-based translation representations. In Proc. of Conf. on Empirical speech recognition language models and decoder graphs [17], Methods in Natural Language Processing, 2011. [17] M. Mohri, F. Pereira, and M. Riley. Weighted finite-state transducers in text normalization grammars for text-to-speech [19], speech speech recognition. Computer Speech & Language, 16(1), 2002. recognition and machine translation lattices [16], and pair n-gram [18] E. Roche and Y. Schabes. Deterministic part-of-speech tagging with finite- grapheme-to-phoneme models [7]. We selected approximately state transducers. Computational Linguistics, 1995. [19] T. Tai, W. Skut, and R. Sproat. Thrax: An open source grammar compiler eighty such automata from these tasks and removed their weights built on openfst. In ASRU, 2011. and output labels (if any), since we focus here on unweighted [20] J. Ziv and A. Lempel. A universal algorithm for sequential data compres- automata. Figure 3 shows the compressed sizes of these automata, sion. IEEE Trans. on Info. Theory, 23(3):337–343, 1977. APPENDIX We now lower bound H(p) in terms of the number of A. Proof of Lemma 1 dictionary elements. def Since 0 is included in the set, the number of bits used to Let dq = |E[q]| be the number of transitions from state q and let transitions in E[q] = {(q, a , q ),..., (q, a , q )} are represent x is upper bounded by θ(x) = log(x+1)+2 log(log(x+ 1 1 dq dq 1) + 1)c + 1. Observe that θ is a concave function since both ordered as follows: for all i, qi ≤ qi+1 and if qi = qi+1 then ai < ai+1. To simplify the discussion, we will use the shorthand log and x 7→ log(log x) are concave. Let x0 = 0. Then, by the def concavity of θ, the total number of bits B used can be bounded eq,i = (q, ai, qi). Then, by the definition of our probabilistic as follows: model, n d X X log p(A) = log p(e , e , . . . , e |h` ) − log Z. B ≤ θ(xi − xi−1) q,1 q,2 q,dq q i=1 q=1 d X 1 We group the transitions the way LZA constructed the dictionary. = d θ(x − x ) d i i−1 Let {Dq,i} be the set of dictionary elements added when state q i=1 is visited during the execution of the algorithm. For a dictionary d  1 X   1  n element Dq,i, let sq,q0 be the starting eq,i and tq,q0 the terminal ≤ d θ xi − xi−1 = d θ xn ≤ d θ , d d d eq,i. Then, by the independence of the transition labels and the i=1 fact that Z ≥ 1, where we used for the last inequality xn ≤ n and the fact that θ n is an increasing function. This completes the proof of the lemma. X X ` log p(A) ≤ log p(eq,sq,i , eq,sq,i+1, . . . eq,tq,i |hq). B. Proof of Lemma 2 q=1 Dq,i

Let kq be the number of elements added to the dictionary when Let gq,i = tq,i − sq,i. We group them now with gq,i and sq,i. Let state q is visited by LZA. The maximum value of the destination D(s, g) be the set of dictionary elements Dq,i with sq,i = s and state is n. Thus, by Lemma 1, the number of bits used to code gq,i = g and let cs,g be the cardinality of that set: cs,g = |D(s, g)| t Tδ is at most and eq,s = eq,s, eq,s+1, . . . eq,t . Then, by Jensen’s inequality, we can write n   n   X n kqθ + kqdlog me , X X kq log p(etq,i |h` ) q=1 q,sq,i q q=1 θ(·) Dq,i where is the function introduced in the proof of Lemma 1. n n n Similarly, since the maximum value of any dictionary element is X X X X s+g ` = log p(eq,s |hq) |D|, by Lemma 1, the number of bits used to code Td is at most q=1 s=1 g=1 Dq,i∈D(s,g) n X |D| n n n k θ . X X 1 X X s+g ` q k = cs,g log p(eq,s |hq) q=1 q cs,g s=1 g=1 q=1 Dq,i∈D(s,g) n n n By concavity these summations are maximized when k = |D| X X 1 X X q n ≤ c log p(es+g|h` ) for all q. Plugging in that expression in the sums above yields s,g c q,s q s=1 g=1 s,g q=1 the following upper bound on the maximum number of bits used: Dq,i∈D(s,g) n n ` X X 2m |D|θ(ν) + |D|dlog me + |D|θ(n). ≤ cs,g log . c s=1 g=1 s,g Additionally, this number must be augmented by ndlog nme where the last inequality follows by Lemma 6, the fact that since dlog nme bits are used to encode each dq, which completes the proof. the events in each summation are disjoint and mutually exclusive ` ml and that the number of possible histories hq is ≤ 2 . Since C. Proof of Lemma 3 P s,g cs,g = |D|, One of the main technical tools we use is Ziv’s inequality, n n m` which is stated below. X X 2 cs,g log cs,g Lemma 5 (Variation of Ziv’s inequality). For a probability s=1 g=1 n n distribution p over non-negative integers with mean µ, X X 1 = |D|m` + c log s,g c H(p) ≤ log(µ + 1) + 1. s=1 g=1 s,g n n X X cs,g |D| The next lemma bounds the probability of disjoint events under = |D|m` − |D| log |D| + |D| log D c different distributions. s=1 g=1 s,g ` Lemma 6. If A1,A2,...,Ak be a set of disjoint events. Then = |D|m − |D| log |D| + |D|H(cs,g). for a set of distributions p1, p2, . . . pr, Let cs and cg be the projections of cs,g into first and second k r r X X X coordinates. Then, we can write pj(Ak) ≤ 1 = r. i=1 j=1 j=1 H(cs,g) ≤ H(cs) + H(cg) ≤ log n + H(cg). P 2 Using s,g cs,gg ≤ n m, by Ziv’s inequality, the following rearranging terms, we have 2 H(c ) ≤ log n m +1 holds: g |D| . Combining this with the previous E[lLZA(an)] − H(p) inequalities gives max 2 p(An) n W (E[|D|]) − H(p)   2   = max ` n m |D| 2 log p(A) ≤ |D| m + log + 1 + 1 − log . p(An) n |D| n " !# [|D|] (n + 1)n  n2m 2 ≤ E log · + 1 n2 [|D|] [|D|] Taking the expectation of both sides, next using the concavity of E E |D| 7→ −|D| log(|D|) and Jensen’s inequality yield [|D|]    n2    + E 2 log log + 1 + 1 + m` + 4 + log m n2 E[|D|]   2   E[|D|] ` n m 2E[|D|] dlog nme H(p) ≥ E[|D|] log − m − log + 1 − 1 . log(log(n + 1) + 1) + n E[|D|] n2 n  `  ` log log n + m = O 2m . D. Proof of Theorem 4 log n The last equality follows from Lemma 7. As n → ∞, the bound We first upper bound E[|D|] using Lemma 3. goes to 0 and hence LZA is a universal compression algorithm. Lemma 7. For the dictionary D generated by LZA

` 10n2m log(m + 1)2m [|D|] ≤ . E log n

Proof: An automaton is a random variable over n2 transition labels each taking at most mm + 1 values, hence H(p) ≤ n2m log(m+1). Combining this inequality with Lemma 3 yields

n2m log(m + 1) ≥   2   E[|D|] ` n m E[|D|] log − m − log + 1 − 1 . n E[|D|]

` 10n2m log(m+1)2m Now, let U = log n and assume that the inequality E[|D|] > U holds. Then, the following inequalities hold:

  2   E[|D|] ` n m E[|D|] log − m − log + 1 − 1 n E[|D|] " ` # 10mn log(m + 1)2m > U log − m` log n   log n   + U − log + 1 − 1 102m` log(m + 1)   log n  ≥ U log n − log + 1 102m` log(m + 1) − U [log log n] > n2m log(m + 1), which leads to a contradiction. This completes the proof of the lemma. We now have all the tools to prove Theorem 4. Let W (|D|) be the upper bound in Lemma 2. Since we have a probabilistic model and the fact that W is concave in |D|, the expected number of bits

E[lLDA(An)] ≤ W (E[|D|]).

Substituting the lower bound on H(p) from Lemma 3 and