arXiv:0809.0949v3 replacements PSfrag [cs.IT] 8 May 2009 and — nindex an lcso size of blocks h oa ubro uptlae cosalTntl trees, Tunstall all across leaves output of number total the rtidxi sd hti,bits is, a not that is used, symbol is index first i nltclyeaie yTbsadRsae.I eea,if general, non-Bernoulli In for Savari Rissanen. and This by art Tabus proposed items. by the examined algorithm output analytically of Tunstall’s (subopt of a state of require number which generalization the sources, the Markov on in including improvement sources, time an linear is is that oe sn h rei i.1 ftefis ybli a is symbol first the If 1. and Fig. parsed be in to tree stream data the ternary using a coded of symbols two first the upt h nie,rte hntervre ttu parses a thus as it known reverse; is the the th and than Unlike and input codewords rather tree. the indices, are coding the inputs a the outputs case of this use in tree, the Huffman suggesting code, prefix variables ieand time lcsaevral nsz n h uptsz sfie,rathe fixed, input ( is distributed f size the identically code output variable-to-fixed-length independent, case, a the Consider this and versa. vice size an In than in offers coding. variable [1] are block blocks Tunstall of by method proposed alternative technique variable-to-fi coding optimal length the method, coding variable-length ahitra oe hnti ehdtakes method this then node, internal each h ubro re sae) and (states), trees of number the = ] oeod aeteform the have Codewords Abstract lhuhnta elkona ufa’ pia fixed-to- optimal Huffman’s as known well as not Although p i for ν ota h uptapae a ecniee with considered be can alphabet output the that so , O j { ehdi rsne o osrcigaTunstall a constructing for presented is method A — fcetIpeetto fteGeneralized the of Implementation Efficient X ( i Y ∈ i.1 Ternary-input 1. Fig. n X ∈ k ) } 000 n space. 0 hr,wtotls fgnrlt,Pr generality, of loss without where, , 10 0 001 , 3 h rttotraysmosaerepresented are symbols ternary two first the , 2 = : 3 , 2 { : usalCd eeainAlgorithm Generation Code Tunstall 0 m { 11 , .I I. 010 0 1 ν , 3 n , . . . , 2 1 : o integers for NTRODUCTION D , . . . , asn tree parsing 12 011 3 2 X : 3 − btotu usaltree Tunstall -bit-output ∗ 100 20 ota h oei a is code the that so , 000 D − 1 } 3 2 1 : . stenme flae of leaves of number the is i.i.d. } m r noe.I h first the If encoded. are o xml,consider example, For . 101 h upt are outputs The . 21 otcommonly most — 3 2 euneo random of sequence ) : O 110 22 ( (log + ((1 a rnic,C 40-91 USA 94105-0901, CA Francisco, San 3 2 : mi:[email protected] Email: 1SeesnS. 13 St., Stevenson 71 s ihe .Baer B. Michael ) [ X /D m D 0 Mae Inc. VMware, ran or imal) the , xed- k -ary -ary n and s ) the n = is is 2 e ) r n rmteMro rcs,idpneto oesz,pirto prior size, code determined of is independent parameter process, output. gener- the Markov the parameterized where the of a from approach using nature the parsed of fixed-length be alization the can to sources due Markov cases, such in ntete;ti ih aels than less have might this tree; the in be eedn on depending ubr fiptbt iie yotu is h number The is bits. parse output per by symbols divided input bits th of input ratio, tre compression of optimal expected An maximizes numbers events. which that corresponding be their will of product the to r c hc ilb h es rbbeie ntesbeun tree. subsequent the in item probable least one leaves, the into be item will probable which most the accomplis splitting algorithm always a Tunstall’s via be which should possible, probabilities as out codeword uniform length, the are fixed because of that that, is Note appr nodes symbol algorithm. splitting splitting optimal inductive Tunstall’s of greedy, is This value descendents. benefit its the an not affect to we wi get not starting split, we step, does until node each tree, at the we null node of probable a leaf most probability the the the with split length by by should tree expected maximize) new determined increases to a splitting fully any value Since to is of split. leading to item, nature leaves, choose the output other into tree, more leaf one-item) one a (or of optimal split an with start value the as length can parse term input maximize. summation to expected right the leaving of ignored, left be the to constant the where oe nobinary into coded 000 sasnl ne — index single a as have ste(enr)cdn re hna nu of input an then tree, coding (ternary) the is ( j j 8 = usalstcnqesosjs before just stops technique Tunstall’s we if time, of ahead known fully are probabilities Because If = log = ) o necddotu isra of bitstream output encoded an for , X th 12 Pr 2 h pia loih eesrl a nsdcodes unused has necessarily algorithm optimal The . k Floor n [ X c asd oe nobinary into coded parsed, is hsteepce ai omxmz is: maximize to ratio expected the Thus . 1 l ( j i.i.d. j n X j ) =0 = − j · 1 hl h ubro uptbt ilalways will bits output of number the while , j hntepoaiiyof probability the then , r c ] 2 j .. nenlndshv rbblt equal probability have nodes internal i.e., ; ( 001 (log j ) log 001 hnhave then ; · 2 c 2 D 3 , n ( 010 ) j l ) j · · · t.S,freape fFg 1 Fig. if example, for So, etc. , (log = n l ie re pitn n node one Splitting tree. -item c j l 0 j ybl ( symbols ( 011 asd oe nobinary into coded parsed, j n n ) D hnhave then ; is evsa nFg for 1 Fig. in as leaves n ) 011001000 l n X j evsaeexceeded are leaves j Q =0 smo codeword -symbol − 12100 1 k l (log j =1 r j l r j 2 c D k ol first would 10 ( . j ) ) l j parsed, where bits), oach (the hes put (1) of th e e s Linear-time binary Tunstall code generation building the tree [2], [3]. ←− 1) Initialize two empty regular queues: Q for left children Analyses of the performance of Tunstall’s technique are −→ prevalent in the literature [4], [5], but perhaps the most and Q for right children. These queues will need to hold obvious advantage to Tunstall is that of being randomly at most n items altogether. 2) Split the root (probability 1) node with the left child accessible [5]: Each output block can be decoded without ←− −→ having to decode any prior block. This not only aids in going into Q and the right child going into Q. Assign randomly accessing portions of the compression sequence, but tree size z ← 2. also in synchronization: Because the size of output blocks is 3) Move the item with the highest (overall) probability out fixed, simple symbol errors do not propagate beyond the set of its queue. The node this represents is split into its of input symbols a given output block represents. Huffman two children, and these children are enqueued into their codes and variable-to-variable-length codes (e.g., those in [6]) respective queues. do not share this property. 4) Increment tree size by 1, i.e., z ← z +1; if z

Left node queue Left node queue Left node queue

0.3 0.21 0.3 0.147 0.21 0.3

Right node queue Right node queue Right node queue

0.7 0.3 0.3 0.3

PSfrag replacements 0.49 0.21 0.21

0.343 0.147

(a) 2−item Tunstall tree (b) 3−item Tunstall tree (c) 4−item Tunstall tree

Fig. 3. Example of binary Tunstall coding using two queues

different) output alphabet size not exceeding n. Any method time cost, being independent of n. In this case n, the size of that achieves a smaller optimal tree more quickly can therefore the output, is actually the number of total leaves in the output achieve an optimal n-leaf tree more quickly using the method set of trees, not the number in any given tree. These functions introduced here in postprocessing. are chosen for coding that is, in some sense, asymptotically optimal [3]. III. FASTGENERALIZEDALGORITHM This method of executing Tunstall’s algorithm is structured Consider D-ary coding with n outputs and g equivalent out- in such a way that it easily generalizes to sources that are put results in terms of states and probabilities — e.g., g =3 for not binary, are not i.i.d., or are neither. If a source is not i.i.d. input probability mass function (0.5, 0.2, 0.2, 0.1), since i.i.d., however, there is state due to, for example, the nature events all have probability in the g-member set {0.5, 0.2, 0.1}. or the quantity of prior input. Thus each possible state needs If the source is memoryless, we always have g ≤ D. A more its own parsing tree. Since the size of the output set of trees complex example might have D different output values with is proportional to the total number of leaves, in this case n different probabilities with s input states and s output states, 2 denotes the total number of leaves. leading to g = Ds . Then a straightforward extension of the In the case of sources with memory, a straightforward approach, using g queues, would split the minimum-f node extension of Tunstall coding might not be optimal [12]. Indeed, among the nodes at the heads of the g queues. This would take the optimal parsing for any given point should depend on its O(n) space and O((s + g/D)n) time per tree, since there are state, resulting in multiple parsing trees. Instead of splitting ⌈(n−1)/(D−1)⌉ steps with g looks (for minimum-fj,k nodes, the node with maximum probability, a generalized Tunstall only one of which is dequeued) and D enqueues as a result of policy splits according to the node maximizing some (constant- the split of a node into D children. (Probabilities in each of (j) the multiple parsing trees are conditioned on the state at the time-computable) fj,k(pi ) across all parsing trees, where j indexes the beginning state (the parse tree) and k indexes the time the root is encountered.) (j) state corresponding to the node; pi is the probability of the However, g could be large, especially if s is not small. th th i leaf of the j tree, conditional on this tree. Every fj,k(p) We can instead use an O(log g)-time priority queue structure (j) is decreasing in p = pi , the probability corresponding to the — e.g., a heap — to keep track of the leaves with the node in tree j to be split. This generalization generally gives smallest values of f. Such a priority queue contains up to g suboptimal but useful codes; in the case of i.i.d. sources, it pointers to queues; these pointers are reordered after each node (j) achieves an optimal code using fj,k(p) = − ln p, where ln split from smallest to largest according to priority fj,k(p∗ ), is the natural logarithm. The functions fj,k are yielded by the value of the function for the item at the head of the preprocessing which we will not count as part of the algorithm corresponding regular queue. (Priority queue insertions occur Efficient coding method for generalized Tunstall policy initialization steps do not dominate. (If we use a priority queue 1) Initialize empty regular queues {Qt} indexed by all g implementation with constant amortized insert time, such as possible combinations of conditional probability, tree a Fibonacci heap [14], this sufficient condition becomes g ≤ state, and node state; denote a given triplet of these O(n).) as t = (p′, j, k). These queues, which are not priority We thus have an O((1+(log g)/D)n)-time method (O((1+ queues, will need to hold at most n items (nodes) (log s)/D)n) in terms of n, s, and D, since g ≤ Ds2) altogether. Initialize an additional empty priority queue using only O(n) space to store the tree and queue data. The P which can hold up to g pointers to these regular significant space users are the output trees (O(n) space); the queues. queues (g queues which never have more items in them total 2) Split the s root (probability 1) nodes among regular than there are tree nodes, resulting in O(n) space); and the queues according to (p′, j, k). Similarly, initialize the priority queue (O(g) space). priority queue to point to those regular queues which are not empty, in an order according to the correspond- Example 2: Consider an example with three inputs — 0, 1, and 2 — and two states — 1 and 2, according to the Markov ing fj,k. Assign solution size z ← Ds. chain shown in Fig. 5. State 1 always goes to state 2 with 3) Move the item at the head of QP0 — the queue pointed (1) (1) input symbols of probability p0 = 0.4, p1 = 0.3, and to by the head P0 of the priority queue P — out of its (1) (2) queue; it has the lowest f and is thus the node to split. p2 =0.3. For state 2, the most probable output, p0 =0.5, (2) Its D children are distributed to their respective queues results in no state change, while the others, p1 = 0.25 and (2) according to t. Then P0 is removed from the priority p2 = 0.25, result in a change back to state 1. Because queue, and, if any of the aforementioned children were there are 2 trees and each of 2 states has 2 distinct output inserted into previously empty queues, pointers to these probability/transition pairs, we need g =2 × 2 × 2 queues, as

queues are inserted into the priority queue. P0, if QP0 well as a priority queue that can point to that many queues. remains nonempty, is also reinserted into the priority Let f1,1(p)= f2,2(p)= − ln p, f1,2(p)= −0.0462158 − ln p, queue according to f for the item now at the head of and f2,1(p)=0.0462158 − ln p. its associated queue. The fifth split in using this method to build an optimal 4) Increment solution size by D − 1, i.e., z ← z + D − 1. coding tree is illustrated by the change from the left-hand If z ≤ n − D +1, go to Step 3; otherwise, end. side to the right-hand side of Fig. 5. The first two splitting steps split the two respective root nodes, the third splits the Fig. 4. Steps for efficient coding using a generalized Tunstall policy probability 0.5 node, and the fourth splits the probability 0.4 node. At this point, the priority queue contains pointers to five queues. (The order of equiprobable sibling items with anywhere within the queue that keeps items in the queue sorted the same output state does not matter for optimality, but can by priorities set upon insertion. Removal of the smallest f affect the output; for the purposes of this example, they are and inserts of arbitrary f generally take O(log g) amortized inserted into each queue from left to right.) In this example time in common implementations [13, section 5.2.3], although we denote these node queues by the conditional probability of some have constant-time inserts [14].) The algorithm — the nodes and the tree the node is in. For example, the first taking O((1 + (log s)/D)n) time and O(n) space per tree, queue, Q(0.5,1,2), is that associated with any node that is in as explained below — is thus as described in Fig. 4. the first tree and represents a transition from state 2 to state 2 As with the binary method, this splits the most preferred (that of probability 0.5). node during each iteration of the loop, thus implementing the Before the step under examination, the queue that is pointed generalized Tunstall algorithm. The number of splits is ⌈(n − to by the head of the priority queue is the first-tree queue of

1)/(D−1)⌉ and each split takes O(D+log g) time amortized. items with conditional probability 0.3 (i.e., QP0 = Q(0.3,1,2)) The D factor comes from the D insertions into (along with and tree probability p =0.3. Thus the node to split is that at one removal from) regular queues, while the log g factor comes the head of this queue, which has lowest f value f1,2(p) = from one amortized priority queue insertion and one removal 1.1578 .... This item is removed from the priority queue, per split node. While each split takes an item out of the priority the head of the queue it points to is also dequeued, and the queue, as in the example below, it does not necessarily return corresponding node in the first tree is given its three children. it to the priority queue in the same iteration. Nevertheless, These children are queued into the appropriate queue: For the every priority queue insert must be one of either a pointer to most probable item — probability 0.15, conditional probability a queue that had been previously removed from the priority 0.5 — is queued into Q(0.5,1,2), while the two items both queue (which we amortize to the removal step) or a pointer to having probability 0.075 and conditional probability 0.25 are a queue that had previously never been in the priority queue queued into Q(0.25,1,1). Finally, because the removed queue (which can be considered an initialization). The latter steps was not empty, it is reinserted into the priority queue according — the only ones that we have left unaccounted — number to the priority value of its head, still f1,2(0.3) = 1.1578 .... no more than g, each taking no more than log g time, so, No other queue needs to be reinserted since none of the new under the reasonable assumption that g log g ≤ O(n), these nodes entered a queue that was empty before the step. In this Before fifth split {0.4, 0.3, 0.3} Markov chain After fifth split

g queues 1 2 0.5 g queues

{0.25, 0.25} PSfrag replacements 0.2 0.15 0.2

Q(0.5,1,2) Q(0.5,1,2)

dequeue Q(0.4,1,2) Q(0.4,1,2) Q(0.3,1,2) Q(0.3,1,2)

−→ f1,2(0.3) f1,2(0.3) −→

0.3 0.3 queue priority queue priority 0.3 = 1.15 ... = 1.15 ... Q(0.3,1,2) Q(0.3,1,2)

Q(0.5,2,2) enqueue Q(0.5,2,2) −→ −−−→ −→ 0.1 0.1 f2,2(0.25) f2,2(0.25) 0.075 0.075 0.1 0.1 = 1.38 ... = 1.38 ... Q(0.25,1,1) Q(0.25,1,1)

Q(0.25,2,1) Q(0.25,2,1) 0.25 0.25 f2,1(0.25) f2,1(0.25) Q(0.5,2,2) = 1.43 ... = 1.43 ... Q(0.5,2,2)

Q(0.5,1,2) Q(0.5,1,2) Q f1,2(0.2) f1,2(0.2) Q (0.4,2,2) = 1.56 ... = 1.56 ... (0.4,2,2)

Q(0.25,1,1) Q(0.25,1,1)

f1,1(0.1) f1,1(0.1) . Q Q 0 5 (0.3,2,2) = 2.30 ... = 2.30 ... (0.3,2,2) 0.4

−−−→ −−→ −−−→ −−→ 0.125 0.125 0.25 0.25 0.125 0.125 0.25 0.25

Q(0.25,2,1) Q(0.25,2,1)

state 1 output parse tree state 2 output parse tree state 1 output parse tree state 2 output parse tree

−→ −−→ −→ −−→ 0.3 0.3 0.25 0.25 0.3 0.25 0.25

−→ −−−→ −→ −−−→ −−−→ 0.2 0.1 0.1 0.25 0.125 0.125 0.2 0.1 0.1 0.15 0.075 0.075 0.25 0.125 0.125

Fig. 5. Example of efficient generalized Tunstall coding for Markov chain (top-center) shown before (left) and after (right) the fifth split node. Right arrow overscore denotes right-most leaf and underscore denotes center subtree (to distinguish items); fj,k denotes priority function.

case, then, the priority queue is unchanged, and the queues [7] J. Kieffer, “Fast generation of Tunstall codes,” in Proc., 2007 IEEE Int. and trees have the states given in right-hand side. Symp. on , June 24–29, 2007, pp. 76–80. [8] Y. A. Reznik and A. V. Anisimov, “Enumerative encoding/decoding of variable-to-fixed-length codes for memoryless source,” in Proc., Ninth REFERENCES Int. Symp. on Communication Theory and Applications, 2007. [9] G. L. Khodak, “Redundancy estimates for word-based encoding of [1] B. Tunstall, “Synthesis of noiseless compression codes,” Ph.D. disserta- sequences produced by a Bernoulli source,” in All-Union Conference tion, Georgia Institute of Technology, 1967. on Problems of Theoretical Cybernetics, 1969, in Russian. English [2] S. A. Savari, “Variable-to-fixed length codes and the conservation of translation available from http://arxiv.org/abs/0712.0097. entropy,” IEEE Trans. Inf. Theory, vol. IT-45, no. 5, pp. 1612–1620, [10] J. van Leeuwen, “On the construction of Huffman trees,” in Proc. 3rd July 1999. Int. Colloquium on Automata, Languages, and Programming, July 1976, [3] I. Tabus and J. Rissanen, “Asymptotics of greedy algorithms for variable- pp. 382–410. to-fixed length coding of Markov sources,” IEEE Trans. Inf. Theory, vol. [11] M. Drmota, Y. Reznik, S. A. Savari, and W. Szpankowski, “Precise IT-48, no. 7, pp. 2022–2035, July 2002. asymptotic analysis of the Tunstall code,” in Proc., 2006 IEEE Int. Symp. on Information Theory, July 9–14, 2006, pp. 2334–2337. [4] J. Abrahams, “Code and parse trees for lossless source encoding,” [12] S. A. Savari and R. G. Gallager, “Generalized Tunstall codes for sources Communications in Information and Systems, vol. 1, no. 2, pp. 113– with memory,” IEEE Trans. Inf. Theory, vol. IT-43, no. 2, pp. 658–668, 146, Apr. 2001. Mar. 1997. [5] I. Tabus, G. Korody, and J. Rissanen, “Text compression based on [13] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and variable-to-fixed codes for Markov sources,” in Proc., IEEE Data Searching, 2nd ed. Reading, MA: Addison-Wesley, 1998. Compression Conf., Mar. 28–30, 2000, pp. 133–142. [14] M. L. Fredman and R. E. Tarjan, “Fibonacci heaps and their uses in [6] Y. Bugeaud, M. Drmota, and W. Szpankowski, “On the construction of improved network optimization algorithms,” J. ACM, vol. 34, no. 3, pp. (explicit) Khodak’s code and its analysis,” IEEE Trans. Inf. Theory, vol. 596–615, July 1987. IT-54, no. 11, pp. 5073–5086, Nov. 2008.