Asymmetric numeral systems.
Jarek Duda
Jagiellonian University, Poland, email: [email protected]
Abstract In this paper will be presented new approach to entropy coding: family of generalizations of standard numeral systems which are optimal for encoding sequence of equiprobable symbols, into asymmetric numeral systems - optimal for freely chosen probability distributions of symbols. It has some similarities to Range Coding but instead of encoding symbol in choosing a range, we spread these ranges uniformly over the whole interval. This leads to simpler encoder - instead of using two states to define range, we need only one. This approach is very universal - we can obtain from extremely precise encoding (ABS) to extremely fast with possibility to additionally encrypt the data (ANS). This encryption uses the key to initialize random number generator, which is used to calculate the coding tables. Such preinitialized encryption has additional advantage: is resistant to brute force attack - to check a key we have to make whole initialization. There will be also presented application for new approach to error correction: after an error in each step we have chosen probability to observe that something was wrong. We can get near Shannon’s limit for any noise level this way with expected linear time of correction. arXiv:0902.0271v5 [cs.IT] 21 May 2009 1 Introduction
In practice there are used two approaches for entropy coding nowadays: building binary tree (Huffman coding [1]) and arithmetic/range coding ([2],[3]). The first one approximates probabilities of symbols with powers of 2 - isn’t precise. Arithmetic coding is precise. It encodes symbol in choosing one of large ranges of length proportional to assumed probability distribution (q). Intuitively, by analogue to standard numeral systems - the symbol is encoded on the most important position. To define the current range, we need to use two numbers (states).
We will construct precise encoding that uses only one state. It will be done by distributing symbols uniformly instead of in ranges - intuitively: place information on the least important position. Standard numeral systems are optimal for encoding streams of equiprobable digits. Asymmetric numeral systems ([4]) is
1 1 INTRODUCTION 2 natural generalization into other, freely chosen probability distributions. If we choose uniform probability, with proper initialization we get standard numeral system.
For the binary case: Asymmetric Binary System (ABS) there are found practi- cal formulas, which gives extremely precise entropy encoder for which probability distribution of symbols can freely change. It was show ([5]) that it can be practical alternative for arithmetic coding. For the general case: Asymmetric Numeral Systems (ANS) instead of using formulas, we initially use pseudorandom number generator to distribute symbols with assumed statistics. The precision can be still very high, but disadvantage is that when the probability distribution changes, we have to reinitialize. The advantage is that we encode/decode a few bits in one use of the table - we get compression rates like in arithmetic coding and transfers like in Huffman coding. On [6] is available demonstration. Another advantage is that we can use a key as the initialization of the random number generator, additionally encrypting the data. Such encryption is extremely unpredictable - uses random coding tables and hidden random variable to choose behavior and so the current length of block. This approach is faster than standard block ciphers and is much more resistant against brute force attacks.
In the last section will be presented new approach to error correction, which is able to get near Shannon’s limit for any noise level and is still practical - has expected linear (or N lg(N)) correction time. It can be imagined as path tracking - we know starting and ending position and we want to walk between them using the proper path. When we use this path everything is fine, but when we lost it, in each step we have selected probability of becoming conscious of this fact. Now we can go back and try to make some correction. If this probability is chosen higher than some threshold corresponding to Shannon’s limit, the number of corrections we should try doesn’t longer grow exponentially and so we can easily verify that it was the proper correction. Intuitively we use short blocks, but we connect their redundancy. This connec- tion practically allows to ’transfer’ surpluses of redundancy to help with large local error concentrations.
1.0.1 Very brief introduction to entropy coding In the possibility of choosing one of 2n choices is stored n bits of information. Assume now that we can store information in choosing a sequence of bits of length n, but such that the probability of ’1’ is given (p). We can evaluate the number of such 2 GENERAL CONCEPT 3
n! sequences using Stirling’s formula limn→∞ √ n n = 1 : 2πn( e )
n n! nn+1/2en = ≈ (2π)−1/2 = pn (pn)!(˜pn)! (pn)pn+1/2(˜pn)pn˜ +1/2en = (2πnpp˜)−1/2p−pnp˜−pn˜ = (2πnpp˜)−1/22−n(p lg p+˜p lgp ˜) wherep ˜ = 1 − p. So while encoding in such sequences, we can store at average
h(p) := −p lg(p) − (1 − p) lg(1 − p) bits of information/symbol (1)
That’s well known formula for Shannon’s entropy. In practice we usually don’t know the probability distribution, but we are approximating it using some statistical analysis. The nearer it is to the real probability distribution, the better compression rates we get. The final step is the entropy coder, which uses found statistics to encode the message.
Even if we would know the probability distribution perfectly, the expected com- pression rate would be usually a bit larger than Shannon’s entropy. One of the reason is that encoded message usually contains some additional correlations. The second source of such entropy increase is that entropy coders are constructed for some discrete set of probability distributions, so they have to approximate the orig- inal one. In an event of probability 1/n is stored lg(n) bits, so generally in event of prob- ability q, should be stored lg(1/q) bits. This can be seen in Shannon’s formula: it’s average of stored bits with probabilities of events as weights. So if we use a coder which encodes perfectly (qs) symbol distribution to encode P (ps) symbol sequence, we would get at average s ps lg(1/qs) bits per symbol. The difference between this value and the optimal one is called Kullback - Leiber distance:
2! 2 X ps X −ps qs 1 qs X (ps − qs) ∆H = p lg ≈ 1 − − 1 − ≈ 0.72 s q ln(2) p 2 p p s s s s s s s (2) We have used second order Taylor’s expansion of logarithm around 1. The first term vanishes and the second allows to quickly estimate how important is that entropy coder is precise.
2 General concept
We would like to encode an uncorrelated sequence of symbols of known probability distribution into as short as possible sequence of bits. For simplicity we will assume that the probability distribution doesn’t change in time, but it can be naturally 2 GENERAL CONCEPT 4 generalized to varying distributions. The encoder will receive succeeding symbols and transform them into succeeding bits.
An symbol(event) of probability p contains lg(1/p) bits of information - it doesn’t have to be a natural number. If we just assign to each symbol a sequence of bits like in Huffman coding, we approximate probabilities with powers of 2. If we want to get closer to the optimal compression rates, we have to be more precise - the encoder have to be more complicated - use not only the current symbol, but also relate for example to the previous ones. The encoder should have some state in which is stored unnatural number of bits of information. This state in arithmetic coder are two numbers describing the current range.
The state of presented encoder will be one natural number: x ∈ N. For this sub- section we will forget about sending bits to output and focus on encoding symbols. So the state x in given moment is a large natural number which encodes all already processed symbols. We could just encode it as a binary number after processing the whole sequence, but because of its size it’s completely impractical. In section 4 it will be shown that we can transfer the youngest bits of x to assure that it stays in some fixed range during the whole process. For now we are looking for a rule of changing the state while processing a symbol s:
encoding −→ (s, x) x0 (3) ←− decoding
So our encoder starts with for example x = 0 and uses above rule on succeeding symbols. These rules are bijective, so that we can uniquely reverse whole process - decode the final state back into initial sequence of symbols in reversed order. In given moment in x is stored some unnatural number of bits of information. While writing it in binary system, we would round this value up. To avoid such approximations, we will use convention that x is the possibility of choosing one of {0, 1, .., x − 1} numbers, so x contains exactly lg(x) bits of information.
For assumed probability distribution of n symbols, we will somehow split the set {0, 1, .., x − 1} into n separate subsets - of sizes x0, .., xn−1 ∈ N, such that Pn−1 s=0 xs = x. We can treat the possibility of choosing one of x numbers as the possibility of choosing the number of subset(s) and then choosing one of xs numbers. xs So with probability qs = x we would choose s-th subset. We can enumerate elements of s-th subset from 0 to xs − 1 in the same order as in the original enumeration of {0, 1, .., x − 1}. Summarizing: we’ve exchanged the possibility of choosing one of x numbers (lg(x) bits) into the possibility of choosing a pair: a symbol s (lg(1/qs) bits) with 2 GENERAL CONCEPT 5
known probability (qs) and the possibility of choosing one of xs numbers (lg(xs) = lg(x) − lg(qs) bits). This (x (s, xs)) will be the bijective coding we are looking for. We will now describe how to split the range. In arithmetic coding approach (Range Coding), we would divide {0, .., x−1} into ranges. In ANS we will distribute these subsets uniformly.
We can describe this split using distributing function D1 : N → {0, .., n − 1}: n−1 [ {0, .., x − 1} = {y ∈ {0, .., x − 1} : D1(y) = s} s=0 We can now enumerate numbers in these subsets by counting how many elements from the same subset was there before:
xs := #{y ∈ {0, 1, .., x − 1},D1(y) = s} D2(x) := xD1(x) (4) getting bijective decoding function(D) and it’s inverse coding function (C):
D(x) := (D1(x),D2(x)) = (s, xs) C(s, xs) := x. Assume that our sequence consists of n ∈ N symbols with given probability distribution (qs)s=0,..,n−1 (∀s=0,..,n−1 qs > 0). We have to construct a distributing function and coding/decoding function for this distribution: such that
∀s,x xs is approximately x · qs (5) We will now show informally how essential above condition is. In section 3 and 5 will be shown two ways of making such construction.
P Statistically in a symbol is encoded H(q) := − s qs lg qs bits. ANS uses lg(x) − lg(xs) = lg(x/xs) bits of information to encode a symbol s from xs state. Analogously to (2) using second Taylor’s expansion of logarithm (around qs), we can estimate that our encoder needs at average: 2 X xs X xs/x − qs (xs/x − qs) − q lg ≈ − q lg(q ) + − = s x s s q ln(2) 2q2 ln(2) s s s s 2 1 − 1 X (xs/x − qs) = H(q) + + bits/symbol. q ln(2) 2q ln(2) s s s We could average
1 X qs 1 X qs 2 (x /q − x)2 = (x /q − C(s, x )) (6) 2 ln(2) x2 s s ln(4) x2 s s s s s over all possible xs to estimate how many bits/symbols we are wasting. We will do it in section 6. 3 ASYMMETRIC BINARY SYSTEM (ABS) 6
3 Asymmetric Binary System (ABS)
It occurs that in the binary case we can find simple explicit formula for cod- ing/decoding functions.
We have now two symbols: ”0” and ”1”. Denote q := q1, soq ˜ := 1 − q = q0. To get xs ≈ x · qs, we can for example take
x1 := dxqe (or alternatively x1 := bxqc) (7)
x0 = x − x1 = x − dxqe (or x0 = x − bxqc) (8)
Now using (4): D1(x) = 1 ⇔ there is a jump of dxqe after it: s := d(x + 1)qe − dxqe (or s := b(x + 1)qc − bxqc) (9)
We’ve just defined decoding function: D(x) = (s, xs).
For example for q = 0.3:
x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x0 0 1 2 3 4 5 6 7 8 9 10 11 12 x1 0 1 2 3 4 5
We will find coding function now: we have s and xs and want to find x. Denote r := dxqe − xq ∈ [0, 1) s := d(x + 1)qe − dxqe = d(x + 1)q − dxqee = d(x + 1)q − r − xqe = dq − re
s = 1 ⇔ r < q
• s = 1: x1 = dxqe = xq + r j k x1−r x1 x = q = q because it’s natural number and 0 ≤ r < q. • s = 0: q ≤ r < 1 soq ˜ ≥ 1 − r > 0 x0 = x − dxqe = x − xq − r = xq˜ − r x + r x + 1 1 − r lx + 1m x = 0 = 0 − = 0 − 1 q˜ q˜ q˜ q˜
Finally coding: l m j k x+1 − 1 if s = 0 x if s = 0 C(s, x) = 1−q or = 1−q (10) j x k l x+1 m q if s = 1 q − 1 if s = 1
For q = 1/2 it’s usual binary system (with switched digits). 4 STREAM CODING/DECODING 7
4 Stream coding/decoding
We can encode now into a large natural numbers (x). We would like to use ABS/ANS to encode data stream - into potentially infinite sequence of digits(bits) with expected uniform distribution. To do it we can sometimes transfer a part of information from x into a digit from a standard numeral system to enforce x to stay in some fixed range (I).
4.1 Algorithm Let us choose that the data stream will be encoded as {0, .., b − 1} digits - in standard numeral system of base b ≥ 2. In practice we should mainly use the binary system (b = 2), but thanks of this general approach, we can for example use 8 b = 2 to transfer whole byte at once. Symbols contain correspondingly lg(1/qs) bits of information. When they cumulate into lg b bits, we will transfer full digit to/from output, moving x back to I (bit transfer).
Observe that taking interval in form (l ∈ N): I := {l, l + 1, .., bl − 1} (11) for any x ∈ N we have exactly one of three cases: • x ∈ I or
k • x > bl − 1, then ∃!k∈N bx/b c ∈ I or k k−1 • x < l, then ∀(di)∈{0,..,b−1}N ∃!k∈N xb + d1b + .. + dk ∈ I. We will call such intervals b-unique: starting from any natural number x, after eventual a few reductions (x → bx/bc) or placing a few youngest digits in x (x → xb + dt) we would finally get into I in unique way.
For some interval(I), define [ Is = {x : C(s, x) ∈ I}, so that I = C(s, Is). (12) s Define: Stream decoding: Stream coding(s): {(s,x)=D(x); {while(x∈/ Is) use s; (e.g. to generate symbol) {put mod(x,b) to output; x=bx/bc} while(x∈/ I) x=C(s,x) x=xb+’digit from input’ } } 4 STREAM CODING/DECODING 8
Figure 1: Stream coding/decoding
We need that above functions are ambiguous reverses of each other. Observe that we would have it iff Is for s = 0, .., n − 1 and I are b-unique:
I = {l, .., lb − 1} Is = {ls, .., lsb − 1} (13) for some l, ls ∈ N. P P We have: s ls(b − 1) = s #Is = #I = l(b − 1). Remembering that C(s, x) ≈ x/qs, we finally have: X ls ≈ lqs ls = l. (14) s We will look at the behavior of lg x while stream coding s now:
lg x → ≈ lg x + lg(1/qs) (modulo lg(b)) (15)
We have three possible sources of random behavior of x:
• we choose one of symbol (behavior) in statistical(random) way,
lg qs • usually lg b are irrational,
• C(s, x) is near but not exactly x/qs. It suggests that lg x should cover uniformly possible space, what agrees with statistical simulations. That means that the probability of visiting given state x should be approximately proportional to 1/x. We will focus on it in section 6. 4 STREAM CODING/DECODING 9
4.2 Analysis of a single step Let’s concentrate on a single stream coding step. Choose some s ∈ {0, .., n − 1}. Among l(b − 1) states of I = {l, .., lb − 1} we have ls(b − 1) appearances of symbol s. While choosing ls we are approximating probabilities. So to simplify further analysis, let us assume for the rest of the paper: l q = s (16) s l
Let us introduce for qs fractional and integer part:
+ cs := {logb(qs)} ∈ [0, 1) ks := −blogb(qs)c ∈ N (17)
ks cs logb(1/qs) = ks − cs 1 ≤ qsb = b < b (18) where {z} = z − bzc is the fractional part. Now if we introduce new variable: x y := log (19) b l
≈ we will have that one coding step is approximately y → {y − cs}.
Figure 2: Example of stream coding/decoding step for b = 2, k = 3, ls = 13, ks−1 l = 9 · 4 + 3 · 8 + 6 = 66, qs = 13/66, x = 19, b x + 3 = 79 = 66 + 2 + 2 · 4 + 3.
The situation looks like in fig. 2:
• The bit transfer makes that states denoted by circles will behave just like the state denoted by ellipse on their left. The difference between them is in transferred digits. 4 STREAM CODING/DECODING 10
• The number of transferred digits has maximally two possibilities differentiating by 1: ks − 1 and ks
ks ks is the only number such that b(lb − 1)/b c ∈ Is
−ks When qs is near some integer power of b (qs ≈ b ), we can have a situation that we always transfer ks digits, but it can be treated as a special case of the first one (Xs = l).
• The states denoted by ellipses are multiplicities of correspondingly bks−1 or bks . So if l is not a natural power of b, there can be some states before the first multiplicity of bks−1. They correspond to the last multiplicity of bks . Let’s assume for simplicity, that
L := logb(l) ∈ N (20) so the first state in the picture is ellipse. With this assumption we can have special case from the previous point, that −ks always k digits are transferred, if and only if qs = b .
We assume also that we have some appearances of each symbol, so L ≥ ks. • The states before the step (the top of the picture) can be divided into two ranges - on the left or right of some boundary value
ks−1 Xs := max{x : C(s, bx/b c) < lb} = min{D2(x): D1(x) = s, x ≥ l}
On the left of this value we transfer k − 1 digits (can be degenerated), on the right we transfer k digits.
Xs−l lb−Xs From ls(b − 1) ellipses, bks−1 are on the left, bks are on the right of Xs:
ks ls(b − 1)b = (Xs − l)b + lb − Xs = (b − 1)Xs
We got exact formula:
ks ks cs l ≤ Xs = lsb = lqsb = lb < lb (21)
Intuitively the position of this boundary corresponds to the inequality
−ks −ks+1 b ≤ qs < b .
• While the situation on the top of the picture (before coding step) was fully determined by l and qs, the distribution on the bottom (after) has full freedom: is made by choosing the distributing function. l l This time we have the boundary value: C(d k −1 e) ≈ k −1 . b s b s qs 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 11
Finally the change of state after one step of stream coding Cs : I → I is: x x C s, b bks−1 c for x < Xs ks−[x 5 Asymmetric Numeral Systems(ANS) In the general case: encoding a sequence of symbols with probability distribution 0 < q0, .., qn−1 < 1 for some n > 2, we could divide the selection of symbol into a few binary choices and just use ABS. In this section we will see that we can also encode such symbols straightforward. Unfortunately I couldn’t find practical explicit formulas for n > 2, but we can calculate coding/decoding functions while the initialization, making processing of the data stream extremely fast. The problem is that we rather cannot table all possible probability distri- butions - we have to initialize for a few of them and eventually reinitialize sometimes. This time we fix the range we are working on (I = {l, .., bl − 1}), so in fact we are interested at stream coding/decoding functions only on this set. They are determined by distribution of symbols: (b − 1)ls appearances of symbol s. This way we are approximating the probabilities. As it was already said - we will assume: ls 0 0 1 qs = l . The exact probability will be denoted from now qs. So we have |qs −qs| ≈ 2l . 5.1 Precise coder We will now construct precise coder in similar way as for the binary case. Denote N := { i : i ∈ +}. s qs N They looks to be a good approximation of positions of symbols in the distributing function. We have only to move them into some positions of natural numbers. Intuition suggests that to choose symbols, we should take succeedingly the smallest element which hasn’t been chosen yet from these sets. P Observe that #(Ns ∩ [0, x]) = bxqsc, but sbxqsc ≤ x. So if we would use just proposed algorithm, while choosing a symbol for given x, at least bxqsc appearances of each symbol have already appeared: ∀s xs ≥ bxqsc. For x being a natural multiplicity of l we get equalities instead. To generally bound xs from above, observe that because the fractional parts of xqs sums to a P natural number, we have bxqsc ≥ x − n + 1. P s Finally, because s xs = x, we get: bxqsc ≤ xs ≤ bxqsc + n − 1 ⇒ xs − xqs ∈ (−1, n − 1] (23) 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 12 Numerical simulations suggest that there are probability distributions for which we cannot improve this pessimistic evaluation, but in practice |xs − xqs| is usually smaller than 1. To implement this algorithm, in each step we have to find the smallest of n numbers. Assume we have implemented some priority queue, for example using a heap. Besides initialization it has two instructions: put((y, s)) inserts (y, s) pair into the queue, getmin removes and returns pair which is the smallest with (y, s) ≤ (y0, s0) ⇔ y ≤ y0 relation. Precise initialization: For s = 0 to n − 1 do {put( (1/qs, s) ); xs = ls}; For x= l to bl − 1 do {(y, s)=getmin; put((y + 1/qs, s)); D[x]=(s,xs) or C[s,xs]=x xs++} 5.2 Selfcorrecting diffusion (ScD) We will focus now on a bit less precise, but faster statistical initialization method: fill the table of size (b − 1)l with proper number of appearances of symbols and for succeeding x take symbol of random number from this table, reducing the table. So on the beginning it will behave like a diffusion, but it will correct itself while approaching the end. Another advantage of this approach is that after fixing (ls), we still have huge (exponential in #I) number of possible coding functions - we can choose one using some key, additionally encrypting the data. Initialization: (b−1)l0 (b−1)l1 (b−1)ln−1 z }| { z }| { z }| { m=(b-1)l; symbols =(0, 0, .., 0, 1, 1, .., 1, .., n − 1, .., n − 1); For s = 0 to n − 1 do xs = ls; For x= l to bl − 1 do {i=random natural number from 1 to m; s=symbols[i]; symbols[i]=symbols[m]; m--; D[x]=(s,xs) or C[s,xs]=x xs++} Where we can use practically any deterministic pseudorandom number genera- tor, like Mersenne Twister([7]) and use eventual key for its initialization. It will be precise on the beginning and the end but generally impreciseness will 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 13 be larger. We will analyze it now. While selecting some symbol s, we can divide symbols into two groups: this symbol and the rest. So we can restrict to simplified model: Model: We have N distinguishable numbers: L copies of ’1’ and N − L copies of ’0’. What is the probability that if we choose M of them, there will be K of ’1’?