Asymmetric numeral systems.

Jarek Duda

Jagiellonian University, Poland, email: [email protected]

Abstract In this paper will be presented new approach to entropy coding: family of generalizations of standard numeral systems which are optimal for encoding sequence of equiprobable symbols, into asymmetric numeral systems - optimal for freely chosen probability distributions of symbols. It has some similarities to Range Coding but instead of encoding symbol in choosing a range, we spread these ranges uniformly over the whole interval. This leads to simpler encoder - instead of using two states to define range, we need only one. This approach is very universal - we can obtain from extremely precise encoding (ABS) to extremely fast with possibility to additionally encrypt the data (ANS). This encryption uses the key to initialize random number generator, which is used to calculate the coding tables. Such preinitialized encryption has additional advantage: is resistant to brute force attack - to check a key we have to make whole initialization. There will be also presented application for new approach to error correction: after an error in each step we have chosen probability to observe that something was wrong. We can get near Shannon’s limit for any noise level this way with expected linear time of correction. arXiv:0902.0271v5 [cs.IT] 21 May 2009 1 Introduction

In practice there are used two approaches for entropy coding nowadays: building binary tree (Huffman coding [1]) and arithmetic/range coding ([2],[3]). The first one approximates probabilities of symbols with powers of 2 - isn’t precise. is precise. It encodes symbol in choosing one of large ranges of length proportional to assumed probability distribution (q). Intuitively, by analogue to standard numeral systems - the symbol is encoded on the most important position. To define the current range, we need to use two numbers (states).

We will construct precise encoding that uses only one state. It will be done by distributing symbols uniformly instead of in ranges - intuitively: place information on the least important position. Standard numeral systems are optimal for encoding streams of equiprobable digits. Asymmetric numeral systems ([4]) is

1 1 INTRODUCTION 2 natural generalization into other, freely chosen probability distributions. If we choose uniform probability, with proper initialization we get standard numeral system.

For the binary case: Asymmetric Binary System (ABS) there are found practi- cal formulas, which gives extremely precise entropy encoder for which probability distribution of symbols can freely change. It was show ([5]) that it can be practical alternative for arithmetic coding. For the general case: Asymmetric Numeral Systems (ANS) instead of using formulas, we initially use pseudorandom number generator to distribute symbols with assumed statistics. The precision can be still very high, but disadvantage is that when the probability distribution changes, we have to reinitialize. The advantage is that we encode/decode a few bits in one use of the table - we get compression rates like in arithmetic coding and transfers like in Huffman coding. On [6] is available demonstration. Another advantage is that we can use a key as the initialization of the random number generator, additionally encrypting the data. Such encryption is extremely unpredictable - uses random coding tables and hidden random variable to choose behavior and so the current length of block. This approach is faster than standard block ciphers and is much more resistant against brute force attacks.

In the last section will be presented new approach to error correction, which is able to get near Shannon’s limit for any noise level and is still practical - has expected linear (or N lg(N)) correction time. It can be imagined as path tracking - we know starting and ending position and we want to walk between them using the proper path. When we use this path everything is fine, but when we lost it, in each step we have selected probability of becoming conscious of this fact. Now we can go back and try to make some correction. If this probability is chosen higher than some threshold corresponding to Shannon’s limit, the number of corrections we should try doesn’t longer grow exponentially and so we can easily verify that it was the proper correction. Intuitively we use short blocks, but we connect their redundancy. This connec- tion practically allows to ’transfer’ surpluses of redundancy to help with large local error concentrations.

1.0.1 Very brief introduction to entropy coding In the possibility of choosing one of 2n choices is stored n bits of information. Assume now that we can store information in choosing a sequence of bits of length n, but such that the probability of ’1’ is given (p). We can evaluate the number of such 2 GENERAL CONCEPT 3

  n! sequences using Stirling’s formula limn→∞ √ n n = 1 : 2πn( e )

 n  n! nn+1/2en = ≈ (2π)−1/2 = pn (pn)!(˜pn)! (pn)pn+1/2(˜pn)pn˜ +1/2en = (2πnpp˜)−1/2p−pnp˜−pn˜ = (2πnpp˜)−1/22−n(p lg p+˜p lgp ˜) wherep ˜ = 1 − p. So while encoding in such sequences, we can store at average

h(p) := −p lg(p) − (1 − p) lg(1 − p) bits of information/symbol (1)

That’s well known formula for Shannon’s entropy. In practice we usually don’t know the probability distribution, but we are approximating it using some statistical analysis. The nearer it is to the real probability distribution, the better compression rates we get. The final step is the entropy coder, which uses found statistics to encode the message.

Even if we would know the probability distribution perfectly, the expected com- pression rate would be usually a bit larger than Shannon’s entropy. One of the reason is that encoded message usually contains some additional correlations. The second source of such entropy increase is that entropy coders are constructed for some discrete set of probability distributions, so they have to approximate the orig- inal one. In an event of probability 1/n is stored lg(n) bits, so generally in event of prob- ability q, should be stored lg(1/q) bits. This can be seen in Shannon’s formula: it’s average of stored bits with probabilities of events as weights. So if we use a coder which encodes perfectly (qs) symbol distribution to encode P (ps) symbol sequence, we would get at average s ps lg(1/qs) bits per symbol. The difference between this value and the optimal one is called Kullback - Leiber distance:

     2! 2 X ps X −ps qs 1 qs X (ps − qs) ∆H = p lg ≈ 1 − − 1 − ≈ 0.72 s q ln(2) p 2 p p s s s s s s s (2) We have used second order Taylor’s expansion of logarithm around 1. The first term vanishes and the second allows to quickly estimate how important is that entropy coder is precise.

2 General concept

We would like to encode an uncorrelated sequence of symbols of known probability distribution into as short as possible sequence of bits. For simplicity we will assume that the probability distribution doesn’t change in time, but it can be naturally 2 GENERAL CONCEPT 4 generalized to varying distributions. The encoder will receive succeeding symbols and transform them into succeeding bits.

An symbol(event) of probability p contains lg(1/p) bits of information - it doesn’t have to be a natural number. If we just assign to each symbol a sequence of bits like in Huffman coding, we approximate probabilities with powers of 2. If we want to get closer to the optimal compression rates, we have to be more precise - the encoder have to be more complicated - use not only the current symbol, but also relate for example to the previous ones. The encoder should have some state in which is stored unnatural number of bits of information. This state in arithmetic coder are two numbers describing the current range.

The state of presented encoder will be one natural number: x ∈ N. For this sub- section we will forget about sending bits to output and focus on encoding symbols. So the state x in given moment is a large natural number which encodes all already processed symbols. We could just encode it as a binary number after processing the whole sequence, but because of its size it’s completely impractical. In section 4 it will be shown that we can transfer the youngest bits of x to assure that it stays in some fixed range during the whole process. For now we are looking for a rule of changing the state while processing a symbol s:

encoding −→ (s, x) x0 (3) ←− decoding

So our encoder starts with for example x = 0 and uses above rule on succeeding symbols. These rules are bijective, so that we can uniquely reverse whole process - decode the final state back into initial sequence of symbols in reversed order. In given moment in x is stored some unnatural number of bits of information. While writing it in binary system, we would round this value up. To avoid such approximations, we will use convention that x is the possibility of choosing one of {0, 1, .., x − 1} numbers, so x contains exactly lg(x) bits of information.

For assumed probability distribution of n symbols, we will somehow split the set {0, 1, .., x − 1} into n separate subsets - of sizes x0, .., xn−1 ∈ N, such that Pn−1 s=0 xs = x. We can treat the possibility of choosing one of x numbers as the possibility of choosing the number of subset(s) and then choosing one of xs numbers. xs So with probability qs = x we would choose s-th subset. We can enumerate elements of s-th subset from 0 to xs − 1 in the same order as in the original enumeration of {0, 1, .., x − 1}. Summarizing: we’ve exchanged the possibility of choosing one of x numbers (lg(x) bits) into the possibility of choosing a pair: a symbol s (lg(1/qs) bits) with 2 GENERAL CONCEPT 5

known probability (qs) and the possibility of choosing one of xs numbers (lg(xs) = lg(x) − lg(qs) bits). This (x (s, xs)) will be the bijective coding we are looking for. We will now describe how to split the range. In arithmetic coding approach (Range Coding), we would divide {0, .., x−1} into ranges. In ANS we will distribute these subsets uniformly.

We can describe this split using distributing function D1 : N → {0, .., n − 1}: n−1 [ {0, .., x − 1} = {y ∈ {0, .., x − 1} : D1(y) = s} s=0 We can now enumerate numbers in these subsets by counting how many elements from the same subset was there before:

xs := #{y ∈ {0, 1, .., x − 1},D1(y) = s} D2(x) := xD1(x) (4) getting bijective decoding function(D) and it’s inverse coding function (C):

D(x) := (D1(x),D2(x)) = (s, xs) C(s, xs) := x. Assume that our sequence consists of n ∈ N symbols with given probability distribution (qs)s=0,..,n−1 (∀s=0,..,n−1 qs > 0). We have to construct a distributing function and coding/decoding function for this distribution: such that

∀s,x xs is approximately x · qs (5) We will now show informally how essential above condition is. In section 3 and 5 will be shown two ways of making such construction.

P Statistically in a symbol is encoded H(q) := − s qs lg qs bits. ANS uses lg(x) − lg(xs) = lg(x/xs) bits of information to encode a symbol s from xs state. Analogously to (2) using second Taylor’s expansion of logarithm (around qs), we can estimate that our encoder needs at average:  2  X xs  X xs/x − qs (xs/x − qs) − q lg ≈ − q lg(q ) + − = s x s s q ln(2) 2q2 ln(2) s s s s 2 1 − 1 X (xs/x − qs) = H(q) + + bits/symbol. q ln(2) 2q ln(2) s s s We could average

1 X qs 1 X qs 2 (x /q − x)2 = (x /q − C(s, x )) (6) 2 ln(2) x2 s s ln(4) x2 s s s s s over all possible xs to estimate how many bits/symbols we are wasting. We will do it in section 6. 3 ASYMMETRIC BINARY SYSTEM (ABS) 6

3 Asymmetric Binary System (ABS)

It occurs that in the binary case we can find simple explicit formula for cod- ing/decoding functions.

We have now two symbols: ”0” and ”1”. Denote q := q1, soq ˜ := 1 − q = q0. To get xs ≈ x · qs, we can for example take

x1 := dxqe (or alternatively x1 := bxqc) (7)

x0 = x − x1 = x − dxqe (or x0 = x − bxqc) (8)

Now using (4): D1(x) = 1 ⇔ there is a jump of dxqe after it: s := d(x + 1)qe − dxqe (or s := b(x + 1)qc − bxqc) (9)

We’ve just defined decoding function: D(x) = (s, xs).

For example for q = 0.3:

x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

x0 0 1 2 3 4 5 6 7 8 9 10 11 12 x1 0 1 2 3 4 5

We will find coding function now: we have s and xs and want to find x. Denote r := dxqe − xq ∈ [0, 1) s := d(x + 1)qe − dxqe = d(x + 1)q − dxqee = d(x + 1)q − r − xqe = dq − re

s = 1 ⇔ r < q

• s = 1: x1 = dxqe = xq + r j k x1−r x1 x = q = q because it’s natural number and 0 ≤ r < q. • s = 0: q ≤ r < 1 soq ˜ ≥ 1 − r > 0 x0 = x − dxqe = x − xq − r = xq˜ − r x + r x + 1 1 − r lx + 1m x = 0 = 0 − = 0 − 1 q˜ q˜ q˜ q˜

Finally coding:  l m   j k   x+1 − 1 if s = 0  x if s = 0 C(s, x) = 1−q or = 1−q (10) j x k  l x+1 m   q if s = 1  q − 1 if s = 1

For q = 1/2 it’s usual binary system (with switched digits). 4 STREAM CODING/DECODING 7

4 Stream coding/decoding

We can encode now into a large natural numbers (x). We would like to use ABS/ANS to encode data stream - into potentially infinite sequence of digits(bits) with expected uniform distribution. To do it we can sometimes transfer a part of information from x into a digit from a standard numeral system to enforce x to stay in some fixed range (I).

4.1 Algorithm Let us choose that the data stream will be encoded as {0, .., b − 1} digits - in standard numeral system of base b ≥ 2. In practice we should mainly use the binary system (b = 2), but thanks of this general approach, we can for example use 8 b = 2 to transfer whole at once. Symbols contain correspondingly lg(1/qs) bits of information. When they cumulate into lg b bits, we will transfer full digit to/from output, moving x back to I (bit transfer).

Observe that taking interval in form (l ∈ N): I := {l, l + 1, .., bl − 1} (11) for any x ∈ N we have exactly one of three cases: • x ∈ I or

k • x > bl − 1, then ∃!k∈N bx/b c ∈ I or k k−1 • x < l, then ∀(di)∈{0,..,b−1}N ∃!k∈N xb + d1b + .. + dk ∈ I. We will call such intervals b-unique: starting from any natural number x, after eventual a few reductions (x → bx/bc) or placing a few youngest digits in x (x → xb + dt) we would finally get into I in unique way.

For some interval(I), define [ Is = {x : C(s, x) ∈ I}, so that I = C(s, Is). (12) s Define: Stream decoding: Stream coding(s): {(s,x)=D(x); {while(x∈/ Is) use s; (e.g. to generate symbol) {put mod(x,b) to output; x=bx/bc} while(x∈/ I) x=C(s,x) x=xb+’digit from input’ } } 4 STREAM CODING/DECODING 8

Figure 1: Stream coding/decoding

We need that above functions are ambiguous reverses of each other. Observe that we would have it iff Is for s = 0, .., n − 1 and I are b-unique:

I = {l, .., lb − 1} Is = {ls, .., lsb − 1} (13) for some l, ls ∈ N. P P We have: s ls(b − 1) = s #Is = #I = l(b − 1). Remembering that C(s, x) ≈ x/qs, we finally have: X ls ≈ lqs ls = l. (14) s We will look at the behavior of lg x while stream coding s now:

lg x → ≈ lg x + lg(1/qs) (modulo lg(b)) (15)

We have three possible sources of random behavior of x:

• we choose one of symbol (behavior) in statistical(random) way,

lg qs • usually lg b are irrational,

• C(s, x) is near but not exactly x/qs. It suggests that lg x should cover uniformly possible space, what agrees with statistical simulations. That means that the probability of visiting given state x should be approximately proportional to 1/x. We will focus on it in section 6. 4 STREAM CODING/DECODING 9

4.2 Analysis of a single step Let’s concentrate on a single stream coding step. Choose some s ∈ {0, .., n − 1}. Among l(b − 1) states of I = {l, .., lb − 1} we have ls(b − 1) appearances of symbol s. While choosing ls we are approximating probabilities. So to simplify further analysis, let us assume for the rest of the paper: l q = s (16) s l

Let us introduce for qs fractional and integer part:

+ cs := {logb(qs)} ∈ [0, 1) ks := −blogb(qs)c ∈ N (17)

ks cs logb(1/qs) = ks − cs 1 ≤ qsb = b < b (18) where {z} = z − bzc is the fractional part. Now if we introduce new variable: x y := log (19) b l

≈ we will have that one coding step is approximately y → {y − cs}.

Figure 2: Example of stream coding/decoding step for b = 2, k = 3, ls = 13, ks−1 l = 9 · 4 + 3 · 8 + 6 = 66, qs = 13/66, x = 19, b x + 3 = 79 = 66 + 2 + 2 · 4 + 3.

The situation looks like in fig. 2:

• The bit transfer makes that states denoted by circles will behave just like the state denoted by ellipse on their left. The difference between them is in transferred digits. 4 STREAM CODING/DECODING 10

• The number of transferred digits has maximally two possibilities differentiating by 1: ks − 1 and ks

ks ks is the only number such that b(lb − 1)/b c ∈ Is

−ks When qs is near some integer power of b (qs ≈ b ), we can have a situation that we always transfer ks digits, but it can be treated as a special case of the first one (Xs = l).

• The states denoted by ellipses are multiplicities of correspondingly bks−1 or bks . So if l is not a natural power of b, there can be some states before the first multiplicity of bks−1. They correspond to the last multiplicity of bks . Let’s assume for simplicity, that

L := logb(l) ∈ N (20) so the first state in the picture is ellipse. With this assumption we can have special case from the previous point, that −ks always k digits are transferred, if and only if qs = b .

We assume also that we have some appearances of each symbol, so L ≥ ks. • The states before the step (the top of the picture) can be divided into two ranges - on the left or right of some boundary value

ks−1 Xs := max{x : C(s, bx/b c) < lb} = min{D2(x): D1(x) = s, x ≥ l}

On the left of this value we transfer k − 1 digits (can be degenerated), on the right we transfer k digits.

Xs−l lb−Xs From ls(b − 1) ellipses, bks−1 are on the left, bks are on the right of Xs:

ks ls(b − 1)b = (Xs − l)b + lb − Xs = (b − 1)Xs

We got exact formula:

ks ks cs l ≤ Xs = lsb = lqsb = lb < lb (21)

Intuitively the position of this boundary corresponds to the inequality

−ks −ks+1 b ≤ qs < b .

• While the situation on the top of the picture (before coding step) was fully determined by l and qs, the distribution on the bottom (after) has full freedom: is made by choosing the distributing function. l l This time we have the boundary value: C(d k −1 e) ≈ k −1 . b s b s qs 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 11

Finally the change of state after one step of stream coding Cs : I → I is:  x  x C s, b bks−1 c for x < Xs ks−[x

5 Asymmetric Numeral Systems(ANS)

In the general case: encoding a sequence of symbols with probability distribution 0 < q0, .., qn−1 < 1 for some n > 2, we could divide the selection of symbol into a few binary choices and just use ABS. In this section we will see that we can also encode such symbols straightforward. Unfortunately I couldn’t find practical explicit formulas for n > 2, but we can calculate coding/decoding functions while the initialization, making processing of the data stream extremely fast. The problem is that we rather cannot table all possible probability distri- butions - we have to initialize for a few of them and eventually reinitialize sometimes.

This time we fix the range we are working on (I = {l, .., bl − 1}), so in fact we are interested at stream coding/decoding functions only on this set. They are determined by distribution of symbols: (b − 1)ls appearances of symbol s. This way we are approximating the probabilities. As it was already said - we will assume: ls 0 0 1 qs = l . The exact probability will be denoted from now qs. So we have |qs −qs| ≈ 2l .

5.1 Precise coder We will now construct precise coder in similar way as for the binary case.

Denote N := { i : i ∈ +}. s qs N They looks to be a good approximation of positions of symbols in the distributing function. We have only to move them into some positions of natural numbers. Intuition suggests that to choose symbols, we should take succeedingly the smallest element which hasn’t been chosen yet from these sets. P Observe that #(Ns ∩ [0, x]) = bxqsc, but sbxqsc ≤ x. So if we would use just proposed algorithm, while choosing a symbol for given x, at least bxqsc appearances of each symbol have already appeared: ∀s xs ≥ bxqsc. For x being a natural multiplicity of l we get equalities instead. To generally bound xs from above, observe that because the fractional parts of xqs sums to a P natural number, we have bxqsc ≥ x − n + 1. P s Finally, because s xs = x, we get:

bxqsc ≤ xs ≤ bxqsc + n − 1 ⇒ xs − xqs ∈ (−1, n − 1] (23) 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 12

Numerical simulations suggest that there are probability distributions for which we cannot improve this pessimistic evaluation, but in practice |xs − xqs| is usually smaller than 1.

To implement this algorithm, in each step we have to find the smallest of n numbers. Assume we have implemented some priority queue, for example using a heap. Besides initialization it has two instructions: put((y, s)) inserts (y, s) pair into the queue, getmin removes and returns pair which is the smallest with (y, s) ≤ (y0, s0) ⇔ y ≤ y0 relation.

Precise initialization: For s = 0 to n − 1 do {put( (1/qs, s) ); xs = ls}; For x= l to bl − 1 do {(y, s)=getmin; put((y + 1/qs, s)); D[x]=(s,xs) or C[s,xs]=x xs++}

5.2 Selfcorrecting diffusion (ScD) We will focus now on a bit less precise, but faster statistical initialization method: fill the table of size (b − 1)l with proper number of appearances of symbols and for succeeding x take symbol of random number from this table, reducing the table. So on the beginning it will behave like a diffusion, but it will correct itself while approaching the end. Another advantage of this approach is that after fixing (ls), we still have huge (exponential in #I) number of possible coding functions - we can choose one using some key, additionally encrypting the data.

Initialization: (b−1)l0 (b−1)l1 (b−1)ln−1 z }| { z }| { z }| { m=(b-1)l; symbols =(0, 0, .., 0, 1, 1, .., 1, .., n − 1, .., n − 1); For s = 0 to n − 1 do xs = ls; For x= l to bl − 1 do {i=random natural number from 1 to m; s=symbols[i]; symbols[i]=symbols[m]; m--; D[x]=(s,xs) or C[s,xs]=x xs++}

Where we can use practically any deterministic pseudorandom number genera- tor, like Mersenne Twister([7]) and use eventual key for its initialization.

It will be precise on the beginning and the end but generally impreciseness will 5 ASYMMETRIC NUMERAL SYSTEMS(ANS) 13 be larger. We will analyze it now. While selecting some symbol s, we can divide symbols into two groups: this symbol and the rest. So we can restrict to simplified model:

Model: We have N distinguishable numbers: L copies of ’1’ and N − L copies of ’0’. What is the probability that if we choose M of them, there will be K of ’1’?

M K copies in M symbols can be distributed in K ways. After choosing one, its copies of ’1’ are distributed in L(L − 1)..(L − K + 1) ways, of ’0’ in (N − L)..(N − L−(M −K)+1) ways. The number of all such sequences is N(N −1)..(N −M +1), so the probability we are looking for is: M L! (N − L)! (N − M)! MN − M N P (K) = = / N,M,L K (L − K)! (N − L − M + K)! N! K L − K L

L For this derivation denote the expected value q := N . K This probability distribution should be gaussian like with maximum in M ≈ q. To approximate it’s width, we can use Newton’s symbol approximation from the introduction:  K   L − K  log (P (K)) ≈ Mh + (N − M)h − Nh(q) 2 N,M,L M N − M Because we are interested only in some approximation of width of the gaussian, we have omitted terms with square root - they correspond mainly to probability normalization. This formula has the only maximum in K = Mq as expected. Ex- panding around this point up to second Taylor’s term, we get ! 1 N  K 2 P (K) ≈ exp − − q (24) N,M,L 2Mqq˜N − M M We get mean derivative σ = pMqq˜(1 − M/N). This result agrees well with exact numerical calculations. Observe that without (1 − M/N) term, it would be just the formula from central limit theorem for the binomial distribution (P (010) = q). So as expected: for small M we have diffusion like behavior, but this term makes that with M → N we approach the expected value.

Returning to the algorithm, N = (b−1)l, M = x−l, L = (b−1)ls,K = xs −ls: s  x − l  x − l ≈ (x − l)q ± (x − l)q q˜ 1 − s s s s s (b − 1)l s x q˜ x − s ≈ ± s (x − l)(bl − x) (25) qs (b − 1)ls 6 STATISTICAL ANALYSIS 14

The mean derivative is a square root of parabola with zeros in l and bl as expected. q The maximum of this parabola will be q˜s(b−1) l for x = l(b + 1)/2. It’s the largest 4qs expected impreciseness - it grows with the square root of l.

Modern pseudorandom number generators can be practically unpredictable, so the ANS initialization would be. It chooses for each x ∈ I different random local behavior, making the state practically unpredictable hidden random variable. Encryption based on ANS instead of making calculation while taking succeeding blocks as standard ciphers, makes all calculations while initialization - processing of the data is much faster: just using the tables. Another advantage of such preinitial- ized cryptosystem is that it’s more resistant to brute force attacks - while taking a new key to try we cannot just start decoding as usual, but we have to make whole initialization earlier, what can take as much time as the user wanted. We will focus on such cryptosystems in section 8.

6 Statistical analysis

In this section we will try to understand behavior, calculate some properties of presented coders. From construction they have some more or less random behavior and they process some more or less random data so we can usually make only some rough evaluations which occurs to agree well with numerical simulations.

For a given coder, let us define function which measure it’s impreciseness:

s(x) = C(s, x) − x/qs (26)

For precise coders usually |s(x)| < 1, for ScD it can be estimated by (25). We have to connect it with the stream version (22): introduce s(x), such that

ks−[x

( x x  1 x x  C (x) − k −1 =  b k −1 c − k −1 − b k −1 c for x < X s qsb s s b s qs b s b s s s(x) := x x 1 x x  C (x) − k =  (b k c) − k − b k c for x ≥ X s qsb s s b s qs b s b s s j x k 1 n x o s(x) = s k −[x

˜ y y := logb(x) − logb(l) ∈ [0, 1], I := logb(I) − logb(l) ⊂ [0, 1], x = lb (29) ˜ ˜ ˜ Now our stream coding function will be Cs : Is → Is with Ys := logb(Xs) − logb(l). ˜ Observe that this approximated equation can be thought as Cs(y) ≈ {y − cs}. 6 STATISTICAL ANALYSIS 15

Introduce ˜s(y) analogously as before:  ˜ y − cs + 1 + ˜s(y) for y < Ys Cs(y) =: (30) y − cs + ˜s(y) for y ≥ Ys

Let us connect ˜s(y) with s(y) and s(y). For y ≥ Ys:

˜ y  ˜s(y) = Cs(y) − y + cs = logb Cs(lb ) − logb l − y + cs =  y  lb y = logb k + s(lb ) − logb l − y + cs ≈ qsb s  y  ks cs lb qsb y b 1 y ≈ logb k + y s(lb ) − logb l − y + cs = y s(lb ) qsb s lb ln(b) l ln(b) b where we have used the first Taylor expansion of logarithm.

y Making similar calculation for y < Ys case, we finally get (lb = x):

 bcs−1 1  lby 1 lby  y  ( k −1 ) − { k −1 } for y < Y  ln(b) lb s b s qs b s s ˜s(y) ≈ (31) bcs 1  lby 1 lby  y  ( k ) − { k } for y ≥ Y  ln(b) lb s b s qs b s s

6.1 Probability distribution of the states We can now consider probability distribution among states our stream coder/decoder should asymptotically obtain while processing long stream of symbols/digits.

While processing some data, the state changes in some very complicated and randomly looking way. Let’s remind its three sources: • Asymmetry (the strongest) - different symbols have usually different proba- bility and so changes the state in completely different way. This choice of symbol/behaviour depends on local symbol distribution, which looks also ran- domly. Analogously while decoding, starting from different state, transferred bits denotes completely different behavior,

• Uniform covering - usually cs = {logb(qs)} are irrational, so by making ≈ y → {y − cs} steps, intuitively we should cover [0, 1) range uniformly,

• Diffusion - C(s, x) is near, but not exactly x/qs ( 6= 0), so we have some addi- tional, randomly looking motion around the expected state from two previous points. These points strongly suggest that the state practically behaves as random variable. So for example starting from any state, we should be able to reach any other. 6 STATISTICAL ANALYSIS 16

Unfortunately there can be found some pathological examples: in which all logb(qs) are rational numbers and we use precise initialization, so that we stay in some proper subset of I:

I = {4, 5, 6, 7}, n = 2, l0 = l1 = 2, C0(4) = C0(5) = 4, C1(4) = C1(5) = 5

I couldn’t find qualitatively more complicated examples, but if it accidently happen the coder will still work as entropy coder, but with a bit different expected probability distribution of symbols - worse compression rate.

We can make natural assumption(*) for the rest of the paper that:

0 For each two states x, x ∈ I, there is a sequence of symbols (s1, .., sm) 0 0 which makes that we go from x to x : Csm (...(Cs1 (x))) = x .

Assume now that we want to use the coder with a sequence of symbols with given probability distribution (ps)s=0,..,n−1 such that ∀s 1 > ps > 0. So if in a given moment the coder is in state x, after one step with probability ps it will be in Cs(x) state. It can be imagined as Markov’s process. Now the assumption(*) means that its stochastic matrix is irreducible - from Frobenius-Perron theorem we know that there is a unique limit probability distribution among states: X X P : I → (0, 1), P (x) = 1 : ∀x,y∈I P (x) = {P (y)ps : Cs(y) = x} (32) x To obtain a good understanding of the coding process, we should find a good general approximation of this probability distribution. The details of such process are extremely complicated, so to compete with this problem we should find as simple equations as possible - use logarithmic form y = logb(x/l). I˜ is difficult to handle subset of [0, 1], so to work with probability on this set, we should use probability distribution function: nondecreasing function D : [0, 1] → [0, 1], fulfilling D(0) = 0, D(1) = 1:

lby X D(y) := probability of being in state less or equal than y = P (x) (33) x=l It describes stationary distribution of coding process iff ( ˜ ˜ X D(Cs(y)) − D(Cs(0)) for y < Ys D(y) = ps  ˜   ˜  D(1) − D(Cs(0)) + D(Cs(y)) − D(0) for y ≥ Ys s  X D(y − cs + 1 + ˜s(y)) − D(0 − cs + 1 + ˜s(0)) for y < Ys D(y) = ps D(y − cs + ˜s(y)) − D(0 − cs + 1 + ˜s(0)) + 1 for y ≥ Ys s 6 STATISTICAL ANALYSIS 17

We see that for ˜ = 0, the unique solution is D(y) = y for y ∈ [0, 1]. It’s idealized solution - in practice we have some discrete set of states, so D cannot even be continuous. ˜ is some very small randomly behaving function having different signs and it is somehow averaged in above equations, so intuitively D should be near this idealized solution. Unfortunately I wasn’t able to prove it, but numerical simulations shows that this correction is in practice much smaller than ˜. If we return to the original states, this approximation says that

0 0 P (x ≤ x ) ≈ logb(x /l).

Differentiating it we get that P (x) is approximately proportional to 1/x. We will use it for further calculations. To work with 1/x sequences we can use well known harmonic numbers:

n X 1 1 1 1 H(n) := = γ + ln(n) + n−1 − n−2 + n−4 + O(n−6) (34) i 2 12 120 i=1 where γ = 0.5772156649... Using this formula we can easily find the normalization coefficient N : bl−1 1 X 1 = = H(bl − 1) − H(l − 1) ≈ ln(b) N x x=l For the rest of the paper we will use N P (x) ≈ (35) x approximation. Now we can for example calculate the probability that while encod- ing symbol s we will transfer ks − 1 digits: 1 P (x < X ) ≈ N (H(X − 1) − H(l − 1)) ≈ ln(bks q ) = c (36) s s ln(b) s s

We can also define the expected value of some functions while coding/decoding process: X 1 X f(x) hf(x)i = P (x)f(x) ≈ (37) ln(b) x x∈I x∈I Numerical simulations shows that they are usually very good approximations.

6.2 Evaluation of the compression rate Using constructed coders we can get as near Shannon’s entropy as we need. In this subsection we will evaluate this distance. It is very sensitive to parameters, so the evaluations will be very rough - only to find general dependence on the main 6 STATISTICAL ANALYSIS 18 parameters.

Having probability distribution of the states, we can now use (6) formula * + 1 X qs 2 ∆H ≈ ( (x)) (38) ln(4) x2 s s Impreciseness of our encoder is more or less random and we can only estimate its qs 2 expected values, so for this estimation we can threat x2 and (s(x)) as independent variables. It would also allow to separate compression rate losses into which comes from l, b parameters only and caused by impreciseness of the coder.

* + Z lb 2 1 X qs N X 1 qs N X 1 b − 1 1 ≈ ≈ q x−3dx ≈ ln(4) x2 ln(4) x x2 ln(4) s l2 b2 ln(b) ln(4) s s,x s l (39) P 2 For the precise initialization h s qs(s(x)) i intuitively shouldn’t depend strongly on l, b parameters, but rather on n and probability distribution. Pessimistically us- ing (23) we can bound it from above by n2, but in practice it’s usually smaller than n.

Let’s focus on ScD initialization now. The term with fractional part of  is much smaller than the main source of imperfection, so we can omit it. D    E P 2 P q˜s x x h qs(s(x)) i ≈ k −[x

 2    P q˜s 2 b −1 b+1 b = N s (b−1)l l 2 + (b − 1) ln(l) − b ln(b) ≈ l(n − 1) 2 ln(b) + logb(l) − b−1

Usually the largest is the term with logb(l), so finally log (l) b2 − 1 n − 1 ∆H ≈ b (40) l b2 ln(4) ln(b)

Comparing to numerical simulations these estimations are very pessimistic: we get many times (like 10-100) smaller value, but general behavior log(l)/l looks to be fulfilled.

To summarize: in practice we rarely require that the coder is worse than optimal than e.g. 1/1000 which can be get using l/n being usually below 100 for ScD initialization. Eventually we can divide I into subranges initialized separately to improve preciseness. 6 STATISTICAL ANALYSIS 19

6.3 Probability distribution of digits and symbols The fact that smaller number of states are more probable unfortunately makes that produced sequences aren’t exactly uniform uncorrelated sequences, what would be expected for example if we would like to use ANS in cryptography. We will analyze it briefly now and in the next section will be shown how to correct it.

First of all let us assume that we are coding some sequence of symbols to produce sequence of digits. Look at fig. 2. The last transferred digit says in which subrange of states indistinguishable after bit transfer we are. So the the fact that P (x) is generally decreasing, makes that it’s a bit more probable that this last transferred digit is 0. Let’s estimate this probability to see how it depends on parameters.

Using D(x) := D(logb(x/l)) ≈ logb(x) − logb(l), we get the probability that this last (while coding)/first (while decoding) digit is 0:

ks−1 P(Xs−l)/b −1 ks−1 ks−2 ks−1 i=0 D(l + ib + b − 1) − D(l + ib − 1) ≈ k −1 k −1 P(Xs−l)/b s −1 bks−2 1 P(Xs−l)/b s −1 1 ≈ i=0 (l+ibks−1) ln(b) = b ln(b) i=0 i+l/bks−1 =  ks−1  1 ks−1 ks−1  1 Xs/b −1 = b ln(b) H(Xs/b − 1) − H(l/b − 1) ≈ b logb l/bks−1−1 =

 ks ks−1   ks−1  1 qsb −b /l 1 ks b ks cs 1 ks = log k −1 ≈ log q b + (q b − 1) ≈ + 2 (q b − 1) b b 1−b s /l b b s l s b lqsb ln(b) s where we’ve used D(x + h) − D(x) ≈ hD 0(x) and the simplest approximation for harmonic numbers. We could get constant a few times smaller if we would take better approximation of harmonic numbers and the derivative in the middle of the range. If we are interested only in general parameters dependency, presented approximation is good enough. In the second range probability distribution of states decreases slower but ranges 1−cs 1 ks are larger. Analogous calculation gives + 2 (b − q b ). b lqsb ln(b) s If we sum these values, we get that while encoding symbol s, probability that 1 b−1 the last digit while bit transfer will be 0 is + 2 . b lqsb ln(b) If we average obtained correction over all possible symbols, we get that proba- bility is larger than uniform digit distribution by approximately b − 1 n (41) b2 ln b l In fact this value is a few times smaller and in practice we can use large l like 105 − 106 to make tables fit in cache memory, so this effect can be extremely weak. While estimating, probability uncertainty decreases with the square root of the number of events, so even observing this effect would require analysis of gigabytes of output. Retrieving some useful information like probability distribution of length of blocks would require much more data. For succeeding digits and correlations this effect will be accordingly smaller. We will see in the next section how to eventually 7 PRACTICAL REMARKS AND MODIFICATIONS 20 reduce it as many orders of magnitude as we want.

Now let us focus on the opposite situation - we have some sequence of digits and we want to encode them into symbols of given probability distribution. This time states are not gathered into subranges as previously, but distributed randomly and more or less uniformly, so the differences should be much smaller. But if we need to more precisely evaluate their probability distribution than ls/l, we can for example use our approximation of state probability distribution, so the probability that we will produce symbol s is approximately:

X  1  N : x ∈ I,D (x) = s (42) x 1

This formula also says more precisely what probability distribution of symbols is encoded closest to the Shannon’s entropy. Using it we could also modify cod- ing/decoding functions to make better approximation of expected probability distri- bution of symbols using the same l. Shifting some appearances of symbol left(right) increases(decreases) its probability a bit.

7 Practical remarks and modifications

This section contains practical remarks for implementation of presented coders and some additional modifications which can improve some of their properties for cryp- tography and error correction purposes.

7.1 Data compression programs are generally constructed in two ways:

• We use constant probability distribution of symbols. It could be generally known for given type of data or estimated by statistical analysis of the file. In the second case it has to be stored in the compressed file, or

• The used probability distribution is dynamically estimated while encoding the file, so that while decoding we can restore these estimations using already decoded symbols. This approach is a bit slower, but we don’t need to store probability distribution tables, we process the file only once and we can get good compression rates with files in which probability distribution of symbols varies locally.

ANS is perfect for the first case: using a table smaller than 100kB we can get a very precise coder which encodes about 8bits for each use of the table. It has two problems: 7 PRACTICAL REMARKS AND MODIFICATIONS 21

• For each probability distribution we have to make separate initialization. We could also store tables some number of them. Observe that while changing the coder, if b and l are the same, we can just use the same state. • Decoding and encoding are made in opposite direction - we get different se- quences for estimations. To solve this problem we should process the file twice: first make the whole prediction process from the beginning to the end, then encode it in backward order. Now we can make decompression straightforward. In Matt Mahoney’s implementations (fpaqa, fpaqc in [5]) the data is divided into compressed separately segments, for which we store q from the prediction process. For ABS situation is a bit different - we have relatively quick to calculate mathe- matical formulas and much smaller space of probability distributions, but we can encode only one binary choice per step. We have generally two options: • Calculate formulas for every symbol while processing data - it is much more precise and because of it we can use large b to transfer a few bits at once, but it can be a bit slower (fpaqc), or • Store the tables for many possible q in memory - it has smaller precision, needs memory and time for initialization, but should be faster and we have large freedom of choosing coding/decoding functions (fpaqa).

7.2 Bit transfer and storing the tables For ABS using the formulas we can use large b, but in other cases we should rather use b = 2. For ANS it means doing bit transfer many times in each step - this quick operation may became essential for the transfer rate of the coder. Intuition suggests that we should be able to join them into one operation per step: for example use AND with proper mask to get the bits and make corresponding bit shift right of the state. It looks like the first problem is the order of these bits - that coding and decoding use them in reverse directions. But in fact in each step we know how many bits we should transfer and so we can just use the same direction for coding and decoding. The larger problem is to determine this number of digits to transfer: ks−[x < Xs]. It requires the comparison and usage of small tables in which on different bits is encoded ks,Xs and maybe the mask. We could also store this information in the coding/decoding tables.

Let’s think how to store the tables to find a compromise between memory needs and speed. Coding tables require for given symbol (b − 1)ls values from (b − 1)l possibil- ities. Usually ls isn’t constant, so to optimize it for memory requirement we can 7 PRACTICAL REMARKS AND MODIFICATIONS 22 encode it in one table of length (b−1)l: store C(s, x) as C[begining[s]+x]+l where P beginning[s]:=(b − 1) s0

All these ideas require additional memory or time for using small tables. The best would be if while initialization we would generate low level code separate for each symbol - with specific Xs, transfers and bit masks. They can be stored such that choosing the behavior for s is just a jump some multiplicity of s positions.

7.3 The initial state Stream coding/decoding requires choosing the initial state. The final state of one process has to be stored in the file to be able to reverse it. As it was previously mentioned - while changing coding tables, if l, b remains the same, we don’t have to change the state. The initial state can be freely chosen - as a fixed number or randomly. We don’t have to store intermediate states when we change the coding tables, but we have to store the final state. This state will be initial while decoding. The problem could be that we are wasting a few bits in this way. Usually it should be insignificant, but for example when we want to encode separately a huge number of small files, such bits could be essential. We can improve it by encoding some information in this initial state of the coder. We can do it for example by using a few steps of coding without bit transfer, starting from x = 0 state. We can always do it using binary choices (ABS). Eventually we could use ANS, but it would require creating tables for additional ranges.

7.4 Removing correlations In the previous section we have seen that the probability distribution of produced bits (digits) isn’t perfectly uniform. It’s very small effect and for correlations it would be even much smaller, but it could be significant if we would like for example use it as pseudorandom number generator. We could use some additional layer of encryption to remove correlations, but we can also do it in simpler and faster way.

The first idea to equilibrate probability distribution of digits is to negate (NOT) transferred digits for every second processed symbol - e.g. in steps of even number. In this way we would make that 0 and 1 are equally probable, but there would remain some correlations - ’00’, ’11’ would be a bit more probable than ’01’, ’10’. 7 PRACTICAL REMARKS AND MODIFICATIONS 23

If in one block of transferred bits we would have ’0’, it’s a bit more probable that a few bits further (in the next block), we will have ’1’. This idea can be thought as making XOR with ’00000...’ and ’11111...’ cyclically. We can improve it by using some longer, randomly looking sequence of numbers in ks {0, .., maxs b } range. They can be generated using some pseudorandom number generator or even chosen somehow optimally and fixed in the coder as its internal parameters. We have to be able to recreate this sequence for decoding and store the number of last position in the file. Now in each step of coding we take succeedingly numbers from this cyclical list and before transferring it, make XOR with the element from this list. While decoding we have to use the same list, and make XOR before using obtained bits. In this way we can reduce correlations as many orders of magnitude as we need. Blocks length varies practically randomly, so knowing this list wouldn’t allow to remove this transformation.

7.5 Artificial increasing the number of states Usually the number of states is (b − 1)l, but we will see in the next chapter that sometimes it’s not enough. There are generally two ways to artificially increase it exponentially:

• Intermediate step(s) - the base of security of ANS based cryptosystem is that the length of blocks and the state varies practically randomly. These effects are very weakened if we want for example encrypt without compression standard data - with uniform probability distribution. To cope with this problem we can for example introduce intermediate step with even randomly chosen probability distribution of symbols. Stream coder/decoder in one step changes a block of bits into a symbol or oppositely. We can combine such steps: decoder changes a block of bits into a symbol of given probability distribution and immediately encoder changes it into a new block of bits. Encoder and decoder have own completely separate 0 0 states and modify them in opposites direction ((y, y ) →≈ ({y − cs}, {y + cs})). It looks like we move on a straight line on this twodimensional torus, but because of impreciseness, this line diffuse in the second direction and asymptotically should cover this ’torus’ uniformly - the total number of states is practically the square of the original one. Surprisingly, because they use separate states, encoder and decoder can be reverses of each other. This approach is slower, but can be useful for cryptographic applications.

• Additional sequence of bits - while using ANS as error correction method, the internal state of the coder contains something like hash value of already pro- cessed message. So if it has small amount of possibilities, we can accidently get 8 CRYPTOGRAPHIC APPLICATIONS 24

the correct value with wrong correction. The search for the proper correction requires a lot of steps, so they should be as fast as possible. In the previous point in each step we’ve changed the whole internal state of the coder - each use of a table changes one part of it, so it is relatively slow. To make it faster, we should use the table only once per step - change only part of the internal state of the coder. We could do it sequentially, but it would just separate the data into subsequences processed separately. The example of practical way is to expand the state of the stream coder by some cyclical table (t) of bits (eventually short bit sequences). Now coding is to make bit transfer, then switch the youngest bit of this reduced state with bit in given position in this table, increase this position cyclically and finally use the coding table. Now decoding step: use decoding table, decrease position, switch the youngest bit and make bit transfer. Stream decoding: Stream coding(s): {(s,x)=D(x); {while(x∈/ Is) use s; {put mod(x,b) to output;x=bx/bc} i--;switch (x AND 1)↔t[i]; switch (x AND 1)↔t[i];i++; while(x∈/ I) x=C(s,x) x=xb+’digit from input’ } }

To make that this bit shift doesn’t get us out of Is, we have to enforce that ls are even. These switches increases a bit impreciseness of the coder, but if we switch only one bit, it is practically insignificant. This table has to be stored somehow in the output file. It’s cyclical so instead of storing position, we can rotate it to make that decoding should be started with the first position. We see that we can in fast and simple way increase the number of internal state as much as we want. To make it faster we can represent this table of bits as one or a few large numbers.

The problem is that this large state will be required to start decoding, so we have to store it in the file. If it is used for error correction, it has to be well protected. For this purpose the initial state of the coder should be some constant of the coder, which allow to make the final verification. Eventually we could also encode some information in this initial state of the coder as previously.

8 Cryptographic applications

Asymmetric numeral systems were created for data compression purposes, but this simple and looking new idea of coding, has some properties which makes it very 8 CRYPTOGRAPHIC APPLICATIONS 25 promising also for cryptography and error correction purposes. It can even fulfills all these purposes simultaneously.

8.1 Pseudorandom number generator and hashing function We have seen that we can think about the state of the coder as some hidden random variable, which chooses current behavior - state change and produced bits. As we would expected from entropy coder - the output bit sequence is nearly uniform and practically uncorrelated. Unfortunately it’s not perfect, but we can use not the whole state but only some of its youngest bits, what would reduce correlations greatly. Additionally we could for example use some set of masks as in the previous section. Pseudorandom number generators (PRNG) are initialized by so called seed state: it generates randomly looking sequences, but if we would use the same seed, obtained sequence would be also the same. To use PRNG in cryptography, it has to meet additional requirements: having a sequence generated by it, we cannot get any information about the seed or further/previous bits. In the next subsection we will see that with properly chosen parameters, we shouldn’t be able even to reveal the sequence of symbols used to generate random bit sequence. So to use ANS as pseudorandom number generator we have to choose some coding function, for example initialized using the seed state. Now we have to feed it with some sequence of symbols. If this sequence is periodic, after some multiplicity of this period, the state of the coder would be the same - the bit sequence would be also periodic. But this period is much longer than the period of symbol sequence: about the number of internal states of the coder times. In the previous section we have seen that this number can be easily increased as much as we want, so in practice the sequence of symbols can be taken from some very weak psedorandom generator, or even taken as some fixed periodic sequence.

Hashing functions change files into some short randomly looking sequence of given length. We shouldn’t be able to get any information about the file from it. Additionally we shouldn’t even be able to find in practice way some two files which give the same value. To fulfill these requirements, we can for example increase the number of states of the coder by using additional table of bits as previously, process the message and for example return this table as the hash value. If we wouldn’t increase the number of states, someone could find two prefixes giving the same state and switch them. We could also prevent finding two messages with the same hash value by encoding the message twice - forward and backward. For example we can decode the file into a sequence of symbols of some fixed/generated probability distribution, then change the state and encode it back into a sequence of digits. Without changing the state we would just get the same file, but any change would make that we just produce practically random sequence - we can now for 8 CRYPTOGRAPHIC APPLICATIONS 26 example combine some youngest bits of last used states to get the hash function. For this purpose extremely small correlations should be completely insignificant. Eventually we could easily reduce them if we need.

8.2 Initialization for cryptosystem For given parameters we still have huge amount of coding functions with practically the same statistical properties, but producing a completely different encoded sequence. If we make selfcorrecting diffusion initialization using some PRNG initialized using given key, we would get practically unique coding function for this key. If we would use it to encode some information, it looks practically impossible to decode the message not knowing the key. We will now make a closer look at such approach to data encryption.

First of all, let us focus on the ScD initialization. It’s large number of picking a random symbol from some large table. The coding table is approximately given by symbol probability distribution, but it looks practically impossible to find its precise values not knowing the key. The initialization process strongly depends on its history, which creates specific symbol distribution in the symbols table - while knowing the key, it looks practically impossible to find C(x, s) without making whole previous initialization (for smaller x). So to start decoding we practically have to make whole initialization. Observe that we can enforce PNRG to require as large time to be calculated as we want, for example: for i=1 to N do {k=random; read k random values} makes that we statistically know approximately in which position of PRNG we will be. But to find the the exact position, we just have to make all calculations. We see that this way we can enforce some time required for initialization. Connecting it with the unpredictability of ScD initialization, we see that such cryp- tosystem would be extremely resistent to brute force attacks. Standard approach makes all computations while processing the file, so to check if given key is correct we can just start decrypting the file and observe if the output for example isn’t a completely random sequence. In the presented approach, most of computation is made while initialization: to check if given key is correct we have to spend given time to make the initialization, for example enforced to take about 0.1s - it’s a few orders of magnitude larger than in standard approach. After initialization the processing of the data uses already calculated tables - is much faster than in standard approach.

Now assume that someone would get the coding function - does it mean that he can retrieve the key? This function says symbols chosen in each steps, but each symbol could be chosen in many ways, so in fact he wouldn’t have sequence of used random variables, but only some sets of its possible values - even using a weak 8 CRYPTOGRAPHIC APPLICATIONS 27

PRNG it looks practically impossible to deduce the key. Eventually we could use some secure PRNG, for which it is ensured that knowing the exact sequence, we couldn’t find its seed state and so the key. This property suggests extremely powerful additional protection - use not only the key as the seed state, but also some number which can be even stored in the file. Now after every encrypted fixed number of bits (like gigabyte), we change this number, store it and use to generate new coding tables. The size of these blocks should be chosen so that it wouldn’t be possible to retrieve any essential information from them. The behavior of each one is practically unrelated, so their information couldn’t be connected for finding the key. Sometimes we would like to make encryption and entropy coding in the same time. The question is - what to do with the probability distribution of symbols. There wouldn’t be a problem if we would use some adaptive prediction method, but it would also require using many different coding tables. We will see that these tables should be rather large, so sometimes it might be better to use fixed probability distribution of symbols. It has to be stored in the header and so is easily accessible. We will see that such knowledge shouldn’t rather make breaking the code easier, but it gives some knowledge of file content, what can be unwanted. To prevent it, this header can be encrypted separately using the same key but probably in some different way.

8.3 Processing the data The coder uses the state which is hidden practically random variable. Also hidden, randomly generated local behavior of the coding function defines current behavior - how many digits to produce and to which state jump. Blocks created this way are relatively short, but they have various, practically randomly chosen length. This picture looks perfect, but unfortunately there are some weaknesses which could give some information about statistics of symbols or even coding function. They would vanish if we would use some additional layer of standard encryption, but I will try to convince that using only ANS with proper parameters and some quick and simple modifications, we can make really safe and fast encryption. • First of all, as it was mentioned in the previous section - the base of the randomness of the state is that we don’t use uniform distribution of symbols (asymmetry) and that some symbols has probability not being an integer power of b. These assumption is in practice automatically fulfilled when we make encryption and entropy coding in the same time, but sometimes we would like just to encrypt some more or less uniform byte sequence. The best way to cope with this problem is to use the intermediate step from the previous section - using the same PRNG choose some probability distribution of symbols and then in one step encode a byte into a symbol and immediately use it to produce output bit sequence. Alternatively if we want to make it quicker - use 8 CRYPTOGRAPHIC APPLICATIONS 28

only one step: we can use the same PRNG to modify randomly the uniform distribution among bytes a bit and treat input sequence as sequence of such symbols. The cost is that the state doesn’t change as fast as previously and that the output file is a bit larger than the input. We have also smaller amount of possible states of the coder this way. Eventually we could use so called homophonic substitution - to each symbol assign a few new ones and choose among them using some separate (hardware) random number generator, but it would increase the size of the message. • If we would encode the same sequence starting from the same state of the coder, we will get the same output. To prevent attacks based on such situations, we should increase the amount of its internal states. In the previous section were shown some ways to do it - use some correlation removing method, intermediate step or additional bit sequence. • As it was previously mentioned, because the probability of being in given state (x) is not uniform, but is decreasing (∝ 1/x), some produced blocks of digits are a bit more probable (with smaller digits). These differences are extremely small and because of various block length I don’t see a way to use it to find given block structure or some precise information about coding function. But analyzing statistically huge amount of data, one could evaluate probability distribution of block lengths, which gives some information about probability distribution of symbols. To prevent it we can use some of presented method of removing correlations. We could also generate sometimes new coding function for example with the same key but with some new additional, presented number. • Transferred digits are the youngest digits of the state. If one would have both ciphertext and corresponding plaintext, would make a correct assumption about the internal state and blocking in given moment and knew precisely used probability distribution of symbols, he could track the history of the processing, which would reveal used coding table. Let’s focus on such scenario. ≈ x Knowing probability distribution of symbol, we know that x → k −[x

−2 l > qs (43) in presented scenario the number of possibilities the person would have to consider would grow exponentially, making such attack completely impractical. 2 Observe that qs is at average 1/n, so above condition tells also that l > n . In practice any presented method for increasing the number of internal states should also prevent such scenarios. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 29

• Having a lot of ciphertext and corresponding plaintext, one could try to make some statistical analysis to connect symbols with blocks. Because of various length of blocks it doesn’t look practical, but to prevent such eventualities it would be expected that every symbol could produce practically all possible youngest digits of the state. Given symbol can produce (b − 1)ls = (b − 1)lqs different states and in importance (shown in the encrypted√ file) are let say ks youngest digits (1/qs values at average), so we again get l > 1/qs condition. Again any other modification would also give good protection.

• If one can use initialized coder (adaptive scenario) and has some message en- crypted with it, he could try to use different inputs, slowly exposing succeeding bits of the plaintext. This is unavoidable weakness of using short block length cryptosystems. Fortunately there is simple universal protection against such rare scenarios: add a few random bytes at the beginning of the file before starting encryption or choose the initial state randomly (this time not using PRNG used for initialization). In this way while encrypting the same data, we will most probably get different output, which still can be decrypted into the same input data.

I cannot assure that this list is compete, but for this moment I cannot think of more weaknesses which could be used to break ANS based encryption. We can easily protect against all of them. To summarize, while designing a cryptosystem base on ANS, we should:

• Ensure the asymmetry - that the probability distribution of symbols is not uniform,

• Use b = 2 for which state probability distribution is nearest uniform,

2 −2 • Use large l > n or even ∀s l > qs . So to make coder faster (larger n), we should use correspondingly large tables,

• Use some correlation removing modification and eventually increase addition- ally the number of internal states of the coder,

• Eventually choose randomly the initial state of coder.

9 Near Shannon’s limit error correction method

While compressing a file we remove some redundancy caused by statistical proper- ties. Using forward error correction methods, we are adding some easily recogniz- able redundancy, which can be used to correct some errors. In standard approach we usually divide the message into short independent blocks. The problem is that it’s vulnerable to pessimistic cases (large local error concentrations) - if the number 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 30 of errors exceeds some boundary, we loose the whole block. In this section will be shown how to connect their redundancy to treat the whole message as one block. I will focus on using ANS for this purpose, but presented approach is more general - fig. 3 shows how to use it for any block code and a hashing function or even only a hashing function. For simplicity let us assume the simplest channel for this paper: memoryless, symmetric. That means that there is some fixed probability (pb ∈ [0, 0.5)) that transmitted bit will be changed (0 ↔ 1). So while transmitting N bits, about Npb of them will be damaged.

For a channel of given statistics of errors (noise), we can say about Shannon’s limit - theoretical maximal information transfer rate. Constructions used to show that this limit is achievable are completely impractical. Near this limit are Low-Density Parity-Check Codes (LDPC) ([8],[9]), but they still they require solving NP-problem to correct. So in practice there are used codes which divide the message into independent blocks, what makes them vulnerable to pessimistic cases. For example for pb = 0.01, we should be able to construct a method which adds asymptotically a bit more than 0.088 bits of redundancy/transmitted bit and is able to fully repair the message. Compare it with commonly used (7,4) Hamming codes - it adds 3 bits of redundancy per 4 transmitted bits to be able to correct 1 damaged bit per such 7 bit block. It uses much more redundancy: 0.75 bits/transmitted bit, but because sometimes we have more than one error in block, we loose about 16bits/transmitted kilobyte and we don’t even know about it.

Imagine we have some channel with known statistical model of error distribution. To transmit some undamaged message through it, we have to add some redundancy ’above’ given error density. We know only statistics of errors, not when exactly they will appear - so this density of redundancy should be chosen practically constant. But the density of errors fluctuates - sometimes is locally high, sometimes low. We see that while dividing the data into independent blocks, we have to choose the density of redundancy accordingly to some pessimistic error density in such block. But in fact there usually isn’t some pessimistic level - we only know that the worse case, the rarer it occurs. So in this way for most of blocks were used much more redundancy then required, but for some of them this amount is still not sufficient. We see that to obtain a really good correction method, we should treat the message as the whole. In LDPC it is made by distributing uniformly some large amount of parity checks. Presented approach divides the message into blocks, but their redundancy is connected by the internal state of the coder, which contains something like checksum of already processed message to choose local behavior. Using these redundancy connections we can intuitively ’transfer’ surpluses of unused redundancy to cope with pessimistic cases. We will see that we are able to get near Shannon’s limit this way with practically linear expected time of correction. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 31

9.0.1 Advantage of connecting redundancy of blocks There will be now shown two approaches of connecting redundancy of blocks as in fig. 3. In fact we will do it later in more flexible and usually faster way, but these approaches show advantages of this new correction mechanism. Analysis and methodology from further subsections apply also to these approaches.

Figure 3: Two simple schemes of block codes with connected redundancy. On the top there is example of usage of standard triple modular redundancy block code - we send three copies of each bit and decode as the value with more appearances. In the middle it is shown how to modify it to connect redundancy of blocks - we use hash value of already processed encoded message stored in the state of decoder - modify the original block by making XOR with some 3 bits of this state. On the bottom there is simpler approach, which don’t need a standard block code - in one block we place some 2 bits of the state of decoder (kind of checksum) and the bit to encode in the third one.

The basic tool to connect redundancy of blocks is some hash function, which allows to deterministically assign some shorter, practically random bit sequence to already processed encoded message. In practice it is usually done by some automate which has a state containing such hash value to given position and changes this state while processing succeeding portions of the message. Look at the middle of the figure - after using the original block code (like 1 → 111), we make XOR with some bits of this state, containing practically random bits determined by already processed message (like XOR(111, 110) = 001). Now while decoding - if given block wasn’t damaged and the state is correct, XOR of the state and the block should be a codeword (000 or 111). If not - we know that there was a damage. Now as long as in each block there is at most one damaged 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 32 bit, we immediately know which bit we should repair - the behavior is similar as for independent blocks. The advantage starts when there occurs more errors in one block - the state from this moment will most probably be different than expected 3 and so for each block with some probability (pd = 1−2/2 = 3/4), we should observe that it’s damaged - it’s much more often than expected for the proper correction and so suggests where to search for additional corrections. The bottom example shows that we don’t really need to base on a block code - in some positions of the block we place some bits of the state(checksum) and bit(s) we want to encode in the rest of them. Now we immediately observe if positions with checksum were damaged. If the essential bits were damaged, we will conclude it later thanks of this new correction mechanism (pd is still 3/4). We will see that as it suggests - using only this new correction mechanism, we can already get near Shannon’s limit. In the last subsection will be shown how to connect it with block codes mechanism as in the middle picture to correct simple damages immediately, additionally quickening the correction process.

9.1 Very short introduction to error correction Forward error correction can be imagined that among all sequences, we choose some allowed ones - so called codewords. They have to be ’far’ enough from each other, so that when we receive a damaged sequence, we should be able to uniquely determine the ’nearest’ allowed one. Of course we would also want that the probability that it’s really the correct sequence is large enough. In other words we divide the space of all sequences into separate subsets - kind of balls around codewords. The ’thicker’ these balls are, the larger probability that we make the correction properly. In standard approach we divide the message into blocks of fixed length, which are encoded independently. In this case we can use Hamming distance - the number of positions on which given two sequences of bits differ. For example triple modular redundancy code uses 3 bit sequence to encode 1 bit - in the space of 23 possible 3 bit sequences, there are chosen 2 codewords (000, 111), which are centers of balls of Hamming radius 1. So if while transmitting given 3 bit block, at most one bit was changed, we can correct it properly. If the number of changed bits is larger than one, we get into a different ball - it is corrected in wrong way and we even don’t know it.

Let us focus on a memoryless symmetric channel: if we received ’1’, with prob- ability 1 − pb it was really ’1’ and with probability pb it had to be ’0’. If we would know in which of these cases we are, we would get exactly one bit of information. To distinguish them is needed h(pb) = −pb lg(pb)−(1−pb) lg(1−pb) bits of information, so such ’uncertain bit’ contains 1 − h(pb) bits of information - to transmit N real 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 33 bits, we have to transmit at least 1 N (44) 1 − h(pb) such ’uncertain bits’ - it’s so called Shannon’s limit and the channel coding theorem says that theoretically we can get as near as we want to this capacity. It means that for a channel with given statistics of error, we should be able to construct an error correction method which uses a bit more than 1 − 1 = h(pb) bits of 1−h(pb) 1−h(pb) redundancy per bit of message and is able to completely repair the message.

While working on such potentially infinite blocks, the number of damaged bits tends to infinity, so we can no longer work on the Hamming distance. Now while transmitting given codeword T of length N bits, will be received a message R with damaged bits on some more or less Npb positions. The positions of these errors can be stored as length N bit sequence E, such that R = T ⊕ E (45) where ⊕ denotes addition modulo 2 of two bit vectors (XOR). This E vector statis- tically should be chosen as one of N  ≈ 2Nh(pb) possibilities. From (45) we see that Npb it’s also the number of possible received sequences corresponding to one codeword. If we will divide the number of all possible received sequences by this number, we can get Shannon’s limit again: 2N /2Nh(pb) = 2N(1−h(pb)). In fact the number of damaged bits is close but rather not exactly equal Npb. But if we have some large number (N) of independent identically distributed random variables of entropy H, their outcome is almost certain to be in some set of size 2NH , which all members have probability ’close to’ 2−NH - it’s so called ’asymptotic equipartition’ property ([9]). This set is called typical set, for example:     N 1 1 x ∈ {0, 1} : lg − h(p) < β . (46) N p#{i:xi=1}(1 − p)#{i:xi=0} For all β > 0 and correspondingly large N, such set contains almost whole prob- ability. Subrange of typical set is asymptotically also typical, so these practically Np copies of ’1’ should be spread more or less uniformly.

Shannon’s coding theorem says that we can get as close to the theoretical limit as we want and we should be able to correct practically all possible typical errors. So we should look for the proper correction among typical ones with pb probability of ’1’. Standard proof generates the set of codewords randomly, modify this set (remove some codewords) and show that for large N, with probability asymptotically going to 1, we can properly determine transmitted codeword. Unfortunately it would require to check exponentially large set of possible corrections - it rather cannot be done in practice. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 34

9.2 Path tracing approach First of all, let us focus on sketch of different but still impractical proof: using a hashing function. Such function allows to assign to each message some shorter, practically random bit sequence. Assume now that we transmit the original message of length N bits through the channel and its ’a bit longer’ than Nh(pb) bits hash value through some different noiseless channel. Now the receiver can check ’all typical corrections’ (2Nh(pb)) of the received message and almost certainly only one (the proper one) will give the expected hash value. If we would like to transfer the hash value through the same noisy channel, we can analogously send additionally its hash value and so on (until it’s smaller than some limit value which can be encoded in some different way). So finally we would asymptotically need at least

2 N N(1 + h(pb) + h (pb) + ...) = bits 1 − h(pb) Observe that ’a bit longer’ can mean that the number of hash values is larger only polynomially with N than the number of typical corrections - there still almost certainly will be only one proper typical correction and we will get asymptotically exactly Shannon’s capacity. There has left to precise what does ’all typical corrections’ mean. For a theoretically infinite data stream we should be able to take β → 0 limit in (46). It can be achieved by intersecting sets of corrections passing verification for some sequence βi → 0. For a data stream of a finite length, we could take proper correction which is nearest to typicality (smallest β), but we will see that the best will be such that corrects the smallest number of bits.

We will now modify this method to make it practical - instead of making huge verification once, intuitively we will spread it uniformly over the whole message. Thanks of it we will be able to detect errors not only on the end of the process, but also shortly after they appear: after an error in each step we have some fixed probability (pd) to detect that something was wrong. We will pay for this parameter in capacity, but when it exceeds some critical point, the number of corrections not detected by this mechanism will no longer grow exponentially. So we will no longer require that the amount of possible hash values (states of the coder) should grow exponentially - the relative cost of storing it will vanish asymptotically. This threshold corresponds to the Shannon’s capacity, but in practical (nearly linear) correction methods there appears some additional problem with large errors concentrations and we should use a bit larger capacity. We will see it in the next subsection. The situation looks like in fig. 4: the transmitted codeword (correct path) is denoted by the thick line. We start with the fixed initial state and try to process succeeding bits. While we are on the correct path, the state changes in randomly looking way among all allowed states. After an error (we’ve lost the path), the state 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 35

Figure 4: Schematic picture of path tracing correction algorithm. If the number of states is large enough, corrections to consider should no longer create cycles as in the figure and so create tree. also changes in randomly looking way, but this time among all states - in each step there is some probability (pd) that we will get to a forbidden state (observe that something was wrong). The problem is that after an error, before it will be detected, the coder can accidently get into the correct state - we would go back to the correct path without a possibility to detect that we’ve made a wrong correction. We will see later that with proper selection of parameters, errors will appear slower than we can correct them - probability of such situation will drop asymptotically to zero.

We can use entropy coder with internal state for such path tracing purpose: add a forbidden symbol of probability pd, marking its appearances as forbidden states and rescale correspondingly probabilities of the rest of symbols (allowed ones). We could eventually use arithmetic coder, but ANS is faster, has useful modification capabilities and is generally simpler, so I will concentrate on it. If we want to encode a symbol sequence with (qs)s probability distribution, we have to use correspondingly ((1 − pd)qs)s probability distribution instead. Now while encoding we use only these allowed symbols. If there wouldn’t be errors, while decoding we would also use only allowed symbols, but after an error we would produce practically random sequence of symbols, so in each step we have probability pd of trying to produce the forbidden symbol and so detecting that there was an error. To use ANS for this purpose we would rather need to use some method to increase the number of internal states of the coder to reduce probability of wrong 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 36 correction. The initial state can be generally fixed for encoding process, but the final one has to be stored, probably in the header of the file. So there have to be used some separate strong error correction method for it to make that we can be sure that this initial decoder state is proper.

What is the cost of adding such forbidden symbol? The data sequence contains P at average H = − s qs lg(qs) bits per symbol. After the rescaling, we will use at average

0 X H := − qs lg((1 − pd)qs) = H − lg(1 − pd) bits/symbol. (47) s

There will be shown now intuitive argument that choosing pd as in Shannon’s limit:

h(pb) − lg(1 − pd) ≥ H (48) 1 − h(pb) the possible space of hash values (states of the coder) wouldn’t longer have to grow exponentially as for pd = 0 from the beginning of this subsection. In this case the cost of storing this (protected) hash value would vanish asymptotically. Denote

H  0  0 − 1−h(p ) h(pb) −H h(pb) pd = 1 − 2 b = 1 − 2 (49) this threshold value. For simplicity let us assume that b = 2. Unfortunately there is a very subtle problem with this argument, which will be explained and precisely analyzed in the next subsection.

Assume we’ve received some message of length N bits. We will use the simplest method in this moment: as before try to correct it using ’all typical corrections’ and check if they pass the verification: while decoding we would use only allowed states and the final state is correct. As in the picture, such message agrees with the correct one before the first error. Then it can vary according to the noise until it reaches a forbidden state or the correct state for given point. Let us assume that in j > 0 steps after an error it still didn’t reach a forbidden state. If we wouldn’t accidently get to the allowed state, the probability of such j situation is (1 − pd) . One step corresponds to at average H encoded bits which correspond to at average H − lg(˜pd) transmitted bits, so in j steps we processed at average j (H − lg(˜pd)) bits. They can freely change according to the noise, so we j(H−lg(˜pd))h(pb) should check about 2 their corrections. If we choose pd such that

j j j(H−lg(˜pd))h(pb) lg(˜pd)+(H−lg(˜pd))h(pb) 1 ≥ (1 − pd) 2 = 2 (50) the expected number of such corrections not rejected by this mechanism will no longer grow exponentially. This threshold is exactly the Shannon’s limit (48). 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 37

We can now intuitively estimate the probability of wrong correction scenarios as in the picture - that we can start with an error from a correct state in some point of time and after some typical noise accidently get back to some correct state. 0 There are almost N possible starting points for such scenario. If pd ≥ pd, the expected number of corrections which errors won’t be detected by the forbidden states mechanism doesn’t longer grow - it usually even drops to zero with the width of such subrange (j). So the expected number of such scenarios can be bounded from above by N 2. If the number of states of the coder behaves for example like N 3, almost certainly only the proper correction will pass the verification. To store protected one of N 3 values we need about a bit more than 3 lg(N) bits - while calculating channel’s capacity this cost vanishes asymptotically.

9.3 Practical correction algorithms Methods constructed on proofs of that we can approach Shannon’s limit requires clearly exponential correction time, like checking all typical corrections. In this subsection I’ll try to convince that practical (nearly linear) correction methods have to require redundancy level above some found higher limit. Then there will be presented general approach to correction - building correction trees. For basic choice of weight it requires a bit more redundancy than this new limit.

9.3.1 Practical correction limit Generally if we want to find correction in practically linear time, we rather cannot work on corrections of the whole message (exponential number), but should slowly elongate them. So practical correction algorithms should use enough redundancy to ensure that expected number of corrections to be considered up to given point is finite. The problem is that in fact we don’t know the number of damaged bits, only that they appear with pb probability - that asymptotically probability distribution p of the number of damaged bits is Gaussian with Npb(1 − pb) standard deviation. This uncertainty vanishes asymptotically, but surprisingly has essential influence on the expected number of corrections to be considered in given moment (width of the correction tree) - rarely there are very large local error concentrations which result in infinite expected width - it essentially influence (50).

We will see it formally later, but intuitively among corrections which survived to given moment, the less bits they changed, the more probable they are. It suggests the simplest correction algorithm - moving the ’front’ of the tree - for given moment remember some number (M) of corrections which passed verification with the smallest number of corrected bits. Now in each step try to expand all of them and take only the best M of those which gave some allowed state. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 38

The question is: how many of them (M) we should remember?

In other words - which in this order is the proper one? The larger M is, the slower the algorithm, but also the larger probability that there is the proper correction among considered ones. If in given moment this number is not enough, we lost this proper correction. Fortunately we can observe it - from this moment the number −1 0 of damages will have to be larger than expected (about h (− lg(˜pd)/H )) - it can suggest to go back and use locally larger M. Later we will do it smarter, but for this moment assume that we use always large enough various M. To summarize: we have to make that the expected number of corrections which passed the verification up to given moment (J bits) and is changing smaller number of bits than the proper correction is finite. The probability distribution of the number of damaged bits is Gaussian with p center in pbJ and standard deviation Jpb(1 − pb). If there was in fact pJ damaged bits, the number of wrong corrections with at most this number of damaged bits J  −1/2 Jh(p) J/H0 will be asymptotically dominated by pJ ≈ (2πJpp˜) 2 and aboutp ˜d of them are expected to survive these about J/H0 steps. So the expected number of those which survived is asymptotically approximately

Z ∞ (pJ−p J)2 J 1 1 − b 0 Jh(p) 2Jp (1−p ) (1 − pd) H · √ 2 · √ e b b Jdp 0 2πJpp˜ 2πJpbp˜b It is finite for J → ∞ if

1 lg(e) 2 0 lg(1 − pd) + h(p) − (p − pb) < 0 H 2pb(1 − pb)

This function of p has only one maximum - a bit above pb as expected. This p corresponds to cases which influence most the expected number of corrections. 1 Finally the integral is finite if we chose pd > pd:

0 “ lg(e) 2” 1 −H maxp∈[0,1] h(p)− 2p (1−p ) (p−pb) 0  pd := 1 − 2 b b ≥ pd (51)

0 Using H = H − lgp ˜d, we get H0 1 =   (52) H lg(e) 2 1 − maxp∈[0,1] h(p) − (p − pb) 2pb(1−pb)

Comparing to Shannon’s limit - it’s a bit larger for small pb, up to twice larger for H(1−H0/H) pb → 0.5. This formula can be used to choose pd = 1 − 2 . 0 1 To summarize: if we would use pd ∈ [pd, pd], while trying to consider some best corrections up to given moment, we asymptotically would need exponential number of steps. Because we rather use polynomially large number of states of the coder, eventually found correction will be probably wrong. If we can tolerate exponential 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 39

0 Figure 5: Comparison of Shannon’s limit (pd), limit for practical correction algo- 1 2 rithms (pd) and reached by the basic version of correction tree algorithm (pd).

correction time, we can use exponential number of states and smaller pd getting better limit and finally Shannon’s limit for pd = 0 as in the beginning of the previous subsection. In practice we use finite blocks, so choosing some large M, with large probability it will find the proper correction (in linear time). We can verify it: it’s most probably the proper correction, if the density of damages didn’t drastically grow from some point. We could focus on this point and try to correct this correction. It suggests how to choose M: use relatively small M first until error density grows much faster then it should. Now interpolate this point and try to use larger M around until error density won’t return to expected.

9.3.2 Correction tree algorithms Previously suggested algorithm was considering some number of best corrections to given point - we are expanding the tree of possible corrections by moving its whole ’front’. We will now look for some sophisticated algorithms - which in given moment selects some best node to expand. We should get some (pseudo)random tree with many subtrees of wrong corrections growing from the core made of the proper correction. These subtrees has generally larger error concentration. Each dot in fig. 6 corresponds to some allowed state of decoder. When we get to a forbidden state, we have to expand somewhere else. Edges of such tree correspond to some corrections of bits used in given step - intuitively the less bits it correct, the more probable it is - we should try it earlier. To create logical structure of the tree, we have generally three possibilities:

1. make node for each bit - split denotes changing one bit - it’s rather impractical, or

2. make node for each step - split denotes correction of bit block used while the last step - good for large pb, or 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 40

Figure 6: Correction algorithm. It will create such (pseudo) random trees. If 0 0 pd < pd this tree would immediately grow exponentially. If pd = pd it would look 0 2 to grow linearly. If pd < pd ≤ pd and we use basic weight function, its width will be generally small, but rare high error concentrations will sum to infinite expected 2 width. If pd > pd its expected width will be finite - it can be used for potentially 1 infinite messages. This limit can be probably improved up to pd.

3. make node every time forbidden state occurs (as in the figure) - we connect consecutive bit blocks as long as we can decode further without correction - good for small pb. In each node we have to store somehow the pointer to its father, lately used correction and the position in the message. To choose quickly the best node for given moment, we will store there also some wight and always choose node with the largest one. For each node we can easily find its most probable, not considered yet child - they are denoted as ’triangles’ in the figure. In given moment we can focus on them and mark further ones after using some. To summarize, the main loop of the algorithm is • Find the most probable correction not considered yet (one of ’triangles’), • Try to expand it - decode one step or until we get to a forbidden state, • Modify the tree - create new node and at most two ’triangles’ - the first for the new node and the next one to the ’triangle’ used. until we get to the fixed final state on the end of the message. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 41

9.3.3 The weights of nodes of the correction tree We will work on two ’time scales’ - j will denote the number of states of the decoder, which corresponds to at average J ≈ H0j bits of the encoded message. In each step we have standard situation for error correction: we make some ob- servation (O) and we need to evaluate probabilities of its possible explanations (E). To cope with such problems we use Bayesian analysis, which says that probability of given situation is the probability that this explanation causes given symptoms multiplied by the probability of this explanation and normalized by the sum over all possible explanations: P r(O|E)P r(E) P r(O|E)P r(E) P r(E|O) = = (53) P 0 0 P r(O) E0 P r(O|E )P r(E ) In our case we observe some tree (O) and we want to find error distribution (E) which caused it - the most probable node to expand in this moment. We will treat E as in (45): it’s {0, 1}J vector in which on each position ’1’ denotes that we should change given bit. Finding P r(E) is easy:

#{1≤i≤J:Ei=1} #{1≤i≤J:Ei=0} P r(E) = pb (1 − pb) (54) or accordingly some more complicated function if we don’t assume symmetric, memoryless channel.

The problem is to calculate P r(O|E). We should use the real structure of the tree to find it, but it would require assuming algorithm which created it - it’s becoming extremely complicated. We will focus now on some basic method: using only the number of al- lowed/forbidden states outside E. We will see that it will already give algorithm very close to found theoretical limit for practical correction algorithms. There will be also shown some ways to improve it later. If we assume that given correction (E) is proper, the states outside the corre- sponding path in our tree should fulfill statistics: pd of them are forbidden, 1 − pd are allowed: #forbidden #allowed P r(O|E) = pd (1 − pd) Finally we can assign to each ’triangle’ P r(O|E)P r(E) - it’s some multiplication of powers of pb, 1 − pb, pd, 1 − pd. The first pair corresponds to given correction, the second to the number of forbidden/allowed states in the rest of the tree. Observe that the number of forbidden/allowed states outside some path is the number of all forbidden/allowed states in the tree minus these in considered path (only allowed ones). So dividing P r(o|e)P r(e) by the term for the whole tree, we see that we have to maximize

#{i:Ei=1} #{i:Ei=0} −# states in this correction pb p˜b p˜d 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 42 among all corrections (E) worth to consider in this moment (’triangles’). We want to find only the node with maximal weight, so we can work on logarithm of this value. Finally while building the tree, to calculate weight for given step of decoder, we should add

#{i : Ei = 1}· lg(pb) + #{i : Ei = 0}· lg(˜pb) − lg(˜pd) (55) to the weight of the previous step, where this time E denotes bits used in the last decoding step. We also see that using the previous algorithm - moving the ’front’ of the tree, really the most probable nodes are those with the smallest number of damaged bits. The term with lg(˜pd) allows to handle with corrections having different lengths - additionally emphasizing longer corrections. Formally because of the search for the node with the largest weight, this algo- rithm has N lg(N) time complexity. In practice we usually need to work on relatively small number of nodes with the largest weight - required priority queue could have some fixed size. Very rarely it will run out and we will have to look through ’trian- gles’ stored outside it.

9.3.4 Analysis of correction tree algorithm with basic weights Let us assume that we will use this algorithm to correct some message: there is some unknown vector of errors (E) with pb probability of 1. Observe that asymptotically average weight per node (55) is 0 0 0 H pb lg(pb) + H p˜b lg(˜pb) − lg(˜pd) = −H h(pb) − lg(˜pd) 0 so the condition pd > pd is equivalent with that the weight of the correct path is statistically growing. But error distribution is not uniform - the weight of the correct path can locally decrease. In such situation, before we will continue expanding the correct path, we have to expand some subtrees of wrong corrections. The problem is that when local concentration of errors is very large, the weight of the correct path drops dramatically and so we have to expand very large subtrees of wrong corrections. Probability of such scenarios decreases exponentially with the size of such weight drop (w), but the size of such subtrees grows exponentially with it.

Such weight drop can have generally any length: observe that to make whole correction, from position J we have to expand subtrees of wrong corrections up to weight about:  0  0 0 J min # {i ∈ [J, J + J ): ei = 1} lg(pb) + # {i ∈ [J, J + J ): Ei = 0} lg(˜pb) − lg(˜pd) J0≥0 H0 We need to find expected probability distribution of such drops. It doesn’t depend on the position: V (w) := probability that the weight will drop by at most w 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 43

For w < 0,V (w) = 0, but V (0) corresponds to situation in which weight increases - it should be positive. Before going to the general case, let us focus for a moment on simplified one - that all blocks have exactly 1 bit (H0 = 1). Observe that we can write equation for V for two succeeding positions:

 p V (w + lg(p ) − lg(˜p )) +p ˜ V (w + lg(˜p ) − lg(˜p )) for w ≥ 0 V (w) = b b d b b d 0 for w < 0 (56) If there would be no such boundary of behaviors in w = 0, it would be simple functional equation - with some combination of exponents as solution. Fortunately we can use it to find the asymptotic behavior. We know that limw→∞ V (w) = 1, so let as assume that asymptotically

1 − V (w) ∝ 2vw (57) for some v < 0. Substituting it to (56), we get:

vw v(w+lg(pb)−lg(˜pd)) v(w+lg(˜pb)−lg(˜pd)) 2 = pb2 +p ˜b2

v v+1 v+1 p˜d = pb +p ˜b (58) 0 This equation has always v = 0 solution, but for pd > pd there emerges the second - negative solution we are interested in. It can be easily found numerically and simulations show that we are reaching asymptotically this behavior.

We can now go to the general case - we use some probability distribution of block lengths: Pa := probability that decoding step will use a bits P 0 P We have a Pa = 1 and H = a aPa. Writing (56) analogously, this time for block of length a we would get 2a terms. After the substitution (57), we can collapse them:

 1 1 a vw vw X v(lg(pb)− lg(˜pd)) v(lg(˜pb)− lg(˜pd)) 2 = 2 Pa pb2 a +p ˜b2 a a

0 v X v+1 v+1a  v+1 v+1H  p˜d = Pa pb +p ˜b ≈ pb +p ˜b (59) a Again we are interested in the v < 0 solution. The approximation on the right is fulfilled if blocks have constant length as before, but we should also be able to use it when there are used only two block lengths differing by 1. Generally we should be careful about it. 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 44

Now we have to estimate asymptotic behavior of size of subtrees of wrong cor- rection for large weight drops. Each node of such subtree can be thought as a root of new subtree. If we will expand it for given allowed weight drop, it should give fixed expected number of nodes:

U(t) := expected number of processed nodes for at most t weight drop

As again, U(t) = 0 for t < 0. For t = 0 we process this node: U(0) ≥ 1. Let’s focus on one bit blocks (H0 = 1) as previously. Connecting node with its children, we get:  1 +p ˜ (U(w + lg(p ) − lg(˜p )) + U(w + lg(˜p ) − lg(˜p ))) for w ≥ 0 U(w) = d b d b d 0 for w < 0 (60) This time we would expect that for some u > 0 asymptotically

U(w) ∝ 2uw (61)

Substituting it to (60) as previously, we will get

uw u(w+lg(pb)−lg(˜pd)) u(w+lg(˜pb)−lg(˜pd)) 2 =p ˜d 2 + 2

u−1 u u p˜d = pb +p ˜b (62) 0 This equation is very similar to (58), for pd > pd we again get two solutions. As previously, because of strong boundary conditions in w = 0, we will be asymptot- ically reaching the smaller one, what confirms numerical simulations. Comparing these two equations, we surprisingly get simple correspondence between these two critical coefficients: u = v + 1 (63) which is also fulfilled in the general case with different length blocks.

Having U(w) and V (w) functions, we can finally find the expected number of nodes in subtrees of wrong corrections per one node of the proper correction: Z ∞ pbU(w − lg(pb) + lg(˜pb)) +p ˜bU(w + lg(pb) − lg(˜pb)) dV (w) (64) 0 Because V is not continuous, it’s formally Stieltjes integral, but to estimate asymp- totic behavior we can use dV (w) dV (w) = dw ∝ 2vw dw So the expected number of processed nodes per corrected bit is finite if and only if 1 1 > 2uw2vw = 2w(u+v) = 2w(2v+1) ⇔ v < − 2 9 NEAR SHANNON’S LIMIT ERROR CORRECTION METHOD 45

The critical pd is

a −1/2 1/2 1/2 p −1 X √ p  p˜d = pb +p ˜b or generally: p˜d = Pa pb + p˜b (65) a

2 Let us denote this value as pd: 1  1  p2 := 1 − ≈ 1 − √ (66) d √ √ 2 √ 2H0 P a ( pb + p˜b) a Pa( pb + p˜b)

Generally Pa depends on pd, so we should solve this equation numerically. We will use the approximation on the right to estimate critical channel capacity for this boundary: 0 2 0 √ p H = H − lg(˜pd) ≈ H + 2H lg( pb + p˜b) H0 1 ≈ √ √ H 1 − 2 lg( pb + p˜b)

It is at most 13.1% larger (for pb ≈ 0.03) than for the limit for practical correction 2 algorithms (fig. 5). Using pd > pd we can be sure that the expected width of the tree is finite - using polynomially large number of states of the decoder, asymptotically almost certainly in practically linear time this algorithm will give the proper correction.

This algorithm needs a bit more minimal redundancy then ’moving the front’ of tree approach, because sometimes it is building huge subtrees of wrong corrections, which goes much further then the proper node. It’s caused by − lg(˜pd) term in the weight function - it amortizes large error density by the length. We see that we could improve the correction tree algorithm, by sometimes switching to the ’moving the front’ algorithm, sometimes enforcing expansion of shorter paths. For example: if the width of the tree or local error density exceeds some value, make some number of steps using only ’triangles’ having position below some bound- ary, like this position of largest width. Unfortunately I couldn’t find optimal pa- rameters analytically, but they can be found experimentally.

9.4 Generalized block codes In the previous two subsections we were using two correction mechanisms - that the final hash value (state of the coder) has to agree and that after an error in each step with probability pd we will see that something was wrong. In this subsection we will see how to use huge freedom of choice while choosing ANS coding tables to include additional error correction mechanisms known from standard block codes as in fig. 3. This mechanism allows to immediately repair simple damages and so reduces the number of usage of decoding table, quickening the process. Finally it REFERENCES 46 can be imagined as block codes in which blocks are no longer independent, but have connected redundancy. This connection is made by the internal state of the coder. This time pd is rather large (at least 1/2), so we should use larger H, for example by treating a few bits as a symbol to be encoded in one step.

The idea is to make that Hamming distances between different allowed states are at least some fixed value. For distance 2 it can be easily done by inserting additional parity check bit, for example as the one before the oldest bit (which is always 1). In this case we can just use ScD initialization for the original symbol probability distribution and then insert parity bit, use 2l instead of l and mark the rest (half) of states as forbidden. In this case pd = 1/2. The advantage of such distance 2 code is that if among bits of one block there is only a single error, it is detected immediately. So forbidden state denotes that there was damaged one of bits used in the last step or there was at least two er- rors in some previous block - the set of possible correction is smaller than previously.

We could also enforce larger Hamming distance. Observe that e.g. triple modular redundancy codes codes can be imagined as obtained this way: l = 23, b = 2, allowed states are ’1000’ and ’1111’ - both symbols (0 and 1) has exactly one appearance. The rest of states are forbidden. While decoding step, before bit transfer, the state is always 1, so this example is degenerated - blocks are independent. Generally let us take K as the maximum of ks for all allowed symbols (after rescaling) - so that bit transfer uses at most K bits. We should enforce that for each allowed state, all states with changed at most given number of bits among these youngest K positions are forbidden. In other words, if l = 2L, for each oldest L + 1 − K bits we should make that two allowed states has Hamming distance at least given value - creates some block code on K youngest bits. To make the connection of redundancy work, there have to be used many essen- tially different block codes. We can generate them from a single one: by making XOR with some K bit mask as in fig. 3 and by using some permuting on these bits - these operations make that new codewords has still the same minimal Hamming distance. So finally to mark allowed states, for each oldest L+1−K bits we should choose some K bit mask to make XOR with and eventually some permutation of K bits of the original block code. Then we can distribute allowed symbols among them. To make these choices, especially when we want to make encryption simultaneously, we can use PRNG as before (initialized with the key).

References

[1] D.A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the I.R.E., September 1952, pp 1098-1102. REFERENCES 47

[2] J.J. Rissanen, Generalized Kraft inequality and arithmetic coding, IBM J. Res. Develop., vol. 20, no. 3, pp. 198-203, May 1976.

[3] G.N.N. Martin, Range encoding: an algorithm for removing redundancy from a digitized message, and Data Recording Conf., Southampton, England, July 1979. vol. 2, pp 179-194, July 1992.

[4] J. Duda, Optimal encoding on discrete lattice with translational invariant con- strains using statistical algorithms, arXiv:0710.3861

[5] M. Mahoney, Data Compression Programs web site, http://cs.fit.edu/∼mmahoney/compression/

[6] J. Duda, Data Compression Using Asymmetric Nu- meral Systems, Wolfram Demonstration Project, http://demonstrations.wolfram.com/author.html?author=Jarek+Duda

[7] M. Matsumoto, Y. Kurita, Twisted GFSR generators, ACM Transactions on Modeling and Computer Simulation,

[8] Gallager, R. G., Low Density Parity Check Codes, Monograph, M.I.T. Press, 1963,

[9] D. MacKay, , Inference, and Learning Algorithms, Cam- bridge University Press (2003)