Quick viewing(Text Mode)

Asymmetric Numeral Systems

Asymmetric Numeral Systems

Old and new coding techniques for , error correction and nonstandard situations

Lecture I: data compression … data encoding

Efficient information encoding to: - reduce storage requirements, - reduce transmission requirements, - often reduce access time for storage and transmission (decompression is usually faster than HDD, e.g. 500 vs 100MB/s)

Jarosław Duda Nokia, Kraków 25. IV. 2016

1 Symbol frequencies from a version of the British National Corpus containing 67 million words ( http://www.yorku.ca/mack/uist2011.html#f1 ):

Brute: 풍품(ퟐퟕ) ≈ ퟒ. ퟕퟓ bits/symbol (푥 → 27푥 + 푠) ′ Huffman uses: 퐻 = ∑푖 푝푖 풓풊 ≈ ퟒ. ퟏퟐ bits/symbol Shannon: 퐻 = ∑푖 푝푖퐥퐠(ퟏ/풑풊) ≈ ퟒ. ퟎퟖ bits/symbol We can theoretically improve by ≈ ퟏ% here 횫푯 ≈ ퟎ. ퟎퟒ bits/symbol

Order 1 Markov: ~ 3.3 bits/symbol (~) Order 2: ~3.1 bps, word: ~2.1 bps (~ZSTD) Currently best compression: cmix-v9 (PAQ) (http://mattmahoney.net/dc/text.html ) 109 of text from Wikipedia (enwik9) into 123874398 bytes: ≈ ퟏ bit/symbol 108 → ퟏퟓퟔ27636 106 → ퟏퟖퟏ476 Hilberg conjecture: 퐻(푛) ~ 푛훽, 훽 < 0.9

… lossy compression: > 100× reduction 2 – fully reversible, e.g. gzip, , png, JPEG-LS, FLAC, lossless mode in h.264, h.265 (~50%) Lossy – better compression at cost of distortion Stronger distortion – better compression (rate distortion theory, psychoacoustics)

General purpose (lossless, e.g. gzip, , lzma, bzip, , , ZSTD) vs specialized compressors (e.g. audiovisual, text, DNA, numerical, mesh,…)

For example audio (http://www.bobulous.org.uk/misc/audioFormats.html ): Wave: 100%, FLAC: 67%, Monkey’s: 63%, MP3: 18%

720p video - uncompressed: ~1.5 GB/min, lossy h.265: ~ 10MB/min

Lossless very data dependent, speed-ratio tradeoff PDF: ~80%, RAWimage: 30-80%, binary: ~20-40%, txt: ~12-30%, : ~20%, 10000URLs: ~20%, XML:~6%, xls: ~2%, mesh: <1%

3 General scheme for Data Compression … data encoding:

1) Transformations, predictions To decorrelate data, make it more predictable, encode only new information Delta coding, YCrCb, Fourier, Lempel-Ziv, Burrows-Wheeler, MTF, RLC …

2) If lossy, quantize coefficients (e.g. 푐 → round(푐/푄)) sequence of (context_ID, symbol) 3) Statistical modeling of final events/symbols Static, parametrized, stored in headers, context-dependent, adaptive

4) (Entropy) coding: use 퐥퐠(ퟏ/풑풔) bits, 푯 = ∑풔 풑풔퐥퐠(ퟏ/풑풔) bits total Prefix/Huffman: fast/cheap, but inaccurate (푝 ~ 2−푟) Range/arithmetic: costly (multiplication), but accurate Asymmetric Numeral Systems – fast and accurate general (Apple LZFSE, Fb ZSTD), DNA (CRAM), games (LZNA, BitKnit), Google VP10, WebP 4 Part 1: Transformations, predictions, quantization

Delta coding: write relative values (differences) instead of absolute

e.g. time (1, 5, 21, 28, 34, 44, 60) → (1, 4, 16, 7, 6, 10, 6)

the differences often have some parametric probability distribution: 1 |푥−휇| Laplacian distribution (geometric): Pr(푥) = exp (− ) 2푏 푏 1 (푥−휇)2 Gaussian distribution (normal): Pr(푥) = exp (− ) 휎√2휋 2휎2 1 Pareto distribution: Pr(푥) ∝ 푥훼+1

5 JPEG LS – simple prediction https://en.wikipedia.org/wiki/Lossless_JPEG

- scan pixels line-by-line - predict X (edge detection):

- find context determining coding parameters: 93+1 (푫 − 푩, 푩 − 푪, 푪 − 푨) ( = 365 contexts) 2 - (entropy) coding of differences using Golomb codes (good for Laplace distribution) with parameter m accordingly to context 6 Color transformation to better exploit spatial correlations

YCbCr: Y – luminance/brightness Cb, Cr – blue, red difference

Chroma subsampling In (usually 4:2:0)

Color depth: 3x8 – true color 3x10,12,16 – deep color 7 Karhunen–Loève transform to decorrelate into independent variables Often close to Fourier (DCT): (energy preserving, frequency domain) (stored quantized in JPEG)

Principal Component Analysis

8 Video: predict values (in blocks), encode only difference (residue)

intra-prediction (in a frame, choose and encode mode), inter-prediction (between frames, use motion of objects)

9 DCT – fixed support, – varying support, e.g. JPEG2000 Haar wavelets:

Progressive decoding: send low frequencies first, then further to improve quality 10 Lempel-Ziv (large family) Replace repeats with position and length (decoding is just parsing)

https://github.com/inikep/lzbench i5-4300U, 100MB total (from W8.1): method Compr.[MB/s] Decomp.[MB/s] Ratio lz4fast r131 level 17 LZ 994 3172 74 % lz4 r131 LZ 497 2492 62 % lz5hc v1.4.1 level 15 LZ 2.29 724 44 % (gzip) level 1 LZ + Huf 45 197 49 % Zlib (gzip) level 9 LZ + Huf 7.5 216 45 % ZStd v0.6.0 level 1 LZ + tANS 231 595 49 % ZStd v0.6.0 level 22 LZ + tANS 2.25 446 37 % Google Brotli level 0 LZ + o1 Huf 210 195 50 % Google Brotli level 11 LZ + o1 Huf 0.25 170 36 % 11 Burrows – Wheeler Transform: sort lexicographically all cyclic shifts, take last column

Usually improves statistical properties for compression (e.g. text, DNA) https://sites.google.com/site/powturbo/entropy-coder benchmark: enwik9 (1GB) enwik9 + BWT Speed (BWT, MB/s) O1 rANS 55.7 % 21.8 % 193/403 TurboANX (tANS) 63.5 % 24.3 % 492/874 Zlibh (Huffman) 63.8 % 28.5 % 307/336

12 Run-Length Encoding (RLE) – encode repeat lengths WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWW WWWWWWWWWBWWWWWWWWWWWWWW → 12W1B12W3B24W1B14W

Move-to-front (MTF, Bentley 86) – move last symbol to front of alphabet

Often used with BWT: mississipi → (BWT) → pssmipissii → (MTF) → 23033313010 “To be or not to be…” : 7807 bits of entropy (symbol-wise) 7033 bits after MTF, 6187 bits after BWT+MTF (7807 after BWT only) 13 Binning – most significant bits of floats often have characteristic distribution Quantization, e.g. 푐 → round(푐/푄) store only bin number (rate distortion theory )

Vector quantization (many coordinates at once) - (r, angle) for energy conservation (e.g. ) - For more convenient representation

14 Part 2: (Entropy) coding – the heart of data compression

Transformations, predictions, quantization etc. have lead to a sequence of symbols/event we finally need to encode

Optimal storage of symbol of probability 풑 uses 퐥퐠 (ퟏ/풑) bits Using approximated probabilities means suboptimal compression ratio We need to model these probabilities, then use (entropy) coder with them

For example : ퟎ → 0, ퟏ → 10, ퟐ → 110, ퟑ → 1110, ퟒ → 11110, … Is optimal if Pr(ퟎ) = 0.5, Pr(ퟏ) = 0.25, … , Pr(풌) = 2−푘−1 (geometric)

Encoding (푝푖) distribution with entropy coder optimal for (푞푖) distribution costs

2 푝푖 1 (푝푖−푞푖) Δ퐻 = ∑푖 푝푖lg(1/푞푖) − ∑푖 푝푖lg(1/푝푖) = ∑푖 푝푖lg ( ) ≈ ∑푖 푞푖 ln (4) 푝푖 more bits/symbol - so called Kullback-Leibler “distance” (not symmetric) 15 Imagine we want to store n values (e.g. hash) of m bits: directly 푚 ⋅ 푛 bits Their order contains lg(푛!) ≈ lg((푛/푒)푛) = 푛 lg(푛) − 푛 lg (푒) bits of information If order is not important, we could use only ≈ 푛(푚 − lg(푛) + 1.443) It means halving storage for 178k of 32bit hashes, 11G of 64bit hashes

How to cheaply approximate it?

Split the range (ퟐ풎 positions) into n equal size buckets - Use unary system to write number of hashes in each bucket: 2n bits - Write suffixes inside buckets: lg(2푚/푛) for each hash: 2푛 + 푛 ⋅ lg(2푚/푛) = 푛(푚 − lg(푛) + 2) bits total

Overhead can be cheaply reduced to lg(3) ≈ 1.585 = 1.443 + 0.142 bits By using 1 trit coding instead of unary (James Dow Allen): “0”, ”1”, ”2 or more”, then store prefixes: 푎1 < 푎2 <. . < 푎푘−2 < 푎푘 > 푎푘−1 16

We need 풏 bits of information to choose one of ퟐ풏 possibilities.

For length 푛 0/1 sequences with 푝푛 of “1”, how many bits we need to choose one?

푛 푛! (푛/푒) 푛 lg(푛)−푝푛 lg(푝푛)−(1−푝)푛 lg((1−푝)푛) 푛(−푝 lg(푝)−(1−푝)lg (1−푝)) ≈ (1−푝)푛 = 2 = 2 (푝푛)! ((1−푝)푛)! (푝푛/푒)푝푛((1−푝)푛/푒)

Entropy coder: encodes sequence with (푝푠)푠=0..푚−1 probability distribution using asymptotically at least 푯 = ∑풔 풑풔 퐥퐠(ퟏ/풑풔) bits/symbol (퐻 ≤ lg(푚))

Seen as weighted average: symbol of probability 풑 contains 퐥퐠(ퟏ/풑) bits.

17 Statistical modelling – where the probabilities come from

- Frequency counts – e.g. for each block (~140B/30kB) and written in header, - Assume a parametric distribution, estimate its parameter – for example 1 |푥−휇| 1 (푥−휇)2 1 Pr(푥) = exp (− ), Pr(푥) = exp (− ), Pr(푥) ∝ 2푏 푏 휎√2휋 2휎2 푥훼+1 - Context dependence, like neighboring pixels, similar can be grouped

e.g. order 1 Markov (context is the previous symbol): use Pr (푠푖|푠푖−1), - Prediction by partial matching (PPM) – varying number of previous symbols as context (build and update suffix tree), - Context mixing (CM, ultracompressors like PAQ) – combine predictions from multiple contexts (e.g. PPM) using some machine learning.

Static or adaptive – modify probabilities (dependent on decoded symbols), modify CDF toward CDF of recently processed data block

Enwik9: CM 12.4%, PPM 15.7%, BWT 16%, dict 16.5%, LZ+bin 19.3%, LZ+byt: 21.5%, LZ: 32% Decoding: 100h cmix, 1h mons, 400s M03 , 15s glza , 55s LZMA , 2.2s ZSTD , 3.7s lz5 18 Adaptiveness – modify probabilities (dependent on decoded symbols), e.g. for (int i = 1; i < m; i++) CDF[i] -= (CDF[i] - mixCDF[i]) >> rate;

where CDF[i] = ∑푗<푖 Pr(푗), mixCDF[] is CDF for recent data block

example: {0014044010001001404441000010010000000000000000000000 100000000000000000000000000000000000000000000000000} {0,1,2,3,4} alphabet, length = 104, direct encoding: 104 ⋅ lg(5) ≈ ퟐퟒퟏ bits

Static probability (stored?) : {89, 8, 0, 0, 7}/104 104 ⋅ ∑푖 푝푖lg (1/푝푖) ≈ 77 bits Symbol-wise adaptation starting from flat (uniform) or averaged initial distribution:

flat, rate = 3, −∑ lg(Pr) ≈ ퟕퟎ flat, rate = 4, −∑ lg(Pr) ≈ ퟕퟖ avg, rate = 3, −∑ lg(Pr) ≈ ퟔퟕ

19 Some universal codings, upper bound doesn’t need to be known unary coding: ퟎ → 0, ퟏ → 10, ퟐ → 110, ퟑ → 1110, ퟒ → 11110, … Is optimal if Pr(ퟎ) = 0.5, Pr(ퟏ) = 0.25, … , Pr(풙) = 2−푥−1 (geometric)

Golomb coding: choose parameter 푀 (preferably a power of 2). Write ⌊풙/푴⌋ with unary coding, then directly 풙 mod 푴 Needs ⌊푥/푀⌋ + 1 + lg (푀) bits, 푥 2 −⌊ ⌋ 푀 −푥 optimal for ~geometric distribution: Pr(푥) ≈ ⋅ 2 푀 ≈ ( √2) 푀

Elias gamma coding: write 푁 = ⌊lg(푥)⌋ zeroes, then 푁 + 1 bits of 푥 Uses 2⌊lg(푥)⌋ + 1 bits – optimal for Pareto: Pr(푥) ≈ 2−2⌊lg(푥)⌋+1 ≈ 푥−2

1 Directly writing 푥 costs ⌊lg(푥)⌋ + 1 bits → Pr(푥) ≈ , but length also costs 푥 1 Elias delta coding – lg(lg(푥)) with unary, optimal for Pr(푥) ≈ 푥 lg2 (푥) 1 Elias omega coding – recursive, optimal for Pr(푥) ≈ 푥⋅lg(푥)⋅lg(lg(푥))⋅… 20 symbol of probability 풑 contains 퐥퐠(ퟏ/풑) bits of information symbol → length 푟 bit sequence assumes: Pr(symbol) ≈ 2−푟

Huffman coding: perfect prefix code - Sort symbols by frequencies - Replace two least frequent with tree of them and so on

generally: prefix codes defined by a tree unique decoding: no sequence is a prefix of another

http://en.wikipedia.org/wiki/Huffman_coding 퐴 → 0, 퐵 → 100, 퐶 → 101, 퐷 → 110, 퐸 → 111

21 Encoding (푝푖) distribution with entropy coder optimal for (푞푖) distribution costs

2 푝푖 1 (푝푖−푞푖) Δ퐻 = ∑푖 푝푖lg(1/푞푖) − ∑푖 푝푖lg(1/푝푖) = ∑푖 푝푖lg ( ) ≈ ∑푖 푞푖 ln (4) 푝푖 more bits/symbol - so called Kullback-Leibler “distance” (not symmetric).

Huffman inaccurate, terrible for skewed distributions (Pr(a) >> 0.5) e.g. Pr(a) = 0.99, Pr(b) = 0.01, H ~ 0.08 bits/symbol, Huffman: 푎 → 0, 푏 → 1

We can reduce Δ퐻 by grouping 푚 symbols together (alphabet size 2푚 or 3푚):

Δ퐻 ∝ ~1/푚 but cost grows like 2푚 22

23 Huffman vs ANS in compressors (LZ + entropy coder): from Matt Mahoney benchmarks http://mattmahoney.net/dc/text.html

Enwiki8 encode time decode time 100,000,000B [ns/] [ns/byte] ZSTD 0.6.0 –22 --ultra 25,405,601 701 2.2 Brotli (Google Feb 2016) –q11 w24 25,764,698 3400 5.9 LZA 0.82b –mx9 –b7 –h7 26,396,613 449 9.7 lzturbo 1.2 –39 –b24 26,915,461 582 2.8 WinRAR 5.00 –ma5 –m5 27,835,431 1004 31 WinRAR 5.00 –ma5 –m2 29,758,785 228 30 lzturbo 1.2 –32 30,979,376 19 2.7 zhuff 0.97 –c2 34,907,478 24 3.5 gzip 1.3.5 –9 36,445,248 101 17 2.0.4 –ex 36,556,552 171 50 WinRAR 5.00 –ma5 –m1 40,565,268 54 31 ZSTD 0.4.2 –1 40,799,603 7.1 3.6 zhuff, ZSTD (Yann Collet, Facebook): LZ4 + tANS (switched from Huffman) lzturbo (Hamid Bouzidi): LZ + tANS (switched from Huffman)

LZA (Nania Francesco): LZ + rANS (switched from range coding) saving time and energy in extremely frequent task

24 Apple LZFSE = Lempel-Ziv + Finite State Entropy Default in iOS9 and OS X 10.11

“matching the compression ratio of ZLIB level 5, but with much higher energy efficiency and speed (between 2x and 3x) for both encode and decode operation”

Finite State Entropy is Yann Collet’s (known from e.g. LZ4) implementation of tANS

Default DNA compression: CRAM 3.0 of European Bioinformatics Institute

Format Size Encoding(s) Decoding(s) SAM 5 579 036 306 46 - CRAM v2 (LZ77+Huff) 1 053 744 556 183 27 CRAM v3 (order 1 rANS) 869 500 447 75 31 25 gzip used everywhere … but it’s ancient … what’s next? Google – first (gzip compatible) Now Brotli (incompatible, Huffman) – support, all news: “Google’s new Brotli compressio is 26 percent better than existing solutions” (22.09.2015) … now ZSTD (tANS, Yann Collet, Facebook):

general (Apple LZFSE, Fb ZSTD), DNA (CRAM), games (LZNA, BitKnit), Google VP10, WebP 26

Huffman coding (HC), prefix codes: example of Huffman penalty most of everyday compression, for truncated 휌(1 − 휌)푛 distribution e.g. , gzip, cab, , gif, png, , pdf, … (can’t use less than 1 bits/symbol) Zlibh (the fastest generally available implementation): Encoding ~ 320 MB/s (/core, 3GHz) 8/H zlibh FSE Decoding ~ 300-350 MB/s Ratio → 휌 = 0.5 4.001 3.99 4.00 Range coding (RC): large alphabet arithmetic 휌 = 0.6 4.935 4.78 4.93 coding, needs multiplication, e.g. 7-Zip, VP Google 휌 = 0.7 6.344 5.58 6.33 video (e.g. YouTube, Hangouts). 휌 = 0.8 8.851 6.38 8.84 Encoding ~ 100-150 MB/s tradeoffs 휌 = 0.9 15.31 7.18 15.29 Decoding ~ 80 MB/s 휌 = 0.95 26.41 7.58 26.38 휌 = 0.99 96.95 7.90 96.43 (binary) (AC): H.264, H.265 video, ultracompressors e.g. PAQ Huffman: 1byte → at least 1 bit Encoding/decoding ~ 20-30MB/s ratio ≤ 8 here

Asymmetric Numeral Systems (ANS) tabled (tANS) - without multiplication FSE implementation of tANS: Encoding ~ 350 MB/s Decoding ~ 500 MB/s

RC → ANS: ~7x decoding speedup, no multiplication (switched e.g. in LZA compressor) HC → ANS means better compression and ~ 1.5x decoding speedup (e.g. zhuff, lzturbo) 27 Operating on fractional number of bits

restricting range 푁 to length 퐿 subrange number 푥 contains 퐥퐠(푵/푳) bits contains 퐥퐠(풙) bits adding symbol of probability 푝 - containing 퐥퐠(ퟏ/풑) bits lg(푁/퐿) + lg(1/푝) ≈ lg(퐿/퐿′) for 푳′ ≈ 풑 ⋅ 푳 lg(푥) + lg(1/푝) = lg(푥′) for 풙′ ≈ 풙/풑 28 Asymmetric numeral systems – redefine even/odd numbers: symbol distribution: 푠: ℕ → 풜 (풜 – alphabet, e.g. {0,1}) ( 푠(푥) = mod(푥, 푏) for base 푏 : 퐶(푠, 푥) = 푏푥 + 푠)

Should be still uniformly distributed – but with density 풑풔: # { 0 ≤ 푦 < 푥: 푠(푦) = 푠} ≈ 푥푝푠 then 풙 becomes 풙-th appearance of given symbol: 퐶(푠, 푥) = 푥′ ∶ 푠(푥′) = 푠, |{0 ≤ 푦 < 푥′: 푠(푦) = 푠(푥′)}| = 푥 퐷(푥′) = ( 푠(푥′), |{0 ≤ 푦 < 푥′: 푠(푦) = 푠(푥′)}|) ′ ′ 퐶(퐷(푥 )) = 푥′ 퐷(퐶(푠, 푥)) = (푠, 푥) 풙 ≈ 풙/풑풔

푠(푥) = 0 if mod(푥, 4) = 0, else 푠(푥) = 1 29 example: range asymmetric binary systems (rABS) Pr(0) = 1/4 Pr(1) = 3/4 - take base 4 system and merge 3 digits, cyclic (0123) symbol distribution 푠 becomes cyclic (0111):

푠(푥) = 0 if mod(푥, 4) = 0, else 푠(푥) = 1 to decode or encode 1, localize quadruple (⌊푥/4⌋ or ⌊푥/3⌋) if 푠(푥) = 0, 퐷(푥) = (0, ⌊푥/4⌋) else 퐷(푥) = (1,3⌊푥/4⌋ + mod(푥, 4) − 1) 퐶(0, 푥) = 4푥 퐶(1, 푥) = 4⌊푥/3⌋ + 1 + mod(푥, 3)

30 rANS - range variant for large alphabet 풜 = {0, . . , 푚 − 1} 푛 assume Pr(푠) = 푓푠/2 푐푠: = 푓0 + 푓1 + ⋯ + 푓푠−1 푛 start with base 2 numeral system and merge length 푓푠 ranges 푛 푛 for 푥 ∈ {0,1, … , 2 − 1}, 풔(풙) = 퐦퐚퐱{풔: 풄풔 ≤ 풙}, 푚푎푠푘 = 2 − 1

encoding: 퐶(푠, 푥) = ⌊푥/푓푠⌋ ≪ 푛 + mod(푥, 푓푠) + 푐푠 decoding: 푠 = 푠(푥 & 푚푎푠푘) (e.g. tabled, alias method)

퐷(푥) = (푠, 푓푠 ⋅ (푥 ≫ 푛) + ( 푥 & 푚푎푠푘) − 푐푠)

Plus renormalization to make for example 푥 ∈ {216, … , 232 − 1}, 푛 = 12 :

Decoding step (푚푎푠푘 = 2푛 − 1) Encoding step (msk = 216 - 1) s = symbol[x & mask] s = readSymbol(); writeSymbol(s); if(x > bound[s]) x = f[s] (x >> n) + (x & mask) – [s]; {write16bits(x & msk); x >>= 16; } if(x < 216) x = x << 16 + read16bits(); x = (x / f[s]) << n + (x % f[s]) + c[s];

CRAM v3 (2015): order 1 rANS: 256 frequencies depending on previous symbol 31 Plus renormalization to make for example 푥 ∈ {216, … , 232 − 1}, 푛 = 12 :

Decoding step (푚푎푠푘 = 2푛 − 1) Encoding step (msk = 216 - 1) s = symbol(x & mask); s = readSymbol(); writeSymbol(s); if(x > bound[s]) x = f[s] (x >> n) + (x & mask) – CDF[s]; {write16bits(x & msk); x >>= 16; } if(x < 216) x = x << 16 + read16bits(); x = (x / f[s]) << n + (x % f[s]) + CDF[s];

Similar to Range Coding, but decoding has 1 multiplication (instead of 2), for determining symbol range is fixed, and state is 1 number (instead of 2), making it convenient for SIMD vectorization ( https://github.com/rygorous/ryg_rans ).

Various ways to handle symbol(푦) = 푠 : 퐶퐷퐹[푠] ≤ 푦 < 퐶퐷퐹[푠 + 1] for 0 ≤ 푦 < 2푛 : Tabled(symbol[푦]), search (binary or SIMD - CDF only!), or alias method:

‘Alias’ method: rearrange probability distribution into 푚 buckets: containing the primary symbol and eventually a single ‘alias’ symbol

32 uABS - uniform binary variant (풜 = {0,1}) - extremely accurate

Assume binary alphabet, 풑 ≔ 퐏퐫 (ퟏ), denote 풙풔 = {풚 < 풙: 풔(풚) = 풔} ≈ 풙풑풔 For uniform symbol distribution we can choose: 풙ퟏ = ⌈풙풑⌉ 풙ퟎ = 풙 − 풙ퟏ = 풙 − ⌈풙풑⌉ 푠(푥) = 1 if there is jump on next position: 푠 = 푠(푥) = ⌈(푥 + 1)푝⌉ − ⌈푥푝⌉ decoding function: 퐷(푥) = (푠, 푥푠)

its inverse – coding function: 푥+1 퐶(0, 푥) = ⌈ ⌉ − 1 1−푝 푥 퐶(1, 푥) = ⌊ ⌋ 푝

For 푝 = Pr(1) = 1 − Pr(0) = 0.3:

33 Stream version – renormalization

Up to now: we encode using succeeding 퐶 functions into a huge number 푥, then decode (in opposite direction!) using succeeding 퐷. Like in arithmetic coding, we need renormalization to limit working precision - enforce 풙 ∈ 푰 = {푳, … , 풃푳 − ퟏ} by transferring base-풃 youngest digits:

ANS decoding step from state 푥 encoding step for symbol 푠 from state 푥 (푠, 푥) = 퐷(푥); while 풙 ≥ 풎풂풙푿[풔] // = 푏퐿푠 useSymbol(푠); {writeDigit(mod(풙, 풃)); 풙 = ⌊풙/풃⌋}; while 풙 < 푳, 풙 = 풃풙 + 퐫퐞퐚퐝퐃퐢퐠퐢퐭(); 푥 = 퐶(푠, 푥);

For unique decoding, we need to ensure that there is a single way to perform above loops:

퐼 = {퐿, … , 푏퐿 − 1}, 퐼푠 = {퐿푠, … , 푏퐿푠 − 1} where 퐼푠 = {푥: 퐶(푠, 푥) ∈ 퐼}

Fulfilled e.g. for 푛 푛 - rABS/rANS when 푝푠 = 푓푠/2 has 1/퐿 accuracy: 2 divides 퐿, - uABS when 푝 has 1/퐿 accuracy: 푏⌈퐿푝⌉ = ⌈푏퐿푝⌉, - in tabled variants (tABS/tANS) it will be fulfilled by construction. 34 uABS rABS arithmetic coding xp = (uint64_t)x * p0; xf = x & mask; //32bit split = low + ((uint64_t) out[i]=((xp & mask) >= p0); xn = p0 * (x >> 16); (hi - low) * p0 >> 16); xp >>= 16; if (xf < p0) out[i] = (x > split); x = out[i] ? xp : x - xp; { x = xn + xf; out[i] = 0 } if (out[i]) {low = split + 1} else {x-=xn+p0;out[i] = 1 } else {hi = split;}

if (x < 0x10000) { x = (x << 16) | *ptr++; } if ((low ^ hi) < 0x10000) { x = (x << 16) | *ptr++; low <<= 16; //32bit x, 16bit renormalization, mask = 0xffff hi = (hi << 16) | mask }

35 RENORMALIZATION to prevent 푥 → ∞ Enforce 푥 ∈ 퐼 = {퐿, … ,2퐿 − 1} by transmitting lowest bits to bitstream “buffer” 풙 contains 퐥퐠(풙) bits of information – produces bits when accumulated Symbol of 푃푟 = 푝 contains lg (1/푝) bits: lg(푥) → lg(푥) + lg (1/푝) modulo 1

Tabled variant (tABS/tANS) – put everything into a table:

36 37

38 Single step of stream version: to get 푥 ∈ 퐼 to 퐼푠 = {퐿푠, … , 푏퐿푠 − 1}, we need to transfer 푘 digits: 푠 푘 푘 푥 → (퐶(푠, ⌊푥/푏 ⌋), mod(푥, 푏 )) where 푘 = ⌊log푏(푥/퐿푠)⌋ 푘 = 푘푠 or 푘 = 푘푠 − 1 for 풌풔 = −⌊퐥퐨퐠풃(풑풔)⌋ = −⌊log푏(퐿푠/퐿)⌋

푘푠 e. g.: 풑풔 = ퟏퟑ/ퟔퟔ, 풃 = ퟐ, 푘푠 = 3, 퐿푠 = 13, 퐿 = 66, 푏 푥 + 3 = 115, 푥 = 14:

Huffman: 푘 = −lg(푝푠) = 푘푠, above lines are vertical 39 General picture: encoder prepares before consuming succeeding symbol decoder produces symbol, then consumes succeeding digits

Decoding is in opposite direction: we have stack of symbols (LIFO) - encoding should be made in backward direction (buffer required), - the final state has to be stored, but we can write information in initial state, alternatively: fixing this state will make it a checksum – random after an error. 40 In single step (퐼 = {퐿, … , 푏퐿 − 1}): 퐥퐠(풙) → ≈ 퐥퐠(풙) + 퐥퐠(ퟏ/풑) modulo 퐥퐠(풃)

Three sources of unpredictability/chaosity: 1) Asymmetry: behavior strongly dependent on chosen symbol – small difference changes decoded symbol and so the entire behavior. 2) Ergodicity: usually log푏(1/푝) is irrational – succeeding iterations cover entire range. 3) Diffusivity: 퐶(푠, 푥) is close but not exactly 푥/푝푠 – there is additional ‘diffusion’ around expected value

So 퐥퐠(풙) ∈ [lg(퐿) , lg(퐿) + lg(푏)) has nearly uniform distribution – 푥 has approximately:

퐏퐫(풙) ∝ ퟏ/풙

probability distribution – contains lg (1/Pr (푥)) ≈ lg (푥) + const bits of information. 41 What symbol spread should we choose? (link) (using PRNG seeded with cryptkey for encryption)

for example: rate loss for 푝 = (0.04, 0.16, 0.16, 0.64) 퐿 = 16 states, 푞 = (1, 3, 2, 10), 푞/퐿 = (0.0625, 0.1875, 0.125, 0.625) dH/H rate method symbol spread loss - - ~0.011 penalty of quantizer itself would give Huffman Huffman 0011222233333333 ~0.080 decoder spread_range_i() 0111223333333333 ~0.059 Increasing order spread_range_d() 3333333333221110 ~0.022 Decreasing order spread_fast() 0233233133133133 ~0.020 fast spread_prec() 3313233103332133 ~0.015 close to quantization dH/H better than quantization spread_tuned() 3233321333313310 ~0.0046 dH/H due to using also 푝 spread_tuned_s() 2333312333313310 ~0.0040 L log L complexity (sort) testing 1/(p ln(1+1/i)) ~ i/p spread_tuned_p() 2331233330133331 ~0.0058 approximation

42 Precise initialization (heuresis) 0.5+푖 푁푠 = { : 푖 = 0, … , 퐿푠 − 1} 푝푠 are uniform – we need to shift them to natural numbers. (priority queue with put, getmin) for 푠 = 0 to 푛 − 1 do put((0.5/푝푠, 푠)); for 푋 = 0 to 퐿 − 1 do {(푣, 푠) = getmin; put((푣 + 1/푝푠, 푠)); 푠푦푚푏표푙[푋] = 푠; }

퐚퐩퐩퐫퐨퐱퐢퐦퐚퐭퐞퐥퐲: 퐚퐥퐩퐡퐚퐛퐞퐭 퐬퐢퐳퐞 ퟐ 횫푯 ∝ ( ) 푳

43 Tuning: 푝푠 ≈ 푞푠/퐿. Can we “tune” spread of 푞푠 symbols accordingly to 푝푠?

Shift symbols right when 풑풔 < 풒풔/푳, left otherwise

ퟏ Assume 퐏퐫(풙) ≈ 풙 퐥퐧(ퟐ) 푏 ∑ Pr (푥) ≈ lg ( ) 푥=푎… 푏 푎 푠 appears 푞푠 times: 푖 ∈ 퐼푠 = {푞푠, … ,2푞푠 − 1} 푖+1 Pr(푖-th interval) ≈ lg ( ) 푖 To fulfill the Pr(푥) assumption, 푥 for this interval should fulfill: 푖 + 1 1 lg ( ) ⋅ 푝 ≈ 푖 푠 푥 ln(2) 1 푥 ≈ 푝푠 ln (1 + 1/푖) is the preferred position for 푖 ∈ 퐼푠 = {푞푠, … ,2푞푠 − 1} appearance of 푝푠 ≈ 푞푠/퐿 symbol

https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit 44 tABS test all possible symbol distributions for binary alphabet

store tables for quantized probabilities (p) e.g. 16 ⋅ 16 = 256 bytes ← for 16 state, Δ퐻 ≈ 0.001

H.264 “M decoder” (arith. coding) tABS 푡 = 푑푒푐표푑푖푛푔푇푎푏푙푒[푝][푋]; 푋 = 푡. 푛푒푤푋 + readBits(푡. 푛푏퐵푖푡푠); useSymbol(푡. 푠푦푚푏표푙);

no branches, no bit-by-bit renormalization state is single number (e.g. SIMD) 45 Additional tANS advantage – simultaneous encryption we can use huge freedom while initialization: choosing symbol distribution – slightly disturb 풔(풙) using PRNG initialized with cryptographic key

ADVANTAGES comparing to (symmetric) cryptography:

standard, e.g. DES, AES ANS based cryptography (initialized) based on XOR, permutations highly nonlinear operations bit blocks fixed length pseudorandomly varying lengths “brute force” just start decoding perform initialization first for new or QC attacks to test cryptokey cryptokey, fixed to need e.g. 0.1s speed online calculation most calculations while initialization entropy operates on bits operates on any input distribution

46 tANS (2007) - fully tabled: Apple LZFSE, Facebook ZSTD, lzturbo fast: no multiplication (FPGA!), less memory efficient (~8kB for 2048 states) static in ~32kB blocks, costly to update (rather needs rebuilding), allows for simultaneous encryption (PRNG to perturb symbol spread) tANS decoding step Encoding step (for symbol s) t = decodingTable[x]; nbBits = (x + nb[s]) >> r ; writeSymbol(t.symbol); writeBits(x, nbBits); x = t.newX + readBits(t.nbBits); x = encodingTable[start[s] + (x >> nbBits)]; rANS (2013) – needs multiplication – CRAM (DNA), VP10 (video), LZNA, BitKnit more memory effective – especially for large alphabet and precision (CDF only) better for adaptation (퐶퐷퐹[푠] ≤ 푦 < 퐶퐷퐹[푠 + 1] - tabled, alias, binary search/SIMD) rANS decoding step (푚푎푠푘 = 2푛 − 1) Encoding step (s) (msk = 216 - 1) s = symbol(x & mask); writeSymbol(s); if(x > bound[s]) x = f[s] (x >> n) + (x & mask) – CDF[s]; {write16bits(x & msk); x >>= 16; } if(x < 216) x = x << 16 + read16bits(); x = (x / f[s]) << n + (x % f[s]) + CDF[s];

i5-4300U: FSE/tANS: 295/467, rANS: 221/342, zlibh: 225/210 MB/s 47 General scheme for Data Compression … data encoding:

1) Transformations, predictions To decorrelate data, make it more predictable, encode only new information Delta coding, YCrCb, Fourier, Lempel-Ziv, Burrows-Wheeler, MTF, RLC …

2) If lossy, quantize coefficients (e.g. 푐 → round(푐/푄)) sequence of (context_ID, symbol) 3) Statistical modeling of final events/symbols Static, parametrized, stored in headers, context-dependent, adaptive

5) (Entropy) coding: use 퐥퐠(ퟏ/풑풔) bits, 푯 = ∑풔 풑풔퐥퐠(ퟏ/풑풔) bits total Prefix/Huffman: fast/cheap, but inaccurate (푝 ~ 2−푟) Range/arithmetic: costly (multiplication), but accurate Asymmetric Numeral Systems – fast and accurate general (Apple LZFSE, Fb ZSTD), DNA (CRAM), games (LZNA, BitKnit), Google VP10, WebP 48