BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY Combining the Burrows Wheeler Transform and the ENUMERATIVE CODING Context-Tree Weighting Algorithm

CONTEXT-TREE WEIGHTING BURROWS WHEELER Frans M.J. Willems CODING BW-SEQUENCES

CODING B-COUNTS Department of Electrical Engineering Eindhoven University of Technology FIND BEST TREE MODEL

BINARY DECOMPOSITION 1st IEEE Seminar on Future Directions in CONCLUSION & 2nd African Winter School on FUTURE DIRECTIONS and Communications, August 16-21, 2015, Protea Kruger Gate, South Africa Motivation

BWT and CTW European School on Information Theory, April 20-24, 2015, Zandvoort,

Frans Willems The Netherlands:

INTRODUCTION Motivation Universal Source Coding Algorithms Problem, Outline IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING

ARITHMETIC CODING

CONTEXT-TREE WEIGHTING

BURROWS WHEELER

CODING BW-SEQUENCES

CODING B-COUNTS

FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS

Tutorial lectures were given by Young-Han Kim, Stephanie Wehner, Imre Csiszar, and Stephan ten Brink and Motivation

BWT and CTW

Frans Willems

INTRODUCTION Motivation Richard Durbin (Genome Informatics Universal Source Coding Group, Sanger Institute, Cambridge, Algorithms Problem, Outline UK) IID SOURCES, PREFIX CODES, REDUNDANCY Human Genome Project, now 99598

ENUMERATIVE CODING citations, h-index = 102, now co-leader

ARITHMETIC CODING of 1000 Genomes Project.

CONTEXT-TREE WEIGHTING “Storage and Search of Genome

BURROWS WHEELER Sequence Information”. CODING BW-SEQUENCES Based on the Burrows-Wheeler CODING B-COUNTS transform: “A Block-sorting Lossless FIND BEST TREE MODEL Algorithm”, Digital BINARY DECOMPOSITION Systems Research Center Report 124,

CONCLUSION 1994, by M. Burrows, & D.J. Wheeler.

FUTURE DIRECTIONS Universal Source Coding Algorithms

BWT and CTW

Frans Willems

INTRODUCTION CLASSES of UNIVERSAL SOURCE CODING ALGORITHMS: Motivation String Matching methods Universal Source Coding Algorithms Ziv-Lempel (1977, Sliding-Window, IT-Soc Best Paper Award). Problem, Outline Ziv-Lempel (1978, Dictionary). IID SOURCES, PREFIX Elias (1987, Recency rank, Move-to-Front analysis). W. (1989, CODES, REDUNDANCY Kac-connection). Wyner-Ziv (1994, LZ77 - proof). ENUMERATIVE CODING Source-Modeling and Arithmetic Coding methods ARITHMETIC CODING Arithm. Cod.: Pasco (1976), Rissanen (1976), Rissanen-Langdon CONTEXT-TREE WEIGHTING (1979), Rissanen-Langdon (1981)

BURROWS WHEELER Source Modeling: Langdon-Rissanen (1983), Rissanen (1983, Context-algorithm), Weinberger-Rissanen-Feder (1995), CODING BW-SEQUENCES W.-Shtarkov-Tjalkens (1995, IT-Soc Best Paper Award): CODING B-COUNTS Context-Tree Weighting (CTW) Method FIND BEST TREE MODEL Burrows-Wheeler (BW) transform, followed by Move-to-Front and BINARY DECOMPOSITION Huffman Coding Methods CONCLUSION BW: 1994 FUTURE DIRECTIONS MTF: Bentley-Sleator-Tarjan-Wei (1986), Elias (1987) Problem Description

BWT and CTW

Frans Willems focusses on minimizing Memory and Computational INTRODUCTION Complexity. Motivation Information Theory focusses on minimizing Redundancy. Universal Source Coding Algorithms Problem, Outline BW is very efficient, allows for searching in the compressed domain, but is IID SOURCES, PREFIX SUBOPTIMAL in terms of redundancy for tree sources. CODES, REDUNDANCY BW does NOT achieve the Rissanen lower bound (1/2 log2 N bits per ENUMERATIVE CODING parameter, 1984, IT-Soc Best Paper Award). ARITHMETIC CODING Effros-Visweswariah-Kulkarni-Verdu (2002): in the binary case log2 N bits CONTEXT-TREE per parameter extra redundancy per parameter. WEIGHTING BURROWS WHEELER Fixed-depth CTW is linear in N (memory and computations) but is CODING BW-SEQUENCES REDUNDANCY OPTIMAL in Rissanen sense (1/2 log2 N bits per CODING B-COUNTS parameter) for tree sources. CTW does not allow for searching in the

FIND BEST TREE MODEL compressed domain however.

BINARY DECOMPOSITION CONCLUSION PROBLEM: FUTURE DIRECTIONS Improve the redundancy of BW data-compression. Problem Description

BWT and CTW

Frans Willems Computer Science focusses on minimizing Memory and Computational INTRODUCTION Complexity. Motivation Information Theory focusses on minimizing Redundancy. Universal Source Coding Algorithms Problem, Outline BW is very efficient, allows for searching in the compressed domain, but is IID SOURCES, PREFIX SUBOPTIMAL in terms of redundancy for tree sources. CODES, REDUNDANCY BW does NOT achieve the Rissanen lower bound (1/2 log2 N bits per ENUMERATIVE CODING parameter, 1984, IT-Soc Best Paper Award). ARITHMETIC CODING Effros-Visweswariah-Kulkarni-Verdu (2002): in the binary case log2 N bits CONTEXT-TREE per parameter extra redundancy per parameter. WEIGHTING BURROWS WHEELER Fixed-depth CTW is linear in N (memory and computations) but is CODING BW-SEQUENCES REDUNDANCY OPTIMAL in Rissanen sense (1/2 log2 N bits per CODING B-COUNTS parameter) for tree sources. CTW does not allow for searching in the

FIND BEST TREE MODEL compressed domain however.

BINARY DECOMPOSITION CONCLUSION PROBLEM: FUTURE DIRECTIONS Improve the redundancy of BW data-compression. Outline Lecture

BWT and CTW

Frans Willems

INTRODUCTION Motivation INTRODUCTION Universal Source Coding Algorithms IID SOURCES, PREFIX CODES, REDUNDANCY Problem, Outline IID SOURCES, PREFIX ENUMERATIVE CODING CODES, REDUNDANCY ARITHMETIC CODING ENUMERATIVE CODING CONTEXT-TREE WEIGHTING ARITHMETIC CODING

CONTEXT-TREE BURROWS WHEELER WEIGHTING CODING BW-SEQUENCES BURROWS WHEELER CODING B-COUNTS CODING BW-SEQUENCES FINDING BEST TREE MODEL CODING B-COUNTS

FIND BEST TREE MODEL BINARY DECOMPOSITION

BINARY DECOMPOSITION CONCLUSION CONCLUSION FUTURE DIRECTIONS FUTURE DIRECTIONS Binary Sources, Sequences, IID

BWT and CTW x1x2 ··· xN Frans Willems Binary Source

INTRODUCTION

IID SOURCES, PREFIX N The binary source produces a sequence x1 = x1x2 ··· xN with CODES, REDUNDANCY N Binary IID Sources components ∈ {0, 1} with probability P(x1 ). Prefix Codes Redundancy Definition (Binary IID Source) ENUMERATIVE CODING For an independent identically distributed (i.i.d.) source with parameter ARITHMETIC CODING θ, for 0 ≤ θ ≤ 1, CONTEXT-TREE N WEIGHTING N Y P(x ) = P(xn), BURROWS WHEELER 1 n=1 CODING BW-SEQUENCES where CODING B-COUNTS P(1) = θ, and P(0) = 1 − θ. FIND BEST TREE MODEL

BINARY DECOMPOSITION N A sequence x1 containing w ones (i.e., having weight w) and N − w zeros CONCLUSION has probability N N−w w FUTURE DIRECTIONS P(x1 ) = (1 − θ) θ .

Definition (Entropy Binary IID Source)

∆ 1 1 The ENTROPY of this source is h(θ) = (1 − θ) log2 1−θ + θ log2 θ (bits). Prefix Codes

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX Definition (Prefix Code) CODES, REDUNDANCY Binary IID Sources In a prefix code no codeword is the prefix of any other codeword. Prefix Codes Redundancy We restrict ourselves to prefix codes. Codewords in a prefix code can be ENUMERATIVE CODING regarded as leaves in a rooted tree. Prefix codes lead to instantaneous ARITHMETIC CODING decodability. CONTEXT-TREE WEIGHTING Example BURROWS WHEELER CODING BW-SEQUENCES N N N 111 x1 c(x1 ) L(x1 ) CODING B-COUNTS 00 0 1 11 FIND BEST TREE MODEL 01 10 2 1 110 BINARY DECOMPOSITION 10 110 3 λ 10 11 111 3 CONCLUSION 0 FUTURE DIRECTIONS Prefix-Codes (cont.)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY Binary IID Sources Prefix Codes Redundancy Theorem (Kraft, 1949) ENUMERATIVE CODING (a) The lengths of the codewords in a prefix code satisfy Kraft’s inequality ARITHMETIC CODING X −L(xN ) CONTEXT-TREE 2 1 ≤ 1. WEIGHTING N N x1 ∈X BURROWS WHEELER CODING BW-SEQUENCES (b) For codeword lengths satisfying Kraft’s inequality there exists a prefix CODING B-COUNTS code with these lengths. FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Prefix-Codes (cont.)

BWT and CTW

Frans Willems This leads to: INTRODUCTION

IID SOURCES, PREFIX Theorem (Fano, 1961) CODES, REDUNDANCY Binary IID Sources (a) Any prefix code satisfies Prefix Codes Redundancy N N E[L(X1 )] ≥ H(X1 ) = Nh(θ). ENUMERATIVE CODING

ARITHMETIC CODING N 1 The minimum is achieved if and only if L(x1 ) = log2 N (ideal P(x1 ) CONTEXT-TREE N N N WEIGHTING codeword length) for all x ∈ X with nonzero P(x ). 1  1  BURROWS WHEELER (b) There exist prefix codes with L(xN ) = log 1 , hence 1 2 P(xN ) CODING BW-SEQUENCES 1 CODING B-COUNTS 1 L(xN ) < log + 1 FIND BEST TREE MODEL 1 2 N P(x1 ) BINARY DECOMPOSITION CONCLUSION (ideal codeword length plus 1 bit). These codes achieve FUTURE DIRECTIONS N N E[L(X1 )] < H(X1 ) + 1 = Nh(θ) + 1 bit. Redundancy

BWT and CTW

Frans Willems

INTRODUCTION Definition IID SOURCES, PREFIX The individual redundancy ρ(xN ) of a sequence xN is defined as CODES, REDUNDANCY 1 1 Binary IID Sources 1 Prefix Codes ρ(xN ) = L(xN ) − log , Redundancy 1 1 2 N P(x1 ) ENUMERATIVE CODING

ARITHMETIC CODING i.e., codeword-length minus ideal codeword-length.

CONTEXT-TREE WEIGHTING

BURROWS WHEELER Definition

CODING BW-SEQUENCES N The expected redundancy ρ(x1 ) is defined as CODING B-COUNTS 1 FIND BEST TREE MODEL ρ = E[L(X N )] − E[log ] = E[L(X N )] − H(X N ), 1 2 N 1 1 BINARY DECOMPOSITION P(x1 ) CONCLUSION i.e., expected codeword-length minus entropy. FUTURE DIRECTIONS Note that there exist codes with individual redundancies ≤ 1 and consequently with expected redundancies ≤ 1. Indexing Sequences of Fixed Weight

BWT and CTW IDEA: Frans Willems Sequences having the same number of ones (weight) have the same INTRODUCTION probability and only need to be INDEXED. IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING Definition (Lexicographical Ordering) Indexing N N Pascal-∆ Method In a lexicographical ordering (0 < 1) we say that x1 < y1 if xn < yn for Weight Coding the smallest index n such that xn 6= yn. ARITHMETIC CODING N N Consider a subset S of the set {0, 1} . Let iS (x1 ) be the lexicographical CONTEXT-TREE index of xN ∈ S, i.e., the number of sequences y N < xN for y N ∈ S. WEIGHTING 1 1 1 1

BURROWS WHEELER

CODING BW-SEQUENCES Example CODING B-COUNTS N N N N Let N = 5 and S = {x1 : w(x1 ) = 3} where w(x1 ) is the weight of x1 . FIND BEST TREE MODEL 5 Then |S| = 3 = 10 and: BINARY DECOMPOSITION

CONCLUSION iS (11100) = 9 iS (10011) = 4

FUTURE DIRECTIONS iS (11010) = 8 iS (01110) = 3

iS (11001) = 7 iS (01101) = 2

iS (10110) = 6 iS (01011) = 1

iS (10101) = 5 iS (00111) = 0 Pascal-∆ Method (Lynch (1966), Davisson (1966), Schalkwijk (1972))

BWT and CTW

Frans Willems Example (From Sequence to Index) 5 INTRODUCTION N P  Let N = 5 and S = {x1 : xn = 3}. Then |S| = 3 = 10. IID SOURCES, PREFIX CODES, REDUNDANCY 1 ENUMERATIVE CODING Indexing 0 1 Pascal-∆ Method Weight Coding 1 1 ARITHMETIC CODING

CONTEXT-TREE 0 1 0 1 WEIGHTING 1 2 1 BURROWS WHEELER

CODING BW-SEQUENCES 0 1 0 1 0 1

CODING B-COUNTS 1 3 3 1

FIND BEST TREE MODEL 0 1 0 1 0 1 0 1 BINARY DECOMPOSITION 1 4 6 4 1 CONCLUSION FUTURE DIRECTIONS 0 1 0 1 0 1 0 1 0 1 1 5 10 10 5 1

i(11010) = 4 + 3 + 1 = 8. Pascal-∆ Method

BWT and CTW Example (From Index to Sequence) Frans Willems Again N = 5 and S = {xN : P x = 3}. INTRODUCTION 1 n

IID SOURCES, PREFIX CODES, REDUNDANCY 1

ENUMERATIVE CODING Indexing 0 1 Pascal-∆ Method 1 1 Weight Coding ARITHMETIC CODING 0 1 0 1 CONTEXT-TREE WEIGHTING 1 2 1

BURROWS WHEELER 0 1 0 1 0 1 CODING BW-SEQUENCES 1 3 3 1 CODING B-COUNTS FIND BEST TREE MODEL 0 1 0 1 0 1 0 1 BINARY DECOMPOSITION 1 4 6 4 1

CONCLUSION 0 1 0 1 0 1 0 1 0 1 FUTURE DIRECTIONS 1 5 10 10 5 1

Index i = 8, now (a) 8 ≥ 4 hence x1 = 1, (b) 8 ≥ 4 + 3 hence x2 = 1, (c) 8 < 4 + 3 + 2 hence x3 = 0, (d) 8 ≥ 4 + 3 + 1 hence x4 = 1, (e) x5 = 0. How to Code the Index?

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY N N ENUMERATIVE CODING If sequence x1 has weight w, the index i(x1 ) is encoded using a Indexing fixed-length code of length Pascal-∆ Method Weight Coding  N L (xN ) = log . ARITHMETIC CODING I 1 2 w CONTEXT-TREE WEIGHTING BURROWS WHEELER Example CODING BW-SEQUENCES Index iS (11010) = 8. Since |S| = 10 the length of the fixed-length code CODING B-COUNTS N for i(x1 ) is 4, and the corresponding codeword could be 1000. FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS How to Code the Weight? θ Coding

BWT and CTW

Frans Willems The weights that can occur are w ∈ {0, 1, ··· , N}. Knowing source parameter θ we can easily compute the probability that weight w occurs INTRODUCTION N IID SOURCES, PREFIX P (w) = (1 − θ)N−w θw . CODES, REDUNDANCY θ w ENUMERATIVE CODING Indexing N The weight w of sequence x1 can therefore be encoded with a Pascal-∆ Method Weight Coding variable-length code of length ARITHMETIC CODING & ' 1 CONTEXT-TREE Lθ(w) = log2 . WEIGHTING N N−w w w (1 − θ) θ . BURROWS WHEELER N CODING BW-SEQUENCES Now we can write for the total codeword length for sequence x1 having CODING B-COUNTS weight w

FIND BEST TREE MODEL N N L(x1 ) = Lθ(w) + LI (x1 ) BINARY DECOMPOSITION & ' 1  N CONCLUSION = log + log 2 N N−w w 2 w FUTURE DIRECTIONS w (1 − θ) θ 1 ≤ log + 2. 2 (1 − θ)N−w θw θ Coding: Individual and Expected Redundancy

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX N N CODES, REDUNDANCY Assume that our sequence x1 has weight w = w(x1 ), then for all ENUMERATIVE CODING 0 ≤ θ ≤ 1, we get for the individual redundancy Indexing Pascal-∆ Method N N 1 Weight Coding ρ(x ) = L(x ) − log 1 1 2 (1 − θ)N−w θw ARITHMETIC CODING 1 1 CONTEXT-TREE ≤ log2 + 2 − log2 WEIGHTING (1 − θ)N−w θw (1 − θ)N−w θw BURROWS WHEELER = 2 bits. CODING BW-SEQUENCES

CODING B-COUNTS For the expected redundancy, when 0 ≤ θ ≤ 1, we get

FIND BEST TREE MODEL N ρ = E[L(X1 )] − Nh(θ) ≤ 2 bits. BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS How to Code the Weight? Uniform Coding

BWT and CTW

Frans Willems

INTRODUCTION IID SOURCES, PREFIX The weights that can occur are w ∈ {0, 1, ··· , N + 1}, hence there are CODES, REDUNDANCY N + 1 alternatives. Therefore we can use a uniform code of length ENUMERATIVE CODING Indexing Pascal-∆ Method LU (w) = dlog2(N + 1)e . Weight Coding N ARITHMETIC CODING Now we can write for the total codeword length for sequence x1 having CONTEXT-TREE weight w WEIGHTING N N BURROWS WHEELER L(x1 ) = LU (w) + LI (x1 )

CODING BW-SEQUENCES  N = dlog2(N + 1)e + log2 CODING B-COUNTS w FIND BEST TREE MODEL N ≤ log2(N + 1) + 2. BINARY DECOMPOSITION w CONCLUSION

FUTURE DIRECTIONS Uniform Coding: Individual Redundancy

BWT and CTW N Assume that our sequence x1 has weight w, then for all 0 ≤ θ ≤ 1, then Frans Willems we get for the individual redundancy

INTRODUCTION N N 1 ρ(x1 ) = L(x1 ) − log2 IID SOURCES, PREFIX (1 − θ)N−w θw CODES, REDUNDANCY N 1 ENUMERATIVE CODING ≤ log (N + 1) + 2 − log . 2 2 N−w w Indexing w (1 − θ) θ Pascal-∆ Method Weight Coding If both N − w → ∞ and w → ∞ the Stirling approximation yields ARITHMETIC CODING √ CONTEXT-TREE N 2πN N N−w N w WEIGHTING (N + 1) ≈ (N + 1) √ ( ) ( ) . w p2π(N − w) 2πw N − w w BURROWS WHEELER CODING BW-SEQUENCES Moreover N − w w CODING B-COUNTS (1 − θ)N−w θw ≤ ( )N−w ( )w . FIND BEST TREE MODEL N N

BINARY DECOMPOSITION Combining this yields for the individual redundancy for N − w → ∞ and

CONCLUSION w → ∞ that FUTURE DIRECTIONS 1 ρ(xN ) = L(xN ) − log 1 1 2 (1 − θ)N−w θw (≈) r N r N − w w ≤ log − log ( )( ) + 2 bits. 2 2π 2 N N Uniform Coding: Expected Redundancy

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY For the expected redundancy, when 0 < θ < 1, for large N we obtain ENUMERATIVE CODING Indexing (≈) r N N p Pascal-∆ Method ρ = E[L(X1 )] − Nh(θ) ≤ log2 − log2 (1 − θ)θ + 2 bits. Weight Coding 2π ARITHMETIC CODING

CONTEXT-TREE WEIGHTING

BURROWS WHEELER UNIVERSAL! CODING BW-SEQUENCES

CODING B-COUNTS

FIND BEST TREE MODEL BINARY DECOMPOSITION NOT VERY GOOD AT THE BORDERS ... CONCLUSION

FUTURE DIRECTIONS How to Code the Weight? Krichevsky-Trofimov

BWT and CTW

Frans Willems Definition (KT Probability (1981))

INTRODUCTION For all integer a ≥ 0 and b ≥ 0, define the KT-probability as

IID SOURCES, PREFIX ∆ (2a)! (2b)! 1 CODES, REDUNDANCY P (a, b) = . kt a b a+b ENUMERATIVE CODING 2 a! 2 b! 2 (a + b)! Indexing Pascal-∆ Method Since P NP (N − w, w) = 1, the weight w of sequence xN can Weight Coding w=0,N w kt 1 be encoded with a variable-length KT-code of length ARITHMETIC CODING

CONTEXT-TREE & ' 1 WEIGHTING L (w) = log . KT 2 N BURROWS WHEELER  w Pkt (N − w, w) CODING BW-SEQUENCES N CODING B-COUNTS Now we can write for the total codeword length for sequence x1 having weight w FIND BEST TREE MODEL

BINARY DECOMPOSITION N N L(x1 ) = LKT (w) + LI (x1 ) CONCLUSION & ' 1  N FUTURE DIRECTIONS = log + log 2 N 2 w w Pkt (N − w, w) 1 ≤ log2 + 2. Pkt (N − w, w) KT Coding: Individual Redundancy

BWT and CTW N N Frans Willems Assume that our sequence x1 has weight w = w(x1 ), then for all 0 ≤ θ ≤ 1, we get INTRODUCTION 1 IID SOURCES, PREFIX ρ(xN ) = L(xN ) − log CODES, REDUNDANCY 1 1 2 (1 − θ)N−w θw ENUMERATIVE CODING 1 1 Indexing ≤ log + 2 − log 2 2 N−w w Pascal-∆ Method Pkt (N − w, w) (1 − θ) θ Weight Coding ARITHMETIC CODING If both N − w → ∞ and w → ∞ the Stirling approximation yields CONTEXT-TREE WEIGHTING 1 r πN N N ≈ ( )N−w ( )w . BURROWS WHEELER Pkt (N − w, w) 2 N − w w CODING BW-SEQUENCES

CODING B-COUNTS Again N−w w N − w N−w w w FIND BEST TREE MODEL (1 − θ) θ ≤ ( ) ( ) . N N BINARY DECOMPOSITION Combining this yields for the individual redundancy for N − w → ∞ and CONCLUSION w → ∞ that FUTURE DIRECTIONS 1 (≈) 1 1 π ρ(xN ) = L(xN ) − log ≤ log N + log + 2 bits. 1 1 2 (1 − θ)N−w θw 2 2 2 2 2 KT Coding: Expected Redundancy

BWT and CTW

Frans Willems

INTRODUCTION For the expected redundancy, when 0 < θ < 1, for large N we get IID SOURCES, PREFIX CODES, REDUNDANCY (≈) N 1 1 π ENUMERATIVE CODING ρ = E[L(X1 )] − Nh(θ) ≤ log2 N + log2 + 2 bits. Indexing 2 2 2 Pascal-∆ Method Weight Coding ARITHMETIC CODING

CONTEXT-TREE WEIGHTING

BURROWS WHEELER NOTE CODING BW-SEQUENCES It can be shown for KT coding, for all 0 ≤ θ ≤ 1, that CODING B-COUNTS 1 FIND BEST TREE MODEL ρ(xN ) ≤ log N + 3 bits for all xN , and 1 2 2 1 BINARY DECOMPOSITION 1 CONCLUSION ρ ≤ log N + 3 bits. 2 2 FUTURE DIRECTIONS Weight Codewordlength

BWT and CTW N Example (N = 255, codewordlength as function of weight w of x1 ) Frans Willems No d· · · e-s. INTRODUCTION

IID SOURCES, PREFIX 9 CODES, REDUNDANCY 8 ENUMERATIVE CODING Indexing Pascal-∆ Method 7 Weight Coding ARITHMETIC CODING 6

CONTEXT-TREE WEIGHTING 5

BURROWS WHEELER 4 CODING BW-SEQUENCES codewordlength (bit) CODING B-COUNTS 3

FIND BEST TREE MODEL 2 BINARY DECOMPOSITION

CONCLUSION 1 Krichevsky-Trofimov FUTURE DIRECTIONS Uniform 0 0 50 100 150 200 250 w

Codewordlengths are roughly log2 N. Note that KT -codewordlength is 1 smaller (≈ 2 log2 N) for w = 0 and w = N. BOUND Individual Redundancy

BWT and CTW Example (N = 255, individual redundancy as a function of the weight w) Frans Willems 9 INTRODUCTION

IID SOURCES, PREFIX 8 CODES, REDUNDANCY

ENUMERATIVE CODING 7 Indexing Pascal-∆ Method Weight Coding 6 ARITHMETIC CODING 5 CONTEXT-TREE WEIGHTING 4 BURROWS WHEELER

CODING BW-SEQUENCES 3 individual redundancy (bit) CODING B-COUNTS

FIND BEST TREE MODEL 2

BINARY DECOMPOSITION 1 CONCLUSION Krichevsky-Trofimov Uniform FUTURE DIRECTIONS 0 0 50 100 150 200 250 w

1 Individual redundancies are roughly 2 log2 N. Note that uniform redundancy is larger (= log2(N + 1)) for θ = 0 and θ = 1. Expected Redundancy

BWT and CTW Example (N = 255, expected redundancy as function of θ.) Frans Willems 9 INTRODUCTION

IID SOURCES, PREFIX 8 CODES, REDUNDANCY

ENUMERATIVE CODING 7 Indexing Pascal-∆ Method Weight Coding 6 ARITHMETIC CODING 5 CONTEXT-TREE WEIGHTING 4 BURROWS WHEELER

CODING BW-SEQUENCES

expected redundancy (bit) 3 CODING B-COUNTS

FIND BEST TREE MODEL 2

BINARY DECOMPOSITION 1 CONCLUSION Krichevsky-Trofimov Uniform FUTURE DIRECTIONS 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ

1 Expected redundancies are roughly 2 log2 N. Again the uniform redundancy is larger (= log2(N + 1)) for w = 0 and w = N. Conclusion Enumerative Source Coding

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING Indexing Pascal-∆ Method NON-UNIVERSAL: θ-Coding yields individual redundancy not Weight Coding exceeding 2 bits. ARITHMETIC CODING UNIVERSAL: CONTEXT-TREE WEIGHTING Both Uniform Coding and KT Coding achieve individual redundancy 1 roughly 2 log2 N if both N − w → ∞ and w → ∞. BURROWS WHEELER KT Coding has similar behaviour at borders (w = 0 and w = N). CODING BW-SEQUENCES Uniform Coding is simpler.

CODING B-COUNTS

FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Arihmetic Coding: Idea Elias

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX Elias: CODES, REDUNDANCY Codewords can be COMPUTED SEQUENTIALLY from the source ENUMERATIVE CODING sequence using conditional PROBABILITIES of next symbol given the ARITHMETIC CODING previous ones, and vice versa. CONTEXT-TREE WEIGHTING

BURROWS WHEELER

CODING BW-SEQUENCES N N N CODING B-COUNTS x1 c(x1 ) x1 Encoder Decoder FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS P(xn|x1 ··· xn−1) P(xn|x1 ··· xn−1) for n = 1, N for n = 1, N Arithmetic Coding: Universal Codes

BWT and CTW

Frans Willems Coding Probabilities N INTRODUCTION If the actual probabilities P(x1 ) are not known arithmetic coding is still N N IID SOURCES, PREFIX possible if instead of P(x1 ) we use coding probabilities Pc (x1 ) satisfying CODES, REDUNDANCY N N ENUMERATIVE CODING Pc (x1 ) > 0 for all x1 , and ARITHMETIC CODING X N Pc (x1 ) = 1. CONTEXT-TREE N WEIGHTING x1

BURROWS WHEELER Then CODING BW-SEQUENCES N 1 L(x1 ) < log2 + 2. CODING B-COUNTS N Pc (x1 ) FIND BEST TREE MODEL

BINARY DECOMPOSITION N N N x1 c(x1 ) x1 CONCLUSION Encoder Decoder FUTURE DIRECTIONS

Pc (xn|x1 ··· xn−1) Pc (xn|x1 ··· xn−1) for n = 1, N for n = 1, N

N PROBLEM: How do we choose the coding probabilities Pc (x1 )? Arithmetic Coding for a Binary IID Source, Unknown θ

BWT and CTW

Frans Willems Arithmetic Coding based on KT Probability INTRODUCTION N N IID SOURCES, PREFIX A good coding probability Pc (x1 ) for a sequence x1 that contains a CODES, REDUNDANCY zeroes and b = N − a ones is ENUMERATIVE CODING ∆ (2a)! (2b)! 1 ARITHMETIC CODING P (a, b) = . kt a b a+b CONTEXT-TREE 2 a! 2 b! 2 (a + b)! WEIGHTING

BURROWS WHEELER Probability of a sequence with a zeroes and b ones followed by a one CODING BW-SEQUENCES b + 1/2 P (a, b + 1) = · P (a, b), CODING B-COUNTS kt a + b + 1 kt FIND BEST TREE MODEL

BINARY DECOMPOSITION hence SEQUENTIAL COMPUTATION is possible!

CONCLUSION Using arithmetic coding, the total individual redundancy FUTURE DIRECTIONS a b N θ (1 − θ) 1 ρ(x1 ) < log2 + 2 ≤ log2 N + 3 bits. Pkt (a, b) 2

N for all θ and x1 with a zeroes and b ones. Context Tree Weighting: Universal Codes

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING

ARITHMETIC CODING

CONTEXT-TREE WEIGHTING OBJECTIVE: Binary Tree-Sources Context Trees Design good code probabities for sources with UNKNOWN Coding Probabilities PARAMETERS and STRUCTURE. Remarks BURROWS WHEELER

CODING BW-SEQUENCES

CODING B-COUNTS

FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Context Tree Weighting: Binary Tree-Sources

BWT and CTW Definition Frans Willems ··· xn−2 xn−1 xn INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY θ1 = 0.1 1 ENUMERATIVE CODING

ARITHMETIC CODING

CONTEXT-TREE θ10 = 0.3 10 λ WEIGHTING Binary Tree-Sources Context Trees Coding Probabilities 0 Remarks BURROWS WHEELER

CODING BW-SEQUENCES θ00 = 0.5 00 CODING B-COUNTS

FIND BEST TREE MODEL

BINARY DECOMPOSITION parameters (tree-) model M = {00, 10, 1}

CONCLUSION

FUTURE DIRECTIONS P(Xn = 1| · · · , Xn−1 = 1) = 0.1

P(Xn = 1| · · · , Xn−2 = 1, Xn−1 = 0) = 0.3

P(Xn = 1| · · · , Xn−2 = 0, Xn−1 = 0) = 0.5 Context Trees

BWT and CTW

Frans Willems Definition (Context Tree)

INTRODUCTION

IID SOURCES, PREFIX 111 CODES, REDUNDANCY 11

ENUMERATIVE CODING 011

ARITHMETIC CODING 1

CONTEXT-TREE 101 WEIGHTING 01 Binary Tree-Sources Context Trees 001 Coding Probabilities λ Remarks 110 BURROWS WHEELER 10 CODING BW-SEQUENCES 010 CODING B-COUNTS 0 FIND BEST TREE MODEL 100 BINARY DECOMPOSITION 00

CONCLUSION 000

FUTURE DIRECTIONS Node s contains the sequence of source symbols that have occurred following context s. Depth is D. Context-Tree Splits Up Sequences in Subsequences

BWT and CTW

Frans Willems 1 2 3 4 5 6 7 8 9 10 11 12 n →

INTRODUCTION 1 1 0 1 0 0 1 0 1 0 0 0 1 1 0 ··· IID SOURCES, PREFIX N CODES, REDUNDANCY past x1 ENUMERATIVE CODING ARITHMETIC CODING - 1 CONTEXT-TREE 12 WEIGHTING 1 Binary Tree-Sources 12 0 2, 5, 7 Context Trees 11, 12 Coding Probabilities 1 Remarks 2,7 1 2,5,7,11 0 BURROWS WHEELER 1, 2, 3, 4 CODING BW-SEQUENCES 5,11 0 5, 6, 7, 8, 9 CODING B-COUNTS 1 1 10, 11, 12 FIND BEST TREE MODEL 1,3,6,8 1 BINARY DECOMPOSITION 0 3,6,8 0 CONCLUSION 1, 3, 4, 6 8, 9, 10 FUTURE DIRECTIONS 4,9 1 4,9,10 0 10 0 Context-Tree Nodes Containing IID Sequences

BWT and CTW

Frans Willems

INTRODUCTION - 1 IID SOURCES, PREFIX 12 1 CODES, REDUNDANCY 12 0 2, 5, 7 ENUMERATIVE CODING 11, 12 ARITHMETIC CODING 2,7 1 1 0 CONTEXT-TREE 2,5,7,11 WEIGHTING 5,11 1, 2, 3, 4 Binary Tree-Sources 0 Context Trees 5, 6, 7, 8, 9 Coding Probabilities 1 10, 11, 12 Remarks 1 1,3,6,8 1 BURROWS WHEELER 0 3,6,8 0 CODING BW-SEQUENCES 1, 3, 4, 6 8, 9, 10 CODING B-COUNTS 4,9 1 FIND BEST TREE MODEL 4,9,10 0 BINARY DECOMPOSITION 10 0 CONCLUSION tree-model M = {00, 10, 1} FUTURE DIRECTIONS Context-Tree Weighting Coding Probabilities

BWT and CTW

Frans Willems

INTRODUCTION Recursive Definition: From Leaves via Internal Nodes to Root IID SOURCES, PREFIX Let a and b be the number of zeros resp. ones in the subsequence CODES, REDUNDANCY s s corresponding to leaf, node, or root s. ENUMERATIVE CODING ARITHMETIC CODING The subsequence corresponding to a leaf s of the context tree is IID. A CONTEXT-TREE good coding probability for this subsequence is therefore WEIGHTING Binary Tree-Sources ∆ Context Trees Pw (s) = Pkt (as , bs ). Coding Probabilities Remarks Weighting the coding probabilities corresponding to both alternatives BURROWS WHEELER (node is iid or needs further splitting) yields the coding probability CODING BW-SEQUENCES

CODING B-COUNTS ∆ Pkt (as , bs ) + Pw (0s) · Pw (1s) Pw (s) = FIND BEST TREE MODEL 2 BINARY DECOMPOSITION for the subsequence that corresponds to node or root s. CONCLUSION Recursively we find in the root λ of the context-tree the coding probability FUTURE DIRECTIONS N Pw (λ) for the entire source sequence x1 . Individual Redundancy CTW

BWT and CTW

Frans Willems

INTRODUCTION Theorem (W., Shtarkov, and Tjalkens (1995)) IID SOURCES, PREFIX CODES, REDUNDANCY In general for a tree source with |M| leaves (parameters):

ENUMERATIVE CODING X 1 ρ(xN ) < (2|M| − 1) + log (a + b ) + 2 bits. ARITHMETIC CODING 1 2 2 l l CONTEXT-TREE l∈M WEIGHTING Binary Tree-Sources (model, parameter, and coding redundancies) Context Trees Coding Probabilities Remarks BURROWS WHEELER

CODING BW-SEQUENCES CODING B-COUNTS About Model Redundancies FIND BEST TREE MODEL A binary tree model can be described recursively: BINARY DECOMPOSITION Code(tree)=0 if tree is only root, CONCLUSION else Code(tree)=(1,Code(subtree-0),Code(subtree-1)). FUTURE DIRECTIONS This leads to 2|M| − 1 bits. Remarks Context-Tree Weighting

BWT and CTW

Frans Willems CTW implements a “weighting” (Bayes mixture) over all tree-models INTRODUCTION with depth not exceeding D, i.e.,

IID SOURCES, PREFIX CODES, REDUNDANCY X N Pw (λ) = P(M)Pkt (x1 |M), ENUMERATIVE CODING M≤D ARITHMETIC CODING N Q −(2|M|−1) CONTEXT-TREE with Pkt (x1 |M) = s∈M Pkt (as , bs ) and P(M) = 2 . WEIGHTING There is one tree-model of depth 0 (i.e. the IID model). If there are Binary Tree-Sources 2 Context Trees #d models of depth not exceeding d then there are #d + 1 models of Coding Probabilities depth not exceeding d + 1. Therefore #2 = 5, #3 = 26, #4 = 677, Remarks 22 #5 = 458330, #6 = 210066388901, #7 = 4.4128 · 10 , BURROWS WHEELER 45 #8 = 1.9473 · 10 , etc. CODING BW-SEQUENCES Straightforward analysis. No model-estimation that only gives CODING B-COUNTS asymptotic results as in e.g. Rissanen [1983, 1986], Weinberger, FIND BEST TREE MODEL Rissanen, and Feder [1995]). BINARY DECOMPOSITION N Number of computations needed to process the source sequence x1 CONCLUSION is linear in N. Same holds for the storage complexity. FUTURE DIRECTIONS Optimal parameter redundancy behavior in Rissanen [1984] sense 1 (i.e., 2 log2 N bits/parameter). Remarks Context-Tree Weighting

BWT and CTW A modified version achieves entropy not only for tree sources but for Frans Willems all stationary ergodic sources. INTRODUCTION A two-pass version (context-tree maximizing) exists that finds the IID SOURCES, PREFIX minimum description length (MDL) model, matching to the source CODES, REDUNDANCY sequence. Now ENUMERATIVE CODING

ARITHMETIC CODING ∆ max[Pkt (as , bs ), Pm(0s) · Pm(1s)] Pm(s) = . CONTEXT-TREE 2 WEIGHTING Binary Tree-Sources If a tree source with model generates the sequence xN the Context Trees 1 Coding Probabilities maximizing method produces a model estimate which is correct with Remarks probability one as N → ∞. The two-pass version achieves again BURROWS WHEELER X 1 CODING BW-SEQUENCES ρ(xN ) < (2|M| − 1) + log (a + b ) + 2 bits. 1 2 2 l l CODING B-COUNTS l∈M FIND BEST TREE MODEL If instead of P we would use “uniform weight coding” BINARY DECOMPOSITION kt

CONCLUSION N 1 1 Pu(x1 ) = FUTURE DIRECTIONS N + 1 N w as coding probability we get similar redundancy results with the problems mentioned before at the borders. Burrows Wheeler Transform (1)

BWT and CTW

Frans Willems The Burrows Wheeler Transform is a TRANSFORM! The transformed INTRODUCTION sequence will be coded. IID SOURCES, PREFIX CODES, REDUNDANCY Let xN = 100101000110 again. Consider xN and all its cyclic left ENUMERATIVE CODING 1 1 rotations. ARITHMETIC CODING

CONTEXT-TREE 1 100101000110 WEIGHTING 2 001010001101 BURROWS WHEELER 3 010100011010 BW Transform BW Reconstruction 4 101000110100 Coding the BW-transformed 5 010001101001 sequence 6 100011010010 CODING BW-SEQUENCES 7 000110100101 CODING B-COUNTS 8 001101001010 FIND BEST TREE MODEL 9 011010010100 BINARY DECOMPOSITION 10 1 1 0 1 0 0 1 0 1 0 0 0

CONCLUSION 11 1 0 1 0 0 1 0 1 0 0 0 1

FUTURE DIRECTIONS 12 0 1 0 0 1 0 1 0 0 0 1 1 Burrows Wheeler Transform (2)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX Sort lexicographically based on the suffixes. CODES, REDUNDANCY

ENUMERATIVE CODING 12 0 10010100011

ARITHMETIC CODING 2 0 01010001101

CONTEXT-TREE 7 0 00110100101 WEIGHTING 5 0 10001101001 BURROWS WHEELER 11 1 01001010001 BW Transform 1 1 00101000110 BW Reconstruction Coding the BW-transformed 3 0 10100011010 sequence 8 0 01101001010 CODING BW-SEQUENCES 6 1 00011010010 CODING B-COUNTS 4 1 01000110100 FIND BEST TREE MODEL 9 0 11010010100

BINARY DECOMPOSITION 10 1 10100101000

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Transform (3)

BWT and CTW

Frans Willems

INTRODUCTION N Now the left-most column is the BW-transform of x1 . Hence IID SOURCES, PREFIX BW (100101000110) = (101100110000). CODES, REDUNDANCY

ENUMERATIVE CODING 12 0 10010100011

ARITHMETIC CODING 2 0 01010001101 7 0 00110100101 CONTEXT-TREE WEIGHTING 5 0 10001101001 BURROWS WHEELER 11 1 01001010001 BW Transform 1 1 00101000110 BW Reconstruction Coding the BW-transformed 3 0 10100011010 sequence 8 0 01101001010 CODING BW-SEQUENCES 6 1 00011010010 CODING B-COUNTS 4 1 01000110100 FIND BEST TREE MODEL 9 0 11010010100

BINARY DECOMPOSITION 10 1 10100101000

CONCLUSION N N The index ibw (x1 ) = 6 of x1 in the ordering is later needed for FUTURE DIRECTIONS reconstruction. Burrows Wheeler Reconstruction (1)

BWT and CTW

Frans Willems

INTRODUCTION IID SOURCES, PREFIX 0 ...... CODES, REDUNDANCY 0 ...... ENUMERATIVE CODING 0 ...... ARITHMETIC CODING 0 ...... CONTEXT-TREE 1 ...... WEIGHTING 1 ...... BURROWS WHEELER 0 ...... BW Transform BW Reconstruction 0 ...... Coding the BW-transformed 1 ...... sequence 1 ...... CODING BW-SEQUENCES 0 ...... CODING B-COUNTS 1 ...... FIND BEST TREE MODEL

BINARY DECOMPOSITION We see 7 times 0, and 5 times 1 in the BW-transform, sorting yields the CONCLUSION rightmost column. FUTURE DIRECTIONS Burrows Wheeler Reconstruction (2)

BWT and CTW

Frans Willems

INTRODUCTION 0 ...... 1 IID SOURCES, PREFIX CODES, REDUNDANCY 0 ...... 1 0 ...... 1 ENUMERATIVE CODING 0 ...... 1 ARITHMETIC CODING 1 ...... 1 CONTEXT-TREE WEIGHTING 1 ...... 0 0 ...... 0 BURROWS WHEELER BW Transform 0 ...... 0 BW Reconstruction 1 ...... 0 Coding the BW-transformed sequence 1 ...... 0 CODING BW-SEQUENCES 0 ...... 0 1 ...... 0 CODING B-COUNTS

FIND BEST TREE MODEL

BINARY DECOMPOSITION Observe that, in all rows, a symbol in the leftmost column (BW-transform) follows the symbol in the rightmost column. CONCLUSION We then see 3 times 00, 4 times 10, 4 times 01, and 1 times 11. Sorting FUTURE DIRECTIONS yields the two rightmost columns. Burrows Wheeler Reconstruction (3)

BWT and CTW

Frans Willems

INTRODUCTION 0 ...... 1 1

IID SOURCES, PREFIX 0 ...... 0 1 CODES, REDUNDANCY 0 ...... 0 1 ENUMERATIVE CODING 0 ...... 0 1 ARITHMETIC CODING 1 ...... 0 1

CONTEXT-TREE 1 ...... 1 0 WEIGHTING 0 ...... 1 0 BURROWS WHEELER 0 ...... 1 0 BW Transform 1 ...... 1 0 BW Reconstruction Coding the BW-transformed 1 ...... 0 0 sequence 0 ...... 0 0 CODING BW-SEQUENCES 1 ...... 0 0 CODING B-COUNTS

FIND BEST TREE MODEL Again, in all rows, a symbol in the leftmost column (BW-transform) BINARY DECOMPOSITION follows the corresponding symbols in the two rightmost columns. CONCLUSION We then see 1 times 000, 2 times 100, 3 times 010, 1 times 110, 2 times FUTURE DIRECTIONS 001, 2 times 101, 1 times 011, and 0 times 000. Sorting yields the three rightmost columns. Burrows Wheeler Reconstruction (4)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 ...... 0 1 1 ENUMERATIVE CODING 0 ...... 1 0 1 ARITHMETIC CODING 0 ...... 1 0 1 CONTEXT-TREE 0 ...... 0 0 1 WEIGHTING 1 ...... 0 0 1 BURROWS WHEELER 1 ...... 1 1 0 BW Transform BW Reconstruction 0 ...... 0 1 0 Coding the BW-transformed sequence 0 ...... 0 1 0 CODING BW-SEQUENCES 1 ...... 0 1 0 1 ...... 1 0 0 CODING B-COUNTS 0 ...... 1 0 0 FIND BEST TREE MODEL 1 ...... 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (5)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 ...... 0 0 1 1 ENUMERATIVE CODING 0 ...... 1 1 0 1 ARITHMETIC CODING 0 ...... 0 1 0 1 CONTEXT-TREE 0 ...... 1 0 0 1 WEIGHTING 1 ...... 0 0 0 1 BURROWS WHEELER 1 ...... 0 1 1 0 BW Transform BW Reconstruction 0 ...... 1 0 1 0 Coding the BW-transformed sequence 0 ...... 1 0 1 0 CODING BW-SEQUENCES 1 ...... 0 0 1 0 1 ...... 0 1 0 0 CODING B-COUNTS 0 ...... 0 1 0 0 FIND BEST TREE MODEL 1 ...... 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (6)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 ...... 0 0 0 1 1 ENUMERATIVE CODING 0 ...... 0 1 1 0 1 ARITHMETIC CODING 0 ...... 0 0 1 0 1 CONTEXT-TREE 0 ...... 0 1 0 0 1 WEIGHTING 1 ...... 1 0 0 0 1 BURROWS WHEELER 1 ...... 0 0 1 1 0 BW Transform BW Reconstruction 0 ...... 1 1 0 1 0 Coding the BW-transformed sequence 0 ...... 0 1 0 1 0 CODING BW-SEQUENCES 1 ...... 1 0 0 1 0 1 ...... 1 0 1 0 0 CODING B-COUNTS 0 ...... 1 0 1 0 0 FIND BEST TREE MODEL 1 ...... 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (7)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 ..... 1 0 0 0 1 1 ENUMERATIVE CODING 0 ..... 0 0 1 1 0 1 ARITHMETIC CODING 0 ..... 1 0 0 1 0 1 CONTEXT-TREE 0 ..... 1 0 1 0 0 1 WEIGHTING 1 ..... 0 1 0 0 0 1 BURROWS WHEELER 1 ..... 0 0 0 1 1 0 BW Transform BW Reconstruction 0 ..... 0 1 1 0 1 0 Coding the BW-transformed sequence 0 ..... 0 0 1 0 1 0 CODING BW-SEQUENCES 1 ..... 0 1 0 0 1 0 1 ..... 1 1 0 1 0 0 CODING B-COUNTS 0 ..... 0 1 0 1 0 0 FIND BEST TREE MODEL 1 ..... 1 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (8)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 .... 0 1 0 0 0 1 1 ENUMERATIVE CODING 0 .... 0 0 0 1 1 0 1 ARITHMETIC CODING 0 .... 0 1 0 0 1 0 1 CONTEXT-TREE 0 .... 1 1 0 1 0 0 1 WEIGHTING 1 .... 1 0 1 0 0 0 1 BURROWS WHEELER 1 .... 1 0 0 0 1 1 0 BW Transform BW Reconstruction 0 .... 0 0 1 1 0 1 0 Coding the BW-transformed sequence 0 .... 1 0 0 1 0 1 0 CODING BW-SEQUENCES 1 .... 1 0 1 0 0 1 0 1 .... 0 1 1 0 1 0 0 CODING B-COUNTS 0 .... 0 0 1 0 1 0 0 FIND BEST TREE MODEL 1 .... 0 1 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (9)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 ... 1 0 1 0 0 0 1 1 ENUMERATIVE CODING 0 ... 1 0 0 0 1 1 0 1 ARITHMETIC CODING 0 ... 1 0 1 0 0 1 0 1 CONTEXT-TREE 0 ... 0 1 1 0 1 0 0 1 WEIGHTING 1 ... 0 1 0 1 0 0 0 1 BURROWS WHEELER 1 ... 0 1 0 0 0 1 1 0 BW Transform BW Reconstruction 0 ... 0 0 0 1 1 0 1 0 Coding the BW-transformed sequence 0 ... 0 1 0 0 1 0 1 0 CODING BW-SEQUENCES 1 ... 1 1 0 1 0 0 1 0 1 ... 0 0 1 1 0 1 0 0 CODING B-COUNTS 0 ... 1 0 0 1 0 1 0 0 FIND BEST TREE MODEL 1 ... 0 0 1 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (10)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 .. 0 1 0 1 0 0 0 1 1 ENUMERATIVE CODING 0 .. 0 1 0 0 0 1 1 0 1 ARITHMETIC CODING 0 .. 1 1 0 1 0 0 1 0 1 CONTEXT-TREE 0 .. 0 0 1 1 0 1 0 0 1 WEIGHTING 1 .. 0 0 1 0 1 0 0 0 1 BURROWS WHEELER 1 .. 1 0 1 0 0 0 1 1 0 BW Transform BW Reconstruction 0 .. 1 0 0 0 1 1 0 1 0 Coding the BW-transformed sequence 0 .. 1 0 1 0 0 1 0 1 0 CODING BW-SEQUENCES 1 .. 0 1 1 0 1 0 0 1 0 1 .. 0 0 0 1 1 0 1 0 0 CODING B-COUNTS 0 .. 0 1 0 0 1 0 1 0 0 FIND BEST TREE MODEL 1 .. 1 0 0 1 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (11)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 . 0 0 1 0 1 0 0 0 1 1 ENUMERATIVE CODING 0 . 1 0 1 0 0 0 1 1 0 1 ARITHMETIC CODING 0 . 0 1 1 0 1 0 0 1 0 1 CONTEXT-TREE 0 . 0 0 0 1 1 0 1 0 0 1 WEIGHTING 1 . 1 0 0 1 0 1 0 0 0 1 BURROWS WHEELER 1 . 0 1 0 1 0 0 0 1 1 0 BW Transform BW Reconstruction 0 . 0 1 0 0 0 1 1 0 1 0 Coding the BW-transformed sequence 0 . 1 1 0 1 0 0 1 0 1 0 CODING BW-SEQUENCES 1 . 0 0 1 1 0 1 0 0 1 0 1 . 1 0 0 0 1 1 0 1 0 0 CODING B-COUNTS 0 . 1 0 1 0 0 1 0 1 0 0 FIND BEST TREE MODEL 1 . 0 1 0 0 1 0 1 0 0 0 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (12)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 0 10010100011 ENUMERATIVE CODING 0 01010001101 ARITHMETIC CODING 0 00110100101 CONTEXT-TREE 0 10001101001 WEIGHTING 1 01001010001 BURROWS WHEELER 1 00101000110 BW Transform BW Reconstruction 0 10100011010 Coding the BW-transformed sequence 0 01101001010 CODING BW-SEQUENCES 1 00011010010 1 01000110100 CODING B-COUNTS 0 11010010100 FIND BEST TREE MODEL 1 10100101000 BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Burrows Wheeler Reconstruction (13)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY 11 0 10010100011

ENUMERATIVE CODING 10 0 01010001101 9 0 00110100101 ARITHMETIC CODING 8 0 10001101001 CONTEXT-TREE WEIGHTING 7 1 01001010001 6 1 00101000110 BURROWS WHEELER BW Transform 5 0 10100011010 BW Reconstruction 4 0 01101001010 Coding the BW-transformed sequence 3 1 00011010010 CODING BW-SEQUENCES 2 1 01000110100

CODING B-COUNTS 1 0 11010010100 0 1 10100101000 FIND BEST TREE MODEL

BINARY DECOMPOSITION N N CONCLUSION From the index ibw (x1 ) = 6 of x1 we get the original sequence back. FUTURE DIRECTIONS Coding the BW-transformed sequence

BWT and CTW Frans Willems IDEA:

INTRODUCTION N For tree sources, the transformed sequence BW (x1 ) is a concatenation of IID SOURCES, PREFIX IID subsequences, as many as there are leaves in the tree model.a CODES, REDUNDANCY a ENUMERATIVE CODING Circularity effects are ignored.

ARITHMETIC CODING

CONTEXT-TREE N WEIGHTING Example (x1 = 100101000110 and tree-model M = {00, 10, 1}) BURROWS WHEELER BW (100101000110) = 101, 1001, 10000. See BW Transform BW Reconstruction 12 0 10010100011 Coding the BW-transformed sequence 2 0 01010001101 7 0 00110100101 CODING BW-SEQUENCES 5 0 10001101001 CODING B-COUNTS 11 1 01001010001 FIND BEST TREE MODEL 1 1 00101000110 3 0 10100011010 BINARY DECOMPOSITION 8 0 01101001010 CONCLUSION 6 1 00011010010 FUTURE DIRECTIONS 4 1 01000110100 9 0 11010010100 10 1 10100101000 Coding the BW-transformed sequence

BWT and CTW Frans Willems IDEA:

INTRODUCTION N For tree sources, the transformed sequence BW (x1 ) is a concatenation of IID SOURCES, PREFIX IID subsequences, as many as there are leaves in the tree model.a CODES, REDUNDANCY a ENUMERATIVE CODING Circularity effects are ignored.

ARITHMETIC CODING

CONTEXT-TREE N WEIGHTING Example (x1 = 100101000110 and tree-model M = {00, 10, 1}) BURROWS WHEELER BW (100101000110) = 101, 1001, 10000. See BW Transform BW Reconstruction 12 0 10010100011 Coding the BW-transformed sequence 2 0 01010001101 7 0 00110100101 CODING BW-SEQUENCES 5 0 10001101001 CODING B-COUNTS 11 1 01001010001 FIND BEST TREE MODEL 1 1 00101000110 3 0 10100011010 BINARY DECOMPOSITION 8 0 01101001010 CONCLUSION 6 1 00011010010 FUTURE DIRECTIONS 4 1 01000110100 9 0 11010010100 10 1 10100101000 Coding the BW-transformed sequence

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY Move-to-front techniques (Bentley et al. (1986), Elias (1987)) ENUMERATIVE CODING combined with Huffman codes can be used. ARITHMETIC CODING Piecewise IID coding (Merhav (1993), W. (1996)) can be used. CONTEXT-TREE These methods demonstrate that we need to specify the transition WEIGHTING points, which requires roughly log2 N bits per transition (Effros et al. BURROWS WHEELER (2002)). BW Transform BW Reconstruction CTW only needs 2|M| − 1 bits, which is roughly 2 bits per transition. Coding the BW-transformed sequence CODING BW-SEQUENCES OBJECTIVE: CODING B-COUNTS Design a procedure for coding the BW transformed sequence that needs FIND BEST TREE MODEL only 2|M| − 1 bits. BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS FSM-Closed Tree Models

BWT and CTW Definition Frans Willems The generator of leaf ud ud−1 ··· u1 at depth d is ud ud−1 ··· u2 at depth INTRODUCTION d − 1. A tree model is FSM-closed if all its leaves have a generator that is IID SOURCES, PREFIX either a leaf or internal node of the tree model. CODES, REDUNDANCY

ENUMERATIVE CODING Example (Tree model M = {00, 010, 110, 1}) ARITHMETIC CODING CONTEXT-TREE 1 WEIGHTING

BURROWS WHEELER

CODING BW-SEQUENCES FSM-Closedness Problem λ FSM-Closed Tree Model Not FSM-Closed Model 110 10 CODING B-COUNTS 010 FIND BEST TREE MODEL 0 BINARY DECOMPOSITION CONCLUSION 00 FUTURE DIRECTIONS

Leaf 00 has generator 0 which is an internal node of M. But note that leaves 110 and 010 do not have a generator in tree model M. Therefore M is not FSM-closed and we cannot describe the source as a FSM. FSM-Closed Tree Models

BWT and CTW

Frans Willems Example (Tree model M0 = {00, 010, 110, 01, 11})

INTRODUCTION

IID SOURCES, PREFIX 11 CODES, REDUNDANCY ENUMERATIVE CODING 1 ARITHMETIC CODING

CONTEXT-TREE 01 WEIGHTING BURROWS WHEELER λ CODING BW-SEQUENCES 110 FSM-Closedness 10 Problem FSM-Closed Tree Model 010 Not FSM-Closed Model 0 CODING B-COUNTS FIND BEST TREE MODEL 00 BINARY DECOMPOSITION CONCLUSION Tree model M0 is FSM-closed. FUTURE DIRECTIONS

If a tree model is FSM-closed then each leaf and each node in the tree model has a generator that is a leaf or node of the tree model. FSM-Closed Tree Models

BWT and CTW Frans Willems Definition

INTRODUCTION A tree model can be made FSM-closed by adding the generators of all IID SOURCES, PREFIX leaves. The resulting model is the FSM-closure of the tree model. CODES, REDUNDANCY

ENUMERATIVE CODING

ARITHMETIC CODING Example

CONTEXT-TREE WEIGHTING 11

BURROWS WHEELER

CODING BW-SEQUENCES 1 FSM-Closedness Problem FSM-Closed Tree Model 01 Not FSM-Closed Model CODING B-COUNTS λ FIND BEST TREE MODEL 110 BINARY DECOMPOSITION 10

CONCLUSION 010 0 FUTURE DIRECTIONS

00 FSM-Closed Tree Models

BWT and CTW Frans Willems Definition

INTRODUCTION A tree model can be made FSM-closed by adding the generators of all IID SOURCES, PREFIX leaves. The resulting model is the FSM-closure of the tree model. CODES, REDUNDANCY

ENUMERATIVE CODING

ARITHMETIC CODING Example

CONTEXT-TREE WEIGHTING 11

BURROWS WHEELER

CODING BW-SEQUENCES 1 FSM-Closedness Problem FSM-Closed Tree Model 01 Not FSM-Closed Model CODING B-COUNTS λ FIND BEST TREE MODEL 110 BINARY DECOMPOSITION 10

CONCLUSION 010 0 FUTURE DIRECTIONS

00 Coding BW-sequences: PROBLEM

BWT and CTW Suppose that Mc is the tree-model that matches best to the Frans Willems BW-sequence. More about how to find Mc later. INTRODUCTION This tree model Mc is included in the description (2|M|c − 1 bits). IID SOURCES, PREFIX CODES, REDUNDANCY Apart from that the b-counts (weights of subsequences) of all leaves

ENUMERATIVE CODING of Mc are added to the description.

ARITHMETIC CODING Finally the lexicographical indices of all the subsequences

CONTEXT-TREE corresponding to the leaves of Mc are included. WEIGHTING

BURROWS WHEELER

CODING BW-SEQUENCES Question: FSM-Closedness Problem Can we reconstruct the entire BW-sequence from the description using FSM-Closed Tree Model ENUMERATIVE techniques? Not FSM-Closed Model CODING B-COUNTS FIND BEST TREE MODEL Definition BINARY DECOMPOSITION Let as the number of zeros and bs the number of ones in the subsequence CONCLUSION that corresponds to s, then FUTURE DIRECTIONS cs = as + bs

is length of this subsequence. FSM-Closed Tree Model (1)

BWT and CTW Suppose that the b-counts in the leaves of FSM-closed tree model

Frans Willems Mc = {00, 010, 110, 01, 11} are given to the decoder, hence the decoder knows b00, b010, b110, b01, b11, and N. INTRODUCTION The decoder first computes the b-counts in all the nodes of the tree IID SOURCES, PREFIX CODES, REDUNDANCY model, hence bλ, b0, b1, and b01. ENUMERATIVE CODING The decoder now processes layer by layer, starting in the root (layer

ARITHMETIC CODING 0). First in the root the a-count is computed.

CONTEXT-TREE WEIGHTING aλ = N − bλ. BURROWS WHEELER Layer 0 is processed now. CODING BW-SEQUENCES FSM-Closedness In layer 1 there are two nodes, 0 and 1 that fit into Mc. Since Mc is Problem FSM-closed, their generator λ exists, is at level 0, and is thus FSM-Closed Tree Model Not FSM-Closed Model processed before. Therefore we can compute the c’s of these nodes. CODING B-COUNTS c0 = aλ FIND BEST TREE MODEL c1 = bλ. BINARY DECOMPOSITION CONCLUSION With the available b-counts also the a-counts can be computed FUTURE DIRECTIONS a0 = c0 − b0

a1 = c1 − b1. Layer 1 is processed now. FSM-Closed Tree Model (2)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE − WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − − FSM-Closedness Problem b110 − FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS − − − FIND BEST TREE MODEL − − BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (3)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − − FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (4)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (5)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − c0 CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (6)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING − a1 CONTEXT-TREE b1 WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (7)

BWT and CTW

Frans Willems In layer 2 there are three leaves, 00, 01, and 11, and a node 10, that INTRODUCTION fit into M. Since M is FSM-closed, again their generators 0 and 1 IID SOURCES, PREFIX c c CODES, REDUNDANCY exist, are at level 1, and are thus processed before. Now we can

ENUMERATIVE CODING compute the c’s of these leaves and node.

ARITHMETIC CODING c00 = a0 CONTEXT-TREE WEIGHTING c10 = a1

BURROWS WHEELER c01 = b0 CODING BW-SEQUENCES c11 = b1. FSM-Closedness Problem FSM-Closed Tree Model With the available b-counts also the a-counts can be computed for Not FSM-Closed Model these leaves and node CODING B-COUNTS FIND BEST TREE MODEL a00 = c00 − b00

BINARY DECOMPOSITION a10 = c10 − b10 CONCLUSION a01 = c01 − b01 FUTURE DIRECTIONS a11 = c11 − b11. FSM-Closed Tree Model (8)

BWT and CTW

Frans Willems c11 INTRODUCTION − IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING − b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model − c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS − b00 FSM-Closed Tree Model (9)

BWT and CTW

Frans Willems c11 INTRODUCTION a11 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01 b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS a00 b00 FSM-Closed Tree Model (10)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY In layer 3 there are two leaves, 010, and 110, that fit into Mc. Since ENUMERATIVE CODING Mc is FSM-closed, again their generators 01 and 11 exist, are at level ARITHMETIC CODING 2, and are thus processed before. Now we can compute the c’s of the subsequences in these two leaves. CONTEXT-TREE WEIGHTING

BURROWS WHEELER c010 = a01

CODING BW-SEQUENCES c110 = a11. FSM-Closedness Problem With the available b-counts also the a-counts can be computed for FSM-Closed Tree Model Not FSM-Closed Model these leaves CODING B-COUNTS a010 = c010 − b010 FIND BEST TREE MODEL a = c − b . BINARY DECOMPOSITION 110 110 110

CONCLUSION

FUTURE DIRECTIONS FSM-Closed Tree Model (11)

BWT and CTW

Frans Willems c11 INTRODUCTION a11 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01

BURROWS WHEELER b01 c110 N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 a0 c010 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS a00 b00 FSM-Closed Tree Model (12)

BWT and CTW

Frans Willems c11 INTRODUCTION a11 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01

BURROWS WHEELER b01 c110 N CODING BW-SEQUENCES a110 aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 a0 c010 FIND BEST TREE MODEL b0 a010 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS a00 b00 FSM-Closed Tree Model (13)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY For all leaves s in Mc the subsequence length cs and b-count bs is ENUMERATIVE CODING known. Note that the b-count is the weight of the subsequence. ARITHMETIC CODING Now with the indices i00, i010, i110, i01, and i11, the corresponding CONTEXT-TREE WEIGHTING subsequences bw 00, bw 010, bw 110, bw 01, and bw 11, can be BURROWS WHEELER reconstructed.

CODING BW-SEQUENCES The BW-transformed sequence is now the concatenation of the five FSM-Closedness subsequences, hence Problem FSM-Closed Tree Model Not FSM-Closed Model BW = bw 00, bw 010, bw 110, bw 01, bw 11. CODING B-COUNTS NOTE that this approach is outlined in Martin, Seroussi, & Weinberger FIND BEST TREE MODEL (2004), with emphasis on the FSM-case, not on the BWT-case. BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Tree Model, not FSM-Closed (1)

BWT and CTW Suppose that the b-counts in the leaves of tree model

Frans Willems Mc = {00, 010, 110, 1}, which is not FSM-closed, are given to the decoder, hence the decoder knows b00, b010, b110, b1, and N. INTRODUCTION The decoder first computes the b-counts in all the nodes of the tree IID SOURCES, PREFIX CODES, REDUNDANCY model, hence bλ, b0, and b10.

ENUMERATIVE CODING The decoder now processes layer by layer, starting in the root (layer

ARITHMETIC CODING 0). First in the root the a-count is computed.

CONTEXT-TREE aλ = N − bλ. WEIGHTING BURROWS WHEELER Now all the nodes in layer 0 are processed. CODING BW-SEQUENCES In layer 1 there are two nodes, 0 and 1, that fit into Mc. Since “all FSM-Closedness Problem nodes in layer 0” are complete, we can compute the c’s of the nodes FSM-Closed Tree Model 0 and 1. Not FSM-Closed Model CODING B-COUNTS c0 = aλ

FIND BEST TREE MODEL c1 = bλ. BINARY DECOMPOSITION With the available b-counts also the a-counts can be computed for CONCLUSION these nodes FUTURE DIRECTIONS a0 = c0 − b0

a1 = c1 − b1. Now all the nodes in layer 1 are processed. Tree Model not FSM-Closed (2)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − − FSM-Closedness Problem b110 − FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS − − − FIND BEST TREE MODEL − − BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 Tree Model not FSM-Closed (3)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − − FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 Tree Model, not FSM-Closed (4)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY ENUMERATIVE CODING − ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − − CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 Tree Model, not FSM-Closed (5)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING − − CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − c0 CODING B-COUNTS b10 − − FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 Tree Model, not FSM-Closed (6)

BWT and CTW

Frans Willems − INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING − a1 CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model − Not FSM-Closed Model − c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION −

FUTURE DIRECTIONS − b00 Tree Model, not FSM-Closed (7)

BWT and CTW

Frans Willems In layer 2 there is a leaf, 00, and a node 10, that fit into Mc. Since all INTRODUCTION nodes at layer 1 are complete, we can compute the c’s of the subsequences in this leaf and node. IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING c00 = a0

ARITHMETIC CODING c10 = a1.

CONTEXT-TREE WEIGHTING With the available b-counts for the node and leaf that fit into Mc the BURROWS WHEELER a-counts can be computed CODING BW-SEQUENCES a = c − b FSM-Closedness 00 00 00 Problem a10 = c10 − b10. FSM-Closed Tree Model Not FSM-Closed Model In layer 2 there are two nodes, 01 and 11, that do not fit into M. CODING B-COUNTS Since all nodes at layer 1 are processed, we can also compute the c’s FIND BEST TREE MODEL of these two nodes. BINARY DECOMPOSITION

CONCLUSION c01 = b0 FUTURE DIRECTIONS c11 = b1.

We need b01 and b11 now ... Tree Model, not FSM-Closed (8)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING New Approach

ARITHMETIC CODING Leaf 1 at layer 1 is complete (we know c1 and weight b1), and CONTEXT-TREE therefore the corresponding subsequence bw 1 can be reconstructed WEIGHTING from the lexicographical index i1. BURROWS WHEELER The first c01 digits of bw 1 are digits corresponding to node 01, call this CODING BW-SEQUENCES sequence bw 01. We can now simply count the number a01 of 0-digits and FSM-Closedness the number b01 of 1-digits in bw 01. Problem The last c digits of bw are digits corresponding to node 11, call this FSM-Closed Tree Model 11 1 Not FSM-Closed Model sequence bw 11. We can now simply count the number a11 of 0-digits and the number b11 of 1-digits in bw . CODING B-COUNTS 11 Now all the nodes in layer 2 are processed. FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Tree Model, not FSM-Closed (9)

BWT and CTW

Frans Willems c11 INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model − c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS − b00 Tree Model, not FSM-Closed (10)

BWT and CTW

Frans Willems c11 INTRODUCTION − IID SOURCES, PREFIX − CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING − − BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS a00 b00 Tree Model, not FSM-Closed (11)

BWT and CTW

Frans Willems c11 INTRODUCTION a11 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01 b01 BURROWS WHEELER − N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 − a0 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c00 FUTURE DIRECTIONS a00 b00 Tree Model, not FSM-Closed (12)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY In layer 3 there are two leaves, 010, and 110, that fit into Mc. Since ENUMERATIVE CODING all nodes at level 2 are complete, we can compute the c’s of these ARITHMETIC CODING leaves. CONTEXT-TREE WEIGHTING c010 = a01 BURROWS WHEELER c110 = a11. CODING BW-SEQUENCES FSM-Closedness With the available b-counts also the a-counts can be computed for Problem FSM-Closed Tree Model these leaves and node Not FSM-Closed Model CODING B-COUNTS a010 = c010 − b010

FIND BEST TREE MODEL a110 = c110 − b110. BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Tree Model, not FSM-Closed (13)

BWT and CTW

Frans Willems c00 INTRODUCTION a00 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01

BURROWS WHEELER b01 c110 N CODING BW-SEQUENCES − aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 a0 c010 FIND BEST TREE MODEL − b0 BINARY DECOMPOSITION b010 CONCLUSION c11 FUTURE DIRECTIONS a00 b00 Tree Model, not FSM-Closed (14)

BWT and CTW

Frans Willems c00 INTRODUCTION a00 IID SOURCES, PREFIX b11 CODES, REDUNDANCY

ENUMERATIVE CODING c1 ARITHMETIC CODING a1 c01 CONTEXT-TREE b1 WEIGHTING a01

BURROWS WHEELER b01 c110 N CODING BW-SEQUENCES a110 aλ FSM-Closedness Problem b110 bλ FSM-Closed Tree Model c10 Not FSM-Closed Model a10 c0 CODING B-COUNTS b10 a0 c010 FIND BEST TREE MODEL b0 a010 BINARY DECOMPOSITION b010 CONCLUSION c11 FUTURE DIRECTIONS a00 b00 Tree Model, not FSM-Closed (15)

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY For all remaining leaves s in Mc the subsequence length cs and ENUMERATIVE CODING b-count bs (weight) is known. ARITHMETIC CODING Now with the indices i00, i010, and i110, the corresponding CONTEXT-TREE subsequences bw , bw , and bw , can be reconstructed. WEIGHTING 00 010 110 Since bw was reconstructed before, the BW-transformed sequence is BURROWS WHEELER 1 now the concatenation of the four subsequences, hence CODING BW-SEQUENCES FSM-Closedness Problem BW = bw 00, bw 010, bw 110, bw 1. FSM-Closed Tree Model Not FSM-Closed Model CODING B-COUNTS NOTE that the new approach works in the non-FSM closed case, without FIND BEST TREE MODEL having to form the FSM-closure of the tree model first. BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Coding b-Counts: Objective

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY ENUMERATIVE CODING Objective ARITHMETIC CODING

CONTEXT-TREE How do we code the b-counts in the leaves L(Mc)? Ideally we would like WEIGHTING to achieve total codewordlength BURROWS WHEELER X CODING BW-SEQUENCES log2(al + bl + 1). CODING B-COUNTS l∈L(Mc) Objective Procedure This corresponds to uniform weight coding. Note that the c’s apart from Loss cλ = N, are still unknown. FIND BEST TREE MODEL

BINARY DECOMPOSITION

CONCLUSION

FUTURE DIRECTIONS Coding b-Counts: Procedure

BWT and CTW

Frans Willems We start in the root node λ. There we describe bλ. To this we need INTRODUCTION dlog2(aλ + bλ + 1)e bits. IID SOURCES, PREFIX CODES, REDUNDANCY Then in all internal nodes s, starting in the root node λ, and ENUMERATIVE CODING processing level by level, we first find ARITHMETIC CODING

CONTEXT-TREE i = arg min log2(ais + bis + 1), WEIGHTING i=0,1

BURROWS WHEELER and we use one bit to describe i. Then we describe bis using CODING BW-SEQUENCES

CODING B-COUNTS dlog2(ais + bis + 1)e bits. Objective Procedure Loss Note that when bis needs to be described, the corresponding FIND BEST TREE MODEL c0s = a0s + b0s and c1s = a1s + b1s have already been reconstructed. BINARY DECOMPOSITION This procedure leads to CONCLUSION X FUTURE DIRECTIONS dlog2(aλ + bλ + 1)e + dlog2 min(a0n + b0n + 1, a1n + b1n + 1)e bits, n∈N (Mc)

where N (Mc) are the internal nodes of Mc. Coding b-Counts: Loss per Node, Total Loss

BWT and CTW

Frans Willems

INTRODUCTION Consider a node n ∈ N (Mc). IID SOURCES, PREFIX CODES, REDUNDANCY Observe that we have described bn using dlog2(cn + 1)e bits, and

ENUMERATIVE CODING next we describe both b0n and b1n using 1 + dlog min(c0n + 1, c1n + 1)e bits. ARITHMETIC CODING 2

CONTEXT-TREE Directly describing b0n and b1n would require dlog2(c0n + 1)e+ WEIGHTING dlog2(c1n + 1)e bits. BURROWS WHEELER The loss in this node is now CODING BW-SEQUENCES

CODING B-COUNTS dlog2(cn + 1)e + 1 + dlog2 min(c0n + 1, c1n + 1)e Objective −dlog (c + 1)e − dlog (c + 1)e Procedure 2 0n 2 1n Loss = dlog2(cn + 1)e + 1 − dlog2 max(c0n + 1, c1n + 1)e FIND BEST TREE MODEL ≤ 2 bit. BINARY DECOMPOSITION CONCLUSION Total loss is therefore FUTURE DIRECTIONS 2(|M|c − 1) bits. Finding Best Tree Model: Maximizing

BWT and CTW

Frans Willems A two-pass version (context-tree maximizing) exists that finds the best INTRODUCTION model (MDL) matching to the source sequence, if coding is done as IID SOURCES, PREFIX described before. CODES, REDUNDANCY ENUMERATIVE CODING Context-Tree Maximizing ARITHMETIC CODING Define for nodes n CONTEXT-TREE WEIGHTING   ∆ cn  BURROWS WHEELER µ(n) = 1+min[ log2 , 1+dlog2 min(c0n + 1, c1n + 1)e+µ(0n)+µ(1n)], bn CODING BW-SEQUENCES while for leaves l CODING B-COUNTS   ∆ cl  FIND BEST TREE MODEL µ(l) = log2 . bl BINARY DECOMPOSITION CONCLUSION Tracking this procedure yield Mc. Code-length is FUTURE DIRECTIONS L = log2(N + 1) + µ(λ).

This procedure can be carried out very efficiently during the BW transform phase. Finding Best Tree Model: Integration with BW Transform

BWT and CTW

Frans Willems Fix a depth, e.g. 3, then the subsequences of the nodes at depth 3 can be INTRODUCTION easily found, see BW table below: IID SOURCES, PREFIX CODES, REDUNDANCY 12 0 1 0 0 1 0 1 0 0 0 1 1 ENUMERATIVE CODING 2 0 0 1 0 1 0 0 0 1 1 0 1 ARITHMETIC CODING 7 0 0 0 1 1 0 1 0 0 1 0 1

CONTEXT-TREE 5 0 1 0 0 0 1 1 0 1 0 0 1 WEIGHTING 11 1 0 1 0 0 1 0 1 0 0 0 1 BURROWS WHEELER 1 1 0 0 1 0 1 0 0 0 1 1 0

CODING BW-SEQUENCES 3 0 1 0 1 0 0 0 1 1 0 1 0

CODING B-COUNTS 8 0 0 1 1 0 1 0 0 1 0 1 0 6 1 0 0 0 1 1 0 1 0 0 1 0 FIND BEST TREE MODEL 4 1 0 1 0 0 0 1 1 0 1 0 0 BINARY DECOMPOSITION 9 0 1 1 0 1 0 0 1 0 1 0 0 CONCLUSION 10 1 1 0 1 0 0 1 0 1 0 0 0 FUTURE DIRECTIONS Now node 000 → {10} → 1, node 100 → {9, 4} → 01, node 010 → {6, 8, 3} → 100, node 110 → {1} → 1, node 001 → {5, 11} → 10, node 101 → {7, 2} → 00, node 011 → {12} → 0, and finally node 111 → ∅ → ∅. Binary Decomposition

BWT and CTW

Frans Willems CTW: Binary Decomposition, 1.8 bit/ASCII.

INTRODUCTION Bytes: 8 bits,

IID SOURCES, PREFIX ASCII-symbols: 7 bits, CODES, REDUNDANCY DNA nucleotide (A,T, G and C): 2 bits, etc. ENUMERATIVE CODING The first bit of a symbol depends on the context (a number of past ARITHMETIC CODING symbols). CONTEXT-TREE The second bit of a symbol depends on the first bit of that symbol WEIGHTING and the context. BURROWS WHEELER The third bit of a symbol depends on the first and second bit of that CODING BW-SEQUENCES symbol and the context, etc. CODING B-COUNTS Therefore there is a tree model for the first bit. FIND BEST TREE MODEL There are two tree models for the second bit, one for first bit being BINARY DECOMPOSITION 0 and a second one when the first bit is 1. Problem There are four tree models for the third bit etc. Example Coding b-Counts, Maximizing Question: CONCLUSION Can we do a binary decomposition also in combination with a BW FUTURE DIRECTIONS transform on the symbols? Does there exist a way of finding the c’s based on only the b-counts? Binary Decomposition: Example (1)

BWT and CTW SYMBOL-SIZE: 2 bits.

Frans Willems Suppose that the b-counts in the leaves of the three tree models are specified. The decoder can now compute the b-counts of all inner nodes. INTRODUCTION The decoder first computes in tree-∅ (corresponding to the first digit IID SOURCES, PREFIX ∅ CODES, REDUNDANCY of the symbol) count aλ ENUMERATIVE CODING ∅ ∅ aλ = N − bλ. ARITHMETIC CODING

CONTEXT-TREE Now in tree-0 and in tree-1 (corresponding to the second digit of the WEIGHTING symbol) the decoder can compute the c’s and then the a’s in the root BURROWS WHEELER nodes CODING BW-SEQUENCES 0 ∅ 0 0 0 cλ = a aλ = cλ − bλ CODING B-COUNTS λ 1 ∅ 1 1 1 FIND BEST TREE MODEL cλ = bλ aλ = cλ − bλ. BINARY DECOMPOSITION Now all the nodes in layer 0 of the three trees are processed. Problem Example The decoder now starts working on layer 1 (one symbol contexts). Coding b-Counts, Maximizing First in tree-∅ the decoder computes the c’s and then the a’s CONCLUSION ∅ 0 ∅ ∅ ∅ c00 = aλ a00 = c00 − b00 FUTURE DIRECTIONS ∅ 0 ∅ ∅ ∅ c01 = bλ a01 = c01 − b01 ∅ 1 ∅ ∅ ∅ c10 = aλ a10 = c10 − b10 ∅ 1 ∅ ∅ ∅ c11 = bλ a11 = c11 − b11 Binary Decomposition: Example (2)

BWT and CTW

Frans Willems Then in tree-0 the decoder computes the c’s and then the a’s

INTRODUCTION 0 ∅ 0 0 0 c00 = a00 a00 = c00 − b00 IID SOURCES, PREFIX 0 ∅ 0 0 0 CODES, REDUNDANCY c01 = a01 a01 = c01 − b01 ENUMERATIVE CODING 0 ∅ 0 0 0 c10 = a10 a10 = c10 − b10 ARITHMETIC CODING 0 ∅ 0 0 0 c11 = a a11 = c11 − b11, CONTEXT-TREE 11 WEIGHTING and in tree-1 the decoder computes the c’s and then the a’s BURROWS WHEELER

CODING BW-SEQUENCES 1 ∅ 1 1 1 c00 = b00 a00 = c00 − b00 CODING B-COUNTS 1 ∅ 1 1 1 c01 = b01 a01 = c01 − b01 FIND BEST TREE MODEL 1 ∅ 1 1 1 BINARY DECOMPOSITION c10 = b10 a10 = c10 − b10 Problem 1 ∅ 1 1 1 Example c11 = b11 a11 = c11 − b11. Coding b-Counts, Maximizing Now all the nodes in layer 1 of the three trees are processed. CONCLUSION Etc. FUTURE DIRECTIONS If a leaf is encountered the corresponding subsequence is reconstructed using the index. From then on a-counts and b-counts can be found by inspection of the reconstructed subsequence. Binary Decomposition: Example (3)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 − − b1 ENUMERATIVE CODING λ − − b1 ARITHMETIC CODING 01 ∅ ··· CONTEXT-TREE − − b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ − − b10 ··· CODING BW-SEQUENCES ∅ N − bλ CODING B-COUNTS ∅ − − b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ Problem − − b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 − − bλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (4)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 − − b1 ENUMERATIVE CODING λ − − b1 ARITHMETIC CODING 01 ∅ ··· CONTEXT-TREE − − b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ − − b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ − − b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ Problem − − b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 − − bλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (5)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 c1 − b1 ENUMERATIVE CODING λ λ − − b1 ARITHMETIC CODING 01 ∅ ··· CONTEXT-TREE − − b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ − − b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ − − b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ Problem − − b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 0 cλ − bλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (6)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 c1 a1 b1 ENUMERATIVE CODING λ λ λ − − b1 ARITHMETIC CODING 01 ∅ ··· CONTEXT-TREE − − b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ − − b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ − − b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ Problem − − b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 0 0 cλaλbλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (7)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 c1 a1 b1 ENUMERATIVE CODING λ λ λ − − b1 ARITHMETIC CODING 01 ∅ ∅ ··· CONTEXT-TREE c11 − b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ ∅ c10 − b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ ∅ c01 − b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ ∅ Problem c00 − b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 0 0 cλaλbλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (8)

BWT and CTW

Frans Willems 1 − − b11 INTRODUCTION IID SOURCES, PREFIX − − b1 CODES, REDUNDANCY 10 c1 a1 b1 ENUMERATIVE CODING λ λ λ − − b1 ARITHMETIC CODING 01 ∅ ∅ ∅ ··· CONTEXT-TREE c11a11b11 WEIGHTING 1 − − b00 BURROWS WHEELER ∅ ∅ ∅ c10a10b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ ∅ ∅ c01a01b01 FIND BEST TREE MODEL 0 ··· − − b11 BINARY DECOMPOSITION ∅ ∅ ∅ Problem c00a00b00 Example 0 ··· − − b10 Coding b-Counts, Maximizing 0 0 0 cλaλbλ CONCLUSION 0 − − b01 FUTURE DIRECTIONS ··· 0 − − b00 ··· Binary Decomposition: Example (9)

BWT and CTW

Frans Willems 1 1 c11 − b11 INTRODUCTION IID SOURCES, PREFIX c1 − b1 CODES, REDUNDANCY 10 10 c1 a1 b1 ENUMERATIVE CODING λ λ λ 1 1 ARITHMETIC CODING c01 − b01 ∅ ∅ ∅ CONTEXT-TREE c11a11b11 ··· WEIGHTING 1 1 c00 − b00 BURROWS WHEELER ∅ ∅ ∅ c10a10b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ ∅ ∅ c01a01b01 FIND BEST TREE MODEL 0 0 ··· c11 − b11 BINARY DECOMPOSITION ∅ ∅ ∅ Problem c00a00b00 Example ··· c0 − b0 Coding b-Counts, 10 10 Maximizing 0 0 0 cλaλbλ CONCLUSION 0 0 c01 − b01 FUTURE DIRECTIONS ··· 0 0 c00 − b00 ··· Binary Decomposition: Example (10)

BWT and CTW

Frans Willems 1 1 1 c11a11b11 INTRODUCTION IID SOURCES, PREFIX c1 a1 b1 CODES, REDUNDANCY 10 10 10 c1 a1 b1 ENUMERATIVE CODING λ λ λ 1 1 1 ARITHMETIC CODING c01a01b01 ∅ ∅ ∅ CONTEXT-TREE c11a11b11 ··· WEIGHTING 1 1 1 c00a00b00 BURROWS WHEELER ∅ ∅ ∅ c10a10b10 ··· CODING BW-SEQUENCES ∅ ∅ Naλbλ CODING B-COUNTS ∅ ∅ ∅ c01a01b01 FIND BEST TREE MODEL 0 0 0 ··· c11a11b11 BINARY DECOMPOSITION ∅ ∅ ∅ Problem c00a00b00 Example ··· c0 a0 b0 Coding b-Counts, 10 10 10 Maximizing 0 0 0 cλaλbλ CONCLUSION 0 0 0 c01a01b01 FUTURE DIRECTIONS ··· 0 0 0 c00a00b00 ··· Binary Decomposition: Coding b-Counts, Maximizing

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY

ENUMERATIVE CODING

ARITHMETIC CODING

CONTEXT-TREE Coding the b-counts costs more bits now. WEIGHTING It can be shown that we loose at most 6 bit per internal node BURROWS WHEELER (quaternary splitting can be accomplished with three binary splits).

CODING BW-SEQUENCES Fortunately this is at most 2 bit per parameter again. CODING B-COUNTS Maximizing formula exists again. FIND BEST TREE MODEL

BINARY DECOMPOSITION Problem Example Coding b-Counts, Maximizing CONCLUSION

FUTURE DIRECTIONS Conclusion

BWT and CTW

Frans Willems

INTRODUCTION

IID SOURCES, PREFIX CODES, REDUNDANCY The BW transform method does not need log2(N) bits to specify a ENUMERATIVE CODING transition. ARITHMETIC CODING The FSM closure is not needed. Specifying the maximizing tree CONTEXT-TREE model M and the b-counts in the leaves is enough. WEIGHTING c

BURROWS WHEELER A loss comes from the fact that these b-counts are specified recursively. This loss is upper-bounded by 2 bit per leaf. CODING BW-SEQUENCES

CODING B-COUNTS Maximizing methods exist that match perfectly with the BW

FIND BEST TREE MODEL transform.

BINARY DECOMPOSITION There exist also binary decomposition techniques that combine with

CONCLUSION the BW transform.

FUTURE DIRECTIONS Also KT-coding of weights can be analysed. Future Directions

BWT and CTW

Frans Willems

INTRODUCTION Software? IID SOURCES, PREFIX CODES, REDUNDANCY Suppose that the data have left-right symmetry hence

ENUMERATIVE CODING P(a, b) = P(b, a), P(a, b, c) = P(c, b, a),

ARITHMETIC CODING P(a, b, c, d) = P(d, c, b, a), etc. This reduces the number of parameters. Algorithm? Important for image-compression. CONTEXT-TREE WEIGHTING CTW can handle side-information by considering it as context (e.g. BURROWS WHEELER Cai, Kulkarni, and Verdu, 2005). BW-version? CODING BW-SEQUENCES Can BW-techniques be used to achieve CT-weighting performance. CODING B-COUNTS Here BW-techniques are described that achieve CT-maximizing FIND BEST TREE MODEL performance. BINARY DECOMPOSITION What if the side-information is not-properly aligned? LZ is more CONCLUSION robust now. Important for reference-based genome compression

FUTURE DIRECTIONS (Chern et al., 2012). Can also Lempel-Ziv be boosted? Context connection via Herschkovitz-Ziv (1998).