Skip Context Tree Switching

Skip Context Tree Switching Marc G. Bellemare [email protected] Joel Veness [email protected] Google DeepMind Erik Talvitie [email protected] Franklin and Marshall College Abstract fixed in advance. As we discuss in Section 3, reorder- Context Tree Weighting is a powerful proba- ing these variables can lead to significant performance im- bilistic sequence prediction technique that effi- provements given limited data. This idea was leveraged ciently performs Bayesian model averaging over by the class III algorithm of Willems et al.(1996), which the class of all prediction suffix trees of bounded performs Bayesian model averaging over the collection of depth. In this paper we show how to generalize prediction suffix trees defined over all possible fixed vari- D this technique to the class of K-skip prediction able orderings. Unfortunately, the O(2 ) computational suffix trees. Contrary to regular prediction suffix requirements of the class III algorithm prohibit its use in trees, K-skip prediction suffix trees are permitted most practical applications. to ignore up to K contiguous portions of the con- Our main contribution is the Skip Context Tree Switching text. This allows for significant improvements in (SkipCTS) algorithm, a polynomial-time compromise be- predictive accuracy when irrelevant variables are tween the linear-time CTW and the exponential-time class present, a case which often occurs within record- III algorithm. We introduce a family of nested model aligned data and images. We provide a regret- classes, the Skip Context Tree classes, which form the ba- based analysis of our approach, and empirically sis of our approach. The Kth order member of this family evaluate it on the Calgary corpus and a set of corresponds to prediction suffix trees which may skip up Atari 2600 screen prediction tasks. to K runs of contiguous variables. The usual model class associated with CTW is a special case, and corresponds to K = 0. In many cases of interest, SkipCTS’s O(D2K+1) 1. Introduction running time is practical and provides significant perfor- The sequential prediction setting, in which an unknown en- mance gains compared to Context Tree Weighting. vironment generates a stream of observations which an al- SkipCTS is best suited to sequential prediction problems gorithm must probabilistically predict, is highly relevant to where a good fixed variable ordering is unknown a priori. a number of machine learning problems such as statistical As a simple example, consider the record aligned data de- language modelling, data compression, and model-based picted by Figure 1. SkipCTS with K = 1 can improve on reinforcement learning. A powerful algorithm for this the CTW ordering by skipping the five most recent symbols setting is Context Tree Weighting (CTW, Willems et al., and directly learning the lexicographical relation. 1995), which efficiently performs Bayesian model averaging over a class of prediction suffix trees (Ron et al., 1996). While Context Tree Weighting has traditionally been used In a compression setting, Context Tree Weighting is known as a data compression algorithm, it has proven useful in to be an asymptotically optimal coding distribution for D- a diverse range of sequential prediction settings. For ex- Markov sources. ample, Veness et al.(2011) proposed an extension (FAC- CTW) for Bayesian, model-based reinforcement learning A significant practical limitation of CTW stems from the in structured, partially observable domains. Bellemare fact that model averaging is only performed over predic- et al.(2013b) used FAC-CTW as a base model in their tion suffix trees whose ordering of context variables is Quad-Tree Factorization algorithm, which they applied to Proceedings of the 31 st International Conference on Machine the problem of predicting high-dimensional video game Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- screen images. Our empirical results on the same video right 2014 by the author(s). game domains (Section 4.2) suggest that SkipCTS is par- Skip Context Tree Switching 2.1. Bayesian Mixture Models A F R A I D One way to construct a model with guaranteed low regret A G A I N ! with respect to some model class M is to use a Bayesian A L W A Y S mixture model A M A Z E D X ξMIX(x1:n) := wρ ρ(x1:n); B E C O M E ρ2M B E H O L D where are prior weights satisfying P . B E T T E R wρ > 0 ρ2M wρ = 1 It can readily be shown that, for any ρ 2 M, we have Rn(ξMIX; fρg) ≤ − log wρ; Figure 1. A sequence of lexicographically sorted fixed-length strings, which is particularly well-modelled by SkipCTS. which implies that the regret of ξMIX(x1:n) with respect to M is bounded uniformly by a constant that depends only ticularly beneficial in this more complex prediction setting. on the prior weight assigned to the best model in M. For example, the Context Tree Weighting approach of Willems et al.(1995) applies this principle recursively to efficiently construct a mixture model over a doubly-exponential class 2. Background of tree models. We consider the problem of probabilistically predicting the A more refined nonparametric Bayesian approach to mix- output of an unknown sequential data generating source. ing is also possible. Given a model class M, the switch- Given a finite alphabet X , we write x1:n := x1x2 : : : xn 2 ing method (Koolen & de Rooij, 2013) efficiently main- n X to denote a string of length n, xy to denote the concate- tains a mixture model ξSWITCH over all sequences of mod- nation of two strings x and y, and xi to denote the concate- els in M. We review here a restricted application of this nation of i copies of x. We further denote x<n := x1:n−1 technique based on the work of Veness et al.(2012) and and the empty string by . Given an arbitrary finite length Herbster & Warmuth(1998). More formally, given an string y, we denote its length by jyj. The space of proba- indexed set of models fρ1; ρ2; : : : ; ρkg and an index se- n bility distributions over a finite alphabet X is denoted by quence i1:n 2 f1; 2; : : : ; kg let P(X ). A sequential probabilistic model ρ is defined by a i n sequence of probability mass functions fρi 2 P(X )gi2N Y n ρi1:n (x1:n) := ρit (xt j x<t) that satisfy, for any n 2 N, for any string x1:n 2 X , the constraint P . Since the t=1 xn2X ρn(x1:n) = ρn−1(x<n) subscript to ρn is always clear from its argument, we hence- be a model which predicts at each time step t according to forth write ρ(x1:n) for the probability assigned to x1:n by ρ. the model with index it. The switching technique implic- We use ρ(xn j x<n) to denote the probability of xn condi- itly computes a Bayesian mixture over the exponentially tional on x<n, defined as ρ(xn j x<n) := ρ(x1:n)/ρ(x<n) many possible index sequences. This mixture is efficiently provided ρ(x<n) > 0, from which the chain rule ρ(x1:n) = computed in O(k) per step by using Qn j follows. i=1 ρ(xi x<i) X ξSWITCH(x1:n) = wρ,n−1ρ(xn j x<n) We assess the quality of a model’s predictions ρ2M through its cumulative, instantaneous logarithmic loss Pn 2 i=1 − log ρ(xi j x<i) = − log ρ(x1:n). Given a set of where, for t 1 : : : n, we have that models M, we define the regret of ρ with respect to M as t w := w − ρ(x j x ) + ρ,t t + 1 ρ,t 1 i <t Rn(ρ, M) := − log ρ(x1:n) − min − log v(x1:n): 1 X ν2M w − ν(x j x ) (1) t + 1 ν;t 1 i <t ν2Mnfρg Our notion of regret corresponds to the excess total loss and in the base case w := 1=k for each ρ 2 M. It can suffered from using ρ in place of the best model in ρ,0 be shown (Veness et al., 2012) that for any ρ we have M. In our later analysis, we will show that the re- i1:n gret of our technique grows sublinearly and therefore that Rn(ξSWITCH; fρi1:n g) ≤ [m(i1:n) + 1] (log k + log n) limn!1 Rn(ρ, M)=n = 0. In other words, the average Pn instantaneous excess loss of our technique with respect to where m(i1:n) := t=2 it−1 =6 it counts the number of the best model in M asymptotically vanishes. times the index sequenceJ switches models.K In particular, if Skip Context Tree Switching c 0 1 ⇢(x1:n) ⇠c(x1:n) ✓0 ⇠0c(x1:n) ⇠1c(x1:n) 0 1 ✓01 ✓11 Figure 2. A prediction suffix tree. Figure 3. The Context Tree Switching recursive operation. For every context c we construct a model which switches between a base estimator ρ and a recursively defined split estimator. a single model performs best throughout, ξSWITCH only in- ct curs an additional log n cost compared to a Bayesian mix- as θct (xt j x<t). Since S is a proper suffix set, this gives ture model using a uniform prior. The switching method the sequential probabilistic model is a key component of the Context Tree Switching algo- Y Y c Y c rithm, which we review in Section 2.3, as well as our new S;Θ(x1:n) := θc(xt j x<t) = θc(x1:n): SkipCTS algorithm.

Skip Context Tree Switching

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Answers to Exercises

Superior Guarantees for Sequential Prediction and Lossless Compression Via Alphabet Decomposition

Improved Adaptive Arithmetic Coding in MPEG-4 AVC/H.264 Video Compression Standard

Applications of Information Theory to Machine Learning Jeremy Bensadon

Bi-Level Image Compression with Tree Coding

Appendix a Information Theory

The Context Tree Weighting Method : Basic Properties

Universal Loseless Compression: Context Tree Weighting(CTW)

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

A Novel Encoding Algorithm for Textual Data Compression

Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study