Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

Count-Based Frequency Estimation With Bounded Memory

Marc G. Bellemare Google DeepMind London, United Kingdom [email protected]

Abstract estimators [Cleary and Witten, 1984; Willems et al., 1995; Tziortziotis et al., 2014]. Count-based estimators are a fundamental building We are interested in the large alphabet setting as it natu- block of a number of powerful sequential predic- rally lends itself to the compression of language data, where tion algorithms, including Context Tree Weighting symbols consist of letters, phonemes, words or even sentence and Prediction by Partial Matching. Keeping exact fragments. This setting is also of practical interest in rein- counts, however, typically results in a high memory forcement learning, for example in learning dynamical mod- overhead. In particular, when dealing with large al- els for planning [Farias et al., 2010; Bellemare et al., 2013]. phabets the memory requirements of count-based In both contexts, the statistical efficiency of count-based es- estimators often become prohibitive. In this pa- timators makes them excellent modelling candidates. How- per we propose three novel ideas for approximat- ever, the memory demands of such estimators typically in- ing count-based estimators using bounded mem- crease linearly with the effective alphabet size , which is ory. Our first contribution, of independent inter- M problematic when grows with the sequence length. Per- est, is an extension of reservoir sampling for sam- M haps surprisingly, this issue is more than just a theoretical pling distinct symbols from a stream of unknown concern and often occurs in language data [Gale and Church, length, which we call K-distinct reservoir sam- 1990]. pling. We combine this sampling scheme with a state-of-the-art count-based estimator for memory- Our work is also motivated by the desire to improve the less sources, the Sparse Adaptive Dirichlet (SAD) Context Tree Weighting algorithm [CTW; Willems et al., estimator. The resulting algorithm, the Budget 1995] to model k-Markov large alphabet sources. While SAD, naturally guarantees a limit on its memory CTW is a powerful modelling technique, its practical mem- usage. We finally demonstrate the broader use of ory requirements (linear in the size of the sequence) often pre- K-distinct reservoir sampling in nonparametric es- clude its use in large problems. Here, even when the alphabet timation by using it to restrict the branching fac- size is fixed and small, the number of observed k-order con- tor of the Context Tree Weighting algorithm. We texts typically grows with the sequence length. We are par- demonstrate the usefulness of our algorithms with ticularly interested in reducing the long term memory usage empirical results on two sequential, large-alphabet of such an estimator, for example when estimating attributes prediction problems. of internet packets going through a router or natural language models trained on nonstationary data. Our solution takes its inspiration from the reservoir sam- 1 Introduction pling algorithm [Knuth, 1981] and the work of Dekel et al. Undoubtedly, counting is the simplest way of estimating the [2008] on budget . We propose an online, ran- relative frequency of symbols in a data sequence. In the se- domized algorithm for sampling distinct symbols from a data quential prediction setting, these estimates are used to assign stream; by choosing a sublinear reservoir size, we effectively to each symbol a probability proportionate to its frequency. force our statistical estimator to forget infrequent symbols. When dealing with binary alphabets the counting approach We show that this algorithm can easily be combined with a gives rise to the Krichevsky-Trofimov estimator, known to be typical count-based estimator, the Sparse Adaptive Dirich- minmax optimal [Krichevsky and Trofimov, 1981]. Count- let estimator, to guarantee good statistical estimation while based approaches are also useful in dealing with large alpha- avoiding unbounded memory usage. As our results demon- bets, and a number of methods coexist [Katz, 1987; Tjalkens strate, the resulting estimator is best suited to sources that et al., 1993; Friedman and Singer, 1999; Hutter, 2013], emit a few frequent symbols while also producing a large pool each with its own statistical assumptions and regret proper- of low-frequency “noise”. We further describe a simple ap- ties. While these simple estimators deal only with mem- plication of our sampling algorithm to branch pruning within oryless sources – whose output is context invariant – they the CTW algorithm. play a critical role as building blocks for more complicated The idea of sampling distinct symbols from a stream is

3337 well-studied in the literature. Perhaps most related to ours conditional probabilities according to is the work of Gibbons and Matias [1998], who proposed the ρ(x1:tx) notion of a concise sample to more efficiently store a “hot ρ(x x1:t) := , list” of frequent elements. Their algorithm is also based on | ρ(x1:t) reservoir sampling, but relies on a user-defined threshold and which after rearrangement yields the familiar chain rule Qn probabilistic counting. By contrast, our algorithm maintains ρ(x1:n) := ρ(xt x1:t−1). t=1 | a reservoir based on a random permutation of the data stream, A source is a distinguished type of coding distribution which avoids the need for a threshold. Gibbons [2001] pro- which we assume generates sequences. When a coding dis- posed an algorithm called Distinct Sampling to estimate the tribution is not a source, we call it a model. A source number of distinct values within a data stream. Similar to (µt : t N) is said to be memoryless when µt(x x1:t) = ∈ | the hot list algorithm, Distinct Sampling employs a growing µt(x ) =: µt(x) for all t, all x , and all sequences | t ∈ X P threshold, which is used to determine when infrequent sym- x1:t ; for , we then write µt( ) := µt(x). ∈ X B ⊆ X B x∈B bols should be discarded. Metwally et al. [2005] proposed When µt(x) := µ1(x) for all t, we say it is stationary and to track frequent symbols using a linked list and a hash table. omit its subscript. Their Space Saving algorithm aims to find all frequently oc- Let x n and let µ be an unknown coding distribu- 1:n ∈ X curring items (heavy hitters), rather than provide a frequency- tion, for example the source that generates x1:n. We define biased sample; a similar scheme was also proposed by De- the redundancy of a model ρ with respect to µ as et al. [ ] [ ] maine 2002 . Manku and Motwani 2002 proposed a ρ(x1:n) probabilistic counting and halving approach to find all items Fn(ρ, µ) := F(ρ, µ, x1:n) := log . − µ(x ) whose frequency is above a user-defined threshold. Charikar 1:n et al. [2004] showed how to use a probabilistic count sketch The redundancy Fn(ρ, µ) can be interpreted as the excess bits in order to find frequent items in a data stream. While count (or nats) required to encode x1:n using ρ rather than µ. The sketches may seem the natural fit for count-based estimators, expectation of this redundancy with respect to a random se- quence x is the Kullback-Leibler divergence KL (µ ρ): their memory requirements grow quadratically with the accu- 1:n n k racy parameter ; by contrast, our algorithm has only a linear KLn(µ ρ) := E [Fn(ρ, µ)] . dependency on the implied . k x1:n∼µ Typically, we are interested in how well ρ compares to a class of coding distributions . The redundancy of ρ with respect 2 Background to this class is definedM as M 0 Fn(ρ, ) := max Fn(ρ, ρ ). In the sequential prediction setting we consider, an algorithm M ρ0∈M must make probabilistic predictions over a sequence of sym- Intuitively, F (ρ, ) measures the cost of encoding x bols observed one at a time. These symbols belong to a finite n M 1:n n with ρ rather than the best model in . alphabet , with x1:n := x1 . . . xn denoting a string M X ∈ X drawn from this alphabet. A probabilistic prediction consists 2.1 The Sparse Adaptive Dirichlet Estimator in assigning a probability to a new symbol xn+1 given that The Sparse Adaptive Dirichlet estimator [SAD; Hutter, 2013] x1:n has been observed. is a count-based frequency estimator designed for large- Let [n] := 1, . . . , n . We denote the set of finite strings alphabet sources with sparse support. Informally, the SAD by ∗ := S{∞ i, denote} the concatenation of strings s X i=0 X predicts according to empirical frequencies in x1:n, while re- and r by sr, and use  to represent the empty string. For serving some escape probability γ for unseen symbols. Given a set , we write x1:n := (xi : i [n], xi / ) St B ⊆ X \B ∈ ∈ B t := i=1 xi the set of symbols seen up to time t and a to denote the substring produced by excising from x1:n the probabilityA distribution{ } w over , the SAD estimator symbols in . We denote by N (x) := N(x, x ) the t X\At B n 1:n predicts according to number of occurrences of x within x1:n, and write ( for the time∈X of first occurrence of in Nt(x) if x , τn(x) := τ(x, x1:n) x SAD t+γt t ρ (x x1:t) := ∈ A x1:n, with τ(x, x1:n) := n + 1 whenever x / x1:n. We de- | γtwt(x) otherwise. note the set of permutations of x by (x ∈ ) and the dis- t+γt 1:n P 1:n  crete uniform distribution over by UX ( ). Finally, unless 1 if t = 0, X ·  otherwise specified, we write log to mean the natural loga-  0 if t = , γt := γ(t, t) := A X rithm. A |At| otherwise.  t+1 log |A | A coding distribution ρ is a sequence (ρt : t N) of prob- t t ∈ SAD ability distributions ρt : [0, 1] each respecting Let be the class of memoryless sources generating X → symbolsM from a subalphabet of size M. The re- X A ⊆ X 1. ρt+1(x1:tx) = ρt(x1:t), and dundancy of the SAD estimator with respect to this class is x∈X bounded as M 1  n  2. ρ () = 1. F (ρSAD, SAD) − log n + O M log log , 0 n M ≤ 2 M In what follows the subscript to ρt is always implied from its making it particularly apt at dealing with sparse-support argument; we therefore omit it. A coding distribution assigns sources.

3338 3 Budget Sequential Estimation The full proof is provided in the supplemental. The next re- sult shows that the optimal ∗ is composed of a combination The SAD estimator requires counting the occurrence of all Z encountered symbols. This is usually undesirable when the of the most and least frequent symbols according to µ. observed alphabet grows with the sequence length. Here we Lemma 2. Let ∗ be the subalphabet associated with the ∗ Z ∗ propose a sampling solution to this problem; this solution, model ρµ. Then = H L, where H,L are disjoint K-distinct reservoir sampling, lets us impose a memory con- sets such that Z ∪ ⊆ X straint on our estimator. More broadly, our sampling scheme for all x H, y H, µ(x) µ(y), and can be combined with any algorithm with a non-parametric • ∈ ∈ X \ ≥ component; as an illustration, in Section 5.2 we apply our for all x L, y L, µ(x) µ(y). solution to restrict the branching factor of the Context Tree • ∈ ∈ X \ ≤ Weighting algorithm. Proof (sketch). The proof follows by using the identity Let µ be an unknown source. We are given a parameter KL(X UX ) = log H(X) K N and seek a count-based estimator that predicts x1:n k |X | − ∈ ∼ µ sequentially without using more than O(K) memory. We for a discrete random variable X distributed over , call this problem the budget, sequential, count-based predic- with H(X) its entropy [Cover and Thomas, 1991X]. tion setting. While similar settings have been studied exten- We then show that, for = xi , the function [ Z { } sively in the context of linear classifiers Dekel et al., 2008; µ( )KL1(µX \Z UX \Z ) is concave in i. The result is ] X\Z k Cavallanti et al., 2007; Orabona et al., 2008 and in relation then extended to the general case = xi1 , . . . xiK . to the 0–1 loss [Lu and Lu, 2011], to the best of our knowl- Z { } edge the notion of a budget count-based frequency estimator ∗ When µ is known, ρµ can be constructed in time O(K) by is novel. From a practical perspective, we are interested in considering the different choices of H and L. Furthermore, prediction problems for which the alphabet is large but the the set ˆ containing the K most frequent symbols achieves source skewed towards a few high-frequency symbols; it is an expectedZ redundancy at most 1 + K log K greater than for these problems that we expect meaningful prediction to e |X | the expected redundancy of ρ∗ (Lemma 3, supplemental), in be achievable. µ our experiments and for large alphabets this redundancy is Our competitor class is the set of budget multinomial M negligible. In the sequel we therefore assume that ∗ = ˆ. distributions. More specifically, we consider models of the Z Z i1 iK However, we are interested in the setting where µ is un- form ρ(x ; , θ, θ0) with := x , . . . , x is a sub- Z Z { K } ⊆ X known. In this case, our budget requirement poses a circular alphabet of cardinality K; θ R ; and θ0 R. Each such ∗ model predicts according to ∈ ∈ problem: we need good estimates of µ to find , and si- multaneously need ∗ to estimate µ (with boundedZ memory).  ij Z θj if x = x Our solution is to maintain a time-varying set which con- ρ(x ; , θ, θ0) := ∈ Z (1) Y Z θ0 / otherwise, tains (with high probability) the K most frequent symbols. |X \ Z| -distinct reservoir sampling P We call our algorithm K , in re- with the usual requirement that x∈X ρ(x ; , θ, θ0) = 1. lation to the reservoir sampling algorithm [Knuth, 1981]. As Let µ be a memoryless stationary source and forZ let B ⊆ X we shall see in Section 3.2, we can leverage K-distinct reser- voir sampling to approximate the SAD estimator by estimat-  0 if x / , µB(x) = ∈ B ing the frequency of symbols in . We call the resulting ap- µ(x)/µ( ) otherwise. Y B proximation the Budget Sparse Adaptive Dirichlet estimator, Finally, arrange := x1, x2, . . . , xm in order of decreas- or Budget SAD for short. ing probability, i.e.X such{ that for all i, µ}(xi) µ(xi+1). 3.1 -Distinct Reservoir Sampling ∗ ≥ K Lemma 1. The model ρµ which minimizes KL1(µ ) is ∈ M k · Let x1:n be a string of a priori unknown length and let K N. Reservoir sampling [Algorithm R; Knuth, 1981] is ∗   ∈ K = arg min µ( )KL1(µX \Z UX \Z ) an algorithm for sampling y := y1 . . . yK from x1:n, Z Z:|Z|=K X\Z k without replacement and using O(K) memory.∈ X It proceeds as follows: θ∗(x) = µ(x) x ∗ θ∗ = µ( ∗). ∀ ∈ Z 0 X\Z 1. Randomly assign x1 . . . xK to y1 . . . yK . Proof (sketch). We first show that for a fixed , the optimal Z 2. For t = K + 1, . . . , n, draw r U ( 1, . . . , t ). If parameters correspond to those given above. We then use the ∼ { } convexity property of the KL divergence, namely that for λ (a) r K, then assign x to y ; ∈ ≤ t r [0, 1] and pairs of probability distributions (p1, q1), (p2, q2) (b) otherwise do nothing.  KL1 λp1 + (1 λ)p2 λq1 + (1 λ)q2 We would like to use reservoir sampling to sample an approx- − k − imation to ∗. However, Algorithm R is inadequate for our λKL (p q ) + (1 λ)KL (p q ), Z ≤ 1 1 k 1 − 1 2 k 2 purposes: since y may contain duplicates, sampling K dis- with equality when (p1, q1) and (p2, q2) have disjoint support. tinct items may require significant memory. Our solution is The result follows by setting p1 = q1 = µZ , and taking p2, q2 to modify Algorithm R to compactly represent duplicates. We ∗ to be respectively µX \Z and ρµ,X \Z = UX \Z . begin with the following observation:

3339 Observation 1. Let x˜1:n be a random permutation of x1:n, as well as the indices I : ( ) [K] and J :[DK+1] that is x˜ U ( (x )). Then y :=x ˜ ... x˜ is a sample [K]: Y S → → 1:n ∼ P 1:n 1 K of size K sampled without replacement from x1:n. I(x) := i [K] s.t. yi = x We can therefore interpret Algorithm R as maintaining the ∈ J(r) := max i [K + 1] : r Di . first K elements of a random permutation of x1:n. By anal- { ∈ ≤ } ogy, we define the K-distinct operator, which describes the In K-distinct reservoir sampling (Algorithm 1), I(x) and first K distinct elements of a string: J(r) are used to indicate, respectively, the index of x in S Definition 1. Let w m. The K-distinct operator φ : and the index in at which a new symbol should be inserted. 1:m ∈ X K S 0 ∗ SK i is recursively defined as For example, in Figure 1 (left) I(‘B ) = 2 and J(3) = 3. X → i=0 X   K = 0 or m = 0 Algorithm 1 K-Distinct Reservoir Sampling. φK (w1:m) := w1φK−1(w1:m w1 ) otherwise. \{ } Initially: = S hi The next step is to use this operator to define a K-distinct SAMPLE(K, x ) sample 1:n of x1:n. From here onwards, we assume without loss for t = 1 . . . n do of generality that sequences contain at least K distinct sym- Observe x bols. t Draw r U ( 0, . . . , t 1 ) ∼ { − } Definition 2. Let Yn := Y (x1:n) be a random variable tak- if r > DK+1 then do nothing K ing values in , and let x˜1:n U ( (x1:n)). Then Yn is a else if xt / ( ) then INSERT( , xt, r) K-distinct sampleX of x if ∼ P else if I(x∈) Y 1 then d r D · J(r)−1 ← − J(r)−1 Definition 3. Let w m with y . . . y := if > K then remove item K + 1 from 1:m ∈ X 1 K+1 |S| S φK+1(w1:m). The K-concise summary of w1:m is a vector MERGE( , xt) of K pairs := (yi, di) N such that, for all i K, S S h ∈ X × i ≤ dI(x )−1 dI(x )−1 + dI(x ) t ← t t 1. di = τ(yi+1, w1:m) τ(yi, w1:m), or equivalently Remove item I(x ) from − t S 2. D := Pi−1 d = τ(y , w ) 1 for i [K + 1]. i j=1 j i 1:m − ∈ Each pair (yi, di) in the K-concise summary of a string Theorem 1. When Algorithm 1 terminates, its summary is th S thus describes this string’s i distinct symbol and the gap di the K-concise summary of a permutation x˜1:n sampled uni- between this distinct symbol and the next (Figure 1, bottom formly at random from (x1:n). P row). Proof (sketch). Let r1:n be the string of random integers gen- erated by Algorithm 1. These induce a sequence (˜xt : t r =3 C 1:t ∈ N) of permutations of x1:n. To show the correctness of Al- x˜1:n A B B A C B A D A B x˜1:n+1 A B B C A C B A D A B gorithm 1, we simply show that its operations mirror this se-

A B B A C B A C A C B A quence of permutations and incrementally generate the se- A B B t quence of K-concise summaries of x˜1:t. The different cases n A, 1 B,3 C,3 n+1 A, 1 B,2 C,5 handled by Algorithm 1 then arise from different kinds of in- S S sertions. Figure 1: Insertion of ‘C’ at r = 3 into a 3-concise summary. Corollary 1. Algorithm 1 outputs a K-distinct sample of x1:n. In the spirit of Algorithm R, K-distinct reservoir sampling Note here the significance of Theorem 1: we can simulate, in a fully online fashion and in O(K) memory and running incrementally generates a random permutation x˜1:n through time, the sampling of a permutation x˜ U ( (x )) and a sequence of insertions, and maintains the K-concise sum- 1:n ∼ P 1:n mary of this permutation. Effectively, this summary allows the subsequent selection of its first K distinct symbols. In- terestingly enough, the use of random insertions, rather than us to describe the beginning of x˜1:n compactly (using O(K) memory), while preserving the information needed for later swaps (as is done in Algorithm R), seems necessary to guar- antee the correct distribution on y( ). updates. S For x , r [DK+1], and := (yi, di): i [K] , 3.2 The Budget SAD Estimator define the∈ vector X and∈ set of stored symbolsS h ∈ i Naturally, we wish to use K-distinct reservoir sampling to ap- y( ) := y : i [K] ( ) := y : i [K] proximate the set ∗ of frequently occurring symbols in x . S h i ∈ i Y S { i ∈ } Z 1:n

3340 However, the di variables do not estimate the frequencies of One may also increment zt to keep tracks of counts that are their respective symbols. To estimate these frequencies, our discarded when a symbol leaves the reservoir; however, this Budget Sparse Adaptive Dirichlet estimator augments a K- does not affect our redundancy bound. concise summary with count information. The new summary In addition to these counting errors, the Budget SAD suf- is a vector of tuples fers a redundancy cost from forgetting symbols. Clearly, this problem arises only when symbols are removed from := (yi, di, ci) N N : i [K] U h ∈ X × × ∈ i the reservoir. As we now show, the Budget SAD achieves low redundancy with respect to the optimal memory-bounded where ci represents the number of occurrences of yi since its last entering the summary. model (Section 3) whenever the source’s tail distribution is close to uniform. Algorithm 2 Budget SAD. 4 Analysis Initially: = , z = 0 Let µ be a memoryless stationary source and let x µ. for t = 1U. . . nhido 1:n We compare our Budget SAD estimator ρB to the optimal∼ Predict x according to Equation 2 t memory-bounded model ρ∗ described in Section 3. Observe x µ t In our analysis we consider the∈ M associated optimal set ∗ Update as per Algorithm 1 of size K∗, and derive a redundancy bound which dependsZ if x is newU in then c 1 t U I(xt) ← on the probability mass µ( ∗) as well as the reservoir size else if x ( ) then c c + 1 ∗ Z t ∈ Y U I(xt) ← I(xt) K > K . For clarity, we take the SAD probability distribu- else z z + 1 // discard xt ← tion wt( ) to be uniform over the alphabet , noting that our result extends· to the more general case. X New symbols are added to with c = 1; then, whenever We take a conservative approach and bound the expected U  B ∗  a symbol already present in is observed, its count is incre- redundancy E Fn(ρ , ρ ) according to the probability of a U µ mented. Recall that the SAD prediction depends on Nt(x), best-case scenario. In this best-case scenario, the symbols in the number of occurrences of x in x , as well as the sub- ∗ are never ejected from the reservoir. We define a good 1:t Z  alphabet (Section 2.1). Our process yields := ( ) event Gn(x) := cn,I (x) = Nn(x) which occurs when the At Yt Y Ut n as an approximation to t, as well as an approximate count stored count for x matches the total number of occurrences of ˆ A ˆ x in x ; this implies that c = N (x) for all t [n]. function Nt : R, with Nt(x) = 0 for x / t. The al- 1:n t,It(x) t X → ∈ Y The instantaneous redundancy of our algorithm at time∈then gorithm also tracks zt, the number of discarded symbols, and t approximates the SAD prediction rule as depends on the probability of Gn(xt). Critically, we set K so that G (x) must occur with probability 1 δ for all x ∗. c n − ∈ Z ˆ t,It(x) For and a memoryless stationary source µ, define Nt(x) := (t zt)P for x t B ⊆ X − 0 c 0 ∈ Y x t,It(x ) υ( , µ) := min µ(x). x∈B γˆt := γ(t, t) + zt B Yn o We first provide a lemma stating the conditions under which N˜t(x) := max Nˆt(x), γˆtwt(x) T Gn( ) := x∈B Gn(x) occurs with high probability.  B B N˜t(x) if x t, Lemma 3. Let , δ (0, 1), and K ρ (x x ) (2) −1 B−1 ⊆ X ∈ ≥ 1:t γˆ w (x) otherwise.∈ Y υ( , µ) log( nδ ). Then | ∝ t t B |B| Pr Gn( ) 1 δ. where ct,· denote the stored counts at time t, and { B } ≥ − P B Proof (sketch). The event G (x) occurs if x never leaves the x∈X ρ (x ) = 1. Our Budget SAD is designed to over- n estimate thek probability · of frequent symbols (via the (t z ) K-concise summary after entering it. Because K-distinct − t renormalization); the definition of N˜ (x) ensures that sym- reservoir sampling induces a permutation of x1:n, we can t view the corresponding K-distinct sample as generated from bols in t are never deemed less likely than unseen symbols. The processY is summarized in Algorithm 2. independent draws without replacement from µ. Computing not The z term keeps track of discarded counts; this is to cor- the probability that a symbol x is drawn in K tries t and taking union bounds yields the∈ B result. rect for two types of counting errors. First, the counter ct,I(x) for a symbol x t only equals Nt(x) if 1) it was never dis- Our main theoretical result follows from this lemma: ∈ Y carded and 2) never removed from the reservoir; in general, Theorem 2. Let δ = o (log + n log n) and K ct,I(x) is an underestimate for Nt(x). We also need to com- ∗ −1 ∗ −1 |X | ≥ υ( , µ) log( nδ ). Let AUG of size K max- pensate for our lack of knowledge of in computing γˆ , Z |Z | Z ⊆ X−1 t t imizing µ( ), and let κ := K(1 µ( AUG)) . The expected the probability mass associated with unseen|A | symbols. Effec- · B − ∗ Z redundancy of ρ with respect to ρµ is bounded as tively, zt mitigates both issues by rectifying the stored counts ∗ so as to approximate the SAD denominator. In particular, E F (ρB, ρ∗ ) µ( ∗) |Z |−1 log n + n µ ≤ Z 2 when N˜(x) = Nˆ(x) for all x , we have ∗ 2 ∈ Y (1 µ( ))κ log n + O(κ log(κn) log log n), X − Z N˜t(x) +γ ˆt = t + γ(t, t) where the expectation is with respect to both x1:n µ and Y , the random integers chosen by Algorithm 1. ∼ x∈Yt r1:n

3341 Proof (sketch). We consider a partition of the event space at each time step: 1. x ∗ and G ( ∗), t ∈ Z n Z 2. x ∗ and G ( ∗), or t ∈ Z ¬ n Z 3. x / ∗. t ∈ Z In the first case, the Budget SAD correctly estimates the prob- ability of xt, and exhibits similar regret guarantees to the SAD. The second case contributes negligible expected redun- Figure 2: Sample from the synthetic image source with m = dancy, since Pr G ( ∗) is small. The final case gives rise {¬ n Z } 4 and p = 0.9. Symbols are corrupted using uniform pixel to most of the expected redundancy, which we bound by pro- noise. viding bounds on zt (which lets us estimate the probability of symbols not in the K-concise summary), which itself de- 32 8 pends on a bound on Dt,K+1, the expected gap between the p=0.1 p=0.9 Budget SAD th Budget SAD 31 first and K + 1 symbols in our summary. Let r1:n be the 7 random draws of r within K-distinct reservoir sampling. We SAD (K = 899,828) show that 6 SAD (K = 99,895) 30 Source entropy E [Dt,K+1] κ 1, Bits per symbol 5 Source entropy Bits per symbol x1:n,r1:n ≤ − 29 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140 which provides the last two terms of our bound. Memory budget K Memory budget K

B ∗ Remark 1. The expected redundancy of ρ with respect to ρµ is also bounded as Figure 3: Redundancy for the Budget SAD as a function of maximum reservoir size. Images are encoded as 32-bit inte- F B ∗  E n(ρ , ρµ) O(n log n), gers (i.e., log = 32). ≤ 2 |X | and in particular the bound of Theorem 2 is meaningful when- ever . κ log n = o(n) (Figure 2; images from Wikipedia, 2015). Thus, while the The three terms in Theorem 2 correspond respectively to 1) source emits symbols from an alphabet of size 25650×60, the cost of learning µ over ∗, 2) estimating the quantity 1 only m of these symbols occur with non-negligibleX frequency. µ( ∗), and 3) the redundancyZ for symbols which enter, but− Z do not remain in our reservoir. In the presence of a favourable We implemented the Budget SAD with a time-dependent ∗ distribution (i.e., when (1 µ( )/(1 µ( AUG)) is small), reservoir size K to match the regret bound of Theorem 2: − Z − Z + our algorithm is asymptotically optimal, in the sense that its from a parameter R R , we set Kt = R log(t+1) . We per-step redundancy goes to 0 as n . This is the case, trained both SAD and∈ Budget SAD on oned d million symbolsee → ∞ for example, when the source is in the class of memory- generated with m = 4 and p 0.1, 0.9 . We performed M bounded models (Equation 1) and n. experiments with R [1.0, 2.0]∈(when { p }= 0.1) and R |X |  As a final note, Lu and Lu [2011] recently provided a Ω(n) [0.5, 10.0] (when p =∈ 0.9), with the results of each Budget∈ lower bound on the regret suffered by a bounded-memory, se- SAD experiment averaged across 100 trials. quential prediction algorithm in the 0–1 loss setting. We point To study the behaviour of our algorithm, we map each out that our analysis does not contradict their bound since, in value of R to its corresponding end-of-sequence reservoir the worst case, the second term may grow superlinearly (see size and report the algorithm’s redundancy as a function of Remark 1 and Section 6 below). In this case, one may (triv- this value. Results for both values of p are shown in Fig- ially) construct a Bayesian mixture between ρB and a uniform ure 3. The Budget SAD achieves equal or lower redundancy estimator to achieve O(n log ) regret. using a small fraction of the SAD’s memory budget (0.01% |X | for both p = 0.1 and p = 0.9). Note that the Budget SAD 5 Empirical Evaluation even performs slightly better than the SAD in this instance, as The Budget SAD is especially motivated by the desire to min- it assigns a higher probability mass to unseen symbols. We imize memory usage in the presence of infrequent symbols. conclude that the Budget SAD has the ability to significantly Here we present two experiments showing the practical value reduce memory usage. of our approach. 5.2 Branch Pruning in Context Tree Weighting 5.1 Synthetic Image Domain Our analysis so far has focused on memoryless sources and We first consider a synthetic image domain to demonstrate their models. A common method of augmenting memoryless the effectiveness of our algorithm in the presence of symbol models is to embed them into a tree structure in order to guar- noise. In this domain, the source first selects one of m images antee low redundancy with respect to the class of k-Markov uniformly at random, then corrupts this image with probabil- sources. One such method, Context Tree Weighting [Willems ity p and otherwise leaves it unchanged. Here we consider et al., 1995], uses the generalized distributive law [Aji and 50 60, 8-bit images and a uniform noise corruption model McEliece, 2000] to efficiently perform Bayesian model aver- ×

3342 4.5 100 suggests an unpleasant O(n log n) redundancy — possibly 4 80 R = 2.0 worse-than-random performance. This cost stems from our 3.5 overestimation of symbols in ∗; it seems probable that an 3 R = 1.0 60 R = 1.5 Z ∗ 2.5 alternative approach, which instead overestimates 1 µ( ), R = 1.5 R = 2.0 40 − Z 2 exists. 20 R = 1.0 Bits per character 1.5 No pruning On the other hand, with well-behaved data the Budget 1 0 100 200 300 400 500 600 700 Memory usage (% of CTS) 00 0 100 200 300 400 500 600 700 SAD provides significant memory savings. This is the Characters processed (millions) Characters processed (millions) case, for example, when the tail of µ is approximately uni- form. Most count-based frequency estimators [Friedman and Singer, 1999; Teh, 2006; Veness and Hutter, 2012; Figure 4: Left. Redundancy for the Budget SAD and SAD Hutter, 2013] model this behaviour by setting aside a lump estimators. Right. Memory usage for the Budget SAD. probability mass ε for all unobserved symbols; our zt term serves a similar purpose. aging over tree structures. Surprisingly, K-distinct reservoir sampling is unnecessary At the core of CTW is a growing -ary context tree; if the source µ is truly memoryless and stationary. In fact, in |X | each new symbol creates up to D new nodes, where D is a this case the first K distinct symbols in x1:n are just as good depth parameter. Each node corresponds to a context (a sub- an approximation to ∗! The benefit of our approach is to Z sequence of x1:n) and maintains a statistical estimator (here, (partially) inoculate the budget estimator against adversarial the SAD). Here we consider a modified CTW in which each empirical sources; a similar role is played by the Dirichlet- node uses a K-distinct reservoir to store its children: remov- multinomial prior in estimating counts. ing an item from the reservoir prunes a whole subtree. Orlitsky et al. [2003] proposed two methods for achieving To understand the need for branch pruning, consider the o(n) per-symbol redundancy on open alphabets, where new Wikipedia Clean Text data set [Mahoney, 2011], which we use symbols appear at a O(n) rate. As an alternative to the zt as our sequential prediction task here. This data set is com- term, in this setting it may be sufficient to accurately estimate posed of 713M lowercase characters (a–z and whitespace; t , the size of the alphabet at time t; work on distinct value = 27). Within are 105M unique contexts of length at queries|A | suggests this is possible [Gibbons, 2001]. However, most|X | D = 9. This number grows linearly over time, making Orlitsky et al.’s method requires that we instead estimate the a memory-bounded solution highly desirable. However, each prevalence, or frequency distribution, of symbols. Whether context sees, on average, two or fewer symbols; as a result, this can be done with bounded memory is not known to us. using the Budget SAD at each node cannot significantly re- duce memory usage. Branch pruning, on the other hand, is 7 Conclusion relatively effective. For our experiment we used the more efficient CTW rela- In this paper we presented a novel algorithm, K-distinct tive, Context Tree Switching [CTS; Veness et al., 2012] and reservoir sampling, which incrementally maintains a sample derived our parameters from its authors’ open-source imple- of frequently occurring symbols within a data stream. We de- mentation. As with the synthetic experiment, we considered scribed how to combine K-distinct reservoir sampling with values in the range R [1.0, 3.0]. For each node u, its reser- a simple count-based estimator and derived a corresponding ∈ voir size was increased as R log(tu + 1), with tu the number redundancy bound. As our empirical results show, our al- of visits to this node. We set the depth parameter to D = 9 gorithm can significantly reduce the memory requirements as larger values of D did not affect redundancy, and averaged of both memoryless and tree-based estimators when dealing each experiment across 10 trials. with long data sequences. As illustrated in Figure 4, our algorithm, through the R parameter, can smoothly trade off memory usage and redun- 8 Acknowledgements dancy (with respect to an unpruned CTS). For example, for a small redundancy ( 5%) we observed a 44.6% average The author wish to thank Joel Veness for uncountably many reduction in memory≤ usage (R 1.76, not illustrated). Fur- discussions on the topic of count-based estimation, Laurent thermore, there was minimal variance≈ between trials. While Orseau for providing thorough editorial comments, Alvin memory usage still remains linear in n, we believe this re- Chua and Michael Bowling for helpful discussion, and finally sult is particularly significant given the Zipfian behaviour of the anonymous reviewers for their superb feedback. natural language data, which in our analysis corresponds to a large κ term. References [Aji and McEliece, 2000] Srinivas M. Aji and Robert J. 6 Discussion McEliece. The generalized distributive law. IEEE Trans- ∗ actions on Information Theory, 46(2):325–343, 2000. In our setting, the optimal model ρµ assigns a uniform prob- ability over the set ∗. Interestingly enough, the main [Bellemare et al., 2013] Marc G. Bellemare, Joel Veness, X\Z source of redundancy in the Budget SAD is the κ log2 n cost and Michael Bowling. Bayesian learning of recursively for approximating this uniform probability. In particular for factored environments. In Proceedings of the Thirtieth In- some distributions (e.g. µ( ) 2−i for i N) our analysis ternational Conference on , 2013. · ∝ ∈

3343 [Cavallanti et al., 2007] Giovanni Cavallanti, Nicolo´ Cesa- [Krichevsky and Trofimov, 1981] R. Krichevsky and Bianchi, and Claudio Gentile. Tracking the best hyper- V. Trofimov. The performance of universal coding. IEEE plane with a simple budget . Machine Learning, Transactions on Information Theory, 27:199–207, 1981. 69(2-3):143–167, 2007. [Lu and Lu, 2011] Chi-Jen Lu and Wei-Fu Lu. Making on- [Charikar et al., 2004] Moses Charikar, Kevin Chen, and line decisions with bounded memory. In Algorithmic Martin Farach-Colton. Finding frequent items in data Learning Theory, pages 249–261, 2011. streams. Theoretical Computer Science, 312(1):3–15, [Mahoney, 2011] Matt Mahoney. Large text compression 2004. benchmark, 2011. [Online; accessed 01-February-2015]. [Cleary and Witten, 1984] John G. Cleary and Ian Witten. [Manku and Motwani, 2002] Gurmeet Singh Manku and Ra- Data compression using adaptive coding and partial jeev Motwani. Approximate frequency counts over data string matching. IEEE Transactions on Communications, streams. In Proceedings of the 28th International Confer- 32(4):396–402, 1984. ence on Very Large Databases, pages 346–357, 2002. [Cover and Thomas, 1991] Thomas M. Cover and Joy A. [Metwally et al., 2005] Ahmed Metwally, Divyakant Thomas. Elements of information theory. John Wiley & Agrawal, and Amr El Abbadi. Efficient computation Sons, 1991. of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database [Dekel et al., 2008] Ofer Dekel, Shai Shalev-Shwartz, and Technology, pages 398–412, 2005. Yoram Singer. The Forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37(5):1342– [Orabona et al., 2008] Francesco Orabona, Joseph Keshet, 1372, 2008. and Barbara Caputo. The projectron: a bounded kernel- based perceptron. In Proceedings of the 25th international [ ] Demaine et al., 2002 Erik D. Demaine, Alejandro Lopez-´ conference on Machine learning, pages 720–727. ACM, Ortiz, and J. Ian Munro. Frequency estimation of internet 2008. packet streams with limited space. In Proceedings of the Tenth Annual European Symposium on Algorithms, pages [Orlitsky et al., 2003] Alon Orlitsky, Narayana P. San- 348–360, 2002. thanam, and Junan Zhang. Always good turing: Asymp- totically optimal probability estimation. In Proceedings of [Farias et al., 2010] Vivek F. Farias, Ciamac C. Moallemi, the 44th Anual Symposium on Foundations of Computer Benjamin Van Roy, and Tsachy Weissman. Universal re- Science, 2003. inforcement learning. IEEE Transactions on Information Theory, 56(5):2441–2454, 2010. [Teh, 2006] Yee Whye Teh. A hierarchical bayesian lan- guage model based on pitman-yor processes. In Proceed- [Friedman and Singer, 1999] Nir Friedman and Yoram ings of the 21st International Conference on Computa- Singer. Efficient Bayesian Parameter Estimation in Large tional Linguistics, pages 985–992, 2006. Discrete Domains. Advances in Neural Information [ et al. ] Processing Systems 11, 11:417, 1999. Tjalkens , 1993 Tj. J Tjalkens, Y.M. Shtarkov, and F.M.J. Willems. Context tree weighting: Multi-alphabet [Gale and Church, 1990] William Gale and Kenneth W. sources. In 14th Symposium on Information Theory in the Church. Poor estimates of context are worse than none. Benelux, pages 128–135, 1993. Proceedings of the Workshop on Speech and Natural In [Tziortziotis et al., 2014] Nikolaos Tziortziotis, Christos Language , pages 283–287, 1990. Dimitrakakis, and Konstantinos Blekas. Cover tree [Gibbons and Matias, 1998] Phillip B. Gibbons and Yossi bayesian . The Journal of Machine Matias. New sampling-based summary for im- Learning Research, 15(1):2313–2335, 2014. proving approximate query answers. In Proceedings of the [Veness and Hutter, 2012] Joel Veness and Marcus Hutter. ACM SIGMOD, volume 27, pages 331–342, 1998. Sparse sequential Dirichlet coding. ArXiv e-prints, 2012. [Gibbons, 2001] Phillip B. Gibbons. Distinct sampling for [Veness et al., 2012] Joel Veness, Kee Siong Ng, Marcus highly-accurate answers to distinct values queries and Hutter, and Michael H. Bowling. Context tree switching. event reports. In Proceedings of the International Con- In Data Compression Conference (DCC), pages 327–336, ference on Very Large Databases, 2001. 2012. [Hutter, 2013] Marcus Hutter. Sparse adaptive dirichlet- [Wikipedia, 2015] Wikipedia. Campbell’s soup cans, 2015. multinomial-like processes. In COLT, 2013. [Online; accessed 01-February-2015]. [Katz, 1987] Slava M. Katz. Estimation of probabilities from [Willems et al., 1995] Frans M. Willems, Yuri M. Shtarkov, sparse data for the language model component of a speech and Tjalling J. Tjalkens. The context tree weighting recognizer. IEEE Transactions on Acoustics, Speech, and method: Basic properties. IEEE Transactions on Infor- Signal Processing, 35(3):400–401, 1987. mation Theory, 41:653–664, 1995. [Knuth, 1981] Donald E. Knuth. The Art of computer Pro- gramming, Volume 2: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981.

3344