Count-Based Frequency Estimation with Bounded Memory

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Count-Based Frequency Estimation With Bounded Memory Marc G. Bellemare Google DeepMind London, United Kingdom [email protected] Abstract estimators [Cleary and Witten, 1984; Willems et al., 1995; Tziortziotis et al., 2014]. Count-based estimators are a fundamental building We are interested in the large alphabet setting as it natu- block of a number of powerful sequential predic- rally lends itself to the compression of language data, where tion algorithms, including Context Tree Weighting symbols consist of letters, phonemes, words or even sentence and Prediction by Partial Matching. Keeping exact fragments. This setting is also of practical interest in rein- counts, however, typically results in a high memory forcement learning, for example in learning dynamical mod- overhead. In particular, when dealing with large al- els for planning [Farias et al., 2010; Bellemare et al., 2013]. phabets the memory requirements of count-based In both contexts, the statistical efficiency of count-based es- estimators often become prohibitive. In this pa- timators makes them excellent modelling candidates. How- per we propose three novel ideas for approximat- ever, the memory demands of such estimators typically in- ing count-based estimators using bounded mem- crease linearly with the effective alphabet size , which is ory. Our first contribution, of independent inter- M problematic when grows with the sequence length. Per- est, is an extension of reservoir sampling for sam- M haps surprisingly, this issue is more than just a theoretical pling distinct symbols from a stream of unknown concern and often occurs in language data [Gale and Church, length, which we call K-distinct reservoir sam- 1990]. pling. We combine this sampling scheme with a state-of-the-art count-based estimator for memory- Our work is also motivated by the desire to improve the less sources, the Sparse Adaptive Dirichlet (SAD) Context Tree Weighting algorithm [CTW; Willems et al., estimator. The resulting algorithm, the Budget 1995] to model k-Markov large alphabet sources. While SAD, naturally guarantees a limit on its memory CTW is a powerful modelling technique, its practical mem- usage. We finally demonstrate the broader use of ory requirements (linear in the size of the sequence) often pre- K-distinct reservoir sampling in nonparametric es- clude its use in large problems. Here, even when the alphabet timation by using it to restrict the branching fac- size is fixed and small, the number of observed k-order con- tor of the Context Tree Weighting algorithm. We texts typically grows with the sequence length. We are par- demonstrate the usefulness of our algorithms with ticularly interested in reducing the long term memory usage empirical results on two sequential, large-alphabet of such an estimator, for example when estimating attributes prediction problems. of internet packets going through a router or natural language models trained on nonstationary data. Our solution takes its inspiration from the reservoir sam- 1 Introduction pling algorithm [Knuth, 1981] and the work of Dekel et al. Undoubtedly, counting is the simplest way of estimating the [2008] on budget perceptrons. We propose an online, ran- relative frequency of symbols in a data sequence. In the se- domized algorithm for sampling distinct symbols from a data quential prediction setting, these estimates are used to assign stream; by choosing a sublinear reservoir size, we effectively to each symbol a probability proportionate to its frequency. force our statistical estimator to forget infrequent symbols. When dealing with binary alphabets the counting approach We show that this algorithm can easily be combined with a gives rise to the Krichevsky-Trofimov estimator, known to be typical count-based estimator, the Sparse Adaptive Dirich- minmax optimal [Krichevsky and Trofimov, 1981]. Count- let estimator, to guarantee good statistical estimation while based approaches are also useful in dealing with large alpha- avoiding unbounded memory usage. As our results demon- bets, and a number of methods coexist [Katz, 1987; Tjalkens strate, the resulting estimator is best suited to sources that et al., 1993; Friedman and Singer, 1999; Hutter, 2013], emit a few frequent symbols while also producing a large pool each with its own statistical assumptions and regret proper- of low-frequency “noise”. We further describe a simple ap- ties. While these simple estimators deal only with mem- plication of our sampling algorithm to branch pruning within oryless sources – whose output is context invariant – they the CTW algorithm. play a critical role as building blocks for more complicated The idea of sampling distinct symbols from a stream is 3337 well-studied in the literature. Perhaps most related to ours conditional probabilities according to is the work of Gibbons and Matias [1998], who proposed the ρ(x1:tx) notion of a concise sample to more efficiently store a “hot ρ(x x1:t) := ; list” of frequent elements. Their algorithm is also based on j ρ(x1:t) reservoir sampling, but relies on a user-defined threshold and which after rearrangement yields the familiar chain rule Qn probabilistic counting. By contrast, our algorithm maintains ρ(x1:n) := ρ(xt x1:t−1). t=1 j a reservoir based on a random permutation of the data stream, A source is a distinguished type of coding distribution which avoids the need for a threshold. Gibbons [2001] pro- which we assume generates sequences. When a coding dis- posed an algorithm called Distinct Sampling to estimate the tribution is not a source, we call it a model. A source number of distinct values within a data stream. Similar to (µt : t N) is said to be memoryless when µt(x x1:t) = 2 j the hot list algorithm, Distinct Sampling employs a growing µt(x ) =: µt(x) for all t, all x , and all sequences j t 2 X P threshold, which is used to determine when infrequent sym- x1:t ; for , we then write µt( ) := µt(x). 2 X B ⊆ X B x2B bols should be discarded. Metwally et al. [2005] proposed When µt(x) := µ1(x) for all t, we say it is stationary and to track frequent symbols using a linked list and a hash table. omit its subscript. Their Space Saving algorithm aims to find all frequently oc- Let x n and let µ be an unknown coding distribu- 1:n 2 X curring items (heavy hitters), rather than provide a frequency- tion, for example the source that generates x1:n. We define biased sample; a similar scheme was also proposed by De- the redundancy of a model ρ with respect to µ as et al. [ ] [ ] maine 2002 . Manku and Motwani 2002 proposed a ρ(x1:n) probabilistic counting and halving approach to find all items Fn(ρ, µ) := F(ρ, µ, x1:n) := log : − µ(x ) whose frequency is above a user-defined threshold. Charikar 1:n et al. [2004] showed how to use a probabilistic count sketch The redundancy Fn(ρ, µ) can be interpreted as the excess bits in order to find frequent items in a data stream. While count (or nats) required to encode x1:n using ρ rather than µ. The sketches may seem the natural fit for count-based estimators, expectation of this redundancy with respect to a random sequence x is the Kullback-Leibler divergence KL (µ ρ): their memory requirements grow quadratically with the accu- 1:n n k racy parameter ; by contrast, our algorithm has only a linear KLn(µ ρ) := E [Fn(ρ, µ)] : dependency on the implied . k x1:n∼µ Typically, we are interested in how well ρ compares to a class of coding distributions . The redundancy of ρ with respect 2 Background to this class is definedM as M 0 Fn(ρ, ) := max Fn(ρ, ρ ): In the sequential prediction setting we consider, an algorithm M ρ02M must make probabilistic predictions over a sequence of sym- Intuitively, F (ρ, ) measures the cost of encoding x bols observed one at a time. These symbols belong to a finite n M 1:n n with ρ rather than the best model in . alphabet , with x1:n := x1 : : : xn denoting a string M X 2 X drawn from this alphabet. A probabilistic prediction consists 2.1 The Sparse Adaptive Dirichlet Estimator in assigning a probability to a new symbol xn+1 given that The Sparse Adaptive Dirichlet estimator [SAD; Hutter, 2013] x1:n has been observed. is a count-based frequency estimator designed for large- Let [n] := 1; : : : ; n . We denote the set of finite strings alphabet sources with sparse support. Informally, the SAD by ∗ := Sf1 i, denoteg the concatenation of strings s X i=0 X predicts according to empirical frequencies in x1:n, while re- and r by sr, and use to represent the empty string. For serving some escape probability γ for unseen symbols. Given a set , we write x1:n := (xi : i [n]; xi = ) St B ⊆ X n B 2 2 B t := i=1 xi the set of symbols seen up to time t and a to denote the substring produced by excising from x1:n the probabilityA distributionf g w over , the SAD estimator symbols in . We denote by N (x) := N(x; x ) the t X n At B n 1:n predicts according to number of occurrences of x within x1:n, and write ( for the time2X of first occurrence of in Nt(x) if x ; τn(x) := τ(x; x1:n) x SAD t+γt t ρ (x x1:t) := 2 A x1:n, with τ(x; x1:n) := n + 1 whenever x = x1:n. We de- j γtwt(x) otherwise. note the set of permutations of x by (x 2 ) and the dis- t+γt 1:n P 1:n 8 crete uniform distribution over by UX ( ).

Load more