Smoothed Bloom Filter Language Models

Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap David Talbot and Miles Osborne School of Informatics, University of Edinburgh 2 Buccleuch Place, Edinburgh, EH8 9LW, UK [email protected], [email protected] Abstract models and models trained on additional monolin- gual corpora can significantly improve translation A Bloom filter (BF) is a randomised data performance, deploying such language models is not structure for set membership queries. Its trivial. Increasing the order of an n-gram model can space requirements fall significantly below result in an exponential increase in the number of lossless information-theoretic lower bounds parameters; for the English Gigaword corpus, for but it produces false positives with some instance, there are 300 million distinct trigrams and quantifiable probability. Here we present over 1.2 billion distinct five-grams. Since a language a general framework for deriving smoothed model is potentially queried millions of times per language model probabilities from BFs. sentence, it should ideally reside locally in memory We investigate how a BF containing n-gram to avoid time-consuming remote or disk-based look- statistics can be used as a direct replacement ups. for a conventional n-gram model. Recent Against this background, we consider a radically work has demonstrated that corpus statistics different approach to language modelling. Instead can be stored efficiently within a BF, here of explicitly storing all distinct n-grams from our we consider how smoothed language model corpus, we create an implicit randomised represen- probabilities can be derived efficiently from tation of these statistics. This allows us to drastically this randomised representation. Our pro- reduce the space requirements of our models. In posal takes advantage of the one-sided error this paper, we build on recent work (Talbot and Os- guarantees of the BF and simple inequali- borne, 2007) that demonstrated how the Bloom filter ties that hold between related n-gram statis- (Bloom (1970); BF), a space-efficient randomised tics in order to further reduce the BF stor- data structure for representing sets, could be used to age requirements and the error rate of the store corpus statistics efficiently. Here, we propose derived probabilities. We use these models a framework for deriving smoothed n-gram models as replacements for a conventional language from such structures and show via machine trans- model in machine translation experiments. lation experiments that these smoothed Bloom filter language models may be used as direct replacements 1 Introduction for standard n-gram models in SMT. Language modelling (LM) is a crucial component in The space requirements of a Bloom filter are quite statistical machine translation (SMT). Standard n- spectacular, falling significantly below information- gram language models assign probabilities to trans- theoretic error-free lower bounds. This efficiency, lation hypotheses in the target language, typically however, comes at the price of false positives: the fil- as smoothed trigram models (Chiang, 2005). Al- ter may erroneously report that an item not in the set though it is well-known that higher-order language is a member. False negatives, on the other hand, will 468 Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 468–476, Prague, June 2007. c 2007 Association for Computational Linguistics never occur: the error is said to be one-sided. Our ments have been inserted, this gives, log-frequency Bloom framework makes use of the k filter presented in (Talbot and Osborne, 2007), and f = (1 − p) . described briefly below, to compute smoothed con- As n items have been entered in the filter by hashing ditional n-gram probabilities on the fly. It takes each k times, the probability that a bit is still zero is, advantage of the one-sided error guarantees of the kn Bloom filter and certain inequalities that hold be- 0 1 − kn p = 1 − ≈ e m tween related n-gram statistics drawn from the same m corpus to reduce both the error rate and the compu- which is the expected value of p. Hence the false tation required in deriving these probabilities. positive rate can be approximated as, 2 The Bloom filter k k 0 k − kn f = (1 − p) ≈ (1 − p ) ≈ 1 − e m . In this section, we give a brief overview of the Bloom filter (BF); refer to Broder and Mitzenmacher By taking the derivative we find that the number of (2005) for a more in detailed presentation. A BF rep- functions k∗ that minimizes f is, resents a set S = {x , x , ..., x } with n elements 1 2 n m drawn from a universe U of size N. The structure is k∗ = ln 2 · , attractive when N n. The only significant stor- n age used by a BF consists of a bit array of size m. which leads to the intuitive result that exactly half This is initially set to hold zeroes. To train the filter the bits in the filter will be set to 1 when the optimal we hash each item in the set k times using distinct number of hash functions is chosen. hash functions h1, h2, ..., hk. Each function is as- The fundmental difference between a Bloom fil- sumed to be independent from each other and to map ter’s space requirements and that of any lossless rep- items in the universe to the range 1 to m uniformly resentation of a set is that the former does not depend at random. The k bits indexed by the hash values on the size of the (exponential) universe N from for each item are set to 1; the item is then discarded. which the set is drawn. A lossless representation Once a bit has been set to 1 it remains set for the life- scheme (for example, a hash map, trie etc.) must de- time of the filter. Distinct items may not be hashed pend on N since it assigns a distinct representation to k distinct locations in the filter; we ignore col- to each possible set drawn from the universe. shared lisons. Bits in the filter can, therefore, be by 3 Language modelling with Bloom filters distinct items allowing significant space savings but introducing a non-zero probability of false positives Recent work (Talbot and Osborne, 2007) presented a at test time. There is no way of directly retrieving or scheme for associating static frequency information 1 ennumerating the items stored in a BF. with a set of n-grams in a BF efficiently. At test time we wish to discover whether a given 3.1 Log-frequency Bloom filter item was a member of the original set. The filter is The efficiency of the scheme for storing n-gram queried by hashing the test item using the same k statistics within a BF presented in Talbot and Os- hash functions. If all bits referenced by the k hash borne (2007) relies on the Zipf-like distribution of values are 1 then we assume that the item was a n-gram frequencies: most events occur an extremely member; if any of them are 0 then we know it was small number of times, while a small number are not. True members are always correctly identified, very frequent. We assume that raw counts are quan- but a false positive will occur if all k corresponding tised and employ a logarithmic codebook that maps bits were set by other items during training and the counts, c(x), to quantised counts, qc(x), as follows, item was not a member of the training set. The probability of a false postive, f, is clearly the qc(x) = 1 + blogb c(x)c. (1) probability that none of k randomly selected bits in 1Note that as described the Bloom filter is not an associative the filter are still 0 after training. Letting p be the data structure and provides only a Boolean function character- proportion of bits that are still zero after these n ele- ising the set that has been stored in it. 469 Algorithm 1 Training frequency BF Algorithm 2 Test frequency BF Input: Strain, {h1, ...hk} and BF = ∅ Input: x, MAXQCOUNT , {h1, ...hk} and BF Output: BF Output: Upper bound on c(x) ∈ Strain for all x ∈ Strain do for j = 1 to MAXQCOUNT do c(x) ← frequency of n-gram x in Strain for i = 1 to k do qc(x) ← quantisation of c(x) (Eq. 1) hi(x) ← hash of event {x, j} under hi for j = 1 to qc(x) do if BF[hi(x)] = 0 then for i = 1 to k do return E[c(x)|qc(x) = j − 1] (Eq. 2) hi(x) ← hash of event {x, j} under hi end if BF[hi(x)] ← 1 end for end for end for end for end for return BF exponentially with the size of the overestimation error d (i.e. as f d for d > 0) since each erroneous increment corresponds to a single false positive and The precision of this codebook decays exponentially d such independent events must occur together. with the raw counts and the scale is determined by The efficiency of the log-frequency BF scheme the base of the logarithm b; we examine the effect can be understood from an entropy encoding per- of this parameter on our language models in experi- spective under the distribution over frequencies of ments below. n-gram types: the most common frequency (the sin- Given the quantised count qc(x) for an n-gram gleton count) is assigned the shortest code (length k) x, the filter is trained by entering composite events while rarer frequencies (those for more common n- consisting of the n-gram appended by an integer grams) are assigned increasingly longer codes (k × counter j that is incremented from 1 to qc(x) into qc(x)).

Load more