A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora
Total Page:16
File Type:pdf, Size:1020Kb
A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora Amir Hazem and Emmanuel Morin Laboratore d’Informatique de Nantes-Atlantique (LINA) Universite´ de Nantes, 44322 Nantes Cedex 3, France {Amir.Hazem, Emmanuel.Morin}@univ-nantes.fr Abstract vector of association scores for its translated cooc- currences (Fano, 1961; Dunning, 1993), with the Smoothing is a central issue in lan- profiles of all words of the target language. The guage modeling and a prior step in dif- distance between two such vectors is interpreted ferent natural language processing (NLP) as an indicator of their semantic similarity and tasks. However, less attention has been their translational relation. If using association given to it for bilingual lexicon extrac- measures to extract word translation equivalents tion from comparable corpora. If a first has shown a better performance than using a work to improve the extraction of low raw cooccurrence model, the latter remains the frequency words showed significant im- core of any statistical generalisation (Evert, 2005). provement while using distance-based av- eraging (Pekar et al., 2006), no investi- As has been known, words and other type-rich gation of the many smoothing techniques linguistic populations do not contain instances has been carried out so far. In this pa- of all types in the population, even the largest per, we present a study of some widely- samples (Zipf, 1949; Evert and Baroni, 2007). used smoothing algorithms for language Therefore, the number and distribution of types n-gram modeling (Laplace, Good-Turing, in the available sample are not reliable estimators Kneser-Ney...). Our main contribution is (Evert and Baroni, 2007), especially for small to investigate how the different smoothing comparable corpora. The literature suggests two techniques affect the performance of the major approaches for solving the data sparseness standard approach (Fung, 1998) tradition- problem: smoothing and class-based methods. ally used for bilingual lexicon extraction. Smoothing techniques (Good, 1953) are often We show that using smoothing as a pre- used to better estimate probabilities when there processing step of the standard approach is insufficient data to estimate probabilities ac- increases its performance significantly. curately. They tend to make distributions more uniform, by adjusting low probabilities such as 1 Introduction zero probabilities upward, and high probabilities Cooccurrences play an important role in many downward. Generally, smoothing methods not corpus based approaches in the field of natural- only prevent zero probabilities, but they also language processing (Dagan et al., 1993). They attempt to improve the accuracy of the model as a represent the observable evidence that can be whole (Chen and Goodman, 1999). Class-based distilled from a corpus and are employed for a models (Pereira et al., 1993) use classes of similar variety of applications such as machine transla- words to distinguish between unseen cooccur- tion (Brown et al., 1992), information retrieval rences. The relationship between given words is (Maarek and Smadja, 1989), word sense disam- modeled by analogy with other words that are biguation (Brown et al., 1991), etc. In bilingual in some sense similar to the given ones. Hence, lexicon extraction from comparable corpora, class-based models provide an alternative to the frequency counts for word pairs often serve as independence assumption on the cooccurrence a basis for distributional methods, such as the of given words w1 and w2: the more frequent standard approach (Fung, 1998) which compares w2 is, the higher estimate of P (w2|w1) will be, the cooccurrence profile of a given source word, a regardless of w1. 24 Proceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 24–33, Sofia, Bulgaria, August 8, 2013. c 2013 Association for Computational Linguistics Starting from the observation that smoothing es- method are then described in Section 3. Section timates ignore the expected degree of association 4 describes the experimental setup and our re- between words (assign the same estimate for all sources. Section 5 presents the experiments and unseen cooccurrences) and that class-based mod- comments on several results. We finally discuss els may not structure and generalize word cooc- the results in Section 6 and conclude in Section 7. currence to class cooccurrence patterns without losing too much information, (Dagan et al., 1993) 2 Smoothing Techniques proposed an alternative to these latter approaches Smoothing describes techniques for adjusting the to estimate the probabilities of unseen cooccur- maximum likelihood estimate of probabilities to rences. They presented a method that makes reduce more accurate probabilities. The smooth- analogies between each specific unseen cooccur- ing techniques tend to make distributions more rence and other cooccurrences that contain similar uniform. In this section we present the most words. The analogies are based on the assump- widely used techniques. tion that similar word cooccurrences have simi- lar values of mutual information. Their method 2.1 Additive Smoothing has shown significant improvement for both: word The Laplace estimator or the additive smoothing sense disambiguation in machine translation and (Lidstone, 1920; Johnson, 1932; Jeffreys, 1948) data recovery tasks. (Pekar et al., 2006) em- is one of the simplest types of smoothing. Its ployed the nearest neighbor variety of the previ- principle is to estimate probabilities P assuming ous approach to extract translation equivalents for that each unseen word type actually occurred once. low frequency words from comparable corpora. Then, if we have N events and V possible words They used a distance-based averaging technique instead of : for smoothing (Dagan et al., 1999). Their method occ(w) yielded a significant improvement in relation to P(w) = (1) low frequency words. N We estimate: Starting from the assumption that smoothing improves the accuracy of the model as a whole (Chen and Goodman, 1999), we believe that occ(w) + 1 Paddone(w) = (2) smoothed context vectors should lead to bet- N + V ter performance for bilingual terminology extrac- Applying Laplace estimation to word’s cooc- tion from comparable corpora. In this work we currence suppose that : if two words cooccur to- carry out an empirical comparison of the most gether n times in a corpus, they can cooccur to- widely-used smoothing techniques, including ad- gether (n + 1) times. According to the maximum ditive smoothing (Lidstone, 1920), Good-Turing likelihood estimation (MLE): estimate (Good, 1953), Jelinek-Mercer (Mercer, 1980), Katz (Katz, 1987) and kneser-Ney smooth- ing (Kneser and Ney, 1995). Unlike (Pekar et al., C(wi, wi+1) P(wi+1|wi) = (3) 2006), the present work does not investigate un- C(wi) seen words. We only concentrate on observed Laplace smoothing: cooccurrences. We believe it constitutes the most systematic comparison made so far with differ- ent smoothing techniques for aligning translation ∗ C(wi, wi+1) + 1 P (wi+1|wi) = (4) equivalents from comparable corpora. We show C(wi) + V that using smoothing as a pre-processing step of Several disadvantages emanate from this the standard approach, leads to significant im- method: provement even without considering unseen cooc- currences. 1. The probability of frequent n-grams is under- estimated. In the remainder of this paper, we present in Section 2, the different smoothing techniques. The 2. The probability of rare or unseen n-grams is steps of the standard approach and our extended overestimated. 25 3. All the unseen n-grams are smoothed in the Good-Turing estimator in the linear interpolation same way. as follows: 4. Too much probability mass is shifted towards ∗ ∗ cint(wi+1|wi) = λc (wi+1|wi) + (1 − λ)P(wi) (8) unseen n-grams. One improvement is to use smaller added count 2.4 Katz Smoothing following the equation below: (Katz, 1987) extends the intuitions of Good- Turing estimate by adding the combination of ∗ δ + C(wi, wi+1) P (wi+1|wi) = (5) higher-order models with lower-order models. For δ|V| + C(wi) i i a bigram wi−1 with count r = c(wi−1), its cor- rected count is given by the equation: with δ ∈]0, 1]. ∗ i r if r > 0 2.2 Good-Turing Estimator ckatz(wi−1) = (9) α(wi−1)PML(wi) if r = 0 The Good-Turing estimator (Good, 1953) pro- vides another way to smooth probabilities. It states and: P i 1 − i Pkatz(wi−1) that for any n-gram that occurs r times, we should wi:c(wi−1)>0 ∗ α(wi−1) = P (10) pretend that it occurs r times. The Good-Turing 1 − i PML(wi−1) wi:c(wi−1)>0 estimators use the count of things you have seen once to help estimate the count of things you have According to (Katz, 1987), the general dis- counted estimate c∗ of Good-Turing is not used for never seen. In order to compute the frequency of all counts c. Large counts where c > k for some words, we need to compute Nc, the number of threshold k are assumed to be reliable. (Katz, ∗ events that occur c times (assuming that all items 1987) suggests k = 5. Thus, we define c = c for c > k, and: are binomially distributed). Let Nr be the num- Nc+1 (k+1)Nk+1 ber of items that occur r times. Nr can be used to (c + 1) N − c N c∗ = c 1 (11) provide a better estimate of r, given the binomial (k+1)Nk+1 1 − N distribution. The adjusted frequency r∗ is then: 1 2.5 Kneser-Ney Smoothing Nr+1 r∗ = (r + 1) (6) Kneser-Ney have introduced an extension of ab- N r solute discounting (Kneser and Ney, 1995). The 2.3 Jelinek-Mercer Smoothing estimate of the higher-order distribution is created by subtracting a fixed discount D from each non- As one alternative to missing n-grams, useful in- zero count.