N-Gram Language Models

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions N-gram language models

Statistical Natural Language Processing • A language model answers the question how likely is a N-gram Language Models sequence of words in a given language? • They assign scores, typically probabilities, to sequences (of words, letters, …) Çağrı Çöltekin • n-gram language models are the ‘classical’ approach to

University of Tübingen language modeling Seminar für Sprachwissenschaft • The main idea is to estimate probabilities of sequences, using the probabilities of words given a limited history Summer Semester 2018 • As a bonus we get the answer for what is the most likely word given previous words?

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 1 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions N-grams in practice: spelling correction N-grams in practice: speech recognition

recognize • How would a spell checker know that there is a spelling reckon error in the following sentence? wreck on in ice speech I like pizza wit spinach aren’t speech an • Or this one? nice not Zoo animals on the lose an eye beach a aren’t ice bee her and I s be r e k @ n ai s b ii ch We want:

P(I like pizza with spinach) > P(I like pizza wit spinach) We want: P(Zoo animals on the loose) > P(Zoo animals on the lose) P(recognize speech) > P(wreck a nice beach)

* Reproduced from Shillcock (1995)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 2 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 3 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Speech recognition gone wrong Speech recognition gone wrong

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 4 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 4 / 81

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 4 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 4 / 81 • How many language models are there in the example above? • Screenshot from google.com - but predictive text is used everywhere • If you want examples of predictive text gone wrong, look for ‘auto-correct mistakes’ on the Web.

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions What went wrong? N-grams in practice: machine translation Recap: noisy channel model German to English translation: noisy • Correct word choice channel German English tell the truth encoder decoder smell the soup Der grosse Mann tanzt gerne The big man likes to dance Der grosse Mann weiß alles The great man knows all • We want P(u | A), probability of the utterance given the • Correct ordering / word choice acoustic signal German English alternatives • | The model of the noisy channel gives us P(A u) Er tanzt gerne He dances with pleasure • We can use Bayes’ formula He likes to dance P(A | u)P(u) P(u | A) = P(A) We want: P(He likes to dance) > P(He dances with pleasure) • P(u), probabilities of utterances, come from a language model

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 5 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 6 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions N-grams in practice: predictive text N-grams in practice: predictive text

• How many language models are there in the example above? • Screenshot from google.com - but predictive text is used everywhere • If you want examples of predictive text gone wrong, look for ‘auto-correct mistakes’ on the Web.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 7 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 7 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions More applications for language models Overview of the overview

• Spelling correction • Speech recognition • Machine translation Why do we need n-gram language models? • Predictive text What are they? • Text recognition (OCR, handwritten) How do we build and use them? • Information retrieval What alternatives are out there? • Question answering • Text classification • …

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 8 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 9 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Overview Our aim in a bit more detail

We want to solve two related problems: • Why do we need n-gram language models? • Given a sequence of words w = (w1w2 ... wm), • How to assign probabilities to sequences? what is the probability of the sequence • N-grams: what are they, how do we count them? P(w)? • MLE: how to assign probabilities to n-grams? (machine translation, automatic speech recognition, spelling correction) • Evaluation: how do we know our n-gram model works • Given a sequence of words w1w2 ... wm−1, well? what is the probability of the next word • Smoothing: how to handle unknown words? P(wm | w1 ... wm−1)? • Some practical issues with implementing n-grams (predictive text) • Extensions, alternative approaches

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 10 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 11 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Assigning probabilities to sentences Assigning probabilities to sentences count and divide? applying the chain rule

How do we calculate the probability a sentence like P(I like pizza with spinach) • The solution is to decompose We use probabilities of parts of the sentence (words) to calculate the probability of the whole sentence • Can we count the occurrences of the P( ) = ? • Using the chain rule of probability (without loss of sentence, and divide it by the total generality), we can write number of sentences (in a large corpus)? • Short answer: No. P(w1, w2,..., wm) = P(w2 | w1) – Many sentences are not observed even × P(w3 | w1, w2) in very large corpora × – For the ones observed in a corpus, ... probabilities will not reflect our × P(wm | w1, w2,... wm−1) intuition, or will not be useful in most applications

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 12 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 13 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Example: applying the chain rule Assigning probabilities to sentences the Markov assumption

We make a conditional independence assumption: probabilities of words are independent, given n previous words ( ) = ( | ) P I like pizza with spinach P like I P(wi | w1,..., wi−1) = P(wi | wi−n+1,..., wi−1) × P(pizza | I like) and × P(with | I like pizza) × P(spinach | I like pizza with) ∏m P(w1,..., wm) = P(wi | wi−n+1,..., wi−1) i=1 • Did we solve the problem? For example, with n = 2 (bigram, first order Markov model): • Not really, the last term is equally difficult to estimate ∏m P(w1,..., wm) = P(wi | wi−1) i=1

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 14 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 15 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Example: bigram probabilities of a sentence Maximum-likelihood estimation (MLE)

• Maximum-likelihood estimation of n-gram probabilities is based on their frequencies in a corpus • We are interested in conditional probabilities of the form: ( | ,..., ) P(I like pizza with spinach) = P(like | I) P wi w1 wi−1 , which we estimate using × | P(pizza like) C(wi−n+1 ... wi) P(wi | wi−n+1,..., wi−1) = × P(with | pizza) C(wi−n+1 ... wi−1) × P(spinach | with) where, C() is the frequency (count) of the sequence in the corpus. • Now, hopefully, we can count them in a corpus • For example, the probability P(like | I) would be

| C(I like) P(like I) = C(I) number of times I like occurs in the corpus = number of times I occurs in the corpus

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 16 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 17 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions MLE estimation of an n-gram language model Unigrams An n-gram model conditioned on n − 1 previous words. • In a 1-gram (unigram) model, Unigrams are simply the single words (or tokens). C(w ) P(w ) = i i N A small corpus When tokenized, we • In a 2-gram (bigram) model, I ’m sorry , Dave . have 15 tokens, and 11 C(wi−1wi) I ’m afraid I can ’t do that . types. P(wi) = P(wi | wi−1) = C(wi−1) • In a 3-gram (trigram) model, Unigram counts

C(wi−2wi−1wi) ngram freq ngram freq ngram freq ngram freq P(wi) = P(wi | wi−2wi−1) = C(wi−2wi−1) I 3 , 1 afraid 1 do 1 ’m 2 Dave 1 can 1 that 1 Training an n-gram model involves estimating these pa- sorry 1 . 2 ’t 1 rameters (conditional probabilities).

Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t, is␣n’t etc.), but for our purposes can␣’t is more readable.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 18 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 19 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Unigram probability of a sentence N-gram models define probability distributions Unigram counts • An n-gram model defines a probability word prob ngram freq ngram freq ngram freq ngram freq distribution over words I 0.200 I 3 , 1 afraid 1 do 1 ∑ ’m 0.133 2 1 1 1 ’m Dave can that P(w) = 1 . 0.133 sorry 1 . 2 ’t 1 ∈ w V ’t 0.067 • They also define probability , 0.067 Dave 0.067 P(I 'm sorry , Dave .) distributions over word sequences of equal size. For example (length 2), afraid 0.067 = ( ) × ('m) × (sorry) × (,) × (Dave) × (.) P I P P P P P ∑ ∑ can 0.067 3 × 2 × 1 × 1 × 1 × 2 . = 15 15 15 15 15 15 P(w)P(v) = 1 do 0 067 = 0.000 001 05 w∈V v∈V sorry 0.067 that 0.067 • P(, 'm I . sorry Dave) = ? • What about sentences? 1.000 • What is the most likely sentence according to this model?

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 20 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 21 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Unigram probabilities Unigram probabilities in a (slightly) larger corpus MLE probabilities in the Universal Declaration of Human Rights

3 0.06 0.2 2 2 0.04 0.1 1 1 1 1 1 1 1 1

0.02

0 MLE probability . , I ’t a long tail follows … 536 ’m do can that 0.00 Dave sorry afraid 0 50 100 150 200 250 rank

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 22 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 23 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Zipf’s law – a short divergence Bigrams

Bigrams are overlapping sequences of two tokens. The frequency of a word is inversely proportional to its rank: I ’m sorry , Dave . 1 rank × frequency = k or frequency ∝ rank I ’m afraid I can ’t do that .

Bigram counts

• This is a reoccurring theme in (computational) linguistics: ngram freq ngram freq ngram freq ngram freq most linguistic units follow more-or-less a similar I ’m 2 , Dave 1 afraid I 1 n’t do 1 distribution ’m sorry 1 Dave . 1 I can 1 do that 1 • Important consequence for us (in this lecture): sorry , 1 ’m afraid 1 can ’t 1 that . 1 – even very large corpora will not contain some of the words (or n-grams) – there will be many low-probability events (words/n-grams) • What about the bigram ‘ .I ’?

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 24 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 25 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Sentence boundary markers Calculating bigram probabilities recap with some more detail

( | ) If we want sentence probabilities, we need to mark them. We want to calculate P w2 w1 . From the chain rule:

P(w1, w2) P(w2 | w1) = ⟨s⟩ I ’m sorry , Dave . ⟨/s⟩ P(w1) ⟨s⟩ I ’m afraid I can ’t do that . ⟨/s⟩ and, the MLE

( ) C w1w2 C(w w ) P(w | w ) = N = 1 2 2 1 C(w ) 1 C(w1) • The bigram ‘ ⟨s⟩ I ’ is not the same as the unigram ‘ I ’ N Including ⟨s⟩ allows us to predict likely words at the P(w2 | w1) is the probability of w2 given the previous word is w1 beginning of a sentence P(w2, w1) is the probability of the sequence w1w2 • Including ⟨/s⟩ allows us to assign a proper probability distribution to sentences P(w1) is the probability of w1 occurring as the first item in a bigram, not its unigram probability

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 26 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 27 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions unigram probability! Bigram probabilities Sentence probability: bigram vs. unigram w1w2 C(w1w2) C(w1) P(w1w2) P(w1) P(w2 | w1) P(w2) ⟨ ⟩ 2 2 0.12 0.12 1.00 0.18 s I 1 Unigram I ’m 2 3 0.12 0.18 0.67 0.12 Bigram ’m sorry 1 2 0.06 0.12 0.50 0.06 sorry , 1 1 0.06 0.06 1.00 0.06 , Dave 1 1 0.06 0.06 1.00 0.06 0.5 Dave . 1 1 0.06 0.06 1.00 0.12 ’m afraid 1 2 0.06 0.12 0.50 0.06 afraid I 1 1 0.06 0.06 1.00 0.18 0 I can 1 3 0.06 0.18 0.33 0.06 ⟨ ⟩ can ’t 1 1 0.06 0.06 1.00 0.06 I ’m sorry , Dave . /s n’t do 1 1 0.06 0.06 1.00 0.06 −9 do that 1 1 0.06 0.06 1.00 0.06 Puni(⟨s⟩ I ’m sorry , Dave . ⟨/s⟩) = 2.83 × 10 1 1 0.06 0.06 1.00 0.12 that . P (⟨s⟩ I ’m sorry , Dave . ⟨/s⟩) = 0.33 . ⟨/s⟩ 2 2 0.12 0.12 1.00 0.12 bi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 28 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 29 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Unigram vs. bigram probabilities Bigram model as a finite-state automaton in sentences and non-sentences w I ’m sorry , Dave . 1.0 1.0 −9 sorry , Puni 0.20 0.13 0.07 0.07 0.07 0.07 2.83 × 10 Dave .5 Pbi 1.00 0.67 0.50 1.00 1.00 1.00 0.33 0 1 . .67 ’m 0 0 0 .5 w , ’m I . sorry Dave 1.0 1.0 ⟨s⟩ I afraid . ⟨/s⟩ −9 1.0 Puni 0.07 0.13 0.20 0.07 0.07 0.07 2.83 × 10 ...... 0 Pbi 0 00 0 00 0 00 0 00 0 00 1 00 0 00 .33 can .0 1 1 .0 ’t do that w I ’m afraid , Dave . 1.0 1.0 −9 Puni 0.07 0.13 0.07 0.07 0.07 0.07 2.83 × 10 Pbi 1.00 0.67 0.50 0.00 0.50 1.00 0.00

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 30 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 31 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Trigrams Trigram probabilities of a sentence

⟨s⟩⟨s⟩ I ’m sorry , Dave . ⟨/s⟩ 1 Unigram ⟨s⟩⟨s⟩ I ’m afraid I can ’t do that . ⟨/s⟩ Bigram Trigram . Trigram counts 0 5

ngram freq ngram freq ngram freq ⟨s⟩⟨s⟩ I 2 do that . 1 that . ⟨/s⟩ 1 0 ⟨s⟩ I ’m 2 I ’m sorry 1 ’m sorry , 1 I ’m sorry , Dave . ⟨/s⟩ sorry , Dave 1 , Dave . 1 Dave . ⟨/s⟩ 1 I ’m afraid 1 ’m afraid I 1 afraid I can 1 −9 Puni(I ’m sorry , Dave . ⟨/s⟩) = 2.83 × 10 I can ’t 1 can ’t do 1 ’t do that 1 ⟨ ⟩ Pbi(I ’m sorry , Dave . /s ) = 0.33

Ptri(I ’m sorry , Dave . ⟨/s⟩) = 0.50 • How many n-grams are there in a sentence of length m?

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 32 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 33 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Short detour: colorless green ideas What do n-gram models model?

• Some morphosyntax: the bigram ‘ideas are’ is (much But it must be recognized that the notion ’probability of a more) likely than ‘ideas is’ sentence’ is an entirely useless one, under any known • Some semantics: ‘bright ideas’ is more likely than ‘green interpretation of this term. — Chomsky (1968) ideas’ • • The following ‘sentences’ are categorically different: Some cultural aspects of everyday language: ‘Chinese food British food – Furiously sleep ideas green colorless ’ is more likely than ‘ ’ – Colorless green ideas sleep furiously • more aspects of ‘usage’ of language • Can n-gram models model the difference? • Should n-gram models model the difference? N-gram models are practical tools, and they have been useful for many tasks.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 34 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 35 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions N-grams, so far … How to test n-gram models?

• N-gram language models are one of the basic tools in NLP Extrinsic: improvement of the target application due to the • They capture some linguistic (and non-linguistic) language model: regularities that are useful in many applications • Speech recognition accuracy • • The idea is to estimate the probability of a sentence based BLEU score for machine translation • on its parts (sequences of words) Keystroke savings in predictive text applications • N-grams are n consecutive units in a sequence • Typically, we use sequences of words to estimate sentence Intrinsic: the higher the probability assigned to a test set probabilities, but other units are also possible: characters, better the model. A few measures: phonemes, phrases,… • Likelihood • (cross) entropy • For most applications, we introduce sentence boundary • perplexity markers

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 36 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 37 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Training and test set division Intrinsic evaluation metrics: likelihood

• Likelihood of a model M is the probability of the (test) set w given the model • We (almost) never use a statistical (language) model on the ∏ L | | training data (M w) = P(w M) = P(s) s∈w • Testing a model on the training set is misleading: the model may overfit the training set • The higher the likelihood (for a given test set), the better • Always test your models on a separate test set the model • Likelihood is sensitive to the test set size • Practical note: (minus) log likelihood is used more commonly, because of ease of numerical manipulation

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 38 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 39 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Intrinsic evaluation metrics: cross entropy Intrinsic evaluation metrics: perplexity

• Cross entropy of a language model on a test set w is ∑ 1 b • Perplexity is a more common measure for evaluating H(w) = − log P(w ) N 2 i language models w i √

• 1 The lower the cross entropy, the better the model H(w) − N 1 PP(w) = 2 = P(w) N = • Cross entropy is not sensitive to the test-set size P(w)

• Perplexity is the average branching factor Reminder: Cross entropy is the bits required to encode the b • Similar to cross entropy data coming from a P using an approximate distribution P. – lower better ∑ – not sensitive to test set size H(P, Q) = − P(x) log Pb(x) x

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 40 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 41 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions What do we do with unseen n-grams? Smoothing: what is in the name? …and other issues with MLE estimates samples from N(0, 1) • Words (and word sequences) are distributed according to the Zipf’s law: many words are rare. 1 5 samples 10 samples 0.8 0.6 • 0 MLE will assign probabilities to unseen words, and 0.5 0.4

sequences containing unseen words 0.2 • Even with non-zero probabilities, MLE overfits the training 0 0 data • One solution is smoothing: take some probability mass from known words, and assign it to unknown words 0.6 30 samples 1000 samples 0.4 0.4 0.2 unseen 0.2

0 0 seen seen −4 −2 0 2 4 −4 −2 0 2 4

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 42 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 43 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Laplace smoothing Laplace smoothing (Add-one smoothing) for n-grams • The probability of a bigram becomes • The idea (from 1790): add one to all counts C(wiwi−1)+1 • The probability of a word is estimated by P+1(wiwi−1) = N+V2 C(w)+1 • P (w) = and, the conditional probability +1 N+V C(w w )+1 P (w | w ) = i−1 i +1 i i−1 C(w )+V N number of word tokens i−1 V number of word types - the size of the vocabulary • In general • Then, probability of an unknown word is: C(wi ) + 1 P (wi ) = i−n+1 0 + 1 +1 i−n+1 N + Vn N + V C(wi ) + 1 i | i−1 i−n+1 P+1(wi−n+1 wi−n+1) = i−1 C(wi−n+1) + V

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 44 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 45 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Bigram probabilities MLE vs. Laplace probabilities non-smoothed vs. Laplace smoothing bigram probabilities in sentences and non-sentences w1w2 C+1 PMLE(w1w2) P+1(w1w2) PMLE(w2 | w1) P+1(w2 | w1) w I ’m sorry , Dave . ⟨/s⟩ ⟨s⟩ I 3 0.118 0.019 1.000 0.188 I ’m 3 0.118 0.019 0.667 0.176 PMLE 1.00 0.67 0.50 1.00 1.00 1.00 1.00 0.33 × −5 ’m sorry 2 0.059 0.012 0.500 0.125 P+1 0.25 0.23 0.17 0.18 0.18 0.18 0.25 1.44 10 sorry , 2 0.059 0.012 1.000 0.133 , Dave 2 0.059 0.012 1.000 0.133 Dave . 2 0.059 0.012 1.000 0.133 w , ’m I . sorry Dave ⟨/s⟩ ’m afraid 2 0.059 0.012 0.500 0.125 PMLE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 afraid I 2 0.059 0.012 1.000 0.133 −8 P+1 0.08 0.09 0.08 0.08 0.08 0.09 0.09 3.34 × 10 I can 2 0.059 0.012 0.333 0.118 can ’t 2 0.059 0.012 1.000 0.133 n’t do 2 0.059 0.012 1.000 0.133 ⟨ ⟩ do that 2 0.059 0.012 1.000 0.133 w I ’m afraid , Dave . /s

that . 2 0.059 0.012 1.000 0.133 Puni 1.00 0.67 0.50 0.00 1.00 1.00 1.00 0.00 . ⟨/s⟩ 3 0.118 0.019 1.000 0.188 P 0.25 0.23 0.17 0.09 0.18 0.18 0.25 7.22 × 10−6 ∑ bi 1.000 0.193

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 46 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 47 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions How much mass does +1 smoothing steal? Lidstone correction (Add-α smoothing)

• Laplace smoothing Unigrams . % unseen (3 33 ) • A simple improvement over Laplace smoothing is adding reserves probability mass 0 < α (and typically < 1) instead of 1 proportional to the size of the vocabulary seen i i i−1 C(wi−n+1) + α • Bigrams P+α(w | w ) = This is just too much for unseen (83.33 %) i−n+1 i−n+1 C(wi−1 ) + αV large vocabularies and i−n+1 higher order n-grams • With smaller α values, the model behaves similar to MLE, seen it overfits: it has high variance • Note that only very few of Trigrams • the higher level n-grams unseen (98.55 %) Larger α values reduce overfitting/variance, but result in (e.g., trigrams) are possible large bias

seen

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 48 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 49 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions How do we pick a good α value Absolute discounting setting smoothing parameters

• We want α value that works best outside the training data • Peeking at your test data during training/development is wrong • This calls for another division of the available data: set • aside a development set for tuning hyperparameters An alternative to the additive smoothing is to reserve an explicit amount of probability mass, ϵ, for the unseen • Alternatively, we can use k-fold cross validation and take events the α with the best average score • The probabilities of known events has to be re-normalized • How do we decide what ϵ value to use?

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 50 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 51 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Good-Turing smoothing Some terminology ‘discounting’ view frequencies of frequencies and equivalence classes

n3 = 1 • Estimate the probability mass to be reserved for the novel 3 n = 2 3 2 n-grams using the observed n-grams 2 2 n1 = 8 2 • Novel events in our training set is the ones that occur once 1 1 1 1 1 1 1 1 1

n 0 p = 1 0 . , I ’t n ’m do can that Dave sorry afraid where n1 is the number of distinct n-grams with frequency 1 in the training data • We often put n-grams into equivalence classes • Now we need to discount this mass from the higher counts • Good-Turing forms the equivalence classes based on • The probability of an n-gram that occurred r times in the frequency corpus is Note: ∑ n (r + 1) r+1 n = r × n n n r r r

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 52 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 53 / 81

• Leave each n-gram out • Count the number of times the left-out n-gram had Sometimes it is instructive to see the ‘effective count’ of an frequency r in the remaining data n-gram under the smoothing method. ∗ – novel n-grams For Good-Turing smoothing, the updated count, r is n1 ∗ n n r = (r + 1) r+1 – n-grams with frequency 1 (singletons) nr

n2 (1 + 1) • novel items: n1 n1n × • singletons: 2 n2 * n1 – n-grams with frequency 2 (doubletons) × • doubletons: 3 n3 n2 n3 (2 + 1) • … n2n

* Yes, this seems to be a word.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 54 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 55 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Good-Turing example Issues with Good-Turing discounting With some solutions n3 = 1 3 n = 2 3 2 2 2 n1 = 8 2 1 1 1 1 1 1 1 1 • Zero counts: we cannot assign probabilities if nr+1 = 0 1 • 0 The estimates of some of the frequencies of frequencies are

, I . unreliable ’t ’m do can that Dave sorry afraid • A solution is to replace nr with smoothed counts zr • A well-known technique (simple Good-Turing) for 8 P (the) = P (a) = ... = smoothing nr is to use linear interpolation GT GT 15 2 × 2 P (that) = P (do) = ... = log zr = a + b log r GT GT 15 3 × 1 P (’m) = P (.) = GT GT 15

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 56 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 57 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions N-grams, so far … N-grams, so far …

• Smoothing methods solve the zero-count problem (also reduce the variance) • Smoothing takes away some probability mass from the • Two different ways of evaluating n-gram models: observed n-grams, and assigns it to unobserved ones Extrinsic success in an external application – Additive smoothing: add a constant α to all counts • Intrinsic likelihood, (cross) entropy, perplexity α = 1 (Laplace smoothing) simply adds one to all counts – simple but often not very useful • Intrinsic evaluation metrics often correlate well with the • A simple correction is to add a smaller α, which requires extrinsic metrics tuning over a development set – Discounting removes a fixed amount of probability mass, ϵ, • Test your n-grams models on an ‘unseen’ test set from the observed n-grams • We need to re-normalize the probability estimates • Again, we need a development set to tune ϵ – Good-Turing discounting reserves the probability mass to the unobserved events based on the n-grams seen only n1 once: p0 = n

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 58 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 59 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Not all (unknown) n-grams are equal Back-off and interpolation

• Let’s assume that black squirrel is an unknown bigram The general idea is to fall-back to lower order n-gram • How do we calculate the smoothed probability when estimation is unreliable

0 + 1 P (squirrel | black) = +1 C(black) + V • Even if,

• How about black wug? C(black squirrel) = C(black wug) = 0

0 + 1 it is unlikely that P (black wug) = P (wug | black) = +1 +1 C(black) + V C(squirrel) = C(wug)

• Would it make a difference if we used a better smoothing in a reasonably sized corpus method (e.g., Good-Turing?)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 60 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 61 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Back-off Interpolation

Interpolation uses a linear combination: Back-off uses the estimate if it is available, ‘backs off’ to the lower order n-gram(s) otherwise: Pint(wi | wi−1) = λP(wi | wi−1) + (1 − λ)P(wi) { ∗ In general (recursive definition), P (wi | wi−1) if C(wi−1wi) > 0 P(wi | wi−1) = αP(wi) otherwise P (w | wi−1 ) = λP(w | wi−1 ) + (1 − λ)P (w | wi−1 ) where, int i i−n+1 i i−n+1 int i i−n+2 ∗ • P (·) is the discounted probability ∑ ∑ • λi = 1 • α makes sure that P(w) is the discounted amount • Recursion terminates with • P(wi), typically, smoothed unigram probability – either smoothed unigram counts 1 – or uniform distribution V

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 62 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 63 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Not all contexts are equal Katz back-off A popular back-off method is Katz back-off: • Back to our example: given both bigrams { – black squirrel P∗(w | wi−1 ) if C(wi ) > 0 ( | i−1 ) = i i−n+1 i−n+1 – wuggy squirrel PKatz wi wi−n+1 | i−1 αwi−1 Pkatz(wi wi−n+2) otherwise are unknown, the above formulations assign the same i−n+1

probability to both bigrams ∗ • P (·) is the Good-Turing discounted probability estimate • To solve this, the back-off or interpolation parameters (only for n-grams with small counts) (α or λ) are often conditioned on the context • αwi−1 makes sure that the back-off probabilities sum to • For example, i−n+1 the discounted amount i−1 i−1 Pint(wi | w ) = λ i−1 P(wi | w ) • α is high for frequent contexts. So, hopefully, i−n+1 wi−n+1 i−n+1 i−1 + (1 − λ i−1 ) Pint(wi | w ) wi−n+1 i−n+2 αblackP(squirrel) > αwuggyP(squirrel) P(squirrel | black) > P(squirrel | wuggy)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 64 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 65 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions A quick summary A quick summary Markov assumption Smoothing

Problem The MLE assigns 0 probabilities to unobserved n-grams, • Our aim is to assign probabilities to sentences and any sentence containing unobserved n-grams. In P(I ’m sorry , Dave .) = ? general, it overfits Problem: We cannot just count & divide Solution Reserve some probability mass for unobserved n-grams – Most sentences are rare: no (reliable) way to count their Additive smoothing add α to every count occurrences – Sentence-internal structure tells a lot about it’s probability C(wi ) + α i | i−1 i−n+1 P+α(wi−n+1 wi−n+1) = i−1 Solution: Divide up, simplify with a Markov assumption C(wi−n+1) + αV P(I ’m sorry , Dave) = | ⟨ ⟩ | | | | | ⟨ ⟩ | P(I s )P(’m I)P(sorry ’m)P(, sorry)P(Dave ,)P(. Dave)P( /s .) Discounting – reserve a fixed amount of probability mass to Now we can count the parts (n-grams), and estimate their unobserved n-grams probability with MLE. – normalize the probabilities of observed n-grams (e.g., Good-Turing smoothing)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 66 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 67 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions A quick summary A quick summary Back-off & interpolation Problems with simple back-off / interpolation Problem if unseen, we assign the same probability for Problem if unseen we assign the same probability for – black squirrel – black squirrel – wuggy squirrel – black wug Solution make normalizing constants (α, λ) context dependent, Solution Fall back to lower-order n-grams when you cannot higher for context n-grams that are more frequent estimate the higher-order n-gram Back-off { Back-off { ∗ | ∗ | | P (wi wi−1) if C(wi−1wi) > 0 | P (wi wi−1) if C(wi−1wi) > 0 P(wi wi−1) = P(wi wi−1) = αi−1P(wi) otherwise αP(wi) otherwise Interpolation Interpolation | ∗ | Pint(wi wi−1) = P (wi wi−1) + λw − P(wi) Pint(wi | wi−1) = λP(wi | wi−1) + (1 − λ)P(wi) i 1

Now P(squirrel) contributes to P(squirrel | black), Now P(black) contributes to P(squirrel | black), it it should be higher than P(wug | black). should be higher than P(wuggy | squirrel).

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 68 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 69 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Kneser-Ney interpolation: intuition Kneser-Ney interpolation for bigrams

• Use absolute discounting for the higher order n-gram Absolute discount • Estimate the lower order n-gram probabilities based on the Unique contexts wi appears probability of the target word occurring in a new context |{ | }| | C(wi−1wi) − D ∑ v C(vwi) > 0 • Example: PKN(wi wi−1) = +λw − C(w ) i 1 | {v | C(vw) > 0}| I can't see without my reading . i w All unique contexts • It turns out the word Francisco is more frequent than glasses (in the typical English corpus, PTB) • λs make sure that the probabilities sum to 1 • But Francisco occurs only in the context San Francisco • The same idea can be applied to back-off as well • Assigning probabilities to unigrams based on the number (interpolation seems to work better) of unique contexts they appear makes glasses more likely

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 70 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 71 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Some shortcomings of the n-gram language models Cluster-based n-grams

• The n-gram language models are simple and successful, but … The idea is to cluster the words, and fall-back (back-off or interpolate) to the cluster • They are highly sensitive to the training data: you do not • For example, want to use an n-gram model trained on business news for – a clustering algorithm is likely to form a cluster containing medical texts words for food, e.g., {apple, pear, broccoli, spinach} • They cannot handle long-distance dependencies: – if you have never seen eat your broccoli, estimate In the last race, the horse he bought last year | | × | finally . P(broccoli eat your) = P(FOOD eat your) P(broccoli FOOD) • The success often drops in morphologically complex • Clustering can be languages hard a word belongs to only one cluster (simplifies the model) • The smoothing methods are often ‘a bag of tricks’ soft words can be assigned to clusters probabilistically (more flexible)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 72 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 73 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Skipping Modeling sentence types

• The contexts | – boring the lecture was • Another way to improve a language model is to condition | – boring (the) lecture yesterday was on the sentence types are completely different for an n-gram model • The idea is different types of sentences (e.g., ones related to • A potential solution is to consider contexts with gaps, different topics) have different behavior ‘skipping’ one or more words • Sentence types are typically based on clustering • P(e | abcd) We would, for example model with a • combination (e.g., interpolation) of We create multiple language models, one for each sentence – P(e | abc_) type – P(e | ab_d) • Often a ‘general’ language model is used, as a fall-back – P(e | a_cd) – …

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 74 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 75 / 81 Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Caching Structured language models

• If a word is used in a document, its probability of being • Another possibility is using a generative parser used again is high • Parsers try to explicitly model (good) sentences • Caching models condition the probability of a word, to a • Parser naturally capture long-distance dependencies larger context (besides the immediate history), such as • Parsers require much more computational resources than – the words in the document (if document boundaries are marked) the n-gram models – a fixed window around the word • The improvements are often small (if any)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 76 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 77 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Maximum entropy models Neural language models

• A neural network can be trained to predict a word from its context • Then we can use the network for estimating the • We can fit a logistic regression ‘max-ent’ model predicting P(w | context) P(w | context) • In the process, the hidden layer(s) of a network will learn • Main advantage is to be able to condition on arbitrary internal representations for the word features • These representations, known as embeddings, are continuous representations that place similar words in the same neighborhood in a high-dimensional space • We will return to embeddings later in this course

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 78 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 79 / 81

Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Motivation Estimation Evaluation Smoothing Back-off & Interpolation Extensions Some notes on implementation Summary

• We want to assign probabilities to sentences • The typical use of n-gram models are on (very) large • N-gram language models do this by corpora – estimating probabilities of parts of the sentence (n-grams) • We often need care for numeric instability issues: – use the n-gram probability and a conditional independence – For example, often it is more convenient to work with ‘log assumption to estimate the probability of the sentence probabilities’ • MLE estimate for n-gram overfit – Sometimes (log) probabilities are ’binned’ into integers, • Smoothing is a way to fight overfitting stored with small number of bits in memory • Back-off and interpolation yields better ‘smoothing’ • Memory or storage may become a problem too • – Assuming words below a frequency are ‘unknown’ often There are other ways to improve n-gram models, and helps language models without (explicitly) use of n-grams – Choice of correct data structure becomes important, Next: – A common data structure is a trie or a suffix tree Today POS tagging Mon/Fri Statistical parsing

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 80 / 81 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 81 / 81

Additional reading, references, credits Additional reading, references, credits (cont.) • The quote from 2001: A Space Odyssey, ‘I’m sorry Dave. I’m • Textbook reference: Jurafsky and Martin (2009, chapter 4) afraid I can’t do it.’ is probably one of the most frequent (draft chapter for the 3rd version is also available). Some of quotes in the CL literature. It was also quoted, among the examples in the slides come from this book. many others, by Jurafsky and Martin (2009). • • Chen and J. Goodman (1998) and Chen and J. Goodman The HAL9000 camera image on page 19 is from Wikipedia, (1999) include a detailed comparison of smoothing (re)drawn by Wikipedia user Cryteria. methods. The former (technical report) also includes a • The Herman comic used in slide 4 is also a popular tutorial introduction example in quite a few lecture slides posted online, it is • J. T. Goodman (2001) studies a number of improvements to difficult to find out who was the first. (n-gram) language models we have discussed. This • The smoothing visualization on slide ?? inspired by Julia technical report also includes some introductory material Hockenmaier’s slides. • Gale and Sampson (1995) introduce the ‘simple’ Chen, Stanley F and Joshua Goodman (1998). An empirical study of smoothing techniques for language modeling. Good-Turing estimation noted on Slide 19. The article also Tech. rep. TR-10-98. Harvard University, Computer Science Group. url: includes an introduction to the basic method. https://dash.harvard.edu/handle/1/25104739. — (1999). “An empirical study of smoothing techniques for language modeling”. In: Computer speech & language 13.4, pp. 359–394.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 A.1 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 A.2 Additional reading, references, credits (cont.)

Chomsky, Noam (1968). “Quine’s empirical assumptions”. In: Synthese 19.1, pp. 53–68. doi: 10.1007/BF00568049.

Gale, William A and Geoffrey Sampson (1995). “Good-Turing frequency estimation without tears”. In: Journal of Quantitative Linguistics 2.3, pp. 217–237.

Goodman, Joshua T (2001). A bit of progress in language modeling extended version. Tech. rep. MSR-TR-2001-72. Microsoft Research. Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. second. Pearson Prentice Hall. isbn: 978-0-13-504196-3. Shillcock, Richard (1995). “Lexical Hypotheses in Continuous Speech”. In: Cognitive Models of Speech Processing. Ed. by Gerry T. M. Altmann. MIT Press.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2018 A.3