Language Modeling

CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40 Motivation How likely are the following sentences? I the dog barked I the cat barked I dog the barked I oqc shgwqw#w 1g0 Karl Stratos CS 533: Natural Language Processing 2/40 Motivation How likely are the following sentences? I the dog barked \probability 0.1" I the cat barked \probability 0.03" I dog the barked \probability 0.00005" I oqc shgwqw#w 1g0 \probability 10−13" Karl Stratos CS 533: Natural Language Processing 2/40 Language Model: Definition A language model is a function that defines a probability distribution p(x1 : : : xm) over all sentences x1 : : : xm. Goal: Design a good language model, in particular p(the dog barked) > p(the cat barked) > p(dog the barked) > p(oqc shgwqw#w 1g0) Karl Stratos CS 533: Natural Language Processing 3/40 Language Models Are Everywhere Karl Stratos CS 533: Natural Language Processing 4/40 Text Generation with Modern Language Models Try it yourself: https://talktotransformer.com/ Karl Stratos CS 533: Natural Language Processing 5/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 6/40 Problem Statement I We'll assume a finite vocabulary V (i.e., the set of all possible word types). m I Sample space: Ω = fx1 : : : xm 2 V : m ≥ 1g I Task: Design a function p over Ω such that p(x1 : : : xm) ≥ 0 8x1 : : : xm 2 Ω X p(x1 : : : xm) = 1 x1:::xm2Ω I What are some challenges? Karl Stratos CS 533: Natural Language Processing 7/40 Challenge 1: Infinitely Many Sentences I Can we \break up" the probability of a sentence into probabilities of individual words? I Yes: Assume a generative process. I We may assume that each sentence x1 : : : xm is generated as (1) x1 is drawn from p(·), (2) x2 is drawn from p(·|x1), (3) x3 is drawn from p(·|x1; x2), ... (m) xm is drawn from p(·|x1; : : : ; xm−1), (m + 1) xm+1 is drawn from p(·|x1; : : : ; xm). where xm+1 = STOP is a special token at the end of every sentence. Karl Stratos CS 533: Natural Language Processing 8/40 Justification of the Generative Assumption By the chain rule, p(x1 : : : xm STOP) = p(x1) × p(x2jx1) × p(x3jx1; x2) × · · · · · · × p(xmjx1; : : : ; xm−1) × p(STOPjx1; : : : ; xm) Thus we have solved the first challenge. I Sample space = finite V I The model still defines a proper distribution over all sentences. (Does the generative process need to be left-to-right?) Karl Stratos CS 533: Natural Language Processing 9/40 STOP Symbol Ensures that there is probabilty mass left for longer sentences Probabilty mass of sentences with length ≥ 1 X 1 − p(STOP) = 1 x2V | {z } P (X1=STOP)=0 Probabilty mass of sentences with length ≥ 2 X 1 − p(x STOP) > 0 x2V | {z } P (X2=STOP) Probabilty mass of sentences with length ≥ 3 X X 1 − p(x STOP) − p(x x0 STOP) > 0 x2V x;x02V | {z } | {z } P (X2=STOP) P (X3=STOP) Karl Stratos CS 533: Natural Language Processing 10/40 Challenge 2: Infinitely Many Distributions Under the generative process, we need infinitely many conditional word distributions: p(x1) 8x1 2 V p(x2jx1) 8x1; x2 2 V p(x3jx1; x2) 8x1; x2; x3 2 V p(x4jx1; x2; x3) 8x1; x2; x3; x4 2 V . Now our goal is to redesign the model to have only a finite, compact set of associated values. Karl Stratos CS 533: Natural Language Processing 11/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 12/40 Independence Assumptions X is independent of Y if P (X = xjY = y) = P (X = x) X is conditionally independent of Y given Z if P (X = xjY = y; Z = z) = P (X = xjZ = z) Can you think of such X; Y; Z? Karl Stratos CS 533: Natural Language Processing 13/40 Unigram Language Model Assumption. A word is independent of all previous words: p(xijx1 : : : xi−1) = p(xi) That is, m Y p(x1 : : : xm) = p(xi) i=1 Number of parameters: O(jV j) Not a very good language model: p(the dog barked) = p(dog the barked) Karl Stratos CS 533: Natural Language Processing 14/40 Bigram Language Model Assumption. A word is independent of all previous words con- ditioning on the preceding word: p(xijx1 : : : xi−1) = p(xijxi−1) That is, m Y p(x1 : : : xm) = p(xijxi−1) i=1 where x0 = * is a special token at the start of every sentence. Number of parameters: O(jV j2) Karl Stratos CS 533: Natural Language Processing 15/40 Trigram Language Model Assumption. A word is independent of all previous words con- ditioning on the two preceding words: p(xijx1 : : : xi−1) = p(xijxi−2; xi−1) That is, m Y p(x1 : : : xm) = p(xijxi−2; xi−1) i=1 where x−1; x0 = * are special tokens at the start of every sentence. Number of parameters: O(jV j3) Karl Stratos CS 533: Natural Language Processing 16/40 n-Gram Language Model Assumption. A word is independent of all previous words con- ditioning on the n − 1 preceding words: p(xijx1 : : : xi−1) = p(xijxi−n+1; : : : ; xi−1) Number of parameters: O(jV jn) This kind of conditional independence assumption (\depends only on the last n − 1 states. ") is called a Markov assumption. I Is this a reasonable assumption for language modeling? Karl Stratos CS 533: Natural Language Processing 17/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 18/40 A Practical Question I Summary so far: We have designed probabilistic language models parametrized by finitely many values. 2 I Bigram model: Stores a table of O(jV j ) values q(x0jx) 8x; x0 2 V (plus q(xj*) and q(STOPjx)) representing transition probabilities and computes p(the cat barked) =q(thej*)× q(catjthe)× q(barkedjcat) q(STOPjbarked) I Q. But where do we get these values? Karl Stratos CS 533: Natural Language Processing 19/40 Estimation from Data (1) (N) I Our data is a corpus of N sentences x : : : x . 0 0 I Define count(x; x ) to be the number of times x; x appear together (called \bigram counts"): N l +1 X Xi count(x; x0) = 1 i=1 j=1: 0 xj =x xj−1=x (i) (li = length of x and xli+1 = STOP) P 0 I Define count(x) := x0 count(x; x ) (called \unigram counts"). Karl Stratos CS 533: Natural Language Processing 20/40 Example Counts Corpus: I the dog chased the cat I the cat chased the mouse I the mouse chased the dog Example bigram/unigram counts: count(x0; the) = 3 count(the) = 6 count(chased; the) = 3 count(chased) = 3 count(the; dog) = 2 count(x0) = 3 count(cat; STOP) = 1 count(cat) = 2 Karl Stratos CS 533: Natural Language Processing 21/40 Parameter Estimates 0 0 I For all x; x with count(x; x ) > 0, set count(x; x0) q(x0jx) = count(x) Otherwise q(x0jx) = 0. I In the previous example: q(thejx0) = 3=3 = 1 q(chasedjdog) = 1=3 = 0:3¯ q(dogjthe) = 2=6 = 0:3¯ q(STOPjcat) = 1=2 = 0:5 q(dogjcat) = 0 I Called maximum likelihood estimation (MLE). Karl Stratos CS 533: Natural Language Processing 22/40 Justification of MLE Claim. The solution of the constrained optimization problem N li+1 ∗ X X q = arg max log q(xjjxj−1) 0 0 q: q(x jx)≥0 8x;x i=1 j=1 P 0 x02V q(x jx)=18x is given by count(x; x0) q∗(x0jx) = count(x) (Proof?) Karl Stratos CS 533: Natural Language Processing 23/40 MLE: Other n-Gram Models Unigram: count(x) q(x) = N Bigram: count(x; x0) q(x0jx) = count(x) Trigram: count(x; x0; x00) q(x00jx; x0) = count(x; x0) Karl Stratos CS 533: Natural Language Processing 24/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 25/40 Evaluation of a Language Model \How good is the model at predicting unseen sentences?" Held-out corpus: Used for evaluation purposes only Do not use held-out data for training the model! Popular evaluation metric: perplexity Karl Stratos CS 533: Natural Language Processing 26/40 What We Are Doing: Conditional Density Estimation I True context-word pairs distributed as (c; w) ∼ pCW I We define some language model qW jC I Learning: minimize cross entropy between pW jC and qW jC H(pW jC ; qW jC ) = E − ln qW jC (wjc) (c;w)∼pCW Number of nats to encode the behavior of pW jC using qW jC H(pW jC ) ≤ H(pW jC ; qW jC ) ≤ ln jV j I Evaluation of qW jC : check how small H(pW jC ; qW jC ) is! I If the model class of qW jC is universally expressive, an optimal ∗ ∗ model qW jC will satisfy H(pW jC ; qW jC ) = H(pW jC ) with ∗ qW jC = pW jC . Karl Stratos CS 533: Natural Language Processing 27/40 Perplexity I Exponentiated cross entropy H(pW jC ;qW jC ) PP(pW jC ; qW jC ) = e I Interpretation: effective vocabulary size H(pW jC ) e ≤ PP(pW jC ; qW jC ) ≤ jV j I Empirical estimation: given (c1; w1) ::: (cN ; wN ) ∼ pCW , 1 PN − ln qW jC (wijci) PP(c pW jC ; qW jC ) = e N i=1 What is the empirical perplexity when qW jC (wijci) = 1 for all i? When qW jC (wijci) = 1= jV j for all i? Karl Stratos CS 533: Natural Language Processing 28/40 Example Perplexity Values for n-Gram Models I Using vocabulary size jV j = 50; 000 (Goodman, 2001) I Unigram: 955, Bigram: 137, Trigram: 74 I Modern neural language models: probably 20 I The big question: what is the minimum perplexity achievable with machines? Karl Stratos CS 533: Natural Language Processing 29/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 30/40 Smoothing: Additive In practice, it's important to smooth estimation to avoid zero probabilities for unseen words: #(x; x0) + α qα(x0jx) = #(x) + α jV j Also called Laplace smoothing: https://en.wikipedia.org/wiki/Additive_smoothing Karl Stratos CS 533: Natural Language Processing 31/40 Smoothing: Interpolation With limited data, enforcing generalization by using less context also helps: smoothed 00 0 α 00 0 q (x jx; x ) =λ1q (x jx; x )+ α 00 0 λ2q (x jx )+ α 00 λ3q (x ) where λ1 + λ2 + λ3 = 1 and λi ≥ 0.

Load more