Language Modeling

Language Modeling

CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40 Motivation How likely are the following sentences? I the dog barked I the cat barked I dog the barked I oqc shgwqw#w 1g0 Karl Stratos CS 533: Natural Language Processing 2/40 Motivation How likely are the following sentences? I the dog barked \probability 0.1" I the cat barked \probability 0.03" I dog the barked \probability 0.00005" I oqc shgwqw#w 1g0 \probability 10−13" Karl Stratos CS 533: Natural Language Processing 2/40 Language Model: Definition A language model is a function that defines a probability distribution p(x1 : : : xm) over all sentences x1 : : : xm. Goal: Design a good language model, in particular p(the dog barked) > p(the cat barked) > p(dog the barked) > p(oqc shgwqw#w 1g0) Karl Stratos CS 533: Natural Language Processing 3/40 Language Models Are Everywhere Karl Stratos CS 533: Natural Language Processing 4/40 Text Generation with Modern Language Models Try it yourself: https://talktotransformer.com/ Karl Stratos CS 533: Natural Language Processing 5/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 6/40 Problem Statement I We'll assume a finite vocabulary V (i.e., the set of all possible word types). m I Sample space: Ω = fx1 : : : xm 2 V : m ≥ 1g I Task: Design a function p over Ω such that p(x1 : : : xm) ≥ 0 8x1 : : : xm 2 Ω X p(x1 : : : xm) = 1 x1:::xm2Ω I What are some challenges? Karl Stratos CS 533: Natural Language Processing 7/40 Challenge 1: Infinitely Many Sentences I Can we \break up" the probability of a sentence into probabilities of individual words? I Yes: Assume a generative process. I We may assume that each sentence x1 : : : xm is generated as (1) x1 is drawn from p(·), (2) x2 is drawn from p(·|x1), (3) x3 is drawn from p(·|x1; x2), ... (m) xm is drawn from p(·|x1; : : : ; xm−1), (m + 1) xm+1 is drawn from p(·|x1; : : : ; xm). where xm+1 = STOP is a special token at the end of every sentence. Karl Stratos CS 533: Natural Language Processing 8/40 Justification of the Generative Assumption By the chain rule, p(x1 : : : xm STOP) = p(x1) × p(x2jx1) × p(x3jx1; x2) × · · · · · · × p(xmjx1; : : : ; xm−1) × p(STOPjx1; : : : ; xm) Thus we have solved the first challenge. I Sample space = finite V I The model still defines a proper distribution over all sentences. (Does the generative process need to be left-to-right?) Karl Stratos CS 533: Natural Language Processing 9/40 STOP Symbol Ensures that there is probabilty mass left for longer sentences Probabilty mass of sentences with length ≥ 1 X 1 − p(STOP) = 1 x2V | {z } P (X1=STOP)=0 Probabilty mass of sentences with length ≥ 2 X 1 − p(x STOP) > 0 x2V | {z } P (X2=STOP) Probabilty mass of sentences with length ≥ 3 X X 1 − p(x STOP) − p(x x0 STOP) > 0 x2V x;x02V | {z } | {z } P (X2=STOP) P (X3=STOP) Karl Stratos CS 533: Natural Language Processing 10/40 Challenge 2: Infinitely Many Distributions Under the generative process, we need infinitely many conditional word distributions: p(x1) 8x1 2 V p(x2jx1) 8x1; x2 2 V p(x3jx1; x2) 8x1; x2; x3 2 V p(x4jx1; x2; x3) 8x1; x2; x3; x4 2 V . Now our goal is to redesign the model to have only a finite, compact set of associated values. Karl Stratos CS 533: Natural Language Processing 11/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 12/40 Independence Assumptions X is independent of Y if P (X = xjY = y) = P (X = x) X is conditionally independent of Y given Z if P (X = xjY = y; Z = z) = P (X = xjZ = z) Can you think of such X; Y; Z? Karl Stratos CS 533: Natural Language Processing 13/40 Unigram Language Model Assumption. A word is independent of all previous words: p(xijx1 : : : xi−1) = p(xi) That is, m Y p(x1 : : : xm) = p(xi) i=1 Number of parameters: O(jV j) Not a very good language model: p(the dog barked) = p(dog the barked) Karl Stratos CS 533: Natural Language Processing 14/40 Bigram Language Model Assumption. A word is independent of all previous words con- ditioning on the preceding word: p(xijx1 : : : xi−1) = p(xijxi−1) That is, m Y p(x1 : : : xm) = p(xijxi−1) i=1 where x0 = * is a special token at the start of every sentence. Number of parameters: O(jV j2) Karl Stratos CS 533: Natural Language Processing 15/40 Trigram Language Model Assumption. A word is independent of all previous words con- ditioning on the two preceding words: p(xijx1 : : : xi−1) = p(xijxi−2; xi−1) That is, m Y p(x1 : : : xm) = p(xijxi−2; xi−1) i=1 where x−1; x0 = * are special tokens at the start of every sentence. Number of parameters: O(jV j3) Karl Stratos CS 533: Natural Language Processing 16/40 n-Gram Language Model Assumption. A word is independent of all previous words con- ditioning on the n − 1 preceding words: p(xijx1 : : : xi−1) = p(xijxi−n+1; : : : ; xi−1) Number of parameters: O(jV jn) This kind of conditional independence assumption (\depends only on the last n − 1 states. ") is called a Markov assumption. I Is this a reasonable assumption for language modeling? Karl Stratos CS 533: Natural Language Processing 17/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 18/40 A Practical Question I Summary so far: We have designed probabilistic language models parametrized by finitely many values. 2 I Bigram model: Stores a table of O(jV j ) values q(x0jx) 8x; x0 2 V (plus q(xj*) and q(STOPjx)) representing transition probabilities and computes p(the cat barked) =q(thej*)× q(catjthe)× q(barkedjcat) q(STOPjbarked) I Q. But where do we get these values? Karl Stratos CS 533: Natural Language Processing 19/40 Estimation from Data (1) (N) I Our data is a corpus of N sentences x : : : x . 0 0 I Define count(x; x ) to be the number of times x; x appear together (called \bigram counts"): N l +1 X Xi count(x; x0) = 1 i=1 j=1: 0 xj =x xj−1=x (i) (li = length of x and xli+1 = STOP) P 0 I Define count(x) := x0 count(x; x ) (called \unigram counts"). Karl Stratos CS 533: Natural Language Processing 20/40 Example Counts Corpus: I the dog chased the cat I the cat chased the mouse I the mouse chased the dog Example bigram/unigram counts: count(x0; the) = 3 count(the) = 6 count(chased; the) = 3 count(chased) = 3 count(the; dog) = 2 count(x0) = 3 count(cat; STOP) = 1 count(cat) = 2 Karl Stratos CS 533: Natural Language Processing 21/40 Parameter Estimates 0 0 I For all x; x with count(x; x ) > 0, set count(x; x0) q(x0jx) = count(x) Otherwise q(x0jx) = 0. I In the previous example: q(thejx0) = 3=3 = 1 q(chasedjdog) = 1=3 = 0:3¯ q(dogjthe) = 2=6 = 0:3¯ q(STOPjcat) = 1=2 = 0:5 q(dogjcat) = 0 I Called maximum likelihood estimation (MLE). Karl Stratos CS 533: Natural Language Processing 22/40 Justification of MLE Claim. The solution of the constrained optimization problem N li+1 ∗ X X q = arg max log q(xjjxj−1) 0 0 q: q(x jx)≥0 8x;x i=1 j=1 P 0 x02V q(x jx)=18x is given by count(x; x0) q∗(x0jx) = count(x) (Proof?) Karl Stratos CS 533: Natural Language Processing 23/40 MLE: Other n-Gram Models Unigram: count(x) q(x) = N Bigram: count(x; x0) q(x0jx) = count(x) Trigram: count(x; x0; x00) q(x00jx; x0) = count(x; x0) Karl Stratos CS 533: Natural Language Processing 24/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 25/40 Evaluation of a Language Model \How good is the model at predicting unseen sentences?" Held-out corpus: Used for evaluation purposes only Do not use held-out data for training the model! Popular evaluation metric: perplexity Karl Stratos CS 533: Natural Language Processing 26/40 What We Are Doing: Conditional Density Estimation I True context-word pairs distributed as (c; w) ∼ pCW I We define some language model qW jC I Learning: minimize cross entropy between pW jC and qW jC H(pW jC ; qW jC ) = E − ln qW jC (wjc) (c;w)∼pCW Number of nats to encode the behavior of pW jC using qW jC H(pW jC ) ≤ H(pW jC ; qW jC ) ≤ ln jV j I Evaluation of qW jC : check how small H(pW jC ; qW jC ) is! I If the model class of qW jC is universally expressive, an optimal ∗ ∗ model qW jC will satisfy H(pW jC ; qW jC ) = H(pW jC ) with ∗ qW jC = pW jC . Karl Stratos CS 533: Natural Language Processing 27/40 Perplexity I Exponentiated cross entropy H(pW jC ;qW jC ) PP(pW jC ; qW jC ) = e I Interpretation: effective vocabulary size H(pW jC ) e ≤ PP(pW jC ; qW jC ) ≤ jV j I Empirical estimation: given (c1; w1) ::: (cN ; wN ) ∼ pCW , 1 PN − ln qW jC (wijci) PP(c pW jC ; qW jC ) = e N i=1 What is the empirical perplexity when qW jC (wijci) = 1 for all i? When qW jC (wijci) = 1= jV j for all i? Karl Stratos CS 533: Natural Language Processing 28/40 Example Perplexity Values for n-Gram Models I Using vocabulary size jV j = 50; 000 (Goodman, 2001) I Unigram: 955, Bigram: 137, Trigram: 74 I Modern neural language models: probably 20 I The big question: what is the minimum perplexity achievable with machines? Karl Stratos CS 533: Natural Language Processing 29/40 Overview Probability of a Sentence n-Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 30/40 Smoothing: Additive In practice, it's important to smooth estimation to avoid zero probabilities for unseen words: #(x; x0) + α qα(x0jx) = #(x) + α jV j Also called Laplace smoothing: https://en.wikipedia.org/wiki/Additive_smoothing Karl Stratos CS 533: Natural Language Processing 31/40 Smoothing: Interpolation With limited data, enforcing generalization by using less context also helps: smoothed 00 0 α 00 0 q (x jx; x ) =λ1q (x jx; x )+ α 00 0 λ2q (x jx )+ α 00 λ3q (x ) where λ1 + λ2 + λ3 = 1 and λi ≥ 0.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    41 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us