Natural Language Processing

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 21, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Probability and Language 1 Probability and Language Quick guide to probability theory Entropy and Information Theory 2 Probability and Language Assign a probability to an input sequence Given a URL: choosespain.com. What is this website about? Input Scoring function choose spain -8.35 chooses pain -9.88 . The Goal Find a good scoring function for input sequences. 3 Scoring Hypotheses in Speech Recognition From acoustic signal to candidate transcriptions Hypothesis Score the station signs are in deep in english -14732 the stations signs are in deep in english -14735 the station signs are in deep into english -14739 the station 's signs are in deep in english -14740 the station signs are in deep in the english -14741 the station signs are indeed in english -14757 the station 's signs are indeed in english -14760 the station signs are indians in english -14790 the station signs are indian in english -14799 the stations signs are indians in english -14807 the stations signs are indians and english -14815 4 Scoring Hypotheses in Machine Translation From source language to target language candidates Hypothesis Score we must also discuss a vision . -29.63 we must also discuss on a vision . -31.58 it is also discuss a vision . -31.96 we must discuss on greater vision . -36.09 . 5 Scoring Hypotheses in Decryption Character substitutions on ciphertext to plaintext candidates Hypothesis Score Heopaj, zk ukq swjp pk gjks w oaynap? -93 Urbcnw, mx hxd fjwc cx twxf j bnlanc? -92 Wtdepy, oz jzf hlye ez vyzh l dpncpe? -91 Mjtufo, ep zpv xbou up lopx b tfdsfu? -89 Nkuvgp, fq aqw ycpv vq mpqy c ugetgv? -87 Gdnozi, yj tjp rvio oj fijr v nzxmzo? -86 Czjkve, uf pfl nrek kf befn r jvtivk? -85 Yvfgra, qb lbh jnag gb xabj n frperg? -84 Zwghsb, rc mci kobh hc ybck o gsqfsh? -83 Byijud, te oek mqdj je adem q iushuj? -77 Jgqrcl, bm wms uylr rm ilmu y qcapcr? -76 Listen, do you want to know a secret? -25 6 The Goal I Write down a model over sequences of words or letters. I Learn the parameters of the model from data. I Use the model to predict the probability of new sequences. 7 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Quick guide to probability theory 8 Probability and Language Quick guide to probability theory Entropy and Information Theory 9 Probability: The Basics I Sample space I Event space I Random variable 10 Probability distributions I P(X): probability of random variable X having a certain value. I P(X = killer) = 1.05e-05 I P(X = app) = 1.19e-05 11 Joint probability I P(X,Y): probability that X and Y each have a certain value. I Let Y stand for choice of a word I Let X stand for the choice of a word that occurs before Y I P(X = killer, Y = app) = 1.24e-10 Joint Probability: P(X=value AND Y=value) I Since X=value AND Y=value, the order does not matter I P(X = killer, Y = app) , P(Y = app, X = killer) I In both cases it is P(X,Y) = P(Y,X) = P('killer app') I In NLP, we often use numerical indices to express this: P(Wi−1 = killer, Wi = app) 12 Joint probability Joint probability table Wi−1 Wi = app P(Wi−1,Wi ) hSi app 1.16e-19 an app 1.76e-08 killer app 1.24e-10 the app 2.68e-07 this app 3.74e-08 your app 2.39e-08 There will be a similar table for each choice of Wi . Get P(Wi ) from P(Wi−1,Wi ) X P(Wi = app) = P(Wi−1 = x; Wi = app) = 1:19e − 05 x 13 Conditional probability I P(Wi j Wi−1): probability that Wi has a certain value after fixing value of Wi−1. I P(Wi = app j Wi−1 = killer) I P(Wi = app j Wi−1 = the) Conditional probability from Joint probability P(Wi−1; Wi ) P(Wi j Wi−1) = P(Wi−1) I P(killer) = 1.05e-5 I P(killer, app) = 1.24e-10 I P(app j killer) = 1.18e-5 14 Basic Terms I P(e){ a priori probability or just prior I P(f j e){ conditional probability. The chance of f given e I P(e; f ){ joint probability. The chance of e and f both happening. I If e and f are independent then we can write P(e; f ) = P(e) × P(f ) I If e and f are not independent then we can write P(e; f ) = P(e) × P(f j e) P(e; f ) = P(f )× ? 15 Basic Terms I Addition of integers: n X i = 1 + 2 + 3 + ::: + n i=1 I Product of integers: n Y i = 1 × 2 × 3 × ::: × n i=1 I Factoring: n n X X i × k = k + 2k + 3k + ::: + nk = k i i=1 i=1 I Product with constant: n n Y Y i × k = 1k × 2k ::: × nk = kn × i i=1 i=1 16 Probability: Axioms I P measures total probability of a set of events I P(;) = 0 I P(all events) = 1 I P(X ) ≤ P(Y ) for any X ⊆ Y I P(X ) + P(Y ) = P(X [ Y ) provided that X \ Y = ; 17 Probability Axioms I All events sum to 1: X P(e) = 1 e I Marginal probability P(f ): X P(f ) = P(e; f ) e I Conditional probability: X X P(e; f ) 1 X P(e j f ) = = P(e; f ) = 1 P(f ) P(f ) e e e I Computing P(f ) from axioms: X P(f ) = P(e) × P(f j e) e 18 Probability: The Chain Rule I P(a; b; c; d j e) I We can simplify this using the Chain Rule: I P(a; b; c; d j e) = P(d j e) · P(c j d; e) · P(b j c; d; e) · P(a j b; c; d; e) p(X ;Y ) I To see why this is possible, recall that P(X j Y ) = p(Y ) p(a;b;c;d;e) p(d;e) p(c;d;e) p(b;c;d;e) p(a;b;c;d;e) I p(e) = p(e) · p(d;e) · p(c;d;e) · p(b;c;d;e) I We can approximate the probability by removing some random variables from the context. For example, we can keep at most two variables to get: P(a; b; c; d j e) ≈ P(d j e) · P(c j d; e) · P(b j c; e) · P(a j b; e) 19 Probability: The Chain Rule I P(e1; e2;:::; en) = P(e1) × P(e2 j e1) × P(e3 j e1; e2) ::: n Y P(e1; e2;:::; en) = P(ei j ei−1; ei−2;:::; e1) i=1 I In NLP, we call: I P(ei ): unigram probability I P(ei j ei−1): bigram probability I P(ei j ei−1; ei−2): trigram probability I P(ei j ei−1; ei−2;:::; ei−(n−1)): n-gram probability 20 Probability: Random Variables and Events I What is y in P(y)? I Shorthand for value assigned to a random variable Y , e.g. Y = y I y is an element of some implicit event space: E 21 Probability: Random Variables and Events I The marginal probability P(y) can be computed from P(x; y) as follows: X P(y) = P(x; y) x2E I Finding the value that maximizes the probability value: x^ = arg max P(x) x2E 22 Log Probability Arithmetic I Practical problem with tiny P(e) numbers: underflow I One solution is to use log probabilities: log(P(e)) = log(p1 × p2 × ::: × pn) = log(p1) + log(p2) + ::: + log(pn) I Note that: x = exp(log(x)) I Also more efficient: addition instead of multiplication 23 Log Probability Arithmetic p log(p) 0:0 −∞ 0:1 −3:32 0:2 −2:32 0:3 −1:74 0:4 −1:32 0:5 −1:00 0:6 −0:74 0:7 −0:51 0:8 −0:32 0:9 −0:15 1:0 0:00 24 Log Probability Arithmetic I So: (0:5 × 0:5 × ::: 0:5) = (0:5)n might get too small but (−1 − 1 − 1 − 1) = −n is manageable I Another useful fact when writing code (log2 is log to the base 2): log10(x) log2(x) = log10(2) 25 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 3: Entropy and Information Theory 26 Probability and Language Quick guide to probability theory Entropy and Information Theory 27 Information Theory I Information theory is the use of probability theory to quantify and measure \information". I Consider the task of efficiently sending a message. Sender Alice wants to send several messages to Receiver Bob. Alice wants to do this as efficiently as possible. I Let's say that Alice is sending a message where the entire message is just one character a, e.g. aaaa:::. In this case we can save space by simply sending the length of the message and the single character. 28 Information Theory I Now let's say that Alice is sending a completely random signal to Bob.

Natural Language Processing

Information Theory, Pattern Recognition and Neural Networks Part III Physics Exams 2006

On Measures of Entropy and Information

Information Theory and Entropy

Lecture Notes (Chapter

Stan Modeling Language

Information Theory: MATH34600

Arxiv:1703.01694V2 [Cs.CL] 31 May 2017 Tinctive Forms to Diﬀerentiate Each Word from Others, Especially If a Speaker’S Intended Word Has Higher-Frequency Competitors

Computing Entropies with Nested Sampling

Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, Section 6.7 Manning and Schutze, Section 2.2

Similarity-Based Approaches to Natural Language Processing

Probability for Linguists Probabilités Pour Les Linguistes

Probability for Linguists