<<

Foundations Prof. Mari Ostendorf

Outline

• Why statistical models of language? • Mathematical fundamentals – Basic probability theory – Basic – Different classifier design approaches • Bag-of-words models – Naive Bayes vs. TF-IDF – Feature selection

Why statistical models? Three related issues:

• Ambiguity in language (one text can have different meanings) • Variability in language (one meaning can be expressed with different words) • Ignorance modeling

1 Ambiguity (example from Jurafsky & Martin:) I made her duck. Some possible interpretations of the text out of context: I cooked waterfowl for her. I cooked the waterfowl that is on the plate in front of her. I created a toy (or decorative) waterfowl for her. I caused her to quickly lower her head. I waved my magic wand and turned her into a ...... These illustrate syntactic ambiguity (e.g. noun vs. verb for “duck”, pronoun vs. possessive for “her”), and semantic ambiguity (e.g. cooked vs. created vs. turned into for “made”). If the words actually correspond to speech, there are additional types of variability associated with discourse characteristics and speech recognition errors: I made HER duck. vs. I made her DUCK. I made her duck? (doubt, disbelief) vs. statement form Ai made her duck. (where “Ai” is a name of a person) A maid heard “uck”.

Ambiguity often occurs when the model covers only a subset of the different aspects of language, such as covering syntax but not topical context and not accounting for prosodic cues (e.g. emphasis, phrasing). Classic speech examples: write vs. right vs. Wright (homophones) Wreck a nice beach vs. Recognize speech

2 Variability Wording differences can be syntactic or vocabulary, due to presentation mode and author/audience relation.

The chicken crossed the road. The road was crossed by the chicken. The chicken has traversed the road. Across the road went the chicken. The daughter of the rooster made it to the other side of the street. A chick- uh I mean the chicken you know like crossed the the road.

Variability due to source can have a big effect on language processing systems. (Consider Seattle PI vs. NY Times.) In a recent study by Kirchhoff et al. (unpublished), text source was more correlated to machine translation error rate than any other factor.

Ignorance Modeling The basic idea: acknowledge the fact that you don’t yet have rules that account for all sources of variability/ambiguity. Example from speech recognition: Gaussian mixture models for observation distributions can represent a range of pronunciations for a given word. Example from language modeling: Deterministic grammars of disfluencies generally haven’t worked well because people do not have good intuitions for where disfluencies happen. We filter them out in everyday conversation. Learning the characteristics automatically has been more effective, e.g. “um” is more likely at main clause boundaries and “uh” is more likely at self-corrections.

3 Mathematical Fundamentals Main topics:

• Key concepts from probability theory • Essential information theory • Classifier design

Random Variables, Random Vectors and Random Processes A random vector is a fixed-length ordered collection of random variables. A random process is a sequence of random variables. Notation: Capital letters indicate random variables; lower case letters indicate the values that random variables take on; boldface indicates a vector.

: X x ∈ A

Q random vector: X = [X1 ··· Xk] x ∈ k Ak random process: {X(n)} = X(1),X(2), . . . x(n) ∈ A ∀n

Q vector random process: {X(n)} = X(1), X(2),... x(n) ∈ k Ak ∀n In language processing, n typically takes on values that are integer (n ∈ Z), but random processes more generally can have continuous-valued time arguments. We use A to denote the sample space, or set of values that the variable can take on, which is usually (but not always) discrete for language processing.

4 Language Examples:

Consider a document d = {w(1), w(2), . . . , w(T )} with an unknown and variable length T .

• Text classification: identify the topic of the document from a fixed set. – index of the topic of the document (class label): discrete random variable – vector of word counts for a fixed set of words (observation): discrete random vector (or continuous if using weighted counts) • Language classification: what language is the text written in? – index of the language of the document (class label): discrete random variable – sequence of letters (observation): discrete-valued scalar random process • Name detection, e.g identify whether a sub-sequence of words in a sentence is a name of a person. – word IDs can be start of name, continuation of name, non-name (class label sequence): discrete random process – vector of word features such as flags for “in name dictionary”, “is capitalized”, “is preceded by punctuation”, “is an acronym”, etc (observation): vector random process • Author verification: was this document written by Shakespeare? – answer is yes or no (class label): binary random variable – ideas from students.... (observation): mixed random vector • What is the attitude/mood of the document? – rating of the mood of the document (value in [-1,1]): continuous random variable – flags for presence of certain words and word combinations (observation): discrete random vector

5 Random Vectors Random vectors are simply a collection of random variables. The simplest possible probability distribution for a vector is to treat all elements as independent. The independence assumption in the context of a class conditional distribution K Y p(x1, . . . , xK|c) = p(xi|c) i=1 is often referred to as the “Naive Bayes” (NB) model. If each variable Xi can take on V values, then there are KV parameters in the distribution (for the non-parametric case). If the you represent the dependence between all variables,then there are V K parameters. Consider two examples of two dimensional p(x, y). In which case do you have NB (p(x, y) = p(x)p(y))?

y = 0 y = 1 p(x) x = 0 1/4 1/2 3/4 x = 1 1/12 1/6 1/4 p(y) 1/3 2/3

y = 0 y = 1 p(x) x = 0 .5 .2 .7 x = 1 .3 0 .3 p(y) .8 .2

6 Important random processes Unlike random vectors, you can’t in general specify a joint distribution for random processes because of the infinite length and you may be interested in a subsequence of arbitrary length. Instead, we have a “recipe” for specifying the joint probability distribution for any K random variables pulled from the sequence:

p(x(n1), x(n2), . . . , x(nk)) This recipe corresponds to distribution assumptions associated with random variables and the dependence between random variables. Some examples:

• Independent and identically distributed (i.i.d.) process: k Y p(x(n1), x(n2), . . . , x(nk)) = p(x(ni)) i=1 This is the simplest possible distribution assumption. Note that time order doesn’t matter. • Markov process: (the next simplest model) k p(x(1), x(2), . . . , x(k)) = p(x(1)) Y p(x(i)|x(i − 1)) i=2 The model is specified in terms of sequential times, but you can get the joint distribution for any set of times by marginalizing over the sequence that contains them. • n-gram: extend history in Markov model to length n − 1 (see 6/25 lecture)

7 Essential Information Theory Entropy For a discrete random variable X that can take on V values H(X) = − X p(x) log p(x) x which gives a quantity measured in bits if the log is base 2. Properties of entropy:

• H(X) ≥ 0, where H(X) = 0 for the deterministic case • H(X) ≤ log V , where H(X) = log V for the uniform distribution

Entropy is a measure of the uncertainty associated with a distribution; there’s more uncertainty with a flatter distribution.

The Twenty-Questions Interpretation: H(X) is the number of clever yes/no questions you need to ask (on average) to guess the value of X.

Entropy of a Sequence In theory: 1 H = lim H(X ,...,X ) rate n→∞ n 1 n In practice, we often use with the empirical distribution and finite n: 1 H˜ = − log p(x , . . . , x ) r n 1 n

Implications for Language Processing: ˜ Entropy (or perplexity PP = 2Hr) indicates how easy (or hard) a task is.

8 Measuring the Entropy of English Goal: estimate the per symbol entropy of a sequence 1 H = lim H(X ,...X ) r n→∞ n 1 n Using V = 27 for the 26 letters plus space (ignoring capitalization and punctuation, which is not entirely reasonable, but...)

• Automatic model-based estimates – Shannon: letter n-gram models, bigram gave 2.8 bits/symbol (bps) – Brown et al.: word n-gram models, compute next letter probability by summing over all words with matching prefix. results are close to human case • Human-based estimates – Shannon: humans guess the next letter, gave 1.3 bps – Cover & King ’78: humans bet on the next letter, gave 1.34 bps

Example: guess the next letters o- ohio state’s pretty big isn’t it yeah yeah i mean uh it’s you know we’re about to do like the uh fiesta bowl there oh yeah

9 When you have 2 random variables, you can compute joint and conditional entropy: H(X,Y ) = − X X p(x, y) log p(x, y) x y H(Y |X) = − X X p(x, y) log p(y|x) x y = H(X,Y ) − H(X)

Properties: 0 ≤ H(Y |X) ≤ H(Y )

Mutual Information Amount of information in common between X and Y . p(x, y) I(X; Y ) = X X p(x, y) log x y p(x)p(y) which you can show is related to the above entropies as: I(X; Y ) = H(X) + H(Y ) − H(X,Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) Properties: I(X; Y ) = I(Y ; X),I(X; Y ) ≥ 0

Implications for Language Processing: indicates how much feature Y reduces the uncertainty of X, so I(X; Y ) is sometimes use in feature selection.

10 Entropy vs. Cross Entropy

X X H(X) = − p(x) log p(x) vs. Hc(X) = − p(x) log q(x) x x

Properties: Hc(X) ≥ H(X) Implications for Language Processing: Cross entropy is computed in practice a lot, since we rarely know the true p(x). One frequently sees:

X 1 X He(X) = − pe(x) log q(x) = − log q(xi) x n xi where q(x) is a model and pe(x) is the empirical distribution associated with collection of samples {xi}, i.e. the relative frequency of x in {xi}.

Relative Entropy (Kullback-Leibler distance)

p(x) D(p||q) = X p(x) log x q(x) Relative entropy is cross-entropy - entropy, D(p||q) ≥ 0. Relative entropy is often used to compute a “distance” between distributions, but it is typically NOT symmetric, i.e. D(p||q) =6 D(q||p). Mutual information can be expressed in terms of relative entropy: I(X; Y ) = D(p(x, y)||p(x)p(y))

Implications for Language Processing: Relative entropy can be used as a “distance” between distributions, e.g. for clustering.

11 Designing Classifiers Two main approaches: • Estimate class-conditional distributions and use with minimum error decision theory rules (generative model) Examples: naive Bayes, Markov model, hidden Markov model, ... • Find decision rule directly to minimize error (discriminative model) Examples: support vector machine, neural network, ... • Decide on class of a test sample based on the label of the training samples that it is most similar to Examples: k-nearest neighbors, memory-based learning, ...

Generative Modeling Approach 1. Get class-labeled data (called the training data or the learning set) 2. Assume a model for the class-conditional distributions (i.e. the form of the generating distributions) 3. Learn the parameters of the model from training data – this is an estimation problem. 4. Design a classifier to minimize error given these distributions – this is a decision theory problem. 5. Given new data, use the classifier for making a decision about the class. 6. If the new data is labeled (a heldout subset of your original data set), then you can estimate the performance of your decision rule by counting errors.

12 Step 3: Estimate the class-conditional model For estimating the parameter of a model, a popular technique is maximum likelihood (ML) estimation. In this case, you assume the form of the model p(y|θ) and are given Y = Y1,Y2,...YT presumed to be generated by the model. Then the optimal solution is ˆ θML = argmax p(Y|θ) = argmax log p(Y|θ), θ θ with necessary conditions for the ML estimate being

∇θ log p(Y|θ) = 0.

Some good things about the ML estimate are that • it is simple • it is consistent (gives the right answer with infinite data), and • it is theoretically optimal for use in classification *IF* the distribution assumption is correct and there is not prior knowledge. However, a problem is that there is no guarantee of a good estimate if the distribution assumption is wrong. For practical purposes, it can sometimes be better to use a wrong but simple distribution assumption (leading to a simple classifier) with a discriminative (not ML) parameter estimation criterion. That said, we’ll mainly use ML estimation for simplicity. Examples: • For a multinomial (e.g. roll of a die), the ML estimate is the relative frequency estimate. • For a Gaussian, the ML estimate is the sample mean and variance.

13 Step 4: Bayes decision theory Assume that we are given:

p(ci), p(x|ci) for ci ∈ C = {c1, c2, . . . cm} the optimal decision rule for the minimum error criterion is: cˆ = argmax p(c|x) c This is called the MAP rule (maximum a posteriori probability) The rule is often expressed in terms of the above distributions by using Bayes Rule: p(x|c)p(c) cˆ = argmax p(c|x) = argmax = argmax p(x|c)p(c) c c p(x) c since p(x) > 0 does not impact the “argmax”.

The first form is often called the “direct model” and the latter is called the “noisy channel model” or “source-channel model” based on the communications theory view. The distribution p(c) describes the source and p(x|c) describes the channel which gives us a noisy view of the variable that was sent.

Aside: The decision rule changes if the objective changes. For example, if different types of errors have different costs, then it becomes a minimum risk rule. Or, one could minimize the maximum possible error when the posterior is not known (minimax rule). We won’t worry about alternative decision rules in this class.

14 The expected error associated with this decision rule is: m X X X Pe = P (Dj, ci) or Pe = 1 − Pc = 1 − P (Di, ci) i i6=j i=1 where Di is the region of x where the rule decides class i, so P (Di, ci) is the probability of deciding class i and being correct. Note that even when you make the optimal decision, there will still be some error associated with overlap of the p(x|c) distributions for different c. So there are multiple reasons for errors:

• Overlap of feature distributions p(x|ci) (related to choice of x) • Modeling error, including making the wrong model assumption and parameter estimation error associated with having finite training data • Search errors associated with approximations in using the model (not covered at this point)

When there are only two classes (e.g. detecting presence vs. absence of something), then there are only two ways to make an error: missing the target when it’s there and detecting it when it’s not. The different error types can be described using:

• probabilities of false detection P (D1|c0) vs. missed detection P (D0|c1) (the EE view), or

• probability of correct detection P (D1|c1) (recall) vs. probability that something you detected is correct (precision) P (c1|D1) (the CS view) Both are informative; the choice often depends on the audience. Note that you cannot translate between these two forms without information about the priors.

15 Bag-of-Words Model Ignoring word order and assuming independence of each word ⇒ a document is a bag of words (only word counts matter). (This is a fairly reasonable assumption for topic and genre modeling, but not for reading level or author identification.)

The bag of words is a random vector, where each element xi of the vector corresponds to a word in the vocabulary. Two variations on this theme:

• TF-IDF: (distance based) xi is a weighted count, where weight depends on relative importance of the word • Naive Bayes: (generative classifier) xi is a count, and relative importance is addressed through the probability distribution

To understand why both work, consider the problem of classification where the class-conditional distributions are Gaussians with the same covariance. You can explicitly model the covariance in the probability distribution (effectively deweighting directions that have more variance), or you can multiply the vector by a linear transformation that does the weighting and use a minimum Euclidean distance decision rule.

16 TF-IDF term frequency: tf = relative frequency of word i on the document inverse document frequency: idf = (# of documents)/(# of documents where word i appears) Idea:

• Use the idf to give a lower weight to words that appear in many documents and a higher weight to those that appear in just a few, e.g.

xi = (1 + log(tfi)) log(idfi)

• Use the cosine distance to determine similarity of two documents:

0 P x y i xiyi d(x, y) = = r r P 2 P 2 |x||y| j xj j yj Distance can be used in clustering or in a nearest neighbor classifier.

17 Naive Bayes c Let xi be the count of word i in the document; X = [x1 ··· xn]; and qi is the probability of observing word i in document type c; then:

Y c xi P (X|c) = [qi ] i or equivalently X c log P (X|c) = xi log qi i Problem: if you don’t see word i for documents of type c in your training data, does that mean you will never see it??? NO! But the maximum likelihood estimate for qi is relative frequency, which would give probability zero for words unseen in training. Solution: Smoothing! You will learn much more about this next Monday. For now, know that there are several options. In the RAINBOW toolkit (for example), some options include:

• general: Laplace, Good-Turing, and Witten-Bell • for cases where you have a hierarchical structure to your classes: shrinking (i.e. interpolating the probabilities of finer grained classes with the associated coarser grained classes – we’ll see the idea again with decision trees)

18 Feature Selection Features = words in these examples. If you use all the words in the vocabulary, then the vector dimension can be VERY big! In addition, many of the words may have little to offer in solving the problem (e.g. “a” and “the” for topic classification). Approaches to reducing the vector size (removing words) are referred to as feature selection, and may include one or more of:

• Eliminate all words on a stopword list • Omit words that occur less than N times in the full collection • Omit words that occur in fewer than M documents • Omit words according to some information theoretic criterion (information gain, KL distance)

More on information gain: Let w mean that word w is present in the document, and w¯ mean that it is not present. Then the information gain for word w is defined as:

IGw = H(C) − p(w)H(C|w) − p(w ¯)H(C|w¯) but we can rewrite this using Iw to be the indicator random variable (flag that indicates whether w is present) as

IGw = H(C) − H(C|Iw) = I(C; Iw) In other words, the information gain of word w is the mutual information between the class random variable C and the presence vs. absence indicator of word w. When mutual information is high, then the word is useful, but when low then it is not.

19 Information gain works pretty well and can automatically find a stopword list for you (especially helpful if you are working with something other than newswire text). However, there is a potential problem with sparse data. Note that to find H(C|Iw) we need p(c|w) which might be estimated using: # documents of type c with word w p(c|w) = # documents with word w Consider the case where there is one document with word w, then p(c|w) = 1 for the class of that document and H(C|Iw) = 0. This word gets a high information gain, but it is not reliable and probably should be dropped! The solution is to use smoothing and/or combine IG with count cut-offs.

20