Mathematical Fundamentals – Basic Probability Theory – Basic Information Theory – Different Classiﬁer Design Approaches • Bag-Of-Words Models – Naive Bayes Vs

Foundations Prof. Mari Ostendorf Outline • Why statistical models of language? • Mathematical fundamentals – Basic probability theory – Basic information theory – Different classifier design approaches • Bag-of-words models – Naive Bayes vs. TF-IDF – Feature selection Why statistical models? Three related issues: • Ambiguity in language (one text can have different meanings) • Variability in language (one meaning can be expressed with different words) • Ignorance modeling 1 Ambiguity (example from Jurafsky & Martin:) I made her duck. Some possible interpretations of the text out of context: I cooked waterfowl for her. I cooked the waterfowl that is on the plate in front of her. I created a toy (or decorative) waterfowl for her. I caused her to quickly lower her head. I waved my magic wand and turned her into a ...... These illustrate syntactic ambiguity (e.g. noun vs. verb for “duck”, pronoun vs. possessive for “her”), and semantic ambiguity (e.g. cooked vs. created vs. turned into for “made”). If the words actually correspond to speech, there are additional types of variability associated with discourse characteristics and speech recognition errors: I made HER duck. vs. I made her DUCK. I made her duck? (doubt, disbelief) vs. statement form Ai made her duck. (where “Ai” is a name of a person) A maid heard “uck”. Ambiguity often occurs when the model covers only a subset of the different aspects of language, such as covering syntax but not topical context and not accounting for prosodic cues (e.g. emphasis, phrasing). Classic speech examples: write vs. right vs. Wright (homophones) Wreck a nice beach vs. Recognize speech 2 Variability Wording differences can be syntactic or vocabulary, due to presentation mode and author/audience relation. The chicken crossed the road. The road was crossed by the chicken. The chicken has traversed the road. Across the road went the chicken. The daughter of the rooster made it to the other side of the street. A chick- uh I mean the chicken you know like crossed the the road. Variability due to source can have a big effect on language processing systems. (Consider Seattle PI vs. NY Times.) In a recent study by Kirchhoff et al. (unpublished), text source was more correlated to machine translation error rate than any other factor. Ignorance Modeling The basic idea: acknowledge the fact that you don’t yet have rules that account for all sources of variability/ambiguity. Example from speech recognition: Gaussian mixture models for observation distributions can represent a range of pronunciations for a given word. Example from language modeling: Deterministic grammars of disfluencies generally haven’t worked well because people do not have good intuitions for where disfluencies happen. We filter them out in everyday conversation. Learning the characteristics automatically has been more effective, e.g. “um” is more likely at main clause boundaries and “uh” is more likely at self-corrections. 3 Mathematical Fundamentals Main topics: • Key concepts from probability theory • Essential information theory • Classifier design Random Variables, Random Vectors and Random Processes A random vector is a fixed-length ordered collection of random variables. A random process is a sequence of random variables. Notation: Capital letters indicate random variables; lower case letters indicate the values that random variables take on; boldface indicates a vector. random variable: X x ∈ A Q random vector: X = [X1 ··· Xk] x ∈ k Ak random process: {X(n)} = X(1),X(2), . x(n) ∈ A ∀n Q vector random process: {X(n)} = X(1), X(2),... x(n) ∈ k Ak ∀n In language processing, n typically takes on values that are integer (n ∈ Z), but random processes more generally can have continuous-valued time arguments. We use A to denote the sample space, or set of values that the variable can take on, which is usually (but not always) discrete for language processing. 4 Language Examples: Consider a document d = {w(1), w(2), . , w(T )} with an unknown and variable length T . • Text classification: identify the topic of the document from a fixed set. – index of the topic of the document (class label): discrete random variable – vector of word counts for a fixed set of words (observation): discrete random vector (or continuous if using weighted counts) • Language classification: what language is the text written in? – index of the language of the document (class label): discrete random variable – sequence of letters (observation): discrete-valued scalar random process • Name detection, e.g identify whether a sub-sequence of words in a sentence is a name of a person. – word IDs can be start of name, continuation of name, non-name (class label sequence): discrete random process – vector of word features such as flags for “in name dictionary”, “is capitalized”, “is preceded by punctuation”, “is an acronym”, etc (observation): vector random process • Author verification: was this document written by Shakespeare? – answer is yes or no (class label): binary random variable – ideas from students.... (observation): mixed random vector • What is the attitude/mood of the document? – rating of the mood of the document (value in [-1,1]): continuous random variable – flags for presence of certain words and word combinations (observation): discrete random vector 5 Random Vectors Random vectors are simply a collection of random variables. The simplest possible probability distribution for a vector is to treat all elements as independent. The independence assumption in the context of a class conditional distribution K Y p(x1, . , xK|c) = p(xi|c) i=1 is often referred to as the “Naive Bayes” (NB) model. If each variable Xi can take on V values, then there are KV parameters in the distribution (for the non-parametric case). If the you represent the dependence between all variables,then there are V K parameters. Consider two examples of two dimensional p(x, y). In which case do you have NB (p(x, y) = p(x)p(y))? y = 0 y = 1 p(x) x = 0 1/4 1/2 3/4 x = 1 1/12 1/6 1/4 p(y) 1/3 2/3 y = 0 y = 1 p(x) x = 0 .5 .2 .7 x = 1 .3 0 .3 p(y) .8 .2 6 Important random processes Unlike random vectors, you can’t in general specify a joint distribution for random processes because of the infinite length and you may be interested in a subsequence of arbitrary length. Instead, we have a “recipe” for specifying the joint probability distribution for any K random variables pulled from the sequence: p(x(n1), x(n2), . , x(nk)) This recipe corresponds to distribution assumptions associated with random variables and the dependence between random variables. Some examples: • Independent and identically distributed (i.i.d.) process: k Y p(x(n1), x(n2), . , x(nk)) = p(x(ni)) i=1 This is the simplest possible distribution assumption. Note that time order doesn’t matter. • Markov process: (the next simplest model) k p(x(1), x(2), . , x(k)) = p(x(1)) Y p(x(i)|x(i − 1)) i=2 The model is specified in terms of sequential times, but you can get the joint distribution for any set of times by marginalizing over the sequence that contains them. • n-gram: extend history in Markov model to length n − 1 (see 6/25 lecture) 7 Essential Information Theory Entropy For a discrete random variable X that can take on V values H(X) = − X p(x) log p(x) x which gives a quantity measured in bits if the log is base 2. Properties of entropy: • H(X) ≥ 0, where H(X) = 0 for the deterministic case • H(X) ≤ log V , where H(X) = log V for the uniform distribution Entropy is a measure of the uncertainty associated with a distribution; there’s more uncertainty with a flatter distribution. The Twenty-Questions Interpretation: H(X) is the number of clever yes/no questions you need to ask (on average) to guess the value of X. Entropy of a Sequence In theory: 1 H = lim H(X ,...,X ) rate n→∞ n 1 n In practice, we often use cross entropy with the empirical distribution and finite n: 1 H˜ = − log p(x , . , x ) r n 1 n Implications for Language Processing: ˜ Entropy (or perplexity PP = 2Hr) indicates how easy (or hard) a task is. 8 Measuring the Entropy of English Goal: estimate the per symbol entropy of a sequence 1 H = lim H(X ,...X ) r n→∞ n 1 n Using V = 27 for the 26 letters plus space (ignoring capitalization and punctuation, which is not entirely reasonable, but...) • Automatic model-based estimates – Shannon: letter n-gram models, bigram gave 2.8 bits/symbol (bps) – Brown et al.: word n-gram models, compute next letter probability by summing over all words with matching prefix. results are close to human case • Human-based estimates – Shannon: humans guess the next letter, gave 1.3 bps – Cover & King ’78: humans bet on the next letter, gave 1.34 bps Example: guess the next letters o- ohio state’s pretty big isn’t it yeah yeah i mean uh it’s you know we’re about to do like the uh fiesta bowl there oh yeah 9 When you have 2 random variables, you can compute joint and conditional entropy: H(X, Y ) = − X X p(x, y) log p(x, y) x y H(Y |X) = − X X p(x, y) log p(y|x) x y = H(X, Y ) − H(X) Properties: 0 ≤ H(Y |X) ≤ H(Y ) Mutual Information Amount of information in common between X and Y . p(x, y) I(X; Y ) = X X p(x, y) log x y p(x)p(y) which you can show is related to the above entropies as: I(X; Y ) = H(X) + H(Y ) − H(X, Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) Properties: I(X; Y ) = I(Y ; X),I(X; Y ) ≥ 0 Implications for Language Processing: Mutual information indicates how much feature Y reduces the uncertainty of X, so I(X; Y ) is sometimes use in feature selection.

Load more