Distributional Semantics Class-Based N-Gram Language Models and Word Embeddings

Distributional Semantics Class-based N-gram Language Models and Word Embeddings Chase Geigle 2017-09-28 1 “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” 2 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 2 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 2 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 399,999 2 0 1 0 B C B0C B C B0C B C B C B0C B . C B . C Sparsity! B C B C @1A 0 What is a word? “dog” “canine” 3 399,999 0 1 0 B C B0C B C B1C B C B C B0C B . C B . C B C B C @0A 0 2 Sparsity! What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C B C B0C B C @ A @1A 0 0 2 What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 2 The Scourge of Sparsity In the real world, sparsity hurts us: • Low vocabulary coverage in training ! high OOV rate in applications (poor generalization) • “one-hot” representations ! huge parameter counts • information about one word completely unutilized for another! Thus, a good representation must: • reduce parameter space, • improve generalization, and • somehow “transfer” or “share” knowledge between words 3 A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. 4 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) 4 Brown Clustering probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a 5 word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over 5 • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. 5 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 5 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 5 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 5 A Class-Based Language Model1 Main Idea: cluster words into a fixed number of clusters C and use their cluster assignments as their identity instead (reducing sparsity) If π : V ! C is a mapping function from a word type to a cluster (or class), we want to find π∗ = arg max p(W j π) where N Y p(W j π) = p(ci j ci−1)p(wi j ci) i=1 with ci = π(wi). 1Peter F. Brown et al. “Class-based N-gram Models of Natural Language”. In: Comput. Linguist. 18.4 (Dec. 1992), pp. 467–479. 6 Finding the best partition π∗ = arg max P(W j π) = arg max log P(W j π) π π One can derive2 L(π) to be X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) 2Sven Martin, Jörg Liermann, and Hermann Ney. “Algorithms for Bigram and Trigram Word Clustering”. In: Speech Commun. 24.1 (Apr. 1998), pp. 19–37. 7 Finding the best partition X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) What does maximizing this mean? Recall 0 X p(ci; cj) MI(c; c ) = p(ci; cj) log p(ci)p(cj) ci;cj which can be shown to be 1 X nci;cj MI(c; c0) = n log + log N N ci;cj n · n c ;c ci cj | {z } |{z} i j (constant) (constant) and thus maximizing MI of adjacent classes selects the best π. 8 Finding the best partition (cont’d) ∗ X X nci;cj π = arg max nw log nw + nci;cj log π nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) Direct maximization is intractable! Thus, agglomerative (bottom-up) clustering is used as a greedy heuristic. The best merge is determined by the lowest loss in average mutual information. 9 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Agglomerative Clustering w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 10 Do they work? (Yes.) Named entity recognition: • Scott Miller, Jethran Guinness, and Alex Zamanian.

Load more