
Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” 2 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 2 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 2 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 399,999 2 0 1 0 B C B0C B C B0C B C B C B0C B . C B . C Sparsity! B C B C @1A 0 What is a word? “dog” “canine” 3 399,999 0 1 0 B C B0C B C B1C B C B C B0C B . C B . C B C B C @0A 0 2 Sparsity! What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C B C B0C B C @ A @1A 0 0 2 What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 2 The Scourge of Sparsity In the real world, sparsity hurts us: • Low vocabulary coverage in training ! high OOV rate in applications (poor generalization) • “one-hot” representations ! huge parameter counts • information about one word completely unutilized for another! Thus, a good representation must: • reduce parameter space, • improve generalization, and • somehow “transfer” or “share” knowledge between words 3 A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. 4 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) 4 Brown Clustering probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a 5 word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over 5 • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. 5 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 5 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 5 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 5 A Class-Based Language Model1 Main Idea: cluster words into a fixed number of clusters C and use their cluster assignments as their identity instead (reducing sparsity) If π : V ! C is a mapping function from a word type to a cluster (or class), we want to find π∗ = arg max p(W j π) where N Y p(W j π) = p(ci j ci−1)p(wi j ci): i=1 1Peter F. Brown et al. “Class-based N-gram Models of Natural Language”. In: Comput. Linguist. 18.4 (Dec. 1992), pp. 467–479. 6 Finding the best partition π∗ = arg max P(W j π) = arg max log P(W j π) π π One can derive2 L(π) to be X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information π (fixed w.r.t. ) (varies with π) 2Sven Martin, Jörg Liermann, and Hermann Ney. “Algorithms for Bigram and Trigram Word Clustering”. In: Speech Commun. 24.1 (Apr. 1998), pp. 19–37. 7 Finding the best partition X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) What does maximizing this mean? Recall 0 X p(ci; cj) MI(c; c ) = p(ci; cj) log p(ci)p(cj) ci;cj which can be shown to be 1 X nci;cj MI(c; c0) = n log + log N N ci;cj n · n c ;c ci cj | {z } |{z} i j (constant) (constant) and thus maximizing MI of adjacent classes selects the best π. 8 Finding the best partition (cont’d) ∗ X X nci;cj π = arg max nw log nw + nci;cj log π nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) Direct maximization is intractable! Thus, agglomerative (bottom-up) clustering is used as a greedy heuristic. The best merge is determined by the lowest loss in average mutual information. 9 Agglomerative Clustering 10 Do they work? (Yes.) Named entity recognition: • Scott Miller, Jethran Guinness, and Alex Zamanian. “Name Tagging with Word Clusters and Discriminative Training.” In: HLT-NAACL. 2004, pp. 337–342 • Lev Ratinov and Dan Roth. “Design Challenges and Misconceptions in Named Entity Recognition”. In: Proc. CoNLL. 2009, pp. 147–155 Dependency parsing: • Terry Koo, Xavier Carreras, and Michael Collins. “Simple Semi-supervised Dependency Parsing”. In: Proc. ACL: HLT. June 2008, pp. 595–603 • Jun Suzuki and Hideki Isozaki. “Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data”. In: Proc. ACL: HLT. June 2008, pp. 665–673 Constituency parsing: • Marie Candito and Benoı̂t Crabbé. “Improving Generative Statistical Parsing with Semi-supervised Word Clustering”. In: IWPT. 2009, pp. 138–141 • Muhua Zhu et al. “Fast and Accurate Shift-Reduce Constituent Parsing”. In: ACL. Aug. 2013, pp. 434–443 11 Vector Spaces for Word Representation Toward Vector Spaces Brown clusters are nice, but limiting: • Arbitrary cut-off point ! two words may be assigned different classes arbitrarily because of our chosen cutoff • Definition of “similarity” is limited to only bigrams of classes3 • Completely misses lots of regularity that’s present in language 3There are adaptations of Brown clustering that move past this, but what we’ll see today is still even better.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages72 Page
-
File Size-