Distributional Semantics and Word Embeddings

Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” 2 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 2 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 2 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 What is a word? “dog” “canine” 3 399,999 2 0 1 0 B C B0C B C B0C B C B C B0C B . C B . C Sparsity! B C B C @1A 0 What is a word? “dog” “canine” 3 399,999 0 1 0 B C B0C B C B1C B C B C B0C B . C B . C B C B C @0A 0 2 Sparsity! What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C B C B0C B C @ A @1A 0 0 2 What is a word? “dog” “canine” 3 399,999 0 1 0 1 0 0 B C B C B0C B0C B C B C B1C B0C B C B C B0C B C B C B0C B . C B . C B . C B . C B C Sparsity! B C B0C B C @ A @1A 0 0 2 The Scourge of Sparsity In the real world, sparsity hurts us: • Low vocabulary coverage in training ! high OOV rate in applications (poor generalization) • “one-hot” representations ! huge parameter counts • information about one word completely unutilized for another! Thus, a good representation must: • reduce parameter space, • improve generalization, and • somehow “transfer” or “share” knowledge between words 3 A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. 4 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. A bit of foreshadowing: • A word ought to be able to predict its context (word2vec Skip-Gram) • A context ought to be able to predict its missing word (word2vec CBOW) 4 Brown Clustering probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a 5 word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over 5 • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. 5 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 5 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 5 Language Models A language model is a probability distribution over word sequences. • Unigram language model p(w j θ): N Y p(W j θ) = p(wi j θ) i=1 • Bigram language model p(wi j wi−1; θ): N Y p(W j θ) = p(wi j wi−1; θ) i=1 n−1 • n-gram language model: p(wn j w1 ; θ) N Y i−1 p(W j θ) = p(wi j wi−n+1; θ) i=1 5 A Class-Based Language Model1 Main Idea: cluster words into a fixed number of clusters C and use their cluster assignments as their identity instead (reducing sparsity) If π : V ! C is a mapping function from a word type to a cluster (or class), we want to find π∗ = arg max p(W j π) where N Y p(W j π) = p(ci j ci−1)p(wi j ci): i=1 1Peter F. Brown et al. “Class-based N-gram Models of Natural Language”. In: Comput. Linguist. 18.4 (Dec. 1992), pp. 467–479. 6 Finding the best partition π∗ = arg max P(W j π) = arg max log P(W j π) π π One can derive2 L(π) to be X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information π (fixed w.r.t. ) (varies with π) 2Sven Martin, Jörg Liermann, and Hermann Ney. “Algorithms for Bigram and Trigram Word Clustering”. In: Speech Commun. 24.1 (Apr. 1998), pp. 19–37. 7 Finding the best partition X X nci;cj L(π) = nw log nw + nci;cj log nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) What does maximizing this mean? Recall 0 X p(ci; cj) MI(c; c ) = p(ci; cj) log p(ci)p(cj) ci;cj which can be shown to be 1 X nci;cj MI(c; c0) = n log + log N N ci;cj n · n c ;c ci cj | {z } |{z} i j (constant) (constant) and thus maximizing MI of adjacent classes selects the best π. 8 Finding the best partition (cont’d) ∗ X X nci;cj π = arg max nw log nw + nci;cj log π nci · ncj w ci;cj | {z } | {z } (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) Direct maximization is intractable! Thus, agglomerative (bottom-up) clustering is used as a greedy heuristic. The best merge is determined by the lowest loss in average mutual information. 9 Agglomerative Clustering 10 Do they work? (Yes.) Named entity recognition: • Scott Miller, Jethran Guinness, and Alex Zamanian. “Name Tagging with Word Clusters and Discriminative Training.” In: HLT-NAACL. 2004, pp. 337–342 • Lev Ratinov and Dan Roth. “Design Challenges and Misconceptions in Named Entity Recognition”. In: Proc. CoNLL. 2009, pp. 147–155 Dependency parsing: • Terry Koo, Xavier Carreras, and Michael Collins. “Simple Semi-supervised Dependency Parsing”. In: Proc. ACL: HLT. June 2008, pp. 595–603 • Jun Suzuki and Hideki Isozaki. “Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data”. In: Proc. ACL: HLT. June 2008, pp. 665–673 Constituency parsing: • Marie Candito and Benoı̂t Crabbé. “Improving Generative Statistical Parsing with Semi-supervised Word Clustering”. In: IWPT. 2009, pp. 138–141 • Muhua Zhu et al. “Fast and Accurate Shift-Reduce Constituent Parsing”. In: ACL. Aug. 2013, pp. 434–443 11 Vector Spaces for Word Representation Toward Vector Spaces Brown clusters are nice, but limiting: • Arbitrary cut-off point ! two words may be assigned different classes arbitrarily because of our chosen cutoff • Definition of “similarity” is limited to only bigrams of classes3 • Completely misses lots of regularity that’s present in language 3There are adaptations of Brown clustering that move past this, but what we’ll see today is still even better.

Distributional Semantics and Word Embeddings

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support