Cognitive Computing
Total Page:16
File Type:pdf, Size:1020Kb
word2vec\Tomas Mikolov Slides by Omer Levy and Guy Rapaport 1 Word Similarity & Relatedness • 2 Approaches for Representing Words Distributional Semantics (Count) Word Embeddings (Predict) • Used since the 90’s • Inspired by deep learning • Sparse word-context PMI/PPMI matrix • word2vec (Mikolov et al., 2013) • Decomposed with SVD Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts” 3 Approaches for Representing Words Both approaches: • Rely on the same linguistic theory • Use the same data • Are mathematically related • “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) • How come word embeddings are so much better? • “Don’t Count, Predict!” (Baroni et al., ACL 2014) 4 Background 5 Distributional Semantics 6 Distributional Semantics Marco saw a furry little wampimuk hiding in the tree. 7 Distributional Semantics Marco saw a furry little wampimuk hiding in the tree. 8 Distributional Semantics Marco saw a furry little wampimuk hiding in the tree. words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in … … 9 Distributional Semantics • 10 Distributional Semantics • 11 Distributional Semantics • 12 What is word2vec? 13 What is word2vec? How is it related to PMI? 14 What is word2vec? • word2vec is not a single algorithm • It is a software package for representing words as vectors, containing: • Tw o d i s t i n c t m o d e l s • CBoW • Skip-Gram • Various training methods • Negative Sampling • Hierarchical Softmax • A rich preprocessing pipeline • Dynamic Context Windows • Subsampling • Deleting Rare Words 15 What is word2vec? • word2vec is not a single algorithm • It is a software package for representing words as vectors, containing: • Tw o d i s t i n c t m o d e l s • CBoW • Skip-Gram (SG) • Various training methods • Negative Sampling (NS) • Hierarchical Softmax • A rich preprocessing pipeline • Dynamic Context Windows • Subsampling • Deleting Rare Words 16 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv17 2014 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv18 2014 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in … … “word2vec Explained…” Goldberg & Levy, arXiv19 2014 Skip-Grams with Negative Sampling (SGNS) • “word2vec Explained…” Goldberg & Levy, arXiv20 2014 Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv21 2014 Skip-Grams with Negative Sampling (SGNS) • “word2vec Explained…” Goldberg & Levy, arXiv22 2014 Skip-Grams with Negative Sampling (SGNS) • • “word2vec Explained…” Goldberg & Levy, arXiv23 2014 Examples Time!!! Some Results - Machine Translation Word vectors should have similar structure when trained on comparable corpora. This should hold even for corpora in different languages. Some Results - Machine Translation For translation from one vector space to another, we need to learn a linear projection (will perform rotation and scaling). Small starting dictionary can be used to train the linear projection. Then, we can translate any word that was seen in the monolingual data. Some Results - Word Analogies Some Results - Phrase Analogies Visualization of Regularities in Word Vector Space Visualization of Regularities in Word Vector Space Visualization of Regularities in Word Vector Space Context Types Example Australian scientist discovers star with telescope Target Word Australian scientist discovers star with telescope Bag of Words (BoW) Context Australian scientist discovers star with telescope Bag of Words (BoW) Context Australian scientist discovers star with telescope Bag of Words (BoW) Context Australian scientist discovers star with telescope Syntactic Dependency Context Australian scientist discovers star with telescope Syntactic Dependency Context nsubj prep_with dobj Australian scientist discovers star with telescope Syntactic Dependency Context nsubj prep_with dobj Australian scientist discovers star with telescope What is the effect of different context types? Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies Dumbledore Sunnydale hallows Collinwood Hogwarts half-blood Calarts (Harry Potter’s Malfoy Greendale school) RelatedSnape to Millfield Schools Harry Potter Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies nondeterministic Pauling non-deterministic Hotelling Tu r i n g computability Heting (computer scientist) deterministic Lessing finite-state Hamming Related to Scientists computability Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies singing singing dance rapping dancing dances breakdancing (dance gerund) dancers miming tap-dancing busking Related to Gerunds dance Embedding Similarity with Different Contexts • Dependency-based embeddings have more functional similarities • This phenomenon goes beyond these examples • Quantitative Analysis (in the paper) Thank You 46.