Natural Language Processing (CSEP 517): Distributional Semantics

Natural Language Processing (CSEP 517): Distributional Semantics Roy Schwartz c 2017 University of Washington [email protected] May 15, 2017 1 / 59 To-Do List I Read: (Jurafsky and Martin, 2016a,b) 2 / 59 mountain lion Distributional Semantics Models Aka, Vector Space Models, Word Embeddings 0 -0.23 1 0 -0.72 1 B -0.21 C B -00.2 C B C B C B -0.15 C B -0.71 C B C B C B -0.61 C B -0.13 C vmountain = B C, vlion = B C B . C B . C B . C B . C B C B C @ -0.02 A @ 0-0.1 A -0.12 -0.11 3 / 59 Distributional Semantics Models Aka, Vector Space Models, Word Embeddings 0 -0.23 1 0 -0.72 1 B -0.21 C B -00.2 C B C B C B -0.15 C B -0.71 C B C B C B -0.61 C B -0.13 C mountain vmountain = B C, vlion = B C B . C B . C B . C B . C B C B C lion @ -0.02 A @ 0-0.1 A -0.12 -0.11 4 / 59 Distributional Semantics Models Aka, Vector Space Models, Word Embeddings 0 -0.23 1 0 -0.72 1 B -0.21 C B -00.2 C B C B C B -0.15 C B -0.71 C B C B C B -0.61 C B -0.13 C mountain vmountain = B C, vlion = B C B . C B . C B . C B . C B C B C lion @ -0.02 A @ 0-0.1 A -0.12 -0.11 θ 5 / 59 Distributional Semantics Models Aka, Vector Space Models, Word Embeddings 0 -0.23 1 0 -0.72 1 mountain lion B -0.21 C B -00.2 C B C B C B -0.15 C B -0.71 C B C B C B -0.61 C B -0.13 C mountain vmountain = B C, vlion = B C B . C B . C B . C B . C B C B C lion @ -0.02 A @ 0-0.1 A -0.12 -0.11 6 / 59 Distributional Semantics Models Aka, Vector Space Models, Word Embeddings Applications Linguistic Study Deep learning models: Lexical Semantics Machine Translation Multilingual Studies Question Answering Evolution of Language Syntactic Parsing ... ... 7 / 59 Outline Vector Space Models Lexical Semantic Applications Word Embeddings Compositionality Current Research Problems 8 / 59 Outline Vector Space Models Lexical Semantic Applications Word Embeddings Compositionality Current Research Problems 9 / 59 Distributional Semantics Hypothesis Harris (1954) Words that have similar contexts are likely to have similar meaning 10 / 59 Distributional Semantics Hypothesis Harris (1954) Words that have similar contexts are likely to have similar meaning 11 / 59 Vector Space Models I Representation of words by vectors of real numbers I 8w 2 V; vw is function of the contexts in which w occurs I Vectors are computed using a large text corpus I No requirement for any sort of annotation in the general case 12 / 59 V1:0: Count Models Salton (1971) I Each element vwi 2 vw represents the co-occurrence of w with another word i I vdog = (cat: 10, leash: 15, loyal: 27, bone: 8, piano: 0, cloud: 0, . ) I Vector dimension is typically very large (vocabulary size) I Main motivation: lexical semantics 13 / 59 dog cat Count Models Example 0 0 1 0 0 1 B 0 C B 2 C B C B C B 15 C B 11 C B C B C B 17 C B 13 C vdog = B C, vcat = B C B . C B . C B . C B . C B C B C @ 0 A @ 20 A 102 11 14 / 59 Count Models Example 0 0 1 0 0 1 dog B 0 C B 2 C B C B C B 15 C B 11 C cat B C B C B 17 C B 13 C vdog = B C, vcat = B C B . C B . C B . C B . C B C B C @ 0 A @ 20 A 102 11 15 / 59 Count Models Example 0 0 1 0 0 1 dog B 0 C B 2 C B C B C B 15 C B 11 C cat B C B C B 17 C B 13 C vdog = B C, vcat = B C B . C B . C B . C B . C B C B C @ 0 A @ 20 A θ 102 11 16 / 59 I Singular value decomposition (SVD), principal component analysis (PCA), matrix factorization methods I Bag-of-words context, document context (Latent Semantic Analysis (LSA)), dependency contexts, pattern contexts I Smoothing by dimensionality reduction I What is a context? Variants of Count Models I Reduce the effect of high frequency words by applying a weighting scheme I Pointwise mutual information (PMI), TF-IDF 17 / 59 I Bag-of-words context, document context (Latent Semantic Analysis (LSA)), dependency contexts, pattern contexts I What is a context? Variants of Count Models I Reduce the effect of high frequency words by applying a weighting scheme I Pointwise mutual information (PMI), TF-IDF I Smoothing by dimensionality reduction I Singular value decomposition (SVD), principal component analysis (PCA), matrix factorization methods 18 / 59 Variants of Count Models I Reduce the effect of high frequency words by applying a weighting scheme I Pointwise mutual information (PMI), TF-IDF I Smoothing by dimensionality reduction I Singular value decomposition (SVD), principal component analysis (PCA), matrix factorization methods I What is a context? I Bag-of-words context, document context (Latent Semantic Analysis (LSA)), dependency contexts, pattern contexts 19 / 59 Outline Vector Space Models Lexical Semantic Applications Word Embeddings Compositionality Current Research Problems 20 / 59 I RG-65 (Rubenstein and Goodenough, 1965), wordsim353 (Finkelstein et al., 2001), MEN (Bruni et al., 2014), SimLex999 (Hill et al., 2015) I Mikolov et al. (2013) I Semantic Similarity I Word Analogies I Vector operations Vector Space Models Evaluation I Vector space models as features I Synonym detection I TOEFL (Landauer and Dumais, 1997) I Word clustering I CLUTO (Karypis, 2002) 21 / 59 Vector Space Models Evaluation I Vector space models as features I Synonym detection I TOEFL (Landauer and Dumais, 1997) I Word clustering I CLUTO (Karypis, 2002) I Vector operations I Semantic Similarity I RG-65 (Rubenstein and Goodenough, 1965), wordsim353 (Finkelstein et al., 2001), MEN (Bruni et al., 2014), SimLex999 (Hill et al., 2015) I Word Analogies I Mikolov et al. (2013) 22 / 59 Semantic Similarity w1 w2 human score model score tiger cat 7.35 0.8 computer keyboard 7.62 0.54 ... ... ... ... architecture century 3.78 0.03 book paper 7.46 0.66 king cabbage 0.23 -0.42 Table: Human scores taken from wordsim353 (Finkelstein et al., 2001) I Model scores are cosine similarity scores between vectors I Model's performance is the Spearman/Pearson correlation between human ranking and model ranking 23 / 59 Word Analogy Mikolov et al. (2013) France woman Italy queen Paris man Rome king 24 / 59 Outline Vector Space Models Lexical Semantic Applications Word Embeddings Compositionality Current Research Problems 25 / 59 V2:0: Predict Models (Aka Word Embeddings) I A new generation of vector space models I Instead of representing vectors as cooccurrence counts, train a supervised machine learning algorithm to predict p(wordjcontext) I Models learn a latent vector representation of each word I These representations turn out to be quite effective vector space representations I Word embeddings 26 / 59 Word Embeddings I Vector size is typically a few dozens to a few hundreds I Vector elements are generally uninterpretable I Developed to initialize feature vectors in deep learning models I Initially language models, nowadays virtually every sequence level NLP task I Bengio et al. (2003); Collobert and Weston (2008); Collobert et al. (2011); word2vec (Mikolov et al., 2013); GloVe (Pennington et al., 2014) 27 / 59 Word Embeddings I Vector size is typically a few dozens to a few hundreds I Vector elements are generally uninterpretable I Developed to initialize feature vectors in deep learning models I Initially language models, nowadays virtually every sequence level NLP task I Bengio et al. (2003); Collobert and Weston (2008); Collobert et al. (2011); word2vec (Mikolov et al., 2013); GloVe (Pennington et al., 2014) 28 / 59 Y I Continuous bag-of-words: argmax p(wjC(w); θ) θ w2corpus Y I Skip-gram: argmax p(cjw; θ) θ (w;c)2corpus I Negative sampling: randomly sample negative (word,context) pairs, then: Y Y argmax p(cjw; θ) · (1 − p(c0jw; θ)) θ (w;c)2corpus (w;c0) word2vec Mikolov et al. (2013) I A software toolkit for running various word embedding algorithms Based on (Goldberg and Levy, 2014) 29 / 59 Y I Skip-gram: argmax p(cjw; θ) θ (w;c)2corpus I Negative sampling: randomly sample negative (word,context) pairs, then: Y Y argmax p(cjw; θ) · (1 − p(c0jw; θ)) θ (w;c)2corpus (w;c0) word2vec Mikolov et al. (2013) I A software toolkit for running various word embedding algorithms Y I Continuous bag-of-words: argmax p(wjC(w); θ) θ w2corpus Based on (Goldberg and Levy, 2014) 30 / 59 I Negative sampling: randomly sample negative (word,context) pairs, then: Y Y argmax p(cjw; θ) · (1 − p(c0jw; θ)) θ (w;c)2corpus (w;c0) word2vec Mikolov et al. (2013) I A software toolkit for running various word embedding algorithms Y I Continuous bag-of-words: argmax p(wjC(w); θ) θ w2corpus Y I Skip-gram: argmax p(cjw; θ) θ (w;c)2corpus Based on (Goldberg and Levy, 2014) 31 / 59 word2vec Mikolov et al.

Natural Language Processing (CSEP 517): Distributional Semantics

Semantic Computing

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Distributional Semantics

Facing the Facts of Fake: a Distributional Semantics and Corpus Annotation Approach Bert Cappelle, Pascal Denis, Mikaela Keller

Sentiment, Stance and Applications of Distributional Semantics

Building Web-Interfaces for Vector Semantic Models with the Webvectors Toolkit

Distributional Semantics for Understanding Spoken Meal Descriptions

Distributional Semantics Today Introduction to the Special Issue Cécile Fabre, Alessandro Lenci

Distributional Semantics

Arguments For: Semantic Folding and Hierarchical Temporal Memory

A Vector Space for Distributional Semantics for Entailment ∗

Embeddings in Natural Language Processing