<<

Distributional Semantics

COMP90042 Lecture 10

1 COPYRIGHT 2019, THE UNIVERSITY OF MELBOURNE COMP90042 W.S.T.A. (S1 2019) L10 Lexical databases - Problems

• Manually constructed * Expensive * Human annotation can be biased and noisy • Language is dynamic * New words: slang, terminology, etc. * New senses • The Internet provides us with massive amounts of text. Can we use that to obtain word meanings?

2 COMP90042 W.S.T.A. (S1 2019) L10 Distributional semantics

• “You shall know a word by the company it keeps” (Firth) • Document co-occurrence often indicative of topic (document as context) * E.g. voting and politics • Local context reflects a word’s semantic class (word window as context) * E.g. eat a pizza, eat a burger

3 COMP90042 W.S.T.A. (S1 2019) L10 Guessing meaning from context

• Learn unknown word from its usage • E.g., tezgüino

• Look at other words in same (or similar) contexts

4 From E18, Ch14, originally Lin COMP90042 W.S.T.A. (S1 2019) L10

Distributed and distributional semantics Distributed = represented as a numerical vector (in contrast to symbolic) Cover both count-based vs neural prediction-based methods

5 COMP90042 W.S.T.A. (S1 2019) L10 The model

• Fundamental idea: represent meaning as a vector • Consider documents as context (ex: tweets) • One matrix, two viewpoints * Documents represented by their words (web search) * Words represented by their documents (text analysis) … state fun heaven …

… 425 0 1 0 426 3 0 0 427 0 0 0

…… 6 COMP90042 W.S.T.A. (S1 2019) L10 Manipulating the VSM

• Weighting the values • Creating low-dimensional dense vectors • Comparing vectors

7 COMP90042 W.S.T.A. (S1 2019) L10 Tf-idf

• Standard weighting scheme for |*| • Also discounts common words !"#$ = log "#$ … the country hell … … the countr hell … … y 425 43 5 1 … 425 0 25.8 6.2 426 24 1 0 427 37 0 3 426 0 5.2 0 … 427 0 0 18.5 … df 500 14 7 tf-idf matrix tf matrix 8 COMP90042 W.S.T.A. (S1 2019) L10 Dimensionality reduction

• Term-document matrices are very sparse • Dimensionality reduction: create shorter, denser vectors • More practical (less features) • Remove noise (less overfitting)

9 COMP90042 W.S.T.A. (S1 2019) L10 Singular value Decomposition |D| & 0 1 0 0 ! = #Σ% 0 0 0 ⋯ 0 0 0 0 0 A |V| ⋮ ⋱ ⋮ (term-document matrix) 0 0 0 ⋯ 0 m=Rank(A)

U Σ VT (new term matrix) (singular values) (new document matrix) m m |D| −0.2 4.0 −1.3 2.2 0.3 8.7 9.1 0 0 0 −4.1 0.6 ⋯ −0.2 5.5 −2.8 ⋯ 0.1 0 4.4 0 ⋯ 0 0 0 2.3 0 2.6 6.1 1.4 −1.3 3.7 3.5 m ⋮ ⋱ ⋮ |V| ⋮ ⋱ ⋮ m ⋮ ⋱ ⋮ −1.9 −1.8 ⋯ 0.3 2.9 −2.1 ⋯ −1.9 0 0 0 ⋯ 0.1

10 COMP90042 W.S.T.A. (S1 2019) L10 Truncating –

• Truncating U, Σ, and VT to k dimensions produces best possible k rank approximation of original matrix

T • So truncated, Uk (or Vk ) is a new low dimensional representation of the word (or document) • Typical values for k are 100-5000 U U k k m k

2.2 0.3 −2.4 8.7 2.2 0.3 −2.4 5.5 −2.8 ⋯ 1.1 ⋯ 0.1 5.5 −2.8 ⋯ 1.1 −1.3 3.7 4.7 3.5 −1.3 3.7 4.7 |V| ⋮ ⋱ ⋮ ⋱ ⋮ |V| ⋮ ⋱ ⋮ 2.9 −2.1 ⋯ −3.3 ⋯ −1.9 2.9 −2.1 ⋯ −3.3

11 COMP90042 W.S.T.A. (S1 2019) L10 Words as context

• Lists how often words appear with other words * In some predefined context (usually a window) • The obvious problem with raw frequency: dominated by common words

… the country hell …

… state 1973 10 1

fun 54 2 0 heaven 55 1 3 …… 12 COMP90042 W.S.T.A. (S1 2019) L10 Pointwise mutual information

For two events x and y, pointwise mutual information (PMI) comparison between the actual joint probability of the two events (as seen in the data) with the expected probability under the assumption of independence .(%, ') !"#(%, ') = log - . % .(')

13 COMP90042 W.S.T.A. (S1 2019) L10 Calculating PMI … the country hell … Σ

… state 1973 10 1 12786

fun 54 2 0 633 heaven 55 1 3 627 …

Σ 1047519 3617 780 15871304 p(x,y) = count(x,y)/Σ x= state, y = country p(x,y) = 10/15871304 = 6.3 x 10-7 p(x) = Σx/ Σ p(x) = 12786/15871304 = 8.0 x 10-4 p(y) = 3617/15871304 = 2.3 x 10-4 p(y) = Σy/ Σ -7 -4 -4 PMI(x,y) = log2(6.3 x 10 )/((8.0 x 10 ) (2.3 x 10 )) = 1.78 14 COMP90042 W.S.T.A. (S1 2019) L10 PMI matrix

• PMI does a better job of capturing interesting semantics * E.g. heaven and hell • But it is obviously biased towards rare words • And doesn’t handle zeros well … the country hell …

… state 1.22 1.78 0.63

fun 0.37 3.79 -inf heaven 0.41 2.80 6.60 …… 15 COMP90042 W.S.T.A. (S1 2019) L10 PMI tricks

• Zero all negative values (PPMI) * Avoid –inf and unreliable negative values • Counter bias towards rare events * Smooth probabilities

16 COMP90042 W.S.T.A. (S1 2019) L10 Similarity

• Regardless of vector representation, classic use of vector is comparison with other vector • For IR: find documents most similar to query • For Text Analysis: find synonyms, based on proximity in vector space * automatic construction of lexical resources * more generally, knowledge base population • Use vectors as features in classifier — more robust to different inputs (movie vs film)

17 COMP90042 W.S.T.A. (S1 2019) L10

Neural Word Embeddings

Learning distributional & distributed representations using “deep” classifiers

18 COMP90042 W.S.T.A. (S1 2019) L10 Skip-gram: Factored Prediction

• Neural network inspired approaches seek to learn vector representations of words and their contexts • Key idea * Word embeddings should be similar to embeddings of neighbouring words * And dissimilar to other words that don’t occur nearby • Using vector dot product for vector ‘comparison’

* u . v = ∑j uj vj • As part of a ‘classifier’ over a word and its immediate context

19 COMP90042 W.S.T.A. (S1 2019) L10 Skip-gram: Factored Prediction

• Framed as learning a classifier (a weird language model)… * Skip-gram: predict words in local context surrounding given word P(he | rests) P(in | rests) P(life | rests) P(peace | rests)

… Bereft of life he rests in peace! If you hadn't nailed him …

P(rests | {he, in, life, peace})

* CBOW: predict word in centre, given words in the local surrounding context • Local context means words within L positions, L=2 above

20 COMP90042 W.S.T.A. (S1 2019) L10 Skip-gram model

• Generates each word in context given centre word

P(he | rests) P(in | rests) P(life | rests) P(peace | rests)

… Bereft of life he rests in peace! If you hadn't nailed him …

• Total probability defined as P (wt+l wt) * | Where subscript l L,..., 1,1,...,L denotes position in 2 Y running text

exp(cwk vwj ) • Using a logistic P (wk wj)= · | w V exp(cw0 vwj ) regression model 02 · P 21 COMP90042 W.S.T.A. (S1 2019) L10 Embedding parameterisation

•19.6 EMBEDDINGS FROM PREDICTION:SKIP-GRAM AND CBOW 19 Two •parameter matrices, with d-dimensional embedding for all words

1……..d 1……..d 1 1 2 2 . W . C . . target i j context word embedding for word i . . for word j |Vw| |Vc|

Fig 19.17, JM |Vw| ⨉ d |Vc| ⨉ d

Figure• Words 19.17 areThe wordnumbered, matrix W e.g.,and context by sorting matrix Cvocabulary(with embeddings and shown as row vectors) learned by the skipgram model. using word location as its index Output layer22 probabilities of context words y1 y Input layer Projection layer 2 embedding for wt 1-hot input vector yk w x1 t-1 x2 C d ⨉ |V|

w y|V| t xj W |V|⨉d y1 y2 x C |V| d⨉|V| w yk t+1 1⨉|V| 1⨉d

y|V|

Figure 19.18 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a).

is to compute P(w w ). k| j one-hot We begin with an input vector x, which is a one-hot vector for the current word w j. A one-hot vector is just a vector that has one element equal to 1, and all the other elements are set to zero. Thus in a one-hot representation for the word w j, x j = 1, and x = 0 i = j, as shown in Fig. 19.19. i 8 6

w0 w1 wj w|V| 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0

Figure 19.19 A one-hot vector, with the dimension corresponding to word w j set to 1.

We then predict the probability of each of the 2C output words—in Fig. 19.18 that means the two output words wt 1 and wt+1— in 3 steps: 1. Select the embedding from W: x is multiplied by W, the input matrix, to give projection layer the hidden or projection layer. Since each column of the input matrix W is just an embedding for word wt , and the input is a one-hot vector for w j, the projection layer for input x will be h = v j, the input embedding for w j. 19.6 EMBEDDINGS FROM PREDICTION:SKIP-GRAM AND CBOW 19 •

1……..d 1……..d 1 1 2 2 . W . C . . target word embedding i j context word embedding for word i . . for word j |Vw| |Vc|

|Vw| ⨉ d |Vc| ⨉ d COMP90042 W.S.T.A. (S1 2019) L10 Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row vectors) learned by the skipgramSkip model.-gram model Output layer probabilities of context words y1 y Input layer Projection layer 2 embedding for wt 1-hot input vector yk w x1 t-1 x2 C d ⨉ |V|

w y|V| t xj W |V|⨉d y1 y2 x C |V| d⨉|V| w yk t+1 1⨉|V| 1⨉d

y|V|

FigureFig 19.18, 19.18 JM The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a). 23

is to compute P(w w ). k| j one-hot We begin with an input vector x, which is a one-hot vector for the current word w j. A one-hot vector is just a vector that has one element equal to 1, and all the other elements are set to zero. Thus in a one-hot representation for the word w j, x j = 1, and x = 0 i = j, as shown in Fig. 19.19. i 8 6

w0 w1 wj w|V| 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0

Figure 19.19 A one-hot vector, with the dimension corresponding to word w j set to 1.

We then predict the probability of each of the 2C output words—in Fig. 19.18 that means the two output words wt 1 and wt+1— in 3 steps: 1. Select the embedding from W: x is multiplied by W, the input matrix, to give projection layer the hidden or projection layer. Since each column of the input matrix W is just an embedding for word wt , and the input is a one-hot vector for w j, the projection layer for input x will be h = v j, the input embedding for w j. COMP90042 W.S.T.A. (S1 2019) L10 Training the skip-gram model

• Train to maximise likelihood of raw text • Too slow in practice, due to normalisation over |V| • Reduce problem to binary classification, distinguish real context words from “negative samples” * words drawn randomly from V

Source: JM3 Ch 6 24 6.8 21 COMP90042 W.S.T.A. (S1 2019) • L10 6.8 WORD2VEC 21 • 6.8.2 LearningNegat skip-gramive embeddings Sampling 6.8.2 Learning skip-gram embeddings Word2vec learns embeddings by starting with an initial set of embedding vectors Word2vec learns embeddings by starting with1 an initial set of embedding vectors andand then then iteratively iterativelyP (+ shifting shiftingw ,w the the)= embedding embedding of each of each word w wordto bew moreto be like more the em- like the em- | k j 1 + exp( c v ) beddingsbeddings of of words words that that occur nearby nearby in texts, in texts, and k less and· likej less the like embeddings the embeddings of words of words that don’t occur nearby. 1 that don’t occurP nearby.( w ,w )=1 Let’s start by considering| k j a single piece1 + of exp( the trainingc v data,) from the sentence Let’s start byAAACaXicjVHLSgMxFM2Mr1pfraKIbi4WpaItMyLoRii6cVnB2kKnDJk0U+NkHiSZahkLfqM7f8CNP2Fau9Dqwgu5HM65J48TL+FMKst6M8yZ2bn5hdxifml5ZXWtUFy/k3EqCG2QmMei5WFJOYtoQzHFaSsRFIcep00vuBrpzT4VksXRrRoktBPiXsR8RrDSlFt4qZeP4Bke3eBYt4dDOLgAxxeYZGDDEEb9CBz6lJShQtwAHNKNFfTdBzjUsuPk6+XKtN+Gyv/3cAslq2qNC34DewJKaFJ1t/DqdGOShjRShGMp27aVqE6GhWKE02HeSSVNMAlwj7Y1jHBIZScbJzWEfc10wY+FXpGCMfvdkeFQykHo6ckQq3s5rY3Iv7R2qvzzTsaiJFU0Il8H+SkHFcModugyQYniAw0wEUzfFcg91hkp/Tl5HYI9/eTf4O6kaltV++a0VLucxJFDu2gPlZGNzlANXaM6aiCC3o1lY9PYMj7Morlt7nyNmsbEs4F+lFn6BFEJrwA= considering a single piece of thek trainingj data, from the sentence above: · above: ... lemon, a [tablespoon of apricot jam, a] pinch ...... lemon, a [tablespoonc1 c2 of apricot t c3 jam, c4 a] pinch ... This examplec1 has a target word c2t (apricot), t and c3 4 context wordsc4 in the L = 2 ± Thiswindow, example resulting has in 4 a positive target training word t instances(apricot), (on and the left 4 context below): words in the L = 2 ± window, resultingpositive examples in 4 positive + training instancesnegative (on examples the left - below): tc tc tc positiveapricot examples tablespoon + apricot aardvarknegative apricot examples twelve - tcapricot of apricottc puddle apricot tc hello apricotapricot tablespoon preservesjam apricotapricot where aardvark apricot apricot dear twelve apricot coaxial apricot forever apricotapricot of ora apricot puddle apricot hello Source: JM3 ChForapricot 6; note training mistake preserves afrom binary reference, classifier corrected weabove also in red needapricot negative where examples, and apricot in fact dear skip-25 gram uses more negative examples than positiveapricot examples, coaxial the ratio apricot set by a param- forever eter kapricot. So for each or of these (t,c) training instances we’ll create k negative samples, Foreach training consisting a of binary the target classifiert plus a we ‘noise also word’. need A negative noise word examples, is a random and word in fact skip- gramfrom uses the more lexicon, negative constrained examples not to be than the target positive word examples,t. The right theabove ratio shows set the by a param- eter settingk. So where for eachk = of2, so these we’ll(t have,c) training 2 negative instances examples in we’ll the negative create trainingk negative set samples, for each positive example t,c. each consistingThe noise of words the are target chosent plus according a ‘noise to their word’. weighted A noise unigram word frequency is a random word frompa the(w) lexicon,, where a constrainedis a weight. not If we to were be the sampling target according word t. toThe unweighted right above fre- shows the settingquency wherep(w)k, it= would2, so mean we’ll that have with unigram 2 negative probability examplesp(“the in”) we the would negative choose training set forthe each word positivethe as a noise example word,t, withc. unigram probability p(“aardvark”) we would choose aardvark, and so on. But in practice it is common to set a = .75, i.e. use the The noise3 words are chosen according to their weighted unigram frequency weighting p 4 (w): pa (w), where a is a weight. If we were sampling according to unweighted fre- quency p(w), it would mean that withcount unigram(w)a probability p(“the”) we would choose P (w)= (6.31) a count(w )a the word the as a noise word, withw unigram0 0 probability p(“aardvark”) we would chooseSettingaardvarka = .75, givesand so better on. performance But inP practice because it it is gives common rare noise to set wordsa = slightly.75, i.e. use the 3 weightinghigher probability:p 4 (w): for rare words, Pa (w) > P(w). To visualize this intuition, it might help to work out the probabilities for an example with two events, P(a)=.99 and P(b)=.01: count(w)a P (w)= (6.31) a count(w )a w0 0 .99.75 P (a)= = .97 Setting a = .75 gives bettera performance.99P.75 + .01 because.75 it gives rare noise words slightly higher probability: for rare words, P.a01(w.75) > P(w). To visualize this intuition, it P (b)= = .03 (6.32) might help to work out thea probabilities.99.75 + for.01 an.75 example with two events, P(a)=.99 ( )=. and P bGiven01: the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we .99.75 Maximize the similarityPa (a of)= the target word, context= word.97 pairs (t,c) drawn • . .75 + . .75 from the positive examples 99 01 .01.75 P (b)= = .03 (6.32) a .99.75 + .01.75 Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we Maximize the similarity of the target word, context word pairs (t,c) drawn • from the positive examples COMP90042 W.S.T.A. (S1 2019) L10 Training illustration

• Iterative process (stochastic gradient descent) * each step moves embeddings closer for context words * and moves embeddings apart for noise samples

Source: JM3 Ch 6 26 COMP90042 W.S.T.A. (S1 2019) L10 Evaluating word vectors

• Lexicon style tasks * WordSim-353 are pairs of nouns with judged relatedness * SimLex-999 also covers verbs and adjectives * TOEFL asks for closest synonym as multiple choice * … • Test compatibility of word pairs using in vector space

27 COMP90042 W.S.T.A. (S1 2019) L10 Embeddings exhibit meaningful geometry

• Word analogy task * Man is to King as Woman is to ??? * France is to Paris as Italy is to ??? * Evaluate where in the ranked predictions the correct answer is, given tables of known relations 28 COMP90042 W.S.T.A. (S1 2019) L10 Evaluating word vectors

• Best evaluation is in other downstream tasks * Use bag-of-word embeddings as a feature representation in a classifier (e.g., sentiment, QA, tagging etc.) * First layer of most deep learning models is to embed input text; use pre-trained word vectors as embeddings, possibly with further training (“fine-tuning”) for specific task • Recently “contextual word vectors” shown to work even better, ELMO (AI2), BERT (Google AI), …

29 COMP90042 W.S.T.A. (S1 2019) L10 Pointers to software

• Word2Vec * C implementation of Skip-gram and CBOW https://code.google.com/archive/p/word2vec/ • * Python library with many methods include LSI, topic models and Skipgram/CBOW https://radimrehurek.com/gensim/index.html • GLOVE * http://nlp.stanford.edu/projects/glove/

30 COMP90042 W.S.T.A. (S1 2019) L10 Further reading

• Either one of: * E18, 14-14.6 (skipping 14.4) * JM3, Ch 6

31