Lecture 4: Static Word Embeddings

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 4: Static word embeddings Julia Hockenmaier [email protected] 3324 Siebel Center Office hours: Monday, 11am—12:30pm (Static) Word Embeddings A (static) word embedding is a function that maps each word type to a single vector — these vectors are typically dense and have much lower dimensionality than the size of the vocabulary — this mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table) — this mapping function typically assumes a fixed size vocabulary (so an UNK token is still required) CS447: Natural Language Processing (J. Hockenmaier) 2 Word2Vec Embeddings Main idea: Use a classifier to predict which words appear in the context of (i.e. near) a target word (or vice versa) This classifier induces a dense vector representation of words (embedding) Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations. These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded) CS447: Natural Language Processing (J. Hockenmaier) 3 Word2Vec (Mikolov et al. 2013) The first really influential dense word embeddings Two ways to think about Word2Vec: — a simplification of neural language models — a binary logistic regression classifier Variants of Word2Vec — Two different context representations: CBOW or Skip-Gram — Two different optimization objectives: Negative sampling (NS) or hierarchical softmax CS447: Natural Language Processing (J. Hockenmaier) 4 Word2Vec architectures INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT w(t-2) w(t-2) w(t-2) w(t-2) w(t-1) w(t-1) w(t-1) w(t-1) SUM SUM w(t) w(t) w(t) w(t) w(t+1) w(t+1) w(t+1) w(t+1) w(t+2) w(t+2) w(t+2) w(t+2) CBOW CBOW Skip-gram Skip-gram FigureCS546 1:Figure New Machine model 1: New architectures. Learning model architectures. in NLP The CBOW The architecture CBOW architecture predicts the predicts current the word current based word on the based on5 the context, andcontext, the Skip-gram and the Skip-gram predicts surrounding predicts surrounding words given words the givencurrent the word. current word. R wordsR fromwords the futurefrom the of futurethe current of the word current as correct word as labels. correct This labels. will This require will us require to do R us to2 do R 2 word classifications,word classifications, with the current with the word current as input, word and as input, each ofand the eachR + ofR thewordsR + asR output.words as In output.⇥ the In⇥ the followingfollowing experiments, experiments, we use C we= 10 use. C = 10. 4 Results4 Results To compareTo the compare quality the of quality different of versions different of versions word vectors, of word previous vectors, papers previous typically papers usetypically a table use a table showing exampleshowing words example and words their andmost their similar most words, similar and words, understand and understand them intuitively. them intuitively. Although Although it is easy toit is show easy that to show word thatFrance wordisFrance similaris to similarItaly and to perhapsItaly and some perhaps other some countries, other countries, it is much it is much more challengingmore challenging when subjecting when subjecting those vectors those in vectors a more in complex a more similarity complex similarity task, as follows. task, as We follows. We follow previousfollow observation previous observation that there that can bethere many can different be many types different of similarities types of similarities between words, between for words, for example,example, word big wordis similarbig is to similarbigger toinbigger the samein thesense same that sensesmall thatis similarsmall is to similarsmaller to. Examplesmaller. Example of anotherof type another of relationship type of relationship can be word can pairs be wordbig - pairs biggestbigand - biggestsmalland - smallestsmall -[20]. smallest We further[20]. We further denote twodenote pairs two of words pairs of with words the same with the relationship same relationship as a question, as a question, as we can as ask: we ”Whatcan ask: is ”What the is the word thatword is similar that is to similarsmall in to thesmall samein sense the same as biggest sense asisbiggest similaris to similarbig?” to big?” SomewhatSomewhat surprisingly, surprisingly, these questions these questions can be answered can be answered by performing by performing simple algebraic simple algebraicoperations operations with the vectorwith the representation vector representation of words. of To words. find a wordTo find that a word is similar that is to similarsmall in to thesmall samein sensethe same as sense as biggest isbiggest similaris to similarbig, we to canbig simply, we can compute simply vector computeX vector= vectorX (”=biggestvector(””)biggestvector”)(”bigvector”) +(”big”) + vector(”smallvector”)(”.small Then,”) we. Then, search we in the search vector in the space vector for thespace word for closest the word to closestX measured− to X measured− by cosine by cosine distance, anddistance, use it and as the use answer it as the to answer the question to the (we question discard (we the discard input thequestion input words question during words this during this search). Whensearch). the When word the vectors word are vectors well trained, are well it trained, is possible it is to possible find the to correct find the answer correct (word answer (word smallest)smallest using this) using method. this method. Finally, weFinally, found we that found when that we whentrain high we train dimensional high dimensional word vectors word on vectors a large on amount a large of amount data, the of data, the resulting vectorsresulting can vectors be used can to be answer used to very answer subtle very semantic subtle relationshipssemantic relationships between words, between such words, as such as a city anda the city country and the it country belongs it to, belongs e.g. France to, e.g. is France to Paris is as to Germany Paris as Germany is to Berlin. is to Word Berlin. vectors Word vectors with suchwith semantic such semanticrelationships relationships could be usedcould to be improve used to many improve existing many NLP existing applications, NLP applications, such such as machineas translation, machine translation, information information retrieval and retrieval question and answering question answering systems, and systems, may enable and may other enable other future applicationsfuture applications yet to be invented. yet to be invented. 5 5 CBOW: predict target from context (CBOW=Continuous Bag of Words) Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Given the surrounding context words (tablespoon, of, jam, a), predict the target word (apricot). Input: each context word is a one-hot vector Projection layer: map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the target word with softmax CS546 Machine Learning in NLP 6 Skipgram: predict context from target Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Given the target word (apricot), predict the surrounding context words (tablespoon, of, jam, a), Input: each target word is a one-hot vector Projection layer: map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the context word with softmax CS546 Machine Learning in NLP 7 Skipgram w11 … w1h … w1H … … … … … w11 … w1T … w1V wC1 … wCh … wCH … … … … … … … … … … wh1 … whT … whV wV1 … wVh … wVH … … … … … Score of wH1 … wHT … wHV wj context word p(wc |wt) ∝ exp(wc ⋅ wt) One-hot encoding wi of target word The rows in the weight matrix for the hidden layer correspond to the weights for each hidden unit. The columns in the weight matrix from input to the hidden layer correspond to the input vectors for each (target) word [typically, those are used as word2vec vectors] The rows in the weight matrix from the hidden to the output layer correspond to the output vectors for each (context) word [typically, those are ignored] CS546 Machine Learning in NLP 8 Negative sampling Skipgram aims to optimize the avg log probability of the data: T T 1 1 exp(wt+jwt) log p(w ∣ w ) = log T ∑ ∑ t+j t T ∑ ∑ ( V ) t=1 −c≤j≤c,j≠0 t=1 −c≤j≤c,j≠0 ∑k=1 exp(wkwt) V But computing the partition function is very expensive ∑ exp(wkwt) k=1 — This can be mitigated by hierarchical softmax (represent each wt+j by Huffman encoding, and predict the sequence of nodes in the resulting binary tree via softmax). — Noise Contrastive Estimation is an alternative to (hierarchical) softmax that aims to distinguish actual data points wt+j from noise via logistic regression — But we just want good word representations, so we do something simpler: Negative Sampling instead aims to optimize k 1 log σ(w ⋅ w ) + E log σ(−w w ) with σ(x) = T c ∑ wi∼P(w)[ T i ] 1 + exp(−x) i=1 CS546 Machine Learning in NLP 9 Skip-Gram Training data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Training data: input/output pairs centering on apricot Assume a +/- 2 word window (in reality: use +/- 10 words) Positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a) For each positive example, sample k negative examples, using noise words (according to [adjusted]

Load more