Structuralism as the Origin of Self-Supervised Learning and Word Embeddings
Piero Molino Motivation
Word Embeddings: were a hot trend in NLP (Post-word2vec era, 2013-2018)
Self-Supervised Learning: hot trend in NLP and ML (Post-ELMo era, 2018+)
Many researchers and practitioner are oblivious of previous work in computer science, cognitive science and computational linguistics (Pre-word2vec era: up to 2013)
Delays progress due to reinventing the wheel + many lessons to be learned Goal
Overview* of the history and cultural background to start building on existing knowledge
Propose a unifying interpretation of many algorithms
*Incomplete personal overview, a useful starting point for exploration Outline
1. Linguistic background: Structuralism
2. Distributional Semantics
3. Embedding Methods overview
4. Self-Supervised Learning Terminology
Word Embeddings, Distributed Representations, Word Vectors, Distributional Semantic Models, Distributional Representations, Semantic Vector Space, Word Space, Semantic Space, Geometrical model of Meaning, Context-theoretic models, Corpus- based semantics, Statistical semantics
They all mean (almost) the same thing
Distributional Semantic Models → Computational Linguistics literature
Word Embeddings → Neural Networks literature Structuralism Structuralism
“The belief that phenomena of human life are not intelligible except through their interrelations. These relations constitute a structure, and behind local variations in the surface phenomena there are constant laws of abstract culture”
- Simon Blackburn, Oxford Dictionary of Philosophy, 2008 Origins of Structuralism
Ferdinand de Saussure, Cours de linguistique générale, 1916
Published posthumous from notes of his students
Previous ideas close to structuralism:
• Wilhelm von Humboldt, Über den Dualis, 1828
• Wilhelm von Humboldt, Über die Verschiedenheit des menschlichen Sprachbaues, 1836
• Ferdinand de Saussure, Mémoire sur le système des primitif voyelles dans les langues indo-européennes, 1879 Structuralism and Semiotics Sign
Grounded in a cultural system Langue vs Parole
Sign, Signifier, Signified
Different languages use different signifiers for the same signified → the Signifier Signified choice of signifiers is arbitrary Expressive elements Referent meaning
Meaning of signs is defined by their word denoted object relationships and contrasts with other color idea images cultural symbol signs scent ideology Meaning of signs is defined by their relationships and contrasts with other signs Linguistic relationships
Paradigmatic: relationship between Syntagmatic words that can be substituted for each in presentia other in the same position within a given sentence the man plays Paradigmatic
Syntagmatic: relationship a word has in absentia with other words that surround it boy jumps Originally de Saussure used the term "associative", the term "paradigmatic" was talks introduced by Louis Hjelmslev, Principes de grammaire générale, 1928 Paradigmatic
Synonymy Hyponymy
Bubbling Effervescent Sparkling Feline Hypernym
Hyponym Antonymy Tiger Lion
Hot Cold Syntagmatic
Collocation Colligation
against the law VERB past time saved law enforcement spent
normal wasted become law ADJECTIVE time law is passed half extra
sport ful Distributionalism
American structuralist branch
Leonard Bloomfield, Language, 1933
Zellig Harris. Methods in Structural Linguistics, 1951
Zellig Harris, Distributional Structure, 1954
Zellig Harris, Mathematical Structure of Language, 1968 Philosophy of Language
"The meaning of a word is its use in the language"
- Ludwig Wittgenstein, Philosophical Investigation, 1953 Corpus Linguistics
"You shall know a word by the company it keeps"
- J.R. Firth, Papers in Linguistics,1957 Other relevant work
Willard Van Orman Quine, Word and Object, 1960
Margaret Masterman, The Nature of a Paradigm, 1965 Distributional Semantics Distributional Hypothesis
The degree of semantic similarity between two linguistic expressions A and B is a function of the similarity of the linguistic contexts in which A and B can appear
First formulation by Harris, Charles, Miller, Firth or Wittgenstein? He filled the wampimuk, passed it around and we all drunk some
We found a little, hairy wampimuk sleeping behind the tree
– McDonald and Ramscar, 2001 He filled the wampimuk, passed it around and we all drunk some
We found a little, hairy wampimuk sleeping behind the tree
– McDonald and Ramscar, 2001 Distributional Semantic Model
1. Represent words through 3. (Optionally) Apply vectors recording their co- dimensionality reduction occurrence counts with context techniques to the co-occurrence elements in a corpus matrix
2. (Optionally) Apply a re-weighting 4. Measure geometric distance of scheme to the resulting co- word vectors as proxy to occurrence matrix semantic similarity / relatedness Example
Target: a specific word bark Context: noun and verbs in the same sentence dog
The dog barked in the park. The owner of the dog put him on the leash since he barked.
word count park bark 2 park 1 leash 1 leash owner 1 Example
Contexts
leash walk run owner leg bark
dog 3 5 1 5 4 2
cat 0 3 3 1 5 0
lion 0 3 2 0 1 0
light 0 0 1 0 0 0 Targets dark 1 0 0 2 1 0
car 0 0 4 3 0 0 Example
bark Use cosine similarity as a measure of relatedness dog n x y x y cat cos ✓ = · = i=1 i i ✓ x y n x2 n y2 k kk k i=0P i i=0 i pP pP car park
leash Similarity and Relatedness
Semantic similarity Semantic relatedness words sharing salient attributes / words semantically associated features without being necessarily similar • synonymy (car / automobile) • function (car / drive) • hypernymy (car / vehicle) • meronymy (car / tyre) • co-hyponymy (car / van / truck) • location (car / road) • attribute (car / fast)
(Budansky and Hirst, 2006) Context
The meaning of a word can be defined in terms of its context (properties, features)
• Other words in the same document / paragraph / • Predicate-Argument structures sentence • Frames • Words in the immediate neighbors • Hand crafted features • Words along dependency paths First attempt in 1960s in Charles Osgood’s semantic differentials, also used in first connectionist AI approaches in the 1980s • Linguistic patterns
Any process that builds a structure on sentences can be used as a source for properties Context Examples Document
1 DOC1: The silhouette of the sun beyond a wide-open bay on the lake; the 2 sun still glitters although evening has arrived in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Context Examples Wide window
1 DOC1: The silhouette of the sun beyond a wide-open bay on the lake; the 2 sun still glitters although evening has arrived in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Context Examples Wide window (content words)
1 1 2 DOC1: The silhouette of the sun beyond a wide-open bay on the lake; the 2 1 sun still glitters although evening has arrived in Kuhmo. It’s midsummer; 2 the living room has its instruments and other objects in each of its corners. Context Examples Small window (content words)
1 1 2 1 DOC1: The silhouette of the sun beyond a wide-open bay on the lake; the 2 2 sun still glitters although evening has arrived in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Context Examples PoS coded content lemmas
1 1 2 DOC1: The silhouette/N of the sun beyond a wide-open/A bay/N on the 1 2 2 lake/N; the sun still glitters/V although evening/N has arrive/V in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Context Examples PoS coded content lemmas filtered by syntactic path
PPDEP DOC1: The silhouette/N of the sun beyond a wide-open bay on the lake; the SUBJ sun still glitters/V although evening has arrived in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Context Examples Syntactic path coded lemmas
PPDEP
DOC1: The silhouette/N_PPDEP of the sun beyond a wide-open bay on the lake; SUBJ the sun still glitters/V_SUBJ although evening has arrived in Kuhmo. It’s midsummer; the living room has its instruments and other objects in each of its corners. Effect of Context Neighbors of dog in BNC Corpus
2-word window 30-word window
cat kennel horse puppy fox pet pet bitch rabbit terrier More paradigmatic pig rottweiler More syntagmatic animal canine mongrel cat sheep bark pigeon alsatian Effect of Context Neighbors of Turing in Wikipedia
Syntactic dependencies 5-word window
Pauling nondeterministic Hotelling non-deterministic Co-hyponyms Heting computability Topically related Paradigmatic Syntagmatic Lessing deterministic Hamming finite-state Measure Definition Scheme Definition
Weighting Schemes1 Euclidean w = f 1+ n (u v )2 None ij ij p i=1 i i P 1 N n TF-IDF wij =log(fij ) log( ) Cityblock 1+ u v nj i=1 | i i| ⇥ P 1 N TF-ICF wij =log(fij ) log( ) Chebyshev 1+max u v fj i | i i| ⇥ u v fij N nj +0.5 · Okapi BM25 wij = f log Cosine u v j fij +0.5 So far we used raw counts 0.5+1.5 +fij | || | ⇥ fj (u µ ) (v µ ) j u · v f Correlation u v (0.5+0.5 ij )log( N ) | || | ⇥ maxf nj n ATC wij = f Several other options for populating2 i=0 min the(ui,vi) N ij N 2 Dice n i=1[(0.5+0.5 max )log(n )] ui+vi ⇥ f j P i=0 r N (log(P fij )+1.0) log( ) target x context matrix are availablePu v nj Jaccard n · ui+vi LTU wij = j i=0 0.8+0.2 fj fj Pn ⇥ ⇥ i=0 min(ui,vi) Jaccard2 n max(u ,v ) P (tij cj ) In most cases Positive PointwisePi=0 Mutuali i MI wij =log | P (tij )P (cj ) Pn i=0 ui+vi Lin u + v Information is the best choice P | | | | PosMI max(0, MI) u v Tanimoto u + v· u v P (tij cj ) P (tij )P (cj ) | | | | · T-Test wij = | P (t )P (c ) 1 (D(u u+v )+D(v u+v )) p ij j Kiela and Clark, A systematic study 2of || 2 || 2 Jensen-Shannon Div 1 p 2log2 2 see (Curran, 2004, p. 83) Semantic Vector Space Parameters,D 2014,(u ↵v+(1 ↵)u) || ↵-skew 1 p2log2 f f ij ⇥ is a good review Lin98a wij = f f i⇥ j nj Lin98b wij = 1 log Table 2: Similarity measures between vectors v ⇥ N and u, where vi is the ith component of v log fij +1 Gref94 wij = log nj +1 whether removing more context words, based on a frequency cut-off, can improve performance. Table 3: Term weighting schemes. fij denotes the target word frequency in a particular context, fi 3 Experiments the total target word frequency, fj the total context frequency, N the total of all frequencies, n the The parameter space is too large to analyse ex- j number of non-zero contexts. P (tij cj) is defined haustively, and so we adopted a strategy for how fij fij | as and P (tij) as . to navigate through it, selecting certain parame- fj N ters to investigate first, which then get fixed or “clamped” in the remaining experiments. Unless tion datasets, and take the weighted average of specified otherwise, vectors are generated with the its inverse over the correlation datasets and the following restrictions and transformations on fea- TOEFL accuracy score (Silver and Dunlap, 1987). tures: stopwords are removed, numbers mapped to ‘NUM’, and only strings consisting of alphanu- 3.1 Vector size meric characters are allowed. In all experiments, The first parameter we investigate is vector size, the features consist of the frequency-ranked first n measured by the number of features. Vectors are words in the given source corpus. constructed from the BNC using a window-based Four of the five similarity datasets (RG, MC, method, with a window size of 5 (2 words either W353, MEN) contain continuous scales of sim- side of the target word). We experiment with vec- ilarity ratings for word pairs; hence we follow tor sizes up to 0.5M features, which is close to the standard practice in using a Spearman correlation total number of context words present in the en- coefficient ⇢s for evaluation. The fifth dataset tire BNC according to our preprocessing scheme. (TOEFL) is a set of multiple-choice questions, Features are added according to frequency in the for which an accuracy measure is appropriate. BNC, with increasingly more rare features being Calculating an aggregate score over all datasets added. For weighting we consider both Positive is non-trivial, since taking the mean of correla- Mutual Information and T-Test, which have been tion scores leads to an under-estimation of per- found to work best in previous research (Bullinaria formance; hence for the aggregate score we use and Levy, 2012; Curran, 2004). Similarity is com- the Fisher-transformed z-variable of the correla- puted using Cosine. Similarity Measures Measure Definition Scheme Definition Euclidean 1 w = f 1+ n (u v )2 None ij ij p i=1 i i P 1 N n TF-IDF wij =log(fij ) log( ) Cityblock 1+ u v nj i=1 | i i| ⇥ P 1 N TF-ICF wij =log(fij ) log( ) Chebyshev 1+max u v fj So far we used cosine similarity i | i i| ⇥ u v fij N nj +0.5 · Okapi BM25 wij = f log Cosine u v j fij +0.5 0.5+1.5 +fij | || | ⇥ fj (u µ ) (v µ ) j Several other options for computing u · v f Correlation u v (0.5+0.5 ij )log( N ) | || | ⇥ maxf nj n ATC wij = f similarity are available 2 i=0 min(ui,vi) N ij N 2 Dice n i=1[(0.5+0.5 max )log(n )] ui+vi ⇥ f j P i=0 r N (log(P fij )+1.0) log( ) Pu v nj Jaccard n · In most cases Correlation is the best ui+vi LTU wij = j i=0 0.8+0.2 fj fj Pn ⇥ ⇥ i=0 min(ui,vi) choice (cosine similarity of vectors Jaccard2 n max(u ,v ) P (tij cj ) Pi=0 i i MI wij =log | P (tij )P (cj ) Pn normalized by their mean) i=0 ui+vi Lin u + v P | | | | PosMI max(0, MI) u v Tanimoto u + v· u v P (tij cj ) P (tij )P (cj ) Kiela and Clark, A systematic study of | | | | · T-Test wij = | P (t )P (c ) 1 (D(u u+v )+D(v u+v )) p ij j 2 || 2 || 2 Semantic Vector Space Parameters, 2014, Jensen-Shannon Div 1 p 2log2 2 see (Curran, 2004, p. 83) D(u ↵v+(1 ↵)u) is a good review || ↵-skew 1 p2log2 f f ij ⇥ Lin98a wij = f f i⇥ j nj Lin98b wij = 1 log Table 2: Similarity measures between vectors v ⇥ N and u, where vi is the ith component of v log fij +1 Gref94 wij = log nj +1 whether removing more context words, based on a frequency cut-off, can improve performance. Table 3: Term weighting schemes. fij denotes the target word frequency in a particular context, fi 3 Experiments the total target word frequency, fj the total context frequency, N the total of all frequencies, n the The parameter space is too large to analyse ex- j number of non-zero contexts. P (tij cj) is defined haustively, and so we adopted a strategy for how fij fij | as and P (tij) as . to navigate through it, selecting certain parame- fj N ters to investigate first, which then get fixed or “clamped” in the remaining experiments. Unless tion datasets, and take the weighted average of specified otherwise, vectors are generated with the its inverse over the correlation datasets and the following restrictions and transformations on fea- TOEFL accuracy score (Silver and Dunlap, 1987). tures: stopwords are removed, numbers mapped to ‘NUM’, and only strings consisting of alphanu- 3.1 Vector size meric characters are allowed. In all experiments, The first parameter we investigate is vector size, the features consist of the frequency-ranked first n measured by the number of features. Vectors are words in the given source corpus. constructed from the BNC using a window-based Four of the five similarity datasets (RG, MC, method, with a window size of 5 (2 words either W353, MEN) contain continuous scales of sim- side of the target word). We experiment with vec- ilarity ratings for word pairs; hence we follow tor sizes up to 0.5M features, which is close to the standard practice in using a Spearman correlation total number of context words present in the en- coefficient ⇢s for evaluation. The fifth dataset tire BNC according to our preprocessing scheme. (TOEFL) is a set of multiple-choice questions, Features are added according to frequency in the for which an accuracy measure is appropriate. BNC, with increasingly more rare features being Calculating an aggregate score over all datasets added. For weighting we consider both Positive is non-trivial, since taking the mean of correla- Mutual Information and T-Test, which have been tion scores leads to an under-estimation of per- found to work best in previous research (Bullinaria formance; hence for the aggregate score we use and Levy, 2012; Curran, 2004). Similarity is com- the Fisher-transformed z-variable of the correla- puted using Cosine. Evaluation
Intrinsic Extrinsic • evaluate word pairs • use the vectors in a similarities → compare with downstream task similarity judgments given by (classification, translation, ...) humans (WordSim, MEN, and evaluate the final Mechanical Turk, SImLex) performance on the task • evaluate on analogy tasks "Paris is to France as Tokyo is to x" (MSR analogy, Google analogy) Best parameters configuration? (context, similarity measure, weighting, ...) Depends on the task! Methods overview Methods
Semantic Differential (Osgood at al. 1957) Random Indexing (Kanerva et al. 2000) word2vec [SGNS and CBOW] (Mikolov et al. 2013)
Semantic features (Smith at al. 1974) Latent Dirichlet Allocation [LDA] (Blei et al. 2003) vLBL and ivLBL (Mnih and Kavukcuoglu 2013)
Mechanisms of sentence processing assigning A neural probabilistic language model (Bengio et al. Hellinger PCA (HPCA) (Lebret and Collobert 2014) roles to constituents (McLelland and Kawamoto 2003) 1986) Global Vectors [GloVe] (Pennington et al. 2014) Infomap (Widdows et al. 2004) Learning Distributed Representations of Concepts Infinite Dimensional Word Embeddings (Nalisnick (Hinton et al. 1986) Correlated Occurrence Analogue to Lexical and Ravi 2015) Semantic [COALS] (Rohde et al. 2006) Forming Global Representations with Extended Gaussian Embeddings (Vilnis and McCallum 2015) Back-Propagation [FGREP] (Mikkulainen and Dyer Dependency Vecotrs (Padó and Lapata 2007) 1987) Diachronic Word Embeddings (Hamilton et al. Explicit Semantic Analysis (Gabrilovich and 2016) Sparse Distributed Memory [SDM] (Kanerva 1988) Markovich 2007) WordRank (Ji et al. 2016) Latent Semantic Analysis [LSA] (Deerwester et Distributional Memory (Baroni and Lenci 2009) al.1988-1990) Exponential Family Embeddings (Rudolph et al. Non-Negative Matrix Factorization [NNMF] (Van 2016) Hyperspace Analogue to Language [HAL] (Lund de Cruys et al. 2010) originally: (Paatero and Tapper and Burgess 1995) 1994) Multimodal Word Distributions (Athiwaratkun and Wilson 2017) Probabilistic Latent Semantic Analysis [pLSA] JoBimText (Biemann and Riedl 2013) (Hoffman et al. 1999) ... Explicit vs Implicit
Distributional Semantic Models Distributional Semantic Models Explicit vectors: big sparse vectors with interpretable dimensions Word models Region models Count models Prediction models (lexemes) (text regions)
Matrix models Random encoding models Implicit vectors: small denseWindow-based vectors models Syntactic models with latent dimensions (window-based collocates) (syntactic collocates) explicit vectors implicit vectors Count vs Prediction
distributional representations
Alessandro Lenci, DistributionalFigure 2 models of word meaning, 2017 AclassificationofDSMsbasedon(left)contexttypesand(right)methodstobuilddistributionalvectors Table 2 Most common matrix DSMs.
Model name Description Latent Semantic Analysis (LSA)a word-by-region matrix, weighted with entropy and reduced with SVD Hyperspace Analogue of Language (HAL)b window-based model with directed collocates Dependency Vectors (DV)c syntactic model with dependency-filtered collocates Latent Relational Analysis (LRA)d pair-by-pattern matrix reduced with SVD to measure relational similarity Distributional Memory (DM)e target–link–context tuples formalized with a high-order tensor Topic Modelsf word-by-region matrix reduced with Bayesian inference High Dimensional Explorer (HiDEx)g generalization of HAL with a larger range of parameter settings Global Vectors (GloVe)h word-by-word matrix reduced with weighted least squares regression
aLandauer & Dumais (1997); bBurgess (1998); cPad´o& Lapata (2007); dTurney (2006); eBaroni & Lenci (2010); f Gri ths et al. (2007); g Shaoul & Westbury (2010); hPennington et al. (2014).
shown that narrow context windows and syntactic collocates are best to capture lexemes related by paradigmatic semantic relations (e.g., synonyms and antonyms) or belonging to the same taxonomic category (e.g., violin and guitar), because they share very close collo- cates (Sahlgren 2006; Bullinaria & Levy 2007; Van de Cruys 2008; Baroni & Lenci 2011; Bullinaria & Levy 2012; Kiela & Clark 2014; Levy & Goldberg 2014a). Conversely, collo- cates extracted with larger context windows are biased towards more associative semantic relations (e.g., violin and music), like region models. The second dimension of variation among DSMs is the method to learn distributional representations. Matrix models (see Table 2) are a rich family of DSMs that generalize the vector space model in information retrieval (see Section 2). They are a subtype of so-called count models (Baroni et al. 2014b), which learn the representation of a target lexeme by recording and counting its co-occurrences in linguistic contexts. Matrix models arrange distributional data into co-occurrence matrices. The matrix is a formal representation of the global distributional statistics extracted from the corpus. The weighting functions use such global statistics to estimate the importance of co-occurrences to characterize target
www.annualreviews.org Distributional semantics 9 • Hyperspace Analogue to Language [HAL] the dog barked at the cat
weight dog barked= 10 (no gap) Target: a specific word weight dog cat = 7 (3 words gap) Context: window of ten words
Weighting: (10 - distance from target) for c2 c7 ... c3 c5 ...... c6 each occurrence w1 54 23 ... 8 4 ...... 1
Similarity: euclidean distance w1 21 82 ... 10 6 ...... 0 ...... Dimensionality reduction: sort contexts
(columns of the matrix) by variance and wn 32 47 ... 9 3 ...... 1 keep top 200 variance 30 25 ... 5 3 ...... 0,5 top 200 keep 201+ discard Hyperspace Analogue to Language
Advantages Disadvantages
• Simple • No higher order interactions (only direct co-occurrence) • Fast O(n) Latent Semantic Analysis [LSA]
Target: a specific word TF IDF N Context: document id weightij = log(fij) log( ) · nj Weighting: tf-idf (term frequency - inverse frequency of word j in total documents over document frequency), but can use others document i documents containing word j Similarity: cosine Intuition: the more frequency in the document, the better. The less frequent in the corpus, the better Dimensionality reduction: Singular Value Decomposition (SVD) SVD in a nutshell
Target matrix W = U � V SV D m x n m x m m x n n x n T T = Uk⌃k m x k Trick from (Levy at al.
terms 2015): throw � away for k k k better performance terms T SV D = U rank k < r topics k documents Context matrix SV D C C = V > W = Uk⌃kVk> k n x k
Intuition documents keep top k singular values as they contain most of the variance k can be interpreted as the number of topics topics Latent Semantic Analysis
Advantages Disadvantages
• Reduced dimension k can be • Static → can't easily add new interpreted as topics documents, words and topics
• Reducing the number of columns • SVD is one time operation, without unveils higher order interactions intermediate results
• Expensive in terms of memory and computation O(k2m) Random Indexing [RI]
Bn,k An,mRm,k k m ⇡ ⌧ Locality-sensitive hashing method that approximates the distance between points v v dr d Generates random matrix R and projects u B u the original matrix A to it to obtain a A reduced matrix B
Reduced space B preserves the euclidean (1 ✏)d (v, u) d(v, u) (1 + ✏)d (v, u) distance between points in original space r r A (Johnson-Lindenstrauss lemma) Random Indexing [RI] DatasetDataset II drinkdrink beer YouYou drinkdrink aa glass glass of ofbeer beer
ContextContext VectorsVectors Algorithm I 1 0 0 0 0 -1 0
• For every word in the corpus create a drink 0 0 1 0 0 0 0 sparse random context vector with beer 0 1 0 0 0 0 0 values in {-1, 0, 1} you 0 -1 0 0 0 0 1 • Target vectors are the sum of the glass -1 0 0 0 1 0 0 context vectors of the words they co- occur with multiplied by the frequency TargetTerm Vectors Vectors of the co-occurrence tvbeer = 1cvi + 2cvdrink + 1cvyou + 1cvglass
beer 0 -1 2 0 1 -1 1 Random Indexing
Advantages Disadvantages
• Fast O(n) • In many intrinsic tasks doesn't perform as well as other methods • Incremental → can add new words any time, just create a new • Stochasticity in the process → context vector random distortion
• Negative similarity scores Explicit Semantic Analysis [ESA] Mickey Janson Drag and Mouse Mouse Button [Rodent] [computing] Mouse Button Drop Target: a specific word mouse 0,95 0,89 0,81 0,50 0,01 0,60
Context: Wikipedia article button 0,10 0,81 0,20 0,95 0,89 0,70 Assumption: Wikipedia articles are mouse 0,50 0,85 0,50 0,72 0,45 0,65 explicit topics button Weighting: tf-idf average of 2 vectors → emerges disambiguated meaning Similarity: cosine
cat leopard jaguar car animal button Dimensionality Reduction: discard too short articles and articles with only few Panther 0,83 0,72 0,65 0,3 0,92 0,01 other articles linking to them Explicit Semantic Analysis
Advantages Disadvantages
• Simple • The assumption doesn't always hold • Fast O(n) • Doesn't perform as good as • Interpretable other methods
• Vectors are really high dimensional, although quite sparse JoBimText Input tuple target context (nsubj, gave, I) I (nsubj, gave, @) (det, book, a) gave (nsubj, @, I) (dobj, gave, book) a (det, book, @) (det girl, the) book (det, @, a) Generic holing @ operation (prep_to, gave, girl) ...... girl (prep_to, gave, @) Apply it to any tuple to obtain gave (prep_to, @, girl) targets (jo) and contexts (bim)
Weighting: custom measure similar Input tuple target context to Lin (I, gave, a, book) I (@, gave, a, book) (gave, a, book, to) gave (I, @, a, book) Similarity: Lexicographer Mutual (a, book, to, the) a (I, gave, @, book) Information (PMI x Frequency) (book, to, the, girl) book (I, gave, a, @) (Kilgarriff et al. 2004) ...... the (book, to, @, girl) girl (book, to, the, @) JoBimText
Advantages Disadvantages
• Generic preprocessing operation • No dimensionality reduction → deals with many context vectors are high dimensional representations and types of data • No uncovering of higher order • Deals with complex contexts relations (example: several steps in a tree) • MapReduce implementation only effective on clusters word2vec
input Target vectors hidden Context vectors softmax output
C eTiCj Skip Gram with Negative Sampling X T → X → eTiCj → (SGNS) P
Target: a specific word
Context vector for "car" Context: window of n words softmax Target vector for "ants" Vectors are obtained training the model eTiCj X → eTiCj to predict the context given a target 300 dimensions 300 dimensions P The error of the prediction is back- propagated and the vectors updated Probability that if you randomly pick a word nearby "ant" you will get "car" Example
The quick brown fox jumps over the lazy dog
one hidden prediction ground hot Target vectors Context vectors truth
the the the over over over fox fox C fox brown brown brown X T → X → softmax → quick quick quick jumps jumps jumps dog dog dog lazy lazy lazy
They should be the same Example
The quick brown fox jumps over the lazy dog
one prediction ground hot Target vectors Context vectors truth
the the the over over over fox fox C fox brown brown brown X T → X → softmax → quick quick quick jumps jumps jumps dog dog dog lazy lazy lazy
They are different Back-propagate the error and update the vectors to improve prediction Example Negative Sampling
Calculating the full softmax is expensive because of large vocabulary
The quick brown fox jumps over the lazy dog
1. Create pairs of target and context 2. Sample false context words from their words and predict the probability of unigram distribution and predict the them co-occurring to be 1 probability of them co-occurring with true target word to be 0 (fox, quick) → 1 (fox, quick) → 1 (fox, the) → 0 (fox, brown) → 1 (fox, brown) → 1 (fox, lazy) → 0 (fox, jumps) → 1 (fox, jumps) → 1 (fox, dog) → 0 (fox, over) → 1 (fox, over) → 1 (fox, the) → 0 Negative Sampling Loss (NCE)
n log (Ti Cj)+ E log ( Tk Cj) · k P (w) · X ⇠
target context Number of Sample from Vector of the negative the distribution negative samples of words sample SGNS as matrix factorization
Target vectors Context vectors Features X = Contexts ? Words
Features SGNS as matrix factorization
Target vectors Context vectors Features X = Contexts PMI log(k) Words Words
Features Contexts
Neural Word Embedding as Implicit Matrix Factorization (Levy and Goldberg 2014) word2vec
Advantages Disadvantages
• Iterative way for factorizing a matrix • Inflexible definition of context
• Fast O(nm), great implementations • Doesn't use dataset statistics in a smart way • Several parameters to improve performance (negative samples, subsampling of frequent words, ...) • Columns are hard to interpret as topics • Default parameters can go a long way Are neural word embeddings better than classic DSMs?
Yes No Maybe With vanilla With optimal Trained on 1 billion+ parameters parameters words
Baroni et al., Don’t count, Levy et al., Improving Sahlgren and Lenci, The predict! A systematic Distributional Similarity with Effects of Data Size and comparison of context- Lessons Learned from Word Frequency Range on counting vs. context- Embeddings, 2015 Distributional Semantic predicting semantic vectors, Models , 2016 2014 GloVe Words Contexts Features X Contexts = W Words Words
Explicit factorization of target x contexts Features Contexts matrix
Precomputes the matrix (unlike SGNS) 2 J = f(W )(w>w˜ log W ) Uses directly the statistics of the dataset ij i j ij i,j (frequencies of co-occurrences) X frequency of word target context like SGNS i in context j GloVe
Advantages Disadvantages
• Better use of dataset statistics • Recent comparisons show that on many tasks it doesn't perform as • Converges to good solutions with well as LSA or SGNS less data
• Simple to apply on different contexts Self-Supervised Learning Compositionality
So far we represented words as Should represent their meaning vectors, how to represent combining word representations sentences? The meaning of an utterance is a Can't use the co-occurrences of function of the meaning of its parts sentences in their context as and their composition rules - Gottlob sentences are sparse, most of Frege, Über Sinn und Bedeutung, them occur once 1892 Composition operators
Simple solution, just sum the vectors of the words in a sentence!
Other operators: product, weighted sum, convolution, ... (Mitchell and Lapata, I drive a car 2008) drive
It's hard to perform better than the car simple sum
Sum can't be the real answer as it's commutative → doesn't consider word order vector matrix tensor
Learn to compose Legend sum product concat + X C
Recursive Matrix Vector Recursive Neural Tensor Recurrent Neural Network Network (Socher at al. 2012) Network (Socher et al. 2013) (Elman 1990) and others
sentence sentence [w1, …, wi] sentence
[w1, w2] o … i on
[w1, w2] …
+
X X … …
X X C C C … … w w w 1 2 n oi-1 wi w1 w2 wn
The quick brown fox jumps over the lazy dog
s | | P (s)= P (si s1,...,si 1) | i Y
The quick brown fox jumps over the lazy dog
s | | P (s)= P (si s1,...,si 1)P (si si+1,...,sS ) | | | | i Y Bidirectional Language Modeling (ELMo)
Target s1 ... si ... s|S|
LSTM2→ LSTM2←
LSTM1→ LSTM1←
Context s1 ... si ... s|S|
Deep Contextualized Word Representations (Clark et al. 2017) Masked Language Modeling (BERT)
Sample 1: The quick brown fox jumps over the lazy dog
Sample 2: The quick brown fox jumps over the lazy dog
Sample 3: The quick brown fox jumps over the lazy dog
Sample 4: The quick brown fox jumps over the lazy dog Masked Language Modeling (BERT)
Target MASK ... MASK ... s|S|
TransofrmerN
Transformer1
Context s1 MASK si MASK MASK
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Delvin at al. 2017) Embeddings for Graphs Embeddings for Graphs
Target Context Graph Representation Learning
DeepWalk (Perozzi et al. 2014) node2vec (Grover and Leskovec 2016)
GraphSage (Hamilton at el. 2017) for supervised tasks, but same principle
Many papers of Knowledge Graphs (Nickel et al. 2015 for a review) Computer Vision Inpainting
NN
Context Target
"Enzo" by ElinaLt is licensed under CC BY-ND 2.0
Context Encoders: Feature Learning by Inpainting (Pathak at al. 2016) Computer Vision Denoising Autoencoders
NN
Context Target
Noise is a kind of "soft mask" "Enzo" by ElinaLt is licensed under CC BY-ND 2.0
Extracting and Composing Robust Features with Denoising Autoencoders (Vincent at al. 2008) Reinforcement Learning
Target: how the world will look z2 ... zi+1 like in the futere if I take action ai
MDN MDN MDN
RNN RNN RNN
Context: how the world looks z1,, a1 ... zi, ai like in the past unitl now
World Models (Ha and Schmidhuber 2008)
Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments (Schmidhuber 1990) Video
Tracking moving objects (Wang and Gupta 2015)
Frame order validation (Misra et al. 2016)
Video colorization (Vodrick at el. 2018) Conclusions
Structuralism can be a unifying interpretation for why many learning algorithms work
Many aspects of reality can be modeled in terms of targets and contexts
Context is king
Self-Supervised Learning is "just" applied structuralism Contacts [email protected] Thanks http://w4nderlu.st
Influenced this talk: Yoav Goldberg
Magnus Sahlgren Andre Freitas
Alfio Gliozzo Pierpaolo Basile
Marco Baroni Aurélie Herbelot
Alessandro Lenci Arianna Betti