What’s in your embedding, and how it predicts task performance

Anna Rogers September 27, 2018

Text Machine lab, University of Massachusetts Lowell Outline

1. Word embeddings: the basic framework and evaluation challenges 2. “Linguistic diagnostics": methodology and implementation 3. Experimental results

1 Distributional meaning representations: the basic framework & challenges Word embeddings are everywhere!

A cool and intuitively appealing way to represent meaning

(fig. by Iyyer et al. (2014)) 2 Word embeddings are everywhere!

; • inference; • ; • sequence labeling; • you name it!

3 Word embeddings are everywhere!

Some success with compositionality:

fig. by Cho et al. (2014) 4 Word embeddings are everywhere!

• pre-trained from large collections of texts; • quick to train on your corpus; • portable & fixed-size => "plug-and-play"

5 The basic idea: count-based models

cat dog mouse cheese cat 12 15 1 ⇒ dog 12 0 0 mouse 15 0 15 cheese 1 0 5

cat 0.571 0.912 0.126 0.412 dog 0.259 0.4512 0.521 0.623   mouse 0.115 0.523 0.674 0.571 cheese 0.921 0.412 0.836 0.591

Ex: SVD (Schütze, 1993), HAL (Lund and Burgess, 1996), PCA

(Lebret and Collobert, 2014) 6 Prediction-based models, a.k.a. word emeddings

cat 0.571 0.912 0.126 0.412 dog 0.259 0.4512 0.521 0.623   mouse 0.115 0.523 0.674 0.571 cheese 0.921 0.412 0.836 0.591

Ex: SG, CBOW (Mikolov et al., 2013a), GloVe (Pennington et al., 2014), FastText (Bojanowski et al., 2017), etc.

7 “Don’t count, predict!” (Baroni et al., 2014)

Mikolov et al., presentation at COLING (2014):

Neural networks: motivation

• The main motivation is to simply come up with more precise way how to represent and model words, documents and language than the basic approaches

• There is nothing that neural networks can do in NLP that the basic techniques completely fail at

• But: the victory in competitions goes to the best, thus few percent gain in accuracy counts!

Tomas Mikolov, COLING 2014 18

8 However...

9 However...

9 • Parameters: number of epochs, negative subsampling, dimensionality... • Input: domain, corpus, pre-processing, context type, window size... • Particular model: much variability between the runs of the same model with the same parameters on the same input!!! (Wendlandt et al., 2018)

So which one should I choose?

• Models: , GloVe, FastText, LexVec...

10 • Input: domain, corpus, pre-processing, context type, window size... • Particular model: much variability between the runs of the same model with the same parameters on the same input!!! (Wendlandt et al., 2018)

So which one should I choose?

• Models: Word2Vec, GloVe, FastText, LexVec... • Parameters: number of epochs, negative subsampling, dimensionality...

10 • Particular model: much variability between the runs of the same model with the same parameters on the same input!!! (Wendlandt et al., 2018)

So which one should I choose?

• Models: Word2Vec, GloVe, FastText, LexVec... • Parameters: number of epochs, negative subsampling, dimensionality... • Input: domain, corpus, pre-processing, context type, window size...

10 So which one should I choose?

• Models: Word2Vec, GloVe, FastText, LexVec... • Parameters: number of epochs, negative subsampling, dimensionality... • Input: domain, corpus, pre-processing, context type, window size... • Particular model: much variability between the runs of the same model with the same parameters on the same input!!! (Wendlandt et al., 2018)

10 Combinatorial explosion of choices

11 Which embedding is better?

Top 7 neighbors of color in dependency-based and FastText embeddings.

Rank Deps FastText 1 colour 0.93 $color 0.75 2 colors 0.72 color. . . 0.69 3 coloration 0.69 colour 0.69 4 colouration 0.68 color#ff 0.69 5 colours 0.68 color#d 0.68 6 hue 0.66 @color 0.67 7 hues 0.65 barcolor 0.67

12 • Performance on your own task of interest; • Performance on "representative suite of downstream tasks" (Nayak et al., 2016).

Evaluation of word embeddings: extrinsic benchmarks

13 • Performance on "representative suite of downstream tasks" (Nayak et al., 2016).

Evaluation of word embeddings: extrinsic benchmarks

• Performance on your own task of interest;

13 Evaluation of word embeddings: extrinsic benchmarks

• Performance on your own task of interest; • Performance on "representative suite of downstream tasks" (Nayak et al., 2016).

13 • behavioral evaluation: correlation with similarity judgments (Bruni et al., 2014; Hill et al., 2015), intrusion (Chang et al., 2009; Schnabel et al., 2015), N400 effect (Van Petten, 2014; Ettinger et al., 2016), fMRI scans (Devereux et al., 2010; Søgaard, 2016), eye-tracking (Klerke et al., 2015; Søgaard, 2016), and semantic priming data (Lund and Burgess, 1996; Jones et al., 2006; Lapesa and Evert, 2013; Ettinger and Linzen, 2016; Auguste et al., 2017); • “linguistic regularities" (Mikolov et al., 2013b; Levy and Goldberg, 2014) • correlation with linguistic resources (Baroni and Lenci, 2011; Tsvetkov et al., 2015, 2016).

Evaluation of word embeddings: intrinsic tasks

14 • “linguistic regularities" (Mikolov et al., 2013b; Levy and Goldberg, 2014) • correlation with linguistic resources (Baroni and Lenci, 2011; Tsvetkov et al., 2015, 2016).

Evaluation of word embeddings: intrinsic tasks

• behavioral evaluation: correlation with similarity judgments (Bruni et al., 2014; Hill et al., 2015), intrusion (Chang et al., 2009; Schnabel et al., 2015), N400 effect (Van Petten, 2014; Ettinger et al., 2016), fMRI scans (Devereux et al., 2010; Søgaard, 2016), eye-tracking (Klerke et al., 2015; Søgaard, 2016), and semantic priming data (Lund and Burgess, 1996; Jones et al., 2006; Lapesa and Evert, 2013; Ettinger and Linzen, 2016; Auguste et al., 2017);

14 • correlation with linguistic resources (Baroni and Lenci, 2011; Tsvetkov et al., 2015, 2016).

Evaluation of word embeddings: intrinsic tasks

• behavioral evaluation: correlation with similarity judgments (Bruni et al., 2014; Hill et al., 2015), intrusion (Chang et al., 2009; Schnabel et al., 2015), N400 effect (Van Petten, 2014; Ettinger et al., 2016), fMRI scans (Devereux et al., 2010; Søgaard, 2016), eye-tracking (Klerke et al., 2015; Søgaard, 2016), and semantic priming data (Lund and Burgess, 1996; Jones et al., 2006; Lapesa and Evert, 2013; Ettinger and Linzen, 2016; Auguste et al., 2017); • “linguistic regularities" (Mikolov et al., 2013b; Levy and Goldberg, 2014)

14 Evaluation of word embeddings: intrinsic tasks

• behavioral evaluation: correlation with similarity judgments (Bruni et al., 2014; Hill et al., 2015), intrusion (Chang et al., 2009; Schnabel et al., 2015), N400 effect (Van Petten, 2014; Ettinger et al., 2016), fMRI scans (Devereux et al., 2010; Søgaard, 2016), eye-tracking (Klerke et al., 2015; Søgaard, 2016), and semantic priming data (Lund and Burgess, 1996; Jones et al., 2006; Lapesa and Evert, 2013; Ettinger and Linzen, 2016; Auguste et al., 2017); • “linguistic regularities" (Mikolov et al., 2013b; Levy and Goldberg, 2014) • correlation with linguistic resources (Baroni and Lenci, 2011; Tsvetkov et al., 2015, 2016).

14 Standard benchmarks: word similarity/relatedness

WordSim353 tiger cat 7.35 • WordSim353 (Finkelstein, Garilovich book paper 7.46 et al. 2002) computer keyboard 7.62 plane car 5.77 • MEN (Bruni, Tran, and Baroni, 2013) train car 6.31 telephone communication 7.50 • RareWords (Luong, Socher and television radio 6.77 Manning, 2013) media radio 7.42 drug abuse 6.85 • Radinsky Mturk data (Radinsky, cucumber potato 5.92 bread butter 6.19 Agichtein et al., 2011) doctor nurse 7.00 • SimLex 999 (Hill, Reichart, and smart student 4.62 smart stupid 5.81 Korhonen, 2014)

15 Standard benchmarks: word analogy

• Task: given one word, identify its pair with a given relation. −−−→ −−−−−−→ −−−→ −−−→ • Berlin - Germany + Japan = Tokyo (Mikolov et al. 2013) • Google test set: 9544 question pairs (8,869 semantic and 10,675 syntactic (i.e. morphological) questions) • 14 types of relations (9 morphological and 5 semantic) • GloVe: 80% accuracy!

16 Linear relations between countries and capitals in GloVe

4

oslo

3 norway

warsaw 2 poland

1 berlin germany tokyo 0 japan

1

london 2 england paris italy france 3 rome 4 3 2 1 0 1 2 3 4

17 18 Analogy only works only for some relations!

1.0 GloVe w10 d300 (average accuracy 0.285) SVD w3 d1000 (average accuracy 0.221)

0.8

y 0.6 c a r u c c

A 0.4

0.2

0.0 ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] l t t r r r s c c y y y y e e e e e g e g g g g d d g g n d g d g g g g g g g g l r a l l c t t t r o e e s s c v v e e e o e e e e e e e e e e i g i S n n S e n e t l i i a a t i i a i b l i a i . n a r r r r r r r r r r b r r l s n V V V a t t p u u t o r r r p r a ______x a _

p V S u n i i i e m m i

m a n j j l c a a m u o a o - i - - 3 m

s s i - y e l d b e n _ t a _ _ P _

o - r h - - l e a s

d d l l e y g - p b r s s r

f - r t e n

s c

a o c t -

n r f 3 s g g a

f s a a i b a -

n

- e u e r e e s n e s - + n a b s - o n t l u -

r i n S j - p -

g n

m s – a l i i c l a l i v

p g . n

i

e + + m

- u

g _

l a + s t l

u d n m m c

- + a r y f s _

p a y - l V u e i - s m y n +

m + n b + n

b

- l r a a t y y s

m o i e P n + _ j s n b

m p i - r r o V t s [ n u b -

y s e i m

s a m -

r n n

v h [ i - 3 u b d b r c _ y i r y - c e +

m - e _ n o m r n

t

r r i n e s [ o r _ 3 _ m o r o a n e m j v b n n e m b m [ v

- y b u t e o 2 e [ [ e v e n

u [

n r p y [ b [ r 0 K e v r y a n

d a o

y j

6 m [ o n

n p n r 0 [ [ v v [ a o [ y t m 9 e u

n e r e n 4 m a U

D

5 n c d

[ y [ 7 0 [ 0 8 y y u e m [ [ n v a

h n D 0 o v [ 6 o v e o a 0 7 [ 7 o 1 0 a

s n 6 0

[ [ 1 [ 0 h o [ v r D [

a n I 9 t 9 E n [ 8 0

[

p n [

[ n 0 0 0 4 o 3 0 c D I

1 [ [

E D [ e

D 6 [ n 0 0 0 r

1 8

3 y [

y 0 5 E

0 0 I E 3 0 D 8 D

2 0 0 I a s e E 0 0 0 4 0 h 2 1 m D 0 5 E I I 0 E [ [ 0 2 [ I 0 L 1 [ L I 1 0 0

I

D 0 L m I L 0 L 9 7 E [ 1 5 E E

0 0 0 0 4 L L L Category L 0 L

Bigger Analogy Test Set (Gladkova et al., 2016)

19 Analogy evaluation doesn’t show how much information is actually encoded (Drozd et al., 2016)

Encyclopedia Lexicography Inflections Derivation Method GloVe SG GloVe SG GloVe SG GloVe SG 3CosAdd 31.5% 26.5% 10.9% 9.1% 59.9% 61.0% 10.2% 11.2% 3CosAvg 44.8% 34.6% 13.0% 9.6% 68.8% 69.8 11.2% 15.2% LRCos 40.6% 43.6% 16.8% 15.4% 74.6% 87.2% 17.0% 45.6%

Given an analogy “a is to a0 as b is to b0”:

0 0 0 • 3CosAdd b = argmax d∈V (cos(b , b − a + a )), where u·v cos(u, v) = ||u||·||v|| (Mikolov et al., 2013b) 0 0 • 3CosAvg: b = argmax b0∈V (cos(b , b + avg_offset)), where Pm Pn i=0 ai i=0 bi avg_offset = m − n (Drozd et al., 2016) 0 0 • LRCos: b = argmax b0∈V (P (b0∈target_class) ∗ cos(b , b)) (Drozd et al., 2016)

20 Vector offset method is cheating! (Rogers et al., 2017)

a a' b b' other 0.90

0.75 Derivation

0.60

Inflections 0.45

0.30 Lexicography 0.15

Encyclopedia 0.00

The “honest” solution to a0 − a + b

21 Analogies are biased by cosine similarity (Rogers et al., 2017)

1.0 1.0 1.0 share of all questions accuracy (top 1) 0.8 accuracy (top 3) 0.8 0.8 accuracy (top 5)

0.6 0.6 0.6

0.4 0.4 0.4 share / accuracy 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(a) similarity between vectors a and a0 (b) similarity between vectors a0 and b0 (c) similarity between vectors b and b0 1.0 1.0 1.0 share of all questions accuracy (top 1) 0.8 accuracy (top 3) 0.8 0.8 accuracy (top 5)

0.6 0.6 0.6

0.4 0.4 0.4 share / accuracy 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 50 60 70 80 90

(d) similarity between vector b0 and pre- (e) similarity between vector b0 and a (f) rank of b in the neighborhood of b0 dicted vector

Vector offset*X-axis method labels indicate accuracy lower boundary for of different the corresponding cosine similarity/rank similarity bins. bins The numerical values for all data can be found in the Appendix. (GloVe)

22 Not to mention fundamental problems...

23 % rely on a pre-defined test dataset; % assume the existence of a single intrinsic metric that would show how “good” an embedding model is.

All intrinsic approaches to date:

24 % assume the existence of a single intrinsic metric that would show how “good” an embedding model is.

All intrinsic approaches to date:

% rely on a pre-defined test dataset;

24 All intrinsic approaches to date:

% rely on a pre-defined test dataset; % assume the existence of a single intrinsic metric that would show how “good” an embedding model is.

24 Which embedding is better?

Top 7 neighbors of color in dependency-based and FastText embeddings.

Rank Deps FastText 1 colour 0.93 $color 0.75 2 colors 0.72 color. . . 0.69 3 coloration 0.69 colour 0.69 4 colouration 0.68 color#ff 0.69 5 colours 0.68 color#d 0.68 6 hue 0.66 @color 0.67 7 hues 0.65 barcolor 0.67

25 Linguistic Diagnostics for Word Embeddings

A. Rogers, Sh. H. Ananthakrishna, A. Rumshisky,

COLING 2018 ! LD is NOT a dataset; ! LD does NOT assume a single metric that would work for all cases.

Linguistic diagnostics

26 ! LD does NOT assume a single metric that would work for all cases.

Linguistic diagnostics

! LD is NOT a dataset;

26 Linguistic diagnostics

! LD is NOT a dataset; ! LD does NOT assume a single metric that would work for all cases.

26 Linguistic diagnostics toolkit http://ldtoolkit.space

• large-scale automatic annotation of linguistic, psychological and distributional relations between words vectors and their neighbors. • the largest freely available lexicographic resources: WordNet, (Fellbaum, 1998) Wiktionary, BabelNet (Navigli and Ponzetto, 2010); • custom rules and resources for morphological analysis, spelling correction, detection of names, foreign words etc.; • extensible to many languages.

27 LDT analysis pipeline

LDT analysis pipeline

28 LDT factors

• Morphology: • SharedForm: the two words share morphological form. E.g.: cat and dog are singular nouns in the nominative case. • SharedDerivation: the two words share affixes or stems, according(to Wiktionary etymologies and custom LDT tools. E.g.: kindness and happiness share the suffix -ness • SharedPOS: the two words have the same part of speech (any overlap counts). E.g.: cat and dog are both nouns (and also verbs)

29 LDT factors

• Lexicography: • Synonyms, Antonyms, Hypernyms, Hyponyms, Meronyms: the corresponding dictionary relations. E.g.: quick:fast (synonymy), white:black (antonymy), tiger:feline (hypernymy), feline:tiger (hyponymy), cat:whiskers (meronymy) • Other encompasses the other semantic relations in Wiktionary. E.g.: corgy:poodle (co-hyponymy) whiskers:cat (holonymy), stroll:walk (troponymy) • MinPath: the minimum path between synsets of two words in the WordNet ontology (if they are both present in WN). Example: 4 for cat:dog, 10 for cat:umbrella

30 LDT factors

• Psychological data: • Associations: whether the two words constitute an associative pair (in either direction), according to EAT (Wilson et al.) and USF-FAN (Nelson et al., 2004). E.g.: falcon:eagle, dog:bone. • Other: • ProperNouns: the neighboring word is a proper noun. E.g. jane, avignon, eyjafjallajökull • ForeignWords: the word is not found in English resources, but found in German, French or Spanish. E.g. extraordinaire, extraño • Misspellings: the word contains a frequent misspelling pattern or noisy symbols. E.g. mispelling, colorff, @color

31 LDT factors

• Distribution: • NeighborsInGDeps: whether the two words co-occur in the Google dependency ngrams. E.g. billions co-occurs with banks, but not with academia • LowFreqNeighbors, HighFreqNeighbors: the frequency of the neighbor in the source corpus is under/above 10,000. E.g.: typical (74009) case vs intriguing (2965) case • CloseNeighbors, FarNeighbors: the number of neighbors with cosine similarity to the target over 0.8/under 0.7. • NonCooccurring: the target and the neighbor do not co-occur in the source Wikipedia corpus (bag-of-words window size 2). E.g. Heifetz co-occurs with Menuhin (both violinists), but not with Prokofiev.

32 LDT highlights: dictionaries

33 LDT highlights: normalization

34 LDT highlights: all information about a word

35 LDT highlights: relations in a pair of words

36 Experiment set-up

• 60 GloVe, Skip-Gram and CBOW embeddings trained on English Wikipedia and published by Li et al. (2017); • 4 vector sizes: 50, 100, 250, 500 dimensions • 4 types of context: bag-of-words vs dependency-based, bound vs unbound; • all embeddings filtered to the same vocabulary size (269,860);

37 Experiment set-up

• balanced random sample of 908 monosemous and polysemous general vocabulary (nouns, verbs, adjectives and adverbs) • up to 30 words of each category drawn from 4 logarithmic frequency bins in the source corpus: 100, 1,000, 10,000, 100,000; • 1000 nearest neighbors drawn for each word from each of 60 embeddings, all pairs of words annotated by LDT (908,000 pairs per embedding).

38 Task selection

• Intrinsic tasks: WordSim353 (Finkelstein et al., 2002; Agirre et al., 2009), RareWords (Luong et al., 2013), MTurk (Radinsky et al., 2011), MEN (Bruni et al., 2014), SimLex999 (Hill et al., 2015), BATS (Gladkova et al., 2016); • Extrinsic tasks similar to “representative suite” by Nayak et al. (2016): • POS-tagging (POS), chunking, NER with CoNLL 2003 shared task dataset (Tjong Kim Sang and De Meulder, 2003); • polarity classification at sentence level (short movie review dataset (Pang and Lee, 2005)) and document level (Stanford IMDB movie review dataset (Maas et al., 2011)). • subjectivity/objectivity classification with Rotten Tomato dataset (Pang and Lee, 2004); • natural language inference task with the SNLI dataset (Bowman et al., 2015). 39 Results LD in action: why do models perform differently?

CBOW GloVe SG

POS 87.660 83.800 87.860 Chunking 77.530 66.100 78.230 NER 75.210 69.620 75.720 Relation class. 74.780 71.050 74.800 Subjectivity class. 89.800 89.160 89.920 Sentiment (sent.) 75.900 74.600 76.860 Sentiment (text) 82.220 82.240 82.730 SNLI 69.290 69.510 69.740

40 LD in action: why do models perform differently?

CBOW GloVe SG SharedForm 51.819 52.061 52.9 SharedPOS 30.061 35.507 31.706 SharedDerivation 4.468 3.938 5.084

Synonyms 0.413 0.443 0.447 Antonyms 0.128 0.133 0.144 Hyponyms 0.035 0.035 0.038 Other 0.013 0.013 0.013

Misspellings 13.546 9.914 12.809 Numbers 4.313 3.147 3.64

LowFreqNeighbors 94.778 66.51 96.109 NonCooccurring 88.97 67.904 90.252 CloseNeighbors 3.102 0.16 2.278 41 LD in action: parameter tuning

s 90 r o b h g i

e 85 N

g n i r

r Model u

c 80 CBOW c o GloVe o C

n SkipGram o 75 N

f o

t n u

o 70 m A

25 50 100 250 500 Vector size

Effect of vector size on detection of relations between words 42 that did not co-occur in corpus LD in action: hypothesis testing s r

o bound, b DEPS h 0.4 g i bound, e

N linear

m 0.3 unbound, y

n DEPS o

n unbound, y 0.2

S linear

f o

t

n 0.1 u o m

A 0.0 CBOW GloVe SkipGram

Effect of context type on specialization in synonymy

43 Tasks vs. information types (http://ldtoolkit.space/analysis/correlation)

SNLI 1 Sentiment (text) Sentiment (sent.) Subjectivity class. Relation class. NER 0.8 Extrinsic tasks Chunking POS tagging

BATS (avg) 0.6 BATS (Encyclopedia) BATS (Lexicography) BATS (Derivation) BATS (Inflections) Sim999 0.4 WS353_sim WS353_rel WS353 Intrinsic tasks RareWords 0.2 Mturk MEN

FarNeighbors CloseNeighbors 0 NonCooccurring NeighborsInGDeps factors HighFreqNeighbors Distributional LowFreqNeighbors −0.2

ShortestPath OtherRelations Hyponyms Hypernyms −0.4 Meronyms Antonyms

Semantic factors Synonyms

−0.6 Associations Misspellings ForeignWords Numbers

Misc. factors ProperNouns −0.8

SharedPOS SharedDerivation factors

Morph. SharedMorphForm SharedMorphForm SharedDerivation SharedPOS ProperNouns Numbers ForeignW Misspellings Associations Synonyms Antonyms Meronyms Hypernyms Hyponyms OtherRelations ShortestPath LowFreqNeighbors HighFreqNeighbors NeighborsInGDeps NonCooccurring CloseNeighbors FarNeighbors MEN Mturk RareW WS353 WS353_rel WS353_sim Sim999 BA BA BA BA BA POS tagging Chunking NER Relation class. Subjectivity class. Sentiment (sent.) Sentiment (text) SNLI TS (Inflections) TS (Derivation) TS (Lexicography) TS (Encyclopedia) TS (avg) ords ords

Morph. Distributional factors Misc. factors Semantic factors factors Intrinsic tasks Extrinsic tasks 44 • different tasks do need different types of information in the semantic representations; • no perfect correlation between “semantic” and “morphological” tasks and variables: e.g. minimum distance in WordNet ontology correlates with POS-tagging more than with inference or subjectivity classification; • tasks correlate with each other stronger than with individual LD variables, showing that it is their ensembles that matter; • weak contribution from psychological associations adds to the suspicions about similarity/relatedness tests.

Takeaways

45 • no perfect correlation between “semantic” and “morphological” tasks and variables: e.g. minimum distance in WordNet ontology correlates with POS-tagging more than with inference or subjectivity classification; • tasks correlate with each other stronger than with individual LD variables, showing that it is their ensembles that matter; • weak contribution from psychological associations adds to the suspicions about similarity/relatedness tests.

Takeaways

• different tasks do need different types of information in the semantic representations;

45 • tasks correlate with each other stronger than with individual LD variables, showing that it is their ensembles that matter; • weak contribution from psychological associations adds to the suspicions about similarity/relatedness tests.

Takeaways

• different tasks do need different types of information in the semantic representations; • no perfect correlation between “semantic” and “morphological” tasks and variables: e.g. minimum distance in WordNet ontology correlates with POS-tagging more than with inference or subjectivity classification;

45 • weak contribution from psychological associations adds to the suspicions about similarity/relatedness tests.

Takeaways

• different tasks do need different types of information in the semantic representations; • no perfect correlation between “semantic” and “morphological” tasks and variables: e.g. minimum distance in WordNet ontology correlates with POS-tagging more than with inference or subjectivity classification; • tasks correlate with each other stronger than with individual LD variables, showing that it is their ensembles that matter;

45 Takeaways

• different tasks do need different types of information in the semantic representations; • no perfect correlation between “semantic” and “morphological” tasks and variables: e.g. minimum distance in WordNet ontology correlates with POS-tagging more than with inference or subjectivity classification; • tasks correlate with each other stronger than with individual LD variables, showing that it is their ensembles that matter; • weak contribution from psychological associations adds to the suspicions about similarity/relatedness tests.

45 • reproducing and testing on more embeddings; • comparison of algorithms (e.g. different modifications of the Word2Vec); • identification of hyperparameter effects on encoding of target linguistic relations in specialized embeddings; • using the above information for a more informed, hypothesis-driven design of new distributional representations; • identification of factors that enable informed choice of word embeddings for various downstream tasks.

Future work

46 • identification of hyperparameter effects on encoding of target linguistic relations in specialized embeddings; • using the above information for a more informed, hypothesis-driven design of new distributional representations; • identification of factors that enable informed choice of word embeddings for various downstream tasks.

Future work

• reproducing and testing on more embeddings; • comparison of word embedding algorithms (e.g. different modifications of the Word2Vec);

46 • using the above information for a more informed, hypothesis-driven design of new distributional representations; • identification of factors that enable informed choice of word embeddings for various downstream tasks.

Future work

• reproducing and testing on more embeddings; • comparison of word embedding algorithms (e.g. different modifications of the Word2Vec); • identification of hyperparameter effects on encoding of target linguistic relations in specialized embeddings;

46 • identification of factors that enable informed choice of word embeddings for various downstream tasks.

Future work

• reproducing and testing on more embeddings; • comparison of word embedding algorithms (e.g. different modifications of the Word2Vec); • identification of hyperparameter effects on encoding of target linguistic relations in specialized embeddings; • using the above information for a more informed, hypothesis-driven design of new distributional representations;

46 Future work

• reproducing and testing on more embeddings; • comparison of word embedding algorithms (e.g. different modifications of the Word2Vec); • identification of hyperparameter effects on encoding of target linguistic relations in specialized embeddings; • using the above information for a more informed, hypothesis-driven design of new distributional representations; • identification of factors that enable informed choice of word embeddings for various downstream tasks.

46 Thank you!

Slides: http://www.cs.uml.edu/~arogers/ ("Talks" tab) Project website: http://ldtoolkit.space

47 References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 19–27, Stroudsburg, PA, USA. Association for Computational Linguistics. Jeremy Auguste, Arnaud Rey, and Benoit Favre. 2017. Evaluation of word embeddings against cognitive processes: Primed reaction times in lexical decision and naming tasks. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, pages 21–26. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 238–247. Marco Baroni and Alessandro Lenci. 2011. How We BLESSed

47 Distributional Semantic Evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, GEMS ’11, pages 1–10, Stroudsburg, PA, USA. Association for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5(0):135–146.

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal . JAIR, 49(1-47).

Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret

47 topic models. In Advances in Neural Information Processing Systems, pages 288–296.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical . Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734.

Barry Devereux, Colin Kelly, and Anna Korhonen. 2010. Using fMRI activation to conceptual stimuli to evaluate methods for extracting conceptual representations from corpora. In Proceedings of the NAACL HLT 2010 First Workshop on Computational Neurolinguistics, pages 70–78. Association for Computational Linguistics.

Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. 2016. Word embeddings, analogies, and : Beyond king - man + woman = queen. In Proceedings of COLING 2016, the 26th

47 International Conference on Computational Linguistics: Technical Papers, pages 3519–3530, Osaka, Japan, December 11-17. Allyson Ettinger, Naomi H. Feldman, Philip Resnik, and Colin Phillips. 2016. Modeling N400 amplitude using vector space models of word representation. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, pages 1445–1450. Allyson Ettinger and Tal Linzen. 2016. Evaluating vector space models using human semantic priming results. In Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP, pages 72–77, Berlin, Germany. Association for Computational Linguistics. Christiane Fellbaum, editor. 1998. Wordnet: An Electronic Lexical Database. Language, speech, and communication series. MIT Press Cambridge. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. In ACM Transactions on Information Systems, volume 20(1), pages 116–131. ACM.

47 Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. 2016. Analogy-based detection of morphological and semantic relations with word embeddings: What works and what doesn’t. In Proceedings of the NAACL-HLT SRW, pages 47–54, San Diego, California, June 12-17, 2016. ACL. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A Neural Network for Factoid Question Answering over Paragraphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 633–644, Doha, Qatar. Association for Computational Linguistics. Michael N. Jones, Walter Kintsch, and Douglas J.K. Mewhort. 2006. High-dimensional accounts of priming. Journal of Memory and Language, 55(4):534–552. Sigrid Klerke, Héctor Martínez Alonso, and Anders Søgaard. 2015.

47 Looking hard: Eye tracking for detecting grammaticality of automatically compressed sentences. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 97–105. Gabriella Lapesa and Stefan Evert. 2013. Evaluating neighbor rank and distance measures as predictors of semantic priming. In Proceedings of the ACL Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2013), pages 66–74. Rémi Lebret and Ronan Collobert. 2014. Word emdeddings through Hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–490, Gothenburg, Sweden, April 26-30 2014. Association for Computational Linguistics. Omer Levy and Yoav Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. CoNLL-2014, pages 171–180. Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, and Xiaoyong Du. 2017. Investigating different syntactic

47 context types and context representations for learning word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2411–2421. Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2):203–208. Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL-2013, 104. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In

47 Proceedings of International Conference on Learning Representations (ICLR).

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL, pages 746–751.

Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual . In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225. Association for Computational Linguistics.

Neha Nayak, Gabor Angeli, and Christopher D. Manning. 2016. Evaluating Word Embeddings Using a Representative Suite of Practical Tasks. In Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pages 19–23, Berlin, Germany, August 12, 2016. Association for Computational Linguistics.

Douglas L. Nelson, Cathy L. McEvoy, and Thomas A. Schreiber. 2004. The University of South Florida free association, rhyme, and word

47 fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3):402–407. Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the ACL. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 12, pages 1532–1543. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 337–346, New York, NY, USA. ACM.

47 Anna Rogers, Aleksandr Drozd, and Bofang Li. 2017. The (Too Many) Problems of Analogical Reasoning with Word Vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pages 135–148. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal, pages 298–307. Association for Computational Linguistics. Hinrich Schütze. 1993. Word Space. In Advances in Neural Information Processing Systems 5, pages 895–902. Morgan Kaufmann. Anders Søgaard. 2016. Evaluating word embeddings with fMRI and eye-tracking. In Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NL, pages 116–121, Berlin, Germany, August 12, 2016. Association for Computational Linguistics. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity

47 recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics. Yulia Tsvetkov, Manaal Faruqui, and Chris Dyer. 2016. Correlation-based Intrinsic Evaluation of Word Vector Representations. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 111–115. Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2049–2054, Lisbon, Portugal, 17-21 September 2015. Association for Computational Linguistics. Cyma Van Petten. 2014. Examining the N400 semantic context effect item-by-item: Relationship to corpus-based measures of word co-occurrence. International Journal of Psychophysiology, 94(3):407–419.

47 Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2092–2102, New Orleans, Louisiana. Association for Computational Linguistics. Michael Wilson, Georg Kiss, and Christine Armstrong. EAT : The Edinburgh associative corpus [Electronic resource].

47