Measuring the Degree of Asymmetry of Static and Contextual Word Embeddings and the Implications to Representation Learning
Total Page:16
File Type:pdf, Size:1020Kb
PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference Circles are like Ellipses, or Ellipses are like Circles? Measuring the Degree of Asymmetry of Static and Contextual Word Embeddings and the Implications to Representation Learning 1 1 2 1 Wei Zhang , Murray Campbell , Yang Yu⇤ , Sadhana Kumaravel 1 IBM Research AI; 2 Google [email protected], [email protected], [email protected], [email protected] Abstract for a word pair (a,b). In this regard, the degree of correla- tion between asymmetry obtained from humans and a word Human judgments of word similarity have been a popular embedding may indicate the quality of the embedding of en- method of evaluating the quality of word embedding. But it coding features. fails to measure the geometry properties such as asymmetry. For example, it is more natural to say “Ellipses are like Cir- Word evocation experiment devised by neurologist Sig- cles” than “Circles are like Ellipses”. Such asymmetry has mund Freud around the 1910s was to obtain such word di- been observed from a psychoanalysis test called word evo- rectional relationship, where a word called cue is shown to a cation experiment, where one word is used to recall another. participant who is asked to “evoke” another word called tar- Although useful, such experimental data have been signifi- get freely. 1 The experiment is usually conducted on many cantly understudied for measuring embedding quality. In this participants for many cue words. The data produced from paper, we use three well-known evocation datasets to gain in- the group of people exhibit a collective nature of word relat- sights into asymmetry encoding of embedding. We study both edness. The P (a b) and P (b a) can be obtained from such static embedding as well as contextual embedding, such as data to obtain an| asymmetry| ratio (Griffiths, Steyvers, and BERT. Evaluating asymmetry for BERT is generally hard due to the dynamic nature of embedding. Thus, we probe BERT’s Tenenbaum 2007) that resonates with the theory of Tversky conditional probabilities (as a language model) using a large (1977) and Resnik (1995). Large scale evocation datasets number of Wikipedia contexts to derive a theoretically justi- had been created to study the psychological aspects of lan- fiable Bayesian asymmetry score. The result shows that con- guage. We are interested in three of them; the Edinburgh As- textual embedding shows randomness than static embedding sociation Thesaurus (Kiss et al. 1973), Florida Association on similarity judgments while performing well on asymmetry Norms (Nelson, McEvoy, and Schreiber 2004) and Small judgment, which aligns with its strong performance on “ex- World of Words (De Deyne et al. 2019) Those three datasets trinsic evaluations” such as text classification. The asymme- have thousands of cue words each and all publicly available. try judgment and the Bayesian approach provides a new per- We use them to derive the human asymmetry judgments and spective to evaluate contextual embedding on intrinsic eval- see how well embedding-derived asymmetry measure aligns uation, and its comparison to similarity evaluation concludes our work with a discussion on the current state and the future with this data. of representation learning. Evocation data was rarely explored in the Computational Linguistics community, except that Griffiths, Steyvers, and Tenenbaum (2007) derived from the Florida Association 1 Introduction Norms an asymmetry ratio for a pair of words to mea- sure the directionality of word relations in topic mod- Popular static word representations such as els, and Nematzadeh, Meylan, and Griffiths (2017) used it word2vec (Mikolov et al. 2013) lie in Euclidean space for word embedding. In this paper, we conduct a larger and are evaluated against symmetric judgments. Such a scale study using three datasets, on both static embedding measure does not expose the geometry of word relations, (word2vec) (Mikolov et al. 2013), GloVe (Pennington, e.g., asymmetry. For example, “ellipses are like circles” Socher, and Manning 2014), fasttext (Mikolov et al. 2018)) is much more natural than “circles are like ellipses”. An and contextual embedding such as BERT (Devlin et al. acceptable representation may exhibit such a property. 2018). We hope the study could help us better understand Tversky (1977) proposed a similarity measure that en- the geometry of word representations and inspire us to im- codes asymmetry. It assumes each word is a feature set, prove text representation learning. and asymmetry manifests when the common features of two To obtain P (a b) for static embedding, we leverage vector words take different proportions in their respective feature space geometry with| projection and soft-max similar to (Ne- sets, i.e., a difference of the likelihoods P (a b) and P (b a) | | matzadeh, Meylan, and Griffiths 2017; Levy and Goldberg ⇤work done while with IBM Copyright c 2021, Association for the Advancement of Artificial 1A clip from A Dangerous Method describing Sigmund Freud’s Intelligence (www.aaai.org). All rights reserved. experiment https://www.youtube.com/watch?v=lblzHkoNn3Q 2014; Arora et al. 2016); For contextual embedding such as is still lacking. Given the current status of research, it is nat- BERT we can not use this method because the embedding ural to ask the following research questions: varies by context. Thus, we use a Bayesian method to esti- RQ 1. Which evocation dataset is the best candidate to mate word conditional distribution from thousands of con- • obtain asymmetry ground-truth? And RQ 1.1: Does the texts using BERT as a language model. In so doing, we can asymmetry score from data align with intuition? RQ 1.2: probe the word relatedness in the dynamic embedding space Are evocation datasets correlated for the first place? in a principled way. How do static and context embeddings compare Comparing an asymmetry measure to the popular cosine RQ 2. • on asymmetry judgment? Furthermore, how do different measure, we observe that similarity judgment fails to cor- context embeddings compare? Do larger models perform rectly measure BERT’s lexical semantic space, while asym- better? metry judgment shows an intuitive correlation with human data. In the final part of this paper, we briefly discuss the re- RQ 3. What are the main factors to estimating P (b a) • | sult and what it means to representation learning. This paper with context embedding, and what are their effects? makes the following contributions: RQ 4. Does asymmetry judgment and similarity judgment • 1. An analysis of embedding asymmetry with evocation agree on embedding quality? What does it imply? datasets and an asymmetry dataset to facilitate research; We first establish the asymmetry measurement in Section 3 and methods to estimate them from an evocation data or 2. An unbiased Bayesian estimation of word pair relatedness an embedding in Section 4 followed by empirical results to for contextual embedding, and a justifiable comparison of answer those questions in Section 5. static and contextual embeddings on lexical semantics us- ing asymmetry. 3 The Asymmetry Measure Log Asymmetry Ratio (LAR) of a word pair. To mea- 2 Related Work sure asymmetry of a word pair (a; b) in some set of word Word embedding can be evaluated by either the sym- pairs , we define two conditional likelihoods, P (b a) and S E | metric “intrinsic” evaluation such as word pair similar- P (a b), which can be obtained from a resource , either E | E ity/relatedness (Agirre et al. 2009; Hill, Reichart, and Ko- an evocation dataset or embedding. Under Tversky’s (1977) rhonen 2015) and analogy (Mikolov et al. 2013) or the “ex- assumption, if b relates to more concepts than a does, P (b a) | trinsic” one of observing the performance on tasks such as will be greater than P (a b) resulting in asymmetry. A ratio | text classification (Joulin et al. 2016), machine reading com- P (b a)/P (a b) (Griffiths, Steyvers, and Tenenbaum 2007) E | E | prehension (Rajpurkar et al. 2016) or language understand- can be used to quantify such asymmetry. We further take ing benchmarks (Wang et al. 2018, 2019). the logarithm to obtain a log asymmetry ratio (LAR) so that On utterance-level, probing contextual embeddings was the degree of asymmetry can naturally align with the sign of conducted mainly on BERT (Devlin et al. 2018), suggesting LAR (a ratio close to 0 suggests symmetry, negative/positive otherwise). Formally, the LAR of (a; b) from resource is its strength in encoding syntactic information rather than se- E mantics (Hewitt and Manning 2019; Reif et al. 2019; Tenney LAR (a; b) = log P (b a) log P (a b) (1) et al. 2019; Tenney, Das, and Pavlick 2019; Mickus et al. E E | − E | 2019), which is counter-intuitive given contextual represen- LAR is key to all the following metrics. tation’s superior performance on external evaluation. Aggregated LAR (ALAR) of a pair set. We care about On the lexical level, it is yet unknown if previous obser- the aggregated LAR on a word pair set , the expectation S vation also holds. Moreover, lexical-semantic evaluation is ALAR ( )=E(a;b) [LAR (a; b)], to quantify the over- non-trivial for contextual embedding due to its dynamic na- all asymmetryE S on 2forS . However,E it is not sensible to ture. Nevertheless, some recent approach still tries to either evaluate ALAR onS any setE of , for two reasons: 1) because extract a static embedding from BERT using PCA (Etha- LAR(a;b) = LAR(b; a), randomlyS ordered word pairs will yarajh 2019; Coenen et al. 2019), or use context embeddings produce random− ALAR values; 2) pairs of different rela- as is (Mickus et al. 2019) for lexical-semantic evaluation.