Speakers Fill Lexical Semantic Gaps with Context

Tiago PimentelD Rowan Hall MaudslayD Damian´ Blasi@,S,K Ryan CotterellD,Q DUniversity of Cambridge @Harvard University SMPI SHH KHSE University QETH Zurich¨ [email protected], [email protected], [email protected], [email protected]

Abstract af ar Lexical ambiguity is widespread in language, 400 bn allowing for the reuse of economical word en et forms and therefore making language more ef- 200 fa ficient. If ambiguous words cannot be disam- fi biguated from context, however, this gain in he efficiency might make language less clear— 0 id resulting in frequent miscommunication. For is kn a language to be clear and efficiently encoded, 200 ml we posit that the lexical ambiguity of a word Lexical Ambiguity (bits) mr type should correlate with how much informa- pt 400 tion context provides about it, on average. To tl investigate whether this is the case, we oper- tr 0 5 10 15 20 ationalise the lexical ambiguity of a word as Contextual Uncertainty (bits) tt the entropy of meanings it can take, and pro- yo vide two ways to estimate this—one which re- Figure 1: The relationship between contextual quires human annotation (using WordNet), and uncertainty—how uncertain a word is given its one which does not (using BERT), making context—and lexical ambiguity, across a diverse set of it readily applicable to a large number of lan- languages. guages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and populated parts of the phonological space (Harley the number of synonyms a word has in Word- and Bown, 1998). Net (e.g. ρ = 0.40 in English). We then In a tradition that goes back at least to Zipf, it has test our main hypothesis—that a word’s lexical been hypothesised that individuals maintain an effi- ambiguity should negatively correlate with its cient balance between over- and under-specifying contextual uncertainty—and find significant correlations on all 18 typologically diverse lan- an intended message. Such balance is mediated by guages we analyse. This suggests that, in the conflicting pressures for both clarity (the quality presence of ambiguity, speakers compensate that allows the reconstruction of the intended mes- by making contexts more informative. sage), and economy of expression (which allows for inexpensive and rapid encoding of the message 1 Introduction in a linguistic signal). Linguistic structure and meaning are often underde- A recent instantiation of this idea is that in an termined in the linguistic signal. In an extreme case efficient language, one expects economical words this can lead to ambiguity: sentences might allow (which are short or phonotactically simple) to be more than one valid syntactic structure, and pro- associated with multiple unrelated meanings, so nouns could corefer to various antecedents. Com- they can be more widely used (Piantadosi et al., plementarily, linguistic signals can also overdeter- 2012). At first blush, this may appear to sacrifice mine some aspect of the intended message—for in- clarity, increasing ambiguity and making it more stance, agreement patterns may require redundant difficult for a listener to resolve the linguistic signal. marking, and word forms might occupy sparsely The emerging picture from psycholinguistics and

4004 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4004–4015, November 16–20, 2020. c 2020 Association for Computational Linguistics pragmatics, however, is that individuals can fill in textual uncertainty about a word should negatively these ambiguous gaps, by tapping on additional correlate with its degree of lexical ambiguity. First, linguistic or extra-linguistic cues (Tanenhaus et al., we test this on the same set of six high-resource 1995; Federmeier and Kutas, 1999; Dautriche et al., languages for which we have WordNet annotation, 2018). An obvious example is given by the role of and find significant negative correlations on five contextual information in reducing the ambiguity of them. We then extend our evaluation, using associated with the meaning of a word form. For our BERT-based measure, to cover a much more instance, the contexts which surround the word representative set of 18 typologically diverse lan- ruler in the sentences ‘Alice borrowed a ruler from guages: Afrikaans, Arabic, Bengali, English, Esto- her friends at school’ and ‘Bob rose to power and nian, Finnish, Hebrew, Indonesian, Icelandic, Kan- became a ruthless ruler’ each play a crucial role in nada, Malayalam, Marathi, Persian, Portuguese, disambiguating its intended underlying meaning. Tagalog, Turkish, Tatar, and Yoruba.2 In this set, To remain robust in the presence of noise, we we find significant negative correlations for all lan- may expect the linguistic signal to be on average guages (see Figure1). somewhat overdetermined by the speaker, leading to redundancy in how words and their contexts de- 2 Ambiguity in Language 1 termine the intended meaning. By analysing this While the pervasiveness of ambiguity in language redundant information-theoretically under the as- encumbers the algorithmic processing of natural sumption that languages strike a balance between language (Church and Patil, 1982; Manning and economy of expression and clarity, we derive that Schutze¨ , 1999), people seamlessly overcome am- the ‘amount’ of lexical ambiguity in a given word biguity through both linguistic and non-linguistic type should negatively correlate with how uncertain means. World knowledge, pragmatic inferences, on average the word is given its context (see §4). and expectations about discourse coherence all As communication unfolds, the efficiency of a par- contribute to rapidly decoding the intended mes- ticular word can only be modestly modified (e.g. by sage out of potentially ambiguous signals (Wasow, choosing clipped forms when available; Mahowald 2015). While sometimes ambiguity might indeed et al., 2013). However, contexts can be enriched or result in an observed processing burden (Frazier, demoted dynamically, so as to complement a word 1985), which could lead communication astray, in- with the evidence needed for disambiguation. dividuals can in response retrace and reanalyse To investigate whether it is the case that the their inferences (as it has been famously shown contexts in which a word appears are systemati- in garden-path sentences like “The horse raced past cally adapted to enable disambiguation, we first the barn fell”; Bever, 1970). provide an operationalisation of lexical ambiguity, This outstanding capacity to navigate ambiguous grounded in information theory. We then provide linguistic signals calls for a reexamination of two methods for estimating it, one using WordNet the presence of ambiguity found in language. If (Miller, 1995), and the other using multilingual the linguistic signal was deterministically and BERT’s contextualised embeddings (Devlin et al., uniquely decodable—as, for instance, in the 2019), which allows us to explore a large set of lan- universal language proposed by Wilkins(Borges, guages. We validate our lexical ambiguity measure- 1964)—then all of the para-linguistic evidence ments by comparing one to the other in six high- would be redundant, and the code underlying the resource languages from four language families signal would be substantially more cumbersome. (Afro-Asiatic: Arabic; Austronesian: Indonesian; On the other hand, if linguistic signals present Indo-European: English, Persian and Portuguese; individuals with too many compatible inferences, Uralic: Finnish), and find significant correlations communication would break down. An extreme between the number of synsets in WordNet and our case is represented by Louis Victor Leborgne, BERT estimate (e.g. ρ = 0.40 in English), indicat- an aphasia patient described by Paul Broca ing that our annotation-free method for measuring (Mohammed et al., 2018). Louis, in spite of lexical ambiguity is useful. immaculate comprehension and mental functions, We then test our main hypothesis—that the con- was unable to utter anything else than the syllable “tan” in his attempts to communicate. 1We refer to overdetermination with relation to redundan- cies in the signal itself, rather than a precise intended meaning. 2We refer to these using ISO 639-1 codes.

4005 The most influential explanation offered for why optimising the communication channel via a natural languages are seemingly far from both principle of least effort: the listener wants to easily extremes derives from the seminal work of Zipf disambiguate, the speaker wants to choose words (1949). In that work, Zipf proposed several aspects which required little effort to utter, and to avoid of human cognition and behaviour could be derived excessively searching their lexicon. from the principle of least effort. Languages should Building on Zipf’s (1949) theories, Piantadosi aim to minimise the complexity and cost of lin- et al.(2012) posit that, when viewed information- guistic signals as much as possible, under the sole theoretically, ambiguity is in fact a requirement for constraint that the signal can be decoded efficiently. a communication system to be efficient. Focusing 2.1 Lexical Ambiguity on economy of expression, Piantadosi et al. suggest We are concerned exclusively with lexical ambi- that lexical ambiguity serves a purpose when the bank context allows for disambiguation—it allows the guity. A classic example is the English word , 4 which can refer to either an establishment where re-use of simpler word forms. They support their money is kept, or the patch of land alongside a hypothesis by demonstrating a correlation between river. A significant source of lexical ambiguity is the number of senses for a word listed in WordNet word types which exhibit multiple senses, which (Miller, 1995) and a number of measures of speaker are said to be polysemous or homonymous.3 effort—phonotactic well-formedness, word length Dautriche(2015) estimates that about 4% of word and the word’s log unigram probability (based on a forms are homophones: “such variation is the rule maximum-likelihood estimate from a large corpus). rather than the exception” (Cruse, 1986). More recently, Dautriche et al.(2018) showed Lexical ambiguity is, in general, a fuzzy con- that languages’ homophones are more likely to cept. Not only can it be unclear what it means for appear across distinct syntactic and semantic cat- two senses to be distinct, but different linguistic an- egories, and will therefore be naturally easier to notators will also have different opinions on what disambiguate. In this work, we show that speakers constitutes a word sense versus a productive use of compensate for lexical ambiguity by making con- metaphor. Often the 2nd or 3rd definitions of a word texts themselves more informative in its presence. in a dictionary blur this line (Lakoff and Johnson, We note an important detail in one of Piantadosi 1980)—in WordNet (Miller, 1995), for instance, the et al.’s experiments. In their work, they employ third sense of attack (intense adverse criticism, e.g. unigram surprisal (i.e. − log punigram(·), where “the government has come under attack”) could punigram(·) is the unigram distribution) as a proxy be viewed as a metaphorical usage of the first (a for ease of production, correlating this with poly- military offensive against an enemy, e.g. “the at- semy. They justify this approximation based on the tack began at dawn”), projected from one domain fact that more frequent words are, in general, pro- to another. Indeed, this fuzziness has led some cessed more quickly (Reder et al., 1974). However, researchers to prefer unsupervised word sense in- this measure has a confounder with our hypothesis: duction methods, as they obviate the potentially a word’s frequency correlates with its contextual problematic annotation altogether (e.g. Panchenko uncertainty. We believe our proposed measure to et al., 2017). Such unsupervised methods are not be more directly connected with lexical ambiguity. without problems, though, with one example be- 3 Ambiguity and Uncertainty ing their overreliance on topical words (Amrami and Goldberg, 2019). These difficulties motivate We formulate both lexical ambiguity and contex- us to opt for using two distinct representation of a tual uncertainty information-theoretically. Let M word’s lexical ambiguity: one hand-annotated and be a space of all lexical meaning representations, discrete, the other unsupervised and continuous. W be the space of all words and C be the space of 2.2 Accounts of Lexical Ambiguity all contexts. We denote the M-, W-, and C-valued random variables as M, W and C, respectively, When investigating the relationship between and name elements of those sets m, w and c. We ambiguity and word frequency, Zipf argued that take M to be an either discrete or continuous mean- ambiguity results as a trade-off from opposing forces between speaker and listener, together 4Recent work, though, has shed some doubt in the interpretation behind these results, showing they might arise solely due 3We make no distinction between polysemy, homonymy, to a language’s phonotactics distribution (Trott and Bergen, and other sources of lexical ambiguity a word may exhibit. 2020; Caplan et al., 2020).

4006 ing space, W to be the set of words in a language where H(M) is constant with respect to w. This (excluding the beginning-of- and end-of-sequence equation asserts something rather trivial: that lexi- symbols, BOS and EOS) and cal ambiguity is inversely correlated with how in- ∗ formative a word is about its meaning. C = {hBOS ◦ p, s ◦ EOSi | p ◦ w ◦ s ∈ W } (1) 3.2 Contextual Uncertainty where ◦ denotes string concatenation, and p and s are the prefix and suffix context strings respec- The predictability of a word in context is also natu- tively. This set contains every possible context that rally operationalised information-theoretically. We could surround a word, padded with beginning- take the contextual uncertainty, once again defined of-sequence and end-of-sequence symbols. We for an entire language, as additionally define p˜ = BOS ◦ p and ˜s = s ◦ EOS. H(W | C) = (5) X X 3.1 Lexical Ambiguity − p(w) p(c | w) log2 p(w | c) We start with a formalisation of lexical ambiguity. w∈W c∈C Specifically, we formalise the lexical ambiguity of Again, we are mostly interested in the half- an entire language as pointwise entropy, which tells us how predictable a given word is, averaged over all contexts: H(M | W ) = (2) H(W = w | C) = (6) X Z − p(w) p(m | w) log2 p(m | w) dm X − p(c | w) log2 p(w | c) w∈W c∈C Interpreting entropy as uncertainty, this definition We take this as our operationalisation of con- implies that the harder it is to predict the meaning textual uncertainty. We note that this definition is of a word from its form alone, the more lexically different to typical uses of surprisal in computa- ambiguous that word must be. tional psycholinguistics (Hale, 2001; Levy, 2008; We will generally be interested in the half- Seyfarth, 2014; Piantadosi et al., 2011; Pimentel pointwise entropy, rather than the entropy itself. et al., 2020). Most work in this vein attempts to In the case of lexical ambiguity, we consider the maintain cognitive plausibility, usually calculating following half-pointwise entropy surprisal based on only the unidirectional left piece H(M | W = w) = (3) of the context, as − log p(w | c←). Z Although surprisal is the operationalisation we − p(m | w) log2 p(m | w) dm are interested in here, we note that a word may have low surprisal if it is frequent across many contexts This half-pointwise entropy tells us how difficult and not just in a specific one under consideration. it is to predict the meaning when you know the Sticking with our notion of half-pointwiseness, we specific word without considering its context. We define contextual informativeness as will not generally have access to the true distribution p(m | w), so we will need to approximate this I(W = w; C) = (7) entropy. This is discussed in §5.1. A unique fea- H(W = w) − H(W = w | C) ture of this operationalisation of lexical ambiguity where we define a word’s pointwise entropy (also 5 is that it is language independent. However, the known as surprisal) as quality of a possible approximation will vary from language to language, depending on the models H(W = w) = − log2 p(w) (8) and the data available in that language. The mutual information between a word and its A final note is that mutual information between context was studied before by Bicknell and Levy M and W as a function of w is equivalent, up to (2011), Futrell and Levy(2017) and Futrell et al. an additive constant, to the conditional entropy (2020)—although only using the unidirectional left piece of the context. Eq. (7) again asserts some- I(M; W = w) = H(M) − H(M | W = w) (4) thing trivial: low contextual uncertainty implies in 5We acknowledge the abuse of this bigram in the NLP an informative context. This informativeness itself literature (Bender, 2009), and use it in the following specific sense: the operationalisation may be applied to any language is upper-bounded by the word’s absolute negative independent of its typological profile. log-probabiliy (i.e. the unigram surprisal).

4007 4 Hypothesis: Why Should Ambiguity 5.1 Lexical Ambiguity Correlate with Uncertainty? In this section, we provide two approximations As discussed in §1, we expect the linguistic sig- for lexical ambiguity. One assumes discrete word nal to be on average somewhat overdetermined senses and requires data annotation (WordNet), or redundant—such redundancy leads to robust- while the other considers continuous meaning ness in noisy situations, when part of the signal spaces (BERT) and allows us to extend our analy- may be lost during its implementation. A natural sis to languages with fewer of these resources. measure of robustness is the three-way mutual in- Discrete senses WordNet (Miller, 1995) is a valu- formation between the context of a word, the word able resource available in high-resource languages, itself, and meaning—I(M; C; W )—which repre- which provides a list of synsets for word types. By sents how much information about the meaning is taking these synsets to be the possible meanings of redundantly encoded in both the context and the a word, and assuming a uniform distribution over word. The half-pointwise tripartite mutual informa- them, we approximate the entropy as tion can be decomposed as

I(M;C; W = w) H(M | W = w) ≈ log2(#senses[w]) (10) = I(M; W = w) − I(M; W = w | C) = I(M; W = w) − H(W = w | C) Continuous meaning space We now describe ( (((( how to approximate ambiguity using BERT (De- + (H((W(=(w | M,C) vlin et al., 2019).6 Let w ∈ W be a word and let ≈ I(M; W = w) − H(W = w | C) (9) | {z } | {z } c = hp˜, ˜si ∈ C be a padded context. We assume (1) (2) that a word’s contextual embedding in BERT (i.e. In this equation, we assume there are no true syn- its final hidden state) is a good approximation for onyms under a specific context—i.e. given a mean- its meaning in a given sentence.7 We define the ing and a context there is no uncertainty about the hidden state of a word w in a context c as word choice: H(W = w | M,C) ≈ 0. Term 1 is the information a word shares with its meaning hhw,ci = BERT(p˜ ◦ w ◦ ˜s) (11) (which is inversely correlated with lexical ambi- and we approximate the true distribution over guity; see eq. (4)) and term 2 is the predictability words, meanings and contexts by of a word in context or the contextual uncertainty (which is itself inversely correlated with contextual p(w, m, c) ≈ δ(m | w, c) p(w, c) (12) informativeness; see eq. (7)). For a language to be efficient, it may reuse its where we define δ(m | w, c) to place probability optimal word forms (as defined by their utterance 1 on the point m = hhw,ci and 0 on every other effort), increasing lexical ambiguity (Piantadosi point. In other words, we assume the meaning is a et al., 2012) and reducing the amount of informa- deterministic function of a word–context pair, and tion a word contains about its meaning (term 1). that it is approximated by BERT’s hidden state. This reduces redundancy though, increasing the This alone is not enough to estimate eq. (3), chance of miscommunication in the presence of though, since we still do not have access to the noise. Speakers can compensate for this by making true distribution p(w, c). Furthermore, estimat- contexts more informative for these words (term 2 ing the marginal distribution p(m|w) directly is smaller). A negative correlation between contex- infeasible, given the sparsity of the meaning space. tual uncertainty and lexical ambiguity then arises Instead, we approximate an upper bound of the en- from the trade-off between clarity and economy. tropy directly—exploiting the fact that a Gaussian distribution N (µ, Σ) will have an entropy that is 5 Computation and Approximation 6We used the implementation of Multilingual BERT made Our information-theoretic operationalisation re- available by Wolf et al.(2019). quires approximation. First, we do not know the 7Since BERT returns embeddings for WordPiece units true distributions over words, their meanings and (Wu et al., 2016) rather than words, we average them per word to get embeddings at the word-level. We acknowledge that this their contexts. Second, even if we did, eq. (3) and is a na¨ıve method of compositionality; improving the method eq. (6) would likely be hard to compute. would likely strengthen our results.

4008 greater than or equal to any other distribution with task. Defining MASK as a special type in vocabu- the same finite and known (co)variance (Cover and lary V , we take a masked hidden state as Thomas, 2012, Chapter 8):8 hc = BERT(p˜ ◦ MASK ◦ ˜s) (16) H(M | W = w) (13) We then use this masked hidden state to estimate 1 ≤ H(N (µ , Σ )) = log det (2πeΣ ) the distribution w w 2 2 w (2) (1) We estimate this covariance based on a corpus qθ(w | c) = softmax(W σ(W hc)) (17) N of N word–context pairs {hw, cii}i=1, which we where W (·) are linear transformations, and bias assume to be sampled according to the true distribu- terms are omitted for brevity. We fix BERT’s pa- tion p (our corpora comes from Wikipedia dumps rameters and train this model with Adam (Kingma 9 and is described in §6). and Ba, 2015), using its default learning rate in The tightness of this upper bound on the PyTorch (Paszke et al., 2019). We use a ReLU as entropy depends on both the accuracy of the our non-linear function σ and 200 as our hidden covariance matrix estimation and the nature of size, training for only one epoch. By minimising the true distribution p(m | w). If p(m | w) is cross-entropy loss we achieve an estimate for p. concentrated in a small region of the meaning We do not use BERT directly as our model qθ space (corresponding to a word with nuanced because its multilingual version was trained on implementations of the same sense), the bound multiple languages, and, thus, was not optimised in eq. (13) could be relatively tight. In contrast, a on each individually. We found this resulted in poor word with several unrelated homophones would approximations on the lowest-resource languages. correspond to a highly structured p(m | w) (e.g. Furthermore, we note that BERT gives probability with multiple modes in far distant regions of the estimates for word pieces (as opposed to the words space) for which this normal approximation would themselves), and combining these piece-level prob- result in a very loose upper bound. abilities to word-level ones is non-trivial. Indeed, 5.2 Contextual Uncertainty doing so would require running BERT several times per word, increasing the already high compu- How uncertain the context is about a specific word tational requirements of this study. To compute the is formalised in the half-pointwise entropy pre- probability of a word composed of two word pieces, sented in eq. (6). We may get an upper bound for example, we would need to run the model with on this entropy from its cross-entropy: two masks, i.e. BERT(p˜ ◦ MASK ◦ MASK ◦ ˜s), and combine the pieces’ probabilities. To correctly H(W = w | C) ≤ Hqθ (W = w | C) (14) X estimate the probability distribution over the entire = − p(c | w) log qθ(w | c) vocabulary (i.e. qθ(w | c)), we would need to c∈C replace each position with an arbitrary number of where qθ is a cloze language model that we train to MASKs and normalise these probability values. approximate p (as we explain later in this section). 6 Data This equation, though, still requires an infinite sum over C. We avoid that by using an empirical esti- We used Wikipedia as the main data source for mate of the cross-entropy: all our experiments. Multilingual BERT10 was trained on the 104 languages with the largest Nw X Wikipedias11—of these, we subsampled a diverse Hqθ (W = w | C) ≈ − log qθ(wi | ci) (15) i=1 set of 18 for our experiments: Afrikaans, Arabic, Bengali, English, Estonian, Finnish, Hebrew, where Nw is the number of samples we have for a Indonesian, Icelandic, Kannada, Malayalam, specific word type w. Marathi, Persian, Portuguese, Tagalog, Turkish, To choose an appropriate distribution qθ(w | c), Tatar, and Yoruba. we train a model on a masked language modelling 10Information about multilingual BERT can be found in: 8We note that, unlike its discrete counterpart, differential https://github.com/google-research/bert/ entropy values can be negative. blob/master/multilingual.md 9We explain how to approximate the covariance matrix 11List of Wikipedias can be found in https://meta. Σw per word type in App.A. wikimedia.org/wiki/List_of_Wikipedias

4009 700 Language # Types Pearson Spearman ∗∗ ∗∗ 600 Arabic 836 0.25 0.30 English 6995 0.40∗∗ 0.40∗∗ 500 ∗ ∗ Language Finnish 1247 0.06 0.07 ar Indonesian 3308 0.12∗∗ 0.13∗∗ 400 en ∗∗ ∗∗ fa Persian 2648 0.14 0.13 300 fi Portuguese 3285 0.13∗∗ 0.13∗∗ id ∗∗ ∗

Lexical Ambiguity (bits) 200 pt p < 0.01 p < 0.1

100 Table 1: Correlations between a word’s lexical ambigu-

0 ity as estimated with BERT or WordNet. 0 1 2 10 10 10 # Senses in WordNet p. 51). Alternatively, the WordNet-based measure Figure 2: Correlating our BERT-based estimate of lexical ambiguity with the number of senses in WordNet of lexical ambiguity is supported by expert human annotation and extensive research on its linguistic and psycholinguistic correlates, e.g. Sigman and For each of these languages, we first downloaded Cecchi(2002) and Budanitsky and Hirst(2006). their entire Wikipedia, which we sentencized and These differences notwithstanding, we expect tokenized using language specific models in spaCy both measures to correlate to a certain degree. To (Honnibal and Montani, 2017)—our definition evaluate this, we run an experiment comparing of a word here is, thus, a token as given by the both estimates in six languages from four dif- spaCy tokenizer. We then subsampled 1 million ferent families for which WordNet is available: random sentences per language for our analysis Arabic, English, Finnish, Indonesian, Persian, and and another 100,000 random sentences to train the Portuguese. model qθ. We run multilingual BERT on the 1 Figure2 and Table1 show that indeed both million analysis sentences to acquire both hhw,ci measures are positively correlated, although the and hc (eq. (11) and eq. (16)) for each word in association may be modest in some languages. these corpora—discarding any word for which we The Pearson correlation between our estimates do not have at least 100 contexts in which the word is ρ = 0.40 for English, but only ρ = 0.06 for occurs. For the purpose of our analysis, we also Finnish—other languages lie in the range between discarded any word containing characters not in the the two.12 This correlation seems to increase with individual scripts of the analysed language. The the quality of the BERT model for the language final number of word types used in our analysis under consideration—English has the largest can be found in Tables1 and3. Wikipedia, so multilingual BERT should naturally be better modelling it, while Finnish has the 7 Discussion: WordNet vs. BERT-based smallest Wikipedia among these six languages. A approximations complementary explanation is that WordNet itself might be better for English than other languages— The novel continuous (BERT-based) approxima- while English’s WordNet contains synsets for tion of lexical ambiguity has two important virtues 147,306 words, Persian only has them for 17,560. over the alternative WordNet-based measure. On This suggests that the modest associations found the practical side, it can be readily computed for should be taken as pessimistic lower bounds. many languages. Since we are using multilingual A potential underlying problem in the above BERT for our continuous approximation, as study is that the number of senses a word has discussed in §5, this quantity is easily obtainable in WordNet might rely on word frequency (this for the 104 languages on which it was trained. beyond a true underlying relationship with it)—e.g. Second, on more theoretical grounds, the con- annotating senses for frequent words may be easier tinuous representation of the space of meanings than for infrequent ones. Furthermore, the number might better capture the gradient that goes from subtle but distinct senses of the same word to 12For all tests of significance in this paper, we apply Ben- completely unrelated homophones (Cruse, 1986, jamini and Hochberg’s correction (1995).

4010 Language # Types WordNet Frequency Language # Types Pearson Spearman Arabic 836 0.28∗∗ 0.30∗∗ Lexical ambiguity as WordNet ∗∗ ∗∗ English 6995 0.38 0.21 Arabic (ar) 836 -0.14∗∗ -0.15∗∗ Finnish 1247 0.07∗ 0.35∗∗ English (en) 6995 -0.07∗∗ -0.11∗∗ Indonesian 3308 0.09∗∗ 0.37∗∗ Finnish (fi) 1247 0.01 -0.00 Persian 2648 0.13∗∗ 0.14∗∗ Indonesian (id) 3308 -0.09∗∗ -0.14∗∗ Portuguese 3285 0.13∗∗ 0.29∗∗ Persian (fa) 2648 -0.11∗∗ -0.12∗∗ ∗∗ ∗∗ ∗∗ p < 0.01 ∗ p < 0.1 Portuguese (pt) 3285 -0.10 -0.11 Lexical ambiguity as BERT Table 2: Parameters (and their significance) of a mul- ∗∗ ∗∗ tivariate linear regression predicting our BERT-based Afrikaans (af) 4505 -0.41 -0.52 ∗∗ ∗∗ measure of ambiguity from both our WordNet estimate Arabic (ar) 10181 -0.33 -0.41 ∗∗ ∗∗ and the word’s frequency. All analysed variables were Bengali (bn) 8128 -0.43 -0.44 ∗∗ ∗∗ normalised to have zero mean and unit variance. English (en) 7097 -0.33 -0.35 Estonian (et) 4482 -0.40∗∗ -0.44∗∗ Finnish (fi) 3928 -0.38∗∗ -0.45∗∗ of samples a word has in our corpus will affect its Hebrew (he) 13819 -0.34∗∗ -0.37∗∗ sample density in the embedding space and thus its Indonesian (id) 4524 -0.45∗∗ -0.57∗∗ estimated BERT entropy. As a second evaluation, Icelandic (is) 3578 -0.44∗∗ -0.46∗∗ we therefore train a multivariate linear regressor Kannada (kn) 9695 -0.42∗∗ -0.41∗∗ predicting our BERT-based measure not only from Malayalam (ml) 6203 -0.47∗∗ -0.46∗∗ ∗∗ ∗∗ the log of the number of senses a word has in Word- Marathi (mr) 5821 -0.39 -0.40 ∗∗ ∗∗ Net, but also the word’s frequency (i.e. its number Persian (fa) 6788 -0.39 -0.49 Portuguese (pt) 5685 -0.31∗∗ -0.45∗∗ of occurrences in the corpus). This analysis is pre- ∗∗ ∗∗ sented in Table2, where we can see that both our Tagalog (tl) 3332 -0.45 -0.50 Turkish (tr) 4386 -0.40∗∗ -0.46∗∗ estimates of lexical ambiguity still correlate when Tatar (tt) 2997 -0.34∗∗ -0.39∗∗ controlling for frequency. This table also shows Yoruba (yo) 417 -0.55∗∗ -0.64∗∗ that our BERT-based estimate still correlates with ∗∗ p < 0.01 the word’s frequency when controlling for the number of senses the word has in WordNet. Future Table 3: Correlation between lexical ambiguity and work could delve further into what this correlation contextual uncertainty. implies, with the potential to improve our proposed annotation-free estimate of lexical ambiguity. our BERT-based estimator of lexical ambiguity. 8 Lexical Ambiguity Correlates With Figures1 and3 show the relationship between con- Contextual Uncertainty textual uncertainty and lexical ambiguity—in all 18 We now test whether lexical ambiguity negatively analysed languages, we find negative correlations, correlates with contextual uncertainty, the main hy- further supporting our hypothesis. These correla- pothesis of our paper. We first evaluate this on a set tions are presented in the bottom half of Table3, of six high-resource languages, using our WordNet and range from Pearson ρ = −0.31 in Portuguese estimate for the lexical ambiguity of a word. The to ρ = −0.55 in Yoruba (p < 0.01). top half of Table3 shows the results: for five of Comparing the top and bottom half of Table3, the six languages, there is a negative correlation we see that the correlations are larger when using between the number of senses of a word and con- our BERT estimate rather than the WordNet one. textual uncertainty (p < 0.01). The top half of Fig- We believe this may result from one or all of the fol- ure3 further presents these results. In these Figures lowing: (i) there is a confounding effect caused by we see that, especially for highly ambiguous words, the use of the same model (BERT) to estimate both contextual uncertainty tends to be very small. This ambiguity and surprisal, (ii) the assumption that the supports our hypothesis, but only on a restricted set senses in WordNet are uniformly distributed may of languages for which WordNet is available. be simplistic, and (iii) our BERT-based ambiguity With that in mind, we now consider a larger and estimate may capture a more subtle sense of more diverse set of 18 languages, analysed using ambiguity than WordNet, which may result in a

4011 Figure 3: Contextual uncertainty versus lexical ambiguity in a selection of languages. Each plot contains the scatter points (representing each word type), a robust linear regression and kernel density estimate regions. (From left to right; Top) WordNet: Arabic, English, Indonesian; (Bottom) BERT: Arabic, English, Malayalam, Tagalog. stronger correlation with contextual uncertainty.13 ample audience target, the text in Wikipedia may Nonetheless, even if there is a confounding effect be over descriptive. Future work should investigate in this second batch of experiments (using BERT if similar results apply to other corpora. to estimate lexical ambiguity), the first batch (with WordNet) has no such confounding factor— 9 Conclusion providing strong support for our main hypothesis. A quick visual inspection of Figure3 indicates In this paper we hypothesised that, were a language this data might be heteroscedastic—it might have economical in its expressions and clear, then the unequal variance across distinct ambiguity levels. contextual uncertainty of a word should negatively To investigate this, we run White’s (1980) test on correlate with its lexical ambiguity—suggesting the uncertainty–ambiguity pairs. This verifies the speakers compensate for lexical ambiguity by mak- intuition that this distribution is heteroscedastic for ing contexts more informative. To investigate this, both our WordNet and BERT measures (p < 0.01). we proposed an information-theoretic operationali- Future work should investigate the impact of this sation of lexical ambiguity, together with two meth- heteroscedasticity in lexical ambiguity. ods of approximating it, one using WordNet and Limitations This work focuses on proposing one using BERT. We discuss the relative advan- new information-theoretic approximations for both tages of each, and provide experiments using both. lexical ambiguity and bidirectional contextual un- With our WordNet approximation, we found sig- certainty and on positing that these two measures nificant negative correlations between lexical am- should negatively correlate. In this experiment sec- biguity and contextual uncertainty in five out of tion, we tested the hypothesis on a set of typolog- six high-resource languages analysed, supporting ically diverse languages. Nonetheless, our exper- our hypothesis in this restricted setting. With our iments are restricted to Wikipedia corpora. This BERT approximation, we then expanded our anal- data is naturally limited. For instance, while dialog ysis to a larger set of 18 typologically diverse lan- utterances may rely on extra-linguistic clues, sen- guages and found significant negative correlations tences in Wikipedia cannot. Furthermore, due to its between lexical ambiguity and contextual uncertainty in all of them, further supporting our hypoth- 13Cruse(1986, p. 51) argues there are two ways in which context affects a word’s semantics—selection between units esis that contextual uncertainty negatively corre- of distinct senses, or contextual modification of a single sense. lates with lexical ambiguity.

4012 Acknowledgments Isabelle Dautriche, Laia Fibla, Anne-Caroline Fievet, and Anne Christophe. 2018. Learning homophones Daman´ Blasi acknowledges funding from the in context: Easy cases are favored in the lexicon of framework of the HSE University Basic Research natural languages. Cognitive Psychology, 104:83 – Program and is funded by the Russian Academic 105. Excellence Project ‘5-100’. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- References standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Asaf Amrami and Yoav Goldberg. 2019. Towards bet- for Computational Linguistics: Human Language ter substitution-based word sense induction. arXiv Technologies, Volume 1 (Long and Short Papers), preprint arXiv:1905.12598. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Emily M. Bender. 2009. Linguistically na¨ıve != language independent: Why NLP needs linguistic ty- Kara D. Federmeier and Marta Kutas. 1999. A rose by pology. In Proceedings of the EACL 2009 Workshop any other name: Long-term memory structure and on the Interaction between Linguistics and Compu- sentence processing. Journal of Memory and Lan- tational Linguistics: Virtuous, Vicious or Vacuous?, guage, 41(4):469 – 495. pages 26–32, Athens, Greece. Association for Com- putational Linguistics. Lyn Frazier. 1985. Syntactic Complexity, Studies in Natural Language Processing, pages 129–189. Cam- Yoav Benjamini and Yosef Hochberg. 1995. Control- bridge University Press. ling the false discovery rate: A practical and pow- erful approach to multiple testing. Journal of the Richard Futrell, Edward Gibson, and Roger P. Levy. Royal Statistical Society. Series B (Methodological), 2020. Lossy-context surprisal: An information- 57(1):289–300. theoretic model of memory effects in sentence processing. Cognitive Science, 44(3). Thomas G. Bever. 1970. The cognitive basis for linguistic structures. In John R. Hayes, editor, Cogni- Richard Futrell and Roger Levy. 2017. Noisy-context tion and the Development of Language, pages 279– surprisal as a human sentence processing cost model. Proceedings of the 15th Conference of the Euro- 362. Wiley & Sons, Inc, New York. In pean Chapter of the Association for Computational Klinton Bicknell and Roger Levy. 2011. Why readers Linguistics: Volume 1, Long Papers, pages 688–698, regress to previous words: A statistical analysis. In Valencia, Spain. Association for Computational Lin- Proceedings of the Annual Meeting of the Cognitive guistics. Science Society, volume 33. John Hale. 2001. A probabilistic Earley parser as a psy- Jorge Luis Borges. 1964. The analytical language of cholinguistic model. In Second Meeting of the North John Wilkins. Other Inquisitions, 1937–1952:101– American Chapter of the Association for Computa- 105. tional Linguistics. Trevor A. Harley and Helen E. Bown. 1998. What Alexander Budanitsky and Graeme Hirst. 2006. Evalu- causes a tip-of-the-tongue state? Evidence for lex- ating WordNet-based measures of lexical semantic ical neighbourhood effects in speech production. relatedness. Computational Linguistics, 32(1):13– British Journal of Psychology, 89(1):151–174. 47. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Spencer Caplan, Jordan Kodner, and Charles Yang. Natural language understanding with Bloom embed- 2020. Miller’s monkey updated: Communicative dings, convolutional neural networks and incremen- efﬁciency and the statistics of words in natural lan- tal parsing. To appear. guage. Cognition, 205:104466. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Kenneth Church and Ramesh Patil. 1982. Coping with method for stochastic optimization. International syntactic ambiguity or how to put the block in the Conference for Learning Representations. box on the table. Computational Linguistics, 8(3- 4):139–149. George Lakoff and Mark Johnson. 1980. Metaphors We Live By. University of Chicago Press, Chicago. Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons. Roger Levy. 2008. Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177. David A. Cruse. 1986. Lexical Semantics. Cambridge University Press. Kyle Mahowald, Evelina Fedorenko, Steven T. Pianta- dosi, and Edward Gibson. 2013. Info/information Isabelle Dautriche. 2015. Weaving an ambiguous lexi- theory: Speakers choose shorter words in predictive con. Ph.D. thesis, Sorbonne Paris Cite.´ contexts. Cognition, 126(2):313–318.

4013 Christopher D. Manning and Hinrich Schutze.¨ 1999. in spoken language comprehension. Science, Foundations of Statistical Natural Language Pro- 268(5217):1632–1634. cessing. MIT Press, Cambridge, MA, USA. Sean Trott and Benjamin Bergen. 2020. Why do George A. Miller. 1995. WordNet: A lexical human languages have homophones? Cognition, database for english. Communications of the ACM, 205:104449. 38(11):39–41. Thomas Wasow. 2015. Ambiguity avoidance is over- Nasser Mohammed, Vinayak Narayan, Devi Prasad Pa- rated. In Ambiguity: Language and Communication, tra, and Anil Nanda. 2018. Louis Victor Leborgne pages 29–48. De Gruyter. (“tan”). World Neurosurgery, 114:121–125. Halbert White. 1980. A heteroskedasticity-consistent Alexander Panchenko, Eugen Ruppert, Stefano Faralli, covariance matrix estimator and a direct test for Simone Paolo Ponzetto, and Chris Biemann. 2017. heteroskedasticity. Econometrica: Journal of the Unsupervised does not mean uninterpretable: The Econometric Society, pages 817–838. case for word sense induction and disambiguation. In Proceedings of the 15th Conference of the Euro- John Wilkins. 1668. An Essay Towards a Real Charac- pean Chapter of the Association for Computational ter, and a Philosophical Language. The Royal Soci- Linguistics: Volume 1, Long Papers, pages 86–98, ety, London. Valencia, Spain. Association for Computational Lin- guistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Adam Paszke, Sam Gross, Francisco Massa, Adam ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- Lerer, James Bradbury, Gregory Chanan, Trevor icz, and Jamie Brew. 2019. HuggingFace’s Trans- Killeen, Zeming Lin, Natalia Gimelshein, Luca formers: State-of-the-art Natural Language Process- Antiga, Alban Desmaison, Andreas Kopf, Edward ing. arXiv preprint arXiv:1910.03771. Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Junjie Bai, and Soumith Chintala. 2019. PyTorch: Le, Mohammad Norouzi, Wolfgang Macherey, An imperative style, high-performance deep learn- Maxim Krikun, Yuan Cao, Qin Gao, Klaus ing library. In Advances in Neural Information Pro- Macherey, Jeff Klingner, Apurva Shah, Melvin John- cessing Systems, pages 8024–8035. Curran Asso- son, Xiaobing Liu, ukasz Kaiser, Stephan Gouws, ciates, Inc. Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Steven T. Piantadosi, Harry Tily, and Edward Gibson. Cliff Young, Jason Smith, Jason Riesa, Alex Rud- 2011. Word lengths are optimized for efficient com- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, munication. Proceedings of the National Academy and Jeffrey Dean. 2016. Google’s neural ma- of Sciences, 108(9):3526–3529. chine translation system: Bridging the gap between human and machine translation. arXiv preprint Steven T. Piantadosi, Harry Tily, and Edward Gibson. arXiv:1609.08144. 2012. The communicative function of ambiguity in language. Cognition, 122(3):280–291. George K. Zipf. 1949. Human Behavior and the Prin- ciple of Least Effort. Addison-Wesley Press. Tiago Pimentel, Brian Roark, and Ryan Cotterell. 2020. Phonotactic complexity and its trade-offs. Transac- tions of the Association for Computational Linguis- tics, 8:1–18. Lynne M. Reder, John R. Anderson, and Robert A. Bjork. 1974. A semantic interpretation of encoding specificity. Journal of Experimental Psychology, 102:648–656. Scott Seyfarth. 2014. Word informativity influences acoustic duration: Effects of contextual predictability on lexical representation. Cognition, 133(1):140– 155. Mariano Sigman and Guillermo A. Cecchi. 2002. Global organization of the WordNet lexicon. Pro- ceedings of the National Academy of Sciences, 99(3):1742–1747. Michael K. Tanenhaus, Michael J. Spivey-Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. 1995. Integration of visual and linguistic information

4014 Appendices A Gaussian Approximation for a Words’ Meanings

N Given our samples {hw, cii}i=1 of word–context pairs (assumed to be drawn from the true distribution p), we get the subset of Nw instances of word type w. We then use an unbiased estimator of the covariance matrix:

Σw ≈ (18) N 1 Xw > h − µ˜ h − µ˜ N − 1 hw,cii w hw,cii w w i=1 where the sample mean is deﬁned as ISO Code Language N 1 Xw µ˜ ≈ h (19) af Afrikaans w N hw,cii ar Arabic w i=1 bn Bengali We note that these approximations become exact en English as Nw → ∞ due to the law of large numbers. et Estonian Since hhw,ci (i.e. BERT’s hidden state) is a 768 ﬁ Finnish dimensional vector, we might not have enough he Hebrew samples to fully estimate Σw. So we actually ap- id Indonesian proximate this entropy by using only its variance is Icelandic diag(Σw). This is still an upper bound on the true kn Kannada entropy ml Malayalam mr Marathi H(N (µw, Σw)) ≤ H(N (µw, diag(Σw))) (20) fa Persian pt Portuguese The right side of this equation is, then, used as our tl Tagalog actual lexical ambiguity estimate. tr Turkish B ISO 639-1 Codes tt Tatar yo Yoruba In this Section, we present the set of ISO 639-1 language codes we use throughout this paper—in Table 4: ISO Codes and their languages Table4.

4015