<<

DISTRIBUTED , JUDGMENT, AND DECISION MAKING 1

Knowledge, , and everyday judgment:

An introduction to the distributed semantics approach

Russell Richie* & Sudeep Bhatia

University of Pennsylvania, USA

September 23, 2019

Author Note

* Correspondence regarding this article should be addressed to Russell Richie, 3720 Walnut St.,

Philadelphia, PA 19104, [email protected] DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 2

Introduction

Every day, we make thousands of judgments and decisions (Gilovich et al., 2002; Oppenheimer

& Kelso, 2015; Weber & Johnson, 2009). We may judge how tasty and nutritious a fruit is, which may influence whether we buy or eat it. We might consider two items, say ‘nurse’ and ‘journalist’, and judge how similar they are, potentially generalizing from one to the other on the basis of their similarity – or we may consider ‘brilliant’ and ‘smart’, and decide whether one is sufficiently synonymous with the other to replace it in a particular sentence. We may ponder questions like “Will there be an earthquake in California?” or “Is Donald Trump honest” or “Should I spend money on a vacation or save up for a new television?”, consider various semantic aspects of these question, and gradually decide our answers to them. We may perceive that the relationship between Tony and Maria in West Side Story is similar to that between Romeo and Juliet, and generalize what we know of the latter relationship to the former. To complicate it all, different people from different cultures and/or times may make very different responses to all these judgment problems.

In each of these examples, the judgments and decisions must be made on the basis of what the decision maker knows about the vast number of things in the world. For example, the taste judgment about a fruit likely depends on, e.g., what the decision maker thinks the sugar content of the fruit is.

The judgment of the similarity of ‘nurse’ and ‘journalist’ depends on the extent to which what is known of ‘nurse’ is identical to what is known of ‘journalist’. Thus, for psychologists to understand how people make the above kinds of judgments and decisions, we must understand both:

(a) What (different) people know about potential judgment targets, and

(b) How that knowledge is used to make a judgment (again, in potentially different ways for

different individuals)

In some studies of human judgment and decision making, and other areas of high-level cognition, artificial stimuli may be constructed such that experimenters can assume (to a first DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 3 approximation) what participants know about the judgment stimuli. But of course, theories of judgment and decision making ought to apply not just in the lab and to artificial stimuli, but in the natural world, to the decision problems people face every day in the normal course of their lives. Thus psychologists need to know what people know and believe about natural entities in the real world. Indeed, it is only by modeling what people know and believe about entities in the world can psychological theories of judgment be applied to areas of practical relevance, such as health policy, consumer behavior, or political psychology.

Techniques for uncovering what people know about such naturally occurring judgment targets has a long history, constituting a large part of the field of psychometrics. For example, participants may provide data on the similarity between items (words, images, etc.), usually through direct ratings of similarity on Likert scales. A matrix of similarities between all pairs of items can then be submitted to techniques like multidimensional scaling or additive clustering (Shepard, 1974; Shepard & Arabie,

1979), which obtain for each item a low dimensional spatial or featural representation, respectively

(representations which may have intuitive interpretations, e.g., scaling of emotion words may have a dimension corresponding to valence (Russell, 1980), or additive clustering of numbers may uncover features for evenness, primeness, etc.; Navarro & Lee, 2003). Or, participants may be simply asked to list the features they think are important for each item, producing so-called ‘feature norms’ (McRae,

Cree, Seidenberg, McNorgan, 2005; Devereux, Tyler, Geertzen, & Randall, 2014; Buchanan, Valentine,

& Maxwell, 2019). For example, responding to ‘cat’, a participant may say ‘furry’, ‘chases mice’, and

‘aloof’. Unfortunately, despite the long success of these and related classical techniques for obtaining knowledge representations, they face a crucial shortcoming: they simply cannot scale to the number – actually infinite – of natural entities in the real world that are potential judgment targets. And even for the relative handful of items for which representations could be obtained through classical techniques, DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 4 the representations are typically relatively impoverished, possessing only a few dimensions or features

(whereas, as we will see, newer techniques can uncover much richer representations).

Fortunately, the last few years have seen the development of techniques that can deliver cheap, rich, and accurate representations for millions, and sometimes even infinite, entities that may be involved in everyday judgment. These techniques, and their underlying theoretical assumptions, are referred to by the term distributed semantics (DS), as they propose that semantic representations are reflected in, and can be recovered from, the statistical distribution of words in language. While these ideas have a long theoretical and applied history in psychology (Harris, 1954; Firth, 1957; Landauer and Dumais, 1997), three key advances have increased interest in distributed semantics: availability of large scale natural language corpora, increased computational power, and new for efficiently deriving distributed semantic representations. These advances now enable especially rich and comprehensive distributed semantic representations – commonly known as word vectors or word embeddings – for millions of words and common phrases (without the need for any explicit participant ratings data), and even, very recently, for longer, novel phrases, sentences, and documents (Turney &

Pantel, 2010; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Devlin, Chang, Lee, & Toutanova,

2018). Distributed semantic representations thus provide a solution to (a) above, quantifying what people know about potential judgment targets.

Additionally, as most distributed semantic methods yield (high-dimensional) spatial or featural representations for judgment targets, these methods can be used in much the same way as the outputs of standard psychometric methods. Thus existing psychological solutions to (b) above (how people use knowledge to make a decision – see Oppenheimer & Kelso, 2015) can be combined with representations obtained through distributed semantics, opening up new avenues for studying naturalistic human judgment, and making it feasible to build computational models that represent knowledge, make evaluations and attributions, and give responses, in a human-like manner. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 5

This chapter will provide a summary of distributed semantic models – primarily models of word representations, but briefly also novel phrase and sentence representations – and their applications to human judgment and psychological science, and is organized as follows. First, we describe distributed semantic models, at a high level of detail (again, focusing on DS models of words and frozen phrases).

Other recent reviews provide a more comprehensive overview of the statistical and computational underpinnings of DS models, the differences between different algorithms for building DS models, the technical steps necessary to apply these algorithms on natural language data, and applications of DS models to natural language processing and tasks (e.g. Lenci. 2018; Turney &

Pantel, 2010). Second, we review applications of DS to judgment and decision making, proceeding from simpler to more complex applications. In each case, we will make explicit (b), the precise way in which DS representations are used, mathematically. A key point of this section and the chapter more broadly is that DS representations are themselves not a model of psychological process, a mistake that has, on occasion, invited certain criticisms against DS representations (and spatial models more generally). Finally, we suggest future work with distributed semantics. Here we will discuss (among other things) very recent models in natural language processing that have transformed that field, models like the Universal Sentence Encoder (Cer et al., 2018), BERT (Devlin et al., 2018), and ELMO

(Peters et al., 2018), which can deliver representations for phrases and sentences. These models may likewise transform the psychology of judgment and decision making.

Distributed Semantics

The main idea behind distributed semantics is that patterns of co-occurrence among words reveal word meaning. We illustrate with a toy ‘co-occurrence matrix’ (Table 1). Consider the words

‘spinach’, ‘banana’, ‘cake’, ‘apple’, and ‘computer’, and imagine that they occur in a large corpus

(collection of documents), and each co-occurs with the words ‘vitamin’, ‘sugar’, ‘fat’, ‘bake’, and

‘information’ with some particular frequency. We refer to the first set of words as target words and the DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 6 second set of words as context words1. Each cell of Table 1 indicates how many times a target word occurred near (within a few words, or in the same document) a context word. For example, ‘spinach’ and ‘vitamin’ co-occur near each other 5 times, while ‘spinach’ and ‘sugar’ never co-occur. Examining these patterns of co-occurrence between target words and context words reveals that ‘banana’ and

‘apple’ are rather similar in meaning (right now, this similarity is only impressionistic, but in the following sections we will discuss how to quantify it), but both are also somewhat similar to ‘spinach’ and ‘cake’, in different ways (fruits being high in vitamins like ‘spinach’, but also having some sugar content like ‘cake’). Further, the four food-related words have a completely different profile of co- occurrence than does ‘computer’, reflecting the fact that ‘computer’ has little to do, semantically, with food.

Although raw co-occurrence matrices, like Table 1, can provide useful semantic representations, even better representations can be obtained from such matrices reduced in dimensionality via procedures like singular value decomposition. Dimensionality reduction, usually to between 50 and 300 dimensions, is performed both to make the resulting representations more compact (for practical concerns with computer memory and processing capacity), but also because thousands or millions of column dimensions (as there would be if every word type in a corpus had its own column) are simply not necessary to represent word meanings accurately for many applications. In some cases, the dimension reduction actually improves the quality of the vector representations, as it allows for the representations to encode higher-order statistical relationships between words. For example, ‘cake’ and

‘cookie’ may never directly co-occur, but dimensionality reduction on raw co-occurrence matrices would reveal a semantic relationship between these words as long as they co-occur systematically with a common set of words (such as ‘sugar’ and ‘bake’). A downside of this, unfortunately, is that the

1 Note that in many practical applications, context in such matrices consists not of individual words but rather the set of words that make up the sentence or document in which the target word occurs. We are simplifying here for expository convenience. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 7 resulting dimensions are no longer so easily interpreted in terms of particular co-occurrence frequencies. In any case, the vectors of (dimension-reduced) co-occurrence information derived from some corpus constitute a distributed semantic representation of a word, or the knowledge a person has about a word, and may use when making judgments involving the word.2

vitamin sugar fat bake information ... spinach 5 0 0 1 0 ... banana 3 2 0 2 0 ... cake 0 4 3 4 0 ... apple 3 1 0 3 0 ... computer 0 0 0 0 5 ... Table 1. A toy example of co-occurrence frequencies between target words (rows) and context words (columns). The patterns of co-occurrence with the context words reveal that ‘banana’ and ‘apple’ are highly similar in meaning, but both are also similar to ‘spinach’ and ‘cake’. The four food-related words have a completely different profile of co-occurrence with ‘computer’, reflecting the fact that ‘computer’ has little to do, semantically, with food. The additional column with ellipses reflects the fact that typical raw co-occurrence matrices will have thousands (or more) contexts and hence columns.

These basic ideas underlie the many different distributed models dating back decades (at least to

LSA and Landauer and Dumais, 1997). The details of models, however, can differ somewhat (see e.g.

Bullinaria & Levy, 2007 for an early discussion). First, all DS implementations can vary in size of context window, i.e. a threshold level of distance in text that determines whether or not two words are seen as co-occurring with each other. Different methods also often implement different text preprocessing choices (e.g., whether to lemmatize words, converting, say, ‘cats’ to just ‘cat’), number of dimensions to retain in the ultimate word representation, and other parameters. Beyond this, DS

2 A related class of models that learn lexical representations from text are known as topic models, the most prominent technique being latent Dirichlet allocation (Griffiths et al., 2007; see Blei, 2012 for review). Under this class of models, documents are distributions over topics, which are themselves distributions over words. We will not cover such models here, as they are no longer applied as much as other spatial DS models, likely owing to their difficulty with scaling to the massive corpora on which other co-occurrence models are trained. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 8 models generally fall into two classes. First are count models, in which a co-occurrence matrix like the one above is built and factorized explicitly (e.g., LSA, Landauer & Dumais, 1997; BEAGLE, Jones &

Mewhort, 2007 ; GLOVE, Pennington, Socher, & Manning, 2014). Within this class, the cells of the co-occurrence matrix are often re-weighed, via term frequency-inverse document frequency, positive pointwise mutual information, or log-entropy (Mandera et al 2017), before undergoing matrix factorization/dimensionality reduction.

The other class is known as ‘predict’ models (, Mikolov et al., 2013; fastText,

Bojanowski, Grave, Joulin, & Mikolov, 2017). Word2vec is the most well-known of these models, and actually comprises two models, CBOW and skip-gram. In both models, a multilayer neural network slides over contiguous windows of text, and either attempts to predict the word in the center of the window from the words in the periphery of the window (CBOW), or predict the peripheral words from the word in the center (skip-gram). Via error-driven learning to predict (peripheral or central) words in this way, the weights of the network gradually encode information about the semantic relationships between words. In the case of skip-gram, the input layer has a single node for each target word, and the trained weights from a target word’s node to the hidden layer of the network correspond to the rows of the co-occurrence matrix above. In fact, it can be shown that the skip-gram is equivalent to factorizing a word-context matrix whose cells are the pointwise mutual information between target and context words, shifted by a constant (Levy & Goldberg, 2014). Even in the case of CBOW, the result of the learning process is a set of neural-network weights assigned to each word, which, as with count methods, represent the word as a vector in a high-dimensional space. While this demonstrates a degree of equivalence between count and predict models, it is arguably the predict models in particular that have increased interest in text-based DS models in the last few years. There may be many reasons for this, but chief among them are likely that predict models scale to large corpora more efficiently than previous count-based models (Mikolov et al., 2013), predict models potentially reflect more DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 9 psychologically realistic learning mechanisms3 (see, e.g., Mandera, Keuleers, & Brysbaert, 2017 for a comparison between word2vec and Rescorla-Wagner associative learning, and Jones, Willits, &

Dennis, 2015 for discussion of how text-based distributed models relate to models of semantic memory), and predict models can be trained incrementally, whereas a count model must process a single corpus at once and then cannot incorporate information from additional documents in a straightforward way. Understanding the differences (or lack thereof) between DS models, and when one particular implementation and combination of parameters leads to more effective representations than others, is an ongoing effort in natural language processing and cognitive science (e.g., Lenci,

2018; Mandera et al., 2017). However, when discussing applications to DS models to judgment and decision making, we will not – with a few exceptions – concern ourselves terribly with the particular variant of text-based distributed semantic representations, either the algorithm or the corpus, that different researchers have used. These issues are clearly important, but for the sake of brevity, we will mostly abstract away from them.

Before moving to our review of DS applications to judgment and decision making, we briefly mention some available software packages and pre-trained distributed semantic representations that make applications of DS models to judgment and decision making research (and cognitive science more generally) accessible to psychologists. For a quick, hands-on experience with distributed semantic representations, see online tools like lsa.colorado.edu for exploration of latent semantic analysis, or http://bionlp-www.utu.fi/wv_demo/ for exploration of word2vec. For using DS models in a programmatic pipeline, we recommend two Python packages. The Python package

‘gensim’ (Řehůřek & Sojka, 2010) allows for training, or using pre-trained, vectors for word2vec, fastText, doc2vec, LSA, latent Dirichlet allocation, and more, while the newer package ‘magnitude’

3 However, the use of vectors from predict models to model judgment and decision making does not require predict models to be psychologically plausible models of learning. In fact, the typical size of corpora predict models are trained on (billions of word tokens) is much larger than most adults have been exposed to. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 10

(Patel, Sands, Callison-Burch, & Apidianaki, 2018), also for Python, provides an even faster, lightweight tool for using pre-trained representations derived via GloVe, word2vec, and fastText (and some techniques for getting representations beyond words, which we will briefly discuss towards the end of the chapter). See Peirera, Gershman, Ritter, and Botvinick (2016) for a comparison of various off-the-shelf distributed semantic representations for modeling psychological data.

Applications

Dot-product and cosine similarity

The simplest and perhaps most common way DS models are used is in psychological applications is computing dot product or cosine similarity between two vectors (Lenci, 2018), and using this similarity as a predictor of some measure of behavior in a task. Dot product between two vectors u and v, denoted by u · v, is defined as the sum of the elementwise products of two vectors: Σi ui*vi. See Table 2 for all pairwise dot products between the target words of Table 1.

spinach banana cake apple computer spinach 26 17 4 18 0 banana 17 17 16 17 0 cake 4 16 41 16 0 apple 18 17 16 19 0 computer 0 0 0 0 25 Table 2. Pairwise dot products between all target word vectors of Table 1.

For example, the dot product between the vectors for ‘banana’ and ‘apple’ in Table 1 is 3*3 +

2*1 + 0*0 + 2*3 + 0*0 = 17, whereas the dot product between ‘banana’ and cake is 3*0 + 2*4 + 0*0 +

2*3 + 0*0 = 14. Thus, the difference in these two dot products roughly corresponds to the greater intuitive similarity between ‘banana’ and ‘apple’, compared to ‘banana’ and ‘cake’. However, the dot DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 11 product between ‘spinach’ and ‘banana’ is 17, which is identical to the dot product between ‘banana’ and ‘apple’, likely contra intuitions about the relative similarity of ‘spinach’ and ‘banana’ compared to

‘apple’ and ‘banana’. This is simply because the vector for ‘spinach’ is longer than the vectors for

‘apple’ and ‘banana’ – longer not in its number of entries, but in terms of the Euclidean length of its vector, defined by:

In particular, the Euclidean length (alternately, the L2-norm) of ‘spinach’ is 5.10, whereas the lengths of ‘banana’ and ‘apple’ are 4.12 and 4.36, respectively. One reason for such length differences between target words (although not exactly the culprit here), is that often times the raw frequency of target words will vary, and, all else equal, more frequent words will have longer vectors, which will drive up the dot product with all other vectors. For these reasons, in general, dot product can be a misleading measure of similarity between two vectors of co-occurrence frequencies. However, if we normalize dot product by the product of each vector’s length, (u · v) / (||u||*||v||), we obtain more sensible similarity relations, as displayed in Table 3.

spinach banana cake apple computer spinach 1 0.81 0.12 0.81 0 banana 0.81 1 0.61 0.95 0 cake 0.12 0.61 1 0.57 0 apple 0.81 0.95 0.57 1 0 computer 0 0 0 0 1 Table 3. Pairwise cosine similarities between all target word vectors of table 1. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 12

This ‘length-normalized dot product’ is known as the cosine similarity, which, under a geometric interpretation, measures the cosine of the angle between two vectors. Cosine similarity ranges from 1, indicating perfect similarity, to 0, when two vectors are orthogonal, to -1, when they point in opposite directions (which can happen if vectors have negative values). As Table 3 illustrates, the similarity relations between our target words are now much more intuitive, ranked as follows, from most similar to least similar:

1. ‘banana’ vs ‘apple’: .95

2. ‘spinach’ vs ‘banana’, and ‘spinach’ vs ‘apple’: .81

3. ‘cake’ vs ‘banana’: .61

4. ‘cake’ vs ‘apple’: .57

5. ‘spinach’ vs ‘cake’: .12

6. ‘computer’ and all other food words: 0

Because it appropriately controls for differences in target word frequency and/or vector lengths, cosine similarity has been much more popular than simple dot product in applications of text-based distributed semantic representation4. Cosine similarity between text-based DS representations predicts

(1) correct choice of a synonym to a probe word in a multiple choice setting (Landauer & Dumais,

1997), (2) Likert scale judgments of a pair of word’s semantic relatedness and similarity (Hill, Reichart,

4 Although dot product and cosine similarity are the most common ways of measuring similarity between text-based distributed semantic representations, other measures are feasible. For example, Euclidean distance can be used to measure dissimilarity, and has been used in classic spatial models of cognition, e.g., Nosofsky’s Generalized Context Model of categorization. However, it is believed that Euclidean distance is inappropriate in high-dimensional spaces (which text-based DS models usually are), owing to the ‘curse of dimensionality’, where every item can be very far from every other item (Aggarwal, Hinneburg, & Keim, 2001). However, Nosofsky, Sanders, and McDaniel (2018a) and Nosofsky, Snaders, Meagher, & Douglas (2018b) successfully model similarity and categorization, respectively, with Euclidean distance in an 8-dimensional representational space derived via multidimensional scaling, and Peterson, Abbot, & Griffiths (2018) successfully modeled visual category learning based on k-means clustering, which uses Euclidean distance, of very high dimensional image vectors derived from deep neural networks. These three works suggest that Euclidean distance might be tested with text-based DS representations of at least moderate dimensionality. See Nosofsky et al. (2017a) for more discussion about the ‘curse of dimensionality’ with respect to high-dimensional spatial models of concepts. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 13

& Korhonen, 2015) , and (3) strength of semantic priming in, for example, lexical decision tasks5

(Jones, Kintsch, & Mewhort, 2006; Mandera et al., 2017). Many associative behaviors are also well- explained with cosine between word vectors. Association-based judgments in probability judgment, event forecasting, and factual judgment are well predicted with cosine similarities (Bhatia, 2017b;

Bhatia & Walasek, 2019), and interestingly, associations derived from DS models predict both when participants are likely to give correct judgments and when they are likely to make mistakes. Strength of social biases as measured in the Implicit Associations Test, for a range of social variables, including age, gender, and race, can be predicted with simple measures based on cosine similarity between vectors (Bhatia, 2017a; Caliskan, Bryson, & Narayanan, 2017)6. Simple cosine similarity also has some success in predicting behavior in free word association tasks (e.g., Pereira et al., 2016)– but see next section for challenges with this approach. Continuous ratings of semantic properties like the size of different animals, or the danger of different sports, can also be predicted by computing the dot product between a vector for a judgment target (say, ‘tiger’), and a vector representing the judgment dimension

(approximately, the vector for ‘large’ minus the vector for ‘small’; Grand et al., 2018).

Of course, as briefly alluded in the introduction, people in different places, times, and cultures may make different judgments in the same contexts, possibly because they have different (associative) knowledge of the judgment targets. We can model these knowledge differences and consequent judgments by training different DS models on corpora that are representative of different populations.

For example, to model liberals’ knowledge and associations, we might train a model on the New York

Times and MSNBC, and to model conservatives’, we might train on The New York Post and Fox

5 In the lexical decision task, participants are presented with a prime word and then a target word in rapid succession, and must indicate as quickly and accurately as possible whether the target is a real word. Participants are faster to respond to the target when the prime is semantically related, as measured by cosine similarity between word vectors. 6 An active area of research involves removing such undesirable social biases from distributed semantic models, since many of these same models are used in NLP and AI applications (Bolukbasi, Chang, Zou, Saligrama, & Kalai., 2016). To the extent that people do have such biases, the presence of such biases in text-based distributed representations actually makes these models more psychologically realistic. In other words, the DS models that psychologist use to describe human behavior perhaps ought not to be debiased. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 14

News. Several recent studies have taken this approach to study differences in social, political, and moral associations pertaining to media bias and social media structure (Bhatia, Goodwin, & Walasek,

2018; Holtzman, Schott, Jones, Balota, & Yarkoni, 2011; Hopkins, 2018). Similarly, it is possible to train a DS model on different time slices of a corpus that extends over long periods of time, and examine dot-product or cosine-driven associations of words within a time period, to study changing gender, class, and ethnic associations over time (Garg,Schiebinger, Jurafsky, & Zou, 2018; Hamilton,

Leskovec, & Jurafsky, 2016).

Process models based on distributed semantic representations

“A cosine is not what people do in a task”

-Jones, Gruenenfelder, & Recchia (2018, pg. 2)

Despite the important successes of using cosine similarity (or just dot product) between vectors to predict human behavior in judgment and decision making settings and related psychological applications, as Jones et al. (2018) put it, “a cosine is not what people do in a task” (pg.2). That is, cosine similarity – by itself – is not a model of the process by which people use distributed representations to perform judgment, decision making, and other behaviors. To the extent that people often simply correlate cosine similarity with some behavioral measure (e.g., Griffiths et al., 2007; for discussion, see Jones, Willits, & Dennis, 2015 and Jones, Hills, & Todd, 2016), certain criticisms have been leveled against DS models that are really criticisms about cosine similarity, dot product, or related similarity measures, and how these are applied to DS representations. For example, cosine similarity and dot product are symmetric measures. Yet, there is evidence of asymmetries in some of the experimental tasks discussed above. In free association tasks, for example, the probability of generating baby as a response to stork is much greater than the reverse (Griffiths et al., 2007; Jones et al., 2018).

Similar results appear in classic studies on similarity: Tversky (1977) found that North Korea was judged to be more similar to China than was China to North Korea (but see ManyLabs 2 for a failure to DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 15 replicate these effects with a high-powered sample, Klein et al., 2018). Second, cosine similarity and dot product weight all dimensions equally. This may be problematic for, e.g., judgments of similarity, where some dimensions may matter more in some contexts or when judging the similarity of items in one domain vs. another (Medin & Schaffer, 1978; Medin & Smith, 1981; Nosofsky, 1984, 1986). A third, closely related point is that computing cosine similarity between a judgment target and an experimenter-selected word (or set of words) for a judgment dimension (e.g., tastiness), as Grand et al.,

(2018) do, may actually underestimate the utility of the information in DS representations for modeling semantic judgments. Instead of stipulating the vector to represent a judgment dimension, and then projecting a target item vector on that dimension, we may obtain a more accurate model of judgment by using human ratings on a judgment dimension to supervise learning of the vector representing the judgment dimension onto which a judgment target is projected. With these issues in , we review in this section some applications of DS models that build DS representations into more accurate mathematical, statistical, or computational models of judgment and decision making. We note that this issue with cosine and other symmetric, unweighted distance measures harkens back to old debates about spatial models of concepts (Tversky, 1977; Tversky & Gati, 1982). In response to these concerns, proponents of spatial models (Krumhansl, 1978; Holman, 1979; Nosofsky, 1991) proposed particular processes operating on spatial representations, to handle, for example, asymmetries in similarity judgments (and other seeming violations of ‘metric axioms’, like the triangle inequality).

In recent work that, if not explicitly inspired by the above classic work, is certainly its spiritual successor, Jones et al. (2018) demonstrated how text-based DS models could model asymmetries in free association data. Rather than simply using cosine as a predictor of the probability with which one word cued another, they proposed combining cosine similarity with a parameter-free version of the

Luce Choice rule (Luce, 1959), such that: DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 16

where τ is a minimum similarity threshold parameter, such that the denominator only considers other words sufficiently similar to the cue. Intuitively, this means that the probability of responding to a cue with a particular target word takes into account not just the associative strength between the cue and the target, but also the associative strength of competitor targets. Because the neighborhood of the target word will often not be the same as the cue word, asymmetries in associative strength are expected (as are violations of the triangle inequality), and in fact, this approach can predict asymmetries in free association norms about as accurately as competitor models like latent Dirichlet allocation (Griffiths et al., 2007). A similar application of the Luce choice rule to text-based DS representations also seems to well describe semantic-clustering behavior in verbal fluency tasks, where, for example, participants must list as many examples of animals as they can in one or two minutes (Hills, Jones, & Todd, 2012;

Johns et al., 2018). The success of this approach suggests that new applications of DS models could look to older work on spatial models of representation and perhaps simple, classic process models like the Luce Choice rule to accommodate asymmetries and other putative challenges to text-based DS representations and spatial models more generally.

Another group of recent applications of DS representations to judgment that go beyond correlating cosine similarity with behavioral measures involves training supervised models to predict a judgment from either a word vector itself, or from a vector of features derived from a pair of word vectors. For example, if we treat Table 1 (or rather a dimension-reduced version of it) as a design matrix X, and collected continuous taste judgments for the target words in Table 1 into a vector y, we could fit a supervised model f(X) = y. This is the exact approach many researchers DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 17 have taken to model judgments from taste and nutrition of foods to masculinity and femininity of personality traits (Richie, Zou, & Bhatia, 2019), concreteness, valence, arousal and dominance of a broad selection of words (Hollis, Westbury, & Lefsrud, 20167), risk perceptions of potential risk sources (Bhatia, 2019), maternal mortality rate predictions of countries and calorie predictions of foods

(Zou & Bhatia, 2019), judgments of primitive semantic features for objects (Utsumi, 2020; Li &

Summers-stay, 2019), and judgments of desirability for foods and movies (Bhatia & Stewart, 2019). In many cases, purely linear models make remarkably accurate predictions, often explaining over half the variance in judgments (Hollis et al., 2016; Richie et al., 2019; Bhatia, 2019; Zou & Bhatia, 2019;

Bhatia & Stewart, 2019; Utsumi, 2020), and often beat out nonlinear models like k-nearest neighbors and support vector machines (Richie et al., 2019; Bhatia, 2019), suggesting that rather simple, purely linear processes operating on distributed representations may be a large if the not the largest component of theory of judgment and decision making in these settings. At the same time, Utsumi (2020) do show that a multilayer with a single hidden layer generally makes better predictions about objects’ attributes from DS representations, beating a purely linear model by a few points in predicted vs actual correlations under leave-one-out cross-validation, suggesting that at least for some judgments, non- linear processes may also play some role.

Further, in the case of Richie et al. (2019), the supervised approach was found to make more accurate predictions of continuous judgment ratings than the purely association- or similarity-based approach of Grand et al. (2018). Recall that Grand et al. found that judgment ratings correlated with the dot product between a vector for a judgment target (say, ‘salmon’), and a vector representing a judgment dimension like taste (which would be derived by subtracting vectors for words low in the

7 Some of these authors do not necessarily interpret their models predicting judgments from DS representations as cognitive models of judgment and decision making (e.g., Hollis et al., 2016; Utsumi, 2020). They may instead, for example, be simply interested in using their models to extrapolate large-scale lexical norms for other psychological applications (Hollis et al., 2016). Nevertheless, we believe their statistical models can be considered and evaluated as cognitive models. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 18 dimension, like ‘disgusting’, from vectors for words high in the dimension, like ‘tasty’). As mentioned above, Richie et al. (2019) found that the best performing supervised models that mapped word vectors to judgments were linear models, which simply entail a dot product between a design matrix of judgment target vectors and the vector of best-fitting coefficients, i.e., f(X) = X ·w. Thus, the Grand et al. (2019) approach only differs from supervised linear models in that, for the former, researchers stipulate a (possibly suboptimal) vector, while for the latter, judgment ratings supervise learning of the optimal vector onto which a judgment target projects via dot-product. The supervised mapping approach can thus avoid the third set of concerns with applications that naively use cosine or dot product: that they do not allow for the flexible, optimal use of the information in word vectors to predict semantic judgments.

Finally, because the weight vector of a linear model is of the same dimensionality as word vectors, the weight vector can be interpreted as a vector in the same semantic space. We can then ask what other words, which may not be plausible judgment targets for a particular judgment dimension, are highly associated with the judgment dimension, by computing the dot product between the weight vector and these other words (i.e., pass these other words’ vectors through the learned linear model).

Richie et al. (2019) and Zou and Bhatia (2019) both took this approach, finding, for example, that words concerning work were most associated with masculine traits whereas words concerning home and family were most associated with feminine traits. These results confirmed prior empirical work

(Heilman, 2012), despite not explicitly eliciting judgments about psychological constructs such as work or home. Of course, it is also possible to use this approach to generate novel behavioral hypotheses about the psychological substrates of judgment, which can be further tested empirically.

While there are, by now, several studies that successfully train models that map from DS representations to judgments about single words, there is relatively less work that successfully learns mappings from DS representations of pairs of words, to judgments about the relations that hold DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 19 between words, like similarity (‘dog’-’hound’), part-whole relations (‘car’-’wheel’), or class inclusion

(‘hammer’-’tool’). Initial attempts at modeling such relations with DS representations attempted to use simple vector arithmetic. For example, it was found that vqueen – vking + vman results in a vector very close, in terms of cosine distance, to vwoman, suggesting that relations may be encoded by particular directions in vector space (Mikolov et al., 2013). However, this approach seems not to be generally applicable for a wide range of relation types (Chen, Peterson, & Griffiths, 2017; Rogers, Drozd, & Li, 2017), leading to alternative DS-based models like BART, the Bayesian Analogy with Relational Transformations, which learns a mapping from two words’ vectors to a judgment about whether a particular relation holds between them (Lu, Chen, & Holyoak, 2012; Lu, Wu, & Holyoak, 2019). BART has multiple components, but its key idea is to learn a linear mapping to relation judgments from the concatenation of (a) the raw difference vector between two words (e.g., vcar - vwheel), and (b) the sorted difference vector (sort(vcar-vwheel)). Lu et al. (2019) explain that the function of sorting is to “highlight semantic features that tend to be aligned with respect to functionally relevant differences between the two words in a pair” (pg 4177). Predictions from this model, trained on positive and negative examples of relations like similarity, class inclusion, part-whole, and many more, achieve out-of-sample correlations with human typicality judgments for relations that approach inter-rater reliability. Further, by considering the similarity of (1) the predicted distribution over relations between a word pair A:B, (2) the predicted distribution over relations between another word pair C:D, BART can solve 2-AFC analogy problems of the form A:B::C:D vs D’, where the foil D’ is closely semantically related to C, but does not bear the same relation to C as B does to A (e.g., “insect:bee::fish:halibut” vs. “fish:water”).

Importantly, BART’s predictions vastly outperform the predictions of cosine similarity between two word’s vectors, again illustrating the power of DS models when they are combined with an appropriate process model. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 20

Despite the power and flexibility of the vector mapping approach, we do note that it is not always as effective as certain alternatives. For example, Derby, Miller, and Devereux (2019) show that feature norm representations (e.g., ducks possess the feature ‘can fly’) are not predicted especially well by partial least squares regression from a word vector to a feature norm vector. Slightly better predictions of feature norms can be made by learning vectors for features themselves, and taking features with a high cosine similarity to the target concept word (i.e., the vector for ‘duck’ and the vector for ‘can fly’ ought to have high cosine similarity). Derby et al. show that useful feature vectors can be learned by passing pre-trained GloVe representations for concepts and existing feature norm concept-by-feature matrices through the skip-gram architecture; the only difference between their application and the typical application of skip-gram is that instead of embeddings being learned for context words, embeddings are learned for features, and the target word embeddings are pre-specified and fixed.

Future work

Overall, it is our view that the promise of (text-based) distributed representations has yet to be fully realized. Reaching this potential will involve many avenues of research. The first is, as the previous section illustrated, continuing to combine DS representations with cognitively plausible process models. Some of the phenomena reviewed earlier, like Likert-scale similarity judgments, or priming in a lexical decision task, are ripe for such modeling. For example, the BART model makes accurate judgments of typicality of candidate word pairs for the similar relation, but the relative handful of tested word pairs tend to be sampled more or less randomly from the entire lexicon, which does not require distinguishing, for example, the subtle difference in similarity between ‘pear’-’apple’ and

‘pear’-’banana’. Thus, a more stringent test of BART would involve predicting similarity judgments among all pairs of items in a narrow domain, say, animals or tools, i.e., the kind of similarity judgments that might be collected for multidimensional scaling or additive clustering. As in Lu et al (2019), BART DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 21 could also be quantitatively compared with simpler models of similarity operating on text-based distributed representations, like dot product or cosine similarity, Euclidean distance, and other distance metrics.. We have conducted work in this direction (Richie & Bhatia, under review), and find that mere cosine similarity and other unweighted distance functions, but also BART do not predict continuous ratings of fine-grained, within-domain similarity as well as a linear model from the hadamard

(elementwise) product of two words’ vectors8 (see Peterson et al. (2018) for a similar successful application of this model to similarity judgments of naturalistic images from vector representations derived from deep neural networks9). We also find that a model trained to predict similarity in one set of domains (e.g., clothing, vegetables, and vehicles) does not generalize to new domains (e.g., tools), suggesting context-sensitive or domain-specific weighting of dimensions, again demonstrating the shortcomings of treating cosine similarity as a measure of word-word similarity, as it weighs all dimensions equally for all domains.

Other tests of BART’s power may be advisable. For one, it could be tested on large benchmark datasets of graded hypernymy judgments like HyperLex (Vulić, Gerz, Kiela, Hill, & Korhonen, 2017).

Future work could also attempt to directly integrate older cognitive process models of analogy and reasoning, such as the Structure Mapping Engine10 (Falkenhainer, Forbus, & Gentner, 1989), with representations for words, relations, and propositions, captured by BART and related techniques.

The lexical decision task, on the other hand, has previously been modeled with the drift diffusion model (Ratcliff, Gomez, & McKoon 2004). In a semantic priming setting, it ought to be

8 This is actually equivalent to a weighted dot product of two word vectors Σi wi*ui*vi, or to a weighted cosine similarity measurement if the vectors are L2-normalized first. 9 There is, in fact, a similar but perhaps smaller literature in psychology modeling judgment and decision making about images from deep neural network representations. Besides Peterson et al. (2018), see, for example, Guest and Love (2019), Sanders & Nosofsky (2018), and Lake, Zaremba, Fergus, and Gureckis (2015). 10 BART complements older models such as the Structure Mapping Engine(SME) in that BART shows how it is possible to learn to do some relational reasoning with distributed representations whose acquisition is formally specified. SME and older models like it do not specify how the representations they operate over are acquired and usually use hand- crafted representations. One criticism of SME is that its power comes from these representations, not its processes (Chalmers, French, & Hofstadter, 1992, but see Forbus, Gentner, Markman, & Ferguson, 1998). DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 22 possible to use DS models and a (possibly asymmetric) measure of semantic priming based on DS models, to derive the strength of the DDM’s drift rate parameter v, which encodes the strength of evidence accumulation. Alternatively, text-based distributed representations could be combined with attractor neural network models of semantic priming that have previously only been tested with hand- crafted, toy representations (Lerner, Bentin, & Shriki, 2012). This would allow quantitative evaluation on real semantic priming data (e.g., Hutchison et al., 2013), and possibly comparison to models of spreading activation on semantic networks built from free association data (Siew, 2019; Nelson,

McEvoy, & Schreiber, 2004). Although it may be challenging to scale up such models, it will be critical to develop dynamical models that use DS representations to make predictions not just about the eventual judgment or decision, but also its time course (Bhatia & Pleskac, 2019). Modeling judgments by learning mappings from word vectors directly to judgments could also be improved. In our own work, for example, we have noticed that purely linear models, while achieving the best overall performance, tend to make systematic errors, overestimating low ends of judgment dimensions, and underestimating higher ends of judgment dimensions, suggesting that these models need additional, non-linear transformations of their predictions.

Second, several very recent developments in DS algorithms have yet to be taken up by judgment and decision making scholars. For one, computer scientists have recently developed

‘dynamic word embeddings’, which take corpora from different time periods, and learn a word vector space for each time period, such that the movement of word vectors from period to period reflects evolution in meaning11 (Yao, Sun, Ding, Rao, & Xiong, 2018; Bamler & Mandt, 2017). In principle, it should be possible to use such techniques to model the judgments people would have made in the past,

11 This was not straightforward with earlier techniques. The reason is that cost functions for training most DS models are invariant to rotation. As a result, if one trained a separate DS model for each time period, the learned vectors may not be placed in the same latent space. It is possible to try alignment techniques like procrustes analysis (Hamilton et al., 2016), but it is difficult to distinguish imperfections of the approximate rotation from true semantic drift (Bamler & Mandt, 2017). DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 23 by fitting models from current-day embeddings to current-day judgments, and then deploying such models on historical embeddings to predict historical judgments. Given that this mapping approach

(Richie et al., in press2019) predicts contemporary judgments more accurately than does relative similarity between judgment targets and words representing the judgment dimensions (Grand et al.,

2018), this approach should make more accurate historical predictions than the similarity-based approaches of Garg et al. (2018) and Hamilton et al. (2016)12. Second, computer scientists have also developed DS models such that only similar but not related words are near each other in vector space13

(Wieting, Bansal, Gimpel, & Livescu, 2015; Ponti, Vulić, Glavaš, Mrkšić, & Korhonen, 2018). Thus, word pairs like ‘doctor’ and ‘nurse’ would be nearby, but not pairs like ‘doctor’ and ‘hospital’. These new techniques perform more accurately on benchmarks of similarity judgments that control for mere relatedness or association (Hill et al., 2015), and aid performance on certain downstream NLP tasks like lexical simplification. However, it is unclear if these newer techniques generally perform as well on other cognitive tasks as do more traditional DS algorithms. For example, such representations may not account for priming behavior as well, since related but not similar word pairs (‘doctor’-’hospital’) can prime each other. Nor is it known if such newer techniques can model well the kind of more stringent, within-domain similarity data we mentioned in the previous paragraph. More generally, while the development of different kinds of DS representations for different tasks may be acceptable for engineering purposes (the typical NLP aim), it is unclear if we want to posit that people have different representations for a single word depending on the task. For the sake of parsimony, at least, it may be more ideal to have a single representation for a word that can then be used in different ways depending

12 Unfortunately, in our preliminary work in this direction, we have found that many word representations are not stable enough over time for this approach to work. For example, the representation for ‘bed’ changes dramatically over time, yielding very unstable risk perceptions, even though almost certainly risk perceptions for ‘bed’ have always been low. This issue may possibly be due to the smaller amount of historical natural language data available. We expect that this issue will not be an issue for future researchers, as there are now large amounts of historical language data being digitized on an almost daily basis. 13 This approach typically uses instances of synonymy, antonymy, or paraphrases to retrofit pre-trained or jointly train DS models. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 24 on the task at hand (e.g., computing similarity but not relatedness, vs semantic relatedness of any kind).

Perhaps there ought to be a ‘meta-model’ that would select a particular set of task-appropriate processes, or alter word representations in task-appropriate ways (perhaps by biasing memory, or differentially activating certain features or dimensions).

Probably the most notable recent advancements in text-based distributed representations, however, lie in so-called ‘compositional distributed representations’. These are deep neural networks

(USE, Cer et al., 2018; ELMO, Peters et al., 2018; BERT, Devlin et al., 2018; see Smith, 2019 for review) that derive fixed-length vector representations for individual words contextualized by the rest of the sentence and/or a vector for an entire sentence itself. These newer techniques have achieved unprecedented success on a range of natural language processing tasks, including question answering

(given a question, selecting a span of text in a passage that answers the question), textual entailment

(judging whether a hypothesis follows from a premise), sentence acceptability (whether a string of words is an acceptable sentence), and sentiment analysis (labeling a sentence from very negative, to very positive). These new techniques could deliver representations for cognitive models of judgments about words in context (e.g., otherwise ambiguous words like ‘bank’ that are disambiguated by context,

Jamieson et al., 2018; Scott, Keital, Becirspahic, Yao, & Sereno, 2018), or judgments about phrases and sentences (e.g., the taste of ‘curry chicken’ vs ‘barbecue chicken’; see Shwartz and Dagan, 2019 for evaluations of these methods for predicting linguistic judgments about phrases and sentences). As a hypothetical example of the latter, it may be possible to use these models to get vector representations for survey prompts that are commonly used in judgment and decision making research, e.g,. “I am likely to disagree with an authority figure” or “I am likely to sunbathe without sunscreen” (see Weber et al., 2002), and model participants agree/disagree responses to such prompts. Inspired by both the verbal fluency literature described above and dynamical decision making models like the drift diffusion model, we have also been using these newer DS models to derive vector representations for sentences DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 25 listed as thoughts in response to naturalistic decision prompts like “Is nuclear power safe?”, and subsequently model the dynamics of memory search and decision making (Zhao, Richie, and Bhatia, revise/resubmit).

Conclusion

We began this chapter by stating that a formal understanding of judgment and decision making requires two components: (a) a theory of our knowledge of the things we make judgments and decisions about, and (b) a theory of how that knowledge is used to make judgments and decisions. We believe text-based distributed representations constitute the first half (a) of such an understanding.

While it is encouraging that simple measures that compare two item representations, like cosine similarity, roughly predict human behaviors like priming or association-based judgment, we, along with others (Jones et al., 2018), believe that this does not constitute a theory of (b), how our knowledge is used to make judgments and decisions, and actually underestimates the potential of DS representations to model judgment and decision making, and cognition more generally. Work that combines text-based

DS representations with appropriate process models of judgment and decision making is still relatively new, and we believe this will be an active area of research, along with applications of compositional distributed semantics to study judgment and decision making about propositions and other complex representations.

Acknowledgments

Funding was received from the National Science Foundation grant SES-1847794 and the Alfred P.

Sloan Foundation. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 26

References

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics

in high dimensional space. In International conference on database theory (pp. 420-434).

Springer, Berlin, Heidelberg.

Bamler, R., & Mandt, S. (2017). Dynamic word embeddings. In Proceedings of the 34th International

Conference on Machine Learning-Volume 70 (pp. 380-389). JMLR.org.

Bhatia, S. (2017a). The semantic representation of prejudice and stereotypes, Cognition, 164, 46-60.

Bhatia, S. (2017b). Associative judgment and vector space semantics. Psychological Review, 124, 1-20.

Bhatia, S. (2019) Predicting risk perception: New insights from . Management Science,

65(8), 3449-3947.

Bhatia S., Goodwin G., & Walasek, L (2018). Trait associations for Hillary Clinton and Donald Trump

in news media: a computational analysis. Social Psychological and Personality Science, 9, 123-

130.

Bhatia, S., & Pleskac, T. J. (2019). Preference accumulation as a process model of desirability ratings.

Cognitive psychology, 109, 47-67.

Bhatia S, Stewart N. (2018). Naturalistic multiattribute choice. Cognition, 179, 71-88.

Bhatia, S. & Walasek, L. (2019). Association and response accuracy in the wild. Memory & Cognition,

47(2), 292-298.

Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword

information. Transactions of the Association for Computational Linguistics, 5, 135–146.

Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., & Kalai, A.T. (2016). Man is to computer

programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural

Information Processing Systems, 29, 4349-4357. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 27

Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012, July). Distributional semantics in technicolor.

In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:

Long Papers-Volume 1 (pp. 136-145). Association for Computational Linguistics.

Buchanan, E. M., Valentine, K. D., & Maxwell, N. P. (2019). English semantic feature production

norms: An extended database of 4436 concepts. Behavior Research Methods. 1-15.

https://doi.org/10.3758/s13428-019-01243-z

Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence

statistics: A computational study. Behavior Research Methods, 39, 510 –526.

http://dx.doi.org/10.3758/BF03193020

Caliskan, A., Bryson, J.J., Narayanan, A. (2017). Semantics derived automatically from language

corpora contain human-like biases. Science, 356, 183-186.

Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Sung, Y. H. (2018). Universal

sentence encoder. arXiv preprint arXiv:1803.11175.

Chalmers, D. J., French, R. M., & Hofstadter, D. R. (1992). High-level perception, representation, and

analogy: A critique of artificial intelligence methodology. Journal of Experimental &

Theoretical Artificial Intelligence, 4(3), 185–211.

Chen, D., Peterson, J. C., & Griffiths, T. L. (2017). Evaluating vector-space models of analogy. arXiv

preprint arXiv:1705.04416.

Devereux, B. J., Tyler, L. K., Geertzen, J., & Randall, B. (2014). The Centre for Speech, Language and

the Brain (CSLB) concept property norms. Behavior research methods, 46(4), 1119-1127.

Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). Bert: Pre-Training of Deep Bidirectional

Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.

Falkenhainer, B., Forbus, K., & Gentner, D. (1989). The structure-mapping engine: Algorithm and

examples. Artificial Intelligence, 20(41), 1–63. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 28

Firth, J.R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Analysis, pp. 1–

32. Oxford, UK: Blackwell.

Forbus, K.D., Gentner, D., Markman, A.B., & Ferguson, R.W. (1998). Analogy just looks like high

level perception: Why a domain-general approach to analogical mapping is right. Journal of

Experimental and Theoretical Artificial Intelligence, 10(2), 231-257.

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of

gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115, E3635-

E3644.

Gilovich, T., Griffin, D., & Kahneman, D. (Eds.). (2002). Heuristics and biases: The psychology of

intuitive judgment. Cambridge University Press.

Grand, G., Blank, I. A., Pereira, F., & Fedorenko, E. (2018). Semantic projection: Predicting high-level

judgment recovering human knowledge of multiple, distinct object features from word

embeddings. arXiv preprint arXiv:1802.01241.

Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B. (2007). Topics in semantic representation.

Psychological Review, 114, 211-244.

Guest, O. & Love, B. C. (2019). Levels of representation in a model of categorization.

bioRxiv. 10.1101/626374

Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical

laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for

Computational Linguistics, 1489-1501.

Harris, Z.S. (1954). Distributional structure. Word, 10, 146–62

Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine)

similarity estimation. Computational Linguistics, 41, 665-695. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 29

Hills, T.T., Jones, M.N., & Todd, P.M. (2012). Optimal foraging in semantic memory. Psychological

Review, 119, 431-440.

Heilman, M.E. (2012). Gender stereotypes and workplace bias. Research in organizational Behavior,

32, 113–135.

Hollis, G., Westbury, C., & Lefsrud, L. (2017). Extrapolating human judgments from skip-gram vector

representations of word meaning. Quarterly Journal of Experimental Psychology, 70, 1603-

1619.

Holman, E. W. (1979). Monotonic models for asymmetric proximities. Journal of Mathematical

Psychology, 20, 1-15.

Holtzman, N.S., Schott, J.P., Jones, M.N., Balota D.A., & Yarkoni, T. (2011). Exploring media bias

with semantic analysis tools: validation of the Contrast Analysis of Semantic Similarity

(CASS). Behavior Research Methods, 43, 193-200.

Hopkins, D.J. (2018). The exaggerated life of death panels? The limited but real influence of elite

rhetoric in the 2009–2010 health care debate. Political Behavior, 40, 681-709.

Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse, C. S., ... &

Buchanan, E. (2013). The semantic priming project. Behavior research methods, 45(4), 1099-

1114.

Johns, B.T., Taler, V., Pisoni, D.B., Farlow M.R., Hake A.M., Kareken D.A., … Jones M.N. (2018).

Cognitive modeling as an interface between brain and behavior: measuring the semantic decline

in mild cognitive impairment. Canadian Journal of Experimental Psychology, 72, 117-126.

Jones, M. N., Gruenenfelder, T. M., & Recchia, G. (2018). In defense of spatial models of semantic

representation. New Ideas in Psychology, 50, 54-60.

Jones, M.N., Hills, T.T., & Todd, P.M. (2015). Hidden processes in structural representations: A reply to

Abbott, Austerweil, & Griffiths. Psychological Review, 122, 570-574. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 30

Jones, M.N., Kintsch. W., & Mewhort, D.J.K. (2006). High-dimensional semantic space accounts of

priming. Journal of Memory and Language, 55, 534-552.

Jones, M.N., & Mewhort, D.J. (2007). Representing word meaning and order information in a

composite holographic lexicon. Psychological Review, 114, 1-37

Jones, M.N., Willits, J.A., & Dennis, S. (2015). Models of semantic memory. In Oxford Handbook of

Computational and Mathematical Psychology. Edited by Busemeyer JR, Wang Z, Townsend JT,

Eidels A. Oxford University Press, 232-254.

Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams Jr, R. B., Alper, S., ... & Batra, R.

(2018). Many Labs 2: Investigating variation in replicability across samples and settings.

Advances in Methods and Practices in Psychological Science, 1(4), 443-490.

Krumhansl, C. L. (1978). Concerning the applicability of geometric models to similarity data: The

interrelationship between similarity and spatial density. Psychological Review, 85, 445-463.

Lake, B. M., Zaremba, W., Fergus, R., & Gureckis, T. M. (2015). Deep neural networks predict

category typicality ratings for images. In Proceedings of the 37th Annual Conference of the

Cognitive Science Society.

Landauer, T.K., & Dumais, S (1997). A solution to Plato’s problem: the latent semantic analysis theory

of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–40

Lenci, A. (2018). Distributional models of word meaning. Annual review of Linguistics, 4, 151-171.

Lerner, I., Bentin, S., & Shriki, O. (2012). Spreading activation in an attractor network with latching

dynamics: automatic semantic priming revisited. Cognitive science, 36(8), 1339-1382.

Levy, O., Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd

Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).

Vol. 2, pp. 302–308. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 31

Li, D., & Summers-Stay, D. (2019). Mapping Distributional Semantics to Property Norms with Deep

Neural Networks. and , 3(2), 30.

Lu, H., Chen, D., & Holyoak, K. J. (2012). Bayesian analogy with relational transformations.

Psychological review, 119(3), 617.

Lu, H., Wu, Y. N., & Holyoak, K. J. (2019). Emergence of analogy from relation learning. Proceedings

of the National Academy of Sciences, 116(10), 4176-4181.

Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. New York: Wiley.

Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic

tasks with models of semantic similarity based on prediction and counting: a review and

empirical validation. Journal of Memory and Language, 92, 57-78.

McRae, K., Cree, G.S., Seidenberg, M.S., & McNorgan, C. (2005). Semantic feature production norms

for a large set of living and nonliving things. Behavior Research Methods, 37, 547-559.

Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. Psychological

Review, 85, 207–238.

Medin, D. L., & Smith, E. E. (1981). Strategies and classification learning. Journal of Experimental

Psychology: Human Learning and Memory, 7, 241–253. http://dx.doi.org/10.1037/0278-

7393.7.4.241

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of

words and phrases and their compositionality. Advances in Neural Information Processing

Systems, 3111-3119.

Navarro, D. J. & Lee, M. D. (2003). Combining dimensions and features in similarity-based

representations. In S. Becker, S. Thrun and K. Obermayer (Ed.) Advances in Neural

Information Processing Systems (pp. 67-74). DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 32

Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free

association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, &

Computers, 36, 402–407. https://doi.org/10.3758/BF03195588.

Nosofsky, R. M. (1984). Choice, similarity, and the context theory of classification. Journal of

Experimental Psychology: Learning, Memory, and Cognition, 10, 104 –114.

http://dx.doi.org/10.1037/0278-7393.10.1.104

Nosofsky, R.M. (1986). Attention, similarity, and the identification-categorization relationship. Journal

of Experimental Psychology: General, 115, 39–57. http://dx.doi.org/10.1037/0096-

3445.115.1.39

Nosofsky, R.M. (1991). Stimulus bias, asymmetric similarity, and classification. Cognitive Psychology,

23(1), 94-140.

Nosofsky, R.M. (2011). The generalized context model: An exemplar model of classification. In E. M.

Pothos & A. J. Wills (Eds.), Formal approaches in categorization (pp. 18–39). New York, NY:

Cambridge University Press.

Nosofsky, R. M., Sanders, C. A., & McDaniel, M. A. (2018). Tests of an exemplar-memory model of

classification learning in a high-dimensional natural-science category domain. Journal of

Experimental Psychology: General, 147, 328-353.

Nosofsky, R.M., Sanders, C., Meagher, B.J., & Douglas, B.J (2018). Toward the development of a

feature-space representation for a complex natural category domain. Behavior Research

Methods, 50, 530-556 doi:10.3758/s13428-017-0884-8.

Oppenheimer, D. M., & Kelso, E. (2015). Information processing as a paradigm for decision making.

Annual review of psychology, 66, 277-294. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 33

Patel, A., Sands, A., Callison-Burch, C., & Apidianaki, M. (2018). Magnitude: A fast, efficient

universal vector embedding utility package. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing: System Demonstrations (pp. 120-126).

Pereira, F., Gershman, S., Ritter, S., & Botvinick, M. (2016). A comparative evaluation of off-the-shelf

distributed semantic representations for modelling behavioural data. Cognitive

Neuropsychology, 33, 175-190.

Pennington, J., Socher, R., & Manning, C.D. (2014). GloVe: global vectors for word representation. In

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,

1532-1543.

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep

Contextualized Word Representations. arXiv preprint arXiv:1802.05365.

Peterson, J. C., Abbott, J. T., & Griffiths, T. L. (2018). Evaluating (and improving) the correspondence

between deep neural networks and human representations. Cognitive science, 42(8), 2648-2669.

Ponti, E. M., Vulić, I., Glavaš, G., Mrkšić, N., & Korhonen, A. (2018). Adversarial Propagation and

Zero-Shot Cross-Lingual Transfer of Word Vector Specialization. In Proceedings of the 2018

Conference on Empirical Methods in Natural Language Processing (pp. 282-293).

Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of the lexical decision task.

Psychological review, 111(1), 159.

Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In In

Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.

Richie, R., & Bhatia, S. (under review). Similarity judgment within and across categories: A

comprehensive model comparison. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 34

Richie, R., Zou, W., Bhatia, S. (2019). Semantic representations extracted from large language corpora

predict high-level human judgment in seven diverse behavioral domains. Collabra: Psychology,

5(1), 50.

Rogers, A., Drozd, A., & Li, B. (2017). The (too many) problems of analogical reasoning with word

vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*

SEM 2017) (pp. 135-148).

Russell, J. A. (1980). A circumplex model of affect. Journal of personality and social psychology,

39(6), 1161.

Sanders, C. A., & Nosofsky, R. M. (2018). Using deep-learning representations of complex natural

stimuli as input to psychological models of classification. Proceedings of the 40th Annual

Conference of the Cognitive Science Society.

Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings

of 5,500 words on nine scales. Behavior research methods, 51(3), 1258-1270.

Siew, C. S. (2019). spreadr: An R package to simulate spreading activation in a network. Behavior

research methods, 51(2), 910-929.

Shepard, R.N. (1974). Representation of structure in similarity data: problems and prospects.

Psychometrika, 39, 373-421

Shepard, R. N., & Arabie, P. (1979). Additive clustering: Representation of similarities as combinations

of discrete overlapping properties. Psychological Review, 86(2), 87.

Shwartz, V., & Dagan, I. (2019). Still a Pain in the Neck: Evaluating Text Representations on Lexical

Composition. arXiv preprint arXiv:1902.10618.

Smith, N. A. (2019). Contextual Word Representations: A Contextual Introduction. arXiv preprint

arXiv:1902.06006. DISTRIBUTED SEMANTICS, JUDGMENT, AND DECISION MAKING 35

Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics.

Journal of artificial intelligence research, 37, 141-188.

Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327-352.

Tversky, A., & Gati, I. (1982). Similarity, separability and the triangle inequality. Psychological

Review, 89, 123-154.

Utsumi, A. (2020). Exploring what is encoded in distributional word vectors: A neurobiologically

motivated analysis. Cognitive Science, 44(6), e12844.

Vulić, I., Gerz, D., Kiela, D., Hill, F., & Korhonen, A. (2017). Hyperlex: A large-scale evaluation of

graded lexical entailment. Computational Linguistics, 43(4), 781-835.

Weber, E. U., Blais, A.-R., & Betz, N. (2002). A domain-specific risk-attitude scale: Measuring risk

perceptions and risk behaviors. Journal of Behavioral Decision Making, 15, 263-290.

Weber, E. U., & Johnson, E. J. (2009). Mindful judgment and decision making. Annual review of

psychology, 60, 53-85.

Wieting, J., Bansal, M., Gimpel, K., & Livescu, K. (2015). From paraphrase database to compositional

paraphrase model and back. Transactions of the Association for Computational Linguistics, 3,

345-358.

Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018). Dynamic word embeddings for evolving

semantic discovery. In Proceedings of the Eleventh ACM International Conference on Web

Search and (pp. 673-681). ACM.

Zhao, W.J., Richie, R., Bhatia, S. (revise/resubmit). Context in decisions from memory.

Zou, W., & Bhatia, S. (2019). Modeling judgment errors in naturalistic numerical estimation.

Proceedings of the 41st Annual Conference of the Cognitive Science Society, 3227-3233.