A Comparative Evaluation of Off-The-Shelf Distributed Semantic

COGNITIVE NEUROPSYCHOLOGY, 2016 VOL. 33, NOS. 3–4, 175–190 http://dx.doi.org/10.1080/02643294.2016.1176907 A comparative evaluation of off-the-shelf distributed semantic representations for modelling behavioural data Francisco Pereiraa, Samuel Gershmanb , Samuel Ritterc and Matthew Botvinickd aMedical Imaging Technologies, Siemens Healthcare, USA; bDepartment of Psychology and Center for Brain Science, Harvard University, Cambridge, USA; cPrinceton Neuroscience Institute, Princeton University, Princeton, USA; dGoogle DeepMind, London, UK ABSTRACT ARTICLE HISTORY In this paper we carry out an extensive comparison of many off-the-shelf distributed semantic Received 15 May 2015 vectors representations of words, for the purpose of making predictions about behavioural Revised 31 March 2016 results or human annotations of data. In doing this comparison we also provide a guide for how Accepted 5 April 2016 vector similarity computations can be used to make such predictions, and introduce many KEYWORDS resources available both in terms of datasets and of vector representations. Finally, we discuss Distributed semantic the shortcomings of this approach and future research directions that might address them. representation; evaluation; semantic space; semantic vector 1. Introduction computational tasks we are trying to solve (and the more general problem of concept representation in 1.1. Distributed semantic representations the brain) require models that are general enough to We are interested in one particular aspect of concep- encompass the entire English vocabulary as well as tual representation – the meaning of a word – arbitrary linguistic combinations, our focus will be on insofar as it is used in the performance of semantic distributional semantic models. Existing hand-engin- tasks. The study of concepts in general has a long eered systems cannot yet be used to address all the and complex history, and we will not attempt to do tasks that we consider. it justice here (see Margolis & Laurence, 1999, and Common to many distributional semantic models is Murphy, 2002). Researchers have approached the the idea that semantic representations can be con- problem of modelling meaning in diverse ways. One ceived as vectors in a metric space, such that proximity approach is to build representations of a concept – in vector space captures a geometric notion of seman- aword used in one specific sense – by hand, using tic similarity (Turney & Pantel, 2010). This idea has some combination of linguistic, ontological, and fea- been important both for psychological theorizing tural knowledge. Examples of this approach include (Howard, Shankar, & Jagadisan, 2011; Landauer & WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller, Dumais, 1997; Lund & Burgess, 1996; McNamara, 1990), Cyc (Lenat, 1995), and semantic feature norms 2011; Steyvers, Shiffrin, & Nelson, 2004) as well as for collected by various research groups (e.g., McRae, building practical natural language processing Cree, Seidenberg, and McNorgan, 2005, and Vinson systems (Collobert & Weston, 2008; Mnih & Hinton, & Vigliocco, 2008). An alternative approach, known 2007; Turian, Ratinov, & Bengio, 2010). However, as distributional semantics, starts from the idea that vector space models are known to have a number of words occurring in similar linguistic contexts – sen- weaknesses. The psychological structure of similarity tences, paragraphs, documents – are semantically appears to disagree with some aspects of the geome- similar (see Sahlgren, 2008, for a review). A major prac- try implied by vector space models, as evidenced by tical advantage of distributional semantics is that it asymmetry of similarity judgments and violations of enables automatic extraction of semantic represen- the triangle inequality (Griffiths, Steyvers, & Tenen- tations by analysing large corpora of text. Since the baum, 2007; Tversky, 1977). Furthermore, many CONTACT Francisco Pereira [email protected] © 2016 Informa UK Limited, trading as Taylor & Francis Group 176 F. PEREIRA ET AL. vector space models do not deal gracefully with polys- brain activation, as measured with functional mag- emy or word ambiguity (but see Jones, Gruenenfelder, netic resonance imaging (fMRI), in response to seman- & Recchia, 2011, and Turney & Pantel, 2010). Recently, tic stimuli (e.g., a picture of an object together with the a number of different researchers have started focus- word naming it, or the word alone). Such models learn ing on producing vector representations for specific a mapping between the degree to which a dimension meanings of words (Huang, Socher, Manning, & Ng, in a distributed semantic representation vector is 2012; Neelakantan, Shankar, Passos, & McCallum, present and its effect on the overall spatial pattern 2015; Reisinger & Mooney, 2010; Yao & Van Durme, of brain activation. These models can be inverted to 2011), but these are still of limited use without some decode semantic vectors from patterns of brain acti- degree of manual intervention to pick which mean- vation, which allow validation of the mappings by clas- ings to use in generating predictions. We discuss sifying the mental state in new data; this can be done these together with other available approaches in by comparing the decoded vectors with “true” vectors Section 2.1. In the work reported here, we do not extracted from a text corpus. attempt to address these issues directly; our goal is Reviewing this literature is beyond the scope of this to compare the effectiveness of different vector rep- paper, but we will highlight particularly relevant work. resentations of words, rather than comparing them The seminal publication in this area is Mitchell et al. with other kinds of models. (2008), which showed that it was possible to build such forward models, and use them to make predictions about new imaging data. The authors rep- 1.2. Modelling human data resented concepts by semantic vectors where Ever since Landauer and Dumais (1997) demonstrated dimensions corresponded to different verbs; the that distributed semantic representations could be vector for a particular concept was derived from co- used to make predictions about human performance occurrence counts of the word naming the concept in semantic tasks, numerous researchers have used and each of those verbs (e.g., the verb “run” co- measures of (dis)similarity between word vectors – occurs more often with animate beings than inani- cosine similarity, euclidean distance, correlation – for mate objects). Subsequently, Just, Cherkassky, Aryal, that purpose. There are now much larger test datasets and Mitchell (2010) produced more elaborate than the TOEFL synonym test used in Landauer and vectors from human judgments, with each dimension Dumais (1997), containing hundreds to thousands of corresponding to one of tens of possible semantic fea- judgments on tasks such as word association, tures. In both cases, this allowed retrospective analogy, and semantic relatedness and similarity, as interpretation of patterns of activation corresponding described in Section 2.3. The availability of LSA as a to each semantic dimension (e.g., ability to manipulate web service1 for calculating similarity between words corresponded to activation in the motor cortex). Other or documents has also allowed researchers to use it groups re-analysing the data from Mitchell et al. (2008) as a means of obtaining a kind of “ground truth” for showed that superior decoding performance could be purposes such as generating stimuli (e.g., Green, obtained by using distributed semantic represen- Kraemer, Fugelsang, Gray, & Dunbar, 2010). In parallel tations rather than human postulated features (e.g., with all this work, researchers within the machine Pereira, Detre, & Botvinick, 2011, and Liu, Palatucci, & learning community have developed many other dis- Zhang, 2009). In particular, Pereira et al. (2011) used tributed semantic representations, mostly used as a topic model of a small corpus of Wikipedia articles components of systems carrying out a variety of to learn a semantic representation where each dimen- natural language processing tasks, ranging from infor- sion corresponded to an interpretable dimension mation retrieval to sentiment classification (Wang & shared by a number of related semantic categories. Manning, 2012). Furthermore, the semantic vectors from brain Beyond behavioural data, distributed semantic rep- images for related concepts exhibited similarity struc- resentations have been used in cognitive neuro- ture that echoed the similarity structure present in science, in the study of how semantic information is word association data, and could also be used to gen- represented in the brain. More specifically, they have erate words pertaining to the mental contents at the been used as components of forward models of time the images were acquired. A systematic COGNITIVE NEUROPSYCHOLOGY 177 comparison of the effectiveness of various kinds of dis- – for all the most commonly used off-the-shelf tributed semantic representations in decoding can be representations. found in Murphy, Talukdar, and Mitchell (2012a). This To do this, we had to assume that the information work has led researchers to consider distributed derived from text corpora suffices to make behav- semantic representations as a core component of ioural predictions; existing literature, and our own forward models of brain activation in semantic tasks, experience, tell us that this is the case. But is this the or even to try to incorporate

A Comparative Evaluation of Off-The-Shelf Distributed Semantic

Semantic Computing

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Distributional Semantics

Facing the Facts of Fake: a Distributional Semantics and Corpus Annotation Approach Bert Cappelle, Pascal Denis, Mikaela Keller

Sentiment, Stance and Applications of Distributional Semantics

Building Web-Interfaces for Vector Semantic Models with the Webvectors Toolkit

Distributional Semantics for Understanding Spoken Meal Descriptions

Distributional Semantics Today Introduction to the Special Issue Cécile Fabre, Alessandro Lenci

Distributional Semantics

Arguments For: Semantic Folding and Hierarchical Temporal Memory

A Vector Space for Distributional Semantics for Entailment ∗

Embeddings in Natural Language Processing