Distributional Semantic Modelling of Human Property Knowledge

Feature2Vec: Distributional semantic modelling of human property knowledge Steven Derby Paul Miller Barry Devereux Queen’s University Belfast, Belfast, United Kingdom fsderby02, p.miller, [email protected] Abstract meaning. A criticism of these real-valued embedding vectors is the opaqueness of their representa- Feature norm datasets of human conceptual knowledge, collected in surveys of human tional dimensions and their lack of cognitive plau- volunteers, yield highly interpretable models sibility and interpretability (Murphy et al., 2012; of word meaning and play an important role S¸enel et al., 2018). In contrast, human conceptual in neurolinguistic research on semantic cog- property knowledge is often modelled in terms of nition. However, these datasets are limited relatively sparse and interpretable vectors, based in size due to practical obstacles associated on verbalizable, human-elicited features collected with exhaustively listing properties for a large in property knowledge surveys (McRae et al., number of words. In contrast, the develop- ment of distributional modelling techniques 2005; Devereux et al., 2014). However, gather- and the availability of vast text corpora have ing and collating human-elicited property knowl- allowed researchers to construct effective vec- edge for concepts is very labour intensive, limiting tor space models of word meaning over large both the number of words for which a rich fea- lexicons. However, this comes at the cost ture set can be gathered, as well as the complete- of interpretable, human-like information about ness of the feature listings for each word. Neural word meaning. We propose a method for map- embedding models, on the other hand, learn from ping human property knowledge onto a dis- large corpora of text in an unsupervised fashion, tributional semantic space, which adapts the word2vec architecture to the task of modelling allowing very detailed, high-dimensional semantic concept features. Our approach gives a mea- models to be constructed for a very large number sure of concept and feature affinity in a sin- of words. In this paper, we propose Feature2Vec, gle semantic space, which makes for easy and a computational framework that combines infor- efficient ranking of candidate human-derived mation from human-elicited property knowledge semantic properties for arbitrary words. We and information from distributional word embed- compare our model with a previous approach, dings, allowing us to exploit the strengths and ad- and show that it performs better on several evaluation tasks. Finally, we discuss how vantages of both approaches. Feature2Vec maps our method could be used to develop efficient human property norms onto a pretrained vector sampling techniques to extend existing feature space model of word meaning. The embedding norm datasets in a reliable way. of feature-based information in the pretrained embedding space makes it possible to rank the rele- 1 Introduction vance of features using cosine similarity, and we Distributional semantic modelling of word mean- demonstrate how simple composition of features ing has become a popular method for building can be used to approximate concept vectors. pretrained lexical representations for downstream Natural Language Processing (NLP) tasks (Baroni 2 Related Work and Lenci, 2010; Mikolov et al., 2013; Penning- ton et al., 2014; Levy and Goldberg, 2014; Bo- Several property-listing studies have been con- janowski et al., 2017). In this approach, meaning ducted with human participants in order to build is encoded in a dense vector space model, such that property norms – datasets of normalized human- words (or concepts) that have vector representa- verbalizable feature listings for lexical concepts tions that are spatially close together are similar in (McRae et al., 2005; Devereux et al., 2014). 5853 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5853–5859, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics One use of feature norms is to critically exam- 3.1 Distributional Semantic Models ine distributional semantic models on their abil- For our experiments, we make use of the pre- ity to encode grounded, human-elicited seman- trained GloVe embeddings (Pennington et al., tic knowledge. For example, Rubinstein et al. 2014) provided in the Spacy1 package trained on (2015) demonstrated that state-of-the-art distribu- the Common Crawl2. The GloVe model includes tional semantic models fail to predict attributive 685; 000 tokens with embedding vectors of di- properties of concept words (e.g. the properties mension 300, providing excellent lexical coverage is-red and is-round for the word apple) as ac- with a rich set of semantic representations. curately as taxonomic properties (e.g. is-a-fruit). In our analyses we use both the McRae property Similarly, Sommerauer and Fokkens(2018) inves- norms (McRae et al., 2005), which contain 541 tigated the types of semantic knowledge encoded concepts and 2526 features, and the CSLB norms within pretrained word embeddings, concluding (Devereux et al., 2014) which have 638 concepts that some properties cannot be learned by super- with 2725 features. For both sets of norms, a fea- vised classifiers. Collell and Moens(2016) com- ture is listed for a concept if it has been elicited by pared linguistic and visual representations of ob- five or more participants in the property norming ject concepts on their ability to represent different study. The number of participants listing a given types of property knowledge. Research has shown feature for a given concept name is termed the pro- that state-of-the-art distributional semantic mod- duction frequency for that concept×feature pair. els built from text corpora fail to capture impor- This gives sparse production frequency vectors for tant aspects of meaning related to grounded per- each concept over all the features in the norms. ceptual information, as this kind of information is not adequately represented in the statistical regu- 3.2 Partial Least Square Regression (PLSR) larities of text data (Li and Gauthier, 2017; Kelly Fagarasan et al.(2015) used partial least squares et al., 2014). Motivated by these issues, Silberer regression (PLSR) to map between the GloVe em- (2017) constructed multimodal semantic models bedding space and property norm vectors. Sup- from text and image data, with the goal of ground- pose we have two real-valued matrices G 2 n×m ing word meaning using visual attributes. More R and F 2 n×k. In this context, G and F represent recently, Derby et al.(2018) built similar models R GloVe embedding vectors and property norm fea- with the added constraint of sparsity, demonstrat- ture vectors, respectively. For n available concept ing that sparse multimodal vectors provide a more words, G is a matrix which consists of stacked faithful representation of human semantic repre- pretrained embeddings from GloVe and F is the sentations. Finally, the work that most resembles (sparse) matrix of production frequencies for each ours is that of Fagarasan et al.(2015), who use concept×feature pair. G and F share the same Partial Least Squares Regression (PLSR) to learn a row indexing for concept words. For a new di- mapping from a word embedding model onto spe- mension size p 2 , a partial least squared regres- cific conceptual properties. Concurrent work re- N sion learns two new subspaces with dimensions cently undertaken by Li and Summers-Stay(2019) n × p, which have maximal covariance between replaces the PLSR model with a feedforward neu- them. The algorithm solves this problem by learn- ral network. In our work, we instead map prop- ing a mapping from the matrix G onto F , similar erty knowledge directly into vector space models to a regression model. The fitted regression model of word meaning, rather than learning a supervised thus provides a framework for predicting vectors predictive function from concept embedding di- in the feature space from vectors in the embedding mensions to feature terms. space. In this work, we use the PLSR approach as a 3 Method baseline for our model. In implementing PLSR, We make primary comparison with the work of we set the intermediate dimension size to 50, fol- Fagarasan et al.(2015), although their approach lowing Fagarasan et al.(2015). We also build a differs from ours in that they map from an embed- PLSR model using 120 dimensions, which in pre- ding space onto the feature space, while we learn liminary experimentation we found gave the best a mapping from the feature domain onto the em- 1https://spacy.io/models/en#en core web lg bedding space. We outline both methods below. 2http://commoncrawl.org/ 5854 Figure 1: Diagram illustrating the relationship of Feature2Vec to the standard Skip-Gram Word2Vec architecture. (a) The standard Skip-Gram Word2Vec architecture with negative sampling. Word and context embeddings are randomly initialized, and both sets of embeddings are trainable. (b) Our Feature2Vec architecture, which uses pretrained word embeddings that are fixed during training. The network learns semantic feature embeddings for feature labels in property norm datasets instead of context embeddings. performance from a range of values tested. In this work, we adapt this skip-gram approach to the task of constructing semantic representa- 3.3 Skip-Gram Word2Vec tions of human property norms by mapping prop- Mikolov et al.(2013) proposed learning word em- erties into an embedding space (Figure1). We beddings using a predictive neural-network ap- achieve this by using a neural network to predict proach. In particular, the skip-gram implemen- the properties from the input word, using the skip- tation with negative sampling mines word co- gram architecture with negative sampling on the occurrences within a text window, and the net- properties. We replace context embeddings and work must learn to predict the surrounding context windowed co-occurrence counts from the conven- from the target word.

Distributional Semantic Modelling of Human Property Knowledge

Arguments For: Semantic Folding and Hierarchical Temporal Memory

Embeddings in Natural Language Processing

Sociolinguistic Properties of Word Embeddings Introduction

Semantic Structure and Interpretability of Word Embeddings

A Compositional and Interpretable Semantic Space

A Vector Semantics Approach to the Geoparsing Disambiguation Task for Texts in Spanish

The Discursive Construction of Migrant Otherness on Facebook: a Distributional Semantics Approach

Enhancing Domain Word Embedding Via Latent Semantic Imputation

Natural Language Processing (CSEP 517): Distributional Semantics

Embeddings in Natural Language Processing

User Representation Learning for Social Networks: an Empirical Study

Learning Semantic Similarity in a Continuous Space