A Systematic Study of Semantic Vector Space Model Parameters
Total Page:16
File Type:pdf, Size:1020Kb
A Systematic Study of Semantic Vector Space Model Parameters Douwe Kiela Stephen Clark University of Cambridge University of Cambridge Computer Laboratory Computer Laboratory [email protected] [email protected] Abstract we use a selection of source corpora. Hence there are two additional “superparameters”: We present a systematic study of parame- dataset for evaluation; ters used in the construction of semantic • source corpus. vector space models. Evaluation is car- • ried out on a variety of similarity tasks, in- Previous studies have been limited to investigat- cluding a compositionality dataset, using ing only a small number of parameters, and us- several source corpora. In addition to rec- ing a limited set of source corpora and tasks for ommendations for optimal parameters, we evaluation (Curran and Moens, 2002a; Curran and present some novel findings, including a Moens, 2002b; Curran, 2004; Grefenstette, 1994; similarity metric that outperforms the al- Pado and Lapata, 2007; Sahlgren, 2006; Turney ternatives on all tasks considered. and Pantel, 2010; Schulte im Walde et al., 2013). 1 Introduction Rohde et al. (2006) considered several weighting schemes for a large variety of tasks, while Weeds Vector space models (VSMs) represent the mean- et al. (2004) did the same for similarity metrics. ings of lexical items as vectors in a “semantic Stone et al. (2008) investigated the effectiveness space”. The benefit of VSMs is that they can eas- of sub-spacing corpora, where a larger corpus is ily be manipulated using linear algebra, allowing queried in order to construct a smaller sub-spaced a degree of similarity between vectors to be com- corpus (Zelikovitz and Kogan, 2006). Blacoe and puted. They rely on the distributional hypothesis Lapata (2012) compare several types of vector rep- (Harris, 1954): the idea that “words that occur in resentations for semantic composition tasks. The similar contexts tend to have similar meanings” most comprehensive existing studies of VSM pa- (Turney and Pantel, 2010; Erk, 2012). The con- rameters — encompassing window sizes, feature struction of a suitable VSM for a particular task is granularity, stopwords and dimensionality reduc- highly parameterised, and there appears to be little tion — are by Bullinaria and Levy (2007; 2012) consensus over which parameter settings to use. and Lapesa and Evert (2013). This paper presents a systematic study of the Section 2 introduces the various parameters of following parameters: vector space model construction. We then attempt, vector size; in Section 3, to answer some of the fundamen- • window size; tal questions for building VSMs through a number • of experiments that consider each of the selected window-based or dependency-based context; • parameters. In Section 4 we examine how these feature granularity; • findings relate to the recent development of dis- similarity metric; • tributional compositional semantics (Baroni et al., weighting scheme; • 2013; Clark, 2014), where vectors for words are stopwords and high frequency cut-off. combined into vectors for phrases. • A representative set of semantic similarity 2 Data and Parameters datasets has been selected from the literature, in- cluding a phrasal similarity dataset for evaluating Two datasets have dominated the literature with compositionality. The choice of source corpus is respect to VSM parameters: WordSim353 (Finkel- likely to influence the quality of the VSM, and so stein et al., 2002) and the TOEFL synonym dataset 21 Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) @ EACL 2014, pages 21–30, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics Dataset Pairings Words Vector size Each component of a vector repre- RG 65 48 sents a context (or perhaps more accurately a “con- MC 30 39 W353 353 437 textual element”, such as second word to the left MEN 3000 751 of the target word).2 The number of components TOEFL 80 400 M&L10 324 314 varies hugely in the literature, but a typical value is in the low thousands. Here we consider vec- Table 1: Datasets for evaluation tor sizes ranging from 50,000 to 500,000, to see whether larger vectors lead to better performance. (Landauer and Dumais, 1997). There is a risk Context There are two main approaches to mod- that semantic similarity studies have been overfit- elling context: window-based and dependency- ting to their idiosyncracies, so in this study we based. For window-based methods, contexts are evaluate on a variety of datasets: in addition to determined by word co-occurrences within a win- WordSim353 (W353) and TOEFL, we also use dow of a given size, where the window simply the Rubenstein & Goodenough (RG) (1965) and spans a number of words occurring around in- Miller & Charles (MC) (1991) data, as well as stances of a target word. For dependency-based a much larger set of similarity ratings: the MEN methods, the contexts are determined by word dataset (Bruni et al., 2012). All these datasets con- co-occurrences in a particular syntactic relation sist of human similarity ratings for word pairings, with a target word (e.g. target word dog is the except TOEFL, which consists of multiple choice subject of run, where run subj is the context). questions where the task is to select the correct We consider different window sizes and compare synonym for a target word. In Section 4 we ex- window-based and dependency-based methods. amine our parameters in the context of distribu- tional compositional semantics, using the evalua- Feature granularity Context words, or “fea- tion dataset from Mitchell and Lapata (2010). Ta- tures”, are often stemmed or lemmatised. We in- ble 1 gives statistics for the number of words and vestigate the effect of stemming and lemmatisa- word pairings in each of the datasets. tion, in particular to see whether the effect varies As well as using a variety of datasets, we also with corpus size. We also consider more fine- consider three different corpora from which to grained features in which each context word is build the vectors, varying in size and domain. paired with a POS tag or a lexical category from These include the BNC (Burnard, 2007) (106 CCG (Steedman, 2000). 8 word types, 10 tokens) and the larger ukWaC Similarity metric A variety of metrics can be 7 9 (Baroni et al., 2009) (10 types, 10 tokens). used to calculate the similarity between two vec- We also include a sub-spaced Wikipedia corpus tors. We consider the similarity metrics in Table 2. (Stone et al., 2008): for all words in the eval- uation datasets, we build a subcorpus by query- Weighting Weighting schemes increase the im- ing the top 10-ranked Wikipedia documents using portance of contexts that are more indicative of the the words as search terms, resulting in a corpus meaning of the target word: the fact that cat co- with 106 word types and 107 tokens. For examin- occurs with purr is much more informative than ing the dependency-based contexts, we include the its co-occurrence with the. Table 3 gives defini- Google Syntactic N-gram corpus (Goldberg and tions of the weighting schemes considered. Orwant, 2013), with 107 types and 1011 tokens. Stopwords, high frequency cut-off Function words and stopwords are often considered too un- 2.1 Parameters informative to be suitable context words. Ignor- We selected the following set of parameters for in- ing them not only leads to a reduction in model vestigation, all of which are fundamental to vector size and computational effort, but also to a more space model construction1. informative distributional vector. Hence we fol- lowed standard practice and did not use stopwords 1Another obvious parameter would be dimensionality re- as context words (using the stoplist in NLTK (Bird duction, which we chose not to include because it does not et al., 2009)). The question we investigated is represent a fundamental aspect of VSM construction: di- mensionality reduction relies on some original non-reduced 2We will use the term “feature” or “context” or “context model, and directly depends on its quality. word” to refer to contextual elements. 22 Measure Definition Scheme Definition Euclidean 1 w = f 1+ Pn (u v )2 None ij ij √ i=1 i− i 1 N Pn TF-IDF wij = log(fij ) log( ) Cityblock 1+ u v nj i=1 | i− i| × 1 N TF-ICF wij = log(fij ) log( ) Chebyshev 1+max u v fj i | i− i| × u v fij N nj +0.5 · Okapi BM25 wij = f log − Cosine u v j fij +0.5 0.5+1.5 +fij | || | × fj (u µ ) (v µ ) j − u · − v f Correlation u v (0.5+0.5 ij ) log( N ) | || | × maxf nj Pn ATC wij = r f 2 i=0 min(ui,vi) PN ij N 2 Dice Pn i=1[(0.5+0.5 max ) log( n )] ui+vi × f j i=0 N (log(fij )+1.0) log( ) u v nj Jaccard Pn · LTU w = i=0 ui+vi ij j 0.8+0.2 fj f Pn × × j i=0 min(ui,vi) Jaccard2 Pn P (tij cj ) i=0 max(ui,vi) | MI wij = log P (t )P (c ) Pn ij j i=0 ui+vi Lin u + v | | | | PosMI max(0, MI) u v Tanimoto u + v· u v P (tij cj ) P (tij )P (cj ) | | | |− · T-Test wij = | − P (t )P (c ) 1 (D(u u+v )+D(v u+v )) √ ij j 2 || 2 || 2 Jensen-Shannon Div 1 √ − 2 log 2 χ2 see (Curran, 2004, p. 83) D(u αv+(1 α)u) || − α-skew 1 √2 log 2 f f − ij × Lin98a wij = f f i× j nj Lin98b wij = 1 log Table 2: Similarity measures between vectors v − × N and u, where vi is the ith component of v log fij +1 Gref94 wij = log nj +1 whether removing more context words, based on a frequency cut-off, can improve performance.