1556-603X/19©2019IEEE 1556-603X/19©2019IEEE Date of publication: Digital Object Identifier 10.1109/MCI.2019.2901085 CIC, Instituto Politécnico Nacional, Mexico City, M George Dueñas CIC, Instituto Politécnico Nacional, Mexico City, M Alexander Gelbukh Bogotá D.C., COLOMBIA MindLab Research Group, Universidad Nacional de Col Fabio A. González Instituto Caro y Cuervo, Bogotá D.C., COLOMBIA Sergio Jimenez Representation Rivaling Neural Word Embedding for word2set: WordNet-Based 10 April 2019 Lexical Similarity and and Similarity Lexical EXICO EXICO ombia, Corresponding Author:A. Gelbukh (Email: gelbukh@ci whichwe called word2set, w based on related neighbor similarity, lexical represtheword WordNetexploits a graph,obtaining measuring for method ty-based judgmentslexicalofsimilarity. predicting for WordNetWe of proposenews a use the eclipsed has ods simil lexical achievedhas state-of-the-art results. larger success The several years, benchmarkshave beenintroduced, whichonword embe recent In ding. tributional methods, and more recently by neural wo longtradition. theInlast decade, hasitbeen cha Abstract Mauig eia smlrt uigWrNt a a has WordNet using similarity lexical —Measuring MAY 2019 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE INTELLIGENCE COMPUTATIONAL MAY 2019 | IEEE c.ipn.mx). llengedby dis- of such meth-suchof et cardinali-et rd embed- ingwords. entation, human human dding arity arity hich hich ©ISTOCKPHOTO.COM/WOSEPHJEBER 41 word2set: WordNet-Based Word Representation Rivaling Neural Word Embedding for Lexical Similarity and Sentiment Analysis Sergio Jimenez, Instituto Caro y Cuervo, Bogotá D.C., COLOMBIA Fabio A. González, MindLab Research Group, Universidad Nacional de Colombia, Bogotá D.C., COLOMBIA Alexander Gelbukh, CIC, Instituto Politécnico Nacional, Mexico City, MEXICO George Dueñas, CIC, Instituto Politécnico Nacional, Mexico City, MEXICO

Abstract—Measuring lexical similarity using WordNet has a relationships encoded in WordNet can be competitive with long tradition. In the last decade, it has been challenged by distri- word embedding. butional methods, and more recently by neural word embedding. Sentiment analysis is closely related to measuring lexical In recent years, several larger lexical similarity benchmarks have been introduced, on which word embedding has achieved state-of- similarity since semantically similar tend to have sim- the-art results. The success of such methods has eclipsed the use ilar polarity. Fig. 1 shows how pairs of words having both of WordNet for predicting human judgments of lexical similarity. either positive or negative polarity have, on average, a greater We propose a new set cardinality-based method for measuring similarity than any other combination. In addition, the cross- lexical similarity, which exploits the WordNet graph, obtaining a like pattern in those graphs shows that neutral words have word representation, which we called word2set, based on related neighboring words. We show that the features extracted from set very low similarity with words of any polarity. With this, the cardinalities computed using this word representation, when fed representations used for lexical and textual similarity can also into a support vector regression classifier trained on a dataset of be useful for sentiment analysis [3]. In this context, practically common synonyms and antonyms, produce results competitive all systems of sentiment analysis rely on a mechanism to with those of word-embedding approaches. On the task of determine the sentiment polarity of the words. In this paper, we predicting the lexical sentiment polarity, our WordNet set-based representation significantly outperforms the classical measures show that there is a large performance gap between predictors and achieves the performance of neural embeddings. Although of lexical sentiment polarity based on word embeddings [4], word embedding is still the best approach for these tasks, [5] and those based on the classical WordNet-based measures our method significantly reduces the gap between the results [6], [7], [8], [9], [10]. Recently, Li et al. [11] widened that shown by knowledge-based approaches and by distributional gap by proposing word embeddings optimized for sentiment representations, without requiring a large training corpus. It is also more effective for less-frequent words. analysis. Despite this, we demonstrate that WordNet can be used to rival the performance of neural embeddings on the tasks of both lexical-similarity and lexical sentiment polarity I.INTRODUCTION classification. This returns the WordNet-based methods back Automatic understanding of human language is the main in the game. goal of the natural language processing field. Given the WordNet [14] is a lexical database that links words in a intrinsic compositionality of human language, the relationships graph connected by relationships of synonymy, hyperonymy, between lexical units (i.e. words) play an important role in hyponymy, etc. It has been used for more than 20 years for this process. In particular, recognizing lexical similarity and addressing many NLP tasks, particularly lexical similarity. lexical relatedness is a key component that endows auto- Lexical similarity functions based on WordNet use graph mea- matic systems with the ability to relate pairs of sentences sures to provide a numerical score of the similarity between that use different words but are close in their meaning. two so-called synsets (sets of synonyms in WordNet) or two Traditionally, apart from edit distance [1], [2], computational words [6], [10]. The functions proposed almost two decades linguists have used two main resources to tackle this task: ago mainly rely on the is-a hierarchy, the path length between linguistic knowledge manually coded by lexicographers, such concepts, and the depth of the concepts in the hierarchy. as WordNet, and large corpora. Recently proposed corpus- Fig. 2 illustrates these components. Another important concept based methods known as word embeddings have outperformed added to this approach is information content [7], which knowledge-based methods by using neural networks trained represents the amount of information conveyed by a concept on very large corpora. However, the availability and quality by combining counts of lexical units in corpora aware of the of manually-coded knowledge or very large corpora vary for WordNet is-a hierarchy [7], [8], [9]. different languages and domains. With this, word embedding Another common approach to address lexical similarity uses is not a clear choice in all scenarios. In this paper, we show the so-called distributional hypothesis of meaning: “words that knowledge-base methods exploiting all lexical-semantic with similar meaning will occur with similar neighbors if enough material is available” [15]. This approach involves Corresponding author: A. Gelbukh (Email: [email protected]). construction of a matrix whose entries contain the number of ANEW SenticNet 1 0.175 0.20 0.150 0.18 0.125 0.16

0.14 0.100

0.12 0.075

0.10 0.050 (-) Word Polarity (+) 0.08 (-) Word Polarity (+) Average Word Similarity 0.025 Average Word Similarity 0.06 0.000 (-) Word Polarity (+) (-) Word Polarity (+)

Fig. 1. Average lexical similarity of words according to their polarity in two sentiment lexicons [12], [13]. Words were compared using [4].

Depth … WordNet Among distributional approaches, neural word embedding [4], [5], [20] has received great attention. In this approach, instead of first obtaining word contexts and then reducing Least Common Subsumer the dimensionality, the dimension of the space is fixed and a single iterative procedure attempts to learn a from corpora, obtaining an optimal word representation. This language model aims to build either a prediction model for Shortest Path each word given a large number of contexts (continuous bag-of-words) or a prediction model for contexts given the words (skip-gram) [4]. Baroni et al. [21] compared traditional word2 distributional methods against word embedding concluding that word embedding was superior in performance at several word1 tasks, including lexical similarity and relatedness. Both WordNet-based and distributional methods have their advantages and disadvantages in practical applications. For Information Content example, while WordNet-based approaches are aware of the different senses of a word, distributional methods merge senses concept word synset is-a in a single representation. In contrast, when a language other than English is used, a WordNet is a resource difficult and Fig. 2. Elements involved in the classical approach for lexical similarity using costly to obtain, but the text corpora required by distributional WordNet. approaches are generally available for major languages. Re- cently, Aletras and Stevenson [22] proposed a hybrid approach combining word embedding and WordNet, obtaining very competitive results but observed only a marginal contribution times a word (rows) occurs in a particular context (columns) of the WordNet component to the overall performance. across corpora. The context of a word can be a fixed-size There is an important gap in performance between window, a sentence, a paragraph, etc. The goal in this approach WordNet-based and word embedding approaches for lexical is to obtain a vectorial representation of the words in a metric similarity. In this paper, we significantly reduce this difference. feature space and combine this with cosine similarity (or We present a new method that uses WordNet to build lexi- other metrics) to provide a similarity score between pairs of cal similarity relatedness functions. Our method exploits the words. The resulting matrix is large and sparse, which requires WordNet graph in a novel way by representing words by their reducing the dimensionality by limiting the size of the word neighboring words in the graph; see Fig. 3. In addition, we vocabulary or using techniques such as use and a set of cardinality-based features (LSA) [16], non-negative matrix factorization (NMF) [17], extracted from this representation. From a relatively large set or random indexing [18], among others. Agirre et al. [19] of features, an optimal subset is selected in a supervised way. compared some distributional and WordNet-based approaches, For training such models, we have developed a new dataset, concluding that the former consistently outperformed the latter W1500, based on a list of common synonyms and antonyms. for lexical similarity. They also differentiated the similarity We compare our method against studies by Pennington and relatedness tasks, and since that time, most benchmarks et al. [5], Aletras and Stevenson [22], and Baroni et al. [21] for lexical semantics clearly differentiate these categories. using identical experimental setups. We also use other pub- word1 word2 hypernym-hyponym relationships [6], [7], [8], [9], [10]. Recall Find related that nodes in such a taxonomy are synsets, i.e., are sets of syn- words in onym words (lemmas) interchangeable in some contexts [14]; WordNet for example, the synset labeled car.n.01 contains lemmas car, auto, automobile and motorcar. Apart from hypernyms and hyponyms, other types of relationships also play an important R1 R2 Make two role in lexical similarity and relatedness. For instance, synsets sets with their bird.n.01 and angel.n.01 can be considered somehow related neighbor because both have the same part-holonym wing.n.01. However, words they are separated by 17 steps in the is-a taxonomy formed by hypernym-hyponym relationships in WordNet. Therefore, current measures based on that hierarchy return a very low Find the similarity score for that pair. To address this issue, our word intersection representation consists of sets containing related words from of the sets the neighborhood of a word in the WordNet graph. Lexical |R ∩R | similarity functions based on WordNet usually take synsets as 1 2 arguments, which requires word sense disambiguation [26] to Measure the be applied to the words being compared. Our representation similarity of is used to model and compare words rather than synsets, the sets allowing a direct comparison against distributional methods. Clearly, if the words are disambiguated in advance, their word representation should include only words related to the correct Fig. 3. Outline of our method for exploiting the WordNet graph for lexical senses, producing a less noisy representation. However, in similarity. this paper, we do not consider this scenario. In addition, unlike classical WordNet-based approaches focusing on a particular relationship such as hypernymy, we use all types licly available benchmarks for testing, making a total of 14 of relationships available in WordNet. comparisons. We survey the state of the art results for these benchmarks and compare them with the results obtained in For each synset, WordNet provides a textual definition this paper. (gloss) describing the meaning of the synset. Each synset We also compare our representation based on words neigh- also has a set of lexical representations (lemmas), which boring in WordNet with pre-trained word2vec [4] and GloVe in turn can be linked to other lemmas by relationships at [5] representations on the task of predicting the sentiment lemma level, such as antonymy, pertainymy, and related forms, polarity of words in a lexicon. For this, we use as benchmarks among others. Lemmas may also be related indirectly through the Affective norms for English words (ANEW) [12] and synsets, thereby inheriting all the synset-level relationships. SenticNet 1 to 4 [13], [23], [24], [25]. In this configuration, For our representation, the neighbor words (lemmas) of a word our representation obtained results similar to word embeddings (lemma) are obtained by following all possible synset and and considerably better than the classical WordNet approaches. lemma-level relationships. The obtained set of words is further The rest of the paper is organized as follows. In Section II, enriched by extracting keywords from the textual definitions we describe our method. In Sections III and IV, we evaluate of the synsets directly related to the given word. our method on the tasks of lexical similarity and lexical First, our algorithm produces a representation from Word- sentiment classification. In Section V, we discuss the results. Net for each word in an unsupervised way by constructing Finally, concluding remarks are given in Section VI. the set of its related words in the graph. At this point, any resemblance coefficient based on cardinality, such as Jaccard, Online resource for word2set: Dice, cosine, or soft cosine [27], among others, can be used http://www.gelbukh.com/resources/word2set to provide a similarity score for a pair of words. However, the choice of such coefficient is arbitrary. Alternatively, following Supplementary materials and source code implementation Jimenez et al. [28], such coefficient can be learned from of the word2set representation generator have been provided. training data using as features the cardinalities of both sets and The data include the baseline W1500 lexical relatedness of their intersection, as well as various algebraic combinations dataset and the word2set representation of all words included of these three factors. The process, described below in detail, in WordNet. is summarized as follows: (i) take a set of word pairs for training, each labeled with a gold-standard lexical similarity or relatedness; (ii) represent each word as a set of related words II.OUR METHOD:CARDINALITY-BASED LEXICAL extracted from WordNet; (iii) extract 17 cardinality-based SIMILARITY factors from each pair of representing sets; (iv) recombine In practice, most of the methods for lexical similarity and these factors into 272 rational features; (v) determine a reduced relatedness based on WordNet rely on the taxonomy formed by set of features in a supervised way; and (vi) fit a regression model using the reduced set of features to a gold standard of dataset [31], which contains less rare words. Hence, common similarity or relatedness; see Fig. 4. words seem to have better connectivity in WordNet’s graph compared to less common words [32].

A. Word Representation by Neighboring Words B. Features Let w be a word. The set Rw of its neighboring words Once two words, a and b, are represented by their sets of can be obtained from WordNet using the following procedure, R R 1 related words, a and b, respectively, these sets are to be which we called word2set. compared to provide a similarity score that reflects the degree The set RelatedSynsetsw of synsets related to w is the of similarity or relatedness between them. The first option is union of RelatedSynsetss by s from Synsetsw, where to use an off-the-shelf resemblance coefficient based on car- Synsetsw is the set of synsets that contain the word w and dinality, such as Jaccard [33] or Dice [34]. These are rational RelatedSynsetss is the set of synsets related to s, i.e., the expressions that combine the three cardinalities, |Ra|, |Rb| and union of the sets of synsets connected with s in WordNet by |Ra ∩ Rb| (or alternatively |Ra ∪ Rb|), to produce a similarity one of the following relations: Hypernyms, Instance Hyper- score between 0 and 1. Most of these coefficients produce nyms, Hyponyms, Instance Hyponyms, Member Holonyms, metrics with desirable properties such as transitivity. However, Member Meronyms, Substance Holonyms, Part Holonyms, generally, they fail to adapt to particular tasks, yielding sub- Substance Meronyms, Part Meronyms, Attributes, Entailments, optimal performance. Alternatively, parameterized coefficients Causes, Also See, Verb Groups, and Similar To. [35], [36], [37] provide some degree of adaptability, allowing The set Lemmasw of lemmas associated with word w is adjustment of parameters using training data. the union of Lemmass by s from RelatedSynsetsw, where When training data is available, regression can be used for Lemmass is the set of lemmas associated with synset s. training a similarity function adapted to the particular task. The set AllRelatedLemmasw of all lemmas related to w This approach has been showed to be effective for several is obtained by expanding the set Lemmasw with their related NLP tasks in the recent SemEval campaigns [28], [38], [39]. lemmas, i.e., by adding to it the union of all RelatedLemmasl The cardinality-based features extracted from a pair of sets by l from Lemmasw, where RelatedLemmasl is the set of usually comprise cardinalities of all possible areas in their lemmas related to lemma l by one of the following relations: Venn diagram. Some additional features were derived by Antonyms, Pertanyms, Topic Domains, Region Domains, Us- combining the basic features into rational coefficients in an age Domains, Derivationally Related Forms, Hypernyms, In- attempt to capture non-linear relationships. Although some of stance Hypernyms, Hyponyms, Instance Hyponyms, Member these feature sets produced high-quality prediction models, the Holonyms, Member Meronyms, Substance Holonyms, Part features sets were somewhat arbitrary and their selection was Holonyms, Substance Meronyms, Part Meronyms, Attributes, guided by intuition. For our lexical similarity task, a relatively Entailments, Causes, Also Sees, Verb Groups, and Similar To. large set of features was extracted using a set of 17 factors Now, the set Rw of the words related to w is ob- to be combined in rational terms. Table II shows this set of tained as AllRelatedLemmasw plus the union of all factors, fi, i = 0 ... 16, which are combined in rational terms, DefinitionKeywordss by s from RelatedSynsetsw, where fi , i 6= j; i, j = 1 . . . n, to produce features for our model. fj DefinitonKeywordss is the set of words obtained from These factors are the building blocks of many resemblance the text of the definition of the synset s after removing coefficients; for example, f4 is the matching coefficient and stopwords. In our experiments, we used the list of English f8 f4 is the cosine coefficient. stopwords from NLTK [29], as well as the following words, f11 The rationale for the selection of the 17 factors is as follows. which are frequent and mostly uninformative in WordNet’s Factor f allows the factors and their multiplicative inverses to definitions: act, act, action, another, become, body, capable, 0 be features. Factors f to f are the 7 possible areas delimited cause, change, coming, consisting, containing, especially, etc, 1 7 in the Venn diagram of two intersected sets. Factors f to form, giving, group, lacking, made, make, move, one, order, 8 f are different commonly used instances of the generalized part, particular, people, person, persons, place, position, prop- 14 mean between |R | and |R |, corresponding to the minimum, erty, quality, relating, resulting, small, somebody, someone, a b maximum, arithmetic mean, geometric mean, quadratic mean, something, state, time, two, used, using, usually, whose. cubic mean, and harmonic mean. Finally, factors f and f With this, the set R of words related to w is composed 15 16 w are variants of f and f used by the Symmetrical Tversky’s by all its neighboring lemmas and the keywords in the defi- 8 9 Ratio Model [40] and by the family of cardinality-based nitions of its related synsets. Table I shows examples of such similarity measures proposed by De Baets et al. [41]. representation for words from the rare-word (RW) dataset [30]. This methodology produces a relatively large feature set One can see that the majority of the representing words are of 272 features by recombining only three basic cardinalities. meaningfully related to the represented word. Some poorly The goal of using this relatively large feature set is to be able related words could arise from less-common senses of the to use the training data not only to learn a suitable similarity related synsets or from unrelated words extracted from glosses. function but also to obtain a good representation for the words. The average number of representing words was 115 in RW and 211 in the Stanford Contextual Word Similarities (SCWS) C. Feature Selection

1A Python script implementing this procedure is available on https://github. Our method for feature selection is supervised. Suppose com/sgjimenezv/neighboring_words. we have a training dataset composed of n pairs of words, word1 word2

WordNet W1500 dataset 500 synonyms pairs Related words finder 500 antonyms pairs 500 unrelated pairs

R1: Set of R2: Set of related words related words

Cardinality-based Cosine coefficient feature extractor 푅 ∩ 푅 1 2 Class labels 푅1 × 푅2 Feature selector Synonyms →1.0 Antonyms → 0.2 Unrelated → 0.0 Evaluation Training vector vectors

Support-Vector Regression

COSINE SVR similarity score similarity score

Fig. 4. The architecture of our lexical-similarity measuring algorithms: unsupervised COSINE and supervised SVR.

TABLE I EXAMPLESOFOURWORDREPRESENTATIONBASEDONNEIGHBORINGWORDSIN WORDNET

w Rw cognizance apprehension, aware, awareness, certain, clear, cognisance, cognisant, cognise, cognizance, cognizant, cognize, conscious, consciousness, feel, followed, gained, general, incognizance, incognizant, individuality, intuitive, ken, know, knowing, knowingness, knowledge, mental, often, perceived, perceiving, perception, range, realization, scope, self-aware, self- awareness, sense, sensify, showing, sometimes, unaware, unawareness, understand, understanding, ... incubate animal, arise, breed, brood, brooder, conditions, conducive, copulate, cover, develop, development, differentiation, eggs, emerge, environment, evolution, evolve, female, given, grow, groth, hatch, hatchery, hatching, horses, incubate, incubation, incubator, individuals, multiplication, multiply, natural, offspring, plant, process, procreate, procreation, procreative, produce,progress, promote, reproduce, reproduction, reproductive, seat, sit, sit down, take, unfold, ... subdivide apart, carve up, come, dissever, divide, divisible, part, partitive, parts, pieces, portions, separate, separation, smaller, split, split up, subdivide, subdivider, subdivision, subdivisions, ... wealthy abundant, affluence, affluent, flush, loaded, material, money, moneyed, poor, possessing, possessions, rich, richness, supply, value, wealth, wealthiness, wealthy, ...

TABLE II a selection of the k best features obtained from a linear FACTORS FOR RATIONAL FEATURES regression model that fits D to T. In our experiments we used

f0 = 1 f9 = max(|Ra|, |Rb|) the KBest feature selection method implemented in Scikit- |Ra|+|Rb| f1 = |Ra| f10 = learn [42]. Basically, the KBest method selects features by p 2 f2 = |Rb| f11 = |Ra| × |Rb| ranking each feature individually by the correlation of a simple q 2 2 |Ra| +|Rb| regression model built using the feature against the target. To f3 = |Ra ∪ Rb| f12 = 2 q 3 3 3 |Ra| +|Rb| avoid overfitting, we performed the feature selection process f4 = |Ra ∩ Rb| f13 = 2 2×|Ra|×|Rb| on each one of the ten partitions of a tenfold cross-validation f5 = |Ra \ Rb| f14 = |Ra|+|Rb| split and randomly shuffled the samples 30 times for a total f6 = |Rb \ Ra| f15 = min(|Ra \ Rb|, |Rb \ Ra|) f7 = |Ra4Rb| f16 = max(|Ra \ Rb|, |Rb \ Ra|) of 300 partitions of D and T; we denote such partitions f8 = min(|Ra|, |Rb|) by Di,j, Ti,j, where i indexes the shuffles and j the folds. k We obtained the k best features, Di,j, for each partition, and included in the final set of features the ones selected in at least where each pair is annotated with a gold-standard similarity or 80% of the 300 partitions (using thresholds from 65% to 95% relatedness, and the 272 features described above are extracted did not significantly change the selected feature set). for each training pair. This dataset, D, is a matrix of size n × 272, with target vector T, n × 1, containing the gold standard. Let Dk be a matrix of size n × k containing D. Regression six datasets as they employed: MC, RG, WS353, the semantic Upon extracting and selecting the features, we combined (WSS) and relatedness (WSR) partitions of WS353 introduced them using support vector regression [43]. The support vector by Agirre et al. [19], and the MEN dataset (relatedness) parameters were set to C = 100, γ = 0.001, and the introduced by Bruni et al. [46]. most appropriate kernel for the task was RBF. All features were standardized before training by subtracting its mean and B. Performance Measure dividing by its standard deviation. The means and standard The usual measure for assessing the performance of a lexical deviations for each feature were saved to apply the same similarity method is Spearman’s correlation coefficient r. To transformation to the test data. provide a single measure across different sets of datasets, we use the simple average of r obtained on each dataset. E. An Inexpensive Training Dataset W1500 Existing benchmarks for lexical similarity and relatedness C. Classical WordNet-Based Measures were built by selecting a set of pairs of words and aggregating ten or more human judgments in a fixed numerical scale to The group of baselines comprises classical lexical similarity obtain a gold standard [15], [30], [31], [44], [45]. Recent measures based on WordNet implemented in NLTK [29]: path benchmarks, such as MEN [46], were built by aggregating 50 (the inverse of the number of edges between to synsets), lch binary judgments for each pair in an attempt to reduce the (Leacock and Chodorow [6]), wup: (Wu and Palmer [10]), noise and cognitive load of using a numerical scale. This lin.b/lin.s (Lin’s measure [8] using Brown or Semcor corpora methodology requires a large amount of costly manual work for information content calculation), res.b/res.s (Resnik [7]), and limits the convenience of supervised approaches, such as and jcn.b/jcn.s (Jiang and Conrath [9]). In our lexical simi- the one presented in this work. There is also the inconvenience larity experiments, to compare two words, the set of synsets that word pairs used in benchmarks are selected with particular associated to each word is obtained and the maximum value of criteria, such as common nouns, making the models trained the similarity function in their Cartesian product is returned. with such data less applicable for other types of words. As an affordable alternative, we built a dataset using D. Number of Selected Features publicly available lists of common English synonyms and 23 The feature selection method described in Section II-C was antonyms. We collected 500 pairs of synonyms and 500 applied to the W1500 dataset obtaining an optimum value of pairs of antonyms. The synonym pairs were labeled with a k = 89 features. lexical similarity score of 1 and antonym pairs were labeled with a constant c. We experimentally determined c = 0.2 to be meaningful for the lexical similarity task. A third subset of 500 E. Lexical Similarity Results pairs was obtained from random combinations of words from We generated results for the classical WordNet measures the synonyms and antonyms subsets, with manual verification (Section III-C) and our methods. Those results are compared for the two words to be unrelated. The pairs in this third subset with three studies that published results for distributional were labeled with a lexical similarity score of zero. methods using the same performance measure (Spearman’s r) We refer to this dataset as W1500.4 Its purpose was to show and datasets. The supervised method presented in Section II that supervised models for predicting lexical similarity, trained is labeled as “SVR (this paper)”. A second method is reported with such simple resource, can perform competitively. using the word representation presented in Section II-A com- bined with the cosine coefficient “COSINE (this paper)”: III.RESULTS ON LEXICAL SIMILARITY |R ∩ R | sim(R , R ) = w1 w2 . In our experiments, we compare the performance of our w1 w2 p |Rw | × |Rw | method with that of classical measures based solely on Word- 1 2 Net and measures based on word embedding. We show that The first study corresponds to the work by Pennington et our methods drastically reduce the performance gap between al. [5], who introduced GloVe, a word-embedding method WordNet and word-embedding approaches. that combines evidence from the local context and the global counts. GloVe vectors are compared against word2vec A. Datasets (CBOW) [4] and an SVD baseline, all of them in a 300- dimensional space and trained on the same corpora. The name We used the benchmarks previously used by Pennington et of each method includes the size of the training corpora in al. [5] for word similarity: WS353 (Finkelstein et al. [45]), billions of tokens. The best-performing method in this group MC (Miller and Charles [44]), RG (Rubenstein and Goode- is GloVe 42B, which produces r¯ = 0.6996; see Fig. 5. nough [15]), SCWS (Huang et al. [31]), and RW (Luong et al. The second study corresponds to the work by Aletras and [30]). In addition, we also compared our approach with that Stevenson [22]. They proposed several hybrid models that of Aletras and Stevenson [22]. The comparison used the same combined word embedding and WordNet (H models) and 2http://www.englishleap.com/vocabulary/synonyms compared them with word embeddings trained on a 2.8B-token ∗ 3http://www.englisch-hilfen.de/en/words/synonyms.htm corpus (D model). H is the best method of this group, with 4https://sites.google.com/site/sergiojimenezvargas/W1500.txt r¯ = 0.72; see Fig. 6. GloVe 42B predict.each CBOW 100B predict.all SVD-L 42B count.each SVR (this paper) count.all GloVe 6B SVR (this paper) SVD-L 6B RES.B CBOW 6B WUP JCN.B RES.S RES.S LCH RES.B LIN.B WUP JCN.B LIN.B PATH LCH LIN.S PATH JCN.S LIN.S COSINE (this paper) JCN.S 0 0.2 0.4 0.6 0.8 COSINE (this paper) Spearman's r average for RG, WS353, Fig. 7. Performance comparison of our systems (red) against the results 0 0.2 0.4 0.6 0.8 WSS, WSR and MEN published by Baroni et al. [21] (blue) and the classical WordNet measures Spearman's r average for MC, RG, (green) for the lexical similarity task, in terms of Spearman’s r average for Fig. 5. Performance comparisonWS353, of our systems SCWS (red)and RW against the results RG, WS353, WSS, WSR and MEN. published by Pennington et al. [5] (blue) and the classical WordNet measures (green) for the lexical similarity task, in terms of Spearman’s r average for MC, RG, WS353, SCWS and RW. of Baroni et al. and Aletras and Stevenson due to the use of H* different datasets. D Hp Ĥp F. Updated State of the Art on Lexical Similarity SVR (this paper) H*p Baroni et al. [21] compiled state-of-the-art results for several Ĥ benchmarks and tasks including lexical similarity. We extend H RES.S this by including datasets MC, SCWS, and RW, and update it RES.B with the results of Pennington et al. [5], this work, and oth- WUP ers. The additional lexical similarity and relatedness datasets JCN.B LCH are: YP-130 (similarity) [47], MTURK287 (relatedness) [48], LIN.B MTURK771 (relatedness) [49], Rel-122 (relatedness) [50], PATH Verb-143 [51], and the recently introduced SimLex-999 (simi- LIN.S JCN.S larity) [52]. Table III shows the results for 14 datasets for both COSINE (this paper) the supervised method presented in this paper (SVR) and the 0 0.2 0.4 0.6 0.8 state of the art. Spearman's r average for MC, RG, One can easily see that our method is a competitive al- Fig. 6. Performance comparisonWS353, of our systems SCWS (red)and RW against the results ternative for the task of lexical similarity, but less compet- published by Aletras and Stevenson [22] (blue) and the classical WordNet measures (green) for the lexical similarity task, in terms of Spearman’s r itive for the lexical relatedness task. The largest gaps are average for MC, RG, WS353, WSS, WSR, and MEN. on the benchmarks that include, partially or totally, word pairs associated by relatedness, i.e., WS353, WSR, MEN, MTurk287, MTurk711, and Rel-122. These gaps range from Finally, Baroni et al. [21] compared word2vec embedding 49.35% (WSR) to 11.78% (Rel-122), which are significant. (“predict” models) against models based on token counts in On the benchmarks characterized by similarity relationships corpora, i.e., distributional semantic models (DSM), combined in word pairs, we obtained a gap of 10%–20% on RG and with techniques, such as SVD or RW, below 10% on other datasets, and state-of-the-art results NNMF. They considered 48 configurations of the former on SimLex-999 (results for Verb-143 are not comparable). and 36 of the latter one, varying parameters such as final dimensionality and context window size. The “each” results correspond to the best configuration for each benchmark and G. Ablation Study “all” to the single best configuration across all benchmarks. In order to determine the importance of the different types Models in this group were trained using the same 2.8B- of semantic relations of WordNet in the SVR method, an token corpus used by Aletras and Stevenson [22]. Methods ablation study was carried out; see Table IV. The performance D and “count.each” are equivalent and, as expected, obtain measure is based on the average of the correlation r obtained very similar results; see Fig. 7. with all the possible combinations of training and testing The best-performing method in the latter two studies was of the datasets from Table III plus our W1500 dataset. The “predict.each”, r¯ = 0.78. Model H∗ was outperformed by average also includes the mean of ten rounds of tenfold cross- “predict.all”; this is a more comparable pair since both have validation for each dataset. This generates 15×15 = 225 runs a single configuration across all benchmarks. Unfortunately, for each ablation configuration, whose standard deviations are the results of Pennington et al. are not comparable with those reported in the Std. column. TABLE III UPDATED STATE OF THE ART (SOA) FOR LEXICAL SIMILARITY AND RELATEDNESS PAIRED WITH RESULTS FROM THIS PAPER (SVR).

Dataset Task n† SVR r SOA r Reference MC sim. 30 0.8507 0.91 Patwardhan and Pedersen (2006) [53] YP-130 sim. 130 0.7385 0.747 Taieb et al. (2013) [54] RG sim. 65 0.7273 0.90 Patwardhan and Pedersen (2006)[53] WSS sim. 203 0.7233 0.80 Baroni et al. (2014) [21] MEN rel. 3,000 0.6140 0.80 Baroni et al. (2014) [21] SCWS sim. 1,997 0.5703 0.6104 Li et al. (2014) [55] WS353 both 353 0.5688 0.81 Halawi et al. (2012) [49] MTurk771 rel. 771 0.5447 0.727 Halawi et al. (2012) [49] SimLex-999 sim. 999 0.5327 0.52 SVR (this paper). Before Hill et al. (2014) [56] Rel-122 rel. 122 0.4711 0.534 Szumlanski et al. (2013) [50] MTurk287 rel. 287 0.4524 0.737 Halawi et al. (2012) [49] RW sim. 2,034 0.4175 0.478 Pennington et al. (2014) [5] WSR rel. 252 0.3647 0.72 Agirre et al. (2009) [19] Verb-143 sim. 143 0.3280 0.642†† Baker et al. (2014) [51] †n is the number of word pairs in the dataset. ††This result corresponds to the Pearson’s correlation.

The results show that the most significant semantic relations B. Prediction Method for our method are hypernymy and hyponymy (i.e., the is-a hi- To predict the sentiment score of a word and to provide a erarchy), followed by the aggregation of 10 minority relations testbed for comparison, we use a regression method that can in WordNet. The holonymy-meronymy relation’s contribution use word embedding, our representation of neighbor words, is small. Withdrawal of the antonymy relation even gave a or the classical WordNet-based measures. Such a method is marginal improvement in both performance and variance. (again) a Support Vector Regression (SVR), which is based on a kernel that can be constructed either from a vectorial representation (a linear kernel of vector dot products) or from H. Best Features a pair-wise similarity matrix (a Gram matrix) of the words to Table V shows the 25 features that were selected the greatest classify. By setting the parameters of the SVR and varying the number of times using the same 15 datasets from the ablation representation of the data, we can evaluate the performance of study. The % columns show the percentage of datasets in each representation for the task in question. The parameters 1 which the feature was selected for the model. Note that the used for the SVR were ε = 0.1 and γ = n , here n is the number of examples. most-used factor was f4 (i.e., |Ra ∩Rb|) and that all the factors were used at least once among the 25 most-selected features, with the exception of f0 = 1. C. Experimental Setup The pre-trained word embedding representations used are GloVe 42B5 and CBOW 100B,6 which correspond to the same IV. RESULTS ON LEXICAL SENTIMENT CLASSIFICATION representations as used by Penninton et al. [5]. Therefore, we used the same names of the method as used in Fig. 5. In this section, we evaluate our word representation based on To build a pseudo-Gramm matrix M from the matrix S of neighbors in WordNet on the task of predicting the sentiment pair-wise word similarity scores obtained from the classical polarity score of a word. As before, the COSINE method is WordNet-based measures and our methods, S needs to be compared with the classical WordNet measures and the word transformed to fulfill symmetry and positiveness. The trans- embeddings. formation used is M = (S + min(S)) · (S + min(S))T , where the function min returns a scalar with the minimum entry in the matrix and + adds the scalar to all elements of the A. Sentiment Data matrix; thus S + min(S) is a non-negative spatial translation We used for our experiments the sentiment lexicons widely of S. The operator (∗)T is the transpose of a matrix. Thus, known to the sentiment analysis community: ANEW [12] and the final matrix multiplication · gives a symmetrical matrix. SenticNet versions 1 to 4 [13], [23], [24], [25] (recently, Even using this transformation, the measures res and jcn SenticNet 5 has become available [57]). Each dataset is a did not produce a Gram matrix suitable for the optimization list of entries composed of a word (or multiword expression) of the SVR. The aggregation method used to obtain the and a numerical score that indicates its sentiment polarity. word-to-word similarity from synset-to-synset similarities for From all datasets, we selected single-word entries. Table VI the classical WordNet measures changed from the previous shows the number of resulting words, two extreme examples lexical similarity experiments: instead of the max operator, we with their polarity scores, and the number of coincidences selected the first synset for each word in the lexicographical between the dataset and the resources used to obtain the order of WordNet. word representations. The column SentiWN corresponds to 5http://nlp.stanford.edu/data/glove.42B.300d.zip the number of coincidences with SentiWordNet [58], which is 6At https://code.google.com/archive/p/word2vec/ see “Pre-trained word and used as a baseline for comparison. phrase vectors” TABLE IV ABLATION STUDY

Ablation setting # removed Performance Std. diff. All 19 relationships 0 0.5404 0.1783 — HYPERNYMS/HYPONYMS removed 4 0.4641 0.1492 14.12% Other relationships removed 10 0.5134 0.1945 5.01% HOLONYMS/MERONYMS removed 4 0.5304 0.1827 1.85% ANTONYMS removed 1 0.5414 0.1773 –0.18%

TABLE V THETOP-25 SELECTED FEATURES

feature % feature % feature % feature % feature % f4/f14 100% f4/f8 93% f16/f8 87% f6/f2 73% f4/f3 67% f4/f11 93% f4/f12 93% f15/f9 87% f10/f3 73% f7/f13 53% f4/f10 93% f3/f10 93% f4/f2 80% f5/f1 67% f16/f14 47% f7/f10 93% f15/f8 93% f4/f13 80% f7/f12 67% f15/f4 47% f4/f9 93% f16/f9 87% f7/f3 73% f4/f1 67% f15/f14 47%

TABLE VI SIZESOFTHEDATASETSUSEDINTHEEXPERIMENTSTOGETHERWITHEXAMPLESANDWITHTHENUMBEROFMATCHESWITHOTHERRESOURCESUSED

Dataset # words positive negative WordNet Word2Vec GloVe SentiWN ANEW 1033 paradise:8.720 suicide:−1.25 1030 1031 1033 1017 SenticNet1 3036 esteem:0.976 offend:−0.99 2975 2809 2965 2932 SenticNet2 6422 heavenly:0.941 acne:−0.969 6192 6172 6349 6176 SenticNet3 14893 radiancy:0.964 unhinge:−0.975 14741 13239 14220 14679 SenticNet4 23497 indorse:0.964 vitiate:−0.980 21141 19908 22324 20847

The max operator is convenient for the lexical similarity our method. The gray bar corresponds to the baseline from task because when humans judge the similarity of pairs of SentiWordNet [58]. polysemous words, they unconsciously select the two closest Given that SenticNet versions 2 to 4 provide sentic values senses [15], [44]. However, in the lexical sentiment classifi- for other affective dimensions (pleasantness, attention, sensi- cation task, the words are out of context, thus, the choice of tivity, and aptitude), we carried out additional experiments for the first sense selected for a word by a lexicographer seems a these dimensions; see Fig. 9. For these experiments, we used better option. In fact, the choices of the maximum and average PATH as representative of the classical measures based on performed significantly poorer in our experiments. The choice WordNet, which obtained the best and most consistent results of the performance measure is (again) the Spearman’s r corre- in the prediction of the affective dimension polarity. lation coefficient, which measures the ability of the regressor of producing a list of words ranked by their sentiment polarity V. DISCUSSION correlated with the gold-standard scores. For evaluation, we Let us first analyze the results for the lexical similarity task. used a tenfold cross-validation setting and the reported results Fig. 5 to 7 show that there is a considerable performance are the average of ten random shuffles of the dataset. gap between the methods based on word embedding (blue The baseline for comparing the performance of the predic- bars) and those based on the classical WordNet measures tors was build using the polarity scores from SentiWordNet (green bars). Our method labeled “SVR (this paper)” out- 3.0. The score for a word was obtained by aggregating performed the classical WordNet measures in all evaluation the scores of the synsets where the word occurs, using the settings. The minimum performance gap of 35% was observed avalilable code.7 The correlations r between these scores and in Fig. 5 against the jcn.b measure. Clearly, our measure the scores from the ANEW and SenticNet datasets are the is a better alternative to the classical WordNet measures. external benchmark of comparison for the results obtained by A possible explanation is that the representation based on our predictors. related words exploits a greater number of relationships in WordNet compared to the classical approaches. Being 20 the maximum depth of the WordNet taxonomy, theoretically, the D. Results largest number of relationships taken into account in a single Fig. 8 shows the results of the polarity lexical sentiment comparison is 40. In contrast, the size of the sets of related classification tasks carried out using all datasets. In this figure, words is in the order of hundreds, or thousands in some cases, the error bars indicate 2 standard deviations obtained from the implying that an even greater number of relationships were ten random shuffles carried out for each measure. As before, considered in obtaining them. Since our representation is large the blue bars correspond to neural embeddings, green bars and considerably more informative, it is reasonable to expect to the classical WordNet-based measures, and the red bar to better robustness against noise and thus better performance, as it was the case. In addition, since late 90s, when the classical 7http://sentiwordnet.isti.cnr.it/code/SentiWordNetDemoCode.java WordNet measures were proposed, until now, no alternative 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ANEW SenticNet1 SenticNet2 SenticNet3 SenticNet4

LIN.S LIN.B LCH WUP PATH CBOW 100B COSINE (this paper) GloVe 42B Baseline SWN

Fig. 8. Results (r) for the Lexical Polarity Classification task using ten shuffles of tenfold cross-validation rounds. Error bars show 2 standard deviations over the 100 resulting train-test runs.

0.7 0.7 Pleasantness Attention 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 SenticNet2 SenticNet3 SenticNet4 SenticNet2 SenticNet3 SenticNet4

0.7 0.7 0.6 Sensitivity 0.6 Aptitude 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 SenticNet2 SenticNet3 SenticNet4 SenticNet2 SenticNet3 SenticNet4

PATH CBOW 100B COSINE (this paper) GloVe 42B

Fig. 9. Results (r) for the Lexical Classification task of sentic values for another four dimensions of affective valence (i.e., pleasantness, attention, sensitivity, and aptitude). Scores are the average over 10 random shuffles of 10-fold cross-validation rounds and error bars depict 2 standard deviations. has been proposed to significantly improve the results using In comparison with the state of the art, our method failed only the WordNet graph. to exceed the performance of the best methods based on word embedding. Comparison with the Pennington et al.’s study (Fig. 5) shows that our method performs better than word In contrast, the other measure, “COSINE (this paper)”, embeddings trained with a relatively small corpus (6 billion SenticNet2 SenticNet3 SenticNet4 obtained the worst performance in all evaluation settings. This tokens) and worse than embeddings trained on a large corpus means that the formulation of the combination method is not (42 and 100 billion tokens). a simple function for this task, justifying the use of supervised learning to obtain a suitable similarity function. Although Regarding the lexical sentiment classification task, results supervised learning is generally criticized for its reliance on show the same performance gap between neural embeddings costly labeled data, the use of the W1500 dataset provides an and classical WordNet approaches on the ANEW and SentiNet affordable alternative easy to replicate in any language. Thus, 1 and 2 datasets. On these three datasets, the classical measures the similarity function learned by support vector regression is gave performance relatively close to that of our baseline. obtained by constraints imposed by synonyms, antonyms, and In contrast, the regressors based on neural embeddings and random pairs of unrelated words. This approach is comparable our representation performed significantly better. On these to that of Halawi et al. [49], who used similar constraints to datasets, similar to the lexical similarity task, our method improve a distributional word representation. narrowed that performance gap. However, on the SenticNet 3 and 4 datasets, the neural [3] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and embeddings (blue bars) do not overperform the baseline, while C. Potts, “Learning word vectors for sentiment analysis,” in Proc. 49th Annual Meeting of the Association for Computational : our method (red bar) obtained the best results. A possible Human Language Technologies - Volume 1 (ACL HTL 2011), (Portland, explanation for this is that SenticNet 3 and 4 are considerably Oregon, USA), pp. 142–150, ACL, June 2011. larger than the other datasets and therefore include words [4] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in that are less frequent. Since the quality of the representation Advances in Neural Information Processing Systems 26 (NIPS 2013), of a word when using neural embeddings depends on the (Lake Tahoe, Nevada, USA), pp. 3111–3119, Dec. 2013. number of times it occurs in the corpus, the representation [5] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. 2014 Conf. on Empiricial Methods in of rare words is noisy. In WordNet, which was compiled Natural Language Processing (EMNLP 2014), vol. 12, (Doha, Qatar), manually by lexicographers, the quality of the representation pp. 1532–1543, Oct. 2014. of a word in principle is independent of the frequency of [6] C. Leacock and M. Chodorow, “Combining local context and WordNet similarity for word sense identification,” in WordNet: An Electronic its use. In this way, our method obtains consistent results Lexical Database (C. Fellbaum, ed.), pp. 265–283, MIT Press, 1998. in a wider spectrum of frequencies, as compared with the [7] P. Resnik, “ in a taxonomy: An information-based neural embeddings. Moreover, our method is the only one that measure and its application to problems of ambiguity in natural lan- exceeded the baseline on all 5 data sets considered. guage,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, July 1999. Additionally, Fig. 9 shows universality of our representation, [8] D. Lin, “An information-theoretic definition of similarity,” in Proc. 15th since consistency of the results on the polarity dimension pre- Conf. on (ICML 1998), (Madison, Wisconsin, USA), diction was preserved for the other four affective dimensions. pp. 296–304, Morgan Kaufmann Publishers Inc., July 1998. [9] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proceedings of Conf. Research on Computational Linguistics (ROCLING/IJCLCLP), (Taipei, Taiwan), VI.CONCLUSIONS pp. 19–33, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Aug. 1997. We have presented a novel approach to exploit the WordNet [10] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proc. graph to build lexical similarity functions and lexical sentiment 32nd Annual Meeting of the Association for Computational Linguistics classifiers. It leverages a novel word representation, which we (ACL 1994), (Las Cruces, New Mexico, USA), pp. 133–138, ACL, June called word2set, based on the sets of words neighboring the 1994. [11] Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, and E. Cambria, “Learning given word in WordNet. word representations for sentiment analysis,” Cognitive Computation, Our method produces results considerably better than clas- vol. 9, pp. 843–851, Dec. 2017. sical WordNet-based approaches and competitive with those [12] M. M. Bradley and P. J. Lang, “Affective norms for English words (ANEW): Instruction manual and affective ratings,” Tech. Rep. 1, of neural embeddings. It uses supervised learning from a University of Florida, 1999. dataset easy to construct for any language. We have tested [13] E. Cambria, R. Speer, C. Havasi, and A. Hussain, “SenticNet: A publicly our approach on all lexical similarity and relatedness bench- available semantic resource for opinion mining,” in Proc. 2010 AAAI Fall Symposium: Commonsense Knowledge, vol. 10, (Arlington, Virginia, marks available to date, obtaining state-of-the-art results on USA), pp. 14–18, Nov. 2010. predicting human judgments on similarity. Our method is less [14] C. Fellbaum, ed., WordNet: An Electronic Lexical Database. MIT Press, effective for relatedness, but still superior to other classical 1998. [15] H. Rubinstein and J. B. Goodenough, “Contextual correlates of syn- WordNet-based measures. For the task of predicting affective onymy,” Communications of the ACM, vol. 8, pp. 627–633, Oct. 1965. valences of words, our representation is the best alternative [16] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and when considering a wide spectrum of both frequent and rare R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, pp. 391–407, Sept. words. 1990. [17] D. Lee and S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems (NIPS 2001), ACKNOWLEDGMENTS (Vancouver, British Columbia, Canada), pp. 556–562, Dec. 2001. [18] M. Sahlgren, “An introduction to random indexing,” in Methods and The work was done with partial support from the Mexican Applications of Semantic Indexing Workshop at the 7th Conf. on Government via SNI, CONACYT, and Instituto Politécnico Terminology and Knowledge Engineering (TKE 2005), (Copenhagen, Nacional, grants SIP 20196437 and 20196021, to A. Gelbukh. Denmark), July 2005. [19] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa, The work was done when A. Gelbukh was on the Sabbatical “A study on similarity and relatedness using distributional and WordNet- stay at the Research Institute for Information and Language based approaches,” in Proceedings of Human Language Technologies: Processing, the University of Wolverhampton, with a grant The 2009 Annual Conf. of the North American Chapter of the Associ- ation for Computational Linguistics (NAACL 2009), pp. 19–27, ACL, provided by the Sabbatical program of the CONACYT, Mex- June 2009. ico. [20] Y. Bengio, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 2003, pp. 1137–1155, Feb. 2003. [21] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! A REFERENCES systematic comparison of context-counting vs. context-predicting se- mantic vectors,” in Proc. 52nd Annual Meeting of the Association for [1] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser- Computational Linguistics (ACL 2014), (Baltimore, Maryland, USA), tions, and reversals,” Soviet Physics Doklady, vol. 10, pp. 707–710, Aug. pp. 238–247, ACL, June 2014. 1966. [22] N. Aletras and M. Stevenson, “A hybrid distributional and knowledge- [2] H. Gómez-Adorno, I. Markov, J. Baptista, G. Sidorov, and D. Pinto, based model of lexical semantics,” in Proc. 4th Joint Conf. on Lexical “Discriminating between similar languages using a combination of typed and Computational Semantics (*SEM 2015), (Denver, Colorado, USA), and untyped character n-grams and words,” in Proc. 4th Workshop on pp. 20–29, ACL, June 2015. NLP for Similar Languages, Varieties and Dialects (VarDial 2017), [23] E. Cambria, C. Havasi, and A. Hussain, “SenticNet 2: A semantic and (Valencia, Spain), pp. 137–145, ACL, Apr. 2017. affective resource for opinion mining and sentiment analysis,” in Proc. 25th Florida Artificial Intelligence Research Society Conf. (FLAIRS Processing Systems 10 (NIPS 1997), vol. 9, (Denver, Colorado, USA), 2012), (Marco Island, Florida, USA), pp. 202–207, May 2012. pp. 155–161, Dec. 1997. [24] E. Cambria, D. Olsher, and D. Rajagopal, “SenticNet 3: A common and [44] G. A. Miller and W. G. Charles, “Contextual correlates of semantic common-sense knowledge base for cognition-driven sentiment analy- similarity,” Language and Cognitive Processes, vol. 6, pp. 1–28, Jan. sis,” in Proc. 28th AAAI Conf. on Artificial Intelligence (AAAI 2014), 1991. (Quebec, Canada), pp. 1515–1521, July 2014. [45] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolf- [25] E. Cambria, S. Poria, R. Bajpai, and B. Schuller, “SenticNet 4: A se- man, and E. Ruppin, “Placing search in context: The concept revisited,” mantic resource for sentiment analysis based on conceptual primitives,” ACM Transactions on Information Systems, vol. 20, pp. 116–131, Jan. in Proc. 26th Conf. on Computational Linguistics: Technical Papers 2002. (COLING 2016), (Osaka, Japan), pp. 2666–2677, Dec. 2016. [46] E. Bruni, J. Uijlings, M. Baroni, and N. Sebe, “ [26] G. Sidorov and F. Viveros-Jiménez, “One sense per discourse heuristic with eyes: Using image analysis to improve computational representa- for improving precision of wsd methods based on lexical intersections tions of word meaning,” in Proc. 20th ACM Conf. on Multimedia (MM with the context,” POLIBITS, vol. 57, pp. 45–50, June 2018. 2012), (Nara, Japan), pp. 1219–1228, Nov. 2012. [27] G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft sim- [47] D. Yang and D. M. W. Powers, “Verb similarity on the taxonomy ilarity and soft cosine measure: Similarity of features in of WordNet,” in Proc. 3rd WordNet Conf. (GWC 2006), (Jeju Island, model,” Computación y Sistemas, vol. 18, pp. 491–504, Oct. 2014. Korea), pp. 121–128, Jan. 2006. [28] S. Jimenez, C. Becerra, and A. Gelbukh, “SOFTCARDINALITY: Hi- [48] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A erarchical text overlap for student response analysis,” in Proc. 7th word at a time: Computing word relatedness using temporal semantic Workshop on Semantic Evaluation (SemEval 2013), (Atlanta, Georgia, analysis,” in Proc. 20th Conf. on World Wide Web (WWW 2011), USA), pp. 280–284, ACL, June 2013. (Hyderabad, India), pp. 337–346, ACM, Apr. 2011. [29] S. Bird and E. Loper, “NLTK: The ,” in Proc. [49] G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren, “Large-scale learning ACL 2004: Interactive Poster and Demonstration Sessions, (Barcelona, of word relatedness with constraints,” in Proc. 18th ACM SIGKDD Conf. Spain), pp. 31–34, ACL, July 2004. on Knowledge Discovery (KDD 2012), (Beijing, China), pp. 1406–1414, [30] M.-T. Luong, R. Socher, and C. D. Manning, “Better word represen- Aug. 2012. tations with recursive neural networks for morphology,” in Proc. 17th [50] S. Szumlanski, F. Gomez, and V. K. Sims, “A new set of norms Conf. on Computational Natural Language Learning (CoNLL 2013), for semantic relatedness measures,” in Proc. 51st Annual Meeting (Sofia, Bulgaria), pp. 104–113, ACL, Aug. 2013. of the Association for Computational Linguistics (ACL 2013), (Sofia, [31] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving Bulgaria), pp. 890–895, Aug. 2013. word representations via global context and multiple word prototypes,” [51] S. Baker, R. Reichart, and A. Korhonen, “An unsupervised model in Proc. 50th Annual Meeting of the Association for Computational for instance level subcategorization acquisition,” in Proc. Conf. on Linguistics (ACL 2012), vol. 1, (Jeju Island, Korea), pp. 873–882, ACL, Empirical Methods in Natural Language Processing (EMNLP 2014), July 2012. (Doha, Qatar), pp. 278–289, Oct. 2014. [52] F. Hill, R. Reichart, and A. Korhonen, “Simlex-999: Evaluating semantic [32] H. Calvo and A. Gelbukh, “Is the most frequent sense of a word better models with (genuine) similarity estimation,” Computational Linguistics, connected in a ?,” in Proc. ICIC 2015: Advanced vol. 41, pp. 665–695, Dec. 2015. Intelligent Computing Theories and Applications (D.-S. Huang and [53] S. Patwardhan and T. Pedersen, “Using WordNet-based context vectors K. Han, eds.), no. 9227 in Lecture Notes in Computer Science, pp. 491– to estimate the semantic relatedness of concepts,” in Proc. EACL 499, Springer, 2015. 2006 Workshop on Making Sense of Sense-Bringing Computational [33] P. Jaccard, “Etude comparative de la distribution florare dans une portion Linguistics and Psycholinguistics Together, (Trento, Italy), pp. 1–8, Apr. des Alpes et des Jura,” Bulletin de la Société Vaudoise des Sciences 2006. Naturelles, pp. 547–579, Dec. 1901. [54] M. A. H. Taieb, M. B. Aouicha, and A. B. Hamadou, “Computing [34] L. R. Dice, “Measures of the amount of ecologic association between semantic relatedness using Wikipedia features,” Knowledge-Based Sys- species,” Ecology, vol. 26, pp. 297–302, July 1945. tems, vol. 50, pp. 260–278, Sept. 2013. [35] A. Tversky, “Features of similarity,” Psychological Review, vol. 84, [55] C. Li, B. Xu, G. Wu, T. Zhuang, X. Wang, and W. Ge, “Improving pp. 327–352, July 1977. word embeddings via combining with complementary languages,” in [36] B. De-Baets, S. Janssens, and H. De-Meyer, “On the transitivity of Proceedings of Canadian AI, pp. 313–318, Springer, LNAI 8436, May a parametric family of cardinality-based similarity measures,” Interna- 2014. tional Journal of Approximate Reasoning, vol. 50, pp. 104–116, Jan. [56] F. Hill, K. Cho, S. Jean, C. Devin, and Y. Bengio, “Not all neural 2009. embeddings are born equal,” arXiv preprint arXiv:1410.0718, Oct. 2014. [37] S. Jimenez, C. Becerra, and A. Gelbukh, “Soft cardinality: A parameter- [57] E. Cambria, S. Poria, D. Hazarika, and K. Kwok, “SenticNet 5: ized similarity function for text comparison,” in 1st Conf. on Lexical and Discovering conceptual primitives for sentiment analysis by means of Computational Semantics (*SEM 2012), (Montreal, Canada), pp. 449– context embeddings,” in Proc. 32th AAAI Conf. on Artificial Intelligence 453, ACL, June 2012. (AAAI 2018), pp. 1795–1802, Feb. 2018. [38] S. Jimenez, C. Becerra, and A. Gelbukh, “Soft cardinality + ML: Learn- [58] S. Baccianella, A. Esuli, and F. Sebastiani, “SentiWordNet 3.0: An ing adaptive similarity functions for cross-lingual ,” in enhanced for sentiment analysis and opinion mining,” in Proc. 1st Joint Conf. on Lexical and Computational Semantics (*SEM Proc. 7th Conf. on Language Resources and Evaluation (LREC 2010), 2012), (Montreal, Canada), pp. 684–688, ACL, June 2012. vol. 10, (Malta), pp. 2200–2204, May 2010. [39] S. Jimenez, G. Dueñas, J. Baquero, and A. Gelbukh, “UNAL-NLP: Combining Soft cardinality features for semantic textual similarity, relat- edness and entailment,” in Proc. 8th Workshop on Semantic Evaluation (SemEval 2014), (Dublin, Ireland), pp. 732–742, ACL, Aug. 2014. [40] S. Jimenez, C. Becerra, and A. Gelbukh, “SOFTCARDINALITY-CORE: Improving text overlap with distributional measures for semantic textual similarity,” in Second Joint Conf. on Lexical and Computational Seman- tics, Volume 1: Proc. Main Conf. and the Shared Task: Semantic Textual Similarity (*SEM 2013), (Atlanta, Georgia, USA), pp. 194–201, June 2013. [41] B. De-Baets, H. De-Meyer, and H. Naessens, “A class of rational cardinality-based similarity measures,” Journal of Computational and Applied Mathematics, vol. 132, pp. 51–69, July 2001. [42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Doubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay, “Scikit-learn: Machine learning in Python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, Oct. 2011. [43] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Sup- port vector regression machines,” in Advances in Neural Information