Arxiv:1704.03560V2

arXiv:1704.03560v2 [cs.CL] 11 Dec 2018 ocpNt5( 5 ConceptNet Introduction 1 sn netnino h ehiu nw as known technique combined the are of sources extension data an three using The evaluations. word-relatedness many state-of-the-art across with performance embeddings new produce to 2013 ( terms. these represent of ings to mean- words the and between frequently-used language, relationships aims generally-known every the It of cover phrases to entities. and to all named trying of of avoids gazetteer most large it a graphs, be knowledge other to language natural of ( phrases and that words graph connects knowledge domain-general tilingual, essi-rmebdig ( Google embeddings word2vec skip-gram the News particularly distributional semantics, of sources with combination in used pe tal. et Speer terms ocpNta eEa-07Ts :EtnigWr Embeddin Word Extending 2: Task SemEval-2017 at ConceptNet h ae eciigCnete 5.5 ConceptNet describing paper The agae,adas akdfis nal10 pairs. all separate language in the cross-lingual the first of of ranked 5 also of and out languages, 4 in It first subtasks. both ranked in place first took Our system semantics. Con- distributional and of ceptNet combination a from embeddings word multilingual work high-quality, previous builds that of update to an submission was Our SemEval phrases. and meanings words the of relates that knowledge eral gen- on focuses that graph knowledge multilingual open, an Con- is ConceptNet on based ceptNet. system a with Word Similarity”, “Multi- Semantic Cross-lingual 2, and Task lingual 2017 SemEval in partici- pation Luminoso’s describes paper This n lV . ( 1.2 GloVe and ) ihlbld egtdegs Compared edges. weighted labeled, with ) , uiooTcnlge,Inc. Technologies, Luminoso [email protected] 2017 7 ascuet Avenue Massachusetts 675 pe n Havasi and Speer abig,M 02139 MA Cambridge, hwdta tcudbe could it that showed ) Abstract oy Speer Robyn enntne al. et Pennington utlnulRltoa Knowledge Relational Multilingual , 2013 ioo tal. et Mikolov samul- a is ) , 2014 ), , h a ebitorebdig sbsdon based is embeddings our ( built retrofitting we way The Implementation 2 niigwt h ees fvrin553o Con- of 5.5.3 version ceptNet of release co- system, the that with of inciding update an is Similarity”, Semantic Word Cross-lingual and “Multilingual 2, Task we embeddings Numberbatch”. word “ConceptNet call pre-computed of system a ( retrofitting http://conceptnet.io task. SemEval this of languages five embed- the word in of sys- dings sources submit, additional used not that did tems but with, experimented We also words. out-of-vocabulary to vectors assigning r de otevcblr n r loassigned also are and vocabulary the to graph knowledge added the are in present hand, its only other are as the that terms vocabulary on retrofitting, same expanded the In has input. output Its ac- objective into count. new graph knowledge a a takes also on that function based exist- embeddings of “expanded values word the ing call adjusts we described, originally it of ( retrofitting” elaboration the ular, Sec- in appear system our tion for results Detailed dis- of pairs ten the languages. of dif- tinct in each are for that languages, words languages; ferent of five pairs the compares of 2 compares each subtask 1 within words Subtask German, of English, pairs Farsi. and languages: Spanish, The five Italian, in relatedness. are or words similarity semantic words their of pairs by rank to ability their at systems uated [email protected] h ytmw umte oSemEval-2017 to submitted we system The hsts ( task This u ytmto rtpaei ohsubtasks. both in place first took system Our 1 aaadcd r vial at available are code and Data 3.4 uiooTcnlge,Inc. Technologies, Luminoso 7 ascuet Avenue Massachusetts 675 . 1 abig,M 02139 MA Cambridge, onaLowry-Duda Joanna eaddmlil alakmtosfor methods fallback multiple added We . auu tal. et Faruqui auu tal. et Faruqui aah-oldse al. et Camacho-Collados pe tal. et Speer . , , , 2017 2015 2015 .Rtotig as Retrofitting, ). .Tersl is result The ). ,adi partic- in and ), , 2017 swith gs eval- ) vectors. word2vec and GloVe inputs, and appear in the first 200,000 rows of at least one of them. We take the 2.1 Combining Multiple Sources of Vectors union of these with the terms in the ConceptNet As described in the ConceptNet 5.5 paper subgraph described above. The resulting vocabu- (Speer et al., 2017), we apply expanded lary, of 1,884,688 ConceptNet terms plus 99,869 retrofitting separately to multiple sources of additional terms, is the vocabulary we use in the embeddings (such as pre-trained word2vec system we submitted and its variants. and GloVe), then align the results on a unified 2.3 Dimensionality Reduction vocabulary and reduce its dimensionality. First, we make a unified matrix of embeddings, The concatenated matrix M1 has k columns repre- M1, as follows: senting features that may be redundant with each other. Our next step is to reduce its dimensional- • Take the subgraph of ConceptNet consisting ity to a smaller number k′, which we set to 300, of nodes whose degree is at least 3. Re- the dimensionality of the largest input matrix. Our move edges corresponding to negative rela- goal is to learn a projection from k dimensions to tions (such as NotUsedFor and Antonym). k′ dimensions that removes the redundancy that Remove phrases with 4 or more words. comes from concatenating multiple sources of embeddings. • Standardize the sources of embeddings by We sample 5% of the rows of M1 to get M2, case-folding their terms and L1-normalizing which we will use to find the projection more effi- their columns. ciently, assuming that its vectors represent approx- • For each source of embeddings, apply ex- imately the same distribution as M1. M2 can be approximated with a truncated SVD: panded retrofitting over that source with the 1 2 ′ ′ M2 ≈ UΣ / V T , where Σ is truncated to a k ×k subgraph of ConceptNet. In each case, this ′ provides vectors for a vocabulary of terms diagonal matrix of the k largest singular values, that includes the ConceptNet vocabulary. and U and V are correspondingly truncated to have only these k′ columns. • Choose a unified vocabulary (described be- U is a matrix mapping the same vocabulary low), and look up the vectors for each term to a smaller set of features. Because V is or- in this vocabulary in the expanded retrofitting thonormal, UΣ is a rotation and truncation of the outputs. If a vector is missing from the vo- original data, where each feature contributes the cabulary of a retrofitted output, fill in zeroes same amount of variance as it did in the original 1 2 for those components. data. UΣ / is a version that removes some of the variance that came from redundant features, • Concatenate the outputs of expanded and also is analogous to the decomposition used retrofitting over this unified vocabulary to by Levy et al. (2015) in their SVD process. give M1. We can solve for the operator that projects M2 into UΣ1/2: 2.2 Vocabulary Selection Σ1/2 ≈ Σ−1/2 Expanded retrofitting produces vectors for all the U M2V terms in its knowledge graph and all the terms in V Σ−1/2 is therefore a k × k′ operator that, the input embeddings. Some terms from outside when applied on the right, projects vectors from the ConceptNet graph have useful embeddings, our larger space of features to our smaller space representing knowledge we would like to keep, but of features. It can be applied to any vector in using all such terms would be noisy and wasteful. the space of M1, not just the ones we sampled. −1 2 To select the vocabulary of our term vectors, we M3 = M1V Σ / is the projection of the selected used a heuristic that takes advantage of the fact vocabulary into k′ dimensions, which is the matrix that the pre-computed word2vec and GloVe em- of term vectors that we output and evaluate. beddings we used have their rows (representing terms) sorted by term frequency. 2.4 Don’t Take “OOV” for an Answer To find appropriate terms, we take all the terms Published evaluations of word embeddings can that appear in the first 500,000 rows of both the be inconsistent about what to do with out-of- vocabulary (OOV) words, those words that the we look for terms in the vocabulary that have the system has learned no representation for. Some given term as a prefix. If we find none of those, evaluators, such as Bojanowski et al. (2016), dis- we drop a letter from the end of the unknown term, card all pairs containing an OOV word. This and look for that as a prefix. We continue dropping makes different systems with different vocabular- letters from the end until a result is found. When ies difficult to compare. It enables gaming the a prefix yields results, we use the mean of all the evaluation by limiting the system’s vocabulary, resulting vectors as the word’s vector. and gives no incentive to expand the vocabulary. This SemEval task took a more objective po- 3 Results sition: no word pairs may be discarded. Every In this task, systems were scored by the harmonic system must submit a similarity value for every mean of their Pearson and Spearman correlation word pair, and “OOV” is no excuse. The organiz- with the test set for each language (or language ers recommended using the midpoint of the simi- pair in Subtask 2). Systems were assigned ag- larity scale as a default. gregate scores, averaging their top 4 languages on In our previous work with ConceptNet, we Subtask 1 and their top 6 pairs on Subtask 2. eliminated one possible cause of OOV terms. A term that is outside of the selected vocabulary, 3.1 The Submitted System: ConceptNet + perhaps because its degree in ConceptNet is too word2vec + GloVe low, can still be assigned a vector. When we en- counter a word with no computed vector, we look The system we submitted applied the retrofitting- it up in ConceptNet, find its neighbors, and take and-merging process described above, with Con- the average of whatever vectors those neighboring ceptNet 5.5.3 as the knowledge graph and two terms have. This approximates the vector the term well-regarded sources of English word embed- would have been assigned if it had participated in dings. The first source is the word2vec Google 2 retrofitting. If the term has no neighbors with vec- News embeddings , and the second is the GloVe tors, it remains OOV. 1.2 embeddings that were trained on 840 billion tokens of the Common Crawl3. For this SemEval task, we recognized the im- portance of minimizing OOV terms, and imple- Because the input embeddings are only in En- mented two additional fallback strategies for the glish, the vectors in other languages depended en- terms that are still OOV. tirely on propagating these English embeddings via the multilingual links in ConceptNet. It is unavoidable that training data in non- This system appears in the results as English languages will be harder to come by and “Luminoso-run2”. Run 1 was similar, but it sparser than data in English. It is also true that was looking up neighbors in an unreleased ver- some words in non-English languages are bor- sion of the ConceptNet graph with fewer edges rowed directly from English, and are therefore ex- from DBPedia in it. act cognates for English words. This system’s aggregate score on subtask 1 was As such, we used a simple strategy to further 0.743. Its combined score on subtask 2 (averaged increase the coverage of our non-English vocabu- over its six best language pairs) was 0.754. laries: if a term is not associated with a vector in matrix M3, we first look up the vector for the term 3.2 Variant A: Adding Polyglot Embeddings that is spelled identically in English. If that vector is present, we use it. Instead of relying entirely on English knowledge This method is in theory vulnerable to false cog- propagated through ConceptNet, it seemed rea- nates, such as the German word Gift (meaning sonable to also include pre-calculated word em- “poison”). However, false cognates tend to appear beddings in other languages as inputs. In Vari- among common words, not rare ones, so they are ant A, we added inputs from the Polyglot embed- unlikely to use this fallback strategy. Our German dings (Al-Rfou et al., 2013) in German, Spanish, embeddings do contain a vector for “Gift”, and it Italian, and Farsi as four additional inputs to the is similar to English “poison”, not English “gift”. retrofitting-and-merging process. As a second fallback strategy, when a term can- 2https://code.google.com/archive/p/word2vec/ not be found in its given language or in English, 3http://nlp.stanford.edu/projects/glove/ The results of this variant on the trial data were Eval. Base Ours −OOV Var. A Var. B en .683 .789 .747 .778 .796 noticeably lower, and when we evaluate it on the de .513 .700 .599 .673 .722 test data in retrospect, its test results are lower as es .602 .743 .611 .716 .761 well. Its aggregate scores are .720 on subtask 1 it .597 .741 .606 .711 .756 fa .412 .503 .363 .506 .541 and .736 on subtask 2. Score 1 .598 .743 .641 .720 .759 en-de .603 .763 .696 .749 .767 3.3 Variant B: Adding Parallel Text from en-es .636 .761 .675 .752 .778 en-it .650 .776 .677 .759 .786 OpenSubtitles en-fa .519 .598 .502 .590 .634 In Variant B, we calculated our own multilin- de-es .550 .728 .620 .704 .747 de-it .565 .741 .612 .722 .757 gual distributional embeddings from word code-fa .464 .587 .501 .586 .610 occurrences in the OpenSubtitles2016 parallel cor- es-it .598 .753 .613 .732 .765 es-fa .493 .627 .482 .623 .646 pus (Lison and Tiedemann, 2016), and used this as it-fa .497 .604 .474 .599 .635 a third input alongside word2vec and GloVe. Score 2 .598 .754 .649 .736 .767 For each pair of aligned subtitles among the five languages, we combined the language-tagged Table 1: Evaluation scores by language. “Score 1” and “Score 2” are the combined subtask words into a single set of n words, then added 1 to the co-occurrence frequency of each pair scores. “Base” is the Nasari baseline, “Ours” is /n − of words, yielding a sparse matrix of word co- Luminoso-Run2 as submitted, “ OOV” removes occurrences within and across languages. We our OOV strategy, and “Var. A” and “Var. B” are then used the SVD-of-PPMI process described the variants we describe in this paper. by Levy et al. (2015) to convert these sparse co- occurrences into 300-dimensional vectors. knowldedge has been implemented by many oth- On the trial data, this variant compared incon- ers, including Iacobacci et al. (2015) and vari- clusively to Run 2. We submitted Run 2 instead of ous implementations of retrofitting (Faruqui et al., Variant B because Run 2 was simpler and seemed 2015). ConceptNet is distinguished by the large to perform slightly better on average. improvement in evaluation scores that occurs However, when we run variant B on the released when it is used as the source of relational knowl- test data, we note that it would have scored better edge. This indicates that ConceptNet’s particu- than the system we submitted. Its aggregate scores lar blend of crowd-sourced, gamified, and expert are .759 on subtask 1 and .767 on subtask 2. knowledge is providing valuable information that is not learned from distributional semantics alone. 3.4 Comparison of Results The results transfer well to other languages, The released results4 show that our system, listed showing ConceptNet’s usefulness as “multilingual as Luminoso-Run2, got the highest aggregate glue” that can combine knowledge in multiple lan- score on both subtasks, and the highest score on guages into a single representation. each test set except the monolingual Farsi set. Our submitted system relies heavily on inter- Table 1 compares the results per language of language links in ConceptNet that represent direct the system we submitted, the same system without translations, as well as exact cognates. We sus- our OOV-handling strategies, variants A and B, pect that this makes it perform particularly well and the baseline Nasari (Camacho-Collados et al., at directly-translated English. It would have more 2016) system. difficulty determining the similarity of words that Variant B performed the best in the end, so we lack direct translations into English that are known will incorporate parallel text from OpenSubtitles or accurate. This is a weak point of many cur- in the next release of the ConceptNet Number- rent word-similarity evaluations: The words that batch system. are vague when translated, or that have language- specific connotations, tend not to appear. 4 Discussion On a task with harder-to-translate words, we may have to rely more on observing the distribu- The idea of producing word embeddings from tional semantics of corpus text in each language, a combination of distributional and relational as we did in the unsubmitted variants. 4http://alt.qcri.org/semeval2017/task2/index.php?id=results References Robyn Speer and Catherine Havasi. 2013. ConceptNet 5: A large semantic network for relational knowl- Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. edge. In The People’s Web Meets NLP, Springer, 2013. Polyglot: Distributed word representa- pages 161–176. tions for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natu- ral Language Learning. Association for Computa- tional Linguistics, Sofia, Bulgaria, pages 183–192. http://www.aclweb.org/anthology/W13-3520. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 https://arxiv.org/pdf/1607.04606.

Jose Camacho-Collados, Mohammad Taher Pilehvar, Nigel Collier, and Roberto Navigli. 2017. SemEval- 2017 Task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of SemEval. Van- couver, Canada.

JoséCamacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Nasari: Integrating ex- plicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artifi- cial Intelligence 240:36–64. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL. http://arxiv.org/abs/1411.4166.

Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In ACL (1). pages 95–105. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the As- sociation for Computational Linguistics 3:211–225. http://www.aclweb.org/anthology/Q15-1016.

Pierre Lison and J¨org Tiedemann. 2016. OpenSub- titles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efﬁcient estimation of word rep- resentations in vector space. CoRR abs/1301.3781. http://arxiv.org/abs/1301.3781.

Jeffrey Pennington, Richard Socher, and Christo- pher D Manning. 2014. GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Lan- guage Processing (EMNLP 2014) 12:1532–1543. http://www-nlp.stanford.edu/pubs/glove.pdf.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. San Francisco. http://arxiv.org/abs/1612.03975.