SENSE 2017

EACL 2017 Workshop on Sense, Concept and Entity Representations and their Applications

Proceedings of the Workshop

April 4, 2017 Valencia, Spain c 2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-945626-50-0

ii Preface

Welcome to the 1st Workshop on Sense, Concept and Entity Representations and their Applications (SENSE 2017). The aim of SENSE 2017 is to focus on addressing one of the most important limitations of word-based techniques in that they conflate different meanings of a word into a single representation. SENSE 2017 brings together researchers in lexical semantics, and NLP in general, to investigate and propose sense-based techniques as well as to discuss effective ways of integrating sense, concept and entity representations into downstream applications.

The workshop is targeted at covering the following topics:

Utilizing sense/concept/entity representations in applications such as Machine Translation, • Information Extraction or Retrieval, Word Sense Disambiguation, Entity Linking, Text Classification, Semantic Parsing, Knowledge Base Construction or Completion, etc.

Exploration of the advantages/disadvantages of using sense representations over word • representations.

Proposing new evaluation benchmarks or comparison studies for sense vector representations. • Development of new sense representation techniques (unsupervised, knowledge-based or hybrid). • Compositionality of senses: learning representations for phrases and sentences. • Construction and use of sense representations for languages other than English as well as • multilingual representations.

We received 21 submissions, accepting 15 of them (acceptance rate: 71%).

We would like to thank the Program Committee members who reviewed the papers and helped to improve the overall quality of the workshop. We also thank Aylien for their support in funding the best paper award. Last, a word of thanks also goes to our invited speakers, Roberto Navigli (Sapienza University of Rome) and Hinrich Schütze (University of Munich).

Jose Camacho-Collados and Mohammad Taher Pilehvar Co-Organizers of SENSE 2017

iii

Organizers: Jose Camacho-Collados, Sapienza University of Rome Mohammad Taher Pilehvar, University of Cambridge

Program Committee: Eneko Agirre, University of the Basque Country Claudio Delli Bovi, Sapienza University of Rome Luis Espinosa-Anke, Pompeu Fabra University Lucie Flekova, Darmstadt University of Technology Graeme Hirst, University of Toronto Eduard Hovy, Carnegie Mellon University Ignacio Iacobacci, Sapienza University of Rome Richard Johansson, University of Gothenburg David Jurgens, Stanford University Omer Levy, University of Washington Andrea Moro, Microsoft Roberto Navigli, Sapienza University of Rome Arvind Neelakantan, University of Massachusetts Amherst Luis Nieto Piña, University of Gothenburg Siva Reddy, University of Edinburgh Horacio Saggion, Pompeu Fabra University Hinrich Schütze, University of Munich Piek Vossen, University of Amsterdam Ivan Vulic,´ University of Cambridge Torsten Zesch, University of Duisburg-Essen Jianwen Zhang, Microsoft Research

Invited Speakers: Roberto Navigli, Sapienza University of Rome Hinrich Schütze, University of Munich

v

Table of Contents

Compositional Semantics using Feature-Based Models from WordNet Pablo Gamallo and Martín Pereira-Fariña ...... 1

Automated WordNet Construction Using Word Embeddings Mikhail Khodak, Andrej Risteski, Christiane Fellbaum and Sanjeev Arora...... 12

Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses Maximilian Köper and Sabine Schulte im Walde ...... 24

Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured Knowledge Yuan Ling, Yuan An and Sadid Hasan ...... 31

Classifying Lexical-semantic Relationships by Exploiting Sense/Concept Representations Kentaro Kanada, Tetsunori Kobayashi and Yoshihiko Hayashi ...... 37

Supervised and unsupervised approaches to measuring usage similarity Milton King and Paul Cook ...... 47

Lexical Disambiguation of Igbo using Diacritic Restoration Ignatius Ezeani, Mark Hepple and Ikechukwu Onyenwe ...... 53

Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non- Expert Crowds Mahmoud El-Haj, Paul Rayson, Scott Piao and Stephen Wattam ...... 61

Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation Alexander Panchenko, Stefano Faralli, Simone Paolo Ponzetto and Chris Biemann ...... 72

One Representation per Word - Does it make Sense for Composition? Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David Weir ...... 79

Elucidating Conceptual Properties from Word Embeddings Kyoung-Rok Jang and Sung-Hyon Myaeng...... 91

TTCSˆe: a Vectorial Resource for Computing Conceptual Similarity Enrico Mensa, Daniele P. Radicioni and Antonio Lieto ...... 96

Measuring the Italian-English lexical gap for action verbs and its impact on translation Lorenzo Gregori and Alessandro Panunzi ...... 102

Word Sense Filtering Improves Embedding-Based Lexical Substitution Anne Cocos, Marianna Apidianaki and Chris Callison-Burch ...... 110

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms Aleksander Wawer and Agnieszka Mykowiecka ...... 120

vii

Workshop Program

8:30 - 9:30 Registration

9:30 - 11:00 Session 1

9:30-9:40 - Opening Remarks • 9:40-10:00 - Short paper presentations • Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses Maximilian Köper and Sabine Schulte im Walde

Using Linked Disambiguated Distributional Networks for Word Sense Disambigua- tion Alexander Panchenko, Stefano Faralli, Simone Paolo Ponzetto and Chris Biemann

10:00-11:00 - Invited talk by Roberto Navigli (Sapienza University) •

11:00 - 11:30 Coffee break

11:30 - 13:00 Session 2

11:30-11:45 - Lightning talks (posters) • Compositional Semantics using Feature-Based Models from WordNet Pablo Gamallo and Martín Pereira-Fariña

Automated WordNet Construction Using Word Embeddings Mikhail Khodak, Andrej Risteski, Christiane Fellbaum and Sanjeev Arora

Classifying Lexical-semantic Relationships by Exploiting Sense/Concept Represen- tations Kentaro Kanada, Tetsunori Kobayashi and Yoshihiko Hayashi

Supervised and unsupervised approaches to measuring usage similarity Milton King and Paul Cook

Lexical Disambiguation of Igbo using Diacritic Restoration Ignatius Ezeani, Mark Hepple and Ikechukwu Onyenwe

ix Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds Mahmoud El-Haj, Paul Rayson, Scott Piao and Stephen Wattam

Elucidating Conceptual Properties from Word Embeddings Kyoung-Rok Jang and Sung-Hyon Myaeng

TTCSˆe: a Vectorial Resource for Computing Conceptual Similarity Enrico Mensa, Daniele P. Radicioni and Antonio Lieto

Measuring the Italian-English lexical gap for action verbs and its impact on trans- lation Lorenzo Gregori and Alessandro Panunzi

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms Aleksander Wawer and Agnieszka Mykowiecka

11:45-13:00 - Poster session •

13:00 - 14:30 Lunch

14:30 - 16:00 Session 3

14:30-15:00 - Invited talk by Hinrich Schütze (University of Munich) • 15:00-16:00 - Presentations of the best paper award candidates •

Improving Clinical Diagnosis Inference through Integration of Structured and Un- structured Knowledge Yuan Ling, Yuan An and Sadid Hasan

One Representation per Word - Does it make Sense for Composition? Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David Weir

Word Sense Filtering Improves Embedding-Based Lexical Substitution Anne Cocos, Marianna Apidianaki and Chris Callison-Burch

16:00 - 16:30 Coffee break

16:30 - 17:15 Session 4

16:30-17:15 - Open discussion, best paper award and closing remarks •

x Compositional Semantics using Feature-Based Models from WordNet Pablo Gamallo Mart´ın Pereira-Farina˜ Centro Singular de Investigacion´ en Centre for Argument Tecnolox´ıas da Informacion´ (CiTIUS) Technology (ARG-tech) Universidade de Santiago de Compostela University of Dundee Galiza, Spain Dundee, DD1 4HN, Scotland (UK) [email protected] [email protected] Departamento de Filosof´ıa e Antropolox´ıa Universidade de Santiago de Compostela Pz. de Mazarelos, 15782, Galiza, Spain [email protected]

Abstract Notwithstanding, these models are usually qual- This article describes a method to build ified as black box systems because they are usually semantic representations of composite ex- not interpretable by humans. Currently, the field of interpretable computational models is gaining rel- pressions in a compositional way by using 1 WordNet relations to represent the mean- evance and, therefore, the development of more ing of words. The meaning of a target explainable and understandable models in compo- word is modelled as a vector in which sitional semantics is also an open challenge. in this its semantically related words are assigned field. On the other hand, distributional semantic weights according to both the type of the models, given the size of the vectors, needs signif- relationship and the distance to the tar- icant resources and they are dependent on particu- get word. Word vectors are composi- lar corpus, which can generate some biases in their tionally combined by syntactic dependen- application to different languages. cies. Each syntactic dependency triggers Thus, in this paper, we will pay attention to two complementary compositional func- compositional approaches which employ other tions: the named head function and depen- kind of word semantic models, such as those based dent function. The experiments show that on the WordNet relationships; i.e., synsets, hyper- the proposed compositional method per- nyms, hyponyms, etc. Only in (Faruqui and Dyer, forms as the state-of-the-art for subject- 2015) we can find a proposal for word vector rep- verb expressions, and clearly outperforms resentation using hand-crafted linguistic resources the best system for transitive subject-verb- (WordNet, FrameNet, etc.), although a composi- object constructions. tional frame is not explicitly adopted. Therefore, to the best of our knowledge, this is the first work 1 Introduction using WordNet to build compositional semantic The principle of compositionality (Partee, 1984) interpretations. Thus, in this article, we propose states that the meaning of a complex expression a method to compositionally build the semantic is a function of the meaning of its constituent representation of composite expressions using a parts and of the mode of their combination. In feature-based approach (Hadj Taieb et al., 2014): the recent years, different distributional semantic constituent elements are induced by WordNet re- models endowed with a compositional component lationships. have been proposed. Most of them define words as However, this proposal raises a serious prob- high-dimensional vectors where dimensions rep- lem: the semantic representation of two syntacti- resent co-occurring context words. This distribu- cally related words (e.g. the verb run and the noun tional semantic representation makes it possible computer in “the computer runs”) encodes incom- to combine vectors using simple arithmetic opera- patible information and there is no direct way of tions such as addition and multiplication, or more combining the features used to represent the mean- advanced compositional methods such as learning ing of the two words. On the one hand, the verb functional words as tensors and composing con- 1http://www.darpa.mil/program/explainable-artificial- stituents through inner product operations. intelligence

1 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 1–11, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics run is related by synonymy, hypernym, hyponym again given team at the direct object position. This and entailment to other verbs and, on the other, the dependency also yields the contextualized sense of noun computer is put in relation with other nouns the object given the verb and its nominal subject by synonymy, hypernym, hyponym, and so on. (coach+run). At the end of the interpretation pro- In order to solve this drawback, on the basis cess, we obtain three fully contextualized senses. of previous work on dependency-based distribu- In the second case, from right-to-left, the semantic tional compositionality (Thater et al., 2010; Erk process process is applied in a similar way, being and Pado,´ 2008), we distinguish between direct contextualized (and disambiguated) using the re- denotation and selectional preferences within a strictions imposed by the verb and its nominal ob- dependency relation. More precisely, when two ject (run+team). As in the first case, three slightly words are syntactically related, for instance com- different word senses are also obtained. puter and the verb run by the subject relation, we Lastly, word sense disambiguation is out of the build two contextualized senses: the contextual- aim of this paper. Here, we only use WordNet for ized sense of computer given the requirements of extracting semantic information from words, but run and the contextualized sense of run given com- not to identify word senses. puter. The article is organized as follow: In the next section (2), different approaches on ontological The sense of computer is built by combining feature-based representations and compositional the semantic features of the noun (its direct de- semantics are introduced and discussed. Then, notation) with the selectional preferences imposed sections 3 and 4 respectively describe our feature- by the verb. The features of the noun are built based semantic representation and compositional from the set of words linked to computer in Word- strategy. In Section 5, some experiments are per- Net, while the selectional preferences of run in formed to evaluate the quality of the word models the subject position are obtained by combining the and compositional word vectors. Finally, relevant features of all the nouns that can be the nominal conclusions are reported in Section 6. subject of the verb (i.e. the features of runners). Then, the two sets of features are combined and 2 Related Work the resulting new set represents the specific sense of the noun computer as nominal subject of run. Our approach relies on two tasks: to build feature- The sense of the verb given the noun is built in a based representations using WordNet relations, analogous way: the semantic features of the verb and to build compositional vectors using the are combined with the (inverse) selectional pref- WordNet representations. In this section, we will erences imposed by the noun, resulting in a new examine work related to these two tasks. compositional representation of the verb run when it is combined with computer at the subject po- 2.1 Feature-Based Approaches sition. The two new compositional feature sets Tversky (1977), in order to define a similarity represent the contextualized senses of the two re- measure, assumes that any object can be repre- lated words. During the contextualization process, sented as a collection (set) of features or proper- ambiguous or polysemous words may be disam- ties. Therefore, a similarity metric is a feature- biguated in order to obtain the right representation. matching process between two objects. This con- For dealing with any sequence with N (lexi- sists of a linear combination of the measures of cal) words (e.g., “the coach runs the team”), the their common and distinctive features. It is worth semantic process can be applied in two different noting that this is a non-symmetric measure. ways: from left-to-right and from right-to-left. In In the particular case of semantic similarity met- the first case, it is applied N 1 times dependency- rics, each word or concept is featured by means of − by-dependency in order to obtain N contextual- a set of words (Hadj Taieb et al., 2014). Framed ized senses, one per lexical word. Thus, firstly, into an ontology such as WordNet, these sets the subject dependency builds two contextualized of words are obtained from taxonomic (hyper- senses: that of run given the noun coach and that nym, hyponym, etc.) and non-taxonomic (synsets, of the noun given the verb. Then, the direct object glosses, meronyms, etc.) properties (Meng et al., dependency is applied on the already contextual- 2013), although these last ones are classified as ized sense of the verb in order to contextualize it secondary in many cases (Slimani, 2013). The

2 main objective of this approach is to capture the Mitchell, 2013; Baroni and Zamparelli, 2010; Ba- semantic knowledge induced by ontological rela- roni, 2013; Baroni et al., 2014). In our approach, tionships. by contrast, compositional functions, which are Our model is partly inspired by that defined driven by dependencies and not by functional in (Rodr´ıguez and Egenhofer, 2003). It proposes words, are just basic arithmetic operations on vec- that the set of properties that characterizes a word tors as in (Mitchell and Lapata, 2008). Arithmetic may be stratified into three groups: i) synsets; approaches are easy to implement and produce ii) features (e.g., meronyms, attributes, hyponym, high-quality compositional vectors, which makes etc.), and, iii) neighbor concepts (those linked via them a good choice for practical applications (Ba- semantic pointers). Each one of these strata is roni et al., 2014). weighted according to its contribution to the rep- Other compositional approaches based on Cat- resentation of the concept. The measure analyzes egorial Grammar use tensor products for com- the overlapping among the three strata between the position (Grefenstette et al., 2011; Coecke et two terms under comparison. al., 2010). A neural network-based method with tensor factorization for learning the embed- 2.2 Compositional Strategies dings of transitive clauses has been introduced in (Hashimoto and Tsuruoka, 2015). Two prob- Several models for compositionality in vector lems arise with tensor products. First, they re- spaces have been proposed in recent years, and sult in an information scalability problem, since most of them use bag-of-words as basic distribu- tensor representations grow exponentially as the tional representations of word contexts. The ba- phrases grow longer (Turney, 2013). And sec- sic approach to composition, explored by Mitchell ond, tensor products did not perform as well as and Lapata (2008; 2009; 2010), is to combine vec- component-wise multiplication in Mitchell and tors of two syntactically related words with arith- Lapata’s (2010) experiments. metic operations: addition and component-wise multiplication. The additive model produces a sort There are also works focused on the notion of union of word contexts, whereas multiplication of sense contextualization, e.g., Dinu and Lapata has an intersective effect. According to Mitchell (2010) work on context-sensitive representations and Lapata (2008), component-wise multiplica- for lexical substitution. Reddy et al. (2011) work tion performs better than the additive model. How- on dynamic prototypes for composing the seman- ever, in (Mitchell and Lapata, 2009; Mitchell and tics of noun-noun compounds and evaluate their Lapata, 2010), these authors explore weighted ad- approach on a compositionality-based similarity ditive models giving more weight to some con- task. stituents in specific word combinations. For in- So far, all the cited works are based on bag- stance, in a noun-subject-verb combination, the of-words to represent vector contexts and, then, verb is provided with higher weight because the word senses. However, there are a few works us- whole construction is closer to the verb than to ing vector spaces structured with syntactic infor- the noun. Other weighted additive models are de- mation. Thater et al. (2010) distinguish between scribed in (Guevara, 2010) and (Zanzotto et al., first-order and second-order vectors in order to al- 2010). All these models have in common the fact low two syntactically incompatible vectors to be of defining composition operations for just word combined. This work is inspired by that described pairs. Their main drawback is that they do not pro- in (Erk and Pado,´ 2008). Erk and Pado´ (2008) pro- pose a more systematic model accounting for all pose a method in which the combination of two types of semantic composition. They do not focus words, a and b, returns two vectors: a vector a’ on the logical aspects of the functional approach representing the sense of a given the selectional underlying compositionality. preferences imposed by b, and a vector b’ standing Other distributional approaches develop sound for the sense of b given the (inverse) selectional compositional models of meaning inspired by preferences imposed by a. A similar strategy is Montagovian semantics, which induce the compo- reported in Gamallo (2017). Our approach is an at- sitional meaning of the functional words from ex- tempt to join the main ideas of these syntax-based amples adopting regression techniques commonly models (namely, second-order vectors, selectional used in machine learning (Krishnamurthy and preferences and two returning words per combi-

3 nation) in order to apply them to WordNet-based out distinguishing between different word senses), word representations. the final weight is the addition of all partial weights. For instance, take the noun car. It is 3 Semantic Features from WordNet related to automobile through two different re- A word meaning is described as a feature-value lationships: they belong to the same synset and structure. The features are the words with which the latter is a direct hypernym of the former, so the target word is related to in the ontology (e.g., in weight(car, automobile) = 1/1 + 1/2 = 1.5. WordNet, hypernym, hyponym, etc.) and the val- To compute compositional operations on words, ues correspond to weights computed taking into the feature-value structure associated to each word account two parameters: the relation type and the is modeled as a vector, where features are dimen- edge-counting distance between the target word sions, words are objects, and weights the values and each word feature (i.e. the number of rela- for each object/dimension position. tions required to achieve the feature from the tar- get word) (Rada et al., 1989). 4 Compositional Semantics The algorithm to set the feature values is the fol- 4.1 Syntactic Dependencies As lowing. Given a target word w1 and the feature set Compositional Functions F , where w F if w is a word semantically re- i ∈ i Our approach is also inspired in (Erk and Pado,´ lated to w in WordNet, the weight for the relation 1 2008). Here, semantic composition is modeled in between w and w is computed by equation 1: 1 i terms of function application driven by binary de- pendencies. A dependency is associated in the se- R 1 mantic space with two compositional functions on weight(w1, wi) = (1) word vectors: the head and the dependent func- length(w1, wi, rj) Xj=1 tions. To explain how they work, let us take the where R is the number of different semantic re- direct object relation (dobj) between the verb run lations (e.g. synonymy/synset, hyperonymy, hy- and the noun team in the expression “run a team”. The head function, dobj , combines the vector of ponymy, etc) that WordNet defines for the part- ↑ of-speech of the target word. For instance, nouns the head verb ~run with the selectional preferences have five different relations, verbs four and ad- imposed by the noun, which is also a vector of ~ jectives just two. length(w1, wi, rj is the length WordNet features, and noted team◦. This combi- of the path from the target word w1 to its fea- nation is performed by component-wise multipli- cation and results in a new vector ~rundobj , which ture wi in relation rj. length(w1, wi, rj) = 1 ↑ when rj stands for the synonymy relationship, represents the contextualized sense of run given i.e. when w1 and wi belong to the same synset; team in the dobj relation: length(w1, wi, rj) = 2 if wi is at the first level dobj ( ~run, team~ ◦) = ~run team~ ◦ = ~rundobj within the hierarchy associated to relation rj. ↑ ↑ For instance, the length value of a direct hyper- To build the (inverse) selectional preferences im- nym is 2 because there is a distance of two arcs posed by the dependent word team as direct ob- with regard to the target word: the first arc goes ject on the verb, we require a reference corpus to from the target word to a synset and the second one extract all those verbs of which team is the di- is the hyperonymy relation between the direct hy- rect object. The selectional preferences of team pernym and the synset. The length value increases as direct object of a verb, and noted team~ , is a in one unit as the hierarchy level goes up, so at ◦ new vector obtained by component-wise addition level 4, the length score is 5 and then the partial of the vectors of all those verbs (e.g. create, sup- weight is 1/5 = 0.2. For some non-taxonomic re- port, help, etc) that are in dobj relation with the lations, namely meronymy, holonymy and coordi- noun team: nates, there is only one level in WordNet, but the distance is 3 since the target word and the word team~ ◦ = ~v feature (part, whole or coordinate term) are sepa- ~v T rated by a synset and a hypernym. X∈ As a feature word wi may be related to the where T is the vector set of verbs having team as target w1 via different semantic relations (with- direct object (except run). T is thus included in the

4 subspace of verb vectors. Component-wise addi- and electric would denote functions while team tion has an union effect. and coach would be their arguments. Similarly, the dependent function, dobj , com- By contrast, we deny the rigid “predicate- ↓ bines the noun vector team~ with the selectional argument” structure. In our compositional ap- preferences imposed by the verb, noted run~ ◦, by proach, dependencies are the active functions that component-wise multiplication. Such a combina- control and rule the selectional requirements im- tions builds the new vector of team~ dobj , which posed by the two related words. Thus, each con- ↓ stands for the contextualized sense of team given stituent word imposes its selectional preferences run in the dobj relation: on the other within a dependency-based construc- tion. This is in accordance with non-standard lin- guistic research which assumes that the words in- dobj (run~ ◦, team~ ) = team~ run~ ◦ = team~ dobj volved in a composite expression impose seman- ↓ ↓ tic restrictions on each other (Pustejovsky, 1995; The selectional preferences imposed by the Gamallo et al., 2005; Gamallo, 2008). head word run to its direct object are represented 4.2 Recursive Compositional Application by the vector run~ ◦, which is obtained by adding the vectors of all those nouns (e.g. company, In our approach, the consecutive application of the project, marathon, etc) which are in relation dobj syntactic dependencies found in a sentence is ac- with the verb run: tually the process of building the contextualized sense of all the lexical words which constitute it. run~ ◦ = ~v Thus, the whole sentence is not assigned to an ~v R unique meaning (which could be the contextual- X∈ ized sense of the root word), but one sense per where R is the vector set of nouns playing the di- lemma, being the sense of the root just one of rect object role of run (except team). R is included them. in the subspace of nominal vectors. This incremental process may have two di- Each multiplicative operation results in a rections: from left-to-right and vice versa (i.e., compositional vector of a contextualized word. from right-to-left). Figure 1 illustrates the Component-wise multiplication has an intersec- incremental process of building the sense of tive effect. The vector standing for the selectional words dependency-by-dependency from left-to- preferences restricts the vector of the target word right. Thus, given the composite expression “the by assigning weight 0 to those WordNet features coach runs the team” and its dependency analysis that are not shared by both vectors. The new com- depicted in the first row of the figure, two com- positional vector as well as the two constituents all positional processes are driven by the two depen- belong to the same vector subspace (the subspace dencies involved in the analysis (nsubj and dobj). of nouns, verbs, or adjectives). Each dependency is decomposed into two func- Notice that, in approaches to computational tions: head (nsubj and dobj ) and dependent ↑ ↑ semantics inspired by Combinatory Categorial (nsubj and dobj ) functions.2 The first compo- ↓ ↓ Grammar (Steedman, 1996) and Montagovian se- sitional process applies, on the one hand, the head mantics (Montague, 1970), the interpretation pro- function nsubj on the denotation of the head verb ↑ cess for composite expressions such as “run a ( ~run) and on the selectional preferences required ~ team” or “electric coach” relies on rigid function- by coach (coach◦), in order to build a contex- argument structures: relational expressions, like tualized sense of the verb: ~runnsubj . On the ↑ verbs and adjectives, are used as predicates while other hand, the dependent function nsubj builds ↓ nouns and nominals are their arguments. In the the sense of coach as nominal subject of run: composition process, each word is supposed to coach~ nsubj . Then, the contextualized head vec- ↓ play a rigid and fixed role: the relational word tor is involved in the compositional process driven is semantically represented as a selective func- by dobj. At this level of semantic composition, the tion imposing constraints on the denotations of selectional preferences imposed on the noun team the words it combines with, while non-relational 2We do not consider the meaning of determiners, auxiliary words are in turn seen as arguments filling the con- verbs, or tense affixes. Quantificational issues associated to straints imposed by the function. For instance, run them are also beyond the scope of this work.

5 stand for the semantic features of all those nouns 5.1.1 Rating by Similarity which may be the direct object of coach+run. At In the first experiment, we use the WordSim353 the end of the process, we have not obtained one dataset (Finkelstein et al., 2002), which was con- single sense for the whole expression, but one con- structed by asking humans to rate the degree of se- textualized sense per lexical word: coach~ nsubj , ↓ mantic similarity between two words on a numer- ~runnsubj +dobj and team~ dobj . ↑ ↑ ↓ ical scale. This is a small dataset with 353 word In other case, from right-to-left, the verb run is pairs. The performance of a computational system first restricted by team at the direct object position, is measured in terms of correlation (Spearman) be- and then by its subject coach. In addition, this tween the scores assigned by humans to the word noun is now restricted by the selectional prefer- pairs and the similarity Dice coefficient assigned ences imposed by run and team, that is, it is com- by our system (WN) built with the WordNet-based bined with the semantic features of all those nouns model space. that may be the nominal subject of run+team. Table 1 compares the Spearman correlation ob- tained by our model, WN, with that obtained by 5 Experiments the corpus-based system described in (Halawi et We have performed several similarity-based ex- al., 2012), which is the highest score reached so periments using the semantic word model defined far on that dataset. Even if our results were clearly in Section 3 and the compositional algorithm de- outperformed by that corpus-based method, WN scribed in 4.3 First, in Subsection 5.1, we evaluate seems to behave well if compared with the state- just word similarity without composition. Then, of-the-art knowledge-based (unsupervised) strat- in Subsection 5.2, we evaluate the simple compo- egy reported in (Agirre et al., 2009). sitional approach by making use of a dataset with Systems ρ Method similar noun-verb pairs (NV constructions). Fi- WN 0.69 knowledge nally, the recursive application of compositional (Hassan and Mihalcea, 2011) 0.62 knowledge (Agirre et al., 2009) 0.66 knowledge functions is evaluated in Subsection 5.3, by mak- (Halawi et al., 2012) 0.81 corpus ing use of a dataset with similar noun-verb-noun pairs (NVN constructions). Table 1: Spearman correlation between the Word- In all experiments, we made use of datasets Sim353 dataset and the rating obtained by our suited for the task at hand, and compare our results knowledge-based system WN and the state-of-the- with those obtained by the best systems for the art for both knowledge and corpus-based strate- corresponding dataset. Moreover, in order to build gies. the selectional preferences of the syntactically re- lated words, we used the (BNC). Syntactic analysis on BNC was performed 5.1.2 Synonym Detection with with the dependency parser DepPattern (Gamallo Multiple-Choice Questions and Gonzalez,´ 2011; Gamallo, 2015), previously In this evaluation task, a target word is presented PoS tagged with Tree-Tagger (Schmid, 1994). with four synonym candidates, one of them being the correct synonym of the target. For instance, 5.1 Word Similarity for the target deserve, the system must choose be- Recently, the use of word similarity methods has tween merit (the correct one), need, want, and been criticised as a reliable technique for evalu- expect. Accuracy is the number of correct an- ating distributional semantic models (Batchkarov swers divided by the total number of words in the et al., 2016), given the small size of the datasets and the limitation of context information as well. Systems Noun Adj Verb All However, given this procedure still is widely ac- WN 0.85 0.85 0.75 0.80 (Freitag et al., 2005) 0.76 0.76 0.64 0.72 cepted, we have performed two different kinds of (Zhu, 2015) 0.71 0.71 0.63 0.69 experiments: rating by similarity and synonym de- (Kiela et al., 2015) - - - 0.88 tection with multiple-choice questions. Table 2: Accuracy of three systems on the WBST 3Both the software and the semantic word model test (synonym detection on nouns, adjectives, and are freely available at http://fegalaz.usc.es/ verbs) ˜gamallo/resources/CompWordNet.tar.gz.

6 nsubj dobj

coach run team

~ ~ nsubj ( ~run, coach◦), nsubj (run~ ◦, coach) ↑ ↓

coach~ nsubj ~runnsubj team~ ↓ ↑

~ dobj ( ~runnsubj , team~ ◦), dobj (coach + run◦, team~ ) ↑ ↑ ↓

coach~ nsubj ~runnsubj +dobj team~ dobj ↓ ↑ ↑ ↓

Figure 1: Syntactic analysis of the expression “the coach runs the team” and left-to-right construction of the word senses. dataset. In this experiment, we compute the similarity The dataset is an extended TOEFL test, called between the contextualized heads of two NV com- the WordNet-based Synonymy Test (WBST) pro- posites and between their contextualized depen- posed in (Freitag et al., 2005). WBST was pro- dent expressions. For instance, we compute the duced by generating automatically a large set similarity between “eye flare” vs “eye flame” by of TOEFL-like questions from the synonyms in comparing first the verbs flare and flame when WordNet. In total, this procedure yields 9,887 combined with eye in the subject position (head noun, 7,398 verb, and 5,824 adjective questions, function), and by comparing how (dis)similar is a total of 23,509 questions, which is a very large the noun eye when combined with both the verbs dataset. Table 2 shows the results. In this case, the flare and flame (dependent function). In addition, accuracy obtained by WN for the three syntactic as we are provided with two similarities (head and categories is close to state-of-the-art corpus-based dep) for each pair of compared expressions, it is method for this task (Kiela et al., 2015), which is a possible to compute a new similarity score by av- neural network trained with a huge corpus contain- eraging the results of head and dependent func- ing 8 billion words from English Wikipedia and tions (head+dep). newswire texts. Table 3 shows the Spearman’s correlation val- ues (ρ) obtained by the three versions of WN: 5.2 Noun-Verb Composition only head function (head), only dependent func- The first experiment aimed at evaluating our com- tion (dep) and average of both (head+dep). The positional strategy uses the test dataset by Mitchell latter score value is comparable to the state-of-the- and Lapata (2008), which comprises a total of art system for this dataset, reported in (Erk and 3,600 human similarity judgments. Each item Pado,´ 2008). It is also very similar to the most re- consists of an intransitive verb and a subject noun, cent results described in (Dinu et al., 2013), where which are compared to another noun-verb pair the authors made use of the compositional strategy (NV) combining the same noun with a synonym defined in (Baroni and Zamparelli, 2010). of the verb that is chosen to be either similar o dissimilar to the verb in the context of the given Systems ρ subject. For instance, “child stray” is related to WN (head+dep) 0.29 WN (head) 0.26 “child roam”, being roam a synonym of stray. WN (dep) 0.14 The dataset was constructed by extracting NV (Erk and Pado,´ 2008) 0.27 composite expressions from the British National (Dinu et al., 2013) 0.26 Corpus (BNC) and verb synonyms from WordNet. Table 3: Spearman correlation for intransitive ex- In order to evaluate the results of the tested sys- pressions using the benchmark by Mitchell and tems, Spearman correlation is computed between Lapata (2008) individual human similarity scores and the sys- tems’ predictions.

7 5.3 Noun-Verb-Noun Composition Systems ρ WN (nv-n head+dep) 0.35 The last experiment consists in evaluating the WN (nv-n head) 0.34 WN (nv-n dep) 0.13 quality of compositional vectors built by means WN (n-vn head+dep) 0.50 of the consecutive application of head and de- WN (n-vn head) 0.35 pendency functions associated with nominal sub- WN (n-vn dep) 0.44 WN (n-vn+nv-n) 0.47 ject and direct object. The experiment is per- (Grefenstette and Sadrzadeh, 2011b)* 0.28 formed on the dataset developed in (Grefenstette (Van De Cruys et al., 2013)* 0.37 and Sadrzadeh, 2011a). The dataset was built us- (Tsubaki et al., 2013)* 0.44 (Milajevs et al., 2014) 0.46 ing the same guidelines as Mitchell and Lapata (Polajnar et al., 2015) 0.35 (2008), using transitive verbs paired with subjects (Hashimoto et al., 2014) 0.48 and direct objects: NVN composites. (Hashimoto and Tsuruoka, 2015) 0.48 Human agreement 0.75 Given our compositional strategy, we are able to compositional build several vectors that somehow Table 4: Spearman correlation for transitive ex- represent the meaning of the whole NVN com- pressions using the benchmark by Grefenstette posite expression. In order to known which is and Sadrzadeh (2011) the best compositional strategy and be exhaustive and complete, we evaluate all of them; i.e., both left-to-right and right-to-left strategies. Thus, take Table 4 shows the Spearman’s correlation val- again the expression “the coach runs the team”. ues (ρ) obtained by all the different versions built If we follow the left-to-right strategy (noted nv-n), from our model WN. The best score was achieved at the end of the compositional process, we obtain by averaging the head and dependent similarity two fully contextualized senses: values derived from the n-vn (right-to-left) strat- egy. Let us note that, for NVN composite ex- nv-n head The sense of the head run, as a result pressions, the left-to-right strategy seems to build of being contextualized first by the prefer- less reliable compositional vectors than the right- ences imposed by the subject and then by the to-left counterpart. Besides, the combination of preferences required by the direct object. We the two strategies (n-vn+nv-n) does not improve note nv-n head) the final sense of the head in the results of the best one (n-vn).4. The score val- a NVN composite expression following the ues obtained by the different versions of the right- left-to-right strategy. to-left strategy outperform other systems for this dataset (see results reported below in the table). nv-n dep The sense of the object team, as a re- Our best strategy (ρ = 0.50) also outperforms the sult of being contextualized by the prefer- neural network strategy described in (Hashimoto ences imposed by run previously combined and Tsuruoka, 2015), which achieved 0.48 with- with the subject coach. We note nv-n dep the out considering extra linguistic information not in- final sense of the direct object in a NVN com- cluded in the dataset. The (ρ) scores for this task posite expression following the left-to-right are reported for averaged human ratings. This is strategy. due to a disagreement in previous work regard- ing which metric to use when reporting results. If we follow the right-to-left strategy (noted n- We mark with asterisk those systems reporting (ρ) vn), at the end of the compositional process, we scores based on non-averaged human ratings. obtain two fully contextualized senses: 6 Conclusions n-nv head The sense of the head run as a result of being contextualized first by the preferences In this paper, we described a compositional model imposed by the object and then by the sub- based on WordNet features and dependency-based ject. functions on those features. It is a recursive pro- posal since it can be repeated from left-to-right or n-nv dep The sense of the subject coach, as a from right-to-left and the sense of each constituent result of being contextualized by the prefer- word is performed in a recursive way. ences imposed by run previously combined 4n-vn+nv-n is computed by averaging the similarities of with the object team. both n-vn head+dep and nv-n head+dep

8 Our compositional model tackles the problem In Proceedings of the 2010 Conference on Em- of information scalability. This problem states pirical Methods in Natural Language Processing, that the size of semantic representations should not EMNLP’10, pages 1183–1193, Stroudsburg, PA, USA. grow exponentially, but proportionally; and, infor- mation must not be loss using fixed size of compo- Marco Baroni, Raffaella Bernardi, and Roberto Zam- sitional vectors. In our approach, however, even if parelli. 2014. Frege in space: A program for com- positional distributional semantics. LiLT, 9:241– the size of the compositional vectors is fixed, there 346. is no information loss since each word of the com- posite expression is associated to a compositional Marco Baroni. 2013. Composition in distributional vector representing its context-sensitive sense. In semantics. Language and Linguistics Compass, 7:511–522. addition, the compositional vectors do not grow exponentially since their size is fixed by the vector Miroslav Batchkarov, Thomas Kober, Jeremy Reffin, space: they are all first-order (or direct) vectors. Julie Weeds, and David Weir. 2016. A critique of Finally, the number of vectors increases in propor- word similarity as a method for evaluating distribu- tional semantic models. In Proceedings of the ACL tion to the number of constituent words found in Workshop on Evaluating Vector Space Representa- the composite expression. So, both points are suc- tions for NLP, Berlin, Germany. cessfully solved. In future work, we will try to design a compo- B. Coecke, M. Sadrzadeh, and S. Clark. 2010. Math- ematical foundations for a compositional distribu- sitional model based on word semantic represen- tional model of meaning. Linguistic Analysis, 36(1- tations combining WordNet-based features with 4):345–384. syntactic-based distributional contexts as well as extend our model to full sentences instead of the Georgiana Dinu and Mirella Lapata. 2010. Measur- ing distributional similarity in context. In Proceed- simple ones described in this paper. ings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1162–1172, Acknowledgements Cambridge, MA, October. Association for Compu- tational Linguistics. We would like to thank the anonymous review- ers for helpful comments and suggestions. This G. Dinu, N. Pham, and M. Baroni. 2013. General work was supported by a 2016 BBVA Founda- estimation and evaluation of compositional distribu- tional semantic models. In ACL 2013 Workshop on tion Grant for Researchers and Cultural Creators, Continuous Vector Space Models and their Compo- and by project TELPARES (FFI2014-51978-C2- sitionality (CVSC 2013), pages 50–58, East Strouds- 1-R) and grant TIN2014-56633-C3-1-R, Ministry burg PA. of Economy and Competitiveness. It has received Katrin Erk and Sebastian Pado.´ 2008. A structured financial support from the Conseller´ıa de Cultura, vector space model for word meaning in context. In Educacion´ e Ordenacion´ Universitaria (Postdoc- Proceedings of EMNLP, Honolulu, HI. toral Training Grants 2016 and Centro Singular de Investigacion´ de Galicia accreditation 2016-2019, Manaal Faruqui and Chris Dyer. 2015. Non- distributional word vector representations. In Pro- ED431G/08) and the European Regional Develop- ceedings of ACL. ment Fund (ERDF). Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- References tan Ruppin. 2002. Placing search in context: the concept revisited. ACM Trans. Inf. Syst., 20(1):116– Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana 131. Kravalova, Marius Pas¸ca, and Aitor Soroa. 2009. A study on similarity and relatedness using distribu- Dayne Freitag, Matthias Blume, John Byrnes, Ed- tional and wordnet-based approaches. In Proceed- mond Chow, Sadik Kapadia, Richard Rohwer, and ings of Human Language Technologies: The 2009 Zhiqiang Wang. 2005. New experiments in distribu- Annual Conference of the North American Chap- tional representations of synonymy. In Proceedings ter of the Association for Computational Linguistics, of the Ninth Conference on Computational Natural NAACL ’09, pages 19–27, Stroudsburg, PA, USA. Language Learning, pages 25–32. Association for Computational Linguistics. Pablo Gamallo and Isaac Gonzalez.´ 2011. A grammat- Marco Baroni and Roberto Zamparelli. 2010. Nouns ical formalism based on patterns of part-of-speech are vectors, adjectives are matrices: Represent- tags. International Journal of , ing adjective-noun constructions in semantic space. 16(1):45–71.

9 Pablo Gamallo, Alexandre Agustini, and Gabriel 1544–1555, Doha, Qatar, October. Association for Lopes. 2005. Clustering Syntactic Positions with Computational Linguistics. Similar Semantic Requirements. Computational Linguistics, 31(1):107–146. S. Hassan and R. Mihalcea. 2011. Semantic related- ness using salient semantic analysis. In Proceedings Pablo Gamallo. 2008. The meaning of syntactic de- of AAAI Conference on Artificial Intelligence. pendencies. Linguistik OnLine, 35(3):33–53. Douwe Kiela, Felix Hill, and Stephen Clark. 2015. Pablo Gamallo. 2015. Dependency parsing with com- Specializing word embeddings for similarity or re- pression rules. In International Workshop on Pars- latedness. In Llu´ıs Marquez,` Chris Callison-Burch, ing Technology (IWPT 2015), Bilbao, Spain. Jian Su, Daniele Pighin, and Yuval Marton, edi- Pablo Gamallo. 2017. The role of syntactic dependen- tors, EMNLP, pages 2044–2048. The Association cies in compositional distributional semantics. Cor- for Computational Linguistics. pus Linguistics and Linguistic Theory. Jayant Krishnamurthy and Tom Mitchell, 2013. Pro- Edward Grefenstette and Mehrnoosh Sadrzadeh. ceedings of the Workshop on Continuous Vector 2011a. Experimental support for a categorical com- Space Models and their Compositionality, chapter positional distributional model of meaning. In Con- Vector Space Semantic Parsing: A Framework for ference on Empirical Methods in Natural Language Compositional Vector Space Models, pages 1–10. Processing. Association for Computational Linguistics. Edward Grefenstette and Mehrnoosh Sadrzadeh. Lingling Meng, Runquing Huang, and Junzhong Gu. 2011b. Experimenting with transitive verbs in a dis- 2013. A review of semantic similarity measures in cocat. In Workshop on Geometrical Models of Nat- wordnet. International Journal of Hybrid Informa- ural Language Semantics (EMNLP-2011). tion Technology, 6(1).

Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh Clark, Bob Coecke, and Stephen Pulman. 2011. Sadrzadeh, and Matthew Purver. 2014. Evaluating Concrete sentence spaces for compositional distri- neural word representations in tensor-based compo- butional models of meaning. In Proceedings of the sitional settings. In Alessandro Moschitti, Bo Pang, Ninth International Conference on Computational and Walter Daelemans, editors, EMNLP, pages 708– Semantics, IWCS ’11, pages 125–134. 719. ACL.

Emiliano Guevara. 2010. A regression model of Jeff Mitchell and Mirella Lapata. 2008. Vector-based adjective-noun compositionality in distributional se- models of semantic composition. In Proceedings of mantics. In Proceedings of the 2010 Workshop on ACL-08: HLT, pages 236–244. GEometrical Models of Natural Language Seman- tics, GEMS ’10. Jeff Mitchell and Mirella Lapata. 2009. Language models based on semantic composition. In Proceed- M. A. Hadj Taieb, M. Ben Aouicha, and ings of EMNLP, pages 430–439. A. Ben Hamadou. 2014. Ontology-based approach for measuring semantic similarity. Engineering Jeff Mitchell and Mirella Lapata. 2010. Composition Applications of Artificial Intelligence, 36:238–261. in distributional models of semantics. Cognitive Sci- Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and ence, 34(8):1388–1439. Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of the Richard Montague. 1970. Universal grammar. theoria. 18th ACM SIGKDD International Conference on Theoria, 36:373–398. Knowledge Discovery and Data Mining, KDD ’12, pages 1406–1414. Barbara Partee. 1984. Compositionality. In Frank Landman and Frank Veltman, editors, Varieties Kazuma Hashimoto and Yoshimasa Tsuruoka. 2015. of Formal Semantics, pages 281–312. Dordrecht: Learning embeddings for transitive verb disam- Foris. biguation by implicit tensor factorization. In Pro- ceedings of the 3rd Workshop on Continuous Vector Tamara Polajnar, Laura Rimell, and Stephen Clark. Space Models and their Compositionality, pages 1– 2015. An exploration of discourse-based sentence 11, Beijing, China, July. Association for Computa- spaces for compositional distributional semantics. tional Linguistics. In Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, Discourse-level Semantics, pages 1–11, Lisbon, Por- and Yoshimasa Tsuruoka. 2014. Jointly learn- tugal, September. Association for Computational ing word representations and composition functions Linguistics. using predicate-argument structures. In Proceed- ings of the 2014 Conference on Empirical Methods James Pustejovsky. 1995. The Generative Lexicon. in Natural Language Processing (EMNLP), pages MIT Press, Cambridge.

10 R. Rada, H. Mili, E. Bicknell, and M. Blettner. 1989. Development and application of a metric on se- mantic nets. Systems, Man and Cybernetics, IEEE Transactions on, 19(1):17–30. Siva Reddy, Ioannis P. Klapaftis, Diana McCarthy, and Suresh Manandhar. 2011. Dynamic and static pro- totype vectors for semantic composition. In Fifth In- ternational Joint Conference on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 705–713. M. Andrea Rodr´ıguez and Max J. Egenhofer. 2003. Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowl. Data Eng., 15(2):442–456. H. Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing. Thabet Slimani. 2013. Description and evaluation of semantic similarity measures approaches. Interna- tional Journal of Computer Applications, 80(1):25– 33. Mark Steedman. 1996. Surface Structure and Inter- pretation. The MIT Press. Stefan Thater, Hagen Furstenau,¨ and Manfred Pinkal. 2010. Contextualizing semantic representations us- ing syntactically enriched vector models. In Pro- ceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics, pages 948–957, Stroudsburg, PA, USA. Masashi Tsubaki, Kevin Duh, Masashi Shimbo, and Yuji Matsumoto. 2013. Modeling and learning se- mantic co-compositionality through prototype pro- jections and neural networks. In EMNLP, pages 130–140. ACL. Peter D. Turney. 2013. Domain and function: A dual- space model of semantic relations and compositions. Journal of Artificial Intelligence Research (JAIR), 44:533–585. A. Tversky. 1977. Features of similarity. Psichologi- cal Review, 84(4). Tim Van De Cruys, Thierry Poibeau, and Anna Ko- rhonen. 2013. A Tensor-based Factorization Model of Semantic Compositionality. In Conference of the North American Chapter of the Association of Com- putational Linguistics (HTL-NAACL), pages 1142– 1151, Atlanta, United States, June. Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional distribu- tional semantics. In Proceedings of the 23rd Inter- national Conference on Computational Linguistics, COLING ’10, pages 1263–1271. Peng Zhu. 2015. N-grams based linguistic search en- gine. International Journal of Computational Lin- guistics Research, 6(1):1–7.

11 Automated WordNet Construction Using Word Embeddings

Mikhail Khodak, Andrej Risteski, Christiane Fellbaum and Sanjeev Arora

Computer Science Department, Princeton University, 35 Olden St., Princeton, New Jersey 08540

mkhodak,risteski,fellbaum,arora @cs.princeton.edu { }

Abstract target language and machine translation (MT) be- tween that language and English. We present a fully unsupervised method A standard minimal WordNet design is to have for automated construction of WordNets synsets connected by hyponym-hypernym rela- based upon recent advances in distribu- tions and linked back to PWN (Global WordNet tional representations of sentences and Association, 2017). This allows for applications word-senses combined with readily avail- to cross-lingual tasks and rests on the assumption able machine translation tools. The ap- that synsets and their relations are invariant across proach requires very few linguistic re- languages (Sagot and Fiser,ˇ 2008). For example, sources and is thus extensible to multiple while all senses of English word tie may not line target languages. To evaluate our method up with all senses of French word cravate, the we construct two 600-word test sets for sense “necktie” will exist in both languages and word-to-synset matching in French and be represented by the same synset. Russian using native speakers and evalu- Thus MT is often used for automated Word- ate the performance of our method along Nets to generate a set of candidate synsets for each with several other recent approaches. Our word w in the target language by getting a set of method exceeds the best language-specific English translations of w and using them to query and multi-lingual automated WordNets in PWN (we will refer to this as MT+PWN). The F-score for both languages. The databases number of candidate synsets produced may not be we construct for French and Russian, both small, even as large as a hundred for some polyse- languages without large publicly available mous verbs. Thus one needs a way to select from manually constructed WordNets, will be the candidates of w those synsets that are its true publicly released along with the test sets. senses. The main contributions of this paper is a 1 Introduction new word embedding-based method for matching words to synsets and the release of two large word- A WordNet is a lexical database for languages synset matching test sets for French and Russian.1 based upon a structure introduced by the Princeton Though there has been some work using word- WordNet (PWN) for English in which sets of cog- vectors for WordNets (see Section 2), the result- nitive synonyms, or synsets, are interconnected ing databases have been small, containing less with arcs standing for semantic and lexical rela- than 1000 words. Using embeddings for this task tions between them (Fellbaum, 1972). WordNets is challenging due to the need for good ways to are widely used in computational linguistics, in- use PWN synset information and account for the formation retrieval, and machine translation. Con- breakdown of cosine-similarity for polysemous structing one by hand is time-consuming and dif- words. We approach the first issue by representing ficult, motivating a search for automated or semi- synset information using recent work on sentence- automated methods. We present an unsupervised embeddings by Arora et al. (2017). To handle pol- method based on word embeddings and word- ysemy we devise a sense clustering scheme based sense induction and build and evaluate WordNets on Word Sense Induction (WSI) via linear alge- for French and Russian. Our approach needs only a large unannotated corpus like Wikipedia in the 1 https://github.com/mkhodak/pawn

12 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 12–23, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics bra over word-vectors (Arora et al., 2016a). We porting recall and POS-separated statistics. demonstrate how this sense purification procedure The most comparable results to ours are from effectively combines clustering with embeddings, the Wordnet Libre du Franc¸ais (WOLF) of Sagot thus being applicable to many word-sense disam- and Fiserˇ (2008), which leverages multiple Euro- biguation (WSD) and induction-related tasks. Us- pean WordNet projects. Our best method exceeds ing both techniques yields a WordNet method that this approach on our test set and benefits from hav- outperforms other language-independent methods ing far fewer resource requirements. The Wordnet as well as language-specific approaches such as du Franc¸ais (WoNeF) of Pradet et al. (2013) de- WOLF, the French WordNet used by the Natural pends on combining linguistic models by a voting Language ToolKit (Sagot and Fiser,ˇ 2008; Bond scheme. Their performance is found to be gener- and Foster, 2013; Bird et al., 2009). ally below WOLF’s, so we compare to the latter. Our second contribution is the creation of two There has also been work on multi-language new 600-word test sets in French and Russian that WordNets, specifically the Extended Open Multi- are larger and more comprehensive than any cur- lingual Wordnet (OMW) (Bond and Foster, 2013), rently available, containing 200 each of nouns, which scraped Wiktionary, and the Universal Mul- verbs, and adjectives. They are constructed by pre- tilingual Wordnet (UWN) (de Melo and Weikum, senting native speakers with all candidate synsets 2009), which used multiple translations to rate produced as above by MT+PWN and treating the word-sense matches. In our evaluation both pro- senses they pick out as “ground truth” for measur- duce high-precision/low-coverage WordNets. ing precision and recall. The motivation behind Finally, there have been recent vector ap- separating by part-of-speech (POS) is that nouns proaches for an Arabic WordNet (Tarouti and are often easier than adjectives and verbs, so re- Kalita, 2016) and a Bengali WordNet (Nasirud- porting one number — as done by some past work din et al., 2014). The Arabic effort uses a cosine- — allows high noun performance to mask low per- similarity threshold for correcting direct transla- formance on adjectives and verbs. tion and reports a precision of 78.4% on synonym Using these test sets, we can begin addressing matching, although its small size (943 synsets) the difficulties of evaluation for non-English auto- indicates a poor precision/recall trade-off. The mated WordNets due to the use of different and Bengali WordNet paper examines WSI on word unreported test data, incompatible metrics (e.g. vectors, evaluating clustering methods on seven matching synsets to words vs. retrieving words for words and achieving F1-scores of at best 52.1%. It synsets), and differing cross-lingual dictionaries. is likely that standard clustering techniques are in- In this paper we use the test sets to evaluate our sufficient when one needs many thousands of clus- method and several other automated WordNets. ters, an issue we address via sparse coding. Our use of distributional word embeddings to 2 Related Work construct WordNets is the latest in a long line of their applications, e.g. approximating word simi- There have been many language-specific ap- larity and solving word-analogies (Mikolov et al., proaches for building automated WordNets, no- 2013). The latter discovery was cited as the in- tably for Korean (Lee et al., 2000), French (Sagot spiration for the theoretical model in Arora et al. and Fiser,ˇ 2008; Pradet et al., 2013), and Persian (2016b), whose Squared-Norm (SN) vectors we (Montazery and Faili, 2010). These approaches use; the computation is similar in form and per- also use MT+PWN to get candidate word-synset formance to GloVe (Pennington et al., 2014). pairs, but often use further resources — such as bilingual corpora, expert knowledge, or WordNets 3 Methods for WordNet Construction in related languages — to select correct senses. The Korean construction depends on a classifier The basic WordNet method is as follows. Given a trained on 3260 word-sense matchings that yields target word w, we use a bilingual dictionary to get 93.6% precision and 77.1% recall, albeit only on its translations in English and let its set of candi- nouns. The Persian WordNet uses a scoring func- date synsets be all PWN senses of the translations tion based on related words between languages (MT+PWN). We then assign a score to each synset (requiring expert knowledge and parallel corpora) and accept as correct all synsets with score above and achieves 82.6% precision, though without re- a threshold α If no synset is above the cutoff we

13 Target MT + PWN Synset Threshold Word Scoring Matching get translations of target get set of candidate synsets assign a score return all synsets word using a by querying Princeton to each with score above bilingual dictionary or WordNet (PWN) using candidate synset a threshold machine translation (MT) translations of flag.n.01 score = 0.280 � flag flag.n.04 score = 0.222 � flag.n.06 score = 0.360 � correct synsets: flag.n.06 - French word French-English Princeton flag.n.07 score = 0.161 flagstone � slab.n.01 � Dictionary WordNet 'dalle' iris.n.01 score = 0.200 � retrieved synsets: slab.n.01 � masthead.n.01 score = 0.195 � slab pin.n.08 score = 0.251 � slab.n.01 score = 0.521 � Translations Candidates Synset Threshold Score 0.41

Figure 1: The score-threshold procedure for French word w = dalle (flagstone, slab). Candidate synsets generated by MT+PWN are given a score and matched to w if the score is above a threshold α. accept only the highest-scoring synset. This score- 3.2 Method 1: Synset Representation threshold method is illustrated in Figure 1. To improve upon this baseline we need a better Thus we need methods to assign high scores to vector representation of S to score S via cosine correct candidate synsets of w and low scores to similarity with vw. Previous efforts in synset and incorrect ones. We use unsupervised word vec- sense embeddings (Iacobacci et al., 2015; Rothe tors in the target language computed from a large and Schutze,¨ 2015) often use extra resources such text corpus (e.g. Wikipedia). Section 3.1 presents as WordNet or BabelNet for the target language a simple baseline that improves upon MT+PWN (Navigli and Ponzetto, 2012). As such databases via a cosine-similarity metric between w and each are not always available, we propose a synset rep- synset’s lemmas. A more sophisticated synset rep- resentation uS that is unsupervised, needing no ex- resentation method using sentence-embeddings is tra resources beyond MT and PWN, and leverages described in Section 3.2. Finally, in Section 3.3 we recent work on sentence embeddings. discuss WSI-based procedures for improving the This new representation combines embeddings score-threshold method for words with fine sense of synset information given by PWN, e.g. synset distinctions or poorly-annotated synsets. relations, definitions, and example sentences. To In this section we assume a vocabulary V of tar- create these embeddings we first consider the get language words with associated d-dimensional question of how to represent a list of words L as a d d unit word-vectors vw R for d V (e.g. vector in R . One way is to simply take the nor- ∈  | | d = 300 (SUM) for vocabulary size 50000) trained on a malized sum vˆL of their word-vectors, where large text corpus. Each word w V also has a set ∈ (SUM) of candidate synsets found by MT+PWN. We call vL = vw0 synsets S, S in PWN related, denoted S S , if w0 L 0 ∼ 0 X∈ one is a hyponym, meronym, antonym, or attribute Potentially more useful is to compute a vector of the other, if they share a verb group, or S = S0. (SIF ) vˆL via the sentence embedding formula of Arora et al. (2017), based on smooth inverse fre- 3.1 Baseline: Average Similarity Method 4 quency (SIF) weighting, which (for a = 10− and This method for scoring synsets can be seen as a before normalization) is expressed as simple baseline. Given a candidate synset S, we (SIF ) a v = vw define TS V as the set of translations of its L a + w 0 ⊂ w L P 0 lemmas from English to the target language. The X0∈ { } 1 score of S is then vw vw , the av- SIF is similar in spirit to TF-IDF (Salton and TS w0 TS · 0 erage cosine similarity| | between∈ w and the trans- Buckley, 1988) and builds on work of Wieting et P lated lemmas of S. Although straightforward, this al. (2016); it has been found to perform well on scoring method is quite noisy, as averaging word- other similarity tasks (Arora et al., 2017). vectors dilutes similarity performance, and does We find that SIF improves performance on sen- not use all synset information provided by PWN. tences but not on translated lemma lists (Figure 2),

14 Average Lemma Avg. Similarity likely because sentences contain many distractor Similarity (Baseline) Synset Lemmas Unweighted words that SIF will weight lower while the pres- Summation ence of distractors among lemmas is independent Synset Lemmas + SIF-Weighted of word frequency. Thus to compute the synset Related Lemmas Summation Synset Rep. score uS vw we make the vector representation (Method 1) · Synset Definition uS of S the element-wise average of: Synset Definition + Example Sentences vˆ(SUM) TS is the set of translations of TS lemmas of S (as in Section 3.1). • All Synset Information   S (SUM) RS = T TS is the Synset Representation vˆ S0 \ RS S0 S • set of lemma∼ translations of 0.55 0.60 0.65 synsets S0 related to S. Figure 2: F -score comparison between using un- (SIF ) vˆ DS is the list of tokens in the weighted summation and SIF-weighted summa- • DS translated definition of S. tion for embedding PWN synset information.

S contains the lists of tokens 1 (SIF ) E vˆE in the translated example sent- • s Thus for w = tie — which can be an article of E ences of S (excluded if S has |E | X∈ES no example sentences). clothing, a drawn match, and so on — we would have v av + bv + ... for w ≈ w(clothing) w(match) 3.3 Method 2: Better Matching Using WSI a, b R. It is unclear how to find such sense- ∈ We found through examination that the score- vectors, but one expects different words to have threshold method we have used so far performs closely related sense-vectors, e.g. for w0 = bow poorly in two main cases: the vector vw0(clothing) would be close to the vec- tor vw(clothing) of tie. Thus the Linear-WSI model (a) the word w has no candidate synset with proposes using sparse coding, namely finding a set score high-enough to clear the threshold. d of unit basis vectors a1, . . . , ak R s.t. w V , ∈ ∀ ∈ (b) w has multiple closely related synsets that are k all correct matches but some of which have a vw = Rwiai + ηw, (1) much lower score than others. Xi=1 Here we address both issues by using sense infor- for k > d, η a noise vector, and at most s coef- mation found by applying a word-sense induction w ficients R nonzero. The hope is that the sense- method first introduced in Arora et al. (2016a). wi vector v of tie is in the neighborhood of We summarize their WSI-model – referred to w(clothing) a vector a s.t. R > 0. Indeed, for k = 2000 and henceforth as Linear-WSI — in Section 3.3.1. i wi s = 5, Arora et al. (2016a) report that solving (1) Then in Section 3.3.2 we devise a sense purifica- represents English word-senses as well as a com- tion procedure for constructing a word-cluster for petent non-native speaker and significantly better each induced sense of a word. Applying this pro- than older clustering methods for WSI. cedure to construct word-cluster representations of candidates synsets provides an additional metric 3.3.2 Sense Purification for the correctness of word-synset matches that can be used to devise a w-specific threshold αw to While (1) is a good WSI method, its use for build- ameliorate problem (a). Meanwhile, using Linear- ing WordNets is hampered by its inability to pro- WSI to associate similar candidate synsets of w to duce more than a few thousand senses ai, as set- each other provides a way to address problem (b). ting a large k yields repetitive rather than different We explain these methodologies in Section 3.3.3. senses. As this is far fewer than the number of synsets in PWN, we use sense purification to ad- 3.3.1 Summary of Linear-WSI Model dress this by extracting a cluster of words related In Arora et al. (2016a) the authors posit that the to ai as well as w to represent each sense. In ad- vector of a polysemous word can be linearly de- dition to w and ai, the procedure takes as input a composed into vectors associated to its senses. search-space V V and set size n. Then we find 0 ⊂

15 (species) (to shoot) (family) (crossbow) (kind) (pistol) (plant) (shooting) (cilantro)

(coriander) (celery)

(garlic)

Figure 3: Isometric mapping of sense-cluster vectors for w = brilliant, fox, and лук (bow, onion). w is marked by a star and each sense ai of w, shown by a large marker, has an associated cluster of words with the same marker shape. Contours are densities of vectors close to w and at least one sense ai.

C V of size n containing w that maximizes note that an incorrect synset is likely to have a ⊂ 0 lower objective value (2) than a correct synset as f = min Median x vw : w0 C (2) x=ai or { · 0 ∈ } it likely has fewer words related to w in its search- x=vw :w0 C 0 ∈ space VS. However, using fS = f(w, aS,CS) as a A cluster C maximizing this must ensure that nei- synset score is difficult as some synsets have very ther a nor any word w C (including w = w) i 0 ∈ 0 small search-spaces, leading to inconsistent scor- has low average cosine similarity with cluster- ing even for correct synsets. words, resulting in a dense cluster close to both Instead we use f as part of a w-specific thresh- w and a . We explain this further in Appendix A. S i old α = min α, u v , where u is the We illustrate this method in Figure 3 by pu- w { S∗ · w} S vector representation of S from Section 3.2 and rifying the senses of English words brilliant S = arg max f + u v . This attempts to ad- and fox and Russian word лук, which has the ∗ S S · w dress problem (a) of the score-threshold method senses “bow” (weapon), “onion” (vegetable), and — that some words have no synsets above the cut- “onion” (plant). Note how correct senses are re- off α and returning only the highest-scoring synset covered across POS and language and for proper in such cases will not retrieve multiple sense for and common noun senses. polysemous words. By the construction of αw, if 3.3.3 Applying WSI to Synset Matching w is polysemous, a synset other than the one with the highest score u v may have a better value of The problem addressed by sense purification is S · w that senses ai induced by Linear-WSI have too fS and thus found to be S∗; then all synsets with many related words; purification solves this by higher scores will be matched to w, allowing for extracting a cluster of words related to w from multiple matches even if no synset clears the cut- the words close to ai. When translating WordNet off α. Otherwise if w is monosemous, we expect synsets, we have a similar problem in that transla- the correct synset S to have the highest value for both u v and f , making S = S. Then if no tions of a synset’s lemmas may not be relevant to S · w S ∗ the synset itself. Thus we can try to create a puri- synset has score greater than α the threshold will be set to u v , the highest synset-score, so only fied representation of each candidate synset S of w S · w by extracting a cluster of translated lemmas close the highest scoring synset will be matched to w. to w and one of its induced senses. We run purifi- To address problem (b) of the threshold method cation on every sense ai in the sparse representa- — that w might have multiple correct and closely tion (1) of w, using as a search-space V 0 = VS the related candidate synsets of which only some set of translations of all lemmas of synsets S0 re- clear the cutoff — we observe that closely related lated to S in PWN (i.e. S S as defined in Sec- synsets of w will have similar search-spaces V 0 ∼ S tion 3). To each synset S we associate the sense and so are likely to be associated to the same sense aS and corresponding cluster CS that are optimal ai. For example, in Figure 4 the candidates of in the objective (2) among all senses ai of w. dalle related to its correct meanings as a flagstone Although we find a sense aS and purified repre- or slab are associated to the same sense a892 while sentation CS for each candidate synset of w, we distractor synsets related to the incorrect sense as a

16 MT + PWN Threshold Sense-Association Sense-Agglomeration Matching by Purification Procedure target get a set of candidate match to all use sense purification to perform cutoff matching with word synsets by querying candidate synsets associate a sense a lower threshold on all Princeton WordNet using with score above to each candidate synset synsets sharing a sense translations of a threshold with an already matched synset Threshold flag.n.01 score = 0.280 � sense = 0.25 flag.n.04 score = 0.222 � sense = � flag.n.06 score = 0.360 � sense = � correct synsets: flag.n.06 � French word flag.n.07 score = 0.161 � sense = slab.n.01 � 'dalle' iris.n.01 score = 0.200 sense = retrieved synsets: � flag.n.06 � masthead.n.01 score = 0.195 � sense = slab.n.01 � pin.n.08 score = 0.251 � sense = slab.n.01 score = 0.521 � sense = Threshold Candidates Synset Associated Score 0.50 Sense

Figure 4: The score-threshold and sense-agglomeration procedure for French word w = dalle (flagstone, slab). Candidate synsets are given a score and matched to w if they clear a high threshold αw (as in Section 3.2). If an unmatched synset shares a sense ai with a matched synset, it is compared to a low threshold β (the sense-agglomeration procedure in Section 3.3.3).

flag are mostly matched to other senses. This mo- 4 Evaluation of Methods tivates the sense agglomeration procedure, which uses threshold-clearing synsets to match synsets We evaluate our method’s accuracy and coverage with the same sense below the score-threshold. by constructing and testing WordNets for French For β < α, the procedure is roughly as follows: and Russian. For both we train 300-dimensional SN word embeddings (Arora et al., 2016b) on co-occurrences of words occurring at least 1000 1. Run score-threshold method with cutoff αw. times, or having candidate PWN synsets and oc- curring at least 100 times, in the lemmatized 2. For each synset S with score u v below Wikipedia corpus. This yields V 50000. For S · w | | ≈ the threshold, check for a synset S0 with score Linear-WSI we run sparse coding with sparsity s = 4 and basis-size k = 2000 and use set-size uS0 vw above the threshold and aS0 = aS. · n = 5 for purification. To get candidate synsets we use Google and Microsoft Translators and the 3. If u v β and the clusters C ,C satisfy S · w ≥ S S0 dictionary of the translation company ECTACO, a cluster similarity condition, match S to w. while for sentence-length MT we use Microsoft.

We include the lower cutoff β because even sim- 4.1 Testsets ilar synsets may not both be senses of the same A common way to evaluate accuracy of an auto- word. The cluster similarity condition, available in mated WordNet is to compare its synsets or word Appendix B.2, ensures the relatedness of synsets matchings to a manually-constructed one. How- sharing a sense ai, as an erroneous synset S0 may ever, the existing ELRA French Wordnet 2 is not be associated to the same sense as a correct one. public and half the size of ours while Russian In summary, the improved synset matching ap- WordNets are either even smaller and not linked proach has two steps: 1-conduct score-threshold to PWN3 or obtained via direct translation4. matching using the modified threshold αw; 2-run Instead we construct test sets for each language the sense-agglomeration procedure on all senses that allow for evaluation of our methods and oth- ai of w having at least one threshold-clearing ers. We randomly chose 200 each of adjectives, synset S. Although for fixed α both steps focus nouns, and verbs from the set of target language on improving the recall of the threshold method, in words whose English translations appear in the practice they allow α to be higher, so that both pre- synsets of the Core WordNet. Their “ground truth” cision and recall are improved. For example, note 2 http://catalog.elra.info/product_ the recovery of a correct synset left unmatched by info.php?products_id=550 the score-threshold method in the simplified de- 3 http://project.phil.spbu.ru/RussNet/ piction of sense-agglomeration shown in Figure 4. 4 http://wordnet.ru/

17 Method POS F.5-Score∗ F1-Score∗ Precision∗ Recall∗ Coverage Synsets α β Adj. 50.3 59.3 46.1 100.0 99.9 11271 Direct Translation Noun 56.2 64.6 52.2 100.0 100.0 74477 (MT + PWN) Verb 41.4 51.3 37.0 100.0 100.0 13017 Total 49.3 58.4 45.1 100.0 100.0 100174† Wordnet Libre Adj. 66.3 58.6 78.1 53.4 84.8 6865 du Franc¸ais Noun 68.6 58.7 83.2 51.5 95.0 36667 (WOLF) Verb 60.8 48.4 81.0 39.6 88.2 7671 (Sagot and Fiser,ˇ 2008) Total 65.2 55.2 80.8 48.2 92.2 52757† Adj. 64.5 51.5 88.3 42.3 69.2 7407 Universal Wordnet Noun 67.5 52.2 94.1 40.8 75.9 24670 (de Melo and Weikum, 2009) Verb 55.4 39.5 88.0 28.5 76.2 5624 Total 62.5 47.7 90.1 37.2 75.0 39497† Adj. 58.4 40.8 90.9 28.4 54.7 2689 Extended Open Noun 61.3 43.8 96.5 31.7 66.6 14936 Multilingual Wordnet Verb 47.8 29.4 95.9 18.6 57.7 2331 (Bond and Foster, 2013) Total 55.9 38.0 94.5 26.2 63.2 20449† Adj. 62.8 0.0 62.6 0.0 65.3 0.0 68.5 0.0 88.7 9687 0.31 Baseline: ± ± ± ± Noun 67.3 0.0 65.4 0.1 71.6 0.1 69.0 0.1 92.2 37970 0.27 Average Similarity ± ± ± ± Verb 51.8 0.0 50.8 0.1 55.9 0.1 57.0 0.1 83.5 10037 0.26 (Section 3.1) ± ± ± ± Total 60.6 0.0 59.6 0.0 64.3 0.0 64.9 0.0 90.0 58962† ± ± ± ± Adj. 65.9 0.0 60.4 0.0 75.9 0.1 59.5 0.1 85.1 8512 0.47 Method 1: ± ± ± ± Noun 71.0 0.0 67.3 0.1 78.7 0.1 69.1 0.1 96.7 35663 0.41 Synset Representation ± ± ± ± Verb 61.6 0.0 53.0 0.0 78.7 0.1 49.8 0.1 89.9 8619 0.45 (Section 3.2) ± ± ± ± Total 66.2 0.0 60.2 0.0 77.8 0.0 59.5 0.1 93.7 53852† ± ± ± ± Method 2: Adj. 67.7 0.0 62.5 0.1 76.9 0.1 62.6 0.1 91.2 8912 0.56 0.42 ± ± ± ± Synset Representation Noun 73.0 0.0 66.0 0.1 83.7 0.1 62.0 0.2 90.9 34001 0.50 0.25 ± ± ± ± + Linear-WSI Verb 64.4 0.0 55.9 0.0 79.3 0.0 51.5 0.1 93.6 9262 0.46 0.28 ± ± ± ± (Section 3.3) Total 68.4 0.0 61.5 0.0 80.0 0.0 58.7 0.1 91.5 53208† ± ± ± ± ∗ Micro-averages over a randomly held-out half of the data; parameters tuned on the other half. 95% asymptotic confidence intervals found with 10000 randomized trials. † Includes adverb synsets. For the last three methods they are matched with the same parameter values (α and β) as for adjectives. Table 1: French WordNet Results word senses are picked by native speakers, who with French, Korean, and Persian WordNets cited were asked to perform the same matching task de- in Section 2 being evaluated on 183 pairs, 3260 scribed in Section 3, i.e. select correct synsets pairs, and 500 words, respectively. Accuracy mea- for a word given a set of candidates generated by sured with respect to this ground truth estimates MT + PWN. For example, the French word foie how well an algorithm does compared to humans. has one translation, liver with four PWN synsets: 1-“glandular organ”; 2-“liver used as meat”; 3- A significant characteristic of this test set is “person with a special life style”; 4-“someone liv- its dependence on the machine translation sys- ing in a place.” The first two align with senses tem used to get candidate synsets. While this can of foie while the others do not, so the expert leave out correct synset matches that the system marks the first two as good and the others as neg- did not propose, by providing both correct and in- ative. Two native speakers for each language were correct candidate synsets we allow future authors trained by an author with knowledge of WordNet to focus on the semantic challenge of selecting and at least 10 years of experience in each lan- correct senses without worrying about finding the guage. Inconsistencies in the matchings of the two best bilingual dictionary. This allows dictionary- speakers were resolved by the same author. independent evaluation of automated WordNets, an important feature in an area where the specific We get 600 words and about 12000 candidate translation systems used are rarely provided in word-synset pairs for each language, with adjec- full. When comparing the performance of our con- tives and nouns having on average about 15 can- struction to that of previous efforts on this test set, didates and verbs having about 30. These num- we do not penalize word-synset matches in which bers makes the test set larger than many others, the synset is not among the candidate synsets we

18 generate for that word, negating the loss of pre- sian nouns and much worse on Russian verbs. cision incurred by other methods due to the use The discrepancy in verbs can be explained by a of different dictionaries. We also do not penalize difference in treating the reflexive case and aspec- other WordNets for test words they do not contain. tual variants due to the grammatical complexity of In addition to precision and recall, we report the Russian verbs. In French, making a verb reflexive Coverage statistic as the proportion of the Core set requires adding a word while in Russian the verb of most-used synsets, a semi-automatically con- itself changes, e.g. to wash to wash oneself is → structed set of about 5000 PWN frequent senses, laver se laver in French but мыть мыться in → → that are matched to at least one word (Fellbaum, Russian. Thus we do not distinguish the reflex- 1972). While an imperfect metric given different ive case for French as the token found is the same sense usage by language, the synsets are universal- but for Russian we do, so both мыть and мыть- enough for it to be a good indicator of usability. ся may appear and have distinct synset matches. Thus matching Russian verbs is challenging as the 4.2 Evaluation Results reflexive usage of a verb is often contextually sim- ilar to the non-reflexive usage. Another compli- We report the evaluation of methods in Section 3 cation for Russian verbs is due to aspectual verb in Tables 1 & 2 alongside evaluations of UWN pairs; thus to do has aspects (делать, сделать) (de Melo and Weikum, 2009), OWM (Bond and in Russian that are treated as distinct verbs while Foster, 2013), and WOLF (Sagot and Fiser,ˇ 2008). in French these are just different tenses of the verb Parameters α and β were tuned to maximize the 1.25 Precision Recall faire. Both factors pose challenges for differentiat- micro-averaged F.5-score .25 Precision· +·Recall , used · ing Russian verb senses by a distributional model. instead of F1 to prioritize precision, which is of- Overall however the method is shown to be ro- ten more important for application purposes. bust to how close the target language is to English, Our synset representation method (Section 3.2) with nouns and adjectives performing well in both exceeds the similarity baseline by 6% in F -score .5 languages and the difference for verbs stemming for French and 10% for Russian. For French it is from an intrinsic quality rather than dissimilarity competitive with the best other WordNet (WOLF) with English. This can be further examined by and in both languages exceeds both multi-lingual testing the method on a non-European language. WordNets. Improving this method via Linear-WSI (Section 3.3) leads to 2% improvement in F - .5 5 Conclusion and Future Work score for French and 1% for Russian. Our methods also perform best in F1-score and Core coverage. We have shown how to leverage recent advances As expected from a Wiktionary-scraping in word embeddings for fully-automated WordNet method, OMW achieves the best precision construction. Our best approach combining sen- across languages, although it and UWN have tence embeddings, and recent methods for WSI low recall and Core coverage. The performance obtains performance 5-16% above the naive base- of our best method for French exceeds that of line in F.5-score as well as outperforming previ- WOLF in F.5-score across POS while achieving ous language-specific and multi-lingual methods. similar coverage. WOLF’s recall performance A notable feature of our work is that we require is markedly lower than the evaluation in Sagot only a large corpus in the target language and au- and Fiserˇ (2008, Table 4); we believe this stems tomated translation into/from English, both avail- from our use of words matched to Core synsets, able for many languages lacking good WordNets. not random words, leading to a more difficult We further contribute new 600-word human- test set as common words are more-polysemous annotated test sets split by POS for French and and have more synsets to retrieve. There is no Russian that can be used to evaluate future au- comparable automated Russian-only WordNet, tomated WordNets. These larger test sets give a with only semi-automated and incomplete efforts more accurate picture of a construction’s strengths (Yablonsky and Sukhonogov, 2006). and weaknesses, revealing some limitations of Comparing across POS, we do best on nouns past methods. With WordNets in French and Rus- and worst on verbs, likely due to the greater pol- sian largely automated or incomplete, the Word- ysemy of verbs. Between languages, performance Nets we build also add an important tool for multi- is similar for adjectives but slightly worse on Rus- lingual natural language processing.

19 Method POS F.5-Score∗ F1-Score∗ Precision∗ Recall∗ Coverage Synsets α β Adj. 50.2 59.6 45.9 100.0 99.6 11412 Direct Translation Noun 41.2 50.2 37.1 100.0 100.0 73328 (MT + PWN) Verb 32.5 41.7 28.6 100.0 100.0 13185 Total 41.3 50.5 37.2 100.0 99.9 99470† Adj. 52.4 38.8 80.3 29.6 51.0 11412 Universal Wordnet Noun 65.0 53.0 87.5 45.1 71.1 19564 (de Melo and Weikum, 2009) Verb 48.1 34.8 74.8 25.7 65.0 3981 Total 55.1 42.2 80.8 33.4 67.1 30015† Adj. 58.7 41.3 91.7 29.2 55.3 2419 Extended Open Noun 67.8 53.1 93.5 42.5 68.4 14968 Multilingual Wordnet Verb 51.1 34.8 84.5 23.9 56.6 2218 (Bond and Foster, 2013) Total 59.2 43.1 89.9 31.9 64.2 19983† Adj. 61.4 0.0 64.6 0.1 60.9 0.0 77.3 0.1 92.1 10293 0.24 Baseline: ± ± ± ± Noun 55.9 0.0 54.8 0.1 59.9 0.1 59.9 0.1 77.0 32919 0.29 Average Similarity ± ± ± ± Verb 46.3 0.0 46.5 0.1 49.0 0.1 55.1 0.1 84.1 9749 0.21 (Section 3.1) ± ± ± ± Total 54.5 0.0 55.3 0.0 56.6 0.0 64.1 0.1 80.5 54372† ± ± ± ± Adj. 69.5 0.0 64.1 0.0 78.1 0.0 61.7 0.1 84.2 8393 0.43 Method 1: ± ± ± ± Noun 69.8 0.0 65.5 0.0 77.6 0.1 66.0 0.1 85.2 29076 0.46 Synset Representation ± ± ± ± Verb 54.2 0.0 51.1 0.1 63.3 0.1 57.4 0.1 91.2 8303 0.39 (Section 3.2) ± ± ± ± Total 64.5 0.0 60.2 0.0 73.0 0.0 61.7 0.1 86.3 46911† ± ± ± ± Method 2: Adj. 69.7 0.0 64.9 0.1 77.3 0.0 63.6 0.1 93.3 9359 0.43 0.35 ± ± ± ± Synset Representation Noun 71.6 0.0 67.6 0.0 78.1 0.0 68.0 0.1 91.0 31699 0.46 0.33 ± ± ± ± + Linear-WSI Verb 54.4 0.0 49.7 0.1 64.9 0.1 52.6 0.2 91.9 8582 0.44 0.33 ± ± ± ± (Section 3.3) Total 65.2 0.0 60.7 0.0 73.4 0.0 61.4 0.1 91.5 50850† ± ± ± ± ∗ Micro-averages over a randomly held-out half of the data; parameters tuned on the other half. 95% asymptotic confidence intervals found with 10000 randomized trials. † Includes adverb synsets. For the last three methods they are matched with the same parameter values (α and β) as for adjectives. Table 2: Russian WordNet Results

Further improvement to our work may come References from other methods in word-embeddings, such Michal Aharon, Michael Elad, and Alfred Bruckstein. as multi-lingual word-vectors (Faruqui and Dyer, 2006. K-svd: An algorithm for designing overcom- 2014). Our techniques can also be combined with plete dictionaries for sparse representation. IEEE others, both language-specific and multi-lingual, Transactions on Signal Processing, 54(11). for automated WordNet construction. In addition, our method for associating multiple synsets to the Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016a. Linear algebraic struc- same sense can contribute to efforts to improve ture of word sense, with applications to polysemy. PWN through sense clustering (Snow et al., 2007). Finally, our sense purification procedure, which Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, uses word-vectors to extract clusters representing and Andrej Risteski. 2016b. Rand-walk: A la- word-senses, likely has further WSI and WSD ap- tent variable model approach to word embeddings. Transactions of the Association for Computational plications; such exploration is left to future work. Linguistics, 4:385–399.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence em- Acknowledgments beddings. In International Conference on Learning Representations. To Appear.

We thank Angel Chang and Yingyu Liang for Steven Bird, Edward Loper, and Ewan Klein. helpful discussion at various stages of this pa- 2009. Natural Language Processing with Python. per. This work was supported in part by NSF O’Reilly Media Inc. grants CCF-1302518, CCF-1527371, Simons In- Francis Bond and Ryan Foster. 2013. Linking and ex- vestigator Award, Simons Collaboration Grant, tending an open multilingual wordnet. In Proceed- and ONR-N00014-16-1-2329. ings of the 51st Annual Meeting of the Association for Computational Linguistics.

20 Gerard de Melo and Gerhard Weikum. 2009. Towards Benoˆıt Sagot and Darja Fiser.ˇ 2008. Building a free a universal wordnet by learning from combined evi- french wordnet from multilingual resources. In Pro- dence. In Proceedings of the 18th ACM Conference ceedings of the Sixth International Language Re- on Information and Knowledge Management. sources and Evaluation Conference.

Manaal Faruqui and Chris Dyer. 2014. Improving Gerald Salton and Christopher Buckley. 1988. Term- vector space word representations using multilingual weighting approaches in automatic text retrieval. In- correlation. In Proceedings of the 14th Conference formation Processing & Management, 24(5). of the European Chapter of the Association for Com- putational Linguistics. Rion Snow, Sushant Prakash, Daniel Jurafsky, and An- drew Y. Ng. 2007. Learning to merge word senses. Christiane Fellbaum. 1972. WordNet: An Electronic In Proceedings of the 2007 Joint Conference on Em- Lexical Database. MIT Press, Cambridge, MA. pirical Methods in Natural Language Processing and Computational Natural Language Learning. Global WordNet Association. 2017. Wordnets in the Feras Al Tarouti and Jugal Kalita. 2016. Enhancing world. automatic wordnet construction using word embed- dings. In Proceedings of the Workshop on Multilin- Ignaci Iacobacci, Mohammad Taher Pilehvar, and gual and Cross-lingual Methods in NLP. Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Proceedings of the 53rd Annual Meeting of the As- Livescu. 2016. Towards universal paraphrastic sen- sociation for Computational Linguistics. tence embeddings. In International Conference on Learning Representations. Changki Lee, Geunbae Lee, and Seo JungYun. 2000. Automated wordnet mapping using word sense dis- Sergey Yablonsky and Andrej Sukhonogov. 2006. ambiguation. In Proceedings of the 2000 Joint SIG- Semi-automated english-russian wordnet construc- DAT Conference on Empirical Methods in Natural tion: Initial resources, software and methods of Language Processing and Very Large Corpora. translation. In Proceedings of the Global WordNet Conference. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- A Purification Procedure tations in vector space. As discussed in Section 3.3.1, the Linear-WSI Mortaza Montazery and Heshaam Faili. 2010. Auto- model (Arora et al., 2016a) posits that there ex- matic persian wordnet construction. In Proceedings d ists an overcomplete basis a1, . . . , ak R of unit of the 23rd International Conference on Computa- ∈ tional Linguistics. vectors such that each word w V can be approx- ∈ imately represented by a linear combination of at Mohammad Nasiruddin, Didier Schwab, and Andon most s basis vectors (see Equation 1). Finding the Tchechmedjiev. 2014. Induction de sens pour en- richir des ressources lexicales. In 21eme´ Traitement basis vectors ai and their coefficients Rwi requires Automatique des Langues Naturelles. solving the optimization problem

Roberto Navigli and Simone Paolo Ponzetto. 2012. minimize Y RA 2 Babelnet: The automatic construction, evaluation k − k subject to R s w V and application of a wide-coverage multilingual se- k wk0 ≤ ∀ ∈ mantic network. Artificial Intelligence, 193. V n where Y R| |× has word-vectors as its rows, V ∈k Jeffrey Pennington, Richard Socher, and Christo- R R| |× is the matrix of coefficients Rwi, and pher D. Manning. 2014. Glove: Global vectors for ∈ k d A R × has the overcomplete basis a1, . . . , ak word representation. In Proceedings of Empirical ∈ Methods in Natural Language Processing. as its rows. This problem is non-convex and can be solved approximately via the K-SVD algorithm Quentin Pradet, Gael¨ de Chalendar, and Jeanne Bague- (Aharon et al., 2006). nier Desormeaux. 2013. Wonef, an improved, ex- Given a word w, the purification procedure is panded and evaluated automatic french translation of wordnet. In Proceedings of the Seventh Global a method for purifying the senses induced via Linear-WSI (the vectors a s.t. R = 0) by rep- Wordnet Conference. i wi 6 resenting them as word-clusters related to both the Sascha Rothe and Hinrich Schutze.¨ 2015. Autoex- sense itself and the word. This is done so as to tend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd create a more fine-grained collection of senses, as Annual Meeting of the Association for Computa- Linear-WSI does not perform well for more than tional Linguistics. a few thousand basis vectors, far fewer than the

21 number of word-senses (Arora et al., 2016a). The its candidate synsets S, and a fixed set size n, we procedure is inspired by the hope that words re- run the following procedure: lated to each sense of w will form clusters near 1. For each sense ai in the sparse representation both vw and one of its senses ai. Given a fixed set- (1) of w let Ci be the output cluster of the size n and a search-space V 0 V , we realize this ⊂ purification procedure run on sense a with hope via the optimization problem i search-space V 0 = VS. maximize γ C V ⊂ 0 2. Return the (sense, sense-cluster) pair subject to (aS,CS) with the highest purification a γ Median v v : w0 C x x C procedure objective value among senses i. ≤ { x · w0 ∈ \{ }} ∀ ∈ γ Median ai vw : w0 C In the following methods we will assume that each ≤ { · 0 ∈ } w C, C = n candidate synset S of w has a sense aS and sense- ∈ | | cluster CS associated to it in this way. Examples This problem is equivalent to maximizing (2) with of such clusters for the French word dalle (flag- constraints C = n and w C V . The ob- | | ∈ ⊂ 0 stone, slab) are provided in Table 3. jective value is constrained to be the lowest me- dian cosine similarity between any word x C B.1 Word-Specific Threshold ∈ or the sense ai and the rest of C, so optimizing it One application of Linear-WSI is the creation of ensures that the words in C are closely related to a word-specific threshold αw to use in the score- each other and to ai. Forcing the cluster to contain threshold method instead of a global cutoff α. We w leads to the words in C being close to w as well. do this by using the quality of the sense cluster CS For computational purposes we solve this prob- of each candidate synset S of w as an indicator of lem approximately using a greedy algorithm that the correctness of that synset. Recalling that uS starts with C = w and repeatedly adds the word S { } is the synset representation of as a vector (see in V 0 C that results in the best objective value un- f = f(w, a ,C ) \ Section 3.2) and letting S S S be the til C = n. Speed of computation is also a reason α | | objective value (2), we find w as follows: for using a search-space V V rather than the 0 ⊂ entire vocabulary as a source of words for the clus- 1. Find S∗ = arg max fS + uS vw. S is a candidate of w · ter; we found that restricting V 0 to be all words w0 s.t. min v v , v a .2 dramatically re- 2. Let αw = min α, uS∗ vw . { w0 · w w0 · i} ≥ { · } duces processing time with little performance loss. As this modified threshold may be lower than The purification procedure represents only one more than one candidate synset of w it allows for sense of w, so to perform WSI we generate clus- multiple synset matches even when no synset has ters for all senses a s.t. R > 0. If two senses i wi score high enough to clear the threshold α. have clusters that share a word other than w, only the cluster with the higher objective value is re- B.2 Sense-Agglomeration Procedure turned. To find the clusters displayed in Figure 3 We use Linear-WSI more explicitly through the we use this procedure with cluster size n = 5 on sense-agglomeration procedure, which attempts to English and Russian SN vectors decomposed with recover unmatched synsets using matched synsets basis size k = 2000 and sparsity s = 4. sharing the same sense ai of w. We define the clus- ter similarity ρ between word-clusters C ,C B Applying WSI to Synset Matching 1 2 ⊂ V as the median of all cosine-similarities of pairs We use the purification procedure to represent the of words in their set product, i.e. candidate synsets of a word w by clusters of words related to w and one if its senses a . First, for each ρ(C1,C2) = Median vx vy : x C1, y C2 . i { · ∈ ∈ } synset S we define the search-space Then we say that two clusters C1,C2 are similar if VS = TS 0 ρ(C ,C ) min ρ(C ,C ), ρ(C ,C ) , S S 1 2 1 1 2 2 [0∼ ≥ { } where TS0 is the set of translations of lemmas of i.e. if their cluster similarity with each other ex- S0 as in Section 3.1. Then given a word w, one of ceeds at least one’s cluster similarity with itself.

22 Associated Objective Value Synset Purified Cluster Representation CS Sense aS f(w, aS ,CS )

flag.n.01 a789 poteau (goalpost), fleche` (arrow), haut (high, top), matˆ (matte) 0.23 flag.n.04 a892 flamme (flame), fanion (pennant), guidon, signal 0.06 flag.n.06 a892 dallage (paving), carrelage (tiling), pavement, pavage (paving) 0.36 flag.n.07 a1556 pan (section), empennage, queue, tail 0.14 iris.n.01 a1556 bœuf (beef), usine (factory), plante (plant), puant (smelly) 0.07 masthead.n.01 a1556 inscription, lettre (letter), catalogue (catalog), cotation (quotation) 0.10 pin.n.08 a1556 trou (hole), tertre (mound), marais (marsh), pavillon (house, pavillion) 0.17 slab.n.01 a892 carrelage (tiling), carreau (tile), tuile (tile), batimentˆ (building) 0.27

Table 3: Purified Synset Representations of dalle (flagstone, slab). Note how the correct candidate synsets (bolded) have clusters of words closely related to the correct meaning while the other clusters have many unrelated words, leading to lower objective values.

Then given a global low threshold β α, for each ≤ sense ai in the sparse representation (1), sense- agglomeration consists of the following algorithm:

1. Let Mi be the set of candidate synsets S of w that have aS = ai and score above the thresh- old α . Stop if M = . w i ∅ 2. Let Ui be the set of candidate synsets S of w that have aS = ai and score below the thresh- old α . Stop if U = . w i ∅ 3. For each synset S U , ordered by synset- ∈ i score, check that CS is similar (in the above sense) to all clusters C for S M and has S0 0 ∈ i score higher than β. If both are true, add S to Mi and remove S from Ui. Otherwise stop.

The sense-agglomeration procedure allows an un- matched synset S to be returned as a correct synset of w provided it shares a sense with a different matched synset S0 and satisfies cluster similarity and score-threshold constraints.

23 Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses

Maximilian Köper and Sabine Schulte im Walde Institut für Maschinelle Sprachverarbeitung Universität Stuttgart, Germany {maximilian.koeper,schulte}@ims.uni-stuttgart.de

Abstract ried out using a variety of features including se- lectional preferences (Martin, 1996; Shutova and Abstract words refer to things that can not Teufel, 2010; Shutova et al., 2010; Haagsma and be seen, heard, felt, smelled, or tasted Bjerva, 2016), word-level semantic similarity (Li as opposed to concrete words. Among and Sporleder, 2009; Li and Sporleder, 2010), other applications, the degree of abstract- topic models (Heintz et al., 2013), word embed- ness has been shown to be a useful infor- dings (Dinh and Gurevych, 2016) and visual in- mation for metaphor detection. Our con- formation (Shutova et al., 2016). tribution to this topic are as follows: i) we compare supervised techniques to learn The underlying motivation of using abstract- and extend abstractness ratings for huge ness in metaphor detection goes back to Lakoff vocabularies ii) we learn and investigate and Johnson (1980), who argue that metaphor is norms for multi-word units by propagat- a method for transferring knowledge from a con- ing abstractness to verb-noun pairs which crete domain to an abstract domain. Abstractness lead to better metaphor detection, iii) we was already applied successfully for the detection overcome the limitation of learning a sin- of metaphors across a variety of languages (Tur- gle rating per word and show that multi- ney et al., 2011; Dunn, 2013; Tsvetkov et al., sense abstractness ratings are potentially 2014; Beigman Klebanov et al., 2015; Köper and useful for metaphor detection. Finally, Schulte im Walde, 2016b). with this paper we publish automatically created abstractness norms for 3 million The abstractness information itself is typically English words and multi-words as well taken from a dictionary, created either by man- as automatically created sense-specific ab- ual annotation or by extending manually col- stractness ratings. lected ratings with the help of supervised learn- ing techniques that rely on word representations. 1 Introduction While potentially less reliable, automatically cre- The standard approach to studying abstractness is ated norm-based abstractness ratings can easily to place words on a scale ranging between ab- cover huge dictionaries. Although some meth- stractness and concreteness. Alternately, abstract- ods have been used to learn abstractness, literature ness can also be given a taxonomic definition in lacks a comparison of these learning techniques. which the abstractness of a word is determined by the number of subordinate words (Kammann and We compare and evaluate different learning Streeter, 1971; Dunn, 2015). techniques. In addition we show and investigate In psycholinguistics abstractness is commonly the usefulness of extending abstractness ratings to used for concept classification (Barsalou and phrases as well as individual word senses. We ex- Wiemer-Hastings, 2005; Hill et al., 2014; trinsically evaluate these techniques on two verb Vigliocco et al., 2014). In computational work, ab- metaphor detection tasks: (i) a type-based setting stractness has become an established information that makes use of phrase ratings, (ii) a token-based for the task of automatic detection of metaphorical classification for multi-sense abstractness norms. language. So far metaphor detection has been car- Both settings benefit from our approach.

24 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 24–30, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics

2 Experiments plicitly encode attributes such as abstractness and directly feed the vector representation of a word 2.1 Propagating Abstractness: A into a classifier, either by using linear regression Comparison of Approaches & Ressources (L-Reg), a regression forest (Reg-F) or a fully 2.1.1 Comparison of Approaches connected feed forward neural network with up to Turney et al. (2011) first aproached to automati- two hidden layers (NN).2 cally create abstractness norms for 114 501 words, relying on manual ratings based on the MRC Psy- T&L 03 L-Reg. Reg-F. NN cholinguistic Database (Coltheart, 1981). The un- Glove50 .76 .76 .78 .79 derlying algorithm (Turney and Littman, 2003) re- Glove100 .80 .79 .79 .85 quires vector representation and annotated training Glove200 .78 .78 .76 .84 samples of words. The algorithm itself performs a Glove300 .76 .78 .74 .85 greedy forward search over the vocabulary to learn W2V300 .83 .84 .79 .90 so-called paradigm words. Once paradigm words for both classes (abstract & concrete) are learned, Table 1: Spearman’s ρ for the test ratings. Com- a rating can be assigned to every word by com- paring representations and regression methods. paring its vector representation against the vector representations of the paradigm words. Table 1 shows clearly that we can learn ab- Köper and Schulte im Walde (2016a) used the stractness ratings with a very high correlation on same algorithm for a large collection of German the test data using the word representations from lemmas, and in the same way additional cre- Google (W2V300) together with a neural net- ated ratings for multiple norms including valency, work for regression (ρ=.90). The NN method sig- arousal and imageability. nificantly outperforms all other methods, using A different method that has been used to ex- Steiger (1980)’s test (p < 0.001). tend abstractness norms based on low-dimensional 2.1.2 Comparison of Ressources word embeddings and a Linear Regression classi- fier (Tsvetkov et al., 2013; Tsvetkov et al., 2014). Based on the comparison of methods in the pre- vious section we propagated abstractness ratings We compare approaches across different pub- to the entire vocabulary of the W2V300 dataset licly available vector representations1, to study po- (3million words) and compare the correlation with tential differences across vector dimensionality we other existing norms of abstractness. For this com- compare vectors between 50 and 300 dimensions. parison we use the common subset of two manu- The Glove vectors (Pennington et al., 2014) have ally and one automatically created resource: MRC been trained on 6billion tokens of Wikipedia plus Psycholinguistic Database, ratings from Brysbaert Gigaword (V=400K), while the word2vec cbow et al. (2014) and the automatically created ratings model (Mikolov et al., 2013) was trained on a from Turney et al. (2011). We map all existing Google internal news corpus with 100billion to- ratings, as well as our newly created ratings, to kens (V=3million). For training and testing we the same interval using the method from Köper relied on the ratings from Brysbaert et al. (2014), and Schulte im Walde (2016a). The mapping is Dividing the ratings into 20% test (7 990) and 80% performed using a continuous function, that maps training (31 964) for tuning hyper parameters we the ratings to an interval ranging from very ab- took 1 000 ratings from the training data. We stract (0) to very concrete (10). The common sub- kept the ratio between word classes. Evaluation is set contains 3 665 ratings. Figure 1 shows the re- done by comparing the new created ratings against sulting pairwise correlation between all four re- the test (gold) ratings using Spearman’s rank-order sources. Despite being created automatically, we correlation. We first reimplemented the algorithm see that the newly created ratings provide a high from Turney and Littman (2003) (T&L 03). In- correlation with both manually created collections spired by recent findings of Gupta et al. (2015) we (ρ for MRS=.91, Brysbaert=.93). In addition, the apply the hypothesis that distributional vectors im- vocabulary of our ratings is much larger than any 1http://nlp.stanford.edu/projects/ existing database. Thus this new collection might glove/ https://code.google.com/archive/p/ 2NN Implementation based on https://github. word2vec/ com/amten/NeuralNetwork

25 method (NN). The technique allows us to propa- MRC BRYS TURNEY NEW 1 gate abstractness to every vector, thus we learn 0.98 MRC 1 0.92 0.85 0.91 abstractness ratings for all three constituents: verb, noun and the entire phrase. 0.96

0.94 For the metaphor classification experiment we BRYS 1 0.84 0.93 use the rating score and apply the Area Under 0.92 Curve (AUC) metric. AUC is a metric for bi- 0.9 nary classification. We assume that literal in-

0.88 stances gain higher scores (= more concrete) than TURNEY 1 0.88 metaphorical word pairs. AUC considers all pos- 0.86 sible thresholds to divide the data into literal and 0.84 metaphorical. In addition to the rating score we NEW 1 also show results based on cosine similarity and 0.82 feature combinations (Table 2). 0.8 Figure 1: Pairwise Spearman’s ρ on commonly covered subset. Red = high correlation. Feat. Name Type AUC - Random baseline .50 be useful, especially for further research which re- 1 V-NN cosine .75 3 quires large vocabulary coverage. 2 V-Phrase cosine .70 3 NN-Phrase cosine .68 2.2 Abstractness for Phrases 4 V rating .53 A potential advantage of our method is that ab- 5 NN rating .78 stractness can be learned for multi-word units as 6 Phrase rating .71 long as the representation of these units live in the same distributional vector space as the words re- Comb 1+2+3 cosine .75 quired for the supervised training. Comb 4+5+6 rating .74 In this section we explore if ratings propa- Comb all(1-6) mixed .80 gated to verb-noun phrases provide useful infor- Comb 1+5+6 best .84 mation for metaphor detection. As dataset we re- lied on the collection from Saif M. Mohammad Table 2: AUC Score single features and com- and Turney (2016), who annotated different senses binations. Classifying literal and metaphorical of WordNet verbs for metaphoricity (Fellbaum, phrases based on the Saif M. Mohammad and Tur- 1998). ney (2016) dataset. We used the same subset of verb–direct object and verb–subject relations as used in Shutova et al. (2016). As preprocessing step we concatenated As shown in Table 2, the rating of the verb alone verb-noun phrases by relying on dependency in- (AUC=.53) provides almost no useful information. formation based on a web corpus, the ENCOW14 The best performance based on a single feature is corpus (Schäfer and Bildhauer, 2012; Schäfer, the abstractness value of the noun (.78) followed 2015). We removed words and phrases that ap- by the cosine between verb and noun vector repre- peared less than 50 times in our corpus, thus our sentation (.75). The phrase rating alone performs selection covers 535 pairs, 238 of which were moderate (.71). However when combining fea- metaphorical and 297 literal. tures we found that the best combinations are ob- Given a verb-noun phrase, such as tained by integrating the phrase rating. In more de- stamp person, we obtained vector representations tail, combining noun and phrase rating (5+6) ob- using word2vec and the same hyper-parameters tains a AUC of (.80). When adding the cosine (1) that were used for the W2V300 embeddings we obtain the best score of (.84). For comparison, (Section 2.1.1) together with the best learning the verb plus noun ratings (4+5) obtains a lower 3Ratings available at http://www.ims. score (.72), this shows that the phrase rating pro- uni-stuttgart.de/data/en_abst_norms.html vides complementary and useful information.

26 2.3 Sense-specific Abstractness Ratings Abstractness norms are implemented using the same five feature dimensions as used by Turney In this section we investigate if automatically et al. (2011) plus dimensions respectively for sub- learned multi-sense abstractness ratings, that is ject and object, thus we rely on the seven feature, having different ratings per word sense, are poten- namely: tially useful for the task of metaphor detection. Recent advances in word representation learn- 1. Rating of the verbs subject ing led to the development of algorithms for non- 2. Rating of the verbs object parametric and unsupervised multi-sense repre- sentation learning (Neelakantan et al., 2014; Liu 3. Average rating of all nouns (excluding proper et al., 2015; Li and Jurafsky, 2015; Bartunov et names) al., 2016). Using these techniques one can learn 4. Average rating of all proper names a different vector representation per word sense. 5. Average rating of all verbs, excluding the tar- Such representations can be combined with our get verb abstractness learning method from section 2.1.1. While in theory any multi-sense learning tech- 6. Average rating of all adjectives nique can be applied, we decided for the one 7. Average rating of all adverbs introduced by Pelevina et al. (2016), as it per- forms sense learning after single senses have been For classification we used a balanced Logistic learned. Starting from the public W2V300 repre- Regression classifier following the findings from sentations we apply the multi-sense learning tech- Beigman Klebanov et al. (2015). While this de- nique using the default settings and learn sense- fault setup tries to generalize over unseen verbs by specific word representations. Finally we prop- only looking at a verb’s context we further present agate abstractness to every newly created sense results for a second setup that uses a 6th feature: representation by using the exact same model and namely the lemma of the target verb itself (+L). training data as in Section 1. For a given word in a The purpose of the second system is to describe sentence we can now disambiguate the word sense performance with respect to the state of the art by comparing its sense-specific vector representa- (Beigman Klebanov et al., 2016), which among tion to all context words. The context words are other features also uses the verb lemma. represented using the (single sense) global repre- Feat. TroFi(10F) VUA(10F) VUA(Test) sentation. We always pick the sense representation that obtains the largest similarity, measured by co- 1S .72 .42 .44 sine. The potential advantage of this method is MS .74 .44* .46 that in a metaphor detection system we are now 1S(+L) .74 .61 .62 able to look up word-sense-specific abstractness MS(+L) .75 .61 .62 ratings instead of globally obtained ratings. For this experiment we use the VU Amster- Table 3: F-score (Metaphor). Classifying literal dam Metaphor Corpus (Steen, 2010) (VUA), fo- and metaphorical verbs based on the VUA and cusing on verb metaphors. The collection contains TroFi dataset. MS = multi-sense, 1S= single sense. 23 113 verb tokens in running text, annotated as being used literally or metaphorically. In addition As shown in Table 3, the mutli-sense ratings we present results for the TroFi metaphor dataset constantly outperform the single-sense ratings in (Birke and Sarkar, 2006) containing 50 verbs and a direct comparison on all three sets. The dif- 3 737 labeled sentences. We pre-processed both ference in performance of single and multi-sense recourses using Stanford CoreNLP (Manning et ratings is statistically significant on the full VUA al., 2014) for lemmatization, part-of-speech tag- dataset, using the χ2 test and for p < 0.05. ∗ ging and dependency parsing. However we also notice that the effect vanishes We present results by applying ten-fold cross- as soon as we combine the ratings with the lemma validation over the entire data. For the VUA we of the verb, which is especially the case for the additionally present results for the test data us- VUA dataset where the lemma increases the per- ing the same training/test split as in Beigman Kle- formance by a large margin. In contrast to related banov et al. (2016). work, the system with the verb unigram (+UL)

27 can be considered state-of-the-art. When apply- Marc Brysbaert, AmyBeth Warriner, and Victor Kuper- ing the same evaluation as Beigman Klebanov et man. 2014. Concreteness Ratings for 40 Thousand al. (2016), namely a macro-average over the four Generally known English Word Lemmas. Behavior Research Methods, pages 904–911. genres of VUA, we obtain an average f-score of .60 by using only eight feature dimensions and ab- Max Coltheart. 1981. The MRC psycholinguistic stractness ratings as external resource.4 database. The Quarterly Journal of Experimental Psychology Section A, 33(4):497–505.

3 Conclusion Erik-Lân Do Dinh and Iryna Gurevych. 2016. In this paper we compared supervised methods Token-Level Metaphor Detection using Neural Net- works. In Proceedings of The Fourth Workshop on to propagate abstractness norms to words. We Metaphor in NLP, pages 28–33, San Diego, CA, showed that a neural-network outperforms other USA. methods. In addition we showed that norms for multi-words phrases can be beneficial for type Jonathan Dunn. 2013. What Metaphor Identification Systems can tell us About Metaphor-in-language. In based metaphor detection. Finally we showed how Proceedings of the First Workshop on Metaphor in norms can be learned for sense representations and NLP, pages 1–10, Atlanta, Georgia. that sense specific norms show a clear tendency to Jonathan Dunn. 2015. Modeling Abstractness and improve token-based verb metaphor detection. Metaphoricity. Metaphor and Symbol, pages 259– 289. Acknowledgments Christiane Fellbaum. 1998. A Semantic Network The research was supported by the DFG Col- of English Verbs. In Christiane Fellbaum, editor, laborative Research Centre SFB 732 (Maximil- WordNet – An Electronic Lexical Database, Lan- ian Köper) and the DFG Heisenberg Fellowship guage, Speech, and Communication. MIT Press, SCHU-2580/1 (Sabine Schulte im Walde). Cambridge, MA. Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Sebastian Padó. 2015. Distributional Vectors En- References code Referential Attributes. In Proceedings of Con- ference on Empirical Methods in Natural Language Lawrence W Barsalou and Katja Wiemer-Hastings. Processing, pages 12–21, Lisbon, Portugal. 2005. Situating Abstract Concepts. Grounding cog- nition: The role of perception and action in memory, Hessel Haagsma and Johannes Bjerva. 2016. Detect- language, and thought, pages 129–163. ing Novel Metaphor using Selectional Preference In- Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, formation. In Proceedings of The Fourth Workshop and Dmitry Vetrov. 2016. Breaking Sticks and Am- on Metaphor in NLP, pages 10–17, San Diego, CA, biguities with Adaptive Skip-gram. In Proceedings USA. of the International Conference on Artificial Intelli- Ilana Heintz, Ryan Gabbard, Mahesh Srivastava, Dave gence and Statistics, pages 130–138, Cadiz, Spain. Barner, Donald Black, Majorie Friedman, and Ralph Beata Beigman Klebanov, Chee Wee Leong, and Weischedel. 2013. Automatic Extraction of Lin- Michael Flor. 2015. Supervised Word-level guistic Metaphors with LDA Topic Modeling. In Metaphor Detection: Experiments with Concrete- Proceedings of the First Workshop on Metaphor in ness and Reweighting of Examples. In Proceedings NLP, pages 58–66, Atlanta, Georgia. of the Third Workshop on Metaphor in NLP, pages 11–20, Denver, Colorado. Felix Hill, Anna Korhonen, and Christian Bentz. 2014. A Quantitative Empirical Analysis of the Beata Beigman Klebanov, Chee Wee Leong, E. Dario Abstract/Concrete Distinction. Cognitive Science, Gutierrez, Ekaterina Shutova, and Michael Flor. 38:162–177. 2016. Semantic Classifications for Detection of Verb Metaphors. In Proceedings of the h Annual Richard Kammann and Lynn Streeter. 1971. Two Meeting of the Association for Computational Lin- Meanings of Word Abstractness. Journal of Verbal guistics, pages 101–106, Berlin, Germany. Learning and Verbal Behavior, 10(3):303 – 306.

Julia Birke and Anoop Sarkar. 2006. A Clustering Ap- Maximilian Köper and Sabine Schulte im Walde. proach for the Nearly Unsupervised Recognition of 2016a. Automatically Generated Affective Norms Nonliteral Language. In Proceedings of the 11th of Abstractness, Arousal, Imageability and Valence Conference of the European Chapter of the ACL, for 350 000 German Lemmas. In Proceedings pages 329–336, Trento, Italy. of the 10th International Conference on Language Resources and Evaluation, pages 2595–2598, Por- 4SOA results from Klebanov obtain also a score of .60 toroz, Slovenia.

28 Maximilian Köper and Sabine Schulte im Walde. Jeffrey Pennington, Richard Socher, and Christo- 2016b. Distinguishing Literal and Non-Literal Us- pher D. Manning. 2014. Glove: Global Vectors age of German Particle Verbs. In Proceedings of the for Word Representation. In Proceedings of Con- Conference of the North American Chapter of the ference on Empirical Methods in Natural Language Association for Computational Linguistics: Human Processing, pages 1532–1543, Doha, Qatar. Language Technologies, pages 353–362, San Diego, California. Ekaterina Shutova Saif M. Mohammad and Peter D. Turney. 2016. Metaphor as a Medium for Emotion: George Lakoff and Mark Johnson. 1980. Metaphors An Empirical Study. In Proceedings of the Fifth we live by. University of Chicago Press. Joint Conference on Lexical and Computational Se- Jiwei Li and Dan Jurafsky. 2015. Do Multi-Sense Em- mantics (*Sem), pages 23–33, Berlin, Germany. beddings Improve Natural Language Understand- ing? In Proceedings of the Conference on Empiri- Roland Schäfer and Felix Bildhauer. 2012. Building cal Methods in Natural Language Processing, pages Large Corpora from the Web Using a New Efficient 1722–1732, Lisbon, Portugal. Tool Chain. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Linlin Li and Caroline Sporleder. 2009. Classifier pages 486–493, Istanbul, Turkey. Combination for Contextual Idiom Detection With- out Labelled Data. In Proceedings of the Confer- Roland Schäfer. 2015. Processing and Querying Large ence on Empirical Methods in Natural Language Web Corpora with the COW14 Architecture. In Pi- Processing, pages 315–323. otr Banski,´ Hanno Biber, Evelyn Breiteneder, Marc Kupietz, Harald Lüngen, and Andreas Witt, editors, Linlin Li and Caroline Sporleder. 2010. Using Gaus- Proceedings of the 3rd Workshop on Challenges in sian Mixture Models to Detect Figurative Language the Management of Large Corpora, pages 28–24, in Context. In Proceedings of the Annual Confer- Lancaster. ence of the North American Chapter of the Associa- tion for Computational Linguistics, pages 297–300. Ekaterina Shutova and Simone Teufel. 2010. Metaphor Corpus Annotated for Source - Target Do- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015. main Mappings. In Proceedings of the International Learning Context-Sensitive Word Embeddings with Conference on Language Resources and Evaluation, Neural Tensor Skip-Gram Model. In Proceedings pages 3225–3261, Valletta, Malta. of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pages 1284–1290, Buenos Ekaterina Shutova, Lin Sun, and Anna Korhonen. Aires, Argentina. 2010. Metaphor Identification Using Verb and Christopher D. Manning, Mihai Surdeanu, John Bauer, Noun Clustering. In Proceedings of the 23rd Inter- Jenny Finkel, Steven J. Bethard, and David Mc- national Conference on Computational Linguistics, Closky. 2014. The Stanford CoreNLP Natural Lan- pages 1002–1010. guage Processing Toolkit. In Proceedings of the As- sociation for Computational Linguistics (ACL) Sys- Ekaterina Shutova, Douwe Kiela, and Jean Maillard. tem Demonstrations, pages 55–60. 2016. Black Holes and White Rabbits: Metaphor Identification with Visual Features. In Proceedings James H. Martin. 1996. Computational Approaches to of the Conference of the North American Chapter of Figurative Language. Metaphor and Symbolic Ac- the Association for Computational Linguistics: Hu- tivity. man Language Technologies, pages 160–170.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- G. Steen. 2010. A Method for Linguistic Metaphor rado, and Jeff Dean. 2013. Distributed Represen- Identification: From MIP to MIPVU. Converg- tations of Words and Phrases and their Composi- ing Evidence in Language and Communication Re- tionality. In C.J.C. Burges, L. Bottou, M. Welling, search. John Benjamins Publishing Company. Z. Ghahramani, and K.Q. Weinberger, editors, Ad- vances in Neural Information Processing Systems James H Steiger. 1980. Tests for Comparing Elements 26, pages 3111–3119. of a Correlation Matrix. Psychological Bulletin. Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- sos, and Andrew McCallum. 2014. Efficient Non- Yulia Tsvetkov, Elena Mukomel, and Anatole Gersh- parametric Estimation of Multiple Embeddings per man. 2013. Cross-lingual Metaphor Detection Us- Word in Vector Space. In Proceedings of the Con- ing Common Semantic Features. In Proceedings of ference on Empirical Methods in Natural Language the Workshop on Metaphor in NLP, pages 45–51, Processing, pages 1059–1069, Doha, Qatar. Atlanta, USA. Maria Pelevina, Nikolay Arefiev, Chris Biemann, and Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Alexander Panchenko. 2016. Making Sense of Eric Nyberg, and Chris Dyer. 2014. Metaphor De- Word Embeddings. In Proceedings of the 1st Work- tection with Cross-Lingual Model Transfer. In Pro- shop on Representation Learning for NLP, pages ceedings of the 52nd Annual Meeting of the Associa- 174–183, Berlin, Germany, August. tion for Computational Linguistics, pages 248–258.

29 Peter D. Turney and Michael L. Littman. 2003. Mea- suring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems, pages 315–346. Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co- hen. 2011. Literal and Metaphorical Sense Identi- fication through Concrete and Abstract Context. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 680– 690, Edinburgh, UK. Gabriella Vigliocco, Stavroula-Thaleia Kousta, Pasquale Anthony Della Rosa, David P Vinson, Marco Tettamanti, Joseph T Devlin, and Stefano F Cappa. 2014. The Neural Representation of Abstract Words: the Role of Emotion. Cerebral Cortex, pages 1767–1777.

30 Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured Knowledge

Yuan Ling, Yuan An Sadid A. Hasan College of Computing & Informatics Artificial Intelligence Laboratory Drexel University Philips Research North America Philadelphia, PA, USA Cambridge, MA, USA yl638,ya45 @drexel.edu [email protected] { }

Abstract that store relevant information about various en- tity types and relation triples. Many large-scale This paper presents a novel approach to KBs have been constructed over the years, such the task of automatically inferring the as WordNet (Miller, 1995), Yago (Suchanek et most probable diagnosis from a given al., 2007), Freebase (Bollacker et al., 2008), DB- clinical narrative. Structured Knowledge pedia (Auer et al., 2007), NELL (Carlson et al., Bases (KBs) can be useful for such com- 2010), UMLS Metathesaurus (Bodenreider, 2004) plex tasks but not sufficient. Hence, we etc. However, using KBs alone for inference tasks leverage a vast amount of unstructured (Bordes et al., 2014) has certain limitations such as free text to integrate with structured KBs. incompleteness of knowledge, sparsity, and fixed The key innovative ideas include building schema (Socher et al., 2013; West et al., 2014). a concept graph from both structured and On the other hand, unstructured textual re- unstructured knowledge sources and rank- sources such as free texts from Wikipedia gen- ing the diagnosis concepts using the en- erally contain more information than structured hanced word embedding vectors learned KBs. As a supplementary knowledge to mitigate from integrated sources. Experiments on the limitations of structured KBs, unstructured text the TREC CDS and HumanDx datasets combined with structured KBs provides improved showed that our methods improved the re- results for related tasks, for example, clinical ques- sults of clinical diagnosis inference. tion answering (Miller et al., 2016). For pro- 1 Introduction and Related Work cessing text, word embedding models (e.g. skip- gram model (Mikolov et al., 2013b; Mikolov et Clinical diagnosis inference is the problem of au- al., 2013a)) can efficiently discover and represent tomatically inferring the most probable diagno- the underlying patterns of unstructured text. Word sis from a given clinical narrative. Many health- embedding models represent words and their rela- related information retrieval tasks can greatly ben- tionships as continuous vectors. To improve word efit from the accurate results of clinical diagnosis embedding models, previous works have also suc- inference. For example, in recent Text REtrieval cessfully leveraged structured KBs (Bordes et al., Conference (TREC) Clinical Decision Support 2011; Weston et al., 2013; Wang et al., 2014; Zhou 1 track (CDS ), diagnosis inference from medical et al., 2015; Liu et al., 2015). narratives has improved the accuracy of retriev- Motivated by the superior power of the integra- ing relevant biomedical articles (Roberts et al., tion of structured KBs and unstructured free text, 2015; Hasan et al., 2015; Goodwin and Harabagiu, we propose a novel approach to clinical diagno- 2016). sis inference. The novelty lies in the ways of in- Solutions to the clinical diagnostic inferencing tegrating structured KBs with unstructured text. problem require a significant amount of inputs Experiments showed that our methods improved from domain experts and a variety of sources (Fer- clinical diagnosis inference from different aspects rucci et al., 2013; Lally et al., 2014). To ad- (Section 5.4). Previous work on diagnosis in- dress such complex inference tasks, researchers ference from clinical narratives either formulates (Yao and Van Durme, 2014; Bao et al., 2014; the problem as a medical literature retrieval task Dong et al., 2015) have utilized structured KBs (Zheng and Wan, 2016; Balaneshin-kordan and 1http://www.trec-cds.org/ Kotov, 2016) or as a multiclass multilabel classi-

31 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 31–36, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics

fication problem in a supervised setting (Hasan et medicine relation types. There are 19 such rela- al., 2016; Prakash et al., 2016). To the best of our tion types in total. Most of them fall under the knowledge, there is no work on diagnoses infer- “medicine.disease” category. ence from clinical narratives conducted in an un- For unstructured text, we use articles from supervised way. Thus, we build such baselines for Wikipedia and MayoClinic corpus as the supple- this task. mentary knowledge source. Important clinical concepts mentioned in a Wikipedia/MayoClinic 2 Overview of the Approach page can serve as a critical clue to a clinical di- agnosis. For example, in Figure 1, we see that Our approach includes four steps in general: 1) “dyspnea”, “shortness of breath”, “tachypnea” etc. extracting source concepts, q, from clinical narra- are the related signs and symptoms of the “Pul- tives, 2) iteratively identifying corresponding ev- monary Embolism” diagnosis. We select 37,245 idence concepts, a, from KBs and unstructured Wikipedia pages under the clinical diseases and text, 3) representing both source and evidence medicine category in this study. Most of the concepts in a weighted graph via a regularizer- page titles represent disease names. In addition, enhanced skip-gram model, and 4) ranking the rel- MayoClinic2 disease corpus contains 1,117 pages, evant evidence concepts (i.e. diagnoses) based on which include sections of Symptoms, Causes, their association with the source concepts, S(q, a) Risk Factors, Treatments and Drugs, Prevention, (computed by weighted dot product of two vec- etc. tors), to generate the final output. Figure 1 shows the overview using an illustrative example. 4 Methodology Given source concepts as input, we build an edge-weighted graph representing the connections 4.1 Building Weighted Concept Graph among all the concepts by iteratively retrieving Both the source and the evidence concepts are evidence concepts from both KBs and unstruc- represented as nodes in a graph. A clinical case tured text. The weights of the edges represent the is represented as a set of source concept nodes: strengths of the relationships between concepts. q = q , q ,... . We build a weighted concept { 1 2 } Each concept is represented as a word embedding graph from source concepts using Algorithm 1. vector. We combine all the source concept vec- tors into a single vector representing a clinical sce- Algorithm 1: Build Concept Graph nario. Source concepts are differentiated accord- Input : source concept nodes q Output: graph G = (V,E) ing to the weighting scheme in Section 4.2. Evi- 1 S = q and V = q; 2 while S = do dence concepts are also represented as vectors and 6 ∅ 3 for each qi in S do ranked according to their relevance to the source 4 if distance(qi, q) > 2 then 5 continue; concepts. For each clinical case, we find the most 6 end probable diagnoses from the top-ranked evidence 7 if triple < qi, r, aj > in KBs then 8 wij = 1; concepts. 9 e = (qi, aj ) and e.value = wij ; 10 insert aj to V and S; 11 insert e to E; 3 Knowledge Sources of Evidence 12 end 13 Use qi as query, search in Unstructured Text Corpora, get Concepts Result R; 14 for each page-similarity pair (p, sij ) in R do 15 e = (qi, title(p)) and e.value = sij ; In this study, we use UMLS Metathesaurus (Bo- 16 insert title(p) to V and S; denreider, 2004) and Freebase (Bollacker et al., 17 insert e to E; 18 end 2008) as the structured KBs. Both KBs pro- 19 remove qi from S; vide semantic relation triples in the following 20 end 21 end format: . We select UMLS relation types that are relevant Two kinds of evidence concept nodes are added to the problem of clinical diagnosis inference. to the graph: 1) the entities from KBs (UMLS and These types include disease-treatment, disease- Freebase) (step 7-12 in Algorithm 1), and 2) the prevention, disease-finding, sign or symptom, entities from unstructured text pages (step 13-18). causes etc. Freebase contains a large number If there exists a triple < q , r, a > in KBs, where of triples from multiple domains. We select i j 61,243 triples from freebase that are classified as 2http://www.mayoclinic.org/diseases-conditions

32 Figure 1: Overview of our system. r refers to a relation, an edge is used to connect (1) A longer concept usually convey more infor- node qi and node aj. wij represents the weight for mation (e.g. malar rash vs. rash), so it should be that edge, and let wij = 1, if the corresponding given more weights. We define this weight value triple occurs at least once. Due to the incomplete- as γ1 = #W ords in Concept. ness of the KBs, there may exist multiple missing (2) For some commonly seen concepts (e.g. connections between a potential evidence concept fever), usually, there are more edges connected to aj and a source concept qi. Unstructured knowl- them. Sometimes, a common concept is less im- edge from Wikipedia and MayoClinic can replen- portant for diagnosis inference, while some unique ish these missing connections. For each page p, concepts are critical to infer a specific diagnosis. the page title represents an evidence concept aj. We define this weight value for each concept as We use each source concept q as a query, and 1 i γ2 = #Connected Edges . A higher weight value page p as a document, and then calculate a query- means the source concept is more unique. document similarity to measure the edge weight wij between node aj and node qi. We only take 4.3 Inferring Concepts for Diagnosis evidence concepts as all nodes connected to source concepts in a distance of at most 2 (step 4-6). Extracting Potential Evidence Concepts: From source concept nodes q, we find their con- nected concepts in the graph as evidence concepts. 4.2 Representing Clinical Case Traversing all edges in a graph is computation- We combine the source concepts q and get a sin- ally expensive and often unnecessary for finding gle vector vq to represent the clinical case narra- potential diagnoses. The solution is to use a sub- tive. The source concepts from narratives for clin- graph. We follow the idea proposed in Bordes et ical diagnosis inference should be differentiated. al. (2014). The evidence concepts are defined as Some source concepts are major symptoms for a all nodes connected to source concepts in a dis- diagnosis, while others are less critical. These ma- tance of at most 2. jor source concepts should be identified and given Ranking Evidence Concepts: We rank each higher weight values. We develop two kinds of evidence concept a0 according to its matching weighting schema for the differential expression score S(q, a0) to the source concepts. The match- of the source concepts. The source concept is rep- ing score S(q, a0) is a dot product of embedding 1 resented as vq = N q q γivqi . N is the total representation of the evidence concept a0 and the i∈ number of source concepts. vq is the vector rep- source concept q by taking the edge weights wij P i resentation for one source concept q . into consideration. S(q, a ) = w v v . v and i 0 ij a0 · q a0

33 4 vq are embedding representations for a0 and q. The Our second dataset is curated from HumanDx , embedding E Rk N for concepts are trained a project to foster integrating efforts to map health ∈ × using embedding models (Section 4.4). N is the problems to their possible diagnoses. We curate total number of concepts and k is the predefined diagnosis-findings relationships from HumanDx dimensions for the embedding vector. Each con- and create a dataset with 459 diagnosis-findings cept in the graph can find a k dimensional vector entries. Note that, the findings from this dataset representation in E. For a set of source concepts are used as the given source concepts for a clinical and evidence concepts A(q), the top-ranked evi- scenario. dence concept can be computed as: 5.2 Training Data for Word Embeddings a = argmax(a A(q))S(q, a0) (1) 0∈ We curate a biomedical corpus of around 5M 4.4 Word Embedding Models sentences from two data sources: PubMed Cen- tral5 from the 2015 TREC CDS snapshot6 and We use the skip-gram model as the basic model. Wikipedia articles under the “Clinical Medicine” The skip-gram model predicts surrounding words category7. After sentence splitting, word tok- wt c, . . . , wt 1, wt+1, . . . , wt+c given the current − − enization, and stop words removal, we train our center word wt. We further enhance the skip-gram word embedding models on this corpus. UMLS model by adding a graph regularizer. Given a se- Metathesaurus and Freebase are used as KBs to quence of training words w1, w2, . . . , wT , the ob- train the graph regularizer. We use stochastic gra- jective function is: dient descent (SGD) to maximize the objective function and set the parameters empirically. T R 1 X X X J = max (1 λ) log p(wt+j wt) λ D(vt, vr ), T − | − t=1 c j c,j=0 r=1 5.3 Results − ≤ ≤ 6 (2) We use Mean Reciprocal Rank (MRR) and Aver- where v and v are the representation vectors for t r age Precision at 5 (P@5) to evaluate our models. word w and word w . λ is a parameter to leverage t r MRR is a statistical measure to evaluate a process the graph regularizer and original objective. Sup- that generates a list of possible responses to a sam- pose, word w is mentioned having relations with a t ple of queries, ordered by probability of correct- set of other words w , r 1,...,R in KBs. The r ∈ { } ness. Average P@5 is calculated as precision at graph regularizer λ R D(v , v ) integrates ex- r=1 t r top 5 predicted results divided by the total number tra knowledge about semantic relationships among P of topics. Since our dataset only has one correct words within the graph structure. D(v , v ) repre- t r diagnosis for each topic, all results have poor Av- sents the distance between v and v . In our ex- t r erage P@5 scores. periments, the distance between two concepts is Table 1 presents the results for our experi- measured using KL-Divergence. D(v , v ) can be t r ments. We report two baselines: Skip-gram refers calculated using any other types of distance met- to the basic word embedding model, and Skip- rics. By minimizing D(v , v ), we expect if two t r gram* refers to the graph-regularized model us- concepts have a close relation in KBs, their vector ing KBs. We also show the results for using dif- representations will also be close to each other. ferent unstructured knowledge sources and differ- 5 Experiments ent weighting schema. We can see that the best scores are obtained by the graph-regularized mod- 5.1 Datasets for Clinical Diagnosis Inference els with both the unstructured knowledge sources Our first dataset is from the 2015 TREC CDS with variable weighting schema (Section 4.2). track (Roberts et al., 2015). It contains 30 top- 5.4 Discussion ics, where each topic is a medical case narrative that describes a patient scenario. Each case is as- Unstructured text is a critical supplement: We sociated with the ground truth diagnosis. We use analyze the source concepts and the corresponding MetaMap3 to extract the source concepts from a evidence concepts for CDS topics, and investigate narrative and then manually refine them to remove 4https://www.humandx.org/ redundancy. 5https://www.ncbi.nlm.nih.gov/pmc/ 6http://www.trec-cds.org/2015.html#documents 3https://metamap.nlm.nih.gov/ 7https://en.wikipedia.org/wiki/Category:Clinical medicine

34 TREC CDS HumanDx Method MRR Average P@5 MRR Average P@5 Baselines Skip-gram 21.66 8.88 18.56 5.08 Skip-gram* 22.60 8.88 18.63 5.15 Skip-gram* + Different Unstructured Text Datasets Wikipedia 26.01 8.96 19.42 5.76 MayoClinic 32.64 9.52 19.46 5.80 Both 32.29 9.60 19.12 5.76 Skip-gram* + Both Text Datasets + Different Weights γ1 32.22 10.40 21.09 5.88 γ2 32.77 12.00 20.86 5.93

Table 1: Evaluation results. the origin of the correct diagnoses. 70% of the Saeid Balaneshin-kordan and Alexander Kotov. 2016. correct diagnoses can be inferred from Wikipedia, Optimization method for weighting explicit and la- 60% of the correct diagnoses from MayoClinic, tent concepts in clinical decision support queries. In Proceedings of the 2016 ACM on International 56% of the correct diagnoses from Freebase, and Conference on the Theory of Information Retrieval, only 7% are from UMLS. Hence, Wikipedia and pages 241–250. ACM. MayoClinic are very important sources for finding Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao. the correct diagnoses. 2014. Knowledge-based question answering as ma- Source concepts should be differentiated: In chine translation. Cell, 2(6). clinical narratives, some concepts are more critical than others for the clinical diagnosis inference. We Olivier Bodenreider. 2004. The unified medical lan- guage system (umls): integrating biomedical termi- developed two weighting schema to assign higher nology. Nucleic acids research, 32(suppl 1):D267– weight values to more important concepts. The re- D270. sults in Table 1 show that differentiating the source concepts with different weight values has a large Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a col- impact on the model performance. laboratively created graph database for structuring Enhanced skip-gram is better: We propose human knowledge. In Proceedings of the 2008 ACM the enhanced skip-gram model by using a graph SIGMOD international conference on Management regularizer to integrate the semantic relationships of data, pages 1247–1250. ACM. among concepts from KBs. Experimental results Antoine Bordes, Jason Weston, Ronan Collobert, and show that diagnosis inference is improved by us- Yoshua Bengio. 2011. Learning structured embed- ing word embedding representations from the en- dings of knowledge bases. In Conference on artifi- cial intelligence. hanced skip-gram model. Antoine Bordes, Sumit Chopra, and Jason Weston. 6 Conclusion 2014. Question answering with subgraph embed- dings. arXiv preprint arXiv:1406.3676. We proposed a novel approach to the task of clin- ical diagnosis inference from clinical narratives. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Our method overcomes the limitations of struc- Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never- tured KBs by making use of the integrated struc- ending language learning. In AAAI, volume 5, tured and unstructured knowledge. Experimen- page 3. tal results showed that the enhanced skip-gram model with differential expression of source con- Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question answering over freebase with multi- cepts improved the performance on two bench- column convolutional neural networks. In Proceed- mark datasets. ings of Association for Computational Linguistics, pages 260–269.

References David Ferrucci, Anthony Levas, Sugato Bagchi, David Gondek, and Erik T Mueller. 2013. Watson: beyond Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens jeopardy! Artificial Intelligence, 199:93–105. Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. Travis R. Goodwin and Sanda M. Harabagiu. 2016. In The semantic web, pages 722–735. Springer. Medical question answering for clinical decision

35 support. In Proceedings of the 25th ACM Inter- Fabian M. Suchanek, Gjergji Kasneci, and Gerhard national on Conference on Information and Knowl- Weikum. 2007. Yago: a core of semantic knowl- edge Management, pages 297–306. ACM. edge. In Proceedings of the 16th international con- ference on World Wide Web, pages 697–706. ACM. Sadid A. Hasan, Yuan Ling, Joey Liu, and Oladimeji Farri. 2015. Using neural embeddings for diag- Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng nostic inferencing in clinical question answering. In Chen. 2014. Knowledge graph and text jointly em- TREC. bedding. In EMNLP, pages 1591–1601. Citeseer. Sadid A. Hasan, Siyuan Zhao, Vivek Datla, Joey Liu, Robert West, Evgeniy Gabrilovich, Kevin Murphy, Kathy Lee, Ashequl Qadir, Aaditya Prakash, and Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Oladimeji Farri. 2016. Clinical question answering Knowledge base completion via search-based ques- using key-value memory networks and knowledge tion answering. In Proceedings of the 23rd inter- graph. In TREC. national conference on World wide web, pages 515– 526. ACM. Adam Lally, Sugato Bachi, Michael A. Barborak, David W. Buchanan, Jennifer Chu-Carroll, David A. Jason Weston, Antoine Bordes, Oksana Yakhnenko, Ferrucci, Michael R. Glass, Aditya Kalyanpur, and Nicolas Usunier. 2013. Connecting language Erik T. Mueller, J. William Murdock, et al. 2014. and knowledge bases with embedding models for re- Watsonpaths: scenario-based question answering lation extraction. arXiv preprint arXiv:1307.7973. and inference over unstructured information. York- town Heights: IBM Research. Xuchen Yao and Benjamin Van Durme. 2014. Infor- mation extraction over structured data: Question an- Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and swering with freebase. In ACL (1), pages 956–966. Yu Hu. 2015. Learning semantic word embeddings Citeseer. based on ordinal knowledge constraints. In Pro- ceedings of the 53rd Annual Meeting of the Associ- Ziwei Zheng and Xiaojun Wan. 2016. Graph-based ation for Computational Linguistics and the 7th In- multi-modality learning for clinical decision sup- ternational Joint Conference on Natural Language port. In Proceedings of the 25th ACM International Processing (ACL-IJCNLP), pages 1501–1511. on Conference on Information and Knowledge Man- agement, pages 1945–1948. ACM. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. representations in vector space. arXiv preprint 2015. Learning continuous word embedding with arXiv:1301.3781. metadata for question retrieval in community ques- tion answering. In Proceedings of ACL, pages 250– Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- 259. rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing systems, pages 3111–3119. Alexander Miller, Adam Fisch, Jesse Dodge, Amir- Hossein Karimi, Antoine Bordes, and Jason We- ston. 2016. Key-value memory networks for di- rectly reading documents. CoRR, abs/1606.03126. George A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Aaditya Prakash, Siyuan Zhao, Sadid A. Hasan, Vivek Datla, Kathy Lee, Ashequl Qadir, Joey Liu, and Oladimeji Farri. 2016. Condensed memory net- works for clinical diagnostic inferencing. arXiv preprint arXiv:1612.01848. Kirk Roberts, Matthew S. Simpson, Ellen Voorhees, and William R. Hersh. 2015. Overview of the trec 2015 clinical decision support track. In TREC. Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Ng. 2013. Reasoning with neural ten- sor networks for knowledge base completion. In Ad- vances in Neural Information Processing Systems, pages 926–934.

36 Classifying Lexical-semantic Relationships by Exploiting Sense/Concept Representations

Kentaro Kanada, Tetsunori Kobayashi and Yoshihiko Hayashi Waseda University, Japan [email protected] [email protected] [email protected]

Abstract be addressed if an accurate method for classifying the type of lexical-semantic relation is available. This paper proposes a method for classi- A number of research studies on the classifi- fying the type of lexical-semantic relation cation of lexical-semantic relationships have been between a given pair of words. Given an conducted. Among them, Necsules¸cu et al. (2015) inventory of target relationships, this task recently presented two classification methods that can be seen as a multi-class classification utilize word-level feature representations includ- problem. We train a supervised classi- ing word embedding vectors. Although the re- fier by assuming that a specific type of ported results are superior to the compared sys- lexical-semantic relation between a pair of tems, neither of the proposed methods exploited words would be signaled by a carefully de- “sense representations,” which are described as the signed set of relation-specific similarities fine-grained representations of word senses, con- between the words. These similarities are cepts, and entities in the description of this work- computed by exploiting “sense represen- shop1. tations” (sense/concept embeddings). The Motivated by the above-described issues and experimental results show that the pro- previous work, this paper proposes a supervised posed method clearly outperforms an ex- classification method that exploits sense represen- isting state-of-the-art method that does not tations, and discusses their utilities in the lexical utilize sense/concept embeddings, thereby relation classification task. The major rationales demonstrating the effectiveness of the behind the proposed method are: (1) a specific sense representations. type of lexical-semantic relation between a pair of words would be indicated by a carefully designed 1 Introduction set of relation-specific similarities associated with Given a pair of words, classifying the type of the words; and (2) the similarities could be ef- lexical-semantic relation that could hold between fectively computed by exploiting sense represen- them may have a range of applications. In particu- tations. lar, discovering typed lexical-semantic relation in- More specifically, for each word in the pair, we stances is vital in building a new lexical-semantic first collect relevant sets of sense/concept nodes resource, as well as for populating an existing (node sets) from an existing lexical-semantic re- lexical-semantic resource. As argued in (Boyd- source (PWN), and then compute similarities for Graber et al., 2006), even Princeton WordNet some designated pairs of node sets, where each (henceforth PWN) (Miller, 1995) is noted for its node is represented by an embedding vector de- sparsity of useful internal lexical-semantic rela- pending on its type (sense/concept). In terms tions. A distributional thesaurus (Weeds et al., of its design, each node set pair is constructed 2014), usually built with an automatic method such that it is associated with a specific type such as that described in (Rychly´ and Kilgar- of lexical-semantic relation. The resulting ar- riff, 2007), often comprises a disorganized seman- ray of similarities, along with the underlying tic network internally, where a variety of lexical- word/sense/concept embedding vectors is finally semantic relations are incorporated without having 1https://sites.google.com/site/ proper relation labels attached. These issues could senseworkshop2017/background

37 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 37–46, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics fed into the classifier as features. higher similarity score yielded by an asymmetric The empirical results that use the BLESS similarity measure. Their idea of exploiting a dataset (Baroni and Lenci, 2011) demonstrate that specific type of similarity to detect a specific type our method clearly outperformed existing state-of- of lexical-semantic relationship is highly feasible. the-art methods (Necsules¸cu et al., 2015) that did Recently, distributional and distributed word not employ sense/concept embeddings, confirm- representations (word embeddings) have been ing that properly combining the similarity features widely utilized, partly because the offset vec- also with the underlying semantic/conceptual- tor simply brought about by vector subtraction level embeddings is indeed effective. These re- over word embeddings can capture some rela- sults in turn highlight the utility of “the sense rep- tional aspects including a lexical-semantic rela- resentations” (the sense/concept embeddings) cre- tionship. Given these useful resources, Weeds et ated by the existing system referred to as AutoEx- al. (2014) presented a supervised classification ap- tend (Rothe and Schutze,¨ 2015). proach that employs a pair of distributional vec- The remainder of the paper first reviews related tors for a given word pair as the feature, argu- work (section 2), and then presents our approach ing that concatenation and subtraction were al- (section 3). As our experiments (section 4) uti- most equally effective vector operations. Simi- lize the BLESS dataset, the experimental results lar lines of work were presented by (Necsules¸cu are directly compared with that of (Necsules¸cu et al., 2015) and (Vylomova et al., 2016): the et al., 2015) (section 5). Although our methods former suggested concatenation might be slightly were proved to be superior through the experi- superior to subtraction, whereas the latter espe- ments, our operational requirement (sense/concept cially highlighted the subtraction. Here it should embeddings should be created from the underly- be noted that Necsules¸cu et al. (2015) employed ing lexical-semantic resource) could be problem- two kinds of vectors: one is a CBOW-based vec- atic especially when having to process unknown tor (Mikolov et al., 2013b), and the other involves words. We conclude the present paper by dis- word embeddings with a dependency-based skip- cussing future work to address this issue (section gram model (Levy and Goldberg, 2014). 6). The present work exploits semantic/conceptual- level embeddings, which were actually derived 2 Related work by applying the AutoExtend (Rothe and Schutze,¨ A lexical-semantic relationship is a fundamental 2015) system. Among the recent proposals relationship that plays an important role in many for deriving semantic/conceptual-level embed- NLP applications. A number of research efforts dings (Huang et al., 2012; Pennington et al., 2014; have been devoted to developing an automated and Neelakantan et al., 2014; Iacobacci et al., 2015), accurate method to type the relationship between we adopt the AutoExtend system, since it ele- an arbitrary pair of words. Most of these stud- gantly exploits the network structure provided by ies (Fu et al., 2014; Kiela et al., 2015; Shwartz an underlying semantic resource, and naturally et al., 2016), however, concentrated on the hyper- consumes existing word embeddings. More im- nymy relation, since it is the most fundamental portantly, the underlying word embeddings are di- relationship that forms the core taxonomic struc- rectly comparable with the derived sense represen- ture in a lexical-semantic resource. In compari- tations. In the present work, we applied the Au- son, fewer studies considered a broader range of toExtend system to the Word2Vec CBOW embed- lexical-semantic relations, e.g., (Necsules¸cu et al., dings (Mikolov et al., 2013b) by referring to PWN 2015) and our present work. version 3.0 as the underlying lexical-semantic re- Lenci and Benotto (2012), among the source. As far as the authors know, AutoExtend- hypernymy-centered researches, compared derived embeddings have been evaluated in the existing directional similarity measures (Kotler- tasks of similarity measurements and word sense man et al., 2010) to identify hypernyms, and disambiguation: they are yet to be applied to a se- proposed a new measure that slightly modified mantic relation classification task. an existing measure. The rationale behind their There are a few datasets (Baroni and Lenci, work is: as hypernymy is a prominent asymmetric 2011; Santus et al., 2015) available that were pre- semantic relation, it might be detected by the pared for the evaluation of lexical-semantic rela-

38 aro words between of relationship pair a semantic of type clas- the for sifying approach learning supervised a adopt We method Proposed 3 2015). al., (Necsules¸cu re- et in the di- given with methods to method lated order proposed in our 2011) compare rectly Lenci, BLESS the and utilized (Baroni We dataset tasks. classification tion hp htcudhl ewe ie odpair. word given a between hold relation- could of that plausibilities ships the are measure similarities to these the assumed that by Recall similarities steps. the following calculate we ( “reptile” assumption, of concepts of ( set “alligator” hy- of of set concepts the pernym between similarity the by measured ( ligator” the that the assume of We plausibility method. proposed the behind the between words. similarity the of each the with associated by senses/concepts measured be lexical-relationship semantic can specific a of sibility .Bidsm sflpiso oest ycon- by sets node of pairs useful some Build 2. for sets node of types pre-defined Collect 1. iue1eepie h udmna rationale fundamental the exemplifies 1 Figure ehl ewe h od 7pi ye;de- 3.2). types; section pair in (7 tailed words the between held be to assumed relationships possible the sidering section in detailed types; 3.1). (five word each w concept sense 1 word n rpie ( “reptile” and ) ( iue1 raignd esadnd e ar (in pairs set node and sets node Creating 1: Figure w hypernymy 1 alligator L S S w , leather. alligator alligator alligator hyper alligator 2 ) xetn htteplau- the that expecting , n . n .01 .01 1 S eainbten“al- between relation w w hypernym 2 2 a emainly be can ) alligator .Bsdo this on Based ). S w hyper 1 crocodilian_reptile. n the and ) alligator alligator ...... 2 n .02 39 n oydtie nscin4.1. section in relevant inven- detailed characterize target tory the to in as set relationships node so lexical-semantic These selected word. are each types for sets node five following of the types collect we PWN, word consulting each By for sets node Collecting 3.1 the feature. yielded as that similarities pairs vector and similarities use these We co- embeddings. the word with between along similarity sine pair word per similarities ods) .01 .Cluaetesmlrte o ahnd set node each for similarities the Calculate 3. nttlw aclt 1( pairs (7 21 calculate we total In • • • • • ietylne rmamme of member PWN a from linked directly S ietylne rmamme of member PWN a from linked directly S ietylne rmamme of member PWN a from linked directly S of ber S L methods calculation 3.3). of section in types (detailed three by pair w w w w mero attri hyper w e fcnet ahdntdb mem- a by denoted each concepts of set a : e fsne htaword a that senses of set a : e fcnet hs ebris member whose concepts of set a : meronymy attribute hypernymy e fcnet hs ebris member whose concepts of set a : L e fcnet hs ebris member whose concepts of set a : reptile. hypernymy w oestpair set node reptile n .01 reptile 1 relation relation). relation relation L S reptile reptile w × has S S S w w w meth- 3 ythe by ythe by ythe by c 3.2 Building node set pairs simmax method: (w , w ) c Given a pair of words 1 2 , we build seven simmax(w1, w2) = max sim( ~x1, ~x2) x1 Xw ,x2 Xw types of node set pairs as shown in Table 1. Each ∈ 1 ∈ 2 row in the table defines the combination of node (1) sets and presents the associated mnemonic. As the formula defines, this method selects a com- bination of the node sets that yield the maximum similarity, implying that it achieves a disambigua- Table 1: Node set pairs built for (w1, w2). tion functionality. Node set pair Mnemonic The vector pair from the most similar node sets

Lw1 Lw2 sense ( ~x1, ~x2) is also used as feature. The actual usage

Sw1 Sw2 concept of this pair in the experiments is detailed in section hyper Sw1 Sw2 hyper 4.2. hyper hyper Sw1 Sw2 coord c attri simsum method: Sw1 Sw2 attri 1 S Sattri attri 2 c w1 w2 simsum(w1, w2) = sim( ~x1, ~x2) mero Sw Sw2 mero x1 Xw x2 Xw 1 X∈ 1 X∈ 2 (2) These types of node set pairs are defined in ex- As defined by the formula, this method firstly pecting that: makes a holistic meaning representation by sum- ming all embeddings of the nodes contained in sense, concept: captures semantic similar- each node set. We devised this method with the • ity/relatedness between the words; expectation that it could dictate semantic related- ness rather than semantic similarity (Budanitsky hyper: captures hypernymy relation between and Hirst, 2006). The pair of the summed embed- • the words; dings ( ~x1, ~x2) is also used as x1 Xw1 x2 Xw2 feature. ∈ ∈ P P coord: dictates if the words share a common c • simmed method: hypernym; c simmed(w1, w2) = median sim( ~x1, ~x2) x1 Xw ,x2 Xw attri 1, attri 2: dictates if w describes some ∈ 1 ∈ 2 • 1 (3) aspect of w (attri 1) or vice versa (attri 2); 2 The method is expected to express the similar- ity between mediated representations of each node mero: captures the meronymy relation be- • set. Instead of the arithmetic average, we employ tween the words. the median to select a representative node in each node set, allowing us to use the associated vector Note that the italicized words indicate lexical pair as feature. relationships often used in linguistic literature. 4 Experiments 3.3 Similarity calculation We evaluated the effectiveness of the proposed su- In a pair of node sets, each node set could have pervised approach by conducting a series of clas- a different number of elements, meaning that we sification experiments using the BLESS dataset cannot apply element-wise computation (e.g., co- (Baroni and Lenci, 2011). Among the possi- sine) for measuring the similarity between the ble learning algorithms, we adopted the Random node sets. We thus propose the following three ForestTMalgorithm as it maintains a balance be- similarity calculation methods and compare them tween performance and efficiency. The results in the experiments. are assessed by using standard measures such as In the following formulations: c indicates a cer- Precision (P ), Recall (R), and F 1. We em- 2 tain node set pair type defined in Table 1; (Xw1 , ployed the pre-trained Word2Vec embeddings . Xw ) is the node set pair for (w1, w2) specified 2 2300-dimensional vectors by CBoW, available by c; and sim( ~x1, ~x2) is the cosine similarity be- at https://code.google.com/archive/p/ tween ~x1 and ~x2. word2vec/

40 We trained sense/concept embeddings by apply- 4.2 Comparing methods 3 ing the AutoExtend system (Rothe and Schutze,¨ A supervised relation classification system 2015) while using the Word2Vec embeddings as referred to as WECE (Word Embeddings the input and consulting PWN 3.0 as the underly- Classification systEm) in (Necsules¸cu et al., ing lexical-semantic resource. 2015) was especially chosen for comparisons, 4.1 Dataset since this method combines and uses the word embeddings of a given word pair (w1, w2) as We utilized the BLESS dataset, which was de- feature. They compare two types of approaches veloped for the evaluation of distributional se- described as WECEoffset and WECEconcat: mantic models. It provides 14,400 tetrads of WECEoffset uses the offset of the word (w1, w2, lexical-semantic relation type, topical do- embeddings ( ~w2 ~w1) as the feature vector, main type): where the topical domain type des- − whereas WECEconcat uses the concatenation ignates a semantic class from the coarse semantic of the word embeddings. Moreover, they use classification system consisting of 17 English con- two types of word embeddings: a bag-of-words crete noun categories (e.g., tools, clothing, vehi- model (BoW ) (Mikolov et al., 2013a) and cles, and animals). The lexical-semantic relation a dependency-based skip-gram model (Dep) types defined in BLESS and their counts are de- (Levy and Goldberg, 2014). In summary, the scribed as follows: WECE system has the following variations: offset offset concat COORD (3565 word pairs): they are co- WECEBoW ,WECEDep ,WECEBoW and • concat hyponyms (e.g., alligator-lizard). WECEDep . As described in Section 3, we utilize 22 kinds HYPER (1337 word pairs): w is a hypernym • 2 of similarity and the underlying vector as fea- of w (e.g., alligator-animal). 1 tures. In order to make reasonable comparisons, MERO (2943 word pairs): w is a com- we compare two vector composition methods. In • 2 ponent/organ/member of w1 (e.g., alligator- addition to the already described array of similari- mouth). ties, the P roposalconcat method uses the concate- nated vector of the underlying vectors, whereas ATTRI (2731 word pairs): w2 is an adjective • the P roposalconcat method employs the differ- expressing an attribute of w1 (e.g., alligator- ence vector. As a result, the dimensionalities of aquatic). the resulting vectors employed in these methods EVENT (3824 word pairs): w is a verb re- are 13,222 (22 similarities + 22 600 dimensions • 2 × ferring to an action/activity/happening/event for concatenated vectors) and 6,622 (22 similari- ties + 22 300 dimensions for difference vectors), associated with w1 (e.g., alligator-swim). × respectively. Note here that these lexical-semantic relation types are not completely concord with the PWN Baseline: As detailed in section 3, our meth- relations described in section 3.1. ods utilize PWN neighboring concepts linked by particular lexical-semantic relationships, such as Data division: In order to compare the per- (hypernymy, attribute, and meronymy). formance for the present task we divided the We thus set the baseline as follows while respect- data in three ways: In-domain, Out-of-domain ing the direct relational links defined in PWN. (as employed in (Necsules¸cu et al., 2015)), and Given a word pair (w , w ), if any concept in Collapsed-domain. For the In-domain setting, the • 1 2 data in the same domain were used both for train- Sw1 and that in Sw2 are directly linked by a ing and testing. We thus conducted a five-fold certain relationship in PWN, let w1 and w2 be cross validation for each domain. For the Out-of- in the relation. domain setting, one domain is used for testing and Note that the baseline method cannot find any the remaining data is used for training. In addi- word pair that is annotated to have the EVENT tion, we prepared the Collapsed-domain setting, relation in the BLESS dataset, because there are where we conducted a 10-fold cross validation for no links in PWN that share the same or a simi- the entire dataset irrespective of the domain. lar definition. Likewise it is not capable for the 3The default hyperparameters were used. method to find any word pair with the ATTRI

41 In-domain Out-of-domain Collapsed-domain PRF 1 PRF 1 PRF 1 offset WECEBoW 0.900 0.909 0.904 0.680 0.669 0.675 - - - offset WECEDep 0.853 0.865 0.859 0.687 0.623 0.654 - - - P roposaloffset 0.913 0.907 0.906 0.766 0.762 0.753 0.867 0.867 0.865 concat WECEBoW 0.899 0.910 0.904 0.838 0.570 0.678 - - - concat WECEDep 0.859 0.870 0.865 0.782 0.638 0.703 - - - P roposalconcat 0.973 0.971 0.971 0.839 0.819 0.812 0.970 0.970 0.970 Table 2: Comparison of the overall classification results. relation in BLESS, because the definition of the Relationship P R F1 attribute relationship in PWN differs from COORD 0.761 0.559 0.645 the definition of the ATTRI relation in the BLESS by Baseline 0.550 0.108 0.180 dataset. Thus, we can only compare the results for HYPER 0.767 0.654 0.706 the relationships COORD, HYPER, and MERO by Baseline 0.746 0.199 0.314 with the common measures, P , R, and F 1. MERO 0.625 0.809 0.705 by Baseline 0.934 0.034 0.065 5 Results ATTRI 0.913 0.995 0.952 EVENT 0.974 0.983 0.979 5.1 Major results Table 2 compares our results with that of the Table 3: Breakdown of the results obtained by OoD WECE systems in the three data set divisions P roposalconcat. The Baseline results are shown (In/Out/Collapsed domains). The results show that in italics. P roposalconcat performed best in all measures of each division (shown in bold font). We observe two common trends across the approaches includ- by P roposalconcat in the Out-of-domain setting OoD ing WECE: (1) Every score in the Out-of-domain (P roposalconcat), and compares them with the setting was lower than that in the In-domain set- Baseline results, showing that P roposalconcat ting; and (2) The methods using vector concatena- clearly outperformed the Baseline in Recall and tion achieved higher scores than those using vector F1. This clearly confirms that the direct relational offsets. The former trend is reasonable, since in- links defined in PWN are insufficient for classify- formation that is also more relevant to the test data ing the BLESS relationships. With respect to the OoD is contained in the training data in the In-domain internal comparison of the P roposalconcat results, settings. The latter trend suggests that concate- a prominent fact is the high-performance classi- nated vectors may be more informative than off- fication of the ATTRI and EVENT relationships. set vectors, supporting the conclusion presented in By definition, these relationships link a noun to an (Necsules¸cu et al., 2015). adjective (ATTRI) or to a verb (EVENT), whereas Nevertheless, the results in the table clearly the COORD, HYPER, and MERO relationships show that P roposaloffset outperformed both connect a noun to another noun. This may sug- concat concat gest that the information carried by part-of-speech WECEBoW and WECEDep not only in Pre- cision but also in Recall in the Out-of-domain plays a role in this classification task. setting. This may confirm that sense represen- Table 4 further details the results obtained by OoD tations, acquired by exploiting a richer structure P roposalconcat by showing the confusion ma- encoded in PWN, are richer in semantic content trix, also endorsing that the fine-grained classifi- than word embeddings learned from textual cor- cation of inter-noun relationships (COORD, HY- pora, and hence, even the offset vectors are capa- PER, and MERO) is far more difficult than dis- ble of abstracting some characteristics of potential tinguishing cross-POS relationships (ATTRI and lexical-semantic relations between a word pair ef- EVENT). In particular, as suggested in (Shwartz fectively. et al., 2016), synonymy is difficult to distinguish Table 3 breaks down the results obtained from hypernymy even by humans.

42 HYPER COORD ATTRI MERO EVENT HYPER 875 157 10 282 13 COORD 189 1994 220 1118 44 ATTRI 1 13 2716 1 0 MERO 74 423 24 2380 42 EVENT 2 32 5 27 3758

OoD Table 4: Confusion matrix for the results by P roposalconcat in the Out-of-domain setting.

5.2 Ablation tests Word2Vec training corpus. It is a potential future work to address this issue. This section more closely considers the re- sults in the Out-of-domain setting. Table 5 Table 6 compares the effectiveness of the types shows the results of the ablation tests for the of features. The only similarities row shows the re- OoD P roposalconcat setting, comparing the effective- sults when ablating the vectoral features and only ness of the source of the similarities. Each using the 22-dimensional similarity features (21 OoD row other than P roposalconcat displays the re- semantic/conceptual-level similarities along with sults when the designated feature is ablated. The a word-level similarity). On the other hand, the word row shows the result when ablating the only vector pairs row shows the results from the − 601-dimensional features created from the pair adverse setting, using the 22 vector pairs (using of word embeddings, and the other rows show 13,200-dimensional features). the results when ablating the corresponding 1803- dimensional features (three similarities and the P R F1 F1 diff OoD three vector pairs that yielded the similarities) gen- P roposalconcat 0.839 0.819 0.812 - erated from each node set pair. only similarities 0.704 0.687 0.683 -0.129 only vector pairs 0.834 0.812 0.804 -0.007 PRF 1 F 1 diff OoD Table 6: Effectiveness comparison of the types of P roposalconcat 0.839 0.819 0.812 - word 0.845 0.827 0.819 0.008 features. − sense 0.833 0.815 0.806 -0.006 − concept 0.826 0.809 0.802 -0.010 − coord 0.834 0.811 0.803 -0.009 It is shown that using vectorial features would − hyper 0.826 0.803 0.800 -0.012 produce more accurate results than simply using − attri1 0.826 0.806 0.798 -0.014 the similarity features, confirming the general as- − attri 0.842 0.820 0.814 0.002 − 2 sumption: more features yield more accurate re- mero 0.835 0.813 0.806 -0.006 − sults. However, we would have to emphasize that, Table 5: Ablation tests comparing the effective- even only with the similarity features, our ap- ness of each node set. proach outperformed the comparable method in Recall (shown in the Out-of domain columns of Table 2 ). This table suggests that the concept, hyper, and Table 7 shows the results of the other abla- attri1 node set pairs are effective, as indicated by the relatively large decreases in F 1. Surpris- tion tests, comparing the effectiveness of the sim- ingly, however, the features generated from the ilarity calculation methods. Each row in the ta- word embeddings affected the performance. This ble displays the result when ablating the 4207- implies that abstract-level semantics encoded in dimensional features (seven similarities plus seven sense/concept embeddings are more robust in the vector pairs that yielded these similarities). classification of the target lexical-semantic rela- As the results in the table show, the F 1 scores tionships. However, the utility of sense embed- did not change significantly in each ablated con- dings was modest.This may result from the learn- dition, showing that the effect provided by the ab- ing method in AutoExtend: it tries to split a word lated method is completed by the remaining meth- embedding into the senses’ embeddings without ods. There exists some redundancy in preparing considering the virtual distribution of senses in the these three calculation methods.

43 PRF 1 F 1 diff c c over, the simmax or sim method selects a pair OoD med P roposalconcat 0.839 0.819 0.812 - of sense/concept embeddings, where the combi- sim 0.835 0.812 0.805 -0.007 − max nation usually differs depending on the combi- sim 0.843 0.822 0.816 0.004 − sum nation of node sets. On one hand, the simc sim 0.838 0.811 0.805 -0.007 sum − med method could be affected by the memorization ef- Table 7: Additional ablation tests comparing the fect, since the vectorial feature for the prototypical similarity calculation methods. hypernym is invariable.

7 Conclusion 6 Discussion This paper proposed a method for classifying the This section discusses two issues: the first is asso- type of lexical-semantic relation that could hold ciated with the usage of PWN in the experiments between a given pair of words. The empirical using BLESS, and the other is concerned with the results clearly outperformed those previously ob- “lexical memorization” problem. tained with the state-of-the-art results, demonstrat- ing that our rationales behind the proposal are Usage of PWN: As detailed in Section 3, our valid. In particular, it would be reasonable to methods utilize neighboring concepts linked assume that the plausibility of a specific type of by the particular lexical-semantic relations lexical-semantic relation between the words could hypernymy, attribute, and meronymy be chiefly recovered by a carefully designed set defined in PWN. Some may consider that the of relation-specific similarities. These results also HYPER, ATTRI, and MERO relationships can highlight the utility of “the sense representations,” be estimated simply by consulting the above- since our similarity calculation methods rely on mentioned PWN relationships. However, this is the sense/concept embeddings created by the Au- definitely NOT the case, since almost all of the toExtend system. semantic relation instances in the BLESS dataset Future work could follow two directions. First, are not immediately defined in PWN: Among the we need to improve the classification of inter-noun 14,400 BLESS instances, only 951 are defined semantic relations. We may particularly need to in PWN. For an obvious example, there are no distinguish the hypernym relationship, which is links in PWN that are labeled event, which is a asymmetric, from the symmetric coordinate rela- type of semantic relation defined in BLESS. The tionship. In this regard, we would need to improve low Recall results presented in Table 3 endorsed the creation of node sets and the combinations to this fact, and clearly show the sparsity of useful capture the innate difference of the relationships. semantic links in PWN. However, for some of Second, we need to address the potential draw- the lexical-semantic relation types that exhibit back of our proposal, which comes from our oper- transitivity, such as hypernymy, consulting the ational requirement: a lack of sense/concept em- PWN indirect links could be effectively utilized beddings is crucial, as we cannot collect relevant to improve the results. node sets in this case. Therefore, we need to de- velop a method to assign some of the existing con- Lexical memorization: Levy et al. (2015) re- cepts to an unknown word, which is not contained cently argued that the results achieved by many su- in PWN, by seeking the nearest concept in the re- pervised methods are inflated because of the “lex- source. A possible method would first seek the ical memorization” effect, which only learns “pro- nearest concept in the underlying lexical-semantic totypical hypernymy” relations. For example, if a resource for an unknown word, and then induce a classifier encounters many positive examples such revised set of sense/concept embeddings by itera- as (X, animals), where X is a hyponym of animal tively applying the AutoExtend system. (e.g., dog, cat, ...), it only learns to classify any word pair as a hypernym as far as the second word Acknowledgments is “animal.” We argue that our method can be ex- pected to be relatively free from this issue. The The present research was supported by JSPS similarity features are not affected by this effect, KAKENHI Grant Number JP26540144 and since any similarity calculation is a symmetric op- JP25280117. eration, and independent of word order. More-

44 References pages 75–79, Montreal,´ Canada, 7-8 June. Associa- tion for Computational Linguistics. Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Pro- Omer Levy and Yoav Goldberg. 2014. Dependency- ceedings of the GEMS 2011 Workshop on GEomet- based word embeddings. In Proceedings of the 52nd rical Models of Natural Language Semantics, pages Annual Meeting of the Association for Computa- 1–10. Association for Computational Linguistics. tional Linguistics (Volume 2: Short Papers), pages 302–308, Baltimore, Maryland, June. Association Jordan Boyd-Graber, Christiane Fellbaum, Daniel Os- for Computational Linguistics. herson, and Robert Schapire. 2006. Adding dense, weighted connections to wordnet. In Proceedings Omer Levy, Steffen Remus, Chris Biemann, and Ido of the third international WordNet conference, pages Dagan. 2015. Do supervised distributional meth- 29–36. ods really learn lexical inference relations? In Pro- ceedings of the 2015 Conference of the North Amer- Alexander Budanitsky and Graeme Hirst. 2006. Eval- ican Chapter of the Association for Computational uating wordnet-based measures of lexical semantic Linguistics: Human Language Technologies, pages relatedness. Computational Linguistics, 32(1):13– 970–976, Denver, Colorado, May–June. Association 47. for Computational Linguistics.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Wang, and Ting Liu. 2014. Learning semantic hier- frey Dean. 2013a. Efficient estimation of word archies via word embeddings. In Proceedings of the representations in vector space. arXiv preprint 52nd Annual Meeting of the Association for Compu- arXiv:1301.3781. tational Linguistics (Volume 1: Long Papers), pages 1199–1209, Baltimore, Maryland, June. Association Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- for Computational Linguistics. rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- Eric Huang, Richard Socher, Christopher Manning, ality. In Advances in neural information processing and Andrew Ng. 2012. Improving word represen- systems, pages 3111–3119. tations via global context and multiple word proto- types. In Proceedings of the 50th Annual Meeting of George A. Miller. 1995. Wordnet: a lexical the Association for Computational Linguistics (Vol- database for english. Communications of the ACM, ume 1: Long Papers), pages 873–882, Jeju Island, 38(11):39–41. Korea, July. Association for Computational Linguis- tics. Silvia Necsules¸cu, Sara Mendes, David Jurgens, Nuria´ Bel, and Roberto Navigli. 2015. Reading between Ignacio Iacobacci, Mohammad Taher Pilehvar, and the lines: Overcoming data sparsity for accurate Roberto Navigli. 2015. Sensembed: Learning classification of lexical relationships. In Proceed- sense embeddings for word and relational similarity. ings of the Fourth Joint Conference on Lexical and In Proceedings of the 53rd Annual Meeting of the Computational Semantics, pages 182–192, Denver, Association for Computational Linguistics and the Colorado, June. Association for Computational Lin- 7th International Joint Conference on Natural Lan- guistics. guage Processing (Volume 1: Long Papers), pages 95–105, Beijing, China, July. Association for Com- Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- putational Linguistics. sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings Douwe Kiela, Felix Hill, and Stephen Clark. 2015. per word in vector space. In Proceedings of the Specializing word embeddings for similarity or re- 2014 Conference on Empirical Methods in Natu- latedness. In Proceedings of the 2015 Conference on ral Language Processing (EMNLP), pages 1059– Empirical Methods in Natural Language Process- 1069, Doha, Qatar, October. Association for Com- ing, pages 2044–2048, Lisbon, Portugal, September. putational Linguistics. Association for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Manning. 2014. Glove: Global vectors for word Zhitomirsky-Geffet. 2010. Directional distribu- representation. In Proceedings of the 2014 Con- tional similarity for lexical inference. Natural Lan- ference on Empirical Methods in Natural Language guage Engineering, 16(04):359–389. Processing (EMNLP), pages 1532–1543, Doha, Qatar, October. Association for Computational Lin- Alessandro Lenci and Giulia Benotto. 2012. Identify- guistics. ing hypernyms in distributional semantic spaces. In *SEM 2012: The First Joint Conference on Lexical Sascha Rothe and Hinrich Schutze.¨ 2015. Autoex- and Computational Semantics – Volume 1: Proceed- tend: Extending word embeddings to embeddings ings of the main conference and the shared task, and for synsets and lexemes. In Proceedings of the Volume 2: Proceedings of the Sixth International 53rd Annual Meeting of the Association for Compu- Workshop on Semantic Evaluation (SemEval 2012), tational Linguistics and the 7th International Joint

45 Conference on Natural Language Processing (Vol- ume 1: Long Papers), pages 1793–1803, Beijing, China, July. Association for Computational Linguis- tics. Pavel Rychly´ and Adam Kilgarriff. 2007. An ef- ficient algorithm for building a distributional the- saurus (and other developments). In Proceedings of the 45th Annual Meeting of the As- sociation for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Ses- sions, pages 41–44, Prague, Czech Republic, June. Association for Computational Linguistics. Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. Evalution 1.0: an evolving semantic dataset for training and evaluation of dis- tributional semantic models. In Proceedings of the 4th Workshop on Linked Data in Linguistics (LDL- 2015), pages 64–69.

Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path-based and distributional method. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 2389–2398, Berlin, Germany, August. Association for Computational Linguistics. Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2016. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 1671–1682, Berlin, Germany, August. Association for Computational Linguistics. Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hy- pernyms and co-hyponyms. In Proceedings of COL- ING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2249–2259, Dublin, Ireland, August. Dublin City University and Association for Computational Lin- guistics.

46 Supervised and unsupervised approaches to measuring usage similarity

Milton King and Paul Cook Faculty of Computer Science, University of New Brunswick Fredericton, NB E3B 5A3, Canada [email protected], [email protected]

Abstract to doubt the existence of word senses (Kilgarriff, 1997). Usage similarity (USim) is an approach to Sense inventories also suffer from a lack of determining word meaning in context that coverage. New words regularly come into us- does not rely on a sense inventory. Instead, age, as do new senses for established words. Fur- pairs of usages of a target lemma are rated thermore, domain-specific senses are often not in- on a scale. In this paper we propose un- cluded in general-purpose sense inventories. This supervised approaches to USim based on issue of coverage is particularly relevant for social embeddings for words, contexts, and sen- media text, which contains a higher rate of out- tences, and achieve state-of-the-art results of-vocabulary words than more-conventional text over two USim datasets. We further con- types (Baldwin et al., 2013). sider supervised approaches to USim, and These issues pose problems for natural lan- find that although they outperform unsu- guage processing tasks such as word sense disam- pervised approaches, they are unable to biguation and induction, which rely on, and seek generalize to lemmas that are unseen in the to induce, respectively, sense inventories, and have training data. traditionally assumed that each instance of a word can be assigned one sense.1 In response to this, 1 Usage similarity alternative approaches to word meaning have been Word senses are not discrete. In many cases, for proposed that do not rely on sense inventories. Erk a given instance of a word, multiple senses from et al. (2009) carried out an annotation task on “us- a sense inventory are applicable, and to varying age similarity” (USim), in which the similarity of degrees (Erk et al., 2009). For example, consider the meanings of two usages of a given word are the usage of wait in the following sentence taken rated on a five-point scale. from Jurgens and Klapaftis (2013): Lui et al. (2012) proposed the first computa- tional approach to USim. They considered ap- 1. And is now the time to say I can hardly proaches based on topic modelling (Blei et al., wait for your impending new novel about the 2003), under a wide range of parameter settings, Alamo? and found that a single topic model for all tar- get lemmas (as opposed to one topic model per Annotators judged the WordNet (Fellbaum, 1998) target lemma) performed best on the dataset of senses glossed as ‘stay in one place and antici- Erk et al. (2009). Gella et al. (2013) considered pate or expect something’ and ‘look forward to the USim on Twitter text, noting that this model of probable occurrence of’, to have applicability rat- word meaning seems particularly well-suited to ings of 4 out of 5, and 2 out of 5, respectively, for this text type because of the prevalence of out-of- this usage of wait. Moreover, Erk et al. (2009) vocabulary words. Gella et al. (2013) also con- also showed that this issue cannot be addressed sidered topic modelling-based approaches, achiev- simply by choosing a coarser-grained sense inven- ing their best results using one topic model per tory. That a clear line cannot be drawn between the 1Recent word sense induction systems and evaluations various senses of a word has been observed as far have, however, considered graded senses and multi-sense ap- back as Johnson (1755). Some have gone so far as plicability (Jurgens and Klapaftis, 2013).

47 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 47–52, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics target word, and a document expansion strategy during training, the encoding of a sentence is based on medium frequency hashtags to combat used to predict surrounding sentences. Kiros et the data sparsity of tweets due to their relatively al. (2015) showed that skip-thoughts out-performs short length. The methods of Lui et al. (2012) and previous approaches to measuring sentence-level Gella et al. (2013) are unsupervised; they do not relatedness. Although our goal is to determine the rely on any gold standard USim annotations. meaning of a word in context, the meaning of a In this paper we propose unsupervised ap- sentence could be a useful proxy for this.2 proaches to USim based on embeddings for words (Mikolov et al., 2013; Pennington et al., 2014), 2.2 Unsupervised approach contexts (Melamud et al., 2016), and sentences In the unsupervised setup, we measure the simi- (Kiros et al., 2015), and achieve state-of-the-art larity between two usages of a target word as the results over the USim datasets of both Erk et al. cosine similarity between their vector representa- (2009) and Gella et al. (2013). We then consider tions, obtained by one of the methods described supervised approaches to USim based on these in Section 2.1. This method does not require gold same methods for forming embeddings, which standard training data. outperform the unsupervised approaches, but per- form poorly on lemmas that are unseen in the 2.3 Supervised approach training data. We also consider a supervised approach. For a given pair of token instances of a target word, t1 2 USim models and t2, we first form vectors v1 and v2 represent- ing each of the two instances of the target, using In this section we describe how we represent a tar- one of the approaches in Section 2.1. To represent get word usage in context, and then how we use each pair of instances, we follow the approach of these representations in unsupervised and super- Kiros et al. (2015). We compute the componen- vised approaches to USim. twise product, and absolute difference, of v1 and 2.1 Usage representation v2, and concatenate them. This gives a vector of length 2d — where d is the dimensionality of the We consider four ways of representing an instance embeddings used — representing each pair of in- of a target word based on embeddings for words, stances. We then train ridge regression to learn contexts, and sentences. For word embeddings, a model to predict the similarity of unseen usage we consider word2vec (Mikolov et al., 2013) and pairs. GloVe (Pennington et al., 2014). In each case we represent a token instance of the target word in a 3 Materials and methods sentence as the average of the word embeddings for the other words occurring in the sentence, ex- 3.1 USim Datasets cluding stopwords. We evaluate our methods on two USim datasets Context2vec (Melamud et al., 2016) can be representing two different text types: ORIGINAL, viewed as an extension of word2vec’s continuous the USim dataset of Erk et al. (2009), and TWIT- bag-of-words (CBOW) model. In CBOW, the con- TER from Gella et al. (2013). Both USim datasets text of a target word token is represented as the av- contain pairs of sentences; each sentence in each erage of the embeddings for words within a fixed pair includes a usage of a particular target lemma. window. In contrast, context2vec uses a richer rep- Each sentence pair is rated on a scale of 1–5 for resentation based on a bidirectional LSTM cap- how similar in meaning the usages of the target turing the full sentential context of a target word words are in the two sentences. token. During training, context2vec embeds the ORIGINAL consists of sentences from Mc- context of word token instances in the same vec- Carthy and Navigli (2007), which were drawn tor space as word types. As this model explicitly from a web corpus (Sharoff, 2006). This dataset embeds word contexts it seems particularly well- contains 34 lemmas, including nouns, verbs, ad- suited to USim. jectives, and adverbs. Each lemma is the target Kiros et al. (2015) proposed skip-thoughts, a 2Inference requires only a single sentence, so the model sentence encoder that can be viewed as a sentence- can infer skip-thought vectors for sentences taken out-of- level version of word2vec’s skipgram model, i.e., context, as in the USim datasets.

48 word in 10 sentences. For each lemma, sentence DW ORIGINAL TWITTER pairs (SPairs) are formed based on all pairwise 50 2 0.251 0.246 comparisons, giving 45 SPairs per lemma. An- 50 5 0.262 0.272 notations were provided by three native English 50 8 0.286 0.282 speakers, with the average taken as the final gold 100 2 0.267 0.248 standard similarity. In a small number of cases the 100 5 0.273 0.253 annotators were unable to judge similarity. Erk et 100 8 0.273 0.298 al. (2009) removed these SPairs from the dataset, 300 2 0.275 0.266 resulting in a total of 1512 SPairs. 300 5 0.279 0.295 TWITTER contains SPairs for ten nouns from 300 8 0.281 0.300 ORIGINAL. In this case the “sentences” are in fact tweets. 55 SPairs are provided for each noun. Table 1: Spearman’s ρ on each dataset using the unsupervised approach with word2vec embed- Unlike ORIGINAL, the SPairs are not formed on the basis of all pairwise comparisons amongst a dings trained using several settings for the number smaller set of sentences. This dataset was anno- of dimensions (D) and window size (W ). The best tated via crowd sourcing and carefully cleaned to ρ for each dataset is shown in boldface. remove outlier annotations.

3.2 Evaluation window size (W =2,5,8) and number of dimen- sions (D=50,100,300). Embeddings trained on Following Lui et al. (2012) and Gella et al. (2013) Wikipedia and Twitter are used for experiments on we evaluate our systems by calculating Spear- ORIGINAL and TWITTER, respectively. man’s rank correlation coefficient between the gold standard similarities and the predicted simi- For the other embeddings we use pre-trained larities. This enables direct comparison of our re- models. We use GloVe vectors from Wikipedia sults with those reported in these previous studies. and Twitter, with 300 and 200 dimensions, for ex- periments on ORIGINAL and TWITTER, respec- We evaluate our supervised approaches using 5 two cross-validation methodologies. In the first tively. For context2vec we use a 600 dimen- case we apply 10-fold cross-validation, randomly sional model trained on the ukWaC (Ferraresi et partitioning all SPairs for all lemmas in a given al., 2008), a web corpus of approximately 2 bil- lion tokens.6 We use a skip-thoughts model with dataset. Using this approach, the test data for a 7 given fold consists of SPairs for target lemmas 4800 dimensions, trained on a corpus of books. that were seen in the training data. To determine We use these context2vec and skip-thoughts mod- how well our methods generalize to unseen lem- els for experiments on both ORIGINAL and TWIT- mas, we consider a second cross-validation setup TER. in which we partition the SPairs in a given dataset by lemma. Here the test data for a given fold con- 4 Experimental results sists of SPairs for one lemma, and the training data We first consider the unsupervised approach using consists of SPairs for all other lemmas. word2vec for a variety of window sizes and num- ber of dimensions. Results are shown in Table 1. 3.3 Embeddings All correlations are significant (p < 0.05). On We train word2vec’s skipgram model on two both ORIGINAL and TWITTER, for a given num- 3 corpora: (1) a corpus of English tweets col- ber of dimensions, as the window size is increased, 4 lected from the Twitter Streaming APIs from ρ increases. Embeddings for larger window sizes November 2014 to March 2015 containing 1.3 bil- tend to better capture semantics, whereas em- lion tokens; and (2) an English Wikipedia dump beddings for smaller window sizes tend to bet- from 1 September 2015 containing 2.6 billion to- ter reflect syntax (Levy and Goldberg, 2014); the kens. Because of the relatively-low cost of train- ing word2vec, we consider several settings of 5http://nlp.stanford.edu/projects/ glove/ 3In preliminary experiments the alternative word2vec 6https://github.com/orenmel/ CBOW model achieved substantially lower correlations than context2vec skipgram, and so CBOW was not considered further. 7https://github.com/ryankiros/ 4https://dev.twitter.com/ skip-thoughts

49 Supervised Dataset Embeddings Unsupervised outperform approaches based on averaging word All Lemma Word2vec 0.281* 0.435* 0.220* embeddings (i.e., word2vec and GloVe) or em- GloVe 0.218* 0.410* 0.230* ORIGINAL bedding sentences (skip-thoughts). This result is Skip-thoughts 0.177* 0.436* 0.099* particularly strong because we consider a range Context2vec 0.302* 0.417* 0.172* Word2vec 0.300* 0.384* 0.196* of parameter settings for word2vec, but only used GloVe 0.122* 0.314* 0.134* 8 TWITTER the default settings for context2vec. Word2vec Skip-thoughts 0.095* 0.360* 0.058 does however perform best on TWITTER. The Context2vec 0.122* 0.193* 0.067 relatively poor performance of context2vec and Table 2: Spearman’s ρ on each dataset using skip-thoughts here could be due to differences be- the unsupervised method, and supervised meth- tween the text types these embedding models were ods with cross-validation folds based on random trained on and the evaluation data. GloVe per- sampling across all lemmas (All) and holding out forms poorly, even though it was trained on tweets individual lemmas (Lemma), for each embedding for these experiments, but that it performs less approach. The best ρ for each experimental setup, well than word2vec is consistent with the findings on each dataset, is shown in boldface. Significant for ORIGINAL. correlations (p < 0.05) are indicated with *. Turning to the supervised approach, we first consider results for cross-validation based on ran- domly partitioning all SPairs in a dataset (column more-semantic embeddings given by larger win- “All” in Table 2). The best correlation on TWIT- dow sizes appear to be better-suited to the task TER (0.384) is again achieved using word2vec, of predicting USim. For a given window size, a while the best correlation on ORIGINAL (0.434) higher number of dimensions also tends to achieve is obtained with skip-thoughts. The difference in higher ρ. For example, for a given window size, performance amongst the various embedding ap- D = 300 gives a higher ρ than D = 50 in each proaches is, however, somewhat less here than in case, except for W = 8 on ORIGINAL. the unsupervised setting. For each embedding ap- The best correlations reported by Lui et al. proach, and each dataset, the correlation in the (2012) on ORIGINAL, and Gella et al. (2013) supervised setting is better than that in the unsu- on TWITTER, were 0.202 and 0.29, respectively. pervised setting, suggesting that if labeled train- The best parameter settings for our unsupervised ing data is available, supervised approaches can approach using word2vec embeddings achieve give substantial improvements over unsupervised higher correlations, 0.286 and 0.300, on ORIGI- approaches to predicting USim.9 However, this NAL and TWITTER, respectively. Lui et al. (2012) experimental setup does not show the extent to and Gella et al. (2013) both report drastic vari- which the supervised approach is able to gener- ation in performance for different settings of the alize to previously-unseen lemmas. number of topics in their models. We also observe The column labeled “Lemma” in Table 2 shows some variation with respect to parameter settings; results for the supervised approach for cross- however, any of the parameter settings considered validation using lemma-based partitioning. In achieves a higher correlation than Lui et al. (2012) these experiments, the test data consists of usages on ORIGINAL. For TWITTER, parameter settings of a target lemma that was not seen as a target with W 5 and D 100 achieve a correlation ≥ ≥ lemma during training. For each dataset, the corre- comparable to, or greater than, the best reported lations achieved here for each type of embedding by Gella et al. (2013) are lower than those of the corresponding unsu- We now consider the unsupervised approach, pervised method, with the exception of GloVe. In using the other embeddings. Based on the previ- ous findings for word2vec, we only consider this 8The context2vec model has 600 dimensions, and was model with W = 8 and D = 300 here. Results trained on the ukWac, whereas our word2vec model for are shown in Table 2 in the column labeled “Unsu- ORIGINAL is trained on Wikipedia. To further compare these approaches we also trained word2vec on the ukWaC with 600 pervised”. For ORIGINAL, context2vec performs dimensions and a window size of 8. These word2vec settings best (and indeed outperforms word2vec for all pa- also did not outperform context2vec. 9 rameter settings considered). This result demon- These results on ORIGINAL must be interpreted cau- tiously, however. The same sentences, albeit in different strates that approaches to predicting USim that ex- SPairs, occur in both the training and testing data for a given plicitly embed the context of a target word can fold. This issue does not affect TWITTER.

50 the case of ORIGINAL, the higher correlation for Brunswick. GloVe relative to the unsupervised setup appears to be largely due to improved performance on ad- verbs. Nevertheless, for each dataset, the correla- References tions achieved by GloVe are still lower than those Timothy Baldwin, Paul Cook, Marco Lui, Andrew of the best unsupervised method on that dataset. MacKinlay, and Li Wang. 2013. How noisy so- cial media text, how diffrnt social media sources? These results demonstrate that the supervised ap- In Proceedings of the 6th International Joint Con- proach generalizes poorly to new lemmas. This ference on Natural Language Processing (IJCNLP negative result indicates an important direction 2013), pages 356–364, Nagoya, Japan. for future work — identifying strategies to train- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. ing supervised approaches to predicting USim that 2003. Latent dirichlet allocation. Journal of Ma- generalize to unseen lemmas. chine Learning Research, 3:993–1022. Katrin Erk, Diana McCarthy, and Nicholas Gaylord. 5 Conclusions 2009. Investigations on word senses and word us- ages. In Proceedings of the Joint Conference of the Word senses are not discrete, and multiple senses 47th Annual Meeting of the ACL and the 4th Interna- are often applicable for a given usage of a word. tional Joint Conference on Natural Language Pro- Moreover, for text types that have a relatively- cessing of the AFNLP, pages 10–18, Singapore. high rate of out-of-vocabulary words, such as so- Christiane Fellbaum, editor. 1998. WordNet: An Elec- cial media text, many words will be missing from tronic Lexical Database. MIT Press, Cambridge, sense inventories. USim is an approach to deter- MA. mining word meaning in context that does not rely Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and on a sense inventory, addressing these concerns. Silvia Bernardini. 2008. Introducing and evaluating We proposed unsupervised approaches to USim ukWaC, a very large web-derived corpus of English. based on embeddings for words, contexts, and sen- In Proceedings of the 4th Web as Corpus Workshop: Can we beat Google, pages 47–54, Marrakech, Mo- tences. We achieved state-of-the-art results over rocco. USim datasets based on Twitter text and more- conventional texts. We further considered super- Spandana Gella, Paul Cook, and Bo Han. 2013. Unsu- pervised word usage similarity in social media texts. vised approaches to USim based on these same In Second Joint Conference on Lexical and Compu- methods for forming embeddings, and found that tational Semantics (*SEM), Volume 1: Proceedings although these methods outperformed the unsu- of the Main Conference and the Shared Task: Se- pervised approaches, they performed poorly on mantic Textual Similarity, pages 248–253, Atlanta, lemmas that were unseen in the training data. USA. The approaches to learning word embeddings Eric Huang, Richard Socher, Christopher Manning, that we considered (word2vec and GloVe) both and Andrew Ng. 2012. Improving word represen- tations via global context and multiple word proto- learn a single vector representing each word type. types. In Proceedings of the 50th Annual Meeting of There are, however, approaches that learn multiple the Association for Computational Linguistics (Vol- embeddings for each type that have been applied ume 1: Long Papers), pages 873–882, Jeju Island, to predict word similarity in context (Huang et al., Korea. 2012; Neelakantan et al., 2014, for example). In Samuel Johnson. 1755. A Dictionary of the English future work, we intend to also evaluate such ap- Language. London. proaches for the task of predicting usage similar- David Jurgens and Ioannis Klapaftis. 2013. Semeval- ity. We also intend to consider alternative strate- 2013 task 13: Word sense induction for graded and gies to training supervised approaches to USim in non-graded senses. In Proceedings of the 7th In- an effort to achieve better performance on unseen ternational Workshop on Semantic Evaluation (Se- lemmas. mEval 2013), pages 290–299, Atlanta, USA. Adam Kilgarriff. 1997. “I Don’t Believe in Word Acknowledgments Senses”. Computers and the Humanities, 31(2):91– 113. This research was financially supported by the Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Natural Sciences and Engineering Research Coun- Richard S. Zemel, Antonio Torralba, Raquel Urta- cil of Canada, the New Brunswick Innovation sun, and Sanja Fidler. 2015. Skip-thought vec- Foundation, ACENET, and the University of New tors. In C. Cortes, N. D. Lawrence, D. D. Lee,

51 M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3276–3284. Curran Associates, Inc. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Co- mouational Linguistics, pages 302–308, Maryland, USA. Marco Lui, Timothy Baldwin, and Diana McCarthy. 2012. Unsupervised estimation of word usage simi- larity. In Proceedings of the Australasian Language Technology Association Workshop 2012, pages 33– 41, Dunedin, New Zealand. Diana McCarthy and Roberto Navigli. 2007. Semeval- 2007 task 10: English lexical substitution task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 48–53, Prague, Czech Republic. Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context em- bedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 51–61. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Proceedings of Workshop at the International Conference on Learning Repre- sentations, 2013, Scottsdale, USA. Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1059–1069, Doha, Qatar.

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– 1543. Serge Sharoff. 2006. Open-source corpora: Using the net to fish for linguistic data. Corpus Linguistics, 11(4):435–462.

52 Lexical Disambiguation of Igbo through Diacritic Restoration

Ignatius Ezeani Mark Hepple Ikechukwu Onyenwe NLP Research Group, Department of Computer Science, University of Sheffield, UK. ignatius.ezeani,m.r.hepple,i.onyenwe @sheffield.ac.uk { }

Abstract ria, is spoken by over 30 million people world- Properly written texts in Igbo, a low re- wide. It uses the Latin scripts and has many di- source African language, are rich in both alects. Most written works, however, use the of- orthographic and tonal diacritics. Dia- ficial orthography produced by the O. nwu. Com- 1 critics are essential in capturing the dis- mittee . tinctions in pronunciation and meaning of The orthography has 8 vowels (a, e, i, o, u,. i, o. , words, as well as in lexical disambigua- u. ) and 28 consonants (b, gb, ch, d, f, g, gw, gh, h, tion. Unfortunately, most electronic texts j, k, kw, kp, l, m, n, nw, ny, n,˙ p, r, s, sh, t, v, w, y, in diacritic languages are written without z). diacritics. This makes diacritic restoration Table 1, shows Igbo characters with their ortho- a necessary step in corpus building and graphic or tonal (or both) diacritics and possible 2 language processing tasks for languages changes in meanings of the words they appear in . with diacritics. In our previous work, we Char Ortho Tonal built some n gram models with simple − a – a,` a,´ a¯ smoothing techniques based on a closed- e – e,` e,´ e¯ world assumption. However, as a classi- i i `ı, ´ı, ¯i, `ı, ´ı, ¯i fication task, diacritic restoration is well . . . . o o o,` o,´ o,¯ o`, o´, o¯ suited for and will be more generalisable . . . . u u u,` u,´ u,¯ u`, u´, u¯ with machine learning. This paper, there- . . . . m – m,` m,´ m¯ fore, presents a more standard approach n n˙ n,` n,´ n¯ to dealing with the task which involves the application of machine learning algo- Table 1: Igbo diacritic complexity rithms. Most Igbo electronic texts collected from so- 1 Introduction cial media platforms are riddled with flaws rang- Diacritics are marks placed over, under, or through ing from dialectal variations and spelling errors to a letter in some languages to indicate a different lack of diacritics. For instance, consider this raw sound value from the same letter. English does not excerpt from a chat on a popular Nigerian online have diacritics (apart from a few borrowed words) chat forum www.nairaland.com3: but many of the world’s languages use a wide range of diacritized letters in their orthography. otu ubochi ka’m no na amaghi ihe mu Automatic Diacritic Restoration Systems (ADRS) na uwa ga-eje. kam noo n’eche ihe enable the restoration of missing diacritics in texts. a,otu mmadu wee kpoturum,m lee anya Many forms of such tools have been proposed, de- o buru nwoke mara mma puru iche,mma signed and developed but work on Igbo is still in ya turu m n’obi.o gwam si nne kedu its early stages. 1http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/ txt onwu 1961.pdf 1.1 Diacritics and Igbo language 2m and n, nasal consonants, are sometimes treated as tone marked vowels. Igbo, a major Nigerian language and the native 3Source: http://www.nairaland.com/189374/igbo-love- language of the people of the south-eastern Nige- messages

53 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 53–60, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics

k’idi.onu ya dika onu ndi m’ozi, ihu ya such tasks as word sense disambiguation with re- dika anyanwu ututu,ahu ya n’achakwa gards to resolving both syntactic and semantic am- bara bara ka mmiri si n’okwute. ka ihe biguities. Indeed it was referred to as an instance niile no n’agbam n’obi,o sim na ohuru of a closely related class of problems which in- m n’anya.na ochoro k’anyi buru enyi,a cludes word choice selection in machine transla- hukwuru m ya n’anya.anyi wee kweko- tion, homograph and homophone disambiguation rita wee buru enyi onye’a m n’ekwu and capitalisation restoration (Yarowsky, 1994b). maka ya bu odinobi m,onye ihe ya Diacritic restoration, like sense disambiguation, n’amasi m is not an end in itself but an “intermediate task” (Wilks and Stevenson, 1996) which supports bet- In the above example, you can observe that ter understanding and representation of meanings there is zero presence of diacritics - tonal or or- in human-machine interactions. In most non- thographic - in the entire text. As pointed out diacritic languages, sense disambiguation systems above, although there are other issues with regards can directly support such tasks as machine transla- to standard in the text, lack of diacritics seems to tion, information retrieval, text processing, speech be harder to control or avoid than the others. This processing etc. (Ide and Veronis,´ 1998). But it is partly because diacritics or lack of it does af- takes more for diacritic languages, where possible, fect human understanding a great deal; and also to produce standard texts. So for those languages, the rigours a writer will go through to insert them to achieve good results with such systems as listed may not worth the effort. The challenge, however, above, diacritic restoration is required as a boost is that NLP systems built and trained with such for the sense disambiguation task. poor quality non standard data will most likely be We note however, that although diacritic unreliable. restoration is related to word sense disambigua- 1.2 Diacritic restoration and other NLP tion (WSD), it does not eliminate the need for systems sense disambiguation. For example, if the word- key akwa is successfully restored to akw` a`, it could Diacritic restoration is important for other NLP still be referring to either bed or bridge. Another systems such as speech recognition, text gener- good example is the behaviour of Google Trans- ation and machine translations systems. For ex- late as the context around the word akw` a` changes. ample, although most translation systems are now very impressive, not a lot of them support Igbo Statement Google Translate Comment Akwa ya di n’elu akwa It was on the high confused language. However, for the few that do (e.g. Akwa ya di n’elu akwa ya It was on the bed in his room fair Google Translate), diacritic restoration still plays Akw´ a` ya di n’elu akw` a` His clothing was on the bridge okay Akw´ a` ya di n’elu akw` a` ya His clothing on his bed good a huge role in how well they perform. The exam- ple below shows the effect of diacritic marks on Table 3: Disambiguation challenge for Google the output of Google Translate’s Igbo-to-English Translate translation. The last two statements, with proper diacrit- Statement Google Translate Comment O ji egbe ya gbuo egbe He used his gun to kill gun wrong ics on the ambiguous wordkey akwa seem both O ji egb´ e` ya gbuo egb´ e´ He used his gun to kill kite correct correct. Some disambiguation system in Google Akwa ya di n’elu akwa ya It was on the bed in his room fair Akw´ a` ya di n’elu akw` a` ya his clothes on his bed correct Translate must have been used to select the right Oke riri oke ya Her addiction confused form. However, it highlights the fact that such a Ok` e´ riri ok` e` ya Mouse ate his share correct disambiguation system may perform better when O jiri ugbo ya bia He came with his farm wrong O jiri u. gbo. ya bia He came with his car correct diacritics are restored.

Table 2: Diacritic disambiguation for Google 2 Problem Definition Translate As explained above, lack of diacritics can often lead to some lexical ambiguities in written Igbo 1.3 Diacritic restoration and WSD sentences. Although a human reader can, in most Yarowsky (1994a) observed that, although dia- cases, infer the intended meaning from context, critic restoration is not a hugely popular task in the machine may not. Consider the sentences in NLP research, it shares similar properties with sections 2.1 and 2.2 and their literal translations:

54 language models for French. On the , Santiˇ c´ et al. (2009) used substitution schemes, a dictionary and language models in im- plementing a similar architecture. For Spanish, Yarowsky (1999) used dictionaries with decision lists, Bayesian classification and Viterbi decoding the surrounding context. Crandall (2005), using Bayesian approach, HMM and a hybrid of both, as well as differ- ent evaluation method, attempted to improve on Yarowsky’s work. Cocks and Keegan (2011) Figure 1: Illustrative View of the Diacritic worked on Maori¯ using na¨ıve Bayes and word- Restoration Process (Ezeani et al., 2016) based n-grams relating to the target word as in- stance features. Tufis¸and Chit¸u (1999) used POS 2.1 Missing orthographic diacritics tagging to restore Romanian texts but backed off to character-based approach to deal with “unknown 1. Nwanyi ahu banyere n’ugbo ya. (The woman words”. Generally, there seems to be a consensus entered her [farm boat/craft]) | on the superiority of the word-based approach for 2. O kwuru banyere olu ya. (He/she talked well resourced languages. about his/her [neck/voice work/job]) | 3.2 Grapheme or letter level diacritic 2.2 Missing tonal diacritics restoration 1. Nwoke ahu nwere egbe n’ulo ya. (That man For low-resource languages, there is often lack of has a [gun kite] in his house) | adequate data and resources (large corpora, dic- 2. O dina n’elu akwa. (He/she is lying on the tionaries, POS-taggers etc.). Mihalcea (2002) as [cloth bed,bridge egg cry]) well as Mihalcea and Nastase (2002) argued that | | | letter-based approach will help to resolve the issue 3. ji ya aka. (He/she is held/gripped by Egwu of lack of resources. They implemented instance [fear song/dance/music]) | based and decision tree classifiers which gave a Ambiguities arise when diacritics – ortho- high letter-level accuracy. However, their evalua- graphic or tonal – are omitted in Igbo texts. In the tion method implied a possibly much lower word- first examples, we could see that ugbo(farm) and level accuracy. u. gbo.(boat/craft) as well as olu(neck/voice) and Versions of Mihalcea’s approach with improved o.lu. (work/job) were candidates in their sentences. evaluation methods have been implemented on Also the second examples show that egb´ e´(kite) other low resourced languages (Wagacha et al., and egb´ e`(gun); akw´ a`(cloth), akw` a`(bed or 2006; De Pauw et al., 2011; Scannell, 2011). Wa- bridge), akw` a´(egg), or even akw´ a´(cry) in a philo- gacha et al. (2006), for example, reviewed the sophical or artistic sense; as well as egw´ u`(fear) evaluation method in Mihalcea’s work and intro- and egw´ u´(music) are all qualified to replace the duced a word-level method for G˜ikuy˜ u.˜ De Pauw ambiguous word in their respective sentences. et al. (2011) extended Wagacha’s work by apply- ing the method to multiple languages. 3 Related Literature Our earlier work on Igbo diacritic restoration Diacritic restoration techniques for low resource (Ezeani et al., 2016) was more of a proof of con- languages adopt two main approaches: word cept aimed at extending the initial work done by based and character based. Scannell (2011). We built a number of n–gram models – basically unigrams, bigrams and tri- 3.1 Word level diacritic restoration grams – along with simple smoothing techniques. Different schemes of the word-based approach Although we got relatively high results, our eval- have been described. They generally involve pre- uation method was based on a closed-world as- processing, candidate generation and disambigua- sumption where we trained and tested on the same tion. Simard (1998) applied POS-tags and HMM set of data. Obviously, that assumption does not

55 model the real world and so it is being addressed Item Number in this paper. Total tokens 1070429 Total words 902150 3.3 Igbo Diacritic Restoration Numbers/punctuations 168279 Unique words 563383 Igbo is low-resourced and is generally neglected Ambiguous words 348509 in NLP research. However, an attempt at restoring Wordkeys 15696 Igbo diacritics was reported by Scannell (2011) Unique wordkeys 15167 in which a combination of word- and character- Ambiguous wordkeys 529 level models were applied. Two lexicon lookup 2 variants 502 methods were used: LL which replaces ambigu- 3 variants 15 ous words with the most frequent word and LL2 4 variants 10 that uses a bigram model to determine the right re- 5 variants 2 placement. >5 variants 0 They reported word-level accuracies of 88.6% Approx. ambiguity 38.22% and 89.5% for the models respectively. But the size of training corpus (31k tokens with 4.3k word Table 4: Corpus statistics types) was too little to be representative and there was no language speaker in the team to validate Top Variants(count) the data used and the results produced. Therefore, na(29272) na(1332),´ na(27940) we implemented a range of more complex n-gram o(22418) o(4757), o(64),` o(5),´ o.(17592) models, using similar evaluation techniques, on Bottom Variants(count) a comparatively larger sized corpus (1.07m with Giteyim(2) Giteyi.m(1), Giteyim(1) 14.4k unique tokens) and had improved on their Galim(2) Galim(1), Gali.m(1) results (Ezeani et al., 2016). Table 5: Most and least frequent wordkeys In this work, we introduce machine learning ap- proaches to further generalise the process and to better learn the intricate patterns in the data that 4.2 Preprocessing will help better restoration. The preprocessing task relied substantially on the approaches used by Onyenwe et al. (2014). Ob- 4 Experimental Setup served language based patterns were preserved. 4.1 Experimental Data For example, ga–, na– and n’ are retained as they are due to the specific lexical functions the spe- The corpus used in these experiments were col- cial characters “–” or “ ’ ” confer on them. For lected from the Igbo version of the bible available instance, while na implies conjunction (e.g. ji na 4 from the Jehova Witness website . The basic cor- ede: yam and cocoa-yam), na– is a verb auxiliary pus statistics are presented in Table 4. (e.g. Obi na–agba o.so.: Obi is running) and n’ is In Table 4, we refer to the “latinized” form of a shortened form of the preposition na (e.g. O. di. word as its wordkey5. Less than 10% (529/15696) n’elu akw` a`: It is on the bed.). of the wordkeys are ambiguous. However, these Also for consistency, diacritic formats are nor- ambiguous wordkeys represent 529 ambiguous malized using the unicode’s Normalization Form sets that yield 348,509 of the corpus words (i.e. Canonical NFC composition. For example, the words that share the same wordkey with at least character e´ from the combined unicode charac- one other word). These ambiguous words consti- ters e (u0065) and ´ (u0301) will be decomposed tutes approximately 38.22% (348,509/911892) of and recombined as a single canonically equivalent the entire corpus. Some of the top most occurring, character e´ (u00e9). Also the character n˙, which is as well as the bottom least occurring ambiguous often wrongly replaced with n˜ and n¯ in some text, sets are shown in Table 5. is generally restored back to its standard form. The diacritic marking of the corpus used in this 4jw.org 5Expectedly, many Igbo words are the same with their research is sufficient but not full or perfect. The wordkeys orthographic diacritics (mostly dot-belows) have

56 been included throughout. However, the tonal dia- FS8[(-4,1), (-3,1), (-2,1), (-1,1), (1,1), (2,1), • critics are fairly sparse, having been included only (3,1), (4,1)]: 4 unigrams on each side where they were for disambiguation (i.e. where FS9[(-5,1), (-4,1), (-3,1), (-2,1), (-1,1), (1,1), the reader might not be able to decide the correct • 5 unigrams on each form from context. (2,1), (3,1), (4,1), (5,1)]: side Therefore, through manual inspection, some observed errors and anomalies were corrected by FS10[(-2,2), (-1,1), (1,1), (2,2)]: 1 unigram • language speakers. For example, 3153 out of 3154 and 1 bigram on each side occurrences of the key mmadu were of the class FS11[(-3,3), (-2,2), (2,2), (3,3)]: 1 bigram mmadu. . The only one that was mmadu was cor- • and 1 trigram on each side rected to mmadu. after inspection. By repeating this process, a lot of the generated ambiguous sets FS12[(-3,3), (-2,2), (-1,1), (1,1), (2,2), (3,3)]: • were resolved and removed from the list to reduce 1 unigram, 1 bigram and 1 trigram on each the noise. Examples are as shown in the table be- side low: FS13[(-4,4), (-3,3), (-2,2), (-1,1), (1,1), (2,2), • wordkey Freq var1Freq var2Freq (3,3), (4,4)]: 1 unigram, 1 bigram, 1 trigram akpu 106 akpu´ .–1 akpu.–105 and a 4-gram on each side agbu 112 agbu–111 agbu´ –1 . . 4.3.1 Appearance threshold and stratification aka 3690 aka–3689 ak´ a–1` We removed low-frequency wordkeys in our data iri 2036 iri–2035 i.ri.–1 by defining an appearance threshold as a percent- Table 6: Some examples of corrected and removed age of the total tokens in our data. This is given by ambiguous set the C(wordkeys) appT hreshold = 100 C(tokens) ∗ 4.3 Feature extraction for training instances and wordkeys with appThreshold below the stated The feature sets for the classification models 6 were based on the works of Scannell (2011) on value were removed from the experiment. character-based restoration which was extended As part of our data preparation for a stan- by Cocks and Keegan (2011) to deal with word- dard cross-validation, we also passed each of our datasets through a simple stratification pro- based restoration for Maori.¯ These features con- 7 sist of a combination of n-grams – represented in cess. Instances of each label , where possible, are the form (x,y), where x is the relative position to evenly distributed to appear at least once in each the target key and y is the token length – at dif- fold or removed from the dataset. ferent positions within the left and right context of Our stratification algorithm basically picks only p the target word. The datasets are built as described labels from each dataset that have a population p >= nfolds nfolds below for each of the ambiguous keys: such that . is the number of folds which in our case has a default value of 10. FS1[(-1,1), (1,1)]: Unigrams on each side of In order to make the task a little more challeng- • the target key ing, this process was augmented by the removal of some high frequency, but low entropy datasets FS2[(-2,2), (2,2)]: Bigrams on each side • where using the most common class (MCC) pro- 8 FS3[(-3,3), (3,3)]: Trigrams on each side duces very high accuracies . Entropy is loosely • used here to refer to the degree of dominance of a FS4[(-4,4), (4,4)]: 4-grams on each side particular class across the dataset and it is simply • defined as: FS5[(-5,5), (5,5)]: 5-grams on each side • max[Count(label )] entropy = 1 i FS6[(-2,1), (-1,1), (1,1), (2,1)]: 2 unigrams − len(dataset) • on both sides 6In this work, we used an appT hreshold of 0.005% 7labels are basically diacritic variants. FS7[(-3,1), (-2,1), (-1,1), (1,1), (2,1), (3,1)]: 8 • Datasets with more than 95% accuracy on the most com- 3 unigrams on each side mon class (i.e. with entropy lower than 0.05) were removed.

57 Performance of all Algorithms on all Models where i = 1..n and n =number of distinct labels 100.00% in the dataset. Table 7 shows 10 of the 30 lowest 95.00% entropy datasets that were removed by this pro- 90.00% 85.00% cess. 80.00% LDA KNN 75.00% DTC SVM wordkey Counts MCCscore Label(count) Accuracies MNB e(2) 5476 99.78% e(12);` e(5464) 70.00% anyi(2) 5390 99.63% anyi(5370); anyi` (20) 65.00% . . Baseline ma(2) 6713 99.61% ma(6687); ma(26)` 60.00% ike(2) 3244 99.54% ike(15);` ike(3229) 55.00% unu(2) 8662 99.53% unu(41);` unu(8621) 50.00% FS1 FS2 FS3 FS4 FS5 FS6 FS7 FS8 FS9 FS10 FS11 FS12 FS13

Ha(2) 2266 99.29% Ha(16);` Ha(2250) Models a(2) 12275 99.10% a(12165); a(110)` onye(2) 8937 98.87% onye(8836); onye(101)` Figure 2: Evaluation of algorithm performance on ohu(2) 790 98.73% ohu(780); o.hu.(10) each feature set model eze(2) 2633 98.14% eze(2584); eze(49)´

Table 7: Low entropy datasets dataset on the overall performance depends on the weight of the dataset. Each dataset is assigned a At end of these pruning processes, our remain- weight corresponding to the number of instances ing datasets came to 110 with the distribution as it generates from the corpus which is determined follows: by its frequency of occurrence. So the actual performance of each learning al- datasets with only 2 variants: = 93 • gorithm, on a particular feature set model, is the overall weighted average of the its performances datasets with 3 variants: = 7 • across all the 110 datasets. The bottom line ac- datasets with 4 variants: = 8 curacy is the result of replacing each word with • its wordkey which gave an accuracy of 30.46%. datasets with 5 variants: = 2 • However, the actual baseline to beat is 52.79% which is achieved by always predicting the most Some datasets that originally had multiple vari- common class. ants lost some of their variants. For example, the dataset from akwa which originally had five vari- 4.5 Results and Discussions ants and 1067 instances comprising of akw´ a´ (355), The results of our experiments are as shown in Ta- akw´ a`(485), akwa(216), akw` a`(1) and akw` a´(10) re- ble 8 and Figure 2. tained only four variants (after dropping akw` a`) and 1066 instances. Models LDA KNN DTC SVM MNB Baseline: 52.79% 4.4 Classification algorithms FS1 77.65% 91.47% 94.49% 74.64% 74.64% FS2 77.65% 91.47% 94.49% 74.64% 74.64% This work applied versions of five of the com- FS3 74.48% 73.70% 84.60% 74.92% 74.64% monly used machine learning algorithms in NLP FS4 73.71% 67.18% 81.00% 74.64% 74.64% classification tasks namely: FS5 74.68% 62.48% 76.70% 74.64% 74.64% FS6 76.21% 85.98% 91.54% 71.39% 71.39% FS7 72.74% 79.20% 90.94% 71.39% 71.39% - Linear Discriminant Analysis(LDA) FS8 72.74% 79.20% 90.94% 71.39% 71.39% FS9 76.99% 73.88% 89.50% 75.46% 74.67% - K Nearest Neighbors(KNN) FS10 76.18% 85.41% 92.89% 75.11% 74.64% FS11 73.94% 74.83% 86.23% 75.29% 74.64% - Decision Trees(DTC) FS12 76.99% 73.88% 89.50% 75.46% 74.67% FS13 76.99% 73.88% 89.50% 75.46% 74.67% - Support Vector Machines(SVC) Table 8: Summary of results - Na¨ıve Bayes(MNB) The experiments indicate that on the average all Their default parameters on Scikit-learn toolkit the algorithms were able to beat the baseline on all were used with 10-fold cross-validation and the models. The decision tree algorithm (DTC) per- evaluation metrics used is mainly the accuracy of formed best across all models with an average ac- prediction of the correct diacritic form in the test curacy of 88.64% (Figure 3), and the highest accu- data. The effect of the accuracy obtained for a racy of 94.49% (Table 8) on both the FS1 and FS2

58 Standard Deviation of Algorithm Performance models. However, with an average standard devi- 0.08 ation of 0.076 (Figure 4) for its results, it appears 0.07 to be the least reliable. 0.06 As the next best performing algorithm, KNN 0.05 falls below DTC in average accuracy (91.47%) 0.04 but seems slightly more reliable. It did, however, Standard Deviation 0.03 struggle more than others as the dimension of fea- 0.02 ture n-grams increased (see its performance on 0.01

0 FS3, FS4 and FS5). This may be due to the in- LDA KNN DTC SVM MNB Algorithm crease in sparsity of features and the difficulty to find similar neighbours. The other algorithms – Figure 4: Average standard deviation for algo- LDA, SVM and MNB – just trailed behind and rithms although their results are a lot more reliable espe- Comparison of Average Performances of Models cially SVM and MNB (Figure 4). But this may 84.0% be an indication that their strategies are not explo- 82.0% rative enough. However, it could be observed that 80.0% they traced a similar path in the graph and also had 78.0% their highest results with the same set of models 76.0% 74.0% (i.e. FS9, FS12 and FS13) with wider context. Accuracy 72.0% On the models, we observed that the unigrams 70.0% and bigrams have better predictive capacity than 68.0% 66.0% the other n-grams. Most of the algorithms got FS1 FS2 FS3 FS4 FS5 FS6 FS7 FS8 FS9 FS10 FS11 FS12 FS13 comparatively good results with FS1, FS2, FS6 Models and FS10 (Figure 5) each of which has the uni- Figure 5: Average performance of models gram closest to the target word (i.e. in the 1 posi- ± tion) in the feature set. Also, models that excluded the closest unigrams on both sides (e.g. FS11) 4.6 Future Research Direction and those with fairly wider context did not per- Although our results show a substantial improve- form comparatively well across algorithms. ment from the baseline accuracy by all the algo- rithms on all the models, there is still a lot of room Comparison of Performances of Algorithms on Models for improvement. Our next experiments will in- 90.0% volve attempts to improve the results by focusing

85.0% on the following key aspects:

80.0% - Reviewing the feature set models: So far we have used instances with similar 75.0% features on both sides of the target words. In AverageAccuracy across models our next experiments, we may consider vary- 70.0% ing these features.

65.0% LDA KNN DTC SVM MNB Algorithms - Exploiting the algorithms: We were more explorative with the algo- Figure 3: Average performance of algorithms rithms and so only the default parameters of the algorithms on Scikit-learn were tested. Subsequent experiments will involve tuning Again, it appears that beyond the three closest the parameters of the algorithms and possibly unigrams (i.e. those in the 3 through +3 posi- − using more evaluation metrics. tions), the classifiers tend to be confused by ad- ditional context information. Generally, FS1 and - Expanding data size and genre: FS2 stood out across all algorithms as the best A major challenge for this research work is models while FS6 and FS7 also did well espe- lack of substantially marked corpora. So al- cially with DTC, KNN and LDA. though, we achieved a lot with the bible data,

59 it is inadequate and not very representative of Nancy Ide and Jean Veronis.´ 1998. Introduction the contemporary use of the language. Future to the special issue on word sense disambiguation: research efforts will apply more resources to The state of the art. Comput. Linguist., 24(1):2–40, March. increasing the data size across other genres. Rada F. Mihalcea and Vivi Nastase. 2002. Letter level - Predicting unknown words: learning for language independent diacritics restora- Our work is yet to properly address the prob- tion. In: Proceedings of CoNLL-2002, Taipei, Tai- lem of unknown words. We are considering a wan, pages 105–111. closer inspection of the structural patterns in Rada F. Mihalcea. 2002. Diacritics Restoration: the target word to see if they contain elements Learning from Letters versus Learning from Words. with predictive capacity. In: Gelbukh, A. (ed.) CICLing LNCS, pages 339– 348. - Broad based standardization: Ikechukwu Onyenwe, Chinedu Uchechukwu, and Beside lack of diacritics online Igbo texts Mark Hepple. 2014. Part-of-speech Tagset and Cor- are riddled with spelling errors, lack of stan- pus Development for igbo, an African Language. dard orthographic and dialectal forms, poor LAW VIII - The 8th Linguistic Annotation Work- shop., pages 93–98. writing styles, foreign words and so on. It may therefore be good to consider a broader Kevin P. Scannell. 2011. Statistical unicodification of based process that includes, not just diacritic african languages. Language Resource Evaluation, 45(3):375–386, September. restoration but other aspects of standardiza- tion. Michel Simard. 1998. Automatic Insertion of Ac- cents in French texts. Proceedings of the Third Con- - Interfacing with other NLP systems: ference on Empirical Methods in Natural Language Although it seems obvious, it will be inter- Processing, pages 27–35. esting to investigate, in empirical terms, the Dan Tufis¸and Adrian Chit¸u. 1999. Automatic Diacrit- relationship between diacritic restoration and ics Insertion in Romanian texts. In Proceedings of others NLP tasks and systems such as POS- COMPLEX99 International Conference on Compu- tagging, morphological analysis and even the tational Lexicography, pages 185–194, Pecs, Hun- gary. broader field of word sense disambiguation. Nikola Santiˇ c,´ Jan Snajder,ˇ and Bojana Dalbelo Basiˇ c,´ Acknowledgements 2009. Automatic Diacritics Restoration in Croatian Texts, pages 126–130. Dept of Info Sci, Faculty of The IgboNLP Project @ The University of Humanities and Social Sciences, University of Za- Sheffield, UK is funded by TETFund & Nnamdi greb , 2009. ISBN: 978-953-175-355-5. Azikiwe University, Nigeria. Peter W. Wagacha, Guy De Pauw, and Pauline W. Githinji. 2006. A Grapheme-based Approach to Accent Restoration in G˜ikuy˜ u.˜ In Proceedings of References the Fifth International Conference on Language Re- John Cocks and Te-Taka Keegan. 2011. A Word- sources and Evaluation, pages 1937–1940, Genoa, based Approach for Diacritic Restoration in Maori.¯ Italy, May. In Proceedings of the Australasian Language Tech- Yorick Wilks and Mark Stevenson. 1996. The gram- nology Association Workshop 2011, pages 126–130, mar of sense: Is word-sense tagging much more than Canberra, Australia, December. part-of-speech tagging? CoRR, cmp-lg/9607028. David Crandall. 2005. Automatic David Yarowsky. 1994a. A Comparison of Corpus- Accent Restoration in Spanish text. based Techniques for Restoring Accents in Spanish http://www.cs.indiana.edu/ djcran/projects/ ∼ and French Text. In Proceedings, 2nd Annual Work- 674 final.pdf. [Online: accessed 7-January-2016]. shop on Very Large Corpora, pages 19–32, Kyoto. Guy De Pauw, Gilles maurice De Schryver, Laurette David Yarowsky. 1994b. Decision Lists for Lexical Pretorius, and Lori Levin. 2011. Introduction to Ambiguity Resolution: Application to Accent Res- the Special Issue on African Language Technology. olution in Spanish and French Text. In Proceedings Language Resources and Evaluation, 45:263–269. of ACL-94, Las Cruces, New Mexico. Ignatius Ezeani, Mark Hepple, and Ikechukwu David Yarowsky, 1999. Corpus-based Techniques for Onyenwe, 2016. Automatic Restoration of Diacrit- Restoring Accents in Spanish and French Text, pages ics for Igbo Language, pages 198–205. Springer In- 99–120. Kluwer Academic Publishers. ternational Publishing, Cham.

60 Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds

Mahmoud El-Haj, Paul Rayson, Scott Piao and Stephen Wattam School of Computing and Communications, Lancaster University, Lancaster, UK [email protected]

Abstract The creation of large-scale semantic lexical re- sources is a time-consuming and difficult task. For Creating high-quality wide-coverage mul- new languages, regional varieties, dialects, or do- tilingual semantic lexicons to support mains the task will need to be repeated and then knowledge-based approaches is a chal- revised over time as word meanings evolve. In lenging time-consuming manual task. this paper, we report on work in which we adapt This has traditionally been performed by crowdsourcing techniques to speed up the creation linguistic experts: a slow and expensive of new semantic lexical resources. We evaluate process. We present an experiment in how efficient the approach is, and how robust the which we adapt and evaluate crowdsourc- semantic representation is across six languages. ing methods employing native speakers to The task that we focus on here is a particularly generate a list of coarse-grained senses un- challenging one. Given a word, each annotator der a common multilingual semantic tax- must decide on its meaning[s] and assign the word onomy for sets of words in six languages. to single or multiple tags in a pre-existing seman- 451 non-experts (including 427 Mechan- tic taxonomy. This task is similar to that under- ical Turk workers) and 15 expert partic- taken by trained lexicographers during the process ipants semantically annotated 250 words of writing or updating dictionary entries. Even for manually for Arabic, Chinese, English, experts, this is a complex task. Kilgarriff (1997) Italian, Portuguese and Urdu lexicons. In highlighted a number of issues related to lexicog- order to avoid erroneous (spam) crowd- raphers ‘lumping’ or ‘splitting’ senses of a word sourced results, we used a novel task- and cautioned that even lexicographers do not be- specific two-phase filtering process where lieve in words having a “discrete, non-overlapping users were asked to identify synonyms in set of senses”. Veronis´ (2001) showed that inter- the target language, and remove erroneous annotator agreement is very low in sense tagging senses. using a traditional dictionary. For our purpose, we use the USAS taxonomy.1 If a linguist were un- 1 Introduction dertaking this task, as they have done in the past Machine understanding of the meaning of words, with Finnish (Lofberg¨ et al., 2005) and Russian phrases, sentences and documents has challenged (Mudraya et al., 2006) USAS taxonomies, they computational linguists since the 1950s, and much would first spend some time learning the seman- progress has been made at multiple levels. Differ- tic taxonomy. In this experimental scenario, we ent types of semantic annotation have been devel- aim to investigate whether or not non-expert na- oped, such as word sense disambiguation, seman- tive speakers can succeed on the word-to-senses tic role labelling, named entity recognition, senti- classification task without being trained on the tax- ment analysis and content analysis. Common to onomy in advance, therefore mitigating a signifi- all of these tasks, in the supervised setting, is the cant overhead for the work. In addition, further requirement for a wide coverage semantic lexicon motivation for our experiments is to validate the acting as a knowledge base from which to select applicability of the USAS taxonomy (Rayson et or derive potential word or phrase level sense an- 1The UCREL Semantic Analysis System (USAS), see notations. http://ucrel.lancaster.ac.uk/usas/

61 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 61–71, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics al., 2004), with a non-expert crowd, as a frame- a word sense from an existing list that matches work for multilingual sense representation. The provided contextual examples, such as a concor- USAS taxonomy was selected for this experiment dance list. In our work, we require the partici- since it offers a manageable coarse-grained set pants to define the list of all possible senses that a of categories that have already been applied to a word could take in different contexts. We also see number of languages. This taxonomy is distinct that our two-stage filtering process tailored for this from other potential choices, such as WordNet. task helps to improve results. We compare inter- The USAS tagset is originally loosely based on rater scores for two groups of experts and non- the Longman Lexicon of Contemporary English experts to examine the feasibility of extracting (McArthur, 1981) and has a hierarchical structure high-quality semantic lexicons via the untrained with 21 major domains (see table 1) subdividing crowd. Non-experts achieved results between 45- into three levels. Versions of the USAS tagger 97% for accuracy, between 48-92% for complete- or tagset exist in 15 languages in total and for ness, with an average of 18% of tasks having er- each language, native speakers have re-evaluated roneous senses being left in. Experts scored 64- the applicability of the tagset with some specific 96% for accuracy, 72-95% for completeness, but extensions for Chinese (Qian and Piao, 2009) but achieve better results in terms of only 1% of erro- otherwise the tagset is stable across all languages. neous senses left behind. Our experimental results For each language tagger, separate linguistic re- show that the non-expert crowdsourced annotation sources (lexicons) have been created, but they all process is of a good quality and comparable to that use the same taxonomy. of expert linguists in some cases, although there are variations across different languages. Crowd- Domain Description sourcing provides a promising approach for the A General and abstract terms speedy generation and expansion of semantic lex- B The body and the individual icons on a large scale. It also allows us to validate C Arts and crafts the semantic representations embedded in our tax- E Emotion onomy in the multilingual setting. F Food and farming G Government and public 2 Related Work H Architecture, housing and the home I Money and commerce in industry The crowdsourcing approach, in particular Me- K Entertainment, sports and games chanical Turk (MTurk), has been successfully ap- L Life and living things plied for a number of different Natural Language M Movement, location, travel and Processing (NLP) tasks. Alonso and Mizzaro transport (2009) adopted MTurk for five types of NLP tasks, N Numbers and measurement resulting in high agreement between expert gold O Substances, materials, objects and standard labels and non-expert annotations, where equipment a small number of workers can emulate an ex- P Education pert. With the possibility of achieving good re- Q Language and communication sults quickly and cheaply, MTurk has been tested S Social actions, states and processes for a variety of tasks, such as image annotation T Time (Sorokin and Forsyth, 2008), Wikipedia article W World and environment quality assessment (Kittur et al., 2008), machine X Psychological actions, states and translation (Callison-Burch, 2009), extracting key processes phrases from documents (Yang et al., 2009), and Y Science and technology summarization (El-Haj et al., 2010). Practical is- Z Names and grammar sues such as payment and task design play an im- portant part in ensuring the quality of the resulting Table 1: USAS top level semantic fields work. Many designers pay between $0.01 to $0.10 for a task taking a few minutes. Quality control In terms of main contributions, our research and evaluation are usually achieved through con- goes beyond the previous work on crowdsourc- fidence scores and gold-standards (Donmez et al., ing word meanings which requires workers to pick 2009; Bhardwaj et al., 2010). Past research has

62 shown (Aker et al., 2012) that the use of radio but- duce the error rate, some manual cleaning was car- ton design seems to lead to better results compared ried out, particularly on the English-Italian bilin- to the free text design. Particularly important in gual lexicons. Because of the substantial amount our case is the language demographics of MTurk of time needed for such manual work, the rest of (Pavlick et al., 2014), since we need to find enough the lexical resources were used with only minor native speakers in a number of languages. sporadic manual checking. Due to the noise intro- There is a growing body of crowdsourcing work duced from the bilingual lexicons, as well as the related to semantic annotation. Snow et al. (2008) ambiguous nature of the translation, the automati- applied MTurk to the Word Sense Disambiguation cally generated semantic lexicons for the three lan- (WSD) task and achieved 100% precision with guages contain errors, including erroneous seman- simple majority voting for the correct sense of tic tags caused by incorrect translations, and inac- the word ‘president’ in 177 example sentences. curate semantic tags caused by ambiguous trans- Rumshisky et al. (2012) derived a sense inven- lations. When these automatically generated lexi- tory and sense-annotated corpus from MTurkers cons were integrated and applied in the USAS se- comparison of senses in pairs of example sen- mantic tagger, the tagger suffered from error rates tences. They used clustering methods to iden- of 23.51%, 12.31% and 10.28% for Italian, Chi- tify the strength of coders’ tags, something that nese and Portuguese respectively. is poorly suited to rejecting work from spammers The improvement of the semantic lexicons is (participants who try to cheat the system with therefore an urgent and challenging task, and we scripts or random answers) and would likely not hypothesise that the crowdsourcing approach can transfer well to our experiment. potentially provide an effective means for address- Akkaya et al. (2010) also performed WSD us- ing this issue on a large scale, while at the same ing MTurk workers. They discuss a number of time allowing us to further validate the representa- methods for ensuring quality, accountability, and tion of word senses in the USAS sense inventory consistency using 9 tasks per word and simple ma- (i.e. the semantic tagset) for these languages. jority voting. Kapelner et al. (2012) increased the scale to 1,000 words for the WSD task and found 3 Semantic Labeling Experiment that workers repeating the task do not learn with- We test the wisdom of the crowd in building out feedback. A set-based agreement metric was lexicons and applying the same multilingual se- used by Passonneau et al. (2006) to assess the mantic representation in six languages: Arabic, validity of polysemous selections of word senses Chinese, English, Italian, Portuguese and Urdu. from WordNet categories. Their objective was to These languages were selected to provide a range take into account similarity between items within a of language families, inflectional and derivational set, however, this may not be desirable in our case morphology, while covering significant number of due to the limited depth of the USAS taxonomy. speakers worldwide. For each language, we ran- Directly related to our research here are the domly selected 250 words. All experiments pre- experiments reported in Piao et al. (2015). A sented here use the USAS taxonomy to describe set of prototype semantic lexicons were auto- semantic categories3 (Rayson et al., 2004). matically generated by transferring semantic tags from the existing USAS English semantic lex- 3.1 Gold Standard Semantic Tags icon entries to their translation equivalents in To prepare gold standard data we asked a group Italian, Chinese and Portuguese via dictionaries of linguists (up to two per language) to manu- and bilingual lexicons. While some dictionar- ally check 250 randomly selected words for each ies involved, including Chinese/English and Por- of the six languages, starting from the data pro- tuguese/English dictionaries, provided high qual- vided by Piao et al. (2015). For additional lan- ity lexical translations for core vocabularies of guages (those not in Piao et al. (2015)), the Arabic these languages, the bilingual lexicons, including and Urdu gold standards were completed manu- FreeLang English/Italian, English/Portuguese lex- ally by native speaker linguists who translated the icons2 and LDC English/Chinese word list, con- 250 words, and we instructed the translators to opt tain erroneous and inaccurate translations. To re- for the most familiar Arabic or Urdu equivalents

2http://www.freelang.net/dictionary/ 3http://ucrel.lancaster.ac.uk/usas/semtags.txt

63 of the English words. This was further confirmed the non-experts. All expert and non-expert partici- by checking the list of words by two other Arabic pants (whether MTurk workers or direct-contact) and Urdu native speakers. Here the base form of used the same user interface (Figure 1) as de- verbs in Arabic is taken to be the present simple scribed in section 3.5. form of the verb in the interest of convenience and because there is no part of speech tag for ‘present 3.4 Experimental Design tense of a lexical verb’. Hence, the three-letter Obtaining reliable results from the crowd remains past tense verbs are tagged as ‘past form of lex- a challenging task (Kazai et al., 2009), which ical verb’, rather than as base forms. Also, while requires a careful experimental design and pre- present and past participles (e.g., ‘interesting’, ‘in- selection of crowdsourcers. In our experiments, terested’) are tagged as adjectives in English, these we worked on minimising the effort required by are labeled in Arabic as nouns in stand-alone po- participants through designing a user-friendly in- sitions but they can also function as adjective pre- terface.7 Aside from copying the final output code modifying nouns. Linguists then used the USAS to a text-box everything else is done using mouse semantic tagset to semantically label each word clicks. Poorly designed experiments can nega- with the most suitable senses. tively affect the quality of the results conducted by MTurk workers (Downs et al., 2010; Welinder 3.2 Non-expert Participants et al., 2010; Kazai, 2011). Non-expert participants are defined as those who Feedback from a short sample run with local are not familiar with the USAS taxonomy in ad- testers helped us update the interface and provide vance of the experiment. We selected Amazon’s more information to make the task efficient. Fig- Mechanical Turk4 – an online marketplace for ure 1 shows a sample task for the word ‘car’. The work that requires human intelligence – and pub- majority of the testers were able to complete the lished “Human Intelligence Tasks” (HITs) for En- task within five minutes. In response to feedback glish, Chinese and Portuguese only. For Ara- by some of the testers we provided the “Instruc- bic, Italian, and Urdu initial experiments using tions” section in the six languages under consider- MTurk showed that not enough native speakers ation. are available to complete the tasks. Therefore, 用户指南 Information Contact ﺍﻟﺗﻌﻠﻳﻣﺎﺕ Instructions Istruzioni we employed 12 non-expert participants directly Please label the word below with a number of tags that represent its meaning. Attach as many (no more than 10), or as few, as you deem appropriate for all senses of the word, placing them in with four native speakers for each of the three lan- descending order of importance. Tags can be assigned in a (positive or negative manner (for example, "occasionally" is tagged negatively for frequency guages. All participants used the same interface To assign a tag, click on the add tag button, and select a box from the category selection. This will add (Figure 1). an entry in the list, that .can then be sorted so that the most commonly used tag is at the top Please remove any unrelated tags and make sure you do not exceed 10 tags in total To help you with identifying common senses of a word, we have provided a number of links to On MTurk, we paid workers an average of 7 US dictionaries, thesauri, and corpora (where you can see real­world usage). The References button will dollars per hour. We paid Portuguese workers 50% bring these up over the top of other things, so you can still browse the tags Example: "Jordan" could be : 'Geographical name' or 'Personal name' more to try and attract more participants due to Please submit your selections by clicking in the Submit button at the end of the page and wait until you receive a confirmation message where you need to copy the output­code and provide it to us, as in the lack of Portuguese native speakers on MTurk. the following example: f962bed5616612c8c7053f6e97e72b129edb4990-b5f5-47ab-8e7e- bfc61c6515ec We paid the other directly contacted participants an average of 8 British pounds per hour. Those Term to Classify car payments were made using Amazon5 and Apple 6 Tags iTunes vouchers. Please add tags to describe the term, arranging them from the most commonly used sense to the least. Movement/transportation: land 3.3 Expert Participants American football Add Tag References Submit Expert participants are defined as those who were already familiar with the USAS taxonomy before Figure 1: Sample Task for the word “Car” the experiments took place. For four languages (Arabic, English, Chinese and Urdu) we asked a total of 15 participants (3 for English and 4 for 3.5 Online Semantic Labeling the other languages) to carry out the same task as As shown in Figure 1, we asked the participants to label each word presented to them with a number 4http://www.mturk.com 5https://www.amazon.co.uk 7Our underlying code is freely available on GitHub at 6http://www.apple.com https://github.com/UCREL/BLC-WSD-Frontend

64 Anatomy and Health and disease References physiology Terms relating to the (state of Terms relating to the (human) the) physical condition Click on any of the following links for more body and bodily processes information about the word: happy

+ Select ‒ Select + Select ‒ Select  dictionary.com  thesaurus.com Clothes and personal Cleaning and personal  British National Corpus belongings care

Terms relating to clothes Terms relating to and other personal belongings domestic/personal hygiene hide + Select ‒ Select + Select ‒ Select

Figure 2: Dictionary and Thesauri References Figure 4: Subcategories of tags that represent the word’s possible mean- the page and wait until they receive a confirmation ings. The participants were asked to attach as message where they need to copy the output-code many, or as few, as they deemed appropriate for and provide it to us. all senses of the word, placing them in descending For each word we targeted a total of four non- order of likelihood. expert participants and four expert participants to To assign a tag, the participants click on the allow measurement and comparison of the agree- Add Tag button, and navigate to a box from the ment within each group to investigate the variabil- category selection where they can select a subcat- ity of task results and participants, rather than to egory (Figures 3 and 4). By following these steps take a simple weighted combination to produce an the participants add an entry in the list, that can agreed list. then be sorted by dragging and dropping the se- lected tags so that the most commonly used tag is 3.6 Filtering at the top. We asked the users to remove any un- Even though crowdsourcing has been shown to related tags and make sure they do not exceed 10 be effective in achieving expert quality for a va- tags in total. riety of NLP tasks (Snow et al., 2008; Callison- Burch, 2009), we still needed to filter out work- General & Abstract The Body & Arts & Crafts ers who were not taking the task seriously or were Terms Individual 1 subcategories attempting to manipulate the system for personal 15 subcategories 5 subcategories gain (spamming).

Numbers & In order to avoid these spamming crowdsourced Life & Living Things Education Measurement results, we designed a novel task-specific two 3 subcategories 1 subcategories 6 subcategories stage filtering process that we considered more suitable for this type of task than previous filtering Science & Money & Time Technology Commerce approaches. Our two stage process encompasses 4 subcategories 2 subcategories 4 subcategories filters that are appropriate for experts and non- experts, and is applicable whether participants are using MTurk or not. Figure 3: Categories In stage one filtering, we asked the MTurk workers to select the correct synonym of the pre- To help them when identifying common senses sented word from a list of noisy candidates in or- of a given word, we provided a number of links to der to avoid rejection of their HITs. The list con- dictionaries, thesauri, and corpora (where they can tained four words where only one word correctly see real-world usage) for each language. The Ref- fitted as a synonym. In order to set up the first erences are displayed alongside the interface, so filtering task for MTurk workers (on English, Chi- they can still browse the tags (Figure 2). Partici- nese and Portuguese tasks), we used Multilingual pants are free to use other resources as they see fit. WordNet to obtain the most common synonym for The participants then needed to submit their selec- each word. The stage one filtering was not needed tions by clicking the Submit button at the end of for Arabic, Italian and Urdu, since these non-

65 expert participants were directly contacted and we whether the gold standard tags (GT ags) are com- knew that they were native speakers and would not pletely or partially contained within the worker’s submit random results. The synonyms were vali- selection. To compute Completeness we divide the dated by linguists in each of the three languages number of matching tags by the number of gold and choices were randomly shuffled before be- standard tags. ing presented to the workers. Stage one filtering W G removed 12% of English HITS, 2% of Chinese, Completeness = T ags ∩ T ags and 6% of the Portuguese submissions. In our re- GT ags sults presented below, we only considered tasks by Correlation: To test the similarity of tag selec- workers who chose the correct synonyms and re- tion between workers and gold standards we used jected the others. Spearman’s rank correlation coefficient. For the stage two filter, we injected random er- roneous senses for each of the 250 words into the In addition to the three metrics mentioned above initial list of tags and the participants were ex- we used three factors that work as indicators of the pected to remove these in order to pass. We de- quality of the tagging process: liberately injected wrong and unrelated semantic Strict: Whether worker’s tags are identical to tags in between ‘potentially’ correct ones before • the gold standard (same tags in the same or- shuffling the order of the tags. For example, ex- der); amining the pre-selected tags for the word ‘car’ in Figure 1 we can see that the semantic tag ‘Ameri- First Tag Correct: Whether the first tag se- • can Football’ is unrelated to the word ‘car’ and in lected by the worker matches the first tag in fact does not exist in the USAS semantic taxon- the gold standard; omy. The potentially correct tag such as ‘Move- ment/transportation: land’ does exist in the se- Fuzzy: Whether tags selected by a worker are • mantic lexicon. Results where participants fail to contained within the gold standard tags (in pass stage two are still retained in the experiment any order). and we report on the usefulness of this filter in section 4. All participants (MTurk workers and For each language we asked for up to four an- directly-contacted; experts and non-experts) un- notators per word (1,000 HITs per language). For dertook stage 2 filtering. Our experimental design Portuguese, where the participants were all from did not reveal to the participants any details of the MTurk, we only received 694 HITs even though two stage filtering process. we paid participants working on Portuguese 50% more than we paid for the English tasks. 4 Results and Discussion Table 29 shows the aggregate averages of the To evaluate the results8 we adopted three main non-expert HITs. In total, direct-contact partic- metrics (inspired by those used in the SemEval ipants and MTurk workers performed well and procedures): Accuracy, Completeness, and Corre- achieved comparable results to the gold standard lation. in places. Around 50% of the English, Chinese, Accuracy: measure the accuracy of the partic- Urdu, and Portuguese HITs had the correct tags se- lected with around 15% being identical to the gold ipant’s selection of tags (WT ags) by counting the matching tags between the worker’s selection and standards. In nearly all of the cases, Portuguese workers chose the correct tags although they were the gold standards (GT ags). To compute Accu- racy we divide the number of matching tags by the in a different order than the gold standard. Arabic number of tags selected by the participants. participants achieved a high completeness score relative to the gold standard tags, but a close anal- W G Accuracy = T ags ∩ T ags ysis of the results show the participants have sug- WT ags gested more tags than the gold standards. Completeness: measure the completeness of the The results for English suggest that the non- participant’s selection of tags (WT ags) by finding expert workers are consistent as can be observed 8All expert and non-expert results are available as 9For this and the following tables, we use: Acc: Accuracy, CSV files on https://github.com/UCREL/Multilingual- Com: Completeness, Corr: Spearman’s, Err: Erroneous tags, USAS/tree/master/papers/eacl2017 sense workshop Str: Strict, 1st: first tag correct, Fuz: Fuzzy.

66 Lang Acc Com Cor Err Str 1st Fuz in knowing the exact sense of an out of context En 61.4 69.3 0.38 29% 16% 65% 38% Arabic word could result in disagreement when it Ar 55.5 87.1 0.35 8% 8% 55% 19% comes to ordering the senses (see Section 3.1). Zh 45.2 56.1 0.22 2% 15% 46% 27% For the Chinese language, the result table shows It 45.7 47.9 0.06 31% 7% 38% 22% that there is a slightly lower correlation between Pt 58.5 56.3 0.21 18% 19% 50% 94% the non-expert workers’ tags and the gold stan- Ur 97.6 91.9 0.51 1% 53% 78% 95% dards. Observing the erroneous results column we can see that the workers have made very few mis- Table 2: Summary of performance [Non experts] takes and deleted the random unrelated tags. The Strict and Fuzzy scores suggest the results to be of high quality. The participants managed to get the by looking at the Accuracy (Acc) and Complete- first tag correct in many cases. ness (Com) results. Spearman’s correlation (Cor) suggests that the workers correlate with the expert For the Italian language, we sourced four non- gold standard tags, which is consistent with previ- expert undergrad student participants who are all ous findings that MTurk is effective for a variety of native Italian speakers but not familiar with the NLP tasks through achieving expert quality (Snow tagset. The participants’ results do not correlate et al., 2008; Callison-Burch, 2009). The majority well with the gold standards. As the tags descrip- of the workers matched the first tag correctly (1st) tion are all in English the annotators found it diffi- by ordering the tags so the most important (core cult to correctly select the senses and had to trans- sense) tag appeared at the top of their selection. late some tags into Italian which could have re- The erroneous tags (Err) column shows that many sulted in shifting the meaning of those tags, to workers did not remove some of the deliberately- communicate with them we had an Italian/English wrong tags (see Section 3.5). This reflects the lack bilingual linguist as a mediator. of training of the workers, but our checking of the As with Arabic and Italian, for the Urdu lan- results showed that the erroneous tags were not se- guage we sourced four non-expert participants lected as first choice. Strict (Str) and Fuzzy (Fuz) who are all native Urdu speakers but not familiar show that many workers were consistent with the with the tagset. Urdu results show that participants gold standard tags in terms of both tag selection correlate well with the gold standards. We also and order. It is worth mentioning that languages notice a lower percentage of erroneous tags than differ in terms of ambiguity (e.g. Urdu is less am- other languages. The First Tag and Fuzzy scores biguous than Arabic) which can be observed in the suggest the results to be of high quality. The par- differences between language results. ticipants also managed to get the first tag correct in As mentioned earlier, we did not use MTurk for many cases. The participants all agreed it was easy the Arabic lexicon, due to the lack of Arabic na- to define word senses with the words being less tives speakers on MTurk. Instead, we found four ambiguous compared to other languages. This is student volunteers who offered to help in seman- shown in the high results achieved when compared tically tagging the words, again without any train- to non-experts of the other languages. ing on the tagset. The results show consistent ac- We received only 694 HITs for Portuguese tasks curacy and completeness. It is worth noting that on MTurk, which suggests there are fewer Por- the Arabic participants obtained higher accuracy tuguese speakers compared to English and Chi- and completeness scores by having higher agree- nese speakers among the MTurk workers. The ment with the gold standard tags. The Arabic lan- results for Portuguese in some cases are of very guage participants selected fewer erroneous tags high quality. It should be noted that the gold stan- than the English ones. The majority of the par- dard tags were selected and manually checked by a ticipants got the first tag correct. Arabic partici- Brazilian Portuguese native speaker expert. There pants failed to match the order of tags in the gold is a difference between European and Brazil- standards as reflected by lower correlation. This ian Portuguese which could result in ambiguous is expected due to the fact that Arabic is highly words for speakers from the two regions (Frota inflectional and derivational, which increases am- and Vigario,´ 2001). biguity and presents a challenge to the interpreta- Table 3 shows the results obtained by using the tion of the words (El-Haj et al., 2014). Difficulties second filtering mechanism to discard HITs where

67 Lang Acc Com Cor Lang Acc Com Cor Err Str 1st Fuz En 70.4 69.0 0.36 En 66.1 83 0.61 1% 31% 75% 40% Ar 56.6 87.6 0.34 Ar 78.8 72.4 0.22 1% 39% 51% 73% Zh 45.6 55.9 0.22 Zh 50.4 60.2 0.21 1% 15% 44% 31% It 54.4 53.5 0.09 Ur 96.2 94.8 0.69 1% 63% 89% 93% Pt 61.3 54.0 0.20 Table 4: Summary of performance [Experts] Ur 97.6 91.9 0.51 Language Measure OA Fleiss K–alpha Table 3: Summary of performance with Second English First Tag 0.82 0.46 0.46 Filter [Non experts] Fuzzy 0.64 0.27 0.27 Strict 0.69 0.32 0.32 random erroneous tags were not completely re- Arabic First Tag 0.77 0.55 0.55 moved. This enables us to increase accuracy for Fuzzy 0.84 0.59 0.59 English by 9.0%, Italian by 8.7% and Portuguese Strict 0.21 0.55 0.55 by 2.8% without negatively affecting complete- Chinese First Tag 0.62 0.23 0.24 ness or correlation. Fuzzy 0.75 0.41 0.41 Strict 0.83 0.31 0.32 In order to allow better interpretation of the non- Urdu First Tag 0.83 0.10 0.10 experts’ scores, we repeated the experiments on Fuzzy 0.91 0.35 0.35 a smaller scale with up to four experts per lan- Strict 0.71 0.37 0.38 guage (English, Arabic, Chinese and Urdu), who were already familiar with the USAS taxonomy Table 5: Total Inter-rater agreement [Experts]. and were researchers in the fields of corpus or computational linguistics. Experts used the same task interface to assign senses to 50 words each. overall. This shows that the novel two-phase fil- The results are presented in Table 4. Most notably, tering method used in our experiment is effective experts consistently excel at removing erroneous for maintaining the quality of the results. tags, leaving only a very small number in the data. 5 Conclusion and Future Work English experts performed much better than English non-experts on completeness, correlation In order to accelerate the task of creating multilin- and strict measures while their accuracy scores are gual semantic lexicons with coarse-grained word comparable. Arabic experts performed much bet- senses using a common multilingual semantic rep- ter than Arabic non-experts on the accuracy, strict resentation scheme, we employed non-expert na- and fuzzy scores while the 1st score is compa- tive speakers via MTurk who were not trained rable. Chinese experts performed slightly better with the semantic taxonomy. Overall, the non- than Chinese non-experts on Accuracy, complete- expert participants semantically tagged 250 words ness and Fuzzy while other scores were compa- in each of six languages: Arabic, Chinese, En- rable. Urdu experts scored relatively more highly glish, Italian, Portuguese and Urdu. We analysed on strict and 1st measures while other scores were the results using a number of metrics to consider comparable to Urdu non-experts. Finally, Tables 5 the correct likelihood order of tags relative to a and 6 show the Observed Agreement (OA), Fleiss’ gold-standard, along with correct removal of ran- Kappa and Krippendorff’s alpha scores for the dom erroneous semantic tags, and completeness inter-rater agreement between Expert and Non Ex- of tag lists. Crowdsourcing has been applied suc- pert participants. According to (Landis and Koch, cessfully for other NLP tasks in previous research, 1977) our inter-rater scores show fair agreement and we build on previous success in WSD tasks in between annotators. This serves to illustrate the three ways. Firstly, we have specific requirements task is complex even for experts. for semantic tagging purposes in terms of placing Overall these results show that untrained crowd- coarse-grained senses into a semantic taxonomy sourcing workers can produce results that are com- rather than stand-alone definitions. Hence, our ex- parable to those of experts when performing se- perimental set-up allows us to validate the sense mantic annotation tasks. Directly-contacted and inventory in a multilingual setting, carried out here MTurk workers achieved similar levels of results for six languages. Secondly, we extend the usual

68 Language Measure OA Fleiss K–alpha dances, (b) prototypical examples for each seman- English First Tag 0.71 0.36 0.36 tic tag, or (c) semantic tag labels in the same lan- Fuzzy 0.58 0.11 0.11 guage as the task word, as part of the resources Strict 0.79 0.20 0.20 available to participants would further enhance the Arabic First Tag 0.66 0.32 0.32 accuracy of the crowdsourcing annotation process. Fuzzy 0.71 0.05 0.05 Strict 0.86 0.03 0.03 Acknowledgements Chinese First Tag 0.73 0.45 0.45 Fuzzy 0.74 0.38 0.38 The crowdsourcing framework and interface was Strict 0.85 0.41 0.41 originally produced in an EPSRC Impact Accel- Italian First Tag 0.67 0.31 0.31 eration Award at Lancaster University in 2013. Fuzzy 0.67 0.03 0.03 This work has also been supported by the Strict 0.89 0.12 0.13 UK ESRC/AHRC-funded CorCenCC (The Na- Portuguese First Tag 0.64 0.22 0.22 tional Corpus of Contemporary Welsh) Project Fuzzy 0.63 0.13 0.13 (ref. ES/M011348/1), the ESRC-funded Centre Strict 0.80 0.18 0.18 for Corpus Approaches to Social Science (ref. Urdu First Tag 0.72 0.16 0.16 ES/K002155/1), and by the UCREL research cen- Fuzzy 0.95 0.45 0.45 tre, Lancaster University, UK via Wmatrix soft- Strict 0.74 0.49 0.49 ware licence fees. We would also like to ex- Table 6: Total Inter-rater agreement [Non Ex- press our thanks to the many expert and non-expert perts]. participants who were also involved in the ex- periments including: Dawn Archer (Manchester Metropolitan University, UK), Francesca Bianchi classification task of putting a word into one of (Universita del Salento, Italy), Carmen Dayrell an existing list of senses, instead asking partici- (Lancaster University, UK), Xiaopeng Hu (China pants to list all possible senses that a word could Center for Information Industry Development, take in different contexts. Thirdly, we have de- CCID, China), Olga Mudraya (Right Word Up, ployed a novel two-stage filtering approach which UK), Sheryl Prentice (Lancaster University, UK), has been shown to improve the quality of our re- Yuan Qi (China Center for Information Industry sults by filtering out spam responses using a sim- Development, CCID, China), Jawad Shafi (COM- ple synonym recognition task as well as HITs re- SATS Institute of Information technology, Pak- moving random erroneous tags. Our experiment istan), and May Wong (University of Hong Kong, suggests that the crowdsourcing process can pro- Hong Kong, China). duce results of good quality and is comparable to the work done by expert linguists. We showed that it is possible for native speakers to apply the hier- References archical semantic taxonomy without prior training Ahmet Aker, Mahmoud El-Haj, Udo Kruschwitz, and by the application of a graphical browsing inter- M-Dyaa Albakour. 2012. Assessing Crowdsourc- face to assist selection and annotation process. ing Quality through Objective Tasks. In 8th Lan- In the future, we will apply the method on a guage Resources and Evaluation Conference, Istan- bul, Turkey. LREC 2012. larger scale to the full semantic lexicons includ- ing multiword expressions, which are important Cem Akkaya, Alexander Conrad, Janyce Wiebe, and for contextual semantic disambiguation. We will Rada Mihalcea. 2010. Amazon mechanical turk also investigate whether adaptations to our method for subjectivity word sense disambiguation. In Pro- are required to include more languages such as ceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Ama- Czech, Malay and Spanish. In order to pursue the zon’s Mechanical Turk, pages 195–203, Los Ange- work beyond the existing languages in the USAS les, USA. Association for Computational Linguis- system, we will extend bootstrapping methods re- tics. ported in Piao et al. (2015) with vector-based tech- Omar Alonso and Stefano Mizzaro. 2009. Can we get niques and evaluate their appropriateness for mul- rid of TREC Assessors? Using Mechanical Turk for tiple languages. Finally, we will test whether Relevance Assessment. In SIGIR ’09: Workshop on (a) provision of words in context through concor- The Future of IR Evaluation, Boston, USA.

69 Vikas Bhardwaj, Rebecca J. Passonneau, Ansaf Salleb- Gabriella Kazai. 2011. In search of quality in crowd- Aouissi, and Nancy Ide. 2010. Anveshan: A frame- sourcing for search engine evaluation. In Paul work for analysis of multiple annotators’ labeling Clough, Colum Foley, Cathal Gurrin, Gareth J.F. behavior. In Proceedings of the Fourth Linguistic Jones, Wessel Kraaij, Hyowon Lee, and Vanessa Annotation Workshop, LAW IV ’10, pages 47–55, Mudoch, editors, Advances in Information Re- Stroudsburg, PA, USA. Association for Computa- trieval, volume 6611 of Lecture Notes in Computer tional Linguistics. Science, pages 165–176. Springer Berlin Heidel- berg. Chris Callison-Burch. 2009. Fast, Cheap, and Cre- ative: Evaluating Translation Quality Using Ama- Adam Kilgarriff. 1997. “I Don’t Believe in Word zon’s Mechanical Turk. In Proceedings of the 2009 Senses”. Computers and the Humanities, 31(2):91– Conference on Empirical Methods in Natural Lan- 113. guage Processing (EMNLP), pages 286–295, Singa- pore. Association for Computational Linguistics. Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. Pinar Donmez, Jaime G. Carbonell, and Jeff Schneider. In Proceedings of the SIGCHI Conference on Hu- 2009. Efficiently learning the accuracy of labeling man Factors in Computing Systems, CHI ’08, pages sources for selective sampling. In Proceedings of 453–456, New York, NY, USA. ACM. the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, J.Richard Landis and Gary Koch. 1977. The mea- pages 259–268, New York, NY, USA. ACM. surement of observer agreement for categorical data. Biometrics, 33(1):159–174. Julie S. Downs, Mandy B. Holbrook, Steve Sheng, and Lorrie Faith Cranor. 2010. Are your partic- Laura Lofberg,¨ Scott Piao, Asko Nykanen, Krista ipants gaming the system?: Screening mechanical Varantola, Paul Rayson, and Jukka-Pekka Juntunen. turk workers. In Proceedings of the SIGCHI Con- 2005. A semantic tagger for the Finnish language. ference on Human Factors in Computing Systems, In Proceedings of Corpus Linguistics 2005, Birm- CHI ’10, pages 2399–2402, New York, NY, USA. ingham, UK. ACM. Tom McArthur. 1981. Longman Lexicon of Contem- Mahmoud El-Haj, Udo Kruschwitz, and Chris Fox. porary English. Longman, London, UK. 2010. Using Mechanical Turk to Create a Cor- pus of Arabic Summaries. In Language Resources Olga Mudraya, Bogdan Babych, Scott Piao, Paul (LRs) and Human Language Technologies (HLT) for Rayson, and Andrew Wilson. 2006. Developing Semitic Languages workshop held in conjunction a Russian semantic tagger for automatic semantic with the 7th International Language Resources and annotation. In Proceedings of Corpus Linguistics Evaluation Conference (LREC 2010)., pages 36–39, 2006, pages 290–297, St. Petersburg, Russia. Valletta, Malta. LREC 2010. Rebecca Passonneau, Nizar Habash, and Owen Ram- Mahmoud El-Haj, Udo Kruschwitz, and Chris Fox. bow. 2006. Inter-annotator agreement on a 2014. Creating language resources for under- multilingual semantic annotation task. In Pro- resourced languages: methodologies, and experi- ceedings of the Fifth International Conference on ments with Arabic. Language Resources and Eval- Language Resources and Evaluation (LREC’06), uation, pages 1–32. Genoa, Italy. European Language Resources Asso- ciation (ELRA). Sonia´ Frota and Marina Vigario.´ 2001. On the correlates of rhythmic distinctions: The Euro- Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, pean/Brazilian Portuguese case. pages 247–275. and Chris Callison-Burch. 2014. The language de- mographics of Amazon Mechanical Turk. Transac- Adam Kapelner, Krishna Kaliannan, H.Andrew tions of the Association for Computational Linguis- Schwartz, Lyle Ungar, and Dean Foster. 2012. tics, 2:79–92. New insights from coarse word sense disambigua- tion in the crowd. In Proceedings of COLING 2012: Scott Piao, Francesca Bianchi, Carmen Dayrell, An- Posters, pages 539–548, Mumbai, India. The COL- gela D’Egidio, and Paul Rayson. 2015. Develop- ING 2012 Organizing Committee. ment of the multilingual semantic annotation sys- tem. In Proceedings of North American Chapter Gabriella Kazai, Natasa Milic-Frayling, and Jamie of the Association for Computational Linguistics – Costello. 2009. Towards methods for the collec- Human Language Technologies Conference, Den- tive gathering and quality control of relevance as- ver, USA. (NAACL HLT 2015). sessments. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Develop- Yufang Qian and Scott Piao. 2009. The development ment in Information Retrieval, SIGIR ’09, pages of a semantic annotation scheme for chinese kinship. 452–459, New York, NY, USA. ACM. Corpora, 4(2):189–208.

70 Paul Rayson, Dawn Archer, Scott Piao, and Anthony McEnery. 2004. The UCREL semantic analysis system. In Proceedings of the Beyond Named En- tity Recognition Semantic Labelling for NLP tasks workshop, pages 7–12, Lisbon, Portugal. Aanna Rumshisky, Nick Botchan, Sophie Kushkuley, and James Pustejovsky. 2012. Word sense inven- tories by non-experts. In Nicoletta Calzolari (Con- ference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uur Doan, Bente Maegaard, Joseph Mar- iani, Jan Odijk, and Stelios Piperidis, editors, Pro- ceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 4055–4059, Istanbul, Turkey, May. European Language Resources Association (ELRA).

Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Nat- ural Language Tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Lan- guage Processing, pages 254–263. Association for Computational Linguistics. Alexander. Sorokin and David Forsyth. 2008. Utility data annotation with amazon mechanical turk. In In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. Jean Veronis.´ 2001. Sense tagging: does it make sense? In Proceedings of Corpus Linguistics 2001, Lancaster, UK. UCREL. Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. 2010. The Multidimensional Wis- dom of Crowds. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, ed- itors, Advances in Neural Information Processing Systems 23, pages 2424–2432. Yin Yang, Nilesh Bansal, Wisam Dakka, Panagio- tis Ipeirotis, Nick Koudas, and Dimitris Papadias. 2009. Query by document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pages 34–43, New York, NY, USA. ACM.

71 Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation

Alexander Panchenko‡, Stefano Faralli†, Simone Paolo Ponzetto†, and Chris Biemann‡

‡Language Technology Group, Computer Science Dept., University of Hamburg, Germany †Web and Data Science Group, Computer Science Dept., University of Mannheim, Germany panchenko,biemann @informatik.uni-hamburg.de { faralli,simone @informatik.uni-mannheim.de} { }

Abstract 2016) and combination of both (Pelevina et al., 2016). Finally, some hybrid approaches emerged, We introduce a new method for unsuper- which aim at building sense representations us- vised knowledge-based word sense disam- ing information from both corpora and lexical re- biguation (WSD) based on a resource that sources, e.g. (Rothe and Schutze,¨ 2015; Camacho- links two types of sense-aware lexical net- Collados et al., 2015a; Faralli et al., 2016). In works: one is induced from a corpus us- this paper, we further explore the last strain of ing distributional semantics, the other is research, investigating the utility of hybrid sense manually constructed. The combination of representation for the word sense disambiguation two networks reduces the sparsity of sense (WSD) task. representations used for WSD. We evalu- In particular, the contribution of this paper is ate these enriched representations within a new unsupervised knowledge-based approach to two lexical sample sense disambiguation WSD based on the hybrid aligned resource (HAR) benchmarks. Our results indicate that (1) introduced by Faralli et al. (2016). The key dif- features extracted from the corpus-based ference of our approach from prior hybrid meth- resource help to significantly outperform a ods based on sense embeddings, e.g. (Rothe and model based solely on the lexical resource; Schutze,¨ 2015), is that we rely on sparse lexical (2) our method achieves results compara- representations that make the sense representation ble or better to four state-of-the-art unsu- readable and allow to straightforwardly use this pervised knowledge-based WSD systems representation for word sense disambiguation, as including three hybrid systems that also will be shown below. In contrast to hybrid ap- rely on text corpora. In contrast to these proaches based on sparse interpretable represen- hybrid methods, our approach does not re- tations, e.g. (Camacho-Collados et al., 2015a), our quire access to web search engines, texts method requires no mapping of texts to a sense in- mapped to a sense inventory, or machine ventory and thus can be applied to larger text col- translation systems. lections. By linking symbolic distributional sense representations to lexical resources, we are able to 1 Introduction improve representations of senses, leading to per- The representation of word senses and the dis- formance gains in word sense disambiguation. ambiguation of lexical items in context is an on- going long-established branch of research (Agirre 2 Related Work and Edmonds, 2007; Navigli, 2009). Tradition- ally, word senses are defined and represented in Several prior approaches combined distributional lexical resources, such as WordNet (Fellbaum, information extracted from text (Turney and Pan- 1998), while more recently, there is an increased tel, 2010) from text with information available interest in approaches that induce word senses in lexical resources, such as WordNet. Yu and from corpora using graph-based distributional ap- Dredze (2014) proposed a model to learn word proaches (Dorow and Widdows, 2003; Biemann, embeddings based on lexical relations of words 2006; Hope and Keller, 2013), word sense embed- from WordNet and PPDB (Ganitkevitch et al., dings (Neelakantan et al., 2014; Bartunov et al., 2013). The objective function of their model

72 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 72–78, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics combines the objective function of the skip-gram All these diverse contributions indicate the ben- model (Mikolov et al., 2013) with a term that takes efits of hybrid knowledge sources for learning into account lexical relations of a target word. word and sense representations. Faruqui et al. (2015) proposed a related approach that performs a post-processing of word embed- 3 Unsupervised Knowledge-based WSD dings on the basis of lexical relations from the using Hybrid Aligned Resource same resources. Pham et al. (2015) introduced an- We rely on the hybrid aligned lexical semantic re- other model that also aim at improving word vec- source proposed by Faralli et al. (2016) to perform tor representations by using lexical relations from WSD. We start with a short description of this re- WordNet. The method makes representations of source and then discuss how it is used for WSD. synonyms closer than representations of antonyms of the given word. While these three models im- 3.1 Construction of the Hybrid Aligned prove the performance on word relatedness eval- Resource (HAR) uations, they do not model word senses. Jauhar et al. (2015) proposed two models that tackle this The hybrid aligned resource links two lexical shortcoming, learning sense embeddings using the semantic networks using the method of Faralli word sense inventory of WordNet. et al. (2016): a corpus-based distributionally- induced network and a manually-constructed net- Iacobacci et al. (2015) proposed to learn work. Sample entries of the HAR are presented sense embeddings on the basis of the BabelNet in Table 1. The corpus-based part of the resource, lexical ontology (Navigli and Ponzetto, 2012). called proto-conceptualization (PCZ), consists of Their approach is to train the standard skip- sense-disambiguated lexical items (PCZ ID), dis- gram model on a pre-disambiguated corpus us- ambiguated related terms and hypernyms, as well ing the Babelfy WSD system (Moro et al., 2014). as context clues salient to the lexical item. The NASARI (Camacho-Collados et al., 2015a) re- knowledge-based part of the resource, called lies on Wikipedia and WordNet to produce vec- conceptualization, is represented by synsets of tor representations of senses. In this approach, the lexical resource and relations between them a sense is represented in lexical or sense-based (WordNet ID). Each sense in the PCZ network is feature spaces. The links between WordNet and subsequently linked to a sense of the knowledge- Wikipedia are retrieved from BabelNet. MUFFIN based network based on their similarity calculated (Camacho-Collados et al., 2015b) adapts several on the basis of lexical representations of senses ideas from NASARI, extending the method to the and their neighbors. The construction of the PCZ multi-lingual case by using BabelNet synsets in- involves the following steps (Faralli et al., 2016): stead of monolingual WordNet synsets. The approach of Chen et al. (2015) to learn- Building a Distributional Thesaurus (DT). At ing sense embeddings starts from initialization this stage, a similarity graph over terms is induced of sense vectors using WordNet glosses. It pro- from a corpus, where each entry consists of the ceeds by performing a more conventional context most similar 200 terms for a given term using the clustering, similar what is found to unsupervised JoBimText method (Biemann and Riedl, 2013). methods such as (Neelakantan et al., 2014; Bar- tunov et al., 2016). Word Sense Induction. In DTs, entries of pol- ysemous terms are mixed, i.e. they contain related Rothe and Schutze¨ (2015) proposed a method terms of several senses. The Chinese Whispers that learns sense embedding using word embed- (Biemann, 2006) graph clustering is applied to the dings and the sense inventory of WordNet. The ego-network (Everett and Borgatti, 2005) of the approach was evaluated on the WSD tasks using each term, as defined by its related terms and con- features based on the learned sense embeddings. nections between then observed in the DT to de- Goikoetxea et al. (2015) proposed a method for rive word sense clusters. learning word embeddings using random walks on a graph of a lexical resource. Nieto Pina˜ and Jo- Labeling Word Senses with Hypernyms. hansson (2016) used a similar approach based on Hearst (1992) patterns are used to extract hy- random walks on a WordNet to learn sense embed- pernyms from the corpus. These hypernyms dings. are assigned to senses by aggregating hypernym

73 PCZ ID WordNet ID Related Terms Hypernyms Context Clues mouse:0 mouse:1 rat:0, rodent:0, monkey:0, ... animal:0, species:1, ... rat:conj and, white-footed:amod, ... mouse:1 mouse:4 keyboard:1, computer:0, printer:0 ... device:1, equipment:3, ... click:-prep of, click:-nn, .... keyboard:0 keyboard:1 piano:1, synthesizer:2, organ:0 ... instrument:2, device:3, ... play:-dobj, electric:amod, .. keyboard:1 keyboard:1 keypad:0, mouse:1, screen:1 ... device:1, technology:0 ... computer, qwerty:amod ...

Table 1: Sample entries of the hybrid aligned resource (HAR) for the words “mouse” and “keyboard”. Trailing numbers indicate sense identifiers. Relatedness and context clue scores are not shown for brevity. relations over the list of related terms for the given sentation with contextual information from the sense into a weighted list of hypernyms. HAR on the basis of the mappings listed below:

Disambiguation of Related Terms and Hyper- WordNet. This baseline model relies solely on nyms. While target words contain sense distinc- the WordNet lexical resource. It builds sense tions (PCZ ID), the related words and hypernyms representations by collecting synonyms and sense do not carry sense information. At this step, each definitions for the given WordNet synset and hypernym and related term is disambiguated with synsets directly connected to it. We removed stop respect to the induced sense inventory (PCZ ID). words and weight words with term frequency. For instance, the word “keyboard” in the list of related terms for the sense “mouse:1” is linked WordNet + Related (news). This model aug- to its “device” sense represented (“keyboard:1”) ments the WordNet-based representation with re- as “mouse:1” and “keyboard:1” share neighbors lated terms from the PCZ items (see Table 1). This from the IT domain. setting is designed to quantify the added value of lexical knowledge in the related terms of PCZ. Retrieval of Context Clues. Salient contexts of senses are retrieved by aggregating salient depen- WordNet + Related (news) + Context (news). dency features of related terms. Context features This model includes all features of the previous that have a high weight for many related terms ob- models and complements them with context clues tain a high weight for the sense. obtained by aggregating features of the words from the WordNet + Related (news) (see Table 1). 3.2 HAR Datasets WordNet + Related (news) + Context (wiki). We experiment with two different corpora for PCZ This model is built in the same way as the pre- induction as in (Faralli et al., 2016), namely a 100 vious model, but using context clues derived from million sentence news corpus (news) from Giga- Wikipedia (see Section 3.2). word (Parker et al., 2011) and LCC (Richter et al., In the last two models, we used up to 5000 2006), and a 35 million sentence Wikipedia cor- most relevant context clues per word sense. This 1 pus (wiki). Chinese Whispers sense clustering is value was set experimentally: performance of the performed with the default parameters, producing WSD system gradually increased with the num- an average number of 2.3 (news) and 1.8 (wiki) ber of context clues reaching a plateau at the value senses per word in a vocabulary of 200 thousand of 5000. During aggregation, we excluded stop words each, with the usual power-law distribution words and numbers from context clues. Besides, of sense cluster sizes. On average, each sense we transformed syntactic context clues presented is related to about 47 senses and has assigned 5 in Table 1 to terms, stripping the dependency type. hypernym labels. These disambiguated distribu- so they can be added to other lexical representa- tional networks were linked to WordNet 3.1 using tions. For instance, the context clue “rat:conj and” the method of Faralli et al. (2016). of the entry “mouse:0” was reduced to the feature “rat”. 3.3 Using the Hybrid Aligned Resource in Table 2 demonstrates features extracted from Word Sense Disambiguation WordNet as compared to feature representations We experimented with four different ways of en- enriched with related terms of the PCZ. riching the original WordNet-based sense repre- Each WordNet word sense is represented with 1The used PCZ and HAR resources are available at: one of the four methods described above. These https://madata.bib.uni-mannheim.de/171 sense representations are subsequently used to per-

74 Model Sense Representation memory, device, floppy, disk, hard, disk, disk, computer, science, computing, diskette, fixed, disk, floppy, magnetic, WordNet disc, magnetic, disk, hard, disc, storage, device recorder, disk, floppy, console, diskette, handset, desktop, iPhone, iPod, HDTV, kit, RAM, Discs, Blu-ray, computer, GB, mi- crochip, site, cartridge, printer, tv, VCR, Disc, player, LCD, software, component, camcorder, cellphone, card, monitor, display, burner, Web, stereo, internet, model, iTunes, turntable, chip, cable, camera, iphone, notebook, device, server, surface, wafer, page, drive, laptop, screen, pc, television, hardware, YouTube, dvr, DVD, product, folder, VCR, radio, phone, circuitry, partition, megabyte, peripheral, format, machine, tuner, website, merchandise, equipment, gb, discs, MP3, hard-drive, piece, video, storage device, memory device, microphone, hd, EP, content, soundtrack, webcam, system, blade, graphic, microprocessor, collection, document, programming, battery, keyboard, HD, handheld, CDs, reel, web, material, hard-disk, ep, chart, debut, configuration, WordNet + Related (news) recording, album, broadcast, download, fixed disk, planet, pda, microfilm, iPod, videotape, text, cylinder, cpu, canvas, label, sampler, workstation, electrode, magnetic disc, catheter, magnetic disk, Video, mobile, cd, song, modem, mouse, tube, set, ipad, signal, substrate, vinyl, music, clip, pad, audio, compilation, memory, message, reissue, ram, CD, subsystem, hdd, touchscreen, electronics, demo, shell, sensor, file, shelf, processor, cassette, extra, mainframe, motherboard, floppy disk, lp, tape, version, kilo- byte, pacemaker, browser, Playstation, pager, module, cache, DVD, movie, Windows, cd-rom, e-book, valve, directory, harddrive, smartphone, audiotape, technology, hard disk, show, computing, computer science, Blu-Ray, blu-ray, HDD, HD-DVD, scanner, hard disc, gadget, booklet, copier, playback, TiVo, controller, filter, DVDs, gigabyte, paper, mp3, CPU, dvd-r, pipe, cd-r, playlist, slot, VHS, film, videocassette, interface, adapter, database, manual, book, channel, changer, storage

Table 2: Original and enriched representations of the third sense of the word “disk” in the WordNet sense inventory. Our sense representation is enriched with related words from the hybrid aligned resource. form WSD in context. For each test instance con- improvements in the results. Further expansion of sisting of a target word and its context, we select the sense representation with context clues (cf. Ta- the sense whose corresponding sense representa- ble 1) provide a modest further improvement on tion has the highest cosine similarity with the tar- the SemEval-2007 dataset and yield no further im- get word’s context. provement on the case of the Senseval-3 dataset.

4 Evaluation Comparison to the state-of-the-art. We com- pare our approach to four state-of-the-art sys- We perform an extrinsic evaluation and show the tems: KnowNet (Cuadros and Rigau, 2008), Ba- impact of the hybrid aligned resource on word belNet, WN+XWN (Cuadros and Rigau, 2007), sense disambiguation performance. While there and NASARI. KnowNet builds sense representa- exist many datasets for WSD (Mihalcea et al., tions based on snippets retrieved with a web search 2004; Pradhan et al., 2007; Manandhar et al., engine. We use the best configuration reported in 2010, inter alia), we follow Navigli and Ponzetto the original paper (KnowNet-20), which extends (2012) and use the SemEval-2007 Task 16 on each sense with 20 keywords. BabelNet in its the “Evaluation of wide-coverage knowledge re- core relies on a mapping of WordNet synsets and sources” (Cuadros and Rigau, 2007). This task is Wikipedia articles to obtain enriched sense rep- specifically designed for evaluating the impact of resentations. The WN+XWN system is the top- lexical resources on WSD performance. The Sem- ranked unsupervised knowledge-based system of Eval-2007 Task 16 is, in turn, based on two “lexi- Senseval-3 and SemEval-2007 datasets from the cal sample” datasets, from the Senseval-3 (Mihal- original competition (Cuadros and Rigau, 2007). cea et al., 2004) and SemEval-2007 Task 17 (Prad- It alleviates sparsity by combining WordNet with han et al., 2007) evaluation campaigns. The first the eXtended WordNet (Mihalcea and Moldovan, dataset has coarse- and fine-grained annotations, 2001). The latter resource relies on parsing of while the second contains only fine-grained sense WordNet glosses. annotations. In all experiments, we use the offi- For KnowNet, BabelNet, and WN+XWN we cial task’s evaluator to compute standard metrics use the scores reported in the respective original of recall, precision, and F-score. publications. However, as NASARI was not eval- 5 Results uated on the datasets used in our study, we used the following procedure to obtain NASARI-based Impact of the corpus-based features. Figure 1 sense representations: Each WordNet-based sense compares various sense representations in terms representation was extended with all features from of F-score. The results show that, expanding the lexical vectors of NASARI.2 WordNet-based sense representations with distri- Thus, we compare our method to three hybrid butional information gives a clear advantage over systems that induce sense representations on the the original representation on both Senseval-3 and 2We used the version of lexical vectors (July 2016) featur- SemEval-2007 datasets. Using related words spe- ing 4.4 million of BabelNet synsets, yet covering only 72% cific to a given WordNet sense provides dramatic of word senses of the two datasets used in our experiments.

75 Figure 1: Performance of different word sense representation strategies.

Senseval-3 fine-grained SemEval-2007 fine-grained Model Precision Recall F-score Precision Recall F-score Random 19.1 19.1 19.1 27.4 27.4 27.4 WordNet 29.7 29.7 29.7 44.3 21.0 28.5 WordNet + Related (news) 47.5 47.5 47.5 54.0 50.0 51.9 WordNet + Related (news) + Context (news) 47.2 47.2 47.2 54.8 51.2 52.9 WordNet + Related (news) + Context (wiki) 46.9 46.9 46.9 55.2 51.6 53.4 BabelNet 44.3 44.3 44.3 56.9 53.1 54.9 KnowNet 44.1 44.1 44.1 49.5 46.1 47.7 NASARI (lexical vectors) 32.3 32.2 32.2 49.3 45.8 47.5 WN+XWN 38.5 38.0 38.3 54.9 51.1 52.9

Table 3: Comparison of our approach to the state of the art unsupervised knowledge-based methods on the SemEval-2007 Task 16 (weighted setting). The best results overall are underlined. basis of WordNet and texts (KnowNet, BabelNet, tionally with context clues. Finally, note that, NASARI) and one purely knowledge-based sys- while our method shows competitive results com- tem (WN+XWN). Note that we do not include the pared to other state-of-the-art hybrid systems, it supervised TSSEM system in this comparison, as does not require access to web search engines in contrast to all other considered methods, it re- (KnowNet), texts mapped to a sense inventory lies on a large sense-labeled corpus. (BabelNet, NASARI), or machine translation sys- Table 3 presents results of the evaluation. tems (BabelNet). On the Senseval-3 dataset, our hybrid models show better performance than all unsupervised 6 Conclusion knowledge-based approaches considered in our The hybrid aligned resource (Faralli et al., 2016) experiment. On the SemEval-2007 dataset, the successfully enriches sense representations of a only resource which exceeds the performance of manually-constructed lexical network with fea- our hybrid model is BabelNet. The extra perfor- tures derived from a distributional disambiguated mance of BabelNet on the SemEval dataset can lexical network. Our WSD experiments on two be explained by its multilingual approach: addi- datasets show that this additional information ex- tional features are obtained using semantic rela- tracted from corpora let us substantially outper- tions across synsets in different languages. Be- form the model based solely on the lexical re- sides, machine translation is used to further enrich source. Furthermore, a comparison of our sense coverage of the resource (Navigli and Ponzetto, representation method with existing hybrid ap- 2012). proaches leveraging corpus-based features demon- These results indicate on the high quality of strate its state-of-the-art performance. the sense representations obtained using the hy- brid aligned resource. Using related words of Acknowledgments induced senses improves WSD performance by a large margin as compared to purely WordNet- We acknowledge the support of the Deutsche based model on both datasets. Adding extra con- Forschungsgemeinschaft (DFG) under the project textual features further improves slightly results ”JOIN-T: Joining Ontologies and Semantics In- on one dataset. Thus, we recommend enriching duced from Text”. sense representations with related words and op-

76 References 161–168, Manchester, UK, August. Coling 2008 Or- ganizing Committee. Eneko Agirre and Philip Edmonds. 2007. Word sense disambiguation: Algorithms and applications, vol- Beate Dorow and Dominic Widdows. 2003. Discov- ume 33. Springer Science & Business Media. ering Corpus-Specific Word Senses. In Proceedings of the Tenth Conference on European Chapter of the Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, Association for Computational Linguistics - Volume and Dmitry Vetrov. 2016. Breaking sticks and am- 2, EACL ’03, pages 79–82, Budapest, Hungary. As- biguities with adaptive skip-gram. In Proceedings of sociation for Computational Linguistics. the 19th International Conference on Artificial Intel- ligence and Statistics (AISTATS’2016), pages 130– Martin Everett and Stephen P. Borgatti. 2005. Ego 138, Cadiz, Spain. JMLR: W&CP volume 51. network betweenness. Social Networks, 27(1):31– 38. Chris Biemann and Martin Riedl. 2013. Text: Now in 2D! A Framework for Lexical Expansion with Con- Stefano Faralli, Alexander Panchenko, Chris Biemann, textual Similarity. Journal of Language Modelling, and Simone P. Ponzetto. 2016. Linked disam- 1(1):55–95. biguated distributional semantic networks. In In- ternational Semantic Web Conference (ISWC’2016), Chris Biemann. 2006. Chinese whispers - an efficient pages 56–64, Kobe, Japan. Springer. graph clustering algorithm and its application to nat- ural language processing problems. In Proceedings Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, of TextGraphs: the First Workshop on Graph Based Chris Dyer, Eduard Hovy, and Noah A. Smith. Methods for Natural Language Processing, pages 2015. Retrofitting word vectors to semantic lexi- 73–80, New York City. Association for Computa- cons. In Proceedings of the 2015 Conference of tional Linguistics. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Jose´ Camacho-Collados, Mohammad Taher Pilehvar, nologies, pages 1606–1615, Denver, Colorado. As- and Roberto Navigli. 2015a. Nasari: a novel sociation for Computational Linguistics. approach to a semantically-aware representation of items. In Proceedings of the 2015 Conference of Christiane Fellbaum. 1998. WordNet: An Electronic the North American Chapter of the Association for Database. MIT Press, Cambridge, MA. Computational Linguistics: Human Language Tech- Juri Ganitkevitch, Benjamin Van Durme, and Chris nologies, pages 567–577, Denver, Colorado. Asso- Callison-Burch. 2013. Ppdb: The paraphrase ciation for Computational Linguistics. database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Jose´ Camacho-Collados, Mohammad Taher Pilehvar, Computational Linguistics: Human Language Tech- and Roberto Navigli. 2015b. A unified multilingual nologies semantic representation of concepts. In Proceedings , pages 758–764, Atlanta, Georgia. Associ- of the 53rd Annual Meeting of the Association for ation for Computational Linguistics. Computational Linguistics and the 7th International Josu Goikoetxea, Aitor Soroa, and Eneko Agirre. Joint Conference on Natural Language Processing 2015. Random walks and neural network language (Volume 1: Long Papers), pages 741–751, Beijing, models on knowledge bases. In Proceedings of the China. Association for Computational Linguistics. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. Human Language Technologies, pages 1434–1439, 2015. Improving distributed representation of word Denver, Colorado. Association for Computational sense via wordnet gloss composition and context Linguistics. clustering. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics Marti A. Hearst. 1992. Automatic acquisition of hy- and the 7th International Joint Conference on Natu- ponyms from large text corpora. In Proceedings ral Language Processing (Volume 2: Short Papers), of the 6th International Conference on Computa- pages 15–20, Beijing, China. Association for Com- tional Linguistics (COLING’1992), pages 539–545, putational Linguistics. Nantes, France. Montse Cuadros and German Rigau. 2007. Semeval- David Hope and Bill Keller. 2013. MaxMax: A 2007 task 16: Evaluation of wide coverage knowl- Graph-Based Soft Clustering Algorithm Applied to edge resources. In Proceedings of the Fourth Word Sense Induction. In Computational Linguis- International Workshop on Semantic Evaluations tics and Intelligent Text Processing: 14th Interna- (SemEval-2007), pages 81–86, Prague, Czech Re- tional Conference, CICLing 2013, Samos, Greece, public. Association for Computational Linguistics. March 24-30, 2013, Proceedings, Part I, pages 368– 381. Springer Berlin Heidelberg, Berlin, Heidelberg. Montse Cuadros and German Rigau. 2008. KnowNet: Building a large net of knowledge from the web. In Ignacio Iacobacci, Mohammad Taher Pilehvar, and Proceedings of the 22nd International Conference Roberto Navigli. 2015. Sensembed: Learning on Computational Linguistics (Coling 2008), pages sense embeddings for word and relational similarity.

77 In Proceedings of the 53rd Annual Meeting of the Doha, Qatar. Association for Computational Lin- Association for Computational Linguistics and the guistics. 7th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages Luis Nieto Pina˜ and Richard Johansson. 2016. Em- 95–105, Beijing, China. Association for Computa- bedding senses for efficient graph-based word sense tional Linguistics. disambiguation. In Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. Language Processing, pages 1–5, San Diego, CA, 2015. Ontologically grounded multi-sense repre- USA. Association for Computational Linguistics. sentation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North Robert Parker, David Graff, Junbo Kong, Ke Chen, and American Chapter of the Association for Computa- Kazuaki Maeda. 2011. English Gigaword Fifth Edi- tional Linguistics: Human Language Technologies, tion. Linguistic Data Consortium, Philadelphia. pages 683–693, Denver, Colorado. Association for Maria Pelevina, Nikolay Arefiev, Chris Biemann, and Computational Linguistics. Alexander Panchenko. 2016. Making sense of word Proceedings of the 1st Workshop Suresh Manandhar, Ioannis P. Klapaftis, Dmitriy Dli- embeddings. In on Representation Learning for NLP gach, and Sameer S. Pradhan. 2010. SemEval-2010 , pages 174– task 14: Word sense induction & disambiguation. In 183, Berlin, Germany. Association for Computa- Proceedings of the 5th International Workshop on tional Linguistics. Semantic Evaluation (ACL’2010), pages 63–68, Up- Nghia The Pham, Angeliki Lazaridou, and Marco Ba- psala, Sweden. Association for Computational Lin- roni. 2015. A multitask objective to inject lexical guistics. contrast into distributional semantics. In Proceed- ings of the 53rd Annual Meeting of the Association Rada Mihalcea and Dan Moldovan. 2001. extended for Computational Linguistics and the 7th Interna- wordnet: Progress report. In In Proceedings of tional Joint Conference on Natural Language Pro- NAACL Workshop on WordNet and Other Lexical cessing (Volume 2: Short Papers), pages 21–26, Bei- Resources , pages 1–5, Pittsburgh, PA, USA. Asso- jing, China. Association for Computational Linguis- ciation for Computational Linguistics. tics. Rada Mihalcea, Timothy Chklovski, and Adam Kil- Sameer Pradhan, Edward Loper, Dmitriy Dligach, and garriff. 2004. The SENSEVAL-3 English lexical Martha Palmer. 2007. Semeval-2007 task-17: En- sample task. In SENSEVAL-3: Third International glish lexical sample, srl and all words. In Proceed- Workshop on the Evaluation of Systems for the Se- ings of the Fourth International Workshop on Se- mantic Analysis of Text, pages 25–28, Barcelona, mantic Evaluations (SemEval-2007), pages 87–92, Spain. Association for Computational Linguistics. Prague, Czech Republic. Association for Computa- tional Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed rep- Matthias Richter, Uwe Quasthoff, Erla Hallsteinsdottir,´ resentations of words and phrases and their com- and Chris Biemann. 2006. Exploiting the Leipzig positionality. In Proceedings of Advances in Neu- Corpora Collection. In Proceedings of the Fifth ral Information Processing Systems 26 (NIPS’2013), Slovenian and First International Language Tech- pages 3111–3119, Harrahs and Harveys, CA, USA. nologies Conference (IS-LTC), Ljubljana, Slovenia. Andrea Moro, Alessandro Raganato, and Roberto Nav- Sascha Rothe and Hinrich Schutze.¨ 2015. Autoex- igli. 2014. Entity linking meets word sense disam- tend: Extending word embeddings to embeddings biguation: a unified approach. Transactions of the for synsets and lexemes. In Proceedings of the Association for Computational Linguistics, 2:231– 53rd Annual Meeting of the Association for Compu- 244. tational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol- Roberto Navigli and Simone Paolo Ponzetto. 2012. ume 1: Long Papers), pages 1793–1803, Beijing, Babelnet: The automatic construction, evaluation China, July. Association for Computational Linguis- and application of a wide-coverage multilingual se- tics. mantic network. Artificial Intelligence, 193:217– 250. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of seman- Roberto Navigli. 2009. Word sense disambiguation: A tics. JAIR, 37:141–188. survey. ACM CSUR, 41(2):1–69. Mo Yu and Mark Dredze. 2014. Improving lexical Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- embeddings with semantic knowledge. In Proceed- sos, and Andrew McCallum. 2014. Efficient ings of the 52nd Annual Meeting of the Association non-parametric estimation of multiple embeddings for Computational Linguistics (Volume 2: Short Pa- per word in vector space. In Proceedings of the pers), pages 545–550, Baltimore, Maryland. Asso- 2014 Conference on Empirical Methods in Natural ciation for Computational Linguistics. Language Processing (EMNLP), pages 1059–1069,

78 One Representation per Word — Does it make Sense for Composition?

Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David Weir TAG laboratory, Department of Informatics, University of Sussex Brighton, BN1 9RH, UK t.kober, j.e.weeds, jw478, j.p.reffin, d.j.weir @sussex.ac.uk { }

Abstract Therefore a number of proposals have been made to overcome the issue of conflating several In this paper, we investigate whether an different senses of an individual lexeme into a a priori disambiguation of word senses is single representation. One approach (Reisinger strictly necessary or whether the meaning and Mooney, 2010; Huang et al., 2012) is to try of a word in context can be disambiguated directly inferring a predefined number of senses through composition alone. We evaluate from data and subsequently label any occurrences the performance of off-the-shelf single- of a polysemous lexeme with the inferred inven- vector and multi-sense vector models on tory. Similar approaches are proposed by Reddy a benchmark phrase similarity task and a et al. (2011) and Kartsaklis et al. (2013) who show novel task for word-sense discrimination. that appropriate sense selection or disambigua- We find that single-sense vector models tion typically improves performance for compo- perform as well or better than multi-sense sition of noun phrases (Reddy et al., 2011) and vector models despite arguably less clean verb phrases (Kartsaklis et al., 2013). Dinu and elementary representations. Our findings Lapata (2010) proposed a model that represents furthermore show that simple composition the meaning of a word as a probability distribu- functions such as pointwise addition are tion over latent senses which is modulated based able to recover sense specific information on contextualisation, and report improved perfor- from a single-sense vector model remark- mance on a word similarity task and the lexi- ably well. cal substitution task. Other approaches leverage an existing lexical resource such as BabelNet or 1 Introduction WordNet to obtain sense labels a priori to creat- Distributional word representations based on ing word representations (Iacobacci et al., 2015), counting co-occurrences have a long history in or as a postprocessing step after obtaining initial natural language processing and have successfully word representations (Chen et al., 2014; Pilehvar been applied to numerous tasks such as sentiment and Collier, 2016). While these approaches have analysis, recognising textual entailment, word- exhibited strong performance on benchmark word sense disambiguation and many other important similarity tasks (Huang et al., 2012; Iacobacci et problems. More recently low-dimensional and al., 2015) and some downstream processing tasks dense neural word embeddings have received a such as part-of-speech tagging and relation identi- considerable amount of attention in the research fication (Li and Jurafsky, 2015), they have been community and have become ubiquitous in numer- weaker than the single-vector representations at ous NLP pipelines in academia and industry. One predicting the compositionality of multi-word ex- fundamental simplifying assumption commonly pressions (Salehi et al., 2015), and at tasks which made in distributional semantic models, however, require the meaning of a word to be considered is that every word can be encoded by a single rep- in context; e.g, word sense disambiguation (Ia- resentation. Combining polysemous lexemes into cobacci et al., 2016) and word similarity in con- a single vector has the consequence of essentially text (Iacobacci et al., 2015). creating a weighted average of all observed mean- In this paper we consider what happens when ings of a lexeme in a given text corpus. distributional representations are composed to

79 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 79–90, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics form representations for larger units of meaning. SENSEMBED model4 by Iacobacci et al. (2015), In a compositional phrase, the meaning of the which creates word-sense embeddings by per- whole can be inferred from the meaning of its forming word-sense disambiguation prior to run- parts. Thus, assuming compositionality, the rep- ning word2vec. resentation of a phrase such as black mood, should We note that word2vec and dep2vec use be directly inferable from the representations for a single vector per word approach and therefore black and for mood. Further, one might suppose conflate the different senses of a polysemous lex- that composing the correct senses of the individ- eme. On the other hand, SENSEMBED utilises Ba- ual lexemes would result in a more accurate repre- belfy (Moro et al., 2014) as an external knowledge sentation of that phrase. However, our counter- source to perform word-sense disambiguation and hypothesis is that the act of composition con- subsequently creates one vector representation per textualises or disambiguates each of the lexemes word sense. thereby making the representations of individual For composition we use pointwise addition for senses redundant. We investigate this hypothe- all models as this has been shown to be a strong sis by evaluating the performance of single-vector baseline in a number of studies (Hashimoto et representations and multi-sense representations at al., 2014; Hill et al., 2016). We also experi- both a benchmark phrase similarity task and at a mented with pointwise multiplication as compo- novel word-sense discrimination task. sition function but, similar to Hill et al. (2016), Our contributions in this work are thus as fol- found its performance to be very poor (results not lows. First, we provide quantitative and qualita- reported). We model any out-of-vocabulary items tive evidence that even simple composition func- as a vector consisting of all zeros and determine tions have the ability to recover sense-specific in- proximity of composed meaning representations formation from a single-vector representation of in terms of cosine similarity. We lowercase and a polysemous lexeme in context. Second, we in- lemmatise the data in our task but do not perform troduce a novel word-sense discrimination task1, number or date normalisation, or removal of rare which can be seen as the first stage of word-sense words. disambiguation. The goal is to find whether the occurrences of a lexeme in two or more senten- 3 Phrase Similarity tial contexts belong to the same sense or not, with- Our first evaluation task is the benchmark phrase out necessarily labelling the senses. While it has similarity task of Mitchell and Lapata (2010). This received relatively little attention in recent years, dataset consists of 108 adjective-noun (AN), 108 it is an important natural language understanding noun-noun (NN) and 108 verb-object (VO) pairs. problem and can provide important insights into The task is to compare a compositional model’s the process of semantic composition. similarity estimates with human judgements by computing Spearman’s ρ. An average ρ of 0.47- 2 Evaluating Distributional Models of 0.48 represents the current state-of-the-art perfor- Composition mance on this task (Hashimoto et al., 2014; Kober For evaluation we use several readily available et al., 2016; Wieting et al., 2015). off-the-shelf word embeddings, that have al- For single-sense representations, the strategy ready been shown to work well for a number for carrying out this task is simple. For each of different NLP applications. We compare the phrase in each pair, we compose the constituent 300-dimensional skip-gram word2vec (Mikolov representations and then compute the similarity et al., 2013) word embeddings2 to the depen- of each pair of phrases using the cosine similar- dency based version of word2vec — henceforth ity. For multi-sense representations, we adapted dep2vec3 (Levy and Goldberg, 2014) — and the the strategy which has been used successfully in various word similarity experiments (Huang et al., 1 Our task is available from https://github.com/ 2012; Iacobacci et al., 2015). Typically, for each tttthomasssss/sense2017 2Available from: https://code.google.com/p/ word pair, all pairs of senses are considered and word2vec/ the similarity of the word pair is considered to be 3Available from: https://levyomer. wordpress.com/2014/04/25/ 4Available from: http://lcl.uniroma1.it/ dependency-based-word-embeddings/ sensembed/

80 the similarity of the closest pair of senses. The Phrase 1 Phrase 2 SE SE* H buy land leave house 0.49 0.28 0.26 fact that this strategy works well suggests that close eye stretch arm 0.40 0.31 0.25 when humans are asked to judge word similar- wave hand leave company 0.42 0.08 0.20 drink water use test 0.29 0.04 0.18 ity, the pairing automatically primes them to se- european state present position 0.28 -0.03 0.19 lect the closest senses. Extending this to phrase high point particular case 0.41 0.10 0.21 similarity requires us to compose each possible Table 2 pair of senses for each phrase and then select Tendency of SENSEMBED (SE) to overestimate the similarity the sense configuration which results in maximal on phrase pairs with low average human similarity when the phrase similarity. For comparison, we also give closest sense strategy is used. results for the configuration which results in mini- mal phrase similarity and the arithmetic mean5 of goal is to find whether two or more occurrences of all sense configurations. the same lexeme express identical senses, without 3.1 Results necessarily labelling the senses yet. It has received relatively little attention despite its potential for Model AN NN VO Average providing important insights into semantic com- word2vec 0.47 0.46 0.45 0.46 position, focusing in particular on to the ability dep2vec 0.48 0.46 0.45 0.46 SENSEMBED:max 0.39 0.39 0.32 0.37 of compositional distributional semantic models to SENSEMBED:min 0.24 0.12 0.22 0.19 appropriately contextualise a polysemous lexeme. SENSEMBED:mean 0.46 0.35 0.37 0.39 Work on word-sense discrimination has suf- Table 1 fered from the absence of a benchmark task as well Results for the Mitchell and Laptata (2010) dataset. as a clear evaluation methodology. For example Schutze¨ (1998) evaluated his model on a dataset Table 1 shows that the simple strategy of adding consisting of 20 polysemous words (10 naturally high quality single-vector representations is very ambiguous lexemes and 10 artificially ambiguous competitive with the state-of-the-art for this task. “pseudo-lexemes”) in terms of accuracy for coarse None of the strategies for selecting a sense config- grained sense distinctions, and an information re- uration for the multi-sense representations could trieval task. Pantel and Lin (2002), and Van de compete with the single sense representations on Cruys (2008) used automatically extracted words this task. One possible explanation is that the com- from various newswire sources and evaluated the monly adopted closest sense strategy is not effec- output of their models in comparison to WordNet tive for composition since the composition of in- and EuroWordNet, respectively. Purandare and correct senses may lead to spuriously high similar- Pedersen (2004) used a subset of the words from ities (for two “implausible” sense configurations). the SENSEVAL-2 task and evaluated their models Table 2 lists a number of example phrase pairs in terms of precision, recall and F1-score of how with low average human similarity scores in the well available sense tags match with clusters dis- Mitchell and Lapata (2010) test set. The results covered by their algorithms. Akkaya et al. (2012) show the tendency of the closest sense strategy used the concatenation of the SENSEVAL-2 and with SENSEMBED (SE) to overestimate the sim- SENSEVAL-3 tasks and evaluated their models in ilarity of dissimilar phrase pairs. For a compari- terms of cluster purity and accuracy. Finally, son we manually labelled the lexemes in the sam- Moen et al. (2013) used the semantic textual sim- ple phrases with the appropriate BabelNet senses ilarity (STS) 2012 task, which is based on human prior to composition (SE*). Human (H) similar- judgements of the similarity between two sen- ity scores are normalised and averaged for an eas- tences. ier comparison, model estimates represent cosine One contribution of our work is a novel word- similarities. sense discrimination task, evaluated on a number 4 Word Sense Discrimination of robust baselines in order to facilitate future re- search in that area. In particular, our task offers a Word-sense discrimination can be seen as the first testbed for assessing the contextualisation ability stage of word-sense disambiguation, where the of compositional distributional semantic models. 5We also tried the geometric mean and the median but The goal is, for a given polysemous lexeme in con- these performed comparably with the arithmetic mean. text, to identify the sentence from a list of options

81 that is expressing the same sense of that lexeme of a particular sense for a given polysemous as the given target sentence. These two sentences lexeme. — the target and the “correct answer” — can ex- hibit any degree of semantic similarity as long as The granularity of the sense inventory reflects • 7 they convey the same sense of the target lexeme. common language use . Table 3 shows an example of the polysemous ad- The example sentences are typically free of jective black in our task. The goal of any model • would be to determine that the expressed sense of any domain bias wherever possible. black in the sentence She was going to set him The data is easily accessible via a web API. free from all of the evil and black hatred he had • brought to the world is identical to the expressed By using the data from curated resources we were sense of black in the target sentence Or should they able to avoid a setup as a sentence similarity task rebut the Democrats’ black smear campaign with and any potentially noisy crowd-sourced human the evidence at hand. similarity judgements. Our task assesses the ability of a model to dis- We were furthermore able to collect data from criminate a particular sense in a sentential context varying frequency bands, enabling an assessment from any other senses and thus provides an excel- of the impact of frequency on any model. Fig- lent testbed for evaluating multi-sense word vec- ure 1 shows the number of target lexemes per fre- tor models as well as compositional distributional quency band. While the majority of lexemes, with semantic models. By composing the representa- reference to a cleaned October 2013 Wikipedia tion of a target lexeme with its surrounding con- dump8, is in the middle band, there is a consid- text, it should be possible to determine its sense. erable amount of less frequent lexemes. The most For example, composing black smear campaign frequent target lexeme in our task is the verb be should lead to a compositional representation that with 1.8m occurrences in Wikipedia, and the ≈ is closer to the composed representation of black least frequent lexeme is the verb ruffle with only hatred than to black mood, black sense of humour 57 occurrences. The average target lexeme fre- or black coffee. This essentially uses the similarity quency is 95k for adjectives, and 45k 46k for ≈ ≈ − of the compositional representation of a lexeme’s nouns and verbs9. context to determine its sense. Similar approaches to word-sense disambiguation have already been successfully used in past works (Akkaya et al., 2012; Basile et al., 2014).

4.1 Task Construction For the construction of our dataset we made use of data from two english dictionaries (Oxford Dictio- nary and Collins Dictionary), accessible via their respective web APIs6, as well as examples from the sense annotated corpus SemCor (Miller et al., 1993). Our use of dictionary data is motivated by a number of favourable properties which make it a very suitable data source for our proposed task: Figure 1: Binned frequency distribution of the polysemous target lexemes in our task. The content is of very high-quality and cu- • rated by expert lexicographers. 7The Oxford dictionary lists 5 different senses for the noun “bank”, whereas WordNet 3.0 lists 10 synsets, for ex- All example sentences are carefully crafted in • ample distinguishing “bank” as the concept for a financial order to unambiguously illustrate the usage institution and “bank” as a reference to the building where financial transactions take place. 6https://developer. 8We removed any articles with fewer than 20 page views. oxforddictionaries.com for the Oxford Dictionary, 9The overall number of unique word types is smaller than https://www.collinsdictionary.com/api/ for the number of examples in our task as there are a number of the Collins Dictionary. We use NLTK 3.2 to access SemCor. lexemes that can occur with more than one part-of-speech.

82 Sense Definition Sentence Target full of anger or hatred Or should they rebut the Democrats’ black smear campaign with the evidence at hand? Option 1 full of anger or hatred She was going to set him free from all of the evil and black hatred he had brought to the world. Option 2 (of a person’s state of mind) I’ve been in a black mood since September 2001, it’s hanging over full of gloom or misery; very depressed me like a penumbra. Option 3 (of humour) presenting tragic or harrowing Over the years I have come to believe that fate either hates me, or situations in comic terms has one hell of a black sense of humour. Option 4 (of coffee or tea) served without milk The young man was reading a paperback novel and sipping a steaming mug of hot, black coffee.

Table 3: Example of the polysemous adjective black in our task. The goal for any model is to predict option 1 as expressing the same sense of black as the target sentence.

4.2 Task Setup Details same sense of a given target lexeme. Accuracy has We collected data for 3 different parts-of-speech: the advantage of being much easier to interpret — adjectives, nouns and verbs. We furthermore cre- in absolute terms as well as in the relative differ- ated task setups with varying numbers of senses ence between models — in comparison to other to distinguish (2-5 senses) for a given target lex- commonly used evaluation metrics such as clus- eme. This is to evaluate how well a model is able ter purity measures or correlation metrics such as to discriminate different degrees of polysemy of Spearman ρ and Pearson r. any lexeme. For any task setup evaluating for n senses, we included all lexemes with > n senses 4.3 Experimental Setup and randomly sampled n senses from its inventory. In this paper we compare the compositional mod- For each lexeme, we furthermore ensured that it els outlined earlier with two baselines, a random had at least 2 example sentences per sense. For baseline and a word-overlap baseline of the ex- the available senses of any given lexeme, we ran- tracted contexts. For the single-vector represen- domly chose a sense as the target sense, and from tations, we composed the target lexeme with all of its list of example sentences randomly sampled 2 the words in the context window and compared it sentences, one as the target example and one as with the equivalent representation of each of the the “correct answer” for the list of candidate sen- options (lexeme plus context words). The option tences. Finally we once again randomly sampled with the highest cosine similarity was deemed to the required number of other senses and example be the selected sense. For SENSEMBED, we com- sentences to complete the task setup. Using ran- posed all sense vectors of a target lexeme with the dom sampling of word senses and targets aims to given context and then used the closest sense strat- avoid a predominant sense bias. egy (Iacobacci et al., 2015) on composed represen- For each part-of-speech we created a develop- tations to choose the predicted sense10. The word- ment split for parameter tuning and a test split for overlap baseline is simply the number of words in the final evaluation. Table 4 shows the number of common between the context window for the tar- examples for each setup variant of our task. The get and each of the options. biggest category are polysemous nouns, represent- We experimented with symmetric linear bag- ing roughly half of the data, followed by verbs of-words contexts of size 1, 2 and 4 around the representing another third, and the smallest cate- target lexeme. We also experimented with de- gory are adjectives taking up the remaining 17%. ≈ pendency contexts, where first-order dependency We measure performance in terms of accuracy of contexts performed almost identical to using a 2- word bag-of-words context window (results not 2 senses 3 senses 4 senses 5 senses Adjective 66/209 47/170 37/137 28/115 reported). We excluded stop words prior to ex- Noun 170/618 125/499 100/412 74/345 tracting the context window in order to maximise Verb 127/438 71/354 72/295 56/256 Total 363/1265 263/1023 209/844 164/716 the number of content words. We break ties for any of the methods — including the baselines — Table 4: Number of examples per part-of-speech and number by randomly picking one of the options with the of senses (#dev examples/#train examples). 10We also tried an all-by-all senses composition, however correctly predicting which two sentences share the found this to be computationally not tractable.

83 highest similarity to the composed representation Frequency Range of the target lexeme with its context. Statistical We chose the 2-sense noun subtask to estimate the significance between the best performing model degree sensitivity of target lexeme frequency on and the word overlap baseline is computed by us- our task we merged the [1, 1k) and [1k, 10k), and ing a randomised pairwise permutation test (Efron [50k, 100k) and [100k, ) frequency bands from ∞ and Tibshirani, 1994). Figure 1, and sampled an equal number of target words from each band. Table 7 reports the results 4.4 Results for this experiment. All methods outperform the Table 5 shows the results for all context window random and word overlap baseline and appear to sizes across all parts-of-speech and number of be working better for less frequent lexemes. One senses. All models substantially outperform the possible explanation for this behaviour is that less random baseline for any number of senses. In- frequent lexemes have fewer senses and poten- terestingly the word overlap baseline is compet- tially less subtle sense differences than more fre- itive for all context window sizes. While it is quent lexemes, which would make them easier to a very simple method, it has already been found discriminate by distributional semantic methods. to be a strong baseline for paraphrase detection and semantic textual similarity (Dinu and Thater, 5 Discussion 2012). One possible explanation for its robust Our results suggest that pointwise addition in a performance on our task is an occurrence of the single-sense vector model such as word2vec is one-sense-per-collocation hypothesis (Yarowsky, able to discriminate the sense of a polysemous 1993). The performance of all other models lexeme in context in a surprisingly effective way is roughly in the same ballpark for all parts-of- and represents a strong baseline for future work. speech and number of senses, suggesting that they Distributional composition can therefore be inter- form robust baselines for future models. While preted as a process of contextualising the mean- the results are relatively mixed for adjectives, ing of a lexeme. This way, composition does not word2vec appears to be the strongest model for only act as a way to represent the meaning of a polysemous nouns and verbs. phrase as a whole, but also as a local discrimina- The perhaps most interesting observation in Ta- tor for any lexemes in the phrase. For example ble 5 is that word2vec and dep2vec are per- the composed representation of dry clothes should forming as well or better than SENSEMBED de- only keep contexts that dry shares with clothes spite the fact that the former conflate the senses of while suppressing contexts it shares with weather a polysemous lexeme in a single vector represen- or wine. Hence, one would expect the same to hap- tation. Figure 2 shows the average performance of pen with a polysemous lexeme such as bank in the all models across parts-of-speech per number of context of river and account, respectively. senses and for all context window sizes. Recent work by Arora et al. (2016) has shown that the different senses of a polysemous lexeme SENSEMBED and Babelfy reside in a linear substructure within a single vec- One possible explanation for SENSEMBED not tor and are recoverable by sparse coding. There is outperforming the other methods despite its furthermore evidence that additive composition in cleaner encoding of different word senses in the low-dimensional word embeddings approximates above experiments is that at train time, it had ac- an intersection of the contexts of two distributional cess to sense labels from Babelfy. At test time word vectors (Tian et al., 2015). It therefore seems on our task however, it did not have any sense plausible that an intersective composition function labels available. We therefore sense tagged the should be able to recover sense specific informa- 5-sense noun subtask with Babelfy and re-ran tion. SENSEMBED. As Table 6 shows, access to sense To qualitatively analyse this hypothesis we labels at test time did not give a substantive perfor- used the word2vec and SENSEMBED vectors mance boost, representing further support for our to compose a small number of example phrases hypothesis that composition in single-sense vector by pointwise addition and calculated their top 5 models might be sufficient to recover sense spe- nearest neighbours in terms of cosine similarity. cific information. For SENSEMBED we manually sense tagged the

84 Symmetric context window of size 1 Adjective Noun Verb Senses 2 3 4 5 2 3 4 5 2 3 4 5 Random 0.53 0.32 0.25 0.14 0.47 0.32 0.23 0.19 0.47 0.31 0.23 0.18 Word Overlap 0.63 0.46 0.47 0.40 0.55 0.40 0.37 0.34 0.54 0.44 0.38 0.29 word2vec 0.70 0.56 0.61† 0.54† 0.66‡ 0.52‡ 0.50‡ 0.44‡ 0.63‡ 0.56‡ 0.52‡ 0.43‡ dep2vec 0.65 0.64‡ 0.57 0.57‡ 0.64‡ 0.50‡ 0.49‡ 0.48‡ 0.63‡ 0.55‡ 0.50‡ 0.43‡ SENSEMBED 0.67 0.54 0.56 0.56† 0.64‡ 0.49‡ 0.50‡ 0.43‡ 0.62† 0.53‡ 0.49‡ 0.38† Symmetric context window of size 2 Adjective Noun Verb Senses 2 3 4 5 2 3 4 5 2 3 4 5 Random 0.53 0.32 0.25 0.14 0.47 0.32 0.23 0.19 0.47 0.31 0.23 0.18 Word Overlap 0.66 0.51 0.55 0.43 0.59 0.47 0.43 0.41 0.58 0.51 0.45 0.36 word2vec 0.70 0.64† 0.58 0.55 0.71‡ 0.63‡ 0.59‡ 0.54‡ 0.68‡ 0.64‡ 0.58‡ 0.49‡ dep2vec 0.71 0.65‡ 0.58 0.57‡ 0.70‡ 0.57‡ 0.55‡ 0.55‡ 0.66† 0.64‡ 0.54† 0.46† SENSEMBED 0.72‡ 0.62 0.61 0.52 0.69‡ 0.56† 0.57‡ 0.51† 0.67‡ 0.65† 0.57 0.45 Symmetric context window of size 4 Adjective Noun Verb Senses 2 3 4 5 2 3 4 5 2 3 4 5 Random 0.53 0.32 0.25 0.14 0.47 0.32 0.23 0.19 0.47 0.31 0.23 0.18 Word Overlap 0.67 0.55 0.58 0.51 0.62 0.50 0.49 0.45 0.59 0.55 0.50 0.40 word2vec 0.71 0.65† 0.65 0.57 0.73‡ 0.61‡ 0.62‡ 0.57‡ 0.71‡ 0.62† 0.57 0.53‡ dep2vec 0.72 0.66† 0.60 0.54 0.71‡ 0.55 0.56† 0.53† 0.67 0.62 0.54 0.50 SENSEMBED 0.75 0.59 0.62 0.55 0.69‡ 0.57† 0.58‡ 0.53† 0.68‡ 0.62† 0.55 0.47

Table 5 Performance overview for all parts-of-speech and number of senses, statistically significant at the p < 0.01 level in com- parison to the Word Overlap baseline; statistically significant at the‡ p < 0.05 level in comparison to the Word Overlap baseline. †

Figure 2: Average performance across parts-of-speech per number of senses and context window.

Noun - 5 Senses Noun - 2 Senses, context window size = 2 Context Window Size 1 2 4 Frequency Band < 10k 10k – 50k 50k ≥ word2vec 0.44 0.54 0.57 Random 0.51 0.51 0.51 dep2vec 0.48 0.55 0.53 Word Overlap 0.66 0.60 0.56 SENSEMBED 0.43 0.51 0.53 word2vec 0.81 0.64 0.66 SENSEMBED & Babelfy 0.45 0.49 0.54 dep2vec 0.77 0.67 0.66 SENSEMBED 0.74 0.68 0.60 Table 6 Results on the 5-sense noun subtask with SENSEMBED hav- Table 7 ing access to Babelfy sense labels at test time. Results on a subsample of the 2-sense noun subtask across frequency bands. phrases with the appropriate BabelNet sense la- however they were consistent with the intended bels prior to composition. We omitted the Babel- sense in all cases. Table 8 supports the view of Net sense labels in the neighbour list for brevity, composition as a way of contextualising the mean-

85 ing of a lexeme. In all cases in our example the or relatedness of two sentences is very difficult word2vec neighbours reflect the intended sense to judge — especially on a fine-grained scale — of the polysemous lexeme, providing evidence for even for humans. This frequently results in a rel- the linear substructure of word senses in a single atively high variance of judgements and low inter- vector as discovered by Arora et al. (2016), and annotator agreement (Batchkarov et al., 2016). suggesting that distributional composition is able The short phrase datasets typically have a fixed to recover sense specific information from a poly- structure that only test a very small fraction of the semous lexeme. The very fine-grained sense-level possible grammatical constructions in which a lex- vector space of SENSEMBED is giving rise to a eme can occur, and furthermore provide very little very focused neighbourhood, however there does context. The use of full sentences remedies the not seem to be any advantage over word2vec lack of context and grammatical variation, how- from a qualitative point of view when using simple ever can still contain a significant level of noise additive composition. due to the automatic construction of the dataset or the variance in human ratings. In contrast, our 6 Related Work task is not set up as a sentence similarity task and therefore avoids the use of human similarity The perhaps most popular tasks for evaluating the judgements. ability of a model to capture or encode the differ- Our task is similar to word-sense induction ent senses of a polysemous lexeme in a given con- (WSI), however we only focus on discriminating text are the english lexical substitution task (Mc- the sense of a polysemous lexeme in context rather Carthy and Navigli, 2007) and the Microsoft sen- than inducing a set of senses from raw data and tence completion challenge (Zweig and Burges, appropriately tagging subsequent occurrences of 2011). Both tasks require any model to fill an ap- polysemous instances with the inferred inventory. propriate word into a pre-defined slot in a given Separating the sense discrimination task from the sentential context. The sentence completion chal- problem of sense induction has the advantage of lenge provides a list of candidate words while the making our task applicable to evaluating composi- english lexical substitution task does not. How- tional distributional semantic models in order to ever, neither task focuses on polysemy and the en- test their ability to appropriately contextualise a glish lexical substitution task conflates the prob- polysemous lexeme. Due to not requiring any lems of discriminating word senses and finding models to perform an extra step for sense induc- meaning preserving substitutes. tion, our task is easier to evaluate as no matching Dictionary definitions have previously been between sense clusters identified by a model and used to evaluate compositional distributional se- some gold standard sense classes needs to be per- mantic models where the goal is to match a dictio- formed, as typically proposed in the WSI litera- nary entry with its corresponding definition (Kart- ture (Agirre and Soroa, 2007; Manandhar et al., saklis et al., 2012; Polajnar and Clark, 2014). 2010). These datasets are commonly set up as retrieval Most closely related to our task are the Stan- tasks, but generally do not test the ability of a ford Contextual Word Similarity (SCWS) dataset model to disambiguate a polysemous word in con- by Huang et al. (2012) and the Usage Similar- text, or discriminate multiple definitions of the ity (USim) task by Erk et al. (2009). The goal same word. in both tasks is to estimate the similarity of two Our task also provides a novel evaluation polysemous words in context in comparison to for compositional distributional semantic models, human provided gold standard judgements. In where the predominant strategy is to estimate the the SCWS dataset typically two different lexemes similarity of two short phrases (Bernardi et al., are considered whereas in USim and our task the 2013; Grefenstette and Sadrzadeh, 2011; Kart- same lexemes with different contexts are com- saklis and Sadrzadeh, 2014; Mitchell and Lap- pared. Instead of relying on crowd-sourced hu- ata, 2008; Mitchell and Lapata, 2010) or sen- man gold-standard similarity judgements, which tences (Agirre et al., 2016; Huang et al., 2012; can be prone to a considerable amount of noise11, Marelli et al., 2014) in comparison to human pro- vided gold-standard judgements. One problem 11For example the average standard deviation of human with these similarity tasks is that the similarity ratings in the SCWS dataset is 3 on a 10-point scale, and ≈

86 Phrase word2vec neighbours SENSEMBED neighbours river bank bank, river, creek, lake, rivers bank, river, stream, creek, river basin bank account account, bank, accounts, banks, citibank bank, banks, the bank, pko bank polski, handlowy dry weather weather, dry, wet weather, wet, unreasonably warm dry, weather, humid, cold, cool dry clothes dry, clothes, clothing, rinse thoroughly, wet dry, clothes, warm, cold, wet capital city capital, city, cities, downtown, town city, capital, the capital city, town, provincial capital capital asset capital, asset, assets, investment, worth capital, asset, investment, assets, investor power plant plant, power, plants, coalfired, megawatt power, plant, near-limitless, pulse-power, power of the wind garden plant plant, garden, plants, gardens, vegetable garden plant, garden, plants, oakville assembly, solanaceous window bar bar, window, windows, doorway, door window, bar, windows, glass window, wall sandwich bar bar, sandwich, restaurant, burger, diner sandwich, bar, restaurant, hot dog, cake gasoline tank gasoline, tank, fuel, gallon, tanks gasoline, tank, fuel, petrol, kerosene armored tank armored, tank, tanks, M1A1 Abrams, armored vehicle armored, armoured, tank, tanks, light tank desert rock rock, desert, rocks, desolate expanse, arid desert desert rock, the desert, deserts, badlands rock band rock, band, rockers, bands, indie rock band, rock, group, the band, rock group

Table 8 Nearest neighbours of composed phrases for word2vec and SENSEMBED. Distributional composition in word2vec is able to recover sense specific information remarkably well. Some neighbours are phrases because they have been encoded as a single token in the original vector space. we leverage the high-quality content of available paradigm, due to its focus on assessing the abil- english dictionaries. Furthermore, our task is not ity of a model to appropriately contextualise the formulated as estimating the similarity between meaning of a word. Our task furthermore pro- two lexemes in context, but identifying the sen- vides another evaluation option away from intrin- tences that use the same sense of a given polyse- sic evaluations which are based on often noisy hu- mous lexeme. man similarity judgements, while also not being embedded in a downstream task. 7 Conclusion In future work we aim to extend our eval- uation to more complex compositional distribu- While elementary multi-sense representations of tional semantic models such as the lexical func- words might capture a more fine grained semantic tion model (Paperno et al., 2014) or the Anchored picture of a polysemous word, that advantage does Packed Dependency Tree framework (Weir et al., not appear to transfer to distributional composi- 2016). We furthermore want to investigate how tion in a straightforward way. Our experiments far the sense-discriminating ability of composition on a popular phrase similarity benchmark and our can be leveraged for other tasks. novel word-sense discrimination task have demon- strated that semantic composition does not appear Acknowledgments to benefit from a fine grained sense inventory, but that the ability to contextualise a polysemous lex- We would like to thank our anonymous reviewers eme in single-sense vector models is sufficient for for their helpful comments. superior performance. We furthermore have pro- vided qualitative and quantitative evidence that an References intersective composition function such as point- wise addition for neural word embeddings is able Eneko Agirre and Aitor Soroa. 2007. Semeval-2007 to discriminate the meaning of a word in context, task 02: Evaluating word sense induction and dis- crimination systems. In Proceedings of the Fourth and is able to recover sense specific information International Workshop on Semantic Evaluations remarkably well. (SemEval-2007), pages 7–12, Prague, Czech Repub- Lastly, our experiments have uncovered an im- lic, June. Association for Computational Linguis- tics. portant question for multi-sense vector models, namely how to exploit the fine-grained sense Eneko Agirre, Carmen Banea, Daniel Cer, Mona level representations for distributional composi- Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, Ger- tion. Our novel word-sense discrimination task man Rigau, and Janyce Wiebe. 2016. Semeval- 2016 task 1: Semantic textual similarity, mono- provides an excellent testbed for compositional lingual and cross-lingual evaluation. In Proceed- distributional semantic models, both following ings of the 10th International Workshop on Seman- a single-sense or multi-sense vector modelling tic Evaluation (SemEval-2016), pages 497–511, San Diego, California, June. Association for Computa- can be up to 4–5 in some cases. tional Linguistics.

87 Cem Akkaya, Janyce Wiebe, and Rada Mihalcea. Katrin Erk, Diana McCarthy, and Nicholas Gaylord. 2012. Utilizing semantic composition in distribu- 2009. Investigations on word senses and word us- tional semantic models for word sense discrimina- ages. In Proceedings of the Joint Conference of the tion and word sense disambiguation. In Proceedings 47th Annual Meeting of the ACL and the 4th In- of ICSC, pages 45–51. ternational Joint Conference on Natural Language Processing of the AFNLP, pages 10–18, Suntec, Sin- Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, gapore, August. Association for Computational Lin- and Andrej Risteski. 2016. Linear algebraic struc- guistics. ture of word senses, with applications to polysemy. CoRR, abs/1601.03764. Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011. Experimental support for a categorical composi- Pierpaolo Basile, Annalina Caputo, and Giovanni Se- tional distributional model of meaning. In Proceed- meraro. 2014. An enhanced lesk word sense dis- ings of the 2011 Conference on Empirical Meth- ambiguation algorithm through a distributional se- ods in Natural Language Processing, pages 1394– mantic model. In Proceedings of COLING 2014, 1404, Edinburgh, Scotland, UK., July. Association the 25th International Conference on Computational for Computational Linguistics. Linguistics: Technical Papers, pages 1591–1600, Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, Dublin, Ireland, August. Dublin City University and and Yoshimasa Tsuruoka. 2014. Jointly learn- Association for Computational Linguistics. ing word representations and composition functions using predicate-argument structures. In Proceed- Miroslav Batchkarov, Thomas Kober, Jeremy Reffin, ings of the 2014 Conference on Empirical Methods Julie Weeds, and David Weir. 2016. A critique of in Natural Language Processing (EMNLP), pages word similarity as a method of evaluating distribu- 1544–1555, Doha, Qatar, October. Association for tional semantic models. In Proceedings of the 1st Computational Linguistics. Workshop on Evaluating Vector-Space Representa- tions for NLP, pages 7–12. Association for Compu- Felix Hill, KyungHyun Cho, Anna Korhonen, and tational Linguistics. Yoshua Bengio. 2016. Learning to understand phrases by embedding the dictionary. Transactions Raffaella Bernardi, Georgiana Dinu, Marco Marelli, of the Association for Computational Linguistics, and Marco Baroni. 2013. A relatedness benchmark 4:17–30. to test the role of determiners in compositional dis- tributional semantics. In Proceedings of the 51st An- Eric Huang, Richard Socher, Christopher Manning, nual Meeting of the Association for Computational and Andrew Ng. 2012. Improving word represen- Linguistics (Volume 2: Short Papers), pages 53–57, tations via global context and multiple word proto- Sofia, Bulgaria, August. Association for Computa- types. In Proceedings of the 50th Annual Meeting of tional Linguistics. the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 873–882, Jeju Island, Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. Korea, July. Association for Computational Linguis- 2014. A unified model for word sense represen- tics. tation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natu- Ignacio Iacobacci, Mohammad Taher Pilehvar, and ral Language Processing (EMNLP), pages 1025– Roberto Navigli. 2015. Sensembed: Learning 1035, Doha, Qatar, October. Association for Com- sense embeddings for word and relational similarity. putational Linguistics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the Georgiana Dinu and Mirella Lapata. 2010. Measur- 7th International Joint Conference on Natural Lan- ing distributional similarity in context. In Proceed- guage Processing (Volume 1: Long Papers), pages ings of the 2010 Conference on Empirical Methods 95–105, Beijing, China, July. Association for Com- in Natural Language Processing, pages 1162–1172, putational Linguistics. Cambridge, MA, October. Association for Compu- Ignacio Iacobacci, Mohammad Taher Pilehvar, and tational Linguistics. Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In Proceed- Georgiana Dinu and Stefan Thater. 2012. Saarland: ings of the 54th Annual Meeting of the Association Vector-based models of semantic textual similarity. for Computational Linguistics (Volume 1: Long Pa- In *SEM 2012: The First Joint Conference on Lexi- pers), pages 897–907, Berlin, Germany, August. As- cal and Computational Semantics – Volume 1: Pro- sociation for Computational Linguistics. ceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Interna- Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A tional Workshop on Semantic Evaluation (SemEval study of entanglement in a categorical framework of 2012), pages 603–607, Montreal,´ Canada, 7-8 June. natural language. In Proceedings of the 11th Work- Association for Computational Linguistics. shop on Quantum Physics and Logic (QPL). Bradley Efron and Robert Tibshirani. 1994. An Intro- Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Stephen duction to the Bootstrap. CRC press. Pulman. 2012. A unified sentence space for

88 categorical distributional-compositional semantics: tions of words and phrases and their composition- Theory and experiments. In Proceedings of COL- ality. In C.J.C. Burges, L. Bottou, M. Welling, ING 2012: Posters, pages 549–558, Mumbai, India, Z. Ghahramani, and K.Q. Weinberger, editors, Ad- December. The COLING 2012 Organizing Commit- vances in Neural Information Processing Systems tee. 26, pages 3111–3119. Curran Associates, Inc.

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Stephen George A. Miller, Claudia Leacock, Randee Tengi, and Pulman. 2013. Separating disambiguation from Ross T. Bunker. 1993. A semantic concordance. In composition in distributional semantics. In Pro- Proceedings of the Arpa Workshop on Human Lan- ceedings of the Seventeenth Conference on Com- guage Technology, pages 303–308. putational Natural Language Learning, pages 114– 123, Sofia, Bulgaria, August. Association for Com- Jeff Mitchell and Mirella Lapata. 2008. Vector-based putational Linguistics. models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Com- Thomas Kober, Julie Weeds, Jeremy Reffin, and David putational Linguistics: Human Language Technol- Weir. 2016. Improving sparse word representations ogy Conference, pages 236–244, Columbus, Ohio, with distributional inference for semantic composi- June. Association for Computational Linguistics. tion. In Proceedings of the 2016 Conference on Em- pirical Methods in Natural Language Processing, pages 1691–1702, Austin, Texas, November. Asso- Jeff Mitchell and Mirella Lapata. 2010. Composition ciation for Computational Linguistics. in distributional models of semantics. Cognitive Sci- ence, 34(8):1388–1429. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Hans Moen, Erwin Marsi, and Bjorn¨ Gamback.¨ 2013. Annual Meeting of the Association for Computa- Towards dynamic word sense discrimination with tional Linguistics (Volume 2: Short Papers), pages random indexing. In Proceedings of the Workshop 302–308, Baltimore, Maryland, June. Association on Continuous Vector Space Models and their Com- for Computational Linguistics. positionality, pages 83–90, Sofia, Bulgaria, August. Association for Computational Linguistics. Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em- beddings improve natural language understanding? Andrea Moro, Alessandro Raganato, and Roberto Nav- In Proceedings of the 2015 Conference on Empiri- igli. 2014. Entity linking meets word sense disam- cal Methods in Natural Language Processing, pages biguation: A unified approach. Transactions of the 1722–1732, Lisbon, Portugal, September. Associa- Association for Computational Linguistics, 2:231– tion for Computational Linguistics. 244.

Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dligach, Patrick Pantel and Dekang Lin. 2002. Discovering and Sameer Pradhan. 2010. Semeval-2010 task 14: word senses from text. In Proceedings of the Eighth Word sense induction & disambiguation. In Pro- ACM SIGKDD International Conference on Knowl- ceedings of the 5th International Workshop on Se- edge Discovery and Data Mining, KDD ’02, pages mantic Evaluation, pages 63–68, Uppsala, Sweden, 613–619, New York, NY, USA. ACM. July. Association for Computational Linguistics. Denis Paperno, Nghia The Pham, and Marco Baroni. Marco Marelli, Stefano Menini, Marco Baroni, Luisa 2014. A practical and linguistically-motivated ap- Bentivogli, Raffaella bernardi, and Roberto Zam- proach to compositional distributional semantics. In parelli. 2014. A sick cure for the evaluation of Proceedings of the 52nd Annual Meeting of the As- compositional distributional semantic models. In sociation for Computational Linguistics (Volume 1: Nicoletta Calzolari, Khalid Choukri, Thierry De- Long Papers), pages 90–99, Baltimore, Maryland, clerck, Hrafn Loftsson, Bente Maegaard, Joseph June. Association for Computational Linguistics. Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth Interna- tional Conference on Language Resources and Eval- Mohammad Taher Pilehvar and Nigel Collier. 2016. uation (LREC’14), pages 216–223, Reykjavik, Ice- De-conflated semantic representations. In Proceed- land, May. European Language Resources Associa- ings of the 2016 Conference on Empirical Methods tion (ELRA). ACL Anthology Identifier: L14-1314. in Natural Language Processing, pages 1680–1690, Austin, Texas, November. Association for Computa- Diana McCarthy and Roberto Navigli. 2007. Semeval- tional Linguistics. 2007 task 10: English lexical substitution task. In Proceedings of the Fourth International Workshop Tamara Polajnar and Stephen Clark. 2014. Improv- on Semantic Evaluations (SemEval-2007), pages ing distributional semantic vectors through context 48–53, Prague, Czech Republic, June. Association selection and normalisation. In Proceedings of the for Computational Linguistics. 14th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 230– Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- 238, Gothenburg, Sweden, April. Association for rado, and Jeff Dean. 2013. Distributed representa- Computational Linguistics.

89 Amruta Purandare and Ted Pedersen. 2004. Word PA, USA. Association for Computational Linguis- sense discrimination by clustering contexts in vec- tics. tor and similarity spaces. In Hwee Tou Ng and Ellen Riloff, editors, HLT-NAACL 2004 Work- Geoffrey Zweig and J.C. Chris Burges. 2011. The shop: Eighth Conference on Computational Natu- microsoft research sentence completion challenge. ral Language Learning (CoNLL-2004), pages 41– Technical report, Microsoft Research. 48, Boston, Massachusetts, USA, May 6 - May 7. Association for Computational Linguistics.

Siva Reddy, Ioannis Klapaftis, Diana McCarthy, and Suresh Manandhar. 2011. Dynamic and static pro- totype vectors for semantic composition. In Pro- ceedings of 5th International Joint Conference on Natural Language Processing, pages 705–713, Chi- ang Mai, Thailand, November. Asian Federation of Natural Language Processing.

Joseph Reisinger and Raymond J. Mooney. 2010. Multi-prototype vector-space models of word mean- ing. In Human Language Technologies: The 2010 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 109–117, Los Angeles, California, June. As- sociation for Computational Linguistics.

Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015. A word embedding approach to predicting the com- positionality of multiword expressions. In Proceed- ings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 977–983, Denver, Colorado, May–June. Association for Computational Linguistics.

Hinrich Schutze.¨ 1998. Automatic word sense dis- crimination. Computational Linguistics, 24(1):97– 123, mar.

Ran Tian, Naoaki Okazaki, and Kentaro Inui. 2015. The mechanism of additive composition. CoRR, abs/1511.08407.

Tim Van de Cruys. 2008. Using three way data for word sense discrimination. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 929–936, Manch- ester, UK, August. Coling 2008 Organizing Com- mittee.

David Weir, Julie Weeds, Jeremy Reffin, and Thomas Kober. 2016. Aligning packed dependency trees: a theory of composition for distributional semantics. Computational Linguistics, special issue on Formal Distributional Semantics, 42(4):727–761, Decem- ber.

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. From paraphrase database to com- positional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3:345–358.

David Yarowsky. 1993. One sense per collocation. In Proceedings of the Workshop on Human Language Technology, HLT ’93, pages 266–271, Stroudsburg,

90 Elucidating Conceptual Properties from Word Embeddings

Kyoung-Rok Jang Sung-Hyon Myaeng School of Computing School of Computing KAIST KAIST Daejeon, South Korea Daejeon, South Korea [email protected] [email protected]

Abstract beddings, however, researchers have mostly fo- cused on improving the performance in well- In this paper, we introduce a method of known semantic benchmark tasks (e.g. SimLex- identifying the components (i.e. dimen- 999) as a way to find better embeddings. sions) of word embeddings that strongly Performing well in such benchmark tasks is signifies properties of a word. By eluci- valuable but provides little help in understanding dating such properties hidden in word em- the inside of the black box. For instance, it is not beddings, we could make word embed- possible to answer to questions like “To what de- dings more interpretable, and also could gree a given word has the property cuteness?”. perform property-based meaning compar- One way to solve this problem is to elucidate ison. With the capability, we can answer properties that are encoded in word embeddings questions like “To what degree a given and associate them with task performances. With word has the property cuteness?” or “In the capability, we can not only enhance our un- what perspective two words are similar?”. derstanding of word embeddings but also make it We verify our method by examining how easier to make comparisons among heterogeneous the strength of property-signifying compo- word embedding models in more coherent ways. nents correlates with the degree of proto- Our immediate goal in this paper is to show the typicality of a target word. feasibility of explicating properties contained in word embeddings. 1 Introduction Our research can be seen as an attempt to in- Modeling the meaning of words has long been crease the interpretability of word embeddings. It studied and served as a basis for almost every kind is in line with an attempt to provide a human- of NLP tasks. Most recent word modeling tech- understandable explanation for complex machine niques are based on neural networks, and the word learning models, with which we can gain enough representations produced by such techniques are confidence to use them in decision-making pro- called word embeddings, which are usually low- cesses. dimensional, dense vectors of continuous-valued There has been a line of work devoted to identi- components. Although word embeddings have fying components that are important for perform- been proved for their usefulness in many tasks, the ing various NLP tasks such as sentiment analysis question of what are represented in them is under- or NER (Faruqui et al., 2015; Fyshe et al., 2015; studied. Herbelot and Vecchi, 2015; Karpathy et al., 2016; Recent studies report empirical evidence that in- Li et al., 2016a; Li et al., 2016b; Rothe et al., dicates word embeddings may reflect some prop- 2016). Those works are analogous to ours in that erty information of a target word (Erk, 2016; Levy they try to inspect the role of the components in et al., 2015). Learning the properties of a word word embedding. However, they just attempt to would be helpful because many NLP tasks can be identify key features for specific tasks rather than related to “finding words that possess similar prop- elucidating properties. In contrast, our question is erties”, which include finding synonyms, named “what comprises word embeddings?” not “what entity recognition (NER). Without a method for components are important for performing well in explicating what properties are contained in em- a specific task?”

91 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 91–95, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics

2 Feasibility Study of concepts to a set of categories. Below we de- scribe two datasets we used in our experiment: 2.1 Background HyperLex and Non-Negative Sparse Embedding Word embeddings can be seen as representing (NNSE). concepts of a word. As such, we attempt to design an property-related experiment around manipula- 2.3.1 Dataset: Non-Negative Sparse tion of concepts. In particular, we bring in the cat- Embedding (NNSE) egory theory (Murphy, 2004) where the notion of One desirable quality we wanted from the word category is defined to be “grouping concepts that embeddings to be used in our experiment is that share similar properties”. In other words, prop- there should be clear contrast between informa- erties have a direct bearing on concepts and their tive and non-informative components. In ordinary categories, according to the theory. dense word embeddings, usually every component On the other hand, researchers have argued that is filled with a non-zero value. some concepts are more typical (or central) than The Non-Negative Sparse Embedding (NNSE) others in a category (Rosch, 1973; Rosch, 1975). (Murphy et al., 2012) fulfills the condition in For instance, apple is more typical than olive in the sense that insignificant components are set to the fruit category. The typicality is a graded phe- zero. The NNSE component values falling be- nomenon, and may rise due to the strength of ‘es- tween 0 and 1 (non-negative) are generated by ap- sential’ properties that make a concept a specific plying the non-negative sparse coding algorithm category. (Hoyer, 2002) to ordinary word embeddings (e.g. The key ideas from the above are 1) concepts word2vec). of the same category share similar properties and 2) some concepts that have strong essential prop- 2.3.2 Dataset: HyperLex erties are considered more typical in specific cate- HyperLex is a dataset and evaluation resource gory, and they guided our experiment design. that quantifies the extent of the semantic category membership (Vulic´ et al., 2016). A total of 2,616 2.2 Design concept pairs are included in the dataset, and the The goal of this study is to show the feasibil- strength of category membership is given by na- ity of sifting property information from word em- tive English speakers and recorded in graded man- beddings. We assume that a concept’s property ner. This graded category membership can be in- information is captured and distributed over one terpreted as a ‘typicality score’ (1–10). Some sam- or more components (dimensions) of embeddings ples are shown in Table 1. during the learning process. Since the concepts that belong to the same category are likely to share Concept Category Score basketball activity 10 similar properties, there should be some salient spy agent 8 components that are shared among them. We call handbag bowl 0 such components as SIG-PROPS (for significant Table 1: A HyperLex sample. The score is the properties) of a specific category. answer to the question “To what degree is concept In this feasibility study, we hypothesize that the a type of category?” strength of SIG-PROPS is strongly correlated with the degree of concept’s typicality. This is based on the theory introduced in Section 2.1, that the 3 Experiment typicality phenomenon rises due to the strength of essential properties a target concept possesses. 3.1 Preparation So the concept that has (higher/lower) SIG-PROPS We first prepared pre-trained NNSE embeddings. values than others should be (more typical/less The authors released pre-trained model on their typical) than other concepts. website1. We used the model trained with depen- 2.3 Datasets dency context (‘Dependency model’ on the web- site), because as reported in (Levy and Goldberg, For our experiment dealing with typicality of con- 2014), models trained on dependency context tend cepts, we needed both (pre-trained) word embed- dings and a dataset that encodes typicality scores 1http://www.cs.cmu.edu/ bmurphy/NNSE/

92 to prefer functional similarity (hogwarts — sun- 3.2 Identification of SIG-PROPS nydale) rather than topical similarity (hogwarts — The goal of this step is to find SIG-PROPS that 2 dumbledore) . The embeddings are more sparse might represent each category. Simply put, SIG- than ordinary embeddings and have 300 compo- PROPS are the components that have on average nents. high value compared to other components of the Next we fetched HyperLex dataset at the au- concepts in the same category. We identified SIG- 3 thor’s website . To make the settings suitable to PROPS by 1) calculating an average value of each our experiment goal, we selected categories with component across the concepts with the same cat- the following criteria: egory, then 2) choosing those components whose average value is above h. We empirically set h to 1. The categories and instances must be con- 0.2. crete nouns (e.g. food). This is because SIG-PROPS Category people are more coherent in producing the Comp. ID Avg. c88 0.806 properties of concrete nouns (Murphy, 2004). instrument So the embeddings of concrete nouns should c258 0.769 0.587 animal c154 contain more clear property information than c265 0.221 0.550 other types of words. bird c154 c265 0.213 c207 0.298 food 2. The categories must contain enough number c233 0.269 of instances (not 1 or 2). This is to gain reli- c229 0.492 able result. c27 0.369 fruit c156 0.349 c44 0.264 3. Some categories are sub-category of another c233 0.206 selected category while others are not related. Table 3: SIG-PROPS of each category. The strings This is to see the discriminative and overlap- in “Comp. ID” column are the component IDs ping effect of identified SIG-PROPS between (c1–c300). “Avg.” column indicates the average categories. Related categories should share a value of the component across all the concepts un- set of strong SIG-PROPS, while unrelated cat- der that category. egories shouldn’t.

Table 3 shows the identified SIG-PROPS. The As a result, we selected five categories: food, number of SIG-PROPS is different across cate- fruit (sub-category of food), animal, bird (sub- gories. Interestingly, there is component overlap category of animal), and instrument. We fetched between taxonomically similar categories (‘c154’ the concepts that belong to the categories and then and ‘c265’ between animal–bird, ‘c233’ between filtered out those that aren’t contained in the pre- food–fruit), while there is none between unrelated trained NNSE embeddings. The final size of each categories (instrument–animal–food). category is shown in Table 2. This initial observation is encouraging for our Category # of Concepts feasibility study in that indeed SIG-PROPS can food 54 play a role of distinguishing or associating cate- fruit 9 gories. We argue that the identified SIG-PROPS animal 46 bird 16 strongly characterize each category, showing that instrument 14 we can associate vector components with proper- ties. Table 2: The size of selected categories 3.3 Correlation between SIG-PROPS and In the next section, we explain how we identi- concepts typicality scores fied SIG-PROPS of each category. In this section, we check how the strength of SIG- PROPS correlates with the typicality scores. Note 2 We thought “sharing similar function” is more compati- that the range of SIG-PROPS values differ from cat- ble with the notion of sharing similar properties. The topical similarity is less indicative of having properties in common. egory to category — those for instrument are es- 3http://people.ds.cam.ac.uk/iv250/hyperlex.html pecially high, which might indicate they are highly

93 representative. SIG-PROPS Correlation Corr. rank c229 0.743 1st Our assumption is that if the identified SIG- c233 0.540 4th PROPS truly represent the essential quality of a cat- c27 0.516 5th egory, the strength of SIG-PROPS should be pro- c44 0.474 7th c156 -0.663 85th portional to concepts’ typicality scores (equation 1). Table 8: Corr(SIG-PROPS, typicality): fruit

As the results show, there is clear tendency SIG- Strength(SIG-PROPS) (1) PROPS having high correlation with the typicality T ypicality(Concept) scores. Most of the SIG-PROPS showed meaning- ∝ ful correlation (> 0.5) with the typicality score or We observe this phenomenon by calculating placed at the top in the component–typicality cor- Pearson correlation between the strength of SIG- relation ranking. The result strongly indicates that PROPS and typicality scores. For instance, suppose even when we apply the simple method of iden- we calculate the correlation between ‘c88’ (88th tifying SIG-PROPS and regarding them as proper- component) of instrument concepts and their typ- ties, they serve as strong indicators for the con- icality score. We inspect the instrument concepts cept’s typicality. one by one, collect their ‘c88’ values (x) and typ- icality scores (y), and then measure the tendency 4 Conclusion and Future Work of changes in the two variables. Although limited in scale, our work showed the The result is shown in Table 4–8 where the col- feasibility of discovering properties from word umn ‘Rank’ shows the rank of the component’s embeddings. Not only SIG-PROPS can be used to correlation score (compared to other components). increase the interpretability of word embeddings, For instance, in Table 4 ‘c88’ has the highest cor- but also enable us more elaborate, property-based relation with the concept’s typicality score. meaning comparison. Our next step would be checking the applica- SIG-PROPS Correlation Corr. rank bility to general NLP tasks (e.g. NER, synonym c88 0.926 1st c258 0.918 2nd identification). Also, applying our method to word embeddings that have more granular components Table 4: Corr(SIG-PROPS, typicality): instrument (e.g. 2,500) might be helpful for identifying SIG- PROPS in more granular level.

SIG-PROPS Correlation Corr. rank Acknowledgments c154 0.549 1st c265 0.265 2nd This work was supported by Institute for Infor- mation & communications Technology Promo- Table 5: Corr(SIG-PROPS, typicality): animal tion(IITP) grant funded by the Korea govern- ment(MSIP) (No. R0126-16-1002, Development of agro-livestock cloud and application service for SIG-PROPS Correlation Corr. rank balanced production, transparent distribution and c154 0.783 1st safe consumption based on GS1) c265 0.563 2nd

Table 6: Corr(SIG-PROPS, typicality): bird References Katrin Erk. 2016. What do you know about an alliga- tor when you know the company it keeps? Seman- SIG-PROPS Correlation Corr. rank tics and Pragmatics Article, 9(17):1–63. c233 0.255 1st c120 0.224 2nd Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris c207 0.216 4th Dyer, and Noah A. Smith. 2015. Sparse overcom- c192 0.030 104th plete word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Table 7: Corr(SIG-PROPS, typicality): food Computational Linguistics and the 7th International

94 Joint Conference on Natural Language Processing Sascha Rothe, Sebastian Ebert, and Hinrich Schutze.¨ (Volume 1: Long Papers), pages 1491–1500, Bei- 2016. Ultradense word embeddings by orthogonal jing, China, July. transformation. In Proceedings of the 2016 Con- ference of the North American Chapter of the As- Alona Fyshe, Leila Wehbe, Partha P. Talukdar, Brian sociation for Computational Linguistics: Human Murphy, and Tom M. Mitchell. 2015. A composi- Language Technologies, pages 767–777, San Diego, tional and interpretable semantic space. In Proceed- California, June. ings of the 2015 Conference of the North American Chapter of the Association for Computational Lin- Ivan Vulic,´ Daniela Gerz, Douwe Kiela, Felix Hill, and guistics: Human Language Technologies, pages 32– Anna Korhonen. 2016. HyperLex: A Large-Scale 41, Denver, Colorado, May–June. Evaluation of Graded Lexical Entailment. Arxiv.

Aurelie´ Herbelot and Eva Maria Vecchi. 2015. Build- ing a shared world: Mapping distributional to model-theoretic semantic spaces. In Proceedings of EMNLP, number September, pages 22–32, Lisbon, Portugal.

P. O. Hoyer. 2002. Non-negative sparse coding. In Neural Networks for Signal Processing - Proceed- ings of the IEEE Workshop, volume 2002-Janua, pages 557–565.

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. International Conference on Learning Representa- tions (ICLR), pages 1–13.

Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 302–308, Baltimore, Maryland, June.

Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional meth- ods really learn lexical inference relations? In Pro- ceedings of the 2015 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 970–976, Denver, Colorado, May–June.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Juraf- sky. 2016a. Visualizing and Understanding Neural Models in NLP. In Naacl, pages 1–10.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016b. Un- derstanding Neural Networks through Representa- tion Erasure. In Arxiv.

Brian Murphy, Partha Pratim Talukdar, and Tom Mitchell. 2012. Learning Effective and In- terpretable Semantic Models using Non-Negative Sparse Embedding. In Proceedings of COLING 2012: Technical Papers, number December 2012, pages 1933–1950.

Gregory Murphy. 2004. The big book of concepts. MIT press.

Eleanor H. Rosch. 1973. Natural categories. Cogni- tive Psychology, 4(3):328–350.

Eleanor H. Rosch. 1975. Cognitive representations of semantic categories. Journal of experimental psy- chology: General, 104(3):192.

95 TTCSE : a Vectorial Resource for Computing Conceptual Similarity

Enrico Mensa Daniele P. Radicioni Antonio Lieto University of Turin, Italy University of Turin, Italy University of Turin, Italy Dipartimento di Informatica Dipartimento di Informatica ICAR-CNR, Palermo, Italy [email protected] [email protected] [email protected]

Abstract a particular class of vector representations where knowledge is represented as a set of limited though In this paper we introduce the TTCSE , a cognitively relevant quality dimensions; in this linguistic resource that relies on BabelNet, representation a geometrical structure is associ- NASARI and ConceptNet, that has now ated to each quality dimension, mostly based on been used to compute the conceptual sim- cognitive accounts. In this setting, concepts cor- ilarity between concept pairs. The con- respond to convex regions, and regions with dif- ceptual representation herein provides uni- ferent geometrical properties correspond to dif- form access to concepts based on Babel- ferent sorts of concepts (Gardenfors,¨ 2014). The Net synset IDs, and consists of a vector- geometrical features of CSs have a direct appeal based semantic representation which is for common-sense representation and common- compliant with the Conceptual Spaces, a sense reasoning, since prototypes (the most rele- geometric framework for common-sense vant representatives of a category from a cogni- knowledge representation and reasoning. tive point of view, see (Rosch, 1975)) correspond The TTCSE has been evaluated in a pre- to the geometrical centre of a convex region, the liminary experimentation on a conceptual centroid. Also exemplars-based representations similarity task. can be mapped onto points in a multidimensional space, and their similarity can be computed as 1 Introduction the distance intervening between each two points, The development of robust and wide-coverage re- based on some suitable metrics such as Euclidean sources to use in different sorts of application or Manhattan distance. etc.. (such as text mining, categorization, etc.) has The CS framework has been recently used to known in the last few years a tremendous growth. extend and complement the representational and In this paper we focus on computing conceptual inferential power allowed by formal ontologies similarity between pairs of concepts, based on a —and in general symbolic representation—, that resource that extends and generalizes an attempt are not suited for representing defeasible, proto- carried out in (Lieto et al., 2016a). In particular, typical knowledge, and for dealing with the cor- the TTCSE —so named after Text to Conceptual responding typicality-based conceptual reasoning Spaces-Extended— has been acquired by inte- (Lieto et al., 2017). Also, wide-coverage seman- grating two different sorts of linguistic resources, tic resources such as DBPedia and ConceptNet, in such as the encyclopedic knowledge available fact, in different cases fail to represent the sort of in BabelNet (Navigli and Ponzetto, 2012) and common-sense information based on prototypical NASARI (Camacho-Collados et al., 2015), and and default information which is usually required the common-sense grasped by ConceptNet (Speer to perform forms of plausible reasoning.1 In this and Havasi, 2012). The resulting representation 1Although DBPedia contains information on many sorts enjoys the interesting property of being anchored of entities, due to its explicit encyclopedic commitment, to both resources, thereby providing a uniform common-sense information is dispersed among textual de- conceptual access grounded on the sense identi- scriptions (e.g., in the abstracts) rather than being available in a well-structured formal way. For instance, the fork entity can fiers provided by BabelNet. be categorized as an object, whilst there is no structured infor- Conceptual Spaces (CSs) can be thought of as mation about its typical usage. On the other hand, Concept-

96 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 96–101, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics paper we explore whether and to what extent a Extraction The TTCSE takes in input c and linguistic resource describing concepts by means builds a bag-of-concepts C including the concepts of qualitative and synthetic vectorial representa- associated to c through one or more ConceptNet tion is suited to assess the conceptual similarity relationships. All ConceptNet nodes related to between pairs of concepts. the input concept c are collected: namely, we take the corresponding ConceptNet node for each 2 Vector representations with the TTCSE term in the WordNet (Miller, 1995) synset of c, sc WN-syn . For each such term we extract The TTCS has been designed to build resources ∈ c E all terms t linked through d, one of the aforemen- encoded as general conceptual representations. tioned ConceptNet relationships: that is, we col- We presently illustrate how the resource is built, c d deferring to Section 3 the description of the con- lect the terms s t and store them in the set T . c → trol strategy designed to use it in the computation Each s can be considered as a different lexical- of conceptual similarity. ization for the same concept c, so that all t can be grouped in T , that finally contains all terms asso- The TTCSE takes in input a concept c referred through a BabelNet synset ID, and produces as ciated in any way to c. output a vector representation ~c where the in- Since ConceptNet does not provide any direct put concept is described along some semantic di- anchoring mechanism to associate its terms to mensions. In turn, filling each such dimension meaning identifiers, it is necessary to determine which of the terms t T are relevant for the amounts to finding a set of appropriate concepts: ∈ features act like vector space dimensions, and they concept c. In other words, when we access the are based on ConceptNet relationships.2 The di- ConceptNet page for a certain term, we find not mensions are filled with BabelNet synset IDs, so only the association regarding that term with the that finally each concept c residing in the linguis- sense conveyed by c, but also all the associations tic resource can be defined as regarding it in any other meaning. To select only (and possibly all) the associations that concern the ~c = ID , c , , c (1) sense individuated through the previous phase, we {h d { 1 ··· k}i} d introduce the notion of relevance. To give an in- [∈D tuition of this process, the terms found in Con- where IDd is the identifier of the d-th dimension, ceptNet are considered as relevant (and thus re- and c , , c is the set of values chosen for d. { 1 ··· k} tained) either if they exhibit a heavy weight in the The control strategy implemented by the TTCSE NASARI vector corresponding to the considered includes two main steps, semantic extraction concept, or if they share at least some terms with (composed by the extraction and concept identi- the NASARI vector (further details on a similar fication phases) and the vector injection. approach can be found in (Lieto et al., 2016a)). Net is more suited to structurally represent common-sense information related to typicality. However, in ConceptNet Concept identification Once the set of relevant the coverage of this type of knowledge component is some- terms has been extracted, we need to lift them to times not satisfactory. For similar remarks on such resources, the corresponding concept(s), which will be used claiming for the need of new resources more suited to repre- sent common-sense information, please also refer to (Basile as value for the features. We stress, in fact, that et al., 2016). dimension fillers are concepts rather than terms 2 The full list of the employed properties, which (please refer to Eq. 1). In the concept identifica- were selected from the most salient properties in Con- ceptNet, includes: INSTANCEOF,RELATEDTO,ISA, tion step, we exploit NASARI in order to provide ATLOCATION, DBPEDIA/GENRE,SYNONYM,DERIVED- each term t T with a BabelNet synset ID, thus ∈ FROM,CAUSES,USEDFOR,MOTIVATEDBYGOAL,HAS- finally converting it into the bag-of-concepts C. SUBEVENT,ANTONYM,CAPABLEOF,DESIRES,CAUS- ESDESIRE,PARTOF,HASPROPERTY,HASPREREQUI- Given a ti T , we distinguish two main cases. ∈ SITE,MADEOF,COMPOUNDDERIVEDFROM,HASFIRST- If ti is contained in one or more synsets inside the SUBEVENT, DBPEDIA/FIELD, DBPEDIA/KNOWNFOR, DB- c c PEDIA/INFLUENCEDBY,DEFINEDAS,HASA,MEMBEROF, NASARI vector of , we obtain i (the concept un- RECEIVESACTION,SIMILARTO, DBPEDIA/INFLUENCED, derlying ti) by directly assigning to ti the identi- SYMBOLOF,HASCONTEXT,NOTDESIRES,OBSTRUCT- fier of the heaviest weighted synset that contains EDBY,HASLASTSUBEVENT,NOTUSEDFOR,NOTCA- 3 PABLEOF,DESIREOF,NOTHASPROPERTY,CREATEDBY, it. Otherwise, if ti is not included in any of the ATTRIBUTE,ENTAILS,LOCATIONOFACTION,LOCATED- NEAR. 3NASARI unified vectors are composed by a head con-

97 synsets in the NASARI vector associated to c, we erage, 14.90 concepts are used to fill each vector.5 need to choose a vector among all possible ones: we first select a list of candidate vectors (that is, 3 Computing Conceptual Similariy those containing ti in their vector head), and then One main assumption underlying our approach is choose the best one by retaining the vector where that two concepts are similar insofar as they share c’s ID has highest weight. some values on the same dimension, such as when For example, given in input the concept bank they are both used for the same ends, they share intended as a financial institution, we inspect the components, etc.. Consequently, our metrics does edges of the ConceptNet node ‘bank’ and its syn- not employ WordNet taxonomy and distances be- onyms. Then, thanks to the relevance notion we tween pairs of nodes, such as in (Wu and Palmer, get rid of associations such as ‘bank ISA flight 1994; Leacock et al., 1998; Schwartz and Gomez, maneuver’ since the term ‘flight maneuver’ is not 2008), nor it depends on information content ac- present in the vector associated to the concept counts either, such as in (Resnik, 1998a; Jiang and bank. Conversely, we accept sentences such as Conrath, 1997). AS ‘bank H A branch’ (i.e., ‘branch’ is added to T ). The representation available to the TTCSE is en- Finally, ‘branch’ goes through the concept identi- tirely filled with conceptual identifiers, so to assess fication phase, resulting in a concept ci and then it the similarity between two such values we check is added to C. whether both the concept vector ~ci and the vec- tor ~c share the same (concept) value for the same The bag-of-concepts C is then j Vector injection dimension d D, and our similarity along each scanned, and each value is injected in the template ∈ dimension basically depends on this simple intu- for ~c. Each value c , . . . , c C is still pro- { 1 n ∈ } ition: vided with the relationship that linked it to c in ConceptNet, so this value is employed to fill the 1 sim(~ci,~cj) = di dj . corresponding feature in ~c. For example, if c is D · | ∩ | k | | d D extracted from the ConceptNet relation USEDFOR X∈ USEDFOR (i.e., c c ), the value c will be added to The score computed by the TTCSE system can → k k the set of entities that are used for c. be justified based on the dimensions actually filled: this explanation can be built automatically, 2.1 Building the TTCSE resource since the similarity between ~ci and ~cj is a func- tion of the sum of the number of shared elements In order to build the set of vectors in the TTCS E in each dimension, so that one can argue that resource, the system took in input 16, 782 con- a given score x is due to the fact that along a cepts. Such concepts have been preliminarily given dimension d both concepts share some val- computed (Lieto et al., 2016b) by starting from the ues (e.g., sim(table, chair) = x because each one 10K most frequent nouns present in the Corpus of is a (ISA) ‘furniture’, both are USEDFOR ‘eating’, Contemporary American English (COCA).4 Then, ‘studying’ and ‘working’; both can be found AT- for each input concept the TTCSE scans some 3M LOCATION ‘home’, ‘office’; and each one HASA ConceptNet nodes to retrieve the terms that appear ‘leg’). into the WordNet synset of the input. This step Ultimately, the TTCS collects information allows to browse over 11M associations avail- E along the 44 dimensions listed in footnote 2, so able in ConceptNet, and to extract on average 155 that we are in principle able to assess in how far ConceptNet nodes for each input concept. Sub- similar they are along each and every dimension. sequently, the TTCS exploits the 2.8M NASARI E However, our approach is presently limited by vectors to decide whether each of the extracted the actual average filling factor, and by the noise nodes is relevant or not w.r.t. the input concept, that can be possibly collected by an automatic and then it tries to associate a NASARI vector to procedure built on top of the BabelNet knowledge each of them (concept identification step). On av- base. Since we need to deal with noisy and incom- cept (represented by its ID in the first position) and a body, plete information, some adjustments to the above that is a list of synsets related to the head concept. Each formula have been necessary in order to handle synset ID is followed by a number that grasps the strength of its correlation with the head concept. 5The final resource is available for download at the URL 4http://corpus.byu.edu/full-text/. http://ttcs.di.unito.it.

98 —intra dimension— the possibly unbalanced ρ r number of concepts that characterize the different RG 0.78 0.85 dimensions; and to prevent —inter dimensions— MC 0.77 0.80 the computation from being biased by more richly WS-Sim 0.64 0.54 defined concepts (i.e., those with more dimensions filled). The computation of the conceptual simi- Table 1: Spearman (ρ) and Pearson (r) correla- larity score is thus based on the following formula: tions obtained over the three datasets.

1 di dj sim(~ci,~cj) = | ∩ | D · β (αa + (1 α) b) + di dj ∗ d D and Spearmans ρ correlations, that are largely | | X∈ − | ∩ | adopted for the conceptual similarity task (for a where d d counts the number of concepts | i ∩ j| recent compendium of the approaches please refer that are used as fillers for the dimension d in to (Pilehvar and Navigli, 2015)). The former mea- the concept ~c and ~c , respectively; and a and b i j sure captures the linear correlation of two vari- are computed as a = min( d d , d d ), | i − j| | j − i| ables as their covariance divided by the product b = max( d d , d d ); and D counts the | i − j| | j − i| | ∗| of their standard deviations, thus basically allow- dimensions filled with at least one concept in both ing to grasp differences in their values, whilst the vectors. Spearman correlation is computed as the Pearson This formula is known as the Symmetrical Tver- correlation between the rank values of the consid- sky’s Ratio Model (Jimenez et al., 2013), which ered variables, so it is reputed to be best suited to is a symmetrical reformulation for the Tversky’s assess results in a similarity ranking setting where ratio model (Tversky, 1977). It enjoys the fol- relative scores are relevant (Schwartz and Gomez, lowing properties: i) it allows grasping the num- 2011; Pilehvar and Navigli, 2015). ber of common traits between the two vectors (at Table 1 shows the results obtained by the system the numerator); ii) it allows tuning the balance be- in a preliminary experimentation. tween cardinality differences (through the param- Provided that the present task of sense-level eter α), and between d d and d d , d d | i∩ j| | i− j| | j − i| similarity is slightly different from word-level (through the parameter β). Interestingly, by set- similarity (about this distinction, please refer ting α = .5 and β = 1 the formula equals the to (Pilehvar and Navigli, 2015)), and our results popular Dice’s coefficient. The parameters α and can be thus hardly compared to those in litera- β were set to .8 and .2 for the experimentation. ture, the reported figures are still far from the 4 Evaluation state of the art, where the Spearman correlation ρ reaches 0.92 for the RG dataset (Pilehvar and Nav- In the experimentation we addressed the concep- igli, 2015), 0.92 for the MC dataset (Agirre et al., tual similarity task at the sense level, that is the 2009), and 0.81 for the WS-Sim dataset (Halawi TTCSE system has been fed with sense pairs. We et al., 2012; Tau Yih and Qazvinian, 2012).7 6 considered three datasets, namely the RG, MC However, we remark that the TTCSE employs and WS-Sim datasets, first designed in (Ruben- vectors of a very limited size w.r.t. the standard stein and Goodenough, 1965; Miller and Charles, vector-based resources used in the current mod- 1991) and (Agirre et al., 2009), respectively. His- els of distributional semantics (as mentioned, each torically, while the first two (RG and MC) datasets vector is defined, on average, through 14.90 con- were originally conceived for similarity measures, cepts). Moreover, due to the explicit grounding the WS-Sim dataset was developed as a subset provided by connecting the NASARI feature val- of a former dataset built by (Finkelstein et al., ues to the corresponding properties in ConceptNet, 2001) created to test similarity by including pairs the TTCSE can be used to provide the scores re- of words related through specific relationships, turned as output with an explanation, based on the such as synonymy, hyponymy, and unrelated. All shared concepts along some given dimension. At senses were mapped onto WordNet 3.0. the best of our knowledge, this is a unique fea- The similarity scores computed by the TTCSE ture, that cannot be easily reached by methods system have been assessed through Pearsons r 7Rich references to state-of-the-art results and works ex- 6Publicly available at the URL http://www.seas. perimenting on the mentioned datasets can be found on the upenn.edu/˜hansens/conceptSim/. ACL Wiki, at the URL https://goo.gl/NQlb6g.

99 based on Latent Semantic Analysis (such as those A study on similarity and relatedness using distribu- pioneered by (Deerwester et al., 1990)) and can tional and wordnet-based approaches. In Procs of be only partly approached by techniques exploit- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ing taxonomic structures (Resnik, 1998b; Baner- Association for Computational Linguistics, pages jee and Pedersen, 2003). Conversely, few and rel- 19–27. ACL. evant traits are present in the final linguistic re- source, which is thus synthetic and more cogni- S. Banerjee and T. Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In tively plausible (Gardenfors,¨ 2014). Proceedings of the Eighteenth International Joint In some cases —27 concept pairs out of the Conference on Artificial Intelligence, pages 805– overall 190 pairs— the system was not able to re- 810. trieve an ID for one of the concepts in the pair: Valerio Basile, Soufian Jebbara, Elena Cabrio, and such pairs were excluded from the computation of Philipp Cimiano. 2016. Populating a knowledge the final accuracy. Missing concepts may be lack- base with object-location relations using distribu- ing in (at least one of) the resources upon which tional semantics. In Eva Blomqvist, Paolo Cian- carini, Francesco Poggi, and Fabio Vitali, editors, the TTCSE is built: including further resources EKAW, volume 10024 of Lecture Notes in Computer may thus be helpful to overcome this limitation. Science, pages 34–50. Also, difficulties stemmed from insufficient infor- mation for the concepts at stake: this phenomenon Jose´ Camacho-Collados, Mohammad Taher Pilehvar, was observed, e.g., when both concepts have been and Roberto Navigli. 2015. NASARI: a novel approach to a semantically-aware representation of found, but no common dimension has been filled. items. In Proceedings of NAACL, pages 567–577. This sort of difficulty shows that the coverage of the resource still needs to be enhanced by improv- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and ing the extraction phase, so to add further concepts R. Harshman. 1990. Indexing by latent semantic per dimension, and to fill more dimensions. analysis. Journal of the American Society for Infor- mation Science, 41:391–407.

5 Conclusions Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, In this paper we have introduced a novel resource, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- tan Ruppin. 2001. Placing search in context: The the TTCSE , which is compatible with the Con- concept revisited. In Proceedings of the 10th inter- ceptual Spaces framework and aims at putting to- national conference on World Wide Web, pages 406– gether encyclopedic and common-sense knowl- 414. ACM. edge. The resource has been employed to compute Peter Gardenfors.¨ 2014. The geometry of meaning: the conceptual similarity between concept pairs. Semantics based on conceptual spaces. MIT Press. Thanks to its representational features it allows implementing a simple though effective heuristics Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and to assess similarity: that is, concepts are similar Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Qiang Yang, Deepak insofar as they share some values along the same Agarwal, and Jian Pei, editors, KDD, pages 1406– dimension. However, further heuristics will be in- 1414. ACM. vestigated in the next future, as well. A preliminary experimentation has been run, Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical tax- employing three different datasets. Provided that onomy. In Proceedings of International Conference we consider the obtained results as encouraging, on Research in Computational Linguisics, Taiwan. the experimentation clearly points out that there is room for improvement along two main axes: di- Sergio Jimenez, Claudia Becerra, Alexander Gelbukh, mensions must be filled with further information, Av Juan Dios Batiz,´ and Av Mendizabal.´ 2013. Softcardinality-core: Improving text overlap with and also the quality of the extracted information distributional measures for semantic textual simi- should be improved. Both aspects will be the ob- larity. In Second Joint Conference on Lexical and ject of our future efforts. Computational Semantics, volume 1, pages 194– 201.

Claudia Leacock, George A Miller, and Martin References Chodorow. 1998. Using corpus statistics and word- Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana net relations for sense identification. Computational Kravalova, Marius Pas¸ca, and Aitor Soroa. 2009. Linguistics, 24(1):147–165.

100 Antonio Lieto, Enrico Mensa, and Daniele P. Radi- Robert Speer and Catherine Havasi. 2012. Represent- cioni. 2016a. A Resource-Driven Approach for An- ing General Relational Knowledge in ConceptNet 5. choring Linguistic Resources to Conceptual Spaces. In LREC, pages 3679–3686. In Procs of the XV International Conference of the Italian Association for Artificial Intelligence, vol- Wen Tau Yih and Vahed Qazvinian. 2012. Measuring ume 10037 of LNAI, pages 435–449. Springer. word relatedness using heterogeneous vector space models. In HLT-NAACL, pages 616–620. The Asso- Antonio Lieto, Enrico Mensa, and Daniele P. Radi- ciation for Computational Linguistics. cioni. 2016b. Taming sense sparsity: a common- sense approach. In Proceedings of Third Italian Amos Tversky. 1977. Features of similarity. Psycho- Conference on Computational Linguistics (CLiC-it logical review, 84(4):327. 2016) & Fifth Evaluation Campaign of Natural Lan- guage Processing and Speech Tools for Italian. Zhibiao Wu and Martha Palmer. 1994. Verbs seman- tics and lexical selection. In Proceedings of the 32nd Antonio Lieto, Daniele P. Radicioni, and Valentina annual meeting on Association for Computational Rho. 2017. Dual PECCS: A Cognitive System Linguistics, pages 133–138. ACL. for Conceptual Representation and Categorization. Journal of Experimental & Theoretical Artificial In- telligence, 29(2):433–452. George A. Miller and Walter G. Charles. 1991. Con- textual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28. George A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual se- mantic network. Artificial Intelligence, 193:217– 250. Mohammad Taher Pilehvar and Roberto Navigli. 2015. From senses to texts: An all-in-one graph-based ap- proach for measuring semantic similarity. Artif. In- tell., 228:95–128. Philip Resnik. 1998a. Semantic similarity in a taxon- omy: An information-based measure and its appli- cation to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11(1). Philip Resnik. 1998b. Semantic similarity in a taxon- omy: An information-based measure and its appli- cation to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11(1). Eleanor Rosch. 1975. Cognitive Representations of Semantic Categories. Journal of experimental psy- chology: General, 104(3):192–233. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communica- tions of the ACM, 8(10):627–633. Hansen A. Schwartz and Fernando Gomez. 2008. Ac- quiring knowledge from the web to be used as selec- tors for noun sense disambiguation. In Procs of the Twelfth Conference on Computational Natural Lan- guage Learning, pages 105–112. ACL. Hansen A Schwartz and Fernando Gomez. 2011. Eval- uating semantic metrics on tasks of concept simi- larity. In Proc. Int. Florida Artif. Intell. Res. Soc. Conf.(FLAIRS), page 324.

101 Measuring the Italian-English lexical gap for action verbs and its impact on translation

Lorenzo Gregori Alessandro Panunzi University of Florence University of Florence [email protected] [email protected]

Abstract languages, as in translation tasks (Ivir, 1977). In this latter case, a lexical gap can be defined as the This paper describes a method to measure absence of direct lexeme in one language while the lexical gap of action verbs in Italian comparing two languages during translation (Cvi- and English by using the IMAGACT on- likaite,˙ 2006). The presence of lexical gaps be- tology of action. The fine-grained cate- tween two languages is more than a theoretical gorization of action concepts of the data problem, having a strong impact in several related source allowed to have wide overview of fields: lexicographers need to deal with lexical the relation between concepts in the two gaps in the compilation of bilingual dictionaries languages. The calculated lexical gap for (Gouws, 2002); in knowledge representation the both English and Italian is about 30% of creation of multilanguage linguistic resources re- the action concepts, much higher than pre- quire a strategy to cope with the lack of concepts vious results. Beyond this general num- (Jansseen, 2004); the lexical transfer process is af- bers a deeper analysis has been performed fected by the presence of lexical gaps in automatic in order to evaluate the impact that lexi- translation system, reducing their accuracy (San- cal gaps can have on translation. In partic- tos, 1990). ular a distinction has been made between Even if in literature it’s possible to find many the cases in which the presence of a lexi- examples of gaps, it’s hard to estimate them. This cal gap affects translation correctness and is due to the fact that most of the gaps are re- completeness at a semantic level. The re- lated to small semantic differences that are hard to sults highlight a high percentage of con- identify: available linguistic resources usually rep- cepts that can be considered hard to trans- resent a coarse-grained semantics, so while they late (about 18% from English to Italian are useful to discriminate the prominent senses of and 20% from Italian to English) and con- words, they can’t capture small semantic shifts. In firms that action verbs are a critical lexical addition to it, a multilanguage resource is required class for translation tasks. for this purpose, but these resources are normally built up through a mapping between two or more 1 Introduction monolingual resources and this cause an approx- Lexical gap is a well known phenomenon in lin- imation in concept definitions: similar concepts guistics and its identification allows to discover tend to be grouped together in a unitary concept some relevant features related to the semantic that represent the core-meaning and lose their se- categorization operated by languages. A lexical mantic specificities. gap corresponds to a lack of lexicalization of a certain concept in a given language. This phe- 2 IMAGACT nomenon traditionally emerged from the analysis of a single language by means of the detection of Verbs are a critical lexical class for disambiguation empty spaces in a lexical matrix (see the semi- and translation tasks: they are much more polyse- nal works by Leher (1974) and Lyons (1977); see mous than nouns and, moreover, their ambiguity also Kjellmer (2003)). Anyway, lexical gap be- is hard to resolve (Fellbaum et al., 2001). In par- comes a major issue when comparing two or more ticular the representation of word senses as sepa-

102 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 102–109, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics rate entities is tricky, since their boundaries are of- Type discrimination agreement: a Cohen k of 0,82 ten vague causing the senses to be under-specified for 2 expert annotators and a Fleiss k of 0.73 for 4 and overlapping. From this point of view the sub- non-expert ones3. class of general verbs represent a crucial point, be- For these features IMAGACT ontology is a re- cause these verbs are characterized by both high liable data source to measure the lexical gap be- frequency in the use of language and high ambi- tween Italian and English: in fact verb Types are guity. defined independently, but linked together through IMAGACT1 is a visual ontology of action that the scenes. The comparison of Types in different provides a video-based translation and disam- language through their action prototypes allows to biguation framework for general verbs. The re- identify the action concepts that are shared be- source is built on an ontology containing a fine- tween the two languages and the ones that don’t grained categorization of action concepts, each match with any concept in the other language; in represented by one or more video prototypes as this case we have a lexical gap. recorded scenes and 3D animations. IMAGACT currently contains 1,010 scenes 3 Type relations which encompass the action concepts most com- In this frame we can perform a set-based compar- monly referred to in everyday language usage. ison, considering a Type as just a set of scenes. Data are derived from the manual annotation of A Type is a lexicalized concept, so a partition of verb occurrences in spontaneous spoken corpora the meaning, but semantic features are not repre- (Moneglia et al., 2012); the dataset has been com- sented in the ontology and, in fact, they are un- piled by selecting action verbs with the highest known: data are derived from the ability of com- frequency in the corpora and comprises 522 Ital- petent speaker in performing a categorization of ian and 554 English lemmas. Although the set similar items with respect to a lemma, without any of retrieved actions is necessarily incomplete, this attempt to formalize semes. So if we look at the methodology ensures to have a significant picture database we can say that Types are merely sets of of the main action types performed in everyday scenes. life2. Comparing a Type (T ) of a verb in source lan- The links between verbs and video scenes are 1 guage (V ) with a Type (T ) of a verb in target lan- based on the co-referentiality of different verbs 1 2 guage (V ) we can have 5 possible configurations: with respect to the action expressed by a scene 2 (i.e. different verbs can describe the same ac- 1. T T : two Types are equivalent if they 1 ≡ 2 tion, visualized in the scene). The visual represen- contain the same set of scenes; tations convey the action information in a cross- linguistic environment and IMAGACT may thus 2. T T = : two Types are disjoint if they 1 ∩ 2 ∅ be exploited to discover how the actions are lexi- don’t share any scene; calized in different languages. 3. T T : T is a subset of T if any scene of In addition to it IMAGACT contains a semantic 1 ⊂ 2 1 2 classification of each lemma, that is divided into T1 is also a scene of T2 and the 2 Types are Types: each verb Type identifies an action con- not equivalent; cept and contains one ore more scenes, that work 4. T T : T is a superset of T if any scene as prototypes of that concept. Type classification 1 ⊃ 2 1 2 T T is manually performed in Italian and English in of 2 is also a scene of 1 and the 2 Types are parallel, through a corpus-based annotation pro- not equivalent; cedure by native language annotators (Moneglia 5. T1 T2 = T1 * T2 T1 + T2: two Types ∩ 6 ∅∧ ∧ et al., 2012); this allowed to have a discrimina- are partially overlapping if they share some tion of verb Types based only on the annotator scenes and each Type have some scenes not competence, without any attempt to fit the data belonging to the other one. into predefined semantic models. Validation re- sults (Gagliardi, 2014) highlight a good rate of 3Inter-annotator agreement (ITA) measured on WordNet by expert annotators on comparable verb sets (score for fine- 1http://www.imagact.it grained sense inventories): ITA = 72% on SemEval-2007 2see Moneglia et al. (2012) and Moneglia (2014b) for dataset (Pradhan et al., 2007); ITA = 71% on SensEval-2 details about corpora annotation numbers and methodology. dataset (Palmer et al., 2007).

103 Figure 1: Two Equivalent Types belonging to the Figure 2: Two Hierarchically related Types be- Italian verb toccare and to the English verb touch. longing to the Italian verb accoltellare and to the English verb stab. It’s important to discuss these cases separately, because each one of them highlights a different encodes only a sub-part of the concept represented semantic relation between verbs and has different by T1. implications for translation. For example Type 1 of the English verb to stab When two Types are equivalents (case 1) the 2 and Type 1 of the Italian verb accoltellare cate- languages share the action concept the Types rep- gorize action where a sharp object pierces a body, resent: we could say that there is an interlinguistic but while stab can be applied to describe actions concept. This case is not problematic for transla- independently on their aim and the tool used, ac- tion: each occurrence of the verb V1 that belongs coltellare is applicable only when the agent vol- to Type T1 can be translated with V2; moreover untarily injures someone and the action is accom- we can apply V1 to translate any occurrence of V2 plished with a knife. In this case the Italian Type belonging to T2. is more specific than the English one, so transla- For example one Type of the English verb to tion is safe from Italian to English (stab can be touch and one Type of the Italian verb toccare used to translate any occurrence of accoltellare- are equivalent. They share 3 video scenes: Mary Type 1), but not vice versa: stab-Type 1 can not be touches the doll, Mary touches the table and John always translated with accoltellare, because a part touches Mary (see Fig. 1). Each scene is con- of its variation is covered by other Italian verbs nected to a different set of verbs (i.e. to brush, to like trafiggere, penetrare or attraversare. graze, to caress), representing a specific semantic Finally a partial overlap between Types (case concept, but they are kept together by a more gen- 5) doesn’t allow to induce any semantic relation eral concept both in Italian and in English. So in between Types: in these cases we have differ- any of these actions the verb to touch can be safely ent concepts that can refer the same action. Nor- translated in Italian with toccare and vice versa. mally these happen when the action is interpreted If two Types are disjoint (case 2) the Types re- from two different points of view and categorized fer to unrelated semantic concepts and we can as- within unrelated lexical concepts. In this case sume that translation between an occurrence of V1 we have a translation relation between V1 and belonging to T2 can not be translated with V2. V2 without having any semantic relation between In cases 3 and 4 the Types are hierarchically re- their Types. lated and we can assume the existence of a seman- For example the Italian verb abbassare, that is tic relation that links a general Type with a spe- frequently translated with lower in English, can cific one. Although we can not induce the type also be translated with position when applied to of this relation that could be hyponym, entailment, some (but not every) actions belonging to Type 1, troponym and so on. In this configuration we can categorizing actions involving the body; moreover see that translation is safe from specific to general, we have the same translation relation from En- but not vice versa: in case 3 any occurrence of V1 glish to Italian where sometimes (but not always) belonging to T1 can be translated with V2, while position-Type 2 can be translated with abbassare. in case 4 V2 can not be safely applied, because it Here there are two Types that represent semanti-

104 several steps: firstly verbs were annotated through a corpus-based procedure and Types were created and validated by mother tongue speakers on the basis of their linguistic competence; then for each concept a scene was produced to provide a proto- typical representation of it; after that a mapping between Italian and English was performed by linking the scenes to the Types of each language; finally annotators were requested to recheck each scene and add the missing verbs that are applica- ble to it. This last revision enriched the scene with more verbs that don’t belong to any Type. Figure 3: Two partially overlapping Types belong- We decided to exclude from the dataset all the ing to the Italian verb abbassare and to the English scenes (and the related Types) that contain untyped verb position. verbs, considering that a partial typing does not ensure the coherence of verb Type discrimination: cally independent concepts, but that can both be in fact it’s not possible to be sure that the creation applied to describe some actions, like Mary posi- of Types for these new instances would preserve tions herself lower and other similar ones. the original Type distinction. This happens rarely in Italian - English (14 After this pruning we obtained a set of 1,000 Types on our dataset) and in any of these cases Italian Types and 1,027 English Types, that refer there are other translation verbs as possible alter- to 501 and 535 verbs respectively (see Table 1). natives. Despite this, Type overlaps identification is very relevant, because it allows to discover un- IT EN expected translation candidates (i.e. target verbs Types 1,000 1,027 that have a translation relation but not a seman- Verbs 501 535 tic relation with the source verb) that can not be Scenes 980 917 extracted from a lexico-semantic resource. In ad- dition to it Type overlaps identification is crucial Table 1: Number of Types, verbs and scenes be- if the target verb is the only one translation pos- longing to the dataset. sibility and this can happen, especially between two languages that are very far: some evidences 4.2 Methodology for example have been discovered in Italian and Chinese (Pan, 2016) through a deep comparison According to our dataset, we can easily estimate of Italian Types with Chinese verbs that refer to the lexical gap by measuring the number of Types the same scenes. This work allowed to identify in source language that don’t have an equivalent some positive occurrences of this interesting phe- Type in target language. Namely for each concept nomenon, but can not be exploited for its numeric in source language we are going to verify if there is quantification: indeed an exhaustive analysis that a concept in target language that refer to the same involves the relation between action concepts can set of actions (represented by video prototypes); be made only between Italian and English, since if the match is not found we have a lexical gap in IMAGACT contains the verb Type discrimination target language. in these two languages only. As we can see in table 2, the action concepts that are lexicalized in Italian and without a correspond- 4 Lexical gap identification ing match in English are 33,6% (English gap); on the contrary the Italian gap for English concepts is 4.1 Dataset building 29,02%. In order to measure the lexical gaps in Italian and Before going ahead we need to do some con- English we created a working dataset by select- siderations about these numbers. First of all we ing the set of Types that have a full mapping in can see that these percentages are much higher the two languages. We need to consider that IMA- than the ones calculated by Bentivogli and Pi- GACT annotation process has been carried out in anta (2000), that found 7,4% of gaps for verbs in

105 IT EN EN IT root Types: these Types in source language → → • Total Types 1,000 1,027 represent concepts that are more general than Equiv. Types 664 (66,4%) 729 (70,98%) other ones in target language: the only Type Lexical gaps 336 (33,6%) 298 (29,02%) in target language that have a partial match with the Type in source language is a subset Table 2: Types in source language that have and (case 4); have not an equivalent Type in target language. middle Types: these Types have a partial • match in target language both with a more English-to-Italian comparison. This is a big shift, general Type and with a more specific one but it’s not surprising if we consider the differ- (both cases 3 and 4). ences of the two experiments in terms of method- ology and dataset: As we mentioned before we did not find any case in which a partial overlapping Type (case 5) is IMAGACT Type distinction is more fine- • the only one possible match in Italian and English grained in respect to WordNet synsets (Bar- comparison; so these cases are counted within the tolini et al., 2014); three categories above. the experiment by Bentivogli and Pianta was • 5.1 Root Types and uncertain translations led on MultiWordNet, in which multilan- guage Wordnets are created on the basis of Starting from this classification we can see that the Princeton Wordnet sense distinction (Pi- root Types are the critical ones in terms of trans- anta et al., 2002); this methodology introduce lation: in fact we don’t have a unique lexicalized an approximation in the concepts definition; concept in target language that is able to repre- sent the concept in source language; instead we the 7,4% of Bentivogli and Pianta is a gen- have more than one Type (and multiple verbs) that • eral value on verbs, while our experiment is cover different subparts of the whole general con- focused on action verbs, which are a strongly cept variation. In these cases we need to have extra ambiguous lexical class (Moneglia, 2014a); information about the action in order to translate it properly. From a computational point of view we the dictionary-based methodology proposed • can say that a word sense disambiguation of the by Bentivogli and Pianta is nearly opposite to source verb is not enough to reach a correct trans- IMAGACT reference-based approach. lation verb. Beyond these general considerations a lemma- The two sentences The cell phone lands on the by-lemma comparison with the experiment of carpet and The pole vaulter lands on the mat, for Bentivogli and Pianta (whose dataset is currently example, belong to the same action concept ac- 4 not available) would better explain this numeric cording to the semantics of the verb to land . In difference. Italian there is not a unique Type that collects these two actions: it’s possible to use atterrare for the 5 Lexical gaps and translation problems athlete, but it is not allowed for the phone, for which we need to make a semantic shift and use Besides a general measure of the gaps for action the verb cadere (that is more similar to fall down). concept it’s important to go a step beyond to ver- Again cadere is not appropriate for the athlete, be- ify in which cases the presence of a lexical gap cause it implies that the athlete stumbles and falls. impacts the translation quality. In order to do this, So this action concept that is lexicalized in En- we divided the Types without an equivalent in tar- glish with to land does not have a unique trans- get language in three categories: lation verb in Italian, and extra informations are leaf Types: these Types in source language required to translate it properly (if the theme is an • represent concepts that are more specific than human being or an object, in this specific case). other ones in target language; in this case the Table 3 show the number of leaf, root and mid- only Type in target language that have a par- dle Types in Italian and English; we can see that tial match with the Type in source language 4unlike The butterfly lands on the flower or The airplane is a superset (case 3); lands that belong to different concepts of to land.

106 IT EN EN IT a negative effect on translation. Type 1 of the En- → → Lexical gaps 336 298 glish verb to throw and Type 1 of the Italian verb Leaf Types 217 (64.58%) 200 (67.58%) lanciare categorize a wide set of actions in which Root Types 47 (13.99%) 43 (14.43%) an object is thrown by a person independently on Middle Types 72 (21.43%) 55 (18.46%) the presence of a destination or on the action aim (John throws the bowling bowl, John throws the Table 3: Number of Leaf, Root and Middle Types rock in the field, John throws the paper in the box in Italian and English (percentages on the lexical etc.). However these two Types are not equivalent, gaps). because the Italian one comprise also actions per- formed in a limited space with a highly controlled movement, like Marco lancia una monetina, that root Types represent the 14% of the lexical gaps in require another verb in English like to toss (Marco both the languages, corresponding to 4-5% of the tosses a coin). In this case the small gap between total Types. the Italian concept and English one does not affect the translation: in fact we can say that lanciare can 5.2 General Types and lossy translations be used to translate properly any action belonging Root Types are the most critical case for a transla- to throw - Type 1. tion task, because they affect the correctness; be- Given this consideration a measure of the se- sides there are also other kinds of lexical gaps that mantic distance with the translation verb is use- impact on translation. In particular is useful to ful to evaluate the loss: this can be easily done estimate how semantically far is the best transla- from IMAGACT dataset by calculating the ratio tion candidate in the cases in which we can apply between the cardinality (i.e. the number of scenes) a more general Type to translate the concept in the of the source Type, T1, and the one of the nearest source language. In fact in both leaf and middle target Type, T2 (the Type with the minimum cardi- Types we have a Type in target language that is nality among the Types in target language that are more general to the source one, so it is safely ap- supersets of the source Type). This ratio estimates plicable to any occurrence belonging to the source the overlapping between the Types: Type. This is not free from problems, because in T overlap = | 1| translation we use a more general verb, so we miss T2 some semes that are encoded in the source verb. Data are represented| | in Figure 4, reporting the In fact in this case we still have a translation prob- number of Types (Italian and English) for each lem, which is not in finding a possible target verb, overlap values, where this values are divided in 10 but in adding more information in other lexical el- ranges. ement of the sentence to fill the lack of semantic We considered semantically distant those Types information. In this case the gap does not affect with overlap < 0.4 (sharing less than 2 scenes the correctness of the translation, but its complete- over 5). These high distance Types (see Table 4) ness. are 150 for Italian (51.9% of leaf + middle Types For example the English verb to plonk does not and 15% of the total Types) and 145 for English have a correspondence in Italian. In particular a (56.86% of leaf + middle Types and 14.12% of the sentence like John plonks the books on the table total Types). belongs to a Type of plonk that is a leaf Type (so Basically we see that not only root Types, but there is a possible translation verb in Italian), but also a relevant part of leaf and middle Types (more for which the nearest Italian Type is much wider, than 50% both in Italian and English) represent a belonging to the very general verb mettere. In this critical point for translation. case it’s possible to translate in Italian with John IT EN EN IT mette i libri sul tavolo, but losing all the informa- → → tion regarding the way the books are placed on the Leaf+Mid T. 289 255 table (mettere is more similar with to put); an ad- Low dist. T. 139 (48.1%) 110 (43.14%) dition of other lexical elements to the sentence is High dist. T. 150 (51.9%) 145 (56.86%) required to fill this gap in Italian. Conversely we can say that a small distance be- Table 4: Distance from the nearest general Type in tween the source and the target Type does not have target language.

107 6 Conclusions In this paper a methodology for measuring the lex- ical gap of action verbs is described and applied to Italian and English, by exploiting IMAGACT on- tology. We measured 33.6% of English gap and 29.02% of Italian gap. Then this result have been investigated, in order to discover when and why a lexical gap can affect a translation task. The re- sults show that 19.7% of Italian Types and 18.3% of English ones represent action concept that are Figure 4: Number of Italian and English Types for critical from a translation perspective: these con- each overlap range. cepts are lexicalized by 27.15% of the Italian verbs and 28.79% of the English verbs that we consid- ered in our analysis. In addition to it the distinction Within this numbers, that are quite homoge- between concepts that can not be correctly trans- neous between the two languages, we can see that lated with a single lemma (root Types) and con- in the overlap range from 0 to 0.2 there are much cepts that can be translated with a sensible seman- more Italian Types than English ones (19% of leaf tic loss (high distance Types) is a relevant informa- + middle Types against 9%); conversely English tion that can lead to a different translation strategy. Types are more distributed in the range from 0.3 Finally we feel important to note that behind to 0.4 (Figure 4). This means that in this area of these numeric values there are lists of verbs and extreme distance between the source and the tar- concepts and this information could be integrated get concept, we have an higher semantic loss in in Machine Translation and Computer Assisted the translation from Italian to English. Translation Systems to improve their accuracy. Finally we can have have an overall value of translation critical Types, by summing up the ones Acknowledgments belonging to high distance Types class and the root Types. The verbs these Types belong to are the This research has been supported by the MOD- verbs for which the selection of a good translation ELACT Project, funded by the Futuro in Ricerca candidate is problematic. Results are reported in 2012 programme (Project Code RBFR12C608); http://modelact.lablita.it. Tables 5 and 6 and confirm that lexical gaps in ac- tion verbs have a strong impact on translation. References IT EN Total Types 1,000 1,027 Roberto Bartolini, Valeria Quochi, Irene De Felice, Irene Russo, and Monica Monachini. 2014. From Root Types 47 (4.7%) 43 (4.2%) synsets to videos: Enriching italwordnet multi- High dist. T. 150 (15.0%) 145 (14.12%) modally. In Nicoletta Calzolari (Conference Chair), Critical Types 197 (19.7%) 188 (18.3%) Khalid Choukri, Thierry Declerck, Hrafn Lofts- son, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Table 5: Number of translation critical Types. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Re- sources Association (ELRA). IT EN Total Verbs 501 535 Luisa Bentivogli and Emanuele Pianta. 2000. Looking Verbs w/ r.T. 39 (7.78%) 38 (7.1%) for lexical gaps. In Proceedings of the ninth EU- Verbs w/ h.d.T. 109 (21.76%) 125 (23.36%) RALEX International Congress, Stuttgart, Germany. Critical Verbs 136 (27.15%) 154 (28.79%) Jurgita Cvilikaite.˙ 2006. Lexical gaps: resolution by functionally complete units of translation. Darbai ir Table 6: Number of verbs with root Types and high dienos, 2006, nr. 45, p. 127-142. distance Types. Christiane Fellbaum, Martha Palmer, Hoa Trang Dang, Lauren Delfs, and Susanne Wolf. 2001. Manual and

108 automatic semantic annotation with wordnet. Word- Emanuele Pianta, Luisa Bentivogli, and Christian Gi- Net and Other Lexical Resources, pages 3–10. rardi. 2002. MultiWordNet: developing an aligned multilingual database. In Proceedings of the First Gloria Gagliardi. 2014. Validazione dellontologia del- International Conference on Global WordNet, pages lazione IMAGACT per lo studio e la diagnosi del 21–25. Mild Cognitive Impairment. Ph.D. thesis, Univer- sity of Florence. Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. Semeval-2007 task 17: Rufus Hjalmar Gouws. 2002. Equivalent relations, English lexical sample, srl and all words. In Pro- context and cotext in bilingual dictionaries. Hermes, ceedings of the 4th International Workshop on Se- 28(1):195–209. mantic Evaluations, pages 87–92. Association for Computational Linguistics. Vladimir Ivir. 1977. Lexical gaps: A contrastive view. Studia Romanica et Anglica Zagrabiensia, Diana Santos. 1990. Lexical gaps and idioms in (43):167–176. machine translation. In Proceedings of the 13th Conference on Computational Linguistics - Volume M. Jansseen. 2004. Multilingual lexical databases, 2, COLING ’90, pages 330–335, Stroudsburg, PA, lexical gaps, and simullda. International Journal of USA. Association for Computational Linguistics. Lexicography, 17(2):137–154.

Gran Kjellmer. 2003. Lexical gaps. In Extending the scope of corpus-based research, pages 149–158. Brill.

A. Lehrer. 1974. Semantic Fields and Lexical Struc- ture. North-Holland linguistic series, 11. North- Holland.

John Lyons. 1977. Semantics, volume i. Cambridge UP, Cambridge.

Massimo Moneglia, Francesca Frontini, Gloria Gagliardi, Irene Russo, Alessandro Panunzi, and Monica Monachini. 2012. Imagact: deriving an action ontology from spoken corpora. Proceed- ings of the Eighth Joint ACL-ISO Workshop on Interoperable Semantic Annotation (isa-8), pages 42–47.

Massimo Moneglia. 2014a. Natural Language On- tology of Action: A Gap with Huge Consequences for Natural Language Understanding and Machine Translation. In Zygmunt Vetulani and Joseph Mar- iani, editors, Human Language Technology Chal- lenges for Computer Science and Linguistics, vol- ume 8387 of Lecture Notes in Computer Science, pages 379–395. Springer International Publishing.

Massimo Moneglia. 2014b. The semantic variation of action verbs in multilingual spontaneous speech cor- pora. In T. Raso and H. Mello, editors, Spoken Cor- pora and Linguistics Studies, pages 152–190. John Benjamins Publishing Company.

Martha Palmer, Hoa Trang Dang, and Christiane Fellbaum. 2007. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(02):137–163.

Yi Pan. 2016. Verbi dazione in Italiano e in Cinese Mandarino. Implementazione e validazione del cinese nellontologia interlinguistica dell’azione IMAGACT. Ph.D. thesis, Universita` degli Studi di Firenze.

109 Word Sense Filtering Improves Embedding-Based Lexical Substitution

† Anne Cocos∗, Marianna Apidianaki∗ and Chris Callison-Burch∗ ∗ Computer and Information Science Department, University of Pennsylvania † LIMSI, CNRS, Universite´ Paris-Saclay, 91403 Orsay acocos,marapi,ccb @seas.upenn.edu { }

Abstract dings (Reisinger and Mooney, 2010; Huang et al., 2012). Iacobacci et al. (2015) disambiguate the The role of word sense disambiguation in words in a corpus using a state-of-the-art WSD lexical substitution has been questioned system and then produce continuous representa- due to the high performance of vector tions of word senses based on distributional infor- space models which propose good sub- mation obtained from the annotated corpus. Mov- stitutes without explicitly accounting for ing from word to sense embeddings generally im- sense. We show that a filtering mecha- proves their performance in word and relational nism based on a sense inventory optimized similarity tasks but is not beneficial in all settings. for substitutability can improve the results Li and Jurafsky (2015) show that although multi- of these models. Our sense inventory sense embeddings give improved performance in is constructed using a clustering method tasks such as semantic similarity, semantic rela- which generates paraphrase clusters that tion identification and part-of-speech tagging, they are congruent with lexical substitution an- fail to help in others, like sentiment analysis and notations in a development set. The re- named entity extraction (Li and Jurafsky, 2015). sults show that lexical substitution can still benefit from senses which can improve the output of vector space paraphrase ranking We show how a sense inventory optimized for models. substitutability can improve the rankings provided by two sense-agnostic, vector-based lexical sub- 1 Introduction stitution models. Lexical substitution requires Word sense has always been difficult to de- systems to predict substitutes for target word in- fine and pin down (Kilgarriff, 1997; Erk et al., stances that preserve their meaning in context 2013). Recent successes of embedding-based, (McCarthy and Navigli, 2007). We consider a sense-agnostic models in various semantic tasks sense inventory with high substitutability to be one cast further doubt on the usefulness of word sense. which groups synonyms or paraphrases that are Why bother to identify senses if even humans can- mutually-interchangeable in the same contexts. In not agree upon their nature and number, and if contrast, sense inventories with low substitutabil- simple word-embedding models yield good results ity might group words linked by different types of without using any explicit sense representation? relations. We carry out experiments with a syntac- Word-based models are successful in various tic vector-space model (Thater et al., 2011; Apid- semantic tasks even though they conflate multiple ianaki, 2016) and a word-embedding model for word meanings into a single representation. Based lexical substitution (Melamud et al., 2015). In- on the hypothesis that capturing polysemy could stead of using the senses to refine the vector rep- further improve their performance, several works resentations as in (Faruqui et al., 2015), we use have focused on creating sense-specific word em- them to improve the lexical substitution rankings beddings. A common approach is to cluster the proposed by the models as a post-processing step. contexts in which the words appear in a corpus Our results show that senses can improve the per- to induce senses, and relabel each word token formance of vector-space models in lexical substi- with the clustered sense before learning embed- tution tasks.

110 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 110–119, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics

2 A sense inventory for substitution Sentence Annotated Substitutes (Count) In this world, one’s vow (1), utterance (1), tongue 2.1 Paraphrase substitutability word is a promise. (1), speech (1) Silverplate: code word phrase (3), term (2), ver- The candidate substitutes used by our rank- for the historic mission biage(1), utterance (1), signal ing models come from the Paraphrase Database that would end World (1), name (1), dictate (1), des- War II. ignation (1), decree (1) (PPDB) XXL package (Ganitkevitch et al., I think she only heard bit (3), verbiage (2), part 2013).1 Paraphrase relations in the PPDB are the last words of my (2), vocabulary (1), terminol- defined between words and phrases which might speech. ogy (1), syllable (1), phrasing (1), phrase (1), patter (1), ex- carry different senses. Cocos and Callison-Burch pression (1), babble (1), anec- (2016) used a spectral clustering algorithm to dote (1) cluster PPDB XXL into senses, but the clusters Table 1: Example annotated sentences for the tar- contain noisy paraphrases and paraphrases linked get word word.N from the CoInCo (Kremer et al., by different types of relations (e.g. hypernyms, 2014) lexical substitution dataset. Numbers after antonyms) which are not always substitutable. We each word indicate the number of annotators who use a slightly modified version of their method to made that suggestion. cluster paraphrases where both the number of clus- ters (senses) and their contents are optimized for substitutability. a set of items, it measures how much each clus- tering reduces uncertainty about the other (Vinh 2.2 A measure of substitutability et al., 2009) in terms of their mutual information We define a substitutability metric that quantifies I(U,V) and entropies H(U),H(V): the extent to which a sense inventory aligns with I(U,V) human-generated lexical substitution annotations. NMI(U,V) = We then cluster PPDB paraphrases using the sub- H(U)H(V) stitutability metric to optimize the sense clusters To calculate the NMI betweenp a sense inventory for substitutability. for target word t and its set of annotated substi- Given a sense inventory C, we can define the tutes, we first define the substitutes as a clustering, senses of a target word t as a set of sense clusters, B = b ,b ,...b , where b denotes the set of t { 1 2 n} i C(t) = c ,c ,...c , where each cluster contains suggested substitutes for each of n sentences. Ta- { 1 2 k} words corresponding to a single sense of t. Intu- ble 1, for example, gives the clustered substitutes itively, if a sense inventory corresponds with sub- for n = 3 sentences for target word t = word.N, stitutability, then each sense cluster ci should have where b = vow, utterance, tongue, speech . We 1 { } two qualities: first, words within ci should be in- then define the substitutability of the sense inven- terchangeable with t in the same set of contexts; tory, Ct , with respect to the annotated substitutes, 2 and second, ci should not be missing any words Bt , as NMI(Ct ,Bt ). Given many target words, we that are interchangeable in those same contexts. can further aggregate the substitutability of sense We therefore operationalize the definition of sub- inventory C over the set of targets T in B into a stitutability as follows. single substitutability score: We begin measuring substitutability with a lex- NMI(Ct ,Bt ) ical substitution dataset, consisting of sentences substitutabilityB(C) = ∑ t T T where content words have been manually anno- ∈ | | tated with substitutes (see example in Table 1). We 2.3 Optimizing for Substitutability then use normalized mutual information (NMI) (Strehl and Ghosh, 2002) to quantify the level Having defined a substitutability score, we now of agreement between the automatically generated automatically generate word sense clusters from sense clusters and human-suggested substitutes. the Paraphrase Database that maximize it. The NMI is an information theoretic measure of clus- idea is to use the substitutability score to choose ter quality. Given two clusterings U and V over the best number of senses for each target word which will be the number of output clusters (k) 1PPDB paraphrases come into packages of different sizes generated by our spectral clustering algorithm. (going from S to XXXL): small packages contain high- precision paraphrases while larger ones have high coverage. 2In calculating NMI, we ignore words that do not appear All are available from paraphrase.org in both Ct and Bt .

111 2.4 Spectral clustering method in the affinity matrix through a technique we call 2.4.1 Constructing the Affinity Matrix ‘masking.’ By masking, we mean allowing pos- itive values in the affinity matrix only where the The spectral clustering algorithm (Yu and Shi, n n row and column correspond to paraphrases that 2003) takes as input an affinity matrix A R × ∈ appear as pairs in PPDB (Figure 1a). All entries encoding n items to be clustered, and an integer corresponding to paraphrase pairs that are not con- k. It generates k non-overlapping clusters of the n nected in the PPDB graph (Figure 1a) are forced to items. Each entry a in the affinity matrix denotes i j 0. a similarity measurement between items i and j. More concretely, in the masked affinity matrix, Entries in A must be nonnegative and symmetric. we set each entry a for which i and j are not para- The affinity matrix can also be thought of as de- i j phrases in PPDB to zero. The masked affinity ma- scribing a graph, where the n rows and columns MASK trix encodes the graph G = PP(t),EPP with correspond to nodes, and each entry ai j gives the { } edges connecting only pairs of words that are in weight of an edge between nodes i and j. Because PPDB, EMASK = (p , p ) p P(p ) . Figure 1c the matrix must be symmetric, the graph is undi- PP { i j | i ∈ j } shows the masked affinity matrix corresponding to rected. the PPDB structure in Figure 1a. Given a target word t, we call its set of PPDB paraphrases PP(t). Note that PP(t) excludes t it- 2.4.3 Optimizing k self. In our most basic clustering method, we clus- Because spectral clustering requires the number ter paraphrases for target t as follows. Given the of output clusters, k, to be specified as input, for length-n set of t’s paraphrases, PP(t), we construct each target word we run the clustering algorithm the n n affinity matrix A where each shared-index × for a range of k between 1 and the minimum of (n, row and column corresponds to some word p ∈ 20). We then choose the k that maximizes the NMI PP(t). We set entries equal to the cosine similarity of the resulting clusters with the human-annotated between the applicable words’ embeddings, plus substitutes for that target in the development data. one: ai j = cos(vi,v j) + 1 (to enforce non-negative similarities). For our implementation we use 300- 2.5 Method variations dimensional part-of-speech-specific word embed- In addition to using the substitutability score to dings vi generated using the gensim word2vec choose the best number of senses for each target package (Mikolov et al., 2013b; Mikolov et al., word, we also experiment with two variations on ˇ 3 2013a; Rehu˚rekˇ and Sojka, 2010). In Figure 1a the basic spectral clustering method to increase the we show a set of paraphrases, linked by PPDB score further: filtering by a paraphrase confidence relations, and in Figure 1b we show the corre- score and co-clustering with WordNet (Fellbaum, sponding basic affinity matrix, encoding the para- 1998). phrases’ distributional similarity. In order to aid further discussion, we point out 2.5.1 PPDB Score Thresholding that the affinity matrix used for the basic cluster- Each paraphrase pair in the PPDB is associated ing method encodes a fully-connected graph G = with a set of scores indicating the strength of ALL PP(t),EPP with paraphrases PP(t) as nodes, the paraphrase relationship. The recently added { } ALL and edges between every pair of words, EPP = PPDB2.0 Score (Pavlick et al., 2015) was calcu- PP(t) PP(t). As for all variations on the cluster- lated using a supervised scoring model trained on × ing method, the matrix entries correspond to dis- human judgments of paraphrase quality.4 Apidi- tributional similarity. anaki (2016) showed that the PPDB2.0 Score itself is a good metric for ranking substitution candi- 2.4.2 Masking dates in context, outperforming some vector space The affinity matrix in Figure 1b ignores the graph models when the number of candidates is high. structure inherent in PPDB, where edges connect With this in mind, we experimented with using a only words that are paraphrases of one another. PPDB2.0 Score threshold to discard noisy PPDB We experiment with enforcing the PPDB structure 4The human judgments were used to fit a regression to the 3The word2vec parameters we use are a context win- features available in PPDB 1.0 plus numerous new features dow of size 3, learning rate alpha from 0.025 to 0.0001, min- including cosine word embedding similarity, lexical overlap 4 imum word count 100, sampling parameter 1e− , 10 negative features, WordNet features and distributional similarity fea- samples per target word, and 5 training epochs. tures.

112 p1 p2 p3 p4 p5 p6 p7 p1 p2 p3 p4 p5 p6 p7

p1 p1

p2 p2 p1 p5 p7 p3 p3

p4 p4 p p 2 3 p5 p5

p6 p6 p p p4 p6 7 7

(a) PPDB graph with n = 7 (b) Unmasked affinity matrix for input to ba- (c) Masked affinity matrix for input to clus- paraphrases PP(t) to be clus- sic clustering algorithm, for paraphrases in tering algorithm, enforcing the paraphrase tered. Fig 1a. This matrix encodes a fully-connected links from the graph in Fig 1a, G = graph G = PP(t),EALL . PP(t),EMASK . { PP } { PP } Figure 1: Unclustered PPDB graph and its corresponding affinity matrices, encoding distributional sim- ilarity, for input to the basic (1b) and masked (1c) clustering algorithms. Masking zeros-out the values of all cells corresponding to paraphrases not connected in the PPDB graph. Cell shading corresponds to the distributional similarity score between words, with darker colors representing higher measurements.

XXL paraphrases prior to sense clustering. Our path of length at most 2 via one of the relations objective was to begin the clustering process with encoded in R(s). a clean set of paraphrases for each target word, For co-clustering, we generate the affinity ma- eliminating erroneous paraphrases that might pol- trix for a graph with m+n nodes corresponding to lute the substitutable sense clusters. We imple- the n words in PP(t) and the m synsets in S+(t), mented PPDB score thresholds in a range from 0 and edges between every pair of nodes. Because to 2.5. the edge weights are cosine similarity between vector embeddings, we need a way to construct an 2.5.2 Co-Clustering with WordNet embedding for each synset in S+(t).5 We there- PPDB is large and inherently noisy. WordNet fore generate compositional embeddings for each has smaller coverage but well-defined semantic synset s that are equal to the weighted average of structure in the form of synsets and relations. the embeddings for the lemmas l L(s), where the ∈ We sought a way to marry the high coverage of weights are the PPDB2.0 scores between t and l: PPDB with the clean structure of WordNet by co- clustering the two resources, in hopes of creating ∑l L(s) PPDB2.0Score(t,l) vl a sense inventory that is both highly-substitutable vs = ∈ × ∑l L(s) PPDB2.0Score(t,l) and high-coverage. ∈ The basic unit in WordNet is the synset, a set The unmasked affinity matrix used for input to of lemmas sharing the same meaning. WordNet the co-clustering method, then, encodes the graph also connects synsets via relations, such as hyper- + ALL ALL ALL G = PP(t) S (t),EPP EPS ESS , where nymy, hyponymy, entailment, and ‘similar-to’. We ALL{ ∪ ∪ ∪ } EPS contains edges between every paraphrase denote as L(s) the set of lemmas associated with ALL and synset, and ESS contains edges between ev- synset s. We denote as R(s) the set of synsets ery pair of synsets. that are related to synset s with a hypernym, hy- We also define masked versions of the co- ponym, entailment, or similar-to relationship. Fi- clustering affinity matrix. In a masked affinity nally, we denote as S(t) the set of synsets to which matrix, positive entries are only allowed for en- a word t belongs. We denote as S+(t) the set of t’s tries where the row and column correspond to enti- synsets, plus all synsets to which they are related; 5We don’t use the NASARI embeddings (Camacho- S+(t) = S(t) S(s ). In other words, S+(t) s0 S(t) 0 Collados et al., 2015) because these are available only for ∪ ∈ t nouns. includes all synsetsS to which is connected by a

113 p1 p2 p3 p4 p5 p6 p7 s1 s2 s3 s4 p1 p2 p3 p4 p5 p6 p7 s1 s2 s3 s4

p1 p1 p 2 p2 p2

p3 p3 s1 s2 p4 p4 p6 p1 p5 p5

p6 p6 s s4 3 p7 p7 p3 s1 s1

s2 s2 p p7 5 s3 s3 s s p4 4 4

(a) Graph showing n = 7 PPDB (b) Unmasked affinity matrix for input to ba- (c) Affinity matrix for input to cluster- paraphrases PP(t) and m = 4 sic clustering algorithm, for paraphrases and ing algorithm, enforcing the paraphrase- + MASK WordNet synsets S (t) to be synsets in Fig 2a. This matrix encodes a fully- paraphrase (EPP ) and paraphrase-synset clustered. connected graph G = PP(t) S+(t),EALL MASK { ∪ PP ∪ (EPS ) links from the graph in 2a, but al- EALL EALL . ALL PS ∪ SS } lowing all synset-synset links (EPP ). Figure 2: Paraphrase/synset graph for input to the co-clustering model, and its corresponding affinity matrices for the basic (2b) and masked (2c) clustering algorithms. Cell shading corresponds to the distributional similarity score between words/synsets. ties (paraphrases or synsets) that are connected by 2.6 Clustering Experiments the underlying knowledge base (PPDB or Word- 2.6.1 Datasets Net). Just as we defined masking for paraphrase- paraphrase links (EPP) to allow only positive val- We run clustering experiments using targets and ues corresponding to paraphrase pairs found in human-generated substitution data drawn from PPDB, here we separately define masking for two lexical substitution datasets. The first is the paraphrase-synset (EPS) and synset-synset (ESS) “Concepts in Context” (CoInCo) corpus (Kremer based on WordNet synsets and relations. When et al., 2014), containing over 15K sentences cor- applying the clustering algorithm, it is possible to responding to nearly 4K unique target words. We elect to use the masked version for any or all of divide the CoInCo dataset into development and EPP, EPS, and ESS. In our experiments we try all test sets by first finding all target words that have combinations. at least 10 sentences. For each of the 327 result- ing targets, we randomly divide the correspond- ing sentences into 60% development instances and For the synset-synset links, we define the 40% test instances. The resulting development MASK masked version ESS as including only nonzero and test sets have 4061 and 2091 sentences respec- edge weights where a hypernym, hyponym, en- tively. We cluster the 327 target words in the re- tailment or similar-to relationship connects two sulting subset of CoInCo, performing all optimiza- synsets: EMASK = (s ,s ) s R(s ) or s tion using the development portion. SS { u v | u ∈ v v ∈ R(su) . For the paraphrase-synset links, we define In order to evaluate how well our method gen- } MASK the masked version EPS to include only nonzero eralizes to other data, we also create clusters for edge weights where the paraphrase is a lemma target words drawn from the SemEval 2007 En- in the synset, or is a paraphrase of a lemma in glish Lexical Substitution shared task dataset (Mc- MASK the synset (excluding the target word): EPS = Carthy and Navigli, 2007). The entire test por- (p ,s ) p L(s ) or (P(p ) t) L(s ) > 0 . tion of the SemEval dataset contains 1700 an- { i u | i ∈ u | i − ∩ u | } We need to exclude the target word when calcu- notated sentences for 170 target words. We fil- lating the overlap because otherwise all words in ter this data to keep only sentences with one or PP(t) would connect to all synsets in S(t). Figure more human-annotated substitutes that overlap our 2 depicts the graph, unnmasked and masked affin- PPDB XXL paraphrase vocabulary. The result- ity matrices for the co-clustering method. ing test set, which we use for evaluating SemEval

114 targets, has 1178 sentences and 157 target words. 3.2 Ranking Models We cluster each of the 157 targets, using the Co- Our approach requires a set of rankings produced InCo development data to optimize substitutabil- by a high-quality lexical substitution model to ity for the 32 SemEval targets that also appear in start. We generate substitution rankings for each CoInCo. For the rest of the SemEval targets we target/sentence pair in the test sets using a syntac- choose a number of senses equal to its WordNet tic vector-space model (Thater et al., 2011; Apidi- synset count. anaki, 2016) and a state-of-the-art model based on 2.6.2 Clustering Method Variations word embeddings (Melamud et al., 2015). The syntactic vector space model of Apidianaki We try all combinations of the following parame- (2016) (Syn.VSM) demonstrated an ability to cor- ters in our clustering model: rectly choose appropriate PPDB paraphrases for a target word in context. The vector features cor- Clustering method: We try the basic cluster- • respond to syntactic dependency triples extracted ing – clustering Paraphrases Only – and the from the English Gigaword corpus 6 analyzed with WordNet Co-Clustering method. Stanford dependencies (Marneffe et al., 2006). PPDB2.0 Score Threshold: We cluster para- Syn.VSM produces a score for each (target, sen- • phrases of each target having a PPDB2.0 tence, substitute) tuple based on the cosine sim- Score above a threshold, ranging from 0-3.0. ilarity of the substitute’s basic vector representa- tion with the target’s contextualized vector (Thater Masking: When clustering paraphrases only, et al., 2011). The contextualized vector is derived • we can either use the PP-PP mask or al- from the basic meaning vector of the target word low positive similarities between all words. by reinforcing its dimensions that are licensed by When co-clustering, we try all combina- the context of the specific instance under consider- tions of the PP-PP, PP-SYN, and SYN-SYN ation. More specifically, the contextualized vector masks. of a target is obtained through vector addition and contains information about the target’s direct syn- For each combination, we evaluate the NMI tactic dependents. substitutability of the resulting sense inventory The second set of rankings comes from the Ad- over our CoInCo and SemEval test instances. The dCos model of Melamud et al. (2015). AddCos substitutability results are given in Tables 2 and 3. quantifies the fit of substitute word s for target word t in context C by measuring the semantic 3 Filtering Substitutes similarity of the substitute to the target, and the 3.1 A WSD oracle similarity of the substitute to the context:

We now question whether it is possible to improve W cos(s,t) + ∑w W cos(s,w) AddCos(s,t,W) = | |· ∈ the rankings of current state-of-the-art lexical sub- 2 W · | | stitution systems by using the optimized sense in- (1) ventory as a filter. Our general approach is to take a set of ranked substitutes generated by a vector- The vectors s and t are word embeddings of the based model. Then, we see whether filtering the substitute and target generated by the skip-gram ranked substitutes to bring words belonging to the with negative sampling model (Mikolov et al., correct sense of the target to the top of the rankings 2013b; Mikolov et al., 2013a). The context W would improve the overall ranking results. is the set of words appearing within a fixed-width Assuming that we have a WSD oracle that is window of the target t in a sentence (we use a win- able to choose the most appropriate sense for a dow (cwin) of 1), and the embeddings c are con- target word in context, this corresponds to nomi- text embeddings generated by skip-gram. In our nating substitutes from the applicable sense cluster implementation, we train 300-dimensional word and elevating them in the list of ranked substitutes and context embeddings over the 4B words in the output by the state-of-the-art lexical substitution Annotated Gigaword (AGiga) corpus (Napoles et system. If sense filtering successfully improves al., 2012) using the gensim word2vec package the quality of ranked substitutes, we can say that 6http://catalog.ldc.upenn.edu/ the sense inventory captures substitutability well. LDC2003T05

115 (Mikolov et al., 2013b; Mikolov et al., 2013a; this to the unfiltered GAP score for each ranking Rehˇ u˚rekˇ and Sojka, 2010).7 model.

3.3 Substitution metrics 3.5 Baselines Lexical substitution experiments are usually eval- We compare the extent to which our optimized uated using generalized average precision (GAP) sense inventories improve lexical substitution (Kishida, 2005). GAP compares a set of predicted rankings to the results of two baseline sense in- rankings to a set of gold standard rankings. Scores ventories. range from 0 to 1; a perfect ranking, in which all WordNet+: a sense inventory formed from • high-scoring substitutes outrank low-scoring sub- WordNet 3.0. For each CoInCo target word stitutes, has a score of 1. For each sentence in the that appears in WordNet, we take its sense CoInCo and SemEval test sets, we consider the clusters to be its synsets, plus lemmas be- PPDB paraphrases for the target word to be the longing to hypernyms and hyponyms of each candidates, and we set the test set annotator fre- synset. quency to be the gold score. Words in PPDB that PPDBClus: a much larger, abeit noisier, were not suggested by annotators receive a gold • score of 0.001. Predicted scores are given by the sense inventory obtained by automatically two ranking models, Syn.VSM and AddCos. clustering words in the PPDB XXL package. To obtain this sense inventory we clustered 3.4 Filtering Method paraphrases for all targets in the CoInCo dataset using the method outlined in Cocos Sense filtering is intended to boost the rank of sub- and Callison-Burch (2016), with PPDB2.0 stitutes that belong to the most appropriate sense Score serving as the similarity metric. of the target given the context. We run this exper- iment as a two-step process. We assess the substitutability of these sense First, given a target and sentence, we obtain the baseline inventories with respect to the human- PPDB XXL paraphrases for the target word and annotated substitutes in the CoInCo and SemEval rank them using the Syn.VSM and the AddCos datasets, and also use them for sense filtering. models.8 We calculate the overall unfiltered GAP Finally, we wish to estimate the impact of score on the test set for each ranking model as the the NMI-based optimization procedure (Section average GAP over sentences. 2.4.3) on the quality of the senses used for fil- Next, we evaluate the ability of a sense inven- tering. We compare the performance of the opti- tory to improve the GAP score through filtering. mized CoInCo sense inventory, where the number We implement sense filtering by adding a large of clusters, k, for a target word is defined through number (10000) to the ranking model’s score for NMI optimization (called ‘Choose-K: Optimize words belonging to a single sense. We assume NMI’), to an inventory induced from CoInCo an oracle that finds the cluster which maximally where k equals the number of synsets available improves the GAP score using this sense filtering for the target word in WordNet (called ‘Choose- method. If the sense inventory corresponds well K: #WN Synsets). to substitutability, we should expect this filtering 4 Results to improve the ranking by downgrading proposed substitutes that do not fall within the correct sense We report substitutability, and the unfiltered and cluster. best sense-filtered GAP scores achieved using the We calculate the maximum sense-restricted paraphrase-only clustering method and the co- GAP score for the inventories produced by each clustering method in Tables 2 and 3. variation on our clustering model, and compare The average unfiltered GAP scores for the Syn.VSM rankings over the CoInCo and SemEval 7The word2vec training parameters we use are a context test sets are 0.528 and 0.673 respectively.9 All window of size 3, learning rate alpha from 0.025 to 0.0001, 4 9 minimum word count 100, sampling parameter 1e− , 10 neg- The score reported by Apidianaki (2016) for the ative samples per target word, and 5 training epochs. Syn.VSM model with XXL PPDB paraphrases on CoInCo 8For the SemEval test set, we rank PPDB XXL para- was 0.56. The difference in scores is due to excluding from phrases having a PPDB2.0 Score with the target of at least our clustering experiments target words that did not have at 2.54. least 10 sentences in CoInCo.

116 Syn.VSM AddCos (cwin=1) substCoInCo Unfiltered GAP Oracle GAP Unfiltered GAP Oracle GAP PPDBClus 0.254 0.661 0.656 WordNet 0.252 0.655 0.651 Choose-K: # WN Synsets (avg) 0.205 0.639 0.636 Choose-K: # WN Synsets (max, no co-clustering) 0.250* 0.695* 0.690* 0.528 0.533 Choose-K: # WN Synsets (max, co-clustering) 0.241** 0.690** 0.683** Choose-K: Optimize NMI (avg) 0.282 0.668 0.662 Choose-K: Optimize NMI (max, no co-clustering) 0.331* 0.719 * 0.714 *** Choose K: Optimize NMI (max, co-clustering) 0.314** 0.718 **** 0.710 **

Table 2: Substitutablity (NMI) of resulting sense inventories, and GAP scores of the unfiltered and best sense-filtered rankings produced by the Syn.VSM and AddCos models, for the CoInCo annotated dataset. Configurations for the best-performing sense inventories were: * Min PPDB Score 2.0, cluster PP’s only, use PP-PP mask; ** Min PPDB Score 2.0, co-clustering, use PP-PP mask only; *** Min PPDB Score 1.5, cluster PP’s only, use PP-PP mask; **** Min PPDB Score 2.0, co-clustering, use PP-PP, Syn-Syn masks only

Syn.VSM AddCos (cwin=1) substSemEval Unfiltered GAP Oracle GAP Unfiltered GAP Oracle GAP PPDBClus 0.357 0.855 0.634 WordNet 0.291 0.774 0.595 Average of all Clustered Sense Inventories 0.3670.673 0.8410.410 0.569 Max basic (no co-clustering) sense inventory 0.448* 0.917* 0.626* Max co-clustered sense inventory 0.449** 0.906*** 0.612****

Table 3: Substitutablity (NMI) of resulting sense inventories, and GAP scores of the unfiltered and best sense-filtered rankings produced by the Syn.VSM and AddCos models, for the SemEval07 annotated dataset. Configurations for the best-performing sense inventories were: * Min PPDB Score 2.31, cluster PP’s only, use PP-PP mask; ** Min PPDB Score 2.54, co-clustering, use PP-SYN mask only; *** Min PPDB Score 2.54, co-clustering, use PP-SYN mask only; **** Min PPDB Score 2.31, co-clustering, use PP-SYN mask only. baseline and cluster sense inventories are capable in WordNet. We find that for both the Syn.VSM of improving these GAP scores when we use the and AddCos ranking models, filtering using the best sense as a filter. Syntactic models generally sense inventory with the NMI-optimized k outper- give very good results with small paraphrase sets forms the results obtained when the inventory with (Kremer et al., 2014) but their performance seems k equal to the number of synsets is used. to degrade when they need to deal with larger and Furthermore, we find that the NMI substi- noisier substitute sets (Apidianaki, 2016). Our tutability score is a generally good indicator of results suggest that finding the most appropriate how much improvement we see in GAP score due sense of a target word in context can improve their to oracle sense filtering. We calculated the Pear- lexical substitution results. son correlation of a sense inventory’s NMI with The trend in results is similar for the AddCos its oracle GAP score to be 0.644 (calculated over rankings. The average unfiltered GAP scores for all target words in the CoInCo test set, including the AddCos rankings over the CoInCo and Se- GAP results for both the Syn.VSM and AddCos mEval test sets are 0.533 and 0.410 respectively. ranking models). This suggests that NMI is a rea- The GAP scores of the unfiltered AddCos rankings sonable measure of substitutability. are much lower than after filtering with any base- We find that for all methods, applying a PPDB- line or cluster sense inventory, showing that lexical Score Threshold prior to clustering is an effective subsitutition rankings based on word-embeddings way of removing noisy, non-substitutable para- can also be improved using senses. phrases from the sense inventory. When we use To assess the impact of the NMI-based opti- the resulting sense inventory for filtering, this ef- mization procedure on the results, we compare the fectively elevates only high-quality paraphrases in performance of two sense inventories on the Co- the lexical substitution rankings. This supports the InCo rankings: one where the number of clusters finding of Apidianaki (2016), who showed that the (k) for a target word is defined through NMI op- PPDB2.0 Score itself is an effective lexical substi- timization and another one, where k is equal to tution ranking metric when large and noisy para- the number of synsets available for the target word phrase substitute sets are involved.

117 Finally, we discover that co-clustering with this paper will show whether it is preferable to ac- WordNet did not produce any significant improve- count for senses before or after actual lexical sub- ment in NMI or GAP score over clustering para- stitution. phrases alone. This could suggest that the added structure from WordNet did not improve over- Acknowledgments all substitutability of the resulting sense invento- This material is based in part on research spon- ries, or that our co-clustering method did not ef- sored by DARPA under grant number FA8750- fectively incorporate useful structural information 13-2-0017 (the DEFT program). The U.S. Gov- from WordNet. ernment is authorized to reproduce and distribute reprints for Governmental purposes. The views 5 Conclusion and conclusions contained in this publication are those of the authors and should not be interpreted We have shown that despite the high performance as representing official policies or endorsements of word-based vector-space models, lexical sub- of DARPA and the U.S. Government. stitution can still benefit from word senses. We This work has also been supported by the have defined a substitutability metric and proposed French National Research Agency under project a method for automatically creating sense invento- ANR-16-CE33-0013. ries optimized for substitutability. The number of We would like to thank our anonymous review- sense clusters in an optimized inventory and their ers for their thoughtful and helpful comments. contents are aligned with lexical substitution an- notations in a development set. Using the best fitting cluster in each context as a filter over the References rankings produced by vector-space models boosts Marianna Apidianaki. 2016. Vector-space models for good substitutes and improves the models’ scores PPDB paraphrase ranking in context. In Proceed- in a lexical substitution task. ings of EMNLP, pages 2028–2034, Austin, Texas. For choosing the cluster that best fits a con- Jose´ Camacho-Collados, Mohammad Taher Pilehvar, text, we used an oracle experiment which finds and Roberto Navigli. 2015. NASARI: a Novel Ap- the maximum GAP score achievable by a sense by proach to a Semantically-Aware Representation of boosting the ranking model’s score for words be- Items. In Proceedings of NAACL/HLT 2015, pages longing to a single sense. The cluster that achieved 567–577, Denver, Colorado. the highest GAP score in each case was selected. Anne Cocos and Chris Callison-Burch. 2016. Clus- The task of finding the most appropriate sense in tering paraphrases by word sense. In Proceedings context still remains. But the improvement in lex- of NAACL/HLT 2016, pages 1463–1472, San Diego, California. ical substitution results shows that word sense in- duction and disambiguation can still benefit state- Katrin Erk, Diana McCarthy, and Nicholas Gaylord. of-the-art word-based models for lexical substitu- 2013. Measuring word meaning in context. Com- tion. putational Linguistics, 39(3):511–554, 9. Our sense filtering mechanism can be applied to Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, the output of any vector-space substitution model Chris Dyer, Eduard Hovy, and Noah A. Smith. at a post-processing step. In future work, we in- 2015. Retrofitting Word Vectors to Semantic Lex- icons. In Proceedings of NAACL/HLT, Denver, Col- tend to experiment with models that account for orado. senses during embedding learning. The models of Huang et al. (2012) and Li and Jurafsky (2015) Christiane Fellbaum, editor. 1998. WordNet: an elec- tronic lexical database. MIT Press. learn multi-prototype, or sense-specific, embed- ding representations and are able to choose the Juri Ganitkevitch, Benjamin VanDurme, and Chris best-fitted ones for words in context. These mod- Callison-Burch. 2013. PPDB: The Paraphrase Database. In Proceedings of NAACL, Atlanta, Geor- els have up to now been tested in several NLP gia. tasks but have not yet been applied to lexical sub- stitution. We will experiment with using the em- Eric Huang, Richard Socher, Christopher Manning, and Andrew Ng. 2012. Improving Word Represen- beddings chosen by these models for specific word tations via Global Context and Multiple Word Pro- instances for ranking candidate substitutes in con- totypes. In Proceedings of the ACL (Volume 1: Long text. The comparison with the results presented in Papers), pages 873–882, Jeju Island, Korea.

118 Ignacio Iacobacci, Mohammad Taher Pilehvar, and Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software Roberto Navigli. 2015. SensEmbed: Learning Framework for Topic Modelling with Large Cor- Sense Embeddings for Word and Relational Simi- pora. In Proceedings of the LREC 2010 Workshop larity. In Proceedings of the ACL/IJCNLP, pages on New Challenges for NLP Frameworks, pages 45– 95–105, Beijing, China. 50, Valletta, Malta.

Adam Kilgarriff. 1997. I don’t believe in word senses. Joseph Reisinger and Raymond J. Mooney. 2010. Computers and the Humanities, 31(2):91–113. Multi-Prototype Vector-Space Models of Word Meaning. In Proceedings of HLT/NAACL, pages Kazuaki Kishida. 2005. Property of average precision 109–117, Los Angeles, California. and its generalization: An examination of evalua- tion indicator for information retrieval experiments. Alexander Strehl and Joydeep Ghosh. 2002. Clus- National Institute of Informatics Tokyo, Japan. ter ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine Gerhard Kremer, Katrin Erk, Sebastian Pado,´ and Ste- learning research, 3(Dec):583–617. fan Thater. 2014. What Substitutes Tell Us - Anal- ysis of an “All-Words” Lexical Substitution Corpus. Stefan Thater, Hagen Furstenau,¨ and Manfred Pinkal. In Proceedings of EACL, pages 540–549, Gothen- 2011. Word Meaning in Context: A Simple and Ef- burg, Sweden. fective Vector Model. In Proceedings of IJCNLP, pages 1134–1143, Chiang Mai, Thailand. Jiwei Li and Dan Jurafsky. 2015. Do Multi-Sense Em- beddings Improve Natural Language Understand- Nguyen Xuan Vinh, Julien Epps, and James Bailey. ing? In Proceedings of the EMNLP, pages 1722– 2009. Information theoretic measures for cluster- 1732, Lisbon, Portugal. ings comparison: is a correction for chance neces- sary? In Proceedings of the 26th Annual Inter- M. Marneffe, B. Maccartney, and C. Manning. 2006. national Conference on Machine Learning, pages Generating Typed Dependency Parses from Phrase 1073–1080. ACM. Structure Parses. In Proceedings of LREC-2006, Genoa, Italy. Stella X. Yu and Jianbo Shi. 2003. Multiclass spec- tral clustering. In Proceedings of International Con- Diana McCarthy and Roberto Navigli. 2007. ference on Computer Vision (ICCV 03), pages 313– SemEval-2007 Task 10: English Lexical Substitu- 319. IEEE. tion Task. In Proceedings of SemEval, pages 48–53, Prague, Czech Republic.

Oren Melamud, Omer Levy, and Ido Dagan. 2015. A Simple Word Embedding Model for Lexical Substi- tution. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 1–7, Denver, Colorado.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in neural information processing systems, pages 3111–3119.

Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated Gigaword. In Pro- ceedings of the Joint Workshop on Automatic Knowl- edge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), pages 95–100, Mon- treal, Canada.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2015. PPDB 2.0: Better paraphrase ranking, fine- grained entailment relations, word embeddings, and style classification. In Proceedings of ACL/IJCNLP, pages 425–430, Beijing, China.

119 Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected]

Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space. A semantic space is ods on corpora with manual annotations of a co-occurrences matrix M build by analysing the word senses from the Polish wordnet. distribution of words in a large corpus, later re- duced using Latent Semantic Analysis (Landauer 1 Introduction and Dumais, 1997). The second group of algo- Ambiguity is one of the fundamental features of rithms comprises graph-based methods which use natural language, so every attempt to understand structure of semantic nets in which different types NL utterances has to include a disambiguation of word sense relations are represented and linked step. People usually do not even notice ambi- (e.g. WordNet, BabelNet). They used various guity because of the clarifying role of the con- graph-induced information, e.g. Page Rank algo- text. A word market is ambiguous, and it is rithm (Mihalcea et al., 2004). still such in the phrase the fish market while in In this paper we present a method of word sense a longer phrase like the global fish market it is disambiguation, i.e. inferring an appropriate word unequivocal because of the word global, which sense from those listed in Polish wordnet, using cannot be used to describe physical place. Thus, word embeddings in both supervised and unsuper- distributional semantics methods seem to be a vised approaches. The main tested idea is to cal- natural way to solve the word sense discrimina- culate sense embeddings using unambiguous syn- tion/disambiguation task (WSD). One of the first onyms (elements of the same synsets) for a par- approaches to WSD was context-group sense dis- ticular word sense. In section 2 we shortly present crimination (Schütze, 1998) in which sense rep- existing results for WSD for Polish as well as other resentations were computed as groups of simi- works related to word embeddings for other lan- lar contexts. Since then, distributional semantic guages, while section 3 presents annotated data methods were utilized in very many ways in su- we use for evaluation and supervised model train- pervised, weekly supervised and unsupervised ap- ing. Next sections describe the chosen method of proaches. calculating word sense embeddings, our unsuper-

120 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 120–125, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics vised and supervised WSD experiments and some texts of each word to learn senses for each word, comments on the results. then re-label them with clustered sense for learn- ing embeddings. (Neelakantan et al., 2014) intro- 2 Existing Work duce flexible number of senses: they extend sense cluster list when a new sense is encountered by a 2.1 Polish WSD model. There was very little research done in WSD for (Iacobacci et al., 2015) use an existing WSD Polish. The first one from the few more vis- algorithm to automatically generate large sense- ible attempts comprise a small supervised ex- annotated corpora to train sense-level embeddings. periment with WSD in which machine learn- (Taghipou and Ng, 2015) prepare POS-specific ing techniques and a set of a priori defined fea- embeddings by applying a neural network with tures were used, (Kobylinski,´ 2012). Next, in trainable embedding layer. They use those embed- (Kobylinski´ and Kopec,´ 2012), extended Lesk dings to extend feature space of a supervised WSD knowledge-based approach and corpus-based sim- tool named IMS. ilarity functions were used to improve previous In (Bhingardive et al., 2015), the authors pro- results. These experiments were conducted on pose to exploit word embeddings in an unsuper- the corpora annotated with the specially designed vised method for most frequent sense detection set of senses. The first one contained general from the untagged corpora. Like in our work, the texts with 106 polysemous words manually anno- paper explores creation of sense embeddings with tated with 2.85 sense definitions per word on av- the use of WordNet. As the authors put it, sense erage. The second, smaller, WikiEcono corpus embeddings are obtained by taking the average of (http://zil.ipipan.waw.pl/plWikiEcono) was anno- word embeddings of each word in the sense-bag. tated by another set of senses for 52 polysemous The sense-bag for each sense of a word is obtained words. It contains 3.62 sense definitions per word by extracting the context words from the WordNet on average. The most recent work on WSD for such as synset members (S), content words in the Polish (K˛edziaet al., 2015) utilizes graph-based gloss (G), content words in the example sentence approaches of (Mihalcea et al., 2004) and (Agirre (E), synset members of the hypernymy-hyponymy et al., 2014). This method uses both plWordnet synsets (HS), and so on. and SUMO ontology and was tested on KPWr data set (Broda et al., 2012) annotated with plWord- 3 Word-Sense Annotated Treebank net senses — the same data set which we use in our experiments. The highest precision of 0.58 The main obstacle in elaborating WSD method for was achieved for nouns. The results obtained by Polish is lack of semantically annotated resources different WSD approaches are very hard to com- which can be applied for training and evaluation. pare because of different set of senses and test In our experiment we used an existing one which data used and big differences in results obtained use wordnet senses – semantic annotation (Ha- by the same system on different data. (Tripodi jnicz, 2014) of Składnica (Wolinski´ et al., 2011). and Pelillo, 2017) reports the results obtained by The set is a rather small but carefully prepared re- the best systems for English at the level of 0.51- source and contains constituency parse trees for 0.85% depending on the approach (supervised or Polish sentences. The adapted version of Skład- unsupervised) and the data set. The only system nica (0.5) contains 8241 manually validated trees. for Polish to which to some extend we can com- Sentence tokens are annotated with fine-grained pare our approach is (K˛edziaet al., 2015). semantic types represented by Polish wordnet synsets from plWordnet 2.0 plWordnet, Piasecki et 2.2 WSD and Word Embeddings al., 2009, http://plwordnet.pwr.wroc.pl/wordnet/). The problem of WSD has been approached from The set contains lexical units of three open parts various perspectives in the context of word embed- of speech: adjectives, nouns and verbs. Therefore, dings. only tokens belonging to these POS are annotated Popular approach is to generate multiple em- (as well as abbreviations and acronyms). Skład- beddings per word type, often using unsupervised nica contains about 50K nouns, verbs and adjec- automatic methods. For example, (Reisinger and tives for annotation, and 17410 of them belong- Mooney, 2010; Huang et al., 2012) cluster con- ing to 2785 (34%) sentences has been already an-

121 notated. For 2072 tokens (12%), the lexical unit 5 Unsupervised Word Sense Recognition appropriate in the context has not been found in In this section we are proposing a simple unsuper- plWordnet. vised approach to WSD. The key idea is to use word embeddings in probabilistic interpretation 4 Obtaining Sense Embeddings and application comparable to language model- ing, however without building any additional mod- In this section we describe the method of obtaining els or parameter-rich systems. The method is de- sense-level word embeddings. Unlike most of the rived from (Taddy, 2015), where it was used with approaches described in Section 2.2, our method a bayesian classifier and vector embedding inver- is applied to manually sense-labeled corpora. sion to classify documents. In Wordnet, words either occur in multiple (Mikolov et al., 2013) describe two alterna- synsets (are therefore ambiguous and subject of tive methods of generating word embeddings: the WSD), or in one synset (are unambiguous). Our skip-gram, which represents conditional probabil- approach is to focus on synsets that contain both ity for a word’s context (surrounding words) and ambiguous and unambiguous words. In Skad- CBOW, which targets the conditional probability nica 2.0 (Polish WordNet) we found 28766 synsets for each word given its context. None of these matching these criteria and therefore potentially corresponds to a likelihood model, but as (Taddy, suitable for our experiments. 2015) note they can be interpreted as components in a composite likelihood approximation. Let w Let us consider a synset containing following = [w1. . . wT ] denote an ordered vector of words. words: ‘blemish’, ‘deface’, ‘disfigure’. Word The skip-gram in (Mikolov et al., 2013) yields the ‘blemish’ appears also in other synsets (is ambigu- pairwise composite log likelihood: ous) while words ‘deface’ and ‘disfigure’ are spe- cific for this synset and do not appear in any other synset (are unambiguous). T T logp (w) = 1[1 k j b]logp (wk wj) We assume that embeddings specific to a sense V ≤| − |≤ V | j=1 i=1 or synset can be approximated by unambiguous X X (1) part of the synset. While some researchers such We use the above formula to compute probabil- as (Bhingardive et al., 2015) take average em- ity of a sentence. Unambiguous words are repre- beddings of all synset-specific words, even us- sented as their word2vec representations derived ing glosses and hyperonymy, we use unambiguous directly from corpus. In case of ambiguous words, words to generate word2vec embedding vector of we substitute them for each possible sense vector a sense. (generated from unambiguous parts of synsets, as During training, each occurrence of unambigu- has been previously described). Therefore, for an ous word in corpus is substituted for a synset ambiguous word to be disambiguated, we gener- identifier. As in the provided example, each oc- ate as many variants of a sentence as there are its currence of ‘deface’ and ‘disfigure’ would be re- senses, and compute each variant’s likelihood us- placed by its sense identifier, the same for both ing formula 1. Ambiguous words which occur in unambiguous words. We’ll later use these sense the context are omitted (although we might also vectors to distinguish between senses of ambigu- replace them with an averaged vector representing ous ‘blemish’ given their contexts. all their meanings). Finally, we select the most We train word2vec vectors using substitution probable variant. mechanism described above on a dump of all Pol- Because the method involves no model train- ish language Wikipedia and 300-million subset of ing, we evaluate it directly over the whole data the National Corpus of Polish (Przepiórkowski et set without dividing it into train and test sets for al., 2012). The embedding size is set to 100, all cross-validation. other word2vec parameters have the default value 6 Supervised Word Sense Recognition as in (Reh˚uˇ rekˇ and Sojka, 2010). The model is based on lemmatized (base word forms) so only In the supervised approach, we train neural net- the occurrences of forms with identical lemmas work models to predict word senses. In our exper- are taken into account. iment, neural network model acts as a regression

122 function F transforming word embeddings pro- set contains 344 occurrences of ambiguous words vided at input into sense (synset identifiers) vec- which were eligible for our method. For the unsu- tors. pervised approach we tested a window of 5 and 10 As the network architecture we selected LSTM words around the analyzed word. (Hochreiter and Schmidhuber, 1997). Neural net- The ambiguous words from the sentence other work model consists of one LSTM layer followed than the one being disambiguated at the moment by a dense (perceptron) layer at the output. We are either omitted or represented as a vector rep- train the network using mean standard error loss resenting all their occurrences. The uniq variant function. omit all other ambiguous words from the sentence Input data consists of the sequences of five while in the all variant we use not disambiguated word2vec embeddings: of two words that make representation of these words. left and right symmetric contexts of each input word to be disambiguated, and the word itself rep- Method Settings Precision resented by the average vector of vectors repre- random baseline N/A 0.47 senting all its senses. Ambiguous words for which there are no embeddings are represented by zero MFS baseline N/A 0.73 vectors (padded). Zero vectors are also added if the context is too short. This data is used to train pagerank N/A 0.52 LSTM model (Keras 1.0.1 https://keras. io/) linked with the subsequent dense layer with 5 word, all 0.507 sigmoid activation function. 5 word, uniq 0.507 unsupervised At the final step, we transform the output into 10 word, uniq 0.529 synsets rather than vectors. We select the most ap- 10 word, all 0.513 propriate sense from a set of possible sense inven- 750 epochs 0.673 tory, taking into account continuous output struc- 1000 epochs 0.680 supervised ture. In this step, neural network output layer 2000 epochs 0.690 (which is a vector of the same size as input em- 4000 epochs 0.667 beddings, but transformed) is compared with each possible sense vector. To compare vectors, we use Table 1: Precision of word-sense disambiguation cosine similarity measure, defined between any methods for Polish. two vectors. In the supervised approach the best results were We compute cosine similarity between neural obtained for 2000 epochs but they did not differ network output vector nnv and each sense from much from these obtained after 1000 epochs. possible sense inventory S, and select the sense For comparison, we include two baseline values: with the maximum cosine similarity towards nnv. To test each neural network set-up we use 30- random baseline select random sense from • fold cross-validation. uniform random probability distribution,

MFS baseline use most frequent sense as 7 Results • computed from the same corpus (There is no In this section we put summary of the results other available sense frequency data for Pol- obtained on our test set, as well as two base- ish, that could be obtained from manually an- line results. The corpus consisted of 2785 sen- notated sources.) tences and 303 occurences of annotated ambigu- ous words which could be disambiguated by our The table also includes results computed using algorithms, i.e. there were unambiguous equiva- pagerank WSD algorithm developed at the PWR lents of its senses and there were appropiate word (K˛edziaet al., 2015). These results were obtained embeddings for at least one of the other senses of for all the ambiguous words occurring within the this word. There were 5571 occurences of words sample, so cannot be directly compared to our re- which occurred only in one sense. sults. Table 1 presents precision of both tested meth- As the results indicate, unsupervised method ods computed over the Skladnica dataset. The performs at the level of random sense selection.

123 Below there are two examples of the analyzed sen- in which just the most frequent sense is chosen is tences. still a little better, this is probably due to a very limited training set available for Polish. l˛ekprzed nicosci´ ˛ał ˛aczysi˛ez doswiadczeniem´ • pustki ‘fear of nothingness combines with the Acknowledgments experience of emptiness’: in this sentence, The paper is partially supported by the Polish Na- Polish ambiguous words ‘nothingness’ and tional Science Centre project Compositional dis- ‘emptiness’ were resolved correctly while an tributional semantic models for identification, dis- ambiguous words ‘experience’ does not have crimination and disambiguation of senses in Pol- unambiguous equivalents. ish texts number 2014/15/B/ST6/05186. na tym nie ko´ncz˛asi˛eproblemy ‘that does not • stop problems’: in this example ambiguous word ‘problem’ was not resolved correctly, References but this case is difficult also for humans. Eneko Agirre, Oier López de Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word The low quality of the results might be the ef- sense disambiguation. Comput. Linguist., 40(1):57– 84, March. fect of a relatively short context available as the analysed text is not continuous. Pierpaolo Basile, Annalina Caputo, and Giovanni Se- It might have also pointed out to the difficulty meraro. 2014. An enhanced lesk word sense dis- of the test set. Senses in plWodnet are very numer- ambiguation algorithm through a distributional se- mantic model. In Proceedings of COLING 2014, ous and hard to differentiate even for human. But the 25th International Conference on Computational the results of the supervised method falsify this as- Linguistics, Dublin, Irleand. Association for Com- sumption. putational Linguistics. Our supervised approach gave much better re- Sudha Bhingardive, Dhirendra Singh, Rudra Murthy, sults although they are also not very good as the Hanumant Redkar, and Pushpak Bhattacharyya. amount of annotated data is rather small. In this 2015. Unsupervised most frequent sense detection approach more epochs resulted in a slight model using word embeddings. In DENVER. over-fitting. Bartosz Broda, Michał Marcinczuk, Marek Maziarz, Adam Radziszewski, and Adam Wardynski. 2012. 8 Conclusions KPWr: Towards a free corpus of polish. Proceed- ings of LREC’12. Our work introduced two methods of word sense disambiguation based on word embeddings, su- Elzbieta˙ Hajnicz. 2014. Lexico-semantic annotation of pervised and unsupervised. The first approach as- składnica treebank by means of PLWN lexical units. In Heili Orav, Christiane Fellbaum, and Piek Vossen, sumes probabilistic interpretation of embeddings editors, Proceedings of the 7th International Word- and computes log probability from sequences of Net Conference (GWC 2014), pages 23–31, Tartu, word embedding vectors. In place of ambiguous Estonia. University of Tartu. word we put embeddings specific for each possible Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long sense and evaluate the likelihood of thus obtained short-term memory. Neural Comput., 9(8):1735– sentences. Finally we select the most probable 1780, November. sentence. The second supervised method is based Eric H. Huang, Richard Socher, Christopher D. Man- on a neural network trained to learn a context- ning, and Andrew Y. Ng. 2012. Improving word sensitive transformation that maps an input vector representations via global context and multiple word of ambiguous word into an output vector repre- prototypes. In Proceedings of the 50th Annual Meet- senting its sense. We compared the performance ing of the Association for Computational Linguis- of both methods on corpora with manual anno- tics: Long Papers - Volume 1, ACL ’12, pages 873– 882, Stroudsburg, PA, USA. Association for Com- tations of word senses from the Polish wordnet putational Linguistics. (plWordnet). The results show the low quality of the unsupervised method and suggest the superior- Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning ity of the supervised version in comparison to the sense embeddings for word and relational similarity. pagerank method on the set of words which were In Proceedings of the 53rd Annual Meeting of the eligible for our approach. Although the baseline Association for Computational Linguistics and the

124 7th International Joint Conference on Natural Lan- on New Challenges for NLP Frameworks, pages 45– guage Processing of the Asian Federation of Natural 50, Valletta, Malta, May. ELRA. http://is. Language Processing, ACL 2015, July 26-31, 2015, muni.cz/publication/884893/en. Beijing, China, Volume 1: Long Papers, pages 95– 105. Joseph Reisinger and Raymond J Mooney. 2010. Multi-prototype vector-space models of word mean- Łukasz Kobylinski´ and Mateusz Kopec.´ 2012. Se- ing. In Human Language Technologies: The 2010 mantic similarity functions in word sense dis- Annual Conference of the North American Chap- ambiguation. In Petrand Horák Sojka, Alešand ter of the Association for Computational Linguistics, Kopecek,ˇ and Karel Ivanand Pala, editors, Text, pages 109–117. Association for Computational Lin- Speech and Dialogue: 15th International Confer- guistics. ence, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pages 31–38, Berlin, Hei- Hinrich Schütze. 1998. Automatic word sense delberg. Springer. discrimination. Computational Linguistics, 24 (1):97–123. Łukasz Kobylinski.´ 2012. Mining class association Matt Taddy. 2015. Document classification by inver- rules for word sense disambiguation. In Pascal Bou- sion of distributed language representations. CoRR, vry, Mieczysław A. Kłopotek, Franck Leprevost, abs/1504.07295. Małgorzata Marciniak, Agnieszka Mykowiecka, and Henryk Rybinski,´ editors, Security and Intelligent Kaveh Taghipou and Hwee Tou Ng. 2015. Semi- Information Systems: International Joint Confer- supervised word sense disambiguation using word ence, SIIS 2011, Warsaw, Poland, June 13-14, 2011, embeddings in general and specific domains. In Revised Selected Papers, volume 7053 of Lecture Human Language Technologies: The 2015 Annual Notes in Computer Science, pages 307–317, Berlin, Conference of the North American Chapter of the Heidelberg. Springer. ACL, page 314–323. Association for Computational Linguistics. Paweł K˛edzia, Maciej Piasecki, and Marlena Or- linska.´ 2015. Word sense disambiguation based Rocco Tripodi and Marcello Pelillo. 2017. A game- on large scale Polish CLARIN heterogeneous lexi- theoretic approach to word sense disambiguation. cal resources. Cognitive Studies| Études cognitives, Computational Linguistics. 15:269–292. Marcin Wolinski,´ Katarzyna Głowinska,´ and Marek Rada Mihalcea, Paul Tarau, and Elizabeth Figa. 2004. Swidzi´ nski.´ 2011. A preliminary version of Skład- Pagerank on semantic networks, with application nica—a treebank of Polish. In Zygmunt Vetulani, to word sense disambiguation. In Proceedings of editor, Proceedings of the 5th Language & Technol- the 20th International Conference on Computational ogy Conference: Human Language Technologies as Linguistics, COLING ’04, Stroudsburg, PA, USA. a Challenge for Computer Science and Linguistics, Association for Computational Linguistics. pages 299–303, Poznan,´ Poland.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Human Language Tech- nologies: Conference of the North American Chap- ter of the Association of Computational Linguis- tics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 746–751.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- sos, and Andrew McCallum. 2014. Efficient non- parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1059–1069.

Adam Przepiórkowski, Mirosław Banko,´ Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus J˛ezykaPolskiego. Wydawnictwo Naukowe PWN, Warsaw.

Radim Reh˚uˇ rekˇ and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop

125

Author Index

An, Yuan, 31 Rayson, Paul, 61 Apidianaki, Marianna, 110 Reffin, Jeremy, 79 Arora, Sanjeev, 12 Risteski, Andrej, 12

Biemann, Chris, 72 Schulte im Walde, Sabine, 24

Callison-Burch, Chris, 110 Wattam, Stephen, 61 Cocos, Anne, 110 Wawer, Aleksander, 120 Cook, Paul, 47 Weeds, Julie, 79 Weir, David, 79 El-Haj, Mahmoud, 61 Wilkie, John, 79 Ezeani, Ignatius, 53

Faralli, Stefano, 72 Fellbaum, Christiane, 12

Gamallo, Pablo,1 Gregori, Lorenzo, 102

Hasan, Sadid, 31 Hayashi, Yoshihiko, 37 Hepple, Mark, 53

Jang, Kyoung-Rok, 91

Kanada, Kentaro, 37 Khodak, Mikhail, 12 King, Milton, 47 Kobayashi, Tetsunori, 37 Kober, Thomas, 79 Köper, Maximilian, 24

Lieto, Antonio, 96 Ling, Yuan, 31

Mensa, Enrico, 96 Myaeng, Sung-Hyon, 91 Mykowiecka, Agnieszka, 120

Onyenwe, Ikechukwu, 53

Panchenko, Alexander, 72 Panunzi, Alessandro, 102 Pereira-Fariña, Martín,1 Piao, Scott, 61 Ponzetto, Simone Paolo, 72

Radicioni, Daniele P., 96

127