Latent induction in the context of linked example senses

Hunter Scott Heidenreich Jake Ryland Williams Department of Computer Science Department of Information Science College of Computing and Informatics College of Computing and Informatics [email protected] [email protected]

Abstract of a network—much like the Princeton Word- Net (Miller, 1995; Fellbaum, 1998)—that is con- The Princeton WordNet is a powerful tool structed solely from the semi-structured data of for studying language and developing nat- ural language processing algorithms. With . This relies on the noisy annotations significant work developing it further, one of the editors of Wiktionary to naturally induce a line considers its extension through aligning network over the entirety of the English portion of its expert-annotated structure with other lex- Wiktionary. In doing so, the development of this ical resources. In contrast, this work ex- work produces: plores a completely data-driven approach to network construction, forming a us- • an induced network over Wiktionary, en- ing the entirety of the open-source, noisy, user- riched with semantically linked examples, annotated dictionary, Wiktionary. Compar- forming a directed acyclic graph (DAG); ing baselines to WordNet, we find compelling evidence that our network induction process • an exploration of the task of relationship dis- constructs a network with useful semantic structure. With thousands of semantically- ambiguation as a means to induce network linked examples that demonstrate sense usage construction; and from basic lemmas to multiword expressions (MWEs), we believe this work motivates fu- • an outline for directions of expansion, includ- ture research. ing increasing precision in disambiguation, cross-linking example usages, and aligning 1 Introduction English Wiktionary with other languages.

Wiktionary is a free and open-source collaborative 2 1 We make our code freely available , which in- dictionary (Wikimedia). With the ability for any- cludes code to download data, to disambiguate one to add or edit lemmas, definitions, relations, relationships between lemmas, to construct net- and examples, Wiktionary has the potential to be works from disambiguation output, and to interact larger and more diverse than any printable dic- with networks produced through this work. tionary. Wiktionary features a rich set of exam- ples of sense usage for many of its lemmas which, 2 Related work when converted to a usable format, supports lan- guage processing tasks such as sense disambigua- 2.1 WordNet tion (Meyer and Gurevych, 2010a; Matuschek The Princeton WordNet, or WordNet as it’s more and Gurevych, 2013; Miller and Gurevych, 2014) commonly referred to, is a lexical database orig- and MWE identification (Muzny and Zettlemoyer, inally created for the English language (Miller, 2013; Salehi et al., 2014; Hosseini et al., 2016). 1995; Fellbaum, 1998). It consists of expert- With natural alignment to other languages, Wik- annotated data, and has been more or less contin- tionary can likewise be used as a resource for ma- ually updated since its creation (Harabagiu et al., chine translation tasks (Matuschek et al., 2013; 1999; Miller and Hristea, 2006). WordNet is built Borin et al., 2014;G ohring¨ , 2014). With these up of synsets, collections of lexical items that all uses in mind, this work introduces the creation 2 Code will be available at https://github.com/ 1https://www.wiktionary.org/ hunter-heidenreich/lsni-paper

170 Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 170–180 Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics have the same meaning. For each synset, a def- tionship anchoring, this task has been previously inition is provided, and for some synsets, usage explored in the creation of machine-readable dic- examples are also presented. If extracted and at- tionaries (Krovetz, 1992), (Pan- tributed properly, the example usages present on tel and Pennacchiotti, 2006, 2008), and German Wiktionary could critically enhance WordNet by Wikitionary (Meyer and Gurevych, 2010b). filling gaps. While significant other work has been As mentioned above, Meyer and Gurevych ex- done in utilizing Wiktionary to enhance WordNet plore relationship disambiguation in the context for purposes like this (discussed in the next sec- of Wiktionary, motivating a sense-disambiguated tions), this work takes a novel step by constructing Wiktionary as a powerful resource (Meyer and a wordnet through entirely computational means, Gurevych, 2012a,b). This task is frequently i.e. under the framing of a machine learning task viewed as a binary classification: Given two linked based on Wiktionary’s data. lemmas, do these pairs of definitions belong to the relationship? While easier to model, this fram- 2.2 Wiktionary ing can suffer from a combinatorial explosion as Wiktionary is an open-source, Wiki-based, open all pairs of definitions must be compared. This content dictionary organized by the WikiMedia work attempts to model the task differently, dis- Foundation (Wikimedia). It has a large and active ambiguating all definitions in the context of a re- volunteer editorial community, and from its noisy, lationship and its lemmas. crowd-sourced nature, includes many MWEs, col- loquial terms, and their example usages, which 3 Model could ultimately fill difficult-to-resolve gaps left 3.1 Framework in other linguistic resources, such as WordNet. Thus, Wiktionary has a significant history of ex- This work starts by identifying a set of lemmas, ploration for the enhancement of WordNet, includ- W , and a set of senses, S. It then proceeds, as- ing efforts that extend WordNet for better domain suming that S forms the vertex set of a Directed coverage of word senses (Meyer and Gurevych, Acyclic Graph (DAG) with edge set E, organizing 2011; Gurevych et al., 2012; Miller and Gurevych, S by refinement of specificity. That is, if senses 2014), automatically derive new lemmas (Jurgens s, t ∈ S have a link (t, s) ∈ E—to s—then s is and Pilehvar, 2015; Rusert and Pedersen, 2016), one degree of refinement more specific than t. and develop the creation of multilingual word- Next, we suppose a lemma u ∈ W has relation nets (de Melo and Weikum, 2009; Gurevych et al., ∼ (e.g., synonymy) indicated to another lemma 2012; Bond and Foster, 2013). While these works v ∈ W . Assuming ∼ is recorded from u to v (e.g., constitute important steps in the usage of extracted from u’s page), we call u the source and v the Wiktionary contents for the development of Word- sink. Working along these lines, the model then Net, none before this effort has attempted to utilize assumes a given indicated relation ∼ is qualified by a sense s; this semantic equivalence is denoted the entirety of Wiktionary alone for the construc- s tion of such a network. u ∼ v. Most similarly, Wiktionary has been used Like others (Landauer and Dumais, 1997; Blei in a sense-disambiguated fashion (Meyer and et al., 2003; Bengio et al., 2003), this work as- Gurevych, 2012b) and to construct an ontology sumes senses exist in a latent semantic space. Pro- (Meyer and Gurevych, 2012a). Our work does cessing a dictionary, one can empirically discover s t not create an ontology, but instead attempts to relationships like u ∼ v and v ∼ w. But for a create a semantic wordnet. In this context, our larger network structure one must know if s = t— work can be viewed as building on notions of that is, do s and t refer to the same relationship— sense-disambiguating Wiktionary to construct a and often neither s nor t are known, explicitly. WordNet-like resource. Hence, this work sets up approximations of s and t for comparison. Given a lemma, u ∈ W , sup- 2.3 Relation Disambiguation pose a set of definitions, Du, exists and form the The task of taking definitions, a semantic relation- basis for disambiguation of a lemma’s senses. We ship, and sub-selecting the definitions that belong then assume that for any d ∈ Du there exists one to that relationship is one of critical importance to or more senses, s ∈ S, such that d =⇒ s, that is, our work. Sometimes called sense linking or rela- the definition d conveys the sense s.

171 Having assumed a DAG structure for S, this ‘gloss’ labels, which indicate the definitions that work denotes specificity of sense by using the for- apply to relationships. So from the data there is malism of a partial order, , which, for senses some knowledge of exact matching, but due to s, t ∈ S having s  t, indicates that the sense s their limited, noisy, and crowd-sourced nature, the is comparable to t and more specific. Note that— labelings may not cover all definitions that belong. as with any partial order—senses can be, and are We assume annotations exhibit relationships be- s often non-comparable. tween lemmas. Finding one: u ∼ v, if u is Intuitively, a given definition d might convey the source, we assume there exists some defini- multiple senses d =⇒ s, t of differing speci- tion d ∈ Du that implies the appropriate sense: ficities, s  t. So for a given definition d, the d =⇒ s. This good practice assumption mod- model’s goal is to find the sense t that is least spe- els editor behavior as a response to exposure to a cific in being conveyed. Satisfying this goal im- particular definition on the source page. Provided plies resolving the sense identification function, this, an editor won’t necessarily annotate the rela- f : D → S, for which any lemma u ∈ W and tionship on the sink page—even if the sink page definition d ∈ Du with d =⇒ s ∈ S, it is as- has a definition that implies the sense s. Thus, our sured that s  f(d). Since no direct knowledge of task doesn’t require identification of a definition any s ∈ S is assumed known for any annotated re- on the sink’s page. More precisely, no d ∈ Dv lationship between lemmas, systems must approx- might exist that implies s (d =⇒ s) for an anno- s imate senses according to the available resources, tated relationship, u ∼ v. e.g., definitions or example usages. Altogether, for an annotated relationship the task aims to identify the sense-conveying subset: 3.2 Task development D s = {d ∈ D ∪ D | d =⇒ s} On Wiktionary, every lemma has its own page. u∼v u v Each page is commonly broken down into sec- for which at least one definition must be drawn tions such as languages, etymologies, and parts- from D . Note that the model does not assume of-speech (POS). Under each POS, a lemma fea- u that arbitrary d, d˜ ∈ D s map through the sense tures a set of definitions that can be automatically u∼v identification function to the same most general extracted. An example of the word induce on En- sense. Presently, these details are resolved by a glish Wiktionary can be seen in Figure1. separate algorithm (developed below), leaving di- A significant benefit of using Wiktionary as a rect modeling of the sense identification function resource to build a wordnet lies in the wealth of to future work.3 examples it offers. Examples come in two fla- vors: basic usage and usage from reference mate- 3.3 Semantic hierarchy induction rial. Currently, each example is linked to its orig- ination definition and lemma, however, in future This section outlines preliminary work inferring a works, these examples could be segmented and semantic hierarchy from pairwise relationships. If sense disambiguated, offering new network links A is the set of relationships, a model’s output, C, and densely connected example usages. will be a collection of sense-conveying subsets, D s , in one-to-one correspondence: A ↔ C. For each lemma, Wiktionary may offer relation- u∼v So, for all ∈ P(C), one has a covering of ship annotations between lemmas. These relation- D (some) senses by pairwise relationships, D s ∈ ships span many categories including acronyms, u∼v . alternative forms, anagrams, antonyms, com- D Under our assumptions, any collection of sense pounds, conjugations, derived terms, descendants, conveying subsets ∈ P(C) with non-empty in- holonyms, hypernyms, hyponyms, meronyms, re- D tersection restricts to a set of definitions that must lated terms, and synonyms. For this work’s pur- convey at least one common sense, s0. Notably, poses, only antonyms and synonyms are consid- s0 must be at least as general as any qualifying ered, exploiting their more typical structure on a particular annotated relationship, i.e., s  s0 Wiktionary and clear theoretical basis in semantic for any s (implicitly) defining any D s ∈ . equivalence to induce a network. Exploring more u∼v D of these relationships is of interest in future work. 3 A major challenge to this approach is the increased com- Additionally, a minority of annotations present plexity required for the development of evaluation data.

172 Figure 1: The Verb section of the induce page on English Wiktionary. Definitions are enumerated, with example usages as sub-elements or drop-down quotations. Relationships for this page are well annotated, with gloss labels to indicate the definition that prompted annotation.

So this work induces the sense-identification func- tion, f, through pre-images: for D ∈ P(C), an im- plicit sense, s, is assumed such that that f −1(s) ⊆ T 0 D s . Now, if a covering ⊃ exists with D u∼v D D non-empty intersection, then its (smaller) intersec- tion comprises definitions that convey a sense, s0 which is more-general than s. So to precisely re- Algorithm 1 Construction of semantic hierarchy solve f through pre-images the model must ‘hole through pairwise collection. C D s punch’ the more-general definitions, constructing Require: : Collection of u∼v the hierarchy by allocating the more general defi- levels ← List() 0 prev ← C nitions in the intersection of D to the more general senses: while prev 6= ∅ do next ← List() ! ! defs ← ∅ −1 \ \ \ 0 f (t) = D s \ D 0 . for p, p ∈ prev do u∼v 0s 0 u ∼v 0 0 D D0⊃D D0 if p 6= p and p ∩ p 6= ∅ then Append(next, p ∩ p0) This allocates each definition to exactly one im- Union(defs, p ∩ p0) plicit sense approximation, t, which is the most end if general sense indicated by the definition. Addi- end for tionally, all senses then fall under a DAG hierarchy filtered ← List() (excepting the singletons, addressed below) as set for p ∈ prev do 0 inclusion, D ⊃ D defines a partial order. This Append(filtered, p \ defs) deterministic algorithm for hierarchy induction is end for presented in Algorithm1. Append(levels, filtered) Considering the output of a model, C, if d is prev ← next not covered by C the model assumes a singleton end while sense. These include definitions not selected dur- return levels ing relationship disambiguation as well as the def- initions of lemmas that feature no relationship an- notations. Singletons are then placed in the DAG at the lowest level, disconnected from all other senses. Figure2 visually represents this full se- mantic hierarchy.

173 performance based on the number of definitions involved in the selection process. Intuitively, mi- cro metrics weight based on size, while macro metrics ignore size (treating all potential links and sides as equal).

4.3 Setting up baselines For baselines, we present two types of models, which we refer to as return all and vector simi- larity. The return all baseline model assumes that for a given relationship link, all definitions be- long. This is not intended as a model that could produce a useful network as many definitions and lemmas would be linked that clearly do not belong together. This achieves maximum recall at the ex- pense of precision, demonstrating a base level of Figure 2: A visualization of 3 lemmas intersecting to precision that must be exceeded. create a semantic hierarchy. The vector similarity baseline model takes advantage of semantic vector representations 4 Evaluation for computing similarity (Bengio et al., 2003; Mikolov et al., 2013; Pennington et al., 2014; 4.1 Characteristics of Wiktionary data Joulin et al., 2017). It computes the similarity be- Data was downloaded from Wiktionary on 1/23/19 tween lemmas and definitions, utilizing thresholds using the Wikimedia Rest API4. To evaluate per- that flag to either retain similarities above (max), formance, a ‘gold’ dataset was created to com- below (min), or with magnitude above the thresh- pare modeling strategies. In totality, 298,377 old (abs). synonym and 44,758 antonym links were gener- Wiktionary features many MWEs and uncom- ated from Wiktionary. ‘Gold’ links were ran- mon lemmas requiring use of a vectorization strat- domly sampled, selecting 400 synonym and 100 egy that allows for handling of lemmas not ob- antonym links. For each link, source and sink served in the representation’s training. Thus, Fast- lemmas were considered independently. Defini- Text was selected for its ability to represent out-of- tions were included if they could plausibly refer vocabulary lemmas through its bag-of-character n- to the other lemma. This process is supported gram modeling (Bojanowski et al., 2017). To com- by the available examples, testing if one lemma pute similarity between lemmas and definitions, can replace the other lemma in the example us- this model aggregates word vectors of the individ- ages. This dataset was constructed in contrast ual tokens present in a definition. Following other to other Wiktionary relationship disambiguation work (Lilleberg et al., 2015; Wu et al., 2018), TF- tasks due to the modeling differences and desire IDF weighted averages of word vectors were uti- for more synonym- and antonym-specific evalua- lized in a very simple averaging scheme. tions (Meyer and Gurevych, 2012a,b). Initial results indicated that a simple cosine sim- ilarity with a linear kernel performed marginally 4.2 Evaluation strategy above the return all baseline5. Thus, kernel tricks This work’s evaluation considers precision, recall, (Cristianini and Shawe-Taylor, 2000) were ex- plored (to positive effect). The Gaussian kernel and variants of the Fβ score (biasing averages of precision and recall). As there is selection on both is often recommended as a good initial kernel to source and sink sides, we consider several averag- try as a baseline (Scholkopf¨ et al., 1995; Joachims, ing schemes. For a final evaluation, each sample 5 This is interesting to note, since previous work has found is averaged at the side-level and averaged across that word embeddings like GloVe and contain a all relationships. Macro-averages compute an un- surprising amount of word frequency effects that pollute sim- ple cosine similarity (Schnabel et al., 2015). This may ex- weighted average, while micro-averages weight plain why vanilla cosine similarity performed poorly with FastText vectors here and provides more evidence against us- 4 https://en.wikipedia.org/api/rest_v1/ ing it as the default similarity measure.

174 1998). It is formulated using a radial basis func- ered at a lemma level. By computing values of tion (RBF), only dependent on a measure of dis- all pairs of synsets between lemmas, three values tance. The Laplacian kernel is a slight variation of per metric are generated: minimum, maximum, the Gaussian kernel, measuring distance as the L1 and average. Additionally, only lemmas that differ distance where the Gaussian measures distance as in minimum and maximum similarity are retained, L2 distance. Both kernels fall in the RBF category restricting the experiment to the most polysemous with a single regularization parameter, γ, and were portions of the networks. used in comparison to cosine similarity. For these kernels, a grid search over γ was con- 5 Results −3 3 ducted from 10 to 10 at steps of powers of 10. 5.1 Baseline model performance Similarly, similarity comparison thresholds were considered from −1.0 to 1.0 at steps of 0.05 for Table1 shows baseline model performance on the all 3 thresholding schemes (min, max, abs). relationship disambiguation task and highlights model parameters. During evaluation, the Lapla- When selecting a final model, F scores were 1 cian kernel was found to consistently outperform not considered as recall scores outweighed preci- the Gaussian kernel. For this reason, this work sion under a simple harmonic mean. This resulted presents the scores from the return all baseline and in models with identical performance to the return two variants of the Laplacian kernel model—one all model or worse. Instead, models were consid- optimized for precision and the other for F0.1. ered against full-precision and F0.1 scores. Note that in the synonym case, max-threshold selection performed best, while in the antonym 4.4 Semantic Structure Correlation case min- and abs-threshold fared better. This Creating a wordnet solely from Wiktionary’s aligns well with the notion that while synonyms noisy, crowd-sourced data begs the question: are semantically similar, antonyms are seman- Does the generated network structure resemble tically anti-similar—an interesting consideration the structure present in Princeton’s WordNet? To for future model development. get a sense of this, we compare the capacities Overall, from the scores in Table1 one can see of each of these resources as a basis for seman- that the vector similarity models improve over the tic similarity modeling (using Pearson correlation return all, but that there is much work to be done (Pearson, 1895)). This work considers three no- to further improve precision and recall. tions of graph-based that are present in WordNet: path similarity (PS), Lea- 5.2 Comparison against WordNet cock Chodorow similarity (LCH) (Leacock and WordNet publishes several statistics6 that one can Chodorow, 1998), and Wu Palmer similarity (WP) use for quantitative comparison with the network (Wu and Palmer, 1994). constructed herein. Reviewing the count statis- The point of this experiment is not to enforce tics shows that Wiktionary is an order of magni- a notion that this network should mirror the struc- tude larger than WordNet and that Wiktionary fea- ture of WordNet. Given Wiktionary’s size, it likely tures 344,789 linked example usages to WordNet’s possesses a great deal of information not repre- 68,411. sented by WordNet (resolved our other experiment Polysemy. Table2 report polysemy statistics. on word similarity, Sec. 5.3). But if there is some Despite the difference in creation processes, the association between the semantic representation induced networks do not have polysemy averages capacities of these two networks we may possi- drastically different from WordNet. bly draw some insight into a more basic question: In comparing the three networks induced, there “has this model produced some relevant semantic is a common theme of increase in polysemy when structure?” shifting from recall to precision. This makes sense For this experiment, only nouns and verbs are due to the fact that the return all model will merge considered as they are the only POS for which all possible lemmas that overlap in relationship an- WordNet defines these metrics. Additionally, notations resulting in lower polysemy statistics, these metrics are defined at the synset level. There 6 Statistics are taken from WordNet’s website for Word- is no direct mapping between synsets in our net- Net 3.0, last accessed on 8/11/2019: https://wordnet. work and WordNet, therefore, scores are consid- princeton.edu/documentation/wnstats7wn

175 Synonyms Antonyms Model Thresh. Recall Precision Thresh. Recall Precision Macro Micro Macro Micro Macro Micro Macro Micro Ret. All 1.000 1.000 0.602 0.268 1.000 1.000 0.527 0.280 Precision max0.35 0.433 0.258 0.847 0.541 min−0.35 0.266 0.196 0.820 0.600 F0.1 max0.30 0.535 0.404 0.814 0.532 abs0.25 0.730 0.763 0.619 0.397

Table 1: Model performance with threshold selection. All γ = 0.1, except for antonym precision where γ = 100. whereas a precision-based model will result in due to different annotators), and Yang and Pow- pair-wise clusters that do not overlap as broadly, ers’s 130 verb pairs (2006, YP-130). Our results resulting in more complex hierarchies. are summarized in Table3. Structural differences. Intentionally, the pre- We also compare F0.1 against latent word vector sented notion of a semantic hierarchy functions representations like word2vec’s continuous bag- similarly to the hypernym connections within of-words (CBOW) and skip-grams (SG) (Mikolov WordNet. Moving up the semantic hierarchy pro- et al., 2013), GloVe (Pennington et al., 2014), and duces sense approximations from definitions that FastText (Bojanowski et al., 2017). These results are more general, and moving down the hierarchy are presented in Table4. produces more specific senses. However, in the In analyzing these results, the F0.1 network per- induced networks, this is a notion applied to every forms well. Against other ESA methods, it is POS—WordNet only produces these connections highly competitive, achieving the highest perfor- for nouns and verbs. An example taken from the mance in two datasets. When strictly comparing F0.1 network is that of the adjective good (refer- performance against ESA with WordNet as the ring to Holy) being subsumed by a synset featur- source, it has approximately equal or better per- ing the adjective proper (referring to suitable, ac- formance in all datasets except YP-130. We hy- ceptable, and following the established standards). pothesize that this is due to a lack of precision in verb disambiguation, reinforced by the low pol- 5.3 Word Similarity ysemy seen above. Additionally, the work from In previous works, WordNet and Wiktionary have Zesch et al. (2008) evaluated on subsets of the been used to create vector representations of data in which all three resources had coverage. In words. A common method for evaluating the qual- their work, YP-130 performance is computed for ity of word vectors is performance on word simi- only 80 of the 130 pairs. larity tasks. Performance on these tasks is evalu- Comparing F0.1 to latent word vectors, it has ated through Spearman’s rank correlation (Spear- the highest performance on noun datasets and is man, 2010) between cosine similarity of vector competitive on WS-353. While not directly com- representations and human annotations. parable, it achieves this through 26 million tokens Using Explicit Semantic Analysis (ESA), a of structured text in contrast to billions of tokens technique based on concept vectors, our network of unstructured text that train latent vectors. constructs vectors using a word’s tf-idf scores over concepts, as has been done in prior works 5.4 Network Correlation Results (Gabrilovich and Markovitch, 2007; Zesch et al., Table5 displays correlation values between graph- 2008; Meyer and Gurevych, 2012b). We define based semantic similarity metrics of F0.1 and our concepts as senses of the F0.1 network and WordNet. Pairs of 1,009 verb and 1,303 noun lem- compute cosine similarity in this representation. mas were considered. In generating similarities, We compare performance against other ESA disconnected lemma pairs were discarded, produc- methods (Zesch et al., 2008; Meyer and Gurevych, ing 31,373 verb and 16,530 noun pairs. The table 2012b) on common datasets: Rubenstein and shows that for nouns, the two networks produce Goodenough’s 65 noun pairs (1965, RG-65), similarity values that are weakly to moderately Miller and Charles’s 30 noun pairs (1991, MC- correlated, however, verbs produce values that are, 30), Finklestein et. al’s 353 word similarity pairs at most, very weakly correlated, if at all. (2002, WS-353, split into Fin-153 and Fin-200 Due to the fact that F0.1 produced better results

176 With Monosemous Words Without Monosemous Words POS WordNet F0.01 Precision Return All WordNet F0.01 Precision Return All Noun 1.24 1.17 1.18 1.10 2.79 2.94 2.99 2.66 Verb 2.17 1.20 1.22 1.10 3.57 3.18 3.33 2.78 Adjective 1.18 1.18 1.18 1.10 2.71 2.59 2.62 2.33 Adverb 1.25 1.11 1.12 1.08 2.50 2.34 2.36 2.25

Table 2: Average polysemy statistics.

Dataset RG-65 MC-30 Fin-153 Fin-200 YP-130 F0.1 0.831 0.849 0.723 0.557 0.687 WordNet* (Zesch et al., 2008) 0.82 0.78 0.61 0.56 0.71 * (Zesch et al., 2008) 0.76 0.68 0.70 0.50 0.29 Wiktionary* (Zesch et al., 2008) 0.84 0.84 0.70 0.60 0.65 Wiktionary (Meyer and Gurevych, 2012b) ---- 0.73

Table 3: Spearman’s rank correlation coefficients on word similarity tasks. Best values are in bold.

Dataset RG-65 MC-30 WS-353 of precision when it comes to polysemous verbs. F0.1 0.831 0.849 0.669 However, the positive correlation values seen for FastText - - 0.73 nouns, coupled with noun similarity performance, CBOW (6B) 0.682 0.656 0.572 offer strong indications that the F0.1 does provide SG (6B) 0.697 0.652 0.628 useful semantic structure that can be further in- GloVe (6B) 0.778 0.727 0.658 creased through better modeling. GloVe (42B) 0.829 0.836 0.759 CBOW (100B) 0.754 0.796 0.684 6 Future work Here, several directions are highlighted along Table 4: Spearman’s correlation on word similarity tasks. Best values are in bold. Number of tokens in which we see this work being extended. training data is featured in parentheses, if reported. Better models. The development of more accu- FastText is reported from (Bojanowski et al., 2017), rate models for predicting definitions involved in and all others are from (Pennington et al., 2014). the pair-wise relations will produce more interest- ing and useful networks, especially with the mag- Noun Verb nitude of examples of sense usage. Precision of PS min 0.266 0.132 verb relations seems to be a critical component of PS max 0.495 0.189 a better model. PS avg 0.448 0.082 Supervision. Relationship prediction is cur- LCH min 0.207 0.120 rently unsupervised. While it is an interesting task LCH max 0.384 0.056 to model in this fashion, crowd sourcing the an- LCH avg 0.359 -0.013 notation of this data would be possible through WP min 0.116 0.090 services like Amazon Mechanical Turk. This WP max 0.219 0.005 would allow for the potential of exploring su- WP avg 0.226 -0.025 pervised models for predicting relationship links, particularly for relationships like synonymy and Table 5: Correlations between F0.1 and WordNet simi- antonymy which are familiar concepts for a broad larity metrics: path similarity (PS), Leacock Chodorow community of potential annotators. similarity (LCH), and Wu Palmer similarity (WP). WordNet semi-supervision. Another logical transformation of this task would be to use Word- on noun similarity tasks, we hypothesize that this Net to inform the induction of a network in a semi- indicates better semantic structure for nouns than supervised fashion. There are many ways to go for verbs, further emphasizing that a possible lim- about this such as using statistics from WordNet itation of the current baseline produced is its lack to create a loss function, or using the structure of

177 WordNet as a base. As this work aimed to create a References network solely from the data of Wiktionary, these Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and ideas were not explored. However, using WordNet Christian Jauvin. 2003. A neural probabilistic lan- in this fashion is one of the directions of greatest guage model. Journal of machine learning research, interest for exploration in the future. 3:1137–1155. Sense usage examples. The examples present David M. Blei, Andrew Y. Ng, and Michael I. Jordan. in Wiktionary have only begun to be used in this 2003. Latent dirichlet allocation. Journal of ma- work. When examples are pulled, the source def- chine learning research, 3:993–1022. inition and lemma are linked. However, these ex- Piotr Bojanowski, Edouard Grave, Armand Joulin, and amples have the potential to be linked to other Tomas Mikolov. 2017. Enriching word vectors with senses and lemmas. This would an immense subword information. Transactions of the Associa- amount of structured, sense-usage data that could tion for Computational Linguistics, 5:135–146. be used for many machine learning tasks. Francis Bond and Ryan Foster. 2013. Linking and ex- Multilingual networks Wiktionary has been tending an open multilingual wordnet. In Proceed- explored as a multilingual resource in previous ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- works (de Melo and Weikum, 2009; Gurevych pers), pages 1352–1362, Sofia, Bulgaria. Associa- et al., 2012; Meyer and Gurevych, 2012b; Bond tion for Computational Linguistics. and Foster, 2013) largely due to the natural align- ment across languages. Extending this approach to Lars Borin, Jens Allwood, and Gerard de Melo. 2014. Bring vs. mtroget: Evaluating automatic thesaurus a multilingual setting could prove to be extremely translation. In Proceedings of the Ninth Interna- useful for , and could allow tional Conference on Language Resources and Eval- low resource languages to benefit from alignment uation (LREC’14), pages 2115–2121, Reykjavik, with other languages that have more annotations. Iceland. European Language Resources Association (ELRA). 7 Conclusion Nello Cristianini and John Shawe-Taylor. 2000. An in- troduction to Support Vector Machines: and other This paper introduced the idea of constructing a kernel-based learning methods. Cambridge Univer- wordnet solely using the data from Wiktionary. sity Press, New York, NY, USA. Wiktionary is a powerful resource, featuring mil- Christiane Fellbaum. 1998. WordNet: An Electronic lions of pages that describe lemmas, their senses, Lexical Database. MIT Press, Cambridge, MA, example usages, and the relationships between USA. them. Previous work has explored aligning re- Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, sources like this with other networks like the Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Princeton WordNet. However, no work has fully Ruppin. 2002. Placing search in context: The con- explored the idea of building an entire network cept revisited. ACM Trans. Inf. Syst., 20:116–131. from the ground up using just Wiktionary. Evgeniy Gabrilovich and Shaul Markovitch. 2007. This work explores simple baselines for con- Computing semantic relatedness using wikipedia- structing a network from Wiktionary through based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical antonym and synonym relationships and com- Intelligence, IJCAI’07, pages 1606–1611, San Fran- pares induced networks with WordNet to find sim- cisco, CA, USA. Morgan Kaufmann Publishers Inc. ilar structures and statistics that appear to high- light strong future directions of particular inter- Anne Gohring.¨ 2014. Building a spanish-german dic- tionary for hybrid mt. In Proceedings of the 3rd est, including but not limited to improving net- Workshop on Hybrid Approaches to Machine Trans- work modeling, linking more semantic exam- lation (HyTra), pages 30–35, Gothenburg, Sweden. ples, and reinforcing network construction using Association for Computational Linguistics. expert-annotated networks, like WordNet. Iryna Gurevych, Judith Eckle-Kohler, Silvana Hart- As conducted, this work is an initial step in mann, Michael Matuschek, Christian M. Meyer, and transforming Wikitionary from an open-source Christian Wirth. 2012. Uby - a large-scale unified dictionary into a powerful tool, dataset, and frame- lexical-semantic resource based on lmf. In Proceed- ings of the 13th Conference of the European Chap- work, with the hope of driving and motivating fur- ter of the Association for Computational Linguis- ther work at endeavors studying languages and de- tics, pages 580–590, Avignon, France. Association veloping language processing systems. for Computational Linguistics.

178 Sanda M. Harabagiu, George A. Miller, and Dan I. Michael Matuschek, Christian M. Meyer, and Iryna Moldovan. 1999. Wordnet 2 - a morphologically Gurevych. 2013. Multilingual knowledge in aligned and semantically enhanced resource. In SIGLEX99: wiktionary and omegawiki for translation applica- Standardizing Lexical Resources. tions. Translation: Computation, Corpora, Cogni- tion, 3(1). Mohammad Javad Hosseini, Noah A. Smith, and Su-In Lee. 2016. Uw-cse at semeval-2016 task 10: De- Gerard de Melo and Gerhard Weikum. 2009. Towards tecting multiword expressions and supersenses us- a universal wordnet by learning from combined evi- ing double-chained conditional random fields. In dence. In Proceedings of the 18th ACM Conference Proceedings of the 10th International Workshop on on Information and Knowledge Management, CIKM Semantic Evaluation (SemEval-2016), pages 931– ’09, pages 513–522, New York, NY, USA. ACM. 936, San Diego, California. Association for Com- putational Linguistics. Christian M. Meyer and Iryna Gurevych. 2010a. How web communities analyze human language: Word Thorsten Joachims. 1998. Text categorization with senses in wiktionary. Proceedings of the 2nd Web support vector machines: Learning with many rel- Science Conference. evant features. In Machine Learning: ECML-98, pages 137–142, Berlin, Heidelberg. Springer Berlin Christian M. Meyer and Iryna Gurevych. 2010b. Worth Heidelberg. its weight in gold or yet another resource–a com- parative study of wiktionary, openthesaurus and ger- Armand Joulin, Edouard Grave, Piotr Bojanowski, and manet. In Computational Linguistics and Intelligent Tomas Mikolov. 2017. Bag of tricks for efficient , pages 38–49, Berlin, Heidelberg. text classification. In Proceedings of the 15th Con- Springer Berlin Heidelberg. ference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Pa- Christian M. Meyer and Iryna Gurevych. 2011. What pers, pages 427–431, Valencia, Spain. Association psycholinguists know about chemistry: Aligning for Computational Linguistics. wiktionary and wordnet for increased domain cov- erage. In Proceedings of 5th International Joint David Jurgens and Mohammad Taher Pilehvar. 2015. Conference on Natural Language Processing, pages Reserating the awesometastic: An automatic exten- 883–892, Chiang Mai, Thailand. Asian Federation sion of the wordnet taxonomy for novel terms. In of Natural Language Processing. Proceedings of the 2015 Conference of the North Christian M. Meyer and Iryna Gurevych. 2012a. On- American Chapter of the Association for Computa- towiktionary — constructing an ontology from the tional Linguistics: Human Language Technologies, collaborative online dictionary wiktionary. In M. T. pages 1459–1465, Denver, Colorado. Association Pazienza and A. Stellato, editors, Semi-Automatic for Computational Linguistics. Ontology Development: Processes and Resources, pages 131—161. IGI Global, Hershey, PA. Robert Krovetz. 1992. Sense-linking in a machine readable dictionary. In 30th Annual Meeting of the Christian M. Meyer and Iryna Gurevych. 2012b. To Association for Computational Linguistics, pages exhibit is not to loiter: A multilingual, sense- 330–332, Newark, Delaware, USA. Association for disambiguated Wiktionary for measuring verb sim- Computational Linguistics. ilarity. In Proceedings of COLING 2012, pages 1763–1780, Mumbai, India. The COLING 2012 Or- Thomas K. Landauer and Susan T. Dumais. 1997. A ganizing Committee. solution to plato’s problem: The theory of acquisition, induction, and rep- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- resentation of knowledge. Psychological Review, rado, and Jeff Dean. 2013. Distributed representa- 104(2):211–240. tions of words and phrases and their composition- ality. In C. J. C. Burges, L. Bottou, M. Welling, C. Leacock and M. Chodorow. 1998. Combining lo- Z. Ghahramani, and K. Q. Weinberger, editors, Ad- cal context and WordNet similarity for word sense vances in Neural Information Processing Systems identification. WordNet: An Electronic Lexical 26, pages 3111–3119. Curran Associates, Inc. Database, pages 265–283. George A. Miller. 1995. Wordnet: A lexical J. Lilleberg, Y. Zhu, and Y. Zhang. 2015. Support database for english. Communications of the ACM, vector machines and word2vec for text classification 38(11):39–31. with semantic features. In 2015 IEEE 14th Interna- tional Conference on Cognitive Informatics Cogni- George A. Miller and Walter G. Charles. 1991. Con- tive Computing (ICCI*CC), pages 136–140. textual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28. Michael Matuschek and Iryna Gurevych. 2013. Dijkstra-wsa: A graph-based approach to word George A. Miller and Florentina Hristea. 2006. Squibs sense alignment. Transactions of the Association for and discussions: Wordnet nouns: Classes and in- Computational Linguistics, 1:151–164. stances. Computational Linguistics, 32(1):1–3.

179 Tristan Miller and Iryna Gurevych. 2014. Wordnet— Bernhard Scholkopf,¨ Chris Burges, and Vladimir Vap- wikipedia—wiktionary: Construction of a three- nik. 1995. Extracting support data for a given task. way alignment. In Proceedings of the Ninth Interna- In Proceedings of the First International Confer- tional Conference on Language Resources and Eval- ence on Knowledge Discovery and Data Mining, uation (LREC’14), pages 2094–2100, Reykjavik, KDD’95, pages 252–257. AAAI Press. Iceland. European Language Resources Association (ELRA). C Spearman. 2010. The proof and measurement of as- sociation between two things. International Journal Grace Muzny and Luke Zettlemoyer. 2013. Automatic of Epidemiology, 39(5):1137–1150. idiom identification in Wiktionary. In Proceedings of the 2013 Conference on Empirical Methods in Wikimedia. Wiktionary, the free dictionary. Natural Language Processing, pages 1417–1421, Lingfei Wu, Ian En-Hsu Yen, Kun Xu, Fangli Seattle, Washington, USA. Association for Compu- Xu, Avinash Balakrishnan, Pin-Yu Chen, Pradeep tational Linguistics. Ravikumar, and Michael J. Witbrock. 2018. Word mover’s embedding: From word2vec to document Patrick Pantel and Marco Pennacchiotti. 2006. embedding. In Proceedings of the 2018 Conference Espresso: Leveraging generic patterns for automati- on Empirical Methods in Natural Language Pro- cally harvesting semantic relations. In Proceedings cessing, pages 4524–4534, Brussels, Belgium. As- of the 21st International Conference on Compu- sociation for Computational Linguistics. tational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Z. Wu and M. Palmer. 1994. Verbs semantics and lex- pages 113–120, Sydney, Australia. Association for ical selection. Proceedings of the 32nd conference Computational Linguistics. on Association for Computational Linguistics, pages 133–138. Patrick Pantel and Marco Pennacchiotti. 2008. Auto- matically harvesting and ontologizing semantic rela- Dongqiang Yang and David Powers. 2006. Verb simi- tions. In Ontology Learning and Population: Bridg- larity on the taxonomy of wordnet. Proceedings of ing the Gap between Text and Knowledge, pages the 3rd International WordNet Conference (GWC), 171–198, Amsterdam, Netherlands. IOS Press. pages 121–128.

Karl Pearson. 1895. Notes on regression and inheri- Torsten Zesch, Christof Muller, and Iryna Gurevych. tance in the case of two parents. In Proceedings of 2008. Using wiktionary for computing semantic re- the Royal Society of London, volume 58, pages 240– latedness. In In Proceedings of AAAI, pages 861– 242. The Royal Society of London. 867.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.

Herbert Rubenstein and John Goodenough. 1965. Con- textual correlates of synonymy. Communications of the ACM, 8:627–633.

Jon Rusert and Ted Pedersen. 2016. Umnduluth at semeval-2016 task 14: Wordnet’s missing lem- mas. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1346–1350, San Diego, California. Associa- tion for Computational Linguistics.

Bahar Salehi, Paul Cook, and Timothy Baldwin. 2014. Detecting non-compositional mwe compo- nents using wiktionary. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1792–1797, Doha, Qatar. Association for Computational Lin- guistics.

Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. pages 298–307.

180