Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank

Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank Kiril Simov Alexander Popov Petya Osenova Linguistic Modeling Department IICT-BAS kivs alex.popov petya @bultreebank.org { | | } Abstract (Sensed BulTreeBank) in order to extract semantic relations and add them into the lexical resources. In this paper we present an approach The hypothesis that this enrichment would lead to for the enrichment of WSD knowledge better WSD for Bulgarian was tested in the context bases with data-driven relations from a of the Personalized PageRank algorithm. gold standard corpus (annotated with word The structure of the papers is as follows: the senses, valency information, syntactic next section discusses the related work on the analyses, etc.). We focus on Bulgarian as topic. Section 3 presents the Bulgarian sense an- a use case, but our approach is scalable to notated treebank. Section 4 focuses on the Bul- other languages as well. For the purpose garian Syntactic and Lexical Resources. Section 5 of exploring such methods, the Personal- introduces the WSD experiments and results. Sec- ized Page Rank algorithm was used. The tion 6 concludes the paper. reported results show that the addition of new knowledge improves the accuracy of 2 Related Work WSD with approximately 10.5%. Knowledge-based systems for WSD have proven 1 Introduction to be a good alternative to supervised systems, which require large amounts of manually anno- Solutions to WSD-related tasks usually employ tated training data. In contrast, knowledge-based lexical databases, such as wordnets and ontolo- systems require only a knowledge base and no gies. However, lexical databases suffer from additional corpus-dependent information. An es- sparseness in the availability and density of rela- pecially popular knowledge-based disambiguation tions. One approach towards remedying this prob- approach has been the use of popular graph-based lem is the BabelNet (Navigli and Ponzetto, 2012), algorithms known under the name of ”Random which relates several lexical resources — Word- Walk on Graph” (Agirre et al., 2014). Most ap- Net1, DBpedia, Wiktionary, etc. Although such a proaches exploit variants of the PageRank algo- setting takes into consideration the role of lexical rithm (Brin and Page, 2012). Agirre and Soroa and world knowledge, it does not incorporate con- (2009) apply a variant of the algorithm to Word textual knowledge learned from actual texts (such Sense Disambiguation by translating WordNet as collocational patterns, for example). This hap- into a graph in which the synsets are represented pens because the knowledge sources for WSD sys- as vertices and the relations between them are rep- tems usually capture only a fraction of the rela- resented as edges between the nodes. The result- tions between entities in the world. Many im- ing graph is called a knowledge graph in this pa- portant relations are not present in ontological re- per. Calculating the PageRank vector Pr is accom- sources but could be learned from texts. plished through solving the equation: One possible approach to handling this sparseness issue is the incorporation of relations from Pr = cMPr + (1 c)v (1) sense annotated corpora into the lexical databases. − We decided to focus on this line of research, where M is an N x N transition probability matrix by using the Bulgarian sense annotated treebank (N being the number of vertices in the graph), c 1In this work we used version 3.0 of Princeton WordNet: is the damping factor and v is an N x 1 vector. In https://wordnet.princeton.edu/. the traditional, static version of PageRank the val- 596 Proceedings of Recent Advances in Natural Language Processing, pages 596–603, Hissar, Bulgaria, Sep 7–9 2015. ues of v are all equal (1/N), which means that in knowledge graph – whether the knowledge repre- the case of a random jump each vertex is equally sented in terms of nodes and relations (arcs) be- likely to be selected. Modifying the values of tween them is sufficient for the algorithm to pick v effectively changes these probabilities and thus the correct senses of ambiguous words. Several makes certain nodes more important. The version extensions of the knowledge graph constructed on of PageRank for which the values in v are not uni- the basis of WordNet have been proposed and im- form is called Personalized PageRank. plemented. An approach similar to the one pre- The words in the text that are to be disam- sented here is described in Agirre and Martinez biguated are inserted as nodes in the knowledge (2002), which explores the extraction of syntacti- graph and are connected to their potential senses cally supported semantic relations from manually via directed edges (by default, a context window of annotated corpora. In that piece of research, Sem- at least 20 words is used for each disambiguation). Cor, a semantically annotated corpus, was pro- These newly introduced nodes serve to inject ini- cessed with the MiniPar dependency parser and tial probability mass (via the v vector) and thus to the subject-verb and object-verb relations were make their associated sense nodes especially rel- consequently extracted. The new relations were evant in the knowledge graph. Applying the Per- represented on several levels: as word-to-class sonalized PageRank algorithm iteratively over the and class-to-class relations. The extracted selec- resulting graph determines the most appropriate tional relations were then added to WordNet and sense for each ambiguous word. The method has used in the WSD task. The chief difference with been boosted by the addition of new relations and the presently described approach is that the set by developing variations and optimizations of the of relations used here is bigger (it includes also algorithm (Agirre and Soroa, 2009). It has also indirect-object-to-verb relations, noun-to-modifier been applied to the task of Named Entity Disam- relations, etc.). Another difference is that the new biguation (Agirre et al., 2015). relations in the present piece of research are not added as selectional relations, but as semantic re- Montoyo et al. (2005) present a combina- lations between the corresponding synsets. This tion of knowledge-based and supervised systems means that the specific syntactic role of the partic- for WSD, which demonstrates that the two ap- ipant is not taken into account, but only the con- proaches can boost one another, due to the funda- nectedness between the participant and the event mentally different types of knowledge they utilise is registered in the knowledge graph. (paradigmatic vs. syntagmatic). They explore a knowledge-based system that uses heuristics for 3 The Bulgarian Sense Annotated WSD depending on the position of word poten- Treebank tial senses in the WordNet knowledge base. In terms of supervised machine learning based on The sense annotation process over BulTreeBank an annotated corpus, it explores a Maximum En- (BTB) was organized in three layers: verb valency tropy model that takes into account multiple fea- frames (Osenova et al., 2012); senses of verbs, tures from the context of the to-be-disambiguated nouns, adjectives and adverbs; DBpedia URIs over word. This earlier line of research demonstrates named entities. However, in the experiment pre- that combining paradigmatic and syntagmatic in- sented here, we used mainly the annotated senses formation is a fruitful strategy, but it does so by of nouns and verbs (together with the valency in- doing the combination in a postprocessing step, formation), as well as the concept mappings to i.e. by merging the output of two separate sys- WordNet. For that reason we do not discuss the tems; also, it still relies on manually-annotated DBpedia annotation here. A brief outline can be data for the supervised disambiguation. Building found in Popov et al. (2014). on the already mentioned work on graph-based ap- The sense annotation was organized as follows: proaches, it is possible to combine paradigmatic the lemmatized words per part-of-speech (POS) and syntagmatic information in another way – by from BTB received all their possible senses from incorporating both into the knowledge graph. This the explanatory dictionary of Bulgarian and from 2 approach is described in the current paper. our Core WordNet . When two competing defi- nitions came from both resources, preference was The success of knowledge-based WSD ap- proaches apparently depends on the quality of the 2Available at http://compling.hss.ntu.edu.sg/omw/ 597 given to the one that was mapped to the WordNet. open for developing a language-specific hierarchy. In the ambiguous cases the correct sense was se- The valency lexicon consists of around 18,000 lected according to the context of usage. For the verb frames extracted from the BTB. The partici- purposes of the evaluation, some of the files were pants in these frames have ontological constraints. independently manually checked by two individ- At the moment, the verb senses are mapped to ual annotators. In total, 92,000 running words WordNet, but the constraints over arguments are have been mapped to word senses. Thus, about 43 not synchronized with the WordNet concepts in % of all the treebank tokens have been associated their levels of granularity and specificity. This with senses. syncronization is planned as a next step in our The word forms annotated with senses mapped work, in order to further enrich the knowledge to WordNet synsets are 69,333, consisting of graph. nouns and verbs. From these POS, 12,792 word forms have been used for testing, and the rest 5 Experiments have been used for relation extraction. About 20,000 word forms are now in the process of be- 5.1 Description of the WSD tool ing mapped to WordNet synsets. Most of them are The experiments that serve to illustrate the out- adjectives and adverbs.

Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support