Aligning Wiktionary with Natural Language Processing Resources

Aligning Wiktionary With Natural Language Processing Resources Ben Casses Western Carolina University Cullowhee, NC 28723 [email protected] Abstract reinforce or partially automate mappings between some of these resources through the use of domain Recently, significant progress has been based disambiguation and gloss comparisons as in made towards mapping various natural the Lesk Algorithm (Lesk, 1986). The following language processing resources together in introduces some of the popular NLP resources and order to form more robust tools. While discusses their relative advantages. most efforts have gone towards connecting existing tools to each other, recently several projects have involved aligning 1.1 FrameNet the popular NLP resources to open col- FrameNet is a collection of semantic frames. It laborative projects such as Wikipedia. contains detailed breakdowns of the function and Such alignments are promising because contextual meaning of a given verb along with they link the specific but frequently all possible participating semantic participants narrow NLP data to high coverage open and examples. FrameNet is organized into a resources. This project explores the hypernym-hyponym hirearchy with more specific effectiveness of some variations of the terms inheriting structure from their hypernyms. Lesk Algorithm in connecting specific FrameNet’s exhaustive detail into semantic par- Wikipedia senses to corresponding senses ticipants makes it valuable for studying distance in other NLP resources. The purpose relationships between frames. Robbery, for ex- of this project is to present a potential ample, inherits directly from Committing crime, method of semiautonomous alignment indirectly from Misdeed and uses Theft. 1 for Wiktionary that will serve to augment other NLP resources. 1.2 OntoNotes OntoNotes is a collaborative project that aims to produce “richer model of text meaning” (Hovy et 1 Introduction al., 2006). OntoNotes contains 2,445 verb entries Existing natural language processing (NLP) re- organized into different senses with brief definitions sources can be used in each step of the language and examples. OntoNotes is the most externally comprehension task. Some resources also contain connected of the sense based NLP resources, it con- mappings to others. Since some extensive tasks like tains mappings to FrameNet, PropBank, VerbNet 2 language comprehension involve the use of multiple and WordNet. resources, such mappings can be very useful. This 1http://framenet.icsi.berkeley.edu/ project will study some methods that may help to 2http://verbs.colorado.edu/html groupings/ can also be a disadvantage. Its open nature leads to format inconsistencies between definitions that 1.3 PropBank make automated processing difficult. The potential PropBank contains information on 5,384 verbs. also exists for incorrect or deliberately misleading Each verb entry is divided into different syntactic entries such as those known to have occurred in structures based on common use. Structures are Wikipedia (Snyder, 2007) (Chesney, 2006). subdivided into components referred to as arguments. Propbank has value as a parsing and a disambiguating resource. A verb in a given sentence 1.6 WordNet could be matched to one syntactic structure by its WordNet is a long term project hosted by Princeton surrounding arguments. Once matched, the roles of University. Entries in WordNet are organized the arguments are disambiguated according to the into parts of speech and distingushed by sense. roleset. PropBank contains mappings to VerbNet. 3 Wordnet is highly interconnected with each term referring to related terms including hypernyms, 1.4 VerbNet hyponyms, troponyms, frames and peers. Because of its interconnectedness, WordNet is useful for VerbNet is an online database of verbs. Verbs are sense disambiguation. An unknown term could be collected into 274 Levin (1993) style verb classes generalized to its hypernym, for example: if the containing relevant semantic and syntactic infor- term “shuffle” is not understood, its hypernym, mation. The value of VerbNet lies in its depth of “walk” may be. 5 study into its verb members and their relationships. A given verb may be considered synonymous with its fellow members and hyponymous to its class. This can be useful for the purposes of translation 2 Related Work and comprehension if a given verb in VerbNet is not Several different mappings between existing NLP understood but one of its fellow members is. Verb- resources already exist. Combinations of resources Net contains mappings to FrameNet, OntoNotes, 4 have been created in different ways with most PropBank and WordNet. projects involving multiple alignment techniques to improve accuracy. 1.5 Wiktionary The effectiveness of Wiktionary in NLP tasks has al- Through common fields shared by both tools. ready been established by Zesch and Muller¨ (2008). • (Loper et al., 2007) (Pazienza et al., 2006) At the time of this writing, English Wiktionary had (Giuglea and Moschitti, 2004) “1,813,199 entries with English definitions from over 350 languages” (wik, 2010). The usefulness Through mutual restrictions, where a given en- of Wiktionary is in its open, collaborative nature. • try in one database could match with a given Because anyone can contribute to Wiktionary, it is entry in another database, but restrictions pre- far more encompassing than any project developed vent this combination, thus narrowing the pos- by an individual or more structured group could siblilities. (Shi and Mihalcea, 2005) (Loper et be. It is also current, while other dictionaries must al., 2007) (Giuglea and Moschitti, 2006) be revised and updated occasionally, Wiktionary entries are constantly being appended as the nature Through frequency analysis such as greatest of the language changes. The author observed • numbers of common synonyms. (Shi and Mi- the number of entries increasing by 20,000 over halcea, 2005) (Giuglea and Moschitti, 2006) a period of three weeks. Wiktionary’s advantage (Giuglea and Moschitti, 2004) 3http://verbs.colorado.edu/propbank/framesets-english/ 4http://verbs.colorado.edu/verb-index/index.php 5http://wordnetweb.princeton.edu/perl/webwn Through supervised learning techniques where effective when combined. There are structural con- • connections are trained. (Loper et al., 2007) cerns where two resources do not follow the same (Giuglea and Moschitti, 2004) layout or present the same information. There are also potential problems where two resources do not And through manual mapping of connections. • expose the same granularity. Only two resources (Pazienza et al., 2006) (Shi and Mihalcea, containing the same information could have a one- 2005). to-one cardinality. In general, these methods could be divided into three 3.1 Structure and Organization categories: manual methods that require human Not all of the NLP resources discussed in the control, statistical methods that involve matching introduction follow the same structure. WordNet, around comparisons, and learning methods that for example is categorized by frames where each involve machine learning techniques. frame contains multiple syntactic structures with interchangable member verbs. Entries in PropBank, Zesch and Gurevych (2010) compared the effec- however, are centered around a specific predicate or tiveness of several different semantic relatedness term and divided into usages. analysis methods. They considered four distinct semantic relatedness measures: Wiktionary is organized by term with each term Path based, where the distance between two divided into different senses. It will be more • terms in a graph is considered edge counting. meaningful to map the Wiktionary senses to the senses of another NLP resource that is organized Information Content based, where the number similarly. Of the resources organized in this way • of documents containing both terms is consid- that were discussed earlier, OntoNotes offers the ered. most advantages due to its interconnectedness. A given Wiktionary sense mapped to OntoNotes could Gloss based, where the quantity of common • be followed to each other resource that OntoNotes words in each term’s gloss is considered. is already connected to. Vector based, where vectors are constructed • from multiple documents and the frequency of 3.2 Granularity occurrence of the given term in each document Term-to-term matching is generally trivial, in- is considered. volving only a mutual lookup for a given term. Recently work has begun on mapping and utilizing Sense-to-sense mapping becomes more difficult the collaborative resources such as Wiktionary and for several reasons. A sense-to-sense connection Wikipedia. The Ubiquitous Knowledge Processing between two different resources indicates that both Lab has developed api’s for both Wiktionary6 and senses could be considered to be the “same”, but Wikipedia7 and made use of them as NLP resources the two senses will generally not contain the same (Zesch et al., 2008). information. For example, one Wiktionary sense of the verb make, “To indicate or suggest to be”, was aligned to “cause to become, or to have a certain 3 Problem Definition quality” in OntoNotes even though both senses contain different information. Additionally, because of The task of aligning NLP resources involves several the different ways these resources were constructed, issues. Each of the NLP resources considered in they feature a different degree of granularity around the introduction have different strengths, but some their terms. Arrive, for

Load more