Automatic Enrichment of Wordnet with Common-Sense Knowledge

Automatic Enrichment of WordNet with Common-Sense Knowledge Luigi Di Caro, Guido Boella University of Turin Corso Svizzera 185 - Turin - Italy [email protected], [email protected] Abstract WordNet represents a cornerstone in the Computational Linguistics field, linking words to meanings (or senses) through a taxonomical representation of synsets, i.e., clusters of words with an equivalent meaning in a specific context often described by few definitions (or glosses) and examples. Most of the approaches to the Word Sense Disambiguation task fully rely on these short texts as a source of contextual information to match with the input text to disambiguate. This paper presents the first attempt to enrich synsets data with common-sense definitions, automatically retrieved from ConceptNet 5, and disambiguated accordingly to WordNet. The aim was to exploit the shared- and immediate-thinking nature of common-sense knowledge to extend the short but incredibly useful contextual information of the synsets. A manual evaluation on a subset of the entire result (which counts a total of almost 600K synset enrichments) shows a very high precision with an estimated good recall. Keywords: Semantic Resources, Semantic Enrichment, Common-Sense Knowledge 1. Introduction like information is represented by common-sense knowl- In the last 20 years, the Artificial Intelligence (AI) commu- edge (CSK), that may described as a set of shared and pos- nity working on Computational Linguistics (CL) has been sibly general facts or views of the world. Being crowd- using one knowledge base among all, i.e., WordNet (Miller, sourced, CSK represents a promising (although often com- 1995). In few words, WordNet was a first answer to the plex and uncoherent) type of information which can serve most important question in this area, which is the treatment complex tasks such as the ones mentioned above. Concept- of language ambiguity. Net is one of the largest sources of CSK, collecting and Generally speaking, a word is a symbolic expression that integrating data from many years since the beginning of the may refer to multiple meanings (polysemy), while distinct MIT Open Mind Common Sense project. However, terms words may share the same meaning. Syntax reflects gram- in ConceptNet are not disambiguated, so it is difficult to use mar rules which add complexity to the overall communi- due to its large amount of lexical ambiguities. cation medium, making CL one of the most challenging To make a parallelism, this double-edged situation is sim- research area in the AI field. ilar to a well-known research question in the Knowledge From a more detailed perspective, WordNet organizes Representation and Information Retrieval fields, regarding words in synsets, i.e., sets of words sharing a unique mean- the dichotomy between taxonomy and folksonomy. The ing in specific contexts (synonyms), further described by former is a top-down and often human-generated represen- descriptions (glosses) and examples. Synsets are then struc- tation of a domain whereas the latter comes from free tags tured in a taxonomy which incorporates the semantics of associated to objects in different contexts. A large effort in generality/specificity of the referenced concepts. Although the integration of these two types of knowledge has been extensively adopted, the limits of this resource are some- carried out, e.g., in (Collins and Murphy, 2013) (Di Caro et times critical: 1) the top-down and general-purpose nature al., 2011) (Kiu and Tsui, 2011). at the basis of its construction lets asking about the actual This paper presents a novel method for the automatic en- need of some underused meanings, and 2) most Word Sense richment of WordNet with disambiguated semantics of Disambiguation approaches use WordNet glosses to under- ConceptNet 5. In particular, the proposed technique is stand the link between an input word (and its context) and able to disambiguate common-sense instances by linking the candidate synsets. them to WordNet synsets. A manual validation of 505 ran- In recent years, natural language understanding traced the dom enrichments shown promising a 88.19% of accuracy, line of novel and challenging research directions which demonstrating the validity of the approach. have been unveiled under the name of textual entailment and question answering. The former is a form of inference 2. Related Work based on a lexical basis (to be not intended as in Formal The idea of extending WordNet with further semantic infor- Semantics) while the latter considers the last-mile goal of mation is not new. Plenty of methods have been proposed, every computational linguists’ dream: asking questions to based on corpus-based enrichments rather than by means of a computer in natural language as with humans. alignments with other resources. As a matter of fact, these tasks require an incredibly rich In this section we briefly mention some of the most relevant semantic knowledge containing facts related to behavioural but distinct works related to this task: (Agirre et al., 2000) rather than conceptual information, such as what an object use WWW contents to enrich synsets with topics (i.e., sets may or may not do or what may happen with it after such of correlated terms); (Ruiz-Casado et al., 2007) and (Nav- actions. igli and Ponzetto, 2010) enrich WordNet with Wikipedia In the light of this, an interesting source of additional gloss- and other resources; (Montoyo et al., 2001) use a Machine 819 Learning classifier trained on general categories; (Navigli 4.1. Common-sense Data: ConceptNet 5 et al., 2004) use OntoLearn, a Word Sense Disambiguation The Open Mind Common Sense3 project developed by MIT (WSD) tool that discovers patterns between words in Word- collected unstructured common-sense knowledge by ask- Net glosses to extract relations; (Niles and Pease, 2003) ing people to contribute over the Web. In this paper, we integrates WordNet with the Sumo ontology (Pease et al., make use of ConceptNet (Speer and Havasi, 2012), that is 2002); (Bentivogli and Pianta, 2003) extends WordNet with a semantic graph that has been directly created from it. In phrases; (Laparra et al., 2010) align synsets with FrameNet contrast with linguistic resources such the above-mentioned semantic information; (Hsu et al., 2008) combine Word- WordNet, ConceptNet contains semantics which is more re- Net and ConceptNet knowledge for expanding web queries; lated to common-sense facts. (Chen and Liu, 2011) align ConceptNet entries to synsets for WSD; (Bentivogli et al., 2004) integrates WordNet with 4.2. Basic Idea domain-specific knowledge; (Vannella et al., 2014) extends The idea of the proposed enrichment approach relies on WordNet via games with a purpose; (Niemi et al., 2012) a fundamental principle, which makes it novel and more proposed a bilingual resource to add synonyms. robust w.r.t the state of the art. Indeed, our extension is not based on a similarity computation between words 3. Common-sense Knowledge for the estimation of correct alignments. On the contrary, 1 The Open Mind Common Sense project developed by MIT it aimed at enriching WordNet with semantics containing collected unstructured common-sense knowledge by ask- direct relations- and words overlapping, preventing asso- ing people to contribute over the Web. In this paper, we ciations of semantic knowledge on the unique basis of started focusing on ConceptNet (Speer and Havasi, 2012), similarity scores (which may be also dependent on algo- that is a semantic graph that has been directly created from rithms, similarity measures, and training corpora). This it. In contrast with the mentioned linguistic resources, Con- point makes this proposal completely different from what ceptNet contains common-sense facts which are conceptual proposed by (Chen and Liu, 2011), where the authors cre- properties but also behavioural information, so it perfectly ated word sense profiles to compare with ConceptNet terms fits with the proposed model. Examples of properties are using semantic similarity metrics. partOf, madeOf, hasA, definedAs, hasProperty and examples of functionalities are capableOf, usedFor, hasPrereq- 4.3. Definitions uisite, motivatedByGoal, desires, causes. Let us consider a WordNet synset Si =< Ti; gi;Ei > Among the more unusual types of relationships (24 in to- where Ti is the set of synonym terms t1; t2; :::; tk while gi tal), it contains information like “ObstructedBy” (i.e., re- and Ei represent its gloss and the available examples re- ferring to what would prevent it from happening), “and spectively. Each synset represents a meaning ascribed to CausesDesire” (i.e., what does it make you want to do). the terms in Ti in a specific context (described by gi and In addition, it also has classic relationships like “is a” and Ei). Then, for each synset Si we can consider a set of se- “part of ” as in most linguistic resources (see Table 1 for ex- mantic properties Pwordnet(Si) coming from the structure amples of property-based and function-based semantic re- around Si in WordNet. For example, hypernym(Si) repre- lations in ConceptNet). For this reason, ConceptNet has sents the direct hypernym synset while meronyms(Si) is the a wider spectrum of semantic relationships but a much set of synsets which compose (as a made-of relation) the more sparse coverage. However, it contains a large set of concept represented by Si. The above-mentioned complete function-based information (e.g., all the actions a concept set of semantic properties Pwordnet(Si) of a synset Si con- can be associated

Load more