Indowordnet and Its Linking with Ontology
Total Page:16
File Type:pdf, Size:1020Kb
IndoWordNet and its Linking with Ontology Brijesh Bhatt Pushpak Bhattacharyya Center For Indian Language Technology, Center For Indian Language Technology, Department of Computer Science and Engineering, Department of Computer Science and Engineering, Indian Institute of Technology, Mumbai Indian Institute of Technology, Mumbai [email protected] [email protected] Abstract hierarchical structure of concepts related by sub- sumption relation, which can be shared between Reasoning about natural language requires applications. Though WordNet is often considered combining semantically rich lexical re- as an ontology (the synset corresponds to concept sources with world knowledge, provided and ‘hypernymy-hyponymy’ relation is similar to by ontologies. In this paper, we describe subsumption), it is a language specific resource linking of WordNets of Indian languages which may vary from language to language. The with an upper ontology SUMO (Sug- ontological issues in WordNet discussed in (pease gested Upper Merged Ontology). This and Fellbaum, 2010; Gangemi et al., 2003; Niles creates multilingual resource for Indian and Pease, 2003) are as follows, languages which can be used in various natural language processing applications. This paper presents the architecture of 1. Confusion between concept and individual: IndoWordNet- Linking of WordNets of WordNet synsets do not distinguish between seventeen different Indian languages and concept and individual. For example, both provides a method to link it with upper on- ‘Martial Art’ and ‘Karate’ are considered as tology SUMO. Two different systems: In- concept. doWordNet navigator and SIGMAKEE in- terface for Indian languages are developed 2. Confusion between object level and meta to access this resource. level concept: WordNet covers both object level and meta level concepts as hyponymy 1 Introduction of same concept. For example, concept ‘Ab- WordNet (Fellbaum, 1998) has emerged as a great straction’ includes both object level concept resource for the Natural Language Processing ap- ‘Time’ and meta level concept ‘Attribute’. plications. Following English WordNet, Word- Nets are built for many languages of the world. 3. Heterogeneous level of generality: Two hy- Hindi WordNet (D. Narayan and Bhattacharyya, ponyms of a concept may represent different 2002) is the first WordNet built for an Indian lan- level of generality. For example, as a hy- guage. WordNets for other 16 Indian languages ponymy of concept ‘Animal’, there is a gen- are being built from Hindi WordNet using expan- eral concept ‘chordate’ and a more specific sion approach (Vossen, 1998). Linking of all these concept ‘Work Animal’. WordNets provides rich knowledge base for the Indian languages, which can be useful for infor- 4. Lexical gap: A language may not have an in- mation extraction, retrieval and many other natural digenous lexeme to describe a concept. For language processing applications. example, vehicles can be divided into two classes, 1) Vehicles that run on the road and 1.1 WordNet and ontology 2) Vehicles that run on the rail, but English Ontology is defined as “Explicit specification of language does not have specific words to de- conceptualization”(Gruber, 1993). Ontology is a scribe these classes. “Proceedings of ICON-2011: 9th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2011” 1.2 Benefits of linking WordNet with and (17) Urdu. Together these languages repre- ontology sents three language families: Indo-Aryan, Dra- By keeping ontological relations in the formal on- vidian and Tibeto-Barman. The comparative study tology and linguistic relations in the lexicon, one of IndoWordNet with EuroWorNet is presented in can avoid merging two different levels of analy- (Bhattacharyya, 2010) sis and can capture the information needed about 2.1 Synset Categorization formal concepts and linguistic tokens (pease and Fellbaum, 2010). Linking of WordNet with on- Synsets of the Hindi WordNet are used as basis tology allows the language independent semantic to create synsets of WordNets of other languages. relations of ontology to be used for inferencing on As a synset is represented by a gloss and a set language specific words. The benefits of linking of words of a particular language, in many cases, ontology and WordNet are as follows (Niles and the synset representation of a concept may vary Pease, 2003): in sense across the languages. Also there exist some concepts for which there may not be words 1. The formal specifications of the ontology can in all the languages. For example, Kashmiri lan- be used with the WordNet in the sense that guage does not have words for the concepts like the axioms corresponding to the words can g}h (‘graaha’, Planet), som (‘som’), m\gl (‘man- be retrieved from the ontology. gal’). Kinship terms also vary across Indian lan- guages. To handle this concept divergence across 2. The formal axioms of ontology can be used languages, synsets are divided into different cate- with natural language text. gories. Table 1 describes these categories. Such classification of synsets helps in linking 3. Linking ontology concepts with WordNet can concepts of different languages. For example, if be used to check completeness of ontology. a synset belongs to the universal synset then it is 4. Concepts of different languages can be com- present in both Hindi and English. And if a synset pared and linked using ontology. belongs to the Pan-Indian category then it belongs to both Hindi and Gujarati. Thus, WordNet devel- 5. WordNet concepts can be refined and restruc- opment using expansion approach will be faster by tured using ontology. this method. This classification also helps in cross lingual information access. By identifying the cat- 6. Different domain ontologies can be linked us- egory of a synset, its presence in another language ing WordNet concepts. can be easily predicted. Till date, 7163 universal synsets and 1356 Pan- 1.3 Organization of the paper Indian synsets have been identified and are now The remaining of the paper is organized as fol- linked across all languages. Language specific lows: Section 2 describes the IndoWordNet. The synsets are being developed and later they will be selection criteria for upper ontology and a compar- linked by translating them into Hindi and English. ative study of upper ontology is given in Section 3 and 4. Section 5 describes the method to link In- 3 Survey of upper ontology doWordNet with ontology and interfaces designed Ontologies are categorized into three different to access the system. Observations and conclu- types according to their level of generality (Guar- sions are discussed in section 6 and 7 respectively. ino, 1998). Top level/Upper ontology, Domain 2 IndoWordNet specific ontology and application specific ontol- ogy. Upper ontology defines very general con- Seeing the enormous potential of WordNet, 17 cepts independent of application or domain. Up- out of 22 official languages of India have started per ontologies are useful in linking and devel- developing WordNets. These languages are: (1) opment of more specific domain/application on- Hindi (2) Marathi (3) Konkani (4) Sanskrit (5) tology. There are various upper ontologies like Nepali (6) Kashimiri (7) Assamese (8) Tamil (9) SUMO, DOLCE, CYC etc. The ontological Malyalam (10) Telugu (11) Kannad (12) Manipuri choices for designing upper ontologies, discussed (13) Bodo (14) Bangla (15) Punjabi (16) Gujarati in (Oberle et al., 2007), are as follows, Table 2: Comparison of Upper ontologies Case SUMO DOLCE Open CYC License Open Open Free Table 1: Synset categories in IndoWordNet subset Category Description of Cyc Universal These concepts appear in all the Modularity Yes No Yes Language KIF, OWL Cyc languages. These concepts are es- OWL sential and most frequently used. Multiplicative(M)- -MM For example, s y (‘soorya’, sun) Reductionist(R) Descriptive(D)- DDD Pan-Indian Concepts common to Indian lan- prescriptive(Pr) guages and linkable across all In- Endurant(E)- No Yes Yes dian languages but does not have Perdurant(Pd) parallel concepts in English. For Universal(U)- Both Pt Both Particular(Pt) example, tblA (‘tabalaa’, An In- Linking to Word- Full Partial Partial dian rhythm instrument) Net In-Family These concepts are common in specific subset of Indian lan- 1. Endurant-Perdurant : An endurant is an en- guages and linkable across all lan- tity which exists in full in every instant at guages of the family. For example, which it exists at all. A perdurant “unfolds cAcA (‘chacha’ ,paternal uncle), itself over time in successive temporal parts BEtjA (‘bhatija’, brother’s son) or phases.” Both endurants and perdurants are taken to be concrete particulars, i.e., in- Language These are the concepts specific to stances. Specific a culture or a language. These may include local food, festivals, 2. Descriptive-prescriptive :A descriptive on- etc. For example, Ebh (‘bihu’ , tology tries to capture more commonsense Name of a festival celebrated in and social notions based on natural language Assam state of India) word is very usage and human cognition. Concepts are specific to the state and culture and divided into things and events. A prescrip- does not appear in any other lan- tive ontology emphasizes upon the scientific guage. and philosophical perspectives. All the con- structs in revisionary ontology are space-time Rare This includes very specific con- objects. cepts adopted in most of the lan- multiplicative-reductionist multiplicative guages. Specific scientific terms 3. : In like ‘ngram’ belongs to this cate- upper ontology concepts can include any- gory thing that reality seems to require. Contrar- ily, reductionist ontology reduces the number Synthesized Synset created in a language due of concepts to the fewest primitives sufficient to influence of another language. to derive the rest of the complex reality. These synsets are not natural to 4. Universal-particular : Universals are the en- the language but required to link tities that have instances, while particulars WordNets of two different lan- are entities that do not have instances.