Introducing Sanskrit Wordnet

Introducing Sanskrit Wordnet

Introducing Sanskrit Wordnet Malhar Kulkarni Chaitali Dangarikar Irawati Kulkarni Department of Humanities Center for Indian Lan- Center for Indian Lan- and Social Sciences, guage Technology, guage Technology, Indian Institute of Tech- Indian Institute of Tech- Indian Institute of Tech- nology Bombay nology Bombay nology Bombay [email protected] chaita- irawatikulkar- li.dangarikar@gmai [email protected] l.com Abhishek Nanda Pushpak Bhattacharyya Center for Indian Language Technology, Center for Indian Language Technology, Indian Institute of Technology Bombay Indian Institute of Technology Bombay [email protected] [email protected] guages range from 10 million (Konkani) to 500 Abstract million (Hindi/Urdu). 2. Being a heritage language, there is need to How does one build the wordnet of a lan- digitize and preserve ancient texts in Sanskrit. guage that has a rich lexical tradition span- This activity is greatly helped by word lists. An ning over millennia? The sheer volume of Optical Character Recognition Device (OCR) for words and their nuances, the rich, deep and Sanskrit, for example, would need spell correc- diverse grammatical tradition, the pressure of tion after scan, and this would need an exhaus- modern developments on the language- all tive lexicon. these factors and more combine to pose unique challenges in creating lexical re- 3. Simlarly, there exists real need for trans- sources for such languages. This present pa- lating ancient texts to preserve traditional culture per describes the construction of Sanskrit and knowledge. An online wordnet would no wordnet, being built using the expansion ap- doubt be a great help to a translator. proach . It presents the processes and chal- 4. Machine aided translation (MAT) is ma- lenges involved in this task that purports to turing fast, and automatic translation of Sanskrit uncover the intimate linkage that underlies text is a challenging problem needing wordnet. Indian languages most of which have speaker 5. There is an enormous amount of Sanskrit population numbering 20 to 500 million. text which should be available in keyword based searchable form. Text search is greatly helped by 1 Introduction wordnets. Sanskrit is historically an Indo-Aryan language 6. The tradition of developing lexical resource is very old in Sanskrit. There are diverse koshas (Deshpande 1992 ) and one of the 22 official , (traditional and rich monolingual dictionaries) in languages of India. It has a vast literature and the Sanskrit (see section 1.2 below). Sanskrit word- interest in analyzing and translating these texts is net will serve as the single reference point always on the rise, worldwide. representing and pointing to all these resources. Specifically, our motivation for building Sanskrit wordnet arises from the following facts: 1.1 Sanskrit language 1. For all languages in the Indo European Indian subcontinent is inhabited by a very family in India, the roots can be traced to San- large population who speak languages belong- skrit. A large part of the vocabulary of these lan- guages is derived from Sanskrit which can, there- ing to 4 major families, Indo-Aryan (a sub- fore, provide the pivot resource for many Indian family of Indo-European), Dravidian, Tibeto- Burman and Austro-Asiatic. Sanskrit is the languages. The speaker population for these lan- oldest member of the Indo-Aryan language family, a sub branch of Indo-Iranian, which in 1.2 Rich lexical tradition of Sanskrit turn is a branch of Indo European language Sanskrit has a rich tradition of creating léxica family. 4 There is a traditional fourfold division of lex- (Kulkarni, 2008). Nighantu (700BC) on which ical units of Indian languages into: Yaska is believed to have written a commentary 1 Nirukta 1. tatsama - words having their origin called is the oldest known treatise that तसम arranged lexical material from the point of view in Sanskrit and accepted in the modern Indo- of synonymy as well as homonymy , and this tradi- Aryan languages without any change in their tion continued to Pali5 tradition as well. The first phonology. 2 and the foremost popular name of lexicon work 2. तव tadbhava - words which have their in classical Sanskrit is Amarasimha’s Amarako- origin in Sanskrit but their phonological forms sha (6th century AD) (Oka, 1913). The Cata- are changed as per the rules of the modern Indo- logous Catalogorum lists at least 40 commenta- Aryan languages. ries on Amarkosha alone, which shows how im- 3. देशी desh• - words which are the native portant and popular this synonyms dictionary in words of the particular language and ancient India was. There were many other léxica created more 4. Bवदेशी videsh• - words borrowed from for- or less in the style of Amarakosha which are giv- eign languages. en in Appendix A (11 of them). The links to तसम tatsama and तव tadbhava The first modern-day dictionary of Sanskrit words, in particular, will be a great pan-Indian was the Sanskrit-English Dictionary compiled by linguistic resource for computational purposes. Professor H.H. Wilson and published in 1819 Table 1 below lists some examples of Sanskrit 3 (Wilson, 1819)Two Indian dictionaries came out words in Hindi wordnet . soon after, namely, the Shabdakalpadruma 6 (Deb , 1988 ) of Pt. Sir Raja Radhakanta Dev HWN Synset Tatsam HWN English 7 word synset meaning and Vacasptyam (Bhattacharya, 2003 ) com- basil {तुलसी , पावनी , बहुमंजर/ , वृंदा , तुलसी तुलसी piled by Pt Taranatha Tarkavacaspati. वृदा , वैंणवी , भारवी , मंजर/क, वृदा वृदा So far the electronic lexical resources availa- 8 Bव#पावन , Bव# -पूDजता , पुंपसारा , वैंणवी वैंणवी ble for Sanskrit are mainly online dictionaries. Bऽदशमंजर/ , Bऽदशम जर/ , तीोा , पावनी पावनी The linguistic resources like Shabdakalpadruma पऽपुंपा , ौीमंजर/ ,ौीम जर/ , पऽपुंपा पऽपुंपा अमृता } 4 Nighantu is Sanskrit term for the collection of words, {भWह ,भW ,ॅू,भृकुट/ , ॅू ॅू eyebrow, brow, superci- grouped thematic categories with brief annotations , , , } 5 तेवर कोदंड कोडंड अब भृकुट/ भृकुट/ lium Pali is a Middle Indo-Aryan language (or Prakrit) of India. {पेशी ,माँस -पेशी ,मांस - पेशी muscle, mus- It is best known as the language of the earliest extant Budd- culus पेशी ,माँसपेशी ,मांसपेशी ,माँस मांसपेशी hist scriptures . 6 पेशी ,मांस पेशी ,नस } Shabdakalpadruma is a first Sanskrit uni-lingual dictio- बSगन ,बैगन ,भंटा ,भाँटा , शाकBबव बSगन eggplant, nary arranged in the modern alphabetical principles. It gives aubergine, full quotations and definitions from the original Koshas शाकBबव ,शाकBबवक , mad_apple शाकौे%ा बSगन which were unavailable in print at that time. Sets of syn- वृंताक ,वृताक ,नीलवृषा , onymous words from the traditional Koshas are arranged शाकौे%ा ,वृंताक3 , वागुण ,वरा , िचऽफला बSगन under the headword, followed by the brief gloss. Each entry िचऽफला ,रकंठ , रकठ ,िनिालु, वृताक बSगन in the lexicon includes headword, its category, meaning, नीलफला ,नटपBऽका usages in the Sanskrit texts . िनिालु बSगन 7 Vacasptyam is a modern mono-lingual Sanskrit lexicon. It नीलफल बSगन arranges words in the Sanskrit alphabetical order and gives grammatical information with word derivations as per the Table 1 : Tatsama words in the HWN traditional Sanskrit grammar. It contains about 46970 unique words. Each entry in the lexicon includes headword, These representative examples show that the its category, meaning, set of synonymous words, usages and some other information. synsets in Hindi wordnet contain 60-70% tatsa- 8 ma (directly borrowed from Sanskrit) words. The online dictionaries available for Sanskrit are-(1) Monier Williams dictionary < http://webapps.uni- koeln.de/tamil/>, (2) Apte’s Sanskrit-English Dictionary < http://www.aa.tufs.ac.jp/~tjun/sktdic/>, (3) Apte’s English- 1 Tatsama Shabda Kosha (Tatsama words dictionary) is Sanskrit Dictionary < http://www.sanskrit-lexicon.uni- published by Kendriya Hindi Nideshalaya, Shiksha Vibha- koeln.de/aequery/index.html> and (4) Spoken Sanskrit Dic- ga, Manava Samsadhana Vikasa Mantralaya, Bharata Sara- tionary: an online hypertext dictionary for Sanskrit - English kara in 1988. and English - Sanskrit.< http://spokensanskrit.de/>. Apart 2 from that various scanned versions of the printed dictiona- See Hindi ki Tadbhava Shabdavali (Sarma, 1968 ). ries prepared by European scholars are available at < 3 www.cfilt.iitb.ac.in/wordnet/webhwn. http://www.sanskrit-lexicon.uni-koeln.de/>. and Vaacaspatyam are vast . For example, a 1.4 Expansion approach for Indian lan- comparison of the entries for the word war in guage wordnets these electronic dictionaries with the synsets of Wordnet construction activities in India started in the same word in the Sanskrit Wordnet is a good 9 indicator of the richness of this lexical tradition 2000 and the Hindi wordnet (Narayan et al., in Sanskrit. 2002) is the first one which got released on the Web in 2006. It was built ab initio using words from available lexical resources of Hindi. The 1. Spoken Sanskrit Dictionary: (7 words) यु , युध,् design of the Hindi wordnet follows the famous संमाम , समर , आयोधन, आहव , रय . English WordNet 10 . 2. Apate’s Sanskrit-English Dictionary: (7 While following the expand method, the words) Bवमहः , संमहारः , वैरारंभः , वैरं, संमामः , युं, रणं Sanskrit wordnet follows the hierarchy preserva- 3. Monier Williams Dictionary: (56 words) tion principle (HPP) (Tufis et al., 2008). In the अनीक ,अयामदG ,अबर/ष ,अरर ,आDज ,आनतG, आयोधन , hierarchy of the Hindi wordnet, if synset H 2 is a hyponym of synset H , and the translation equi- आहव , आहाव , कठाल , कदल , खज , न, नदनु, िनमहण , 1 valents in the Sanskrit wordnet for H and H are पुंकर ,ूBवदारण ,ूसर ,बलज ,भडन ,भर ,भीमर ,युकार , 1 2 S1 and S 2 respectively, then in the hierarchy of यु ,योध ,योधन ,रय ,राAट , ,वराक ,Bवदथ ,Bवदार , Sanskrit wordnet S 2 should be a hyponym of syn- Bवदारण ,BवमदG ,BवमदGन ,शबर ,िशलीमुख ,संयत,् संयुग , set S 1.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us