<<

Introduction to Synskarta: An Online Interface for Synset Creation with Special Reference to

Hanumant Redkar, Jai Paranjape, Nilesh Joshi, Irawati Kulkarni, Malhar Kulkarni, Pushpak Bhattacharyya Center for Indian Language Technology, Indian Institute of Technology Bombay, India.

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

In this paper, we have taken reference of San- Abstract skrit WordNet 3. Sanskrit is an Indo-Aryan lan- guage and is one of the ancient languages. It has WordNet is a large lexical resource express- vast literature and a rich tradition of creating ing distinct concepts in a language. Synset is léxica. The roots of all languages in the Indo Euro- a basic building block of the WordNet. In this pean family in India can be traced to Sanskrit paper, we introduce a web based lexicogra- (Kulkarni et al., 2010). Sanskrit WordNet is con- pher's interface ‘Synskarta’ which is devel- structed using expansion approach where oped to create synsets from source language to target language with special reference to WordNet is used as a source (Kulkarni et al., Sanskrit WordNet. We focus on introduction 2010). and implementation of Synskarta and how it While developing Sanskrit WordNet, lexicogra- can help to overcome the limitations of the phers create Sanskrit synsets by referring to Hindi existing system. Further, we highlight the fea- synsets and by following the three principles of tures, advantages, limitations and user evalua- synset creation (Bhattacharyya, 2010). Since San- tions of the same. Finally, we mention the skrit came into existence much before Hindi, it has scope and enhancements to the Synskarta and many words which are not present in Hindi its usefulness in the entire IndoWordNet WordNet. These words are frequently used in the community. Sanskrit texts; hence, there was a need to create new Sanskrit synsets. In this case, the lexicogra- 1 Introduction pher creates new Sanskrit synsets by referring to electronic lexical resources such as Monier- WordNet is a lexical resource composed of synsets Williams Dictionary, Apte's Dictionary, Spoken and semantic relations. Synsets are sets of Sanskrit Dictionary, etc. and linguistic resources synonyms. They are linked by semantic relations such as Shabdakalpadruma and Vaacaspatyam , like hypernymy ( is-a), meronymy ( part-of ), etc. (Kulkarni et al., 2010). troponymy, etc. IndoWordNet 1 is a linked structure The IndoWordNet community uses the IL- of of major Indian languages from Indo- Multidict tool to create synsets. This tool is de- signed and developed by researchers at IIT Bom- Aryan, Dravidian and Sino-Tibetan families bay (Bhattacharyya, 2010; Kulkarni et al., 2010). (Bhattacharyya, 2010). WordNets are constructed Though this existing lexicographer's interface is by following the merge approach or the expansion popular and widely used, it has major limitations. approach (Vossen, 1998). IndoWordNet is Some of them are – it uses flat files, has chances of constructed using expansion approach wherein data redundancy, inconsistency, etc. To overcome Hindi is used as the source language; however, the these limitations, we developed a new web based 2 Hindi WordNet is constructed using merge synset creation tool – ‘Synskarta’. The features, approach (Narayan et al., 2002). advantages, limitations and user evaluations of Synskarta are detailed in this paper. 1 http://www.cfilt.iitb.ac.in/indowordnet/ 2 http://www.cfilt.iitb.ac.in/wordnet/webhwn/wn.php 3 http://www.cfilt.iitb.ac.in/wordnet/webswn/wn.php 95 D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 95–100, , India. December 2014. c 2014 NLP Association of India (NLPAI) The rest of the paper is organized as follows: synset id, search by word, generate synset count, Section 2 describes the existing system – its ad- generate word count, reference to quotations, vantages and disadvantages, section 3 describes commenting on a current synset and linkage to the Synskarta – its features, advantages, limitations corresponding English synset, etc. and the user evaluations. Subsequently, the con- clusion, scope and enhancements to the tool are 2.2 Advantages and Disadvantages of the Ex- presented. isting System 2 Existing System Some of the major advantages of the existing tool are: Firstly, it is a standalone tool. Hence it can be installed easily. Secondly, it is portable, i.e. the 2.1 IL Multidic Development Tool tool can be installed and used over different operat- ing systems. IL Multidict (Indian Language Multidict Though the existing Lexicographer’s Interface development tool) or the Offline Synset Creation has its own advantages and it is widely accepted by tool is developed using Java and works with flat the IndoWordNet community, we can find several files. This tool, popularly known as the limitations with the system. Some of them are 'Lexicographer's Interface' is an offline tool which mentioned below: helps in creating synsets using the expansion • Tool works with flat files hence there is a high approach. possibility of data redundancy, data The interface is vertically divided into the inconsistency, etc. source language panel and the target language • As the number of synsets increases, the panel. At any given time, only the current source processing time to perform various operations synset is displayed in the source language panel like searching, counting, synchronization, etc. and its corresponding target synset is displayed in increases. the target language panel. The source panel • Being a standalone tool, the installation and displays the details of the current synset of the configuration time increases with increase in source language such as number of records in number of machines. source file, current synset id, its part-of-speech • Merging data from different systems may lead (POS) category, gloss or the concept definition, to data loss, data redundancy as well as data example(s) and synonym(s). inconsistency. Similarly, the target panel displays the target • Synset data is not rendered properly in the synset details such as total number of synsets in the interface if there is any formatting mistake in target file, number of complete synsets, number of the source or target file. Also, if any special incomplete synsets and the current synset id character is added in a file then the synsets are (which is the same as source synset id). These not loaded in the system. fields are non-editable. There are also editable fields which allow editing of target synset details 3 Developed System such as gloss, example(s) and synonym(s). The lexicographer translates the source language synset into the target language synset, while the validator 3.1 Synskarta uses same tool to validate these translated synsets. The developed system, ‘Synskarta’ is an online There is a navigation panel which allows a lexi- interface for creating synsets by following the ex- cographer to navigate between synsets. The button pansion approach. This web based tool is devel- ‘Save & Next’ saves the current synset and moves oped using PHP and MySQL which uses relational to the next synset. The source and target synset database management system to store and maintain data is extracted from source and target synset files the synset and related data. The IndoWordNet da- respectively. These files are in Dictionary Standard tabase structure (Prabhu et al., 2012) is used for Format (DSF) with extension '.syns'. Some of the storing and maintaining the synset data while features of the existing system are: search by

96

IndoWordNet APIs (Prabhugaonkar et al., 2012) are used for accessing and manipulating this data. Synskarta overcomes the limitations of the standalone offline tool. The look and feel of the interface is kept similar to that of the existing sys- tem for ease of user adaptability. Most of the basic features of the existing system are incorporated in this developed system. Figure 1 shows the Lexi- cographer’s Interface of the developed system.

3.2 Features of Synskarta

3.2.1 Features of Synskarta incorporated from the Existing System

The features of the existing system which are im- plemented with some improvements in Synskarta Figure 1. The Developed System – Synskarta are as follows – • User Registration Module - This module fields. • allows the system administrator to create user Search – User can search a synset either by profiles and provide necessary access entering 'synset id' or a 'word' in a synset. • privileges to user. The user can login using the Advanced Search – Here user is allowed to access privileges provided to him and search synset data by entering various accordingly the user interface is displayed to parameters such as POS category, words that particular user. appearing in gloss or in example, etc. • • Configuration Module – This module sets all Comment – User can comment on a the necessary parameters such as source particular synset being translated. language, target language and enables or • English Synset – User can check the disables certain features such as Source, corresponding English synset for better Domain, Linking, Comment, References, etc. clarity in translation process. • Main Module – This module allows the user to • Navigation Panel – This panel allows the enter data in the target language panel by user to navigate between synsets. Button referring to data in the source language panel. ‘Save & Next’ allows inserting or updating The source panel and the target panel vertically a current target synset and the data is divide the main module into two equal sized directly stored in the IndoWordNet panels. Following are the major components of database. this module: 3.2.2 Features of Synskarta specific to San- • Source language panel – This panel is skrit Language placed on the left of the screen which has fields for synset id, POS category, gloss, Apart from the features of the existing system, example(s) and synonyms of the source there are features which are specific to the Sanskrit language synset. language. Some of these features can also be appli- • Target language panel – This panel is cable to other languages. placed on the right of the screen. This As far as Sanskrit is concerned, some of the fea- panel has non-editable fields such as tures can be specific to a particular word in the synset id, POS category and editable fields synset. These are given in table 1. To capture these such as gloss, example(s) and synonym(s) word specific features, the ‘Word Options’ button of the target language synset. The user is is provided. This window has various features expected to enter the data in these editable which users can set or unset for a selected word . 97

Feature Details Word Word can have vaidika or laukika Types word types. vaidika words are from vedic literature and laukika words are from post-vedic literature. Accent Accent is an appropriate tone for utterance of any particular syllable. Class There are 10 classes of verbs in Sanskrit. They are bhv ādi, ad ādi, juhoty ādi, div ādi, sv ādi, tud ādi, rudh ādi, tan ādi, kry ādi and cur ādi. Etymol It is the study of the origin of a word ogy and historical development of its meaning. Pada There are padas for suffixes of a verb Figure 2. Word Options - Noun Specific Features

such as parasmaipada , ātmanepada culine for तटः (taṭaḥ) and neuter for तटम ् or ubhayapada . (taṭam). Hence, there is a need to store gender Ittva Verb can be ani ṭ, se ṭ or ve ṭ. information for such type of words . • Table 1. Special Features of Synskarta specific Indication of Preverbs ( उपसग', upasarga ) – In to Sanskrit Sanskrit, there are 22 preverbs. Some of them are (pra ), (par ), (apa ), etc. (Papke, 3.2.2.1 Noun Specific Features  परा ā अप 2005; Ajotikar et al. 2012). For example, for a Noun specific features are displayed only if the word (gandha ), if we add preverbs (su ), गध सु POS category of the synset is NOUN. Figure 2 (dus ) and (upa ) we get words shows a screenshot of the ‘Word Options’ window दसु ् उप सुगध (sugandha ), (durgandha ) and for noun synsets. Following are some of the noun दग'धु उपगध specific features of a word in Sanskrit language – (upagandha ) respectively. • • Indication of Class ( , ga a) – Certain words Indication of Word Type (शद कार , śabda गण ṇ in Sanskrit language belong to prak āra ) – In Sanskrit, words can have वै दक परग)णत (pariga ita ) or ( k ti ) class type. Each (vaidika ) or लौ क क (laukika ) word types. ṇ आकृ ,त ā ṛ • of this class type has class names. For exam- Indication of Accent (वर , swara ) – In San- skrit, if a word has vaidika word type then it ple, a word शव (śiva ) belongs to pariga ṇita class type having class name ( iv di ) can have accents such as उदात (udaatta ), शवा द ś ā (anudaatta ) or (svarita ). Again, and a word (śau ṇḍ a) belongs to ākṛti अनुदात वरत शौ-ड udaata has sub-accents such as class type having class name आयुदात शौ-डा द (aadyudaatta ), (madhyod ātta ) or (śau ṇḍ ādi ). मयोदात • Expectancy (/प, rūpa ) – Certain words expect अतोदात (antodaatta ). It is needed particular- ly in Sanskrit because the meaning of a word its related word to be in specific case(s). For changes according to the place of accent. example, a word अलम ् (alam, ‘ enough’) ex- • pects its related word to be in (t t y , Identification of Gender ( लग , li ṅga ) – In ततीयाृ ṛ ī ā Sanskrit, a gender can be masculine ( , ‘instrumental case’) as (ala ṁ पुंि!लग अलं रोदनेन। rōdan ēna, ‘enough of crying’). pu ṃlli ṅga), feminine ( "ीलग , str īli ṅga ) or • Etymology ( , vyutpatti ) – Most of the neutral ( , napu ṃsakali ṅga ). For ex- 0युपित नपुंसकलग words in Sanskrit have etymology. For exam- ample, a word तट (taṭa) has two genders, mas- 98

ple, a word (k la ka a , ‘ the sea’), related word in (pañcam , ‘ablative कू लकषः ū ṅ ṣ ḥ प>चमी ī (k la ka ati iti, ‘ one who cuts case’) such as (vy ghr t bh ta , कू लं कष,त इ,त। ū ṁ ṣ 0या@ात ् भीतः ā ā ī ḥ the shore’). ‘feared of tiger’).

3.2.2.2 Verb Specific Features

Verb specific feature list is displayed only if the POS category of the synset is VERB. Figure 3 shows the ‘Word Options’ window for verb synsets. Following are some of the verb specific fea- tures of a word in Sanskrit language – • Indication of Word Type ( शद कार , śabda prak āra ) – Verbs can have वै दक (vaidika ) or लौ कक (laukika ) word types. • Indication of Accent ( वर , swara ). • Indication of Transitivity ( कम'कव , karmakatva ) – A verb can be सकम'क (sakarmaka , ‘active’) or अकम'क (akarmaka, ‘passive’). • Indication of I ṭtva ( इ6व , iṭtva ) - A verb can be Figure 3. Word Options – Verb Specific Features अ,न6 (ani ṭ), से6 (se ṭ) or वे6 (ve ṭ). 3.2.3 General Features not specific to any • Language Indication of Class ( गण , ga ṇa) – In Sanskrit, there are 10 classes of verbs. Some of them are Some of the general features which are not specific (bhv di ), (ad di ), 8वा द ā अदा द ā जुहोया द to any particular language are also provided in (juhoty ādi ), etc. Synskarta, they are as follows: • • Indication of Pada ( पद , pada ) – In Sanskrit, Vindication – This feature allows the user to there are padas for suffixes of a verb such as record the special feature of a particular word (parasmaipada ), in a current synset. परमैपद आमनेपद • (ātmanepada ) or (ubhayapada ). Source – This feature allows the user to record उभयपद the information about source of the synset. • Indication of Preverbs ( , upasarga, ) – उपसग' • Domain – This feature allows the user to For verbs also there are 22 preverbs in San- record the information about domain of the skrit. For example, when preverbs are attached synset. to a root गम ् (gam, ‘to go’), we get verb forms • Linking – Many-to-Many linking of words is such as √ ( gam, ‘to come’), √ आ गम ् ā√ अनु गम ् supported in this feature. √ • Quotations – The feature to add quotations as (anu √gam, ‘to follow’), ,नर् गम ् (nir √gam, ‘to additional examples are supported. go out’), √ (vi √gam, ‘to go away’). <व गम ् • Root Verb – This feature allows the user to • Indication of Verbal Root Types ( , धातु कार enter the root verb of a given word. dh ātu prak āra ) – A verbal root can have types • Feedback – Feedback related to the tool and its such as (dh tu ), (s dhitadh tu ), धातु ā सा=धतधातु ā ā features are captured here. (vaidikadh ātu ) and वै दकधातु सौ"धातु 3.3 Advantages and Limitations of Synskarta (sautradh ātu ). • Expectancy (/प, rūpa ) – Certain verbs expect Major advantages of Synskarta are as follows: its related word to be in specific case(s). For • Centralized system • example, a root भी (bh ī, ‘to fear’) expects its No data redundancy and inconsistency. 99

• Online access from anywhere in the world such as automatic translation, transliteration, tran- • No text files to maintain data scription can also be implemented over the time. • Faster processing and updating Further, there can be a feature to link • Multiple users can work at the same time IndoWordNet synsets to the foreign language WordNets. A feature to generate produced words • Can be used by all the language WordNets can also be implemented. Link to FrameNet of

Sanskrit verbs is an important work that will be Synskarta has a few limitations. The major limi- undertaken in future. This will be useful in the tation is that, it is heavily dependent on internet light of development of dependency tree banks. access or networking. Another limitation is that currently there are rendering issues depending on Acknowledgements the device being used. We sincerely thank the members of the CFILT lab 3.4 User Evaluation – IIT Bombay and IndoWordNet community.

The beta version of Synskarta was released to the References Sanskrit WordNet creation team to capture users’ experiences and the feedback which are as follows: Malhar Kulkarni, Chaitali Dangarikar, Irawati Kulkarni, Abhishek Nanda and Pushpak Bhattacharyya. 2010. • Searching of synset from source as well as Introducing Sanskrit Wordnet . In Principles, Con- target language is faster. struction and Application of Multilingual WordNets, • User does not have to maintain files for source Proceedings of the 5th GWC, edited by Pushpak and target language. Bhattacharyya, Christiane Fellbaum and Piek • Separate synchronization is not required. Vossen, Narosa Publishing House, , 2010, Everything is handled by the tool itself. pp 257 – 294. • Malhar Kulkarni, Irawati Kulkarni, Chaitali Dangarikar The system always shows the updated synsets. and Pushpak Bhattacharyya. 2010. Gloss in Sanskrit • User has to depend on the internet while using Wordnet . In Proceedings of Sanskrit Computational the system. Linguistics. Jha. G. Berlin: Springer-Verlag / Heidel- berg. pp 190-197. 4 Conclusion Tanuja Ajotikar Malhar Kulkarni, and Pushpak Bhattacharyya. 2012. Verbs in Sanskrit Wordnet . It has been observed that the existing system for Proceeding of 6th Global WordNet Conference, synset creation activity is not up to the expecta- Matsue, Japan. pp 30 – 35. tions of the user, as it uses text files for storage and Narayan D., Chakrabarty D., Pande P. and processing. This may lead in creation of redundant Bhattacharyya P. 2002. An Experience in Building the Indo WordNet - a WordNet for Hindi . 1st Global synset data and hence, data maintenance becomes WordNet Conference, , India. difficult. A web based tool ‘Synskarta’ is devel- Neha R Prabhugaonkar, Apurva S Nagvenkar, Ramdas oped to overcome the limitations of the existing N Karmali. 2012. IndoWordNet Application Pro- system. This is a centralized system which uses gramming Interfaces . COLING2012, , India. relational database to store and maintain data. The Papke Julia. 2005. Order and Meaning in Sanskrit difficulty of maintaining data in flat files is taken Preverbs . 17 th International Conference on Historical care of. User feedback is found to be positive and Linguistics, Madison, Wisconsin. this tool may prove essential for the IndoWordNet Pushpak Bhattacharyya. 2010. IndoWordNet . In the community. Proceedings of Lexical Resources Engineering Con- ference (LREC), Malta. 5 Future Scope and Enhancements Venkatesh Prabhu, Shilpa Desai, Hanumant Redkar, Neha Prabhugaonkar, Apurva Nagvenkar, Ramdas, In future, the tool can be expanded to link to the Karmali. 2012. An Efficient Database Design for other Indian languages. This tool can have addi- IndoWordNet Development Using Hybrid Approach . COLING 2012, Mumbai, India. p 229. tional features such as capturing appropriate pro- Vossen Piek (ed.). 1998. EuroWordNet: A Multilingual nunciation of the synset, features to capture video, Database with Lexical Semantic Networks, Kluwer, images or documents related to the synset. Features Dordrecht, Netherlands.

100