A Report on the INDRADHANUSH WORDNET CONSORTIUM

1. Introduction

The title of the project is “Development Indradhanush: an Integrated WordNet for Bengali, Gujarati, Kashmiri, Konkani, Odia, Punjabi and ” and is executed by a consortium of nine academic institutions from all over India, whose details are given below –

1. University, Goa (Consortium Leader and Konkani WordNet development) 2. IIT Bombay, Powai (Co-Consortium leader) 3. Indian Statistical Institute, (Bengali WordNet development) 4. University of Kashmir, (Kashmiri WordNet development) 5. University of , Hyderabad (Odia WordNet development) 6. , (Punjabi WordNet development) 7. Thapar University, Patiala (Punjabi WordNet development) 8. Dharmsinh Desai University, Nadiad (Gujarati WordNet development) 9. Jawaharlal Nehru University, (Urdu WordNet development)

Indradhanush WordNet Consortium comes under the umbrella of IndoWordNet along with two other consortiums namely, North East WordNet Consortium which works on “Development of NE WordNet: An Integrated WordNet for North East Languages: Assamese, Bodo, Manipuri and Nepali” and Dravidian WordNet Consortium which works on “Development of Dravidian WordNet: An Integrated WordNet for Telugu, Tamil, and ”. All these are developed at different institutes in India and co-ordinated by IIT Bombay. These WordNets are linked to WordNet and amongst each other.

This document has been prepared to give the details of the Indradhanush WordNet Project Consortium to the PRSG members before the PRSG meeting which is scheduled on the 30 th April 2013 at University of . It starts by giving the brief details of the PRSG meetings held in section 2, followed by the current work status in section 3. Section 4, 5 and 6 lists the tools, utilities, websites developed by the members. The Financial details are presented in section 7. Section 8 and 9 give details of the total Manpower trained and total Equipments purchased under this consortium respectively. The Future work plan is presented in brief under section 10 which is followed by the details of Workshops/ Conferences organised by the members. The list of publications is given under section 11. There are three Appendices A, B and C which contains additional details related to Financial and Manpower details.

2. PRSG Details

The details of the recommendations of the first and second PRSG meetings and action taken are presented below -

First PRSG held on 9 th August 2011 at Goa University . 1) PRSG expressed that statistics of word senses in large corpus may resolve the problem of identifying exact sense and suggested that a large amount of corpus should be used for

1

determining the sense of the words.

Follow-up action taken – Newspaper corpus has been collected and used by all the Institutes for sense marking. The details of sense marking are presented in the Present Work Status Section under Sense Marking Details.

2) All the research papers on WordNet published by consortium should be made available to Chairman PRSG and DeitY for uploading on the TDIL Data Center.

Follow-up action taken – Proceedings of the Goa Workshop have been submitted to Chairman of PRSG for feedback and will be published. All published papers will be uploaded on the Indradhanush WordNet Website the link of this site will be put up on the TDIL Data Center.

3) PRSG recommended release of the next installment of the GIA to Consortium Leader Goa University subject to receipt of the Compiled Utilization Certificate (UC) for the released grants.

Follow-up action taken – Compiled U.C submitted by Consortium Leader (CL) on 31 st January 2012. And subsequently DeitY released the second year funds on 11 th April 2012 to CL and the CL released it to all members depending on the utilization of their funds for the first year.

Second PRSG held on 24 th July 2012 at University of Hyderabad –

1) PRSG recommended the extension of project duration till 31 st December 2012.

Follow-up action taken – The project duration extended till 31 st December 2012 and new deliverables set were as under – ••• Linking and validation of minimum 27,000 synsets by each member ••• Sense Marking of minimum 1,00,000 words. ••• Testing and documenting the tools and utilities developed.

2) PRSG recommended release of the balance amount to the consortium leader after submission of the Compiled U.C.

Follow-up action taken – There was sufficient overall balance available with the consortium members and also scope for enhancement of the WordNet work. Hence it was requested to consider the extension for the period till 31 st March 2013 instead of 31 st December 2012. The request was accepted.

2

3. Present Work Status

Synset Linking Details .

All the seven language groups of the Indradhanush consortium have completed the set target of minimum 27,000 synsets. The details are given below –

Language Noun Verb Adjective Adverb Total Hindi 28227 3098 6075 460 37860 Bengali 27281 2804 5815 445 36345 Gujarati 24896 2805 5828 445 33974 Kashmiri 17959 2354 6382 305 27000 Konkani 22976 2991 5689 474 32130 Odia 27216 2418 5273 377 35284 Punjabi 18898 2836 5828 443 28005 Urdu 21595 2800 5787 443 30625 Table 3.1: Synset Linkage Status

Sense Marking Details

The details of sense marking of all the languages of the Indradhanush consortium is as under –

Language Corpus name No. of No. of Total No. of WordNet files files used No. of sense Coverage Collected for sense words marked marking words Newspaper 9 9 163360 38637 23.65% Bengali (Anandabazar patrika) Gujarati Gujarati News corpus 101 101 337094 112884 33.49% So:n Mira:s’ Kashmiri weekly 350 350 98350 42290 43.00% Kashmiri newspaper ‘Sunaparant’ Konkani daily 3433 625 213415 103456 48.48% Konkani newspaper Odia Newspaper (Sambad) and Articles 135 135 236125 100285 42.27% Punjabi Online Articles, News Text, Stories 98 98 216878 93279 43.01% Newspaper:“Jang urdu” ,“Nawai 240 10 110000 50171 45.61% Urdu waqt” & “BBC urdu” Table 3.2: Sense Marking Status

3

4. Tools developed

• Synset Categorization Tool – by IIT Bombay • To chose common linkable synsets across all languages by classifying them as Universal, Pan-Indian, etc. • Released for use by consortium members

• Synset Creation Tool – by IIT Bombay • An offline interface to create target language synsets by using Hindi language synsets as source. • Released for use by consortium members

• Sense Marker Tool – by IIT Bombay • To find the synset coverage of a WordNet. • Released for use by consortium members

• Generic Stemmer for Indian Languages – by IIT Bombay • To find the possible stems of a given word • Released for use by consortium members (http://www.cfilt.iitb.ac.in/~bornali/generic_stemmer/index.php )

• WordNet Linkage Tool – by IIT Bombay • To link Hindi WordNet and English WordNet, uses 13 different heuristics to automatically identify top 5 English synsets for a given Hindi Synset. • Released for use by consortium members

• Word Sense Disambiguation Tool – by IIT Bombay • Provides single access point to 9 different state of art word Sense disambiguation algorithms • Released for use by consortium members

• WordNet CMS – v1.0, v2.0 – by Goa University • The WordNet Content Management System (CMS) allows creation of WordNet websites with good user interface and desired functionalities in a very short time for many Indian languages. • Tested and documentation available • Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php )

• CSS Manger Tool v1.0 – by Goa University • Centralized Web based tool to assist creation of Concept Specific Synsets (CSS) and manage their linkages to other Indian Language WordNets. • Tested and documentation available. • Released for use by consortium members (http://indradhanush.unigoa.ac.in/conceptspace/ )

• Lexical Relation Creation Web Based Tool – by Thapar University, Patiala

4

• Tool to verify and create lexical relations in the WordNet • This tool is under development

5. Utilities developed

• Sense Marking Statistic Finder Utility – by Goa University o Utility to find coverage statistics of the sense marked corpus. o Tested and documentation available

• Synset Merger Utility – by Goa University o Utility to merge different synset files into one single file. o Tested and documentation available

6. Websites and Computational Resources developed

• Indradhanush WordNet Consortium Website v1.0 ( http://indradhanush.unigoa.ac.in/ )

• Bengali WordNet Website v1.0 ( http://www.isical.ac.in/~lru/wordnetnew/ ) • Gujarati WordNet Website v1.0 ( http://www.cfilt.iitb.ac.in/gujarati/ ) • Kashmiri WordNet Website v1.0 ( http://indradhanush.unigoa.ac.in/kashmiriwordnet/ ) • Konkani WordNet Website v2.0. ( http://konkaniwordnet.unigoa.ac.in/ ) • Odia WordNet Website v1.0 (http://indradhanush.unigoa.ac.in/odiawordnet ) • Punjabi WordNet Website v1.0 ( http://punjabiwordnet.com/ ) • Urdu WordNet Website v1.0 ( http://indradhanush.unigoa.ac.in/urduwordnet )

• IndoWordNet Database v1.0, v2.0, v3.0 o Relational database structure to store WordNet data and relationships o Tested and documentation available o Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php )

• IndoWordNet API – v1.0, v2.0, v3.0 – by Goa University • IndoWordNet Application Programming Interface (IWAPI) helps in providing access to the WordNet resources independent of the underlying storage technology. • Implemented in Java as well as in Php • Tested and documentation available • Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php )

5

7. Financial Details

Financial Details of the Indradhanush Project Consortium as on 2 nd February 2013

• Total funds received by Goa University from DeitY - Rs. 281,83,413 • Total Interest earned by all institutes on the received funds - Rs. 4,99,687 Total amount including interest earned - Rs. 286,83,100

• Total amount spent by all Institutes - Rs. 267,46,182 • Total committed expenditure of all Institutes - Rs. 5,63,673 Total amount spent including the committed expenditure - Rs. 273,09,855

• Total balance [Rs. 286,83,100 – Rs. 273,09,855] - Rs. 13,73,245 • Total amount balance with DeitY [Rs. 299,52,000 – Rs. 286,83,100] Rs. 12,68,900

Net balance with the Consortium (Including the unreleased balance with DeitY) Rs. 26,42,145

Budget Head Wise and Institute Wise financial details are placed at Appendix A and Appendix B respectively.

8. Manpower Details

The details of the total manpower trained under the Indradhanush Project Consortium in different roles is as under –

Manpower Number Consortium Leader 1 Co-Consortium Leader 1 Principal Investigator 8 Co-Principal Investigator 9 Project Manager 2 Office Assistant 3 Senior Linguist 11 Lexicographer 32 Computer Scientist 23 Research Scholar 4 Consultant 7 Total 101

Table 8.1 Manpower Details

Institute Wise Manpower details are placed at Appendix C

6

9. Equipment Details

Total Equipments purchased by the Indradhanush Project Consortium from the funds sanctioned are as under –

Equipments Number Desktop 22 Laptop 24 Netbook 2 Server 1 Scanner 1 Printer 5 UPS 5 LCD Projector 2 Hard Disk 1 DVD Writer 2 Wi-Fi dongle 2 LCD Projector Screen 1 Adapter 1 KVM Switch 1 Total 70

Table 9.1 Equipment Details

10. Future Plan

An extension is requested for the period till 31 st July 2013. The following set of additional deliverables will be submitted at the end of this period -

1. Report on the Preliminary study carried out to give a Semantic Web Orientation to the Indradhanush WordNet 2. Each member will create an additional of 2,000 to 5,000 new synsets to increase the coverage of their WordNets 3. Each member will sense mark an additional 25,000 to 50,000 words from Newspaper Corpus. 4. All tools will be documented, tested and uploaded on the Indradhanush WordNet Website http://indradhanush.unigoa.ac.in/ (beta version hosted) at Goa University and the link of this will be put up on the TDIL data Center.

The balance amount is requested from DeitY to meet the expenses for the project extension period till 31 st July 2013.

7

11. Workshops / Conferences organized and participated

i. DDU Organized 1 st Indradhanush WordNet Workshop for Indradhanush Consortium Members (1 st to 3 rd October 2010) ii. Goa University Organized 2 nd Indradhanush WordNet Workshop for Indradhanush Consortium Members (8 th to 10 th August 2011) iii. IITB Organized 3 rd Indradhanush WordNet Workshop for IndoWordNet Members (1 st to 3 rd January 2012) iv. Thapar University organized 4 th Indradhanush WordNet Workshop (Computational Workshop) for IndoWordNet Members (23 rd to 25 th March 2012) v. University of Hyderabad organized 5 th Indradhanush WordNet Workshop for IndoWordNet Members (23 rd to 25 th July 2012) vi. Members participated and presented papers in COLING 2012 organized at IIT Bombay (8 th December 2012 to 16 th December 2012) vii. IITB Organized a one day workshop on 16 th December 2013 for all the PI’s / Co-PI’s of the Indradhanush Consortium. viii. Telecon Meetings: Regular Telecon meetings were held on a monthly basis for discussion on the progress made by the members and address the problems faced by each member.

Publications of the Indradhanush Consortium members

i. Challenges in Multilingual Domain-Specific Sense-marking, Jaya Saraswati, Rajita Shukla, Sonal Pathade, Tina Solanki and Pushpak Bhattacharyya, 5th International Conference on Global Wordnet (GWC2010), , Jan, 2010.

ii. Cost and Benefit of Using WordNet Senses for Sentiment Analysis, Balamurali A.R., Aditya Joshi and Pushpak Bhattacharyya, Lexical Resources Engineering Conference (LREC 2012), Istanbul, Turkey, May, 2012.

iii. A Study of the Sense Annotation Process: Man v/s Machine, Arindam Chatterjee, Salil Joshii, Pushpak Bhattacharyya, Diptesh Kanojia and Akhlesh Meena, International Conference on Global Wordnets (GWC 2011), Matsue, Japan, Jan, 2012

iv. Verbal Roots in Wordnet, Anuja Ajotikar, Malhar Kulkarni and Pushpak Bhattacharyya, International Conference on Global Wordnets (GWC 2011), Matsue, Japan, Jan, 2012.

v. Introduction to Gujarati Wordnet, Brijesh Bhatt, Dinesh Chauhan, Pushpak Bhattacharyya, C.K. Bhensdadia and Kirit Patel, International Conference on Global Wordnets (GWC 2011), Matsue, Japan, Jan, 2012.

vi. IndoWordnet and its Linking with Ontology, Brijesh Bhatt and Pushpak Bhattacharyya,

8

Conference on Natural Language Processing (ICON 2011), Chennai, December, 2011. vii. It takes two to Tango: A Bilingual Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011. viii. Insights on Hindi Wordnet Coming from the IndoWordNet, Laxmi Kashyap, Salil Rajeev Joshi, & Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

ix. Cross-Domain Sentiment Tagging Using Meta-Classifier and a High Accuracy In-Domain Classifier, Balamurali A.R., Debraj Manna and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

x. Emotion Analysis of Internet Chat, Shashank and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

xi. Probabilistic Approach for Automatic Last Suffix Extraction, Vasudevan N. and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur xii. A Fall-Back Strategy for Sentiment Analysis in Hindi: A Case Study , Aditya Joshi, Balamurali A.R. and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur xiii. Compound Verbs in a Multilingual Indo Wordnet, Debasri Chakrabarti and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur xiv. Query Disambiguation and Expansion Using Pseudo Relevance Feedback with WordNet, Nishikant Dhanuka and Pushpak Bhattacharyya, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

xv. Experiences in Building the Konkani WordNet using the Expansion Approach, Shantaram Walawalikar, Shilpa Desai, Ramdas Karmali, Sushant Naik, Damodar Ghanekar, Chandralekha D’Souza and Jyoti Pawar, 5th International Conference on Global WordNet (GWC2010), Mumbai, Jan 2010. xvi. Experiences and Challenges in Synset Linkages for Building Konkani WordNet and Augmenting it with Language Specific Contents, Shantaram Walawalikar, Ramdas Karmali, Sushant Naik, Kapila Desai, Shilpa Desai, & Damodar Ghanekar, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

9

xvii. WordNet Website Development And Deployment Using Content Management Approach, Neha Prabhugaonkar, Apurva Nagvenkar, Venkatesh Prabhu and Ramdas Karmali, COLING2012 Main Conference.

xviii. An Efficient Database Design for IndoWordNet Development Using Hybrid Approach , Venkatesh Prabhu, Shilpa Desai, Hanumant Redkar, Neha Prabhugaonkar, Apurva Nagvenkar and Ramdas Karmali, COLING2012 for 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP)

xix. IndoWordNet Application Programming Interfaces, Neha Prabhugaonkar, Apurva Nagvenkar and Ramdas Karmali,for 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP)

xx. Challenges Faced in Language Specific Synset Creation: Lack of Explicit Standards/Guidelines, Aadil Amin Kak, Farooq Ahmad Sheikh, Nazima Mehdi, & Muneera Hakim, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxi. Problems and Issues in Synset Linkage in Kashmiri, Aadil Amin Kak, Nazima Mehdi, Farooq Ahmad, Mansoor Farooq, & Muneera Hakim, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxii. Towards a Combined Attempt at Simultaneous Synset Linkage and Expansion, Aadil Amin Kak, Nazima Mehdi, Aadil A. Lawaye, Farooq Ahmad Sheikh, Muneera Hakim, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

xxiii. Synset Categorization for Synset Creation: A proposal, Aadil Amin Kak, Nazima Mehdi, Aadil A. Lawaye, Mansoor Farooq, Farooq A. Shiekh, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

xxiv. Wordnet for Indian Languages: Some Issues, Panchanan Mohanty, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010.

xxv. OnPolysemy in Tamil and other Indian Languages, Panchanan Mohanty and S. Arulmozi, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010.

xxvi. Issues in the Creation of Synsets in Odia: A Report, Panchanan Mohanty, Ramesh C. Malik, & Bhimasena Bhol, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University. xxvii. Punjabi WordNet Relations and Categorization of Synsets, Rupinderdeep Kaur, R.K Sharma, Suman Preet, Parteek Bhatia, 8th International Conference on NLP (ICON 2010), IIT Kharagpur

10

xxviii. Language Specific Synsets of Punjabi WordNet , Suman Preet 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxix. Problems and Challenges faced in Linking Punjabi WordNet, Suman Preet, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxx. Synset Structure of Punjabi WordNet, Suman Preet

xxxi. Challenges in Creation of Punjabi WordNet and Bilingual Dictionaries , R.K. Sharma, Parteek Bhatia, & Rekha Rattan, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxxii. Language Specific Synset: Experiences of the Urdu Team, Rizwanur Rahman & Mazhar Mehdi Hussain

xxxiii. Synset Linking: Experiences and Challenges Faced by the Urdu Team, Rizwanur Rahman & Mazhar Mehdi Hussain

xxxiv. Polysemy and Homonymy: A Conceptual Labyrinth , Dash Niladri Sekhar (2010), Workshop on IndoWordnet-2010, IIT, Kharagpur, 8th December, 2010 .

xxxv. Problems in Defining Language Specific Synsets (LSS) in Bengali for the Indradhanush IndoWordnet: Some Theoritical and Practical Issues”, Authors: Niladri Sekhar Dash, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxxvi. Problems and Challenges in Translation of Hindi Synsets into Bengali in Indradhanush WordNet Niladri Sekhar Dash, Abhisek Sarkar, Dipsikha Bose , and Soumi Banerjee, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

xxxvii. Introduction to Gujarati WordNet, C. K. Bhensdadia, Brijesh Bhatt, Pushpak Bhattacharyya, IndoWordNet Workshop, I I T, Kharagpur, Dec. 2010 (ICON2010) xxxviii. Domain specific ontology extractor for Indian languages , COLING2012 for 10th Workshop of Asian Language Resources

xxxix. WordNet: Issues related to language specific synsets, Brijesh Bhatt, C. K. Bhensdadia, Pushpak Bhattacharyya, Dinesh Chauhan, Kirit Patel, 2nd National Level Workshop of Indradhanush WordNet Consortium, Goa University.

11