A Report on the INDRADHANUSH WORDNET CONSORTIUM
Total Page:16
File Type:pdf, Size:1020Kb
A Report on the INDRADHANUSH WORDNET CONSORTIUM 1. Introduction The title of the project is “Development Indradhanush: an Integrated WordNet for Bengali, Gujarati, Kashmiri, Konkani, Odia, Punjabi and Urdu” and is executed by a consortium of nine academic institutions from all over India, whose details are given below – 1. Goa University, Goa (Consortium Leader and Konkani WordNet development) 2. IIT Bombay, Powai (Co-Consortium leader) 3. Indian Statistical Institute, Kolkata (Bengali WordNet development) 4. University of Kashmir, Srinagar (Kashmiri WordNet development) 5. University of Hyderabad, Hyderabad (Odia WordNet development) 6. Punjabi University, Patiala (Punjabi WordNet development) 7. Thapar University, Patiala (Punjabi WordNet development) 8. Dharmsinh Desai University, Nadiad (Gujarati WordNet development) 9. Jawaharlal Nehru University, New Delhi (Urdu WordNet development) Indradhanush WordNet Consortium comes under the umbrella of IndoWordNet along with two other consortiums namely, North East WordNet Consortium which works on “Development of NE WordNet: An Integrated WordNet for North East Languages: Assamese, Bodo, Manipuri and Nepali” and Dravidian WordNet Consortium which works on “Development of Dravidian WordNet: An Integrated WordNet for Telugu, Tamil, Kannada and Malayalam”. All these WordNets are developed at different institutes in India and co-ordinated by IIT Bombay. These WordNets are linked to Hindi WordNet and amongst each other. This document has been prepared to give the details of the Indradhanush WordNet Project Consortium to the PRSG members before the PRSG meeting which is scheduled on the 30 th April 2013 at University of Mysore. It starts by giving the brief details of the PRSG meetings held in section 2, followed by the current work status in section 3. Section 4, 5 and 6 lists the tools, utilities, websites developed by the members. The Financial details are presented in section 7. Section 8 and 9 give details of the total Manpower trained and total Equipments purchased under this consortium respectively. The Future work plan is presented in brief under section 10 which is followed by the details of Workshops/ Conferences organised by the members. The list of publications is given under section 11. There are three Appendices A, B and C which contains additional details related to Financial and Manpower details. 2. PRSG Details The details of the recommendations of the first and second PRSG meetings and action taken are presented below - First PRSG held on 9 th August 2011 at Goa University . 1) PRSG expressed that statistics of word senses in large corpus may resolve the problem of identifying exact sense and suggested that a large amount of corpus should be used for 1 determining the sense of the words. Follow-up action taken – Newspaper corpus has been collected and used by all the Institutes for sense marking. The details of sense marking are presented in the Present Work Status Section under Sense Marking Details. 2) All the research papers on WordNet published by consortium should be made available to Chairman PRSG and DeitY for uploading on the TDIL Data Center. Follow-up action taken – Proceedings of the Goa Workshop have been submitted to Chairman of PRSG for feedback and will be published. All published papers will be uploaded on the Indradhanush WordNet Website the link of this site will be put up on the TDIL Data Center. 3) PRSG recommended release of the next installment of the GIA to Consortium Leader Goa University subject to receipt of the Compiled Utilization Certificate (UC) for the released grants. Follow-up action taken – Compiled U.C submitted by Consortium Leader (CL) on 31 st January 2012. And subsequently DeitY released the second year funds on 11 th April 2012 to CL and the CL released it to all members depending on the utilization of their funds for the first year. Second PRSG held on 24 th July 2012 at University of Hyderabad – 1) PRSG recommended the extension of project duration till 31 st December 2012. Follow-up action taken – The project duration extended till 31 st December 2012 and new deliverables set were as under – ••• Linking and validation of minimum 27,000 synsets by each member ••• Sense Marking of minimum 1,00,000 words. ••• Testing and documenting the tools and utilities developed. 2) PRSG recommended release of the balance amount to the consortium leader after submission of the Compiled U.C. Follow-up action taken – There was sufficient overall balance available with the consortium members and also scope for enhancement of the WordNet work. Hence it was requested to consider the extension for the period till 31 st March 2013 instead of 31 st December 2012. The request was accepted. 2 3. Present Work Status Synset Linking Details . All the seven language groups of the Indradhanush consortium have completed the set target of minimum 27,000 synsets. The details are given below – Language Noun Verb Adjective Adverb Total Hindi 28227 3098 6075 460 37860 Bengali 27281 2804 5815 445 36345 Gujarati 24896 2805 5828 445 33974 Kashmiri 17959 2354 6382 305 27000 Konkani 22976 2991 5689 474 32130 Odia 27216 2418 5273 377 35284 Punjabi 18898 2836 5828 443 28005 Urdu 21595 2800 5787 443 30625 Table 3.1: Synset Linkage Status Sense Marking Details The details of sense marking of all the languages of the Indradhanush consortium is as under – Language Corpus name No. of No. of Total No. of WordNet files files used No. of sense Coverage Collected for sense words marked marking words Newspaper 9 9 163360 38637 23.65% Bengali (Anandabazar patrika) Gujarati Gujarati News corpus 101 101 337094 112884 33.49% So:n Mira:s’ Kashmiri weekly 350 350 98350 42290 43.00% Kashmiri newspaper ‘Sunaparant’ Konkani daily 3433 625 213415 103456 48.48% Konkani newspaper Odia Newspaper (Sambad) and Articles 135 135 236125 100285 42.27% Punjabi Online Articles, News Text, Stories 98 98 216878 93279 43.01% Newspaper:“Jang urdu” ,“Nawai 240 10 110000 50171 45.61% Urdu waqt” & “BBC urdu” Table 3.2: Sense Marking Status 3 4. Tools developed • Synset Categorization Tool – by IIT Bombay • To chose common linkable synsets across all languages by classifying them as Universal, Pan-Indian, etc. • Released for use by consortium members • Synset Creation Tool – by IIT Bombay • An offline interface to create target language synsets by using Hindi language synsets as source. • Released for use by consortium members • Sense Marker Tool – by IIT Bombay • To find the synset coverage of a WordNet. • Released for use by consortium members • Generic Stemmer for Indian Languages – by IIT Bombay • To find the possible stems of a given word • Released for use by consortium members (http://www.cfilt.iitb.ac.in/~bornali/generic_stemmer/index.php ) • WordNet Linkage Tool – by IIT Bombay • To link Hindi WordNet and English WordNet, uses 13 different heuristics to automatically identify top 5 English synsets for a given Hindi Synset. • Released for use by consortium members • Word Sense Disambiguation Tool – by IIT Bombay • Provides single access point to 9 different state of art word Sense disambiguation algorithms • Released for use by consortium members • WordNet CMS – v1.0, v2.0 – by Goa University • The WordNet Content Management System (CMS) allows creation of WordNet websites with good user interface and desired functionalities in a very short time for many Indian languages. • Tested and documentation available • Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php ) • CSS Manger Tool v1.0 – by Goa University • Centralized Web based tool to assist creation of Concept Specific Synsets (CSS) and manage their linkages to other Indian Language WordNets. • Tested and documentation available. • Released for use by consortium members (http://indradhanush.unigoa.ac.in/conceptspace/ ) • Lexical Relation Creation Web Based Tool – by Thapar University, Patiala 4 • Tool to verify and create lexical relations in the WordNet • This tool is under development 5. Utilities developed • Sense Marking Statistic Finder Utility – by Goa University o Utility to find coverage statistics of the sense marked corpus. o Tested and documentation available • Synset Merger Utility – by Goa University o Utility to merge different synset files into one single file. o Tested and documentation available 6. Websites and Computational Resources developed • Indradhanush WordNet Consortium Website v1.0 ( http://indradhanush.unigoa.ac.in/ ) • Bengali WordNet Website v1.0 ( http://www.isical.ac.in/~lru/wordnetnew/ ) • Gujarati WordNet Website v1.0 ( http://www.cfilt.iitb.ac.in/gujarati/ ) • Kashmiri WordNet Website v1.0 ( http://indradhanush.unigoa.ac.in/kashmiriwordnet/ ) • Konkani WordNet Website v2.0. ( http://konkaniwordnet.unigoa.ac.in/ ) • Odia WordNet Website v1.0 (http://indradhanush.unigoa.ac.in/odiawordnet ) • Punjabi WordNet Website v1.0 ( http://punjabiwordnet.com/ ) • Urdu WordNet Website v1.0 ( http://indradhanush.unigoa.ac.in/urduwordnet ) • IndoWordNet Database v1.0, v2.0, v3.0 o Relational database structure to store WordNet data and relationships o Tested and documentation available o Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php ) • IndoWordNet API – v1.0, v2.0, v3.0 – by Goa University • IndoWordNet Application Programming Interface (IWAPI) helps in providing access to the WordNet resources independent of the underlying storage technology. • Implemented in Java as well as in Php • Tested and documentation available • Released for use by consortium members (http://indradhanush.unigoa.ac.in/public/downloadTools/downloadTools.php ) 5 7. Financial Details Financial Details of the Indradhanush Project Consortium as on 2 nd February 2013 • Total funds received by Goa University from DeitY - Rs. 281,83,413 • Total Interest earned by all institutes on the received funds - Rs. 4,99,687 Total amount including interest earned - Rs. 286,83,100 • Total amount spent by all Institutes - Rs. 267,46,182 • Total committed expenditure of all Institutes - Rs. 5,63,673 Total amount spent including the committed expenditure - Rs. 273,09,855 • Total balance [Rs.