WILDRE Workshop on Indian Language Data
Total Page:16
File Type:pdf, Size:1020Kb
Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme 21 May 2012 08:30-08:40 – Welcome by Workshop Chairs 08:40-08:55 – Inaugural Address by Mrs. Swaran Lata, Head, TDIL, Dept of IT, Govt of India 08:55-09:10 – Address by Dr. Khalid Choukri, ELDA CEO 0910-09:45 – Keynote Lecture by Prof Pushpak Bhattacharyya, Dept of CSE, IIT Bombay. 09:45-10:30 – Paper Session I Chairperson: Sobha L Somnath Chandra, Swaran Lata and Swati Arora, Standardization of POS Tag Set for Indian Languages based on XML Internationalization best practices guidelines Ankush Gupta and Kiran Pala, A Generic and Robust Algorithm for Paragraph Alignment and its Impact on Sentence Alignment in Parallel Corpora Malarkodi C.S and Sobha Lalitha Devi, A Deeper Look into Features for NE Resolution in Indian Languages 10:30 – 11:00 Coffee break + Poster Session Chairperson: Monojit Choudhury Akilandeswari A, Bakiyavathi T and Sobha Lalitha Devi, ‘atu’ Difficult Pronominal in Tamil Subhash Chandra, Restructuring of Painian Morphological Rules for Computer processing of Sanskrit Nominal Inflections H. Mamata Devi, Th. Keat Singh, Bindia L and Vijay Kumar, On the Development of Manipuri-Hindi Parallel Corpus Madhav Gopal, Annotating Bundeli Corpus Using the BIS POS Tagset Madhav Gopal and Girish Nath Jha, Developing Sanskrit Corpora Based on the National Standard: Issues and Challenges Ajit Kumar and Vishal Goyal, Practical Approach For Developing Hindi-Punjabi Parallel Corpus Sachin Kumar, Girish Nath Jha and Sobha Lalitha Devi, Challenges in Developing Named Entity Recognition System for Sanskrit Swaran Lata and Swati Arora, Exploratory Analysis of Punjabi Tones in relation to orthographic characters: A Case Study Diwakar Mishra, Kalika Bali and Girish Nath Jha, Grapheme-to-Phoneme converter for Sanskrit Speech Synthesis Aparna Mukherjee and Alok Dadhekar, Phonetic Dictionary for Indian English Sibansu Mukhapadyay, Tirthankar Dasgupta and Anupam Basu, Development of an Online Repository of Bangla Literary Texts and its Ontological Representation for Advance Search Options Kumar Nripendra Pathak, Challenges in Sanskrit-Hindi Adjective Mapping i Nikhil Priyatam Pattisapu, Srikanth Reddy Vadepally and Vasudeva Varma, Hindi Web Page Collection tagged with Tourism Health and Miscellaneous Arulmozi S, Balasubramanian G and Rajendran S, Treatment of Tamil Deverbal Nouns in BIS Tagset Silvia Staurengo, TschwaneLex Suite (5.0.0.414) Software to Create Italian-Hindi and Hindi-Italian Terminological Database on Food, Nutrition, Biotechnologies and Safety on Nutrition: a Case Study. 11:00 – 12:00 – Paper Session II Chairperson: Kalika Bali Shahid Mushtaq Bhat and Richa Srishti, Building Large Scale POS Annotated Corpus for Hindi & Urdu Vijay Sundar Ram R, Bakiyavathi T, Sindhuja Gopalan, Amudha K and Sobha Lalitha Devi, Tamil Clause Boundary Identification: Annotation and Evaluation Manjira Sinha, Tirthankar Dasgupta and Anupam Basu, A Complex Network Analysis of Syllables in Bangla through SyllableNet Pinkey Nainwani, Blurring the demarcation between Machine Assisted Translation (MAT) and Machine Translation (MT): the case of English and Sindhi 12:00-12:40 – Panel discussion on "India and Europe - making a common cause in LTRs" Coordinator: Nicoletta Calzolari Panelists - Kahlid Choukri, Joseph Mariani, Pushpak Bhattacharya, Swaran Lata, Monojit Choudhury, Zygmunt Vetulani, Dafydd Gibbon 12:40- 12:55 – Valedictory Address by Prof Nicoletta Calzolari, Director ILC-CNR, Italy 12:55-13:00 – Vote of Thanks ii Editors Girish Nath Jha Jawaharlal Nehru University, New Delhi Kalika Bali Microsoft Research Lab India, Bangalore AU-KBC Research Centre, Anna University, Sobha L Chennai Workshop Organizers/Organizing Committee Girish Nath Jha Jawaharlal Nehru University, New Delhi Kalika Bali Microsoft Research Lab India, Bangalore AU-KBC Research Centre, Anna University, Sobha L Chennai Workshop Programme Committee A. Kumaran Microsoft Research Lab India, Bangalore A. G. Ramakrishnan IISc Bangalore Amba Kulkarni University of Hyderabad Dafydd Gibbon Universitat Bielefeld, Germany Dipti Mishra Sharma IIIT, Hyderabad Girish Nath Jha Jawaharlal Nehru University, New Delhi Joseph Mariani LIMSI-CNRS, France Kalika Bali Microsoft Research Lab India, Bangalore Khalid Choukri ELRA, France Monojit Choudhury Microsoft Research Lab India, Bangalore Nicoletta Calzolari ILC-CNR, Pisa, Italy Niladri Shekhar Dash ISI Kolkata Shivaji Bandhopadhyah Jadavpur University, Kolkata Sobha L AU-KBC Research Centre, Anna University Soma Paul IIIT, Hyderabad Umamaheshwar Rao University of Hyderabad iii Table of contents 1 Introduction viii 2 Standardization of POS Tag Set for Indian 1 Languages based on XML Internationalization best practices guidelines Somnath Chandra, Swaran Lata and Swati Arora 3 A Generic and Robust Algorithm for Paragraph 18 Alignment and its Impact on Sentence Alignment in Parallel Corpora Ankush Gupta and Kiran Pala 4 A Deeper Look into Features for NE Resolution in 28 Indian Languages Malarkodi C.S and Sobha Lalitha Devi 5 ‘atu’ Difficult Pronominal in Tamil 34 Akilandeswari A, Bakiyavathi T and Sobha Lalitha Devi 6 Restructuring of Paninian Morphological Rules for 39 Computer processing of Sanskrit Nominal Inflections Subhash Chandra 7 On the Development of Manipuri-Hindi Parallel 45 Corpus H. Mamata Devi, Th. Keat Singh, Bindia L and Vijay Kumar 8 Annotating Bundeli Corpus Using the BIS POS 50 Tagset Madhav Gopal 9 Developing Sanskrit Corpora Based on the National 57 Standard: Issues and Challenges Madhav Gopal and Girish Nath Jha iv 10 Practical Approach for Developing Hindi-Punjabi 65 Parallel Corpus Ajit Kumar and Vishal Goyal 11 Challenges in Developing Named Entity Recognition 70 System for Sanskrit Sachin Kumar, Girish Nath Jha and Sobha Lalitha Devi 12 Exploratory Analysis of Punjabi Tones in relation to 76 orthographic characters: A Case Study Swaran Lata and Swati Arora 13 Grapheme-to-Phoneme converter for Sanskrit 81 Speech Synthesis Diwakar Mishra, Kalika Bali and Girish Nath Jha 14 Phonetic Dictionary for Indian English 89 Aparna Mukherjee and Alok Dadhekar 15 Development of an Online Repository of Bangla 93 Literary Texts and its Ontological Representation for Advance Search Options Sibansu Mukhapadyay, Tirthankar Dasgupta and Anupam Basu 16 Challenges in Sanskrit-Hindi Adjective Mapping 97 Kumar Nripendra Pathak 17 Hindi Web Page Collection tagged with Tourism 102 Health and Miscellaneous Nikhil Priyatam Pattisapu, Srikanth Reddy Vadepally and Vasudeva Varma 18 Treatment of Tamil Deverbal Nouns in BIS Tagset 106 Arulmozi S, Balasubramanian G and Rajendran S v 19 TschwaneLex Suite (5.0.0.414) Software to Create 111 Italian-Hindi and Hindi-Italian Terminological Database on Food, Nutrition, Biotechnologies and Safety on Nutrition: a Case Study Silvia Staurengo 20 Building Large Scale POS Annotated Corpus for 115 Hindi & Urdu Shahid Mushtaq Bhat and Richa Srishti 21 Tamil Clause Boundary Identification: Annotation 122 and Evaluation Vijay Sundar Ram R, Bakiyavathi T, Sindhuja Gopalan, Amudha K and Sobha Lalitha Devi 22 A Complex Network Analysis of Syllables in Bangla 131 through SyllableNet Manjira Sinha, Tirthankar Dasgupta and Anupam Basu 23 Blurring the demarcation between Machine Assisted 139 Translation (MAT) and Machine Translation (MT): the case of English and Sindhi Pinkey Nainwani vi Author Index Akilandeswari, A. .. 34 Amudha, K. 122 Arora, Swati. 1, 76 Arulmozi, S. .106 Bakiyavathi, T. .34, 122 Balasubramanian, G. 106 Bali, Kalika. .81 Basu, Anupam. 93, 131 Bhat, Shahid Mushtaq. .115 Bindia, L . 45 Chandra, Somnath. 1 Chandra, Subhash. 39 Dadhekar, Alok. 89 Dasgupta, Tirthankar. 93, 131 Goyal, Vishal. 65 Gupta, Ankush. 18 Jha, Girish Nath. 57, 70, 81 Kumar, Ajit. 65 Kumar, Sachin. 70 Kumar, Vijay . 45 Lalitha Devi, Sobha. 28, 34, 70, 122 Madhav Gopal. 50, 57 Malarkodi, C.S. 28 Mamata Devi, H. 45 Mishra, Diwakar. 81 Mukhapadyay, Sibansu. 93 Mukherjee, Aparna. .. 89 Nainwani, Pinkey. 139 Pala, Kiran. 18 Pathak, Kumar Nripendra. .. 97 Pattisapu, Nikhil Priyatam. .. 102 Rajendran, S. 106 Sindhuja, Gopalan . 122 Singh, Th. Keat . .. 45 Sinha, Manjira. 131 Srishti, Richa. 115 Staurengo, Silvia. 111 Swaran Lata. 1, 76 Vadepally, Srikanth Reddy. 102 Varma, Vasudeva. 102 Vijay Sundar Ram, R. .. 122 vii Introduction WILDRE – the first ‘Workshop on Indian Language Data: Resources and Evaluation’ is being organized in Istanbul, Turkey on 21st May, 2012 under the LREC platform. India has a huge linguistic diversity and has seen concerted efforts from the Indian government and industry towards developing language resources. European Language Resource Association (ELRA) and its associate organizations have been very active and successful in addressing the challenges and opportunities related to language resource creation and evaluation. It is therefore a great opportunity for resource creators of Indian languages to showcase their work on this platform and also to interact and learn from those involved in similar initiatives all over the world. The broader objectives of the WILDRE is To map the status of Indian Language Resources To investigate challenges related to creating and sharing various levels of language resources To promote a dialogue between language resource developers and users To provide opportunity for researchers from India to collaborate with researchers from other parts of the world.