Book of Proceedings
Total Page:16
File Type:pdf, Size:1020Kb
LREC 2018 Workshop WILDRE4– 4th Workshop on Indian Language Data: Resources and Evaluation PROCEEDINGS Edited by Girish Nath Jha, Kalika Bali, Sobha L, Atul Kr. Ojha ISBN: 979-10-95546-09-2 EAN: 9791095546092 12 May 2018 Proceedings of the LREC 2018 Workshop “WILDRE4 – 4th Workshop on Indian Language Data: Resources and Evaluation” 12 May 2018 – Miyazaki, Japan Edited by Girish Nath Jha, Kalika Bali, Sobha L, Atul Kr. Ojha http://sanskrit.jnu.ac.in/conf/wildre4/index.jsp Acknowledgments: This work has received funding from the Microsoft Research India. Organising Committee • Girish Nath Jha, Jawaharlal Nehru University, India • Kalika Bali, Microsoft Research India Lab, Bangalore • Sobha L, AU-KBC, Anna University Workshop Manager • Atul Kr. Ojha, Jawaharlal Nehru University, India i Programme Committee • Adil Amin Kak, University of Kashmir, India • Anil Kumar Singh, IIT-BHU, India • Arul Mozhi, University of Hyderabad, India • Asif Iqbal, IIT-Patna, India • Bogdan Babych, University of Leeds, UK • Claudia Soria, CNR-ILC, Italy • Dafydd Gibbon, Universität Bielefeld, Germany • Delyth Prys, Bangor University, UK • Dipti Mishra Sharma, IIIT, Hyderabad India • Diwakar Mishra, EZDI, Ahmedabad, India • Dorothee Beermann, NTN Univeristy(NTNU), Norway • Elizabeth Sherley, IITM-Kerala, Trivandrum, India • Esha Banerjee, Google, USA • Eveline Wandl-Vogt, Austrian Academy of Sciences, Austria • Georg Rehm, DFKI, Germany • Girish Nath Jha, Jawaharlal Nehru University, New Delhi, India • Hans Uszkoreit, DFKI, Berlin, Germany • Jan Odijk, Utrecht University, The Netherlands • Jolanta Bachan, Adam Mickiewicz University, Poland • Joseph Mariani, LIMSI-CNRS, France • Jyoti D. Pawar, Goa University, India • Kalika Bali, MSRI, Bangalore, India • Kevin Scannell, St. Louis University, USA • Khalid Choukri, ELRA, France • Lars Hellan, NTNU, Norway • Malhar Kulkarni, IIT-Bombay, India • Manji Bhadra, Bankura University, West Bengal, India • Marko Tadic, Croatian Academy of Sciences and Arts, Croatia ii • Massimo Monaglia, University of Florence, Italy • Monojit Choudhary, MSRI Bangalore, India • Narayan Choudhary, CIIL-Mysore, India • Nicoletta Calzolari, ILC-CNR, Pisa, Italy • Niladri Shekhar Dash, ISI Kolkata, India • Panchanan Mohanty, University of Hyderabad, India • Pinky Nainwani, Optimum Pvt.Ltd, Bangalore, India • Pushpak Bhattacharya, Director, IIT-Patna, India • Qun Liu, Adapt Center, Dublin City University, Ireland • Ritesh Kumar, Dr. B.R. Ambedkar University, Agra, India • Sachin Kumar, CDAC-Pune, India • Shivaji Bandhopadhyay, Director, NIT Silchar, India • Sobha L, AU-KBC Research Centre, Anna University, India • S S Aggarwal, KIIT, Gurgaon, India • Stelios Piperidis, ILSP, Greece • Subhash Chandra, Delhi University, India • Swaran Lata, Head TDIL, MCIT, Govt. of India • Virach Sornlertlamvanich, Thammasat Univeristy, Bangkok, Thailand • Vishal Goyal, Punjabi University Patiala, India • Zygmunt Vetulani, Adam Mickiewicz University, Poland iii Preface WILDRE – the 4th Workshop on Indian Language Data: Resources and Evaluation is being organized in Miyazaki, Japan on 12th May, 2018 under the LREC platform. India has a huge linguistic diversity and has seen concerted efforts from the Indian government and industry towards developing language resources. European Language Resource Association (ELRA) and its associate organizations have been very active and successful in addressing the challenges and opportunities related to language resource creation and evaluation. It is therefore a great opportunity for resource creators of Indian languages to showcase their work on this platform and also to interact and learn from those involved in similar initiatives all over the world. The broader objectives of the 4th WILDRE will be • to map the status of Indian Language Resources • to investigate challenges related to creating and sharing various levels of language resources • to promote a dialogue between language resource developers and users • to provide opportunity for researchers from India to collaborate with researchers from other parts of the world The call for papers received a good response from the Indian language technology community. Out of twenty nine full papers received for review, we selected one papers for oral, four for short oral, seven for poster and three for demo presentation. iv Workshop Programme 12th May 2018 09:00 – 09:45hrs: Inaugural session 09:00 – 09:05hrs – Welcome by Workshop Chairs 09:05 – 09:25hrs – Inaugural Address 09:25 – 09:45hrs – Keynote Lecture 09:45 – 10:30hrs – Panel discussion Coordinator: Zygmunt Vetulani Panellists – TBD 10:30 – 11:00hrs – Coffee break + Poster/Demo Chairperson: Kalika Bali • Royal Jain, Anger Detection in Social Media for Resource Scarce Languages • Aniketh Janardhan Reddy, Monica Adusumilli, Sai Kiranmai Gorla, Lalita Bhanu Murthy Neti and Aruna Malapati, Named Entity Recognition for Telugu using LSTM-CRF • K. V. S. Prasad, Shafqat Mumtaz Virk, Miki Nishioka and C. A. G. Kaushik, Crowd-sourced Technical Texts can help Revitalise Indian Languages • Debopam Das, Discourse Segmentation in Bangla • Priya Rani, Atul Kr. Ojha and Girish Nath Jha, Automatic Language Identification System for Hindi and Magahi • Abdul Basit and Ritesh Kumar, Towards a Part-of-Speech Tagger for Awadhi : Corpus and Experiments • Rajneesh Pandey, Atul Kr. Ojha and Girish Nath Jha, Demo of Sanskrit-Hindi SMT System • Srishti Singh and Girish Nath Jha, Demo: Parts-of-Speech Tagger for BhojpuriDeveloping • Mayank Jain, Yogesh Dawer, Nandini Chauhan and Anjali Gupta, Resources for a Less Resourced Language: Braj Bhasha • Atul Kr. Ojha and Girish Nath Jha, Demo: Graphic-based Statistical Machine Translator v 11:00 – 12:45hrs – Paper Session II (Oral and Short Oral Presentation ) Chairperson: S. S. Aggarwal • Massimo Moneglia, Alessandro Panunzi and Lorenzo Gregori, “Taking Events” in Hindi. A Case Study from the Annotation of Indian Languages in IMAGACT • Gaurav Mohanty, Pruthwik Mishra and Radhika Mamidi, Kabithaa: An Annotated Corpus of Odia Poems with Sentiment Polarity Information • Manas Jyoti Bora and Ritesh Kumar, Automatic Word-level Identification of Language in Assamese – English – Hindi Code-mixed Data • Shagun Sinha and Girish Nath Jha, Issues in Conversational Sanskrit to Bhojpuri MT • Ritesh Kumar, Bornini Lahiri, Deepak Alok, Atul Kr. Ojha, Mayank Jain, Abdul Basit and Yogesh Dawer, Automatic Identification of Closely-related Indian Languages: Resources and Experiments 12:45 – 12:55hrs – Valedictory Address 12:55 – 13:00hrs – Vote of Thanks vi Table of Contents Anger Detection in Social Media for Resource Scarce Languages Royal Jain.......................................................................................................................01 Named Entity Recognition for Telugu using LSTM-CRF Aniketh Janardhan Reddy, Monica Adusumilli, Sai Kiranmai Gorla, Lalita Bhanu Murthy Neti and Aruna Malapati...................................................................................06 Crowd-sourced Technical Texts can help Revitalise Indian Languages K. V. S. Prasad, Shafqat Mumtaz Virk, Miki Nishioka and C. A. G. Kaushik...............11 Discourse Segmentation in Bangla Debopam Das.................................................................................................................17 Automatic Language Identification System for Hindi and Magahi Priya Rani, Atul Kr. Ojha and Girish Nath Jha...............................................................23 Towards a Part-of-Speech Tagger for Awadhi: Corpus and Experiments Abdul Basit and Ritesh Kumar.......................................................................................29 Demo of Sanskrit-Hindi SMT System Rajneesh Pandey, Atul Kr. Ojha and Girish Nath Jha....................................................34 Demo: Parts-of-Speech Tagger for Bhojpuri Srishti Singh and Girish Nath Jha..................................................................................36 Developing Resources for a Less Resourced Language: Braj Bhasha Mayank Jain, Yogesh Dawer, Nandini Chauhan and Anjali Gupta................................38 Demo: Graphic-based Statistical Machine Translator Atul Kr. Ojha and Girish Nath Jha.................................................................................44 “Taking Events” in Hindi. A Case Study from the Annotation of Indian Languages in IMAGACT Massimo Moneglia, Alessandro Panunzi and Lorenzo Gregori.....................................46 Kabithaa: An Annotated Corpus of Odia Poems with Sentiment Polarity Information Gaurav Mohanty, Pruthwik Mishra and Radhika Mamidi.............................................52 vii Automatic Word-level Identification of Language in Assamese-English-Hindi Code- mixed Data Manas Jyoti Bora and Ritesh Kumar..............................................................................58 Issues in Conversational Sanskrit to Bhojpuri MT Shagun Sinha and Girish Nath Jha.................................................................................63 Automatic Identification of Closely-related Indian Languages: Resources and Experiments Ritesh Kumar, Bornini Lahiri, Deepak Alok, Atul Kr. Ojha, Mayank Jain, Abdul Basit and Yogesh Dawer..........................................................................................................68 viii 1 Anger Detection in Social Media for Resource Scarce Languages Royal Jain Seernet Technologies LLC [email protected] Abstract Emotion Detection from text is a recent field of research that is closely related