Applications and Resources for Telugu Code-Mixing
Total Page:16
File Type:pdf, Size:1020Kb
Applications and Resources for Telugu Code-mixing Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Exact Humanities by Research by SRIRANGAM VAMSHI KRISHNA 201456111 [email protected] International Institute of Information Technology Hyderabad - 500 032, INDIA June 2020 Copyright c Srirangam Vamshi Krishna, 2020 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “ Applications and Resources for Telugu Code- mixing” by SRIRANGAM VAMSHI KRISHNA, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. MANISH SHRIVASTAVA To teachers, family, friends and IIIT Hyderabad Acknowledgments I would like to take this opportunity to thank my advisor Prof. Manish Shrivastava, my friends and family. Prof. Manish accepted me as his student in my difficult times. He inspired and encouraged me in the research process. He taught me ‘Patience’ and ‘Perseverance’ the two important aspects of any research which helped me complete my research and which will stay instrumental in my further research life as well. I am indebted to my advisor, LTRC and IIIT Hyderabad for shaping me as a researcher. I would like to thank late Prof. Navjyoti Singh who influenced my thought process and introduced me to a wide spectrum of ideas in computer science and Humanities and their confluence. Abhinav Appidi, Vinay Singh, Venu madhav, SVK Rohit, Pruthwik helped in my research and kept me motivated to complete it. I would like to thank my friends and well wishers who made my memories and stayed in my highs and lows throughout the five years at IIIT. I would like to thank my family for its continual, unwavering support and love. v Abstract The recent surge of data and trends in machine learning and deep learning helped to increase our understanding of language and thereby resulting in many applications. But the same is not true for many under resourced languages. Telugu is one such low resource Indian language. This work aims to shed light on the research of Telugu language. We present corpora and their analysis in the areas of Conversational Dialogue Systems, Named Entity Recognition and Emotion Prediction respectively. We present a Telugu conversational corpus, the first ever corpus to the best of our knowledge. We have built an end-to-end dialogue system using the corpus and performed various experiments with sequence to sequence encoder and attention decoder model involving word order, translation, vocabulary size, transliteration and word representations. The second is a Telugu-English code-mixed social media corpus for Named Entity Recognition(NER), the first ever corpus to the best of our knowledge. We have experimented with traditional machine learn- ing methods such as Conditional Random Fields(CRFs), an undirected probabilistic graphical model, Decision Trees and also using a deep learning method, Long Short term Memory Networks(LSTMs). We have proposed feature functions for Named Entity Recognition which were used in the CRF. We reported an F1-score of 0.96, 0.94 and 0.95 with CRFs, Decision Trees and Bidirectional LSTMS re- spectively. The third is a Telugu-English code-mixed social media corpus for Emotion prediction, the first ever corpus to the best of our knowledge. We have proposed feature functions for Emotion Prediction which were used in an experiment with Support Vector Machines(SVM). We have also experimented with deep learning methods such as Long Short term Memory Networks(LSTMs) and Bidirectional LSTMs. SVM, LSTMs and Bidirectional LSTMs reported an accuracy of 58%, 60.92% and 70.74% respectively. vi Contents Chapter Page 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Motivation . 1 1.2 Main Contribution . 2 1.3 Organisation of thesis . 5 2 Background and Related Work ::::::::::::::::::::::::::::::::: 6 2.1 Code-mixing . 6 2.2 Conversational Dialogue Systems for Telugu . 7 2.3 Named Entity Recognition . 8 2.4 Emotion Prediction . 9 3 Conversational Dialogue System for Telugu :::::::::::::::::::::::::: 11 3.1 Corpus Creation . 11 3.1.1 Annotation: . 11 3.1.2 Corpus Statistics: . 11 3.2 Experimental Analysis . 12 3.2.1 Sequence to Sequence Encoder with Attention Decoder . 13 3.3 Result Analysis . 15 4 Named Entity Recognition for Telugu-English Code-Mixed Social Media Data :::::::: 18 4.1 Corpus Creation . 18 4.2 Annotation . 19 4.3 Inter Annotator Agreement . 19 4.4 Data Statistics . 20 4.5 Experimental Analysis . 20 4.5.1 Conditional Random Fields (CRF) . 21 4.5.2 Decision Tree . 21 4.5.3 Bidirectional LSTM . 22 4.5.4 Features . 23 4.5.5 Results and Discussion . 24 5 Emotion Prediction for Telugu-English Code-Mixed Social Media Data :::::::::::: 28 5.1 Corpus Creation . 28 5.1.1 Pre-Processing . 28 5.1.2 Annotation . 28 vii viii CONTENTS 5.1.3 Inter Annotator Agreement . 34 5.1.4 Corpus Statistics . 34 5.2 Experimental Analysis . 34 5.3 SVM . 35 5.3.1 Feature Identification and Extraction : . 35 5.4 LSTMs . 37 5.5 Results and Discussion . 38 6 Conclusion and Future Work :::::::::::::::::::::::::::::::::: 40 6.1 Conversational Dialogue System for Telugu . 40 6.2 Named Entity Recognition . 41 6.3 Emotion Prediction . 41 Bibliography :::::::::::::::::::::::::::::::::::::::::::: 43 List of Figures Figure Page 4.1 Results from a Decision Tree . 26 4.2 BiLSTM model architecture . 27 ix List of Tables Table Page 3.1 An example from the corpus . 12 3.2 Example (Input,Output) pair for Dialogue System . 13 3.3 Transliterated Telugu 4000 training pairs, 1000 test pairs . 13 3.4 Transliterated Telugu 14500 training pairs and 3500 test pairs . 14 3.5 Translated Cornell English-Movie Dialogs 4000 training pairs, 1000 test pairs . 15 3.6 Example of (Input,Output) pairs from Cornell Movie-Dialog data . 15 3.7 Cornell English Movie-Dialog 4000 training pairs, 1000 test pairs . 16 3.8 Telugu-English combined 8000 training pairs and 2000 test pairs . 16 3.9 Telugu- word ordered English combined 8000 training pairs, 2000 test pairs . 16 4.1 Inter Annotator Agreement. 20 4.2 Tags and their Count in Corpus . 20 4.3 CRF Model with ‘c2=0.1’ and ‘l2sgd’ algo. 21 4.4 Feature Specific Results for CRF . 22 4.5 Decision Tree Model with ‘max-depth=32’ . 22 4.6 Feature Specific Results for Decision tree . 23 4.7 An Example Prediction of our CRF Model . 25 4.8 Bi-LSTM model with optimizer = ‘adam’ and has a weighted f1-score of 0.95 . 25 5.1 Inter Annotator Agreement. 34 5.2 Data Distribution . 35 5.3 Causal Language Distribution . 35 5.4 Total Word Distribution . 36 5.5 Unique Word Distribution . 36 5.6 Results of SVM . 37 5.7 Results of LSTM on our corpus . 38 5.8 LSTM Results on all emotions . 38 5.9 Results of BiLSTM on our corpus . 39 5.10 BiLSTM Results on all emotions . 39 x Chapter 1 Introduction 1.1 Motivation India is a land of several languages. India has around 780 languages which is the second highest number in the world. In this research work we focus upon on a language spoken in India called Telugu. Telugu is the most spoken Dravidian language. It is spoken predominantly in the Indian states of Telan- gana, Andhra Pradesh and in the neighboring states of Tamil Nadu, Karnataka, Kerala and Odisha. It is one of six languages designated as classical language of India by the Indian government. Classical language status is given to languages which have a rich heritage and independent nature. Telugu is one of the twenty-two scheduled languages of the Republic of India. It is also the fastest growing language in the United States of America, where there is a large Telugu-speaking community . Globalization lead to peoples interactions in multiple languages for social, political and economic reasons. Understanding language of people belonging to multilingual societies is interesting and chal- lenging. People tend to mix up two or more languages while speaking or writing. This phenomenon of mixing up of languages is described by two terms ‘Code-Mixing’ and ‘Code-Switching’. Code-Mixing: If linguistic units such as affixes, words, phrases, clauses from two or more languages are used with in a sentence and speech context we refer to it as code-mixing. Code-Switching: Code-Switching refers to mixing of linguistic units such as words, phrases, clauses from two or more languages within a speech context. The relative ordering of the words helps us in gaining a better understanding of the difference be- tween these two. Intrasentential mixing of words occurs in code-mixing where as Intersentential mixing of words occurs in code-switching [10]. With the advent of internet, social media platforms and online applications English language has become a part of Indian life as a medium of instruction and inter- action in several educational and social institutions. The daily interactions of Telugu speaking people involve use of linguistic units from English language in multiple facets of life. We do not differentiate code-mixing and code-switching computationally and treat them all alike and use the umbrella term code-mixing in this thesis. In this work we focus upon Telugu-English code-mixed language. Code- mixing happens at different levels. 1 • Morpheme level: Morphemes from English language are appended by suffixes from Telugu language resulting in words indicative of cases and numbers. Examples of such words are ‘caru’ which means a ‘car’, here the word ‘car’ is from English and the suffix ‘u’ is added to it. Similarly in the case of ‘ticketlu’ meaning multiple tickets, the word ‘ticket’ is from English language and the suffix ‘lu’ is added to it.