Applications and Resources for Telugu Code-mixing

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Exact Humanities by Research

by

SRIRANGAM VAMSHI 201456111 [email protected]

International Institute of Information Technology - 500 032, INDIA June 2020 Copyright c Srirangam Vamshi Krishna, 2020 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “ Applications and Resources for Telugu Code- mixing” by SRIRANGAM VAMSHI KRISHNA, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. MANISH SHRIVASTAVA To teachers, family, friends and IIIT Hyderabad Acknowledgments

I would like to take this opportunity to thank my advisor Prof. Manish Shrivastava, my friends and family. Prof. Manish accepted me as his student in my difficult times. He inspired and encouraged me in the research process. He taught me ‘Patience’ and ‘Perseverance’ the two important aspects of any research which helped me complete my research and which will stay instrumental in my further research life as well. I am indebted to my advisor, LTRC and IIIT Hyderabad for shaping me as a researcher. I would like to thank late Prof. Navjyoti Singh who influenced my thought process and introduced me to a wide spectrum of ideas in computer science and Humanities and their confluence. Abhinav Appidi, Vinay Singh, Venu madhav, SVK Rohit, Pruthwik helped in my research and kept me motivated to complete it. I would like to thank my friends and well wishers who made my memories and stayed in my highs and lows throughout the five years at IIIT. I would like to thank my family for its continual, unwavering support and love.

v Abstract

The recent surge of data and trends in machine learning and deep learning helped to increase our understanding of language and thereby resulting in many applications. But the same is not true for many under resourced languages. Telugu is one such low resource Indian language. This work aims to shed light on the research of . We present corpora and their analysis in the areas of Conversational Dialogue Systems, Named Entity Recognition and Emotion Prediction respectively. We present a Telugu conversational corpus, the first ever corpus to the best of our knowledge. We have built an end-to-end dialogue system using the corpus and performed various experiments with sequence to sequence encoder and attention decoder model involving word order, translation, vocabulary size, transliteration and word representations. The second is a Telugu-English code-mixed social media corpus for Named Entity Recognition(NER), the first ever corpus to the best of our knowledge. We have experimented with traditional machine learn- ing methods such as Conditional Random Fields(CRFs), an undirected probabilistic graphical model, Decision Trees and also using a deep learning method, Long Short term Memory Networks(LSTMs). We have proposed feature functions for Named Entity Recognition which were used in the CRF. We reported an F1-score of 0.96, 0.94 and 0.95 with CRFs, Decision Trees and Bidirectional LSTMS re- spectively. The third is a Telugu-English code-mixed social media corpus for Emotion prediction, the first ever corpus to the best of our knowledge. We have proposed feature functions for Emotion Prediction which were used in an experiment with Support Vector Machines(SVM). We have also experimented with deep learning methods such as Long Short term Memory Networks(LSTMs) and Bidirectional LSTMs. SVM, LSTMs and Bidirectional LSTMs reported an accuracy of 58%, 60.92% and 70.74% respectively.

vi Contents

Chapter Page

1 Introduction ...... 1 1.1 Motivation ...... 1 1.2 Main Contribution ...... 2 1.3 Organisation of thesis ...... 5

2 Background and Related Work ...... 6 2.1 Code-mixing ...... 6 2.2 Conversational Dialogue Systems for Telugu ...... 7 2.3 Named Entity Recognition ...... 8 2.4 Emotion Prediction ...... 9

3 Conversational Dialogue System for Telugu ...... 11 3.1 Corpus Creation ...... 11 3.1.1 Annotation: ...... 11 3.1.2 Corpus Statistics: ...... 11 3.2 Experimental Analysis ...... 12 3.2.1 Sequence to Sequence Encoder with Attention Decoder ...... 13 3.3 Result Analysis ...... 15

4 Named Entity Recognition for Telugu-English Code-Mixed Social Media Data ...... 18 4.1 Corpus Creation ...... 18 4.2 Annotation ...... 19 4.3 Inter Annotator Agreement ...... 19 4.4 Data Statistics ...... 20 4.5 Experimental Analysis ...... 20 4.5.1 Conditional Random Fields (CRF) ...... 21 4.5.2 Decision Tree ...... 21 4.5.3 Bidirectional LSTM ...... 22 4.5.4 Features ...... 23 4.5.5 Results and Discussion ...... 24

5 Emotion Prediction for Telugu-English Code-Mixed Social Media Data ...... 28 5.1 Corpus Creation ...... 28 5.1.1 Pre-Processing ...... 28 5.1.2 Annotation ...... 28

vii viii CONTENTS

5.1.3 Inter Annotator Agreement ...... 34 5.1.4 Corpus Statistics ...... 34 5.2 Experimental Analysis ...... 34 5.3 SVM ...... 35 5.3.1 Feature Identification and Extraction : ...... 35 5.4 LSTMs ...... 37 5.5 Results and Discussion ...... 38

6 Conclusion and Future Work ...... 40 6.1 Conversational Dialogue System for Telugu ...... 40 6.2 Named Entity Recognition ...... 41 6.3 Emotion Prediction ...... 41

Bibliography ...... 43 List of Figures

Figure Page

4.1 Results from a Decision Tree ...... 26 4.2 BiLSTM model architecture ...... 27

ix List of Tables

Table Page

3.1 An example from the corpus ...... 12 3.2 Example (Input,Output) pair for Dialogue System ...... 13 3.3 Transliterated Telugu 4000 training pairs, 1000 test pairs ...... 13 3.4 Transliterated Telugu 14500 training pairs and 3500 test pairs ...... 14 3.5 Translated Cornell English-Movie Dialogs 4000 training pairs, 1000 test pairs . . . . . 15 3.6 Example of (Input,Output) pairs from Cornell Movie-Dialog data ...... 15 3.7 Cornell English Movie-Dialog 4000 training pairs, 1000 test pairs ...... 16 3.8 Telugu-English combined 8000 training pairs and 2000 test pairs ...... 16 3.9 Telugu- word ordered English combined 8000 training pairs, 2000 test pairs ...... 16

4.1 Inter Annotator Agreement...... 20 4.2 Tags and their Count in Corpus ...... 20 4.3 CRF Model with ‘c2=0.1’ and ‘l2sgd’ algo...... 21 4.4 Feature Specific Results for CRF ...... 22 4.5 Decision Tree Model with ‘max-depth=32’ ...... 22 4.6 Feature Specific Results for Decision tree ...... 23 4.7 An Example Prediction of our CRF Model ...... 25 4.8 Bi-LSTM model with optimizer = ‘adam’ and has a weighted f1-score of 0.95 . . . . . 25

5.1 Inter Annotator Agreement...... 34 5.2 Data Distribution ...... 35 5.3 Causal Language Distribution ...... 35 5.4 Total Word Distribution ...... 36 5.5 Unique Word Distribution ...... 36 5.6 Results of SVM ...... 37 5.7 Results of LSTM on our corpus ...... 38 5.8 LSTM Results on all emotions ...... 38 5.9 Results of BiLSTM on our corpus ...... 39 5.10 BiLSTM Results on all emotions ...... 39

x Chapter 1

Introduction

1.1 Motivation

India is a land of several languages. India has around 780 languages which is the second highest number in the world. In this research work we focus upon on a language spoken in India called Telugu. Telugu is the most spoken Dravidian language. It is spoken predominantly in the Indian states of Telan- gana, Andhra Pradesh and in the neighboring states of Tamil Nadu, Karnataka, Kerala and Odisha. It is one of six languages designated as classical language of India by the Indian government. Classical language status is given to languages which have a rich heritage and independent nature. Telugu is one of the twenty-two scheduled languages of the Republic of India. It is also the fastest growing language in the United States of America, where there is a large Telugu-speaking community . Globalization lead to peoples interactions in multiple languages for social, political and economic reasons. Understanding language of people belonging to multilingual societies is interesting and chal- lenging. People tend to mix up two or more languages while speaking or writing. This phenomenon of mixing up of languages is described by two terms ‘Code-Mixing’ and ‘Code-Switching’. Code-Mixing: If linguistic units such as affixes, words, phrases, clauses from two or more languages are used with in a sentence and speech context we refer to it as code-mixing. Code-Switching: Code-Switching refers to mixing of linguistic units such as words, phrases, clauses from two or more languages within a speech context. The relative ordering of the words helps us in gaining a better understanding of the difference be- tween these two. Intrasentential mixing of words occurs in code-mixing where as Intersentential mixing of words occurs in code-switching [10]. With the advent of internet, social media platforms and online applications English language has become a part of Indian life as a medium of instruction and inter- action in several educational and social institutions. The daily interactions of Telugu speaking people involve use of linguistic units from English language in multiple facets of life. We do not differentiate code-mixing and code-switching computationally and treat them all alike and use the umbrella term code-mixing in this thesis. In this work we focus upon Telugu-English code-mixed language. Code- mixing happens at different levels.

1 • Morpheme level: Morphemes from English language are appended by suffixes from Telugu language resulting in words indicative of cases and numbers. Examples of such words are ‘caru’ which means a ‘car’, here the word ‘car’ is from English and the suffix ‘u’ is added to it. Similarly in the case of ‘ticketlu’ meaning multiple tickets, the word ‘ticket’ is from English language and the suffix ‘lu’ is added to it.

• Word level: Examples of code-mixing at word level involves taking of an entire word from English language into Telugu usage. Example ‘naku oka pen isthava’ which means ‘can you give me a pen?’. Here the word pen is borrowed from English language.

• Phrase level: In the sentence ‘ lanti actor dorakadam mana tollywood ke kadu motham movie industry ke pattina luck, chala manchi artist athanu’ meaning ‘Mahesh babu is not a boon to Tollywood but to entire film industry. He is a very good artist’ phrases from English languauge are embedding the sentence which follows a Telugu structure.

• Syntactic level: This is type of mixing of words is generally referred to as inter-sentential code- switching. The following ‘What an acting, uthiki aresadu, awesome screenplay, tirugu ledu enka’ meaning ‘What an acting! very Well done, awesome screenplay, no one can stop this’ is an example of syntactic level mixing of languages.

This research presents work in the areas of Conversational Dialogue Systems, Named Entity Recog- nition, Emotion Prediction for English-Telugu code-mixed language. The work is on corpus creation, analysis, building of machine learning models and their evaluation for the respective tasks.

1.2 Main Contribution

Conversational Dialogue System: Telugu is an Indian language which lacks digital conversational data. Conversational data exists for many resource rich languages such as English. Movie Scripts is one such main source for such languages. In Telugu we do not have either digital conversational data or Movie Scripts. There exists no conversational corpus for Telugu. Manual Creation of such corpus is time expensive and labour intensive. This work presents to the best of our knowledge the first ever conversational corpus for Telugu. The corpus creation is done by extracting movie dialogues from Telugu movies. The corpus creation is done by Telugu speakers who are fluent in both Telugu and English language and has a background in computational linguistics. Each Data sample in the corpus consists of a time stamp, speaker and the dialogue. The time stamp encodes the beginning and the ending time of the dialogue which is extracted from the video. The character name represents the name of the character delivering the dialogue. The time stamp acts as a mapping between the audio visual data and the textual data. The character level data helps in encoding persona level information. We have performed experiments with sequence to sequence encoder and attention decoder models involving translation of sentences, transliteration of

2 sentences, word order of language and vocabulary size. The detailed explanation and analysis is given in the following chapters. Named Entity Recognition: As part of our work we focus on Telugu-English code-mixed language. Telugu is a Dravidian language spoken in the Indian States of Telangana, Andhra Pradesh and minorly in other neighbouring states. Code-mixing is seen often in social-media platforms like twitter and face- book. In this work we focus on the code-mixed language in twitter tweets. The work is on corpus creation and analysis of Telugu-English code-mixed data for Named Entity Recognition. To the best of our knowledge we have created the first ever Telugu-English code-mixed corpus annotated with Named Entities. The Named Entities used in the annotation are ‘Person’, ‘Organisation’, ‘Location’ and ‘other’. Each of these three tags is further divided into two tags. The Beginner tag and the Intermediate tag. Thus there are a total of seven tags including these six Beginner and Intermediate tags and an other tag. The Beginner tag is used to tag a word if it is the beginning word of a Named Entity. The Intermediate tag is used to tag a word if the Named Entity is split into multiple continuous. The Intermediate tag is assigned to the words which follow the Beginner word. The detailed explanation and analysis is given in the following chapters. We have performed experiments with traditional machine learning methods like Conditional Random Fields, Decision Trees and also modern deep learning methods such as Long Short Term Memory Networks.

The following are instances taken from Twitter depicting Telugu-English code-mixing, each word in the example is annotated with its respective Named Entity and Language Tags (‘Eng’ for English and ‘Tel’ for Telugu).

T1 : “Sir/other/Eng Rajanna/Person/Tel Siricilla/Location/Tel district/other/Eng loni/other/Tel ee/other/Tel government/other/Eng school/other/Eng ki/other/Tel computers/other/Eng fans/other/Eng vochi/other/Tel samvastharam/other/Tel avthunna/other/Tel Inka/other/Tel permanent/other/Eng electricity/other/Eng raledu/other/Tel Could/other/Eng you/other/Eng please/other/Eng respond/other/Eng @KTRTRS/person/Tel @Collector RSL/other/Eng”

Translation: “Sir it has been a year that this government school in Rajanna Siricilla district has got computers and fans still there is no permanent electricity, Could you please respond @KTRTRS @Col- lector RSL ”

T2 : “mana/other/Tel mahesh/Person/Tel anna/other/Tel Telangana/Location/Tel ki/other/Tel doriki- ran/other/Tel varam/other/Tel, athanu/other/Tel tollywood/Organisation/Eng ke/other/Tel kadu/other/Tel Bollywood/Organisation/Eng ki/other/Tel kuda/other/Tel vella/other/Tel galige/other/Tel range/other/End vundi/other/Tel. Jai/other/Tel Mahesh/Person/Tel anna/other/Tel, Jai/other/Tel Babu/Person/Tel”

Translation: “Mahesh Babu is like a boon to Telangana. He can not only shine in Tollywood but

3 can also enter into Bollywood. Victory to Mahesh Brother. Victory to Babu.”

T3: “Pawankalyan/Person/Tel movie/other/Eng ante/other/Tel ah/other/Tel martham/other/Tel bayam/other/Tel undali/other/Tel TELANGANA/Location/Tel government/Organisation/Eng ki/other/Tel” Translation: “The TELANGANA Government must have at least that much fear for a PawanKalyan movie”

Emotion Prediction: Social media platforms enable users to express their emotions and opinions. In this research work we focus on analyzing the emotions in twitter tweets which helps us in understanding trends, popular opinions, human behaviour etc. The work presents corpus creation and analysis of code- mixed English-Telugu social media tweets for Emotion Prediction. The Tweets are annotated using the emotions ‘Happy’, ‘Angry’, ‘Sad’, ‘Fear’, ‘Disgust’, ‘Surprise’ and ‘Multiple Emotion’ tags. There are also tweets with other emotions but the above are the emotions existing predominantly. We have per- formed experiment with a traditional machine learning method support vector machine and have also proposed feature functions for it. The feature functions which we proposed are at a character and word level. They help in capturing the information related to emotions in a sentence. Experiment with deep learning models such as Long Short Term Memory Networks(LSTM) and Bidirectional LSTMs have been done. The detailed explanation and analysis is given in the following chapters. Below is an exam- ple of code-mixed Telugu-English tweet from Twitter and its translation.

T4: “Pawankalyan movie ante ah martham bayam undali TELANGANA government ki” Translation: “The TELANGANA Government must have at least that much fear for a PawanKalyan movie”

T5: “I hate this guy called ashok, Telangana elections apudu chesina athi inka gurtundi” Translation: “I hate this guy called ashok, I still remember the over action he has done during the Telangana elections”

T6: “Really you are awesome sir... Mee laanti young dynamic leader and good human being unna minister telangana prajalu chesukunna goppa ardrustam we are really proud of you sir.” Translation: “Really you are awesome sir... Its a great luck and blessing to the people of Telangana to have a minister like you who is young, dynamic and a good human being. We are really proud of you sir.”

4 1.3 Organisation of thesis

1. The thesis is organised into six chapters. In the first chapter we discuss the motivation of the research, our contribution as part of this thesis work and the thesis organisation.

2. In the second chapter we discuss the background work related to research in Code-mixing, Con- versational Dialogue Systems, Named Entity Recognition and Emotion Prediction.

3. In the third chapter we discuss the corpus creation, annotation and corpus statistics of Conversa- tional Dialogue Corpus for Telugu. We describe in detail the various experiments performed with Sequence to Sequence Encoder and Attention decoder models and an analysis of the results.

4. In the fourth chapter we discuss the corpus creation and analysis of Telugu-English code-mixed corpus for Named Entity Recognition and various experiments performed using conditional ran- dom fields, decision trees and Bidirectional Long Short Term Memory Networks.

5. In the fifth chapter we discuss the corpus creation and analysis of Telugu-English code-mixed corpus for Emotion Prediction and various experiments performed using Support Vector Machines and Long Short Term Memory Networks.

6. In the final chapter we conclude the thesis with important observations in each of the respective research areas and suggestions for future work.

5 Chapter 2

Background and Related Work

2.1 Code-mixing

We have seen that code-mixing happens at different levels between languages at morpheme level, word level, phrase level and syntactic level. Here we discuss some of the important primary works and recent works which help us give a glimpse of the wide area of research in code-mixing. Socio-linguistic Approach: Code-mixing is a result of multilingualism. Here we quote some impor- tant research observations in code mixing from a socio-linguistic perspective. According to [14] in the phenomenon of code-mixing and code-switching the speaker and his goals play an important role. As per [1] there is a non-deterministic behaviour with respective speaker choice and it involves various factors when there exists multiple choices of communication and exchange. In the paper [4] explains that code-mixing is also done to signal or imply pragmatic functions like change in topic, emphasis on particular aspect or in the usage of puns. The research by [38] is one of the earliest in studying English Arabic Email communications. They analyzed that English was used for searching and in formal mail communications. They also reported that Arabic script was used instead of the original script in chat conversations. In the research by [29] studied the validity and universality of three linguistic constraints ‘equivalence of structure’, ‘free morpheme’, ‘size of the constituent’ examining code-switching state- ments in two syntactically divergent languages Moroccan Arabic and French. Computational Approach: [5] provided an annotation Framework and some initial experiments on Functions of Code-Switching in tweets. In the research by [3] provided an analysis on the extent of code-mixing in English Tweets, their classification of Code-Mixed words based on frequency and linguistic typology underlined the fact that while there are easily identifiable cases of borrowing and mixing at two ends, a large majority of the words form a continuum in the middle, emphasizing the need to handle these at different levels for automatic processing of the data. In the work by [28] showed that grammatically Valid CM data generated by Equivalence constraint theory when trained with real world Code-Mixed Data and monolingual Data helped to reduce the perplexity of a RNN model.

6 2.2 Conversational Dialogue Systems for Telugu

Initial Work: The first conversational systems were chatbots like ELIZA[39] in 1966 and PARRY in 1971.

• ELIZA had a wide spread influence on popular perceptions of artificial intelligence.

• Ir raised some of the first ethical questions in Natural language Processing

• The issues of privacy and the role of algorithms in Decision-Making

• These lead to its creator Joseph Weizenbaum to fight for social responsibility in AI and computer science in general

Frame-based systems: GUS, a frame based dialogue system established by [8] in 1977, became the dominant industry paradigm for the next 30 years. Stochastic modelling for slot-filling: Later in 1990 stochastic models based methods in Natural Lan- guage Processing were applied to dialog slot-filling Commercialization: By 2010, the frame based systems are being used widely in commercial space in phone based dialog system APPLE SIRI and other virtual assistants. Rise of web: With rise of web and online chat bots and the surge of data has raised interest in cor- pus based chatbots. Initially the corpus based chat bots were working on Information retrieval based methods, with the rise of deep learning in 2010s, the usage of sequence-to-sequence models has begun. Work on conversational Dialogue systems for English was done by Cornell researchers where they released the Cornell Movie Dialog Corpus together with the research paper [15] Chameleons in Imag- ined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogues. The corpus contains a large collection of fictional conversations extracted from raw movie scripts. The corpus consists of 220579 conversational exchanges between 10292 pairs of movie characters which involves 9035 characters from 617 movies and in a total of 304713 utterances. The corpus which we present here consists of the timestamp information. The time stamp denotes the start and end time of the dialogue in the movie. This helps in creating a mapping between speech and text data. This information is not present in the Cornell Movie Dialogue Corpus. Deep learning: [31] worked on building an end-to-end Dialogue system using generative hierarchical neural network model. Here they investigate the task of building conversational, open domain, dialogue system based on large dialogue corpus using generative models. The generative model produces system responses that are generated at word level, opening up the possibility for realistic and flexible interac- tions. The Hierarchical recurrent encoder-decoder architecture used in the above paper was proposed by [34] in the paper, A Hierarchical Recurrent Encoder-Decoder for generative context-aware query suggestion. [11] has given a survey on on dialogue systems the recent advances and new frontiers. In general they divide the existing dialogue systems work into

7 • task-oriented models

• task independent models

Then they detail how the deep learning techniques are helping with representative algorithms and discuss some research approaches that can give the research of dialogue systems a new frontier. In task-oriented systems there is a specific goal to be accomplished. The system works in a slot filling type like in railway, hotel and flight booking systems. The system tries to push the user to fill those specific slots to complete the task. In task independent system there is no specific end goal but they keep the user engaged in open domain conversations. A multi-turn response selection approach for chat bots with deep Attention matching network was proposed by [44]. They investigate with matching a response with its multi-turn context using depen- dency information based entirely up on attention. Reinforcement learning: A deep reinforcement learning approach for dialogue generation was pro- posed by [20]. In this paper they say that, though neural networks are good for dialogue generation they tend to be shortsighted ignoring future outcomes. In order to make the dialogues make coherent and natural there is need for reinforcement learning in natural language processing. The model simulates dialogues between two virtual agents, using policy gradient method to reward sequences that display the three useful properties, informativity meaning non-repetitive turns, coherence between dialogues and ease of answering in conversation. This work is a step towards learning a conversational models based on long-term success of the dialogues.

2.3 Named Entity Recognition

Initial work: In the field of Named Entity Recognition there has been a considerable amount of work in the resource rich languages [16], English [30], German [36], French [2] and Spanish [41]. This is not true in the case of code-mixed Indian Languages. The FIRE(Forum for Information Retrieval and Extraction) 1 tasks have shed some light on Named Entity recognition in Indian languages as well as code-mixed language data. Though Hand-crafted grammar based systems for Named Entity Recognition obtain a better Precision value they suffer a low recall value. The Usage of Conditional Random Fields for Named Entity Recognition is a typical choice. Machine Learning: [6] proposed an algorithm which is a hybrid approach of dictionary and Su- pervised classification for identifying entities such as Person, Organisation, Product, Location in code- mixed text of Indian Languages such as Hindi-English and Tamil-English. [24] worked on annotating code mixed English-Telugu data collected from social media site Facebook and created automatic POS- Taggers for the corpus. They have considered Parts of Speech Tagging as a classification problem and experimented with Linear SVMs, CRFs, Multinomial Bayes using different combinations of features and capturing both the context of the word and the internal structure. [32] presented an exploration of

1http://fire.irsi.res.in/fire/2018/home

8 automated Named Entity recognition in code-mixed data. They presented a corpus for Named Entity Recognition in Hindi-English code-mixed data and also presented experiments with the machine learn- ing models. Named entity recognition using Bidirectional LSTM-CNNs was proposed by [12] where they present a novel neural network architecture that automatically detects word and character level features using a hybrid bidirectional LSTM and CNN architecture which eliminates the need for most of the feature engineering. They also proposed a novel approach of encoding partial lexicon matches in neural network and compared it to the existing approaches. Neural Architectures for Named Entity Recognition was proposed by [19] in the paper, they introduce two new neural architectures, one based on bidirectional Long short term memory networks and conditional random fields, and the other that constructs and labels segments using a transition based approach inspired by shift reduce parsers. The source of information for their models are character based word representations learned from the su- pervised corpus and unsupervised word representations that are learned from unannotated corpora. A semi-supervised sequence tagging approach with bidirectional language models was proposed in the paper [27]. They demonstrate a general semi-supervised method for adding pre-trained contextual em- beddings from bidirectional language model to Natural language processing systems and apply it to the tasks of sequence labeling . An End-to-end sequence labeling with Bi-directional LSTM-CNNs- CRF(Long short term memory networks, convolutional neural networks, conditional random fields) was proposed by [21]. Their system is truly end-to-end system, requiring no feature engineering or data preprocessing work, thus making it directly applicable to wide range of sequence labeling tasks. The neural network architecture benefits from both the word and character level representations auto- matically using combination of bidirectional long short term memory networks, convolutional neural network and a conditional random field. A Hybrid semi-Markov Conditional random field for Neural Sequence Labeling was proposed by [40]. Semi markov CRFs have been designed for the tasks of as- signing the labels to segments by extracting the features from and describing the transitions between segments instead of between the words.

2.4 Emotion Prediction

In the work [22], sentiment analysis of tweets is done by light weight discourse analysis .There is a significant amount of research work done in Emotion prediction of resource rich languages but the same is not true for code-mixed English-Telugu language. Limitations in the availability of the cor- pus is a reason for the research hindrance in this area. [23] Provided a gold standard corpus called ACTSA(Annotated corpus for Telugu Sentiment Analysis). They have collected Telugu sentences from different sources. The sentences were pre-processed and annotated according to their annotation guide- lines. [13] presented an approach to extract social media micro blogs such as tweets. They created a corpus for multilingual sentiment analysis and Emoji prediction in Hindi, Bengali and Telugu. [26] extracted adjectives, adverbs, verbs from OntoSenseNet and annotated at word level using the language experts. They developed a benchmark corpus(BCSAT) as an extension to SentiWordNet. They proposed

9 a baseline accuracy for models using lexeme level annotations for sentiment prediction. [25] worked on annotation of POS Tags for English-Telugu code-mixed Sentences from the social media site Facebook and creation of automatic POS Taggers for this corpus. [33] provided a unique language and POS tagged code-mixed English Hindi Tweets related to five events from India. [18] provided a aggression annotated corpus of Hindi-English code-mixed data from two of the most popular social networking sites Twitter and Facebook. [37] created a corpus for Emotion Prediction of Hindi-English code-mixed social media text. [9] created a Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection. [35] created a Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection. A convolutional neural network approach for sentence classification was proposed by [17], they show that a simple con- volutional neural network with little hyper parameter tuning and static vectors achieved excellent results on multiple benchmarks. They additionally propose a simple modification to architecture that allows for the use of both the task-specific and the static vectors. A Character level convolutional neural networks for text Classification was proposed by [42], they showed that their model achieved competitive results and compared with traditional models such as bag of words, n-grams and their TF-IDF variants, deep learning models such as word-based convolutional neural networks and recurrent neural networks. [43] proposed an investigating capsule networks with dynamic routing for the task of text classification.

10 Chapter 3

Conversational Dialogue System for Telugu

3.1 Corpus Creation

Here we present the work of creation and analysis of Conversational corpus for Telugu Dialogue Systems. Conversational corpora is available online for English and other widely spoken languages. But no such corpora is available online for Telugu language. Manual Creation of such corpora is also expensive as a result there exists no such corpora in Telugu language. Here we present a conversational corpora for Telugu language. The conversations are movie dialogues taken from Telugu movies. Each dialogue is annotated with a time stamp and character information. The timestamp has a start date and end date which denote the start and the end time of the dialogue in the movie. The time stamp helps in mapping the dialogue information with the audio visual information from the movie. The character data helps us in giving persona related information and social factors associated with the data.

3.1.1 Annotation:

The dialogue information is collected from Telugu movies spanning over a time period of 2000- 2018. Movies are selected from various genres such as drama, action, love, comedy, fantasy such that there is diversity in the corpora. The dialogue information is collected from eight different movies. Example Names of Movies are ‘Mirchi’ and ‘Bhale Dongalu’. The dialogues are collected by Native Telugu Speakers who are fluent in both Telugu and English and have a linguistic background. The respective videos for movies are publicly available in Youtube. Each dialogue is written in the format of (time stamp) $ character $ dialogue, here $ is a delimiter. The time stamp has start time and end time written in (hh:mm:ss) format. The corpus will be available online in future.

3.1.2 Corpus Statistics:

The corpus has a total of 8590 dialogues involving 780 unique movie characters. There are a total of 15899 unique words out of which 14265 words belong to the Telugu language and 1634 words belong

11 Time Stamp Character Dialogue aa intlo oni la funcion ata e intlo evrni pilavaledhani 01:54:30- 01:59:40 villan edupu.. hospital opening twaralo vundhi..apudu mimmalni 01:59:40- 02:00:45 lawyer pilustaru ma jivitham lo theruvaanu anukunna hospital ni malli 02:00:45- 02:01:48 thataya theripicharu me runam ela thirchukovali.. goppa koduku ni kannaru andi.aaa adhrustam andhariki vundadhu

Table 3.1 An example from the corpus to the English language. These statistics depict the code-mixing nature in the recent Telugu movies. Table 3.1 shows an example from the corpus. The Telugu conversational corpus has widespread use in areas such as speech recognition using the given dialogue and the respective audio file extracted from the time stamp. It is useful in vision when juxtaposed with the dialogue and video file extracted from the corresponding time stamp. Applications like persona based or character dependent dialogue systems using the character data associated with the dialogue, task independent end-to-end dialogue systems and multi-party chat summarization systems can be built using this. The data set can further be enriched by annotating it with information like coherence relations between the dialogues and dialogue states. This data set can also be used in building the Telugu-English translation corpus as the respective English subtitles for the same are also available. Due to its richness, the dialogue corpus helps us in understanding the differences between general underlying conversational structures between the movies which are scripted by directors and the regular social interaction between people. Opining mining on this data at sentence and scene level can be of great assistance in understanding the flow of emotions through the story. Given the dialogue data which is collected from different movies helps us in understanding the writing style of artists. The corpus makes the task-oriented systems more conversational rather than mere information delivery systems.

3.2 Experimental Analysis

In order to have a better understanding of the corpus experiments like language identification, word representation generation using fasttext, design of end-to-end task independent dialogue systems were done on the data. We have performed several experiments with the data while building an End-to-End task independent dialogue system. The (Input,Output) pairs for the dialogue system are taken from the corpus. At the time of creation of corpus change in scenes were also annotated based upon factors like change in location, change in actors and change in topic of discussion. The (Input, Output) pairs are collected from this scenes only so that the dialogues will remain coherent. The (Input, Output) pair does not include the character and time stamp information. The table 3.2 shows example (Input, Output) instances.

12 Input Output vellaku entha dabbu echavo teliyadu gani Dani valla problems kani, solve avvavu, vellu nekosam entha risk ayina chesetattu kasepu open ga matlaudunkunte pani unnaru.mundhu ee china vadu vasthadu vediki dookudu ekkuva. entee veerapratap office kaaa avunandi ma owner veeraprataap ma friend jai hello

Table 3.2 Example (Input,Output) pair for Dialogue System

Model Perplexity RNN 2900.4309 RNN with fastText 3438.7359

Table 3.3 Transliterated Telugu 4000 training pairs, 1000 test pairs

3.2.1 Sequence to Sequence Encoder with Attention Decoder

We have experimented with the Sequence to Sequence Encoder with Attention Decoder Model on the following corpus and evaluated the models performance. Sequence to Sequence models are widely em- ployed in tasks like Machine Translation, Text summarization, image captioning. Sequence to Sequence models generally consists of a encoder and a decoder. Encoders are used in capturing the context of the input sentence. Decoders produce the output sequences using the information encoded in the hidden state. Generally Recurrent Neural Networks(RNNs) and Long Short Term Memory Networks(LSTMs) are used in the building of encoders and decoders. RNNs have a problem of vanishing gradient where they tend to forget the information they captured at the beginning of the input sequence. This vanishing gradient is because while back propagation the gradient tends to become smaller and smaller, as a result learning in earlier neurons becomes slower compared to the latter ones. In order to overcome this prob- lem, Long Short Term Memory Networks(LSTMs) were proposed which contains a cell state, input gate, update gate and forget gate. The network learns to retain required information while forgetting unnecessary information. Attention models focus on different parts of input sentence at every stage of the output sequence enabling the context to be preserved from starting to end. Conversational Corpus transliterated to Telugu: The (Input, Output) pairs are transliterated to Telugu using language Identification and Transliteration tool [7]. We trained the Sequence to Sequence Encoder with Attention Decoder model on the data. Word perplexity is used as a measure to evaluate the data. Initially we trained on 4000 samples and tested on 1000 samples. The perplexity got relatively low when the training corpus is increased from 4000 to 14500 samples. The table 3.3 shows the results for training on 4000 and testing on 1000 samples. The table 3.4 shows the results for training on 14500 samples and testing on 3500 samples.

13 We have also performed the same experiment on a different data set. The same model was trained on cornell movie dataset translated to Telugu. The following describes the data and experiments on this data. Cornell Movie Dialogues data Translated to Telugu: We have sampled 5000 (Input, Output) pairs from Cornell Movie Dialogues corpus and translated it to Telugu using the Google translator. The corpus consists of a large collection of fictional conversations extracted from movie scripts. The corpus consists of 2,20,579 conversational exchanges between 10,292 pairs of characters involving 9,035 characters from 617 movies and in a total of 3,04,713 utterances. The Same Sequence to Sequence Encoder with Attention Decoder model was used on the corpus. Experiments were conducted with and without the fasttext embeddings. The table 3.5 shows the results of the Seq2Seq Encoder with Attention decoder Model on 4000 Training and 1000 Test pairs which are translated to Telugu. For Better understanding of our corpus we have conducted the same experiments with Cornell English Movie Dialogues Data. The following section explains the Data creation and experimental results. Cornell English Movie Dialogue Data: We have experimented with the same Sequence to Sequence Encoder with Attention Decoder on the Cornell English Movie Dialogue Data. We have sampled a total of 5000 (Input, Output) pairs from the Cornell English Movie Dialogue Data out of which 4000 pairs are used for training and 1000 pairs are used for testing. The corpus consists of 2,20,579 conversational exchanges between 10,292 pairs of characters involving 9,035 characters from 617 movies and in a total of 3,04,713 utterances. We have conducted the experiments with and without the fast text embeddings. The perplexity results are comparatively low when we used the fasttext embeddings. We can also observe that the perplexity here is low compared to the experiment with the same model on the Cornell English Movie Corpus translated to Telugu. The tables 3.6 and 3.7 show the Sample Instances from data created and the perplexity results of the model respectively. Combined Telugu Data and Cornell English Movie Dialogue Data: We have combined 5000 (Input, Output) pairs from both Cornell English Movie Dialogue corpus and Telugu conversational corpus. The training and testing data consists of 8000 and 2000 (Input, Output) pairs respectively. When the Sequence to Sequence Encoder with Attention decoder model is trained and tested upon the Cornell corpus the perplexity of the model reduced. However the perplexity of the model is high when tested only on the Telugu (Input, Output) pairs. The table 3.8 shows the result of the Sequence

Model Perplexity RNN 32.3145 RNN with fastText 30.6573

Table 3.4 Transliterated Telugu 14500 training pairs and 3500 test pairs

14 Model Perplexity RNN 1654.5184 RNN with fastText 2177.2824

Table 3.5 Translated Cornell English-Movie Dialogs 4000 training pairs, 1000 test pairs

to Sequence Encoder with Attention Decoder on the combined Cornell English Movie Corpus and the Telugu conversational corpus with 8000 (Input, Output) Training pairs and 2000 (Input, Output) testing pairs. Word Ordering: Difference in the word ordering of English and Indian languages can be attributed to high perplexity. Here we changed the word order of English language to align to that of Indian languages. The word ordered 5000 (Input, Output) pairs are combined with 5000 Telugu conversational (Input, Output) pairs. The data is spit into 8000 (Input, Output) training pairs and 2000 (Input, Output) testing pairs. The table 3.9 shows the result of the Sequence to Sequence Encoder with Attention decoder on the above data.

3.3 Result Analysis

We can clearly see that the perplexity of model decreased when word embeddings from fastText are used in case of experiment with the Cornell English Movie-Dialog corpus. The perplexity of Cornell Movie-dialogue data translated to Telugu is comparatively low than that of our transliterated data. The perplexity of the model initially decreased in case of training data consisting of both Telugu and English pairs because the testing data also has English pairs along with Telugu pairs. The perplexity of the model increased when tested on only Telugu pairs. Though the data is created in order to take leverage from the English data, the knowledge gained from English data could not be used. This is because the word order in English is different from that of Telugu. We can say from the above observation that, size of the training data is one of the factors that is affecting the perplexity of the system. The model is only trained on 4000 (Input, Output) pairs. This

Input Output Too late, they wont come back That must be him. Water taxi. Get us one out till morning Look, I cant help you with Quincey if thats So youre just attracted to me, what youre after. This has nothing to do with him is that it?

Table 3.6 Example of (Input,Output) pairs from Cornell Movie-Dialog data

15 Model Perplexity RNN 1144.6391 RNN with fastText 1067.9600

Table 3.7 Cornell English Movie-Dialog 4000 training pairs, 1000 test pairs

Model Perplexity RNN on combined Telugu and English 2340.7413 (Input, Output) pairs RNN on 3635.1392 Telugu (Input, Output) pairs

Table 3.8 Telugu-English combined 8000 training pairs and 2000 test pairs

is low for a deep learning sequence-to-sequence model. The perplexity decreased with the increase of training data to 14500 pairs. The other important thing is word embeddings. Having more training data helps in creating richer word representations. Since the training data used for our system is transliterated from English, the errors in transliteration could be responsible for the higher perplexity of the system. This also explains why perplexity of translated data is less than that of transliterated data.This is one of the important factors to be considered. The perplexity of the combined Telugu and English pairs decreased when the English data is word ordered in accordance to that of Indian languages. Perplexity decreased when testing data consists of both Telugu and word ordered English pairs. Perplexity increased when testing data consisted only of Telugu pairs. Word order is one of the important factor to reduce the perplexity. The increased perplexity is due to the errors in changing the word order of English to Indian Languages and errors due to transliteration.

Model Perplexity RNN on combined Telugu and English 887.5810 (Input, Output) pairs RNN on Telugu (Input, Output) 3889.5080 pairs

Table 3.9 Telugu- word ordered English combined 8000 training pairs, 2000 test pairs

16 One more important thing regarding the vocabulary of our data is that few words are written in multiple spellings as the dialogue corpus is collected by different individuals. They have written Telugu data in English roman script. For example the word “ekkada” meaning “where” in English is written as (“ekada”, “ekkada”, “ekkadda”), leading to a wide variety of representations for few words in the vocabulary and therefore increasing perplexity. The dialogue corpus has to be normalized in order to minimize multiple representations of a single word, we have worked using a couple of algorithms in order to clean the data. The first one is edit distance between the words, where you replace a set of words with a single word where the cost function for the edit distance is calculated as the minimum number of character edits or removals required to convert a particular string to other. But this algorithm did not work well because the cost function replaced words with minimum distance which were not supposed to be replaced in cases like “mat”, “cat” and failed to replace words like “ekada” , “ekkadaa” because the edit distance was higher. The other algorithm is cleaning words based upon the semantic distance between the word vectors. The word vectors are created from the corpus using fastText, but this approach failed too as the corpus was small and the similar or neighboring words did not have similar enough representations and were placed apart. We did not use it due to poor representation of word vectors.

17 Chapter 4

Named Entity Recognition for Telugu-English Code-Mixed Social Media Data

4.1 Corpus Creation

The corpus created consists of Telugu-English code-mixed tweets along with their Named Entity Tags. The tweets are mined from twitter using Twitter python API which uses advanced search option in twitter. The tweets collected belong to the last 2 years from topics such as politics, movies, sports and social events. Certain hash tags were used in the extraction of tweets as search query. The hash tags involve sports, sportsmen, politicians, film stars, festivals etc.

Pre Processing Extensive pre processing of tweets is done. The following are the steps taken in the pre processing stage

• Tweets which are noisy and useless are removed, i.e tweets which contain URLs and many hash tags are removed

• Tweets which contain words either from only English language or from only Telugu language are removed in order to maintain the code-mixed nature of the corpus.

• Tweets which contain words both from English and Telugu language are considered

• Tokenization of tweets is done using Twitter Tokenizer

• Tweets with a length of five words or above are only considered

A total of 2,16,800 tweets were extracted, after the pre-processing step we are left with only 3968 tweets.

18 4.2 Annotation

The Annotation of the data set was carried out by two individuals who are proficient in both Tel- ugu and English and have a linguistic background. The Annotation of the corpus was done using three Named Entity Tags ‘Person’(Per), ‘Organization’(Org) and ‘Location’(Loc). Each of these three tags is further divided into two tags. The Beginner tag and the Intermediate tag. Thus there are a total of seven tags including these six Beginner and Intermediate tags and an other tag.

The Beginner tag is used to tag a word if it is the beginning word of a Named Entity. The Interme- diate tag is used to tag a word if the Named Entity is split into multiple continuous. The Intermediate tag is assigned to the words which follow the Beginner word. The following explains the use of these seven tags.

The ‘Per’ tag refers to the ‘Person’ Entity which is the names of persons, twitter handles and nick names of people. The ‘B-Per’ tag is given to the Beginning word of the ‘Person’ entity and the ‘I-Per’ tag is given if the ‘Person’ Entity is split into multiple continuous words, all the words that follow the Beginner word are given the ‘I-Per’ tag.

The ‘Org’ tag refers to the ‘Organization’ entity which refers to the social, political, economic and cultural organizations like ‘Hindus’, ‘Muslims’, ‘BJP’, ‘TRS’, ‘Reserve Bank of India’ and social me- dia organizations like ‘Twitter’, ‘facebook’, ‘Google’ etc. The ‘B-Org’ tag is assigned to the Beginning word of the ‘Organization’ entity, ‘I-Org’ tag is assigned if the ‘Organization’ entity is split into multiple continuous, it is assigned to the words that follow the Beginning word.

The ‘Loc’ tag refers to the ‘Location’ entity which is the names of locations and places like ‘Hyder- abad’, ‘USA’ and ‘India’. The ‘B-Loc’ tag is assigned to the Beginning word of the ‘Location’ entity, ‘I-Loc’ tag is assigned if the ‘Location’ entity is split into multiple continuous, it is assigned to the words that follow the Beginning word.

The ‘Other’ tag is assigned if a word does not belong to the any of the above six tags.

4.3 Inter Annotator Agreement

The Annotation of the data set was carried by two persons who are proficient in Telugu and English language and have a linguistic background. In order to assess the quality of the data set Inter Annotator Agreement score was calculated between the two Annotated data sets consisting of 3968 tweets and 115772 tokens using kohen’s cappa co - efficent. We can see that the agreement is significantly high. The agreement of ‘Location’ tags was high where as that of the ‘Person’ and ‘Organization’ are com-

19 Cohen Kappa B-Loc 0.97 B-Org 0.95 B-Per 0.94 I-Loc 0.97 I-Org 0.92 I-Per 0.93

Table 4.1 Inter Annotator Agreement.

Tag Count of Tokens B-Loc 5429 B-Org 2257 B-Per 4888 I-Loc 352 I-Org 201 I-Per 782 Total NE tokens 13909

Table 4.2 Tags and their Count in Corpus paratively low because of uncommon and confusing names of persons and organisations. The following table 4.1 shows the Inter Annotator Agreement.

4.4 Data Statistics

We have retrieved a total of 2,16,800 tweets using the python twitter API. We are left with 3968 tweets after the extensive cleaning and pre-processing steps. We have annotated a total of 115772 tokens using ‘B-Per’, ‘I-per’, ‘B-Org’, ‘I-Org’, ‘B-Loc’, ‘I-Loc’ and an ‘other’ tag. Each tweet has an average length of 29 words. The following table 4.2 shows the distribution of tags.

4.5 Experimental Analysis

In this section we present experiments with our corpus using different features and systems. Exper- iments were conducted by using all the features and a set of features in order to determine the effect of each feature simultaneously changing the parameters of the model like regularization parameters and algorithms of optimization like ‘L2 regularization’, ‘Avg.Perceptron’ and ‘Passive Aggressive’ for CRF.

20 Tag Precision Recall F1-score B-Loc 0.958 0.890 0.922 I-Loc 0.867 0.619 0.722 B-Org 0.802 0.600 0.687 I-Org 0.385 0.100 0.159 B-Per 0.908 0.832 0.869 I-Per 0.715 0.617 0.663 OTHER 0.974 0.992 0.983 weighted avg 0.963 0.966 0.964

Table 4.3 CRF Model with ‘c2=0.1’ and ‘l2sgd’ algo.

Criterion(‘Information Gain’, ‘Gini Index’) and maximum depth of the tree in the decision tree model. Algorithms of Optimizations and loss functions in LSTMs. We used 5 fold cross validation in order to verify our models. We used the ‘sci-kit’ learn and keras libraries for implementation of the above algorithms.

4.5.1 Conditional Random Fields (CRF)

Conditional Random Field is a probabilistic machine learning algorithm often used for structured prediction. A CRF can take context into account while making a structured prediction. CRFs find ap- plication in Part of Speech Tagging, Named Entity Recognition, gene finding etc. In Computer Vision CRFs are used for Object Recognition and Image Segmentation.

In Sequence prediction tasks like Parts of Speech Tagging, an adjective is more likely to be followed by a noun than to be followed by a verb. In the Named Entity Recognition using the BIOS tags, I-ORG is less likely to be followed by I-PER, similarly B-PER is more likely to be followed by I-PER, thus we need to look at sentence level rather than word level. These sentence level co-relations help us in understanding the context and is beneficial for Named Entity Recognition. Hence we chose to work with Conditional Random Fields for the task of Named Entity Recognition. We have experimented with regularization perimeters and algorithms of optimization like ‘Passive Aggressive’, ‘Avg. Perceptron’ and ‘L2 Regularization’.

4.5.2 Decision Tree

Decision Tree belongs to a class of Supervised machine learning algorithm used for classification and regression purposes. They have a tree like structure where the internal nodes represent the test on the attribute, the branches represent outcome of the test and the leaf nodes represent the class labels.

21 Feature Precision Recall F1-score Char N-Grams 0.73 0.56 0.62 Word N-Grams 0.88 0.59 0.70 Capitalization 0.15 0.02 0.03 Mentions, Hashtags 0.36 0.14 0.19 Numbers in String 0.01 0.01 0.01 Previous Word tag 0.78 0.19 0.15 Common Symbols 0.21 0.06 0.09

Table 4.4 Feature Specific Results for CRF

Tag Precision Recall F1-score B-Org 0.55 0.61 0.58 I-Per 0.43 0.50 0.47 B-Per 0.76 0.76 0.76 I-Loc 0.50 0.59 0.54 OTHER 0.98 0.97 0.97 B-Loc 0.83 0.84 0.84 I-Org 0.09 0.13 0.11 weighted avg 0.94 0.94 0.94

Table 4.5 Decision Tree Model with ‘max-depth=32’

The Important challenge in Decision Trees is attribute selection. To select an attribute for nodes at each level of the tree. There are two popular attribute selections methods. The methods are Information Gain and Gini Index.

Information gain is a measure of the change in entropy. Whenever an attribute is introduced into a Decision Tree it changes the entropy. Gini Index is a measure of how often a randomly chosen element would be miss classified. We have experimented with parameters like criterion(Information gain, Gini Index) and maximum depth of the tree.

4.5.3 Bidirectional LSTM

Long short Term Memory Network is deep learning model introduced to curb the problems of a Re- current Neural Networks. If the gradient is too small or too big, when the gradient propagates through the network it will become either too small or too big leading to the problems of vanishing or exploding gradient, due to this they tend to forget the information they have seen earlier. In order to curb this problem LSTMs were introduced. These Long Short Term Memory Networks have a cell state and three gates forget gate, input gate and update gate. Using these three gates and the cell state the LSTM

22 Feature Precision Recall F1-score Char N-Grams 0.42 0.72 0.51 Word N-Grams 0.57 0.59 0.58 Capitalization 0.19 0.31 0.23 Mentions, Hashtags 0.29 0.20 0.22 Numbers in String 0.06 0.16 0.07 Previous Word tag 0.14 0.20 0.16 Common Symbols 0.16 0.20 0.16

Table 4.6 Feature Specific Results for Decision tree networks learn to forget the information and remember only the information needed. These Networks are often used in structured prediction tasks such as POS tagging, Named Entity Recognition and for encoding of Information for classification tasks.

In a Bidirectional LSTM there are two LSTM networks that encode the information in both forward and backward directions. Each hidden state at a particular point of time stamp encodes information of its neighboring words. We have experimented with the loss functions and optimization algorithms in LSTM.

4.5.4 Features

We have character, lexical and word level features for Named Entity Recognition. They help in cap- turing information like Numbers, Punctuation marks, special mentions, emoticons, numbers in strings. We have the following features.

1. Character N-Grams

N-Grams are a contiguous sequence of n tokens taken from a text or speech sample, character N-Grams are an important feature not only for the task of Named Entity Recognition but also are useful for any classification or prediction tasks. Character N-grams capture the sub word in- formation. This is really an important feature for us in this case because we deal with informal, uncommon and incomplete words in Telugu-English code-mixed data. It helps to deals with cases where a single word is expressed in many ways. Generally a single word is written in many ways in different tenses, different parts of speech, different spellings(misspells), using suffixes or pre- fixes etc. A character N-Gram will be able to capture the common information in all these cases. They are also proven to be very efficient in capturing information in various natural language processing tasks. Thus this is an important feature for us in the task of Named Entity Recognition for Telugu-English code-mixed data.

23 2. Word N-Grams N-Grams represent a contiguous sequence of n items taken from a text or speech sample. Here we take the previous and next word as features to our model. These features help us in giving the contextual information to our model.

3. Capitalization We used two binary features in order to capture capitalization. One feature to represent if the first letter of the word is capital and the other to represent if the entire word is capital. People in social-media generally tend to write names of people, places, locations using capital letters in order to give special importance or show aggression.

4. Mentions and Hashtags People use ‘@’ symbol in twitter as a reference to a person, and ‘#’ tags to make a topic trending or to follow trends. They use these hashtags to denote an important place or organisation they visited or an important person they have met. These features help us in identifying the named entities.

5. Numbers in Strings People generally use alpha numeric characters while writing in social media in order to shorten the message length or as a different style of writing. For e.g ‘night’ as ‘n8’. These words are generally not named entities. Thus the presence of these words help us in identifying the negative samples.

6. Previous Word Tag Contextual features play an important role in predicting the tag for current word. Previous word tag helps us in getting this information. All the ‘I-Tag’s come after the ‘B-Tag’. Thus this is an important feature.

7. Common Symbols Generally symbols like ‘(’, ‘)’, ‘[’ are followed by numbers or mentions of not much importance. Thus the presence of symbols denote that the words before and after these symbols are not Named Entities.

4.5.5 Results and Discussion

Table 4.3 shows the results of the CRF model with ‘l2sgd’(Stochastic Gradient Descent with L2 regularization term) algorithm for 100 iterations. The c2 value corresponds to the ‘L2 regression’ which is used to restrict our estimation of w*. Experiments using the algorithms ‘ap’(Averaged Perceptron) and ‘pa’(Passive Aggressive) yielded almost similar F1-scores of 0.96. Table 4.4 shows the weighted average feature specific results for the CRF model where the results are calculated excluding the ‘OTHER’ tag. Table 4.5 shows the results for the decision tree model. The maximum depth of the model is 32. The F1-score is 0.94. Figure 4.1 shows the results of a Decision tree with max depth = 32. Table 4.6 shows

24 Word Truth Predicted Today OTHER OTHER paper OTHER OTHER clippings OTHER OTHER manam B-Org OTHER vartha I-Org OTHER @Thirumalagiri B-Loc B-Per @Nagaram B-Loc B-Per Telangana B-Org B-Loc Jagruthi I-Org OTHER Thungathurthy B-Loc B-Loc Niyojakavargam OTHER OTHER @MedayRajeev B-Per B-Org @JagruthiFans B-Org B-Org

Table 4.7 An Example Prediction of our CRF Model

Tag Precision Recall F1-score BL 0.94 0.86 0.89 BO 0.76 0.56 0.64 BP 0.80 0.70 0.74 IL 0.41 0.55 0.47 IO 0.04 0.09 0.056 IP 0.33 0.52 0.40 OTHER 0.97 0.98 0.97

Table 4.8 Bi-LSTM model with optimizer = ‘adam’ and has a weighted f1-score of 0.95 the weighted average feature specific results for the Decision tree model where the results are calculated excluding the ‘OTHER’ tag. In the experiments with BiLSTM we experimented with the optimizer, activation functions, no of units and no of epochs. After several experiment, the best result we came through was using ‘softmax’ as activation function, ‘adam’ as optimizer and ‘categorical cross entropy’ as our loss function. The table 4.8 shows the results of BiLSTM on our corpus using a dropout of 0.3, 15 epochs and random initialization of embedding vectors. The F1-score is 0.95. Figure 4.2 shows the BiLSTM model architecture. Table 4.7 shows an example prediction by our CRF model. This is a good example which shows the areas in which the model suffers to learn. The model predicted the tag of ‘@Thirumalagiri’ as ‘B-Per’ instead of ‘B-Loc’ because their are person names which are lexically similar to it. The tag of the word ‘Telangana’ is predicted as ‘B-Loc’ instead of ‘B-Org’ this is because ‘Telangana’ is a ‘Location’ in most of the examples and it is an ‘Organization’ in very few cases. We can also see ‘@MedayRajeev’

25 Figure 4.1 Results from a Decision Tree is predicted as ‘B-Org’ instead of ‘B-Per’. The model performs well for ‘OTHER’ and ‘Location’ tags. Lexically similar words having different tags and insufficient data makes it difficult for the model to train at times as a result of which we can see some incorrect predictions of tags.

26 Figure 4.2 BiLSTM model architecture

27 Chapter 5

Emotion Prediction for Telugu-English Code-Mixed Social Media Data

5.1 Corpus Creation

The corpus created consists of tweets from twitter. The tweets are scraped through the twitter python API which uses the advanced search option in twitter. The extraction of tweets is done based upon the hashtags. We have taken hashtags of social events, movies, movie stars, popular people, places, festivals, political events, politicians, sports, sportsmen etc. Tweets retrieved are in the json format which consists all the information such as time stamp, URL, text, user, retweets, replies, full name, id and likes.

5.1.1 Pre-Processing

The pre-processing of the tweets contain the following steps

• URL, Links, Multiple spaces are removed from tweets

• Tweets that contain both Telugu and English words are only considered in order to maintain the code-mixed nature of the tweets

• Tweets that contain only Telugu words or only English words are removed

• Tweets that contain a minimum of five words or above only considered

• Tweets that do not express the code-mixing nature predominantly are removed too, i.e tweets that contain only one or two linguistics units from the other language as suffixes or prefixes.

5.1.2 Annotation

The Annotation of the tweets is done by two persons who are proficient in both Telugu and English and have linguistic background. The Annotation of tweets is done in two stages. They are Emotion Annotation and Causal Language Annotation.

28 Emotion Annotation The Tweets are annotated using the emotions ‘Happy’, ‘Angry’, ‘Sad’, ‘Fear’, ‘Disgust’, ‘Surprise’ and ‘Multiple Emotion’ tags. There are also tweets with other emotions but the above are the emotions existing predominantly. The following tweets are the examples of Emotions ‘Happy’, ‘Angry’, ‘Sad’, ‘Fear’, ‘Disgust’, ‘Surprise’, ‘Multiple Emotion’ respectively.

Happy T1: “All the best Raj Tarun Anna.. e sari Super hit confirm..Naku Nammakam undhi Anna..Success Track ekkuthav malli..All the best” Translation: “All the best Raj Tarun Brother. A super hit is conformed this time. I have a strong belief. You will head upon the success track again. All the best. ”

In the above tweet, the text “All the best Raj Tarun Anna.. e sari Super hit confirm..Naku Nammakam undhi Anna..Success Track ekkuthav malli..All the best” expresses the emotion Happy.

T2: “manchi time lo manchi talk tho manchi collections bhayya congrats” Translation: “In good time with good talk we have got good collections Brother. Congrats”

In the above tweet, the text “manchi time lo manchi talk tho manchi collections bhayya congrats” expresses the emotion Happy.

Angry

T3: “ cheputo kottali ye ninnu Oka okaru publicity kosam pawan kalyan gari ni troll cheyadam Baga nerchukunaru Shame on u To Use That words on Pawan sir’s mother ... Bloddy Buger Nuv ni pandhi mutgi dhana Nikante pandhi melu Loafer” Translation: “You should be beaten up with shoes. Someone has learned to troll pawan kalyan for publicity. Shame on you to use that words on Pawan sir’s mother. Bloddy buger, Pig face. A pig is better than you. Loafer”

In the above tweet, the text “cheputo kottali ye ninnu”, “ Shame on u”, “Bloddy Buger Nuv ni pandhi mutgi dhana Nikante pandhi melu Loafer” expresses the emotion Angry.

T4: “Rey!! nee pendrive kosam okka state CM meda edustunav.. thupp ey tweet tho impression poyindi velli ne wives batalu vutukora jaffa .. srireddy better nee kanna” Translation: “Hey! you are weeping upon a state CM for the sake of your pendrive, thu, I have lost impression upon you with this tweet. Go and wash your wife clothes, jaffa, sri reddy is better than you.”

29 In the above tweet, the text “ey tweet tho impression poyindi velli ne wives batalu vutukora jaffa .. srireddy better nee kanna” expresses the emotion Angry.

Sad

T5: “Manobhavalu debba thintayi kabatti.. Ayina as actor he did his best.. Story bagopothe rc em chesthadu papam” Translation: “Since he did not want to hurt others feelings, aslo as an actor he did his best. What can RC do if the story is not good. Poor guy.”

In the above tweet, the text “Story bagopothe rc em chesthadu papam” expresses the emotion Sad.

T6: “always felt bad for Pk fans kindoff directors or movies he choose, papam ee Thala Ajith fans paristiti mari darunam aa Meher 2.0 shiva tho 4 movies ante” Translation: “always felt bad for Pk fans for the directors or movies he choose. Poor thala Ajith fans, there situation is too bad as they are doing 4 movies with Meher 2.0 and Shiva”

In the above tweet, the text “always felt bad for Pk fans”, “papam ee Thala Ajith fans paristiti mari darunam” expresses the emotion Sad.

Fear

T7: “thanks andi. Vere valla talks chustunte bayam estundi. Thanks sir.” Translation: “Thank you. I am afraid by looking at others talk. Thank you.”

In the above tweet, the text “Vere valla talks chustunte bayam estundi.” expresses the emotion Fear.

T8: “Asale Second time combination lopala chinna bhayam start aindhi dude... ki work- out ainattu aithe baguntadhi” Translation: “Its a second time combination. A small Fear has started in me dude. Hope it works out like the Bussinessman movie.”

In the above tweet, the text “lopala chinna bhayam start aindhi dude...” expresses the emotion Fear.

Disgust

30 T9: “endku edusthunnav not an easy pitch to bat ani chepthunnaru ga commentry kuda vinu already India have hit more than 20 runs than par score antunnadu em telikunda edusthunnav thu” Translation: “Why do you weep? They are already saying that it is not an easy pitch to bat, listen to commentary too, India have hit more than 20 runs, thu!, why do you yell without even knowing any- thing?”

In the above tweet, the text “endku edusthunnav”, “ em telikunda edusthunnav thu” expresses the emotion Disgust.

T10: “nela ticket one of the worst film ever seen. Trailer ke chiraku vachindi. songs too worst” Translation: “nela ticket one of the worst film ever seen. I got irritated with the trailer itself. Songs are too worst”

In the above tweet, the text “Nela ticket one of the worst film ever seen. Trailer ke chiraku vachindi. songs too worst” expresses the emotion Disgust.

Surprise

T11: “vammo idhi epudu nunchi anyways welcome change” Translation: “OMG! Since when is this happening? anyways welcome change”

In the above tweet, the text “vammo idhi epudu nunchi” expresses the emotion Surprise.

T12: “Ento ah surprise Mamul excitement ledu naaku Waiting .. Waiting .... . Actully adi Maku Pedda surprise” Translation: “What that surprise is going to be? I am very much excited. I am Waiting. Actually this itself is a big surprise for us.”

In the above tweet, the text “Ento ah surprise Mamul excitement ledu naaku Waiting .. Waiting .... . Actully adi Maku Pedda surprise” expresses the emotion Surprise.

Multipe Emotion

T13: “look marchu Mahesh anna. appude brm failure ni happy ga marchipotam” Translation: “Please change your looks Mahesh brother. Atleast then we can happily forget the failure

31 of Brm Movie.”

In the above tweet, the text “look marchu Mahesh anna. appude brm failure ni happy ga marchipotam” expresses the Multiple emotion.

T14: “Hey when you came bro ? Wowwww cherry birthday ki sudden surprise ichaava maaku ? Great amma chaala great” Translation: “Hey when did you come? Did you give us a sudden surprise for cherry’s birthday? It is so great of you.”

In the above tweet, the text “Hey when you came bro ? Wowwww cherry birthday ki sudden surprise ichaava maaku ? Great amma chaala great” expresses the Multiple emotion .

Causal Language Annotation The tweets are annotated with the causal language tags, ‘English’, ‘Telugu’, ‘Both’ and ‘Mixed’. This tag represents what is the causal language for expression of an emotion in a text.

• If the Emotion in a text is expressed solely through ‘English’ language then the causal language is English.

• If the Emotion in a text is expressed solely through ‘Telugu’ language then the causal language is Telugu.

• If the emotion in a text is expressed using both the languages independent of each other then the causal language is ‘Both’

• If the emotion in a text is expressed using linguistic units from the both the languages at once then it is ‘Mixed’

The following explains the Causal Language annotation for each tag using examples.

English In the below examples the emotion is expressed solely through English language

T15: “velu maatram okate pani and they start trolling or abusing using personal life !! Cheap shit” Translation: “They have only one work and they start trolling and abusing using personnel life! Cheap shit”

T16: ‘congratulations sir...pls fill some spirit in our andhra people also to elect leaders irrespective of castism..ippatikaina ma CM ki ardham ayyi untadhi..pakkodi intlo velu pettakudadhu ani‘”

32 Translation: “Congratulations sir, Please fill some spirit in our andhra people also to elect leaders irre- spective of casteism. At least now our CM might have have understood that, not to interfere in others personnel matters.”

Telugu In the below examples the emotion is expressed solely through Telugu Language

T17: “This is the reason why jadeja is called as SIR, a ball spin vastaro a ball straight vestaro pa- pam Anna garike arthamkaledu” Translation: “This is the reason why jadeja is called as SIR, he himself does not know if he is going to ball a straight one or a spin.”

T18: “but u need to reinvent ur self wat abt tarun and navadeep but they have very good family support adhi ledhu papam uday ki” Translation: “But you need to reinvent yourself, what about tarun and navadeep but they have very good family support, poor uday does not have that”

Both In the below examples the emotion is expressed using both languages independently

T19: “Sir first telangana ki budget lo amount ippavandhi, raithulaku kopam vasthe modi a tokka,, be great ful to kcr that he is talking on behalf of farmers” Translation: “Sir please allocate money to telangana in the budget, if the farmers are angry they do not even care about modi, be grateful to KCR sir, he is talking on behalf of farmers”

T20: “Vote for note case appudu evaru tappinchukundi. PK clearly said all this shit things done by not only BJP, TDP all parties should avoid such things ani” Translation: “Who got escaped in the vote for note case? PK clearly said all this shit things done by not only BJP, TDP but also all other parties and they should avoid doing such things ”

Mixed In the below examples the emotion is expressed using linguistic units from both languages at once

T21: “ snr fans mundu mana tucham anthe close inka” Translation: “We have spit infront of senior fans, thats it, we are finished!”

T22: “So lo thappa anni chotla super antav nice..” Translation: “So, you are saying that except Eluru all other locations are Super. Nice”

33 Cohen Kappa Happy 0.90 Angry 0.91 Sad 0.92 Fear 0.89 Disgust 0.90 Surprise 0.92 Multiple Emotion 0.93 Causal Language 0.91

Table 5.1 Inter Annotator Agreement.

5.1.3 Inter Annotator Agreement

The Annotation was done by two persons who are proficient in both Telugu and English and have a linguistic background. The Inter Annotator Agreement score was calculated between two data sets con- taining 2924 tweets using kohen’s cappa co-efficient. The Inter Annotator Agreement was significantly high. The table 5.1 shows the Inter Annotator agreement.

5.1.4 Corpus Statistics

We have mined a total of 26,61,629 tweets, after the pre-processing steps where we removed all the noisy tweets we are left with 2924 tweets. The distribution of the Emotion and Causal language tags in the corpus is shown in the tables 5.2 and 5.3 respectively. To have a better understanding of the corpus, we have done language identification for each word in the corpus using the tool 1 from the research [7]. The distribution of the words between the languages helps us to understand the code mixing nature. Table 5.4 shows the distribution of words present in the corpus between the English and Telugu languages. Table 5.5 shows the distribution of unique words present in the corpus between English and Telugu languages. Each tweet on an average consists of 25 words.

5.2 Experimental Analysis

The Experiments on the data are conducted using Support Vector Machines(SVMs) and Long Short Term Memory Networks(LSTMs).

1https://github.com/irshadbhat/litcm

34 Emotion Sentences Happy 825 Angry 764 Sad 709 Fear 30 Disgust 171 Surprise 38 Multiple Emotions 387 Total 2924

Table 5.2 Data Distribution

Causal Language Sentences English 125 Telugu 68 Mixed 2526 Both 205

Table 5.3 Causal Language Distribution

5.3 SVM

Support Vector Machines belong to the class of supervised machine learning algorithms. They work with labelled data. The SVMs map the points to a space such that there is a gap between the two classes. Given a new sample it predicts the class of the sample by mapping it into the same space and assigning a class to the sample based on the side of the gap in which they fall. Support Vector Machines are also capable of doing non-linear classification using kernel trick where it maps the points to a higher dimensional space.

5.3.1 Feature Identification and Extraction :

In our work, we have used the following feature vectors to train our supervised machine learning model.

1. Character N-Grams: This feature is language independent and is one of the important features for classifying text. Since the code-mixed English-Telugu social media text contains words with spelling variations and informal words which are different from the standard English and Telugu words, Character N-Grams help us in capturing the semantic information. We used character N-gram as a feature where n varies from 2 to 3.

35 Language Word Count English 17037 Telugu 158618 Total 176555

Table 5.4 Total Word Distribution

Language Unique Words English 1634 Telugu 14265 Total 15899

Table 5.5 Unique Word Distribution

2. Word N-Grams: Word N grams are widely used to capture emotion in a text, we used word n grams where n varies from 1 to 2 as a feature in our model.

3. Emoticons: We used the count of emoticons present in a tweet for each emotion as a feature. Emoticons are used to express emotions in social media. ‘:)’ expresses Happiness, ‘:(’ expresses sadness etc. We use a list of Western Emoticons from Wikipedia2.

4. Punctuation: Punctuation marks are also used as one of the feature. For example multiple ex- clamation marks, question marks depict the feelings of astonishment and angry respectively. We count the occurrence of each punctuation mark in a sentence and use them as a feature.

5. Repetitive Characters: People in social media often repeat characters in words to stress an emotion or feeling. Few examples are ‘yaaay’, ‘happpyy’, ‘loooool’, ‘parrttyyy’, etc. We made a count of all such words in which a particular characters has repeated more than two times in a row and used it as a feature.

6. Capitalization: People often write words with capital letters to denote shouting or anger. We made a count of all such words in a tweet which are completely written in capital letters and used it as a feature.

7. Intensifiers: Intensifiers are used to lay emphasis on emotion or sentiment. We used a list of in- tensifier words from Wikipedia3, we obtained a list of Telugu intensifiers by translation of English intensifiers and a few common Telugu intensifier words from the corpus. We made a count of all such words present in a tweet and used it as a feature.

2https://en.wikipedia.org/wiki/List of emoticons 3https://en.wikipedia.org/wiki/Intensifier

36 Feature Accuracy Char N grams 0.49 Word N Grams 0.50 Emoticons 0.14 Punctuation 0.29 Repetitive Characters 0.14 Capitalization 0.16 Intensifiers 0.17 Negation words 0.15 Emotion Words 0.47

Table 5.6 Results of SVM

8. Negation Words: : We select negation words to address variance from the desired emotion caused by negated phrases like not sad or not happy. ‘Not happy’ depicts sadness even though it contains the word happy. A list of English negation words was taken from Christopher Potts sentiment tutorial4. Telugu negation words were manually selected from the corpus. Thus we made a count of all such words in a tweet and used it as a feature.

9. Emotion Words: We created a list of words for each emotion. These words were taken from both English and Telugu language. We used the count of these words present in a tweet for each emotion as a feature. For e.g ‘joy’, ‘merry’, ‘santhosham’, ‘glad’, ‘cheers’ ‘’ and others are the words present in the list for ‘HAPPY’.

5.4 LSTMs

Long Short Term Memory Networks are proposed in order to overcome the vanishing gradient prob- lem in the Recurrent Neural Networks. They are widely used in the sequential data processing(Speech Recognition, Hand Writing Recognition etc). They are used in the encoding, classification and pre- diction of sequential data in the field of deep learning. The LSTMs overcome the vanishing gradient problem with the help of a cell state which helps in capturing the context. Each Cell state consists of a Input gate, Output gate and a Forget gate which helps the LSTM in remembering some information and learning to forget the useless information. The table 5.6 shows the feature specific results for SVM. We have experimented only on three emotions(‘Happy’, ‘Angry’ and ‘Sad’) since the number of instances from other emotions are relatively small. Using SVM we have conducted several experiments by changing the parameters like gamma value, number of maximum iterations. The best accuracy we had is 58% with with RBF kernel and 100 iterations. In the experiments with LSTM, we have

4http://sentiment.christopherpotts.net/lingstruc.html

37 Emotion precision recall f1-score Happy 0.75 0.65 0.70 Angry 0.89 0.41 0.56 Sad 0.45 0.77 0.57 avg / total 0.70 0.61 0.61

Table 5.7 Results of LSTM on our corpus

Emotion precision recall f1-score Happy 0.78 0.53 0.63 Angry 0.43 0.60 0.50 Sad 0.40 0.42 0.41 Fear 0.01 0.01 0.01 Disgust 0.50 0.09 0.15 Surprise 0.01 0.01 0.01 Mixed 0.20 0.27 0.23

Table 5.8 LSTM Results on all emotions experimented with the parameters like dropout, activation function, loss function, optimizer and number of epochs. The best accuracy we had is 60.92% and an F1-score of 0.61 using a dropout value of ‘0.2’, ‘softmax’ as activation function,‘sparsecategoricalcrossentropy’ as loss function, ‘adam’ as optimizer, randomly initialized word embeddings and training for 20 epochs. Table 5.7 shows the results of running LSTM on our corpus. Table 5.8 shows the results of LSTM on all the emotions.In our experiments using a Bidirectional LSTM(BiLSTM), after several experiments by changing the parameter values we noticed a best accuracy score of 70.74% and an F1-score of 0.71 using a dropout value of ‘0.2’, ‘softmax’ as activation function, ‘sparsecategoricalcrossentropy’ as loss function, ‘adam’ as optimizer, randomly initialized word embeddings and training for 15 epochs. Table 5.9 shows the results of running BiLSTM on our corpus. Table 5.10 shows the results of BiLSTM on all the emotions. The training and testing sets are split in the ratio of 80 to 20. The low results of LSTM and BiLSTM on the emotions Fear’ and Surprise’ are due to the fewer number of these samples in the training and test data.

5.5 Results and Discussion

The Accuracy of the system increased from 58% using SVM to 71% using BiLSTM, the increase in results can be attributed to the rich vector representations generated by the BiLSTM Models, they are also capable of capturing contextual information, BiLSTM helps in capturing the context from both ways using a forward and a backward LSTM. Considering the above results and the quality of the Inter

38 Emotion precision recall f1-score Happy 0.76 0.76 0.76 Angry 0.76 0.73 0.74 Sad 0.60 0.63 0.61 avg / total 0.71 0.71 0.71

Table 5.9 Results of BiLSTM on our corpus

Emotion precision recall f1-score Happy 0.65 0.60 0.62 Angry 0.57 0.42 0.48 Sad 0.54 0.49 0.51 Fear 0.01 0.01 0.01 Disgust 0.10 0.35 0.15 Surprise 0.33 0.14 0.20 Mixed 0.16 0.13 0.14

Table 5.10 BiLSTM Results on all emotions

Annotator Agreement our corpus looks promising and serves as a stepping stone in the research of Telugu-English code-mixed language.

39 Chapter 6

Conclusion and Future Work

The following section describes the conclusion and future work related to the three research areas discussed above.

6.1 Conversational Dialogue System for Telugu

We are experimenting with the word representation for transliterated Telugu words from publicly available fasText representations for Telugu 1. As these vectors are created from a huge dump of data these vectors might be helpful for us. Techniques like semantic modelling and clustering for word normalization are to be experimented with for better word representations. The perplexity of the model can be reduced by creating richer and better word representations with the help of POS tags and language tags. [24]. Telugu being an agglutinative language, multiple words are combined together to form a single word. Example words are (eedukuntuvelladu = eedu + kuntu + vellal + du), meaning he went away swimming. (Parugethukuntuvelladu = Parugu + ethu + kuntu + vellal + du), meaning he went away running. Splitting of these words to morphemes level using morph analyzers results in creating a rich vocabulary and in result we can create better word representations. This creation of better word representations helps us in tasks such as domain transfer. In order to make use of the large amount of English data available, we can alter word order of English so that it matches to that of Telugu language. This Experiment can also be be done with other resource rich languages by altering their word order. The knowledge obtained from those datasets can be applied to tasks related to Telugu language. Future works on the corpus such as sentiment analysis of Dialogues, understanding the difference between dialogue and monologue data like turn taking behaviour, convergence and dialogue state recognition encourage us to continue research in this field.

1 https://fasttext.cc/docs/en/crawl-vectors.html

40 6.2 Named Entity Recognition

The following are the contributions of our work towards Named Entity Recognition for code-mixed English Telugu language. The paper presented an annotated code-mixed English Telugu corpus for Named Entity Recognition, which for the best of our knowledge is the first of a kind. The work also presents experiments with the machine learning models Conditional Random Fields(CRF), Decision Trees, Bidirectional Long Short Term Memory Networks(BiLSTMs) which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively. These results are looking good and encouraging considering the amount research of done in this field. This works introduces and addresses Named Entity Recognition of code- mixed English Telugu data as a research problem. As part of the future work, the corpus can be enriched with the respective parts of speech tags for each word. The Corpus size for the Named Entities can also be increased. The Research can be extended by working on Named Entity Recognition for languages involving three or more languages.

6.3 Emotion Prediction

The following are the contributions of our work towards Emotion Prediction of code-mixed English Telugu language. Our work presented an annotated corpus of code-mixed English Telugu sentences for Emotion Prediction. To the best of our knowledge this is first ever corpus in this field of research. The work presented experiments with the machine learning models Support Vectors Machines(SVM), Long Short Term Memory Networks(LSTMs) and Bidirectional LSTMs which reported an accuracy of 58% 60.92%, 70.74% respectively. This work introduces and addresses Emotion Prediction of code-mixed English-Telugu language as a research problem. The enrichment of the corpus can be done by adding the respective Parts of Speech Tags for each word. The size of the corpus can be increased which helps in the resource creation and for further research. The work can be extended by working on Emotion Prediction of languages involving linguistic units from three or more languages.

41 Related Publications

Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop Vamshi Krishna Srirangam, Appidi Abhinav Reddy, Vinay Singh, Manish Shrivastava.

Creation and Analysis of Telugu Conversational Corpus, 20th International Conference on Compu- tational Linguistics and Intelligent Text Processing. Vamshi Krishna Srirangam, Koushik Reddy Sane, Sairam Kolla, Manish Shrivastava.

42 Bibliography

[1] P. Auer. The pragmatics of code-switching: a sequential approach, page 115135. Cambridge University Press, 1995. [2] A. Azpeitia, M. Cuadros, S. Gaines, and G. Rigau. Nerc-fr: supervised named entity recognition for french. In International Conference on Text, Speech, and Dialogue, pages 158–165. Springer, 2014. [3] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. “I am borrowing ya mixing ?” an analysis of English- Hindi code mixing in Facebook. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pages 116–126, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. [4] I. M. Barredo. Pragmatic functions of code-switching among basque-spanish bilinguals. [5] R. Begum, K. Bali, M. Choudhury, K. Rudra, and N. Ganguly. Functions of code-switching in tweets: An annotation framework and some initial experiments. In Proceedings of the Tenth International Confer- ence on Language Resources and Evaluation (LREC’16), pages 1644–1650, Portoroz,ˇ Slovenia, May 2016. European Language Resources Association (ELRA). [6] R. Bhargava, B. Vamsi, and Y. Sharma. Named entity recognition for code mixing in indian languages using hybrid approach. Facilities, 23(10), 2016. [7] I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and M. Shrivastava. Iiit-h system submission for fire2014 shared task on transliterated search. In Proceedings of the Forum for Information Retrieval Evaluation, FIRE ’14, pages 48–53, New York, NY, USA, 2015. ACM. [8] D. G. Bobrow, R. M. Kaplan, M. Kay, D. A. Norman, H. S. Thompson, and T. Winograd. Gus, a frame- driven dialog system. Artif. Intell., 8:155–173, 1977. [9] A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, and M. Shrivastava. A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pages 36–41, New Orleans, Louisiana, USA, June 2018. Association for Computational Linguistics. [10] E. G. Bokamba. Code-mixing, language variation, and linguistic theory:: Evidence from bantu languages. Lingua, 76(1):21–62, 1988. [11] H. Chen, X. Liu, D. Yin, and J. Tang. A survey on dialogue systems: Recent advances and new frontiers, 2017.

43 [12] J. P. Chiu and E. Nichols. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370, 2016.

[13] N. Choudhary, R. Singh, V. Anvesh Rao, and M. Shrivastava. Twitter corpus of resource-scarce languages for sentiment analysis and multilingual emoji prediction. In Proceedings of the 27th International Confer- ence on Computational Linguistics, pages 1570–1577, Santa Fe, New Mexico, USA, Aug. 2018. Associa- tion for Computational Linguistics.

[14] D. Crystal. Language and the Internet. Cambridge University Press, 2001.

[15] C. Danescu-Niculescu-Mizil and L. Lee. Chameleons in imagined conversations: A new approach to un- derstanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages 76–87, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

[16] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistics, 2005.

[17] Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.

[18] R. Kumar, A. N. Reganti, A. Bhatia, and T. Maheshwari. Aggression-annotated corpus of hindi-english code-mixed data. CoRR, abs/1803.09402, 2018.

[19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California, June 2016. Association for Computational Linguistics.

[20] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation, 2016.

[21] X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074, Berlin, Germany, Aug. 2016. Association for Computational Linguistics.

[22] S. Mukherjee and P. Bhattacharyya. Sentiment analysis in Twitter with lightweight discourse analysis. In Proceedings of COLING 2012, pages 1847–1864, , India, Dec. 2012. The COLING 2012 Organiz- ing Committee.

[23] S. S. Mukku and R. Mamidi. ACTSA: Annotated corpus for Telugu sentiment analysis. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 54–58, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics.

44 [24] K. Nelakuditi, D. S. Jitta, and R. Mamidi. Part-of-speech tagging for code mixed english-telugu social media data. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 332–342. Springer, 2016. [25] K. Nelakuditi, D. S. Jitta, and R. Mamidi. Part-of-speech tagging for code mixed english-telugu social media data. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, pages 332– 342, Cham, 2018. Springer International Publishing. [26] S. Parupalli, V. A. Rao, and R. Mamidi. BCSAT : A benchmark corpus for sentiment analysis in telugu using word-level annotations. CoRR, abs/1807.01679, 2018. [27] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power. Semi-supervised sequence tagging with bidirec- tional language models, 2017. [28] A. Pratapa, G. Bhat, M. Choudhury, S. Sitaram, S. Dandapat, and K. Bali. Language modeling for code- mixing: The role of linguistic theory based synthetic data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1543–1553, Melbourne, Australia, July 2018. Association for Computational Linguistics. [29] R. Redouane. Linguistic constraints on codeswitching and codemixing of bilingual moroccan arabic-french speakers in canada. 2005. [30] K. Sarkar. A hidden markov model based system for entity extraction from social media english text at fire 2015. arXiv preprint arXiv:1512.03950, 2015. [31] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [32] K. Singh, I. Sen, and P. Kumaraguru. Language identification and named entity recognition in hinglish code mixed tweets. In Proceedings of ACL 2018, Student Research Workshop, pages 52–58, 2018. [33] K. Singh, I. Sen, and P. Kumaraguru. A twitter corpus for Hindi-English code mixed POS tagging. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, pages 12–17, Melbourne, Australia, July 2018. Association for Computational Linguistics. [34] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.-Y. Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM Interna- tional on Conference on Information and Knowledge Management, CIKM 15, page 553562, New York, NY, USA, 2015. Association for Computing Machinery. [35] S. Swami, A. Khandelwal, V. Singh, S. S. Akhtar, and M. Shrivastava. A corpus of english-hindi code-mixed tweets for sarcasm detection. CoRR, abs/1805.11869, 2018. [36] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT- NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003.

45 [37] D. Vijay, A. Bohra, V. Singh, S. S. Akhtar, and M. Shrivastava. Corpus creation and emotion prediction for hindi-english code-mixed social media text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 128–135, 2018. [38] M. Warschauer, G. R. E. Said, and A. G. Zohry. Language choice online: Globalization and identity in egypt. Journal of Computer-Mediated Communication, 7(4):JCMC744, 2002. [39] J. Weizenbaum. Elizaa computer program for the study of natural language communication between man and machine. Commun. ACM, 9(1):3645, Jan. 1966. [40] Z. Ye and Z.-H. Ling. Hybrid semi-Markov CRF for neural sequence labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 235–240, Melbourne, Australia, July 2018. Association for Computational Linguistics. [41] J. L. C. Zea, J. E. O. Luna, C. Thorne, and G. Glavas.ˇ Spanish ner with word representations and conditional random fields. In Proceedings of the Sixth Named Entity Workshop, pages 34–40, 2016. [42] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification, 2015. [43] W. Zhao, J. Ye, M. Yang, Z. Lei, S. Zhang, and Z. Zhao. Investigating capsule networks with dynamic routing for text classification, 2018. [44] X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu. Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1118–1127, Melbourne, Australia, July 2018. Association for Computational Linguistics.

46