Word Level Language Identification in English Telugu
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC 32 Word Level Language Identification in English Telugu Code Mixed Data Sunil Gundapu Radhika Mamidi Language Technologies Research Centre Language Technologies Research Centre KCIS, IIIT Hyderabad KCIS, IIIT Hyderabad Telangana, India Telangana, India gundapusunil@gmail.com radhika.mamidi@iiit.ac.in Abstract tion in a single discourse between two languages, where the switching occurs within a sentence). Code In a multilingual or sociolingual configura- mixing is inconsistently elucidated in disparate sub- tion Intra-sentential Code Switching (ICS) or Code Mixing (CM) is frequently observed fields of linguistics and frequent examination of nowadays. In the world most of the people phrases, words, inflectional, derivational morphol- know more than one language. The CM us- ogy and syntax use of a term as an equivalent to age is especially apparent in social media plat- Code Mixing. forms. Moreover, ICS is particularly signifi- Code mixing is defined as the embedding of lin- cant in the context of technology, health and guistic units of one language into utterance of an- law where conveying the upcoming develop- other language. CM is not only used in commonly ments are difficult in ones native language. used spoken form of multilingual setting, but also In applications like dialog systems, machine translation, semantic parsing, shallow pars- used in social media websites in the form of com- ing, etc. CM and Code Switching pose seri- ments and replies, posts and especially in chat con- ous challenges. To do any further advance- versations. Most of the chat conversations are in a ment in code-mixed data, the necessary step formal or semi-formal setting and CM is often used. is Language Identification. So, in this pa- It commonly takes place in scenarios where a stan- per we present a study of various models - dard formal education is received in a language dif- Nave Bayes Classifier, Random Forest Clas- ferent than the persons native language or mother sifier, Conditional Random Field (CRF) and Hidden Markov Model (HMM) for Language tongue. Identification in English - Telugu Code Mixed Most of the case studies state that CM is very pop- Data. Considering the paucity of resources ular in the world in the present day, especially in in code mixed languages, we proposed CRF countries like India, with more than 20 official lan- model and HMM model for word level lan- guages. One such language, Telugu is a Dravidian guage identification. Our best performing sys- language used by a total of 7.19% people of Telan- tem is CRF-based with an f1-score of 0.91. gana and Andra Pradesh states in India. Because of influence of English, most people use a combination 1 Introduction of Telugu and English in conversations. There is lot of research work being carried out in code mixed Code switching is characterized as the use of two data in Telugu as well as other Indian languages. or more languages, diversity and verbal style by Consider this example sentence from a Telugu a speaker within a statement, pronouncement or website1 which consists of movie reviews, short discourse, or between different speakers or situa- stories and articles in cross script code-mixed lan- tions. This code switching can be classified as inter- guages. This example sentence illustrates the mix- sentential (the language switching is done at sen- tence boundaries) and intra-sentential (the alterna- 1www.chaibisket.com 180 32nd Pacific Asia Conference on Language, Information and Computation Hong Kong, 1-3 December 2018 Copyright 2018 by the authors PACLIC 32 ing being addressed in this paper: take one word in a sentence as a data point. To fig- Example: “John/NE nuvvu/TE exams/EN ure out this sequence labeling classification problem baaga/TE prepare/EN aithene/TE ,/UNIV first/EN they consider the words as observations and POS classlo/TE pass/EN avuthav/TE ./UNIV”. (Trans- tags as the hidden states. Kovida at el. (2018), with lation into English: John, you will pass in the first the use of internal structure and context of word per- class, even if the exams are well prepared). The form POS tagging for Telugu-English Code Mixed words followed by /NE, /TE, /EN and /UNIV cor- data. respond to Named Entity, Telugu, English and Uni- Language Identification issue has been addressed versal tags respectively. In the above example some for English and Hindi CM data. For this problem words exhibit morpheme level code mixing, like in Harsh at el. (2014) constructed a new model with the “classlo” : “class” (English word) + “lo” (plural help of character level n-grams of a word and POS morpheme in Telugu). We also consider the clitiques tag of adjoining words. Amitava Das and Bjrn Gam- like supere: super (English root word) + e (clitique) bck (2014) modeled various approaches like unsu- as code mixed. pervised dictionary based approach and supervised We present some approaches to solve the prob- SVM word level classification by considering con- lem of word level language identification in Telugu textual clues, lexical borrowings, and phonetic typ- English Code Mixed data. The rest of this paper is ing for LI in code mixed Indian social media text. divided into five sections. Ben King and Steven Abney (2013) elaborate the In section 2, we discuss the related work, followed LI problem as a monolingual text of sequence label- by data set and its annotation in section 3. Section ing problem. The authors used a weakly supervised 4 describes the approaches for language identifica- method for recognizing the language of words in the tion. And finally section 5 reports Results, Conclu- mixed language documents. For language identifica- sion and Future Work. tion in Nepali - English and Spanish - English data Utsab Barman et al. (2014) extract the features like 2 Related Work word contextual clues, capitalization, char N-Grams and length of word and more features for each word. Research on code switching is decades old Gold These features are input for the traditional super- (1967) and a lot of progress is yet to be made. Braj vised algorithms like K-NN and SVM. B., Kachru (1976) explained about the structure of Indian languages are relatively resource poor lan- multilingual languages organization and language guages. Siva Reddy and Serge Sharoff (2011) are dependency in linguistic convergence of code mix- modeled a cross language POS Tagger for Indian ing in an Indian Perspective. The examination of languages. For an experimental purpose they used syntactic properties and sociolinguistic constraints the Kannada language by using the Telugu resources for bilingual code switching data was explored by because Kannada is more similar to Telugu. To do Sridhar et al. (1980) who contended that how ICS effective text analysis in Hindi English Code mixed shows impact on bilingual processing. According social media data Sharma et al. (2016) are stimulate to Noor Al-Qaysi, Mostafa Al-Emran (2017) code the shallow parsing pipeline. In this work the au- switching in social networking websites like Face- thors modeled a language identifier, a normalizer, a book, Twitter and WhatsApp is very high. The re- POS tagger and a shallow parser. sult of such case studies indicated that 86.40% of the Most of the experiments have been carried out by students using code switching on social networks, using dictionaries, supervised classification models whereas 81% of the educators do so. and Markov models for language identification. Fi- One of the basic tasks in text processing is Part of nally, most of the authors concluded that language Speech (POS) Tagging and it is primary step in most modeling techniques are more robust than dictionary of NLP. Word level Language Identification can be based models. looked at as a task similar to POS tagging. For POS We attempted this language identification prob- tagging task Nisheeth Joshi et al. (2013) developed lem with different classification algorithms to ana- a Hidden Markov Model based tagger in which they lyze the results. According to our knowledge, until 181 32nd Pacific Asia Conference on Language, Information and Computation Hong Kong, 1-3 December 2018 Copyright 2018 by the authors PACLIC 32 now no work has been done on language identifica- Mixed data produced estimable results. We are con- tion in Telugu-English Code-mixed data. sidered this LI problem similar to a Text Classifica- tion problem. We consider each sentence as a doc- 3 Dataset Annotation and Statistics ument and each word in it as a term, for which we calculate the conditional probability for each class We use the English Telugu code mixed dataset for label. The Nave Bayes assumption is that indepen- language identification from the Twelfth Interna- dence among the features, each feature is indepen- tional Conference on Natural Language Processing dent to the other features. We first convert the in- (ICON-2015). The dataset is comprised of Face- put code mixed words into TF(Term Frequency) - book posts, comments, replies to comments, tweets IDF(Inverse Document Frequency) vectors. and WhatsApp chat conversations. This dataset con- Word to Vector: Initially, the raw text data will tains 1987 code-mixed sentences. These sentences be processed into feature vectors using the existing are tokenized into words. And the tokenized words dataset. We used the TF-IDF Vector as the features of each sentence are separated by new line. The for both the trigrams and character trigrams. TF-IDF dataset contains 29503 tokens. calculates a score that represent the importance of a Language Label Percentage word in a document. Label Frequency of Label Telugu 8828 29.92 • TF: Number of times term word(w) appears in English 8886 30.11 a document / Total no. of terms in the document Universal 11033 37.39 • Inverse Document Frequency (IDF): loge (To- Named Entity 756 2.56 tal no.