Quantitative Characterization of Code Switching Patterns in Complex Multi-Party Conversations: A Case Study on Hindi Movie Scripts Adithya Pratapa Monojit Choudhury Microsoft Research, India Microsoft Research, India [email protected] [email protected] Abstract man et al., 2017). Nevertheless, there are no large- scale quantitative studies of code-switched conver- In this paper, we present a framework sations, primarily because currently the only avail- for quantitative characterization of code- able large-scale datasets come from social media. switching patterns in multi-party conver- These are either micro-blogs without any conver- sations, which allows us to compare and sational context or data from Facebook or What- contrast the socio-cultural and functional sApp with very short conversations. On the other aspects of code-switching within a set of hand, functions of CS are most relevant and dis- cultural contexts. Our method applies cernible in relatively long multi-party conversa- some of the proposed metrics for quan- tions embedded in a social context. For instance, tification of code-switching (Gamback and it is well documented (Auer, 2013) that CS is mo- Das, 2016; Guzman et al., 2017) at the tivated by complex social functions, such as iden- level of entire conversations, dyads and tity, social power and style accommodation, which participants. We apply this technique to are difficult to elicit and establish from short social analyze the conversations from 18 recent media texts. Hindi movies. In the process, we are able In this work, we propose a set of techniques for to tease apart the use of code-switching analyzing CS styles and functions in conversations as a device for establishing identity, socio- grounded over social networks. Our approach de- cultural contexts of the characters and the velops on two previously proposed metrics of CS – events in a movie. the Code-mixing Index (CMI) (Gamback and Das, 2016) and corpus level metrics proposed in (Guz- 1 Introduction man et al., 2017), applied to conversations at the Code-switching (henceforth CS) or code-mixing level of dyads, participants, conversation scenes refers to the juxtaposition of linguistic units from and the entire social network of the participants. more than one language in a single conversation, We apply this new approach to analyze scripts of or in a single utterance. Linguists have exten- 18 recent Hindi movies with various degrees and sively studied the structural (i.e., the grammatical styles of Hindi-English CS. Through this analysis constraints on CS) and functional (i.e., the moti- technique, we are able to bring out the social func- vation and intention behind CS) aspects of CS in tions of CS at different levels. various mediums, contexts, languages and geogra- The primary contributions of this work are: (a) phies (Myers-Scotton, 2005; Auer, 1995, 2013). development of a set of quantitative conversation However, most of these studies are limited to qual- analysis techniques for CS; (b) some visualiza- itative analysis of small datasets, which makes it tion techniques for CS patterns in conversations hard to make statistically valid quantitative claims that can help linguists and social scientists to get over the nature and distribution of CS. a holistic view of the switching styles in interac- Recently, due to the availability of large code- tions; (c) analysis of CS patterns in recent Hindi switched datasets, gathered mostly from social movies that adds to the existing rich literature of media, there has been some quantitative stud- similar but small scale qualitative studies of CS in ies on socio-linguistic and functional aspects of Indian cinema. CS (Rudra et al., 2016; Rijhwani et al., 2017; Guz-75 Rest of this paper is organized as follows: Sec S Bandyopadhyay, D S Sharma and R Sangal. Proc. of the 14th Intl. Conference on Natural Language Processing, pages 75–84, Kolkata, India. December 2017. c 2016 NLP Association of India (NLPAI) 2 describes related work on functions of CS with section of the Indian society (see Sec.2.3 for de- particular emphasis on CS in Indian cinema. Sec tailed discussion on this). 3 introduces our analysis technique, which is later applied and illustrated in the context of movie 2.2 Computational and Quantitative Studies scripts in Sec 5 and 6. Sec 4 introduces the movie Over the last decade, research in computational dataset, preprocessing of the scripts and word- processing of code-switching has gained signifi- level language labeling of the dialogues. Sec 7 cant interest (Solorio and Liu, 2008, 2010; Vyas concludes the paper by summarizing the contribu- et al., 2014; Peng et al., 2014; Sharma et al., 2016). tions and discussing potential future work. In particular, word-level language identification, which is the first step towards processing of CS 2 Related Work text, has received a lot of attention (see Rijhwani et al.(2017) for a review). In this work, we use the In this section, we will start with a brief review of word-level language labeler by Gella et al.(2013) the linguistics literature on functional and socio- for labeling the Hindi movie dialogues. linguistic aspects of CS, followed by a discussion Nevertheless, to the best of our knowledge, on recent computational models. In order to put there has been very little work on automatic iden- the case-study on Hindi movies in perspective, we tification of functional aspects of CS or any large- will also review relevant literature on CS in Indian scale data-driven study of its socio-linguistic as- cinema. pects. Of the few studies that exist, most no- table are the ones by Rudra et al. (2016) on lan- 2.1 Functions of Code-Switching guage preference by Hindi-English bilinguals on Code-switching is a common phenomenon in all Twitter and Rijhwani et al. (2017) on extent and multilingual communities, though usually it is un- patterns of CS across European languages from predictable whether in a given context a speaker 24 cities. Rudra et al. (2016) analyzed 430K will code-switch or not (Auer, 1995). Neverthe- unique tweets for opinion and sentiment, and con- less, linguists have observed that there are pre- cluded that Hindi-English bilinguals prefer to ex- ferred languages for communicating certain kinds press negative opinions in Hindi; they further re- of functions. For instance, certain speech activities port that a large fraction of the CS tweets exhib- might be exclusively or more commonly related ited the narrative-evaluative function. Rijhwani et to a certain language choice (e.g. Fishman (1971) al. (2017) examined more than 50M tweets from reports use of English for professional purposes across the world the study shows that the percent- and Spanish for informal chat for English-Spanish age of CS tweets varies from 1 to 11% across bilinguals from Puerto Rico). Language switch- the cities, and more CS is observed in the cities ing is also used as a signaling device that serves where English is not the primary language of com- specific communicative functions (Barredo, 1997; munication. They also show that English-Spanish Sanchez, 1983; Nishimura, 1995; Maschler, 1991, CS patterns in a predominantly Spanish speaking 1994) such as: (a) reported speech (b) narrative region (e.g., Barcelona) are different from those to evaluative switch (c) reiterations or empha- where English is the primary language (e.g., Hous- sis (d) topic shift (e) puns and language play (f) ton). topic/comment structuring etc. Attempts of pre- In an excellent survey on computational socio- dicting the preferred language, or even exhaus- linguistics, Nguyen et al.(2016) report a few other tively listing such functions, have failed. However, studies on socio-linguistic aspects of multilingual linguists agree that language alteration in multilin- communities. gual communities is not a random process. Code-switching is also strongly linked to social 2.3 Code-switching in Indian Cinema identity and the principle of linguistic style ac- Hindi-English CS, commonly called Hinglish, is commodation (Melhim and Rahman, 1991; Auer, extremely widespread in India. There is histor- 2013). For instance, two Hindi-English bilingual ical attestation, as well as recent studies on the speakers could code-switch just to establish a con- growing use of Hinglish in general conversation, nection or in-group identity because CS is the and in entertainment and media (see Parshad et al. norm for a large section of urban Indians, and En- (2016) and references therein). Several recent glish is attached to aspirational values by a large76 studies (Bali et al., 2014; Barman et al., 2014; Sequiera et al., 2015) also provide evidence of 3.1 Metrics for Quantification of CS Hinglish and other instances of CS on online so- The first corpus level quantification of the extent cial media, such as Twitter and Facebook. and nature of CS was proposed by Gamback and Hindi movies provide a rich data source for Das(2016). Referred to as the Code mixing in- studying CS in the Indian context. Accord- dex, this metric tries to capture the language distri- ing to the Conversational Analysis approach to bution and the switching, both at the level of utter- CS (Auer, 2013; Wei, 2002), in any given context ances and the entire corpus. Let N be the number a particular language is preferred or unmarked. of languages, x an utterance; let tLi be the tokens Therefore, “speakers, and in turn script writers, in language Li, P be the number of code alterna- choose marked or unmarked codes on the ba- tion points in x; also, let wm and wp be the weights sis of which one will bring them the best out- for the two components of the metric. Then, the comes” (Vaish, 2011). Myers-Scotton (2005) sug- Code mixed index per utterance, Cu(x) for x is: gested that the matrix or unmarked code for Hindi movies is Hindi.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-