Predicting Code-Switching in Multilingual Communication for Immigrant Communities
Total Page:16
File Type:pdf, Size:1020Kb
Predicting Code-Switching in Multilingual Communication for Immigrant Communities Evangelos E. Papalexakis Dong Nguyen A. Seza Dogru˘ oz¨ Carnegie Mellon University University of Twente Netherlands Institute Pittsburgh, USA Enschede, The Netherlands for Advanced Study [email protected] [email protected] Wassenaar, The Netherlands [email protected] Abstract Mixing two or more languages is not a random process. There are in-depth linguistic descriptions Immigrant communities host multilingual of code-switching across different multilingual speakers who switch across languages contexts (Poplack, 1980; Silva-Corvalan,´ 1994; and cultures in their daily communication Owens and Hassan, 2013). Although these studies practices. Although there are in-depth provide invaluable insights about code-switching linguistic descriptions of code-switching from a variety of aspects, there is a growing need across different multilingual communica- for computational analysis of code-switching in tion settings, there is a need for au- large datasets (e.g. social media) where man- tomatic prediction of code-switching in ual analysis is not feasible. In immigrant set- large datasets. We use emoticons and tings, multilingual/bilingual speakers switch be- multi-word expressions as novel features tween minority (e.g. Turkish) and majority (e.g. to predict code-switching in a large online Dutch) languages. Code-switching marks multi- discussion forum for the Turkish-Dutch lingual, multi-cultural (Luna et al., 2008; Gros- immigrant community in the Netherlands. jean, 2014) and ethnic identities (De Fina, 2007) Our results indicate that multi-word ex- of the speakers. By predicting code-switching pressions are powerful features to predict patterns in Turkish-Dutch social media data, we code-switching. aim to raise consciousness about mixed language 1 Introduction communication patterns in immigrant communi- ties. Our study is innovative in the following ways: Multilingualism is the norm rather than an ex- ception in face-to-face and online communica- We performed experiments on the longest • tion for millions of speakers around the world and largest bilingual dataset analyzed so far. (Auer and Wei, 2007). 50% of the EU popula- tion is bilingual or multilingual (European Comis- We are the first to predict code-switching in • sion, 2012). Multilingual speakers in immigrant social media data which allow us to investi- communities switch across different languages gate features such as emoticons. and cultures depending on the social and contex- We are the first to exploit multi-word expres- tual factors present in the communication envi- • ronment (Auer, 1988; Myers-Scotton, 2002; Ro- sions to predict code-switching. maine, 1995; Toribio, 2002; Bullock and Toribio, We use automatic language identification at 2009). Example (1) illustrates Turkish-Dutch • code-switching in a post about video games in an the word level to create our dataset and fea- online discussion forum for the Turkish immigrant tures that capture previous language choices. community in the Netherlands. The rest of this paper is structured as follows: Example (1) we discuss related work on code-switching and user1: <dutch>vette spellllllllll </dutch>.. <turkish>bir girdimmi cikamiyomm .. multilingualism in Section 2, our dataset in Sec- yendikce yenesi geliyo insanin</turkish> tion 3, a qualitative analysis in Section 4, our ex- Translation: <dutch> awesome gameeeee perimental setup and features in Section 5, our re- </dutch>.. <turkish>once you are in it, it is hard to leave .. the more you win, the more sults in Section 6 and our conclusion in Section you want to win</turkish> 7. 42 Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 42–50, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics 2 Related Work speakers use minority languages to discuss topics related to their ethnic identity and reinforcing inti- Code-switching in sociolinguistics There is macy and self-disclosure (e.g. homeland, cultural rarely any consensus on the terminology about traditions, joke telling) whereas they use the ma- mixed language use. Wei (1998) considers al- jority language for sports, education, world poli- ternations between languages at or above clause tics, science and technology. levels as code-mixing. Romaine (1995) refers to both inter-sentential and intra-sentential switches Computational approaches to code-switching as code-switching. Bilingual speakers may shift Recently, an increasing number of research within from one language to another entirely (Poplack et NLP has focused on dealing with multilingual al., 1988) or they mix languages partially within documents. For example, corpora with multilin- the single speech (Gumperz, 1982). In this study, gual documents have been created to support stud- we focus on code-switching within the same post ies on code-switching (e.g. Cotterell et al. (2014)) in an online discussion forum used by Turkish- To enable the automatic processing and analysis Dutch bilinguals. of documents with mixed languages, there is a There are different theoretical models which shift in focus toward language identification at the support (Myers-Scotton, 2002; Poplack, 1980) or word level (King and Abney, 2013; Nguyen and reject (MacSwan, 2005; Thomason and Kaufman, Dogru˘ oz,¨ 2013; Lui et al., 2014). Most closely re- 2001) linguistic constraints on code-switching. lated to our work is the study by Solorio and Liu According to (Thomason and Kaufman, 2001; (2008) who predict code-switching in recorded Gardner-Chloros and Edwards, 2004) linguistic English-Spanish conversations. Compared to their factors are mostly unpredictable since social fac- work, we use a large-scale social media dataset tors govern the multilingual environments in most that enables us to explore novel features. cases. Bhatt and Bolonyai (2011) have an exten- The task most closely related to automatic pre- sive study on socio-cognitive factors that lead to diction of code-switching is automatic language code-switching across different multilingual com- identification (King and Abney, 2013; Nguyen and munities. Dogru˘ oz,¨ 2013; Lui et al., 2014). While automatic Although multilingual communication has been language detection uses the words to identify the widely studied through spoken data analyses, re- language, automatic prediction of code-switching search on online communication is relatively re- involves predicting whether the language of the cent. In terms of linguistic factors Cardenas-´ next word is the same without having access to the Claros and Isharyanti (2009) report differences next word itself. between Indonesian-English and Spanish-English speakers in their amount of code-switching on Language practices of the Turkish community MSN (an instant messaging client). Durham in the Netherlands Turkish has been in con- (2003) finds a tendency to switch to English over tact with Dutch due to labor immigration since time in an online multilingual (German, French, the 1960s and the Turkish community is the Italian) discussion forum in Switzerland. largest minority group (2% of the whole popula- The media (e.g. IRC, Usenet, email, online tion) in the Netherlands (Centraal Bureau voor de discussions) used for multilingual conversations Statistiek, 2013). In addition to their Dutch flu- influence the amount of code-switching as well ency, second and third generations are also fluent (Paolillo, 2001; Hinrichs, 2006). Androutsopou- in Turkish through speaking it within the family los and Hinnenkamp (2001), Tsaliki (2003) and and community, regular family visits to Turkey Hinnenkamp (2008) have done qualitative anal- and watching Turkish TV through satellite dishes. yses of switch patterns across German-Greek- These speakers grow up speaking both languages Turkish, Greek-English and Turkish-German in simultaneously rather than learning one language online environments respectively. after the other (De Houwer, 2009). In addition In terms of social factors, a number of studies to constant switches between Turkish and Dutch, have investigated the link between topic and lan- there are also literally translated Dutch multi-word guage choices qualitatively (Ho, 2007; Androut- expressions (Dogru˘ oz¨ and Backus, 2007; Dogru˘ oz¨ sopoulos, 2007; Tang et al., 2011). These stud- and Backus, 2009). Due to the religious back- ies share the similar conclusion that multilingual grounds of the Turkish-Dutch community, Arabic 43 words and phrases (e.g. greetings) are part of daily 4 Types of Code-Switching communication. In addition, English words and phrases are used both in Dutch and Turkish due to In this section, we provide a qualitative analysis of the exposure to American and British media. code-switching in the online forum. We differen- tiate between two types of code-switching: code- Although the necessity of studying immigrant switching across posts and code-switching within languages in Dutch online environments has been the same post. voiced earlier (Dorleijn and Nortier, 2012), the current study is the first to investigate mixed lan- 4.1 Code-switching across posts guage communication patterns of Turkish-Dutch bilinguals in online environments. Within the same discussion thread, users react to posts of other users in different languages. In example (2), user 1 posts in Dutch to tease 3 Dataset User 2. User 2 reacts to this message with a Our data comes from a large online forum humorous idiomatic expression in Turkish (i.e. (Hababam) used by Turkish-Dutch