POS Tagging for Improving Code-Switching Identification In
Total Page:16
File Type:pdf, Size:1020Kb
POS Tagging for Improving Code-Switching Identification in Arabic Mohammed Attia1 Younes Samih2 Ali Elkahky1 Hamdy Mubarak2 Ahmed Abdelali2 Kareem Darwish2 1Google LLC, New York City, USA 2Qatar Computing Research Institute, HBKU Research Complex, Doha, Qatar 1{attia,alielkahky}@google.com 2{ysamih,hmubarak,aabdelali,kdarwish}@hbku.edu.qa Abstract and sociolinguistic reasons. Second, existing the- oretical and computational linguistic models are When speakers code-switch between their na- tive language and a second language or lan- based on monolingual data and cannot adequately guage variant, they follow a syntactic pattern explain or deal with the influx of CS data whether where words and phrases from the embedded spoken or written. language are inserted into the matrix language. CS has been studied for over half a century This paper explores the possibility of utiliz- from different perspectives, including theoretical ing this pattern in improving code-switching linguistics (Muysken, 1995; Parkin, 1974), ap- identification between Modern Standard Ara- plied linguistics (Walsh, 1969; Boztepe, 2003; Se- bic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the tati, 1998), socio-linguistics (Barker, 1972; Heller, POS signal in word-level code-switching iden- 2010), psycho-linguistics (Grosjean, 1989; Prior tification. We build a deep learning model en- and Gollan, 2011; Kecskes, 2006), and more re- riched with linguistic features (including POS cently computational linguistics (Solorio and Liu, tags) that outperforms the state-of-the-art re- 2008a; Çetinoglu˘ et al., 2016; Adel et al., 2013b). sults by 1.9% on the development set and 1.0% In this paper, we investigate the possibility of on the test set. We also show that in intra- using POS tagging to improve word-level lan- sentential code-switching, the selection of lex- guage identification for diglossic Arabic in a ical items is constrained by POS categories, where function words tend to come more often deep-learning system. We present some syn- from the dialectal language while the majority tactic characterization of intra-sentential code- of content words come from the standard lan- switching, and show that POS can be a power- guage. ful signal for code-switching identification. We also pay special attention to intra-sentential code- 1 Introduction switching and examine the distribution of POS Code-switching (CS) is common in multilingual categories involved in this type of data. communities as well as diglossic ones, where the The paper is organized as follows: in the re- language of information and education is different mainder of this introduction we present chal- from the language of speaking and daily interac- lenges, definitions, and types of CS, and the partic- tion. With the increased level of education, mobil- ular aspects involved in Arabic CS. Section2 gives ity, globalization, multiculturalism, and multilin- an overview of related works. In Section3, we de- gualism in modern societies, combined with the scribe and record our observations on the data used rise of social media, where people write in the in our experiments. Section4 presents a descrip- way they speak, CS has become a pervasive phe- tion of our system and the features used. Section5 nomenon, particularly in user-generated data, and gives the details of our experiments and discusses a major challenge for NLP systems dealing with the results, and finally we conclude in Section6. that data. CS is interesting for two reasons: first, there is a 1.1 Why is CS Computationally large population of bilingual and diglossic speak- Challenging? ers, or at least speakers with some exposure to a When two languages are blended together in foreign language, who tend to mix and blend two a single utterance, the traditional phonological languages for various pragmatic, psycholinguistic and morphosyntactic rules are perturbed. When 18 Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 18–29 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics judged by a standard monolingual model, these ut- computational models when dealing with CS data terances can be deemed as ungrammatical or un- (Adel et al., 2015). natural. Therefore, CS should generally be treated The study of CS does not only help downstream in its own terms and not to be conceived of as tasks (like ASR (automatic speech recognition), a peripheral phenomenon that can be understood IR (information retrieval), parsing, etc.), but it is by tweaking and twisting monolingual models and also crucial for language generation (e.g. TTS theories. When two languages come in contact, (text to speech), MT (machine translation), and this implies the cross-fertilization and the emer- automated responses by virtual assistants) in order gence of structures that may be absent in either to allow computational models to produce natural languages. When code-switching, speakers com- sentences that closely match how modern societies promise the syntactic rules of the two languages talk. involved, sometime adding in or leaving out a de- terminer, or applying a system of affixation from 1.2 Definition and Defining Perspectives one language and not the other. The definition of CS has varied greatly depend- CS has conveniently been used as a cover term ing on the different researchers’ attitude and per- (Myers-Scotton, 1997; Çetinoglu˘ et al., 2016) for spectives of the operation involved. While some all operations where two languages are used si- viewed it as a process where two languages are ac- multaneously or alternately by the same speaker. tively interacting with each other (ultimately cre- When the user speaks one sentence in one lan- ating a new code), other viewed the operation just guage and another sentence in another language, as two separate languages sitting side-by-side as this has been referred to as inter-sentential code- isolated islands. Following the first perspective, switching, while mixing elements from the two Joshi(1982) defined code-switching as the situ- languages together in the same sentence has been ation when two languages systematically interact termed intra-sentential. The language that pro- with each other in the production of sentences vides the function words and grammatical struc- in a framework which consists of two grammat- ture is called the host (Bokamba, 1989) or ma- ical systems and a mechanism for switching be- trix language, while the language being inserted tween the two. Following the second perspec- is called the guest or embedded language. tive, Muysken(1995) defined CS as “the alterna- While inter-sentential CS is relatively less chal- tive use by bilinguals of two or more languages lenging for computational analysis, as each sen- in the same conversation”, while other researchers tence still follows a monolingual model, intra- (Auer, 1999; Nilep, 2006) defined it as the “jux- sentential CS poses a bottleneck challenge. It taposition” of elements from two different gram- needs a special amount of attention, because it is matical systems within the same speech. only this type that involves the lexical and syn- The juxtaposition definition has been widely tactic integration and activation of two language cited in the research on code-switching, advanc- models at the same time. NLP systems trained on ing a monolingual view on the topic and promot- monolingual data suffer significantly when trying ing the idea that bilingual speech is the sum (or to process this kind bilingual text or utterance. juxtaposition) of two monolingual utterances. The CS has proved challenging for NLP technolo- literal meaning suggests placing two heteroge- gies, not only because current tools are geared neous and isolated pieces from different languages toward the processing of one language at a time next to each other, but, in fact, foreign phrases (AlGhamdi et al., 2016), but also because code- are usually syntactically integrated and may often switched data is typically associated with addi- change phonologically, morphologically and prag- tional challenges such as the non-conventional or- matically to fit homogeneously in the new posi- thography, non-canonicity (nonstandard or incom- tion. The term also has a sense of randomness, plete) of syntactic structures, and the large number which departs from the fact that CS is patterned of OOV-words (Çetinoglu˘ et al., 2016), which sug- and predictable. gest the need for larger training data than what is The view we adopt is that when people code- typically used in monolingual models. Unfortu- switch, they interweave (Lipski, 2005) or blend nately, shortage of training data has usually been two languages together, and the grammar of code- cited as the reason for the under-performance of switching depends, to a large extent, on which lan- 19 guages are being interwoven, where, when, how, diglossic code-switching, the shift is more likely and by whom. The where and when relates to the to be lexical, morphological, and structural, rather sociolinguistic factors, such as the situation and than phonological, unlike the other two cases power relations, and the how and by whom to the when we have two completely distinct language psycholinguistic factors, such as speakers’ compe- systems. tence and proficiency in either or both languages. This is why we see a wide range of regular patterns 1.4 Peculiarities of Arabic CS as well as highly idiosyncratic behavior. Arabic is a diglossic language, where the lan- guage of education is different from the language 1.3 CS Types and Categories of speaking. Dialectal Arabic has traditionally not enjoyed the same prestige, socio-economic sta- A speaker can turn from one language to the tus, and official recognition as MSA. Dialects, by other at the sentence level, or he/she can make nature, diverge from the standard language, and, the turn within the same sentence. Some re- therefore, they can easily and freely draw from the searchers (Muysken et al., 2000) use the term larger repository of the standard language.