Language Modeling for Code-Mixing

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data Adithya Pratapa1 Gayatri Bhat2∗ Monojit Choudhury1 Sunayana Sitaram1 Sandipan Dandapat3 Kalika Bali1 1 Microsoft Research, Bangalore, India 2 Language Technology Institute, Carnegie Mellon University 3 Microsoft R&D, Hyderabad, India 1ft-pradi, monojitc, sunayana.sitaram, [email protected], [email protected], [email protected] Abstract It is, therefore, imperative to build NLP technology for CM text and speech. There have Training language models for Code-mixed been some efforts towards building of Automatic (CM) language is known to be a diffi- Speech Recognition Systems and TTS for CM cult problem because of lack of data com- speech (Li and Fung, 2013, 2014; Gebhardt, 2011; pounded by the increased confusability Sitaram et al., 2016), and tasks like language due to the presence of more than one lan- identification (Solorio et al., 2014; Barman et al., guage. We present a computational tech- 2014), POS tagging (Vyas et al., 2014; Solorio nique for creation of grammatically valid and Liu, 2008), parsing and sentiment analy- artificial CM data based on the Equiva- sis (Sharma et al., 2016; Prabhu et al., 2016; Rudra lence Constraint Theory. We show that et al., 2016) for CM text. Nevertheless, the accura- when training examples are sampled ap- cies of all these systems are much lower than their propriately from this synthetic data and monolingual counterparts, primarily due to lack of presented in certain order (aka training enough data. curriculum) along with monolingual and Intuitively, since CM happens between two (or real CM data, it can significantly reduce more languages), one would typically need twice the perplexity of an RNN-based language as much, if not more, data to train a CM sys- model. We also show that randomly gener- tem. Furthermore, any CM corpus will contain ated CM data does not help in decreasing large chunks of monolingual fragments, and rel- the perplexity of the LMs. atively far fewer code-switching points, which are 1 Introduction extremely important to learn patterns of CM from data. This implies that the amount of data required Code-switching or code-mixing (CM) refers to the would not just be twice, but probably 10 or 100 juxtaposition of linguistic units from two or more times more than that for training a monolingual languages in a single conversation or sometimes system with similar accuracy. On the other hand, even a single utterance.1 It is quite commonly ob- apart from user-generated content on the Web and served in speech conversations of multilingual so- social media, it is extremely difficult to gather cieties across the world. Although, traditionally, large volumes of CM data because (a) CM is rare CM has been associated with informal or casual in formal text, and (b) speech data is hard to gather speech, there is evidence that in several societies, and even harder to transcribe. such as urban India and Mexico, CM has become In order to circumvent the data scarcity issue, the default code of communication (Parshad et al., in this paper we propose the use of linguistically- 2016), and it has also pervaded written text, espe- motivated synthetically generated CM data (as cially in computer-mediated communication and a supplement to real CM data) for development social media (Rijhwani et al., 2017). of CM NLP systems. In particular, we use the Work∗ done during author’s internship at Microsoft Re- Equivalence Constraint Theory (Poplack, 1980; search Sankoff, 1998) for generating linguistically valid 1According to some linguists, code-switching refers to CM sentences from a pair of parallel sentences inter-sentential mixing of languages, whereas code-mixing refers to intra-sentential mixing. Since the latter is more gen- in the two languages. We then use these gener- eral, we will use code-mixing in this paper to mean both. ated sentences, along with monolingual and little 1543 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1543–1553 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics amount of real CM data to train a CM Language minal symbol (or word) w1 in G1 has a corre- Model (LM). Our experiments show that, when sponding terminal symbol w2 in G2. Finally, it trained following certain sampling strategies and assumes that every production rule in L1 has a cor- training curriculum, the synthetic CM sentences responding rule in L2 - i.e, the non-terminal cate- are indeed able to improve the perplexity of the gories on the left-hand side of the two rules cor- trained LM over a baseline model that uses only respond to each other, and every category/symbol monolingual and real CM data. on the right-hand side of one rule corresponds to LM is useful for a variety of downstream NLP a category/symbol on the right-hand side of the tasks such as Speech Recognition and Machine other rule. Translation. By definition, it is a discriminator be- All these correspondences must also hold vice- tween natural and unnatural language data. The versa (between languages L2 and L1), which im- fact that linguistically constrained synthetic data plies that the two grammars can only differ in the can be used to develop better LM for CM text is, ordering of categories/symbols on the right-hand on one hand an indirect statistical and task-based side of any production rule. As a result, any sen- validation of the linguistic theory used to generate tence in L1 has a corresponding translation in L2, the data, and on the other hand an indication that with their parse trees being equivalent except for the approach in general is promising and can help the ordering of sibling nodes. Fig.1(a) and (b) solve the issue of data scarcity for a variety of NLP illustrate one such sentence pair in English and tasks for CM text and speech. Spanish and their parse-trees. The EC Theory de- scribes a CM sentence as a constrained combina- 2 Generating Synthetic Code-mixed Data tion of two such equivalent sentences. There is a large and growing body of linguis- While the assumptions listed above are quite tic research regarding the occurrence, syntac- strong, they do not prevent the EC Theory from tic structure and pragmatic functions of code- being applied to two natural languages whose mixing in multilingual communities across the grammars do not correspond as described above. world. This includes many attempts to explain We apply a simple but effective strategy to recon- the grammatical constraints on CM, with three of cile the structures of a sentence and its translation the most widely-accepted being the Embedded- - if any corresponding subtrees of the two parse Matrix (Joshi, 1985; Myers-Scotton, 1993, 1995), trees do not have equivalent structures, we col- the Equivalence Constraint (EC) (Poplack, 1980; lapse each of these subtrees to a single node. Ac- Sankoff, 1998) and the Functional Head Con- counting for the actual asymmetry between a pair straint (DiSciullo et al., 1986; Belazi et al., 1994) of languages will certainly allow for the genera- theories. tion of more CM variants of any L1-L2 sentence pair. However, in our experiments, this strategy For our experiments, we generate CM sentences retains most of the structural information in the as per the EC theory, since it explains a range of parse trees, and allows for the generation of up to interesting CM patterns beyond lexical substitu- thousands of CM variants of a single sentence pair. tion and is also suitable for computational modeling. Further, in a brief human-evaluation we con- ducted, we found that it is representative of real 2.2 The Equivalence Constraint Theory CM usage. In this section, we list the assumptions Sentence production. Given two monolingual made by the EC theory, briefly explain the theory, sentences (such as those introduced in Fig.1), a and then describe how we generate CM sentences CM sentence is created by traversing all the leaf as per this theory. nodes in the parse tree of either of the two sentences. At each node, either the word at that 2.1 Assumptions of the EC Theory node or at the corresponding node in the other Consider two languages L1 and L2 that are be- sentence’s parse is generated. While the traver- ing mixed. The EC Theory assumes that both sal may start at any leaf node, once the produc- languages are defined by context-free grammars tion enters one constituent, it will exhaust all the G1 and G2. It also assumes that every non- lexical slots (leaf nodes) in that constituent or its terminal category X1 in G1 has a corresponding equivalent constituent in the other language before non-terminal category X2 in G2 and that every ter- entering into a higher level constituent or a sister 1544 (a) SE (b) SS (c) S (d) S NPE VPE NPS VPS NPS VP NPS VP PRPE VBZE PPE PRPS VBZS PPS PRPS VBZE PP PRPS VBZE PP She lives INE NPE Elle vive INS NPS Elle lives INS NP Elle lives INE NPS in DTE JJE NNE en DTS NNS JJS en DTS NNS JJ* in DTS NNS JJS a white house una casa blanca una casa white una casa blanca Figure 1: Parse trees of a pair of equivalent (a) English and (b) Spanish sentences, with corresponding hierarchical structure (due to production rules), internal nodes (non-terminal categories) and leaf nodes (terminal symbols), and parse trees of (c) incorrectly code-mixed and (d) correctly code-mixed variants of these sentences (as per the EC theory).

Load more