Automatic Detection of Code-Switching Style from Acoustics

Automatic Detection of Code-switching Style from Acoustics SaiKrishna Rallabandi*, Sunayana Sitaram, Alan W Black* *Language Technologies Institute, Carnegie Mellon University, USA Microsoft Research India [email protected], [email protected], [email protected] Abstract switching affects co-articulation and context dependent acoustic modeling (Elias et al., 2017). Multilingual speakers switch between lan- Therefore, developing systems for such speech re- guages displaying inter sentential, intra quires careful handling of unexpected language sentential, and congruent lexicalization switches that may occur in a single utterance. We based transitions. While monolingual hypothesize that in such scenarios it would be de- ASR systems may be capable of recogniz- sirable to condition the recognition systems on the ing a few words from a foreign language, type (Muysken, 2000) or style of language mixing they are usually not robust enough to han- that might be expected in the signal. In this paper, dle these varied styles of code-switching. we present approaches to detecting code-switching There is also a lack of large code-switched ‘style’ from acoustics. We first define style of an speech corpora capturing all these styles utterance based on two metrics that indicate the making it difficult to build code-switched level of mixing in the utterance: CodeMixing In- speech recognition systems. We hypothe- dex(CMI) and CodeMixing Span Index. Based on size that it may be useful for an ASR sys- these, we classify each mixed utterance into 5 style tem to be able to first detect the switch- classes. We also obtain an utterance level acous- ing style of a particular utterance from tic representation for each of the utterances using acoustics, and then use specialized lan- a variant of SoundNet. Using this acoustic repre- guage models or other adaptation tech- sentation as features, we try to predict the style of niques for decoding the speech. In this utterance. paper, we look at the first problem of detecting code-switching style from acous- 2 Related Work tics. We classify code-switched Spanish- English and Hindi-English corpora using Prior work on building Acoustic and Language two metrics and show that features ex- Models for ASR systems for code-switched speech tracted from acoustics alone can distin- can be categorized into the following approaches: guish between different kinds of code- (1) Detecting code-switching points in an utter- switching in these language pairs. ance, followed by the application of monolingual acoustic and language models to the individual Index Terms: speech recognition, code- segments (Chan et al., 2004; Lyu and Lyu, 2008; switching, language identification Shia et al., 2004). (2)Employing a shared phone 1 Introduction set to build acoustic models for mixed speech with standard language models trained on code- Code-switching refers to the phenomenon where switched text (Imseng et al., 2011; Li et al., 2011; bilingual speakers alternate between the languages Bhuvanagiri and Kopparapu, 2010; Yeh et al., while speaking. It occurs in multilingual soci- 2010). (3) Training Acoustic or Language mod- eties around the world. As Automatic Speech els on monolingual data in both languages with lit- Recognition (ASR) systems are now recognizing tle or no code-switched data (Lyu et al., 2006; Vu conversational speech, it becomes important that et al., 2012; Bhuvanagirir and Kopparapu, 2012; they handle code-switching. Furthermore, code- Yeh and Lee, 2015). We attempt to approach this 76 Proceedings of The Third Workshop on Computational Approaches to Code-Switching, pages 76–81 Melbourne, Australia, July 19, 2018. c 2018 Association for Computational Linguistics Class CMI Hi-En Utts En-Es Utts Class Description Hi-En En-Es C1 0 6771 41624 S1 Mono En 5413 27960 C2 0-0.15 13986 2284 S2 Mono Hi/Es 0 12749 C3 0.15-0.30 492 2453 S3 En Matrix 626 2883 C4 0.30-0.45 8865 1025 S4 Hi/Es Matrix 36454 1986 C5 0.45-1 2496 1562 S5 Others 8307 3345 Table 1: Distribution of CMI classes for Hinglish Table 2: Distribution of span based classes for and Spanglish Hinglish and Spanglish. Note that the term ‘Ma- trix’ is used just here notionally to indicate larger word span of the language. problem by first identifying the style of code mixing from acoustics. This is similar to the problem of language identification from acoustics, which is ferent styles based on two metrics: (1) Code Mix- typically done over the span of an entire utterance. ing index (Gamback and Das, 2014) which at- Deep Learning based methods have recently tempts to quantify the codemixing based on the proven very effective in speaker and language word counts and (2) CodeMixed Span information recognition tasks. Prior work in Deep Neural Net- which attempts to quantify codemixing of an utter- works (DNN) based language recognition can be ance based on the span of participating languages. grouped into two categories: (1) Approaches that 3.1 Categorization based on Code Mixing use DNNs as feature extractors followed by sep- Index arate classifiers to predict the identity of the language (Jiang et al., 2014; Matejka et al., 2014; Code Mixing Index (Gamback and Das, 2014) was Song et al., 2013) and (2) Approaches that em- introduced to quantify the level of mixing between ploy DNNs to directly predict the language ID the participating languages in a codemixed utter- (Richardson et al., 2015b,a; Lopez-Moreno et al., ance. CMI can be calculated at the corpus and ut- 2014). Although DNN based systems outper- terance level. We use utterance CMI, which is de- form the iVector based approaches, the output de- fined as: cision is dependent on the outcome from every − f g frame. This limits the real time deployment capa- wm(N(x) maxLi2L tLi (x)) + wpP (x) Cu(x) = 100 bilities for such systems. Moreover, such systems N(x) (1) typically use a fixed contextual window which where N is the number of languages, t are the spans hundreds of milliseconds of speech while the Li tokens in language L , P is the number of code language effects in a code-switched scenario are i alternation points in utterance x and w and w suprasegmental and typically span a longer range. m p are weights. In our current study, we quantize the In addition, the accuracies of such systems, espe- range of codemixed index ( 0 to 1) into 5 styles and cially ones that employ some variant of iVectors categorize each utterance as shown in Table 1.A drop as the duration of the utterance is reduced. We CMI of 0 indicates that the utterance is monolin- follow the approach of using DNNs as utterance gual. We experimented with various CMI ranges level feature extractors. Our interest is in adding and found that the chosen ranges led to a reason- long term information to influence the recognition able distribution within the corpus. For example, model, particularly at the level of the complete ut- the C2 CMI class in Hindi-English code switched terance, representing stylistic aspects of the degree data has utterances such as start and style of code-switching throughout the utter- "पंधरा पे कये थे यारा (’started at fifteen, ances. पंधरा पे यार अभी तो कुछ नही आ" eleven or fifteen but buddy nothing has happened 3 Style of Mixing and Motivation so far’). The C4 class on the other hand, has utterances such as ”actual म आज यह rainy season का Multiple metrics have been proposed to quantify मौसम था ना" (’actually the weather today was like codemixing (Guzmán et al., 2017; Gambäck and rainy season, right?’). An example of a C5 utter- Das, 2014) such as span of the participating lan- ance is ”ohh English अछा English कौनसा favourite guages, burstiness and complexity. For our cur- singer मतलब English म?" (’Ohh English, ok who rent study, we categorize the utterances into dif- is your favorite English singer?’) 77 3.2 Categorization based on Span of codemixing While CMI captures the level of mixing, it does not take to account the span information (regularity) of mixing. Therefore, we use language span information (Guzmán et al., 2017) to categorize the utterances into 5 different styles as shown in Ta- ble 2. We divide each utterance based on the span of the participating languages into five classes - monolingual English, monolingual Hindi or Span- ish, classes where the two languages are dominant Figure 1: Architecture for style modeling using (70% or more) and all other utterances. The classes modified Soundnet S3 and S4 indicate that the primary language in the utterance has a span of at least 70% with respect 4 Experimental Setup to the length of utterance. This criterion makes these classes notionally similar to the construct of 4.1 Data ‘matrix’ language. However, we do not consider We use code-switched Spanish English (referred any information related to the word identity in this to as Spanglish hereafter) released as a part of Mi- approach. As we can see from both the CMI and ami Corpus (Deuchar et al., 2014) for training and span-based classes, the distributions of the two lan- testing. The corpus consists of 56 audio record- guage pairs are very different. The Spanglish data ings and their corresponding transcripts of infor- contains much more monolingual data, while the mal conversation between two or more speakers, Hinglish data is predominantly Hindi matrix with involving a total of 84 speakers. We segment the English embeddings. The Hinglish data set does files based on the transcriptions provided and ob- not have monolingual Hindi utterances which is tain a total of 51993 utterances. For Hinglish, due to the way the data was selected, as explained we use an in-house speech corpus of conversa- in Section 4.1.

Load more