Homophone Identification and Merging for Code-Switched Speech Recognition

Homophone Identification and Merging for Code-switched Speech Recognition Brij Mohan Lal Srivastava and Sunayana Sitaram Microsoft Research India [email protected], [email protected] Abstract A related phenomenon exists in monolingual scenar- ios in the form of homophones in a single language (eg. Code-switching or mixing is the use of multiple languages ‘meat’ and ‘meet’). Typically, a language model is able in a single utterance or conversation. Borrowing oc- to disambiguate homophones based on context and pick curs when a word from a foreign language becomes part the correct variant during decoding. Human transcribers of the vocabulary of a language. In multilingual soci- are also able to disambiguate such words using context. eties, switching/mixing and borrowing are not always Additional homophones could occur when languages are clearly distinguishable. Due to this, transcription of mixed, in which case words having the same phonetic re- code-switched and borrowed words is often not standard- alization exist in both languages but have different mean- ized, and leads to the presence of homophones in the ings (eg. ‘come’ in English and ‘कम’, meaning ‘less’, in training data. In this work, we automatically identify Hindi). A strong language model built using a large and disambiguate homophones in code-switched data to amount of code-switched text should be able to disam- improve recognition of code-switched speech. We use a biguate such homophones based on context. WX-based common pronunciation scheme for both lan- In this work, we focus on homophones that are cre- guages being mixed and unify the homophones during ated due to transcription choices and errors. We carry training, which results in a lower word error rate for sys- out experiments on conversational Hindi-English code- tems built using this data. We also extend this frame- switched speech and use a common pronunciation scheme work to propose a metric for code-switched speech recog- to automatically collapse homophones to decrease the nition that takes into account homophones in both lan- Word Error Rate (WER) of the ASR. We also propose a guages while calculating WER, which can help provide modification to the WER that takes into account poten- a more accurate picture of errors the ASR system makes tial homophones, which may help give a more accurate on code-switched speech. picture of the errors made by a code-switched ASR sys- Index Terms: speech recognition, code-switching, mul- tem. Our contributions are as follows: tilinguality, pronunciation, homophones 1. We identify that a large number of homophones are created due to transcription choices and errors 1. Introduction in code-switched databases Code-switching or mixing is the use of more than one 2. We propose using a common pronunciation scheme language in a single utterance or conversation, and is to automatically identify and merge potential ho- common in multilingual communities all over the world. mophones Automatic Speech Recognition (ASR) systems that can 3. We propose a modification to the traditional WER handle code-switching need to be trained with appro- metric to take into account homophones, to obtain priate code-switched data that is transcribed in the two a better description of code-switched ASR perfor- (or more) languages being switched. In many cases, the mance scripts that the two languages use are different, while in other cases, code-switching may occur in language pairs that have similar or the same script, such as English and 2. Relation to Prior Work Spanish. In case the two languages use different scripts, Ali et al. propose alternative word error rates based a transcriber may choose to use either language’s script on multiple crowd-sourced references [2, 3] to evaluate while transcribing code-switched speech. ASR for dialectal speech, where there isn’t a gold ref- Lexical borrowing occurs when a word from a foreign erence transcription due to non-standard orthography of language becomes part of the vocabulary of a language the language. due to language contact. In some language pairs, the dif- Jyothi et al. [4] describe a technique to use tran- ference between code-switching, mixing and borrowing is scripts created by non-native speakers for training an often not very clear, and mixing and borrowing can be ASR system. They model the noise in the transcripts thought of as a continuum [1]. Due to this, transcription to account for biases in non-native transcribers and es- of code-switched speech may not always be standardized, timate the information lost as a function of how many leading to the same word being transcribed using both transcriptions exist for an utterance. In an isolated word scripts. This may lead to less data per word for build- transcription task for Hindi, they recover 85% of correct ing acoustic models, and more lexical variants for the transcriptions. Our problem is different from this formu- language model. Ultimately, this may influence the ac- lation in that transcribers are assumed to be bilinguals curacy of ASR systems built with such data. who know the two languages being mixed, and hence the Table 1: Hindi-English code switched data tify homophones, but also spelling variants and errors. # of Total Unique En Data Utts Hrs En (%) Spkrs Words Words (%) 4.1. Common pronunciation scheme Train 41276 429 46 560893 16.6 18900 40.23 In order to identify words in different languages with sim- Test 5193 53 5.6 69177 16.5 6000 41.01 ilar pronunciation, they must be decoded using a com- Dev 4689 53 5.7 68995 16.05 6051 40.04 mon pronunciation scheme. We choose the WX notation noise expected is much less and usually in the form of as the pronunciation scheme since Hindi words written spelling variants and cross-script transcription. in Devanagari can be directly transcribed into WX and In this work, we use the WX notation for mapping using a simple rule-based method the pronunciation of English and Hindi words to a common pronunciation the word can be obtained. To get the pronunciation of a Devanagari word, it is first converted to its WX rep- scheme [5], similar to [6]. The Unitran mapping is a 1 similar scheme that maps all Unicode characters to a resentation using wxILP . Then, each character is sep- phoneme in the X-SAMPA phoneset [7]. The Global- arated and special characters like nukta and anunasika Phone project proposed by Schultz et al. [8] attempts are processed to ignore the character or make it part of to unify the phonetic units based on their articulatory the phoneme. similarity shared across 12 languages. Both Unitran and To get the pronunciation for English words written GlobalPhone can be used in place of WX in our proposed in Roman characters, we train a sequence-to-sequence framework. neural network using CMUDict as training data which takes a sequence of Roman characters as input and pro- duces sequence of CMUDict phonemes. We used Mi- 3. Data crosoft CNTK [9] to build the sequence-to-sequence re- The dataset used in this work contains conversational current network with Long Short Term Memory (LSTM) speech recordings spoken in code-switched Hindi and En- cells along with attention on hidden layers. The CMU- glish. Hindi-English bilinguals were given a topic and Dict phone sequence obtained by applying this model is asked to have a conversation with another bilingual. then converted to WX using a mapping that we created. They were not explicitly asked to code-switch during If there exists a word-final schwa(ə), it is deleted. recording, but around 40% of the data contains at least one English word per utterance. The conversations were 4.2. Approach transcribed by Hindi-English bilinguals, in which they transcribed Hindi words in the Devanagari script, and Result: Lexicon; Rmap English words in Roman script. There was no distinction EN = {valid English words with frequency}; made between borrowed words and code-switching, which HI = {valid Hindi words with frequency}; led to some inconsistencies in transcription. Hindi, like V = {vocab to be merged}; other Indian languages, is usually Romanized on social P MAP {pronunciation map}; media, in user generated content and in casual commu- while v in V do nication, which could have contributed to making tran- i P MAP [pron(v )] fv g; scription of Hindi words in Devanagari even more difficult i i end for the transcribers. while wordlist in PMAP do The code-switched dataset contains 51158 utterances if len(wordslist) == 1 then spoken by 535 different speakers. Hindi words made up Lexicon (wi; pron(wi)); 85% of the data, making it the predominant language else in this corpus. However, 40% of the total words in the Select wi amongst wordlist with highest vocabulary were in English. The distribution of Hindi frequency in EN or HI; and English words and vocabulary were similar in train, Lexicon (wi; pron(wi)); test and dev. A summary of the code switched dataset anchor = wi; is shown in Table 1. Rmap[anchor] fwordlist − anchorg; end 4. Homophone merging end Algorithm 1: Homophone merging algorithm which Errors and ambiguities in transcription of code-switched returns the compact lexicon and Rmap. Rmap contains speech occur because bilinguals have access to two tran- potential suggestions for words that can be replaced scription schemes, which inflate the word error rate ofa (replacee) by an anchor word (replacer). Each group of code-switched ASR system and complicate their evalua- candidates is then searched for an alpha (anchor word) tion. One approach to solving this problem would be to which must replace others in the group based on its come up with more standardized schemes for transcrip- frequency in the corpus.

Load more