Using Alternate Representations of Text for Natural Language Understanding

Using Alternate Representations of Text for Natural Language Understanding Venkata sai Varada, Charith Peris, Yangsook Park, Christopher DiPersio Amazon Alexa AI, USA {vnk, perisc, yangsoop, dipersio}@amazon.com Abstract ping list म dove soap bar जोड़’ (‘add dove soap bar to my shopping list’). NLU models ex- One of the core components of voice assis- pect ASR to return the text in the above form, tants is the Natural Language Understand- ing (NLU) model. Its ability to accurately which matches with the transcription of the classify the user’s request (or “intent”) and training data for NLU models. Then, NLU recognize named entities in an utterance is model would return the action as Add Items pivotal to the success of these assistants. To Shopping List, while it also recognizes ‘dove NLU models can be challenged in some soap bar’ as the actual item name to be added languages by code-switching or morpho- to the shopping list. However, ASR could in- logical and orthographic variations. This consistently return ‘ shopping list dove work explores the possibility of improving मेरी म soap ’, where the English word ‘bar’ is the accuracy of NLU models for Indic lan- बार जोड़ guages via the use of alternate represen- recognized as a Hindi word ‘बार’. Note that tations of input text for NLU, specifically while the Hindi word ‘बार’ means something ISO-15919 and IndicSOUNDEX, a custom different than the English ‘bar’, their pronun- SOUNDEX designed to work for Indic lan- ciation is similar enough to mislead the ASR guages. We used a deep neural network model in the context of mixed-language utter- based model to incorporate the informa- ance. In this case, the NLU model should be- tion from alternate representations into the come robust against such ASR mistakes, and NLU model. We show that using alternate representations significantly improves the learn that ‘dove soap बार’ is equivalent to ‘dove overall performance of NLU models when soap bar’ in order to correctly recognize it the amount of training data is limited. as an item name. The phenomenon of code- switching is common amongst other Indic lan- 1 Introduction guages as well. Building NLU models can be more challeng- ASR inconsistencies can cause significant ing for languages that involve code-switching. challenges for NLU models by introducing new Recent times have seen a significant surge of data that the models were not trained on. interest in voice-enabled smart assistants, such Such inconsistencies can occur due to 1) code- as Amazon’s Alexa, Google Assistant, and Ap- switching in the data, especially when the in- ple’s Siri. These assistants are powered by put text contains more than one script (char- several components, which include Automatic acter set), 2) orthographic or morphological Speech Recognition (ASR) and Natural Lan- variations that exist in the input text, or 3) guage Understanding (NLU) models. The in- due to requirements for multilingual support. put to an NLU model is the text returned by In a language such as English in the United the ASR model. With this input text, there States, where code-switching is less common, are two major tasks for an NLU model: 1) In- both ASR and NLU models can perform quite tent Classification (IC), and 2) Named Entity well, as input data tend to be more consistent. Recognition (NER). However, when it comes to Indic languages Code-switching is a phenomenon in which such as Hindi, where code-switching is much two or more languages are interchanged within more common, the question arises as to which a sentence or between sentences. An exam- representation of the input text would work ple of code-switching from Hindi is ‘मेरी shop- best for NLU models. 1 Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 1–10 July 9, 2020. c 2020 Association for Computational Linguistics There is ongoing research on solving mod- on one high-resource Indic language (Hindi) eling tasks for data with code-switching. In and three low-resource Indic languages (Tamil, Geetha et al. (2018), the authors explore using Marathi, Telugu). In Section 2, we describe character and word level embeddings for NER the two alternate representations of text that tasks. Some of this research focuses on us- we explore. In Section 3, we describe our data, ing alternate representations of text for NLU model architecture, and detail our experimen- modeling. For example, Johnson et al. (2019) tal setup. In Section 4, we present our results explore the use of cross-lingual transfer learn- followed by our conclusions in Section 5. ing from English to Japanese for NER by ro- manizing Japanese input into a Latin-based 2 Alternate representations of text character set to overcome the script differences between the language pair. Zhang and Le- Using data transcribed in the original script of Cun (2017) has shown using romanization for a language can cause problems for both mono- Japanese text for sentiment classification hurt lingual and multilingual NLU models. For a the performance of the monolingual model. monolingual model in a language where code- In this paper, we explore the possibility of switching, orthographic variations, or rich mitigating the problems related to ASR in- morphological inflections are common, NLU consistency and code-switching in our input models may not be able to perform well on data by using two alternate representations of all the variations, depending on the frequency text in our NLU model: ISO-15919 and In- of these variations in the training data. For dicSOUNDEX. ISO-159191 was developed as multilingual models, words with similar sound a standardized Latin-based representation for and meaning across different languages (e.g., Indic languages and scripts. SOUNDEX is an loanwords, common entities) cannot be cap- algorithm that provides phonetic-like represen- tured if the words are written in their original tations of words. Most work on SOUNDEX script. For example, the same proper noun ‘tel- algorithms has focused primarily on monolin- ugu’ is written as ‘तेलुगु’ in Hindi, as ‘ெத’ gual solutions. One of the best known imple- in Tamil, and as ‘లుగు’ in Telugu. Although mentations with a multilingual component is they sound similar and mean the same thing, Beider-Morse Phonetic Matching Beider and NLU model will see them as unrelated tokens Morse (2010); however, it only identifies the if they are represented in their original scripts language in which a given word is written in the input data. to choose which pronunciation rules to apply. From the NLU point of view, a text repre- Other attempts at multilingual SOUNDEX sentation that can minimize variations of the algorithms, particularly for Indic languages, same or similar words within a language and were smaller studies focused on two Indic lan- across different languages would be beneficial guages with or without English as a third lan- for both IC and NER tasks. In this section, guage. Chaware and Rao (2012) developed a we explore two alternative ways of text rep- custom SOUNDEX algorithm for monolingual resentations for Indic languages: ISO-15919 Hindi and Marathi word pair matching. Shah and a SOUNDEX-based algorithm, which we and Singh (2014) describe an actual multilin- call IndicSOUNDEX. Compared to using the gual SOUNDEX implementation designed to original scripts, these two alternatives can rep- cover Hindi, Gujarati, and English, which, in resent the variants of the same word or root addition to the actual algorithm implementa- in the same way. For example, in the origi- tion, was aided by a matching threshold declar- nal Hindi script (i.e., Devanagari), the word ing two conversions a match even if they dif- for ‘volume/voice’ can be represented in two fered in up to two characters. forms: ‘आवाज़’ and ‘आवाज’. These two forms, In this study, we use ISO-15919 and Indic- however, are uniformly represented as ‘āvāj’ SOUNDEX representations of text in a deep in ISO-15919 and as the string ‘av6’ in Indic- neural network (DNN) to perform multi-task SOUNDEX. Similarly, the English word ‘list’ modeling of IC and NER. We experiment may be transcribed as ‘list’ or as ‘’ in Tel- ugu; however, they map to the same Indic- 1https://en.wikipedia.org/wiki/ISO_15919 SOUNDEX representation, ‘ls8’. 2 2.1 ISO-15919 3 Experimental Setup ISO-15919 is a standardized representation of 3.1 Data Indic scripts based on Latin characters, which For our experiments, we chose datasets from maps the corresponding Indic characters onto four Indic languages: Hindi, Tamil, Marathi, the same Latin character. For example, ‘क’ in and Telugu. For Hindi, we use an inter- Devanagari, ‘క’ in Telugu, and ‘க’ in Tamil are nal large-scale real-world dataset; for the all represented by ‘k’ in the ISO-15919 stan- other three languages, we use relatively small dard. Consequently, ISO-15919 can be used datasets collected and annotated by third as a neutral common representation between party vendors. We perform a series of ex- Indic scripts. However, ISO-15919 has some periments to evaluate the use of the alternate downsides. Indic scripts often rely on implicit representations of text, ISO-15919 and Indic- vowels which are not represented in the orthog- SOUNDEX, described in Section 2. raphy, which means they cannot be reliably Our Hindi training dataset consists of 6M added to a transliterated word. Additionally, data. We separate a portion (1̃%) of the data Indic scripts have a character called a halant, into an independent test set. We execute a or virama, which is used to suppress an in- stratified split on the remainder, based onin- herent vowel. This character, although usu- tent, and choose 10% of this data for validation ally included orthographically in a word, does and the rest as training data.

Using Alternate Representations of Text for Natural Language Understanding

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

Names of Countries, Their Capitals and Inhabitants

Evitalia NORMAS ISO En El Marco De La Complejidad

Transliteration of Tamil and Other Indic Scripts Ram Viswanadha

A Könyvtárüggyel Kapcsolatos Nemzetközi Szabványok

Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library

Inventory of Romanization Tools

Transliteration of <Script Name>

Massively Multilingual Creation of Language Tools Using Interlinear Glossed Text

Proceedings of the 2Nd Workshop on Natural Language Processing for Conversational AI, Pages 1–10 July 9, 2020

The Issue of the Assamese Script Misrepresentation / Nonrepresentation in the International Standards

A Könyvtárüggyel Kapcsolatos Nemzetközi Szabványok