Phonetic and Visual Priors for Decipherment of Informal Romanization
Total Page:16
File Type:pdf, Size:1020Kb
Phonetic and Visual Priors for Decipherment of Informal Romanization Maria Ryskina1 Matthew R. Gormley2 Taylor Berg-Kirkpatrick3 1Language Technologies Institute, Carnegie Mellon University 2Machine Learning Department, Carnegie Mellon University 3Computer Science and Engineering, University of California, San Diego {mryskina,mgormley}@cs.cmu.edu [email protected] Abstract horosho [Phonetically romanized] Informal romanization is an idiosyncratic pro- хорошо [Underlying Cyrillic] cess used by humans in informal digital com- munication to encode non-Latin script lan- xopowo [Visually romanized] guages into Latin character sets found on common keyboards. Character substitution Figure 1: Example transliterations of a Russian choices differ between users but have been word horoxo [horošo, ‘good’] (middle) based on shown to be governed by the same main princi- ples observed across a variety of languages— phonetic (top) and visual (bottom) similarity, with namely, character pairs are often associated character alignments displayed. The phonetic- visual dichotomy gives rise to[Phonetically one-to-many romanized] map- through phonetic or visual similarity. We pro- [Phonetic] pose a noisy-channel WFST cascade model pings such as x /S/ sh / w. ! [Underlying Cyrillic] for deciphering the original non-Latin script [Cyrillic] from observed romanized text in an unsuper- [Visually romanized] vised fashion. We train our model directly on layout[Visual] incompatibility). An example of such a sen- romanized data from two languages: Egyp- tence can be found in Figure2. Unlike named en- tian Arabic and Russian. We demonstrate that tity transliteration where the change of script rep- adding inductive bias through phonetic and resents the change of language, here Latin charac- visual priors on character mappings substan- ters serve as an intermediate symbolic representa- tially improves the model’s performance on tion to be decoded by another speaker of the same both languages, yielding results much closer to the supervised skyline. Finally, we intro- source language, calling for a completely differ- duce a new dataset of romanized Russian, col- ent transliteration mechanism: instead of express- lected from a Russian social network website ing the pronunciation of the word according to and partially annotated for our experiments.1 the phonetic rules of another language, informal transliteration can be viewed as a substitution ci- 1 Introduction pher, where each source character is replaced with a similar Latin character. Written online communication poses a number of In this paper, we focus on decoding informally challenges for natural language processing sys- romanized texts back into their original scripts. tems, including the presence of neologisms, code- We view the task as a decipherment problem and switching, and the use of non-standard orthogra- propose an unsupervised approach, which allows phy. One notable example of orthographic varia- us to save annotation effort since parallel data tion in social media is informal romanization2— for informal transliteration does not occur natu- speakers of languages written in non-Latin alpha- rally. We propose a weighted finite-state trans- bets encoding their messages in Latin characters, ducer (WFST) cascade model that learns to de- for convenience or due to technical constraints code informal romanization without parallel text, (improper rendering of native script or keyboard relying only on transliterated data and a language 1The code and data are available at https://github. model over the original orthography. We test it com/ryskina/romanization-decipherment on two languages, Egyptian Arabic and Russian, 2Our focus on informal transliteration excludes formal settings such as pinyin for Mandarin where transliteration collecting our own dataset of romanized Russian conventions are well established. from a Russian social network website vk.com. 8308 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8308–8319 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 4to mowet bit’ ly4we? [Romanized] 2 Related work Qto moet byt~ luqxe? [Latent Cyrillic] Ctoˇ možet byt’ lucše?ˇ [Scientific] > /Sto "moZ1t b1tj "lutSS1/ [IPA] Prior work on informal transliteration uses su- What can be better? [Translated] pervised approaches with character substitution rules either manually defined or learned from au- Figure 2: Example of an informally romanized tomatically extracted character alignments (Dar- sentence from the dataset presented in this paper, wish, 2014; Chalamandaris et al., 2004). Typi- containing a many-to-one mapping / x w. cally, such approaches are pipelined: they produce ! Scientific transliteration, broad phonetic transcrip- candidate transliterations and rerank them using tion, and translation are not included in the dataset modules encoding knowledge of the source lan- and are presented for illustration only. guage, such as morphological analyzers or word- level language models (Al-Badrashiny et al., 2014; Eskander et al., 2014). Supervised finite-state ap- Since informal transliteration is not standard- proaches have also been explored (Wolf-Sonkin ized, converting romanized text back to its origi- et al., 2019; Hellsten et al., 2017); these WFST nal orthography requires reasoning about the spe- cascade models are similar to the one we propose, cific user’s transliteration preferences and han- but they encode a different set of assumptions dling many-to-one (Figure2) and one-to-many about the transliteration process due to being de- (Figure1) character mappings, which is beyond signed for abugida scripts (using consonant-vowel traditional rule-based converters. Although user syllables as units) rather than alphabets. To our behaviors vary, there are two dominant patterns knowledge, there is no prior unsupervised work on in informal romanization that have been observed this problem. independently across different languages, such as Named entity transliteration, a task closely re- Russian (Paulsen, 2014), dialectal Arabic (Dar- lated to ours, is better explored, but there is little wish, 2014) or Greek (Chalamandaris et al., 2006): unsupervised work on this task as well. In par- Phonetic similarity: Users represent source char- ticular, Ravi and Knight(2009) propose a fully acters with Latin characters or digraphs associated unsupervised version of the WFST approach in- with similar phonemes (e.g. m /m/ m, l /l/ l troduced by Knight and Graehl(1998), refram- ! ! in Figure2). This substitution method requires ing the task as a decipherment problem and learn- implicitly tying the Latin characters to a phonetic ing cross-lingual phoneme mappings from mono- system of an intermediate language (typically, En- lingual data. We take a similar path, although it glish). should be noted that named entity transliteration methods cannot be straightforwardly adapted to Visual similarity: Users replace source characters > our task due to the different nature of the translit- with similar-looking symbols (e.g. q /tSj/ 4, ! eration choices. The goal of the standard translit- u /u/ y in Figure2). Visual similarity choices ! eration task is to communicate the pronunciation often involve numerals, especially when the cor- of a sequence in the source language (SL) to a responding source language phoneme has no En- speaker of the target language (TL) by render- glish equivalent (e.g. Arabic /Q/ 3). ! ing it appropriately in the TL alphabet; in con- Taking that consistency across languages into trast, informal romanization emerges in commu- account, we show that incorporating these style nication between SL speakers only, and TL is patterns into our model as priors on the emission not specified. If we picked any specific Latin- parameters—also constructed from naturally oc- script language to represent TL (e.g. English, curring resources—improves the decoding accu- which is often used to ground phonetic substi- racy on both languages. We compare the pro- tutions), many of the informally romanized se- posed unsupervised WFST model with a super- quences would still not conform to its pronuncia- vised WFST, an unsupervised neural architecture, tion rules: the transliteration process is character- and commercial systems for decoding romanized level rather than phoneme-level and does not take Russian (translit) and Arabic (Arabizi). Our un- possible TL digraphs into account (e.g. Russian supervised WFST outperforms the unsupervised sh /sx/ sh), and it often involves eclectic visual ! neural baseline on both languages. substitution choices such as numerals or punctua- 8309 tion (e.g. Arabic [tHt, ‘under’]3 ta7t, Rus- 3.1 Model ! sian dl [dlja, ‘for’] dl9| ). ! If we view the process of romanization as encod- Finally, another relevant task is translating be- ing a source sequence o into Latin characters, we tween closely related languages, possibly writ- can consider each observation l to have originated ten in different scripts. An approach similar to via o being generated from a distribution p(o) and ours is proposed by Pourdamghani and Knight then transformed to Latin script according to an- (2017). They also take an unsupervised decipher- other distribution p(l o). We can write the proba- j ment approach: the cipher model, parameterized bility of the observed Latin sequence as: as a WFST, is trained to encode the source lan- X guage character sequences into the target language p(l) = p(o; γ) p(l o; θ) p (θ; α) (1) · j · prior alphabet as part of a character-level