Phonetic and Visual Priors for Decipherment of Informal Romanization

Total Page:16

File Type:pdf, Size:1020Kb

Phonetic and Visual Priors for Decipherment of Informal Romanization Phonetic and Visual Priors for Decipherment of Informal Romanization Maria Ryskina1 Matthew R. Gormley2 Taylor Berg-Kirkpatrick3 1Language Technologies Institute, Carnegie Mellon University 2Machine Learning Department, Carnegie Mellon University 3Computer Science and Engineering, University of California, San Diego {mryskina,mgormley}@cs.cmu.edu [email protected] Abstract horosho [Phonetically romanized] Informal romanization is an idiosyncratic pro- хорошо [Underlying Cyrillic] cess used by humans in informal digital com- munication to encode non-Latin script lan- xopowo [Visually romanized] guages into Latin character sets found on common keyboards. Character substitution Figure 1: Example transliterations of a Russian choices differ between users but have been word horoxo [horošo, ‘good’] (middle) based on shown to be governed by the same main princi- ples observed across a variety of languages— phonetic (top) and visual (bottom) similarity, with namely, character pairs are often associated character alignments displayed. The phonetic- visual dichotomy gives rise to[Phonetically one-to-many romanized] map- through phonetic or visual similarity. We pro- [Phonetic] pose a noisy-channel WFST cascade model pings such as x /S/ sh / w. ! [Underlying Cyrillic] for deciphering the original non-Latin script [Cyrillic] from observed romanized text in an unsuper- [Visually romanized] vised fashion. We train our model directly on layout[Visual] incompatibility). An example of such a sen- romanized data from two languages: Egyp- tence can be found in Figure2. Unlike named en- tian Arabic and Russian. We demonstrate that tity transliteration where the change of script rep- adding inductive bias through phonetic and resents the change of language, here Latin charac- visual priors on character mappings substan- ters serve as an intermediate symbolic representa- tially improves the model’s performance on tion to be decoded by another speaker of the same both languages, yielding results much closer to the supervised skyline. Finally, we intro- source language, calling for a completely differ- duce a new dataset of romanized Russian, col- ent transliteration mechanism: instead of express- lected from a Russian social network website ing the pronunciation of the word according to and partially annotated for our experiments.1 the phonetic rules of another language, informal transliteration can be viewed as a substitution ci- 1 Introduction pher, where each source character is replaced with a similar Latin character. Written online communication poses a number of In this paper, we focus on decoding informally challenges for natural language processing sys- romanized texts back into their original scripts. tems, including the presence of neologisms, code- We view the task as a decipherment problem and switching, and the use of non-standard orthogra- propose an unsupervised approach, which allows phy. One notable example of orthographic varia- us to save annotation effort since parallel data tion in social media is informal romanization2— for informal transliteration does not occur natu- speakers of languages written in non-Latin alpha- rally. We propose a weighted finite-state trans- bets encoding their messages in Latin characters, ducer (WFST) cascade model that learns to de- for convenience or due to technical constraints code informal romanization without parallel text, (improper rendering of native script or keyboard relying only on transliterated data and a language 1The code and data are available at https://github. model over the original orthography. We test it com/ryskina/romanization-decipherment on two languages, Egyptian Arabic and Russian, 2Our focus on informal transliteration excludes formal settings such as pinyin for Mandarin where transliteration collecting our own dataset of romanized Russian conventions are well established. from a Russian social network website vk.com. 8308 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8308–8319 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 4to mowet bit’ ly4we? [Romanized] 2 Related work Qto moet byt~ luqxe? [Latent Cyrillic] Ctoˇ možet byt’ lucše?ˇ [Scientific] > /Sto "moZ1t b1tj "lutSS1/ [IPA] Prior work on informal transliteration uses su- What can be better? [Translated] pervised approaches with character substitution rules either manually defined or learned from au- Figure 2: Example of an informally romanized tomatically extracted character alignments (Dar- sentence from the dataset presented in this paper, wish, 2014; Chalamandaris et al., 2004). Typi- containing a many-to-one mapping / x w. cally, such approaches are pipelined: they produce ! Scientific transliteration, broad phonetic transcrip- candidate transliterations and rerank them using tion, and translation are not included in the dataset modules encoding knowledge of the source lan- and are presented for illustration only. guage, such as morphological analyzers or word- level language models (Al-Badrashiny et al., 2014; Eskander et al., 2014). Supervised finite-state ap- Since informal transliteration is not standard- proaches have also been explored (Wolf-Sonkin ized, converting romanized text back to its origi- et al., 2019; Hellsten et al., 2017); these WFST nal orthography requires reasoning about the spe- cascade models are similar to the one we propose, cific user’s transliteration preferences and han- but they encode a different set of assumptions dling many-to-one (Figure2) and one-to-many about the transliteration process due to being de- (Figure1) character mappings, which is beyond signed for abugida scripts (using consonant-vowel traditional rule-based converters. Although user syllables as units) rather than alphabets. To our behaviors vary, there are two dominant patterns knowledge, there is no prior unsupervised work on in informal romanization that have been observed this problem. independently across different languages, such as Named entity transliteration, a task closely re- Russian (Paulsen, 2014), dialectal Arabic (Dar- lated to ours, is better explored, but there is little wish, 2014) or Greek (Chalamandaris et al., 2006): unsupervised work on this task as well. In par- Phonetic similarity: Users represent source char- ticular, Ravi and Knight(2009) propose a fully acters with Latin characters or digraphs associated unsupervised version of the WFST approach in- with similar phonemes (e.g. m /m/ m, l /l/ l troduced by Knight and Graehl(1998), refram- ! ! in Figure2). This substitution method requires ing the task as a decipherment problem and learn- implicitly tying the Latin characters to a phonetic ing cross-lingual phoneme mappings from mono- system of an intermediate language (typically, En- lingual data. We take a similar path, although it glish). should be noted that named entity transliteration methods cannot be straightforwardly adapted to Visual similarity: Users replace source characters > our task due to the different nature of the translit- with similar-looking symbols (e.g. q /tSj/ 4, ! eration choices. The goal of the standard translit- u /u/ y in Figure2). Visual similarity choices ! eration task is to communicate the pronunciation often involve numerals, especially when the cor- of a sequence in the source language (SL) to a responding source language phoneme has no En- speaker of the target language (TL) by render- glish equivalent (e.g. Arabic /Q/ 3). ! ing it appropriately in the TL alphabet; in con- Taking that consistency across languages into trast, informal romanization emerges in commu- account, we show that incorporating these style nication between SL speakers only, and TL is patterns into our model as priors on the emission not specified. If we picked any specific Latin- parameters—also constructed from naturally oc- script language to represent TL (e.g. English, curring resources—improves the decoding accu- which is often used to ground phonetic substi- racy on both languages. We compare the pro- tutions), many of the informally romanized se- posed unsupervised WFST model with a super- quences would still not conform to its pronuncia- vised WFST, an unsupervised neural architecture, tion rules: the transliteration process is character- and commercial systems for decoding romanized level rather than phoneme-level and does not take Russian (translit) and Arabic (Arabizi). Our un- possible TL digraphs into account (e.g. Russian supervised WFST outperforms the unsupervised sh /sx/ sh), and it often involves eclectic visual ! neural baseline on both languages. substitution choices such as numerals or punctua- 8309 tion (e.g. Arabic [tHt, ‘under’]3 ta7t, Rus- 3.1 Model ! sian dl [dlja, ‘for’] dl9| ). ! If we view the process of romanization as encod- Finally, another relevant task is translating be- ing a source sequence o into Latin characters, we tween closely related languages, possibly writ- can consider each observation l to have originated ten in different scripts. An approach similar to via o being generated from a distribution p(o) and ours is proposed by Pourdamghani and Knight then transformed to Latin script according to an- (2017). They also take an unsupervised decipher- other distribution p(l o). We can write the proba- j ment approach: the cipher model, parameterized bility of the observed Latin sequence as: as a WFST, is trained to encode the source lan- X guage character sequences into the target language p(l) = p(o; γ) p(l o; θ) p (θ; α) (1) · j · prior alphabet as part of a character-level
Recommended publications
  • Names in Multi-Lingual, -Cultural and -Ethic Contact
    Oliviu Felecan, Romania 399 Romanian-Ukrainian Connections in the Anthroponymy of the Northwestern Part of Romania Oliviu Felecan Romania Abstract The first contacts between Romance speakers and the Slavic people took place between the 7th and the 11th centuries both to the North and to the South of the Danube. These contacts continued through the centuries till now. This paper approaches the Romanian – Ukrainian connection from the perspective of the contemporary names given in the Northwestern part of Romania. The linguistic contact is very significant in regions like Maramureş and Bukovina. We have chosen to study the Maramureş area, as its ethnic composition is a very appropriate starting point for our research. The unity or the coherence in the field of anthroponymy in any of the pilot localities may be the result of the multiculturalism that is typical for the Central European area, a phenomenon that is fairly reflected at the linguistic and onomastic level. Several languages are used simultaneously, and people sometimes mix words so that speakers of different ethnic origins can send a message and make themselves understood in a better way. At the same time, there are common first names (Adrian, Ana, Daniel, Florin, Gheorghe, Maria, Mihai, Ştefan) and others borrowed from English (Brian Ronald, Johny, Nicolas, Richard, Ray), Romance languages (Alessandro, Daniele, Anne, Marie, Carlos, Miguel, Joao), German (Adolf, Michaela), and other languages. *** The first contacts between the Romance natives and the Slavic people took place between the 7th and the 11th centuries both to the North and to the South of the Danube. As a result, some words from all the fields of onomasiology were borrowed, and the phonological system was changed, once the consonants h, j and z entered the language.
    [Show full text]
  • Sex, Lies, and Red Tape: Ideological and Political Barriers in Soviet Translation of Cold War American Satire, 1964-1988
    University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2015-07-10 Sex, Lies, and Red Tape: Ideological and Political Barriers in Soviet Translation of Cold War American Satire, 1964-1988 Khmelnitsky, Michael Khmelnitsky, M. (2015). Sex, Lies, and Red Tape: Ideological and Political Barriers in Soviet Translation of Cold War American Satire, 1964-1988 (Unpublished doctoral thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/27766 http://hdl.handle.net/11023/2348 doctoral thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca Allegorie der Übersetzung (2015) Michael G. Khmelnitsky acrylic on canvas (30.4 cm x 30.4 cm) The private collection of Dr. Hollie Adams. M. G. Khmelnitsky ALLEGORY OF TRANSLATION IB №281 A 00276 Sent to typesetting 17.II.15. Signed for printing 20.II.15. Format 12x12. Linen canvas. Order №14. Print run 1. Price 3,119 r. 3 k. Publishing House «Soiuzmedkot» Calgary UNIVERSITY OF CALGARY Sex, Lies, and Red Tape: Ideological and Political Barriers in Soviet Translation of Cold War American Satire, 1964-1988 by Michael Khmelnitsky A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY GRADUATE PROGRAM IN ENGLISH CALGARY, ALBERTA JULY, 2015 © Michael Khmelnitsky 2015 Abstract My thesis investigates the various ideological and political forces that placed pressures on cultural producers, specifically translators in the U.S.S.R., during the Era of Stagnation (1964- 1988).
    [Show full text]
  • United Nations Language and Communications Programme Javier
    United Nations Language and Communications Programme Javier Zanón is the head of the United Nations Language and Communications Programme (UNLCP). He came to UNHQ as Coordinator of the Spanish Language Programme in 2001. He earned an MA in Educational Psychology and a PhD in Psycholinguistics from the Universitat de Barcelona . Mr. Zanón has been both a lecturer and academic director of the Master of Arts programme in Teaching Spanish as a Foreign Language at Universitat de Barcelona , and was Academic Director of the Cervantes Institute in Chicago, USA. In addition, he has been teacher and teacher trainer in Spain, Morocco, Mexico, the United States, and Portugal. He has authored a number of Spanish language teaching books and materials and has served as Learning Manager at the UN Economic Commission for Latin America and the Caribbean. Mr. Zanón describes himself as a firm believer in cultural and linguistic diversity as a tool to fight discrimination and inequality. His favourite quote is: “The intellectual health of the planet is dependent on multilingualism.” (D. Crystal) Russian Language Programme Full-time Teacher Alla Padalka came to the Language Training Programme as a part- time teacher of Russian in 1989 and became Head Teacher of Russian in 1992. She holds an MA degree in Teaching Foreign Languages from the Belorussian State Linguistic University, where she taught English and Italian and provided linguistic support to a team of interpreters in the Italian language for the 1980 Moscow Olympics. Prior to joining the UN Russian Language Programme, Ms. Padalka worked in Yemen and Moscow and for the Russian Permanent Mission to the United Nations.
    [Show full text]
  • Nota Bene-- <C:\Nbwin\USERS\DEFAULT
    AATSEEL 2020 Presentation Abstracts FRIDAY, FEBRUARY 7, 2020 1-1 Stream 1A: Tolstoy as Reader (I): Tolstoy Reading Literature, Myth and Religion Brian Kim, University of Pennsylvania Recommending Reading: Great Books According to Tolstoy In 1890, in response to Sir John Lubbock’s recently published list of one hundred books deemed “best worth reading,” Leo Tolstoy was approached by the publisher M. M. Lederle, who was interested in printing Tolstoy’s own recommendations in this regard. Tolstoy’s sin- gle and abortive attempt at compiling such a list in response contained fewer than 50 titles, organized according to the period of one’s life when they ought to be read and the degree of impression each had made on him personally. Ranging from religious texts and classical epics to contemporary philosophy and Russian literature, Tolstoy’s unpublished list is unsurprisingly characterized by an extraordinary breadth and a focus on writings conducive to the development of moral and spiritual education that was his main preoccupation in the latter period of his life. Though it did not become part of his public recommendations for reading (as, e.g., the aphorisms he later gathered in Krug chteniia), Tolstoy’s list was reflec- tive of a contemporaneous response to the rapid growth of literacy in late nineteenth-century Russia that was concerned with directing the reading consumption of a newly literate public toward texts of greater value than the light fiction so commonly found among booksellers’ wares. This paper examines Tolstoy’s recommendations in light of his experiences as a reader, educator, and public figure, and places his list in dialogue with conversations about literacy education in Russia at the end of the nineteenth century.
    [Show full text]
  • Slavic Collection Descriptions
    Slavic Collection Descriptions AMHERST CENTER FOR RUSSIAN CULTURE Institution Name: Amherst Center for Russian Culture Institution Address: Box 2268, Amherst College, Amherst, MA 01002-5000 USA Phone: (413) 542-8453, (413) 542-2350 Fax: (413) 542-2798 E-mail: [email protected] Website: http://www.amherst.edu/~acrc/ Access Policy: All scholars planning to visit the Amherst Center for Russian Culture, or requesting access to books or manuscripts, should contact the director, Professor Stanley J. Rabinowitz. To help staff find the material desired, please use the full listing of collections (http:// www.amherst.edu/~acrc/collections.html) and the listing of cataloged collections (http://www.amherst.edu/~acrc/archives.html) online; then indicate on the registration form the collection, boxes, and folders needed; and send the registration form by mail, e-mail, or fax. The reg- istration form (http://www.amherst.edu/~acrc/forms/reg.pdf) must be accompanied by a cover letter to Professor Stanley J. Rabinowitz, the Center’s director. All collections must be used in the reading room of the Center. Additional information including maps to the Center and other campus resources can be found at http://www.amherst.edu/~ acrc/forms.html. Online Catalog: Books and periodicals belonging to the Center are cata- loged in the Four Colleges Catalog that can be accessed at http://fclibr. [Haworth co-indexing entry note]: “Slavic Collection Descriptions.” Urbanic, Allan, and Beth Feinberg. Co-published simultaneously in Slavic & East European Information Resources (The Haworth Information Press, an imprint of The Haworth Press, Inc.) Vol. 5, No. 3/4, 2004, pp.
    [Show full text]
  • University of Copenhagen
    A hundred years later streetcars are still rattling in Baltic cities Lundén, Thomas; Balogh, Peter; Börén, Thomas; Chekalina, Tatiana; Gentile, Michael; Kravchenko, Zhanna; Lindström, Jonas; Polanska, Dominika V.; Vaattovaara, Mari; Matthiessen, Christian Wichmann; Svensson, Ragni Published in: Baltic Worlds Publication date: 2012 Document version Publisher's PDF, also known as Version of record Citation for published version (APA): Lundén, T., Balogh, P., Börén, T., Chekalina, T., Gentile, M., Kravchenko, Z., ... Svensson, R. (2012). A hundred years later streetcars are still rattling in Baltic cities. Baltic Worlds, 3-4, 37-44. Download date: 08. apr.. 2020 BALTIC A quarterly scholarly journal and news magazine. December 2012. Vol. V:3–4. 1 From the Centre for Baltic and East European Studies (CBEES) Book review: WO Södertörn University, Stockholm Naimark’s “Genocide” R LDS December 2012. Vol. V:3–4. BALTIC WORLDSbalticworlds.com Gated communities in Poland Modernizing marginal Russia Report from Prussian Posen Wolves in myth and reality Cities in the Baltic also in this issue Illustration: KG Nilson RUSSIAN HUMAN RIGHTS FIGHTERS / DISSIDENCE IN VILNIUS / BERLIN FASHION / KRAKÓW STREET ART / ANDREI PLEşU & JÜRGEN KOCKA short takes Encounter between East and West Painting “RUSSIAN CULTURE IN philosophy of language COVER ARTIST KG Nilson is EXILE (1921–1953)” was the and artistic expressionism a renowned Swedish paint- theme of a two-day confer- in the young, exiled Roman er who for several years ence at the Courtauld Insti- Jakobson. was a professor at the tute in London (November Robert Chandler, poet Royal Swedish Academy 2–3, 2012). The conference and translator of the likes of of Fine Arts.
    [Show full text]
  • ANASTASIYA ASTAPOVA Negotiating Belarusianness: Political Folklore Betwixt and Between
    DISSERTATIONES ANASTASIYA ASTAPOVA FOLKLORISTICAE UNIVERSITATIS TARTUENSIS 22 Negotiating Belarusianness: Political folklore betwixt and between ANASTASIYA ASTAPOVA Negotiating Belarusianness: Political folklore betwixt and between Tartu 2015 ISSN 1406–7366 ISBN 978-9949-32-994-6 DISSERTATIONES FOLKLORISTICAE UNIVERSITATIS TARTUENSIS 22 DISSERTATIONES FOLKLORISTICAE UNIVERSITATIS TARTUENSIS 22 ANASTASIYA ASTAPOVA Negotiating Belarusianness: Political folklore betwixt and between Department of Estonian and Comparative Folklore, Faculty of Philosophy This dissertation is accepted for the commencement of the degree of Doctor of Philosophy (Estonian and Comparative Folklore) on 11.11.2015 by the Institute of Cultural Research and Fine Arts, University of Tartu. Supervisors: Professor Ülo Valk, Dr Elo-Hanna Seljamaa Opponents: Dr Liisi Laineste (Estonian Literature Museum) Dr William Westerman (New Jersey City University) Commencement: 16.12.2015 at 14.15 at Ülikooli 18-140 This research was supported by the European Social Fund’s Doctoral Studies and Internationalisation Programme DoRa; the European Union through the European Regional Development Fund (Centre of Excellence, CECT); Estonian Research Council (Institutional Research Project ‘Tradition, Creativity, and society: minorities and alternative discourses’ (IUT2–43)); Estonian Science foundation (grant no. 9190); ETF grant 8149 ‘Cultural processes in a changing society: Tradition and creativity in post-socialist humour’. ISSN 1406-7366 ISBN 978-9949-32-994-6 (print) ISBN 978-9949-32-995-3 (pdf) Copyright: Anastasiya Astapova, 2015 University of Tartu Press www.tyk.ee ACKNOWLEDGEMENTS This dissertation would not have been possible without the people I was sur- rounded with. First of all, there were my informants who spent their time and sometimes risked their well-being giving the interviews.
    [Show full text]
  • And Others the Less Widely Taught Languages of Europe. Proceedings Of
    DOCUMENT RESUME ED 344 420 FL 019 667 AUTHOR Mathuna, Liam Mac, Ed.; And Others TITLE The Less Widely Taught Languages of Europe. Proceedings of the Joint United Nations Educational, Scientific, and Cultural OrganiZatiOn, International Association of Applied Linguistics, and Irish Association of Applied Linguistics Symposium (St. Patrick's College, Dublin, Ireland, April 23-25, 1987). INSTITUTION Irish Association for Applied Linguistics, Dublin. REPORT NO ISBN-0-9509132-3-5 PUB DATE 88 NOTE 211p. PUB TYPE Collected Works - Conference Proceedings (021) EDRS PRICE MF01/PCOP Plus Postage. DESCRIPTORS *Applied Linguistics; Cultural Traits; Diachronic Linguistics; Dutch; Foreign Countries; Government Role; Interference (Language); International Organizations; Irish; *Language Maintenance; *Language Planning; *Language Role; *Languages, Official Languages; Psycholinguistics; Public Policy; Rumanian; Second LangUage Instruction; Sociolinguistics; Television; *Uncommonly Taught Languages; Visual Aids IDENTIFIERS Catalan; *Europe; Gypsies; Hungary; Macedonian; Netherlands; UNESCO; USSR (Ukraine) ABSTRACT Papers presented at a symposium on Europe's less commonly tanght languages include the following: "The Necessity of Dialogue" (Marcel de Greve); "Socio- and Psycholinguistic Interference in Teaching Fo:eign Languages" (Penka Ilieva-Balc.ova); "Satellite Television, National Television, and Video in Teaching/Learning Less Widely Taught Languages" (Zofia Jancewicz); "Can the Gap Between 'Lesser Used' and 'Less Widely Taught' Languages Be Bridged? A Status Challenge for Irish" (Liam Mac Mathuna); "Dutch, the Language of 20,000,000" (Jos Nivette); "Historical Overview of the Position of Irish" (Mairtin 0 Murchu); "Lesser Used Languages of the European Communities--Developments in the Recent Past and New Hopes for the Future" (Donall 0 Riagain); State and Non-State Supported Less Widely Taught Languages: Statutes Beat Numbers" (Yvo J.
    [Show full text]