<<

Out-of-the-box Universal Tool uroman

Ulf Hermjakob, Jonathan May, Kevin Knight Information Sciences Institute, University of Southern California {ulf,jonmay,knight}@isi.edu

Abstract distance-based metrics (Levenshtein, 1966; Win- kler, 1990) and phonetic-based metrics such as We present uroman, a tool for converting Metaphone (Philips, 2000). text in myriads of and scripts Hindi, for example, is written in the Devanagari such as Chinese, Arabic and Cyrillic into a script and Urdu in the Arabic script, so any words common Latin-script representation. The between those two languages will superficially tool relies on data and other ta- appear to be very different, even though the two bles, and handles nearly all character sets, languages are closely related. After romanization, including some that are quite obscure such however, the similarities become apparent, as can as Tibetan and Tifinagh. uroman converts be seen in Table1: digital numbers in various scripts to West- ern Arabic numerals. Romanization en- ables the application of string-similarity metrics to texts from different scripts with- out the need and complexity of an in- termediate phonetic representation. The tool is freely and publicly available as a Table 1: Example of Hindi and Urdu romanization Perl script suitable for inclusion in data processing pipelines and as an interactive Foreign scripts also present a massive cognitive demo web page. barrier to humans who are not familiar with them. We devised a utility that allows people to trans- 1 Introduction late text from languages they don’ know, using String similarity is a useful feature in many natural the same information available to a machine trans- processing tasks. In machine translation, lation system (Hermjakob et al., 2018). We found it can be used to improve the alignment of bitexts, that when we asked native English speakers to use and for low-resource languages with a related lan- this utility to translate text from languages such as guage of larger resources, it can help to decode Uyghur or Bengali to English, they strongly pre- out-of-vocabulary words. For example, suppose ferred working on the romanized version of the we have to translate degustazione del vino with- source language compared to its original form and out any occurrence of degustazione in any train- indeed found using the native, unfamiliar script to ing corpora, but we do know that in a related lan- be a nearly impossible task. guage degustation´ de vin means wine tasting, we can use string similarity to infer the meaning of degustazione. 1.1 Scope of Romanization String similarity metrics typically assume that Romanization maps characters or groups of char- the strings are in the same script, but many cross- acters in one script to a character or group of char- lingual tasks such as machine translation often in- acters in the (ASCII) with the goal to volve multiple scripts. If we can romanize text approximate the pronunciation of the original text from a non-Latin script to Latin, standard string and to map cognates in various languages to simi- similarity metrics can be applied, including edit lar words in the Latin script, typically without the

13 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 13–18 Melbourne, Australia, July 15 - 20, 2018. 2018 Association for Computational Linguistics Table 2: Romanization examples for 10 scripts use of any large-scale lexical resources. As a sec- Κρήτη γεωλογία μπανάνα ondary goal, romanization standards tend to pre- Pronunciation Kriti yeoloyia banana fer reversible mappings. For example, as stand- Romanization Krete geologia banana alone , the Greek letters ι () and υ (up- English Crete geology banana silon) are romanized to and respectively, even German Kreta Geologie Banane though they have the same pronunciation in Mod- ern Greek. Table 3: Examples of Greek romanization uroman generally follows such preference, but uroman is not always fully reversible. For exam- occurring text in Arabic and Hebrew (i.., without ple, since uroman maps letters to ASCII charac- /points); see Table4. ters, the romanized text does not contain any dia- critics, so the French word ou (“or”) and its homo- phone ou` (“where”) both map to romanized ou. uroman provides the option to map to a plain string or to a lattice of romanized text, which al- lows the system to output alternative romaniza- Table 4: Romanization with and without diacritics tions. This is particularly useful for source lan- guages that use the same character for significantly different sounds. The Hebrew Pe for exam- 1.2 Features ple can stand for both p and . Lattices are output uroman has the following features: in JSON format. 1. Input: UTF8-encoded text and an optional Note that romanization does not necessarily ISO-639-3 language code capture the exact pronunciation, which varies 2. Output: Romanized text (default) or lattice of across time and space (due to language change romanization alternatives in JSON format and dialects) and can be subject to a number of 1 processes of phonetic assimilation. It also is not 3. Nearly universal romanization a translation of names and cognates to English 4. -to- mapping for groups of characters that (or any other target language). See Table3 for are non-decomposable with respect to roman- examples for Greek. ization

A romanizer is not a full transliterator. For 5. Context-sensitive and source language- example, this tool does not vowelize text that specific romanization rules lacks explicit vowelization such as normally 1See Section4 for a few limitations.

14 6. Romanization includes (digital) numbers currently 1,088 entries. The entries in these tables 7. Romanization includes punctuation can be restricted by conditions, for example to spe- cific languages or to the beginning of a word, and 8. Preserves capitalization can express alternative . This data 9. Freely and publicly available table is a core contribution of the tool. uroman additionally includes a few special Romanization tools have long existed for spe- heuristics cast in code, such as for the voweliza- 2 cific individual languages such as the Kakasi tions of a number of Indian languages and Ti- kanji-to-kana/romaji converter for Japanese, but to betan, dealing with diacritics, and a few language- the best of our knowledge, we present the first pub- specific idiosyncrasies such as the Japanese licly available (near) universal romanizer that han- sokuon and Thai - swaps. dles n-to-m character mappings. Many romaniza- Building these uroman resources has been tion examples are shown in Table2 and examples greatly facilitated by information drawn from of n-to-m character mapping rules are shown in Wikipedia,4 Richard Ishida’ script notes,5 and Table7. ALA-LC Romanization Tables.6 2 System Description 2.3 Characters without Unicode Description 2.1 Unicode Data The Unicode table does not include character de- As its basis, uroman uses the character descrip- scriptions for all scripts. tions of the Unicode table.3 For the characters of For Chinese characters, we use a Mandarin most scripts, the Unicode table contains descrip- pinyin table for romanization. tions such as CYRILLIC SMALL LETTER SHORT For Korean, we use a short standard Hangul ro- U or CYRILLIC CAPITAL LETTER TE WITH 7 MIDDLE HOOK. Using a few heuristics, uroman manization algorithm. identifies the phonetic token in that description, For Egyptian hieroglyphs, we added single- i.e. U and TE for the examples above. The heuris- sound phonetic characters and numbers to uro- tics use a list of anchor keywords such as letter man’s additional tables. and as well as a number of modifier pat- terns that can be discarded. Given the phonetic 2.4 Numbers token of the Unicode description, uroman then uroman also romanizes numbers in digital form. uses a second set of heuristics to predict the ro- For some scripts, number characters map one- manization for these phonetic tokens, i.e. u and t. to-one to Western Arabic numerals 0-9, e.. for For example, if the phonetic token is one of more Bengali, Eastern Arabic and Hindi. followed by one or more vowels, the predicted romanization is the leading sequence of For other scripts, such as Amharic, Chinese, consonants, e.g. SHA → . and Egyptian hieroglyphics, written numbers are structurally different, e.g. the Amharic num- 2.2 Additional Tables ber character sequence 10·9·100·90·8 = 1998 However, these heuristics often fail. An exam- and the Chinese number character sequence ple of a particularly spectacular failure is SCHWA 2·10·5·10000·6·1000 = 256000. uroman includes → schw instead of the desired e. Addition- a special number module to accomplish this latter ally, there are sequences of characters with non- type of mapping. Examples are shown in Table5. compositional romanization. For example, the Note that for phonetically spelled-out numbers standard romanization for the Greek sequence such as Greek οκτώ, uroman romanizes to the omikron+, (ου) is the Latin ou rather than spelled-out Latin okto rather than the digital 8. the character-by-character romanization oy. As a remedy, we manually created additional correction tables that map sequences of one or 4https://en.wikipedia.org more characters to the desired romanization, with 5https://r12a.github.io/scripts/featurelist 6https://www.loc.gov/catdir/cpso/roman.html 2http://kakasi.namazu.org 7http://gernot-katzers-spice-pages.com/var/ 3ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt korean hangul unicode.html

15 Figure 1: Screenshot of Uyghur romanization on demo site at bit.ly/uroman

Table 5: Romanization/Arabization of numbers

2.5 Romanization of Latin Text Table 7: Romanization rules with two examples Some Latin script-based languages have words for each for Greek, Uyghur, Japanese, and English, which spelling and pronunciation differ substan- with a variety of n-to-m mappings. tially, e.g. the English name Knight (IPA: /naIt/) (::s = source; ::t = target; ::lcode = language code) and French Bordeaux (/bOK.do/), which compli- cates string similarity matching if the correspond- 3 Download and Demo Sites ing spelling of the word in the non-Latin script is based on pronunciation. uroman v1.2 is publicly available for download uroman therefore offers alternative romaniza- at bit.ly/isi-nlp-software. The fully self-sufficient tions for words such as Knight and Bordeaux software package includes the implementation of (see Table6 for an example of the former), but, uroman in Perl with all necessary data tables. The as a policy uroman always preserves the original software is easy to install (gunzip and tar), with- Latin spelling, minus any diacritics, as the top out any need for compilation or any other software romanization alternative. (other than Perl). Typical call (for plain text output): uroman.pl --lc uig < STDIN > STDOUT where –lc uig specifies the (optional) language code (e.g. Uyghur). There is also an interactive demo site at bit.ly/uroman. Users may enter text in the lan- Table 6: Romanization with alternatives guage and script of their choice, optionally specify a language code, and then have uroman romanize Table7 includes examples of the Romanization the text. rules in uroman, including n-to-m mappings. Additionally, the demo page includes sample texts in 290 languages in a wide variety of scripts. 2.6 Caching Texts in 21 sample languages are available on the uroman caches token romanizations for speed. demo start page and more are accessible as ran-

16 dom texts. After picking the first random text, useful in cross-lingual information retrieval. additional random texts will be available from There is a body of work mapping text to pho- three corpora to choose from (small, large, and netic representations. Deri and Knight(2016) Wikipedia articles about the US). Users can then use Wiktionary and Wikipedia resources to learn restrict the randomness in a special option field. text-to- mappings. Phonetic representa- For example, -- will exclude texts in Latin+ tions are used in a number of end-to-end translit- scripts. For further information about possible re- eration systems (Knight and Graehl, 1998; Yoon strictions, hover over the word restriction (a dot- et al., 2007). Qian et al.(2010) describe the ted underline indicates that additional info will be toolkit ScriptTranscriber, which extracts cross- shown when a user hovers over it). lingual transliteration pairs from comparable cor- The romanization of the output at the demo site pora. A core component of ScriptTranscriber is mouse sensitive. Hovering over characters of maps text to an ASCII variant of the International either the original or romanized text, the page will Phonetic (IPA). highlight corresponding characters. See Figure1 Andy Hu’s transliterator8 is a fairly universal for an example. Hovering over the original text romanizer in JavaScript, limited to romanizing one will also display additional information such as Unicode character at a time, without context. the Unicode name and any numeric value. To sup- port this interactive demo site, the uroman package 5.2 Applications Using uroman also includes fonts for Burmese, Tifinagh, Klin- gon, and Egyptian hieroglyphs, as they are some- Ji et al.(2017) and Mayfield et al.(2017) use uroman times missing from standard browser font pack- for named entity recognition. Mayhew uroman ages. et al.(2016) use for (end-to-end) translit- eration. Cheung et al.(2017) use uroman for ma- 4 Limitations and Future Work chine translation of low-resource languages. uroman has also been used in our aforemen- uroman The current version of has a few limita- tioned translation utility (Hermjakob et al., 2018), tions, some of which we plan to address in fu- where humans translate text to another language, uroman ture versions. For Japanese, currently ro- with computer support, with high fluency in the manizes hiragana and katakana as expected, but target language (English), but no prior knowledge kanji are interpreted as Chinese characters and ro- of the source language. manized as such. For Egyptian hieroglyphs, only uroman has been partially ported by third par- single-sound phonetic characters and numbers are ties to Python and Java.9 currently romanized. For Linear , only phonetic syllabic characters are romanized. For some other 6 Conclusion extinct scripts such as cuneiform, no romanization is provided. Romanization tools have long existed for specific uroman allows the user to specify an ISO-639-3 individual languages, but to the best of our knowl- source language code, e.g. uig for Uyghur. This edge, we present the first publicly available (near) invokes any language-specific romanization rules universal romanizer that handles n-to-m character for languages that share a script with other lan- mappings. The tool offers both simple plain text guages. Without source language code specifi- as well as lattice output with alternatives, and in- cation, uroman assumes a default language, e.g. cludes romanization of numbers in digital form. Arabic for text in Arabic script. We are consid- It has been successfully deployed in a number of ering adding a source language detection com- multi-lingual natural language systems. ponent that will automatically determine whether an Arabic-script source text is Arabic, Farsi, or Uyghur etc. Acknowledgment This work is supported by DARPA (HR0011-15- 5 Romanization Applications C-0115). 5.1 Related Work

Gey(2009) reports that romanization based on 8https://github.com/andyhu/transliteration ALA-LC romanization tables (see Section 2.2) is 9https://github.com/BBN-E/bbn-transliterator

17 References Su-Youn Yoon, Kyoung-Young Kim, and Richard Sproat. 2007. Multilingual transliteration using fea- Leon Cheung, Thamme Gowda, Ulf Hermjakob, Nel- ture based phonetic method. In Proceedings of the son Liu, Jonathan May, Alexandra Mayn, Nima 45th annual meeting of the Association of Computa- Pourdamghani, Michael Pust, Kevin Knight, Niko- tional Linguistics, pages 112–119. laos Malandrakis, et al. 2017. Elisa system descrip- tion for LoReHLT 2017.

Aliya Deri and Kevin Knight. 2016. -to- phoneme models for (almost) any language. In Pro- ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 399–408.

Fredric Gey. 2009. Romanization–an untapped re- source for out-of-vocabulary machine translation for CLIR. In SIGIR Workshop on Information Access in a Multilingual World, pages 49–51.

Ulf Hermjakob, Jonathan May, Michael Pust, and Kevin Knight. 2018. Translating a language you don’t know in the Chinese Room. In Proceedings of the 56th Annual Meeting of Association for Com- putational Linguistics, Demo Track.

Heng Ji, Xiaoman Pan, Boliang Zhang, Joel Nothman, James Mayfield, Paul McNamee, and Cash Costello. 2017. Overview of TAC-KBP2017 13 languages en- tity discovery and linking. In Text Analysis Confer- ence – Knowledge Base Population track.

Kevin Knight and Jonathan Graehl. 1998. Ma- chine transliteration. Computational Linguistics, 24(4):599–612.

Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.

James Mayfield, Paul McNamee, and Cash Costello. 2017. Language-independent named entity analysis using parallel projection and rule-based disambigua- tion. In Proceedings of the 6th Workshop on Balto- Slavic Natural Language Processing, pages 92–96.

Stephen Mayhew, Christos Christodoulopoulos, and Dan Roth. 2016. Transliteration in any lan- guage with surrogate languages. arXiv preprint arXiv:1609.04325.

Lawrence Philips. 2000. The double Metaphone search algorithm. C/C++ Users ., 18(6):38–43.

Ting Qian, Kristy Hollingshead, Su-youn Yoon, Kyoung-young Kim, and Richard Sproat. 2010. A Python toolkit for universal transliteration. In LREC.

William E. Winkler. 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Sec- tion on Survey Research Methods, pages 354–359. ERIC.

18